Paper status: completed

Integrating large language models with explainable fuzzy inference systems for trusty steel defect detection

Published:03/20/2025
Original Link
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

LE-FIS integrates LLMs with explainable fuzzy inference, using locally trained globally predicted detection and genetic optimization, providing efficient, transparent, and trustworthy steel defect detection.

Abstract

Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec Integrating large language models with explainable fuzzy inference systems for trusty steel defect detection Kening Zhang a , Yung Po Tsang a , Carman K.M. Lee a , C.H. Wu b , ∗ a Department of Industrial and Systems Engineering, Research Institute for Advanced Manufacturing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong b Department of Supply Chain and Information Management, The Hang Seng University of Hong Kong, Siu Lek Yuen, N.T, Hong Kong A R T I C L E I N F O Keywords: Fuzzy interfence system (FIS) Steel defect detection Explainable artificial intelligence (XAI) Black-box model A B S T R A C T In industrial applications, the complexity of machine learning models often makes their decision-making processes difficult to interpret and lack transparency, particularly in the steel manufact

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Integrating large language models with explainable fuzzy inference systems for trusty steel defect detection.

1.2. Authors

The authors of this paper are Kening Zhang, Yung Po Tsang, Carman K.M. Lee, and C.H. Wu. They are affiliated with the Hong Kong Polytechnic University in Hong Kong. Their research backgrounds appear to be in machine learning, industrial applications, and explainable AI.

1.3. Journal/Conference

The provided text does not specify the journal or conference where this paper was published. The file path and lack of a formal publication header suggest it may be a preprint or a manuscript under review.

1.4. Publication Year

The paper references works published as recently as 2024. Therefore, the publication year of this paper is likely 2024 or later.

1.5. Abstract

The paper introduces a novel framework named LE-FIS (Large language models-based Explainable Fuzzy Inference System) designed to interpret "black-box" machine learning models used for detecting defects in steel. The methodology involves a two-stage process. First, a deep learning model is trained using a LTGP (locally trained, globally predicted) approach, where images are broken into smaller segments for training, and the model is then used to detect defects on entire images. Second, the LE-FIS framework is used to explain the decisions of this deep learning model by automatically generating fuzzy rules and membership functions. These components are then optimized using a genetic algorithm (GA). Finally, state-of-the-art large language models (LLMs) are employed to translate the technical outputs of LE-FIS into human-understandable explanations. The experimental results indicate that the LTGP model performs effectively in defect detection, and the LE-FIS framework, supported by LLMs, successfully provides a trustworthy and interpretable solution, enhancing transparency and reliability for industrial applications.

The original source is a PDF file located at /files/papers/690485299f2f7e6b6c47c52d/paper.pdf. Its publication status is likely a preprint or submitted manuscript, as it is not linked to an official publisher's website.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper addresses is the "black-box" nature of advanced machine learning models, particularly deep learning models like Convolutional Neural Networks (CNNs), when applied to critical industrial tasks such as steel defect detection. While these models achieve high accuracy, their internal decision-making processes are opaque and difficult for humans to understand.

This lack of transparency is a major barrier to adoption in high-stakes environments for several reasons:

  • Quality Control: Engineers need to understand why a defect was flagged to take appropriate corrective action.

  • Regulatory Compliance: Many industries require transparent and justifiable processes for quality assurance.

  • Stakeholder Trust: Building trust with engineers, managers, and customers is difficult if the technology making crucial decisions cannot be explained.

    Prior research has either focused on high-accuracy black-box models or simpler, more interpretable models that lack accuracy. Existing Explainable AI (XAI) methods like LIME or SHAP provide some insight but do not generate inherent, rule-based explanations that are easy to follow. The paper identifies a gap: the need for a system that is not only accurate but can also logically explain its reasoning and provide actionable guidance. This paper's innovative idea is to bridge this gap by building an explainable surrogate model (LE-FIS) that mimics the black-box detector and then uses the power of Large Language Models (LLMs) to translate its logic into natural language.

The overall workflow is illustrated in Figure 1 from the paper, which contrasts machine learning and non-machine learning methods in terms of their accuracy and explainability trade-off.

Fig. 1. Differences in explainability and accuracy of machine learning methods and non-machine learning methods to steel defect detection. For machine learning methods, they focus on achieving high a… 该图像是论文中的示意图,展示了钢铁缺陷检测过程的整体流程,包括图像采集、原始图像输入、基于机器学习和非机器学习方法的模糊推理系统的检测方法,以及缺陷分类的最终输出,突出了准确性与可解释性的权衡。

2.2. Main Contributions / Findings

The paper outlines four primary contributions:

  1. A Novel Detection Method (LTGP): The paper proposes a locally trained, globally predicted deep detection method. This approach involves segmenting large steel surface images into smaller patches for training a deep learning model, which is then used to perform detection on the complete, unsegmented images. This is designed to handle large-scale industrial imagery efficiently.
  2. An Explainable Surrogate Model (LE-FIS): The paper introduces LE-FIS, an explainable fuzzy inference system designed to interpret the LTGP black-box model. LE-FIS automatically generates fuzzy rules and membership functions to approximate the behavior of the deep learning model. A genetic algorithm (GA) is used to optimize its parameters for a better fit.
  3. LLM-Powered Interpretation: The authors utilize state-of-the-art LLMs to interpret the fuzzy rules and results from LE-FIS. This step translates the technical, mathematical output of the fuzzy system into clear, professional, and accessible natural language explanations. The paper also establishes evaluation metrics to compare the explanatory performance of different LLMs.
  4. Experimental Validation: The paper provides experimental results demonstrating that the LTGP model achieves strong performance in defect detection, and the complete LE-FIS framework successfully provides a trustworthy and interpretable model. This enhances transparency for industrial steel defect detection, balancing accuracy with explainability.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand this paper, one must be familiar with the following concepts:

  • Black-Box vs. White-Box Models:

    • Black-Box Model: An artificial intelligence system whose internal workings are hidden or too complex for humans to understand. Users can see the inputs and outputs but not the decision-making process. Deep neural networks are a prime example.
    • White-Box Model: A model whose internal logic is transparent and easily interpretable. Examples include decision trees or simple linear regression, where the rules and weights are clear.
  • Explainable AI (XAI): A field of artificial intelligence focused on developing methods and techniques that make the results and decisions of AI systems understandable to humans. The goal of XAI is to open up the "black box" and provide insights into how a model arrives at a specific conclusion.

  • Fuzzy Inference System (FIS): A system based on fuzzy logic, which is a form of logic that deals with reasoning that is approximate rather than fixed and exact. An FIS consists of three main stages:

    1. Fuzzification: Crisp (numerical) input values (e.g., pixel intensity, texture score) are converted into fuzzy values by determining their degree of membership in predefined fuzzy sets (e.g., "low brightness," "medium texture," "high contrast"). This is done using membership functions.
    2. Inference Engine (Rule Evaluation): A set of human-like IF-THEN rules are applied to the fuzzy values to produce a fuzzy output. For example: IF texture is 'rough' AND brightness is 'low', THEN defect is 'likely'.
    3. Defuzzification: The fuzzy output from the rule evaluation is converted back into a single, crisp numerical value (e.g., a defect probability score).
  • Genetic Algorithm (GA): A search and optimization technique inspired by the process of natural selection. A GA works by evolving a population of candidate solutions to a problem over generations. The basic steps are:

    1. Initialization: Create an initial population of random solutions (chromosomes).
    2. Fitness Evaluation: Evaluate how good each solution is using a "fitness function." In this paper, the fitness is how well the FIS output matches the black-box model's output.
    3. Selection: Select the fittest individuals to be "parents" for the next generation.
    4. Crossover: Combine the genetic material (parameters) of pairs of parents to create new offspring.
    5. Mutation: Introduce small, random changes into the offspring's genes to maintain diversity.
    6. Repeat: Repeat the process of evaluation, selection, crossover, and mutation for many generations until the solution converges.
  • Convolutional Neural Network (CNN) and ResNet:

    • CNN: A type of deep learning model especially effective for image analysis. It uses layers of convolutions (filters) to automatically learn hierarchical features from images, such as edges, textures, and shapes.
    • ResNet (Residual Network): A specific architecture of CNN that introduces "shortcut connections" or "residual blocks." These connections allow the model to skip one or more layers, which helps to mitigate the "vanishing gradient" problem and enables the training of much deeper and more powerful networks. The paper uses ResNet-18.
  • Large Language Models (LLMs): Massive deep learning models trained on vast amounts of text data (e.g., GPT-4, Llama). They excel at understanding, generating, and translating human language. In this paper, they are used as an "explainer" to convert the technical rules of the FIS into fluent, understandable prose.

3.2. Previous Works

The authors position their work against existing methods for steel defect detection and explainability.

  • Machine Learning for Defect Detection: The paper acknowledges a rich history of using machine learning for this task.

    • Early methods used neural networks like two-layer feedforward networks [13].
    • More recent and successful methods leverage deep learning, especially CNNs [14, 15], which excel at image-based feature extraction.
    • Unsupervised methods, such as Generative Adversarial Networks (GANs) and autoencoders [11, 16], have been used to detect defects without needing labeled data.
    • Reinforcement learning has also been applied to enhance detection [17]. The common thread is that while these models can be highly accurate, they are typically black boxes.
  • Explainable AI (XAI) in Defect Detection: The paper critiques previous XAI approaches in this domain.

    • Some methods claim explainability but do not fundamentally address the black-box problem [18].
    • Other works [19-21] apply general-purpose XAI tools like LIME (Local Interpretable Model-agnostic Explanations) or SHAP (SHapley Additive exPlanations). These tools are post-hoc, meaning they explain a model's prediction by analyzing how inputs affect the output, but they do not reveal the model's internal logic. The authors argue these methods are not "inherently explainable."
    • Other approaches are limited in scope, for instance, by only addressing classification and not segmentation [22], or by providing interpretations that are still unclear about the exact logic used [23].

3.3. Technological Evolution

The field of automated defect detection has evolved significantly:

  1. Traditional Image Processing: Early methods relied on manually crafted filters and algorithms to detect features like edges, blobs, and textures. These were highly interpretable but brittle and lacked accuracy on complex defects.

  2. Classical Machine Learning: Models like Support Vector Machines (SVMs) and Random Forests were used with hand-engineered features, improving accuracy but still requiring significant domain expertise for feature design.

  3. Deep Learning Revolution: CNNs and other deep models automated feature extraction, leading to a massive leap in accuracy. However, this came at the cost of interpretability, creating the "black-box" problem.

  4. The Rise of XAI: In response, the field of XAI emerged, with post-hoc methods like LIME and SHAP becoming popular. These methods provide valuable insights but are often approximations and do not create a fully transparent model.

    This paper represents a step forward in the XAI timeline by attempting to create an inherently interpretable surrogate model (FIS) and then further enhancing its accessibility through natural language generation (LLM).

3.4. Differentiation Analysis

The core innovation of this paper compared to previous work is its hybrid, multi-stage approach to explainability:

  • Beyond Post-Hoc Explanations: Instead of just applying LIME or SHAP to a black-box model, this paper constructs a completely new, interpretable "white-box" model (the FIS) that is trained to mimic the black-box model. The explanation is then derived from this interpretable surrogate.

  • Generation of Logical Rules: The FIS generates explicit IF-THEN rules, which are more intuitive to a human operator than feature importance scores or heatmaps. This provides a causal, logical explanation.

  • Leveraging LLMs for the "Last Mile" of Explanation: The paper uniquely recognizes that even IF-THEN rules can be technical. It uses LLMs to bridge the final gap, translating these rules into professional, easy-to-read reports with recommendations. This is a novel application of LLMs in the industrial XAI pipeline.

  • Integrated Framework: The LE-FIS is not just a single component but a full framework, from the LTGP training strategy to the GA-based optimization and the final LLM interpretation layer, as shown in the paper's overall architecture diagram (Figure 2).

    该图像是一个用于钢铁缺陷检测的整体方法示意图,展示了数据预处理、FLEX模块(含模糊推理与遗传算法优化)及输出结果解释的流程,体现了LE-FIS模型结构与LLMs解释过程。 该图像是一个用于钢铁缺陷检测的整体方法示意图,展示了数据预处理、FLEX模块(含模糊推理与遗传算法优化)及输出结果解释的流程,体现了LE-FIS模型结构与LLMs解释过程。

4. Methodology

4.1. Principles

The core idea of the methodology is to create a trustworthy and explainable system for steel defect detection by combining the predictive power of a deep learning model with the interpretability of a fuzzy inference system, and then making that interpretation accessible through a large language model. The approach can be broken down into four main stages:

  1. Image Preprocessing and Segmentation: Prepare the data by segmenting large images into smaller, manageable patches for efficient training.
  2. Deep Learning for Detection: Train a high-performance "black-box" model (ResNet-18) on these patches to serve as the primary defect detector. This is the LTGP method.
  3. Fuzzy System Fitting: Train an FIS to mimic the predictions of the ResNet-18 model. This FIS serves as an interpretable "surrogate" model. Its parameters are optimized using a Genetic Algorithm (GA).
  4. LLM-based Explanation: Use a prompted LLM to translate the rules and outputs of the FIS into natural language.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Step 1: Preprocessing and Segmentation

To handle large industrial images (e.g., 1600×2561600 \times 256 pixels), the authors employ a locally trained, globally predicted (LTGP) strategy. The training phase starts with preprocessing and segmentation.

  • Patch Extraction: The original high-resolution image, denoted as IrawI_{raw} with dimensions H×WH \times W, is cropped into smaller patches of size hc×wch_c \times w_c. This is done using a sliding window with vertical stride shs_h and horizontal stride sws_w. A specific patch Icropi,jI_{crop}^{i,j} is extracted using the formula: $ I _ { c r o p } ^ { i , j } = I _ { r a w } ( i \cdot s _ { h } : i \cdot s _ { h } + h _ { c } , j \cdot s _ { w } : j \cdot s _ { w } + w _ { c } ) $

    • Symbol Explanation:
      • Icropi,jI_{crop}^{i,j}: The cropped patch at the (i, j)-th window position.
      • IrawI_{raw}: The original input image.
      • i, j: The row and column indices of the sliding window.
      • sh,sws_h, s_w: The vertical and horizontal strides, determining how far the window moves in each step.
      • hc,wch_c, w_c: The height and width of the cropped patch. The paper mentions that zero-padding is applied if the image dimensions are not perfectly divisible by the window size and stride, ensuring the entire image is covered.
  • Normalization: After cropping, each patch is normalized to scale pixel values to a standard range (e.g., [0, 1]), which helps stabilize the training process. The formula for normalization is given as: $ In(l) = I_r(l) $ The text seems to have a typo here, as the formula is incomplete. A standard normalization formula would be Inorm=(Irawμ)/σI_{norm} = (I_{raw} - \mu) / \sigma. The paper's text mentions In(l)=I,(,whereμandσarethemeanandstandarddeviationofthepixelvaluesinthedatasetIn(l) = I,(, where μ and σ are the mean and standard deviation of the pixel values in the dataset, which confirms this intent.

    • Symbol Explanation:
      • In(l): The ll-th normalized image patch.
      • Ir(l)I_r(l): The ll-th raw image patch.
      • μ\mu: The mean pixel value across the dataset.
      • σ\sigma: The standard deviation of pixel values across the dataset.

4.2.2. Step 2: Defect Detection using ResNet-18 (The Black-Box Model)

The cropped and normalized patches are used to train a ResNet-18 model for defect classification. ResNet-18 is chosen for its ability to train deep networks effectively.

  • Residual Blocks: The core component of ResNet is the residual block, which allows gradients to flow more easily through the network. Its operation is defined by: $ y = F ( x , { W _ { i } } ) + x $

    • Symbol Explanation:
      • xx: The input feature map to the block.
      • yy: The output feature map.
      • F(x,{Wi})F ( x , \{ W _ { i } \} ): The residual mapping (a series of convolutional transformations) that the block learns. The model learns the difference, or residual, between the input and the desired output.
      • {Wi}\{ W _ { i } \}: The weights of the convolutional layers within the block. The transformation F(x) typically consists of two or more convolutional layers. The paper specifies a two-layer block: $ F ( x , { W _ { i } } ) = W _ { 2 } \cdot \sigma ( W _ { 1 } \cdot x ) $
    • Symbol Explanation:
      • W1,W2W_1, W_2: The weight matrices of the two convolutional layers.
      • σ()\sigma(\cdot): The ReLU (Rectified Linear Unit) activation function, which introduces non-linearity.
  • Final Classification Layers: After passing through multiple residual blocks, the resulting feature map is processed for final classification.

    1. Global Average Pooling (GAP): The spatial dimensions of the feature map are reduced to a single vector. This reduces the number of parameters and helps prevent overfitting. $ f _ { g a p } = \frac { 1 } { H _ { f } \times W _ { f } } \sum _ { i = 1 } ^ { H _ { f } } \sum _ { j = 1 } ^ { W _ { f } } f _ { r n } ( i , j ) $
      • Symbol Explanation:
        • fgapf_{gap}: The output feature vector after pooling.
        • frn(i,j)f_{rn}(i,j): The value of the feature map at position (i, j).
        • Hf,WfH_f, W_f: The height and width of the final feature map before GAP.
    2. Fully Connected (FC) Layer: The pooled feature vector is fed into a standard FC layer to produce logits (raw scores) for each defect class. $ z = W _ { f c } \cdot f _ { g a p } + b _ { f c } $
      • Symbol Explanation:
        • zz: The output vector of logits.
        • WfcW_{fc}: The weight matrix of the FC layer.
        • bfcb_{fc}: The bias term of the FC layer.
    3. Softmax Function: The logits are converted into class probabilities. $ \hat { y } _ { i } = \frac { \exp ( z _ { i } ) } { \sum _ { j = 1 } ^ { C } \exp ( z _ { j } ) } , \quad i = 1 , 2 , \ldots , C $
      • Symbol Explanation:
        • y^i\hat{y}_i: The predicted probability for class ii.
        • ziz_i: The logit for class ii.
        • CC: The total number of defect classes.

4.2.3. Step 3: FIS Module for Black-Box Fitting (The Surrogate Model)

This is the core of the LE-FIS framework. An FIS is trained to approximate the output (y^\hat{y}) of the ResNet-18 model, using the same input features.

  • Fuzzification (Membership Functions): Each input feature xix_i (e.g., texture, color from the image) is mapped to a degree of membership in a fuzzy set (e.g., 'low', 'medium', 'high') using a membership function. The paper uses a generalized bell-shaped function: $ \mu _ { A _ { i } } ( x _ { i } ) = 1 / ( 1 + \left( \frac { x _ { i } - c _ { i } } { \sigma _ { i } } \right) ^ { 2 } ) $

    • Symbol Explanation:
      • μAi(xi)\mu_{A_i}(x_i): The membership degree of input xix_i in the fuzzy set AiA_i.
      • cic_i: The center of the bell curve for this function.
      • σi\sigma_i: The width of the bell curve, controlling its steepness. These parameters (ci,σic_i, \sigma_i) are learned during optimization.
  • Fuzzy Rules and Inference: The system uses a set of IF-THEN rules. A typical rule is structured as: IFx1isA1ANDx2isA2THENz=f(x1,x2)IF x₁ is A₁ AND x₂ is A₂ THEN z = f(x₁, x₂) The "firing strength" or weight (wiw_i) of each rule is calculated by combining the membership degrees of the inputs for that rule, typically using a product (AND operator): $ w _ { i } = \prod _ { j = 1 } ^ { n } \mu _ { A _ { j } } ( x _ { j } ) $

    • Symbol Explanation:
      • wiw_i: The firing strength of the ii-th rule.
      • μAj(xj)\mu_{A_j}(x_j): The membership degree of input xjx_j in the fuzzy set AjA_j associated with this rule.
      • nn: The number of input features.
  • Defuzzification: The final output of the FIS is a crisp value, calculated as the weighted average of the outputs of all rules. $ z = { \frac { \sum _ { i = 1 } ^ { N } w _ { i } f _ { i } ( x _ { 1 } , x _ { 2 } ) } { \sum _ { i = 1 } ^ { N } w _ { i } } } $

    • Symbol Explanation:
      • zz: The final crisp output of the FIS.
      • NN: The total number of fuzzy rules.
      • wiw_i: The firing strength of the ii-th rule.
      • fi(x1,x2)f_i(x_1, x_2): The output function of the ii-th rule (for a Sugeno-type FIS, this is often a linear combination of inputs).
  • Optimization with Genetic Algorithm (GA): The goal is to make the FIS output (zz) as close as possible to the black-box model's prediction (y^\hat{y}). The GA is used to tune the FIS parameters (the membership function parameters ci,σic_i, \sigma_i and the rule parameters) by minimizing the Mean-Squared Error (MSE) loss function: $ \mathcal { L } = \frac { 1 } { m } \sum _ { i = 1 } ^ { m } \left( z _ { i } - \hat { y } _ { i } \right) ^ { 2 } $

    • Symbol Explanation:
      • L\mathcal{L}: The loss function to be minimized.
      • mm: The number of training samples.
      • ziz_i: The output of the FIS for the ii-th sample.
      • y^i\hat{y}_i: The prediction of the ResNet-18 black-box model for the ii-th sample.

4.2.4. Step 4: Prompt Design for LLMs

After the FIS is trained, its rules and parameters represent an interpretable model. However, to make it truly accessible, the authors use an LLM. They employ Prompt Engineering (PE) to guide the LLM to generate a high-quality explanation. The prompt is structured into six sections:

  1. Task: Clearly state the goal: use an LLM to explain the results of a fuzzy logic system for steel defect detection.
  2. Context: Provide background information, including the dataset (Severstal), the task (classification based on image features like color and texture), and the LLM's role.
  3. Example: Give a concrete example of the desired output, such as "Based on the fuzzy logic rules, the color variation and irregularities in texture in the image suggest the presence of a scratch."
  4. Roles: Define the roles: the LE-FIS is the "detector," and the LLM is the "explainer."
  5. Format: Specify the output structure, which should include the detection result, a detailed explanation, and a possible recommendation.
  6. Tone: Instruct the LLM to use a professional yet accessible tone, explaining technical terms clearly for non-experts.

5. Experimental Setup

5.1. Datasets

The study uses the Kaggle Steel Defect Detection dataset. This is a well-known public dataset for benchmarking industrial defect detection models.

  • Source and Scale: The dataset contains 12,568 training images, each with a resolution of 1600×2561600 \times 256 pixels.
  • Characteristics and Domain: The images are of flat steel surfaces. The key feature is that defects are annotated with pixel-wise masks, which allows for training and evaluating segmentation models, not just classification.
  • Data Sample: The dataset contains four types of surface defects:
    • Type 1 (Scratches): Linear, narrow markings.
    • Type 2 (Patches): Irregular, larger defects with rough textures.
    • Type 3 (Dents): Depressions or surface distortions.
    • Type 4 (Cracks): Thin, elongated fractures. The paper notes a significant challenge with this dataset: class imbalance, where some defect types are much more common than others. This requires careful handling during model training to avoid bias.
  • Choice Rationale: This dataset was chosen because it is a standard benchmark in the field and provides the precise pixel-level annotations needed to validate the segmentation performance of the proposed model.

5.2. Evaluation Metrics

The paper uses several standard metrics to evaluate the performance of the defect detection and segmentation.

  • Dice Coefficient: This metric measures the overlap or similarity between the predicted segmentation mask and the ground-truth mask. It is particularly useful for imbalanced datasets where the region of interest (the defect) might be very small compared to the background.

    1. Conceptual Definition: The Dice coefficient calculates twice the area of the intersection of the two masks, divided by the sum of their areas. A score of 1 means a perfect match, and 0 means no overlap.
    2. Mathematical Formula: $ { \mathrm { D i c e } } = { \frac { 2 \times | X \cap Y | } { | X | + | Y | } } $
    3. Symbol Explanation:
      • XX: The set of pixels in the predicted segmentation mask.
      • YY: The set of pixels in the ground-truth mask.
      • | \cdot |: The number of elements in a set (i.e., the area of the mask).
      • \cap: The intersection operator, representing the pixels common to both masks.
  • Precision: This metric measures the accuracy of the positive predictions.

    1. Conceptual Definition: Of all the pixels that the model predicted as defects, what proportion were actual defects? High precision means the model makes few false positive errors.
    2. Mathematical Formula: $ \mathrm { P r e c i s i o n } = { \frac { \mathrm { T P } } { \mathrm { T P } + \mathrm { F P } } } $
    3. Symbol Explanation:
      • TP (True Positives): The number of defect pixels correctly identified as defects.
      • FP (False Positives): The number of non-defect pixels incorrectly identified as defects.
  • Recall (Sensitivity): This metric measures the model's ability to find all actual positive instances.

    1. Conceptual Definition: Of all the actual defect pixels present in the image, what proportion did the model successfully identify? High recall means the model makes few false negative errors.
    2. Mathematical Formula: $ { \mathrm { R e c a l l } } = { \frac { \mathrm { T P } } { \mathrm { T P } + { \mathrm { F N } } } } $
    3. Symbol Explanation:
      • TP (True Positives): The number of defect pixels correctly identified as defects.
      • FN (False Negatives): The number of actual defect pixels that the model failed to identify.
  • F1-score: This metric provides a single score that balances both Precision and Recall.

    1. Conceptual Definition: The F1-score is the harmonic mean of Precision and Recall. It is a good overall measure of a model's accuracy, especially when the class distribution is uneven.
    2. Mathematical Formula: $ { \mathrm { F1 - s c o r e } } = { \frac { 2 \times { \mathrm { P r e c i s i o n } } \times { \mathrm { R e c a l l } } } { { \mathrm { P r e c i s i o n } } + { \mathrm { R e c a l l } } } } $
    3. Symbol Explanation:
      • Precision: The precision score.
      • Recall: The recall score.

5.3. Baselines

The paper compares its proposed LE-FIS method against a baseline model described as a "Complex deep learning method [26]". The reference [26] is "A deep learning model for steel surface defect detection" by Li et al. (2024). This baseline is representative because it likely represents the state-of-the-art in terms of pure predictive accuracy, serving as the "black-box" model that LE-FIS aims to explain while trading off some performance.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents a comprehensive analysis of the performance of its proposed LTGP detector and the LE-FIS framework.

Visual Detection Results: Figure 3 shows qualitative results of the model's predictions. The images display the original steel surface, the ground-truth mask of the defect, and the mask predicted by the model. These examples visually confirm that the model is capable of accurately localizing different types of defects, such as linear scratches and larger "crazing" defects.

Fig. 3. Examples of some of the detection results.

Training Performance: Figure 4 shows the training progress of the underlying deep learning model.

  • Figure 4(a) shows that the validation accuracy increases rapidly and stabilizes at a high level, indicating that the model learns effectively and does not suffer from severe overfitting.

  • Figure 4(b) shows that the Dice coefficient also steadily improves, confirming that the model's segmentation accuracy gets progressively better at matching the ground-truth defect shapes.

    Fig. 4. Accuracy and dice coefficient comparison.

    Genetic Algorithm Optimization: Figure 5 illustrates the optimization process of the FIS parameters by the Genetic Algorithm.

  • Figure 5(a) plots the best and mean fitness values over generations. The sharp drop in the early generations shows that the GA quickly finds good parameter combinations. The subsequent stabilization suggests the algorithm has converged to an optimal solution.

  • Figure 5(b) shows the stall generations, indicating that while the optimization slows down in later stages, it avoids getting stuck in poor local optima. This confirms the effectiveness of the GA for tuning the LE-FIS.

    Fig. 5. The process of parameter change in evolution.

    Fuzzy System Analysis: Figure 6 provides insights into the learned FIS.

  • Figure 6(a) is a t-SNE visualization, which shows that the features used by the FIS create a clear separation between the different defect categories (scratches, patches, dents, cracks). This means the features are discriminative.

  • Figure 6(b) displays the learned membership functions. The smooth curves show how the FIS handles uncertainty by allowing gradual transitions between fuzzy sets (e.g., from 'low' to 'medium' defect likelihood).

    Fig. 6. Relationship between the membership function and image features. 该图像是论文中图6,包含两个子图:(a) 展示了真实钢材缺陷类别的t-SNE可视化,显示刮痕、补丁、凹痕和裂纹的分布;(b) 展示了输入变量1的三个隶属函数随图像特征值变化的曲线。

6.2. Data Presentation (Tables)

Performance Comparison: The following are the results from Table 1 of the original paper:

Class Complex deep learning method [26] LE-FIS
Precision (%) Recall (%) F1-score (%) Precision (%) Recall (%) F1-score (%)
Scratches 92.75 91.92 92.33 82.18 80.45 81.30
Patches 90.63 89.74 90.18 79.58 78.22 78.89
Dents 88.92 87.88 88.40 77.45 75.22 76.32
Cracks 93.85 92.96 93.40 83.12 81.24 82.17
Macro Avg 91.54 90.62 91.08 80.58 78.78 79.67
Weighted Avg 91.97 91.75 91.86 81.03 79.18 80.09
Support (Scratches) 100 100
Support (Patches) 30 30
Support (Dents) 15 15
Support (Cracks) 50 50
Time cost (Training) (s) 223.21 54.23
Time cost (Interfence) (s) 4.23 1.20

Analysis: This table highlights the central trade-off of the paper. The complex deep learning method [26] achieves superior performance across all metrics (e.g., a weighted average F1-score of 91.86%). The proposed LE-FIS method has lower scores (weighted average F1-score of 80.09%). However, LE-FIS offers a massive advantage in efficiency: its training time is over 4x faster (54.23s vs. 223.21s) and its inference time is over 3x faster (1.20s vs. 4.23s). The key argument of the paper is that this performance drop is an acceptable price for gaining full model transparency and reliability, which LE-FIS provides.

LLM Explanation Performance: The following are the results from Table 2 of the original paper:

LLM model
Copilot ChatGPT-40 ChatGPT-4o mini ChatGPT-o1 mini ChatGPT-4 Qwen Mistral Llama
Words IQ 700 928 714 1125 619 754 572 598
DQ 1036 1924 789 1357 788 738 1033 1253
Membership function explanation Yes Yes No No No Yes Yes No
Rules explanation Yes Yes No Yes Yes Yes No Yes
Feature explanation No Yes No Yes No Yes No No

Analysis: This table compares the ability of different LLMs to explain the LE-FIS output. The authors evaluate them based on the length of the explanation (word count for an initial question, IQ, and a detailed question, DQ) and their ability to explain three key aspects: membership functions, rules, and features.

  • Verbosity: ChatGPT-4 and Copilot provide the most detailed responses.
  • Capability: There are clear differences in capability. Copilot and ChatGPT-4o were able to explain the membership functions, while several others were not. Most models could explain the fuzzy rules, but only ChatGPT-4o and Qwen successfully provided feature-level explanations. Based on this synthesis, the authors selected GPT-4o as the best-performing model for generating the final explanation, as it demonstrated a strong ability to provide detailed, multi-faceted explanations. The paper then presents the refined output from GPT-4o as the final explanation.

6.3. Ablation Studies / Parameter Analysis

The paper does not contain a formal ablation study section where individual components of the LE-FIS model (e.g., GA optimization, specific rule structures) are removed to measure their impact. However, the analysis of the GA optimization process in Figure 5 functions as a form of parameter analysis, demonstrating that the GA effectively converges and finds optimal parameters for the FIS, which is crucial for the model's performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully presents a novel framework, LE-FIS, for building an explainable and trustworthy steel defect detection system. The authors combined a powerful deep learning detection method (LTGP) with an interpretable fuzzy inference system (FIS) that acts as a surrogate model. The FIS parameters are optimized with a genetic algorithm to closely mimic the black-box detector. The framework's major innovation is the use of Large Language Models (LLMs) to translate the technical fuzzy logic into human-readable explanations.

The experimental results validate the approach. The LTGP model achieves good detection performance (81.03% weighted average precision), and while this is lower than a pure black-box model, it is presented as a satisfactory trade-off for achieving full interpretability and faster computation. The LLM component was shown to effectively generate comprehensive explanations, enhancing the transparency and reliability of the entire system for industrial use.

7.2. Limitations & Future Work

The authors acknowledge several limitations of their work:

  • Data Dependency: The performance of LE-FIS is highly dependent on large volumes of high-quality, labeled training data. Its accuracy may degrade significantly in data-scarce scenarios.

  • Domain Specificity: While effective for steel defect detection, the model's performance may not directly transfer to other industries or defect types without recalibration and fine-tuning, as data distributions and defect characteristics can vary widely.

    Based on these limitations, they propose the following directions for future work:

  • Extend to other defect types and materials.

  • Further optimize the fuzzy inference system.

  • Integrate a real-time feedback loop for improved adaptability on the production line.

7.3. Personal Insights & Critique

This paper offers a thoughtful and practical approach to the pressing challenge of XAI in industrial settings.

Strengths:

  • The core concept of using an FIS as a surrogate model and then an LLM as a "translator" is highly innovative. It creates a multi-layered explanation that is both logically sound (the fuzzy rules) and easily consumable (the natural language output).
  • The LTGP training strategy is a clever and practical solution for handling the very large images common in industrial inspection.
  • The paper's focus on trustworthiness and transparency, not just accuracy, is crucial for real-world adoption of AI in critical systems.

Potential Issues and Areas for Improvement:

  • The Fidelity-Interpretability Trade-off: The central premise rests on a significant trade-off: a ~10% drop in F1-score for explainability. In a safety-critical or high-cost manufacturing process, this might be an unacceptable compromise. The paper justifies it, but a deeper analysis of the cost-benefit in an industrial context would strengthen the argument.

  • The Surrogate Model Gap: The explanation provided is for the LE-FIS surrogate model, not the original ResNet-18 black-box. While the MSE loss is minimized to ensure fidelity, there's no guarantee that the FIS reasons in the same way as the ResNet. The paper could be strengthened by including metrics that specifically measure the "fidelity" or "faithfulness" of the surrogate model to the black-box.

  • Subjectivity in LLM Evaluation: The method for comparing LLMs is somewhat superficial. "Word count" is a poor proxy for explanation quality, and the decision of whether a model "explains" a concept is subjective. A more rigorous evaluation, perhaps involving human operators rating the usefulness and clarity of the explanations, would be more compelling.

  • Genericity of the Final Explanation: The "best explanation" example provided in the text feels quite generic. It explains the general principles of the system rather than providing a specific, instance-based explanation for a particular defect detection. A more powerful system would generate a unique report for each detected defect, e.g., "This area was flagged as a 'scratch' because the fuzzy rules for 'high edge response' and 'low texture variance' fired strongly, with a confidence of 92%." This would be more actionable for an operator.

    Overall, this paper is a valuable contribution that pushes the boundaries of practical XAI. Its a-la-carte approach of combining different AI techniques (ResNet, FIS, GA, LLM) into a single pipeline is a powerful paradigm that could be adapted to many other domains where accuracy and transparency are both required.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.