IF-AIP: A machine learning method for the identification of anti-inflammatory peptides using multi-feature fusion strategy

Kil To Chong

Paper status: completed

IF-AIP: A machine learning method for the identification of anti-inflammatory peptides using multi-feature fusion strategy

Published:11/18/2023

Anti-Inflammatory Peptide Identification (1)Multi-Feature Fusion Strategy (1)Voting Classifier (1)Machine Learning Method (1)Feature Selection Algorithm (1)

Original Link

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study introduces IF-AIP, a machine learning model using a voting classifier to identify anti-inflammatory peptides (AIPs). This model integrates eight feature descriptors and five conventional classifiers, optimizing performance with feature selection. It significantly improv

Abstract

Background: The most commonly used therapy currently for inflammatory and autoimmune diseases is non-specific anti-inflammatory drugs, which have various hazardous side effects. Recently, some anti-inflammatory peptides (AIPs) have been found to be a substitute therapy for inflammatory diseases like rheumatoid arthritis and Alzheimer’s. Therefore, the identification of these AIPs is an emerging topic that is equally important. Methods: In this work, we have proposed an identification model for AIPs using a voting classifier. We used eight different feature descriptors and five conventional machine-learning classifiers. The eight feature encodings were concatenated to get a hybrid feature set. The five baseline models trained on the hybrid feature set were integrated via a voting classifier. Finally, a feature selection algorithm was used to select the optimal feature set for the construction of our final model, named IF-AIP. Results: We tested the proposed model on two independent datasets. On independent data 1, the IF-AIP model shows an improvement of 3%–5.6% in terms of accuracies and 6.7%–10.8% in terms of MCC compared to the existing methods. On the independent dataset 2, our model IF-AIP shows an overall improvement of 2.9%–5.7% in terms of accuracy and 8.3%–8.6% in terms of MCC score compared to the existing methods. A comparative performance analysis was conducted between the proposed model and existing methods using a set of 24 novel peptide sequences. Notably, the IF-AIP method exhibited exceptional accuracy, correctly identifying all 24 peptides as AIPs. The source code, pre-trained models, and all datasets are made available at https://github.com/Mir-Saima/IF-AIP.

Mind Map

In-depth Reading

English Analysis~17 min read · 23,469 chars

1. Bibliographic Information

1.1. Title

IF-AIP: A machine learning method for the identification of anti-inflammatory peptides using multi-feature fusion strategy

1.2. Authors

Saima Gaffar, Mir Tanveerul Hassan, Hilal Tayara, Kil To Chong

1.3. Journal/Conference

The paper was published in Computers in Biology and Medicine. This journal is a peer-reviewed publication focusing on the application of computer science to medicine and biological problems, indicating a strong reputation in bioinformatics and computational biology.

1.4. Publication Year

2023

1.5. Abstract

The most common therapies for inflammatory and autoimmune diseases, non-specific anti-inflammatory drugs, are often associated with hazardous side effects. Anti-inflammatory peptides (AIPs) have emerged as a promising alternative therapy, making their identification a critical and evolving research area. This study proposes IF-AIP, a machine learning model for identifying AIPs. The methodology involves using a voting classifier, incorporating eight distinct feature descriptors concatenated into a hybrid feature set. Five conventional machine learning classifiers are trained on this hybrid set and then integrated via the voting classifier. Subsequently, a feature selection algorithm refines the feature set to construct the final IF-AIP model. When tested on two independent datasets, IF-AIP demonstrated significant performance improvements. On independent dataset 1, it showed a 3%–5.6% increase in accuracy and a 6.7%–10.8% increase in MCC compared to existing methods. For independent dataset 2, IF-AIP exhibited a 2.9%–5.7% improvement in accuracy and an 8.3%–8.6% improvement in MCC. Furthermore, IF-AIP accurately identified all 24 novel peptide sequences in a comparative performance analysis, highlighting its exceptional generalization ability. The source code, pre-trained models, and datasets are publicly available on GitHub.

1.6. Original Source Link

/files/papers/6919eba6110b75dcc59ae31e/paper.pdf

2. Executive Summary

2.1. Background & Motivation

The paper addresses the critical need for safer and more effective treatments for inflammatory and autoimmune diseases. Current therapies, primarily non-specific anti-inflammatory drugs and steroids, are often associated with severe side effects such as brain-blood blockage and various gastrointestinal, cardiovascular, and renal complications. This motivates the search for alternative therapeutic agents.

Recently, anti-inflammatory peptides (AIPs) have been identified as promising candidates due to their potent immunotherapeutic properties and minimal side effects. However, the traditional biological experimental methods for identifying AIPs are time-consuming and expensive. Therefore, there is an urgent need for efficient, automated computational methods, particularly using machine learning, to accurately identify these peptides from sequence information. Existing computational methods often suffer from limitations such as small datasets and a limited number of feature extraction techniques, which restrict their performance and scope.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Novel Computational Model (IF-AIP): The development of a new machine learning model, IF-AIP, specifically designed for the accurate identification of anti-inflammatory peptides (AIPs).
Multi-Feature Fusion Strategy: The integration of eight diverse feature descriptors (AAC, DPC, PAAC, APAAC, QSON, SOCN, CKSAAGP, and GTPC) into a hybrid feature set to capture comprehensive peptide sequence information.
Ensemble Learning with Voting Classifier: The utilization of a voting classifier to combine the predictions of five baseline machine learning classifiers (Random Forest, Light Gradient Boost Machine, Extreme Gradient Boosting, Extra Tree Classifier, and CatBoost) trained on the hybrid feature set, leading to more robust and accurate predictions.
Optimal Feature Selection: The application of a feature selection algorithm to identify and use an optimal feature set (OFs), which further refines the model's performance by excluding less informative features.
Enhanced Data Curation: The use of a larger, well-curated dataset, processed with CD-HIT for redundancy removal and SMOTE-Tomek for handling class imbalance, improving the reliability of the training.
Superior Performance:
- On independent dataset 1, IF-AIP achieved an accuracy of 80.0% and an MCC of 0.579, demonstrating an improvement of 3%–5.6% in accuracy and 6.7%–10.8% in MCC compared to existing methods.
- On independent dataset 2, IF-AIP achieved an accuracy of 77.7% and an MCC of 0.536, showing an overall improvement of 2.9%–5.7% in accuracy and 8.3%–8.6% in MCC compared to existing methods.
- In a case study involving 24 novel peptide sequences, IF-AIP correctly identified all of them as AIPs, showcasing its exceptional generalization ability and robustness compared to PreAIP, which identified only 14 correctly.
  
  These findings demonstrate that IF-AIP offers a better and more consistent predictive performance, making it a viable computational tool for high-throughput identification of novel AIPs.

3.1. Foundational Concepts

To understand the IF-AIP model, a grasp of several foundational concepts in bioinformatics and machine learning is essential:

Peptides and Amino Acids:
- Peptides are short chains of amino acids linked by peptide bonds. They are smaller than proteins and play diverse biological roles. The properties of a peptide are determined by its sequence of amino acids.
- Amino acids are the fundamental building blocks of peptides and proteins. There are 20 standard amino acids, each with unique side chains that confer different physicochemical properties (e.g., hydrophobicity, charge, size). A peptide sequence is an ordered list of these amino acids.
Anti-inflammatory Peptides (AIPs):
- AIPs are specific peptides that possess biological activity to reduce or prevent inflammation. They can modulate immune responses, inhibit pro-inflammatory mediators, or promote anti-inflammatory pathways. Their therapeutic potential stems from their targeted action and typically fewer side effects compared to conventional drugs.
Machine Learning (ML) for Classification:
- Machine learning is a field of artificial intelligence that enables systems to learn from data without being explicitly programmed. In this context, it's used for a classification task, where the goal is to categorize peptide sequences into one of two classes: anti-inflammatory peptides (AIPs) or non-AIPs.
- Classifiers are algorithms that implement this categorization.
Feature Descriptors/Encodings:
- Feature descriptors (also called feature encodings or feature vectors) are numerical representations of raw data (in this case, peptide sequences) that machine learning algorithms can process. Since peptide sequences are of variable length and symbolic (composed of amino acid letters), they must be transformed into fixed-length numerical vectors. These descriptors capture various characteristics, such as amino acid frequencies, dipeptide frequencies, or physicochemical properties.
Imbalanced Datasets and SMOTE-Tomek:
- An imbalanced dataset is one where the number of samples in one class (the minority class) is significantly lower than in the other class (the majority class). This can lead to biased machine learning models that perform poorly on the minority class.
- SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling method that creates synthetic (new, but similar) examples for the minority class, helping to balance the dataset. It works by taking a sample from the minority class, finding its k-nearest neighbors, and creating new synthetic samples along the line segments connecting the sample to its neighbors.
- Tomek links are a concept used in undersampling. A Tomek link exists between two instances from different classes if they are each other's nearest neighbors. Removing the majority class samples that form Tomek links helps to clean the decision boundary and reduce overlap between classes.
- SMOTE-Tomek is a hybrid sampling technique that combines SMOTE (oversampling the minority class) and Tomek links (undersampling the majority class by removing instances that are "too close" to minority class examples), aiming to achieve a better balance and cleaner class separation.
Ensemble Learning and Voting Classifier:
- Ensemble learning is a machine learning paradigm where multiple learning algorithms (called base learners or weak learners) are trained to solve the same problem. The idea is that by combining multiple models, the overall performance can be better than any single model.
- A voting classifier is a simple yet effective ensemble method. It trains multiple diverse base classifiers and then predicts the class label based on the majority vote of these classifiers (for hard voting) or on the average of their predicted probabilities (for soft voting). This approach leverages the strengths of individual models while mitigating their weaknesses.
Evaluation Metrics:
- Metrics like Accuracy, Sensitivity, Specificity, Matthews Correlation Coefficient (MCC), and Area Under the Receiver Operating Characteristic Curve (AUC) are used to quantitatively assess the performance of classification models. They provide different perspectives on how well a model discriminates between classes, especially useful for imbalanced datasets where accuracy alone might be misleading.

3.2. Previous Works

The paper contextualizes its work by reviewing several prior machine learning methods for AIP identification:

Gupta et al. (2017) - AntiInflam [11]: This was one of the first studies to apply machine learning to AIP identification. They proposed an SVM (Support Vector Machine) classifier.
Manavalan et al. (2018) - AIPpred [12]: This method used a Random Forest (RF) classifier for prediction. They highlighted the effectiveness of Dipeptide Composition (DPC) as a feature extraction method.
Khatun et al. (2019) - PreAIP [13]: This approach combined Amino Acid Composition (AAC) and conditional entropy features. They selected five characteristics to train separate RF models, with the final classification derived by combining these five RF classifiers.
Zhang et al. (2020) - AIEpred [14]: This model is an ensemble classifier based on a three-feature representation scheme for encoding peptide sequences.
Zhao et al. (2021) - iAIPs [15]: This RF-based method utilized three feature encodings: g-gap dipeptide composition (GDC), dipeptide deviation from the expected mean (DDC), and amino acid composition (AAC). The selected features were then fed into the Random Forest classifier.

The paper notes that a common limitation of these existing methods was the use of relatively small datasets, as summarized in Table 1. This limitation likely impacted their generalizability and overall performance, which the current paper aims to address by using a larger dataset and a more comprehensive feature set.

The following are the results from Table 1 of the original paper:

Methods	Datasets	Benchmark sets		Independent sets
Methods	Datasets	AIPs	Non-AIPs	AIPs	Non-AIPs
AntiInflam	Gupta2017	690	1009	173	253
AIPpred	Manvalan2018	1258	1887	420	629
PreAIP	Khatun2019	1258	1887	420	629
AIEpred	Zhang2020	1258	1887	173	253
iAIPs	Zhao2021	690	1009	420	629

3.3. Technological Evolution

The identification of anti-inflammatory peptides has evolved from expensive and time-consuming biological experiments to efficient computational methods. Early computational approaches were likely basic sequence comparisons or rule-based systems. With the rise of machine learning in bioinformatics, the field transitioned to using algorithms like SVM and Random Forest for classification. Initially, models might have relied on single or a few simple feature descriptors (like Amino Acid Composition or Dipeptide Composition).

This paper represents a step forward in this evolution by moving towards more sophisticated feature engineering (using a larger and more diverse set of feature encodings) and robust ensemble learning techniques (voting classifiers). The emphasis on multi-feature fusion and feature selection, coupled with larger and balanced datasets, reflects a trend towards building more comprehensive and higher-performing predictive models to overcome the limitations of earlier methods. The ultimate goal is to develop highly generalized models that can accurately identify novel AIPs in a high-throughput manner, bridging the gap between computational prediction and experimental validation.

3.4. Differentiation Analysis

The IF-AIP model differentiates itself from previous works through several key innovations:

Comprehensive Feature Engineering: Unlike previous methods that used a limited number of feature encodings (e.g., AIPpred focused on DPC, iAIPs used three encodings), IF-AIP leverages a much larger and more diverse set of eight feature descriptors (AAC, DPC, PAAC, APAAC, QSON, SOCN, CKSAAGP, GTPC). This multi-feature fusion strategy aims to capture a broader range of physicochemical properties and sequence-order information.
Hybrid Feature Set and Optimal Feature Selection: The paper first concatenates all eight features into a hybrid feature set. Crucially, it then applies a feature selection algorithm to identify an optimal feature set (OFs) by removing less informative features, leading to the final IF-AIP model. This systematic approach to feature optimization is a significant refinement.
Ensemble Learning with Voting Classifier: While some previous methods (e.g., PreAIP, AIEpred) used ensemble approaches, IF-AIP employs a voting classifier that integrates five powerful and widely-used machine learning algorithms (RF, LGBM, XGB, ETC, CatBoost). This ensemble approach is generally more robust and performs better than individual classifiers or simpler ensemble strategies.
Larger and Better-Curated Dataset: The paper addresses the limitation of small datasets in previous works by compiling a larger benchmark dataset and employing CD-HIT for redundancy removal and SMOTE-Tomek for handling class imbalance. This contributes to building more reliable and generalizable models.
Demonstrated Generalization Ability: The case studies with 24 novel peptide sequences, which were not part of the training or independent testing, demonstrate IF-AIP's superior generalization ability compared to existing methods (e.g., PreAIP), indicating its practical utility for discovering new AIPs.

In essence, IF-AIP combines extensive feature engineering with a robust ensemble learning framework and rigorous data handling to achieve superior and more consistent predictive performance, addressing the limitations observed in prior AIP identification models.

4. Methodology

4.1. Principles

The core principle behind the IF-AIP method is to leverage a comprehensive suite of peptide feature encodings alongside powerful ensemble machine learning techniques to accurately classify anti-inflammatory peptides (AIPs). The method assumes that the diverse physicochemical properties and sequence-order information embedded in peptide sequences, when properly extracted and combined, can provide sufficient discriminative power for AIP identification. By fusing multiple feature descriptors into a hybrid feature set and then refining it through feature selection, the model aims to capture both global and local sequence characteristics relevant to anti-inflammatory activity. The use of a voting classifier further enhances robustness by aggregating the predictions of several heterogeneous base learners, capitalizing on their individual strengths and mitigating their weaknesses, thereby leading to a more generalized and reliable prediction model.

4.2. Core Methodology In-depth (Layer by Layer)

The IF-AIP model construction involves several key steps, from data preparation to model deployment, as depicted in Figure 1.

The following figure (Fig. 1 from the original paper) shows the proposed architecture of the model IF-AIP:

Fig. 1. The proposed architecture of the model IF-AIP. 该图像是IF-AIP模型构建的示意图，展示了数据集构建、特征提取、模型训练和性能评估的流程。包含多个数据集的信息和多个分类器的结果评估，通过 $f(x) = rac{1}{5} extstyleigg( extstyleigg)igg( extstyleigg)$ 进行综合得分，以提高AIP识别的准确性。

4.2.1. Data Curation

A high-quality dataset is fundamental for building effective machine learning models.

Collection: The authors collected peptide sequences from two existing papers: iAIPs [15] and AntiInflam [11].
Initial Dataset: The initial dataset comprised 1962 positive samples (AIPs) and 2896 negative samples (non-AIPs).
Redundancy Removal: To eliminate redundant or highly similar sequences, the CD-HIT tool [16] was applied with a sequence identity threshold ( $\mathrm{c}$ ) of 0.9. This step ensures that the training data is diverse and prevents overfitting.
Final Training (Benchmark) Set: After redundancy removal, the benchmark training set consisted of 1451 positive samples and 2339 negative samples.
Independent Test Datasets: The model's performance was evaluated on two independent datasets not used in training:
- Independent Dataset 1 (from iAIPs [15]): 420 positive samples and 629 negative samples.
- Independent Dataset 2 (from AntiInflam [11]): 173 positive samples and 253 negative samples. These datasets serve to assess the model's generalization ability.

4.2.2. Feature Representation

Peptide sequences are variable in length and composed of categorical amino acid letters. To be processed by machine learning algorithms, they must be converted into fixed-length numerical feature vectors. A peptide sequence $S$ is generally represented as: $ S = [ S _ { 1 } , S _ { 2 } , . . . . . . . . . . . . . S_L] $ Where $S_1$ represents the first amino acid in the peptide sequence, and $L$ denotes the total length of the peptide sequence. The paper utilized eight different feature encoding techniques:

(i) Amino Acid Composition (AAC): AAC [17] describes the frequency of each of the 20 standard amino acids in a peptide sequence. It results in a 20-dimension feature vector. $ x ( m ) = \frac { L _ { m } } { L } , \quad m \ \epsilon { A , C , D , . . . . , Y } $

x(m): The frequency of amino acid type $m$ .
$L_m$ : The number of occurrences of amino acid type $m$ in the sequence.
$L$ : The total length of the peptide sequence.
$m \ \epsilon \{ A , C , D , . . . . , Y \}$ : Represents one of the 20 standard amino acids.

(ii) Dipeptide Composition (DPC): DPC [17] represents the frequency of all possible pairs of adjacent amino acids (dipeptides) in a sequence. Since there are 20 amino acids, there are $20 \times 20 = 400$ possible dipeptides, resulting in a 400-dimension feature vector. $ x ( m , n ) = \frac { L _ { m n } } { L - 1 } , \quad m , n \ \epsilon { A , C , D , . . . . , Y } $

x(m, n): The frequency of the dipeptide formed by amino acid type $m$ followed by amino acid type $n$ .
$L_{mn}$ : The number of occurrences of the dipeptide mn in the sequence.
$L$ : The total length of the peptide sequence.
L-1: The total number of possible dipeptides in the sequence.
$m , n \ \epsilon \{ A , C , D , . . . . , Y \}$ : Represents one of the 20 standard amino acids.

(iii) Pseudo Amino Acid Composition (PAAC): PAAC [18] extends AAC by incorporating physicochemical properties and sequence-order information of the amino acids. The dimension of the PAAC feature vector used here is 22D (20 amino acid frequencies + 2 sequence-correlated factors). $ S = [ S _ { 1 } , S _ { 2 } , \ldots . . . . , S _ { 2 0 + 1 } , \ldots , S _ { 2 0 + \lambda } ] $ with $ \begin{array} { l } { { S _ { z } = \displaystyle \frac { x _ { z } } { \sum _ { j = 1 } ^ { 2 0 } x _ { j } + w \sum _ { k = 1 } ^ { \lambda } \theta _ { k } } , ~ ( 1 \le z \le 2 0 ) } } \ { { S _ { z } = \displaystyle \frac { w \theta _ { z - 2 0 } } { \sum _ { j = 1 } ^ { 2 0 } x _ { j } + w \sum _ { k = 1 } ^ { \lambda } \theta _ { k } } , ~ ( 2 1 \le z \le 2 0 + \lambda ) } } \ { { \theta _ { \lambda } = \displaystyle \frac { 1 } { L - \lambda } \sum _ { m = 1 } ^ { L - \lambda } \Theta ( S ( R _ { m } ) , S ( R _ { m + \lambda } ) ) , \lambda < L } } \end{array} $

$S_z$ : The components of the PAAC vector.
$x_z$ : The normalized frequency of the amino acid $\mathscr { z }$ (for $1 \le z \le 20$ ).
$w$ : The weighting factor, set to 0.05.
$\lambda$ : An integer representing the rank of correlation, set to 2.
$\theta_k$ : The $k$ -th sequence-correlated factor.
$\theta_\lambda$ : The $\lambda$ -th sequence-correlated factor, calculated as the average of correlation functions for amino acid pairs separated by $\lambda$ positions.
$L$ : The total length of the peptide sequence.
$\Theta ( S ( R _ { m } ) , S ( R _ { m + \lambda } ) )$ : A correlation function that quantifies the correlation between the physicochemical properties of the amino acids at positions $R_m$ and $R_{m+\lambda}$ (i.e., $m$ and $m+\lambda$ ).

(iv) Amphiphilic Pseudo Amino Acid Composition (APAAC): APAAC [19] is a variant of PAAC that specifically incorporates the amphiphilic nature (hydrophobic and hydrophilic properties) of amino acids. The total dimension of the APAAC feature vector used is 24D (20 amino acid frequencies + $2\lambda$ sequence-order factors). Here, $\lambda$ is set to 2, so $2 \times 2 = 4$ sequence-order factors are used. $ S = [ S _ { 1 } , S _ { 2 } , \ldots , S _ { 2 0 } , S _ { 2 0 + 1 } , \ldots , S _ { 2 0 + \lambda } , \ldots , S _ { 2 0 + 2 \lambda } ] $ with $ \begin{array} { r l } & { S _ { z } = \cfrac { x _ { z } } { \sum _ { j = 1 } ^ { 2 0 } x _ { j } + w \sum _ { k = 1 } ^ { 2 \lambda } \tau _ { k } } , \quad ( 1 \leq z \leq 2 0 ) } \ & { S _ { z } = \cfrac { w \tau _ { z } } { \sum _ { j = 1 } ^ { 2 0 } x _ { j } + w \sum _ { k = 1 } ^ { 2 \lambda } \tau _ { k } } , \quad ( 2 1 \leq z \leq 2 0 + 2 \lambda ) } \end{array} $

$S_z$ : The components of the APAAC vector.
$x_z$ : The normalized frequency of amino acid $z$ (for $1 \le z \le 20$ ).
$w$ : The weighting factor, set to 0.05.
$\lambda$ : An integer, set to 2.
$\tau_k$ : The $k$ -th sequence-order factor, capturing amphiphilic information. The sequence-order factors are defined as: $ \tau _ { 2 \lambda } = \frac { 1 } { L - \lambda } \sum _ { k = 1 } ^ { L - \lambda } H _ { k , k + \lambda } ^ { 2 } $ $ \tau _ { 2 \lambda - 1 } = \frac { 1 } { L - \lambda } \sum _ { k = 1 } ^ { L - \lambda } H _ { k , k + \lambda } ^ { 1 } $
$L$ : The total length of the sequence.
$H^1_{k, k+\lambda}$ and $H^2_{k, k+\lambda}$ : Represent correlation functions based on specific physicochemical properties (e.g., hydrophobicity and hydrophilicity) between amino acids at positions $k$ and $k+\lambda$ .

(v) Quasi Sequence Order Number (QSON): QSON [20] establishes a quantitative relationship between the sequence and its properties by encoding sequence-order information and physicochemical properties. It results in a 130D vector. $ Q _ { z } = \frac { x _ { z } } { \sum _ { m = 1 } ^ { 2 0 } x _ { z } + w \sum _ { t = 1 } ^ { n l a g } \tau _ { t } } , z = 1 , 2 , \ldots , 2 0 $

$Q_z$ : The $z$ -th component of the QSON vector, related to the normalized frequency of amino acid $z$ .
$x_z$ : The normalized frequency of amino acid $z$ .
$w$ : The weighting factor, set to 0.1.
nlag: The maximum lag value for sequence-order correlation.
$\tau_t$ : The $t$ -th sequence-order factor, similar to PAAC but can incorporate different correlation functions.

(vi) Composition of k-spaced Amino Acid Pairs with Gap (CKSAAGP): CKSAAGP [21] is a feature extraction method that calculates the frequencies of k-spaced amino acid pairs with gaps. The range of $k$ (the gap size) is typically 0-5. In this work, descriptors were extracted at $k=3$ . The total dimension of this feature vector is 100D. For example, for $k=0$ , it represents adjacent pairs; for $k=1$ , pairs separated by one amino acid, etc.

(vii) Sequence-Order Coupling Number (SOCN): SOCN [20] calculates the dissimilarity between amino acid components using dissimilarity matrices, such as Schneirder-Wrede physicochemical and Grantham Chemical distance matrices. It generates 90 descriptors. $ x _ { m } = \sum _ { k = 1 } ^ { L - m } ( m _ { k , k + m } ) ^ { 2 } , m = 1 , 2 , \ldots , n l a g $

$x_m$ : The sequence-order coupling number for a lag of $m$ .
$L$ : The total length of the sequence.
$m$ : The separation (lag) between two amino acid pairs.
$m_{k, k+m}$ : Represents the dissimilarity between the amino acids at positions $k$ and $k+m$ in the sequence.
nlag: The maximum lag value considered.

(viii) Grouped Tri-peptide Composition (GTPC): GTPC [21] is a variation of tri-peptide composition where the 20 amino acids are categorized into five groups based on their physicochemical properties:

$g_1$ : aliphatic
$g_2$ : aromatic
$g_3$ : positive charge
$g_4$ : negative charge
$g_5$ : uncharged This method results in 125 descriptors ( $5 \times 5 \times 5 = 125$ possible tripeptides of grouped amino acids). $ t ( x , y , z ) = \frac { N _ { x y z } } { L - 1 } , ~ x , y , z ~ \in ~ { g _ { 1 } , g _ { 2 } , g _ { 3 } , g _ { 4 } , g _ { 5 } } $
t(x, y, z): The frequency of the tri-peptide formed by grouped amino acid types x, y, z.
$N_{xyz}$ : The number of occurrences of the tri-peptide xyz in the sequence.
L-1: This seems to be a slight error in the formula provided in the paper, as tri-peptide composition is usually divided by L-2. However, adhering strictly to the paper's formula, L-1 is the denominator.
$x, y, z ~ \in ~ \{ g _ { 1 } , g _ { 2 } , g _ { 3 } , g _ { 4 } , g _ { 5 } \}$ : Represents one of the five amino acid groups.

All these feature encodings were extracted using the iLearn standalone Python package [21].

4.2.3. SMOTE-Tomek for Dataset Balancing

The benchmark training dataset was imbalanced, consisting of 1451 positive samples and 2339 negative samples. To address this, the SMOTE-Tomek hybrid sampling technique [22] was applied.

SMOTE (Synthetic Minority Over-sampling Technique): This technique oversamples the minority class (AIPs) by creating synthetic examples, thereby increasing its representation.
Tomek Links: This technique undersamples the majority class (non-AIPs) by removing Tomek links, which are pairs of instances (one from the majority, one from the minority) that are each other's nearest neighbors. Removing the majority instance in such a pair helps to clarify the decision boundary between classes. This hybrid approach was applied only to the benchmark training set to ensure the model learns from a balanced distribution, while the two independent datasets remained unaltered for an unbiased evaluation of generalization.

4.2.4. Baseline Classifiers

Five conventional machine learning classifiers were chosen as base learners for their widespread use and effectiveness in bioinformatics [27-29]:

Random Forest (RF): An ensemble learning method that operates by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Light Gradient Boost Machine (LGBM): A gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient, often outperforming other gradient boosting algorithms in speed and accuracy.
Extreme Gradient Boosting (XGB): Another gradient boosting framework known for its high performance and flexibility. It implements a parallelized tree boosting algorithm and includes techniques to prevent overfitting.
Extra Tree Classifier (ETC): An ensemble learning method similar to Random Forest but with an additional layer of randomness. It randomly selects split points for features, which can reduce variance and computation time.
CatBoost Classifier: A gradient boosting library developed by Yandex. It is particularly effective with categorical features and handles them automatically, often yielding good results with default parameters.

To optimize the performance of each baseline model, Optuna [30], a hyperparameter optimization algorithm, was used. The models were tuned using repeated stratified 5-fold cross-validation on the benchmark dataset for every encoding technique.

4.2.5. Construction of Model IF-AIP

The IF-AIP model is constructed through a five-step process:

Step 1: Feature Extraction: Eight different feature encodings (AAC, DPC, PAAC, APAAC, QSON, SOCN, CKSAAGP, and GTPC) are extracted from the given peptide sequences in the dataset.
Step 2: Hybrid Feature Set Creation: All eight extracted feature vectors are concatenated into a single, comprehensive hybrid feature set. This approach aims to capture a wide range of peptide characteristics.
Step 3: Baseline Model Training: The five machine learning classifiers (RF, LGBM, XGB, ETC, and CatBoost) are trained on these eight individual feature descriptors as well as on the hybrid feature set. This allows for evaluating the performance of individual features and the combined set.
Step 4: HB-AIP (Hybrid Baseline AIP) Model: The five machine learning classifiers that were trained on the hybrid feature set are then integrated using a voting classifier. This initial ensemble model is named HB-AIP. The voting classifier combines the predictions of these base learners to make a final decision, typically by majority vote or averaging probabilities.
Step 5: IF-AIP (Improved Feature AIP) Model: A feature selection algorithm is applied to the hybrid feature set to identify an optimal feature set. This step aims to remove less informative or redundant features that might reduce model performance. The voting classifier from Step 4 (HB-AIP) is then retrained using this optimal feature set. This refined model is termed IF-AIP and serves as the final predictive model.

The IF-AIP model can be conceptually represented as an aggregation of the base classifiers' predictions: $ I F - A I P \approx R F \lor E T C \lor X G B \lor L G B M \lor C a t B o o s t $
$I F - A I P$ : Represents the final prediction made by the voting classification model.
RF, ETC, XGB, LGBM, CatBoost: Refer to the predictions made by the individual base classifiers after being trained on the optimal feature set.
$\lor$ : Denotes the fusing operator used to combine the predictions of the individual classifiers (e.g., majority voting or weighted averaging of probabilities).

4.3. Evaluation Metrics

The performance of the predictive models is evaluated using several widely accepted metrics [31-36]. These metrics provide a comprehensive view of the model's effectiveness, especially in binary classification tasks.

Let's define the components:

$T_P$ (True Positive): The number of AIPs correctly identified as AIPs.
$T_N$ (True Negative): The number of non-AIPs correctly identified as non-AIPs.
$F_P$ (False Positive): The number of non-AIPs incorrectly identified as AIPs.
$F_N$ (False Negative): The number of AIPs incorrectly identified as non-AIPs.

The metrics are defined as follows:

Accuracy (Acc):
- Conceptual Definition: Accuracy measures the overall proportion of correctly classified instances (both positive and negative) out of the total number of instances. It provides a general sense of how well the model performs.
- Mathematical Formula: $ Acc = \displaystyle \frac { T _ { P } + T _ { N } } { T _ { P } + F _ { N } + T _ { N } + F _ { P } } $
- Symbol Explanation:
  - $T_P$ : True Positives.
  - $T_N$ : True Negatives.
  - $F_N$ : False Negatives.
  - $F_P$ : False Positives.
Sensitivity (Sn) / Recall:
- Conceptual Definition: Sensitivity (also known as Recall or True Positive Rate) measures the proportion of actual positive cases (AIPs) that were correctly identified. It indicates the model's ability to detect AIPs.
- Mathematical Formula: $ Sn = \displaystyle \frac { T _ { P } } { T _ { P } + F _ { N } } $
- Symbol Explanation:
  - $T_P$ : True Positives.
  - $F_N$ : False Negatives.
Specificity (Sp):
- Conceptual Definition: Specificity (also known as True Negative Rate) measures the proportion of actual negative cases (non-AIPs) that were correctly identified. It indicates the model's ability to correctly identify non-AIPs.
- Mathematical Formula: $ Sp = \displaystyle \frac { T _ { N } } { T _ { N } + F _ { P } } $
- Symbol Explanation:
  - $T_N$ : True Negatives.
  - $F_P$ : False Positives.
Matthews Correlation Coefficient (MCC):
- Conceptual Definition: MCC is a single-value metric that considers all four components of the confusion matrix (TP, TN, FP, FN). It is considered a balanced measure that can be used even if the classes are of very different sizes. An MCC of +1 represents a perfect prediction, 0 represents a random prediction, and -1 represents a perfect inverse prediction.
- Mathematical Formula: $ MCC = \displaystyle \frac { ( T _ { P } * T _ { N } ) - ( F _ { P } * F _ { N } ) } { \sqrt { ( T _ { P } + F _ { P } ) * ( T _ { P } + F _ { N } ) * ( T _ { N } + F _ { P } ) * ( T _ { N } + F _ { N } ) } } $
- Symbol Explanation:
  - $T_P$ : True Positives.
  - $T_N$ : True Negatives.
  - $F_P$ : False Positives.
  - $F_N$ : False Negatives.
Area Under the Curve (AUC):
- Conceptual Definition: AUC represents the Area Under the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold settings. AUC measures the overall ability of a classifier to distinguish between positive and negative classes. A higher AUC indicates better discriminative power.
- Mathematical Formula: The AUC itself doesn't have a simple direct formula like the others, as it's calculated by integrating the ROC curve. Conceptually, it is the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
- Symbol Explanation: AUC is a scalar value typically ranging from 0 to 1. No specific variables are within a direct formula like the others.

5. Experimental Setup

5.1. Datasets

The study utilized both a benchmark training dataset and two independent test datasets to evaluate the model's performance rigorously.

Benchmark Training Set:
- Source: Collected from iAIPs [15] and AntiInflam [11] papers.
- Initial Scale: 1962 positive samples (AIPs) and 2896 negative samples (non-AIPs).
- Processing: Redundancies were removed using CD-HIT with a sequence identity threshold ( $\mathrm{c}$ ) of 0.9.
- Final Scale: 1451 positive samples and 2339 negative samples.
- Purpose: Used for training and cross-validation of the machine learning models. The SMOTE-Tomek technique was applied to this set to handle class imbalance.
Independent Dataset 1:
- Source: From the iAIPs paper [15].
- Scale: 420 positive samples and 629 negative samples.
- Purpose: Used to test the model's generalization ability on unseen data from a previously established dataset.
Independent Dataset 2:
- Source: From the AntiInflam paper [11].
- Scale: 173 positive samples and 253 negative samples.
- Purpose: Used to further validate the model's generalization ability on another distinct set of unseen data.
  
  These datasets were chosen because they are standard benchmarks in AIP identification research, allowing for direct comparison with existing methods. The use of two distinct independent datasets provides a robust assessment of the model's ability to perform well beyond its training data.

5.1.1. Compositional and Positional Analysis

The paper also performed a compositional and positional analysis of the benchmark training dataset to identify characteristic patterns in AIPs and non-AIPs.

The following figure (Fig. 2 from the original paper) shows the compositional and positional analysis of the training dataset:

Fig. 2. . 该图像是一个柱状图和一个字母云。柱状图（a部分）展示了AIP与非AIP氨基酸的平均组成百分比。字母云（b部分）则展示了富集和耗竭氨基酸的可视化信息，显示了不同氨基酸的分布情况。

Compositional Analysis (Fig. 2a): This analysis compares the average composition percentages of amino acids in AIPs versus non-AIPs.
- AIPs are found to overrepresent Ile (I), Lys (K), Leu (L), Arg (R), and Ser (S).
- Non-AIPs are dominant in Ala (A), Asp (D), Gly (G), Pro (P), Thr (T), and Val (V).
Positional Preference Analysis (Fig. 2b): This analysis, generated using a two-sample-logo server [37], shows amino acid enrichment or depletion at specific positions within the peptide sequences. The height of the logos is scaled based on a t-test ( $p < 0.5$ $p < 0.5$ ).
- In AIPs, Ser (S) shows dominance at positions 2 and 12, while Leu (L) is dominant at positions 5, 6, 7, 10, 11, and 15.
- In non-AIPs, Thr (T) is dominant at positions 3, 7, and 14, and Asp (D) is dominant at positions 4, 5, 10, 13, and 15.
  
  These findings reveal significant differences in amino acid preferences and their positional distribution between AIPs and non-AIPs, which are crucial discriminative characteristics that machine learning models can learn to distinguish the two classes.

5.2. Evaluation Metrics

As described in the Methodology section, the following metrics were used to evaluate the model's performance:

Accuracy (Acc)
Sensitivity (Sn)
Specificity (Sp)
Matthews Correlation Coefficient (MCC)
Area Under the Curve (AUC)

These metrics provide a robust and comprehensive assessment of the model's predictive capabilities, addressing aspects like overall correctness, true positive rate, true negative rate, and balanced performance, particularly important for imbalanced datasets.

5.3. Baselines

The IF-AIP model was compared against several types of baselines:

Individual Baseline Classifiers: The five machine learning classifiers (RF, LGBM, XGB, ETC, CatBoost) when trained on single feature encodings or the hybrid feature set serve as internal baselines to demonstrate the benefits of feature fusion and ensemble learning.
HB-AIP Model: The HB-AIP model, which is the voting classifier trained on the hybrid feature set before feature selection, acts as a direct baseline to show the improvement gained by the optimal feature selection step in forming IF-AIP.
Existing State-of-the-Art Methods: For external validation and to benchmark against the current literature, IF-AIP was compared with the following published methods:
- AIPpred [12]
- PreAIP [13]
- AIEPred [14]
- iAIPs [15]
- AntiInflam [11] These methods are representative because they are recent and relevant computational models for AIP identification, allowing the authors to demonstrate the competitive advantage of IF-AIP.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance of Baseline Models on Raw Features

The study first evaluated the performance of 45 baseline models (5 classifiers x 8 individual feature encodings + 5 classifiers x 1 hybrid feature set).

The following figure (Fig. 3 from the original paper) shows the performance of the top 10 baseline models:

该图像是图表，展示了不同模型在交叉验证准确率及MCC得分方面的比较。图中的多个条形图分别对应于不同的模型和性能指标，如图(a)展示了交叉验证准确性，而图(b)显示了交叉验证的MCC得分。

Cross-validation on Benchmark Dataset (Supplementary Table S2, Fig. 3a, 3b):
- The LGBM classifier generally showed the best discriminative capability.
- The hybrid feature set consistently yielded the best performance for all baseline classifiers, achieving cross-validation accuracies in the range of 78.1%–80.1% and MCC scores up to 0.605 (for LGBM).
- Among individual feature encodings, AAC and CKSAAGP were the next best performers, with cross-validation accuracies in the range of 76.1%–78.6% for AAC and 75.2%–78.3% for CKSAAGP.
- This indicates that combining diverse features is more effective than using individual features, and AAC and CKSAAGP capture particularly relevant information.

Independent Datasets (Supplementary Tables S3 & S4, Fig. 3c-3f):

The trend observed on the benchmark dataset generally held true for the independent datasets. The hybrid feature set and individual features like AAC and CKSAAGP continued to perform well.

The following are the results from Table 2 of the original paper:

Descriptors	Independent dataset 1					Independent dataset 2
Descriptors	Acc	Sn	Sp	MCC	AUC	Acc	Sn	Sp	MCC	AUC
AAC	76.4	58.0	83.7	0.428	80.0	74.8	69.9	76.5	0.462	82.9
DPC	77.5	61.4	86.7	0.497	81.6	78.9	78.0	82.9	0.577	85.2
PAAC	73.5	53.8	82.9	0.379	75.8	76.5	65.3	84.1	0.506	83.2
APAAC	75.8	63.0	81.9	0.449	77.6	77.6	65.3	86.1	0.530	83.8
SOCN	75.9	51.9	85.9	0.401	76.7	77.6	84.3	73.0	0.564	85.4
QSON	77.8	75.7	83.8	0.524	82.4	79.0	83.8	74.7	0.585	87.2
CKSAAGP	77.7	78.9	78.9	0.512	82.8	77.6	78.0	77.3	0.547	86.6
GTPC	77.3	64.1	83.0	0.455	79.9	75.7	75.1	76.1	0.507	83.7
Hybrid	78.7	69.5	81.6	0.501	83.9	76.3	76.9	74.2	0.528	84.4

6.1.2. Performance of the Voting Classifiers and HB-AIP Method

The HB-AIP model, which is the voting classifier integrating the five baseline models trained on the hybrid feature set, showed improved performance over individual models.

Benchmark Dataset (Supplementary Table S5): HB-AIP achieved the highest cross-validation accuracy of 80.2% and an MCC score of 0.606. This was superior to other voting classifier models trained on single descriptors, which ranged from 72.7%–78.5% accuracy and 0.458–0.571 MCC.
Independent Dataset 1 (Table 2): HB-AIP achieved an accuracy of 78.7% and an MCC score of 0.501.
Independent Dataset 2 (Table 2): HB-AIP achieved an accuracy of 76.3% and an MCC score of 0.528.

These results confirm the benefit of ensemble learning and the multi-feature fusion approach in HB-AIP.

6.1.3. Effect of the Optimal Feature Selection on the Performance of HB-AIP Method

To further enhance performance, a feature selection algorithm was applied. The baseline models trained on PAAC, APAAC, and SOCN encodings showed relatively poor performance compared to others. These three feature types were excluded, and the remaining five feature encodings (AAC, DPC, QSON, CKSAAGP, and GTPC) were concatenated to form the optimal feature set (OFs). The HB-AIP voting classifier was then retrained on this OFs, resulting in the final IF-AIP model.

The following are the results from Table 3 of the original paper:

Dataset	Method	Number of features	Acc	Sn	Sp	MCC	AUC
Benchmark dataset	HB-AIP	911	80.2	80.1	80.5	0.606	88.6
Benchmark dataset	IF-AIP	775	81.0	79.9	82.1	0.621	89.2
Independent dataset 1	HB-AIP	911	78.7	69.5	81.6	0.501	83.9
Independent dataset 1	IF-AIP	775	80.0	69.0	87.4	0.579	87.3
Independent dataset 2	HB-AIP	911	76.3	76.9	74.2	0.528	84.4
Independent dataset 2	IF-AIP	77.7	80.3	74.2	0.536	87.1

Benchmark Dataset: IF-AIP (775 features) achieved an accuracy of 81.0% and an MCC score of 0.621, which is an improvement of 0.8% in accuracy and 1.5% in MCC over HB-AIP (911 features).
Independent Dataset 1: IF-AIP showed notable improvements with an accuracy of 80.0% and an MCC score of 0.579, corresponding to a 1.3% increase in accuracy and a substantial 7.8% increase in MCC compared to HB-AIP.
Independent Dataset 2: IF-AIP also slightly outperformed HB-AIP, with a 1.4% increase in accuracy (77.7%) and a 0.8% increase in MCC (0.536).

This demonstrates that feature selection effectively pruned less relevant features, leading to a more streamlined and accurate model.

6.1.4. Performance Comparison of IF-AIP Model with the Existing Methods

The IF-AIP model was rigorously compared against existing state-of-the-art methods using the two independent datasets.

The following are the results from Table 4 of the original paper:

Dataset	Method	Work	Acc	Sn	Sp	MCC	AUC
Independent test 1	AIPpred	Manvalan2018	74.4	74.1	74.6	0.479	81.4
	PreAIP	Khatun2019	77.0	61.8	87.1	0.512	84.0
	AIEPred	Zhang2020	76.2	55.5	89.9	0.497	76.7
	iAIPs	Zhao2021	75.1	56.7	87.4	0.471	82.2
	HB-AIP	Our work	78.7	69.5	81.6	0.501	83.9
	IF-AIP	Our work	80.0	69.0	87.4	0.579	87.3
Independent test 2	AntiInflam	Gupta2017	72.0	78.6	67.4	0.450
	AIEPred	Zhang2020	74.8	52.3	88.3	0.453
	HB-AIP	Our work	76.3	76.9	74.2	0.528	84.4
	IF-AIP	Our work	77.7	80.3	74.2	0.536	87.1

Independent Dataset 1:
- IF-AIP achieved an accuracy of 80.0% and an MCC score of 0.579, with an AUC of 87.3%.
- Compared to existing methods like AIPpred, PreAIP, AIEPred, and iAIPs, IF-AIP showed a significant improvement of 3%–5.6% in accuracy and 6.7%–10.8% in MCC. The AUC score was also 3.3%–10.6% higher.
Independent Dataset 2:
- IF-AIP achieved an accuracy of 77.7% and an MCC score of 0.536, with an AUC of 87.1%.
- Compared to AntiInflam and AIEPred, IF-AIP demonstrated a 2.9%–5.7% higher accuracy and an 8.3%–8.6% higher MCC score. (AUC values for AntiInflam and AIEPred were not reported in their respective papers).
  
  These comparisons unequivocally establish IF-AIP as a superior method, offering better and more consistent predictive performance across different independent datasets.

6.1.5. Case Studies

To further validate the efficacy, robustness, and generalization ability of IF-AIP, the model was tested on 24 experimentally validated anti-inflammatory peptide sequences downloaded from Peplab [38] and Uniprot [39]. These sequences were strictly novel and not used in the training or testing phases. Redundancy with the benchmark training set was checked using CD-HIT at $\mathrm{c}=1.0$ .

The IF-AIP model was compared against PreAIP [13] (as other online servers for existing methods were unavailable or not working).

The following are the results from Table 5 of the original paper:

Sequences	IF-AIP		PreAIP
Sequences	Score	Prediction	Score	Prediction
ELRLPEIARPVPEVLPARLPLPALPRNKMAKNQ	0.875	AIP	0.625	AIP
MAPRGFSCLLLLTSEIDLPVKRRA	0.828	AIP	0.585	AIP
FLSLIPHIATGIAALAKHL	0.826	AIP	0.592	AIP
DTEAR	0.826	AIP	0.283	Non-AIP
FLSLIPKIAGGIASLVKDL	0.821	AIP	0.588	AIP
FLSLIPKIAGGIASLVKNL	0.819	AIP	0.615	AIP
FFSMIPKIATGIASLVKDL	0.810	AIP	0.552	AIP
FFSMIPKIATGIASLVKNL	0.800	AIP	0.577	AIP
LLGMIPVAITAISALSKL	0.774	AIP	0.593	AIP
KGHYAERVG	0.759	AIP	0.417	Non-AIP
NSPGPHDVALDQ	0.758	AIP	0.400	Non-AIP
FIGMIPGLIGGLISAIK	0.754	AIP	0.626	AIP
GLVNGLLSSVLGGQGGGGLLGGIL	0.748	AIP	0.527	AIP
HDMNKVLDL	0.744	AIP	0.457	Non-AIP
RMVLPEYELLYE	0.736	AIP	0.513	AIP
MRWQEMGYIFYPRKLR	0.723	AIP	0.525	AIP
KPVAAP	0.696	AIP	0.298	Non-AIP
FDLIYSV	0.687	AIP	0.463	Non-AIP
GLVSGLLNSVTGLLGNLAGGGL	0.673	AIP	0.569	AIP
AAFAATY	0.653	AIP	0.298	Non-AIP
GPETAFLR	0.634	AIP	0.481	Non-AIP
GKWMSLLKHILK	0.553	AIP	0.636	AIP
KIPYIL	0.546	AIP	0.343	Non-AIP
APTLW	0.511	AIP	0.328	Non-AIP

IF-AIP Performance: IF-AIP correctly identified all 24 novel peptides as AIPs.
PreAIP Performance: PreAIP correctly identified only 14 out of the 24 peptides, misclassifying 10 peptides as Non-AIPs.

This result is a strong indicator of IF-AIP's excellent generalization ability and robustness, demonstrating its capability to accurately predict novel AIPs even when these sequences are completely new to the model.

6.2. Ablation Studies / Parameter Analysis

The transition from HB-AIP to IF-AIP constitutes a form of ablation study focused on feature selection.

HB-AIP utilized a hybrid feature set comprising 8 individual feature encodings (911 dimensions in total).
IF-AIP was constructed by removing the PAAC, APAAC, and SOCN features, as their performance with baseline classifiers was relatively poorer. The remaining 5 feature encodings (AAC, DPC, QSON, CKSAAGP, and GTPC) formed the optimal feature set (775 dimensions).

As shown in Table 3, this feature selection step led to improvements:
On the benchmark dataset, IF-AIP improved accuracy by 0.8% and MCC by 1.5% over HB-AIP.
On independent dataset 1, accuracy increased by 1.3% and MCC by 7.8%.
On independent dataset 2, accuracy increased by 1.4% and MCC by 0.8%.

This analysis confirms that careful feature selection (i.e., ablating less informative features) significantly contributes to the overall predictive performance and efficiency of the IF-AIP model, making it more compact and accurate. The use of Optuna for hyperparameter optimization also played a role in tuning each baseline model for optimal performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully developed IF-AIP, a novel machine learning method for the accurate identification of anti-inflammatory peptides from their sequences. The model's strength lies in its multi-feature fusion strategy, which combines eight diverse peptide descriptors, followed by a rigorous feature selection process to derive an optimal feature set. This refined feature set is then fed into an ensemble voting classifier that integrates five different machine learning algorithms.

IF-AIP demonstrated superior performance, achieving cross-validation accuracy of 81.0% and MCC score of 0.621 on the benchmark dataset. Crucially, it outperformed existing methods on two independent datasets, showing significant improvements in both accuracy (2.9%–5.7%) and MCC (8.3%–10.8%). Its exceptional generalization ability was further confirmed in a case study where it correctly identified all 24 novel anti-inflammatory peptides, whereas a comparative model (PreAIP) misclassified 10 of them. The source code and datasets are publicly available, promoting reproducibility and further research.

7.2. Limitations & Future Work

The authors acknowledge several areas for future improvement:

Deep Learning: They suggest that deep learning approaches could potentially further enhance the performance of AIP predictive models. This implies that the current classical machine learning ensemble might still have room for improvement by leveraging the hierarchical feature learning capabilities of neural networks.
Data Curation (Positive Samples): A significant limitation mentioned is the relatively low number of positive AIP samples currently available online. Data curation remains a crucial factor, as increasing the quantity and quality of AIP data could substantially improve model training and generalization.
New Feature Representations: The authors believe that exploring new feature representations (beyond the eight used in this study) could play a vital role in further enhancing the identification of AIPs. This suggests that there might be undiscovered physicochemical properties or sequence-order patterns that, when encoded, could provide additional discriminative power.

7.3. Personal Insights & Critique

The IF-AIP paper presents a robust and well-executed machine learning approach to an important bioinformatics problem. The comprehensive feature engineering using eight different descriptors is a strong point, as it captures a wide range of peptide characteristics. The deliberate use of a hybrid feature set followed by feature selection is particularly effective, demonstrating a systematic approach to optimizing input features rather than simply throwing all features at the model. The ensemble voting classifier further enhances reliability, which is a common best practice in machine learning for improving robustness and reducing variance.

The rigorous evaluation on two separate independent datasets and a challenging case study with novel peptides is commendable and lends strong credibility to the model's generalization ability. The clear performance gains over existing methods underscore the value of the proposed multi-feature fusion and ensemble strategy.

Potential Issues/Areas for Improvement (beyond what authors noted):

Interpretability: While the model performs well, the multi-feature fusion and ensemble nature make it a "black box." Understanding why specific peptides are classified as AIPs (e.g., which features are most influential, which amino acids or patterns contribute most) could be valuable for drug design. Tools for model interpretability (e.g., SHAP, LIME) could be integrated.
Computational Cost: A model built on eight feature encodings and five ensemble classifiers, even after feature selection, might be computationally more intensive for training and prediction compared to simpler models. While not explicitly discussed, this could be a factor in very high-throughput scenarios or resource-constrained environments.
Specificity of Feature Selection: The paper mentions using a feature selection algorithm to derive the optimal feature set by removing baseline models with performance below average accuracy. The specific feature selection algorithm used is not explicitly named beyond this description. A more detailed explanation of the algorithm (e.g., Recursive Feature Elimination, L1 regularization, etc.) could provide deeper insight into the feature selection process.
Dataset Diversity: While a larger dataset was used, the sources are still from existing computational studies. As the authors themselves noted, the availability of more diverse, experimentally validated AIPs is crucial. The current datasets might still carry biases from their original collection methods.

The methods and conclusions of this paper could potentially be transferred to other peptide classification tasks, such as identifying antimicrobial peptides, cell-penetrating peptides, or antioxidant peptides. The general framework of multi-feature fusion with ensemble learning and feature selection is a powerful paradigm applicable across various bioinformatics prediction problems where sequences need to be characterized. The success of IF-AIP reinforces the idea that a comprehensive approach to feature engineering coupled with robust ensemble methods can yield significant advances in bioinformatics.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

IF-AIP: A machine learning method for the identification of anti-inflammatory peptides using multi-feature fusion strategy

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~17 min read · 23,469 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Data Curation

4.2.2. Feature Representation

4.2.3. SMOTE-Tomek for Dataset Balancing

4.2.4. Baseline Classifiers

4.2.5. Construction of Model IF-AIP

4.3. Evaluation Metrics

5. Experimental Setup

5.1. Datasets

5.1.1. Compositional and Positional Analysis

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance of Baseline Models on Raw Features

6.1.2. Performance of the Voting Classifiers and HB-AIP Method

6.1.3. Effect of the Optimal Feature Selection on the Performance of HB-AIP Method

6.1.4. Performance Comparison of IF-AIP Model with the Existing Methods

6.1.5. Case Studies

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers