Paper status: completed

Development of a machine learning-based predictor for identifying and discovering antioxidant peptides based on a new strategy

Published:07/21/2021

Antioxidant Peptide Classification Model (1)Pseudo-Amino Acid Composition Application (1)Multifunctional Peptide Mining (1)Machine Learning Antioxidant Prediction (1)DPPH Radical Scavenging Activity (1)

Original Link

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

A machine learning model using pseudo-amino acid composition and motifs predicted antioxidant peptides with high accuracy (AUC 0.939). Experimental validation confirmed QCQ peptide's strong antioxidant activity, offering an effective tool to identify functional antioxidant peptid

Abstract

Food Control 131 (2022) 108439 Available online 22 July 2021 0956-7135/© 2021 Elsevier Ltd. All rights reserved. Development of a machine learning-based predictor for identifying and discovering antioxidant peptides based on a new strategy Yong Shen, Chunmei Liu, Kunmei Chi, Qian Gao, Xue Bai, Ying Xu, Na Guo * College of Food Science and Engineering, Jilin University, Changchun, 130062, China A R T I C L E I N F O Keywords: Machine learning SVM Hybrid model Multifunctional peptide Antioxidant peptide A B S T R A C T It is necessary to solve the problem of food corruption and oxidation to improve food quality. Peptides are a good candidate to solve the above problems. In this paper, a machine learning method was used to construct an antioxidant peptide classification model based on the pseudo-amino acid composition and motifs of peptides as input features. The AUC of PseAAC-dipeptide-motif hybrid model is 0.939 and the average precision score is 0.947, which is the best among all models in this paper. Besides, the classification threshold has been increased to make the model precision above 0.95. Then, the model was used as predictor to discover potential antioxidant pe

Mind Map

In-depth Reading

English Analysis~32 min read · 44,461 chars

1. Bibliographic Information

1.1. Title

Development of a machine learning-based predictor for identifying and discovering antioxidant peptides based on a new strategy

1.2. Authors

Yong Shen, Chunmei Liu, Kunmei Chi, Qian Gao, Xue Bai, Ying Xu, Na Guo

1.3. Journal/Conference

The paper was published in Food Control, a reputable journal focusing on food safety and quality, which aligns well with the paper's application in preventing food corruption and oxidation. The journal is known for publishing research on food microbiology, chemical contaminants, analytical methods, and quality assurance.

1.4. Publication Year

The publication year, derived from the DOI (10.1016/j.foodcont.2021.108439), is 2021.

1.5. Abstract

This paper addresses the issue of food spoilage due to oxidation by proposing a machine learning-based method to identify and discover antioxidant peptides. The researchers constructed an antioxidant peptide classification model using pseudo-amino acid composition (PseAAC) and peptide motifs as input features. The PseAAC-dipeptide-motif hybrid model achieved an AUC of 0.939 and an average precision score of 0.947, outperforming other models presented in the study. To enhance reliability, the classification threshold was adjusted to ensure a precision above 0.95. This optimized model was then used to screen a random peptide dataset, leading to the identification of 5 potential antioxidant peptides (PSGK, LKPQ, GRP, QCQ, QGM). Experimental validation confirmed the strong antioxidant properties of QCQ, with a total antioxidant capacity (T-AOC) of $9.59 \text{ U/mg prot}$ and a DPPH radical-scavenging activity of $95.52\%$ at $125 \mu\text{g/mL}$ . The predictor can also be applied to discover multifunctional peptides with antioxidant capabilities. Overall, the developed predictor is presented as an effective tool for discovering peptides with antioxidant functions.

1.6. Original Source Link

/files/papers/6910a0b25d12d02a6339cf92/paper.pdf This link points to a PDF of the paper, indicating it is an officially published work.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is food corruption and oxidation, which lead to significant economic losses and potential foodborne illnesses. Traditional chemical additives used to prevent oxidation and microbial growth can have health concerns or contribute to bacterial resistance. Peptides, particularly those with antioxidant properties, are presented as promising natural alternatives to address these issues.

The importance of this problem stems from the global demand for food quality and safety. Manual, laboratory-based identification and characterization of functional peptides from the vast number of discovered peptides are time-consuming and costly. This creates a critical gap where a rapid, efficient screening method is needed to leverage the potential of peptides.

The paper's entry point is the application of machine learning (ML) to overcome the limitations of traditional methods. By building a predictive model, the researchers aim to quickly screen for antioxidant peptides and, innovatively, to discover multifunctional peptides that can tackle both oxidation and microbial contamination, thereby ensuring comprehensive food safety.

2.2. Main Contributions / Findings

The paper makes several significant contributions and findings:

Development of an effective ML-based Predictor: The study successfully constructed a machine learning model for classifying antioxidant peptides. By integrating pseudo-amino acid composition (PseAAC) and motif features, the PseAAC-dipeptide-motif hybrid model demonstrated superior performance with an AUC of 0.939 and an average precision score of 0.947. This model provides a powerful tool for in silico screening.
Precision-Oriented Model Optimization: Recognizing the practical need for highly reliable predictions in discovery, the authors optimized the model by adjusting its classification threshold to achieve a precision greater than 0.95. This ensures that a high proportion of predicted antioxidant peptides are indeed active, reducing false positives in downstream experimental validation.
Discovery and Experimental Validation of Novel Antioxidant Peptides: The developed predictor was applied to a random peptide dataset, leading to the identification of 254 potential antioxidant peptides. Five of these (PSGK, LKPQ, GRP, QCQ, QGM) were synthesized and experimentally validated. Notably, QCQ exhibited strong antioxidant activity, with a T-AOC value of $9.59 \text{ U/mg prot}$ and a DPPH scavenging activity of $95.52\%$ at $125 \mu\text{g/mL}$ , confirming the predictor's efficacy in discovering novel bioactive peptides.
New Strategy for Multifunctional Peptide Discovery: The paper introduces a novel strategy to identify peptides with multiple beneficial functions. By using the antioxidant peptide predictor on existing databases of peptides known for other functions (e.g., antibacterial, antifungal), the study successfully identified several multifunctional peptides (e.g., peptides with both antibacterial and antioxidant properties). This addresses a critical need in food preservation, where integrated solutions are often required.
Demonstrated Applicability in Food Industry: The research provides a practical, efficient, and cost-effective approach for discovering peptides relevant to the food industry, offering a means to improve food quality and safety by targeting both oxidation and microbial spoilage.

3.1. Foundational Concepts

To fully understand this paper, a reader should be familiar with the following concepts:

Antioxidant Peptides: These are short sequences of amino acids that can neutralize free radicals and prevent oxidative damage. They are of interest in food preservation and health due to their natural origin and potential for various biological activities. Their antioxidant properties often stem from specific amino acid residues (like His, Tyr, Trp) or their overall sequence structure.
Machine Learning (ML): A branch of artificial intelligence that enables systems to learn from data without being explicitly programmed. In bioinformatics, ML is used to build predictive models for various biological phenomena, such as peptide function, protein structure prediction, and disease diagnosis.
Classification Model: A type of machine learning model that predicts a categorical output. In this paper, the model classifies peptides into two categories: "antioxidant" or "non-antioxidant."
Support Vector Machines (SVM): A supervised machine learning algorithm used for classification and regression tasks. SVM works by finding an optimal hyperplane that best separates data points of different classes in a high-dimensional space.
- Kernel Functions: SVMs can handle non-linear decision boundaries by mapping data into a higher-dimensional space using kernel functions. Common kernels include:
  - Linear Kernel: For linearly separable data.
  - Polynomial Kernel: For non-linear data, using polynomial features.
  - Sigmoid Kernel: Based on the hyperbolic tangent function.
  - Radial Basis Function (RBF) Kernel (Gaussian Kernel): A popular choice for non-linear data, it measures the similarity between data points. Its formula is $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma ||\mathbf{x}_i - \mathbf{x}_j||^2)$ , where $\gamma$ is the gamma parameter that defines the influence of a single training example (small $\gamma$ means large influence, large $\gamma$ means small influence), and $||\mathbf{x}_i - \mathbf{x}_j||^2$ is the squared Euclidean distance between two data points.
- Hyperparameters C and Gamma:
  - C (Regularization Parameter): Controls the trade-off between achieving a low training error and a low testing error (i.e., between maximizing the margin and minimizing misclassification). A large C value means a smaller margin and fewer misclassifications (potential overfitting), while a small C value means a larger margin and more misclassifications (potential underfitting).
  - Gamma ( $\gamma$ ): A parameter for non-linear kernels like RBF, it defines how far the influence of a single training example reaches. Small $\gamma$ means a large influence radius, leading to smoother decision boundaries (potential underfitting). Large $\gamma$ means a small influence radius, leading to more complex decision boundaries (potential overfitting).
Pseudo-Amino Acid Composition (PseAAC): A feature extraction method proposed by K.C. Chou for representing peptide or protein sequences of varying lengths as fixed-length numerical vectors. This is crucial for machine learning algorithms, which typically require fixed-size inputs. PseAAC not only considers the simple amino acid composition but also incorporates sequence-order information and physicochemical properties of amino acids (e.g., hydrophobicity, hydrophilicity, mass). The paper uses three types:
- PseAAC-type 1 (Parallel-correlation type): Combines amino acid composition with a correlation factor that captures the physicochemical properties of adjacent amino acids. It generates $20 + \lambda$ discrete numbers, where 20 represents the frequencies of the 20 standard amino acids, and $\lambda$ represents the correlation factors based on selected physicochemical properties.
- PseAAC-type 2 (Series-correlation type): Similar to Type 1 but incorporates more complex sequence-order correlation modes, generating $20 + i\lambda$ discrete numbers, where $i$ is the number of amino acid attributes considered.
- PseAAC-dipeptide: Represents the composition of all possible dipeptides (pairs of adjacent amino acids) in a sequence. Since there are 20 standard amino acids, there are $20 \times 20 = 400$ possible dipeptides. This method typically results in a 400-dimensional vector (or 420-D in some implementations that might include additional features or normalization, as stated in the paper).
Sequence Motifs: Short, recurring patterns in peptide or protein sequences that are often associated with specific biological functions or structural characteristics. Identifying motifs can provide functional insights and serve as discriminative features for classification.
10-fold Cross-validation: A resampling procedure used to evaluate machine learning models on a limited data sample. The dataset is divided into 10 "folds." The model is trained on 9 folds and validated on the remaining fold. This process is repeated 10 times, with each fold serving as the validation set exactly once. The results are then averaged to provide a more robust estimate of model performance and reduce bias from a single train/test split.
DPPH Radical-Scavenging Activity: A common in vitro assay used to assess the antioxidant capacity of a substance. DPPH (2,2-diphenyl-1-picrylhydrazyl) is a stable free radical that, when exposed to an antioxidant, changes color from purple to yellow, and this change can be measured spectrophotometrically. A higher scavenging activity indicates stronger antioxidant properties.
Total Antioxidant Capacity (T-AOC): A broader measure of antioxidant activity that assesses the overall ability of a substance to neutralize various types of free radicals or reactive oxygen species, often determined using commercial kits that quantify different antioxidant mechanisms.

3.2. Previous Works

The paper contextualizes its work by citing various applications of machine learning in peptide function prediction:

Anticancer Peptides: Manavalan et al. (2017) used SVM and Random Forest. Grisoni et al. (2018) employed recurrent neural networks.
Anti-inflammatory Peptides: Gupta et al. (2017) used ML to classify anti-inflammatory epitopes.
Antimicrobial Peptides: Liu et al. (2018), Meher et al. (2017), and others developed prediction methods.
Antifungal Peptides: Agrawal et al. (2018) contributed to this area.
Antibiofilm Peptides: Haney et al. (2018) and Sharma et al. (2016) explored this.

Specific models for antioxidant peptide prediction are also mentioned:
AOPs-SVM (Meng et al., 2019): This model used sequence features and SVM. The paper notes that AOPs-SVM had higher specificity and accuracy but lower sensitivity, MCC, and AUC compared to their predictor.
Butt et al. (2019) model: Incorporated statistical moments into Chou's PseAAC. This model also had higher specificity, accuracy, and MCC but lower sensitivity and precision than the current paper's predictor.
AnOxPePred (Olsen et al., 2020): A deep learning-based tool for antioxidant peptides. Its MCC and AUC values were reported as lower than the predictor developed in this study.

Differentiation Analysis: Compared to existing methods, this paper's core innovations and differences are:

Hybrid Feature Strategy: The paper combines PseAAC-dipeptide with motif features, which it demonstrates leads to a more robust and accurate model than using either feature type alone or other PseAAC variations. This hybrid approach aims to capture both general physicochemical properties and specific functional patterns.
Precision-Centric Optimization: Instead of solely maximizing overall performance metrics like AUC, the paper explicitly prioritizes achieving a high precision (above 0.95) by adjusting the classification threshold. This pragmatic approach is highly relevant for discovery tasks where minimizing false positives in experimental validation is crucial for cost and time efficiency.
Multifunctional Peptide Discovery Strategy: The most significant innovation is the "new strategy" for discovering multifunctional peptides. This involves using the trained antioxidant predictor to screen existing peptide databases (like APD3) that contain peptides with other known functions (e.g., antibacterial, antiviral). This allows for the in silico identification of peptides that possess multiple desirable properties, directly addressing complex real-world problems like ensuring comprehensive food safety (simultaneously tackling oxidation and microbial spoilage). This goes beyond merely predicting a single function for unknown peptides.
Experimental Validation: The paper goes beyond in silico prediction by synthesizing and experimentally validating the antioxidant activity of selected peptides, providing concrete evidence of the predictor's practical utility.

3.3. Technological Evolution

The field of peptide discovery has evolved from laborious wet-lab experiments to increasingly sophisticated computational approaches. Initially, peptides were isolated and characterized one by one. With the advent of high-throughput sequencing and large peptide databases, the volume of peptide data exploded, making manual analysis impractical. This spurred the development of computational tools.

The evolution saw:

Rule-based and Physiochemical Property-based Prediction: Early methods often relied on predefined rules or simple aggregations of amino acid properties.
Statistical Methods: Introduction of statistical models to identify patterns.
Classical Machine Learning: Algorithms like SVM, Logistic Regression, and K-Nearest Neighbors became prominent for their ability to learn complex relationships from structured feature data (e.g., PseAAC, amino acid composition). This paper fits squarely within this stage, demonstrating the power of carefully selected features and algorithms.
Deep Learning: More recently, deep learning models (e.g., recurrent neural networks, convolutional neural networks) have emerged, capable of learning features directly from raw sequence data, potentially capturing more intricate patterns. Models like AnOxPePred (Olsen et al., 2020) represent this trend.

This paper's work fits within the classical machine learning paradigm, emphasizing feature engineering (PseAAC, motifs) and robust model selection (SVM). While deep learning offers new avenues, this study shows that well-designed classical ML approaches, particularly with effective feature engineering and strategic model application (like the multifunctional peptide discovery), remain highly competitive and valuable, especially when experimental validation is integral.

4. Methodology

The paper outlines a comprehensive methodology for developing and applying a machine learning-based predictor for antioxidant peptides, culminating in a strategy for discovering multifunctional peptides. The process involves dataset collection, feature extraction, model building and optimization, performance evaluation, experimental validation, and a novel application strategy.

4.1. Principles

The core idea behind the method is to leverage machine learning to identify patterns within known antioxidant and non-antioxidant peptide sequences. By transforming variable-length peptide sequences into fixed-length numerical feature vectors that capture both their amino acid composition and physicochemical properties (PseAAC) as well as specific functional patterns (motifs), a classification model can be trained. This model can then predict the antioxidant potential of unseen peptides. The theoretical basis lies in the assumption that the biological activity of peptides is encoded in their sequence and structural characteristics, which can be approximated by these features.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Datasets Construction

Collecting reliable data is paramount for building an effective model.

Data Sources: All peptides were downloaded from public databases:
- Swiss-Prot: A comprehensive, high-quality, and freely accessible protein sequence knowledgebase (Consortium, 2018).
- APD3 (Antimicrobial Peptide Database): A specialized database for antimicrobial peptides (Wang, Li, & Wang, 2016).
- BIOPEP-UWM: A database of bioactive peptides, including antioxidant peptides (Minkiewicz, Iwaniak, & Darewicz, 2019).
Filtering: Peptides containing non-standard amino acids were removed.
Positive Dataset (Antioxidant Peptides):
- Collected from APD3 and BIOPEP-UWM.
- Filtered for peptide chain lengths between 2 and 31 amino acids.
- Duplicate peptides were removed.
- Final size: 669 antioxidant peptides.
- Analysis of this dataset showed that over $80\%$ of these peptides are $\le 10$ amino acids long (Figure 1). Amino acid composition analysis (Figure 2) revealed a higher proportion of Leucine (L), Proline (P), Tyrosine (Y), Histidine (H), and Tryptophan (W) in the positive set compared to the negative set, which are known to be related to antioxidant activity.
  
  The following figure (Figure 1 from the original paper) shows the length distribution of antioxidant peptides in the positive dataset:
  
  该图像是图表，展示了正样本数据集中抗氧化肽的氨基酸残基长度分布，其中1-5个残基的肽所占比例最高，超过50%。

Fig. 1. The length distribution of antioxidant peptides in the positive dataset.

The following figure (Figure 2 from the original paper) presents the amino acid compositional analysis for positive and negative datasets:

Fig. 2. Amino acids compositional analysis of the peptides in the positive dataset and negative dataset. 该图像是一张柱状图，展示了正样本抗氧化肽（AOP）和负样本非抗氧化肽（NAOP）中20种氨基酸组成的比例分布差异。

Fig. 2. Amino acids compositional analysis of the peptides in the positive dataset and negative dataset.

Negative Dataset (Non-Antioxidant Peptides):
- Since no dedicated negative dataset exists, it was constructed from the Swiss-Prot database, a common approach in literature.
- Selection Criteria:
  1. Reviewed peptides with lengths between 2 and 31 amino acids were selected from Swiss-Prot.
  2. Peptides screened using "antioxidant" or "antimicrobial" as keywords were removed to ensure they were unlikely to have the target function.
  3. Peptides already present in the positive dataset and prediction dataset were excluded.
- From the remaining 3718 peptides, 669 were randomly selected using Python to match the size of the positive dataset.
Training, Validation, and Test Sets:
- The combined positive and negative datasets were randomly split.
- Ratio: $80\%$ for the training set, $20\%$ for the independent test set.
- Cross-validation: During model training, the $80\%$ training set was further subjected to 10-fold cross-validation. This means it was divided into 10 parts, with 9 parts used for training and 1 part for validation in each iteration.
- A random seed was used for all splits to ensure reproducibility.
Random Peptide Dataset:
- 2007 peptides were synthetically generated using Python.
- Composed of one or more of the 20 standard amino acids, with lengths ranging from 2 to 31 residues.
- This dataset was used to discover potential antioxidant peptides with unknown functions.

4.2.2. Feature Extraction

Feature extraction is critical for transforming raw peptide sequences into a format suitable for machine learning. The paper employed two main types of features: Pseudo-Amino Acid Composition (PseAAC) and motif features.

4.2.2.1. PseAAC Feature

PseAAC, proposed by K.C. Chou, generates fixed-length numerical vectors from variable-length peptide sequences. It considers both amino acid composition and physicochemical characteristics, as well as sequence-order information. The PseAAC web server (http://www.csbio.sjtu.edu.cn/bioinf/PseAAC) was used to generate these features.

PseAAC-type 1 (Parallel-correlation type):
- Generates $20 + \lambda$ discrete numbers to represent a peptide.
- $\lambda$ (lambda) is a non-negative integer parameter, set to 1 in this study. It determines the number of correlation factors considered.
- Selected amino acid attributes for correlation: hydrophobicity, hydrophilicity, mass, pK1 (alpha-COOH), pK2 (NH3), and pI (isoelectric point at $25^{\circ}\text{C}$ ).
- Weight factor: Default value was used.
- Result: A 21-dimensional vector for each peptide.
PseAAC-type 2 (Series-correlation type):
- Generates $20 + i\lambda$ discrete numbers to represent a peptide.
- $\lambda$ was set to 1 (same as Type 1).
- $i$ is the number of amino acid attributes selected, which was 6 in this study (hydrophobicity, hydrophilicity, mass, pK1, pK2, pI).
- Result: A 26-dimensional vector for each peptide.
PseAAC-dipeptide (Dipeptide-composition):
- This mode represents the frequencies of all possible dipeptides (pairs of adjacent amino acids).
- No parameters need to be selected for this mode.
- Result: A 420-dimensional vector for each peptide (since there are 20 standard amino acids, $20 \times 20 = 400$ possible dipeptides, plus additional features in this specific PseAAC web server implementation).

4.2.2.2. Motif Feature

Sequence motifs are recurring patterns that often indicate important functional characteristics.

Tool: The MERCI program (Vens et al., 2011) was used to identify motifs.
Input: Requires peptide sequences from both positive and negative sets in FASTA format.
Output: Identifies top $k$ motifs that are present in the positive set but not in the negative set, making them discriminative for the positive class (antioxidant peptides).
Parameter: $k$ was set to its default value of 10, but 11 motifs were obtained due to some having the same frequency. These 11 motifs are listed in S2 File.

4.2.3. Machine Learning Model Building

4.2.3.1. Initial Algorithm Selection

To find the most suitable classification algorithm, four common machine learning algorithms were initially tested with their default parameters:

Logistic Regression (LR): A linear model for binary classification, good for estimating probabilities.
Linear Discriminant Analysis (LDA): A linear classification method that projects data onto a lower-dimensional space while maximizing class separability.
Support Vector Machines (SVM): A powerful algorithm for classification, especially effective in high-dimensional spaces.
k-Nearest Neighbors (KNN): A non-parametric, instance-based learning algorithm that classifies a data point based on the majority class of its $k$ nearest neighbors. The model with the best Area Under the Curve (AUC) was chosen for further optimization.

4.2.3.2. PseAAC Models

Models were built using each of the three PseAAC feature types independently.

PseAAC-type 1 Model:
- Input: 21-dimensional vectors.
- Algorithms: LR, LDA, SVM, KNN were compared. SVM showed the highest AUC (Figure 3).
- Optimization: GridSearchCV with 10-fold cross-validation was used to optimize SVM parameters, specifically the kernel function and hyperparameters C and gamma.
  - Kernel functions tested: linear, polynomial, sigmoid, and Radial Basis Function (RBF). RBF was selected as the optimal kernel.
  - Hyperparameters: C and gamma were optimized.
    - C (penalty parameter): Controls the trade-off between misclassification and margin size. A value of C=1 was found to be optimal.
    - Gamma ( $\gamma$ ): Defines the influence of individual training examples for the RBF kernel. A value of $\gamma$ =0.01 was optimal.
- Evaluation: Performance metrics (Sensitivity, Specificity, Accuracy, MCC, AUC, average precision score) were calculated on the independent test set (Table 1).
PseAAC-type 2 Model:
- Input: 26-dimensional vectors.
- Algorithms: LR, LDA, SVM, KNN were compared. SVM again showed the highest AUC (Figure 4).
- Optimization: Similar GridSearchCV process as Type 1. Optimal parameters: RBF kernel, C=1, $\gamma$ =0.001.
- Evaluation: Metrics calculated on the independent test set (Table 1).
PseAAC-dipeptide Model:
- Input: 420-dimensional vectors.
- Algorithms: LR, LDA, SVM, KNN were compared. SVM again showed the highest AUC (Figure 5). For LR and LDA, max_iter was increased to 3000 to ensure convergence with the higher dimensionality.
- Optimization: Similar GridSearchCV process. Optimal parameters: RBF kernel, C=1, and $\gamma$ were found to be the same as in previous modes (although the exact value for gamma is not explicitly restated, it's implied to be the same from the context of "optimal parameters are the same").
- Evaluation: Metrics calculated on the independent test set (Table 1).
  
  The following figure (Figure 3 from the original paper) shows the AUC comparison for PseAAC-type 1 models:
  
  该图像是图3，展示了基于PseAAC-type 1编码的四种机器学习模型的AUC箱线图。图中type1_SVM模型的AUC值最高，表现优于type1_LR、type1_LDA和type1_KNN模型，说明其在抗氧化肽分类任务中的预测性能最佳。

Fig. 3. AUC of four models based on the PseAAC-type 1.

The following figure (Figure 4 from the original paper) shows the AUC comparison for PseAAC-type 2 models:

Fig. 4. AUC of four models based on the PseAAC-type 2. 该图像是图4，展示了基于PseAAC-type 2的四种机器学习模型AUC值的箱线图，模型包括type2_LR、type2_LDA、type2_SVM和type2_KNN，显示了各模型AUC的分布范围及中位数。

Fig. 4. AUC of four models based on the PseAAC-type 2.

The following figure (Figure 5 from the original paper) shows the AUC comparison for PseAAC-dipeptide models:

Fig. 5. AUC of four models based on the PseAAC- dipeptide. 该图像是一个箱线图，展示了基于PseAAC-二肽特征的四种模型（LR、LDA、SVM、KNN）的AUC分布情况，显示dipep_SVM模型的AUC最高且波动最小，表现最佳。

Fig. 5. AUC of four models based on the PseAAC- dipeptide.

4.2.4. Hybrid Model Construction

To further improve performance, motif features were combined with PseAAC features.

Method:
- First, a base PseAAC model (Type 1, Type 2, or Dipeptide) predicted a score for a peptide.
- Then, if the peptide sequence contained any of the 11 identified motifs, 0.5 points were added to this predicted score. This effectively boosts the score of peptides containing known antioxidant motifs.
- This adjusted score was then used for subsequent model evaluation and prediction.
Hybrid Models: Three hybrid models were constructed:
1. PseAAC-type 1-motif hybrid model
2. PseAAC-type 2-motif hybrid model
3. PseAAC-dipeptide-motif hybrid model

4.2.5. Cross-validation

The 10-fold cross-validation method was used throughout the modeling process to ensure robustness and prevent overfitting.

The entire dataset was first split into $80\%$ training and $20\%$ independent test sets.
The $80\%$ training dataset was then divided into 10 equal parts. In each fold, 9 parts were used for training the model, and the remaining 1 part was used as a validation set.
This process was repeated 10 times, ensuring each part served as the validation set once. The average of the evaluation indices from these 10 runs was reported. A random seed was specified to maintain consistency across runs.

4.2.6. Performance Measure and Model Selection

Metrics: Both threshold-dependent (Sensitivity, Specificity, Accuracy, MCC) and threshold-independent (AUC, average precision score) parameters were calculated.
Selection Criteria:
1. Initially, AUC was the primary indicator for selecting the best model and optimizing parameters.
2. After initial modeling, the best model was selected based on both AUC and average precision score.
3. Crucially, for the final predictor, the classification threshold of the selected model was adjusted to ensure the precision was greater than 0.95. This prioritization of precision is important for downstream experimental validation to minimize false positives.

4.2.7. Chemical Synthesis of Peptides

Based on predictions from the final model, potential antioxidant peptides were selected.
These peptides were chemically synthesized by Sangon Biotech (Shanghai, China) Co., Ltd.
Purity: Synthesized peptides were analyzed by HPLC (High-Performance Liquid Chromatography) and MS (Mass Spectrometry), confirming purity higher than $98\%$ (S3 File).

4.2.8. Determination of Antioxidant Activity

The antioxidant activity of the synthesized peptides was experimentally determined using two common assays:

Total Antioxidant Capacity (T-AOC):
- Determined using T-AOC kits according to the manufacturer's instructions.
DPPH Radical-Scavenging Activity:
- Method adapted from Liu et al. (2020).
- Reagents: $2 \text{ mg}$ of DPPH dissolved in $50 \text{ ml}$ absolute ethanol. Synthetic peptides dissolved in purity water.
- Procedure: Carried out in a 96-well microplate.
  - Total volume per well: $200 \mu\text{L}$ .
  - Sample group (S): $100 \mu\text{L}$ DPPH solution + $100 \mu\text{L}$ peptide solution.
  - Control group (C): $100 \mu\text{L}$ DPPH solution + $100 \mu\text{L}$ water. (Represents maximum DPPH absorbance without peptide.)
  - Blank Sample group (BS): $100 \mu\text{L}$ water + $100 \mu\text{L}$ peptide solution. (Measures absorbance of peptide solution itself, to subtract any intrinsic absorbance from peptide.)
  - Blank Control group (BC): $100 \mu\text{L}$ ethanol + $100 \mu\text{L}$ water. (Measures absorbance of solvent blank.)
- Incubation: 96-well plate incubated in the dark for $30 \text{ min}$ .
- Measurement: Absorbance measured at $517 \text{ nm}$ .
- Calculation: The scavenging ability was calculated using the following formula: $ \text{DPPH scavenging activity } (%) = \left[ 1 - \left( A_S - A_{BS} \right) / \left( A_C - A_{BC} \right) \right] \times 100% $ Where:
  - $A_S$ : Absorbance of the sample group (peptide + DPPH).
  - $A_{BS}$ : Absorbance of the blank sample group (peptide + water).
  - $A_C$ : Absorbance of the control group (water + DPPH).
  - $A_{BC}$ : Absorbance of the blank control group (ethanol + water).
- Replicates: All experiments were conducted in triplicate.
- Statistical Analysis: Performed using R 4.03.

4.2.9. Discovery of Multifunctional Peptides based on a New Strategy

This novel strategy combines the developed ML predictor with existing peptide databases to find peptides with multiple functions.

Prediction Dataset Construction: Peptides with known functions (antibacterial, antifungal, anti-MRSA, anti-toxin, antiviral) and residue lengths from 2 to 31 were collected from the APD3 database (S1 File). Note that repeated peptides (those having multiple functions) were not removed, as the goal was to classify by function.
Prediction: The final PseAAC-dipeptide-motif hybrid model (the predictor) was used to predict whether these known functional peptides also possessed antioxidant activity.
Identification: Peptides from the prediction dataset that were predicted to have antioxidant function by the model were then identified as multifunctional peptides.

5. Experimental Setup

5.1. Datasets

The study utilized several datasets for different stages of the research:

Training and Test Datasets (for model building):
- Source: Swiss-Prot, APD3, and BIOPEP-UWM databases.
- Positive Dataset (Antioxidant Peptides): 669 peptides, length 2-31 amino acids, from APD3 and BIOPEP-UWM.
- Negative Dataset (Non-Antioxidant Peptides): 669 peptides, length 2-31 amino acids, randomly selected from Swiss-Prot after filtering out known functional peptides.
- Characteristics: These datasets consist of peptide sequences, which are strings of amino acid characters (e.g., "PSGK", "QCQ"). The data is curated to ensure standard amino acids and appropriate lengths. The amino acid composition analysis (Figure 2) showed differences between positive and negative sets, suggesting the features used are relevant.
- Purpose: These datasets are crucial for training and independently evaluating the machine learning models. The balanced nature (equal number of positive and negative samples) helps prevent bias during training.
Random Peptide Dataset (for de novo discovery):
- Source: Synthetically generated using Python.
- Scale: 2007 peptides.
- Characteristics: Composed of one or more of the 20 typical amino acids, with lengths ranging from 2 to 31 amino acid residues.
- Purpose: To demonstrate the predictor's ability to discover potential antioxidant peptides from an unknown pool of sequences, mimicking a drug discovery scenario.
Prediction Dataset (for multifunctional peptide discovery):
- Source: Peptides with known functions (antibacterial, antifungal, anti-MRSA, anti-toxin, antiviral) collected from the APD3 database.
- Characteristics: Peptide sequences with residue lengths from 2 to 31. This dataset contains peptides already categorized by one or more functions.
- Purpose: To apply the trained antioxidant predictor to identify peptides that possess both a known function and antioxidant activity, thus discovering multifunctional peptides.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a detailed explanation:

True Positives (TP): The number of antioxidant peptides correctly predicted as antioxidant.
True Negatives (TN): The number of non-antioxidant peptides correctly predicted as non-antioxidant.
False Positives (FP): The number of non-antioxidant peptides incorrectly predicted as antioxidant. (Also known as Type I error)
False Negatives (FN): The number of antioxidant peptides incorrectly predicted as non-antioxidant. (Also known as Type II error)

Area Under the Receiver Operating Characteristic (ROC) Curve (AUC)
- Conceptual Definition: AUC quantifies the overall performance of a binary classification model across all possible classification thresholds. The ROC curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold settings. AUC represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance. A higher AUC indicates a better ability to distinguish between positive and negative classes.
- Mathematical Formula: There is no single, simple formula for AUC itself, as it is the area under a curve. However, the curve is generated from Sensitivity and Specificity. $ \text{AUC} = \int_{0}^{1} \text{Sensitivity}(t) , d(\text{1 - Specificity}(t)) $ Where:
  - $t$ : Classification threshold.
  - $\text{Sensitivity}(t)$ : True Positive Rate at threshold $t$ .
  - $\text{1 - Specificity}(t)$ : False Positive Rate at threshold $t$ .
- Symbol Explanation: Not applicable for AUC directly as it's an integral of rates. The rates themselves are defined below.
Average Precision Score
- Conceptual Definition: The average precision score summarizes a Precision-Recall (PR) curve. A PR curve plots Precision against Recall at various classification thresholds. The average precision score is the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. It is particularly useful for imbalanced datasets and tasks where positive class prediction is more critical (e.g., drug discovery, where identifying true positives is paramount). A higher average precision score indicates better performance in retrieving positive samples without many false positives.
- Mathematical Formula: $ \text{Average Precision} = \sum_{n} (R_n - R_{n-1}) P_n $ Where:
  - $P_n$ : Precision at the $n$ -th threshold.
  - $R_n$ : Recall at the $n$ -th threshold.
  - $R_0 = 0$ , $P_0$ is usually taken as 1.
- Symbol Explanation: Not applicable for average precision directly as it's a sum of products of rates. The rates themselves are defined below.
Sensitivity (Recall or True Positive Rate)
- Conceptual Definition: Sensitivity measures the proportion of actual positive cases (antioxidant peptides) that were correctly identified by the model. It indicates the model's ability to avoid false negatives.
- Mathematical Formula: $ \text{Sensitivity} = \frac{TP}{TP + FN} $
- Symbol Explanation:
  - TP: True Positives.
  - FN: False Negatives.
Specificity (True Negative Rate)
- Conceptual Definition: Specificity measures the proportion of actual negative cases (non-antioxidant peptides) that were correctly identified by the model. It indicates the model's ability to avoid false positives.
- Mathematical Formula: $ \text{Specificity} = \frac{TN}{TN + FP} $
- Symbol Explanation:
  - TN: True Negatives.
  - FP: False Positives.
Accuracy
- Conceptual Definition: Accuracy is the most intuitive performance measure and is simply the proportion of total predictions that were correct. It measures the overall correctness of the model.
- Mathematical Formula: $ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $
- Symbol Explanation:
  - TP: True Positives.
  - TN: True Negatives.
  - FP: False Positives.
  - FN: False Negatives.
Precision (Positive Predictive Value)
- Conceptual Definition: Precision measures the proportion of positive predictions made by the model that were actually correct. It indicates how reliable positive predictions are, i.e., how many of the peptides predicted as antioxidant are truly antioxidant. This metric is crucial when the cost of false positives is high.
- Mathematical Formula: $ \text{Precision} = \frac{TP}{TP + FP} $
- Symbol Explanation:
  - TP: True Positives.
  - FP: False Positives.
Matthews Correlation Coefficient (MCC)
- Conceptual Definition: MCC is a robust and reliable statistical measure of the quality of binary classifications. It takes into account all four values of the confusion matrix (TP, TN, FP, FN) and is generally regarded as a balanced measure even if the classes are of very different sizes. A value of +1 represents a perfect prediction, 0 an average random prediction, and -1 an inverse prediction.
- Mathematical Formula: $ \text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} $
- Symbol Explanation:
  - TP: True Positives.
  - TN: True Negatives.
  - FP: False Positives.
  - FN: False Negatives.

5.3. Baselines

The paper primarily uses internal comparisons to establish the best model.

Algorithm Baselines: For each PseAAC feature type (Type 1, Type 2, Dipeptide), the performance of SVM was compared against three other common machine learning algorithms: Logistic Regression (LR), Linear Discriminant Analysis (LDA), and k-Nearest Neighbors (KNN). These served as algorithmic baselines to justify the selection of SVM.
Feature Baselines: The different PseAAC types (Type 1, Type 2, Dipeptide) themselves act as baselines for each other. The performance of pure PseAAC models is also a baseline for the hybrid models that incorporate motif features.
External Model Comparisons (Discussion): While not direct baselines in the experimental results tables, the paper discusses its predictor's performance in comparison to previously reported models like AOPs-SVM (Meng et al., 2019), a model by Butt et al. (2019), and the deep learning tool AnOxPePred (Olsen et al., 2020). This provides context on how the developed predictor stands against existing solutions in the field, primarily emphasizing its higher sensitivity, precision, MCC, and AUC in certain aspects.

6. Results & Analysis

6.1. Core Results Analysis

The study meticulously built and evaluated several machine learning models to identify antioxidant peptides, culminating in a highly precise predictor and its application for discovering multifunctional peptides.

6.1.1. Prediction Model using PseAAC as Input Features

The first step involved building models using different PseAAC feature types. The general strategy was to compare four algorithms (LR, LDA, SVM, KNN) using default parameters, select the best one based on AUC, and then optimize its parameters using GridSearchCV with 10-fold cross-validation.

PseAAC-type 1 Model:
- Comparison (Figure 3): SVM (Support Vector Machine) model showed a significantly higher AUC compared to LR, LDA, and KNN, indicating its superior ability to discriminate between antioxidant and non-antioxidant peptides with Type 1 PseAAC features.
- Optimization: The optimal SVM parameters for Type 1 were an RBF kernel, C=1, and gamma=0.01.
- Performance on Test Set (Table 1): Achieved an AUC of 0.934 and an average precision score of 0.937.
PseAAC-type 2 Model:
- Comparison (Figure 4): Again, SVM emerged as the best performer in terms of AUC among the four algorithms.
- Optimization: Optimal SVM parameters for Type 2 were an RBF kernel, C=1, and gamma=0.001. The paper noted that Type 2 yielded slightly better results than Type 1 with these parameters.
- Performance on Test Set (Table 1): Achieved an AUC of 0.921 and an average precision score of 0.931.
PseAAC-dipeptide Model:
- Comparison (Figure 5): SVM demonstrated the best performance, as observed with the other PseAAC types. The authors adjusted the max_iter parameter for LR and LDA to 3000 due to the higher dimensionality (420-D) of dipeptide features, ensuring convergence.
- Optimization: Optimal SVM parameters were found to be the same as the previous modes (RBF kernel, C=1, and similar gamma values). The paper states that the AUC was "greatly improved" in this mode.
- Performance on Test Set (Table 1): Achieved an AUC of 0.939 and an average precision score of 0.946, making it the best performing pure PseAAC model.

6.1.2. Hybrid Model Performance

The PseAAC-dipeptide-motif hybrid model (which applies a 0.5 score boost if a peptide contains an identified motif) significantly improved performance.

The following are the results from Table 1 of the original paper:

Test set
Model	Sensitivity	Specificity	Accuracy	MCC	AUC	average precision score
PseAAC-type 1 model	0.965	0.712	0.847	0.708	0.934	0.937
PseAAC-type 2 model	0.832	0.888	0.858	0.719	0.921	0.931
PseAAC-dipeptide mode	0.916	0.832	0.877	0.753	0.939	0.946
PseAAC-type 1-motif hybrid model	0.965	0.712	0.847	0.708	0.935	0.937
PseAAC-type 2-motif hybrid model	0.832	0.888	0.858	0.719	0.921	0.932
PseAAC-dipeptide-motif hybrid model	0.916	0.832	0.877	0.753	0.939	0.947

Best Model: As shown in Table 1, the PseAAC-dipeptide-motif hybrid model achieved the highest AUC (0.939) and average precision score (0.947) among all models tested. This indicates that combining the detailed dipeptide composition information with specific antioxidant motifs provides the most discriminative features for this task.
Precision Prioritization: The paper highlights that for discovery, high precision is crucial to ensure reliability. While the average precision score for the best model was 0.947, the researchers analyzed the Precision-Recall (PR) curve and adjusted the classification threshold to 0.668. This adjustment was made to ensure the model's precision was above 0.95, making the predicted antioxidant peptides more trustworthy for experimental follow-up, even if it might slightly trade off recall.

6.1.3. Predictions of Antioxidant Peptides

The PseAAC-dipeptide-motif hybrid model (with the adjusted threshold) was then applied as a predictor to the 2007 peptides in the random peptide dataset.

Identification: 254 peptides were predicted as potential antioxidant peptides.
Experimental Validation Selection: The top 5 peptides by predicted score were chosen for synthesis and experimental validation: PSGK (P1), LKPQ (P2), GRP (P3), QCQ (P4), and QGM (P5).

6.1.4. Antioxidant Activity

The selected 5 peptides were synthesized and their antioxidant activities were measured in vitro.

Total Antioxidant Capacity (T-AOC):
- Results (Figure 6): QCQ (P4) exhibited significantly higher T-AOC compared to the other four peptides (p < 0.05).
- Value: QCQ's T-AOC was $9.59 \text{ U/mg prot}$ .
  
  The following figure (Figure 6 from the original paper) shows the T-AOC of the 5 synthetic peptides:
  
  该图像是图表，展示了5种合成肽（PSGK，LKPQ，GRP，QCQ，QGM）的总抗氧化能力（T-AOC，单位U/mg蛋白）。结果显示肽QCQ的T-AOC显著高于其他肽，表明其具有较强的抗氧化活性。

Fig. 6. T-AOC of the 5 synthetic peptides (PSGK, P1; LKPQ, P2; GRP, P3; QCQ, P4; QGM, P5).

DPPH Radical-Scavenging Activity:
- Results (Figure 7): All 5 peptides showed some DPPH radical-scavenging activity, but QCQ (P4) demonstrated the strongest activity.
- Concentration-dependent Activity (Figure 8): Further testing of QCQ at different concentrations showed that its DPPH scavenging activity was $95.52\%$ at $125 \mu\text{g/mL}$ and $88.45\%$ at $62.5 \mu\text{g/mL}$ , indicating a robust and concentration-dependent antioxidant effect.
  
  The following figure (Figure 7 from the original paper) shows the DPPH radical scavenging activity of the 5 synthetic peptides:
  
  该图像是图7，为合成的5种肽类的DPPH自由基清除活性柱状图，展示了各肽在125 μg/mL浓度下的清除率及其误差线，P4肽表现出最高的自由基清除活性。

Fig. 7. DPPH radical scavenging activity of the 5 synthetic peptides.

The following figure (Figure 8 from the original paper) shows the DPPH radical scavenging activity of different concentrations of the QCQ peptide (P4):

Fig. 8. DPPH radical scavenging activity of different concentrations of the QCQ peptide (P4). 该图像是图表，展示了QCQ肽在不同浓度下的DPPH自由基清除率。随浓度增加，清除率逐渐上升，125 μg/mL及以上浓度的清除率接近或达到最高水平，证明QCQ具有显著的抗氧化活性。

Fig. 8. DPPH radical scavenging activity of different concentrations of the QCQ peptide (P4).

The experimental results strongly validate that the predictor can effectively identify peptides with strong antioxidant properties, with QCQ being a prime example.

6.1.5. Discovery of Multifunctional Peptides

The final and novel application was to use the PseAAC-dipeptide-motif hybrid model to predict antioxidant function within peptides already known to possess other biological activities (from APD3). This strategy aimed to identify multifunctional peptides.

The following are the results from Table 2 of the original paper:

	ID	sequences
ABPa	AP00142	GLKKLLGKLLKKLGKLLLK
	AP00143	KKLLKWLKKLL
	AP00334	IIGGR
	AP00511	GYGGHGGHGGHGGHGGHGGHGHGGGGHG
	AP00528	DDDDDDD
	AP00551	FRWWHR
	AP01357	FFHLHFHY
	AP01406	ACSAG
	AP01518	AMVSS
	AP01899	FLKPLFNAALKLLP
	AP02204	KTKKKLLKKT
	AP02243	VKLFPVKLFP
	AP02261	PLGG
	AP02418	QWGGG
	AP02461	FLPGLIKAAVGVGSTILCKITKKC
	AP02670	DEDDD
	AP02681	YL
	AP02803	DEDLDE
	AP02856	WWWLRKIW
	AP02874	YSYYTIV
	AP02884	GDDDDDD
	AP02885	GADDDDD
	AP02984	YPVEPF
	AP03230	CVWLVVV
	AP03236	RRRWWWWV
AFPb	AP00511	GYGGHGGHGGHGGHGGHGGHGHGGGGHG
	AP00889	APPGARPPPGPPPPGPPPPGP
	AP01494	GHHPHGHHPHGHHPHGHHHPH
	AP02243	VKLFPVKLFP
	AP02261	PLGG
	AP02381	EL
	AP02382	ELLL
	AP02383	ELLL
	AP02461	FLPGLIKAAVGVGSTILCKITKKC
	AP02681	YL
	AP02856	WWWLRKIW
	AP02874	YSYYTIV
AMPc	AP02856	WWWLRKIW
	AP02874	YSYYTIV
	AP03236	RRRWWWWV
AVPd	AP01406	ACSAG

a : ABP is Antibacterial peptides from APD3. b : AFP is Antifungal peptides from APD3. c : AMP is Anti-MRSA peptides from APD3. d : AVP is Antiviral peptides from APD3.

Predictions:
- 25 peptides from the antibacterial group were predicted to have antioxidant function.
- 12 peptides from the antifungal group were predicted to have antioxidant function.
- 3 peptides from the anti-MRSA group were predicted to have antioxidant function.
- 1 peptide from the antiviral group was predicted to have antioxidant function.
- 0 peptides from the anti-toxin group were predicted to have antioxidant function.
Validation of known multifunctional peptides:
- Among the antibacterial peptides, AP02261 and AP02461 were already marked as antioxidant in APD3, and AP01899 was marked in the BIOPEP-UWM database. This confirms the predictor's ability to re-identify known multifunctional peptides.
- Two peptides (AP02261, AP02681) were found to be both antibacterial and antifungal, and predicted to be antioxidant by the model, suggesting triple functionality.
- The anti-fungal peptide EL (AP02381) was predicted as antioxidant by the model, and it is indeed listed with antioxidant function in BIOPEP-UWM. Similarly, the peptide YL (AP02681) was experimentally validated by other researchers (Yang et al., 2018) to have antioxidant function, and the model also predicted it.
Discovery of New Multifunctional Peptides: Crucially, many peptides (e.g., 22 out of 25 antibacterial peptides in Table 2) were not previously marked as having antioxidant function in the databases but were predicted to be antioxidant by the model. These represent novel discoveries of multifunctional peptides. This demonstrates the power of the predictor in mining new functions for existing peptides.

6.2. Data Presentation (Tables)

(Transcribed and presented above within the results analysis.)

6.3. Ablation Studies / Parameter Analysis

Feature Ablation (Implicit): The comparison of PseAAC-type 1, PseAAC-type 2, PseAAC-dipeptide models, and then their respective hybrid models with motif features, acts as an implicit ablation study. It demonstrates that the PseAAC-dipeptide features are generally superior to Type 1 and Type 2 PseAAC, and that the addition of motif features (hybrid model) further enhances performance, particularly for the PseAAC-dipeptide base.
Parameter Analysis: The use of GridSearchCV for optimizing SVM's kernel function (linear, polynomial, sigmoid, RBF) and hyperparameters (C and gamma) is a form of parameter analysis. This systematically explores the parameter space to find the optimal configuration for the SVM models, demonstrating that RBF kernel with specific C and gamma values consistently yielded the best results. The discussion on the effects of C and gamma (e.g., C controlling overfitting/underfitting, gamma affecting model complexity) further illustrates this analysis.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully developed and validated a machine learning-based predictor for identifying and discovering antioxidant peptides. By constructing a PseAAC-dipeptide-motif hybrid model, the researchers achieved high predictive performance (AUC 0.939, average precision score 0.947). A key innovation was the strategic adjustment of the classification threshold to ensure a precision greater than 0.95, enhancing the reliability of predictions for practical discovery. The predictor's efficacy was confirmed through the experimental validation of newly identified antioxidant peptides, notably QCQ, which demonstrated strong T-AOC and DPPH scavenging activity. Furthermore, the study introduced a novel strategy to discover multifunctional peptides by applying the antioxidant predictor to existing databases of peptides known for other functions, thereby identifying peptides with combined properties crucial for applications like food safety. The work provides an effective computational tool and a new methodology for the efficient discovery of functional peptides.

7.2. Limitations & Future Work

The authors implicitly highlight several areas for future consideration and improvement:

Experimental Validation Scope: While 5 peptides were synthesized and tested, the paper identifies 254 potential antioxidant peptides and numerous predicted multifunctional peptides. A clear limitation is the practical bottleneck of experimentally validating all these predictions. Future work would ideally involve more extensive in vitro and potentially in vivo validation to fully characterize the newly predicted peptides.
Antioxidant Capacity vs. Predicted Score: The authors noted that "the degree of antioxidant capacity is not consistent with the predicted score." This suggests that while the model is effective as a classification tool (identifying if a peptide is antioxidant or not), it might not accurately predict the strength of antioxidant activity. Future work could focus on developing regression models to predict quantitative antioxidant capacity.
Multifunctionality Complexity: The paper focused on identifying antioxidant function in peptides already known for other functions. The complexity of these multifunctional peptides, their potential synergistic or antagonistic effects, and their overall safety and efficacy in food systems (e.g., stability, bioavailability, potential off-flavors) are areas that warrant further investigation.
Dataset Diversity: While large public databases were used, the representativeness of the negative dataset (generated from Swiss-Prot by exclusion) could always be improved with more rigorously defined non-antioxidant peptides.

7.3. Personal Insights & Critique

This paper offers a robust and practical approach to a significant problem in food science. The rigorous application of machine learning, from feature engineering (PseAAC and motifs) to model selection and threshold optimization, is commendable. The explicit focus on achieving high precision for discovery tasks is a critical design choice that often differentiates academic models from truly applicable tools, as it directly reduces the cost and effort of downstream experimental validation.

The "new strategy" for discovering multifunctional peptides is particularly insightful. It moves beyond predicting a single function for novel sequences and instead leverages existing knowledge to identify synergistic properties in known functional peptides. This approach has broad applicability. For instance, similar strategies could be applied to:

Drug Discovery: Identifying antimicrobial compounds that also exhibit anti-inflammatory effects.
Biomaterial Design: Discovering peptides for coatings that are both antimicrobial and biocompatible.
Cosmetics: Finding peptides with anti-aging properties that also provide UV protection.

One potential area for improvement or future exploration could be the use of more advanced deep learning architectures (e.g., recurrent neural networks or convolutional neural networks) that can learn features directly from raw peptide sequences, potentially capturing more subtle sequence-order dependencies without explicit feature engineering like PseAAC. However, this paper demonstrates that well-engineered classical ML methods can still yield state-of-the-art results when features are carefully chosen and validation is thorough.

The paper's clear experimental validation of the predicted peptides (especially QCQ) strengthens its claims significantly. The detailed reporting of both T-AOC and DPPH scavenging activity provides a comprehensive assessment of antioxidant potential. The findings regarding amino acid composition in positive vs. negative sets (Figure 2) also offer valuable biological insights for peptide design. Overall, this work provides a valuable framework for peptide discovery in various fields, showcasing the power of integrating computational prediction with experimental verification.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Development of a machine learning-based predictor for identifying and discovering antioxidant peptides based on a new strategy

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~32 min read · 44,461 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Datasets Construction

4.2.2. Feature Extraction

4.2.2.1. PseAAC Feature

4.2.2.2. Motif Feature

4.2.3. Machine Learning Model Building

4.2.3.1. Initial Algorithm Selection

4.2.3.2. PseAAC Models

4.2.4. Hybrid Model Construction

4.2.5. Cross-validation

4.2.6. Performance Measure and Model Selection

4.2.7. Chemical Synthesis of Peptides

4.2.8. Determination of Antioxidant Activity

4.2.9. Discovery of Multifunctional Peptides based on a New Strategy

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Prediction Model using PseAAC as Input Features

6.1.2. Hybrid Model Performance

6.1.3. Predictions of Antioxidant Peptides

6.1.4. Antioxidant Activity

6.1.5. Discovery of Multifunctional Peptides

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers