AiPaper
Paper status: completed

Bitter-RF: A random forest machine model for recognizing bitter peptides

Published:01/26/2023
Original Link
Price: 0.10
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Bitter-RF, a random forest model integrating 10 peptide sequence features, achieves high accuracy (AUROC=0.98) in bitter peptide recognition, pioneering RF use in this domain and enhancing protein classification methods.

Abstract

TYPE Original Research PUBLISHED 26 January 2023 DOI 10.3389/fmed.2023.1052923 OPEN ACCESS EDITED BY C. George Priya Doss, VIT University, India REVIEWED BY HaiHui Huang, Shaoguan University, China Dragos Horvath, UMR 7140 Chimie de la Matière Complexe, France Zhibin Lv, Sichuan University, China *CORRESPONDENCE Hui Ding hding@uestc.edu.cn Yang Zhang yangzhang@cdutcm.edu.cn Ke-Jun Deng dengkj@uestc.edu.cn † These authors have contributed equally to this work SPECIALTY SECTION This article was submitted to Precision Medicine, a section of the journal Frontiers in Medicine RECEIVED 24 September 2022 ACCEPTED 05 January 2023 PUBLISHED 26 January 2023 CITATION Zhang Y-F, Wang Y-H, Gu Z-F, Pan X-R, Li J, Ding H, Zhang Y and Deng K-J (2023) Bitter-RF: A random forest machine model for recognizing bitter peptides. Front. Med. 10:1052923. doi: 10.3389/fmed.2023.1052923 COPYRIGHT © 2023 Zhang, Wang, Gu, Pan, Li, Ding, Zhang and Deng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copy

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Bitter-RF: A random forest machine model for recognizing bitter peptides

1.2. Authors

Yu-Fei Zhang, Yu-Hao Wang, Zhi-Feng Gu, Xian-Run Pan, Jian Li, Hui Ding, Yang Zhang, and Ke-Jun Deng. The authors are primarily affiliated with the School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China; Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China; and School of Basic Medical Sciences, Chengdu University, Chengdu, China. The corresponding authors are Hui Ding, Yang Zhang, and Ke-Jun Deng. Their research backgrounds appear to be in bioinformatics, computational biology, and possibly traditional Chinese medicine, focusing on machine learning applications in biological sequence analysis.

1.3. Journal/Conference

Published in Frontiers in Medicine, Volume 10, Article 1052923. Frontiers in Medicine is an open-access peer-reviewed journal publishing across various fields of medical research. It is generally considered a reputable journal within the "Frontiers" publishing family, contributing to the dissemination of medical and biomedical research.

1.4. Publication Year

2023

1.5. Abstract

This paper introduces Bitter-RF, a Random Forest (RF)-based machine learning model designed for recognizing bitter peptides using their sequence information. Bitter peptides are short peptides with significant potential medical applications that remain largely unexplored. To facilitate their practical utilization, an accurate classification method is crucial. Bitter-RF integrates 10 distinct features extracted from peptide sequences, aiming for a more comprehensive representation of peptide information. The model demonstrates superior performance compared to existing state-of-the-art models on an independent validation set, achieving an Area Under the Receiver Operating Characteristic curve (AUROC) of 0.98. This research not only enhances the accuracy of bitter peptide classification but also expands the application of the RF method in protein classification tasks, a domain where it had not been previously used for bitter peptide prediction. The authors hope Bitter-RF will serve as a valuable tool for researchers in bitter peptide studies.

/files/papers/690dd8947a8fb0eb524e6853/paper.pdf (This link points to a local file path provided by the user, implying it is the PDF of the paper).

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the accurate and efficient identification of bitter peptides. Bitter peptides are short amino acid sequences that elicit a bitter taste. While often associated with spoilage or toxins, some bitter peptides possess significant potential medical applications, such as regulating blood glucose (e.g., peptides from Momordica charantia). However, their full therapeutic value remains largely untapped.

The problem is important because traditional experimental methods for identifying bitter peptides are complex, time-consuming, expensive, and often inaccurate. These biological methods typically involve laborious steps like gel separation, multiple rounds of liquid chromatography, purification, and identification using specialized instruments like Fourier transform infrared spectroscopy (FTIR), which are not universally accessible. Human sensory evaluations, while sometimes used, are subjective and can lead to inconsistent results. There is a clear need for a more efficient and accurate classification method to unlock the practical value of bitter peptides.

The paper's entry point or innovative idea is to develop a machine learning (ML) model, specifically using the Random Forest (RF) algorithm, that leverages comprehensive sequence information (features) to predict bitter peptides. Previous computational methods existed, including Quantitative Structure-Activity Relationship (QSBR) models and earlier generations of sequence-based models, but they either focused on structural properties rather than sequence directly, or suffered from issues like information redundancy, overfitting, or suboptimal feature representation. This study seeks to improve upon these by integrating a wider array of sequence-derived features and applying an RF model, which is noted for its robustness and adaptability to high-dimensional data, to this specific classification task for the first time.

2.2. Main Contributions / Findings

The primary contributions and key findings of the paper are:

  • Development of Bitter-RF Model: The authors developed a novel Random Forest (RF)-based machine learning model named Bitter-RF for the accurate recognition of bitter peptides. This is highlighted as the first application of the RF method to build a predictive model specifically for bitter peptides.
  • Comprehensive Feature Integration: Bitter-RF integrates a more comprehensive and extensive set of 10 different sequence-derived features, covering various aspects of peptide composition, physicochemical properties, and sequence order. This multi-perspective feature set (initially 1,337 dimensions, reduced to 1,206 after removing zero columns) provides richer information for classification.
  • Superior Performance: The model demonstrates significantly improved prediction accuracy, especially on an independent validation set. It achieved an AUROC (Area Under the Receiver Operating Characteristic curve) of 0.98 on the independent test set, which is comparable to or better than the latest generation of existing models. Key metrics like Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), and Matthew's Correlation Coefficient (MCC) also showed strong performance (Sn=0.94, Sp=0.94, Acc=0.94, MCC=0.88 on independent set).
  • Validation of Feature Fusion and RF Method: The study systematically showed that fusing multiple features leads to better predictive performance compared to using single features. Furthermore, the Random Forest algorithm outperformed other traditional machine learning methods (SVM, LightGBM, Decision Trees, Logistic Regression) on the fused feature set for this specific task.
  • Enrichment of Protein Classification Applications: The research enriches the practical application of the RF method in protein classification, providing a robust model for bitter peptide identification that can guide further research and potential medical applications.
  • Open-Source Tool: A free and easy-to-use Python package for Bitter-RF has been made available on GitHub, providing a practical tool for scholars in bitter peptide research.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with basic concepts in molecular biology, machine learning, and statistical evaluation.

  • Peptides and Amino Acids:

    • Amino Acids: The basic building blocks of proteins and peptides. There are 20 common types, each with unique side chains that confer different physicochemical properties (e.g., hydrophobicity, charge, size).
    • Peptides: Short chains of amino acids linked by peptide bonds. Bitter peptides are a specific class of peptides that elicit a bitter taste perception, often due to their hydrophobic amino acid content or sequence arrangement.
    • Peptide Sequence Information: The linear order of amino acids in a peptide chain. This sequence dictates the peptide's properties and potential function.
  • Machine Learning (ML): A field of artificial intelligence that enables systems to learn from data without being explicitly programmed.

    • Classification Task: A type of supervised learning where an algorithm learns to assign input data into predefined categories (e.g., "bitter peptide" or "non-bitter peptide").
    • Features: Measurable properties or attributes of the data that the ML model uses to learn and make predictions. In this paper, features are derived from peptide sequences.
    • Model Training: The process where an ML algorithm learns patterns from a training dataset to build a predictive model.
    • Model Validation/Testing: Evaluating the performance of a trained model on unseen data (validation set or independent set) to assess its generalization ability.
    • Supervised Learning: A type of machine learning where the algorithm learns from labeled data (i.e., data where the correct output/category is already known).
  • Specific Machine Learning Algorithms:

    • Random Forest (RF): An ensemble learning method that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It reduces overfitting and improves accuracy. Each tree in the forest is built using a random subset of the training data and a random subset of features.
    • Support Vector Machine (SVM): A supervised learning model used for classification and regression tasks. It works by finding an optimal hyperplane that best separates data points of different classes in a high-dimensional space.
    • Light Gradient Boosting Machine (LightGBM): A gradient boosting framework that uses tree-based learning algorithms. It is designed to be highly efficient and scalable, particularly for large datasets, by using techniques like Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
    • Decision Tree (DT): A non-parametric supervised learning method used for classification and regression. It partitions the data into subsets based on feature values, creating a tree-like model of decisions and their possible consequences.
    • Logistic Regression (LR): A statistical model used for binary classification. It models the probability of a binary outcome (e.g., bitter or non-bitter) using a logistic function to estimate probabilities, which are then mapped to two discrete classes.
  • Cross-validation (e.g., 10-fold cross-validation): A technique to assess how the results of a statistical analysis will generalize to an independent dataset. In k-fold cross-validation, the dataset is divided into kk equally sized folds. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated kk times, with each fold used exactly once as the test set. 10-fold cross-validation means k=10k=10.

3.2. Previous Works

The paper frames its work in the context of previous efforts to predict bitter peptides, broadly categorizing them into experimental and computational methods, and further detailing four generations of sequence-based computational models.

3.2.1. Experimental Methods

  • Process: Involve extracting bitter peptides from raw materials, gel separation, multiple rounds of liquid chromatography, purification, and then identification using techniques like Fourier Transform Infrared Spectroscopy (FTIR).
  • Limitations: Complex, time-consuming, instrument-dependent (not universal), and potentially inaccurate due to human sensory evaluation involvement.

3.2.2. Computational Methods (QSBR Models)

  • Quantitative Structure-Activity Relationship (QSAR/QSBR): Models that attempt to find a correlation between the structural properties of molecules and their biological activity (e.g., bitterness).
  • Techniques used: Multiple linear regression, Support Vector Machine (SVM), Artificial Neural Network (ANN).
  • Example cited: A model based on 229 experimental bitterness values, extracting 1292 descriptors using Dragon 5.4 software, reducing them to 244, and then selecting six best-scoring descriptors (SPAN, Mean Square Distance (MSD), E3s, G3p, Hats8U, and 3D-MoRSE) using GAPLS (Genetic Algorithm Partial Least Squares) for QSAR model construction. These descriptors represent molecular dimensions, atom counts, electrical topological states, WHIM indices, spatial autocorrelation, and molecular size/mass/volume.

3.2.3. Sequence-based Computational Models (Four Generations)

The paper highlights an evolution of sequence-based models, providing context for its own Bitter-RF model.

  • First-generation model (iBitter-SCM):

    • Method: Used dipeptide propensity scores to predict bitter peptides. Dipeptide propensity refers to the likelihood of specific pairs of amino acids appearing together in bitter peptides versus non-bitter peptides.
    • Limitation: Extracted only a few characteristics, potentially limiting information capture.
    • Reference: Charoenkwan et al., 2020 (22).
  • Second-generation model (BERT4Bitter):

    • Method: Utilized deep learning research methods, specifically Bidirectional Encoder Representations from Transformers (BERT). BERT is a powerful neural network model pre-trained on large text corpora, adapted here to learn contextual representations of peptide sequences.
    • Potential Problems: The authors suggest potential issues with information redundancy and overfitting, common challenges in deep learning models, especially with limited data.
    • Reference: Charoenkwan et al., 2021 (23).
  • Third-generation model (iBitter-Fuse):

    • Method: Integrated five peptide features to characterize bitter peptides and built a prediction model, likely using SVM as mentioned in its reference. The five features are not explicitly listed in the current paper's "Previous Works" section, but the reference (Charoenkwan et al., 2021 (24)) indicates it combines "multi-view features."
    • Limitation: The representativeness of the features might need further optimization.
    • Reference: Charoenkwan et al., 2021 (24).
  • Fourth-generation model (iBitter-DRLF):

    • Method: Extracted features through deep learning pre-training and then built a prediction model based on Light Gradient Boosting Machine (LGBM). This combines the representation learning power of deep learning with the efficiency and performance of LGBM for classification.
    • Reference: Zhang et al., 2022 (26).

3.3. Technological Evolution

The evolution in bitter peptide identification has moved from:

  1. Laborious, expensive, and subjective experimental methods (e.g., FTIR, human sensory evaluation).

  2. To computational methods based on quantitative structure-activity relationships (QSBR), which correlate molecular structure with bitterness using various ML algorithms (SVM, ANN).

  3. Then, to sequence-based computational models, which are more practical as they only require the amino acid sequence. This started with simpler models using dipeptide propensity scores, progressed to sophisticated deep learning approaches (BERT4Bitter), then to feature fusion with traditional ML (iBitter-Fuse), and finally to hybrid approaches combining deep learning for feature extraction with gradient boosting machines for prediction (iBitter-DRLF).

    This paper's Bitter-RF fits into this timeline by building upon the concept of feature fusion (like iBitter-Fuse) but expanding the number and types of features significantly (10 features) and employing a Random Forest algorithm, which is shown to be highly effective for this problem, contrasting with previous models that might have used SVM or LightGBM for the final classification step.

3.4. Differentiation Analysis

Compared to the main methods in related work, Bitter-RF differentiates itself through several core innovations:

  • Expanded Feature Set: While previous feature fusion models (e.g., iBitter-Fuse) used a limited number of features (e.g., five), Bitter-RF integrates a more comprehensive set of 10 different sequence-derived features. This broader scope aims to capture more diverse and extensive information about bitter peptides, including amino acid composition, pseudo-amino acid composition, dipeptide composition, and sequence-order-coupling numbers, which are crucial for physicochemical properties and sequential relationships.
  • Novel Application of Random Forest (RF): The paper explicitly states that the Random Forest method "has not been used to build a prediction model for bitter peptides." Bitter-RF introduces RF as a robust and effective classifier for this specific task, demonstrating its superior performance compared to other traditional ML methods (SVM, LightGBM, DT, LR) and competitive performance against state-of-the-art deep learning-based models.
  • Improved Accuracy on Independent Set: Bitter-RF achieves a high AUROC of 0.98 on the independent validation set, outperforming several previous generations of models and matching the performance of iBitter-DRLF, which often relies on computationally intensive deep learning pre-training.
  • Computational Efficiency: By utilizing a traditional machine learning method like Random Forest, Bitter-RF offers a strong prediction performance while consuming fewer computing resources compared to complex deep learning models, making it more accessible and practical for general research use.

4. Methodology

4.1. Principles

The core idea behind Bitter-RF is to accurately classify bitter peptides by leveraging a rich set of information derived from their amino acid sequences using a robust machine learning algorithm. The theoretical basis is that the bitter taste property of peptides is encoded within their sequence composition and arrangement, reflecting underlying physicochemical properties (e.g., hydrophobicity, hydrophilicity) and sequential patterns. By extracting diverse features that capture these characteristics and combining them, a machine learning model can learn to distinguish bitter peptides from non-bitter ones. The Random Forest algorithm is chosen for its ability to handle high-dimensional data, reduce overfitting, and maintain strong predictive power.

4.2. Core Methodology In-depth (Layer by Layer)

The construction of the Bitter-RF model involves several key steps: dataset preparation, comprehensive feature extraction, feature fusion, model training using Random Forest, and performance evaluation.

4.2.1. Dataset Source

The foundation of Bitter-RF is a high-quality benchmark dataset. The study utilizes the same dataset as previous generations of bitter peptide prediction models (references 22-24) to ensure a fair comparison. This dataset, accessible from http://pmlab.pythonanywhere.com/BERT4Bitter, was originally compiled by manually collecting experimentally validated bitter peptides from various scientific literature.

The dataset characteristics are as follows:

  • Total Records: 640

  • Bitter Peptides: 320 (experimentally validated)

  • Non-Bitter Peptides: 320 (randomly generated from BIOPEP database)

    To objectively evaluate the model's performance, the dataset was rigorously split into:

  • Training Set: Used to train the machine learning model. It contains 512 records (80% of total), specifically 256 bitter peptides and 256 non-bitter peptides.

  • Independent Set: Used to validate the model's generalization ability on unseen data. It contains 128 records (20% of total), specifically 64 bitter peptides and 64 non-bitter peptides.

4.2.2. Feature Extraction

Feature extraction is a critical step in machine learning models based on biological sequence data, as it aims to encode sequences in a way that reveals as much relevant information as possible. The authors used iLearnPlus (reference 37), a platform for sequence analysis, to extract 10 types of features from the bitter peptide sequences.

4.2.2.1. Amino Acid Composition (AAC)

The AAC encoding calculates the fractional frequencies of each of the 20 standard amino acids within a peptide sequence. This feature provides a basic compositional overview of the peptide.

The equation for AAC is: $ f \left( t \right) = \frac { N \left( t \right) } { N } , t \in \left{ A , C , . . . , Y \right} $ Where:

  • f(t) represents the frequency of amino acid type tt.
  • N(t) denotes the number of occurrences of amino acid type tt in the peptide sequence.
  • NN is the total length (number of amino acids) of the peptide sequence.
  • t{A,C,...,Y}t \in \{A, C, ..., Y\} indicates that tt can be any of the 20 standard amino acids.
  • Dimension: 20 (one for each amino acid type).

4.2.2.2. Traditional Pseudo-Amino Acid Composition (TPAAC)

TPAAC, also known as type 1 pseudo-amino acid composition, extends AAC by incorporating sequence-order information and physicochemical properties. It considers three specific amino acid properties: hydrophobicity, hydrophilicity, and side-chain mass.

First, the original values for hydrophobicity (H1o(i)H_1^o(i)), hydrophilicity (H2o(i)H_2^o(i)), and side chain mass (Mo(i)M^o(i)) for each of the 20 amino acids ii are normalized using a standard normal distribution transformation: $ H _ { 1 } \left( i \right) \ = \frac { H _ { 1 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 1 } ^ { o } \left( i \right) } { \sqrt { \frac { \sum _ { i = 1 } ^ { 2 0 } \left[ H _ { 1 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 1 } ^ { o } \left( i \right) \right] ^ { 2 } } { 2 0 } } } $ $ H _ { 2 } \left( i \right) \ = \frac { H _ { 2 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 2 } ^ { o } \left( i \right) } { \sqrt { \frac { \sum _ { i = 1 } ^ { 2 0 } \left[ H _ { 2 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 2 } ^ { o } \left( i \right) \right] ^ { 2 } } { 2 0 } } } $ $ M \left( i \right) \ = \frac { M ^ { 0 } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } M ^ { 0 } \left( i \right) } { \sqrt { \frac { \sum _ { i = 1 } ^ { 2 0 } \left[ M ^ { 0 } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } M ^ { 0 } \left( i \right) \right] ^ { 2 } } { 2 0 } } } $ Where:

  • H1(i)H_1(i), H2(i)H_2(i), and M(i) are the normalized hydrophobicity, hydrophilicity, and side chain mass values for amino acid ii.

  • H1o(i)H_1^o(i), H2o(i)H_2^o(i), and Mo(i)M^o(i) are the original values for amino acid ii.

  • The terms 120i=120H1o(i)\frac{1}{20}\sum_{i=1}^{20} H_1^o(i) (and similar for H2oH_2^o and MoM^o) represent the mean of the respective property over all 20 amino acids.

  • The denominator represents the standard deviation of the respective property over all 20 amino acids.

    Next, a correlation function Θ(Ri,Rj)\Theta(R_i, R_j) between two amino acids RiR_i and RjR_j (at positions ii and jj in the sequence) is defined: $ \begin{array} { l } { { \Theta \left( R _ { i } , R _ { j } \right) \ = \ \displaystyle \frac { 1 } { 3 } \big { \big [ H _ { 1 } \left( R _ { i } \right) - H _ { 1 } \left( R _ { j } \right) \big ] ^ { 2 } + \big [ H _ { 2 } \left( R _ { i } \right) - H _ { 2 } \left( R _ { j } \right) \big ] ^ { 2 } } } \ { { \ } } \ { { \qquad + \big [ M \left( R _ { i } \right) - M \left( R _ { j } \right) \big ] ^ { 2 } \big } } } \end{array} $ This function measures the squared difference in hydrophobicity, hydrophilicity, and mass between the two amino acids, averaged over the three properties. The correlation function can also be defined for a single amino acid property or a set of properties: $ \Theta \left( R _ { i } , R _ { j } \right) \ = \left[ H _ { 1 } \left( R _ { i } \right) - H _ { 1 } \left( R _ { j } \right) \right] ^ { 2 } $ $ \Theta \left( R _ { i } , R _ { j } \right) ~ = \frac { 1 } { n } \sum _ { n = 1 } ^ { n } \left[ H _ { k } \left( R _ { i } \right) - H _ { k } \left( R _ { j } \right) \right] ^ { 2 } $ Where:

  • H(Ri)H(R_i) is the standardized amino acid property of amino acid RiR_i.

  • Hk(Ri)H_k(R_i) is the kk-th attribute in the amino acid attribute set for amino acid RiR_i.

    Sequence order-correlated factors (Θ1,Θ2,...,Θλ\Theta_1, \Theta_2, ..., \Theta_\lambda) are then computed: $ \Theta _ { 1 } \ = \frac { 1 } { N - 1 } \sum _ { i \ = 1 } ^ { N - 1 } \Theta \left( R _ { i } , R _ { i + 1 } \right) $ $ \begin{array} { l } { { \Theta _ { 2 } = \displaystyle \frac { 1 } { N - 2 } \sum _ { i = 1 } ^ { N - 2 } \Theta \left( R _ { i } , R _ { i + 2 } \right) } } \ { { . . . } } \ { { \Theta _ { \mathsf { \tiny { A } } } = \displaystyle \frac { 1 } { N - \mathsf { \tiny { A } } } \sum _ { i = 1 } ^ { N - \mathsf { \tiny { A } } } \Theta \left( R _ { i } , R _ { i + \mathsf { \tiny { A } } } \right) } } \end{array} $ Where:

  • λ\lambda (represented as AA in the last equation) is a correlation parameter, indicating the maximum sequence separation (lag) considered. It must be less than NN (peptide length). The paper states λ=1\lambda = 1 for this study.

  • Θj\Theta_j represents the jj-th sequence order correlation factor, calculated by averaging the correlation function Θ(Ri,Ri+j)\Theta(R_i, R_{i+j}) over all adjacent amino acid pairs separated by j-1 intervening residues.

    Finally, the TPAAC descriptor for a protein sequence is defined by combining the amino acid frequencies with these sequence order-correlated factors: $ X _ { c } = \frac { f _ { c } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + \omega \sum _ { j = 1 } ^ { \lambda } \theta _ { j } } , ( 1 < c < 2 0 ) $ $ X _ { c } = \frac { \omega \theta _ { c - 2 0 } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + \omega \sum _ { j = 1 } ^ { \lambda } \theta _ { j } } , ( 2 1 < c < 2 0 + \lambda ) $ Where:

  • fcf_c is the frequency of the cc-th amino acid.

  • ω\omega is a weighting factor, set to 0.05 in this study.

  • r=120fr\sum_{r=1}^{20} f_r is the sum of frequencies of all 20 amino acids (typically 1).

  • j=1λθj\sum_{j=1}^{\lambda} \theta_j is the sum of the λ\lambda sequence-order correlation factors.

  • For 1<c<201 < c < 20, XcX_c represents the modified frequency of the cc-th amino acid.

  • For 21<c<20+λ21 < c < 20 + \lambda, XcX_c represents the sequence-order correlation factors themselves, scaled by ω\omega.

  • Dimension: 20+λ20 + \lambda. With λ=1\lambda=1, the dimension is 21.

4.2.2.3. Amphiphilic Pseudo-Amino Acid Composition (APAAC)

APAAC is another type of PseAAC that focuses on the distribution patterns of hydrophobicity and hydrophilicity along the peptide chain. It comprises 20+2λ20 + 2\lambda discrete numbers.

It starts by using the normalized hydrophobicity H1(i)H_1(i) and hydrophilicity H2(i)H_2(i) values (from Equations 2 and 3 in TPAAC) to define hydrophobicity and hydrophilicity correlation functions: $ H _ { i , j } ^ { 1 } \mathrm { ~ = ~ } H _ { 1 } \left( i \right) H _ { 1 } \left( j \right) $ $ H _ { i , j } ^ { 2 } \mathrm { ~ = ~ } H _ { 2 } \left( i \right) H _ { 2 } \left( j \right) $ Where:

  • Hi,j1H^1_{i,j} and Hi,j2H^2_{i,j} are the correlation values for hydrophobicity and hydrophilicity, respectively, between amino acids at positions ii and jj.

    Next, sequence order factors are formulated as: $ \tau _ { 1 } = \frac { 1 } { N - 1 } \sum _ { i = 1 } ^ { N - 1 } H _ { i , i + 1 } ^ { 1 } $ $ \tau _ { 2 } = \frac { 1 } { N - 1 } \sum _ { i = 1 } ^ { N - 1 } H _ { i , i + 1 } ^ { 2 } $ $ \tau _ { 3 } = \frac { 1 } { N - 2 } \sum _ { i = 1 } ^ { N - 2 } H _ { i , i + 2 } ^ { 1 } $ $ \tau _ { 4 } = \frac { 1 } { N - 2 } \sum _ { i = 1 } ^ { N - 2 } H _ { i , i + 2 } ^ { 2 } $ These continue up to 2λ2\lambda factors: $ \tau _ { 2 \alpha - 1 } = \frac { 1 } { N - \alpha } \sum _ { i = 1 } ^ { N - \alpha } H _ { i , i + \alpha } ^ { 1 } $ $ \tau _ { 2 \alpha } = \frac { 1 } { N - \alpha } \sum _ { i = 1 } ^ { N - \alpha } H _ { i , i + \alpha } ^ { 2 } $ Where:

  • τj\tau_j are the sequence order factors.

  • α\alpha ranges from 1 to λ\lambda.

  • NN is the peptide length.

  • The paper states λ=1\lambda = 1 for this study, meaning α\alpha only takes the value 1, resulting in τ1\tau_1 and τ2\tau_2.

    Finally, the APAAC descriptor is defined as: $ P _ { C } = \frac { f _ { c } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { j = 1 } ^ { 2 \lambda } \tau _ { j } } , ( 1 ~ < ~ c ~ < ~ 2 0 ) $ $ P _ { C } = \frac { \omega \tau _ { u } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { j = 1 } ^ { 2 \lambda } \tau _ { j } } , \ ( 2 1 \ < \ u \ < \ 2 0 + 2 \lambda ) $ Where:

  • fcf_c is the frequency of the cc-th amino acid.

  • ww is the weighting factor, set to 0.5 in this study.

  • λ\lambda is the correlation parameter, set to 1 in this study.

  • For 1<c<201 < c < 20, PCP_C represents the modified frequency of the cc-th amino acid.

  • For 21<u<20+2λ21 < u < 20 + 2\lambda, PCP_C represents the sequence-order correlation factors themselves, scaled by ω\omega.

  • Dimension: 20+2λ20 + 2\lambda. With λ=1\lambda=1, the dimension is 22.

4.2.2.4. Adaptive Skip Dinucleotide Composition (ASDC)

ASDC is a modified dipeptide composition that considers the relationships between non-adjacent residues, accounting for intervening peptides.

The feature vector for ASDC is defined as: $ \mathrm { A S D C } = ( f _ { \nu 1 } , f _ { \nu 2 } , . . . , f _ { \nu 4 0 0 } ) , $ $ f _ { \nu i } = \frac { \sum _ { g = 1 } ^ { L - 1 } O _ { i } ^ { g } } { \sum _ { i = 1 } ^ { 4 0 0 } \sum _ { g = 1 } ^ { L - 1 } O _ { i } ^ { g } } $ Where:

  • fνif_{\nu i} represents the occurrence frequency of the ii-th possible dipeptide, considering all possible skips.
  • OigO_i^g is the number of occurrences of the ii-th dipeptide with g-1 intervening amino acids (i.e., at a distance of gg).
  • LL is the length of the peptide sequence.
  • The sum g=1L1Oig\sum_{g=1}^{L-1} O_i^g counts all occurrences of the ii-th dipeptide type, irrespective of the skip distance.
  • The denominator normalizes these counts by the total number of all possible dipeptides at all possible skip distances.
  • Dimension: 400 (since there are 20×20=40020 \times 20 = 400 possible dipeptides).

4.2.2.5. Di-peptide Composition (DPC)

DPC describes the frequencies of all 400 possible dipeptide combinations (e.g., AA, AC, ..., YY) in a peptide sequence. It captures information about adjacent amino acid pairs.

The calculation method for DPC is: $ D \left( r , s \right) \ = \frac { N _ { r s } } { N - 1 } , r , s \in { A , C , D , . . . , Y } $ Where:

  • D(r,s) is the frequency of the dipeptide formed by amino acid type rr followed by amino acid type ss.
  • NrsN_{rs} is the number of times the dipeptide combination r-s appears in the peptide sequence.
  • NN is the total length of the peptide sequence.
  • N-1 is the total number of adjacent dipeptides in a sequence of length NN.
  • Dimension: 400.

4.2.2.6. Dipeptide Deviation from Expected Mean (DDE)

DDE is a feature that quantifies how much the observed frequency of a dipeptide deviates from its theoretically expected frequency. It uses three parameters: observed dipeptide composition (DcD_c), theoretical mean (TmT_m), and theoretical variance (TνT_\nu).

  • DcD_c is the same as the DPC calculation method.

    The theoretical mean and variance are calculated based on codon usage frequencies: $ T _ { m } \left( r , s \right) = \frac { C _ { r } } { C _ { N } } \times \frac { C _ { s } } { C _ { N } } $ $ T _ { \nu } \left( r , s \right) = \frac { T _ { m } \left( r , s \right) \left( 1 - T _ { m } \left( r , s \right) \right) } { N - 1 } $ Where:

  • Tm(r,s)T_m(r,s) is the theoretical mean frequency for the dipeptide r-s.

  • Tν(r,s)T_\nu(r,s) is the theoretical variance for the dipeptide r-s.

  • CrC_r is the number of codons that encode amino acid type rr.

  • CsC_s is the number of codons that encode amino acid type ss.

  • CNC_N is the total number of possible codons (excluding stop codons).

  • N-1 is the total number of possible dipeptide positions.

    Finally, DDE for a dipeptide r-s is calculated as: $ D D E \left( r , s \right) = \frac { D _ { c } \left( r , s \right) - T _ { m } \left( r , s \right) } { T _ { \nu } \left( r , s \right) } $ Where:

  • Dc(r,s)D_c(r,s) is the observed frequency of the dipeptide r-s.

  • Dimension: 400.

4.2.2.7. Grouped Amino Acid Composition (GAAC)

GAAC reduces the dimensionality of AAC by grouping the 20 amino acids into 5 categories based on their shared physicochemical properties. It then calculates the frequencies of these groups.

The five groups are:

  • Aliphatic group (g1g1): G, A, V, L, M, I

  • Aromatic group (g2g2): F, Y, W

  • Positive charge group (g3g3): K, R, H

  • Negative charged group (g4g4): D, E

  • Uncharged group (g5g5): S, T, C, P, N, Q

    The frequency calculation is: $ f \left( g \right) \ = \frac { N \left( g \right) } { N } , G \in \left{ g 1 , g 2 , g 3 , g 4 , g 5 \right} $ Where:

  • f(g) is the frequency of amino acids belonging to group GG.

  • N(g) is the sum of the number of amino acids in group GG.

  • NN is the total length of the peptide sequence.

  • Dimension: 5.

4.2.2.8. Grouped Dipeptide Composition (GDPC)

GDPC is a variant of DPC that uses the same 5 amino acid groups defined in GAAC. It calculates the frequencies of dipeptides formed by these groups.

The feature consists of 25 descriptors (5 groups ×\times 5 groups), calculated as: $ f \left( r , s \right) \ = \frac { N _ { r s } } { N - 1 } , r , s \in \left{ g 1 , g 2 , g 3 , g 4 , g 5 \right} $ Where:

  • f(r,s) is the frequency of a dipeptide where the first amino acid belongs to group rr and the second to group ss.
  • NrsN_{rs} is the number of occurrences of dipeptides represented by amino acid type groups rr and ss.
  • NN is the total length of the peptide sequence.
  • Dimension: 25.

4.2.2.9. Sequence-order-coupling number (SOCNumber)

SOCNumber captures sequence-order information by calculating the sum of squared distances between amino acids at specific separations (lags).

The dd-th rank sequence-order-coupling number (τd\tau_d) is calculated as: $ \tau _ { d } = \sum _ { i = 1 } ^ { N - d } \left( d _ { i , i + d } \right) ^ { 2 } , \ d = 1 , 2 , . . . , n l a g $ Where:

  • τd\tau_d is the dd-th rank sequence-order-coupling number.
  • dd is the lag, representing the distance between two amino acids.
  • NN is the length of the peptide sequence.
  • di,i+dd_{i, i+d} describes the "distance" between the amino acid at position ii and the amino acid at position i+di+d. This distance is derived from a pre-defined amino acid distance matrix. The paper mentions using distance matrices from Schneider-Wrede (physicochemical) and Grantham (chemical) for this purpose.
  • nlag denotes the maximum value of the lag, with a default value of 30.
  • Dimension: nlag. With nlag=30nlag=30, the dimension would be 30. However, Table 1 states a dimension of 2, which might imply a specific configuration in iLearnPlus or that only two specific lags (e.g., d=1,2d=1, 2) were used after some internal processing for this feature. The discrepancy between the formula implying nlag dimensions and the table showing 2 dimensions is noted. Following the table, it is 2.

4.2.2.10. Quasi-sequence-order (QsOrder)

QsOrder is another feature that combines amino acid composition with sequence-order information, utilizing the SOCNumber concept. It produces a set of descriptors reflecting both composition and sequence patterns.

For each amino acid (20 types), the QsOrder is defined as: $ X _ { r } = \frac { f _ { r } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { d = 1 } ^ { n l a g } \tau _ { d } } , r = 1 , 2 , 3 , . . . , 2 0 $ Where:

  • XrX_r is the QsOrder descriptor for the rr-th amino acid type.

  • frf_r represents the normalized occurrence frequency of the rr-th amino acid type.

  • ww is the weighting factor, defined as 0.1.

  • nlag denotes the maximum value of the lag (default: 30).

  • τd\tau_d is the dd-th rank sequence-order-coupling number, as defined in SOCNumber.

    For other nlag quasi-sequence-order descriptors (30 in total, implying nlag=30nlag=30), QsOrder is defined as: $ X _ { d } = \frac { w \tau _ { d } - 2 0 } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { d = 1 } ^ { n l a g } \tau _ { d } } , d = 2 1 , 2 2 , . . . , 2 0 + n l a g $ Where:

  • XdX_d represents the additional QsOrder descriptors that are primarily based on the sequence-order-coupling numbers.

  • Dimension: 20 + nlag. With nlag=30nlag=30, the dimension is 50. Table 1 states a dimension of 42. This discrepancy is noted, but the analysis will follow the stated dimension in the table for consistency with the results section.

4.2.3. Feature Fusion Processing

After extracting the 10 types of features, they are concatenated to form a single, high-dimensional feature vector for each peptide.

  • Initial Fusion: The concatenation of all 10 features results in a feature vector with 1,337 dimensions.
  • De-zeroing (Feature Reduction): A practical step where any feature column (dimension) that contains only zero values across all samples is removed. Such columns provide no discriminative information and can be safely eliminated. After this de-zero operation, the total number of features used for model learning is reduced to 1,206.

4.2.4. Random Forest (RF)

The Random Forest algorithm is employed as the primary machine learning classifier for Bitter-RF.

  • Ensemble Method: RF is an ensemble learning method that builds multiple decision trees during training.

  • Randomness: Each tree is constructed using a random subset of the training data (bootstrap aggregating or bagging) and, at each split in a tree, a random subset of features is considered. This inherent randomness helps to reduce correlation among trees and prevents overfitting.

  • Prediction: For classification, the final prediction is made by taking a majority vote from the predictions of all individual trees in the forest.

  • Advantages: RF is known for its high accuracy, robustness to noise, and strong adaptability to high-dimensional data, which makes it suitable for the comprehensive feature set developed in this study. The paper explicitly mentions that RF can "reduce the possibility of overfitting, improve the ability to resist noise, and has strong adaptability to high-dimensional data."

    The schematic framework of Bitter-RF for bitter peptide prediction is visually represented in Figure 1. It outlines the process from data collection to feature extraction, feature fusion, model training with RF, and ultimately prediction.

The following figure (Figure 1 from the original paper) illustrates the construction workflow of the Bitter-RF model:

FIGURE 1 该图像是论文中图1,展示了Bitter-RF模型构建流程图,包括数据收集、特征融合、机器学习方法选择及模型评估四个步骤,清晰展示了研究设计思路。

5. Experimental Setup

5.1. Datasets

The study utilized a publicly available benchmark dataset, the same one used by previous bitter peptide prediction models (iBitter-SCM, BERT4Bitter, iBitter-Fuse, iBitter-DRLF), to ensure comparability and reliability of results.

  • Source: The dataset was originally collected by manually curating experimentally validated bitter peptides from various scientific literature and can be accessed from http://pmlab.pythonanywhere.com/BERT4Bitter. Non-bitter peptides were randomly generated from the BIOPEP database.
  • Characteristics:
    • Total Samples: 640 records.
    • Classes: 320 experimentally validated bitter peptides (positive samples) and 320 non-bitter peptides (negative samples).
  • Data Split: The dataset was divided using an 8:2 ratio to create training and independent validation sets:
    • Training Set: 512 records (256 bitter peptides, 256 non-bitter peptides). This set is used for model learning.
    • Independent Set: 128 records (64 bitter peptides, 64 non-bitter peptides). This set is crucial for evaluating the model's generalization performance on entirely unseen data.
  • Rationale for Choice: Using the same dataset as previous models allows for a direct and fair comparison of the Bitter-RF model's performance against existing state-of-the-art methods.

5.2. Evaluation Metrics

To assess the training effect and predictive ability of the model, the authors used several standard classification metrics. Bitter peptides were defined as positive samples, and non-bitter peptides as negative samples.

For context, the fundamental counts for these metrics are:

  • TP (True Positives): Number of bitter peptides correctly predicted as bitter.

  • FN (False Negatives): Number of bitter peptides incorrectly predicted as non-bitter.

  • TN (True Negatives): Number of non-bitter peptides correctly predicted as non-bitter.

  • FP (False Positives): Number of non-bitter peptides incorrectly predicted as bitter.

    Here are the evaluation metrics used, with their conceptual definitions, mathematical formulas, and symbol explanations:

5.2.1. Sensitivity (Sn)

  • Conceptual Definition: Sensitivity, also known as Recall or True Positive Rate, measures the proportion of actual positive cases (bitter peptides) that are correctly identified by the model. A high sensitivity indicates that the model is good at catching positive instances.
  • Mathematical Formula: $ S n = \frac { T P } { ( T P + F N ) } $
  • Symbol Explanation:
    • Sn: Sensitivity
    • TP: True Positives
    • FN: False Negatives

5.2.2. Specificity (Sp)

  • Conceptual Definition: Specificity, also known as True Negative Rate, measures the proportion of actual negative cases (non-bitter peptides) that are correctly identified by the model. A high specificity indicates that the model is good at correctly identifying negative instances and avoiding false alarms.
  • Mathematical Formula: $ S p = { \frac { T N } { ( T N + F P ) } } $
  • Symbol Explanation:
    • Sp: Specificity
    • TN: True Negatives
    • FP: False Positives

5.2.3. Accuracy (ACC)

  • Conceptual Definition: Accuracy measures the proportion of total predictions that were correct. It is a straightforward metric but can be misleading in imbalanced datasets.
  • Mathematical Formula: $ A C C = { \frac { ( T P + T N ) } { ( T P + T N + F P + F N ) } } $
  • Symbol Explanation:
    • ACC: Accuracy
    • TP: True Positives
    • TN: True Negatives
    • FP: False Positives
    • FN: False Negatives

5.2.4. Matthew's Correlation Coefficient (MCC)

  • Conceptual Definition: MCC is a comprehensive and robust metric for binary classification that is considered a balanced measure even if the classes are of very different sizes. It takes into account true and false positives and negatives and returns a value between -1 and +1. A value of +1 indicates a perfect prediction, 0 indicates a random prediction, and -1 indicates a completely opposite prediction.
  • Mathematical Formula: $ M C C = { \frac { \left( T N \times \ T P - F N \times \ F P \right) } { \sqrt { \left( T P + F P \right) \left( T P + F N \right) \left( T N + F P \right) \left( T N + F N \right) } } } $
  • Symbol Explanation:
    • MCC: Matthew's Correlation Coefficient
    • TP: True Positives
    • TN: True Negatives
    • FP: False Positives
    • FN: False Negatives

5.2.5. Area Under the Receiver Operating Characteristic curve (AUROC)

  • Conceptual Definition: AUROC is a performance metric for binary classifiers. The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold settings. The AUROC value represents the area under this curve. A higher AUROC (closer to 1) indicates a better ability of the model to distinguish between classes. An AUROC of 0.5 suggests no discrimination (like random guessing), while an AUROC of 1.0 indicates perfect discrimination. The paper states that AUROC can be used as a standard for evaluating the quality of the binary classification model.
  • Mathematical Formula: The AUROC itself does not have a single simple formula based on TP, TN, FP, FN directly, as it is derived from integrating the ROC curve. However, the ROC curve is generated by varying the classification threshold and calculating the TPR and FPR at each threshold.
    • TPR=Sensitivity=TPTP+FNTPR = Sensitivity = \frac{TP}{TP + FN}
    • FPR=1Specificity=FPTN+FPFPR = 1 - Specificity = \frac{FP}{TN + FP}
    • The AUROC is then the integral of the TPR with respect to the FPR over the interval [0, 1].
  • Symbol Explanation:
    • AUROC: Area Under the Receiver Operating Characteristic curve.
    • TPR: True Positive Rate (Sensitivity).
    • FPR: False Positive Rate (1 - Specificity).

5.3. Baselines

The paper compared Bitter-RF against two categories of baseline models:

  1. Other Traditional Machine Learning Methods: To validate the choice of Random Forest, the fused features were also tested with:

    • Support Vector Machine (SVM)
    • Light Gradient Boosting Machine (LightGBM)
    • Decision Trees (DT)
    • Logistic Regression (LR) These methods are representative of commonly used and powerful classification algorithms in bioinformatics.
  2. Existing State-of-the-Art Bitter Peptide Prediction Models: To demonstrate the superior performance of Bitter-RF in the specific domain of bitter peptide prediction, it was compared with four previously published sequence-based models, which the authors refer to as "generations":

    • iBitter-SCM: The first-generation model, based on dipeptide propensity scores (Charoenkwan et al., 2020).
    • BERT4Bitter: The second-generation model, utilizing deep learning (BERT) (Charoenkwan et al., 2021).
    • iBitter-Fuse: The third-generation model, combining five peptide features with SVM (Charoenkwan et al., 2021).
    • iBitter-DRLF: The fourth-generation model, which uses deep learning for feature extraction and LightGBM for prediction (Zhang et al., 2022). These models represent the progression of computational approaches for bitter peptide identification and serve as direct benchmarks for Bitter-RF.

6. Results & Analysis

6.1. Single-feature-based results

The initial phase of evaluation involved training Random Forest models using each of the 10 extracted features individually. This step helps to understand the discriminative power of each feature type. The performance was assessed using both 10-fold cross-validation on the training set and an independent validation set.

The following are the results from Table 1 of the original paper:

Cross-validation Feature Dimension AUROC Sn Sp Acc Mcc
10-fold cross-validation AAC 20 0.91 0.85 0.84 0.85 0.69
TPAAC 21 0.90 0.83 0.78 0.80 0.61
APAAC 22 0.89 0.83 0.81 0.82 0.64
ASDC 400 0.88 0.89 0.68 0.79 0.59
DPC 400 0.86 0.87 0.64 0.76 0.53
DDE 400 0.83 0.84 0.73 0.78 0.57
GAAC 5 0.75 0.72 0.66 0.69 0.39
GDPC 25 0.78 0.75 0.71 0.73 0.46
SOCNumber 2 0.70 0.66 0.62 0.64 0.28
QSOrder 42 0.89 0.82 0.82 0.82 0.64
Independent set validation AAC 20 0.96 0.91 0.89 0.90 0.80
TPAAC 21 0.94 0.83 0.86 0.84 0.69
APAAC 22 0.97 0.89 0.91 0.90 0.80
ASDC 400 0.92 0.89 0.75 0.82 0.65
CKSAAGP 100 0.87 0.77 0.81 0.79 0.58
DPC 400 0.89 0.88 0.70 0.79 0.59
DDE 400 0.90 0.89 0.84 0.87 0.74
GAAC 5 0.76 0.83 0.64 0.73 0.48
GDPC 25 0.80 0.73 0.72 0.73 0.45
SOCNumber 2 0.73 0.59 0.69 0.64 0.28
QSOrder 42 0.95 0.92 0.84 0.88 0.77

Analysis of Single-Feature Results:

  • Best Performers: On the 10-fold cross-validation, AAC shows the highest AUROC (0.91), ACC (0.85), and MCC (0.69). On the independent set, APAAC performs best with an AUROC of 0.97, followed closely by AAC (0.96) and QSOrder (0.95). These features, particularly AAC and APAAC, seem to capture highly discriminative information.

  • Worst Performers: SOCNumber consistently performs the worst in both cross-validation (AUROC 0.70) and independent validation (AUROC 0.73). This is attributed to its low dimensionality (2 features), indicating it might not provide sufficient information on its own. GAAC and GDPC also show relatively lower performance.

  • Dimension vs. Performance: Interestingly, features with higher dimensions like ASDC, DPC, and DDE (all 400 dimensions) do not necessarily outperform lower-dimensional features like AAC (20 dimensions) or APAAC (22 dimensions). This suggests that the quality and relevance of the information encoded by the feature are more important than merely the number of dimensions.

  • Importance of Physicochemical Properties: The introduction mentions that hydrophobic amino acids and their positions are crucial for bitter taste. Features like TPAAC and APAAC, which explicitly incorporate hydrophobicity and hydrophilicity, show good performance, supporting this premise.

  • Independent Set Insights: The independent set validation generally shows higher AUROC values for several features (e.g., AAC from 0.91 to 0.96, APAAC from 0.89 to 0.97), suggesting that the models are generalizing well, and some features might be even more robust on unseen data.

    This analysis suggests that while some single features are strong predictors, there is potential for improvement by combining their complementary information. The paper notes that "some single features with poor performance have rich information that AAC does not have and can improve prediction performance," motivating the next step of feature fusion.

6.2. Fusion Feature Processing

Prior to training the final model, the 10 individual features were combined into a single feature vector. This process also involved a de-zero operation to remove redundant or non-informative dimensions.

The following are the results from Table 2 of the original paper:

Feature Dimension Dimension after operation
AAC 20 20
TPAAC 21 21
APAAC 22 22
ASDC 400 366
DPC 400 303
DDE 400 400
GAAC 5 5
GDPC 25 25
SOCNumber 2 2
QSOrder 42 42
Total of features 1,337 1,206

Analysis of Feature Fusion:

  • Total Initial Dimensions: Summing the individual dimensions of the 10 features gives 20+21+22+400+400+400+5+25+2+42=1,33720 + 21 + 22 + 400 + 400 + 400 + 5 + 25 + 2 + 42 = 1,337.
  • De-zeroing Effect: The de-zero operation, which removes columns containing only zero values, reduced the total feature dimension from 1,337 to 1,206. This means 131 features (1337 - 1206) were found to be all zeros and were removed. This reduction is beneficial as it removes non-informative features, potentially reducing noise and computational load without losing discriminative power.
  • Specific Reductions: ASDC saw a reduction from 400 to 366, and DPC from 400 to 303. This indicates that some specific dipeptide combinations or skip dipeptide combinations were absent in the dataset, leading to zero counts for those features. Other features maintained their original dimensions, implying all their components had non-zero values across the dataset.

6.3. Fusion-feature-based results

The fused feature set (1,206 dimensions) was then used to train an RF model, and its performance was compared against the three best-performing single features (AAC, APAAC, QSOrder) identified in the single-feature analysis.

The following are the results from Table 3 of the original paper:

ML method Cross-validation Feature Dimension AUROC Sn Sp Acc Mcc
Random Forest 10-fold cross-validation AAC 20 0.91 0.85 0.84 0.85 0.69
APAAC 22 0.89 0.83 0.81 0.82 0.64
QSOrder 42 0.89 0.82 0.82 0.82 0.64
Fusion 1206 0.93 0.86 0.84 0.85 0.70
Independent set validation AAC 20 0.96 0.91 0.89 0.90 0.80
APAAC 22 0.97 0.89 0.91 0.90 0.80
QSOrder 42 0.95 0.92 0.84 0.88 0.77
Fusion 1206 0.98 0.94 0.94 0.94 0.88

Analysis of Fusion-Feature Results:

  • 10-fold Cross-Validation:
    • The Fusion feature set achieved an AUROC of 0.93, Sn of 0.86, Sp of 0.84, Acc of 0.85, and MCC of 0.70.
    • This is an improvement over AAC (AUROC 0.91, MCC 0.69), APAAC (AUROC 0.89, MCC 0.64), and QSOrder (AUROC 0.89, MCC 0.64). The MCC for the fusion set (0.70) is notably higher, indicating a better overall balance in prediction performance.
  • Independent Set Validation:
    • The Fusion feature set demonstrated excellent performance with an AUROC of 0.98, Sn of 0.94, Sp of 0.94, Acc of 0.94, and MCC of 0.88.

    • This is a significant improvement over all single features: AAC (AUROC 0.96, MCC 0.80), APAAC (AUROC 0.97, MCC 0.80), and QSOrder (AUROC 0.95, MCC 0.77). The AUROC of 0.98 for the Fusion set is the highest achieved, and the MCC of 0.88 is substantially better than any single feature, signifying a highly robust and accurate model.

      The results strongly indicate that fusing diverse features provides a more comprehensive representation of bitter peptides, leading to enhanced predictive capabilities. The statement "the prediction performance of fusion features was improved or remained unchanged compared with single feature prediction" is validated, with clear improvements observed, especially on the independent set.

The following figure (Figure 2 from the original paper) shows the performance of the Bitter-RF model in bitter peptide classification. Panel A presents ROC curves from 10-fold cross-validation and an independent validation set, panel B compares ROC curves of different feature sets, and panels C and D compare average and specific performance metrics (AUC, Sn, Sp, Acc, Mcc) for different feature combinations.

该图像是包含4个子图的图表,展示了Bitter-RF模型在苦味肽分类中的性能。A图为10折交叉验证和独立验证集的ROC曲线,B图比较了不同特征的ROC曲线,C、D图分别为不同特征组合在平均性能和具体指标(AUC、Sn、Sp、Acc、Mcc)上的表现对比。

Figure 2 Interpretation:

  • Panel A (ROC curves for Fusion feature): Shows high AUROC values for both 10-fold cross-validation (0.93) and independent validation (0.98), visually confirming the model's strong discriminatory power. The curve for the independent set is closer to the top-left corner, indicating better performance.
  • Panel B (Comparison of ROC curves for different features on independent set): Visually confirms that the Fusion feature set has the highest AUROC (0.98), surpassing APAAC (0.97), AAC (0.96), and QSOrder (0.95). This plot clearly supports the benefit of feature fusion.
  • Panel C (Performance comparison on 10-fold cross-validation): A bar chart comparing AUROC, Sn, Sp, Acc, and Mcc for Fusion, AAC, APAAC, and QSOrder. The Fusion bar is consistently higher or comparable, particularly for AUROC and Mcc.
  • Panel D (Performance comparison on independent set): Similar to Panel C, but for the independent set. Here, the Fusion feature shows a clear advantage across all metrics, especially AUROC and Mcc, further solidifying its superior performance.

6.4. Comparison with other machine learning methods on fusion features

To further validate the choice of Random Forest, the Fusion feature set was used to train and test other traditional machine learning algorithms: SVM, LightGBM, Decision Trees (DT), and Logistic Regression (LR).

The following are the results from Table 4 of the original paper:

Cross-validation Feature ML method AUROC Sn Sp Acc Mcc
10-fold cross-validation Fusion SVM 0.67 0.51 0.80 0.66 0.34
Fusion LightGBM 0.92 0.85 0.85 0.85 0.70
Fusion DT 0.80 0.83 0.77 0.80 0.60
Fusion LR 0.82 0.74 0.77 0.76 0.52
Fusion RF 0.93 0.86 0.84 0.85 0.70
Independent set validation Fusion SVM 0.74 0.61 0.78 0.70 0.40
Fusion LightGBM 0.97 0.92 0.91 0.91 0.83
Fusion DT 0.94 0.94 0.84 0.89 0.78
Fusion LR 0.89 0.80 0.84 0.82 0.64
Fusion RF 0.98 0.94 0.94 0.94 0.88

The following figure (Figure 3 from the original paper) shows the performance comparison of different machine learning models across various metrics (AUC, Sn, Sp, Acc, Mcc) in two parts, A and B. The Random Forest (RF) model stands out with superior performance on most metrics compared to other models.

FIGURE 3

Analysis of ML Method Comparison:

  • 10-fold Cross-Validation:
    • RF (AUROC 0.93, MCC 0.70) and LightGBM (AUROC 0.92, MCC 0.70) showed the best performance, with RF slightly edging out LightGBM in AUROC.
    • SVM performed significantly worse (AUROC 0.67, MCC 0.34), suggesting it might not be well-suited for this dataset or feature space, or its hyperparameters were not optimally tuned.
    • DT and LR showed intermediate performance.
  • Independent Set Validation:
    • RF demonstrated superior performance with an AUROC of 0.98 and MCC of 0.88.

    • LightGBM was a close second (AUROC 0.97, MCC 0.83).

    • DT also performed well (AUROC 0.94, MCC 0.78), indicating its ability to generalize, but its Sp (0.84) was lower than RF.

    • SVM again showed the lowest performance (AUROC 0.74, MCC 0.40).

      The results confirm that the RF method, when applied to the fused features, is either superior or equal to other tested machine learning methods across various indicators, especially on the critical independent validation set. This validates the choice of Random Forest as the classifier for Bitter-RF.

6.5. Comparison with existed models

To ultimately demonstrate the effectiveness of Bitter-RF, its performance was compared with four existing state-of-the-art bitter peptide prediction models. The performance indicators for these models were obtained from relevant literature.

The following are the results from Table 5 of the original paper:

Cross-validation Classifier AUROC Sn Sp Acc Mcc
10-fold cross-validation iBitter-SCM 0.90 0.91 0.83 0.87 0.75
BERT4Bitter 0.92 0.87 0.85 0.86 0.73
iBitter-Fuse 0.94 0.92 0.92 0.92 0.84
iBitter-DRLF 0.95 0.89 0.89 0.89 0.78
Bitter-RF 0.93 0.86 0.84 0.85 0.70
Independent set validation iBitter-SCM 0.90 0.84 0.84 0.84 0.69
BERT4Bitter 0.96 0.94 0.91 0.92 0.84
iBitter-Fuse 0.93 0.94 0.92 0.93 0.86
iBitter-DRLF 0.98 0.92 0.98 0.94 0.89
Bitter-RF 0.98 0.94 0.94 0.94 0.88

The following figure (Figure 4 from the original paper) shows radar charts of the overall and detailed performance of Bitter-RF and other models across various metrics (AUC, Sn, Sp, Acc, Mcc), clearly comparing the strengths and weaknesses of each model.

FIGURE 4

Analysis of Comparison with Existing Models:

  • 10-fold Cross-Validation:
    • Bitter-RF (AUROC 0.93, MCC 0.70) performed similarly to BERT4Bitter (AUROC 0.92, MCC 0.73) but slightly lower than iBitter-Fuse (AUROC 0.94, MCC 0.84) and iBitter-DRLF (AUROC 0.95, MCC 0.78). This indicates that on internal cross-validation, some other models, especially those using more complex feature engineering or deep learning, might have a slight edge.
  • Independent Set Validation: This is the most crucial comparison for generalization ability.
    • Bitter-RF achieved an outstanding AUROC of 0.98, matching the best performance of iBitter-DRLF.
    • Bitter-RF also achieved high Sn (0.94), Sp (0.94), Acc (0.94), and MCC (0.88).
    • Compared to iBitter-SCM (AUROC 0.90, MCC 0.69) and iBitter-Fuse (AUROC 0.93, MCC 0.86), Bitter-RF shows clear improvements in AUROC and competitive MCC.
    • Against BERT4Bitter (AUROC 0.96, MCC 0.84), Bitter-RF shows higher AUROC and comparable MCC.
    • While iBitter-DRLF has a slightly higher MCC (0.89) and Sp (0.98), Bitter-RF matches its AUROC and has a higher Sn (0.94 vs 0.92). This suggests Bitter-RF is very competitive.

Key Takeaways:

  • Bitter-RF demonstrates strong generalization ability on unseen data, often outperforming or matching the latest deep learning-based models in terms of AUROC.
  • The Sn, Sp, Acc, and MCC of Bitter-RF are generally better than the first three generations of models, especially on the independent set.
  • A significant advantage of Bitter-RF is that it achieves this high performance using a traditional machine learning method (Random Forest), which typically consumes fewer computational resources compared to deep learning approaches (like BERT4Bitter or deep learning pre-training in iBitter-DRLF). This makes Bitter-RF more accessible and practical.
  • The authors acknowledge that iBitter-DRLF has a slightly better MCC, but Bitter-RF is very competitive while using a less computationally intensive method.

6.6. Ablation Studies / Parameter Analysis

The paper does not explicitly present an ablation study in the traditional sense (removing components of the Bitter-RF model to see their individual impact). However, the comparison of Fusion features with single features (Table 3 and Figure 2) serves as an indirect form of ablation, demonstrating the synergistic effect of combining different feature types. It shows that the fused feature set (all 10 features combined) performs better than any single feature alone. The de-zero operation on features (Table 2) also highlights a form of feature selection or preprocessing that improved the efficiency and potentially the robustness of the model by removing non-informative dimensions. The paper mentions optimizing parameters for characteristics, but no detailed parameter tuning results or sensitivity analyses for the Random Forest model's hyperparameters (e.g., number of trees, max depth) are provided.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully developed Bitter-RF, a novel Random Forest-based machine learning model for accurately identifying bitter peptides from their sequence information. By integrating a comprehensive set of 10 diverse sequence-derived features, Bitter-RF achieved superior predictive performance, particularly on an independent validation set, where it recorded an impressive AUROC of 0.98. The research demonstrated that the fusion of multiple features significantly enhances prediction accuracy compared to individual features, and that the Random Forest algorithm is highly effective for this protein classification task, outperforming other traditional machine learning methods. Bitter-RF stands as a competitive model against existing state-of-the-art predictors, offering comparable or better accuracy with potentially lower computational demands. This work not only provides a valuable tool for bitter peptide research but also extends the application of Random Forest in bioinformatics.

7.2. Limitations & Future Work

The authors acknowledge one primary limitation and propose future work:

  • Limitation: The intrinsic robustness of the bitter/non-bitter classification model may be affected by the inherent bias of the training/test set data. This suggests that while the model performs well on the current dataset, its performance on other, potentially more diverse or differently distributed, bitter peptide datasets is yet to be fully explored. The paper explicitly states, "it cannot be excluded that the model may be affected by the inherent bias of training/test set data."
  • Future Work: The authors plan to utilize various feature selection techniques (e.g., methods referenced 83-86) to identify and select the most optimal features from the current comprehensive set. This optimization aims to further improve the model's performance by focusing on the most discriminative features and potentially reducing dimensionality.

7.3. Personal Insights & Critique

This paper presents a solid and practical approach to a significant problem in peptide research. The comprehensive feature engineering is a major strength, as is the systematic comparison of single vs. fused features and various ML algorithms. The choice of Random Forest is well-justified by its performance and computational efficiency, especially when compared to complex deep learning models for practical applications.

Strengths:

  • Thorough Feature Engineering: The use of 10 diverse features, covering various aspects of peptide sequence and physicochemical properties, is a key differentiator and likely contributes significantly to the model's high performance.
  • Rigorous Evaluation: The use of both 10-fold cross-validation and an independent validation set, along with a comprehensive set of metrics (AUROC, Sn, Sp, Acc, MCC), provides a robust assessment of the model's performance and generalization ability.
  • Comparative Analysis: The systematic comparison against multiple traditional ML methods and four generations of existing bitter peptide predictors firmly establishes Bitter-RF's competitive edge.
  • Practicality: The development of a Python package (GitHub link provided) makes the model directly usable by other researchers, fostering scientific progress. The emphasis on lower computational resources compared to deep learning models is also a practical advantage.

Potential Issues/Areas for Improvement:

  • Parameter Tuning Details: While Random Forest is used, the paper does not elaborate on the specific hyperparameters used (e.g., number of trees, maximum depth, max_features), or whether a hyperparameter optimization strategy was employed. Such details would enhance reproducibility and build confidence in the optimal configuration.
  • Feature Importance Analysis: Given the fusion of 10 different features, an analysis of feature importance (e.g., from the Random Forest model) could provide deeper biological insights into which sequence characteristics are most predictive of bitterness. This could also guide future biological experiments.
  • Dataset Bias: The authors acknowledge potential dataset bias. Expanding the dataset with more diverse bitter and non-bitter peptides from various sources and ensuring balanced representation of peptide lengths and compositions could further enhance the model's generalizability.
  • Interpretability of SOCNumber and QsOrder dimensions: There's a slight discrepancy between the theoretical dimension of SOCNumber (nlag, default 30) and QsOrder (20 + nlag, default 50) and the dimensions reported in Table 1 (2 and 42 respectively). Clarifying this or explaining the specific configuration used in iLearnPlus would be beneficial for beginners.
  • Lack of Ablation Study: While the fusion vs. single feature comparison is helpful, a more detailed ablation study (e.g., removing specific feature types from the fusion set) could quantitatively demonstrate the contribution of each feature category to the overall performance.

Transferability and Future Value: The methods, particularly the comprehensive feature engineering and the demonstration of Random Forest's effectiveness, could be transferred to other peptide classification tasks (e.g., predicting antimicrobial peptides, anticancer peptides, or allergenic peptides). The emphasis on capturing diverse sequence information and physicochemical properties is a generalizable principle in bioinformatics. Bitter-RF provides a solid foundation, and its open-source availability ensures its immediate utility and potential for further development by the research community. The identified avenues for future work, especially feature selection, promise further refinements to the model.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.