Bitter-RF: A random forest machine model for recognizing bitter peptides
TL;DR Summary
Bitter-RF, a random forest model integrating 10 peptide sequence features, achieves high accuracy (AUROC=0.98) in bitter peptide recognition, pioneering RF use in this domain and enhancing protein classification methods.
Abstract
TYPE Original Research PUBLISHED 26 January 2023 DOI 10.3389/fmed.2023.1052923 OPEN ACCESS EDITED BY C. George Priya Doss, VIT University, India REVIEWED BY HaiHui Huang, Shaoguan University, China Dragos Horvath, UMR 7140 Chimie de la Matière Complexe, France Zhibin Lv, Sichuan University, China *CORRESPONDENCE Hui Ding hding@uestc.edu.cn Yang Zhang yangzhang@cdutcm.edu.cn Ke-Jun Deng dengkj@uestc.edu.cn † These authors have contributed equally to this work SPECIALTY SECTION This article was submitted to Precision Medicine, a section of the journal Frontiers in Medicine RECEIVED 24 September 2022 ACCEPTED 05 January 2023 PUBLISHED 26 January 2023 CITATION Zhang Y-F, Wang Y-H, Gu Z-F, Pan X-R, Li J, Ding H, Zhang Y and Deng K-J (2023) Bitter-RF: A random forest machine model for recognizing bitter peptides. Front. Med. 10:1052923. doi: 10.3389/fmed.2023.1052923 COPYRIGHT © 2023 Zhang, Wang, Gu, Pan, Li, Ding, Zhang and Deng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copy
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Bitter-RF: A random forest machine model for recognizing bitter peptides
1.2. Authors
Yu-Fei Zhang, Yu-Hao Wang, Zhi-Feng Gu, Xian-Run Pan, Jian Li, Hui Ding, Yang Zhang, and Ke-Jun Deng. The authors are primarily affiliated with the School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China; Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China; and School of Basic Medical Sciences, Chengdu University, Chengdu, China. The corresponding authors are Hui Ding, Yang Zhang, and Ke-Jun Deng. Their research backgrounds appear to be in bioinformatics, computational biology, and possibly traditional Chinese medicine, focusing on machine learning applications in biological sequence analysis.
1.3. Journal/Conference
Published in Frontiers in Medicine, Volume 10, Article 1052923. Frontiers in Medicine is an open-access peer-reviewed journal publishing across various fields of medical research. It is generally considered a reputable journal within the "Frontiers" publishing family, contributing to the dissemination of medical and biomedical research.
1.4. Publication Year
2023
1.5. Abstract
This paper introduces Bitter-RF, a Random Forest (RF)-based machine learning model designed for recognizing bitter peptides using their sequence information. Bitter peptides are short peptides with significant potential medical applications that remain largely unexplored. To facilitate their practical utilization, an accurate classification method is crucial. Bitter-RF integrates 10 distinct features extracted from peptide sequences, aiming for a more comprehensive representation of peptide information. The model demonstrates superior performance compared to existing state-of-the-art models on an independent validation set, achieving an Area Under the Receiver Operating Characteristic curve (AUROC) of 0.98. This research not only enhances the accuracy of bitter peptide classification but also expands the application of the RF method in protein classification tasks, a domain where it had not been previously used for bitter peptide prediction. The authors hope Bitter-RF will serve as a valuable tool for researchers in bitter peptide studies.
1.6. Original Source Link
/files/papers/690dd8947a8fb0eb524e6853/paper.pdf (This link points to a local file path provided by the user, implying it is the PDF of the paper).
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the accurate and efficient identification of bitter peptides. Bitter peptides are short amino acid sequences that elicit a bitter taste. While often associated with spoilage or toxins, some bitter peptides possess significant potential medical applications, such as regulating blood glucose (e.g., peptides from Momordica charantia). However, their full therapeutic value remains largely untapped.
The problem is important because traditional experimental methods for identifying bitter peptides are complex, time-consuming, expensive, and often inaccurate. These biological methods typically involve laborious steps like gel separation, multiple rounds of liquid chromatography, purification, and identification using specialized instruments like Fourier transform infrared spectroscopy (FTIR), which are not universally accessible. Human sensory evaluations, while sometimes used, are subjective and can lead to inconsistent results. There is a clear need for a more efficient and accurate classification method to unlock the practical value of bitter peptides.
The paper's entry point or innovative idea is to develop a machine learning (ML) model, specifically using the Random Forest (RF) algorithm, that leverages comprehensive sequence information (features) to predict bitter peptides. Previous computational methods existed, including Quantitative Structure-Activity Relationship (QSBR) models and earlier generations of sequence-based models, but they either focused on structural properties rather than sequence directly, or suffered from issues like information redundancy, overfitting, or suboptimal feature representation. This study seeks to improve upon these by integrating a wider array of sequence-derived features and applying an RF model, which is noted for its robustness and adaptability to high-dimensional data, to this specific classification task for the first time.
2.2. Main Contributions / Findings
The primary contributions and key findings of the paper are:
- Development of Bitter-RF Model: The authors developed a novel Random Forest (RF)-based machine learning model named
Bitter-RFfor the accurate recognition of bitter peptides. This is highlighted as the first application of the RF method to build a predictive model specifically for bitter peptides. - Comprehensive Feature Integration:
Bitter-RFintegrates a more comprehensive and extensive set of 10 different sequence-derived features, covering various aspects of peptide composition, physicochemical properties, and sequence order. This multi-perspective feature set (initially 1,337 dimensions, reduced to 1,206 after removing zero columns) provides richer information for classification. - Superior Performance: The model demonstrates significantly improved prediction accuracy, especially on an independent validation set. It achieved an AUROC (Area Under the Receiver Operating Characteristic curve) of 0.98 on the independent test set, which is comparable to or better than the latest generation of existing models. Key metrics like Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), and Matthew's Correlation Coefficient (MCC) also showed strong performance (Sn=0.94, Sp=0.94, Acc=0.94, MCC=0.88 on independent set).
- Validation of Feature Fusion and RF Method: The study systematically showed that fusing multiple features leads to better predictive performance compared to using single features. Furthermore, the Random Forest algorithm outperformed other traditional machine learning methods (SVM, LightGBM, Decision Trees, Logistic Regression) on the fused feature set for this specific task.
- Enrichment of Protein Classification Applications: The research enriches the practical application of the RF method in protein classification, providing a robust model for bitter peptide identification that can guide further research and potential medical applications.
- Open-Source Tool: A free and easy-to-use Python package for
Bitter-RFhas been made available on GitHub, providing a practical tool for scholars in bitter peptide research.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with basic concepts in molecular biology, machine learning, and statistical evaluation.
-
Peptides and Amino Acids:
- Amino Acids: The basic building blocks of proteins and peptides. There are 20 common types, each with unique side chains that confer different physicochemical properties (e.g., hydrophobicity, charge, size).
- Peptides: Short chains of amino acids linked by peptide bonds.
Bitter peptidesare a specific class of peptides that elicit a bitter taste perception, often due to their hydrophobic amino acid content or sequence arrangement. - Peptide Sequence Information: The linear order of amino acids in a peptide chain. This sequence dictates the peptide's properties and potential function.
-
Machine Learning (ML): A field of artificial intelligence that enables systems to learn from data without being explicitly programmed.
- Classification Task: A type of supervised learning where an algorithm learns to assign input data into predefined categories (e.g., "bitter peptide" or "non-bitter peptide").
- Features: Measurable properties or attributes of the data that the ML model uses to learn and make predictions. In this paper, features are derived from peptide sequences.
- Model Training: The process where an ML algorithm learns patterns from a
training datasetto build a predictive model. - Model Validation/Testing: Evaluating the performance of a trained model on unseen data (
validation setorindependent set) to assess its generalization ability. - Supervised Learning: A type of machine learning where the algorithm learns from labeled data (i.e., data where the correct output/category is already known).
-
Specific Machine Learning Algorithms:
- Random Forest (RF): An
ensemble learningmethod that constructs a multitude ofdecision treesat training time and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It reduces overfitting and improves accuracy. Each tree in the forest is built using a random subset of the training data and a random subset of features. - Support Vector Machine (SVM): A supervised learning model used for classification and regression tasks. It works by finding an optimal hyperplane that best separates data points of different classes in a high-dimensional space.
- Light Gradient Boosting Machine (LightGBM): A
gradient boosting frameworkthat uses tree-based learning algorithms. It is designed to be highly efficient and scalable, particularly for large datasets, by using techniques likeGradient-based One-Side Sampling (GOSS)andExclusive Feature Bundling (EFB). - Decision Tree (DT): A non-parametric supervised learning method used for classification and regression. It partitions the data into subsets based on feature values, creating a tree-like model of decisions and their possible consequences.
- Logistic Regression (LR): A statistical model used for binary classification. It models the probability of a binary outcome (e.g., bitter or non-bitter) using a
logistic functionto estimate probabilities, which are then mapped to two discrete classes.
- Random Forest (RF): An
-
Cross-validation (e.g., 10-fold cross-validation): A technique to assess how the results of a statistical analysis will generalize to an independent dataset. In
k-fold cross-validation, the dataset is divided into equally sized folds. The model is trained onk-1folds and tested on the remaining fold. This process is repeated times, with each fold used exactly once as the test set.10-fold cross-validationmeans .
3.2. Previous Works
The paper frames its work in the context of previous efforts to predict bitter peptides, broadly categorizing them into experimental and computational methods, and further detailing four generations of sequence-based computational models.
3.2.1. Experimental Methods
- Process: Involve extracting bitter peptides from raw materials, gel separation, multiple rounds of liquid chromatography, purification, and then identification using techniques like Fourier Transform Infrared Spectroscopy (FTIR).
- Limitations: Complex, time-consuming, instrument-dependent (not universal), and potentially inaccurate due to human sensory evaluation involvement.
3.2.2. Computational Methods (QSBR Models)
- Quantitative Structure-Activity Relationship (QSAR/QSBR): Models that attempt to find a correlation between the structural properties of molecules and their biological activity (e.g., bitterness).
- Techniques used: Multiple linear regression, Support Vector Machine (SVM), Artificial Neural Network (ANN).
- Example cited: A model based on 229 experimental bitterness values, extracting 1292 descriptors using Dragon 5.4 software, reducing them to 244, and then selecting six best-scoring descriptors (SPAN, Mean Square Distance (MSD), E3s, G3p, Hats8U, and 3D-MoRSE) using
GAPLS(Genetic Algorithm Partial Least Squares) for QSAR model construction. These descriptors represent molecular dimensions, atom counts, electrical topological states, WHIM indices, spatial autocorrelation, and molecular size/mass/volume.
3.2.3. Sequence-based Computational Models (Four Generations)
The paper highlights an evolution of sequence-based models, providing context for its own Bitter-RF model.
-
First-generation model (iBitter-SCM):
- Method: Used
dipeptide propensity scoresto predict bitter peptides. Dipeptide propensity refers to the likelihood of specific pairs of amino acids appearing together in bitter peptides versus non-bitter peptides. - Limitation: Extracted only a few characteristics, potentially limiting information capture.
- Reference: Charoenkwan et al., 2020 (22).
- Method: Used
-
Second-generation model (BERT4Bitter):
- Method: Utilized
deep learningresearch methods, specificallyBidirectional Encoder Representations from Transformers (BERT). BERT is a powerful neural network model pre-trained on large text corpora, adapted here to learn contextual representations of peptide sequences. - Potential Problems: The authors suggest potential issues with information redundancy and overfitting, common challenges in deep learning models, especially with limited data.
- Reference: Charoenkwan et al., 2021 (23).
- Method: Utilized
-
Third-generation model (iBitter-Fuse):
- Method: Integrated
five peptide featuresto characterize bitter peptides and built a prediction model, likely using SVM as mentioned in its reference. The five features are not explicitly listed in the current paper's "Previous Works" section, but the reference (Charoenkwan et al., 2021 (24)) indicates it combines "multi-view features." - Limitation: The representativeness of the features might need further optimization.
- Reference: Charoenkwan et al., 2021 (24).
- Method: Integrated
-
Fourth-generation model (iBitter-DRLF):
- Method: Extracted features through
deep learning pre-trainingand then built a prediction model based onLight Gradient Boosting Machine (LGBM). This combines the representation learning power of deep learning with the efficiency and performance of LGBM for classification. - Reference: Zhang et al., 2022 (26).
- Method: Extracted features through
3.3. Technological Evolution
The evolution in bitter peptide identification has moved from:
-
Laborious, expensive, and subjective experimental methods (e.g., FTIR, human sensory evaluation).
-
To computational methods based on quantitative structure-activity relationships (QSBR), which correlate molecular structure with bitterness using various ML algorithms (SVM, ANN).
-
Then, to sequence-based computational models, which are more practical as they only require the amino acid sequence. This started with simpler models using
dipeptide propensity scores, progressed to sophisticateddeep learningapproaches (BERT4Bitter), then tofeature fusionwith traditional ML (iBitter-Fuse), and finally to hybrid approaches combiningdeep learning for feature extractionwithgradient boosting machinesfor prediction (iBitter-DRLF).This paper's
Bitter-RFfits into this timeline by building upon the concept offeature fusion(likeiBitter-Fuse) but expanding the number and types of features significantly (10 features) and employing aRandom Forestalgorithm, which is shown to be highly effective for this problem, contrasting with previous models that might have used SVM or LightGBM for the final classification step.
3.4. Differentiation Analysis
Compared to the main methods in related work, Bitter-RF differentiates itself through several core innovations:
- Expanded Feature Set: While previous
feature fusionmodels (e.g.,iBitter-Fuse) used a limited number of features (e.g., five),Bitter-RFintegrates a more comprehensive set of10 different sequence-derived features. This broader scope aims to capture more diverse and extensive information about bitter peptides, including amino acid composition, pseudo-amino acid composition, dipeptide composition, and sequence-order-coupling numbers, which are crucial for physicochemical properties and sequential relationships. - Novel Application of Random Forest (RF): The paper explicitly states that the Random Forest method "has not been used to build a prediction model for bitter peptides."
Bitter-RFintroduces RF as a robust and effective classifier for this specific task, demonstrating its superior performance compared to other traditional ML methods (SVM, LightGBM, DT, LR) and competitive performance against state-of-the-art deep learning-based models. - Improved Accuracy on Independent Set:
Bitter-RFachieves a high AUROC of 0.98 on the independent validation set, outperforming several previous generations of models and matching the performance ofiBitter-DRLF, which often relies on computationally intensive deep learning pre-training. - Computational Efficiency: By utilizing a traditional machine learning method like Random Forest,
Bitter-RFoffers a strong prediction performance while consuming fewer computing resources compared to complex deep learning models, making it more accessible and practical for general research use.
4. Methodology
4.1. Principles
The core idea behind Bitter-RF is to accurately classify bitter peptides by leveraging a rich set of information derived from their amino acid sequences using a robust machine learning algorithm. The theoretical basis is that the bitter taste property of peptides is encoded within their sequence composition and arrangement, reflecting underlying physicochemical properties (e.g., hydrophobicity, hydrophilicity) and sequential patterns. By extracting diverse features that capture these characteristics and combining them, a machine learning model can learn to distinguish bitter peptides from non-bitter ones. The Random Forest algorithm is chosen for its ability to handle high-dimensional data, reduce overfitting, and maintain strong predictive power.
4.2. Core Methodology In-depth (Layer by Layer)
The construction of the Bitter-RF model involves several key steps: dataset preparation, comprehensive feature extraction, feature fusion, model training using Random Forest, and performance evaluation.
4.2.1. Dataset Source
The foundation of Bitter-RF is a high-quality benchmark dataset. The study utilizes the same dataset as previous generations of bitter peptide prediction models (references 22-24) to ensure a fair comparison. This dataset, accessible from http://pmlab.pythonanywhere.com/BERT4Bitter, was originally compiled by manually collecting experimentally validated bitter peptides from various scientific literature.
The dataset characteristics are as follows:
-
Total Records: 640
-
Bitter Peptides: 320 (experimentally validated)
-
Non-Bitter Peptides: 320 (randomly generated from
BIOPEPdatabase)To objectively evaluate the model's performance, the dataset was rigorously split into:
-
Training Set: Used to train the machine learning model. It contains 512 records (80% of total), specifically 256 bitter peptides and 256 non-bitter peptides.
-
Independent Set: Used to validate the model's generalization ability on unseen data. It contains 128 records (20% of total), specifically 64 bitter peptides and 64 non-bitter peptides.
4.2.2. Feature Extraction
Feature extraction is a critical step in machine learning models based on biological sequence data, as it aims to encode sequences in a way that reveals as much relevant information as possible. The authors used iLearnPlus (reference 37), a platform for sequence analysis, to extract 10 types of features from the bitter peptide sequences.
4.2.2.1. Amino Acid Composition (AAC)
The AAC encoding calculates the fractional frequencies of each of the 20 standard amino acids within a peptide sequence. This feature provides a basic compositional overview of the peptide.
The equation for AAC is:
$
f \left( t \right) = \frac { N \left( t \right) } { N } , t \in \left{ A , C , . . . , Y \right}
$
Where:
f(t)represents the frequency of amino acid type .N(t)denotes the number of occurrences of amino acid type in the peptide sequence.- is the total length (number of amino acids) of the peptide sequence.
- indicates that can be any of the 20 standard amino acids.
- Dimension: 20 (one for each amino acid type).
4.2.2.2. Traditional Pseudo-Amino Acid Composition (TPAAC)
TPAAC, also known as type 1 pseudo-amino acid composition, extends AAC by incorporating sequence-order information and physicochemical properties. It considers three specific amino acid properties: hydrophobicity, hydrophilicity, and side-chain mass.
First, the original values for hydrophobicity (), hydrophilicity (), and side chain mass () for each of the 20 amino acids are normalized using a standard normal distribution transformation: $ H _ { 1 } \left( i \right) \ = \frac { H _ { 1 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 1 } ^ { o } \left( i \right) } { \sqrt { \frac { \sum _ { i = 1 } ^ { 2 0 } \left[ H _ { 1 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 1 } ^ { o } \left( i \right) \right] ^ { 2 } } { 2 0 } } } $ $ H _ { 2 } \left( i \right) \ = \frac { H _ { 2 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 2 } ^ { o } \left( i \right) } { \sqrt { \frac { \sum _ { i = 1 } ^ { 2 0 } \left[ H _ { 2 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 2 } ^ { o } \left( i \right) \right] ^ { 2 } } { 2 0 } } } $ $ M \left( i \right) \ = \frac { M ^ { 0 } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } M ^ { 0 } \left( i \right) } { \sqrt { \frac { \sum _ { i = 1 } ^ { 2 0 } \left[ M ^ { 0 } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } M ^ { 0 } \left( i \right) \right] ^ { 2 } } { 2 0 } } } $ Where:
-
, , and
M(i)are the normalized hydrophobicity, hydrophilicity, and side chain mass values for amino acid . -
, , and are the original values for amino acid .
-
The terms (and similar for and ) represent the mean of the respective property over all 20 amino acids.
-
The denominator represents the standard deviation of the respective property over all 20 amino acids.
Next, a correlation function between two amino acids and (at positions and in the sequence) is defined: $ \begin{array} { l } { { \Theta \left( R _ { i } , R _ { j } \right) \ = \ \displaystyle \frac { 1 } { 3 } \big { \big [ H _ { 1 } \left( R _ { i } \right) - H _ { 1 } \left( R _ { j } \right) \big ] ^ { 2 } + \big [ H _ { 2 } \left( R _ { i } \right) - H _ { 2 } \left( R _ { j } \right) \big ] ^ { 2 } } } \ { { \ } } \ { { \qquad + \big [ M \left( R _ { i } \right) - M \left( R _ { j } \right) \big ] ^ { 2 } \big } } } \end{array} $ This function measures the squared difference in hydrophobicity, hydrophilicity, and mass between the two amino acids, averaged over the three properties. The correlation function can also be defined for a single amino acid property or a set of properties: $ \Theta \left( R _ { i } , R _ { j } \right) \ = \left[ H _ { 1 } \left( R _ { i } \right) - H _ { 1 } \left( R _ { j } \right) \right] ^ { 2 } $ $ \Theta \left( R _ { i } , R _ { j } \right) ~ = \frac { 1 } { n } \sum _ { n = 1 } ^ { n } \left[ H _ { k } \left( R _ { i } \right) - H _ { k } \left( R _ { j } \right) \right] ^ { 2 } $ Where:
-
is the standardized amino acid property of amino acid .
-
is the -th attribute in the amino acid attribute set for amino acid .
Sequence order-correlated factors () are then computed: $ \Theta _ { 1 } \ = \frac { 1 } { N - 1 } \sum _ { i \ = 1 } ^ { N - 1 } \Theta \left( R _ { i } , R _ { i + 1 } \right) $ $ \begin{array} { l } { { \Theta _ { 2 } = \displaystyle \frac { 1 } { N - 2 } \sum _ { i = 1 } ^ { N - 2 } \Theta \left( R _ { i } , R _ { i + 2 } \right) } } \ { { . . . } } \ { { \Theta _ { \mathsf { \tiny { A } } } = \displaystyle \frac { 1 } { N - \mathsf { \tiny { A } } } \sum _ { i = 1 } ^ { N - \mathsf { \tiny { A } } } \Theta \left( R _ { i } , R _ { i + \mathsf { \tiny { A } } } \right) } } \end{array} $ Where:
-
(represented as in the last equation) is a correlation parameter, indicating the maximum sequence separation (lag) considered. It must be less than (peptide length). The paper states for this study.
-
represents the -th sequence order correlation factor, calculated by averaging the correlation function over all adjacent amino acid pairs separated by
j-1intervening residues.Finally, the
TPAACdescriptor for a protein sequence is defined by combining the amino acid frequencies with these sequence order-correlated factors: $ X _ { c } = \frac { f _ { c } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + \omega \sum _ { j = 1 } ^ { \lambda } \theta _ { j } } , ( 1 < c < 2 0 ) $ $ X _ { c } = \frac { \omega \theta _ { c - 2 0 } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + \omega \sum _ { j = 1 } ^ { \lambda } \theta _ { j } } , ( 2 1 < c < 2 0 + \lambda ) $ Where: -
is the frequency of the -th amino acid.
-
is a weighting factor, set to 0.05 in this study.
-
is the sum of frequencies of all 20 amino acids (typically 1).
-
is the sum of the sequence-order correlation factors.
-
For , represents the modified frequency of the -th amino acid.
-
For , represents the sequence-order correlation factors themselves, scaled by .
-
Dimension: . With , the dimension is 21.
4.2.2.3. Amphiphilic Pseudo-Amino Acid Composition (APAAC)
APAAC is another type of PseAAC that focuses on the distribution patterns of hydrophobicity and hydrophilicity along the peptide chain. It comprises discrete numbers.
It starts by using the normalized hydrophobicity and hydrophilicity values (from Equations 2 and 3 in TPAAC) to define hydrophobicity and hydrophilicity correlation functions: $ H _ { i , j } ^ { 1 } \mathrm { ~ = ~ } H _ { 1 } \left( i \right) H _ { 1 } \left( j \right) $ $ H _ { i , j } ^ { 2 } \mathrm { ~ = ~ } H _ { 2 } \left( i \right) H _ { 2 } \left( j \right) $ Where:
-
and are the correlation values for hydrophobicity and hydrophilicity, respectively, between amino acids at positions and .
Next, sequence order factors are formulated as: $ \tau _ { 1 } = \frac { 1 } { N - 1 } \sum _ { i = 1 } ^ { N - 1 } H _ { i , i + 1 } ^ { 1 } $ $ \tau _ { 2 } = \frac { 1 } { N - 1 } \sum _ { i = 1 } ^ { N - 1 } H _ { i , i + 1 } ^ { 2 } $ $ \tau _ { 3 } = \frac { 1 } { N - 2 } \sum _ { i = 1 } ^ { N - 2 } H _ { i , i + 2 } ^ { 1 } $ $ \tau _ { 4 } = \frac { 1 } { N - 2 } \sum _ { i = 1 } ^ { N - 2 } H _ { i , i + 2 } ^ { 2 } $ These continue up to factors: $ \tau _ { 2 \alpha - 1 } = \frac { 1 } { N - \alpha } \sum _ { i = 1 } ^ { N - \alpha } H _ { i , i + \alpha } ^ { 1 } $ $ \tau _ { 2 \alpha } = \frac { 1 } { N - \alpha } \sum _ { i = 1 } ^ { N - \alpha } H _ { i , i + \alpha } ^ { 2 } $ Where:
-
are the sequence order factors.
-
ranges from 1 to .
-
is the peptide length.
-
The paper states for this study, meaning only takes the value 1, resulting in and .
Finally, the
APAACdescriptor is defined as: $ P _ { C } = \frac { f _ { c } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { j = 1 } ^ { 2 \lambda } \tau _ { j } } , ( 1 ~ < ~ c ~ < ~ 2 0 ) $ $ P _ { C } = \frac { \omega \tau _ { u } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { j = 1 } ^ { 2 \lambda } \tau _ { j } } , \ ( 2 1 \ < \ u \ < \ 2 0 + 2 \lambda ) $ Where: -
is the frequency of the -th amino acid.
-
is the weighting factor, set to 0.5 in this study.
-
is the correlation parameter, set to 1 in this study.
-
For , represents the modified frequency of the -th amino acid.
-
For , represents the sequence-order correlation factors themselves, scaled by .
-
Dimension: . With , the dimension is 22.
4.2.2.4. Adaptive Skip Dinucleotide Composition (ASDC)
ASDC is a modified dipeptide composition that considers the relationships between non-adjacent residues, accounting for intervening peptides.
The feature vector for ASDC is defined as:
$
\mathrm { A S D C } = ( f _ { \nu 1 } , f _ { \nu 2 } , . . . , f _ { \nu 4 0 0 } ) ,
$
$
f _ { \nu i } = \frac { \sum _ { g = 1 } ^ { L - 1 } O _ { i } ^ { g } } { \sum _ { i = 1 } ^ { 4 0 0 } \sum _ { g = 1 } ^ { L - 1 } O _ { i } ^ { g } }
$
Where:
- represents the occurrence frequency of the -th possible dipeptide, considering all possible skips.
- is the number of occurrences of the -th dipeptide with
g-1intervening amino acids (i.e., at a distance of ). - is the length of the peptide sequence.
- The sum counts all occurrences of the -th dipeptide type, irrespective of the skip distance.
- The denominator normalizes these counts by the total number of all possible dipeptides at all possible skip distances.
- Dimension: 400 (since there are possible dipeptides).
4.2.2.5. Di-peptide Composition (DPC)
DPC describes the frequencies of all 400 possible dipeptide combinations (e.g., AA, AC, ..., YY) in a peptide sequence. It captures information about adjacent amino acid pairs.
The calculation method for DPC is:
$
D \left( r , s \right) \ = \frac { N _ { r s } } { N - 1 } , r , s \in { A , C , D , . . . , Y }
$
Where:
D(r,s)is the frequency of the dipeptide formed by amino acid type followed by amino acid type .- is the number of times the dipeptide combination
r-sappears in the peptide sequence. - is the total length of the peptide sequence.
N-1is the total number of adjacent dipeptides in a sequence of length .- Dimension: 400.
4.2.2.6. Dipeptide Deviation from Expected Mean (DDE)
DDE is a feature that quantifies how much the observed frequency of a dipeptide deviates from its theoretically expected frequency. It uses three parameters: observed dipeptide composition (), theoretical mean (), and theoretical variance ().
-
is the same as the
DPCcalculation method.The theoretical mean and variance are calculated based on codon usage frequencies: $ T _ { m } \left( r , s \right) = \frac { C _ { r } } { C _ { N } } \times \frac { C _ { s } } { C _ { N } } $ $ T _ { \nu } \left( r , s \right) = \frac { T _ { m } \left( r , s \right) \left( 1 - T _ { m } \left( r , s \right) \right) } { N - 1 } $ Where:
-
is the theoretical mean frequency for the dipeptide
r-s. -
is the theoretical variance for the dipeptide
r-s. -
is the number of codons that encode amino acid type .
-
is the number of codons that encode amino acid type .
-
is the total number of possible codons (excluding stop codons).
-
N-1is the total number of possible dipeptide positions.Finally,
DDEfor a dipeptider-sis calculated as: $ D D E \left( r , s \right) = \frac { D _ { c } \left( r , s \right) - T _ { m } \left( r , s \right) } { T _ { \nu } \left( r , s \right) } $ Where: -
is the observed frequency of the dipeptide
r-s. -
Dimension: 400.
4.2.2.7. Grouped Amino Acid Composition (GAAC)
GAAC reduces the dimensionality of AAC by grouping the 20 amino acids into 5 categories based on their shared physicochemical properties. It then calculates the frequencies of these groups.
The five groups are:
-
Aliphatic group (): G, A, V, L, M, I
-
Aromatic group (): F, Y, W
-
Positive charge group (): K, R, H
-
Negative charged group (): D, E
-
Uncharged group (): S, T, C, P, N, Q
The frequency calculation is: $ f \left( g \right) \ = \frac { N \left( g \right) } { N } , G \in \left{ g 1 , g 2 , g 3 , g 4 , g 5 \right} $ Where:
-
f(g)is the frequency of amino acids belonging to group . -
N(g)is the sum of the number of amino acids in group . -
is the total length of the peptide sequence.
-
Dimension: 5.
4.2.2.8. Grouped Dipeptide Composition (GDPC)
GDPC is a variant of DPC that uses the same 5 amino acid groups defined in GAAC. It calculates the frequencies of dipeptides formed by these groups.
The feature consists of 25 descriptors (5 groups 5 groups), calculated as: $ f \left( r , s \right) \ = \frac { N _ { r s } } { N - 1 } , r , s \in \left{ g 1 , g 2 , g 3 , g 4 , g 5 \right} $ Where:
f(r,s)is the frequency of a dipeptide where the first amino acid belongs to group and the second to group .- is the number of occurrences of dipeptides represented by amino acid type groups and .
- is the total length of the peptide sequence.
- Dimension: 25.
4.2.2.9. Sequence-order-coupling number (SOCNumber)
SOCNumber captures sequence-order information by calculating the sum of squared distances between amino acids at specific separations (lags).
The -th rank sequence-order-coupling number () is calculated as: $ \tau _ { d } = \sum _ { i = 1 } ^ { N - d } \left( d _ { i , i + d } \right) ^ { 2 } , \ d = 1 , 2 , . . . , n l a g $ Where:
- is the -th rank sequence-order-coupling number.
- is the lag, representing the distance between two amino acids.
- is the length of the peptide sequence.
- describes the "distance" between the amino acid at position and the amino acid at position . This distance is derived from a pre-defined amino acid distance matrix. The paper mentions using distance matrices from Schneider-Wrede (physicochemical) and Grantham (chemical) for this purpose.
nlagdenotes the maximum value of the lag, with a default value of 30.- Dimension:
nlag. With , the dimension would be 30. However, Table 1 states a dimension of 2, which might imply a specific configuration in iLearnPlus or that only two specific lags (e.g., ) were used after some internal processing for this feature. The discrepancy between the formula implyingnlagdimensions and the table showing 2 dimensions is noted. Following the table, it is 2.
4.2.2.10. Quasi-sequence-order (QsOrder)
QsOrder is another feature that combines amino acid composition with sequence-order information, utilizing the SOCNumber concept. It produces a set of descriptors reflecting both composition and sequence patterns.
For each amino acid (20 types), the QsOrder is defined as:
$
X _ { r } = \frac { f _ { r } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { d = 1 } ^ { n l a g } \tau _ { d } } , r = 1 , 2 , 3 , . . . , 2 0
$
Where:
-
is the
QsOrderdescriptor for the -th amino acid type. -
represents the normalized occurrence frequency of the -th amino acid type.
-
is the weighting factor, defined as 0.1.
-
nlagdenotes the maximum value of the lag (default: 30). -
is the -th rank sequence-order-coupling number, as defined in
SOCNumber.For other
nlagquasi-sequence-order descriptors (30 in total, implying ),QsOrderis defined as: $ X _ { d } = \frac { w \tau _ { d } - 2 0 } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { d = 1 } ^ { n l a g } \tau _ { d } } , d = 2 1 , 2 2 , . . . , 2 0 + n l a g $ Where: -
represents the additional
QsOrderdescriptors that are primarily based on the sequence-order-coupling numbers. -
Dimension:
20 + nlag. With , the dimension is 50. Table 1 states a dimension of 42. This discrepancy is noted, but the analysis will follow the stated dimension in the table for consistency with the results section.
4.2.3. Feature Fusion Processing
After extracting the 10 types of features, they are concatenated to form a single, high-dimensional feature vector for each peptide.
- Initial Fusion: The concatenation of all 10 features results in a feature vector with 1,337 dimensions.
- De-zeroing (Feature Reduction): A practical step where any feature column (dimension) that contains only zero values across all samples is removed. Such columns provide no discriminative information and can be safely eliminated. After this
de-zerooperation, the total number of features used for model learning is reduced to 1,206.
4.2.4. Random Forest (RF)
The Random Forest algorithm is employed as the primary machine learning classifier for Bitter-RF.
-
Ensemble Method: RF is an ensemble learning method that builds multiple
decision treesduring training. -
Randomness: Each tree is constructed using a random subset of the training data (bootstrap aggregating or bagging) and, at each split in a tree, a random subset of features is considered. This inherent randomness helps to reduce correlation among trees and prevents overfitting.
-
Prediction: For classification, the final prediction is made by taking a majority vote from the predictions of all individual trees in the forest.
-
Advantages:
RFis known for its high accuracy, robustness to noise, and strong adaptability to high-dimensional data, which makes it suitable for the comprehensive feature set developed in this study. The paper explicitly mentions that RF can "reduce the possibility of overfitting, improve the ability to resist noise, and has strong adaptability to high-dimensional data."The schematic framework of
Bitter-RFfor bitter peptide prediction is visually represented in Figure 1. It outlines the process from data collection to feature extraction, feature fusion, model training with RF, and ultimately prediction.
The following figure (Figure 1 from the original paper) illustrates the construction workflow of the Bitter-RF model:
该图像是论文中图1,展示了Bitter-RF模型构建流程图,包括数据收集、特征融合、机器学习方法选择及模型评估四个步骤,清晰展示了研究设计思路。
5. Experimental Setup
5.1. Datasets
The study utilized a publicly available benchmark dataset, the same one used by previous bitter peptide prediction models (iBitter-SCM, BERT4Bitter, iBitter-Fuse, iBitter-DRLF), to ensure comparability and reliability of results.
- Source: The dataset was originally collected by manually curating experimentally validated bitter peptides from various scientific literature and can be accessed from
http://pmlab.pythonanywhere.com/BERT4Bitter. Non-bitter peptides were randomly generated from theBIOPEPdatabase. - Characteristics:
- Total Samples: 640 records.
- Classes: 320 experimentally validated bitter peptides (positive samples) and 320 non-bitter peptides (negative samples).
- Data Split: The dataset was divided using an 8:2 ratio to create training and independent validation sets:
- Training Set: 512 records (256 bitter peptides, 256 non-bitter peptides). This set is used for model learning.
- Independent Set: 128 records (64 bitter peptides, 64 non-bitter peptides). This set is crucial for evaluating the model's generalization performance on entirely unseen data.
- Rationale for Choice: Using the same dataset as previous models allows for a direct and fair comparison of the
Bitter-RFmodel's performance against existing state-of-the-art methods.
5.2. Evaluation Metrics
To assess the training effect and predictive ability of the model, the authors used several standard classification metrics. Bitter peptides were defined as positive samples, and non-bitter peptides as negative samples.
For context, the fundamental counts for these metrics are:
-
TP(True Positives): Number of bitter peptides correctly predicted as bitter. -
FN(False Negatives): Number of bitter peptides incorrectly predicted as non-bitter. -
TN(True Negatives): Number of non-bitter peptides correctly predicted as non-bitter. -
FP(False Positives): Number of non-bitter peptides incorrectly predicted as bitter.Here are the evaluation metrics used, with their conceptual definitions, mathematical formulas, and symbol explanations:
5.2.1. Sensitivity (Sn)
- Conceptual Definition: Sensitivity, also known as Recall or True Positive Rate, measures the proportion of actual positive cases (bitter peptides) that are correctly identified by the model. A high sensitivity indicates that the model is good at catching positive instances.
- Mathematical Formula: $ S n = \frac { T P } { ( T P + F N ) } $
- Symbol Explanation:
Sn: SensitivityTP: True PositivesFN: False Negatives
5.2.2. Specificity (Sp)
- Conceptual Definition: Specificity, also known as True Negative Rate, measures the proportion of actual negative cases (non-bitter peptides) that are correctly identified by the model. A high specificity indicates that the model is good at correctly identifying negative instances and avoiding false alarms.
- Mathematical Formula: $ S p = { \frac { T N } { ( T N + F P ) } } $
- Symbol Explanation:
Sp: SpecificityTN: True NegativesFP: False Positives
5.2.3. Accuracy (ACC)
- Conceptual Definition: Accuracy measures the proportion of total predictions that were correct. It is a straightforward metric but can be misleading in imbalanced datasets.
- Mathematical Formula: $ A C C = { \frac { ( T P + T N ) } { ( T P + T N + F P + F N ) } } $
- Symbol Explanation:
ACC: AccuracyTP: True PositivesTN: True NegativesFP: False PositivesFN: False Negatives
5.2.4. Matthew's Correlation Coefficient (MCC)
- Conceptual Definition: MCC is a comprehensive and robust metric for binary classification that is considered a balanced measure even if the classes are of very different sizes. It takes into account true and false positives and negatives and returns a value between -1 and +1. A value of +1 indicates a perfect prediction, 0 indicates a random prediction, and -1 indicates a completely opposite prediction.
- Mathematical Formula: $ M C C = { \frac { \left( T N \times \ T P - F N \times \ F P \right) } { \sqrt { \left( T P + F P \right) \left( T P + F N \right) \left( T N + F P \right) \left( T N + F N \right) } } } $
- Symbol Explanation:
MCC: Matthew's Correlation CoefficientTP: True PositivesTN: True NegativesFP: False PositivesFN: False Negatives
5.2.5. Area Under the Receiver Operating Characteristic curve (AUROC)
- Conceptual Definition: AUROC is a performance metric for binary classifiers. The
Receiver Operating Characteristic (ROC)curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold settings. TheAUROCvalue represents the area under this curve. A higherAUROC(closer to 1) indicates a better ability of the model to distinguish between classes. AnAUROCof 0.5 suggests no discrimination (like random guessing), while anAUROCof 1.0 indicates perfect discrimination. The paper states thatAUROCcan be used as a standard for evaluating the quality of the binary classification model. - Mathematical Formula: The
AUROCitself does not have a single simple formula based onTP, TN, FP, FNdirectly, as it is derived from integrating the ROC curve. However, the ROC curve is generated by varying the classification threshold and calculating theTPRandFPRat each threshold.- The
AUROCis then the integral of theTPRwith respect to theFPRover the interval [0, 1].
- Symbol Explanation:
AUROC: Area Under the Receiver Operating Characteristic curve.TPR: True Positive Rate (Sensitivity).FPR: False Positive Rate (1 - Specificity).
5.3. Baselines
The paper compared Bitter-RF against two categories of baseline models:
-
Other Traditional Machine Learning Methods: To validate the choice of Random Forest, the fused features were also tested with:
- Support Vector Machine (SVM)
- Light Gradient Boosting Machine (LightGBM)
- Decision Trees (DT)
- Logistic Regression (LR) These methods are representative of commonly used and powerful classification algorithms in bioinformatics.
-
Existing State-of-the-Art Bitter Peptide Prediction Models: To demonstrate the superior performance of
Bitter-RFin the specific domain of bitter peptide prediction, it was compared with four previously published sequence-based models, which the authors refer to as "generations":- iBitter-SCM: The first-generation model, based on dipeptide propensity scores (Charoenkwan et al., 2020).
- BERT4Bitter: The second-generation model, utilizing deep learning (BERT) (Charoenkwan et al., 2021).
- iBitter-Fuse: The third-generation model, combining five peptide features with SVM (Charoenkwan et al., 2021).
- iBitter-DRLF: The fourth-generation model, which uses deep learning for feature extraction and LightGBM for prediction (Zhang et al., 2022).
These models represent the progression of computational approaches for bitter peptide identification and serve as direct benchmarks for
Bitter-RF.
6. Results & Analysis
6.1. Single-feature-based results
The initial phase of evaluation involved training Random Forest models using each of the 10 extracted features individually. This step helps to understand the discriminative power of each feature type. The performance was assessed using both 10-fold cross-validation on the training set and an independent validation set.
The following are the results from Table 1 of the original paper:
| Cross-validation | Feature | Dimension | AUROC | Sn | Sp | Acc | Mcc |
| 10-fold cross-validation | AAC | 20 | 0.91 | 0.85 | 0.84 | 0.85 | 0.69 |
| TPAAC | 21 | 0.90 | 0.83 | 0.78 | 0.80 | 0.61 | |
| APAAC | 22 | 0.89 | 0.83 | 0.81 | 0.82 | 0.64 | |
| ASDC | 400 | 0.88 | 0.89 | 0.68 | 0.79 | 0.59 | |
| DPC | 400 | 0.86 | 0.87 | 0.64 | 0.76 | 0.53 | |
| DDE | 400 | 0.83 | 0.84 | 0.73 | 0.78 | 0.57 | |
| GAAC | 5 | 0.75 | 0.72 | 0.66 | 0.69 | 0.39 | |
| GDPC | 25 | 0.78 | 0.75 | 0.71 | 0.73 | 0.46 | |
| SOCNumber | 2 | 0.70 | 0.66 | 0.62 | 0.64 | 0.28 | |
| QSOrder | 42 | 0.89 | 0.82 | 0.82 | 0.82 | 0.64 | |
| Independent set validation | AAC | 20 | 0.96 | 0.91 | 0.89 | 0.90 | 0.80 |
| TPAAC | 21 | 0.94 | 0.83 | 0.86 | 0.84 | 0.69 | |
| APAAC | 22 | 0.97 | 0.89 | 0.91 | 0.90 | 0.80 | |
| ASDC | 400 | 0.92 | 0.89 | 0.75 | 0.82 | 0.65 | |
| CKSAAGP | 100 | 0.87 | 0.77 | 0.81 | 0.79 | 0.58 | |
| DPC | 400 | 0.89 | 0.88 | 0.70 | 0.79 | 0.59 | |
| DDE | 400 | 0.90 | 0.89 | 0.84 | 0.87 | 0.74 | |
| GAAC | 5 | 0.76 | 0.83 | 0.64 | 0.73 | 0.48 | |
| GDPC | 25 | 0.80 | 0.73 | 0.72 | 0.73 | 0.45 | |
| SOCNumber | 2 | 0.73 | 0.59 | 0.69 | 0.64 | 0.28 | |
| QSOrder | 42 | 0.95 | 0.92 | 0.84 | 0.88 | 0.77 |
Analysis of Single-Feature Results:
-
Best Performers: On the 10-fold cross-validation,
AACshows the highest AUROC (0.91), ACC (0.85), and MCC (0.69). On the independent set,APAACperforms best with an AUROC of 0.97, followed closely byAAC(0.96) andQSOrder(0.95). These features, particularlyAACandAPAAC, seem to capture highly discriminative information. -
Worst Performers:
SOCNumberconsistently performs the worst in both cross-validation (AUROC 0.70) and independent validation (AUROC 0.73). This is attributed to its low dimensionality (2 features), indicating it might not provide sufficient information on its own.GAACandGDPCalso show relatively lower performance. -
Dimension vs. Performance: Interestingly, features with higher dimensions like
ASDC,DPC, andDDE(all 400 dimensions) do not necessarily outperform lower-dimensional features likeAAC(20 dimensions) orAPAAC(22 dimensions). This suggests that the quality and relevance of the information encoded by the feature are more important than merely the number of dimensions. -
Importance of Physicochemical Properties: The introduction mentions that hydrophobic amino acids and their positions are crucial for bitter taste. Features like
TPAACandAPAAC, which explicitly incorporate hydrophobicity and hydrophilicity, show good performance, supporting this premise. -
Independent Set Insights: The independent set validation generally shows higher AUROC values for several features (e.g.,
AACfrom 0.91 to 0.96,APAACfrom 0.89 to 0.97), suggesting that the models are generalizing well, and some features might be even more robust on unseen data.This analysis suggests that while some single features are strong predictors, there is potential for improvement by combining their complementary information. The paper notes that "some single features with poor performance have rich information that AAC does not have and can improve prediction performance," motivating the next step of feature fusion.
6.2. Fusion Feature Processing
Prior to training the final model, the 10 individual features were combined into a single feature vector. This process also involved a de-zero operation to remove redundant or non-informative dimensions.
The following are the results from Table 2 of the original paper:
| Feature | Dimension | Dimension after operation |
| AAC | 20 | 20 |
| TPAAC | 21 | 21 |
| APAAC | 22 | 22 |
| ASDC | 400 | 366 |
| DPC | 400 | 303 |
| DDE | 400 | 400 |
| GAAC | 5 | 5 |
| GDPC | 25 | 25 |
| SOCNumber | 2 | 2 |
| QSOrder | 42 | 42 |
| Total of features | 1,337 | 1,206 |
Analysis of Feature Fusion:
- Total Initial Dimensions: Summing the individual dimensions of the 10 features gives .
- De-zeroing Effect: The
de-zerooperation, which removes columns containing only zero values, reduced the total feature dimension from 1,337 to 1,206. This means 131 features (1337 - 1206) were found to be all zeros and were removed. This reduction is beneficial as it removes non-informative features, potentially reducing noise and computational load without losing discriminative power. - Specific Reductions:
ASDCsaw a reduction from 400 to 366, andDPCfrom 400 to 303. This indicates that some specific dipeptide combinations or skip dipeptide combinations were absent in the dataset, leading to zero counts for those features. Other features maintained their original dimensions, implying all their components had non-zero values across the dataset.
6.3. Fusion-feature-based results
The fused feature set (1,206 dimensions) was then used to train an RF model, and its performance was compared against the three best-performing single features (AAC, APAAC, QSOrder) identified in the single-feature analysis.
The following are the results from Table 3 of the original paper:
| ML method | Cross-validation | Feature | Dimension | AUROC | Sn | Sp | Acc | Mcc |
| Random Forest | 10-fold cross-validation | AAC | 20 | 0.91 | 0.85 | 0.84 | 0.85 | 0.69 |
| APAAC | 22 | 0.89 | 0.83 | 0.81 | 0.82 | 0.64 | ||
| QSOrder | 42 | 0.89 | 0.82 | 0.82 | 0.82 | 0.64 | ||
| Fusion | 1206 | 0.93 | 0.86 | 0.84 | 0.85 | 0.70 | ||
| Independent set validation | AAC | 20 | 0.96 | 0.91 | 0.89 | 0.90 | 0.80 | |
| APAAC | 22 | 0.97 | 0.89 | 0.91 | 0.90 | 0.80 | ||
| QSOrder | 42 | 0.95 | 0.92 | 0.84 | 0.88 | 0.77 | ||
| Fusion | 1206 | 0.98 | 0.94 | 0.94 | 0.94 | 0.88 |
Analysis of Fusion-Feature Results:
- 10-fold Cross-Validation:
- The
Fusionfeature set achieved an AUROC of 0.93, Sn of 0.86, Sp of 0.84, Acc of 0.85, and MCC of 0.70. - This is an improvement over
AAC(AUROC 0.91, MCC 0.69),APAAC(AUROC 0.89, MCC 0.64), andQSOrder(AUROC 0.89, MCC 0.64). TheMCCfor the fusion set (0.70) is notably higher, indicating a better overall balance in prediction performance.
- The
- Independent Set Validation:
-
The
Fusionfeature set demonstrated excellent performance with an AUROC of 0.98, Sn of 0.94, Sp of 0.94, Acc of 0.94, and MCC of 0.88. -
This is a significant improvement over all single features:
AAC(AUROC 0.96, MCC 0.80),APAAC(AUROC 0.97, MCC 0.80), andQSOrder(AUROC 0.95, MCC 0.77). TheAUROCof 0.98 for theFusionset is the highest achieved, and theMCCof 0.88 is substantially better than any single feature, signifying a highly robust and accurate model.The results strongly indicate that fusing diverse features provides a more comprehensive representation of bitter peptides, leading to enhanced predictive capabilities. The statement "the prediction performance of fusion features was improved or remained unchanged compared with single feature prediction" is validated, with clear improvements observed, especially on the independent set.
-
The following figure (Figure 2 from the original paper) shows the performance of the Bitter-RF model in bitter peptide classification. Panel A presents ROC curves from 10-fold cross-validation and an independent validation set, panel B compares ROC curves of different feature sets, and panels C and D compare average and specific performance metrics (AUC, Sn, Sp, Acc, Mcc) for different feature combinations.

Figure 2 Interpretation:
- Panel A (ROC curves for Fusion feature): Shows high AUROC values for both 10-fold cross-validation (0.93) and independent validation (0.98), visually confirming the model's strong discriminatory power. The curve for the independent set is closer to the top-left corner, indicating better performance.
- Panel B (Comparison of ROC curves for different features on independent set): Visually confirms that the
Fusionfeature set has the highest AUROC (0.98), surpassingAPAAC(0.97),AAC(0.96), andQSOrder(0.95). This plot clearly supports the benefit of feature fusion. - Panel C (Performance comparison on 10-fold cross-validation): A bar chart comparing
AUROC,Sn,Sp,Acc, andMccforFusion,AAC,APAAC, andQSOrder. TheFusionbar is consistently higher or comparable, particularly forAUROCandMcc. - Panel D (Performance comparison on independent set): Similar to Panel C, but for the independent set. Here, the
Fusionfeature shows a clear advantage across all metrics, especiallyAUROCandMcc, further solidifying its superior performance.
6.4. Comparison with other machine learning methods on fusion features
To further validate the choice of Random Forest, the Fusion feature set was used to train and test other traditional machine learning algorithms: SVM, LightGBM, Decision Trees (DT), and Logistic Regression (LR).
The following are the results from Table 4 of the original paper:
| Cross-validation | Feature | ML method | AUROC | Sn | Sp | Acc | Mcc |
| 10-fold cross-validation | Fusion | SVM | 0.67 | 0.51 | 0.80 | 0.66 | 0.34 |
| Fusion | LightGBM | 0.92 | 0.85 | 0.85 | 0.85 | 0.70 | |
| Fusion | DT | 0.80 | 0.83 | 0.77 | 0.80 | 0.60 | |
| Fusion | LR | 0.82 | 0.74 | 0.77 | 0.76 | 0.52 | |
| Fusion | RF | 0.93 | 0.86 | 0.84 | 0.85 | 0.70 | |
| Independent set validation | Fusion | SVM | 0.74 | 0.61 | 0.78 | 0.70 | 0.40 |
| Fusion | LightGBM | 0.97 | 0.92 | 0.91 | 0.91 | 0.83 | |
| Fusion | DT | 0.94 | 0.94 | 0.84 | 0.89 | 0.78 | |
| Fusion | LR | 0.89 | 0.80 | 0.84 | 0.82 | 0.64 | |
| Fusion | RF | 0.98 | 0.94 | 0.94 | 0.94 | 0.88 |
The following figure (Figure 3 from the original paper) shows the performance comparison of different machine learning models across various metrics (AUC, Sn, Sp, Acc, Mcc) in two parts, A and B. The Random Forest (RF) model stands out with superior performance on most metrics compared to other models.

Analysis of ML Method Comparison:
- 10-fold Cross-Validation:
RF(AUROC 0.93, MCC 0.70) andLightGBM(AUROC 0.92, MCC 0.70) showed the best performance, withRFslightly edging outLightGBMin AUROC.SVMperformed significantly worse (AUROC 0.67, MCC 0.34), suggesting it might not be well-suited for this dataset or feature space, or its hyperparameters were not optimally tuned.DTandLRshowed intermediate performance.
- Independent Set Validation:
-
RFdemonstrated superior performance with an AUROC of 0.98 and MCC of 0.88. -
LightGBMwas a close second (AUROC 0.97, MCC 0.83). -
DTalso performed well (AUROC 0.94, MCC 0.78), indicating its ability to generalize, but itsSp(0.84) was lower thanRF. -
SVMagain showed the lowest performance (AUROC 0.74, MCC 0.40).The results confirm that the
RFmethod, when applied to the fused features, is either superior or equal to other tested machine learning methods across various indicators, especially on the critical independent validation set. This validates the choice ofRandom Forestas the classifier forBitter-RF.
-
6.5. Comparison with existed models
To ultimately demonstrate the effectiveness of Bitter-RF, its performance was compared with four existing state-of-the-art bitter peptide prediction models. The performance indicators for these models were obtained from relevant literature.
The following are the results from Table 5 of the original paper:
| Cross-validation | Classifier | AUROC | Sn | Sp | Acc | Mcc |
| 10-fold cross-validation | iBitter-SCM | 0.90 | 0.91 | 0.83 | 0.87 | 0.75 |
| BERT4Bitter | 0.92 | 0.87 | 0.85 | 0.86 | 0.73 | |
| iBitter-Fuse | 0.94 | 0.92 | 0.92 | 0.92 | 0.84 | |
| iBitter-DRLF | 0.95 | 0.89 | 0.89 | 0.89 | 0.78 | |
| Bitter-RF | 0.93 | 0.86 | 0.84 | 0.85 | 0.70 | |
| Independent set validation | iBitter-SCM | 0.90 | 0.84 | 0.84 | 0.84 | 0.69 |
| BERT4Bitter | 0.96 | 0.94 | 0.91 | 0.92 | 0.84 | |
| iBitter-Fuse | 0.93 | 0.94 | 0.92 | 0.93 | 0.86 | |
| iBitter-DRLF | 0.98 | 0.92 | 0.98 | 0.94 | 0.89 | |
| Bitter-RF | 0.98 | 0.94 | 0.94 | 0.94 | 0.88 |
The following figure (Figure 4 from the original paper) shows radar charts of the overall and detailed performance of Bitter-RF and other models across various metrics (AUC, Sn, Sp, Acc, Mcc), clearly comparing the strengths and weaknesses of each model.

Analysis of Comparison with Existing Models:
- 10-fold Cross-Validation:
Bitter-RF(AUROC 0.93, MCC 0.70) performed similarly toBERT4Bitter(AUROC 0.92, MCC 0.73) but slightly lower thaniBitter-Fuse(AUROC 0.94, MCC 0.84) andiBitter-DRLF(AUROC 0.95, MCC 0.78). This indicates that on internal cross-validation, some other models, especially those using more complex feature engineering or deep learning, might have a slight edge.
- Independent Set Validation: This is the most crucial comparison for generalization ability.
Bitter-RFachieved an outstanding AUROC of 0.98, matching the best performance ofiBitter-DRLF.Bitter-RFalso achieved highSn(0.94),Sp(0.94),Acc(0.94), andMCC(0.88).- Compared to
iBitter-SCM(AUROC 0.90, MCC 0.69) andiBitter-Fuse(AUROC 0.93, MCC 0.86),Bitter-RFshows clear improvements in AUROC and competitive MCC. - Against
BERT4Bitter(AUROC 0.96, MCC 0.84),Bitter-RFshows higherAUROCand comparableMCC. - While
iBitter-DRLFhas a slightly higherMCC(0.89) andSp(0.98),Bitter-RFmatches itsAUROCand has a higherSn(0.94 vs 0.92). This suggestsBitter-RFis very competitive.
Key Takeaways:
Bitter-RFdemonstrates strong generalization ability on unseen data, often outperforming or matching the latest deep learning-based models in terms of AUROC.- The
Sn,Sp,Acc, andMCCofBitter-RFare generally better than the first three generations of models, especially on the independent set. - A significant advantage of
Bitter-RFis that it achieves this high performance using a traditional machine learning method (Random Forest), which typically consumes fewer computational resources compared to deep learning approaches (likeBERT4Bitteror deep learning pre-training iniBitter-DRLF). This makesBitter-RFmore accessible and practical. - The authors acknowledge that
iBitter-DRLFhas a slightly better MCC, butBitter-RFis very competitive while using a less computationally intensive method.
6.6. Ablation Studies / Parameter Analysis
The paper does not explicitly present an ablation study in the traditional sense (removing components of the Bitter-RF model to see their individual impact). However, the comparison of Fusion features with single features (Table 3 and Figure 2) serves as an indirect form of ablation, demonstrating the synergistic effect of combining different feature types. It shows that the fused feature set (all 10 features combined) performs better than any single feature alone. The de-zero operation on features (Table 2) also highlights a form of feature selection or preprocessing that improved the efficiency and potentially the robustness of the model by removing non-informative dimensions. The paper mentions optimizing parameters for characteristics, but no detailed parameter tuning results or sensitivity analyses for the Random Forest model's hyperparameters (e.g., number of trees, max depth) are provided.
7. Conclusion & Reflections
7.1. Conclusion Summary
This study successfully developed Bitter-RF, a novel Random Forest-based machine learning model for accurately identifying bitter peptides from their sequence information. By integrating a comprehensive set of 10 diverse sequence-derived features, Bitter-RF achieved superior predictive performance, particularly on an independent validation set, where it recorded an impressive AUROC of 0.98. The research demonstrated that the fusion of multiple features significantly enhances prediction accuracy compared to individual features, and that the Random Forest algorithm is highly effective for this protein classification task, outperforming other traditional machine learning methods. Bitter-RF stands as a competitive model against existing state-of-the-art predictors, offering comparable or better accuracy with potentially lower computational demands. This work not only provides a valuable tool for bitter peptide research but also extends the application of Random Forest in bioinformatics.
7.2. Limitations & Future Work
The authors acknowledge one primary limitation and propose future work:
- Limitation: The intrinsic robustness of the bitter/non-bitter classification model may be affected by the inherent bias of the training/test set data. This suggests that while the model performs well on the current dataset, its performance on other, potentially more diverse or differently distributed, bitter peptide datasets is yet to be fully explored. The paper explicitly states, "it cannot be excluded that the model may be affected by the inherent bias of training/test set data."
- Future Work: The authors plan to utilize various
feature selection techniques(e.g., methods referenced 83-86) to identify and select the most optimal features from the current comprehensive set. This optimization aims to further improve the model's performance by focusing on the most discriminative features and potentially reducing dimensionality.
7.3. Personal Insights & Critique
This paper presents a solid and practical approach to a significant problem in peptide research. The comprehensive feature engineering is a major strength, as is the systematic comparison of single vs. fused features and various ML algorithms. The choice of Random Forest is well-justified by its performance and computational efficiency, especially when compared to complex deep learning models for practical applications.
Strengths:
- Thorough Feature Engineering: The use of 10 diverse features, covering various aspects of peptide sequence and physicochemical properties, is a key differentiator and likely contributes significantly to the model's high performance.
- Rigorous Evaluation: The use of both 10-fold cross-validation and an independent validation set, along with a comprehensive set of metrics (AUROC, Sn, Sp, Acc, MCC), provides a robust assessment of the model's performance and generalization ability.
- Comparative Analysis: The systematic comparison against multiple traditional ML methods and four generations of existing bitter peptide predictors firmly establishes
Bitter-RF's competitive edge. - Practicality: The development of a Python package (GitHub link provided) makes the model directly usable by other researchers, fostering scientific progress. The emphasis on lower computational resources compared to deep learning models is also a practical advantage.
Potential Issues/Areas for Improvement:
- Parameter Tuning Details: While
Random Forestis used, the paper does not elaborate on the specific hyperparameters used (e.g., number of trees, maximum depth,max_features), or whether a hyperparameter optimization strategy was employed. Such details would enhance reproducibility and build confidence in the optimal configuration. - Feature Importance Analysis: Given the fusion of 10 different features, an analysis of
feature importance(e.g., from theRandom Forestmodel) could provide deeper biological insights into which sequence characteristics are most predictive of bitterness. This could also guide future biological experiments. - Dataset Bias: The authors acknowledge potential dataset bias. Expanding the dataset with more diverse bitter and non-bitter peptides from various sources and ensuring balanced representation of peptide lengths and compositions could further enhance the model's generalizability.
- Interpretability of
SOCNumberandQsOrderdimensions: There's a slight discrepancy between the theoretical dimension ofSOCNumber(nlag, default 30) andQsOrder(20 + nlag, default 50) and the dimensions reported in Table 1 (2 and 42 respectively). Clarifying this or explaining the specific configuration used iniLearnPluswould be beneficial for beginners. - Lack of Ablation Study: While the fusion vs. single feature comparison is helpful, a more detailed ablation study (e.g., removing specific feature types from the fusion set) could quantitatively demonstrate the contribution of each feature category to the overall performance.
Transferability and Future Value:
The methods, particularly the comprehensive feature engineering and the demonstration of Random Forest's effectiveness, could be transferred to other peptide classification tasks (e.g., predicting antimicrobial peptides, anticancer peptides, or allergenic peptides). The emphasis on capturing diverse sequence information and physicochemical properties is a generalizable principle in bioinformatics. Bitter-RF provides a solid foundation, and its open-source availability ensures its immediate utility and potential for further development by the research community. The identified avenues for future work, especially feature selection, promise further refinements to the model.
Similar papers
Recommended via semantic vector search.