Identify Bitter Peptides by Using Deep Representation Learning Features
TL;DR Summary
This study introduces the iBitter-DRLF method, employing deep learning techniques to enhance the identification of bitter peptides, significantly improving palatability outcomes in related products.
Abstract
A bitter taste often identifies hazardous compounds and it is generally avoided by most animals and humans. Bitterness of hydrolyzed proteins is caused by the presence of bitter peptides. To improve palatability, bitter peptides need to be identified experimentally in a time-consuming and expensive process, before they can be removed or degraded. Here, we report the development of a machine learning prediction method, iBitter-DRLF, which is based on a deep learning pre-trained neural network feature extraction method. It uses three sequence embedding techniques, soft symmetric alignment (SSA), unified representation (UniRep), and bidirectional long short-term memory (BiLSTM). These were initially combined into various machine learning algorithms to build several models. After optimization, the combined features of UniRep and BiLSTM were finally selected, and the model was built in combination with a light gradient boosting machine (LGBM). The results showed that the use of deep representation learning greatly improves the ability of the model to identify bitter peptides, achieving accurate prediction based on peptide sequence data alone. By helping to identify bitter peptides, iBitter-DRLF can help research into improving the palatability of peptide therapeutics and dietary supplements in the future. A webserver is available, too.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Identify Bitter Peptides by Using Deep Representation Learning Features
1.2. Authors
Jici Jiang, Xinxu Lin, Yueqi Jiang, Liangzhen Jiang, and Zhibin Lv.
- Jici Jiang, Xinxu Lin, Yueqi Jiang, and Zhibin Lv are affiliated with Sichuan University, China, across different colleges (Biomedical Engineering, Software Engineering, West China School of Medicine).
- Liangzhen Jiang is affiliated with Chengdu University, China, in the Key Laboratory of Coarse Cereal Processing, Ministry of Agriculture and Rural Affairs, College of Food and Biological Engineering.
- Zhibin Lv is the corresponding author, indicating a leading role in the research.
1.3. Journal/Conference
The paper was published in Int. J. Mol. Sci. (International Journal of Molecular Sciences). This is a peer-reviewed, open-access journal covering a broad range of topics in molecular sciences. It is generally considered a reputable journal in its field, with an impact factor that signifies its influence.
1.4. Publication Year
2022
1.5. Abstract
This paper introduces a novel machine learning method, iBitter-DRLF, for identifying bitter peptides. The motivation stems from the fact that bitterness in hydrolyzed proteins, often caused by bitter peptides, necessitates time-consuming and expensive experimental identification to improve palatability. iBitter-DRLF leverages deep learning pre-trained neural network feature extraction, specifically integrating three sequence embedding techniques: soft symmetric alignment (SSA), unified representation (UniRep), and bidirectional long short-term memory (BiLSTM). These features were initially combined with various machine learning algorithms. Through optimization, the combined features of UniRep and BiLSTM were selected and integrated with a light gradient boosting machine (LGBM) classifier. The results demonstrate that deep representation learning significantly enhances the model's ability to accurately predict bitter peptides based solely on peptide sequence data. The authors suggest iBitter-DRLF can support future research in improving the palatability of peptide therapeutics and dietary supplements. A webserver for the method is also available.
1.6. Original Source Link
Official Source Link: /files/papers/6917522c110b75dcc59ae06e/paper.pdf
Publication Status: Officially published in Int. J. Mol. Sci.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the time-consuming and expensive experimental identification of bitter peptides. Peptides, especially hydrolyzed proteins, are increasingly used in therapeutics and dietary supplements due to their beneficial biological activities and good nutritional properties. However, these hydrolyzed proteins often contain peptides that impart a bitter taste, which is generally avoided by humans and animals as an instinctual warning sign for toxic compounds. This bitterness significantly impacts palatability and patient adherence to peptide-based products.
Prior research has made progress in developing computational models for bitter peptide prediction using quantitative structure-activity relationship (QSAR) modeling and traditional machine learning (ML) with sequence features. However, there remains significant room for improvement in accuracy and efficiency, particularly in identifying novel bitter peptides solely from sequence data. The main challenge lies in effectively representing peptide sequences in a way that captures their underlying characteristics relevant to bitterness.
The paper's entry point is the recognition that deep representation learning, inspired by its success in natural language processing (NLP), can provide more meaningful and accurate feature descriptors for peptide sequences compared to traditional feature engineering methods. By transforming raw protein sequences into representations that ML models can effectively utilize, deep learning can potentially overcome the limitations of previous approaches.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Development of iBitter-DRLF: A novel machine learning prediction method specifically designed for the accurate identification of bitter peptides.
-
Leveraging Deep Representation Learning: The method uniquely integrates three advanced sequence embedding techniques—
soft symmetric alignment (SSA),unified representation (UniRep), andbidirectional long short-term memory (BiLSTM)—to extract rich features from peptide sequences. -
Comprehensive Feature Fusion and Selection: The paper systematically evaluates various combinations of these
deep representation learning features(feature fusion) and employsLGBMforfeature selectionto optimize the feature space, reducing redundancy and improving model performance. -
Optimal Model Configuration: Through extensive experimentation and optimization, the combined
UniRepandBiLSTMfeatures (specifically a 106-dimensional subset after feature selection) coupled with anLGBMclassifier emerged as the most effective configuration foriBitter-DRLF. -
Superior Predictive Performance:
iBitter-DRLFdemonstrated significantly higher accuracy compared to existing state-of-the-art methods in independent tests, achievingACCof 0.944,MCCof 0.889,Spof 0.977, andauROCof 0.977. This indicates its reliability and stability. -
User-Friendly Webserver: The authors provide an accessible webserver for
iBitter-DRLF, enabling wider use of their algorithm by other researchers.The key finding is that the application of
deep representation learningfeatures, particularly the fusion ofUniRepandBiLSTMfollowed by rigorousfeature selection, substantially improves the ability to predict bitter peptides accurately based solely on their sequence data. This advancement has practical implications for enhancing the palatability of peptide-based products and facilitating drug development.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the methodology and contributions of this paper, a foundational understanding of several key concepts across biology, machine learning, and deep learning is essential:
- Peptides and Amino Acids:
Peptidesare short chains ofamino acids, typically linked bypeptide bonds.Amino acidsare the fundamental building blocks of proteins and peptides, each with unique chemical properties that influence the overall structure and function of the peptide. The sequence of amino acids in a peptide (peptide sequence) is crucial for its biological activity, including taste. - Protein Hydrolysates: These are mixtures of peptides produced by breaking down proteins (e.g., through enzymatic digestion). While often having beneficial nutritional properties,
protein hydrolysatescan containbitter peptides, which negatively affect their taste. - Machine Learning (ML): A field of artificial intelligence that enables systems to learn from data without explicit programming. In classification tasks, ML models learn to categorize data points into predefined classes (e.g., bitter vs. non-bitter).
- Deep Learning (DL): A subfield of
machine learningthat usesneural networkswith multiple layers (hence "deep") to learn complex patterns from data. Unlike traditional ML, deep learning models can often learn meaningfulfeature representationsdirectly from raw input data, reducing the need for manualfeature engineering. - Sequence Embedding/Representation Learning: The process of converting discrete sequential data (like
peptide sequencesmade ofamino acids) into continuous numerical vectors (embeddings). These vectors capture semantic and structural information, allowing machine learning models to process the sequences effectively.Representation learningrefers to the broader concept of automatically learning useful features from raw data. - Soft Symmetric Alignment (SSA): A sequence embedding technique that aims to capture similarity between sequences of varying lengths by aligning them in a "soft" or probabilistic manner. It involves embedding sequences into vectors and then computing a similarity score based on these embeddings, often using a
BiLSTMencoder. - Unified Representation (UniRep): A
deep representation learningmodel specifically trained on a vast dataset ofprotein sequences(UniRef50). It uses amultiplicative Long Short-Term Memory (mLSTM)network to generate fixed-length vector representations for protein/peptide sequences. The model learns to predict the nextamino acidin a sequence, thereby learning rich internal representations that capture biological information. - Bidirectional Long Short-Term Memory (BiLSTM): A type of
recurrent neural network (RNN)that processes sequence data in both forward and backward directions.- Long Short-Term Memory (LSTM): An advanced type of
RNNdesigned to overcome the vanishing gradient problem in standardRNNs, allowing it to learn long-term dependencies. AnLSTMunit has several components:- Cell State (): The memory of the
LSTMunit, carrying information through the sequence. - Hidden State (): The output of the
LSTMunit, carrying information to the next time step. - Forget Gate (): Controls what information from the previous cell state should be discarded.
- Input Gate (): Controls what new information should be stored in the cell state.
- Output Gate (): Controls what part of the cell state should be outputted as the hidden state.
- Cell State (): The memory of the
BiLSTMcombines a forwardLSTM(processing sequence from start to end) and a backwardLSTM(processing from end to start). This allows it to capture context from both past and future elements in a sequence, providing a richer representation.
- Long Short-Term Memory (LSTM): An advanced type of
- Feature Engineering vs. Feature Learning:
Feature Engineering: The manual process of creating relevant input features for a machine learning model from raw data, often requiring expert domain knowledge.Feature Learning(orRepresentation Learning): The process where a model automatically discovers and learns useful features from raw data, a key advantage of deep learning.
- Feature Fusion: The process of combining multiple sets of features (e.g., from different embedding models) into a single, comprehensive feature vector. This often aims to leverage diverse information sources for improved model performance.
- Feature Selection: The process of selecting a subset of relevant features for use in model construction. This helps reduce
dimensionality, combatoverfitting, and improve model interpretability and training efficiency. - Support Vector Machine (SVM): A powerful supervised machine learning algorithm used for classification and regression. It works by finding the optimal hyperplane that best separates data points of different classes in a high-dimensional feature space.
- Random Forest (RF): An
ensemble learningmethod for classification and regression that operates by constructing a multitude ofdecision treesat training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It reducesoverfittingand improves accuracy. - Light Gradient Boosting Machine (LGBM): A
gradient boosting frameworkthat usestree-based learning algorithms. It's known for its efficiency and speed, particularly with large datasets.LGBMuses aleaf-wisetree growth strategy, which can be faster thanlevel-wisestrategies but is more prone tooverfittingfor small datasets. - Evaluation Metrics: Standard measures used to quantify the performance of classification models, including
Accuracy (ACC),Matthews Correlation Coefficient (MCC),Sensitivity (Sn),Specificity (Sp),F1-score (F1),Area Under the Precision-Recall Curve (auPRC), andArea Under the Receiver Operating Characteristic Curve (auROC). These will be explained in detail in Section 5. - K-fold Cross-Validation: A robust technique for evaluating machine learning models by dividing the dataset into K subsets (folds). The model is trained on K-1 folds and validated on the remaining fold, and this process is repeated K times, with each fold used exactly once as the validation data. The results are then averaged. The paper uses 10-fold cross-validation.
- Independent Test Set: A dataset completely separate from the training and validation sets, used to provide an unbiased evaluation of the final model's performance on unseen data.
- Hyperparameter Optimization: The process of finding the best set of
hyperparametersfor a machine learning model (e.g., andgammaforSVM,n_estimatorsandmax_depthfor tree-based models). Techniques likeGridSearchCVsystematically search through a predefined grid of hyperparameter values. - Uniform Manifold Approximation and Projection (UMAP): A
dimensionality reductiontechnique used for visualizing high-dimensional data, similar tot-SNE. It aims to preserve the global and local structure of the data in a lower-dimensional space, making it useful for feature visualization.
3.2. Previous Works
The paper contextualizes its work by referencing a lineage of previous computational methods for bitter peptide identification, which primarily relied on traditional quantitative structure-activity relationship (QSAR) modeling and machine learning approaches using handcrafted sequence features.
-
QSAR and Traditional ML methods: Several early methods, such as those by Gramatica et al. [9], Chen et al. [10], and Zh o et al. [13], focused on predicting bitterness using physicochemical properties and structural descriptors of peptides. These methods often involve meticulous
feature engineeringto extract meaningful characteristics from the peptide sequences. -
BitterX [15], BitterPredict [16], iBitter-SCM [17], and iBitter-Fuse [18]: These represent a progression in machine learning-based bitter peptide predictors. They generally utilize
traditional sequence features(e.g., amino acid composition, dipeptide composition, hydrophobicity scales) to build predictive models.iBitter-SCMandiBitter-Fuseare specifically mentioned as havingincreasing performancethrough usingtraditional sequence featuresandmulti-view feature fusion, respectively.iBitter-Fuse, for instance, combined features from different perspectives to enhance prediction.
-
BERT4Bitter [19]: This model marked a shift by employing
natural language processing (NLP)heuristic signature coding methods to represent peptide sequences asfeature descriptors. This move towards more sophisticated, learned representations, rather than purely handcrafted features, showedbetter accuracy.BERT(Bidirectional Encoder Representations from Transformers) models, common in NLP, learn contextual representations of words (or in this case, amino acids) by considering their surrounding text. -
MIMML (Mutual Information-based Meta Learning) [20]: Proposed in 2022,
MIMMLaimed todiscover the best feature combination for bitter peptides, achieving anindependent accuracy of 93.8%. This highlights the importance of effective feature selection and combination for optimal performance.These prior works collectively demonstrate an evolution from simple physicochemical features to more complex,
NLP-inspiredheuristic codingandmeta-learningfor feature optimization. The current paper builds upon this by exploring the even more advanceddeep representation learning features(SSA, UniRep, BiLSTM), which learn hierarchical representations directly from vast amounts ofsequence data, moving beyond heuristics or purely amino acid-level properties.
3.3. Technological Evolution
The technological evolution in bitter peptide identification mirrors broader trends in bioinformatics and machine learning.
-
Early QSAR and Physicochemical Descriptors: Initial approaches focused on
quantitative structure-activity relationships (QSAR), wherephysicochemical properties(e.g.,hydrophobicity,charge,molecular weight) and simpleamino acid compositionwere manually extracted as features. These methods required significant domain expertise forfeature engineering. -
Traditional Machine Learning with Handcrafted Features: As ML algorithms became more accessible, models like
SVM,Random Forest, andGradient Boostingwere applied, still largely relying on expertlyhandcrafted features(e.g.,dipeptide composition,pseudo amino acid composition, varioushydrophobicity scales). The challenge here was that the quality of prediction heavily depended on the quality and comprehensiveness of the engineered features. -
NLP-inspired Heuristic Features: A more recent development, exemplified by
BERT4Bitter, involved adaptingnatural language processing (NLP)techniques. This treated peptide sequences as "sentences" ofamino acid"words," usingheuristic coding methodsto derive features that capture contextual relationships between amino acids. This demonstrated an improvement over purely physicochemical features. -
Deep Representation Learning: The current paper's work represents the next major step in this evolution. It moves beyond heuristics to directly leverage
deep representation learningfrompre-trained neural networks. Models likeUniRep,SSA, andBiLSTMare trained on massive datasets ofprotein sequencesto learn general-purpose, high-dimensionalembeddings. These embeddings are not handcrafted features but rather learned representations that capture complex, hierarchical patterns within the sequences, often without explicit prior knowledge of what constitutes "bitter" features. This paradigm minimizes the need for manualfeature engineering, allowing the model to discover optimal representations autonomously.The paper's work fits within this technological timeline by pushing the frontier of
feature extractionforpeptide sequencesfromhandcraftedorheuristicmethods tolearned, deep representations, aiming for superior predictive power and generalizability.
3.4. Differentiation Analysis
Compared to the main methods in related work, iBitter-DRLF presents several core differences and innovations:
-
Shift from Feature Engineering to Deep Representation Learning:
- Previous methods (
BitterX,BitterPredict,iBitter-SCM,iBitter-Fuse) largely relied ontraditional sequence featuresandfeature engineering, which often required domain expertise and could be limited by the expressiveness of the handcrafted features. - iBitter-DRLF moves beyond this by utilizing
deep representation learning features(SSA,UniRep,BiLSTM). These models arepre-trainedon vast protein/peptide databases, allowing them to learn generic, high-level, and context-rich representations ofpeptide sequencesautomatically. This approach reduces manual effort and can uncover subtle patterns that handcrafted features might miss.
- Previous methods (
-
Integration of Multiple Deep Embedding Techniques:
- While
BERT4BitterusedNLP heuristic signature coding(a form of learned representation),iBitter-DRLFintegratesthree distinct deep learning embedding techniques(SSA,UniRep,BiLSTM). This multi-perspective approach tofeature extractionpotentially provides a more comprehensive and robust representation of peptide characteristics.UniRepspecifically uses anmLSTMtrained onUniRef50, offering a very general protein representation.BiLSTMdirectly captures sequence dependencies, andSSAfocuses on sequence similarity.
- While
-
Systematic Feature Fusion and Selection for Optimal Performance:
- The paper doesn't just use one type of
deep featurebut systematically exploresfusion features(combinations ofSSA,UniRep,BiLSTM). - Crucially, it employs
LGBMforfeature selectionon these high-dimensional fused features. This step is vital for removing redundancy, preventingoverfitting, and identifying the most discriminative subset of features. This rigorous optimization process, from raw embeddings to a refined feature set, distinguishes it from methods that might use fixed feature sets.
- The paper doesn't just use one type of
-
Superior Performance Metrics:
-
The results show
iBitter-DRLF(ACC0.944,MCC0.889,Sp0.977,auROC0.977) outperforming even recent advanced methods likeMIMML(ACC0.938,MCC0.875) andBERT4Bitter(ACC0.922,MCC0.844) in independent tests. This demonstrates a clear quantitative advantage in accuracy, particularly inspecificityandauROC, indicating better discrimination between bitter and non-bitter peptides.In essence,
iBitter-DRLFinnovates by embracing the power ofdeep representation learningto extract richer, more abstract features, combining diverseembedding techniques, and then meticulously refining these features throughfusionandselectionto build a highly accurate and stable predictive model.
-
4. Methodology
4.1. Principles
The core principle of iBitter-DRLF is to leverage the power of deep representation learning to automatically extract meaningful and high-dimensional features from raw peptide sequences. Instead of relying on traditional, manually engineered features, the method uses pre-trained neural networks (like those underlying SSA, UniRep, and BiLSTM) that have learned rich contextual and structural information about peptide sequences from vast biological datasets. These deep features are then fused to create comprehensive representations. To handle the high dimensionality and potential redundancy of these fused features, a feature selection step using LGBM is applied to identify the most discriminative subset. Finally, these optimized features are fed into a machine learning classifier (LGBM) to predict whether a peptide is bitter or not. The overall aim is to achieve superior prediction accuracy by learning better representations of peptides.
4.2. Core Methodology In-depth (Layer by Layer)
The development of iBitter-DRLF involves several key stages: dataset preparation, feature extraction using deep learning models, feature fusion, feature selection, model training and optimization with machine learning algorithms, and final evaluation. A high-level overview of the model development process is depicted in Figure 1.
该图像是一个示意图,展示了iBitter-DRLF模型的开发流程。图中包括数据集收集、特征提取(SSA、UniRep、BiLSTM)、机器学习算法及特征融合,并展示了模型性能评估的结果,如ACC和MCC等指标。
4.2.1. Benchmark Dataset
The iBitter-DRLF model was developed and evaluated using the BTP640 benchmark dataset, which was updated from the iBitter-SCM study [17]. This dataset is designed for binary classification of peptides into bitter or non-bitter categories.
- Composition: The
BTP640dataset contains 320bitter peptidesand 320non-bitter peptides, totaling 640 unique peptide sequences. Thenon-bitter peptideswere constructed using theBIOPEP database[29], whilebitter peptideswere experimentally confirmed. - Splitting: To prevent
overfittingand ensure robust evaluation, the dataset was randomly split into two subsets:- Training and Cross-Validation Set (BTP-CV): This subset is used for model training and
k-fold cross-validation. It contains 256bitter peptidesand 256non-bitter peptides(a 4:1 ratio of the total dataset). - Independent Test Set (BTP-TS): This subset is used for
independent testingof the final model's performance on unseen data. It contains 64bitter peptidesand 64non-bitter peptides.
- Training and Cross-Validation Set (BTP-CV): This subset is used for model training and
4.2.2. Feature Extraction
The paper explores three different deep representation learning methods to extract features from peptide sequences. These methods convert variable-length peptide sequences into fixed-dimensional numerical vectors (eigenvectors) that can be used by machine learning models.
4.2.2.1. Pre-Trained SSA Embedding Model
The soft symmetric alignment (SSA) model [30] is designed to measure similarity between sequences of arbitrary lengths by embedding them into vectors.
- Input: A
peptide sequenceis fed into apre-trained model. - Encoding: The
peptide sequenceis encoded through athree-tier stacked BiLSTM encoder. - Output Embedding: The final embedding for a peptide sequence is represented as a matrix , where is the length of the peptide and
121Dindicates a 121-dimensional vector for eachamino acidresidue. - Similarity Calculation: To calculate the similarity between two peptide sequences, and , which are represented by embedded matrices and respectively:
- , where is a 121-dimensional vector.
- , where is also a 121-dimensional vector. The similarity is calculated using the following formula: where:
- is the soft symmetric alignment similarity score.
- is a normalization factor.
- and are the lengths of the two peptide sequences.
- and are the 121-dimensional embedded vectors for the -th and -th
amino acidin sequences 1 and 2, respectively. - denotes the L1 norm (Manhattan distance) between vectors and .
- is a weighting coefficient determined by the following formulas:
Here, and are
softmax-like attention weights indicating the similarity ofamino acidto considering allamino acidsin the respective sequences. combines these weights to form a symmetric alignment score. is a normalization factor, usually the sum of all . These parameters arebackfittedwith the parameters of thesequence encoderby a fully differentiatedSSA. Thetrained modelconverts thepeptide sequenceinto an embedding matrix . For classification, typically an aggregation (e.g., mean or max pooling) over the length dimension is performed to obtain a fixed-length vector.
4.2.2.2. Pre-Trained UniRep Embedding Model
The UniRep model [25] is a deep representation learning model trained on 24 million UniRef50 primary amino acid sequences (a large database of protein sequences).
- Training Objective: The model performs
next amino acid predictionby minimizingcross-entropy losses. Through this task, it learns to represent proteins internally. - Architecture: It uses an
mLSTM(multiplicative Long Short-Term Memory) network. - Feature Generation:
-
A
peptide sequencewithamino acidresidues is initially embedded into a matrix usingone-hot encoding, resulting in a matrix (assuming 10 features peramino acid, or more commonly 20 for standardamino acids). -
This matrix is then fed into the
mLSTMencoder. -
The
mLSTMoutputs a hidden state, and after this operation, a fixed-lengthUniRep feature vectorof1900D(1900 dimensions) is derived, representing the entirepeptide sequence.The calculation of the
mLSTM encoderinvolves the following equations: \mathrm { { { f } } } _ { \mathrm { { t } } } = \sigma ( { { X } _ { \mathrm { { t } } } } { { W } _ { { \mathrm { { x f } } } } } + { { \mathrm { m } } _ { \mathrm { { t } } } } { { W } _ _ { { \mathrm { { m f } } } } } ) Where:
-
- represents element-by-element multiplication.
- is the current input (e.g.,
one-hot encodedrepresentation of the currentamino acid). - represents the previous
hidden state. - is the current intermediate multiplicative state, a key distinguishing feature of
mLSTM. - represents the input before the
hidden stateactivation. - is the
forget gate(controls which information to discard). - is the
input gate(controls which new information to store). - is the
output gate(controls which information to output from the cell state). - is the previous
cell state. - is the current
cell state. - is the output
hidden state. - is the
sigmoid function, which squashes values between 0 and 1. - is the
hyperbolic tangent function, which squashes values between -1 and 1. - , , , , , , , , , are weight matrices learned during training.
4.2.2.3. Pre-Trained BiLSTM Embedding Model
A bidirectional Long Short-Term Memory (BiLSTM) model [31] is used to extract sequence features.
-
Architecture:
BiLSTMconsists of aforward LSTMand abackward LSTM. Theforward LSTMprocesses the sequence from the beginning to the end, while thebackward LSTMprocesses it from the end to the beginning. The outputs from both directions are concatenated or combined to form a single representation. This allows the model to capturebidirectional contextfor each position in the sequence. -
LSTM Mechanism: An
LSTMunit utilizes gates (forget,input,output) to regulate the flow of information into and out of itscell stateandhidden state, enabling it to learn long-range dependencies.The calculation of
BiLSTMinvolves the following formulas for a singleLSTMunit (these are then applied in both forward and backward directions and combined): Where: -
is the current input at time step .
-
represents the
hidden statefrom the previous time step. -
, , , are weight matrices for the
forget,input, candidatecell state, andoutputgates, respectively. -
, , , are bias vectors for the respective gates.
-
denotes the concatenation of the previous
hidden stateand the current input. -
is the
forget gatevector, determining what information from toforget. -
is the
input gatevector, determining what new information tostorein . -
is the candidate
cell state, representing new potential information. -
is the
output gatevector, determining what part of tooutputas . -
is the previous
cell state. -
is the current
cell state, updated by forgetting part of and adding part of . -
is the output
hidden stateat time step . -
is the
sigmoid function. -
is the
hyperbolic tangent function. ForBiLSTM, the final representation for a peptide sequence is typically derived by concatenating the lasthidden statesof the forward and backwardLSTMs, or by applying a pooling operation over the entire sequence ofhidden states. The model used here produces a3605D(3605-dimensional) eigenvector for each peptide.
4.2.3. Feature Fusion
To combine the complementary information from different deep representation learning features, the paper implements feature fusion. This process concatenates the individual feature vectors to create higher-dimensional, composite feature vectors.
- SSA + UniRep: The
121D SSAeigenvector is combined with the1900D UniRepeigenvector to form a2021D(2021-dimensional) fusion feature vector. - SSA + BiLSTM: The
121D SSAeigenvector is combined with the3605D BiLSTMeigenvector to form a3726D(3726-dimensional) fusion feature vector. - UniRep + BiLSTM: The
1900D UniRepeigenvector is combined with the3605D BiLSTMeigenvector to form a5505D(5505-dimensional) fusion feature vector. - SSA + UniRep + BiLSTM: All three eigenvectors (
121D SSA,1900D UniRep,3605D BiLSTM) are combined to obtain a5626D(5626-dimensional) fusion feature vector.
4.2.4. Feature Selection Method
High-dimensional feature vectors, especially from feature fusion, can introduce redundancy and increase the risk of overfitting. To address this, LGBM is utilized for feature selection.
- Process:
Data(feature vectors) andlabels(bitter/non-bitter) are input into anLGBM model.- The
LGBM modelis trained, and its built-in functions are used to obtain animportance valuefor each feature.LGBMinherently provides feature importance based on how much each feature contributes to the splitting decisions in the ensemble ofdecision trees. - Features are ranked from
highesttolowestbased on theirimportance values. - A subset of features is selected by choosing those with an
importance value greater than the critical value(defined as the averagefeature importance value).
- Optimization Strategy: An
incremental feature strategyis used, where features are added incrementally, and model performance is monitored. Ahyperparametric mesh search method(specificallyscikit-learn GridSearchCVmodule) is used in conjunction to optimizehyperparametersfor each model after feature selection.
4.2.5. Machine Learning Methods
Three widely used machine learning models are employed as classifiers to build predictive models, testing various feature sets and fusion combinations:
- Support Vector Machine (SVM) [32,33]:
- Purpose: A
binary classifierthat finds an optimal hyperplane to separate data points into classes. - Hyperparameters: The paper optimized
gamma(kernel coefficient forrbfkernel) and (regularization parameter) within the range of to . The default kernel wasrbf(Radial Basis Function).
- Purpose: A
- Random Forest (RF) [34,35]:
- Purpose: An
ensemble learningalgorithm based onbaggingthat constructs multipledecision treesand outputs the mode of their predictions. It uses both random sampling of data and random selection of features during tree construction. - Hyperparameters: The paper optimized
n_estimators(number of trees in the forest) in the range of (25, 550) andNleaf(minimum number of samples required to be at a leaf node) in the range of (2, 12).
- Purpose: An
- Light Gradient Boosting Machine (LGBM) [23,36]:
- Purpose: A
gradient boosting frameworkthat usestree-based learning algorithms. It's known for its speed and efficiency by using aleaf-wisetree growth strategy (splitting the leaf that yields the largest reduction in loss) rather than alevel-wisestrategy. - Hyperparameters: The paper optimized
n_estimators(number of boosting rounds or trees) in the range of (25, 750) andmax_depth(maximum depth of the individual decision trees) in the range of (1, 12).
- Purpose: A
5. Experimental Setup
5.1. Datasets
The study utilized the BTP640 benchmark dataset, which is an updated version from iBitter-SCM [17].
- Source: The dataset is available online at
http://public.aibiochem.net/peptides/BitterP/orhttps://github.com/Shoombuatong2527/Benchmark-datasets. - Composition: It comprises 640
peptide sequencesin total, evenly split into:- 320
bitter peptides(experimentally confirmed). - 320
non-bitter peptides(constructed using theBIOPEP database[29]).
- 320
- Splitting: To ensure robust model evaluation and prevent
overfitting, theBTP640dataset was randomly divided into two subsets:- BTP-CV (Training and Cross-Validation Set): Contains 256
bitter peptidesand 256non-bitter peptides(total 512 peptides). This set was used for10-fold cross-validation. - BTP-TS (Independent Test Set): Contains 64
bitter peptidesand 64non-bitter peptides(total 128 peptides). This set was used forindependent testingon unseen data.
- BTP-CV (Training and Cross-Validation Set): Contains 256
- Characteristics: The dataset focuses on relatively short peptides, as
bitter peptidesare typically composed of no more than eightamino acids, though some can be longer. The domain ispeptide therapeuticsanddietary supplements, where bitterness is a critical palatability issue.
5.2. Evaluation Metrics
The performance of the models was evaluated using five widely accepted classification metrics, along with auPRC and auROC. These metrics are calculated based on the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
-
TP(True Positive): Number ofbitter peptidescorrectly predicted asbitter. -
TN(True Negative): Number ofnon-bitter peptidescorrectly predicted asnon-bitter. -
FP(False Positive): Number ofnon-bitter peptidesincorrectly predicted asbitter. -
FN(False Negative): Number ofbitter peptidesincorrectly predicted asnon-bitter.Here are the detailed explanations and formulas for each metric:
-
Accuracy (ACC)
- Conceptual Definition:
Accuracymeasures the overall correctness of the model's predictions. It represents the proportion of correctly classified instances (bothtrue positivesandtrue negatives) out of the total number of instances. It is a good general indicator of performance, but can be misleading in imbalanced datasets. - Mathematical Formula:
- Symbol Explanation:
- : True Positives.
- : True Negatives.
- : False Positives.
- : False Negatives.
- Conceptual Definition:
-
Matthews Correlation Coefficient (MCC)
- Conceptual Definition:
MCCis a more robust metric thanaccuracy, especially forimbalanced datasets. It considers all four quadrants of theconfusion matrix(TP, TN, FP, FN) and produces a value between -1 and +1. A value of +1 indicates a perfect prediction, 0 indicates a random prediction, and -1 indicates a complete disagreement between prediction and observation. It is essentially a correlation coefficient between the observed and predicted binary classifications. - Mathematical Formula:
- Symbol Explanation:
- : True Positives.
- : True Negatives.
- : False Positives.
- : False Negatives.
- Conceptual Definition:
-
Sensitivity (Sn), also known as Recall or True Positive Rate
- Conceptual Definition:
Sensitivitymeasures the proportion of actualpositive instances(bitter peptides) that were correctly identified by the model. In other words, it quantifies the model's ability to avoidfalse negatives. A high sensitivity is crucial when the cost of missing a positive is high. - Mathematical Formula:
- Symbol Explanation:
- : True Positives.
- : False Negatives.
- Conceptual Definition:
-
Specificity (Sp), also known as True Negative Rate
- Conceptual Definition:
Specificitymeasures the proportion of actualnegative instances(non-bitter peptides) that were correctly identified by the model. It quantifies the model's ability to avoidfalse positives. A high specificity is important when the cost of incorrectly identifying a negative as positive is high. - Mathematical Formula:
- Symbol Explanation:
- : True Negatives.
- : False Positives.
- Conceptual Definition:
-
F1-score (F1)
- Conceptual Definition: The
F1-scoreis the harmonic mean ofprecisionandrecall(sensitivity). It provides a single score that balances bothprecision(the proportion of positive identifications that were actually correct) andrecall. It is particularly useful whenclass distributionisimbalanced. - Mathematical Formula:
- Symbol Explanation:
- : True Positives.
- : False Negatives.
- : False Positives.
- Conceptual Definition: The
-
Area Under the Precision-Recall Curve (auPRC)
- Conceptual Definition: The
Precision-Recall Curveplotsprecisionagainstrecallat variousthresholdsettings.auPRCcalculates the area under this curve. It is particularly informative for tasks withhighly imbalanced datasets, where the positive class is rare, as it focuses on the performance of the model on the positive class. A higherauPRCindicates better performance. - Mathematical Formula: The paper refers to
auPRCas the area enclosed by theprecision-recall curveand the x-axis, without providing a specific formula. Conceptually, it is the integral ofprecisionwith respect torecall.
- Conceptual Definition: The
-
Area Under the Receiver Operating Characteristic Curve (auROC)
- Conceptual Definition: The
Receiver Operating Characteristic (ROC)curve plotsTrue Positive Rate (Sensitivity)againstFalse Positive Rate (1 - Specificity)at variousthresholdsettings.auROCcalculates the area under this curve. It measures the ability of a classifier to distinguish between classes. A value of 0.5 indicates performance no better than random chance, while 1.0 represents a perfect classifier.auROCis useful for evaluating models across all possibleclassification thresholds. - Mathematical Formula: The paper refers to
auROCas the area under theROC curve, without providing a specific formula. Conceptually, it is the integral ofSensitivitywith respect to1-Specificity.
- Conceptual Definition: The
5.3. Baselines
The iBitter-DRLF model's performance was compared against several existing state-of-the-art methods for bitter peptide prediction:
-
iBitter-Fuse [18]: An earlier method that used
multi-view featuresandfusiontechniques for bitter peptide prediction. -
MIMML [20]: A recent method (2022) that employed
mutual information-based meta-learningto find optimal feature combinations, known for achieving high accuracy. -
iBitter-SCM [17]: A method that likely used
traditional sequence featuresandmachine learningtechniques. Thebenchmark datasetused in the current study is an updated version fromiBitter-SCM. -
BERT4Bitter [19]: A method that utilized
natural language processing (NLP)heuristic signature coding methods(BERT-based) to representpeptide sequences.These baselines represent a range of approaches, from
traditional feature-based ML(iBitter-SCM,iBitter-Fuse) to more advancedNLP-inspired (BERT4Bitter) andmeta-learning(MIMML) techniques, providing a comprehensive comparison foriBitter-DRLF.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate a clear progression in model performance, highlighting the effectiveness of deep representation learning, feature fusion, feature selection, and hyperparameter optimization.
6.1.1. Results of Preliminary Optimization
The initial step involved evaluating the performance of individual deep representation learning features (SSA, UniRep, BiLSTM) when combined with three different machine learning algorithms (SVM, LGBM, RF). The results, presented in Table 1, show the 10-fold cross-validation and independent test metrics after initial parameter optimization.
The following are the results from Table 1 of the original paper:
| Feature | Model | Dim | 10-Fold Cross-Validation | Independent Test | ||||||||||||
| ACC | MCC | Sn | Sp | F1 | auPRC | auROC | ACC | MCC | Sn | Sp | F1 | auPRC | auROC | |||
| SSA b | SVM | 0.826 | 0.652 | 0.836 | 0.816 | 0.828 | 0.890 | 0.898 | 0.883 a | 0.766 | 0.891 | 0.875 | 0.884 | 0.951 | 0.944 | |
| LGBM | 121 | 0.787 | 0.575 | 0.816 | 0.758 | 0.793 | 0.874 | 0.886 | 0.859 | 0.722 | 0.906 | 0.812 | 0.866 | 0.949 | 0.941 | |
| RFc | 0.791 | 0.584 | 0.828 | 0.754 | 0.798 | 0.848 | 0.865 | 0.820 | 0.644 | 0.875 | 0.766 | 0.830 | 0.934 | 0.922 | ||
| SVM c | 0.865 | 0.730 | 0.867 | 0.863 | 0.865 | 0.937 | 0.931 | 0.867 | 0.735 | 0.844 | 0.891 | 0.864 | 0.952 | 0.948 | ||
| UniRep b | LGBM C | 1900 | 0.840 | 0.680 | 0.828 | 0.852 | 0.838 | 0.939 | 0.930 | 0.867 | 0.735 | 0.844 | 0.891 | 0.864 | 0.953 | 0.952 |
| RFc | 0.842 | 0.684 | 0.836 | 0.848 | 0.841 | 0.927 | 0.920 | 0.844 | 0.688 | 0.828 | 0.859 | 0.841 | 0.946 | 0.943 | ||
| SVM | 0.818 | 0.637 | 0.820 | 0.816 | 0.819 | 0.910 | 0.912 | 0.883 | 0.766 | 0.906 | 0.859 | 0.885 | 0.956 | 0.951 | ||
| LGBM | 0.855 | 0.711 | 0.863 | 0.848 | 0.857 | 0.924 | 0.926 | 0.836 | 0.673 | 0.812 | 0.859 | 0.832 | 0.950 | 0.950 | ||
| BiLSTM b | RFc | 3605 | 0.818 | 0.637 | 0.828 | 0.809 | 0.820 | 0.900 | 0.908 | 0.844 | 0.688 | 0.844 | 0.844 | 0.844 | 0.954 | 0.949 |
**Key Observations:** * **UniRep/SVM (10-fold CV)**: The `UniRep` feature vector combined with `SVM` achieved the best `10-fold cross-validation` results, with , , , , , , and . This indicates that `UniRep` features, which are 1900-dimensional, are highly effective in capturing relevant information for bitter peptide identification. * **SSA/SVM (Independent Test)**: In `independent tests`, the `SSA` feature combined with `SVM` achieved the highest `ACC` of `0.883` and `MCC` of `0.766`. * **Comparison of Features**: The paper concludes that `UniRep` features were generally superior to `BiLSTM` features, showing higher `ACC` and `MCC` in `10-fold cross-validation`. However, `SSA` showed strong performance in `independent tests`. This preliminary analysis suggests that while individual deep features are effective, there might be benefits in combining them.
6.1.2. The Effects of Feature Fusion on the Automatic Identification of Bitter Peptides
To further enhance predictive performance, the study explored feature fusion by combining pairs and all three of the deep features. These fusion features were then evaluated with SVM, LGBM, and RF algorithms. Figure 2 visually summarizes the accuracy of independent tests for individual and fused features.
该图像是图表,展示了不同特征组合在机器学习模型中的表现。图中列出了多种特征组合(如SSA、UniRep、BiLSTM等),并以不同颜色区分,各模型的预测准确率在0.78到0.9之间。可以看出,部分组合(如SSA+UniRep+BiLSTM)在准确率上表现最佳,表明深度表征学习在识别苦味肽方面的有效性。
The following are the results from Table 2 of the original paper:
| Feature | Model | Dim | 10-Fold Cross-Validation | Independent Test | ||||||||||||
| ACC | MCC | Sn | Sp | F1 | auPRC | auROC | ACC | MCC | Sn | Sp | F1 | auPRC | auROC | |||
| SSA + UniRep b | SVM | 0.861 | 0.723 | 0.875 a | 0.848 | 0.863 | 0.929 | 0.927 | 0.867 | 0.734 | 0.859 | 0.875 | 0.866 | 0.954 | 0.952 | |
| LGBM | 2021 | 0.840 | 0.680 | 0.848 | 0.832 | 0.841 | 0.933 | 0.924 | 0.859 | 0.719 | 0.859 | 0.859 | 0.859 | 0.960 | 0.958 | |
| RFc | 0.838 | 0.676 | 0.840 | 0.836 | 0.838 | 0.923 | 0.917 | 0.867 | 0.735 | 0.844 | 0.891 | 0.864 | 0.955 | 0.954 | ||
| SSA + BiLSTM b | SVMc | 0.836 | 0.672 | 0.848 | 0.824 | 0.838 | 0.915 | 0.917 | 0.883 | 0.766 | 0.859 | 0.906 | 0.880 | 0.943 | 0.947 | |
| LGBM | 3726 | 0.848 | 0.696 | 0.859 | 0.836 | 0.849 | 0.927 | 0.927 | 0.875 | 0.751 | 0.906 | 0.844 | 0.879 | 0.961 | 0.957 | |
| RFc | 0.824 | 0.649 | 0.832 | 0.816 | 0.826 | 0.906 | 0.911 | 0.898 | 0.797 | 0.891 | 0.906 | 0.898 | 0.959 | 0.951 | ||
| UniRep + BiLSTM b | SVM | 0.844 | 0.688 | 0.859 | 0.828 | 0.846 | 0.921 | 0.926 | 0.891 | 0.783 | 0.922 | 0.859 | 0.894 | 0.966 | 0.962 | |
| LGBM | 5505 | 0.863 | 0.727 | 0.871 | 0.855 | 0.864 | 0.932 | 0.935 | 0.870 | 0.737 | 0.859 | 0.886 | 0.887 | 0.972 | 0.958 | |
| RFc | 0.832 | 0.664 | 0.844 | 0.820 | 0.834 | 0.932 | 0.930 | 0.875 | 0.750 | 0.859 | 0.891 | 0.873 | 0.963 | 0.960 | ||
| SSA +UniRep + BiLSTM b | SVM | 0.871 | 0.742 | 0.863 | 0.879 | 0.870 | 0.943 | 0.941 | 0.891 | 0.783 | 0.922 | 0.859 | 0.894 | 0.940 | 0.943 | |
| LGBM | 5626 | 0.855 | 0.711 | 0.844 | 0.867 | 0.854 | 0.945 | 0.942 | 0.898 | 0.797 | 0.891 | 0.906 | 0.898 | 0.971 | 0.971 | |
| RFc | 0.840 | 0.680 | 0.848 | 0.832 | 0.841 | 0.926 | 0.925 | 0.898 | 0.799 | 0.859 | 0.937 | 0.894 | 0.963 | 0.957 | ||
**Key Observations:** * **Fusion Superiority**: The optimal performance values from `fusion features` consistently `outperformed` the best values obtained from `non-combined features` (compare Table 2 to Table 1). For example, the model achieved an `ACC` of 0.898 in `independent tests`, which is a 9.51% improvement over the `SSA` feature alone (`ACC` of 0.820). * **Triple Fusion Performance**: The model achieved and in `independent tests`, showing robust performance. The model also achieved and . * **Best Fusion**: The model achieved notable and in `independent tests`. Overall, combining features generally led to better accuracy, demonstrating the benefit of `feature fusion` in capturing diverse information.
6.1.3. The Effect of Feature Selection on the Automatic Identification of Bitter Peptides
Given the improved performance of fused features, the next step was to address the potential issues of high dimensionality and redundancy. Feature selection using LGBM was applied, along with incremental feature strategy and hyperparameter mesh search. The performance metrics for individual and fused features after feature selection are summarized in Table 3. Figure 3 visually presents these results.
该图像是图表,展示了不同机器学习算法在融合特征中的性能指标,包括支持向量机(SVM)、轻梯度提升机(LGBM)和随机森林(RF)。面板(A、C、E)显示了10倍交叉验证结果,面板(B、D、F)则为独立测试结果。图中描述了不同特征组合的表现,验证了深度表示学习方法在苦味肽识别中的有效性。
The following are the results from Table 3 of the original paper:
| Feature | Model | Dim ACC | 10-Fold Cross-Validation | Independent Test | ||||||||||||
| MCC | Sn | Sp | F1 | auPRC | auROC | ACC | MCC | Sn | Sp | F1 | auPRC | auROC | ||||
| SSA b | SVM | 53 | 0.820 | 0.641 | 0.840 | 0.801 | 0.824 | 0.910 | 0.909 | 0.914 | 0.829 | 0.937 | 0.891 | 0.916 | 0.948 | 0.941 |
| LGBM | 77 | 0.816 | 0.634 | 0.848 | 0.785 | 0.822 | 0.877 | 0.892 | 0.883 | 0.768 | 0.922 | 0.844 | 0.887 | 0.947 | 0.940 | |
| RFc | 16 | 0.805 | 0.610 | 0.820 | 0.789 | 0.808 | 0.860 | 0.881 | 0.867 | 0.734 | 0.875 | 0.859 | 0.868 | 0.888 | 0.894 | |
| UniRep b | SVM | 65 | 0.875 | 0.750 | 0.875 | 0.875 | 0.875 | 0.946 | 0.943 | 0.906 | 0.813 | 0.891 | 0.922 | 0.905 | 0.952 | 0.952 |
| LGBM | 313 | 0.854 | 0.707 | 0.855 | 0.852 | 0.854 | 0.946 | 0.938 | 0.914 | 0.829 | 0.891 | 0.937 | 0.912 | 0.954 | 0.948 | |
| RFc | 329 | 0.836 | 0.672 | 0.824 | 0.848 | 0.834 | 0.918 | 0.908 | 0.891 | 0.785 | 0.844 | 0.937 | 0.885 | 0.958 | 0.957 | |
| BiLSTM b | SVM | 344 | 0.820 | 0.641 | 0.824 | 0.816 | 0.821 | 0.913 | 0.915 | 0.922 | 0.844 | 0.937 | 0.906 | 0.923 | 0.955 | 0.956 |
| LGBM | 339 | 0.871 | 0.742 | 0.883 | 0.859 | 0.873 | 0.925 | 0.929 | 0.906 | 0.813 | 0.906 | 0.906 | 0.906 | 0.969 | 0.966 | |
| RFc | 434 | 0.830 | 0.660 | 0.836 | 0.824 | 0.831 | 0.906 | 0.914 | 0.898 | 0.797 | 0.906 | 0.891 | 0.899 | 0.957 | 0.950 | |
| SSA +UniRep b | SVM | 62 | 0.865 | 0.730 | 0.863 | 0.867 | 0.865 | 0.944 | 0.942 | 0.914 | 0.828 | 0.906 | 0.922 | 0.913 | 0.958 | 0.957 |
| LGBMC | 106 | 0.881 | 0.762 | 0.887 | 0.875 | 0.882 | 0.961 | 0.957 | 0.891 | 0.783 | 0.859 | 0.922 | 0.887 | 0.952 | 0.947 | |
| RFc | 47 | 0.838 | 0.676 | 0.859 | 0.816 | 0.841 | 0.937 | 0.931 | 0.906 | 0.816 | 0.859 | 0.953 | 0.902 | 0.956 | 0.947 | |
| SSA + BiLSTM b | SVMc | 267 | 0.836 | 0.672 | 0.836 | 0.836 | 0.836 | 0.910 | 0.911 | 0.914 | 0.828 | 0.906 | 0.922 | 0.913 | 0.956 | 0.952 |
| LGBM | 317 | 0.861 | 0.723 | 0.875 | 0.848 | 0.863 | 0.924 | 0.929 | 0.906 | 0.813 | 0.906 | 0.906 | 0.906 | 0.962 | 0.958 | |
| RFc | 176 | 0.832 | 0.664 | 0.848 | 0.816 | 0.835 | 0.922 | 0.925 | 0.906 | 0.813 | 0.906 | 0.906 | 0.906 | 0.959 | 0.952 | |
| UniRep + BiLSTM bb | SVM | 186 | 0.873 | 0.746 | 0.887 | 0.859 | 0.875 | 0.932 | 0.934 | 0.914 | 0.829 | 0.937 | 0.891 | 0.916 | 0.961 | 0.965 |
| LGBM | 106 | 0.889 a | 0.777 | 0.891 | 0.887 | 0.889 | 0.947 | 0.952 | 0.944 | 0.889 | 0.922 | 0.977 | 0.952 | 0.984 | 0.977 | |
| RFc | 45 | 0.871 | 0.742 | 0.871 | 0.871 | 0.871 | 0.937 | 0.941 | 0.938 | 0.875 | 0.938 | 0.938 | 0.938 | 0.976 | 0.971 | |
| SSA +UniRep + BiLSTM b | SVMc | 336 | 0.881 | 0.762 | 0.883 | 0.879 | 0.881 | 0.940 | 0.942 | 0.922 | 0.845 | 0.953 | 0.891 | 0.924 | 0.942 | 0.946 |
| LGBM | 285 | 0.881 | 0.762 | 0.891 | 0.871 | 0.882 | 0.951 | 0.947 | 0.938 | 0.875 | 0.922 | 0.953 | 0.937 | 0.969 | 0.969 | |
| RFc | 192 | 0.863 | 0.727 | 0.859 | 0.867 | 0.863 | 0.932 | 0.932 | 0.922 | 0.844 | 0.906 | 0.937 | 0.921 | 0.970 | 0.967 | |
**Key Observations:** * **Significant Improvement with Feature Selection**: `Feature selection` dramatically improved performance for most metrics. The `dimensionality` of the feature vectors was substantially reduced (e.g., from 5505D to 106D), indicating successful removal of `redundant` or less important features. * **UniRep+BiLSTM (106D) is Optimal**: The fusion feature, after `LGBM feature selection` to 106 dimensions, consistently `outperformed` all other options. * **10-fold Cross-Validation**: Achieved , , , , , , . This shows improvements across all metrics compared to pre-selection results. * **Independent Test**: Achieved , , , , , , . These are the highest recorded values, indicating excellent generalization to unseen data. * **Reduced Dimensionality, Enhanced Performance**: This finding validates that `feature selection` is an effective strategy to resolve `information redundancy` and `overfitting` issues inherent in high-dimensional `fused features`, leading to a more robust and accurate model.
6.1.4. The Effect of Machine Learning Model Parameter Optimization on the Automated Identification of Bitter Peptides
After identifying the feature set as superior, the authors performed further hyperparameter optimization using scikit-learn GridSearchCV for the SVM, RF, and LGBM models. Figure 4 illustrates the performance metrics with default parameters versus optimized hyperparameters.

Key Observations:
- Hyperparameter Tuning Benefits: For all
machine learning models(SVM,LGBM,RF) with the feature set, usingoptimized hyperparametersconsistentlymatched or outperformedmodels usingdefault parametersacross all metrics. This confirms the importance of fine-tuning model parameters. - LGBM Superiority: The
LGBMmodel, with optimized parameters (, ), showedclearly superior performancecompared toRFandSVMin bothindependent testingand10-fold cross-validation. Although theRF modelhad a slightly betterSn(0.938vs.0.922forLGBMinindependent tests),LGBMexcelled in all other crucial metrics likeACC,MCC,Sp,F1,auPRC, andauROC. - Final Model Selection: Based on this comprehensive analysis, the feature set combined with the
LGBMmodel (optimized with and ) was selected as the finaliBitter-DRLF predictor.- Final 10-fold CV results: , , , , , , .
- Final Independent Test results: , , , , , , .
6.1.5. Comparison with Existing Methods
The ultimate validation for iBitter-DRLF involved comparing its independent test performance against existing state-of-the-art methods, as shown in Table 4.
The following are the results from Table 4 of the original paper:
| Classifier | ACC | MCC | Sn | Sp | auROC |
| iBitter-DRLF | 0.944 a | 0.889 | 0.922 | 0.977 | 0.977 |
| iBitter-Fuse | 0.930 | 0.859 | 0.938 | 0.922 | 0.933 |
| BERT4Bitter | 0.922 | 0.844 | 0.938 | 0.906 | 0.964 |
| iBitter-SCM | 0.844 | 0.688 | 0.844 | 0.844 | 0.904 |
| MIMML | 0.938 | 0.875 | 0.938 | 0.938 | 0.955 |
**Key Observations:** * **Superior Performance**: `iBitter-DRLF` consistently `outperformed` all compared methods across several critical metrics in `independent tests`. * `ACC`: 0.944 (vs. `MIMML` 0.938, `iBitter-Fuse` 0.930, `BERT4Bitter` 0.922). * `MCC`: 0.889 (vs. `MIMML` 0.875, `iBitter-Fuse` 0.859, `BERT4Bitter` 0.844). * `Sp`: 0.977 (vs. `MIMML` 0.938, `iBitter-Fuse` 0.922, `BERT4Bitter` 0.906). This indicates `iBitter-DRLF` is very good at correctly identifying non-bitter peptides, minimizing `false positives`. * `auROC`: 0.977 (vs. `BERT4Bitter` 0.964, `MIMML` 0.955, `iBitter-Fuse` 0.933). A high `auROC` indicates excellent discriminative ability across various `thresholds`. * **Reliability and Stability**: The significant improvements in `ACC`, `MCC`, `Sp`, and `auROC` demonstrate that `iBitter-DRLF` is more reliable and stable in predicting peptide bitterness compared to existing algorithms. While some methods might have slightly higher `Sn` (e.g., `iBitter-Fuse`, `BERT4Bitter`, `MIMML` all at 0.938 vs. `iBitter-DRLF`'s 0.922), `iBitter-DRLF`'s overall balance of `Sn` with exceptionally high `Sp` and `auROC` makes it a stronger predictor.
6.1.6. Feature Visualization of the Picric Peptide Automatic Recognition Effect
UMAP (Uniform Manifold Approximation and Projection) was used for dimensionality reduction and visualization of the extracted features, providing insight into how well different feature sets separate bitter from non-bitter peptides. Figure 5 shows the UMAP visualizations.

Key Observations:
- Improved Discrimination with Fusion and Selection:
- Individual Features (Figure 5A, B):
UniRepandBiLSTMfeatures individually (before fusion and selection) show some separation betweenbitter peptides(red) andnon-bitter peptides(blue), but there is still significant overlap. - Fused Features (Figure 5C): The fusion shows better clustering and separation compared to individual features, indicating that combining information improves discriminability.
- Selected Fused Features (Figure 5D): The
UMAPvisualization of the106 selected featuresfrom theBiLSTM_UniRepfusion demonstrates theclearest discrimination. Thebitterandnon-bitter peptidesform distinct, well-separated clusters with minimal overlap. This visually confirms thatfeature selectioneffectively extracts the most relevant information, leading to highly separable clusters for classification. This provides strong visual evidence for the performance improvements observed in the quantitative metrics.
- Individual Features (Figure 5A, B):
6.2. Data Presentation (Tables)
All relevant tables (Table 1, Table 2, Table 3, Table 4) have been transcribed and presented in the subsections above, following the specified formatting rules, including using HTML for tables with merged cells.
6.3. Ablation Studies / Parameter Analysis
- Ablation Study (Implicit): The progression of results from individual features (Table 1) to fused features (Table 2) and then to selected fused features (Table 3) serves as an
implicit ablation study. It effectively demonstrates the incremental benefits offeature fusionand subsequentfeature selectionin improving model performance. Each stage shows that adding or refining components (feature fusionandfeature selection) contributes positively to the model's ability to discriminate bitter peptides. - Parameter Analysis: The study explicitly details
hyperparameter optimizationusingscikit-learn GridSearchCVforSVM,RF, andLGBMmodels (as shown in Figure 4 and discussed in Section 6.1.4). It highlights thatoptimized hyperparametersconsistently led to better performance thandefault parameters. For the finaliBitter-DRLF LGBM model, specific optimized parameters and were identified. This parameter analysis is crucial for ensuring the model's robustness and optimal performance.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully developed iBitter-DRLF, a novel computational model for accurately identifying bitter peptides solely based on their sequence data. The core innovation lies in its application of deep representation learning by leveraging pre-trained neural networks to extract highly informative features. Through a comprehensive process of evaluating individual features (SSA, UniRep, BiLSTM), performing feature fusion (combining UniRep and BiLSTM), and rigorously applying feature selection (LGBM) to distill the most discriminative 106-dimensional feature set, the model achieved superior performance. Coupled with an optimized LGBM classifier, iBitter-DRLF demonstrated impressive results in both 10-fold cross-validation and independent tests, significantly outperforming existing state-of-the-art predictors across key metrics such as accuracy, MCC, specificity, and auROC. The availability of a user-friendly webserver further enhances its utility, aiming to facilitate research in improving the palatability of peptide therapeutics and dietary supplements.
7.2. Limitations & Future Work
The authors acknowledge a key limitation: "Although the exact physicochemical relevance of these features is unclear, this does not prevent the successful use of this method for computational predictions in peptide and protein sequence analysis." This highlights the black-box nature inherent in many deep learning models; while they provide highly effective features, the direct biological or chemical interpretation of these high-dimensional, learned representations is not immediately obvious.
While the paper doesn't explicitly outline a "Future Work" section, the concluding remarks implicitly suggest directions:
- Improved Palatability: The direct application of
iBitter-DRLFis to assist inidentifying bitter peptidesfor removal or modification, thereby improving thepalatabilityofnutritional supplementsandpeptide therapeutics. - Advancing Drug Development and Nutrition Research: The method can serve as a tool to accelerate
drug developmentby screening for bitter properties early and to enhancenutrition researchby guiding the design of non-bitterprotein hydrolysates.
7.3. Personal Insights & Critique
- Innovation: The paper's strength lies in its systematic and rigorous approach to integrating
deep representation learningintobitter peptide prediction. Moving beyond handcrafted features to learned embeddings frompre-trained models(UniRep, BiLSTM) is a significant step forward. The methodical evaluation of individual embeddings, their fusions, and subsequent feature selection demonstrates thoroughness in model development. The use ofUMAPfor visualization effectively communicates the discriminative power gained through feature optimization. - Applicability: The development of a webserver is highly valuable, making the research accessible to a broader scientific community for practical application in food science, pharmaceutical development, and nutrition. The problem it addresses is very practical: enhancing the acceptance of beneficial but bitter peptide products.
- Potential Issues/Critique:
- Interpretability of Deep Features: As noted by the authors, the lack of clear
physicochemical relevancefor thedeep featuresis a limitation. While effective, it provides limited mechanistic insight into why certain peptides are bitter. Future work could involve integratingexplainable AI (XAI)techniques to interpret the learned features and connect them back to knownphysicochemical propertiesofamino acidsandpeptides. This could potentially lead to the design of novel, non-bitter peptides. - Dataset Size for Deep Learning: The
BTP640dataset, though a benchmark, contains only 640 peptides. While the use ofpre-trained models(UniRep,BiLSTM) mitigates the need for a very large dataset for training the embeddings, the finalLGBMclassifier training still relies on this relatively small set. For deep learning models that are fine-tuned or trained from scratch, larger datasets are typically preferred to fully exploit their capacity and preventoverfitting. Future research could explore collecting more extensive and diversebitter peptide datasets. - Generalizability to Novel Peptides: While the independent test set provides confidence, the true test of
generalizabilitylies in predicting bitterness for peptides with sequences significantly different from those in the training data. The robustness ofdeep representation learningshould help here, but this is always a challenge inbioinformatics.
- Interpretability of Deep Features: As noted by the authors, the lack of clear
- Transferability: The methodology, particularly the strategy of using diverse
pre-trained deep sequence embeddingsfollowed byfusionandfeature selectionwithgradient boosting models, is highly transferable. This framework could be applied to predict otherpeptide properties(e.g.,antimicrobial activity,bioavailability,allergenicity) or evenprotein functionsfrom sequence data. Thedeep representation learningmodels essentially act as powerfulfeature extractors, which can then be combined with variousdownstream machine learning tasks.
Similar papers
Recommended via semantic vector search.