A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features
TL;DR Summary
iUmami-DRLF employs multiplicative LSTM-based deep features with logistic regression to accurately identify umami peptides, offering a robust, efficient alternative to traditional costly testing and advancing umami flavor research.
Abstract
Citation: Jiang, J.; Li, J.; Li, J.; Pei, H.; Li, M.; Zou, Q.; Lv, Z. A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features. Foods 2023 , 12 , 1498. https://doi.org/10.3390/ foods12071498 Academic Editor: Christophe Flahaut Received: 26 February 2023 Revised: 24 March 2023 Accepted: 30 March 2023 Published: 2 April 2023 Copyright: © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). foods Article A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features Jici Jiang 1 , Jiayu Li 2 , Junxian Li 1 , Hongdi Pei 1,3 , Mingxin Li 1 , Quan Zou 4,5, * and Zhibin Lv 1, * 1 College of Biomedical Engineering, Sichuan University, Chengdu 610065, China 2 College of Life Science, Sichuan University, Chengdu 610065, China 3 Wu Yuzhang Honors College, Sichuan University, Chengdu 610065, China 4 Institute of Fundamental and Frontier Sciences, University o
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features
1.2. Authors
Jici Jiang, Jiayu Li, Junxian Li, Hongdi Pei, Mingxin Li, Quan Zou, and Zhibin Lv. Affiliations:
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China
- College of Life Science, Sichuan University, Chengdu 610065, China
- Wu Yuzhang Honors College, Sichuan University, Chengdu 610065, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
1.3. Journal/Conference
Published in Foods. Foods is a peer-reviewed open-access journal that covers a wide range of topics related to food science, technology, and nutrition. It is a reputable journal in the field, indicating that the research has undergone a review process by experts.
1.4. Publication Year
2023
1.5. Abstract
The paper introduces iUmami-DRLF, a machine learning method for identifying umami peptide sequences. This method exclusively uses logistic regression (LR) based on features extracted by a deep learning pre-trained neural network called unified representation (UniRep), which is built on multiplicative LSTM. The research demonstrates that this deep learning representation learning significantly enhances the model's ability to identify umami peptides and improves predictive precision using only peptide sequence information. iUmami-DRLF was tested against newly validated taste sequences and other predictors, showing superior robustness and accuracy, maintaining validity even at high probability thresholds. The proposed method aims to facilitate further studies on enhancing food's umami flavor for dietary needs.
1.6. Original Source Link
Official PDF link: /files/papers/6908b45ae81fdddf1c48bfa8/paper.pdf
This is the officially published version of the paper.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the time-consuming and expensive nature of wet testing (experimental laboratory procedures) for identifying umami peptides. Umami, recognized as the fifth basic taste, is crucial for enhancing food flavor, promoting healthy eating, and has numerous potential applications due to the distinctive flavor and properties of umami peptides (peptides that contribute to the umami taste).
In the postgenomic era, there has been a proliferation of peptide sequence databases, which has opened opportunities for automated mathematical methods to discover novel umami peptides. Prior research has developed machine learning (ML) models like Umami-SCM, UMPred-FRL, and iUP-BERT for umami peptide prediction. However, despite these advancements, existing ML-based algorithms relying solely on sequence data still lack sufficient accuracy and robustness, particularly in independent testing. The authors specifically note that iUP-BERT, while an improvement, was "not as robust as expected." This indicates a gap in the robustness and generalization performance of current models, which the paper seeks to address.
The paper's entry point and innovative idea revolve around leveraging deep representation learning to automatically extract highly informative features from raw peptide sequence data, eliminating the need for manual feature engineering. Specifically, they use a multiplicative LSTM-based unified representation (UniRep) model for feature extraction, combined with logistic regression for classification, aiming to achieve superior performance and robustness.
2.2. Main Contributions / Findings
The primary contributions and key findings of this paper are:
-
Novel Model
iUmami-DRLF: The development ofiUmami-DRLF, an advanced machine learning model for identifying umami peptide sequences. This model uniquely relies solely ondeep representation learningfeatures extracted by a pre-trainedUniRepneural network (specifically,multiplicative LSTMembeddings) and employslogistic regressionfor classification. -
Enhanced Predictive Performance: The study demonstrates that
deep learning representation learningsignificantly boosts the capability and predictive precision of models in identifying umami peptides.iUmami-DRLFachieved superior results in both 10-fold cross-validation and independent tests compared to existing state-of-the-art methods. For instance, its independent test accuracy (ACC) was improved by 2.45% overiUP-BERT. -
Superior Robustness and Accuracy: Through rigorous validation using a dataset of 91
wet-experiment verifiedumami peptide sequences (UMP-VERIFIED),iUmami-DRLFproved to be more robust and accurate thanUMPred-FRLandiUP-BERT. Critically,iUmami-DRLFmaintained significant prediction accuracy even at very high probability thresholds (e.g., 40.7% at 99% threshold), where other methods completely failed (0% accuracy). This robustness is attributed to its optimizedcross-entropy loss. -
Feature Optimization: The research highlights the effectiveness of
SMOTEfor balancing imbalanced datasets and the crucial role of feature selection methods (particularlyLGBM) in optimizing high-dimensional feature vectors derived from deep learning embeddings. -
User-Friendly Web Server: A publicly accessible web server for
iUmami-DRLFwas developed, providing a practical tool for researchers to predict umami peptides.These findings solve the problem of insufficient accuracy and robustness in existing ML-based umami peptide prediction tools, offering a more reliable and efficient alternative to costly and time-consuming wet laboratory experiments.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice reader should be familiar with several fundamental concepts in biology, machine learning, and deep learning:
- Umami Taste and Peptides:
- Umami: Recognized as the fifth basic taste (alongside sweet, sour, salty, and bitter), often described as savory or meaty. It's perceived through specific taste receptors.
- Peptides: Short chains of amino acids linked by peptide bonds. They are smaller than proteins. Umami peptides are specific peptides that elicit or enhance the umami taste. The paper notes they often contain
aspartic acid,glutamic acid,asparagine, orglutamineresidues.
- Machine Learning (ML):
- Supervised Learning: A type of machine learning where an algorithm learns from labeled data (i.e., data points where the correct output is already known). The goal is to learn a mapping from input features to output labels. In this paper, the model learns to classify peptides as "umami" or "non-umami" from sequences with known labels.
- Classification: A supervised learning task where the model predicts a categorical label (e.g., "umami" or "non-umami") for a given input.
- Deep Learning (DL):
- Neural Networks: Computational models inspired by the structure and function of biological neural networks. They consist of layers of interconnected "neurons" that process data.
- Representation Learning: A set of machine learning techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. Instead of hand-crafting features, the model learns optimal feature representations. This is a core concept in the paper.
- Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM):
- RNNs: Neural networks designed to process sequential data (like peptide sequences). They have internal memory that allows them to use information from previous steps in the sequence.
- LSTM: A special type of RNN capable of learning long-term dependencies. It overcomes the vanishing gradient problem common in vanilla RNNs through
gates(input, forget, output gates) that regulate the flow of information. - Multiplicative LSTM (mLSTM): A variant of LSTM where the recurrent connections are modified with multiplicative interactions, potentially enhancing its ability to capture complex dependencies and improve representation learning.
UniRepis based on this.
- Logistic Regression (LR):
- A statistical model used for binary classification. Despite its name, it's a classification algorithm that models the probability of a binary outcome. It uses a
sigmoid functionto map predictions to probabilities between 0 and 1.
- A statistical model used for binary classification. Despite its name, it's a classification algorithm that models the probability of a binary outcome. It uses a
- Data Balancing Strategy:
- Synthetic Minority Over-sampling Technique (SMOTE): An oversampling technique used to address imbalanced datasets (where one class has significantly fewer samples than others). SMOTE creates synthetic samples for the minority class by interpolating between existing minority samples and their neighbors, thus increasing the number of minority class samples and balancing the dataset. This helps prevent classifiers from being biased towards the majority class.
- Feature Selection:
- Feature Selection: The process of selecting a subset of relevant features for use in model construction. This reduces dimensionality, removes noisy data, and can improve model performance and interpretability.
- Analysis of Variance (ANOVA): A statistical test used to analyze differences among group means. In feature selection, it can identify features where the means of different classes are significantly different, suggesting their importance for classification.
- Light Gradient Boosting Machine (LGBM) for Feature Importance:
LGBMis an ensemble learning method based on decision trees. It can provide a score for each feature indicating how useful or important it was in the construction of the boosted decision trees. Features with higher scores are considered more important. - Mutual Information (MI): A measure from information theory that quantifies the amount of information obtained about one random variable by observing another random variable. In feature selection, it measures the dependency between a feature and the target variable; higher MI indicates a stronger relationship.
- Evaluation Metrics: Explained in detail in Section 5.2.
3.2. Previous Works
The paper contextualizes its contribution by referencing several previous machine learning models designed for umami peptide prediction:
Umami-SCM(2020) [9]: This model combines theScoring Card Method (SCM)withpropensity scoresof amino acids and dipeptides to identify umami peptides. The SCM assigns scores based on the frequency or likelihood of amino acids/dipeptides appearing in umami peptides. It reported an independent test accuracy of 0.865.UMPred-FRL(2021) [11]: Developed by Charoenkwan et al., this method integrates seven different traditional feature encodings (e.g., amino acid composition, dipeptide composition, physicochemical properties) to construct its umami peptide classifier.iUP-BERT(2022) [12]: Proposed by Jiang et al., this model was a significant advancement, utilizing a singledeep representational learningfeature encoding method based onBERT(Bidirectional Encoder Representations from Transformers). BERT is a powerful pre-trained language model that can learn contextual representations of sequences.iUP-BERTshowed superior performance compared toUmami-SCMandUMPred-FRLin both independent testing and cross-validation.
3.3. Technological Evolution
The field of umami peptide prediction has evolved from methods relying on traditional feature engineering to those leveraging deep representation learning.
-
Early/Traditional Methods: Initially, researchers would manually design or select
features(numerical descriptors) from peptide sequences. These features could includeamino acid composition(percentage of each amino acid),dipeptide composition(percentage of each two-amino acid combination),physicochemical properties(e.g., hydrophobicity, charge), orpropensity scores. Models likeUmami-SCMandUMPred-FRLrepresent this phase, where expert knowledge was critical for feature selection. While effective, these methods might struggle with capturing complex, non-linear patterns and can be limited by the completeness of human-designed features. -
Deep Learning for Feature Extraction: The advent of deep learning brought about
representation learning, where neural networks automatically learn abstract and high-level feature representations directly from raw data (e.g., peptide sequences). This eliminates the laborious process of manual feature engineering.iUP-BERTmarked a shift in this direction by using theBERTmodel, a transformer-based architecture known for its ability to learn rich contextual embeddings from sequences. -
Specialized Deep Learning Architectures (e.g., Multiplicative LSTM): The current paper builds upon this by employing
UniRep, which is based onmultiplicative LSTM.LSTMsare particularly well-suited for sequential data like peptides, and the "multiplicative" aspect enhances their capacity to model complex relationships.UniRepis explicitly designed to learn a "unified representation" for proteins/peptides, making it a strong candidate for general sequence embedding.This paper's work fits within the technological timeline as a further refinement and exploration of deep representation learning, specifically using
mLSTM-basedUniRepfeatures, aiming for even greater robustness and accuracy than previousBERT-based or traditional feature-based approaches.
3.4. Differentiation Analysis
Compared to the main methods in related work, iUmami-DRLF presents several core differences and innovations:
-
Sole Reliance on
UniRep(mLSTM) Features: UnlikeUMPred-FRLwhich integrates various traditional feature codes, oriUP-BERTwhich usesBERTfeatures,iUmami-DRLFsolely employs features extracted from theUniRepmodel, which is built uponmultiplicative LSTM. This emphasizes the power and effectiveness of this specific deep learning architecture for peptide representation. The paper argues thatUniRepprovides a "unified representation" that is highly informative. -
Focus on Robustness at High Probability Thresholds: A key differentiator is
iUmami-DRLF's demonstrated superior robustness and accuracy, particularly at high prediction probability thresholds (e.g., 95% or 99%). The paper explicitly shows that whileiUP-BERTandUMPred-FRLfail (0% accuracy) at these high thresholds,iUmami-DRLFmaintains significant predictive power. This is crucial for practical applications where high confidence in predictions is required. -
Optimized
Logistic RegressionClassifier: While deep learning is used for feature extraction, the final classification is performed byLogistic Regression(after feature selection and data balancing). This contrasts with models that might use more complex classifiers or end-to-end deep learning models. The choice ofLRsuggests a focus on interpretability and potentially faster inference times once features are extracted. -
Comprehensive Feature Optimization Pipeline: The methodology includes a robust pipeline involving
SMOTEfor data balancing and multiplefeature selectiontechniques (ANOVA, LGBM, MI), which are systematically evaluated to refine the feature set. This rigorous approach to feature engineering (even post-deep learning embedding) contributes to the model's performance. -
Improved Independent Test Performance: The paper meticulously reports and compares independent test results, showing
iUmami-DRLFconsistently outperforms previous state-of-the-art models across various metrics, indicating better generalization to unseen data.In essence,
iUmami-DRLFdifferentiates itself by a highly optimized deep representation learning approach specifically tailored withUniRep/mLSTMembeddings, combined with careful data balancing and feature selection, leading to a significantly more robust and accurate predictor, especially under stringent confidence requirements.
4. Methodology
4.1. Principles
The core principle behind iUmami-DRLF is to leverage the power of deep representation learning to transform raw peptide sequences into meaningful, fixed-length numerical feature vectors. These vectors, known as embeddings, are designed to capture the underlying biochemical and structural properties relevant to a peptide's umami taste. By learning these representations automatically, the model avoids the limitations of manual feature engineering. Subsequently, a standard machine learning classifier (Logistic Regression) is trained on these high-quality features to predict whether a peptide is umami or not. The theoretical basis is that deep learning models, particularly recurrent architectures like LSTM, excel at processing sequential data and learning hierarchical features, making them suitable for peptide sequences. The multiplicative interactions in mLSTM further enhance this capability. The overall intuition is that if a machine can learn a robust numerical "fingerprint" for each peptide that accurately reflects its umami potential, then a simpler classifier can effectively distinguish between umami and non-umami peptides.
4.2. Core Methodology In-depth (Layer by Layer)
The development of iUmami-DRLF follows a structured pipeline as depicted in Figure 1, encompassing dataset preparation, feature extraction, data balancing, feature selection, model training, and evaluation.
4.2.1. Benchmark Dataset
The model was developed using the UMP442 benchmark dataset, which was an updated version from iUmami-SCM.
- Positive dataset: Comprises
umami peptidesfrom theBIOPEP-UWMdatabase and other experimentally verified umami peptides. - Negative dataset: Comprises
bitter non-umami peptides. - Total samples: 444 peptides (304 non-umami, 140 umami) after data cleaning.
- Dataset split: To prevent
overfitting, the dataset was arbitrarily split into:- Training subset (
UMP-TR): 112 umami peptides and 241 non-umami peptides. - Independent test subset (
UMP-IND): 28 umami peptides and 61 non-umami peptides.
- Training subset (
- External Validation Dataset (
UMP-VERIFIED): An additional 91wet-experiment verifiedumami peptide sequences were collected from the latest research to rigorously validate the accuracy and robustness of the model against state-of-the-art methods.
4.2.2. Feature Extraction using UniRep (Multiplicative LSTM)
This is the cornerstone of the iUmami-DRLF method. The paper leverages UniRep (Unified Representation), a deep learning model pre-trained on a vast corpus of amino acid sequences, to convert peptide sequences into fixed-length numerical vectors.
-
UniRep Model Training: The
UniRepmodel was pre-trained using 24 million core amino acid sequences fromUniRef50. Its training objective was to identify the subsequent amino acid by minimizingcross-entropy losses. This objective allows the model to learn meaningful representations of protein/peptide sequences. -
Embedding Process:
- Input Sequence Encoding: A peptide sequence (S amino acid residues) is initially represented as a matrix using
single thermal code(also known as one-hot encoding). If there are 20 standard amino acids, each residue is represented as a 20-dimensional vector with a '1' at the position corresponding to the amino acid and '0's elsewhere. The paper describes this as , which might be a typo and should be or where is the alphabet size. - mLSTM Encoder: This encoded matrix is then fed into the
multiplicative Long Short-Term Memory (mLSTM)encoder. The mLSTM processes the sequence step-by-step, generating hidden states. - 1900-D Feature Vector: The output from the mLSTM encoder is an embedding matrix . To obtain a single fixed-length vector representing the entire peptide, an
average pooling operationis applied across the sequence length (S). This results in a 1900-dimensional (D)UniRep feature vectorfor each peptide.
- Input Sequence Encoding: A peptide sequence (S amino acid residues) is initially represented as a matrix using
-
mLSTM Encoder Equations: The
mLSTMencoder performs calculations using a set of equations to update its internal states based on the current input () and previous hidden/cell states (, ). Equation (1): Where:-
: The current intermediate multiplication state.
-
: The current input at time step .
-
: Weight matrix for the input in the multiplicative state calculation.
-
: Weight matrix for the previous hidden state in the multiplicative state calculation.
-
: The hidden state from the previous time step (
t-1). -
: Denotes element-by-element multiplication (Hadamard product).
Equation (2): Where:
-
: The input before the hidden state update, representing the candidate hidden state.
-
: Weight matrix for the intermediate multiplication state .
-
: Weight matrix for the current input .
-
: The hyperbolic tangent activation function, which squashes values between -1 and 1.
Equation (3): Where:
-
: The
forget gateoutput at time step . It determines what information from the previous cell state should be discarded. -
: The sigmoid activation function, which squashes values between 0 and 1.
-
: Weight matrix for the input in the forget gate.
-
: Weight matrix for the intermediate multiplication state in the forget gate.
Equation (4): Where:
-
: The
input gateoutput at time step . It determines which new information from the candidate hidden state should be stored in the cell state. -
: Weight matrix for the input in the input gate.
-
: Weight matrix for the intermediate multiplication state in the input gate.
Equation (5): Where:
-
: The
output gateoutput at time step . It determines which parts of the cell state should be exposed as the hidden state . -
: Weight matrix for the input in the output gate.
-
: Weight matrix for the intermediate multiplication state in the output gate.
Equation (6): Where:
-
: The current
cell stateat time step . This is the "memory" of the mLSTM. -
: The cell state from the previous time step (
t-1). -
The first term represents what information from the previous cell state is retained (forgotten).
-
The second term represents what new information from the candidate hidden state is added to the cell state.
Equation (7): Where:
-
: The current
hidden stateat time step . This is the output of the mLSTM cell for the current time step and is passed to the next step. -
The
tanhfunction is applied to the cell state, and its output is modulated by the output gate .These equations collectively describe how the
mLSTMencoder processes sequential input data, maintains a memory state, and generates a hidden representation that captures complex dependencies within the peptide sequence.
-
4.2.3. Balancing Strategy (SMOTE)
Since the initial UMP-TR dataset was imbalanced (112 umami vs. 241 non-umami peptides), the Synthetic Minority Over-sampling Technique (SMOTE) was applied to balance the classes.
- Mechanism:
- For each sample in the
minority class(umami peptides),k-nearest neighbors (KNN)are identified. Synthetic samplesare then created by linearly interpolating between a minority class sample and one of its randomly chosen neighbors. This means taking a sample, picking a neighbor, and creating a new sample along the line segment connecting them.
- For each sample in the
- Purpose:
SMOTEnot only increases the sample size of the minority class but also improvessample qualityby creating diverse, yet realistic, synthetic samples. This helps classifiers learn more distinct features and avoids bias towards the majority class, thereby improving overall performance.
4.2.4. Feature Selection Strategy
After UniRep feature extraction, each peptide was represented by a 1900-dimensional vector. To address potential issues of over-fitting and feature redundancy associated with high-dimensional data, three feature selection techniques were employed: Analysis of Variance (ANOVA), Light Gradient Boosting Machine (LGBM), and Mutual Information (MI).
- Process: Features were ranked based on their
importance values(calculated by each method). Only features with importance values greater than a crucial threshold (typically the average feature importance value) were selected. Anincremental feature strategyandhyperparameter grid search(usingGridSearchCVinscikit-learn) were used for optimization.
4.2.4.1. Analysis of Variance (ANOVA)
ANOVA is used here to score features based on their ability to differentiate between classes.
Equation (8):
Where:
-
S(t): The ANOVA score for feature . A higher score indicates greater importance. -
: The variance between groups for feature . This measures how much the means of different classes vary for that feature.
-
: The variance within groups for feature . This measures the variability of values for that feature within each class.
Equation (9) for calculating variance between groups: Where:
-
: The number of groups (classes, e.g., umami and non-umami).
-
: The number of samples in group .
-
: The value of feature for the -th sample in the -th group.
-
The first fraction inside the parenthesis is the mean of feature for group .
-
The second fraction is the overall mean of feature across all samples and groups.
Equation (10) for calculating variance within groups: Where:
-
: The total number of instances (samples).
-
The term inside the parenthesis is the difference between a sample's feature value and the mean of that feature within its group.
4.2.4.2. Lighting Gradient Boosting Machine (LGBM)
LGBM is a decision tree-based gradient boosting framework. For feature selection, it inherently provides feature importances.
Equation (11):
Where:
-
: The new base learner (decision tree) being sought at iteration .
-
: The space of possible base learners.
-
: The sum of the
loss functionover all training samples, where is the true label, is the model's prediction from the previousc-1iterations, andh(x)is the current base learner's prediction. The goal is to find that minimizes this loss.Equation (12): Where:
-
: The
pseudo-residual(or negative gradient) for sample at iteration . This represents the error that the current base learner is trying to predict and correct. -
: The partial derivative of the loss function with respect to the model's prediction from the previous iteration.
Equation (13):
F _ { c + n } ( x ) = h _ { 2 n } ( x ) + F _ { c - n } ( x )
This equation describes the iterative update of the overall model. It states that the model's prediction at iteration is the sum of a new learner and the model's prediction from a previous state . This formula seems a bit unusual as typically the new learner is added to the current model . However, as presented, it indicates how the model incrementally improves by adding new weak learners (decision trees). The importance of each feature in LGBM is derived from how often and how effectively it is used to split nodes in these decision trees across all iterations.
4.2.4.3. Mutual Information (MI)
Mutual Information quantifies the dependency between two variables, or how much knowing one variable reduces uncertainty about the other.
Equation (14) for entropy:
Where:
-
H(S): The entropy of the peptide sequence . Entropy measures the average amount of information or uncertainty in a random variable. -
: The alphabet of amino acid residues (e.g., 20 standard amino acids).
-
: The marginal probability of a specific amino acid residue occurring in the sequence.
Equation (15) for
mutual information (MI): Where: -
MI: The mutual information between two variables (e.g., a feature and the class label). The paper's formula as written seems to represent the MI between two amino acid residues and within a sequence, or potentially between a feature's value and the class label, if is interpreted as a class. Generally, for feature selection, MI is calculated between each feature and the target variable (umami/non-umami). -
: The joint probability of observing residue and residue (or a feature value and a class label).
-
and : The marginal probabilities of observing residue and residue , respectively.
-
A higher
MIvalue indicates a stronger statistical dependency between the feature and the class label, making the feature more valuable for classification.
4.2.5. Machine Learning Methods
Five common and high-performance ML methods were chosen to evaluate the UniRep features and identify the optimal classifier:
- K-Nearest Neighbors (KNN): A non-parametric, instance-based learning algorithm. It classifies a new data point by finding the closest data points in the training set and assigning the class that is most common among them. Its simplicity makes it a good baseline.
- Logistic Regression (LR): A linear model for binary classification. It estimates the probability of an instance belonging to a particular class using a sigmoid function. It's known for its simplicity, interpretability, and parallelizability.
- Support Vector Machine (SVM): A powerful discriminative classifier that finds an optimal hyperplane to separate data points into different classes, maximizing the margin between the classes. It's often used for binary classification in bioinformatics.
- Random Forest (RF): An ensemble learning method that builds multiple decision trees during training and outputs the mode of the classes (for classification) or mean prediction (for regression) of the individual trees. It uses
bagging(bootstrap aggregating) and random feature selection for improved robustness and reduced overfitting. - Light Gradient Boosting Machine (LGBM): An efficient, distributed gradient boosting framework based on decision trees. It builds trees sequentially, with each new tree correcting the errors of the previous ones. It is known for its speed and high performance.
4.2.6. Evaluation Metrics and Methods
The models were evaluated using K-fold cross-validation and independent testing.
-
K-fold Cross-Validation: The training data (
UMP-TR) is divided into equal subsets (folds). The model is trained times; in each iteration, one fold is used as the validation set, and the remainingK-1folds are used for training. The results from all iterations are averaged to provide a more robust estimate of model performance. The paper used 10-fold cross-validation (). -
Independent Testing: After the model is trained and optimized using the
UMP-TRdataset, its final performance is evaluated on a completely unseenindependent test set(UMP-IND) and theUMP-VERIFIEDdataset. This provides an unbiased measure of the model's generalization ability. -
Evaluation Metrics: The following widely used metrics were calculated:
-
True Positives (TP): The number of umami peptides correctly identified as umami.
-
True Negatives (TN): The number of non-umami peptides correctly identified as non-umami.
-
False Positives (FP): The number of non-umami peptides incorrectly identified as umami.
-
False Negatives (FN): The number of umami peptides incorrectly identified as non-umami.
Equation (16) for
Accuracy (ACC): Where: -
ACC: The proportion of total predictions that were correct.
Equation (17) for
Matthews Correlation Coefficient (MCC): Where: -
MCC: A correlation coefficient between the observed and predicted binary classifications. It is considered a more reliable statistical measure for imbalanced datasets than accuracy, as it takes into account all four confusion matrix values (TP, TN, FP, FN). Its value ranges from -1 (perfect disagreement) to +1 (perfect agreement), with 0 indicating random prediction.
Equation (18) for
Sensitivity (Sn)orRecall: Where: -
Sn: The proportion of actual positive cases (umami peptides) that were correctly identified.
Equation (19) for
Specificity (Sp): Where: -
Sp: The proportion of actual negative cases (non-umami peptides) that were correctly identified.
Equation (20) for
Balanced Accuracy (BACC): Where: -
BACC: The average of sensitivity and specificity. It is particularly useful for imbalanced datasets because it gives equal weight to both classes, preventing a high accuracy score due to correctly classifying only the majority class. If the dataset is perfectly balanced, ACC and BACC will be equal.
-
Area Under the Receiver Operating Characteristic Curve (auROC): The
ROC curveplots theTrue Positive Rate (TPR)(Sensitivity) against theFalse Positive Rate (FPR)(1 - Specificity) at various classification thresholds. TheauROCquantifies the overall performance of a binary classifier, representing its ability to distinguish between classes across all possible thresholds. A value of 0.5 indicates random prediction, while 1.0 indicates a perfect classifier. -
Cross-Entropy Loss: For binary classification, the
cross-entropy loss(also known asbinary cross-entropy lossorlog loss) measures the performance of a classification model whose output is a probability value between 0 and 1. Equation (21) forLoss: Where:Loss: The cross-entropy loss.- : The true label of the sample. for a positive case (umami peptide) and for a negative case (non-umami peptide).
- : The predicted probability that the sample is a positive case (umami peptide), ranging from 0 to 1.
- The goal of training is to minimize this loss function. A lower cross-entropy loss indicates a more accurate classification effect.
-
4.2.7. Web Server Development
A user-friendly web server was developed and made freely accessible at https://www.aibiochem.net/servers/iUmami-DRLF/. Users can input peptide sequences and receive predictions (whether it's an umami peptide and its confidence level).
The following figure (Figure 1 from the original paper) provides an overview of the model development:
该图像是论文中描述iUmami-DRLF模型开发流程的示意图,展示了数据集收集、特征提取、数据平衡、特征选择、模型训练以及性能评估和网页服务器的整体流程。
5. Experimental Setup
5.1. Datasets
The experiments utilized several datasets derived from existing resources and newly collected data:
-
UMP442Benchmark Dataset: This is the primary dataset used for model training and initial testing.- Source: Updated from
iUmami-SCM[9], originally comprising peptides from theBIOPEP-UWMdatabase [4] and other experimentally verified umami peptides. - Scale: Contains a total of 444 peptide sequences after data cleaning.
- Characteristics:
- Positive samples: 140 umami peptides.
- Negative samples: 304 non-umami (bitter) peptides. This indicates an imbalanced dataset, which was addressed using
SMOTE.
- Split: Arbitrarily divided into:
UMP-TR(Training set): 112 umami peptides and 241 non-umami peptides.UMP-IND(Independent Test set): 28 umami peptides and 61 non-umami peptides.
- Domain: Peptide sequences, with classification based on their umami taste property.
- Data Sample Example: The paper does not provide a concrete example of a peptide sequence from the dataset; however, typical peptide sequences are strings of amino acid abbreviations (e.g., "Gly-Pro-Leu" or "GPL").
- Source: Updated from
-
UMP-VERIFIEDDataset: This dataset was specifically collected for external validation to test model robustness and compare against state-of-the-art methods.-
Source: 91
wet-experiment verifiedumami peptide sequences reported in the latest literature [56-70]. -
Scale: 91 umami peptide sequences.
-
Characteristics: All samples in this dataset are positive (umami) peptides, validated through laboratory experiments.
-
Why chosen: This dataset provides an unbiased, real-world validation of the models' ability to identify true umami peptides, especially for newly discovered ones.
These datasets were chosen to ensure a robust training process, fair independent evaluation, and a rigorous comparison with existing methods using recently validated experimental data. The use of an imbalanced initial dataset and its subsequent balancing with
SMOTEis a critical aspect of ensuring the model's effectiveness across both classes.
-
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
-
True Positives (TP):
- Conceptual Definition: Instances that are actually positive (e.g., umami peptides) and were correctly predicted as positive by the model.
- Mathematical Formula: N/A (fundamental count)
- Symbol Explanation:
TPdenotes the count of umami peptides successfully identified as umami.
-
True Negatives (TN):
- Conceptual Definition: Instances that are actually negative (e.g., non-umami peptides) and were correctly predicted as negative by the model.
- Mathematical Formula: N/A (fundamental count)
- Symbol Explanation:
TNdenotes the count of non-umami peptides successfully identified as non-umami.
-
False Positives (FP):
- Conceptual Definition: Instances that are actually negative (e.g., non-umami peptides) but were incorrectly predicted as positive by the model. This is also known as a Type I error.
- Mathematical Formula: N/A (fundamental count)
- Symbol Explanation:
FPdenotes the count of non-umami peptides falsely identified as umami.
-
False Negatives (FN):
- Conceptual Definition: Instances that are actually positive (e.g., umami peptides) but were incorrectly predicted as negative by the model. This is also known as a Type II error.
- Mathematical Formula: N/A (fundamental count)
- Symbol Explanation:
FNdenotes the count of umami peptides incorrectly identified as non-umami.
-
Accuracy (ACC):
- Conceptual Definition: The proportion of total predictions that the model made correctly. It measures the overall correctness of the model.
- Mathematical Formula:
- Symbol Explanation:
TP: True Positives.TN: True Negatives.FP: False Positives.FN: False Negatives.
-
Matthews Correlation Coefficient (MCC):
- Conceptual Definition: A robust measure of the quality of binary classifications, especially useful for imbalanced datasets. It considers all four values of the confusion matrix and produces a high score only if the model performs well on all four aspects (TP, TN, FP, FN). It ranges from -1 (perfect inverse correlation) to +1 (perfect prediction), with 0 indicating a random prediction.
- Mathematical Formula:
- Symbol Explanation:
TP: True Positives.TN: True Negatives.FP: False Positives.FN: False Negatives.
-
Sensitivity (Sn) / Recall:
- Conceptual Definition: The proportion of actual positive cases that were correctly identified by the model. It measures the model's ability to find all the positive samples.
- Mathematical Formula:
- Symbol Explanation:
TP: True Positives.FN: False Negatives.
-
Specificity (Sp):
- Conceptual Definition: The proportion of actual negative cases that were correctly identified by the model. It measures the model's ability to correctly identify non-positive samples.
- Mathematical Formula:
- Symbol Explanation:
TN: True Negatives.FP: False Positives.
-
Balanced Accuracy (BACC):
- Conceptual Definition: The average of
Sensitivity(True Positive Rate) andSpecificity(True Negative Rate). This metric is particularly useful for imbalanced datasets because it gives equal weight to the performance on both positive and negative classes, preventing a misleadingly high accuracy score driven by the majority class. If the dataset is perfectly balanced,ACCandBACCvalues will be equal. - Mathematical Formula:
- Symbol Explanation:
Sn: Sensitivity.Sp: Specificity.
- Conceptual Definition: The average of
-
Area Under the Receiver Operating Characteristic Curve (auROC):
- Conceptual Definition: The
ROC curveplots theTrue Positive Rate (TPR)(Sensitivity) against theFalse Positive Rate (FPR)(1 - Specificity) at various classification thresholds. TheauROCvalue represents the overall ability of a classifier to distinguish between positive and negative classes. A higherauROCindicates a better model. Values range from 0.5 (random classifier) to 1.0 (perfect classifier). - Mathematical Formula: No single formula as it's the area under a curve generated by varying a threshold.
- Symbol Explanation: N/A (derived from TPR and FPR).
- Conceptual Definition: The
-
Cross-Entropy Loss:
- Conceptual Definition: For binary classification problems where the output is a probability between 0 and 1,
cross-entropy lossquantifies the difference between the predicted probability distribution and the true distribution. It penalizes predictions that are confident and wrong more heavily. A lower loss value indicates a better fit between the model's predictions and the true labels. - Mathematical Formula:
- Symbol Explanation:
Loss: The cross-entropy loss value.- : The true binary label of the sample (1 for positive, 0 for negative).
- : The model's predicted probability that the sample is positive.
- Conceptual Definition: For binary classification problems where the output is a probability between 0 and 1,
5.3. Baselines
The iUmami-DRLF method was compared against the following existing state-of-the-art models for umami peptide prediction:
-
iUmami-SCM[9]: This model combines the scoring card method (SCM) with propensity scores of amino acids and dipeptides. It represents a traditional feature engineering approach. -
UMPred-FRL[11]: This model integrates seven different traditional feature codes for constructing the umami peptide classifier. It also primarily relies on traditional feature engineering but with a more comprehensive set. -
iUP-BERT[12]: This model is based ondeep representational learning, usingBERT(Bidirectional Encoder Representations from Transformers) for feature encoding. It is a more recent deep learning-based approach and represents the direct predecessor in terms of advanced feature extraction.These baselines are representative because they cover the evolution of umami peptide prediction, from traditional statistical methods (
Umami-SCM,UMPred-FRL) to more recent deep learning approaches (iUP-BERT). ComparingiUmami-DRLFagainst these diverse baselines allows for a thorough assessment of its advancements in both methodology and performance.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate the effectiveness of iUmami-DRLF, especially its robustness and accuracy, by systematically evaluating the impact of data balancing, different ML models, feature selection techniques, and direct comparison with existing methods.
6.1.1. Effect of SMOTE
The initial analysis focused on the impact of SMOTE (Synthetic Minority Over-sampling Technique) on model performance. The UniRep feature vectors (1900-dimensional) were extracted, and five different ML models (KNN, LR, SVM, LGBM, RF) were trained both with and without SMOTE balancing.
The following figure (Figure 2 from the original paper) illustrates the results of 10-fold cross-validation and independent testing for models with and without SMOTE:
该图像是图表,展示了五种机器学习模型经过SMOTE平衡处理与未处理后,在10折交叉验证(A)和独立测试(B)中的多项性能指标对比结果,表明SMOTE优化显著提升了模型表现。
As shown in Figure 2 and detailed in Supplementary Table S1 (not provided in the main text but referenced), SMOTE significantly improved the performance of the models. For example, the LR-SMOTE model either outperformed or equaled the LR model without SMOTE in 66.7% of the metrics in both cross-validation and independent tests. Similarly, SVM-SMOTE outperformed its non-SMOTE counterpart in 83.3% of indicators. The paper highlights that Sp (Specificity) values were often high without SMOTE, but Sn (Sensitivity) and other indicators were poor, indicating a bias towards the negative class due to the imbalanced dataset. This underscores the necessity of SMOTE to enable the models to effectively recognize positive cases (umami peptides).
Visual analyses using UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction also supported SMOTE's benefit. Figure 3A (UniRep features without SMOTE) showed less distinct clustering between umami and non-umami peptides compared to Figure 3B (UniRep features with SMOTE), where the clusters were more separable, indicating improved data representation for classification.
6.1.2. Effects of Different ML Models
After confirming the benefits of SMOTE, the study proceeded to compare the performance of the five ML algorithms using the SMOTE-balanced UniRep features.
The following are the results from Table 1 of the original paper:
| Model | 10-Fold Cross-Validation | Independent Test | ||||||||||
| ACC | MCC | Sn | Sp | auROC | BACC | ACC | MCC | Sn | Sp | auROC | BACC | |
| LRc | 0.921 a | 0.847 | 0.954 | 0.888 | 0.956 | 0.921 | 0.853 | 0.653 | 0.721 | 0.913 | 0.928 | 0.817 |
| KNN | 0.861 b | 0.727 | 0.917 | 0.805 | 0.924 | 0.861 | 0.807 | 0.589 | 0.818 | 0.802 | 0.875 | 0.810 |
| SVMC | 0.865 | 0.756 | 0.738 | 0.992 | 0.981 | 0.865 | 0.716 | 0.258 | 0.100 | 0.998 | 0.789 | 0.549 |
| RFC | 00.917 | 0.837 | 0.942 | 0..892 | 0.967 | 00.917 | 0.836 | 0.617 | 0.725 | 0.887 | 0.893 | 0.806 |
| LGBM | 0.919 | 0.841 | 0.946 | 0.892 | 0.972 | 0.919 | 0.845 | 0.636 | 0.729 | 0.898 | 0.907 | 0.813 |
Table 1 indicates that the Logistic Regression (LR) model generally outperformed other ML models. In 10-fold cross-validation, LR (iUmami-DRLF) exceeded other models in four metrics (ACC, BACC, MCC, Sn). For instance, its ACC and BACC were 0.22-6.97% higher, and MCC and Sn increased by 0.71-16.51% and 0.85-29.27% respectively. In independent tests, LR also outscored others in ACC, MCC, auROC, and BACC (e.g., ACC 0.95-19.13% higher, MCC 2.67-153.10% higher). Although SVM showed high Sp in independent tests (0.998), its extremely low Sn (0.100) and MCC (0.258) indicated a severe imbalance in its predictive capability, likely favoring the majority class, despite SMOTE. This led to the selection of LR as the base classifier for the final iUmami-DRLF model. The equality of ACC and BACC values in 10-fold cross-validation further confirmed the effectiveness of SMOTE in balancing the dataset.
6.1.3. Effects of Different Feature Selection Methods
With SMOTE applied and LR identified as the superior classifier, the next step involved optimizing the high-dimensional 1900D UniRep feature vector using feature selection. Three methods were compared: ANOVA, LGBM, and MI.
The following are the results from Table 2 of the original paper:
| Model | Feature D Selection Method | Dim | 10-Fold Cross-Validation | Independent Test | |||||||||||
| ACC | MCC | Sn | Sp | auROC | BACC | ACC | MCC | Sn | Sp | auROC | BACC | ||||
| LRc | LGBM d | 177 | 0.925 b | 0.853 | 0.959 | 0.892 | 0.957 0.938 | 0.925 | 0.921 a | 0.815 | 0.821 | 0.967 | 0.956 | 0.894 | |
| ANOVAd | 102 | 0.882 | 0.764 | 0.896 | 0.867 0.863 | 0.882 | 0.899 | 0.768 | 0.857 | 00.918 | 0..930 | 0.888 | |||
| MI d | 136 | 0.888 | 0.777 | 0.913 | 0.942 | 0.888 | 0.888 | 0.733 | 0.750 | 0.951 | 0.864 | 0.850 | |||
| LGBM d | 33 | 0.892 | 0.788 | 0.938 | 0.846 | 0.955 | 0.892 | 0.899 | 0.782 | 0.929 | 0.885 | 0.911 | 0.907 | ||
| KNN | ANOVA d | 15 | 0.873 | 0.748 | 0.896 | 0.851 | 0.934 | 0.873 | 0.865 | 0.703 | 0.857 | 0.869 | 0.907 | 0.863 | |
| M a | 58 | 0.888 | 0.783 | 0.954 | 0.822 | 0.927 | 0.888 | 0.888 | 0.773 | 0.964 | 0.852 | 0.931 | 0.908 | ||
| LGBM d | 121 | 0.944 | 0.889 | 0.971 | 0.917 | 0.980 | 0.944 | 0.888 | 0.739 | 0.821 | 0.918 | 0.913 | 0.870 | ||
| ANNOVA d | 48 | 0.925 | 0.854 | 0.967 | 0.884 | 0.977 | 0.925 | 0.865 | 0.678 | 0.679 | 0.951 | 0.906 | 0.815 | ||
| SVM | MId | 16 | 0.919 | 0.841 | 0.959 | 0.80 | 0..968 | 0.919 | 0.88 | 0..735 | 00.786 | 0..934 | 0.921 | 0..860 | |
| 0.934 | 0.896 | 0.876 | 0.716 | 0.821 | 0.902 | 0.920 | 0.862 | ||||||||
| LGBM d ANNOVAd | 88 118 | 0.915 0.898 | 0.830 00.797 | 0.913 | 0.884 | 0.975 00.961 | 0.915 0.898 | 0.865 | 0.694 | 0.821 | 0.885 | 0.911 | 0.853 | ||
| RFc | MId | 8 | 0.902 | 0.806 | 0.921 | 0.884 | 0.952 | 0.902 | 0.888 | 0.753 | 0.893 | 0.885 | 0.923 | 0.889 | |
| 0.971 | 0.888 | 0.739 | 0.821 | 0.918 | 0.912 | ||||||||||
| LGBM d | 35 19 | 0.938 0.902 | 0.877 0.807 | 0.942 | 0.905 0.863 | 0.988 0.945 | 0.938 0.902 | 0.876 | 0.706 | 0.714 | 0.951 | 0.929 | 0.870 0.833 | ||
| LGBMC | ANNOVA d MId | 18 | 0.888 | 0.777 | 0.917 | 0.859 | 0.953 | 0.888 | 0.865 | 0.682 | 0.750 | 0.918 | 0.916 | 0.834 | |
The following figure (Figure 4 from the original paper) provides a comparison of the results of independent testing of the models with selected features and the models without selected features:
该图像是图表,展示了不同机器学习模型(KNN、LR、SVM、RF、LGBM)基于不同特征集在独立测试集上的多种性能指标(ACC、MCC、Sn、Sp、auROC、BACC)对比结果。
Figure 4 and Table 2 clearly demonstrate that feature selection significantly improved model performance. The Sp of the 1900D models without feature selection was lower than most models with feature selection, indicating that feature selection helps resolve information redundancy and optimize predictive performance. Among the three methods, LGBM feature selection yielded the best overall performance, particularly for the LR model. For the LR model, LGBM feature selection led to a 4.17-4.88% improvement in ACC and BACC in 10-fold cross-validation, and 2.45-3.72% improvement in ACC in independent tests, compared to ANOVA and MI. The specific configuration of LR with the top 177D features selected by LGBM was chosen as the optimal iUmami-DRLF predictor. This was further supported by UMAP visualization (Figure 3C), which showed better separation of clusters with the 177D features compared to the full 1900D SMOTE-optimized features (Figure 3B).
6.1.4. Comparison with Existing Methods
The final iUmami-DRLF model (LR classifier with 177D LGBM-selected UniRep features) was rigorously compared against iUmami-SCM, UMPred-FRL, and iUP-BERT.
The following are the results from Table 3 of the original paper:
| Classifier | 10-Fold Cross-Validation | Independent Test | ||||||||||
| ACC | MCC | Sn | Sp | auROC | BACC | ACC | MCC | Sn | Sp | auROC | BACC | |
| iUmami-DRLF(LR) | 0.925 b | 0.853 | 0.959 | 0.892 | 0.957 | 0.925 | 0.921 a | 0.815 | 0.821 | 0.967 | 0.956 | 0.894 |
| iUmami-DRLF(SVM) | 0.944 | 0.889 | 00.971 | 0.917 | 0.980 | 0.9444 | 0.8888 | 0.739 | 0..821 | 0.918 | 00.913 | 0..870 |
| UP-BER | 0.940 | 0.881 | 0.963 | 0.917 | .971 | 0.4 | 0.899 | 0.7 | 0 893 | 0.902 | 0.933 | 0.897 |
| UMPred-FRL | 0.921 | 0.81 | 00.847 | 0.955 | 0.93 | 0.901 | 0.888 | 0.735 | 086 | 0.934 | 0.919 | 0.860 |
| Umi-SCM | 0.935 | 0.864 | 0.947 | 0.930 | 00.945 | 0.939 | 0.865 | 0.679 | 00.714 | 00.934 | 0.898 | 0.824 |
Table 3 highlights iUmami-DRLF(LR)'s superior performance in independent testing. While iUmami-DRLF(SVM) showed slightly better results in 10-fold cross-validation across some metrics, iUmami-DRLF(LR) demonstrated stronger generalization ability in independent tests.
- In independent tests,
iUmami-DRLF(LR)achieved an ACC of 0.921, MCC of 0.815, Sn of 0.821, Sp of 0.967, auROC of 0.956, and BACC of 0.894. These values are notably higher than all other compared methods. For example, its ACC was 3.76-6.51% higher, MCC 10.86-20.00% higher, and auROC 4.04-6.47% higher than other predictors. - The comparison between
iUmami-DRLF(LR)andiUmami-DRLF(SVM)specifically pointed out thatLRhad better generalization (higher independent test scores) despite slightly lower cross-validation scores, solidifying its choice for the final predictor.
6.1.5. Methods' Robustness
The robustness of iUmami-DRLF was further validated using the UMP-VERIFIED dataset of 91 wet-experiment verified umami peptides, focusing on performance at different prediction probability thresholds.
The following figure (Figure 5 from the original paper) shows the prediction results under varying probability thresholds:
该图像是图表,展示了iUmami-DRLF、UMPRED-FRL和iUP-BERT三种模型在UMP-VERIFIED数据集上不同概率阈值下的预测性能。(A)显示了预测准确率与概率阈值的关系。(B)展示了交叉熵损失与概率阈值的关系,交叉熵损失越小表示模型的鲁棒性和准确性越好。
Figure 5A shows the relationship between prediction accuracy and probability threshold. iUmami-DRLF consistently showed the best accuracy across all probability thresholds. Crucially, at a 95% threshold, iUP-BERT's accuracy dropped to 0%, indicating failure, while UMPred-FRL was 8.8%, and iUmami-DRLF maintained 52.7%. At a stringent 99% threshold, both iUP-BERT and UMPred-FRL yielded 0% accuracy, whereas iUmami-DRLF still achieved 40.7% accuracy. This vividly demonstrates iUmami-DRLF's superior robustness and generalization, especially when high confidence predictions are required.
Figure 5B, showing cross-entropy loss against probability thresholds, further supports this. iUmami-DRLF exhibited the minimum cross-entropy loss at 50%, 70%, and 85% thresholds. At 95%, its loss was significantly smaller than UMPred-FRL. For 95% and 99% thresholds, the cross-entropy losses for UMPred-FRL and iUP-BERT became meaningless as their accuracy dropped to zero. This confirms that iUmami-DRLF is an optimized model with minimum cross-entropy loss, leading to better accuracy and reliability.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Model | 10-Fold Cross-Validation | Independent Test | ||||||||||
| ACC | MCC | Sn | Sp | auROC | BACC | ACC | MCC | Sn | Sp | auROC | BACC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LRc | 0.921 a | 0.847 | 0.954 | 0.888 | 0.956 | 0.921 | 0.853 | 0.653 | 0.721 | 0.913 | 0.928 | 0.817 |
| KNN | 0.861 b | 0.727 | 0.917 | 0.805 | 0.924 | 0.861 | 0.807 | 0.589 | 0.818 | 0.802 | 0.875 | 0.810 |
| SVMc | 0.865 | 0.756 | 0.738 | 0.992 | 0.981 | 0.865 | 0.716 | 0.258 | 0.100 | 0.998 | 0.789 | 0.549 |
| RFc | 00.917 | 0.837 | 0.942 | 0..892 | 0.967 | 00.917 | 0.836 | 0.617 | 0.725 | 0.887 | 0.893 | 0.806 |
| LGBM | 0.919 | 0.841 | 0.946 | 0.892 | 0.972 | 0.919 | 0.845 | 0.636 | 0.729 | 0.898 | 0.907 | 0.813 |
a The best performance values are indicated in bold and underlined. Blue indicates equal values of ACC and BACC. c LR: logistic regression; KNN: k-nearest neighbors; SVM: support vector machine; LGBM: light gradient boosting machine; RF: random forest.
The following are the results from Table 2 of the original paper:
| Model | Feature Selection Method | Dim | 10-Fold Cross-Validation | Independent Test | |||||||||||
| ACC | MCC | Sn | Sp | auROC | BACC | ACC | MCC | Sn | Sp | auROC | BACC | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LRc | LGBM d | 177 | 0.925 b | 0.853 | 0.959 | 0.892 | 0.957 | 0.925 | 0.921 a | 0.815 | 0.821 | 0.967 | 0.956 | 0.894 | |
| ANOVAd | 102 | 0.882 | 0.764 | 0.896 | 0.867 | 0.863 | 0.882 | 0.899 | 0.768 | 0.857 | 0.918 | 0.930 | 0.888 | ||
| MI d | 136 | 0.888 | 0.777 | 0.913 | 0.942 | 0.888 | 0.888 | 0.733 | 0.750 | 0.951 | 0.864 | 0.850 | |||
| LGBM d | 33 | 0.892 | 0.788 | 0.938 | 0.846 | 0.955 | 0.892 | 0.899 | 0.782 | 0.929 | 0.885 | 0.911 | 0.907 | ||
| KNN | ANOVA d | 15 | 0.873 | 0.748 | 0.896 | 0.851 | 0.934 | 0.873 | 0.865 | 0.703 | 0.857 | 0.869 | 0.907 | 0.863 | |
| MI d | 58 | 0.888 | 0.783 | 0.954 | 0.822 | 0.927 | 0.888 | 0.888 | 0.773 | 0.964 | 0.852 | 0.931 | 0.908 | ||
| LGBM d | 121 | 0.944 | 0.889 | 0.971 | 0.917 | 0.980 | 0.944 | 0.888 | 0.739 | 0.821 | 0.918 | 0.913 | 0.870 | ||
| ANOVA d | 48 | 0.925 | 0.854 | 0.967 | 0.884 | 0.977 | 0.925 | 0.865 | 0.678 | 0.679 | 0.951 | 0.906 | 0.815 | ||
| SVM | MI d | 16 | 0.919 | 0.841 | 0.959 | 0.80 | 0.968 | 0.919 | 0.88 | 0.735 | 0.786 | 0.934 | 0.921 | 0.860 | |
| LGBM d | 88 | 0.934 | 0.896 | 0.971 | 0.884 | 0.975 | 0.915 | 0.876 | 0.716 | 0.821 | 0.902 | 0.920 | 0.862 | ||
| ANOVA d | 118 | 0.898 | 0.797 | 0.913 | 0.884 | 0.961 | 0.898 | 0.865 | 0.694 | 0.821 | 0.885 | 0.911 | 0.853 | ||
| RFc | MI d | 8 | 0.902 | 0.806 | 0.921 | 0.884 | 0.952 | 0.902 | 0.888 | 0.753 | 0.893 | 0.885 | 0.923 | 0.889 | |
| LGBM d | 35 | 0.938 | 0.877 | 0.971 | 0.905 | 0.988 | 0.938 | 0.876 | 0.706 | 0.714 | 0.951 | 0.929 | 0.870 | ||
| ANOVA d | 19 | 0.902 | 0.807 | 0.942 | 0.863 | 0.945 | 0.902 | 0.888 | 0.739 | 0.821 | 0.918 | 0.912 | 0.833 | ||
| LGBMc | ANOVA d | 18 | 0.888 | 0.777 | 0.917 | 0.859 | 0.953 | 0.888 | 0.865 | 0.682 | 0.750 | 0.918 | 0.916 | 0.834 | |
| MI d | |||||||||||||||
a The best performance values are indicated in bold and underlined. Blue indicates equal values of ACC and BACC. c LR: logistic regression; KNN: k-nearest neighbors; SVM: support vector machine; LGBM: light gradient boosting machine; RF: random forest. d LGBM: light gradient boosting machine; ANOVA: analysis of variance; MI: mutual information.
The following are the results from Table 3 of the original paper:
| Classifier | 10-Fold Cross-Validation | Independent Test | ||||||||||
| ACC | MCC | Sn | Sp | auROC | BACC | ACC | MCC | Sn | Sp | auROC | BACC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| iUmami-DRLF(LR) | 0.925 b | 0.853 | 0.959 | 0.892 | 0.957 | 0.925 | 0.921 a | 0.815 | 0.821 | 0.967 | 0.956 | 0.894 |
| iUmami-DRLF(SVM) | 0.944 | 0.889 | 0.971 | 0.917 | 0.980 | 0.944 | 0.888 | 0.739 | 0.821 | 0.918 | 0.913 | 0.870 |
| UP-BER | 0.940 | 0.881 | 0.963 | 0.917 | 0.971 | 0.940 | 0.899 | 0.770 | 0.893 | 0.902 | 0.933 | 0.897 |
| UMPred-FRL | 0.921 | 0.81 | 0.847 | 0.955 | 0.930 | 0.901 | 0.888 | 0.735 | 0.860 | 0.934 | 0.919 | 0.860 |
| Umi-SCM | 0.935 | 0.864 | 0.947 | 0.930 | 0.945 | 0.939 | 0.865 | 0.679 | 0.714 | 0.934 | 0.898 | 0.824 |
a The best performance values are indicated in bold and underlined. Blue indicates equal values of ACC and BACC.
6.3. Ablation Studies / Parameter Analysis
While the paper does not explicitly label sections as "ablation studies," the structured comparison of different components functions as such:
-
SMOTE vs. No SMOTE: This comparison (Figure 2) effectively acts as an ablation study for the data balancing strategy. It clearly shows that including
SMOTEis critical for improvingSnand overall balanced performance, validating its inclusion in the methodology. -
Different ML Models: Testing five different classifiers (KNN, LR, SVM, RF, LGBM) on the same
SMOTE-balancedUniRepfeatures (Table 1) allows for selection of the optimal base classifier. This demonstrates thatLRis the most suitable choice for generalization performance despite other models sometimes performing better in specific metrics or cross-validation. -
Feature Selection Methods (ANOVA, LGBM, MI) and Dimensionality: This part of the experiment (Table 2, Figure 4) is a comprehensive analysis of the feature space. By comparing models trained with features selected by different methods and exploring different dimensionalities (e.g., 177D, 102D, 136D), the authors determine that
LGBMfeature selection yielding 177 dimensions results in the best performingLRmodel. This validates the importance of dimensionality reduction and the choice ofLGBMfor feature selection. -
Impact of
UniRepFeatures: The entire premise ofiUmami-DRLFis built onUniRepfeatures. The comparison withiUmami-SCMandUMPred-FRL(which use traditional features) andiUP-BERT(which usesBERTfeatures) implicitly validates the superior representational power ofUniRep/mLSTMembeddings for this task.The
GridSearchCVmodule was used for searchinghyperparametersfor each model during the optimization process, which is a standard practice for parameter analysis to find the best model configuration.
7. Conclusion & Reflections
7.1. Conclusion Summary
This research successfully introduced iUmami-DRLF, a novel predictor designed for the accurate identification of umami peptides solely based on their sequence information. The model's strength lies in its innovative integration of a multiplicative LSTM-based UniRep deep representation learning for feature extraction, coupled with SMOTE for handling imbalanced datasets and LGBM for optimal feature selection. The final iUmami-DRLF model, employing Logistic Regression with the top 177 UniRep features, demonstrated superior performance. It markedly outperformed existing state-of-the-art methods in independent testing across key metrics (ACC=0.921, MCC=0.815, Sn=0.821, Sp=0.967, auROC=0.956, BACC=0.894). Furthermore, iUmami-DRLF exhibited exceptional robustness and accuracy when validated against newly discovered umami peptide sequences, maintaining significant predictive power even at high probability thresholds where other predictors failed. A user-friendly web server was also developed, making this robust tool accessible to the research community.
7.2. Limitations & Future Work
The authors acknowledged several limitations and suggested directions for future work:
- Computational Cost of Feature Extraction: The current
UniRepfeature extraction model requires significant computation, especially without aGPUconfiguration, leading to long processing times for large numbers of sequences. This suggests a practical bottleneck for high-throughput applications. - Data Updates: The model's performance could potentially be further improved by incorporating the most recent empirical (wet-lab) data during training. As more umami peptides are discovered, continuously updating the training dataset can enhance the model's learning.
- Model Simplification via Distillation: A promising future direction is to use
model distillation. This technique involves training a smaller, simpler "student" model to replicate the behavior of the larger, more complexUniRep"teacher" model. This could significantly reduce the computational complexity of the feature extraction step without a substantial loss in performance, addressing the first limitation.
7.3. Personal Insights & Critique
This paper presents a rigorous and well-executed study that makes a significant contribution to the field of bioinformatics and food science.
- Innovation: The core innovation lies in the exclusive and highly optimized use of
multiplicative LSTM-basedUniRepfeatures. WhileBERThas gained popularity, demonstrating the superior performance and robustness ofmLSTMembeddings (especially for peptide sequences, which are shorter and have a simpler 'alphabet' than natural language) is valuable. The systematic approach to data balancing (SMOTE) and feature selection (LGBM) further refines these high-quality embeddings, leading to tangible performance gains. - Robustness Evaluation: The evaluation of model performance across varying
probability thresholdsusing theUMP-VERIFIEDdataset is particularly insightful and highly practical. In real-world applications, a predictor that maintains reliability even at high confidence levels is invaluable. The stark contrast betweeniUmami-DRLFand other models at 95% and 99% thresholds is a powerful testament to its practical utility. This type of evaluation often gets overlooked but is critical for deployment. - Generalizability: The strong performance in
independent testsand against awet-experiment verifieddataset suggests good generalizability, which is a common challenge for machine learning models in biological domains. - Applicability and Impact: The development of a user-friendly
web serversignificantly enhances the paper's impact by making the tool readily available to other researchers and potentially to the food industry. This directly addresses the initial motivation of providing an efficient alternative to costly wet testing. The research can directly aid in designing healthier, more flavorful food products.
Potential Issues/Areas for Improvement:
-
Reliance on Pre-trained UniRep: The
iUmami-DRLFmodel heavily relies on the pre-trainedUniRepmodel. The quality and potential biases of theUniRef50dataset used to trainUniRepwould inherently affectiUmami-DRLF's performance. While robust, its ultimate performance ceiling might be limited by the foundationalUniRepmodel's understanding of peptide space. -
"Small" Verified Dataset: While the
UMP-VERIFIEDdataset is crucial for robustness testing, its size (91 peptides) is relatively small. Expanding this dataset with more diverse, newly validated umami peptides could further strengthen future validation efforts and potentially allow for even more fine-tuned model training. -
Black-box Nature of Deep Features: While
UniRepfeatures are powerful, they areblack-boxrepresentations. Understanding why specific 177 dimensions are important or what biological properties they encode could provide deeper insights into umami peptide characteristics, moving beyond pure prediction to scientific discovery. The feature selection step helps, but the underlying interpretation of theUniRepdimensions remains complex.Overall, this paper provides a robust and practically valuable machine learning solution for identifying umami peptides, pushing the boundaries of deep representation learning in this specific domain. Its emphasis on robustness and generalization is a critical aspect that makes it stand out.
Similar papers
Recommended via semantic vector search.