Paper status: completed

A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features

Published:04/02/2023
Original Link
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

iUmami-DRLF employs multiplicative LSTM-based deep features with logistic regression to accurately identify umami peptides, offering a robust, efficient alternative to traditional costly testing and advancing umami flavor research.

Abstract

Citation: Jiang, J.; Li, J.; Li, J.; Pei, H.; Li, M.; Zou, Q.; Lv, Z. A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features. Foods 2023 , 12 , 1498. https://doi.org/10.3390/ foods12071498 Academic Editor: Christophe Flahaut Received: 26 February 2023 Revised: 24 March 2023 Accepted: 30 March 2023 Published: 2 April 2023 Copyright: © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). foods Article A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features Jici Jiang 1 , Jiayu Li 2 , Junxian Li 1 , Hongdi Pei 1,3 , Mingxin Li 1 , Quan Zou 4,5, * and Zhibin Lv 1, * 1 College of Biomedical Engineering, Sichuan University, Chengdu 610065, China 2 College of Life Science, Sichuan University, Chengdu 610065, China 3 Wu Yuzhang Honors College, Sichuan University, Chengdu 610065, China 4 Institute of Fundamental and Frontier Sciences, University o

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features

1.2. Authors

Jici Jiang, Jiayu Li, Junxian Li, Hongdi Pei, Mingxin Li, Quan Zou, and Zhibin Lv. Affiliations:

  • College of Biomedical Engineering, Sichuan University, Chengdu 610065, China
  • College of Life Science, Sichuan University, Chengdu 610065, China
  • Wu Yuzhang Honors College, Sichuan University, Chengdu 610065, China
  • Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
  • Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China

1.3. Journal/Conference

Published in Foods. Foods is a peer-reviewed open-access journal that covers a wide range of topics related to food science, technology, and nutrition. It is a reputable journal in the field, indicating that the research has undergone a review process by experts.

1.4. Publication Year

2023

1.5. Abstract

The paper introduces iUmami-DRLF, a machine learning method for identifying umami peptide sequences. This method exclusively uses logistic regression (LR) based on features extracted by a deep learning pre-trained neural network called unified representation (UniRep), which is built on multiplicative LSTM. The research demonstrates that this deep learning representation learning significantly enhances the model's ability to identify umami peptides and improves predictive precision using only peptide sequence information. iUmami-DRLF was tested against newly validated taste sequences and other predictors, showing superior robustness and accuracy, maintaining validity even at high probability thresholds. The proposed method aims to facilitate further studies on enhancing food's umami flavor for dietary needs.

Official PDF link: /files/papers/6908b45ae81fdddf1c48bfa8/paper.pdf This is the officially published version of the paper.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the time-consuming and expensive nature of wet testing (experimental laboratory procedures) for identifying umami peptides. Umami, recognized as the fifth basic taste, is crucial for enhancing food flavor, promoting healthy eating, and has numerous potential applications due to the distinctive flavor and properties of umami peptides (peptides that contribute to the umami taste).

In the postgenomic era, there has been a proliferation of peptide sequence databases, which has opened opportunities for automated mathematical methods to discover novel umami peptides. Prior research has developed machine learning (ML) models like Umami-SCM, UMPred-FRL, and iUP-BERT for umami peptide prediction. However, despite these advancements, existing ML-based algorithms relying solely on sequence data still lack sufficient accuracy and robustness, particularly in independent testing. The authors specifically note that iUP-BERT, while an improvement, was "not as robust as expected." This indicates a gap in the robustness and generalization performance of current models, which the paper seeks to address.

The paper's entry point and innovative idea revolve around leveraging deep representation learning to automatically extract highly informative features from raw peptide sequence data, eliminating the need for manual feature engineering. Specifically, they use a multiplicative LSTM-based unified representation (UniRep) model for feature extraction, combined with logistic regression for classification, aiming to achieve superior performance and robustness.

2.2. Main Contributions / Findings

The primary contributions and key findings of this paper are:

  • Novel Model iUmami-DRLF: The development of iUmami-DRLF, an advanced machine learning model for identifying umami peptide sequences. This model uniquely relies solely on deep representation learning features extracted by a pre-trained UniRep neural network (specifically, multiplicative LSTM embeddings) and employs logistic regression for classification.

  • Enhanced Predictive Performance: The study demonstrates that deep learning representation learning significantly boosts the capability and predictive precision of models in identifying umami peptides. iUmami-DRLF achieved superior results in both 10-fold cross-validation and independent tests compared to existing state-of-the-art methods. For instance, its independent test accuracy (ACC) was improved by 2.45% over iUP-BERT.

  • Superior Robustness and Accuracy: Through rigorous validation using a dataset of 91 wet-experiment verified umami peptide sequences (UMP-VERIFIED), iUmami-DRLF proved to be more robust and accurate than UMPred-FRL and iUP-BERT. Critically, iUmami-DRLF maintained significant prediction accuracy even at very high probability thresholds (e.g., 40.7% at 99% threshold), where other methods completely failed (0% accuracy). This robustness is attributed to its optimized cross-entropy loss.

  • Feature Optimization: The research highlights the effectiveness of SMOTE for balancing imbalanced datasets and the crucial role of feature selection methods (particularly LGBM) in optimizing high-dimensional feature vectors derived from deep learning embeddings.

  • User-Friendly Web Server: A publicly accessible web server for iUmami-DRLF was developed, providing a practical tool for researchers to predict umami peptides.

    These findings solve the problem of insufficient accuracy and robustness in existing ML-based umami peptide prediction tools, offering a more reliable and efficient alternative to costly and time-consuming wet laboratory experiments.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with several fundamental concepts in biology, machine learning, and deep learning:

  • Umami Taste and Peptides:
    • Umami: Recognized as the fifth basic taste (alongside sweet, sour, salty, and bitter), often described as savory or meaty. It's perceived through specific taste receptors.
    • Peptides: Short chains of amino acids linked by peptide bonds. They are smaller than proteins. Umami peptides are specific peptides that elicit or enhance the umami taste. The paper notes they often contain aspartic acid, glutamic acid, asparagine, or glutamine residues.
  • Machine Learning (ML):
    • Supervised Learning: A type of machine learning where an algorithm learns from labeled data (i.e., data points where the correct output is already known). The goal is to learn a mapping from input features to output labels. In this paper, the model learns to classify peptides as "umami" or "non-umami" from sequences with known labels.
    • Classification: A supervised learning task where the model predicts a categorical label (e.g., "umami" or "non-umami") for a given input.
  • Deep Learning (DL):
    • Neural Networks: Computational models inspired by the structure and function of biological neural networks. They consist of layers of interconnected "neurons" that process data.
    • Representation Learning: A set of machine learning techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. Instead of hand-crafting features, the model learns optimal feature representations. This is a core concept in the paper.
  • Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM):
    • RNNs: Neural networks designed to process sequential data (like peptide sequences). They have internal memory that allows them to use information from previous steps in the sequence.
    • LSTM: A special type of RNN capable of learning long-term dependencies. It overcomes the vanishing gradient problem common in vanilla RNNs through gates (input, forget, output gates) that regulate the flow of information.
    • Multiplicative LSTM (mLSTM): A variant of LSTM where the recurrent connections are modified with multiplicative interactions, potentially enhancing its ability to capture complex dependencies and improve representation learning. UniRep is based on this.
  • Logistic Regression (LR):
    • A statistical model used for binary classification. Despite its name, it's a classification algorithm that models the probability of a binary outcome. It uses a sigmoid function to map predictions to probabilities between 0 and 1.
  • Data Balancing Strategy:
    • Synthetic Minority Over-sampling Technique (SMOTE): An oversampling technique used to address imbalanced datasets (where one class has significantly fewer samples than others). SMOTE creates synthetic samples for the minority class by interpolating between existing minority samples and their neighbors, thus increasing the number of minority class samples and balancing the dataset. This helps prevent classifiers from being biased towards the majority class.
  • Feature Selection:
    • Feature Selection: The process of selecting a subset of relevant features for use in model construction. This reduces dimensionality, removes noisy data, and can improve model performance and interpretability.
    • Analysis of Variance (ANOVA): A statistical test used to analyze differences among group means. In feature selection, it can identify features where the means of different classes are significantly different, suggesting their importance for classification.
    • Light Gradient Boosting Machine (LGBM) for Feature Importance: LGBM is an ensemble learning method based on decision trees. It can provide a score for each feature indicating how useful or important it was in the construction of the boosted decision trees. Features with higher scores are considered more important.
    • Mutual Information (MI): A measure from information theory that quantifies the amount of information obtained about one random variable by observing another random variable. In feature selection, it measures the dependency between a feature and the target variable; higher MI indicates a stronger relationship.
  • Evaluation Metrics: Explained in detail in Section 5.2.

3.2. Previous Works

The paper contextualizes its contribution by referencing several previous machine learning models designed for umami peptide prediction:

  • Umami-SCM (2020) [9]: This model combines the Scoring Card Method (SCM) with propensity scores of amino acids and dipeptides to identify umami peptides. The SCM assigns scores based on the frequency or likelihood of amino acids/dipeptides appearing in umami peptides. It reported an independent test accuracy of 0.865.
  • UMPred-FRL (2021) [11]: Developed by Charoenkwan et al., this method integrates seven different traditional feature encodings (e.g., amino acid composition, dipeptide composition, physicochemical properties) to construct its umami peptide classifier.
  • iUP-BERT (2022) [12]: Proposed by Jiang et al., this model was a significant advancement, utilizing a single deep representational learning feature encoding method based on BERT (Bidirectional Encoder Representations from Transformers). BERT is a powerful pre-trained language model that can learn contextual representations of sequences. iUP-BERT showed superior performance compared to Umami-SCM and UMPred-FRL in both independent testing and cross-validation.

3.3. Technological Evolution

The field of umami peptide prediction has evolved from methods relying on traditional feature engineering to those leveraging deep representation learning.

  • Early/Traditional Methods: Initially, researchers would manually design or select features (numerical descriptors) from peptide sequences. These features could include amino acid composition (percentage of each amino acid), dipeptide composition (percentage of each two-amino acid combination), physicochemical properties (e.g., hydrophobicity, charge), or propensity scores. Models like Umami-SCM and UMPred-FRL represent this phase, where expert knowledge was critical for feature selection. While effective, these methods might struggle with capturing complex, non-linear patterns and can be limited by the completeness of human-designed features.

  • Deep Learning for Feature Extraction: The advent of deep learning brought about representation learning, where neural networks automatically learn abstract and high-level feature representations directly from raw data (e.g., peptide sequences). This eliminates the laborious process of manual feature engineering. iUP-BERT marked a shift in this direction by using the BERT model, a transformer-based architecture known for its ability to learn rich contextual embeddings from sequences.

  • Specialized Deep Learning Architectures (e.g., Multiplicative LSTM): The current paper builds upon this by employing UniRep, which is based on multiplicative LSTM. LSTMs are particularly well-suited for sequential data like peptides, and the "multiplicative" aspect enhances their capacity to model complex relationships. UniRep is explicitly designed to learn a "unified representation" for proteins/peptides, making it a strong candidate for general sequence embedding.

    This paper's work fits within the technological timeline as a further refinement and exploration of deep representation learning, specifically using mLSTM-based UniRep features, aiming for even greater robustness and accuracy than previous BERT-based or traditional feature-based approaches.

3.4. Differentiation Analysis

Compared to the main methods in related work, iUmami-DRLF presents several core differences and innovations:

  • Sole Reliance on UniRep (mLSTM) Features: Unlike UMPred-FRL which integrates various traditional feature codes, or iUP-BERT which uses BERT features, iUmami-DRLF solely employs features extracted from the UniRep model, which is built upon multiplicative LSTM. This emphasizes the power and effectiveness of this specific deep learning architecture for peptide representation. The paper argues that UniRep provides a "unified representation" that is highly informative.

  • Focus on Robustness at High Probability Thresholds: A key differentiator is iUmami-DRLF's demonstrated superior robustness and accuracy, particularly at high prediction probability thresholds (e.g., 95% or 99%). The paper explicitly shows that while iUP-BERT and UMPred-FRL fail (0% accuracy) at these high thresholds, iUmami-DRLF maintains significant predictive power. This is crucial for practical applications where high confidence in predictions is required.

  • Optimized Logistic Regression Classifier: While deep learning is used for feature extraction, the final classification is performed by Logistic Regression (after feature selection and data balancing). This contrasts with models that might use more complex classifiers or end-to-end deep learning models. The choice of LR suggests a focus on interpretability and potentially faster inference times once features are extracted.

  • Comprehensive Feature Optimization Pipeline: The methodology includes a robust pipeline involving SMOTE for data balancing and multiple feature selection techniques (ANOVA, LGBM, MI), which are systematically evaluated to refine the feature set. This rigorous approach to feature engineering (even post-deep learning embedding) contributes to the model's performance.

  • Improved Independent Test Performance: The paper meticulously reports and compares independent test results, showing iUmami-DRLF consistently outperforms previous state-of-the-art models across various metrics, indicating better generalization to unseen data.

    In essence, iUmami-DRLF differentiates itself by a highly optimized deep representation learning approach specifically tailored with UniRep/mLSTM embeddings, combined with careful data balancing and feature selection, leading to a significantly more robust and accurate predictor, especially under stringent confidence requirements.

4. Methodology

4.1. Principles

The core principle behind iUmami-DRLF is to leverage the power of deep representation learning to transform raw peptide sequences into meaningful, fixed-length numerical feature vectors. These vectors, known as embeddings, are designed to capture the underlying biochemical and structural properties relevant to a peptide's umami taste. By learning these representations automatically, the model avoids the limitations of manual feature engineering. Subsequently, a standard machine learning classifier (Logistic Regression) is trained on these high-quality features to predict whether a peptide is umami or not. The theoretical basis is that deep learning models, particularly recurrent architectures like LSTM, excel at processing sequential data and learning hierarchical features, making them suitable for peptide sequences. The multiplicative interactions in mLSTM further enhance this capability. The overall intuition is that if a machine can learn a robust numerical "fingerprint" for each peptide that accurately reflects its umami potential, then a simpler classifier can effectively distinguish between umami and non-umami peptides.

4.2. Core Methodology In-depth (Layer by Layer)

The development of iUmami-DRLF follows a structured pipeline as depicted in Figure 1, encompassing dataset preparation, feature extraction, data balancing, feature selection, model training, and evaluation.

4.2.1. Benchmark Dataset

The model was developed using the UMP442 benchmark dataset, which was an updated version from iUmami-SCM.

  • Positive dataset: Comprises umami peptides from the BIOPEP-UWM database and other experimentally verified umami peptides.
  • Negative dataset: Comprises bitter non-umami peptides.
  • Total samples: 444 peptides (304 non-umami, 140 umami) after data cleaning.
  • Dataset split: To prevent overfitting, the dataset was arbitrarily split into:
    • Training subset (UMP-TR): 112 umami peptides and 241 non-umami peptides.
    • Independent test subset (UMP-IND): 28 umami peptides and 61 non-umami peptides.
  • External Validation Dataset (UMP-VERIFIED): An additional 91 wet-experiment verified umami peptide sequences were collected from the latest research to rigorously validate the accuracy and robustness of the model against state-of-the-art methods.

4.2.2. Feature Extraction using UniRep (Multiplicative LSTM)

This is the cornerstone of the iUmami-DRLF method. The paper leverages UniRep (Unified Representation), a deep learning model pre-trained on a vast corpus of amino acid sequences, to convert peptide sequences into fixed-length numerical vectors.

  • UniRep Model Training: The UniRep model was pre-trained using 24 million core amino acid sequences from UniRef50. Its training objective was to identify the subsequent amino acid by minimizing cross-entropy losses. This objective allows the model to learn meaningful representations of protein/peptide sequences.

  • Embedding Process:

    1. Input Sequence Encoding: A peptide sequence (S amino acid residues) is initially represented as a matrix using single thermal code (also known as one-hot encoding). If there are 20 standard amino acids, each residue is represented as a 20-dimensional vector with a '1' at the position corresponding to the amino acid and '0's elsewhere. The paper describes this as RS×10R^{S \times 10^-}, which might be a typo and should be RS×20R^{S \times 20} or RS×AR^{S \times A} where AA is the alphabet size.
    2. mLSTM Encoder: This encoded matrix is then fed into the multiplicative Long Short-Term Memory (mLSTM) encoder. The mLSTM processes the sequence step-by-step, generating hidden states.
    3. 1900-D Feature Vector: The output from the mLSTM encoder is an embedding matrix R1900×SR^{1900 \times S}. To obtain a single fixed-length vector representing the entire peptide, an average pooling operation is applied across the sequence length (S). This results in a 1900-dimensional (D) UniRep feature vector for each peptide.
  • mLSTM Encoder Equations: The mLSTM encoder performs calculations using a set of equations to update its internal states based on the current input (XtX_t) and previous hidden/cell states (ht1h_{t-1}, Ct1C_{t-1}). Equation (1): mt=(XtWxm)(Whmht1) m _ { t } = ( X _ { t } W _ { x m } ) \bigotimes ( W _ { h m } h _ { t - 1 } ) Where:

    • mtm_t: The current intermediate multiplication state.

    • XtX_t: The current input at time step tt.

    • WxmW_{xm}: Weight matrix for the input XtX_t in the multiplicative state calculation.

    • WhmW_{hm}: Weight matrix for the previous hidden state ht1h_{t-1} in the multiplicative state calculation.

    • ht1h_{t-1}: The hidden state from the previous time step (t-1).

    • \bigotimes: Denotes element-by-element multiplication (Hadamard product).

      Equation (2): h^t=(Wmhmt+WxhXt)×tanh \hat { h } _ { t } = ( W _ { m h } m _ { t } + W _ { x h } X _ { t } ) \times t a n h Where:

    • h^t\hat{h}_t: The input before the hidden state update, representing the candidate hidden state.

    • WmhW_{mh}: Weight matrix for the intermediate multiplication state mtm_t.

    • WxhW_{xh}: Weight matrix for the current input XtX_t.

    • tanh\mathrm{tanh}: The hyperbolic tangent activation function, which squashes values between -1 and 1.

      Equation (3): ft=σ(XtWxf+mtWmf) f _ { t } = \sigma \Big ( X _ { t } W _ { x f } + m _ { t } W _ { m f } \Big ) Where:

    • ftf_t: The forget gate output at time step tt. It determines what information from the previous cell state Ct1C_{t-1} should be discarded.

    • σ\sigma: The sigmoid activation function, which squashes values between 0 and 1.

    • WxfW_{xf}: Weight matrix for the input XtX_t in the forget gate.

    • WmfW_{mf}: Weight matrix for the intermediate multiplication state mtm_t in the forget gate.

      Equation (4): it=σ(XtWxi+mtWmi) i _ { t } = \sigma ( X _ { t } W _ { x i } + m _ { t } W _ { m i } ) Where:

    • iti_t: The input gate output at time step tt. It determines which new information from the candidate hidden state h^t\hat{h}_t should be stored in the cell state.

    • WxiW_{xi}: Weight matrix for the input XtX_t in the input gate.

    • WmiW_{mi}: Weight matrix for the intermediate multiplication state mtm_t in the input gate.

      Equation (5): ot=σ(XtWxo+mtWmo) o _ { t } = \sigma ( X _ { t } W _ { x o } + m _ { t } W _ { m o } ) Where:

    • oto_t: The output gate output at time step tt. It determines which parts of the cell state CtC_t should be exposed as the hidden state hth_t.

    • WxoW_{xo}: Weight matrix for the input XtX_t in the output gate.

    • WmoW_{mo}: Weight matrix for the intermediate multiplication state mtm_t in the output gate.

      Equation (6): Ct=ftCt1+ith^t C _ { t } = f _ { t } \bigotimes C _ { t - 1 } + i _ { t } \bigotimes \hat { h } _ { t } Where:

    • CtC_t: The current cell state at time step tt. This is the "memory" of the mLSTM.

    • Ct1C_{t-1}: The cell state from the previous time step (t-1).

    • The first term (ftCt1)(f_t \bigotimes C_{t-1}) represents what information from the previous cell state is retained (forgotten).

    • The second term (ith^t)(i_t \bigotimes \hat{h}_t) represents what new information from the candidate hidden state is added to the cell state.

      Equation (7): ht=ottanh(Ct) h _ { t } = o _ { t } \bigotimes \mathrm { t a n h } \big ( C _ { t } \big ) Where:

    • hth_t: The current hidden state at time step tt. This is the output of the mLSTM cell for the current time step and is passed to the next step.

    • The tanh function is applied to the cell state, and its output is modulated by the output gate oto_t.

      These equations collectively describe how the mLSTM encoder processes sequential input data, maintains a memory state, and generates a hidden representation that captures complex dependencies within the peptide sequence.

4.2.3. Balancing Strategy (SMOTE)

Since the initial UMP-TR dataset was imbalanced (112 umami vs. 241 non-umami peptides), the Synthetic Minority Over-sampling Technique (SMOTE) was applied to balance the classes.

  • Mechanism:
    1. For each sample in the minority class (umami peptides), k-nearest neighbors (KNN) are identified.
    2. Synthetic samples are then created by linearly interpolating between a minority class sample and one of its randomly chosen neighbors. This means taking a sample, picking a neighbor, and creating a new sample along the line segment connecting them.
  • Purpose: SMOTE not only increases the sample size of the minority class but also improves sample quality by creating diverse, yet realistic, synthetic samples. This helps classifiers learn more distinct features and avoids bias towards the majority class, thereby improving overall performance.

4.2.4. Feature Selection Strategy

After UniRep feature extraction, each peptide was represented by a 1900-dimensional vector. To address potential issues of over-fitting and feature redundancy associated with high-dimensional data, three feature selection techniques were employed: Analysis of Variance (ANOVA), Light Gradient Boosting Machine (LGBM), and Mutual Information (MI).

  • Process: Features were ranked based on their importance values (calculated by each method). Only features with importance values greater than a crucial threshold (typically the average feature importance value) were selected. An incremental feature strategy and hyperparameter grid search (using GridSearchCV in scikit-learn) were used for optimization.

4.2.4.1. Analysis of Variance (ANOVA)

ANOVA is used here to score features based on their ability to differentiate between classes. Equation (8): S(t)=Sθ2(t)Sω2(t) S ( t ) = \frac { S _ { \theta } ^ { 2 } ( t ) } { S _ { \omega } ^ { 2 } ( t ) } Where:

  • S(t): The ANOVA score for feature tt. A higher score indicates greater importance.

  • Sθ2(t)S_{\theta}^2(t): The variance between groups for feature tt. This measures how much the means of different classes vary for that feature.

  • Sω2(t)S_{\omega}^2(t): The variance within groups for feature tt. This measures the variability of values for that feature within each class.

    Equation (9) for calculating variance between groups: Sθ2(t)=1K1i=1Kmi(j=1mift(i,j)mii=1Kj=1mift(i,j)i=1Kmi)2 \begin{array} { r } { S _ { \theta } ^ { 2 } ( t ) = \frac { 1 } { K - 1 } \displaystyle \sum _ { i = 1 } ^ { K } m _ { i } \biggl ( \frac { \sum _ { j = 1 } ^ { m _ { i } } f _ { t } ( i , j ) } { m _ { i } } - \frac { \sum _ { i = 1 } ^ { K } \sum _ { j = 1 } ^ { m _ { i } } f _ { t } ( i , j ) } { \sum _ { i = 1 } ^ { K } m _ { i } } \biggr ) ^ { 2 } } \end{array} Where:

  • KK: The number of groups (classes, e.g., umami and non-umami).

  • mim_i: The number of samples in group ii.

  • ft(i,j)f_t(i, j): The value of feature tt for the jj-th sample in the ii-th group.

  • The first fraction inside the parenthesis is the mean of feature tt for group ii.

  • The second fraction is the overall mean of feature tt across all samples and groups.

    Equation (10) for calculating variance within groups: Sω2(t)=1NKi=1Kj=1mi(ft(i,j)j=1mift(i,j)mi)2 S _ { \omega } ^ { 2 } ( t ) = \frac { 1 } { N - K } \sum _ { i = 1 } ^ { K } \sum _ { j = 1 } ^ { m _ { i } } \left( f _ { t } ( i , j ) - \frac { \sum _ { j = 1 } ^ { m _ { i } } f _ { t } ( i , j ) } { m _ { i } } \right) ^ { 2 } Where:

  • NN: The total number of instances (samples).

  • The term inside the parenthesis is the difference between a sample's feature value and the mean of that feature within its group.

4.2.4.2. Lighting Gradient Boosting Machine (LGBM)

LGBM is a decision tree-based gradient boosting framework. For feature selection, it inherently provides feature importances. Equation (11): hc(x)=argminhHL(y,Fc1(x)+h(x)) h _ { c } ( x ) = \mathop { \mathrm { a r g m i n } } _ { h \in H } \sum L ( y , F _ { c - 1 } ( x ) + \mathop { \mathrm { h } } ( x ) ) Where:

  • hc(x)h_c(x): The new base learner (decision tree) being sought at iteration cc.

  • HH: The space of possible base learners.

  • L(y,Fc1(x)+h(x))\sum L(y, F_{c-1}(x) + h(x)): The sum of the loss function LL over all training samples, where yy is the true label, Fc1(x)F_{c-1}(x) is the model's prediction from the previous c-1 iterations, and h(x) is the current base learner's prediction. The goal is to find hc(x)h_c(x) that minimizes this loss.

    Equation (12): rti=L(y,Ft1(xi))Ft1(xi) r _ { t i } = - \frac { \partial L ( y , F _ { t - 1 } ( x _ { i } ) ) } { \partial F _ { t - 1 } ( x _ { i } ) } Where:

  • rtir_{ti}: The pseudo-residual (or negative gradient) for sample ii at iteration tt. This represents the error that the current base learner ht(x)h_t(x) is trying to predict and correct.

  • L(y,Ft1(xi))Ft1(xi)\frac { \partial L ( y , F _ { t - 1 } ( x _ { i } ) ) } { \partial F _ { t - 1 } ( x _ { i } ) }: The partial derivative of the loss function LL with respect to the model's prediction Ft1(xi)F_{t-1}(x_i) from the previous iteration.

    Equation (13):

F _ { c + n } ( x ) = h _ { 2 n } ( x ) + F _ { c - n } ( x )

This equation describes the iterative update of the overall model. It states that the model's prediction at iteration c+nc+n is the sum of a new learner h2n(x)h_{2n}(x) and the model's prediction from a previous state Fcn(x)F_{c-n}(x). This formula seems a bit unusual as typically the new learner is added to the current model Fc(x)=Fc1(x)+hc(x)F_c(x) = F_{c-1}(x) + h_c(x). However, as presented, it indicates how the model incrementally improves by adding new weak learners (decision trees). The importance of each feature in LGBM is derived from how often and how effectively it is used to split nodes in these decision trees across all iterations.

4.2.4.3. Mutual Information (MI)

Mutual Information quantifies the dependency between two variables, or how much knowing one variable reduces uncertainty about the other. Equation (14) for entropy: H(S)=iUP(εi)logP(εi) H ( S ) = - \sum _ { i \in \sum U } P ( \varepsilon _ { i } ) \log P ( \varepsilon _ { i } ) Where:

  • H(S): The entropy of the peptide sequence SS. Entropy measures the average amount of information or uncertainty in a random variable.

  • ΣU\Sigma U: The alphabet of amino acid residues (e.g., 20 standard amino acids).

  • P(εi)P(\varepsilon_i): The marginal probability of a specific amino acid residue εi\varepsilon_i occurring in the sequence.

    Equation (15) for mutual information (MI): MI=iUjUP(εi,εj)logP(εi,εj)P(εi)P(εj) M I = \sum _ { i \in \sum U } \sum _ { j \in \sum U } P \big ( \varepsilon _ { i } , \varepsilon _ { j } \big ) \log \frac { P \big ( \varepsilon _ { i } , \varepsilon _ { j } \big ) } { P \big ( \varepsilon _ { i } \big ) P \big ( \varepsilon _ { j } \big ) } Where:

  • MI: The mutual information between two variables (e.g., a feature and the class label). The paper's formula as written seems to represent the MI between two amino acid residues εi\varepsilon_i and εj\varepsilon_j within a sequence, or potentially between a feature's value and the class label, if εj\varepsilon_j is interpreted as a class. Generally, for feature selection, MI is calculated between each feature and the target variable (umami/non-umami).

  • P(εi,εj)P(\varepsilon_i, \varepsilon_j): The joint probability of observing residue εi\varepsilon_i and residue εj\varepsilon_j (or a feature value and a class label).

  • P(εi)P(\varepsilon_i) and P(εj)P(\varepsilon_j): The marginal probabilities of observing residue εi\varepsilon_i and residue εj\varepsilon_j, respectively.

  • A higher MI value indicates a stronger statistical dependency between the feature and the class label, making the feature more valuable for classification.

4.2.5. Machine Learning Methods

Five common and high-performance ML methods were chosen to evaluate the UniRep features and identify the optimal classifier:

  • K-Nearest Neighbors (KNN): A non-parametric, instance-based learning algorithm. It classifies a new data point by finding the KK closest data points in the training set and assigning the class that is most common among them. Its simplicity makes it a good baseline.
  • Logistic Regression (LR): A linear model for binary classification. It estimates the probability of an instance belonging to a particular class using a sigmoid function. It's known for its simplicity, interpretability, and parallelizability.
  • Support Vector Machine (SVM): A powerful discriminative classifier that finds an optimal hyperplane to separate data points into different classes, maximizing the margin between the classes. It's often used for binary classification in bioinformatics.
  • Random Forest (RF): An ensemble learning method that builds multiple decision trees during training and outputs the mode of the classes (for classification) or mean prediction (for regression) of the individual trees. It uses bagging (bootstrap aggregating) and random feature selection for improved robustness and reduced overfitting.
  • Light Gradient Boosting Machine (LGBM): An efficient, distributed gradient boosting framework based on decision trees. It builds trees sequentially, with each new tree correcting the errors of the previous ones. It is known for its speed and high performance.

4.2.6. Evaluation Metrics and Methods

The models were evaluated using K-fold cross-validation and independent testing.

  • K-fold Cross-Validation: The training data (UMP-TR) is divided into KK equal subsets (folds). The model is trained KK times; in each iteration, one fold is used as the validation set, and the remaining K-1 folds are used for training. The results from all KK iterations are averaged to provide a more robust estimate of model performance. The paper used 10-fold cross-validation (K=10K=10).

  • Independent Testing: After the model is trained and optimized using the UMP-TR dataset, its final performance is evaluated on a completely unseen independent test set (UMP-IND) and the UMP-VERIFIED dataset. This provides an unbiased measure of the model's generalization ability.

  • Evaluation Metrics: The following widely used metrics were calculated:

    • True Positives (TP): The number of umami peptides correctly identified as umami.

    • True Negatives (TN): The number of non-umami peptides correctly identified as non-umami.

    • False Positives (FP): The number of non-umami peptides incorrectly identified as umami.

    • False Negatives (FN): The number of umami peptides incorrectly identified as non-umami.

      Equation (16) for Accuracy (ACC): ACC=TP+TNFP+FN+TP+TN \mathrm { A C C } = { \frac { \mathrm { T P } + \mathrm { T N } } { \mathrm { F P } + \mathrm { F N } + \mathrm { T P } + \mathrm { T N } } } Where:

    • ACC: The proportion of total predictions that were correct.

      Equation (17) for Matthews Correlation Coefficient (MCC): MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN) \mathrm { M C C } = { \frac { \mathrm { T P } \times \mathrm { T N } - \mathrm { F P } \times \mathrm { F N } } { \sqrt { ( \mathrm { T P } + \mathrm { F P } ) ( \mathrm { T P } + \mathrm { F N } ) ( \mathrm { T N } + \mathrm { F P } ) ( \mathrm { T N } + \mathrm { F N } ) } } } Where:

    • MCC: A correlation coefficient between the observed and predicted binary classifications. It is considered a more reliable statistical measure for imbalanced datasets than accuracy, as it takes into account all four confusion matrix values (TP, TN, FP, FN). Its value ranges from -1 (perfect disagreement) to +1 (perfect agreement), with 0 indicating random prediction.

      Equation (18) for Sensitivity (Sn) or Recall: Sn=TPTP+FN \mathrm { S n } = \frac { \mathrm { T P } } { \mathrm { T P } + \mathrm { F N } } Where:

    • Sn: The proportion of actual positive cases (umami peptides) that were correctly identified.

      Equation (19) for Specificity (Sp): Sp=TNTN+FP \mathsf { Sp } = \frac { \mathrm { T N } } { \mathrm { T N } + \mathrm { F P } } Where:

    • Sp: The proportion of actual negative cases (non-umami peptides) that were correctly identified.

      Equation (20) for Balanced Accuracy (BACC): BACC=Sn+Sp2 \mathsf { B A C C } = { \frac { \mathsf { S n } + \mathsf { Sp } } { 2 } } Where:

    • BACC: The average of sensitivity and specificity. It is particularly useful for imbalanced datasets because it gives equal weight to both classes, preventing a high accuracy score due to correctly classifying only the majority class. If the dataset is perfectly balanced, ACC and BACC will be equal.

    • Area Under the Receiver Operating Characteristic Curve (auROC): The ROC curve plots the True Positive Rate (TPR) (Sensitivity) against the False Positive Rate (FPR) (1 - Specificity) at various classification thresholds. The auROC quantifies the overall performance of a binary classifier, representing its ability to distinguish between classes across all possible thresholds. A value of 0.5 indicates random prediction, while 1.0 indicates a perfect classifier.

    • Cross-Entropy Loss: For binary classification, the cross-entropy loss (also known as binary cross-entropy loss or log loss) measures the performance of a classification model whose output is a probability value between 0 and 1. Equation (21) for Loss: Loss=(ylog(y^)+(1y)log(1y^)) L o s s = - ( y \cdot \log ( \hat { y } ) + ( 1 - y ) \cdot \log ( 1 - \hat { y } ) ) Where:

      • Loss: The cross-entropy loss.
      • yy: The true label of the sample. y=1y=1 for a positive case (umami peptide) and y=0y=0 for a negative case (non-umami peptide).
      • y^\hat{y}: The predicted probability that the sample is a positive case (umami peptide), ranging from 0 to 1.
      • The goal of training is to minimize this loss function. A lower cross-entropy loss indicates a more accurate classification effect.

4.2.7. Web Server Development

A user-friendly web server was developed and made freely accessible at https://www.aibiochem.net/servers/iUmami-DRLF/. Users can input peptide sequences and receive predictions (whether it's an umami peptide and its confidence level).

The following figure (Figure 1 from the original paper) provides an overview of the model development:

Figure 1. Overview of model development. The pre-trained UniRep sequence embedding model was used to embed the peptide sequences into eigenvectors. The peptide sequences were converted into 1900-dime… 该图像是论文中描述iUmami-DRLF模型开发流程的示意图,展示了数据集收集、特征提取、数据平衡、特征选择、模型训练以及性能评估和网页服务器的整体流程。

5. Experimental Setup

5.1. Datasets

The experiments utilized several datasets derived from existing resources and newly collected data:

  • UMP442 Benchmark Dataset: This is the primary dataset used for model training and initial testing.

    • Source: Updated from iUmami-SCM [9], originally comprising peptides from the BIOPEP-UWM database [4] and other experimentally verified umami peptides.
    • Scale: Contains a total of 444 peptide sequences after data cleaning.
    • Characteristics:
      • Positive samples: 140 umami peptides.
      • Negative samples: 304 non-umami (bitter) peptides. This indicates an imbalanced dataset, which was addressed using SMOTE.
    • Split: Arbitrarily divided into:
      • UMP-TR (Training set): 112 umami peptides and 241 non-umami peptides.
      • UMP-IND (Independent Test set): 28 umami peptides and 61 non-umami peptides.
    • Domain: Peptide sequences, with classification based on their umami taste property.
    • Data Sample Example: The paper does not provide a concrete example of a peptide sequence from the dataset; however, typical peptide sequences are strings of amino acid abbreviations (e.g., "Gly-Pro-Leu" or "GPL").
  • UMP-VERIFIED Dataset: This dataset was specifically collected for external validation to test model robustness and compare against state-of-the-art methods.

    • Source: 91 wet-experiment verified umami peptide sequences reported in the latest literature [56-70].

    • Scale: 91 umami peptide sequences.

    • Characteristics: All samples in this dataset are positive (umami) peptides, validated through laboratory experiments.

    • Why chosen: This dataset provides an unbiased, real-world validation of the models' ability to identify true umami peptides, especially for newly discovered ones.

      These datasets were chosen to ensure a robust training process, fair independent evaluation, and a rigorous comparison with existing methods using recently validated experimental data. The use of an imbalanced initial dataset and its subsequent balancing with SMOTE is a critical aspect of ensuring the model's effectiveness across both classes.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a complete explanation:

  • True Positives (TP):

    • Conceptual Definition: Instances that are actually positive (e.g., umami peptides) and were correctly predicted as positive by the model.
    • Mathematical Formula: N/A (fundamental count)
    • Symbol Explanation: TP denotes the count of umami peptides successfully identified as umami.
  • True Negatives (TN):

    • Conceptual Definition: Instances that are actually negative (e.g., non-umami peptides) and were correctly predicted as negative by the model.
    • Mathematical Formula: N/A (fundamental count)
    • Symbol Explanation: TN denotes the count of non-umami peptides successfully identified as non-umami.
  • False Positives (FP):

    • Conceptual Definition: Instances that are actually negative (e.g., non-umami peptides) but were incorrectly predicted as positive by the model. This is also known as a Type I error.
    • Mathematical Formula: N/A (fundamental count)
    • Symbol Explanation: FP denotes the count of non-umami peptides falsely identified as umami.
  • False Negatives (FN):

    • Conceptual Definition: Instances that are actually positive (e.g., umami peptides) but were incorrectly predicted as negative by the model. This is also known as a Type II error.
    • Mathematical Formula: N/A (fundamental count)
    • Symbol Explanation: FN denotes the count of umami peptides incorrectly identified as non-umami.
  • Accuracy (ACC):

    • Conceptual Definition: The proportion of total predictions that the model made correctly. It measures the overall correctness of the model.
    • Mathematical Formula: ACC=TP+TNFP+FN+TP+TN \mathrm { A C C } = { \frac { \mathrm { T P } + \mathrm { T N } } { \mathrm { F P } + \mathrm { F N } + \mathrm { T P } + \mathrm { T N } } }
    • Symbol Explanation:
      • TP: True Positives.
      • TN: True Negatives.
      • FP: False Positives.
      • FN: False Negatives.
  • Matthews Correlation Coefficient (MCC):

    • Conceptual Definition: A robust measure of the quality of binary classifications, especially useful for imbalanced datasets. It considers all four values of the confusion matrix and produces a high score only if the model performs well on all four aspects (TP, TN, FP, FN). It ranges from -1 (perfect inverse correlation) to +1 (perfect prediction), with 0 indicating a random prediction.
    • Mathematical Formula: MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN) \mathrm { M C C } = { \frac { \mathrm { T P } \times \mathrm { T N } - \mathrm { F P } \times \mathrm { F N } } { \sqrt { ( \mathrm { T P } + \mathrm { F P } ) ( \mathrm { T P } + \mathrm { F N } ) ( \mathrm { T N } + \mathrm { F P } ) ( \mathrm { T N } + \mathrm { F N } ) } } }
    • Symbol Explanation:
      • TP: True Positives.
      • TN: True Negatives.
      • FP: False Positives.
      • FN: False Negatives.
  • Sensitivity (Sn) / Recall:

    • Conceptual Definition: The proportion of actual positive cases that were correctly identified by the model. It measures the model's ability to find all the positive samples.
    • Mathematical Formula: Sn=TPTP+FN \mathrm { S n } = \frac { \mathrm { T P } } { \mathrm { T P } + \mathrm { F N } }
    • Symbol Explanation:
      • TP: True Positives.
      • FN: False Negatives.
  • Specificity (Sp):

    • Conceptual Definition: The proportion of actual negative cases that were correctly identified by the model. It measures the model's ability to correctly identify non-positive samples.
    • Mathematical Formula: Sp=TNTN+FP \mathsf { Sp } = \frac { \mathrm { T N } } { \mathrm { T N } + \mathrm { F P } }
    • Symbol Explanation:
      • TN: True Negatives.
      • FP: False Positives.
  • Balanced Accuracy (BACC):

    • Conceptual Definition: The average of Sensitivity (True Positive Rate) and Specificity (True Negative Rate). This metric is particularly useful for imbalanced datasets because it gives equal weight to the performance on both positive and negative classes, preventing a misleadingly high accuracy score driven by the majority class. If the dataset is perfectly balanced, ACC and BACC values will be equal.
    • Mathematical Formula: BACC=Sn+Sp2 \mathsf { B A C C } = { \frac { \mathsf { S n } + \mathsf { Sp } } { 2 } }
    • Symbol Explanation:
      • Sn: Sensitivity.
      • Sp: Specificity.
  • Area Under the Receiver Operating Characteristic Curve (auROC):

    • Conceptual Definition: The ROC curve plots the True Positive Rate (TPR) (Sensitivity) against the False Positive Rate (FPR) (1 - Specificity) at various classification thresholds. The auROC value represents the overall ability of a classifier to distinguish between positive and negative classes. A higher auROC indicates a better model. Values range from 0.5 (random classifier) to 1.0 (perfect classifier).
    • Mathematical Formula: No single formula as it's the area under a curve generated by varying a threshold.
    • Symbol Explanation: N/A (derived from TPR and FPR).
  • Cross-Entropy Loss:

    • Conceptual Definition: For binary classification problems where the output is a probability between 0 and 1, cross-entropy loss quantifies the difference between the predicted probability distribution and the true distribution. It penalizes predictions that are confident and wrong more heavily. A lower loss value indicates a better fit between the model's predictions and the true labels.
    • Mathematical Formula: Loss=(ylog(y^)+(1y)log(1y^)) L o s s = - ( y \cdot \log ( \hat { y } ) + ( 1 - y ) \cdot \log ( 1 - \hat { y } ) )
    • Symbol Explanation:
      • Loss: The cross-entropy loss value.
      • yy: The true binary label of the sample (1 for positive, 0 for negative).
      • y^\hat{y}: The model's predicted probability that the sample is positive.

5.3. Baselines

The iUmami-DRLF method was compared against the following existing state-of-the-art models for umami peptide prediction:

  • iUmami-SCM [9]: This model combines the scoring card method (SCM) with propensity scores of amino acids and dipeptides. It represents a traditional feature engineering approach.

  • UMPred-FRL [11]: This model integrates seven different traditional feature codes for constructing the umami peptide classifier. It also primarily relies on traditional feature engineering but with a more comprehensive set.

  • iUP-BERT [12]: This model is based on deep representational learning, using BERT (Bidirectional Encoder Representations from Transformers) for feature encoding. It is a more recent deep learning-based approach and represents the direct predecessor in terms of advanced feature extraction.

    These baselines are representative because they cover the evolution of umami peptide prediction, from traditional statistical methods (Umami-SCM, UMPred-FRL) to more recent deep learning approaches (iUP-BERT). Comparing iUmami-DRLF against these diverse baselines allows for a thorough assessment of its advancements in both methodology and performance.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the effectiveness of iUmami-DRLF, especially its robustness and accuracy, by systematically evaluating the impact of data balancing, different ML models, feature selection techniques, and direct comparison with existing methods.

6.1.1. Effect of SMOTE

The initial analysis focused on the impact of SMOTE (Synthetic Minority Over-sampling Technique) on model performance. The UniRep feature vectors (1900-dimensional) were extracted, and five different ML models (KNN, LR, SVM, LGBM, RF) were trained both with and without SMOTE balancing.

The following figure (Figure 2 from the original paper) illustrates the results of 10-fold cross-validation and independent testing for models with and without SMOTE:

Figure 2. Results of 10-fold cross-validation (A) and independent testing (B) of the five ML models balanced with SMOTE and the five ML models balanced without SMOTE.As ilustrated in Figure 2 and Sup… 该图像是图表,展示了五种机器学习模型经过SMOTE平衡处理与未处理后,在10折交叉验证(A)和独立测试(B)中的多项性能指标对比结果,表明SMOTE优化显著提升了模型表现。

As shown in Figure 2 and detailed in Supplementary Table S1 (not provided in the main text but referenced), SMOTE significantly improved the performance of the models. For example, the LR-SMOTE model either outperformed or equaled the LR model without SMOTE in 66.7% of the metrics in both cross-validation and independent tests. Similarly, SVM-SMOTE outperformed its non-SMOTE counterpart in 83.3% of indicators. The paper highlights that Sp (Specificity) values were often high without SMOTE, but Sn (Sensitivity) and other indicators were poor, indicating a bias towards the negative class due to the imbalanced dataset. This underscores the necessity of SMOTE to enable the models to effectively recognize positive cases (umami peptides).

Visual analyses using UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction also supported SMOTE's benefit. Figure 3A (UniRep features without SMOTE) showed less distinct clustering between umami and non-umami peptides compared to Figure 3B (UniRep features with SMOTE), where the clusters were more separable, indicating improved data representation for classification.

6.1.2. Effects of Different ML Models

After confirming the benefits of SMOTE, the study proceeded to compare the performance of the five ML algorithms using the SMOTE-balanced UniRep features.

The following are the results from Table 1 of the original paper:

Model10-Fold Cross-ValidationIndependent Test
ACCMCCSnSpauROCBACCACCMCCSnSpauROCBACC
LRc0.921 a0.8470.9540.8880.9560.9210.8530.6530.7210.9130.9280.817
KNN0.861 b0.7270.9170.8050.9240.8610.8070.5890.8180.8020.8750.810
SVMC0.8650.7560.7380.9920.9810.8650.7160.2580.1000.9980.7890.549
RFC00.9170.8370.9420..8920.96700.9170.8360.6170.7250.8870.8930.806
LGBM0.9190.8410.9460.8920.9720.9190.8450.6360.7290.8980.9070.813

Table 1 indicates that the Logistic Regression (LR) model generally outperformed other ML models. In 10-fold cross-validation, LR (iUmami-DRLF) exceeded other models in four metrics (ACC, BACC, MCC, Sn). For instance, its ACC and BACC were 0.22-6.97% higher, and MCC and Sn increased by 0.71-16.51% and 0.85-29.27% respectively. In independent tests, LR also outscored others in ACC, MCC, auROC, and BACC (e.g., ACC 0.95-19.13% higher, MCC 2.67-153.10% higher). Although SVM showed high Sp in independent tests (0.998), its extremely low Sn (0.100) and MCC (0.258) indicated a severe imbalance in its predictive capability, likely favoring the majority class, despite SMOTE. This led to the selection of LR as the base classifier for the final iUmami-DRLF model. The equality of ACC and BACC values in 10-fold cross-validation further confirmed the effectiveness of SMOTE in balancing the dataset.

6.1.3. Effects of Different Feature Selection Methods

With SMOTE applied and LR identified as the superior classifier, the next step involved optimizing the high-dimensional 1900D UniRep feature vector using feature selection. Three methods were compared: ANOVA, LGBM, and MI.

The following are the results from Table 2 of the original paper:

ModelFeature D Selection MethodDim10-Fold Cross-ValidationIndependent Test
ACCMCCSnSpauROCBACCACCMCCSnSpauROCBACC
LRcLGBM d1770.925 b0.8530.9590.8920.957 0.9380.9250.921 a0.8150.8210.9670.9560.894
ANOVAd1020.8820.7640.8960.867 0.8630.8820.8990.7680.85700.9180..9300.888
MI d1360.8880.7770.9130.9420.8880.8880.7330.7500.9510.8640.850
LGBM d330.8920.7880.9380.8460.9550.8920.8990.7820.9290.8850.9110.907
KNNANOVA d150.8730.7480.8960.8510.9340.8730.8650.7030.8570.8690.9070.863
M a580.8880.7830.9540.8220.9270.8880.8880.7730.9640.8520.9310.908
LGBM d1210.9440.8890.9710.9170.9800.9440.8880.7390.8210.9180.9130.870
ANNOVA d480.9250.8540.9670.8840.9770.9250.8650.6780.6790.9510.9060.815
SVMMId160.9190.8410.9590.800..9680.9190.880..73500.7860..9340.9210..860
0.9340.8960.8760.7160.8210.9020.9200.862
LGBM d ANNOVAd88 1180.915 0.8980.830 00.7970.9130.8840.975 00.9610.915 0.8980.8650.6940.8210.8850.9110.853
RFcMId80.9020.8060.9210.8840.9520.9020.8880.7530.8930.8850.9230.889
0.9710.8880.7390.8210.9180.912
LGBM d35 190.938 0.9020.877 0.8070.9420.905 0.8630.988 0.9450.938 0.9020.8760.7060.7140.9510.9290.870 0.833
LGBMCANNOVA d MId180.8880.7770.9170.8590.9530.8880.8650.6820.7500.9180.9160.834

The following figure (Figure 4 from the original paper) provides a comparison of the results of independent testing of the models with selected features and the models without selected features:

Figure 4. Comparison of the results of independent testing of the models with selected features and the models without selected features. 该图像是图表,展示了不同机器学习模型(KNN、LR、SVM、RF、LGBM)基于不同特征集在独立测试集上的多种性能指标(ACC、MCC、Sn、Sp、auROC、BACC)对比结果。

Figure 4 and Table 2 clearly demonstrate that feature selection significantly improved model performance. The Sp of the 1900D models without feature selection was lower than most models with feature selection, indicating that feature selection helps resolve information redundancy and optimize predictive performance. Among the three methods, LGBM feature selection yielded the best overall performance, particularly for the LR model. For the LR model, LGBM feature selection led to a 4.17-4.88% improvement in ACC and BACC in 10-fold cross-validation, and 2.45-3.72% improvement in ACC in independent tests, compared to ANOVA and MI. The specific configuration of LR with the top 177D features selected by LGBM was chosen as the optimal iUmami-DRLF predictor. This was further supported by UMAP visualization (Figure 3C), which showed better separation of clusters with the 177D features compared to the full 1900D SMOTE-optimized features (Figure 3B).

6.1.4. Comparison with Existing Methods

The final iUmami-DRLF model (LR classifier with 177D LGBM-selected UniRep features) was rigorously compared against iUmami-SCM, UMPred-FRL, and iUP-BERT.

The following are the results from Table 3 of the original paper:

Classifier10-Fold Cross-ValidationIndependent Test
ACCMCCSnSpauROCBACCACCMCCSnSpauROCBACC
iUmami-DRLF(LR)0.925 b0.8530.9590.8920.9570.9250.921 a0.8150.8210.9670.9560.894
iUmami-DRLF(SVM)0.9440.88900.9710.9170.9800.94440.88880.7390..8210.91800.9130..870
UP-BER0.9400.8810.9630.917.9710.40.8990.70 8930.9020.9330.897
UMPred-FRL0.9210.8100.8470.9550.930.9010.8880.7350860.9340.9190.860
Umi-SCM0.9350.8640.9470.93000.9450.9390.8650.67900.71400.9340.8980.824

Table 3 highlights iUmami-DRLF(LR)'s superior performance in independent testing. While iUmami-DRLF(SVM) showed slightly better results in 10-fold cross-validation across some metrics, iUmami-DRLF(LR) demonstrated stronger generalization ability in independent tests.

  • In independent tests, iUmami-DRLF(LR) achieved an ACC of 0.921, MCC of 0.815, Sn of 0.821, Sp of 0.967, auROC of 0.956, and BACC of 0.894. These values are notably higher than all other compared methods. For example, its ACC was 3.76-6.51% higher, MCC 10.86-20.00% higher, and auROC 4.04-6.47% higher than other predictors.
  • The comparison between iUmami-DRLF(LR) and iUmami-DRLF(SVM) specifically pointed out that LR had better generalization (higher independent test scores) despite slightly lower cross-validation scores, solidifying its choice for the final predictor.

6.1.5. Methods' Robustness

The robustness of iUmami-DRLF was further validated using the UMP-VERIFIED dataset of 91 wet-experiment verified umami peptides, focusing on performance at different prediction probability thresholds.

The following figure (Figure 5 from the original paper) shows the prediction results under varying probability thresholds:

Figure 5. Under varying probability thresholds, the prediction results of iUmami-DRLF (this work), UMPred-FRL, and iUP-BERT are shown using the UMP-VERIFIED dataset. (A) is the relationship between p… 该图像是图表,展示了iUmami-DRLF、UMPRED-FRL和iUP-BERT三种模型在UMP-VERIFIED数据集上不同概率阈值下的预测性能。(A)显示了预测准确率与概率阈值的关系。(B)展示了交叉熵损失与概率阈值的关系,交叉熵损失越小表示模型的鲁棒性和准确性越好。

Figure 5A shows the relationship between prediction accuracy and probability threshold. iUmami-DRLF consistently showed the best accuracy across all probability thresholds. Crucially, at a 95% threshold, iUP-BERT's accuracy dropped to 0%, indicating failure, while UMPred-FRL was 8.8%, and iUmami-DRLF maintained 52.7%. At a stringent 99% threshold, both iUP-BERT and UMPred-FRL yielded 0% accuracy, whereas iUmami-DRLF still achieved 40.7% accuracy. This vividly demonstrates iUmami-DRLF's superior robustness and generalization, especially when high confidence predictions are required.

Figure 5B, showing cross-entropy loss against probability thresholds, further supports this. iUmami-DRLF exhibited the minimum cross-entropy loss at 50%, 70%, and 85% thresholds. At 95%, its loss was significantly smaller than UMPred-FRL. For 95% and 99% thresholds, the cross-entropy losses for UMPred-FRL and iUP-BERT became meaningless as their accuracy dropped to zero. This confirms that iUmami-DRLF is an optimized model with minimum cross-entropy loss, leading to better accuracy and reliability.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Model 10-Fold Cross-Validation Independent Test
ACC MCC Sn Sp auROC BACC ACC MCC Sn Sp auROC BACC
LRc 0.921 a 0.847 0.954 0.888 0.956 0.921 0.853 0.653 0.721 0.913 0.928 0.817
KNN 0.861 b 0.727 0.917 0.805 0.924 0.861 0.807 0.589 0.818 0.802 0.875 0.810
SVMc 0.865 0.756 0.738 0.992 0.981 0.865 0.716 0.258 0.100 0.998 0.789 0.549
RFc 00.917 0.837 0.942 0..892 0.967 00.917 0.836 0.617 0.725 0.887 0.893 0.806
LGBM 0.919 0.841 0.946 0.892 0.972 0.919 0.845 0.636 0.729 0.898 0.907 0.813

a The best performance values are indicated in bold and underlined. Blue indicates equal values of ACC and BACC. c LR: logistic regression; KNN: k-nearest neighbors; SVM: support vector machine; LGBM: light gradient boosting machine; RF: random forest.

The following are the results from Table 2 of the original paper:

Model Feature Selection Method Dim 10-Fold Cross-Validation Independent Test
ACC MCC Sn Sp auROC BACC ACC MCC Sn Sp auROC BACC
LRc LGBM d 177 0.925 b 0.853 0.959 0.892 0.957 0.925 0.921 a 0.815 0.821 0.967 0.956 0.894
ANOVAd 102 0.882 0.764 0.896 0.867 0.863 0.882 0.899 0.768 0.857 0.918 0.930 0.888
MI d 136 0.888 0.777 0.913 0.942 0.888 0.888 0.733 0.750 0.951 0.864 0.850
LGBM d 33 0.892 0.788 0.938 0.846 0.955 0.892 0.899 0.782 0.929 0.885 0.911 0.907
KNN ANOVA d 15 0.873 0.748 0.896 0.851 0.934 0.873 0.865 0.703 0.857 0.869 0.907 0.863
MI d 58 0.888 0.783 0.954 0.822 0.927 0.888 0.888 0.773 0.964 0.852 0.931 0.908
LGBM d 121 0.944 0.889 0.971 0.917 0.980 0.944 0.888 0.739 0.821 0.918 0.913 0.870
ANOVA d 48 0.925 0.854 0.967 0.884 0.977 0.925 0.865 0.678 0.679 0.951 0.906 0.815
SVM MI d 16 0.919 0.841 0.959 0.80 0.968 0.919 0.88 0.735 0.786 0.934 0.921 0.860
LGBM d 88 0.934 0.896 0.971 0.884 0.975 0.915 0.876 0.716 0.821 0.902 0.920 0.862
ANOVA d 118 0.898 0.797 0.913 0.884 0.961 0.898 0.865 0.694 0.821 0.885 0.911 0.853
RFc MI d 8 0.902 0.806 0.921 0.884 0.952 0.902 0.888 0.753 0.893 0.885 0.923 0.889
LGBM d 35 0.938 0.877 0.971 0.905 0.988 0.938 0.876 0.706 0.714 0.951 0.929 0.870
ANOVA d 19 0.902 0.807 0.942 0.863 0.945 0.902 0.888 0.739 0.821 0.918 0.912 0.833
LGBMc ANOVA d 18 0.888 0.777 0.917 0.859 0.953 0.888 0.865 0.682 0.750 0.918 0.916 0.834
MI d

a The best performance values are indicated in bold and underlined. Blue indicates equal values of ACC and BACC. c LR: logistic regression; KNN: k-nearest neighbors; SVM: support vector machine; LGBM: light gradient boosting machine; RF: random forest. d LGBM: light gradient boosting machine; ANOVA: analysis of variance; MI: mutual information.

The following are the results from Table 3 of the original paper:

Classifier 10-Fold Cross-Validation Independent Test
ACC MCC Sn Sp auROC BACC ACC MCC Sn Sp auROC BACC
iUmami-DRLF(LR) 0.925 b 0.853 0.959 0.892 0.957 0.925 0.921 a 0.815 0.821 0.967 0.956 0.894
iUmami-DRLF(SVM) 0.944 0.889 0.971 0.917 0.980 0.944 0.888 0.739 0.821 0.918 0.913 0.870
UP-BER 0.940 0.881 0.963 0.917 0.971 0.940 0.899 0.770 0.893 0.902 0.933 0.897
UMPred-FRL 0.921 0.81 0.847 0.955 0.930 0.901 0.888 0.735 0.860 0.934 0.919 0.860
Umi-SCM 0.935 0.864 0.947 0.930 0.945 0.939 0.865 0.679 0.714 0.934 0.898 0.824

a The best performance values are indicated in bold and underlined. Blue indicates equal values of ACC and BACC.

6.3. Ablation Studies / Parameter Analysis

While the paper does not explicitly label sections as "ablation studies," the structured comparison of different components functions as such:

  • SMOTE vs. No SMOTE: This comparison (Figure 2) effectively acts as an ablation study for the data balancing strategy. It clearly shows that including SMOTE is critical for improving Sn and overall balanced performance, validating its inclusion in the methodology.

  • Different ML Models: Testing five different classifiers (KNN, LR, SVM, RF, LGBM) on the same SMOTE-balanced UniRep features (Table 1) allows for selection of the optimal base classifier. This demonstrates that LR is the most suitable choice for generalization performance despite other models sometimes performing better in specific metrics or cross-validation.

  • Feature Selection Methods (ANOVA, LGBM, MI) and Dimensionality: This part of the experiment (Table 2, Figure 4) is a comprehensive analysis of the feature space. By comparing models trained with features selected by different methods and exploring different dimensionalities (e.g., 177D, 102D, 136D), the authors determine that LGBM feature selection yielding 177 dimensions results in the best performing LR model. This validates the importance of dimensionality reduction and the choice of LGBM for feature selection.

  • Impact of UniRep Features: The entire premise of iUmami-DRLF is built on UniRep features. The comparison with iUmami-SCM and UMPred-FRL (which use traditional features) and iUP-BERT (which uses BERT features) implicitly validates the superior representational power of UniRep/mLSTM embeddings for this task.

    The GridSearchCV module was used for searching hyperparameters for each model during the optimization process, which is a standard practice for parameter analysis to find the best model configuration.

7. Conclusion & Reflections

7.1. Conclusion Summary

This research successfully introduced iUmami-DRLF, a novel predictor designed for the accurate identification of umami peptides solely based on their sequence information. The model's strength lies in its innovative integration of a multiplicative LSTM-based UniRep deep representation learning for feature extraction, coupled with SMOTE for handling imbalanced datasets and LGBM for optimal feature selection. The final iUmami-DRLF model, employing Logistic Regression with the top 177 UniRep features, demonstrated superior performance. It markedly outperformed existing state-of-the-art methods in independent testing across key metrics (ACC=0.921, MCC=0.815, Sn=0.821, Sp=0.967, auROC=0.956, BACC=0.894). Furthermore, iUmami-DRLF exhibited exceptional robustness and accuracy when validated against newly discovered umami peptide sequences, maintaining significant predictive power even at high probability thresholds where other predictors failed. A user-friendly web server was also developed, making this robust tool accessible to the research community.

7.2. Limitations & Future Work

The authors acknowledged several limitations and suggested directions for future work:

  • Computational Cost of Feature Extraction: The current UniRep feature extraction model requires significant computation, especially without a GPU configuration, leading to long processing times for large numbers of sequences. This suggests a practical bottleneck for high-throughput applications.
  • Data Updates: The model's performance could potentially be further improved by incorporating the most recent empirical (wet-lab) data during training. As more umami peptides are discovered, continuously updating the training dataset can enhance the model's learning.
  • Model Simplification via Distillation: A promising future direction is to use model distillation. This technique involves training a smaller, simpler "student" model to replicate the behavior of the larger, more complex UniRep "teacher" model. This could significantly reduce the computational complexity of the feature extraction step without a substantial loss in performance, addressing the first limitation.

7.3. Personal Insights & Critique

This paper presents a rigorous and well-executed study that makes a significant contribution to the field of bioinformatics and food science.

  • Innovation: The core innovation lies in the exclusive and highly optimized use of multiplicative LSTM-based UniRep features. While BERT has gained popularity, demonstrating the superior performance and robustness of mLSTM embeddings (especially for peptide sequences, which are shorter and have a simpler 'alphabet' than natural language) is valuable. The systematic approach to data balancing (SMOTE) and feature selection (LGBM) further refines these high-quality embeddings, leading to tangible performance gains.
  • Robustness Evaluation: The evaluation of model performance across varying probability thresholds using the UMP-VERIFIED dataset is particularly insightful and highly practical. In real-world applications, a predictor that maintains reliability even at high confidence levels is invaluable. The stark contrast between iUmami-DRLF and other models at 95% and 99% thresholds is a powerful testament to its practical utility. This type of evaluation often gets overlooked but is critical for deployment.
  • Generalizability: The strong performance in independent tests and against a wet-experiment verified dataset suggests good generalizability, which is a common challenge for machine learning models in biological domains.
  • Applicability and Impact: The development of a user-friendly web server significantly enhances the paper's impact by making the tool readily available to other researchers and potentially to the food industry. This directly addresses the initial motivation of providing an efficient alternative to costly wet testing. The research can directly aid in designing healthier, more flavorful food products.

Potential Issues/Areas for Improvement:

  • Reliance on Pre-trained UniRep: The iUmami-DRLF model heavily relies on the pre-trained UniRep model. The quality and potential biases of the UniRef50 dataset used to train UniRep would inherently affect iUmami-DRLF's performance. While robust, its ultimate performance ceiling might be limited by the foundational UniRep model's understanding of peptide space.

  • "Small" Verified Dataset: While the UMP-VERIFIED dataset is crucial for robustness testing, its size (91 peptides) is relatively small. Expanding this dataset with more diverse, newly validated umami peptides could further strengthen future validation efforts and potentially allow for even more fine-tuned model training.

  • Black-box Nature of Deep Features: While UniRep features are powerful, they are black-box representations. Understanding why specific 177 dimensions are important or what biological properties they encode could provide deeper insights into umami peptide characteristics, moving beyond pure prediction to scientific discovery. The feature selection step helps, but the underlying interpretation of the UniRep dimensions remains complex.

    Overall, this paper provides a robust and practically valuable machine learning solution for identifying umami peptides, pushing the boundaries of deep representation learning in this specific domain. Its emphasis on robustness and generalization is a critical aspect that makes it stand out.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.