IUP-BERT: Identification of Umami Peptides Based on BERT Features
TL;DR Summary
The study presents iUP-BERT, a novel umami peptide predictor using BERT for feature extraction. Combined with SMOTE and SVM, it significantly improves the efficiency and accuracy of umami peptide identification, outperforming existing methods, and provides an open-access web serv
Abstract
Umami is an important widely-used taste component of food seasoning. Umami peptides are specific structural peptides endowing foods with a favorable umami taste. Laboratory approaches used to identify umami peptides are time-consuming and labor-intensive, which are not feasible for rapid screening. Here, we developed a novel peptide sequence-based umami peptide predictor, namely iUP-BERT, which was based on the deep learning pretrained neural network feature extraction method. After optimization, a single deep representation learning feature encoding method (BERT: bidirectional encoder representations from transformer) in conjugation with the synthetic minority over-sampling technique (SMOTE) and support vector machine (SVM) methods was adopted for model creation to generate predicted probabilistic scores of potential umami peptides. Further extensive empirical experiments on cross-validation and an independent test showed that iUP-BERT outperformed the existing methods with improvements, highlighting its effectiveness and robustness. Finally, an open-access iUP-BERT web server was built. To our knowledge, this is the first efficient sequence-based umami predictor created based on a single deep-learning pretrained neural network feature extraction method. By predicting umami peptides, iUP-BERT can help in further research to improve the palatability of dietary supplements in the future.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is the identification of umami peptides using machine learning, specifically leveraging features extracted by the BERT deep learning model. The model is named iUP-BERT.
1.2. Authors
The authors are: Liangzhen Jiang, Jici Jiang, Xiao Wang, Yin Zhang, Bowen Zheng, Shuqi Liu, Yiting Zhang, Changying Liu, Yan Wan, Dabing Xiang, and Zhibin Lv. Their affiliations span several institutions, primarily in China:
- College of Food and Biological Engineering, Chengdu University, Chengdu, China (Liangzhen Jiang, Xiao Wang, Shuqi Liu, Changying Liu, Yan Wan, Dabing Xiang)
- Department of Computer Science, Peking University, Beijing, China (Jici Jiang, Bowen Zheng, Zhibin Lv)
- Key Laboratory of Comprehensive Utilization of Crops, Ministry of Agriculture and Rural Affairs, Chengdu, China (Liangzhen Jiang, Xiao Wang, Shuqi Liu, Changying Liu, Yan Wan, Dabing Xiang)
- College of Biology, Southwest Jiaotong University, Chengdu, China (Yiting Zhang)
- College of Biology, Georgia State University, Atlanta, GA, USA (Yiting Zhang) The corresponding authors are Dabing Xiang and Zhibin Lv, with Zhibin Lv associated with Peking University. Their research backgrounds appear to be at the intersection of food science/biological engineering and computer science/bioinformatics, indicating an interdisciplinary approach to the problem.
1.3. Journal/Conference
The paper was published in Foods, a peer-reviewed open-access journal published by MDPI. Foods covers a wide range of topics related to food science, technology, and nutrition. Its reputation and influence are recognized in the food science and technology domain, making it a relevant venue for this research.
1.4. Publication Year
The paper was published on November 21, 2022.
1.5. Abstract
The paper addresses the challenge of identifying umami peptides, which are crucial for food seasoning, given that traditional laboratory methods are time-consuming and labor-intensive. To overcome this, the authors developed a novel peptide sequence-based predictor called iUP-BERT. This predictor is based on a deep learning pre-trained neural network feature extraction method, specifically Bidirectional Encoder Representations from Transformer (BERT). After optimization, the model utilized BERT features in conjunction with the Synthetic Minority Oversampling Technique (SMOTE) to handle data imbalance and Support Vector Machine (SVM) for classification. Extensive experiments, including cross-validation and an independent test, demonstrated that iUP-BERT significantly outperformed existing methods in effectiveness and robustness. The authors also built an open-access web server for iUP-BERT. They claim this is the first efficient sequence-based umami predictor created using a single deep-learning pretrained neural network feature extraction method. The ultimate goal is to use iUP-BERT to improve the palatability of dietary supplements.
1.6. Original Source Link
The original source link is /files/papers/6919ed1f110b75dcc59ae33c/paper.pdf. This link points to the PDF file of the paper, indicating it is an officially published work.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inefficient and labor-intensive identification of umami peptides. Umami taste, known as the "fifth taste," is a highly desirable characteristic in food, contributing to deliciousness. Umami peptides are specific short linear peptides that impart this taste and also offer health benefits like reducing salt content and antioxidant activity.
Current laboratory approaches to identify and characterize umami peptides (e.g., RP-HPLC, MALDI-TOF-MS, LC-Q-TOF-MS) are:
-
Time-consuming: They require significant time for experimental procedures.
-
Labor-intensive: They demand substantial manual effort and skilled personnel.
-
Not feasible for rapid screening: Their high-throughput capability is limited, restricting the discovery of new umami peptides.
These limitations highlight a critical gap: the lack of accurate and efficient computational methods for rapid screening of umami peptides. Prior computational methods, while a step forward, also presented challenges:
-
iUmami-SCM: Relied on artificial feature extraction and only a single type of feature (propensity scores of amino acids and dipeptides), leading to insufficient sequence feature information and unsatisfactory performance. -
UMPred-FRL: Improved uponiUmami-SCMby using multiple feature encodings (amino acid composition, dipeptide composition, etc.) and various machine learning algorithms. However, its overall prediction performance was still considered "not efficient enough," likely due to the continued reliance on manual feature extraction.The paper's entry point and innovative idea is to leverage the power of deep learning, specifically the
BERTmodel, for automatic and efficient feature extraction from peptide sequences.BERT, originally designed for natural language processing, has shown great success in capturing contextual information and has been successfully applied to other biological sequence prediction tasks. The authors hypothesize thatBERT's ability to learn complex representations directly from raw sequences, without manual feature engineering, can overcome the limitations of previous methods and lead to more robust and accurate umami peptide prediction.
2.2. Main Contributions / Findings
The paper's primary contributions are:
- Development of
iUP-BERT: A novel machine learning-based predictor for umami peptides that utilizes a deep learning pretrained neural network (BERT) for feature extraction. This marks a significant shift from traditional manual feature engineering in this domain. - First application of a single deep-learning pretrained neural network for umami peptide prediction: The authors highlight that
iUP-BERTis the first efficient sequence-based umami predictor built upon a single deep representation learning feature (BERT), eliminating the need for complex, hand-crafted feature combinations. - Demonstrated superior performance: Through extensive empirical experiments (10-fold cross-validation and independent testing),
iUP-BERTconsistently and significantly outperformed existing state-of-the-art methods likeiUmami-SCMandUMPred-FRLacross multiple evaluation metrics (Accuracy, MCC, Sensitivity, auROC, and Balanced Accuracy). - Incorporation of
SMOTEandLGBMfeature selection: The study rigorously optimized the model by first applyingSMOTEto address data imbalance, which proved critical for performance improvement. Subsequently,LGBMwas used for feature selection, identifying an optimal feature space (139 dimensions) that further enhanced model robustness and accuracy. - Deployment of an open-access web server: An
iUP-BERTweb server was built and made publicly available, facilitating rapid and high-throughput screening of umami peptides for researchers and the food industry. - Practical implications: The findings suggest that
iUP-BERTcan be a powerful tool for exploring new umami peptides, contributing to the development of improved dietary supplements and advancing the food seasoning industry.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the iUP-BERT paper, a grasp of several fundamental concepts in machine learning, deep learning, and bioinformatics is essential.
- Umami Peptides:
- Conceptual Definition:
Umamiis recognized as the fifth basic taste, characterized by a savory, meaty, or broth-like flavor.Umami peptidesare specific short linear amino acid sequences (peptides) that bind to taste receptors (primarily the receptor) on the tongue, eliciting the umami sensation. They typically have a low molecular weight (less than 5000 Da), with dipeptides and tripeptides being common. Beyond taste, they also possess various health benefits.
- Conceptual Definition:
- Machine Learning (ML):
- Conceptual Definition:
Machine Learningis a subset of artificial intelligence that enables systems to learn from data, identify patterns, and make decisions or predictions with minimal human intervention. Instead of being explicitly programmed, ML algorithms "learn" a model from training data, which can then be used to analyze new, unseen data.
- Conceptual Definition:
- Deep Learning:
- Conceptual Definition:
Deep learningis a specialized branch of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from large datasets. Unlike traditional ML, deep learning models can automatically learn feature representations from raw input data, eliminating the need for manual feature engineering. This ability to "learn how to learn" is a key advantage.
- Conceptual Definition:
- BERT (Bidirectional Encoder Representations from Transformer):
- Conceptual Definition:
BERTis a powerful, pre-trained deep learning model developed by Google, primarily for natural language processing (NLP) tasks. Its core innovation lies in itsbidirectionallearning of language context. Unlike previous models that processed text directionally (left-to-right or right-to-left),BERTprocesses words in relation to all other words in a sentence simultaneously. This allows it to understand the full context of a word by looking at both its left and right neighbors. It's pre-trained on massive amounts of unlabeled text data using two main tasks:Masked Language Model (MLM)(predicting masked words) andNext Sentence Prediction (NSP). After pre-training, it can befine-tunedfor various downstream tasks with minimal architectural changes. In this paper, peptide sequences are treated as "language" forBERTto process. - Transformer Architecture:
BERT's foundation is theTransformerarchitecture, specifically itsencodercomponent. TheTransformermodel, introduced in "Attention Is All You Need," revolutionized sequence processing by relying entirely onself-attention mechanismsrather than recurrent or convolutional layers.- Self-Attention:
Self-attentionallows the model to weigh the importance of different words (or amino acids, in the peptide context) in an input sequence when encoding a particular word. It computes a "contextualized" representation for each token by attending to all other tokens in the sequence. The core idea is to calculatequery (Q),key (K), andvalue (V)matrices from the input embeddings. The attention score is then computed by taking the dot product of and , scaling it, applying a softmax function, and finally multiplying by . - Multi-head Self-Attention: This mechanism performs the
self-attentionprocess multiple times in parallel with different linear projections of , , and . The outputs from these "attention heads" are then concatenated and linearly transformed, allowing the model to focus on different aspects of the input sequence simultaneously.
- Self-Attention:
- Conceptual Definition:
- SSA (Soft Symmetric Alignment):
- Conceptual Definition:
SSAis a method designed to compare arbitrary-length sequences within vectors. It involves encoding a peptide sequence using an initial pre-trained model (in this case, a three-tier stackedBiLSTM encoder) to create an embedding matrix.BiLSTM(Bidirectional Long Short-Term Memory) is a type of recurrent neural network that can process sequences in both forward and backward directions, capturing dependencies from both past and future contexts.SSAthen uses a specific mechanism to calculate the similarity between these embedded sequences.
- Conceptual Definition:
- SMOTE (Synthetic Minority Oversampling Technique):
- Conceptual Definition:
SMOTEis an oversampling algorithm used to address theclass imbalance problemin datasets. Class imbalance occurs when one class (the minority class) has significantly fewer samples than another (the majority class). This can lead to machine learning models being biased towards the majority class and performing poorly on the minority class.SMOTEworks by synthesizing new minority class samples that are "similar" to existing minority samples, rather than simply duplicating them. It achieves this by taking a minority sample, finding its k-nearest neighbors, and then creating new samples along the line segments connecting the original sample to its neighbors.
- Conceptual Definition:
- Machine Learning Algorithms (Classifiers):
- Support Vector Machine (SVM):
- Conceptual Definition:
SVMis a powerful supervised learning model used for classification and regression tasks. For binary classification, its goal is to find an optimalhyperplane(a decision boundary) that best separates data points of different classes in a high-dimensional space. The "optimal" hyperplane is the one that has the largestmargin(the distance between the hyperplane and the nearest data points of each class). Maximizing this margin helps to improve the generalization ability of the model.
- Conceptual Definition:
- K-Nearest Neighbor (KNN):
- Conceptual Definition:
KNNis a non-parametric, instance-based learning algorithm used for classification and regression. In classification, a new data point is classified by a majority vote of its nearest neighbors among the training data. "Nearest" is typically determined by distance metrics like Euclidean distance. It's a "lazy learner" because it doesn't build a model during training but memorizes the training data.
- Conceptual Definition:
- Logistic Regression (LR):
- Conceptual Definition:
LRis a statistical model primarily used for binary classification. Despite its name, it's a classification algorithm, not a regression algorithm in the traditional sense. It models the probability that a given input belongs to a particular class. It does this by applying asigmoid function(also known as thelogistic function) to the output of a linear equation, squashing the output between 0 and 1, which can then be interpreted as a probability.
- Conceptual Definition:
- Random Forest (RF):
- Conceptual Definition:
RFis anensemble learningmethod, meaning it combines predictions from multiple models to improve accuracy and robustness. It constructs a multitude ofdecision treesduring training. For classification, it outputs the class that is the mode (majority vote) of the classes predicted by individual trees. Each tree is trained on a random subset of the data (bootstrapping) and considers only a random subset of features at each split, which helps reduce overfitting.
- Conceptual Definition:
- Light Gradient Boosting Machine (LGBM):
- Conceptual Definition:
LGBMis a gradient boosting framework that uses tree-based learning algorithms. It's known for its high speed and efficiency. Unlike other boosting algorithms that grow trees level-wise,LGBMgrows treesleaf-wise, meaning it chooses the leaf with the largest loss to grow, which can lead to faster training and better accuracy. It also employs ahistogram-based algorithmto discretize continuous features, further speeding up the training process. In this paper,LGBMis also used for feature selection.
- Conceptual Definition:
- Support Vector Machine (SVM):
3.2. Previous Works
The paper contextualizes iUP-BERT by comparing it against two notable prior computational methods for umami peptide prediction:
-
iUmami-SCM (Scoring Card Method):
- Summary: This was the first sequence-based umami peptide predictor. It analyzes and predicts umami sensory peptides solely based on the information of the primary peptide sequence, without requiring advanced structural data. It conjugated
estimated propensity scoresof amino acids and dipeptides with thescoring card method (SCM). - Limitations:
- Artificial Feature Extraction: It relied on manually designed and
artificial feature extraction methods, which might not fully capture the complex patterns within peptide sequences. - Single Feature Type: Only a single type of feature was used as input for the machine learning models. This limited the amount of sequence feature information the model could learn from.
- Performance: Achieved a sensitivity (Sn) of 0.714, balanced accuracy (BACC) of 0.824, and Matthew's correlation coefficient (MCC) of 0.679. These metrics, while a starting point, were considered "not very satisfactory."
- Artificial Feature Extraction: It relied on manually designed and
- Summary: This was the first sequence-based umami peptide predictor. It analyzes and predicts umami sensory peptides solely based on the information of the primary peptide sequence, without requiring advanced structural data. It conjugated
-
UMPred-FRL (Feature Representation Learning):
- Summary:
UMPred-FRLwas a latermeta-predictorthat aimed to improve uponiUmami-SCM. It was based on afeature representation learning approach(though still primarily manual/engineered feature types). It combined seven different feature encodings (including amino acid composition, dipeptide composition, composition transition-distribution, amphiphilic pseudo-amino acid composition, and pseudo-amino acid composition) with six well-known ML algorithms (KNN,Extremely Randomized Trees,Partial Least Squares,Random Forest,Logistic Regression, andSVM). - Limitations:
- Inefficient Manual Feature Extraction: Despite using multiple feature types, the underlying feature extraction remained largely manual or engineered. The paper argues that this led to its overall prediction performance still "not being efficient enough."
- Performance: Achieved an accuracy (ACC) of 0.888, MCC of 0.735, Sn of 0.786, and BACC of 0.860 on its benchmark dataset. While better than
iUmami-SCM, the authors sought further improvements.
- Summary:
3.3. Technological Evolution
The field of bioinformatics, particularly in peptide function prediction, has seen a significant evolution in methodology:
-
Early Experimental Methods: Initially, identifying umami peptides was solely reliant on laborious and costly laboratory techniques like
RP-HPLCandMS-based analyses. These methods are definitive but severely limit throughput. -
Rule-Based/Propensity-Based Methods: The first generation of computational tools, like
iUmami-SCM, emerged by using statistical properties or "propensity scores" of amino acids and dipeptides to infer umami taste. These were essentially rule-based or feature-engineering driven, without complex learning algorithms. -
Traditional Machine Learning with Hand-crafted Features: The next step involved applying more sophisticated
ML algorithms(SVM,RF,KNN, etc.) to a broader set of hand-crafted orphysicochemical featuresextracted from peptide sequences.UMPred-FRLrepresents this era, combining various engineered features with multiple ML models to improve prediction accuracy. While more powerful, this approach still required domain experts to design effective features, which can be a bottleneck and may not capture all relevant sequence information. -
Deep Learning for Automatic Feature Extraction: The current paper,
iUP-BERT, represents the latest evolution by moving towardsdeep learningforautomatic feature extraction. By adoptingBERT, a model pre-trained on vast amounts of sequential data (language in its original context, now applied to peptides), the reliance on manual feature engineering is drastically reduced or eliminated. Deep learning models can learn intricate, high-level representations directly from raw peptide sequences, potentially capturing more subtle and complex patterns that influence umami taste. This shifts the focus from "what features to extract" to "how to train an effective deep learning model."iUP-BERTfits into this timeline as a cutting-edge application of deep learning, demonstrating its potential to surpass traditional ML methods that rely on pre-defined features.
3.4. Differentiation Analysis
Compared to the main methods in related work, iUP-BERT presents several core differences and innovations:
-
Automatic Deep Representation Learning Feature Extraction:
- Differentiation: The most significant innovation is the use of
BERT, a singledeep learning pretrained neural network, forautomatic feature extraction. This contrasts sharply withiUmami-SCM(which used artificial, single-type features based on propensity scores) andUMPred-FRL(which relied on a combination of seven different manually engineered feature encodings like amino acid composition, dipeptide composition, etc.). - Innovation:
BERTcan learndeep bidirectional language representationsdirectly from raw peptide sequences. This means it implicitly captures complex contextual relationships between amino acids, which manual feature engineering struggles to achieve. The paper emphasizes it's the first to use a single deep-learning pretrained neural network for this task, simplifying the feature engineering pipeline.
- Differentiation: The most significant innovation is the use of
-
Leveraging Contextual Information:
- Differentiation: Traditional methods often treat amino acids or short k-mers in isolation or with limited context.
BERT'sTransformerarchitecture andmulti-head self-attentionmechanism enable it to have aglobal receptive field, meaning it can effectively captureglobal context informationacross the entire peptide sequence. - Innovation: This comprehensive understanding of context allows
iUP-BERTto generate more meaningful and informative feature descriptors compared to models limited by local windows or predefined feature types.
- Differentiation: Traditional methods often treat amino acids or short k-mers in isolation or with limited context.
-
Robustness through Data Balancing and Feature Selection:
- Differentiation: While prior methods focused on feature diversity or specific ML algorithms,
iUP-BERTexplicitly integratesSMOTEto address the common problem ofdata imbalancein biological datasets andLGBMforfeature selection. - Innovation: These steps are crucial for model robustness and generalization.
SMOTEprevents bias towards the majority class, andLGBMfeature selection helps remove redundant information, mitigating overfitting and improving efficiency, which was not explicitly highlighted as a core strength in previous works.
- Differentiation: While prior methods focused on feature diversity or specific ML algorithms,
-
Overall Performance Improvement:
-
Differentiation:
iUP-BERTconsistently and remarkably outperformsiUmami-SCMandUMPred-FRLacross multiple key metrics in both cross-validation and independent tests. -
Innovation: This empirical superiority validates the effectiveness of the deep learning-based feature extraction approach for umami peptide prediction.
In essence,
iUP-BERTdifferentiates itself by replacing laborious and potentially suboptimal manual feature engineering with an intelligent, data-driven, and context-aware feature extraction process powered byBERT, further refined by robust pre-processing (SMOTE) and post-processing (LGBMfeature selection) steps.
-
4. Methodology
4.1. Principles
The core idea of the iUP-BERT method is to leverage the powerful feature extraction capabilities of a deep learning pre-trained model, BERT, to automatically derive rich, contextual representations from raw peptide sequences. These representations are then used as input for traditional machine learning (ML) classifiers to predict whether a peptide is umami. The theoretical basis is that BERT, originally designed for natural language, can effectively learn the "language" of peptides (amino acid sequences) and capture subtle patterns critical for umami taste, without requiring manual feature engineering. The intuition is that complex biological sequences, much like human language, contain deep semantic and contextual information that can be uncovered by advanced neural networks.
The overall framework of iUP-BERT involves a systematic pipeline to optimize prediction performance:
- Deep Representation Learning for Feature Extraction: Using
BERT(andSSAfor comparison) to transform peptide sequences into high-dimensional feature vectors. - Addressing Data Imbalance: Employing
SMOTEto synthesize minority class samples and achieve a balanced training dataset. - Feature Space Optimization: Applying
LGBMfor feature selection to reduce dimensionality and remove redundant features, enhancing model generalization. - Machine Learning Classification: Training and evaluating various
ML algorithms(KNN,LR,SVM,RF,LGBM) on the processed features. - Model Optimization and Selection: Identifying the best combination of feature extraction, data balancing, feature selection, and classification algorithm based on performance metrics.
4.2. Core Methodology In-depth (Layer by Layer)
The development of iUP-BERT follows a six-step process, as depicted in Figure 1.
The following figure (Figure 1 from the original paper) illustrates the overall framework of iUP-BERT development:
该图像是示意图,展示了 iUP-BERT 模型开发的六个主要步骤。包括肽序列的文本提取、BERT 模型与 SSA 方法生成特征向量的融合、数据不平衡处理(SMOTE)、特征选择、结合多种机器学习算法以及最终建立优化模型的过程。图中用流程图的形式清晰地体现了各步骤的关系和操作。
4.2.1. Step 1: Peptide Sequence Input and Feature Extraction
Upon receiving a peptide sequence as input (textual amino acid string), two deep representation learning feature extraction methods are primarily used: the pretrained SSA sequence embedding model and the pretrained BERT sequence embedding model. These models transform the raw peptide sequence into numerical feature vectors.
4.2.1.1. Pretrained SSA Embedding Model
- Principle:
SSA (Soft Symmetric Alignment)defines a novel way to compare arbitrary-length sequences by first converting them into vector representations. It leverages aBiLSTM encoderto capture sequential information. - Process: An initial pretrained model encodes each peptide sequence. This encoder is a three-tier stacked
BiLSTM(Bidirectional Long Short-Term Memory) network, which processes the sequence from both forward and backward directions to capture long-range dependencies. The output of theBiLSTM encoderis then passed through a linear layer, which transforms the internal representations into a final embedding matrix. - Output: Each peptide sequence creates a final embedding matrix, denoted as , where represents the length of the peptide (number of amino acids) and 121 is the dimension of the vector representation for each amino acid position.
- Similarity Calculation (SSA Mechanism): The
SSAmechanism is used to calculate the similarity between two amino acid sequences based on their embedded vectors.- Consider two embedded matrices (vector representations of sequences):
$
\mathrm { P } _ { 1 } = [ \alpha _ { 1 } , \alpha _ { 2 } , \cdot \cdot \cdot , \alpha _ { \mathrm { { L1 } } } ] \
\mathrm { P } _ { 2 } = [ \beta _ { 1 } , \beta _ { 2 } , \cdot \cdot \cdot , \beta _ { \mathrm { { L2 } } } ]
$
Where:
- and are the embedding matrices for two distinct peptide sequences.
- and are the lengths of the respective peptide sequences.
- and represent the 121-dimensional vector embedding for the -th amino acid in and the -th amino acid in , respectively.
- The similarity between the two sequences is calculated as:
$
\hat { \omega } = - \frac { 1 } { W } \sum _ { \mathrm { i = 1 } } ^ { \mathrm { L1 } } \sum _ { \mathrm { j = 1 } } ^ { \mathrm { L2 } } { \tau _ { \mathrm { i j } } | \alpha _ { \mathrm { i } } - \beta _ { \mathrm { j } } | _ { 1 } }
$
Where:
- is the calculated similarity score.
- is a normalization factor, defined below.
- is a weighting coefficient that reflects the alignment contribution between and .
- is the L1-norm (Manhattan distance) between the vector and , measuring their dissimilarity.
- The weighting coefficient is calculated using the following formulas (Equations 4-7 from the paper):
$
\displaystyle \varrho _ { \mathrm { i j } } = \frac { \exp \big ( - | \alpha _ { \mathrm { k } } - \beta _ { \mathrm { j } } | _ { 1 } \big ) } { \sum _ { \mathrm { k } = 1 } ^ { \lfloor 1 \rfloor } \exp \big ( - | \alpha _ { \mathrm { k } } - \beta _ { \mathrm { j } } | _ { 1 } \big ) }
$
$
\displaystyle \mathfrak {sigma } _ { \mathrm { i j } } = \frac { \exp \big ( - | \alpha _ { \mathrm { i } } - \beta _ { \mathrm { k } } | _ { 1 } \big ) } { \sum _ { \mathrm { k } = 1 } ^ { \lfloor 2 \rfloor } \exp \big ( - | \alpha _ { \mathrm { i } } - \beta _ { \mathrm { k } } | _ { 1 } \big ) }
$
$
\displaystyle \qquad \mathfrak { r } _ { \mathrm { i j } } = \mathfrak { o } _ { \mathrm { i j } } + \mathfrak { o } _ { \mathrm { i j } } - \mathfrak { p } _ { \mathrm { i j } } \mathfrak { o } _ { \mathrm { i j } }
$
$
\displaystyle \qquad \operatorname { W } = \sum _ { \mathrm { i = 1 } } ^ { \lfloor 1 \rfloor } \sum _ { \mathrm { i } = 1 } ^ { \lfloor 2 \rfloor } \mathfrak { r } _ { \mathrm { i j } }
$
Note: There seems to be a typo in the paper for the last two equations. In the original text,
varrho_ijandsigma_ijare defined, but thenr_ijis defined usingo_ijandp_ijwhich are not explicitly defined asvarrhoorsigma. Assumingo_ijandp_ijarevarrho_ijandsigma_ijrespectively based on common notation in soft alignment contexts for forward/backward probabilities.- : Represents the normalized similarity (or "soft alignment probability") of with respect to all in . It measures how well aligns to by considering all possible alignments from .
- : Represents the normalized similarity of with respect to all in . It measures how well aligns to by considering all possible alignments from .
- : This term seems to be a combination of the two alignment probabilities, potentially a form of fuzzy OR or combination that emphasizes strong alignments from either direction. Given the symbols and are used instead of
varrhoandsigmain the paper, it's possible these are intermediate terms derived fromvarrhoandsigma. - : The sum of all values, serving as a normalization constant for the overall similarity .
- Consider two embedded matrices (vector representations of sequences):
$
\mathrm { P } _ { 1 } = [ \alpha _ { 1 } , \alpha _ { 2 } , \cdot \cdot \cdot , \alpha _ { \mathrm { { L1 } } } ] \
\mathrm { P } _ { 2 } = [ \beta _ { 1 } , \beta _ { 2 } , \cdot \cdot \cdot , \beta _ { \mathrm { { L2 } } } ]
$
Where:
- Final Embedding: After these calculations, an
averaging pooling procedureis applied to the embedding matrix to obtain a fixed-size 121-dimensional (121D) feature vector for each peptide, regardless of its length. This fixed-size vector is suitable as input for traditional machine learning models.
4.2.1.2. Pretrained BERT Embedding Model
- Principle:
BERTexcels at generatingdeep bidirectional language representationsby understanding the full context of tokens (amino acids) in a sequence. It eliminates the need for manual feature engineering. - Process:
- Tokenization: Peptide sequences are directly input into the
BERTmodel. First, they are converted intotoken representations, often usingk-mers(contiguous subsequences of k amino acids) as tokens, similar to how words or subword units are handled in natural language. - Positional Embedding: To incorporate information about the order of amino acids,
positional embeddingsare added to the token representations. This allowsBERTto distinguish between amino acids at different positions in the sequence. - Transformer Encoder Layers: The combined token and positional embeddings are then passed through the core of
BERT: a stack ofTransformer encoder layers. The model used here has 12 such layers.- Each
Transformer encoder layercontains amulti-head self-attentionmechanism. This mechanism allows each amino acid token to attend to all other amino acid tokens in the sequence, capturing semantic and contextual relationships across the entire peptide. - After the
multi-head self-attention, the output passes throughfeed-forward neural networksandlinear transformations.
- Each
- Pretraining Tasks: The
BERTmodel is initially pre-trained on a large, unlabeled dataset using tasks such as themasked language model(predicting masked amino acids) to learn robust sequence representations. Thecross-entropy loss functionis used for backpropagation during pretraining.
- Tokenization: Peptide sequences are directly input into the
- Output: The
BERT-trained modelproduces a 768-dimensional (768D) feature vector for each peptide sequence, which encapsulates its contextual information.
4.2.2. Step 2: Feature Fusion
To explore whether combining different types of deep features could yield better performance, the 121D SSA eigenvector was concatenated (combined) with the 768D BERT eigenvector. This resulted in an 889-dimensional (889D) SSA + BERT fusion feature vector. This fusion aims to combine the local and global contextual information captured by both methods. For comparison, individual feature vectors (SSA alone and BERT alone) were also evaluated.
4.2.3. Step 3: Synthetic Minority Oversampling Technique (SMOTE)
- Principle: Datasets often suffer from
class imbalance, where the number of samples in one class (e.g., umami peptides) is much smaller than in another (e.g., non-umami peptides). This can bias ML models towards the majority class.SMOTEis used to mitigate this. - Process:
SMOTEanalyzes the minority class samples. For each minority sample, it identifies itsk-nearest neighbors. Then, it randomly selects samples from these neighbors and generates new, synthetic minority samples by performingrandom linear interpolationbetween the original sample and its chosen neighbors. These artificially simulated new samples are added to the dataset, balancing the class distribution. This process continues until the data imbalance meets specified requirements, effectively increasing the representation of the minority class without simply copying existing samples, thus reducing the risk of overfitting.
4.2.4. Step 4: Feature Space Optimization (LGBM Feature Selection)
- Principle: High-dimensional feature vectors (like the 768D BERT features or 889D fusion features) can contain redundant or less informative features. This
information redundancycan lead tomodel overfittingand increased computational cost.Feature selectionaims to identify and retain only the most discriminative features. - Process: The
LGBM (Light Gradient Boosting Machine) feature selection methodwas employed.LGBMnaturally providesfeature importance scoresduring its training process (due to its tree-based nature). These scores indicate how much each feature contributes to the model's predictive power. By ranking features based on their importance and selecting a subset, the method can reduce the dimensionality of the feature space while preserving or even enhancing predictive performance. This step helps in attaining the best feature combinations and creating an optimized feature space.
4.2.5. Step 5: Machine Learning Methods
Five different ML algorithms were combined with the extracted and processed features to build classification models:
- K-Nearest Neighbor (KNN):
- Mechanism: For a new, unseen peptide,
KNNidentifies the training samples (umami or non-umami peptides) that are "closest" to it in the feature space. The new peptide is then assigned the class label that is most common among its nearest neighbors.
- Mechanism: For a new, unseen peptide,
- Logistic Regression (LR):
- Mechanism:
LRmodels the probability of a peptide belonging to the umami class. It uses asigmoid functionto map a linear combination of input features to a probability score between 0 and 1. If this probability exceeds a certain threshold (e.g., 0.5), the peptide is classified as umami; otherwise, it's non-umami.
- Mechanism:
- Support Vector Machine (SVM):
- Mechanism:
SVMaims to find the optimalhyperplanethat maximally separates the umami and non-umami peptide samples in the high-dimensional feature space. It focuses on the "support vectors" (data points closest to the hyperplane) to define this boundary, which helps in robust classification.
- Mechanism:
- Random Forest (RF):
- Mechanism:
RFis an ensemble method consisting of multipledecision trees. Each tree is trained on a bootstrap sample of the data and a random subset of features. For a new peptide, each decision tree makes a prediction. The final classification for the peptide is determined by amajority voteamong all the individual decision tree predictions.
- Mechanism:
- Light Gradient Boosting Machine (LGBM):
- Mechanism:
LGBMbuilds an ensemble ofdecision treessequentially, where each new tree corrects the errors of the previous ones. It uses aleaf-wisetree growth strategy and ahistogram-based algorithmfor efficiency. The final prediction is a weighted sum of the predictions from all individual trees.
- Mechanism:
4.2.6. Step 6: Final iUP-BERT Predictor Establishment
After extensive experimentation and optimization across different combinations of feature extraction methods (SSA, BERT, fusion), data balancing (SMOTE), feature selection (LGBM), and machine learning algorithms (KNN, LR, SVM, RF, LGBM), the BERT feature extraction method combined with the SVM classification model and SMOTE (and after LGBM feature selection) was selected as the optimal combination. This optimized configuration forms the final iUP-BERT predictor.
5. Experimental Setup
5.1. Datasets
For a fair comparison with previous umami peptide ML models, the paper used the same peptide datasets as in UMPred-FRL [24]. These datasets are provided in Supplementary File S1.
- Positive Samples (Umami Peptides):
- Number: 140 unique peptides.
- Source: Experimentally validated umami peptides collected from various studies [10,15,16,20] and from the
BIOPEP-UWM databases[40].
- Negative Samples (Non-Umami Peptides):
- Number: 302 unique peptides.
- Source: Identified bitter peptides [41,42], which are distinct from umami peptides.
- Dataset Split: The complete dataset was divided into a training set and an independent test set.
-
Training Dataset: 112 umami peptides (positive) and 241 non-umami peptides (negative).
-
Independent Test Dataset: 28 umami peptides (positive) and 61 non-umami peptides (negative).
The choice of these datasets ensures comparability with existing methods and allows for robust validation, using peptides identified through experimental means or reputable databases. The use of bitter peptides as negative samples is a common strategy in taste peptide prediction, as they represent a distinct taste category.
-
5.2. Evaluation Metrics
The performance of the models was evaluated using six widely accepted binary classification metrics. To provide a comprehensive understanding, these metrics are defined below, along with their mathematical formulas and explanations of symbols.
- Key Terminology:
TP (True Positive): The number of umami peptides (positive samples) correctly identified as umami.TN (True Negative): The number of non-umami peptides (negative samples) correctly identified as non-umami.FP (False Positive): The number of non-umami peptides (negative samples) incorrectly identified as umami.FN (False Negative): The number of umami peptides (positive samples) incorrectly identified as non-umami.
-
Accuracy (ACC)
- Conceptual Definition:
ACCmeasures the overall proportion of correctly classified instances (both true positives and true negatives) out of all instances in the dataset. It provides a general sense of how well the model performs. - Mathematical Formula: $ \mathrm { ACC } = { \frac { \mathrm { T P } + \mathrm { T N } } { \mathrm { T P } + \mathrm { T N } + \mathrm { F P } + \mathrm { F N } } } $
- Symbol Explanation:
- : True Positives.
- : True Negatives.
- : False Positives.
- : False Negatives.
- Conceptual Definition:
-
Matthew's Correlation Coefficient (MCC)
- Conceptual Definition:
MCCis a robust and balanced metric that is particularly useful for evaluating binary classifications, especially on imbalanced datasets. It considers all four quadrants of the confusion matrix (TP, TN, FP, FN) and produces a value between -1 and +1. A value of +1 indicates a perfect prediction, 0 indicates a random prediction, and -1 indicates a perfectly inverse prediction. It's often preferred over accuracy for imbalanced data because it accounts for true and false positives and negatives proportionally. - Mathematical Formula: $ { \bf M C C } = \frac { \mathrm { T P } \times \mathrm { T N } - \mathrm { F P } \times \mathrm { F N } } { ( \sqrt { ( \mathrm { T P } + \mathrm { F P } ) \times ( \mathrm { T P } + \mathrm { F N } ) \times ( \mathrm { T N } + \mathrm { F P } ) \times ( \mathrm { T N } + \mathrm { F N } ) } } $
- Symbol Explanation:
- : True Positives.
- : True Negatives.
- : False Positives.
- : False Negatives.
- Conceptual Definition:
-
Sensitivity (Sn), also known as Recall or True Positive Rate (TPR)
- Conceptual Definition:
Snmeasures the proportion of actual positive instances (umami peptides) that were correctly identified by the model. It indicates the model's ability to avoid false negatives. - Mathematical Formula: $ \mathrm { S n } = { \frac { \mathrm { T P } } { \mathrm { T P } + \mathrm { F N } } } $
- Symbol Explanation:
- : True Positives.
- : False Negatives.
- Conceptual Definition:
-
Specificity (Sp), also known as True Negative Rate (TNR)
- Conceptual Definition:
Spmeasures the proportion of actual negative instances (non-umami peptides) that were correctly identified by the model. It indicates the model's ability to avoid false positives. - Mathematical Formula: $ \mathsf { S p } = \frac { \mathrm { T N } } { \mathrm { T N } + \mathrm { F P } } $
- Symbol Explanation:
- : True Negatives.
- : False Positives.
- Conceptual Definition:
-
Balanced Accuracy (BACC)
- Conceptual Definition:
BACCis the average of sensitivity (true positive rate) and specificity (true negative rate). It is particularly useful for imbalanced datasets because it provides a more balanced assessment of performance compared to raw accuracy, which can be misleading if one class significantly outweighs the other. - Mathematical Formula: $ \mathsf { B A C C } = { \frac { \mathsf { S n } + \mathsf { S p } } { 2 } } $
- Symbol Explanation:
- : Sensitivity.
- : Specificity.
- Conceptual Definition:
-
Area Under the Receiver Operating Characteristic Curve (auROC)
- Conceptual Definition: The
ROC curveis a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots theTrue Positive Rate (Sensitivity)against theFalse Positive Rate (1 - Specificity). TheauROCquantifies the entire 2D area underneath theROC curve. A higherauROCvalue (closer to 1) indicates a better overall model performance across all possible classification thresholds, meaning the model can better distinguish between positive and negative classes. AnauROCof 0.5 suggests a random classifier. - Mathematical Formula: (The paper does not provide an explicit formula for auROC, but its definition is standard). $ \text{auROC} = \int_{0}^{1} \text{TPR}(\text{FPR}) , d\text{FPR} $
- Symbol Explanation:
- : True Positive Rate (Sensitivity).
- : False Positive Rate (1 - Specificity).
- (False Positive Rate): .
- Conceptual Definition: The
- Model Evaluation Methods:
- K-fold Cross-Validation: The
10-fold cross-validation methodwas used for model training and validation evaluation on the training set. The training set is randomly divided into 10 equal parts (folds). In each iteration, 9 folds are used for training and the remaining 1 fold is used for validation. This process is repeated 10 times, with each fold serving as the validation set exactly once. The performance of the model is then evaluated by averaging the 10 validation scores, providing a more robust estimate of model performance than a single train/test split. - Independent Testing: This method evaluates the trained model on a completely separate dataset that was not used during any part of the training or cross-validation process. This provides an unbiased assessment of the model's
generalization abilityto new, unseen data. A good model must perform well on both cross-validation and independent testing.
- K-fold Cross-Validation: The
5.3. Baselines
The iUP-BERT model's performance was compared against two existing methods for umami peptide prediction:
-
iUmami-SCM:
- Description: This was the first sequence-based umami peptide predictor, developed by Ha et al. in 2020. It utilized the
scoring card method (SCM)in conjunction with estimated propensity scores of amino acids and dipeptides. - Representativeness: It serves as a foundational baseline, representing early computational efforts using artificial feature extraction.
- Description: This was the first sequence-based umami peptide predictor, developed by Ha et al. in 2020. It utilized the
-
UMPred-FRL:
-
Description: A more recent
ML-based meta-predictorfor umami peptides, created by Charkwan et al. in 2021. It employed afeature representation learning approach(though still based on engineered features) and combined seven different feature encodings with six well-knownML algorithms. -
Representativeness: It represents the state-of-the-art among methods relying on diverse, manually engineered features combined with traditional
ML algorithmsprior to the deep learning era for this specific task.These baselines are representative because they showcase the progression of computational approaches to umami peptide prediction, from simpler feature engineering to more complex combinations of engineered features and ML algorithms. By outperforming these,
iUP-BERTdemonstrates the advancement brought by deep learning feature extraction.
-
6. Results & Analysis
6.1. Preliminary Performance of Models Trained with or without SMOTE
The first step in the experimental analysis was to evaluate the impact of the Synthetic Minority Oversampling Technique (SMOTE) on model performance, especially given the imbalanced nature of the dataset (112 umami vs. 241 non-umami peptides in the training set). This was done by comparing models built with and without SMOTE using two deep representation learning features (SSA and BERT) and five ML algorithms (KNN, LR, SVM, RF, and LGBM). The evaluation was performed using repeated stratified 10-fold cross-validation tests (10 times).
The following figure (Figure 2 from the original paper) shows the performance metrics of SSA and BERT features using different algorithms, both pretrained with and without SMOTE, in 10-fold cross-validation.

Analysis of SMOTE's Effect (Figure 2 and Table 1):
-
Overall Improvement: For 10-fold cross-validation results, all five
ML algorithmsshowed improved performance acrossACC,MCC,Sn,auROC, andBACCwhenSMOTEwas applied, for bothSSAandBERTfeatures. This highlightsSMOTE's effectiveness in addressing data imbalance. -
Specificity (Sp) Exception: The
Spmetric was the only exception, sometimes showing a slight decrease withSMOTE. For instance, the bestSpforSSAwithSMOTEwas 0.913 (using SVM), which was lower than 0.938 (using RF without SMOTE). However, the overall bestSp(0.959) was still obtained from theBERTfeature optimized withSMOTE(using LR). This suggests thatSMOTEmight slightly increase false positives to improve sensitivity, but the overall balance is better. -
Quantified Improvement (Cross-Validation):
- For
SSAfeatures,ACCforKNN,LR,SVM,RF, andLGBMimproved by 1.08% to 10.88% withSMOTE. - Similar improvements were observed with
BERTfeatures.
- For
-
Balanced Accuracy (BACC) Behavior: A noteworthy observation was that
BACCscores became identical toACCscores whenSMOTEwas used in cross-validation. This is becauseSMOTEeffectively balances the dataset, making the true positive rate (Sn) and true negative rate (Sp) more comparable, hence their average (BACC) converges with overall accuracy. This indicates that the data became balanced afterSMOTEapplication. -
Independent Test Performance: The positive impact of
SMOTEwas consistent in the independent test results as well. The best scores across five metrics for bothSSAandBERTfeatures were achieved whenSMOTEwas used. For example, forSSAwithSMOTE,ACCwas 0.866,MCC0.683,Sn0.814,auROC0.916, andBACC0.825.The following are the results from Table 1 of the original paper:
Feature Model SMOTE Dim 10-Fold Cross-Validation Independent Test ACC MCC Sn Sp auROC BACC ACC MCC Sn Sp auROC BACC SSA KNN − 121 0.833 0.607 0.663 0.913 0.849 0.788 0.825 0.575 0.596 0.930 0.876 0.763 LR − 121 0.776 0.485 0.634 0.842 0.814 0.738 0.780 0.498 0.679 0.826 0.839 0.752 SVM − 121 0.827 0.588 0.613 0.925 0.909 0.769 0.857 0.658 0.682 0.944 0.907 0.806 RF − 121 0.836 0.609 0.618 0.938 0.902 0.778 0.826 0.578 0.557 0.949 0.879 0.753 LGBM − 121 0.852 0.664 0.721 0.913 0.896 0.817 0.827 0.583 0.621 0.921 0.880 0.771 KNN + 121 0.842 0.709 0.962 0.721 0.930 0.841 0.787 0.555 0.814 0.774 0.885 0.794 LR + 121 0.857 0.722 0.904 0.809 0.902 0.856 0.843 0.640 0.682 0.813 0.916 0.748 SVM + 121 0.917 0.835 0.921 0.913 0.967 0.917 0.866 0.675 0.696 0.941 0.936 0.819 RF + 121 0.915 0.833 0.921 0.908 0.967 0.915 0.866 0.683 0.714 0.936 0.895 0.825 LGBM + 121 0.917 0.835 0.929 0.904 0.964 0.917 0.827 0.585 0.643 0.911 0.887 0.777 BERT KNN - 768 0.836 0.610 0.679 0.908 0.879 0.794 0.807 0.537 0.618 0.893 0.872 0.756 LR - 768 0.836 0.649 0.820 0.842 0.888 0.833 0.850 0.660 0.743 0.907 0.912 0.825 SVM - 768 0.830 0.613 0.727 0.880 0.910 0.803 0.820 0.599 0.770 0.841 0.875 0.806 RF - 768 0.859 0.667 0.714 0.925 0.925 0.820 0.819 0.567 0.643 0.900 0.900 0.771 LGBM - 768 0.830 0.609 0.705 0.890 0.898 0.797 0.830 0.596 0.668 0.905 0.915 0.786 KNN + 768 0.884 0.775 0.954 0.813 0.928 0.884 0.820 0.625 0.857 0.803 0.881 0.830 LR + 768 0.911 0.825 0.959 0.863 0.952 0.911 0.843 0.635 0.750 0.885 0.905 0.818 SVM + 768 0.923 0.849 0.888 0.959 0.984 0.923 0.876 0.706 0.714 0.951 0.926 0.832 RF + 768 0.898 0.797 0.909 0.887 0.967 0.898 0.896 0.793 0.905 0.887 0.971 0.897 LGBM + 768 0.896 0.793 0.905 0.888 0.971 0.896 0.843 0.635 0.750 0.852 0.920 0.818
Best performance values are bold and underlined. SSA: Soft Symmetric Alignment; BERT: Bidirectional Encoder Representations from Transformer. KNN: k-nearest neighbor; LR: logistic regression; SVM: support vector machine; RF: random forest; LGBM: light gradient boosting machine. "-" indicates without the SMOTE method; "+" indicates with the SMOTE method.
6.2. The Effect of Different Feature Types
This section focuses on comparing the effectiveness of SSA versus BERT features, primarily in combination with SMOTE and different ML algorithms.
Analysis of Feature Type Effect (Figure 2 and Table 1):
- BERT's Superiority in Cross-Validation: The
BERTfeature vector, when combined with theSVM algorithmandSMOTE, consistently showed the best performance across most metrics (ACC,MCC,Sp,auROC, andBACC) in the 10-fold cross-validation.ACC: 0.923MCC: 0.849Sp: 0.959auROC: 0.984BACC: 0.923 These values were significantly higher compared to other combinations.
- SSA's Specific Strength: The
SSAfeature vector, when conjugated withKNNandSMOTE, achieved the highestSn(0.962) in cross-validation, outperforming allBERTcombinations. This suggestsSSAmight be particularly good at identifying positive samples, even if other metrics are slightly lower. - Independent Test Nuances: In the independent test, while
BERT-SVM-SMOTEstill performed very well, some otherBERTcombinations, specificallyBERT-RF-SMOTE, showed slightly higher scores for some metrics:BERT-RF-SMOTEhad anACCof 0.896,MCCof 0.793,Snof 0.905,auROCof 0.971, andBACCof 0.897. These were marginally higher thanBERT-SVM-SMOTEfor these specific metrics in the independent test.- However,
BERT-SVM-SMOTEhad a higherSp(0.951) compared toBERT-RF-SMOTE(0.887).
- Conclusion on Best Model: Despite the slight variations in independent test results, the paper concludes that the
BERT-SVM-SMOTEcombination was still considered the "best model out of all the combinations" due to its consistently strong performance across both cross-validation and independent tests, particularly its highACC,MCC, andauROCin cross-validation, and robustSpin independent test. This emphasizesBERT's capability to extract highly effective features.
6.3. The Effect of Feature Fusion
To investigate if combining features from both SSA and BERT could further enhance performance, a fusion feature was created by concatenating the 121D SSA eigenvector and the 768D BERT eigenvector, resulting in an 889D vector. This fusion feature was then tested with the five ML algorithms, with and without SMOTE.
The following are the results from Table 2 of the original paper:
| Feature | Model | SMOTE | Dim | 10-Fold Cross-Validation | Independent Test | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ACC | MCC | Sn | Sp | auROC | BACC | ACC | MCC | Sn | Sp | auROC | BACC | ||||
| SSA + BERT | KNN | − | 889 | 0.836 | 0.610 | 0.679 | 0.909 | 0.908 | 0.794 | 0.820 | 0.576 | 0.679 | 0.885 | 0.900 | 0.782 |
| LR | − | 889 | 0.844 | 0.640 | 0.750 | 0.887 | 0.900 | 0.819 | 0.876 | 0.716 | 0.821 | 0.902 | 0.910 | 0.862 | |
| SVM | − | 889 | 0.858 | 0.667 | 0.732 | 0.917 | 0.921 | 0.825 | 0.854 | 0.658 | 0.750 | 0.902 | 0.906 | 0.826 | |
| RF | − | 889 | 0.841 | 0.620 | 0.643 | 0.934 | 0.906 | 0.788 | 0.831 | 0.599 | 0.679 | 0.902 | 0.906 | 0.790 | |
| LGBM | − | 889 | 0.813 | 0.553 | 0.625 | 0.900 | 0.892 | 0.763 | 0.831 | 0.606 | 0.714 | 0.895 | 0.921 | 0.800 | |
| KNN | + | 889 | 0.888 | 0.787 | 0.971 | 0.805 | 0.932 | 0.888 | 0.831 | 0.643 | 0.820 | 0.883 | 0.898 | 0.838 | |
| LR | + | 889 | 0.917 | 0.836 | 0.954 | 0.880 | 0.951 | 0.917 | 0.876 | 0.724 | 0.857 | 0.906 | 0.906 | 0.871 | |
| SVM | + | 889 | 0.934 | 0.867 | 0.938 | 0.929 | 0.980 | 0.934 | 0.820 | 0.563 | 0.571 | 0.934 | 0.916 | 0.733 | |
| RF | + | 889 | 0.915 | 0.830 | 0.929 | 0.900 | 0.968 | 0.915 | 0.820 | 0.592 | 0.750 | 0.852 | 0.919 | 0.801 | |
| LGBM | + | 889 | 0.919 | 0.840 | 0.950 | 0.888 | 0.963 | 0.919 | 0.843 | 0.643 | 0.786 | 0.869 | 0.919 | 0.827 | |
Best performance values are bold and underlined. SSA: Soft Symmetric Alignment; BERT: Bidirectional Encoder Representations from Transformer. KNN: k-nearest neighbor; LR: logistic regression; SVM: support vector machine; RF: random forest; LGBM: light gradient boosting machine. "-" indicates without the SMOTE method; "+" indicates with the SMOTE method.
The following figure (Figure 3 from the original paper) displays the performance metrics of individual and fused features with SMOTE, according to the machine learning methods used. (A) Ten-fold cross-validation results. (B) Independent test results.

Analysis of Feature Fusion Effect (Table 2 and Figure 3):
- Cross-Validation: Consistent with previous findings,
SMOTEsignificantly improved performance for fusion features. The best performance for fusion features in 10-fold cross-validation was achieved withSVM(ACC0.934,MCC0.867,Sn0.938,Sp0.929,auROC0.980,BACC0.934). These scores were slightly superior to theBERTfeature alone (compareACC0.934 vs. 0.923,MCC0.867 vs. 0.849,Sn0.971 vs. 0.959,BACC0.934 vs. 0.923). This initially suggested a benefit from combining features. - Independent Test: However, the independent test results revealed a different picture. The best performance of the fusion feature (achieved by
LRwithSMOTE:ACC0.876,MCC0.724,Sn0.857,Sp0.902,auROC0.910,BACC0.871) was lower than the corresponding scores obtained from theBERTfeature alone withSMOTE(e.g.,BERT-RF-SMOTEhadACC0.896,MCC0.793,Sn0.905,Sp0.887,auROC0.971,BACC0.897). - Conclusion: The authors concluded that the
feature fusionofSSAandBERTwas not a beneficial choice for model optimization in umami peptide prediction. While it showed slight improvements in cross-validation (which can sometimes be optimistic), it failed to generalize better to unseen data in the independent test, performing worse thanBERTfeatures alone. This suggests that the 121DSSAfeatures might introduce redundancy or noise that hinders the generalization of the powerful 768DBERTfeatures.
6.4. The Effect of Feature Selection
Given that feature fusion did not yield improvements and higher dimensionality carries a risk of overfitting, feature selection was applied using the LGBM method. This aimed to remove redundant and indistinguishable features to find an optimized feature space for umami peptide prediction. The optimized models (after SMOTE and feature selection) were then evaluated.
The following are the results from Table 3 of the original paper:
| Feature | Model | SMOTE | Dim | 10-Fold Cross-Validation | Independent Test | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ACC | MCC | Sn | Sp | auROC | BACC | ACC | MCC | Sn | Sp | auROC | BACC | ||||
| SSA | KNN | + | 43 | 0.892 | 0.788 | 0.942 | 0.842 | 0.938 | 0.892 | 0.921 | 0.825 | 0.929 | 0.918 | 0.914 | 0.923 |
| LR | + | 29 | 0.884 | 0.768 | 0.900 | 0.867 | 0.938 | 0.884 | 0.887 | 0.745 | 0.857 | 0.902 | 0.919 | 0.879 | |
| SVM | + | 29 | 0.909 | 0.820 | 0.946 | 0.871 | 0.962 | 0.909 | 0.899 | 0.761 | 0.786 | 0.951 | 0.913 | 0.868 | |
| RF | + | 39 | 0.892 | 0.784 | 0.892 | 0.892 | 0.957 | 0.892 | 0.887 | 0.735 | 0.786 | 0.934 | 0.914 | 0.860 | |
| LGBM | + | 39 | 0.902 | 0.805 | 0.905 | 0.900 | 0.958 | 0.902 | 0.899 | 0.763 | 0.821 | 0.934 | 0.919 | 0.878 | |
| BERT | KNN | + | 163 | 0.888 | 0.786 | 0.967 | 0.809 | 0.950 | 0.888 | 0.865 | 0.723 | 0.929 | 0.836 | 0.909 | 0.882 |
| LR | + | 29 | 0.876 | 0.751 | 0.884 | 0.867 | 0.937 | 0.876 | 0.887 | 0.739 | 0.821 | 0.918 | 0.913 | 0.870 | |
| SVM | + | 139 | 0.940 | 0.881 | 0.963 | 0.917 | 0.971 | 0.940 | 0.899 | 0.774 | 0.893 | 0.902 | 0.933 | 0.897 | |
| RF | + | 77 | 0.921 | 0.843 | 0.938 | 0.905 | 0.973 | 0.921 | 0.865 | 0.711 | 0.821 | 0.895 | 0.923 | 0.853 | |
| LGBM | + | 174 | 0.917 | 0.834 | 0.929 | 0.905 | 0.973 | 0.917 | 0.876 | 0.694 | 0.786 | 0.918 | 0.916 | 0.852 | |
| SSA + BERT | KNN | + | 65 | 0.900 | 0.806 | 0.954 | 0.846 | 0.942 | 0.900 | 0.876 | 0.742 | 0.929 | 0.852 | 0.898 | 0.891 |
| LR | + | 79 | 0.915 | 0.832 | 0.950 | 0.880 | 0.941 | 0.915 | 0.887 | 0.745 | 0.857 | 0.902 | 0.909 | 0.879 | |
| SVM | + | 99 | 0.932 | 0.864 | 0.950 | 0.913 | 0.981 | 0.932 | 0.887 | 0.745 | 0.857 | 0.902 | 0.909 | 0.879 | |
| RF | + | 168 | 0.909 | 0.818 | 0.925 | 0.892 | 0.974 | 0.909 | 0.876 | 0.716 | 0.821 | 0.902 | 0.917 | 0.862 | |
| LGBM | + | 114 | 0.919 | 0.839 | 0.942 | 0.896 | 0.979 | 0.919 | 0.876 | 0.724 | 0.857 | 0.885 | 0.920 | 0.871 | |
Best performance values are bold and underlined. SSA: Soft Symmetric Alignment; BERT: Bidirectional Encoder Representations from Transformer. KNN: k-nearest neighbor; LR: logistic regression; SVM: support vector machine; RF: random forest; LGBM: light gradient boosting machine. "+" indicates with the SMOTE method.
The following figure (Figure 4 from the original paper) presents the performance metrics of individual and fusion features using selected features and different algorithms. (A) Ten-fold cross-validation results. (B) Independent test results.

Analysis of Feature Selection Effect (Table 3 and Figure 4):
- Cross-Validation Improvement:
Feature selectionsignificantly improved the models' performance. TheBERTfeature encoding, combined with theSVM algorithm(withSMOTE) and139 dimensionsafterLGBMfeature selection, achieved the best performance in 10-fold cross-validation acrossACC,MCC,Sp, andBACC.ACC: 0.940 (an improvement of 0.86% to 7.31% over other options).MCC: 0.881 (an improvement of 1.97% to 17.31%).Sp: 0.917 (an improvement of 0.44% to 13.35%).BACC: 0.940 (same as ACC, indicating balanced data). These results underscore the value of selecting an optimal feature descriptor.
- Independent Test Performance:
- For the independent test, some
SSAcombinations achieved the highest scores forACC(0.921,SSA-KNNwith 43D),MCC(0.825,SSA-KNNwith 43D),Sn(0.929,SSA-KNNwith 43D), andBACC(0.923,SSA-KNNwith 43D). Also,SSA-SVMwith 29D achieved the highestSp(0.951). - However, the
BERTfeature (139D) withSVMstill yielded the highestauROC(0.933) in the independent test, which is a strong indicator of overall classifier quality across thresholds. Furthermore, its scores forACC(0.899),MCC(0.774),Sn(0.893), andBACC(0.897) were the second best among all models, demonstrating consistent strong performance.
- For the independent test, some
- Conclusion on Optimal Model: Considering both cross-validation and independent testing results, the
BERTfeature based on theSVM algorithm(withSMOTEand 139 selected dimensions) was deemed the best option for umami peptide prediction. This configuration balances high performance across multiple metrics and robust generalization.
6.5. Comparison of iUP-BERT with Existing Models
The efficacy and robustness of the final iUP-BERT model (specifically, the BERT-SVM-SMOTE combination with 139 selected dimensions) were then rigorously compared against the previously discussed existing methods: iUmami-SCM and UMPred-FRL.
The following are the results from Table 4 of the original paper:
| Classifier | 10-Fold Cross-Validation | Independent Test | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ACC | MCC | Sn | Sp | auROC | BACC | ACC | MCC | Sn | Sp | auROC | BACC | |
| iUP-BERT | 0.940 | 0.881 | 0.963 | 0.917 | 0.971 | 0.940 | 0.899 | 0.774 | 0.893 | 0.902 | 0.933 | 0.897 |
| iUmami-SCM | 0.935 | 0.864 | 0.947 | 0.930 | 0.939 | 0.939 | 0.865 | 0.679 | 0.714 | 0.934 | 0.898 | 0.824 |
| UMPred-FRL | 0.921 | 0.814 | 0.847 | 0.955 | 0.938 | 0.901 | 0.888 | 0.735 | 0.786 | 0.934 | 0.919 | 0.860 |
Best performance values are in bold and are underlined.
Analysis of Comparison (Table 4):
- 10-Fold Cross-Validation:
iUP-BERT(ACC 0.940, MCC 0.881, Sn 0.963, auROC 0.971, BACC 0.940) clearly outperformed bothiUmami-SCM(ACC 0.935, MCC 0.864, Sn 0.947, auROC 0.939, BACC 0.939) andUMPred-FRL(ACC 0.921, MCC 0.814, Sn 0.847, auROC 0.938, BACC 0.901) acrossACC,MCC,Sn,auROC, andBACC.iUmami-SCMhad a slightly higherSp(0.930) thaniUP-BERT(0.917), butiUP-BERT's overall performance was superior.
- Independent Test: This is the most critical evaluation for generalization ability.
iUP-BERTdemonstrated remarkably better results in all five primary metrics (ACC,MCC,Sn,auROC,BACC) compared to both baselines:-
ACC:iUP-BERT(0.899) was higher by 1.23% (vs.UMPred-FRL0.888) to 3.93% (vs.iUmami-SCM0.865). -
MCC:iUP-BERT(0.774) was higher by 5.31% (vs.UMPred-FRL0.735) to 13.99% (vs.iUmami-SCM0.679). -
Sn:iUP-BERT(0.893) was significantly higher by 13.6% (vs.UMPred-FRL0.786) to 25.07% (vs.iUmami-SCM0.714). This indicatesiUP-BERTis much better at identifying actual umami peptides. -
auROC:iUP-BERT(0.933) was higher by 1.52% (vs.UMPred-FRL0.919) to 3.90% (vs.iUmami-SCM0.898). -
BACC:iUP-BERT(0.897) was higher by 4.30% (vs.UMPred-FRL0.860) to 8.86% (vs.iUmami-SCM0.824). -
Sp: Notably,iUP-BERTalso achieved the highestSp(0.902) among the three, suggesting it is also effective at correctly identifying non-umami peptides.Conclusion: The comparisons strongly confirm that
iUP-BERTis more effective, reliable, and stable than existing methods for umami peptide prediction, particularly due to its superior generalization capabilities on unseen data. This validates the effectiveness of usingBERTfor deep representation learning of peptide features.
-
6.6. Feature Analysis Using Feature Projection and Decision Function
To provide a visual explanation for iUP-BERT's excellent performance, the 139-dimensional BERT feature space (optimized by feature selection) was reduced to a 2-dimensional plane using two dimensionality reduction techniques: Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). This allows for a visual inspection of how well umami and non-umami peptides are separated in the learned feature space. Additionally, the decision function boundary of the SVM model was plotted.
The following figure (Figure 5 from the original paper) shows the dimension reduction visualization of umami peptide BERT features and decision function boundary analysis of the SVM model.

Analysis of Visualization (Figure 5):
- Separation of Classes: In both
PCA(Figure 5A) andUMAP(Figure 5B) visualizations, thered dots(representing umami peptides) andblue dots(representing non-umami peptides) show a relatively concentrated distribution in two distinct areas. Theyellow sectionindicates the positive (umami) sample area, and thepurple sectionindicates the negative (non-umami) sample area. - Effectiveness of BERT Features: The clear separation between the red and blue clusters demonstrates that the 139-dimensional
BERTfeatures, even after being reduced to 2D, are highly discriminative. TheBERTmodel successfully learned features that effectively differentiate umami peptides from non-umami peptides. - SVM Decision Boundary: The drawn
decision function boundary(a line in 2D) visually confirms that theSVMmodel can distinguish most positive and negative samples. The boundary largely separates the red and blue clusters, aligning with the high classification performance observed in the quantitative metrics. - Misclassified Samples: Despite the good separation, the visualization also shows some
misclassified samples. These are the red dots appearing in the purple area or blue dots in the yellow area, indicating instances where theSVMmodel made incorrect predictions. - Future Improvement Implications: The presence of misclassified samples suggests that while the
BERTfeatures are powerful, there is still room for improvement. The authors note that "better feature extraction methods or more suitable machine learning methods were needed for modeling, to better identify umami peptide sequences from non-umami peptide sequences in the future." This implies that even more advanced deep learning architectures, larger datasets, or fine-tuning approaches could potentially achieve even cleaner separation and fewer misclassifications.
6.7. Construction of the Web Server of iUP-BERT
To maximize the utility and accessibility of the iUP-BERT predictor for the research community and industry, an open-access web server was developed and made available.
- Access: The web server can be accessed at
https://www.aibiochem.net/servers/iUP-BERT/(accessed on 23 September 2022). - Purpose: This server allows users to rapidly and efficiently screen potential umami peptides by simply inputting their peptide sequences. It transforms the computational model into a practical tool, facilitating high-throughput prediction without requiring users to set up the complex deep learning and machine learning environment locally.
- Impact: This contribution significantly enhances the usability of the research, making
iUP-BERTa valuable resource for exploring new umami peptides and promoting innovation in the food seasoning industry.
7. Conclusion & Reflections
7.1. Conclusion Summary
This study successfully developed iUP-BERT, a novel and highly accurate machine learning prediction model for identifying umami peptides solely based on their amino acid sequences. The model's core innovation lies in its utilization of BERT, a single deep representation learning pretrained neural network, for automatic and highly effective feature extraction. The methodology systematically optimized performance by incorporating SMOTE to address data imbalance and applying LGBM for feature selection, ultimately identifying the BERT-SVM-SMOTE model with 139 selected dimensions as the most robust and efficient configuration. This work represents the first application of BERT for computational identification of umami peptides.
Extensive validation through 10-fold cross-validation and independent testing unequivocally demonstrated iUP-BERT's superior efficacy and robustness. Compared to existing methods like iUmami-SCM and UMPred-FRL, iUP-BERT achieved significant improvements across critical metrics in the independent test, including ACC (1.23–3.93% higher), MCC (5.31–13.99% higher), Sn (13.6–25.07% higher), auROC (1.52–3.90% higher), and BACC (4.30–8.86% higher). Finally, an open-access web server for iUP-BERT was built, transforming this research into a practical tool to accelerate the discovery of new umami peptides and advance the food seasoning industry.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Limited Training Sample Size: The current dataset size for training was relatively low (112 positive and 241 negative samples). The authors suggest that
larger training sample sizesgenerally improve the prediction performance of deep learning models. - Further BERT Optimization: While
BERTwas used for feature extraction, the authors propose thatfine-tuning the BERT modelspecifically for the umami peptide prediction task (rather than just using its off-the-shelf embeddings) could lead to an even more accurate model. Fine-tuning involves continuing the training of the pre-trainedBERTmodel on the specific umami peptide dataset, allowing it to adapt its internal representations more precisely to this domain. - Expansion of Datasets: Future efforts should focus on constructing an optimized, larger-sized dataset with a higher number of experimentally identified umami and non-umami peptides. This would provide more robust data for training and validation, potentially leading to better model performance.
- Broader Applications: The overall goal is to use
iUP-BERTas a powerful tool for exploring new umami peptides, which can help in improving the palatability of dietary supplements and promoting the umami seasoning industry.
7.3. Personal Insights & Critique
This paper presents a solid application of cutting-edge deep learning techniques to a practical problem in food science. My insights and critique are as follows:
-
Strength of Deep Learning for Feature Extraction: The paper strongly demonstrates the power of
deep representation learning, specificallyBERT, in automatically extracting meaningful features from biological sequences. This is a significant leap from manual feature engineering, which often requires extensive domain knowledge and can be prone to missing subtle patterns. The consistent outperformance ofiUP-BERTover previous methods is a testament to this paradigm shift. The success ofBERTin a non-NLP domain like peptide prediction highlights its versatility and the underlying similarity in sequential data processing. -
Importance of Ancillary Techniques: The rigorous inclusion of
SMOTEfor data balancing andLGBMfor feature selection is commendable. These steps, often overlooked or minimally explored in some deep learning papers, were crucial for optimizing the model and ensuring its robustness and generalization ability. The observation thatSMOTEmadeBACCredundant in cross-validation is a clear indicator of its effectiveness in achieving class balance. -
Transparency and Reproducibility: Providing an open-access web server significantly enhances the impact and utility of this research. It makes the developed predictor readily available to a wider audience, facilitating its adoption and further research, which is a great practice for computational biology tools.
-
Limitations and Areas for Improvement:
- Dataset Size: As acknowledged by the authors, the dataset size (112 positive, 241 negative) is relatively small for
deep learningmodels. WhileBERTis pre-trained, its fine-tuning or the overall performance of the classifier could benefit substantially from a much larger, more diverse, and rigorously validated dataset. The "gold standard" for umami peptides is still evolving, which might contribute to this limitation. - "Black Box" Nature of BERT Features: While
BERTprovides powerful features, their exact biochemical interpretation remains somewhat opaque. Further work could involveinterpretable AItechniques to understand what specific patterns in the peptide sequenceBERTis identifying as indicative of umami taste. This could lead to new biochemical insights. - Further Deep Learning Exploration: The paper primarily uses
BERTfor feature extraction and then feeds these into traditionalML classifiers. Future work could explore end-to-enddeep learningarchitectures, where theBERTembeddings are integrated directly into a neural network classifier, potentially allowing for more complex, hierarchical learning. Other advancedTransformer-based architectures orgraph neural networks(if peptide structure information were to be incorporated) could also be investigated. - Domain Specific Pre-training: Instead of using a
BERTmodel pre-trained on natural language, developing aBERT-like model specifically pre-trained on a massive corpus of diverse peptide sequences (e.g., from protein databases) would likely yield even more domain-specific and effective feature embeddings.
- Dataset Size: As acknowledged by the authors, the dataset size (112 positive, 241 negative) is relatively small for
-
Transferability: The methodology of using a pre-trained
Transformer-based model for sequence feature extraction, combined with data balancing and feature selection, is highly transferable. This approach could be applied to predict other types of bioactive peptides (e.g., antioxidant, antihypertensive, antimicrobial peptides) or even other biological sequences likeDNAorRNAmotifs, provided suitable pre-training data and downstream tasks.Overall,
iUP-BERTis a valuable contribution, marking an important step forward in the computational identification of umami peptides by effectively harnessing the power of deep learning. The robust experimental design and practical deployment further solidify its significance.
Similar papers
Recommended via semantic vector search.