IUP-BERT: Identification of Umami Peptides Based on BERT Features

Zhibin Lv

Paper status: completed

IUP-BERT: Identification of Umami Peptides Based on BERT Features

Published:11/21/2022

Deep Learning Feature Extraction (1)Umami Peptide Prediction Based on BERT (1)Support Vector Machine Model (1)Synthetic Minority Over-Sampling Technique (1)Peptide Sequence Prediction (1)

Original Link

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study presents iUP-BERT, a novel umami peptide predictor using BERT for feature extraction. Combined with SMOTE and SVM, it significantly improves the efficiency and accuracy of umami peptide identification, outperforming existing methods, and provides an open-access web serv

Abstract

Umami is an important widely-used taste component of food seasoning. Umami peptides are specific structural peptides endowing foods with a favorable umami taste. Laboratory approaches used to identify umami peptides are time-consuming and labor-intensive, which are not feasible for rapid screening. Here, we developed a novel peptide sequence-based umami peptide predictor, namely iUP-BERT, which was based on the deep learning pretrained neural network feature extraction method. After optimization, a single deep representation learning feature encoding method (BERT: bidirectional encoder representations from transformer) in conjugation with the synthetic minority over-sampling technique (SMOTE) and support vector machine (SVM) methods was adopted for model creation to generate predicted probabilistic scores of potential umami peptides. Further extensive empirical experiments on cross-validation and an independent test showed that iUP-BERT outperformed the existing methods with improvements, highlighting its effectiveness and robustness. Finally, an open-access iUP-BERT web server was built. To our knowledge, this is the first efficient sequence-based umami predictor created based on a single deep-learning pretrained neural network feature extraction method. By predicting umami peptides, iUP-BERT can help in further research to improve the palatability of dietary supplements in the future.

Mind Map

In-depth Reading

English Analysis~37 min read · 55,540 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is the identification of umami peptides using machine learning, specifically leveraging features extracted by the BERT deep learning model. The model is named iUP-BERT.

1.2. Authors

The authors are: Liangzhen Jiang, Jici Jiang, Xiao Wang, Yin Zhang, Bowen Zheng, Shuqi Liu, Yiting Zhang, Changying Liu, Yan Wan, Dabing Xiang, and Zhibin Lv. Their affiliations span several institutions, primarily in China:

College of Food and Biological Engineering, Chengdu University, Chengdu, China (Liangzhen Jiang, Xiao Wang, Shuqi Liu, Changying Liu, Yan Wan, Dabing Xiang)
Department of Computer Science, Peking University, Beijing, China (Jici Jiang, Bowen Zheng, Zhibin Lv)
Key Laboratory of Comprehensive Utilization of Crops, Ministry of Agriculture and Rural Affairs, Chengdu, China (Liangzhen Jiang, Xiao Wang, Shuqi Liu, Changying Liu, Yan Wan, Dabing Xiang)
College of Biology, Southwest Jiaotong University, Chengdu, China (Yiting Zhang)
College of Biology, Georgia State University, Atlanta, GA, USA (Yiting Zhang) The corresponding authors are Dabing Xiang and Zhibin Lv, with Zhibin Lv associated with Peking University. Their research backgrounds appear to be at the intersection of food science/biological engineering and computer science/bioinformatics, indicating an interdisciplinary approach to the problem.

1.3. Journal/Conference

The paper was published in Foods, a peer-reviewed open-access journal published by MDPI. Foods covers a wide range of topics related to food science, technology, and nutrition. Its reputation and influence are recognized in the food science and technology domain, making it a relevant venue for this research.

1.4. Publication Year

The paper was published on November 21, 2022.

1.5. Abstract

The paper addresses the challenge of identifying umami peptides, which are crucial for food seasoning, given that traditional laboratory methods are time-consuming and labor-intensive. To overcome this, the authors developed a novel peptide sequence-based predictor called iUP-BERT. This predictor is based on a deep learning pre-trained neural network feature extraction method, specifically Bidirectional Encoder Representations from Transformer (BERT). After optimization, the model utilized BERT features in conjunction with the Synthetic Minority Oversampling Technique (SMOTE) to handle data imbalance and Support Vector Machine (SVM) for classification. Extensive experiments, including cross-validation and an independent test, demonstrated that iUP-BERT significantly outperformed existing methods in effectiveness and robustness. The authors also built an open-access web server for iUP-BERT. They claim this is the first efficient sequence-based umami predictor created using a single deep-learning pretrained neural network feature extraction method. The ultimate goal is to use iUP-BERT to improve the palatability of dietary supplements.

1.6. Original Source Link

The original source link is /files/papers/6919ed1f110b75dcc59ae33c/paper.pdf. This link points to the PDF file of the paper, indicating it is an officially published work.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inefficient and labor-intensive identification of umami peptides. Umami taste, known as the "fifth taste," is a highly desirable characteristic in food, contributing to deliciousness. Umami peptides are specific short linear peptides that impart this taste and also offer health benefits like reducing salt content and antioxidant activity.

Current laboratory approaches to identify and characterize umami peptides (e.g., RP-HPLC, MALDI-TOF-MS, LC-Q-TOF-MS) are:

Time-consuming: They require significant time for experimental procedures.
Labor-intensive: They demand substantial manual effort and skilled personnel.
Not feasible for rapid screening: Their high-throughput capability is limited, restricting the discovery of new umami peptides.

These limitations highlight a critical gap: the lack of accurate and efficient computational methods for rapid screening of umami peptides. Prior computational methods, while a step forward, also presented challenges:
iUmami-SCM: Relied on artificial feature extraction and only a single type of feature (propensity scores of amino acids and dipeptides), leading to insufficient sequence feature information and unsatisfactory performance.
UMPred-FRL: Improved upon iUmami-SCM by using multiple feature encodings (amino acid composition, dipeptide composition, etc.) and various machine learning algorithms. However, its overall prediction performance was still considered "not efficient enough," likely due to the continued reliance on manual feature extraction.

The paper's entry point and innovative idea is to leverage the power of deep learning, specifically the BERT model, for automatic and efficient feature extraction from peptide sequences. BERT, originally designed for natural language processing, has shown great success in capturing contextual information and has been successfully applied to other biological sequence prediction tasks. The authors hypothesize that BERT's ability to learn complex representations directly from raw sequences, without manual feature engineering, can overcome the limitations of previous methods and lead to more robust and accurate umami peptide prediction.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Development of iUP-BERT: A novel machine learning-based predictor for umami peptides that utilizes a deep learning pretrained neural network (BERT) for feature extraction. This marks a significant shift from traditional manual feature engineering in this domain.
First application of a single deep-learning pretrained neural network for umami peptide prediction: The authors highlight that iUP-BERT is the first efficient sequence-based umami predictor built upon a single deep representation learning feature (BERT), eliminating the need for complex, hand-crafted feature combinations.
Demonstrated superior performance: Through extensive empirical experiments (10-fold cross-validation and independent testing), iUP-BERT consistently and significantly outperformed existing state-of-the-art methods like iUmami-SCM and UMPred-FRL across multiple evaluation metrics (Accuracy, MCC, Sensitivity, auROC, and Balanced Accuracy).
Incorporation of SMOTE and LGBM feature selection: The study rigorously optimized the model by first applying SMOTE to address data imbalance, which proved critical for performance improvement. Subsequently, LGBM was used for feature selection, identifying an optimal feature space (139 dimensions) that further enhanced model robustness and accuracy.
Deployment of an open-access web server: An iUP-BERT web server was built and made publicly available, facilitating rapid and high-throughput screening of umami peptides for researchers and the food industry.
Practical implications: The findings suggest that iUP-BERT can be a powerful tool for exploring new umami peptides, contributing to the development of improved dietary supplements and advancing the food seasoning industry.

3.1. Foundational Concepts

To understand the iUP-BERT paper, a grasp of several fundamental concepts in machine learning, deep learning, and bioinformatics is essential.

Umami Peptides:
- Conceptual Definition: Umami is recognized as the fifth basic taste, characterized by a savory, meaty, or broth-like flavor. Umami peptides are specific short linear amino acid sequences (peptides) that bind to taste receptors (primarily the $T1R1/T1R3$ receptor) on the tongue, eliciting the umami sensation. They typically have a low molecular weight (less than 5000 Da), with dipeptides and tripeptides being common. Beyond taste, they also possess various health benefits.
Machine Learning (ML):
- Conceptual Definition: Machine Learning is a subset of artificial intelligence that enables systems to learn from data, identify patterns, and make decisions or predictions with minimal human intervention. Instead of being explicitly programmed, ML algorithms "learn" a model from training data, which can then be used to analyze new, unseen data.
Deep Learning:
- Conceptual Definition: Deep learning is a specialized branch of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from large datasets. Unlike traditional ML, deep learning models can automatically learn feature representations from raw input data, eliminating the need for manual feature engineering. This ability to "learn how to learn" is a key advantage.
BERT (Bidirectional Encoder Representations from Transformer):
- Conceptual Definition: BERT is a powerful, pre-trained deep learning model developed by Google, primarily for natural language processing (NLP) tasks. Its core innovation lies in its bidirectional learning of language context. Unlike previous models that processed text directionally (left-to-right or right-to-left), BERT processes words in relation to all other words in a sentence simultaneously. This allows it to understand the full context of a word by looking at both its left and right neighbors. It's pre-trained on massive amounts of unlabeled text data using two main tasks: Masked Language Model (MLM) (predicting masked words) and Next Sentence Prediction (NSP). After pre-training, it can be fine-tuned for various downstream tasks with minimal architectural changes. In this paper, peptide sequences are treated as "language" for BERT to process.
- Transformer Architecture: BERT's foundation is the Transformer architecture, specifically its encoder component. The Transformer model, introduced in "Attention Is All You Need," revolutionized sequence processing by relying entirely on self-attention mechanisms rather than recurrent or convolutional layers.
  - Self-Attention: Self-attention allows the model to weigh the importance of different words (or amino acids, in the peptide context) in an input sequence when encoding a particular word. It computes a "contextualized" representation for each token by attending to all other tokens in the sequence. The core idea is to calculate query (Q), key (K), and value (V) matrices from the input embeddings. The attention score is then computed by taking the dot product of $Q$ and $K$ , scaling it, applying a softmax function, and finally multiplying by $V$ .
  - Multi-head Self-Attention: This mechanism performs the self-attention process multiple times in parallel with different linear projections of $Q$ , $K$ , and $V$ . The outputs from these "attention heads" are then concatenated and linearly transformed, allowing the model to focus on different aspects of the input sequence simultaneously.
SSA (Soft Symmetric Alignment):
- Conceptual Definition: SSA is a method designed to compare arbitrary-length sequences within vectors. It involves encoding a peptide sequence using an initial pre-trained model (in this case, a three-tier stacked BiLSTM encoder) to create an embedding matrix. BiLSTM (Bidirectional Long Short-Term Memory) is a type of recurrent neural network that can process sequences in both forward and backward directions, capturing dependencies from both past and future contexts. SSA then uses a specific mechanism to calculate the similarity between these embedded sequences.
SMOTE (Synthetic Minority Oversampling Technique):
- Conceptual Definition: SMOTE is an oversampling algorithm used to address the class imbalance problem in datasets. Class imbalance occurs when one class (the minority class) has significantly fewer samples than another (the majority class). This can lead to machine learning models being biased towards the majority class and performing poorly on the minority class. SMOTE works by synthesizing new minority class samples that are "similar" to existing minority samples, rather than simply duplicating them. It achieves this by taking a minority sample, finding its k-nearest neighbors, and then creating new samples along the line segments connecting the original sample to its neighbors.
Machine Learning Algorithms (Classifiers):
- Support Vector Machine (SVM):
  - Conceptual Definition: SVM is a powerful supervised learning model used for classification and regression tasks. For binary classification, its goal is to find an optimal hyperplane (a decision boundary) that best separates data points of different classes in a high-dimensional space. The "optimal" hyperplane is the one that has the largest margin (the distance between the hyperplane and the nearest data points of each class). Maximizing this margin helps to improve the generalization ability of the model.
- K-Nearest Neighbor (KNN):
  - Conceptual Definition: KNN is a non-parametric, instance-based learning algorithm used for classification and regression. In classification, a new data point is classified by a majority vote of its $k$ nearest neighbors among the training data. "Nearest" is typically determined by distance metrics like Euclidean distance. It's a "lazy learner" because it doesn't build a model during training but memorizes the training data.
- Logistic Regression (LR):
  - Conceptual Definition: LR is a statistical model primarily used for binary classification. Despite its name, it's a classification algorithm, not a regression algorithm in the traditional sense. It models the probability that a given input belongs to a particular class. It does this by applying a sigmoid function (also known as the logistic function) to the output of a linear equation, squashing the output between 0 and 1, which can then be interpreted as a probability.
- Random Forest (RF):
  - Conceptual Definition: RF is an ensemble learning method, meaning it combines predictions from multiple models to improve accuracy and robustness. It constructs a multitude of decision trees during training. For classification, it outputs the class that is the mode (majority vote) of the classes predicted by individual trees. Each tree is trained on a random subset of the data (bootstrapping) and considers only a random subset of features at each split, which helps reduce overfitting.
- Light Gradient Boosting Machine (LGBM):
  - Conceptual Definition: LGBM is a gradient boosting framework that uses tree-based learning algorithms. It's known for its high speed and efficiency. Unlike other boosting algorithms that grow trees level-wise, LGBM grows trees leaf-wise, meaning it chooses the leaf with the largest loss to grow, which can lead to faster training and better accuracy. It also employs a histogram-based algorithm to discretize continuous features, further speeding up the training process. In this paper, LGBM is also used for feature selection.

3.2. Previous Works

The paper contextualizes iUP-BERT by comparing it against two notable prior computational methods for umami peptide prediction:

iUmami-SCM (Scoring Card Method):
- Summary: This was the first sequence-based umami peptide predictor. It analyzes and predicts umami sensory peptides solely based on the information of the primary peptide sequence, without requiring advanced structural data. It conjugated estimated propensity scores of amino acids and dipeptides with the scoring card method (SCM).
- Limitations:
  - Artificial Feature Extraction: It relied on manually designed and artificial feature extraction methods, which might not fully capture the complex patterns within peptide sequences.
  - Single Feature Type: Only a single type of feature was used as input for the machine learning models. This limited the amount of sequence feature information the model could learn from.
  - Performance: Achieved a sensitivity (Sn) of 0.714, balanced accuracy (BACC) of 0.824, and Matthew's correlation coefficient (MCC) of 0.679. These metrics, while a starting point, were considered "not very satisfactory."
UMPred-FRL (Feature Representation Learning):
- Summary: UMPred-FRL was a later meta-predictor that aimed to improve upon iUmami-SCM. It was based on a feature representation learning approach (though still primarily manual/engineered feature types). It combined seven different feature encodings (including amino acid composition, dipeptide composition, composition transition-distribution, amphiphilic pseudo-amino acid composition, and pseudo-amino acid composition) with six well-known ML algorithms (KNN, Extremely Randomized Trees, Partial Least Squares, Random Forest, Logistic Regression, and SVM).
- Limitations:
  - Inefficient Manual Feature Extraction: Despite using multiple feature types, the underlying feature extraction remained largely manual or engineered. The paper argues that this led to its overall prediction performance still "not being efficient enough."
  - Performance: Achieved an accuracy (ACC) of 0.888, MCC of 0.735, Sn of 0.786, and BACC of 0.860 on its benchmark dataset. While better than iUmami-SCM, the authors sought further improvements.

3.3. Technological Evolution

The field of bioinformatics, particularly in peptide function prediction, has seen a significant evolution in methodology:

Early Experimental Methods: Initially, identifying umami peptides was solely reliant on laborious and costly laboratory techniques like RP-HPLC and MS-based analyses. These methods are definitive but severely limit throughput.
Rule-Based/Propensity-Based Methods: The first generation of computational tools, like iUmami-SCM, emerged by using statistical properties or "propensity scores" of amino acids and dipeptides to infer umami taste. These were essentially rule-based or feature-engineering driven, without complex learning algorithms.
Traditional Machine Learning with Hand-crafted Features: The next step involved applying more sophisticated ML algorithms (SVM, RF, KNN, etc.) to a broader set of hand-crafted or physicochemical features extracted from peptide sequences. UMPred-FRL represents this era, combining various engineered features with multiple ML models to improve prediction accuracy. While more powerful, this approach still required domain experts to design effective features, which can be a bottleneck and may not capture all relevant sequence information.
Deep Learning for Automatic Feature Extraction: The current paper, iUP-BERT, represents the latest evolution by moving towards deep learning for automatic feature extraction. By adopting BERT, a model pre-trained on vast amounts of sequential data (language in its original context, now applied to peptides), the reliance on manual feature engineering is drastically reduced or eliminated. Deep learning models can learn intricate, high-level representations directly from raw peptide sequences, potentially capturing more subtle and complex patterns that influence umami taste. This shifts the focus from "what features to extract" to "how to train an effective deep learning model."

iUP-BERT fits into this timeline as a cutting-edge application of deep learning, demonstrating its potential to surpass traditional ML methods that rely on pre-defined features.

3.4. Differentiation Analysis

Compared to the main methods in related work, iUP-BERT presents several core differences and innovations:

Automatic Deep Representation Learning Feature Extraction:
- Differentiation: The most significant innovation is the use of BERT, a single deep learning pretrained neural network, for automatic feature extraction. This contrasts sharply with iUmami-SCM (which used artificial, single-type features based on propensity scores) and UMPred-FRL (which relied on a combination of seven different manually engineered feature encodings like amino acid composition, dipeptide composition, etc.).
- Innovation: BERT can learn deep bidirectional language representations directly from raw peptide sequences. This means it implicitly captures complex contextual relationships between amino acids, which manual feature engineering struggles to achieve. The paper emphasizes it's the first to use a single deep-learning pretrained neural network for this task, simplifying the feature engineering pipeline.
Leveraging Contextual Information:
- Differentiation: Traditional methods often treat amino acids or short k-mers in isolation or with limited context. BERT's Transformer architecture and multi-head self-attention mechanism enable it to have a global receptive field, meaning it can effectively capture global context information across the entire peptide sequence.
- Innovation: This comprehensive understanding of context allows iUP-BERT to generate more meaningful and informative feature descriptors compared to models limited by local windows or predefined feature types.
Robustness through Data Balancing and Feature Selection:
- Differentiation: While prior methods focused on feature diversity or specific ML algorithms, iUP-BERT explicitly integrates SMOTE to address the common problem of data imbalance in biological datasets and LGBM for feature selection.
- Innovation: These steps are crucial for model robustness and generalization. SMOTE prevents bias towards the majority class, and LGBM feature selection helps remove redundant information, mitigating overfitting and improving efficiency, which was not explicitly highlighted as a core strength in previous works.
Overall Performance Improvement:
- Differentiation: iUP-BERT consistently and remarkably outperforms iUmami-SCM and UMPred-FRL across multiple key metrics in both cross-validation and independent tests.
- Innovation: This empirical superiority validates the effectiveness of the deep learning-based feature extraction approach for umami peptide prediction.
  
  In essence, iUP-BERT differentiates itself by replacing laborious and potentially suboptimal manual feature engineering with an intelligent, data-driven, and context-aware feature extraction process powered by BERT, further refined by robust pre-processing (SMOTE) and post-processing (LGBM feature selection) steps.

4. Methodology

4.1. Principles

The core idea of the iUP-BERT method is to leverage the powerful feature extraction capabilities of a deep learning pre-trained model, BERT, to automatically derive rich, contextual representations from raw peptide sequences. These representations are then used as input for traditional machine learning (ML) classifiers to predict whether a peptide is umami. The theoretical basis is that BERT, originally designed for natural language, can effectively learn the "language" of peptides (amino acid sequences) and capture subtle patterns critical for umami taste, without requiring manual feature engineering. The intuition is that complex biological sequences, much like human language, contain deep semantic and contextual information that can be uncovered by advanced neural networks.

The overall framework of iUP-BERT involves a systematic pipeline to optimize prediction performance:

Deep Representation Learning for Feature Extraction: Using BERT (and SSA for comparison) to transform peptide sequences into high-dimensional feature vectors.
Addressing Data Imbalance: Employing SMOTE to synthesize minority class samples and achieve a balanced training dataset.
Feature Space Optimization: Applying LGBM for feature selection to reduce dimensionality and remove redundant features, enhancing model generalization.
Machine Learning Classification: Training and evaluating various ML algorithms (KNN, LR, SVM, RF, LGBM) on the processed features.
Model Optimization and Selection: Identifying the best combination of feature extraction, data balancing, feature selection, and classification algorithm based on performance metrics.

4.2. Core Methodology In-depth (Layer by Layer)

The development of iUP-BERT follows a six-step process, as depicted in Figure 1.

The following figure (Figure 1 from the original paper) illustrates the overall framework of iUP-BERT development:

Figure 1. Overview of iUP-BERT development. The illustration depicts the 6 main steps for model development. (1) The peptide sequence was included as text and feature-extracted by the BERT model and… 该图像是示意图，展示了 iUP-BERT 模型开发的六个主要步骤。包括肽序列的文本提取、BERT 模型与 SSA 方法生成特征向量的融合、数据不平衡处理（SMOTE）、特征选择、结合多种机器学习算法以及最终建立优化模型的过程。图中用流程图的形式清晰地体现了各步骤的关系和操作。

4.2.1. Step 1: Peptide Sequence Input and Feature Extraction

Upon receiving a peptide sequence as input (textual amino acid string), two deep representation learning feature extraction methods are primarily used: the pretrained SSA sequence embedding model and the pretrained BERT sequence embedding model. These models transform the raw peptide sequence into numerical feature vectors.

4.2.1.1. Pretrained SSA Embedding Model

Principle: SSA (Soft Symmetric Alignment) defines a novel way to compare arbitrary-length sequences by first converting them into vector representations. It leverages a BiLSTM encoder to capture sequential information.
Process: An initial pretrained model encodes each peptide sequence. This encoder is a three-tier stacked BiLSTM (Bidirectional Long Short-Term Memory) network, which processes the sequence from both forward and backward directions to capture long-range dependencies. The output of the BiLSTM encoder is then passed through a linear layer, which transforms the internal representations into a final embedding matrix.
Output: Each peptide sequence creates a final embedding matrix, denoted as $\mathbb { R } ^ { \mathrm { L } \times 1 2 1 }$ , where $L$ represents the length of the peptide (number of amino acids) and 121 is the dimension of the vector representation for each amino acid position.
Similarity Calculation (SSA Mechanism): The SSA mechanism is used to calculate the similarity between two amino acid sequences based on their embedded vectors.
- Consider two embedded matrices (vector representations of sequences): $ \mathrm { P } _ { 1 } = [ \alpha _ { 1 } , \alpha _ { 2 } , \cdot \cdot \cdot , \alpha _ { \mathrm { { L1 } } } ] \ \mathrm { P } _ { 2 } = [ \beta _ { 1 } , \beta _ { 2 } , \cdot \cdot \cdot , \beta _ { \mathrm { { L2 } } } ] $ Where:
  - $\mathrm{P_1}$ and $\mathrm{P_2}$ are the embedding matrices for two distinct peptide sequences.
  - $\mathrm{L_1}$ and $\mathrm{L_2}$ are the lengths of the respective peptide sequences.
  - $\alpha_i$ and $\beta_j$ represent the 121-dimensional vector embedding for the $i$ -th amino acid in $\mathrm{P_1}$ and the $j$ -th amino acid in $\mathrm{P_2}$ , respectively.
- The similarity $\hat { \omega }$ $\overset{ω}{^}$ between the two sequences is calculated as: $ \hat { \omega } = - \frac { 1 } { W } \sum _ { \mathrm { i = 1 } } ^ { \mathrm { L1 } } \sum _ { \mathrm { j = 1 } } ^ { \mathrm { L2 } } { \tau _ { \mathrm { i j } } | \alpha _ { \mathrm { i } } - \beta _ { \mathrm { j } } | _ { 1 } } $ Where:
  - $\hat { \omega }$ is the calculated similarity score.
  - $W$ is a normalization factor, defined below.
  - $\tau_{ij}$ is a weighting coefficient that reflects the alignment contribution between $\alpha_i$ and $\beta_j$ .
  - $\| \alpha _ { \mathrm { i } } - \beta _ { \mathrm { j } } \| _ { 1 }$ is the L1-norm (Manhattan distance) between the vector $\alpha_i$ and $\beta_j$ , measuring their dissimilarity.
- The weighting coefficient $\tau_{ij}$ $τ_{ij}$ is calculated using the following formulas (Equations 4-7 from the paper): $ \displaystyle \varrho _ { \mathrm { i j } } = \frac { \exp \big ( - | \alpha _ { \mathrm { k } } - \beta _ { \mathrm { j } } | _ { 1 } \big ) } { \sum _ { \mathrm { k } = 1 } ^ { \lfloor 1 \rfloor } \exp \big ( - | \alpha _ { \mathrm { k } } - \beta _ { \mathrm { j } } | _ { 1 } \big ) } $ $ \displaystyle \mathfrak {sigma } _ { \mathrm { i j } } = \frac { \exp \big ( - | \alpha _ { \mathrm { i } } - \beta _ { \mathrm { k } } | _ { 1 } \big ) } { \sum _ { \mathrm { k } = 1 } ^ { \lfloor 2 \rfloor } \exp \big ( - | \alpha _ { \mathrm { i } } - \beta _ { \mathrm { k } } | _ { 1 } \big ) } $ $ \displaystyle \qquad \mathfrak { r } _ { \mathrm { i j } } = \mathfrak { o } _ { \mathrm { i j } } + \mathfrak { o } _ { \mathrm { i j } } - \mathfrak { p } _ { \mathrm { i j } } \mathfrak { o } _ { \mathrm { i j } } $ $ \displaystyle \qquad \operatorname { W } = \sum _ { \mathrm { i = 1 } } ^ { \lfloor 1 \rfloor } \sum _ { \mathrm { i } = 1 } ^ { \lfloor 2 \rfloor } \mathfrak { r } _ { \mathrm { i j } } $ Note: There seems to be a typo in the paper for the last two equations. In the original text, varrho_ij and sigma_ij are defined, but then r_ij is defined using o_ij and p_ij which are not explicitly defined as varrho or sigma. Assuming o_ij and p_ij are varrho_ij and sigma_ij respectively based on common notation in soft alignment contexts for forward/backward probabilities.
  - $\varrho_{ij}$ : Represents the normalized similarity (or "soft alignment probability") of $\beta_j$ with respect to all $\alpha_k$ in $P_1$ . It measures how well $\beta_j$ aligns to $\alpha_i$ by considering all possible alignments from $P_1$ .
  - $\mathfrak {sigma }_{ij}$ : Represents the normalized similarity of $\alpha_i$ with respect to all $\beta_k$ in $P_2$ . It measures how well $\alpha_i$ aligns to $\beta_j$ by considering all possible alignments from $P_2$ .
  - $\mathfrak { r } _ { \mathrm { i j } }$ : This term seems to be a combination of the two alignment probabilities, potentially a form of fuzzy OR or combination that emphasizes strong alignments from either direction. Given the symbols $o$ and $p$ are used instead of varrho and sigma in the paper, it's possible these are intermediate terms derived from varrho and sigma.
  - $W$ : The sum of all $\mathfrak { r } _ { \mathrm { i j } }$ values, serving as a normalization constant for the overall similarity $\hat { \omega }$ .
Final Embedding: After these calculations, an averaging pooling procedure is applied to the $\mathbb { R } ^ { \mathrm { L } \times 1 2 1 }$ embedding matrix to obtain a fixed-size 121-dimensional (121D) feature vector for each peptide, regardless of its length. This fixed-size vector is suitable as input for traditional machine learning models.

4.2.1.2. Pretrained BERT Embedding Model

Principle: BERT excels at generating deep bidirectional language representations by understanding the full context of tokens (amino acids) in a sequence. It eliminates the need for manual feature engineering.
Process:
1. Tokenization: Peptide sequences are directly input into the BERT model. First, they are converted into token representations, often using k-mers (contiguous subsequences of k amino acids) as tokens, similar to how words or subword units are handled in natural language.
2. Positional Embedding: To incorporate information about the order of amino acids, positional embeddings are added to the token representations. This allows BERT to distinguish between amino acids at different positions in the sequence.
3. Transformer Encoder Layers: The combined token and positional embeddings are then passed through the core of BERT: a stack of Transformer encoder layers. The model used here has 12 such layers.
  - Each Transformer encoder layer contains a multi-head self-attention mechanism. This mechanism allows each amino acid token to attend to all other amino acid tokens in the sequence, capturing semantic and contextual relationships across the entire peptide.
  - After the multi-head self-attention, the output passes through feed-forward neural networks and linear transformations.
4. Pretraining Tasks: The BERT model is initially pre-trained on a large, unlabeled dataset using tasks such as the masked language model (predicting masked amino acids) to learn robust sequence representations. The cross-entropy loss function is used for backpropagation during pretraining.
Output: The BERT-trained model produces a 768-dimensional (768D) feature vector for each peptide sequence, which encapsulates its contextual information.

4.2.2. Step 2: Feature Fusion

To explore whether combining different types of deep features could yield better performance, the 121D SSA eigenvector was concatenated (combined) with the 768D BERT eigenvector. This resulted in an 889-dimensional (889D) SSA + BERT fusion feature vector. This fusion aims to combine the local and global contextual information captured by both methods. For comparison, individual feature vectors (SSA alone and BERT alone) were also evaluated.

4.2.3. Step 3: Synthetic Minority Oversampling Technique (SMOTE)

Principle: Datasets often suffer from class imbalance, where the number of samples in one class (e.g., umami peptides) is much smaller than in another (e.g., non-umami peptides). This can bias ML models towards the majority class. SMOTE is used to mitigate this.
Process: SMOTE analyzes the minority class samples. For each minority sample, it identifies its k-nearest neighbors. Then, it randomly selects $N$ samples from these $k$ neighbors and generates new, synthetic minority samples by performing random linear interpolation between the original sample and its chosen neighbors. These artificially simulated new samples are added to the dataset, balancing the class distribution. This process continues until the data imbalance meets specified requirements, effectively increasing the representation of the minority class without simply copying existing samples, thus reducing the risk of overfitting.

4.2.4. Step 4: Feature Space Optimization (LGBM Feature Selection)

Principle: High-dimensional feature vectors (like the 768D BERT features or 889D fusion features) can contain redundant or less informative features. This information redundancy can lead to model overfitting and increased computational cost. Feature selection aims to identify and retain only the most discriminative features.
Process: The LGBM (Light Gradient Boosting Machine) feature selection method was employed. LGBM naturally provides feature importance scores during its training process (due to its tree-based nature). These scores indicate how much each feature contributes to the model's predictive power. By ranking features based on their importance and selecting a subset, the method can reduce the dimensionality of the feature space while preserving or even enhancing predictive performance. This step helps in attaining the best feature combinations and creating an optimized feature space.

4.2.5. Step 5: Machine Learning Methods

Five different ML algorithms were combined with the extracted and processed features to build classification models:

K-Nearest Neighbor (KNN):
- Mechanism: For a new, unseen peptide, KNN identifies the $K$ training samples (umami or non-umami peptides) that are "closest" to it in the feature space. The new peptide is then assigned the class label that is most common among its $K$ nearest neighbors.
Logistic Regression (LR):
- Mechanism: LR models the probability of a peptide belonging to the umami class. It uses a sigmoid function to map a linear combination of input features to a probability score between 0 and 1. If this probability exceeds a certain threshold (e.g., 0.5), the peptide is classified as umami; otherwise, it's non-umami.
Support Vector Machine (SVM):
- Mechanism: SVM aims to find the optimal hyperplane that maximally separates the umami and non-umami peptide samples in the high-dimensional feature space. It focuses on the "support vectors" (data points closest to the hyperplane) to define this boundary, which helps in robust classification.
Random Forest (RF):
- Mechanism: RF is an ensemble method consisting of multiple decision trees. Each tree is trained on a bootstrap sample of the data and a random subset of features. For a new peptide, each decision tree makes a prediction. The final classification for the peptide is determined by a majority vote among all the individual decision tree predictions.
Light Gradient Boosting Machine (LGBM):
- Mechanism: LGBM builds an ensemble of decision trees sequentially, where each new tree corrects the errors of the previous ones. It uses a leaf-wise tree growth strategy and a histogram-based algorithm for efficiency. The final prediction is a weighted sum of the predictions from all individual trees.

4.2.6. Step 6: Final `iUP-BERT` Predictor Establishment

After extensive experimentation and optimization across different combinations of feature extraction methods (SSA, BERT, fusion), data balancing (SMOTE), feature selection (LGBM), and machine learning algorithms (KNN, LR, SVM, RF, LGBM), the BERT feature extraction method combined with the SVM classification model and SMOTE (and after LGBM feature selection) was selected as the optimal combination. This optimized configuration forms the final iUP-BERT predictor.

5. Experimental Setup

5.1. Datasets

For a fair comparison with previous umami peptide ML models, the paper used the same peptide datasets as in UMPred-FRL [24]. These datasets are provided in Supplementary File S1.

Positive Samples (Umami Peptides):
- Number: 140 unique peptides.
- Source: Experimentally validated umami peptides collected from various studies [10,15,16,20] and from the BIOPEP-UWM databases [40].
Negative Samples (Non-Umami Peptides):
- Number: 302 unique peptides.
- Source: Identified bitter peptides [41,42], which are distinct from umami peptides.
Dataset Split: The complete dataset was divided into a training set and an independent test set.
- Training Dataset: 112 umami peptides (positive) and 241 non-umami peptides (negative).
- Independent Test Dataset: 28 umami peptides (positive) and 61 non-umami peptides (negative).
  
  The choice of these datasets ensures comparability with existing methods and allows for robust validation, using peptides identified through experimental means or reputable databases. The use of bitter peptides as negative samples is a common strategy in taste peptide prediction, as they represent a distinct taste category.

5.2. Evaluation Metrics

The performance of the models was evaluated using six widely accepted binary classification metrics. To provide a comprehensive understanding, these metrics are defined below, along with their mathematical formulas and explanations of symbols.

Key Terminology:
- TP (True Positive): The number of umami peptides (positive samples) correctly identified as umami.
- TN (True Negative): The number of non-umami peptides (negative samples) correctly identified as non-umami.
- FP (False Positive): The number of non-umami peptides (negative samples) incorrectly identified as umami.
- FN (False Negative): The number of umami peptides (positive samples) incorrectly identified as non-umami.

Accuracy (ACC)
- Conceptual Definition: ACC measures the overall proportion of correctly classified instances (both true positives and true negatives) out of all instances in the dataset. It provides a general sense of how well the model performs.
- Mathematical Formula: $ \mathrm { ACC } = { \frac { \mathrm { T P } + \mathrm { T N } } { \mathrm { T P } + \mathrm { T N } + \mathrm { F P } + \mathrm { F N } } } $
- Symbol Explanation:
  - $\mathrm{TP}$ : True Positives.
  - $\mathrm{TN}$ : True Negatives.
  - $\mathrm{FP}$ : False Positives.
  - $\mathrm{FN}$ : False Negatives.
Matthew's Correlation Coefficient (MCC)
- Conceptual Definition: MCC is a robust and balanced metric that is particularly useful for evaluating binary classifications, especially on imbalanced datasets. It considers all four quadrants of the confusion matrix (TP, TN, FP, FN) and produces a value between -1 and +1. A value of +1 indicates a perfect prediction, 0 indicates a random prediction, and -1 indicates a perfectly inverse prediction. It's often preferred over accuracy for imbalanced data because it accounts for true and false positives and negatives proportionally.
- Mathematical Formula: $ { \bf M C C } = \frac { \mathrm { T P } \times \mathrm { T N } - \mathrm { F P } \times \mathrm { F N } } { ( \sqrt { ( \mathrm { T P } + \mathrm { F P } ) \times ( \mathrm { T P } + \mathrm { F N } ) \times ( \mathrm { T N } + \mathrm { F P } ) \times ( \mathrm { T N } + \mathrm { F N } ) } } $
- Symbol Explanation:
  - $\mathrm{TP}$ : True Positives.
  - $\mathrm{TN}$ : True Negatives.
  - $\mathrm{FP}$ : False Positives.
  - $\mathrm{FN}$ : False Negatives.
Sensitivity (Sn), also known as Recall or True Positive Rate (TPR)
- Conceptual Definition: Sn measures the proportion of actual positive instances (umami peptides) that were correctly identified by the model. It indicates the model's ability to avoid false negatives.
- Mathematical Formula: $ \mathrm { S n } = { \frac { \mathrm { T P } } { \mathrm { T P } + \mathrm { F N } } } $
- Symbol Explanation:
  - $\mathrm{TP}$ : True Positives.
  - $\mathrm{FN}$ : False Negatives.
Specificity (Sp), also known as True Negative Rate (TNR)
- Conceptual Definition: Sp measures the proportion of actual negative instances (non-umami peptides) that were correctly identified by the model. It indicates the model's ability to avoid false positives.
- Mathematical Formula: $ \mathsf { S p } = \frac { \mathrm { T N } } { \mathrm { T N } + \mathrm { F P } } $
- Symbol Explanation:
  - $\mathrm{TN}$ : True Negatives.
  - $\mathrm{FP}$ : False Positives.
Balanced Accuracy (BACC)
- Conceptual Definition: BACC is the average of sensitivity (true positive rate) and specificity (true negative rate). It is particularly useful for imbalanced datasets because it provides a more balanced assessment of performance compared to raw accuracy, which can be misleading if one class significantly outweighs the other.
- Mathematical Formula: $ \mathsf { B A C C } = { \frac { \mathsf { S n } + \mathsf { S p } } { 2 } } $
- Symbol Explanation:
  - $\mathsf { S n }$ : Sensitivity.
  - $\mathsf { S p }$ : Specificity.
Area Under the Receiver Operating Characteristic Curve (auROC)
- Conceptual Definition: The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity). The auROC quantifies the entire 2D area underneath the ROC curve. A higher auROC value (closer to 1) indicates a better overall model performance across all possible classification thresholds, meaning the model can better distinguish between positive and negative classes. An auROC of 0.5 suggests a random classifier.
- Mathematical Formula: (The paper does not provide an explicit formula for auROC, but its definition is standard). $ \text{auROC} = \int_{0}^{1} \text{TPR}(\text{FPR}) , d\text{FPR} $
- Symbol Explanation:
  - $\text{TPR}$ : True Positive Rate (Sensitivity).
  - $\text{FPR}$ : False Positive Rate (1 - Specificity).
  - $\text{FPR}$ (False Positive Rate): $\frac{\mathrm{FP}}{\mathrm{TN} + \mathrm{FP}}$ .

Model Evaluation Methods:
- K-fold Cross-Validation: The 10-fold cross-validation method was used for model training and validation evaluation on the training set. The training set is randomly divided into 10 equal parts (folds). In each iteration, 9 folds are used for training and the remaining 1 fold is used for validation. This process is repeated 10 times, with each fold serving as the validation set exactly once. The performance of the model is then evaluated by averaging the 10 validation scores, providing a more robust estimate of model performance than a single train/test split.
- Independent Testing: This method evaluates the trained model on a completely separate dataset that was not used during any part of the training or cross-validation process. This provides an unbiased assessment of the model's generalization ability to new, unseen data. A good model must perform well on both cross-validation and independent testing.

5.3. Baselines

The iUP-BERT model's performance was compared against two existing methods for umami peptide prediction:

iUmami-SCM:
- Description: This was the first sequence-based umami peptide predictor, developed by Ha et al. in 2020. It utilized the scoring card method (SCM) in conjunction with estimated propensity scores of amino acids and dipeptides.
- Representativeness: It serves as a foundational baseline, representing early computational efforts using artificial feature extraction.
UMPred-FRL:
- Description: A more recent ML-based meta-predictor for umami peptides, created by Charkwan et al. in 2021. It employed a feature representation learning approach (though still based on engineered features) and combined seven different feature encodings with six well-known ML algorithms.
- Representativeness: It represents the state-of-the-art among methods relying on diverse, manually engineered features combined with traditional ML algorithms prior to the deep learning era for this specific task.
  
  These baselines are representative because they showcase the progression of computational approaches to umami peptide prediction, from simpler feature engineering to more complex combinations of engineered features and ML algorithms. By outperforming these, iUP-BERT demonstrates the advancement brought by deep learning feature extraction.

6. Results & Analysis

6.1. Preliminary Performance of Models Trained with or without SMOTE

The first step in the experimental analysis was to evaluate the impact of the Synthetic Minority Oversampling Technique (SMOTE) on model performance, especially given the imbalanced nature of the dataset (112 umami vs. 241 non-umami peptides in the training set). This was done by comparing models built with and without SMOTE using two deep representation learning features (SSA and BERT) and five ML algorithms (KNN, LR, SVM, RF, and LGBM). The evaluation was performed using repeated stratified 10-fold cross-validation tests (10 times).

The following figure (Figure 2 from the original paper) shows the performance metrics of SSA and BERT features using different algorithms, both pretrained with and without SMOTE, in 10-fold cross-validation.

Figure 2. The performance of 10-fold cross-validation metrics of SSA and BERT features using different algorithms pretrained with or without SMOTE. (A) KNN; (B) LR; (C) SVM; (D) RF; (E) LGBM.

Analysis of SMOTE's Effect (Figure 2 and Table 1):

Overall Improvement: For 10-fold cross-validation results, all five ML algorithms showed improved performance across ACC, MCC, Sn, auROC, and BACC when SMOTE was applied, for both SSA and BERT features. This highlights SMOTE's effectiveness in addressing data imbalance.
Specificity (Sp) Exception: The Sp metric was the only exception, sometimes showing a slight decrease with SMOTE. For instance, the best Sp for SSA with SMOTE was 0.913 (using SVM), which was lower than 0.938 (using RF without SMOTE). However, the overall best Sp (0.959) was still obtained from the BERT feature optimized with SMOTE (using LR). This suggests that SMOTE might slightly increase false positives to improve sensitivity, but the overall balance is better.
Quantified Improvement (Cross-Validation):
- For SSA features, ACC for KNN, LR, SVM, RF, and LGBM improved by 1.08% to 10.88% with SMOTE.
- Similar improvements were observed with BERT features.
Balanced Accuracy (BACC) Behavior: A noteworthy observation was that BACC scores became identical to ACC scores when SMOTE was used in cross-validation. This is because SMOTE effectively balances the dataset, making the true positive rate (Sn) and true negative rate (Sp) more comparable, hence their average (BACC) converges with overall accuracy. This indicates that the data became balanced after SMOTE application.

Independent Test Performance: The positive impact of SMOTE was consistent in the independent test results as well. The best scores across five metrics for both SSA and BERT features were achieved when SMOTE was used. For example, for SSA with SMOTE, ACC was 0.866, MCC 0.683, Sn 0.814, auROC 0.916, and BACC 0.825.

The following are the results from Table 1 of the original paper:

Feature	Model	SMOTE	Dim	10-Fold Cross-Validation						Independent Test
Feature	Model	SMOTE	Dim	ACC	MCC	Sn	Sp	auROC	BACC	ACC	MCC	Sn	Sp	auROC	BACC
SSA	KNN	−	121	0.833	0.607	0.663	0.913	0.849	0.788	0.825	0.575	0.596	0.930	0.876	0.763
	LR	−	121	0.776	0.485	0.634	0.842	0.814	0.738	0.780	0.498	0.679	0.826	0.839	0.752
	SVM	−	121	0.827	0.588	0.613	0.925	0.909	0.769	0.857	0.658	0.682	0.944	0.907	0.806
	RF	−	121	0.836	0.609	0.618	0.938	0.902	0.778	0.826	0.578	0.557	0.949	0.879	0.753
	LGBM	−	121	0.852	0.664	0.721	0.913	0.896	0.817	0.827	0.583	0.621	0.921	0.880	0.771
	KNN	+	121	0.842	0.709	0.962	0.721	0.930	0.841	0.787	0.555	0.814	0.774	0.885	0.794
	LR	+	121	0.857	0.722	0.904	0.809	0.902	0.856	0.843	0.640	0.682	0.813	0.916	0.748
	SVM	+	121	0.917	0.835	0.921	0.913	0.967	0.917	0.866	0.675	0.696	0.941	0.936	0.819
	RF	+	121	0.915	0.833	0.921	0.908	0.967	0.915	0.866	0.683	0.714	0.936	0.895	0.825
	LGBM	+	121	0.917	0.835	0.929	0.904	0.964	0.917	0.827	0.585	0.643	0.911	0.887	0.777
BERT	KNN	-	768	0.836	0.610	0.679	0.908	0.879	0.794	0.807	0.537	0.618	0.893	0.872	0.756
	LR	-	768	0.836	0.649	0.820	0.842	0.888	0.833	0.850	0.660	0.743	0.907	0.912	0.825
	SVM	-	768	0.830	0.613	0.727	0.880	0.910	0.803	0.820	0.599	0.770	0.841	0.875	0.806
	RF	-	768	0.859	0.667	0.714	0.925	0.925	0.820	0.819	0.567	0.643	0.900	0.900	0.771
	LGBM	-	768	0.830	0.609	0.705	0.890	0.898	0.797	0.830	0.596	0.668	0.905	0.915	0.786
	KNN	+	768	0.884	0.775	0.954	0.813	0.928	0.884	0.820	0.625	0.857	0.803	0.881	0.830
	LR	+	768	0.911	0.825	0.959	0.863	0.952	0.911	0.843	0.635	0.750	0.885	0.905	0.818
	SVM	+	768	0.923	0.849	0.888	0.959	0.984	0.923	0.876	0.706	0.714	0.951	0.926	0.832
	RF	+	768	0.898	0.797	0.909	0.887	0.967	0.898	0.896	0.793	0.905	0.887	0.971	0.897
	LGBM	+	768	0.896	0.793	0.905	0.888	0.971	0.896	0.843	0.635	0.750	0.852	0.920	0.818

Best performance values are bold and underlined. SSA: Soft Symmetric Alignment; BERT: Bidirectional Encoder Representations from Transformer. KNN: k-nearest neighbor; LR: logistic regression; SVM: support vector machine; RF: random forest; LGBM: light gradient boosting machine. "-" indicates without the SMOTE method; "+" indicates with the SMOTE method.

6.2. The Effect of Different Feature Types

This section focuses on comparing the effectiveness of SSA versus BERT features, primarily in combination with SMOTE and different ML algorithms.

Analysis of Feature Type Effect (Figure 2 and Table 1):

BERT's Superiority in Cross-Validation: The BERT feature vector, when combined with the SVM algorithm and SMOTE, consistently showed the best performance across most metrics (ACC, MCC, Sp, auROC, and BACC) in the 10-fold cross-validation.
- ACC: 0.923
- MCC: 0.849
- Sp: 0.959
- auROC: 0.984
- BACC: 0.923 These values were significantly higher compared to other combinations.
SSA's Specific Strength: The SSA feature vector, when conjugated with KNN and SMOTE, achieved the highest Sn (0.962) in cross-validation, outperforming all BERT combinations. This suggests SSA might be particularly good at identifying positive samples, even if other metrics are slightly lower.
Independent Test Nuances: In the independent test, while BERT-SVM-SMOTE still performed very well, some other BERT combinations, specifically BERT-RF-SMOTE, showed slightly higher scores for some metrics:
- BERT-RF-SMOTE had an ACC of 0.896, MCC of 0.793, Sn of 0.905, auROC of 0.971, and BACC of 0.897. These were marginally higher than BERT-SVM-SMOTE for these specific metrics in the independent test.
- However, BERT-SVM-SMOTE had a higher Sp (0.951) compared to BERT-RF-SMOTE (0.887).
Conclusion on Best Model: Despite the slight variations in independent test results, the paper concludes that the BERT-SVM-SMOTE combination was still considered the "best model out of all the combinations" due to its consistently strong performance across both cross-validation and independent tests, particularly its high ACC, MCC, and auROC in cross-validation, and robust Sp in independent test. This emphasizes BERT's capability to extract highly effective features.

6.3. The Effect of Feature Fusion

To investigate if combining features from both SSA and BERT could further enhance performance, a fusion feature was created by concatenating the 121D SSA eigenvector and the 768D BERT eigenvector, resulting in an 889D vector. This fusion feature was then tested with the five ML algorithms, with and without SMOTE.

The following are the results from Table 2 of the original paper:

Feature	Model	SMOTE	Dim	10-Fold Cross-Validation						Independent Test
Feature	Model	SMOTE	Dim	ACC	MCC	Sn	Sp	auROC	BACC	ACC	MCC	Sn	Sp	auROC	BACC
SSA + BERT	KNN	−	889	0.836	0.610	0.679	0.909	0.908	0.794	0.820	0.576	0.679	0.885	0.900	0.782
	LR	−	889	0.844	0.640	0.750	0.887	0.900	0.819	0.876	0.716	0.821	0.902	0.910	0.862
	SVM	−	889	0.858	0.667	0.732	0.917	0.921	0.825	0.854	0.658	0.750	0.902	0.906	0.826
	RF	−	889	0.841	0.620	0.643	0.934	0.906	0.788	0.831	0.599	0.679	0.902	0.906	0.790
	LGBM	−	889	0.813	0.553	0.625	0.900	0.892	0.763	0.831	0.606	0.714	0.895	0.921	0.800
	KNN	+	889	0.888	0.787	0.971	0.805	0.932	0.888	0.831	0.643	0.820	0.883	0.898	0.838
	LR	+	889	0.917	0.836	0.954	0.880	0.951	0.917	0.876	0.724	0.857	0.906	0.906	0.871
	SVM	+	889	0.934	0.867	0.938	0.929	0.980	0.934	0.820	0.563	0.571	0.934	0.916	0.733
	RF	+	889	0.915	0.830	0.929	0.900	0.968	0.915	0.820	0.592	0.750	0.852	0.919	0.801
	LGBM	+	889	0.919	0.840	0.950	0.888	0.963	0.919	0.843	0.643	0.786	0.869	0.919	0.827

Best performance values are bold and underlined. SSA: Soft Symmetric Alignment; BERT: Bidirectional Encoder Representations from Transformer. KNN: k-nearest neighbor; LR: logistic regression; SVM: support vector machine; RF: random forest; LGBM: light gradient boosting machine. "-" indicates without the SMOTE method; "+" indicates with the SMOTE method.

The following figure (Figure 3 from the original paper) displays the performance metrics of individual and fused features with SMOTE, according to the machine learning methods used. (A) Ten-fold cross-validation results. (B) Independent test results.

Figure 3. The performance metrics of individual and fused features with SMOTE, according to the machine learning methods used. (A) Ten-fold cross-validation results. (B) Independent test results.

Analysis of Feature Fusion Effect (Table 2 and Figure 3):

Cross-Validation: Consistent with previous findings, SMOTE significantly improved performance for fusion features. The best performance for fusion features in 10-fold cross-validation was achieved with SVM (ACC 0.934, MCC 0.867, Sn 0.938, Sp 0.929, auROC 0.980, BACC 0.934). These scores were slightly superior to the BERT feature alone (compare ACC 0.934 vs. 0.923, MCC 0.867 vs. 0.849, Sn 0.971 vs. 0.959, BACC 0.934 vs. 0.923). This initially suggested a benefit from combining features.
Independent Test: However, the independent test results revealed a different picture. The best performance of the fusion feature (achieved by LR with SMOTE: ACC 0.876, MCC 0.724, Sn 0.857, Sp 0.902, auROC 0.910, BACC 0.871) was lower than the corresponding scores obtained from the BERT feature alone with SMOTE (e.g., BERT-RF-SMOTE had ACC 0.896, MCC 0.793, Sn 0.905, Sp 0.887, auROC 0.971, BACC 0.897).
Conclusion: The authors concluded that the feature fusion of SSA and BERT was not a beneficial choice for model optimization in umami peptide prediction. While it showed slight improvements in cross-validation (which can sometimes be optimistic), it failed to generalize better to unseen data in the independent test, performing worse than BERT features alone. This suggests that the 121D SSA features might introduce redundancy or noise that hinders the generalization of the powerful 768D BERT features.

6.4. The Effect of Feature Selection

Given that feature fusion did not yield improvements and higher dimensionality carries a risk of overfitting, feature selection was applied using the LGBM method. This aimed to remove redundant and indistinguishable features to find an optimized feature space for umami peptide prediction. The optimized models (after SMOTE and feature selection) were then evaluated.

The following are the results from Table 3 of the original paper:

Feature	Model	SMOTE	Dim	10-Fold Cross-Validation						Independent Test
Feature	Model	SMOTE	Dim	ACC	MCC	Sn	Sp	auROC	BACC	ACC	MCC	Sn	Sp	auROC	BACC
SSA	KNN	+	43	0.892	0.788	0.942	0.842	0.938	0.892	0.921	0.825	0.929	0.918	0.914	0.923
	LR	+	29	0.884	0.768	0.900	0.867	0.938	0.884	0.887	0.745	0.857	0.902	0.919	0.879
	SVM	+	29	0.909	0.820	0.946	0.871	0.962	0.909	0.899	0.761	0.786	0.951	0.913	0.868
	RF	+	39	0.892	0.784	0.892	0.892	0.957	0.892	0.887	0.735	0.786	0.934	0.914	0.860
	LGBM	+	39	0.902	0.805	0.905	0.900	0.958	0.902	0.899	0.763	0.821	0.934	0.919	0.878
BERT	KNN	+	163	0.888	0.786	0.967	0.809	0.950	0.888	0.865	0.723	0.929	0.836	0.909	0.882
	LR	+	29	0.876	0.751	0.884	0.867	0.937	0.876	0.887	0.739	0.821	0.918	0.913	0.870
	SVM	+	139	0.940	0.881	0.963	0.917	0.971	0.940	0.899	0.774	0.893	0.902	0.933	0.897
	RF	+	77	0.921	0.843	0.938	0.905	0.973	0.921	0.865	0.711	0.821	0.895	0.923	0.853
	LGBM	+	174	0.917	0.834	0.929	0.905	0.973	0.917	0.876	0.694	0.786	0.918	0.916	0.852
SSA + BERT	KNN	+	65	0.900	0.806	0.954	0.846	0.942	0.900	0.876	0.742	0.929	0.852	0.898	0.891
	LR	+	79	0.915	0.832	0.950	0.880	0.941	0.915	0.887	0.745	0.857	0.902	0.909	0.879
	SVM	+	99	0.932	0.864	0.950	0.913	0.981	0.932	0.887	0.745	0.857	0.902	0.909	0.879
	RF	+	168	0.909	0.818	0.925	0.892	0.974	0.909	0.876	0.716	0.821	0.902	0.917	0.862
	LGBM	+	114	0.919	0.839	0.942	0.896	0.979	0.919	0.876	0.724	0.857	0.885	0.920	0.871

Best performance values are bold and underlined. SSA: Soft Symmetric Alignment; BERT: Bidirectional Encoder Representations from Transformer. KNN: k-nearest neighbor; LR: logistic regression; SVM: support vector machine; RF: random forest; LGBM: light gradient boosting machine. "+" indicates with the SMOTE method.

The following figure (Figure 4 from the original paper) presents the performance metrics of individual and fusion features using selected features and different algorithms. (A) Ten-fold cross-validation results. (B) Independent test results.

Figure 4. The performance metrics of individual and fusion features using selected features and different algorithms. (A) Ten-fold cross-validation results. (B) Independent test results.

Analysis of Feature Selection Effect (Table 3 and Figure 4):

Cross-Validation Improvement: Feature selection significantly improved the models' performance. The BERT feature encoding, combined with the SVM algorithm (with SMOTE) and 139 dimensions after LGBM feature selection, achieved the best performance in 10-fold cross-validation across ACC, MCC, Sp, and BACC.
- ACC: 0.940 (an improvement of 0.86% to 7.31% over other options).
- MCC: 0.881 (an improvement of 1.97% to 17.31%).
- Sp: 0.917 (an improvement of 0.44% to 13.35%).
- BACC: 0.940 (same as ACC, indicating balanced data). These results underscore the value of selecting an optimal feature descriptor.
Independent Test Performance:
- For the independent test, some SSA combinations achieved the highest scores for ACC (0.921, SSA-KNN with 43D), MCC (0.825, SSA-KNN with 43D), Sn (0.929, SSA-KNN with 43D), and BACC (0.923, SSA-KNN with 43D). Also, SSA-SVM with 29D achieved the highest Sp (0.951).
- However, the BERT feature (139D) with SVM still yielded the highest auROC (0.933) in the independent test, which is a strong indicator of overall classifier quality across thresholds. Furthermore, its scores for ACC (0.899), MCC (0.774), Sn (0.893), and BACC (0.897) were the second best among all models, demonstrating consistent strong performance.
Conclusion on Optimal Model: Considering both cross-validation and independent testing results, the BERT feature based on the SVM algorithm (with SMOTE and 139 selected dimensions) was deemed the best option for umami peptide prediction. This configuration balances high performance across multiple metrics and robust generalization.

6.5. Comparison of iUP-BERT with Existing Models

The efficacy and robustness of the final iUP-BERT model (specifically, the BERT-SVM-SMOTE combination with 139 selected dimensions) were then rigorously compared against the previously discussed existing methods: iUmami-SCM and UMPred-FRL.

The following are the results from Table 4 of the original paper:

Classifier	10-Fold Cross-Validation						Independent Test
Classifier	ACC	MCC	Sn	Sp	auROC	BACC	ACC	MCC	Sn	Sp	auROC	BACC
iUP-BERT	0.940	0.881	0.963	0.917	0.971	0.940	0.899	0.774	0.893	0.902	0.933	0.897
iUmami-SCM	0.935	0.864	0.947	0.930	0.939	0.939	0.865	0.679	0.714	0.934	0.898	0.824
UMPred-FRL	0.921	0.814	0.847	0.955	0.938	0.901	0.888	0.735	0.786	0.934	0.919	0.860

Best performance values are in bold and are underlined.

Analysis of Comparison (Table 4):

10-Fold Cross-Validation:
- iUP-BERT (ACC 0.940, MCC 0.881, Sn 0.963, auROC 0.971, BACC 0.940) clearly outperformed both iUmami-SCM (ACC 0.935, MCC 0.864, Sn 0.947, auROC 0.939, BACC 0.939) and UMPred-FRL (ACC 0.921, MCC 0.814, Sn 0.847, auROC 0.938, BACC 0.901) across ACC, MCC, Sn, auROC, and BACC. iUmami-SCM had a slightly higher Sp (0.930) than iUP-BERT (0.917), but iUP-BERT's overall performance was superior.
Independent Test: This is the most critical evaluation for generalization ability. iUP-BERT demonstrated remarkably better results in all five primary metrics (ACC, MCC, Sn, auROC, BACC) compared to both baselines:
- ACC: iUP-BERT (0.899) was higher by 1.23% (vs. UMPred-FRL 0.888) to 3.93% (vs. iUmami-SCM 0.865).
- MCC: iUP-BERT (0.774) was higher by 5.31% (vs. UMPred-FRL 0.735) to 13.99% (vs. iUmami-SCM 0.679).
- Sn: iUP-BERT (0.893) was significantly higher by 13.6% (vs. UMPred-FRL 0.786) to 25.07% (vs. iUmami-SCM 0.714). This indicates iUP-BERT is much better at identifying actual umami peptides.
- auROC: iUP-BERT (0.933) was higher by 1.52% (vs. UMPred-FRL 0.919) to 3.90% (vs. iUmami-SCM 0.898).
- BACC: iUP-BERT (0.897) was higher by 4.30% (vs. UMPred-FRL 0.860) to 8.86% (vs. iUmami-SCM 0.824).
- Sp: Notably, iUP-BERT also achieved the highest Sp (0.902) among the three, suggesting it is also effective at correctly identifying non-umami peptides.
  
  Conclusion: The comparisons strongly confirm that iUP-BERT is more effective, reliable, and stable than existing methods for umami peptide prediction, particularly due to its superior generalization capabilities on unseen data. This validates the effectiveness of using BERT for deep representation learning of peptide features.

6.6. Feature Analysis Using Feature Projection and Decision Function

To provide a visual explanation for iUP-BERT's excellent performance, the 139-dimensional BERT feature space (optimized by feature selection) was reduced to a 2-dimensional plane using two dimensionality reduction techniques: Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). This allows for a visual inspection of how well umami and non-umami peptides are separated in the learned feature space. Additionally, the decision function boundary of the SVM model was plotted.

The following figure (Figure 5 from the original paper) shows the dimension reduction visualization of umami peptide BERT features and decision function boundary analysis of the SVM model.

Figure 5. Dimension reduction visualization of umami peptide BERT features and decision function boundary analysis of the SVM model. The red dots are umami peptides and the blue dots are non-umami pe…

Analysis of Visualization (Figure 5):

Separation of Classes: In both PCA (Figure 5A) and UMAP (Figure 5B) visualizations, the red dots (representing umami peptides) and blue dots (representing non-umami peptides) show a relatively concentrated distribution in two distinct areas. The yellow section indicates the positive (umami) sample area, and the purple section indicates the negative (non-umami) sample area.
Effectiveness of BERT Features: The clear separation between the red and blue clusters demonstrates that the 139-dimensional BERT features, even after being reduced to 2D, are highly discriminative. The BERT model successfully learned features that effectively differentiate umami peptides from non-umami peptides.
SVM Decision Boundary: The drawn decision function boundary (a line in 2D) visually confirms that the SVM model can distinguish most positive and negative samples. The boundary largely separates the red and blue clusters, aligning with the high classification performance observed in the quantitative metrics.
Misclassified Samples: Despite the good separation, the visualization also shows some misclassified samples. These are the red dots appearing in the purple area or blue dots in the yellow area, indicating instances where the SVM model made incorrect predictions.
Future Improvement Implications: The presence of misclassified samples suggests that while the BERT features are powerful, there is still room for improvement. The authors note that "better feature extraction methods or more suitable machine learning methods were needed for modeling, to better identify umami peptide sequences from non-umami peptide sequences in the future." This implies that even more advanced deep learning architectures, larger datasets, or fine-tuning approaches could potentially achieve even cleaner separation and fewer misclassifications.

6.7. Construction of the Web Server of iUP-BERT

To maximize the utility and accessibility of the iUP-BERT predictor for the research community and industry, an open-access web server was developed and made available.

Access: The web server can be accessed at https://www.aibiochem.net/servers/iUP-BERT/ (accessed on 23 September 2022).
Purpose: This server allows users to rapidly and efficiently screen potential umami peptides by simply inputting their peptide sequences. It transforms the computational model into a practical tool, facilitating high-throughput prediction without requiring users to set up the complex deep learning and machine learning environment locally.
Impact: This contribution significantly enhances the usability of the research, making iUP-BERT a valuable resource for exploring new umami peptides and promoting innovation in the food seasoning industry.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully developed iUP-BERT, a novel and highly accurate machine learning prediction model for identifying umami peptides solely based on their amino acid sequences. The model's core innovation lies in its utilization of BERT, a single deep representation learning pretrained neural network, for automatic and highly effective feature extraction. The methodology systematically optimized performance by incorporating SMOTE to address data imbalance and applying LGBM for feature selection, ultimately identifying the BERT-SVM-SMOTE model with 139 selected dimensions as the most robust and efficient configuration. This work represents the first application of BERT for computational identification of umami peptides.

Extensive validation through 10-fold cross-validation and independent testing unequivocally demonstrated iUP-BERT's superior efficacy and robustness. Compared to existing methods like iUmami-SCM and UMPred-FRL, iUP-BERT achieved significant improvements across critical metrics in the independent test, including ACC (1.23–3.93% higher), MCC (5.31–13.99% higher), Sn (13.6–25.07% higher), auROC (1.52–3.90% higher), and BACC (4.30–8.86% higher). Finally, an open-access web server for iUP-BERT was built, transforming this research into a practical tool to accelerate the discovery of new umami peptides and advance the food seasoning industry.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Limited Training Sample Size: The current dataset size for training was relatively low (112 positive and 241 negative samples). The authors suggest that larger training sample sizes generally improve the prediction performance of deep learning models.
Further BERT Optimization: While BERT was used for feature extraction, the authors propose that fine-tuning the BERT model specifically for the umami peptide prediction task (rather than just using its off-the-shelf embeddings) could lead to an even more accurate model. Fine-tuning involves continuing the training of the pre-trained BERT model on the specific umami peptide dataset, allowing it to adapt its internal representations more precisely to this domain.
Expansion of Datasets: Future efforts should focus on constructing an optimized, larger-sized dataset with a higher number of experimentally identified umami and non-umami peptides. This would provide more robust data for training and validation, potentially leading to better model performance.
Broader Applications: The overall goal is to use iUP-BERT as a powerful tool for exploring new umami peptides, which can help in improving the palatability of dietary supplements and promoting the umami seasoning industry.

7.3. Personal Insights & Critique

This paper presents a solid application of cutting-edge deep learning techniques to a practical problem in food science. My insights and critique are as follows:

Strength of Deep Learning for Feature Extraction: The paper strongly demonstrates the power of deep representation learning, specifically BERT, in automatically extracting meaningful features from biological sequences. This is a significant leap from manual feature engineering, which often requires extensive domain knowledge and can be prone to missing subtle patterns. The consistent outperformance of iUP-BERT over previous methods is a testament to this paradigm shift. The success of BERT in a non-NLP domain like peptide prediction highlights its versatility and the underlying similarity in sequential data processing.
Importance of Ancillary Techniques: The rigorous inclusion of SMOTE for data balancing and LGBM for feature selection is commendable. These steps, often overlooked or minimally explored in some deep learning papers, were crucial for optimizing the model and ensuring its robustness and generalization ability. The observation that SMOTE made BACC redundant in cross-validation is a clear indicator of its effectiveness in achieving class balance.
Transparency and Reproducibility: Providing an open-access web server significantly enhances the impact and utility of this research. It makes the developed predictor readily available to a wider audience, facilitating its adoption and further research, which is a great practice for computational biology tools.
Limitations and Areas for Improvement:
- Dataset Size: As acknowledged by the authors, the dataset size (112 positive, 241 negative) is relatively small for deep learning models. While BERT is pre-trained, its fine-tuning or the overall performance of the classifier could benefit substantially from a much larger, more diverse, and rigorously validated dataset. The "gold standard" for umami peptides is still evolving, which might contribute to this limitation.
- "Black Box" Nature of BERT Features: While BERT provides powerful features, their exact biochemical interpretation remains somewhat opaque. Further work could involve interpretable AI techniques to understand what specific patterns in the peptide sequence BERT is identifying as indicative of umami taste. This could lead to new biochemical insights.
- Further Deep Learning Exploration: The paper primarily uses BERT for feature extraction and then feeds these into traditional ML classifiers. Future work could explore end-to-end deep learning architectures, where the BERT embeddings are integrated directly into a neural network classifier, potentially allowing for more complex, hierarchical learning. Other advanced Transformer-based architectures or graph neural networks (if peptide structure information were to be incorporated) could also be investigated.
- Domain Specific Pre-training: Instead of using a BERT model pre-trained on natural language, developing a BERT-like model specifically pre-trained on a massive corpus of diverse peptide sequences (e.g., from protein databases) would likely yield even more domain-specific and effective feature embeddings.
Transferability: The methodology of using a pre-trained Transformer-based model for sequence feature extraction, combined with data balancing and feature selection, is highly transferable. This approach could be applied to predict other types of bioactive peptides (e.g., antioxidant, antihypertensive, antimicrobial peptides) or even other biological sequences like DNA or RNA motifs, provided suitable pre-training data and downstream tasks.

Overall, iUP-BERT is a valuable contribution, marking an important step forward in the computational identification of umami peptides by effectively harnessing the power of deep learning. The robust experimental design and practical deployment further solidify its significance.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

IUP-BERT: Identification of Umami Peptides Based on BERT Features

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~37 min read · 55,540 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Step 1: Peptide Sequence Input and Feature Extraction

4.2.1.1. Pretrained SSA Embedding Model

4.2.1.2. Pretrained BERT Embedding Model

4.2.2. Step 2: Feature Fusion

4.2.3. Step 3: Synthetic Minority Oversampling Technique (SMOTE)

4.2.4. Step 4: Feature Space Optimization (LGBM Feature Selection)

4.2.5. Step 5: Machine Learning Methods

4.2.6. Step 6: Final iUP-BERT Predictor Establishment

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Preliminary Performance of Models Trained with or without SMOTE

6.2. The Effect of Different Feature Types

6.3. The Effect of Feature Fusion

6.4. The Effect of Feature Selection

6.5. Comparison of iUP-BERT with Existing Models

6.6. Feature Analysis Using Feature Projection and Decision Function

6.7. Construction of the Web Server of iUP-BERT

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.6. Step 6: Final `iUP-BERT` Predictor Establishment