Paper status: completed

From prediction to design: Revealing the mechanisms of umami peptides using interpretable deep learning, quantum chemical simulations, and module substitution

Original Link
Price: 0.100000
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study uses interpretable deep learning and module substitution to efficiently screen and design umami peptides, achieving 0.94 accuracy. It identifies various umami peptides, explores module substitution mechanisms, and highlights essential amino acids for taste enhancement.

Abstract

This study screened and designed umami peptides using deep learning model and module substitution strategies. The predictive model, which integrates pre-training, enhanced feature, and contrastive learning module, achieved an accuracy of 0.94, outperforming other models by 2–9 %. Umami peptides were identified through virtual hydrolysis, model predictions, and sensory evaluation. Peptides EN, ETR, GK4, RK5, ER6, EF7, IL8, VR9, DL10, and PK14 demonstrated umami taste and exhibited umami-enhancing effects with MSG. Module substitution strategy, where highly contributive module from umami peptides replace corresponding module in bitter peptides, facilitates peptide design and modification. The mechanism underlying module substitution and taste presentation were elucidated via molecular docking and active site analysis, revealing that substituted peptides form more hydrogen bonds and hydrophobic interactions with T1R1/T1R3. Amino acids D, E, Q, K, and R were critical for umami taste. This study provides an efficient tool for rapid umami peptide screening and expands the repository.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

From prediction to design: Revealing the mechanisms of umami peptides using interpretable deep learning, quantum chemical simulations, and module substitution

1.2. Authors

Lijun Su, Zhenren Ma, Huizhuo Ji, Jianlei Kong, Wenjing Yan, Qingchuan Zhang, Jian Li, Min Zuo. The affiliations indicate research is conducted at the School of Food and Health, Beijing Technology and Business University, and the School of Information, Beijing Wuzi University, suggesting a multidisciplinary approach combining food science and information technology.

1.3. Journal/Conference

The paper is published in Food Chemistry. This is a highly reputable journal in the field of food science, known for publishing high-impact research on the chemical and biochemical aspects of food. Its influence is significant for studies related to food quality, safety, and functional ingredients. The specific reference in the acknowledgements points to https://doi.org/10.1016/j.foodchem.2025.144301, indicating it is either published or accepted for publication in 2025.

1.4. Publication Year

2025 (as indicated by the DOI in the supplementary data section: 10.1016/j.foodchem.2025.14430110.1016/j.foodchem.2025.144301).

1.5. Abstract

This study focused on screening and designing umami peptides using a novel deep learning model and module substitution strategies. The predictive model, which integrates pre-training, enhanced feature, and contrastive learning modules, achieved an accuracy of 0.94, outperforming existing models by 2–9%. Through virtual hydrolysis, model predictions, and sensory evaluation, several umami peptides (EN, ETR, GK4, RK5, ER6, EF7, IL8, VR9, DL10, and PK14) were identified that exhibited umami taste and enhanced umami effects synergistically with monosodium glutamate (MSG). A module substitution strategy, replacing high-contributory bitter peptide modules with high-contributory umami peptide modules, facilitated peptide design. The underlying mechanisms of module substitution and taste presentation were elucidated using molecular docking and active site analysis, revealing that substituted peptides form more hydrogen bonds and hydrophobic interactions with the T1R1/T1R3 taste receptor. Amino acids D, E, Q, K, and R were identified as critical for umami taste. This research provides an efficient tool for rapid umami peptide screening and expands the repository of known umami peptides.

/files/papers/69135ac4430ad52d5a9ef421/paper.pdf Publication Status: The presence of a DOI (10.1016/j.foodchem.2025.14430110.1016/j.foodchem.2025.144301) suggests that the paper has been accepted for publication and will appear in Food Chemistry in 2025. The provided link is an internal file path, likely a pre-publication version.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the inefficient and resource-intensive traditional methods for screening and designing umami peptides. Umami peptides are a type of flavor enhancer that can reduce the need for traditional sodium-rich enhancers like MSG (monosodium glutamate), aligning with growing consumer demand for healthy, natural, and nutritious food. Traditional screening involves complex multi-step chromatographic separation, chemical synthesis, and sensory evaluation, which are time-consuming and costly, severely limiting high-throughput screening and industrial application.

This problem is important because umami peptides offer a healthier alternative for flavor enhancement, potentially mitigating risks associated with high sodium intake (e.g., hypertension) and stimulating appetite. The existing methods pose a significant bottleneck for their widespread adoption and further research.

The paper's entry point is the development of an accurate and efficient in silico (computational) method to rapidly screen and design umami peptides. It leverages interpretable deep learning and a module substitution strategy to overcome the limitations of previous machine learning models, which often struggled with effective peptide feature representation and reliance on manually extracted features. The innovative idea is to combine advanced deep learning architectures with a biologically informed module substitution strategy to not only predict but also intelligently design peptides, while simultaneously elucidating the underlying molecular mechanisms.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Development of an Advanced Deep Learning Model: It proposes a novel predictive model for umami peptides that integrates a pre-training module (using BERT for feature encoding), an enhanced feature module (incorporating various physicochemical and structural properties), and a contrastive learning module. This model achieved state-of-the-art performance with an accuracy of 0.94, outperforming other models by 2–9 percentage points.

  • Identification of Novel Umami Peptides: Using the developed model, virtual hydrolysis of Tenebrio molitor protein, and subsequent sensory evaluation, ten previously unreported umami peptides (EN, ETR, GK4, RK5, ER6, EF7, IL8, VR9, DL10, and PK14) were identified. These peptides exhibited strong umami taste and synergistic umami-enhancing effects with MSG, with detection thresholds lower than MSG.

  • Introduction of a Module Substitution Strategy: The study proposed and demonstrated a novel module substitution strategy, where highly contributive dipeptide fragments from umami peptides replaced corresponding highly contributive fragments in bitter peptides. This strategy was shown to successfully convert non-umami peptides into umami peptides, facilitating precise peptide design and modification.

  • Elucidation of Umami Taste Mechanisms: Through interpretable deep learning (attention value analysis), quantum chemical simulations (HOMO/LUMO analysis), and molecular docking experiments, the study identified critical amino acid residues (D, E, Q, K, and R) for umami taste and elucidated the molecular mechanisms of taste presentation and how module substitution alters taste characteristics (e.g., increased hydrogen bonds and hydrophobic interactions with the T1R1/T1R3 receptor).

    The key conclusions and findings are:

  • The integrated deep learning model is a highly accurate and robust tool for predicting umami peptides and their thresholds.

  • Tenebrio molitor protein is a promising source for novel umami peptides.

  • Specific amino acids (D, E, Q, K, R) and dipeptide fragments (EE, DE, EK, EL, EA) play crucial roles in umami taste.

  • Module substitution is a viable and effective strategy for rational peptide design to engineer taste properties.

  • The mechanism of umami taste involves specific interactions (hydrogen bonds, hydrophobic interactions) between peptides and the T1R1/T1R3 receptor, which can be modulated by peptide sequence changes.

    These findings address the challenge of rapid and precise umami peptide discovery and design, offering a powerful computational framework that reduces reliance on laborious experimental methods and provides mechanistic insights.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully grasp this paper, a novice reader should understand the following fundamental concepts:

  • Peptides: Peptides are short chains of amino acids linked by peptide bonds. They are smaller than proteins and play various biological roles, including acting as hormones, enzymes, or, in this context, flavor compounds. Their sequence of amino acids dictates their structure and function.
  • Umami Taste: Umami (often translated as "savory") is one of the five basic tastes, alongside sweet, sour, bitter, and salty. It's often described as a pleasant, savory, or meaty taste, typically associated with glutamate, aspartate, and certain nucleotides.
  • Umami Peptides: These are specific peptides that elicit or enhance the umami taste. They are a focus of research for their potential as natural flavor enhancers and sodium reduction agents.
  • Deep Learning (DL): A subfield of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from data. Unlike traditional machine learning, deep learning models can automatically extract features from raw data, reducing the need for manual feature engineering.
  • Neural Networks: Computational models inspired by the structure and function of the human brain. They consist of interconnected nodes (neurons) organized in layers. Each connection has a weight, which is adjusted during training to minimize the difference between predicted and actual outputs.
  • BERT (Bidirectional Encoder Representations from Transformers): A powerful pre-trained deep learning model for natural language processing (NLP). BERT is designed to understand the context of words in a sentence by looking at words that come before and after it simultaneously. In this paper, peptide sequences are treated like "sentences" of amino acids, and BERT is used to learn rich, contextual feature representations of these sequences.
  • Pre-training: The process of training a large model (like BERT) on a massive, general dataset (e.g., text from the internet, or diverse peptide sequences in this case) to learn general representations or knowledge. This pre-trained model can then be fine-tuned on smaller, specific datasets for downstream tasks (e.g., predicting umami peptides), which is a form of transfer learning.
  • Contrastive Learning: A self-supervised learning technique where a model learns representations by comparing samples. The goal is to make similar samples (positive pairs) have similar representations in a learned latent space while pushing dissimilar samples (negative pairs) apart. This helps the model learn to distinguish subtle differences and similarities in data. The InfoNCE loss function is a common objective function used in contrastive learning.
  • Molecular Docking: A computational simulation technique that predicts the preferred orientation of one molecule (e.g., a peptide) to another (e.g., a taste receptor) when bound together to form a stable complex. It helps understand ligand-receptor interactions at an atomic level, including hydrogen bonds and hydrophobic interactions.
  • Quantum Chemical Simulations: Computational methods rooted in quantum mechanics to study the electronic structure and properties of molecules. In this paper, HOMO (Highest Occupied Molecular Orbital) and LUMO (Lowest Unoccupied Molecular Orbital) analyses are performed. These orbitals describe where electrons are most likely to be found and where they are likely to accept electrons, respectively, providing insights into a molecule's reactivity and active sites.
  • T1R1/T1R3 Receptor: This is a specific G protein-coupled receptor (GPCR) complex found on taste bud cells, primarily responsible for detecting umami taste. Peptides (or MSG) bind to this receptor to trigger the sensation of umami.
  • Module Substitution: A strategy for designing or modifying peptides by replacing specific fragments (modules) of their amino acid sequence with other fragments, aiming to alter or enhance a desired property (e.g., taste, bioactivity).

3.2. Previous Works

The paper builds upon a foundation of previous research in umami peptide prediction and design:

  • Traditional Machine Learning (ML) Models:

    • iUmami-SCM (Charoenkwan et al., 2020): This model used machine learning algorithms combined with a scoring card method (SCM) based on dipeptide propensity scores to predict umami peptides, achieving an accuracy of 0.824. This highlights early attempts to use computational methods but also the limitations in accuracy.
    • Gradient boosting decision tree (Cui et al., 2023): Another ML model that predicted umami peptides based on molecular descriptors. While performing well, ML models often rely on manually extracted features, which can introduce noise and redundancy.
  • Early Deep Learning (DL) Models:

    • Umami-MRNN (Qi et al., 2023): Combined multi-layer perceptron (MLP) and recurrent neural network (RNN) using six feature vectors, achieving 90.5% accuracy on the UMP499 dataset. This showed the potential of DL.
    • Two-stage training strategy (Zhang et al., 2023): Utilized bi-directional encoder representations from transformers (BERT) with an inception network for umami peptide prediction, achieving 93.23% accuracy on a balanced dataset. This paper was a direct predecessor in applying BERT.
    • UMPred-FRL (Charoenkwan et al., 2021), Umami-YYDS (Cui et al., 2023), Jiang's method (Jiang et al., 2023), and IUP-BERT (Jiang et al., 2022) are other state-of-the-art models mentioned for comparison, indicating a continuous effort in the field to improve prediction accuracy using various ML and DL techniques.
  • Peptide Feature Representation: Previous research often relied on limited sets of feature descriptors, sometimes neglecting crucial physicochemical properties and structural information of peptides, which can restrict model accuracy and generalization. The use of pre-training and diverse learning strategies to extract significant features (Lv et al., 2021) was identified as an effective approach.

  • Peptide Modification Strategies:

    • Fragment substitution (Meng et al., 2024): Used to enhance the activity of angiotensin-converting enzyme inhibitory peptides.
    • Single amino acid substitutions (Jia et al., 2024): Studies showed that removing bitter amino acids from umami peptides could increase umami intensity. However, these were limited to single amino acids, which are rarely considered functional modules.

3.3. Technological Evolution

The field has evolved from laborious wet-lab experimental screening methods to in silico approaches. Initially, traditional machine learning models were employed, but they faced challenges in effectively representing peptide sequence features and often relied on manually extracted features prone to noise. The advent of deep learning brought capabilities like automatic data processing and hidden feature discovery, significantly accelerating screening. More recently, pre-trained models like BERT, inspired by natural language processing, have shown promise due to the analogy between peptide sequences and text sequences. This paper represents a further evolution by integrating pre-training, enhanced feature modules, and contrastive learning to not only improve prediction but also incorporate interpretability and a novel module substitution strategy for rational design, moving beyond mere prediction to active design. Quantum chemical simulations and molecular docking represent the integration of computational chemistry to understand the underlying molecular mechanisms, further advancing the design aspect.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's core innovations are:

  • Integrated Model Architecture: While previous works used BERT (e.g., Zhang et al., IUP-BERT) or combined different neural networks (Umami-MRNN), this paper combines a BERT-based pre-training module, an enhanced feature module (integrating four different sequence representation methods: DistancePair, CKSAAGP, QSOrder, DDE), and a contrastive learning module. This multi-faceted feature extraction and learning strategy leads to significantly improved predictive performance (outperforming others by 2-9%).

  • Contrastive Learning for Feature Enhancement: The explicit incorporation of contrastive learning is a key differentiator. By actively minimizing distances between positive pairs and maximizing between negative pairs, the model learns more robust and discriminative latent space representations, which is crucial for distinguishing between umami, bitter, and non-taste peptides.

  • Beyond Prediction to Design (Module Substitution): Unlike most previous models that primarily focus on prediction, this study introduces a module substitution strategy. This strategy enables precise design and modification of peptides by replacing high-contributory bitter modules with high-contributory umami modules identified through model interpretability. This is a significant step towards rational peptide engineering.

  • Mechanistic Elucidation through Interpretability: The paper uses attention value analysis from BERT to understand which amino acids and dipeptides are most important for umami taste, and combines this with quantum chemical simulations (HOMO/LUMO) and molecular docking to explain the molecular basis of umami perception and the effect of module substitution on receptor binding. This interpretability and mechanistic understanding is often lacking in black-box deep learning models.

  • Focus on Tenebrio molitor: The application of the model to virtually hydrolyze Tenebrio molitor protein and validate novel umami peptides from a specific, protein-rich source adds practical value and expands the umami peptide repository.

    In essence, this paper moves beyond simply predicting umami peptides by offering a comprehensive framework that includes enhanced prediction, interpretable insights into taste mechanisms, and a practical strategy for rational peptide design.

4. Methodology

4.1. Principles

The core idea of this method is to leverage the power of advanced deep learning to accurately predict umami peptides and their thresholds, and then to use the insights gained from the interpretable nature of the deep learning model to inform a module substitution strategy for rational peptide design. This approach aims to accelerate the discovery and optimization of umami peptides while also elucidating the molecular mechanisms of umami taste perception. The theoretical basis lies in treating peptide sequences as analogous to natural language, allowing Transformer-based models (like BERT) to learn contextual features, augmenting these with established physicochemical properties, and refining representations using contrastive learning. The design principle relies on identifying highly contributive peptide modules for umami and bitter tastes and rationally swapping them to alter taste profiles. Finally, quantum chemical simulations and molecular docking provide a theoretical foundation for understanding molecular interactions with taste receptors.

4.2. Core Methodology In-depth (Layer by Layer)

The overall framework of the umami peptide prediction model consists of four main modules: (i) a BERT-based pre-training module; (ii) a features-enhanced module; (iii) a contrastive learning module; and (iv) a prediction module.

The following figure (Figure 2 from the original paper) shows the framework of the umami peptide prediction model:

Fig. 2. The framework of the umami peptide prediction model. 该图像是umami肽预测模型的框架示意图。图中展示了预训练、特征增强模块及对比学习模块的结构,利用BERT进行特征编码,通过特征融合生成最终的预测结果,目标为识别umami、苦味和其他味道的肽。

4.2.1. Benchmark Dataset

High-quality datasets are crucial for building robust predictive models. The study uses a multi-stage data collection and preparation process.

  • Pre-training Stage: A large collection of biopeptide sequences was gathered from various public datasets to enable the BERT model to learn general sequence characteristics. These datasets included:

    • 1850 anticancer peptides from the UCI Machine Learning Repository.
    • 847 neuropeptides from NeuroPedia.
    • 1010 antituberculosis peptides from AntiTbPdb.
    • 2325 fermentation-derived peptides from FermFooDb.
    • 6289 food-derived bioactive peptides from DFBP.
    • 20,027 naturally occurring signal peptides from PeptideDB. Sequences were filtered to lengths between 2 and 50 amino acids, and duplicates were removed.
  • Retraining Stage (UMP1080 Dataset): For umami peptide prediction, a specific dataset, UMP1080, was constructed:

    • Umami Peptides: 360 experimentally verified umami peptides collected from Web of Science (up to May 2024), TastepeptidesDB, and BIOPEP-UWM database.
    • Bitter Peptides: 360 bitter peptides collected from TastepeptidesDB, BIOPEP-UWM, and other research studies.
    • Neither Umami nor Bitter Peptides: 360 peptides randomly selected from the pre-trained dataset that did not exhibit either umami or bitter taste. This balanced UMP1080 dataset totals 1080 peptides (360 for each class).
  • Dataset Split: The UMP1080 dataset was randomly divided into:

    • Training Set: 300 umami, 300 bitter, 300 neither umami nor bitter peptides (total 900).
    • Test Set: 60 umami, 60 bitter, 60 neither umami nor bitter peptides (total 180).

4.2.2. Pre-trained BERT Model for Feature Encoding

BERT (Bidirectional Encoder Representations from Transformers) is employed to convert peptide sequences into high-dimensional feature sets.

  • Mechanism: BERT uses a Transformer architecture with self-attention mechanisms to process input sequences. It considers all surrounding "words" (amino acids in this context) to generate a contextual representation for each amino acid. This helps capture complex dependencies and overcome the redundancy of manually crafted features.
  • Model Architecture: A BERT model with 12 Transformer encoders was constructed, each containing 12 multi-head attention mechanisms.
  • Input/Output: Peptide sequences are directly input. The model accepts sequences with a maximum length of 512 amino acids and outputs feature vectors with a dimension of 768. These vectors serve as input for downstream tasks.

4.2.3. Enhanced Feature Construction

To enrich the feature representation beyond what BERT alone provides, four additional sequence representation methods are utilized, focusing on physicochemical properties and structural information. These are referred to as the enhanced feature module.

  • DistancePair: Integrates Pseudo Amino Acid Composition (PseAAC) with distance-pair information. PseAAC extends traditional amino acid composition by adding position-specific and sequence-specific information. DistancePair calculates distances between amino acid pairs and uses a reduced alphabet to capture structural and functional characteristics.

  • CKSAAGP (Composition of k-spaced amino acid group pairs): Considers the composition of amino acid pairs separated by kk positions in the sequence. It calculates the frequency of all k-spaced amino acid pairs, capturing long-distance interaction information.

  • QSOrder (Quasi-Sequence Order): Characterizes sequences by analyzing order relationships between amino acids, incorporating composition and order information to reflect structural characteristics. It captures both local and global structural information.

  • DDE (Dipeptide Deviation from the Expected Mean): Represents sequences by calculating the frequency of all dipeptide combinations and their deviation from the expected frequency.

  • Output: Integrating these methods, each sequence is transformed into a 562-dimensional feature vector. This enhanced feature set is then combined with the BERT-generated features.

4.2.4. Contrastive Learning Strategy

Contrastive learning is applied to refine the feature representations by focusing on similarities and dissimilarities between samples. This is part of the contrastive learning module.

  • Purpose: To learn robust feature representations where positive sample pairs are brought closer together in the latent space, and negative sample pairs are pushed further apart.

  • Data Augmentation: Noise is added to enhance the uniformity of the combined features (from BERT and the enhanced feature module).

  • Loss Function: The InfoNCE (Info Noise-Contrastive Estimation) contrastive loss function is used during training.

    The InfoNCE loss is defined as: LCL=1Ni=1Nlogexp(sin(zizi+)/τexp(sin(zizi+)/τ+j=1Kexp(sin(zi,zij)/τ \mathrm { L _ { C L } = - \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \log \frac { \exp ( \sin ( z _ { i } \bullet z _ { i + } ) / \tau } { \exp \left( \sin ( z _ { i } \bullet z _ { i + } ) \middle / \tau + \sum _ { j = 1 } ^ { K } \exp \left( \sin ( z _ { i } , z _ { i - j } ) \middle / \tau \right. \right. } } Where:

  • NN: The total number of samples in the batch.

  • LCL\mathrm { L _ { CL } }: The contrastive loss for the batch.

  • zi\mathfrak { z } _ { i }: The feature representation (embedding) of the ii-th sample.

  • zi+z _ { i + }: The feature representation of a positive sample corresponding to ziz_i (e.g., an augmented version of ziz_i or another sample from the same class).

  • zijz _ { i - j }: The feature representation of the jj-th negative sample corresponding to ziz_i (e.g., samples from different classes).

  • KK: The number of negative samples.

  • sin(x,y)\sin ( \mathbf { x } , \mathbf { y } ): A similarity function (e.g., cosine similarity) that measures the similarity between two feature vectors x\mathbf { x } and y\mathbf { y }.

  • τ\boldsymbol { \tau }: A temperature parameter that controls the sharpness of the probability distribution and the shape of the loss function. A smaller τ\tau makes the model more sensitive to small differences in similarity.

4.2.5. Prediction Module

This module takes the refined features from contrastive learning and performs either classification or regression.

4.2.5.1. Classifier of umami peptides

  • Input: Noise-free features extracted by the contrastive learning module.

  • Architecture: A two-layer fully connected neural network.

  • Task: Three-class classification to identify umami, bitter, or neither taste attributes.

  • Loss Function: The Softmax cross-entropy loss is used for convergence training.

    The Softmax cross-entropy loss is formulated as: Loss=c=1Myclog(pc) \mathrm { L o s s = - \sum _ { c = 1 } ^ { M } y _ { c } l o g ( p _ { c } ) } Where:

  • Loss\mathrm { L o s s }: The cross-entropy loss.

  • MM: The number of classes (in this case, 3: umami, bitter, neither).

  • y _ { c }: The true label for class cc, which is 1 if the sample belongs to class cc and 0 otherwise.

  • p _ { c }: The predicted probability that the sample belongs to class cc, output by the Softmax function.

4.2.5.2. Regressor of umami peptides

This part predicts the umami threshold values.

  • Outlier Handling: Outliers in the data are identified using the interquartile range (IQR) method. Values outside the range of Q11.5×IQR\mathsf { Q } _ { 1 } { - } 1 . 5 \times \mathsf { I } \mathsf { Q } \mathsf { R } and Q3+1.5×IQR{ \sf Q } _ { 3 } + 1 . 5 \times { \sf I Q R } are replaced with the mean of the data to mitigate their impact.

  • Clustering: The K-nearest neighbors (KNN) algorithm is applied to cluster the data into three categories based on a defined threshold.

  • Regression Model: The AdaBoost regressor is employed for predictions.

    • A classification model (based on a fully connected neural network) is first trained using labels derived from the clustering results.

    • For each class identified by the classifier, a separate AdaBoost regressor is trained.

    • During prediction, the data is first classified, and then the corresponding AdaBoost regressor is selected for regression analysis.

      The formula for AdaBoost regression is: F(x)=m=1Mαmhm(x) \mathbf { F } ( \mathbf { x } ) = \sum _ { \mathrm { m } = 1 } ^ { \mathrm { M } } \alpha _ { \mathrm { m } } \mathbf { h } _ { \mathrm { m } } ( \mathbf { x } ) The squared loss for AdaBoost is defined as: L(y,F(x))=(yF(x))2 \operatorname { L } ( \mathbf { y } , \operatorname { F } ( \mathbf { x } ) ) = \left( \mathbf { y } - \operatorname { F } ( \mathbf { x } ) \right) ^ { 2 } Where:

  • F(x)\mathbf { F } ( \mathbf { x } ): The final regression prediction for input x\mathbf { x }.

  • MM: The number of weak regressors (individual models) in the ensemble.

  • αm\alpha _ { \mathrm { m } }: The weight of the mm-th weak regressor, representing its importance in the ensemble.

  • hm(x)\mathbf { h } _ { \mathrm { m } } ( \mathbf { x } ): The prediction of the mm-th weak regressor for the input x\mathbf { x }.

  • y\mathbf { y }: The true value (actual umami threshold).

  • L(y,F(x))\operatorname { L } ( \mathbf { y } , \operatorname { F } ( \mathbf { x } ) ): The loss function, specifically squared loss, used to evaluate the error between the true and predicted values.

  • Training Parameters:

    • Epochs: 100.
    • Initial Learning Rate: 0.0001.
    • Learning Rate Scheduler: Reduces learning rate by 0.8 times if accuracy does not improve for 5 consecutive epochs.
    • Cross-validation: 5-fold cross-validation for parameter optimization.
    • Optimizer: Adam optimizer.
    • Regularization: Early stopping strategy to reduce overfitting and dropout mechanism to enhance generalization ability.
    • Software: Experiments conducted using PyTorch and CUDA framework with Python. iFeatureOmega package used to obtain sequence representations (DistancePair, CKSAAGP, QSOrder).

4.2.6. Attention Value Analysis

The self-attention mechanism within BERT is leveraged to understand the importance of individual amino acids and dipeptides.

4.2.6.1. Amino acid importance analysis

  • Mechanism: BERT's self-attention allows it to capture complex dependencies among input features by calculating attention scores between different parts of the input.

  • Attention Formula (Scaled Dot-Product Attention):

    n(Q,K,V)=Softmax(QKTdk)V { \mathfrak { n } } \left( \mathbf { Q } , \mathbf { K } , \mathbf { V } \right) = \mathbf { S o f t m a x } \left( { \frac { \mathbf { Q } \mathbf { K } ^ { \mathrm { T } } } { \sqrt { \mathsf { d } _ { \mathbf { k } } } } } \right) \mathbf { V } Where:

  • Q\mathbf { Q }: The Query matrix, derived from the input sequence.

  • K\mathbf { K }: The Key matrix, derived from the input sequence.

  • V\mathbf { V }: The Value matrix, derived from the input sequence.

  • QKT\mathbf { QK } ^ { \mathrm { T } }: The dot product of the Query and Key matrices, indicating the similarity between each query and key.

  • dk\mathsf { d } _ { \mathbf { k } }: The dimensionality of the key vectors, used to scale the dot product to prevent large values from pushing the softmax function into regions with tiny gradients.

  • Softmax()\mathbf { Softmax } (\cdot): A function that normalizes the scores to create a probability distribution, indicating how much attention each value should get.

  • V\mathbf { V }: The Value matrix, weighted by the softmax scores to produce the final output.

  • Amino Acid Attention Value Calculation: The attention values for individual amino acids within umami peptides are extracted. This is done by considering attention scores in various configurations: [CLS] amino acid, amino acid [CLS], [SEP] amino acid, and amino acid [SEP]. [CLS] (classifier token) and [SEP] (separator token) are special tokens used in BERT to mark the beginning of a sequence and the separation of segments, respectively. The attention value of an amino acid (AttentionAA) is defined as: 1NMiN×jNαij([CLS]AA)+αij(AA[CLS])+αij([SEP]AA)+αij(AA[SEP]) \begin{array} { l } { { \displaystyle { \frac { 1 } { N M } } \sum _ { i } ^ { N } } } \\ { { \displaystyle \quad \times \sum _ { j } ^ { N } \alpha _ { i j } ( [ { \mathrm { C L S } } ] \to { \mathrm { A A } } ) + \alpha _ { i j } ( { \mathrm { A A } } \to [ { \mathrm { C L S } } ] ) + \alpha _ { i j } ( [ { \mathrm { S E P } } ] \to { \mathrm { A A } } ) + \alpha _ { i j } ( { \mathrm { A A } } \to [ { \mathrm { S E P } } ] ) } } \end{array} Where:

  • NN: The layer number of the Transformer (e.g., 12 in the BERT model).

  • MM: The head number within each Transformer layer (e.g., 12 in the multi-head attention).

  • αij\alpha _ { i j }: The attention score between token ii and token jj, derived from the Attention (Q, K, V) calculation.

  • [CLS]>AA[CLS] -> AA: Attention from the [CLS] token to an amino acid (AA).

  • AA>[CLS]AA -> [CLS]: Attention from an amino acid (AA) to the [CLS] token.

  • [SEP]>AA[SEP] -> AA: Attention from the [SEP] token to an amino acid (AA).

  • AA>[SEP]AA -> [SEP]: Attention from an amino acid (AA) to the [SEP] token. This sum and averaging across layers and heads provides a comprehensive measure of how much "attention" the model pays to a specific amino acid in relation to these special tokens, indicating its importance for classification.

4.2.6.2. Score of amino acid pair analysis

  • Calculation: The pairwise amino acid score is defined as the product of the average attention value of each dipeptide (two-amino acid fragment) and its frequency of occurrence in the dataset.
  • Data Source: Average attention values for dipeptides are obtained from the BERT pre-trained model. Frequencies are calculated by counting occurrences in umami and bitter peptide datasets.
  • Purpose: To identify dipeptide segments with high contribution or significant characteristics for umami and bitter tastes.

4.2.7. Preparation and Taste Characteristics Prediction of Peptides

This section describes the process for obtaining potential umami peptides from a biological source.

  • Protein Source: Tenebrio molitor (yellow mealworm) protein (GenBank accession number: KAH0814361.1) was chosen due to its high protein content and richness in umami amino acids.
  • Virtual Hydrolysis: In silico (computational) hydrolysis of the Tenebrio molitor protein sequence was performed using the PeptideCutter online program (http://web.expasy.org/peptide_cutter).
    • Enzymes: Two specific enzymes, pepsin (pH 2) (EC: 3.4.23.1) and trypsin (EC: 3.4.21.4), were selected to simulate enzymatic digestion.
  • Preliminary Screening: The resulting peptides were screened for water solubility and toxicity.
  • Model Prediction: Peptides with good solubility and non-toxicity were then subjected to prediction by the developed deep learning model to determine their umami characteristics and thresholds.

4.2.8. Sensory Evaluation

To validate the model's predictions, sensory evaluation was conducted.

  • Ethics Approval: Approved by the Ethics Committee of Beijing Technology and Business University (No.2023050).
  • Assessors: 15 trained assessors (7 male, 8 female, aged 23-31) from Beijing Technology and Business University, healthy, non-smokers, no taste/olfactory disorders, with prior sensory assessment training.
  • Method: Standardized sip-and-spit method.
  • Taste Characteristics: Descriptive analysis of peptides at 1mg/mL1 \mathrm{mg/mL} concentration, using criteria from Song et al. (2023) and Gu et al. (2024).
  • Detection Threshold: Determined using the three-alternative forced-choice (3-AFC) method.
    • Test samples (1mg/mL1 \mathrm{mg/mL}) serially diluted at 1:1 (V/V).
    • Sigma curve analysis used for probability detection results.
    • Threshold defined as the concentration corresponding to a 50% detection probability (according to ASTM E1432 and Tempere, 2011).
  • Interaction with MSG: Sigmoid curve analysis used to elucidate synergistic, additive, or masking effects with monosodium glutamate (MSG) solutions. The R-value is used, defined as the ratio of experimentally determined threshold to theoretically predicted threshold.
    • R<0.5R < 0.5: Synergistic effect.
    • 0.5R<10.5 \leq R < 1: Additive effect.
    • R=1R = 1: No interaction.
    • R>1R > 1: Masking effect.

4.2.9. Active Site Analysis of Peptides Based on Quantum Chemical Computing

To understand the molecular-level mechanisms of taste.

  • Software: Gaussian 16 and GaussView 6.0 programs.
  • 3D Structure Construction: GaussView 6.0 for building umami peptide structures.
  • Geometric Optimization: Density functional theory (DFT) with B3LYP/6-311G(d,p) basis set in Gaussian 16. This method calculates the minimum energy structure.
  • Vibrational Frequency Calculations: Performed to confirm the minimum energy structure.
  • Frontier Molecular Orbitals (FMOs): HOMO (Highest Occupied Molecular Orbital) and LUMO (Lowest Unoccupied Molecular Orbital) are calculated using the Molekel program.
    • HOMO: Indicates regions most likely to donate electrons.
    • LUMO: Indicates regions most likely to accept electrons.
    • HOMO-LUMO energy gap: Reflects molecular reactivity; a smaller gap suggests higher reactivity and a greater propensity to interact with other molecules (e.g., taste receptors).
  • Purpose: Identifying active sites in umami peptide molecules to elucidate their taste-presenting mechanisms.

4.2.10. Precise Design and Modification of Peptides

This section outlines the module substitution strategy for rational peptide design.

  • Strategy: Replacing highly contributive dipeptide fragments from bitter peptides with highly contributive fragments from umami peptides.
  • Module Selection: EE (glutamic acid-glutamic acid), identified as a high umami activity module through interpretability analysis of the deep learning model, was used to replace high-contribution bitterness modules such as PF, FP, GP, PP, and PG.
  • Prediction: The taste characteristics and thresholds of the substituted peptides were predicted using the developed deep learning model.
  • Purpose: To enhance the functional properties and taste characteristics of peptides, moving from prediction to active peptide design and modification.

4.2.11. Molecular Docking of Peptides and Taste Receptor T1R1/T1R3

To understand how module substitution affects taste characteristics at a molecular level.

  • Method: Semiflexible molecular docking.
  • Target Receptor: T1R1/T1R3T1R1/T1R3 (the umami taste receptor).
  • Receptor Structure: 3D crystal structure of T1R1/T1R3T1R1/T1R3 was constructed using homology modeling, with the fish taste receptor T1R2a-T1R3 (PDB ID: 5X2M) as the template.
  • Docking Software: Autodock Vina.
  • Docking Box Parameters:
    • Center coordinates: x=46.595\mathbf { x } = 46.595, y=35.837\mathbf { y } = 35.837, z=23.666\mathbf { z } = 23.666.
    • Box dimensions: X=68\mathbf { X } = 68, y=86{ \bf y } = 86, and z=88{ \bf z } = 88.
  • Analysis: LigPlot+LigPlot+ software used to analyze optimal docking poses, identifying key amino acid residues and interaction forces (e.g., hydrogen bonds, hydrophobic interactions) involved in peptide binding with T1R1/T1R3.
  • Purpose: To elucidate the molecular mechanism by which module substitution alters the taste characteristics of peptides.

4.2.12. Statistical Analysis

  • Software: Microsoft Office Excel 2019 and Origin 2024.
  • Significance Testing: Independent samples t-test.
  • Significance Level: P<0.05P < 0.05 was considered statistically significant.

5. Experimental Setup

5.1. Datasets

The study utilized a comprehensive set of datasets for pre-training and fine-tuning the deep learning model.

  • Pre-training Dataset:

    • Purpose: To enable the BERT model to learn general contextual features from diverse biopeptide sequences.
    • Sources:
      • 1850 anticancer peptides (UCI Machine Learning Repository).
      • 847 neuropeptides (NeuroPedia).
      • 1010 antituberculosis peptides (AntiTbPdb).
      • 2325 fermentation-derived peptides (FermFooDb).
      • 6289 food-derived bioactive peptides (DFBP).
      • 20,027 naturally occurring signal peptides (PeptideDB).
    • Characteristics: Included sequences with lengths between 2 and 50 amino acids, with duplicate sequences removed. This large, diverse dataset allows the BERT model to develop a robust understanding of peptide sequence patterns before being applied to the specific task of umami peptide prediction.
  • UMP1080 Benchmark Dataset (for Fine-tuning and Evaluation):

    • Purpose: To train and test the umami peptide prediction model specifically for classifying umami, bitter, and neither tastes, and for umami threshold regression.
    • Sources: Experimentally verified umami peptides from Web of Science, TastepeptidesDB, and BIOPEP-UWM; bitter peptides from TastepeptidesDB, BIOPEP-UWM, and prior research; neither umami nor bitter peptides randomly selected from the pre-trained dataset.
    • Scale: A balanced dataset comprising 360 umami peptides, 360 bitter peptides, and 360 peptides that are neither umami nor bitter, totaling 1080 peptides.
    • Split:
      • Training Set: 900 peptides (300 umami, 300 bitter, 300 neither).
      • Test Set: 180 peptides (60 umami, 60 bitter, 60 neither).
    • Domain: These datasets are specific to peptide taste characteristics, providing concrete examples of peptides associated with umami and bitter tastes, which are essential for validating the model's performance in this domain.

5.2. Evaluation Metrics

The model's performance was evaluated using standard classification and regression metrics.

5.2.1. Classification Metrics

  • Accuracy (ACC):

    • Conceptual Definition: Measures the proportion of correctly predicted instances (both true positives and true negatives) out of the total number of instances. It indicates the overall correctness of the model's predictions.
    • Mathematical Formula: ACC=TP+TNTP+FN+TN+FP \mathsf { A C C } = \frac { \mathsf { T P } + \mathsf { T N } } { \mathsf { T P } + \mathsf { F N } + \mathsf { T N } + \mathsf { F P } }
    • Symbol Explanation:
      • TP\mathsf { TP }: True Positives, instances correctly predicted as positive.
      • TN\mathsf { TN }: True Negatives, instances correctly predicted as negative.
      • FP\mathsf { FP }: False Positives, instances incorrectly predicted as positive (Type I error).
      • FN\mathsf { FN }: False Negatives, instances incorrectly predicted as negative (Type II error).
  • Precision:

    • Conceptual Definition: Measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It indicates the model's ability to avoid false positives.
    • Mathematical Formula: Precision=TPTP+FP \mathrm { P r e c i s i o n } = { \frac { \mathrm { T P } } { \mathrm { T P } + \mathrm { F P } } }
    • Symbol Explanation:
      • TP\mathsf { TP }: True Positives.
      • FP\mathsf { FP }: False Positives.
  • Recall:

    • Conceptual Definition: Measures the proportion of correctly predicted positive instances out of all actual positive instances. It indicates the model's ability to find all positive instances (sensitivity).
    • Mathematical Formula: Recall=TPTP+FN { \mathrm { R e c a l l } } = { \frac { \mathrm { T P } } { \mathrm { T P } + { \mathrm { F N } } } }
    • Symbol Explanation:
      • TP\mathsf { TP }: True Positives.
      • FN\mathsf { FN }: False Negatives.
  • F1 score:

    • Conceptual Definition: The harmonic mean of Precision and Recall. It provides a single score that balances both Precision and Recall, particularly useful when there is an uneven class distribution.
    • Mathematical Formula: F1  score=2×Precision×RecallPrecision+Recall \mathrm { F1 \; score } = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}
    • Symbol Explanation:
      • Precision\mathrm { Precision }: The Precision value.
      • Recall\mathrm { Recall }: The Recall value.

5.2.2. Regression Metrics

  • R-squared (R2\mathsf { R } ^ { 2 }):

    • Conceptual Definition: Measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It indicates how well the model's predictions fit the observed data. A value closer to 1 indicates a better fit.
    • Mathematical Formula: R2=1i=1n(yifi)2i=1n(yiyˉ)2 \mathsf { R } ^ { 2 } = 1 - \frac{\sum_{i=1}^{n} (y_i - f_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
    • Symbol Explanation:
      • nn: The total number of samples.
      • yiy_i: The true value of the dependent variable for the ii-th sample.
      • fif_i: The predicted value of the dependent variable for the ii-th sample.
      • yˉ\bar{y}: The mean of the true dependent variable values.
  • Mean Absolute Error (MAE):

    • Conceptual Definition: Measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average of the absolute differences between prediction and actual observation.
    • Mathematical Formula: MAE=1ni=1n(yifi) \mathrm { \sf M A E } = \frac { 1 } { \mathrm { \mathbf { n } } } \sum _ { \mathrm { i = 1 } } ^ { \mathrm { n } } \left| \left( \mathbf { y } _ { \mathrm { i } } - \mathbf { f } _ { \mathrm { i } } \right) \right|
    • Symbol Explanation:
      • nn: The number of samples.
      • yi\mathbf { y } _ { \mathrm { i } }: The true value for the ii-th sample.
      • fi\mathbf { f } _ { \mathrm { i } }: The predicted value for the ii-th sample.
      • | \cdot |: The absolute value operator.
  • Mean Squared Error (MSE):

    • Conceptual Definition: Measures the average of the squares of the errors. It gives a relatively high weight to large errors, meaning it is most useful when large errors are particularly undesirable.
    • Mathematical Formula: MSE=1ni=1n(yifi)2 \mathrm { { M S E } = \frac { 1 } { n } \sum _ { i = 1 } ^ { n } { { { \left( { { y } _ { \mathrm { i } } } - { { f } _ { \mathrm { i } } } \right) } ^ { 2 } } } }
    • Symbol Explanation:
      • nn: The number of samples.
      • yiy _ { \mathrm { i } }: The true value for the ii-th sample.
      • fif _ { \mathrm { i } }: The predicted value for the ii-th sample.
  • Root Mean Squared Error (RMSE):

    • Conceptual Definition: The square root of the MSE. It has the advantage of being in the same units as the target variable, making it easier to interpret than MSE.
    • Mathematical Formula: RMSE=1ni=1n(yifi)2 \mathrm { R M S E } = \sqrt { \frac { 1 } { \mathrm { n } } \sum _ { \mathrm { i } = 1 } ^ { \mathrm { n } } \left( \mathbf { y } _ { \mathrm { i } } - \mathbf { f } _ { \mathrm { i } } \right) ^ { 2 } }
    • Symbol Explanation:
      • nn: The number of samples.
      • yi\mathbf { y } _ { \mathrm { i } }: The true value for the ii-th sample.
      • fi\mathbf { f } _ { \mathrm { i } }: The predicted value for the ii-th sample.
      • \sqrt { \cdot }: The square root operator.

5.3. Baselines

The paper compared its proposed model against several state-of-the-art models in umami peptide prediction, representing different algorithmic approaches and feature engineering strategies:

  • UMPred-FRL (Charoenkwan et al., 2021): A model using feature representation learning.

  • Umami-YYDS (Cui et al., 2023): A model based on gradient boosting decision trees and molecular descriptors.

  • Umami-MRNN (Qi et al., 2023): A model combining multi-layer perceptron and recurrent neural network.

  • LSTM (Long Short-Term Memory): A type of recurrent neural network commonly used for sequence data.

  • IUP-BERT (Jiang et al., 2022): A BERT-based model specifically for umami peptide prediction.

  • Jiang's method (Jiang et al., 2023): A machine learning method using multiplicative LSTM embedded features.

    These baselines are representative because they cover a range of machine learning and deep learning techniques applied to the problem of umami peptide prediction, including traditional ML, RNNs, and earlier Transformer-based approaches. This allows for a comprehensive comparison to demonstrate the advancements made by the proposed integrated model.

6. Results & Analysis

6.1. Amino Acid Composition and Distribution Analysis

The study began by analyzing the amino acid composition and distribution patterns in umami and bitter peptide datasets, as taste characteristics are closely related to these features.

The following figure (Figure 1 from the original paper) shows the frequency of individual amino acids in different segments of bitter peptides:

该图像是多种氨基酸肽及其特征的可视化示意图。图中展示了不同肽段的味觉表现及其在生物活性方面的比较,利用极坐标图展示了氨基酸的组合特征和相互作用。此外,图中还用泡沫图表示了不同肽的效果和特征,突出氨基酸对鲜味的重要性。 该图像是多种氨基酸肽及其特征的可视化示意图。图中展示了不同肽段的味觉表现及其在生物活性方面的比较,利用极坐标图展示了氨基酸的组合特征和相互作用。此外,图中还用泡沫图表示了不同肽的效果和特征,突出氨基酸对鲜味的重要性。

  • Peptide Length Distribution (Fig. 1A): Both umami and bitter peptides primarily had lengths below 10 amino acids, with 86.7% of umami peptides and 85.8% of bitter peptides falling within this range. This indicates that shorter peptides are highly relevant for taste perception in both categories.
  • Amino Acid Frequency in Umami Peptides (Fig. 1B): Glutamic acid (E), aspartic acid (D), leucine (L), alanine (A), glycine (G), and lysine (K) were significantly more frequent in umami peptides. This aligns with previous findings that E and D are key umami amino acids, and A and G can enhance umami taste synergistically with D or E.
  • Amino Acid Frequency in Bitter Peptides (Fig. 1C): Hydrophobic amino acids, particularly proline (P) and phenylalanine (F), were highly prevalent in bitter peptides. This is consistent with the general understanding that hydrophobicity is often associated with bitter taste.
  • N- and C-terminal Frequencies (Fig. 1D):
    • Umami Peptides: DD and EE were enriched at the N-terminal, while KK and RR were relatively high at the C-terminal. KK residues at the C-terminal have been linked to enhanced umami expression.
    • Bitter Peptides: RR, GG, and VV were primary at the N-terminal, and FF and PP dominated at the C-terminal.
  • Terminal Frequencies by Peptide Length:
    • Short Umami Peptides (2-3 amino acids, Fig. 1E): DD and EE were significantly more frequent at both N- and C-terminals, with higher occurrence at the N-terminal. This reinforces their role in umami perception. For 4-7 amino acid peptides, LL, EE, and DD were higher at the N-terminal, while KK and RR were more frequent at the C-terminal. For 8-10 amino acid peptides, AA and GG were higher at the N-terminal, with KK and RR remaining predominant at the C-terminal.
    • Long Umami Peptides (>10 residues): The amino acid distribution was less clear, but RR and KK were high-frequency residues, particularly at the C-terminus (57.3% combined). No CC or PP were found at the termini of these longer umami peptides. This suggests that for longer peptides, taste might be influenced by complex spatial conformations rather than individual amino acids.
    • Short Bitter Peptides (<10 amino acids, Fig. 1F): Hydrophobic residues like GG, LL, and PP were more frequent at the N-terminal, and RR also appeared. At the C-terminal, PP and FF predominated. This supports the observation that peptides with combinations of PP, LL, II, FF, HH, KK, or RR are often bitter. DD and CC were absent at the termini of bitter peptides.
  • Overall Amino Acid Composition by Length:
    • Short Umami Peptides (<10 amino acids, Fig. 1G): DD and EE were identified as key components influencing umami characteristics.

    • Long Umami Peptides (>10 amino acids, Fig. 1G): No clear pattern, supporting the idea of complex spatial conformations for longer peptides.

    • Short Bitter Peptides (<10 amino acids, Fig. 1H): Hydrophobic residues GG, FF, and PP were predominant.

    • Long Bitter Peptides (>10 amino acids, Fig. 1H): PP frequency was significantly higher than other amino acids.

      In summary, this analysis provides a foundation for understanding the molecular characteristics of taste peptides, highlighting the importance of specific amino acids and their positions for umami and bitter tastes, and setting the stage for deep learning model development and interpretability.

6.2. Performance Comparisons of Different Sequence Encoding Methods

To determine the most effective way to represent peptide sequences for the model, a comparison of various sequence encoding methods was conducted using 5-fold cross-validation. The performance was assessed based on Accuracy (ACC), Precision, Recall, and F1 score.

The paper refers to supplementary data for a detailed table of these comparisons (Table S2). Although Table S2 is not provided in the main text, the key finding is stated:

  • The combination model of BERT and the four sequence encodings (DistancePair, CKSAAGP, QSOrder, DDE), termed feature fusion, achieved significant improvements.
  • Its performance metrics were: ACC=0.93981ACC = 0.93981, Precision=0.94366Precision = 0.94366, Recall=0.93056Recall = 0.93056, and F1score=0.93706F1 score = 0.93706. These results underscore the superiority of integrating BERT's contextual embeddings with physicochemical and structural features.

6.3. Performance Comparisons with State-of-the-Art Models

The proposed model's performance was rigorously evaluated against existing state-of-the-art models.

The following are the results from Table 1 of the original paper:

Algorithms Samples ACC Precision Recall F1 score
UMPred-FRL 140 umami and 340 bitter 0.860 0.786
Umami-YYDS 198 umami and 215 bitter 0.896 0.913 0.875 0.894
Umami-MRNN 212 umami and 287 bitter 0.915 0.879 -
LSTM 140 umami and 304 bitter 0.921 0.821
IUP-BERT 140 umami and 302 bitter 0.923 0.888
Ours 360 umami, 360 bitter, and 360 others 0.93981 0.94366 0.93056 0.93706
  • Classification Performance (Table 1): The proposed model achieved the highest ACC of 0.93981, outperforming other models by 2 to 9 percentage points. It also demonstrated superior Precision (0.94366), Recall (0.93056), and F1 score (0.93706). This indicates the model's strong ability to correctly identify umami peptides and distinguish them from non-umami peptides. The comparison with IUP-BERT (0.923 ACC) highlights the incremental benefit of combining BERT with enhanced features and contrastive learning.

    The following figure (Figure 3 from the original paper) shows the characteristics and relations of bitter and umami peptides:

    该图像是多个图表的组合,展示了苦味和鲜味肽的特征及其关系。图A为三维散点图,区分了苦味(红色)和鲜味(绿色)肽;图B显示了预测值与实际值的关系;图C和D为三维表面图及小提琴图,分别呈现不同氨基酸的效果;图E和F为氨基酸相互作用的热图,数字表示相关性强度。 该图像是多个图表的组合,展示了苦味和鲜味肽的特征及其关系。图A为三维散点图,区分了苦味(红色)和鲜味(绿色)肽;图B显示了预测值与实际值的关系;图C和D为三维表面图及小提琴图,分别呈现不同氨基酸的效果;图E和F为氨基酸相互作用的热图,数字表示相关性强度。

  • Feature Space Visualization (Fig. 3A): A Uniform Manifold Approximation and Projection (UMAP) method was used to visualize the feature space. The distinct separation between umami (green) and bitter (red) peptides in the 3D scatter plot (Fig. 3A) visually confirms the model's high accuracy in classification and its ability to learn discriminative features.

  • Umami Threshold Prediction (Fig. 3B): For regression tasks (predicting umami thresholds), the model showed excellent performance. The plot of actual vs. predicted values for umami peptide thresholds exhibited a strong correlation with an R² value of 0.98.

  • Regression Error Metrics: The MSE, RMSE, and MAE were calculated as 0.0013, 0.036, and 0.031, respectively. These values are lower than those reported in comparable studies (e.g., Guo et al., 2023, with R2=0.883R^2 = 0.883, MSE = 0.103, RMSE = 0.321, MAE = 0.235 for astringency threshold prediction), indicating superior predictive performance and robustness.

  • Umami Threshold Feature Space (Fig. 3C): UMAP was again used to visualize the feature vectors, clustering umami threshold data into three categories. The strong correlation between these features and the umami thresholds in 3D space further validates the model's predictive capability.

    The superior performance is attributed to:

  1. Pre-trained BERT: Effectively captures rich contextual feature representations from large-scale bioactive peptide datasets, facilitating transfer learning.
  2. Multi-feature Fusion: Integrates peptide sequence information, amino acid composition, physicochemical properties, structural features, and evolutionary information, providing multi-dimensional input.
  3. Contrastive Learning: Enables the model to learn subtle yet critical differences by comparing similar and dissimilar peptide instances, enhancing discriminative power.

6.4. Model Interpretation

Understanding how the model makes predictions (interpretability) is crucial for rational design.

  • Amino Acid Importance (Fig. 3D): Analysis of attention values (how much the model "focuses" on each amino acid) revealed that amino acid residues D (aspartic acid) and EE (glutamic acid) had higher attention values, indicating their significant role in the model's accurate prediction of umami peptides. This reinforces their known importance from amino acid frequency analysis. Interestingly, QQ (glutamine), MM (methionine), SS (serine), PP (proline), and HH (histidine) also showed high attention values, suggesting that their position within the peptide chain or their interaction context might be important, even if their overall frequency is not the highest.
  • Dipeptide Pair Scores (Figs. 3E and 3F): Amino acid pair scores (product of average attention value and frequency) were calculated to identify high-contribution dipeptide fragments.
    • Umami Peptides (Fig. 3E): EE (glutamic acid-glutamic acid), DE (aspartic acid-glutamic acid), EK (glutamic acid-lysine), EL (glutamic acid-leucine), and EA (glutamic acid-alanine) exhibited high scores (1.496, 1.042, 0.892, 0.845, and 0.797, respectively). These are identified as potential key determinants of umami characteristics.

    • Bitter Peptides (Fig. 3F): PF (proline-phenylalanine), FP (phenylalanine-proline), GP (glycine-proline), PP (proline-proline), and PG (proline-glycine) showed high scores (3.603, 2.370, 2.152, 2.146, and 1.570, respectively). These are indicative of bitterness-contributing modules.

      This interpretability provides direct guidance for the module substitution strategy.

6.5. Identification of Umami Peptides

The developed deep learning model was applied to screen peptides derived from Tenebrio molitor protein.

  • Virtual Hydrolysis: In silico hydrolysis of Tenebrio molitor protein (using pepsin and trypsin) generated 1469 peptides.
  • Pre-screening:
    • All 1469 peptides were predicted to be non-toxic.
    • 1316 peptides demonstrated good water solubility.
  • Taste Prediction: The model predicted the taste characteristics and thresholds for these 1469 peptides:
    • 1237 were predicted as umami peptides.
    • 202 were predicted as bitter peptides.
    • 30 were predicted as other types of peptides. These findings confirm Tenebrio molitor as a rich source for umami peptide discovery.

6.6. Taste Characteristics of Synthetic Peptides

Ten previously unreported peptides, selected based on model predictions and covering dipeptides to decapeptides and longer, were synthesized and subjected to sensory evaluation.

The paper refers to supplementary data for a detailed table of taste properties (Table S3). Although Table S3 is not provided in the main text, the key findings are stated:

  • Validation: The actual taste perception of the synthesized peptides (EN, ETR, GK4, RK5, ER6, EF7, IL8, VR9, DL10, and PK14) was highly consistent with the model's predictions. All primarily exhibited umami taste.

  • Detection Thresholds: The detection thresholds ranged from 0.02446 to 0.13464mg/mL0.13464 \mathrm{mg/mL}. These values are significantly lower than the threshold of MSG (0.3mg/mL0.3 \mathrm{mg/mL}), highlighting their potent umami characteristics.

    • The threshold for ECQVEGF was not measured due to the presence of sulfur-containing amino acids (C, cysteine) which produce a pungent odor that interferes with threshold determination.

      The following figure (Figure 4 from the original paper) shows the probability of correct selection versus log-concentration for various amino acid peptides:

      该图像是一个示意图,展示了不同氨基酸肽(如EN、ETR、GVVK等)在各种浓度下的正确选择概率与对数浓度的关系。每个子图包括了氨基酸结构以及拟合曲线,表现出不同的阈值和相关系数。 该图像是一个示意图,展示了不同氨基酸肽(如EN、ETR、GVVK等)在各种浓度下的正确选择概率与对数浓度的关系。每个子图包括了氨基酸结构以及拟合曲线,表现出不同的阈值和相关系数。

  • Figure 4 visually presents the detection thresholds for the individual synthesized peptides. The sigmoid curves show the probability of correct selection by assessors as a function of log-concentration. The point at which the probability reaches 50% (the threshold) is clearly identifiable for each peptide.

  • Other Basic Tastes: Besides umami, some peptides also exhibited other basic tastes like sweetness, sourness, and astringency. The synergistic interaction between sweetness and umami is noted as a potential enhancer. Sourness and astringency might be artifacts of solvent residues from synthesis.

  • Amino Acid Composition Consistency: These validated umami peptides frequently contained DD, EE, GG, and AA and often had KK or RR at the C-terminal, consistent with the interpretability results of the deep learning model.

    The following figure (Figure 5 from the original paper) shows a series of curve plots for the probability of correct selection for different peptides in combination with MSG:

    该图像是多条曲线图,展示了不同肽与MSG共同作用下的判断概率。这些曲线提供了每种肽与MSG浓度关系的理论拟合和实验数据,显示了不同肽的阈值和R²值,揭示了其对鲜味的影响。 该图像是多条曲线图,展示了不同肽与MSG共同作用下的判断概率。这些曲线提供了每种肽与MSG浓度关系的理论拟合和实验数据,显示了不同肽的阈值和R²值,揭示了其对鲜味的影响。

  • Interaction with MSG (Fig. 5): The interaction between nine of the synthesized peptides and MSG was investigated using the R-value.

    • The R values for EN, ETR, GK4, RK5, ER6, IL8, VR9, DL10, and PK14 were 0.69, 0.84, 0.91, 0.80, 0.72, 0.77, 0.83, 0.74, and 0.72, respectively.
    • Since all R values are between 0.5 and 1, this indicates an additive effect when combined with MSG. This implies that these peptides can enhance umami and potentially reduce the reliance on MSG, contributing to sodium reduction strategies.

6.7. Active Site Analysis of Umami Peptides

Quantum chemical calculations, specifically frontier molecular orbital (FMO) analysis, were performed to understand the active sites and taste mechanisms of the umami peptides.

The paper refers to supplementary data for a table of HOMO-LUMO energy gaps (Table S4). Although Table S4 is not provided in the main text, the key finding is stated:

  • HOMO-LUMO Energy Gap: The energy gap between HOMO and LUMO reflects a molecule's chemical reactivity. A smaller gap generally indicates higher reactivity and a greater propensity to interact with taste receptors.
    • RPIEK exhibited the lowest HOMO-LUMO energy gap (3.16eV-3.16 \mathrm{eV}), which correlates with its lower umami threshold (higher potency).

      The following figure (Figure 6 from the original paper) shows the LUMO and HOMO states of different peptides (EN, ETR, GVVK, RPIEK, EDAQDR):

      该图像是分子轨道图,展示了不同肽(EN, ETR, GVVK, RPIEK, EDAQDR)的LUMO和HOMO态。在图中,各分子的电子分布由不同颜色的球体表示,用于理解其在味觉潜能中的电子特性。 该图像是分子轨道图,展示了不同肽(EN, ETR, GVVK, RPIEK, EDAQDR)的LUMO和HOMO态。在图中,各分子的电子分布由不同颜色的球体表示,用于理解其在味觉潜能中的电子特性。

The following figure (Figure 7 from the original paper) shows the relationship between different peptide chains and their corresponding LUMO and HOMO orbitals (ECQVEGF, IKPTVVEL, VLGHELPER, DDDGQPIPEL, PEIEAQPIEEQK):

该图像是示意图,展示了不同肽链与其对应的LUMO和HOMO轨道的关系。图中显示了五个肽链(ECQVEGF、IKPTVVEL、VLGHELPER、DDDGQPIPEL、PEIEAQPIEEQK)的分子结构以及相应的电子云分布。通过对比可以观察到特定的电子特性。 该图像是示意图,展示了不同肽链与其对应的LUMO和HOMO轨道的关系。图中显示了五个肽链(ECQVEGF、IKPTVVEL、VLGHELPER、DDDGQPIPEL、PEIEAQPIEEQK)的分子结构以及相应的电子云分布。通过对比可以观察到特定的电子特性。

  • HOMO/LUMO Orbital Analysis (Figs. 6 and 7): The active sites of umami peptides were primarily distributed on amino acid residues DD, EE, QQ, KK, and RR.
    • Significantly, when RR or KK was present as the C-terminal residue, their occurrence as active sites was higher. This supports previous research linking C-terminal K or RR to enhanced umami taste expression, and confirms DD and EE as key umami amino acids.
  • Consistency with Deep Learning Interpretability: The results from quantum chemical simulations align with the interpretability outcomes of the deep learning model, providing a strong validation of the model's insights into the taste-presenting mechanisms.

6.8. Umami Evaluation of Design and Modification of Peptides

The module substitution strategy was employed to demonstrate its effectiveness in precise peptide design and modification.

  • Strategy: The EE dipeptide module, identified as highly umami-active by the model, was used to replace high-contribution bitterness modules (PF, FP, GP, PP, PG) in bitter peptides.
  • Application: Out of the Tenebrio molitor proteins, 27 non-umami peptides contained these bitter modules, with 20 having good water solubility and all being non-toxic.
  • Results of Substitution: After replacing the bitter modules with EE, all resulting peptides were predicted to be converted into umami peptides, maintaining good water solubility and non-toxicity.
  • Previous Studies Alignment: This finding is consistent with prior research on fragment substitution to enhance peptide activities (e.g., XOD inhibitory peptide modifications, ACE inhibitory peptide modifications by Mirzaei et al., 2019; Zhao et al., 2023; Meng et al., 2024). Specifically, Meng et al. (2024) showed that replacing low-contribution GP with high-contribution KE or KN enhanced XOD inhibitory activity.
  • Significance: This demonstrates that modular substitution is a feasible and effective strategy for improving peptide flavor, enabling precise peptide design and modification. The modified umami peptide sequences can then inform the selection of enzymes and hydrolysis conditions for targeted preparation, or guide the choice of protein sources.

6.9. Mechanism of Module Substitution Altering Peptide Taste Characteristics

To elucidate the molecular mechanism behind the module substitution strategy, molecular docking experiments were conducted. Due to computational cost, peptides shorter than 10 amino acids were chosen for this analysis.

The following figure (Figure 8 from the original paper) shows the interaction between peptides and the umami receptors T1R1/T1R3 before and after module substitution:

该图像是示意图,展示了多个氨基酸序列及其对应的分子结构,标注了关键的氢键和疏水相互作用。每个结构代表一种特定的味道肽,揭示了其与T1R1/T1R3的相互关系。 该图像是示意图,展示了多个氨基酸序列及其对应的分子结构,标注了关键的氢键和疏水相互作用。每个结构代表一种特定的味道肽,揭示了其与T1R1/T1R3的相互关系。

The following figure (Figure 7, continued from the original paper) shows the molecular structure diagram of DQTEEIQR:

Fig. 7. (continued). 该图像是分子结构示意图,展示了三种肽链TPPSEEIN、DQTPGIPQR和DQTEEIQR的氢键和疏水相互作用。每个氨基酸通过不同的颜色和符号标示,强调了关键氨基酸在味道呈现中的重要性。

  • Molecular Docking Results (Fig. 8 and Fig. S1 (not provided)): The results revealed that the peptides after module substitution formed more hydrogen bonds and hydrophobic interactions with the umami receptors T1R1/T1R3 compared to their unmodified counterparts.
  • Interaction Sites:
    • Modified Peptides: Primarily interacted with Arg151, Asp147, Arg277, His71, Ser146, and Ala302 on T1R1/T1R3T1R1/T1R3.
    • Unmodified Peptides: Mainly interacted with Asp147, Ala302, and His71.
  • Key Interaction Forces and Residues: This finding is consistent with previous studies demonstrating that hydrogen bonds and hydrophobic interactions are crucial for the binding of umami peptides to T1R1/T1R3T1R1/T1R3. The residues Arg151, Asp147, Gln52, Glu277, Arg277, His71, Ser146, and Ala302 have been identified as critical for these interactions.
  • Conclusion: The increased number and strength of interactions, particularly with key residues in the T1R1/T1R3 binding pocket, explain how module substitution enables modified peptides to effectively bind to the receptor and elicit an umami taste. This confirms that altering peptide sequence composition through module substitution directly influences taste characteristics by modulating receptor binding.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully developed and validated a powerful computational framework for the rapid screening and rational design of umami peptides. The proposed deep learning model, integrating pre-training, enhanced features, and contrastive learning, achieved an impressive accuracy of 0.93981, significantly outperforming existing models. Through virtual hydrolysis of Tenebrio molitor protein and sensory evaluation, ten novel umami peptides were identified, demonstrating potent umami taste and additive umami-enhancing effects with MSG, with detection thresholds lower than MSG. Crucially, the research introduced a module substitution strategy that enabled the successful conversion of bitter peptides into umami peptides. Interpretability analyses (attention values, HOMO/LUMO) identified key amino acid residues (D, E, Q, K, R) for umami taste, and molecular docking elucidated the mechanism: module substitution enhances hydrogen bonding and hydrophobic interactions with the T1R1/T1R3 receptor. This comprehensive approach provides an efficient tool for discovery and design, significantly expanding the umami peptide repository and deepening the understanding of umami taste presentation mechanisms.

7.2. Limitations & Future Work

The authors implicitly highlight some limitations and suggest future directions:

  • Peptide Length and Conformational Complexity: The amino acid composition analysis showed that for longer peptides (>10 amino acids), taste characteristics are not primarily determined by individual amino acids but by complex spatial conformations. This suggests that the current model, primarily sequence-based, might have limitations in fully capturing the taste of very long peptides. Future work could involve incorporating more sophisticated structural prediction or molecular dynamics simulations for longer sequences.
  • Other Taste Attributes: While the model successfully classifies umami, bitter, and neither, the sensory evaluation noted that umami peptides can also exhibit sweetness, sourness, or astringency. The current model focuses solely on umami and bitter. Future work could expand the model to predict a wider range of taste attributes and their interactions.
  • Synthesis and Preparation Costs: While module substitution is a powerful design tool, the paper acknowledges that peptide synthesis can still be costly. Future work could focus on using bioinformatics analysis to map designed peptides back to protein sequences, allowing for the selection of suitable enzymes and optimized hydrolysis conditions for cost-effective preparation from natural protein sources. This could also guide the selection of appropriate protein sources for targeted peptide production.
  • Unverified Assumptions: The module substitution strategy is validated using predicted taste profiles. While sensory evaluation confirmed the initial umami peptides, comprehensive experimental validation for all module-substituted peptides is still needed.

7.3. Personal Insights & Critique

This paper presents a highly innovative and comprehensive approach to umami peptide discovery and design. The integration of advanced deep learning (BERT, enhanced features, contrastive learning) with interpretable analysis and a practical module substitution strategy is particularly compelling. It represents a significant step forward from purely predictive models to truly rational peptide engineering.

Key strengths:

  • Holistic Approach: It covers prediction, design, and mechanistic elucidation, providing a full pipeline from in silico screening to understanding molecular interactions.
  • State-of-the-Art Performance: The model's accuracy and low error rates are impressive, demonstrating the power of the combined feature engineering and learning strategies.
  • Interpretability: The use of attention value analysis and quantum chemical simulations to identify critical amino acids and active sites adds significant scientific value, moving beyond black-box models. This interpretability is crucial for building trust in AI-driven design.
  • Practical Application: The module substitution strategy is a tangible method for modifying peptides to achieve desired taste profiles, which has direct applications in food science and product development. The identification of umami peptides from Tenebrio molitor is also a valuable practical outcome given the interest in alternative protein sources.

Potential areas for improvement or further research:

  • Generalizability of Module Substitution: While the EE module successfully converted bitter to umami, further research could explore a wider library of umami modules and their effectiveness in different peptide contexts. The specific "bitter modules" targeted might also vary depending on the bitter peptide's structure.

  • Beyond Dipeptides: The module substitution focused on dipeptide fragments. Investigating larger functional modules (e.g., tripeptides or longer motifs) could yield even more precise and potent modifications, though this would increase complexity.

  • Computational Cost: Quantum chemical simulations and molecular docking for mechanistic studies can be computationally intensive, especially for longer peptides. Developing more efficient computational methods or AI-accelerated simulations could enhance the practicality of these analyses for high-throughput design.

  • Multi-Taste Design: While the model can distinguish umami and bitter, designing peptides with a desired combination of tastes (e.g., umami with a hint of sweetness, or masking unwanted bitterness while retaining umami) remains a complex challenge. Future models could aim for multi-label prediction across all basic tastes.

  • Experimental Validation Scope: While 10 peptides were validated, confirming the taste of all module-substituted peptides experimentally would be the ultimate proof of concept for the design strategy.

    The methods and conclusions can certainly be transferred to other domains of bioactive peptide research. For instance, similar frameworks could be used to design antihypertensive peptides, antioxidant peptides, or antimicrobial peptides by identifying key functional modules and substituting them into inactive sequences. The interpretability framework is particularly valuable for accelerating understanding in these areas where experimental characterization is slow and expensive. This paper provides a strong blueprint for AI-driven rational design in peptide science.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.