Paper status: completed

A TastePeptides-Meta system including an umami/bitter classification model Umami_YYDS, a TastePeptidesDB database and an open-source package Auto_Taste_ML

Original Link
Price: 0.100000
8 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study developed TastePeptides-Meta, featuring a database, an 89.6%-accurate umami/bitter classification model Umami_YYDS, sensory validation, a prediction website, and an open-source ML package to enable rapid taste peptide screening.

Abstract

Food Chemistry 405 (2023) 134812 Available online 9 November 2022 0308-8146/© 2022 Published by Elsevier Ltd. A TastePeptides-Meta system including an umami/bitter classification model Umami_YYDS, a TastePeptidesDB database and an open-source package Auto_Taste_ML Zhiyong Cui a , Zhiwei Zhang a , Tianxing Zhou b , Xueke Zhou a , Yin Zhang c , Hengli Meng a , Wenli Wang a , * , Yuan Liu a , * a Department of Food Science & Technology, School of Agriculture & Biology, Shanghai Jiao Tong University, Shanghai 200240, China b Department of Bioinformatics, Faculty of Science, The University of Melbourne, Victoria 3010, Australia c Key Laboratory of Meat Processing of Sichuan, Chengdu University, Chengdu 610106, China A R T I C L E I N F O Keywords: Peptides Umami prediction TastePeptidesDB Machine learning A B S T R A C T Taste peptides with umami/bitterness play a role in food attributes. However, the taste mechanisms of peptides are not fully understood, and the identification of these peptides is time-consuming. Here, we created a taste peptide database by collecting the reported taste peptide information. Eight key molecular descriptors from

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is the development of a comprehensive system for taste peptide analysis, specifically focusing on umami and bitter tastes. The system is named TastePeptides-Meta and includes a classification model Umami_YYDS, a database TastePeptidesDB, and an open-source machine learning package Auto_Taste_ML.

1.2. Authors

The authors of the paper are:

  • Zhong Cui

  • Zhiwei Zhang

  • Tianxig Zhou

  • Xueke Zhou

  • Yin Zhang

  • Hengli Meng

  • Wenli Wang

  • Yuan Liu

    Their affiliation is the Key Laboratory of Meat Processing of Sichuan, Chengdu University, Chengdu 610106, China.

1.3. Journal/Conference

The paper does not explicitly state the journal or conference name in the provided text, but based on the reference format (e.g., https://doi.org/10.1016/j.foodchem.2022.134812), it was likely published in a Food Chemistry related journal. Food Chemistry is a highly reputable journal in the field of food science, known for publishing high-quality research, which indicates the work has undergone rigorous peer review.

1.4. Publication Year

The publication year, derived from the DOI, is 2022.

1.5. Abstract

The paper addresses the challenge of understanding the taste mechanisms of peptides and the time-consuming nature of identifying taste peptides, particularly those with umami and bitterness. To tackle this, the authors developed a system called TastePeptides-Meta. This system comprises three main components:

  1. TastePeptidesDB: A database compiled from reported taste peptide information.

  2. Umami_YYDS: A gradient boosting decision tree model for classifying umami/bitter peptides. This model achieved 89.6% accuracy and was built using data enhancement, comparative algorithms, and optimization techniques. It selected eight key molecular descriptors from di/tripeptides through a modeling screening process. The model's predictive performance was validated against other models and confirmed by sensory experiments.

  3. Auto_Taste_ML: An open-source machine learning package designed to assist in taste peptide modeling.

    The paper highlights that Umami_YYDS showed superior prediction performance and was verified by sensory experiments. To facilitate access, a prediction website based on Umami_YYDS was deployed, and the Auto_Taste_ML package was uploaded. The TastePeptides-Meta system aims to provide a convenient approach for the rapid screening of umami peptides.

The original source link provided is /files/papers/6908b7cae81fdddf1c48bfdb/paper.pdf. Given the abstract's mention of a DOI (https://doi.org/10.1016/j.foodchem.2022.134812), the paper is officially published.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the time-consuming and laborious identification of taste peptides, particularly those exhibiting umami or bitter flavors. Taste peptides are crucial food attributes (food attributes), with umami contributing to pleasant sensations and bitterness often signaling undesirable consumption. Traditional methods for identifying these peptides involve complex experimental processes like pretreatment, separation, purification, synthesis, characterization, and sensory evaluation, making them laborious, expensive, and time consuming.

This problem is important because a better understanding and efficient identification of taste peptides can significantly impact food processing, production, trade, and nutrition.

Prior research has attempted to address this with quantitative structure-activity relationships (QSAR) models and chemoinformatics (CI), but these efforts were often limited by:

  • Insufficient data size

  • Simplistic models (e.g., Scoring Card Method like iBitter-SCM and iUmami-SCM)

  • Models achieving only single taste judgment

  • Black box algorithms (e.g., XGBoost, BERT4Bitter) that perform well but lack interpretability, making them difficult to debug and maintain.

  • Lack of code encapsulation for developed methods, hindering their practical application.

    The paper's entry point is to overcome these limitations by building a QSAR model with excellent performance and model interpretability, supported by a comprehensive taste peptide information summary platform and open-source tools.

2.2. Main Contributions / Findings

The paper makes several primary contributions by establishing the TastePeptides-Meta system:

  1. Creation of TastePeptidesDB: The largest and most comprehensive database of reported taste peptides, addressing the insufficient data size issue in previous studies. This platform summarizes and provides accessible information on taste peptides.

  2. Development of Umami_YYDS Model: A novel umami/bitter classification model based on gradient boosting decision tree (GBDT).

    • It demonstrates excellent performance with 89.6% accuracy in calibration and 0.98 AUC.
    • It specifically addresses the interpretability challenge by using SHAP values to reveal the key molecular descriptors (MolLogP, SMR_VSA1, VSA_EState6, BCUT2D_MWLOW) influencing taste, categorizing them into solubility, charge & van der Waals radius, and molecular weight. This moves beyond "black box" algorithms.
    • The model's unbiased judgment for both umami and bitter peptides is highlighted, contrasting with other models that might overemphasize bitterness.
    • Its outstanding ability was verified by sensory experiments on novel peptides, showing 80% accuracy in prediction.
  3. Release of Auto_Taste_ML: The first open-source machine learning package in the field of taste, encapsulating the modeling process and facilitating data processing, feature construction, model selection, and visualization. This directly tackles the lack of code encapsulation problem.

  4. Deployment of a Web Server: A user-friendly web server based on Umami_YYDS (tastepeptides-meta.com) for convenient taste peptide prediction.

    Key Findings:

  • The Umami_YYDS model effectively predicts umami/bitter tastes with high accuracy and robustness, particularly for peptides 4\ge 4 amino acids in length.

  • Water solubility (MolLogP, SMR_VSA1), polarization rate, charge properties (VSA_EState descriptors), and van der Waals radius (MinEStateIndex, PEOE_VSA14), along with molecular weight (BCUT2D_MWLOW), are identified as the main factors affecting the taste characteristics of short peptides.

  • The study confirmed that high water solubility generally correlates with a higher possibility of being umami peptides.

  • The TastePeptidesDB reveals that most reported taste peptides focus on umami and bitter tastes (79.4%). Dipeptides and tripeptides constitute nearly half of the taste peptide entries, with a decreasing number as peptide length increases.

    These contributions and findings are helpful for the rapid screening of umami peptides and provide computational support for future high-throughput analysis.

3. Prerequisite Knowledge & Related Work

This section aims to provide readers with the prerequisite knowledge needed to understand the paper.

3.1. Foundational Concepts

To fully grasp the methodology and contributions of this paper, understanding several fundamental concepts is crucial:

  • Taste Peptides: These are short chains of amino acids (typically 2-20 amino acids long) that can elicit specific taste perceptions, such as umami, bitter, sweet, sour, or salty. They are often produced during protein hydrolysis in food processing.
  • Umami (旨味): Recognized as the fifth basic taste alongside sweet, sour, salty, and bitter. It is often described as savory, meaty, or broth-like, indicating the presence of proteins and amino acids. It is generally associated with a pleasant eating experience.
  • Bitterness (苦味): A basic taste often perceived as unpleasant or even toxic. In the context of food, bitter peptides can negatively impact palatability.
  • Quantitative Structure-Activity Relationships (QSAR): A computational modeling approach used in chemistry and biology to predict the activity of compounds based on their molecular structure. QSAR models establish a mathematical relationship between the chemical structure of a compound (represented by molecular descriptors) and its biological activity (e.g., taste, toxicity). The core idea is that similar structures should have similar activities.
  • Molecular Descriptors: Numerical values that describe the chemical and physical properties of a molecule's structure. These can include properties like molecular weight, hydrophobicity (LogP), surface area, charge distributions, number of specific atoms/bonds, etc. They are the input features for QSAR and machine learning models.
  • Machine Learning (ML): A field of artificial intelligence that uses statistical techniques to enable computer systems to "learn" from data without being explicitly programmed. In this paper, ML algorithms are used to build predictive models for taste.
  • Gradient Boosting Decision Tree (GBDT): A powerful ensemble machine learning technique that builds a predictive model in a stage-wise fashion, where each new decision tree corrects the errors of the previous ones. It combines many weak prediction models (decision trees) to create a single strong predictor. It is known for its high accuracy and robustness.
  • Chemoinformatics (CI): An interdisciplinary field that combines chemistry, computer science, and information science. It uses computational and informational techniques to solve problems in chemistry, such as molecular design, property prediction, and drug discovery. In this paper, it's applied to the analysis and prediction of taste peptides.
  • Data Enhancement / Data Augmentation: Techniques used to increase the amount of data by adding slightly modified copies of existing data or newly created synthetic data from existing data. This is particularly useful when the original dataset is small or imbalanced.
  • SMOTE (Synthetic Minority Over-sampling Technique): A specific data enhancement technique used for imbalanced datasets, where the number of samples in one class (minority class) is significantly smaller than in another (majority class). SMOTE creates synthetic samples of the minority class by interpolating between existing minority class samples, thereby balancing the dataset and improving model performance.
  • Cross-validation: A statistical method used to estimate the performance of a machine learning model on an independent dataset. It involves partitioning the dataset into multiple subsets (or "folds"). The model is trained on a subset of the folds and validated on the remaining fold. This process is repeated multiple times, and the results are averaged to get a more robust estimate of model performance. 5-fold cross-validation means the data is divided into 5 folds, and the process is repeated 5 times, with each fold used as the validation set once.
  • Accuracy (ACC): A common evaluation metric in classification, defined as the proportion of correctly classified instances (both true positives and true negatives) out of the total number of instances.
  • Area Under the Receiver Operating Characteristic Curve (AUC-ROC): A performance metric for binary classifiers. The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate at various threshold settings. The AUC represents the degree or measure of separability between classes. A higher AUC (closer to 1) indicates better model performance in distinguishing between positive and negative classes.
  • F1-score: The harmonic mean of Precision and Recall. It is a good metric to use when there is an uneven class distribution, as it balances both Precision and Recall.
  • Precision: In binary classification, Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. It answers: "Of all instances predicted as positive, how many were actually positive?"
  • Recall (Sensitivity): In binary classification, Recall is the ratio of correctly predicted positive observations to all observations in the actual class. It answers: "Of all actual positive instances, how many did we correctly predict as positive?"
  • Matthews Correlation Coefficient (MCC): A correlation coefficient used as a measure of the quality of binary classifications. It takes into account true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). MCC is generally regarded as a balanced measure, even if the classes are of very different sizes. Its value ranges from -1 (inverse prediction) to +1 (perfect prediction), with 0 indicating random prediction.
  • SHapley Additive exPlanations (SHAP): A game theory-based approach used to explain the output of any machine learning model. It assigns an importance value (SHAP value) to each feature for a particular prediction, showing how much each feature contributes to pushing the prediction from the baseline (average) prediction. This helps in model interpretability, especially for complex "black box" models.
  • Principal Component Analysis (PCA): A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. PCA is commonly used for dimensionality reduction and data visualization, allowing high-dimensional data to be represented in fewer dimensions (e.g., 2D or 3D) while retaining most of its variance.

3.2. Previous Works

The paper contextualizes its work by discussing limitations in prior taste peptide prediction research:

  • Limited Data and Simplistic Models: Earlier studies were often "restricted by insufficient data size" and used "simplistic models." Examples include Scoring Card Method (SCM) models like iBitter-SCM and iUmami-SCM (Phasit Charoenkwan et al., 2020). These models could only achieve "a single taste judgment" (predicting only bitter or only umami, not both in a comparative context) and often had "accuracy and generalization performance...not ideal."
  • "Black Box" Algorithms and Lack of Interpretability: More recent models, such as BERT4Bitter (Phasit Charoenkwan et al., 2021), while aiming for "great performance," often ignored interpretability. They frequently used "black box" algorithms like XGBoost (Bai et al., 2021) which, despite their predictive power, make it difficult to "understand the decision-making process" and hinder debugging and maintenance. This is a significant concern for scientific understanding beyond mere prediction.
  • Lack of Code Encapsulation: A notable gap identified is that "most of the modeling research is still in the stage of developing method, and none of them have finished the encapsulation of codes." This means that while methods might be proposed, their practical implementation as reusable software packages is rare, limiting broader adoption and reproducibility.
  • Existing Databases and Web Services: The paper acknowledges existing efforts in database construction (e.g., Toxindb (D. Zhang et al., 2021), ChemTastesDB (Rojas et al., 2022)) and web prediction services (e.g., VirtualTaste (Fritz et al., 2021)). These serve as "exemplary roles" but do not offer the comprehensive system (database + prediction model + open-source package) that TastePeptides-Meta aims to provide.

3.3. Technological Evolution

The evolution of taste peptide research has moved from labor-intensive traditional experimental methods to increasingly sophisticated computational approaches:

  1. Traditional Experimental Identification: Initially, identifying taste peptides involved laborious and expensive wet-lab processes of pretreatment, separation, purification, synthesis, characterization, and sensory evaluation. This bottleneck limited the discovery rate and understanding of taste mechanisms.

  2. Emergence of QSAR Models: With advancements in computer performance and chemoinformatics (CI), quantitative structure-activity relationships (QSAR) models emerged. These models began to leverage molecular structures to predict activities, including those of biological peptides (Mahmoodi-Reihani et al., 2020) and ADMET properties (Oussama et al., 2022). This marked a shift towards in silico prediction, reducing the reliance on purely experimental methods.

  3. Early Machine Learning Applications: As machine learning became more accessible, it was applied to taste prediction. However, early attempts were often characterized by "insufficient data size" and "simplistic models" (e.g., Scoring Card Method based models like iBitter-SCM and iUmami-SCM), leading to suboptimal accuracy and generalization performance. Many of these models focused on single taste predictions.

  4. Rise of Complex ML Models: More powerful ML algorithms, including ensemble methods (like XGBoost) and deep learning (like BERT in BERT4Bitter), started being employed. While these often achieved "great performance," they frequently operated as "black boxes," sacrificing interpretability—a critical aspect for understanding underlying mechanisms in scientific research.

  5. Development of Databases and Web Services: Alongside modeling efforts, there has been a parallel development of databases (e.g., Toxindb, ChemTastesDB) to centralize chemical information and web prediction services (e.g., VirtualTaste) to make predictions accessible. However, these components often existed in isolation.

    This paper's work (TastePeptides-Meta) fits within the latest stage of this evolution by attempting to integrate and advance these disparate components: combining a large, curated database, an interpretable and high-performing ML model, and an open-source package into a systematic universe. This represents a move towards more comprehensive, transparent, and user-friendly in silico tools for taste peptide research.

3.4. Differentiation Analysis

Compared to the main methods in related work, the TastePeptides-Meta system, including its core components (TastePeptidesDB, Umami_YYDS, and Auto_Taste_ML), offers several key innovations and differentiators:

  1. Systematic and Integrated Approach:

    • Prior Work: Often focused on individual components—either building a database, developing a specific prediction model, or, less frequently, encapsulating code. These efforts were typically fragmented.
    • TastePeptides-Meta: Proposes a "systematic taste peptides universe" that integrates all three crucial aspects: a database for information summary, a prediction model for identification, and an open-source package for auxiliary modeling. This comprehensive approach is highlighted as unique ("no similar platform published like the TastePeptides-Meta in this field").
  2. Comprehensive Data Foundation (TastePeptidesDB):

    • Prior Work: Many models were "restricted by insufficient data size," limiting their accuracy and generalization performance.
    • TastePeptides-Meta: Addresses this by creating TastePeptidesDB, which is claimed to be "the largest taste peptide database with the most information." A larger and well-curated dataset is fundamental for building robust machine learning models.
  3. Interpretable Model (Umami_YYDS):

    • Prior Work: Increasingly used "black box" algorithms (e.g., BERT4Bitter based on BERT, XGBoost), which, despite high performance, "focus too much on achieving great performance while ignoring the interpretability of the models." This makes debugging and rule mining challenging.
    • TastePeptides-Meta: Emphasizes building a QSAR model with "excellent performance and model interpretability." By using SHAP values with a Gradient Boosting Decision Tree model, Umami_YYDS provides insights into the "decision-making process" and identifies "key molecular descriptors" (solubility, charge & van der Waals radius, molecular weight) that determine taste. This transparency is critical for scientific understanding and future hypothesis generation.
  4. Open-Source and Reproducible (Auto_Taste_ML):

    • Prior Work: "None of them have finished the encapsulation of codes," making it difficult for other researchers to reproduce methods or build upon them.
    • TastePeptides-Meta: Releases Auto_Taste_ML as "the first open-source machine learning package in the field of taste." This package encapsulates the entire modeling process (data processing, feature construction, model selection, visualization), promoting reproducibility, transparency, and reducing workloads for other researchers.
  5. Umami/Bitter Classification with Balanced Judgment:

    • Prior Work: Models often focused on single taste judgment (e.g., iBitter-SCM, iUmami-SCM).

    • TastePeptides-Meta: Umami_YYDS is designed for umami/bitter classification. Crucially, it demonstrates an "unbiased" judgment, maintaining high accuracy for both umami and bitter predictions, unlike some previous models that might "overemphasize the judgment of bitter" or make more misjudgments for umami.

      In essence, while previous works contributed individual pieces of the puzzle, TastePeptides-Meta aims to provide a cohesive, transparent, and accessible ecosystem for taste peptide research.

4. Methodology

4.1. Principles

The core principle of the methodology is to leverage machine learning (ML) and chemoinformatics (CI) to establish quantitative structure-activity relationships (QSAR) for taste peptides. The central idea is that the taste attributes (umami or bitter) of peptides can be predicted by analyzing their molecular structures and properties, which are quantitatively represented by molecular descriptors.

The theoretical basis or intuition behind this approach is that specific structural features and physicochemical properties of peptides interact with taste receptors in a predictable manner. By identifying and quantifying these key molecular characteristics (the molecular descriptors), a computational model can learn the complex patterns that differentiate umami from bitter peptides.

The workflow involves:

  1. Data Collection: Gathering known taste peptides with their associated taste attributes.

  2. Feature Engineering: Calculating a comprehensive set of molecular descriptors from the peptide sequences/structures.

  3. Feature Selection: Identifying the most relevant and discriminative subset of these descriptors that are highly correlated with taste. This step is crucial for building efficient and interpretable models and avoiding overfitting.

  4. Model Training: Applying ML algorithms to learn the relationship between the selected features and the taste attributes, optimizing the model's parameters to maximize prediction performance.

  5. Model Evaluation: Rigorously assessing the trained model's performance using various metrics and validation strategies (e.g., cross-validation, generalization test set, sensory experiments).

  6. Interpretability Analysis: Understanding why the model makes certain predictions by identifying the most influential features, which provides scientific insights into taste mechanisms.

  7. System Development: Encapsulating the data, model, and tools into a user-friendly system for broader application.

    This QSAR approach transforms the laborious experimental process of taste peptide identification into a rapid in silico screening method, offering a predictive and insightful tool for food science.

4.2. Core Methodology In-depth

The methodology for building the TastePeptides-Meta system, particularly the Umami_YYDS model, follows a systematic approach encompassing data collection, feature engineering, model selection, optimization, and validation.

4.2.1. Benchmark Data Sets

The foundation of the Umami_YYDS model is a curated dataset of peptides with known taste attributes.

  • Initial Collection: A total of 203 reported umami/bitter peptides were initially collected specifically for model construction. This set included 99 dipeptides (31 umami and 68 bitter) and 104 tripeptides (53 umami and 61 bitter).
  • Labeling Strategy: Given the inhibitory effect of umami substances on bitterness (Kim et al., 2015), umami peptides were labeled as positive and bitter peptides as negative for the binary classification task.
  • Broader Database (TastePeptidesDB): For the TastePeptidesDB database, a more extensive collection was performed by searching Web of Science using keywords like "Tastes", "Sour", "Sweet", "Bitter", "Salty", "Umami", "Kokumi", "Astringent", and "Peptides". This yielded 483 peptides (collected by Dec 3rd, 2021), which are displayed on the TastePeptidesDB website (http://www.tastepeptides-meta.com/database/son/1). This larger dataset forms the basis of the comprehensive TastePeptidesDB database, while the more focused 203 peptides were used for model training.
  • Dataset for Model Training (ATPD): All collected umami and bitter peptides (presumably the 203 peptides, possibly expanded by the SMOTE process) were constructed into a dataset referred to as ATPD (all taste peptides dataset).
  • Generalization Test Set (GTS): For independent testing and verification, 410 peptides were specifically used to constitute a generalization test set (GTS) to better detect the model's generalization performance.

4.2.2. Feature Structure

The process of extracting and selecting relevant molecular descriptors (features) from the peptides was critical. This involved a 4-step feature selection process as illustrated in Figure S1A (not provided, but described in text):

  • Step 1: Descriptor Calculation:

    • For each peptide, 208 molecular descriptors were initially calculated using the chemometrics special toolkit RDKit 2020.9.1 (Landrum, 2006). These descriptors are designed to capture various properties such as water solubility, electrostatic properties, and atomic properties of peptides (Marcou et al., 2012).
    • To provide a more comprehensive description, an additional 69 descriptors were added. These included planar properties, cyclic properties (Frecer, 2006), aromatic properties (Adamczak et al., 2020), and properties related to the first and last amino acids (e.g., presence of C-terminal hydrophobic amino acids) (Phasit Charoenkwan, Yana, Schaduangrat, et al., 2020).
    • In total, 278 descriptor features were initially considered for each peptide (shown in Fig. S2, not provided).
  • Step 2: Variance Checking:

    • Features with a variance of 0 were discarded using the variance checking algorithm from scikit-learn 0.24.2 (Buitinck et al., 2013). This step removes features that have the same value across all samples, as they provide no discriminative information. This left 207 features.
  • Step 3: Statistical Screening:

    • The Kolmogorov-Smirnov test and t-test (with a pvalue0.0001p-value ≤ 0.0001) were employed to perform feature screening from a statistical perspective. These tests identify features whose distributions differ significantly between the umami and bitter peptide classes, indicating their potential relevance for classification (shown in Fig. S3, not provided).
  • Step 4: Recursive Feature Elimination with Cross-Validation (RFE-CV):

    • Recursive Feature Elimination with Cross-Validation (RFE-CV) was implemented to select the optimal number of features. This method iteratively trains a model (in this case, a Random Forest Model) and eliminates the least important features until the optimal subset is found.

    • 51 features were used as input parameters. In each iteration, features with the largest effect were retained.

    • The cross-validation score of the models reached a top score when the number of features was 8 (Fig. S1B, not provided).

    • Considering computational cost and overfitting probabilities, these 8 features were chosen as the final set for model training. These 8 key molecular descriptors are: BCUT2D_MWLOW, PEOE_VSA14, SMR_VSA1, MinEStateIndex, VSA_EState5, VSA_EState6, VSA_EState7, and MolLogP. Their source and calculation modules are detailed in Table 1.

      The following table presents the selected features, their RDKit modules, and explanations:

      RDKit Module (Rdkit. Chem.) The selected feature Explanation
      rdMolDescriptors. BCUT2D BCUT2D_MWLOW Calculates lowest and highest eigenvalues of the original Burden matrix and the three variant introduced by Pearlamn and Smith (Beno & Mason, 2001) polarizability
      MolSurf module SMR_VSA1
      EState.EState. MinEStateIndex MinEStateIndex MOE-type descriptors using EState indices and surface area
      EState.EState_VSA module VSA_EState5 contributions (developed at RD, not described in the CCG paper) (Hall, Mohney, & Kier, 1991)
      VSA_EState6
      VSA_EState7
      Chem.MolSurf module PEOE_VSA14 Exposes functionality for MOE-like approximate molecular surface area descriptors (Labute, 2000). Indicators for describing ligands based on atomic contribution (Wildman & Crippen, 1999)
      Crippen module MolLogP

4.2.3. Data Enhancement

To address the imbalance of the data (referring to the unequal number of umami and bitter peptides), the imblearn0.8.1 package (Lemaitre et al., 2016) was used to oversample the umami peptide data (the minority class).

  • Algorithm Selection: Various SMOTE algorithms were compared: KMeans-SMOTE, SMOTE, and SVM-SMOTE. The SMOTE algorithm (different from the original 203 peptides) showed the best effect on the data, even though its precision performance was slightly low (Fig. S1C, not provided). It performed excellently on accuracy and recall-related indicators.

  • Visualization of Enhanced Data: After data enhancement, the 8 selected feature values were scaled to a range of 0-10 for better visualization (Fig. S1D, not provided). Additionally, Principal Component Analysis (PCA) was used to reduce the 8-dimensional data to 2-dimensional data, allowing for visual inspection of the separation between umami and bitter peptides (Fig. S1E, not provided). This visualization confirmed a clear distinction between the data classes, indicating the effectiveness of SMOTE in improving generalization performance.

  • Training and Validation Set Construction: For model development, the enhanced data was split into a training set and a validation set using a 4:1 ratio via stratified sampling. Stratified sampling ensures that the proportion of umami and bitter peptides is maintained in both sets, which is crucial for imbalanced datasets.

4.2.4. Model Selection and Optimization

The process involved selecting the most suitable machine learning algorithm and then fine-tuning its hyperparameters.

  • Algorithm Comparison: 19 popular and widely recognized binary classification algorithms were evaluated to identify the best one for discerning internal data patterns.

  • Evaluation Metrics for Selection: Accuracy (ACC) and Area Under the ROC Curve (AUC) were used as primary evaluation indices during this selection phase, assessed via 5-fold cross-validation. The results (Fig. S1F, not provided) showed that ensembled models (such as Bagging, GradientBoosting, and RandomForest) generally had higher median ACC and AUC values and more convergent box distributions, indicating stronger robustness.

  • Algorithm Choice: Among the top-performing algorithms, GradientBoosting (GTB) was ultimately selected due to its higher upper limit in ROC (0.934) (Fig. S4, not provided), suggesting its strong potential for classification.

  • Hyperparameter Optimization:

    • To thoroughly explore the Gradient Boosting algorithm's capabilities, a vast number of hyperparameter combinations (551,840 combinations) were tested.
    • Accuracy was used as the grid search evaluation index, with each combination evaluated by 5-fold cross-validation.
    • Initial results (Fig. S1G, not provided) indicated that n_estimators was the most influential factor, followed by max_depth and min_samples_split. min_samples_leaf showed less statistical significance.
    • To ensure generalization performance and avoid overfitting, combinations where n_estimator was greater than the number of samples were discarded.
    • The final optimized hyperparameters for the Umami_YYDS model (a GradientBoostingClassifier from scikit-learn) were set as follows:
      • criterion = friedman_mse
      • loss=devianceloss = deviance
      • max_depth = 17
      • min_samples_leaf = 3
      • min_samples_split = 10
      • nestimators=211n_estimators = 211

4.2.5. Performance Evaluation

To ensure a fair, objective, and quantitative assessment of the binary classifier model performance, five widely used evaluation indicators were introduced and calculated based on scikit-learn 0.24.2 (Buitinck et al., 2013). These metrics quantify different aspects of a classifier's effectiveness:

  1. F1-score: The formula for F1-score is: $ F1 = \frac{2 \times TP}{2TP + FN + FP} $ Where:

    • TP (True Positives): The number of umami peptides correctly predicted as umami.
    • FN (False Negatives): The number of umami peptides incorrectly predicted as bitter.
    • FP (False Positives): The number of bitter peptides incorrectly predicted as umami. The F1-score is the harmonic mean of Precision and Recall, providing a balance between them.
  2. Accuracy (ACC): The formula for Accuracy is: $ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $ Where:

    • TP (True Positives): Same as above.
    • TN (True Negatives): The number of bitter peptides correctly predicted as bitter.
    • FP (False Positives): Same as above.
    • FN (False Negatives): Same as above. Accuracy measures the proportion of total predictions that were correct.
  3. Precision: The formula for Precision is: $ Precision = \frac{TP}{TP + FP} $ Where:

    • TP (True Positives): Same as above.
    • FP (False Positives): Same as above. Precision answers: "Of all instances predicted as umami, how many were actually umami?"
  4. Recall (Sensitivity): The formula for Recall is: $ Recall = \frac{TP}{TP + FN} $ Where:

    • TP (True Positives): Same as above.
    • FN (False Negatives): Same as above. Recall answers: "Of all actual umami instances, how many did we correctly predict as umami?"
  5. Matthews Correlation Coefficient (MCC): The formula for MCC is: $ MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} $ Where:

    • TP, TN, FP, FN are as defined above. The MCC is a measure of the quality of binary and multiclass classifications. It takes into account all four values in the confusion matrix and is considered a balanced measure that can be used even with imbalanced classes. Its value ranges from +1+1 (perfect prediction) to -1 (inverse prediction), with 0 representing an average random prediction.

Additionally, the Area Under the ROC Curve (AUC) was used, where AUC values closer to 1 indicate a better comprehensive classification effect, while 0.5 signifies no difference from a random classifier.

4.2.6. Sensory Evaluation

To validate the Umami_YYDS model's predictions with human perception, a sensory evaluation experiment was conducted.

  • Panelists: Fifteen panelists (6 males, 9 females, aged 22-29 years) with over 6 months of experience in sensory evaluation of umami peptides were recruited (Liu et al., 2020).
  • Environment: The experiment took place in an air-conditioned sensory panel room at 23±2C23 \pm 2 ^\circ \mathrm{C} and 60%60\% humidity, ensuring controlled conditions.
  • Sample Preparation: Six unreported peptides (ATQ, LPG, ECH, RVF, RGG, NQS), predicted by Umami_YYDS and consisting of 2-3 amino acids, were randomly selected and synthesized by Geer Group Chemical Reagent Co., ltd. (analytically pure standard). Solutions of these peptides were prepared at varying concentrations (0.05,0.1,0.15,0.2,0.4,0.6,0.8mg/ml0.05, 0.1, 0.15, 0.2, 0.4, 0.6, 0.8 \mathrm{mg/ml}).
  • Evaluation Procedure:
    • Each 5 ml diluted sample was placed in a plastic cup with a three-digit random code.
    • Panelists evaluated samples using the triangle test (ISO 4120:2004) and the method of investigating sensitivity of taste (ISO 3972:2011).
    • For each sample, panelists were asked to rotate 5 ml in their mouth for 15 seconds before spitting it out.
    • They described the taste attributes (bitter, sour, sweet, salty, and umami) for each sample.
    • To prevent taste fatigue, panelists relaxed for 5 minutes and rinsed their mouths thoroughly with ultrapure water (produced by NW10VF water purifier system) at least twice between tests.
  • Purpose: This sensory evaluation served as an independent, real-world validation of the Umami_YYDS model's predictive accuracy for novel peptides.

4.2.7. Software Implementation

The TastePeptides-Meta system is developed as a comprehensive taste peptide universe integrating taste peptide query, taste peptide prediction, and Python language-assisted modeling.

  • Frontend: The user interfaces for TastePeptidesDB (database) and Umami_YYDS (prediction website) were built using HTML and the BootStrap4 framework.
  • Backend and Web Server:
    • Nginx was adopted for dynamic load balancing, distributing incoming requests efficiently.
    • Uwsgi, built with Django3.2 (a Python web framework), was employed to handle back-end modeling and Umami-SQL database query requests.
  • Performance: The web server was tested on Google Chrome and Apple Safari for 3 months, demonstrating good performance.
  • Auxiliary Modeling Package (Auto_Taste_ML): An open-source third-party package named Auto_Taste_ML was developed. It is written in Python and adheres to the BSD protocol. Its purpose is to facilitate the entire taste data processing and model building workflow, including feature construction, model selection, and visualization for taste peptide data. It is released on the Python Package Index (PyPI) at https://pypi.org/project/Auto-TasteML/, and its code is available on GitHub (https://github.com/SynchronyML/Auto_Taste_ML/).
  • Umami_YYDS Web Server: The prediction model is publicly accessible via a web server at http://tastepeptides-meta.com/cal.

4.2.8. Statistical Analysis

Various Python libraries were utilized for statistical analysis, numerical computation, and visualization:

  • Programming Language: Python3.8.10.
  • Statistical Tests: T-test and Kolmogorov-Smirnov test were used for evaluating significant differences between feature distributions, with a significance level of P0.0001P ≤ 0.0001.
  • Numerical Computation: Pandas 1.3.3 (Mckinney, 2010) for data structures and analysis, and Numpy 1.2.0 (Harris et al., 2020) for numerical operations and array manipulation.
  • Visualization: Matplotlib 3.4.2 and plotlyExpress 0.4.1 for generating plots and figures.

5. Experimental Setup

5.1. Datasets

The study utilized several datasets throughout its various phases:

  • Benchmark Data for Model Construction:

    • Source: Collected from reported umami/bitter peptides.
    • Scale: 203 peptides in total. Specifically, 99 dipeptides (31 umami, 68 bitter) and 104 tripeptides (53 umami, 61 bitter).
    • Characteristics: These peptides are short (di- or tri-peptides) and are explicitly labeled as either umami (positive class) or bitter (negative class). The labeling considered the inhibitory effect of umami on bitterness. This dataset formed the basis for training and validating the Umami_YYDS model.
    • Data Split: This dataset was split into a training set and validation set using a 4:1 ratio via stratified sampling to ensure proportional representation of classes.
  • TastePeptidesDB Database:

    • Source: A broader collection of peptides gathered from Web of Science using keywords like "Tastes", "Sour", "Sweet", "Bitter", "Salty", "Umami", "Kokumi", "Astringent", and "Peptides".
    • Scale: 483 peptides (as of Dec 3rd, 2021).
    • Characteristics: This database includes taste peptides with various taste attributes beyond just umami and bitter. Each entry contains name (FASTA format), taste, verification status (Vitro_verit), SMILES, literature reference, author, and update time. It is described as the "largest taste peptide database that have been published" in terms of information.
    • Domain: Food-derived peptides.
  • All Taste Peptides Dataset (ATPD):

    • Source: Composed of "All the umami and bitter peptides." This likely refers to the 203 peptides used for model construction, potentially augmented by SMOTE.
    • Purpose: Used for testing the Umami_YYDS model's taste recognition effect on a comprehensive set of umami and bitter peptides.
  • Generalization Test Set (GTS):

    • Source: Not explicitly stated if it's a subset of ATPD or a completely separate collection, but it's used for independent testing.
    • Scale: 410 peptides.
    • Purpose: Specifically designed to assess the generalization performance of the model, allowing for a robust evaluation of how well the model performs on unseen data.
  • Sensory Evaluation Peptides:

    • Source: 6 unreported peptides (NQS, ATQ, ECH, RVF, RGG, LPG) that were food-derived and consisted of 2-3 amino acids. These were chosen from the predicted results of Umami_YYDS and then synthesized.

    • Purpose: Used for human sensory validation to directly confirm the model's predictive accuracy in a real-world setting.

      The choice of these datasets allows for both rigorous internal model validation on a controlled benchmark and external validation on a larger, more diverse set and with human sensory panels, ensuring the method's effectiveness and applicability.

5.2. Evaluation Metrics

The paper uses several standard classification evaluation metrics to quantify the performance of Umami_YYDS and compare it with other models. For each metric, a conceptual definition, its mathematical formula, and an explanation of its symbols are provided below.

  1. Accuracy (ACC)

    • Conceptual Definition: Accuracy measures the overall correctness of a classification model. It represents the proportion of total predictions that were correct (both true positives and true negatives).
    • Mathematical Formula: $ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $
    • Symbol Explanation:
      • TP (True Positives): Number of instances correctly predicted as positive (umami).
      • TN (True Negatives): Number of instances correctly predicted as negative (bitter).
      • FP (False Positives): Number of instances incorrectly predicted as positive (bitter predicted as umami).
      • FN (False Negatives): Number of instances incorrectly predicted as negative (umami predicted as bitter).
  2. F1-score

    • Conceptual Definition: The F1-score is the harmonic mean of Precision and Recall. It is particularly useful when dealing with imbalanced datasets, as it balances both false positives and false negatives, providing a more robust measure of a model's performance than simple accuracy.
    • Mathematical Formula: $ F1 = \frac{2 \times TP}{2TP + FN + FP} $
    • Symbol Explanation:
      • TP: True Positives.
      • FN: False Negatives.
      • FP: False Positives.
  3. Precision

    • Conceptual Definition: Precision measures the proportion of positive identifications that were actually correct. It quantifies how many of the predicted umami peptides were truly umami.
    • Mathematical Formula: $ Precision = \frac{TP}{TP + FP} $
    • Symbol Explanation:
      • TP: True Positives.
      • FP: False Positives.
  4. Recall (Sensitivity)

    • Conceptual Definition: Recall measures the proportion of actual positives that were correctly identified. It quantifies how many of the actual umami peptides were correctly predicted by the model.
    • Mathematical Formula: $ Recall = \frac{TP}{TP + FN} $
    • Symbol Explanation:
      • TP: True Positives.
      • FN: False Negatives.
  5. Area Under the ROC Curve (AUC)

    • Conceptual Definition: AUC represents the degree or measure of separability between classes. It indicates how well the model can distinguish between positive (umami) and negative (bitter) classes. A higher AUC means the model is better at predicting 0s as 0s and 1s as 1s.
    • Mathematical Formula: While AUC itself does not have a single simple formula that can be written out directly from TP/TN/FP/FN, it is calculated by integrating the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
      • TPR=Recall=TPTP+FNTPR = Recall = \frac{TP}{TP + FN}
      • FPR=FPFP+TNFPR = \frac{FP}{FP + TN}
    • Symbol Explanation:
      • TP, TN, FP, FN: As defined above.
      • AUC values range from 0 to 1, where 1 indicates a perfect classifier, and 0.5 indicates a random classifier.
  6. Matthews Correlation Coefficient (MCC)

    • Conceptual Definition: MCC is a balanced measure for binary classification, which considers all four types of predictions (TP, TN, FP, FN). It's generally considered a reliable metric, especially for imbalanced datasets, as it produces a high score only if the classifier performs well on both positive and negative classes.
    • Mathematical Formula: $ MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} $
    • Symbol Explanation:
      • TP, TN, FP, FN: As defined above.
      • MCC values range from -1 to +1: +1 represents a perfect prediction, 0 represents an average random prediction, and -1 represents an inverse prediction.

5.3. Baselines

The Umami_YYDS model's performance was compared against several existing and well-known taste classifier models to demonstrate its superiority:

  1. iUmami-SCM: A Scoring Card Method based model specifically designed for predicting umami peptides (Phasit Charoenkwan, Yana, Nantasenamat, et al., 2020). This represents a more simplistic, rule-based approach.

  2. Q model (Ney, 1979): A model (likely referring to a general statistical or QSAR model) used for predicting bitterness of peptides based on amino acid composition and chain length.

  3. BERT_bitter (BERT4Bitter): A deep learning based model that uses Bidirectional Encoder Representations from Transformers (BERT) for improving the prediction of bitter peptides (Phasit Charoenkwan, Nantasenamat, Hasan, Manavalan, et al., 2021). This represents a state-of-the-art, "black box" approach.

  4. Other unspecified models: The paper mentions "other models may overemphasize the judgment of bitter" without naming them explicitly in the main comparison. The initial model selection phase also compared against 19 popular binary classification algorithms (e.g., Bagging, RandomForest), but the direct performance comparison focuses on iUmami-SCM, Q model, and BERT_bitter.

    These baselines were chosen to represent different approaches to taste prediction (simplistic rule-based, traditional QSAR, and modern deep learning), allowing for a comprehensive evaluation of Umami_YYDS's strengths.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Umami_YYDS Reveal Known and New Umami/Bitter-Presented Determinants

The paper emphasizes the interpretability of Umami_YYDS, contrasting it with "black box" models. SHAP (SHapley Additive explanation) algorithm, based on game theory, was employed to understand the model's decision-making process and feature importance.

  • Identified Key Features: Through feature-screening, eight key molecular descriptors were identified. These are:

    • BCUT2D_MWLOW (BM)

    • PEOE_VSA14 (PV14)

    • SMR_VSA1 (SV1)

    • MinEStateIndex (ME)

    • VSA_EState5 (VS5)

    • VSA_EState6 (VS6)

    • VSA_EState7 (VS7)

    • MolLogP

      The MolLogP and SMR_VSA (SMR_VSA1) features are highlighted as partially overlapping with descriptors used in other models (e.g., MLP for bitter/sweet molecules), confirming their relevance.

  • Feature Importance and Interpretability: The 8 features were sorted based on Permutation Importance (Fig. 2B, from original paper, not provided here). MolLogP, VSA_EState6, and BCUT2D_MWLOW were found to be positively correlated with SHAP values, indicating their significant contribution to the model's output.

    The features were grouped into three categories, providing insights into taste mechanisms:

    1. Solubility (MolLogP, SMR_VSA1):

      • Solubility was identified as the most important indicator for distinguishing umami from bitter peptides.
      • MolLogP (LogP) describes hydrophobicity/lipophilicity, inversely related to water solubility. A lower LogP (more hydrophilic) generally implies higher water solubility.
      • SMR_VSA1 represents polarizability.
      • Insights: High water solubility is usually associated with umami peptides. Partial Dependence Plot (PDP) analysis (Fig. S6A, not provided) showed that when LogP<0.83LogP < -0.83, there was a >61.5>61.5% probability of peptides being umami. Specific LogP intervals like [3.84,2.51][-3.84, -2.51] showed 92.6% success rate for umami judgment. Similarly, SMR_VSA1 values >24.6>24.6 indicated higher probability for umami (Fig. S6B, not provided).
      • Figure 2A (below) illustrates the discriminating effect of SMR_VSA1 and MolLogP, showing clear separation between umami and bitter peptides.
    2. Charge Properties and Van der Waals Surface (VSA_EState5/6/7, MinEStateIndex, SMR_VSA1, PEOE_VSA14):

      • These are described as "non-intuitive indicators" derived from complex matrices representing charge characteristics and Van der Waals space volume.
      • Insights: VSA_EState6 was a key indicator. When VSA_EState6 < -1.84, there was a >74.1>74.1% probability of the peptide being umami (Fig. S6C, not provided). Specific intervals like [5.21,3.61][-5.21, -3.61] showed 100% success rate for umami prediction. This highlights the importance of charge distribution and molecular surface properties.
    3. Molecular Weight (BCUT2D_MWLOW):

      • BCUT2D_MWLOW is part of the BCUT descriptor which encodes atomic properties related to intermolecular interactions and is derived from a Burden matrix representation of the molecular connection table.

      • Insights: While BCUT2D_MWLOW's regional judgment was fluctuating and difficult to determine by a single index (Fig. S6D, S6E, not provided), its significance is consistent with findings from iUmami-SCM, suggesting peptide's molecular weight contributes to umami taste. Generally, smaller molecular weights (<0.5 kDa and 0.5-3 kDa) are often associated with umami-flavor peptides, while large molecular weights tend to be tasteless or bitter.

        The following figure (Figure 2 from the original paper) illustrates the contribution analysis of eight key molecular descriptors in the Umami_YYDS model prediction.

        该图像是图表,展示了Umami_YYDS模型中八个关键分子描述符对模型预测的贡献分析。A部分为SMR_VSA1和MolLogP的二维关系图,气泡大小与颜色代表频数和强度。B部分为SHAP值分布,展示各特征值对模型输出的影响。C部分为各特征的模型输出路径,体现特征值变化与模型预测的关联。 该图像是图表,展示了Umami_YYDS模型中八个关键分子描述符对模型预测的贡献分析。A部分为SMR_VSA1和MolLogP的二维关系图,气泡大小与颜色代表频数和强度。B部分为SHAP值分布,展示各特征值对模型输出的影响。C部分为各特征的模型输出路径,体现特征值变化与模型预测的关联。 Figure 2: (A) A two-dimensional plot showing the distribution and separation of umami (green) and bitter (orange) peptides based on SMR_VSA1 (x-axis) and MolLogP (y-axis). Bubble size indicates frequency, and color intensity indicates taste prevalence. (B) A SHAP summary plot (not provided in text version, but described as permutation importance) illustrating the impact and direction of features on model output. (C) A force plot showing the judgment path for a specific taste peptide (not provided in text version).

6.1.2. Comparison of Umami_YYDS with Well-Known Taste Classifier

The Umami_YYDS model's performance was evaluated against other models, showing strong capabilities.

  • Calibration Set Performance:

    • Umami_YYDS achieved 89.6% accuracy and 98% AUC on the calibration set (Fig. 3A & B below). This indicates excellent performance under controlled conditions.
  • ATPD Performance and Bias Analysis:

    • On the ATPD (all taste peptides dataset), Umami_YYDS showed a good taste recognition effect.
    • The confusion matrix (Fig. S5, not provided) indicated that Umami_YYDS had an accuracy rate of 73%, which is comparable to iUmami-SCM (73.8%).
    • Crucially, Umami_YYDS's judgment ratio of umami to bitterness (46:63) was the closest to the actual 198:215 ratio in ATPD. This suggests that Umami_YYDS equally learned the umami and bitter attribute characteristics, avoiding the bias seen in other models that "may overemphasize the judgment of bitter," leading to more misjudgments for umami.
  • Comparison of Metrics (ACC, MCC, Precision, Recall, F1):

    • Umami_YYDS achieved very similar Accuracy and MCC values to iUmami-SCM (ACC: Umami_YYDS = 0.735, iUmami-SCM = 0.738; MCC: Umami_YYDS = 0.474, iUmami-SCM = 0.485) (Fig. 3C below).
    • Its precision was at a medium level, explained by its "unbiased" judgment, meaning it wasn't overly conservative towards bitter predictions like some other models.
    • Umami_YYDS had a significant lead in Recall, indicating fewer misjudgments for correct identification.
    • The F1 score for Umami_YYDS was the highest, signifying a good harmonic mean of recall and precision, and thus, an ideal and unbiased judgment.
  • Generalization Performance on GTS (by peptide length):

    • The GTS (generalization test set) was used to assess performance on novel peptides of varying lengths.

    • As shown in Fig. 3D (below):

      • ACC & F1: Umami_YYDS showed a linear increasing tendency from the beginning and took the leading position from hexapeptides onwards.
      • Precision: Umami_YYDS's precision gradually improved with peptide length, which is consistent with its "unbiased" nature.
      • Recall: Although showing a slight downward trend and meeting the Q model line at 10 peptides, it generally remained in a leading position.
      • MCC: Umami_YYDS was competitive with the best model (BERT_bitter) and gradually overtook it in the mid-to-long peptide range.
    • Conclusion: The model's judgment for peptides4peptides ≥ 4 was found to be reliable and extremely competitive.

      The following figure (Figure 3 from the original paper) presents the performance validation of the Umami_YYDS model.

      该图像是图表,展示了Umami_YYDS模型的性能验证,包括(A)混淆矩阵,(B)ROC曲线及AUC=0.98,(C)与其他模型在多项指标上的对比,(D)不同训练集大小下的指标变化情况。 该图像是图表,展示了Umami_YYDS模型的性能验证,包括(A)混淆矩阵,(B)ROC曲线及AUC=0.98,(C)与其他模型在多项指标上的对比,(D)不同训练集大小下的指标变化情况。 Figure 3: (A) Confusion matrix for Umami_YYDS showing true positives, true negatives, false positives, and false negatives. (B) ROC curve of Umami_YYDS with an AUC of 0.98, illustrating its classification performance. (C) Bar charts comparing Accuracy (ACC), Precision, Recall, F1-score, and MCC of Umami_YYDS against iUmami-SCM. (D) Line graphs showing the trend of ACC, Precision, Recall, F1-score, and MCC for Umami_YYDS and other models (Q model, BERT_bitter) as peptide length increases.

6.1.3. Identify Novel Umami Peptides

To further validate Umami_YYDS, sensory evaluations were performed on 6 randomly selected, unreported peptides (NQS, ATQ, ECH, RVF, RGG, LPG) that were food-derived and 2-3 amino acids long, chosen from the model's predictions.

  • High Consistency with Predictions:

    • The actual taste perception of ATQ, ECH, RVF, and NQS was highly consistent with the predicted results (Table S5, not provided, and Fig. 4A below).
    • ATQ, ECH, and NQS showed strong umami taste with recognition thresholds of 0.164, 0.184, and 0.148mg/ml0.148 mg/ml respectively. They also exhibited sweetness (0.134, 0.181, 0.137mg/ml0.137 mg/ml respectively), noting the synergistic effect between sweet and umami.
    • RVF showed a strong bitter taste with a threshold of 0.150mg/ml0.150 mg/ml.
    • Compared to similar models, Umami_YYDS achieved the best ACC of 80% in this sensory validation (Table S6, not provided).
  • Analysis of Misjudgments (RGG and LPG):

    • RGG and LPG were predicted to be bitter but showed a dominant umami perception in sensory evaluation, despite some bitterness. This was identified as a misjudgment.

    • Reason for Misjudgment: An analysis of their characteristic attribute values (Fig. 4B below) revealed that most of their attributes were similar to the mean values of umami peptides, except for SV1 (SMR_VSA1) and VS7 (VSA_EState7).

      • The mean SV1 of umami peptides was 30.647, but RGG and LPG had SV1 values of 19.40, which are very close to the bitter peptides' mean SV1 of 18.567.
      • The mean VS7 of umami peptides was -0.366, but RGG and LPG had VS7 values of 0.905 and 1.843, respectively, which are close to the bitter peptides' mean VS7 of 1.213.
    • Conclusion: These two parameters (SV1 and VS7) were identified as the likely cause of the model's misjudgment, providing a clear direction for future model upgrades and improvements.

      The following figure (Figure 4 from the original paper) shows the sensory evaluation results.

      该图像是论文中的图表,包含两个部分。A部分为雷达图,展示了五种感觉(酸、苦、咸、鲜、甜)下六种肽的味觉分布;B部分为柱状图,比较了不同肽样本在多个特征上的数值差异,反映其味觉属性。 该图像是论文中的图表,包含两个部分。A部分为雷达图,展示了五种感觉(酸、苦、咸、鲜、甜)下六种肽的味觉分布;B部分为柱状图,比较了不同肽样本在多个特征上的数值差异,反映其味觉属性。 Figure 4: (A) Radar chart illustrating the taste profiles (sour, bitter, salty, umami, sweet) of six synthesized peptides (NQS, ATQ, ECH, RVF, RGG, LPG) based on sensory evaluation. (B) Bar chart comparing the values of selected molecular features (SV1, MolLogP, VS7, VS6, VS5, MinEStateIndex, PEOE_VSA14, BCUT2D_MWLOW) for RGG, LPG, and the average values for umami and bitter peptides, highlighting differences in SV1 and VS7 for misjudged peptides.

6.1.4. TastePeptides-Meta System

The TastePeptides-Meta system integrates three main parts: TastePeptidesDB (database), Auto_Taste_ML (ML package), and Umami_YYDS (web server).

6.1.4.1. TastesPeptidesDB database

  • Purpose: A database for storing and displaying taste peptide information.

  • Scale: Currently contains 483 taste peptides, making it the "largest taste peptide database that have been published."

  • Entry Information: Each peptide entry includes: name (FASTA format), taste, verified (Vitro_verit), simplified molecular-input line-entry system (Canonical SMILES), literature, paper Author (Contributor), update time, etc.

  • Query Functions: The query page offers 4 basic functions: precise search, taste screening, submission of new discoveries, and cross-page jump link (Fig. 5A below).

  • Submission Workflow: Users can submit new discoveries by providing required information (Fig. 5B below), following a detailed workflow provided in the supplementary material.

  • Taste Distribution Analysis:

    • Peptides are sorted by taste attributes: umami, bitter, sweet, sour, kokumi, astringent, and salty (Fig. 5C below).
    • Most reported studies (79.4%) focus on umami and bitter peptides. This suggests that peptide structures are highly susceptible to activating umami receptors (T1R1-T1R3) and bitter receptors (GABA or T2Rs). Sweet taste receptors (T1R2-T1R3) are less easily activated by peptides.
    • Single-taste bitter or umami peptides are the most abundant, followed by sweet/umami and bitter/umami peptides (Fig. 5D below). This indicates the existence of peptides that can activate multiple taste receptors simultaneously, prompting further research into their key conformations.
  • Peptide Length Distribution: Dipeptides and tripeptides account for almost half of the database capacity, with the number of taste peptides gradually decreasing as length increases (Fig. 5E below).

    The following figure (Figure 5 from the original paper) shows the interface of the TastePeptidesDB database and statistics of taste distribution.

    该图像是一个关于TastePeptidesDB数据库界面和味道分布统计的图表。图A展示了数据库的查询界面,图B为添加肽段信息的表单,图C和D分别展示了味觉类别分布和组合频率,图E为不同肽段长度的数量统计环形图。 Figure 5: (A) Screenshot of the TastePeptidesDB search interface, showing options for precise search, taste screening, and submission. (B) Example form for submitting new peptide information, detailing required fields. (C) Pie chart showing the distribution of reported taste peptides by taste attribute (umami, bitter, sweet, sour, kokumi, astringent, salty). (D) Bar chart illustrating the number of peptides exhibiting single or multiple taste attributes (e.g., bitter, umami, sweet/umami, bitter/umami). (E) Donut chart displaying the distribution of taste peptides based on their length (dipeptides, tripeptides, tetrapeptides, etc.).

6.1.4.2. Auto_Taste_ML: A data packet for taste model

  • Purpose: Provides a standard workflow for taste data processing and analysis, feature construction, model selection, and data visualization. It aims to reduce the workloads of researchers.
  • Technical Details: Written in Python, complies with BSD protocol.
  • Functionality: Designed to reveal the entire TastePeptidesDB data processing and Umami_YYDS model building process.
  • Availability: Released on PyPI (The Python Package Index) at https://pypi.org/project/Auto-TasteML/. Code and instructions are on GitHub (https://github.com/SynchronyML/Auto_Taste_ML/).
  • Efficiency: Claims corresponding functions can be realized "within 1 min."

6.1.4.3. Umami_YYDS web server

  • Purpose: To provide a direct connection between academia and industry and facilitate the rapid identification of taste peptides.
  • Accessibility: The modeling results are deployed on the Umami_YYDS server at http://tastepeptides-meta.com/cal.
  • User-friendliness: Developed as a user-friendly web server.
  • Performance: Tested on Google Chrome and Apple Safari for 3 months with good performance.

6.2. Data Presentation (Tables)

The paper refers to several supplementary tables (S1, S2, S3, S4, S5, S6) that are not provided in the main text. However, the content of Table 1, which lists the Feature source and calculation module for the 8 selected molecular descriptors, is provided within the main text. This table has been transcribed in Section 4.2.2.

6.3. Ablation Studies / Parameter Analysis

The paper discusses aspects of parameter analysis during the model selection and optimization phase, which functions somewhat like a sensitivity analysis for hyperparameters:

  • Hyperparameter Influence: During the grid search for the Gradient Boosting algorithm, 551,840 combinations were explored. The results indicated that n_estimator was the "main factor that affects the results," followed by max_depth and min_samples_split. min_samples_leaf showed "little influence" and "did not show statistical significance." This analysis helped determine the optimal hyperparameter settings for Umami_YYDS (nestimators=211n_estimators = 211, max_depth = 17, min_samples_split = 10, min_samples_leaf = 3).

    While not a formal ablation study in the sense of removing specific components of the final model, the systematic exploration of feature selection steps (from 278 to 8 features) and hyperparameter optimization demonstrates a rigorous process to identify the most effective and efficient configuration for the model. The analysis of RGG and LPG misjudgments also serves as a post-hoc analysis of feature impact, highlighting that SV1 and VS7 were key factors leading to misclassification, which can guide future model improvements.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully developed TastePeptides-Meta, a comprehensive system designed to facilitate the rapid screening and analysis of taste peptides, particularly focusing on umami and bitter tastes. The system comprises three main components:

  1. TastePeptidesDB: A newly compiled and extensive database of reported taste peptides, addressing the prior limitation of insufficient data.

  2. Umami_YYDS: An umami/bitter classification model built using Gradient Boosting Decision Tree on eight key molecular descriptors. The model achieved high performance (89.6% accuracy, 0.98 AUC) and, critically, offers interpretability through SHAP analysis. This analysis identified water solubility (MolLogP, SMR_VSA1), polarization rate, charge properties (VSA_EState descriptors), and molecular weight (BCUT2D_MWLOW) as primary factors influencing short peptide taste. The model's predictions were rigorously validated by sensory experiments, demonstrating an 80% accuracy on novel peptides.

  3. Auto_Taste_ML: An open-source machine learning package that encapsulates the entire modeling process, promoting reproducibility and easing data processing and model building for researchers.

    The integration of these components into the TastePeptides-Meta universe provides a systematic, accurate, interpretable, and user-friendly platform for taste peptide research, moving beyond the limitations of previous fragmented or "black box" approaches.

7.2. Limitations & Future Work

The authors acknowledge several areas for future improvement and identified limitations:

  • Model Fusion for Enhanced Performance: While Umami_YYDS achieved excellent performance, the authors suggest constructing a "fusion model method with better recognition performance." This implies combining multiple models (e.g., an ensemble of diverse models or a stacking approach) could potentially yield even higher accuracy and robustness.
  • Synergistic Effects of Multiple Tastes: The paper notes that "additional data were also collected to study their synergistic effects" due to the "multiple taste characterization of umami peptide." This indicates a limitation in the current model's ability to fully account for complex taste interactions (e.g., how umami can inhibit bitterness or how sweet and umami can synergize), suggesting a need for models that can predict or incorporate these multi-taste interactions.
  • Incorporating Ligand Interaction Information: In terms of feature construction, the authors propose adding "the information of ligand interaction based on molecular docking." This would provide insights into how peptides physically bind to and activate taste receptors, potentially leading to more biologically relevant and accurate predictions, and enabling a "consensus judgment."

7.3. Personal Insights & Critique

7.3.1. Personal Insights

This paper presents a highly valuable and practical contribution to the field of food science and chemoinformatics. The TastePeptides-Meta system is a commendable effort to create a holistic ecosystem for taste peptide research, moving beyond standalone models or databases.

  • Emphasis on Interpretability: The deliberate choice of a Gradient Boosting Decision Tree and the subsequent SHAP analysis is a significant strength. In scientific domains like food chemistry, understanding why a model predicts something is often as crucial as the prediction itself. Identifying key molecular descriptors related to solubility, charge, and molecular weight provides actionable insights for rational design of taste peptides or for understanding natural taste profiles in food. This makes the model not just a predictive tool but also a scientific instrument for discovery.
  • Open-Source Contribution: The release of Auto_Taste_ML is a critical step towards fostering reproducibility and democratizing computational taste research. By providing encapsulated code, the authors empower other researchers, particularly those who might be new to machine learning, to apply and build upon their methodologies, accelerating progress in the field.
  • Rigorous Validation: The comprehensive validation strategy, including comparison against multiple baselines, evaluation on a dedicated generalization test set, and crucially, human sensory validation of novel peptides, lends significant credibility to the Umami_YYDS model's performance. The detailed analysis of misjudgments in RGG and LPG is also insightful, demonstrating a scientific approach to model improvement.
  • Addressing Data Scarcity: The creation of TastePeptidesDB directly addresses one of the most common bottlenecks in machine learning for specialized domains: data scarcity. A large, curated database is an invaluable resource for future research.

7.3.2. Critique and Potential Improvements

While the paper is strong, some areas could be further elaborated or improved:

  • Dataset Details: While the paper mentions the number of peptides and their categorization, more detailed statistics on the dataset (e.g., distribution of amino acid types, common motifs, molecular weight range for umami vs. bitter) within the 203 benchmark peptides could enhance understanding of the training data's characteristics. A clearer distinction between the 203 benchmark peptides and the 483 peptides in TastePeptidesDB regarding their usage in model training versus database population could also be beneficial.

  • Black-Box Baselines Comparison: When comparing Umami_YYDS to BERT_bitter, the paper rightly highlights the interpretability advantage. However, a deeper dive into why Umami_YYDS (a non-deep learning model) manages to compete or even surpass BERT_bitter for longer peptides (4peptides≥ 4 peptides) in terms of MCC could be explored. Is it due to the strength of the selected molecular descriptors, the specific GBDT tuning, or limitations of BERT-like models on small peptide sequences?

  • Synergistic Effects: The paper acknowledges the limitation regarding synergistic effects. This is a complex but vital aspect of taste perception. Future work could investigate multi-label classification or regression models that predict not just the primary taste but also the intensity or interaction profiles. Incorporating molecular docking information, as suggested by the authors, would be a strong step in this direction.

  • Scalability to Longer Peptides: The paper notes that di- and tri-peptides dominate the database. While Umami_YYDS performs well on 4\ge 4 peptides in the GTS, the current selection of 8 molecular descriptors might be more optimized for shorter peptides. Exploring additional descriptors or different feature extraction methods (e.g., sequence-based embeddings beyond di/tripeptides) might improve performance for much longer peptides.

  • Web Server Features: While a web server is deployed, details on its interactive features beyond basic prediction (e.g., batch prediction, visualization of SHAP values for user-submitted peptides, direct database query integration on the prediction page) could be further specified to showcase its user-friendliness.

    Overall, the paper represents a significant step forward in the computational prediction of taste peptides, providing not just a high-performing model but a foundational system that encourages transparency and collaborative research in the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.