A TastePeptides-Meta system including an umami/bitter classification model Umami_YYDS, a TastePeptidesDB database and an open-source package Auto_Taste_ML
TL;DR Summary
The study developed TastePeptides-Meta, featuring a database, an 89.6%-accurate umami/bitter classification model Umami_YYDS, sensory validation, a prediction website, and an open-source ML package to enable rapid taste peptide screening.
Abstract
Food Chemistry 405 (2023) 134812 Available online 9 November 2022 0308-8146/© 2022 Published by Elsevier Ltd. A TastePeptides-Meta system including an umami/bitter classification model Umami_YYDS, a TastePeptidesDB database and an open-source package Auto_Taste_ML Zhiyong Cui a , Zhiwei Zhang a , Tianxing Zhou b , Xueke Zhou a , Yin Zhang c , Hengli Meng a , Wenli Wang a , * , Yuan Liu a , * a Department of Food Science & Technology, School of Agriculture & Biology, Shanghai Jiao Tong University, Shanghai 200240, China b Department of Bioinformatics, Faculty of Science, The University of Melbourne, Victoria 3010, Australia c Key Laboratory of Meat Processing of Sichuan, Chengdu University, Chengdu 610106, China A R T I C L E I N F O Keywords: Peptides Umami prediction TastePeptidesDB Machine learning A B S T R A C T Taste peptides with umami/bitterness play a role in food attributes. However, the taste mechanisms of peptides are not fully understood, and the identification of these peptides is time-consuming. Here, we created a taste peptide database by collecting the reported taste peptide information. Eight key molecular descriptors from
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is the development of a comprehensive system for taste peptide analysis, specifically focusing on umami and bitter tastes. The system is named TastePeptides-Meta and includes a classification model Umami_YYDS, a database TastePeptidesDB, and an open-source machine learning package Auto_Taste_ML.
1.2. Authors
The authors of the paper are:
-
Zhong Cui
-
Zhiwei Zhang
-
Tianxig Zhou
-
Xueke Zhou
-
Yin Zhang
-
Hengli Meng
-
Wenli Wang
-
Yuan Liu
Their affiliation is the
Key Laboratory of Meat Processing of Sichuan, Chengdu University, Chengdu 610106, China.
1.3. Journal/Conference
The paper does not explicitly state the journal or conference name in the provided text, but based on the reference format (e.g., https://doi.org/10.1016/j.foodchem.2022.134812), it was likely published in a Food Chemistry related journal. Food Chemistry is a highly reputable journal in the field of food science, known for publishing high-quality research, which indicates the work has undergone rigorous peer review.
1.4. Publication Year
The publication year, derived from the DOI, is 2022.
1.5. Abstract
The paper addresses the challenge of understanding the taste mechanisms of peptides and the time-consuming nature of identifying taste peptides, particularly those with umami and bitterness. To tackle this, the authors developed a system called TastePeptides-Meta. This system comprises three main components:
-
TastePeptidesDB: A database compiled from reported taste peptide information. -
Umami_YYDS: A gradient boosting decision tree model for classifying umami/bitter peptides. This model achieved89.6% accuracyand was built using data enhancement, comparative algorithms, and optimization techniques. It selected eight key molecular descriptors from di/tripeptides through a modeling screening process. The model's predictive performance was validated against other models and confirmed by sensory experiments. -
Auto_Taste_ML: An open-source machine learning package designed to assist in taste peptide modeling.The paper highlights that
Umami_YYDSshowed superior prediction performance and was verified by sensory experiments. To facilitate access, a prediction website based onUmami_YYDSwas deployed, and theAuto_Taste_MLpackage was uploaded. TheTastePeptides-Metasystem aims to provide a convenient approach for the rapid screening of umami peptides.
1.6. Original Source Link
The original source link provided is /files/papers/6908b7cae81fdddf1c48bfdb/paper.pdf.
Given the abstract's mention of a DOI (https://doi.org/10.1016/j.foodchem.2022.134812), the paper is officially published.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the time-consuming and laborious identification of taste peptides, particularly those exhibiting umami or bitter flavors. Taste peptides are crucial food attributes (food attributes), with umami contributing to pleasant sensations and bitterness often signaling undesirable consumption. Traditional methods for identifying these peptides involve complex experimental processes like pretreatment, separation, purification, synthesis, characterization, and sensory evaluation, making them laborious, expensive, and time consuming.
This problem is important because a better understanding and efficient identification of taste peptides can significantly impact food processing, production, trade, and nutrition.
Prior research has attempted to address this with quantitative structure-activity relationships (QSAR) models and chemoinformatics (CI), but these efforts were often limited by:
-
Insufficient data size -
Simplistic models(e.g.,Scoring Card MethodlikeiBitter-SCMandiUmami-SCM) -
Models achieving only
single taste judgment -
Black boxalgorithms (e.g.,XGBoost,BERT4Bitter) that perform well but lackinterpretability, making them difficult to debug and maintain. -
Lack of
code encapsulationfor developed methods, hindering their practical application.The paper's entry point is to overcome these limitations by building a
QSAR modelwithexcellent performanceandmodel interpretability, supported by a comprehensivetaste peptide information summary platformandopen-source tools.
2.2. Main Contributions / Findings
The paper makes several primary contributions by establishing the TastePeptides-Meta system:
-
Creation of
TastePeptidesDB: The largest and most comprehensive database of reported taste peptides, addressing theinsufficient data sizeissue in previous studies. This platform summarizes and provides accessible information on taste peptides. -
Development of
Umami_YYDSModel: A novelumami/bitter classification modelbased ongradient boosting decision tree (GBDT).- It demonstrates
excellent performancewith89.6% accuracyin calibration and0.98 AUC. - It specifically addresses the
interpretabilitychallenge by usingSHAP valuesto reveal thekey molecular descriptors(MolLogP,SMR_VSA1,VSA_EState6,BCUT2D_MWLOW) influencing taste, categorizing them intosolubility,charge & van der Waals radius, andmolecular weight. This moves beyond "black box" algorithms. - The model's
unbiased judgmentfor both umami and bitter peptides is highlighted, contrasting with other models that might overemphasize bitterness. - Its
outstanding abilitywasverified by sensory experimentson novel peptides, showing80% accuracyin prediction.
- It demonstrates
-
Release of
Auto_Taste_ML: The firstopen-source machine learning packagein the field of taste, encapsulating the modeling process and facilitating data processing, feature construction, model selection, and visualization. This directly tackles thelack of code encapsulationproblem. -
Deployment of a Web Server: A user-friendly web server based on
Umami_YYDS(tastepeptides-meta.com) for convenienttaste peptide prediction.Key Findings:
-
The
Umami_YYDSmodel effectively predicts umami/bitter tastes with high accuracy and robustness, particularly for peptides amino acids in length. -
Water solubility(MolLogP,SMR_VSA1),polarization rate,charge properties(VSA_EStatedescriptors), andvan der Waals radius(MinEStateIndex,PEOE_VSA14), along withmolecular weight(BCUT2D_MWLOW), are identified as the main factors affecting the taste characteristics of short peptides. -
The study confirmed that high water solubility generally correlates with a higher possibility of being umami peptides.
-
The
TastePeptidesDBreveals that most reported taste peptides focus on umami and bitter tastes (79.4%). Dipeptides and tripeptides constitute nearly half of the taste peptide entries, with a decreasing number as peptide length increases.These contributions and findings are helpful for the
rapid screening of umami peptidesand providecomputational support for future high-throughput analysis.
3. Prerequisite Knowledge & Related Work
This section aims to provide readers with the prerequisite knowledge needed to understand the paper.
3.1. Foundational Concepts
To fully grasp the methodology and contributions of this paper, understanding several fundamental concepts is crucial:
- Taste Peptides: These are short chains of amino acids (typically 2-20 amino acids long) that can elicit specific taste perceptions, such as umami, bitter, sweet, sour, or salty. They are often produced during protein hydrolysis in food processing.
- Umami (旨味): Recognized as the fifth basic taste alongside sweet, sour, salty, and bitter. It is often described as savory, meaty, or broth-like, indicating the presence of proteins and amino acids. It is generally associated with a pleasant eating experience.
- Bitterness (苦味): A basic taste often perceived as unpleasant or even toxic. In the context of food, bitter peptides can negatively impact palatability.
- Quantitative Structure-Activity Relationships (QSAR): A computational modeling approach used in chemistry and biology to predict the activity of compounds based on their molecular structure. QSAR models establish a mathematical relationship between the chemical structure of a compound (represented by
molecular descriptors) and its biological activity (e.g., taste, toxicity). The core idea is that similar structures should have similar activities. - Molecular Descriptors: Numerical values that describe the chemical and physical properties of a molecule's structure. These can include properties like
molecular weight,hydrophobicity (LogP),surface area,charge distributions,number of specific atoms/bonds, etc. They are the input features forQSARandmachine learningmodels. - Machine Learning (ML): A field of artificial intelligence that uses statistical techniques to enable computer systems to "learn" from data without being explicitly programmed. In this paper, ML algorithms are used to build predictive models for taste.
- Gradient Boosting Decision Tree (GBDT): A powerful
ensemble machine learningtechnique that builds a predictive model in a stage-wise fashion, where each newdecision treecorrects the errors of the previous ones. It combines many weak prediction models (decision trees) to create a single strong predictor. It is known for its high accuracy and robustness. - Chemoinformatics (CI): An interdisciplinary field that combines chemistry, computer science, and information science. It uses computational and informational techniques to solve problems in chemistry, such as
molecular design,property prediction, anddrug discovery. In this paper, it's applied to the analysis and prediction of taste peptides. - Data Enhancement / Data Augmentation: Techniques used to increase the amount of data by adding slightly modified copies of existing data or newly created synthetic data from existing data. This is particularly useful when the original dataset is small or imbalanced.
- SMOTE (Synthetic Minority Over-sampling Technique): A specific
data enhancementtechnique used forimbalanced datasets, where the number of samples in one class (minority class) is significantly smaller than in another (majority class). SMOTE creates synthetic samples of the minority class by interpolating between existing minority class samples, thereby balancing the dataset and improving model performance. - Cross-validation: A statistical method used to estimate the performance of a machine learning model on an independent dataset. It involves partitioning the dataset into multiple subsets (or "folds"). The model is trained on a subset of the folds and validated on the remaining fold. This process is repeated multiple times, and the results are averaged to get a more robust estimate of model performance.
5-fold cross-validationmeans the data is divided into 5 folds, and the process is repeated 5 times, with each fold used as the validation set once. - Accuracy (ACC): A common
evaluation metricin classification, defined as the proportion of correctly classified instances (both true positives and true negatives) out of the total number of instances. - Area Under the Receiver Operating Characteristic Curve (AUC-ROC): A
performance metricfor binary classifiers. TheROC curveplots theTrue Positive Rate (Recall)against theFalse Positive Rateat various threshold settings. TheAUCrepresents the degree or measure of separability between classes. A higher AUC (closer to 1) indicates better model performance in distinguishing between positive and negative classes. - F1-score: The
harmonic meanofPrecisionandRecall. It is a good metric to use when there is an uneven class distribution, as it balances bothPrecisionandRecall. - Precision: In binary classification,
Precisionis the ratio of correctly predicted positive observations to the total predicted positive observations. It answers: "Of all instances predicted as positive, how many were actually positive?" - Recall (Sensitivity): In binary classification,
Recallis the ratio of correctly predicted positive observations to all observations in the actual class. It answers: "Of all actual positive instances, how many did we correctly predict as positive?" - Matthews Correlation Coefficient (MCC): A
correlation coefficientused as ameasure of the quality of binary classifications. It takes into accounttrue positives (TP),true negatives (TN),false positives (FP), andfalse negatives (FN). MCC is generally regarded as a balanced measure, even if the classes are of very different sizes. Its value ranges from -1 (inverse prediction) to +1 (perfect prediction), with 0 indicating random prediction. - SHapley Additive exPlanations (SHAP): A game theory-based approach used to explain the output of any
machine learning model. It assigns an importance value (SHAP value) to each feature for a particular prediction, showing how much each feature contributes to pushing the prediction from the baseline (average) prediction. This helps inmodel interpretability, especially for complex "black box" models. - Principal Component Analysis (PCA): A statistical procedure that uses an
orthogonal transformationto convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables calledprincipal components. PCA is commonly used fordimensionality reductionanddata visualization, allowing high-dimensional data to be represented in fewer dimensions (e.g., 2D or 3D) while retaining most of its variance.
3.2. Previous Works
The paper contextualizes its work by discussing limitations in prior taste peptide prediction research:
- Limited Data and Simplistic Models: Earlier studies were often "restricted by insufficient data size" and used "simplistic models." Examples include
Scoring Card Method (SCM)models likeiBitter-SCMandiUmami-SCM(Phasit Charoenkwan et al., 2020). These models could only achieve "a single taste judgment" (predicting only bitter or only umami, not both in a comparative context) and often had "accuracy and generalization performance...not ideal." - "Black Box" Algorithms and Lack of Interpretability: More recent models, such as
BERT4Bitter(Phasit Charoenkwan et al., 2021), while aiming for "great performance," often ignoredinterpretability. They frequently used "black box" algorithms likeXGBoost(Bai et al., 2021) which, despite their predictive power, make it difficult to "understand the decision-making process" and hinderdebuggingandmaintenance. This is a significant concern for scientific understanding beyond mere prediction. - Lack of Code Encapsulation: A notable gap identified is that "most of the modeling research is still in the stage of developing method, and none of them have finished the encapsulation of codes." This means that while methods might be proposed, their practical implementation as reusable software packages is rare, limiting broader adoption and reproducibility.
- Existing Databases and Web Services: The paper acknowledges existing efforts in
database construction(e.g.,Toxindb(D. Zhang et al., 2021),ChemTastesDB(Rojas et al., 2022)) andweb prediction services(e.g.,VirtualTaste(Fritz et al., 2021)). These serve as "exemplary roles" but do not offer the comprehensive system (database + prediction model + open-source package) thatTastePeptides-Metaaims to provide.
3.3. Technological Evolution
The evolution of taste peptide research has moved from labor-intensive traditional experimental methods to increasingly sophisticated computational approaches:
-
Traditional Experimental Identification: Initially, identifying taste peptides involved laborious and expensive wet-lab processes of
pretreatment,separation,purification,synthesis,characterization, andsensory evaluation. This bottleneck limited the discovery rate and understanding of taste mechanisms. -
Emergence of QSAR Models: With advancements in
computer performanceandchemoinformatics (CI),quantitative structure-activity relationships (QSAR)models emerged. These models began to leverage molecular structures to predict activities, including those ofbiological peptides(Mahmoodi-Reihani et al., 2020) andADMETproperties (Oussama et al., 2022). This marked a shift towardsin silicoprediction, reducing the reliance on purely experimental methods. -
Early Machine Learning Applications: As
machine learningbecame more accessible, it was applied to taste prediction. However, early attempts were often characterized by "insufficient data size" and "simplistic models" (e.g.,Scoring Card Methodbased models likeiBitter-SCMandiUmami-SCM), leading to suboptimalaccuracyandgeneralization performance. Many of these models focused on single taste predictions. -
Rise of Complex ML Models: More powerful
ML algorithms, includingensemble methods(likeXGBoost) anddeep learning(likeBERTinBERT4Bitter), started being employed. While these often achieved "great performance," they frequently operated as "black boxes," sacrificinginterpretability—a critical aspect for understanding underlying mechanisms in scientific research. -
Development of Databases and Web Services: Alongside modeling efforts, there has been a parallel development of
databases(e.g.,Toxindb,ChemTastesDB) to centralizechemical informationandweb prediction services(e.g.,VirtualTaste) to make predictions accessible. However, these components often existed in isolation.This paper's work (
TastePeptides-Meta) fits within the latest stage of this evolution by attempting to integrate and advance these disparate components: combining a large, curateddatabase, aninterpretableandhigh-performing ML model, and anopen-source packageinto a systematicuniverse. This represents a move towards more comprehensive, transparent, and user-friendlyin silicotools for taste peptide research.
3.4. Differentiation Analysis
Compared to the main methods in related work, the TastePeptides-Meta system, including its core components (TastePeptidesDB, Umami_YYDS, and Auto_Taste_ML), offers several key innovations and differentiators:
-
Systematic and Integrated Approach:
- Prior Work: Often focused on individual components—either building a database, developing a specific prediction model, or, less frequently, encapsulating code. These efforts were typically fragmented.
- TastePeptides-Meta: Proposes a "systematic taste peptides universe" that integrates all three crucial aspects: a
databasefor information summary, aprediction modelfor identification, and anopen-source packagefor auxiliary modeling. This comprehensive approach is highlighted as unique ("no similar platform published like the TastePeptides-Meta in this field").
-
Comprehensive Data Foundation (
TastePeptidesDB):- Prior Work: Many models were "restricted by insufficient data size," limiting their
accuracyandgeneralization performance. - TastePeptides-Meta: Addresses this by creating
TastePeptidesDB, which is claimed to be "the largest taste peptide database with the most information." A larger and well-curated dataset is fundamental for building robustmachine learning models.
- Prior Work: Many models were "restricted by insufficient data size," limiting their
-
Interpretable Model (
Umami_YYDS):- Prior Work: Increasingly used "black box" algorithms (e.g.,
BERT4Bitterbased onBERT,XGBoost), which, despite high performance, "focus too much on achieving great performance while ignoring the interpretability of the models." This makesdebuggingandrule miningchallenging. - TastePeptides-Meta: Emphasizes building a
QSAR modelwith "excellent performance and model interpretability." By usingSHAP valueswith aGradient Boosting Decision Treemodel,Umami_YYDSprovides insights into the "decision-making process" and identifies "key molecular descriptors" (solubility,charge & van der Waals radius,molecular weight) that determine taste. This transparency is critical for scientific understanding and future hypothesis generation.
- Prior Work: Increasingly used "black box" algorithms (e.g.,
-
Open-Source and Reproducible (
Auto_Taste_ML):- Prior Work: "None of them have finished the encapsulation of codes," making it difficult for other researchers to reproduce methods or build upon them.
- TastePeptides-Meta: Releases
Auto_Taste_MLas "the first open-source machine learning package in the field of taste." This package encapsulates the entire modeling process (data processing,feature construction,model selection,visualization), promotingreproducibility,transparency, and reducingworkloadsfor other researchers.
-
Umami/Bitter Classification with Balanced Judgment:
-
Prior Work: Models often focused on
single taste judgment(e.g.,iBitter-SCM,iUmami-SCM). -
TastePeptides-Meta:
Umami_YYDSis designed forumami/bitter classification. Crucially, it demonstrates an "unbiased" judgment, maintaining high accuracy for both umami and bitter predictions, unlike some previous models that might "overemphasize the judgment of bitter" or make moremisjudgmentsfor umami.In essence, while previous works contributed individual pieces of the puzzle,
TastePeptides-Metaaims to provide a cohesive, transparent, and accessibleecosystemfor taste peptide research.
-
4. Methodology
4.1. Principles
The core principle of the methodology is to leverage machine learning (ML) and chemoinformatics (CI) to establish quantitative structure-activity relationships (QSAR) for taste peptides. The central idea is that the taste attributes (umami or bitter) of peptides can be predicted by analyzing their molecular structures and properties, which are quantitatively represented by molecular descriptors.
The theoretical basis or intuition behind this approach is that specific structural features and physicochemical properties of peptides interact with taste receptors in a predictable manner. By identifying and quantifying these key molecular characteristics (the molecular descriptors), a computational model can learn the complex patterns that differentiate umami from bitter peptides.
The workflow involves:
-
Data Collection: Gathering known taste peptides with their associated taste attributes.
-
Feature Engineering: Calculating a comprehensive set of
molecular descriptorsfrom the peptide sequences/structures. -
Feature Selection: Identifying the most relevant and discriminative subset of these descriptors that are highly correlated with taste. This step is crucial for building efficient and interpretable models and avoiding
overfitting. -
Model Training: Applying
ML algorithmsto learn the relationship between the selected features and the taste attributes, optimizing the model's parameters to maximize prediction performance. -
Model Evaluation: Rigorously assessing the trained model's performance using various
metricsandvalidation strategies(e.g.,cross-validation,generalization test set,sensory experiments). -
Interpretability Analysis: Understanding why the model makes certain predictions by identifying the most influential features, which provides scientific insights into taste mechanisms.
-
System Development: Encapsulating the data, model, and tools into a user-friendly system for broader application.
This
QSARapproach transforms the laborious experimental process of taste peptide identification into a rapidin silicoscreening method, offering a predictive and insightful tool for food science.
4.2. Core Methodology In-depth
The methodology for building the TastePeptides-Meta system, particularly the Umami_YYDS model, follows a systematic approach encompassing data collection, feature engineering, model selection, optimization, and validation.
4.2.1. Benchmark Data Sets
The foundation of the Umami_YYDS model is a curated dataset of peptides with known taste attributes.
- Initial Collection: A total of
203 reported umami/bitter peptideswere initially collected specifically for model construction. This set included99 dipeptides(31 umami and 68 bitter) and104 tripeptides(53 umami and 61 bitter). - Labeling Strategy: Given the
inhibitory effect of umami substances on bitterness(Kim et al., 2015),umami peptideswere labeled aspositiveandbitter peptidesasnegativefor the binary classification task. - Broader Database (
TastePeptidesDB): For theTastePeptidesDBdatabase, a more extensive collection was performed by searchingWeb of Scienceusing keywords like "Tastes", "Sour", "Sweet", "Bitter", "Salty", "Umami", "Kokumi", "Astringent", and "Peptides". This yielded483 peptides(collected by Dec 3rd, 2021), which are displayed on theTastePeptidesDBwebsite (http://www.tastepeptides-meta.com/database/son/1). This larger dataset forms the basis of the comprehensiveTastePeptidesDBdatabase, while the more focused 203 peptides were used for model training. - Dataset for Model Training (
ATPD): All collected umami and bitter peptides (presumably the 203 peptides, possibly expanded by the SMOTE process) were constructed into a dataset referred to asATPD(all taste peptides dataset). - Generalization Test Set (
GTS): For independent testing and verification,410 peptideswere specifically used to constitute ageneralization test set (GTS)to better detect the model's generalization performance.
4.2.2. Feature Structure
The process of extracting and selecting relevant molecular descriptors (features) from the peptides was critical. This involved a 4-step feature selection process as illustrated in Figure S1A (not provided, but described in text):
-
Step 1: Descriptor Calculation:
- For each peptide,
208 molecular descriptorswere initially calculated using thechemometrics special toolkit RDKit 2020.9.1(Landrum, 2006). These descriptors are designed to capture various properties such aswater solubility,electrostatic properties, andatomic propertiesof peptides (Marcou et al., 2012). - To provide a more comprehensive description, an additional
69 descriptorswere added. These includedplanar properties,cyclic properties(Frecer, 2006),aromatic properties(Adamczak et al., 2020), and properties related to thefirst and last amino acids(e.g., presence of C-terminal hydrophobic amino acids) (Phasit Charoenkwan, Yana, Schaduangrat, et al., 2020). - In total,
278 descriptor featureswere initially considered for each peptide (shown in Fig. S2, not provided).
- For each peptide,
-
Step 2: Variance Checking:
- Features with a
variance of 0were discarded using thevariance checking algorithmfromscikit-learn 0.24.2(Buitinck et al., 2013). This step removes features that have the same value across all samples, as they provide no discriminative information. This left207 features.
- Features with a
-
Step 3: Statistical Screening:
- The
Kolmogorov-Smirnov testandt-test(with a ) were employed to performfeature screeningfrom a statistical perspective. These tests identify features whose distributions differ significantly between the umami and bitter peptide classes, indicating their potential relevance for classification (shown in Fig. S3, not provided).
- The
-
Step 4: Recursive Feature Elimination with Cross-Validation (RFE-CV):
-
Recursive Feature Elimination with Cross-Validation (RFE-CV)was implemented to select the optimal number of features. This method iteratively trains a model (in this case, aRandom Forest Model) and eliminates the least important features until the optimal subset is found. -
51 featureswere used as input parameters. In each iteration, features with the largest effect were retained. -
The
cross-validation scoreof the models reached a top score when the number of features was8(Fig. S1B, not provided). -
Considering
computational costandoverfitting probabilities, these8 featureswere chosen as the final set for model training. These 8 key molecular descriptors are:BCUT2D_MWLOW,PEOE_VSA14,SMR_VSA1,MinEStateIndex,VSA_EState5,VSA_EState6,VSA_EState7, andMolLogP. Their source and calculation modules are detailed in Table 1.The following table presents the selected features, their RDKit modules, and explanations:
RDKit Module (Rdkit. Chem.) The selected feature Explanation rdMolDescriptors. BCUT2D BCUT2D_MWLOW Calculates lowest and highest eigenvalues of the original Burden matrix and the three variant introduced by Pearlamn and Smith (Beno & Mason, 2001) polarizability MolSurf module SMR_VSA1 EState.EState. MinEStateIndex MinEStateIndex MOE-type descriptors using EState indices and surface area EState.EState_VSA module VSA_EState5 contributions (developed at RD, not described in the CCG paper) (Hall, Mohney, & Kier, 1991) VSA_EState6 VSA_EState7 Chem.MolSurf module PEOE_VSA14 Exposes functionality for MOE-like approximate molecular surface area descriptors (Labute, 2000). Indicators for describing ligands based on atomic contribution (Wildman & Crippen, 1999) Crippen module MolLogP
-
4.2.3. Data Enhancement
To address the imbalance of the data (referring to the unequal number of umami and bitter peptides), the imblearn0.8.1 package (Lemaitre et al., 2016) was used to oversample the umami peptide data (the minority class).
-
Algorithm Selection: Various
SMOTEalgorithms were compared:KMeans-SMOTE,SMOTE, andSVM-SMOTE. TheSMOTEalgorithm (different from the original 203 peptides) showed the best effect on the data, even though its precision performance was slightly low (Fig. S1C, not provided). It performed excellently onaccuracyandrecall-related indicators. -
Visualization of Enhanced Data: After data enhancement, the 8 selected feature values were scaled to a range of 0-10 for better visualization (Fig. S1D, not provided). Additionally,
Principal Component Analysis (PCA)was used to reduce the 8-dimensional data to 2-dimensional data, allowing for visual inspection of the separation between umami and bitter peptides (Fig. S1E, not provided). This visualization confirmed a clear distinction between the data classes, indicating the effectiveness ofSMOTEin improvinggeneralization performance. -
Training and Validation Set Construction: For model development, the enhanced data was split into a
training setand avalidation setusing a4:1 ratioviastratified sampling. Stratified sampling ensures that the proportion of umami and bitter peptides is maintained in both sets, which is crucial for imbalanced datasets.
4.2.4. Model Selection and Optimization
The process involved selecting the most suitable machine learning algorithm and then fine-tuning its hyperparameters.
-
Algorithm Comparison:
19 popular and widely recognized binary classification algorithmswere evaluated to identify the best one for discerning internal data patterns. -
Evaluation Metrics for Selection:
Accuracy (ACC)andArea Under the ROC Curve (AUC)were used as primaryevaluation indicesduring this selection phase, assessed via5-fold cross-validation. The results (Fig. S1F, not provided) showed thatensembled models(such asBagging,GradientBoosting, andRandomForest) generally had higher medianACCandAUCvalues and more convergentbox distributions, indicating stronger robustness. -
Algorithm Choice: Among the top-performing algorithms,
GradientBoosting (GTB)was ultimately selected due to itshigher upper limitinROC(0.934) (Fig. S4, not provided), suggesting its strong potential for classification. -
Hyperparameter Optimization:
- To thoroughly explore the
Gradient Boosting algorithm'scapabilities, a vast number ofhyperparameter combinations(551,840 combinations) were tested. Accuracywas used as thegrid search evaluation index, with each combination evaluated by5-fold cross-validation.- Initial results (Fig. S1G, not provided) indicated that
n_estimatorswas the most influential factor, followed bymax_depthandmin_samples_split.min_samples_leafshowed less statistical significance. - To ensure
generalization performanceand avoidoverfitting, combinations wheren_estimatorwas greater than the number of samples were discarded. - The final optimized
hyperparametersfor theUmami_YYDSmodel (aGradientBoostingClassifierfromscikit-learn) were set as follows:criterion = friedman_msemax_depth = 17min_samples_leaf = 3min_samples_split = 10
- To thoroughly explore the
4.2.5. Performance Evaluation
To ensure a fair, objective, and quantitative assessment of the binary classifier model performance, five widely used evaluation indicators were introduced and calculated based on scikit-learn 0.24.2 (Buitinck et al., 2013). These metrics quantify different aspects of a classifier's effectiveness:
-
F1-score: The formula for
F1-scoreis: $ F1 = \frac{2 \times TP}{2TP + FN + FP} $ Where:TP(True Positives): The number of umami peptides correctly predicted as umami.FN(False Negatives): The number of umami peptides incorrectly predicted as bitter.FP(False Positives): The number of bitter peptides incorrectly predicted as umami. TheF1-scoreis theharmonic meanofPrecisionandRecall, providing a balance between them.
-
Accuracy (ACC): The formula for
Accuracyis: $ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $ Where:TP(True Positives): Same as above.TN(True Negatives): The number of bitter peptides correctly predicted as bitter.FP(False Positives): Same as above.FN(False Negatives): Same as above.Accuracymeasures the proportion of total predictions that were correct.
-
Precision: The formula for
Precisionis: $ Precision = \frac{TP}{TP + FP} $ Where:TP(True Positives): Same as above.FP(False Positives): Same as above.Precisionanswers: "Of all instances predicted as umami, how many were actually umami?"
-
Recall (Sensitivity): The formula for
Recallis: $ Recall = \frac{TP}{TP + FN} $ Where:TP(True Positives): Same as above.FN(False Negatives): Same as above.Recallanswers: "Of all actual umami instances, how many did we correctly predict as umami?"
-
Matthews Correlation Coefficient (MCC): The formula for
MCCis: $ MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} $ Where:TP,TN,FP,FNare as defined above. TheMCCis a measure of thequality of binary and multiclass classifications. It takes into account all four values in theconfusion matrixand is considered abalanced measurethat can be used even withimbalanced classes. Its value ranges from (perfect prediction) to-1(inverse prediction), with0representing an average random prediction.
Additionally, the Area Under the ROC Curve (AUC) was used, where AUC values closer to 1 indicate a better comprehensive classification effect, while 0.5 signifies no difference from a random classifier.
4.2.6. Sensory Evaluation
To validate the Umami_YYDS model's predictions with human perception, a sensory evaluation experiment was conducted.
- Panelists:
Fifteen panelists(6 males, 9 females, aged 22-29 years) with over 6 months of experience in sensory evaluation of umami peptides were recruited (Liu et al., 2020). - Environment: The experiment took place in an
air-conditioned sensory panel roomat and humidity, ensuring controlled conditions. - Sample Preparation:
Six unreported peptides(ATQ, LPG, ECH, RVF, RGG, NQS), predicted byUmami_YYDSand consisting of 2-3 amino acids, were randomly selected and synthesized by Geer Group Chemical Reagent Co., ltd. (analytically pure standard). Solutions of these peptides were prepared at varying concentrations (). - Evaluation Procedure:
- Each
5 mldiluted sample was placed in a plastic cup with athree-digit random code. - Panelists evaluated samples using the
triangle test(ISO 4120:2004) and themethod of investigating sensitivity of taste(ISO 3972:2011). - For each sample, panelists were asked to rotate
5 mlin their mouth for15 secondsbefore spitting it out. - They described the
taste attributes(bitter,sour,sweet,salty, andumami) for each sample. - To prevent
taste fatigue, panelists relaxed for5 minutesand rinsed their mouths thoroughly withultrapure water(produced byNW10VF water purifier system) at least twice between tests.
- Each
- Purpose: This
sensory evaluationserved as an independent, real-world validation of theUmami_YYDSmodel's predictive accuracy for novel peptides.
4.2.7. Software Implementation
The TastePeptides-Meta system is developed as a comprehensive taste peptide universe integrating taste peptide query, taste peptide prediction, and Python language-assisted modeling.
- Frontend: The user interfaces for
TastePeptidesDB(database) andUmami_YYDS(prediction website) were built usingHTMLand theBootStrap4 framework. - Backend and Web Server:
Nginxwas adopted fordynamic load balancing, distributing incoming requests efficiently.Uwsgi, built withDjango3.2(a Python web framework), was employed to handleback-end modelingandUmami-SQL database query requests.
- Performance: The web server was tested on
Google ChromeandApple Safarifor 3 months, demonstrating good performance. - Auxiliary Modeling Package (
Auto_Taste_ML): Anopen-source third-party packagenamedAuto_Taste_MLwas developed. It is written inPythonand adheres to theBSD protocol. Its purpose is to facilitate the entiretaste data processingandmodel building workflow, includingfeature construction,model selection, andvisualizationfor taste peptide data. It is released on thePython Package Index (PyPI)athttps://pypi.org/project/Auto-TasteML/, and its code is available on GitHub (https://github.com/SynchronyML/Auto_Taste_ML/). - Umami_YYDS Web Server: The prediction model is publicly accessible via a web server at
http://tastepeptides-meta.com/cal.
4.2.8. Statistical Analysis
Various Python libraries were utilized for statistical analysis, numerical computation, and visualization:
- Programming Language:
Python3.8.10. - Statistical Tests:
T-testandKolmogorov-Smirnov testwere used for evaluatingsignificant differencesbetween feature distributions, with a significance level of . - Numerical Computation:
Pandas 1.3.3(Mckinney, 2010) for data structures and analysis, andNumpy 1.2.0(Harris et al., 2020) for numerical operations and array manipulation. - Visualization:
Matplotlib 3.4.2andplotlyExpress 0.4.1for generating plots and figures.
5. Experimental Setup
5.1. Datasets
The study utilized several datasets throughout its various phases:
-
Benchmark Data for Model Construction:
- Source: Collected from
reported umami/bitter peptides. - Scale:
203 peptidesin total. Specifically,99 dipeptides(31 umami, 68 bitter) and104 tripeptides(53 umami, 61 bitter). - Characteristics: These peptides are short (di- or tri-peptides) and are explicitly labeled as either umami (positive class) or bitter (negative class). The labeling considered the inhibitory effect of umami on bitterness. This dataset formed the basis for training and validating the
Umami_YYDSmodel. - Data Split: This dataset was split into a
training setandvalidation setusing a4:1 ratioviastratified samplingto ensure proportional representation of classes.
- Source: Collected from
-
TastePeptidesDB Database:
- Source: A broader collection of peptides gathered from
Web of Scienceusing keywords like "Tastes", "Sour", "Sweet", "Bitter", "Salty", "Umami", "Kokumi", "Astringent", and "Peptides". - Scale:
483 peptides(as of Dec 3rd, 2021). - Characteristics: This database includes taste peptides with various taste attributes beyond just umami and bitter. Each entry contains
name (FASTA format),taste,verification status (Vitro_verit),SMILES,literature reference,author, andupdate time. It is described as the "largest taste peptide database that have been published" in terms of information. - Domain: Food-derived peptides.
- Source: A broader collection of peptides gathered from
-
All Taste Peptides Dataset (ATPD):
- Source: Composed of "All the umami and bitter peptides." This likely refers to the
203 peptidesused for model construction, potentially augmented bySMOTE. - Purpose: Used for testing the
Umami_YYDSmodel's taste recognition effect on a comprehensive set of umami and bitter peptides.
- Source: Composed of "All the umami and bitter peptides." This likely refers to the
-
Generalization Test Set (GTS):
- Source: Not explicitly stated if it's a subset of
ATPDor a completely separate collection, but it's used for independent testing. - Scale:
410 peptides. - Purpose: Specifically designed to assess the
generalization performanceof the model, allowing for a robust evaluation of how well the model performs on unseen data.
- Source: Not explicitly stated if it's a subset of
-
Sensory Evaluation Peptides:
-
Source:
6 unreported peptides(NQS, ATQ, ECH, RVF, RGG, LPG) that werefood-derivedand consisted of2-3 amino acids. These were chosen from thepredicted results of Umami_YYDSand thensynthesized. -
Purpose: Used for
human sensory validationto directly confirm the model's predictive accuracy in a real-world setting.The choice of these datasets allows for both rigorous internal model validation on a controlled benchmark and external validation on a larger, more diverse set and with human sensory panels, ensuring the method's effectiveness and applicability.
-
5.2. Evaluation Metrics
The paper uses several standard classification evaluation metrics to quantify the performance of Umami_YYDS and compare it with other models. For each metric, a conceptual definition, its mathematical formula, and an explanation of its symbols are provided below.
-
Accuracy (ACC)
- Conceptual Definition: Accuracy measures the overall correctness of a classification model. It represents the proportion of total predictions that were correct (both true positives and true negatives).
- Mathematical Formula: $ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $
- Symbol Explanation:
TP(True Positives): Number of instances correctly predicted as positive (umami).TN(True Negatives): Number of instances correctly predicted as negative (bitter).FP(False Positives): Number of instances incorrectly predicted as positive (bitter predicted as umami).FN(False Negatives): Number of instances incorrectly predicted as negative (umami predicted as bitter).
-
F1-score
- Conceptual Definition: The F1-score is the harmonic mean of Precision and Recall. It is particularly useful when dealing with imbalanced datasets, as it balances both
false positivesandfalse negatives, providing a more robust measure of a model's performance than simple accuracy. - Mathematical Formula: $ F1 = \frac{2 \times TP}{2TP + FN + FP} $
- Symbol Explanation:
TP: True Positives.FN: False Negatives.FP: False Positives.
- Conceptual Definition: The F1-score is the harmonic mean of Precision and Recall. It is particularly useful when dealing with imbalanced datasets, as it balances both
-
Precision
- Conceptual Definition: Precision measures the proportion of positive identifications that were actually correct. It quantifies how many of the predicted umami peptides were truly umami.
- Mathematical Formula: $ Precision = \frac{TP}{TP + FP} $
- Symbol Explanation:
TP: True Positives.FP: False Positives.
-
Recall (Sensitivity)
- Conceptual Definition: Recall measures the proportion of actual positives that were correctly identified. It quantifies how many of the actual umami peptides were correctly predicted by the model.
- Mathematical Formula: $ Recall = \frac{TP}{TP + FN} $
- Symbol Explanation:
TP: True Positives.FN: False Negatives.
-
Area Under the ROC Curve (AUC)
- Conceptual Definition: AUC represents the degree or measure of separability between classes. It indicates how well the model can distinguish between positive (umami) and negative (bitter) classes. A higher AUC means the model is better at predicting 0s as 0s and 1s as 1s.
- Mathematical Formula: While AUC itself does not have a single simple formula that can be written out directly from TP/TN/FP/FN, it is calculated by integrating the
Receiver Operating Characteristic (ROC)curve. TheROC curveplots theTrue Positive Rate (TPR)against theFalse Positive Rate (FPR)at variousthresholdsettings. - Symbol Explanation:
TP,TN,FP,FN: As defined above.AUCvalues range from 0 to 1, where 1 indicates a perfect classifier, and 0.5 indicates a random classifier.
-
Matthews Correlation Coefficient (MCC)
- Conceptual Definition: MCC is a balanced measure for binary classification, which considers all four types of predictions (TP, TN, FP, FN). It's generally considered a reliable metric, especially for imbalanced datasets, as it produces a high score only if the classifier performs well on both positive and negative classes.
- Mathematical Formula: $ MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} $
- Symbol Explanation:
TP,TN,FP,FN: As defined above.MCCvalues range from -1 to +1: +1 represents a perfect prediction, 0 represents an average random prediction, and -1 represents an inverse prediction.
5.3. Baselines
The Umami_YYDS model's performance was compared against several existing and well-known taste classifier models to demonstrate its superiority:
-
iUmami-SCM: A
Scoring Card Methodbased model specifically designed for predicting umami peptides (Phasit Charoenkwan, Yana, Nantasenamat, et al., 2020). This represents a more simplistic, rule-based approach. -
Q model (Ney, 1979): A model (likely referring to a general statistical or QSAR model) used for predicting
bitterness of peptidesbased onamino acid compositionandchain length. -
BERT_bitter (BERT4Bitter): A
deep learningbased model that usesBidirectional Encoder Representations from Transformers (BERT)for improving the prediction of bitter peptides (Phasit Charoenkwan, Nantasenamat, Hasan, Manavalan, et al., 2021). This represents a state-of-the-art, "black box" approach. -
Other unspecified models: The paper mentions "other models may overemphasize the judgment of bitter" without naming them explicitly in the main comparison. The initial model selection phase also compared against 19 popular
binary classification algorithms(e.g.,Bagging,RandomForest), but the direct performance comparison focuses oniUmami-SCM,Q model, andBERT_bitter.These baselines were chosen to represent different approaches to taste prediction (simplistic rule-based, traditional QSAR, and modern deep learning), allowing for a comprehensive evaluation of
Umami_YYDS's strengths.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Umami_YYDS Reveal Known and New Umami/Bitter-Presented Determinants
The paper emphasizes the interpretability of Umami_YYDS, contrasting it with "black box" models. SHAP (SHapley Additive explanation) algorithm, based on game theory, was employed to understand the model's decision-making process and feature importance.
-
Identified Key Features: Through
feature-screening,eight key molecular descriptorswere identified. These are:-
BCUT2D_MWLOW(BM) -
PEOE_VSA14(PV14) -
SMR_VSA1(SV1) -
MinEStateIndex(ME) -
VSA_EState5(VS5) -
VSA_EState6(VS6) -
VSA_EState7(VS7) -
MolLogPThe
MolLogPandSMR_VSA(SMR_VSA1) features are highlighted as partially overlapping with descriptors used in other models (e.g., MLP for bitter/sweet molecules), confirming their relevance.
-
-
Feature Importance and Interpretability: The 8 features were sorted based on
Permutation Importance(Fig. 2B, from original paper, not provided here).MolLogP,VSA_EState6, andBCUT2D_MWLOWwere found to be positively correlated withSHAPvalues, indicating their significant contribution to the model's output.The features were grouped into three categories, providing insights into taste mechanisms:
-
Solubility (MolLogP, SMR_VSA1):
Solubilitywas identified as themost important indicatorfor distinguishing umami from bitter peptides.MolLogP(LogP) describeshydrophobicity/lipophilicity, inversely related to water solubility. Alower LogP(more hydrophilic) generally implieshigher water solubility.SMR_VSA1representspolarizability.- Insights: High
water solubilityis usually associated withumami peptides.Partial Dependence Plot (PDP)analysis (Fig. S6A, not provided) showed that when , there was a of peptides being umami. SpecificLogPintervals like showed92.6%success rate for umami judgment. Similarly,SMR_VSA1values indicated higher probability for umami (Fig. S6B, not provided). - Figure 2A (below) illustrates the discriminating effect of
SMR_VSA1andMolLogP, showing clear separation between umami and bitter peptides.
-
Charge Properties and Van der Waals Surface (VSA_EState5/6/7, MinEStateIndex, SMR_VSA1, PEOE_VSA14):
- These are described as "non-intuitive indicators" derived from complex matrices representing
charge characteristicsandVan der Waals space volume. - Insights:
VSA_EState6was a key indicator. WhenVSA_EState6 < -1.84, there was a of the peptide being umami (Fig. S6C, not provided). Specific intervals like showed100%success rate for umami prediction. This highlights the importance of charge distribution and molecular surface properties.
- These are described as "non-intuitive indicators" derived from complex matrices representing
-
Molecular Weight (BCUT2D_MWLOW):
-
BCUT2D_MWLOWis part of theBCUT descriptorwhich encodesatomic properties related to intermolecular interactionsand is derived from aBurden matrixrepresentation of the molecular connection table. -
Insights: While
BCUT2D_MWLOW's regional judgment wasfluctuatingand difficult to determine by a single index (Fig. S6D, S6E, not provided), its significance is consistent with findings fromiUmami-SCM, suggestingpeptide's molecular weightcontributes to umami taste. Generally,smaller molecular weights(<0.5 kDa and 0.5-3 kDa) are often associated withumami-flavor peptides, whilelarge molecular weightstend to betastelessorbitter.The following figure (Figure 2 from the original paper) illustrates the contribution analysis of eight key molecular descriptors in the
Umami_YYDSmodel prediction.
该图像是图表,展示了Umami_YYDS模型中八个关键分子描述符对模型预测的贡献分析。A部分为SMR_VSA1和MolLogP的二维关系图,气泡大小与颜色代表频数和强度。B部分为SHAP值分布,展示各特征值对模型输出的影响。C部分为各特征的模型输出路径,体现特征值变化与模型预测的关联。
Figure 2: (A) A two-dimensional plot showing the distribution and separation of umami (green) and bitter (orange) peptides based on SMR_VSA1 (x-axis) and MolLogP (y-axis). Bubble size indicates frequency, and color intensity indicates taste prevalence. (B) A SHAP summary plot (not provided in text version, but described as permutation importance) illustrating the impact and direction of features on model output. (C) A force plot showing the judgment path for a specific taste peptide (not provided in text version).
-
-
6.1.2. Comparison of Umami_YYDS with Well-Known Taste Classifier
The Umami_YYDS model's performance was evaluated against other models, showing strong capabilities.
-
Calibration Set Performance:
Umami_YYDSachieved89.6% accuracyand98% AUCon thecalibration set(Fig. 3A & B below). This indicates excellent performance under controlled conditions.
-
ATPD Performance and Bias Analysis:
- On the
ATPD(all taste peptides dataset),Umami_YYDSshowed a good taste recognition effect. - The
confusion matrix(Fig. S5, not provided) indicated thatUmami_YYDShad an accuracy rate of73%, which is comparable toiUmami-SCM(73.8%). - Crucially,
Umami_YYDS's judgment ratio of umami to bitterness (46:63) was the closest to the actual198:215 ratioinATPD. This suggests thatUmami_YYDSequally learned the umami and bitter attribute characteristics, avoiding the bias seen in other models that "may overemphasize the judgment of bitter," leading to moremisjudgmentsfor umami.
- On the
-
Comparison of Metrics (ACC, MCC, Precision, Recall, F1):
Umami_YYDSachieved very similarAccuracyandMCCvalues toiUmami-SCM(ACC: Umami_YYDS = 0.735, iUmami-SCM = 0.738; MCC: Umami_YYDS = 0.474, iUmami-SCM = 0.485) (Fig. 3C below).- Its
precisionwas at amedium level, explained by its "unbiased" judgment, meaning it wasn't overly conservative towards bitter predictions like some other models. Umami_YYDShad asignificant lead in Recall, indicatingfewer misjudgmentsfor correct identification.- The
F1 scoreforUmami_YYDSwas thehighest, signifying a goodharmonic meanofrecallandprecision, and thus, an ideal and unbiased judgment.
-
Generalization Performance on GTS (by peptide length):
-
The
GTS(generalization test set) was used to assess performance on novel peptides of varying lengths. -
As shown in Fig. 3D (below):
- ACC & F1:
Umami_YYDSshowed alinear increasing tendencyfrom the beginning and took theleading positionfrom hexapeptides onwards. - Precision:
Umami_YYDS'sprecisiongraduallyimproved with peptide length, which is consistent with its "unbiased" nature. - Recall: Although showing a slight downward trend and meeting the
Q modelline at 10 peptides, it generally remained in aleading position. - MCC:
Umami_YYDSwas competitive with the best model (BERT_bitter) and graduallyovertook it in the mid-to-long peptide range.
- ACC & F1:
-
Conclusion: The model's judgment for was found to be
reliable and extremely competitive.The following figure (Figure 3 from the original paper) presents the performance validation of the
Umami_YYDSmodel.
该图像是图表,展示了Umami_YYDS模型的性能验证,包括(A)混淆矩阵,(B)ROC曲线及AUC=0.98,(C)与其他模型在多项指标上的对比,(D)不同训练集大小下的指标变化情况。
Figure 3: (A) Confusion matrix for Umami_YYDS showing true positives, true negatives, false positives, and false negatives. (B) ROC curve of Umami_YYDS with an AUC of 0.98, illustrating its classification performance. (C) Bar charts comparing Accuracy (ACC), Precision, Recall, F1-score, and MCC of Umami_YYDS against iUmami-SCM. (D) Line graphs showing the trend of ACC, Precision, Recall, F1-score, and MCC for Umami_YYDS and other models (Q model, BERT_bitter) as peptide length increases.
-
6.1.3. Identify Novel Umami Peptides
To further validate Umami_YYDS, sensory evaluations were performed on 6 randomly selected, unreported peptides (NQS, ATQ, ECH, RVF, RGG, LPG) that were food-derived and 2-3 amino acids long, chosen from the model's predictions.
-
High Consistency with Predictions:
- The actual taste perception of
ATQ,ECH,RVF, andNQSwashighly consistent with the predicted results(Table S5, not provided, and Fig. 4A below). ATQ,ECH, andNQSshowed strongumami tastewith recognition thresholds of0.164,0.184, and respectively. They also exhibitedsweetness(0.134,0.181, respectively), noting the synergistic effect between sweet and umami.RVFshowed a strongbitter tastewith a threshold of .- Compared to similar models,
Umami_YYDSachieved thebest ACC of 80%in this sensory validation (Table S6, not provided).
- The actual taste perception of
-
Analysis of Misjudgments (RGG and LPG):
-
RGGandLPGwerepredicted to be bitterbut showed adominant umami perceptionin sensory evaluation, despite some bitterness. This was identified as amisjudgment. -
Reason for Misjudgment: An analysis of their
characteristic attribute values(Fig. 4B below) revealed that most of their attributes were similar to the mean values of umami peptides, except forSV1(SMR_VSA1) andVS7(VSA_EState7).- The mean
SV1of umami peptides was30.647, butRGGandLPGhadSV1values of19.40, which are very close to the bitter peptides' meanSV1of18.567. - The mean
VS7of umami peptides was-0.366, butRGGandLPGhadVS7values of0.905and1.843, respectively, which are close to the bitter peptides' meanVS7of1.213.
- The mean
-
Conclusion: These two parameters (
SV1andVS7) were identified as the likely cause of the model's misjudgment, providing a clear direction forfuture model upgradesand improvements.The following figure (Figure 4 from the original paper) shows the sensory evaluation results.
该图像是论文中的图表,包含两个部分。A部分为雷达图,展示了五种感觉(酸、苦、咸、鲜、甜)下六种肽的味觉分布;B部分为柱状图,比较了不同肽样本在多个特征上的数值差异,反映其味觉属性。
Figure 4: (A) Radar chart illustrating the taste profiles (sour, bitter, salty, umami, sweet) of six synthesized peptides (NQS, ATQ, ECH, RVF, RGG, LPG) based on sensory evaluation. (B) Bar chart comparing the values of selected molecular features (SV1, MolLogP, VS7, VS6, VS5, MinEStateIndex, PEOE_VSA14, BCUT2D_MWLOW) for RGG, LPG, and the average values for umami and bitter peptides, highlighting differences in SV1 and VS7 for misjudged peptides.
-
6.1.4. TastePeptides-Meta System
The TastePeptides-Meta system integrates three main parts: TastePeptidesDB (database), Auto_Taste_ML (ML package), and Umami_YYDS (web server).
6.1.4.1. TastesPeptidesDB database
-
Purpose: A database for storing and displaying taste peptide information.
-
Scale: Currently contains
483 taste peptides, making it the "largest taste peptide database that have been published." -
Entry Information: Each peptide entry includes:
name (FASTA format),taste,verified (Vitro_verit),simplified molecular-input line-entry system (Canonical SMILES),literature,paper Author (Contributor),update time, etc. -
Query Functions: The query page offers
4 basic functions:precise search,taste screening,submission of new discoveries, andcross-page jump link(Fig. 5A below). -
Submission Workflow: Users can submit new discoveries by providing required information (Fig. 5B below), following a detailed workflow provided in the supplementary material.
-
Taste Distribution Analysis:
- Peptides are sorted by taste attributes:
umami,bitter,sweet,sour,kokumi,astringent, andsalty(Fig. 5C below). - Most reported studies (
79.4%) focus onumami and bitterpeptides. This suggests that peptide structures are highly susceptible to activating umami receptors (T1R1-T1R3) and bitter receptors (GABAorT2Rs).Sweet taste receptors(T1R2-T1R3) are less easily activated by peptides. Single-taste bitter or umami peptidesare the most abundant, followed bysweet/umamiandbitter/umamipeptides (Fig. 5D below). This indicates the existence of peptides that can activate multiple taste receptors simultaneously, prompting further research into their key conformations.
- Peptides are sorted by taste attributes:
-
Peptide Length Distribution:
Dipeptidesandtripeptidesaccount for almost half of the database capacity, with the number of taste peptides gradually decreasing as length increases (Fig. 5E below).The following figure (Figure 5 from the original paper) shows the interface of the
TastePeptidesDBdatabase and statistics of taste distribution.
Figure 5: (A) Screenshot of the TastePeptidesDB search interface, showing options for precise search, taste screening, and submission. (B) Example form for submitting new peptide information, detailing required fields. (C) Pie chart showing the distribution of reported taste peptides by taste attribute (umami, bitter, sweet, sour, kokumi, astringent, salty). (D) Bar chart illustrating the number of peptides exhibiting single or multiple taste attributes (e.g., bitter, umami, sweet/umami, bitter/umami). (E) Donut chart displaying the distribution of taste peptides based on their length (dipeptides, tripeptides, tetrapeptides, etc.).
6.1.4.2. Auto_Taste_ML: A data packet for taste model
- Purpose: Provides a standard
workflowfortaste data processing and analysis,feature construction,model selection, anddata visualization. It aims to reduce theworkloadsof researchers. - Technical Details: Written in
Python, complies withBSD protocol. - Functionality: Designed to reveal the entire
TastePeptidesDB data processingandUmami_YYDS model buildingprocess. - Availability: Released on
PyPI(The Python Package Index) athttps://pypi.org/project/Auto-TasteML/. Code and instructions are on GitHub (https://github.com/SynchronyML/Auto_Taste_ML/). - Efficiency: Claims corresponding functions can be realized "within 1 min."
6.1.4.3. Umami_YYDS web server
- Purpose: To provide a direct connection between
academia and industryand facilitate therapid identification of taste peptides. - Accessibility: The
modeling resultsare deployed on theUmami_YYDS serverathttp://tastepeptides-meta.com/cal. - User-friendliness: Developed as a
user-friendly web server. - Performance: Tested on
Google ChromeandApple Safarifor 3 months with good performance.
6.2. Data Presentation (Tables)
The paper refers to several supplementary tables (S1, S2, S3, S4, S5, S6) that are not provided in the main text. However, the content of Table 1, which lists the Feature source and calculation module for the 8 selected molecular descriptors, is provided within the main text. This table has been transcribed in Section 4.2.2.
6.3. Ablation Studies / Parameter Analysis
The paper discusses aspects of parameter analysis during the model selection and optimization phase, which functions somewhat like a sensitivity analysis for hyperparameters:
-
Hyperparameter Influence: During the
grid searchfor theGradient Boosting algorithm,551,840 combinationswere explored. The results indicated thatn_estimatorwas the "main factor that affects the results," followed bymax_depthandmin_samples_split.min_samples_leafshowed "little influence" and "did not show statistical significance." This analysis helped determine the optimal hyperparameter settings forUmami_YYDS(,max_depth = 17,min_samples_split = 10,min_samples_leaf = 3).While not a formal
ablation studyin the sense of removing specific components of the final model, the systematic exploration offeature selectionsteps (from 278 to 8 features) andhyperparameter optimizationdemonstrates a rigorous process to identify the most effective and efficient configuration for the model. The analysis ofRGGandLPGmisjudgments also serves as apost-hoc analysisof feature impact, highlighting thatSV1andVS7were key factors leading to misclassification, which can guide future model improvements.
7. Conclusion & Reflections
7.1. Conclusion Summary
This study successfully developed TastePeptides-Meta, a comprehensive system designed to facilitate the rapid screening and analysis of taste peptides, particularly focusing on umami and bitter tastes. The system comprises three main components:
-
TastePeptidesDB: A newly compiled and extensive database of reported taste peptides, addressing the prior limitation of insufficient data.
-
Umami_YYDS: An
umami/bitter classification modelbuilt usingGradient Boosting Decision Treeon eight keymolecular descriptors. The model achieved high performance (89.6% accuracy,0.98 AUC) and, critically, offersinterpretabilitythroughSHAP analysis. This analysis identifiedwater solubility(MolLogP,SMR_VSA1),polarization rate,charge properties(VSA_EStatedescriptors), andmolecular weight(BCUT2D_MWLOW) as primary factors influencing short peptide taste. The model's predictions were rigorously validated bysensory experiments, demonstrating an80% accuracyon novel peptides. -
Auto_Taste_ML: An
open-source machine learning packagethat encapsulates the entire modeling process, promotingreproducibilityand easingdata processingandmodel buildingfor researchers.The integration of these components into the
TastePeptides-Meta universeprovides a systematic, accurate, interpretable, and user-friendly platform for taste peptide research, moving beyond the limitations of previous fragmented or "black box" approaches.
7.2. Limitations & Future Work
The authors acknowledge several areas for future improvement and identified limitations:
- Model Fusion for Enhanced Performance: While
Umami_YYDSachieved excellent performance, the authors suggest constructing a "fusion model method with better recognition performance." This implies combining multiple models (e.g., anensemble of diverse modelsor astacking approach) could potentially yield even higher accuracy and robustness. - Synergistic Effects of Multiple Tastes: The paper notes that "additional data were also collected to study their synergistic effects" due to the "multiple taste characterization of umami peptide." This indicates a limitation in the current model's ability to fully account for complex taste interactions (e.g., how umami can inhibit bitterness or how sweet and umami can synergize), suggesting a need for models that can predict or incorporate these multi-taste interactions.
- Incorporating Ligand Interaction Information: In terms of
feature construction, the authors propose adding "the information of ligand interaction based on molecular docking." This would provide insights into how peptides physically bind to and activate taste receptors, potentially leading to more biologically relevant and accurate predictions, and enabling a "consensus judgment."
7.3. Personal Insights & Critique
7.3.1. Personal Insights
This paper presents a highly valuable and practical contribution to the field of food science and chemoinformatics. The TastePeptides-Meta system is a commendable effort to create a holistic ecosystem for taste peptide research, moving beyond standalone models or databases.
- Emphasis on Interpretability: The deliberate choice of a
Gradient Boosting Decision Treeand the subsequentSHAP analysisis a significant strength. In scientific domains like food chemistry, understanding why a model predicts something is often as crucial as the prediction itself. Identifying key molecular descriptors related tosolubility,charge, andmolecular weightprovides actionable insights for rational design of taste peptides or for understanding natural taste profiles in food. This makes the model not just a predictive tool but also a scientific instrument for discovery. - Open-Source Contribution: The release of
Auto_Taste_MLis a critical step towards fosteringreproducibilityanddemocratizingcomputational taste research. By providing encapsulated code, the authors empower other researchers, particularly those who might be new tomachine learning, to apply and build upon their methodologies, accelerating progress in the field. - Rigorous Validation: The comprehensive validation strategy, including comparison against multiple baselines, evaluation on a dedicated
generalization test set, and crucially,human sensory validationof novel peptides, lends significant credibility to theUmami_YYDSmodel's performance. The detailed analysis of misjudgments inRGGandLPGis also insightful, demonstrating a scientific approach to model improvement. - Addressing Data Scarcity: The creation of
TastePeptidesDBdirectly addresses one of the most common bottlenecks inmachine learningfor specialized domains:data scarcity. A large, curated database is an invaluable resource for future research.
7.3.2. Critique and Potential Improvements
While the paper is strong, some areas could be further elaborated or improved:
-
Dataset Details: While the paper mentions the number of peptides and their categorization, more detailed statistics on the dataset (e.g., distribution of amino acid types, common motifs, molecular weight range for umami vs. bitter) within the
203 benchmark peptidescould enhance understanding of the training data's characteristics. A clearer distinction between the203 benchmark peptidesand the483 peptidesinTastePeptidesDBregarding their usage in model training versus database population could also be beneficial. -
Black-Box Baselines Comparison: When comparing
Umami_YYDStoBERT_bitter, the paper rightly highlights the interpretability advantage. However, a deeper dive into whyUmami_YYDS(a non-deep learning model) manages to compete or even surpassBERT_bitterfor longer peptides () in terms ofMCCcould be explored. Is it due to the strength of the selected molecular descriptors, the specific GBDT tuning, or limitations of BERT-like models on small peptide sequences? -
Synergistic Effects: The paper acknowledges the limitation regarding
synergistic effects. This is a complex but vital aspect of taste perception. Future work could investigate multi-label classification or regression models that predict not just the primary taste but also the intensity or interaction profiles. Incorporatingmolecular dockinginformation, as suggested by the authors, would be a strong step in this direction. -
Scalability to Longer Peptides: The paper notes that di- and tri-peptides dominate the database. While
Umami_YYDSperforms well on peptides in theGTS, the current selection of 8 molecular descriptors might be more optimized for shorter peptides. Exploring additional descriptors or different feature extraction methods (e.g., sequence-based embeddings beyonddi/tripeptides) might improve performance for much longer peptides. -
Web Server Features: While a web server is deployed, details on its interactive features beyond basic prediction (e.g., batch prediction, visualization of SHAP values for user-submitted peptides, direct database query integration on the prediction page) could be further specified to showcase its user-friendliness.
Overall, the paper represents a significant step forward in the computational prediction of taste peptides, providing not just a high-performing model but a foundational system that encourages transparency and collaborative research in the field.
Similar papers
Recommended via semantic vector search.