Paper status: completed

Identification and prediction of milk-derived bitter taste peptides based on peptidomics technology and machine learning method

Published:09/01/2023

Bitter Peptide Prediction Model for Dairy Products (1)Peptidomics-based Bitter Peptide Screening (1)Application of LightGBM in Bitter Peptide Identification (1)Bitter Receptor Activity Validation (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study developed a screening workflow for bitter taste peptides using peptidomics and machine learning, achieving 90.3% accuracy with the novel CPM-BP model. Among 724 distinct peptides, 180 potential bitter peptides were identified, with three confirmed to activate the human

Abstract

Bitter taste peptides (BPs) are vital for drug and nutrition research, but large-scale screening of them is still time-consuming and costly. This study developed a complete workflow for screening BPs based on peptidomics technology and machine learning method. Using an expanded dataset and a new combination of BPs’ characteristic factors, a novel classification prediction model (CPM-BP) based on the Light Gradient Boosting Machine algorithm was constructed with an accuracy of 90.3% for predicting BPs. Among 724 significantly different peptides between spoiled and fresh UHT milk, 180 potential BPs were predicted using CPM-BP and eleven of them were previously reported. One known BP (FALPQYLK) and three predicted potential BPs (FALPQYL, FFVAPFPEVFGKE, EMPFPKYP) were verified by determination of calcium mobilization of HEK293T cells expressing human bitter taste receptor T2R4. Three potential BPs could activate the hT2R4 and are demonstrated to be BPs, which proved the effectiveness of CPM-BP.

Mind Map

In-depth Reading

English Analysis~32 min read · 46,873 chars

1. Bibliographic Information

1.1. Title

Identification and prediction of milk-derived bitter taste peptides based on peptidomics technology and machine learning method

1.2. Authors

Yang Yu, Shengchi Liu, Xinchen Zhang, Wenhao Yu, Xiaoyan Pei, Li Liu, Yan Jin

Affiliations:

Yang Yu, Wenhao Yu, Xiaoyan Pei, Yan Jin: China Agricultural University, Beijing, China.
Shengchi Liu, Xinchen Zhang, Li Liu, Yan Jin: Inner Mongolia Yili Industrial Group Co., Ltd., Hohhot, Inner Mongolia, China.
Yan Jin is also affiliated with China Agricultural University and Inner Mongolia Yili Industrial Group Co., Ltd.

The authors represent a collaboration between academia (China Agricultural University) and industry (Inner Mongolia Yili Industrial Group Co., Ltd.), suggesting a research focus that combines scientific rigor with practical application, particularly in the dairy industry.

1.3. Journal/Conference

The paper was published in Food Chemistry, a highly reputable and influential journal in the field of food science, analytical chemistry, and food technology. Its focus on original research that advances the understanding of food composition, processing, and quality makes it a significant venue for this type of study, especially given the paper's application to milk quality.

1.4. Publication Year

Published at (UTC): 2023-09-01T00:00:00.000Z. A corrigendum was published on March 1, 2024, correcting an affiliation, but not affecting the scientific content.

1.5. Abstract

This study addresses the challenge of large-scale screening of bitter taste peptides (BPs), which is typically time-consuming and costly, despite their importance in drug and nutrition research. The authors developed a comprehensive workflow combining peptidomics technology and machine learning. They constructed a novel classification prediction model (CPM-BP) using an expanded dataset, new characteristic factors for BPs, and the Light Gradient Boosting Machine (LightGBM) algorithm, achieving an accuracy of 90.3% in predicting BPs. Applying CPM-BP to 724 significantly different peptides found between spoiled and fresh ultra-high temperature (UHT) milk, 180 potential BPs were predicted, 11 of which were previously known. For experimental validation, one known BP (FALPQYLK) and three predicted potential BPs (FALPQYL, FFVAPFPEVFGKE, EMPFPKYP) were tested using HEK293T cells expressing human bitter taste receptor T2R4 (hT2R4) to measure calcium mobilization. Three of these potential BPs successfully activated hT2R4, confirming their bitter taste and validating the effectiveness of CPM-BP.

1.6. Original Source Link

/files/papers/691751fe110b75dcc59ae05c/paper.pdf (This link points to the PDF hosted by the system.)

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the inefficient and costly large-scale screening of bitter taste peptides (BPs). BPs are peptides that elicit a bitter taste sensation. They are vital for various fields, including drug development, nutritional research, and food quality control. For instance, in food science, the presence of BPs can significantly impact the taste and consumer acceptance of products like milk, especially UHT milk which can develop a bitter taste during its shelf life due to protein hydrolysis.

Existing methods for identifying BPs, particularly conventional laboratory approaches, are often time-consuming and expensive. While machine learning (ML) has shown promise in biological data analysis and BP prediction, current ML models still suffer from certain shortcomings:

Incomplete discriminative characteristics: They may not fully capture the features that distinguish BPs from non-bitter taste peptides (NBPs).
Information redundancy and overfitting: Non-important features can lead to models that perform well on training data but poorly on new data.
Insufficient benchmark datasets: The quantity and quality of data used to train these models can limit their predictive power.

These limitations highlight a significant gap: there's a need for a more efficient, accurate, and robust method for predicting BPs, especially in complex food matrices like milk, to better understand and control product quality. The paper's entry point is to leverage the power of peptidomics technology (for peptide identification) and advanced machine learning (for prediction) to overcome these challenges.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Developed a complete workflow for BP screening: It integrates peptidomics technology (for peptide identification in complex samples like milk) with a machine learning approach (for predicting bitterness).
Expanded Benchmark Dataset: Constructed an improved and expanded benchmark dataset named BTP720 (containing 360 BPs and 360 NBPs) by aggregating data from existing datasets (like BTP640) and various databases/literature (Biopep, Flavor Database, BitterDB). This addresses the "insufficiency of current benchmark dataset" limitation of previous models.
Novel Combination of Characteristic Factors: Proposed and screened a new set of fourteen potential characteristic factors related to hydrophobicity, amino acid composition, and terminal characteristics of peptides, which are crucial for distinguishing BPs from NBPs. Through optimization, an optimal combination of ten factors was identified.
Constructed a Novel Classification Prediction Model (CPM-BP): Based on the Light Gradient Boosting Machine (LightGBM) algorithm, the CPM-BP achieved a high accuracy of 90.3% for predicting BPs on an independent testing dataset. This model demonstrated superior performance (ACC, PRE, F1, MCC) compared to existing models like iBitter-SCM and iBitter-Fuse.
Applied to Real-World Samples: Successfully applied CPM-BP to identify 180 potential BPs among 724 significantly different peptides between spoiled and fresh UHT milk, showing the model's practical utility.
Experimental Verification: Experimentally verified the bitterness of three predicted potential BPs (FALPQYL, FFVAPFPEVFGKE, EMPFPKYP) and one known BP (FALPQYLK) by observing their ability to activate the human bitter taste receptor T2R4 (hT2R4) via calcium mobilization assays in HEK293T cells. This validation step demonstrated the effectiveness and reliability of CPM-BP's predictions.

In summary, the paper provides a validated, high-throughput method for identifying BPs, which can significantly accelerate research in food science, drug discovery, and nutrition by reducing the time and cost associated with traditional screening.

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following core concepts:

Peptides: Short chains of amino acids linked by peptide bonds. They are smaller than proteins.
Bitter Taste Peptides (BPs): Specific peptides that bind to bitter taste receptors in the mouth, eliciting a bitter taste sensation. Their presence can be desirable (e.g., in some fermented foods) or undesirable (e.g., in spoiled milk).
Peptidomics Technology: A branch of proteomics that focuses on the comprehensive analysis of peptides within a biological sample. It involves techniques to identify, characterize, and quantify peptides.
Mass Spectrometry (MS): An analytical technique that measures the mass-to-charge ratio (m/z) of ions. In peptidomics, it's used to identify peptides by fragmenting them (tandem MS, MS/MS) and matching the fragment patterns to known peptide sequences or protein databases.
Liquid Chromatography (LC): A separation technique used to separate components in a mixture based on their differential partitioning between a stationary phase and a mobile phase. Often coupled with MS (LC-MS/MS) to separate peptides before they enter the mass spectrometer, enhancing detection and identification.
Ultra-High Temperature (UHT) Milk: Milk that has been heat-treated at very high temperatures (typically 135-150°C for a few seconds) to sterilize it. This process allows for a long shelf life without refrigeration until opened. However, enzymatic activity or other factors can still lead to spoilage and bitterness.
Machine Learning (ML): A field of artificial intelligence that enables systems to learn from data, identify patterns, and make decisions with minimal human intervention. In this context, ML is used to build predictive models that can classify peptides as bitter or non-bitter.
Light Gradient Boosting Machine (LightGBM): An open-source, distributed, high-performance gradient boosting framework based on decision tree algorithms. It's known for its speed, efficiency, and accuracy, especially with large datasets. It works by building an ensemble of weak prediction models (decision trees) sequentially, where each new model corrects the errors of the previous ones.
Classification Prediction Model (CPM-BP): A specific machine learning model developed in this study, designed to classify (predict) whether a given peptide sequence is a bitter taste peptide (BP) or a non-bitter taste peptide (NBP).
Benchmark Dataset: A standardized dataset used to train and evaluate machine learning models, allowing for fair comparison between different algorithms.
Characteristic Factors (Features): Measurable properties or attributes of a peptide (e.g., hydrophobicity, amino acid composition, terminal residues) that are used as input to a machine learning model to make predictions. These are also known as "features" in ML.
Human Bitter Taste Receptors (hT2Rs): A family of G protein-coupled receptors (GPCRs) found on taste cells in the tongue. When BPs bind to these receptors, they initiate a signaling cascade that leads to the perception of bitterness. The paper specifically mentions hT2R4, which is known to bind a wide variety of bitter compounds.
HEK293T Cells: Human embryonic kidney 293T cells. A commonly used cell line in biological research for transient expression of recombinant proteins (like hT2Rs) and studying cellular signaling pathways.
Calcium Mobilization Assay: A laboratory technique used to measure changes in intracellular calcium ion ( $Ca^{2+}$ ) concentration. When bitter taste receptors are activated, they trigger a signaling pathway that leads to the release of $Ca^{2+}$ from intracellular stores. Measuring this $Ca^{2+}$ increase (often with fluorescent dyes like Fluo-4 AM) is a common way to assess receptor activation.
Euclidean Distance: A measure of the true straight-line distance between two points in Euclidean space. In this context, it's used to assess the similarity or difference between peptide characteristic factors. A shorter distance implies higher similarity.

3.2. Previous Works

The paper discusses several computational methods based on quantitative structure-activity relationship (QSAR) modeling and machine learning that have been developed to predict BPs. These previous works, while innovative, serve as a backdrop against which the current study demonstrates its improvements.

iBitter-SCM (Charoenkwan et al., 2020): This model used traditional sequence features to identify BPs. The name SCM likely refers to a Scoring Card Method, often using propensity scores of dipeptides or amino acids.
iBitter-Fuse (Charoenkwan et al., 2021): This model aimed to improve performance by fusing multi-view features, suggesting it combined different types of sequence information or feature representations.
BERT4Bitter (Charoenkwan et al., 2021): This approach leveraged natural language processing (NLP) techniques. BERT (Bidirectional Encoder Representations from Transformers) is a powerful neural network architecture commonly used in NLP tasks. BERT4Bitter represented peptide sequences as feature descriptors, treating them similarly to natural language sentences, to predict BPs.
iBitter-DRLF (Jiang et al., 2022): This model used a deep learning pre-trained neural network feature extraction method. DRLF likely stands for Deep Representation Learning Features, indicating that deep learning was used to automatically learn relevant features from peptide sequences.

The authors note that these previous models still have certain shortcomings, including "incompletely discriminative characteristics between BPs and non-bitter taste peptides (NBPs), information redundancy and over-fitting caused by the embodiment of non-important features, and the insufficiency of the current benchmark dataset." These limitations directly motivated the current study's approach to expand the dataset and refine characteristic factors.

3.3. Technological Evolution

The identification and prediction of BPs have evolved significantly. Initially, it relied on laborious sensory evaluation (human taste panels) and physicochemical analysis (chromatography, spectroscopy) to isolate and identify bitter compounds. This was often slow and resource-intensive.

The advent of peptidomics technology, particularly liquid chromatography-tandem mass spectrometry (LC-MS/MS), revolutionized the ability to rapidly identify thousands of peptides in complex biological samples. This high-throughput identification created a data abundance that traditional manual analysis couldn't match.

Simultaneously, the rise of machine learning provided powerful tools to analyze these large datasets. Early ML approaches for BPs often involved Quantitative Structure-Activity Relationship (QSAR) modeling, linking chemical structure to biological activity, and models based on traditional sequence features (e.g., amino acid composition, hydrophobicity). More recently, the field has seen the integration of deep learning and natural language processing (NLP) techniques, treating peptide sequences as "biological language" to extract more abstract and powerful features.

This paper fits into this evolution by:

Improving Data Foundation: Addressing the "insufficiency of current benchmark dataset" by creating BTP720, a larger and more curated dataset.
Refining Feature Engineering: Moving beyond standard sequence features by proposing a novel combination of characteristic factors that are rigorously selected and optimized.
Adopting Advanced ML Algorithms: Utilizing LightGBM, a state-of-the-art gradient boosting framework known for its efficiency and accuracy, which can handle complex feature interactions better than simpler models.
Integrating Experimental Validation: Coupling computational prediction with biological verification using hT2R4 calcium mobilization assays, thus providing a robust end-to-end workflow from discovery to validation.

3.4. Differentiation Analysis

Compared to the main methods in related work (iBitter-SCM, iBitter-Fuse, BERT4Bitter, iBitter-DRLF), the core differences and innovations of this paper's approach (CPM-BP) are:

Expanded and Curated Benchmark Dataset (BTP720): Unlike previous models that might have relied on smaller or less curated datasets (e.g., BTP640), CPM-BP benefits from BTP720, which is significantly expanded by collecting BPs from multiple databases (Biopep, Flavor Database, BitterDB) and extensive literature. Crucially, BTP720 focuses only on peptides that uniquely present a bitter taste, ensuring a clearer distinction between BPs and NBPs, thereby improving data quality for training.
Novel Combination of Characteristic Factors: Instead of relying solely on traditional sequence features or deep learning-derived features, this study innovatively proposes and optimizes a new combination of fourteen potential characteristic factors. These factors are specifically chosen to capture various aspects of peptide hydrophobicity, amino acid composition (including specific bitter-contributing amino acids), and positional information (N-terminal, C-terminal). This targeted feature engineering, combined with separability verification and LightGBM's feature selection capability, aims to create a more discriminative and less redundant feature set.
Application of LightGBM Algorithm: The choice of LightGBM is a key differentiator. While other models might use different ML (e.g., SCM, feature fusion) or deep learning (BERT, DRLF) approaches, LightGBM is praised for its faster training speed, lower memory consumption, better accuracy, and efficient handling of massive data, potentially leading to a more robust and scalable model.
Demonstrated Superior Performance: CPM-BP significantly outperforms iBitter-SCM and iBitter-Fuse in key metrics like ACC (accuracy), PRE (precision), F1 (F1 score), and MCC (Matthews correlation coefficient) on the expanded independent testing dataset from BTP720. This empirically validates the advantages of its improved dataset and characteristic factors.
Integrated Workflow with Experimental Validation: The paper doesn't just propose a predictive model; it demonstrates a complete workflow from peptidomics identification in real-world samples (spoiled milk) to in vitro biological validation using HEK293T cells expressing hT2R4. This end-to-end approach, culminating in verifying predicted BPs, adds a strong layer of credibility and practical applicability that goes beyond purely computational studies. The experimental verification of FALPQYLK (a known BP) and three novel potential BPs directly proves the effectiveness of the CPM-BP model.

4. Methodology

The methodology section outlines a comprehensive workflow for identifying and predicting bitter taste peptides (BPs), integrating peptidomics with machine learning and biological validation.

4.1. Principles

The core idea of the method is to build a highly accurate machine learning model that can predict bitter taste peptides (BPs) from their amino acid sequences. This model is then used to screen peptides identified in real-world samples (like milk). The principles are based on the understanding that BPs possess specific physicochemical characteristics (e.g., hydrophobicity, amino acid composition, terminal residues) that differentiate them from non-bitter taste peptides (NBPs). By quantifying these characteristics and feeding them to a powerful classification algorithm like LightGBM, a predictive model can be trained. The predictions are then validated biologically by testing the peptides' ability to activate human bitter taste receptors.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Benchmark Dataset Construction

The foundation of the machine learning model is a robust and representative dataset of known BPs and NBPs.

Expansion of Existing Data: The study started with BTP640, a dataset often used in previous reports, which contains 320 BPs and 320 NBPs.
Comprehensive Data Collection: This dataset was significantly expanded by collecting additional BPs from several authoritative sources:
- Biopep (https://biochemia.uwm.edu.pl/biopep-uwm/)
- Flavor Database (https://mffi.sjtu.edu.cn/database/)
- BitterDB (https://bitterdb.agri.huji.ac.il/bitterdb/)
- Various previous scientific literature.
Strict Selection Criteria: To ensure high data quality and clear discriminative characteristics, only peptides confirmed to present a bitter taste uniquely were included as BPs. Peptides with multiple tastes were excluded.
Final Dataset (BTP720): The resulting benchmark dataset, named BTP720, comprises 720 peptide sequences, equally balanced with 360 BPs (Table S1 in supplementary materials) and 360 NBPs (Table S2).
Data Splitting: For fair evaluation, BTP720 was randomly divided into:
- Training Dataset: 80% of the data, consisting of 288 BPs and 288 NBPs. This set is used to train the machine learning model.
- Independent Testing Dataset: 20% of the data, consisting of 72 BPs and 72 NBPs. This set is used to evaluate the model's performance on unseen data, simulating real-world prediction.

4.2.2. Selection of the Characteristic Factors of BPs

A crucial step for machine learning is feature engineering – defining characteristics that differentiate BPs from NBPs.

Initial Proposal: Based on published literature and databases, fourteen potential characteristic factors of BPs were initially proposed (Table 1). These factors aim to capture various aspects of peptide structure and physicochemical properties relevant to bitterness.
Quantification: Each peptide in the BTP720 dataset (both BPs and NBPs) was quantified according to these fourteen characteristic factors, transforming peptide sequences into numerical data suitable for machine learning.
Hydrophobicity Focus: Hydrophobicity is identified as a particularly important factor for BPs. Several factors were designed to capture it:
- $Q$ (Average hydrophobicity of peptides): An important index to evaluate the bitter degree of peptides, calculated using amino acid values listed in Table S3.
- Q1 ( $Percentage of the amino acids with value < 0 in peptides$ ): Represents the proportion of amino acids with very low Q values.
- Q2 (Percentage of the amino acids with value in range 0-1000 in peptides): Proportion of amino acids with moderate Q values.
- Q3 (Percentage of the amino acids with value in range 1000-2000 in peptides): Proportion of amino acids with higher Q values.
- Q4 (Percentage of the amino acids with value in range 2000-3000 in peptides): Proportion of amino acids with very high Q values.
- AH (Average hydrophobicity of peptides): Another measure of average hydrophobicity, derived from the COWR900101 descriptor in the AAindex database (Table S4).
- $N$ (The hydrophobicity of amino acids located in the N-terminal of peptides): Focuses on the hydrophobicity of amino acids at the beginning of the peptide.
- $C$ (The hydrophobicity of amino acids located in the C-terminal of peptides): Focuses on the hydrophobicity of amino acids at the end of the peptide.
- Percentage-HAA (Percentage of bitter-contributing amino acids): Proportion of specific amino acids (Ala, Phe, Gly, Ile, Leu, Met, Pro, Val, Tyr, Trp) known to contribute to bitterness.
- Percentage-FWY (Percentage of three kinds of bitter-contributing amino acids): Proportion of Phenylalanine (Phe), Tryptophan (Trp), and Tyrosine (Tyr), which are highly hydrophobic and frequently associated with bitterness.
Other Positional and Compositional Factors:
- N-basic AA (The amino acids located in N-terminal of peptides were basic amino acids or not): Binary factor indicating presence of basic amino acids at N-terminal.
- LFIYWV-C (The amino acids located in C-terminal of peptides were six kinds bitter-contributing amino acids (Leu, Phe, Ile, Tyr, Trp, and Val) or not): Binary factor for specific hydrophobic amino acids at the C-terminal.
- P-X-C (Amino acid P located in the second place from C-terminal of peptides or not): Binary factor for Proline (P) at the penultimate position from the C-terminal.
- RP (Adjacent RP in peptides or not): Binary factor for the presence of an 'RP' sequence motif.

Separability Verification: To identify the most discriminative factors, Euclidean distance was used. This method quantifies how distinct the distributions of a factor are between BPs and NBPs. Shorter Euclidean distance between data points implies higher similarity, indicating a poor discriminative factor. This step helps in pre-selecting features for the model.

The following are the results from Table 1 of the original paper:

Codes	Interpretations	Contribution degrees
Q	Average hydrophobicity of peptides	56
Q1*	Percentage of the amino acids with value < 0 in peptides	\
Q2	Percentage of the amino acids with value in range 0-1000 in peptides	21
Q3*	Percentage of the amino acids with value in range 1000-2000 in peptides	\
Q4	Percentage of the amino acids with value in range 2000-3000 in peptides	17
AH	Average hydrophobicity of peptides	52
N	The hydrophobicity of amino acids located in the N-terminal of peptides	26
C	The hydrophobicity of amino acids located in the C-terminal of peptides	41
Percentage- HAA	Percentage of bitter-contributing amino acids (Ala, Phe, Gly, Ile, Leu, Met, Pro, Val, Tyr, and Trp) in peptides	64
N-basic AA	The amino acids located in N-terminal of peptides were basic amino acids or not	5
LFIYWV-C**	The amino acids located in C-terminal of peptides were six kinds bitter-contributing amino acids (Leu, Phe, Ile, Tyr, Trp, and Val) or	\
Percentage- FWY	Percentage of three kinds of bitter-contributing amino acids (Phe, Trp, and Tyr) in peptides	11
P-X-C	Amino acid P located in the second place from C-terminal of peptides or not	7
RP*	Adjacent RP in peptides or not	\

Notes from paper: *These characteristic factors were eliminated artificially in follow-up steps. **This characteristic factor was eliminated by LightGBM algorithm automatically in follow-up steps. \These characteristic factors were eliminated in optimal combination and had no contribution degrees.

4.2.3. LightGBM Model Construction

The CPM-BP model is built using the LightGBM algorithm, which is a powerful ensemble machine learning technique.

Algorithm Choice: LightGBM (Light Gradient Boosting Machine) was chosen for its advantages: faster training speed, lower memory consumption, better accuracy, and support for parallel and distributed computing. It's an optimized implementation of the Gradient Boosted Decision Tree (GBDT) algorithm.
GBDT Principle: LightGBM builds an ensemble of weak prediction models (decision trees) sequentially. Each new tree attempts to correct the errors made by the previously built trees. It does this by focusing on the "residuals" or mistakes from the prior models.
LightGBM Optimizations:
- It uses the first and second-order negative gradients of the loss function simultaneously to calculate the residual of the current tree, improving accuracy.
- It employs a Histogram-based decision tree algorithm, which buckets continuous feature values into discrete bins. This speeds up training and reduces memory usage.
- Gradient-based One-Side Sampling (GOSS): This technique down-samples instances with small gradients (well-trained instances) and keeps instances with large gradients (poorly trained instances), focusing the training effort on the more challenging examples.
- Exclusive Feature Bundling (EFB): This method bundles mutually exclusive features (features that rarely take non-zero values simultaneously) into a single feature, further reducing the number of features and speeding up training.
Feature Selection and Optimization: The initial fourteen characteristic factors were used. The LightGBM algorithm itself has inherent feature selection capabilities (e.g., based on feature importance). The training datasets, quantified by various subsets of these characteristic factors, were fed into the LightGBM classifier.
Hyperparameter Tuning: To find the optimal model, hyperparameters of LightGBM were optimized using:
- 10-fold cross-validation: The training dataset is split into 10 equal parts. The model is trained on 9 parts and validated on the remaining 1 part, and this process is repeated 10 times, with each part serving as the validation set once. This helps assess model performance robustness and reduces overfitting.
- Grid search: A systematic method for hyperparameter tuning where a model is trained and evaluated for every possible combination of specified hyperparameter values. The combination that yields the best performance is selected.

4.2.4. Performance Evaluation

To assess the predictive capability of the CPM-BP model, six widely used metrics for binary classification problems were employed. For these metrics, it's essential to understand the basic confusion matrix terms:

True Positive (TP): The number of BPs that were correctly predicted as BPs.
True Negative (TN): The number of NBPs that were correctly predicted as NBPs.
False Positive (FP): The number of NBPs that were incorrectly predicted as BPs.
False Negative (FN): The number of BPs that were incorrectly predicted as NBPs.

The formulas for the evaluation metrics are as follows:

Accuracy (ACC): Measures the proportion of correctly predicted instances out of the total instances. $ \mathrm{ACC} = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}} $
- Symbols:
  - $\mathrm{TP}$ : True Positives
  - $\mathrm{TN}$ : True Negatives
  - $\mathrm{FP}$ : False Positives
  - $\mathrm{FN}$ : False Negatives
Precision (PRE): Measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It indicates the model's ability to avoid false positives. $ \mathrm{PRE} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} $
- Symbols:
  - $\mathrm{TP}$ : True Positives
  - $\mathrm{FP}$ : False Positives
Sensitivity (SN) (also known as Recall or True Positive Rate): Measures the proportion of correctly predicted positive instances out of all actual positive instances. It indicates the model's ability to find all positive samples. $ \mathrm{SN} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $
- Symbols:
  - $\mathrm{TP}$ : True Positives
  - $\mathrm{FN}$ : False Negatives
F1 Score (F1): The harmonic mean of Precision and Sensitivity. It provides a single score that balances both precision and sensitivity, which are often conflicting measures. $ \mathrm{F1} = \frac{2 \times \mathrm{TP}}{2 \times \mathrm{TP} + \mathrm{FP} + \mathrm{FN}} $
- Symbols:
  - $\mathrm{TP}$ : True Positives
  - $\mathrm{FP}$ : False Positives
  - $\mathrm{FN}$ : False Negatives
Matthews Correlation Coefficient (MCC): A measure of the quality of binary classifications. It is generally regarded as a balanced measure, even if the classes are of very different sizes. A value of +1 represents a perfect prediction, 0 an average random prediction, and -1 an inverse prediction. $ \mathrm{MCC} = \frac{\mathrm{TP} \cdot \mathrm{TN} - \mathrm{FP} \cdot \mathrm{FN}}{\sqrt{(\mathrm{TP} + \mathrm{FP})(\mathrm{TP} + \mathrm{FN})(\mathrm{TN} + \mathrm{FP})(\mathrm{TN} + \mathrm{FN})}} $
- Symbols:
  - $\mathrm{TP}$ : True Positives
  - $\mathrm{TN}$ : True Negatives
  - $\mathrm{FP}$ : False Positives
  - $\mathrm{FN}$ : False Negatives
Receiver Operating Characteristic (ROC) Curve and Area Under the Curve (AUC):
- ROC Curve: A graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots the True Positive Rate (Sensitivity) against the False Positive Rate (1-Specificity) at various threshold settings.
- AUC: The area under the ROC curve. It provides a single scalar value that summarizes the model's performance across all possible classification thresholds. An AUC of 0.5 indicates a random model, while an AUC of 1 indicates a perfect model.

4.2.5. Peptides Extraction and Identification

This section details the experimental procedure to obtain and identify peptides from milk samples.

2.3.1. Sample Preparation:
- Source: Fresh UHT milk (FM) and spoiled UHT milk (SM) samples were obtained from a local market.
- Pre-treatment: Samples were stored at $4^\circ \mathrm{C}$ for 1 hour.
- Centrifugation: Centrifuged at $15,000 \times g$ and $4^\circ \mathrm{C}$ for 20 minutes to remove fat and precipitates. The supernatant contains the peptides.
- Ultrafiltration: The supernatants were then ultrafiltered using a 10-kDa (kiloDalton) ultrafiltration centrifugal tube (Millipore Corp., MA, USA) at $4000 \times g$ and $20^\circ \mathrm{C}$ for 20 minutes. This step separates peptides (which are smaller than 10 kDa) from larger proteins and other molecules.
- Lyophilization: The extracted peptides were lyophilized (freeze-dried) to remove water, concentrating them and preparing them for subsequent steps.
2.3.2. Sample Desalting: This step removes salts and other non-peptide components that can interfere with mass spectrometry analysis.
- C18 SPE Column: A C18 Solid Phase Extraction (SPE) column ( $10 \mathrm{mg}$ , Waters, Milford, MA, USA) was used. C18 refers to a hydrophobic stationary phase that retains peptides based on their hydrophobicity.
- Activation: The column was activated with $1 \mathrm{mL}$ of methanol.
- Equilibration: Equilibrated with $0.1\%$ Trifluoroacetic acid (TFA) (v/v) in water. TFA is a common ion-pairing agent in peptide chromatography.
- Loading: $1 \mathrm{mg}$ of lyophilized peptides was redissolved in $0.1\%$ TFA (v/v) and loaded onto the column.
- Washing (Desalting): The column was washed twice with $200 \mu \mathrm{L}$ of $0.1\%$ TFA (v/v) to remove hydrophilic impurities (salts).
- Elution: Peptides were eluted from the column using $1.5 \mathrm{mL}$ of $0.1\%$ TFA (v/v) in $80\%$ acetonitrile (ACN) (v/v). Acetonitrile is an organic solvent that disrupts the hydrophobic interactions between peptides and the C18 phase, allowing peptides to elute.
- Storage: The purified peptides were collected, lyophilized, and stored at $-80^\circ \mathrm{C}$ .
2.3.3. HPLC-MS/MS Identification: High-performance liquid chromatography coupled with tandem mass spectrometry was used for peptide separation and identification.
- Instrument: A Dionex UltiMate 3000 RSLCnano system (Thermo Scientific, USA) for HPLC, coupled to an LTQ-Orbitrap Elite mass spectrometer (Thermo Scientific, USA) for MS.
- Sample Preparation: Lyophilized peptides were dissolved in $0.1\%$ Formic Acid (FA). $1 \mu \mathrm{L}$ of each sample was analyzed.
- Chromatography (HPLC):
  - Trap Column: Peptides were enriched on a C18 trap column ( $3.5 \mathrm{cm} \times 200 \mu \mathrm{m}$ i.d.).
  - Analytical Column: Separated on a reverse-phase (RP) C18 analytical column ( $15 \mathrm{cm} \times 150 \mu \mathrm{m}$ i.d.) packed with C18 AQ beads ( $2.9 \mu \mathrm{m}$ , $120 \mathrm{\AA}$ pore size, Sunchrom).
  - Flow Rate: $\sim 600 \mathrm{nL/min}$ .
  - Gradient Elution: A binary gradient system was used:
    - 2-8% buffer B ( $80\%$ ACN / $0.1\%$ FA) for 2 minutes.
    - 8-45% buffer B for 100 minutes.
    - 45-95% buffer B for 3 minutes.
    - Buffer A is typically an aqueous solution with a small percentage of FA. The increasing concentration of buffer B (ACN) gradually elutes more hydrophobic peptides.
  - Replicates: Each sample was analyzed in triplicate to ensure reproducibility.
- Mass Spectrometry (MS/MS):
  - Mode: Operated in positive ion data-dependent acquisition (DDA) mode. In DDA, a full MS scan is performed first, and then the most abundant precursor ions are selected for fragmentation (MS/MS).
  - Settings:
    - Ion transfer capillary temperature: $320^\circ \mathrm{C}$ .
    - Spray voltage: $1.9 \mathrm{kV}$ .
    - Resolution (full MS): 120,000 (high resolution for accurate mass measurement of intact peptides).
    - Scan range: m/z 350-2000.
  - Fragmentation (CID): Collision-induced dissociation (CID) was used to fragment the 20 most abundant precursor ions.
    - Minimum intensity: 500.
    - Isolation width: 2.
    - Normalized collision energy: 35.
    - Dynamic exclusion: Enabled (repeat count 1, repeat duration 30, exclusion duration 40). This prevents re-analysis of already fragmented ions, allowing for a broader sampling of peptides.
2.3.4. Data Analysis:
- Software: Peptides were identified and quantified using MaxQuantTM (v1.5.3.30), an open-source software.
- Database Search: Raw data files were searched against a bovine protein database downloaded from UniProt (https://www.UniProt.org/).
- Parameters:
  - Precursor-ion mass tolerance: $4.5 \mathrm{ppm}$ (parts per million).
  - Fragment-ion mass tolerance: $20 \mathrm{ppm}$ .
  - Enzyme: No enzyme specified (meaning the peptides were generated by unspecific proteolysis, typical for spoiled milk).
  - Missed cleavage: Not allowed.
  - Fixed modification: None.
  - Variable modification: Methionine oxidation (M, $+15.9949$ Da).
  - PSM False Discovery Rates (FDRs): Set to 0.01 (1%) for peptide-spectrum matches, ensuring high confidence in identifications.
  - Other settings: Conventional search parameters were used.
- Validation: Valid peptides were required to be identified in at least three out of five parallel runs for each milk sample to ensure robustness.

4.2.6. Determination of Calcium Mobilization

This biological assay was used to experimentally verify the bitterness of predicted peptides.

Cell Line: Human embryonic kidney 293 (HEK293T) cells were used.
Seeding: Cells were seeded at $1.0 \times 10^5$ cells per well on poly-L-lysine (PLL)-coated 96-well plates. PLL coating enhances cell adhesion.
Incubation: Cells were incubated in DMEM medium with $10\%$ Fetal Bovine Serum (FBS) at $37^\circ \mathrm{C}$ and $5\%$ $CO_2$ in a humidified atmosphere for 24 hours.
Transient Transfection: This is a critical step to make cells express the bitter taste receptor.
- Experimental Groups: Cells were transiently transfected with both FLAG-TAS2R4 (gene for human T2R4 receptor) and G $\alpha$ $α$ 16/44-FLAG.
  - FLAG-TAS2R4: The human T2R4 gene was tagged with an N-terminal FLAG epitope and codon-optimized for mammalian cell expression, then inserted into the pcDNA3.4 expression vector.
  - Gα16/44-FLAG: A chimeric G-alpha16 protein with 44 gustducin-specific sequences at the C-terminal. Gustducin is a G protein specifically involved in taste signaling. This chimeric protein enhances the coupling of the T2R4 receptor to the intracellular calcium signaling pathway in HEK293T cells, which normally lack taste-specific G proteins.
- Control Groups: Only G $\alpha$ 16/44-FLAG-transfected HEK293T cells were used. This control helps ensure that any observed calcium mobilization is due to T2R4 activation by BPs, not non-specific G protein activation.
- Transfection Method: Lipofectamine 2000 was used, a common reagent for lipid-mediated DNA delivery into cells.
Dye Loading:
- Cells were incubated with Fluo-4 acetoxymethyl ester (Fluo-4 AM) dye containing probenecid ( $2.5 \mathrm{mM}$ ) and Pluronic F-127 ( $0.05\%$ , w/v) at $37^\circ \mathrm{C}$ for 30 minutes, followed by 30 minutes at room temperature.
- Fluo-4 AM: A cell-permeable fluorescent dye that enters cells and is cleaved by intracellular esterases, trapping the fluorescent molecule (Fluo-4) inside. Fluo-4 binds to calcium ions and fluoresces brightly when bound, allowing for real-time monitoring of intracellular $Ca^{2+}$ levels.
- Probenecid: An inhibitor of organic anion transporters, used to prevent the leakage of the fluorescent dye out of the cells.
- Pluronic F-127: A non-ionic surfactant used to help solubilize Fluo-4 AM in aqueous solutions and facilitate its entry into cells.
Peptide Treatment: Cells were treated with potential BPs at different concentrations ( $0.1 \mathrm{mM}$ , $1.0 \mathrm{mM}$ , and $5.0 \mathrm{mM}$ ).
Measurement: Calcium levels were measured at $525 \mathrm{nm}$ (emission) when excited at $494 \mathrm{nm}$ (excitation) using a microplate reader (Biotek Synergy H1).
Data Analysis:
- $$\Deltafluorescence intensities were calculated by subtracting the blank responses (from control groups) from the maximum responses of the experimental groups. This isolates the specific T2R4-mediated calcium changes.
- Data was collected from at least three independent experiments.
  
  The following figure (Figure 1 from the original paper) shows the overall workflow:
  
  Figure 1: Technology roadmap for the identification and prediction of milk-derived bitter taste peptides.

4.2.7. Statistical Analysis

Software: Data was analyzed and plots visualized using:
- EVenn (https://www.ehbio.com/test/venn/) for Venn diagrams.
- SIMCA 14.1 software (UMETRICS, Malmo, Sweden) for Principal Component Analysis (PCA).
- Graph-Pad Prism 9.0 (GraphPad Software, San Diego, CA, USA) for other statistical analyses and graphing.

5. Experimental Setup

5.1. Datasets

Benchmark Dataset (BTP720):
- Source: Constructed by expanding the existing BTP640 dataset with BPs collected from Biopep, Flavor Database (Shanghai Jiaotong University), BitterDB, and various scientific literature.
- Scale: Contains a total of 720 peptide sequences, evenly balanced with 360 Bitter Taste Peptides (BPs) and 360 Non-Bitter Taste Peptides (NBPs).
- Characteristics: Peptides were carefully selected to ensure that BPs uniquely exhibited a bitter taste, excluding those with multiple taste sensations. This curation enhances the specificity of the dataset for bitterness prediction.
- Splitting: Randomly divided into a training dataset (288 BPs, 288 NBPs) and an independent testing dataset (72 BPs, 72 NBPs) with an 8:2 ratio.
- Domain: Peptide sequences, primarily from food or biological sources, with known taste properties.
Real-World Milk Samples:
- Source: Fresh UHT milk (FM) and spoiled UHT milk (SM) samples were purchased from a local market.
- Purpose: These samples were used for peptidomics analysis to identify actual peptides present in milk that might cause bitterness, and then to apply the CPM-BP model for prediction.
- Characteristics: UHT milk is a common dairy product, and spoilage often involves protein hydrolysis, leading to the formation of peptides. Comparing FM and SM helps identify peptides associated with spoilage and potential bitterness.

5.2. Evaluation Metrics

The evaluation metrics used are described in detail in Section 4.2.4. Performance Evaluation of the Methodology. They include:

Accuracy (ACC): Proportion of correct predictions. $ \mathrm{ACC} = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}} $
- $\mathrm{TP}$ : True Positives
- $\mathrm{TN}$ : True Negatives
- $\mathrm{FP}$ : False Positives
- $\mathrm{FN}$ : False Negatives
Precision (PRE): Proportion of true positives among all positive predictions. $ \mathrm{PRE} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} $
- $\mathrm{TP}$ : True Positives
- $\mathrm{FP}$ : False Positives
Sensitivity (SN): Proportion of true positives among all actual positives (also known as Recall). $ \mathrm{SN} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $
- $\mathrm{TP}$ : True Positives
- $\mathrm{FN}$ : False Negatives
F1 Score (F1): Harmonic mean of Precision and Sensitivity, balancing both. $ \mathrm{F1} = \frac{2 \times \mathrm{TP}}{2 \times \mathrm{TP} + \mathrm{FP} + \mathrm{FN}} $
- $\mathrm{TP}$ : True Positives
- $\mathrm{FP}$ : False Positives
- $\mathrm{FN}$ : False Negatives
Matthews Correlation Coefficient (MCC): A balanced measure for binary classification, robust to imbalanced datasets. $ \mathrm{MCC} = \frac{\mathrm{TP} \cdot \mathrm{TN} - \mathrm{FP} \cdot \mathrm{FN}}{\sqrt{(\mathrm{TP} + \mathrm{FP})(\mathrm{TP} + \mathrm{FN})(\mathrm{TN} + \mathrm{FP})(\mathrm{TN} + \mathrm{FN})}} $
- $\mathrm{TP}$ : True Positives
- $\mathrm{TN}$ : True Negatives
- $\mathrm{FP}$ : False Positives
- $\mathrm{FN}$ : False Negatives
Area Under the ROC Curve (AUC): Summarizes model performance across all classification thresholds, indicating overall discriminative ability. Values range from 0.5 (random) to 1 (perfect).

These metrics are standard for binary classification tasks and collectively provide a comprehensive view of a model's performance, covering aspects like overall correctness, false positive control, true positive recall, and balance.

5.3. Baselines

The paper compared its proposed CPM-BP model against two well-known existing BPs prediction models:

iBitter-SCM (Charoenkwan et al., 2020): This model relies on a scoring card method using propensity scores of dipeptides for BP identification.
iBitter-Fuse (Charoenkwan et al., 2021): This model aims to improve prediction by fusing multi-view features, suggesting it integrates different types of information or representations of peptide sequences.

These baselines are representative because they are recent and relevant computational methods for BPs prediction, allowing the authors to demonstrate the CPM-BP's advancements against the current state-of-the-art in the field.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Determination of Optimal LightGBM Model

The optimization process involved systematically eliminating characteristic factors based on separability verification and evaluating performance using 10-fold cross-validation and an independent test dataset. The initial fourteen factors were progressively reduced.

Elimination of factors: RP, Q1, Q3, N-basic AA, and P-X-C were considered for elimination. RP, Q1, and Q3 were artificially eliminated first because they showed lower statistical significance (e.g., only 25 BPs and 5 NBPs possessed adjacent RP). LFIYWV-C was automatically eliminated by the LightGBM algorithm itself, indicating its low contribution.
Optimal Combination: The optimal CPM-BP model was constructed using ten characteristic factors: $Q$ , Q2, Q4, AH, $N$ , $C$ , Percentage-HAA, N-basic AA, Percentage-FWY, and P-X-C.

Contribution Degrees: Percentage-HAA (64), $Q$ (56), AH (52), and $C$ (41) were found to have the highest contribution degrees (Table 1), consistent with previous literature highlighting the importance of overall hydrophobicity and the presence of hydrophobic amino acids in BPs.

The following are the results from Table 2 of the original paper:

	Eliminated factors	10-fold CV						Independent test
	Eliminated factors	ACC	PRE	SN	F1	MCC	AUC	ACC	PRE	SN	F1	MCC	AUC
Initial		83.7 %	84.3 %	83.6 %	83.6 %	67.4 %	86.3 %	87.5 %	92.2 %	81.9 %	86.8 %	75.5 %	88.2 %
Step 1	RP + Q1	84.4 %	85.1 %	84.3 %	84.3 %	68.8 %	86.6 %	88.9 %	95.2 %	81.9 %	88.0 %	78.5 %	89.3 %
Step 2	RP + Q1 + Q3	84.9 %	85.4 %	84.9 %	84.8 %	69.8 %	86.8 %	90.3 %	98.3 %	81.9 %	89.3 %	81.6 %	90.5 %
Step 3	RP + Q1 + Q3 + N-basic AA	84.0 %	84.5 %	84.0 %	84.0 %	68.0 %	86.4 %	88.2 %	95.1 %	80.6 %	87.2 %	77.3 %	88.7 %
Step 4	RP + Q1 + Q3 + N-basic AA + P-X-C	83.5 %	84.0 %	83.5 %	83.4 %	67.0 %	86.3 %	88.9 %	95.2 %	81.9 %	88.0 %	78.5 %	89.3 %

Notes from paper: 1 ACC, accuracy. 2 PRE, precision. 3 SN, sensitivity. 4 F1, F1 score. 5 C, . 6AUC, area under the ROC curve. The highest value of each indicator were presented bold.

The table shows that the model achieved its best performance (bolded values) after eliminating RP, Q1, and Q3 for both 10-fold cross-validation and the independent test. For the independent test, CPM-BP achieved 90.3% ACC, 98.3% PRE, 81.9% SN, 89.3% F1, 81.6% MCC, and 90.5% AUC. This indicates that a carefully selected subset of features significantly improves model performance.

6.1.2. Comparison of CPM-BP with Other Prediction Models

CPM-BP was compared against iBitter-SCM and iBitter-Fuse using the expanded independent testing dataset from BTP720.

CPM-BP significantly outperformed existing models in four out of five key performance metrics:
- ACC: CPM-BP (90.3%) vs. iBitter-SCM (16.7% lower), iBitter-Fuse (similar to iBitter-SCM).
- PRE: CPM-BP (98.3%) vs. iBitter-SCM (29.0% lower). This high precision implies that when CPM-BP predicts a peptide is bitter, it is highly likely to be truly bitter.
- F1: CPM-BP (89.3%) vs. iBitter-SCM (13.1% lower).
- MCC: CPM-BP (81.6%) vs. iBitter-SCM (33.2% lower).
Sensitivity (SN): iBitter-SCM and iBitter-Fuse showed slightly better SN (2.8% and 7.0% higher, respectively) than CPM-BP. This suggests CPM-BP might miss some BPs (have more false negatives) but is very accurate when it does predict a BP. Given the study's aim for accurate BP prediction, high precision is highly desirable.
The overwhelming advantages of CPM-BP were attributed to the larger and more accurate BTP720 benchmark dataset and the optimized combination of characteristic factors. The paper specifically highlighted that FALPQYLK, a known BP, was correctly predicted by CPM-BP but not by iBitter-SCM or iBitter-Fuse, even though it was in both benchmark datasets, reinforcing CPM-BP's superior accuracy.

6.1.3. Peptidomics Analysis of FM and SM

PCA: Principal Component Analysis showed a clear distinction between fresh UHT milk (FM) and spoiled UHT milk (SM) (Fig. 2A), indicating significant differences in their peptide profiles.
Peptide Identification: 1280 unique peptides from 57 proteins were identified in FM, and 1072 unique peptides from 27 proteins were identified in SM.
Abundance Differences: Violin plots (Fig. 2C) showed a significant difference ( $P < 0.01$ ) in the abundance distribution of identified peptides between FM and SM. Volcano plots (Fig. 2D) highlighted peptides with significant changes in intensity.
Length Distribution: SM contained more short peptides (7-25 amino acids) than FM (Fig. S3), suggesting increased protein hydrolysis by proteases during spoilage.
Different Peptides: 724 peptides were identified as "different" between FM and SM based on two criteria:
1. Uniquely identified in SM (639 peptides).
2. Identified in both, but with $\geq$ 5-fold higher intensity in SM (85 peptides).
  
  The following figure (Figure 2 from the original paper) summarizes the peptidomics analysis:
  
  Figure 2: Overview of FM and SM and their peptidomics analysis results. (A) PCA score plots between FM and SM, (B) Venn diagram showing the number of unique proteins between FM and SM, (C) Violin plots showing the distribution of LFQ intensity of peptides identified in FM and SM, (D) Volcano plot of the peptides identified in FM and SM.

6.1.4. Prediction of Potential BPs in SM

CPM-BP Application: The CPM-BP model predicted 180 potential BPs among the 724 significantly different peptides found in SM.
Origin of BPs: Most (164/180) predicted potential BPs were casein-derived, particularly from $\beta$ -casein (7 BPs), followed by $\alpha$ S1- (2 BPs), $\alpha$ S2- (1 BP), and $\kappa$ -casein (1 BP). This aligns with the known role of casein hydrolysis in milk bitterness.
Source in SM: A majority (150/180) of potential BPs were uniquely found in SM, while 30/180 showed a fold-change $>5$ . This strongly suggests that these peptides are generated during spoilage by non-inactivated proteases, contributing to the bitter taste.

Known BPs: 11 of the 180 predicted potential BPs had already been reported as bitter in the literature (Table 3). CPM-BP correctly identified all 11, whereas iBitter-SCM and iBitter-Fuse only identified 6. Notably, FALPQYLK was correctly predicted as bitter by CPM-BP but not by the other two models, highlighting CPM-BP's improved accuracy.

The following are the results from Table 3 of the original paper:

Category	Sequences	CPM-BP	Prediction Models			References
Category	Sequences	CPM-BP	iBitter-SCM	iBitter-Fuse		References
Known BPs	YLEQLLR	Bitter	Bitter	Bitter	(Lemieux & Simard, 1992)
	FALPQYLK	Bitter	Non-Bitter	Non-Bitter	(Lemieux & Simard, 1992)
	LHLPLPLL	Bitter	Non-Bitter	Non-Bitter	(Sebald et al., 2020)
	LPLPLLQSW	Bitter	Non-Bitter	Non-Bitter	(Sebald et al., 2020)
	PFPGPIPNS	Bitter	Bitter	Bitter	(Belitz & Wieser, 1985)
	VYPFPGPIPN	Bitter	Bitter	Bitter	(Toelstede & Hofmann, 2008; Zhao et al., 2016)
	YLGYLEQLLR	Bitter	Bitter	Bitter	(Belitz & Wieser, 1985)
	VENLHLPLPLL	Bitter	Non-Bitter	Non-Bitter	(Sebald et al., 2020)
	MPFPKYPVEPF	Bitter	Bitter	Bitter	(Karametsi et al., 2014)
	AIPPKKNQDKTEIPTIN	Bitter	Non-Bitter	Non-Bitter	(Sebald et al., 2020)
	APKHKEMPFPKYPVEPF	Bitter	Bitter	Bitter	(Karametsi et al., 2014)
Potential BPs	FALPQYL	Bitter	Non-bitter	Non-bitter	This study
	FFVAPFPEVFGKE	Bitter	Bitter	Bitter
	EMPFPKYP	Bitter	Bitter	Bitter

6.1.5. Effect of Potential BPs on Calcium Release in HEK293T Cells Expressing hT2R4

To experimentally validate the predictions, calcium mobilization assays were performed on HEK293T cells expressing hT2R4.

Selected Peptides: One known BP (FALPQYLK) and three predicted potential BPs (FALPQYL, FFVAPFPEVFGKE, EMPFPKYP) were chosen. AGDDAPRAVF was used as a negative control.
Negative Control: AGDDAPRAVF (beef hydrolysates) did not activate hT2R4 (Fig. S5), confirming its non-bitter nature and the specificity of the assay.
Activation by BPs:
- FALPQYLK (known BP) showed the most significant calcium signal change, confirming its bitterness.
- FFVAPFPEVFGKE, FALPQYL, and EMPFPKYP also activated hT2R4, demonstrating their bitter taste activity.
- Dose-dependent effects: FALPQYLK, FALPQYL, and FFVAPFPEVFGKE showed clear dose-dependent activation across $0.1 \mathrm{mM}$ , $1.0 \mathrm{mM}$ , and $5.0 \mathrm{mM}$ concentrations.
- EMPFPKYP showed dose-dependent activation at $0.1 \mathrm{mM}$ and $1.0 \mathrm{mM}$ , but an unusual, sustained increase at $5.0 \mathrm{mM}$ , suggesting potential complex or non-specific effects at high concentrations.
Structure-Activity Relationship: The comparison between FALPQYLK and FALPQYL revealed that the presence of Lysine (K) at the C-terminal significantly enhanced bitterness, as FALPQYLK exhibited nearly ten times higher $\Delta$ fluorescence intensity than FFVAPFPEVFGKE and FALPQYL.
Validation of CPM-BP: The successful experimental activation of hT2R4 by the predicted potential BPs (FALPQYL, FFVAPFPEVFGKE, EMPFPKYP) proved the effectiveness of the CPM-BP model and the overall workflow.

The following figure (Figure 3 from the original paper) illustrates the calcium mobilization results:

Figure 3: Calcium mobilization responses of HEK293T cells expressing hT2R4 to four potential BPs (FALPQYLK, FALPQYL and EMPFPKYP). The results were showed by $\Delta$ fluorescence intensities which were calculated by subtracting the blank responses from the maximum responses of the experimental groups.

6.2. Data Presentation (Tables)

All relevant tables from the paper have been transcribed and presented in the respective sections above (Table 1 in Methodology 4.2.2, Table 2 in Results 6.1.1, and Table 3 in Results 6.1.4).

6.3. Ablation Studies / Parameter Analysis

The paper implicitly performs an ablation study by systematically eliminating characteristic factors and evaluating the model's performance at each step (Table 2).

Factor Elimination: The process began with 14 factors, and through evaluation, RP, Q1, Q3 were first removed. This ablation improved performance from an initial ACC of 87.5% (independent test) to 90.3% (ACC) when RP, Q1, and Q3 were removed.
Impact of N-basic AA and P-X-C: Further removal of N-basic AA and P-X-C (after RP, Q1, Q3) resulted in a decrease in performance (ACC dropped to 88.2% and 88.9%, respectively), indicating that these factors, despite having lower individual contribution degrees, were beneficial to the overall model. This confirms that the optimal combination of 10 factors (after removing RP, Q1, Q3, and LFIYWV-C automatically by LightGBM) provides the best predictive power.
Conclusion: This systematic evaluation of characteristic factor subsets acts as an ablation study, demonstrating that not all features are equally useful, and an optimized subset can lead to superior model performance, supporting the authors' selection of the final 10 characteristic factors. The individual contribution degrees of the remaining factors (Table 1) further indicate their relative importance to the model.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully established a novel, high-throughput workflow for the identification and prediction of bitter taste peptides (BPs). The workflow effectively integrates peptidomics technology, for identifying peptides in real-world samples like UHT milk, with an advanced machine learning approach. A key innovation was the construction of an expanded and carefully curated benchmark dataset, BTP720, which informed the development of a novel classification prediction model (CPM-BP). This model, built using the LightGBM algorithm and an optimized set of characteristic factors, demonstrated superior predictive accuracy (90.3%) and precision compared to existing models. The practical utility of CPM-BP was showcased by its ability to predict 180 potential BPs in spoiled UHT milk. Critically, the bitterness of three novel predicted BPs was experimentally verified through calcium mobilization assays in HEK293T cells expressing human bitter taste receptor T2R4, thereby validating the model's effectiveness.

7.2. Limitations & Future Work

The authors explicitly acknowledge certain limitations and suggest future directions:

Benchmark Dataset Size: The current benchmark dataset (BTP720, with 720 items) is considered "small" in the context of machine learning, especially for deep learning approaches. This small size limits the completeness and potentially the characteristics distribution in training and testing sets.
Generalizability of hT2R4: While hT2R4 is a broad-spectrum bitter receptor, it's only one of many (25 in humans). Not all bitter compounds activate hT2R4, and a peptide might be bitter through other receptors. The study's verification is specific to hT2R4.
Flavor Complexity: The paper focuses solely on bitter taste. However, in real food systems, flavor is a complex interplay of multiple tastes and aromas, where BPs might interact with other compounds to modulate overall perception.
Mechanism of Bitterness: While predicting bitterness, the model doesn't inherently explain the precise molecular mechanisms or binding sites leading to hT2R4 activation.

Based on these limitations, the authors suggest that this approach would be further developed and enhanced with:
Ongoing Growth of Benchmark Datasets: Continuously expanding the dataset of known BPs and NBPs will improve the model's robustness and generalizability.
Usage of Deep Learning Algorithms: With larger datasets, deep learning models could potentially learn more abstract and powerful features from peptide sequences, leading to even higher accuracy and predictive capabilities.

7.3. Personal Insights & Critique

This paper presents a robust and well-validated approach to a significant problem in food science and potentially pharmacology. The integration of peptidomics, advanced machine learning (LightGBM), and direct biological verification provides a strong framework.

Inspirations and Applications:

Food Industry: The immediate application is clear for quality control in dairy products, helping to identify and mitigate the causes of bitterness in UHT milk or other fermented foods. It could also aid in developing new food products with desired bitter profiles.
Drug Discovery: Peptides are increasingly explored as therapeutic agents. This workflow could be adapted to predict if potential drug candidates might elicit an undesirable bitter off-taste, accelerating drug development.
Personalized Nutrition: Understanding individual differences in bitter taste perception and metabolism could lead to personalized dietary recommendations, and identifying BPs is a foundational step.
Flavor Chemistry: The study's insights into characteristic factors (e.g., hydrophobicity, C-terminal Lysine) contribute to a deeper understanding of the structural determinants of bitterness in peptides, which can inform rational design of flavor modulators.

Potential Issues, Unverified Assumptions, or Areas for Improvement:
"Small" Dataset Challenge: While BTP720 is an improvement, 720 samples is still relatively small for complex ML tasks, especially if aiming for deep learning. The robustness of feature importance (Table 1) and model generalizability could be further tested with an order of magnitude larger dataset. The authors acknowledge this as a limitation.
Representativeness of Characteristic Factors: While the chosen factors are well-justified, they are still hand-engineered. Deep learning models, given enough data, might discover more subtle and powerful features that are not immediately intuitive to human experts.
Generalizability Beyond hT2R4: The experimental validation focuses solely on hT2R4. While a broad receptor, some peptides might be bitter via other T2Rs or even non-receptor mechanisms. A more comprehensive validation would involve a panel of bitter taste receptors or in vivo taste perception studies, although this is significantly more complex.
Specificity of "Bitter": The definition of "bitter" was strict ("unique bitter taste"). While this improves dataset quality, some peptides might contribute to a mixed bitter-salty or bitter-umami taste in a way that is relevant to food perception but excluded here. This is a trade-off for model clarity.
Interpretation of "Contribution Degrees": While helpful, the "contribution degrees" from LightGBM (Table 1) are relative to the model. They indicate how much a feature impacts the decision-making of this specific ensemble of trees, but not necessarily an absolute biological importance in all contexts.
Dynamic Nature of Peptidomes: The peptidome of milk can be highly dynamic, influenced by raw milk quality, processing, storage conditions, and microbial activity. The identified 724 "different" peptides in SM represent a snapshot. The model's predictions are based on these peptides, but the complexity of the full dynamic peptidome could be an area for more extensive longitudinal studies.

Overall, this paper provides a valuable contribution to the field, showcasing how an intelligent combination of experimental and computational techniques can effectively address a complex biological and industrial problem. The rigorous validation adds significant credibility to their machine learning approach.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Identification and prediction of milk-derived bitter taste peptides based on peptidomics technology and machine learning method

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~32 min read · 46,873 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Benchmark Dataset Construction

4.2.2. Selection of the Characteristic Factors of BPs

4.2.3. LightGBM Model Construction

4.2.4. Performance Evaluation

4.2.5. Peptides Extraction and Identification

4.2.6. Determination of Calcium Mobilization

4.2.7. Statistical Analysis

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Determination of Optimal LightGBM Model

6.1.2. Comparison of CPM-BP with Other Prediction Models

6.1.3. Peptidomics Analysis of FM and SM

6.1.4. Prediction of Potential BPs in SM

6.1.5. Effect of Potential BPs on Calcium Release in HEK293T Cells Expressing hT2R4

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers