AiPaper
Paper status: completed

Identify Bitter Peptides by Using Deep Representation Learning Features

Published:07/17/2022
Original Link
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces the iBitter-DRLF method, employing deep learning techniques to enhance the identification of bitter peptides, significantly improving palatability outcomes in related products.

Abstract

A bitter taste often identifies hazardous compounds and it is generally avoided by most animals and humans. Bitterness of hydrolyzed proteins is caused by the presence of bitter peptides. To improve palatability, bitter peptides need to be identified experimentally in a time-consuming and expensive process, before they can be removed or degraded. Here, we report the development of a machine learning prediction method, iBitter-DRLF, which is based on a deep learning pre-trained neural network feature extraction method. It uses three sequence embedding techniques, soft symmetric alignment (SSA), unified representation (UniRep), and bidirectional long short-term memory (BiLSTM). These were initially combined into various machine learning algorithms to build several models. After optimization, the combined features of UniRep and BiLSTM were finally selected, and the model was built in combination with a light gradient boosting machine (LGBM). The results showed that the use of deep representation learning greatly improves the ability of the model to identify bitter peptides, achieving accurate prediction based on peptide sequence data alone. By helping to identify bitter peptides, iBitter-DRLF can help research into improving the palatability of peptide therapeutics and dietary supplements in the future. A webserver is available, too.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Identify Bitter Peptides by Using Deep Representation Learning Features

1.2. Authors

Jici Jiang, Xinxu Lin, Yueqi Jiang, Liangzhen Jiang, and Zhibin Lv.

  • Jici Jiang, Xinxu Lin, Yueqi Jiang, and Zhibin Lv are affiliated with Sichuan University, China, across different colleges (Biomedical Engineering, Software Engineering, West China School of Medicine).
  • Liangzhen Jiang is affiliated with Chengdu University, China, in the Key Laboratory of Coarse Cereal Processing, Ministry of Agriculture and Rural Affairs, College of Food and Biological Engineering.
  • Zhibin Lv is the corresponding author, indicating a leading role in the research.

1.3. Journal/Conference

The paper was published in Int. J. Mol. Sci. (International Journal of Molecular Sciences). This is a peer-reviewed, open-access journal covering a broad range of topics in molecular sciences. It is generally considered a reputable journal in its field, with an impact factor that signifies its influence.

1.4. Publication Year

2022

1.5. Abstract

This paper introduces a novel machine learning method, iBitter-DRLF, for identifying bitter peptides. The motivation stems from the fact that bitterness in hydrolyzed proteins, often caused by bitter peptides, necessitates time-consuming and expensive experimental identification to improve palatability. iBitter-DRLF leverages deep learning pre-trained neural network feature extraction, specifically integrating three sequence embedding techniques: soft symmetric alignment (SSA), unified representation (UniRep), and bidirectional long short-term memory (BiLSTM). These features were initially combined with various machine learning algorithms. Through optimization, the combined features of UniRep and BiLSTM were selected and integrated with a light gradient boosting machine (LGBM) classifier. The results demonstrate that deep representation learning significantly enhances the model's ability to accurately predict bitter peptides based solely on peptide sequence data. The authors suggest iBitter-DRLF can support future research in improving the palatability of peptide therapeutics and dietary supplements. A webserver for the method is also available.

Official Source Link: /files/papers/6917522c110b75dcc59ae06e/paper.pdf Publication Status: Officially published in Int. J. Mol. Sci.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the time-consuming and expensive experimental identification of bitter peptides. Peptides, especially hydrolyzed proteins, are increasingly used in therapeutics and dietary supplements due to their beneficial biological activities and good nutritional properties. However, these hydrolyzed proteins often contain peptides that impart a bitter taste, which is generally avoided by humans and animals as an instinctual warning sign for toxic compounds. This bitterness significantly impacts palatability and patient adherence to peptide-based products.

Prior research has made progress in developing computational models for bitter peptide prediction using quantitative structure-activity relationship (QSAR) modeling and traditional machine learning (ML) with sequence features. However, there remains significant room for improvement in accuracy and efficiency, particularly in identifying novel bitter peptides solely from sequence data. The main challenge lies in effectively representing peptide sequences in a way that captures their underlying characteristics relevant to bitterness.

The paper's entry point is the recognition that deep representation learning, inspired by its success in natural language processing (NLP), can provide more meaningful and accurate feature descriptors for peptide sequences compared to traditional feature engineering methods. By transforming raw protein sequences into representations that ML models can effectively utilize, deep learning can potentially overcome the limitations of previous approaches.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  1. Development of iBitter-DRLF: A novel machine learning prediction method specifically designed for the accurate identification of bitter peptides.

  2. Leveraging Deep Representation Learning: The method uniquely integrates three advanced sequence embedding techniques—soft symmetric alignment (SSA), unified representation (UniRep), and bidirectional long short-term memory (BiLSTM)—to extract rich features from peptide sequences.

  3. Comprehensive Feature Fusion and Selection: The paper systematically evaluates various combinations of these deep representation learning features (feature fusion) and employs LGBM for feature selection to optimize the feature space, reducing redundancy and improving model performance.

  4. Optimal Model Configuration: Through extensive experimentation and optimization, the combined UniRep and BiLSTM features (specifically a 106-dimensional subset after feature selection) coupled with an LGBM classifier emerged as the most effective configuration for iBitter-DRLF.

  5. Superior Predictive Performance: iBitter-DRLF demonstrated significantly higher accuracy compared to existing state-of-the-art methods in independent tests, achieving ACC of 0.944, MCC of 0.889, Sp of 0.977, and auROC of 0.977. This indicates its reliability and stability.

  6. User-Friendly Webserver: The authors provide an accessible webserver for iBitter-DRLF, enabling wider use of their algorithm by other researchers.

    The key finding is that the application of deep representation learning features, particularly the fusion of UniRep and BiLSTM followed by rigorous feature selection, substantially improves the ability to predict bitter peptides accurately based solely on their sequence data. This advancement has practical implications for enhancing the palatability of peptide-based products and facilitating drug development.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully grasp the methodology and contributions of this paper, a foundational understanding of several key concepts across biology, machine learning, and deep learning is essential:

  • Peptides and Amino Acids: Peptides are short chains of amino acids, typically linked by peptide bonds. Amino acids are the fundamental building blocks of proteins and peptides, each with unique chemical properties that influence the overall structure and function of the peptide. The sequence of amino acids in a peptide (peptide sequence) is crucial for its biological activity, including taste.
  • Protein Hydrolysates: These are mixtures of peptides produced by breaking down proteins (e.g., through enzymatic digestion). While often having beneficial nutritional properties, protein hydrolysates can contain bitter peptides, which negatively affect their taste.
  • Machine Learning (ML): A field of artificial intelligence that enables systems to learn from data without explicit programming. In classification tasks, ML models learn to categorize data points into predefined classes (e.g., bitter vs. non-bitter).
  • Deep Learning (DL): A subfield of machine learning that uses neural networks with multiple layers (hence "deep") to learn complex patterns from data. Unlike traditional ML, deep learning models can often learn meaningful feature representations directly from raw input data, reducing the need for manual feature engineering.
  • Sequence Embedding/Representation Learning: The process of converting discrete sequential data (like peptide sequences made of amino acids) into continuous numerical vectors (embeddings). These vectors capture semantic and structural information, allowing machine learning models to process the sequences effectively. Representation learning refers to the broader concept of automatically learning useful features from raw data.
  • Soft Symmetric Alignment (SSA): A sequence embedding technique that aims to capture similarity between sequences of varying lengths by aligning them in a "soft" or probabilistic manner. It involves embedding sequences into vectors and then computing a similarity score based on these embeddings, often using a BiLSTM encoder.
  • Unified Representation (UniRep): A deep representation learning model specifically trained on a vast dataset of protein sequences (UniRef50). It uses a multiplicative Long Short-Term Memory (mLSTM) network to generate fixed-length vector representations for protein/peptide sequences. The model learns to predict the next amino acid in a sequence, thereby learning rich internal representations that capture biological information.
  • Bidirectional Long Short-Term Memory (BiLSTM): A type of recurrent neural network (RNN) that processes sequence data in both forward and backward directions.
    • Long Short-Term Memory (LSTM): An advanced type of RNN designed to overcome the vanishing gradient problem in standard RNNs, allowing it to learn long-term dependencies. An LSTM unit has several components:
      • Cell State (Ct\mathbf{C_t}): The memory of the LSTM unit, carrying information through the sequence.
      • Hidden State (ht\mathbf{h_t}): The output of the LSTM unit, carrying information to the next time step.
      • Forget Gate (ft\mathbf{f_t}): Controls what information from the previous cell state should be discarded.
      • Input Gate (it\mathbf{i_t}): Controls what new information should be stored in the cell state.
      • Output Gate (ot\mathbf{o_t}): Controls what part of the cell state should be outputted as the hidden state.
    • BiLSTM combines a forward LSTM (processing sequence from start to end) and a backward LSTM (processing from end to start). This allows it to capture context from both past and future elements in a sequence, providing a richer representation.
  • Feature Engineering vs. Feature Learning:
    • Feature Engineering: The manual process of creating relevant input features for a machine learning model from raw data, often requiring expert domain knowledge.
    • Feature Learning (or Representation Learning): The process where a model automatically discovers and learns useful features from raw data, a key advantage of deep learning.
  • Feature Fusion: The process of combining multiple sets of features (e.g., from different embedding models) into a single, comprehensive feature vector. This often aims to leverage diverse information sources for improved model performance.
  • Feature Selection: The process of selecting a subset of relevant features for use in model construction. This helps reduce dimensionality, combat overfitting, and improve model interpretability and training efficiency.
  • Support Vector Machine (SVM): A powerful supervised machine learning algorithm used for classification and regression. It works by finding the optimal hyperplane that best separates data points of different classes in a high-dimensional feature space.
  • Random Forest (RF): An ensemble learning method for classification and regression that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It reduces overfitting and improves accuracy.
  • Light Gradient Boosting Machine (LGBM): A gradient boosting framework that uses tree-based learning algorithms. It's known for its efficiency and speed, particularly with large datasets. LGBM uses a leaf-wise tree growth strategy, which can be faster than level-wise strategies but is more prone to overfitting for small datasets.
  • Evaluation Metrics: Standard measures used to quantify the performance of classification models, including Accuracy (ACC), Matthews Correlation Coefficient (MCC), Sensitivity (Sn), Specificity (Sp), F1-score (F1), Area Under the Precision-Recall Curve (auPRC), and Area Under the Receiver Operating Characteristic Curve (auROC). These will be explained in detail in Section 5.
  • K-fold Cross-Validation: A robust technique for evaluating machine learning models by dividing the dataset into K subsets (folds). The model is trained on K-1 folds and validated on the remaining fold, and this process is repeated K times, with each fold used exactly once as the validation data. The results are then averaged. The paper uses 10-fold cross-validation.
  • Independent Test Set: A dataset completely separate from the training and validation sets, used to provide an unbiased evaluation of the final model's performance on unseen data.
  • Hyperparameter Optimization: The process of finding the best set of hyperparameters for a machine learning model (e.g., CC and gamma for SVM, n_estimators and max_depth for tree-based models). Techniques like GridSearchCV systematically search through a predefined grid of hyperparameter values.
  • Uniform Manifold Approximation and Projection (UMAP): A dimensionality reduction technique used for visualizing high-dimensional data, similar to t-SNE. It aims to preserve the global and local structure of the data in a lower-dimensional space, making it useful for feature visualization.

3.2. Previous Works

The paper contextualizes its work by referencing a lineage of previous computational methods for bitter peptide identification, which primarily relied on traditional quantitative structure-activity relationship (QSAR) modeling and machine learning approaches using handcrafted sequence features.

  • QSAR and Traditional ML methods: Several early methods, such as those by Gramatica et al. [9], Chen et al. [10], and Zh o et al. [13], focused on predicting bitterness using physicochemical properties and structural descriptors of peptides. These methods often involve meticulous feature engineering to extract meaningful characteristics from the peptide sequences.

  • BitterX [15], BitterPredict [16], iBitter-SCM [17], and iBitter-Fuse [18]: These represent a progression in machine learning-based bitter peptide predictors. They generally utilize traditional sequence features (e.g., amino acid composition, dipeptide composition, hydrophobicity scales) to build predictive models.

    • iBitter-SCM and iBitter-Fuse are specifically mentioned as having increasing performance through using traditional sequence features and multi-view feature fusion, respectively. iBitter-Fuse, for instance, combined features from different perspectives to enhance prediction.
  • BERT4Bitter [19]: This model marked a shift by employing natural language processing (NLP) heuristic signature coding methods to represent peptide sequences as feature descriptors. This move towards more sophisticated, learned representations, rather than purely handcrafted features, showed better accuracy. BERT (Bidirectional Encoder Representations from Transformers) models, common in NLP, learn contextual representations of words (or in this case, amino acids) by considering their surrounding text.

  • MIMML (Mutual Information-based Meta Learning) [20]: Proposed in 2022, MIMML aimed to discover the best feature combination for bitter peptides, achieving an independent accuracy of 93.8%. This highlights the importance of effective feature selection and combination for optimal performance.

    These prior works collectively demonstrate an evolution from simple physicochemical features to more complex, NLP-inspired heuristic coding and meta-learning for feature optimization. The current paper builds upon this by exploring the even more advanced deep representation learning features (SSA, UniRep, BiLSTM), which learn hierarchical representations directly from vast amounts of sequence data, moving beyond heuristics or purely amino acid-level properties.

3.3. Technological Evolution

The technological evolution in bitter peptide identification mirrors broader trends in bioinformatics and machine learning.

  1. Early QSAR and Physicochemical Descriptors: Initial approaches focused on quantitative structure-activity relationships (QSAR), where physicochemical properties (e.g., hydrophobicity, charge, molecular weight) and simple amino acid composition were manually extracted as features. These methods required significant domain expertise for feature engineering.

  2. Traditional Machine Learning with Handcrafted Features: As ML algorithms became more accessible, models like SVM, Random Forest, and Gradient Boosting were applied, still largely relying on expertly handcrafted features (e.g., dipeptide composition, pseudo amino acid composition, various hydrophobicity scales). The challenge here was that the quality of prediction heavily depended on the quality and comprehensiveness of the engineered features.

  3. NLP-inspired Heuristic Features: A more recent development, exemplified by BERT4Bitter, involved adapting natural language processing (NLP) techniques. This treated peptide sequences as "sentences" of amino acid "words," using heuristic coding methods to derive features that capture contextual relationships between amino acids. This demonstrated an improvement over purely physicochemical features.

  4. Deep Representation Learning: The current paper's work represents the next major step in this evolution. It moves beyond heuristics to directly leverage deep representation learning from pre-trained neural networks. Models like UniRep, SSA, and BiLSTM are trained on massive datasets of protein sequences to learn general-purpose, high-dimensional embeddings. These embeddings are not handcrafted features but rather learned representations that capture complex, hierarchical patterns within the sequences, often without explicit prior knowledge of what constitutes "bitter" features. This paradigm minimizes the need for manual feature engineering, allowing the model to discover optimal representations autonomously.

    The paper's work fits within this technological timeline by pushing the frontier of feature extraction for peptide sequences from handcrafted or heuristic methods to learned, deep representations, aiming for superior predictive power and generalizability.

3.4. Differentiation Analysis

Compared to the main methods in related work, iBitter-DRLF presents several core differences and innovations:

  • Shift from Feature Engineering to Deep Representation Learning:

    • Previous methods (BitterX, BitterPredict, iBitter-SCM, iBitter-Fuse) largely relied on traditional sequence features and feature engineering, which often required domain expertise and could be limited by the expressiveness of the handcrafted features.
    • iBitter-DRLF moves beyond this by utilizing deep representation learning features (SSA, UniRep, BiLSTM). These models are pre-trained on vast protein/peptide databases, allowing them to learn generic, high-level, and context-rich representations of peptide sequences automatically. This approach reduces manual effort and can uncover subtle patterns that handcrafted features might miss.
  • Integration of Multiple Deep Embedding Techniques:

    • While BERT4Bitter used NLP heuristic signature coding (a form of learned representation), iBitter-DRLF integrates three distinct deep learning embedding techniques (SSA, UniRep, BiLSTM). This multi-perspective approach to feature extraction potentially provides a more comprehensive and robust representation of peptide characteristics. UniRep specifically uses an mLSTM trained on UniRef50, offering a very general protein representation. BiLSTM directly captures sequence dependencies, and SSA focuses on sequence similarity.
  • Systematic Feature Fusion and Selection for Optimal Performance:

    • The paper doesn't just use one type of deep feature but systematically explores fusion features (combinations of SSA, UniRep, BiLSTM).
    • Crucially, it employs LGBM for feature selection on these high-dimensional fused features. This step is vital for removing redundancy, preventing overfitting, and identifying the most discriminative subset of features. This rigorous optimization process, from raw embeddings to a refined feature set, distinguishes it from methods that might use fixed feature sets.
  • Superior Performance Metrics:

    • The results show iBitter-DRLF (ACC 0.944, MCC 0.889, Sp 0.977, auROC 0.977) outperforming even recent advanced methods like MIMML (ACC 0.938, MCC 0.875) and BERT4Bitter (ACC 0.922, MCC 0.844) in independent tests. This demonstrates a clear quantitative advantage in accuracy, particularly in specificity and auROC, indicating better discrimination between bitter and non-bitter peptides.

      In essence, iBitter-DRLF innovates by embracing the power of deep representation learning to extract richer, more abstract features, combining diverse embedding techniques, and then meticulously refining these features through fusion and selection to build a highly accurate and stable predictive model.

4. Methodology

4.1. Principles

The core principle of iBitter-DRLF is to leverage the power of deep representation learning to automatically extract meaningful and high-dimensional features from raw peptide sequences. Instead of relying on traditional, manually engineered features, the method uses pre-trained neural networks (like those underlying SSA, UniRep, and BiLSTM) that have learned rich contextual and structural information about peptide sequences from vast biological datasets. These deep features are then fused to create comprehensive representations. To handle the high dimensionality and potential redundancy of these fused features, a feature selection step using LGBM is applied to identify the most discriminative subset. Finally, these optimized features are fed into a machine learning classifier (LGBM) to predict whether a peptide is bitter or not. The overall aim is to achieve superior prediction accuracy by learning better representations of peptides.

4.2. Core Methodology In-depth (Layer by Layer)

The development of iBitter-DRLF involves several key stages: dataset preparation, feature extraction using deep learning models, feature fusion, feature selection, model training and optimization with machine learning algorithms, and final evaluation. A high-level overview of the model development process is depicted in Figure 1.

Figure 1. Overview of model development. The pre-trained SSA sequence embedding model, UniRep sequence embedding model, and BiLSTM sequence embedding model were used to embed peptide sequences into e… 该图像是一个示意图,展示了iBitter-DRLF模型的开发流程。图中包括数据集收集、特征提取(SSA、UniRep、BiLSTM)、机器学习算法及特征融合,并展示了模型性能评估的结果,如ACC和MCC等指标。

4.2.1. Benchmark Dataset

The iBitter-DRLF model was developed and evaluated using the BTP640 benchmark dataset, which was updated from the iBitter-SCM study [17]. This dataset is designed for binary classification of peptides into bitter or non-bitter categories.

  • Composition: The BTP640 dataset contains 320 bitter peptides and 320 non-bitter peptides, totaling 640 unique peptide sequences. The non-bitter peptides were constructed using the BIOPEP database [29], while bitter peptides were experimentally confirmed.
  • Splitting: To prevent overfitting and ensure robust evaluation, the dataset was randomly split into two subsets:
    • Training and Cross-Validation Set (BTP-CV): This subset is used for model training and k-fold cross-validation. It contains 256 bitter peptides and 256 non-bitter peptides (a 4:1 ratio of the total dataset).
    • Independent Test Set (BTP-TS): This subset is used for independent testing of the final model's performance on unseen data. It contains 64 bitter peptides and 64 non-bitter peptides.

4.2.2. Feature Extraction

The paper explores three different deep representation learning methods to extract features from peptide sequences. These methods convert variable-length peptide sequences into fixed-dimensional numerical vectors (eigenvectors) that can be used by machine learning models.

4.2.2.1. Pre-Trained SSA Embedding Model

The soft symmetric alignment (SSA) model [30] is designed to measure similarity between sequences of arbitrary lengths by embedding them into vectors.

  • Input: A peptide sequence is fed into a pre-trained model.
  • Encoding: The peptide sequence is encoded through a three-tier stacked BiLSTM encoder.
  • Output Embedding: The final embedding for a peptide sequence is represented as a matrix RL×121\mathtt { R } ^ { \mathrm { L } \times 1 2 1 }, where L\mathrm { L } is the length of the peptide and 121D indicates a 121-dimensional vector for each amino acid residue.
  • Similarity Calculation: To calculate the similarity between two peptide sequences, F1\mathrm { F_1 } and F2\mathrm { F_2 }, which are represented by embedded matrices RL1×121\mathtt { R } ^ { \mathrm { L_1 \times 121 } } and RL2×121\mathtt { R } ^ { \mathrm { L_2 \times 121 } } respectively:
    • F1=[x1,x2,,xL1]\mathrm { F } _ { 1 } = [ \mathrm { x } _ { 1 } , \mathrm { x } _ { 2 } , \cdot \cdot \cdot , \mathrm { x } _ { \mathrm { L1 } } ], where xi\mathbf { x _ { i } } is a 121-dimensional vector.
    • F2=[y1,y2,,yL2]\mathrm { F } _ { 2 } = [ \mathrm { y } _ { 1 } , \mathrm { y } _ { 2 } , \cdot \cdot \cdot , \mathrm { y } _ { \mathrm { L2 } } ], where yi\mathrm { y _ { i } } is also a 121-dimensional vector. The similarity s^\hat { \mathbf { s } } is calculated using the following formula: s^=1Ai=1L1j=1L2aijxiyj1 { \hat { \mathbf { s } } } = - { \frac { 1 } { \operatorname { A } } } \sum _ { \mathrm { i = 1 } } ^ { \operatorname { L1 } } \sum _ { \mathrm { j = 1 } } ^ { \operatorname { L2 } } { \mathrm { a } } _ { \mathrm { i j } } \| \operatorname { x } _ { \mathrm { i } } - \operatorname { y } _ { \mathrm { j } } \| _ { 1 } where:
    • s^\hat { \mathbf { s } } is the soft symmetric alignment similarity score.
    • A\mathrm { A } is a normalization factor.
    • L1\mathrm { L1 } and L2\mathrm { L2 } are the lengths of the two peptide sequences.
    • xi\mathrm { x_i } and yj\mathrm { y_j } are the 121-dimensional embedded vectors for the ii-th and jj-th amino acid in sequences 1 and 2, respectively.
    • xiyj1\| \operatorname { x } _ { \mathrm { i } } - \operatorname { y } _ { \mathrm { j } } \| _ { 1 } denotes the L1 norm (Manhattan distance) between vectors xi\mathrm { x_i } and yj\mathrm { y_j }.
    • aij\mathrm { a_{ij} } is a weighting coefficient determined by the following formulas: φij=exp(xkyj1)k=11kyj1) \varphi _ { \mathrm { i j } } = \frac { \exp ( - \| \mathbf { \boldsymbol { x } } _ { \mathrm { k } } - \mathbf { \boldsymbol { y } } _ { \mathrm { j } } \| _ { 1 } ) } { \sum _ { \mathbf { k } = 1 } ^ { \lfloor \mathbf { 1 } _ { \mathbf { k } } - \mathbf { \boldsymbol { y } } _ { \mathrm { j } } \parallel _ { 1 } ) } } ωij=exp(xiyk1)k=12kyk(xiyk1) \omega _ { \mathrm { i j } } = \frac { \exp ( - \| \mathbf { \boldsymbol { x } } _ { \mathrm { i } } - \mathbf { \boldsymbol { y } } _ { \mathrm { k } } \| _ { 1 } ) } { \sum _ { \mathbf { k } = 1 } ^ { \lfloor \mathbf { 2 } _ { \mathbf { k } } - \mathbf { \boldsymbol { y } } _ { \mathrm { k } } \rfloor } ( - \| \mathbf { \boldsymbol { x } } _ { \mathrm { i } } - \mathbf { \boldsymbol { y } } _ { \mathrm { k } } \| _ { 1 } ) } aij=ωij+ωijωijφij \mathsf { a } _ { \mathrm { i j } } = \omega _ { \mathrm { i j } } + \omega _ { \mathrm { i j } } - \omega _ { \mathrm { i j } } \varphi _ { \mathrm { i j } } A=i=11k2ξ)aij \mathbf { A } = \sum _ { \mathbf { i } = 1 } ^ { \lfloor \mathbf { 1 } _ { \mathbf { k } } - \mathbf { 2 } \mathbf { \xi } ) } \mathsf { a } _ { \mathrm { i j } } Here, φij\varphi_{ij} and ωij\omega_{ij} are softmax-like attention weights indicating the similarity of amino acid xi\mathrm{x_i} to yj\mathrm{y_j} considering all amino acids in the respective sequences. aij\mathrm{a_{ij}} combines these weights to form a symmetric alignment score. A\mathrm{A} is a normalization factor, usually the sum of all aij\mathrm{a_{ij}}. These parameters are backfitted with the parameters of the sequence encoder by a fully differentiated SSA. The trained model converts the peptide sequence into an embedding matrix RL×121\mathtt { R } ^ { \mathrm { L } \times 1 2 1 }. For classification, typically an aggregation (e.g., mean or max pooling) over the length dimension L\mathrm{L} is performed to obtain a fixed-length vector.

4.2.2.2. Pre-Trained UniRep Embedding Model

The UniRep model [25] is a deep representation learning model trained on 24 million UniRef50 primary amino acid sequences (a large database of protein sequences).

  • Training Objective: The model performs next amino acid prediction by minimizing cross-entropy losses. Through this task, it learns to represent proteins internally.
  • Architecture: It uses an mLSTM (multiplicative Long Short-Term Memory) network.
  • Feature Generation:
    1. A peptide sequence with L\mathrm{L} amino acid residues is initially embedded into a matrix using one-hot encoding, resulting in a RL×10\mathtt { R } ^ { \mathrm { L \times 10 } } matrix (assuming 10 features per amino acid, or more commonly 20 for standard amino acids).

    2. This matrix is then fed into the mLSTM encoder.

    3. The mLSTM outputs a hidden state, and after this operation, a fixed-length UniRep feature vector of 1900D (1900 dimensions) is derived, representing the entire peptide sequence.

      The calculation of the mLSTM encoder involves the following equations: mt=(XtWxm)(ht1Whm) \mathrm { m } _ { \mathrm { { t } } } = \left( { { { X } _ { \mathrm { { t } } } } { { W } _ { { \mathrm { { x m } } } } } } \right) \otimes \left( { { { \mathrm { h } } _ { { \mathrm { { t } } - 1 } } } { { W } _ { { \mathrm { { h m } } } } } } \right) h^t=tanh(XtWxh+mtWmh) { \hat { \mathrm { h } } } _ { \mathrm { { t } } } = \mathrm { { t a n h } } \left( { { { X } _ { \mathrm { { t } } } } { { W } _ { { \mathrm { { x h } } } } } } + { { \mathrm { m } } _ { \mathrm { { t } } } } { { W } _ { { \mathrm { { m h } } } } } \right) \mathrm { { { f } } } _ { \mathrm { { t } } } = \sigma ( { { X } _ { \mathrm { { t } } } } { { W } _ { { \mathrm { { x f } } } } } + { { \mathrm { m } } _ { \mathrm { { t } } } } { { W } _ _ { { \mathrm { { m f } } } } } ) it=σ(XtWxi+mtWmi) \mathrm { { { i } } } _ { \mathrm { { t } } } = \sigma ( { { X } _ { \mathrm { { t } } } } { { W } _ { { \mathrm { { x i } } } } } + { { \mathrm { m } } _ { \mathrm { { t } } } } { { W } _ { { \mathrm { { m i } } } } } ) ot=σ(XtWxo+mtWmo) \mathrm { { { o } } _ { \mathrm { { t } } } } = \sigma ( { { X } _ { \mathrm { { t } } } } { { W } _ { { \mathrm { { x o } } } } } + { { \mathrm { m } } _ { \mathrm { { t } } } } { W _ { { \mathrm { m o } } } } ) Ct=ftCti+ith^t { { C } _ { \mathrm { { t } } } } = \mathrm { { f } } _ { \mathrm { { t } } } \otimes { C } _ { \mathrm { { t - i } } } + \mathrm { { i } } _ { \mathrm { { t } } } \otimes { \hat { \mathrm { h } } } _ { \mathrm { { t } } } ht=ottanh(Ct) { { \mathrm { h } } _ { \mathrm { { t } } } } = { { \mathrm { o } } _ { \mathrm { { t } } } } \otimes \mathrm { { t a n h } } ( { { C } _ { \mathrm { { t } } } } ) Where:

  • \otimes represents element-by-element multiplication.
  • Xt\mathrm { X_t } is the current input (e.g., one-hot encoded representation of the current amino acid).
  • ht1\mathrm { h_{t-1} } represents the previous hidden state.
  • mt\mathrm { m_t } is the current intermediate multiplicative state, a key distinguishing feature of mLSTM.
  • h^t\hat { \mathrm { h } } _ { \mathrm { t } } represents the input before the hidden state activation.
  • ft\mathrm { f_t } is the forget gate (controls which information to discard).
  • it\mathrm { i_t } is the input gate (controls which new information to store).
  • ot\mathrm { o_t } is the output gate (controls which information to output from the cell state).
  • Ct1\mathrm { C_{t-1} } is the previous cell state.
  • Ct\mathrm { C_t } is the current cell state.
  • ht\mathrm { h_t } is the output hidden state.
  • σ\sigma is the sigmoid function, which squashes values between 0 and 1.
  • tanh\mathrm { tanh } is the hyperbolic tangent function, which squashes values between -1 and 1.
  • Wxm\mathrm { W_{xm} }, Whm\mathrm { W_{hm} }, Wxh\mathrm { W_{xh} }, Wmh\mathrm { W_{mh} }, Wxf\mathrm { W_{xf} }, Wmf\mathrm { W_{mf} }, Wxi\mathrm { W_{xi} }, Wmi\mathrm { W_{mi} }, Wxo\mathrm { W_{xo} }, Wmo\mathrm { W_{mo} } are weight matrices learned during training.

4.2.2.3. Pre-Trained BiLSTM Embedding Model

A bidirectional Long Short-Term Memory (BiLSTM) model [31] is used to extract sequence features.

  • Architecture: BiLSTM consists of a forward LSTM and a backward LSTM. The forward LSTM processes the sequence from the beginning to the end, while the backward LSTM processes it from the end to the beginning. The outputs from both directions are concatenated or combined to form a single representation. This allows the model to capture bidirectional context for each position in the sequence.

  • LSTM Mechanism: An LSTM unit utilizes gates (forget, input, output) to regulate the flow of information into and out of its cell state and hidden state, enabling it to learn long-range dependencies.

    The calculation of BiLSTM involves the following formulas for a single LSTM unit (these are then applied in both forward and backward directions and combined): ft=σ(Wf[ht1,Xt]+bf) \mathbf { f _ t } = \sigma ( \mathbf { W _ f } \cdot [ \mathbf { h _ { t - 1 } } , \mathbf { X _ t } ] + \mathbf { b _ f } ) it=σ(Wi[ht1,Xt]+bi) \mathbf { i } _ { \mathrm { t } } = \sigma ( \mathbf { W } _ { \mathrm { i } } \cdot [ \mathbf { h } _ { \mathrm { t - 1 } } , \mathbf { \boldsymbol { X } } _ { \mathrm { t } } ] + \mathbf { b } _ { \mathrm { i } } ) C~t=tanh(WC[ht1,Xt]+bC) \widetilde { \mathrm { C } } _ { \mathrm { t } } = \mathrm { t a n h } ( \mathrm { W } _ { \mathrm { C } } \cdot \left[ \mathrm { h } _ { \mathrm { t } - 1 } , \mathrm { X } _ { \mathrm { t } } \right] + \mathrm { b } _ { \mathrm { C } } ) ot=σ(Wo[ht1,Xt]+bo) \mathsf { o } _ { \mathsf { t } } = \sigma ( \mathsf { W } _ { \mathrm { o } } \cdot [ \mathsf { h } _ { \mathrm { t - 1 } } , \mathsf { X } _ { \mathrm { t } } ] + \mathsf { b } _ { \mathrm { o } } ) Ct=ftCt1+itC~t \mathbf { C _ { \mathrm { t } } } = \mathbf { f _ { \mathrm { t } } } * \mathbf { C _ { \mathrm { t - 1 } } } + \mathbf { i _ { \mathrm { t } } } * \widetilde { \mathbf { C } } _ { \mathrm { t } } ht=ottanh(Ct) \mathrm { h } _ { \mathrm { t } } = \mathrm { o } _ { \mathrm { t } } * \mathrm { t a n h } ( \mathrm { C } _ { \mathrm { t } } ) Where:

  • Xt\mathrm { X_t } is the current input at time step tt.

  • ht1\mathrm { h_{t-1} } represents the hidden state from the previous time step.

  • Wf\mathrm { W_f }, Wi\mathrm { W_i }, WC\mathrm { W_C }, Wo\mathrm { W_o } are weight matrices for the forget, input, candidate cell state, and output gates, respectively.

  • bf\mathrm { b_f }, bi\mathrm { b_i }, bC\mathrm { b_C }, bo\mathrm { b_o } are bias vectors for the respective gates.

  • [ht1,Xt][ \mathrm { h_{t-1} } , \mathrm { X_t } ] denotes the concatenation of the previous hidden state and the current input.

  • ft\mathrm { f_t } is the forget gate vector, determining what information from Ct1\mathrm { C_{t-1} } to forget.

  • it\mathrm { i_t } is the input gate vector, determining what new information to store in Ct\mathrm { C_t }.

  • C~t\widetilde { \mathrm { C } } _ { \mathrm { t } } is the candidate cell state, representing new potential information.

  • ot\mathrm { o_t } is the output gate vector, determining what part of Ct\mathrm { C_t } to output as ht\mathrm { h_t }.

  • Ct1\mathrm { C_{t-1} } is the previous cell state.

  • Ct\mathrm { C_t } is the current cell state, updated by forgetting part of Ct1\mathrm { C_{t-1} } and adding part of C~t\widetilde { \mathrm { C } } _ { \mathrm { t } }.

  • ht\mathrm { h_t } is the output hidden state at time step tt.

  • σ\sigma is the sigmoid function.

  • tanh\mathrm { tanh } is the hyperbolic tangent function. For BiLSTM, the final representation for a peptide sequence is typically derived by concatenating the last hidden states of the forward and backward LSTMs, or by applying a pooling operation over the entire sequence of hidden states. The model used here produces a 3605D (3605-dimensional) eigenvector for each peptide.

4.2.3. Feature Fusion

To combine the complementary information from different deep representation learning features, the paper implements feature fusion. This process concatenates the individual feature vectors to create higher-dimensional, composite feature vectors.

  • SSA + UniRep: The 121D SSA eigenvector is combined with the 1900D UniRep eigenvector to form a 2021D (2021-dimensional) SSA+UniRepSSA+UniRep fusion feature vector.
  • SSA + BiLSTM: The 121D SSA eigenvector is combined with the 3605D BiLSTM eigenvector to form a 3726D (3726-dimensional) SSA+BiLSTMSSA+BiLSTM fusion feature vector.
  • UniRep + BiLSTM: The 1900D UniRep eigenvector is combined with the 3605D BiLSTM eigenvector to form a 5505D (5505-dimensional) UniRep+BiLSTMUniRep+BiLSTM fusion feature vector.
  • SSA + UniRep + BiLSTM: All three eigenvectors (121D SSA, 1900D UniRep, 3605D BiLSTM) are combined to obtain a 5626D (5626-dimensional) SSA+UniRep+BiLSTMSSA+UniRep+BiLSTM fusion feature vector.

4.2.4. Feature Selection Method

High-dimensional feature vectors, especially from feature fusion, can introduce redundancy and increase the risk of overfitting. To address this, LGBM is utilized for feature selection.

  • Process:
    1. Data (feature vectors) and labels (bitter/non-bitter) are input into an LGBM model.
    2. The LGBM model is trained, and its built-in functions are used to obtain an importance value for each feature. LGBM inherently provides feature importance based on how much each feature contributes to the splitting decisions in the ensemble of decision trees.
    3. Features are ranked from highest to lowest based on their importance values.
    4. A subset of features is selected by choosing those with an importance value greater than the critical value (defined as the average feature importance value).
  • Optimization Strategy: An incremental feature strategy is used, where features are added incrementally, and model performance is monitored. A hyperparametric mesh search method (specifically scikit-learn GridSearchCV module) is used in conjunction to optimize hyperparameters for each model after feature selection.

4.2.5. Machine Learning Methods

Three widely used machine learning models are employed as classifiers to build predictive models, testing various feature sets and fusion combinations:

  • Support Vector Machine (SVM) [32,33]:
    • Purpose: A binary classifier that finds an optimal hyperplane to separate data points into classes.
    • Hyperparameters: The paper optimized gamma (kernel coefficient for rbf kernel) and CC (regularization parameter) within the range of 10410^{-4} to 10410^4. The default kernel was rbf (Radial Basis Function).
  • Random Forest (RF) [34,35]:
    • Purpose: An ensemble learning algorithm based on bagging that constructs multiple decision trees and outputs the mode of their predictions. It uses both random sampling of data and random selection of features during tree construction.
    • Hyperparameters: The paper optimized n_estimators (number of trees in the forest) in the range of (25, 550) and Nleaf (minimum number of samples required to be at a leaf node) in the range of (2, 12).
  • Light Gradient Boosting Machine (LGBM) [23,36]:
    • Purpose: A gradient boosting framework that uses tree-based learning algorithms. It's known for its speed and efficiency by using a leaf-wise tree growth strategy (splitting the leaf that yields the largest reduction in loss) rather than a level-wise strategy.
    • Hyperparameters: The paper optimized n_estimators (number of boosting rounds or trees) in the range of (25, 750) and max_depth (maximum depth of the individual decision trees) in the range of (1, 12).

5. Experimental Setup

5.1. Datasets

The study utilized the BTP640 benchmark dataset, which is an updated version from iBitter-SCM [17].

  • Source: The dataset is available online at http://public.aibiochem.net/peptides/BitterP/ or https://github.com/Shoombuatong2527/Benchmark-datasets.
  • Composition: It comprises 640 peptide sequences in total, evenly split into:
    • 320 bitter peptides (experimentally confirmed).
    • 320 non-bitter peptides (constructed using the BIOPEP database [29]).
  • Splitting: To ensure robust model evaluation and prevent overfitting, the BTP640 dataset was randomly divided into two subsets:
    • BTP-CV (Training and Cross-Validation Set): Contains 256 bitter peptides and 256 non-bitter peptides (total 512 peptides). This set was used for 10-fold cross-validation.
    • BTP-TS (Independent Test Set): Contains 64 bitter peptides and 64 non-bitter peptides (total 128 peptides). This set was used for independent testing on unseen data.
  • Characteristics: The dataset focuses on relatively short peptides, as bitter peptides are typically composed of no more than eight amino acids, though some can be longer. The domain is peptide therapeutics and dietary supplements, where bitterness is a critical palatability issue.

5.2. Evaluation Metrics

The performance of the models was evaluated using five widely accepted classification metrics, along with auPRC and auROC. These metrics are calculated based on the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

  • TP (True Positive): Number of bitter peptides correctly predicted as bitter.

  • TN (True Negative): Number of non-bitter peptides correctly predicted as non-bitter.

  • FP (False Positive): Number of non-bitter peptides incorrectly predicted as bitter.

  • FN (False Negative): Number of bitter peptides incorrectly predicted as non-bitter.

    Here are the detailed explanations and formulas for each metric:

  1. Accuracy (ACC)

    • Conceptual Definition: Accuracy measures the overall correctness of the model's predictions. It represents the proportion of correctly classified instances (both true positives and true negatives) out of the total number of instances. It is a good general indicator of performance, but can be misleading in imbalanced datasets.
    • Mathematical Formula: ACC=TP+TN(TP+TN+FP+FN) \mathrm { A C C } = { \frac { \mathrm { T P } + \mathrm { T N } } { ( \mathrm { T P } + \mathrm { T N } + \mathrm { F P } + \mathrm { F N } ) } }
    • Symbol Explanation:
      • TP\mathrm{TP}: True Positives.
      • TN\mathrm{TN}: True Negatives.
      • FP\mathrm{FP}: False Positives.
      • FN\mathrm{FN}: False Negatives.
  2. Matthews Correlation Coefficient (MCC)

    • Conceptual Definition: MCC is a more robust metric than accuracy, especially for imbalanced datasets. It considers all four quadrants of the confusion matrix (TP, TN, FP, FN) and produces a value between -1 and +1. A value of +1 indicates a perfect prediction, 0 indicates a random prediction, and -1 indicates a complete disagreement between prediction and observation. It is essentially a correlation coefficient between the observed and predicted binary classifications.
    • Mathematical Formula: MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN) \mathrm { M C C } = { \frac { \mathrm { T P } \times \mathrm { T N } - \mathrm { F P } \times \mathrm { F N } } { \sqrt { ( \mathrm { T P } + \mathrm { F P } ) ( \mathrm { T P } + \mathrm { F N } ) ( \mathrm { T N } + \mathrm { F P } ) ( \mathrm { T N } + \mathrm { F N } ) } } }
    • Symbol Explanation:
      • TP\mathrm{TP}: True Positives.
      • TN\mathrm{TN}: True Negatives.
      • FP\mathrm{FP}: False Positives.
      • FN\mathrm{FN}: False Negatives.
  3. Sensitivity (Sn), also known as Recall or True Positive Rate

    • Conceptual Definition: Sensitivity measures the proportion of actual positive instances (bitter peptides) that were correctly identified by the model. In other words, it quantifies the model's ability to avoid false negatives. A high sensitivity is crucial when the cost of missing a positive is high.
    • Mathematical Formula: Sn=TP(TP+FN) \mathrm { S n } = { \frac { \mathrm { T P } } { \left( \mathrm { T P } + \mathrm { F N } \right) } }
    • Symbol Explanation:
      • TP\mathrm{TP}: True Positives.
      • FN\mathrm{FN}: False Negatives.
  4. Specificity (Sp), also known as True Negative Rate

    • Conceptual Definition: Specificity measures the proportion of actual negative instances (non-bitter peptides) that were correctly identified by the model. It quantifies the model's ability to avoid false positives. A high specificity is important when the cost of incorrectly identifying a negative as positive is high.
    • Mathematical Formula: Sp=TN(TN+FP) \mathrm { S p } = { \frac { \mathrm { T N } } { \left( \mathrm { T N } + \mathrm { F P } \right) } }
    • Symbol Explanation:
      • TN\mathrm{TN}: True Negatives.
      • FP\mathrm{FP}: False Positives.
  5. F1-score (F1)

    • Conceptual Definition: The F1-score is the harmonic mean of precision and recall (sensitivity). It provides a single score that balances both precision (the proportion of positive identifications that were actually correct) and recall. It is particularly useful when class distribution is imbalanced.
    • Mathematical Formula: F1=2×TP(2×TP+FN+FP) \mathrm { F } 1 = { \frac { 2 \times \mathrm { T P } } { ( 2 \times \mathrm { T P } + \mathrm { F N } + \mathrm { F P } ) } }
    • Symbol Explanation:
      • TP\mathrm{TP}: True Positives.
      • FN\mathrm{FN}: False Negatives.
      • FP\mathrm{FP}: False Positives.
  6. Area Under the Precision-Recall Curve (auPRC)

    • Conceptual Definition: The Precision-Recall Curve plots precision against recall at various threshold settings. auPRC calculates the area under this curve. It is particularly informative for tasks with highly imbalanced datasets, where the positive class is rare, as it focuses on the performance of the model on the positive class. A higher auPRC indicates better performance.
    • Mathematical Formula: The paper refers to auPRC as the area enclosed by the precision-recall curve and the x-axis, without providing a specific formula. Conceptually, it is the integral of precision with respect to recall.
  7. Area Under the Receiver Operating Characteristic Curve (auROC)

    • Conceptual Definition: The Receiver Operating Characteristic (ROC) curve plots True Positive Rate (Sensitivity) against False Positive Rate (1 - Specificity) at various threshold settings. auROC calculates the area under this curve. It measures the ability of a classifier to distinguish between classes. A value of 0.5 indicates performance no better than random chance, while 1.0 represents a perfect classifier. auROC is useful for evaluating models across all possible classification thresholds.
    • Mathematical Formula: The paper refers to auROC as the area under the ROC curve, without providing a specific formula. Conceptually, it is the integral of Sensitivity with respect to 1-Specificity.

5.3. Baselines

The iBitter-DRLF model's performance was compared against several existing state-of-the-art methods for bitter peptide prediction:

  • iBitter-Fuse [18]: An earlier method that used multi-view features and fusion techniques for bitter peptide prediction.

  • MIMML [20]: A recent method (2022) that employed mutual information-based meta-learning to find optimal feature combinations, known for achieving high accuracy.

  • iBitter-SCM [17]: A method that likely used traditional sequence features and machine learning techniques. The benchmark dataset used in the current study is an updated version from iBitter-SCM.

  • BERT4Bitter [19]: A method that utilized natural language processing (NLP) heuristic signature coding methods (BERT-based) to represent peptide sequences.

    These baselines represent a range of approaches, from traditional feature-based ML (iBitter-SCM, iBitter-Fuse) to more advanced NLP-inspired (BERT4Bitter) and meta-learning (MIMML) techniques, providing a comprehensive comparison for iBitter-DRLF.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate a clear progression in model performance, highlighting the effectiveness of deep representation learning, feature fusion, feature selection, and hyperparameter optimization.

6.1.1. Results of Preliminary Optimization

The initial step involved evaluating the performance of individual deep representation learning features (SSA, UniRep, BiLSTM) when combined with three different machine learning algorithms (SVM, LGBM, RF). The results, presented in Table 1, show the 10-fold cross-validation and independent test metrics after initial parameter optimization.

The following are the results from Table 1 of the original paper:

Feature Model Dim 10-Fold Cross-Validation Independent Test
ACC MCC Sn Sp F1 auPRC auROC ACC MCC Sn Sp F1 auPRC auROC
SSA b SVM 0.826 0.652 0.836 0.816 0.828 0.890 0.898 0.883 a 0.766 0.891 0.875 0.884 0.951 0.944
LGBM 121 0.787 0.575 0.816 0.758 0.793 0.874 0.886 0.859 0.722 0.906 0.812 0.866 0.949 0.941
RFc 0.791 0.584 0.828 0.754 0.798 0.848 0.865 0.820 0.644 0.875 0.766 0.830 0.934 0.922
SVM c 0.865 0.730 0.867 0.863 0.865 0.937 0.931 0.867 0.735 0.844 0.891 0.864 0.952 0.948
UniRep b LGBM C 1900 0.840 0.680 0.828 0.852 0.838 0.939 0.930 0.867 0.735 0.844 0.891 0.864 0.953 0.952
RFc 0.842 0.684 0.836 0.848 0.841 0.927 0.920 0.844 0.688 0.828 0.859 0.841 0.946 0.943
SVM 0.818 0.637 0.820 0.816 0.819 0.910 0.912 0.883 0.766 0.906 0.859 0.885 0.956 0.951
LGBM 0.855 0.711 0.863 0.848 0.857 0.924 0.926 0.836 0.673 0.812 0.859 0.832 0.950 0.950
BiLSTM b RFc 3605 0.818 0.637 0.828 0.809 0.820 0.900 0.908 0.844 0.688 0.844 0.844 0.844 0.954 0.949

**Key Observations:** * **UniRep/SVM (10-fold CV)**: The `UniRep` feature vector combined with `SVM` achieved the best `10-fold cross-validation` results, with ACC=0.865ACC = 0.865, MCC=0.730MCC = 0.730, Sn=0.867Sn = 0.867, Sp=0.863Sp = 0.863, F1=0.865F1 = 0.865, auPRC=0.937auPRC = 0.937, and auROC=0.931auROC = 0.931. This indicates that `UniRep` features, which are 1900-dimensional, are highly effective in capturing relevant information for bitter peptide identification. * **SSA/SVM (Independent Test)**: In `independent tests`, the `SSA` feature combined with `SVM` achieved the highest `ACC` of `0.883` and `MCC` of `0.766`. * **Comparison of Features**: The paper concludes that `UniRep` features were generally superior to `BiLSTM` features, showing higher `ACC` and `MCC` in `10-fold cross-validation`. However, `SSA` showed strong performance in `independent tests`. This preliminary analysis suggests that while individual deep features are effective, there might be benefits in combining them.

6.1.2. The Effects of Feature Fusion on the Automatic Identification of Bitter Peptides

To further enhance predictive performance, the study explored feature fusion by combining pairs and all three of the deep features. These fusion features were then evaluated with SVM, LGBM, and RF algorithms. Figure 2 visually summarizes the accuracy of independent tests for individual and fused features.

该图像是图表,展示了不同特征组合在机器学习模型中的表现。图中列出了多种特征组合(如SSA、UniRep、BiLSTM等),并以不同颜色区分,各模型的预测准确率在0.78到0.9之间。可以看出,部分组合(如SSA+UniRep+BiLSTM)在准确率上表现最佳,表明深度表征学习在识别苦味肽方面的有效性。 该图像是图表,展示了不同特征组合在机器学习模型中的表现。图中列出了多种特征组合(如SSA、UniRep、BiLSTM等),并以不同颜色区分,各模型的预测准确率在0.78到0.9之间。可以看出,部分组合(如SSA+UniRep+BiLSTM)在准确率上表现最佳,表明深度表征学习在识别苦味肽方面的有效性。

The following are the results from Table 2 of the original paper:

Feature Model Dim 10-Fold Cross-Validation Independent Test
ACC MCC Sn Sp F1 auPRC auROC ACC MCC Sn Sp F1 auPRC auROC
SSA + UniRep b SVM 0.861 0.723 0.875 a 0.848 0.863 0.929 0.927 0.867 0.734 0.859 0.875 0.866 0.954 0.952
LGBM 2021 0.840 0.680 0.848 0.832 0.841 0.933 0.924 0.859 0.719 0.859 0.859 0.859 0.960 0.958
RFc 0.838 0.676 0.840 0.836 0.838 0.923 0.917 0.867 0.735 0.844 0.891 0.864 0.955 0.954
SSA + BiLSTM b SVMc 0.836 0.672 0.848 0.824 0.838 0.915 0.917 0.883 0.766 0.859 0.906 0.880 0.943 0.947
LGBM 3726 0.848 0.696 0.859 0.836 0.849 0.927 0.927 0.875 0.751 0.906 0.844 0.879 0.961 0.957
RFc 0.824 0.649 0.832 0.816 0.826 0.906 0.911 0.898 0.797 0.891 0.906 0.898 0.959 0.951
UniRep + BiLSTM b SVM 0.844 0.688 0.859 0.828 0.846 0.921 0.926 0.891 0.783 0.922 0.859 0.894 0.966 0.962
LGBM 5505 0.863 0.727 0.871 0.855 0.864 0.932 0.935 0.870 0.737 0.859 0.886 0.887 0.972 0.958
RFc 0.832 0.664 0.844 0.820 0.834 0.932 0.930 0.875 0.750 0.859 0.891 0.873 0.963 0.960
SSA +UniRep + BiLSTM b SVM 0.871 0.742 0.863 0.879 0.870 0.943 0.941 0.891 0.783 0.922 0.859 0.894 0.940 0.943
LGBM 5626 0.855 0.711 0.844 0.867 0.854 0.945 0.942 0.898 0.797 0.891 0.906 0.898 0.971 0.971
RFc 0.840 0.680 0.848 0.832 0.841 0.926 0.925 0.898 0.799 0.859 0.937 0.894 0.963 0.957

**Key Observations:** * **Fusion Superiority**: The optimal performance values from `fusion features` consistently `outperformed` the best values obtained from `non-combined features` (compare Table 2 to Table 1). For example, the SSA+BiLSTM/RFSSA+BiLSTM/RF model achieved an `ACC` of 0.898 in `independent tests`, which is a 9.51% improvement over the `SSA` feature alone (`ACC` of 0.820). * **Triple Fusion Performance**: The SSA+UniRep+BiLSTM/LGBMSSA+UniRep+BiLSTM/LGBM model achieved ACC=0.898ACC = 0.898 and MCC=0.797MCC = 0.797 in `independent tests`, showing robust performance. The SSA+UniRep+BiLSTM/RFSSA+UniRep+BiLSTM/RF model also achieved ACC=0.898ACC = 0.898 and MCC=0.799MCC = 0.799. * **Best Fusion**: The UniRep+BiLSTM/SVMUniRep+BiLSTM/SVM model achieved notable auPRC=0.966auPRC = 0.966 and auROC=0.962auROC = 0.962 in `independent tests`. Overall, combining features generally led to better accuracy, demonstrating the benefit of `feature fusion` in capturing diverse information.

6.1.3. The Effect of Feature Selection on the Automatic Identification of Bitter Peptides

Given the improved performance of fused features, the next step was to address the potential issues of high dimensionality and redundancy. Feature selection using LGBM was applied, along with incremental feature strategy and hyperparameter mesh search. The performance metrics for individual and fused features after feature selection are summarized in Table 3. Figure 3 visually presents these results.

Figure 3. The performance metrics of fusion features using a range of selected features and different algorithms. Panels (A,C,E) show 10-fold cross-validation results, and panels (B,D,F) are independ… 该图像是图表,展示了不同机器学习算法在融合特征中的性能指标,包括支持向量机(SVM)、轻梯度提升机(LGBM)和随机森林(RF)。面板(A、C、E)显示了10倍交叉验证结果,面板(B、D、F)则为独立测试结果。图中描述了不同特征组合的表现,验证了深度表示学习方法在苦味肽识别中的有效性。

The following are the results from Table 3 of the original paper:

Feature Model Dim ACC 10-Fold Cross-Validation Independent Test
MCC Sn Sp F1 auPRC auROC ACC MCC Sn Sp F1 auPRC auROC
SSA b SVM 53 0.820 0.641 0.840 0.801 0.824 0.910 0.909 0.914 0.829 0.937 0.891 0.916 0.948 0.941
LGBM 77 0.816 0.634 0.848 0.785 0.822 0.877 0.892 0.883 0.768 0.922 0.844 0.887 0.947 0.940
RFc 16 0.805 0.610 0.820 0.789 0.808 0.860 0.881 0.867 0.734 0.875 0.859 0.868 0.888 0.894
UniRep b SVM 65 0.875 0.750 0.875 0.875 0.875 0.946 0.943 0.906 0.813 0.891 0.922 0.905 0.952 0.952
LGBM 313 0.854 0.707 0.855 0.852 0.854 0.946 0.938 0.914 0.829 0.891 0.937 0.912 0.954 0.948
RFc 329 0.836 0.672 0.824 0.848 0.834 0.918 0.908 0.891 0.785 0.844 0.937 0.885 0.958 0.957
BiLSTM b SVM 344 0.820 0.641 0.824 0.816 0.821 0.913 0.915 0.922 0.844 0.937 0.906 0.923 0.955 0.956
LGBM 339 0.871 0.742 0.883 0.859 0.873 0.925 0.929 0.906 0.813 0.906 0.906 0.906 0.969 0.966
RFc 434 0.830 0.660 0.836 0.824 0.831 0.906 0.914 0.898 0.797 0.906 0.891 0.899 0.957 0.950
SSA +UniRep b SVM 62 0.865 0.730 0.863 0.867 0.865 0.944 0.942 0.914 0.828 0.906 0.922 0.913 0.958 0.957
LGBMC 106 0.881 0.762 0.887 0.875 0.882 0.961 0.957 0.891 0.783 0.859 0.922 0.887 0.952 0.947
RFc 47 0.838 0.676 0.859 0.816 0.841 0.937 0.931 0.906 0.816 0.859 0.953 0.902 0.956 0.947
SSA + BiLSTM b SVMc 267 0.836 0.672 0.836 0.836 0.836 0.910 0.911 0.914 0.828 0.906 0.922 0.913 0.956 0.952
LGBM 317 0.861 0.723 0.875 0.848 0.863 0.924 0.929 0.906 0.813 0.906 0.906 0.906 0.962 0.958
RFc 176 0.832 0.664 0.848 0.816 0.835 0.922 0.925 0.906 0.813 0.906 0.906 0.906 0.959 0.952
UniRep + BiLSTM bb SVM 186 0.873 0.746 0.887 0.859 0.875 0.932 0.934 0.914 0.829 0.937 0.891 0.916 0.961 0.965
LGBM 106 0.889 a 0.777 0.891 0.887 0.889 0.947 0.952 0.944 0.889 0.922 0.977 0.952 0.984 0.977
RFc 45 0.871 0.742 0.871 0.871 0.871 0.937 0.941 0.938 0.875 0.938 0.938 0.938 0.976 0.971
SSA +UniRep + BiLSTM b SVMc 336 0.881 0.762 0.883 0.879 0.881 0.940 0.942 0.922 0.845 0.953 0.891 0.924 0.942 0.946
LGBM 285 0.881 0.762 0.891 0.871 0.882 0.951 0.947 0.938 0.875 0.922 0.953 0.937 0.969 0.969
RFc 192 0.863 0.727 0.859 0.867 0.863 0.932 0.932 0.922 0.844 0.906 0.937 0.921 0.970 0.967

**Key Observations:** * **Significant Improvement with Feature Selection**: `Feature selection` dramatically improved performance for most metrics. The `dimensionality` of the feature vectors was substantially reduced (e.g., UniRep+BiLSTMUniRep+BiLSTM from 5505D to 106D), indicating successful removal of `redundant` or less important features. * **UniRep+BiLSTM (106D) is Optimal**: The UniRep+BiLSTMUniRep+BiLSTM fusion feature, after `LGBM feature selection` to 106 dimensions, consistently `outperformed` all other options. * **10-fold Cross-Validation**: Achieved ACC=0.889ACC = 0.889, MCC=0.777MCC = 0.777, Sn=0.891Sn = 0.891, Sp=0.887Sp = 0.887, F1=0.889F1 = 0.889, auPRC=0.947auPRC = 0.947, auROC=0.952auROC = 0.952. This shows improvements across all metrics compared to pre-selection results. * **Independent Test**: Achieved ACC=0.944ACC = 0.944, MCC=0.889MCC = 0.889, Sn=0.922Sn = 0.922, Sp=0.977Sp = 0.977, F1=0.952F1 = 0.952, auPRC=0.984auPRC = 0.984, auROC=0.977auROC = 0.977. These are the highest recorded values, indicating excellent generalization to unseen data. * **Reduced Dimensionality, Enhanced Performance**: This finding validates that `feature selection` is an effective strategy to resolve `information redundancy` and `overfitting` issues inherent in high-dimensional `fused features`, leading to a more robust and accurate model.

6.1.4. The Effect of Machine Learning Model Parameter Optimization on the Automated Identification of Bitter Peptides

After identifying the UniRep+BiLSTM106UniRep+BiLSTM_106 feature set as superior, the authors performed further hyperparameter optimization using scikit-learn GridSearchCV for the SVM, RF, and LGBM models. Figure 4 illustrates the performance metrics with default parameters versus optimized hyperparameters.

Figure 4. Performance metrics of the UniRep \(^ +\) BiLSTM features analyzed by different models using default parameters (light bars) or hyperparameters (dark bars). Results using selected hyperparame…

Key Observations:

  • Hyperparameter Tuning Benefits: For all machine learning models (SVM, LGBM, RF) with the UniRep+BiLSTMUniRep+BiLSTM feature set, using optimized hyperparameters consistently matched or outperformed models using default parameters across all metrics. This confirms the importance of fine-tuning model parameters.
  • LGBM Superiority: The LGBM model, with optimized parameters (depth=3depth = 3, nestimators=75n_estimators = 75), showed clearly superior performance compared to RF and SVM in both independent testing and 10-fold cross-validation. Although the RF model had a slightly better Sn (0.938 vs. 0.922 for LGBM in independent tests), LGBM excelled in all other crucial metrics like ACC, MCC, Sp, F1, auPRC, and auROC.
  • Final Model Selection: Based on this comprehensive analysis, the UniRep+BiLSTM106UniRep+BiLSTM_106 feature set combined with the LGBM model (optimized with depth=3depth = 3 and nestimators=75n_estimators = 75) was selected as the final iBitter-DRLF predictor.
    • Final 10-fold CV results: ACC=0.889ACC = 0.889, MCC=0.777MCC = 0.777, Sn=0.891Sn = 0.891, Sp=0.887Sp = 0.887, F1=0.889F1 = 0.889, auPRC=0.947auPRC = 0.947, auROC=0.952auROC = 0.952.
    • Final Independent Test results: ACC=0.944ACC = 0.944, MCC=0.889MCC = 0.889, Sn=0.922Sn = 0.922, Sp=0.977Sp = 0.977, F1=0.952F1 = 0.952, auPRC=0.984auPRC = 0.984, auROC=0.977auROC = 0.977.

6.1.5. Comparison with Existing Methods

The ultimate validation for iBitter-DRLF involved comparing its independent test performance against existing state-of-the-art methods, as shown in Table 4.

The following are the results from Table 4 of the original paper:

Classifier ACC MCC Sn Sp auROC
iBitter-DRLF 0.944 a 0.889 0.922 0.977 0.977
iBitter-Fuse 0.930 0.859 0.938 0.922 0.933
BERT4Bitter 0.922 0.844 0.938 0.906 0.964
iBitter-SCM 0.844 0.688 0.844 0.844 0.904
MIMML 0.938 0.875 0.938 0.938 0.955

**Key Observations:** * **Superior Performance**: `iBitter-DRLF` consistently `outperformed` all compared methods across several critical metrics in `independent tests`. * `ACC`: 0.944 (vs. `MIMML` 0.938, `iBitter-Fuse` 0.930, `BERT4Bitter` 0.922). * `MCC`: 0.889 (vs. `MIMML` 0.875, `iBitter-Fuse` 0.859, `BERT4Bitter` 0.844). * `Sp`: 0.977 (vs. `MIMML` 0.938, `iBitter-Fuse` 0.922, `BERT4Bitter` 0.906). This indicates `iBitter-DRLF` is very good at correctly identifying non-bitter peptides, minimizing `false positives`. * `auROC`: 0.977 (vs. `BERT4Bitter` 0.964, `MIMML` 0.955, `iBitter-Fuse` 0.933). A high `auROC` indicates excellent discriminative ability across various `thresholds`. * **Reliability and Stability**: The significant improvements in `ACC`, `MCC`, `Sp`, and `auROC` demonstrate that `iBitter-DRLF` is more reliable and stable in predicting peptide bitterness compared to existing algorithms. While some methods might have slightly higher `Sn` (e.g., `iBitter-Fuse`, `BERT4Bitter`, `MIMML` all at 0.938 vs. `iBitter-DRLF`'s 0.922), `iBitter-DRLF`'s overall balance of `Sn` with exceptionally high `Sp` and `auROC` makes it a stronger predictor.

6.1.6. Feature Visualization of the Picric Peptide Automatic Recognition Effect

UMAP (Uniform Manifold Approximation and Projection) was used for dimensionality reduction and visualization of the extracted features, providing insight into how well different feature sets separate bitter from non-bitter peptides. Figure 5 shows the UMAP visualizations.

Figure 5. UMAP was used to visualize the dimension-reduced features of fused features: (A) is the UniRep feature, (B) is the BiLSTM feature, (C) is the BiLSTM \(^ +\) UniRep fusion, and (D) represents…

Key Observations:

  • Improved Discrimination with Fusion and Selection:
    • Individual Features (Figure 5A, B): UniRep and BiLSTM features individually (before fusion and selection) show some separation between bitter peptides (red) and non-bitter peptides (blue), but there is still significant overlap.
    • Fused Features (Figure 5C): The BiLSTM+UniRepBiLSTM+UniRep fusion shows better clustering and separation compared to individual features, indicating that combining information improves discriminability.
    • Selected Fused Features (Figure 5D): The UMAP visualization of the 106 selected features from the BiLSTM_UniRep fusion demonstrates the clearest discrimination. The bitter and non-bitter peptides form distinct, well-separated clusters with minimal overlap. This visually confirms that feature selection effectively extracts the most relevant information, leading to highly separable clusters for classification. This provides strong visual evidence for the performance improvements observed in the quantitative metrics.

6.2. Data Presentation (Tables)

All relevant tables (Table 1, Table 2, Table 3, Table 4) have been transcribed and presented in the subsections above, following the specified formatting rules, including using HTML for tables with merged cells.

6.3. Ablation Studies / Parameter Analysis

  • Ablation Study (Implicit): The progression of results from individual features (Table 1) to fused features (Table 2) and then to selected fused features (Table 3) serves as an implicit ablation study. It effectively demonstrates the incremental benefits of feature fusion and subsequent feature selection in improving model performance. Each stage shows that adding or refining components (feature fusion and feature selection) contributes positively to the model's ability to discriminate bitter peptides.
  • Parameter Analysis: The study explicitly details hyperparameter optimization using scikit-learn GridSearchCV for SVM, RF, and LGBM models (as shown in Figure 4 and discussed in Section 6.1.4). It highlights that optimized hyperparameters consistently led to better performance than default parameters. For the final iBitter-DRLF LGBM model, specific optimized parameters depth=3depth = 3 and nestimators=75n_estimators = 75 were identified. This parameter analysis is crucial for ensuring the model's robustness and optimal performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully developed iBitter-DRLF, a novel computational model for accurately identifying bitter peptides solely based on their sequence data. The core innovation lies in its application of deep representation learning by leveraging pre-trained neural networks to extract highly informative features. Through a comprehensive process of evaluating individual features (SSA, UniRep, BiLSTM), performing feature fusion (combining UniRep and BiLSTM), and rigorously applying feature selection (LGBM) to distill the most discriminative 106-dimensional feature set, the model achieved superior performance. Coupled with an optimized LGBM classifier, iBitter-DRLF demonstrated impressive results in both 10-fold cross-validation and independent tests, significantly outperforming existing state-of-the-art predictors across key metrics such as accuracy, MCC, specificity, and auROC. The availability of a user-friendly webserver further enhances its utility, aiming to facilitate research in improving the palatability of peptide therapeutics and dietary supplements.

7.2. Limitations & Future Work

The authors acknowledge a key limitation: "Although the exact physicochemical relevance of these features is unclear, this does not prevent the successful use of this method for computational predictions in peptide and protein sequence analysis." This highlights the black-box nature inherent in many deep learning models; while they provide highly effective features, the direct biological or chemical interpretation of these high-dimensional, learned representations is not immediately obvious.

While the paper doesn't explicitly outline a "Future Work" section, the concluding remarks implicitly suggest directions:

  • Improved Palatability: The direct application of iBitter-DRLF is to assist in identifying bitter peptides for removal or modification, thereby improving the palatability of nutritional supplements and peptide therapeutics.
  • Advancing Drug Development and Nutrition Research: The method can serve as a tool to accelerate drug development by screening for bitter properties early and to enhance nutrition research by guiding the design of non-bitter protein hydrolysates.

7.3. Personal Insights & Critique

  • Innovation: The paper's strength lies in its systematic and rigorous approach to integrating deep representation learning into bitter peptide prediction. Moving beyond handcrafted features to learned embeddings from pre-trained models (UniRep, BiLSTM) is a significant step forward. The methodical evaluation of individual embeddings, their fusions, and subsequent feature selection demonstrates thoroughness in model development. The use of UMAP for visualization effectively communicates the discriminative power gained through feature optimization.
  • Applicability: The development of a webserver is highly valuable, making the research accessible to a broader scientific community for practical application in food science, pharmaceutical development, and nutrition. The problem it addresses is very practical: enhancing the acceptance of beneficial but bitter peptide products.
  • Potential Issues/Critique:
    1. Interpretability of Deep Features: As noted by the authors, the lack of clear physicochemical relevance for the deep features is a limitation. While effective, it provides limited mechanistic insight into why certain peptides are bitter. Future work could involve integrating explainable AI (XAI) techniques to interpret the learned features and connect them back to known physicochemical properties of amino acids and peptides. This could potentially lead to the design of novel, non-bitter peptides.
    2. Dataset Size for Deep Learning: The BTP640 dataset, though a benchmark, contains only 640 peptides. While the use of pre-trained models (UniRep, BiLSTM) mitigates the need for a very large dataset for training the embeddings, the final LGBM classifier training still relies on this relatively small set. For deep learning models that are fine-tuned or trained from scratch, larger datasets are typically preferred to fully exploit their capacity and prevent overfitting. Future research could explore collecting more extensive and diverse bitter peptide datasets.
    3. Generalizability to Novel Peptides: While the independent test set provides confidence, the true test of generalizability lies in predicting bitterness for peptides with sequences significantly different from those in the training data. The robustness of deep representation learning should help here, but this is always a challenge in bioinformatics.
  • Transferability: The methodology, particularly the strategy of using diverse pre-trained deep sequence embeddings followed by fusion and feature selection with gradient boosting models, is highly transferable. This framework could be applied to predict other peptide properties (e.g., antimicrobial activity, bioavailability, allergenicity) or even protein functions from sequence data. The deep representation learning models essentially act as powerful feature extractors, which can then be combined with various downstream machine learning tasks.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.