1. Bibliographic Information

1.1. Title

The central topic of this paper is the development of a multi-representation ensemble learning model, named iBitter-Stack, for the accurate identification of bitter peptides.

1.2. Authors

Sarfraz Ahmad (National University of Sciences and Technology (NUST), H-12, Islamabad, Pakistan)
Momina Ahsan (National University of Sciences and Technology (NUST), H-12, Islamabad, Pakistan)
Muhammad Nabeel Asim (German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, 67663, Germany)
Andreas Dengel (German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, 67663, Germany)
Muhammad Imran Malik (National University of Sciences and Technology (NUST), H-12, Islamabad, Pakistan)

1.3. Journal/Conference

This paper is set to appear in the Journal of Molecular Biology. This journal is a highly reputable and influential peer-reviewed scientific journal in the fields of molecular biology, biochemistry, and structural biology. Its focus on fundamental molecular mechanisms and structures makes it a significant venue for research in bioinformatics and computational biology, particularly for studies related to protein and peptide functions.

1.4. Publication Year

The paper was published (or is scheduled to be published) in 2025. The Published at (UTC) date provided is 2025-09-19T00:00:00.000Z, and the Accepted Date in the paper is 15 September 2025.

1.5. Abstract

The identification of bitter peptides is critical across various fields, including food science, drug discovery, and biochemical research, due to their impact on taste and their roles in physiological and pharmacological processes. However, traditional experimental methods are costly and time-consuming, highlighting the need for efficient computational approaches. This study introduces iBitter-Stack, a novel stacking-based ensemble learning framework designed to improve the accuracy and reliability of bitter peptide classification. The model integrates diverse sequence-based feature representations and utilizes a broad array of machine learning classifiers. It features a two-layer stacking architecture: the first layer consists of multiple base classifiers, each trained on distinct feature encoding schemes, while the second layer uses logistic regression to refine predictions from an 8-dimensional probability vector. Evaluated on a meticulously curated dataset, iBitter-Stack significantly outperforms existing methods, achieving an accuracy of $96.09\%$ and a Matthews Correlation Coefficient (MCC) of 0.9220 on an independent test set. To enhance accessibility, a user-friendly web server for iBitter-Stack has been developed and is freely available, enabling real-time screening of peptide sequences for bitterness.

1.6. Original Source Link

The original source link is /files/papers/691751d8110b75dcc59ae057/paper.pdf. This indicates that the paper is available as a PDF file, and given its "Journal Pre-proofs" status and "To appear in: Journal of Molecular Biology," it is in the final stages of publication.

2. Executive Summary

2.1. Background & Motivation

The perception of bitter taste serves as a fundamental biological defense mechanism, alerting organisms to potentially harmful substances. However, many naturally occurring bitter compounds, including peptides, also possess significant value in nutrition and medicine. Bitter peptides, often formed during protein hydrolysis, are particularly relevant in food science (contributing to undesirable taste) and pharmaceutical development.

The core problem the paper addresses is the challenging and resource-intensive nature of identifying bitter peptides using traditional experimental techniques. Methods like biochemical assays, human sensory evaluation, and chromatography are labor-intensive, time-consuming, costly, and can suffer from subjectivity and inter-individual variability (in human sensory testing).

With the exponential growth of peptide sequence data in the post-genomic era, there is a critical need for rapid, accurate, and cost-effective computational approaches, specifically machine learning (ML)-based methods, to distinguish bitter from non-bitter peptides based on their sequence and structural properties. Prior computational methods have shown promise, ranging from Quantitative Structure-Activity Relationship (QSAR) models to Deep Learning (DL)-based approaches utilizing Natural Language Processing (NLP) techniques. However, existing models often face limitations such as reliance on single-type feature representations (restricting generalizability), lack of integration with physicochemical properties crucial for biochemical understanding, or fixed ensemble configurations that limit optimization. The paper identifies a specific gap in existing stacking ensemble models, such as iBitter-GRE, which use a fixed set of base classifiers and an early fusion of features, potentially limiting flexibility, introducing redundancy, and omitting informative sequence-level representations.

The paper's entry point is to overcome these limitations by proposing a novel stacking-based ensemble learning framework that systematically combines diverse peptide representations and a wide array of machine learning classifiers into a unified meta-learning pipeline.

2.2. Main Contributions / Findings

The primary contributions and key findings of this paper are:

Novel Stacking Ensemble Framework (iBitter-Stack): The paper proposes a sophisticated two-layer stacking ensemble model that integrates heterogeneous feature representations and multiple machine learning classifiers. This framework systematically constructs a diverse pool of 56 base learners from seven different encoding schemes and eight distinct classifiers.
Multi-Representation Feature Integration: iBitter-Stack leverages a comprehensive multi-view feature strategy. It combines Protein Language Model (PLM)-derived embeddings (specifically ESM-2) with various handcrafted physicochemical and compositional descriptors (Dipeptide Composition, Amino Acid Entropy, Amino Acid Index, Grouped Tripeptide Composition, Composition-Transition-Distribution, and Binary Profile-based N- and C-terminal encoding). This approach captures both contextual nuances and domain-specific biochemical characteristics of peptides.
Systematic Base Learner Selection: Unlike previous models that rely on fixed base classifier configurations, iBitter-Stack employs a rigorous performance-based filtering strategy. Only base learners achieving an MCC greater than 0.8 and an Accuracy above $90\%$ are selected to form the meta-dataset, enhancing robustness and adaptability.
Superior Performance: The model significantly outperforms existing state-of-the-art bitter peptide prediction methods. On an independent test set, iBitter-Stack achieves an accuracy of $96.1\%$ , a Matthews Correlation Coefficient (MCC) of 0.922, and an Area Under the Receiver Operating Characteristic (AUROC) of 0.981. This demonstrates strong discriminative ability and generalization capability.
Robustness to Sequence Similarity: An additional experiment with an $80\%$ sequence identity threshold for filtering between training and testing sets confirmed the model's robustness, maintaining strong performance ( $95.3\%$ accuracy, 0.91 MCC), indicating genuine learning of discriminative sequence patterns rather than reliance on data redundancy.
User-Friendly Web Server: To facilitate practical application and broader accessibility, the authors developed and made freely available a web server (ibitter-stack-webserver.streamlit.app) that allows researchers and practitioners to screen peptide sequences for bitterness in real-time.

These findings collectively solve the problem of accurately and efficiently identifying bitter peptides, offering a robust, reliable, and accessible computational tool that advances the state of the art in this critical domain.

3.1. Foundational Concepts

To fully understand the iBitter-Stack paper, a reader should be familiar with several foundational concepts from biology, chemistry, and machine learning.

Peptides and Amino Acids:
- Amino Acids: The basic building blocks of proteins and peptides. There are 20 standard amino acids, each with a unique side chain determining its chemical properties (e.g., hydrophobic, hydrophilic, charged).
- Peptides: Short chains of amino acids linked together by peptide bonds. They are smaller than proteins and often exhibit various biological activities, including bitterness. The sequence of amino acids (e.g., Ala-Val-Gly) determines a peptide's structure and function.
- N-terminus (NT5) and C-terminus (CT5): The beginning (N-terminal) and end (C-terminal) of a peptide chain. The N-terminus has a free amino group, and the C-terminus has a free carboxyl group. The paper specifically extracts features from the first five (NT5) and last five (CT5) residues, as these regions often play critical roles in peptide bioactivity.
Bitter Taste Perception:
- Bitter taste is one of the five basic tastes, serving as a defense mechanism to detect potential toxins. The perception is mediated by specific taste receptors on the tongue. Certain chemical properties of peptides (e.g., presence of hydrophobic amino acids, especially at the C-terminal) are strongly associated with bitterness.
Machine Learning (ML):
- A field of artificial intelligence that enables systems to learn from data without being explicitly programmed. In this context, ML models are trained on known bitter and non-bitter peptides to predict the bitterness of new, unseen peptides.
- Classification: A type of supervised learning task where the model learns to assign input data points to one of several predefined categories (e.g., "bitter" or "non-bitter").
- Features/Feature Engineering: Numerical representations of raw data (peptide sequences in this case) that a machine learning model can understand. Feature engineering is the process of selecting and transforming raw data into features that are most informative for the model.
- Supervised Learning: A machine learning paradigm where the model learns from a labeled dataset (pairs of input data and their corresponding correct output labels).
Ensemble Learning:
- A technique that combines multiple machine learning models (called base learners or weak learners) to achieve better predictive performance than any single model could achieve alone. The idea is that the aggregated "wisdom of the crowd" is often more accurate and robust.
- Stacking (Stacked Generalization): A specific type of ensemble learning where the predictions of multiple base models (first-level learners) are used as input features for a higher-level model (a meta-learner or second-level learner). The meta-learner learns how to optimally combine the predictions of the base models. This typically involves feeding soft predictions (probabilities) from base models to the meta-learner, rather than hard class labels.
Protein Language Models (PLMs) and Evolutionary Scale Modeling (ESM):
- Language Models: In Natural Language Processing (NLP), language models learn statistical relationships between words in a language. PLMs adapt this concept to protein sequences, treating amino acids as words and protein sequences as sentences. They learn complex patterns and contextual embeddings by being trained on vast amounts of protein sequence data.
- Embeddings: Numerical vector representations of objects (e.g., amino acids, peptides) that capture their semantic or functional meaning in a high-dimensional space. Words with similar meanings are close together in the embedding space. PLM embeddings capture evolutionary and structural information about proteins.
- ESM (Evolutionary Scale Modeling): A specific family of Protein Language Models developed by FAIR (Facebook AI Research). ESM-2 is a powerful variant trained on massive datasets of protein sequences (like UniProt), allowing it to learn general principles of protein structure, function, and evolution. It generates sequence embeddings that are highly informative for various bioinformatics tasks.
Dimensionality Reduction:
- Techniques used to reduce the number of features or dimensions in a dataset while retaining as much meaningful information as possible. This is useful for visualization and can improve model performance by removing noise.
- t-Distributed Stochastic Neighbor Embedding (t-SNE): A nonlinear dimensionality reduction technique particularly well-suited for visualizing high-dimensional datasets in 2D or 3D, aiming to preserve local neighborhoods (i.e., points that are close in high-dimensional space remain close in low-dimensional space).
- Uniform Manifold Approximation and Projection (UMAP): Another nonlinear dimensionality reduction algorithm, often faster than t-SNE and capable of preserving more of the global data structure in addition to local relationships.
Cross-Validation:
- A technique used to assess how the results of a statistical analysis (e.g., a machine learning model's performance) generalize to an independent dataset.
- 10-fold Cross-Validation: The dataset is divided into 10 equal parts. The model is trained 10 times; in each iteration, 9 parts are used for training, and the remaining 1 part is used for testing. The results from all 10 iterations are then averaged. This helps to reduce overfitting (where a model performs well on training data but poorly on unseen data) and provides a more robust estimate of the model's performance.
Logistic Regression (LR):
- A linear model used for binary classification. Despite its name, it's a classification algorithm, not a regression algorithm. It models the probability of a binary outcome (e.g., bitter or non-bitter) using a logistic function. It is often chosen for its interpretability and computational efficiency. In stacking, it's frequently used as a meta-learner due to its ability to combine probabilities from base models effectively.

3.2. Previous Works

The paper thoroughly reviews previous efforts in bitter peptide identification, categorizing them from traditional experimental methods to advanced machine learning models.

Traditional Experimental Techniques:
- Biochemical assays, human sensory evaluation, and chromatography-based separation have been the gold standard.
- Limitations: Labor-intensive, time-consuming, costly, and subjective (human sensory).
Quantitative Structure-Activity Relationship (QSAR) Modeling:
- One of the earliest computational approaches. QSAR models establish mathematical relationships between peptide descriptors (numerical properties) and their biological activity (e.g., bitterness).
- Algorithms: Support Vector Machines (SVM), Artificial Neural Networks (ANN), Multiple Linear Regression (MLR).
- Examples:
  - Yin et al. [15]: Developed 28 QSAR models using Support Vector Regression (SVR) to estimate peptide bitterness.
  - Soltani et al. [20]: Analyzed bitterness thresholds for 229+ peptides using three ML methods.
  - BitterX [21] and BitterPredict [22]: Open-access tools employing ML classification for high-accuracy bitter compound identification.
- Limitations: While effective, QSAR typically relies on handcrafted molecular descriptors, which might not fully capture complex sequence information.

Sequence-Based ML Predictors:

iBitterSCM [23]: One of the earliest sequence-based predictors.
- Algorithm: Scoring Card Method (SCM).
- Feature Type: Dipeptide propensity scores.
- Performance (Independent Test Set): Accuracy $84.0\%$ , MCC 0.69.
- Limitations: Relied on a single-type feature representation, limiting its generalizability.
BERT4Bitter [24]: Introduced Deep Learning (DL) and NLP techniques.
- Algorithm: BERT (Bidirectional Encoder Representations from Transformers) + Bi-LSTM (Bidirectional Long Short-Term Memory).
- Feature Type: BERT embeddings (extracted directly from raw peptide sequences).
- Performance (Independent Test Set): Accuracy $92.2\%$ , MCC 0.84.
- Limitations: Lacked integration with physicochemical properties, which are crucial for understanding biochemical mechanisms.
iBitter-Fuse [31]: Explored multi-representation learning to overcome BERT4Bitter's limitations.
- Algorithm: SVM (Support Vector Machine).
- Feature Type: Integrates multiple encoding schemes including Dipeptide Composition (DPC), Amino Acid Composition (AAC), Pseudo Amino Acid Composition (PAAC), and physicochemical properties. Used Genetic Algorithm with Self-Assessment-Report (GA-SAR) for feature selection.
- Performance (Independent Test Set): Accuracy $93.0\%$ , MCC 0.86.
- Limitations: MCC still lower than more recent approaches; still relied solely on handcrafted features, potentially missing NLP-based pre-trained embeddings.
iBitterDRLF [32]: Incorporated deep representation learning.
- Algorithm: LightGBM.
- Feature Type: Leveraged two types of peptide sequence-based feature extraction methods, likely involving deep embeddings like SSA (Spectral Structural Alignment), UniRep (Universal Representation), and BiLSTM embeddings. Used UMAP for dimensionality reduction.
- Performance (Independent Test Set): Accuracy $94.0\%$ , MCC 0.89.
- Limitations: Relied on limited types of deep representations and lacked ensemble learning to fully capture complementary information.
UniDL4BioPep [33]: Universal deep learning architecture for bioactive peptide classification.
- Algorithm: CNN (Convolutional Neural Networks) with ESM-2 embeddings.
- Feature Type: ESM-2 embeddings (320-dim).
- Performance (Independent Test Set): Accuracy $93.8\%$ , MCC 0.87.
- Limitations: Similar to NLP-based approaches, omitted physicochemical properties and compositional features, potentially limiting comprehensive biochemical understanding.
Bitter-RF [34]: A Random Forest model.
- Algorithm: Random Forest.
- Feature Type: Physicochemical sequence features.
- Performance (Independent Test Set): Accuracy $94.0\%$ , MCC 0.88.
- Limitations: While good, it uses a single classifier and relies on one type of feature.

iBitter-GRE [35]: Stacking ensemble model using ESM-2 and biochemical descriptors.

Algorithm: Stacking Ensemble (Gradient Boosting, Random Forest, Extra Trees as base classifiers; Logistic Regression as meta-classifier).
Feature Type: ESM-2 embeddings (6-layer, 8M parameter version) combined with seven manually engineered features (molecular weight, hydrophobicity, polarity, isoelectric point, amino acid composition, transition frequency, amino acid distribution). Used RFECV for dimensionality reduction.
Performance (Independent Test Set): Accuracy $96.1\%$ , MCC 0.92.

Limitations:

Fixed set of base classifiers (not exploring a broader space).
Early fusion of ESM and physicochemical descriptors doesn't account for distinct predictive contributions, potentially introducing redundancy.

Omitted several informative sequence-level representations (e.g., AAE, GTC, CTD).

The following are the results from Table 1 of the original paper:

Predictor	Algorithm	Feature/Embedding Type	Accuracy	Sensitivity	Specificity	MCC
iBitter-SCM [23]	Scoring Card Method (SCM)	Propensity scores of amino acids and dipeptides	84.0	84.0	84.0	0.69
BERT4Bitter [24]	BERT + Bi-LSTM	BERT embeddings	92.0	94.0	91.0	0.84
iBitter-Fuse [31]	SVM	Composition + Physicochemical properties	93.0	94.0	92.0	0.86
iBitter-DRLF [32]	LightGBM	SSA, UniRep, and BiLSTM embeddings	94.0	92.0	98.0	0.89
UniDL4BioPep [33]	CNN (shallow, 8-layer)	ESM-2 embeddings (320-dim)	93.8	92.4	95.2	0.87
Bitter-RF [34]	Random Forest	Physicochemical sequence features	94.0	94.0	94.0	0.88
iBitter-GRE [34]	Stacking Ensemble	ESM-2 embeddings + Biochemical Descriptors	96.1	98.4	93.8	0.92

3.3. Technological Evolution

The field of bitter peptide identification has evolved significantly:

Early 2000s: Experimental Methods Dominance: The primary approach relied on labor-intensive and costly experimental techniques.
Mid-2000s: Rise of QSAR: Computational methods began with QSAR models, correlating peptide descriptors with bitterness using traditional ML algorithms like SVM and ANN. This marked the shift towards data-driven prediction.
Late 2010s: Sequence-Based ML and Deep Learning: With larger datasets, models started to leverage peptide sequences directly. iBitterSCM used propensity scores. The advent of Deep Learning and Natural Language Processing (NLP) techniques, particularly BERT (e.g., BERT4Bitter), allowed for direct feature extraction from raw sequences, capturing contextual information.
Early 2020s: Multi-Representation and Ensemble Learning: Recognizing the limitations of single-feature or single-model approaches, research moved towards integrating diverse features (e.g., iBitter-Fuse combined composition and physicochemical properties) and deep representation learning (iBitterDRLF). The integration of Protein Language Models (PLMs) like ESM-2 (e.g., UniDL4BioPep, iBitter-GRE) became a significant advancement, capturing rich evolutionary information.
Current State (iBitter-Stack): This paper builds on the PLM and ensemble learning trend, further refining the stacking ensemble approach by:
- Systematically exploring a much wider range of base learner combinations (feature encoding schemes with various classifiers).
- Employing a rigorous selection process for base learners.
- Leveraging soft probability outputs from base learners as input for the meta-learner, allowing for more nuanced decision-making.
- Explicitly combining deep PLM embeddings with a broader set of handcrafted physicochemical and compositional features, ensuring comprehensive representation.

3.4. Differentiation Analysis

Compared to the main methods in related work, iBitter-Stack introduces several core differences and innovations:

Comprehensive Feature Diversity and Fusion Strategy:
- Differentiation: Unlike BERT4Bitter or UniDL4BioPep which primarily rely on NLP embeddings, or iBitter-Fuse which uses handcrafted features, iBitter-Stack explicitly combines both deep PLM embeddings (ESM-2) and a broad set of handcrafted physicochemical and compositional descriptors (DPC, AAE, AAI, GTPC, CTD, BPNC). This multi-representation strategy provides a more comprehensive understanding of peptide characteristics.
- Innovation: This extensive feature set ensures that both high-level contextual information from PLMs and low-level biochemical properties are captured, addressing limitations of models that focus on only one type of feature.
Systematic and Flexible Base Learner Configuration:
- Differentiation: In contrast to iBitter-GRE which uses a fixed set of three base classifiers, iBitter-Stack systematically constructs a large pool of 56 base learners by combining seven different feature encoding schemes with eight diverse machine learning classifiers.
- Innovation: This broad exploration allows for a more optimal selection of base learners, enhancing the flexibility and potential optimization of the ensemble. It avoids the a priori commitment to specific classifiers, allowing the data to guide the selection.
Refined Meta-Learning Pipeline with Soft Probability Fusion:
- Differentiation: While iBitter-GRE uses an early fusion of ESM embeddings and physicochemical descriptors before base classification, iBitter-Stack's meta-learning layer receives soft probability outputs (confidence scores) from the selected base learners.
- Innovation: This late fusion approach, using an 8-dimensional probability vector as input to the meta-learner, reduces redundancy and encourages smoother decision boundaries. It allows the meta-learner (Logistic Regression) to learn the optimal way to weight and combine the nuanced predictions of diverse models, leveraging their complementary strengths rather than being diluted by early feature concatenation.
Rigorous Base Learner Selection:
- Differentiation: iBitter-Stack applies a strict filtering criterion ( $MCC > 0.8$ and $Accuracy > 90%$ ) to select only the top-performing base learners.
- Innovation: This performance-based selection ensures that only reliable and effective models contribute to the final ensemble, improving overall robustness and reducing the risk of incorporating underperforming components.
Demonstrated Superior and Consistent Performance:
- Differentiation: While iBitter-GRE achieved competitive results on the independent test set, iBitter-Stack shows superior consistency in 10-fold cross-validation and maintains a better-balanced sensitivity-specificity trade-off on the independent test set, indicating stronger generalization across varying data splits and more reliable prediction in real-world scenarios.
- Innovation: This consistent top performance across different evaluation settings, validated by high MCC and AUROC scores, positions iBitter-Stack as a more reliable and generalizable tool.
  
  In summary, iBitter-Stack differentiates itself by its holistic approach to feature representation, systematic ensemble construction, and refined meta-learning strategy, leading to a more robust, accurate, and generalizable model for bitter peptide identification.

4. Methodology

4.1. Principles

The core idea behind iBitter-Stack is to build a highly accurate and robust predictor for bitter peptides by leveraging the strengths of multiple machine learning models and diverse data representations through a stacking ensemble framework. The theoretical basis rests on the principle that combining different models, each trained on distinct views of the data (heterogeneous features) and employing varied learning algorithms, can capture more complex patterns and achieve better generalization than any single model. By using a meta-learner to intelligently combine the soft predictions (probabilities) of these base learners, the system can effectively integrate complementary information and mitigate individual model biases or weaknesses. This approach aims to create a decision space that is more abstract and highly discriminative, as illustrated by the t-SNE visualizations in the results.

4.2. Core Methodology In-depth (Layer by Layer)

The iBitter-Stack framework follows a multi-stage pipeline, from data preparation and feature engineering to model training and ensemble construction. The overall workflow is illustrated in Fig. 2.

The following figure (Figure 2 from the original paper) presents the workflow of a multi-representation ensemble learning model for bitter peptide identification.

该图像是一个示意图，展示了用于苦味肽识别的多重表示集成学习模型的构建流程。图中包括数据集的准备、特征表示技术和基础学习者选择，并强调了元学习器的优化过程。

4.2.1. Dataset

A robust benchmark dataset is crucial for reliable model development.

Source: The BTP640 dataset was adopted, a widely accepted benchmark in previous research.
Composition: It comprises 320 experimentally validated bitter peptides and 320 non-bitter peptides, making it a balanced dataset suitable for binary classification tasks. Bitter peptides were collected from multiple peer-reviewed studies, ensuring strong experimental validation.
Curation:
- Peptides containing ambiguous amino acid residues (X, B, U, Z) were excluded.
- Duplicate sequences were removed to prevent data redundancy and overfitting.
- Non-bitter peptides were randomly selected from the BIOPEP database [42], a comprehensive source of peptide sequences, to address the scarcity of experimentally validated non-bitter peptides.
Splitting: The dataset was randomly divided into training and independent test subsets using an 8:2 ratio, a standard convention in ML-based peptide classification.
- Training Set (BTP-CV): 256 bitter peptides and 256 non-bitter peptides. This set is used for model training and 10-fold cross-validation.
- Independent Test Set (BTP-TS): 64 bitter peptides and 64 non-bitter peptides. This set is used for unbiased evaluation of the final model's generalization capability.
- Stratified sampling was used to preserve class balance in both subsets.
Public Availability: The dataset and source code are accessible at https://github.com/Shoombuatong/Dataset-Code/tree/master/iBitter and http://pmlab.pythonanywhere.com/BERT4Bitter.
Sequence Similarity Filtering (Additional Experiment in Appendix A): To further mitigate potential information leakage and ensure fairer evaluation, an additional experiment was conducted. Peptides with greater than or equal to 80% sequence identity were removed, both within and across the train-test boundary. This resulted in a filtered dataset of 428 training and 86 testing sequences (with a slight class imbalance), confirming the model's robustness under stricter similarity constraints.

4.2.2. Feature Representation

Given a peptide sequence $P$ , it can be represented as: $ P = p _ { 1 } p _ { 2 } p _ { 3 } \ldots p _ { N } $ where p _ { i } denotes the $i$ -th residue in the sequence $P$ , and $N$ is the total length of the peptide. Each residue p _ { i } is selected from the standard set of 20 natural amino acids.

The study employed a range of feature encoding schemes to construct a comprehensive representation of peptide sequences, capturing diverse attributes:

4.2.2.1. Evolutionary Scale Modeling (ESM) Embeddings

ESM is a type of Protein Language Model (PLM) designed to learn rich evolutionary and contextual information from protein sequences.

Model Used: ESM-2 (esm2_t6_8M_UR50D variant). This is a smaller variant with 6 layers and 8 million parameters, chosen to manage dimensionality for the given dataset size.
Output Dimensions: Generates a 320-dimensional vector for each peptide.
Extraction: Embeddings are extracted from the last layer (layer 6) of the pretrained ESM-2 model, as this layer provides the most relevant sequence information for bioactivity recognition.
Normalization: Min-max normalization is applied to scale features within the range of [0, 1] based on the training dataset. The test dataset is normalized using the min/max values from the training set.
Visualization: UMAP and t-SNE were used to visualize the high-dimensional ESM embeddings in 2D space, demonstrating their effectiveness in capturing relevant features for peptide bioactivity.

The following figure (Figure 1 from the original paper) shows the architecture of the ESM model used for generating peptide embeddings.

$该图像是示意图，展示了iBitter-Stack模型的架构。图中标注了输入序列、标记化过程、经过修改的六层BERT模型，以及最终的序列嵌入和最后隐藏状态输出。输入序列由N个残基组成，通过标记化后输入BERT模型，最终生成尺寸为$n \\times 320$的输出结果。$ 该图像是示意图，展示了iBitter-Stack模型的架构。图中标注了输入序列、标记化过程、经过修改的六层BERT模型，以及最终的序列嵌入和最后隐藏状态输出。输入序列由N个残基组成，通过标记化后输入BERT模型，最终生成尺寸为 $n \times 320$ 的输出结果。

4.2.2.2. Dipeptide Composition (DPC)

DPC captures the local relationship between adjacent amino acid residues.

Representation: A 400-dimensional vector, where each dimension corresponds to the normalized frequency of one of the $20 \times 20 = 400$ possible dipeptide combinations.
Calculation: The method is defined by the formula: $ D ( r , s ) = { \frac { N _ { r s } } { N - 1 } } $ where:
- D(r, s) is the normalized frequency of the dipeptide formed by amino acid types $r$ and $s$ .
- N _ { r s } denotes the number of occurrences of the dipeptide rs (e.g., 'AR', 'GG') in the peptide sequence.
- $N$ is the total length of the peptide.
- The denominator N-1 represents the total number of adjacent amino acid pairs in a sequence of length $N$ .
Normalization: Counts are normalized to relative frequencies, making the feature vector robust to variations in sequence length. DPC is effective for capturing local sequential patterns crucial for functional properties.

4.2.2.3. AAE

Amino Acid Entropy (AAE) is a position-based feature that quantifies the non-random distribution of each amino acid, reflecting variability and disorder along the peptide chain.

Calculation: The entropy value for each amino acid $A$ $A$ in a peptide sequence $P$ $P$ of length $p$ $p$ is given by: $ AAE _ { A } = \sum _ { i = 1 } ^ { n } \left( { \frac { s _ { i } - s _ { i - 1 } } { p } } \right) \log _ { 2 } \left( { \frac { s _ { i } - s _ { i - 1 } } { p } } \right) $ where:
- AAE _ { A } is the amino acid entropy for amino acid $A$ .
- $p$ represents the length of the peptide sequence $P$ .
- $n$ denotes the number of occurrences of amino acid $A$ in the peptide.
- $s _ { 1 } , s _ { 2 } , \ldots , s _ { n }$ are the positions of amino acid $A$ within the peptide.
- The position indices are defined such that $s _ { 0 } = 0$ and $s _ { n + 1 } = n + 1$ , marking the boundaries of the peptide sequence.
Regions: AAE is calculated for the full peptide sequence, its N-terminal (NT5, first five residues), and C-terminal (CT5, last five residues) subsequences.
Representation: The resulting AAE values are combined into a 60-dimensional feature vector (20 amino acids $\times$ 3 regions).

4.2.2.4. Binary Profile-based Encoding for N and C-terminal residues (BPNC)

BPNC represents each amino acid as a binary vector to capture positional specificity.

Encoding Scheme: Each of the 20 standard amino acids is represented by a 20-dimensional binary vector. For example, Alanine (A) would be $(1, 0, ..., 0)$ , Cysteine (C) $(0, 1, ..., 0)$ , etc., where a 1 indicates the presence of that specific amino acid at a given position.
Application: Applied specifically to the first five N-terminal (NT5) and last five C-terminal (CT5) residues of each peptide.
Representation: This results in a 200-dimensional vector for each peptide (100 dimensions for NT5 and 100 dimensions for CT5, as each of 5 amino acids in each terminal region is represented by a 20-dim vector).

Purpose: Emphasizes the critical role of terminal residues in peptide function and bioactivity.

The following are the results from Table 2 of the original paper:

	Amino Acid 20-Dimensional Binary Vector
A	(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
C	(0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Y	(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)

4.2.2.5. Physicochemical Property-Based Features

These features capture chemical characteristics and structural properties.

AAI (Amino Acid Index):
- Consists of 12 properties from the AAindex database (e.g., hydrophobicity, steric parameters, solvation).
- For some properties (hydrophobicity, etc.), the average AAindex values for all amino acids in the full, NT5, and CT5 sequences are used.
- For other properties (hydrogen bonding, net charge, molecular weight), the sum of AAindex values for all amino acids is used.
- Representation: A 36-dimensional vector.
GTPC (Global TriPeptide Composition):
- Categorizes amino acids into five groups based on physicochemical properties: aliphatic, aromatic, positive charge, negative charge, and uncharged.
- Calculates the frequency of tri-peptides formed by combinations of these groups in the full, NT5, and CT5 sequences.
- Representation: A 125-dimensional vector.
CTD (Composition-Transition-Distribution):
- Captures distribution patterns of amino acids based on specific physicochemical properties.
- Representation: A 147-dimensional vector, comprising:
  - 21 dimensions for Composition (C): Frequency of amino acids with certain properties.
  - 21 dimensions for Transition (T): Frequency of transitions between amino acids with different properties.
  - 105 dimensions for Distribution (D): Distribution patterns of amino acids with specific properties along the sequence.

4.2.3. Base Learners and Meta Learners

The iBitter-Stack model is built upon a two-layer stacking ensemble architecture:

4.2.3.1. Base Learners (First Layer)

Construction: A total of 56 base learners were constructed by combining:
- 7 diverse embeddings/feature types: ESM, BPNC, DPC, AAE, AAI, GTPC, and CTD.
- 8 distinct classifiers: Support Vector Machine (SVM), Decision Tree (DT), Naive Bayes (NB), K-Nearest Neighbors (KNN), Logistic Regression (LR), Random Forest (RF), Adaptive Boosting (AdaBoost), and Multilayer Perceptron (MLP).
- Each unique combination (e.g., ESM with RF, CTD with MLP) formed an individual base learner.
Training: All 56 base learners were trained using 10-fold cross-validation on the training set (BTP-CV) to optimize their hyperparameters.
Selection: After training, a rigorous selection criterion was applied to identify top-performing models for the meta-learning phase:
- Matthews Correlation Coefficient (MCC) greater than 0.8.
- Accuracy higher than $90\%$ .
Output: The top eight models that met these criteria were chosen. For each peptide sample in the training set, these selected models output a class probability (a soft output between 0 and 1) indicating the likelihood of the sample being bitter or non-bitter.

4.2.3.2. Meta Learner (Second Layer)

Meta-Dataset Construction: The soft probability outputs from the eight selected base learners for each peptide sample are concatenated. This forms an 8-dimensional probability vector for every peptide. This collection of probability vectors across all samples constitutes the meta-dataset.
Meta-Learner Model: A Logistic Regression (LR) model was chosen as the meta-learner. LR is robust, computationally efficient, and effective in combining predictions from base learners.
Training: The LR meta-learner is trained on this meta-dataset. Its role is to learn the optimal way to combine the probabilities from the base learners by assigning appropriate weights to each input probability.
Final Prediction: The final classification of a peptide as bitter or non-bitter is derived from the output of this LR meta-learner, leveraging the collective judgment of the most reliable base models.
Hyperparameter Optimization: Hyperparameters for the LR meta-learner were optimized via grid search, with the best configuration found to be $penalty = l2$ (L2 regularization to prevent overfitting) and max_iter = 1500 (maximum iterations for convergence).

The architectural design allows the system to capture complex and heterogeneous patterns within peptide sequences, making the framework highly effective for distinguishing between bitter and non-bitter peptides.

5. Experimental Setup

5.1. Datasets

The study utilized a carefully curated benchmark dataset to ensure robustness, reproducibility, and fair evaluation.

Primary Dataset: BTP640 dataset.
- Source: Widely accepted in prior research and collected from multiple peer-reviewed studies [1, 13-18, 36].
- Composition: Comprises 320 experimentally validated bitter peptides and 320 non-bitter peptides, resulting in a perfectly balanced dataset of 640 total peptides.
- Curation:
  - Peptides containing ambiguous amino acid residues (X, B, U, Z) were excluded.
  - Duplicate sequences were removed to prevent data redundancy and overfitting.
  - Non-bitter peptides were randomly selected from the BIOPEP database [42] to address the scarcity of experimentally validated negative samples.
Dataset Split: An 8:2 ratio was used to divide the BTP640 dataset into training and independent test sets, a common practice in ML-based peptide classification.
- Training Set (BTP-CV): 512 peptides (256 bitter, 256 non-bitter). Used for 10-fold cross-validation and training base learners and the meta-learner.
- Independent Test Set (BTP-TS): 128 peptides (64 bitter, 64 non-bitter). Used for final, unbiased evaluation of the iBitter-Stack model.
- Stratified sampling ensured class balance in both subsets.
Data Example: A peptide sequence is a string of characters representing amino acids, e.g., "IVY". These sequences are then converted into numerical features.
Justification for Choice: The BTP640 dataset is a standardized benchmark, promoting transparency and direct comparison with existing state-of-the-art methods, including iBitter-SCM and BERT4Bitter.
Availability: The dataset and source code are publicly available at https://github.com/Shoombuatong/Dataset-Code/tree/master/iBitter and http://pmlab.pythonanywhere.com/BERT4Bitter.
Additional Experiment: Sequence Similarity Filtering (Appendix A):
- To address concerns about sequence redundancy, an additional experiment was conducted.
- Procedure: Pairwise global alignment was used to filter out peptides with $\ge 80\%$ sequence identity within and across the train-test boundary.
- Resulting Dataset Size: Reduced from 640 to 514 peptides.
  - Training set: 428 peptides (219 bitter, 209 non-bitter)
  - Test set: 86 peptides (44 bitter, 42 non-bitter)
- Rationale: The $80\%$ threshold balanced redundancy reduction with preserving sufficient dataset size, as more aggressive thresholds drastically reduced data, undermining statistical robustness.

5.2. Evaluation Metrics

The performance of the model was evaluated using several standard metrics commonly used in peptide classification tasks. Let TP be True Positives, TN be True Negatives, FP be False Positives, and FN be False Negatives.

Accuracy (ACC):
- Conceptual Definition: Measures the overall proportion of correctly classified instances (both bitter and non-bitter peptides) out of all instances. It indicates the general correctness of the model's predictions.
- Mathematical Formula: $ \mathrm { A C C } = { \frac { \mathrm { T P } + \mathrm { T N } } { \mathrm { T P } + \mathrm { T N } + \mathrm { F P } + \mathrm { F N } } } $
- Symbol Explanation:
  - TP: True Positives (correctly identified bitter peptides).
  - TN: True Negatives (correctly identified non-bitter peptides).
  - FP: False Positives (non-bitter peptides incorrectly identified as bitter).
  - FN: False Negatives (bitter peptides incorrectly identified as non-bitter).
Sensitivity (Sn) (also known as Recall or True Positive Rate):
- Conceptual Definition: Measures the proportion of actual bitter peptides that are correctly identified by the model. It quantifies the model's ability to avoid false negatives.
- Mathematical Formula: $ \mathrm { S n } = { \frac { \mathrm { T P } } { \mathrm { T P } + \mathrm { F N } } } $
- Symbol Explanation:
  - TP: True Positives.
  - FN: False Negatives.
Specificity (Sp) (also known as True Negative Rate):
- Conceptual Definition: Measures the proportion of actual non-bitter peptides that are correctly identified by the model. It quantifies the model's ability to avoid false positives.
- Mathematical Formula: $ \mathrm { S p } = { \frac { \mathrm { T N } } { \mathrm { T N } + \mathrm { F P } } } $
- Symbol Explanation:
  - TN: True Negatives.
  - FP: False Positives.
Matthews Correlation Coefficient (MCC):
- Conceptual Definition: A robust and reliable metric for binary classification, especially valuable for imbalanced datasets, but also informative for balanced ones. It considers all four confusion matrix categories (TP, TN, FP, FN) and produces a value between -1 (perfect inverse prediction) and +1 (perfect prediction), with 0 indicating random prediction. It's considered a balanced measure that can be used even if the classes are of very different sizes.
- Mathematical Formula: $ \mathrm { M C C } = { \frac { \mathrm { T P } \times \mathrm { T N } - \mathrm { F P } \times \mathrm { F N } } { \sqrt { ( \mathrm { T P } + \mathrm { F P } ) ( \mathrm { T P } + \mathrm { F N } ) ( \mathrm { T N } + \mathrm { F P } ) ( \mathrm { T N } + \mathrm { F N } ) } } } $
- Symbol Explanation:
  - TP: True Positives.
  - TN: True Negatives.
  - FP: False Positives.
  - FN: False Negatives.
Area Under the Receiver Operating Characteristic (AUROC):
- Conceptual Definition: A threshold-independent metric that quantifies the model's ability to distinguish between positive and negative classes across all possible classification thresholds. The ROC curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold settings. An AUROC of 1.0 indicates a perfect classifier, while 0.5 suggests performance no better than random chance.
- Mathematical Formula: The AUROC is the area under the ROC curve. While there isn't a single closed-form formula for AUROC that uses only TP, TN, FP, FN directly, it is typically calculated by integrating the ROC curve. For discrete predictions, it can be approximated by: $ \mathrm{AUROC} = \frac{\sum_{i=1}^{N_0} \sum_{j=1}^{N_1} \mathbf{1}(P_i > P_j)}{N_0 \cdot N_1} $ where:
  - $N_0$ is the number of negative samples.
  - $N_1$ is the number of positive samples.
  - $P_i$ is the predicted probability for a negative sample $i$ .
  - $P_j$ is the predicted probability for a positive sample $j$ .
  - $\mathbf{1}(P_i > P_j)$ is an indicator function that equals 1 if the predicted probability of the negative sample is greater than that of the positive sample, and 0 otherwise. This essentially counts how many times a randomly chosen positive example is ranked higher than a randomly chosen negative example.

5.3. Baselines

The proposed iBitter-Stack model was compared against several state-of-the-art models for bitter peptide identification, representing different evolutionary stages and methodological approaches in the field. These baselines are chosen for their established performance and to demonstrate the advancements made by iBitter-Stack. The models used for comparison, as listed in Tables 1, 5, and 6 of the paper, include:

iBitter-SCM [23]: An early sequence-based predictor using a Scoring Card Method and dipeptide propensity scores.
BERT4Bitter [24]: A deep learning model leveraging BERT embeddings and a Bi-LSTM for NLP-inspired sequence analysis.
iBitter-Fuse [31]: An ML pipeline that integrates multi-view features (compositional and physicochemical properties) with an SVM classifier.
iBitter-DRLF [32]: Incorporates deep representation learning features with LightGBM.
UniDL4BioPep [33]: A universal deep learning architecture using ESM-2 embeddings with CNNs.
Bitter-RF [34]: A Random Forest model based on physicochemical sequence features.
iBitter-GRE [34]: A stacking ensemble model that combines ESM-2 embeddings and biochemical descriptors with Gradient Boosting, Random Forest, and Extra Trees as base classifiers, and Logistic Regression as the meta-classifier. This is a particularly relevant baseline as it also uses ESM-2 and an ensemble approach.

These baselines collectively represent a spectrum of computational methodologies, from traditional ML with handcrafted features to deep learning and ensemble approaches, providing a comprehensive context for evaluating iBitter-Stack's performance.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the superior performance and robustness of the iBitter-Stack model across various evaluation metrics and comparison scenarios. The analysis focuses on the performance of individual base learners, the selection of optimal base learners, and the overall effectiveness of the stacked meta-learner against state-of-the-art models.

6.1.1. Performance Evaluation of Base Learners

To understand the contribution of individual feature-classifier combinations, 56 base learners were evaluated using MCC and Accuracy.

The following figure (Figure 3 from the original paper) is a heatmap showing the MCC and accuracy metrics for different models.

该图像是热图，展示了不同模型的MCC和准确性指标。图中可见多种模型的性能差异，其中ESM_RF在MCC上达到最高值0.854，准确性为0.920。

As observed in Fig. 3, models leveraging ESM embeddings in combination with ensemble classifiers such as Random Forest (RF), Support Vector Machine (SVM), and Multilayer Perceptron (MLP) consistently achieved high performance. For instance, the ESM_RF model recorded the highest MCC of 0.85 and an accuracy of $92\%$ . In contrast, models built upon AAI and GTPC features generally exhibited lower performance, indicating that these features alone are less discriminative for bitter peptide identification.

The following figure (Figure 4 from the original paper) is a box plot showing the distribution of MCC and Accuracy values across all base learners.

Figure 4: Box plot showing the distribution of MCC and Accuracy values across all base learners. 该图像是一个箱线图，展示了各种模型在MCC和准确率上的分布情况。图中MCC的值集中较低，而准确率的值则表现得相对较高，显示了不同模型在这两项指标上的性能差异。

Fig. 4, a box plot of MCC and Accuracy values, further illustrates this. While the median accuracy across models was relatively high with a narrow inter-quartile range, MCC showed a broader spread, with some models significantly underperforming. This variability in MCC (despite a balanced dataset) highlights its sensitivity to false positives and false negatives, offering a more nuanced view of reliability than accuracy alone. This initial evaluation guided the stringent selection of base learners for the meta-stacking phase.

6.1.2. Identification of Optimal Base Learners for Meta-Modeling

Based on the selection criteria ( $MCC > 0.8$ and $Accuracy > 90%$ ), eight base learners were chosen for the meta-learning phase:

ESM_RF
ESM_SVM
ESM_MLP
ESM_LR
ESM_ADA (AdaBoost)
CTD_MLP
CTD_SVM
AAI_RF

This selection indicates that ESM-derived models are dominant but the inclusion of CTD and AAI-based models confirms the value of feature diversity and complementary information from alternative descriptors. The paper also mentions a restricted meta-learner ESM_Stack (using only the five ESM-based models), which performed competitively but consistently lower than the full iBitter-Stack, underscoring the benefit of diverse feature inclusion. The selected classifiers (RF, AdaBoost, MLP) are known for handling nonlinear patterns.

A qualitative analysis showed that while ESM-based models often agreed, CTD and AAI-based learners provided valuable complementary signals, especially for ambiguous cases, reinforcing the importance of integrating orthogonal features. Each selected model outputs a probability score (soft output), which are concatenated to form an 8-dimensional vector for each peptide, creating the meta-dataset. A Logistic Regression (LR) model was selected as the meta-learner due to its superior performance in combining soft predictions. Its hyperparameters were optimized to $penalty = l2$ and max_iter = 1500.

6.1.3. Performance Evaluation of the Stacked Meta-Learner

6.1.3.1. Performance Comparison with Base Learners (10-Fold Cross-Validation)

The following are the results from Table 3 of the original paper:

Model	Acc (%)	Sn (%)	Sp (%)	MCC	AUROC
ESM_SVM	85.5	85.9	85.1	0.71	0.85
ESM−RF	83.4	82.8	84.0	0.67	0.83
ESM−MLP	83.6	85.1	82.1	0.67	0.83
ESM−LR	83.6	83.9	83.2	0.67	0.83
CTD−MLP	81.1	80.4	81.7	0.62	0.81
ESM_ADA	83.0	80.4	85.6	0.66	0.83
CTD_SVM	83.2	83.2	83.3	0.66	0.83
AAI_RF	78.5	79.7	77.3	0.57	0.78
iBitter-Stack	99.8	100.0	99.6	0.99	0.99

Table 3 shows a comparison of iBitter-Stack with the individual base learners during 10-fold cross-validation. The meta-learner (iBitter-Stack) achieved near-perfect results with an Accuracy of $99.8\%$ , Sensitivity of $100.0\%$ , Specificity of $99.6\%$ , MCC of 0.99, and AUROC of 0.99. This significantly surpasses the performance of any single base learner, whose MCC values ranged from 0.57 to 0.71. This dramatic improvement underscores the effectiveness of the stacked ensemble approach in integrating diverse decision boundaries and generalizing patterns learned by individual models.

6.1.3.2. Performance Comparison with Base Learners (Independent Test Set)

The following are the results from Table 4 of the original paper:

Model	Acc (%)	Sn (%)	Sp (%)	MCC	AUROC
ESM_SVM	92.2	92.2	92.2	0.84	0.92
ESM_RF	92.2	89.1	95.3	0.84	0.92
ESM−MLP	91.4	85.9	96.9	0.83	0.91
ESM LR	91.4	90.6	92.2	0.82	0.91
CTD _MLP	89.8	87.5	92.2	0.79	0.89
ESM ADA	89.1	90.6	87.5	0.78	0.89
CTD_ SVM	89.1	85.9	92.2	0.78	0.89
AAI_RF	89.8	90.6	89.1	0.79	0.89
ESM_Stack	92.9	91.0	95.1	0.86	0.98
iBitter-Stack	96.1	95.4	97.2	0.92	0.98

Table 4 presents the independent test set results. iBitter-Stack maintained high performance with an Accuracy of $96.1\%$ , Sensitivity of $95.4\%$ , Specificity of $97.2\%$ , MCC of 0.92, and AUROC of 0.98. While some base learners like ESM_SVM and ESM_RF also showed strong performance on the independent test set (e.g., MCC of 0.84), iBitter-Stack still outperformed them. The ESM_Stack (ensemble of only ESM-based models) also performed well (MCC 0.86), but iBitter-Stack's inclusion of CTD and AAI-based models further improved overall performance, reaching the highest MCC. The improved performance of individual base learners on the independent test set compared to 10-fold cross-validation suggests potential limitations in capturing broader generalization across diverse data splits in the CV setting.

The following figure (Figure 5 from the original paper) shows the classification results of different models for bitter peptides, featuring eight subplots: ESM, AAE, AAI, BPNC, CTD, DPC, GTPC, and Meta-Dataset.

该图像是一个示意图，展示了不同模型对苦味肽的分类结果，包含八个子图，分别为ESM、AAE、AAI、BPNC、CTD、DPC、GTPC和Meta-Dataset。每个子图中，通过t-SNE降维，蓝色点表示非苦味肽，橙色点表示苦味肽。

To visualize the discriminative power, t-SNE analysis was performed (Fig. 5). It shows that individual features like AAE, DPC, and GTPC result in high overlap between bitter (orange) and non-bitter (blue) peptide classes, indicating limited separability. In stark contrast, the final 8-dimensional meta-dataset (generated from the soft probabilities of selected base learners) achieved the most distinct clustering, with clear margins and tight groupings. This visual evidence supports that the stacked representation effectively captures a more abstract and highly discriminative decision space, explaining the meta-learner's superior performance.

6.1.4. Comparison with State-of-the-Art Models

6.1.4.1. 10-Fold Cross-Validation Comparison

The following are the results from Table 5 of the original paper:

Model	Acc (%)	Sn (%)	Sp (%)	MCC	AUROC
iBitter-SCM [23]	87.0	91.0	83.0	0.75	0.90
BERT4Bitter [24]	86.0	87.0	85.0	0.73	0.92
iBitter-Fuse [31]	92.0	92.0	92.0	0.84	0.94
iBitter-DRLF [32]	89.0	89.0	89.0	0.78	0.95
Bitter-RF [34]	85.0	86.0	84.0	0.70	0.93
iBitter-GRE [34]	86.3	85.5	87.1	0.73	0.92
iBitter-Stack	99.8	100.0	99.6	0.99	0.99

In the 10-fold cross-validation setting (Table 5), iBitter-Stack achieved near-perfect scores (Accuracy $99.8\%$ , MCC 0.99, AUROC 0.99), significantly outperforming all prior state-of-the-art models. Traditional models like iBitter-SCM and BERT4Bitter had MCCs under 0.75, while even more advanced models like iBitter-Fuse (MCC 0.84) and iBitter-DRLF (MCC 0.78) were substantially lower. Notably, iBitter-GRE, a stacking ensemble model, achieved an MCC of 0.73, highlighting iBitter-Stack's superior generalization ability across cross-validation folds.

6.1.4.2. Independent Test Set Comparison

The following are the results from Table 6 of the original paper:

Model	Acc (%)	Sn (%)	Sp (%)	MCC	AUROC
iBitter-SCM [23]	84.0	84.0	84.0	0.69	0.90
BERT4Bitter [24]	92.2	93.8	90.6	0.84	0.96
iBitter-Fuse [31]	93.0	94.0	92.0	0.86	0.93
iBitter-DRLF [32]	94.0	92.0	96.9	0.89	0.97
UniDL4BioPep [33]	93.8	92.4	95.2	0.87	0.98
Bitter-RF [34]	94.0	94.0	94.0	0.88	0.98
iBitter-GRE [34]	96.1	98.4	93.8	0.92	0.97
Proposed	96.1	95.4	97.2	0.92	0.98

On the independent test set (Table 6), iBitter-Stack achieved an Accuracy of $96.1\%$ , MCC of 0.92, and AUROC of 0.98. It matched the highest MCC score with iBitter-GRE but demonstrated a more balanced sensitivity-specificity trade-off ( $95.4\%$ Sensitivity, $97.2\%$ Specificity) compared to iBitter-GRE ( $98.4\%$ Sensitivity, $93.8\%$ Specificity). This balance suggests better control over both false positives and false negatives, which is critical for reliability in real-world applications. The architectural innovations of iBitter-Stack—including its diverse base learner pool, selective ensemble strategy, and meta-level fusion of soft probability vectors—contribute to its competitive predictive performance and enhanced modularity and extensibility.

The following figure (Figure 6 from the original paper) shows the Receiver Operating Characteristic (ROC) Curve for the Proposed Model on the Independent Test Set.

$Figure 6: Receiver Operating Characteristic (ROC) Curve for the Proposed Model on the Independent Test Set. AU$\\mathrm { R O C } = 0 . 9 8 1$ .$ 该图像是一个接收操作特征（ROC）曲线图，展示了 proposed model 在独立测试集上的表现。曲线的下面区域控制（AUROC）为 0.981，表明模型的分类能力较强。

The ROC curve (Fig. 6) for iBitter-Stack on the independent test set shows an AUROC of 0.981, demonstrating exceptional discriminatory ability. This high AUROC confirms the model's capacity to maintain a strong true positive rate while minimizing false positives, which is essential for peptide screening. The consistently high MCC across all evaluations reflects the model's robustness and generalizability.

6.1.5. Performance After Sequence Similarity Filtering (Appendix A)

The following are the results from Table A.7 of the original paper:

Model	Acc (%)	Sn (%)	Sp (%)	MCC	AUROC
Proposed (Unfiltered)	96.1	95.4	97.2	0.92	0.98
Proposed (Filtered, 80%)	95.3	95.3	95.3	0.91	0.98

Table A.7 presents the performance of iBitter-Stack on the independent test set after applying an $80\%$ sequence identity filter. Even with a reduced and slightly imbalanced dataset, the model maintained strong performance: Accuracy of $95.3\%$ , MCC of 0.91, and AUROC of 0.98. These results are only marginally lower than the unfiltered case, confirming the model's robustness and that its high performance reflects genuine learning of discriminative sequence patterns rather than potential overlap effects between training and testing sets. Interestingly, the selection of base learners changed slightly for the filtered dataset, with BPNC_RF and ESM_KNN replacing some original top performers, showcasing the adaptability of the selection process.

6.2. Data Presentation (Tables)

All relevant tables from the paper have been transcribed and presented in the subsections above, ensuring no data summarization or cherry-picking. This includes Table 1 (Performance of Existing Bitter Peptide Prediction Methods), Table 2 (Binary Profile of Amino Acids in BPNC Representation), Table 3 (10-Fold Cross-Validation Results: Meta-Learner vs. Base Learners), Table 4 (Independent Test Set Results: Meta-Learner vs. Base Learners), Table 5 (Comparison with Prior State-of-the-Art Models (10-Fold Cross-Validation)), Table 6 (Comparison with Prior State-of-the-Art Models (Independent Test Set)), and Table A.7 (Performance of Proposed Model Before and After Sequence Similarity Filtering).

6.3. Ablation Studies / Parameter Analysis

While the paper does not present a formal ablation study in the sense of systematically removing each component and re-evaluating the model, it implicitly assesses the contribution of different components through several comparisons:

Comparison of iBitter-Stack with individual base learners (Tables 3 and 4): This demonstrates the significant performance gain achieved by the stacking ensemble over any single feature-classifier combination. It shows that the ensemble effect is crucial.
Comparison of iBitter-Stack with ESM_Stack (Table 4): ESM_Stack is a restricted meta-learner built only from ESM-based base learners. Its performance (MCC 0.86) is competitive but lower than the full iBitter-Stack (MCC 0.92). This implicitly acts as an ablation by showing the added value of including CTD- and$AAI-based base learnersalongsideESMmodels, validating the importance offeature diversitybeyond justPLM embeddings`.
Visualization with t-SNE (Fig. 5): This visual analysis effectively serves as a qualitative ablation of feature types. It demonstrates that the 8-dimensional meta-dataset (the output of the base learners before the final meta-learner) creates a much clearer separation between bitter and non-bitter peptides compared to any individual feature type (AAE, DPC, GTPC). This highlights how the ensemble's combined output is more discriminative than individual raw features.
Hyperparameter Optimization for Meta-Learner: The paper states that hyperparameters for the Logistic Regression meta-learner were optimized via grid search, with $penalty = l2$ and max_iter = 1500 identified as the best configuration. This indicates a parameter analysis was performed for the meta-learner itself, ensuring its optimal performance within the ensemble.

These comparisons and analyses collectively demonstrate the effectiveness of the proposed multi-representation and ensemble learning strategy, even without a conventional, explicit ablation study section.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully proposed iBitter-Stack, a novel stacking-based ensemble learning framework for the accurate identification of bitter peptides. The model's strength lies in its comprehensive integration of seven heterogeneous feature representations, including advanced contextual embeddings from Protein Language Models (ESM) and various handcrafted physicochemical descriptors. These features were combined with eight diverse machine learning classifiers to construct a pool of 56 base learners. A rigorous performance-based filtering strategy ( $MCC > 0.8$ and $Accuracy > 90%$ ) then selected the most effective base learners, whose soft probability outputs formed an 8-dimensional meta-dataset. This meta-dataset was then fed into a Logistic Regression meta-learner to produce the final, refined predictions.

Extensive evaluations using 10-fold cross-validation and an independent test set demonstrated that iBitter-Stack consistently and significantly outperformed individual models and existing state-of-the-art predictors. Specifically, on the independent test set, it achieved an accuracy of $96.1\%$ , an MCC of 0.922, and an AUROC of 0.981, showcasing its strong discriminative ability and generalization capabilities. The model's robustness was further validated by maintaining high performance even after stringent sequence similarity filtering (at an $80\%$ identity threshold) between training and testing sets. The availability of a user-friendly web server (ibitter-stack-webserver.streamlit.app) makes this powerful tool accessible for real-time peptide screening.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Data Scarcity: A key challenge remains the limited availability of experimentally validated bitter and non-bitter peptides. This scarcity motivated the use of an ensemble of lightweight predictors to mitigate overfitting on small datasets.
Stricter Similarity Thresholds: While $80\%$ sequence identity filtering confirmed robustness, future studies could explore even stricter thresholds (e.g., $70\%$ ) to address residual overlap, especially for short peptides, while balancing the trade-off between data quality and quantity.
Future Applications of Modular Framework: The modular framework of iBitter-Stack can be extended to related bioactivity prediction tasks, such as:
- Bitterness intensity prediction (a regression task instead of classification).
- Peptide solubility classification.
- Functional motif detection.
Integration with Deep End-to-End Learning: Future work might incorporate deep end-to-end learning for further automation and scalability, potentially moving beyond handcrafted features in some aspects while still retaining the benefits of comprehensive representation.
Broader Peptide Annotation Efforts: The authors emphasize the need for broader peptide annotation efforts to support future advancements in the field, addressing the fundamental data limitation.

7.3. Personal Insights & Critique

This paper presents a highly robust and well-designed machine learning solution for a challenging bioinformatics problem. Several aspects are particularly insightful:

Power of Multi-Representation: The explicit emphasis and systematic integration of diverse feature representations (PLM embeddings and handcrafted physicochemical descriptors) is a critical strength. It highlights that even with powerful deep learning features like ESM, domain-specific handcrafted features still provide unique and complementary information, especially in fields like bioactivity prediction where specific biochemical properties are crucial. This challenges the notion that end-to-end deep learning always obviates the need for feature engineering.
Systematic Ensemble Construction: The rigorous process of generating 56 base learners and then selectively choosing the top-performers based on strict metrics ( $MCC > 0.8$ , $Accuracy > 90%$ ) is a significant methodological improvement over ad-hoc ensemble designs. This systematic approach enhances the reliability and interpretability of the ensemble.
Soft Probability Fusion: Using soft probability outputs from base learners as input to the meta-learner is more sophisticated than simply concatenating raw features. It allows the meta-learner to weigh the confidence of each base learner, leading to more nuanced and potentially more accurate final predictions, as vividly demonstrated by the t-SNE visualization of the meta-dataset.
Practical Utility: The development of a freely accessible web server is commendable. It transforms the research output into a practical tool, immediately benefiting researchers in food science, drug discovery, and biochemistry by enabling real-time screening. This direct application of research is a strong indicator of its potential impact.
Addressing Data Redundancy: The additional experiment with $80\%$ sequence identity filtering in the appendix adds significant methodological rigor. It proactively addresses a common critique in bioinformatics ML—that high performance might be due to information leakage from highly similar sequences in training and test sets. The sustained high performance post-filtering strengthens confidence in the model's generalizability.

Potential Issues/Areas for Improvement:

Computational Cost: While the paper mentions using a smaller ESM-2 variant, training 56 base learners (even with 10-fold cross-validation for hyperparameter tuning) can still be computationally intensive. For wider adoption, the efficiency of training and inference, especially for very large peptide libraries, might become a factor.
Interpretability of Meta-Learner: While Logistic Regression is relatively interpretable, the 8-dimensional probability vector input, derived from complex interactions, still makes the exact "reason" for a prediction opaque. Further work on explainable AI (XAI) techniques could provide insights into which base learners or feature types contribute most to specific predictions, enhancing trust and understanding.
Generalizability to Other Bioactivities: While the modular framework is suggested for other bioactivities, the current selection criteria for base learners and the specific meta-learner might need re-tuning for different tasks. The "universality" would require validation across a broader range of peptide functions.
Handling Imbalance in Other Datasets: Although the BTP640 dataset is balanced, the 80% filtered dataset introduced a slight imbalance. While MCC handles imbalance well, future applications on inherently imbalanced datasets (common in drug discovery) might require more explicit imbalance-handling techniques at the base learner or meta-learner level (e.g., oversampling, undersampling, cost-sensitive learning).

Overall, iBitter-Stack represents a significant advancement in bitter peptide identification, offering a well-justified, highly effective, and practically deployable solution. Its methodology provides valuable lessons for designing robust ML models in bioinformatics, especially when combining deep learning with domain-specific knowledge.

iBitter-Stack: A Multi-Representation Ensemble Learning Model for Accurate Bitter Peptide Identification

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~30 min read · 43,752 chars