Explainable Machine Learning and Deep Learning Models for Predicting TAS2R-Bitter Molecule Interactions

and Marco A. Deriu¹*

Paper status: completed

Explainable Machine Learning and Deep Learning Models for Predicting TAS2R-Bitter Molecule Interactions

Published:10/09/2025

Explainable Machine Learning Models (1)TAS2R-Bitter Molecule Interaction Prediction (1)Deep Learning for Ligand Recognition (1)G Protein-Coupled Receptor Function Research (1)Molecular Characteristics and Drug Design (1)

Original Link

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study developed explainable machine learning and deep learning models to predict interactions between bitter molecules and TAS2R receptors, enhancing ligand selection and understanding of receptor functions, with significant implications for drug design and disease research.

Abstract

This work aims to develop explainable models to predict the interactions between bitter molecules and TAS2Rs via traditional machine-learning and deep-learning methods starting from experimentally validated data. Bitterness is one of the five basic taste modalities that can be perceived by humans and other mammals. It is mediated by a family of G protein-coupled receptors (GPCRs), namely taste receptor type 2 (TAS2R) or bitter taste receptors. Furthermore, TAS2Rs participate in numerous functions beyond the gustatory system and have implications for various diseases due to their expression in various extra-oral tissues. For this reason, predicting the specific ligand-TAS2Rs interactions can be useful not only in the field of taste perception but also in the broader context of drug design. Considering that in-vitro screening of potential TAS2R ligands is expensive and time-consuming, machine learning (ML) and deep learning (DL) emerged as powerful tools to assist in the selection of ligands and targets for experimental studies and enhance our understanding of bitter receptor roles. In this context, ML and DL models developed in this work are both characterized by high performance and easy applicability. Furthermore, they can be synergistically integrated to enhance model explainability and facilitate the interpretation of results. Hence, the presented models promote a comprehensive understanding of the molecular characteristics of bitter compounds and the design of novel bitterants tailored to target specific TAS2Rs of interest.

Mind Map

In-depth Reading

English Analysis~32 min read · 40,321 chars

1. Bibliographic Information

1.1. Title

Explainable Machine Learning and Deep Learning Models for Predicting TAS2R-Bitter Molecule Interactions

1.2. Authors

Francesco Ferri
Marco Cannariato
Lorenzo Pallante
Eric A. Zizzi
Marcello Miceli
Giacomo di Benedetto
Marco A. Deriu

Affiliations mainly include the Leiiz Institute for Food Systems Biology at the Technical University of Munich, TroliMeDerelT y 27hc srl, Rome, Italy.

1.3. Journal/Conference

The paper does not explicitly state a journal or conference. However, given the nature of the research and the mention of arXiv in the references, it is likely a preprint or submitted work related to computational chemistry, bioinformatics, or machine learning in life sciences.

1.4. Publication Year

2025 (Published at UTC: 2025-10-09T00:00:00.000Z).

1.5. Abstract

This work focuses on developing explainable models to predict the interactions between bitter molecules and Taste Receptor Type 2 (TAS2R) proteins. These models utilize both traditional machine learning (TML) and deep learning (DL) methods, trained on experimentally validated data. Bitterness, a fundamental taste modality, is mediated by G protein-coupled receptors (GPCRs) known as TAS2Rs, which also play diverse roles in extra-oral tissues and disease. Given the expense and time involved in in-vitro screening of TAS2R ligands, ML and DL offer powerful computational alternatives for ligand/target selection and understanding receptor function. The developed ML and DL models boast high performance and easy applicability. Crucially, they are designed to be synergistically integrated to enhance model explainability, facilitating the interpretation of results. Ultimately, these models aim to deepen the understanding of bitter compound characteristics and aid in the design of novel bitterants tailored to specific TAS2Rs.

1.6. Original Source Link

/files/papers/69120b7eb150195a0db74a14/paper.pdf This appears to be a link to a PDF stored on a local or internal system, not a publicly accessible URL. Its publication status is unknown based solely on this link, but the publication year suggests it's a forthcoming or recent work.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the challenge of predicting interactions between bitter molecules and TAS2R proteins. TAS2Rs (Taste Receptor Type 2) are a family of G protein-coupled receptors (GPCRs) responsible for sensing bitterness. Understanding these interactions is critical not only for taste perception but also for broader applications like drug design, as TAS2Rs are expressed in various extra-oral tissues and are implicated in numerous physiological functions and diseases (e.g., inflammatory response, respiratory immunity, obesity, diabetes, asthma, cancer).

The importance of this problem stems from the current methods for identifying TAS2R targets for compounds, which are laborious and costly in-vitro assays. This makes in-vitro screening expensive and time-consuming, leading to a limited amount of available data on TAS2R-ligand interactions, despite efforts to centralize data in databases like BitterDB.

The paper's entry point or innovative idea lies in leveraging machine learning (ML) and deep learning (DL) as powerful, cost-effective, and scalable tools to overcome these limitations. Specifically, the authors focus on developing models that are not just performant but also explainable. This addresses a key challenge in ML/DL—their "black-box" nature—which often makes it difficult to understand why a model makes a particular prediction. By providing interpretability, the models can offer valuable insights into the molecular features governing TAS2R-ligand interactions, thereby assisting in the rational design of new compounds.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Development of Two Complementary Models: The authors developed two distinct yet complementary models for predicting TAS2R-bitter molecule interactions: one using a Traditional Machine Learning (TML) approach (specifically, Gradient Boosting on Decision Trees with CatBoost) and another using Graph Convolutional Neural Networks (GCNs).
Emphasis on Explainability: A significant contribution is the integration of explainability methods for both model types. For TML, this includes CatBoost's intrinsic feature importance and SHAP (SHapley Additive exPlanations). For GCNs, GNNExplainer and Grad-CAM (specifically UGrad-CAM) are employed to provide visual and structural insights into predictions.
High Performance and Applicability: Both models demonstrate high predictive performance on a challenging, imbalanced dataset of experimentally validated TAS2R-ligand interactions. They are designed for easy applicability, allowing prediction for new molecules based on their SMILES representation within the model's applicability domain.
Comprehensive Understanding of Molecular Characteristics: The explainability features of the models facilitate a deeper understanding of the molecular characteristics that drive bitter taste perception and TAS2R activation. For instance, GCN explainability can highlight specific atoms or bonds crucial for interaction.
Guidance for Novel Bitterant Design: By elucidating the molecular features underlying interactions, the models promote the rational design of novel bitterants (compounds that produce a bitter taste) or bitter taste modulators tailored to target specific TAS2Rs of interest.

Key findings include:
The TML model (Gradient Boosting on Decision Trees) achieved strong performance (ROC AUC 0.92, PR AUC 0.75) and demonstrated higher precision for the positive class compared to the GCN.
The GCN model also performed well (ROC AUC 0.88, PR AUC 0.67) and offered more visually impactful, direct explanations at the molecular structure level.
Explainability methods revealed that promiscuous TAS2Rs significantly influence predictions towards positive associations in the TML model. For the GCN, specific structural motifs (e.g., tertiary amines, partial charges) were identified as key drivers of interaction predictions, aligning with experimental evidence.
The TML and GCN models, while having slightly different strengths (TML for overall metrics, GCN for visual interpretability), are considered complementary tools.

3.1. Foundational Concepts

To fully understand this paper, a reader should be familiar with several key concepts from biology, chemistry, and machine learning:

Taste Receptor Type 2 (TAS2R): These are a family of specialized G protein-coupled receptors (GPCRs) primarily responsible for detecting bitter taste. Humans have 25 different TAS2R subtypes, each responding to a range of bitter compounds. Beyond the tongue, TAS2Rs are also found in various extra-oral tissues (e.g., respiratory tract, gut, brain), where they mediate diverse physiological functions unrelated to taste, such as immune response modulation or hormone secretion.
G protein-coupled receptors (GPCRs): These are the largest and most diverse group of membrane receptors in eukaryotes. They act as an "inbox" for messages in the form of light energy, peptides, lipids, sugars, and proteins. When a ligand (e.g., a bitter molecule) binds to a GPCR, it causes a conformational change that activates a G protein inside the cell, initiating a cascade of intracellular signaling pathways. This process allows cells to respond to a wide variety of extracellular signals.
Ligand: In biochemistry, a ligand is a molecule that binds to another molecule, typically a larger one, to form a complex. In this context, bitter molecules are ligands that bind to TAS2R receptor proteins.
In-vitro screening: This refers to experimental procedures conducted in a controlled environment outside of a living organism, typically using cells or biological components in test tubes or petri dishes. For TAS2R ligands, in-vitro assays involve exposing receptors (often expressed in cell lines) to candidate molecules to see if they elicit a response. This process is often costly and time-consuming, motivating computational approaches.
Machine Learning (ML): A subfield of artificial intelligence that enables systems to learn from data without explicit programming. ML algorithms build a model from example data (training data) to make predictions or decisions without being explicitly programmed to perform the task. In this paper, ML is used for binary classification, predicting whether a molecule will interact (class 1) or not (class 0) with a TAS2R.
Deep Learning (DL): A subset of ML that uses artificial neural networks with multiple layers (hence "deep"). DL models are particularly effective at learning complex patterns from large datasets and have shown superior performance in tasks like image recognition, natural language processing, and, as in this paper, molecular interaction prediction.
Explainable AI (XAI): A field of artificial intelligence focused on making AI models understandable to humans. Many ML and DL models, especially deep neural networks, are considered "black-boxes" because their decision-making processes are opaque. XAI aims to develop methods that allow users to comprehend the rationale behind an AI system's output, identify its strengths and weaknesses, and build trust. This paper explicitly targets explainability for its models.
Canonical SMILES (Simplified Molecular-Input Line-Entry System): A line notation that allows a user to represent a chemical structure using a short ASCII string. It's a standard way to encode molecular structures in a computer-readable format. "Canonical" means there's a unique SMILES string for each molecule, regardless of how it's drawn, ensuring consistency.
Molecular Fingerprints: These are compact, numerical representations of chemical structures, often as binary vectors (0s and 1s). Each bit in the vector typically corresponds to the presence or absence of a specific substructural feature (e.g., a particular atom type, bond arrangement, or functional group) within the molecule. Morgan fingerprints are a type of circular fingerprint that encodes structural information around each atom up to a certain radius.
Molecular Descriptors: These are numerical values that quantify various physicochemical properties or structural characteristics of a molecule (e.g., molecular weight, logP for lipophilicity, topological indices, electronic properties). Mordred is a Python library used to calculate a vast array of such descriptors.
Graph Neural Networks (GNNs): A class of deep learning methods designed to operate on data structured as graphs. In chemistry, GNNs are particularly powerful for molecular data because molecules can be naturally represented as graphs, where atoms are nodes and chemical bonds are edges. GNNs learn node embeddings (numerical representations of atoms) by iteratively aggregating information from neighboring nodes and edges, effectively capturing both local and global structural information.
One-hot encoding: A common technique to convert categorical (non-numerical) data into a numerical format that ML algorithms can process. For a categorical feature with $k$ unique values, one-hot encoding creates $k$ new binary features (0 or 1). For example, if there are 22 TAS2Rs, each receptor would be represented by a vector of 22 zeros with a single 1 at a unique position corresponding to that specific receptor.

3.2. Previous Works

The paper contextualizes its research by reviewing existing computational approaches for bitter taste prediction and TAS2R interaction.

Bitter/Non-bitter Classification: Earlier models focused on classifying compounds as generally bitter or non-bitter:
- Zheng et al., 2018: One of the foundational works in using ML for bitter taste prediction.
- Bitterntense (Margulis et al., 2021): An ML model for predicting the intensity of bitterness.
- BitterCNN (Bo et al., 2022): Utilizes Convolutional Neural Networks (CNNs) for bitterant prediction, often showing improved performance over traditional ML.
- VirtuousSweetBitter (Maroni et al., 2022): An explainable ML model for classifying sweeteners/bitterants, highlighting the growing trend towards interpretability.
Specific TAS2R Target Prediction: More relevant to the current paper, these models aim to predict which TAS2R a bitter compound interacts with:
- BitterX (Huang et al., 2016): Uses a Support Vector Machine (SVM) model trained on a reduced and balanced dataset. SVMs are powerful ML algorithms for classification, finding a hyperplane that best separates data points into classes.
- BitterSweet (Tuwani et al., 2019): A web server that offers predictions on bitterant-TAS2R associations. However, the paper notes a lack of detailed information on its specific predictive model development for associations in its original publication.
- BitterMatch (Margulis et al., 2022): This is highlighted as the most recent and comparable work. It employs Gradient Boosting (GB) on Decision Trees (DTs) (specifically XGBoost) trained on data from BitterDB. Gradient Boosting is an ensemble ML technique where multiple weak prediction models (like decision trees) are combined to form a stronger model. Each new model in the ensemble tries to correct the errors of the previous ones.
Explainability Methods for ML/DL:
- SHAP (SHapley Additive exPlanations) (Lundberg & Lee, 2017): A widely used model-agnostic XAI method. It assigns an importance value (SHAP value) to each feature for a particular prediction, based on game theory concepts (Shapley values). It explains how much each feature contributes to pushing the prediction from the base value to the model's output.
- GNNExplainer (Ying et al., 2019): A model-agnostic method designed specifically for Graph Neural Networks to generate explanations for their predictions by identifying crucial nodes and edges.
- Grad-CAM (Selvaraju et al., 2020): Originally for CNNs in image classification, it generates visual explanations by using the gradients of the target concept flowing into the final convolutional layer to produce a localization map highlighting important regions in the input. UGrad-CAM is a generalization for graphs.

3.3. Technological Evolution

The evolution of technology in this field can be seen as progressing through several stages:

Early efforts: Focused on simply classifying molecules as bitter or non-bitter, often using traditional physicochemical descriptors and simpler ML algorithms.
Specificity: Moving beyond general bitterness to predicting interactions with specific TAS2R subtypes, recognizing the diverse roles and ligand specificities of individual receptors. This often involved more sophisticated ML techniques like SVMs and Gradient Boosting.
Deep Learning Integration: Adoption of Deep Learning models, particularly Graph Neural Networks (GNNs), which are naturally suited for representing molecular structures as graphs. This often leads to improved performance by automatically learning complex features from raw molecular graphs.
Explainability and Interpretability: The most recent trend, and a key focus of this paper, is to move beyond "black-box" predictions towards explainable AI. This is crucial for gaining scientific insights, building trust in AI models, and enabling rational design rather than just prediction. This paper places itself firmly in this fourth stage.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

Dual-Model Approach with Complementary Strengths: Unlike most previous works that focus on a single ML paradigm, this paper develops two distinct models (TML and GCN), acknowledging their different strengths (e.g., TML for overall statistical performance, GCN for direct visual molecular interpretability). This offers a more robust and comprehensive toolkit.
Integrated Explainability: While some previous works (e.g., VirtuousSweetBitter) incorporate explainability, this paper rigorously applies XAI methods (SHAP, CatBoost importance, GNNExplainer, UGrad-CAM) to both TML and GCN models, demonstrating how they can be synergistically integrated to enhance understanding. This goes beyond mere prediction to provide actionable insights into molecular features.
Direct Molecular-Level Explainability for GCNs: The GCN model's ability to directly highlight important atoms and bonds (UGrad-CAM and GNNExplainer) provides a visually impactful and chemically intuitive explanation that is often harder to achieve with descriptor-based TML models, where feature importance might be attributed to abstract numerical descriptors.
Easy Applicability: The models are designed for straightforward application to new molecules using only their SMILES representation, within an applicability domain framework. This makes them user-friendly and practical for researchers.
Enhanced Dataset: The dataset is expanded beyond previous works like BitterMatch by incorporating newer literature, aiming for a more comprehensive training set.
Performance: While competitive with state-of-the-art models like BitterMatch, the paper emphasizes the added value of explainability and complementary approaches rather than solely focusing on marginal performance gains.

4. Methodology

4.1. Principles

The core principle of this work is to develop highly performant and easily applicable predictive models for TAS2R-bitter molecule interactions, while simultaneously ensuring that these models are explainable. This dual objective allows researchers not only to predict whether an interaction will occur but also to understand why it occurs, by identifying the key molecular features and structural motifs driving the interaction. The problem is framed as a binary classification task, where the input is a bitterant-TAS2R pair, and the output is a label indicating a positive (binding, class 1) or negative (non-binding, class 0) association. Two distinct machine learning paradigms—Traditional Machine Learning (TML) and Graph Convolutional Neural Networks (GCNs)—are employed to leverage their respective strengths and provide complementary insights.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Dataset Acquisition and Preprocessing

The foundation of the models is a comprehensive dataset of TAS2R-bitter molecule interactions.

Data Sources: The primary source is the BitterMatch dataset (Margulis et al., 2022), which itself is derived from the BitterDB dataset (Dagan-Wiener et al., 2019). This base dataset provided 301 molecules and 3204 known associations. To enrich the dataset, an additional 37 molecules (760 known associations) were gathered from recent scientific literature.
Dataset Composition: The final dataset comprises 338 unique bitter molecules and their known interactions with 22 out of the 25 human TAS2R receptors. (The remaining three, TAS2R45, TAS2R48, and TAS2R60, are orphan receptors with no known agonists).
Interaction Labeling: Positive associations (molecule-receptor pairs known to interact) are labeled as class 1, while negative interactions (molecules known not to bind a specific receptor) are labeled as class 0. Only uniquely known and in-vitro verified interactions are included.
Total Associations: This results in a total of 3964 paired associations (bitterant-TAS2R pairs).
Data Imbalance: A significant characteristic of the dataset is its imbalance, with the number of class 1 (binding) instances being approximately five times greater than class 0 (non-binding) instances. This imbalance is acknowledged as a challenge for model training and evaluation.
Molecular Encoding:
- Molecules: Represented as Canonical SMILES strings, obtained from BitterDB or PubChem. SMILES (Simplified Molecular-Input Line-Entry System) is a textual notation for describing chemical structures.
- Receptors: Represented using one-hot encoding. For 22 TAS2R receptors, each receptor is converted into a binary vector of length 22, where only one element is 1 and the rest are 0, uniquely identifying that receptor.

4.2.2. Interaction Prediction using a Traditional Machine-Learning (TML) Approach

The TML workflow (Figure 1 in the original paper) involves several steps:

Figure 1. Traditional Machine-Learning (TML) workflow. 该图像是传统机器学习（TML）工作流程示意图。流程从扩展数据集开始，包含对配体和受体的处理，使用摩根指纹、排序描述符和相关性过滤等步骤，最终进行模型评估和解释。

Molecular Standardization: SMILES strings of molecules are standardized using the ChEMBL structure pipeline. This process ensures consistency in molecular representation by normalizing chemical structures (e.g., canonicalizing tautomers, neutralizing charges, removing salts).
Feature Engineering:
- Ligand Features: Two types of features are extracted for bitter molecules:
  - Morgan fingerprints: These are computed using the RDKit Python package. They are a type of circular fingerprint that describes the presence of specific structural patterns around each atom in a molecule. The parameters used are $number of bits = 1024$ (determining the length of the binary vector) and $radius = 2$ (determining how far from each atom the algorithm looks for structural information).
  - Molecular descriptors: A wide array of physicochemical and structural descriptors are calculated using the Mordred Python library. These include properties like molecular weight, logP, topological indices, electronic properties, etc.
- Receptor Features: The one-hot encoded representation of the TAS2R receptor is directly used as a feature.
Feature Preprocessing:
- Correlation Filtering: To reduce redundancy and multicollinearity, all descriptors with more than 90% correlation with other descriptors are removed.
- Normalization: All non-binary data (i.e., continuous molecular descriptors) is scaled using Min-Max normalization. This transforms the values into a predefined range, typically between 0 and 1, which helps ML algorithms converge faster and prevents features with larger numerical ranges from dominating the learning process. The formula for Min-Max normalization is: $ A' = \frac{A - minimum_value_of_A}{maximum_value_of_A - minimum_value_of_A} \ast (D - C) + C $ Where:
  - $A'$ is the normalized value of the data point.
  - $A$ is the original value of the data point.
  - $minimum\_value\_of\_A$ is the smallest value in the original dataset for feature $A$ .
  - $maximum\_value\_of\_A$ is the largest value in the original dataset for feature $A$ .
  - $C$ is the lower bound of the desired normalized range (here, $C=0$ ).
  - $D$ is the upper bound of the desired normalized range (here, $D=1$ ).
Model Selection: Several traditional machine learning algorithms were compared: Gaussian Naive Bayes, Logistic Regression, K-Neighbors, Support Vector Machines (SVM), Random Forest, and Gradient Boosting on Decision Trees (GB on DTs). GB on DTs was selected due to its superior performance (Figure S1).
Algorithm Implementation: CatBoost (Dorogush et al., 2018), an open-source library for Gradient Boosting on Decision Trees, was specifically employed. CatBoost is known for its ability to handle categorical features directly, resilience to overfitting, and high performance.
Data Splitting: To ensure a representative distribution of chemical space in both training and test sets, a clustering approach is used before the split:
- Agglomerative clustering (using the complete linkage algorithm) is applied to group the data into $n$ clusters.
- The Tanimoto distance (Rogers & Tanimoto, 1960), computed from Morgan fingerprints, serves as the distance metric for clustering.
- The optimal number of clusters ( $n$ ) is determined using Silhouette score analysis.
- Once clustered, data entries from each cluster are split into training (80%) and test (20%) sets, with stratification over the class labels (ensuring similar proportions of positive/negative samples in each split).
Feature Selection: With an initial 2824 ligand-based features, dimensionality reduction is crucial. Two methods were compared:
- "Noisy" Feature Selection: An iterative technique where a column of pseudo-random numbers ("noisy" feature) is added to the dataset. After training a tree-based classifier, features with Gini importance lower than the "noisy" feature are systematically removed until only more informative features remain.
- Sequential Feature Selection (SFS): A greedy algorithm. Backward-SFS was used, starting with an initial set of features (the 150 most important features according to CatBoostClassifier's tree-based importance metric) and iteratively removing the least impactful feature. The average precision (AP) using 5-fold cross-validation (CV) served as the criterion for removal. The final number of features was determined a posteriori. Scikit-learn was used for this.
Training and Evaluation: The CatBoostClassifier is trained on the training set using 10-fold CV (cross-validation) for robust parameter tuning and performance estimation. The final model's performance is then evaluated on the independent test set.

4.2.3. Interaction Prediction using Graph Convolutional Neural Networks (GCN)

The GCN framework workflow (Figure 2 in the original paper) also involves distinct steps:

Figure 6. (A) Number of agonists per TAS2R and (B) relative SHAP values for receptor associations. (C, D) SHAP waterfall for strychnine-TAS2R46 (positive association) (C) and strychnine-TAS2R1 (negat… 该图像是图表，展示了不同 TAS2R 的激动剂数量及其对应的 SHAP 值。A 部分展示每个 TAS2R 的激动剂数量，B 部分则显示各 TAS2R 的 SHAP 值，蓝色条表示负贡献，红色条表示正贡献。C 和 D 部分为特定激动剂（如 Strychnine）的 SHAP 瀑布图，分别显示正负关联的影响因素及其值，E[f(X)] 为整体预估。

Molecular Graph Representation:
- Molecules, represented by standardized SMILES, are converted into undirected molecular graphs using the NetworkX Python library.
- In these graphs, atoms represent nodes ( $v_i$ ), and chemical bonds represent edges ( $e_{ij}$ ).
- Node Features: Each node (atom) is described by a $d$ $d$ -dimensional feature vector ( $x_v$ $x_{v}$ ). The specific node features selected (Table S1 in Supplementary Information) include:
  - Mass: Normalized mass of the atom.
  - logP: Atom contribution to the molecule's logP (lipophilicity).
  - MR: Atom contribution to the molecule's Molar Refractivity.
  - EState: Atom contribution to the EState (Electrotopological State) of the molecule, which reflects its electronic environment.
  - ASA: Atom contribution to the Accessible Solvent Area of the molecule.
  - TPSA: Atom contribution to the Topological Polar Surface Area of the molecule, related to drug absorption.
  - Partial Charge: Atom partial charge (e.g., Gasteiger charge).
  - Degree: Number of directly bonded neighbors to the atom.
  - Implicit Valence: Number of implicit hydrogens on the atom.
  - nH: Number of total hydrogens on the atom. (Note: Features marked with * are normalized, ° are Boolean, $^$ are one-hot encoded).
- Edge Features: Each edge (bond) is described by a $c$ $c$ -dimensional feature vector ( $x^e_{v,u}$ $x_{v, u}^{e}$ ). Edge features (Table S1) include:
  - Single bond
  - Double bond
  - Triple bond
  - Aromatic bond
Data Splitting: Similar to the TML approach, the dataset is clustered based on Tanimoto similarity (from Morgan fingerprints) to ensure chemical space representation, and then each cluster is split into 80:20 training and test sets, stratified by class labels.
GCN Model Architecture: The model is built using PyTorch and PyTorch Geometric libraries (Figure 3 in original paper).

该图像是图表，展示了GCN模型的性能评估。图（A）显示了验证集（绿色）和测试集（红色）的ROC曲线，验证集AUC为0.87±0.03，测试集AUC为0.88。图（B）展示了测试集的PR曲线，其AUC为0.67。
- Input: A batch of graphs (molecular representations), each with node features, edge features, and the one-hot encoded receptor associated with that molecule-receptor pair.
- Graph Convolutional Layers: Two layers employing the GATv2Conv module. GATv2Conv is a variant of the Graph Attention Network (GAT) that uses a self-attention mechanism to compute node embeddings. The attention mechanism allows the network to differentially weigh the importance of neighboring nodes when aggregating information. These layers have 32 and 8 output channels, respectively.
- Batch Normalization Layers: Applied after each convolutional layer. Batch normalization helps stabilize and accelerate the training of deep neural networks by normalizing the inputs to each layer.
- Global Mean Pooling: After the convolutional layers, global mean pooling is applied to the node embeddings to create a single graph embedding (a fixed-size vector representation of the entire molecule).
- Concatenation with Receptor Features: The graph embedding is then concatenated (joined) with the one-hot encoded receptor features. This combined vector forms the input to the subsequent fully connected layers.
- Fully Connected (FC) Layers: Four FC layers (32, 16, 8, and 4 output units, respectively) map the combined graph-receptor embedding to a lower-dimensional space.
- Dropout Layers: Two dropout layers are used to prevent overfitting: one with probability 0.1 applied to the input of the FC layers, and another with probability 0.2 applied to the output of the last FC layer. Dropout randomly sets a fraction of input units to zero at each update during training, which forces the network to learn more robust features.
- Activation Functions:
  - ReLU (Rectified Linear Unit): Used for the hidden units in the FC layers ( $ReLU(x) = \max(0, x)$ ).
  - Sigmoid: Used for the node embeddings.
- Output Layer: The final FC layer is followed by a linear transformation producing two outputs, which are interpreted as the probabilities of belonging to each class (class 0 or class 1).

4.2.4. Explainability Methods

The paper integrates explainability into both TML and GCN models:

TML Explainability:
- CatBoost Feature Importance: CatBoost inherently provides individual importance values for each input feature. These values quantify the average change in the model's prediction that results from modifying the value of a specific feature. This provides a direct measure of a feature's relevance to the model's overall decision-making.
- SHAP (SHapley Additive exPlanations):
  - SHAP is a model-agnostic method based on game theory that assigns a SHAP value to each feature for a particular prediction. A SHAP value represents the contribution of a feature to the prediction, compared to the average prediction for the dataset, by considering all possible combinations of features.
  - It satisfies desirable properties like consistency (if a feature's importance increases or stays the same when its contribution to the model changes, its SHAP value won't decrease).
  - The SHAP library uses a Tree-based model as its local explanation method for trees (Lundberg et al., 2019), allowing for efficient calculation of optimal local explanations for tree-based models like CatBoost.
  - SHAP allows for both global explanations (e.g., average SHAP values across the dataset, like in Figure 6B) and local explanations (explaining a single prediction, like in Figure 6C, D, shown as SHAP waterfall plots).
GCN Explainability:
- GNNExplainer (Ying et al., 2019):
  - A model-agnostic method specifically designed to generate interpretable explanations for GNN predictions on graph-based machine learning tasks.
  - It provides single-instance explanations, meaning it can explain a prediction for a single molecule (graph) by identifying the most influential subset of nodes and edges within that graph that contribute to the prediction.
  - It can identify both node feature importances (Figure 8A) and edge importances (Figure 8B).
  - Graph Explanation Faithfulness (GEF) Score: The faithfulness of GNNExplainer's explanations is evaluated using the GEF score. This metric quantifies how well the explanation (e.g., the masked subgraph identified as important) preserves the original prediction of the GNN. $ GEF(y, \hat{y}) = 1 - e^{-KL(y || \hat{y})} $ Where:
    - $y$ is the output probability vector obtained from the original graph.
    - $\hat{y}$ is the output probability vector obtained from the masked subgraph (the part identified as important by the explainer).
    - KL is the Kullback-Leibler divergence score, a measure of how one probability distribution ( $y$ ) diverges from a second, expected probability distribution ( $\hat{y}$ ). A lower KL divergence means the distributions are more similar.
    - The GEF score ranges from 0 to 1. Values near 0 indicate excellent prediction faithfulness (the explanation accurately reflects the original prediction), while values near 1 indicate very poor faithfulness (the explanation is untrustworthy). The paper notes that scores higher than 0.5 are typically considered untrustworthy.
- Grad-CAM (Selvaraju et al., 2020) and UGrad-CAM:
  - Grad-CAM was originally developed for image classification to identify salient regions (e.g., pixels) in an image that are most important for a given prediction. It works by computing the gradients of the prediction score with respect to the feature maps of the last convolutional layer.
  - In this work, a generalization to graphs called Unsigned Grad-CAM (UGrad-CAM) (Pope et al., 2019) is employed.
  - UGrad-CAM generates heatmaps on the molecular graph (Figure 8C, D) to visualize the contribution of each node (atom) to the prediction. Red nodes typically indicate a strong contribution towards the predicted class (class 1), while blue nodes indicate a strong contribution towards the opposite class (class 0). This provides visual and chemically intuitive insights into which parts of the molecule are driving the model's decision.

5. Experimental Setup

5.1. Datasets

The study utilized a dataset of bitter molecule-TAS2R receptor interactions.

Source: The primary source was BitterMatch's dataset (Margulis et al., 2022), itself derived from BitterDB (Dagan-Wiener et al., 2019). This provided 301 molecules and 3204 associations. An additional 37 molecules (760 associations) were extracted from recent scientific literature (Behrens et al., 2018; Cui et al., 2021; Delompré et al., 2022; Jaggupili et al., 2019; Karolkowski et al., 2023; Lang et al., 2020; Morini et al., 2021; Nouri et al., 2019; Soares et al., 2018).
Scale and Characteristics:
- Total 338 unique bitter molecules.
- Interactions with 22 human TAS2R receptors (out of 25; TAS2R45, TAS2R48, and TAS2R60 are orphan receptors).
- Total 3964 paired associations (molecule-receptor pairs).
- Associations are labeled as class 1 for positive interactions (binding) and class 0 for negative interactions (non-binding), based on in-vitro verified data.
Domain: The dataset specifically focuses on bitter molecules and their interactions with TAS2R receptors, relevant to taste perception and broader physiological roles of TAS2Rs.
Imbalance: The dataset exhibits a significant imbalance, with approximately five times more negative (class 0) instances than positive (class 1) instances. This is a common challenge in biological datasets and impacts model training and evaluation, particularly for the minority class.
Data Sample Example: The paper does not provide a direct visual example of a data sample (e.g., a SMILES string, a one-hot encoded receptor vector, or an entry from the dataset matrix). However, the representation involves Canonical SMILES for molecules (e.g., $CN1CC2CCC3C(C1C4=CC=CC=C4OC2)N3C=O$ ) and one-hot encoded vectors for receptors (e.g., [0,0,1,0,...0] for TAS2R3 if it's the 3rd receptor in the ordered list).
Choice of Datasets: These datasets were chosen because they represent the most comprehensive collection of experimentally validated TAS2R-ligand interaction data available, primarily from BitterDB, which is the leading database for taste ligands and receptors. This ensures that the models are trained on real-world, verified biological interactions, which is crucial for their validity and generalizability.

5.2. Evaluation Metrics

The performance of the models was evaluated using several standard metrics for binary classification tasks, especially considering the imbalanced nature of the dataset.

First, let's define the fundamental components:

True Positive (TP): The number of actual positive cases that are correctly identified by the model as positive.
False Negative (FN): The number of actual positive cases that are incorrectly identified by the model as negative.
False Positive (FP): The number of actual negative cases that are incorrectly identified by the model as positive.
True Negative (TN): The number of actual negative cases that are correctly identified by the model as negative.

The derived evaluation metrics are:

5.2.1. Precision

Conceptual Definition: Precision measures the accuracy of positive predictions. It answers the question: "Of all the instances the model predicted as positive, how many were actually positive?" It is particularly important when the cost of false positives is high.

Mathematical Formula: $ Precision = \frac{TP}{TP + FP} $

Symbol Explanation:

TP: True Positives
FP: False Positives

5.2.2. Recall (Sensitivity)

Conceptual Definition: Recall measures the ability of the model to find all the positive samples. It answers the question: "Of all the actual positive instances, how many did the model correctly identify?" It is crucial when the cost of false negatives is high.

Mathematical Formula: $ Recall = \frac{TP}{TP + FN} $

Symbol Explanation:

TP: True Positives
FN: False Negatives

5.2.3. Specificity

Conceptual Definition: Specificity measures the ability of the model to correctly identify negative samples. It answers the question: "Of all the actual negative instances, how many did the model correctly identify?"

Mathematical Formula: $ Specificity = \frac{TN}{TN + FP} $

Symbol Explanation:

TN: True Negatives
FP: False Positives

5.2.4. F-beta Score ( $F_{\beta}$ )

Conceptual Definition: The $F_{\beta}$ score is a weighted harmonic mean of Precision and Recall. It provides a single score that balances both metrics. The parameter $\beta$ determines the weight given to Recall relative to Precision.

If $\beta = 1$ , it's the F1 score, giving equal weight to Precision and Recall.
If $\beta > 1$ , it gives more weight to Recall (e.g., F2 weights Recall twice as much as Precision).
If $\beta < 1$ , it gives more weight to Precision (e.g., F0.5 weights Precision twice as much as Recall). The paper states: "A lower $\beta$ gives less weight to precision, while a higher $\beta$ gives more weight to it." This phrasing in the paper is slightly unconventional. Standard interpretation is that a higher $\beta$ gives more weight to recall, and a lower $\beta$ (e.g., less than 1) gives more weight to precision. For example, $F_2$ emphasizes recall, and $F_{0.5}$ emphasizes precision.

Mathematical Formula: $ F_{\beta} = \frac{(1 + \beta^2) \times Recall \times Precision}{(\beta^2 \times Precision) + Recall} $

Symbol Explanation:

$\beta$ : A non-negative real number that controls the weight of Recall in the score.
Recall: The Recall score.
Precision: The Precision score.

5.2.5. Average Precision (AP)

Conceptual Definition: Average Precision summarizes the Precision-Recall curve into a single value. It is the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. AP is particularly useful for imbalanced datasets and tasks where correctly identifying positive samples is crucial, as it focuses on the performance of the positive class.

Mathematical Formula: $ Average\ Precision = \sum_n (R_n - R_{n-1}) P_n $

Symbol Explanation:

$n$ : Index for the threshold.
$R_n$ : Recall at the $n$ -th threshold.
$R_{n-1}$ : Recall at the (n-1)-th threshold.
$P_n$ : Precision at the $n$ -th threshold.

5.2.6. Receiver Operating Characteristic (ROC) AUC

Conceptual Definition: The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (Recall) against the $False Positive Rate (FPR = FP / (FP + TN))$ at various threshold settings. The Area Under the ROC Curve (AUC) provides a single scalar value that summarizes the model's ability to discriminate between positive and negative classes across all possible classification thresholds. A higher AUC indicates better discrimination. However, ROC curves can be misleading in highly imbalanced datasets, as they can show an overly optimistic view of model performance because they are not sensitive to a large number of True Negatives.

5.2.7. Precision-Recall (PR) AUC

Conceptual Definition: The Precision-Recall (PR) curve plots Precision against Recall at various threshold settings. The Area Under the PR Curve (AUC) summarizes the model's performance on the positive class. PR curves are considered more informative than ROC curves for imbalanced datasets because they focus on the positive class and are sensitive to false positives and false negatives, directly reflecting the model's ability to identify true positive instances. A higher PR AUC indicates better performance for the minority class.

5.3. Baselines

The paper compared its models against several baselines at different stages:

For TML Model Selection:
- Gaussian Naive Bayes (GaussianNB): A probabilistic classifier based on Bayes' theorem, assuming feature independence.
- Logistic Regression (LR): A linear model used for binary classification, estimating probabilities.
- K-Neighbors: A non-parametric method for classification based on the distance to $k$ nearest neighbors in the feature space.
- Support Vector Machines (SVM): A powerful model that finds an optimal hyperplane to separate classes.
- Random Forest (RF): An ensemble ML method that builds multiple decision trees and merges their predictions to improve accuracy and reduce overfitting.
- Gradient Boosting on Decision Trees (GB on DTs): The chosen method, an ensemble technique that builds trees sequentially, with each new tree correcting errors made by previous ones.
For Overall Performance Comparison:
- BitterMatch (Margulis et al., 2022): This is the most recent and relevant state-of-the-art model for TAS2R-ligand interaction prediction. To ensure a fair comparison, the authors retrained BitterMatch using only human TAS2R data (referred to as BM Human-Only), following the official GitHub repository's code. This re-training was done to make its dataset more comparable to the current study, which also exclusively focuses on human receptors.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Traditional Machine-Learning (TML) Approach

Model Selection: Gradient Boosting on Decision Trees (GB on DTs) (CatBoost) was selected as the optimal TML model after comparing it against Gaussian Naive Bayes, Logistic Regression, K-Neighbors, Support Vector Machines (SVM), and Random Forest. GB on DTs achieved the best ROC AUC (Figure S1).

该图像是ROC曲线图，展示了不同机器学习模型在预测TAS2R-苦味分子相互作用中的性能。图中分别展示了高斯朴素贝叶斯、逻辑回归、K邻近、支持向量机、随机森林及CatBoost模型的真阳性率与假阳性率的关系，AUC值也在图例中标注。
CatBoostClassifier Hyperparameters: The tuned hyperparameters for the CatBoostClassifier are detailed in Table S2. The following are the results from Table S2 of the original paper:

Boosting Type Depth Iterations Learning Rate Leaf Estimation Iterations L2 Leaf Reg Subsample
Plain 6 1000 0.1 4 3 0.7
Feature Selection:
- The "noisy" method selected 28 ligand features, while the Backward-Sequential Feature Selection (SFS) method selected 17.
- Both methods showed similar performance in terms of ROC AUC and PR AUC on the test set (Figure S3).
- SFS was preferred due to its higher reproducibility and selection of fewer features (17 Mordred descriptors and no ligand fingerprints). This indicates that chemically intuitive descriptors are more informative than generic fingerprints for this task.
  
  该图像是图表，展示了“噪声”特征选择与后向特征选择（Backward-SFS）方法的性能比较。左侧为真实正率与假正率的关系图，右侧为精确度与召回率的关系图。图中分别标示了AUC值，噪声方法的AUC为0.92，而SFS方法的AUC为0.92，显示出二者的相似性能。
- The SFS method's selection process for features is illustrated in Figure S2, showing that average precision peaked with 17 features.
  
  该图像是一个示意图，展示了选择特征数量与平均精度（avgP）之间的关系。可以看到，当选择17个特征时，平均精度达到了0.68。
Performance: The final TML model, using features selected by SFS, achieved a ROC AUC of 0.92 and a PR AUC of 0.75 on the test set (Figure 4).

该图像是图表，展示了用于预测strychnine与TAS2R46结合的GCN模型的可解释性，包括特征重要性（A）、分子结构（B）以及UGrad-CAM热图（C, D）。图中相比红色的节点表示与类别1的贡献，蓝色则对应类别0。
Feature Importance (Tree-based):
- Figure 5 shows the tree-based feature importance. The most important features were associations with TAS2R14 and TAS2R46, which are known as the two most promiscuous receptors (binding to many compounds). More selective receptors had lower importance values.
- The 17 selected ligand descriptors were predominantly Mordred descriptors, with autocorrelation and topological descriptors having the highest occurrence (e.g., GATS1i, ATSC4d, Xpc-5dv, VR2_Dzm).
  
  该图像是ROC曲线图，展示了不同机器学习模型在预测TAS2R-苦味分子相互作用中的性能。图中分别展示了高斯朴素贝叶斯、逻辑回归、K邻近、支持向量机、随机森林及CatBoost模型的真阳性率与假阳性率的关系，AUC值也在图例中标注。
Feature Importance (SHAP):
- SHAP values (Figure 6) confirmed that associations with promiscuous receptors (TAS2R14, TAS2R46, TAS2R39) biased predictions towards class 1 (positive association), while selective receptors biased towards class 0 (negative association).
- SHAP waterfall plots for individual predictions (e.g., strychnine-TAS2R46 and strychnine-TAS2R1) revealed that while receptor association often dominates, ligand-based descriptors (like ATSC4d for strychnine-TAS2R1) can also significantly influence predictions.
  
  该图像是一个示意图，展示了选择特征数量与平均精度（avgP）之间的关系。可以看到，当选择17个特征时，平均精度达到了0.68。

6.1.2. Graph Convolutional Neural Network (GCN) Approach

Performance: The GCN model achieved a ROC AUC of 0.88 and a PR AUC of 0.67 on the test set (Figure 7).

该图像是图表，展示了“噪声”特征选择与后向特征选择（Backward-SFS）方法的性能比较。左侧为真实正率与假正率的关系图，右侧为精确度与召回率的关系图。图中分别标示了AUC值，噪声方法的AUC为0.92，而SFS方法的AUC为0.92，显示出二者的相似性能。
Explainability (GNNExplainer & UGrad-CAM):
- For the strychnine-TAS2R46 positive association, GNNExplainer (Figure 8A, B) highlighted atom's partial charge and partition coefficient (logP) as important node features, and bonds around a tertiary amine as important edge features. This aligns with experimental findings about TAS2R46 interaction involving π-interactions with a benzene ring and hydrogen bonds with a tertiary amine in strychnine.
- UGrad-CAM heatmaps (Figure 8C) visually showed that the tertiary amine region contributed significantly towards class 1 (binding), while the aromatic ring contributed towards class 0 (non-binding or opposite).
- Modifying the strychnine structure by removing two carbon atoms near the tertiary amine (Figure 8D) altered the UGrad-CAM pattern, reducing the contribution of the amine towards class 1 and decreasing the overall prediction probability, demonstrating the model's sensitivity to structural changes.
  
  该图像是一个直方图，显示了Jaccard相似度的密度分布，包含训练-训练和测试-训练的比较。红色条形表示训练集之间的相似度，而蓝色条形表示测试集与训练集的相似度。虚线标记了相似度阈值，体现了不同组别的相似度趋势。

6.1.3. Comparison of TML and GCN Models

The following are the results from Table 1 of the original paper:

	ROC AUC	PR AUC	Class	Precision	Recall	F1	F2
TML	0.92	0.75	0	0.93	0.97	0.95	0.96
TML	0.92	0.75	1	0.78	0.60	0.68	0.63
GCN	0.88	0.67	0	0.94	0.92	0.93	0.93
GCN	0.88	0.67	1	0.62	0.67	0.64	0.66

Overall Performance: TML generally showed higher performance metrics. It had a higher ROC AUC (0.92 vs 0.88) and PR AUC (0.75 vs 0.67).
Class-Specific Performance:
- For class 0 (negative associations), both models performed comparably well (TML Precision 0.93, Recall 0.97; GCN Precision 0.94, Recall 0.92).
- For class 1 (positive associations), TML achieved remarkably higher Precision (0.78 vs 0.62), while GCN showed slightly higher Recall (0.67 vs 0.60).
Trade-off: This suggests TML prioritized precision (fewer false positives) for the under-represented positive class, whereas GCN prioritized recall (identifying more true positives). The authors attribute this discrepancy to the dataset's imbalance, which GCN might be more sensitive to given its complex architecture.

6.1.4. Comparison with BitterMatch

The following are the results from Table S3 of the original paper:

TML			GCN	BM
Class 0	Precision	0.93	0.94	0.88
	Recall	0.97	0.92	0.96
	F1	0.95	0.93	0.92
	F2	0.96	0.93	0.95
Class 1	Precision	0.78	0.62	0.75
	Recall	0.60	0.67	0.44
	F1	0.68	0.64	0.55
	F2	0.63	0.66	0.48

A re-trained version of BitterMatch (BM Human-Only) on human data was used for comparison.
Overall: All three models (TML, GCN, BM) showed similar PR AUC scores.
Class 0 Performance: All models performed similarly well for class 0 (negative associations), with TML and GCN slightly outperforming BM in Precision and F1.
Class 1 Performance: For class 1 (positive associations), TML achieved the highest Precision (0.78), and GCN achieved the highest Recall (0.67). BitterMatch showed slightly less performance in Recall (0.44), F1 (0.55), and F2 (0.48) compared to both developed models.
Advantage of Current Models: Unlike BitterMatch, the presented models can predict for any query molecule within their applicability domain using only its SMILES representation, enhancing their broader utility.

该图像是一个性能评估图，展示了三种模型（BM、TML、GNN）在精确率和召回率上的表现。图中蓝线、红线和绿色线分别表示各模型在不同AUC值下的表现，反映了它们的预测能力。

6.2. Data Presentation (Tables)

The following are the results from Table S1 of the original paper:

#	Node features	Edge features
1	Mass* = normalized mass (on lodium mass)	Single bond°
2	logP* = atom contribution to logP of the molecule	Double bond
3	MR* = atom contribution to Molar Refractivity of the molecule	Triple bond
4	Estate* = atom contribution to EState of the molecule	Aromatic bond
5	ASA* = atom contribution to the Accessible Solvent Area of the molecule
6	TPSA* = atom contribution to the Topological Polar Surface Area of the molecule
7	Partial Charge* = Atom partial charge
8	Degree^ = number of directly bonded neighbours to the atom
9	Implicit Valence^ = number of implicit hydrogens on the atom
10	nH^ = number of total hydrogens on the atom

Legend for Table S1: * if normalized with Min-Max normalization [0, 1]; ° indicates a Boolean feature (0 or 1); $^$ if one-hot encoded.

6.3. Ablation Studies / Parameter Analysis

While not explicitly termed "ablation studies," the paper presents several analyses that serve a similar purpose by evaluating the impact of different components or choices:

Comparison of Traditional ML Algorithms (Figure S1): This analysis compares the performance of different TML algorithms (GaussianNB, LR, K-Neighbors, SVM, RF, CatBoost) to justify the selection of CatBoost (GB on DTs). This shows the relative effectiveness of different underlying ML paradigms for the task.
Comparison of Feature Selection Methods (Figure S3): The paper compares the "noisy" feature selection method with Backward-SFS. The results demonstrate that SFS yields similar performance with a smaller, more reproducible set of features, justifying its choice for the final TML model. This implicitly shows the value of effective feature selection.
Hyperparameter Tuning (Table S2): The CatBoostClassifier hyperparameters were tuned, indicating an optimization process to find the best configuration for the chosen TML model.
Impact of Structural Alterations (Figure 8D): The GCN explainability section effectively demonstrates a form of sensitivity analysis by showing how removing two carbon atoms from strychnine (a structural alteration) significantly changes the UGrad-CAM explanation and reduces the prediction probability. This indirectly validates that the model's predictions are sensitive to chemically relevant structural changes, confirming the importance of specific molecular motifs for interaction.

6.4. Applicability Domain (AD)

The discussion section describes how the Applicability Domain (AD) of the models is evaluated, ensuring reliability of predictions.

Method: An average-similarity approach is used.
1. Morgan Fingerprints (1024 bits, radius 2) are calculated for all compounds in the training set.
2. Jaccard similarity index (from RDKit) is computed between each molecule in the test/query set and the training set.
3. The average similarity score is then calculated by averaging the similarity scores of the 5 most similar compound pairs.
Threshold: The distribution of these average similarity scores for the training and test sets is used to define a similarity threshold (Figure S4). Compounds falling outside this threshold are flagged as being outside the model's AD, meaning their predictions might be less reliable.

该图像是一个直方图，显示了Jaccard相似度的密度分布，包含训练-训练和测试-训练的比较。红色条形表示训练集之间的相似度，而蓝色条形表示测试集与训练集的相似度。虚线标记了相似度阈值，体现了不同组别的相似度趋势。
This AD check is performed before any prediction to ensure the reliability of the model's output for a given query molecule. This is a crucial step for practical application of predictive models.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work successfully introduces a novel approach for predicting interactions between bitter taste receptors (TAS2Rs) and their ligands, utilizing both Traditional Machine Learning (TML) and Graph Convolutional Neural Networks (GCNs). Both model types were specifically designed with explainability in mind, a critical feature often lacking in complex ML/DL models. The TML model, based on Gradient Boosting on Decision Trees (CatBoost), achieved strong predictive performance (ROC AUC 0.92, PR AUC 0.75) and demonstrated high precision for the positive class. The GCN model, while having slightly lower overall performance (ROC AUC 0.88, PR AUC 0.67), offered visually rich and chemically intuitive explanations directly on the molecular structures, highlighting key atoms and bonds. The authors emphasize the complementary nature of these two approaches, providing robust predictions alongside valuable insights into the molecular basis of TAS2R-ligand associations. The models are easy to use, applicable to new molecules within their defined applicability domain, and competitive with state-of-the-art methods like BitterMatch. Ultimately, this research provides powerful tools for in silico identification of promising compounds, with significant potential applications in the food industry (bitter modulators), pharmaceutical sector (masking drug bitterness), and understanding TAS2R functions in extra-oral tissues related to various diseases.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations of their work and propose future research directions:

Dataset Limitations:
- Paucity, diversity, and imbalance of data: The available data on TAS2R-ligand interactions is scarce, heterogeneous, and heavily skewed towards negative instances. This imbalance poses challenges for training robust models, especially for the minority class, and likely contributes to the observed trade-offs between precision and recall.
- Limited to bitter molecules: The current models are trained only on bitter compounds, limiting their applicability domain to this specific taste modality.
Future Experimental Studies: More experimental studies are needed to elucidate interactions between other bitter compounds or non-bitter chemicals with TAS2Rs. Such data would significantly enhance model performance and broaden their chemical applicability.
Lack of 3D Receptor Information: A crucial limitation is the absence of features related to the three-dimensional (3D) structure of the TAS2R receptors. GPCRs are known to have complex binding pockets where ligands interact, and features like binding pocket volume, Solvent Accessible Surface Area (SASA), and radius of gyration could significantly improve predictive accuracy. However, accurate experimental or in silico determination of GPCR structures remains a complex and ongoing challenge.
Interpretability of TML Descriptors: While feature importance was provided for the TML model, interpreting the chemical and physical meaning of the 17 selected Mordred descriptors (many of which are autocorrelation or topological indices) can still be challenging for a non-expert.
Future Directions:
- Simpler Descriptors/Methodologies: Develop or utilize simpler molecular descriptors, or specific methodologies, to intuitively relate these descriptors to relative structural features or functional groups, thereby enhancing the TML model's explainability in a more chemically intuitive way.
- Integrating 3D Receptor Data: Incorporate 3D structural features of TAS2Rs into the models once more accurate structural data becomes available, to better capture the intricacies of ligand binding and recognition.

7.3. Personal Insights & Critique

Innovation in Explainability: The paper's strongest point is its rigorous commitment to explainable AI. By applying both model-agnostic (SHAP, GNNExplainer) and model-specific (CatBoost importance, UGrad-CAM) methods, and demonstrating their complementary nature, the authors provide a holistic view of model decisions. This is crucial for gaining scientific trust and guiding rational drug/bitterant design, moving beyond mere prediction to actionable insights. The visual explanations offered by UGrad-CAM on molecular graphs are particularly insightful for chemists and biologists.
Complementary Model Approach: The decision to develop and present two distinct ML paradigms (TML and GCN) is a strength. It acknowledges that different ML approaches might excel in different aspects (e.g., TML for overall statistical robustness, GCN for direct structural interpretability), offering a more versatile toolkit for researchers.
Robust Methodology: The attention to detail in the TML workflow, such as SMILES standardization, sophisticated feature engineering, clustering before splitting for chemical space representation, and careful feature selection (Backward-SFS), contributes to the robustness and reliability of the TML model.
Addressing Imbalance: The explicit recognition and discussion of the dataset imbalance (5x more negative instances) and its impact on performance metrics (e.g., TML favoring precision, GCN favoring recall for the minority class) is a testament to the rigor of the analysis. It highlights a common challenge in real-world biological data.
Applicability Domain (AD): The inclusion of an AD check is vital for practical deployment. It ensures that users are aware of the reliability of predictions for novel compounds, preventing extrapolation beyond the model's learned chemical space.
Minor Critique on F-beta explanation: The paper's explanation of the F-beta score in the Supplementary Information ("A lower $\beta$ gives less weight to precision, while a higher $\beta$ gives more weight to it.") is a bit misleading compared to common conventions, where a higher $\beta$ actually weights recall more heavily. While a minor point, clarifying this could improve beginner understanding.
Future Value and Transferability: The methodologies presented in this paper are highly transferable. The explainable ML/DL framework for ligand-receptor interactions could be readily applied to other GPCRs or other protein families, different taste modalities (sweet, umami, sour, salty), or even drug-target interaction prediction in general. The focus on SMILES input makes it very practical for high-throughput screening and virtual library design. The potential impact on precision nutrition (tailoring food to individual taste perceptions) and nutraceutical development (designing health-promoting compounds) is substantial and exciting. The in silico approach inherently overcomes limitations of cost, time, and ethical concerns associated with traditional in-vitro/in-vivo methods.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Boosting Type	Depth	Iterations	Learning Rate	Leaf Estimation Iterations	L2 Leaf Reg	Subsample
Plain	6	1000	0.1	4	3	0.7

Explainable Machine Learning and Deep Learning Models for Predicting TAS2R-Bitter Molecule Interactions

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~32 min read · 40,321 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Dataset Acquisition and Preprocessing

4.2.2. Interaction Prediction using a Traditional Machine-Learning (TML) Approach

4.2.3. Interaction Prediction using Graph Convolutional Neural Networks (GCN)

4.2.4. Explainability Methods

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Precision

5.2.2. Recall (Sensitivity)

5.2.3. Specificity

5.2.4. F-beta Score (FβF_{\beta}Fβ​)

5.2.5. Average Precision (AP)

5.2.6. Receiver Operating Characteristic (ROC) AUC

5.2.7. Precision-Recall (PR) AUC

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Traditional Machine-Learning (TML) Approach

6.1.2. Graph Convolutional Neural Network (GCN) Approach

6.1.3. Comparison of TML and GCN Models

6.1.4. Comparison with BitterMatch

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.4. Applicability Domain (AD)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

5.2.4. F-beta Score ( $F_{\beta}$ )