Abstract

Enzymes are the molecular machines of life, and a key property that governs their function is substrate specificity—the ability of an enzyme to recognize and selectively act on particular substrates. Here we developed a cross-attention-empowered SE(3)-equivariant graph neural network architecture named EZSpecificity for predicting enzyme substrate specificity, trained on a comprehensive database of enzyme–substrate interactions. Experimental validation showed that EZSpecificity achieved a 91.7% accuracy in identifying the single potential reactive substrate, significantly outperforming existing models.

1. Bibliographic Information

1.1. Title

Enzyme specificity prediction using cross-attention graph neural networks

1.2. Authors

Haiyang Cui, Yufeng Su, Tanner J. Dean, Tianhao Yu, Zhengyi Zhang, Jian Peng, Diwakar Shukla, Huimin Zhao

1.3. Journal/Conference

Published in Nature. Nature is one of the most prestigious and influential scientific journals globally, publishing original research across a wide range of scientific disciplines. Its reputation for high-impact, groundbreaking discoveries means that papers published here are considered to be at the forefront of their respective fields and undergo rigorous peer review.

1.4. Publication Year

2025

1.5. Abstract

Enzymes are essential molecular machines in life, and their function is fundamentally governed by substrate specificity—their ability to recognize and selectively act on specific substrates. This paper introduces EZSpecificity, a novel deep learning architecture based on a cross-attention-empowered SE(3)-equivariant graph neural network, designed for predicting enzyme substrate specificity. The model was trained on ESIBank, a newly compiled, comprehensive database of enzyme-substrate interactions that incorporates sequence and structural information. Experimental validation demonstrated that EZSpecificity achieved a 91.7% accuracy in identifying the single potential reactive substrate, significantly surpassing the performance of existing models (e.g., the state-of-the-art model achieved 58.3%).

1.6. Original Source Link

/files/papers/6916b3da110b75dcc59adf89/paper.pdf (This link indicates a local file path, suggesting it's likely a PDF provided to the system. The paper also provides a DOI: https://doi.org/10.1038/s41586-025-09697-2, which is the official publication source.)

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the accurate prediction of enzyme substrate specificity. Enzymes are critical biological catalysts, but a vast number of known enzymes lack reliable information about their substrate specificities, which significantly impedes their practical application in areas like biocatalysis, drug discovery, and a comprehensive understanding of natural biochemical diversity.

Existing machine learning (ML) tools for enzyme specificity prediction have met with limited success. Specific challenges and gaps in prior research include:

Limited Scope: Many existing tools are specific to particular protein families and lack general applicability.
Discrimination Issues: Popular enzyme function prediction tools (e.g., CLEAN, ProteInfer, DeepECTransformer) struggle to distinguish enzyme reactivity and substrate specificity within the same Enzyme Commission (EC) numbers, a critical challenge for fine-grained biocatalytic understanding.
Data Limitations: Models like ESP (Enzyme Substrate Prediction) are hampered by a limited number of collected substrates (e.g., ~1.3k).
Incomplete Feature Utilization: Previous methods often rely on one-dimensional protein sequences or two-dimensional molecular graphs (fingerprint/sequence-based embeddings), failing to fully capture the crucial three-dimensional (3D) nature of substrate binding, the specific micro-environment of active sites, and complex enzyme-substrate atomic interactions. They often reduce enzyme and substrate to separate embeddings before concatenation, which limits their ability to model direct interactions.

The paper's innovative idea or entry point is to leverage structural information and explicitly model enzyme-substrate interactions in 3D space. This is achieved through a novel deep learning architecture that integrates sequence data, 3D enzyme-substrate complex structures, and active site environment, empowered by an SE(3)-equivariant Graph Neural Network (GNN) and cross-attention mechanisms.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Novel Model Architecture (EZSpecificity): Development of a cross-attention-empowered SE(3)-equivariant GNN architecture named EZSpecificity. This model innovatively integrates full-length amino acid representations (from ESM-2), 3D binding pocket environments (modeled by SE(3)-equivariant GNN), and explicit enzyme-substrate atomic interactions (through cross-attention layers).
Comprehensive Database (ESIBank): Construction of a high-quality, curated enzyme-substrate interaction dataset (ESIBank). This database is significantly larger (323,783 pairs, 34,417 substrates, 8,124 enzymes) and more diverse than previous datasets, encompassing natural and non-natural substrates, variant enzymes, and structural information.
Superior Predictive Performance: Demonstrated that EZSpecificity consistently and significantly outperforms existing state-of-the-art machine learning models (e.g., ESP, CPI) in various in silico benchmarking experiments across different dataset splits (random, unknown substrate, unknown enzyme, unknown enzyme and substrate).
Strong Experimental Validation: Achieved a remarkable 91.7% accuracy in identifying the single potential reactive substrate during in vitro experimental validation with eight halogenases and 78 diverse substrates, substantially outperforming the state-of-the-art model ESP (58.3%).
Generalizability and Applicability: Showed high transferability across six diverse enzyme families and practical utility in metabolite-enzyme pair prediction within E. coli and biosynthetic gene cluster (BGC) analysis.

The key conclusions reached are that EZSpecificity represents a general, robust, and highly accurate machine learning model for predicting substrate specificity for a wide range of enzymes. These findings solve the problem of limited accuracy and generalizability in previous models by effectively integrating diverse data types and explicitly modeling crucial 3D interactions.

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following fundamental concepts:

Enzyme Substrate Specificity: This refers to the ability of an enzyme to bind to and catalyze a reaction on only a limited number of specific substrates. Enzymes are highly selective, meaning their active sites are precisely shaped and chemically configured to interact with particular molecules (substrates) and often to catalyze only one or a few types of reactions. This specificity is crucial for the precise regulation of metabolic pathways in living organisms.
Enzyme Commission (EC) Numbers: The EC number is a numerical classification scheme for enzymes, based on the chemical reactions they catalyze. It is a four-level hierarchical classification:
- EC 1.x.x.x: Oxidoreductases (catalyze oxidation/reduction reactions)
- EC 2.x.x.x: Transferases (transfer functional groups)
- EC 3.x.x.x: Hydrolases (catalyze hydrolysis reactions)
- EC 4.x.x.x: Lyases (cleave bonds by elimination, forming double bonds)
- EC 5.x.x.x: Isomerases (catalyze isomerization reactions)
- EC 6.x.x.x: Ligases (catalyze the joining of molecules using ATP) Each subsequent digit in the EC number provides more specific information about the type of reaction. For example, EC 1.1.x.x refers to enzymes acting on the CH-OH group of donors, with $\mathrm{NAD}^+$ or $\mathrm{NADP}^+$ as acceptor.
Graph Neural Networks (GNNs): GNNs are a class of neural networks designed to process data represented as graphs. In a graph, data points are nodes (or vertices), and relationships between them are edges. GNNs work by iteratively aggregating information from a node's neighbors (a process called message passing) to update the node's representation (embedding). This allows them to capture structural information and dependencies within the graph.
SE(3)-Equivariant GNNs: In molecular modeling, it's crucial that the model's predictions are independent of how the molecule is oriented or positioned in 3D space. This property is called equivariance. An SE(3)-equivariant GNN is a specialized type of GNN where the output of the network transforms in a predictable way (e.g., rotates or translates) if the input graph (representing a 3D molecule) is rotated or translated. This ensures that the model learns intrinsic molecular properties rather than arbitrary spatial orientations, making it robust for tasks involving 3D structures like protein-ligand interactions. SE(3) refers to the special Euclidean group in 3 dimensions, which encompasses rotations and translations.
Cross-Attention: Originating from the Transformer architecture, attention mechanisms allow a neural network to focus on specific parts of its input that are most relevant for a given task. Cross-attention is a variant used when there are two different sets of inputs (e.g., enzyme features and substrate features). It allows elements from one input (e.g., enzyme amino acids) to query and attend to elements in the other input (e.g., substrate atoms), and vice versa. This generates a blended representation where each input's elements are "aware" of the most relevant parts of the other, effectively modeling their interaction. The fundamental Attention mechanism is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings. In cross-attention, $Q$ typically comes from one input (e.g., enzyme) and K, V from the other (e.g., substrate).
- $Q K^T$ calculates the similarity scores between queries and keys.
- $\sqrt{d_k}$ is a scaling factor to prevent the dot products from becoming too large, especially with high-dimensional $d_k$ vectors.
- $\mathrm{softmax}$ normalizes these scores into a probability distribution, indicating how much attention each query should pay to each key.
- The result is a weighted sum of the Value vectors, where weights are determined by the attention scores.
Multi-layer Perceptron (MLP): An MLP, also known as a feedforward neural network, is a basic type of artificial neural network composed of multiple layers of interconnected nodes (neurons). Each node applies a non-linear activation function to a weighted sum of its inputs, passing the result to the next layer. MLPs are commonly used for tasks like classification and regression.
SMILES (Simplified Molecular Input Line Entry System): A line notation that allows a user to represent a chemical structure using short ASCII strings. It's a common way to input and represent chemical molecules in computational chemistry. For example, CCO represents ethanol.
AlphaFold/AlphaFill: AlphaFold is an AI system developed by DeepMind that predicts the 3D structure of proteins from their amino acid sequences with high accuracy. AlphaFill is a complementary tool that enriches AlphaFold models by identifying and placing ligands, cofactors, and ions into predicted protein structures based on structural similarity to experimentally determined structures.
AutoDock-GPU: A hardware-accelerated version of AutoDock, a suite of molecular docking programs. Molecular docking is a computational method that predicts the preferred orientation of one molecule (e.g., a substrate) to a second (e.g., an enzyme) when bound to form a stable complex. It's used to predict binding poses and affinities.
ESM-2 (Evolutionary Scale Modeling - 2): A large, pre-trained transformer-based protein language model developed by Meta AI. It learns representations of protein sequences by being trained on vast amounts of protein sequence data, allowing it to capture evolutionary and structural information about proteins. It's akin to language models like BERT or GPT but for protein sequences.
Area Under the Receiver Operating Characteristic Curve (AUROC): AUROC is a performance metric for binary classifiers. The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds. The AUROC value (ranging from 0 to 1) represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance. A higher AUROC indicates better discriminative power.
Area Under the Precision-Recall Curve (AUPR): AUPR is another performance metric, particularly useful for imbalanced datasets (where one class is much rarer than the other). It plots Precision ( $\frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$ ) against Recall ( $\frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$ ) at various thresholds. A higher AUPR value indicates that the model is better at identifying positive instances without generating too many false positives, which is crucial in tasks like enzyme specificity prediction where positive enzyme-substrate pairs might be rare.
Accuracy: A common metric for classification tasks, representing the proportion of total predictions that were correct. It is calculated as $\frac{\mathrm{Number\ of\ correct\ predictions}}{\mathrm{Total\ number\ of\ predictions}}$ .

3.2. Previous Works

The paper discusses several prior studies and existing tools, highlighting their limitations:

Family-Specific Tools (e.g., for esterases, glycotransferases, nitrilases, phosphatases, thiolases, sesquiterpene synthases, proteases): Many early machine learning tools were developed for specific protein families (e.g., Robinson et al., 2020; Yang et al., 2018; Mou et al., 2021; Durairaj et al., 2021; Goldman et al., 2022). While achieving some success, their lack of generalizability across diverse enzyme families is a significant limitation.
General Enzyme Function Prediction Tools:
- CLEAN (Contrastive Learning-Enabled Enzyme Annotation, Yu et al., 2023)
- ProteInfer (Sanderson et al., 2023)
- DeepECTransformer (Wang et al., 2023) These tools, while powerful for general enzyme function prediction, struggle to distinguish fine-grained enzyme reactivity and substrate specificity within the same EC numbers. This means they can identify what type of reaction an enzyme performs but not necessarily which specific substrate it acts upon among many structurally similar options.
ESP (Enzyme Substrate Prediction, Kroll et al., 2023): This model used Graph Neural Networks (GNNs) to encode metabolites for various proteins. A key limitation of ESP was its reliance on a comparatively small dataset, comprising only about 1,300 unique substrates. This restricted its ability to cover all native and non-native enzyme-substrate pairs on a genome-scale. The paper states that EZSpecificity's database contains 25 times more substrates.
CPI (Compound-Protein Interaction Prediction, Du et al., 2022): CPI methods are crucial for drug discovery by screening candidate compounds. The paper identifies that the architecture of CPI is effectively equivalent to EZSpecificity without its graph, cross-attention, and structural embeddings (termed EZSpecificity-w/oGCS). This implies that CPI typically uses sequence-based or fingerprint-based embeddings and relies on simpler interaction models, often concatenating separate enzyme and compound embeddings before a final prediction layer, thereby failing to capture explicit 3D atomic interactions.
ALDELE (All-Purpose Deep-Learning-Based Multiple Toolkit, Wang et al., 2024): This framework highlighted the potential for broader applicability but also faced the challenge of fully integrating information from sequences, structures, and interactions simultaneously.

These previous works commonly suffered from:
Limited datasets, especially concerning diverse substrates and structural information.
Inability to effectively capture the 3D nature of enzyme-substrate binding and intricate atomic-level interactions.
Difficulty in distinguishing specificities for enzymes within the same EC number or for homologous enzymes.

3.3. Technological Evolution

The field of enzyme specificity prediction has evolved from:

Rule-based systems and expert curation: Early methods relied on biochemical knowledge and manual annotation (e.g., BRENDA, UniProt).
Fingerprint and sequence-based embeddings: As machine learning emerged, approaches using molecular fingerprints (e.g., ECFP, RDKit descriptors) for substrates and sequence embeddings (e.g., one-hot encoding, BLOSUM matrices) for enzymes became prevalent. These were often combined with traditional ML models like SVMs or random forests, or simpler neural networks. These struggled to capture spatial arrangement and long-range couplings.
Graph Neural Networks (GNNs): The advent of GNNs allowed for more sophisticated representation of molecules as graphs (atoms as nodes, bonds as edges), capturing local structural information. ESP is an example of this generation.
Deep Learning with Pre-trained Language Models and 3D Structural Information: More recently, advanced deep learning models, often incorporating protein language models like ESM-2 for rich sequence embeddings, and attempting to integrate 3D structural data, have emerged. However, prior efforts often still lacked explicit modeling of precise 3D enzyme-substrate interactions.
EZSpecificity's Position: This paper represents a significant step forward by combining the strengths of advanced sequence representation (ESM-2), explicit 3D binding pocket modeling using SE(3)-equivariant GNNs, and crucial cross-attention mechanisms to capture atomic-level enzyme-substrate interactions. It also addresses the data bottleneck by constructing a much larger, structurally informed ESIBank dataset.

3.4. Differentiation Analysis

Compared to the main methods in related work, EZSpecificity offers several core differences and innovations:

Comprehensive Data Integration: Unlike models that focus on 1D sequences or 2D graphs, EZSpecificity explicitly integrates sequence data (via ESM-2), 3D enzyme-substrate complex structures, and the active site environment. This multi-modal approach provides a richer and more complete representation of the enzyme-substrate system.
SE(3)-Equivariant Graph Neural Network for Active Site: A key innovation is the use of an SE(3)-equivariant GNN to model the binding pocket. This ensures that the encoding process is invariant to 3D transformations (rotations and translations), allowing the model to learn intrinsic properties of the active site and substrate, which is crucial for chemical and biological systems. Previous GNNs often lacked this 3D equivariance.
Cross-Attention for Direct Interactions: EZSpecificity uses two cross-attention layers to directly model the interactions between enzyme amino acids and substrate atoms. This is a significant departure from previous approaches (e.g., CPI, ESP) that typically generate separate embeddings for the enzyme and substrate and then concatenate them, which may limit their ability to capture complex, atom-level interactions. Cross-attention allows the model to "emphasize the amino acids and atoms crucial for enzyme-substrate specificity," reducing noise and enhancing focus.
Enhanced Database (ESIBank): The paper introduces ESIBank, a high-quality, comprehensive database that is 25 times larger in terms of unique substrates compared to datasets used by models like ESP. This extensive and structurally rich dataset is crucial for training a robust and generalizable deep learning model.
Generalizability Across Families and EC Numbers: EZSpecificity is designed as a general model, demonstrating strong performance across diverse enzyme families and showing improved ability to distinguish specificity even at the lowest (fourth) level of EC number resolution, a known challenge for earlier models.

4. Methodology

4.1. Principles

The core idea behind EZSpecificity is to predict enzyme substrate specificity by comprehensively modeling the enzyme-substrate interaction at multiple levels: sequence, 3D structure, and direct atomic-level interactions. The theoretical basis is that enzyme specificity is fundamentally determined by the 3D geometry and chemical properties of the enzyme's active site and its precise fit and interaction with the substrate. By integrating advanced deep learning techniques—specifically, pre-trained protein language models for sequence features, SE(3)-equivariant Graph Neural Networks for 3D structural context of the active site, and cross-attention mechanisms for explicit enzyme-substrate interaction modeling—the model aims to capture these critical determinants more effectively than previous approaches. The intuition is that a model that "sees" and "understands" the enzyme and substrate in their full 3D context, and can explicitly learn which parts of each molecule interact most strongly, will be better at predicting specificity.

4.2. Core Methodology In-depth (Layer by Layer)

The EZSpecificity architecture is a multi-component deep learning model, as illustrated in Fig. 1. The overall workflow involves: (1) preparing a high-quality enzyme-substrate interaction database (ESIBank), (2) generating 3D enzyme-substrate complexes, (3) encoding enzyme sequences, (4) encoding the 3D micro-environment of the catalytic pocket and substrate, (5) modeling enzyme-substrate interactions using cross-attention, and (6) predicting specificity using a multi-layer perceptron.

The following figure (Fig. 1 from the original paper) shows the machine learning architecture of EZSpecificity:

Fig. 1| The machine learning architecture ofEZSpecificity. ESM-2 and active site environment encoding method empowered by SE(3)-equivariant GNN were applied to extract the sequence and structural emb… 该图像是EZSpecificity机器学习架构的示意图，展示了如何利用ESM-2和SE(3)-不变图神经网络提取酶-底物复合物的序列和结构嵌入。图中包含氨基酸残基表示、内部消息传递神经网络、双重交叉注意力机制，以及加权3D催化核心嵌入和加权序列嵌入的集合，用于训练酶特异性深度神经网络。

Fig. 1| The machine learning architecture ofEZSpecificity. ESM-2 and active site environment encoding method empowered by SE(3)-equivariant GNN were applied to extract the sequence and structural embedding of the enzyme substrate complex. Message passing neural networks (MPNN) were used for obtaining the embedding of the internal substrate and the embedding of enzyme-substrate interaction. The cross-attention layer contained the independent heads, which were a learnable attention function to capture the interaction between the amino acids of enzymes and the atoms of substrates. The attention function was composed of several trainable linear layers. Weighted 3D-catalytic core embedding and weighted sequence embedding were used as input for training the multi-layer perceptron model. Any fingerprint at the atom, residue and overall levels can be flexibly integrated into the embedding process. Schematic was created using BioRender (https://biorender.com).

4.2.1. ESIBank Database Preparation and 3D Complex Construction

A high-quality dataset is crucial for training deep learning models. EZSpecificity relies on ESIBank, a comprehensive enzyme-substrate interaction database.

Dataset Sources and Construction:

Initial Data Collection: ESIBank was built by integrating sequence-based databases like BRENDA and UniProt with structural information.
- Substrates: Native and non-native substrates were collected from BRENDA in SMILES format.
- Enzymes: Since BRENDA often lacks direct enzyme sequences, an enzyme from the listed EC number and organism name category was randomly selected from UniProt to match substrates. This yielded approximately 180,000 positive enzyme-substrate pairs.
- Negative Samples: For each positive pair, five negative enzymes and five negative substrates were generated, resulting in 10 negative enzyme-substrate pairs per positive pair. These negatives were designed to vary in similarity to the actual enzyme/substrate, guided by EC number digits (from no common digits to all four identical), creating five levels of difference. Approximately 1,100,000 negative pairs were generated this way.
Data Filtering: Low-quality or abnormal data points were removed:
- Substrates with more than 280 atoms.
- Enzymes with more than 1,000 amino acids.
- Enzymes lacking active site information in UniProt.
Enrichment with Specific Enzyme Families: Data from six representative enzyme families (esterases, glycotransferases, nitrilases, phosphatases, thiolases, and domain of unknown function proteins (DUFs)) were added.
Semi-Automatic Data Extraction: For enzymes like halogenases, where data might not be fully indexed, a semi-automatic process was used. This four-step process includes: identification, extraction, translation (e.g., using OSRA to convert chemical structures from images into SMILES), and connection. A HaloS dataset (~3,300 pairs) was established using this method.
Total Size: In total, ESIBank comprises 323,783 high-quality enzyme-substrate pairs, involving 34,417 unique substrates and 8,124 enzymes.

The following figure (Fig. 2 from the original paper) shows the construction process of the ESIBank database:

该图像是图示，展示了ESIBank数据库的构建过程，包括从文献和在线服务器中提取酶和基质的信息。图中分别描述了识别、提取、翻译与连接的四个步骤，强调了光学结构识别应用于化合物的化学结构信息转换。

Fig. 2| Construction ofthe comprehensive enzyme-substrate interaction (ESIBank) database. a, The process of constructing the ESIBank database. The middle panel described the whole process of obtaining various natural and non-natural substrates and wild-type and variant enzymes from published online servers and literature. The semi-automatic data extraction and AutoDock-GPU docking process were presented in the top and bottom panels, respectively. Semi-automatic data extraction includes four steps: identification, extraction, translation and connection. OSRA (Optical Structure Recognition Application) converts the chemical structure information of chemical compounds from

Construction of 3D Enzyme-Substrate Complexes: To provide crucial structural context, 3D enzyme-substrate complexes were generated for each pair in ESIBank:

Substrate Preparation: Substrate structures were generated from SMILES strings using Open Babel and RDKit.
Enzyme Structure Preparation: 3D structures of enzymes and required cofactors were obtained from the AlphaFill database. Apo enzyme models (protein structures without bound ligands) were sourced from the AlphaFold Protein Structure Database. AlphaFill helped integrate cofactors, which are often crucial for enzyme function.
Docking Process: AutoDock-GPU was used for molecular docking:
- Input Conversion: Enzyme structures were converted to map files, and substrate files to pdbqt inputs.
- Active Site Definition: For enzymes with known active sites (from UniProt), a 20 Å cube centered around the active site amino acids was used as the docking box. For enzymes without known active sites, the enzyme's center was used.
- Docking Runs: 100 runs were performed per enzyme-substrate pair, with each run allowing up to 2,500,000 score evaluations for a unique binding position, totaling a maximum of 250 million pose evaluations per pair.
- Pose Selection: The highest-scoring binding pose was selected as the output, generating structural files of enzymes with substrates bound.

4.2.2. Pretrained Protein Language Model for Enzyme Representation

The full-length amino acid sequences of enzymes are represented using ESM-2, a powerful self-supervised protein language model.

ESM-2 Embedding: ESM-2 (15 billion parameters) generates a 1,280-dimensional vector representation for each amino acid in the enzyme sequence. This embedding captures rich evolutionary and contextual information about the protein. The output from the penultimate layer of ESM-2 is used for this purpose.
Dimension Reprojection: A linear layer reprojects the 1,280-dimensional vector for each amino acid into a 128-dimensional embedding. This is referred to as the amino acid representation of the enzyme.
Enzyme Representation: The overall representation of the enzyme ( $h_{\mathrm{enzyme}}$ ) is obtained by averaging these 128-dimensional amino acid embeddings across the entire sequence.

4.2.3. Capturing Catalytic Pocket Environment by SE(3)-Equivariant GNNs

To encode the 3D structural micro-environment of each atom in the catalytic active site (including both enzyme and substrate atoms), an SE(3)-equivariant GNN is employed.

Graph Representation: The binding pocket is modeled as a graph $G = (V, E)$ $G = (V, E)$ .
- Nodes ( $V$ ): Each node represents an atom, either from the substrate or the enzyme.
- Edges ( $E$ ): Edges connect each node to its $k$ -nearest atomic neighbors in the 3D structure. In practice, $k$ is set to 32.
Atom Features:
- Enzyme Atoms: Features include chemical elements, amino acid types, and a binary indicator of whether the atom belongs to the protein backbone.
- Substrate Atoms: Features are multi-hot vectors, including chemical element types and aromatic properties.
Edge Features ( $e_{ij}$ ):
- Distance Embeddings: Radial basis functions are used, positioned at 32 centers within the range of 0-10 Å, to represent the distance between connected atoms.
- Bond Types: A four-dimensional one-hot vector indicates if a bond is single, double, triple, or a virtual bond.
- Inter-molecular Indicator: A two-dimensional one-hot vector specifies whether the edge connects atoms between the substrate and the enzyme.
SE(3)-Equivariant GNN Update Rule: The GNN updates the hidden embedding of each atom ( $h_i$ $h_{i}$ ) iteratively, ensuring the encoding process is equivariant to 3D transformations.
- Initial Hidden Embedding ( $h^0$ ): Obtained by two distinct linear layers, one for enzyme atom features and one for substrate atom features, reprojecting them independently.
- Message Passing and Aggregation: At each $l$ $l$ -th layer, the atom hidden embedding $h_i^l$ $h_{i}^{l}$ is updated as follows: $ h _ { i } ^ { l + 1 } = h _ { i } ^ { l } + \varphi _ { h } \Biggl ( \sum _ { j \in \mathcal { N } ( i ) } \varphi _ { e } \bigl ( e _ { i j } , h _ { i } ^ { l } , h _ { j } ^ { l } \bigr ) , h _ { i } ^ { l } \Biggr ) $ Where:
  - $h_i^l$ : The hidden embedding of atom $i$ at layer $l$ .
  - $h_i^{l+1}$ : The updated hidden embedding of atom $i$ for the next layer.
  - $\mathcal{N}(i)$ : The set of neighbors of atom $i$ in the graph.
  - $e_{ij}$ : The features of the edge connecting atom $i$ and atom $j$ .
  - $\varphi_e$ : A two-layer perceptron (MLP) that models the message passing function, taking the edge features and the current embeddings of atoms $i$ and $j$ to generate a message.
  - $\sum_{j \in \mathcal{N}(i)}$ : Aggregates messages from all neighbors $j$ of atom $i$ .
  - $\varphi_h$ : A two-layer perceptron that models the message aggregation and update function, combining the aggregated messages with the atom's current embedding $h_i^l$ to produce the new embedding.
- Output Embeddings: The final atom hidden embedding $h^t$ is referred to as the micro-environment embedding. Specifically, the final atom hidden embeddings for substrate atoms are called substrate atom embedding, and their average is the substrate embedding ( $h_{\mathrm{substrate}}$ ). The $SE(3)$ neural network consists of three GNN layers.

4.2.4. Modelling the Interactions between Substrates and Enzymes using Cross-Attention Layers

To explicitly capture the complex enzyme-substrate interaction within the active site, two cross-attention layers are employed. This differs from previous methods that simply concatenate separate embeddings, allowing EZSpecificity to emphasize crucial amino acids and atoms.

Enzyme-Aware Substrate Atom Embedding: The first cross-attention layer generates an enzyme-aware representation for each substrate atom. This means each substrate atom's embedding is modulated by how strongly it interacts with different amino acids of the enzyme.
- The Attention mechanism is defined as: $ \sf { A t t e n t i o n } ( Q , K , V ) = \sf { s o f t m a x } \left( \frac { Q K ^ { T } } { \sqrt { d _ { k } } } \right) V $
- The Cross-attention for enzyme-aware substrate embedding is: $ \mathbf { C r o s s-a t t e n t i o n } ( E , S ) = \varphi _ { 0 } ( \mathbf { Concat } ( \mathrm { h ead } _ { 1 } , . . . , \mathrm { h ead } _ { h } ) ) $ where $\mathrm { head }_i = \mathrm { Attention } ( \varphi _ { q _ { i } } ( E ) , \varphi _ { k _ { i } } ( S ) , \varphi _ { V _ { i } } ( S ) )$ $head_{i} = Attention (φ_{q_{i}} (E), φ_{k_{i}} (S), φ_{V_{i}} (S))$ .
  - $E$ : The enzyme amino acid embedding (from ESM-2 and reprojection).
  - $S$ : The substrate atom embedding (from the SE(3)-equivariant GNN).
  - $\varphi_{q_i}$ , $\varphi_{k_i}$ , $\varphi_{V_i}$ : Linear layers that project $E$ to Query ( $Q$ ) and $S$ to Key ( $K$ ) and Value ( $V$ ) for the $i$ -th attention head.
  - $d_k$ : The dimension of the Key vectors.
  - $h$ : The number of independent attention heads.
  - $\mathrm{Concat}$ : Concatenates the outputs of all attention heads.
  - $\varphi_0$ : A final linear layer to combine the concatenated heads.
- The average of these enzyme-aware substrate atom embeddings forms the enzyme-aware substrate embedding ( $\bar{h}_{\mathrm{substrate}}$ ).
Substrate-Aware Enzyme Amino Acids Embedding: Similarly, a second cross-attention layer generates a substrate-aware representation for each enzyme amino acid. This allows each enzyme amino acid's embedding to be modulated by how strongly it interacts with different atoms of the substrate.
- The Cross-attention for substrate-aware enzyme embedding is: $ \mathbf { C r o s s-a t t e n t i o n } ( S , E ) = \varphi _ { 0 } ( \mathbf { Concat } ( \mathrm { h ead } _ { 1 } , . . . , \mathrm { h ead } _ { h } ) ) $ where $\mathrm { head }_i = \mathrm { Attention } ( \varphi _ { q _ { i } } ( S ) , \varphi _ { k _ { i } } ( E ) , \varphi _ { V _ { i } } ( E ) )$ $head_{i} = Attention (φ_{q_{i}} (S), φ_{k_{i}} (E), φ_{V_{i}} (E))$ .
  - Here, $S$ provides the Query, and $E$ provides the Key and Value.
- The average of these substrate-aware enzyme amino acids embeddings forms the substrate-aware enzyme embedding ( $\bar{h}_{\mathrm{enzyme}}$ ).

4.2.5. Multilayer Perceptron as the Base Predictor

Finally, a multi-layer perceptron (MLP) acts as the base predictor to determine enzyme specificity based on the comprehensive enzyme-substrate pair representation.

Input Concatenation: The MLP takes a concatenated vector as input, combining four types of embeddings:
- The original substrate embedding ( $h_{\mathrm{substrate}}$ ).
- The original enzyme embedding ( $h_{\mathrm{enzyme}}$ ).
- The enzyme-aware substrate embedding ( $\bar{h}_{\mathrm{substrate}}$ ).
- The substrate-aware enzyme embedding ( $\bar{h}_{\mathrm{enzyme}}$ ).
- The concatenation operation is: $\mathrm{Concat} \big ( \overline { { h } } _ { \mathrm { s u b s t r a t e } } , \overline { { h } } _ { \mathrm { e n z y m e } } , h _ { \mathrm { s u b s t r a t e } } , h _ { \mathrm { e n z y m e } } \big )$
Specificity Prediction: The MLP processes this concatenated representation to output a prediction score ( $y$ ), which defines the enzyme specificity. $ y = MLP \Big ( \mathrm { Concat } \big ( \overline { { h } } _ { \mathrm { s u b s t r a t e } } , \overline { { h } } _ { \mathrm { e n z y m e } } , h _ { \mathrm { s u b s t r a t e } } , h _ { \mathrm { e n z y m e } } \big ) \Big ) $ The MLP consists of three feedforward layers. Unless otherwise specified, all hidden embedding dimensions are set to 128.

4.2.6. Training Process of EZSpecificity

The task of predicting enzyme specificity is formulated as a binary classification problem: classifying pairs of substrates and enzymes into two categories (e.g., reactive/non-reactive).

Loss Function: A cross-entropy loss function is used, which is standard for classification tasks, measuring the difference between the predicted probabilities and the true labels.
Optimizer: The AdamW optimizer is employed, using default parameters. AdamW is an extension of the Adam optimizer that decouples weight decay from the optimization step, which often leads to better generalization.
Learning Rate Schedule:
- Initial Learning Rate: Set at 0.0003.
- Warm-up: In the initial few epochs, the learning rate is linearly increased from 0.000006 to 0.0003. This warm-up phase helps stabilize training at the beginning.
- Learning Rate Reduction: If the model's performance does not improve for 10 consecutive epochs, the learning rate is reduced to half of its previous value. This learning rate on plateau strategy helps the model converge when it's nearing an optimum.
- Early Stopping: Training concludes when the learning rate drops below 0.000006, preventing overfitting and unnecessary computation.
Batch Size: The batch size for training is set to 32.

5. Experimental Setup

5.1. Datasets

The experiments primarily utilized two main datasets: ESIBank for in silico training and evaluation, and a specialized HaloS dataset for in vitro validation.

ESIBank Database:
- Source: Curated from sequence-based databases (BRENDA, UniProt), enriched with structural information generated by AlphaFold, AlphaFill, and AutoDock/Vina-GPU, and additional data from published literature and online servers.
- Scale: Comprises 323,783 high-quality enzyme-substrate pairs. This includes 34,417 unique substrates and 8,124 unique enzymes.
- Characteristics: It covers a wide range of natural and non-natural substrates, as well as wild-type and variant enzymes. The dataset explicitly incorporates 3D structural context derived from docking. It's noted to have 25 times more substrates than the dataset used by ESP.
- Data Sample (Implicit): A typical data sample would consist of:
  - An enzyme sequence (e.g., a string of amino acid letters).
  - A substrate SMILES string (e.g., $CC(=O)Oc1ccccc1C(=O)O$ for acetylsalicylic acid).
  - The 3D coordinates of the enzyme-substrate complex generated through docking, including active site information.
  - A binary label indicating whether the enzyme-substrate pair is reactive (positive) or non-reactive (negative).
- Choice: This dataset was specifically built by the authors to provide a comprehensive and structurally rich foundation for training EZSpecificity, directly addressing the limitations of smaller, less structurally informed datasets in prior work.
Six Representative Enzyme Families Dataset:
- Source: Additional data points for these families were included in ESIBank to enrich its diversity.
- Families: Esterases, glycotransferases, nitrilases, phosphatases, thiolases, and domain of unknown function proteins (DUFs).
- Choice: These families were chosen to demonstrate the model's transferability and generalizability across different enzymatic functions.
Halogenase-Substrate (HaloS) Dataset:
- Source: An in-house collected dataset of halogenases, established using a semi-automatic data extraction process from literature.
- Scale: Contains around 3,300 enzyme-substrate pairs.
- Characteristics: Focuses on an understudied enzyme family (halogenases) with distinct regioselectivity, chosen for proof-of-concept experimental validation. The 78 substrates selected for experimental validation had an average molecular similarity of only about 9% to the 449 substrates collected from literature, highlighting their diversity.
- Choice: This dataset was crucial for in vitro experimental validation, providing a realistic test case for EZSpecificity's predictive accuracy on new and diverse substrates/enzymes.

5.2. Evaluation Metrics

The performance of EZSpecificity and baseline models was evaluated using several standard metrics for binary classification.

Area Under the Receiver Operating Characteristic Curve (AUROC):
- Conceptual Definition: The AUROC quantifies the ability of a classifier to discriminate between positive and negative classes across all possible classification thresholds. It is computed as the area under the ROC curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR). A higher AUROC score (closer to 1) indicates better overall model discrimination, meaning it can correctly distinguish positive instances from negative instances more effectively.
- Mathematical Formula: $ \mathrm{AUROC} = \int_0^1 \mathrm{TPR}(t) , \mathrm{d}(\mathrm{FPR}(t)) $
- Symbol Explanation:
  - $\mathrm{TPR}(t)$ (True Positive Rate, also known as Recall or Sensitivity): $\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$ . It represents the proportion of actual positive cases that are correctly identified.
  - $\mathrm{FPR}(t)$ (False Positive Rate): $\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}}$ . It represents the proportion of actual negative cases that are incorrectly identified as positive.
  - $\mathrm{TP}$ (True Positives): Instances correctly predicted as positive.
  - $\mathrm{FN}$ (False Negatives): Instances incorrectly predicted as negative when they are actually positive.
  - $\mathrm{FP}$ (False Positives): Instances incorrectly predicted as positive when they are actually negative.
  - $\mathrm{TN}$ (True Negatives): Instances correctly predicted as negative.
  - $t$ : The classification threshold, which varies from 0 to 1 to generate the ROC curve.
Area Under the Precision-Recall Curve (AUPR):
- Conceptual Definition: The AUPR measures the average precision achieved across all possible recall levels. It plots Precision against Recall. This metric is particularly informative and relevant when dealing with imbalanced datasets (where the number of negative instances significantly outweighs positive instances), as it focuses on the performance of the positive class. A high AUPR indicates that the model has both high precision (few false positives) and high recall (identifies most true positives).
- Mathematical Formula: $ \mathrm{AUPR} = \int_0^1 \mathrm{Precision}(r) , \mathrm{d}r $
- Symbol Explanation:
  - $\mathrm{Precision}$ : $\mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$ . It is the proportion of positive identifications that were actually correct.
  - $\mathrm{Recall}$ (same as TPR): $\mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$ . It is the proportion of actual positives that were identified correctly.
  - $r$ : The recall value, which varies from 0 to 1 as the classification threshold changes.
Accuracy:
- Conceptual Definition: Accuracy is the most straightforward classification metric, representing the proportion of the total predictions that were correct. It assesses the overall correctness of the model's predictions across all classes.
- Mathematical Formula: $ \mathrm{Accuracy} = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}} $
- Symbol Explanation:
  - $\mathrm{TP}$ (True Positives): Correctly predicted positive instances.
  - $\mathrm{TN}$ (True Negatives): Correctly predicted negative instances.
  - $\mathrm{FP}$ (False Positives): Incorrectly predicted positive instances.
  - $\mathrm{FN}$ (False Negatives): Incorrectly predicted negative instances.

5.3. Baselines

The EZSpecificity model was rigorously compared against several existing and ablated baseline models to demonstrate its effectiveness.

ESP (Enzyme Substrate Prediction):
- Description: This is described as the "state-of-the-art general machine learning model for predicting enzyme substrate specificity." It uses Graph Neural Networks (GNNs) to encode metabolites for various proteins.
- Representativeness: It serves as the primary benchmark for general enzyme specificity prediction. The paper notes that due to limited accessibility, the final ESP model was used directly, and it was trained on a smaller ESP-database (approx. 18k data points, 1.3k substrates, 12k enzymes).
EZSpecificity-w/oGCS (EZSpecificity without Graph, Cross-attention, and Structural Embeddings):
- Description: This is an ablated version of EZSpecificity designed to mimic the architecture of CPI (Compound-Protein Interaction) models. It lacks the SE(3)-equivariant GNN for structural embeddings and the cross-attention layers.
- Representativeness: This serves two purposes:
  1. To provide a fair comparison with CPI by retraining it on the ESIBank dataset, showing the impact of ESIBank itself versus architectural innovations.
  2. To act as an ablation control, demonstrating the performance contribution of the graph, cross-attention, and structural components within the EZSpecificity architecture.
CPI (Compound-Protein Interaction):
- Description: The paper explicitly states that the architecture of CPI is "equivalent to EZSpecificity without graph, cross-attention and structural embeddings."
- Representativeness: CPI models are important for drug discovery and assessing molecular interactions. By comparing EZSpecificity (full architecture) against CPI (represented by EZSpecificity-w/oGCS retrained on ESIBank), the paper highlights the advantages of incorporating 3D structure and explicit interaction modeling.
Docking Score (AutoDock Vina):
- Description: In the "Representative Applications" section for metabolite-enzyme pair prediction, AutoDock Vina (a physical model for docking scores) was used as a baseline.
- Representativeness: This provides a comparison against a traditional physics-based computational method for predicting molecular interactions, showing whether the deep learning approach offers superior performance.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results consistently demonstrate EZSpecificity's superior performance, both in silico and in vitro, across various challenging scenarios.

6.1.1. In silico Evaluation on ESIBank Dataset

EZSpecificity was benchmarked against ESP (the state-of-the-art general model) and EZSpecificity-w/oGCS across four different data splitting scenarios to simulate real-world difficulties: random, unknown substrate, unknown enzyme, and unknown enzyme and substrate.

The following figure (Fig. 3 from the original paper) shows the AUROC performance of EZSpecificity and the general enzyme specificity model ESP:

Fig. 3 | Evaluation ofEZSpecificity on ESIBank dataset. a, The AUROC performance of EZSpecificity and the general enzyme specificity model ESP, evaluating on four dataset splits (random, unknown subs… 该图像是评估 EZSpecificity 在 ESIBank 数据集上的效果图，包含 AUROC 分数和不同模型的比较。图中 a 显示了 EZSpecificity 与 ESP 的 AUROC 分数，b 为 EZSpecificity 各配置的消融实验，c、d 展示了对 EC 编号的预测分辨率，e 则为四个数据集拆分下的平均分辨率得分。

Fig. 3 | Evaluation ofEZSpecificity on ESIBank dataset. a, The AUROC performance of EZSpecificity and the general enzyme specificity model ESP, evaluating on four dataset splits (random, unknown substrate, unknown enzyme and unknown enzyme and substrate). b, Ablation experiments of EZSpecificity on the unknown enzyme and substrate dataset by evaluating the AUROC score. c,d, The prediction resolution of EZSpecificity towards four digits of EC number by evaluating AUROC (c) and AUPR (d). The resolution was defined as the AUROC or AUPR score of models in the scenarios that were distinguished based on the number of shared EC number digits between the actual enzyme-substrate pair and its corresponding negative samples. Owing to the absence of EC number annotations in data from specific enzyme families, they are grouped into one category. e, Average resolution score on all four dataset splits. EZSpecificity-w/oGCS and ESP were also investigated for comparison. EZSpecificity without graph, cross-attention and structural embeddings is abbreviated as EZSpecificity-w/oGCS. AUPR and AUROC are more suited when the focus is on identifying positive instances in an imbalanced dataset, whereas MCC provides a balanced view across all classes. For tasks such as enzyme specificity prediction, in which the identification of positive hits is often a primary goal, AUPR and AUROC can be more informative and relevant.

Comparison with ESP (Figure 3a):

Random Split: EZSpecificity (AUROC = 0.8988) significantly outperformed ESP (AUROC = 0.6572).
Unknown Substrate: EZSpecificity (AUROC = 0.7712) outperformed ESP (AUROC = 0.6481).
Unknown Enzyme: EZSpecificity (AUROC = 0.7725) outperformed ESP (AUROC = 0.6548).
Unknown Enzyme and Substrate (Most Challenging): EZSpecificity (AUROC = 0.7198) showed a substantial lead over ESP (AUROC = 0.6523). This 8.72% improvement in AUPR (as mentioned in the text) is crucial for real-world reliability.

Impact of ESIBank Dataset (EZSpecificity-w/oGCS vs. ESP): The comparison between EZSpecificity-w/oGCS (AUROC = 0.8822 on random split) and ESP (AUROC = 0.6572 on random split) highlights the significant contribution of the ESIBank dataset's size and quality. Despite having almost the same architecture as ESP, EZSpecificity-w/oGCS performed much better when trained on ESIBank. This indicates that a comprehensive, high-quality training dataset is fundamental for strong model performance.

6.1.2. Ablation Studies (Figure 3b)

Ablation experiments were conducted on the challenging "unknown enzyme and substrate" dataset to quantify the contribution of key components of EZSpecificity.

Loss of Active Structure: Removing the explicit modeling of atomic interactions within the binding pocket decreased the AUROC score from 0.7198 to 0.7036. This demonstrates the benefit of the SE(3)-equivariant GNN in capturing the 3D active site environment.
Loss of Cross-Attention Layers: Removing the cross-attention layers decreased the AUROC score from 0.7198 to 0.7021. This confirms the critical role of cross-attention in explicitly modeling enzyme-substrate interactions and enhancing performance by focusing on crucial atoms/amino acids. Although these numerical changes may seem small, the paper emphasizes that even marginal improvements are challenging in deep learning and reflect meaningful enhancements.

6.1.3. EC Number Resolution (Figure 3c-e)

EZSpecificity was evaluated for its ability to predict specificity at different levels of EC number resolution (up to four digits).

EZSpecificity demonstrated superior performance (higher AUROC and AUPR) across all levels of EC number resolution compared to ESP and EZSpecificity-w/oGCS.
The model slightly outperformed ESP even at the most granular EC.x.x.x.x level, indicating its improved capacity to distinguish among homologous enzymes or enzyme variants. This is a critical advantage for fine-grained biocatalysis applications.

6.1.4. Generalizability Across Enzyme Families (Figure 4)

The transferability of EZSpecificity was assessed on six diverse enzyme families.

The following figure (Fig. 4 from the original paper) shows the in silico evaluation of EZSpecificity using six representative enzyme families:

Fig. 4 | In silico evaluation of EZSpecificity using six representative enzyme families. a, The enzymatic functions of six enzyme families. The structure of a representative enzyme from each family i… 该图像是图表，展示了EZSpecificity在六个酶家族中的计算评估。图中包括不同酶的功能、性能评估结果和平均AUPR分数，具体结果显示在b和c部分的柱状图中。d部分则列出了在未知酶和底物数据集上的AUPR评分。

Fig. 4 | In silico evaluation of EZSpecificity using six representative enzyme families. a, The enzymatic functions of six enzyme families. The structure of a representative enzyme from each family is shown. b, The averaged AUPR performance of EZSpecificity with and without fine-tuning on six enzyme families. c, The average AUPR performance of the EZSpecificity-individual model across six enzyme families for three types of dataset splits. d, The AUPR performance of EZSpecificity on specific protein families under the unknown enzyme and substrate data split setting. Nitrilases were excluded from this evaluation because of the extremely low number of data points, which rendered the results highly unreliable. For instance, no positive pairs may have occurred during the four-fold validation.

Overall Performance (Figure 4b): EZSpecificity performed well across all six families (esterases, glycotransferases, nitrilases, phosphatases, thiolases, DUFs) for random, unknown substrate, and unknown enzyme splits, achieving an average AUPR value of up to 0.6835.
Fine-tuning Strategies:
- EZSpecificity-fine-tune (Figure 4b): Updating the weights of the pre-trained EZSpecificity model with data from a specific target family further improved performance. For instance, AUPR on the unknown substrate split increased by about 7% after fine-tuning. This confirms the benefit of adapting a large model to specific downstream tasks.
- EZSpecificity-individual (Figure 4c): Training a new model from scratch using only data from the target enzyme family. This model surpassed CPI-individual by 4.2-8.3% AUPR and ESP by 27.4-54.5% AUPR. This suggests EZSpecificity's architecture can effectively manage limited data points, likely due to the structural context integration.
Performance on Understudied Enzymes/Substrates (Figure 4d): EZSpecificity outperformed CPI and ESP across all selected enzyme families under the challenging "unknown enzyme and substrate" split. The optimal strategy (direct application, fine-tuning, or individual training) varied by enzyme family, emphasizing the need for initial testing in practical applications.

6.1.5. Experimental Validation of Halogenases (In vitro)

Eight flavin-dependent halogenases and 78 diverse substrates were selected for in vitro experimental validation using a high-throughput screening platform.

The following figure (Fig. 5 from the original paper) illustrates the in silico and in vitro experimental validation of EZSpecificity on the in-house HaloS dataset:

Fig. 5 | In silico and in vitro experimental validation of EZSpecificity on the in-house HaloS dataset collected by a semi-automatic data extraction approach. a, The halogenation function of four typ… 该图像是图表，展示了EZSpecificity在HaloS数据集上的实验验证，包括四种卤素酶的卤化功能、代表性底物结构及相应的TMAP可视化结果。结果显示EZSpecificity在识别潜在反应底物方面的准确度达到91.7%。

Fig. 5 | In silico and in vitro experimental validation of EZSpecificity on the in-house HaloS dataset collected by a semi-automatic data extraction approach. a, The halogenation function of four types of halogenases. The typical structure of each halogenase is shown. X stands for halogen atom (Cl", I and F"). b, The structure of representative collected substrates in HaloS dataset. c, TMAPs of halogenation reaction in HaloS dataset, colourcoded on four types of halogenases. TMAP is a data visualization method, which is capable of representing datasets of up to millions of data points and

In silico performance on Halogenases (Figure 5d, f): EZSpecificity (without fine-tuning) achieved strong AUROC (0.7720-0.9447) and AUPR (0.5430-0.8506) values on the HaloS dataset. Fine-tuning further enhanced performance, with EZSpecificity-fine-tune achieving AUROC of 0.8008-0.9600 and AUPR of 0.5698-0.8823.
In vitro Prediction Accuracy (Figure 5e): For 12 new substrates not seen in the training database, EZSpecificity-fine-tune (fine-tuned on the HaloS dataset) achieved a remarkable 91.7% accuracy for its top-1 recommendation in identifying the single potential reactive substrate. This significantly outperformed:
- EZSpecificity-w/oGCS: 41.7%
- ESP: 58.3%
- EZSpecificity-individual: 66.7%
- EZSpecificity-ensemble (ensemble of individual and fine-tune models): 75.0% This strong in vitro validation confirms the practical utility and high accuracy of EZSpecificity, especially when fine-tuned on relevant data.

6.1.6. Representative Applications

EZSpecificity's ability to predict a score for enzyme-substrate pairs enables various applications.

Metabolite-Enzyme Pair Prediction in E. coli:

Task: Identify the enzyme(s) acting on 34 target metabolites from E. coli among 860 enzymes.
Results (Extended Data Fig. 1a): EZSpecificity successfully matched 10 metabolites (29.4%) with their corresponding enzymes within the top 5% of ranked predictions. This significantly surpassed using docking scores alone, which achieved 20.4%. When expanding the ranking threshold to the top 20%, the success rate increased to 50% (17 metabolites).
Generalization (Extended Data Fig. 1b): A 2D kernel density estimation plot showed that EZSpecificity had the highest confidence (peak density) in selecting correct reactive enzymes when they were ranked in the top 10% of candidates. Importantly, reasonable predictive accuracy was maintained even for enzymes with low sequence similarity to those in the training set, demonstrating its generalization capacity.

The following figure (Extended Data Fig. 1 from the original paper) illustrates the EZSpecificity examination in the metabolite-enzyme pair prediction task:

该图像是一个图表，展示了EZSpecificity与对接得分在预测中的累积百分比比较。图(a)显示了不同前百分比的预测准确性，图(b)展示了酶预测的序列相似度与前百分比之间的关系。

Extended Data Fig. 1| EZSpecificity examination in the metabolite-enzyme pair prediction task. (a) The performance of EZSpecificity us. Docking Score in identifying the reactive enzyme(s) for 34 target metabolites. AutoDock Vina was used as a representative physical model for docking scores. Only one reactive enzyme with the best-ranked score was considered for calculation. EZSpecificity predicts $2 9 \%$ of the known positive enzyme within the top $5 \%$ of predictions compared to $2 0 \%$ from physical score alone. (b) 2D kernel density estimation plot of the percentile rank of top predicted enzymes us. the sequence similarity compared to any enzyme in our training set for EZSpecificity in the metabolite-enzyme pair prediction task. Kernel density estimation analysis can answer a fundamental data smoothing problem where inferences about the population are made based on a finite data sample. 1D kernel density estimation for each axis was also plotted at the top and right of the figure. The data was obtained from 34 metabolites within E. coli used for the calculation. Density is indicated by color intensity, where darker regions correspond to greater density.

Biosynthetic Gene Cluster (BGC) Analysis:

Task: Linking BGC genes to their corresponding biosynthetic intermediates in pathways like clavulanic acid and albonoursin biosynthesis.
Results: EZSpecificity achieved up to 66.7% accuracy in identifying the correct target enzyme among the top three ranked candidates for each step in the biosynthetic pathway. This highlights its potential for deciphering complex natural product biosynthesis pathways.

6.2. Data Presentation (Tables)

The paper primarily presents its quantitative results embedded within the text or graphically in figures. There are no large, complex tables to transcribe that contain merged cells in the main body. The reporting summary, however, includes some data:

The following are the results from the "Reporting Summary" of the original paper regarding study design:

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.
Sample size	323783
Data exclusions	no
Replication	no
Randomization	no
Blinding	no

It is important to note the "no replication," "no randomization," and "no blinding" for the life sciences study design. While unusual for traditional wet-lab experiments, this context likely refers to the computational model's evaluation setup on its primary dataset, where the focus is on fixed dataset splits and model performance, rather than an experimental design for biological data collection itself. The in vitro validation for halogenases (Fig. 5e) does describe controls and high-throughput screening, implying appropriate biological experimental rigor.

6.3. Ablation Studies / Parameter Analysis

As discussed in section 6.1.2, ablation studies were crucial for validating the architectural choices of EZSpecificity.

The removal of the active structure (i.e., the SE(3)-equivariant GNN for binding pocket encoding) resulted in a decrease in AUROC from 0.7198 to 0.7036.
The removal of cross-attention layers resulted in a decrease in AUROC from 0.7198 to 0.7021. These results clearly indicate that both components contribute positively to the model's predictive performance, especially in challenging scenarios with unknown enzymes and substrates. The cross-attention layers, in particular, were effective in allowing the model to focus on critical interaction points, reducing noise from indirect atoms/amino acids.

7. Conclusion & Reflections

7.1. Conclusion Summary

In conclusion, the paper successfully developed EZSpecificity, a novel and general deep learning model for accurately predicting enzyme substrate specificity. The model's key innovations lie in its architecture, which integrates sequence information from ESM-2, 3D structural context of binding complexes using an SE(3)-equivariant GNN, and explicit enzyme-substrate interaction modeling through cross-attention layers. These components collectively address limitations of prior work, which often overlooked the 3D binding process and atomic-level interactions. Furthermore, the creation of ESIBank, a comprehensive and structurally rich enzyme-substrate interaction database, was instrumental in training this robust model. EZSpecificity demonstrated significantly superior performance in both in silico evaluations (up to 48.1% higher AUROC than ESP) and in vitro experimental validation (91.7% accuracy compared to 58.3% for ESP on halogenases and new substrates). This establishes EZSpecificity as a broadly applicable tool for various fields requiring enzyme functional characterization.

7.2. Limitations & Future Work

The authors acknowledged several limitations and suggested future research directions:

Stereoselectivity, Chemo-, and Regioselectivity: While EZSpecificity focuses on substrate specificity, it "does not support reliable prediction of chemo-, regio- or stereoselectivity." This limitation arises because the current GNN encoding method "treats different stereoselectivity at the same atom equally" and due to general limitations in molecular representation and encoding resolution.
Active Site Annotation Dependency: The performance of ESIBank and consequently EZSpecificity "may be affected for pairs lacking active site annotations." This highlights a dependency on the quality and completeness of active site data.
BGC Data Organization: For applications in Biosynthetic Gene Cluster (BGC) analysis, the authors note that EZSpecificity's performance "could be further improved through training on additional BGC-relevant data, although these datasets remain poorly organized at present."
Integration of Dynamic Binding Information: The paper suggests that "future integration of dynamic binding information" (e.g., from molecular dynamics simulations) is "poised to further enhance predictive power," implying that the current model primarily considers static 3D structures.

7.3. Personal Insights & Critique

This paper presents a rigorous and impactful advance in enzyme specificity prediction. The EZSpecificity model's architecture is a well-reasoned integration of cutting-edge deep learning techniques, and the development of the ESIBank dataset is a significant contribution to the community.

Inspirations and Applications:

Multi-modal Integration: The success of EZSpecificity underscores the power of integrating diverse data modalities (sequence, 3D structure, interaction) for complex biological problems. This approach could be highly transferable to other interaction prediction tasks, such as protein-protein interactions, drug-target interactions, or even predicting reaction outcomes in organic chemistry, where multiple types of information are critical.
SE(3)-Equivariance: The explicit use of SE(3)-equivariant GNNs is a crucial choice that ensures physical realism and robustness, which is paramount in molecular modeling. This principle is increasingly important in any domain dealing with 3D spatial data.
Cross-Attention for Interaction: The application of cross-attention to explicitly model enzyme-substrate interactions is an elegant solution to a common limitation in previous models that simply concatenated embeddings. This mechanism provides interpretability by focusing on crucial interaction points.
Biocatalysis and Drug Discovery: The high accuracy achieved by EZSpecificity, particularly its in vitro validation on new substrates, makes it an invaluable tool for accelerating enzyme engineering, identifying novel biocatalysts for industrial applications, and guiding lead optimization in drug discovery by predicting off-target effects or substrate promiscuity.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Computational Cost: The generation of ESIBank involves extensive molecular docking (AutoDock-GPU), and the SE(3)-equivariant GNN and ESM-2 embeddings are computationally intensive. While performance is excellent, the computational resources required for training and inference, especially for very large-scale screening beyond E. coli metabolomes, could be substantial.
Active Site Definition Accuracy: The reliance on UniProt for active site annotations and a fixed 20 Å cutoff for docking could introduce biases or inaccuracies if the active site definitions are incomplete or if the true binding pocket extends beyond this radius for some enzymes. While AlphaFill helps, the quality of initial structural predictions (e.g., from AlphaFold for apo enzymes) also influences the docking outcome.
Static Representation Limitations: The model currently relies on static 3D enzyme-substrate complexes. Enzymes are dynamic, and substrate binding often involves conformational changes. As the authors themselves note, incorporating "dynamic binding information" could be a significant future improvement. This could involve integrating features from molecular dynamics simulations.
Lack of Mechanistic Interpretability: While cross-attention highlights important amino acids and atoms, EZSpecificity primarily provides a prediction score. Deeper mechanistic insights into why a particular enzyme-substrate pair is specific (e.g., precise catalytic mechanisms, transition state stabilization) might require further development or integration with quantum chemistry methods.
Data Reporting (Reporting Summary): The "no replication," "no randomization," and "no blinding" statements in the "Life sciences study design" section of the Reporting Summary are somewhat jarring for a paper in Nature, even if they refer specifically to the computational model's evaluation. While the paper's in vitro validation details (controls, high-throughput) suggest experimental rigor, this blanket statement might raise questions for some readers unfamiliar with the nuances of computational model evaluation versus biological experimentation reporting. It is likely that these refer to the computational model training/testing splits rather than the underlying biological data collection or experimental validation, which often lacks randomization in dataset splits if the goal is to evaluate generalizability to "unknown" (hold-out) subsets. However, for the in vitro experiments, standard practices like blinding (if applicable) and multiple biological replicates are typically expected for robust conclusions. The text, however, focuses on "prediction accuracy" rather than statistical significance derived from multiple biological replicates in the in vitro section, which is a common practice when reporting ML model performance on experimental data.