Abstract

The ability to predict how amino acid changes affect proteins has a wide range of applications including in disease variant classification and protein engineering. Here, we present SSEmb (Sequence Structure Embedding), a method that integrates sequence and structure information in a single model. By combining a graph representation of protein structure with a transformer model for processing multiple sequence alignments, we demonstrate that SSEmb provides robust variant effect predictions, especially in cases where sequence information is limited, and is additionally useful for other tasks such as predicting protein-protein binding sites.

1. Bibliographic Information

1.1. Title

SSEmb: A joint embedding of protein sequence and structure enables robust variant effect predictions

1.2. Authors

Lasse M. Blaabjerg, Nicolas Jonsson, Wouter Boomsma, Amelie Stein, & Kresten Lindorff-Larsen

1.3. Journal/Conference

Published online in Nature Communications. Nature Communications is a highly reputable, peer-reviewed open access scientific journal published by Springer Nature. It covers all areas of the natural sciences, including biology, chemistry, physics, and earth sciences, and is known for publishing high-quality research across diverse fields.

1.4. Publication Year

Published online: 07 November 2024 (Received: 21 January 2024)

1.5. Abstract

The paper presents SSEmb (Sequence Structure Embedding), a novel method designed to predict how amino acid changes affect proteins. This has wide applications in disease variant classification and protein engineering. SSEmb uniquely integrates protein sequence information, specifically from multiple sequence alignments (MSAs), with three-dimensional protein structure information into a single model. It achieves this by combining a graph representation of protein structure with a Transformer model for processing MSAs. The authors demonstrate that SSEmb provides robust variant effect predictions, particularly in scenarios where sequence information (i.e., MSA depth) is limited. Furthermore, the learned embeddings from SSEmb are shown to be useful for other downstream tasks, such as predicting protein-protein binding sites, with performance comparable to specialized state-of-the-art methods. The paper concludes that SSEmb is valuable for variant effect predictions and as a general representation for learning protein properties dependent on both sequence and structure.

1.6. Original Source Link

Official Source Link: /files/papers/6915b3cc4d6b2ff314a02eab/paper.pdf Publication Status: Officially published online at Nature Communications.

2. Executive Summary

2.1. Background & Motivation

The ability to predict the functional consequences of amino acid changes (also known as variant effects) in proteins is a fundamental challenge with profound implications for understanding molecular mechanisms of evolution, human diseases (e.g., classifying disease-causing variants), and advancing protein engineering (e.g., optimizing protein function). Traditional methods often rely on either protein sequence information (e.g., evolutionary conservation from Multiple Sequence Alignments or MSAs) or protein structure information (e.g., stability calculations).

However, each approach has limitations:

Sequence-based methods: While powerful, their accuracy can be highly sensitive to the depth or quality of the input MSA. For proteins with few known homologous sequences, MSAs can be shallow, leading to unreliable predictions.
Structure-based methods: These require a known or predicted three-dimensional protein structure, which might not always be available or accurate, especially for intrinsically disordered proteins or large complexes.

The core problem the paper aims to solve is to overcome these individual limitations by synergistically combining both types of information. Previous attempts to combine sequence and structure often involve ensemble predictions or combining results at a later stage. The challenge is to integrate this information within a single model in an end-to-end manner to learn a richer, more robust representation.

2.2. Main Contributions / Findings

The paper's primary contributions revolve around the introduction and validation of SSEmb:

Novel End-to-End Integration Model (SSEmb): The paper proposes SSEmb, a novel method that integrates protein sequence (via MSA) and structure (via graph representation) information into a single, self-supervised model. This is achieved by combining a structure-constrained MSA Transformer with a Graph Neural Network (GNN).
Robust Variant Effect Predictions for Shallow MSAs: SSEmb demonstrates improved robustness and accuracy in predicting variant effects, particularly for proteins where the input MSA is shallow (i.e., has limited evolutionary information). This addresses a critical gap where many sequence-only methods struggle. The model achieves competitive performance on the ProteinGym benchmark, especially in low MSA depth scenarios.
Information-Rich Embeddings for Downstream Tasks: The embeddings learned by SSEmb are shown to be information-rich and highly useful for various downstream tasks beyond variant effect prediction. The paper exemplifies this by successfully using SSEmb embeddings to predict protein-protein binding sites, achieving results comparable to specialized state-of-the-art methods. This highlights SSEmb's potential as a general-purpose protein representation learning tool.
Self-Supervised Training Paradigm: SSEmb is trained in a self-supervised manner using a masked amino acid prediction task, leveraging large amounts of unlabeled sequence and structure data, which is a powerful approach for learning generalizable protein representations.

These findings suggest that SSEmb offers a more comprehensive and robust approach to understanding protein function and variant effects, especially in data-scarce scenarios, and provides a versatile tool for various protein research applications.

3.1. Foundational Concepts

To fully grasp the methodology and contributions of SSEmb, a foundational understanding of several key concepts is essential:

Amino Acid Changes and Variant Effects: Proteins are chains of amino acids that fold into specific three-dimensional structures, which dictate their function. An amino acid change (also called a variant or mutation) is a substitution of one amino acid for another at a specific position in the protein sequence. These changes can have various effects on the protein, such as altering its stability, activity, binding affinity, or cellular abundance. Predicting these variant effects is crucial for understanding disease and for protein engineering.
Multiple Sequence Alignment (MSA): An MSA is a sequence alignment of three or more biological sequences (e.g., protein or DNA sequences) that are evolutionarily related. It arranges these sequences to align homologous (similar by descent) positions, allowing for the inference of evolutionary conservation. Positions that are highly conserved across many species often indicate critical functional or structural roles.
- MSA Depth: Refers to the number of sequences in an MSA. A deep MSA has many homologous sequences, providing rich evolutionary information, while a shallow MSA has few, making evolutionary inference difficult.
- MSA Query Sequence: In the context of variant effect prediction, this is the specific protein sequence for which variant effects are being predicted. It's usually the wild-type (most common natural) sequence.
Protein Structure: The three-dimensional (3D) structure of a protein is critical for its function. This structure is determined by the amino acid sequence and interactions between amino acids, including covalent bonds and non-covalent interactions (e.g., hydrogen bonds, hydrophobic interactions). The protein structure provides spatial information about amino acid proximities and solvent exposure, which is not directly evident from the linear sequence.
Machine Learning (ML): A field of artificial intelligence that enables systems to learn from data.
- Supervised Learning: An ML paradigm where a model learns from labeled data (input-output pairs). For example, training a model to predict variant effects requires examples of variants and their experimentally measured effects.
- Self-Supervised Learning: A variant of unsupervised learning where the model learns representations from unlabeled data by generating supervision signals from the data itself. For instance, masked language modeling (predicting masked words in a sentence) is a common self-supervised task. In protein science, predicting masked amino acids in a sequence or MSA allows models to learn intrinsic properties of proteins without explicit functional labels. This is a key paradigm for protein language models.
Transformer Model: An attention-based neural network architecture introduced in 2017, primarily for natural language processing. It revolutionized sequence modeling by replacing recurrent layers with self-attention mechanisms, allowing it to process entire sequences in parallel and capture long-range dependencies efficiently.
- Self-Attention: A mechanism that weighs the importance of different parts of the input sequence when processing a specific part. In a Transformer, each token (e.g., an amino acid in a protein sequence) computes its representation by attending to all other tokens in the sequence, determining their relevance. The attention score between a query $Q$ and a key $K$ is often computed as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where $Q$ represents the query vector, $K$ represents the key vectors, $V$ represents the value vectors, and $d_k$ is the dimension of the key vectors, used for scaling. The softmax function normalizes the scores to produce a probability distribution.
Graph Neural Network (GNN): A class of neural networks designed to operate on graph-structured data. Proteins can naturally be represented as graphs, where amino acid residues are nodes and interactions (e.g., spatial proximity) between them are edges. GNNs learn by message passing, where nodes aggregate information from their neighbors and update their representations. This allows GNNs to capture relational information inherent in protein structures.
- Geometric Vector Perceptron (GVP): A specific type of GNN designed to handle 3D geometric information (like protein structures) by representing features as both scalars and vectors, maintaining rotational equivariance. Rotational equivariance means that if the input protein structure is rotated, the GNN's output embedding for that structure will also rotate in a predictable way, preserving the underlying geometric relationships.

3.2. Previous Works

The paper builds upon and compares itself to several categories of prior research:

Multiplexed Assays of Variant Effects (MAVEs) / Deep Mutational Scanning (DMS): These high-throughput experimental techniques generate large datasets mapping protein sequence to function by simultaneously assaying thousands of protein variants. Examples include VAMP-seq (Variant Abundance by Massively Parallel Sequencing) for quantifying variant effects on cellular protein abundance. MAVEs provide crucial experimental data for training and benchmarking computational predictors. The ProteinGym benchmark is a large collection of such data.
Sequence-Based Predictors:
- Evolutionary Models: These methods leverage MSAs to infer evolutionary conservation and predict variant effects. Highly conserved positions are often functionally critical.
  - GEMME (Global Epistatic Model Predicting Mutational Effects): A state-of-the-art MSA-based model that uses a relatively simple evolutionary model to predict variant effects, often outperforming more complex machine learning methods for protein activity prediction. It quantifies the effect of mutations by comparing the likelihood of a variant sequence in an MSA to the wild-type.
- Protein Language Models (PLMs): Self-supervised models trained on vast datasets of protein sequences to learn rich protein representations (embeddings).
  - MSA Transformer: A Transformer model specifically designed to process MSAs. It learns representations by attending to both sequences (rows) and positions (columns) within the MSA. It can be used for variant effect prediction by predicting masked amino acids. Its performance is known to be sensitive to MSA depth.
  - ESM-1b / ESM2 (Evolutionary Scale Modeling): Other prominent protein language models that learn from MSAs or large collections of single protein sequences.
  - TranceptEVE, Tranception L, EVE, VESPA: Other PLMs or ensemble methods benchmarked on ProteinGym for variant effect prediction.
Structure-Based Predictors:
- Rosetta: A widely used protein modeling software suite that includes protocols for predicting protein stability changes ( $\Delta\Delta G$ ) upon mutation. Rosetta uses physics-based force fields and sampling algorithms to estimate the energetic impact of amino acid substitutions. It's commonly used in protein engineering and for predicting stability and abundance.
Combined Sequence and Structure Methods:
- Early methods often combined predictions from separate sequence-based and structure-based models (e.g., ensemble methods).
- More recent approaches aim for deeper integration:
  - Some methods combine contextualized language models and graph neural networks (e.g., ELASPIC, AlphaMissense).
  - Others explore using geometric deep learning with pre-trained protein language models (e.g., LM-GVP).
  - The paper mentions other recent methods such as ScanNet (specifically for protein-protein binding site prediction) and xgboost (a machine learning algorithm) with handcrafted features as baselines for protein-protein binding site prediction.

3.3. Technological Evolution

The field of variant effect prediction has evolved significantly:

Early Statistical and Rule-Based Methods: Initially, predictions relied on simple rules (e.g., conservation scores like SIFT or PolyPhen-2) or statistical analyses of MSAs.
Physics-Based Modeling: Rosetta and similar tools brought physics-based simulations to predict stability changes, offering mechanistic insights.
Supervised Machine Learning: As MAVE data became available, supervised ML models (e.g., random forests, support vector machines) were trained to predict variant effects from sequence and structure features.
Self-Supervised Protein Language Models (PLMs): The advent of Transformer architectures and vast sequence databases led to PLMs like MSA Transformer and ESM, which learn protein representations in a self-supervised manner, enabling zero-shot prediction (predictions without task-specific training data).
Integration of Sequence and Structure: Recognizing the complementary nature of sequence and structure, the current frontier involves combining these modalities. Initial efforts involved combining outputs, but the trend is towards end-to-end models that learn joint representations.

SSEmb fits into this evolution at stage 5, representing a significant step towards end-to-end integration within a self-supervised framework, specifically addressing the robustness issue with shallow MSAs.

3.4. Differentiation Analysis

Compared to the main methods in related work, SSEmb offers several core differences and innovations:

End-to-End Joint Embedding: Unlike many methods that combine sequence and structure predictions at a later stage or use one to inform the other sequentially, SSEmb integrates them end-to-end. It uses structure to constrain the MSA Transformer's attention and feeds the MSA Transformer's embeddings directly into a GNN that processes the protein structure. This allows for a deeper, more synergistic learning of joint representations.
Structure-Constrained Attention: A key innovation is using protein structure to mask the row attention in the MSA Transformer. This means the Transformer only considers interactions between amino acids that are spatially close in the 3D structure, introducing structural context directly into the sequence model's attention mechanism. This is distinct from simply concatenating sequence and structure features post-hoc.
Robustness to Shallow MSAs: By embedding structural information directly into the MSA Transformer and GNN, SSEmb becomes less dependent on the availability of deep MSAs. This makes it particularly robust and accurate for proteins with limited evolutionary information, an area where pure MSA-based methods often struggle.
Versatile Embeddings: The paper demonstrates that the dual embeddings learned by SSEmb are highly information-rich and generalizable, proving useful for diverse downstream tasks like protein-protein binding site prediction, not just variant effect prediction. This suggests SSEmb learns a more holistic representation of protein characteristics.
Self-Supervised Training Philosophy: While other PLMs are self-supervised, SSEmb extends this by incorporating structural context into the self-supervised masking task, making the learned representations more aware of physical constraints and relationships.

In essence, SSEmb moves beyond simply having two types of information towards a model where sequence and structure are deeply intertwined to create a more comprehensive and resilient understanding of protein function.

4. Methodology

4.1. Principles

The core idea behind SSEmb is to create a robust and comprehensive protein representation by jointly embedding information from both the protein's amino acid sequence (specifically, its evolutionary context derived from a Multiple Sequence Alignment (MSA)) and its three-dimensional (3D) structure. The method posits that by integrating these two complementary data types within a single end-to-end self-supervised learning framework, the resulting embeddings will be more informative and less sensitive to limitations in either input modality (e.g., shallow MSAs).

The theoretical basis draws from the success of Transformer models in capturing long-range dependencies in sequences and Graph Neural Networks (GNNs) in modeling relational data like protein structures. SSEmb combines these by:

Constraining sequence attention with structure: Using spatial proximity from the 3D structure to guide the attention mechanism within the MSA Transformer, ensuring that the model implicitly considers structural interactions when processing sequence information.
Fusing information via a GNN: Concatenating the sequence embeddings from the MSA Transformer with the structural graph features and processing them through a GNN to learn a joint representation.
Self-supervised learning: Training the entire model to predict masked amino acids, a task that encourages the model to learn contextually rich representations without explicit functional labels.

This integrated approach aims to create embeddings that capture both evolutionary conservation and structural context, leading to more accurate and robust predictions of variant effects and broader utility for other protein-related tasks.

4.2. Core Methodology In-depth (Layer by Layer)

The SSEmb model is trained in a self-supervised manner using a combination of Multiple Sequence Alignments (MSAs) and protein structures. The overall architecture involves a structure-constrained MSA Transformer whose output embeddings are then fed into a Graph Neural Network (GNN) module. The model's training objective is to predict masked amino acids.

4.2.1. Input Data Generation

Protein Structures: The protein structures used for training were sourced from the CATH 4.2 data set, which contains 18,204 non-redundant training proteins (at 40% sequence identity) partitioned by CATH class. For benchmarking and clinical tasks, AlphaFold-predicted structures were used.
- Pre-processing: For testing, OpenMM PDBFixer package was used for pre-processing. Training set structures were used without modifications.
Multiple Sequence Alignments (MSAs): For each protein structure, an MSA was generated using MMSeqs2 in conjunction with ColabFold's filtering protocol. This protocol aims to maximize sequence diversity in the final alignment.
- MMSeqs2 parameters: $-diff=512$ , $-filter-min-enable=64$ , $-max-seq-id=0.90$ , and an additional $-cov=0.75$ for each sequence identity bucket to ensure high-coverage sequences.
MSA Subsampling: Due to GPU memory constraints, the full MSA is randomly subsampled before being fed into SSEmb. During training, 16 MSA sequences are subsampled. For inference, an ensemble of 5 predictions is generated by subsampling 16 sequences 5 times with replacement, and the mean of these predictions is taken.

4.2.2. Structure-Constrained MSA Transformer

The MSA Transformer component of SSEmb is based on the original MSA Transformer architecture.

Initialization: The model is initialized using pre-trained weights from the original MSA Transformer (available at https://github.com/facebookresearch/esm/tree/main/esm).
Structure Constraint: A key modification is the application of a binary contact mask to the attention maps that operate across MSA columns. This is referred to as row attention (attention between different sequences at the same position).
- The contact mask is derived from the 20 nearest neighbor graph structures generated for the protein structure (the same graph used in the GNN module).
- This mask ensures that row attention values are only propagated between MSA positions that are spatially proximal in the 3D protein structure. This effectively injects structural information directly into the MSA Transformer's attention mechanism.
Training: During training, only the row attention layers of the structure-constrained MSA Transformer are fine-tuned. This strategy aims to preserve the phylogenetic information already encoded in the column attention layers (attention within the same sequence across different positions), which are kept frozen.

4.2.3. GNN Module

The GNN module in SSEmb processes the protein structure and incorporates the sequence information from the MSA Transformer. It largely follows the architecture of the Geometric Vector Perceptron (GVP) model but with specific adjustments.

Graph Definition: Graph edges are defined for the 20 closest node neighbors based on spatial proximity in the protein structure, rather than the 30 used in the original GVP implementation.
Node Embeddings: Node embedding dimensions (representing each amino acid residue) are increased to 256 for scalar features and 64 for vector channels. The GVP model represents features as both scalars (magnitude-only values) and vectors (values with magnitude and direction), which is crucial for maintaining rotational equivariance with respect to the 3D protein structure.
Edge Embeddings: Edge embedding dimensions are kept at 32 (scalar) and 1 (vector), as in the original GVP model.
Architecture: The number of encoder and decoder layers is increased to four.
Vector Gating: The GNN uses vector gating, a mechanism often employed in GNNs to control the flow of information based on vector features, enhancing the model's ability to process geometric information.
Information Fusion: The MSA query sequence embeddings (from the last layer of the structure-constrained MSA Transformer) are concatenated to the node embeddings of the GNN decoder. These combined features then pass through a dense layer to reduce dimensionality before being processed by the GNN.
Prediction Task: The GNN's prediction task is modified from auto-regressive sequence prediction (as in original GVP applications) to a masked token prediction task (consistent with the MSA Transformer's objective).

The resulting architecture, visualized in Figure 1, demonstrates this integration:

该图像是示意图，展示了SSEmb模型的训练过程。模型接收一个子采样的多序列比对（MSA）和完整的蛋白质结构，通过图形特征化与结构约束的MSA Transformer进行信息处理，最终生成图神经网络（GNN）的编码器和解码器输出。

Fig. 1 | Overview of the SSEmb model and how it is trained. The model takes as input a subsampled MSA with a partially masked query sequence and a complete protein structure. The protein structure graph is used to mask (constrain) the row attention (i.e., attention across MSA columns) in the MSA Transformer. The MSA query sequence embeddings from the structure-constrained MSA Transformer are concatenated to the protein graph nodes. During training, SSEmb tries to predict the amino acid type at the masked positions. The model is optimized using the cross-entropy loss between the predicted and the true amino acid tokens at the masked positions. Variant effect prediction is made from these predictions as described in Methods.

4.2.4. Model Training

SSEmb is trained in a self-supervised manner using a modified BERT masking scheme.

Masking Strategy: Before each forward pass, 15% of all wild-type sequence residues are randomly selected for optimization. Within this 15% subset:
- 60% are completely masked (replaced with a special [MASK] token).
- 20% are masked, and their corresponding MSA columns (across all sequences in the subsampled MSA) are also masked. This helps reduce reliance on conservation signals.
- 10% are replaced by a random amino acid type.
- 10% are left unchanged.
Prediction Task: The SSEmb model is tasked with predicting the original amino acid types of these masked residues, given the protein structure and the subsampled MSA input.
Loss Function: The masked prediction task is optimized using the cross-entropy loss between the predicted amino acid types and the true amino acid types for the 15% selected residues.
Gradual Unfreezing: The training process employs a two-step gradual unfreezing method:
1. Step 1: The GNN module is trained until it approaches convergence, while the parameters of the structure-constrained MSA Transformer are kept frozen.
2. Step 2: The row attention parameters within the structure-constrained MSA Transformer are unfrozen, and both the GNN module and these MSA Transformer parameters are fine-tuned together. Early stopping is used, assessed by the mean correlation performance on the MAVE validation set.
Optimizer: The Adam optimizer is used with different learning rates: $10^{-3}$ for the GNN module and $10^{-6}$ for the structure-constrained MSA Transformer.
Batch Sizes: Batch sizes are fixed at 128 proteins for the first training stage and 2048 proteins for the second stage.

4.2.5. Variant Effect Prediction

At inference time, SSEmb predicts variant effects based on the masked marginal method.

Ensembling: To generate robust predictions, 16 sequences are randomly subsampled from the full MSA five times with replacement, creating an ensemble of model predictions. The final SSEmb score is the mean of the scores from this ensemble.
Score Calculation: Protein variant scores are computed using the masked marginal method as follows:

$ \sum _ { i \in M } \log p ( x _ { i } = x _ { i } ^ { \mathrm { {var } } } | x _ { - \mathrm { {M } } } ) - \log p ( x _ { i } = x _ { i } ^ { \mathrm { {wt } } } | x _ { - \mathrm { {M } } } ) $
- $x _ { i } ^ { \mathrm { {var } } }$ : Represents the amino acid type of the variant (mutant) sequence at position $i$ .
- $x _ { i } ^ { \mathrm { {wt } } }$ : Represents the amino acid type of the wild-type (original) sequence at position $i$ .
- $M$ : Denotes the set of substituted (mutated) positions. For single amino acid variants, $M$ would contain just one position.
- $x _ { - \mathrm { {M } } }$ : Represents a sequence where the positions in the set $M$ have been replaced with mask tokens. This means the model predicts the likelihood of the variant/wild-type amino acid at position $i$ given the rest of the sequence is masked at the mutation site.
- $\log p ( \cdot | \cdot )$ : Represents the log-probability predicted by the model for a specific amino acid type at a given position, conditioned on the rest of the masked sequence.
- The formula essentially calculates the log-likelihood ratio of the variant amino acid versus the wild-type amino acid at the mutated position, summed over all mutated positions (though typically for single variants, it's just one position). A more negative score indicates a higher predicted deleterious effect of the mutation. This model represents an additive variant effect model.

5. Experimental Setup

5.1. Datasets

The authors utilized a variety of datasets for training, validation, and benchmarking SSEmb across different tasks.

Training Dataset:
- Source: CATH 4.2 data set (for protein structures) and MSAs generated for these structures.
- Characteristics: Contains 18,204 training proteins with 40% non-redundancy, partitioned by CATH class. Proteins present in the MAVE validation set or ProteinGym test set were removed using a 95% sequence identity cut-off to prevent data leakage.
- Purpose: To train the SSEmb model in a self-supervised manner.
MAVE Validation Set:
- Source: A selection of 10 Multiplexed Assays of Variant Effects (MAVEs). Most data was from ProteinGym, with exceptions: LDLRAP1 from Jiang and Roth (47, 48), and MAPK from Brenan et al. (49, 50).
- Characteristics: This set included a mix of assays probing protein activity and abundance, with a majority focusing on activity. It also contained assays considered difficult or easier to predict by structure- or sequence-based methods.
  - Example data: Variant effect scores (e.g., changes in abundance, competitive growth, E1 reactivity, two-hybrid assay scores) for individual amino acid substitutions in proteins like NUD15, TPMT, CP2C9, P53, PABP, SUMO1, RL401, PTEN, MAPK, LDLRAP1.
- Purpose: Used for hyperparameter selection and early stopping during model development, providing informative feedback on SSEmb's ability to capture different mechanistic aspects.
ProteinGym Substitution Benchmark:
- Source: A large collection of MAVE data originally collected in reference 6.
- Characteristics: Comprises 87 datasets on 72 different proteins. 76 datasets contain single substitution effects, while 11 include multiple substitutions. Assays part of the SSEmb validation set were excluded. When multiple assays were present for a single UniProt ID, the mean correlation over assays was reported. The benchmark is segmented by MSA depth:
  - Low MSA depth: $N_{\mathrm{eff}}/L < 1$ (where $N_{\mathrm{eff}}$ is the effective number of sequences and $L$ is sequence length).
  - Medium MSA depth: $N_{\mathrm{eff}}/L < 100$ .
  - High MSA depth: $N_{\mathrm{eff}}/L > 100$ .
- Purpose: To extensively benchmark SSEmb's prediction accuracy against other variant effect prediction models.
- Predicted Structures: For this benchmark, AlphaFold-predicted structures were used as input to SSEmb.
Mega-scale Protein Stability Dataset:
- Source: Dataset 3 from (59), consisting of experimentally well-defined $\Delta\Delta G$ measurements.
- Characteristics: Contains 607,839 protein sequences. Filtering excluded synonymous, insertion, deletion mutations, and domains without corresponding AlphaFold models.
  - Example data: $\Delta\Delta G$ values (change in Gibbs free energy of unfolding) indicating changes in protein stability upon mutation.
- Purpose: To test the zero-shot performance of SSEmb as a predictor of protein stability.
ProteinGym Clinical Substitution Benchmark:
- Source: A large set of variants with clinical annotations (67).
- Characteristics: Contains clinically annotated missense variants in humans.
- Purpose: To evaluate SSEmb's zero-shot performance in classifying disease-causing variants (variant pathogenicity).
Protein-Protein Binding Site (PPBS) Dataset:
- Source: PPBS data set (74).
- Characteristics: Originally contained 20,025 protein chains with binary residue-level binding site labels. Filtered to exclude obsolete entries, missing binding site labels, sequence mismatches, and chains longer than 1024 amino acids (due to MSA Transformer limit). The modified dataset contained 19,264 protein chains.
  - Example data: Each residue in a protein chain is labeled as either belonging to a binding site (1) or not belonging to a binding site (0).
- Purpose: To evaluate the utility of SSEmb embeddings for a downstream task of protein-protein binding site prediction using a small supervised model.

5.2. Evaluation Metrics

The paper uses several standard evaluation metrics depending on the specific task:

Spearman's Rank Correlation Coefficient ( $\rho_s$ ):
- Conceptual Definition: Spearman's rank correlation coefficient assesses the monotonic relationship between two ranked variables. It measures how well the relationship between two variables can be described using a monotonic function. Unlike Pearson's correlation coefficient, it does not assume that the relationship between the variables is linear. It is suitable for variant effect prediction tasks where the absolute magnitude might be hard to predict, but the ranking of effects is important.
- Mathematical Formula: $ \rho_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $
- Symbol Explanation:
  - $d_i$ : The difference between the ranks of the $i$ -th observation for the two variables (e.g., predicted rank vs. experimental rank).
  - $n$ : The number of observations (e.g., number of variants).
Area Under the Receiver Operating Characteristic Curve (AUC or AUROC):
- Conceptual Definition: AUC is a performance metric for binary classification problems. It represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance. An AUC of 1.0 indicates a perfect classifier, while 0.5 indicates a classifier no better than random guessing. It's robust to imbalanced datasets.
- Mathematical Formula: AUC is the integral of the ROC curve. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. $ \text{TPR (Sensitivity)} = \frac{\text{TP}}{\text{TP} + \text{FN}} $ $ \text{FPR (1 - Specificity)} = \frac{\text{FP}}{\text{FP} + \text{TN}} $ where TP are True Positives, FN are False Negatives, FP are False Positives, and TN are True Negatives. AUC itself does not have a simple closed-form formula but is computed as the area under the TPR vs FPR plot.
- Symbol Explanation:
  - TP: Number of true positives (correctly predicted positive instances).
  - FN: Number of false negatives (positive instances incorrectly predicted as negative).
  - FP: Number of false positives (negative instances incorrectly predicted as positive).
  - TN: Number of true negatives (correctly predicted negative instances).
Area Under the Precision-Recall Curve (PR-AUC):
- Conceptual Definition: PR-AUC is another metric for binary classification, particularly useful for imbalanced datasets where the positive class is rare. It plots Precision against Recall at various threshold settings. A higher PR-AUC indicates better performance. It is often preferred over AUC when the focus is on the performance on the positive class, especially when false positives are costly.
- Mathematical Formula: PR-AUC is the integral of the Precision vs Recall curve. $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $ $ \text{Recall (Sensitivity)} = \frac{\text{TP}}{\text{TP} + \text{FN}} $ PR-AUC is computed as the area under the Precision vs Recall plot.
- Symbol Explanation:
  - TP: Number of true positives.
  - FP: Number of false positives.
  - FN: Number of false negatives.

5.3. Baselines

The SSEmb model was benchmarked against a range of established and state-of-the-art variant effect prediction and protein property prediction methods:

For MAVE Validation and ProteinGym Benchmark:
- GEMME: An MSA-based model (Global Epistatic Model Predicting Mutational Effects) that leverages evolutionary information. For comparison to the ProteinGym benchmark, data directly from ProteinGym was used. For validation, MSAs were generated using HHblits (version 2.0.15) to search the UniRef30 database with specific parameters ( $-e 1e-10 -p 40 -b 1 -B 20000$ ) and additional filtering (remove columns not in query, remove rows with >50% gaps).
- Rosetta: A structure-based method that predicts changes in protein stability ( $\Delta\Delta G$ ) using the Cartesian\Delta\Delta Gprotocol (51). Thermodynamic stability changes (in Rosetta Energy Units) were converted to kcal/mol by dividing by 2.9.
- MSA Transformer (original): The foundational Transformer model for MSAs that SSEmb builds upon.
- Other high-accuracy methods from ProteinGym:
  - TranceptEVE L
  - Tranception L
  - EVE (ensemble) and EVE (single)
  - VESPA
  - $ESM2 (15B)$ : Another large protein language model.
  - ProteinMPNN: A model primarily for protein design/inverse folding, included for comparison.
For Protein-Protein Binding Site Prediction:
- ScanNet: A specialized, state-of-the-art geometric deep learning model specifically developed for structure-based protein binding site prediction (74).
- Handcrafted features baseline: An xgboost model (a gradient boosting algorithm) trained with handcrafted structure- and sequence-based features (74). This represents a traditional feature engineering approach.
  
  These baselines were chosen to cover different approaches (sequence-only, structure-only, various PLMs, and specialized downstream models) and to provide a comprehensive comparison of SSEmb's performance across diverse tasks and MSA depths.

6. Results & Analysis

6.1. Core Results Analysis

The SSEmb model was rigorously evaluated across several tasks, demonstrating its effectiveness, particularly in integrating sequence and structure information for robust predictions.

6.1.1. Validation using Multiplexed Assays of Variant Effects (MAVEs)

During model development, SSEmb was validated on 10 MAVEs from the validation set.

The following are the results from Table 1 of the original paper:

Protein	MAVE reference	MAVE type	Spearman \|Ps\| (↑)
Protein	MAVE reference	MAVE type	SSEmb	GEMME	Rosetta
NUD15	Suiter et al. 2020	Abundance	0.584	0.543	0.437
TPMT	Matreyek et al. 2018	Abundance	0.523	0.529	0.489
CP2C9	Amorosi et al. 2021	Abundance	0.609	0.423	0.519
P53	Kotler et al. 2018	Competitive growth	0.577	0.655	0.488
PABP	Melamed et al. 2013	Competitive growth	0.595	0.569	0.384
SUMO1	Weile et al. 2017	Competitive growth	0.481	0.406	0.433
RL401	Roscoe & Bolon 2014	E1 reactivity	0.438	0.390	0.366
PTEN	Mighell et al. 2018	Competitive growth	0.422	0.532	0.423
MAPK	Brenan et al. 2016	Competitive growth	0.395	0.445	0.307
LDLRAP1	Jiang et al. 2019	Two-hybrid assay	0.411	0.348	0.377
Mean	-	-	0.503	0.484	0.422

Key Finding: SSEmb achieves a higher mean Spearman correlation ( $\rho_s$ ) of 0.503 on the MAVE validation set compared to GEMME (0.484) and Rosetta (0.422).
Insights: SSEmb performs particularly well on abundance assays, where it often outperforms GEMME (e.g., NUD15, CP2C9). This suggests that the added structural information in SSEmb is beneficial for predicting effects related to protein abundance. While GEMME sometimes outperforms SSEmb on activity-based MAVEs (e.g., P53, PTEN, MAPK), SSEmb generally maintains strong performance across both types of assays. This indicates that SSEmb successfully integrates information to make accurate predictions for both activity and abundance.

6.1.2. Testing SSEmb on ProteinGym Benchmark

The model was tested on the ProteinGym substitution benchmark, excluding the validation set assays. Results are segmented by MSA depth.

The following are the results from Table 2 of the original paper:

Model	Spearman ρs by MSA depth (↑)
Model	Low	Medium	High	All
TranceptEVE L	0.451	0.462	0.502	0.468
GEMME	0.429	0.448	0.495	0.453
SSEmb (ours)	0.449	0.439	0.501	0.453
Tranception L	0.438	0.438	0.467	0.444
EVE (ensemble)	0.412	0.438	0.493	0.443
VESPA	0.411	0.422	0.514	0.438
EVE (single)	0.405	0.431	0.488	0.437
MSA Transformer (ensemble)	0.385	0.426	0.470	0.426
ESM2 (15B)	0.342	0.368	0.433	0.375
ProteinMPNN	0.189	0.151	0.237	0.175

Key Finding: SSEmb generally compares favorably to other variant effect prediction methods on the ProteinGym benchmark. Its overall Spearman correlation is 0.453, matching GEMME and being very close to the top-performing TranceptEVE L (0.468).
Robustness for Low MSA Depth: Crucially, SSEmb achieves 0.449 Spearman correlation for low MSA depth proteins, significantly outperforming the original MSA Transformer ensemble (0.385) and many other models, and nearly matching the best TranceptEVE L (0.451). This directly supports the design goal of being robust when sequence information is scarce.

This robustness for low MSA depth is further highlighted in Figure 2:

$Fig. 2 | Overview of SSEmb results on the ProteinGym low-MSA $( N _ { \\mathrm { e f f } } / \\pmb { L } < 1 )$ substitution benchmark subset grouped by UniProt ID. Spearman correlations are plotted fo…$ 该图像是图表，展示了SSEmb和MSA Transformer在ProteinGym低MSA替代基准子集上的Spearman相关系数比较。SSEmb的ρ_s值约为0.45 ± 0.05，MSA Transformer的ρ_s值约为0.39 ± 0.05，结果以颜色区分，蓝色表示SSEmb，橙色表示MSA Transformer。

Fig. 2 | Overview of SSEmb results on the ProteinGym low-MSA $( N _ { \mathrm { e f f } } / \pmb { L } < 1 )$ substitution benchmark subset grouped by UniProt ID. Spearman correlations are plotted for both SSEmb (blue) and the MSA Transformer ensemble (orange). The mean and standard error of the mean of the set of all ProteinGym Spearman correlations are presented in the legend. Assays from the SSEmb validation set have been excluded from the original data set. Source data are provided as a Source Data file.

Graphical Evidence: Figure 2 visually confirms SSEmb's superior performance for proteins with low MSA depth ( $N_{eff}/L < 1$ ). The blue bars (SSEmb) are consistently higher than the orange bars (MSA Transformer ensemble) across various UniProt IDs. The mean Spearman correlation for SSEmb is 0.45 $\pm$ 0.05, while for the MSA Transformer ensemble, it is 0.39 $\pm$ 0.05. This clear visual difference underscores the benefit of integrating structural information in sequence-sparse contexts.
Robustness to Structure Quality: Supplementary Fig. 4 (not provided here, but mentioned in the text) indicates that SSEmb is robust to the quality of the input protein structure (experimental vs. AlphaFold predictions), showing only a weak correlation between performance and TM-scores. This is attributed to the backbone-based representation of the structure and complementary MSA information.

6.1.3. Prediction of Protein Stability

Key Finding: SSEmb achieves an absolute Spearman correlation coefficient of 0.61 (Supplementary Fig. 2, not provided here) on mega-scale measurements of protein stability.
Insights: This performance is comparable to dedicated methods for protein stability predictions, demonstrating SSEmb's ability to act as a zero-shot predictor for protein stability, even without specific training on stability data. This suggests that SSEmb learns generalizable features relevant to protein biophysics.

6.1.4. Classification of Disease-Causing Variants

SSEmb was evaluated on a large set of clinically annotated variants.

The following are the results from Table 3 of the original paper:

Model	Avg. AuC (↑)
TranceptEVE L	0.920
GEMME	0.919
EVE	0.917
SSEmb	0.893
ESM-1b	0.892

Key Finding: SSEmb performs relatively well in classifying disease-causing variants, achieving an average AUC of 0.893.
Insights: While TranceptEVE L, GEMME, and EVE achieve slightly higher AUC values, SSEmb's performance is commendable given it is a zero-shot predictor for this task. The subtle differences in ranking between MAVE and pathogenicity evaluations suggest that these two tasks capture different aspects of variant effects, and a model optimized for one might not be perfectly optimized for the other.

6.1.5. Prediction of Protein-Protein Binding Sites using Embeddings

The information richness of SSEmb's embeddings was explored for protein-protein binding site prediction.

The following are the results from Table 4 of the original paper:

Model	PR-AUC (↑)
Model	Test set (7%)	Test set homology)	Test set (topology)	Test set (none)	Test set (all)
SSEmb downstream	0.684	0.651	0.672	0.571	0.642
Handcrafted features baseline	0.596	0.567	0.568	0.432	0.537
ScanNet	0.732	0.712	0.735	0.605	0.694

Key Finding: A small supervised downstream model trained on SSEmb embeddings achieves a PR-AUC of 0.642 (mean across all test sets) for protein-protein binding site prediction.
Insights: This performance places SSEmb's downstream model significantly above the handcrafted features baseline (0.537) but below the specialized ScanNet model (0.694). This serves as a strong "proof of principle" that SSEmb embeddings contain a rich blend of structural and sequence-based information useful for downstream tasks. The fact that a simple downstream model can achieve such results without specialized feature engineering highlights the quality and generalizability of the SSEmb representations. Performance gracefully degrades as the test set similarity to training data decreases, as expected (Test set (none) is the hardest).

6.2. Ablation Study

An ablation study (Supplementary Table 1, not provided here but discussed in the text) was performed to understand the contribution of different components of SSEmb to its performance, especially regarding MSA depth. The study focused on three components used to inject structure information into the MSA Transformer:

The GVP-GNN module after the MSA Transformer.
Structure-based row attention masking in the MSA Transformer.
Fine-tuning of the MSA Transformer with column masking (reducing reliance on conservation signals).

Key Findings:
- Structure-based masking of the MSA Transformer and fine-tuning with column masking are crucial for decreasing the sensitivity of the model to MSA depth. These components help maintain accuracy even with shallow MSAs.
- An ablated SSEmb model (even without all components) still outperforms the original MSA Transformer from the ProteinGym benchmark. This is attributed to the optimized MSA-generation protocol used in SSEmb and the ensembling over MSA subsamples during inference.
- The fine-tuned MSA Transformer without a structural component performs best overall (across all MSA depths), but at the cost of accuracy for low-MSA-depth proteins. This highlights that while structural components might slightly reduce overall average performance (potentially due to introducing noise or complexity), they are vital for robustness in challenging, data-scarce scenarios.
  
  This study confirms that the structural integration and specific training strategies are instrumental in SSEmb's ability to provide robust predictions when MSA information is limited.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces SSEmb (Sequence Structure Embedding), a novel computational model that comprehensively integrates protein sequence and structure information. By combining a structure-constrained MSA Transformer with a Graph Neural Network (GNN), SSEmb learns rich, joint embeddings through a self-supervised masked amino acid prediction task.

The key findings demonstrate:

Robust Variant Effect Prediction: SSEmb provides robust and accurate variant effect predictions, significantly outperforming baseline MSA-based and structure-based methods, especially when Multiple Sequence Alignments (MSAs) are shallow.
Generalizable Embeddings: The learned SSEmb embeddings are highly informative and prove valuable for various downstream tasks, exemplified by their effective use in predicting protein-protein binding sites with performance comparable to specialized methods.
Insights into Variant Mechanisms: The model's ability to correlate with both protein activity and abundance suggests it captures diverse mechanistic aspects of variant effects.

SSEmb represents a significant step towards developing more holistic and resilient protein representation learning models that can function effectively even in data-limited scenarios, making it a valuable tool for disease variant classification and protein engineering.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest avenues for future research:

Modest Absolute Correlations: While SSEmb performs well relative to other methods, the absolute Spearman correlation values (e.g., mean 0.503 on validation MAVEs) are still relatively modest. Future work could explore better integration with supervised methods to potentially improve absolute accuracy.
Reliance on Input Data Availability: The model requires both a (subsampled) MSA and a protein structure. This limits its applicability in cases where these inputs are unreliable or unavailable, such as for intrinsically disordered proteins (IDPs) or protein complexes lacking experimentally resolved structures.
Training Data Coverage: Despite being trained on a relatively large dataset, this represents only a small fraction of the vast sequence-structure space. SSEmb is expected to suffer from degrading performance when making predictions for proteins highly dissimilar to those in its training data (e.g., de-novo-designed proteins).
Not Always State-of-the-Art for Specialized Tasks: Although SSEmb embeddings are useful for downstream tasks like binding site prediction, a generic model like SSEmb may not always achieve state-of-the-art accuracy compared to models specifically developed and optimized for individual purposes (e.g., ScanNet for binding sites).

Future work could focus on:
Improving absolute prediction accuracy through hybrid self-supervised/supervised learning approaches.
Extending the model to better handle IDPs or predicting effects in protein complexes where structural information is challenging.
Expanding training data diversity to improve generalization to novel protein folds and sequences.
Further exploring how the integration of sequence and structure can disentangle mechanistic aspects of variant effects.

7.3. Personal Insights & Critique

This paper presents a compelling and logically sound approach to tackling the persistent challenge of variant effect prediction. The end-to-end integration of MSA Transformer and GNN via structure-constrained attention is a particularly elegant solution to infuse spatial information directly into the sequence context, rather than merely concatenating features.

Strengths:

Conceptual Clarity: The idea of using structure to "constrain" MSA attention is intuitive and effective, especially for overcoming the limitations of shallow MSAs. This makes SSEmb highly practical for many real-world proteins that lack deep evolutionary histories.
Rigorous Benchmarking: The use of MAVE validation sets, the comprehensive ProteinGym benchmark (segmented by MSA depth), and testing on stability and clinical variant tasks demonstrates a thorough evaluation of the model's capabilities.
Generalizable Embeddings: The "proof of principle" that SSEmb embeddings are useful for protein-protein binding site prediction is significant. It suggests that SSEmb is not just a variant effect predictor but a powerful protein representation learner, with potential applications across various biophysical and functional prediction tasks.

Potential Areas for Improvement/Critique:

Complexity vs. Interpretability: While the integration is powerful, Transformer and GNN models can be complex. Further work on interpreting why SSEmb makes certain predictions, especially regarding the interplay between sequence conservation and structural context, would be valuable for biological insights.
Dynamic Structures: The current model relies on a single static protein structure. IDPs or proteins undergoing significant conformational changes are challenging. Exploring methods to incorporate structural dynamics or ensemble information could extend its applicability.
Beyond Single Point Mutations: While the paper focuses on single substitutions, many disease variants involve indels (insertions/deletions) or multi-point mutations. Adapting SSEmb to handle these more complex variants would be a natural next step.
Computational Cost: Training and inference with large Transformer and GNN models can be computationally intensive. While MSA subsampling helps, exploring more efficient architectures or approximation methods could broaden accessibility.

The core idea of SSEmb—integrating complementary protein data types at a fundamental model level—is broadly applicable. This approach could inspire future models in other domains where multiple, interdependent data modalities exist (e.g., combining genomics, transcriptomics, and proteomics for disease prediction), or in other areas of structural biology (e.g., protein-ligand binding, protein folding from sequence alone but with implicit structural priors). SSEmb is a strong testament to the power of holistic data integration in deep learning for scientific discovery.