SSEmb: A joint embedding of protein sequence and structure enables robust variant effect predictions
TL;DR Summary
The study introduces SSEmb, a method integrating protein sequence and structure for robust variant effect predictions, particularly effective with limited sequence data, and applicable in tasks like predicting protein-protein binding sites.
Abstract
The ability to predict how amino acid changes affect proteins has a wide range of applications including in disease variant classification and protein engineering. Here, we present SSEmb (Sequence Structure Embedding), a method that integrates sequence and structure information in a single model. By combining a graph representation of protein structure with a transformer model for processing multiple sequence alignments, we demonstrate that SSEmb provides robust variant effect predictions, especially in cases where sequence information is limited, and is additionally useful for other tasks such as predicting protein-protein binding sites.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
SSEmb: A joint embedding of protein sequence and structure enables robust variant effect predictions
1.2. Authors
Lasse M. Blaabjerg, Nicolas Jonsson, Wouter Boomsma, Amelie Stein, & Kresten Lindorff-Larsen
1.3. Journal/Conference
Published online in Nature Communications. Nature Communications is a highly reputable, peer-reviewed open access scientific journal published by Springer Nature. It covers all areas of the natural sciences, including biology, chemistry, physics, and earth sciences, and is known for publishing high-quality research across diverse fields.
1.4. Publication Year
Published online: 07 November 2024 (Received: 21 January 2024)
1.5. Abstract
The paper presents SSEmb (Sequence Structure Embedding), a novel method designed to predict how amino acid changes affect proteins. This has wide applications in disease variant classification and protein engineering. SSEmb uniquely integrates protein sequence information, specifically from multiple sequence alignments (MSAs), with three-dimensional protein structure information into a single model. It achieves this by combining a graph representation of protein structure with a Transformer model for processing MSAs. The authors demonstrate that SSEmb provides robust variant effect predictions, particularly in scenarios where sequence information (i.e., MSA depth) is limited. Furthermore, the learned embeddings from SSEmb are shown to be useful for other downstream tasks, such as predicting protein-protein binding sites, with performance comparable to specialized state-of-the-art methods. The paper concludes that SSEmb is valuable for variant effect predictions and as a general representation for learning protein properties dependent on both sequence and structure.
1.6. Original Source Link
Official Source Link: /files/papers/6915b3cc4d6b2ff314a02eab/paper.pdf
Publication Status: Officially published online at Nature Communications.
2. Executive Summary
2.1. Background & Motivation
The ability to predict the functional consequences of amino acid changes (also known as variant effects) in proteins is a fundamental challenge with profound implications for understanding molecular mechanisms of evolution, human diseases (e.g., classifying disease-causing variants), and advancing protein engineering (e.g., optimizing protein function). Traditional methods often rely on either protein sequence information (e.g., evolutionary conservation from Multiple Sequence Alignments or MSAs) or protein structure information (e.g., stability calculations).
However, each approach has limitations:
-
Sequence-based methods: While powerful, their accuracy can be highly sensitive to the
depthorqualityof the inputMSA. For proteins with few known homologous sequences,MSAscan be shallow, leading to unreliable predictions. -
Structure-based methods: These require a known or predicted three-dimensional protein structure, which might not always be available or accurate, especially for intrinsically disordered proteins or large complexes.
The core problem the paper aims to solve is to overcome these individual limitations by synergistically combining both types of information. Previous attempts to combine sequence and structure often involve ensemble predictions or combining results at a later stage. The challenge is to integrate this information within a single model in an
end-to-endmanner to learn a richer, more robust representation.
2.2. Main Contributions / Findings
The paper's primary contributions revolve around the introduction and validation of SSEmb:
-
Novel End-to-End Integration Model (
SSEmb): The paper proposesSSEmb, a novel method that integrates protein sequence (viaMSA) and structure (viagraph representation) information into a single, self-supervised model. This is achieved by combining a structure-constrainedMSA Transformerwith aGraph Neural Network (GNN). -
Robust Variant Effect Predictions for Shallow MSAs:
SSEmbdemonstrates improved robustness and accuracy in predictingvariant effects, particularly for proteins where the inputMSAisshallow(i.e., has limited evolutionary information). This addresses a critical gap where many sequence-only methods struggle. The model achieves competitive performance on theProteinGymbenchmark, especially in lowMSA depthscenarios. -
Information-Rich Embeddings for Downstream Tasks: The
embeddingslearned bySSEmbare shown to be information-rich and highly useful for various downstream tasks beyondvariant effect prediction. The paper exemplifies this by successfully usingSSEmb embeddingsto predict protein-protein binding sites, achieving results comparable to specialized state-of-the-art methods. This highlightsSSEmb's potential as a general-purpose protein representation learning tool. -
Self-Supervised Training Paradigm:
SSEmbis trained in aself-supervisedmanner using a masked amino acid prediction task, leveraging large amounts of unlabeled sequence and structure data, which is a powerful approach for learning generalizable protein representations.These findings suggest that
SSEmboffers a more comprehensive and robust approach to understanding protein function and variant effects, especially in data-scarce scenarios, and provides a versatile tool for various protein research applications.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the methodology and contributions of SSEmb, a foundational understanding of several key concepts is essential:
-
Amino Acid Changes and Variant Effects: Proteins are chains of
amino acidsthat fold into specific three-dimensional structures, which dictate their function. Anamino acid change(also called avariantormutation) is a substitution of one amino acid for another at a specific position in the protein sequence. These changes can have variouseffectson the protein, such as altering its stability, activity, binding affinity, or cellular abundance. Predicting thesevariant effectsis crucial for understanding disease and for protein engineering. -
Multiple Sequence Alignment (MSA): An
MSAis a sequence alignment of three or more biological sequences (e.g., protein or DNA sequences) that are evolutionarily related. It arranges these sequences to align homologous (similar by descent) positions, allowing for the inference ofevolutionary conservation. Positions that are highly conserved across many species often indicate critical functional or structural roles.- MSA Depth: Refers to the number of sequences in an
MSA. Adeep MSAhas many homologous sequences, providing rich evolutionary information, while ashallow MSAhas few, making evolutionary inference difficult. - MSA Query Sequence: In the context of
variant effect prediction, this is the specific protein sequence for whichvariant effectsare being predicted. It's usually the wild-type (most common natural) sequence.
- MSA Depth: Refers to the number of sequences in an
-
Protein Structure: The
three-dimensional (3D) structureof a protein is critical for its function. This structure is determined by theamino acid sequenceand interactions between amino acids, includingcovalent bondsandnon-covalent interactions(e.g., hydrogen bonds, hydrophobic interactions). Theprotein structureprovides spatial information about amino acid proximities and solvent exposure, which is not directly evident from the linear sequence. -
Machine Learning (ML): A field of artificial intelligence that enables systems to learn from data.
- Supervised Learning: An
MLparadigm where a model learns fromlabeled data(input-output pairs). For example, training a model to predict variant effects requires examples of variants and their experimentally measured effects. - Self-Supervised Learning: A variant of
unsupervised learningwhere the model learns representations from unlabeled data by generatingsupervision signalsfrom the data itself. For instance,masked language modeling(predicting masked words in a sentence) is a commonself-supervisedtask. In protein science, predicting masked amino acids in a sequence orMSAallows models to learn intrinsic properties of proteins without explicit functional labels. This is a key paradigm forprotein language models.
- Supervised Learning: An
-
Transformer Model: An
attention-based neural networkarchitecture introduced in 2017, primarily for natural language processing. It revolutionizedsequence modelingby replacing recurrent layers withself-attention mechanisms, allowing it to process entire sequences in parallel and capture long-range dependencies efficiently.- Self-Attention: A mechanism that weighs the importance of different parts of the input sequence when processing a specific part. In a
Transformer, each token (e.g., an amino acid in a protein sequence) computes its representation by attending to all other tokens in the sequence, determining their relevance. Theattention scorebetween a query and a key is often computed as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where represents thequeryvector, represents thekeyvectors, represents thevaluevectors, and is the dimension of thekeyvectors, used for scaling. Thesoftmaxfunction normalizes the scores to produce a probability distribution.
- Self-Attention: A mechanism that weighs the importance of different parts of the input sequence when processing a specific part. In a
-
Graph Neural Network (GNN): A class of neural networks designed to operate on
graph-structured data. Proteins can naturally be represented as graphs, whereamino acid residuesarenodesandinteractions(e.g., spatial proximity) between them areedges.GNNslearn bymessage passing, wherenodesaggregate information from theirneighborsand update theirrepresentations. This allowsGNNsto capture relational information inherent in protein structures.- Geometric Vector Perceptron (GVP): A specific type of
GNNdesigned to handle3D geometric information(like protein structures) by representing features as bothscalarsandvectors, maintainingrotational equivariance.Rotational equivariancemeans that if the input protein structure is rotated, theGNN's outputembeddingfor that structure will also rotate in a predictable way, preserving the underlying geometric relationships.
- Geometric Vector Perceptron (GVP): A specific type of
3.2. Previous Works
The paper builds upon and compares itself to several categories of prior research:
-
Multiplexed Assays of Variant Effects (MAVEs) / Deep Mutational Scanning (DMS): These high-throughput experimental techniques generate large datasets mapping
protein sequencetofunctionby simultaneously assaying thousands ofprotein variants. Examples includeVAMP-seq(Variant Abundance by Massively Parallel Sequencing) for quantifying variant effects on cellular protein abundance.MAVEsprovide crucial experimental data for training and benchmarking computational predictors. TheProteinGymbenchmark is a large collection of such data. -
Sequence-Based Predictors:
- Evolutionary Models: These methods leverage
MSAsto inferevolutionary conservationand predictvariant effects. Highly conserved positions are often functionally critical.- GEMME (Global Epistatic Model Predicting Mutational Effects): A state-of-the-art
MSA-based model that uses a relatively simple evolutionary model to predictvariant effects, often outperforming more complexmachine learningmethods forprotein activity prediction. It quantifies the effect of mutations by comparing the likelihood of a variant sequence in anMSAto the wild-type.
- GEMME (Global Epistatic Model Predicting Mutational Effects): A state-of-the-art
- Protein Language Models (PLMs):
Self-supervised modelstrained on vast datasets of protein sequences to learn richprotein representations(embeddings).- MSA Transformer: A
Transformermodel specifically designed to processMSAs. It learnsrepresentationsby attending to bothsequences(rows) andpositions(columns) within theMSA. It can be used forvariant effect predictionby predicting masked amino acids. Its performance is known to be sensitive toMSA depth. - ESM-1b / ESM2 (Evolutionary Scale Modeling): Other prominent
protein language modelsthat learn fromMSAsor large collections of single protein sequences. - TranceptEVE, Tranception L, EVE, VESPA: Other
PLMsor ensemble methods benchmarked onProteinGymfor variant effect prediction.
- MSA Transformer: A
- Evolutionary Models: These methods leverage
-
Structure-Based Predictors:
- Rosetta: A widely used protein modeling software suite that includes protocols for predicting protein stability changes () upon mutation.
Rosettausesphysics-based force fieldsandsampling algorithmsto estimate the energetic impact of amino acid substitutions. It's commonly used in protein engineering and for predictingstabilityandabundance.
- Rosetta: A widely used protein modeling software suite that includes protocols for predicting protein stability changes () upon mutation.
-
Combined Sequence and Structure Methods:
- Early methods often combined predictions from separate sequence-based and structure-based models (e.g., ensemble methods).
- More recent approaches aim for deeper integration:
- Some methods combine
contextualized language modelsandgraph neural networks(e.g., ELASPIC, AlphaMissense). - Others explore using
geometric deep learningwithpre-trained protein language models(e.g.,LM-GVP). - The paper mentions other recent methods such as
ScanNet(specifically for protein-protein binding site prediction) andxgboost(amachine learningalgorithm) withhandcrafted featuresas baselines forprotein-protein binding siteprediction.
- Some methods combine
3.3. Technological Evolution
The field of variant effect prediction has evolved significantly:
-
Early Statistical and Rule-Based Methods: Initially, predictions relied on simple rules (e.g., conservation scores like
SIFTorPolyPhen-2) or statistical analyses ofMSAs. -
Physics-Based Modeling:
Rosettaand similar tools broughtphysics-based simulationsto predictstability changes, offering mechanistic insights. -
Supervised Machine Learning: As
MAVEdata became available,supervised MLmodels (e.g.,random forests,support vector machines) were trained to predictvariant effectsfrom sequence and structure features. -
Self-Supervised Protein Language Models (PLMs): The advent of
Transformerarchitectures and vast sequence databases led toPLMslikeMSA TransformerandESM, which learnprotein representationsin aself-supervisedmanner, enablingzero-shot prediction(predictions without task-specific training data). -
Integration of Sequence and Structure: Recognizing the complementary nature of sequence and structure, the current frontier involves combining these modalities. Initial efforts involved combining outputs, but the trend is towards
end-to-end modelsthat learn joint representations.SSEmbfits into this evolution at stage 5, representing a significant step towardsend-to-end integrationwithin aself-supervised framework, specifically addressing the robustness issue withshallow MSAs.
3.4. Differentiation Analysis
Compared to the main methods in related work, SSEmb offers several core differences and innovations:
-
End-to-End Joint Embedding: Unlike many methods that combine sequence and structure predictions at a later stage or use one to inform the other sequentially,
SSEmbintegrates themend-to-end. It uses structure toconstraintheMSA Transformer's attentionand feeds theMSA Transformer's embeddingsdirectly into aGNNthat processes the protein structure. This allows for a deeper, more synergistic learning of joint representations. -
Structure-Constrained Attention: A key innovation is using protein structure to
masktherow attentionin theMSA Transformer. This means theTransformeronly considers interactions between amino acids that are spatially close in the 3D structure, introducing structural context directly into the sequence model'sattention mechanism. This is distinct from simply concatenating sequence and structure features post-hoc. -
Robustness to Shallow MSAs: By embedding structural information directly into the
MSA TransformerandGNN,SSEmbbecomes less dependent on the availability of deepMSAs. This makes it particularly robust and accurate for proteins with limited evolutionary information, an area where pureMSA-based methods often struggle. -
Versatile Embeddings: The paper demonstrates that the
dual embeddingslearned bySSEmbare highly information-rich and generalizable, proving useful for diverse downstream tasks likeprotein-protein binding site prediction, not justvariant effect prediction. This suggestsSSEmblearns a more holistic representation of protein characteristics. -
Self-Supervised Training Philosophy: While other
PLMsareself-supervised,SSEmbextends this by incorporating structural context into theself-supervised masking task, making the learned representations more aware of physical constraints and relationships.In essence,
SSEmbmoves beyond simply having two types of information towards a model where sequence and structure are deeply intertwined to create a more comprehensive and resilient understanding of protein function.
4. Methodology
4.1. Principles
The core idea behind SSEmb is to create a robust and comprehensive protein representation by jointly embedding information from both the protein's amino acid sequence (specifically, its evolutionary context derived from a Multiple Sequence Alignment (MSA)) and its three-dimensional (3D) structure. The method posits that by integrating these two complementary data types within a single end-to-end self-supervised learning framework, the resulting embeddings will be more informative and less sensitive to limitations in either input modality (e.g., shallow MSAs).
The theoretical basis draws from the success of Transformer models in capturing long-range dependencies in sequences and Graph Neural Networks (GNNs) in modeling relational data like protein structures. SSEmb combines these by:
-
Constraining sequence attention with structure: Using spatial proximity from the 3D structure to guide the
attention mechanismwithin theMSA Transformer, ensuring that the model implicitly considers structural interactions when processing sequence information. -
Fusing information via a GNN: Concatenating the
sequence embeddingsfrom theMSA Transformerwith thestructural graph featuresand processing them through aGNNto learn a joint representation. -
Self-supervised learning: Training the entire model to predict masked
amino acids, a task that encourages the model to learn contextually rich representations without explicit functional labels.This integrated approach aims to create
embeddingsthat capture both evolutionary conservation and structural context, leading to more accurate and robust predictions ofvariant effectsand broader utility for other protein-related tasks.
4.2. Core Methodology In-depth (Layer by Layer)
The SSEmb model is trained in a self-supervised manner using a combination of Multiple Sequence Alignments (MSAs) and protein structures. The overall architecture involves a structure-constrained MSA Transformer whose output embeddings are then fed into a Graph Neural Network (GNN) module. The model's training objective is to predict masked amino acids.
4.2.1. Input Data Generation
- Protein Structures: The protein structures used for training were sourced from the
CATH 4.2 data set, which contains 18,204 non-redundant training proteins (at 40% sequence identity) partitioned byCATH class. For benchmarking and clinical tasks,AlphaFold-predicted structures were used.- Pre-processing: For testing,
OpenMM PDBFixerpackage was used for pre-processing. Training set structures were used without modifications.
- Pre-processing: For testing,
- Multiple Sequence Alignments (MSAs): For each protein structure, an
MSAwas generated usingMMSeqs2in conjunction withColabFold's filtering protocol. This protocol aims to maximize sequence diversity in the final alignment.- MMSeqs2 parameters: , , , and an additional for each sequence identity bucket to ensure high-coverage sequences.
- MSA Subsampling: Due to GPU memory constraints, the full
MSAis randomly subsampled before being fed intoSSEmb. During training, 16MSA sequencesare subsampled. For inference, an ensemble of 5 predictions is generated by subsampling 16 sequences 5 times with replacement, and the mean of these predictions is taken.
4.2.2. Structure-Constrained MSA Transformer
The MSA Transformer component of SSEmb is based on the original MSA Transformer architecture.
- Initialization: The model is initialized using pre-trained weights from the original
MSA Transformer(available athttps://github.com/facebookresearch/esm/tree/main/esm). - Structure Constraint: A key modification is the application of a
binary contact maskto theattention mapsthat operate acrossMSA columns. This is referred to asrow attention(attention between different sequences at the same position).- The
contact maskis derived from the20 nearest neighbor graph structuresgenerated for the protein structure (the same graph used in theGNNmodule). - This mask ensures that
row attention valuesare only propagated betweenMSA positionsthat are spatially proximal in the 3D protein structure. This effectively injects structural information directly into theMSA Transformer's attention mechanism.
- The
- Training: During training, only the
row attention layersof the structure-constrainedMSA Transformerare fine-tuned. This strategy aims to preserve thephylogenetic informationalready encoded in thecolumn attention layers(attention within the same sequence across different positions), which are kept frozen.
4.2.3. GNN Module
The GNN module in SSEmb processes the protein structure and incorporates the sequence information from the MSA Transformer. It largely follows the architecture of the Geometric Vector Perceptron (GVP) model but with specific adjustments.
-
Graph Definition: Graph edges are defined for the
20 closest node neighborsbased on spatial proximity in the protein structure, rather than the 30 used in the originalGVPimplementation. -
Node Embeddings: Node embedding dimensions (representing each amino acid residue) are increased to 256 for
scalar featuresand 64 forvector channels. The GVP model represents features as bothscalars(magnitude-only values) andvectors(values with magnitude and direction), which is crucial for maintainingrotational equivariancewith respect to the 3D protein structure. -
Edge Embeddings: Edge embedding dimensions are kept at 32 (scalar) and 1 (vector), as in the original
GVPmodel. -
Architecture: The number of
encoderanddecoder layersis increased to four. -
Vector Gating: The
GNNusesvector gating, a mechanism often employed inGNNsto control the flow of information based onvector features, enhancing the model's ability to processgeometric information. -
Information Fusion: The
MSA query sequence embeddings(from the last layer of the structure-constrainedMSA Transformer) areconcatenatedto thenode embeddingsof theGNN decoder. These combined features then pass through adense layerto reduce dimensionality before being processed by theGNN. -
Prediction Task: The
GNN's prediction task is modified fromauto-regressive sequence prediction(as in originalGVPapplications) to amasked token prediction task(consistent with theMSA Transformer'sobjective).The resulting architecture, visualized in Figure 1, demonstrates this integration:
该图像是示意图,展示了SSEmb模型的训练过程。模型接收一个子采样的多序列比对(MSA)和完整的蛋白质结构,通过图形特征化与结构约束的MSA Transformer进行信息处理,最终生成图神经网络(GNN)的编码器和解码器输出。
Fig. 1 | Overview of the SSEmb model and how it is trained. The model takes as input a subsampled MSA with a partially masked query sequence and a complete protein structure. The protein structure graph is used to mask (constrain) the row attention (i.e., attention across MSA columns) in the MSA Transformer. The MSA query sequence embeddings from the structure-constrained MSA Transformer are concatenated to the protein graph nodes. During training, SSEmb tries to predict the amino acid type at the masked positions. The model is optimized using the cross-entropy loss between the predicted and the true amino acid tokens at the masked positions. Variant effect prediction is made from these predictions as described in Methods.
4.2.4. Model Training
SSEmb is trained in a self-supervised manner using a modified BERT masking scheme.
- Masking Strategy: Before each forward pass, 15% of all wild-type sequence residues are randomly selected for optimization. Within this 15% subset:
- 60% are completely
masked(replaced with a special[MASK]token). - 20% are
masked, and their correspondingMSA columns(across all sequences in the subsampled MSA) are alsomasked. This helps reduce reliance on conservation signals. - 10% are replaced by a
random amino acid type. - 10% are left
unchanged.
- 60% are completely
- Prediction Task: The
SSEmbmodel is tasked with predicting the originalamino acid typesof thesemasked residues, given the protein structure and the subsampledMSAinput. - Loss Function: The
masked prediction taskis optimized using thecross-entropy lossbetween the predictedamino acid typesand the trueamino acid typesfor the 15% selected residues. - Gradual Unfreezing: The training process employs a two-step
gradual unfreezingmethod:- Step 1: The
GNN moduleis trained until it approaches convergence, while the parameters of the structure-constrainedMSA Transformerare keptfrozen. - Step 2: The
row attention parameterswithin the structure-constrainedMSA Transformerareunfrozen, and both theGNN moduleand theseMSA Transformerparameters are fine-tuned together.Early stoppingis used, assessed by the mean correlation performance on theMAVE validation set.
- Step 1: The
- Optimizer: The
Adam optimizeris used with different learning rates: for theGNN moduleand for the structure-constrainedMSA Transformer. - Batch Sizes: Batch sizes are fixed at 128 proteins for the first training stage and 2048 proteins for the second stage.
4.2.5. Variant Effect Prediction
At inference time, SSEmb predicts variant effects based on the masked marginal method.
-
Ensembling: To generate robust predictions, 16 sequences are randomly subsampled from the full
MSAfive times with replacement, creating anensembleof model predictions. The finalSSEmb scoreis the mean of the scores from thisensemble. -
Score Calculation: Protein
variant scoresare computed using themasked marginal methodas follows:$ \sum _ { i \in M } \log p ( x _ { i } = x _ { i } ^ { \mathrm { {var } } } | x _ { - \mathrm { {M } } } ) - \log p ( x _ { i } = x _ { i } ^ { \mathrm { {wt } } } | x _ { - \mathrm { {M } } } ) $
- : Represents the amino acid type of the
variant(mutant) sequence at position . - : Represents the amino acid type of the
wild-type(original) sequence at position . - : Denotes the set of
substituted (mutated) positions. For single amino acid variants, would contain just one position. - : Represents a sequence where the positions in the set have been replaced with
mask tokens. This means the model predicts the likelihood of the variant/wild-type amino acid at position given the rest of the sequence is masked at the mutation site. - : Represents the
log-probabilitypredicted by the model for a specific amino acid type at a given position, conditioned on the rest of the masked sequence. - The formula essentially calculates the
log-likelihood ratioof the variant amino acid versus the wild-type amino acid at the mutated position, summed over all mutated positions (though typically for single variants, it's just one position). A more negative score indicates a higher predicted deleterious effect of the mutation. This model represents anadditive variant effect model.
- : Represents the amino acid type of the
5. Experimental Setup
5.1. Datasets
The authors utilized a variety of datasets for training, validation, and benchmarking SSEmb across different tasks.
-
Training Dataset:
- Source:
CATH 4.2 data set(for protein structures) andMSAsgenerated for these structures. - Characteristics: Contains 18,204 training proteins with 40% non-redundancy, partitioned by
CATH class. Proteins present in theMAVE validation setorProteinGym test setwere removed using a 95% sequence identity cut-off to prevent data leakage. - Purpose: To train the
SSEmbmodel in aself-supervisedmanner.
- Source:
-
MAVE Validation Set:
- Source: A selection of 10
Multiplexed Assays of Variant Effects (MAVEs). Most data was fromProteinGym, with exceptions:LDLRAP1from Jiang and Roth (47, 48), andMAPKfrom Brenan et al. (49, 50). - Characteristics: This set included a mix of assays probing
protein activityandabundance, with a majority focusing onactivity. It also contained assays considereddifficultoreasierto predict by structure- or sequence-based methods.- Example data: Variant effect scores (e.g., changes in abundance, competitive growth, E1 reactivity, two-hybrid assay scores) for individual amino acid substitutions in proteins like
NUD15,TPMT,CP2C9,P53,PABP,SUMO1,RL401,PTEN,MAPK,LDLRAP1.
- Example data: Variant effect scores (e.g., changes in abundance, competitive growth, E1 reactivity, two-hybrid assay scores) for individual amino acid substitutions in proteins like
- Purpose: Used for hyperparameter selection and
early stoppingduring model development, providing informative feedback onSSEmb's ability to capture different mechanistic aspects.
- Source: A selection of 10
-
ProteinGym Substitution Benchmark:
- Source: A large collection of
MAVEdata originally collected in reference 6. - Characteristics: Comprises 87 datasets on 72 different proteins. 76 datasets contain single substitution effects, while 11 include multiple substitutions. Assays part of the
SSEmb validation setwere excluded. When multiple assays were present for a singleUniProt ID, the mean correlation over assays was reported. The benchmark is segmented byMSA depth:Low MSA depth: (where is the effective number of sequences and is sequence length).Medium MSA depth: .High MSA depth: .
- Purpose: To extensively benchmark
SSEmb's prediction accuracy against othervariant effect prediction models. - Predicted Structures: For this benchmark,
AlphaFold-predicted structures were used as input toSSEmb.
- Source: A large collection of
-
Mega-scale Protein Stability Dataset:
- Source: Dataset 3 from (59), consisting of experimentally well-defined measurements.
- Characteristics: Contains 607,839 protein sequences. Filtering excluded synonymous, insertion, deletion mutations, and domains without corresponding
AlphaFoldmodels.- Example data: values (change in Gibbs free energy of unfolding) indicating changes in protein stability upon mutation.
- Purpose: To test the
zero-shot performanceofSSEmbas a predictor of protein stability.
-
ProteinGym Clinical Substitution Benchmark:
- Source: A large set of variants with clinical annotations (67).
- Characteristics: Contains clinically annotated
missense variantsin humans. - Purpose: To evaluate
SSEmb'szero-shot performancein classifyingdisease-causing variants(variant pathogenicity).
-
Protein-Protein Binding Site (PPBS) Dataset:
- Source:
PPBS data set(74). - Characteristics: Originally contained 20,025 protein chains with
binary residue-level binding site labels. Filtered to exclude obsolete entries, missing binding site labels, sequence mismatches, and chains longer than 1024 amino acids (due toMSA Transformerlimit). The modified dataset contained 19,264 protein chains.- Example data: Each residue in a protein chain is labeled as either
belonging to a binding site(1) ornot belonging to a binding site(0).
- Example data: Each residue in a protein chain is labeled as either
- Purpose: To evaluate the utility of
SSEmb embeddingsfor adownstream taskofprotein-protein binding site predictionusing a smallsupervised model.
- Source:
5.2. Evaluation Metrics
The paper uses several standard evaluation metrics depending on the specific task:
-
Spearman's Rank Correlation Coefficient ():
- Conceptual Definition:
Spearman's rank correlation coefficientassesses the monotonic relationship between two ranked variables. It measures how well the relationship between two variables can be described using a monotonic function. UnlikePearson's correlation coefficient, it does not assume that the relationship between the variables is linear. It is suitable forvariant effect predictiontasks where the absolute magnitude might be hard to predict, but the ranking of effects is important. - Mathematical Formula: $ \rho_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $
- Symbol Explanation:
- : The difference between the ranks of the -th observation for the two variables (e.g., predicted rank vs. experimental rank).
- : The number of observations (e.g., number of variants).
- Conceptual Definition:
-
Area Under the Receiver Operating Characteristic Curve (AUC or AUROC):
- Conceptual Definition:
AUCis a performance metric forbinary classification problems. It represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance. AnAUCof 1.0 indicates a perfect classifier, while 0.5 indicates a classifier no better than random guessing. It's robust toimbalanced datasets. - Mathematical Formula:
AUCis the integral of theROC curve. TheROC curveplots theTrue Positive Rate (TPR)against theFalse Positive Rate (FPR)at various threshold settings. $ \text{TPR (Sensitivity)} = \frac{\text{TP}}{\text{TP} + \text{FN}} $ $ \text{FPR (1 - Specificity)} = \frac{\text{FP}}{\text{FP} + \text{TN}} $ whereTPare True Positives,FNare False Negatives,FPare False Positives, andTNare True Negatives.AUCitself does not have a simple closed-form formula but is computed as the area under theTPRvsFPRplot. - Symbol Explanation:
TP: Number of true positives (correctly predicted positive instances).FN: Number of false negatives (positive instances incorrectly predicted as negative).FP: Number of false positives (negative instances incorrectly predicted as positive).TN: Number of true negatives (correctly predicted negative instances).
- Conceptual Definition:
-
Area Under the Precision-Recall Curve (PR-AUC):
- Conceptual Definition:
PR-AUCis another metric forbinary classification, particularly useful forimbalanced datasetswhere the positive class is rare. It plotsPrecisionagainstRecallat various threshold settings. A higherPR-AUCindicates better performance. It is often preferred overAUCwhen the focus is on the performance on the positive class, especially when false positives are costly. - Mathematical Formula:
PR-AUCis the integral of thePrecisionvsRecallcurve. $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $ $ \text{Recall (Sensitivity)} = \frac{\text{TP}}{\text{TP} + \text{FN}} $PR-AUCis computed as the area under thePrecisionvsRecallplot. - Symbol Explanation:
TP: Number of true positives.FP: Number of false positives.FN: Number of false negatives.
- Conceptual Definition:
5.3. Baselines
The SSEmb model was benchmarked against a range of established and state-of-the-art variant effect prediction and protein property prediction methods:
-
For MAVE Validation and ProteinGym Benchmark:
- GEMME: An
MSA-based model (Global Epistatic Model Predicting Mutational Effects) that leverages evolutionary information. For comparison to theProteinGymbenchmark, data directly fromProteinGymwas used. For validation,MSAswere generated usingHHblits(version 2.0.15) to search theUniRef30database with specific parameters () and additional filtering (remove columns not in query, remove rows with >50% gaps). - Rosetta: A structure-based method that predicts changes in protein stability () using the
Cartesian\Delta\Delta Gprotocol(51).Thermodynamic stability changes(inRosetta Energy Units) were converted tokcal/molby dividing by 2.9. - MSA Transformer (original): The foundational
Transformermodel forMSAsthatSSEmbbuilds upon. - Other high-accuracy methods from ProteinGym:
TranceptEVE LTranception LEVE (ensemble)andEVE (single)VESPA- : Another large
protein language model. ProteinMPNN: A model primarily for protein design/inverse folding, included for comparison.
- GEMME: An
-
For Protein-Protein Binding Site Prediction:
-
ScanNet: A specialized, state-of-the-art
geometric deep learning modelspecifically developed forstructure-based protein binding site prediction(74). -
Handcrafted features baseline: An
xgboostmodel (agradient boostingalgorithm) trained withhandcrafted structure- and sequence-based features(74). This represents a traditional feature engineering approach.These baselines were chosen to cover different approaches (sequence-only, structure-only, various
PLMs, and specialized downstream models) and to provide a comprehensive comparison ofSSEmb's performance across diverse tasks andMSA depths.
-
6. Results & Analysis
6.1. Core Results Analysis
The SSEmb model was rigorously evaluated across several tasks, demonstrating its effectiveness, particularly in integrating sequence and structure information for robust predictions.
6.1.1. Validation using Multiplexed Assays of Variant Effects (MAVEs)
During model development, SSEmb was validated on 10 MAVEs from the validation set.
The following are the results from Table 1 of the original paper:
| Protein | MAVE reference | MAVE type | Spearman |Ps| (↑) | ||
| SSEmb | GEMME | Rosetta | |||
| NUD15 | Suiter et al. 2020 | Abundance | 0.584 | 0.543 | 0.437 |
| TPMT | Matreyek et al. 2018 | Abundance | 0.523 | 0.529 | 0.489 |
| CP2C9 | Amorosi et al. 2021 | Abundance | 0.609 | 0.423 | 0.519 |
| P53 | Kotler et al. 2018 | Competitive growth | 0.577 | 0.655 | 0.488 |
| PABP | Melamed et al. 2013 | Competitive growth | 0.595 | 0.569 | 0.384 |
| SUMO1 | Weile et al. 2017 | Competitive growth | 0.481 | 0.406 | 0.433 |
| RL401 | Roscoe & Bolon 2014 | E1 reactivity | 0.438 | 0.390 | 0.366 |
| PTEN | Mighell et al. 2018 | Competitive growth | 0.422 | 0.532 | 0.423 |
| MAPK | Brenan et al. 2016 | Competitive growth | 0.395 | 0.445 | 0.307 |
| LDLRAP1 | Jiang et al. 2019 | Two-hybrid assay | 0.411 | 0.348 | 0.377 |
| Mean | - | - | 0.503 | 0.484 | 0.422 |
- Key Finding:
SSEmbachieves a higher meanSpearman correlation() of 0.503 on theMAVE validation setcompared toGEMME(0.484) andRosetta(0.422). - Insights:
SSEmbperforms particularly well onabundanceassays, where it often outperformsGEMME(e.g., NUD15, CP2C9). This suggests that the added structural information inSSEmbis beneficial for predicting effects related to protein abundance. WhileGEMMEsometimes outperformsSSEmbonactivity-based MAVEs(e.g., P53, PTEN, MAPK),SSEmbgenerally maintains strong performance across both types of assays. This indicates thatSSEmbsuccessfully integrates information to make accurate predictions for bothactivityandabundance.
6.1.2. Testing SSEmb on ProteinGym Benchmark
The model was tested on the ProteinGym substitution benchmark, excluding the validation set assays. Results are segmented by MSA depth.
The following are the results from Table 2 of the original paper:
| Model | Spearman ρs by MSA depth (↑) | |||
| Low | Medium | High | All | |
| TranceptEVE L | 0.451 | 0.462 | 0.502 | 0.468 |
| GEMME | 0.429 | 0.448 | 0.495 | 0.453 |
| SSEmb (ours) | 0.449 | 0.439 | 0.501 | 0.453 |
| Tranception L | 0.438 | 0.438 | 0.467 | 0.444 |
| EVE (ensemble) | 0.412 | 0.438 | 0.493 | 0.443 |
| VESPA | 0.411 | 0.422 | 0.514 | 0.438 |
| EVE (single) | 0.405 | 0.431 | 0.488 | 0.437 |
| MSA Transformer (ensemble) | 0.385 | 0.426 | 0.470 | 0.426 |
| ESM2 (15B) | 0.342 | 0.368 | 0.433 | 0.375 |
| ProteinMPNN | 0.189 | 0.151 | 0.237 | 0.175 |
-
Key Finding:
SSEmbgenerally compares favorably to othervariant effect prediction methodson theProteinGym benchmark. Its overallSpearman correlationis 0.453, matchingGEMMEand being very close to the top-performingTranceptEVE L(0.468). -
Robustness for Low MSA Depth: Crucially,
SSEmbachieves 0.449Spearman correlationforlow MSA depthproteins, significantly outperforming the originalMSA Transformer ensemble(0.385) and many other models, and nearly matching the bestTranceptEVE L(0.451). This directly supports the design goal of being robust when sequence information is scarce.This robustness for
low MSA depthis further highlighted in Figure 2:
该图像是图表,展示了SSEmb和MSA Transformer在ProteinGym低MSA替代基准子集上的Spearman相关系数比较。SSEmb的ρ_s值约为0.45 ± 0.05,MSA Transformer的ρ_s值约为0.39 ± 0.05,结果以颜色区分,蓝色表示SSEmb,橙色表示MSA Transformer。
Fig. 2 | Overview of SSEmb results on the ProteinGym low-MSA substitution benchmark subset grouped by UniProt ID. Spearman correlations are plotted for both SSEmb (blue) and the MSA Transformer ensemble (orange). The mean and standard error of the mean of the set of all ProteinGym Spearman correlations are presented in the legend. Assays from the SSEmb validation set have been excluded from the original data set. Source data are provided as a Source Data file.
- Graphical Evidence: Figure 2 visually confirms
SSEmb's superior performance for proteins withlow MSA depth(). The blue bars (SSEmb) are consistently higher than the orange bars (MSA Transformer ensemble) across various UniProt IDs. The meanSpearman correlationforSSEmbis 0.45 0.05, while for theMSA Transformer ensemble, it is 0.39 0.05. This clear visual difference underscores the benefit of integrating structural information in sequence-sparse contexts. - Robustness to Structure Quality: Supplementary Fig. 4 (not provided here, but mentioned in the text) indicates that
SSEmbis robust to the quality of the input protein structure (experimental vs.AlphaFoldpredictions), showing only a weak correlation between performance andTM-scores. This is attributed to thebackbone-based representationof the structure and complementaryMSAinformation.
6.1.3. Prediction of Protein Stability
- Key Finding:
SSEmbachieves an absoluteSpearman correlation coefficientof 0.61 (Supplementary Fig. 2, not provided here) onmega-scale measurements of protein stability. - Insights: This performance is comparable to dedicated methods for protein stability predictions, demonstrating
SSEmb's ability to act as azero-shot predictorfor protein stability, even without specific training on stability data. This suggests thatSSEmblearns generalizable features relevant to protein biophysics.
6.1.4. Classification of Disease-Causing Variants
SSEmb was evaluated on a large set of clinically annotated variants.
The following are the results from Table 3 of the original paper:
| Model | Avg. AuC (↑) |
| TranceptEVE L | 0.920 |
| GEMME | 0.919 |
| EVE | 0.917 |
| SSEmb | 0.893 |
| ESM-1b | 0.892 |
- Key Finding:
SSEmbperforms relatively well in classifyingdisease-causing variants, achieving an averageAUCof 0.893. - Insights: While
TranceptEVE L,GEMME, andEVEachieve slightly higherAUCvalues,SSEmb's performance is commendable given it is azero-shot predictorfor this task. The subtle differences in ranking betweenMAVEandpathogenicityevaluations suggest that these two tasks capture different aspects of variant effects, and a model optimized for one might not be perfectly optimized for the other.
6.1.5. Prediction of Protein-Protein Binding Sites using Embeddings
The information richness of SSEmb's embeddings was explored for protein-protein binding site prediction.
The following are the results from Table 4 of the original paper:
| Model | PR-AUC (↑) | ||||
| Test set (7%) | Test set homology) | Test set (topology) | Test set (none) | Test set (all) | |
| SSEmb downstream | 0.684 | 0.651 | 0.672 | 0.571 | 0.642 |
| Handcrafted features baseline | 0.596 | 0.567 | 0.568 | 0.432 | 0.537 |
| ScanNet | 0.732 | 0.712 | 0.735 | 0.605 | 0.694 |
- Key Finding: A small
supervised downstream modeltrained onSSEmb embeddingsachieves aPR-AUCof 0.642 (mean across all test sets) forprotein-protein binding site prediction. - Insights: This performance places
SSEmb's downstream model significantly above thehandcrafted features baseline(0.537) but below the specializedScanNetmodel (0.694). This serves as a strong "proof of principle" thatSSEmb embeddingscontain a rich blend of structural and sequence-based information useful fordownstream tasks. The fact that a simple downstream model can achieve such results without specialized feature engineering highlights the quality and generalizability of theSSEmbrepresentations. Performance gracefully degrades as the test set similarity to training data decreases, as expected (Test set (none)is the hardest).
6.2. Ablation Study
An ablation study (Supplementary Table 1, not provided here but discussed in the text) was performed to understand the contribution of different components of SSEmb to its performance, especially regarding MSA depth. The study focused on three components used to inject structure information into the MSA Transformer:
- The
GVP-GNN moduleafter theMSA Transformer. Structure-based row attention maskingin theMSA Transformer.Fine-tuningof theMSA Transformerwithcolumn masking(reducing reliance on conservation signals).
- Key Findings:
-
Structure-based maskingof theMSA Transformerandfine-tuningwithcolumn maskingare crucial for decreasing the sensitivity of the model toMSA depth. These components help maintain accuracy even withshallow MSAs. -
An
ablated SSEmb model(even without all components) still outperforms the originalMSA Transformerfrom theProteinGym benchmark. This is attributed to the optimizedMSA-generation protocolused inSSEmband theensembling over MSA subsamplesduring inference. -
The
fine-tuned MSA Transformerwithout a structural component performs best overall (across allMSA depths), but at the cost of accuracy forlow-MSA-depthproteins. This highlights that while structural components might slightly reduce overall average performance (potentially due to introducing noise or complexity), they are vital for robustness in challenging, data-scarce scenarios.This study confirms that the structural integration and specific training strategies are instrumental in
SSEmb's ability to provide robust predictions whenMSAinformation is limited.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces SSEmb (Sequence Structure Embedding), a novel computational model that comprehensively integrates protein sequence and structure information. By combining a structure-constrained MSA Transformer with a Graph Neural Network (GNN), SSEmb learns rich, joint embeddings through a self-supervised masked amino acid prediction task.
The key findings demonstrate:
-
Robust Variant Effect Prediction:
SSEmbprovides robust and accuratevariant effect predictions, significantly outperforming baselineMSA-basedandstructure-basedmethods, especially whenMultiple Sequence Alignments (MSAs)areshallow. -
Generalizable Embeddings: The learned
SSEmb embeddingsare highly informative and prove valuable for variousdownstream tasks, exemplified by their effective use in predictingprotein-protein binding siteswith performance comparable to specialized methods. -
Insights into Variant Mechanisms: The model's ability to correlate with both protein activity and abundance suggests it captures diverse mechanistic aspects of
variant effects.SSEmbrepresents a significant step towards developing more holistic and resilient protein representation learning models that can function effectively even in data-limited scenarios, making it a valuable tool fordisease variant classificationandprotein engineering.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest avenues for future research:
-
Modest Absolute Correlations: While
SSEmbperforms well relative to other methods, the absoluteSpearman correlationvalues (e.g., mean 0.503 on validation MAVEs) are stillrelatively modest. Future work could explore better integration withsupervised methodsto potentially improve absolute accuracy. -
Reliance on Input Data Availability: The model requires both a (subsampled)
MSAand aprotein structure. This limits its applicability in cases where these inputs are unreliable or unavailable, such as forintrinsically disordered proteins (IDPs)orprotein complexeslacking experimentally resolved structures. -
Training Data Coverage: Despite being trained on a relatively large dataset, this represents only a small fraction of the vast
sequence-structure space.SSEmbis expected to suffer fromdegrading performancewhen making predictions for proteins highly dissimilar to those in its training data (e.g.,de-novo-designed proteins). -
Not Always State-of-the-Art for Specialized Tasks: Although
SSEmb embeddingsare useful fordownstream taskslikebinding site prediction, a generic model likeSSEmbmay not always achievestate-of-the-art accuracycompared to models specifically developed and optimized for individual purposes (e.g.,ScanNetfor binding sites).Future work could focus on:
-
Improving absolute prediction accuracy through hybrid
self-supervised/supervised learningapproaches. -
Extending the model to better handle
IDPsor predicting effects inprotein complexeswhere structural information is challenging. -
Expanding training data diversity to improve generalization to novel protein folds and sequences.
-
Further exploring how the
integration of sequence and structurecan disentangle mechanistic aspects ofvariant effects.
7.3. Personal Insights & Critique
This paper presents a compelling and logically sound approach to tackling the persistent challenge of variant effect prediction. The end-to-end integration of MSA Transformer and GNN via structure-constrained attention is a particularly elegant solution to infuse spatial information directly into the sequence context, rather than merely concatenating features.
Strengths:
- Conceptual Clarity: The idea of using structure to "constrain"
MSA attentionis intuitive and effective, especially for overcoming the limitations ofshallow MSAs. This makesSSEmbhighly practical for many real-world proteins that lack deep evolutionary histories. - Rigorous Benchmarking: The use of
MAVE validation sets, the comprehensiveProteinGym benchmark(segmented byMSA depth), and testing onstabilityandclinical varianttasks demonstrates a thorough evaluation of the model's capabilities. - Generalizable Embeddings: The "proof of principle" that
SSEmb embeddingsare useful forprotein-protein binding site predictionis significant. It suggests thatSSEmbis not just avariant effect predictorbut a powerfulprotein representation learner, with potential applications across variousbiophysicalandfunctional prediction tasks.
Potential Areas for Improvement/Critique:
-
Complexity vs. Interpretability: While the integration is powerful,
TransformerandGNNmodels can be complex. Further work on interpreting whySSEmbmakes certain predictions, especially regarding the interplay between sequence conservation and structural context, would be valuable for biological insights. -
Dynamic Structures: The current model relies on a single static protein structure.
IDPsor proteins undergoing significantconformational changesare challenging. Exploring methods to incorporatestructural dynamicsorensemble informationcould extend its applicability. -
Beyond Single Point Mutations: While the paper focuses on single substitutions, many disease variants involve
indels(insertions/deletions) ormulti-point mutations. AdaptingSSEmbto handle these more complex variants would be a natural next step. -
Computational Cost: Training and inference with large
TransformerandGNNmodels can be computationally intensive. WhileMSA subsamplinghelps, exploring more efficient architectures or approximation methods could broaden accessibility.The core idea of
SSEmb—integrating complementary protein data types at a fundamental model level—is broadly applicable. This approach could inspire future models in other domains where multiple, interdependent data modalities exist (e.g., combining genomics, transcriptomics, and proteomics for disease prediction), or in other areas of structural biology (e.g., protein-ligand binding, protein folding from sequence alone but with implicit structural priors).SSEmbis a strong testament to the power of holistic data integration indeep learningfor scientific discovery.
Similar papers
Recommended via semantic vector search.