InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders
TL;DR Summary
This paper introduces a method using sparse autoencoders to extract interpretable features from protein language models, revealing up to 2,548 features correlated with 143 biological concepts, and demonstrating applications in database annotation and targeted protein generation.
Abstract
Protein language models (PLMs) have demonstrated remarkable success in protein modeling and design, yet their internal mechanisms for predicting structure and function remain poorly understood. Here we present a systematic approach to extract and analyze interpretable features from PLMs using sparse autoencoders (SAEs). By training SAEs on embeddings from the PLM ESM-2, we identify up to 2,548 human-interpretable latent features per layer that strongly correlate with up to 143 known biological concepts such as binding sites, structural motifs, and functional domains. In contrast, examining individual neurons in ESM-2 reveals up to 46 neurons per layer with clear conceptual alignment across 15 known concepts, suggesting that PLMs represent most concepts in superposition. Beyond capturing known annotations, we show that ESM-2 learns coherent concepts that do not map onto existing annotations and propose a pipeline using language models to automatically interpret novel latent features learned by the SAEs. As practical applications, we demonstrate how these latent features can fill in missing annotations in protein databases and enable targeted steering of protein sequence generation. Our results demonstrate that PLMs encode rich, interpretable representations of protein biology and we propose a systematic framework to extract and analyze these latent features. In the process, we recover both known biology and potentially new protein motifs. As community resources, we introduce InterPLM (this http URL), an interactive visualization platform for exploring and analyzing learned PLM features, and release code for training and analysis at this http URL.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders
1.2. Authors
- Elana Simon: Stanford University, epsimon@stanford.edu
- James Zou: Stanford University, jamesz@stanford.edu
1.3. Journal/Conference
Published as a preprint on arXiv. Venue Reputation: arXiv is a highly influential open-access repository for preprints in various scientific fields, including computer science, mathematics, physics, and biology. Papers published on arXiv are not peer-reviewed prior to posting but often undergo peer review for subsequent publication in journals or conferences. Its reputation lies in its role for rapid dissemination of research findings and fostering early feedback within the scientific community.
1.4. Publication Year
2024
1.5. Abstract
Protein language models (PLMs) have achieved significant success in protein modeling and design, but their internal mechanisms for predicting structure and function are not well understood. This paper introduces a systematic approach, InterPLM, to extract and analyze interpretable features from PLMs using sparse autoencoders (SAEs). By training SAEs on embeddings from ESM-2, the authors identified up to 2,548 human-interpretable latent features per layer that strongly correlate with up to 143 known biological concepts (e.g., binding sites, structural motifs, functional domains). In contrast, examining individual neurons in ESM-2 revealed fewer interpretable concepts (up to 46 neurons per layer across 15 concepts), suggesting that PLMs represent most concepts in superposition. Beyond known annotations, ESM-2 learns coherent concepts that do not map onto existing annotations. The paper proposes a pipeline using large language models (LLMs) to automatically interpret these novel latent features. Practical applications include filling missing annotations in protein databases and targeted steering of protein sequence generation. The results demonstrate that PLMs encode rich, interpretable representations of protein biology, and InterPLM provides a systematic framework to extract and analyze these latent features, recovering both known biology and potentially new protein motifs. InterPLM is also an interactive visualization platform, and the code for training and analysis is openly released.
1.6. Original Source Link
Official Source: https://arxiv.org/abs/2412.12101 PDF Link: https://arxiv.org/pdf/2412.12101.pdf Publication Status: This paper is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The remarkable success of Protein Language Models (PLMs) in tasks like protein structure prediction and design has opened new avenues in computational biology. However, these powerful models often operate as "black boxes," meaning their internal reasoning and the specific biological concepts they learn remain largely opaque. Understanding these internal mechanisms is crucial for several reasons:
-
Model Development: Insights into
PLMrepresentations can guide the design of more effective and robustPLMs. -
Biological Discovery: Uncovering the biological principles encoded within
PLMscan lead to new biological hypotheses and discoveries. -
Trust and Reliability: A better understanding of how
PLMswork can increase trust in their predictions for critical applications like drug design.A significant challenge in
PLM interpretabilityis the phenomenon ofsuperposition, where a singleneuronin a neural network might represent multiple unrelated concepts, or a single concept might be distributed across manyneurons. Thispolysemanticitymakes direct interpretation of individualneuronsdifficult and limits the ability to extract clear, human-interpretable biological features.
The paper's entry point is to address this black box problem and the superposition challenge by adapting sparse autoencoders (SAEs), a technique previously successful in Large Language Model (LLM) interpretability, to the domain of PLMs.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of PLM interpretability and biological discovery:
-
Systematic Framework for Interpretable Feature Extraction: Introduction of
InterPLM, a systematic methodology to extractsparse, human-interpretablelatent featuresfromPLMs(specificallyESM-2) usingsparse autoencoders (SAEs). -
Enhanced Interpretability Compared to Neurons:
SAEssuccessfully disentanglesuperpositioninPLMs, identifying up to 2,548interpretable latent featuresper layer, correlating with up to 143 known biological concepts (e.g., binding sites, structural motifs). In contrast, individualESM-2 neuronsshowed far fewer clear conceptual alignments (up to 46 neurons per layer across 15 concepts), highlighting theSAEapproach's superiority in revealinginterpretable representations. -
Discovery of Novel Biological Concepts:
ESM-2learns coherent biological concepts that do not directly map onto existingSwiss-Protannotations. TheSAEframework helps identify these potentially novel protein motifs. -
LLM-Powered Interpretation Pipeline: Development of an automated pipeline using
Large Language Models (LLMs)(Claude-3.5 Sonnet) to generate meaningful descriptions for novelSAE latent featuresthat lack existing biological annotations, enabling interpretation beyond current databases. -
Practical Applications:
- Filling Missing Annotations: Demonstration of how
SAE latent featurescan be used to identify missing or incomplete annotations in protein databases. - Targeted Sequence Generation: Showing that
interpretable featurescan be used to steerprotein sequence generationin a targeted manner, opening avenues for controllable protein design.
- Filling Missing Annotations: Demonstration of how
-
Community Resources: Release of
InterPLM.ai(an interactive visualization platform) and open-sourcing of the code for training and analysis, fostering community engagement and further research.These findings collectively demonstrate that
PLMsencode a rich, interpretable understanding of protein biology, andInterPLMprovides a powerful framework to unlock these representations for both scientific discovery and model improvement.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a foundational understanding of several key concepts is essential:
-
Protein Language Models (PLMs):
- Concept:
PLMsare deep learning models, often based on thetransformerarchitecture, that are trained on vast datasets of protein sequences (likeUniReforAlphaFold DB). They treat amino acid sequences as a "language," where individual amino acids are "tokens." - Mechanism: Through self-supervised learning objectives (e.g.,
masked language modeling),PLMslearn contextual representations (embeddings) for each amino acid in a sequence. These embeddings capture complex evolutionary, structural, and functional information about proteins. - Masked Language Modeling (MLM): A common training objective where a certain percentage of amino acids in a sequence are
masked(replaced with a special token or a random amino acid), and the model is trained to predict the originalmaskedamino acids based on their context. This forces the model to learn deep contextual dependencies. - Transformer Architecture: A neural network architecture introduced by Vaswani et al. (2017) that relies heavily on
attention mechanismsto process sequential data. It consists of multiple identical layers, each typically containingmulti-head self-attentionandfeed-forward neural networks. - Attention Mechanism: A core component of
transformersthat allows the model to weigh the importance of different parts of the input sequence when processing a specific token. For a given token,attentioncalculates a score for every other token in the sequence, indicating how relevant it is. The output is a weighted sum of the values from all other tokens. The common formula forScaled Dot-Product Attentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices derived from the input embeddings, representing different linear transformations of the input.
- is the dimension of the key vectors, used for scaling to prevent very large dot products that push the
softmaxfunction into regions with extremely small gradients. - computes the dot product similarity between queries and keys.
softmaxnormalizes these scores into a probability distribution.- The result is a weighted sum of the
Valuevectors, where weights are determined by theattention scores.
- ESM-2: A specific state-of-the-art
PLMdeveloped by Meta AI, based on thetransformerarchitecture. It is known for learning powerful protein representations that are useful for various downstream tasks, including structure prediction (e.g.,ESMFold).
- Concept:
-
Sparse Autoencoders (SAEs):
- Concept: An
autoencoderis a type of neural network trained to reconstruct its input. It consists of anencoderthat maps the input to alatent (hidden) representationand adecoderthat reconstructs the input from thislatent representation. - Sparsity: In
SAEs, an additional constraint (often anL1 regularizationterm on thelatent activations) is added during training to encourage mostlatent unitsto be inactive (zero or near-zero) for any given input. This forces theautoencoderto learn a moredisentangledrepresentation, where eachlatent featureideally corresponds to a specific, independent concept. - Polysemanticity: The phenomenon where individual
neuronsin a neural network activate for multiple, often unrelated, concepts. This makes direct interpretation ofneuronsdifficult.SAEsare designed to addresspolysemanticityby finding a sparse basis where eachlatent featureismonosemantic(represents a single concept). - Dictionary Learning: The process by which
SAEslearn a set ofdictionary vectors(thedecoder weights) that represent the fundamental "concepts" or "features" present in the data. Theactivations(latent features) then indicate the presence and strength of these concepts for a given input.
- Concept: An
-
Swiss-Prot:
- Concept: A high-quality, manually annotated, and reviewed protein sequence database, part of
UniProtKB. - Annotations: It contains a wealth of information about proteins, including functional descriptions, post-translational modifications, domains, structural features, and disease associations, all supported by experimental evidence or computational analysis. This makes it an invaluable resource for evaluating the biological relevance of features learned by
PLMs.
- Concept: A high-quality, manually annotated, and reviewed protein sequence database, part of
3.2. Previous Works
The paper builds upon a lineage of work in PLM interpretability and mechanistic interpretability using SAEs.
-
PLM Interpretability:
- Early work on
PLM interpretabilityoften focused on analyzingattention patternswithintransformerlayers to understand how models weigh different amino acids or regions. For example, some studies analyzedattention headsfor patterns related to contact maps or secondary structure. - Other approaches involved probing
individual neuronsfor associations with known biological properties (e.g., specific amino acid types, structural motifs). However, these studies often encounteredpolysemanticity, where a singleneuronmight respond to multiple, seemingly unrelated biological concepts, making interpretation challenging. The paper explicitly contrasts its findings with this neuron-level analysis, showing thatSAEsreveal significantly more interpretable concepts. - Work by Rives et al. (2021) on
ESM-1bandESM-2showed thatPLMslearn deep evolutionary and structural information, paving the way for further mechanistic understanding.
- Early work on
-
SAE Applications in Mechanistic Interpretability:
SAEshave gained traction inmechanistic interpretabilityforLarge Language Models (LLMs)andvision models. Works by teams like Anthropic and Redwood Research have demonstrated thatSAEscan decompose complexneuron activationsinto moreinterpretable features, helping to uncover the underlying "circuits" or "concepts" learned by these models.- These
SAEshave been used to identifyinterpretable latent featurescorresponding to specific linguistic phenomena (e.g., negation, specific facts) or visual concepts. This paper adapts this proven methodology to the protein domain.
3.3. Technological Evolution
The field has evolved from:
-
Sequence-based bioinformatics: Traditional methods relying on sequence alignments, motifs, and statistical models.
-
Early machine learning in biology: Applying simpler models to protein data.
-
Deep learning on protein sequences: Emergence of
PLMs(likeESMandAlphaFoldmodels) that leverage massive datasets andtransformerarchitectures to learn powerful, general-purpose protein representations. This shift allowedPLMsto captureco-evolutionary patternsand complexphysicochemical principlesimplicitly. -
Interpretability of Deep Learning Models: As
PLMsbecame more powerful, the need to understand their internal workings grew. Initial interpretability efforts (attention maps, neuron probing) provided some insights but were often limited by the models'polysemanticity. -
Mechanistic Interpretability with SAEs: This paper represents a crucial step in bringing
mechanistic interpretabilitytechniques, particularlySAEssuccessful in other domains, toPLMs. This allows for a more granular andmonosemanticunderstanding of the biological concepts encoded within these models.This paper's work fits within the technological timeline as a leading-edge effort to bridge the gap between the predictive power of
PLMsand their biological interpretability, moving beyond black-box predictions to uncover the explicit biological knowledge learned.
3.4. Differentiation Analysis
Compared to prior approaches, the core differences and innovations of InterPLM are:
-
Addressing Superposition Directly: While previous
PLM interpretabilityoften struggled withpolysemanticityat theneuronlevel,InterPLMdirectly tackles this by usingSAEsto decomposeneuron activationsintosparse,monosemantic latent features. This is a significant improvement, enabling the identification of thousands of clear biological concepts whereneuronanalysis found only dozens. -
Quantitative Biological Validation at Scale: The paper establishes a rigorous quantitative evaluation framework using
Swiss-Protannotations. This allows for objective assessment of the biological interpretability of theSAE features, demonstrating their strong correlation with a wide range of known biological concepts. -
Automated Interpretation of Novel Concepts: Introduction of an
LLM-powered pipelinefor automatically interpretingSAE featuresthat do not map to existingSwiss-Protannotations. This is crucial for discovering new biological insights thatPLMsmight have learned but are not yet cataloged in databases. -
Practical Downstream Applications: Beyond mere interpretation, the paper demonstrates tangible applications: identifying missing protein annotations and enabling targeted
protein sequence generation. This shows the utility ofinterpretable featuresfor both data curation and protein engineering. -
Open-Source and Interactive Platform: The release of the
InterPLM.aiplatform and code differentiates this work by making the methodology and results accessible to the broader research community, fostering collaborative discovery.In essence,
InterPLMprovides a comprehensive and systematic framework that moves beyond qualitative observation or limitedneuronanalysis to quantitatively extract, interpret, and apply a vast array ofinterpretable biological featuresfromPLMs.
4. Methodology
4.1. Principles
The core principle of the InterPLM methodology is to leverage sparse autoencoders (SAEs) to transform the dense, often polysemantic (representing multiple concepts simultaneously) representations within Protein Language Models (PLMs) into sparse, monosemantic (representing a single, clear concept) latent features. This approach aims to address the superposition problem prevalent in neural networks, where individual neurons might encode a mix of different concepts, making direct interpretation difficult. By training SAEs on the embeddings from PLM layers, the method seeks to learn a "dictionary" of fundamental biological concepts. Each latent feature in this dictionary is designed to activate for a specific biological pattern or property, thereby enabling human-interpretable analysis. The methodology then integrates quantitative evaluation against known biological databases (Swiss-Prot) and an automated Large Language Model (LLM)-based interpretation pipeline to provide comprehensive understanding and practical applications of these extracted features.
4.2. Core Methodology In-depth (Layer by Layer)
The InterPLM methodology involves several key stages: Sparse Autoencoder Training, Swiss-Prot Concept Evaluation Pipeline, LLM Feature Annotation Pipeline, Feature Analysis and Visualization, and Steering Experiments.
4.2.1. Sparse Autoencoder Training
This stage focuses on training SAEs for each layer of the Protein Language Model (PLM) to extract interpretable features.
4.2.1.1. Dataset Preparation
The SAEs are trained on embeddings extracted from a pre-trained PLM, specifically ESM-2-8M-UR50D.
- Source: Proteins are selected from
UniRef50, a clustered protein sequence database, to ensure a diverse training dataset. - Embedding Extraction:
Hidden representationsare extracted fromESM-2after each of its sixtransformer block layers(layers 1 through 6). Theseembeddingsrepresent the contextualized numerical representation of each amino acid token. - Token Exclusion: The special (classifier) and (end-of-sequence) tokens are excluded from the
embeddingsused forSAEtraining, focusing only on the amino acid representations. - Sampling: Proteins are randomly sampled during training to ensure a diverse and representative input for the
SAEs.
4.2.1.2. Architecture and Training Parameters
The SAE architecture is designed to expand the PLM's embedding space into a much larger dictionary space, promoting sparsity.
-
Expansion Factor: Each of
ESM-2's six layers has 320neurons(representing theembedding dimension). TheSAEsare designed to expand this into afeature dictionary sizeof 10,240latent features. This 32x expansion (10240 / 320) is a common heuristic inSAEtraining to allow forsparsedecomposition. -
Training Protocol:
-
For each
ESM-2layer, 20SAEsare trained independently. -
Each
SAEis trained for 500,000 steps. -
A
batch sizeof 2,048 is used during training. -
Learning Rates:
Learning ratesare sampled in increments of 10x, ranging from to . -
L1 Penalty: An
L1 regularization penaltyis applied, with values ranging from 0.07 to 0.1. TheL1 penaltyis crucial for encouragingsparsityin thelatent activations, pushing manyfeature activationstowards zero. This penalty forces the model to represent inputs using only a few active features, leading to moredisentangledandinterpretable representations. -
Warm-up Phase:
L1 penaltyvalues are decreased during an initial warm-up phase, with thelearning ratereaching its maximum within the first 5% of training steps.The
reconstructionof an inputactivation vectorby anSAEcan be expressed as: Where:
-
-
: The original
embedding vectorfrom thePLMlayer (input to theSAE). -
: A
bias termthat is added to thereconstruction. -
: The size of the
learned dictionary(number oflatent features, e.g., 10,240). -
: The
activation valueof the -thlatent featurefor the input . This is the output of theencoder. -
: The
dictionary vector(ordecoder weight vector) for the -thlatent feature. These vectors are rows in thedecoder matrixand represent the learned basis elements.This equation shows that the
SAEreconstructs the inputembeddingas a linear combination of itsdictionary vectors, weighted by theiractivation values. Thesparsityconstraint ensures that for any given input , only a few values will be non-zero, making the contribution of each explicit.
The activation function for the latent features is typically a Rectified Linear Unit (ReLU), ensuring non-negativity and contributing to sparsity. The relationship is depicted in the following figure:
Figure 8: Overview of SAE decomposition and training. (a) Decomposition of embedding vector into weighted sum of dictionary elements.) Architecture for the SAE
Specifically, the latent feature activation is calculated as:
$
f_i(\mathbf{x}) = \mathrm{ReLU}(W_d(\mathbf{x} - \mathbf{b}_d) + \mathbf{b}_i)
$
Where:
- : The input
embedding vector. - : The
encoder weight matrix. - : The
decoder bias term. - : The
encoder bias term. - : The
Rectified Linear Unitactivation function, defined as . This enforces non-negativity and promotessparsity.
4.2.1.3. Feature Normalization
To standardize feature comparisons and ensure consistency across different SAEs and layers, activation values are normalized.
- Method:
Activation valuesarescaled(e.g., usingmin-max scalingorz-score normalization) across a subset of 5,000 proteins randomly sampled fromSwiss-Prot. This ensures thatactivation valuesare comparable regardless of the specificSAEor protein context.
4.2.1.4. SAE Metrics (Appendix B.1.1)
The performance of SAEs is evaluated using several metrics beyond standard reconstruction loss, focusing on sparsity and interpretability.
- L1 Norm of Activations (): Measures the sum of absolute values of
feature activations, used as aregularization termto encouragesparsity. $ L _ { 1 } ( f ( x ) ) = \displaystyle \sum _ { i = 1 } ^ { d _ { \mathrm { d i c t } } } | f _ { i } ( x ) | $ Where:- : The
activation valueof the -thlatent featurefor input . - : The total number of
latent featuresin the dictionary.
- : The
- Mean Squared Error (MSE(
x, x')): Quantifies the difference between the original inputembeddingand itsreconstruction, indicating the quality of theSAE'sreconstruction. $ MSE ( x , x ^ { \prime } ) = \displaystyle \frac { 1 } { d _ { \mathrm { d i c t } } } \sum _ { i = 1 } ^ { d _ { \mathrm { d i c t } } } ( x _ { i } - x _ { i } ^ { \prime } ) ^ { 2 } $ Where:- : The -th component of the original
embedding vector. x'_i: The -th component of thereconstructed embedding vector.- : The dimension of the
embedding vector.
- : The -th component of the original
- L0 Norm of Activations (): Counts the number of non-zero
latent feature activationsfor a given input, directly measuringsparsity. $ L _ { 0 } ( f ( x ) ) = \displaystyle \sum _ { i = 1 } ^ { d _ { \mathrm { d i c t } } } \mathbf { 1 } ( f _ { i } ( x ) > 0 ) $ Where:- : The
indicator function, which equals 1 if the condition is true and 0 otherwise. - : The
activation valueof the -thlatent featurefor input . - : The total number of
latent features.
- : The
- % Loss Recovered: This metric assesses how much of the original
PLM'spredictive performance (measured bycross-entropy) can be retained when its internalembedding layeris replaced by theSAE's reconstructionof thatembedding. It quantifies how well theSAEpreserves the information necessary for thePLM'sdownstream task. $ \mathcal { G } _ { L } \mathrm { ~ L o s s ~ R e c o v e r e d } = \Big ( 1 - \frac { C E _ { \mathrm { Reconstruction } } - C E _ { \mathrm { Original } } } { C E _ { \mathrm { Zero } } - C E _ { \mathrm { Original } } } \Big ) \times 1 0 0 $ Where:- : The
cross-entropyof thePLMwhen the specifiedembedding layeris replaced by theSAE's reconstructionof theembedding. - : The
cross-entropyof thePLMusing its originalembedding layer. - : The
cross-entropyof thePLMwhen the specifiedembedding layeris replaced with all zeros (a baseline representing complete information loss). - The term measures the increase in loss due to reconstruction. The denominator represents the total potential loss if the
embeddingprovides no information. A higher% Loss Recoveredindicates that theSAEreconstructioneffectively preserves the information content of the originalembedding.
- : The
4.2.2. Swiss-Prot Concept Evaluation Pipeline
This pipeline quantitatively evaluates the biological interpretability of SAE features by associating them with known biological concepts from Swiss-Prot.
4.2.2.1. Dataset Construction
- Source: A random sample of 50,000 proteins is drawn from the reviewed subset of
UniProtKB (Swiss-Prot). - Filtering: Proteins with lengths under 1,024 amino acids are selected for efficiency. This dataset is partitioned into
validationandheld-outsets. - Concept Coverage: The
Swiss-Protannotations cover a wide range of biological concepts, including structural features (active sites,binding sites,disulfide bonds), modifications (carbohydrate,lipid), targeting (signal peptide,transit peptide), and functional domains (protein domains,motifs). A comprehensive list of categories and specific concepts used is provided in Appendix C.2.
4.2.2.2. Feature-Concept Association Analysis
For each SAE feature and each Swiss-Prot concept, the association is quantified using modified precision, recall, and F1 score metrics. The goal is to identify how well a feature's activation pattern aligns with the presence of a known biological concept.
- Activation Thresholds:
Feature activationsare binarized at three thresholds: 0.5, 0.6, and 0.8 to determine "active" positions. - Metrics:
- Precision: Measures the proportion of feature-activated positions that are true positives for a given concept.
Where:
TruePositives: Number of positions where thefeature is activeAND theconcept is present.FalsePositives: Number of positions where thefeature is activeBUT theconcept is NOT present.
- Recall: Measures the proportion of positions where a concept is present that are correctly identified by the feature. This metric is adapted to protein domains.
Where:
DomainsWithTruePositive: Number of protein domains (or regions) where theconcept is presentAND at least one position within that domain isactivated by the feature.TotalDomains: Total number of protein domains (or regions) where theconcept is present.
- F1 Score: The
harmonic meanofprecisionandrecall, providing a balanced measure of a feature's association with a concept.
- Precision: Measures the proportion of feature-activated positions that are true positives for a given concept.
Where:
- Threshold Selection: For each
feature-conceptpair, theactivation threshold(0.5, 0.6, or 0.8) that yields the highestF1 scoreis selected for final evaluation.
4.2.2.3. Model Selection and Evaluation
- Initial Evaluation: Initial evaluations are conducted on 20% of the 135 concepts that have more than 10 domains or 1,500 amino acids.
- SAE Selection: For each
ESM-2layer, theSAE(from the 20 trainedSAEs) with the highestaverage F1 scoreacross thevalidation setis chosen for subsequent analyses and inclusion in theInterPLM dashboard. - Test Set Evaluation: For the selected
SAEs,feature-conceptpairs with anF1 scorein thevalidation setare identified, and theirF1 scoresare then calculated on an independenttest set. The number of pairs retainingF1 scorein thetest setis reported.
4.2.2.4. Baselines
To demonstrate the effectiveness of SAEs, comparisons are made against two baselines:
- Individual Neurons of ESM-2: Directly evaluating the
interpretabilityofESM-2'sindividual neuronsusing the sameSwiss-Protassociation pipeline. This serves as a direct comparison to showSAEsdisentanglingpolysemanticity. - SAEs Trained on Randomized ESM-2 Weights: Training
SAEsonembeddingsfrom anESM-2 modelwithshuffled weights. This control ensures thatinterpretabilityis due to the learned representations of thePLM, not just inherent properties of theSAEor the protein data distribution.Randomized modelsare expected to show little to no association with biological concepts.
4.2.3. LLM Feature Annotation Pipeline
This pipeline automatically generates descriptions for SAE features, especially those without existing Swiss-Prot annotations.
4.2.3.1. Example Selection
- Feature Subset: Analysis is performed on a random selection of 1,200 (10%)
SAE features. - Representative Proteins: For each selected feature, proteins that maximally activate that feature are chosen. Specifically, 10
Swiss-Protproteins are selected based on theiractivation levels. - Activation Bins:
Activation levelsare quantified into 10 bins (0-0.1, 0.1-0.2, ..., 0.9-1.0). For each feature, 1 protein is selected per bin. If a feature has fewer than 10maximally activating proteinsin the highest bin (0.9-1.0), additional examples are sampled from lower bins to reach a total of 10 proteins, ensuring a diverse range of activations are shown to theLLM.
4.2.3.2. Description Generation and Validation
- LLM Used: Claude-3.5 Sonnet (a
Large Language Modelfrom Anthropic). - Prompting: The
LLMis provided with:- Protein
metadatafromSwiss-Prot(e.g.,protein name,gene names,organism,functional annotations). - Quantitative
activation valuesfor each amino acid in the selected proteins. - Amino acid identities at these activated positions.
- The
LLMis instructed to generate a description of thefeature's activation patternsand a concise summary, focusing on biological properties associated with high activation and overlapping functional/structural annotations.
- Protein
- Validation: To validate the quality of the
LLM-generated descriptions, theLLMis then prompted to predict themaximum activation valuefor new, unseen proteins based solely on its own generated description and the protein's metadata. TheLLM's predicted activationsare compared to thetrue measured activationsusingPearson correlation. A highPearson correlationindicates that theLLM's descriptionaccurately captures the essence of thefeature's activation.
4.2.4. Feature Analysis and Visualization
This stage involves techniques for exploring and clustering the extracted SAE features.
4.2.4.1. UMAP Embedding and Clustering
- Dimensionality Reduction:
Uniform Manifold Approximation and Projection (UMAP)is applied to the normalizedSAE decoder weights(thedictionary vectors).UMAPis a non-linear dimensionality reduction technique that preserves both local and global structure of the high-dimensional data, allowing for visualization offeature relationshipsin 2D or 3D.- Parameters: , ,
min_dist=0.1.
- Parameters: , ,
- Clustering:
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)is used to identify natural clusters among theUMAP-embeddedfeatures.HDBSCANis robust to noise and can find clusters of varying densities and shapes.- Parameters:
min_cluster_size=5,min_samples=3.
- Parameters:
- Visualization: The results are visualized in the
InterPLMinterface, wherefeaturesare plotted in theUMAPspace and colored according to their clusters orSwiss-Protassociations.
4.2.4.4. Sequential and Structural Feature Analysis
To understand if SAE features represent local sequence motifs or 3D structural elements, a detailed analysis is performed.
- Procedure:
- High-Activation Regions: For proteins with available
AlphaFold structures, regions with highfeature activation() are identified. - Clustering Metrics: For each protein's
highest-activation residue, two metrics are calculated:- Sequential Clustering: Mean
feature activationwithin positions in the sequence around thehighest-activation residue. - Structural Clustering: Mean
feature activationof residues within (Angstroms) in 3D space from thehighest-activation residue.
- Sequential Clustering: Mean
- Null Distributions:
Null distributionsfor these clustering metrics are generated by averaging 5 random permutations ofactivation valuesper protein. This helps determine if observed clustering is statistically significant compared to random chance. - Significance Assessment:
Paired t-testsandCohen's d effect sizesare used to assess the significance of sequential and structural clustering across 100 proteins per feature (or features with at least 25 examples meeting a high activation threshold). - Visualization: Features are colored based on the ratio of
structuraltosequential effect sizes, indicating whether they primarily capture3D structural proximityorlinear sequence patterns. Only features withBonferroni-corrected structural p-valuesare considered significant.
- High-Activation Regions: For proteins with available
4.2.5. Steering Experiments
This stage demonstrates a practical application: using interpretable SAE features to guide protein sequence generation.
- Approach: Following a method described in prior work [20],
ESM embeddingsaredecomposedintoSAE reconstruction predictionsanderror terms. - Sequence Steering Steps:
Extract embeddingsat the specifiedPLM layer.Calculate SAE reconstructionsanderror termsfor theseembeddings.Modify the reconstructionbyamplifyingorsuppressingtheactivationof adesired feature.Combine the modified reconstructionswith theerror terms.- Allow
normal PLM processingto continue with thesemodified embeddings. Extract probabilitiesformasked tokens(e.g., probability of a specific amino acid at amasked position) using tools likeNNsight.
- Example: The paper demonstrates this by steering specific
glycine-related featuresto influenceglycine predictionsinperiodic patterns(likeGxGXrepeats incollagen). This shows how fine-grained control overPLMoutputs can be achieved by manipulatinginterpretable latent features.
5. Experimental Setup
5.1. Datasets
The study utilizes two primary datasets: UniRef50 for SAE training and Swiss-Prot for feature evaluation and LLM interpretation.
-
UniRef50:
- Source:
UniRef50(UniProt Reference Clusters at 50% sequence identity) is a comprehensive, non-redundant database of protein sequences. - Purpose: Used as the training dataset for the
SAEs. Its purpose is to provide a broad and diverse set of protein sequences to trainSAEsto learn general protein biology, similar to howESM-2itself was trained. - Characteristics: Proteins are clustered such that sequences within a cluster share at least 50% sequence identity, reducing redundancy while retaining diversity. The paper doesn't specify the exact number of proteins or their typical lengths from
UniRef50used forSAEtraining beyond "proteins from UniRef50." - Data Sample: A protein sequence from
UniRef50might look like:MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKELGYQG(an example of a Globin sequence).
- Source:
-
Swiss-Prot (Reviewed UniProtKB):
- Source: The reviewed section of the
UniProt Knowledgebase (UniProtKB), known for its high-quality, manually curated, and extensively annotated protein entries. - Purpose: Primarily used for quantitative evaluation of
SAE featuresagainst known biological concepts, for selecting optimalSAEmodels, and for providing metadata to theLLMduring feature annotation. - Characteristics:
- Size: A random sample of 50,000 proteins was drawn for evaluation.
- Length: Proteins with lengths under 1,024 amino acids were selected. The criteria for
validationandheld-outsets ensured a reasonable number of annotations (e.g., or ). - Annotations: Rich annotations cover various aspects of protein biology, categorized into:
Basic Identification Fields(e.g.,accession,protein_name,sequence,organism_name,length).Structural Features(e.g.,active sites,binding sites,disulfide bonds,helical regions,beta strands,coiled coil regions,transmembrane regions).Modifications and Chemical Features(e.g.,carbohydrate modifications,lipid modifications,modified residues,cofactor information).Targeting and Localization(e.g.,signal peptide,transit peptide).Functional Domains and Regions(e.g.,protein domains,short motifs,zinc finger regions).Functional Annotations(e.g.,catalytic activity,enzyme commission number,function,protein families,Gene Ontology function).
- Data Sample (Annotations): For a protein like
Q46638(Glycogen synthase from Escherichia coli),Swiss-Protmight have an annotation for abinding siteat positions210-217forATP, or adomaincalledGlycosyltransferasespanning15-400. The actual sequence might be like:MKIISTL...GGLSQ....
- Source: The reviewed section of the
-
Rationale for Dataset Choice:
UniRef50provides the necessary scale and diversity for training deep learning models likeSAEseffectively.Swiss-Protis the gold standard for high-quality, verified biological annotations. Its detailed and diverse annotations make it an ideal choice for rigorously evaluating whetherSAE featurescorrespond to known biological concepts. It effectively validates the method's ability to uncover biologically meaningful representations.
5.2. Evaluation Metrics
The paper uses several metrics to evaluate the performance of SAE features in capturing biological concepts and the quality of LLM-generated descriptions.
5.2.1. Precision
Conceptual Definition: Precision (also known as Positive Predictive Value) measures the accuracy of the feature's positive predictions. It answers the question: "Of all the positions (or domains) that the feature activated for, how many actually correspond to the target biological concept?" A high precision indicates that when the feature activates, it is very likely to be signaling the presence of the concept.
Mathematical Formula: Symbol Explanation:
- : The number of positions (or domains) where the
SAE featureactivated above a certain threshold, and thetarget biological conceptwas actually present according toSwiss-Protannotations. - : The number of positions (or domains) where the
SAE featureactivated above a certain threshold, but thetarget biological conceptwas not present.
5.2.2. Recall
Conceptual Definition: Recall (also known as Sensitivity or True Positive Rate) measures the ability of the feature to find all relevant instances of a biological concept. It answers the question: "Of all the positions (or domains) where the target biological concept was actually present, how many did the feature successfully activate for?" A high recall indicates that the feature is good at detecting most occurrences of the concept. This metric is specifically adapted for protein domains in this paper.
Mathematical Formula: Symbol Explanation:
- : The number of protein domains (or regions) where the
target biological conceptwas present, AND at least one position within that domain (or region) wasactivated by the SAE featureabove the threshold. - : The total number of protein domains (or regions) where the
target biological conceptwas present according toSwiss-Protannotations.
5.2.3. F1 Score
Conceptual Definition: The F1 score is the harmonic mean of precision and recall. It is a useful metric when you need to balance precision and recall, especially in cases of imbalanced class distribution (e.g., when biological concepts are rare). A high F1 score indicates that the feature has both high precision (few false positives) and high recall (few false negatives).
Mathematical Formula: Symbol Explanation:
- : The
precisionvalue calculated as described above. - : The
recallvalue calculated as described above.
5.2.4. Pearson Correlation
Conceptual Definition: Pearson correlation coefficient () measures the linear relationship between two sets of data. In this paper, it's used to validate the LLM-generated descriptions. It quantifies how well the LLM's predicted activation values for unseen proteins align with the actual measured activation values of the SAE feature. A coefficient of 1 indicates a perfect positive linear relationship, -1 a perfect negative linear relationship, and 0 no linear relationship.
Mathematical Formula:
The Pearson correlation coefficient for two variables and is:
Symbol Explanation:
- : The number of data points (in this context, the number of proteins for which
activation valuesare compared). - : The -th
LLM-predicted maximum activation valuefor a feature. - : The mean of all
LLM-predicted maximum activation values. - : The -th
actual measured maximum activation valuefor theSAE feature. - : The mean of all
actual measured maximum activation values.
5.3. Baselines
The effectiveness of the InterPLM approach (using SAE features) is benchmarked against two crucial baselines to highlight its advantages:
-
Individual Neurons in ESM-2:
- Description: This baseline involves directly analyzing the
activationsof theindividual neuronswithin theESM-2transformer layersusing the sameSwiss-Prot Concept Evaluation Pipeline. - Purpose: This comparison directly addresses the
superpositionproblem. By showing thatSAE featurescorrelate with significantly more biological concepts thanindividual neurons, the authors demonstrate thatSAEssuccessfully disentangle the mixed representations found inneurons, leading to clearer,monosemanticconcepts. - Representativeness: This is a standard and intuitive baseline for interpretability studies, as
neuronsare the fundamental computational units of a neural network.
- Description: This baseline involves directly analyzing the
-
SAEs Trained on Randomized ESM-2 Weights:
- Description: This control experiment involves training
SAEsonembeddingsextracted from anESM-2 modelwhere theweightshave beenrandomized(shuffled). - Purpose: This baseline ensures that the observed
interpretabilityofSAE featuresis genuinely due to the learned biological representations within thePLMand not merely an artifact of theSAEtraining process itself or generic statistical patterns in protein sequences. Arandomized modelshould not encode meaningful biological information, soSAEstrained on itsembeddingsshould not yieldinterpretable biological features. - Representativeness: This is a robust control for attributing
interpretabilityto thePLM'slearned knowledge rather than inherent biases in theSAEmethod or the data distribution.
- Description: This control experiment involves training
6. Results & Analysis
6.1. Core Results Analysis
The paper presents compelling evidence that sparse autoencoders (SAEs) can extract rich, interpretable biological features from Protein Language Models (PLMs), far surpassing the interpretability of individual neurons.
1. SAEs Find Interpretable Concepts in Protein Language Models (Section 3.1 & Figure 3, Figure 9):
The core finding is that SAEs successfully extract human-interpretable latent features from ESM-2.
-
Quantitative Advantage:
SAEsidentify up to 2,548interpretable latent featuresper layer that correlate strongly with up to 143 knownSwiss-Prot biological concepts(e.g.,binding sites,structural motifs,functional domains). -
Contrast with Neurons: In stark contrast,
individual neuronsinESM-2revealed clear conceptual alignment for only up to 46neuronsper layer across 15 concepts. This dramatic difference (e.g., 2,548 features vs. 46 neurons) strongly supports the hypothesis thatPLMsrepresent most concepts insuperposition, andSAEsare effective at disentangling these concepts. -
Control Experiment Validation: The effectiveness of
SAE featuresin capturing biological concepts disappears whenSAEsare trained onrandomized ESM-2 model weights(Figure 3, green bars). This confirms that theinterpretabilityis derived from the meaningful biological information learned by thePLM, not just an artifact of theSAEmethod. -
Robustness Across Layers: The superiority of
SAE featuresoverneuronspersists across allESM-2 layers(Figure 9b). Early layers tend to capture more basic properties (like amino acid types in randomized models), while deeper layers capture more complex biological concepts.The following are results from [Figure 3b] of the original paper:
Figure 3b: Number of Features with Concept F1 Scores
This bar chart clearly shows that SAE features (pink bars) consistently identify a significantly higher number of Swiss-Prot concepts with an F1 score across all ESM-2 layers compared to ESM neurons (blue bars) and SAEs trained on shuffled weights (green bars). The count rises from a few hundred in layer 1 to over 2,500 in layer 6 for SAE features.
2. Features Form Clusters Based on Shared Functional and Structural Roles (Section 3.4 & Figure 4, Figure 6, Figure 7):
SAE features not only capture specific concepts but also exhibit meaningful clustering patterns.
- UMAP Visualization:
UMAPembeddings ofSAE decoder weightsreveal clusters offeaturesthat share functional and structural roles (Figure 4a). For example, a "kinase cluster" was identified wherefeaturesspecialized in different sub-regions of the kinase domain (e.g.,catalytic loop,beta sheet). - Specificity vs. Generality: Some
featuresare highly specific (e.g., as aTBDR detectorwith ), while others are more general (beta barrel features). This demonstrates theSAE'sability to learn a spectrum of conceptual granularity. - Sequential vs. Structural Activations: Analysis of activation patterns in 3D protein structures shows that
SAE featurescan represent bothsequential motifsand3D structural elements(Figure 2a).Featuresthat activate on spatially proximal residues, even if far apart in sequence, indicate learning of3D structural concepts.
3. Large Language Models Can Generate Meaningful Feature Descriptions (Section 3.5 & Figure 5):
The paper demonstrates a novel pipeline for automatically interpreting SAE features using LLMs, extending interpretability beyond predefined annotations.
-
Automated Interpretation:
Claude-3.5 Sonnetsuccessfully generatedfeature descriptionsby analyzing protein metadata andfeature activation patterns. TheseLLM-generated descriptionswere highly predictive offeature activationon new proteins (medianPearson rcorrelation = 0.72). -
Discovering Unannotated Concepts: This
LLM-powered pipelineis particularly valuable forfeaturesthat do not map to existingSwiss-Prot annotations(which labeled less than 20% offeaturesacross layers). For instance, (a hexapeptide beta-helixfeature) was accurately described by theLLMdespite lackingSwiss-Protannotations. This highlightsPLMs'ability to learn coherent concepts not yet documented in databases.The following are results from [Figure 5] of the original paper:
Figure 5: Language models can generate automatic feature descriptions for SAE features. (a) Workflow for generating and validating descriptions with Claude-3.5 Sonnet (new). ) Comparing measured maximum activation vauprotei preiati luPearneatcrousl of generatefeaturdesciptions and maximally ctivate proteins ea feaureredicte activations qy vislizevi kerel density estiatio.The text isClaude' descripton sumary o eac featurelement description present in max examples annotated next to structures.
This figure illustrates the workflow for LLM-based interpretation (a) and shows examples of how LLM-generated descriptions (e.g., for , , ) correlate well with the actual feature activation in proteins (b). For instance, the LLM correctly describes as activating on a hexapeptide beta-helix, which aligns with the structural motif observed in the maximally activating protein.
4. Feature Activations Identify Missing and Novel Protein Annotations (Section 3.6 & Figure 6):
A practical application of interpretable SAE features is to fill gaps in existing biological databases.
-
Nudix Box Motif: consistently
activateson specific amino acids within a conservedNudix box motif, even in proteins (e.g.,B2GFH1) where this motif is not yet annotated inSwiss-Protbut is confirmed by other external databases. -
Peptidase Domain: Similarly, identifies
peptidase domainsacross multiple amino acids, revealing missing annotations for this domain in certain proteins. -
UDP-GlcNAc and Mg2+ Binding Sites: suggests previously unannotated
binding sitesforUDP-N-acetyl\alpha-D-glucosamineand within bacterialglycosyltransferases. -
These examples demonstrate that
PLMslearn beyond curated annotations and thatSAE featurescan serve as powerful tools forbiological discoveryanddatabase curation.The following are results from [Figure 6] of the original paper:
Figur :Feature activation patterns can be used to identiy missing and new protein annotations.)/939 identies missigmot otation orNudix box. I activates n slemino aciin cnserve positio whi is lbe a sructurRght to proteis exmpes wi atiatns n pi clori then with Nudix label. Left protein (B2GFH1, which does not have a Nudix motiannotationin Swiss-Prot, has pld f activation ighlihte reThe presnc a ud mot mewher this protei s cn y Iner3 dentis misdoai otatin or peptidas. It activate span mi au pat pr hav petai acivationhighlighte in pik.Left protein, which does ot have peptidase domaanotationinSwiss-rot, is l issatThe preommhe his prot s on y nte f/9046 suggests missing binding site annotations for UDP-N-acetyl D-glucosamine (UDP-GlcNAc) and within bacerial gycosytranerases.In both structures, highe activation ndicated wihdarkr pinkRight proths o u does have glycosyltransferase activity, has implied binding site annotations labeled in green
This figure visually presents three examples of how feature activations highlight missing annotations. For , a Nudix box motif is activated in protein B2GFH1, which lacks the Nudix annotation in Swiss-Prot. For , an unannotated peptidase domain is activated. For , implied binding sites for UDP-GlcNAc and are activated in bacterial glycosyltransferases that are not explicitly labeled in Swiss-Prot.
5. Protein Sequence Generation Can be Steered by Activating Interpretable Features (Section 3.7 & Figure 7, Figure 10, Figure 11):
The interpretable SAE features enable targeted control over PLM-based sequence generation.
-
Targeted Steering: By selectively amplifying or suppressing specific
SAE featuresduringsequence generation, the authors demonstrate thatPLMscan be steered to favor or disfavor particular amino acids atmasked positions. -
Glycine Repeats Example: The paper illustrates this with
glycine-related featuresthat activate onperiodic glycine repeats(e.g.,GxGX) common incollagen-like regions. Steering thesefeaturesinfluences theprobabilityof generatingglycineatmasked positionswithin these repeats. For instance, steering aperiodic glycine featurecan increase the probability ofglycineat subsequent positions, demonstrating a propagating effect (Figure 7a, Figure 10). -
This application opens up exciting possibilities for controllable
protein design, allowing researchers to guidePLMstowards generating sequences with desired biological properties.The following are results from [Figure 10] of the original paper:
Figure 7: Steerin feature activation on single amino acid additionally influences protein generation for neary on glycines in periodic repeats of GxX in collagen-like regions by only steering one G and measurig impact on u plvalue o yc pos he e Gask>.The ate a aplid t the unmasked G, not to the masked token r ny other positions.Steer amount coresponds t th alue the featre wascampe whe is theaxiu bserve tivatnvau this atureservaay lfor glycine (f/6581, f/781, f/5381). Tvaliahat auecaptucusay eniul patters permrententios el' preicsnlikngod he ei e yial ak wecalzula oe avisnW cebilcncetsqnatieratieetialleno lackng clear sequence-based validation.We therefore focused on a simple, measurable example: showing how activating specific features can steer glycine predictions in periodic patterns. Weteted threeatues hat activaten peridi gycie repeats GxGX) withi collagen-ie mans Giv g positc the proabily gyc ot that poit n hemask posi cnsstt t retheear pattesepechyipecatue e99 i ah phi cpabiliyh sfule eicio d sis ve proagatih suseunt peipea widihtensiy Figr1, rveali heptu hieord patterns. Thesutrahaat atteetabi pat aepauy ioe eav eicab wye sibeyon ruatHowr complex biological patterns.
This figure demonstrates the effect of steering specific glycine features on the probability of glycine at subsequent masked positions. The top panel (a) shows that steering a "periodic glycine feature" () at a specific glycine in a GxGX pattern significantly increases the probability of glycine at subsequent masked positions within the periodic repeat. The bottom panel (b) shows that steering a "non-periodic glycine feature" () has a less specific or even negative impact on other glycine positions. This highlights the fine-grained control offered by interpretable features.
6.2. Data Presentation (Tables)
The following are the results from [Table 1] of the original paper:
| Layer | Learning Rate | L1 | % Loss Recovered |
| L1 | 1.0e-7 | 0.1 | 99.60921 |
| L2 | 1.0e-7 | 0.08 | 99.31827 |
| L3 | 1.0e-7 | 0.1 | 99.02078 |
| L4 | 1.0e-7 | 0.1 | 98.39785 |
| L5 | 1.0e-7 | 0.1 | 99.32478 |
| L6 | 1.0e-7 | 0.09 | 100 |
Table 1: Layer-wise learning parameters and SAE performance metrics
This table shows the hyperparameters (Learning Rate, L1 penalty) chosen for the best-performing SAE at each ESM-2 layer, along with their reconstruction quality measured by % Loss Recovered. The high % Loss Recovered values (all above 98%) indicate that the SAEs are highly effective at reconstructing the original ESM-2 embeddings with minimal information loss, suggesting that the latent features preserve the critical information content of the PLM.
The following are the results from [Table C.2.1] of the original paper:
| Field Name | Full Name | Description | Quant. | LLM |
| accession | Accession Number | Unique identifier for the protein entry in UniProt | N | Y |
| id | UniProt ID | Short mnemonic name for the protein | N | Y |
| protein_name | Protein Name | Full recommended name of the protein | N | Y |
| gene_names | Gene Names | Names of the genes encoding the protein | N | Y |
| sequence | Protein Sequence | Complete amino acid sequence of the protein | N | Y |
| organism_name | Organism | Scientific name of the organism the protein is from | N | Y |
| length | Sequence Length | Total number of amino acids in the protein | N | Y |
Table C.2.1: Swiss-Prot Metadata Categories - Basic Identification Fields
This table lists basic identification fields from Swiss-Prot and indicates whether they were used for quantitative evaluation (Quant.) or provided to the LLM for feature description generation (LLM). For these fields, only the LLM used them, as they are not directly position-specific concepts for quantitative F1 score calculation.
The following are the results from [Table C.2.2] of the original paper:
| Field Name | Full Name | Description | Quant. | LLM | |
| ft_act_site | Active Sites | Specific amino acids directly involved in the protein's chemical reaction | Y | Y | |
| ft_binding | Binding Sites | Regions where the protein interacts with other molecules | Y | Y | |
| ft_disulfid | Disulfide Bonds | Covalent bonds between sulfur atoms that stabilize protein structure | Y | Y | |
| ft_helix | Helical Regions | Areas where protein forms alpha-helical structures | Y | Y | |
| ft_turn | Turns | Regions where protein chain changes direction | Y | Y | |
| ft_strand | Beta Strands | Regions forming sheet-like structural elements | Y | Y | |
| ft_coiled | Coiled Coil Regions | Areas where multiple helices intertwine | Y | Y | |
| ft_non_std | Non-standard Residues | Non-standard amino acids in the protein | N | ||
| ft_transmem | Transmembrane Regions | Regions that span cellular membranes | N | Y | |
| ft_intramem | Intramembrane Regions | Regions located within membranes | N | Y | |
Table C.2.2: Swiss-Prot Metadata Categories - Structural Features
This table lists various structural features from Swiss-Prot. Most of these (active sites, binding sites, disulfide bonds, helical regions, turns, beta strands, coiled coil regions) were used for quantitative evaluation (Quant.: Y) and provided to the LLM (LLM: Y). Some, like transmembrane and intramembrane regions, were used by the LLM but not directly in the quantitative F1 score calculation (Quant.: N).
The following are the results from [Table C.2.3] of the original paper:
| Field Name | Full Name | Description | Quant. | LLM |
| ft_carbohyd | Carbohydrate Modifications | Locations where sugar groups are attached to the protein | Y | Y |
| ft_lipid | Lipid Modifications | Sites where lipid molecules are attached to the protein | Y | Y |
| ft_mod_res | Modified Residues | Amino acids that undergo post-translational modifications | Y | Y |
| cc_cofactor | Cofactor Information | Non-protein molecules required for protein function | N | Y |
Table C.2.3: Swiss-Prot Metadata Categories - Modifications and Chemical Features
This table details modifications and chemical features. Carbohydrate, lipid, and modified residues were used for both quantitative evaluation and LLM input. Cofactor Information was used by the LLM but not for direct quantitative comparison.
The following are the results from [Table C.2.4] of the original paper:
| Field Name | Full Name | Description | Quant. | LLM |
| ft_signal | Signal Peptide | Sequence that directs protein trafficking in the cell | Y | Y |
| ft_transit | Transit Peptide | Sequence guiding proteins to specific cellular compartments | Y | Y |
Table C.2.4: Swiss-Prot Metadata Categories - Targeting and Localization
This table lists targeting and localization signals. Both signal peptide and transit peptide were used for quantitative evaluation and LLM input, as they are sequence-based concepts.
The following are the results from [Table C.2.5] of the original paper:
| Field Name | Full Name | Description | Quant. | LLM |
| ft_compbias | Compositionally Biased Regions | Sequences with unusual amino acid distributions | Y | Y |
| ft_domain | Protein Domains | Distinct functional or structural protein units | Y | Y |
| ft_motif | Short Motifs | Small functionally important amino acid patterns | Y | Y |
| ft_region | Regions of Interest | Areas with specific biological significance | Y | Y |
| ft_zn_fing | Zinc Finger Regions | DNA-binding structural motifs containing zinc | Y | Y |
| ft_dna_bind | DNA Binding Regions | Regions that interact with DNA | N | Y |
| ft_repeat | Repeated Regions | Repeated sequence motifs within the protein | N | Y |
| cc_domain | Domain Commentary | General information about functional protein units | N | Y |
Table C.2.5: Swiss-Prot Metadata Categories - Functional Domains and Regions
This table covers functional domains and regions. Most were used for both quantitative evaluation and LLM input. DNA Binding Regions, Repeated Regions, and Domain Commentary were used by the LLM but not directly in the quantitative F1 score calculation.
The following are the results from [Table C.2.6] of the original paper:
| Field Name | Full Name | Description | Quant. | LLM | |
| cc_catalytic_activity | Catalytic Activity | Description of the chemical reaction(s) performed by the protein | N | Y | |
| ec | Enzyme Commission Number | Enzyme Commission number for categorizing enzyme-catalyzed reactions | N | Y | |
| cc_activity_regulation | Activity Regulation | Information about how the protein's activity is controlled | N | Y | |
| cc_function | Function | General description of the protein's biological role | N | Y | |
| protein_families | Protein Families | Classification of the protein into functional/evolutionary groups | N | Y | |
| go_f | Gene Ontology Function | Gene Ontology terms describing molecular functions | N | Y | |
Table C.2.6: Swiss-Prot Metadata Categories - Functional Annotations
This table presents functional annotations. All listed categories (catalytic activity, EC number, activity regulation, function, protein families, Gene Ontology function) were used by the LLM for feature description generation but not directly for quantitative evaluation as position-specific concepts.
The following are the results from [Table D.2] of the original paper:
| Feature | Pearson r | Feature Summary |
| 4360 | 0.75 | The feature activates on interchain disulfide bonds and surrounding hydrophobic residues in serine proteases, particularly those involved in venom and blood coagulation pathways. |
| 9390 | 0.98 | The feature activates on the conserved Nudix box motif of Nudix hydrolase enzymes, particularly detecting the metal ion binding residues that are essential for their nucleotide pyrophosphatase activity. |
| 3147 | 0.70 | The feature activates on conserved leucine and cysteine residues that occur in leucine-rich repeat domains and metal-binding structural motifs, particularly those involved in protein-protein interactions and signaling. |
| 4616 | 0.76 | The feature activates on conserved glycine residues in structured regions, with highest sensitivity to the characteristic glycine-containing repeats of collagens and GTP-binding motifs. |
| 8704 | 0.75 | The feature activates on conserved catalytic motifs in protein kinase active sites, particularly detecting the proton acceptor residues and surrounding amino acids involved in phosphotransfer reactions. |
| 9047 | 0.80 | The feature activates on conserved glycine/alanine/proline residues within the nucleotide-sugar binding domains of glycosyltransferases, particularly at positions known to interact with the sugar-nucleotide donor substrate. |
| 10091 | 0.83 | The feature activates on conserved hydrophobic residues (particularly V/I/L) within the catalytic regions of N-acetyltransferase domains, likely detecting a key structural or functional motif involved in substrate binding or catalysis. |
| 1503 | 0.73 | The feature activates on extracellular substrate binding loops of TonB-dependent outer membrane transporters, particularly those involved in nutrient uptake. |
| 2469 | 0.85 | The feature activates on conserved structural and sequence elements in bacterial outer membrane beta-barrel proteins, particularly around substrate binding and ion coordination sites in porins and TonB-dependent receptors. |
Table D.2: Example feature descriptions and corresponding Pearson correlation coefficients along with more verbose descriptions to predict maximum activation levels.
This table provides examples of LLM-generated feature summaries along with their Pearson correlation coefficients. The high Pearson r values (ranging from 0.70 to 0.98) indicate that the LLM's descriptions are remarkably accurate at predicting how the SAE features will activate on new proteins. This validates the LLM-based interpretation pipeline and its utility in providing meaningful, human-readable explanations for complex latent features. The summaries cover a diverse range of biological concepts, from disulfide bonds and Nudix box motifs to glycosyltransferase binding domains and beta-barrel proteins.
6.3. Ablation Studies / Parameter Analysis
The paper implicitly conducts an ablation study by comparing SAE features derived from the trained ESM-2 model against SAE features derived from an ESM-2 model with randomized weights.
- Comparison with Randomized Weights: As shown in Figure 3b and discussed in Section 3.3,
SAEstrained onembeddingsfrom arandomized ESM-2 model(green bars) extract significantly fewer biologicallyinterpretable featuresthan those from the properly trainedESM-2(pink bars). While therandomized modelsstill extractfeaturesassociated with individual amino acid types (Appendix Figure 9b), they largely fail to correspond to complex biological concepts likebinding sitesordomains. - Implication: This comparison serves as a crucial control, demonstrating that the
interpretabilityobserved inInterPLMis not an artifact of theSAEmethod itself or general protein sequence statistics, but rather a direct consequence of the meaningful biological knowledge encoded within thepre-trained ESM-2 PLM. It verifies that theSAEsare indeed recovering features learned by the PLM, rather than just finding patterns in raw sequences.
7. Conclusion & Reflections
7.1. Conclusion Summary
The InterPLM framework represents a significant advancement in Protein Language Model (PLM) interpretability. By employing sparse autoencoders (SAEs), the authors successfully disentangled the polysemantic representations within ESM-2, revealing a rich lexicon of up to 2,548 human-interpretable latent features per layer. These features strongly correlate with a wide array of known biological concepts (binding sites, structural motifs, functional domains) and significantly outperform individual neurons in conceptual clarity and quantity (2,548 features vs. 46 neurons per layer). Beyond known annotations, InterPLM facilitated the discovery of novel, coherent biological concepts learned by ESM-2, which were then automatically interpreted using a Large Language Model (LLM) pipeline. Practical applications include identifying missing protein annotations in databases and enabling targeted steering of protein sequence generation, showcasing the utility of these interpretable features for both scientific discovery and protein engineering. InterPLM provides a systematic, quantitative, and extensible framework for unlocking the biological knowledge embedded in PLMs, complemented by open-source tools and an interactive visualization platform.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Scaling to Structure Prediction Models: A crucial next step is to apply this
SAEapproach tostructure-prediction modelslikeESMFoldorAlphaFold. Understanding howlatent featuresevolve with increasing model size and structural capabilities could enable more granular control overgenerated protein conformationsthroughtargeted steering. - Robust Steering and Validation: While
sequence steeringwas demonstrated, applying it to more complex biological patterns and ensuring robust, validated outcomes remains challenging. More sophisticated methods and clearer success metrics are needed. - Interpretation of Masked Token Embeddings and CLS Token: The current analysis focuses on
embeddingsof unmasked amino acids. Further investigation is needed to understand the information encoded inmasked token embeddingsand theCLS token, which is often used as aprotein-level representation. - Combining Features: The paper's framework focuses on individual
latent features. Future work could explore how these individualfeaturescombine into higher-levelbiological circuitsorcomputational mechanismswithin thePLM, which would require more advancedactivation analysisand externalvalidation. - SAE Training Methods: Continued advancements in
SAE training methodscould further improvefeature interpretability. - Additional Annotations: Incorporating more diverse
biological annotationsandstructural datacould enhance the evaluation and interpretation pipeline.
7.3. Personal Insights & Critique
This paper offers a highly impactful contribution to the growing field of mechanistic interpretability for biological AI.
Innovations and Strengths:
- Solving Polysemanticity: The most significant innovation is the successful application of
SAEsto effectively de-superposePLMrepresentations. This moves beyond the limitations ofneuron-level analysisand provides a much finer-grained and more accurate view of whatPLMslearn. The quantitative evidence (thousands of features vs. dozens of neurons) is highly compelling. - Automated Discovery of Novel Biology: The
LLM-powered interpretation pipelineis a powerful and elegant solution for interpretingfeaturesthat don't map to existing databases. This not only enhances interpretability but also transforms thePLMinto a potentialbiological discovery enginethat can suggest new motifs or functions. This is a critical step towardsAI-driven hypothesis generation. - Bridging Interpretability and Application: The demonstration of
missing annotation fillingandsequence steeringdirectly linksinterpretabilityto practical, high-value applications. This shows that understandingPLMinternals is not just an academic exercise but can directly acceleratebiological researchandprotein engineering. - Community Resources: The release of
InterPLM.aiand the code is commendable. It lowers the barrier for other researchers to explore and build upon this work, which is crucial for advancing the field.
Potential Issues and Areas for Improvement:
- Complexity of Interpretation for Novices: While
SAEsmake features more interpretable thanneurons, understanding what adictionary vectoror anactivation patterntruly signifies still requires significant domain expertise. TheLLMpipeline helps, but the full depth of a feature might still be challenging for abeginner. TheLLMitself is ablack box, so interpretingLLM-generated descriptionsalso requires care. - Computational Cost: Training
SAEs(20 per layer) and running theLLM interpretationpipeline can be computationally intensive, especially for largerPLMsand deeper layers. Scaling this toAlphaFoldmodels, as suggested for future work, will be a substantial challenge. - Validation of Novel Concepts: While
LLMscan describenovel features, experimentally validating these new biological motifs ormissing annotationsis the ultimate bottleneck. The paper provides examples but highlights the need for further experimental efforts. This is a general challenge forAI-driven discovery, whereAIgenerates hypotheses faster than experiments can confirm them. - Feature Redundancy/Overlap: Despite the
sparsityconstraint, it's possible that someSAE featuresmight still capture highly similar or overlapping biological concepts, especially given the hierarchical nature of protein biology. A more granular analysis offeature relationshipswithin clusters could reveal this. - Causality in Steering: While
steering experimentsdemonstrate control, the precise causal mechanism (i.e., why amplifying a feature leads to a specific sequence change) might still be implicit. Further work could delve into thecausal circuitswithin thePLMthat link asteered featureto the final output.
Transferability and Future Value:
The methods developed in InterPLM are highly transferable. The SAE approach could be applied to other biologically relevant deep learning models (e.g., those for RNA, DNA, or even cellular imaging) to extract interpretable latent features. The framework for quantitative biological validation and LLM-powered interpretation is also broadly applicable across scientific AI domains where models learn complex, unannotated patterns. This paper lays critical groundwork for mechanistic interpretability to become a standard tool in AI-driven biological discovery, moving us closer to truly understanding, rather than just using, our powerful AI models.
Similar papers
Recommended via semantic vector search.