Knowledge Circuits in Pretrained Transformers
TL;DR Summary
This paper explores knowledge encoding in large language models, introducing the concept of 'Knowledge Circuits' to reveal critical subgraphs in their computation graphs. Experiments with GPT-2 and TinyLLAMA showcase how information heads, relation heads, and MLPs collaboratively
Abstract
The remarkable capabilities of modern large language models are rooted in their vast repositories of knowledge encoded within their parameters, enabling them to perceive the world and engage in reasoning. The inner workings of how these models store knowledge have long been a subject of intense interest and investigation among researchers. To date, most studies have concentrated on isolated components within these models, such as the Multilayer Perceptrons and attention head. In this paper, we delve into the computation graph of the language model to uncover the knowledge circuits that are instrumental in articulating specific knowledge. The experiments, conducted with GPT2 and TinyLLAMA, have allowed us to observe how certain information heads, relation heads, and Multilayer Perceptrons collaboratively encode knowledge within the model. Moreover, we evaluate the impact of current knowledge editing techniques on these knowledge circuits, providing deeper insights into the functioning and constraints of these editing methodologies. Finally, we utilize knowledge circuits to analyze and interpret language model behaviors such as hallucinations and in-context learning. We believe the knowledge circuits hold potential for advancing our understanding of Transformers and guiding the improved design of knowledge editing. Code and data are available in https://github.com/zjunlp/KnowledgeCircuits.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Knowledge Circuits in Pretrained Transformers
1.2. Authors
Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, Huajun Chen. Affiliations: Zhejiang University (Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Huajun Chen), National University of Singapore, NUS-NCS Joint Lab, Singapore (Shumin Deng).
1.3. Journal/Conference
Published on arXiv, a preprint server, under the identifier abs/2405.17969v4. arXiv is a reputable platform for sharing preprints in physics, mathematics, computer science, and other fields, allowing for rapid dissemination of research before formal peer review and publication in journals or conferences.
1.4. Publication Year
2024
1.5. Abstract
This paper investigates how modern large language models (LLMs) store and utilize knowledge, moving beyond studies focusing on isolated components like Multilayer Perceptrons (MLPs) and attention heads. The authors introduce the concept of "Knowledge Circuits" – critical subgraphs within the language model's computation graph that articulate specific knowledge. Through experiments on GPT-2 and TinyLLAMA, they observe how information heads, relation heads, and MLPs collaboratively encode knowledge. Furthermore, the paper evaluates the impact of existing knowledge editing techniques on these circuits, offering insights into their functionality and limitations. Finally, knowledge circuits are employed to analyze LLM behaviors such as hallucinations and in-context learning. The authors believe that understanding knowledge circuits can advance the comprehension of Transformers and guide improved knowledge editing designs.
1.6. Original Source Link
The paper is available as a preprint on arXiv: https://arxiv.org/abs/2405.17969v4. The PDF link is https://arxiv.org/pdf/2405.17969v4.pdf.
2. Executive Summary
2.1. Background & Motivation
Modern large language models (LLMs) possess remarkable capabilities, largely attributed to the vast knowledge encoded within their parameters. However, the precise mechanisms by which LLMs store and utilize this knowledge remain enigmatic, leading to issues like hallucinations, biases, and unsafe behaviors. Previous research has primarily focused on isolated components, such as Multilayer Perceptrons (MLPs) being identified as 'key-value memories' or 'knowledge neurons', and attention heads. While these studies have advanced understanding, they often treat knowledge blocks as isolated and have shown limitations in generalizing editing techniques or fully explaining complex LLM behaviors. There's a gap in understanding how different components cooperate to articulate knowledge.
The core problem the paper aims to solve is to move beyond isolated component analysis and explore the collaborative information flow within the Transformer's computation graph to understand how knowledge is stored and expressed. This is important because a deeper understanding of these internal mechanisms can lead to more robust, reliable, and interpretable LLMs, and better-designed knowledge editing techniques.
The paper's innovative idea is to introduce "Knowledge Circuits"—human-interpretable subgraphs within the LLM's computation graph that are specifically responsible for articulating particular pieces of knowledge. Unlike previous work that focused on identifying where knowledge is stored, this paper investigates how it flows and is processed by the cooperation of various components (attention heads, MLPs, embeddings) to answer questions or make predictions.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Introduction of Knowledge Circuits: Proposes and constructs
knowledge circuitsas critical subgraphs in the Transformer's computation graph to analyze the cooperative mechanisms of different components (information heads, relation heads, MLPs) in encoding and expressing knowledge. This offers a holistic view beyond isolated components. - Unveiling Implicit Neural Knowledge Representations: Demonstrates that these discovered knowledge circuits can independently recall related knowledge with a significant portion of the full model's performance, indicating their effectiveness as neural representations. It shows that knowledge tends to aggregate in earlier to middle layers and is enhanced in later layers. It identifies special components like
mover headsandrelation headscrucial for information transfer and relational context capture. - Elucidating Internal Mechanisms for Knowledge Editing: Evaluates the impact of knowledge editing methods (ROME, FT-M) on knowledge circuits. It reveals that ROME tends to incorporate edited information at the edited layer, which is then transported by mover heads. FT-M, conversely, directly integrates edited knowledge, dominating subsequent predictions but risking overfitting and influencing unrelated knowledge. This provides a mechanistic explanation for editing successes and failures.
- Facilitating Interpretation of LLM Behaviors: Utilizes knowledge circuits to interpret complex LLM behaviors:
- Hallucinations: Observes that hallucinations occur when the model fails to correctly transfer knowledge to the final token, often due to an ineffective
mover heador selection of incorrect information in earlier layers. - In-Context Learning (ICL): Shows that ICL involves the emergence of new attention heads in the knowledge circuit that focus on demonstration context, acting as
induction headsto leverage examples for correct responses.
- Hallucinations: Observes that hallucinations occur when the model fails to correctly transfer knowledge to the final token, often due to an ineffective
- Advancing Understanding and Guiding Design: The authors believe that understanding knowledge circuits holds potential for advancing the mechanistic interpretability of Transformers and guiding the improved design of knowledge editing techniques to enhance factuality and alleviate hallucinations.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Transformer Architecture: The core neural network architecture used in modern LLMs. It consists of an encoder and a decoder (or decoder-only for generative LLMs). Key components include:
- Token Embeddings: Numerical representations of input words or subwords.
- Positional Encodings: Added to embeddings to provide information about the token's position in the sequence.
- Multi-Head Attention: Allows the model to weigh the importance of different parts of the input sequence when processing each token. It computes
Query(),Key(), andValue() vectors for each token. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here, , , are matrices representing query, key, and value vectors respectively. is the dimension of the key vectors, used for scaling. Thesoftmaxfunction normalizes the attention scores. Multi-head attention performs this operation multiple times in parallel with different linear projections. - Feed-Forward Networks (FFNs) / Multilayer Perceptrons (MLPs): Position-wise fully connected networks applied to each position independently. They typically consist of two linear transformations with a ReLU activation in between. Previous work has identified MLPs as potential 'key-value memories' for factual knowledge.
- Residual Connections: Add the input of a sub-layer to its output, helping to mitigate the vanishing gradient problem and allowing deeper networks.
- Layer Normalization: Normalizes the activations across the features for each sample independently, improving training stability and performance.
- Residual Stream: The cumulative output of all previous attention heads, MLPs, and input embeddings, through which information flows in a Transformer. It's considered a valuable tool for mechanistic interpretability.
-
Mechanistic Interpretability: A field of research focused on understanding the internal mechanisms of neural networks. It aims to reverse-engineer trained models to figure out how they perform specific tasks, often by identifying and understanding
circuits. -
Circuit Theory: Within mechanistic interpretability, a
circuitis defined as a human-interpretable subgraph of a neural network's computation graph that is responsible for a specific behavior or functionality. The nodes in the graph represent components like attention heads, MLPs, and embeddings, and edges represent their interactions. -
Ablation Studies: A technique used in interpretability to understand the importance of specific components or connections by systematically removing them (e.g., setting their output to zero or a mean value) and observing the impact on the model's performance.
Zero ablation(setting to zero) andmean ablation(setting to mean activation) are common types. -
Logit Lens: An interpretability technique that involves taking the intermediate activations from various layers of a Transformer and mapping them directly to the vocabulary space using the unembedding matrix. This allows researchers to see what tokens are "predicted" at each intermediate layer, providing insight into the information flow.
3.2. Previous Works
The paper builds upon a rich history of research into Transformer interpretability, particularly focusing on how knowledge is stored and processed:
-
Knowledge in MLPs:
- Geva et al. [12, 14, 15] proposed that Transformer Feed-Forward Layers (MLPs) act as
key-value memories, storing factual knowledge.Knowledge neurons(KNs) within MLPs are hypothesized to encode specific facts. - Meng et al. [18] leveraged this understanding to propose
ROME(Rank-One Model Editing), a method to modify specific factual associations by directly editing MLP weights. - Niu et al. [23] and Zhang et al. [24] challenged the
knowledge neuronthesis as an oversimplification, suggesting that knowledge might be distributed more broadly or that different types of knowledge might reside in the same areas.
- Geva et al. [12, 14, 15] proposed that Transformer Feed-Forward Layers (MLPs) act as
-
Knowledge in Attention Heads:
- Li et al. [58] and Wu et al. [81] highlighted the role of attention heads in retrieving information. Wu et al. identified
retrieval headsfor long-context factuality. - Jin et al. [82] distinguished between
memory heads(internal memory) andcontext heads(external context). - Todd et al. [51, 83] introduced
function vectorsin attention heads, responsible for in-context learning.
- Li et al. [58] and Wu et al. [81] highlighted the role of attention heads in retrieving information. Wu et al. identified
-
Knowledge in Hybrid Components / Information Flow:
- Geva et al. [13] described factual recall as a three-step process: subject enrichment (in MLPs), relation propagation (to the
ENDtoken), and attribute extraction (by attention heads in later layers). - Lv et al. [31] suggested that task-specific attention heads move topic entities to the final position in the residual stream, while MLPs perform relation functions.
- Geva et al. [13] described factual recall as a three-step process: subject enrichment (in MLPs), relation propagation (to the
-
Circuit Discovery:
- Olah et al. [25, 30] pioneered the concept of
circuitsas human-interpretable subgraphs. - Conmy et al. [32] developed
Automated Circuit Discovery(ACDC) as a method to automatically identify critical circuits. - Wang et al. [26] and Merullo et al. [27] identified
circuitsfor specific tasks likeIndirect Object Identification (IOI)andColor Object (CO)tasks, notingcomponent reuseacross similar tasks. - Dutta et al. [84] constructed
circuitsforchain-of-thought reasoning, observing distinct roles for early-layer attention heads (moving information along ontological relations) and later-layer components (writing answers).
- Olah et al. [25, 30] pioneered the concept of
3.3. Technological Evolution
The field of Transformer interpretability has evolved from initial black-box analyses to component-level investigations (e.g., individual neurons, attention heads, or MLPs), and now towards understanding the interactions and flows between these components. Early works focused on identifying "where" knowledge is, while more recent work, including this paper, emphasizes "how" knowledge is processed and articulated through collaborative circuits. This paper's contribution lies in extending circuit theory from simple task-specific circuits (like IOI) to more complex knowledge recall tasks, integrating both MLP and attention components across layers, and using these circuits to explain higher-level LLM behaviors (editing, hallucination, ICL). The move towards sparse auto-encoders [36, 37] also represents an evolution in circuit discovery methods, aiming for more interpretable feature representations.
3.4. Differentiation Analysis
Compared to previous work, this paper differentiates itself in several key ways:
- Holistic View of Knowledge Storage: Instead of treating MLPs or attention heads as isolated knowledge repositories (as in many
knowledge neuronstudies), this paper explicitly focuses onknowledge circuitsthat involve the cooperation ofinformation heads,relation heads, andMLPsacross multiple layers. This provides a more integrated understanding of how knowledge is processed and articulated. - Focus on Factual Recall Tasks: While
circuit theoryhas been applied to tasks likeIndirect Object IdentificationorColor Object identification, this paper extends it tofactual recall tasks, which require the model to retrieve and utilize stored knowledge rather than simply copying tokens from the context. This necessitates a different kind of circuit discovery that taps into the model's parametric knowledge. - Application to LLM Behaviors: The paper moves beyond just identifying circuits to utilize them for interpreting complex LLM behaviors. Specifically, it uses
knowledge circuitsto analyze the mechanisms behindknowledge editing,hallucinations, andin-context learning, offering mechanistic explanations for why these phenomena occur or why editing methods succeed/fail. - Empirical Validation across Models: Experiments are conducted on both GPT-2 (medium and large) and TinyLLAMA, allowing for observations across different model sizes and architectures (e.g., TinyLLAMA's
Grouped Query Attention Mechanism), providing broader validation than studies focused on a single model. - Detailed Analysis of Information Flow: The paper provides a detailed, layer-by-layer analysis of how target entities' ranks and probabilities change in the
residual stream, identifying phenomena likeearly decodingand the specific roles ofmover headsandrelation headsin this process.
4. Methodology
4.1. Principles
The core idea behind discovering knowledge circuits is to identify a minimal but critical subgraph of the Transformer's computation graph that is essential for articulating a specific piece of knowledge. This is based on the principle of causal mediation: if removing a component or connection significantly degrades the model's ability to perform a task, that component/connection is considered critical to the circuit. Unlike previous work that focused on isolated components, this paper views the Transformer as a graph where various nodes (input embedding, attention heads, MLPs, output) interact via edges, and aims to find the collaborative pathways. The theoretical basis is that specific knowledge recall in LLMs is not a monolithic operation but rather emerges from a structured interplay of multiple internal components.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Transformer as a Computation Graph
The Transformer decoder architecture is conceptualized as a directed acyclic graph (DAG) .
-
Nodes (): Represent individual computational components:
- Input embedding ()
- Attention heads (, where is the layer and is the head index)
- MLPs (, for layer )
- Output node ()
- So, .
-
Edges (): Represent the data flow and dependencies between these nodes, such as residual connections, attention mechanisms, and projections.
The
residual stream() at layer is defined by the following equations, showing how the input embedding and outputs from attention heads and MLPs from previous layers cumulatively contribute: Where: -
: The residual stream at layer .
-
: The initial residual stream is the input embedding.
-
: The output of the -th attention head in layer .
-
: The output of the MLP in layer .
-
: The input to the attention layers in layer . It is the sum of the input embedding and all MLP outputs () and attention head outputs () from layers before .
-
: The input to the MLP layer in layer . It is the sum of the input embedding and all MLP outputs () from layers before and all attention head outputs () from layers before .
A
circuitis then a subgraph of , defined as , containing a selection of nodes and edges critical for specific behaviors.
4.2.2. Knowledge Circuits Construction
The goal is to find the circuit critical for predicting a target entity given a subject-relation pair (s, r), typically presented as a natural language prompt (e.g., "The official language of France is ___").
- Causal Mediation and Ablation: The method leverages
causal mediationby systematicallyablating(removing) edges in the computation graph and measuring the impact on the model's performance. - Edge Ablation: For each edge connecting node to , its contribution is removed by replacing it with a
corrupted value. The paper stateszero ablationis used, meaning the output of as input to is set to zero. - Performance Metric: The impact of ablating an edge on the model's ability to predict the target entity is measured using the
MatchNLL loss. TheMatchNLL lossis essentially the negative log-likelihood of the target token. A higher loss means worse performance. The change in this loss after ablation is calculated as: Where:- : The probability (or log-probability) of the target entity given the subject-relation pair
(s, r)by the original full model . - : The probability (or log-probability) of the target entity given
(s, r)by the model with edge ablated. - : Represents the decrease in log-likelihood (or increase in negative log-likelihood) when is ablated. A positive means ablating the edge hurt performance, indicating the edge was critical.
- : The probability (or log-probability) of the target entity given the subject-relation pair
- Thresholding: If the score is less than a predefined threshold , the edge is considered non-critical and removed from the temporary circuit . This process is typically performed on a topologically sorted graph, traversing edges in a causal order.
- Final Circuit: By iteratively removing non-critical edges, a
knowledge circuitis derived, containing only the nodes and edges essential for predicting the target entity for a specific knowledge triplet .
4.2.3. Knowledge Circuits Information Analysis
Once a knowledge circuit is identified, the behavior and contribution of each node within it are analyzed:
- Mapping to Embedding Space: The output of each node is first subjected to
layer normalization (LN)and then mapped into the vocabulary's embedding space by multiplying it with the language model'sunembedding matrix(). $ \mathbf{W}_U \mathrm{LN}(n_i) $ This allows researchers to inspect what tokens each component is "writing" or promoting into theresidual stream. - Identifying Special Components: Through this analysis, specific types of attention heads are identified based on their behaviors:
- Mover Head: Focuses on the last token of the context and attends to the subject token, effectively transferring information from the subject to the final token position.
- Relation Head: Attends to the relation token in the context and produces tokens related to the relation, guiding subsequent computations.
- Mix Head: Focuses on both relation and subject tokens, often working similarly to a
mover head.
4.2.4. Analyzing Knowledge Editing and LLM Behaviors
-
Knowledge Editing: The paper investigates how
knowledge editingmethods (e.g., ROME, FT-M) alter theknowledge circuits. It observes how edited information is incorporated (e.g., at specific layers, by specific heads) and how this changes the information flow compared to the original circuit. -
Hallucinations: By analyzing
knowledge circuitswhen the model hallucinates, the paper seeks to identify breakdowns in the information flow, such as ineffectivemover headsor selection of incorrect information. -
In-Context Learning (ICL): For
ICL,knowledge circuitsare constructed with and without demonstrations. The appearance of new attention heads (e.g.,induction heads[50]) focusing on demonstration contexts is observed to explain how ICL works mechanistically.The process of circuit discovery involves several steps of running the model and intervening in its computational graph:
-
Forward Pass: Run a forward pass to compute the baseline performance (log-likelihood of the target token).
-
Iterative Ablation: For each potential edge in the graph (or a subset of interest):
- Ablate the edge (e.g., set its output to zero).
- Perform another forward pass with the ablated graph.
- Calculate the performance change .
- If , consider the edge non-critical and effectively remove it from the circuit.
-
Circuit Definition: The remaining edges and their connected nodes form the
knowledge circuitfor that specific piece of knowledge. -
Analysis: Analyze the identified circuit components by mapping their outputs to the vocabulary space to understand their role and contribution.
This methodology provides a rigorous framework for decomposing the complex internal operations of LLMs into interpretable subgraphs responsible for specific knowledge articulation.
5. Experimental Setup
5.1. Datasets
The study utilizes the dataset provided by Hernandez et al. [42]. This dataset comprises various types of knowledge, which are categorized into:
-
Factual Knowledge: Relates to concrete facts (e.g., "person mother," "landmark in country," "company hq").
-
Commonsense Knowledge: General understanding of the world (e.g., "object superclass," "fruit inside color," "work location").
-
Linguistic Knowledge: Pertains to language rules and relationships (e.g., "adjective antonym," "word first letter," "verb past tense").
-
Bias Knowledge: Captures societal biases often present in training data (e.g., "occupation age," "name religion," "occupation gender").
The original dataset from Hernandez et al. [42] was used for few-shot settings. However, in this paper, the focus is on
zero-shot knowledge storage, meaning the model's inherent knowledge without in-context examples. Therefore, the data is sampled based on whether the model correctly answers the given prompt in a zero-shot setting using theHit@10metric. The test set is sampled in a 1:1 ratio with the validation set to ensure balanced evaluation.
The detailed statistics of the dataset are presented in the following table (Table 3 from the original paper): The following are the results from Table 3 of the original paper:
| Category | Relation | # | # Correct in Hit@10 | ||
| GPT2-Medium | GPT2-large | TinyLLaMA | |||
| factual | person mother | 994 | 83 | 144 | 361 |
| person father | 991 | 359 | 385 | 474 | |
| person sport position | 952 | 400 | 489 | 596 | |
| landmark on continent | 947 | 835 | 543 | 694 | |
| person native language | 919 | 310 | 220 | 260 | |
| landmark in country | 836 | 600 | 489 | 709 | |
| person occupation | 821 | 57 | 76 | 149 | |
| company hq | 674 | 308 | 312 | 470 | |
| product by company | 522 | 422 | 432 | 460 | |
| person plays instrument | 513 | 510 | 505 | 498 | |
| star constellation name | 362 | 266 | 148 | 297 | |
| plays pro sport | 318 | 317 | 316 | 315 | |
| company CEO | 298 | 20 | 52 | 125 | |
| superhero person | 100 | 28 | 35 | 50 | |
| superhero archnemesis | 96 | 6 | 6 | 23 | |
| person university | 91 | 14 | 37 | 35 | |
| pokemon evolution | 44 | 11 | 13 | 16 | |
| country currency | 30 | 25 | 25 | 30 | |
| food from country city in country | 30 | 23 20 | 25 | 29 | |
| country capital city | 27 24 | 23 | 27 | ||
| country language | 24 24 | 24 | 24 | ||
| country largest city | 24 24 | 24 | 24 24 | 24 24 | |
| person lead singer of band | 21 | 7 | 16 | ||
| president birth year | 11 | 12 | 21 | ||
| president election year | 19 19 | 17 | 18 | - - | |
| commonsense | object superclass | 76 | 62 | 64 | 72 |
| word sentiment | 60 | 14 | 34 | ||
| task done by tool | 52 | 44 | 9 45 | 45 | |
| substance phase of matter | 50 | 12 | 16 | 48 | |
| work location | 38 | 28 | 24 | 27 | |
| fruit inside color | 36 | 35 | 36 | ||
| task person type | 36 32 | 28 | 27 | ||
| fruit outside color | 30 | 16 | 20 | 26 21 | |
| linguistic | word first letter | 241 | 236 | 235 | 241 |
| word last letter | 241 | 135 | 73 | 114 | |
| adjective antonym | 100 | 80 | 81 | 84 | |
| adjective superlative | 80 | 24 | 19 | 63 | |
| verb past tense | 76 | 1 | 15 | 76 | |
| adjective comparative | 68 | 7 | 15 | 63 | |
| bias | occupation age | 45 | 18 | 20 | 18 |
| univ degree gender | 38 | 14 | 35 | 38 | |
| name birthplace | 31 | 29 | 30 | 31 | |
| name religion | 31 | 24 | 31 | 31 | |
| characteristic gender | 30 | 26 | 30 | 30 | |
| name gender | 19 | 19 | 19 | 19 | |
| occupation gender | 19 | 19 | 19 | 19 | |
5.2. Evaluation Metrics
-
MatchNLL (Negative Log-Likelihood): This metric is used during the
circuit discoveryphase to quantify the impact of ablating an edge. It measures how well the model predicts the target entity.- Conceptual Definition: In the context of next-token prediction,
Negative Log-Likelihood(NLL) measures the discrepancy between the model's predicted probability distribution over the vocabulary and the actual target token. A lower NLL means the model assigns a higher probability to the correct token, indicating better prediction. When an edge is ablated, the change in NLL (or rather, the log-likelihood as formulated in the paper) indicates how critical that edge is for the correct prediction. - Mathematical Formula (from paper):
- Symbol Explanation:
- : The score quantifying the impact of ablating edge . A higher positive score means ablating significantly decreased the log-probability of the target, thus is critical.
- : The natural logarithm of the probability assigned by the original full model to the target entity given the subject and relation .
- : The natural logarithm of the probability assigned by the model with edge ablated, to the target entity given and .
- Conceptual Definition: In the context of next-token prediction,
-
Hit@10: This metric is used to evaluate the
completenessof a discoveredknowledge circuitand the overall performance of the models in recalling knowledge.- Conceptual Definition:
Hit@kmeasures how often the true target token appears within the top predicted tokens from the model's vocabulary.Hit@10specifically checks if the target token is among the top 10 most probable tokens predicted. It's a common metric for assessing the accuracy of knowledge retrieval or generation, especially in open-ended prediction tasks where exact matches might be less important than finding the correct answer within a reasonable set of candidates. - Mathematical Formula:
(Note: The paper's formula seems to have a typo where is in the denominator. A more standard definition for
Hit@kis the fraction of queries for which the target is in the top k. Assuming the to summation means iterating over all test examples, then should be the number of test examples, not vocabulary size. If it iterates over vocabulary and sums up 1s, it would simply be count. Let's provide a more standard interpretation if the paper's formula is interpreted literally as number of items. A more standard formulation would be where is the number of examples. Given the context, the paper likely means the percentage of examples where the target rank is .) Self-correction: The formula presented in the paper for Hit@10 is: The paper writesHit@16but then defines it for . I will present the formula exactly as written in the paper, and note the discrepancy. - Symbol Explanation:
- : The metric measuring the percentage of cases where the target entity's rank is within the top 10. (There is a discrepancy in the paper: the formula is labeled
Hit@16but uses . I will maintain the paper's literal notation for formula, but refer to it asHit@10in text as that's what the paper generally uses). - : Represents the vocabulary size. (As noted above, in a standard
Hit@kcontext, this often refers to the number of test samples. I will use the paper's definition of "vocabulary size" here for consistency with their text, but acknowledge the common usage difference if it were for a general audience.) - : The indicator function, which evaluates to 1 if the condition inside the parenthesis is true, and 0 otherwise.
- : The rank of the target entity in the model's list of predicted tokens (sorted by probability, highest probability = rank 1).
10: The threshold for the rank; the target token must be among the top 10 predictions to be counted as a "hit".
- : The metric measuring the percentage of cases where the target entity's rank is within the top 10. (There is a discrepancy in the paper: the formula is labeled
- Conceptual Definition:
5.3. Baselines
The paper's primary contribution is a novel interpretability framework, not a new predictive model, so it doesn't compare against traditional predictive baselines in the same way. Instead, it implicitly compares against:
- The Full Original Model (): The performance of the discovered
knowledge circuits() is compared against the full model to assess the circuit'scompleteness(how well it reproduces the full model's behavior). - Random Circuits: To validate the significance of the discovered circuits, their performance is compared against
random circuitsof the same size. These random circuits are generated by randomly removing edges while ensuring connectivity, serving as a null hypothesis to show that the identified circuits are not merely effective due to their size.
5.4. Implementations
- Models Used: GPT-2 medium, GPT-2 large, and TinyLLAMA [29]. TinyLLAMA is specifically chosen to validate effectiveness across different architectures, particularly noting its use of
Grouped Query Attention Mechanism[67]. - Tools:
Automated Circuit Discovery (ACDC)[32] toolkit: Used to construct the circuits.TransformerLens[41]: Used for further analysis of the results.
- Ablation Method:
Zero ablationis employed to remove specific computational nodes. - Hyperparameter: The threshold for
MatchNLLis a critical hyperparameter. Values tested include to determine appropriate circuit size. - Computational Resources: NVIDIA-A800 (40GB) GPUs. Computing a circuit for a knowledge type in GPT-2 medium took approximately 1-2 days.
- TinyLLAMA Specifics: For TinyLLAMA, which uses
Grouped Query Attention, the key and value pairs are interleaved and repeated to analyze specific attention head behaviors.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Knowledge Circuit Evaluation and Completeness
The study demonstrates that the discovered knowledge circuits are effective in retaining significant portions of the original model's performance with a substantially reduced subgraph.
The following are the results from Table 1 of the original paper:
| Type | Knowledge | #Edge | Dval | Dtest | |||
| Original(G) | Circuit(C) | Original(G) | Random | Circuit(C) | |||
| Linguistic | Adj Antonym | 573 | 0.80 | 1.00 ↑ | 0.00 | 0.00 | 0.40 ↑ |
| word first letter | 432 | 1.00 | 0.88 | 0.36 | 0.00 | 0.16 | |
| word last letter | 230 | 1.00 | 0.72 | 0.76 | 0.00 | 0.76 | |
| Commonsense | object superclass | 102 | 1.00 | 0.68 | 0.64 | 0.00 | 0.52 |
| fruit inside color | 433 | 1.00 | 0.20 | 0.93 | 0.00 | 0.13 | |
| work location | 422 | 1.00 | 0.70 | 0.10 | 0.00 | 0.10 | |
| Factual | Capital City | 451 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 |
| Landmark country | 278 | 1.00 | 0.60 | 0.16 | 0.00 | 0.36 ↑ | |
| Country Language | 329 | 1.00 | 1.00 | 0.16 | 0.00 | 0.75 ↑ | |
| Person Native Language | 92 | 1.00 | 0.76 | 0.50 | 0.00 | 0.76↑ | |
| Bias | name religion | 423 | 1.00 | 0.50 | 0.42 | 0.00 | 0.42 |
| occupation age | 413 | 1.00 | 1.00 | 1.00 | 0.00 | 1.00 | |
| occupation gender | 226 | 1.00 | 0.66 | 1.00 | 0.00 | 0.66 | |
| name birthplace | 276 | 1.00 | 0.57 | 0.07 | 0.00 | 0.57 ↑ | |
| Avg | 0.98 | 0.73 | 0.44 | 0.00 | 0.47↑ | ||
- Circuit Size and Performance: The discovered circuits, despite using less than 10% of the original model's subgraph (indicated by
#Edgecolumn, which is the number of edges in the circuit), can maintain over 70% of the original model's performance on the validation set () (AvgCircuit(C)for is 0.73). This demonstrates thecompletenessand efficacy of these circuits in capturing relevant knowledge. - Comparison to Random Circuits:
Random circuitsof the same size perform significantly worse (AvgRandomfor is 0.00), highlighting that the identified circuits are not merely artifacts of size but truly capture the knowledge-articulating pathways. - Performance Improvement: Intriguingly, some tasks, like "Landmark country" and "Country Language," show improved performance on the test set (
Circuit(C)>Original(G)) when using only the circuit. This suggests that the full model might containnoisefrom other components that hinders performance on these specific tasks, and the isolated circuit effectively filters this noise.
6.1.2. Knowledge Aggregation Across Layers
Analysis of activated circuit component distributions (Figure 2) shows that attention and MLPs are more active in the lower layers, where the model extracts general information.
The following figure (Figure 2 from the original paper) illustrates the activated circuit component distributions:
该图像是一个条形图,展示了GPT2-Medium模型中不同层次的激活电路组件分布,分别包含语言、常识、偏见和事实四个方面的激活频率。每个子图表示特定类型信息在各层的激活情况,反映了电路的活跃程度。
Further, by computing the average rank of the target token across layers (Figure 7), the paper observes early decoding—the target entity becomes present in the residual stream by middle to later layers, and subsequent layers primarily serve to increase its probability.
The following figure (Figure 7 from the original paper) shows the knowledge encoding performance of different models across various layers:
该图像是图表,展示了不同模型在不同层次上的知识编码效果。上方部分为GPT2-Medium,第二部分为GPT2-Large,最后为TinyLLAMA。各个图表中,y轴表示目标实体的平均排名,x轴表示模型的层数,可以看出不同模型对知识的捕获大致在中间到后面的层次。
6.1.3. Special Components in Knowledge Circuits
The circuits reveal specialized roles for different attention heads:
-
Mover Heads: (e.g., L14H7 in the "France is French" example) transfer information from the subject position (e.g., "France") to the last token position. The paper reinterprets these not just as argument parsers but as
extract headsthat extract related information from the subject's position. -
Relation Heads: (e.g., L14H13) focus on the relation token (e.g., "language") and generate relation-related tokens (e.g., "language").
-
Mixture Heads: Attend to both relation and subject tokens.
A running example for "The official language of France is French" (Figure 1, Figure 3) demonstrates that after MLP17, the target "French" emerges as the top token, and its probability further increases. This MLP integrates information from
relation heads(e.g., L14H13, focusing on "language") andmover heads(e.g., L14H7, extracting information related to "France"). The model accumulates knowledge from early-to-middle layers, with MLPs combining information and prioritizing the target token, and later attention heads (e.g., L18H14, L20H6) enhancing the final prediction.
The following figure (Figure 1 from the original paper) illustrates the knowledge circuit for "The official language of France is French":
该图像是图示,展示了从 "The official language of France is French" 这句话中提取的知识电路,包括简化电路和一些特殊组件的输出。左侧展示了简化电路的结构,右侧显示了不同注意力头的输出日志和关注模式。
The following figure (Figure 3 from the original paper) shows the rank and probability of the target entity for the fact "The official language of France is French":
该图像是图表,展示了在不同层级下,针对事实“法国的官方语言是法语”中的目标实体 在最后主题词和主题位置的排名及概率变化。图中蓝色线条分别表示目标实体在最后位置和主题位置的排名,紫色虚线表示目标和主题实体的概率。纵轴以对数尺度表示排名,横轴表示层数,显示了知识在模型不同层次上的分布特征。
6.1.4. Knowledge Editing Impact
The study examines how ROME [18] and FT-M [24] affect knowledge circuits.
-
ROME: Tends to incorporate edited information primarily at the edited layer. Subsequent
mover headsthen transport this new information to theresidual streamof the last token. For example, for the fact "Platform Controller Hub was created by Intel", ifL15H3was a mover head that previously moved "Controller", after ROME editing, its output shifts to "Intel". -
FT-M: Shows a more direct integration, where the edited token's logits are boosted significantly (
more than 10for "Intel" in MLP-0) immediately at the edited component, dominating subsequent predictions. This suggests FT-M writes knowledge directly into specific components. -
Side Effects: Both methods, especially FT-M, risk
overfittingorinfluencing unrelated knowledge. For instance, after editing to make "Platform Controller Hub was created by Intel", querying "Windows server is created by?" also yields "Intel," demonstrating ageneralizationissue. This supports the idea that edits add signals into circuits rather than fundamentally altering storage.The following figure (Figure 4 from the original paper) illustrates the different behaviors when editing the language model:
该图像是图表,展示了在不同编辑模型下,L15H3 头部的注意力模式和输出日志的对比。上部为原始模型,显示原始令牌"Controller"的信息转移;中部为ROME模型,成功添加"Intel"的信息;下部为FT层-0编辑,直接将编辑知识写入组件,但显示这些方法对"Windows server is created by?"的影响。
The following figure (Figure 11 from the original paper) shows the knowledge circuit obtained from the edited model:
该图像是示意图,展示了针对 "Platform Controller Hub was created by" 的知识电路在调整不同层次(层0、层6、层12和层23)时的变化。具体而言,调整后的 MLP 允许模型直接提供编辑后的信息,揭示了知识的流动和存储过程。
6.1.5. Interpretation of LLM Behaviors
-
Factual Hallucinations: When
hallucinationsoccur (Figure 5, left), the model fails to correctly transfer knowledge to the final token in earlier layers. This is often due to theknowledge circuitlacking an effectivemover heador themover headselecting incorrect information. For example, for "The official currency of Malaysia is called", both "Ringgit" (correct) and "Malaysian" (incorrect) are accumulated early, but amover headat layer 16 extracts the erroneous "Malaysian," leading to hallucination. -
In-Context Learning (ICL): When demonstrations are provided for
ICL(Figure 5, right), newattention headsemerge in theknowledge circuit. Thesenew headsprimarily focus on the demonstration's context, acting asinduction heads[50] that look for previous instances of the current token and the token that followed it. Ablation studies on these newly appeared heads show a significant drop in prediction probability, confirming their importance in ICL.The following figure (Figure 5 from the original paper) illustrates fact hallucination and in-context learning cases:
该图像是图表,展示了两个案例:左侧是关于马来西亚官方货币的事实幻觉案例,在第15层,Mover Head选择了错误信息;右侧是关于上下文学习的案例,注意到一些新的关注头出现在知识电路中。
The following are the results from Table 2 of the original paper:
| Knowledge | Origin Model | Ablating extra head | Ablating random head | |
| Linguistic | adj_comparative | 62.24 | 32.55 | 58.18 |
| Commonsense | word_sentiment | 89.02 | 55.50 | 88.61 |
| substance_phase | 78.74 | 52.85 | 71.24 | |
| Bias | occupation_gender | 86.97 | 59.54 | 86.54 |
| Factual | person_occupation | 35.17 | 23.27 | 31.60 |
The table shows that ablating extra heads (new attention heads appearing in the ICL circuit) causes a much larger drop in performance compared to ablating random heads, underscoring their critical role in in-context learning.
6.1.6. Multi-hop Factual Knowledge Editing
The paper briefly touches on multi-hop knowledge editing, noting that current methods struggle. They observe that the mover head in the original multi-hop circuit extracts the second-hop answer. After editing the first hop, the mover head extracts the new (edited) information, dominantly saturating and influencing the circuit, leading to incorrect subsequent reasoning. An intriguing finding is that even in the original model, if the first-hop context is removed, the model can still directly provide the multi-hop answer, suggesting reliance on relational and subject-related information rather than strict grammatical adherence.
The following figure (Figure 10 from the original paper) illustrates a specific case in Multi-hop reasoning:
该图像是一个示意图,展示了多跳推理中的知识类型。图中列举了三种知识表现形式:事实知识、多跳知识和损坏知识,并分别指向相关答案,突出即使在移除第一跳上下文后,模型仍能直接给出部分答案的现象。
6.2. Data Presentation (Tables)
Table 1: Hit@10 of the Original and Circuit Standalone performance of knowledge circuit in GPT2- Medium. (Already transcribed in Section 6.1.1)
Table 2: Performance change via ablating the newly appeared attention heads in the ICL circuit and random heads. (Already transcribed in Section 6.1.5)
Table 3: Information about the dataset. (Already transcribed in Section 5.1)
Table 4: Special component behaviour in circuits as task-specific head. The following are the results from Table 4 of the original paper:
| Model | Type | Fact | Critical Component in Circuit |
| GPT2-Medium | Linguistic | Antonym | L17H2, L18H1, L13H12, L13H8 |
| Factual | city country | L21H12, L16H2 | |
| Commonsense | work location | L19H15, L14H4, L13H3 | |
| Bias | name country | L16H6, L21H12 | |
| GPT2-Large | Linguistic Factual | Antonym | L25H5, L24H16, L19H13, L18H8 |
| company hq | L30H6, L25H13 | ||
| Commonsense | work location | L18H13, L28H18, L30H5 | |
| Bias | name country | L21H19, L29H2 | |
| TinyLLAMA | Linguistic Factual | Verb past tense | L17H0, MLP20 |
| Landmark country | L15H11, L17H19, MLP18 | ||
| Commonsense | Fruit Inside Color | L18H25, MLP18 | |
| Bias | name country | L15H11, MLP17 |
This table lists specific attention heads and MLPs (e.g., L17H2, MLP20) identified as critical components within the knowledge circuits for various types of knowledge and relations across different models (GPT2-Medium, GPT2-Large, TinyLLAMA). This provides concrete examples of the components that constitute these circuits.
Table 5: Hit for different hop The following are the results from Table 5 of the original paper:
| First-hop | Second-hop | Integrate | |
| node | 83.33 | 70.27 | 71.42 |
| edge | 63.20 | 45.27 | 49.42 |
This table shows the overlap (or Hit) of nodes and edges when comparing knowledge circuits for first-hop, second-hop, and integrated multi-hop knowledge. High overlap for nodes (e.g., 83.33% for first-hop) indicates component reuse in multi-hop reasoning.
6.3. Ablation Studies / Parameter Analysis
6.3.1. Threshold for Circuit Discovery
The choice of threshold (e.g., ) is crucial. A high leads to an incomplete circuit, while a low results in too many unnecessary nodes. The paper implies that an appropriate was chosen to yield circuits that are small yet retain high performance.
6.3.2. Ablation of New Heads in ICL
As shown in Table 2 (transcribed in Section 6.1.5), ablating the newly appeared attention heads in the ICL circuit significantly drops performance (e.g., adj_comparative from 62.24 to 32.55). In contrast, ablating random heads has a much smaller impact (e.g., adj_comparative from 62.24 to 58.18). This confirms the causal role of these specific heads in facilitating in-context learning.
6.3.3. Ablation of Mover and Relation Heads
Figure 6 (Appendix B.2) shows that ablating mover heads increases the probability of other non-subject-related tokens (e.g., "Italian", "English", "Spanish" instead of the target "French"), while ablating relation heads leads to meaningless words (e.g., "a", "that"). This directly illustrates their specific functions in transferring subject information and guiding relation-based output.
The following figure (Figure 6 from the original paper) shows the output of the model after ablating mover and relation heads:
该图像是一个示意图,展示了经过不同处理后GPT模型生成的法语相关单词的权重变化。图中包含三部分,分别是原始输出、去除mover头后的输出和去除relation头后的输出。每部分的单词及其对应的权重值以深浅不一的色调展示,反映了不同知识电路对输出的影响。
6.4. Component Reuse
The paper observes component reuse across related tasks. For example, L21H12 is critical for city_in_country, name_birth_place, and country_language relations in GPT2-Medium. Similarly, L7H14 appears in circuits for both official_language and official_currency. This suggests that certain components act as topic heads rather than being strictly task-specific, potentially handling general semantic categories.
6.5. Model Differences
- Knowledge Aggregation: In GPT-2 models, the
rank of the target entitydrops gradually across middle-to-later layers, indicating gradual knowledge aggregation. In contrast, TinyLLAMA shows a sharper decline in rank around specific layers, suggesting a more concentrated knowledge processing. This difference is hypothesized to relate tomodel knowledge capacity[44]. - Special Components Concentration: TinyLLAMA's
special components(e.g.,mover heads,relation heads) are more concentrated in later layers compared to GPT-2.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces and validates the concept of knowledge circuits as interpretable subgraphs within Transformer language models. By moving beyond isolated component analysis, the authors demonstrate how attention heads, MLPs, and embeddings collaboratively encode and articulate specific knowledge. Key findings include the high completeness of these circuits in reproducing model behavior, the identification of specialized mover and relation heads for information flow, and a mechanistic explanation for knowledge editing effects (ROME adds signals at edited layers, FT-M directly writes knowledge). Furthermore, knowledge circuits provide concrete insights into hallucinations (failure of mover heads) and in-context learning (emergence of new induction heads). The research highlights that knowledge is aggregated from early to middle layers and refined in later layers.
7.2. Limitations & Future Work
The authors acknowledge several limitations:
-
Granularity of Circuits: The current approach operates at a relatively coarse granularity. Neurons within an MLP, for instance, might require finer-grained analysis to fully capture their behavior.
-
Efficiency of Discovery: The
causal mediationmethod used forcircuit discoveryis time-intensive. More efficient methodologies likemask training[63, 64] orsparse auto-encoders[65, 36] could be explored. -
Opacity of Activation: While the paper identifies which components work together, the precise reasons why they are activated remain somewhat opaque.
-
Logit Lens Discrepancies: The
logit lensmethod may face discrepancies between middle layers and the outputunembedding matrix, hindering comprehensive analysis of early-layer component behavior. More robust techniques likeAttention Lens[86] could be explored, despite being resource-intensive. -
Mover Head Activation Mechanisms: The mechanisms by which
mover headsare activated and the conditions under which they operate (especially given their reuse across different knowledge types) require further exploration to understandmonosemanticvs.polysemanticbehaviors [90]. -
Multi-hop Reasoning in Aligned Models: The intriguing phenomenon of models directly answering multi-hop questions even after removing first-hop context, and how this is alleviated by aligned models, requires further investigation.
Future work directions suggested include:
-
Advancing
knowledge circuit discoverymethodologies for greater efficiency and finer granularity. -
Further exploring the activation mechanisms of specialized heads, particularly in cases of
component reuse. -
Developing more robust techniques for analyzing intermediate layer outputs, potentially leveraging
Attention Lensor similar methods. -
Investigating the observed differences in knowledge aggregation and component concentration between models like GPT-2 and TinyLLAMA.
-
Using
knowledge circuitsto guide the improved design ofknowledge editingand to enhancefactualityand alleviatehallucinationsin LLMs. -
Contributing to the development of
trustworthy AIby ensuringsafetyandprivacyof information throughcircuit breakers[66].
7.3. Personal Insights & Critique
This paper offers a significant step forward in mechanistic interpretability by providing a holistic framework for understanding knowledge representation and utilization in LLMs. The shift from isolated component analysis to knowledge circuits is crucial, as real-world cognitive processes are rarely isolated. The finding that knowledge aggregates in early-to-middle layers and is refined later, combined with the identification of specific head types, paints a more detailed picture of the internal "reasoning" process.
The mechanistic explanation for hallucinations and in-context learning is particularly valuable. Understanding that hallucinations can stem from a mover head misdirection provides concrete avenues for mitigation. Similarly, observing the emergence of induction heads during ICL reinforces the idea that LLMs are not just pattern matchers but develop specialized modules for specific learning paradigms.
A potential area for deeper exploration might be the interplay between the identified knowledge circuits and the model's emergent world models. If these circuits are indeed fundamental, how do they contribute to or conflict with higher-level semantic understanding? Also, while the paper uses zero ablation, investigating the impact of mean ablation could provide complementary insights, especially regarding default behaviors when a component is "silenced." The observed performance improvement in circuits over the full model for some tasks is fascinating and warrants further study into how "noise" manifests in LLMs and how circuits might inherently filter it.
The work's focus on bridging the gap between interpretability and practical application (like knowledge editing) is commendable. By providing a "roadmap" of knowledge flow, it can guide more targeted and effective editing strategies, moving beyond trial-and-error. The observed component reuse phenomenon, where certain heads act as topic heads, suggests an underlying organizational principle within LLMs that could be exploited for more efficient and robust knowledge management.
Similar papers
Recommended via semantic vector search.