Paper status: completed

Attention is not Explanation

Published:02/27/2019

Attention Mechanism Explainability Analysis (1)Neural Network Model Interpretability (1)Interpretation Methods in NLP (1)Attention Weights and Prediction Relationship (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study empirically shows that attention weights in NLP models poorly explain predictions, as diverse attention patterns yield similar outcomes, indicating attention alone is insufficient for reliable model interpretability.

Abstract

Attention mechanisms have seen wide adoption in neural NLP models. In addition to improving predictive performance, these are often touted as affording transparency: models equipped with attention provide a distribution over attended-to input units, and this is often presented (at least implicitly) as communicating the relative importance of inputs. However, it is unclear what relationship exists between attention weights and model outputs. In this work, we perform extensive experiments across a variety of NLP tasks that aim to assess the degree to which attention weights provide meaningful `explanations' for predictions. We find that they largely do not. For example, learned attention weights are frequently uncorrelated with gradient-based measures of feature importance, and one can identify very different attention distributions that nonetheless yield equivalent predictions. Our findings show that standard attention modules do not provide meaningful explanations and should not be treated as though they do. Code for all experiments is available at https://github.com/successar/AttentionExplanation.

Mind Map

In-depth Reading

English Analysis~25 min read · 32,503 chars

1. Bibliographic Information

1.1. Title

Attention is not Explanation

1.2. Authors

Sarthak Jain: At the time of publication, a student at Northeastern University.
Byron C. Wallace: An Associate Professor at Northeastern University, with a research focus on machine learning, natural language processing (NLP), and their applications in health informatics. His work often involves model interpretability and reliability.

Their affiliations are with Northeastern University.

1.3. Journal/Conference

The paper was published as a preprint on arXiv. It was presented at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). NAACL is a top-tier conference in the field of computational linguistics and natural language processing, known for its high standards and significant impact on the field.

1.4. Publication Year

The paper was submitted to arXiv and published in 2019. The specific version referenced was published on February 26, 2019.

1.5. Abstract

The abstract introduces the central conflict the paper addresses: attention mechanisms in neural NLP models are widely used not only for their performance benefits but also for their supposed ability to provide transparency into model decisions. The authors question this assumption by empirically investigating the relationship between attention weights and model predictions. They conduct extensive experiments across various NLP tasks and find that attention weights do not provide faithful explanations. Their key findings are that (1) learned attention weights are often uncorrelated with other feature importance measures (like gradients), and (2) it is possible to find substantially different attention distributions that produce the same model prediction. The authors conclude that standard attention should not be treated as a reliable explanation tool.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/1902.10186
PDF Link: https://arxiv.org/pdf/1902.10186v3.pdf
Publication Status: The paper is a preprint available on arXiv and was also published in the proceedings of NAACL-HLT 2019.

2. Executive Summary

2.1. Background & Motivation

Core Problem: In Natural Language Processing (NLP), attention mechanisms became a standard component of many state-of-the-art neural network architectures. Researchers and practitioners frequently claimed that attention provides a form of built-in interpretability. By visualizing the attention weights as a heatmap over the input text, one could supposedly see which words the model "focused on" to make its prediction. This was often presented as an explanation for the model's behavior.
Importance and Gaps: While this narrative of "attention as explanation" was widely accepted and cited in numerous papers, it was largely an intuitive assumption rather than a rigorously tested hypothesis. There was a lack of formal investigation into whether the input units with high attention weights were truly the most influential factors driving the model's output. This created a potential gap between what researchers claimed about model transparency and what was actually happening. The authors identify this gap as their primary motivation: to empirically test the "faithful explanation" property of attention.
Innovative Idea: The paper's innovative idea is to systematically challenge the assumption of attention as explanation through two main lines of empirical inquiry:
1. Correlation Analysis: If attention weights truly reflect feature importance, they should correlate with other established measures of feature importance, such as input gradients or the impact of removing a feature (leave-one-out).
2. Counterfactual Analysis: If a specific attention distribution is the "reason" for a prediction, then a significantly different attention distribution should lead to a different prediction. The authors test this by generating alternative, "adversarial" attention distributions that are maximally different from the original but are constrained to produce the same prediction.

2.2. Main Contributions / Findings

Primary Contributions:
1. A Systematic Empirical Investigation: The paper provides the first large-scale, multi-task empirical study questioning the validity of using attention as a faithful explanation for model predictions in NLP.
2. A Novel "Adversarial Attention" Method: The authors propose a method to find counterfactual attention distributions that are maximally different from the learned one but yield the same prediction. The existence of such distributions serves as strong evidence against the uniqueness and faithfulness of the original attention-based explanation.
Key Findings:
1. Attention weights show weak and inconsistent correlation with feature importance measures. Across a wide range of tasks and datasets, the authors found that the Kendall's Tau correlation between attention weights and both gradient-based and leave-one-out feature importance scores was generally low, especially for complex recurrent encoders like BiLSTMs.
2. Alternative attention distributions can produce the same prediction. The experiments demonstrated that it is often possible to find adversarial attention distributions that are very dissimilar to the one learned by the model but result in a nearly identical output. This undermines the claim that the original attention pattern is the unique or necessary cause for the prediction. Even randomly permuting the attention weights often had a minimal effect on the output.
3. The authors' central conclusion is a strong cautionary note to the NLP community: "Attention is not Explanation." They argue that the common practice of presenting attention heatmaps as a direct explanation of a model's reasoning is misleading and should be abandoned.

3.1. Foundational Concepts

To fully grasp the paper, one must understand the following concepts:

Attention Mechanisms: An attention mechanism allows a neural network to dynamically focus on different parts of an input sequence when producing an output. In NLP, this means assigning a weight to each input word. The final representation used for prediction is a weighted average of the word representations, where the weights are determined by the attention mechanism.
- Intuition: Imagine translating a long sentence. Instead of trying to compress the entire source sentence into a single fixed-size vector, you would naturally focus on different parts of the source sentence as you generate each word of the translation. Attention mimics this process.
- Components:
  - Query (Q): A representation of what the model is currently trying to do (e.g., the state of a decoder generating a word).
  - Keys (K): Representations of the input items (e.g., encoded words) that "compete" for attention.
  - Values (V): The actual representations of the input items that will be aggregated. In many NLP models, Keys and Values are derived from the same source (e.g., the hidden states of an RNN).
- Process:
  1. Scoring: A similarity function computes a score between the Query and each Key.
  2. Weighting: These scores are passed through a softmax function to create a probability distribution, known as the attention weights ( $\alpha$ ).
  3. Aggregation: A context vector is created by taking the weighted sum of the Values. This vector is then used by the rest of the model for its final prediction.
Recurrent Neural Networks (RNNs) and BiLSTMs:
- RNNs: A class of neural networks designed to work with sequential data (like text). They process a sequence step-by-step, maintaining an internal hidden state that summarizes the information seen so far. The hidden state at time step $t$ is a function of the input at $t$ and the hidden state at t-1.
- Long Short-Term Memory (LSTM): A sophisticated type of RNN unit that uses a gating mechanism (input, forget, and output gates) to better control the flow of information. This helps LSTMs overcome the "vanishing gradient" problem and learn long-range dependencies in data.
- Bidirectional LSTM (BiLSTM): A BiLSTM consists of two LSTMs. One processes the input sequence from start to end (forward), and the other processes it from end to start (backward). At each time step, the outputs of the two LSTMs are concatenated. This allows the representation of each word to capture context from both its left and its right, making it a powerful encoder for many NLP tasks. A key point relevant to this paper is that the hidden state $h_t$ of a BiLSTM at a position $t$ is influenced by all other words in the sequence, not just the word at position $t$ .
Feature Importance Measures: Methods used to determine how much each input feature contributes to a model's prediction.
- Gradient-based methods: Calculate the gradient of the output with respect to the input features. A larger gradient magnitude implies that a small change in that feature will cause a large change in the output, indicating higher importance.
- Leave-One-Out (LOO) / Feature Erasure: Measures importance by removing one feature at a time from the input and observing the change in the model's output. A larger change implies the feature was more important.

3.2. Previous Works

The authors situate their work in contrast to a prevailing trend in the NLP community.

Claims of Attention as Interpretability: The paper cites a series of influential works that either explicitly state or implicitly suggest that attention mechanisms provide model interpretability.
- Bahdanau et al. (2014), "Neural Machine Translation by Jointly Learning to Align and Translate": This is the seminal paper that introduced the attention mechanism to NLP. While primarily focused on improving translation quality, its visualizations of word alignments were highly suggestive of interpretability and became a standard way to "explain" model behavior. The core additive attention formula from this paper is: $ \alpha_t = \frac{\exp(e_t)}{\sum_{k=1}^T \exp(e_k)} \quad \text{where} \quad e_t = \mathbf{v}^T \tanh(\mathbf{W}_1 \mathbf{h}_t + \mathbf{W}_2 \mathbf{Q}) $ Here, $\mathbf{h}_t$ is the hidden state of the input sequence at position $t$ , $\mathbf{Q}$ is the query (e.g., decoder state), $\mathbf{W}_1$ , $\mathbf{W}_2$ , and $\mathbf{v}$ are learned parameters, and $\alpha_t$ is the final attention weight for input $t$ .
- Xu et al. (2015), "Show, Attend and Tell": This work applied attention to image captioning, showing the model focusing on different parts of an image as it generated corresponding words in the caption. These visualizations were compelling and further cemented the "attention as explanation" narrative.
- Other cited works like (Choi et al., 2016; Mullenbach et al., 2018) applied attention in healthcare and medical coding, where interpretability is a critical requirement, thus reinforcing the high stakes of this assumption.
Alternative Interpretability Methods: The authors also reference other lines of work on model explanation, which they use as a basis for their comparison.
- Li et al. (2016), "Understanding Neural Networks through Representation Erasure": This work proposed measuring feature importance by erasing parts of the input representation and observing the effect on the prediction, similar to the LOO method used in this paper.
- Ross et al. (2017), "Right for the Right Reasons": This paper introduced the concept of "faithful explanation" and used input gradients as a measure of feature importance, which directly informs the gradient-based comparisons in "Attention is not Explanation."

3.3. Technological Evolution

The use of attention mechanisms in NLP followed a rapid evolutionary path:

Pre-Attention Era: Models like standard RNNs/LSTMs used a "sequence-to-vector" approach, encoding an entire input sequence into a single fixed-length vector, which became a bottleneck for long sequences.
Introduction of Attention (Bahdanau et al., 2014): Attention was introduced to solve this bottleneck, allowing the model to look back at the entire input sequence at each step of the output generation. This dramatically improved performance in tasks like machine translation.
Widespread Adoption: Following its success, attention was quickly adopted across a vast range of NLP tasks, including text classification, summarization, question answering, and NLI. It became a near-ubiquitous component.
The "Attention as Explanation" Narrative: The intuitive appeal of attention visualizations led to a widespread, often uncritical, acceptance of attention weights as a tool for explaining model behavior.
Critical Re-evaluation (This Paper): Jain and Wallace's work represents a critical turning point. It systematically challenges the prevailing narrative, forcing the community to reconsider the relationship between attention and explanation. This paper helped spawn a new wave of research focused on "faithful" interpretability and designing new model architectures that are truly transparent by construction.

3.4. Differentiation Analysis

This paper's approach differs from prior work in several key ways:

Skepticism vs. Advocacy: While most previous papers either advocated for or assumed attention's explanatory power, this work takes an explicitly skeptical and adversarial stance.
Empirical Rigor vs. Intuition: Instead of relying on qualitative analysis of attention heatmaps, the authors use quantitative metrics (Kendall's Tau, JSD) and controlled experiments to test specific, falsifiable hypotheses about attention.
Focus on Faithfulness: The paper distinguishes between plausible-looking explanations and "faithful" explanations (i.e., those that accurately reflect the model's internal reasoning). It argues that attention often provides the former but not the latter.
Counterfactual Reasoning: The introduction of "adversarial attention" is a significant innovation. While other methods tested importance by removing features, this approach shows that the model's output can be invariant even when the "explanation" (the attention map) is drastically altered, which is a more profound challenge to its explanatory power.

4. Methodology

4.1. Principles

The core principle of the paper is to empirically test whether attention weights provide a faithful explanation for a model's prediction. A faithful explanation should accurately reflect the reasoning process of the model. The authors formalize this by investigating two properties that a faithful attention-based explanation should possess:

Correlation: The attention weights should strongly correlate with other independent measures of feature importance. If a word is truly important for a prediction, both attention and other metrics should identify it as such.
Causality/Uniqueness: The specific attention distribution learned by the model should be a critical factor in producing the output. If the model were to attend to a very different set of words, the prediction should change accordingly. If it doesn't, the original attention pattern cannot be claimed as the "reason" for the output.

The methodology is designed to test these two properties across multiple NLP tasks, datasets, and model variations.

4.2. Core Methodology In-depth (Layer by Layer)

The authors' methodology is divided into two main experimental pillars.

4.2.1. Pillar 1: Correlation Between Attention and Feature Importance Measures

This set of experiments aims to answer the question: To what extent do induced attention weights correlate with measures of feature importance? The authors use Algorithm 1 to compute and compare attention weights against two standard feature importance scores: gradient-based importance and leave-one-out (LOO) importance.

The process for a single input instance $x$ is as follows:

Forward Pass and Attention Calculation: First, the model performs a standard forward pass.
- The input $x$ is encoded into hidden states: $\mathbf{h} = \text{Enc}(\mathbf{x})$ .
- Attention scores are computed over the hidden states and normalized via softmax to get the attention weights: $\alpha = \text{softmax}(\phi(\mathbf{h}, \mathbf{Q}))$ .
- A context vector is formed: $h_\alpha = \sum_t \alpha_t h_t$ .
- A final prediction is made: $\hat{y} = \text{Dec}(h_\alpha)$ .
Compute Gradient-based Feature Importance: The importance of each input word is measured by the magnitude of the gradient of the output with respect to that word's embedding.
- The formula given in the paper is: $ g_t = \left| \sum_{w=1}^{|V|} \mathbb{1}[x_{tw}=1] \frac{\partial \hat{y}}{\partial x_{tw}} \right|, \quad \forall t \in [1, T] $
- Explanation:
  - $t$ is the position of a word in the input sequence of length $T$ .
  - $x_{tw}$ is an indicator variable that is 1 if the word at position $t$ is the $w$ -th word in the vocabulary $V$ , and 0 otherwise (one-hot encoding).
  - $\frac{\partial \hat{y}}{\partial x_{tw}}$ is the gradient of the final prediction $\hat{y}$ with respect to the input feature for word $w$ at position $t$ .
  - The summation and indicator function simply select the gradient corresponding to the specific word present at position $t$ .
  - The absolute value $|\cdot|$ is taken because we care about the magnitude of the influence, not its direction (positive or negative).
- An important detail is that the authors disconnect the computation graph at the attention module for this gradient calculation. This means the gradient reflects how changing an input word affects the prediction, assuming the attention distribution remains fixed. This isolates the influence of the input words via the encoder from their influence on the attention weights themselves.
Compute Leave-One-Out (LOO) Feature Importance: The importance of each word is measured by how much the model's prediction changes when that word is removed from the input.
- For each position $t$ from 1 to $T$ , a new input $\mathbf{x}_{-t}$ is created by removing the word at that position.
- The model makes a new prediction $\hat{y}(\mathbf{x}_{-t})$ on this modified input.
- The change in prediction is measured using Total Variation Distance (TVD), a metric for comparing probability distributions. The formula for TVD between two predictions $\hat{y}_1$ and $\hat{y}_2$ is: $ \mathrm{TVD}(\hat{y}1, \hat{y}2) = \frac{1}{2} \sum{i=1}^{|\mathcal{V}|} |\hat{y}{1i} - \hat{y}_{2i}| $ where $|\mathcal{V}|$ is the number of output classes.
- The LOO importance score for the word at position $t$ is: $ \Delta \hat{y}t = \text{TVD}(\hat{y}(\mathbf{x}{-t}), \hat{y}(\mathbf{x})) $
Calculate Correlation: Finally, the authors compute the correlation between the vector of attention weights $\alpha$ and the vectors of feature importance scores ( $g$ and $\Delta \hat{y}$ ). They use Kendall's Tau ( $\tau$ ), a non-parametric statistic that measures the ordinal association between two ranked quantities. A value of +1 indicates perfect agreement of rankings, -1 indicates perfect disagreement, and 0 indicates no association. They compute $\tau_g = \text{Kendall-τ}(\alpha, g)$ and $\tau_{loo} = \text{Kendall-τ}(\alpha, \Delta \hat{y})$ .

4.2.2. Pillar 2: Counterfactual Attention Weights

This set of experiments addresses the question: Would alternative attention weights necessarily yield different predictions? The authors explore this through two methods.

4.2.2.1. Attention Permutation

This is a simple but effective test. As described in Algorithm 2, the process is:

For a given input, obtain the model's hidden states $\mathbf{h}$ and the original attention weights $\alpha$ .
Repeat for a number of trials (e.g., 100 times):
- Create a new attention distribution $\alpha_p$ by randomly shuffling (permuting) the original weights $\alpha$ .
- Compute a new prediction $\hat{y}_p = \text{Dec}(\sum_t \alpha_{pt} h_t)$ . Note that the hidden states $\mathbf{h}$ are held constant.
- Measure the change in output using TVD: $\Delta \hat{y}_p = \text{TVD}(\hat{y}_p, \hat{y})$ .
The median change in output across all permutations, $\Delta \hat{y}^{med}$ , is reported. If this value is small, it suggests the specific alignment of attention weights to inputs is not critical for the prediction.

4.2.2.2. Adversarial Attention

This is the most novel part of the methodology. The goal is to actively search for an attention distribution that is as different as possible from the original one, yet produces a nearly identical prediction.

Problem Formulation: The authors frame this as a constrained optimization problem. They aim to find a set of $k$ new attention distributions $\{\alpha^{(1)}, ..., \alpha^{(k)}\}$ that maximize a function $f$ measuring their distance from the original attention $\hat{\alpha}$ and from each other.
- The objective function to maximize is: $f(\{\alpha^{(i)}\}_{i=1}^k) = \sum_{i=1}^{k} \mathbf{JSD}[\boldsymbol{\alpha}^{(i)}, \hat{\boldsymbol{\alpha}}] + \frac{1}{k(k-1)} \sum_{i<j} \mathbf{JSD}[\boldsymbol{\alpha}^{(i)}, \boldsymbol{\alpha}^{(j)}]$ This function has two parts:
  - The first term encourages each new distribution $\alpha^{(i)}$ to be far from the original distribution $\hat{\alpha}$ .
  - The second term encourages the new distributions to be far from each other, promoting diversity in the counterfactual explanations found.
  - Distance is measured using Jensen-Shannon Divergence (JSD), a symmetrized and smoothed version of Kullback-Leibler (KL) divergence. The formula is: $ \mathrm{JSD}(\alpha_1, \alpha_2) = \frac{1}{2} \mathrm{KL}\left[\alpha_1 \middle| \frac{\alpha_1 + \alpha_2}{2}\right] + \frac{1}{2} \mathrm{KL}\left[\alpha_2 \middle| \frac{\alpha_1 + \alpha_2}{2}\right] $ JSD is bounded between 0 and $\log(2)$ (approx. 0.693 for natural log), making it a convenient metric.
- The optimization is subject to the constraint that the prediction change must be small: $\forall i, \quad \mathrm{TVD}[\hat{y}(\mathbf{x}, \alpha^{(i)}), \hat{y}(\mathbf{x}, \hat{\alpha})] \leq \epsilon$ where $\epsilon$ is a small tolerance (e.g., 0.01).
Optimization: In practice, this constrained problem is solved using a relaxed objective via stochastic gradient descent (Adam optimizer). The new objective function combines the original objective with a penalty term for violating the constraint: $f(\{\alpha^{(i)}\}_{i=1}^k) + \frac{\lambda}{k} \sum_{i=1}^k \max(0, \text{TVD}[\hat{y}(\mathbf{x}, \alpha^{(i)}), \hat{y}(\mathbf{x}, \hat{\alpha})] - \epsilon)$ Here, $\lambda$ is a hyperparameter that balances maximizing JSD with satisfying the prediction constraint. The optimizer adjusts the adversarial attention weights $\alpha^{(i)}$ to maximize this combined score.
Analysis: The primary result from this experiment is the maximum JSD found for an adversarial distribution that satisfies the $\epsilon$ constraint. If this value is high (close to the theoretical maximum of ~0.69), it means a very different, counterfactual attention distribution exists that serves as an equally plausible "explanation" for the same output, thereby undermining the faithfulness of the original one. The paper provides a compelling visual example in Figure 1.

The following figure (Figure 1 from the original paper) shows the system architecture:

该图像是论文中的示意图，展示了负面电影评论中原始注意力权重和对抗构造的注意力权重的对比，尽管两者显著不同，但预测结果均为 $f(x|\alpha,\theta)=0.01$ ，体现注意力权重与模型输出的独立性。

This figure powerfully illustrates the core finding. The heatmap on the left shows the model's original attention, focusing heavily on the word "waste". The heatmap on the right shows an adversarially generated attention distribution that focuses on completely different words like "was" and "of". Despite the visual dissimilarity, both lead to the same negative prediction (a score of 0.01), questioning whether "waste" was truly the sole reason for the prediction.

5. Experimental Setup

5.1. Datasets

The authors selected a diverse set of datasets spanning three major NLP tasks to ensure their findings were not specific to a single domain or problem type.

Binary Text Classification:
- Stanford Sentiment Treebank (SST): Movie review sentences classified as positive or negative.
- IMDB Large Movie Reviews Corpus: Longer movie reviews, also for binary sentiment classification.
- Twitter Adverse Drug Reaction (ADR): Tweets annotated for mentioning adverse drug reactions.
- 20 Newsgroups (Hockey vs Baseball): A topic classification task discriminating between newsgroup posts about hockey and baseball.
- AG News (Business vs World): A topic classification task on news articles.
- MIMIC ICD9 (Diabetes & Anemia): Clinical notes (discharge summaries) from the MIMIC-III dataset, classified for the presence of specific medical codes (e.g., diabetes) or distinguishing between disease types (acute vs. chronic anemia). These texts are notably long.
Question Answering (QA):
- CNN News Articles: Cloze-style (fill-in-the-blank) questions where the answer is an entity from the provided news article.
- bAbI: A set of synthetic QA tasks designed to test different reasoning capabilities (e.g., using one, two, or three supporting facts).

Natural Language Inference (NLI):

SNLI (Stanford Natural Language Inference): Given a premise sentence and a hypothesis sentence, the task is to classify the relationship as entailment, contradiction, or neutral. Attention is generated over the premise words.

The following are the results from Table 1 of the original paper:

Dataset	\|V\|	Avg. length	Train size	Test size	Test performance (LSTM)
SST	16175	19	3034 / 3321	863 / 862	0.81
IMDB	13916	179	12500 / 12500	2184 / 2172	0.88
ADR Tweets	8686	20	14446 / 1939	3636 / 487	0.61
20 Newsgroups	8853	115	716 /710	151 / 183	0.94
AG News	14752	36	30000 / 30000	1900 / 1900	0.96
Diabetes (MIMIC)	22316	1858	6381 / 1353	1295 / 319	0.79
Anemia (MIMIC)	19743	2188	1847 / 3251	460 / 802	0.92
CNN	74790	761	380298	3198	0.64
bAbI (Task 1 / 2 / 3)	40	8 / 67 / 421	10000	1000	1.0 / 0.48 / 0.62
SNLI	20982	14	182764 / 183187 / 183416	3219 / 3237 / 3368	0.78

This choice of datasets is effective because it covers a wide range of text lengths, domain vocabularies, and task complexities, making the study's conclusions more generalizable.

5.2. Evaluation Metrics

The paper uses several key metrics to quantify its findings.

Kendall's Tau ( $\tau$ )
1. Conceptual Definition: Kendall's Tau is a non-parametric statistical measure used to quantify the ordinal association between two ranked lists. It assesses the similarity of the orderings of the data when ranked by each of the quantities. A high $\tau$ means that items ranked highly by one measure are also ranked highly by the other. It is robust to outliers and does not assume linearity, making it suitable for comparing feature importance rankings which can be highly skewed.
2. Mathematical Formula: $ \tau = \frac{(\text{number of concordant pairs}) - (\text{number of discordant pairs})}{\frac{1}{2} n (n-1)} $
3. Symbol Explanation:
  - $n$ : The number of items (in this case, input tokens).
  - Concordant pair: A pair of items (i, j) that are in the same order in both rankings. If item $i$ is ranked higher than item $j$ in list 1, it is also ranked higher in list 2.
  - Discordant pair: A pair of items that are in the opposite order in the two rankings.
Total Variation Distance (TVD)
1. Conceptual Definition: TVD measures the total difference between two probability distributions. In this paper, it is used to quantify how much a model's prediction (which is a probability distribution over output classes) changes. A TVD of 0 means the distributions are identical, while a TVD of 1 means they are entirely non-overlapping.
2. Mathematical Formula: $ \mathrm{TVD}(\hat{y}1, \hat{y}2) = \frac{1}{2} \sum{i=1}^{|\mathcal{V}|} |\hat{y}{1i} - \hat{y}_{2i}| $
3. Symbol Explanation:
  - $\hat{y}_1, \hat{y}_2$ : Two probability distributions over the output classes.
  - $|\mathcal{V}|$ : The number of classes in the output space.
  - $\hat{y}_{1i}, \hat{y}_{2i}$ : The probability assigned to the $i$ -th class by each distribution.
Jensen-Shannon Divergence (JSD)
1. Conceptual Definition: JSD is a method of measuring the similarity between two probability distributions. It is a symmetrized and smoothed version of the more common Kullback-Leibler (KL) divergence. Unlike KL divergence, JSD is symmetric (i.e., $\text{JSD}(P, Q) = \text{JSD}(Q, P)$ ) and is always finite. It is used here to measure the dissimilarity between the original attention distribution and a counterfactual one.
2. Mathematical Formula: $ \mathrm{JSD}(\alpha_1, \alpha_2) = \frac{1}{2} D_{KL}\left(\alpha_1 \middle| M\right) + \frac{1}{2} D_{KL}\left(\alpha_2 \middle| M\right) \quad \text{where} \quad M = \frac{1}{2}(\alpha_1 + \alpha_2) $
3. Symbol Explanation:
  - $\alpha_1, \alpha_2$ : The two attention distributions being compared.
  - $M$ : The average of the two distributions.
  - $D_{KL}(P\|Q)$ : The Kullback-Leibler divergence, which measures how one probability distribution $P$ diverges from a second, expected probability distribution $Q$ . Its formula is $D_{KL}(P\|Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$ .

5.3. Baselines

The paper does not use "baselines" in the traditional sense of comparing a proposed model against existing ones for performance. Instead, the experimental design is based on internal comparisons and model variants:

Feature Importance Measures as Baselines: The gradient-based and LOO measures serve as reference points or "baselines" for what a feature importance score should look like. The attention weights are then compared against these baselines.
Encoder Architectures as a Variable: The authors test three different encoder types to see how model complexity affects the findings:
1. BiLSTM: A powerful recurrent encoder where hidden states are complex, entangled representations of the entire input.
2. Average: A simple, non-recurrent feed-forward model where the representation of each token is just its embedding passed through a linear layer. The hidden states are not context-dependent.
3. CNN: A convolutional encoder, which is somewhere between the two in terms of how it incorporates local context.
  
  The primary comparison is between the behavior of attention in the complex BiLSTM model versus the simple Average model. The hypothesis is that the entanglement of inputs in the BiLSTM's hidden states is what makes the attention weights unreliable as explanations for the original inputs.

6. Results & Analysis

6.1. Core Results Analysis

The paper's results consistently and strongly support its central thesis across two main experimental thrusts.

6.1.1. Correlation Results

The first set of experiments measured the correlation between attention weights and feature importance scores (gradients and LOO).

The following are the results from Table 2 of the original paper:

Dataset	Class	Gradient (BiLSTM) T_g		Gradient (Average) T_g		Leave-One-Out (BiLSTM) T_loo
Dataset	Class	Mean ± Std.	Sig. Frac.	Mean ± Std.	Sig. Frac.	Mean ± Std.	Sig. Frac.
SST	0	0.40 ± 0.21	0.59	0.69 ± 0.15	0.93	0.34 ± 0.20	0.47
SST	1	0.38 ± 0.19	0.58	0.69 ± 0.14	0.94	0.33 ± 0.19	0.47
IMDB	0	0.37 ± 0.07	1.00	0.65 ± 0.05	1.00	0.30 ± 0.07	0.99
IMDB	1	0.37 ± 0.08	0.99	0.66 ± 0.05	1.00	0.31 ± 0.07	0.98
ADR Tweets	0	0.45 ± 0.17	0.74	0.71 ± 0.13	0.97	0.29 ± 0.19	0.44
ADR Tweets	1	0.45 ± 0.16	0.77	0.71 ± 0.13	0.97	0.40 ± 0.17	0.69
20News	0	0.08 ± 0.15	0.31	0.65 ± 0.09	0.99	0.05 ± 0.15	0.28
20News	1	0.13 ± 0.16	0.48	0.66 ± 0.09	1.00	0.14 ± 0.14	0.51
AG News	0	0.42 ± 0.11	0.93	0.77 ± 0.08	1.00	0.35 ± 0.13	0.80
AG News	1	0.35 ± 0.13	0.81	0.75 ± 0.07	1.00	0.32 ± 0.13	0.73
Diabetes	0	0.47 ± 0.06	1.00	0.68 ± 0.02	1.00	0.44 ± 0.07	1.00
Diabetes	1	0.38 ± 0.08	1.00	0.68 ± 0.02	1.00	0.38 ± 0.08	1.00
Anemia	0	0.42 ± 0.05	1.00	0.81 ± 0.01	1.00	0.42 ± 0.05	1.00
Anemia	1	0.43 ± 0.06	1.00	0.81 ± 0.01	1.00	0.44 ± 0.06	1.00
CNN	Overall	0.20 ± 0.06	0.99	0.48 ± 0.11	1.00	0.16 ± 0.07	0.95
bAbI 1	Overall	0.23 ± 0.19	0.46	0.66 ± 0.17	0.97	0.23 ± 0.18	0.45
bAbI 2	Overall	0.17 ± 0.12	0.57	0.84 ± 0.09	1.00	0.11 ± 0.13	0.40
bAbI 3	Overall	0.30 ± 0.11	0.93	0.76 ± 0.12	1.00	0.31 ± 0.11	0.94
SNLI	0	0.36 ± 0.22	0.46	0.54 ± 0.20	0.76	0.44 ± 0.18	0.60
	1	0.42 ± 0.19	0.57	0.59 ± 0.18	0.84	0.43 ± 0.17	0.59
	2	0.40 ± 0.20	0.52	0.53 ± 0.19	0.75	0.44 ± 0.17	0.61

Analysis of Table 2:
- BiLSTM correlations are low: For the BiLSTM encoder, the mean Kendall's Tau values (T_g and T_loo) are generally modest, hovering between 0.1 and 0.4. A correlation of 0.4 is weak, and for some datasets like 20News, it is near zero. This indicates that the ranking of words by importance from attention is very different from the ranking provided by gradients or LOO.
- Average encoder correlations are much higher: In stark contrast, the Average encoder shows much higher correlations with gradients (T_g), often in the 0.6-0.8 range. This is expected, as this simple model lacks the complex interactions of a BiLSTM. Attention in this context operates on less-entangled representations, making it a more direct indicator of importance. This comparison is a key piece of evidence: the lack of correlation is tied to the complexity of the encoder.
- "Sig. Frac." is misleading: The "Significant Fraction" column shows the proportion of instances where the correlation was statistically significant. For long documents (e.g., IMDB, MIMIC), this fraction is nearly 1.0. However, this is a classic "p-hacking" scenario where large sample sizes (many tokens) can make even a tiny, meaningless correlation statistically significant. The authors rightly focus on the weak magnitude of the correlation (the Mean), not its statistical significance.
  
  The following figure (Figure 2 from the original paper) shows histograms of these correlation values, providing a more detailed view than the summary statistics in the table.
  
  $Figure 2: Histogram of Kendall $\\tau$ between attention and gradients. Encoder variants are denoted parenthetically; colors indicate predicted classes. Exhaustive results are available for perusal on…$ 该图像是论文中图2，展示了不同任务和编码器变体中注意力权重与梯度相关性的Kendall τ值直方图，颜色区分预测类别，展示了注意力机制与特征重要性之间的一致性分布。

This figure visualizes the distributions of the Kendall's Tau values. For the BiLSTM models (top row), the histograms are centered around low values (e.g., 0.3-0.4), confirming the weak correlation. For the simpler Average model (bottom row), the distributions are shifted significantly to the right, toward higher correlation values.

6.1.2. Counterfactual Results

The second set of experiments investigated whether changing the attention distribution would change the model's output.

Attention Permutation: The following figure (Figure 6 from the original paper) plots the maximum attention weight in an instance against the median change in prediction when the attention weights are randomly scrambled.

$Figure 6: Median change in output $\\Delta \\hat { y } ^ { m e d }$ $\\mathbf { \\bar { X } }$ xi dnsti elation he max atton $\\operatorname { \\bf ( m a x } \\hat { \\alpha } )$ (y axiobtained by randmy per…$ 该图像是多个子图组成的图表，展示了在不同数据集和编码器设置下，随机置换注意力权重后输出中位数变化的分布情况。横轴为中位数输出差异，纵轴为最大注意力权重区间。该图用于评估注意力权重对模型输出的影响。 Analysis of Figure 6: A key observation is the large number of points in the bottom right of each plot. These points represent instances where the model placed a very high attention weight on a single token (high $max α$ ), yet scrambling the attention weights entirely resulted in a very small change to the prediction (low $Δŷ_med$ ). This directly contradicts the idea that the highly attended token was singularly responsible for the output. If it were, its attention weight being moved to a random, unimportant word should have drastically altered the prediction.
Adversarial Attention: This experiment sought to find maximally different attention distributions (high JSD) that produced an almost identical prediction (low TVD). The following figure (Figure 7 from the original paper) shows histograms of the maximum JSD found for adversarial attention distributions.

$Figure 7: Histogram of maximum adversarial JS Divergence ${ \\bf \\Pi } _ { \\epsilon = \\mathbf { m } \\mathbf { a } \\mathbf { X } }$ JSD) between original and adversarial atnsvet. I wn, \$| \\hat { y } ^…$ 该图像是论文中的图表，展示了多个NLP任务和模型中最大对抗JS散度（JS Divergence）直方图，描绘了在满足预测差异小于B5条件下，原始与对抗注意力分布的差异。 Analysis of Figure 7: The histograms show that for most datasets and models, the distribution of max JSD is heavily skewed towards the theoretical maximum of ~0.69. This means that for a large fraction of instances, the authors' optimization was able to find an adversarial attention distribution that was almost completely different from the original one but yielded the same output. This is the paper's most damning evidence against the "attention as explanation" hypothesis. It suggests that the attention heatmaps often shown in papers are not unique and that alternative, equally valid (from the model's perspective) "explanations" exist.

The following figure (Figure 8 from the original paper) plots the relationship between the max attention weight and the max JSD found.

$Figure 8: Densities of maximum JS divergences (e-max JSD) \$\\mathbf { \\dot { X } } - \\mathbf { \\dot { X } } - \\mathbf { \\dot { X } } - \\mathbf { \\dot { X } } - \\mathbf { \\dot { X } } = \\mathbf { \\dot…$ 该图像是图表，展示了图8中多种模型（如BiLSTM和CNN）在不同任务（SST、Diabetes等）下的最大JS散度与最大注意力值的分布密度，数据以小提琴图形式呈现，反映原始与对抗注意力权重间的差异。 Analysis of Figure 8: One might hope that if attention is very "peaky" (high $max α$ ), it would be harder to find a dissimilar adversarial distribution. This figure shows that this is not strongly the case. While there is a slight negative trend, there are many instances with high $max α$ that also have a high max JSD. This means that even when the model seems to be "confidently" focusing on one word, there often exists a completely different configuration of attention that would have worked just as well.

An exception noted in the results is the Diabetes task, where for positive-class instances, perturbing attention does change the output, and finding high-JSD adversaries is harder. The authors hypothesize this is because the task relies on a few high-precision keywords (like "diabetes"), and attending to these is crucial. However, they stress this is the exception, not the rule.

6.2. Ablation Studies / Parameter Analysis

The paper's comparison across different encoder types (BiLSTM, Average, CNN) acts as a form of ablation study on model complexity. The key finding here is that the problems identified—low correlation and the existence of adversarial distributions—are most pronounced in the most complex and commonly used model, the BiLSTM. For the simpler Average model, attention aligns much better with other importance measures. This supports the authors' hypothesis that it is the entanglement of information in the recurrent encoder's hidden states that decouples the final attention weights from the importance of the original input tokens.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper presents strong and extensive empirical evidence that challenges the widely held belief that attention mechanisms provide faithful, transparent explanations for the predictions of neural NLP models. The authors demonstrate two key findings:

For recurrent models like BiLSTMs, learned attention weights have weak and inconsistent correlations with gradient-based and leave-one-out measures of feature importance.
It is often possible to find counterfactual attention distributions that are maximally different from the one learned by the model but still produce a nearly identical prediction.

Based on these findings, the authors issue a strong conclusion: the common practice of presenting attention heatmaps as a direct explanation of "why" a model made a certain prediction is misleading and should be approached with extreme caution, if not abandoned. The story told by an attention map is not necessarily the only, or even the true, story.

7.2. Limitations & Future Work

The authors are clear about the limitations of their work and suggest directions for future research.

Limitations:
- Scope of Models: The analysis primarily focuses on BiLSTM encoders with standard additive and dot-product attention. The conclusions may not generalize to all attention variants or other architectures like the Transformer (though subsequent work has shown similar issues exist there).
- Scope of Tasks: The study is limited to classification, QA, and NLI tasks with unstructured outputs. It does not cover sequence-to-sequence tasks like machine translation or summarization.
- "Ground Truth" for Explanations: The authors use gradients and LOO as proxies for feature importance but acknowledge that these measures are not perfect and have their own interpretation challenges. There is no absolute "ground truth" for what a correct explanation should be.
- Correlation Metric: The use of Kendall's Tau might be sensitive to noise from many irrelevant features. However, the authors argue this is mitigated by their comparative analysis (e.g., BiLSTM vs. Average encoder).
Future Work: The authors hope their work will motivate the development of more principled attention mechanisms that are designed explicitly for interpretability. They point towards promising directions such as:
- Sparse and Structured Attention: Models that impose hard, sparse constraints, forcing the model to select a small subset of inputs that are, by construction, responsible for the output (e.g., Lei et al., 2016).
- Tying Attention to Human Rationales: Training models to produce attention patterns that align with human-provided explanations or rationales (e.g., Bao et al., 2018).

7.3. Personal Insights & Critique

This paper is a landmark contribution to the field of NLP interpretability. Its impact cannot be overstated.

Inspirations and Impact:
- Shifting the Paradigm: This paper was a much-needed reality check for the NLP community. It single-handedly shifted the conversation around attention from one of naive acceptance to one of healthy skepticism. It forced researchers to be more precise about what they mean by "interpretability" and to distinguish between plausible narratives (plausibility) and mechanically accurate descriptions of the model's process (faithfulness).
- Spurring New Research: "Attention is not Explanation" directly catalyzed a new wave of research into "faithful XAI" (Explainable AI). It inspired work on developing new evaluation metrics for explanations, designing new interpretable architectures, and post-hoc methods to "debug" existing models like those using attention.
Critique and Nuances:
- Is Attention Useless for Interpretation? The paper's title is a powerful, memorable, and provocative statement. However, a more nuanced reading suggests that attention is not a faithful explanation, which is not the same as being completely useless. Attention can still be a useful debugging tool to see if a model is attending to "obviously" wrong parts of an input (e.g., punctuation, stop words). The authors' findings show it's not a sufficient explanation, but it might sometimes be a necessary first check.
- The "Sufficient Explanation" Argument: The authors briefly touch upon a counterargument: perhaps attention provides a sufficient but not necessary explanation. The existence of an adversarial attention map (a different sufficient explanation) does not invalidate the original one. For example, in the sentence "This movie was great and fantastic," attending to "great" is a valid reason for a positive prediction, and so is attending to "fantastic." The model finding one of them is reasonable. However, this argument weakens when the adversarial attention highlights nonsensical words, as shown in Figure 1 ("was," "of"). The paper's strength is in showing that such nonsensical alternatives often exist.
- The Role of Encoder Complexity: The paper's most critical insight, in my view, is highlighting the role of the encoder. The problem isn't attention per se, but attention on top of entangled representations. A BiLSTM hidden state $h_t$ is already a function of the entire sentence. When the model attends to $h_t$ , it's not just attending to word $t$ ; it's attending to a complex summary of the sequence centered at $t$ . Visualizing this as a weight on word $t$ alone is a fundamental misrepresentation. This explains why the simpler Average encoder, with its non-entangled representations, fares much better. This insight has profound implications for how we design and interpret complex deep learning models in general.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Attention is not Explanation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~25 min read · 32,503 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Pillar 1: Correlation Between Attention and Feature Importance Measures

4.2.2. Pillar 2: Counterfactual Attention Weights

4.2.2.1. Attention Permutation

4.2.2.2. Adversarial Attention

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Correlation Results

6.1.2. Counterfactual Results

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers