Paper status: completed

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Published:06/25/2023

Efficient Generation Inference (2)KV Cache Optimization for Large Language Models (1)Heavy Hitter Algorithm (1)Dynamic Submodular Problem (1)Model Inference Performance Enhancement (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces H$_2$O, a novel KV cache eviction strategy for Large Language Models that reduces memory usage by 5-10 times while improving inference throughput by up to 29x, based on identifying and retaining high-contribution tokens known as Heavy Hitters.

Abstract

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H $_2$ ). Through a comprehensive investigation, we find that (i) the emergence of H $_2$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H $_2$ O), a KV cache eviction policy that dynamically retains a balance of recent and H $_2$ tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H $_2$ O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29 $\times$ , 29 $\times$ , and 3 $\times$ on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9 $\times$ . The code is available at https://github.com/FMInference/H2O.

Mind Map

In-depth Reading

English Analysis~14 min read · 16,699 chars

1. Bibliographic Information

1.1. Title

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

1.2. Authors

Zhenyu Zhang $^1$ , Ying Sheng $^2$ , Tianyi Zhou $^3$ , Tianlong Chen $^1$ , Lianmin Zheng $^4$ , Ruisi Cai $^1$ , Zhao Song $^5$ , Yuandong Tian $^6$ , Christopher Ré $^2$ , Clark Barrett $^2$ , Zhangyang Wang $^1$ , Beidi Chen $^{6,7}$

Affiliations:

University of Texas at Austin
Stanford University
University of California, San Diego
University of California, Berkeley
Adobe Research
Meta AI (FAIR)
Carnegie Mellon University

1.3. Journal/Conference

Published at: NeurIPS 2023 (Thirty-seventh Conference on Neural Information Processing Systems). Note: The metadata provided in the prompt indicates a publication date of June 2023 (likely the arXiv preprint release), but this paper is widely recognized as a NeurIPS 2023 conference paper, which is a top-tier venue in the field of Artificial Intelligence and Machine Learning.

1.4. Publication Year

2023

1.5. Abstract

This paper addresses the deployment cost bottleneck of Large Language Models (LLMs), specifically the high memory consumption of the KV cache during long-content generation. The authors observe that a small subset of tokens, termed "Heavy Hitters" (H_2), contributes the vast majority of value to attention scores. Based on this, they propose H_2O, a novel Key-Value (KV) cache eviction policy. H_2O dynamically retains a mix of the most recent tokens and these heavy hitters, allowing for a significant reduction in cache size (up to 5-10 $\times$ ) without compromising generation quality. The method increases throughput by up to 29 $\times$ compared to leading systems like DeepSpeed and Accelerate.

1.6. Original Source Link

https://arxiv.org/abs/2306.14048 (Preprint/Official Version)

2. Executive Summary

2.1. Background & Motivation

Core Problem: The deployment of Large Language Models (LLMs) is extremely expensive, primarily due to memory bottlenecks. While model parameters are static, the KV Cache (Key-Value Cache)—which stores intermediate states to prevent re-computation during text generation—grows linearly with sequence length and batch size. For a 30B parameter model with a batch size of 128 and sequence length of 1024, the KV cache alone can consume 180GB of GPU memory. This makes long-context generation and large-batch inference prohibitive.

Importance & Gaps:

Memory Wall: Standard hardware cannot fit the KV cache for long sequences, forcing systems to offload data to slower CPU memory or disk, which drastically kills performance.
Existing Solutions' Limitations:
- Sparse Attention (e.g., Reformer): Often requires training from scratch or fine-tuning and doesn't always reduce inference memory effectively.
- Approximations (e.g., Sparse Transformer): When applied directly to pre-trained LLMs without fine-tuning, they suffer from high "miss rates" (dropping important information), leading to severe accuracy degradation.
- Heuristic Eviction: Simple policies like "keep only the last $N$ tokens" fail to capture long-range dependencies crucial for tasks like summarization or QA.
  
  Innovative Idea: The paper identifies that not all tokens are created equal. Empirical analysis reveals that attention matrices are over 95% sparse, and a small set of influential tokens ("Heavy Hitters") consistently receive high attention scores throughout the generation process.

2.2. Main Contributions / Findings

Discovery of Heavy Hitters (H_2): The authors demonstrate that tokens critical for generation (Heavy Hitters) emerge naturally and their accumulated attention scores follow a power-law distribution. These tokens often correlate with frequent co-occurrence in the text.
H_2O Framework: A novel, training-free KV cache eviction policy that greedily retains tokens with the highest accumulated attention scores alongside a small window of recent tokens.
Theoretical Guarantee: The authors formulate the KV cache eviction problem as a Dynamic Submodular Maximization problem and provide a theoretical proof that their greedy algorithm delivers near-optimal performance under mild assumptions.
System Performance: Implementing H_2O on the FlexGen inference engine resulted in:
- Throughput: Up to 29 $\times$ improvement over DeepSpeed Zero-Inference and Hugging Face Accelerate, and 3 $\times$ over standard FlexGen on OPT-6.7B/30B models.
- Memory: Reduced KV cache memory footprint by 5-10 $\times$ (e.g., using only 20% budget) with negligible accuracy loss across tasks like HELM and lm-eval-harness.

3.1. Foundational Concepts

To understand H_2O, one must grasp the basics of Transformer inference and caching.

KV Cache (Key-Value Cache): In Transformer-based LLMs (like GPT), generating text is an auto-regressive process (token by token). To generate the $t$ -th token, the model needs to attend to all previous tokens 1 to t-1.
- Instead of recomputing the Key ( $K$ ) and Value ( $V$ ) vectors for all previous tokens at every step, the system caches them in GPU memory.
- Problem: This cache grows by 2 vectors (Key and Value) for every layer and every head at every step. For long sequences, this "transient state" becomes larger than the model weights themselves.
Attention Mechanism: The core operation in Transformers. For a query $Q$ , keys $K$ , and values $V$ : $ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V $
- $Q$ : Represents the current token being generated.
- $K$ : Represents the "features" of previous tokens.
- $QK^\top$ : Computes a "similarity" or "importance" score (attention score) for each previous token relative to the current one.
- Heavy Hitter: A token in the history whose Key ( $K$ ) consistently results in a high product with the Query ( $Q$ ), meaning it is "attended to" frequently and strongly.
Submodularity: A property of set functions where adding an element to a smaller set gives a larger gain than adding it to a larger set (diminishing returns).
- $f(A \cup \{x\}) - f(A) \geq f(B \cup \{x\}) - f(B)$ for $A \subseteq B$ .
- This property is crucial because greedy algorithms (picking the single best item at each step) are known to provide mathematically provable near-optimal solutions for submodular maximization problems.

3.2. Previous Works

Sparse Attention (Training-time): Methods like Reformer and Sparse Transformer design patterns (e.g., fixed windows, strided patterns) to reduce complexity from $O(N^2)$ $O (N^{2})$ to $O(N \log N)$ $O (N lo g N)$ or $O(N)$ $O (N)$ .
- Limitation: These usually require training the model with these constraints. H_2O targets pre-trained models at inference time.
System Optimizations:
- FlexGen: An inference engine optimizing offloading strategies (moving data between GPU, CPU, and Disk). H_2O is implemented on top of FlexGen to reduce the amount of data that needs to be stored/moved.
- FlashAttention: Optimizes the computation of attention to be memory-efficient and fast but does not inherently reduce the storage requirement of the KV cache across decoding steps.
Token Pruning/Compression: Methods like SpAtten prune tokens based on importance but often require custom hardware or retraining/fine-tuning to be fully effective without accuracy loss.

3.3. Technological Evolution & Differentiation

Evolution: Early optimizations focused on model compression (quantization, pruning weights). As sequence lengths grew (from 512 to 32k+), the focus shifted to the KV cache.
Differentiation:
- Unlike static policies (e.g., keeping only the last 2048 tokens), H_2O is dynamic and content-aware. It decides which tokens to keep based on their accumulated importance during the generation itself.
- Unlike weight pruning, H_2O prunes the runtime state (KV cache).
- Unlike methods requiring fine-tuning, H_2O works out-of-the-box with pre-trained models (OPT, LLaMA, GPT-NeoX).
  
  The following figure (Figure 1 from the original paper) illustrates the difference between H_2O and other policies. While standard sparse policies (Strided) miss important tokens (red dots), H_2O retains them dynamically.
  
  $Figure:Upper plots illustrate symbolic plots o anattention map deploying different KV cache policies in LLM generation. Lower right: contrasts their accuracy-memory trade-off. Left: the overview of ${ \\sf H } _ { 2 } \\sf { O }$ framework.$ 该图像是展示 H_2O 框架的示意图，包含四种稀疏性策略的对比，分别为动态稀疏性、静态稀疏性（分隔和局部）以及基于 H_2O 的静态稀疏性。下方公式 $ext{Value} = ext{Key} imes ext{Query}$ 对 H_2O 的操作进行了说明，右侧图表展示了不同稀疏性策略的准确率与内存减少之间的关系。

4. Methodology

4.1. Principles

The H_2O methodology is built on two key empirical observations regarding pre-trained LLMs:

Attention Sparsity: Even in dense models, the attention matrix is highly sparse ( $>95\%$ ) during inference. This means for any given token generation, only a tiny fraction of previous tokens contribute significantly to the attention output.
Heavy Hitters (H_2): The importance of tokens is not uniform or random. A small set of tokens ("Heavy Hitters") frequently receives high attention scores across many decoding steps. Their accumulated attention scores follow a power-law distribution.

Core Intuition: If a token was important in the past (high accumulated attention), it is likely to be important in the future. Therefore, instead of keeping all history, we can safely evict tokens with low accumulated scores.

The following figure (Figure 2 from the original paper) visualizes this sparsity and the "Heavy Hitter" phenomenon (red dots showing accumulated scores vs. word index).

该图像是图表，包含四个部分：(a) 展示了不同模型在不同层上的注意力稀疏性；(b) 显示了单词索引与共现次数之间的关系，整体呈现出平滑的曲线特性；(c) 比较了基线和去掉重击词后的准确率表现；(d) 以雷达图形式展示了多种任务下的不同统计方法的准确性。这些图表有助于理解论文中提出的 H_2O 方法的有效性。

4.2. Core Methodology In-depth (Layer by Layer)

The authors formalize the generative process with a limited KV cache budget and propose a greedy eviction algorithm.

4.2.1. Problem Formulation: Generative Process with Eviction

Let $Q \in \mathbb{R}^{n \times d}$ be the query matrix and $K \in \mathbb{R}^{n \times d}$ be the key matrix for a sequence of length $n$ . Let $k$ be the KV cache budget (maximum number of tokens stored), where $k < n$ .

The generative process at step $i$ (predicting the $i$ -th token) involves:

Current Token: $Q_{i, *}$ (the query vector for the current step).
Cache State: $S_i \subset [i]$ , the set of indices of tokens retained in the cache. The constraint is $|S_i| \le k$ .
Eviction Policy: A function $g: S_{i-1} \to S_i$ that updates the cache. The constraint is usually $|S_i \setminus S_{i-1}| \le 1$ (we add the new token $i$ and potentially evict one old token).

The attention output $o_i$ (a scalar or vector representing the weighted sum, simplified here to the normalized probability distribution term) is calculated as:

$ o_i := D_i^{-1} \cdot \exp\left(Q_{i, *}(K_{S_i, *})^\top\right) $

Where $D_i$ is the normalization factor (the denominator of Softmax), computed only over the retained tokens $S_i$ :

$ D_i := \left(\exp\left(Q_{i, *}(K_{S_i, *})^\top\right) - \mathbf{1}{[i] \setminus S_i}\right) \cdot \mathbf{1}{i} $

Explanation:
- $K_{S_i, *}$ : The sub-matrix of Keys corresponding to the indices in $S_i$ .
- The term $\mathbf{1}_{[i] \setminus S_i}$ implies that tokens not in $S_i$ effectively contribute 0 (or are masked out) to the normalization, simulating their eviction.

4.2.2. H`_2`O Eviction Algorithm

The goal is to select $S_i$ to maximize the retention of important information. The authors propose a greedy algorithm based on accumulated attention scores.

The Score Function: The importance of a token $v$ is measured by summing its attention contributions over time. Let $F_{\text{score}}$ be the scoring function. The algorithm maintains the accumulated attention score for every token currently in the cache.

Algorithm Flow (Step-by-Step):

Initialization: Start with an empty cache $S_0 = \emptyset$ .
Generation Loop (for $i = 1$ to $n$ ):
- Step 2a: Cache Filling (if not full): If the current step $i \le k$ (cache is not full), simply add the current token to the cache: $S_i \leftarrow S_{i-1} \cup \{i\}$
- Step 2b: Greedy Eviction (if cache is full): If $i > k$ $i > k$ , we must evict one token to make room for the new token $i$ $i$ .
  1. Calculate Attention: Compute the attention scores for the current query $Q_{i, *}$ against all keys in the current cache $S_{i-1}$ .
  2. Accumulate Scores: Update the historical importance score of each token in the cache by adding its contribution to the current step's attention.
  3. Identify Victim: Find the token $u$ $u$ in the set $S_{i-1} \cup \{i\}$ $S_{i - 1} \cup {i}$ that has the lowest accumulated score (or contributes the least to the objective function). The formal greedy selection is: $ u \leftarrow \arg \min_{v \in (S_{i-1} \cup {i})} F_{\text{score}}( (S_{i-1} \cup {i}) \setminus {v} ) $
    - Clarification: We want the remaining set $(S_{i-1} \cup \{i\}) \setminus \{v\}$ to have the maximum total score. Therefore, we remove $v$ which minimizes the score reduction (i.e., $v$ has the smallest value).
  4. Update Cache: $S_i \leftarrow (S_{i-1} \cup \{i\}) \setminus \{u\}$

The following figure (Figure 3 from the original paper) illustrates this process. At step 5, the cache is full (budget=3). The algorithm compares the accumulated scores of tokens in the cache plus the new token, and evicts the one with the lowest score.

Figure 3: Illustration of Algorithm 1 during two consecutive decoding steps. 该图像是示意图，展示了算法1在两个连续解码步骤中的过程。在解码步骤4中，关键字为"Children laughed and played in the sunny park"的值被计算并加权。在解码步骤5中，对于新关键字的处理及全局统计进行说明，右侧显示了相应的计算值。同时，图中也提到了一种驱逐策略的不可行性。

4.2.3. Practical Implementation: H`_2`O Policy

While the greedy algorithm is theoretically sound, a pure greedy approach might evict the very most recent tokens if they haven't accumulated enough score yet, breaking the local context coherence. Therefore, the practical H_2O Policy used in experiments is a hybrid:

Recent Tokens (Local Window): Always keep the most recent $k_{local}$ tokens. This guarantees immediate local context is preserved (crucial for grammar and immediate flow).
Heavy Hitters (H_2): For the remaining budget $k_{H2} = k - k_{local}$ , keep the tokens with the highest accumulated attention scores from the past.

The eviction logic effectively becomes:

When a new token arrives, if it falls out of the "Recent Window", it becomes a candidate for the "Heavy Hitter" set.
If the "Heavy Hitter" set is full, the token with the lowest accumulated score in that set is evicted.

5. Experimental Setup

5.1. Datasets

The paper validates H_2O on a wide range of tasks involving text generation, reasoning, and summarization.

Evaluation Frameworks:
- HELM (Holistic Evaluation of Language Models): Used for tasks like XSUM (Extreme Summarization) and CNN/Daily Mail (News Summarization).
- lm-eval-harness: Used for tasks like COPA, MathQA, OpenBookQA, PiQA, RTE, and Winogrande.
Characteristics: These datasets cover zero-shot, one-shot, and few-shot (5-shot) settings. Summarization tasks (XSUM, CNN/DM) are particularly important as they require long-context understanding, making them sensitive to cache eviction.

5.2. Evaluation Metrics

Accuracy / Exact Match / ROUGE:
- Standard NLP metrics to measure the correctness of the output.
- ROUGE-2: Measures the overlap of bigrams between the generated summary and the reference. $ \text{ROUGE-2} = \frac{\sum_{S \in {\text{Ref}}} \sum_{\text{bigram} \in S} \text{Count}{\text{match}}(\text{bigram})}{\sum{S \in {\text{Ref}}} \sum_{\text{bigram} \in S} \text{Count}(\text{bigram})} $
  - $Count_{match}$ : Number of bigrams co-occurring in candidate and reference.
Throughput (token/s):
- The number of tokens generated per second. Higher is better.
- $ \text{Throughput} = \frac{\text{Number of generated tokens}}{\text{Total Generation Time (s)}} $
Latency (s/ms):
- The time taken to generate a response or process a prompt. Lower is better.
Memory Footprint (GB):
- The amount of GPU memory occupied by the KV cache.

5.3. Baselines

The paper compares H_2O against several strong baselines:

Full Cache: The standard implementation keeping all KV pairs (Upper bound on accuracy, lower bound on efficiency).
Local (Window): Keeping only the most recent $k$ tokens. This is the simplest eviction policy.
Sparse Transformer (Strided/Fixed): Static sparse patterns where tokens are kept at fixed intervals or positions.
Leading Inference Systems:
- DeepSpeed Zero-Inference: Optimizes memory via partitioning but can be slow due to communication.
- Hugging Face Accelerate: Standard offloading library.
- FlexGen: State-of-the-art high-throughput generation engine (offloading-based).

6. Results & Analysis

6.1. Core Results Analysis

The experiments demonstrate that H_2O achieves a superior trade-off between memory efficiency and model accuracy compared to all baselines.

1. Accuracy vs. Cache Budget: As shown in the following figure (Figure 4 from the original paper), H_2O (blue line) consistently maintains high accuracy (ROUGE-2) even when the KV cache budget is reduced to 20% (approx 5 $\times$ compression). In contrast, the "Local" policy (green line) and Sparse Transformer suffer catastrophic performance drops once the budget is reduced, failing to capture the long-range dependencies required for summarization.

该图像是多个实验结果的示意图，展示了不同KV缓存预算下，Heavy-Hitter Oracle、Local和Full方法的表现对比，包括ROUGE-2、Coverage和Accuracy等指标。各图中，随着KV缓存预算的减少，Heavy-Hitter Oracle的表现相对较优，显示出其在长文本生成中的效率优势。

2. Throughput Improvement: H_2O significantly outperforms existing systems. By reducing the memory footprint, it allows for larger batch sizes and reduces the need for expensive I/O offloading (swapping data between GPU and CPU).

The following are the results from Table 3 of the original paper, showing generation throughput on a T4 GPU:

Seq. length	512+32		512+512		512+1024
Seq. length	6.7B	30B	6.7B	30B	6.7B	30B
Accelerate	20.4 (2, G)	0.6 (8, C)	15.5 (1, G)	0.6 (8, C)	5.6 (16, C)	0.6 (8, C)
DeepSpeed	10.2 (16, C)	0.6 (4, C)	9.6 (16, C)	0.6 (4, C)	10.1 (16, C)	0.6 (4, C)
FlexGen	20.2 (2, G)	8.1 (144, C)	16.8 (1, G)	8.5 (80, C)	16.9 (1, G)	7.1 (48, C)
H2O (20%)	35.1 (4, G)	12.7 (728, C)	51.7 (4, G)	18.83 (416, C)	52.1 (4, G)	13.82 (264, C)

Key Insight: For OPT-30B (512+512), H_2O achieves 18.83 tokens/s, which is more than 2 $\times$ faster than FlexGen (8.5) and 30 $\times$ faster than DeepSpeed/Accelerate (0.6).
Note: The values in brackets indicate (Batch Size, Offload Target: G=GPU, C=CPU). H_2O enables larger batch sizes (e.g., 416 vs 80 for FlexGen), driving the throughput gain.

6.2. Ablation Studies / Parameter Analysis

1. Importance of Heavy Hitters vs. Local Tokens: The authors conducted an ablation study to verify if both components (Heavy Hitters + Recent Tokens) are necessary. The following are the results from Table 9 (Appendix C.6) of the original paper:

Tasks	Models	Full	w. Local	w. H2	w. Local + H2
PiQA	OPT-13B	77.37	54.62	76.12	77.26
PiQA	OPT-30B	78.51	55.82	67.25	78.45
OpenBookQA	OPT-13B	41.40	25.60	30.40	41.20
OpenBookQA	OPT-30B	43.20	25.20	26.60	43.00

Analysis: Neither "Local" alone nor "H2" alone matches the Full Cache performance. "Local" suffers severe degradation (e.g., 55.82 vs 78.51 on PiQA). "H2" alone is better but still lags. The combination (H_2O) recovers almost all performance, proving that recent tokens provide syntactic coherence while Heavy Hitters provide semantic anchoring.

2. Compatibility with Quantization: The paper also explores combining H_2O with 4-bit weight quantization. Surprisingly, the combination sometimes yields better accuracy than quantization alone (Table 6 in the paper), suggesting a regularization effect. It further improves throughput by allowing even larger batch sizes.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces H_2O, a rigorously grounded yet simple oracle for managing the KV cache in Large Language Models. By discovering the "Heavy Hitter" phenomenon—where a small subset of tokens dominates attention scores—the authors designed a greedy eviction policy that retains only the most critical information (Heavy Hitters + Recent Tokens). H_2O successfully reduces memory usage by up to 10 $\times$ and boosts inference throughput by up to 29 $\times$ without retraining the model or significantly sacrificing generation quality. The theoretical formulation connects this eviction policy to dynamic submodular maximization, providing a solid mathematical foundation.

7.2. Limitations & Future Work

Limitations:
- Bias towards historical tokens: The current accumulated score metric might favor very old tokens that have had "more time" to accumulate scores, potentially making it harder for new, important concepts to establish themselves as heavy hitters quickly. (Though the "Recent" window mitigates this).
- Potential for adversarial inputs: The paper does not extensively explore if specific inputs could break the heavy hitter assumption.
Future Work:
- Offloading Policy: The authors suggest extending the Heavy Hitter concept to MLP blocks (which also show sparsity) to design better offloading policies for model weights, not just KV cache.
- Long-Context Generation: Further validating H_2O on extremely long sequences (beyond 10k tokens) to see if the heavy hitter set stabilizes or drifts.

7.3. Personal Insights & Critique

Inspiration: The connection between "Heavy Hitters" in attention and "Co-occurrence" in text (Figure 2b) is fascinating. It suggests that the attention mechanism naturally learns to focus on "hub" words that bind the context together. This provides a tangible, interpretable view of what the "Value" vectors in the cache actually represent.
Applicability: H_2O is highly practical because it requires no retraining. It can be immediately plugged into existing inference pipelines (like vLLM, Hugging Face) to save memory.
Critique: While the "Heavy Hitter" theory is strong, the reliance on a fixed "Recent Window" in the implementation suggests that the pure greedy algorithm based on accumulated score isn't perfect on its own. The "Recent" window acts as a necessary heuristic to handle the immediate local structure of language, which might not yet have high accumulated scores but is essentially "infinite" importance for the next token prediction. A more unified metric that naturally weights recency and long-term importance without a hard split could be an interesting theoretical improvement.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.