H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
TL;DR Summary
The paper introduces H$_2$O, a novel KV cache eviction strategy for Large Language Models that reduces memory usage by 5-10 times while improving inference throughput by up to 29x, based on identifying and retaining high-contribution tokens known as Heavy Hitters.
Abstract
Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H). Through a comprehensive investigation, we find that (i) the emergence of H is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (HO), a KV cache eviction policy that dynamically retains a balance of recent and H tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of HO with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29, 29, and 3 on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9. The code is available at https://github.com/FMInference/H2O.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
1.2. Authors
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen
Affiliations:
- University of Texas at Austin
- Stanford University
- University of California, San Diego
- University of California, Berkeley
- Adobe Research
- Meta AI (FAIR)
- Carnegie Mellon University
1.3. Journal/Conference
Published at: NeurIPS 2023 (Thirty-seventh Conference on Neural Information Processing Systems). Note: The metadata provided in the prompt indicates a publication date of June 2023 (likely the arXiv preprint release), but this paper is widely recognized as a NeurIPS 2023 conference paper, which is a top-tier venue in the field of Artificial Intelligence and Machine Learning.
1.4. Publication Year
2023
1.5. Abstract
This paper addresses the deployment cost bottleneck of Large Language Models (LLMs), specifically the high memory consumption of the KV cache during long-content generation. The authors observe that a small subset of tokens, termed "Heavy Hitters" (H_2), contributes the vast majority of value to attention scores. Based on this, they propose H_2O, a novel Key-Value (KV) cache eviction policy. H_2O dynamically retains a mix of the most recent tokens and these heavy hitters, allowing for a significant reduction in cache size (up to 5-10) without compromising generation quality. The method increases throughput by up to 29 compared to leading systems like DeepSpeed and Accelerate.
1.6. Original Source Link
https://arxiv.org/abs/2306.14048 (Preprint/Official Version)
2. Executive Summary
2.1. Background & Motivation
Core Problem: The deployment of Large Language Models (LLMs) is extremely expensive, primarily due to memory bottlenecks. While model parameters are static, the KV Cache (Key-Value Cache)—which stores intermediate states to prevent re-computation during text generation—grows linearly with sequence length and batch size. For a 30B parameter model with a batch size of 128 and sequence length of 1024, the KV cache alone can consume 180GB of GPU memory. This makes long-context generation and large-batch inference prohibitive.
Importance & Gaps:
- Memory Wall: Standard hardware cannot fit the KV cache for long sequences, forcing systems to offload data to slower CPU memory or disk, which drastically kills performance.
- Existing Solutions' Limitations:
-
Sparse Attention (e.g., Reformer): Often requires training from scratch or fine-tuning and doesn't always reduce inference memory effectively.
-
Approximations (e.g., Sparse Transformer): When applied directly to pre-trained LLMs without fine-tuning, they suffer from high "miss rates" (dropping important information), leading to severe accuracy degradation.
-
Heuristic Eviction: Simple policies like "keep only the last tokens" fail to capture long-range dependencies crucial for tasks like summarization or QA.
Innovative Idea: The paper identifies that not all tokens are created equal. Empirical analysis reveals that attention matrices are over 95% sparse, and a small set of influential tokens ("Heavy Hitters") consistently receive high attention scores throughout the generation process.
-
2.2. Main Contributions / Findings
- Discovery of Heavy Hitters (H
_2): The authors demonstrate that tokens critical for generation (Heavy Hitters) emerge naturally and their accumulated attention scores follow a power-law distribution. These tokens often correlate with frequent co-occurrence in the text. - H
_2O Framework: A novel, training-free KV cache eviction policy that greedily retains tokens with the highest accumulated attention scores alongside a small window of recent tokens. - Theoretical Guarantee: The authors formulate the KV cache eviction problem as a Dynamic Submodular Maximization problem and provide a theoretical proof that their greedy algorithm delivers near-optimal performance under mild assumptions.
- System Performance: Implementing H
_2O on the FlexGen inference engine resulted in:-
Throughput: Up to 29 improvement over DeepSpeed Zero-Inference and Hugging Face Accelerate, and 3 over standard FlexGen on OPT-6.7B/30B models.
-
Memory: Reduced KV cache memory footprint by 5-10 (e.g., using only 20% budget) with negligible accuracy loss across tasks like HELM and lm-eval-harness.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand H_2O, one must grasp the basics of Transformer inference and caching.
-
KV Cache (Key-Value Cache): In Transformer-based LLMs (like GPT), generating text is an auto-regressive process (token by token). To generate the -th token, the model needs to attend to all previous tokens
1tot-1.- Instead of recomputing the Key () and Value () vectors for all previous tokens at every step, the system caches them in GPU memory.
- Problem: This cache grows by 2 vectors (Key and Value) for every layer and every head at every step. For long sequences, this "transient state" becomes larger than the model weights themselves.
-
Attention Mechanism: The core operation in Transformers. For a query , keys , and values : $ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V $
- : Represents the current token being generated.
- : Represents the "features" of previous tokens.
- : Computes a "similarity" or "importance" score (attention score) for each previous token relative to the current one.
- Heavy Hitter: A token in the history whose Key () consistently results in a high product with the Query (), meaning it is "attended to" frequently and strongly.
-
Submodularity: A property of set functions where adding an element to a smaller set gives a larger gain than adding it to a larger set (diminishing returns).
- for .
- This property is crucial because greedy algorithms (picking the single best item at each step) are known to provide mathematically provable near-optimal solutions for submodular maximization problems.
3.2. Previous Works
- Sparse Attention (Training-time): Methods like Reformer and Sparse Transformer design patterns (e.g., fixed windows, strided patterns) to reduce complexity from to or .
- Limitation: These usually require training the model with these constraints. H
_2O targets pre-trained models at inference time.
- Limitation: These usually require training the model with these constraints. H
- System Optimizations:
- FlexGen: An inference engine optimizing offloading strategies (moving data between GPU, CPU, and Disk). H
_2O is implemented on top of FlexGen to reduce the amount of data that needs to be stored/moved. - FlashAttention: Optimizes the computation of attention to be memory-efficient and fast but does not inherently reduce the storage requirement of the KV cache across decoding steps.
- FlexGen: An inference engine optimizing offloading strategies (moving data between GPU, CPU, and Disk). H
- Token Pruning/Compression: Methods like SpAtten prune tokens based on importance but often require custom hardware or retraining/fine-tuning to be fully effective without accuracy loss.
3.3. Technological Evolution & Differentiation
- Evolution: Early optimizations focused on model compression (quantization, pruning weights). As sequence lengths grew (from 512 to 32k+), the focus shifted to the KV cache.
- Differentiation:
-
Unlike static policies (e.g., keeping only the last 2048 tokens), H
_2O is dynamic and content-aware. It decides which tokens to keep based on their accumulated importance during the generation itself. -
Unlike weight pruning, H
_2O prunes the runtime state (KV cache). -
Unlike methods requiring fine-tuning, H
_2O works out-of-the-box with pre-trained models (OPT, LLaMA, GPT-NeoX).The following figure (Figure 1 from the original paper) illustrates the difference between H
_2O and other policies. While standard sparse policies (Strided) miss important tokens (red dots), H_2O retains them dynamically.
该图像是展示 H_2O 框架的示意图,包含四种稀疏性策略的对比,分别为动态稀疏性、静态稀疏性(分隔和局部)以及基于 H_2O 的静态稀疏性。下方公式 对 H_2O 的操作进行了说明,右侧图表展示了不同稀疏性策略的准确率与内存减少之间的关系。
-
4. Methodology
4.1. Principles
The H_2O methodology is built on two key empirical observations regarding pre-trained LLMs:
-
Attention Sparsity: Even in dense models, the attention matrix is highly sparse () during inference. This means for any given token generation, only a tiny fraction of previous tokens contribute significantly to the attention output.
-
Heavy Hitters (H
_2): The importance of tokens is not uniform or random. A small set of tokens ("Heavy Hitters") frequently receives high attention scores across many decoding steps. Their accumulated attention scores follow a power-law distribution.Core Intuition: If a token was important in the past (high accumulated attention), it is likely to be important in the future. Therefore, instead of keeping all history, we can safely evict tokens with low accumulated scores.
The following figure (Figure 2 from the original paper) visualizes this sparsity and the "Heavy Hitter" phenomenon (red dots showing accumulated scores vs. word index).
该图像是图表,包含四个部分:(a) 展示了不同模型在不同层上的注意力稀疏性;(b) 显示了单词索引与共现次数之间的关系,整体呈现出平滑的曲线特性;(c) 比较了基线和去掉重击词后的准确率表现;(d) 以雷达图形式展示了多种任务下的不同统计方法的准确性。这些图表有助于理解论文中提出的 H_2O 方法的有效性。
4.2. Core Methodology In-depth (Layer by Layer)
The authors formalize the generative process with a limited KV cache budget and propose a greedy eviction algorithm.
4.2.1. Problem Formulation: Generative Process with Eviction
Let be the query matrix and be the key matrix for a sequence of length . Let be the KV cache budget (maximum number of tokens stored), where .
The generative process at step (predicting the -th token) involves:
-
Current Token: (the query vector for the current step).
-
Cache State: , the set of indices of tokens retained in the cache. The constraint is .
-
Eviction Policy: A function that updates the cache. The constraint is usually (we add the new token and potentially evict one old token).
The attention output (a scalar or vector representing the weighted sum, simplified here to the normalized probability distribution term) is calculated as:
$ o_i := D_i^{-1} \cdot \exp\left(Q_{i, *}(K_{S_i, *})^\top\right) $
Where is the normalization factor (the denominator of Softmax), computed only over the retained tokens :
$ D_i := \left(\exp\left(Q_{i, *}(K_{S_i, *})^\top\right) - \mathbf{1}{[i] \setminus S_i}\right) \cdot \mathbf{1}{i} $
- Explanation:
- : The sub-matrix of Keys corresponding to the indices in .
- The term implies that tokens not in effectively contribute 0 (or are masked out) to the normalization, simulating their eviction.
4.2.2. H_2O Eviction Algorithm
The goal is to select to maximize the retention of important information. The authors propose a greedy algorithm based on accumulated attention scores.
The Score Function: The importance of a token is measured by summing its attention contributions over time. Let be the scoring function. The algorithm maintains the accumulated attention score for every token currently in the cache.
Algorithm Flow (Step-by-Step):
- Initialization: Start with an empty cache .
- Generation Loop (for to ):
- Step 2a: Cache Filling (if not full): If the current step (cache is not full), simply add the current token to the cache:
- Step 2b: Greedy Eviction (if cache is full):
If , we must evict one token to make room for the new token .
- Calculate Attention: Compute the attention scores for the current query against all keys in the current cache .
- Accumulate Scores: Update the historical importance score of each token in the cache by adding its contribution to the current step's attention.
- Identify Victim: Find the token in the set that has the lowest accumulated score (or contributes the least to the objective function).
The formal greedy selection is:
$
u \leftarrow \arg \min_{v \in (S_{i-1} \cup {i})} F_{\text{score}}( (S_{i-1} \cup {i}) \setminus {v} )
$
- Clarification: We want the remaining set to have the maximum total score. Therefore, we remove which minimizes the score reduction (i.e., has the smallest value).
- Update Cache:
The following figure (Figure 3 from the original paper) illustrates this process. At step 5, the cache is full (budget=3). The algorithm compares the accumulated scores of tokens in the cache plus the new token, and evicts the one with the lowest score.
该图像是示意图,展示了算法1在两个连续解码步骤中的过程。在解码步骤4中,关键字为"Children laughed and played in the sunny park"的值被计算并加权。在解码步骤5中,对于新关键字的处理及全局统计进行说明,右侧显示了相应的计算值。同时,图中也提到了一种驱逐策略的不可行性。
4.2.3. Practical Implementation: H_2O Policy
While the greedy algorithm is theoretically sound, a pure greedy approach might evict the very most recent tokens if they haven't accumulated enough score yet, breaking the local context coherence.
Therefore, the practical H_2O Policy used in experiments is a hybrid:
-
Recent Tokens (Local Window): Always keep the most recent tokens. This guarantees immediate local context is preserved (crucial for grammar and immediate flow).
-
Heavy Hitters (H
_2): For the remaining budget , keep the tokens with the highest accumulated attention scores from the past.The eviction logic effectively becomes:
-
When a new token arrives, if it falls out of the "Recent Window", it becomes a candidate for the "Heavy Hitter" set.
-
If the "Heavy Hitter" set is full, the token with the lowest accumulated score in that set is evicted.
5. Experimental Setup
5.1. Datasets
The paper validates H_2O on a wide range of tasks involving text generation, reasoning, and summarization.
- Evaluation Frameworks:
- HELM (Holistic Evaluation of Language Models): Used for tasks like XSUM (Extreme Summarization) and CNN/Daily Mail (News Summarization).
- lm-eval-harness: Used for tasks like COPA, MathQA, OpenBookQA, PiQA, RTE, and Winogrande.
- Characteristics: These datasets cover zero-shot, one-shot, and few-shot (5-shot) settings. Summarization tasks (XSUM, CNN/DM) are particularly important as they require long-context understanding, making them sensitive to cache eviction.
5.2. Evaluation Metrics
- Accuracy / Exact Match / ROUGE:
- Standard NLP metrics to measure the correctness of the output.
- ROUGE-2: Measures the overlap of bigrams between the generated summary and the reference.
$
\text{ROUGE-2} = \frac{\sum_{S \in {\text{Ref}}} \sum_{\text{bigram} \in S} \text{Count}{\text{match}}(\text{bigram})}{\sum{S \in {\text{Ref}}} \sum_{\text{bigram} \in S} \text{Count}(\text{bigram})}
$
- : Number of bigrams co-occurring in candidate and reference.
- Throughput (token/s):
- The number of tokens generated per second. Higher is better.
- $ \text{Throughput} = \frac{\text{Number of generated tokens}}{\text{Total Generation Time (s)}} $
- Latency (s/ms):
- The time taken to generate a response or process a prompt. Lower is better.
- Memory Footprint (GB):
- The amount of GPU memory occupied by the KV cache.
5.3. Baselines
The paper compares H_2O against several strong baselines:
- Full Cache: The standard implementation keeping all KV pairs (Upper bound on accuracy, lower bound on efficiency).
- Local (Window): Keeping only the most recent tokens. This is the simplest eviction policy.
- Sparse Transformer (Strided/Fixed): Static sparse patterns where tokens are kept at fixed intervals or positions.
- Leading Inference Systems:
-
DeepSpeed Zero-Inference: Optimizes memory via partitioning but can be slow due to communication.
-
Hugging Face Accelerate: Standard offloading library.
-
FlexGen: State-of-the-art high-throughput generation engine (offloading-based).
-
6. Results & Analysis
6.1. Core Results Analysis
The experiments demonstrate that H_2O achieves a superior trade-off between memory efficiency and model accuracy compared to all baselines.
1. Accuracy vs. Cache Budget:
As shown in the following figure (Figure 4 from the original paper), H_2O (blue line) consistently maintains high accuracy (ROUGE-2) even when the KV cache budget is reduced to 20% (approx 5 compression). In contrast, the "Local" policy (green line) and Sparse Transformer suffer catastrophic performance drops once the budget is reduced, failing to capture the long-range dependencies required for summarization.

2. Throughput Improvement:
H_2O significantly outperforms existing systems. By reducing the memory footprint, it allows for larger batch sizes and reduces the need for expensive I/O offloading (swapping data between GPU and CPU).
The following are the results from Table 3 of the original paper, showing generation throughput on a T4 GPU:
| Seq. length | 512+32 | 512+512 | 512+1024 | |||
|---|---|---|---|---|---|---|
| 6.7B | 30B | 6.7B | 30B | 6.7B | 30B | |
| Accelerate | 20.4 (2, G) | 0.6 (8, C) | 15.5 (1, G) | 0.6 (8, C) | 5.6 (16, C) | 0.6 (8, C) |
| DeepSpeed | 10.2 (16, C) | 0.6 (4, C) | 9.6 (16, C) | 0.6 (4, C) | 10.1 (16, C) | 0.6 (4, C) |
| FlexGen | 20.2 (2, G) | 8.1 (144, C) | 16.8 (1, G) | 8.5 (80, C) | 16.9 (1, G) | 7.1 (48, C) |
| H2O (20%) | **35.1** (4, G) | **12.7** (728, C) | **51.7** (4, G) | **18.83** (416, C) | **52.1** (4, G) | **13.82** (264, C) |
- Key Insight: For OPT-30B (512+512), H
_2O achieves 18.83 tokens/s, which is more than 2 faster than FlexGen (8.5) and 30 faster than DeepSpeed/Accelerate (0.6). - Note: The values in brackets indicate (Batch Size, Offload Target: G=GPU, C=CPU). H
_2O enables larger batch sizes (e.g., 416 vs 80 for FlexGen), driving the throughput gain.
6.2. Ablation Studies / Parameter Analysis
1. Importance of Heavy Hitters vs. Local Tokens: The authors conducted an ablation study to verify if both components (Heavy Hitters + Recent Tokens) are necessary. The following are the results from Table 9 (Appendix C.6) of the original paper:
| Tasks | Models | Full | w. Local | w. H2 | w. Local + H2 |
|---|---|---|---|---|---|
| PiQA | OPT-13B | 77.37 | 54.62 | 76.12 | **77.26** |
| OPT-30B | 78.51 | 55.82 | 67.25 | **78.45** | |
| OpenBookQA | OPT-13B | 41.40 | 25.60 | 30.40 | **41.20** |
| OPT-30B | 43.20 | 25.20 | 26.60 | **43.00** |
- Analysis: Neither "Local" alone nor "H2" alone matches the Full Cache performance. "Local" suffers severe degradation (e.g., 55.82 vs 78.51 on PiQA). "H2" alone is better but still lags. The combination (H
_2O) recovers almost all performance, proving that recent tokens provide syntactic coherence while Heavy Hitters provide semantic anchoring.
2. Compatibility with Quantization:
The paper also explores combining H_2O with 4-bit weight quantization. Surprisingly, the combination sometimes yields better accuracy than quantization alone (Table 6 in the paper), suggesting a regularization effect. It further improves throughput by allowing even larger batch sizes.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces H_2O, a rigorously grounded yet simple oracle for managing the KV cache in Large Language Models. By discovering the "Heavy Hitter" phenomenon—where a small subset of tokens dominates attention scores—the authors designed a greedy eviction policy that retains only the most critical information (Heavy Hitters + Recent Tokens). H_2O successfully reduces memory usage by up to 10 and boosts inference throughput by up to 29 without retraining the model or significantly sacrificing generation quality. The theoretical formulation connects this eviction policy to dynamic submodular maximization, providing a solid mathematical foundation.
7.2. Limitations & Future Work
- Limitations:
- Bias towards historical tokens: The current accumulated score metric might favor very old tokens that have had "more time" to accumulate scores, potentially making it harder for new, important concepts to establish themselves as heavy hitters quickly. (Though the "Recent" window mitigates this).
- Potential for adversarial inputs: The paper does not extensively explore if specific inputs could break the heavy hitter assumption.
- Future Work:
- Offloading Policy: The authors suggest extending the Heavy Hitter concept to MLP blocks (which also show sparsity) to design better offloading policies for model weights, not just KV cache.
- Long-Context Generation: Further validating H
_2O on extremely long sequences (beyond 10k tokens) to see if the heavy hitter set stabilizes or drifts.
7.3. Personal Insights & Critique
- Inspiration: The connection between "Heavy Hitters" in attention and "Co-occurrence" in text (Figure 2b) is fascinating. It suggests that the attention mechanism naturally learns to focus on "hub" words that bind the context together. This provides a tangible, interpretable view of what the "Value" vectors in the cache actually represent.
- Applicability: H
_2O is highly practical because it requires no retraining. It can be immediately plugged into existing inference pipelines (like vLLM, Hugging Face) to save memory. - Critique: While the "Heavy Hitter" theory is strong, the reliance on a fixed "Recent Window" in the implementation suggests that the pure greedy algorithm based on accumulated score isn't perfect on its own. The "Recent" window acts as a necessary heuristic to handle the immediate local structure of language, which might not yet have high accumulated scores but is essentially "infinite" importance for the next token prediction. A more unified metric that naturally weights recency and long-term importance without a hard split could be an interesting theoretical improvement.
Similar papers
Recommended via semantic vector search.