Paper status: completed

Efficient Memory Management for Large Language Model Serving with PagedAttention

Published:09/12/2023

Attention-based sequence transduction (2)LLM Reasoning Capacity Enhancement (39)Training-Free Acceleration Methods (22)

Price: 0.100000

10 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

PagedAttention, inspired by virtual memory, enables near-zero KV cache waste and flexible sharing in vLLM, boosting large language model serving throughput by 2-4×, especially with longer sequences, larger models, and complex decoding.

Abstract

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4 $\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm

Mind Map

In-depth Reading

English Analysis~18 min read · 18,852 chars

1. Bibliographic Information

Title: Efficient Memory Management for Large Language Model Serving with PagedAttention
Authors: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica.
Affiliations: The authors are affiliated with UC Berkeley, Stanford University, UC San Diego, and an Independent Researcher. The primary authors are from UC Berkeley.
Journal/Conference: The paper was posted on arXiv, a preprint server. Preprints are drafts of research papers that have not yet undergone formal peer review. However, given the authors' affiliations and the subsequent widespread adoption of the vLLM library, the work is considered highly influential and credible within the machine learning systems community.
Publication Year: 2023
Abstract: The paper addresses a critical bottleneck in serving Large Language Models (LLMs): inefficient memory management for the key-value cache (KV cache). This cache is large, dynamic, and essential for autoregressive generation. Existing systems waste significant memory through fragmentation and an inability to share memory, which limits the number of requests that can be processed in a batch, thereby reducing throughput. The authors propose PagedAttention, a novel attention algorithm inspired by virtual memory and paging in operating systems. Built upon this, they introduce vLLM, an LLM serving system that nearly eliminates memory waste and allows flexible sharing of the KV cache. Evaluations show that vLLM increases the throughput of popular LLMs by 2-4x compared to state-of-the-art systems like FasterTransformer and Orca, with more significant gains for longer sequences and complex decoding methods.
Original Source Link:
- arXiv: https://arxiv.org/abs/2309.06180v1
- PDF: https://arxiv.org/pdf/2309.06180v1.pdf

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Serving LLMs efficiently is expensive. To improve throughput and reduce cost, systems need to batch many user requests together. However, LLM inference requires a large, dynamic memory space called the KV cache to store the context of each sequence.
- Existing Gaps: Prior systems manage this KV cache by allocating a single, contiguous block of memory for each request. This is highly inefficient because:
  1. The output length is unknown, so systems must pre-allocate memory for the maximum possible length, leading to massive waste (internal fragmentation).
  2. Allocating and deallocating these variable-sized blocks creates unusable memory gaps between them (external fragmentation).
  3. It prevents memory from being shared between different generation sequences (e.g., in parallel sampling or beam search), leading to redundant data storage.
- Innovation: The paper introduces a fundamentally new way to manage the KV cache, drawing inspiration from a classic computer science concept: virtual memory and paging from operating systems. This allows the KV cache to be stored in non-contiguous, fixed-size blocks, solving the fragmentation and sharing problems in one elegant stroke.
Main Contributions / Findings (What):
- PagedAttention Algorithm: The core technical innovation is an attention algorithm that can operate on a KV cache stored in non-contiguous memory blocks (pages). This decouples the logical layout of the cache from its physical storage.
- vLLM Serving System: An end-to-end LLM serving engine built around PagedAttention. vLLM implements a sophisticated memory manager that treats KV cache blocks like memory pages, allocating them on demand and enabling advanced sharing strategies.
- Near-Zero Memory Waste: vLLM's paged memory management reduces KV cache memory waste to less than 4%, compared to 60-80% in other systems. This drastically increases the number of requests that can be batched on a single GPU.
- State-of-the-Art Throughput: As a direct result of better memory efficiency, vLLM achieves a 2-4x increase in serving throughput over highly optimized systems like FasterTransformer and Orca, without sacrificing latency.

Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data to understand and generate human-like language. They operate autoregressively, meaning they generate text one token (word or sub-word) at a time, with each new token depending on all the tokens that came before it.
- Transformer Architecture & Self-Attention: The backbone of modern LLMs. Its key component is the self-attention mechanism. For each token in a sequence, attention calculates how relevant every other token is to it. This is done by computing three vectors for each token: a Query (Q), a Key (K), and a Value (V). The attention score is derived from the interaction between a token's Query and all other tokens' Keys, and the output is a weighted sum of the Value vectors.
- KV Cache: During autoregressive generation, the Key (K) and Value (V) vectors for every token in the sequence are computed. Since the K and V vectors of past tokens are needed to generate every future token, they are stored (cached) in GPU memory to avoid expensive recomputation. This stored data is the KV cache. It grows with each new token generated and can consume gigabytes of memory per request.
- Autoregressive Generation Phases:
  1. Prompt Phase (Prefill): The initial user prompt is processed all at once. The KV cache for all prompt tokens is computed in a highly parallel step.
  2. Generation Phase (Decoding): Output tokens are generated one by one. In each step, the model uses the entire KV cache (from the prompt and previously generated tokens) to produce the next token. This phase is sequential and often memory-bound.
- Virtual Memory and Paging: A memory management technique used by operating systems. It gives a program the illusion of having a large, contiguous memory space (virtual memory), while the actual data is stored in fixed-size blocks called pages that can be scattered throughout physical memory (like RAM). A page table maps the program's logical addresses to their physical locations. This system avoids fragmentation and allows for flexible memory management. PagedAttention directly applies this concept to the KV cache.
- Memory Fragmentation:
  - Internal Fragmentation: Waste inside an allocated memory block. For example, if you allocate a 2048-token chunk for a sequence that only ends up being 100 tokens long, the remaining 1948-token space is wasted.
  - External Fragmentation: Waste outside allocated blocks. When memory is a patchwork of allocated and freed chunks of different sizes, there may be enough total free memory for a new request, but no single contiguous chunk is large enough.
Previous Works:
- FasterTransformer: A highly optimized inference library from NVIDIA. It achieves low latency but uses a static, contiguous memory allocation scheme for the KV cache, making it inefficient for serving dynamic workloads with high throughput demands.
- Orca: A state-of-the-art serving system designed for high throughput. It improves upon static batching by using iteration-level scheduling (also called continuous or dynamic batching), where new requests can join a batch and finished requests can leave at every generation step. However, it still relies on a contiguous memory manager (like a buddy allocator), which suffers from the fragmentation issues described above.
Differentiation: The key difference between vLLM and prior work is the memory allocation strategy for the KV cache. While systems like FasterTransformer and Orca are constrained by the requirement that a tensor must live in a contiguous memory block, vLLM breaks this constraint. By introducing PagedAttention, it can store the KV cache in non-contiguous "pages," fundamentally solving the memory fragmentation problem and unlocking powerful sharing capabilities that were previously impossible.

4. Methodology (Core Technology & Implementation)

The core of the paper is the PagedAttention algorithm and the vLLM system built around it.

Figure 4. vLLM system overview. 该图像是图4，vLLM系统概览的示意图，展示了调度器、KV缓存管理器、CPU和GPU块分配器与多个Worker节点之间的交互结构，体现了系统中缓存引擎与模型分片的协调工作。

As shown in Figure 4, the vLLM system consists of a central scheduler that communicates with distributed GPU workers. The key components are the KV Cache Manager and the PagedAttention kernel, which work together to manage GPU memory efficiently.

4.1 PagedAttention

Principle: The central idea is to stop treating the KV cache for a sequence as a single, monolithic tensor. Instead, it is partitioned into smaller, fixed-size blocks. Each block contains the K and V vectors for a fixed number of tokens. These blocks can be stored anywhere in physical GPU memory.

Figure 15. Average amount of memory saving from sharing KV blocks, when serving OPT-13B for the Alpaca trace. 该图像是图表，展示了为OPT-13B模型处理Alpaca数据时共享KV块所带来的平均内存节省百分比。左图(a)为并行采样，右图(b)为束搜索，内存节省随输出序列数和束宽增加而提升。

As illustrated in Figure 5, the KV cache for a sequence (e.g., "Four score and seven years ago...") is split into blocks. Block 0, Block 1, and Block 2 are not stored next to each other in memory. The PagedAttention algorithm is designed to compute attention by fetching these blocks individually. For a query token (e.g., "forth"), the kernel fetches each required KV block, computes attention scores against the keys in that block, and aggregates the results using the values from that block.

The attention formula is adapted for block-wise computation. The standard attention output $o_i$ is: $o_i = \sum_{j=1}^{i} a_{ij} v_j, \quad \text{where} \quad a_{ij} = \frac{\exp({q_i^\top k_j / \sqrt{d}})}{\sum_{t=1}^i \exp({q_i^\top k_t / \sqrt{d}})}$ With PagedAttention, this is conceptually transformed into a block-wise operation: $o_i = \sum_{j=1}^{\lceil i/B \rceil} V_j A_{ij}^\top$ where $B$ is the block size, $K_j$ and $V_j$ are the $j$ -th blocks of keys and values, and $A_{ij}$ is the vector of attention scores for the query $q_i$ against the keys in block $K_j$ . The custom CUDA kernel handles fetching these non-contiguous blocks on the fly.

4.2 KV Cache Manager

Analogy to Virtual Memory: vLLM's memory manager works just like an OS virtual memory system:

Logical Blocks: Each sequence has its own view of its KV cache as a contiguous sequence of logical blocks.
Physical Blocks: The GPU memory is partitioned into a global pool of fixed-size physical blocks.
Block Table: For each sequence, a block table (analogous to a page table) maps its logical blocks to physical blocks in GPU memory.

该图像是图16的图表，展示了翻译任务中输入提示共享相同前缀时，vLLM与Orca（Oracle）在不同请求率下的归一化延迟对比，分别对应(a) 1个示例的80个token前缀和(b) 5个示例的341个token前缀。

Figure 6 shows this process:

A request arrives with a 7-token prompt. The block size is 4. vLLM allocates two logical blocks (0 and 1). The manager maps these to two available physical blocks (e.g., 7 and 1). The KV cache for the prompt is stored in these physical blocks.
In the first decoding step, a new token is generated. Its KV cache is stored in the first available slot in the last logical block (block 1), which is physically located at block 1.
In the second decoding step, the last logical block becomes full. The manager allocates a new physical block (e.g., block 3), maps logical block 2 to it, and stores the new KV cache there.

This on-demand allocation eliminates internal fragmentation, as memory is only allocated when needed. Since all blocks are the same size, there is no external fragmentation.

4.3 Application to Other Decoding Scenarios

vLLM's paged memory management naturally enables efficient memory sharing, which is difficult or impossible in contiguous systems.

Parallel Sampling: When generating multiple output sequences from the same prompt, the prompt's KV cache can be shared.

该图像是论文中的图表，展示了注意力核的延迟及不同块大小下的端到端延迟对比。左图(a)说明了vLLM和FasterTransformer在不同上下文长度与批大小下的内核延迟表现，右图(b)展示了ShareGPT和Alpaca模型在不同块大小的归一化延迟情况。

As shown in Figure 8, two sequences (Sample A1, Sample A2) share the same prompt. Their logical blocks for the prompt map to the same physical blocks (7 and 1). vLLM uses reference counting to track how many logical blocks point to a physical block. When a sequence needs to write to a shared block (e.g., the last block of the prompt), a copy-on-write mechanism is triggered. A new physical block is allocated, the original content is copied, and the sequence then writes to its new private block. This saves significant memory, especially for long prompts.
Beam Search: In beam search, different candidate sequences (beams) often share long common prefixes.

该图像是图表，来源于论文中图19，展示了不同块大小下的重计算与交换开销(a)及使用ShareGPT轨迹服务OPT-13B时的端到端性能(b)。横坐标为块大小，纵坐标分别为时间(ms)和归一化延迟(s/token)。

vLLM's block sharing elegantly handles this. As shown in Figure 9, different candidates can share arbitrary prefixes. When a beam is discarded, its logical blocks are freed, and the reference counts of the corresponding physical blocks are decremented. Physical blocks with a reference count of zero are returned to the free pool. This avoids the massive and frequent memory copies required in traditional beam search implementations.
Shared Prefix: For applications where many requests share a common system prompt or prefix (e.g., instructions for a chatbot), vLLM can pre-calculate the KV cache for this prefix and store it in a set of physical blocks. New requests can simply map their initial logical blocks to these cached physical blocks, saving both memory and computation.

4.4 Scheduling and Preemption

When the system is overloaded and runs out of free physical blocks, it must preempt some requests. vLLM uses a First-Come-First-Serve (FCFS) policy and preempts the most recently started requests. Two strategies are considered for handling preempted requests:

Swapping: The KV cache blocks of a preempted sequence are copied from GPU memory to CPU RAM. They are swapped back in when space becomes available.
Recomputation: The KV cache is simply discarded. When the sequence is rescheduled, its KV cache is recomputed from its full prefix (original prompt + already generated tokens). This is done in a single, fast prompt phase operation. The paper's experiments show recomputation is generally faster than swapping.

4.5 Distributed Execution

vLLM supports tensor model parallelism (e.g., Megatron-LM style), where a large model is sharded across multiple GPUs. A single, centralized scheduler manages the block tables for all workers. Each worker stores only its shard of the KV cache (e.g., for its subset of attention heads) in its local physical blocks, but all workers use the same logical-to-physical block mapping provided by the scheduler.

5. Experimental Setup

Datasets:
- ShareGPT: A dataset of real user conversations with ChatGPT. It features long prompts and long outputs, representing a challenging, realistic workload.
- Alpaca: An instruction-following dataset. It has shorter prompts and outputs compared to ShareGPT.
  
  该图像是示意图，展示了请求A和请求B的KV缓存使用及内部碎片和外部碎片情况，突出内存中预留但未使用的插槽。图中用颜色区分不同请求的缓存状态和碎片类型。
Evaluation Metrics:
- Normalized Latency: The main performance metric.
  1. Conceptual Definition: It measures the average time taken to generate a single output token for a request, calculated as the total end-to-end latency divided by the number of generated tokens. A serving system's goal is to keep this value low even as the request rate increases. The point where this latency "explodes" indicates the system's maximum throughput.
  2. Mathematical Formula: $\text{Normalized Latency} = \frac{\text{End-to-end Request Latency}}{\text{Number of Generated Tokens}}$
  3. Symbol Explanation:
    - End-to-end Request Latency: The total time from when a request arrives at the system until the last token is generated.
    - Number of Generated Tokens: The total number of tokens in the generated output sequence.
Baselines:
- FasterTransformer (FT): A high-performance inference engine from NVIDIA, adapted by the authors with a dynamic batching scheduler.
- Orca: A state-of-the-art serving system that uses iteration-level scheduling but with contiguous memory allocation. Since it's not open-source, the authors implemented their own version with three configurations:
  - Orca (Oracle): An unachievable upper bound that knows the exact output length in advance, resulting in no internal fragmentation.
  - Orca (Pow2): A realistic strategy that reserves memory in sizes that are powers of two, leading to some internal fragmentation.
  - Orca (Max): A pessimistic but common strategy that reserves memory for the model's maximum possible sequence length (e.g., 2048 tokens), causing severe internal fragmentation.

Models and Hardware:

Models: OPT (13B, 66B, 175B) and LLaMA (13B).
Hardware: NVIDIA A100 GPUs.

The following table, transcribed from Table 1 in the paper, shows the server configurations.

Model size	GPUs	Total GPU memory	Parameter size	Memory for KV cache	Max. # KV cache slots
13B	A100	40 GB	26 GB	12 GB	15.7K
66B	4×A100	160 GB	132 GB	21 GB	9.7K
175B	8×A100-80GB	640 GB	346 GB	264 GB	60.1K

6. Results & Analysis

The evaluation convincingly demonstrates the superiority of vLLM's paged memory management.

Core Results

Figure 12. Single sequence generation with OPT models on the ShareGPT and Alpaca dataset 该图像是包含六个子图的图表，展示了在不同OPT模型（13B、66B、175B）和数据集（ShareGPT、Alpaca）上，不同系统在单序列生成时的归一化延迟与请求速率关系，反映了vLLM相比FasterTransformer和Orca系统的性能优势。

Figure 12 shows the core results for basic sampling. The x-axis is the request rate (requests/second), and the y-axis is normalized latency.

Observation: For all models and datasets, vLLM's curve stays flat for a much longer range of request rates before shooting up. This indicates that vLLM can sustain a significantly higher throughput than both FasterTransformer and all variants of Orca.
Interpretation: vLLM achieves a 2.7x higher throughput on the ShareGPT dataset (with long sequences) and a 1.7x higher throughput on the Alpaca dataset (with shorter sequences) for the OPT-13B model. The advantage is more pronounced for workloads with longer, more variable sequence lengths, where fragmentation in other systems is most severe.

该图像是包含六个子图的图表，展示了在不同OPT模型（13B、66B、175B）和数据集（ShareGPT、Alpaca）上，不同系统在单序列生成时的归一化延迟与请求速率关系，反映了vLLM相比FasterTransformer和Orca系统的性能优势。

The reason for this performance gain is explained by Figure 2, which shows the percentage of memory wasted.

Observation: vLLM has only 3.7% memory waste. In contrast, Orca (Max) wastes over 80% of its allocated KV cache memory, and even the ideal Orca (Oracle) wastes 20% due to external fragmentation.
Interpretation: By nearly eliminating memory waste, vLLM can fit more requests into a batch.

该图像是图5的示意图，展示了PagedAttention算法中，注意力的key和value向量存储为内存中的非连续块的结构。

Figure 13 confirms this, showing that vLLM consistently batches more requests than other systems, directly leading to higher GPU utilization and throughput.

Ablations / Parameter Sensitivity

Performance with Advanced Decoding:

该图像是论文中图6的示意图，展示了vLLM中逻辑KV块到物理KV块的映射过程，包含请求A的各个逻辑块、块表映射与对应的GPU DRAM物理块分布，描述了KV缓存的分页管理机制。

Figure 14 shows that vLLM's advantage grows with more complex decoding. As the number of parallel samples or the beam width increases, the memory savings from sharing in vLLM become more substantial, while Orca's performance degrades due to memory pressure from storing redundant KV caches.
Effect of Block Size:

$Figure 10. Shared prompt example for machine translation. The examples are adopted from \[5\].$ 该图像是一个示意图，展示了机器翻译任务中共享提示（prompt）的示例。图中比较了两个序列的输入提示及对应的模型输出，强调共享的前缀提示对不同任务输入的影响。

The paper studies the trade-off of block size. A smaller block size reduces internal fragmentation (waste within the last block of a sequence), but a larger block size allows the PagedAttention CUDA kernel to be more efficient by processing more data in parallel. Figure 18b shows that a block size of 16 offers the best balance, achieving low latency across both datasets.
Swapping vs. Recomputation:

该图像是图表，展示了图11中(a) ShareGPT和(b) Alpaca数据集的输入与输出长度分布，横轴为词元数量，纵轴为密度，图中分别标注了输入输出的平均值。

Figure 19a shows the overhead of preemption strategies. Recomputation is significantly faster than swapping KV blocks to CPU RAM across all block sizes. This is because modern GPUs have immense computational power, making it faster to re-run the prompt phase than to wait for a slow PCIe data transfer. Figure 19b confirms that using recomputation as the preemption strategy yields better end-to-end performance.

7. Conclusion & Reflections

Conclusion Summary: The paper convincingly identifies and solves a fundamental bottleneck in LLM serving: inefficient memory management of the KV cache. The proposed PagedAttention algorithm, inspired by classical OS concepts, is an elegant and powerful solution. By enabling non-contiguous memory storage for the KV cache, the vLLM system built upon it nearly eliminates memory fragmentation, enables flexible and deep memory sharing across requests, and consequently achieves state-of-the-art throughput, outperforming existing systems by a factor of 2-4x.
Limitations & Future Work:
- The paper itself does not explicitly state limitations. However, one could consider the overhead of the centralized scheduler as a potential bottleneck in extremely large, distributed deployments, although the design aims to minimize communication.
- The performance of swapping versus recomputation is hardware-dependent. On systems with faster GPU-CPU interconnects (e.g., NVLink between CPU and GPU), swapping might become more competitive.
- The authors state that their fork, append, and free primitives can support future decoding algorithms, which points toward the framework's extensibility.
Personal Insights & Critique:
- Impact and Elegance: The paper is a landmark in ML systems research. The core idea is a brilliant application of a well-established concept from a different field (Operating Systems) to solve a modern, critical problem. The solution is not a complex, heuristic-based patch but a fundamental redesign that is both simple in concept and profound in impact.
- Practical Significance: The problem of high serving costs is a major barrier to the widespread adoption of LLMs. By dramatically improving throughput, vLLM directly translates to lower operational costs for anyone deploying LLM services. The fact that vLLM has become a widely used, popular open-source library is a testament to the real-world value and correctness of the paper's contributions.
- Open Questions: While PagedAttention solves memory management for the KV cache, other bottlenecks may emerge as models grow even larger. For instance, loading the model weights themselves (for model switching) or the pure computational cost of extremely long contexts could become the next major challenges. However, for the problem it set out to solve, this paper provides a definitive and highly effective solution.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.