Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Xinran Xu

Paper status: completed

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Published:06/24/2024

LLM Serving Architecture (1)KVCache-Centric Scheduling Strategy (1)Disaggregated Cache System (1)Load Prediction and Rejection Policy (1)Long-Context Processing Optimization (1)

Original Link PDF

Price: 0.10

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Mooncake features a KVCache-centric disaggregated architecture that significantly enhances effective throughput for LLM serving. By separating prefill and decoding stages and utilizing idle GPU cluster resources, it achieves up to a 525% increase in throughput in long-context sce

Abstract

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.

In-depth Reading

English Analysis~21 min read · 27,126 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving". It focuses on designing an efficient and scalable system for serving large language models (LLMs), particularly emphasizing the optimization of the Key-Value Cache (KVCache) within a disaggregated architecture.

1.2. Authors

The authors are Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, and Weimin Zheng, Xinran Xu. Their affiliations include Moonshot AI and Tsinghua University, indicating a collaboration between industry and academia in the field of LLM serving systems.

1.3. Journal/Conference

The paper is published as a preprint on arXiv (arXiv:2407.00079), a widely recognized open-access repository for scientific preprints. While arXiv itself is not a peer-reviewed journal or conference, it is a common platform for researchers to share their latest work before or during the peer-review process, allowing for broad dissemination and early feedback within the research community. The reputation of arXiv as a venue for cutting-edge research in computer science and AI is very high.

1.4. Publication Year

The paper was published on 2024-06-24T02:05:32.000Z.

1.5. Abstract

Mooncake is presented as the serving platform for Kimi, a prominent LLM service by Moonshot AI. Its core innovation is a KVCache-centric disaggregated architecture that distinctly separates the prefill and decoding stages of LLM inference into different clusters. Furthermore, it efficiently utilizes underutilized resources like CPU, DRAM, and SSD within the GPU cluster to establish a disaggregated KVCache. The platform's central component is its KVCache-centric scheduler, which is meticulously designed to optimize overall effective throughput while rigorously adhering to latency-related Service Level Objectives (SLOs).

Unlike conventional research that often assumes all requests are processed, Mooncake confronts significant challenges posed by highly overloaded scenarios. To address this, the authors developed a prediction-based early rejection policy. Experimental findings demonstrate Mooncake's exceptional performance in long-context scenarios. Specifically, it achieves up to a 525% increase in throughput compared to baseline methods in certain simulated environments, all while meeting SLOs. Under real-world workloads, Mooncake's innovative design allows Kimi to process 75% more requests effectively.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2407.00079
PDF Link: https://arxiv.org/pdf/2407.00079v4.pdf
Publication Status: The paper is a preprint published on arXiv.

2. Executive Summary

2.1. Background & Motivation

The rapid adoption of Large Language Models (LLMs) has led to diversified workloads with varying input/output lengths, arrival patterns, and crucial Service Level Objectives (SLOs) for latency, primarily Time To First Token (TTFT) and Time Between Tokens (TBT). As a Model as a Service (MaaS) provider, Kimi (by Moonshot AI) faces the challenge of maximizing overall effective throughput (which directly impacts revenue) while satisfying these complex SLO constraints.

The core problem the paper addresses is the inefficient serving of LLMs, especially in:

Resource Utilization: GPU servers are often highly integrated, but the prefill and decoding stages of LLM inference have very different computational characteristics. Traditional monolithic serving architectures struggle to optimize resources for both, leading to underutilization.
KVCache Management: The KVCache is central to LLM serving, but optimizing its reuse (to reduce computation) and batching (to improve Model FLOPs Utilization - MFU) often conflict with latency SLOs (e.g., reusing KVCache from remote locations can increase TTFT; large batch sizes can increase TBT).
Long-Context Scenarios: Modern LLMs handle increasingly long contexts, making the prefill stage computationally intensive and demanding efficient TTFT optimization.
Overloaded Scenarios: MaaS providers frequently face severe overload problems due to limited GPU supply and rapidly growing user requests, especially during peak times. Existing LLM serving research often assumes sufficient resources, leaving a gap in strategies for managing overload effectively and deciding which requests to reject to avoid wasting computational resources.

The paper's entry point is the observation that KVCache scheduling is central to LLM serving efficiency. Its innovative idea is to propose a KVCache-centric disaggregated architecture that separates prefill and decoding processes and leverages underutilized CPU, DRAM, and SSD resources for a disaggregated KVCache. This allows for specialized optimization of each stage and intelligent KVCache management, coupled with overload-oriented scheduling policies.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of LLM serving:

KVCache-centric Disaggregated Architecture: Introduction of Mooncake, a novel architecture that separates prefill and decoding clusters and implements a disaggregated KVCache using CPU, DRAM, and SSD resources. This design allows for independent optimization of each stage and efficient KVCache management.
Chunked Pipeline Parallelism (CPP) for Prefill: Proposes and implements CPP for long-context prefill to reduce TTFT, offering benefits over traditional Tensor Parallelism (TP) or Sequence Parallelism (SP) by reducing network consumption and simplifying elastic scaling.
Layer-wise KVCache Transfer: Implements layer-wise prefill with stream transferring of KVCache to overlap latency, effectively reducing VRAM occupation during prefill and optimizing KVCache transfer.
KVCache-centric Scheduling Algorithm: Develops a sophisticated global scheduler (Conductor) that considers KVCache reuse, instance loads, and SLOs (TTFT, TBT). This includes a heuristic-based automated hot-spot migration scheme for KVCache blocks to balance loads and reduce TTFT.
Overload-Oriented Scheduling with Prediction-Based Early Rejection: Addresses the practical challenge of overload scenarios by introducing an early rejection policy that predicts future load (especially decoding load) to prevent wasted computation and mitigate load fluctuations often caused by naive early rejection.
Empirical Validation and Open-Source Trace: Demonstrates the effectiveness of Mooncake through extensive experiments on public datasets, simulated data, and real-world Kimi traces. The results show significant throughput improvements (up to 525% in simulated scenarios, 75% more requests under real workloads) while adhering to SLOs. An anonymized real-world request trace is open-sourced to facilitate further research.

The key conclusions are that a KVCache-centric disaggregated architecture with intelligent scheduling and overload management is highly effective for LLM serving, especially for long-context requests and under overloaded conditions. The prediction-based early rejection policy is crucial for maintaining resource utilization and stability in such dynamic environments. These findings collectively solve the problem of efficiently serving LLMs at scale while maintaining quality of service and maximizing resource efficiency.

3.1. Foundational Concepts

To fully understand the Mooncake paper, a foundational understanding of several concepts related to Large Language Models (LLMs) and their serving infrastructure is essential.

Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the Transformer architecture, trained on vast amounts of text data. They can understand, generate, and process human language for various tasks like text summarization, question answering, and content creation. Examples include GPT and LLaMA.
Transformer Architecture: The core neural network architecture for most modern LLMs. It relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence, and feed-forward neural networks. It processes input sequences in parallel and generates output sequences autoregressively.
- Self-Attention: A mechanism that allows the model to weigh the importance of different words in the input sequence when processing a specific word. It calculates query (Q), key (K), and value (V) vectors from the input embeddings. The attention score is computed by taking the dot product of $Q$ $Q$ and $K$ $K$ vectors, scaled by $\sqrt{d_k}$ $d_{k}$ (where $d_k$ $d_{k}$ is the dimension of $Q$ $Q$ and $K$ $K$ ), and then applying a softmax function. This score is then multiplied by $V$ $V$ to get the output. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
  - $Q K^T$ is the dot product of the Query and Key matrices, measuring similarity.
  - $\sqrt{d_k}$ is a scaling factor to prevent large dot products from pushing the softmax into regions with tiny gradients.
  - $\mathrm{softmax}$ normalizes the scores to create a probability distribution.
LLM Inference Stages: When an LLM generates a response, it typically involves two distinct stages:
- Prefill Stage (or Prompt Processing): This is the initial stage where the entire input prompt (context) is processed in parallel to generate the first output token. During this stage, intermediate activations, specifically the key and value vectors from the self-attention layers, are computed and stored. This stage is usually computation-intensive, especially for long contexts.
- Decoding Stage (or Token Generation): After the prefill stage, the model generates subsequent tokens one by one, autoregressively. For each new token, the model reuses the key and value vectors (the KVCache) from previous tokens and computes new key and value vectors for the current token. This stage is typically memory-constrained and involves sequential computation.
KVCache (Key-Value Cache): The intermediate key and value activations computed during the prefill and decoding stages. Storing these allows the model to avoid recomputing them for each new token generated, significantly speeding up autoregressive decoding. The size of the KVCache grows with the length of the input plus generated tokens, making its management crucial for memory efficiency.
Service Level Objectives (SLOs): Performance targets or guarantees for a service. In LLM serving, key SLOs include:
- Time To First Token (TTFT): The latency from when a request arrives until the first output token is generated. This is mainly influenced by the prefill stage.
- Time Between Tokens (TBT): The average latency between the generation of consecutive output tokens. This is mainly influenced by the decoding stage.
Continuous Batching: An optimization technique in LLM serving where requests are dynamically batched together. Instead of processing requests sequentially, a scheduler continuously adds new requests to the batch and removes completed ones, maximizing GPU utilization. vLLM is a prominent open-source system that leverages this.
PagedAttention: An advanced memory management technique, introduced by vLLM, that uses a paging mechanism similar to operating systems to manage KVCache memory. It allows for flexible sharing of KVCache among different requests and prevents memory fragmentation, leading to higher throughput.
Disaggregated Architecture: A system design where different components or functions are separated into independent, specialized services or clusters. In LLM serving, this often means separating prefill and decoding computations onto different hardware pools, or disaggregating memory resources.
Parallelism Techniques in LLMs: Strategies to distribute computations across multiple GPUs or nodes.
- Tensor Parallelism (TP): Divides individual tensors (like weights or activations) across multiple devices, typically within a single node. Requires frequent all-reduce operations, which can be costly across nodes.
- Sequence Parallelism (SP): Partitions the input sequence across different devices. Each device processes a segment of the sequence. Requires communication per layer but can be more efficient for long sequences than TP across nodes.
- Pipeline Parallelism (PP): Divides the model layers across different devices, forming a pipeline. Each device processes a stage of the model, and data flows sequentially through the pipeline. Reduces communication overhead compared to TP for large models.

3.2. Previous Works

The Mooncake paper builds upon and differentiates itself from several key advancements in LLM serving.

Production-grade Systems: FasterTransformer [28], TensorRT-LLM [29], and DeepSpeed Inference [30] are industry solutions focused on optimizing LLM inference throughput through low-level optimizations. Mooncake operates at a higher architectural level, complementing these by focusing on disaggregation and scheduling.
Scheduling and Memory Management:
- Orca [12] introduced iteration-level scheduling for concurrent processing, enhancing GPU utilization.
- vLLM [13] (which Mooncake uses as a baseline) revolutionized LLM serving with continuous batching and PagedAttention for efficient KVCache memory management. Mooncake's design acknowledges vLLM's strengths but aims to overcome its limitations in handling distinct prefill and decoding characteristics, especially for long contexts.
- FlexGen [31], SARATHI [15], and FastServe [32] explore various scheduling and swapping strategies to manage workloads on limited hardware. Mooncake's approach to KVCache handling and disaggregation aims to further optimize these aspects.
Disaggregated Architectures: Recent research, concurrent with or preceding Mooncake, has also identified the benefits of separating prefill and decoding:
- Splitwise [7]: An early work proposing phase splitting for LLM inference. Mooncake was motivated by this direction.
- DistServe [8]: Optimizes resource allocation and parallel strategies for each stage in a disaggregated setup to maximize GPU goodput.
- TetriInfer [9]: Incorporates chunked prefill and two-stage disaggregation with a predictive two-stage scheduling algorithm. Mooncake shares these high-level ideas but focuses on KVCache-centricity and detailed overload management.
Prefix Caching and KVCache Reuse:
- Prompt Cache [33]: Precomputes and stores frequently used KVCache to reduce inference latency.
- SGLang [34]: Leverages RadixAttention with LRU cache in a radix tree for efficient KVCache sharing.
- AttentionStore [35]: A concurrent work that proposes a hierarchical KVCache system using cost-effective memory. Mooncake shares design choices with AttentionStore but emphasizes KVCache-centric global scheduling for extremely large KVCaches in long-context inference.
- Preble [36]: Explores KVCache-centric scheduling. Mooncake corroborates many findings in this area, particularly the focus on KVCache as a central scheduling primitive.

3.3. Technological Evolution

The evolution of LLM serving has moved from initial naive deployments to highly optimized systems. Early approaches often treated LLMs as monolithic black boxes, running prefill and decoding sequentially on the same hardware.

Basic Serving: Initial LLM deployments often used simple batching or served requests one by one, leading to low GPU utilization.
Continuous Batching & PagedAttention: Pioneered by vLLM, these techniques significantly improved throughput by dynamically grouping requests and efficiently managing KVCache memory. This marked a shift towards memory-aware optimization.
Disaggregation: The realization that prefill and decoding have fundamentally different compute and memory characteristics led to the idea of separating these stages. This allows for specialized hardware and software optimizations for each, leading to better resource utilization and SLO adherence. Splitwise, DistServe, TetriInfer, and Mooncake are part of this trend.
KVCache-Centric Optimization: As LLMs grew larger and contexts longer, the KVCache became a dominant factor in memory consumption and latency. Recent efforts, including Mooncake, AttentionStore, Prompt Cache, and SGLang, focus on optimizing KVCache storage, transfer, and reuse as a central element of serving efficiency.
Overload Management: With the commercialization of LLMs and limited GPU availability, managing overload scenarios and ensuring SLO compliance during peak usage has become critical. This includes early rejection policies and load prediction, a key focus of Mooncake.

Mooncake fits within this timeline by pushing the boundaries of disaggregation and KVCache-centricity, specifically addressing the practical challenges of long-context LLMs and overload conditions faced by MaaS providers.

3.4. Differentiation Analysis

Compared to the main methods in related work, Mooncake's core differences and innovations lie in:

KVCache-Centricity as a First-Class Citizen: While others explore KVCache management, Mooncake explicitly positions KVCache as the central primitive for its global scheduling decisions. This goes beyond just memory efficiency to encompass TTFT optimization, load balancing, and hot-spot migration.
Comprehensive Disaggregation: Extends beyond just prefill/decoding separation to include a truly disaggregated KVCache utilizing CPU, DRAM, and SSD resources. This leverages underutilized hardware for cost-effective capacity and bandwidth.
Optimized Long-Context Prefill: Introduces Chunked Pipeline Parallelism (CPP) and layer-wise prefill specifically tailored for long contexts. CPP offers better MFU and less network contention compared to Sequence Parallelism (SP) for cross-node acceleration, and layer-wise prefill effectively overlaps KVCache transfer, reducing VRAM occupation.
Overload-Oriented Scheduling: Unlike most research assuming sufficient resources, Mooncake explicitly tackles overload scenarios with a prediction-based early rejection policy. This aims to maximize goodput (successfully completed requests within SLOs) by preventing wasted computation and mitigating load fluctuations inherent in disaggregated systems. This is a practical innovation for MaaS providers.
Holistic Optimization for SLOs: Its scheduler (Conductor) balances cache reuse, instance load, and SLO adherence (TTFT, TBT) as primary objectives, rather than solely focusing on throughput maximization.
Real-World Validation: The system is deployed for Kimi (Moonshot AI's LLM service), handling exponential workload growth, and validated with real-world traces, providing practical evidence of its effectiveness under production constraints.

4. Methodology

4.1. Principles

The core principle of Mooncake is to maximize the overall effective throughput of an LLM serving system while strictly adhering to latency-related Service Level Objectives (SLOs), specifically Time To First Token (TTFT) and Time Between Tokens (TBT). This is achieved through a KVCache-centric disaggregated architecture that recognizes the distinct characteristics of the prefill and decoding stages, and proactively manages overload scenarios through intelligent scheduling and request rejection policies.

The theoretical basis and intuition behind this approach are:

Disaggregation for Specialization: Prefill (computation-intensive, long-context) and decoding (memory-bound, autoregressive) stages have fundamentally different resource demands. Separating them into dedicated clusters allows for specialized optimization of hardware and software configurations for each, leading to higher resource utilization and better SLO compliance than a monolithic design.
KVCache as the Central Bottleneck/Opportunity: The KVCache is a critical resource, both in terms of memory consumption and its potential for reuse. By making KVCache management and distribution central to scheduling decisions, Mooncake aims to reduce redundant computation, minimize data transfer overheads, and optimize VRAM usage.
Proactive Overload Management: In real-world MaaS environments, overload is inevitable. Simply processing requests until SLOs are violated wastes computational resources. A proactive early rejection policy, especially one that anticipates future load, can save GPU cycles and ensure goodput by only accepting requests that are likely to complete within SLOs.
Leveraging Underutilized Resources: Modern GPU servers often have substantial CPU, DRAM, and SSD resources that are underutilized during LLM inference. By using these for a disaggregated KVCache, Mooncake provides large, cost-effective storage for KVCache blocks, enabling greater reuse and reducing pressure on expensive GPU VRAM.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Mooncake Architecture Overview

The Mooncake architecture (Figure 1) is a KVCache-centric disaggregated system for LLM serving. It consists of:

Prefill Instances (Prefill Cluster): A pool of GPU nodes optimized for the prefill stage. Their goal is to maximize KVCache reuse, meet TTFT SLOs, and ensure a minimum MFU. They are constrained by DRAM availability.
Decoding Instances (Decoding Cluster): A pool of GPU nodes optimized for the decoding stage. Their goal is to maximize throughput (by maximizing batch size) while meeting TBT SLOs. They are constrained by VRAM capacity for the aggregated KVCache.
Disaggregated KVCache Pool: Leverages CPU, DRAM, and SSD across the GPU cluster to store KVCache blocks, enabling efficient near-GPU prefix caching without additional dedicated hardware costs.
Conductor (Global Scheduler): The central orchestrator responsible for dispatching requests, selecting appropriate prefill and decoding instances, and managing KVCache blocks (replicating hot blocks, swapping cold blocks).

The following figure (Figure 1 from the original paper) shows the overall Mooncake architecture:

该图像是Mooncake架构的示意图，展示了KVCache中心调度器的组成部分，包括预填充实例、KVCache池和解码池。图中显示了不同调度器的功能，如缓存感知的预填充调度器、KVCache平衡调度器和负载均衡解码调度器，强调了各个组件之间的资源分配和相互作用。

4.2.2. KVCache Storage and Management

The KVCache is stored in CPU memory as paged blocks. Each block has a hash value for deduplication. This hash is derived from both the block's content and its prefix, allowing for precise identification of reusable prefixes. This structure enables efficient cache eviction algorithms (like LRU, LFU, or LengthAwareCache) to manage the KVCache pool based on request patterns. The Messenger component, a separate RDMA-based service in each node, handles high-speed, cross-machine transfer of these KVCache blocks.

The following figure (Figure 3 from the original paper) illustrates the KVCache pool in CPU memory:

Figure 3: The KVCache pool in CPU memory. Each block is attached with a hash value determined by both its own hash and its prefix for deduplication. 该图像是示意图，展示了 KVCache 池在 CPU 内存中的组织结构。图中展示了多个 Token 块和对应的哈希值，区分了前缀缓存块、增量缓存块和未分配缓存块。每个缓存块都附有通过自身哈希及其前缀确定的哈希值，以便进行去重处理。该设计支持 KVCache 的存储、读取和加载，旨在优化 Mooncake 的缓存管理效率。

4.2.3. Request Workflow

A typical request workflow in Mooncake (Figure 4) involves four main steps orchestrated by the Conductor:

KVCache Reuse:
- Upon receiving a request, the Conductor selects a prefill node (or group).
- This selection balances three objectives: maximizing KVCache reuse (by identifying existing prefix KVCache blocks), balancing workloads across prefill nodes, and ensuring TTFT SLO adherence.
- The selected prefill node loads reusable prefix KVCache blocks from remote CPU memory into GPU memory. This step is skipped if no reusable KVCache exists.
Incremental Prefill:
- The prefill node completes the prefill stage using the loaded prefix cache.
- Newly generated KVCache (incremental KVCache) is stored back into CPU memory.
- If the number of uncached input tokens exceeds a threshold (prefill_chunk), the prefill stage is split into multiple chunks and executed in a pipeline manner using Chunked Pipeline Parallelism (CPP). This chunking allows for efficient utilization of GPU computational power.
KVCache Transfer:
- The Messenger service, deployed in each node, manages high-speed, cross-machine KVCache transfer.
- This transfer is executed asynchronously and overlapped with the incremental prefill step.
- KVCache generated by each model layer is streamed to the destination decoding node's CPU memory, reducing waiting time. This is part of the Layer-wise Prefill strategy.
Decoding:
- Once all KVCache for a request is received in the decoding node's CPU DRAM, the request joins the next batch for continuous batching.
- The Conductor pre-selects the decoding node based on its current load to prevent TBT SLO violations.
- A local scheduler at the decoding node double-checks the anticipated load. If SLOs cannot be met, the request might still be rejected, leading to wasted prefill costs.
  
  The following figure (Figure 4 from the original paper) depicts the workflow of inference instances:
  
  $Figure 4: Workflow of inference instances. (\\*) For prefill instances, the load and store operations of the KVCache layer are performed layer-by-layer and in parallel with the prefill computation to mitigate transmission overhead (see $\\ S 5 . 2 )$ . (†) For decoding instances, asynchronous loading is performed concurrently with GPU decoding to prevent GPU idle time.$ 该图像是图表，展示了推理实例的工作流程。对于预填充实例，KVCache 层的加载和存储操作是逐层进行并与预填充计算并行，以减轻传输开销。对于解码实例，异步加载与 GPU 解码并发进行，以防止 GPU 空闲时间。

4.2.4. Implementation of the Prefill Pool

Mooncake maintains a separate prefill node pool due to the distinct characteristics of prefill and decoding, which require different cross-node parallelism settings and offer unique opportunities for VRAM saving.

4.2.4.1. Multi-node Prefill with Chunked Pipeline Parallelism (CPP)

For long-context requests (e.g., 10x-100x longer than output tokens), TTFT optimization is crucial. While Tensor Parallelism (TP) and Sequence Parallelism (SP) are options, they often involve significant RDMA-based all-reduce operations or frequent cross-node communication, reducing MFU and competing for network resources.

Mooncake uses Chunked Pipeline Parallelism (CPP):

It groups every $X$ nodes in the prefill cluster into a pipelined prefill node group.
Input tokens for a request are partitioned into chunks, each no larger than prefill_chunk.
Different chunks of the same request can be processed simultaneously by different nodes within the pipeline group, accelerating prefill and reducing TTFT.

Benefits of CPP:
Reduced Communication: Similar to pipeline parallelism in training, CPP only requires cross-node communication at the boundaries of each pipeline stage, which can be easily overlapped with computation. This leads to better MFU and less network resource contention with KVCache transfers.
Adaptability: Naturally fits both short and long contexts without significant overhead for short contexts, avoiding the need for frequent dynamic adjustment of node partitioning.

4.2.4.2. Layer-wise Prefill

To minimize VRAM occupation by KVCache, Mooncake implements layer-wise prefill. Since prefill is computation-bound and processed layer-by-layer, KVCache transfer and dumping can be overlapped with computation.

KVCache loading and storing are executed asynchronously using launch and wait operations.
Before a layer's attention computation begins, the model waits for that layer's KVCache to load and then triggers the asynchronous loading of the next layer's KVCache.
After attention calculation is complete, asynchronous storage of that layer's KVCache is launched.
Once all layers are computed, the process waits for all asynchronous storage operations to complete.

This overlapping ensures that the prefill instance's execution time is roughly equivalent to either the KVCache loading time or the standard prefilling time. The main advantage is that it allows prefill scheduling to largely disregard VRAM size, as long as it can hold a single request. This frees up VRAM for other uses.

The following figure (Figure 7 from the original paper) shows the latency of storing KVCache of different request lengths, highlighting the efficiency of layer-wise prefill:

Figure 7: Latency of storing KVCache of different request lengths (Layer-wise latency refers to the difference in latency between Layer-wise Prefill and Prefill without storing KVCache). 该图像是一个图表，展示了不同请求长度下存储KVCache的延迟。蓝色条表示序列化延迟，黄色条表示分层延迟。随着序列长度增加，延迟显著上升，尤其在128000时达到最高值。

4.2.5. KVCache-centric Scheduling

Conductor is responsible for KVCache-centric scheduling, balancing instance loads and user experience (measured by TTFT and TBT SLOs).

4.2.5.1. Prefill Global Scheduling

Unlike traditional load-balancing based on request counts, Mooncake's prefill instance selection considers prefix cache hit length and KVCache block distribution. The goal is to route requests to instances with longer prefix cache lengths to reduce computation, but also to balance overall system load.

Algorithm 1 details the cache-aware prefill scheduling:

For a new request $R$ :
1. block_keys are generated by PrefixHash(R.prompt_tokens, B) where $B$ is the cache block size. This involves hashing token blocks and their prefixes.
2. Initialize TTFT to infinity, and $p$ (prefill instance) to null.
3. FindBestPrefixMatch(P, block_keys) identifies the best_prefix_len (longest matching prefix) and the best_matched_instance (the instance holding this prefix) across all prefill instances $P$ .
4. Iterate through each instance in $P$ $P$ :
  - Get instance.prefix_len (local prefix match length).
  - Estimate queue_time for the instance: $T_{queue} \gets \text{EstimatePrefillQueueTime(instance)}$ .
  - Cache-aware prefill scheduling (local match): If the best_prefix_len is not significantly better than the instance.prefix_len (controlled by kvcache_balancing_threshold), meaning the local instance has a good enough match or the best_prefix_len is too short to be worth transferring:
    - Estimate prefill_time for the instance: $T_{prefill} \gets \text{EstimatePrefillExecutionTime(len(R.prompt\_tokens), instance.prefix\_len)}$ .
    - If $TTFT > T_{queue} + T_{prefill}$ , update $TTFT \gets T_{queue} + T_{prefill}$ and set $p \gets \text{instance}$ .
  - Cache-aware and -balancing prefill scheduling (remote match with transfer): Otherwise (the best_prefix_len is significantly better and potentially worth transferring):
    - Calculate transfer_len (tokens to transfer): best_prefix_len - prefix_len.
    - Estimate transfer_time: $T_{transfer} \gets \text{EstimateKVCacheTransferTime(instance, best\_matched\_instance, transfer\_len)}$ .
    - Estimate prefill_time: $T_{prefill} \gets \text{EstimatePrefillExecutionTime(len(R.prompt\_tokens), best\_prefix\_len)}$ .
    - If $TTFT > T_{transfer} + T_{queue} + T_{prefill}$ , update $TTFT \gets T_{transfer} + T_{queue} + T_{prefill}$ and set $p \gets \text{instance}$ .
5. $d \gets \text{SelectDecodingInstance}(D)$ for$load-balancing decoding scheduling`.
6. Check SLOs: If $TTFT > TTFT\_SLO$ or $TBT > TBT\_SLO$ , reject R and return.
7. KVCache hot-spot migration: If the best_prefix_len is above a kvcache_balancing_threshold, TransferKVCache(best_matched_instance, p). This means if a request relies on a "hot" remote KVCache block, that block is proactively transferred to the selected prefill instance $p$ for future reuse and load balancing.
8. Return the selected pair (p, d).
Engineering details:
- EstimatePrefillExecutionTime: Uses a predictive model based on request length and prefix cache hit length.
- EstimatePrefillQueueTime: Aggregates prefill times of currently queued requests.
- EstimateKVCacheTransferTime: More complex to predict due to network status; this necessitates hot KVCache block replication.
  
  The following is Algorithm 1 from the original paper, detailing the KVCache-centric scheduling: $Input: prefill instance pool $P$ , decoding instance pool $D$ , request $R$ , cache block size $B$ . Output: the prefill and decoding instances `( p , d )` to process $R$ . 1: block_keys PrefixHash( $R$ .prompt_tokens, $B$ ) 2: $T T F T \gets$ inf 3: $p \gets \emptyset$ 4: best_prefix_len, best_matched_instance FindBestPrefixMatch(P, block_keys) 5: for instance $\in P$ do 6: prefix_len $\gets$ instance.prefix_len 7: $T_{queue} \gets$ EstimatePrefillQueueTime(instance) 8: if $\begin{array}{r} \frac{best\_prefix\_len}{prefix\_len} < \text{kvcache\_balancing\_threshold} \end{array}$ then Cache-aware prefill scheduling 9: $T_{prefill} \gets$ EstimatePrefillExecutionTime(len(R.prompt_tokens), prefix_len) 10: if $TTFT > T_{queue} + T_{prefill}$ then 11: $TTFT \gets T_{queue} + T_{prefill}$ 12: $p \gets$ instance 13: end if 14: else Cache-aware and -balancing prefill scheduling 15: transfer_len $\gets$ best_prefix_len `-` prefix_len 16: $T_{transfer} \gets$ EstimateKVCacheTransferTime(instance, best_matched_instance, transfer_len) 17: $T_{prefill} \gets$ EstimatePrefillExecutionTime(len(R.prompt_tokens), best_prefix_len) 18: if $TTFT > T_{transfer} + T_{queue} + T_{prefill}$ then 19: $TTFT \gets T_{transfer} + T_{queue} + T_{prefill}$ 20: $p \gets$ instance 21: end if 22: end if 23:end for 24: $d \gets$ SelectDecodingInstance `( D )` Load-balancing decoding scheduling 25: if $TTFT > TTFT\_SLO$ or $TBT > TBT\_SLO$ then 26: reject $R$ ; return 27: end if 28: if best_prefix_len $> \text{kvcache\_balancing\_threshold}$ then 29: TransferKVCache(best_matched_instance, p) KVCache hot-spot migration 30:end if 31: return `( p , d )`$

4.2.5.2. Cache Load Balancing

To prevent network congestion and optimize reuse, Mooncake employs a heuristic-based automated hot-spot migration scheme:

When Conductor routes a request to an alternative prefill instance (not the one with the longest prefix match) due to load, if the estimated additional prefill time is shorter than the transfer time, the alternative instance proactively retrieves the necessary KVCache from the holder.
Additionally, if the best remote prefix match length is not significantly greater than the current local reusable prefix (i.e., best_prefix_len / local_prefix_len < threshold), the system prefers to compute the input tokens locally instead of transferring. These strategies facilitate the automatic replication of hot-spot caches across multiple machines, distributing the load and reducing TTFT.

The following figure (Figure 8 from the original paper) shows the results of a prefill scheduling experiment comparing different strategies:

Figure 8: The prefill scheduling experiment in the Mooncake cluster. 该图像是一个箱线图，展示了在不同调度策略下的总延迟（TTFT）。横坐标表示不同的调度策略，包括KVCache-centric、cache-aware、load-balancing和random，纵坐标为延迟时间（秒）。图中标明了服务水平目标(SLO)，并且KVCache-centric策略显示了最低的延迟，达到14.36秒。

4.2.6. Overload-Oriented Scheduling

In overload scenarios, Mooncake determines whether to accept or reject incoming requests based on system load.

4.2.6.1. Defining System Load and Early Rejection

Load Measurement: In Mooncake's disaggregated architecture, SLO satisfaction is used as the direct load measurement.
- $l_{ttft}$ : The TTFT SLO constraint.
- $l_{tbt}$ : The TBT SLO constraint.
- The load for prefill and decoding instances is determined by comparing the predicted maximum TTFT and TBT on an instance against these constraints.
Early Rejection: To prevent wasted computation when a request is rejected by the decoding instance after prefill (due to high load), Mooncake assesses the decoding load before the prefill stage begins. Conductor accepts a request only if both prefill and decoding pools are predicted to meet SLOs.

4.2.6.2. Load Fluctuation Caused by Early Rejection

A naive Early Rejection policy can cause significant anti-phase fluctuations between prefill and decoding machine loads (Figure 9). This is due to the time lag between predicting the decoding load and its actual execution.

The following figure (Figure 9 from the original paper) shows observed load fluctuations:

Figure 9: The load of prefill and decoding instances over 20 minutes, before using the predictionbased early rejection. 该图像是图表，展示了在20分钟内预填充实例（Prefill）和解码实例（Decoding）的负载变化情况，表示在未使用基于预测的提前拒绝策略之前的性能波动。

The fluctuation mechanism (Figure 10a) can be described in four stages:

Stage 1 (Low Load): Both prefill and decoding loads are low. Conductor accepts many requests, saturating prefill instances.
Stage 2 (Decoding High, Prefill Low): Requests from Stage 1 prefill move to decoding, causing high decoding load. Conductor rejects new incoming requests, leading to lower prefill load.
Stage 3 (Decoding Decreases, Prefill Increases): No new requests enter decoding, so its load decreases. Conductor starts accepting requests again, increasing prefill load.
Stage 4 (Decoding Increases, Prefill Decreases): As prefill requests complete and move to decoding, decoding load increases again. Conductor rejects new requests, lowering prefill load.

This cycle leads to poor resource utilization.

The following figure (Figure 10 from the original paper) illustrates instance load when applying Early Rejection and Early Rejection Based on Prediction:

Figure 10: Instance load when applying Early Rejection and Early Rejection Based on Prediction. 该图像是图表，展示了应用早期拒绝和基于预测的早期拒绝时的实例负载情况。图中的四个阶段分别展示了预填请求和解码请求在不同时间段的负载变化，并通过星号和箭头指示了接收和拒绝的决策。

4.2.6.3. Early Rejection Based on Prediction

To mitigate load fluctuation, Mooncake uses Early Rejection Based on Prediction. This framework predicts the decoding load after the prefill stage of incoming requests and uses this prediction to decide whether to accept them.

System-level Prediction (Current approach): Instead of predicting individual output lengths (which is hard), Mooncake estimates the overall batch count or TBT status for decoding instances after a specified time.
- It assumes a uniform decoding time $t_d$ for each request's decoding stage.
- At a given moment $t$ , it identifies requests that will complete prefill by $t$ and adds them to uniform decoding instances.
- It removes requests whose execution time will exceed $t_d$ before $t$ .
- The average TBT ratio of all decoding instances to $l_{tbt}$ is calculated to predict the load.
Request-level Prediction (Future work): Predicting the specific output length of each request could enable more accurate TTFT and TBT predictions, thus a more precise load assessment. However, this is currently challenging due to high cost or low accuracy, especially under overload.

This prediction-based approach, as illustrated in Figure 10b, aims to stabilize load and improve resource utilization.

5. Experimental Setup

5.1. Datasets

The experiments in Mooncake utilize a mix of public datasets, simulated data, and real-world traces to evaluate performance across various scenarios. All experiments use a dummy model with the same architecture as LLaMA2-70B to protect proprietary information and ensure reproducibility.

The following are the results from Table 2 of the original paper:

Dataset	Avg Input Length	Avg Output Length	Cache Ratio	Arrival Pattern
ArXiv Summarization [26]	8088	229	~0%	Poisson Process
L-Eval [27]	19019	72	>80%	Poisson Process
Simulated Data	16k, 32k, 64k, 128k	512	50%	Poisson Process
Real Data	7955	194	~50%	Timestamp-based

Detailed description of datasets:

ArXiv Summarization [26]:
- Avg Input Length: 8088 tokens.
- Avg Output Length: 229 tokens.
- Cache Ratio: Approximately 0%. This dataset likely consists of unique, non-repeating prompts, making KVCache reuse minimal. It's suitable for evaluating raw prefill and decoding performance without the benefit of caching.
- Arrival Pattern: Poisson Process, which simulates random arrivals typical of many real-world systems.
L-Eval [27]:
- Avg Input Length: 19019 tokens.
- Avg Output Length: 72 tokens.
- Cache Ratio: Greater than 80%. This dataset represents scenarios with high KVCache reuse, making it ideal for evaluating the effectiveness of Mooncake's KVCache-centric scheduling and prefix caching. Its long input length also stresses prefill capabilities.
- Arrival Pattern: Poisson Process.
Simulated Data:
- Avg Input Length: Varied at 16k, 32k, 64k, 128k tokens. This is crucial for evaluating Mooncake's performance under extreme long-context conditions, where prefill can significantly impact TTFT.
- Avg Output Length: 512 tokens.
- Cache Ratio: 50%. A balanced scenario for KVCache reuse.
- Arrival Pattern: Poisson Process.
Real Data:
- Source: A sampled subset of online request data from Kimi (Moonshot AI) over a 1-hour period. Contains 23,608 entries.
- Avg Input Length: 7955 tokens.
- Avg Output Length: 194 tokens.
- Cache Ratio: Approximately 50%.
- Arrival Pattern: Timestamp-based, representing actual arrival times from a real-world workload, making it highly realistic for evaluating production performance and overload scenarios.
- Privacy Protection: The trace is anonymized, removing user content but preserving timestamp, input_length, output_length, and hash_ids. The hash_ids field describes prefix caching relationships by hashing token blocks and their prefixes. For example:
```
{ "timestamp": 27482, "input_length": 6955, "output_length": 52, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354]   
{ "timestamp": 30535, "input_length": 6472, "output_length": 26, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2366]   
}
```
  In this sample, the identical first 12 hash_ids (46 to 57) indicate that the first $12 \times 512 = 6144$ $12 \times 512 = 6144$ tokens can share prefix caching. This open-sourced trace is unique for real-world KVCache reuse analysis.

These datasets were chosen to cover a spectrum of LLM serving challenges: minimal cache reuse, high cache reuse, extremely long contexts, and realistic production workloads, allowing for comprehensive validation of Mooncake's design.

5.2. Evaluation Metrics

The experiments primarily focus on the throughput performance of systems under defined SLOs.

Throughput (Requests Per Second - RPS):
- Conceptual Definition: Measures the number of requests a system can successfully process and complete within a given time frame. A higher RPS indicates better throughput. In the context of LLM serving, throughput is crucial for handling large volumes of user queries efficiently.
- Mathematical Formula: Not explicitly provided, but generally calculated as: $ \text{Throughput (RPS)} = \frac{\text{Total Number of Completed Requests}}{\text{Total Time Elapsed (seconds)}} $
  - Symbol Explanation:
    - $\text{Total Number of Completed Requests}$ : The total count of requests that have successfully finished their entire inference process.
    - $\text{Total Time Elapsed (seconds)}$ : The duration over which the requests were processed, measured in seconds.
Time To First Token (TTFT) P90:
- Conceptual Definition: TTFT measures the latency from the moment a request arrives until the very first output token is generated. It's a critical metric for user experience as it determines the initial responsiveness of the LLM. The P90 (90th percentile) value means that 90% of requests have a TTFT less than or equal to this value, reflecting the performance for the vast majority of users, not just the average.
- Mathematical Formula: Not explicitly provided, but typically: $ \text{TTFT} = \text{Time of First Token Generation} - \text{Time of Request Arrival} $ The P90 TTFT is the value $X$ $X$ such that $90\%$ $90%$ of all observed TTFT values are $\le X$ $\leq X$ .
  - Symbol Explanation:
    - $\text{Time of First Token Generation}$ : The timestamp when the first token of the response is produced.
    - $\text{Time of Request Arrival}$ : The timestamp when the user request was received by the system.
- SLO Threshold: The paper sets TTFT thresholds relative to a baseline. For end-to-end experiments, it's $\text{TTFT}_{P90} = 10 \times \text{baseline\_TTFT}$ . Exceeding this threshold indicates an SLO violation.
Time Between Tokens (TBT) P90:
- Conceptual Definition: TBT measures the average latency between the generation of successive output tokens for the same request. This metric reflects the smoothness and speed of ongoing text generation, impacting the perceived fluency of the LLM's response. The P90 TBT similarly indicates that 90% of token generation intervals are below this value.
- Mathematical Formula: Not explicitly provided, but typically: $ \text{TBT} = \frac{\text{Time of Token}i - \text{Time of Token}{i-1}}{1} $ The P90 TBT is the value $X$ $X$ such that $90\%$ $90%$ of all observed TBT values are $\le X$ $\leq X$ .
  - Symbol Explanation:
    - $\text{Time of Token}_i$ : The timestamp when the $i$ -th token is produced.
    - $\text{Time of Token}_{i-1}$ : The timestamp when the (i-1)-th token was produced.
- SLO Threshold: The paper sets TBT thresholds as $\text{TBT}_{P90} = 5 \times \text{baseline\_TBT}$ . Exceeding this threshold indicates an SLO violation.
SLO Attainment Rate / Goodput:
- Conceptual Definition: The primary objective is to maximize overall effective throughput while adhering to SLOs. The goodput concept implies that only requests that fully complete their execution within their respective SLOs are counted towards the throughput. If an SLO is violated, the resources consumed by that request are considered wasted, and the request does not contribute to goodput.
- Mathematical Formula: Implicitly, goodput is the throughput calculated only from SLO-compliant requests. $ \text{Goodput} = \frac{\text{Number of Completed Requests within SLO}}{\text{Total Time Elapsed (seconds)}} $
  - Symbol Explanation:
    - $\text{Number of Completed Requests within SLO}$ : The count of requests that finished their full execution and met both their TTFT and TBT thresholds.
      
      All TTFT and TBT values are normalized against their respective upper limits (SLO thresholds) for easier comparison, establishing a baseline of 1.0.

5.3. Baselines

The primary baseline model used for comparison is vLLM.

vLLM:
- Description: vLLM is described as one of the state-of-the-art open-source LLM serving systems. It is highly regarded for its efficiency in LLM inference.
- Key Technologies: vLLM incorporates continuous batching and PagedAttention technologies.
  - Continuous batching allows dynamic grouping of requests, maximizing GPU utilization.
  - PagedAttention offers efficient KVCache memory management by breaking KVCache into fixed-size blocks, similar to virtual memory paging, preventing fragmentation and enabling flexible sharing.
- Why it's a good baseline: Its adoption of continuous batching and PagedAttention makes it a strong contender in terms of inference throughput and memory efficiency, representing a high bar for LLM serving performance.
- Limitations (as highlighted by Mooncake): The paper notes that vLLM's design couples the prefill and decoding stages of inference requests. This tight coupling can cause disruptions during decoding in scenarios involving long contexts. Specifically, a long prefill request might block the decoding of other requests, leading to TBT SLO violations. To counteract this in long-context scenarios, vLLM might resort to processing requests individually rather than in batches, which can reduce throughput. Mooncake's disaggregated approach aims to address this fundamental architectural limitation.

5.4. Testbed

The experiments were conducted on a high-performance computing node cluster.

Hardware Configuration (per node):
- GPUs: 8 NVIDIA-A800-SXM4-80GB GPUs. Each GPU has 80GB of HBM (High Bandwidth Memory).
- Interconnect: GPUs are connected by NVLINK.
- Network: Equipped with RDMA network cards supporting up to 800 Gbps of interconnect bandwidth between nodes. RDMA (Remote Direct Memory Access) is crucial for high-speed, low-latency data transfer directly between memory of different machines, bypassing the CPU, which is essential for KVCache migration in Mooncake.
Deployment: Each node in the cluster is configured to deploy either a prefill instance or a decoding instance based on startup parameters, reflecting Mooncake's disaggregated design. This flexible configuration allows for testing different ratios of prefill to decoding resources.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Public Datasets

The performance of Mooncake and vLLM was evaluated on ArXiv Summarization and L-Eval datasets, focusing on throughput (achieved RPS) and SLO adherence (P90 TTFT and P90 TBT).

Baseline Configuration: vLLM was configured with four instances, denoted as vLLM-[4M].
Mooncake Configurations: Two setups were used:
- Mooncake-[3P+1D]: Three prefill instances and one decoding instance.
- Mooncake-[2P+2D]: Two prefill instances and two decoding instances.
  
  The following figure (Figure 11 from the original paper) shows the end-to-end experimental results on the ArXiv Summarization and L-Eval datasets:
  
  Analysis:
ArXiv Summarization Dataset:
- Mooncake-[3P+1D] achieved a 20% throughput improvement over vLLM-[4M] while satisfying SLOs. This suggests that for workloads with minimal KVCache reuse (like ArXiv, ~0% cache ratio), Mooncake's disaggregated architecture and optimized prefill handling still provide benefits.
L-Eval Dataset:
- Mooncake-[3P+1D] showed a 40% throughput improvement. This is particularly significant because L-Eval has a high cache ratio (>80%). Mooncake's ability to leverage prefix caching efficiently, combined with its KVCache-centric scheduling, further boosted performance.
Mooncake-[2P+2D] Performance:
- Although Mooncake-[2P+2D] generally exhibited lower TBT latency (due to having more dedicated decoding instances), its TTFT performance was not as good as Mooncake-[3P+1D] or vLLM-[4M].
- Reason for Discrepancy: This is attributed to an imbalance in the load between prefill and decoding instances. Having fewer prefill instances (2P vs 3P) might have created a bottleneck in the prefill stage, leading to higher TTFT, despite good TBT. This highlights the importance of correctly proportioning prefill and decoding resources based on workload characteristics. The paper suggests that in real clusters, this proportion can be pre-set, and future research will explore more flexible dynamic adjustments.

6.1.2. Simulated Data

This section evaluates Mooncake's performance with simulated data featuring various long-context input lengths (16k, 32k, 64k, 128k tokens) and a 50% prefix cache ratio. The cluster configurations (Mooncake-[3P+1D], Mooncake-[2P+2D], vLLM-[4M]) remained the same.

The following figure (Figure 12 from the original paper) presents the end-to-end experimental results on simulated data:

Figure 12: End-to-end experiments of Mooncake and vLLM on simulated data.

Analysis:

Impact of Long Contexts on vLLM: The paper notes that long-context requests significantly disrupt vLLM's decoding stage. To prevent TBT SLO violations, vLLM might be forced to process these requests individually rather than in batches, which severely limits its throughput. This is a critical point of differentiation.
Mooncake's Superiority in Long Contexts: Mooncake demonstrates significantly higher throughput enhancements, ranging from 50% to 525% compared to vLLM, while still adhering to both TTFT and TBT SLOs.
- This is a direct validation of Mooncake's two-stage disaggregation design, particularly its Chunked Pipeline Parallelism (CPP) and layer-wise prefill optimizations for prefill. By decoupling prefill from decoding, Mooncake effectively minimizes the impact of the prefill stage on the decoding stage, preventing TBT SLO breaches.
- The impressive 525% throughput increase in certain scenarios underscores Mooncake's strength in handling computationally intensive long-context prefill without sacrificing decoding stability or SLO compliance.

6.1.3. Real Workload

For real-world validation, Mooncake was tested against vLLM using replayed traces from 23,000 actual requests.

Configurations:
- Mooncake-[10P+10D]: Ten prefill instances and ten decoding instances.
- vLLM-[20M]: Twenty vLLM instances. (Total GPU count is identical for fair comparison, 20 GPUs for Mooncake and 20 GPUs for vLLM in total).
SLO Thresholds: TTFT upper limit set at 30 seconds; TBT threshold capped at 0.1 seconds per token.

The following figure (Figure 13 from the original paper) presents the CDF (Cumulative Distribution Function) plots for TTFT and TBT for the two systems under real workloads:

Analysis:
TTFT Distribution: The TTFT distributions for both Mooncake-[10P+10D] and vLLM-[20M] are nearly identical, with almost 100% of requests meeting the TTFT SLO. This indicates that both systems are capable of providing a responsive first token experience under real conditions.
TBT Distribution (Key Differentiator): This is where Mooncake demonstrates a significant advantage.
- Approximately 100% of Mooncake-[10P+10D] requests satisfy the TBT SLO.
- In contrast, only 57% of vLLM-[20M] requests meet the TBT criterion, with some requests exhibiting extremely high TBTs (indicating significant stuttering or delays in token generation).
Overall Capacity: In this real-world experiment, Mooncake was able to process approximately 75% more requests while adhering to both SLOs. This finding is a strong validation of Mooncake's practical effectiveness in handling production-scale workloads and maintaining quality of service, particularly its ability to prevent TBT degradation that often plagues vLLM in long-context or high-load situations. The disaggregated architecture prevents prefill from negatively impacting decoding stability.

6.2. Data Presentation (Tables)

6.2.1. Cache Hit Rates Under Different Cache Policies and Capacities

The paper analyzed cache hit rates using its real-world trace to understand KVCache reuse patterns.

The following are the results from Table 1 of the original paper:

Block capacity	Inf	100000	50000	30000	10000	1000
LRUCache	0.51	0.51	0.50	0.48	0.40	0.30
LFUCache	0.51	0.51	0.49	0.43	0.35	0.30
LengthAwareCache	0.51	0.50	0.48	0.42	0.35	0.30

Analysis:

Overall Cache Hit Ratio: Even with infinite capacity, the maximum cache hit ratio observed is 0.51 (51%). This indicates that for the specific sampled trace, only about half of the KVCache blocks could be reused. This highlights that while KVCache reuse is beneficial, it's not a silver bullet for all workloads.
Impact of Capacity: Increasing cache capacity from 1,000 blocks to 50,000 blocks significantly boosts the hit ratio from 30% to 50%. However, beyond 50,000 blocks, further increases yield minimal improvement. This suggests a diminishing return on capacity for this particular trace.
Cache Policy Comparison: LRUCache (Least Recently Used) performed marginally best or on par with LFUCache (Least Frequently Used) and LengthAwareCache (prioritizing blocks occurring later in requests). The slight edge for LRU suggests that temporal locality (recently used items are likely to be used again soon) is a strong factor in the sampled workload's KVCache access patterns.
Implications for Mooncake: This analysis informs Mooncake's KVCache management. The observation that over 50% of blocks remain unused while some are accessed thousands of times (Figure 6) underscores the need for hot block replication to avoid transfer congestion, a mechanism implemented in Mooncake's KVCache-centric scheduler and cache load balancing.

6.2.2. Number of Requests Rejected by the System Under Overloaded-Scenario Experiment

This experiment evaluates the effectiveness of Mooncake's overload-oriented scheduling strategies. A cluster with 8 prefill instances and 8 decoding instances was tested using real traces, with the replay speed increased to 2x to simulate overload.

The following are the results from Table 3 of the original paper:

	Baseline	Early Rejection	Early Rejection based on Prediction
Number of rejected requests	4183	3771	3589

Analysis:

Baseline (Naively rejecting based on initial load): Rejected 4,183 requests. This approach often leads to resource wastage because prefill computations might have already been performed for requests that are later rejected by the decoding stage.
Early Rejection: Reduced rejected requests to 3,771. This strategy assesses the decoding load before prefill begins, preventing ineffective computations for requests that would eventually be rejected. This is a clear improvement in resource utilization.
Early Rejection based on Prediction: Further reduced rejected requests to 3,589. By predicting the decoding load into the near future, this strategy proactively addresses the load fluctuation problem (Figure 9 and 10a). By stabilizing the load, the system can accept more requests overall, improving the request handling capacity and leading to fewer rejections compared to the simpler Early Rejection policy.

These results strongly validate the benefits of Mooncake's overload-oriented scheduling and prediction-based early rejection, demonstrating its ability to improve effective resource utilization and request handling capacity in overloaded production environments.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Prefill Scheduling Experiment

An experiment was conducted to evaluate the impact of Mooncake's prefill scheduling strategies on TTFT and SLO attainment rate.

The following figure (Figure 8 from the original paper) shows the total latency (TTFT) under different scheduling strategies in the Mooncake cluster:

Figure 8: The prefill scheduling experiment in the Mooncake cluster.

Analysis:

Random Scheduling: A prefill instance is selected arbitrarily. This resulted in the highest TTFT (around 30 seconds on average) and likely the lowest SLO attainment rate (not explicitly shown in the box plot, but implied by poor performance). It serves as a weak baseline, showing the necessity of any scheduling.
Load-Balancing Scheduling: The instance with the lightest current load is chosen. This improves TTFT considerably compared to random (around 25 seconds), as it prevents single instances from becoming bottlenecks.
Cache-Aware Scheduling: This strategy, described in §6.1, considers both instance load and prefix cache hit length. It further reduces TTFT (around 20 seconds) by prioritizing KVCache reuse, which reduces computation.
KVCache-centric Scheduling: This is Mooncake's full scheduling algorithm, which incorporates cache-aware scheduling and cache load balancing (including hot-spot migration). It achieves the lowest TTFT (around 14.36 seconds) and the best SLO attainment rate (as shown by the lower box and whisker plot, indicating better consistency and adherence to SLO).

Conclusion: The results clearly demonstrate that Mooncake's KVCache-centric scheduling algorithm significantly outperforms simpler random and load-balancing approaches, and even cache-aware strategies, by intelligently combining KVCache reuse, instance load balancing, and proactive KVCache migration to optimize TTFT and SLO adherence. The SLO is marked at a certain level, and KVCache-centric is well below it, indicating successful SLO compliance.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces Mooncake, a novel KVCache-centric disaggregated architecture for Large Language Model (LLM) serving, developed by Moonshot AI for its Kimi service. Mooncake addresses critical challenges in LLM serving, particularly for long-context scenarios and overloaded production environments. Its core contributions include:

Disaggregated Architecture: Separating prefill and decoding stages into distinct clusters, along with a disaggregated KVCache utilizing CPU, DRAM, and SSD resources, enables specialized optimization and efficient resource utilization.
Advanced Prefill Optimization: Implementation of Chunked Pipeline Parallelism (CPP) and layer-wise KVCache transfer significantly reduces Time To First Token (TTFT) for long-context requests, mitigating network overhead and VRAM pressure.
KVCache-Centric Scheduling: The Conductor (global scheduler) intelligently balances KVCache reuse, instance load, and Service Level Objectives (SLOs) (TTFT and Time Between Tokens - TBT), incorporating heuristic-based hot-spot migration for KVCache blocks.
Overload Management: A prediction-based early rejection policy is introduced to prevent wasted computation and mitigate load fluctuations in overloaded scenarios, ensuring higher goodput (successfully completed requests within SLOs).

Experimental results demonstrate Mooncake's superiority: up to a 525% throughput increase in simulated long-context scenarios compared to baseline vLLM, and the ability to handle 75% more requests under real-world Kimi workloads, all while consistently meeting SLOs. The KVCache-centric approach, combined with overload-aware scheduling, proves highly effective for scalable and efficient LLM serving.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose promising directions for future research:

Heterogeneous Accelerators: Current flagship accelerators are versatile but not optimal in every metric (e.g., bandwidth per dollar/watt). Future work will explore leveraging heterogeneous accelerators (computation-oriented vs. bandwidth-oriented) and process-in-memory or hybrid bonding technologies to reduce the cost of memory-bound operations in the decoding phase.
Advanced Disaggregation: Further disaggregation architectures could separate the attention operator from other linear operators during decoding, especially since attention can be memory-bound. Preliminary simulated results show potential for increased throughput. The MLA operator by DeepSeek-v2 is noted as a promising alternative.
KVCache Reduction Algorithms: Continuously reducing KVCache size is crucial for increasing batch size and improving KVCache hit ratios. Future work includes exploring KVCache compression (e.g., [48-53]), important token selection (e.g., [54-60]), KVCache sharing across layers (e.g., [61-63]), and hybrid architectures that don't rely solely on KVCache (e.g., Mamba [65], RWKV [66]).
Enhanced Scheduling Policies: Developing more advanced policies to account for varying request priorities and scenarios with different TTFT/TBT SLOs.
Dynamic KVCache Management: Improving KVCache management, including replication, migration, and specialized eviction policies for partial hits and expiration scenarios.
Dynamic Resource Balancing: Strategies for dynamically balancing prefill and decoding instances and utilizing idle resources through batch-oriented offloading tasks to maximize resource utilization during fluctuating workloads.

7.3. Personal Insights & Critique

Mooncake presents a compelling and practically relevant solution for the increasingly complex challenge of LLM serving. Its KVCache-centric disaggregated architecture is a robust response to the distinct demands of prefill and decoding stages, a recognition that is gaining traction across the research community.

Strengths and Innovations:

Practicality for MaaS Providers: The explicit focus on overload scenarios and SLO adherence is a standout feature. Most academic papers optimize for peak throughput under ideal conditions, but Mooncake directly addresses the messy reality of production environments with limited resources and fluctuating demand. The prediction-based early rejection is a highly valuable, albeit complex, mechanism for maximizing goodput.
Holistic Optimization: Mooncake doesn't just disaggregate; it ties the entire system together with a sophisticated KVCache-centric scheduler. This unified approach, considering KVCache reuse, load balancing, and latency SLOs simultaneously, is more powerful than isolated optimizations.
Leveraging Existing Resources: The idea of using underutilized CPU, DRAM, and SSD for KVCache is a smart, cost-effective way to scale capacity without relying solely on expensive GPU VRAM.
Open-Sourced Trace: The provision of a real-world, anonymized trace is a significant contribution to the research community, enabling more realistic benchmarking and future studies on KVCache reuse patterns.

Potential Issues/Areas for Improvement:

Complexity of Prediction: While system-level prediction for early rejection is practical, request-level output length prediction (left for future work) is notoriously hard. The accuracy and robustness of any prediction model will heavily influence the effectiveness of early rejection, especially under rapidly changing workloads. The sensitivity of the system to prediction errors could be a challenge.
RDMA Overhead: While RDMA provides high bandwidth and low latency, cross-node KVCache transfer still incurs overhead. The paper mentions hot-spot migration to avoid congestion, but the specific mechanisms and costs associated with dynamic KVCache replication and transfer are not detailed extensively. The efficiency of the Messenger component is critical here.
Static Instance Ratios: The observed load imbalance in Mooncake-[2P+2D] on public datasets suggests that pre-setting prefill and decoding instance ratios might not always be optimal. The proposed future work on dynamic balancing is crucial for truly elastic and efficient resource allocation.
Cost Model: While the paper mentions cost-effectiveness through leveraging existing resources, a more explicit cost model comparing Mooncake to alternatives in terms of TCO (Total Cost of Ownership) or cost per inference would strengthen its economic argument.

Transferability and Future Value: The principles of KVCache-centric scheduling, disaggregated architectures, and overload-oriented policies are highly transferable beyond Mooncake or Kimi. Any LLM serving platform or MaaS provider facing similar challenges of scale, long contexts, and SLO adherence under resource constraints could benefit from adopting or adapting these concepts. The insights into load fluctuation and prediction-based mitigation are particularly valuable for general distributed system design. The emphasis on practical deployment and real-world workloads gives Mooncake a strong foundation for influencing future LLM serving system designs.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

TL;DR Summary

Abstract

In-depth Reading

English Analysis~21 min read · 27,126 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Mooncake Architecture Overview

4.2.2. KVCache Storage and Management

4.2.3. Request Workflow

4.2.4. Implementation of the Prefill Pool

4.2.4.1. Multi-node Prefill with Chunked Pipeline Parallelism (CPP)

4.2.4.2. Layer-wise Prefill

4.2.5. KVCache-centric Scheduling

4.2.5.1. Prefill Global Scheduling

4.2.5.2. Cache Load Balancing

4.2.6. Overload-Oriented Scheduling

4.2.6.1. Defining System Load and Early Rejection

4.2.6.2. Load Fluctuation Caused by Early Rejection

4.2.6.3. Early Rejection Based on Prediction

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Testbed

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Public Datasets

6.1.2. Simulated Data

6.1.3. Real Workload

6.2. Data Presentation (Tables)

6.2.1. Cache Hit Rates Under Different Cache Policies and Capacities

6.2.2. Number of Requests Rejected by the System Under Overloaded-Scenario Experiment

6.3. Ablation Studies / Parameter Analysis

6.3.1. Prefill Scheduling Experiment

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers