Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
TL;DR Summary
Mooncake features a KVCache-centric disaggregated architecture that significantly enhances effective throughput for LLM serving. By separating prefill and decoding stages and utilizing idle GPU cluster resources, it achieves up to a 525% increase in throughput in long-context sce
Abstract
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving". It focuses on designing an efficient and scalable system for serving large language models (LLMs), particularly emphasizing the optimization of the Key-Value Cache (KVCache) within a disaggregated architecture.
1.2. Authors
The authors are Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, and Weimin Zheng, Xinran Xu. Their affiliations include Moonshot AI and Tsinghua University, indicating a collaboration between industry and academia in the field of LLM serving systems.
1.3. Journal/Conference
The paper is published as a preprint on arXiv (arXiv:2407.00079), a widely recognized open-access repository for scientific preprints. While arXiv itself is not a peer-reviewed journal or conference, it is a common platform for researchers to share their latest work before or during the peer-review process, allowing for broad dissemination and early feedback within the research community. The reputation of arXiv as a venue for cutting-edge research in computer science and AI is very high.
1.4. Publication Year
The paper was published on 2024-06-24T02:05:32.000Z.
1.5. Abstract
Mooncake is presented as the serving platform for Kimi, a prominent LLM service by Moonshot AI. Its core innovation is a KVCache-centric disaggregated architecture that distinctly separates the prefill and decoding stages of LLM inference into different clusters. Furthermore, it efficiently utilizes underutilized resources like CPU, DRAM, and SSD within the GPU cluster to establish a disaggregated KVCache. The platform's central component is its KVCache-centric scheduler, which is meticulously designed to optimize overall effective throughput while rigorously adhering to latency-related Service Level Objectives (SLOs).
Unlike conventional research that often assumes all requests are processed, Mooncake confronts significant challenges posed by highly overloaded scenarios. To address this, the authors developed a prediction-based early rejection policy. Experimental findings demonstrate Mooncake's exceptional performance in long-context scenarios. Specifically, it achieves up to a 525% increase in throughput compared to baseline methods in certain simulated environments, all while meeting SLOs. Under real-world workloads, Mooncake's innovative design allows Kimi to process 75% more requests effectively.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2407.00079
- PDF Link: https://arxiv.org/pdf/2407.00079v4.pdf
- Publication Status: The paper is a preprint published on arXiv.
2. Executive Summary
2.1. Background & Motivation
The rapid adoption of Large Language Models (LLMs) has led to diversified workloads with varying input/output lengths, arrival patterns, and crucial Service Level Objectives (SLOs) for latency, primarily Time To First Token (TTFT) and Time Between Tokens (TBT). As a Model as a Service (MaaS) provider, Kimi (by Moonshot AI) faces the challenge of maximizing overall effective throughput (which directly impacts revenue) while satisfying these complex SLO constraints.
The core problem the paper addresses is the inefficient serving of LLMs, especially in:
-
Resource Utilization:
GPU serversare often highly integrated, but theprefillanddecodingstages of LLM inference have very different computational characteristics. Traditional monolithic serving architectures struggle to optimize resources for both, leading to underutilization. -
KVCache Management: The
KVCacheis central to LLM serving, but optimizing its reuse (to reduce computation) and batching (to improveModel FLOPs Utilization - MFU) often conflict withlatency SLOs(e.g., reusingKVCachefrom remote locations can increaseTTFT; large batch sizes can increaseTBT). -
Long-Context Scenarios: Modern LLMs handle increasingly long contexts, making the
prefillstage computationally intensive and demanding efficientTTFToptimization. -
Overloaded Scenarios:
MaaSproviders frequently facesevere overload problemsdue to limitedGPUsupply and rapidly growing user requests, especially during peak times. Existing LLM serving research often assumes sufficient resources, leaving a gap in strategies for managing overload effectively and deciding which requests to reject to avoid wasting computational resources.The paper's entry point is the observation that
KVCachescheduling is central to LLM serving efficiency. Its innovative idea is to propose aKVCache-centric disaggregated architecturethat separatesprefillanddecodingprocesses and leverages underutilizedCPU,DRAM, andSSDresources for a disaggregatedKVCache. This allows for specialized optimization of each stage and intelligentKVCachemanagement, coupled withoverload-oriented schedulingpolicies.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of LLM serving:
-
KVCache-centric Disaggregated Architecture: Introduction of
Mooncake, a novel architecture that separatesprefillanddecodingclusters and implements a disaggregatedKVCacheusingCPU,DRAM, andSSDresources. This design allows for independent optimization of each stage and efficientKVCachemanagement. -
Chunked Pipeline Parallelism (CPP) for Prefill: Proposes and implements
CPPforlong-context prefillto reduceTTFT, offering benefits over traditionalTensor Parallelism (TP)orSequence Parallelism (SP)by reducingnetwork consumptionand simplifyingelastic scaling. -
Layer-wise KVCache Transfer: Implements
layer-wise prefillwithstream transferring of KVCacheto overlaplatency, effectively reducingVRAMoccupation duringprefilland optimizingKVCachetransfer. -
KVCache-centric Scheduling Algorithm: Develops a sophisticated
global scheduler(Conductor) that considersKVCache reuse,instance loads, andSLOs(TTFT,TBT). This includes aheuristic-based automated hot-spot migration schemeforKVCacheblocks to balance loads and reduceTTFT. -
Overload-Oriented Scheduling with Prediction-Based Early Rejection: Addresses the practical challenge of
overload scenariosby introducing anearly rejection policythat predictsfuture load(especiallydecoding load) to preventwasted computationand mitigateload fluctuationsoften caused by naive early rejection. -
Empirical Validation and Open-Source Trace: Demonstrates the effectiveness of
Mooncakethrough extensive experiments on public datasets, simulated data, and real-worldKimitraces. The results show significantthroughputimprovements (up to 525% in simulated scenarios, 75% more requests under real workloads) while adhering toSLOs. An anonymized real-world request trace is open-sourced to facilitate further research.The key conclusions are that a
KVCache-centric disaggregated architecturewith intelligentschedulingandoverload managementis highly effective forLLM serving, especially forlong-contextrequests and underoverloaded conditions. Theprediction-based early rejection policyis crucial for maintainingresource utilizationandstabilityin such dynamic environments. These findings collectively solve the problem of efficiently servingLLMsat scale while maintainingquality of serviceand maximizingresource efficiency.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the Mooncake paper, a foundational understanding of several concepts related to Large Language Models (LLMs) and their serving infrastructure is essential.
- Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the
Transformerarchitecture, trained on vast amounts of text data. They can understand, generate, and process human language for various tasks like text summarization, question answering, and content creation. Examples includeGPTandLLaMA. - Transformer Architecture: The core neural network architecture for most modern
LLMs. It relies heavily onself-attention mechanismsto weigh the importance of different parts of the input sequence, andfeed-forward neural networks. It processes input sequences in parallel and generates output sequences autoregressively.- Self-Attention: A mechanism that allows the model to weigh the importance of different words in the input sequence when processing a specific word. It calculates
query (Q),key (K), andvalue (V)vectors from the input embeddings. The attention score is computed by taking the dot product of and vectors, scaled by (where is the dimension of and ), and then applying asoftmaxfunction. This score is then multiplied by to get the output. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices derived from the input embeddings.
- is the dot product of the Query and Key matrices, measuring similarity.
- is a scaling factor to prevent large dot products from pushing the
softmaxinto regions with tiny gradients. - normalizes the scores to create a probability distribution.
- Self-Attention: A mechanism that allows the model to weigh the importance of different words in the input sequence when processing a specific word. It calculates
- LLM Inference Stages: When an
LLMgenerates a response, it typically involves two distinct stages:- Prefill Stage (or Prompt Processing): This is the initial stage where the entire input prompt (context) is processed in parallel to generate the first output token. During this stage, intermediate activations, specifically the
keyandvaluevectors from theself-attentionlayers, are computed and stored. This stage is usuallycomputation-intensive, especially forlong contexts. - Decoding Stage (or Token Generation): After the
prefillstage, the model generates subsequent tokens one by one,autoregressively. For each new token, the model reuses thekeyandvaluevectors (theKVCache) from previous tokens and computes newkeyandvaluevectors for the current token. This stage is typicallymemory-constrainedand involves sequential computation.
- Prefill Stage (or Prompt Processing): This is the initial stage where the entire input prompt (context) is processed in parallel to generate the first output token. During this stage, intermediate activations, specifically the
- KVCache (Key-Value Cache): The intermediate
keyandvalueactivations computed during theprefillanddecodingstages. Storing these allows the model to avoid recomputing them for each new token generated, significantly speeding upautoregressive decoding. The size of theKVCachegrows with the length of the input plus generated tokens, making its management crucial for memory efficiency. - Service Level Objectives (SLOs): Performance targets or guarantees for a service. In
LLM serving, keySLOsinclude:Time To First Token (TTFT): The latency from when a request arrives until the first output token is generated. This is mainly influenced by theprefillstage.Time Between Tokens (TBT): The average latency between the generation of consecutive output tokens. This is mainly influenced by thedecodingstage.
- Continuous Batching: An optimization technique in
LLM servingwhere requests are dynamically batched together. Instead of processing requests sequentially, a scheduler continuously adds new requests to the batch and removes completed ones, maximizingGPU utilization.vLLMis a prominent open-source system that leverages this. - PagedAttention: An advanced memory management technique, introduced by
vLLM, that uses a paging mechanism similar to operating systems to manageKVCachememory. It allows for flexible sharing ofKVCacheamong different requests and prevents memory fragmentation, leading to higher throughput. - Disaggregated Architecture: A system design where different components or functions are separated into independent, specialized services or clusters. In
LLM serving, this often means separatingprefillanddecodingcomputations onto different hardware pools, or disaggregating memory resources. - Parallelism Techniques in LLMs: Strategies to distribute computations across multiple
GPUsor nodes.- Tensor Parallelism (TP): Divides individual tensors (like weights or activations) across multiple devices, typically within a single node. Requires frequent
all-reduceoperations, which can be costly across nodes. - Sequence Parallelism (SP): Partitions the input sequence across different devices. Each device processes a segment of the sequence. Requires communication per layer but can be more efficient for long sequences than
TPacross nodes. - Pipeline Parallelism (PP): Divides the model layers across different devices, forming a pipeline. Each device processes a stage of the model, and data flows sequentially through the pipeline. Reduces communication overhead compared to
TPfor large models.
- Tensor Parallelism (TP): Divides individual tensors (like weights or activations) across multiple devices, typically within a single node. Requires frequent
3.2. Previous Works
The Mooncake paper builds upon and differentiates itself from several key advancements in LLM serving.
- Production-grade Systems:
FasterTransformer[28],TensorRT-LLM[29], andDeepSpeed Inference[30] are industry solutions focused on optimizingLLMinferencethroughputthrough low-level optimizations.Mooncakeoperates at a higher architectural level, complementing these by focusing on disaggregation and scheduling. - Scheduling and Memory Management:
Orca[12] introduced iteration-level scheduling for concurrent processing, enhancingGPU utilization.vLLM[13] (whichMooncakeuses as a baseline) revolutionizedLLM servingwithcontinuous batchingandPagedAttentionfor efficientKVCachememory management.Mooncake's design acknowledgesvLLM's strengths but aims to overcome its limitations in handling distinctprefillanddecodingcharacteristics, especially forlong contexts.FlexGen[31],SARATHI[15], andFastServe[32] explore various scheduling andswapping strategiesto manage workloads on limited hardware.Mooncake's approach toKVCachehandling and disaggregation aims to further optimize these aspects.
- Disaggregated Architectures: Recent research, concurrent with or preceding
Mooncake, has also identified the benefits of separatingprefillanddecoding:Splitwise[7]: An early work proposing phase splitting forLLM inference.Mooncakewas motivated by this direction.DistServe[8]: Optimizes resource allocation and parallel strategies for each stage in a disaggregated setup to maximizeGPU goodput.TetriInfer[9]: Incorporateschunked prefillandtwo-stage disaggregationwith apredictive two-stage scheduling algorithm.Mooncakeshares these high-level ideas but focuses onKVCache-centricityand detailedoverload management.
- Prefix Caching and KVCache Reuse:
Prompt Cache[33]: Precomputes and stores frequently usedKVCacheto reduceinference latency.SGLang[34]: LeveragesRadixAttentionwithLRU cachein aradix treefor efficientKVCachesharing.AttentionStore[35]: A concurrent work that proposes ahierarchical KVCache systemusing cost-effective memory.Mooncakeshares design choices withAttentionStorebut emphasizesKVCache-centric global schedulingfor extremely largeKVCachesinlong-context inference.Preble[36]: ExploresKVCache-centric scheduling.Mooncakecorroborates many findings in this area, particularly the focus onKVCacheas a central scheduling primitive.
3.3. Technological Evolution
The evolution of LLM serving has moved from initial naive deployments to highly optimized systems. Early approaches often treated LLMs as monolithic black boxes, running prefill and decoding sequentially on the same hardware.
-
Basic Serving: Initial
LLMdeployments often used simple batching or served requests one by one, leading to lowGPU utilization. -
Continuous Batching & PagedAttention: Pioneered by
vLLM, these techniques significantly improvedthroughputby dynamically grouping requests and efficiently managingKVCachememory. This marked a shift towardsmemory-aware optimization. -
Disaggregation: The realization that
prefillanddecodinghave fundamentally differentcomputeandmemorycharacteristics led to the idea of separating these stages. This allows for specialized hardware and software optimizations for each, leading to betterresource utilizationandSLOadherence.Splitwise,DistServe,TetriInfer, andMooncakeare part of this trend. -
KVCache-Centric Optimization: As
LLMsgrew larger and contexts longer, theKVCachebecame a dominant factor in memory consumption andlatency. Recent efforts, includingMooncake,AttentionStore,Prompt Cache, andSGLang, focus on optimizingKVCachestorage, transfer, and reuse as a central element of serving efficiency. -
Overload Management: With the commercialization of
LLMsand limitedGPUavailability, managingoverload scenariosand ensuringSLOcompliance during peak usage has become critical. This includesearly rejection policiesandload prediction, a key focus ofMooncake.Mooncakefits within this timeline by pushing the boundaries of disaggregation andKVCache-centricity, specifically addressing the practical challenges oflong-context LLMsandoverload conditionsfaced byMaaSproviders.
3.4. Differentiation Analysis
Compared to the main methods in related work, Mooncake's core differences and innovations lie in:
- KVCache-Centricity as a First-Class Citizen: While others explore
KVCachemanagement,Mooncakeexplicitly positionsKVCacheas the central primitive for its global scheduling decisions. This goes beyond just memory efficiency to encompassTTFToptimization,load balancing, andhot-spot migration. - Comprehensive Disaggregation: Extends beyond just
prefill/decodingseparation to include a trulydisaggregated KVCacheutilizingCPU,DRAM, andSSDresources. This leverages underutilized hardware for cost-effective capacity and bandwidth. - Optimized Long-Context Prefill: Introduces
Chunked Pipeline Parallelism (CPP)andlayer-wise prefillspecifically tailored forlong contexts.CPPoffers betterMFUand lessnetwork contentioncompared toSequence Parallelism (SP)for cross-node acceleration, andlayer-wise prefilleffectively overlapsKVCachetransfer, reducingVRAMoccupation. - Overload-Oriented Scheduling: Unlike most research assuming sufficient resources,
Mooncakeexplicitly tacklesoverload scenarioswith aprediction-based early rejection policy. This aims to maximizegoodput(successfully completed requests withinSLOs) by preventing wasted computation and mitigatingload fluctuationsinherent in disaggregated systems. This is a practical innovation forMaaSproviders. - Holistic Optimization for SLOs: Its scheduler (
Conductor) balancescache reuse,instance load, andSLOadherence (TTFT,TBT) as primary objectives, rather than solely focusing onthroughputmaximization. - Real-World Validation: The system is deployed for
Kimi(Moonshot AI'sLLMservice), handlingexponential workload growth, and validated with real-world traces, providing practical evidence of its effectiveness under production constraints.
4. Methodology
4.1. Principles
The core principle of Mooncake is to maximize the overall effective throughput of an LLM serving system while strictly adhering to latency-related Service Level Objectives (SLOs), specifically Time To First Token (TTFT) and Time Between Tokens (TBT). This is achieved through a KVCache-centric disaggregated architecture that recognizes the distinct characteristics of the prefill and decoding stages, and proactively manages overload scenarios through intelligent scheduling and request rejection policies.
The theoretical basis and intuition behind this approach are:
- Disaggregation for Specialization:
Prefill(computation-intensive,long-context) anddecoding(memory-bound,autoregressive) stages have fundamentally different resource demands. Separating them into dedicated clusters allows for specialized optimization of hardware and software configurations for each, leading to higherresource utilizationand betterSLOcompliance than a monolithic design. - KVCache as the Central Bottleneck/Opportunity: The
KVCacheis a critical resource, both in terms of memory consumption and its potential for reuse. By makingKVCachemanagement and distribution central to scheduling decisions,Mooncakeaims to reduce redundant computation, minimize data transfer overheads, and optimizeVRAMusage. - Proactive Overload Management: In real-world
MaaSenvironments,overloadis inevitable. Simply processing requests untilSLOsare violated wastes computational resources. A proactiveearly rejection policy, especially one that anticipates future load, can saveGPU cyclesand ensuregoodputby only accepting requests that are likely to complete withinSLOs. - Leveraging Underutilized Resources: Modern
GPUservers often have substantialCPU,DRAM, andSSDresources that are underutilized duringLLM inference. By using these for a disaggregatedKVCache,Mooncakeprovides large, cost-effective storage forKVCacheblocks, enabling greater reuse and reducing pressure on expensiveGPU VRAM.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Mooncake Architecture Overview
The Mooncake architecture (Figure 1) is a KVCache-centric disaggregated system for LLM serving. It consists of:
-
Prefill Instances (Prefill Cluster): A pool of
GPUnodes optimized for theprefillstage. Their goal is to maximizeKVCache reuse, meetTTFT SLOs, and ensure aminimum MFU. They are constrained byDRAMavailability. -
Decoding Instances (Decoding Cluster): A pool of
GPUnodes optimized for thedecodingstage. Their goal is to maximizethroughput(by maximizing batch size) while meetingTBT SLOs. They are constrained byVRAMcapacity for the aggregatedKVCache. -
Disaggregated KVCache Pool: Leverages
CPU,DRAM, andSSDacross theGPU clusterto storeKVCacheblocks, enabling efficient near-GPU prefix cachingwithout additional dedicated hardware costs. -
Conductor (Global Scheduler): The central orchestrator responsible for dispatching requests, selecting appropriate
prefillanddecodinginstances, and managingKVCacheblocks (replicating hot blocks, swapping cold blocks).The following figure (Figure 1 from the original paper) shows the overall
Mooncakearchitecture:
该图像是Mooncake架构的示意图,展示了KVCache中心调度器的组成部分,包括预填充实例、KVCache池和解码池。图中显示了不同调度器的功能,如缓存感知的预填充调度器、KVCache平衡调度器和负载均衡解码调度器,强调了各个组件之间的资源分配和相互作用。
4.2.2. KVCache Storage and Management
The KVCache is stored in CPU memory as paged blocks. Each block has a hash value for deduplication. This hash is derived from both the block's content and its prefix, allowing for precise identification of reusable prefixes. This structure enables efficient cache eviction algorithms (like LRU, LFU, or LengthAwareCache) to manage the KVCache pool based on request patterns. The Messenger component, a separate RDMA-based service in each node, handles high-speed, cross-machine transfer of these KVCache blocks.
The following figure (Figure 3 from the original paper) illustrates the KVCache pool in CPU memory:
该图像是示意图,展示了 KVCache 池在 CPU 内存中的组织结构。图中展示了多个 Token 块和对应的哈希值,区分了前缀缓存块、增量缓存块和未分配缓存块。每个缓存块都附有通过自身哈希及其前缀确定的哈希值,以便进行去重处理。该设计支持 KVCache 的存储、读取和加载,旨在优化 Mooncake 的缓存管理效率。
4.2.3. Request Workflow
A typical request workflow in Mooncake (Figure 4) involves four main steps orchestrated by the Conductor:
-
KVCache Reuse:
- Upon receiving a request, the
Conductorselects aprefillnode (or group). - This selection balances three objectives: maximizing
KVCache reuse(by identifying existing prefixKVCacheblocks), balancing workloads acrossprefillnodes, and ensuringTTFT SLOadherence. - The selected
prefillnode loads reusable prefixKVCacheblocks from remoteCPU memoryintoGPU memory. This step is skipped if no reusableKVCacheexists.
- Upon receiving a request, the
-
Incremental Prefill:
- The
prefillnode completes theprefillstage using the loadedprefix cache. - Newly generated
KVCache(incrementalKVCache) is stored back intoCPU memory. - If the number of uncached input tokens exceeds a
threshold(prefill_chunk), theprefillstage is split into multiple chunks and executed in apipelinemanner usingChunked Pipeline Parallelism (CPP). Thischunkingallows for efficient utilization ofGPUcomputational power.
- The
-
KVCache Transfer:
- The
Messengerservice, deployed in each node, manages high-speed, cross-machineKVCachetransfer. - This transfer is executed
asynchronouslyandoverlappedwith theincremental prefillstep. KVCachegenerated by eachmodel layeris streamed to the destinationdecoding node's CPU memory, reducing waiting time. This is part of theLayer-wise Prefillstrategy.
- The
-
Decoding:
-
Once all
KVCachefor a request is received in thedecoding node's CPU DRAM, the request joins the next batch forcontinuous batching. -
The
Conductorpre-selects thedecoding nodebased on its current load to preventTBT SLOviolations. -
A local scheduler at the
decoding nodedouble-checksthe anticipated load. IfSLOscannot be met, the request might still be rejected, leading to wastedprefillcosts.The following figure (Figure 4 from the original paper) depicts the workflow of inference instances:
该图像是图表,展示了推理实例的工作流程。对于预填充实例,KVCache 层的加载和存储操作是逐层进行并与预填充计算并行,以减轻传输开销。对于解码实例,异步加载与 GPU 解码并发进行,以防止 GPU 空闲时间。
-
4.2.4. Implementation of the Prefill Pool
Mooncake maintains a separate prefill node pool due to the distinct characteristics of prefill and decoding, which require different cross-node parallelism settings and offer unique opportunities for VRAM saving.
4.2.4.1. Multi-node Prefill with Chunked Pipeline Parallelism (CPP)
For long-context requests (e.g., 10x-100x longer than output tokens), TTFT optimization is crucial. While Tensor Parallelism (TP) and Sequence Parallelism (SP) are options, they often involve significant RDMA-based all-reduce operations or frequent cross-node communication, reducing MFU and competing for network resources.
Mooncake uses Chunked Pipeline Parallelism (CPP):
-
It groups every nodes in the
prefill clusterinto apipelined prefill node group. -
Input tokens for a request are partitioned into
chunks, each no larger thanprefill_chunk. -
Different
chunksof the same request can be processedsimultaneouslyby different nodes within thepipeline group, acceleratingprefilland reducingTTFT.Benefits of
CPP: -
Reduced Communication: Similar to
pipeline parallelismin training,CPPonly requires cross-node communication at the boundaries of each pipeline stage, which can be easilyoverlappedwith computation. This leads to betterMFUand lessnetwork resource contentionwithKVCachetransfers. -
Adaptability: Naturally fits both
shortandlong contextswithout significant overhead forshort contexts, avoiding the need for frequentdynamic adjustmentof node partitioning.
4.2.4.2. Layer-wise Prefill
To minimize VRAM occupation by KVCache, Mooncake implements layer-wise prefill. Since prefill is computation-bound and processed layer-by-layer, KVCache transfer and dumping can be overlapped with computation.
-
KVCacheloading and storing are executedasynchronouslyusinglaunchandwaitoperations. -
Before a
layer's attention computationbegins, the model waits for thatlayer's KVCacheto load and then triggers theasynchronous loadingof the next layer'sKVCache. -
After
attention calculationis complete,asynchronous storageof thatlayer's KVCacheis launched. -
Once all layers are computed, the process waits for all
asynchronous storage operationsto complete.This
overlappingensures that theprefill instance's execution timeis roughly equivalent to either theKVCache loading timeor the standardprefilling time. The main advantage is that it allowsprefill schedulingto largely disregardVRAM size, as long as it can hold a single request. This frees upVRAMfor other uses.
The following figure (Figure 7 from the original paper) shows the latency of storing KVCache of different request lengths, highlighting the efficiency of layer-wise prefill:
该图像是一个图表,展示了不同请求长度下存储KVCache的延迟。蓝色条表示序列化延迟,黄色条表示分层延迟。随着序列长度增加,延迟显著上升,尤其在128000时达到最高值。
4.2.5. KVCache-centric Scheduling
Conductor is responsible for KVCache-centric scheduling, balancing instance loads and user experience (measured by TTFT and TBT SLOs).
4.2.5.1. Prefill Global Scheduling
Unlike traditional load-balancing based on request counts, Mooncake's prefill instance selection considers prefix cache hit length and KVCache block distribution. The goal is to route requests to instances with longer prefix cache lengths to reduce computation, but also to balance overall system load.
Algorithm 1 details the cache-aware prefill scheduling:
-
For a new request :
block_keysare generated byPrefixHash(R.prompt_tokens, B)where is thecache block size. This involves hashing token blocks and their prefixes.- Initialize
TTFTto infinity, and (prefill instance) to null. FindBestPrefixMatch(P, block_keys)identifies thebest_prefix_len(longest matching prefix) and thebest_matched_instance(the instance holding this prefix) across allprefill instances.- Iterate through each
instancein :- Get
instance.prefix_len(local prefix match length). - Estimate
queue_timefor the instance: . - Cache-aware prefill scheduling (local match): If the
best_prefix_lenis not significantly better than theinstance.prefix_len(controlled bykvcache_balancing_threshold), meaning the local instance has a good enough match or thebest_prefix_lenis too short to be worth transferring:- Estimate
prefill_timefor the instance: . - If , update and set .
- Estimate
- Cache-aware and -balancing prefill scheduling (remote match with transfer): Otherwise (the
best_prefix_lenis significantly better and potentially worth transferring):- Calculate
transfer_len(tokens to transfer):best_prefix_len - prefix_len. - Estimate
transfer_time: . - Estimate
prefill_time: . - If , update and set .
- Calculate
- Get
- for$load-balancing decoding scheduling`.
- Check
SLOs: If or ,reject Rand return. - KVCache hot-spot migration: If the
best_prefix_lenis above akvcache_balancing_threshold,TransferKVCache(best_matched_instance, p). This means if a request relies on a "hot" remoteKVCacheblock, that block is proactively transferred to the selectedprefill instancefor future reuse and load balancing. - Return the selected pair
(p, d).
-
Engineering details:
-
EstimatePrefillExecutionTime: Uses apredictive modelbased on request length andprefix cache hit length. -
EstimatePrefillQueueTime: Aggregatesprefill timesof currently queued requests. -
EstimateKVCacheTransferTime: More complex to predict due to network status; this necessitateshot KVCache block replication.The following is Algorithm 1 from the original paper, detailing the
KVCache-centric scheduling: Input: prefill instance pool $P$ , decoding instance pool $D$ , request $R$ , cache block size $B$ . Output: the prefill and decoding instances `( p , d )` to process $R$ . 1: block_keys PrefixHash( $R$ .prompt_tokens, $B$ ) 2: $T T F T \gets$ inf 3: $p \gets \emptyset$ 4: best_prefix_len, best_matched_instance FindBestPrefixMatch(P, block_keys) 5: for instance $\in P$ do 6: prefix_len $\gets$ instance.prefix_len 7: $T_{queue} \gets$ EstimatePrefillQueueTime(instance) 8: if $\begin{array}{r} \frac{best\_prefix\_len}{prefix\_len} < \text{kvcache\_balancing\_threshold} \end{array}$ then Cache-aware prefill scheduling 9: $T_{prefill} \gets$ EstimatePrefillExecutionTime(len(R.prompt_tokens), prefix_len) 10: if $TTFT > T_{queue} + T_{prefill}$ then 11: $TTFT \gets T_{queue} + T_{prefill}$ 12: $p \gets$ instance 13: end if 14: else Cache-aware and -balancing prefill scheduling 15: transfer_len $\gets$ best_prefix_len `-` prefix_len 16: $T_{transfer} \gets$ EstimateKVCacheTransferTime(instance, best_matched_instance, transfer_len) 17: $T_{prefill} \gets$ EstimatePrefillExecutionTime(len(R.prompt_tokens), best_prefix_len) 18: if $TTFT > T_{transfer} + T_{queue} + T_{prefill}$ then 19: $TTFT \gets T_{transfer} + T_{queue} + T_{prefill}$ 20: $p \gets$ instance 21: end if 22: end if 23:end for 24: $d \gets$ SelectDecodingInstance `( D )` Load-balancing decoding scheduling 25: if $TTFT > TTFT\_SLO$ or $TBT > TBT\_SLO$ then 26: reject $R$ ; return 27: end if 28: if best_prefix_len $> \text{kvcache\_balancing\_threshold}$ then 29: TransferKVCache(best_matched_instance, p) KVCache hot-spot migration 30:end if 31: return `( p , d )`
-
4.2.5.2. Cache Load Balancing
To prevent network congestion and optimize reuse, Mooncake employs a heuristic-based automated hot-spot migration scheme:
- When
Conductorroutes a request to an alternativeprefill instance(not the one with the longest prefix match) due to load, if the estimated additionalprefill timeis shorter than thetransfer time, the alternative instance proactively retrieves the necessaryKVCachefrom the holder. - Additionally, if the
best remote prefix match lengthis not significantly greater than thecurrent local reusable prefix(i.e.,best_prefix_len / local_prefix_len < threshold), the system prefers to compute the input tokens locally instead of transferring. These strategies facilitate theautomatic replicationofhot-spot cachesacross multiple machines, distributing the load and reducingTTFT.
The following figure (Figure 8 from the original paper) shows the results of a prefill scheduling experiment comparing different strategies:
该图像是一个箱线图,展示了在不同调度策略下的总延迟(TTFT)。横坐标表示不同的调度策略,包括KVCache-centric、cache-aware、load-balancing和random,纵坐标为延迟时间(秒)。图中标明了服务水平目标(SLO),并且KVCache-centric策略显示了最低的延迟,达到14.36秒。
4.2.6. Overload-Oriented Scheduling
In overload scenarios, Mooncake determines whether to accept or reject incoming requests based on system load.
4.2.6.1. Defining System Load and Early Rejection
- Load Measurement: In
Mooncake's disaggregated architecture,SLO satisfactionis used as the directload measurement.- : The
TTFT SLOconstraint. - : The
TBT SLOconstraint. - The load for
prefillanddecoding instancesis determined by comparing the predicted maximumTTFTandTBTon an instance against these constraints.
- : The
- Early Rejection: To prevent
wasted computationwhen a request is rejected by thedecoding instanceafterprefill(due to high load),Mooncakeassesses thedecoding loadbefore theprefillstage begins.Conductoraccepts a request only if bothprefillanddecodingpools are predicted to meetSLOs.
4.2.6.2. Load Fluctuation Caused by Early Rejection
A naive Early Rejection policy can cause significant anti-phase fluctuations between prefill and decoding machine loads (Figure 9). This is due to the time lag between predicting the decoding load and its actual execution.
The following figure (Figure 9 from the original paper) shows observed load fluctuations:
该图像是图表,展示了在20分钟内预填充实例(Prefill)和解码实例(Decoding)的负载变化情况,表示在未使用基于预测的提前拒绝策略之前的性能波动。
The fluctuation mechanism (Figure 10a) can be described in four stages:
-
Stage 1 (Low Load): Both
prefillanddecodingloads are low.Conductoraccepts many requests, saturatingprefill instances. -
Stage 2 (Decoding High, Prefill Low): Requests from
Stage 1 prefillmove todecoding, causing highdecoding load.Conductorrejects new incoming requests, leading to lowerprefill load. -
Stage 3 (Decoding Decreases, Prefill Increases): No new requests enter
decoding, so its load decreases.Conductorstarts accepting requests again, increasingprefill load. -
Stage 4 (Decoding Increases, Prefill Decreases): As
prefillrequests complete and move todecoding,decoding loadincreases again.Conductorrejects new requests, loweringprefill load.This cycle leads to poor
resource utilization.
The following figure (Figure 10 from the original paper) illustrates instance load when applying Early Rejection and Early Rejection Based on Prediction:
该图像是图表,展示了应用早期拒绝和基于预测的早期拒绝时的实例负载情况。图中的四个阶段分别展示了预填请求和解码请求在不同时间段的负载变化,并通过星号和箭头指示了接收和拒绝的决策。
4.2.6.3. Early Rejection Based on Prediction
To mitigate load fluctuation, Mooncake uses Early Rejection Based on Prediction. This framework predicts the decoding load after the prefill stage of incoming requests and uses this prediction to decide whether to accept them.
-
System-level Prediction (Current approach): Instead of predicting individual
output lengths(which is hard),Mooncakeestimates the overallbatch countorTBT statusfordecoding instancesafter a specified time.- It assumes a
uniform decoding timefor each request'sdecoding stage. - At a given moment , it identifies requests that will complete
prefillby and adds them touniform decoding instances. - It removes requests whose
execution timewill exceed before . - The
average TBT ratioof alldecoding instancesto is calculated to predict the load.
- It assumes a
-
Request-level Prediction (Future work): Predicting the specific output length of each request could enable more accurate
TTFTandTBTpredictions, thus a more preciseload assessment. However, this is currently challenging due to high cost or low accuracy, especially underoverload.This prediction-based approach, as illustrated in Figure 10b, aims to stabilize load and improve
resource utilization.
5. Experimental Setup
5.1. Datasets
The experiments in Mooncake utilize a mix of public datasets, simulated data, and real-world traces to evaluate performance across various scenarios. All experiments use a dummy model with the same architecture as LLaMA2-70B to protect proprietary information and ensure reproducibility.
The following are the results from Table 2 of the original paper:
| Dataset | Avg Input Length | Avg Output Length | Cache Ratio | Arrival Pattern |
|---|---|---|---|---|
| ArXiv Summarization [26] | 8088 | 229 | ~0% | Poisson Process |
| L-Eval [27] | 19019 | 72 | >80% | Poisson Process |
| Simulated Data | 16k, 32k, 64k, 128k | 512 | 50% | Poisson Process |
| Real Data | 7955 | 194 | ~50% | Timestamp-based |
Detailed description of datasets:
- ArXiv Summarization [26]:
- Avg Input Length: 8088 tokens.
- Avg Output Length: 229 tokens.
- Cache Ratio: Approximately 0%. This dataset likely consists of unique, non-repeating prompts, making
KVCache reuseminimal. It's suitable for evaluating rawprefillanddecodingperformance without the benefit of caching. - Arrival Pattern:
Poisson Process, which simulates random arrivals typical of many real-world systems.
- L-Eval [27]:
- Avg Input Length: 19019 tokens.
- Avg Output Length: 72 tokens.
- Cache Ratio: Greater than 80%. This dataset represents scenarios with high
KVCache reuse, making it ideal for evaluating the effectiveness ofMooncake'sKVCache-centric schedulingandprefix caching. Its long input length also stressesprefillcapabilities. - Arrival Pattern:
Poisson Process.
- Simulated Data:
- Avg Input Length: Varied at 16k, 32k, 64k, 128k tokens. This is crucial for evaluating
Mooncake's performance under extremelong-contextconditions, whereprefillcan significantly impactTTFT. - Avg Output Length: 512 tokens.
- Cache Ratio: 50%. A balanced scenario for
KVCache reuse. - Arrival Pattern:
Poisson Process.
- Avg Input Length: Varied at 16k, 32k, 64k, 128k tokens. This is crucial for evaluating
- Real Data:
- Source: A sampled subset of online request data from Kimi (Moonshot AI) over a 1-hour period. Contains 23,608 entries.
- Avg Input Length: 7955 tokens.
- Avg Output Length: 194 tokens.
- Cache Ratio: Approximately 50%.
- Arrival Pattern:
Timestamp-based, representing actual arrival times from a real-world workload, making it highly realistic for evaluatingproduction performanceandoverload scenarios. - Privacy Protection: The trace is anonymized, removing user content but preserving
timestamp,input_length,output_length, andhash_ids. Thehash_idsfield describesprefix caching relationshipsby hashing token blocks and their prefixes. For example:
In this sample, the identical first 12{ "timestamp": 27482, "input_length": 6955, "output_length": 52, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354] { "timestamp": 30535, "input_length": 6472, "output_length": 26, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2366] }hash_ids(46 to 57) indicate that the first tokens can shareprefix caching. This open-sourced trace is unique for real-worldKVCache reuse analysis.
These datasets were chosen to cover a spectrum of LLM serving challenges: minimal cache reuse, high cache reuse, extremely long contexts, and realistic production workloads, allowing for comprehensive validation of Mooncake's design.
5.2. Evaluation Metrics
The experiments primarily focus on the throughput performance of systems under defined SLOs.
- Throughput (Requests Per Second - RPS):
- Conceptual Definition: Measures the number of requests a system can successfully process and complete within a given time frame. A higher
RPSindicates betterthroughput. In the context ofLLM serving,throughputis crucial for handling large volumes of user queries efficiently. - Mathematical Formula: Not explicitly provided, but generally calculated as:
$
\text{Throughput (RPS)} = \frac{\text{Total Number of Completed Requests}}{\text{Total Time Elapsed (seconds)}}
$
- Symbol Explanation:
- : The total count of requests that have successfully finished their entire inference process.
- : The duration over which the requests were processed, measured in seconds.
- Symbol Explanation:
- Conceptual Definition: Measures the number of requests a system can successfully process and complete within a given time frame. A higher
- Time To First Token (TTFT) P90:
- Conceptual Definition:
TTFTmeasures the latency from the moment a request arrives until the very first output token is generated. It's a critical metric for user experience as it determines the initial responsiveness of theLLM. The P90 (90th percentile) value means that 90% of requests have aTTFTless than or equal to this value, reflecting the performance for the vast majority of users, not just the average. - Mathematical Formula: Not explicitly provided, but typically:
$
\text{TTFT} = \text{Time of First Token Generation} - \text{Time of Request Arrival}
$
The P90
TTFTis the value such that of all observedTTFTvalues are .- Symbol Explanation:
- : The timestamp when the first token of the response is produced.
- : The timestamp when the user request was received by the system.
- Symbol Explanation:
- SLO Threshold: The paper sets
TTFTthresholds relative to a baseline. For end-to-end experiments, it's . Exceeding this threshold indicates anSLOviolation.
- Conceptual Definition:
- Time Between Tokens (TBT) P90:
- Conceptual Definition:
TBTmeasures the average latency between the generation of successive output tokens for the same request. This metric reflects the smoothness and speed of ongoing text generation, impacting the perceived fluency of theLLM's response. The P90TBTsimilarly indicates that 90% of token generation intervals are below this value. - Mathematical Formula: Not explicitly provided, but typically:
$
\text{TBT} = \frac{\text{Time of Token}i - \text{Time of Token}{i-1}}{1}
$
The P90
TBTis the value such that of all observedTBTvalues are .- Symbol Explanation:
- : The timestamp when the -th token is produced.
- : The timestamp when the
(i-1)-th token was produced.
- Symbol Explanation:
- SLO Threshold: The paper sets
TBTthresholds as . Exceeding this threshold indicates anSLOviolation.
- Conceptual Definition:
- SLO Attainment Rate / Goodput:
- Conceptual Definition: The primary objective is to maximize
overall effective throughputwhile adhering toSLOs. Thegoodputconcept implies that only requests that fully complete their execution within their respective SLOs are counted towards thethroughput. If anSLOis violated, the resources consumed by that request are considered wasted, and the request does not contribute togoodput. - Mathematical Formula: Implicitly,
goodputis thethroughputcalculated only fromSLO-compliantrequests. $ \text{Goodput} = \frac{\text{Number of Completed Requests within SLO}}{\text{Total Time Elapsed (seconds)}} $- Symbol Explanation:
-
: The count of requests that finished their full execution and met both their
TTFTandTBTthresholds.All
TTFTandTBTvalues are normalized against their respectiveupper limits(SLO thresholds) for easier comparison, establishing a baseline of 1.0.
-
- Symbol Explanation:
- Conceptual Definition: The primary objective is to maximize
5.3. Baselines
The primary baseline model used for comparison is vLLM.
- vLLM:
- Description:
vLLMis described as one of thestate-of-the-art open-source LLM serving systems. It is highly regarded for its efficiency inLLM inference. - Key Technologies:
vLLMincorporatescontinuous batchingandPagedAttentiontechnologies.Continuous batchingallows dynamic grouping of requests, maximizingGPU utilization.PagedAttentionoffers efficientKVCachememory management by breakingKVCacheinto fixed-size blocks, similar to virtual memory paging, preventing fragmentation and enabling flexible sharing.
- Why it's a good baseline: Its adoption of
continuous batchingandPagedAttentionmakes it a strong contender in terms ofinference throughputand memory efficiency, representing a high bar forLLM servingperformance. - Limitations (as highlighted by Mooncake): The paper notes that
vLLM's designcouples the prefill and decoding stagesof inference requests. This tight coupling can causedisruptionsduringdecodinginscenarios involving long contexts. Specifically, along prefillrequest might block thedecodingof other requests, leading toTBT SLOviolations. To counteract this inlong-contextscenarios,vLLMmight resort to processing requestsindividuallyrather than in batches, which can reducethroughput.Mooncake's disaggregated approach aims to address this fundamental architectural limitation.
- Description:
5.4. Testbed
The experiments were conducted on a high-performance computing node cluster.
- Hardware Configuration (per node):
GPUs: 8NVIDIA-A800-SXM4-80GB GPUs. EachGPUhas 80GB ofHBM (High Bandwidth Memory).Interconnect:GPUsare connected byNVLINK.Network: Equipped withRDMA network cardssupporting up to800 Gbpsof interconnect bandwidth between nodes.RDMA (Remote Direct Memory Access)is crucial for high-speed, low-latency data transfer directly between memory of different machines, bypassing the CPU, which is essential forKVCachemigration inMooncake.
- Deployment: Each node in the cluster is configured to deploy either a
prefill instanceor adecoding instancebased on startup parameters, reflectingMooncake's disaggregated design. This flexible configuration allows for testing different ratios ofprefilltodecodingresources.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Public Datasets
The performance of Mooncake and vLLM was evaluated on ArXiv Summarization and L-Eval datasets, focusing on throughput (achieved RPS) and SLO adherence (P90 TTFT and P90 TBT).
-
Baseline Configuration:
vLLMwas configured with four instances, denoted asvLLM-[4M]. -
Mooncake Configurations: Two setups were used:
-
Mooncake-[3P+1D]: Threeprefill instancesand onedecoding instance. -
Mooncake-[2P+2D]: Twoprefill instancesand twodecoding instances.The following figure (Figure 11 from the original paper) shows the end-to-end experimental results on the
ArXiv SummarizationandL-Evaldatasets:
Analysis:
-
-
ArXiv Summarization Dataset:
Mooncake-[3P+1D]achieved a 20%throughputimprovement overvLLM-[4M]while satisfyingSLOs. This suggests that for workloads with minimalKVCache reuse(like ArXiv, ~0% cache ratio),Mooncake's disaggregated architecture and optimizedprefillhandling still provide benefits.
-
L-Eval Dataset:
Mooncake-[3P+1D]showed a 40%throughputimprovement. This is particularly significant becauseL-Evalhas a highcache ratio(>80%).Mooncake's ability to leverageprefix cachingefficiently, combined with itsKVCache-centric scheduling, further boosted performance.
-
Mooncake-[2P+2D] Performance:
- Although
Mooncake-[2P+2D]generally exhibited lowerTBT latency(due to having more dedicateddecoding instances), itsTTFTperformance was not as good asMooncake-[3P+1D]orvLLM-[4M]. - Reason for Discrepancy: This is attributed to an
imbalancein theloadbetweenprefillanddecoding instances. Having fewerprefill instances(2P vs 3P) might have created a bottleneck in theprefillstage, leading to higherTTFT, despite goodTBT. This highlights the importance of correctly proportioningprefillanddecodingresources based on workload characteristics. The paper suggests that in real clusters, this proportion can be pre-set, andfuture researchwill explore more flexible dynamic adjustments.
- Although
6.1.2. Simulated Data
This section evaluates Mooncake's performance with simulated data featuring various long-context input lengths (16k, 32k, 64k, 128k tokens) and a 50% prefix cache ratio. The cluster configurations (Mooncake-[3P+1D], Mooncake-[2P+2D], vLLM-[4M]) remained the same.
The following figure (Figure 12 from the original paper) presents the end-to-end experimental results on simulated data:

Analysis:
- Impact of Long Contexts on vLLM: The paper notes that
long-context requestssignificantly disruptvLLM'sdecoding stage. To preventTBT SLOviolations,vLLMmight be forced to process these requestsindividuallyrather than in batches, which severely limits itsthroughput. This is a critical point of differentiation. - Mooncake's Superiority in Long Contexts:
Mooncakedemonstrates significantly higherthroughputenhancements, ranging from50% to 525%compared tovLLM, while still adhering to bothTTFTandTBT SLOs.- This is a direct validation of
Mooncake'stwo-stage disaggregationdesign, particularly itsChunked Pipeline Parallelism (CPP)andlayer-wise prefilloptimizations forprefill. By decouplingprefillfromdecoding,Mooncakeeffectivelyminimizes the impact of the prefill stage on the decoding stage, preventingTBT SLObreaches. - The impressive
525% throughput increasein certain scenarios underscoresMooncake's strength in handlingcomputationally intensive long-context prefillwithout sacrificingdecoding stabilityorSLOcompliance.
- This is a direct validation of
6.1.3. Real Workload
For real-world validation, Mooncake was tested against vLLM using replayed traces from 23,000 actual requests.
-
Configurations:
Mooncake-[10P+10D]: Tenprefill instancesand tendecoding instances.vLLM-[20M]: TwentyvLLMinstances. (TotalGPUcount is identical for fair comparison, 20GPUsforMooncakeand 20GPUsforvLLMin total).
-
SLO Thresholds:
TTFTupper limit set at 30 seconds;TBTthreshold capped at 0.1 seconds per token.The following figure (Figure 13 from the original paper) presents the
CDF(Cumulative Distribution Function) plots forTTFTandTBTfor the two systems under real workloads:
Analysis:
-
TTFT Distribution: The
TTFT distributionsfor bothMooncake-[10P+10D]andvLLM-[20M]arenearly identical, with almost100% of requests meeting the TTFT SLO. This indicates that both systems are capable of providing a responsivefirst tokenexperience under real conditions. -
TBT Distribution (Key Differentiator): This is where
Mooncakedemonstrates a significant advantage.- Approximately
100% of Mooncake-[10P+10D]requests satisfy theTBT SLO. - In contrast, only
57% of vLLM-[20M]requests meet theTBTcriterion, with some requests exhibitingextremely high TBTs(indicating significant stuttering or delays in token generation).
- Approximately
-
Overall Capacity: In this real-world experiment,
Mooncakewas able to process approximately75% more requestswhile adhering to bothSLOs. This finding is a strong validation ofMooncake's practical effectiveness in handlingproduction-scale workloadsand maintainingquality of service, particularly its ability to preventTBTdegradation that often plaguesvLLMinlong-contextorhigh-loadsituations. The disaggregated architecture preventsprefillfrom negatively impactingdecoding stability.
6.2. Data Presentation (Tables)
6.2.1. Cache Hit Rates Under Different Cache Policies and Capacities
The paper analyzed cache hit rates using its real-world trace to understand KVCache reuse patterns.
The following are the results from Table 1 of the original paper:
| Block capacity | Inf | 100000 | 50000 | 30000 | 10000 | 1000 |
|---|---|---|---|---|---|---|
| LRUCache | 0.51 | 0.51 | 0.50 | 0.48 | 0.40 | 0.30 |
| LFUCache | 0.51 | 0.51 | 0.49 | 0.43 | 0.35 | 0.30 |
| LengthAwareCache | 0.51 | 0.50 | 0.48 | 0.42 | 0.35 | 0.30 |
Analysis:
- Overall Cache Hit Ratio: Even with
infinite capacity, the maximumcache hit ratioobserved is 0.51 (51%). This indicates that for the specific sampled trace, only about half of theKVCacheblocks could be reused. This highlights that whileKVCache reuseis beneficial, it's not a silver bullet for all workloads. - Impact of Capacity: Increasing
cache capacityfrom 1,000 blocks to 50,000 blocks significantly boosts thehit ratiofrom 30% to 50%. However, beyond 50,000 blocks, further increases yield minimal improvement. This suggests a diminishing return on capacity for this particular trace. - Cache Policy Comparison:
LRUCache(Least Recently Used) performed marginally best or on par withLFUCache(Least Frequently Used) andLengthAwareCache(prioritizing blocks occurring later in requests). The slight edge forLRUsuggests that temporal locality (recently used items are likely to be used again soon) is a strong factor in the sampled workload'sKVCacheaccess patterns. - Implications for Mooncake: This analysis informs
Mooncake'sKVCachemanagement. The observation that over 50% of blocks remain unused while some are accessed thousands of times (Figure 6) underscores the need forhot block replicationto avoidtransfer congestion, a mechanism implemented inMooncake'sKVCache-centric schedulerandcache load balancing.
6.2.2. Number of Requests Rejected by the System Under Overloaded-Scenario Experiment
This experiment evaluates the effectiveness of Mooncake's overload-oriented scheduling strategies. A cluster with 8 prefill instances and 8 decoding instances was tested using real traces, with the replay speed increased to 2x to simulate overload.
The following are the results from Table 3 of the original paper:
| Baseline | Early Rejection | Early Rejection based on Prediction | |
|---|---|---|---|
| Number of rejected requests | 4183 | 3771 | 3589 |
Analysis:
-
Baseline (Naively rejecting based on initial load): Rejected 4,183 requests. This approach often leads to
resource wastagebecauseprefill computationsmight have already been performed for requests that are later rejected by thedecoding stage. -
Early Rejection: Reduced rejected requests to 3,771. This strategy assesses the
decoding loadbeforeprefillbegins, preventingineffective computationsfor requests that would eventually be rejected. This is a clear improvement inresource utilization. -
Early Rejection based on Prediction: Further reduced rejected requests to 3,589. By predicting the
decoding loadinto the near future, this strategy proactively addresses theload fluctuation problem(Figure 9 and 10a). By stabilizing the load, the system can accept more requests overall, improving therequest handling capacityand leading to fewer rejections compared to the simplerEarly Rejectionpolicy.These results strongly validate the benefits of
Mooncake'soverload-oriented schedulingandprediction-based early rejection, demonstrating its ability to improveeffective resource utilizationandrequest handling capacityinoverloaded production environments.
6.3. Ablation Studies / Parameter Analysis
6.3.1. Prefill Scheduling Experiment
An experiment was conducted to evaluate the impact of Mooncake's prefill scheduling strategies on TTFT and SLO attainment rate.
The following figure (Figure 8 from the original paper) shows the total latency (TTFT) under different scheduling strategies in the Mooncake cluster:

Analysis:
-
Random Scheduling: A
prefill instanceis selected arbitrarily. This resulted in the highestTTFT(around 30 seconds on average) and likely the lowestSLO attainment rate(not explicitly shown in the box plot, but implied by poor performance). It serves as a weak baseline, showing the necessity of any scheduling. -
Load-Balancing Scheduling: The instance with the lightest current load is chosen. This improves
TTFTconsiderably compared torandom(around 25 seconds), as it prevents single instances from becoming bottlenecks. -
Cache-Aware Scheduling: This strategy, described in §6.1, considers both
instance loadandprefix cache hit length. It further reducesTTFT(around 20 seconds) by prioritizingKVCache reuse, which reduces computation. -
KVCache-centric Scheduling: This is
Mooncake's full scheduling algorithm, which incorporatescache-aware schedulingandcache load balancing(includinghot-spot migration). It achieves the lowestTTFT(around 14.36 seconds) and the bestSLO attainment rate(as shown by the lower box and whisker plot, indicating better consistency and adherence toSLO).Conclusion: The results clearly demonstrate that
Mooncake'sKVCache-centric scheduling algorithmsignificantly outperforms simplerrandomandload-balancingapproaches, and evencache-awarestrategies, by intelligently combiningKVCache reuse,instance load balancing, and proactiveKVCache migrationto optimizeTTFTandSLOadherence. TheSLOis marked at a certain level, andKVCache-centricis well below it, indicating successfulSLOcompliance.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces Mooncake, a novel KVCache-centric disaggregated architecture for Large Language Model (LLM) serving, developed by Moonshot AI for its Kimi service. Mooncake addresses critical challenges in LLM serving, particularly for long-context scenarios and overloaded production environments. Its core contributions include:
-
Disaggregated Architecture: Separating
prefillanddecodingstages into distinct clusters, along with a disaggregatedKVCacheutilizingCPU,DRAM, andSSDresources, enables specialized optimization and efficient resource utilization. -
Advanced Prefill Optimization: Implementation of
Chunked Pipeline Parallelism (CPP)andlayer-wise KVCache transfersignificantly reducesTime To First Token (TTFT)forlong-contextrequests, mitigating network overhead andVRAMpressure. -
KVCache-Centric Scheduling: The
Conductor(global scheduler) intelligently balancesKVCache reuse,instance load, andService Level Objectives (SLOs)(TTFTandTime Between Tokens - TBT), incorporatingheuristic-based hot-spot migrationforKVCacheblocks. -
Overload Management: A
prediction-based early rejection policyis introduced to preventwasted computationand mitigateload fluctuationsinoverloaded scenarios, ensuring highergoodput(successfully completed requests withinSLOs).Experimental results demonstrate
Mooncake's superiority: up to a 525%throughputincrease in simulatedlong-contextscenarios compared to baselinevLLM, and the ability to handle 75% more requests under real-worldKimiworkloads, all while consistently meetingSLOs. TheKVCache-centricapproach, combined withoverload-aware scheduling, proves highly effective for scalable and efficientLLM serving.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose promising directions for future research:
- Heterogeneous Accelerators: Current flagship accelerators are versatile but not optimal in every metric (e.g., bandwidth per dollar/watt). Future work will explore leveraging
heterogeneous accelerators(computation-oriented vs. bandwidth-oriented) andprocess-in-memoryorhybrid bondingtechnologies to reduce the cost ofmemory-bound operationsin thedecoding phase. - Advanced Disaggregation: Further disaggregation architectures could separate the
attention operatorfrom otherlinear operatorsduringdecoding, especially sinceattentioncan be memory-bound. Preliminary simulated results show potential for increasedthroughput. TheMLA operatorbyDeepSeek-v2is noted as a promising alternative. - KVCache Reduction Algorithms: Continuously reducing
KVCachesize is crucial for increasingbatch sizeand improvingKVCache hit ratios. Future work includes exploringKVCache compression(e.g., [48-53]),important token selection(e.g., [54-60]),KVCache sharing across layers(e.g., [61-63]), andhybrid architecturesthat don't rely solely onKVCache(e.g.,Mamba[65],RWKV[66]). - Enhanced Scheduling Policies: Developing more advanced policies to account for
varying request prioritiesandscenarios with different TTFT/TBT SLOs. - Dynamic KVCache Management: Improving
KVCachemanagement, includingreplication,migration, and specializedeviction policiesforpartial hitsandexpiration scenarios. - Dynamic Resource Balancing: Strategies for
dynamically balancing prefill and decoding instancesand utilizingidle resourcesthroughbatch-oriented offloading tasksto maximizeresource utilizationduring fluctuating workloads.
7.3. Personal Insights & Critique
Mooncake presents a compelling and practically relevant solution for the increasingly complex challenge of LLM serving. Its KVCache-centric disaggregated architecture is a robust response to the distinct demands of prefill and decoding stages, a recognition that is gaining traction across the research community.
Strengths and Innovations:
- Practicality for MaaS Providers: The explicit focus on
overload scenariosandSLO adherenceis a standout feature. Most academic papers optimize for peakthroughputunder ideal conditions, butMooncakedirectly addresses the messy reality ofproduction environmentswith limited resources and fluctuating demand. Theprediction-based early rejectionis a highly valuable, albeit complex, mechanism for maximizinggoodput. - Holistic Optimization:
Mooncakedoesn't just disaggregate; it ties the entire system together with a sophisticatedKVCache-centric scheduler. This unified approach, consideringKVCache reuse,load balancing, andlatency SLOssimultaneously, is more powerful than isolated optimizations. - Leveraging Existing Resources: The idea of using underutilized
CPU,DRAM, andSSDforKVCacheis a smart, cost-effective way to scale capacity without relying solely on expensiveGPU VRAM. - Open-Sourced Trace: The provision of a real-world, anonymized trace is a significant contribution to the research community, enabling more realistic benchmarking and future studies on
KVCache reusepatterns.
Potential Issues/Areas for Improvement:
- Complexity of Prediction: While
system-level predictionforearly rejectionis practical,request-level output length prediction(left for future work) is notoriously hard. The accuracy and robustness of any prediction model will heavily influence the effectiveness ofearly rejection, especially under rapidly changingworkloads. The sensitivity of the system toprediction errorscould be a challenge. - RDMA Overhead: While
RDMAprovides high bandwidth and low latency,cross-node KVCache transferstill incurs overhead. The paper mentionshot-spot migrationto avoid congestion, but the specific mechanisms and costs associated with dynamicKVCache replicationandtransferare not detailed extensively. The efficiency of theMessengercomponent is critical here. - Static Instance Ratios: The observed
load imbalanceinMooncake-[2P+2D]on public datasets suggests thatpre-settingprefillanddecoding instanceratios might not always be optimal. The proposed future work ondynamic balancingis crucial for truly elastic and efficient resource allocation. - Cost Model: While the paper mentions cost-effectiveness through leveraging existing resources, a more explicit
cost modelcomparingMooncaketo alternatives in terms ofTCO (Total Cost of Ownership)orcost per inferencewould strengthen its economic argument.
Transferability and Future Value:
The principles of KVCache-centric scheduling, disaggregated architectures, and overload-oriented policies are highly transferable beyond Mooncake or Kimi. Any LLM serving platform or MaaS provider facing similar challenges of scale, long contexts, and SLO adherence under resource constraints could benefit from adopting or adapting these concepts. The insights into load fluctuation and prediction-based mitigation are particularly valuable for general distributed system design. The emphasis on practical deployment and real-world workloads gives Mooncake a strong foundation for influencing future LLM serving system designs.
Similar papers
Recommended via semantic vector search.