EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
TL;DR Summary
EcoServe introduces partial disaggregation with temporal decoupling and rolling activation, proactively orchestrating instances to reduce interference, enhance throughput and latency, and enable cost-effective LLM serving on commodity clusters with superior performance to existin
Abstract
Existing LLM serving strategies can be categorized based on whether prefill and decode phases are disaggregated: non-disaggregated (NoDG) or fully disaggregated (FuDG). However, the NoDG strategy leads to strong prefill-decode interference and the FuDG strategy highly relies on high-performance interconnects, making them less cost-effective. We introduce EcoServe, a system that enables cost-effective LLM serving on clusters with commodity interconnects. EcoServe is built on the partially disaggregated (PaDG) strategy, applying temporal disaggregation and rolling activation for proactive intra- and inter-instance scheduling. It first disaggregates the prefill and decode phases along the time dimension within a single instance to mitigate inter-phase interference and enhance throughput. Next, it coordinates multiple instances and cyclically activates them to ensure the continuous availability of prefill processing, thereby improving latency. Thus, EcoServe's basic serving unit is the macro instance, within which multiple instances collaborate. It further integrates an adaptive scheduling algorithm to route requests in a macro instance and a mitosis scaling approach to enable fine-grained capacity scaling. Beyond delivering high goodput, EcoServe excels in load balancing, hardware cost, parallelism compatibility, and even engineering simplicity compared to existing solutions. When serving 30B- and 70B-scale models on a production-level cluster with 32 NVIDIA L20 GPUs using commodity Ethernet, EcoServe averagely improves goodput by 82.49%, 86.17%, 122.76%, and 126.96% over four representative NoDG and FuDG systems.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
1.2. Authors
- Jiangsu Du (Sun Yat-sen University, Guangzhou, China)
- Hongbin Zhang (Sun Yat-sen University, Guangzhou, China)
- Taosheng Wei (Sun Yat-sen University, Guangzhou, China)
- Zhenyi Zheng (Sun Yat-sen University, Guangzhou, China)
- Kaiyi Wu (Sun Yat-sen University, Guangzhou, China)
- Zhiguang Chen (Sun Yat-sen University, Guangzhou, China)
- Yutong Lu (Sun Yat-sen University, Guangzhou, China)
1.3. Journal/Conference
The paper was published on arXiv, a preprint server for research papers. While arXiv is not a peer-reviewed journal or conference in itself, it is a highly influential platform for disseminating new research rapidly in fields like computer science, physics, and mathematics. Papers on arXiv often undergo peer review and are subsequently published in reputable conferences or journals.
1.4. Publication Year
2025 (Based on the UTC timestamp 2025-04-25T08:06:22.000Z)
1.5. Abstract
The paper introduces EcoServe, a novel system designed to enable cost-effective serving of Large Language Models (LLMs) on clusters equipped with commodity interconnects. It addresses limitations of existing LLM serving strategies: non-disaggregated (NoDG) strategies suffer from severe prefill-decode interference, while fully disaggregated (FuDG) strategies demand expensive, high-performance interconnects for KV cache transfer, making both less cost-effective.
EcoServe proposes a partially disaggregated (PaDG) strategy. This strategy employs temporal disaggregation to mitigate inter-phase interference and enhance throughput within a single instance by periodically switching between prefill and decode phases. To address the resulting increase in Time to First Token (TTFT), EcoServe further implements rolling activation, coordinating multiple instances cyclically to ensure continuous availability of prefill processing and thus improving latency. The fundamental serving unit in EcoServe is termed a macro instance, comprising several collaborating instances. The system also integrates an adaptive scheduling algorithm for request routing within a macro instance and a mitosis scaling approach for fine-grained capacity adjustment.
Beyond high goodput, EcoServe demonstrates advantages in load balancing, hardware cost, parallelism compatibility, and engineering simplicity. Experimental evaluations on a production-level cluster with 32 NVIDIA L20 GPUs and commodity Ethernet show that EcoServe significantly improves goodput by an average of 82.49% to 126.96% compared to four representative NoDG and FuDG systems when serving 30B- and 70B-scale models.
1.6. Original Source Link
https://arxiv.org/abs/2504.18154v1 (Preprint on arXiv)
1.7. PDF Link
https://arxiv.org/pdf/2504.18154v1.pdf (Preprint PDF on arXiv)
2. Executive Summary
2.1. Background & Motivation
The proliferation of Large Language Models (LLMs) across various applications (e.g., Github Copilot, Character.ai) has led to a surge in demand for efficient and cost-effective LLM inference. A primary objective in LLM serving is to optimize the cost per request while simultaneously adhering to Service Level Objectives (SLOs) for response times. LLM inference inherently involves two distinct computational phases: the prefill phase and the decode phase. The prefill phase processes the input prompt to generate the first token and the initial Key-Value (KV) cache, while the decode phase iteratively generates subsequent tokens using the accumulated KV cache. These two phases have different performance metrics: Time to First Token (TTFT) for prefill and Time Per Output Token (TPOT) for decode. Improving one often comes at the expense of the others, forming an inherent performance trade-off triangle with throughput.
Existing cluster-level LLM serving solutions generally fall into two categories:
-
Non-Disaggregated (NoDG) Strategy: In this approach, both the prefill and decode phases are handled by a single instance, often running on a single GPU or a set of GPUs cooperating via parallelism. While conceptually simple, this colocation leads to
strong prefill-decode interference. For example, if prefills are prioritized for low TTFT, ongoing decodes suffer from high TPOT; conversely, prioritizing decodes can delay prefills. This interference not only impacts latency SLOs but also hinders throughput by preventing the decode phase from accumulating sufficiently large batches to saturate GPU resources. It also struggles withpipeline parallelismdue to imbalanced workloads and dependencies. -
Fully Disaggregated (FuDG) Strategy: To eliminate prefill-decode interference, this strategy assigns prefill and decode phases to separate instances, potentially on different devices or even nodes. While effective in mitigating interference, FuDG introduces a significant challenge: it requires transferring massive amounts of
KV cachedata between prefill and decode instances. This necessitateshigh-performance interconnects(e.g., NVLink, InfiniBand), which are exceptionally expensive and power-intensive, thus makingFuDGlesscost-effectivefor broader deployment on clusters with commodity hardware (e.g., standard Ethernet). Additionally,FuDGfacesload imbalanceissues, as adjusting the ratio of prefill to decode instances is complex, and memory utilization can be imbalanced (decode instances store large KV caches, while prefill instances store less).The paper's entry point is the critical observation that both existing strategies have fundamental limitations preventing
cost-effective LLM servingon commodity hardware. The innovative idea is that intra-instance scheduling (when to execute prefills/decodes) must be meticulously coordinated with inter-instance scheduling (where/when to route requests) to optimally utilize resources and improve the trade-off between TTFT, TPOT, and throughput.
2.2. Main Contributions / Findings
The paper introduces EcoServe, a novel LLM serving system, and makes the following primary contributions:
- Introduction of EcoServe, a Cost-effective LLM Serving System: EcoServe is specifically designed to enable
cost-effective LLM inferenceon clusters utilizingcommodity interconnects, addressing a critical need in production environments. - The Partially Disaggregated (PaDG) Strategy: This is the core innovation, combining
temporal disaggregationandrolling activation.Temporal Disaggregation: Within a single instance, prefill and decode phases are disaggregated along the time dimension. This means an instance switches between processing only prefills and only decodes, reducinginter-phase interferenceand boostingthroughputwithoutKV cachetransfer.Rolling Activation: To counteract theTTFTincrease caused by temporal disaggregation, EcoServe coordinates multiple instances in a cyclic pattern. This ensures that at any given moment, some instances areactivatedforprefill processing, maintaining low TTFT for new requests.
- Macro Instance Abstraction: EcoServe introduces the
macro instanceas its basic serving unit, where multiple instances collaborate under the PaDG strategy. This abstraction simplifies scheduling and resource management. - Adaptive Scheduling Algorithm: An intelligent algorithm is integrated to route requests within a
macro instance. It prioritizesTPOTmaintenance, identifies optimal instances for new requests, and determines the maximumprefill tokensthat can be inserted, balancingTTFTandTPOTconstraints. - Mitosis Scaling Approach: EcoServe incorporates a novel
mitosis scaling approachforfine-grained capacity scaling. This allows elastic adjustment of instance counts within a macro instance, and triggerssplitormergeoperations of macro instances when thresholds are met. Aserializable proxy objectfacilitates transparent instance migration without re-initialization or interruption. - Hierarchical Architecture: EcoServe is implemented with a
hierarchical architecturecomprising anoverall scheduler,macro-instance schedulers, andinstance schedulers, enabling coordinated decision-making at multiple levels. - Comprehensive Evaluation and Superior Performance: EcoServe was evaluated on a production-level cluster with 32 NVIDIA L20 GPUs and commodity Ethernet, serving 30B- and 70B-scale models.
-
It demonstrated an average
goodputimprovement of 82.49% to 126.96% over four representativeNoDG(vLLM, Sarathi) andFuDG(DistServe, MoonCake) systems. -
It showed superior performance in
load balancing,hardware cost (lower),parallelism compatibility, andengineering simplicitycompared to existing solutions. -
EcoServe exhibits higher tolerance to tighter
SLOsand offerssuperlinear scalingof throughput with increased resources.These findings solve the problem of achieving high-performance and
cost-effective LLM servingon commodity hardware, offering a practical alternative to solutions that either suffer from severe interference or demand prohibitively expensive infrastructure.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand EcoServe, a basic grasp of Large Language Models (LLMs) and their inference serving challenges is essential.
3.1.1. Large Language Models (LLMs)
Large Language Models (LLMs) are a type of artificial intelligence model, typically based on the Transformer architecture, that are trained on vast amounts of text data. They are capable of understanding, generating, and processing human language for various tasks like text generation, summarization, translation, and question answering. Examples include Llama, GPT, and Qwen.
3.1.2. LLM Inference Phases: Prefill and Decode
LLM inference, especially for generative tasks, is divided into two distinct phases:
- Prefill Phase: Also known as the
prompt processingphase. When a user inputs a prompt (e.g., a question or a starting sentence), the LLM processes this entire sequence of input tokens to generate the initial internal representations, including the first output token. This phase is typicallycompute-bound, meaning its performance is limited by computational power (e.g., Floating Point Operations Per Second - FLOPS). It involves large matrix multiplications. - Decode Phase: Also known as the
token generationphase. After the first token is generated, the LLM generates subsequent tokens one by one, in anautoregressivemanner. Each new token is generated based on the input prompt and all previously generated tokens. This phase is typicallymemory-bound, meaning its performance is limited by memory access speed, particularly for loading theKV cache.
3.1.3. KV Cache
The Key-Value (KV) cache is a critical optimization technique in Transformer-based LLMs. During the self-attention mechanism, Query (Q), Key (K), and Value (V) vectors are computed for each token. For subsequent tokens in the decode phase, the K and V vectors for previous tokens in the sequence remain the same. Instead of recomputing these K and V vectors for every previous token at each step, they are stored in memory (the KV cache). This significantly reduces redundant computation, especially in the decode phase. The size of the KV cache grows with the sequence length and the number of parallel requests.
3.1.4. Performance Metrics: TTFT and TPOT
- Time to First Token (TTFT): This metric measures the latency from when a user request is submitted until the first output token is generated and returned. It is primarily influenced by the
prefill phaseefficiency. - Time Per Output Token (TPOT): This metric measures the average time taken to generate each subsequent output token after the first one. It is primarily influenced by the
decode phaseefficiency.
3.1.5. Throughput and Goodput
- Throughput: This refers to the total number of requests processed or tokens generated per unit of time (e.g., requests/second, tokens/second). Higher throughput means the system can handle more workload.
- Goodput: This is a more nuanced metric than raw throughput. It measures the throughput of requests that successfully meet predefined
Service Level Objectives (SLOs). For example, if a system processes 100 requests/second but only 80 of them meet their latency targets, thegoodputwould be 80 requests/second. EcoServe aims for high goodput.
3.1.6. Service Level Objectives (SLOs)
Service Level Objectives (SLOs) are target values or ranges for system performance metrics, often defined in service level agreements (SLAs). For LLM serving, common SLOs include target TTFT and TPOT values (e.g., TTFT < 1 second, TPOT < 100 milliseconds). Systems strive to meet these targets to ensure a good user experience.
3.1.7. LLM Batching Techniques
To efficiently utilize GPU resources, which are optimized for parallel processing, multiple requests are often processed together in a batch.
- Continuous Batching: This is a standard technique where requests can dynamically enter or exit a batch at each iteration. This maximizes
GPUutilization by always keeping the batch full, unlike static batching which waits for a fixed number of requests. - Separate Batching:
Prefillrequests are batched and processed separately fromdecoderequests. This is because their computational characteristics are very different (prefill iscompute-bound, decode ismemory-bound). - Hybrid Batching:
Prefillanddecoderequests are combined into a single batch and processed together. This can potentially improve throughput but exacerbatesprefill-decode interference.
3.1.8. Parallelism Strategies
For large LLMs that cannot fit on a single GPU or to accelerate inference, various parallelism strategies are used:
-
Tensor Parallelism (TP): Also known as
model parallelism(within a layer). InTP, the model's weights and activations within a single layer (ee.g., the large weight matrices inQKV projectionorFeed-Forward Networks) are partitioned across multipleGPUs. Each GPU computes a portion of the matrix multiplication, and the results are then aggregated (e.g., viaall-reduceoperations).TPis generally used to fit very large models into memory or to accelerate individual layer computations. It requires frequentinter-device communications.Figure 3(a) from the paper illustrates
tensor parallelism.
该图像是论文中图5的架构示意图,展示了EcoServe系统的部分解耦策略及调度流程。包含时间维度的预填充与解码阶段的划分(Temporal Disaggregation)、滚动激活(Rolling Activation)、自适应调度算法及分裂扩容(Mitosis Scaling Approach)等关键模块。Figure 3. Tensor parallelism and pipeline parallelism.
-
Pipeline Parallelism (PP): Also known as
layer parallelism. InPP, the layers of an LLM are partitioned across multipleGPUs. Each GPU is responsible for computing a subset of the model's layers. Data flows sequentially through the GPUs, forming a "pipeline." This reduces memory requirements per GPU and can be efficient if the pipeline stages are balanced. However, it can suffer frompipeline bubbles(idle time) if stages are not perfectly balanced or if there are dependencies between iterations (as inautoregressive decoding).Figure 3(b) from the paper illustrates
pipeline parallelism.
该图像是论文中图5的架构示意图,展示了EcoServe系统的部分解耦策略及调度流程。包含时间维度的预填充与解码阶段的划分(Temporal Disaggregation)、滚动激活(Rolling Activation)、自适应调度算法及分裂扩容(Mitosis Scaling Approach)等关键模块。Figure 3. Tensor parallelism and pipeline parallelism.
Figure 4 from the paper illustrates
pipeline bubbles.
该图像是图6的示意图,展示了运行时与前端的时间关系。图中用不同颜色表示了预填充(Prefill)、解码(Decode)、Token处理和其他预填充或数据传输阶段,各时间段以TTFT、Wait和TPOT标注,体现了多阶段的时间协调。Figure 4. Pipeline bubbles.
3.1.9. Interconnects
Interconnects refer to the hardware connections that allow different components (e.g., CPUs, GPUs, memory, nodes) within a computer system or cluster to communicate.
- Commodity Interconnects: These are standard, widely available, and relatively inexpensive networking technologies, such as
Ethernet. They typically offer lower bandwidth and higher latency compared to specialized interconnects. EcoServe aims to perform well on these. - High-Performance Interconnects: These are specialized, expensive, and high-bandwidth, low-latency networking technologies designed for demanding computational tasks. Examples include:
NVLink: A high-speed interconnect developed by NVIDIA for direct GPU-to-GPU communication within a node.InfiniBand: A high-throughput, low-latency communication technology used in high-performance computing (HPC) and data centers, often connecting nodes in a cluster.RoCE (RDMA over Converged Ethernet): AllowsRemote Direct Memory Access (RDMA)functionality over standardEthernetnetworks, offering lower latency and higher throughput than traditionalEthernetbut not always as high-performing as dedicatedInfiniBand.
3.2. Previous Works
The paper categorizes previous LLM serving solutions based on prefill-decode disaggregation:
3.2.1. Non-Disaggregated (NoDG) Strategy
- Description: In
NoDGsystems (e.g.,vLLM[5],Sarathi[9]), a single inference instance handles the entire lifecycle of a request, processing both itsprefillanddecodephases. When scaling is needed, this instance is replicated. - Key Issues:
- Strong Prefill-Decode Interference: As both phases share resources, processing one often delays the other, impacting
TTFTandTPOTSLOs. - Low Throughput: The interference makes it hard for the
decode phaseto accumulate large enough batches to saturateGPUs, leading to underutilization. - Inefficient Pipeline Parallelism: Imbalanced workloads and tight dependencies in
autoregressive decodinglead topipeline bubbles, reducingPPefficiency.
- Strong Prefill-Decode Interference: As both phases share resources, processing one often delays the other, impacting
- Mitigation Attempts:
Chunked prefill[9] (used bySarathi) attempts to break long prefills into smaller chunks to reduce interference, but it incurs overhead and its effectiveness varies.
3.2.2. Fully Disaggregated (FuDG) Strategy
- Description:
FuDGsystems (e.g.,DistServe[50],MoonCake[35]) completely separate theprefillanddecodephases into different instances. A request first goes to aprefill instance, generates theKV cacheand first token, and then theKV cacheis transferred to adecode instancefor subsequent token generation. - Key Issues:
-
High Interconnect Requirements: Transferring massive
KV cachedata between instances necessitateshigh-performance interconnects(e.g.,NVLink,InfiniBand), which are very expensive and power-intensive. The paper provides Table 3 to illustrate this requirement.The following are the results from Table 3 of the original paper:
Operation P/D FLOPS Memory Access Approximate AI QKV Projection Prefill 6BSH2 6BSH + 3H2 BS Decode 6BH2 6BH + 3H2 B Attention Prefill 2BS2H 2BSH + BS2M S Decode 2BSH 2BSM + BH(S + 1) 1 Output Projection Prefill 2BSH2 2BSH + H2 BS Decode 2BH2 2BH + H2 B Dim Expansion Prefill 8BSH2 2BSH + 4H2 BS Decode 8BH2 2BH + 4H2 B Dim Reduction Prefill 8BSH2 2BSH + 4H2 BS Decode 8BH2 2BH + 4H2 B The following are the results from Table 3 of the original paper (Note: the paper has two tables labeled 'Table 3'. I am referring to the second one, which relates to KV cache bandwidth):
Model Device Tokens/s Theoretical Bandwidth Llama-30B L20 6584.6 9.796 GB/s Llama-30B A800 26189.2 38.96 GB/s CodeLlama-34B L20 6838.92 1.25 GB/s CodeLlama-34B A800 25978.88 4.76 GB/s -
Load Imbalance:
FuDGmakes load balancing betweenprefillanddecodeinstances challenging due to their asymmetric durations. This can lead to underutilized resources. -
Memory Imbalance:
Decode instancesstore largeKV caches, whileprefill instancesstore less, leading to inefficient memory use across the cluster.
-
- Variants:
DistServe(intra-nodeFuDG) relies on intra-node high-speed links (e.g.,NVLink), whileMoonCake(inter-nodeFuDG) uses a centralizedKV cache poolandInfiniBandfor inter-node communication.
3.2.3. KV Cache Optimizations
- GQA (Grouped Query Attention) [10]: An optimization that reduces
KV cachesize by sharing key and value projections across multiple query heads, alleviating transmission overhead. Used in CodeLlama2-34B and Qwen2-72B. - PagedAttention [24]: A technique to manage
KV cachememory more efficiently by organizing it into fixed-size blocks, reducing fragmentation. - Other works like
H2O[49] andKeyformer[8] explore KV cache compression or redundancy removal.
3.2.4. Other Related Works
The paper also briefly mentions other areas of LLM inference optimization:
- Memory-limited inference:
Flexgen[39],FastDecode[19],Specinfer[31] use offloading. - Long-context inference:
Loongserve[45],Infinitellm[28] optimize for long sequences. - Mixture-of-Experts (MoE) models:
Moe-lightning[12],Pre-gated MoE[21],Lina[25] optimize resource utilization forMoEarchitectures.MegaScale-Infer[52] disaggregates attention and FFN modules for ultra-largeMoEmodels. - Kernel scheduling:
Liger[15],NanoFlow[51] schedule and overlapGPU kernelsfrom different requests.
3.3. Technological Evolution
The field of LLM serving has evolved significantly, driven by the increasing size and computational demands of LLMs.
-
Early LLM Inference (Single Model, Single Request): Initially, LLMs were served one request at a time on dedicated hardware. This was inefficient for high-throughput scenarios.
-
Batching for GPU Utilization: The introduction of
batchingtechniques (static, thencontinuous batching) dramatically improvedGPUutilization by processing multiple requests concurrently. -
Basic Parallelism:
Tensor ParallelismandPipeline Parallelismemerged to handle models that exceeded single-device memory or to reduce inference latency by distributing computation. -
Cluster-Level Serving Strategies: As demand scaled, systems moved to multi-node clusters. This led to the
NoDGstrategy, replicating instances, and thenFuDGto addressprefill-decode interference. -
Addressing Interconnect Bottlenecks and Cost: The high cost of
FuDGdue tohigh-performance interconnectsbecame a new bottleneck. This is where EcoServe'sPaDGstrategy fits, attempting to achieveFuDG-like benefits oncommodity interconnects. -
Advanced Optimizations: Ongoing research focuses on
KV cachemanagement (PagedAttention,GQA), specializedparallelismfor specific architectures (MoE), and dynamic scaling.EcoServe's work builds upon the understanding of
prefill-decodedynamics andKV cachemanagement from previous studies, but strategically places itself as a middle-ground solution (PaDG) to overcome the cost and interference limitations ofNoDGandFuDGon commodity hardware.
3.4. Differentiation Analysis
Compared to the NoDG and FuDG strategies, EcoServe's Partially Disaggregated (PaDG) approach offers distinct innovations:
-
Interference Mitigation vs. Hardware Cost:
- NoDG: Suffers from severe
prefill-decode interferencebecause both phases run on the same instance simultaneously. This harmsgoodputandSLOs. - FuDG: Eliminates interference by physically separating prefill and decode instances. However, this comes at the cost of needing
high-performance interconnects(e.g.,NVLink,InfiniBand) to transfer the largeKV cachebetween instances, making it very expensive. - PaDG (EcoServe): Mitigates interference by
temporal disaggregationwithin a single instance. It processes only one phase at a time (prefill or decode) for an extended duration. This avoids theKV cachetransfer overhead ofFuDGand thus works efficiently withcommodity interconnects, making it highlycost-effective.
- NoDG: Suffers from severe
-
Latency Management (TTFT):
- NoDG: Struggles to balance
TTFTandTPOTdue to interference. - FuDG: Can achieve good
TTFTdue to dedicated prefill instances but might still suffer ifKV cachetransfer is a bottleneck. - PaDG (EcoServe):
Temporal disaggregationalone would increaseTTFT(as new requests might wait for an instance to switch to prefill). EcoServe addresses this withrolling activation, where multiple instances are coordinated to ensure that at least one instance is always ready for prefill, thereby maintaining lowTTFTwhile still avoidingKV cachetransfer.
- NoDG: Struggles to balance
-
Resource Utilization and Load Balancing:
- NoDG: Can lead to underutilized
GPUs in thedecode phasedue to insufficient batch sizes caused by interference. - FuDG: Faces
load imbalanceissues when determining the optimal ratio of prefill to decode instances, and memory can be imbalanced (decode instances holding largeKV cacheswhile prefill instances have idle memory). - PaDG (EcoServe): Improves
GPU saturationby processing phases for longer durations, reducing switching overhead. Theadaptive scheduling algorithmandmitosis scaling approachprovide fine-grained control and dynamic load balancing withinmacro instances, leading to better resource efficiency.
- NoDG: Can lead to underutilized
-
Parallelism Compatibility:
- NoDG:
Pipeline Parallelismis difficult due topipeline bubblescaused by imbalanced workloads and dependencies between prefill and decode.Tensor Parallelismcan be inefficient due to communication overhead onPCIe-onlysystems. - FuDG:
PPcan be more effective as phases are separated.TPcan still facePCIecontention if not usingGPU-direct interconnects. - PaDG (EcoServe): Minimizes
prefill-decodeswitches, making it highly compatible withpipeline parallelism(lesspipeline bubbles). Its minimal data movement and reducedPCIe contentionmake it more suitable fortensor parallelismon systems withcommodity interconnects.
- NoDG:
-
Engineering Complexity:
-
NoDG: Relatively low complexity due to single-instance focus.
-
FuDG: High complexity due to distributed management of prefill/decode instances,
KV cachetransfer mechanisms, and load balancing across two distinct resource types. -
PaDG (EcoServe): Aims for lower complexity similar to
NoDGby keeping phases within a single instance (temporally disaggregated) and simplifying scaling withmacro instancesandmitosis scaling.In essence, EcoServe's
PaDGstrategy is an innovative middle ground that aims to achieve the benefits ofFuDG(reduced interference, bettergoodput) without incurring its highhardware costandengineering complexity, making it a morecost-effectivesolution forLLM servingon prevalentcommodity interconnectclusters.
-
4. Methodology
4.1. Principles
The core idea behind EcoServe is to achieve cost-effective LLM serving by intelligently orchestrating the prefill and decode phases. Its foundational principle is the Partially Disaggregated (PaDG) strategy. This strategy is built on two key proactive scheduling mechanisms:
-
Temporal Disaggregation (Intra-Instance Scheduling): Within a single inference instance, the
prefillanddecodephases are separated along the time dimension. This means an instance dedicates itself to one phase for an extended period before switching to the other. This mitigatesinter-phase interferenceand enhancesthroughputby allowing each phase to run more efficiently without being constantly interrupted or contending for resources. The intuition is that by focusing on one task for longer, theGPUcan achieve better saturation for that task. -
Rolling Activation (Inter-Instance Scheduling): While temporal disaggregation improves throughput within an instance, it could lead to high
Time to First Token (TTFT)if a new request arrives at an instance currently engaged in thedecode phase. To counter this, EcoServe proactively coordinates multiple instances within amacro instance(a group of cooperating instances). These instances cyclically activate theirprefill phasesin a staggered manner, ensuring that at any given point, there is always at least one instance available and ready to immediately process new prefills, thereby maintaining lowTTFTand meetingSLOs.By combining these two principles, EcoServe aims to optimize the
performance trade-off triangle(TTFT, TPOT, throughput) on commodity hardware, achieving highgoodputwithout the expensive interconnects required byFuDGstrategies.
4.2. Core Methodology In-depth (Layer by Layer)
EcoServe is structured hierarchically, employing a three-level scheduling architecture to implement its PaDG strategy effectively.
The following figure (Figure 5 from the original paper) shows the EcoServe architecture overview:
该图像是图7,展示了扩展和收缩两个过程的示意图。扩展过程包括添加(add)和拆分(split)操作,收缩过程包括删除(delete)和合并(merge)操作。图中以 和 为例说明。
Figure 5. EcoServe Architecture Overview.
4.2.1. Hierarchical Architecture Overview
As shown in Figure 5, EcoServe comprises:
-
Overall Scheduler (Figure 5,
Overall Scheduler): This is the highest level scheduler. Its responsibilities include dispatching new requests to appropriatemacro instancesbased on their capabilities and managingcapacity scalingacross different macro instances, such as transferringinstance handlersbetween them. -
Macro-Instance Scheduler (Figure 5,
Macro-Instance Scheduler): This scheduler coordinates multipleinstanceswithin a singlemacro instance. It aggregates execution states from individual instances and dispatches requests to the most suitable instance based on profiling results andSLOs. Themacro instanceis EcoServe's unique abstraction, representing the smallest unit of scheduling at the cluster level. -
Instance Scheduler (Figure 5,
Instance Scheduler): This is the lowest-level scheduler, responsible for managing execution within a single instance. It coordinates theprefillanddecodephases (as pertemporal disaggregation), orchestrates multiple devices ifparallelismis used, and executes directives received from higher-level schedulers.The focus of the paper is primarily on the internal architecture and scheduling within a
macro instance.
4.2.2. Partially Disaggregated (PaDG) Strategy
The PaDG strategy is the cornerstone of EcoServe, implemented through proactive intra- and inter-instance scheduling.
4.2.2.1. Temporal Disaggregation (Proactive Intra-Instance Scheduling)
- Mechanism (Figure 5,
Temporal Disaggregation): Within each inference instance, EcoServe proactively disaggregates theprefillanddecodephases along the time dimension. This means an instance does not interleave prefill and decode tasks in quick succession. Instead, it commits to processing only prefill tasks for a certain period, and then switches to processing only decode tasks for another period. - Benefits:
- Mitigates Prefill-Decode Interference: By dedicating an instance to one phase at a time, resource contention between the distinct computational patterns of prefill (compute-bound) and decode (memory-bound) is significantly reduced.
- Enhances Throughput: Each phase can run more efficiently, leading to better
GPU saturationand higher overallthroughput. - Eliminates KV Cache Transmission: Unlike
FuDG, both phases occur within the same instance (just at different times), so there's no need to transfer theKV cachebetween instances. This makes EcoServe compatible withcommodity interconnects.
- Challenge: This approach can lead to an unacceptable increase in
TTFTif a new request arrives when an instance is dedicated to thedecode phase, as it would have to wait for the instance to switch. This challenge is addressed byrolling activation. - Saved TPOT: The paper notes that if
decode executionis faster than theTPOT SLO, the instance can accumulate "spare time" (saved TPOT). This saved time can be used to absorb interruptions (like waiting for a phase switch) without violating theTPOT SLO.
4.2.2.2. Rolling Activation (Proactive Inter-Instance Scheduling)
- Mechanism (Figure 5,
Rolling Activation): To ensure lowTTFTdespite temporal disaggregation, EcoServe employsrolling activation. This involves coordinating multiple instances within amacro instancein a cyclic pattern. At any given moment, different instances are in different stages of their prefill/decode cycle. The goal is to always have some instances specifically activated and ready to process new prefills. - Benefits:
- Rescues TTFT: New requests are routed to instances currently in their
prefill phase, allowing for immediate processing and meetingTTFT SLOs. - Continuous Prefill Availability: The staggered activation ensures that the system as a whole maintains a continuous capacity for processing new prefills.
- Higher Overall Throughput: By combining temporal disaggregation and rolling activation, EcoServe achieves both efficient phase execution and low latency.
- Rescues TTFT: New requests are routed to instances currently in their
- Coordination: Instances continuously update their status (e.g., decode progress, memory usage) to the
macro-instance schedulerfor coordination.
4.2.3. Adaptive Scheduling Algorithm
The adaptive scheduling algorithm (Figure 5, ) is crucial for operationalizing the PaDG strategy within and across instances. It works in a master-slave manner between the macro-instance scheduler (master) and instance schedulers (slaves).
4.2.3.1. Inter-Instance Scheduling Algorithm (Macro-Instance Scheduler's Perspective)
The macro-instance scheduler receives status updates from all instances and schedules them to achieve rolling activation. When a new request arrives, it attempts to route it to the most suitable instance.
The following is the InterSchedule algorithm from Algorithm 1 of the original paper:
Data: current request: req; instance list: instances;
1 Function InterSchedule(req):
2 prev_idx ← last request's routed instance;
3 last_instance ← instances[prev_idx];
4 if CheckConstraints (last_instance,req) then
5 route req to instance[prev_idx];
6 else
7 next_idx ← (prev_idx + 1)%len(instances) ;
8 route req to instance[next_idx];
- Function
InterSchedule(req): This function takes a newrequest(req) as input and determines which instance it should be routed to. req: Represents the current incomingLLM inference request.instances: A list or array containing all availableinference instanceswithin the currentmacro instance.prev_idx: Stores the index of theinstancethat the previous request was routed to. This implies a strategy that first attempts to route to the last used instance for locality or continuity.last_instance: The actualinstanceobject corresponding toprev_idx.CheckConstraints(last_instance, req): This is a call to theConstraint Checking Algorithm(described below). It verifies if routing thecurrent request(req) tolast_instancewould violate anySLOsor memory constraints.if CheckConstraints (last_instance,req) then route req to instance[prev_idx];: If thelast_instancecan satisfy all constraints for the newrequest, therequestis routed to thisinstance.else next_idx ← (prev_idx + 1)%len(instances) ; route req to instance[next_idx];: If thelast_instancecannot satisfy the constraints, the algorithm moves to thenext_idx(the next instance in a cyclic pattern, using the modulo operator%to wrap around the list of instances). Therequestis then routed to thisnext_instance. This ensuresrolling activationby always trying to find a capable instance.
4.2.3.2. Constraint Checking Algorithm (Instance Scheduler's Perspective)
The Constraint Checking Algorithm is invoked by the Inter-Instance Scheduling Algorithm to determine if an instance can accept a new request without violating SLOs or memory capacity.
The following is the CheckConstraints algorithm from Algorithm 2 of the original paper:
Data: System constraints: SLOTTFT, SLOTPOT;
1 Function CheckConstraints(instance, req):
2 Constraint 1: TTFT
3 ← phase switching timestamp;
4 pending_prefills ← instance.reqs |
5 prefill_times ← predict pending_prefills durations ;
6 prefill_times;
7 if then
8 return NotSatisfied;
9 Constraint 2: TPOT
10 existed_decodes ← EPY r.arrival_time :
11 saved_tpots ← `[ ]` :
12 current_time ← current timestamp;
13 foreach r existed_decodes do
14 L ←r.output_length;
15 saved (current_time -r.first_token_time)
saved_tpots.append(saved_tpot)
16 mean_saved_tpot ← mean(saved_tpots);
17 if mean_saved then
18 return NotSatisfied;
19 Constraint 3: KV Cache capacity
20 if req_kocache_size `>` remain_memsize then
21 return NotSatisfied;
22 return Satisfied
-
Function
CheckConstraints(instance, req): This function checks if a giveninstancecan accept anew request(req) while adhering to systemSLOsandresource limits. -
SLOTTFT: The system's targetTime to First Token (TTFT)Service Level Objective. -
SLOTPOT: The system's targetTime Per Output Token (TPOT)Service Level Objective. -
Constraint 1:
TTFT: This block checks if admittingreqwill violate theTTFTSLO.t_switch: Represents the timestamp when theinstanceis expected tophase switchfromdecodetoprefill(or when it last switched and is currently inprefill). This is a crucial concept fortemporal disaggregation.pending_prefills: A set of allrequeststhat are either already waiting to beprefilledin thisinstance(arrived aftert_switch) OR the newrequest(req) itself. This represents the batch of prefills that will be processed consecutively.prefill_times: A prediction of the duration required to process each request in thepending_prefillsset. This prediction can be obtained from profiling data (e.g., prefill duration for various input lengths).-
prefill_times: The sum of the predicted durations for allpending_prefills. This represents the total time the instance will spend in theprefill phasefor this batch. - : If the estimated
total prefill timeexceeds theSLOTTFT, the constraint is violated, and therequestcannot be admitted. This ensures that even withtemporal disaggregation, the first token latency remains within bounds.
-
Constraint 2:
TPOT: This block checks if admittingreqwill violate theTPOTSLOfor currently ongoingdecoderequests.existed_decodes: A set ofrequeststhat are currently undergoingdecodein thisinstanceand arrived beforet_switch. These are thedecodetasks that will be paused or whose execution might be affected by the upcomingprefill phase.saved_tpots: An empty list to store thesaved TPOTfor eachexisted_decode.current_time: The current timestamp.foreach r in existed_decodes do: Loop through eachdecode requestalready in progress.- : The
output_lengthof the currentdecode request. saved_l_tpot: Calculates thesaved TPOTfor request . It's the maximum allowed time for 's decode (L * SLOTPOT) minus the actual time already spent generating tokens (current_time - r.first_token_time). A positive value means there's slack.saved_tpots.append(saved_l_tpot): Add the calculatedsaved TPOTto the list.
- : The
mean_saved_tpot: The averagesaved TPOTacross allexisted_decodes. This represents the collective slack available from ongoing decodes.if mean_saved_tpot < t_total then return NotSatisfied;: If the averagesaved TPOTis less than thetotal prefill time(t_total) for the new batch, it means processing the new prefill batch would likely cause the ongoing decodes to violate theirTPOT SLO. In this case, therequestis not admitted. This ensures that optimizingTTFTdoes not compromiseTPOTfor existing requests.
-
Constraint 3:
KV Cache capacity: This block checks if admittingreqwill exceed the instance's availableGPU memoryforKV cache.req_kocache_size: The estimatedKV cachesize required for the newrequest(req).remain_memsize: Theremaining available memoryin theinstance(specifically forKV cache).if req_kocache_size > remain_memsize then return NotSatisfied;: If theKV cachefor the newrequestexceeds available memory, therequestcannot be admitted to preventout-of-memory (OOM)errors.
-
return Satisfied: If all three constraints are met, theinstancecan admit therequest.
4.2.4. Mitosis Scaling Approach
The mitosis scaling approach (Figure 5, ) provides elastic and fine-grained capacity scaling for EcoServe, adapting to fluctuating LLM inference workloads over time.
4.2.4.1. Expansion and Contraction (Figure 7)
This approach dynamically adjusts the number of instances within macro instances.
-
Hyperparameters:
- : Lower bound on the number of instances in a
macro instance. - : Upper bound on the number of instances in a
macro instance.
- : Lower bound on the number of instances in a
-
Scaling Triggers: Scaling can be triggered when the system fails to meet
SLOs(demanding more capacity) or when there is sustained resource underutilization (indicating a need to contract capacity).The following figure (Figure 7 from the original paper) illustrates the expansion and contraction processes:
该图像是图9的图表,展示了EcoServe系统在不同实例数量下的吞吐量表现。图中通过对比EcoServe P90与线性增长,反映了系统的静态粗粒度扩展性能。
Figure 7. The illustration of the expansion and contraction processes. Here and .
- Expansion Process (Steps 1-4 in Figure 7):
- New instances are incrementally added to an existing
macro instanceas demand increases. - If the number of instances in a
macro instanceexceeds (e.g., 6 in the figure), a newmacro instanceissplit offfrom the original. This newmacro instancestarts with instances (e.g., 3 instances are moved from the original to form a new macro instance). - If additional instances are still required, they are first added back to the original
macro instanceuntil it again reaches . - Subsequent instance additions then go to the new
macro instance.
- New instances are incrementally added to an existing
- Contraction Process (Steps 5-8 in Figure 7):
- When capacity is excessive, instances are first removed from the
smallest macro instanceuntil its instance count reaches . - Then, instances start to be removed from a
full macro instance. - If the total number of instances across two
macro instancesreaches , they will bemergedinto a singlemacro instanceafter one additional instance is removed. This consolidation improves resource packing.
- When capacity is excessive, instances are first removed from the
- Post-Scaling: After any expansion or contraction, each
macro instancecontinues scheduling requests independently using theadaptive scheduling algorithm, requiring no additional specialized logic. This means the system typically maintains several fullmacro instancesand one or two partially filled ones.
4.2.4.2. Flexible Instance Migration
To enable dynamic splitting or merging of macro instances without interrupting or reinitializing individual instances, EcoServe uses a serializable proxy object.
- InstanceHandler Metadata: At the core is the
InstanceHandlermetadata, which encapsulates all necessary information about an instance: itsactor ID,worker address,function calls, and other attributes. - Serialization and Transfer: When an instance needs to be logically migrated between
macro-instance schedulers(which might be different processes), itsInstanceHandlerisserialized(e.g., using Python'spicklelibrary). This serialized data is then sent to the targetmacro-instance scheduler, coordinated by theoverall scheduler. - Deserialization and Reconstruction: The receiving process
deserializestheInstanceHandler, reconstructing a fully functionalproxy object. This proxy can then issuefunction callsto the original instance via anRPC-like system. - Benefits: This design allows
logical migrationwithout interrupting the instance's execution or requiring a costly re-initialization (which can take minutes for large LLMs), thereby supporting highly flexible and low-overhead scaling.
4.3. LLM Computational Characteristics
The paper provides additional background on LLM computations, particularly how arithmetic intensity (AI) differs between prefill and decode phases.
The following are the results from Table 1 of the original paper:
| Variable | Description | Notation |
| prompt_len | The length of prompt | S |
| generation_len | The length of generated tokens | G |
| batch_size | The number of batched requests | B |
| layer_num | The number of model layers | L |
| hidden_size | Input dimension of the hidden layer | H |
| heads | The number of attention heads | M |
| size_per_head | The hidden state per head | D |
The following are the results from Table 2 of the original paper:
| Operation | P/D | FLOPS | Memory Access | Approximate AI |
| QKV Projection | Prefill | 6BSH2 | 6BSH + 3H2 | BS |
| Decode | 6BH2 | 6BH + 3H2 | B | |
| Attention | Prefill | 2BS2H | 2BSH + BS2M | S |
| Decode | 2BSH | 2BSM + BH(S + 1) | 1 | |
| Output Projection | Prefill | 2BSH2 | 2BSH + H2 | BS |
| Decode | 2BH2 | 2BH + H2 | B | |
| Dim Expansion | Prefill | 8BSH2 | 2BSH + 4H2 | BS |
| Decode | 8BH2 | 2BH + 4H2 | B | |
| Dim Reduction | Prefill | 8BSH2 | 2BSH + 4H2 | BS |
| Decode | 8BH2 | 2BH + 4H2 | B |
-
Arithmetic Intensity (AI): This metric is computed by dividing the total number of
Floating Point Operations Per Second (FLOPS)by the total amount ofmemory access. It indicates how much computation is performed per unit of data transferred from memory. A higher AI typically means the operation iscompute-bound, while a lower AI means it'smemory-bound. -
Prefill Phase AI: As shown in Table 2, the
prefill phasehas anArithmetic Intensitythat depends on bothsequence length (S)andbatch size (B). Since can be large, the prefill phase exhibits significantly higherAI, making itcompute-bound. -
Decode Phase AI: The
decode phaseAIprimarily depends onbatch size (B)and is much lower. It also requires loading theKV cache, which increasesmemory access. Consequently, the decode phase is typicallymemory-bound.This fundamental difference in
arithmetic intensityand computational bottlenecks between the two phases is a key motivation for EcoServe'stemporal disaggregation, as it allowsGPUs to be optimally utilized for either compute-bound or memory-bound tasks without interference.
4.3.1. QKV Projection (Equation 1)
This step projects input tokens into Query (Q), Key (K), and Value (V) embeddings.
$
{ \bf Q } = W _ { q } { \bf X } , \quad { \bf K } = W _ { k } { \bf X } , \quad { \bf V } = W _ { v } { \bf X } .
$
Where:
- , , : Represent the
Query,Key, andValuematrices, respectively. - , , : Represent the weight matrices for projecting the input into Q, K, and V. These are learnable parameters of the model.
- : Represents the input token embedding matrix.
4.3.2. Attention (Equation 2)
This is the core self-attention mechanism. It calculates how each token attends to other tokens in the sequence.
$
A t t e n t i o n ( \mathbf { Q } , \mathbf { K } , \mathbf { V } ) = s o f t m a x ( \frac { Q K ^ { T } } { \sqrt { d _ { k } } } ) \mathbf { V }
$
Where:
- : The output of the
attentionmechanism. - : The
Querymatrix. - : The
Keymatrix. - : The transpose of the
Keymatrix. - : The dimension of the
Keyvectors. Dividing by is a scaling factor to prevent large dot product values from pushing thesoftmaxfunction into regions with tiny gradients. - : The
softmaxfunction, which normalizes scores into a probability distribution. - : The
Valuematrix. Thesoftmaxoutput (attention weights) are applied to theValuevectors to get the contextualized representation.
4.3.3. Feed-Forward Network (FFN) / Output Projection (Equation 3)
The Output Projection (or Feed-Forward Network) further transforms each token's representation using a position-wise, two-layer neural network with a non-linear activation.
$
F F N ( x ) = A c t ( x W _ { 1 } + b 1 ) W _ { 2 } + b _ { 2 }
$
Where:
FFN(x): The output of theFeed-Forward Networkfor input .- : The input to the
FFN(typically the output of theattentionmechanism). - , : Weight matrices of the two linear transformations.
- , : Bias vectors of the two linear transformations.
- : A non-linear
activation function(e.g., ReLU, GeLU).
4.4. Runtime and Frontend Timing
The paper also clarifies how SLOs are measured and perceived, distinguishing between runtime metrics and user experience.
The following figure (Figure 6 from the original paper) illustrates the timing between runtime and frontend:
该图像是多组柱状图,展示了EcoServe与vLLM、Sarathi、DistServe及MoonCake在不同模型(如Llama-30B、CodeLlama2-34B、Qwen2-72B)和数据集(Alpaca、ShareGPT、LongBench)下P50、P90及P99延迟下的吞吐率对比,反映了EcoServe在多GPU配置中的优越性能。
Figure 6. Runtime and frontend Timing.
- Runtime Metrics: Classically,
Time to First Token (TTFT)andTime Per Output Token (TPOT)are used. - Phase-Switching Waiting Time: The paper highlights that for all strategies (
NoDG,PaDG,FuDG), there's an implicit waiting time before a request enters itsdecode phase. ForNoDG/PaDG, this is due to other prefills; forFuDG, it'sKV cachetransmission. This is crucial as it's often misrepresented. - EcoServe's TTFT Definition: To maintain consistency with prior work but also reflect reality, EcoServe's reported
TTFTactually includes both the trueTTFTand thisphase-switching waiting time. This makes it a stricterSLO. - EcoServe's TPOT Measurement:
TPOTmeasurement begins after thephase-switching delay, focusing purely on the token generation rate.
5. Experimental Setup
EcoServe is built upon vLLM [5] as the single-device runtime, leveraging Ray for multi-device orchestration within an instance (RPC-like control), and ZeroMQ for inter-instance synchronization at the macro-instance scheduler level.
5.1. Datasets
The evaluation uses three application datasets with diverse input and output length distributions, following prior research by truncating inputs to a maximum length of 4096 tokens.
The following are the results from Table 4 of the original paper:
| DataSet | InAvg | InMed | OutAvg | OutMed | SLOTTFT | SLOTPOT |
| Alpaca-gpt4 | 20.63 | 17.00 | 163.80 | 119.00 | 1s | 100ms |
| ShareGPT | 343.76 | 148.00 | 237.20 | 152 | 5s | 100ms |
| LongBench | 2686.89 | 2736.50 | 101.78 | 19 | 15s | 100ms |
- Alpaca-gpt4:
- Description: Used for
human instruction applications. - Characteristics: Characterized by very
short input sequences(averageInAvg= 20.63, medianInMed= 17.00) and relativelylong outputs(averageOutAvg= 163.80, medianOutMed= 119.00). The average output length is approximately 10 times the input length. - SLOs:
TTFT= 1s,TPOT= 100ms.
- Description: Used for
- ShareGPT:
- Description: Represents
chatbot applications. - Characteristics: Features
relatively balanced input and output lengths(e.g.,InAvg= 343.76,OutAvg= 237.20). - SLOs:
TTFT= 5s,TPOT= 100ms.
- Description: Represents
- LongBench:
-
Description: Used for
summarization applications, where the goal is to generate a concise summary from a long article. -
Characteristics: Characterized by
long input sequences(averageInAvg= 2686.89, medianInMed= 2736.50) andshort outputs(averageOutAvg= 101.78, medianOutMed= 19). -
SLOs:
TTFT= 15s,TPOT= 100ms.These diverse datasets allow for a comprehensive evaluation of EcoServe's performance across different
LLM workloadcharacteristics. TheSLOsare set based on application needs, independent of model size, and are often stricter than those in prior works.
-
5.2. Evaluation Metrics
The evaluation focuses on goodput under different SLO attainment levels.
5.2.1. Goodput
- Conceptual Definition:
Goodputmeasures the effective throughput of a system, specifically counting only those requests that successfully meet their definedService Level Objectives (SLOs). It reflects both the raw processing capacity and the ability to deliver quality of service. - Mathematical Formula: The paper defines
goodputas throughput under different levels ofSLO attainment. While an explicit formula forgoodputis not provided in the paper, it is generally understood as: $ \text{Goodput} = \frac{\text{Number of requests meeting SLOs}}{\text{Total time}} $ - Symbol Explanation:
Number of requests meeting SLOs: The count of completed requests where bothTTFTandTPOT(or other relevant metrics) fall within their specifiedSLOlimits.Total time: The duration over which thegoodputis measured.
5.2.2. SLO Attainment (P50, P90, P99)
- Conceptual Definition:
SLO attainmentrefers to the percentage of requests that successfully meet theirService Level Objectives(e.g.,TTFT< 1s,TPOT< 100ms). TheP50,P90, andP99percentiles are commonly used to assess the distribution of latency.P50(50th percentile): 50% of requests meet or exceed this performance level. It represents the median performance.P90(90th percentile): 90% of requests meet or exceed this performance level. It reflects the performance for the majority of users.P99(99th percentile): 99% of requests meet or exceed this performance level. This is a very stringent metric, focusing on the tail latency and user experience for almost all users.
- Mathematical Formula: No explicit formula is given for
SLO attainmentitself, but it's typically calculated as: $ \text{SLO Attainment} = \frac{\text{Number of requests satisfying SLO}}{\text{Total number of requests}} \times 100% $ - Symbol Explanation:
Number of requests satisfying SLO: The count of requests whoseTTFTandTPOTvalues are within the definedSLOlimits.Total number of requests: The total number of requests processed.
5.2.3. Time to First Token (TTFT)
- Conceptual Definition:
TTFTmeasures the latency from the moment a user's request is submitted to the system until the very first token of the LLM's response is generated and available. It is a critical metric for user perceived responsiveness. In this paper, the reportedTTFTimplicitly includes thephase-switching waiting time(the time a new request might wait for an instance to switch to its prefill phase), making it a stricterSLO. - Mathematical Formula: No explicit formula is provided, but conceptually it is: $ \text{TTFT} = \text{Time}{\text{first_token_generated}} - \text{Time}{\text{request_submission}} $
- Symbol Explanation:
- : The timestamp when the first output token for a request is computed.
- : The timestamp when the request was initially received by the system.
5.2.4. Time Per Output Token (TPOT)
- Conceptual Definition:
TPOTmeasures the average time taken by the LLM to generate each subsequent token after the first token has been produced. It reflects the efficiency of thedecode phase. In this paper, the measurement ofTPOTbegins after thephase-switching delay, ensuring it focuses on the actual token generation rate without including initial waiting times. - Mathematical Formula: No explicit formula is provided, but conceptually it is: $ \text{TPOT} = \frac{\text{Time}{\text{last_token_generated}} - \text{Time}{\text{first_token_generated}}}{\text{Number of output tokens} - 1} $
- Symbol Explanation:
- : The timestamp when the final output token for a request is computed.
- : The timestamp when the first output token for a request is computed.
Number of output tokens: The total count of tokens generated for a request (excluding the prompt tokens).
5.3. Baselines
EcoServe is compared against four representative LLM serving systems spanning NoDG and FuDG strategies. All baselines are built on vLLM [5] as the underlying runtime, ensuring a fair comparison.
- vLLM [5]:
- Strategy:
Non-Disaggregated (NoDG). - Techniques: Uses
separate batching(prefill and decode batches are processed separately) andprefill-priority scheduling(new prefills are prioritized over ongoing decodes). - Role: Represents a standard, high-performance
NoDGsystem.
- Strategy:
- Sarathi [9]:
- Strategy:
Non-Disaggregated (NoDG). - Techniques: Employs
hybrid batching(combines prefill and decode requests into one batch),decode-priority scheduling(prioritizes ongoing decodes), and thechunked prefilltechnique (breaks long prefills into smaller chunks to reduce interference). - Role: Represents an
NoDGsystem that explicitly tries to mitigateprefill-decode interferencethrough advanced scheduling and batching.
- Strategy:
- DistServe [50]:
- Strategy:
Intra-node Fully Disaggregated (FuDG). - Techniques: Prefill and decode instances are
colocated within a single node.KV cacheis transferred between them overintra-node high-speed links(e.g.,NVLink, if available). The paper notes that its strategy of distributing instances across nodes with pipeline parallelism was not compatible with theSLOsin their setting. - Role: Represents a
FuDGsystem that leverages fast intra-node communication to reduceKV cachetransfer overhead.
- Strategy:
- MoonCake [35]:
- Strategy:
Inter-node Fully Disaggregated (FuDG). - Techniques: Allows prefill and decode instances to be
assigned to different nodes. It introduces acentralized KV cache poolthat acts as a buffer forKV cachetransmission, typically relying onInfiniBandfor inter-node connectivity. Even if prefill and decode instances are on the same node,KV cachepasses through this pool. To addressload imbalance, the optimalP/D (prefill/decode) ratiois selected. - Role: Represents a
FuDGsystem designed for large-scale, multi-node deployments with specialized high-performance networking.
- Strategy:
5.4. Cluster Testbed
Experiments were conducted on two different cluster setups to evaluate performance across varying hardware capabilities.
-
Primary Testbed (L20 Cluster):
- Configuration: A production-level cluster with 8 nodes, totaling 64 GPUs (8 NVIDIA L20-48GB GPUs per node).
- Interconnects: GPUs within a node are connected via
PCIe only. Nodes are interconnected via standard 10Gbps Ethernet (commodity interconnects). - Significance: Represents a typical, cost-effective infrastructure setting in modern data centers, where EcoServe aims to excel.
-
Second Testbed (A800 Cluster):
- Configuration: Consists of 2 nodes, totaling 16 GPUs (8 NVIDIA A800-80GB GPUs per node).
- Interconnects: GPUs within a node are connected via
PCIe only. Nodes are interconnected via 25Gbps RoCE (a higher-bandwidthEthernet-based interconnect than the 10Gbps in the L20 cluster). - Significance: Allows evaluation under higher bandwidth interconnects while still being
Ethernet-based, providing a comparison point forFuDGsystems that theoretically benefit from more bandwidth.
5.5. Model Setup
Three representative LLM models were used, chosen for their varying sizes and attention mechanisms. All experiments use precision.
- Llama-30B [43]:
- Attention Mechanism: Standard
multi-head attention (MHA). This mechanism results in a largerKV cachesize compared toGQA. - TP Configuration:
- L20 Cluster (32 GPUs, 8 nodes): (model partitioned across 4 GPUs).
- A800 Cluster (16 GPUs): (model partitioned across 2 GPUs).
- Attention Mechanism: Standard
- CodeLlama2-34B [36]:
- Attention Mechanism: Employs
grouped-query attention (GQA)[10].GQAsignificantly compressesKV cachesize, reducingmemory bandwidthandtransmission overhead. - TP Configuration:
- L20 Cluster (32 GPUs, 8 nodes): .
- A800 Cluster (16 GPUs): .
- Attention Mechanism: Employs
- Qwen2-72B [46]:
- Attention Mechanism: Also uses
grouped-query attention (GQA). Despite being a larger model (72Bvs.34B), itsGQAstill makes itsKV cachemore compact relative toMHAmodels of similar scale. - TP Configuration:
-
L20 Cluster (32 GPUs, 8 nodes): .
-
A800 Cluster (16 GPUs): .
Workloads are generated by pairing each model with each dataset. Request arrivals are simulated using a
Poisson distributionat a fixed rate to introduce realistic fluctuations.
-
- Attention Mechanism: Also uses
6. Results & Analysis
The evaluation primarily compares EcoServe's goodput against baseline systems under various SLO attainment levels, across different models, clusters, and applications. The goodput is measured by incrementally increasing the request rate until the system fails to meet the specified SLO attainment (P50, P90, P99).
6.1. End-to-end Performance Evaluation
The following are the results from Figure 8 of the original paper:
该图像是论文中的图表,展示了随着时间推移,SLO达成率(蓝色点)与请求率(绿色点)之间的变化关系。图中以红色虚线分隔了不同时间段,显示请求率逐步增加且SLO达成率在大部分时间保持较高水平,反映了系统在动态负载下的性能表现。
Figure 8. Overall goodput performance comparison.
Figure 8 presents a comprehensive comparison of EcoServe against vLLM, Sarathi, DistServe, and MoonCake across various scenarios.
6.1.1. Overall Comparison with Baselines
- EcoServe's Dominance: EcoServe generally outperforms all baselines in most cases, especially under stricter
SLOs. - Vs. NoDG Systems (vLLM, Sarathi): EcoServe achieves an average
P90 goodputimprovement of 83.76% over vLLM and 71.97% over Sarathi. This is attributed to EcoServe'sPaDGstrategy, which, by mitigatingprefill-decode interferencethroughtemporal disaggregationandrolling activation, creates more headroom to balanceTTFTandTPOTvia cross-instance cooperation.- Exception (Alpaca dataset): For the
Alpaca dataset(short inputs, long outputs),NoDGsystems can sometimes achieve comparable or slightly better performance. This is becauseAlpaca's short inputs lead to lessprefill-decode interference, and itsSLOsmight be loose enough that the additional trade-off space provided byPaDGbecomes less critical.
- Exception (Alpaca dataset): For the
- Vs. FuDG Systems (DistServe, MoonCake): While
FuDGsystems can sometimes matchNoDGfor models with reducedKV cache(e.g.,GQAmodels) and long outputs, they fall significantly behind EcoServe. EcoServe achieves an averageP90 goodputimprovement of 192.41% over DistServe and 218.22% over MoonCake. This highlights the severe bottleneckFuDGsystems face due toKV cachetransmission overcommodity interconnects, which EcoServe avoids.
6.1.2. Comparison Across SLO Attainment Levels
- Throughput Decline with Stricter SLOs: All systems experience a decrease in throughput as the
SLO attainmentlevel increases fromP50toP99(meaning stricter latency requirements). - EcoServe's Tolerance to Tight SLOs: EcoServe demonstrates significantly higher tolerance to tighter
SLOs:- P50 SLO Attainment: EcoServe shows improvements of 36.49%, 19.82%, 180.73%, and 194.62% over baselines.
- P90 SLO Attainment: These improvements substantially increase to 83.76%, 71.97%, 192.41%, and 218.22%.
- P99 SLO Attainment: The gap further widens, with some baseline systems being unable to meet
P99 SLO attainment.
- Validation: This trend validates that
PaDG, through its inter-instance cooperation, provides a much larger performance envelope for balancingTTFTandTPOTunder stringent latency constraints.
6.1.3. Comparison Across Models
EcoServe shows consistent performance gains across different model architectures:
- Vs. NoDG Systems (P90 SLO):
- Llama-30B: 65.00% improvement.
- CodeLlama2-34B: 83.30% improvement.
- Qwen2-72B: 85.30% improvement.
- Vs. FuDG Systems (P90 SLO): The advantage varies significantly:
- Llama-30B: 507.67% improvement. This massive gain is because Llama-30B uses
MHA(standard multi-head attention), resulting in a much largerKV cachethat severely degradesFuDGsystems due toKV cachetransmission overhead. - CodeLlama2-34B: 125.45% improvement.
- Qwen2-72B: 83.61% improvement. CodeLlama2-34B and Qwen2-72B use
GQA(Grouped Query Attention) which significantly reducesKV cachesize, thereby alleviatingKV cachetransmission overhead forFuDGsystems. Qwen2-72B, despite being larger, has a relatively smallerKV cachecompared to its computational cost, which benefitsFuDGmore than Llama-30B.
- Llama-30B: 507.67% improvement. This massive gain is because Llama-30B uses
6.1.4. Comparison Across Clusters
- A800 Cluster (P90 SLO): EcoServe achieves an average throughput improvement of 71.41% over NoDG systems and 285.78% over FuDG systems.
- L20 Cluster (P90 SLO): EcoServe achieves an average throughput improvement of 84.33% over NoDG systems and 124.86% over FuDG systems.
- A800 vs. L20 with FuDG: While the A800 cluster has a higher bandwidth interconnect (25Gbps RoCE vs. 10Gbps Ethernet), it appears less favorable for
FuDGsystems in terms of relative improvement. This counter-intuitive result is explained by the paper: whilebandwidth increases by 2.5x, theprocessing capability(especially for A800 GPUs) improves byover 4x. This means theinter-node networkbecomes an even more significant bottleneck forFuDGon A800s because the GPUs can generateKV cachedata much faster than the network can transfer it.
6.1.5. Comparison Across Applications
- Vs. NoDG Systems (P90 SLO):
- Alpaca: 10.44% improvement.
- ShareGPT: 20.60% improvement.
- LongBench: 202.57% improvement.
- Analysis: Shorter input lengths (Alpaca) reduce
prefill-decode interferenceandchunked prefilloverhead, makingNoDGrelatively better. EcoServe shines withLongBench(long inputs, short outputs) whereprefillis dominant and interference is severe forNoDG.
- Vs. FuDG Systems (P90 SLO):
- Alpaca: 74.80% improvement.
- ShareGPT: 363.10% improvement.
- LongBench: 164.42% improvement. (Excluding Llama-30B due to execution failures, which would make this even higher).
- Analysis: Datasets with longer inputs and shorter outputs (
LongBench) demand moreprefill instancesto generateKV cache. This significantly increasesnetwork transmission pressureforFuDGsystems, leading to worse performance compared to EcoServe, which avoids this transmission.
6.2. Scaling Capability
6.2.1. Static Coarse-grained Scaling
This section evaluates how EcoServe's goodput scales when the available resources (number of instances) are doubled.
-
Setup: CodeLlama2-34B and Qwen2-72B models on the L20 cluster, using for CodeLlama2-34B and for Qwen2-72B, with the
ShareGPTdataset. -
Results:
The following are the results from Figure 9 of the original paper:
该图像是图表,展示了EcoServe与vLLM在不同TP和PP设置下的吞吐量(Throughput)随TPOT SLO变化的关系。图中以CodeLlama2-34B和ShareGPT L20 P90及P99为测试环境,纵轴为吞吐量,横轴为TPOT SLO(ms),对比了不同并行度配置的性能表现。
Figure 9. Static coarse-grained scaling.
* As shown in Figure 9, both models achieve `superlinear improvement` in `P90 SLO attainment`. For example, CodeLlama2-34B serving scales from 1 instance (4 GPUs) to 4 instances (16 GPUs) and achieves `5.6x throughput`.
- Analysis for Superlinear Scaling:
- Minimal Management Overhead: EcoServe incurs minimal overhead in managing more instances within a
macro instance, especially when nodes are symmetrical. - Mitigation of Inter-phase Interference: Crucially, adding more instances provides
more space for mitigating inter-phase interference. This allows for higherarithmetic intensityand betterGPU saturationbecause instances can spend longer periods dedicated to one phase without causingSLOviolations. - Degradation to NoDG: If a
macro instancecontains only a single instance, thePaDGstrategy effectively degrades toNoDG, suffering from frequent phase switches and severe interference.
- Minimal Management Overhead: EcoServe incurs minimal overhead in managing more instances within a
- Plateau Effect: The paper notes that this
superlinear scaling effectwill eventuallyplateauonce a sufficient number of instances are reached, as the benefits of reducing interference diminish.
6.2.2. Dynamic Fine-grained Scaling
This evaluates EcoServe's ability to adapt to dynamically changing request rates by incrementally adding instances.
-
Setup: CodeLlama2-34B on the L20 cluster with ,
ShareGPTdataset. Request rate increased every 2 minutes (from 20 to 50 requests/second).SLO attainmentscollected every 30 seconds. -
Hyperparameters: and for
mitosis scaling. The system starts with 8 instances and uses up all GPUs eventually. -
Results:
The following are the results from Figure 10 of the original paper:
该图像是论文中的示意图,展示了LLM自回归解码过程,区分了预填充(prefill)和解码(decode)阶段,体现时间维度上的输入输出动态及关键缓存机制。
Figure 10. Request rate (green) and SLO attainment (blue) as request rate increase. Here and .
* Figure 10 shows that as the `request rate increases`, `SLO attainment` initially drops, but is then `restored by the addition of a new instance`. The blue dots (SLO attainment) remain high despite the increasing green line (request rate).
- Analysis:
- Adaptive Scheduling: The
adaptive scheduling algorithmimmediately routes new requests to newly added instances, freeing up time for existing instances to processdecodes. - Instance Migration Overhead: The
serializable proxy objectfor instance migration (part ofmitosis scaling) introducesless than 100 ms of overhead, which can be hidden by triggering migration during thedecode phase. This contrasts sharply with the3-minute (or longer)overhead of re-initializing an instance from scratch.
- Adaptive Scheduling: The
- Conclusion: The
mitosis scaling approacheffectively providesflexibleandfine-grained scaling, allowing EcoServe to adapt to dynamic workloads.
6.3. Parallelism Compatibility
This section validates EcoServe's compatibility with pipeline parallelism (PP).
-
Setup: CodeLlama2-34B,
ShareGPTdataset, L20 cluster.TPOT SLOis varied from100msto500ms(relaxed, asPPdoesn't improve single-batch latency). -
Configurations:
- EcoServe with , .
- EcoServe with , (effectively just
TP). - vLLM with , .
-
Results:
The following are the results from Figure 11 of the original paper:
该图像是论文中的示意图,展示了图3中的张量并行和流水线并行两种并行方式。左侧(a)部分为张量并行,显示多个GPU同时处理并通过All Reduce同步。右侧(b)部分为流水线并行,展示任务在GPU间的发送与接收过程。
Figure 11. Pipeline parallel compatibility.
* As shown in Figure 11, EcoServe utilizing `PP` () achieves better performance than its `TP` counterpart () at lower `TPOT SLOs`.
* EcoServe (PP) also `outperforms vLLM` across the board, and its `throughput plateau` achieved with `PP` is much higher than that of `vLLM`.
- Analysis:
- Lower Frequency of Prefill-Decode Switching: EcoServe's
PaDGstrategy inherently involves less frequentprefill-decode switchingwithin an instance compared toNoDG. This reducespipeline bubblesand improvespipeline parallelism efficiency. - Minimal Data Movement and PCIe Contention: The
PaDGstrategy's design, which avoidsKV cachetransfer between instances, also means less data movement and reducedPCIe contention, making it more suitable fortensor parallelismonPCIe-onlysystems (like the L20 cluster).
- Lower Frequency of Prefill-Decode Switching: EcoServe's
- Conclusion: EcoServe is highly compatible with
pipeline parallelism, achieving superior throughput compared tovLLMand even its ownTP-onlyconfiguration under certainTPOT SLOs.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces EcoServe, a novel system designed for cost-effective Large Language Model (LLM) serving on clusters equipped with commodity interconnects. Its core innovation is the Partially Disaggregated (PaDG) Strategy, which addresses the limitations of existing Non-Disaggregated (NoDG) and Fully Disaggregated (FuDG) approaches.
The PaDG strategy leverages two key mechanisms:
-
Temporal Disaggregation: It disaggregates the
prefillanddecodephases along the time dimension within a single instance. This significantly mitigatesinter-phase interferenceand enhancesthroughputby allowing each phase to run more efficiently. -
Rolling Activation: To ensure low
Time to First Token (TTFT)despite temporal disaggregation, EcoServe coordinates multiple instances within amacro instancein a cyclic pattern, guaranteeing continuous availability ofprefill processing.EcoServe further integrates an adaptive scheduling algorithm for intelligent request routing and a mitosis scaling approach for fine-grained, elastic capacity adjustment and seamless instance migration.
Experimental results on production-level clusters demonstrate that EcoServe achieves an average goodput improvement of 82.49% to 126.96% over representative NoDG and FuDG systems. Beyond raw performance, EcoServe excels in load balancing, hardware cost, parallelism compatibility, and engineering simplicity, making it a superior solution for practical, cost-sensitive LLM deployments.
7.2. Limitations & Future Work
The paper's discussion section (Section 6) provides a comparative analysis (Table 5) that implicitly outlines the advantageous scenarios for each strategy rather than explicit limitations of EcoServe.
The following are the results from Table 5 of the original paper:
| Goodput | Cost Effective | Load Balance | Hardware Cost | Parallelism Compatibility | Engineering Complexity | |
| NoDG | ✓ | Good | Easy | Low | Low | Low |
| FuDG | √√ | Poor | Hard | High | High | High |
| PaDG | √√ | Excellent | Easy | Low | High | Low |
Based on this, and the context of the paper, we can infer the following:
Implicit Limitations of PaDG (EcoServe):
- Applicability for Small Models: While
PaDGis excellent for 30B, 70B, and 130B models, for very small models (e.g., 7B, 13B) with lower computational demands and easier-to-satisfySLOs, theprefill-decode interferenceinNoDGmight be negligible. In such cases, theengineering simplicityandlow overheadofNoDGcould still make it a preferable choice.PaDGintroduces some coordination complexity that might not be justified for very small-scale workloads. - Extreme Scenarios/Stringent SLOs: For
ultra-large modelsor extremelystringent SLOswhere even minor interferences are intolerable,FuDGwith advancedhigh-performance hardware(e.g.,InfiniBand) might still be essential, despite its cost.PaDG, while mitigating interference, still has both phases run on the same physical instance, just temporally separated, which might not be enough for the most extreme cases. - Dependency on Prediction Accuracy: The
adaptive scheduling algorithmrelies onpredicting prefill durationsandsaved TPOT. Inaccurate predictions could lead toSLOviolations or suboptimal scheduling.
Suggested Future Work (by the authors and implied):
- More Aggressive Disaggregation: The authors mention
MegaScale-Infer[52] which disaggregatesattentionandFFN modulesinto different instances forultra-large MoE models. This suggests a potential future direction forPaDGto explore even finer-grained module-level disaggregation, beyond just prefill/decode, for even larger or specialized models. - Optimizing Intra-Instance Phase Switching: While
temporal disaggregationaims to reduce switching frequency, further research could optimize the timing and conditions for switching between prefill and decode phases to minimize context switching overheads and maximizeGPU utilization. - Dynamic Workload Prediction: Improving the accuracy and adaptiveness of workload prediction (
prefill_times,output_lengthforsaved TPOT) could further enhance theadaptive scheduling algorithm. - Heterogeneous Hardware: Exploring
PaDGstrategies on heterogeneous clusters (e.g., mixing different GPU types, or GPUs with varying interconnects) could extend its applicability.
7.3. Personal Insights & Critique
EcoServe presents a highly practical and well-reasoned solution to a pressing problem in LLM serving: achieving high performance on commodity hardware. The paper clearly articulates the trade-offs involved in LLM system design, illustrating that "the art of trade-offs" is paramount.
Key Strengths:
- Addressing the "Missing Middle": The
PaDGstrategy effectively carves out a sweet spot between theNoDGandFuDGextremes. It provides the benefits of interference mitigation without the prohibitive cost of specialized interconnects, making it particularly relevant for enterprises and cloud providers with existingcommodity infrastructure. - Proactive Scheduling: The combination of
temporal disaggregation(intra-instance) androlling activation(inter-instance) is elegant. It solves the inherentTTFTproblem introduced by temporal separation within an instance, demonstrating a holistic system design approach. - Practical Scaling: The
mitosis scaling approachwith itsserializable proxy objectis a robust solution forfine-grained, elastic scalingwith minimal overhead. This is crucial for dynamic cloud environments. - Comprehensive Evaluation: The experiments are conducted on production-level clusters with diverse models and datasets, using rigorous metrics (goodput at various
SLO attainments), which lends strong credibility to the findings. The analysis across different clusters (L20 vs. A800) provides insightful observations about howinterconnect bandwidthcan become an even greater bottleneck relative toGPU processing powerwith newer hardware. - Engineering Simplicity: The argument for
lower engineering complexitycompared toFuDGis compelling, as it avoids complexKV cachetransfer mechanisms and simplifiesload balancing.
Potential Areas for Deeper Exploration / Critique:
-
Phase-Switching Overhead: While the paper states that
temporal disaggregationlasts "longer to reduce switching overhead," a more detailed quantification of this overhead (e.g., context switching time, cache invalidation) and how it's minimized would be beneficial. -
Prediction Accuracy and Robustness: The
adaptive scheduling algorithmrelies onpredicting prefill durations. The paper mentions profiling, but the robustness of these predictions under highly variable or adversarial workloads (e.g., very diverse prompt lengths, sudden spikes in traffic) could be explored further. What if predictions are consistently off? -
Interplay with Advanced KV Cache Optimizations: The paper mentions
GQAbenefitsFuDGby reducingKV cachesize. How doesEcoServeinteract with other advancedKV cacheoptimizations (likePagedAttentionor compression techniques)? Do they offer additional synergies or create new trade-offs within thePaDGframework? -
Generalizability to Other Parallelism: While
TPandPPare covered, exploring compatibility with otherparallelism strategies(e.g.,expert parallelismforMoEmodels) could be valuable, especially given the mention ofMegaScale-Infer. -
Cost Model: While "cost-effective" is a central claim, a more explicit cost model (e.g., comparing GPU-hours per goodput-request-unit across solutions, or the total cost of ownership including networking) could further solidify the argument for
PaDGoverFuDG.Overall, EcoServe offers a significant step forward for practical
LLM deployment. ItsPaDGstrategy is a clever synthesis of existing ideas, thoughtfully applied to address a critical industry pain point. The paper is well-structured and provides a strong foundation for future research in optimizingLLM servingarchitectures.
Similar papers
Recommended via semantic vector search.