Paper status: completed

EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration

Published:04/25/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

EcoServe introduces partial disaggregation with temporal decoupling and rolling activation, proactively orchestrating instances to reduce interference, enhance throughput and latency, and enable cost-effective LLM serving on commodity clusters with superior performance to existin

Abstract

Existing LLM serving strategies can be categorized based on whether prefill and decode phases are disaggregated: non-disaggregated (NoDG) or fully disaggregated (FuDG). However, the NoDG strategy leads to strong prefill-decode interference and the FuDG strategy highly relies on high-performance interconnects, making them less cost-effective. We introduce EcoServe, a system that enables cost-effective LLM serving on clusters with commodity interconnects. EcoServe is built on the partially disaggregated (PaDG) strategy, applying temporal disaggregation and rolling activation for proactive intra- and inter-instance scheduling. It first disaggregates the prefill and decode phases along the time dimension within a single instance to mitigate inter-phase interference and enhance throughput. Next, it coordinates multiple instances and cyclically activates them to ensure the continuous availability of prefill processing, thereby improving latency. Thus, EcoServe's basic serving unit is the macro instance, within which multiple instances collaborate. It further integrates an adaptive scheduling algorithm to route requests in a macro instance and a mitosis scaling approach to enable fine-grained capacity scaling. Beyond delivering high goodput, EcoServe excels in load balancing, hardware cost, parallelism compatibility, and even engineering simplicity compared to existing solutions. When serving 30B- and 70B-scale models on a production-level cluster with 32 NVIDIA L20 GPUs using commodity Ethernet, EcoServe averagely improves goodput by 82.49%, 86.17%, 122.76%, and 126.96% over four representative NoDG and FuDG systems.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration

1.2. Authors

  • Jiangsu Du (Sun Yat-sen University, Guangzhou, China)
  • Hongbin Zhang (Sun Yat-sen University, Guangzhou, China)
  • Taosheng Wei (Sun Yat-sen University, Guangzhou, China)
  • Zhenyi Zheng (Sun Yat-sen University, Guangzhou, China)
  • Kaiyi Wu (Sun Yat-sen University, Guangzhou, China)
  • Zhiguang Chen (Sun Yat-sen University, Guangzhou, China)
  • Yutong Lu (Sun Yat-sen University, Guangzhou, China)

1.3. Journal/Conference

The paper was published on arXiv, a preprint server for research papers. While arXiv is not a peer-reviewed journal or conference in itself, it is a highly influential platform for disseminating new research rapidly in fields like computer science, physics, and mathematics. Papers on arXiv often undergo peer review and are subsequently published in reputable conferences or journals.

1.4. Publication Year

2025 (Based on the UTC timestamp 2025-04-25T08:06:22.000Z)

1.5. Abstract

The paper introduces EcoServe, a novel system designed to enable cost-effective serving of Large Language Models (LLMs) on clusters equipped with commodity interconnects. It addresses limitations of existing LLM serving strategies: non-disaggregated (NoDG) strategies suffer from severe prefill-decode interference, while fully disaggregated (FuDG) strategies demand expensive, high-performance interconnects for KV cache transfer, making both less cost-effective.

EcoServe proposes a partially disaggregated (PaDG) strategy. This strategy employs temporal disaggregation to mitigate inter-phase interference and enhance throughput within a single instance by periodically switching between prefill and decode phases. To address the resulting increase in Time to First Token (TTFT), EcoServe further implements rolling activation, coordinating multiple instances cyclically to ensure continuous availability of prefill processing and thus improving latency. The fundamental serving unit in EcoServe is termed a macro instance, comprising several collaborating instances. The system also integrates an adaptive scheduling algorithm for request routing within a macro instance and a mitosis scaling approach for fine-grained capacity adjustment.

Beyond high goodput, EcoServe demonstrates advantages in load balancing, hardware cost, parallelism compatibility, and engineering simplicity. Experimental evaluations on a production-level cluster with 32 NVIDIA L20 GPUs and commodity Ethernet show that EcoServe significantly improves goodput by an average of 82.49% to 126.96% compared to four representative NoDG and FuDG systems when serving 30B- and 70B-scale models.

https://arxiv.org/abs/2504.18154v1 (Preprint on arXiv)

https://arxiv.org/pdf/2504.18154v1.pdf (Preprint PDF on arXiv)

2. Executive Summary

2.1. Background & Motivation

The proliferation of Large Language Models (LLMs) across various applications (e.g., Github Copilot, Character.ai) has led to a surge in demand for efficient and cost-effective LLM inference. A primary objective in LLM serving is to optimize the cost per request while simultaneously adhering to Service Level Objectives (SLOs) for response times. LLM inference inherently involves two distinct computational phases: the prefill phase and the decode phase. The prefill phase processes the input prompt to generate the first token and the initial Key-Value (KV) cache, while the decode phase iteratively generates subsequent tokens using the accumulated KV cache. These two phases have different performance metrics: Time to First Token (TTFT) for prefill and Time Per Output Token (TPOT) for decode. Improving one often comes at the expense of the others, forming an inherent performance trade-off triangle with throughput.

Existing cluster-level LLM serving solutions generally fall into two categories:

  1. Non-Disaggregated (NoDG) Strategy: In this approach, both the prefill and decode phases are handled by a single instance, often running on a single GPU or a set of GPUs cooperating via parallelism. While conceptually simple, this colocation leads to strong prefill-decode interference. For example, if prefills are prioritized for low TTFT, ongoing decodes suffer from high TPOT; conversely, prioritizing decodes can delay prefills. This interference not only impacts latency SLOs but also hinders throughput by preventing the decode phase from accumulating sufficiently large batches to saturate GPU resources. It also struggles with pipeline parallelism due to imbalanced workloads and dependencies.

  2. Fully Disaggregated (FuDG) Strategy: To eliminate prefill-decode interference, this strategy assigns prefill and decode phases to separate instances, potentially on different devices or even nodes. While effective in mitigating interference, FuDG introduces a significant challenge: it requires transferring massive amounts of KV cache data between prefill and decode instances. This necessitates high-performance interconnects (e.g., NVLink, InfiniBand), which are exceptionally expensive and power-intensive, thus making FuDG less cost-effective for broader deployment on clusters with commodity hardware (e.g., standard Ethernet). Additionally, FuDG faces load imbalance issues, as adjusting the ratio of prefill to decode instances is complex, and memory utilization can be imbalanced (decode instances store large KV caches, while prefill instances store less).

    The paper's entry point is the critical observation that both existing strategies have fundamental limitations preventing cost-effective LLM serving on commodity hardware. The innovative idea is that intra-instance scheduling (when to execute prefills/decodes) must be meticulously coordinated with inter-instance scheduling (where/when to route requests) to optimally utilize resources and improve the trade-off between TTFT, TPOT, and throughput.

2.2. Main Contributions / Findings

The paper introduces EcoServe, a novel LLM serving system, and makes the following primary contributions:

  • Introduction of EcoServe, a Cost-effective LLM Serving System: EcoServe is specifically designed to enable cost-effective LLM inference on clusters utilizing commodity interconnects, addressing a critical need in production environments.
  • The Partially Disaggregated (PaDG) Strategy: This is the core innovation, combining temporal disaggregation and rolling activation.
    • Temporal Disaggregation: Within a single instance, prefill and decode phases are disaggregated along the time dimension. This means an instance switches between processing only prefills and only decodes, reducing inter-phase interference and boosting throughput without KV cache transfer.
    • Rolling Activation: To counteract the TTFT increase caused by temporal disaggregation, EcoServe coordinates multiple instances in a cyclic pattern. This ensures that at any given moment, some instances are activated for prefill processing, maintaining low TTFT for new requests.
  • Macro Instance Abstraction: EcoServe introduces the macro instance as its basic serving unit, where multiple instances collaborate under the PaDG strategy. This abstraction simplifies scheduling and resource management.
  • Adaptive Scheduling Algorithm: An intelligent algorithm is integrated to route requests within a macro instance. It prioritizes TPOT maintenance, identifies optimal instances for new requests, and determines the maximum prefill tokens that can be inserted, balancing TTFT and TPOT constraints.
  • Mitosis Scaling Approach: EcoServe incorporates a novel mitosis scaling approach for fine-grained capacity scaling. This allows elastic adjustment of instance counts within a macro instance, and triggers split or merge operations of macro instances when thresholds are met. A serializable proxy object facilitates transparent instance migration without re-initialization or interruption.
  • Hierarchical Architecture: EcoServe is implemented with a hierarchical architecture comprising an overall scheduler, macro-instance schedulers, and instance schedulers, enabling coordinated decision-making at multiple levels.
  • Comprehensive Evaluation and Superior Performance: EcoServe was evaluated on a production-level cluster with 32 NVIDIA L20 GPUs and commodity Ethernet, serving 30B- and 70B-scale models.
    • It demonstrated an average goodput improvement of 82.49% to 126.96% over four representative NoDG (vLLM, Sarathi) and FuDG (DistServe, MoonCake) systems.

    • It showed superior performance in load balancing, hardware cost (lower), parallelism compatibility, and engineering simplicity compared to existing solutions.

    • EcoServe exhibits higher tolerance to tighter SLOs and offers superlinear scaling of throughput with increased resources.

      These findings solve the problem of achieving high-performance and cost-effective LLM serving on commodity hardware, offering a practical alternative to solutions that either suffer from severe interference or demand prohibitively expensive infrastructure.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand EcoServe, a basic grasp of Large Language Models (LLMs) and their inference serving challenges is essential.

3.1.1. Large Language Models (LLMs)

Large Language Models (LLMs) are a type of artificial intelligence model, typically based on the Transformer architecture, that are trained on vast amounts of text data. They are capable of understanding, generating, and processing human language for various tasks like text generation, summarization, translation, and question answering. Examples include Llama, GPT, and Qwen.

3.1.2. LLM Inference Phases: Prefill and Decode

LLM inference, especially for generative tasks, is divided into two distinct phases:

  • Prefill Phase: Also known as the prompt processing phase. When a user inputs a prompt (e.g., a question or a starting sentence), the LLM processes this entire sequence of input tokens to generate the initial internal representations, including the first output token. This phase is typically compute-bound, meaning its performance is limited by computational power (e.g., Floating Point Operations Per Second - FLOPS). It involves large matrix multiplications.
  • Decode Phase: Also known as the token generation phase. After the first token is generated, the LLM generates subsequent tokens one by one, in an autoregressive manner. Each new token is generated based on the input prompt and all previously generated tokens. This phase is typically memory-bound, meaning its performance is limited by memory access speed, particularly for loading the KV cache.

3.1.3. KV Cache

The Key-Value (KV) cache is a critical optimization technique in Transformer-based LLMs. During the self-attention mechanism, Query (Q), Key (K), and Value (V) vectors are computed for each token. For subsequent tokens in the decode phase, the K and V vectors for previous tokens in the sequence remain the same. Instead of recomputing these K and V vectors for every previous token at each step, they are stored in memory (the KV cache). This significantly reduces redundant computation, especially in the decode phase. The size of the KV cache grows with the sequence length and the number of parallel requests.

3.1.4. Performance Metrics: TTFT and TPOT

  • Time to First Token (TTFT): This metric measures the latency from when a user request is submitted until the first output token is generated and returned. It is primarily influenced by the prefill phase efficiency.
  • Time Per Output Token (TPOT): This metric measures the average time taken to generate each subsequent output token after the first one. It is primarily influenced by the decode phase efficiency.

3.1.5. Throughput and Goodput

  • Throughput: This refers to the total number of requests processed or tokens generated per unit of time (e.g., requests/second, tokens/second). Higher throughput means the system can handle more workload.
  • Goodput: This is a more nuanced metric than raw throughput. It measures the throughput of requests that successfully meet predefined Service Level Objectives (SLOs). For example, if a system processes 100 requests/second but only 80 of them meet their latency targets, the goodput would be 80 requests/second. EcoServe aims for high goodput.

3.1.6. Service Level Objectives (SLOs)

Service Level Objectives (SLOs) are target values or ranges for system performance metrics, often defined in service level agreements (SLAs). For LLM serving, common SLOs include target TTFT and TPOT values (e.g., TTFT < 1 second, TPOT < 100 milliseconds). Systems strive to meet these targets to ensure a good user experience.

3.1.7. LLM Batching Techniques

To efficiently utilize GPU resources, which are optimized for parallel processing, multiple requests are often processed together in a batch.

  • Continuous Batching: This is a standard technique where requests can dynamically enter or exit a batch at each iteration. This maximizes GPU utilization by always keeping the batch full, unlike static batching which waits for a fixed number of requests.
  • Separate Batching: Prefill requests are batched and processed separately from decode requests. This is because their computational characteristics are very different (prefill is compute-bound, decode is memory-bound).
  • Hybrid Batching: Prefill and decode requests are combined into a single batch and processed together. This can potentially improve throughput but exacerbates prefill-decode interference.

3.1.8. Parallelism Strategies

For large LLMs that cannot fit on a single GPU or to accelerate inference, various parallelism strategies are used:

  • Tensor Parallelism (TP): Also known as model parallelism (within a layer). In TP, the model's weights and activations within a single layer (ee.g., the large weight matrices in QKV projection or Feed-Forward Networks) are partitioned across multiple GPUs. Each GPU computes a portion of the matrix multiplication, and the results are then aggregated (e.g., via all-reduce operations). TP is generally used to fit very large models into memory or to accelerate individual layer computations. It requires frequent inter-device communications.

    Figure 3(a) from the paper illustrates tensor parallelism.

    Figure 5. EcoServe Architecture Overview. 该图像是论文中图5的架构示意图,展示了EcoServe系统的部分解耦策略及调度流程。包含时间维度的预填充与解码阶段的划分(Temporal Disaggregation)、滚动激活(Rolling Activation)、自适应调度算法及分裂扩容(Mitosis Scaling Approach)等关键模块。

    Figure 3. Tensor parallelism and pipeline parallelism.

  • Pipeline Parallelism (PP): Also known as layer parallelism. In PP, the layers of an LLM are partitioned across multiple GPUs. Each GPU is responsible for computing a subset of the model's layers. Data flows sequentially through the GPUs, forming a "pipeline." This reduces memory requirements per GPU and can be efficient if the pipeline stages are balanced. However, it can suffer from pipeline bubbles (idle time) if stages are not perfectly balanced or if there are dependencies between iterations (as in autoregressive decoding).

    Figure 3(b) from the paper illustrates pipeline parallelism.

    Figure 5. EcoServe Architecture Overview. 该图像是论文中图5的架构示意图,展示了EcoServe系统的部分解耦策略及调度流程。包含时间维度的预填充与解码阶段的划分(Temporal Disaggregation)、滚动激活(Rolling Activation)、自适应调度算法及分裂扩容(Mitosis Scaling Approach)等关键模块。

    Figure 3. Tensor parallelism and pipeline parallelism.

    Figure 4 from the paper illustrates pipeline bubbles.

    Figure 6. Runtime and frontend Timing. 该图像是图6的示意图,展示了运行时与前端的时间关系。图中用不同颜色表示了预填充(Prefill)、解码(Decode)、Token处理和其他预填充或数据传输阶段,各时间段以TTFT、Wait和TPOT标注,体现了多阶段的时间协调。

    Figure 4. Pipeline bubbles.

3.1.9. Interconnects

Interconnects refer to the hardware connections that allow different components (e.g., CPUs, GPUs, memory, nodes) within a computer system or cluster to communicate.

  • Commodity Interconnects: These are standard, widely available, and relatively inexpensive networking technologies, such as Ethernet. They typically offer lower bandwidth and higher latency compared to specialized interconnects. EcoServe aims to perform well on these.
  • High-Performance Interconnects: These are specialized, expensive, and high-bandwidth, low-latency networking technologies designed for demanding computational tasks. Examples include:
    • NVLink: A high-speed interconnect developed by NVIDIA for direct GPU-to-GPU communication within a node.
    • InfiniBand: A high-throughput, low-latency communication technology used in high-performance computing (HPC) and data centers, often connecting nodes in a cluster.
    • RoCE (RDMA over Converged Ethernet): Allows Remote Direct Memory Access (RDMA) functionality over standard Ethernet networks, offering lower latency and higher throughput than traditional Ethernet but not always as high-performing as dedicated InfiniBand.

3.2. Previous Works

The paper categorizes previous LLM serving solutions based on prefill-decode disaggregation:

3.2.1. Non-Disaggregated (NoDG) Strategy

  • Description: In NoDG systems (e.g., vLLM [5], Sarathi [9]), a single inference instance handles the entire lifecycle of a request, processing both its prefill and decode phases. When scaling is needed, this instance is replicated.
  • Key Issues:
    • Strong Prefill-Decode Interference: As both phases share resources, processing one often delays the other, impacting TTFT and TPOT SLOs.
    • Low Throughput: The interference makes it hard for the decode phase to accumulate large enough batches to saturate GPUs, leading to underutilization.
    • Inefficient Pipeline Parallelism: Imbalanced workloads and tight dependencies in autoregressive decoding lead to pipeline bubbles, reducing PP efficiency.
  • Mitigation Attempts: Chunked prefill [9] (used by Sarathi) attempts to break long prefills into smaller chunks to reduce interference, but it incurs overhead and its effectiveness varies.

3.2.2. Fully Disaggregated (FuDG) Strategy

  • Description: FuDG systems (e.g., DistServe [50], MoonCake [35]) completely separate the prefill and decode phases into different instances. A request first goes to a prefill instance, generates the KV cache and first token, and then the KV cache is transferred to a decode instance for subsequent token generation.
  • Key Issues:
    • High Interconnect Requirements: Transferring massive KV cache data between instances necessitates high-performance interconnects (e.g., NVLink, InfiniBand), which are very expensive and power-intensive. The paper provides Table 3 to illustrate this requirement.

      The following are the results from Table 3 of the original paper:

      Operation P/D FLOPS Memory Access Approximate AI
      QKV Projection Prefill 6BSH2 6BSH + 3H2 BS
      Decode 6BH2 6BH + 3H2 B
      Attention Prefill 2BS2H 2BSH + BS2M S
      Decode 2BSH 2BSM + BH(S + 1) 1
      Output Projection Prefill 2BSH2 2BSH + H2 BS
      Decode 2BH2 2BH + H2 B
      Dim Expansion Prefill 8BSH2 2BSH + 4H2 BS
      Decode 8BH2 2BH + 4H2 B
      Dim Reduction Prefill 8BSH2 2BSH + 4H2 BS
      Decode 8BH2 2BH + 4H2 B

      The following are the results from Table 3 of the original paper (Note: the paper has two tables labeled 'Table 3'. I am referring to the second one, which relates to KV cache bandwidth):

      Model Device Tokens/s Theoretical Bandwidth
      Llama-30B L20 6584.6 9.796 GB/s
      Llama-30B A800 26189.2 38.96 GB/s
      CodeLlama-34B L20 6838.92 1.25 GB/s
      CodeLlama-34B A800 25978.88 4.76 GB/s
    • Load Imbalance: FuDG makes load balancing between prefill and decode instances challenging due to their asymmetric durations. This can lead to underutilized resources.

    • Memory Imbalance: Decode instances store large KV caches, while prefill instances store less, leading to inefficient memory use across the cluster.

  • Variants: DistServe (intra-node FuDG) relies on intra-node high-speed links (e.g., NVLink), while MoonCake (inter-node FuDG) uses a centralized KV cache pool and InfiniBand for inter-node communication.

3.2.3. KV Cache Optimizations

  • GQA (Grouped Query Attention) [10]: An optimization that reduces KV cache size by sharing key and value projections across multiple query heads, alleviating transmission overhead. Used in CodeLlama2-34B and Qwen2-72B.
  • PagedAttention [24]: A technique to manage KV cache memory more efficiently by organizing it into fixed-size blocks, reducing fragmentation.
  • Other works like H2O [49] and Keyformer [8] explore KV cache compression or redundancy removal.

The paper also briefly mentions other areas of LLM inference optimization:

  • Memory-limited inference: Flexgen [39], FastDecode [19], Specinfer [31] use offloading.
  • Long-context inference: Loongserve [45], Infinitellm [28] optimize for long sequences.
  • Mixture-of-Experts (MoE) models: Moe-lightning [12], Pre-gated MoE [21], Lina [25] optimize resource utilization for MoE architectures. MegaScale-Infer [52] disaggregates attention and FFN modules for ultra-large MoE models.
  • Kernel scheduling: Liger [15], NanoFlow [51] schedule and overlap GPU kernels from different requests.

3.3. Technological Evolution

The field of LLM serving has evolved significantly, driven by the increasing size and computational demands of LLMs.

  1. Early LLM Inference (Single Model, Single Request): Initially, LLMs were served one request at a time on dedicated hardware. This was inefficient for high-throughput scenarios.

  2. Batching for GPU Utilization: The introduction of batching techniques (static, then continuous batching) dramatically improved GPU utilization by processing multiple requests concurrently.

  3. Basic Parallelism: Tensor Parallelism and Pipeline Parallelism emerged to handle models that exceeded single-device memory or to reduce inference latency by distributing computation.

  4. Cluster-Level Serving Strategies: As demand scaled, systems moved to multi-node clusters. This led to the NoDG strategy, replicating instances, and then FuDG to address prefill-decode interference.

  5. Addressing Interconnect Bottlenecks and Cost: The high cost of FuDG due to high-performance interconnects became a new bottleneck. This is where EcoServe's PaDG strategy fits, attempting to achieve FuDG-like benefits on commodity interconnects.

  6. Advanced Optimizations: Ongoing research focuses on KV cache management (PagedAttention, GQA), specialized parallelism for specific architectures (MoE), and dynamic scaling.

    EcoServe's work builds upon the understanding of prefill-decode dynamics and KV cache management from previous studies, but strategically places itself as a middle-ground solution (PaDG) to overcome the cost and interference limitations of NoDG and FuDG on commodity hardware.

3.4. Differentiation Analysis

Compared to the NoDG and FuDG strategies, EcoServe's Partially Disaggregated (PaDG) approach offers distinct innovations:

  • Interference Mitigation vs. Hardware Cost:

    • NoDG: Suffers from severe prefill-decode interference because both phases run on the same instance simultaneously. This harms goodput and SLOs.
    • FuDG: Eliminates interference by physically separating prefill and decode instances. However, this comes at the cost of needing high-performance interconnects (e.g., NVLink, InfiniBand) to transfer the large KV cache between instances, making it very expensive.
    • PaDG (EcoServe): Mitigates interference by temporal disaggregation within a single instance. It processes only one phase at a time (prefill or decode) for an extended duration. This avoids the KV cache transfer overhead of FuDG and thus works efficiently with commodity interconnects, making it highly cost-effective.
  • Latency Management (TTFT):

    • NoDG: Struggles to balance TTFT and TPOT due to interference.
    • FuDG: Can achieve good TTFT due to dedicated prefill instances but might still suffer if KV cache transfer is a bottleneck.
    • PaDG (EcoServe): Temporal disaggregation alone would increase TTFT (as new requests might wait for an instance to switch to prefill). EcoServe addresses this with rolling activation, where multiple instances are coordinated to ensure that at least one instance is always ready for prefill, thereby maintaining low TTFT while still avoiding KV cache transfer.
  • Resource Utilization and Load Balancing:

    • NoDG: Can lead to underutilized GPUs in the decode phase due to insufficient batch sizes caused by interference.
    • FuDG: Faces load imbalance issues when determining the optimal ratio of prefill to decode instances, and memory can be imbalanced (decode instances holding large KV caches while prefill instances have idle memory).
    • PaDG (EcoServe): Improves GPU saturation by processing phases for longer durations, reducing switching overhead. The adaptive scheduling algorithm and mitosis scaling approach provide fine-grained control and dynamic load balancing within macro instances, leading to better resource efficiency.
  • Parallelism Compatibility:

    • NoDG: Pipeline Parallelism is difficult due to pipeline bubbles caused by imbalanced workloads and dependencies between prefill and decode. Tensor Parallelism can be inefficient due to communication overhead on PCIe-only systems.
    • FuDG: PP can be more effective as phases are separated. TP can still face PCIe contention if not using GPU-direct interconnects.
    • PaDG (EcoServe): Minimizes prefill-decode switches, making it highly compatible with pipeline parallelism (less pipeline bubbles). Its minimal data movement and reduced PCIe contention make it more suitable for tensor parallelism on systems with commodity interconnects.
  • Engineering Complexity:

    • NoDG: Relatively low complexity due to single-instance focus.

    • FuDG: High complexity due to distributed management of prefill/decode instances, KV cache transfer mechanisms, and load balancing across two distinct resource types.

    • PaDG (EcoServe): Aims for lower complexity similar to NoDG by keeping phases within a single instance (temporally disaggregated) and simplifying scaling with macro instances and mitosis scaling.

      In essence, EcoServe's PaDG strategy is an innovative middle ground that aims to achieve the benefits of FuDG (reduced interference, better goodput) without incurring its high hardware cost and engineering complexity, making it a more cost-effective solution for LLM serving on prevalent commodity interconnect clusters.

4. Methodology

4.1. Principles

The core idea behind EcoServe is to achieve cost-effective LLM serving by intelligently orchestrating the prefill and decode phases. Its foundational principle is the Partially Disaggregated (PaDG) strategy. This strategy is built on two key proactive scheduling mechanisms:

  1. Temporal Disaggregation (Intra-Instance Scheduling): Within a single inference instance, the prefill and decode phases are separated along the time dimension. This means an instance dedicates itself to one phase for an extended period before switching to the other. This mitigates inter-phase interference and enhances throughput by allowing each phase to run more efficiently without being constantly interrupted or contending for resources. The intuition is that by focusing on one task for longer, the GPU can achieve better saturation for that task.

  2. Rolling Activation (Inter-Instance Scheduling): While temporal disaggregation improves throughput within an instance, it could lead to high Time to First Token (TTFT) if a new request arrives at an instance currently engaged in the decode phase. To counter this, EcoServe proactively coordinates multiple instances within a macro instance (a group of cooperating instances). These instances cyclically activate their prefill phases in a staggered manner, ensuring that at any given point, there is always at least one instance available and ready to immediately process new prefills, thereby maintaining low TTFT and meeting SLOs.

    By combining these two principles, EcoServe aims to optimize the performance trade-off triangle (TTFT, TPOT, throughput) on commodity hardware, achieving high goodput without the expensive interconnects required by FuDG strategies.

4.2. Core Methodology In-depth (Layer by Layer)

EcoServe is structured hierarchically, employing a three-level scheduling architecture to implement its PaDG strategy effectively.

The following figure (Figure 5 from the original paper) shows the EcoServe architecture overview:

Figure 7. The illustration of the expansion and contraction processes. Here \(N _ { l } = 3\) and \(N _ { u } = 6\) . 该图像是图7,展示了扩展和收缩两个过程的示意图。扩展过程包括添加(add)和拆分(split)操作,收缩过程包括删除(delete)和合并(merge)操作。图中以 Nl=3N_l=3Nu=6N_u=6 为例说明。

Figure 5. EcoServe Architecture Overview.

4.2.1. Hierarchical Architecture Overview

As shown in Figure 5, EcoServe comprises:

  • Overall Scheduler (Figure 5, Overall Scheduler): This is the highest level scheduler. Its responsibilities include dispatching new requests to appropriate macro instances based on their capabilities and managing capacity scaling across different macro instances, such as transferring instance handlers between them.

  • Macro-Instance Scheduler (Figure 5, Macro-Instance Scheduler): This scheduler coordinates multiple instances within a single macro instance. It aggregates execution states from individual instances and dispatches requests to the most suitable instance based on profiling results and SLOs. The macro instance is EcoServe's unique abstraction, representing the smallest unit of scheduling at the cluster level.

  • Instance Scheduler (Figure 5, Instance Scheduler): This is the lowest-level scheduler, responsible for managing execution within a single instance. It coordinates the prefill and decode phases (as per temporal disaggregation), orchestrates multiple devices if parallelism is used, and executes directives received from higher-level schedulers.

    The focus of the paper is primarily on the internal architecture and scheduling within a macro instance.

4.2.2. Partially Disaggregated (PaDG) Strategy

The PaDG strategy is the cornerstone of EcoServe, implemented through proactive intra- and inter-instance scheduling.

4.2.2.1. Temporal Disaggregation (Proactive Intra-Instance Scheduling)

  • Mechanism (Figure 5, Temporal Disaggregation 4\textcircled{4}): Within each inference instance, EcoServe proactively disaggregates the prefill and decode phases along the time dimension. This means an instance does not interleave prefill and decode tasks in quick succession. Instead, it commits to processing only prefill tasks for a certain period, and then switches to processing only decode tasks for another period.
  • Benefits:
    • Mitigates Prefill-Decode Interference: By dedicating an instance to one phase at a time, resource contention between the distinct computational patterns of prefill (compute-bound) and decode (memory-bound) is significantly reduced.
    • Enhances Throughput: Each phase can run more efficiently, leading to better GPU saturation and higher overall throughput.
    • Eliminates KV Cache Transmission: Unlike FuDG, both phases occur within the same instance (just at different times), so there's no need to transfer the KV cache between instances. This makes EcoServe compatible with commodity interconnects.
  • Challenge: This approach can lead to an unacceptable increase in TTFT if a new request arrives when an instance is dedicated to the decode phase, as it would have to wait for the instance to switch. This challenge is addressed by rolling activation.
  • Saved TPOT: The paper notes that if decode execution is faster than the TPOT SLO, the instance can accumulate "spare time" (saved TPOT). This saved time can be used to absorb interruptions (like waiting for a phase switch) without violating the TPOT SLO.

4.2.2.2. Rolling Activation (Proactive Inter-Instance Scheduling)

  • Mechanism (Figure 5, Rolling Activation 5\textcircled{5}): To ensure low TTFT despite temporal disaggregation, EcoServe employs rolling activation. This involves coordinating multiple instances within a macro instance in a cyclic pattern. At any given moment, different instances are in different stages of their prefill/decode cycle. The goal is to always have some instances specifically activated and ready to process new prefills.
  • Benefits:
    • Rescues TTFT: New requests are routed to instances currently in their prefill phase, allowing for immediate processing and meeting TTFT SLOs.
    • Continuous Prefill Availability: The staggered activation ensures that the system as a whole maintains a continuous capacity for processing new prefills.
    • Higher Overall Throughput: By combining temporal disaggregation and rolling activation, EcoServe achieves both efficient phase execution and low latency.
  • Coordination: Instances continuously update their status (e.g., decode progress, memory usage) to the macro-instance scheduler for coordination.

4.2.3. Adaptive Scheduling Algorithm

The adaptive scheduling algorithm (Figure 5, 7\textcircled{7}) is crucial for operationalizing the PaDG strategy within and across instances. It works in a master-slave manner between the macro-instance scheduler (master) and instance schedulers (slaves).

4.2.3.1. Inter-Instance Scheduling Algorithm (Macro-Instance Scheduler's Perspective)

The macro-instance scheduler receives status updates from all instances and schedules them to achieve rolling activation. When a new request arrives, it attempts to route it to the most suitable instance.

The following is the InterSchedule algorithm from Algorithm 1 of the original paper:

Data: current request: req; instance list: instances;   
1 Function InterSchedule(req):   
2 prev_idx ← last request's routed instance;   
3 last_instance ← instances[prev_idx];   
4 if CheckConstraints (last_instance,req) then   
5 route req to instance[prev_idx];   
6 else   
7 next_idx ← (prev_idx + 1)%len(instances) ;   
8 route req to instance[next_idx];
  • Function InterSchedule(req): This function takes a new request (req) as input and determines which instance it should be routed to.
  • req: Represents the current incoming LLM inference request.
  • instances: A list or array containing all available inference instances within the current macro instance.
  • prev_idx: Stores the index of the instance that the previous request was routed to. This implies a strategy that first attempts to route to the last used instance for locality or continuity.
  • last_instance: The actual instance object corresponding to prev_idx.
  • CheckConstraints(last_instance, req): This is a call to the Constraint Checking Algorithm (described below). It verifies if routing the current request (req) to last_instance would violate any SLOs or memory constraints.
  • if CheckConstraints (last_instance,req) then route req to instance[prev_idx];: If the last_instance can satisfy all constraints for the new request, the request is routed to this instance.
  • else next_idx ← (prev_idx + 1)%len(instances) ; route req to instance[next_idx];: If the last_instance cannot satisfy the constraints, the algorithm moves to the next_idx (the next instance in a cyclic pattern, using the modulo operator % to wrap around the list of instances). The request is then routed to this next_instance. This ensures rolling activation by always trying to find a capable instance.

4.2.3.2. Constraint Checking Algorithm (Instance Scheduler's Perspective)

The Constraint Checking Algorithm is invoked by the Inter-Instance Scheduling Algorithm to determine if an instance can accept a new request without violating SLOs or memory capacity.

The following is the CheckConstraints algorithm from Algorithm 2 of the original paper:

Data: System constraints: SLOTTFT, SLOTPOT; 
1 Function CheckConstraints(instance, req): 
2 Constraint 1: TTFT 
3 tswitcht _ { \mathrm { s w i t c h } }  ← phase switching timestamp; 
4 pending_prefills ← {r \{ r \in instance.reqs | r.arrival_timetswitch}{req}r . a r r i v a l \_ t i m e \geq t _ { \mathrm { s w i t c h } } \} \cup \{ r e q \} 
5 prefill_times ← predict pending_prefills durations ; 
6 ttotalΣt _ { \mathrm { t o t a l } }  \leftarrow \Sigma prefill_times; 
7 if ttotal>SLOTTFT\underline { { t _ { \mathrm { t o t a l } } > \mathbf { S L O } _ { \mathrm { T T F T } } } } then 
8 return NotSatisfied; 
9 Constraint 2: TPOT 
10 existed_decodes ← {rinstance.reqs \{ r \in i n s t a n c e . r e q s \mid  EPY r.arrival_time <Δtswitch}< \mathrm { \Delta } t _ { \mathrm { s w i t c h } } \} : 
11 saved_tpots ← `[ ]` : 
12 current_time ← current timestamp; 
13 foreach r  existed_decodes do 
14 L ←r.output_length; 
15 saved ltpotL×SLOTPOTl _ { - } t p o t \gets L \times S L O _ { T P O T } - (current_time -r.first_token_time) 
saved_tpots.append(saved_tpot) 
16 mean_saved_tpot ← mean(saved_tpots); 
17 if mean_saved tpot<ttotalt p o t < t _ { \mathrm { t o t a l } } then 
18 return NotSatisfied; 
19 Constraint 3: KV Cache capacity 
20 if req_kocache_size `>` remain_memsize then 
21 return NotSatisfied; 
22 return Satisfied
  • Function CheckConstraints(instance, req): This function checks if a given instance can accept a new request (req) while adhering to system SLOs and resource limits.

  • SLOTTFT: The system's target Time to First Token (TTFT) Service Level Objective.

  • SLOTPOT: The system's target Time Per Output Token (TPOT) Service Level Objective.

  • Constraint 1: TTFT: This block checks if admitting req will violate the TTFT SLO.

    • t_switch: Represents the timestamp when the instance is expected to phase switch from decode to prefill (or when it last switched and is currently in prefill). This is a crucial concept for temporal disaggregation.
    • pending_prefills: A set of all requests that are either already waiting to be prefilled in this instance (arrived after t_switch) OR the new request (req) itself. This represents the batch of prefills that will be processed consecutively.
    • prefill_times: A prediction of the duration required to process each request in the pending_prefills set. This prediction can be obtained from profiling data (e.g., prefill duration for various input lengths).
    • ttotalΣt _ { \mathrm { t o t a l } } \leftarrow \Sigma prefill_times: The sum of the predicted durations for all pending_prefills. This represents the total time the instance will spend in the prefill phase for this batch.
    • ifttotal>SLOTTFTthenreturnNotSatisfied;if t_total > SLOTTFT then return NotSatisfied;: If the estimated total prefill time exceeds the SLOTTFT, the constraint is violated, and the request cannot be admitted. This ensures that even with temporal disaggregation, the first token latency remains within bounds.
  • Constraint 2: TPOT: This block checks if admitting req will violate the TPOT SLO for currently ongoing decode requests.

    • existed_decodes: A set of requests that are currently undergoing decode in this instance and arrived before t_switch. These are the decode tasks that will be paused or whose execution might be affected by the upcoming prefill phase.
    • saved_tpots: An empty list to store the saved TPOT for each existed_decode.
    • current_time: The current timestamp.
    • foreach r in existed_decodes do: Loop through each decode request already in progress.
      • LL: The output_length of the current decode request rr.
      • saved_l_tpot: Calculates the saved TPOT for request rr. It's the maximum allowed time for rr's decode (L * SLOTPOT) minus the actual time already spent generating tokens (current_time - r.first_token_time). A positive value means there's slack.
      • saved_tpots.append(saved_l_tpot): Add the calculated saved TPOT to the list.
    • mean_saved_tpot: The average saved TPOT across all existed_decodes. This represents the collective slack available from ongoing decodes.
    • if mean_saved_tpot < t_total then return NotSatisfied;: If the average saved TPOT is less than the total prefill time (t_total) for the new batch, it means processing the new prefill batch would likely cause the ongoing decodes to violate their TPOT SLO. In this case, the request is not admitted. This ensures that optimizing TTFT does not compromise TPOT for existing requests.
  • Constraint 3: KV Cache capacity: This block checks if admitting req will exceed the instance's available GPU memory for KV cache.

    • req_kocache_size: The estimated KV cache size required for the new request (req).
    • remain_memsize: The remaining available memory in the instance (specifically for KV cache).
    • if req_kocache_size > remain_memsize then return NotSatisfied;: If the KV cache for the new request exceeds available memory, the request cannot be admitted to prevent out-of-memory (OOM) errors.
  • return Satisfied: If all three constraints are met, the instance can admit the request.

4.2.4. Mitosis Scaling Approach

The mitosis scaling approach (Figure 5, 8\textcircled{8}) provides elastic and fine-grained capacity scaling for EcoServe, adapting to fluctuating LLM inference workloads over time.

4.2.4.1. Expansion and Contraction (Figure 7)

This approach dynamically adjusts the number of instances within macro instances.

  • Hyperparameters:

    • NlN_l: Lower bound on the number of instances in a macro instance.
    • NuN_u: Upper bound on the number of instances in a macro instance.
  • Scaling Triggers: Scaling can be triggered when the system fails to meet SLOs (demanding more capacity) or when there is sustained resource underutilization (indicating a need to contract capacity).

    The following figure (Figure 7 from the original paper) illustrates the expansion and contraction processes:

    Figure 9. Static coarse-grained scaling. 该图像是图9的图表,展示了EcoServe系统在不同实例数量下的吞吐量表现。图中通过对比EcoServe P90与线性增长,反映了系统的静态粗粒度扩展性能。

Figure 7. The illustration of the expansion and contraction processes. Here Nl=3N _ { l } = 3 and Nu=6N _ { u } = 6 .

  • Expansion Process (Steps 1-4 in Figure 7):
    1. New instances are incrementally added to an existing macro instance as demand increases.
    2. If the number of instances in a macro instance exceeds NuN_u (e.g., 6 in the figure), a new macro instance is split off from the original. This new macro instance starts with NlN_l instances (e.g., 3 instances are moved from the original to form a new macro instance).
    3. If additional instances are still required, they are first added back to the original macro instance until it again reaches NuN_u.
    4. Subsequent instance additions then go to the new macro instance.
  • Contraction Process (Steps 5-8 in Figure 7):
    1. When capacity is excessive, instances are first removed from the smallest macro instance until its instance count reaches NlN_l.
    2. Then, instances start to be removed from a full macro instance.
    3. If the total number of instances across two macro instances reaches NuN_u, they will be merged into a single macro instance after one additional instance is removed. This consolidation improves resource packing.
  • Post-Scaling: After any expansion or contraction, each macro instance continues scheduling requests independently using the adaptive scheduling algorithm, requiring no additional specialized logic. This means the system typically maintains several full macro instances and one or two partially filled ones.

4.2.4.2. Flexible Instance Migration

To enable dynamic splitting or merging of macro instances without interrupting or reinitializing individual instances, EcoServe uses a serializable proxy object.

  • InstanceHandler Metadata: At the core is the InstanceHandler metadata, which encapsulates all necessary information about an instance: its actor ID, worker address, function calls, and other attributes.
  • Serialization and Transfer: When an instance needs to be logically migrated between macro-instance schedulers (which might be different processes), its InstanceHandler is serialized (e.g., using Python's pickle library). This serialized data is then sent to the target macro-instance scheduler, coordinated by the overall scheduler.
  • Deserialization and Reconstruction: The receiving process deserializes the InstanceHandler, reconstructing a fully functional proxy object. This proxy can then issue function calls to the original instance via an RPC-like system.
  • Benefits: This design allows logical migration without interrupting the instance's execution or requiring a costly re-initialization (which can take minutes for large LLMs), thereby supporting highly flexible and low-overhead scaling.

4.3. LLM Computational Characteristics

The paper provides additional background on LLM computations, particularly how arithmetic intensity (AI) differs between prefill and decode phases.

The following are the results from Table 1 of the original paper:

Variable Description Notation
prompt_len The length of prompt S
generation_len The length of generated tokens G
batch_size The number of batched requests B
layer_num The number of model layers L
hidden_size Input dimension of the hidden layer H
heads The number of attention heads M
size_per_head The hidden state per head D

The following are the results from Table 2 of the original paper:

Operation P/D FLOPS Memory Access Approximate AI
QKV Projection Prefill 6BSH2 6BSH + 3H2 BS
Decode 6BH2 6BH + 3H2 B
Attention Prefill 2BS2H 2BSH + BS2M S
Decode 2BSH 2BSM + BH(S + 1) 1
Output Projection Prefill 2BSH2 2BSH + H2 BS
Decode 2BH2 2BH + H2 B
Dim Expansion Prefill 8BSH2 2BSH + 4H2 BS
Decode 8BH2 2BH + 4H2 B
Dim Reduction Prefill 8BSH2 2BSH + 4H2 BS
Decode 8BH2 2BH + 4H2 B
  • Arithmetic Intensity (AI): This metric is computed by dividing the total number of Floating Point Operations Per Second (FLOPS) by the total amount of memory access. It indicates how much computation is performed per unit of data transferred from memory. A higher AI typically means the operation is compute-bound, while a lower AI means it's memory-bound.

  • Prefill Phase AI: As shown in Table 2, the prefill phase has an Arithmetic Intensity that depends on both sequence length (S) and batch size (B). Since SS can be large, the prefill phase exhibits significantly higher AI, making it compute-bound.

  • Decode Phase AI: The decode phase AI primarily depends on batch size (B) and is much lower. It also requires loading the KV cache, which increases memory access. Consequently, the decode phase is typically memory-bound.

    This fundamental difference in arithmetic intensity and computational bottlenecks between the two phases is a key motivation for EcoServe's temporal disaggregation, as it allows GPUs to be optimally utilized for either compute-bound or memory-bound tasks without interference.

4.3.1. QKV Projection (Equation 1)

This step projects input tokens into Query (Q), Key (K), and Value (V) embeddings. $ { \bf Q } = W _ { q } { \bf X } , \quad { \bf K } = W _ { k } { \bf X } , \quad { \bf V } = W _ { v } { \bf X } . $ Where:

  • Q\mathbf{Q}, K\mathbf{K}, V\mathbf{V}: Represent the Query, Key, and Value matrices, respectively.
  • WqW_q, WkW_k, WvW_v: Represent the weight matrices for projecting the input into Q, K, and V. These are learnable parameters of the model.
  • X\mathbf{X}: Represents the input token embedding matrix.

4.3.2. Attention (Equation 2)

This is the core self-attention mechanism. It calculates how each token attends to other tokens in the sequence. $ A t t e n t i o n ( \mathbf { Q } , \mathbf { K } , \mathbf { V } ) = s o f t m a x ( \frac { Q K ^ { T } } { \sqrt { d _ { k } } } ) \mathbf { V } $ Where:

  • Attention(Q,K,V)\mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}): The output of the attention mechanism.
  • Q\mathbf{Q}: The Query matrix.
  • K\mathbf{K}: The Key matrix.
  • KT\mathbf{K}^T: The transpose of the Key matrix.
  • dkd_k: The dimension of the Key vectors. Dividing by dk\sqrt{d_k} is a scaling factor to prevent large dot product values from pushing the softmax function into regions with tiny gradients.
  • softmax()\mathrm{softmax}(\cdot): The softmax function, which normalizes scores into a probability distribution.
  • V\mathbf{V}: The Value matrix. The softmax output (attention weights) are applied to the Value vectors to get the contextualized representation.

4.3.3. Feed-Forward Network (FFN) / Output Projection (Equation 3)

The Output Projection (or Feed-Forward Network) further transforms each token's representation using a position-wise, two-layer neural network with a non-linear activation. $ F F N ( x ) = A c t ( x W _ { 1 } + b 1 ) W _ { 2 } + b _ { 2 } $ Where:

  • FFN(x): The output of the Feed-Forward Network for input xx.
  • xx: The input to the FFN (typically the output of the attention mechanism).
  • W1W_1, W2W_2: Weight matrices of the two linear transformations.
  • b1b_1, b2b_2: Bias vectors of the two linear transformations.
  • Act()Act(\cdot): A non-linear activation function (e.g., ReLU, GeLU).

4.4. Runtime and Frontend Timing

The paper also clarifies how SLOs are measured and perceived, distinguishing between runtime metrics and user experience.

The following figure (Figure 6 from the original paper) illustrates the timing between runtime and frontend:

该图像是多组柱状图,展示了EcoServe与vLLM、Sarathi、DistServe及MoonCake在不同模型(如Llama-30B、CodeLlama2-34B、Qwen2-72B)和数据集(Alpaca、ShareGPT、LongBench)下P50、P90及P99延迟下的吞吐率对比,反映了EcoServe在多GPU配置中的优越性能。 该图像是多组柱状图,展示了EcoServe与vLLM、Sarathi、DistServe及MoonCake在不同模型(如Llama-30B、CodeLlama2-34B、Qwen2-72B)和数据集(Alpaca、ShareGPT、LongBench)下P50、P90及P99延迟下的吞吐率对比,反映了EcoServe在多GPU配置中的优越性能。

Figure 6. Runtime and frontend Timing.

  • Runtime Metrics: Classically, Time to First Token (TTFT) and Time Per Output Token (TPOT) are used.
  • Phase-Switching Waiting Time: The paper highlights that for all strategies (NoDG, PaDG, FuDG), there's an implicit waiting time before a request enters its decode phase. For NoDG/PaDG, this is due to other prefills; for FuDG, it's KV cache transmission. This is crucial as it's often misrepresented.
  • EcoServe's TTFT Definition: To maintain consistency with prior work but also reflect reality, EcoServe's reported TTFT actually includes both the true TTFT and this phase-switching waiting time. This makes it a stricter SLO.
  • EcoServe's TPOT Measurement: TPOT measurement begins after the phase-switching delay, focusing purely on the token generation rate.

5. Experimental Setup

EcoServe is built upon vLLM [5] as the single-device runtime, leveraging Ray for multi-device orchestration within an instance (RPC-like control), and ZeroMQ for inter-instance synchronization at the macro-instance scheduler level.

5.1. Datasets

The evaluation uses three application datasets with diverse input and output length distributions, following prior research by truncating inputs to a maximum length of 4096 tokens.

The following are the results from Table 4 of the original paper:

DataSet InAvg InMed OutAvg OutMed SLOTTFT SLOTPOT
Alpaca-gpt4 20.63 17.00 163.80 119.00 1s 100ms
ShareGPT 343.76 148.00 237.20 152 5s 100ms
LongBench 2686.89 2736.50 101.78 19 15s 100ms
  • Alpaca-gpt4:
    • Description: Used for human instruction applications.
    • Characteristics: Characterized by very short input sequences (average InAvg = 20.63, median InMed = 17.00) and relatively long outputs (average OutAvg = 163.80, median OutMed = 119.00). The average output length is approximately 10 times the input length.
    • SLOs: TTFT = 1s, TPOT = 100ms.
  • ShareGPT:
    • Description: Represents chatbot applications.
    • Characteristics: Features relatively balanced input and output lengths (e.g., InAvg = 343.76, OutAvg = 237.20).
    • SLOs: TTFT = 5s, TPOT = 100ms.
  • LongBench:
    • Description: Used for summarization applications, where the goal is to generate a concise summary from a long article.

    • Characteristics: Characterized by long input sequences (average InAvg = 2686.89, median InMed = 2736.50) and short outputs (average OutAvg = 101.78, median OutMed = 19).

    • SLOs: TTFT = 15s, TPOT = 100ms.

      These diverse datasets allow for a comprehensive evaluation of EcoServe's performance across different LLM workload characteristics. The SLOs are set based on application needs, independent of model size, and are often stricter than those in prior works.

5.2. Evaluation Metrics

The evaluation focuses on goodput under different SLO attainment levels.

5.2.1. Goodput

  • Conceptual Definition: Goodput measures the effective throughput of a system, specifically counting only those requests that successfully meet their defined Service Level Objectives (SLOs). It reflects both the raw processing capacity and the ability to deliver quality of service.
  • Mathematical Formula: The paper defines goodput as throughput under different levels of SLO attainment. While an explicit formula for goodput is not provided in the paper, it is generally understood as: $ \text{Goodput} = \frac{\text{Number of requests meeting SLOs}}{\text{Total time}} $
  • Symbol Explanation:
    • Number of requests meeting SLOs: The count of completed requests where both TTFT and TPOT (or other relevant metrics) fall within their specified SLO limits.
    • Total time: The duration over which the goodput is measured.

5.2.2. SLO Attainment (P50, P90, P99)

  • Conceptual Definition: SLO attainment refers to the percentage of requests that successfully meet their Service Level Objectives (e.g., TTFT < 1s, TPOT < 100ms). The P50, P90, and P99 percentiles are commonly used to assess the distribution of latency.
    • P50 (50th percentile): 50% of requests meet or exceed this performance level. It represents the median performance.
    • P90 (90th percentile): 90% of requests meet or exceed this performance level. It reflects the performance for the majority of users.
    • P99 (99th percentile): 99% of requests meet or exceed this performance level. This is a very stringent metric, focusing on the tail latency and user experience for almost all users.
  • Mathematical Formula: No explicit formula is given for SLO attainment itself, but it's typically calculated as: $ \text{SLO Attainment} = \frac{\text{Number of requests satisfying SLO}}{\text{Total number of requests}} \times 100% $
  • Symbol Explanation:
    • Number of requests satisfying SLO: The count of requests whose TTFT and TPOT values are within the defined SLO limits.
    • Total number of requests: The total number of requests processed.

5.2.3. Time to First Token (TTFT)

  • Conceptual Definition: TTFT measures the latency from the moment a user's request is submitted to the system until the very first token of the LLM's response is generated and available. It is a critical metric for user perceived responsiveness. In this paper, the reported TTFT implicitly includes the phase-switching waiting time (the time a new request might wait for an instance to switch to its prefill phase), making it a stricter SLO.
  • Mathematical Formula: No explicit formula is provided, but conceptually it is: $ \text{TTFT} = \text{Time}{\text{first_token_generated}} - \text{Time}{\text{request_submission}} $
  • Symbol Explanation:
    • Timefirst_token_generated\text{Time}_{\text{first\_token\_generated}}: The timestamp when the first output token for a request is computed.
    • Timerequest_submission\text{Time}_{\text{request\_submission}}: The timestamp when the request was initially received by the system.

5.2.4. Time Per Output Token (TPOT)

  • Conceptual Definition: TPOT measures the average time taken by the LLM to generate each subsequent token after the first token has been produced. It reflects the efficiency of the decode phase. In this paper, the measurement of TPOT begins after the phase-switching delay, ensuring it focuses on the actual token generation rate without including initial waiting times.
  • Mathematical Formula: No explicit formula is provided, but conceptually it is: $ \text{TPOT} = \frac{\text{Time}{\text{last_token_generated}} - \text{Time}{\text{first_token_generated}}}{\text{Number of output tokens} - 1} $
  • Symbol Explanation:
    • Timelast_token_generated\text{Time}_{\text{last\_token\_generated}}: The timestamp when the final output token for a request is computed.
    • Timefirst_token_generated\text{Time}_{\text{first\_token\_generated}}: The timestamp when the first output token for a request is computed.
    • Number of output tokens: The total count of tokens generated for a request (excluding the prompt tokens).

5.3. Baselines

EcoServe is compared against four representative LLM serving systems spanning NoDG and FuDG strategies. All baselines are built on vLLM [5] as the underlying runtime, ensuring a fair comparison.

  1. vLLM [5]:
    • Strategy: Non-Disaggregated (NoDG).
    • Techniques: Uses separate batching (prefill and decode batches are processed separately) and prefill-priority scheduling (new prefills are prioritized over ongoing decodes).
    • Role: Represents a standard, high-performance NoDG system.
  2. Sarathi [9]:
    • Strategy: Non-Disaggregated (NoDG).
    • Techniques: Employs hybrid batching (combines prefill and decode requests into one batch), decode-priority scheduling (prioritizes ongoing decodes), and the chunked prefill technique (breaks long prefills into smaller chunks to reduce interference).
    • Role: Represents an NoDG system that explicitly tries to mitigate prefill-decode interference through advanced scheduling and batching.
  3. DistServe [50]:
    • Strategy: Intra-node Fully Disaggregated (FuDG).
    • Techniques: Prefill and decode instances are colocated within a single node. KV cache is transferred between them over intra-node high-speed links (e.g., NVLink, if available). The paper notes that its strategy of distributing instances across nodes with pipeline parallelism was not compatible with the SLOs in their setting.
    • Role: Represents a FuDG system that leverages fast intra-node communication to reduce KV cache transfer overhead.
  4. MoonCake [35]:
    • Strategy: Inter-node Fully Disaggregated (FuDG).
    • Techniques: Allows prefill and decode instances to be assigned to different nodes. It introduces a centralized KV cache pool that acts as a buffer for KV cache transmission, typically relying on InfiniBand for inter-node connectivity. Even if prefill and decode instances are on the same node, KV cache passes through this pool. To address load imbalance, the optimal P/D (prefill/decode) ratio is selected.
    • Role: Represents a FuDG system designed for large-scale, multi-node deployments with specialized high-performance networking.

5.4. Cluster Testbed

Experiments were conducted on two different cluster setups to evaluate performance across varying hardware capabilities.

  1. Primary Testbed (L20 Cluster):

    • Configuration: A production-level cluster with 8 nodes, totaling 64 GPUs (8 NVIDIA L20-48GB GPUs per node).
    • Interconnects: GPUs within a node are connected via PCIe only. Nodes are interconnected via standard 10Gbps Ethernet (commodity interconnects).
    • Significance: Represents a typical, cost-effective infrastructure setting in modern data centers, where EcoServe aims to excel.
  2. Second Testbed (A800 Cluster):

    • Configuration: Consists of 2 nodes, totaling 16 GPUs (8 NVIDIA A800-80GB GPUs per node).
    • Interconnects: GPUs within a node are connected via PCIe only. Nodes are interconnected via 25Gbps RoCE (a higher-bandwidth Ethernet-based interconnect than the 10Gbps in the L20 cluster).
    • Significance: Allows evaluation under higher bandwidth interconnects while still being Ethernet-based, providing a comparison point for FuDG systems that theoretically benefit from more bandwidth.

5.5. Model Setup

Three representative LLM models were used, chosen for their varying sizes and attention mechanisms. All experiments use BF16(bfloat16)BF16 (bfloat16) precision.

  • Llama-30B [43]:
    • Attention Mechanism: Standard multi-head attention (MHA). This mechanism results in a larger KV cache size compared to GQA.
    • TP Configuration:
      • L20 Cluster (32 GPUs, 8 nodes): TP=4TP=4 (model partitioned across 4 GPUs).
      • A800 Cluster (16 GPUs): TP=2TP=2 (model partitioned across 2 GPUs).
  • CodeLlama2-34B [36]:
    • Attention Mechanism: Employs grouped-query attention (GQA) [10]. GQA significantly compresses KV cache size, reducing memory bandwidth and transmission overhead.
    • TP Configuration:
      • L20 Cluster (32 GPUs, 8 nodes): TP=4TP=4.
      • A800 Cluster (16 GPUs): TP=2TP=2.
  • Qwen2-72B [46]:
    • Attention Mechanism: Also uses grouped-query attention (GQA). Despite being a larger model (72B vs. 34B), its GQA still makes its KV cache more compact relative to MHA models of similar scale.
    • TP Configuration:
      • L20 Cluster (32 GPUs, 8 nodes): TP=8TP=8.

      • A800 Cluster (16 GPUs): TP=4TP=4.

        Workloads are generated by pairing each model with each dataset. Request arrivals are simulated using a Poisson distribution at a fixed rate to introduce realistic fluctuations.

6. Results & Analysis

The evaluation primarily compares EcoServe's goodput against baseline systems under various SLO attainment levels, across different models, clusters, and applications. The goodput is measured by incrementally increasing the request rate until the system fails to meet the specified SLO attainment (P50, P90, P99).

6.1. End-to-end Performance Evaluation

The following are the results from Figure 8 of the original paper:

该图像是论文中的图表,展示了随着时间推移,SLO达成率(蓝色点)与请求率(绿色点)之间的变化关系。图中以红色虚线分隔了不同时间段,显示请求率逐步增加且SLO达成率在大部分时间保持较高水平,反映了系统在动态负载下的性能表现。 该图像是论文中的图表,展示了随着时间推移,SLO达成率(蓝色点)与请求率(绿色点)之间的变化关系。图中以红色虚线分隔了不同时间段,显示请求率逐步增加且SLO达成率在大部分时间保持较高水平,反映了系统在动态负载下的性能表现。

Figure 8. Overall goodput performance comparison.

Figure 8 presents a comprehensive comparison of EcoServe against vLLM, Sarathi, DistServe, and MoonCake across various scenarios.

6.1.1. Overall Comparison with Baselines

  • EcoServe's Dominance: EcoServe generally outperforms all baselines in most cases, especially under stricter SLOs.
  • Vs. NoDG Systems (vLLM, Sarathi): EcoServe achieves an average P90 goodput improvement of 83.76% over vLLM and 71.97% over Sarathi. This is attributed to EcoServe's PaDG strategy, which, by mitigating prefill-decode interference through temporal disaggregation and rolling activation, creates more headroom to balance TTFT and TPOT via cross-instance cooperation.
    • Exception (Alpaca dataset): For the Alpaca dataset (short inputs, long outputs), NoDG systems can sometimes achieve comparable or slightly better performance. This is because Alpaca's short inputs lead to less prefill-decode interference, and its SLOs might be loose enough that the additional trade-off space provided by PaDG becomes less critical.
  • Vs. FuDG Systems (DistServe, MoonCake): While FuDG systems can sometimes match NoDG for models with reduced KV cache (e.g., GQA models) and long outputs, they fall significantly behind EcoServe. EcoServe achieves an average P90 goodput improvement of 192.41% over DistServe and 218.22% over MoonCake. This highlights the severe bottleneck FuDG systems face due to KV cache transmission over commodity interconnects, which EcoServe avoids.

6.1.2. Comparison Across SLO Attainment Levels

  • Throughput Decline with Stricter SLOs: All systems experience a decrease in throughput as the SLO attainment level increases from P50 to P99 (meaning stricter latency requirements).
  • EcoServe's Tolerance to Tight SLOs: EcoServe demonstrates significantly higher tolerance to tighter SLOs:
    • P50 SLO Attainment: EcoServe shows improvements of 36.49%, 19.82%, 180.73%, and 194.62% over baselines.
    • P90 SLO Attainment: These improvements substantially increase to 83.76%, 71.97%, 192.41%, and 218.22%.
    • P99 SLO Attainment: The gap further widens, with some baseline systems being unable to meet P99 SLO attainment.
  • Validation: This trend validates that PaDG, through its inter-instance cooperation, provides a much larger performance envelope for balancing TTFT and TPOT under stringent latency constraints.

6.1.3. Comparison Across Models

EcoServe shows consistent performance gains across different model architectures:

  • Vs. NoDG Systems (P90 SLO):
    • Llama-30B: 65.00% improvement.
    • CodeLlama2-34B: 83.30% improvement.
    • Qwen2-72B: 85.30% improvement.
  • Vs. FuDG Systems (P90 SLO): The advantage varies significantly:
    • Llama-30B: 507.67% improvement. This massive gain is because Llama-30B uses MHA (standard multi-head attention), resulting in a much larger KV cache that severely degrades FuDG systems due to KV cache transmission overhead.
    • CodeLlama2-34B: 125.45% improvement.
    • Qwen2-72B: 83.61% improvement. CodeLlama2-34B and Qwen2-72B use GQA (Grouped Query Attention) which significantly reduces KV cache size, thereby alleviating KV cache transmission overhead for FuDG systems. Qwen2-72B, despite being larger, has a relatively smaller KV cache compared to its computational cost, which benefits FuDG more than Llama-30B.

6.1.4. Comparison Across Clusters

  • A800 Cluster (P90 SLO): EcoServe achieves an average throughput improvement of 71.41% over NoDG systems and 285.78% over FuDG systems.
  • L20 Cluster (P90 SLO): EcoServe achieves an average throughput improvement of 84.33% over NoDG systems and 124.86% over FuDG systems.
  • A800 vs. L20 with FuDG: While the A800 cluster has a higher bandwidth interconnect (25Gbps RoCE vs. 10Gbps Ethernet), it appears less favorable for FuDG systems in terms of relative improvement. This counter-intuitive result is explained by the paper: while bandwidth increases by 2.5x, the processing capability (especially for A800 GPUs) improves by over 4x. This means the inter-node network becomes an even more significant bottleneck for FuDG on A800s because the GPUs can generate KV cache data much faster than the network can transfer it.

6.1.5. Comparison Across Applications

  • Vs. NoDG Systems (P90 SLO):
    • Alpaca: 10.44% improvement.
    • ShareGPT: 20.60% improvement.
    • LongBench: 202.57% improvement.
    • Analysis: Shorter input lengths (Alpaca) reduce prefill-decode interference and chunked prefill overhead, making NoDG relatively better. EcoServe shines with LongBench (long inputs, short outputs) where prefill is dominant and interference is severe for NoDG.
  • Vs. FuDG Systems (P90 SLO):
    • Alpaca: 74.80% improvement.
    • ShareGPT: 363.10% improvement.
    • LongBench: 164.42% improvement. (Excluding Llama-30B due to execution failures, which would make this even higher).
    • Analysis: Datasets with longer inputs and shorter outputs (LongBench) demand more prefill instances to generate KV cache. This significantly increases network transmission pressure for FuDG systems, leading to worse performance compared to EcoServe, which avoids this transmission.

6.2. Scaling Capability

6.2.1. Static Coarse-grained Scaling

This section evaluates how EcoServe's goodput scales when the available resources (number of instances) are doubled.

  • Setup: CodeLlama2-34B and Qwen2-72B models on the L20 cluster, using TP=4TP=4 for CodeLlama2-34B and TP=2TP=2 for Qwen2-72B, with the ShareGPT dataset.

  • Results:

    The following are the results from Figure 9 of the original paper:

    Figure 11. Pipeline parallel compatibility. 该图像是图表,展示了EcoServe与vLLM在不同TP和PP设置下的吞吐量(Throughput)随TPOT SLO变化的关系。图中以CodeLlama2-34B和ShareGPT L20 P90及P99为测试环境,纵轴为吞吐量,横轴为TPOT SLO(ms),对比了不同并行度配置的性能表现。

Figure 9. Static coarse-grained scaling.

*   As shown in Figure 9, both models achieve `superlinear improvement` in `P90 SLO attainment`. For example, CodeLlama2-34B serving scales from 1 instance (4 GPUs) to 4 instances (16 GPUs) and achieves `5.6x throughput`.
  • Analysis for Superlinear Scaling:
    • Minimal Management Overhead: EcoServe incurs minimal overhead in managing more instances within a macro instance, especially when nodes are symmetrical.
    • Mitigation of Inter-phase Interference: Crucially, adding more instances provides more space for mitigating inter-phase interference. This allows for higher arithmetic intensity and better GPU saturation because instances can spend longer periods dedicated to one phase without causing SLO violations.
    • Degradation to NoDG: If a macro instance contains only a single instance, the PaDG strategy effectively degrades to NoDG, suffering from frequent phase switches and severe interference.
  • Plateau Effect: The paper notes that this superlinear scaling effect will eventually plateau once a sufficient number of instances are reached, as the benefits of reducing interference diminish.

6.2.2. Dynamic Fine-grained Scaling

This evaluates EcoServe's ability to adapt to dynamically changing request rates by incrementally adding instances.

  • Setup: CodeLlama2-34B on the L20 cluster with TP=4TP=4, ShareGPT dataset. Request rate increased every 2 minutes (from 20 to 50 requests/second). SLO attainments collected every 30 seconds.

  • Hyperparameters: Nl=4N_l = 4 and Nu=16N_u = 16 for mitosis scaling. The system starts with 8 instances and uses up all GPUs eventually.

  • Results:

    The following are the results from Figure 10 of the original paper:

    Figure 2. LLM autoregressive decoding process. 该图像是论文中的示意图,展示了LLM自回归解码过程,区分了预填充(prefill)和解码(decode)阶段,体现时间维度上的输入输出动态及关键缓存机制。

Figure 10. Request rate (green) and SLO attainment (blue) as request rate increase. Here Nl=4N _ { l } = 4 and Nu=16N _ { u } = 16.

*   Figure 10 shows that as the `request rate increases`, `SLO attainment` initially drops, but is then `restored by the addition of a new instance`. The blue dots (SLO attainment) remain high despite the increasing green line (request rate).
  • Analysis:
    • Adaptive Scheduling: The adaptive scheduling algorithm immediately routes new requests to newly added instances, freeing up time for existing instances to process decodes.
    • Instance Migration Overhead: The serializable proxy object for instance migration (part of mitosis scaling) introduces less than 100 ms of overhead, which can be hidden by triggering migration during the decode phase. This contrasts sharply with the 3-minute (or longer) overhead of re-initializing an instance from scratch.
  • Conclusion: The mitosis scaling approach effectively provides flexible and fine-grained scaling, allowing EcoServe to adapt to dynamic workloads.

6.3. Parallelism Compatibility

This section validates EcoServe's compatibility with pipeline parallelism (PP).

  • Setup: CodeLlama2-34B, ShareGPT dataset, L20 cluster. TPOT SLO is varied from 100ms to 500ms (relaxed, as PP doesn't improve single-batch latency).

  • Configurations:

    • EcoServe with TP=2TP=2, PP=2PP=2.
    • EcoServe with TP=4TP=4, PP=1PP=1 (effectively just TP).
    • vLLM with TP=4TP=4, PP=1PP=1.
  • Results:

    The following are the results from Figure 11 of the original paper:

    Figure 3. Tensor parallelism and pipeline parallelism. 该图像是论文中的示意图,展示了图3中的张量并行和流水线并行两种并行方式。左侧(a)部分为张量并行,显示多个GPU同时处理并通过All Reduce同步。右侧(b)部分为流水线并行,展示任务在GPU间的发送与接收过程。

Figure 11. Pipeline parallel compatibility.

*   As shown in Figure 11, EcoServe utilizing `PP` (TP=2,PP=2TP=2, PP=2) achieves better performance than its `TP` counterpart (TP=4,PP=1TP=4, PP=1) at lower `TPOT SLOs`.
*   EcoServe (PP) also `outperforms vLLM` across the board, and its `throughput plateau` achieved with `PP` is much higher than that of `vLLM`.
  • Analysis:
    • Lower Frequency of Prefill-Decode Switching: EcoServe's PaDG strategy inherently involves less frequent prefill-decode switching within an instance compared to NoDG. This reduces pipeline bubbles and improves pipeline parallelism efficiency.
    • Minimal Data Movement and PCIe Contention: The PaDG strategy's design, which avoids KV cache transfer between instances, also means less data movement and reduced PCIe contention, making it more suitable for tensor parallelism on PCIe-only systems (like the L20 cluster).
  • Conclusion: EcoServe is highly compatible with pipeline parallelism, achieving superior throughput compared to vLLM and even its own TP-only configuration under certain TPOT SLOs.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces EcoServe, a novel system designed for cost-effective Large Language Model (LLM) serving on clusters equipped with commodity interconnects. Its core innovation is the Partially Disaggregated (PaDG) Strategy, which addresses the limitations of existing Non-Disaggregated (NoDG) and Fully Disaggregated (FuDG) approaches.

The PaDG strategy leverages two key mechanisms:

  1. Temporal Disaggregation: It disaggregates the prefill and decode phases along the time dimension within a single instance. This significantly mitigates inter-phase interference and enhances throughput by allowing each phase to run more efficiently.

  2. Rolling Activation: To ensure low Time to First Token (TTFT) despite temporal disaggregation, EcoServe coordinates multiple instances within a macro instance in a cyclic pattern, guaranteeing continuous availability of prefill processing.

    EcoServe further integrates an adaptive scheduling algorithm for intelligent request routing and a mitosis scaling approach for fine-grained, elastic capacity adjustment and seamless instance migration.

Experimental results on production-level clusters demonstrate that EcoServe achieves an average goodput improvement of 82.49% to 126.96% over representative NoDG and FuDG systems. Beyond raw performance, EcoServe excels in load balancing, hardware cost, parallelism compatibility, and engineering simplicity, making it a superior solution for practical, cost-sensitive LLM deployments.

7.2. Limitations & Future Work

The paper's discussion section (Section 6) provides a comparative analysis (Table 5) that implicitly outlines the advantageous scenarios for each strategy rather than explicit limitations of EcoServe.

The following are the results from Table 5 of the original paper:

Goodput Cost Effective Load Balance Hardware Cost Parallelism Compatibility Engineering Complexity
NoDG Good Easy Low Low Low
FuDG √√ Poor Hard High High High
PaDG √√ Excellent Easy Low High Low

Based on this, and the context of the paper, we can infer the following:

Implicit Limitations of PaDG (EcoServe):

  • Applicability for Small Models: While PaDG is excellent for 30B, 70B, and 130B models, for very small models (e.g., 7B, 13B) with lower computational demands and easier-to-satisfy SLOs, the prefill-decode interference in NoDG might be negligible. In such cases, the engineering simplicity and low overhead of NoDG could still make it a preferable choice. PaDG introduces some coordination complexity that might not be justified for very small-scale workloads.
  • Extreme Scenarios/Stringent SLOs: For ultra-large models or extremely stringent SLOs where even minor interferences are intolerable, FuDG with advanced high-performance hardware (e.g., InfiniBand) might still be essential, despite its cost. PaDG, while mitigating interference, still has both phases run on the same physical instance, just temporally separated, which might not be enough for the most extreme cases.
  • Dependency on Prediction Accuracy: The adaptive scheduling algorithm relies on predicting prefill durations and saved TPOT. Inaccurate predictions could lead to SLO violations or suboptimal scheduling.

Suggested Future Work (by the authors and implied):

  • More Aggressive Disaggregation: The authors mention MegaScale-Infer [52] which disaggregates attention and FFN modules into different instances for ultra-large MoE models. This suggests a potential future direction for PaDG to explore even finer-grained module-level disaggregation, beyond just prefill/decode, for even larger or specialized models.
  • Optimizing Intra-Instance Phase Switching: While temporal disaggregation aims to reduce switching frequency, further research could optimize the timing and conditions for switching between prefill and decode phases to minimize context switching overheads and maximize GPU utilization.
  • Dynamic Workload Prediction: Improving the accuracy and adaptiveness of workload prediction (prefill_times, output_length for saved TPOT) could further enhance the adaptive scheduling algorithm.
  • Heterogeneous Hardware: Exploring PaDG strategies on heterogeneous clusters (e.g., mixing different GPU types, or GPUs with varying interconnects) could extend its applicability.

7.3. Personal Insights & Critique

EcoServe presents a highly practical and well-reasoned solution to a pressing problem in LLM serving: achieving high performance on commodity hardware. The paper clearly articulates the trade-offs involved in LLM system design, illustrating that "the art of trade-offs" is paramount.

Key Strengths:

  • Addressing the "Missing Middle": The PaDG strategy effectively carves out a sweet spot between the NoDG and FuDG extremes. It provides the benefits of interference mitigation without the prohibitive cost of specialized interconnects, making it particularly relevant for enterprises and cloud providers with existing commodity infrastructure.
  • Proactive Scheduling: The combination of temporal disaggregation (intra-instance) and rolling activation (inter-instance) is elegant. It solves the inherent TTFT problem introduced by temporal separation within an instance, demonstrating a holistic system design approach.
  • Practical Scaling: The mitosis scaling approach with its serializable proxy object is a robust solution for fine-grained, elastic scaling with minimal overhead. This is crucial for dynamic cloud environments.
  • Comprehensive Evaluation: The experiments are conducted on production-level clusters with diverse models and datasets, using rigorous metrics (goodput at various SLO attainments), which lends strong credibility to the findings. The analysis across different clusters (L20 vs. A800) provides insightful observations about how interconnect bandwidth can become an even greater bottleneck relative to GPU processing power with newer hardware.
  • Engineering Simplicity: The argument for lower engineering complexity compared to FuDG is compelling, as it avoids complex KV cache transfer mechanisms and simplifies load balancing.

Potential Areas for Deeper Exploration / Critique:

  • Phase-Switching Overhead: While the paper states that temporal disaggregation lasts "longer to reduce switching overhead," a more detailed quantification of this overhead (e.g., context switching time, cache invalidation) and how it's minimized would be beneficial.

  • Prediction Accuracy and Robustness: The adaptive scheduling algorithm relies on predicting prefill durations. The paper mentions profiling, but the robustness of these predictions under highly variable or adversarial workloads (e.g., very diverse prompt lengths, sudden spikes in traffic) could be explored further. What if predictions are consistently off?

  • Interplay with Advanced KV Cache Optimizations: The paper mentions GQA benefits FuDG by reducing KV cache size. How does EcoServe interact with other advanced KV cache optimizations (like PagedAttention or compression techniques)? Do they offer additional synergies or create new trade-offs within the PaDG framework?

  • Generalizability to Other Parallelism: While TP and PP are covered, exploring compatibility with other parallelism strategies (e.g., expert parallelism for MoE models) could be valuable, especially given the mention of MegaScale-Infer.

  • Cost Model: While "cost-effective" is a central claim, a more explicit cost model (e.g., comparing GPU-hours per goodput-request-unit across solutions, or the total cost of ownership including networking) could further solidify the argument for PaDG over FuDG.

    Overall, EcoServe offers a significant step forward for practical LLM deployment. Its PaDG strategy is a clever synthesis of existing ideas, thoughtfully applied to address a critical industry pain point. The paper is well-structured and provides a strong foundation for future research in optimizing LLM serving architectures.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.