MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
TL;DR Summary
MuxServe is introduced as a flexible spatial-temporal multiplexing system for efficiently serving multiple LLMs by colocating them based on popularity, optimizing resource usage and achieving up to 1.8x throughput increase or processing 2.9x more requests.
Abstract
Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally formulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. MuxServe designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that MuxServe can achieves up to higher throughput or processes more requests within SLO attainment. The code is available at: \url{https://github.com/hao-ai-lab/MuxServe}.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving." It focuses on improving the efficiency of serving multiple Large Language Models (LLMs) by intelligently sharing computational and memory resources.
1.2. Authors
The authors are Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Their affiliations span various research institutions, including Shanghai AI Laboratory, CUHK Interdisciplinary AI Research Institute, and the Centre for Perceptual and Interactive Intelligence (CPII) Ltd, indicating a strong academic and research background in AI and systems.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server for scientific papers, with a publication date of 2024-04-02T14:56:43.000Z. arXiv is widely recognized in the scientific community for disseminating research rapidly, though papers posted there are typically not peer-reviewed until later submission to a conference or journal.
1.4. Publication Year
2024
1.5. Abstract
The paper addresses the challenge of efficiently serving multiple Large Language Models (LLMs) of varying sizes and popularities. Existing methods struggle with significant resource underutilization. The authors introduce MuxServe, a system that employs flexible spatial-temporal multiplexing. Its core insights involve colocating LLMs based on their popularity to multiplex memory resources, and separating/flexibly colocating prefill and decoding phases to multiplex computation resources. MuxServe formally defines the multiplexing problem and proposes a novel placement algorithm along with an adaptive batch scheduling strategy to optimize resource utilization. A unified resource manager is designed to enable this flexible and efficient multiplexing. Evaluation results demonstrate that MuxServe achieves up to higher throughput or processes more requests while maintaining a 99% Service Level Objective (SLO) attainment.
1.6. Original Source Link
The paper is available at: \url{https://arxiv.org/abs/2404.02015}. Its publication status is a preprint on arXiv. PDF Link: \url{https://arxiv.org/pdf/2404.02015v2.pdf}
2. Executive Summary
2.1. Background & Motivation
The proliferation of Large Language Models (LLMs) for diverse applications like chat, programming, and search has led organizations to serve multiple LLMs as endpoints. However, LLMs, especially large ones, demand significant computational resources; for instance, a single 175B LLM requires eight A100 (80GB) GPUs. Efficiently serving multiple LLMs simultaneously presents significant challenges due to their varying sizes, resource requirements, and dynamic popularity among users. This leads to substantial inference costs and underutilization of expensive GPU resources.
Existing approaches fall short:
-
Spatial Partitioning: This method dedicates separate groups of GPUs to each LLM. While simple, it often results in severe underutilization, particularly for less popular LLMs with sparse request patterns, leaving GPUs idle (as depicted in Figure 1a). Conversely, popular LLMs can become performance bottlenecks.
-
Temporal Multiplexing: This approach shares a group of GPUs among multiple models by interleaving their execution. While it can reduce latency for bursty workloads, it typically allocates the entire computation resource to one model at a time. It overlooks the distinct characteristics of the
prefillanddecodingphases of LLM inference. Thedecodingphase, which dominates inference time, often underutilizes GPUs when run in isolation, leading to wave-shaped utilization with frequent troughs (as depicted in Figure 1b and Figure 3).The core problem is to efficiently serve a dynamic mix of LLMs with varying popularity and resource demands on a shared cluster, maximizing GPU utilization and throughput while meeting Service Level Objectives (SLOs).
The paper's innovative idea stems from two key insights:
-
The
prefillanddecodingphases of LLM inference have distinct computational characteristics. Separating and flexibly colocating jobs from these phases (even from different LLMs) can multiplex computation resources. -
LLMs exhibit varying popularity. Colocating LLMs considering their popularity can effectively multiplex memory resources. For example, a popular LLM's
decodingphase, which might not fully utilize GPUs, can run concurrently with a less popular LLM'sprefillphase. Thisflexible spatial-temporal multiplexing(Figure 1c) allows for higher GPU utilization and faster request completion.The paper aims to leverage these insights to design a system that dynamically adapts to workload changes, leading to significant cost reductions for LLM endpoint providers.
2.2. Main Contributions / Findings
The paper makes the following primary contributions:
-
First to Explore and Formulate Spatial-Temporal Multiplexing for LLM Serving: MuxServe introduces and formally formulates the problem of flexible spatial-temporal multiplexing tailored for multiple LLM serving, considering the distinct characteristics of
prefillanddecodingphases and varying LLM popularity. -
Novel Placement Algorithm and Adaptive Batch Scheduling Strategy: MuxServe proposes an enumeration-based greedy placement algorithm to identify optimal colocations of LLMs across GPU resources. It also introduces an adaptive batch scheduling (ADBS) strategy that maximizes intra-unit throughput by prioritizing
prefilljobs and dynamically reallocating KV cache blocks, ensuring fair resource sharing. -
Viable System Design with Unified Resource Manager: The paper details the design and implementation of a unified resource manager. This manager leverages NVIDIA MPS for dynamic SM partitioning and introduces a
head-wise cacheto enable flexible and efficient sharing of KV cache memory across different LLMs.The key findings from the evaluation are:
- MuxServe significantly outperforms prior state-of-the-art systems, achieving up to higher throughput.
- It can process up to more requests while maintaining a 99% SLO attainment, demonstrating both efficiency and reliability.
- Ablation studies confirm the effectiveness of each proposed component, including the placement algorithm, adaptive batch scheduling, and unified resource manager.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand MuxServe, a beginner should be familiar with the following concepts:
Large Language Models (LLMs)
LLMs are deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. They can perform various tasks like question answering, text summarization, translation, and code generation.
- Transformer Architecture: Introduced by Vaswani et al. (2017), Transformers are neural network architectures that rely heavily on the
attention mechanismto weigh the importance of different parts of the input sequence. They consist of encoder and decoder blocks. LLMs primarily use the decoder-only architecture for text generation. - Autoregressive Generation: LLMs generate text token by token. Each new token is predicted based on the input prompt and all previously generated tokens. This sequential generation process is fundamental to their operation.
LLM Inference Phases
LLM inference, especially for autoregressive generation, typically involves two distinct phases:
- Prefill Phase: This is the initial phase where the entire input prompt (which can be hundreds or thousands of tokens long) is processed in parallel. The model computes the representations for all prompt tokens simultaneously and generates the first output token. This phase is typically computation-intensive as it involves processing a long sequence.
- Decoding Phase (Incremental Decoding): After the first token is generated, the model iteratively generates subsequent tokens one at a time. For each new token, the model takes the previously generated tokens (including the input prompt) as context. This phase is typically memory-intensive due to the growing
KV cacheand less computation-intensive per token compared to prefill, as only one new token is processed at each step. However, it dominates the total inference time for long outputs.
Key-Value Cache (KV Cache)
During the decoding phase, the Transformer's attention mechanism needs access to the "keys" and "values" computed for all previously processed tokens (both from the prompt and already generated output). To avoid recomputing these for every new token, they are stored in a memory structure called the KV cache. As more tokens are generated, the KV cache grows dynamically, consuming significant GPU memory. The size of the KV cache depends on the sequence length, batch size, number of layers, number of attention heads, and hidden dimension.
Distributed LLM Inference
Due to their massive size, many LLMs cannot fit into a single GPU's memory or require faster inference than a single GPU can provide. Distributed inference techniques are used to spread the model across multiple GPUs or even multiple machines.
- Intra-operator Parallelism (Tensor Parallelism): This technique splits individual LLM layers (e.g., attention mechanism, feed-forward networks) across multiple GPUs. For example, the weight matrices of a linear layer might be partitioned, and different GPUs compute parts of the output. This significantly reduces the memory footprint on each GPU and can reduce inference latency. However, it requires frequent communication between GPUs to exchange intermediate results.
- Inter-operator Parallelism (Pipeline Parallelism): This technique partitions the LLM into multiple stages (groups of layers), with each stage placed on a different GPU. Data flows in a pipeline fashion, where one GPU completes its stage and passes the output to the next GPU. This introduces minimal communication overhead but does not reduce the latency for a single request as much as tensor parallelism; rather, it improves overall throughput by allowing multiple requests to be processed concurrently in different pipeline stages.
GPU Multiplexing
Multiplexing refers to techniques for sharing GPU resources among multiple tasks or models.
- Spatial Multiplexing: Divides a GPU's physical resources (like memory or Streaming Multiprocessors - SMs) into separate partitions, each allocated to a different task, allowing them to run concurrently.
- NVIDIA Multi-Instance GPU (MIG): A hardware feature on some NVIDIA GPUs (like A100) that allows physically partitioning a single GPU into multiple smaller, isolated GPU instances. Each MIG instance has its own dedicated memory, compute cores, and caches, providing strong isolation.
- NVIDIA CUDA Multi-Process Service (MPS): A software layer that allows multiple CUDA applications to share a single GPU by providing concurrent access to its resources. MPS creates a daemon process that arbitrates access to the GPU, effectively partitioning its compute units (SMs) among different processes. It provides finer-grained resource sharing than MIG but with less strict isolation.
- Temporal Multiplexing: Tasks share the entire GPU (or a group of GPUs) but take turns using it over time. Each task occupies the full computational resource for a specific time interval, and scheduling determines the order and duration of these intervals.
Throughput, Latency, and Service Level Objective (SLO)
These are standard metrics for evaluating system performance:
- Throughput: The rate at which a system processes requests, typically measured in requests per second (req/s) or tokens per second (tokens/s). Higher throughput indicates better efficiency.
- Latency: The time taken to complete a single request. Key latency metrics include:
- Time to First Token (TTFT): Time from request submission to the generation of the first output token.
- Time per Output Token (TPOT): Average time taken to generate each subsequent output token.
- End-to-End Latency: Total time from request submission to the completion of all output tokens.
- Service Level Objective (SLO): A target performance level that a system aims to meet, often expressed as a percentile of latency (e.g., 99% of requests must complete within 500ms).
SLO attainmentmeasures the percentage of requests that successfully meet this target.
3.2. Previous Works
The paper contextualizes MuxServe by discussing prior research and existing systems in deep learning (DL) serving and LLM serving, highlighting their limitations for the multi-LLM scenario.
DL Serving Systems (General)
Earlier DL serving systems primarily focused on smaller Deep Neural Network (DNN) models and often employed temporal multiplexing with advanced batching and scheduling strategies to improve GPU utilization and meet SLOs (e.g., Olston et al., 2017; Crankshaw et al., 2017; Shen et al., 2019; Gujarati et al., 2020; Zhang et al., 2023; Romero et al., 2021).
- Clipper (Crankshaw et al., 2017): An early low-latency online prediction serving system focusing on efficient request batching and scheduling for DNNs.
- Nexus (Shen et al., 2019): A GPU cluster engine for accelerating DNN-based video analysis, also focusing on scheduling and resource management.
- AlpaServe (Li et al., 2023): This is a more related work as it explores model parallelism and temporal multiplexing for serving multiple large DNN models. It distributes models across multiple GPUs and interleave schedules requests.
- Differentiation: MuxServe argues that AlpaServe is not specifically designed for LLMs and
misses the characteristics of LLM inference phases(prefill vs. decoding). As Figure 3 illustrates, simply reducing computation resources to the dominantdecodingphase doesn't significantly hurt latency, implyingtemporal multiplexingcan still lead to underutilization for LLMs. MuxServe’s key differentiator is exploiting these phase-specific characteristics.
- Differentiation: MuxServe argues that AlpaServe is not specifically designed for LLMs and
LLM Serving Systems (Single LLM Focus)
More recent efforts have focused on optimizing the inference for single LLMs, often incorporating advanced techniques:
- Custom GPU Kernels: Optimizing Transformer model inference at a low level, e.g., TensorRT-LLM (NVIDIA, 2023b) and LightSeq (Wang et al., 2021b).
- Distributed Inference for LLMs: Incorporating intra- and inter-operator parallelism to accelerate inference on multiple GPUs, e.g., FasterTransformer (NVIDIA, 2021), DeepSpeed-Inference (Aminabadi et al., 2022), vLLM (Kwon et al., 2023), and Huggingface's Text Generation Inference (TGI, 2023).
- vLLM (Kwon et al., 2023): A state-of-the-art system known for its
PagedAttentionmechanism, which significantly improves memory utilization forKV cacheby managing it in pages, similar to virtual memory in operating systems.- Differentiation: MuxServe states it builds atop vLLM for single LLM optimization, but vLLM's
PagedAttentionprimarily addresses memory utilization for a single LLM. MuxServe'sunified KV cacheaddresses the more complex scenario of multiple LLMs of varying sizes and popularities sharing the cache.
- Differentiation: MuxServe states it builds atop vLLM for single LLM optimization, but vLLM's
- vLLM (Kwon et al., 2023): A state-of-the-art system known for its
- Other Optimizations: Memory management (Kwon et al., 2023), offloading (Sheng et al., 2023), iteration-level batching (Yu et al., 2022), speculative decoding (Miao et al., 2023), and using cheap instances (Miao et al., 2024).
- Differentiation: MuxServe emphasizes that these works focus on optimizing single LLM inference and are largely
orthogonalto its contribution of efficient multi-LLM serving. MuxServe aims to complement these by providing a system-level solution for multiplexing.
- Differentiation: MuxServe emphasizes that these works focus on optimizing single LLM inference and are largely
GPU Sharing (General)
Research on GPU sharing broadly categorizes into temporal and spatial methods:
- Temporal Sharing:
- Salus (Yu & Chowdhury, 2019): Proposes fast job switching and memory management for temporal GPU sharing.
- Wavelet (Wang et al., 2021a), Zico (Lim et al., 2021), AntMan (Xiao et al., 2020): Other systems exploring efficient temporal sharing for DNN training or general GPU workloads.
- Spatial Sharing:
- NVIDIA MIG (NVIDIA, 2022a) and MPS (NVIDIA, 2022b): Native NVIDIA tools for spatial partitioning. MuxServe uses MPS as a fundamental building block.
- GSLICE (Dhakal et al., 2020): A dynamic GPU resource allocation and management framework.
- Muxflow (Zhao et al., 2023): Efficient and safe GPU sharing in production DL clusters.
- Gpulet (Choi et al., 2022): Introduces mixed spatial-temporal sharing for multiplexing general jobs.
- Differentiation: These general GPU sharing works typically target multiplexing smaller DNN jobs or general GPU tasks. MuxServe focuses specifically on the challenges posed by
LLMs(huge memory, distinct prefill/decoding phases, autoregressive nature) and their serving application.
3.3. Technological Evolution
The evolution of GPU serving can be traced from:
- Early DNN Serving (Small Models): Initially, systems focused on serving individual, relatively small DNNs. Techniques like micro-batching and round-robin scheduling were common.
- Large DNN Serving: With larger models, distributed inference (model parallelism) became necessary. Systems like AlpaServe emerged to manage multiple large DNNs through temporal multiplexing.
- LLM Serving (Single LLM Optimization): The advent of LLMs introduced new challenges, primarily their massive size and the
KV cachememory bottleneck. This led to specialized systems like vLLM, FasterTransformer, and TGI, which optimize single LLM inference with techniques likePagedAttention, custom kernels, and distributed parallelism. - Multi-LLM Serving (MuxServe's Domain): The current frontier, where MuxServe fits, addresses the complex scenario of simultaneously serving a diverse portfolio of LLMs with varying popularity and resource demands. It builds upon previous distributed inference and single-LLM optimizations, but innovates by introducing flexible spatial-temporal multiplexing specifically tailored to the unique characteristics of LLM inference phases and the dynamic nature of multi-LLM workloads. MuxServe attempts to unify the benefits of spatial and temporal sharing, guided by LLM-specific insights.
3.4. Differentiation Analysis
MuxServe's core differentiation from the main methods in related work lies in its flexible spatial-temporal multiplexing specifically designed for multiple LLM serving, informed by two key insights:
-
Exploiting Prefill/Decoding Phase Characteristics: Unlike general temporal multiplexing systems (e.g., AlpaServe) that treat model inference as a monolithic task, MuxServe explicitly separates
prefillanddecodingjobs. It recognizes thatdecodingphases often underutilize GPUs (Figure 3) and can be efficiently colocated with other jobs (e.g.,prefillof another LLM), maximizing computation resource utilization. This fine-grained phase awareness is novel for multi-LLM serving. -
Popularity-Aware LLM Colocation for Memory Multiplexing: Traditional spatial partitioning ignores LLM popularity, leading to idle resources for unpopular models. MuxServe proactively considers LLM popularity in its placement algorithm, strategically colocating LLMs with high and low request rates to effectively multiplex memory resources (especially the
KV cache). This contrasts with systems like vLLM, which optimizeKV cachefor a single LLM, or general GPU sharing mechanisms that don't account for model-specific popularity or dynamicKV cachegrowth across multiple large models. -
Unified Resource Management for LLMs: While NVIDIA MPS allows SM partitioning and vLLM offers
PagedAttention, MuxServe combines and extends these. Itsunified resource managerdynamically assigns SMs toprefillanddecodingjobs from different colocated LLMs and introduces a novelhead-wise KV cacheto flexibly share memory across heterogeneous LLMs, whichPagedAttentionin vLLM is not designed for. This allows for dynamicKV cachereallocation during runtime, adapting to bursty and changing workloads in a multi-LLM context.In essence, MuxServe integrates insights from distributed inference, single-LLM optimization, and general GPU sharing, but uniquely synthesizes them into a holistic system that addresses the specific and complex challenges of serving a diverse set of LLMs concurrently and efficiently.
4. Methodology
4.1. Principles
The core idea behind MuxServe is to achieve flexible spatial-temporal multiplexing for multiple Large Language Model (LLM) serving. This is driven by two key insights:
-
Multiplexing Memory Resources by Colocating LLMs based on Popularity: LLMs exhibit varying levels of popularity and request rates (Figure 2). MuxServe recognizes that colocating LLMs with different popularity profiles allows for more efficient sharing of memory resources. For instance, a less popular LLM might occupy memory for its model weights and some
KV cache, but its requests are sparse. By colocating it with a more popular LLM, the memory can be multiplexed, preventing idle memory. The goal is to group LLMs into "LLM units" such that resources are balanced and highly utilized across these units. -
Multiplexing Computation Resources by Separating and Flexibly Colocating Prefill and Decoding Phases: The
prefillanddecodingphases of LLM inference have distinct computational characteristics. Theprefillphase is typically compute-intensive, while thedecodingphase, especially for a single LLM, often underutilizes the GPU's computational units (SMs) (Figure 3). MuxServe treats these as separatejobs. This allows for flexible colocation: for example, aprefilljob from one LLM (which is compute-heavy) can run concurrently with adecodingjob from another LLM (which is less compute-heavy), filling the "troughs" of utilization seen in pure temporal multiplexing (Figure 1b vs. 1c).MuxServe combines these principles to formulate an optimization problem, solve it with a placement algorithm and an adaptive batch scheduling strategy, and implement it with a unified resource manager.
4.2. Core Methodology In-depth
4.2.1. Problem Formulation
MuxServe first formally defines the multiplexing problem to find optimal colocations and scheduling strategies.
Consider a cluster and a set of LLMs with their respective workloads . The primary objective is to group LLMs into LLM units (denoted ) such that GPU utilization (i.e., throughput) is maximized. An LLM unit refers to a group of LLMs that will be colocated on a shared set of GPUs.
The overall problem of finding the best group of LLM units is formulated as: $ B ^ { * } = \underset { B \in \mathcal { B } } { \arg \operatorname* { m a x } } \sum _ { b \in B } \ F ( b , W _ { b } ) $ Here:
-
represents the optimal group of LLM units.
-
denotes the set of all possible groupings of LLM units.
-
is an individual LLM unit, which comprises a group of colocated LLMs.
-
represents the aggregated workload for the LLM unit .
-
is a function that estimates the throughput for a given unit under its workload . This function quantifies how efficiently a specific LLM unit configuration can process its assigned workload.
Within an
LLM unit, which contains a set of colocated LLMs , MuxServe needs to determine an optimal batching and scheduling strategy . This strategy aims to maximize the throughput of the entire unit while ensuring fair resource sharing among the individual LLMs within that unit. This sub-problem, estimating , is formulated as: $ \begin{array} { r l r } { { \mathrm { F } ( b , W _ { b } ) = \operatorname* { m a x } _ { S } \sum _ { m \in b _ { l l m } } \mathrm { t p } \mathrm { t } _ { S } ( m , b , W _ { b } ) \quad s . t . } } \ & { } & \ & { } & { \qquad \quad \mathrm { R } ( m _ { i } , W _ { m _ { i } } ) - \mathrm { R } ( m _ { j } , W _ { m _ { j } } ) | \le \epsilon , \forall m _ { i } , m _ { j } \in b _ { l l m } } \end{array} $ Here: -
is the throughput of the LLM unit with workload .
-
represents a specific batching and scheduling strategy.
-
estimates the throughput of an individual LLM when operating within unit using strategy under workload . The summation implies that the total throughput of the unit is the sum of throughputs of its constituent LLMs.
-
estimates the normalized resource consumption (e.g., computation or memory) of an LLM with its workload . Normalization is crucial to account for varying LLM scales and popularity, as large and popular LLMs naturally require more resources.
-
The constraint ensures fairness. It stipulates that the difference in normalized resource consumption between any two LLMs (, ) within the same unit must be less than or equal to a small fairness threshold . This prevents one LLM from monopolizing resources at the expense of others.
4.2.2. Placement Algorithm
Solving the combinatorial optimization problem of Equation (1) efficiently is challenging due to the exponential growth of possible LLM unit combinations. MuxServe proposes an enumeration-based greedy algorithm to find a good solution efficiently by prioritizing LLMs with large computation requirements (model scale and popularity).
The Enumeration-based Greedy LLM Placement algorithm is detailed below:
Explanation of Algorithm 1:
-
Line 1 (
M ← llm_parallel_candidate(M, W)): First, MuxServe callsAlgorithm 2(described below) to generateparallel candidatesfor each LLM in the input list . A parallel candidate represents an optimal configuration (number of SMs, parallelism degree) for a single LLM to meet its workload requirements with minimal SM usage. This step essentially preprocesses each LLM to determine its potential configurations for deployment. -
Line 2 (
D ← get_potential_device_mesh_groups(C, M)): The algorithm then identifies allpotential device mesh groupswithin the cluster . Adevice mesh groupis a collection of GPUs that will be used to serve a set of LLMs concurrently. Heuristics are applied here to prune the search space, considering thatintra-operator parallelismis usually within a node and workload constrains mesh size. -
Line 3 (
best_tpt, best_units ← 0, None): Initializes variables to track the best overall throughput and the corresponding optimal group of LLM units found so far. -
Line 4 (): The algorithm iterates through each potential device mesh group .
-
Line 5 (): Within each mesh group, the LLMs in are sorted in descending order based on their
computation requirements. This is a crucial greedy heuristic: LLMs requiring more computation (due to size and/or popularity) are prioritized for placement. This helps ensure these demanding LLMs get adequate resources first, which is critical for overall utilization. -
Line 6 (
for m ∈ M' do): For each LLM (sorted by computation) in the current iteration of the mesh group:- Line 7 (
best_mesh, max_delta ← None, -1): Initializes variables to find the best mesh within for the current LLM . - Line 8 (): It iterates through each individual mesh within the current mesh group .
- Line 9 (
u = make_unit(m, d)): A hypothetical unit is formed by placing LLM onto mesh . - Line 10 (): This line calculates the approximate increase in throughput if LLM were placed on mesh . estimates the throughput of the new unit , and represents the current throughput of the mesh (which might already have other LLMs,
d.u). The difference,delta, represents the marginal gain. - Lines 11-13: If the calculated
deltais greater than themax_deltafound so far, then becomes thebest_meshandmax_deltais updated. - Line 15 (
best_mesh.u = make_unit(m, best_mesh)): After evaluating all meshes in , LLM is formally placed on thebest_meshthat yielded the maximum throughput increase.
- Line 7 (
-
Line 17 (): After all LLMs have been greedily placed onto meshes within the current group , the total throughput
tptfor this entire mesh group is calculated. -
Lines 18-20: If this
tptis higher thanbest_tptfound across all previous mesh groups,best_tptandbest_unitsare updated. -
Line 22 (
return best_units): The algorithm returns the group of LLM units that resulted in the highest overall throughput.The time complexity of Algorithm 1 is , where is the number of LLMs, is the number of devices (GPUs), and is the number of potential device mesh groups.
The LLM Parallel Candidate Generation algorithm (Algorithm 2) is used as a sub-routine:
Algorithm 2 LLM Parallel Candidate Generation
Input: LLM list M, workload W
Output: The parallel candidate M_hat
1: M_hat ← []
2: for m ∈ M do
3: sm_list ← get_potential_sm_list(m)
4: tp_list ← get_potential_tp_degree(m)
5: for p ∈ tp_list do
6: for num_sm ∈ sorted(sm_list) do
7: tpt, bs ← estimate_throughput(m, num_sm, p)
8: // Store this configuration if it's a "parallel candidate"
9: // (e.g., meets workload with minimal SMs or optimal performance)
10: // The paper implies it finds "a set parallel candidates", but the exact
11: // criteria for selecting candidates (e.g. one for each TP degree)
12: // are not fully detailed in the snippet.
13: end for
14: end for
15: // Append the selected parallel candidate(s) for LLM m to M_hat
16: end for
17: return M_hat
Explanation of Algorithm 2:
- Line 2 (
for m ∈ M do): Iterates through each LLM in the input list. - Line 3 (
sm_list ← get_potential_sm_list(m)): Determines the possible numbers of Streaming Multiprocessors (SMs) that LLM could be assigned. - Line 4 (
tp_list ← get_potential_tp_degree(m)): Determines the potentialtensor parallelism(TP) degrees for LLM . - Lines 5-7 (
for p ∈ tp_list do ... for num_sm ∈ sorted(sm_list) do ... tpt, bs ← estimate_throughput(m, num_sm, p)): For every combination oftensor parallelism degree (p)andnumber of SMs (num_sm), the algorithm estimates the throughput (tpt) and batch size (bs) achievable for LLM . - Line 9 (
// Store this configuration if it's a "parallel candidate"): The paper states that aparallel candidateis "a configuration that meets the workload requirements while utilizing the fewest number of SMs." The snippet implies that for each LLM, this loop explores configurations, and thenM_hatstores the resulting "parallel candidate(s)". The specific selection logic (e.g., choosing one minimal SM configuration per TP degree, or finding the single best configuration) is implicit but aims to find efficient configurations.
4.2.3. Maximize Intra-unit Throughput
Once LLMs are placed into units, MuxServe needs an optimal strategy for batching and scheduling requests within each unit to maximize throughput (Equation 2) while ensuring fairness.
Resource Definition for Fairness:
MuxServe defines (normalized resource consumption) as the token block usage (Sheng et al., 2024). This is motivated by the fact that KV cache size is a significant performance bottleneck for LLM serving. Using token blocks provides a more granular and intuitive way to measure resource consumption, as tokens from different LLMs can consume varying amounts of KV cache. To account for varying workload characteristics, is normalized by request rates. MuxServe assigns a token block quota to each LLM to enforce fair sharing.
Prefill-Decoding and Decoding-Decoding Collocation:
MuxServe maximizes intra-unit throughput by prioritizing prefill jobs and using remaining resources for decoding jobs. This is based on the observation that decoding jobs of a single LLM typically require fewer computation resources and can be batched together. Prioritizing prefill jobs maximizes the opportunity for colocation and batching with decoding jobs from other LLMs.
The Adaptive Batch Scheduling (ADBS) algorithm is described below:
Algorithm 3 Adaptive Batch Scheduling (ADBS)
Input: LLM list M (LLMs within a unit)
1: prefill_waiting ← false
2: quota ← init_token_block_quota(M)
3: while True do
4: if no prefill jobs in execution then
5: prefill_waiting ← True
6: m ← round-robin a prefill job from M
7: if resource_enough(m, quota) then
8: execute_and_update(m, quota)
9: prefill_waiting ← False
10: end if
11: end if
12: if not prefill_waiting then
13: m ← round-robin a decoding job from M
14: while resource_enough(m, quota) do
15: execute_and_update(m, quota)
16: m ← round-robin a decoding job from M
17: end while
18: end if
19: end while
Explanation of Algorithm 3:
- Line 1 (
prefill_waiting ← false): A flag indicating if the system is currently waiting to schedule aprefilljob due to resource constraints. - Line 2 (
quota ← init_token_block_quota(M)): Initializes thetoken block quotafor each LLM within the unit. This quota limits the amount ofKV cachememory an LLM can use, ensuring fair sharing based on the defined . - Line 3 (
while True do): The main scheduling loop, continuously running. - Line 4 (
if no prefill jobs in execution then): MuxServe prioritizesprefilljobs. If noprefilljobs are currently active, it attempts to schedule one.- Line 5 (
prefill_waiting ← True): Sets the flag, indicating a potential wait forprefillresources. - Line 6 (
m ← round-robin a prefill job from M): Selects an LLM from which to schedule aprefilljob using a round-robin policy to ensure fairness across LLMs. - Lines 7-10: Checks if
resource_enough(m, quota)(e.g., sufficient SMs andtoken blockswithin 's quota). If yes, the job is executed, resources are updated, andprefill_waitingis reset. If not, the system waits for resources (implicitly by not scheduling anything else, andprefill_waitingremains true).
- Line 5 (
- Line 12 (
if not prefill_waiting then): If aprefilljob was successfully scheduled (or none was pending), the scheduler proceeds to scheduledecodingjobs.- Line 13 (
m ← round-robin a decoding job from M): Selects an LLM for adecodingjob using round-robin. - Lines 14-16 (
while resource_enough(m, quota) do ...): Continuously schedulesdecodingjobs as long as resources are available and stays within itstoken block quota. This maximizes colocation ofdecodingjobs to fill the remaining GPU capacity.
- Line 13 (
Adaptive Quota Adjustment:
To further improve utilization, ADBS periodically adapts the token block quota for each LLM. During runtime, MuxServe monitors KV cache utilization. It identifies LLMs with low utilization and proactively transfers KV cache blocks from them to LLMs with high utilization. This dynamic reallocation ensures optimal memory utilization and efficient sharing within the unit, adapting to changes in workload.
Throughput Estimator:
The exact throughput cannot be directly observed without profiling. MuxServe uses a simple yet effective analytical estimator. As shown in Figure 12 in the appendix, the execution timeline reveals that prefill phases of different LLMs are generally executed sequentially, while decoding phases can run concurrently. Different phases are interleaved.
The throughput estimator for LLM is given by: $ \mathsf { t p t } _ { S } ( m , b , W _ { b } ) = \mathrm{min} \left{ \frac { b ^ { m } } { \sum _ { i \in b } t _ { p } ^ { i } + t _ { d } ^ { m } \cdot l _ { o } ^ { m } } , W _ { b } \right} $ Here:
-
is the estimated throughput for LLM within unit under workload using strategy .
-
represents the batch size of requests for LLM .
-
is the sum of the
prefilllatencies for all LLMs in the batch within unit . This reflects the sequential nature ofprefillexecution. -
is the average
decodinglatency (time per output token) for LLM . -
is the average generation length (number of output tokens) for LLM .
-
The term represents the total
decodinglatency for a request of LLM . -
The denominator approximates the total latency for a batch request.
-
The fraction estimates the throughput based on the batch size and latency.
-
ensures that the estimated throughput does not exceed the actual workload request rate .
The
prefillanddecodinglatencies () can be profiled in advance, and the average generation length () can be estimated from historical request data or datasets like ShareGPT. Given request arrival rates, a binary search can find the batch size that satisfies the traffic demand.
4.2.4. Resource Manager
After determining optimal LLM units and scheduling strategies, MuxServe requires a specialized resource management mechanism for flexible and efficient spatial-temporal multiplexing. The challenges include dynamic sharing of computation resources among different prefill and decoding jobs and efficient sharing of memory (model weights and KV cache) to reduce waste and fragmentation. MuxServe addresses this with a unified resource manager.
The overview of GPU resource management in an LLM unit is illustrated in Figure 4.
该图像是示意图,展示了LLM单元中的GPU资源管理。内存空间被划分为统一KV缓存、模型权重和激活,底部展示了并行运行时。图中标示了SM资源的动态分配过程。步骤1和步骤2分别显示了资源分配的不同状态。
Explanation of Figure 4: The figure shows the GPU divided into three memory partitions (left) and how Streaming Multiprocessors (SMs) are dynamically allocated (right).
Computation Resource Management:
- Mechanism: MuxServe's parallel runtime manages computation resources (GPU SMs) at the granularity of individual SMs, based on NVIDIA MPS (Multi-Process Service).
- Dynamic Assignment: Instead of statically allocating SMs, MuxServe dynamically assigns SMs to
prefillanddecodingjobs from colocated LLMs as scheduled by the ADBS algorithm. - Flexibility: This dynamic assignment enables flexible multiplexing. As shown in the right part of Figure 4:
- Step 1: All SMs might be allocated to a single job (e.g., a compute-intensive
prefilljob) due to its large computational intensity. - Step 2: After that job completes or progresses, MuxServe can schedule multiple, potentially less compute-intensive, jobs (e.g., several
decodingjobs or adecodingjob and aprefilljob) to run concurrently, sharing the SM resources and maximizing utilization.
- Step 1: All SMs might be allocated to a single job (e.g., a compute-intensive
Memory Resource Management:
The main challenge is efficiently sharing LLM weights and the dynamically growing KV cache, which consume huge amounts of memory. MuxServe divides memory into three partitions:
-
Unified KV Cache Space (Head-wise Cache):
- Observation: The size of each
attention head(a component within the Transformer's attention mechanism) is often consistent or has limited variations across different LLMs (e.g., LLaMAs and GPT-3 often use 128 dimensions per head). - Mechanism: MuxServe divides the global
KV cachetable into small, fixed-size blocks. Each block is designed to hold theKV cache(keys and values) for one attention head for a fixed number of tokens. - Benefit: This
head-wise granularityallows MuxServe to accommodate theKV cachefrom different LLMs (which may have different total numbers of heads or hidden layers) within a unified, shared memory space. This reduces memory waste and fragmentation compared to reserving separateKV cacheregions for each LLM, and enables dynamic adjustment of cache allocation during runtime. - Distinction from vLLM's Paged Attention: While vLLM's
PagedAttentionimproves memory utilization for a single LLM by managingKV cachein pages, MuxServe'sunified KV cachespecifically addresses the more complex scenario of multiple heterogeneous LLMs sharing the cache. It allows flexible allocation across LLMs, adapting to bursty workloads with minimal overhead.
- Observation: The size of each
-
Model Weights Partition:
- Mechanism: This partition stores a single replica of the model weights for each LLM.
- Benefit: These weights can be shared among all
prefillanddecodingjobs of that specific LLM within the unit, reducing memory redundancy.
-
Activation Partition:
- Mechanism: This partition reserves space for intermediate
activations(outputs of layers) that are generated and utilized during the inference runtime. These are temporary and typically discarded after use.
- Mechanism: This partition reserves space for intermediate
Implementation Details:
- MuxServe is built on top of
vLLM(Kwon et al., 2023), leveraging its efficient single-LLM serving capabilities. - It uses
NVIDIA MPS(NVIDIA, 2022b) for partitioningSMresources. - A
global schedulerruns on each LLM unit. It manages a request queue and implements theADBSalgorithm (Algorithm 3) to schedule requests. - MuxServe disaggregates
prefillanddecodingphases by launching separatevLLM processesfor each. These processes are configured with different numbers ofSMresources viaNVIDIA MPS. - The
global schedulercommunicates and sendsprefillordecodingjobs to these runtime engine processes usingPython multiprocessing shared memory. - A dedicated
memory manager process(implemented in C++ for efficiency) manages the unified memory space, including thehead-wise KV cache. Runtime engines request new cache blocks from this manager when needed, and fill theKV cacheinto assigned blocks. They access this shared memory space viaCUDA IPC handler(Inter-Process Communication). KV cacheblocks allocated forprefilljobs are maintained by the memory manager for subsequentdecodingjobs and are only released when the entire request is finished.
5. Experimental Setup
5.1. Datasets
MuxServe evaluates its performance using both synthetic and real-world workloads.
Synthetic Workloads
-
LLM Models: The experiments use LLaMA models, a popular open-source LLM architecture. The LLMs are categorized into four size buckets as shown in the following table:
Scale 4B-8B 8B-21B 21B-41B 41B-70B #LLMs 12 4 2 1 The experiments involve serving a total of 19 LLaMA models of varying scales.
-
Request Rates:
- Request rates for each LLM are generated using a
power-law distributionwith an exponent . This distribution models how popularity varies among LLMs, where a few LLMs are highly popular, and many are less popular. - A larger indicates a more skewed popularity distribution, meaning fewer LLMs receive a disproportionately higher share of total requests. For example, means 20% of LLMs receive 50% of requests, while means 20% of LLMs receive 90% of requests (Figure 6).
- The maximal request rate for each LLM is initially set to 20 req/s, and then scaled up to evaluate diverse workloads.
- Request rates for each LLM are generated using a
-
Arrival Time: Requests are sampled using
Poisson processes, which simulate random and independent arrival events, typical of real-world server traffic. -
Request Content: The requests (prompts and desired output lengths) are sampled from
ShareGPT(ShareGPT-Team, 2023), a dataset of real-world conversations with LLMs. This ensures realistic prompt lengths and output generation characteristics. For example, the paper notes an average prompt length of 161 tokens and an average output length of 338 tokens in ShareGPT.
Real Workloads
- LLM Models & Workloads: MuxServe samples LLMs and workloads from the
ChatLMSYS trace. ChatLMSYS is a web application that serves multiple LLMs of different scales, providing a representative real-world traffic pattern. - Configuration: For real workload evaluation, 16 LLMs are served on a 32-GPU cluster. The popularity distribution reflects a realistic scenario where
20% popular LLMs get 50% of the request traffic. - Scaling: The rates from the ChatLMSYS trace are rescaled to evaluate MuxServe under different load conditions.
5.2. Evaluation Metrics
The effectiveness of MuxServe is evaluated using the following metrics:
Aggregated Throughput
- Conceptual Definition: Throughput measures the rate at which a system can process requests. When serving multiple LLMs with different arrival rates,
aggregated throughputrepresents the total effective processing capacity across all LLMs in the system. It is often calculated as a weighted average to reflect the actual demand for each LLM. - Mathematical Formula: While the paper does not provide a specific formula for "aggregated throughput," in a multi-model serving context with varying arrival rates, it typically refers to the sum of requests processed per second for all LLMs. If weights () are applied based on arrival rates, it could be: $ \text{Aggregated Throughput} = \sum_{m \in M} w_m \cdot \text{Throughput}_m $
- Symbol Explanation:
- : The total effective request processing rate across all served LLMs.
- : The set of all LLMs being served.
- : A weight associated with LLM , often proportional to its request arrival rate or importance.
- : The number of requests processed per second for LLM .
SLO Attainment
- Conceptual Definition: Service Level Objective (SLO) attainment measures the percentage of requests that successfully complete within a predefined latency target. It reflects the quality of service provided by the system.
- Mathematical Formula: $ \text{SLO Attainment} = \frac{\text{Number of requests completed within SLO latency}}{\text{Total number of requests}} \times 100% $
- Symbol Explanation:
- : The percentage of requests that meet the specified latency objective.
- : The count of requests whose end-to-end latency is less than or equal to the defined
SLO latencytarget. - : The total number of requests processed during the evaluation period.
- SLO Scale: In this paper, the
latency targetis scaled to different multiples of thesingle device execution latency(e.g.,SLO scale 8means the target latency is 8 times the single device latency).
P99 Latency
- Conceptual Definition: P99 latency (99th percentile latency) is the latency value below which 99% of all observed request latencies fall. It is a critical metric for understanding the worst-case user experience, as it captures the latency faced by almost all users, excluding only the very slowest 1%.
- Mathematical Formula: P99 latency is typically determined by sorting all observed latencies in ascending order and picking the value at the 99th percentile position. There isn't a simple algebraic formula, but rather a statistical calculation. If is the set of all observed latencies sorted in ascending order, then: $ \text{P99 Latency} = L_{\lceil 0.99 \times N \rceil} $
- Symbol Explanation:
- : The latency value at the 99th percentile.
- : The sorted list of observed latencies.
- : The total number of observed latencies.
- : The ceiling function, which rounds a number up to the nearest integer.
5.3. Baselines
MuxServe's performance is compared against two primary baselines:
-
Spatial Partitioning:
- Description: This is the most common approach for serving multiple large models. It involves statically allocating separate, dedicated groups of GPUs for each LLM. Each LLM runs in isolation on its assigned resources.
- Implementation: Each LLM is served using
vLLM(Kwon et al., 2023), a state-of-the-art single LLM serving framework known for its efficiency, particularly due toPagedAttentionfor KV cache management. This ensures a strong baseline for individual LLM performance. - Why it's representative: It reflects the industry standard practice of providing dedicated resources, which often leads to underutilization for varying workloads.
-
Temporal Multiplexing (AlpaServe-like):
- Description: This baseline attempts to share GPU resources over time, similar to AlpaServe (Li et al., 2023), but adapted for multiple LLMs. It focuses on partitioning models onto shared GPUs and scheduling requests in an interleaved manner.
- Implementation: Since AlpaServe is not designed for multiplexing multiple LLMs (it focuses on large DNNs), MuxServe implements this baseline:
- Colocation: LLMs are colocated on shared GPUs using the same placement optimization as MuxServe's placement algorithm. This ensures a fair comparison regarding model grouping.
- Scheduling: Requests are scheduled using a
first-come-first-serve (FCFS)policy in a temporal manner, meaning the entire GPU resources are dedicated to one LLM's batch at a time. - Batching: Employs
continuous batchingfor each LLM's requests, a technique used in vLLM to dynamically add new requests to a batch even if previous requests in the batch are still generating tokens. - KV Cache: Utilizes MuxServe's
unified KV cachemechanism, ensuring that memory management is not a bottleneck in the comparison and focusing on the scheduling differences.
- Why it's representative: It represents a strong temporal sharing approach that attempts to share resources more dynamically than spatial partitioning, but without MuxServe's fine-grained awareness of LLM inference phases.
5.4. Cluster
The experiments are conducted on a cluster with the following specifications:
- Nodes: 4 computing nodes.
- GPUs per node: Each node is equipped with 8 NVIDIA A100 GPUs (80GB VRAM each). This totals 32 A100 GPUs in the cluster.
- Intra-node connection: 600GB/s NVLink, providing high-bandwidth communication between GPUs within the same node.
- Inter-node connection: 200Gbps InfiniBand (IB), providing high-speed communication between different nodes in the cluster.
6. Results & Analysis
6.1. Core Results Analysis
End-to-End Results for Synthetic Workloads
The synthetic workload experiments (Figure 5) demonstrate MuxServe's superior performance in terms of both throughput and SLO attainment across varying popularity distributions () and average request rates.
该图像是图表,展示了不同 值下 MuxServe、空间分区和时间复用的吞吐量与 SLO 达成率。图中显示随着速率尺度的增加,MuxServe 在吞吐量方面表现优于其他方法,且在 SLO 达成率上也取得较好的效果。
- Throughput (Left Chart): MuxServe consistently achieves higher throughput than both
Spatial PartitioningandTemporal Multiplexingin all scenarios.- It achieves up to improvement in throughput.
- When is small (more even request rates among LLMs), MuxServe efficiently colocates
prefill-decodinganddecoding-decodingjobs. This indicates its ability to fill GPU compute cycles more effectively. - When is large (skewed popularity, few popular LLMs), MuxServe's strategy of colocating popular LLMs with unpopular ones to multiplex memory resources (using more SMs and larger KV caches for popular LLMs) yields higher throughput.
- SLO Attainment (Right Chart): MuxServe demonstrates its ability to process more requests within the 99% SLO target.
- It processes up to more requests while maintaining 99% SLO attainment.
- For small
SLO scale(tight latency targets) and small , MuxServe shows slightly lower SLO attainment. This is attributed to the inherent interference introduced by colocation, which can cause minor latency variations. - For larger , MuxServe achieves higher SLO attainment for popular LLMs. This is because the efficient multiplexing allows popular LLMs to receive more resources, reducing their individual latencies and improving their success rate within the SLO.
- The results highlight that MuxServe's improvement is more significant when LLM popularity differences are greater (larger ), confirming the benefit of its popularity-aware resource multiplexing.
End-to-End Results for Real Workloads
Evaluation on real workloads sampled from ChatLMSYS trace (Figure 7) further validates MuxServe's effectiveness under realistic conditions.
该图像是图表,展示了在不同平均请求速率下,Spatial、Temporal 和 MuxServe 方法的吞吐量与 SLO 达成率。左侧显示吞吐量 (req/s),右侧为 SLO 达成率 (%),可以看出 MuxServe 在各个速率下表现良好。
- Throughput (Left Chart): MuxServe consistently achieves higher throughput compared to baselines across varying average rates.
- It achieves up to higher throughput compared to
Spatial Partitioning. - It achieves up to higher throughput compared to
Temporal Multiplexing.
- It achieves up to higher throughput compared to
- SLO Attainment (Right Chart): MuxServe maintains a higher SLO attainment at different average rates.
- Notably, at an
average rate of 4.8 req/s,Temporal Multiplexingperforms poorly. This is explained by several medium-rate LLMs being colocated on a large mesh.Temporal Multiplexingstruggles to efficiently multiplex these LLMs, leading to lower utilization and worse SLO. MuxServe, with its phase-aware multiplexing and adaptive scheduling, handles this scenario much better.
- Notably, at an
P99 Latency Comparison
The P99 latency comparison (Figure 11, Appendix A.1) provides insights into the tail latency performance.
该图像是示意图,展示了不同速率比例下,平均延迟、TPOT和TTF的性能对比。可以观察到,随着速率比例的变化,MuxServe(绿色线)的表现相较于空间划分(蓝色线)和时间复用(橙色线)有明显优势,特别是在 ext{avg latency} 和 ext{TPOT} 指标上。
- The P99
Time to First Token (TTFT)latency for MuxServe is generally competitive. - When is small (even popularity), MuxServe's P99 TTFT latency can be slightly higher than
Temporal Multiplexing. This is likely due to the increased interference from concurrentprefillanddecodingjobs, even with efficient multiplexing. However, it is still significantly lower thanSpatial Partitioning. - When is large (skewed popularity), the P99 TTFT latency is dominated by the most popular model. In this case,
Temporal Multiplexingcan sometimes reduce queuing time for these popular models, leading to a lower P99 TTFT compared toSpatial Partitioning. MuxServe, by giving more resources to popular LLMs, also performs well, often comparable to or better than temporal multiplexing for the dominant models, contributing to better overall SLO.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Scale | 4B-8B | 8B-21B | 21B-41B | 41B-70B |
|---|---|---|---|---|
| #LLMs | 12 | 4 | 2 | 1 |
This table shows the distribution of LLMs of different scales used in the synthetic workload experiments, with 12 smaller LLMs, 4 medium LLMs, 2 larger LLMs, and 1 very large LLM.
6.3. Ablation Studies
Effectiveness of Placement Algorithm
An ablation study compares MuxServe's enumeration-based greedy placement algorithm against a simpler greedy algorithm. The simpler greedy algorithm prioritizes LLMs with high arrival rates and assigns them to meshes with the largest available free memory.
该图像是一个图表,展示了在不同 GPU 数量下,采用我们的方法与贪婪算法的吞吐量对比。左侧图为 8 个 GPU 和 4 个 LLMs 的情况,右侧图为 16 个 GPU 和 7 个 LLMs 的情况。可以看出,我们的方法在吞吐量上表现更优。
- Results: MuxServe's placement algorithm achieves up to higher throughput compared to the simpler greedy algorithm (right subfigure).
- Insight Verification: This confirms MuxServe's insight to prioritize placement selection for LLMs with large computation requirements (considering both model scale and popularity).
- Example (8 GPUs, 4 LLMs): MuxServe colocates two popular small LLMs with one less popular small LLM on 4 GPUs, while allocating 4 GPUs for a less popular large LLM.
- Example (16 GPUs, 7 LLMs): MuxServe colocates two large LLMs with low arrival rates on 4 GPUs, and partitions the remaining GPUs into groups of 2 or 4. It then places the other 5 popular LLMs based on their computational demands.
- Contrast with Memory-Prioritized Placement: If placement prioritized large memory requirements (to balance memory consumption), large LLMs might be placed on large GPU groups without considering their low popularity. This could leave popular smaller LLMs struggling to fit into remaining memory or requiring massive GPUs, which is inefficient. MuxServe's computation-centric prioritization is shown to be more effective.
Effectiveness of ADBS (Adaptive Batch Scheduling)
The Adaptive Batch Scheduling (ADBS) algorithm is compared against First Come First Serve (FCFS) and Round-Robin scheduling in two different serving settings (Figure 9).
该图像是图表,展示了不同调度方法在四个 GPU 上的缓存使用情况,包括 FCFS、Round-Robin 和 ADBS 调度。图中分别描绘了三和两个 LLM 的服务效果,并标注了每种方法下不同时间段的 token 块使用比例。
- Results:
- In scenario (a) ( for LLaMA-30B, 13B, 7B respectively), ADBS achieves a throughput of 6.2 req/s, outperforming Round-Robin (4.1 req/s) and FCFS (3.8 req/s). This is a improvement over Round-Robin and over FCFS.
- In scenario (b) ( for LLaMA-65B and 30B respectively), ADBS achieves 6.6 req/s, outperforming Round-Robin (4.9 req/s) and FCFS (3.2 req/s). This is a improvement over Round-Robin and over FCFS.
- Fairness and Utilization:
- ADBS shows more aligned
token block usagewith the distribution of arrival rates, indicating fairer memory resource sharing. Therelative proportion of token block usageannotations in the figure illustrate this alignment. - The superior performance of ADBS stems from its ability to adapt
token block quotasand prioritizeprefilljobs while filling remaining capacity withdecodingjobs. This preventsunfair sharing of KV cache blocks, which can significantly hurt the performance of popular LLMs. - FCFS is shown to be inefficient in multiplexing different LLMs due to its lack of resource awareness.
- ADBS shows more aligned
Effectiveness of Resource Manager
The ablation study for the unified resource manager (Figure 10) evaluates the impact of its computation and memory management components.
该图像是图表,展示了不同方法在不同参数 eta 下的吞吐量(左)和 SLO 达成率(右)。左侧图中,MuxServe 达到最高的吞吐量,而右侧图中,MuxServe 的 SLO 达成率保持在较高水平,展示了其在多模型服务中的优势。
- Setup: 4 LLMs served on 4 GPUs, with arrival rates following a power-law distribution.
- Results:
- Computation Management (Separating Prefill/Decoding): By enabling the separation of
prefillanddecodingphases and managing computation resources dynamically, MuxServe achieves a improvement in throughput. This highlights the significant gains from efficiently scheduling these distinct job types. - Unified Memory Manager: Further adding the
unified memory manager(including thehead-wise cacheand dynamicKV cachereallocation) provides an additional higher throughput and a substantial improvement in SLO attainment. This demonstrates the critical role of efficient memory sharing and dynamic adaptation toKV cachedemands in multi-LLM serving.
- Computation Management (Separating Prefill/Decoding): By enabling the separation of
- Conclusion: The results clearly show that both dynamic computation resource management and intelligent unified memory management are crucial for MuxServe's overall performance benefits.
7. Conclusion & Reflections
7.1. Conclusion Summary
MuxServe introduces a novel and highly effective system for serving multiple Large Language Models (LLMs) concurrently through flexible spatial-temporal multiplexing. The system's core innovation lies in its ability to simultaneously leverage two key characteristics: the varying popularity of LLMs for efficient memory resource multiplexing, and the distinct computational demands of LLM prefill and decoding phases for flexible computation resource multiplexing.
The paper formally formulates this multiplexing problem and provides practical solutions through:
-
An enumeration-based greedy placement algorithm that optimizes the grouping of LLMs into
LLM unitson GPU meshes, prioritizing LLMs with high computational requirements. -
An adaptive batch scheduling (ADBS) strategy that maximizes intra-unit throughput by intelligently scheduling
prefillanddecodingjobs, while ensuring fairness throughtoken block quotasand dynamically reallocatingKV cacheblocks. -
A unified resource manager that uses NVIDIA MPS for dynamic Streaming Multiprocessor (SM) allocation and a novel
head-wise cachefor efficient, flexible, and unifiedKV cachememory sharing across heterogeneous LLMs.Evaluation results, both on synthetic and real-world workloads, robustly demonstrate MuxServe's superior performance, achieving up to higher throughput and processing more requests within 99% SLO attainment compared to state-of-the-art spatial partitioning and temporal multiplexing baselines.
7.2. Limitations & Future Work
The paper's "Impact Statement" suggests no specific highlight-worthy societal consequences, and the conclusion does not explicitly list limitations or future work. However, based on the problem formulation and methodology, potential limitations and implicit future work directions can be inferred:
- Combinatorial Complexity of Placement: While the
enumeration-based greedy placement algorithmuses heuristics to prune the search space (e.g., within-node parallelism, workload constraints), the initial problem (Equation 1) is a combinatorial optimization problem. For extremely large clusters or a very high number of diverse LLMs, the search for "optimal" mesh groups (get_potential_device_mesh_groups) could still be computationally intensive or rely heavily on the quality of pruning heuristics. Future work might explore more sophisticated optimization techniques or machine learning-based approaches for dynamic placement. - Analytical Throughput Estimator Accuracy: The throughput estimator (Equation 3) is described as "simple yet effective." Its accuracy depends on pre-profiled
prefillanddecodinglatencies and estimated average generation lengths. In highly dynamic real-world scenarios with unpredictable prompt/output lengths or rapidly changing LLM characteristics, the estimator's accuracy might fluctuate, potentially impacting the effectiveness of the placement and scheduling decisions. Dynamic, real-time profiling or more adaptive estimation models could be explored. - Fairness Metric Nuances: The fairness constraint (Equation 2) uses a threshold on the difference in normalized resource consumption based on
token block usage. Whiletoken block usageis a good proxy forKV cachememory, fairness can be a complex multi-dimensional concept (e.g., equal latency, equal throughput share, proportional resource usage). The choice of and the specific normalization method might require tuning for different organizational priorities. Future work could investigate more sophisticated, user-defined fairness objectives. - Overhead of Resource Management: While MuxServe is designed for efficiency, the overheads associated with
NVIDIA MPSfor dynamic SM partitioning,Python multiprocessing shared memory, andCUDA IPC handlerfor inter-process communication (especially for the C++ memory manager) are inherent to its disaggregated architecture. While stated to be minimal, these overheads could become a factor in extremely low-latency scenarios or with very frequent job switching. - Generalizability to Diverse LLM Architectures: The evaluation primarily uses LLaMA models. While the principles are broadly applicable to autoregressive Transformers, the specific
head-wise cachedesign leverages the commonality of attention head sizes. Future work could examine generalizability and potential adaptations needed for drastically different LLM architectures or non-autoregressive models.
7.3. Personal Insights & Critique
MuxServe represents a significant step forward in the practical deployment of LLMs, especially in multi-tenant or multi-model serving environments.
Inspirations and Transferability:
- Phase-Aware Optimization: The most inspiring aspect is the explicit recognition and exploitation of the distinct
prefillanddecodingphases. This granular understanding of workload characteristics, rather than treating inference as a monolithic task, is a powerful paradigm that could be transferred to other complex, multi-stage computational workloads beyond LLMs. For instance, in scientific simulations or data processing pipelines, identifying compute-bound vs. memory-bound stages could enable similar spatial-temporal multiplexing. - Popularity-Driven Resource Allocation: The idea of dynamically allocating resources based on the real-time popularity of models is crucial for cost-effectiveness. This is directly applicable to any service where demand for different components varies significantly.
- Unified Resource Management (Head-wise Cache): The
head-wise KV cacheis a clever solution to a highly specific LLM memory problem. The general principle of "identifying a common, fine-grained unit of resource across heterogeneous workloads and then building a unified, dynamic allocation scheme" could inspire memory management solutions for other diverse, memory-intensive applications. - System Integration: Building on
vLLMandNVIDIA MPSdemonstrates a pragmatic approach to system design, leveraging existing robust components while innovating at the orchestration and management layers.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
"Goodness" of Greedy Algorithm: While the greedy placement algorithm is shown to outperform a simpler greedy baseline, there's no guarantee it finds the globally optimal solution given the combinatorial nature. The quality of the solution might heavily depend on the sorting key () and the effectiveness of the
throughput estimator. The paper acknowledges this by stating it finds a "good solution efficiently" rather than optimal. -
Dynamic Workload Prediction: The system relies on workload characteristics (, ) for placement and adaptive scheduling. While real-time monitoring and adaptive quota adjustment help, accurate prediction of future request rates and generation lengths is always a challenge in dynamic environments. How MuxServe handles sudden, unforeseen shifts in LLM popularity or request patterns could be a point of further investigation.
-
Overhead of Inter-Process Communication: The disaggregation into
vLLM processesforprefillanddecoding, managed by aglobal schedulerandmemory managerviashared memoryandCUDA IPC, implies a certain level of IPC overhead. Whileshared memoryandCUDA IPCare efficient, they are not zero-cost. For extremely low-latency applications, the balance between multiplexing gains and IPC overhead needs careful consideration. -
Complexity for System Administrators: While MuxServe simplifies efficient serving, the underlying system (NVIDIA MPS, multiple vLLM processes, C++ memory manager, Python shared memory) is inherently complex. Deploying, monitoring, and debugging such a system would require significant operational expertise.
-
Energy Efficiency: While throughput and SLO are primary metrics, energy consumption is increasingly important. It would be interesting to see how MuxServe's multiplexing strategies impact overall energy efficiency compared to baselines, especially when GPUs are kept at higher utilization levels.
Overall, MuxServe presents a well-reasoned and empirically validated solution to a pressing problem in LLM infrastructure. Its core insights are robust and offer valuable lessons for designing efficient systems for complex, heterogeneous AI workloads.
Similar papers
Recommended via semantic vector search.