MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Hao Zhang

Paper status: completed

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Published:04/02/2024

LLM Serving System (1)Efficient Multi-Model Scheduling (1)Spatial-Temporal Multiplexing Algorithm (1)Adaptive Batch Scheduling Strategy (1)Resource Management for LLMs (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MuxServe is introduced as a flexible spatial-temporal multiplexing system for efficiently serving multiple LLMs by colocating them based on popularity, optimizing resource usage and achieving up to 1.8x throughput increase or processing 2.9x more requests.

Abstract

Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally formulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. MuxServe designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that MuxServe can achieves up to $1.8\times$ higher throughput or processes $2.9\times$ more requests within $99\%$ SLO attainment. The code is available at: \url{https://github.com/hao-ai-lab/MuxServe}.

Mind Map

In-depth Reading

English Analysis~40 min read · 52,374 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving." It focuses on improving the efficiency of serving multiple Large Language Models (LLMs) by intelligently sharing computational and memory resources.

1.2. Authors

The authors are Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Their affiliations span various research institutions, including Shanghai AI Laboratory, CUHK Interdisciplinary AI Research Institute, and the Centre for Perceptual and Interactive Intelligence (CPII) Ltd, indicating a strong academic and research background in AI and systems.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server for scientific papers, with a publication date of 2024-04-02T14:56:43.000Z. arXiv is widely recognized in the scientific community for disseminating research rapidly, though papers posted there are typically not peer-reviewed until later submission to a conference or journal.

1.4. Publication Year

2024

1.5. Abstract

The paper addresses the challenge of efficiently serving multiple Large Language Models (LLMs) of varying sizes and popularities. Existing methods struggle with significant resource underutilization. The authors introduce MuxServe, a system that employs flexible spatial-temporal multiplexing. Its core insights involve colocating LLMs based on their popularity to multiplex memory resources, and separating/flexibly colocating prefill and decoding phases to multiplex computation resources. MuxServe formally defines the multiplexing problem and proposes a novel placement algorithm along with an adaptive batch scheduling strategy to optimize resource utilization. A unified resource manager is designed to enable this flexible and efficient multiplexing. Evaluation results demonstrate that MuxServe achieves up to $1.8\times$ higher throughput or processes $2.9\times$ more requests while maintaining a 99% Service Level Objective (SLO) attainment.

1.6. Original Source Link

The paper is available at: \url{https://arxiv.org/abs/2404.02015}. Its publication status is a preprint on arXiv. PDF Link: \url{https://arxiv.org/pdf/2404.02015v2.pdf}

2. Executive Summary

2.1. Background & Motivation

The proliferation of Large Language Models (LLMs) for diverse applications like chat, programming, and search has led organizations to serve multiple LLMs as endpoints. However, LLMs, especially large ones, demand significant computational resources; for instance, a single 175B LLM requires eight A100 (80GB) GPUs. Efficiently serving multiple LLMs simultaneously presents significant challenges due to their varying sizes, resource requirements, and dynamic popularity among users. This leads to substantial inference costs and underutilization of expensive GPU resources.

Existing approaches fall short:

Spatial Partitioning: This method dedicates separate groups of GPUs to each LLM. While simple, it often results in severe underutilization, particularly for less popular LLMs with sparse request patterns, leaving GPUs idle (as depicted in Figure 1a). Conversely, popular LLMs can become performance bottlenecks.
Temporal Multiplexing: This approach shares a group of GPUs among multiple models by interleaving their execution. While it can reduce latency for bursty workloads, it typically allocates the entire computation resource to one model at a time. It overlooks the distinct characteristics of the prefill and decoding phases of LLM inference. The decoding phase, which dominates inference time, often underutilizes GPUs when run in isolation, leading to wave-shaped utilization with frequent troughs (as depicted in Figure 1b and Figure 3).

The core problem is to efficiently serve a dynamic mix of LLMs with varying popularity and resource demands on a shared cluster, maximizing GPU utilization and throughput while meeting Service Level Objectives (SLOs).

The paper's innovative idea stems from two key insights:

The prefill and decoding phases of LLM inference have distinct computational characteristics. Separating and flexibly colocating jobs from these phases (even from different LLMs) can multiplex computation resources.
LLMs exhibit varying popularity. Colocating LLMs considering their popularity can effectively multiplex memory resources. For example, a popular LLM's decoding phase, which might not fully utilize GPUs, can run concurrently with a less popular LLM's prefill phase. This flexible spatial-temporal multiplexing (Figure 1c) allows for higher GPU utilization and faster request completion.

The paper aims to leverage these insights to design a system that dynamically adapts to workload changes, leading to significant cost reductions for LLM endpoint providers.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

First to Explore and Formulate Spatial-Temporal Multiplexing for LLM Serving: MuxServe introduces and formally formulates the problem of flexible spatial-temporal multiplexing tailored for multiple LLM serving, considering the distinct characteristics of prefill and decoding phases and varying LLM popularity.
Novel Placement Algorithm and Adaptive Batch Scheduling Strategy: MuxServe proposes an enumeration-based greedy placement algorithm to identify optimal colocations of LLMs across GPU resources. It also introduces an adaptive batch scheduling (ADBS) strategy that maximizes intra-unit throughput by prioritizing prefill jobs and dynamically reallocating KV cache blocks, ensuring fair resource sharing.
Viable System Design with Unified Resource Manager: The paper details the design and implementation of a unified resource manager. This manager leverages NVIDIA MPS for dynamic SM partitioning and introduces a head-wise cache to enable flexible and efficient sharing of KV cache memory across different LLMs.

The key findings from the evaluation are:

MuxServe significantly outperforms prior state-of-the-art systems, achieving up to $1.8\times$ higher throughput.
It can process up to $2.9\times$ more requests while maintaining a 99% SLO attainment, demonstrating both efficiency and reliability.
Ablation studies confirm the effectiveness of each proposed component, including the placement algorithm, adaptive batch scheduling, and unified resource manager.

3.1. Foundational Concepts

To understand MuxServe, a beginner should be familiar with the following concepts:

Large Language Models (LLMs)

LLMs are deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. They can perform various tasks like question answering, text summarization, translation, and code generation.

Transformer Architecture: Introduced by Vaswani et al. (2017), Transformers are neural network architectures that rely heavily on the attention mechanism to weigh the importance of different parts of the input sequence. They consist of encoder and decoder blocks. LLMs primarily use the decoder-only architecture for text generation.
Autoregressive Generation: LLMs generate text token by token. Each new token is predicted based on the input prompt and all previously generated tokens. This sequential generation process is fundamental to their operation.

LLM Inference Phases

LLM inference, especially for autoregressive generation, typically involves two distinct phases:

Prefill Phase: This is the initial phase where the entire input prompt (which can be hundreds or thousands of tokens long) is processed in parallel. The model computes the representations for all prompt tokens simultaneously and generates the first output token. This phase is typically computation-intensive as it involves processing a long sequence.
Decoding Phase (Incremental Decoding): After the first token is generated, the model iteratively generates subsequent tokens one at a time. For each new token, the model takes the previously generated tokens (including the input prompt) as context. This phase is typically memory-intensive due to the growing KV cache and less computation-intensive per token compared to prefill, as only one new token is processed at each step. However, it dominates the total inference time for long outputs.

Key-Value Cache (KV Cache)

During the decoding phase, the Transformer's attention mechanism needs access to the "keys" and "values" computed for all previously processed tokens (both from the prompt and already generated output). To avoid recomputing these for every new token, they are stored in a memory structure called the KV cache. As more tokens are generated, the KV cache grows dynamically, consuming significant GPU memory. The size of the KV cache depends on the sequence length, batch size, number of layers, number of attention heads, and hidden dimension.

Distributed LLM Inference

Due to their massive size, many LLMs cannot fit into a single GPU's memory or require faster inference than a single GPU can provide. Distributed inference techniques are used to spread the model across multiple GPUs or even multiple machines.

Intra-operator Parallelism (Tensor Parallelism): This technique splits individual LLM layers (e.g., attention mechanism, feed-forward networks) across multiple GPUs. For example, the weight matrices of a linear layer might be partitioned, and different GPUs compute parts of the output. This significantly reduces the memory footprint on each GPU and can reduce inference latency. However, it requires frequent communication between GPUs to exchange intermediate results.
Inter-operator Parallelism (Pipeline Parallelism): This technique partitions the LLM into multiple stages (groups of layers), with each stage placed on a different GPU. Data flows in a pipeline fashion, where one GPU completes its stage and passes the output to the next GPU. This introduces minimal communication overhead but does not reduce the latency for a single request as much as tensor parallelism; rather, it improves overall throughput by allowing multiple requests to be processed concurrently in different pipeline stages.

GPU Multiplexing

Multiplexing refers to techniques for sharing GPU resources among multiple tasks or models.

Spatial Multiplexing: Divides a GPU's physical resources (like memory or Streaming Multiprocessors - SMs) into separate partitions, each allocated to a different task, allowing them to run concurrently.
- NVIDIA Multi-Instance GPU (MIG): A hardware feature on some NVIDIA GPUs (like A100) that allows physically partitioning a single GPU into multiple smaller, isolated GPU instances. Each MIG instance has its own dedicated memory, compute cores, and caches, providing strong isolation.
- NVIDIA CUDA Multi-Process Service (MPS): A software layer that allows multiple CUDA applications to share a single GPU by providing concurrent access to its resources. MPS creates a daemon process that arbitrates access to the GPU, effectively partitioning its compute units (SMs) among different processes. It provides finer-grained resource sharing than MIG but with less strict isolation.
Temporal Multiplexing: Tasks share the entire GPU (or a group of GPUs) but take turns using it over time. Each task occupies the full computational resource for a specific time interval, and scheduling determines the order and duration of these intervals.

Throughput, Latency, and Service Level Objective (SLO)

These are standard metrics for evaluating system performance:

Throughput: The rate at which a system processes requests, typically measured in requests per second (req/s) or tokens per second (tokens/s). Higher throughput indicates better efficiency.
Latency: The time taken to complete a single request. Key latency metrics include:
- Time to First Token (TTFT): Time from request submission to the generation of the first output token.
- Time per Output Token (TPOT): Average time taken to generate each subsequent output token.
- End-to-End Latency: Total time from request submission to the completion of all output tokens.
Service Level Objective (SLO): A target performance level that a system aims to meet, often expressed as a percentile of latency (e.g., 99% of requests must complete within 500ms). SLO attainment measures the percentage of requests that successfully meet this target.

3.2. Previous Works

The paper contextualizes MuxServe by discussing prior research and existing systems in deep learning (DL) serving and LLM serving, highlighting their limitations for the multi-LLM scenario.

DL Serving Systems (General)

Earlier DL serving systems primarily focused on smaller Deep Neural Network (DNN) models and often employed temporal multiplexing with advanced batching and scheduling strategies to improve GPU utilization and meet SLOs (e.g., Olston et al., 2017; Crankshaw et al., 2017; Shen et al., 2019; Gujarati et al., 2020; Zhang et al., 2023; Romero et al., 2021).

Clipper (Crankshaw et al., 2017): An early low-latency online prediction serving system focusing on efficient request batching and scheduling for DNNs.
Nexus (Shen et al., 2019): A GPU cluster engine for accelerating DNN-based video analysis, also focusing on scheduling and resource management.
AlpaServe (Li et al., 2023): This is a more related work as it explores model parallelism and temporal multiplexing for serving multiple large DNN models. It distributes models across multiple GPUs and interleave schedules requests.
- Differentiation: MuxServe argues that AlpaServe is not specifically designed for LLMs and misses the characteristics of LLM inference phases (prefill vs. decoding). As Figure 3 illustrates, simply reducing computation resources to the dominant decoding phase doesn't significantly hurt latency, implying temporal multiplexing can still lead to underutilization for LLMs. MuxServe’s key differentiator is exploiting these phase-specific characteristics.

LLM Serving Systems (Single LLM Focus)

More recent efforts have focused on optimizing the inference for single LLMs, often incorporating advanced techniques:

Custom GPU Kernels: Optimizing Transformer model inference at a low level, e.g., TensorRT-LLM (NVIDIA, 2023b) and LightSeq (Wang et al., 2021b).
Distributed Inference for LLMs: Incorporating intra- and inter-operator parallelism to accelerate inference on multiple GPUs, e.g., FasterTransformer (NVIDIA, 2021), DeepSpeed-Inference (Aminabadi et al., 2022), vLLM (Kwon et al., 2023), and Huggingface's Text Generation Inference (TGI, 2023).
- vLLM (Kwon et al., 2023): A state-of-the-art system known for its PagedAttention mechanism, which significantly improves memory utilization for KV cache by managing it in pages, similar to virtual memory in operating systems.
  - Differentiation: MuxServe states it builds atop vLLM for single LLM optimization, but vLLM's PagedAttention primarily addresses memory utilization for a single LLM. MuxServe's unified KV cache addresses the more complex scenario of multiple LLMs of varying sizes and popularities sharing the cache.
Other Optimizations: Memory management (Kwon et al., 2023), offloading (Sheng et al., 2023), iteration-level batching (Yu et al., 2022), speculative decoding (Miao et al., 2023), and using cheap instances (Miao et al., 2024).
- Differentiation: MuxServe emphasizes that these works focus on optimizing single LLM inference and are largely orthogonal to its contribution of efficient multi-LLM serving. MuxServe aims to complement these by providing a system-level solution for multiplexing.

Research on GPU sharing broadly categorizes into temporal and spatial methods:

Temporal Sharing:
- Salus (Yu & Chowdhury, 2019): Proposes fast job switching and memory management for temporal GPU sharing.
- Wavelet (Wang et al., 2021a), Zico (Lim et al., 2021), AntMan (Xiao et al., 2020): Other systems exploring efficient temporal sharing for DNN training or general GPU workloads.
Spatial Sharing:
- NVIDIA MIG (NVIDIA, 2022a) and MPS (NVIDIA, 2022b): Native NVIDIA tools for spatial partitioning. MuxServe uses MPS as a fundamental building block.
- GSLICE (Dhakal et al., 2020): A dynamic GPU resource allocation and management framework.
- Muxflow (Zhao et al., 2023): Efficient and safe GPU sharing in production DL clusters.
- Gpulet (Choi et al., 2022): Introduces mixed spatial-temporal sharing for multiplexing general jobs.
Differentiation: These general GPU sharing works typically target multiplexing smaller DNN jobs or general GPU tasks. MuxServe focuses specifically on the challenges posed by LLMs (huge memory, distinct prefill/decoding phases, autoregressive nature) and their serving application.

3.3. Technological Evolution

The evolution of GPU serving can be traced from:

Early DNN Serving (Small Models): Initially, systems focused on serving individual, relatively small DNNs. Techniques like micro-batching and round-robin scheduling were common.
Large DNN Serving: With larger models, distributed inference (model parallelism) became necessary. Systems like AlpaServe emerged to manage multiple large DNNs through temporal multiplexing.
LLM Serving (Single LLM Optimization): The advent of LLMs introduced new challenges, primarily their massive size and the KV cache memory bottleneck. This led to specialized systems like vLLM, FasterTransformer, and TGI, which optimize single LLM inference with techniques like PagedAttention, custom kernels, and distributed parallelism.
Multi-LLM Serving (MuxServe's Domain): The current frontier, where MuxServe fits, addresses the complex scenario of simultaneously serving a diverse portfolio of LLMs with varying popularity and resource demands. It builds upon previous distributed inference and single-LLM optimizations, but innovates by introducing flexible spatial-temporal multiplexing specifically tailored to the unique characteristics of LLM inference phases and the dynamic nature of multi-LLM workloads. MuxServe attempts to unify the benefits of spatial and temporal sharing, guided by LLM-specific insights.

3.4. Differentiation Analysis

MuxServe's core differentiation from the main methods in related work lies in its flexible spatial-temporal multiplexing specifically designed for multiple LLM serving, informed by two key insights:

Exploiting Prefill/Decoding Phase Characteristics: Unlike general temporal multiplexing systems (e.g., AlpaServe) that treat model inference as a monolithic task, MuxServe explicitly separates prefill and decoding jobs. It recognizes that decoding phases often underutilize GPUs (Figure 3) and can be efficiently colocated with other jobs (e.g., prefill of another LLM), maximizing computation resource utilization. This fine-grained phase awareness is novel for multi-LLM serving.
Popularity-Aware LLM Colocation for Memory Multiplexing: Traditional spatial partitioning ignores LLM popularity, leading to idle resources for unpopular models. MuxServe proactively considers LLM popularity in its placement algorithm, strategically colocating LLMs with high and low request rates to effectively multiplex memory resources (especially the KV cache). This contrasts with systems like vLLM, which optimize KV cache for a single LLM, or general GPU sharing mechanisms that don't account for model-specific popularity or dynamic KV cache growth across multiple large models.
Unified Resource Management for LLMs: While NVIDIA MPS allows SM partitioning and vLLM offers PagedAttention, MuxServe combines and extends these. Its unified resource manager dynamically assigns SMs to prefill and decoding jobs from different colocated LLMs and introduces a novel head-wise KV cache to flexibly share memory across heterogeneous LLMs, which PagedAttention in vLLM is not designed for. This allows for dynamic KV cache reallocation during runtime, adapting to bursty and changing workloads in a multi-LLM context.

In essence, MuxServe integrates insights from distributed inference, single-LLM optimization, and general GPU sharing, but uniquely synthesizes them into a holistic system that addresses the specific and complex challenges of serving a diverse set of LLMs concurrently and efficiently.

4. Methodology

4.1. Principles

The core idea behind MuxServe is to achieve flexible spatial-temporal multiplexing for multiple Large Language Model (LLM) serving. This is driven by two key insights:

Multiplexing Memory Resources by Colocating LLMs based on Popularity: LLMs exhibit varying levels of popularity and request rates (Figure 2). MuxServe recognizes that colocating LLMs with different popularity profiles allows for more efficient sharing of memory resources. For instance, a less popular LLM might occupy memory for its model weights and some KV cache, but its requests are sparse. By colocating it with a more popular LLM, the memory can be multiplexed, preventing idle memory. The goal is to group LLMs into "LLM units" such that resources are balanced and highly utilized across these units.
Multiplexing Computation Resources by Separating and Flexibly Colocating Prefill and Decoding Phases: The prefill and decoding phases of LLM inference have distinct computational characteristics. The prefill phase is typically compute-intensive, while the decoding phase, especially for a single LLM, often underutilizes the GPU's computational units (SMs) (Figure 3). MuxServe treats these as separate jobs. This allows for flexible colocation: for example, a prefill job from one LLM (which is compute-heavy) can run concurrently with a decoding job from another LLM (which is less compute-heavy), filling the "troughs" of utilization seen in pure temporal multiplexing (Figure 1b vs. 1c).

MuxServe combines these principles to formulate an optimization problem, solve it with a placement algorithm and an adaptive batch scheduling strategy, and implement it with a unified resource manager.

4.2. Core Methodology In-depth

4.2.1. Problem Formulation

MuxServe first formally defines the multiplexing problem to find optimal colocations and scheduling strategies.

Consider a cluster $C$ and a set of LLMs $M$ with their respective workloads $W$ . The primary objective is to group LLMs into LLM units (denoted $B$ ) such that GPU utilization (i.e., throughput) is maximized. An LLM unit refers to a group of LLMs that will be colocated on a shared set of GPUs.

The overall problem of finding the best group of LLM units $B^*$ is formulated as: $ B ^ { * } = \underset { B \in \mathcal { B } } { \arg \operatorname* { m a x } } \sum _ { b \in B } \ F ( b , W _ { b } ) $ Here:

$B^*$ represents the optimal group of LLM units.
$\mathcal{B}$ denotes the set of all possible groupings of LLM units.
$b$ is an individual LLM unit, which comprises a group of colocated LLMs.
$W_b$ represents the aggregated workload for the LLM unit $b$ .
$F(b, W_b)$ is a function that estimates the throughput for a given unit $b$ under its workload $W_b$ . This function quantifies how efficiently a specific LLM unit configuration can process its assigned workload.

Within an LLM unit $b$ , which contains a set of colocated LLMs $b_{llm}$ , MuxServe needs to determine an optimal batching and scheduling strategy $S$ . This strategy aims to maximize the throughput of the entire unit while ensuring fair resource sharing among the individual LLMs within that unit. This sub-problem, estimating $F(b, W_b)$ , is formulated as: $ \begin{array} { r l r } { { \mathrm { F } ( b , W _ { b } ) = \operatorname* { m a x } _ { S } \sum _ { m \in b _ { l l m } } \mathrm { t p } \mathrm { t } _ { S } ( m , b , W _ { b } ) \quad s . t . } } \ & { } & \ & { } & { \qquad \quad \mathrm { R } ( m _ { i } , W _ { m _ { i } } ) - \mathrm { R } ( m _ { j } , W _ { m _ { j } } ) | \le \epsilon , \forall m _ { i } , m _ { j } \in b _ { l l m } } \end{array} $ Here:
$\mathrm{F}(b, W_b)$ is the throughput of the LLM unit $b$ with workload $W_b$ .
$S$ represents a specific batching and scheduling strategy.
$\mathrm{tpt}_S(m, b, W_b)$ estimates the throughput of an individual LLM $m$ when operating within unit $b$ using strategy $S$ under workload $W_b$ . The summation implies that the total throughput of the unit is the sum of throughputs of its constituent LLMs.
$\operatorname{R}(m_i, W_{m_i})$ estimates the normalized resource consumption (e.g., computation or memory) of an LLM $m_i$ with its workload $W_{m_i}$ . Normalization is crucial to account for varying LLM scales and popularity, as large and popular LLMs naturally require more resources.
The constraint $|\operatorname{R}(m_i, W_{m_i}) - \operatorname{R}(m_j, W_{m_j})| \le \epsilon$ ensures fairness. It stipulates that the difference in normalized resource consumption between any two LLMs ( $m_i$ , $m_j$ ) within the same unit $b_{llm}$ must be less than or equal to a small fairness threshold $\epsilon$ . This prevents one LLM from monopolizing resources at the expense of others.

4.2.2. Placement Algorithm

Solving the combinatorial optimization problem of Equation (1) efficiently is challenging due to the exponential growth of possible LLM unit combinations. MuxServe proposes an enumeration-based greedy algorithm to find a good solution efficiently by prioritizing LLMs with large computation requirements (model scale and popularity).

The Enumeration-based Greedy LLM Placement algorithm is detailed below:

$Algorithm 1 Enumeration-based Greedy LLM Placement Input: LLM list M, cluster C, workload W Output: The optimal group of LLM unit best_units 1: M ← llm_parallel_candidate(M, W) // Algorithm 2 2: D ← get_potential_device_mesh_groups(C, M) 3: best_tpt, best_units ← 0, None 4: for D_i ∈ D do // Greedily place LLMs on mesh group D_i 5: M' = sort(M, key=computation, descend=True) 6: for m ∈ M' do 7: best_mesh, max_delta ← None, -1 8: for d ∈ D_i do 9: u = make_unit(m, d) 10: delta = F(u, W_u) - F(d.u, W_d.u) 11: if delta > max_delta then 12: best_mesh, max_delta = d, delta 13: end if 14: end for 15: best_mesh.u = make_unit(m, best_mesh) 16: end for 17: tpt = sum(F(d.u, W_d.u) for d ∈ D_i) 18: if tpt > best_tpt then 19: best_tpt, best_units = tpt, D_i 20: end if 21: end for 22: return best_units$ Explanation of Algorithm 1:

Line 1 (M ← llm_parallel_candidate(M, W)): First, MuxServe calls Algorithm 2 (described below) to generate parallel candidates for each LLM in the input list $M$ . A parallel candidate represents an optimal configuration (number of SMs, parallelism degree) for a single LLM to meet its workload requirements with minimal SM usage. This step essentially preprocesses each LLM to determine its potential configurations for deployment.
Line 2 (D ← get_potential_device_mesh_groups(C, M)): The algorithm then identifies all potential device mesh groups within the cluster $C$ . A device mesh group is a collection of GPUs that will be used to serve a set of LLMs concurrently. Heuristics are applied here to prune the search space, considering that intra-operator parallelism is usually within a node and workload constrains mesh size.
Line 3 (best_tpt, best_units ← 0, None): Initializes variables to track the best overall throughput and the corresponding optimal group of LLM units found so far.
Line 4 ( $for D_i ∈ D do$ ): The algorithm iterates through each potential device mesh group $D_i$ .
Line 5 ( $M' = sort(M, key=computation, descend=True)$ ): Within each mesh group, the LLMs in $M$ are sorted in descending order based on their computation requirements. This is a crucial greedy heuristic: LLMs requiring more computation (due to size and/or popularity) are prioritized for placement. This helps ensure these demanding LLMs get adequate resources first, which is critical for overall utilization.
Line 6 (for m ∈ M' do): For each LLM $m$ (sorted by computation) in the current iteration of the mesh group:
- Line 7 (best_mesh, max_delta ← None, -1): Initializes variables to find the best mesh within $D_i$ for the current LLM $m$ .
- Line 8 ( $for d ∈ D_i do$ ): It iterates through each individual mesh $d$ within the current mesh group $D_i$ .
- Line 9 (u = make_unit(m, d)): A hypothetical unit $u$ is formed by placing LLM $m$ onto mesh $d$ .
- Line 10 ( $delta = F(u, W_u) - F(d.u, W_d.u)$ ): This line calculates the approximate increase in throughput if LLM $m$ were placed on mesh $d$ . $F(u, W_u)$ estimates the throughput of the new unit $u$ , and $F(d.u, W_d.u)$ represents the current throughput of the mesh $d$ (which might already have other LLMs, d.u). The difference, delta, represents the marginal gain.
- Lines 11-13: If the calculated delta is greater than the max_delta found so far, then $d$ becomes the best_mesh and max_delta is updated.
- Line 15 (best_mesh.u = make_unit(m, best_mesh)): After evaluating all meshes in $D_i$ , LLM $m$ is formally placed on the best_mesh that yielded the maximum throughput increase.
Line 17 ( $tpt = sum(F(d.u, W_d.u) for d ∈ D_i)$ ): After all LLMs have been greedily placed onto meshes within the current group $D_i$ , the total throughput tpt for this entire mesh group is calculated.
Lines 18-20: If this tpt is higher than best_tpt found across all previous mesh groups, best_tpt and best_units are updated.
Line 22 (return best_units): The algorithm returns the group of LLM units that resulted in the highest overall throughput.

The time complexity of Algorithm 1 is $O(M \cdot C \cdot D)$ , where $M$ is the number of LLMs, $C$ is the number of devices (GPUs), and $D$ is the number of potential device mesh groups.

The LLM Parallel Candidate Generation algorithm (Algorithm 2) is used as a sub-routine:

Algorithm 2 LLM Parallel Candidate Generation

Input: LLM list M, workload W
Output: The parallel candidate M_hat

1: M_hat ← []
2: for m ∈ M do
3:   sm_list ← get_potential_sm_list(m)
4:   tp_list ← get_potential_tp_degree(m)
5:   for p ∈ tp_list do
6:     for num_sm ∈ sorted(sm_list) do
7:       tpt, bs ← estimate_throughput(m, num_sm, p)
8:       // Store this configuration if it's a "parallel candidate"
9:       // (e.g., meets workload with minimal SMs or optimal performance)
10:      // The paper implies it finds "a set parallel candidates", but the exact
11:      // criteria for selecting candidates (e.g. one for each TP degree)
12:      // are not fully detailed in the snippet.
13:    end for
14:  end for
15:  // Append the selected parallel candidate(s) for LLM m to M_hat
16: end for
17: return M_hat

Explanation of Algorithm 2:

Line 2 (for m ∈ M do): Iterates through each LLM $m$ in the input list.
Line 3 (sm_list ← get_potential_sm_list(m)): Determines the possible numbers of Streaming Multiprocessors (SMs) that LLM $m$ could be assigned.
Line 4 (tp_list ← get_potential_tp_degree(m)): Determines the potential tensor parallelism (TP) degrees for LLM $m$ .
Lines 5-7 (for p ∈ tp_list do ... for num_sm ∈ sorted(sm_list) do ... tpt, bs ← estimate_throughput(m, num_sm, p)): For every combination of tensor parallelism degree (p) and number of SMs (num_sm), the algorithm estimates the throughput (tpt) and batch size (bs) achievable for LLM $m$ .
Line 9 (// Store this configuration if it's a "parallel candidate"): The paper states that a parallel candidate is "a configuration that meets the workload requirements while utilizing the fewest number of SMs." The snippet implies that for each LLM, this loop explores configurations, and then M_hat stores the resulting "parallel candidate(s)". The specific selection logic (e.g., choosing one minimal SM configuration per TP degree, or finding the single best configuration) is implicit but aims to find efficient configurations.

4.2.3. Maximize Intra-unit Throughput

Once LLMs are placed into units, MuxServe needs an optimal strategy for batching and scheduling requests within each unit to maximize throughput (Equation 2) while ensuring fairness.

Resource Definition for Fairness: MuxServe defines $\operatorname{R}(\cdot, \cdot)$ (normalized resource consumption) as the token block usage (Sheng et al., 2024). This is motivated by the fact that KV cache size is a significant performance bottleneck for LLM serving. Using token blocks provides a more granular and intuitive way to measure resource consumption, as tokens from different LLMs can consume varying amounts of KV cache. To account for varying workload characteristics, $\operatorname{R}(\cdot, \cdot)$ is normalized by request rates. MuxServe assigns a token block quota to each LLM to enforce fair sharing.

Prefill-Decoding and Decoding-Decoding Collocation: MuxServe maximizes intra-unit throughput by prioritizing prefill jobs and using remaining resources for decoding jobs. This is based on the observation that decoding jobs of a single LLM typically require fewer computation resources and can be batched together. Prioritizing prefill jobs maximizes the opportunity for colocation and batching with decoding jobs from other LLMs.

The Adaptive Batch Scheduling (ADBS) algorithm is described below:

Algorithm 3 Adaptive Batch Scheduling (ADBS)

Input: LLM list M (LLMs within a unit)

1: prefill_waiting ← false
2: quota ← init_token_block_quota(M)
3: while True do
4:   if no prefill jobs in execution then
5:     prefill_waiting ← True
6:     m ← round-robin a prefill job from M
7:     if resource_enough(m, quota) then
8:       execute_and_update(m, quota)
9:       prefill_waiting ← False
10:    end if
11:  end if
12:  if not prefill_waiting then
13:    m ← round-robin a decoding job from M
14:    while resource_enough(m, quota) do
15:      execute_and_update(m, quota)
16:      m ← round-robin a decoding job from M
17:    end while
18:  end if
19: end while

Explanation of Algorithm 3:

Line 1 (prefill_waiting ← false): A flag indicating if the system is currently waiting to schedule a prefill job due to resource constraints.
Line 2 (quota ← init_token_block_quota(M)): Initializes the token block quota for each LLM within the unit. This quota limits the amount of KV cache memory an LLM can use, ensuring fair sharing based on the defined $\operatorname{R}(\cdot, \cdot)$ .
Line 3 (while True do): The main scheduling loop, continuously running.
Line 4 (if no prefill jobs in execution then): MuxServe prioritizes prefill jobs. If no prefill jobs are currently active, it attempts to schedule one.
- Line 5 (prefill_waiting ← True): Sets the flag, indicating a potential wait for prefill resources.
- Line 6 (m ← round-robin a prefill job from M): Selects an LLM $m$ from which to schedule a prefill job using a round-robin policy to ensure fairness across LLMs.
- Lines 7-10: Checks if resource_enough(m, quota) (e.g., sufficient SMs and token blocks within $m$ 's quota). If yes, the job is executed, resources are updated, and prefill_waiting is reset. If not, the system waits for resources (implicitly by not scheduling anything else, and prefill_waiting remains true).
Line 12 (if not prefill_waiting then): If a prefill job was successfully scheduled (or none was pending), the scheduler proceeds to schedule decoding jobs.
- Line 13 (m ← round-robin a decoding job from M): Selects an LLM $m$ for a decoding job using round-robin.
- Lines 14-16 (while resource_enough(m, quota) do ...): Continuously schedules decoding jobs as long as resources are available and $m$ stays within its token block quota. This maximizes colocation of decoding jobs to fill the remaining GPU capacity.

Adaptive Quota Adjustment: To further improve utilization, ADBS periodically adapts the token block quota for each LLM. During runtime, MuxServe monitors KV cache utilization. It identifies LLMs with low utilization and proactively transfers KV cache blocks from them to LLMs with high utilization. This dynamic reallocation ensures optimal memory utilization and efficient sharing within the unit, adapting to changes in workload.

Throughput Estimator: The exact throughput $\mathrm{tpt}_S(\cdot, \cdot, \cdot)$ cannot be directly observed without profiling. MuxServe uses a simple yet effective analytical estimator. As shown in Figure 12 in the appendix, the execution timeline reveals that prefill phases of different LLMs are generally executed sequentially, while decoding phases can run concurrently. Different phases are interleaved.

The throughput estimator for LLM $m$ is given by: $ \mathsf { t p t } _ { S } ( m , b , W _ { b } ) = \mathrm{min} \left{ \frac { b ^ { m } } { \sum _ { i \in b } t _ { p } ^ { i } + t _ { d } ^ { m } \cdot l _ { o } ^ { m } } , W _ { b } \right} $ Here:

$\mathsf{tpt}_S(m, b, W_b)$ is the estimated throughput for LLM $m$ within unit $b$ under workload $W_b$ using strategy $S$ .
$b^m$ represents the batch size of requests for LLM $m$ .
$\sum_{i \in b} t_p^i$ is the sum of the prefill latencies for all LLMs $i$ in the batch within unit $b$ . This reflects the sequential nature of prefill execution.
$t_d^m$ is the average decoding latency (time per output token) for LLM $m$ .
$l_o^m$ is the average generation length (number of output tokens) for LLM $m$ .
The term $t_d^m \cdot l_o^m$ represents the total decoding latency for a request of LLM $m$ .
The denominator $\left( \sum _ { i \in b } t _ { p } ^ { i } + t _ { d } ^ { m } \cdot l _ { o } ^ { m } \right)$ approximates the total latency for a batch request.
The fraction $\frac { b ^ { m } } { \text{latency} }$ estimates the throughput based on the batch size and latency.
$\mathrm{min}\{\cdot, W_b\}$ ensures that the estimated throughput does not exceed the actual workload request rate $W_b$ .

The prefill and decoding latencies ( $t_p^m, t_d^m$ ) can be profiled in advance, and the average generation length ( $l_o^m$ ) can be estimated from historical request data or datasets like ShareGPT. Given request arrival rates, a binary search can find the batch size $b^m$ that satisfies the traffic demand.

4.2.4. Resource Manager

After determining optimal LLM units and scheduling strategies, MuxServe requires a specialized resource management mechanism for flexible and efficient spatial-temporal multiplexing. The challenges include dynamic sharing of computation resources among different prefill and decoding jobs and efficient sharing of memory (model weights and KV cache) to reduce waste and fragmentation. MuxServe addresses this with a unified resource manager.

The overview of GPU resource management in an LLM unit is illustrated in Figure 4.

Figure 4. Overview of GPU resource management in an LLM unit. The memory is divided into 3 partitions to store KV cache, weights and activations, respectively. The parallel runtime partitions SM dynamically to different jobs. 该图像是示意图，展示了LLM单元中的GPU资源管理。内存空间被划分为统一KV缓存、模型权重和激活，底部展示了并行运行时。图中标示了SM资源的动态分配过程。步骤1和步骤2分别显示了资源分配的不同状态。

Explanation of Figure 4: The figure shows the GPU divided into three memory partitions (left) and how Streaming Multiprocessors (SMs) are dynamically allocated (right).

Computation Resource Management:

Mechanism: MuxServe's parallel runtime manages computation resources (GPU SMs) at the granularity of individual SMs, based on NVIDIA MPS (Multi-Process Service).
Dynamic Assignment: Instead of statically allocating SMs, MuxServe dynamically assigns SMs to prefill and decoding jobs from colocated LLMs as scheduled by the ADBS algorithm.
Flexibility: This dynamic assignment enables flexible multiplexing. As shown in the right part of Figure 4:
- Step 1: All SMs might be allocated to a single job (e.g., a compute-intensive prefill job) due to its large computational intensity.
- Step 2: After that job completes or progresses, MuxServe can schedule multiple, potentially less compute-intensive, jobs (e.g., several decoding jobs or a decoding job and a prefill job) to run concurrently, sharing the SM resources and maximizing utilization.

Memory Resource Management: The main challenge is efficiently sharing LLM weights and the dynamically growing KV cache, which consume huge amounts of memory. MuxServe divides memory into three partitions:

Unified KV Cache Space (Head-wise Cache):
- Observation: The size of each attention head (a component within the Transformer's attention mechanism) is often consistent or has limited variations across different LLMs (e.g., LLaMAs and GPT-3 often use 128 dimensions per head).
- Mechanism: MuxServe divides the global KV cache table into small, fixed-size blocks. Each block is designed to hold the KV cache (keys and values) for one attention head for a fixed number of tokens.
- Benefit: This head-wise granularity allows MuxServe to accommodate the KV cache from different LLMs (which may have different total numbers of heads or hidden layers) within a unified, shared memory space. This reduces memory waste and fragmentation compared to reserving separate KV cache regions for each LLM, and enables dynamic adjustment of cache allocation during runtime.
- Distinction from vLLM's Paged Attention: While vLLM's PagedAttention improves memory utilization for a single LLM by managing KV cache in pages, MuxServe's unified KV cache specifically addresses the more complex scenario of multiple heterogeneous LLMs sharing the cache. It allows flexible allocation across LLMs, adapting to bursty workloads with minimal overhead.
Model Weights Partition:
- Mechanism: This partition stores a single replica of the model weights for each LLM.
- Benefit: These weights can be shared among all prefill and decoding jobs of that specific LLM within the unit, reducing memory redundancy.
Activation Partition:
- Mechanism: This partition reserves space for intermediate activations (outputs of layers) that are generated and utilized during the inference runtime. These are temporary and typically discarded after use.

Implementation Details:

MuxServe is built on top of vLLM (Kwon et al., 2023), leveraging its efficient single-LLM serving capabilities.
It uses NVIDIA MPS (NVIDIA, 2022b) for partitioning SM resources.
A global scheduler runs on each LLM unit. It manages a request queue and implements the ADBS algorithm (Algorithm 3) to schedule requests.
MuxServe disaggregates prefill and decoding phases by launching separate vLLM processes for each. These processes are configured with different numbers of SM resources via NVIDIA MPS.
The global scheduler communicates and sends prefill or decoding jobs to these runtime engine processes using Python multiprocessing shared memory.
A dedicated memory manager process (implemented in C++ for efficiency) manages the unified memory space, including the head-wise KV cache. Runtime engines request new cache blocks from this manager when needed, and fill the KV cache into assigned blocks. They access this shared memory space via CUDA IPC handler (Inter-Process Communication).
KV cache blocks allocated for prefill jobs are maintained by the memory manager for subsequent decoding jobs and are only released when the entire request is finished.

5. Experimental Setup

5.1. Datasets

MuxServe evaluates its performance using both synthetic and real-world workloads.

Synthetic Workloads

LLM Models: The experiments use LLaMA models, a popular open-source LLM architecture. The LLMs are categorized into four size buckets as shown in the following table:

Scale 4B-8B 8B-21B 21B-41B 41B-70B

#LLMs 12 4 2 1

The experiments involve serving a total of 19 LLaMA models of varying scales.
Request Rates:
- Request rates for each LLM are generated using a power-law distribution with an exponent $\alpha$ . This distribution models how popularity varies among LLMs, where a few LLMs are highly popular, and many are less popular.
- A larger $\alpha$ indicates a more skewed popularity distribution, meaning fewer LLMs receive a disproportionately higher share of total requests. For example, $\alpha = 0.9$ means 20% of LLMs receive 50% of requests, while $\alpha = 2.1$ means 20% of LLMs receive 90% of requests (Figure 6).
- The maximal request rate for each LLM is initially set to 20 req/s, and then scaled up to evaluate diverse workloads.
Arrival Time: Requests are sampled using Poisson processes, which simulate random and independent arrival events, typical of real-world server traffic.
Request Content: The requests (prompts and desired output lengths) are sampled from ShareGPT (ShareGPT-Team, 2023), a dataset of real-world conversations with LLMs. This ensures realistic prompt lengths and output generation characteristics. For example, the paper notes an average prompt length of 161 tokens and an average output length of 338 tokens in ShareGPT.

Real Workloads

LLM Models & Workloads: MuxServe samples LLMs and workloads from the ChatLMSYS trace. ChatLMSYS is a web application that serves multiple LLMs of different scales, providing a representative real-world traffic pattern.
Configuration: For real workload evaluation, 16 LLMs are served on a 32-GPU cluster. The popularity distribution reflects a realistic scenario where 20% popular LLMs get 50% of the request traffic.
Scaling: The rates from the ChatLMSYS trace are rescaled to evaluate MuxServe under different load conditions.

5.2. Evaluation Metrics

The effectiveness of MuxServe is evaluated using the following metrics:

Aggregated Throughput

Conceptual Definition: Throughput measures the rate at which a system can process requests. When serving multiple LLMs with different arrival rates, aggregated throughput represents the total effective processing capacity across all LLMs in the system. It is often calculated as a weighted average to reflect the actual demand for each LLM.
Mathematical Formula: While the paper does not provide a specific formula for "aggregated throughput," in a multi-model serving context with varying arrival rates, it typically refers to the sum of requests processed per second for all LLMs. If weights ( $w_m$ ) are applied based on arrival rates, it could be: $ \text{Aggregated Throughput} = \sum_{m \in M} w_m \cdot \text{Throughput}_m $
Symbol Explanation:
- $\text{Aggregated Throughput}$ : The total effective request processing rate across all served LLMs.
- $M$ : The set of all LLMs being served.
- $w_m$ : A weight associated with LLM $m$ , often proportional to its request arrival rate or importance.
- $\text{Throughput}_m$ : The number of requests processed per second for LLM $m$ .

SLO Attainment

Conceptual Definition: Service Level Objective (SLO) attainment measures the percentage of requests that successfully complete within a predefined latency target. It reflects the quality of service provided by the system.
Mathematical Formula: $ \text{SLO Attainment} = \frac{\text{Number of requests completed within SLO latency}}{\text{Total number of requests}} \times 100% $
Symbol Explanation:
- $\text{SLO Attainment}$ : The percentage of requests that meet the specified latency objective.
- $\text{Number of requests completed within SLO latency}$ : The count of requests whose end-to-end latency is less than or equal to the defined SLO latency target.
- $\text{Total number of requests}$ : The total number of requests processed during the evaluation period.
SLO Scale: In this paper, the latency target is scaled to different multiples of the single device execution latency (e.g., SLO scale 8 means the target latency is 8 times the single device latency).

P99 Latency

Conceptual Definition: P99 latency (99th percentile latency) is the latency value below which 99% of all observed request latencies fall. It is a critical metric for understanding the worst-case user experience, as it captures the latency faced by almost all users, excluding only the very slowest 1%.
Mathematical Formula: P99 latency is typically determined by sorting all observed latencies in ascending order and picking the value at the 99th percentile position. There isn't a simple algebraic formula, but rather a statistical calculation. If $L = \{l_1, l_2, \dots, l_N\}$ is the set of all observed latencies sorted in ascending order, then: $ \text{P99 Latency} = L_{\lceil 0.99 \times N \rceil} $
Symbol Explanation:
- $\text{P99 Latency}$ : The latency value at the 99th percentile.
- $L$ : The sorted list of observed latencies.
- $N$ : The total number of observed latencies.
- $\lceil \cdot \rceil$ : The ceiling function, which rounds a number up to the nearest integer.

5.3. Baselines

MuxServe's performance is compared against two primary baselines:

Spatial Partitioning:
- Description: This is the most common approach for serving multiple large models. It involves statically allocating separate, dedicated groups of GPUs for each LLM. Each LLM runs in isolation on its assigned resources.
- Implementation: Each LLM is served using vLLM (Kwon et al., 2023), a state-of-the-art single LLM serving framework known for its efficiency, particularly due to PagedAttention for KV cache management. This ensures a strong baseline for individual LLM performance.
- Why it's representative: It reflects the industry standard practice of providing dedicated resources, which often leads to underutilization for varying workloads.
Temporal Multiplexing (AlpaServe-like):
- Description: This baseline attempts to share GPU resources over time, similar to AlpaServe (Li et al., 2023), but adapted for multiple LLMs. It focuses on partitioning models onto shared GPUs and scheduling requests in an interleaved manner.
- Implementation: Since AlpaServe is not designed for multiplexing multiple LLMs (it focuses on large DNNs), MuxServe implements this baseline:
  - Colocation: LLMs are colocated on shared GPUs using the same placement optimization as MuxServe's placement algorithm. This ensures a fair comparison regarding model grouping.
  - Scheduling: Requests are scheduled using a first-come-first-serve (FCFS) policy in a temporal manner, meaning the entire GPU resources are dedicated to one LLM's batch at a time.
  - Batching: Employs continuous batching for each LLM's requests, a technique used in vLLM to dynamically add new requests to a batch even if previous requests in the batch are still generating tokens.
  - KV Cache: Utilizes MuxServe's unified KV cache mechanism, ensuring that memory management is not a bottleneck in the comparison and focusing on the scheduling differences.
- Why it's representative: It represents a strong temporal sharing approach that attempts to share resources more dynamically than spatial partitioning, but without MuxServe's fine-grained awareness of LLM inference phases.

5.4. Cluster

The experiments are conducted on a cluster with the following specifications:

Nodes: 4 computing nodes.
GPUs per node: Each node is equipped with 8 NVIDIA A100 GPUs (80GB VRAM each). This totals 32 A100 GPUs in the cluster.
Intra-node connection: 600GB/s NVLink, providing high-bandwidth communication between GPUs within the same node.
Inter-node connection: 200Gbps InfiniBand (IB), providing high-speed communication between different nodes in the cluster.

6. Results & Analysis

6.1. Core Results Analysis

End-to-End Results for Synthetic Workloads

The synthetic workload experiments (Figure 5) demonstrate MuxServe's superior performance in terms of both throughput and SLO attainment across varying popularity distributions ( $α$ ) and average request rates.

Figure 5. Throughput and SLO atainment on synthetic workloads. 该图像是图表，展示了不同 $ext{α}$ 值下 MuxServe、空间分区和时间复用的吞吐量与 SLO 达成率。图中显示随着速率尺度的增加，MuxServe 在吞吐量方面表现优于其他方法，且在 SLO 达成率上也取得较好的效果。

Throughput (Left Chart): MuxServe consistently achieves higher throughput than both Spatial Partitioning and Temporal Multiplexing in all scenarios.
- It achieves up to $1.8\times$ improvement in throughput.
- When $\alpha$ is small (more even request rates among LLMs), MuxServe efficiently colocates prefill-decoding and decoding-decoding jobs. This indicates its ability to fill GPU compute cycles more effectively.
- When $\alpha$ is large (skewed popularity, few popular LLMs), MuxServe's strategy of colocating popular LLMs with unpopular ones to multiplex memory resources (using more SMs and larger KV caches for popular LLMs) yields higher throughput.
SLO Attainment (Right Chart): MuxServe demonstrates its ability to process more requests within the 99% SLO target.
- It processes up to $2.9\times$ more requests while maintaining 99% SLO attainment.
- For small SLO scale (tight latency targets) and small $\alpha$ , MuxServe shows slightly lower SLO attainment. This is attributed to the inherent interference introduced by colocation, which can cause minor latency variations.
- For larger $\alpha$ , MuxServe achieves higher SLO attainment for popular LLMs. This is because the efficient multiplexing allows popular LLMs to receive more resources, reducing their individual latencies and improving their success rate within the SLO.
- The results highlight that MuxServe's improvement is more significant when LLM popularity differences are greater (larger $\alpha$ ), confirming the benefit of its popularity-aware resource multiplexing.

End-to-End Results for Real Workloads

Evaluation on real workloads sampled from ChatLMSYS trace (Figure 7) further validates MuxServe's effectiveness under realistic conditions.

Figure 7. Throughput and SLO attainment as we vary the rates on real workloads. 该图像是图表，展示了在不同平均请求速率下，Spatial、Temporal 和 MuxServe 方法的吞吐量与 SLO 达成率。左侧显示吞吐量 (req/s)，右侧为 SLO 达成率 (%)，可以看出 MuxServe 在各个速率下表现良好。

Throughput (Left Chart): MuxServe consistently achieves higher throughput compared to baselines across varying average rates.
- It achieves up to $1.38\times$ higher throughput compared to Spatial Partitioning.
- It achieves up to $1.46\times$ higher throughput compared to Temporal Multiplexing.
SLO Attainment (Right Chart): MuxServe maintains a higher SLO attainment at different average rates.
- Notably, at an average rate of 4.8 req/s, Temporal Multiplexing performs poorly. This is explained by several medium-rate LLMs being colocated on a large mesh. Temporal Multiplexing struggles to efficiently multiplex these LLMs, leading to lower utilization and worse SLO. MuxServe, with its phase-aware multiplexing and adaptive scheduling, handles this scenario much better.

P99 Latency Comparison

The P99 latency comparison (Figure 11, Appendix A.1) provides insights into the tail latency performance.

该图像是示意图，展示了不同速率比例下，平均延迟、TPOT和TTF的性能对比。可以观察到，随着速率比例的变化，MuxServe（绿色线）的表现相较于空间划分（蓝色线）和时间复用（橙色线）有明显优势，特别是在 ext{avg latency} 和 ext{TPOT} 指标上。

The P99 Time to First Token (TTFT) latency for MuxServe is generally competitive.
When $\alpha$ is small (even popularity), MuxServe's P99 TTFT latency can be slightly higher than Temporal Multiplexing. This is likely due to the increased interference from concurrent prefill and decoding jobs, even with efficient multiplexing. However, it is still significantly lower than Spatial Partitioning.
When $\alpha$ is large (skewed popularity), the P99 TTFT latency is dominated by the most popular model. In this case, Temporal Multiplexing can sometimes reduce queuing time for these popular models, leading to a lower P99 TTFT compared to Spatial Partitioning. MuxServe, by giving more resources to popular LLMs, also performs well, often comparable to or better than temporal multiplexing for the dominant models, contributing to better overall SLO.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Scale	4B-8B	8B-21B	21B-41B	41B-70B
#LLMs	12	4	2	1

This table shows the distribution of LLMs of different scales used in the synthetic workload experiments, with 12 smaller LLMs, 4 medium LLMs, 2 larger LLMs, and 1 very large LLM.

6.3. Ablation Studies

Effectiveness of Placement Algorithm

An ablation study compares MuxServe's enumeration-based greedy placement algorithm against a simpler greedy algorithm. The simpler greedy algorithm prioritizes LLMs with high arrival rates and assigns them to meshes with the largest available free memory.

Figure 8. Ablation study of placement algorithm. 该图像是一个图表，展示了在不同 GPU 数量下，采用我们的方法与贪婪算法的吞吐量对比。左侧图为 8 个 GPU 和 4 个 LLMs 的情况，右侧图为 16 个 GPU 和 7 个 LLMs 的情况。可以看出，我们的方法在吞吐量上表现更优。

Results: MuxServe's placement algorithm achieves up to $1.3\times$ higher throughput compared to the simpler greedy algorithm (right subfigure).
Insight Verification: This confirms MuxServe's insight to prioritize placement selection for LLMs with large computation requirements (considering both model scale and popularity).
- Example (8 GPUs, 4 LLMs): MuxServe colocates two popular small LLMs with one less popular small LLM on 4 GPUs, while allocating 4 GPUs for a less popular large LLM.
- Example (16 GPUs, 7 LLMs): MuxServe colocates two large LLMs with low arrival rates on 4 GPUs, and partitions the remaining GPUs into groups of 2 or 4. It then places the other 5 popular LLMs based on their computational demands.
Contrast with Memory-Prioritized Placement: If placement prioritized large memory requirements (to balance memory consumption), large LLMs might be placed on large GPU groups without considering their low popularity. This could leave popular smaller LLMs struggling to fit into remaining memory or requiring massive GPUs, which is inefficient. MuxServe's computation-centric prioritization is shown to be more effective.

Effectiveness of ADBS (Adaptive Batch Scheduling)

The Adaptive Batch Scheduling (ADBS) algorithm is compared against First Come First Serve (FCFS) and Round-Robin scheduling in two different serving settings (Figure 9).

$Figure 9. Comparison of cache usage of different schedule approaches on 4 GPUs. The relative proportion of token block usage is annotated in the figure. FCFS: First Come First Serve, ADBS: ADaptive Batch Size. (a) Request rate: $2 { : } 8 { : } 8 \\ r e q / s$ . Throughput $( r e q / s )$ : FCFS (3.8), Round-Robin (4.1), ADBS (6.2). (b) Request rate: 1:8 req/s. Throughput (req/s): FCFS (3.2), RoundRobin (4.9), ADBS (6.6).$ 该图像是图表，展示了不同调度方法在四个 GPU 上的缓存使用情况，包括 FCFS、Round-Robin 和 ADBS 调度。图中分别描绘了三和两个 LLM 的服务效果，并标注了每种方法下不同时间段的 token 块使用比例。

Results:
- In scenario (a) ( $2:8:8 req/s$ for LLaMA-30B, 13B, 7B respectively), ADBS achieves a throughput of 6.2 req/s, outperforming Round-Robin (4.1 req/s) and FCFS (3.8 req/s). This is a $1.43\times$ improvement over Round-Robin and $1.63\times$ over FCFS.
- In scenario (b) ( $1:8 req/s$ for LLaMA-65B and 30B respectively), ADBS achieves 6.6 req/s, outperforming Round-Robin (4.9 req/s) and FCFS (3.2 req/s). This is a $1.35\times$ improvement over Round-Robin and $2.06\times$ over FCFS.
Fairness and Utilization:
- ADBS shows more aligned token block usage with the distribution of arrival rates, indicating fairer memory resource sharing. The relative proportion of token block usage annotations in the figure illustrate this alignment.
- The superior performance of ADBS stems from its ability to adapt token block quotas and prioritize prefill jobs while filling remaining capacity with decoding jobs. This prevents unfair sharing of KV cache blocks, which can significantly hurt the performance of popular LLMs.
- FCFS is shown to be inefficient in multiplexing different LLMs due to its lack of resource awareness.

Effectiveness of Resource Manager

The ablation study for the unified resource manager (Figure 10) evaluates the impact of its computation and memory management components.

Figure 10. Ablation study of the unified resource manager. 该图像是图表，展示了不同方法在不同参数 eta 下的吞吐量（左）和 SLO 达成率（右）。左侧图中，MuxServe 达到最高的吞吐量，而右侧图中，MuxServe 的 SLO 达成率保持在较高水平，展示了其在多模型服务中的优势。

Setup: 4 LLMs served on 4 GPUs, with arrival rates following a power-law distribution.
Results:
- Computation Management (Separating Prefill/Decoding): By enabling the separation of prefill and decoding phases and managing computation resources dynamically, MuxServe achieves a $1.7\times$ improvement in throughput. This highlights the significant gains from efficiently scheduling these distinct job types.
- Unified Memory Manager: Further adding the unified memory manager (including the head-wise cache and dynamic KV cache reallocation) provides an additional $1.2\times$ higher throughput and a substantial $3.6\times$ improvement in SLO attainment. This demonstrates the critical role of efficient memory sharing and dynamic adaptation to KV cache demands in multi-LLM serving.
Conclusion: The results clearly show that both dynamic computation resource management and intelligent unified memory management are crucial for MuxServe's overall performance benefits.

7. Conclusion & Reflections

7.1. Conclusion Summary

MuxServe introduces a novel and highly effective system for serving multiple Large Language Models (LLMs) concurrently through flexible spatial-temporal multiplexing. The system's core innovation lies in its ability to simultaneously leverage two key characteristics: the varying popularity of LLMs for efficient memory resource multiplexing, and the distinct computational demands of LLM prefill and decoding phases for flexible computation resource multiplexing.

The paper formally formulates this multiplexing problem and provides practical solutions through:

An enumeration-based greedy placement algorithm that optimizes the grouping of LLMs into LLM units on GPU meshes, prioritizing LLMs with high computational requirements.
An adaptive batch scheduling (ADBS) strategy that maximizes intra-unit throughput by intelligently scheduling prefill and decoding jobs, while ensuring fairness through token block quotas and dynamically reallocating KV cache blocks.
A unified resource manager that uses NVIDIA MPS for dynamic Streaming Multiprocessor (SM) allocation and a novel head-wise cache for efficient, flexible, and unified KV cache memory sharing across heterogeneous LLMs.

Evaluation results, both on synthetic and real-world workloads, robustly demonstrate MuxServe's superior performance, achieving up to $1.8\times$ higher throughput and processing $2.9\times$ more requests within 99% SLO attainment compared to state-of-the-art spatial partitioning and temporal multiplexing baselines.

7.2. Limitations & Future Work

The paper's "Impact Statement" suggests no specific highlight-worthy societal consequences, and the conclusion does not explicitly list limitations or future work. However, based on the problem formulation and methodology, potential limitations and implicit future work directions can be inferred:

Combinatorial Complexity of Placement: While the enumeration-based greedy placement algorithm uses heuristics to prune the search space (e.g., within-node parallelism, workload constraints), the initial problem (Equation 1) is a combinatorial optimization problem. For extremely large clusters or a very high number of diverse LLMs, the search for "optimal" mesh groups (get_potential_device_mesh_groups) could still be computationally intensive or rely heavily on the quality of pruning heuristics. Future work might explore more sophisticated optimization techniques or machine learning-based approaches for dynamic placement.
Analytical Throughput Estimator Accuracy: The throughput estimator (Equation 3) is described as "simple yet effective." Its accuracy depends on pre-profiled prefill and decoding latencies and estimated average generation lengths. In highly dynamic real-world scenarios with unpredictable prompt/output lengths or rapidly changing LLM characteristics, the estimator's accuracy might fluctuate, potentially impacting the effectiveness of the placement and scheduling decisions. Dynamic, real-time profiling or more adaptive estimation models could be explored.
Fairness Metric Nuances: The fairness constraint (Equation 2) uses a threshold $\epsilon$ on the difference in normalized resource consumption based on token block usage. While token block usage is a good proxy for KV cache memory, fairness can be a complex multi-dimensional concept (e.g., equal latency, equal throughput share, proportional resource usage). The choice of $\epsilon$ and the specific normalization method might require tuning for different organizational priorities. Future work could investigate more sophisticated, user-defined fairness objectives.
Overhead of Resource Management: While MuxServe is designed for efficiency, the overheads associated with NVIDIA MPS for dynamic SM partitioning, Python multiprocessing shared memory, and CUDA IPC handler for inter-process communication (especially for the C++ memory manager) are inherent to its disaggregated architecture. While stated to be minimal, these overheads could become a factor in extremely low-latency scenarios or with very frequent job switching.
Generalizability to Diverse LLM Architectures: The evaluation primarily uses LLaMA models. While the principles are broadly applicable to autoregressive Transformers, the specific head-wise cache design leverages the commonality of attention head sizes. Future work could examine generalizability and potential adaptations needed for drastically different LLM architectures or non-autoregressive models.

7.3. Personal Insights & Critique

MuxServe represents a significant step forward in the practical deployment of LLMs, especially in multi-tenant or multi-model serving environments.

Inspirations and Transferability:

Phase-Aware Optimization: The most inspiring aspect is the explicit recognition and exploitation of the distinct prefill and decoding phases. This granular understanding of workload characteristics, rather than treating inference as a monolithic task, is a powerful paradigm that could be transferred to other complex, multi-stage computational workloads beyond LLMs. For instance, in scientific simulations or data processing pipelines, identifying compute-bound vs. memory-bound stages could enable similar spatial-temporal multiplexing.
Popularity-Driven Resource Allocation: The idea of dynamically allocating resources based on the real-time popularity of models is crucial for cost-effectiveness. This is directly applicable to any service where demand for different components varies significantly.
Unified Resource Management (Head-wise Cache): The head-wise KV cache is a clever solution to a highly specific LLM memory problem. The general principle of "identifying a common, fine-grained unit of resource across heterogeneous workloads and then building a unified, dynamic allocation scheme" could inspire memory management solutions for other diverse, memory-intensive applications.
System Integration: Building on vLLM and NVIDIA MPS demonstrates a pragmatic approach to system design, leveraging existing robust components while innovating at the orchestration and management layers.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

"Goodness" of Greedy Algorithm: While the greedy placement algorithm is shown to outperform a simpler greedy baseline, there's no guarantee it finds the globally optimal solution given the combinatorial nature. The quality of the solution might heavily depend on the sorting key ( $key=computation$ ) and the effectiveness of the throughput estimator. The paper acknowledges this by stating it finds a "good solution efficiently" rather than optimal.
Dynamic Workload Prediction: The system relies on workload characteristics ( $W_b$ , $l_o^m$ ) for placement and adaptive scheduling. While real-time monitoring and adaptive quota adjustment help, accurate prediction of future request rates and generation lengths is always a challenge in dynamic environments. How MuxServe handles sudden, unforeseen shifts in LLM popularity or request patterns could be a point of further investigation.
Overhead of Inter-Process Communication: The disaggregation into vLLM processes for prefill and decoding, managed by a global scheduler and memory manager via shared memory and CUDA IPC, implies a certain level of IPC overhead. While shared memory and CUDA IPC are efficient, they are not zero-cost. For extremely low-latency applications, the balance between multiplexing gains and IPC overhead needs careful consideration.
Complexity for System Administrators: While MuxServe simplifies efficient serving, the underlying system (NVIDIA MPS, multiple vLLM processes, C++ memory manager, Python shared memory) is inherently complex. Deploying, monitoring, and debugging such a system would require significant operational expertise.
Energy Efficiency: While throughput and SLO are primary metrics, energy consumption is increasingly important. It would be interesting to see how MuxServe's multiplexing strategies impact overall energy efficiency compared to baselines, especially when GPUs are kept at higher utilization levels.

Overall, MuxServe presents a well-reasoned and empirically validated solution to a pressing problem in LLM infrastructure. Its core insights are robust and offer valuable lessons for designing efficient systems for complex, heterogeneous AI workloads.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~40 min read · 52,374 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Large Language Models (LLMs)

LLM Inference Phases

Key-Value Cache (KV Cache)

Distributed LLM Inference

GPU Multiplexing

Throughput, Latency, and Service Level Objective (SLO)

3.2. Previous Works

DL Serving Systems (General)

LLM Serving Systems (Single LLM Focus)

GPU Sharing (General)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Problem Formulation

4.2.2. Placement Algorithm

4.2.3. Maximize Intra-unit Throughput

4.2.4. Resource Manager

5. Experimental Setup

5.1. Datasets

Synthetic Workloads

Real Workloads

5.2. Evaluation Metrics

Aggregated Throughput

SLO Attainment

P99 Latency

5.3. Baselines

5.4. Cluster

6. Results & Analysis

6.1. Core Results Analysis

End-to-End Results for Synthetic Workloads

End-to-End Results for Real Workloads

P99 Latency Comparison

6.2. Data Presentation (Tables)

6.3. Ablation Studies

Effectiveness of Placement Algorithm

Effectiveness of ADBS (Adaptive Batch Scheduling)

Effectiveness of Resource Manager

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers