NanoFlow: Towards Optimal Large Language Model Serving Throughput

Baris Kasikci

Paper status: completed

NanoFlow: Towards Optimal Large Language Model Serving Throughput

Published:08/23/2024

High-Throughput Serving for Large Language Models (1)Intra-Device Parallel Processing Framework (1)Nano-Batching Optimization (1)LLaMA Model Serving Evaluation (1)Heterogeneous Resource Utilization (1)

Original Link PDF

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces NanoFlow, a novel framework for optimizing Large Language Model (LLM) serving throughput by leveraging intra-device parallelism. It significantly improves throughput, achieving a 1.91x increase over existing systems by utilizing smaller nano-batches and optim

Abstract

Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput has emerged as a key metric that determines serving systems' performance. Due to large model sizes and memory-intensive self-attention, LLM serving has been commonly assumed to be memory-bound. Through a detailed analysis, we show that despite having memory-intensive components, end-to-end LLM serving is compute bound for most common workloads and LLMs. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving--compute, memory, networking--are executed sequentially within a device. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of heterogeneous resources within a single device. NanoFlow splits inputs into smaller nano-batches and duplicates operations to operate on each portion independently, enabling overlapping. NanoFlow automatically identifies the number, size, ordering, and GPU resource allocation of nano-batches to minimize the execution time, while considering the interference of concurrent operations. We evaluate NanoFlow's end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 50% to 72% of optimal throughput across popular models.

Mind Map

In-depth Reading

English Analysis~39 min read · 52,058 chars

1. Bibliographic Information

1.1. Title

NanoFlow: Towards Optimal Large Language Model Serving Throughput

1.2. Authors

The paper lists a comprehensive author team primarily from the University of Washington, with contributions from Tsinghua University, U Berkeley, and the University of Michigan. The lead author appears to be Kan Zhu from the University of Washington. Their research backgrounds generally align with systems, machine learning, and high-performance computing, focusing on optimizing large-scale AI infrastructure.

1.3. Journal/Conference

This paper was published on arXiv, a preprint server, on 2024-08-22T23:00:40.000Z. arXiv is a well-regarded repository for preprints of scientific papers in fields like physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While not a peer-reviewed journal or conference itself, papers published on arXiv are often submitted to and eventually published in reputable venues. Its influence lies in rapid dissemination of research findings.

1.4. Publication Year

2024

1.5. Abstract

Large Language Models (LLMs) have created immense demand for planet-scale serving systems, where throughput is a critical performance metric. Contrary to the common assumption that LLM serving is memory-bound due to large model sizes and memory-intensive self-attention, this paper demonstrates through detailed analysis that end-to-end LLM serving is primarily compute-bound for most common workloads and LLMs. However, existing serving engines fail to achieve optimal compute utilization because heterogeneous operations (compute, memory, networking) are executed sequentially within a single device.

To address this, the paper proposes NanoFlow, a novel serving framework that leverages intra-device parallelism by overlapping the usage of these heterogeneous resources. NanoFlow splits input batches into smaller nano-batches and duplicates operations to process each portion independently, enabling concurrent execution. It automatically determines the optimal number, size, ordering, and GPU resource allocation of nano-batches to minimize execution time, while accounting for interference between concurrent operations.

Evaluations using popular models like LLaMA-2-70B, Mixtral 8x7B, and LLaMA-3-8B with practical workloads show that NanoFlow achieves a 1.91x throughput boost compared to state-of-the-art serving systems, reaching 50% to 72% of the theoretically optimal throughput across various models.

1.6. Original Source Link

https://arxiv.org/abs/2408.12757 (Preprint) PDF Link: https://arxiv.org/pdf/2408.12757v2.pdf

2. Executive Summary

2.1. Background & Motivation

The proliferation of Large Language Models (LLMs) like GPT-4 and LLaMA has led to an explosion in demand for inference serving systems. These systems need to support hundreds of millions of users on tens of thousands of GPUs globally. In this context, throughput (tokens per device per second) has become the paramount metric, directly impacting the operational cost and scalability of LLM services, especially given the scarcity of high-end GPUs.

A common assumption in the LLM serving community has been that LLM inference is fundamentally memory-bound. This assumption stems from several key characteristics:

Large Model Sizes: LLMs often have billions of parameters, requiring significant memory to store model weights (e.g., GPT-3 175B needs multiple A100 80GB GPUs).
Memory-Intensive Self-Attention: The self-attention mechanism, particularly during the decode phase, scales quadratically with context length (input + output sequence length).
KV-cache: Per-request state (KV-cache) can grow very large, sometimes exceeding model weight size, and needs to be frequently accessed.
Single Token Output: Each LLM iteration typically produces only one output token per sequence while loading entire model weights and a unique KV-cache, further exacerbating memory pressure.

However, the authors identify a critical gap: despite these memory-intensive components, a detailed analysis reveals that for most common workloads and LLMs, the end-to-end serving process is actually compute-bound. The challenge then becomes that existing serving engines, while potentially optimizing individual operations, execute the heterogeneous operations (compute, memory, networking) sequentially within a device. This sequential execution leads to significant underutilization of the most constrained resource—compute—resulting in suboptimal overall throughput. The problem is thus to bridge this gap and maximize compute utilization by efficiently managing heterogeneous resource usage.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of LLM serving:

Reclassification of LLM Serving Workloads: Through a detailed analytical cost model and empirical validation, the authors demonstrate that contrary to common belief, modern LLM serving, especially with optimizations like Grouped Query Attention (GQA) and large batch sizes, is predominantly compute-bound rather than memory-bound or network-bound for typical workloads and LLMs. This re-evaluation provides a new perspective on where optimization efforts should be focused.
Introduction of NanoFlow Framework: The paper proposes NanoFlow, a novel end-to-end serving framework designed to maximize compute utilization by leveraging intra-device parallelism.
- Intra-device Parallelism via Nano-batching: NanoFlow splits input batches into smaller nano-batches and duplicates operations. These nano-operations can then execute concurrently on different nano-batches without data dependencies, allowing heterogeneous resources (compute, memory, network) within a single GPU to be utilized simultaneously, overcoming the limitations of sequential execution.
- Automated Pipeline Search Engine: NanoFlow includes an auto-search engine that automatically constructs an optimized pipeline for nano-batches. This engine identifies the optimal number, size, ordering, and GPU resource allocation for nano-operations. It employs a two-stage approach: first, it determines an initial pipeline assuming no interference, and then refines it by profiling and modeling actual kernel interference between concurrent operations.
- Efficient Runtime System: NanoFlow provides a runtime system for executing these optimized pipelines. This includes mechanisms for efficient batch formation (prioritizing decode, chunking prefill, memory prediction), asynchronous scheduling (hiding CPU-side overhead), and advanced KV-cache management (simultaneous host/SSD offloading, LRU policy, optimized loading/scattering).
Significant Throughput Improvements:
- Evaluations on popular LLMs (LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc.) with practical workloads show that NanoFlow achieves an average throughput boost of 1.91x compared to state-of-the-art serving systems like vLLM, DeepSpeed-FastGen, and TensorRT-LLM.
- NanoFlow reaches 50% to 72% of the theoretically optimal throughput across various models, significantly closing the gap to hardware capabilities.
- It also demonstrates the ability to sustain higher request rates while meeting Service Level Objective (SLO) constraints for latency.
  
  These findings fundamentally shift the understanding of LLM serving bottlenecks and provide a practical, high-performance solution for maximizing GPU utilization and reducing serving costs.

3.1. Foundational Concepts

To fully grasp the mechanics and contributions of NanoFlow, understanding several core concepts related to Large Language Models (LLMs), GPU computing, and serving systems is essential.

Large Language Models (LLMs) and Transformers

LLMs are powerful artificial intelligence models, typically based on the transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language.

Transformer Architecture: Introduced by Vaswani et al. (2017), the transformer is a neural network architecture that relies heavily on the self-attention mechanism to process sequential data. Unlike recurrent neural networks, transformers process input sequences in parallel, making them highly efficient for modern accelerators.
Decoder-Only Transformers: Many modern LLMs (e.g., GPT-4, LLaMA, Mistral) use a decoder-only transformer architecture. This means they are primarily designed for generative tasks, taking an input sequence and auto-regressively generating an output sequence token by token.
- Token: A token is the fundamental unit of text processed by an LLM. It can be a word, part of a word, a punctuation mark, or even a single character.

LLM Inference Workflow

The process of generating output from an LLM given an input is called inference. It typically involves two phases:

Prefill Phase: This is the initial phase where the entire input prompt (e.g., "Write a poem about a cat") is processed by the model all at once. The main goal here is to establish the initial context and populate the KV-cache.
Decode Phase: After the prefill phase, the model generates output tokens one at a time, auto-regressively. For each new token generated, the model processes the previously generated tokens along with the initial prompt, extending the sequence.

Self-Attention and KV-Cache

The self-attention mechanism is central to transformers, allowing the model to weigh the importance of different tokens in the input sequence when processing each token.

Self-Attention Mechanism: For each token in a sequence, self-attention computes three vectors: Query (Q), Key (K), and Value (V). These are derived by multiplying the token's embedding with learned weight matrices ( $W_Q, W_K, W_V$ $W_{Q}, W_{K}, W_{V}$ ). The attention score for a Query token against all Key tokens is calculated (e.g., dot product), normalized, and then used to create a weighted sum of Value tokens, providing contextual information. The mathematical formula for Attention (specifically, Scaled Dot-Product Attention) is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- $Q$ is the matrix of Query vectors.
- $K$ is the matrix of Key vectors.
- $V$ is the matrix of Value vectors.
- $Q K^T$ is the dot product of Query and Key matrices, which measures the similarity between each Query and Key.
- $\sqrt{d_k}$ is a scaling factor, where $d_k$ is the dimension of the Key vectors. This scaling prevents the dot products from growing too large, which could push the softmax function into regions with very small gradients.
- $\mathrm{softmax}$ is an activation function that normalizes the scores, turning them into probabilities.
- The result is a weighted sum of the Value vectors, where the weights are determined by the attention scores.
KV-cache (Key-Value Cache): During the decode phase, tokens are generated one by one. Recomputing the Key and Value vectors for all previously generated tokens at each step would be highly inefficient. The KV-cache stores these Key and Value vectors from previous tokens for each request. This allows the model to only compute $Q$ , $K$ , and $V$ for the new token and append them to the cache, significantly speeding up the decode phase. The size of the KV-cache can grow substantially with longer sequences and larger batch sizes.
Grouped Query Attention (GQA): A memory optimization where multiple attention heads share the same Key and Value projections. This reduces the memory footprint of the KV-cache, allowing for larger batch sizes.

GPU Operations

GPUs (Graphics Processing Units) are specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images. In the context of AI, they are highly efficient for parallel processing of mathematical operations.

General Matrix Multiplication (GEMM): A fundamental operation in deep learning, representing the multiplication of two matrices. Many dense operations (e.g., linear layers, feed-forward networks) in LLMs are dominated by GEMMs. These are typically compute-bound, meaning their performance is limited by the raw computational power (FLOPs) of the GPU.
General Matrix-Vector Multiplication (GEMV): A special case of GEMM where one of the matrices is a vector. GEMVs are often found in memory-bound operations.
Compute-Bound: An operation is compute-bound if its execution time is limited by the rate at which the processor can perform arithmetic operations (e.g., floating-point operations per second, FLOPs).
Memory-Bound: An operation is memory-bound if its execution time is limited by the rate at which data can be transferred to and from memory (i.e., memory bandwidth).
Network-Bound: An operation is network-bound if its execution time is limited by the speed of data transfer over network interconnects between different devices (e.g., NVLink, Infinity Fabric).

Parallelism Strategies

For very large models that don't fit into a single GPU's memory or require distributed computation, various parallelism strategies are used:

Tensor Parallelism: This technique splits the weight matrices of a model across multiple GPUs (within or across nodes). During computation, each GPU performs a part of the matrix multiplication, and then collective communication operations are used to aggregate results. This avoids duplicating model weights on each GPU.
Pipeline Parallelism: This involves splitting the model into stages (layers) and assigning different stages to different GPUs. Each GPU processes a different micro-batch of data, creating a pipeline where data flows through the stages.
Collective Communication Operations: These are specialized operations for exchanging data between multiple GPUs in a distributed system.
- AllGather (AG): Each GPU has a piece of data, and AllGather collects all pieces from all GPUs into each GPU, so every GPU ends up with the complete concatenated data.
- AllReduce (AR): Each GPU has a piece of data, AllReduce performs an operation (e.g., sum, average) on all pieces, and the final result is available on all GPUs. These operations are network-bound.

Performance Metrics

Throughput: A measure of how many units of work a system can process per unit of time. In LLM serving, it's often measured in tokens per second (tokens/s) or tokens per second per GPU (tokens/s/GPU).
Latency: The time taken for a single request to be processed from start to finish. In LLM serving, it can be measured as the time from receiving a prompt to generating the complete response, or normalized latency (latency divided by output length).
Service Level Objective (SLO): A target value or range for a performance metric (e.g., latency) that a service aims to meet. For LLMs, an SLO might define the maximum acceptable latency for generating responses.

Data Types and Units

FP16 (Floating Point 16-bit): A reduced-precision floating-point format that uses 16 bits instead of the standard 32 bits (FP32). FP16 reduces memory footprint and can significantly speed up computation on modern GPUs that have specialized Tensor Cores for FP16 arithmetic, often with minimal loss in model accuracy. This is standard for data center-scale inference.
GFLOP/s (Giga Floating Point Operations per second): A measure of computational performance, indicating billions of floating-point operations per second.
GB/s (Gigabytes per second): A measure of data transfer rate, indicating billions of bytes per second.

3.2. Previous Works

The paper contextualizes NanoFlow by contrasting it with existing LLM serving optimization techniques, broadly categorizing them by the granularity at which they operate:

Request-Level Optimizations:
- Continuous Batching (e.g., Orca [53]): This technique dynamically refills the batch of requests being processed by the GPU. Instead of waiting for all requests in a batch to complete before starting a new one, continuous batching allows new requests to be added to the batch as soon as GPU resources become available (e.g., when a request finishes). This maximizes GPU utilization by keeping the batch full.
- PagedAttention (e.g., vLLM [17]): PagedAttention is a memory management technique that addresses the fragmentation and waste of GPU memory caused by variable-length KV-cache sizes. Inspired by virtual memory and paging in operating systems, it stores KV-cache entries in fixed-size "pages" and manages them using a page table. This allows for more efficient memory utilization and enables non-contiguous memory allocation for KV-cache, similar to how OS manages memory.
Phase-Level Scheduling:
- Disaggregating Prefill and Decode (e.g., DistServe [59], Splitwise [32]): These approaches separate the prefill and decode phases of LLM inference, potentially assigning them to different clusters or specialized hardware. This can improve efficiency by allowing each phase to be optimized independently and handled by resources best suited for its characteristics.
Batch-Level Optimizations:
- Chunked Prefill (e.g., DeepSpeed-FastGen [13], Sarathi-Serve [2]): Instead of processing an entire long prefill prompt at once, chunked prefill splits it into smaller chunks. These chunks can then be batched together with decode requests, allowing for a more consistent and higher utilization of the GPU by amortizing prefill costs and keeping the batch full of active operations.
- Dynamic Batching: Automatically adjusts the batch size of incoming requests to maximize throughput or minimize latency, often by ensuring the GPU is always busy with a sufficiently large batch.
  
  The paper highlights that while these prior works have significantly improved throughput by optimizing at the request, phase, or batch level, they generally do not perform scheduling or resource management at the granularity of individual operations within a device.

3.3. Technological Evolution

The evolution of LLM serving systems has been driven by the increasing scale and complexity of LLMs, coupled with the rising demand for efficient inference:

Early Stages (Basic Inference): Initially, LLM inference involved sequential processing on a single GPU or distributed inference using basic model parallelism, often inefficiently.
Memory Optimization: The advent of KV-cache was a crucial step, addressing the quadratic memory scaling of self-attention during decode. Techniques like PagedAttention (vLLM) further optimized KV-cache memory usage, enabling larger batch sizes.
Batching and Scheduling: Continuous batching (Orca) and dynamic batching became standard to maximize GPU utilization by keeping the batch full and varying its size based on available resources. Chunked prefill and phase-level scheduling emerged to better manage the distinct characteristics of prefill and decode workloads.
Distributed Inference: As models grew beyond single-GPU capacity, tensor parallelism, pipeline parallelism, and hybrid approaches became essential for distributing model weights and computation across multiple GPUs and nodes.
Focus on Throughput vs. Latency: Initial efforts often balanced throughput and latency. However, with "planet-scale" demand, throughput (and thus cost-efficiency) has emerged as a primary concern for cloud providers.

NanoFlow fits into this timeline by pushing the boundaries of optimization to an even finer granularity: the intra-device operation level. While previous works focused on what to batch or how to schedule requests, NanoFlow looks at how individual operations within a single GPU can be overlapped to maximize hardware utilization, especially for the actual bottleneck (compute), which it re-identifies.

3.4. Differentiation Analysis

Compared to the main methods in related work, NanoFlow introduces a fundamental shift in its approach to LLM serving optimization:

Granularity of Optimization:
- Prior Works: Primarily focus on request-level, phase-level, or batch-level scheduling. For example, vLLM and DeepSpeed-FastGen optimize how requests are batched and processed at an iteration level, DistServe and Splitwise disaggregate phases, and Sarathi-Serve uses chunked prefill. These methods manage the flow of data into or between GPUs.
- NanoFlow: Operates at a much finer granularity: intra-device operation-level parallelism. It focuses on how different operations (e.g., GEMM, KV-cache access, network communication) within a single GPU can be executed concurrently. This is a novel approach that complements existing batching and scheduling techniques.
Bottleneck Identification:
- Prior Works (Implicit Assumption): Often operated under the common assumption that LLM serving is memory-bound, leading to optimizations targeting memory efficiency (e.g., PagedAttention, GQA).
- NanoFlow: Challenges this assumption with a rigorous analysis, demonstrating that for many modern LLMs and workloads, the end-to-end serving process is actually compute-bound. This re-identifies the true bottleneck, allowing NanoFlow to target compute utilization directly.
Mechanism for Parallelism:
- Prior Works: Achieve parallelism through macro-level batching, pipeline stages, or distributing tasks across devices.
- NanoFlow: Achieves parallelism by nano-batching inputs and duplicating operations into nano-operations. These nano-operations can then execute concurrently within the same device, effectively overlapping the execution of operations with heterogeneous resource demands (compute-bound, memory-bound, network-bound). This is a fine-grained pipelining within a single GPU's resources.
Adaptive Optimization:
- Prior Works: While some have dynamic scheduling or batching, the internal execution logic for operations is typically static.
- NanoFlow: Incorporates an auto-search engine that automatically constructs and refines optimized pipelines. This adaptive approach accounts for specific model architectures, hardware characteristics, and even kernel interference, which is a significant advancement in handling the complexity of heterogeneous GPU workloads.
Memory Overhead vs. Compute Utilization:
- Prior Works: Aim to minimize memory movement and overhead.
- NanoFlow: Acknowledges that nano-batching might increase memory I/O due to repeated weight loading. However, it strategically embraces this trade-off, arguing that for compute-bound workloads, this increased memory I/O can be hidden through pipelining, leading to overall higher compute utilization and throughput.
  
  In essence, NanoFlow innovates by shifting the focus from simply managing requests and data flow to actively orchestrating the concurrent execution of diverse operations within the GPU itself, driven by a re-evaluated understanding of the actual bottleneck.

4. Methodology

4.1. Principles

The core idea behind NanoFlow stems from a critical re-evaluation of the bottlenecks in Large Language Model (LLM) serving. The prevailing assumption has been that LLM serving is memory-bound due to massive model sizes, KV-cache growth, and the memory-intensive nature of self-attention. However, through detailed analysis and empirical validation, NanoFlow demonstrates that for most common workloads and modern LLMs (especially those using optimizations like Grouped Query Attention), the overall serving process is predominantly compute-bound.

Despite this compute-bound nature, existing LLM serving engines suffer from suboptimal compute utilization. This is because the diverse operations that constitute LLM serving—ranging from compute-bound General Matrix Multiplications (GEMMs) to memory-bound KV-cache accesses and network-bound collective communications in distributed settings—are executed sequentially within a single device. This sequential execution leads to pipeline bubbles, periods where the GPU's compute units are idle while waiting for memory or network operations to complete, or vice-versa.

NanoFlow's principle is to address this by leveraging intra-device parallelism. Instead of processing a single large batch sequentially, NanoFlow breaks down the input into smaller, independent units called nano-batches. Each operation is then duplicated to form nano-operations that process these nano-batches. Since nano-operations on different nano-batches are independent, NanoFlow can overlap their execution within the same GPU. This fine-grained pipelining allows for concurrent utilization of heterogeneous resources (compute, memory, network) on the device, maximizing the utilization of the most constrained resource (compute) and thereby improving overall throughput. While this approach might increase memory I/O for weight loading, NanoFlow posits that for compute-bound workloads, this additional I/O can be hidden through effective pipelining.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Analysis of LLM Serving Workloads and Bottlenecks

NanoFlow begins with a fundamental analysis to accurately identify the bottlenecks in LLM serving.

Key Factors of Serving Throughput (Section 3.1)

The paper identifies key factors that determine LLM serving throughput, defining throughput as the total number of tokens processed per second (including both prefill and decode phases).

Hardware Specification:
- $N_{GPU}$ : Number of GPUs.
- MemBW (GB/s): Aggregate GPU memory bandwidth.
- MemSize (GB): Aggregate GPU memory capacity.
- Compute (GFLOP/s): Aggregate GPU compute capacity.
- NetBW (GB/s): Aggregate GPU interconnect bandwidth.
Model Configuration:
- $D_{model}$ : Hidden dimension size (e.g., the size of the embedding vectors).
- $L$ : Number of layers in the transformer model.
- $P_{Model}$ : Total number of parameters in the model.
- $R_{GQA}$ : Group size of Grouped Query Attention (how many query heads share a KV-cache).
- $S_{type}$ (Bytes): Size in bytes of the data type for model parameters (e.g., 2 bytes for FP16).
User Query Statistics:
- $p$ : Average number of tokens in prompts to be prefilled.
- $d$ : Average number of tokens in output to be decoded. Thus, a request involves $p$ prefill tokens and $d$ decode tokens.
Batch Size: For optimal throughput, the system should operate at the largest possible batch size that fits the model weights and all KV-caches within available memory. This maximum batch size amortizes overheads and increases utilization for compute-bound, memory-bound, and network-bound operations.

Cost Model of LLM Serving (Section 3.2)

Under the assumption of using the maximum possible batch size, NanoFlow models the latency of a single LLM serving iteration from the perspectives of memory, compute, and network resources. An "iteration" refers to processing a batch of user requests through all transformer layers.

Memory Latency ( $T_{mem}$ ): This represents the time required to load all necessary data from memory, assuming the entire device memory content needs to be loaded into GPU caches and registers once per iteration (e.g., for model weights and KV-cache). $ T_{mem} = \frac{MemSize}{MemBW} $ Where:
- MemSize: The total memory content that needs to be loaded in bytes (e.g., model weights + KV-cache).
- MemBW: The available memory bandwidth of the GPU in bytes/second.
Compute Latency ( $T_{Compute}$ ): This primarily accounts for the time spent on dense operations (mainly GEMMs), which constitute the vast majority of computations in LLMs. For each GEMM in dense operations, $2 B_{Dense} N_w K_w$ computations are needed, where $N_w$ and $K_w$ are dimensions of the weight matrices. Summing over all layers, the total compute is $2 B_{Dense} \sum N_w K_w$ . The term $\sum N_w K_w$ can be approximated by the total number of model parameters, $P_{Model}$ . $ T_{Compute} \approx \frac{2 \mathcal{B}{Dense} \cdot P{Model}}{Compute} $ Where:
- $\mathcal{B}_{Dense}$ : The batch size for dense operations (including decode tokens from many requests and prefill tokens).
- $P_{Model}$ : Total number of parameters in the model (e.g., 70 billion for LLaMA-2 70B).
- Compute: The aggregate GPU compute capacity in FLOPs/second.
Network Latency ( $T_{net}$ ): This applies specifically to tensor parallelism setups where collective communication (e.g., AllGather, AllReduce) is needed to synchronize results across multiple GPUs after operations. Tensor parallelism typically requires two AllGathers and one AllReduce (or two AllReduces) per layer. An AllReduce transfers activations twice, while an AllGather transfers once. Thus, the total data movement for one GPU in bytes is estimated as $4 \cdot B_{Dense} D_{model} S_{type} \cdot L$ . $ T_{net} \approx 4 \cdot \frac{N_{GPU} B_{Dense} D_{model} S_{type} L}{NetBW} $ Where:
- $N_{GPU}$ : Number of GPUs involved in tensor parallelism.
- $B_{Dense}$ : Batch size for dense operations.
- $D_{model}$ : Hidden dimension size.
- $S_{type}$ : Size of the data type (e.g., 2 bytes for FP16).
- $L$ : Number of layers.
- NetBW: Aggregate GPU interconnect bandwidth.

Classification of LLM Serving Workloads (Section 3.3)

By comparing these latency components, NanoFlow classifies the workload characteristics.

Network vs. Compute: The ratio $T_{Net} / T_{Compute}$ is analyzed. For large models ( $P_{Model} \approx 12 D_{model}^2 L^4$ ) and modern data center GPUs with high-bandwidth interconnects (like NVLink), this ratio is typically less than 1 (Figure 2, provided as Image 4 in this analysis), indicating that the network is generally not the bottleneck compared to compute.

The following are the results from [Figure 2] of the original paper:

该图像是一个热力图，展示了不同模型（如LLaMA-3 8B、Mistral 8x7B等）在不同GPU配置下的计算和网络资源利用率。颜色深浅代表了计算限制和网络限制的程度，数值展示了系统的性能表现。
Memory vs. Compute: The ratio $T_R = T_{Mem} / T_{Compute}$ is used to determine if the workload is memory-bound or compute-bound: $ T_R = \frac{T_{Mem}}{T_{Compute}} \approx \frac{Compute}{MemBW} \frac{MemSize}{P_{Model}} \frac{1}{2 B_{Dense}} $ Where:
- Compute: Aggregate GPU compute capacity.
- MemBW: Aggregate GPU memory bandwidth.
- MemSize: Aggregate GPU memory capacity.
- $P_{Model}$ : Total number of model parameters.
- $B_{Dense}$ : Batch size for dense operations.
  
  The paper observes that modern models widely adopt Grouped Query Attention (GQA), which allows for significantly larger $B_{Dense}$ values (e.g., 1024-2048 for LLaMA-2 70B). Combined with the increasing size of $P_{Model}$ , these factors cause $T_R$ to drop below 1, implying that the workload becomes compute-bound. Figure 3 (provided as Image 5 in this analysis) empirically validates this, showing that many common workloads are compute-bound (closer to yellow) rather than memory-bound (closer to green).

The following are the results from [Figure 3] of the original paper:

$Figure 5: Interference characteristics between GEMM and GEMV kernels. The points on the x-axis correspond unique GEMM-GEMV implementation pairs. The y-axis denotes the GEMM and GEMV kernels' normalized performance $P$ .$ 该图像是一个性能对比图，展示了不同矩阵乘法方法的归一化性能。图中蓝色线表示 GEMM 性能，橙色线表示 GEMV 性能，灰色线表示非最优 GEMV 性能。优先处理 GEMM 和 GEMV 的效果也被标出，数据显示在某些情况下，优先处理 GEMM 性能优于非最优 GEMV。

The following are the results from [Table 1] of the original paper:

Vendor	Model	Release Year	MemSize (GB)	MemBW (GB/s)	NetBW (GB/s)	Compute (FP16 GFLOP/s)	MemSize MemBW	Compute MemBW	NetBW MemBW
NVIDIA	V100	2017	16	900	300	125,000	0.018	139	0.33
NVIDIA	A100	2020	40	1,555	600	312,000	0.026	200	0.39
NVIDIA	A100	2021	80	2,000	600	312,000	0.040	156	0.30
NVIDIA	H100	2023	80	3,352	900	989,000	0.024	295	0.268
NVIDIA	H200	2024	96	4,800	900	989,000	0.020	206	0.19
NVIDIA	B100	2024	120	8,000	1,800	1,800,000	0.015	225	0.23
NVIDIA	B200	2024	120	8,000	1,800	2,250,000	0.015	281	0.23
AMD	MI250	2021	128	3,352	800	362,000	0.038	107	0.24
AMD	MI300	2023	192	5,300	1,024	1,307,000	0.036	246	0.19
AMD	MI325X	2024	256	6,000	1,024	1,307,000	0.043	218	0.17
Intel	Gaudi 2	2022	96	2,400	600	1,000,000	0.040	417	0.25
Intel	Gaudi 3	2024	128	3,700	1,200	1,800,000	0.035	486	0.32
NVIDIA	Ada 6000	2022	48	960	64	182,000	0.050	190	0.067

Table 1 (above) presents accelerator model characteristics, showing that the Compute/MemBW and Compute/NetBW ratios are stable across vendors and generations, reinforcing the compute-bound conclusion.

Validation of the Cost Model (Section 3.4)

The cost model is validated on LLaMA-2 70B with 8 NVIDIA A100 GPUs and a dense batch size of 2048. GFLOPs, memory movement, and network traffic are computed for each operation (Table 2), and then $T_{compute}$ , $T_{mem}$ , and $T_{net}$ are estimated. The operation with the maximum estimated time, $T_{op} = \max(T_{compute}, T_{mem}, T_{net})$ , indicates the most constrained resource. The sums of these values over all operations confirm that compute is the most constrained resource, aligning with the model's finding.

The following are the results from [Table 2] of the original paper:

Operation	Compute (GFLOP)	Mem Load (GB)	Net Usage (GB)	Est. Tcomp (ms)	Est. Tmem (ms)	Est. Tnet (ms)	Real Time (ms)
KQV	27487.8	19.5	0	11.01	1.22	0	16.08
O	21990.2	16.1	0	8.81	1.01	0	16.01
UG	153931.6	96.6	0	61.67	6.04	0	69.92
D	76965.8	49.7	0	30.84	3.11	0	34.96
DecAttn	3665.9	462.2	0	1.47	28.89	0	35.60
PfAttn	916.3	2.1	0	0.37	0.13	0	4.56
Net	18.8	75.2	75.2	0.01	4.70	31.33	47.92
Total				114.17	45.09	31.33

Table 2 (above) compares cost model estimations with real-world measurements for different LLM operations. The Est. Tcomp column, particularly for UG (Up Gate) and $D$ (Down projection), shows the highest values, confirming compute as the dominant factor for the entire process.

Optimal Serving Throughput (Section 3.5)

Given that LLM serving is compute-bound, optimal throughput is achieved when the compute resource is fully utilized. $ \mathrm {Throughput_{optimal}} = \frac{B_{Dense}}{T_{Compute}} = \frac{Compute}{2 P_{Model}} $ Where:

$\mathrm{Throughput_{optimal}}$ : The maximum theoretical throughput in tokens per second per GPU.
Compute: Aggregate GPU compute capacity.
$P_{Model}$ : Total number of model parameters.

This equation indicates that optimal throughput is solely dependent on the GPU's computational capacity and the model's parameter count, largely independent of memory or network characteristics. For example, for LLaMA-2 70B on $8 \times \mathrm{A100}$ GPUs, with 280 TFLOPS peak FP16 Compute and $P_{Model}$ of 70B, the optimal throughput is calculated as 1857 tokens/s/GPU.

4.2.2. NanoFlow Design: Intra-device Parallelism (Section 3.7 & 4)

Motivated by the compute-bound nature and the gap between current system throughput and optimal throughput (due to sequential execution), NanoFlow introduces intra-device parallelism through nano-batching.

Intra-device Parallelism Concept

Instead of processing a single large batch, NanoFlow splits each input batch into nano-batches. For example, an Up projection operation on a batch of 2048 could be split into two nano-operations, UP1 and UP2, each processing a nano-batch of 768 and 1280 tokens respectively. These nano-operations are independent and can be executed concurrently. This allows heterogeneous operations (e.g., a compute-bound GEMM and a memory-bound KV-cache access) to overlap their execution within the same GPU, maximizing resource utilization.

Automated Pipeline Search (Section 4.1)

The complexity of determining the optimal number, size, ordering, and resource allocation of nano-batches across diverse models and hardware is immense. NanoFlow addresses this with an auto-search engine using mixed integer linear programming (MILP) and a two-stage approximation.

4.2.2.1. Kernel Profiling and Interference Modeling (Section 4.1.1)

Understanding individual kernel performance and their interactions is crucial.

Maximum Dense Batch Size: For a given model and hardware, NanoFlow first determines the largest dense batch size that can fit into GPU memory. This sets the upper bound for profiling.
Profiling Interference-Free Kernels: NanoFlow profiles kernels (GEMM, GEMV, network kernels) individually, running them exclusively on the GPU. It explores various input batch sizes (e.g., multiples of 128) and kernel implementations (varying thread blocks, warps, tile sizes) to find the fastest configuration for each (kernel, batch size) pair.
Profiling Kernels with Interference:
- Kernel Interference: When multiple kernels run in parallel on a GPU, they compete for shared resources (execution units, caches, memory bandwidth), leading to slowdown. This is kernel interference.
- Resource Allocation Proxy ( $R$ ): NVIDIA GPUs don't offer explicit control over resource allocation. NanoFlow uses GEMM performance as a proxy for resource allocation $R$ . If a compute kernel A is overlapped with kernel B, and kernel A achieves $40\%$ of its individual peak performance, then $R_A = 0.4$ . The remaining resources are assumed for kernel B, so $R_B = 1 - R_A = 0.6$ .
- Performance ( $P$ ): For the non-compute kernel $B$ , its normalized performance $P_B$ captures its memory or network utilization when $R_A$ resources are assigned to kernel A. This establishes an "exchange rate" between compute utilization ( $R$ ) and memory/network utilization ( $P$ ).
- Reduced Profiling Space: To make profiling feasible given the vast number of kernel combinations, NanoFlow:
  - Limits thread block numbers for GEMV and network kernels.
  - Excludes inefficient GEMM kernels.
  - Focuses on pairwise interference (compute-memory, compute-network), assuming these $R$ to $P$ mappings hold for three-kernel overlaps.
- Example (Figure 5, provided as Image 7 in this analysis): The paper shows profiling results for GEMM-GEMV pairs, identifying trade-offs. The goal is to find the best-performing kernel combinations at various trade-off points.
  
  The following are the results from [Figure 5] of the original paper:
  
  该图像是一个图表，展示了不同模型在固定输入和输出长度下的每 GPU 代币吞吐量。包含了 vLLM、DeepSpeed-FastGen、TensorRT-LLM 和 NanoFlow 的性能对比，最高吞吐量达 1286，而最优值为 1857。图表显示了不同的输入和输出长度对性能的影响，并提供了两个实验场景的对比。

The following are the results from [Table 3] of the original paper:

Operations	0	0.1	0.2	0.8 0.9	1
Operations	Resource Utilization (R)
GEMM (by definition)	0	0.1	0.2	0.8 0.9	1
GEMV	0	0.2	0.3	0.85 0.95	1
Network	0	0.3	0.5	0.9 1	1

Table 3 (above) is an example resource mapping table generated from interference profiles, quantifying GEMV and network kernel performance ( $P$ ) as functions of resource utilization ( $R$ ).

4.2.2.2. Auto-search Stage I: Pipeline Structure Search (Section 4.1.2)

This stage uses MILP to determine the initial pipeline structure, assuming no kernel interference.

Input: Dense batch size, operation dependencies (from model architecture), interference-free kernel profiles.
Output: Number, batch size, and order of each nano-operation.
Optimization Objective: Minimize execution time by removing pipeline bubbles (idle periods) for compute operations.
Constraints:
- Number of Nano-operations: Start with splitting into two nano-operations. If compute bubbles persist, increase the number for operations near the bubble until no further improvement is made.
- Batch Sizes and Execution Times: Batch sizes are chosen from discrete values (e.g., multiples of 128) up to the dense batch size. Interference-free execution times from profiling are used.
- Dependencies: Nano-operations are dependent if their parent operations are dependent AND their input nano-batches intersect.
- Overlapping: Only operations constrained by different resources (e.g., compute and memory) are allowed to overlap, as overlapping same-resource operations is counterproductive.
- Operation Transformations: Explores alternative implementations for network nano-operations (e.g., AllGather to AllReduce conversions) with different performance characteristics.
Search Time: While finding a globally optimal solution can be very time-consuming, NanoFlow prioritizes finding a feasible and practical solution within a reasonable time (e.g., ~10 minutes).

4.2.2.3. Auto-search Stage II: Refining the Pipeline (Section 4.1.3)

This stage refines the pipeline from Stage I by incorporating kernel interference.

Input: Pipeline structure (number, batch sizes, ordering) from Stage I, $R$ to $P$ mapping (Table 3).
Optimization Objective: Minimize pipeline execution time, considering kernel slowdown due to interference.
Constraints:
- GPU Resource Utilization: The sum of $R$ for concurrently executing kernels at any given time must be less than or equal to 1.0 (representing the total GPU resources).
- Execution Times: The execution time of a nano-operation is calculated as $D_{best} / P$ , where $D_{best}$ is its best interference-free execution time and $P$ is its normalized performance derived from Table 3 based on its allocated $R$ .
Trigger: This auto-search is performed only when the model architecture or workload characteristics change significantly.

Example Pipelines (Section 4.1.4)

70B Pipeline (e.g., LLaMA-2 70B): For models like LLaMA-2 70B, LLaMA-3 70B, Qwen2.5-72B, and Deepseek-67B, auto-search generates similar schedules because their performance characteristics are largely alike. For example, in LLaMA-2 70B, NanoFlow may use 4 nano-operations for KQV generation (where compute, memory, and network resources overlap), allowing decode attention to operate at 80% of its peak performance with a 0.4 GEMM performance sacrifice. For the rest of the pipeline, GEMM operations are prioritized with two nano-operations. This is illustrated in Figure 6 (provided as Image 8 in this analysis).

The following are the results from [Figure 6] of the original paper:

$Figure 8: Latency comparison. The $\\mathbf { X }$ -axis shows the number of incoming requests per second and the y-axis shows the normalized latency. NanoFlow handles higher request within $2 0 0 \\mathrm { m s }$ SLO constraints.$ 该图像是一个图表，展示了不同请求速率下，各个系统的归一化延迟。在图中，NanoFlow（红色）与其他方法（如vLLM、DeepSpeed-FastGen和TensorRT-LLM）相比，展现了更优的性能表现。各子图（a、b、c）分别对应不同的测试场景，清晰地反映了各系统的性能差异。
8B Pipeline (e.g., LLaMA-3 8B): These models fit on a single GPU, so network operations are not relevant. Auto-search splits operations into two nano-operations, typically overlapping decode attention with Up Gate Down projection.
MoE Pipeline (e.g., Mixtral 8x7B): Mixture-of-Experts (MoE) models have different hidden dimensions and layers. Due to expert imbalance, NanoFlow uses tensor parallelism for FFN (Feed-Forward Network) layers, which are implemented using grouped-GEMM and include an additional gate routing operation. Auto-search adapts to these specific characteristics to generate an efficient pipeline.

4.2.3. NanoFlow Runtime (Section 4.2)

The NanoFlow runtime executes the auto-generated pipelines efficiently.

4.2.3.1. Request Scheduling (Section 4.2.1)

NanoFlow manages batching and scheduling to maintain high GPU utilization.

Batch Formation:
- NanoFlow assumes external auto-scaling, workload balancing, and priority-aware routing.
- It operates with the assumption of abundant requests and equal priority. If requests are scarce, the control plane should reduce NanoFlow instances to ensure a sufficiently large per-instance batch size.
- It prioritizes unfinished decode requests and chunks prefill requests (following SarathiServe [3]) to precisely fill the remaining capacity of a pre-selected best-performing dense batch. This keeps dense operations running with consistent batch sizes, reducing tail latency.
- The global batch initially has prefill requests. As they complete, they become decode requests, and new prefill requests are introduced to maintain a fixed token batch size. The ratio of prefill to decode tokens stabilizes over time.
- To prevent out-of-memory errors, NanoFlow predicts future memory usage based on request status (e.g., decoded tokens, estimated completion time) and only prefills new requests if memory limits are maintained. If OOM occurs, requests can be offloaded to the CPU and reloaded later.
Asynchronous Scheduling:
- In traditional systems, CPU-side tasks like batch formation, EOS token detection, and request management happen sequentially after each GPU iteration, leading to GPU idleness (pipeline bubbles).
- NanoFlow asynchronously schedules batch formation in parallel with GPU execution. For iteration $i$ , the batch for $i+1$ is formed before $i$ ends. After launching $i+1$ , the batch for $i+2$ is formed, and EOS tokens from iteration $i$ are detected, and finished requests removed.
- This might mean a slight delay in detecting EOS (one extra decode token), but for typical workloads with average decode lengths over 100, this overhead is negligible ( $<1\%$ ) compared to the benefit of hiding batch formation overhead.

4.2.3.2. KV-cache Management (Section 4.2.2)

To support multi-round conversations efficiently, NanoFlow implements advanced KV-cache offloading.

Simultaneous Offloading:
- Instead of waiting for requests to complete, NanoFlow offloads the KV-cache of tokens directly after KQV generation in each transformer layer, before they are appended to the main KV-cache.
- This ensures KV vectors are contiguous and offloading data size is balanced across iterations.
- Device-host copies (GPU-initiated) are performed during compute-bound FFN operations to minimize overhead.
- NUMA-aware thread-binding further reduces offloading time.
Host KV-cache Management:
- NanoFlow uses an LRU (Least Recently Used) policy to manage a hierarchical cache across CPU memory and SSDs.
- KV-cache entries are evicted to SSD when CPU memory limits are hit and retrieved from either CPU or SSD when a request's next round arrives.
KV-cache Loading and Scattering:
- Utilizing PagedAttention, KV-cache pages can be fragmented in GPU memory.
- To avoid slow copies to fragmented destinations, NanoFlow first copies the KV-cache data to a contiguous space on the GPU.
- Then, it scatters these contiguous pages to their fragmented destinations in GPU memory, achieving 7-10x higher bandwidth for host-to-device copy.
  
  The implementation details involve approximately 10K lines of CUDA and 6K lines of Python code, launching nano-operations based on auto-search results and managing dependencies with CUDA events.

5. Experimental Setup

5.1. Datasets

The evaluation uses three practical, real-world conversation datasets and also includes experiments with constant input/output lengths.

Splitwise [32]:
- Source: A conversation trace collected from a real production environment at Microsoft.
- Scale: Approximately 20,000 requests.
- Characteristics: Represents typical interactions in a production setting.
- Average Input (Std): 1155 (1109) tokens
- Average Output (Std): 211 (163) tokens
- Why Chosen: Provides a realistic workload profile from a large-scale application.
LMSYS-Chat-1M [56]:
- Source: A large-scale dataset featuring 1 million real-world conversations collected from 25 different LLMs.
- Scale: 50,000 requests randomly sampled for evaluation.
- Characteristics: Diverse conversations reflecting a broad range of user interactions with various LLMs.
- Average Input (Std): 102 (169) tokens
- Average Output (Std): 222 (210) tokens
- Why Chosen: Offers a comprehensive and varied representation of real-world LLM usage.
ShareGPT [1]:
- Source: A dataset of conversations collected from the ShareGPT API.
- Scale: 50,000 requests randomly sampled for evaluation.
- Characteristics: Contains interactions with various LLMs, often featuring longer and more complex prompts and responses.
- Average Input (Std): 246 (547) tokens
- Average Output (Std): 322 (244) tokens
- Why Chosen: Provides insights into LLM performance under more extensive and user-generated dialogue.
  
  The following are the results from [Table 4] of the original paper:
  
  Dataset Avg. Input (Std) Avg. Output (Std)
  
  Splitwise [32] 1155 (1109) 211 (163)
  
  LMSYS-Chat [56] 102 (169) 222 (210)
  
  ShareGPT [1] 246 (547) 322 (244)

Table 4 (above) provides the average and standard deviation of input and output lengths for the sampled datasets. These statistics are crucial for understanding the workload profiles used in the evaluation. The datasets are effective for validating the method's performance as they represent diverse and realistic LLM workloads, covering varying input/output lengths and conversational patterns.

5.2. Evaluation Metrics

The paper evaluates NanoFlow using standard performance metrics for LLM serving:

Throughput (tokens/s/GPU):
- Conceptual Definition: Measures the total number of tokens (including both input prompt tokens and generated output tokens) processed by the system per second, normalized per GPU. This metric is crucial for assessing the cost-efficiency and overall capacity of an LLM serving system, as higher throughput means more work done with the same hardware.
- Mathematical Formula: Not explicitly provided as a standard formula for this context, but typically calculated as: $ \text{Throughput} = \frac{\text{Total Tokens Processed}}{\text{Total Execution Time (seconds)}} $ When normalized per GPU: $ \text{Throughput/GPU} = \frac{\text{Total Tokens Processed}}{\text{Total Execution Time (seconds)} \times N_{GPU}} $
- Symbol Explanation:
  - $\text{Total Tokens Processed}$ : The sum of all input tokens (from prefill phases) and all output tokens (from decode phases) across all requests.
  - $\text{Total Execution Time (seconds)}$ : The total duration for which the system was processing these requests.
  - $N_{GPU}$ : The number of GPUs used in the serving system.
Optimal Throughput (tokens/s/GPU):
- Conceptual Definition: The theoretical maximum throughput achievable by the system, assuming full utilization of the most constrained resource (which NanoFlow identifies as compute). It serves as an upper bound for performance, against which actual system performance can be benchmarked.
- Mathematical Formula: $ \mathrm {Throughput_{optimal}} = \frac{Compute}{2 P_{Model}} $
- Symbol Explanation:
  - Compute: The aggregate GPU compute capacity in GFLOP/s (Giga Floating Point Operations per second) for the specific data type (e4.g., FP16).
  - $P_{Model}$ : The total number of parameters in the LLM.
Latency (Normalized Latency, 99th-percentile Latency):
- Conceptual Definition:
  - Normalized Latency: The end-to-end time taken for a request to be processed, divided by the output length in tokens. This metric helps to compare latency across requests with varying output lengths. A lower value indicates faster processing per output token.
  - 99th-percentile Latency: The latency value below which 99% of all requests fall. This is a critical metric for Service Level Objectives (SLOs) as it captures the experience of most users, including those who encounter slower responses (the "tail" of the latency distribution).
- Mathematical Formula:
  - $\text{Normalized Latency} = \frac{\text{End-to-End Latency (ms)}}{\text{Output Length (tokens)}}$
  - The 99th-percentile is a statistical measure and does not have a simple algebraic formula; it's computed from the sorted list of all individual request latencies.
- Symbol Explanation:
  - $\text{End-to-End Latency (ms)}$ : The time from a request's arrival to its full completion.
  - $\text{Output Length (tokens)}$ : The number of tokens generated as output for that specific request.
Resource Utilization (Compute, Memory, Network):
- Conceptual Definition: Measures the percentage of time or capacity that a specific hardware resource (e.g., GPU compute units, memory bandwidth, network bandwidth) is actively being used. High utilization indicates efficient use of hardware. NanoFlow specifically aims to maximize compute utilization.
- Mathematical Formula: No single standard formula, typically represented as a percentage or fraction: $ \text{Utilization} = \frac{\text{Actual Usage}}{\text{Maximum Capacity}} \times 100% $
- Symbol Explanation:
  - $\text{Actual Usage}$ : The amount of a resource actively being used (e.g., FLOPs performed, bytes transferred).
  - $\text{Maximum Capacity}$ : The theoretical maximum capability of that resource (e.g., peak FLOPs/s, peak GB/s).
Speedup (Throughput Boost):
- Conceptual Definition: The factor by which the performance (e.g., throughput) of a new method improves over a baseline method. It quantifies the effectiveness of an optimization.
- Mathematical Formula: $ \text{Speedup} = \frac{\text{Throughput}{\text{NanoFlow}}}{\text{Throughput}{\text{Baseline}}} $
- Symbol Explanation:
  - $\text{Throughput}_{\text{NanoFlow}}$ : The throughput achieved by NanoFlow.
  - $\text{Throughput}_{\text{Baseline}}$ : The throughput achieved by a comparative baseline system.

5.3. Baselines

NanoFlow is compared against three state-of-the-art and widely-used LLM serving frameworks, representing the current best practices in the field:

vLLM [17]:
- Description: A high-throughput LLM serving system known for its efficient memory management and scheduling.
- Key Features: Implements PagedAttention for maximizing KV-cache memory utilization and continuous batching (referred to as chunked prefill in some contexts for vLLM's batching strategy) to maintain high GPU utilization. It focuses on effective GPU memory usage and dynamic batching.
- Why Representative: A widely adopted and recognized leader in LLM serving throughput optimization, making it a strong benchmark for comparing memory-efficient approaches.
DeepSpeed-FastGen [13]:
- Description: A serving framework developed by Microsoft, part of the broader DeepSpeed ecosystem for large-scale model training and inference.
- Key Features: Dynamically composes prefill and decode requests to ensure the engine operates in a high-throughput regime. It also uses chunked prefill and advanced scheduling techniques.
- Why Representative: Represents industrial-grade optimization efforts from a major technology company, focusing on dynamic request management for high throughput.
TensorRT-LLM [26]:
- Description: A high-performance LLM inference engine built upon NVIDIA's TensorRT SDK.
- Key Features: Leverages NVIDIA's highly optimized kernel libraries and compilation tools (TensorRT) to achieve maximum performance on NVIDIA GPUs. It includes optimizations like paged KV-cache and dynamic batching.
- Why Representative: Represents the pinnacle of hardware-specific (NVIDIA) optimization, providing an excellent benchmark for raw performance potential on the target hardware.
  
  These baselines were chosen because they are widely recognized, actively maintained, and implement state-of-the-art techniques for LLM serving, making them robust comparisons for NanoFlow's performance. The experimental setup tunes parameters for each baseline (e.g., max-ragged-batch-size for DeepSpeed-FastGen, max-num-tokens for TensorRT-LLM) to ensure they are operating at their best possible throughput.

5.4. Hardware

The experiments were conducted on the following hardware:

GPUs: $8 \times$ NVIDIA A100 80GB SXM GPUs.
Interconnect: The GPUs are interconnected via NVLink, providing high-bandwidth, low-latency communication crucial for tensor parallelism.

This setup is representative of high-end data center inference nodes. For models that fit on a single GPU (e.g., LLaMA-3-8B), a single A100 80GB SXM GPU was used.

5.5. Models

The evaluation covers a range of popular and representative LLMs to demonstrate NanoFlow's general applicability:

LLaMA-2-70B [49]:
- Role: The primary model used for detailed evaluation and ablation studies due to its widespread adoption as a large, open-source LLM.
- Characteristics: 70 billion parameters.
LLaMA-3-70B [22]:
- Role: Used to show NanoFlow's performance on newer LLaMA generations.
- Characteristics: 70 billion parameters, features a larger vocabulary size (128K) compared to LLaMA-2.
LLaMA-3-8B [22]:
- Role: Evaluates NanoFlow's performance on smaller models that fit within a single GPU.
- Characteristics: 8 billion parameters.
Qwen2-72B [9]:
- Role: Demonstrates performance on models with slightly different architectural specifics.
- Characteristics: 72 billion parameters, introduces biases in KQV generation.
Deepseek-67B:
- Role: Another large model with architectural variations.
- Characteristics: 67 billion parameters, different number of layers and hidden dimension ( $D_{model}$ ).
Mixtral 8x7B [16]:
- Role: Crucial for demonstrating NanoFlow's effectiveness on Mixture-of-Experts (MoE) architectures.
- Characteristics: An MoE model with 8 experts, each being a 7B parameter model, enabling sparse activation during inference.
  
  All models used FP16 (16-bit floating-point) weights and activations, which is standard practice for high-performance, data center-scale LLM inference to optimize memory and computation.

6. Results & Analysis

6.1. Core Results Analysis

Throughput Comparison

The paper first evaluates throughput in an "offline" setting, simulating scenarios like benchmarking, information extraction, or data processing where the system is continuously supplied with requests. Requests sample input and output lengths from the real-world datasets (Splitwise, LMSYS-Chat-1M, ShareGPT) or use constant lengths. The NanoFlow instance, configured with a dense batch size of 2048 for LLaMA-2-70B (where it performs best), is compared against baselines and the theoretical optimal throughput.

The theoretical optimal throughput for LLaMA-2-70B on $8 \times \mathrm{A100}$ GPUs is derived as 1857 tokens/s/GPU (as calculated in Section 3.5).

The following are the results from [Figure 7] of the original paper:

Figure 9: Ablation study results for NanoFlow. Nano-batching and overlapping improves NanoFlow's performance. 该图像是一个柱状图，展示了在不同输入输出参数下，NanoFlow、Nanobatch-only和非重叠方法的每个GPU的吞吐量（tokens/s）。可以看到，NanoFlow在大多数情况下提供了优于其他方法的性能。

Figure 7 illustrates the throughput comparison. NanoFlow consistently achieves the highest throughput across all tested settings. In its best case, NanoFlow reaches 68.5% of the theoretical optimal throughput.

Constant Length Workloads:
- NanoFlow achieves an average of 2.62x higher offline throughput than vLLM.
- NanoFlow achieves an average of 2.78x higher offline throughput than DeepSpeed-FastGen.
- NanoFlow achieves an average of 1.73x higher offline throughput than TensorRT-LLM.
Dataset-Driven Workloads (variable lengths):
- NanoFlow achieves an average of 4.18x higher throughput than vLLM.
- NanoFlow achieves an average of 3.45x higher throughput than DeepSpeed-FastGen.
- NanoFlow achieves an average of 1.91x higher throughput than TensorRT-LLM.
  
  The significant gains, especially on dataset-driven workloads, demonstrate NanoFlow's ability to handle realistic, variable-length requests much more efficiently than state-of-the-art systems. The 1.91x average boost over TensorRT-LLM (which leverages NVIDIA's highly optimized kernels) is particularly impressive, highlighting the effectiveness of intra-device parallelism in unlocking latent GPU potential.

Latency Comparison

The paper evaluates latency by modeling request arrival intervals with an exponential distribution, generating request traces for various rates over 5 minutes. Normalized latency (end-to-end latency divided by output length) is the primary metric, with an SLO of 200ms (typical human reading speed).

The following are the results from [Figure 8] of the original paper:

该图像是对比图，展示了非重叠管道资源使用情况（左）与NanoFlow资源使用情况（右）。图中分别标示了计算、内存和网络资源在时间上的占用百分比，NanoFlow显著提升了资源利用效率。

Figure 8 presents the normalized latency of NanoFlow versus baselines at different request rates.

Low Request Rates: At lower request rates, NanoFlow exhibits comparable but slightly higher latency than the best baseline (TensorRT-LLM). This is attributed to NanoFlow's focus on throughput-oriented scenarios, where it uses a large dense batch size which might introduce a small amount of overhead for individual requests at very low concurrency.
High Request Rates & SLO: As the request rate increases, NanoFlow demonstrates superior performance. It is capable of sustaining a significantly higher request rate while staying within the 200ms latency SLO compared to all baselines across all datasets. For instance, on the LMSys-Chat-1M dataset, NanoFlow handles 1.64x higher request rates than TensorRT-LLM within the 200ms normalized latency constraint.
Tail Latency: NanoFlow maintains good tail latency performance. Its 99th-percentile latency is only 1.07x of the average latency at near-maximum throughput. This is a crucial advantage for user experience, as it implies a consistent performance even for a small fraction of slower requests, largely due to its use of a constant dense batch size which avoids performance cliffs.

6.2. Data Presentation (Tables)

The tables from the paper were used in the methodology section to illustrate the foundational analysis for NanoFlow.

Table 1: Characteristics of accelerator models: Presented in Section 4.2.1.
Table 2: Comparison of operation runtimes between cost model estimation and real-world measurements: Presented in Section 4.2.1.
Table 3: Performance $P$ of GEMV and network kernels with different resource utilization $R$ : Presented in Section 4.2.2.1.
Table 4: The average and standard deviation of input and output lengths in the sampled datasets: Presented in Section 5.1.

6.3. Ablation Studies / Parameter Analysis

To isolate the contributions of NanoFlow's key techniques, an ablation study was conducted, comparing NanoFlow with two baselines that share its asynchronous request scheduling and kernel libraries:

Non-overlapping Baseline: Processes inputs sequentially without nano-batches. This represents the traditional approach.
Nano-batch-only Baseline: Splits requests into nano-batches but executes them sequentially, allowing assessment of the overhead of nano-batching alone.

The following are the results from [Figure 9] of the original paper:

$Figure 11: Performance of NanoFlow instances in terms of tokens per second per GPU for other models. Compared with vLLM, NanoFlow significantly increases throughput. NanoFlow achieves up to $78 . 5 \\%$ of optimal throughput.$ 该图像是柱状图，展示了不同大型语言模型的每GPU吞吐量的标准化比较，Nanoflow在多个模型上，相较于vLLM均实现了显著的性能提升，例如Llama-3-8B模型的吞吐量达到了78.5%。

Figure 9 illustrates the results of the ablation study.

Overhead of Nano-batching: Splitting into nano-batches alone (Nano-batch-only vs. Non-overlapping) actually reduces performance by 13.2%. This confirms that nano-batching introduces some overhead (e.g., increased kernel launch overhead, repeated weight loading), which needs to be offset by overlapping.
Benefit of Overlapping Network-bound Kernels:
- For prefill-only workloads (Input 512, Output 0), which are typically compute-bound with some network-bound collective communications, NanoFlow achieves a 1.07x speedup compared to the non-overlapping baseline. This demonstrates the benefit of overlapping network-bound and compute-bound kernels.
Benefit of Overlapping Network- and Memory-bound Kernels:
- For decode-heavy workloads (Input 512, Output 1024), where decode attention introduces memory-bound characteristics, NanoFlow achieves a 1.17x speedup compared to the non-overlapping baseline. This indicates the effectiveness of overlapping all three types of heterogeneous operations.
KV-cache Offloading Overhead: Enabling KV-cache offloading introduces a 3.0% performance degradation due to kernel interference from KV-cache movement. However, this is a necessary trade-off: offloading can reduce the compute required for multi-round LMSYS-Chat workloads by 3.02x, effectively saving resources for longer-term or memory-constrained scenarios despite the minor immediate throughput hit.

Resource Usage

NanoFlow's effectiveness in resource utilization is demonstrated by comparing it with a non-overlapping baseline.

The following are the results from [Figure 10] of the original paper:

该图像是一个热力图，展示了不同模型（如LLaMA-3 8B、Mistral 8x7B等）在不同GPU配置下的计算和网络资源利用率。颜色深浅代表了计算限制和网络限制的程度，数值展示了系统的性能表现。

Figure 10 (which is an image with caption Figure 10, but the provided image with 10.jpg actually corresponds to Image 2 in the summary, which is a resource usage chart. I will assume Figure 10 in the text refers to this image) visually represents the resource usage pattern. The non-overlapping baseline (left side of the image) shows sequential execution, where typically only one resource (compute, memory, or network) is heavily utilized at any given time, leading to pipeline bubbles and underutilization of others. In contrast, NanoFlow (right side of the image) concurrently utilizes multiple resources, leading to a much denser and more efficient usage pattern. This enables NanoFlow to achieve an average of 68.5% compute utilization across the whole pipeline. While kernel interference prevents 100% optimal compute usage (a natural consequence of concurrent operations), the significant improvement over baselines validates the intra-device parallelism approach.

6.4. Performance on Other LLMs

To showcase NanoFlow's generalizability, its throughput was evaluated on several other popular LLMs, using a constant input length of 1024 and output length of 512. All tests were on $8 \times \mathrm{A100}$ 80GB SXM GPUs, except for LLaMA-3-8B which ran on a single A100.

The following are the results from [Figure 11] of the original paper:

Figure 3: Comparison of compute time and memory time. The closer to yellow, the more compute-bound the workload is, whereas the closer to green the more memory-bound it becomes. 该图像是一个示意图，展示了不同模型在多个GPU配置下的计算绑定与内存绑定性能指标。颜色深浅表示性能值，从左到右依次为LLaMA-3 8B 1xGPU、Mistral 8x7B 8xGPU等，并显示通过LMSYS-Chat、Splitwise和ShareGPT等策略的效果。

Figure 11 summarizes the performance for these models.

NanoFlow consistently improves throughput for all tested models.
It achieves between 50% and 72% of the optimal throughput for these diverse architectures, demonstrating robust performance across different model scales and types (including MoE models).
On average, NanoFlow achieves a 2.66x throughput gain compared to vLLM across these models, further cementing its advantage over state-of-the-art systems.
For example, Llama-3-8B reaches 78.5% of optimal throughput, highlighting efficient scaling down to single-GPU deployments.

These results confirm that NanoFlow's auto-search engine is capable of automatically generating efficient pipelines for a wide range of LLMs and hardware configurations, effectively adapting to their specific characteristics.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work presents a compelling analysis that re-evaluates the fundamental bottleneck in Large Language Model (LLM) serving. Contrary to the widely held assumption of memory-bound operations, the authors rigorously demonstrate that end-to-end LLM serving, for most modern models and common workloads, is predominantly compute-bound. The core problem identified is the sequential execution of heterogeneous operations (compute, memory, network) within a single GPU, leading to significant underutilization of the compute resources.

To address this, the paper proposes NanoFlow, a novel and comprehensive LLM serving framework. NanoFlow's key innovation is the exploitation of intra-device parallelism through nano-batching. By splitting inputs into smaller nano-batches and duplicating operations into nano-operations, NanoFlow enables the concurrent execution of heterogeneous operations, effectively overlapping compute-bound, memory-bound, and network-bound tasks within the same GPU. The framework features an intelligent auto-search engine that automatically designs optimal execution pipelines, considering kernel interference and resource allocation, and an efficient runtime system that manages asynchronous scheduling and KV-cache offloading.

The experimental results are robust and significant. NanoFlow achieves an impressive 1.91x throughput improvement over state-of-the-art serving systems like vLLM, DeepSpeed-FastGen, and TensorRT-LLM on practical workloads. Furthermore, it reaches between 50% and 72% of the theoretically optimal throughput across various popular LLMs (e.g., LLaMA-2-70B, Mixtral 8x7B), significantly closing the gap to hardware capabilities. NanoFlow also demonstrates superior latency performance at high request rates while adhering to SLO constraints.

7.2. Limitations & Future Work

While the paper doesn't have a dedicated "Limitations" section, some inherent aspects and areas for potential future work can be inferred:

Complexity of Interference Modeling: The kernel interference modeling relies on pairwise profiling and an assumption that these mappings ( $R$ to $P$ ) hold for overlapping three kernels. While practical, this is an approximation and the non-linear, unpredictable nature of GPU kernel interference means there's always a gap between modeled and true performance, limiting NanoFlow from reaching 100% optimal throughput.
Search Space for Auto-search: While NanoFlow reduces the auto-search space, finding a "practical" pipeline in ~10 minutes, there's a trade-off with absolute optimality. For highly dynamic or niche workloads, the current approximation might not always yield the absolute best solution. The MILP formulation itself implies a certain level of abstraction and simplification of real-world GPU complexities.
Fixed Dense Batch Size Assumption: NanoFlow operates under the assumption of a stable dense batch size to simplify calculations and ensure consistent performance. In extremely sparse or rapidly changing real-world traffic, maintaining this optimal dense batch size could be challenging, relying heavily on the external control plane for auto-scaling and load balancing.
Overhead of Nano-batching: The ablation study showed that nano-batching alone introduces a 13.2% performance reduction. While overlapping significantly offsets this, further reducing this inherent overhead could yield even greater gains.
Hardware Specificity: The evaluation is primarily focused on NVIDIA GPUs (A100). While Table 1 suggests similar $compute/memory/network$ ratios across vendors, the specific kernel profiling and interference modeling are hardware-dependent. Adapting NanoFlow to other accelerators (AMD, Intel, custom AI chips) would require re-profiling and potentially re-tuning the auto-search's assumptions.
Extending to Other Model Architectures: While NanoFlow demonstrates effectiveness on MoE models, the rapid evolution of LLM architectures (e.g., more complex attention variants, novel layers) would require continuous updates to the auto-search's understanding of operation dependencies and characteristics.

7.3. Personal Insights & Critique

NanoFlow presents a highly insightful and practical advancement in LLM serving. Its core contribution lies in challenging a long-standing assumption and offering a sophisticated, automated solution to the newly identified bottleneck.

Innovation of Bottleneck Re-evaluation: The re-classification of LLM serving as compute-bound is a pivotal insight. It redirects optimization efforts from solely memory-centric approaches to maximizing compute utilization, which is a more accurate target for current hardware and model trends. This demonstrates critical thinking about system performance beyond superficial observations.
Elegance of Intra-device Parallelism: The concept of intra-device parallelism via nano-batching is elegant. Instead of just packing more requests into a GPU, NanoFlow intelligently overlaps the execution phases within a single GPU, turning idle times into productive ones. This is a fine-grained form of pipelining that is highly effective for heterogeneous workloads.
Sophistication of Auto-search: The auto-search engine is a standout feature. Manually optimizing complex pipelines involving multiple nano-batches and accounting for non-linear kernel interference is an intractable problem. Automating this process with MILP and a two-stage approximation makes the approach adaptable and deployable across diverse models and hardware, significantly reducing engineering effort. The $R$ to $P$ mapping is a clever way to quantify and manage interference.
Holistic System Design: NanoFlow isn't just an algorithm; it's a complete system, addressing batch formation, asynchronous scheduling, and KV-cache management. This holistic approach ensures that the gains from intra-device parallelism are not negated by other system bottlenecks.
Applicability Beyond LLMs: The principles of intra-device parallelism and auto-search for optimizing heterogeneous workloads could potentially be applied to other complex deep learning models or even general-purpose GPU computing where different kernel types (compute, memory, I/O) are executed sequentially.

Critique:

Approximation in Interference Modeling: While necessary, the approximation in kernel interference modeling (pairwise profiling, $R$ to $P$ mapping) is a potential source of suboptimality. Real-world interference patterns are often more complex and context-dependent. Future work could explore more dynamic or machine-learning-based interference prediction models, perhaps with hardware performance counters, to refine this.
Search Time vs. Optimality Trade-off: The "10 minutes" search time for a "practical" pipeline is reasonable for deployment but implies that it's not exhaustively exploring the entire space. For extremely critical or static deployments, a longer, more exhaustive search might yield marginal but still valuable improvements.
Real-time Adaptability: While auto-search runs when architectures change, dynamic changes in workload characteristics (e.g., sudden shifts in input/output length distributions) might require rapid pipeline re-optimization. The current auto-search, taking minutes, might not be suitable for real-time adaptation. A lightweight, online adaptation mechanism could be a valuable future direction.

Overall, NanoFlow is a robust and impactful piece of research. It provides a deeper understanding of LLM serving bottlenecks and delivers a practical, high-performance solution that significantly advances the state of the art in efficient LLM inference.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Dataset	Avg. Input (Std)	Avg. Output (Std)
Splitwise [32]	1155 (1109)	211 (163)
LMSYS-Chat [56]	102 (169)	222 (210)
ShareGPT [1]	246 (547)	322 (244)

NanoFlow: Towards Optimal Large Language Model Serving Throughput

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~39 min read · 52,058 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Large Language Models (LLMs) and Transformers

LLM Inference Workflow

Self-Attention and KV-Cache

GPU Operations

Parallelism Strategies

Performance Metrics

Data Types and Units

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Analysis of LLM Serving Workloads and Bottlenecks

Key Factors of Serving Throughput (Section 3.1)

Cost Model of LLM Serving (Section 3.2)

Classification of LLM Serving Workloads (Section 3.3)

Validation of the Cost Model (Section 3.4)

Optimal Serving Throughput (Section 3.5)

4.2.2. NanoFlow Design: Intra-device Parallelism (Section 3.7 & 4)

Intra-device Parallelism Concept

Automated Pipeline Search (Section 4.1)

4.2.2.1. Kernel Profiling and Interference Modeling (Section 4.1.1)

4.2.2.2. Auto-search Stage I: Pipeline Structure Search (Section 4.1.2)

4.2.2.3. Auto-search Stage II: Refining the Pipeline (Section 4.1.3)

Example Pipelines (Section 4.1.4)

4.2.3. NanoFlow Runtime (Section 4.2)

4.2.3.1. Request Scheduling (Section 4.2.1)

4.2.3.2. KV-cache Management (Section 4.2.2)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Hardware

5.5. Models

6. Results & Analysis

6.1. Core Results Analysis

Throughput Comparison

Latency Comparison

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

Resource Usage

6.4. Performance on Other LLMs

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers