NanoFlow: Towards Optimal Large Language Model Serving Throughput
TL;DR Summary
The paper introduces NanoFlow, a novel framework for optimizing Large Language Model (LLM) serving throughput by leveraging intra-device parallelism. It significantly improves throughput, achieving a 1.91x increase over existing systems by utilizing smaller nano-batches and optim
Abstract
Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput has emerged as a key metric that determines serving systems' performance. Due to large model sizes and memory-intensive self-attention, LLM serving has been commonly assumed to be memory-bound. Through a detailed analysis, we show that despite having memory-intensive components, end-to-end LLM serving is compute bound for most common workloads and LLMs. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving--compute, memory, networking--are executed sequentially within a device. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of heterogeneous resources within a single device. NanoFlow splits inputs into smaller nano-batches and duplicates operations to operate on each portion independently, enabling overlapping. NanoFlow automatically identifies the number, size, ordering, and GPU resource allocation of nano-batches to minimize the execution time, while considering the interference of concurrent operations. We evaluate NanoFlow's end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 50% to 72% of optimal throughput across popular models.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
NanoFlow: Towards Optimal Large Language Model Serving Throughput
1.2. Authors
The paper lists a comprehensive author team primarily from the University of Washington, with contributions from Tsinghua University, U Berkeley, and the University of Michigan. The lead author appears to be Kan Zhu from the University of Washington. Their research backgrounds generally align with systems, machine learning, and high-performance computing, focusing on optimizing large-scale AI infrastructure.
1.3. Journal/Conference
This paper was published on arXiv, a preprint server, on 2024-08-22T23:00:40.000Z. arXiv is a well-regarded repository for preprints of scientific papers in fields like physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While not a peer-reviewed journal or conference itself, papers published on arXiv are often submitted to and eventually published in reputable venues. Its influence lies in rapid dissemination of research findings.
1.4. Publication Year
2024
1.5. Abstract
Large Language Models (LLMs) have created immense demand for planet-scale serving systems, where throughput is a critical performance metric. Contrary to the common assumption that LLM serving is memory-bound due to large model sizes and memory-intensive self-attention, this paper demonstrates through detailed analysis that end-to-end LLM serving is primarily compute-bound for most common workloads and LLMs. However, existing serving engines fail to achieve optimal compute utilization because heterogeneous operations (compute, memory, networking) are executed sequentially within a single device.
To address this, the paper proposes NanoFlow, a novel serving framework that leverages intra-device parallelism by overlapping the usage of these heterogeneous resources. NanoFlow splits input batches into smaller nano-batches and duplicates operations to process each portion independently, enabling concurrent execution. It automatically determines the optimal number, size, ordering, and GPU resource allocation of nano-batches to minimize execution time, while accounting for interference between concurrent operations.
Evaluations using popular models like LLaMA-2-70B, Mixtral 8x7B, and LLaMA-3-8B with practical workloads show that NanoFlow achieves a 1.91x throughput boost compared to state-of-the-art serving systems, reaching 50% to 72% of the theoretically optimal throughput across various models.
1.6. Original Source Link
https://arxiv.org/abs/2408.12757 (Preprint) PDF Link: https://arxiv.org/pdf/2408.12757v2.pdf
2. Executive Summary
2.1. Background & Motivation
The proliferation of Large Language Models (LLMs) like GPT-4 and LLaMA has led to an explosion in demand for inference serving systems. These systems need to support hundreds of millions of users on tens of thousands of GPUs globally. In this context, throughput (tokens per device per second) has become the paramount metric, directly impacting the operational cost and scalability of LLM services, especially given the scarcity of high-end GPUs.
A common assumption in the LLM serving community has been that LLM inference is fundamentally memory-bound. This assumption stems from several key characteristics:
-
Large Model Sizes: LLMs often have billions of parameters, requiring significant memory to store model weights (e.g., GPT-3 175B needs multiple A100 80GB GPUs).
-
Memory-Intensive Self-Attention: The
self-attentionmechanism, particularly during thedecodephase, scales quadratically withcontext length(input + output sequence length). -
KV-cache: Per-request state (
KV-cache) can grow very large, sometimes exceeding model weight size, and needs to be frequently accessed. -
Single Token Output: Each LLM iteration typically produces only one output token per sequence while loading entire model weights and a unique
KV-cache, further exacerbating memory pressure.However, the authors identify a critical gap: despite these memory-intensive components, a detailed analysis reveals that for most common workloads and LLMs, the end-to-end serving process is actually
compute-bound. The challenge then becomes that existing serving engines, while potentially optimizing individual operations, execute the heterogeneous operations (compute, memory, networking) sequentially within a device. This sequential execution leads to significant underutilization of the most constrained resource—compute—resulting in suboptimal overall throughput. The problem is thus to bridge this gap and maximize compute utilization by efficiently managing heterogeneous resource usage.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of LLM serving:
- Reclassification of LLM Serving Workloads: Through a detailed analytical cost model and empirical validation, the authors demonstrate that contrary to common belief, modern LLM serving, especially with optimizations like
Grouped Query Attention (GQA)and large batch sizes, is predominantlycompute-boundrather thanmemory-boundornetwork-boundfor typical workloads and LLMs. This re-evaluation provides a new perspective on where optimization efforts should be focused. - Introduction of NanoFlow Framework: The paper proposes
NanoFlow, a novel end-to-end serving framework designed to maximize compute utilization by leveragingintra-device parallelism.- Intra-device Parallelism via Nano-batching:
NanoFlowsplits input batches into smallernano-batchesand duplicates operations. Thesenano-operationscan then execute concurrently on differentnano-batcheswithout data dependencies, allowing heterogeneous resources (compute, memory, network) within a single GPU to be utilized simultaneously, overcoming the limitations of sequential execution. - Automated Pipeline Search Engine:
NanoFlowincludes anauto-searchengine that automatically constructs an optimized pipeline fornano-batches. This engine identifies the optimal number, size, ordering, and GPU resource allocation fornano-operations. It employs a two-stage approach: first, it determines an initial pipeline assuming no interference, and then refines it by profiling and modeling actualkernel interferencebetween concurrent operations. - Efficient Runtime System:
NanoFlowprovides a runtime system for executing these optimized pipelines. This includes mechanisms for efficientbatch formation(prioritizing decode, chunking prefill, memory prediction),asynchronous scheduling(hiding CPU-side overhead), and advancedKV-cache management(simultaneous host/SSD offloading, LRU policy, optimized loading/scattering).
- Intra-device Parallelism via Nano-batching:
- Significant Throughput Improvements:
-
Evaluations on popular LLMs (LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc.) with practical workloads show that
NanoFlowachieves an average throughput boost of 1.91x compared to state-of-the-art serving systems likevLLM,DeepSpeed-FastGen, andTensorRT-LLM. -
NanoFlowreaches 50% to 72% of the theoretically optimal throughput across various models, significantly closing the gap to hardware capabilities. -
It also demonstrates the ability to sustain higher request rates while meeting
Service Level Objective (SLO)constraints for latency.These findings fundamentally shift the understanding of LLM serving bottlenecks and provide a practical, high-performance solution for maximizing GPU utilization and reducing serving costs.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the mechanics and contributions of NanoFlow, understanding several core concepts related to Large Language Models (LLMs), GPU computing, and serving systems is essential.
Large Language Models (LLMs) and Transformers
LLMs are powerful artificial intelligence models, typically based on the transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language.
- Transformer Architecture: Introduced by Vaswani et al. (2017), the transformer is a neural network architecture that relies heavily on the
self-attentionmechanism to process sequential data. Unlike recurrent neural networks, transformers process input sequences in parallel, making them highly efficient for modern accelerators. - Decoder-Only Transformers: Many modern LLMs (e.g., GPT-4, LLaMA, Mistral) use a
decoder-onlytransformer architecture. This means they are primarily designed for generative tasks, taking an input sequence and auto-regressively generating an output sequence token by token.- Token: A
tokenis the fundamental unit of text processed by an LLM. It can be a word, part of a word, a punctuation mark, or even a single character.
- Token: A
LLM Inference Workflow
The process of generating output from an LLM given an input is called inference. It typically involves two phases:
- Prefill Phase: This is the initial phase where the entire input prompt (e.g., "Write a poem about a cat") is processed by the model all at once. The main goal here is to establish the initial context and populate the
KV-cache. - Decode Phase: After the
prefillphase, the model generates outputtokensone at a time, auto-regressively. For each newtokengenerated, the model processes the previously generatedtokensalong with the initial prompt, extending the sequence.
Self-Attention and KV-Cache
The self-attention mechanism is central to transformers, allowing the model to weigh the importance of different tokens in the input sequence when processing each token.
- Self-Attention Mechanism: For each
tokenin a sequence,self-attentioncomputes three vectors:Query (Q),Key (K), andValue (V). These are derived by multiplying thetoken's embedding with learned weight matrices (). The attention score for aQuerytokenagainst allKeytokensis calculated (e.g., dot product), normalized, and then used to create a weighted sum ofValuetokens, providing contextual information. The mathematical formula forAttention(specifically,Scaled Dot-Product Attention) is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- is the matrix of
Queryvectors. - is the matrix of
Keyvectors. - is the matrix of
Valuevectors. - is the dot product of
QueryandKeymatrices, which measures the similarity between eachQueryandKey. - is a scaling factor, where is the dimension of the
Keyvectors. This scaling prevents the dot products from growing too large, which could push thesoftmaxfunction into regions with very small gradients. - is an activation function that normalizes the scores, turning them into probabilities.
- The result is a weighted sum of the
Valuevectors, where the weights are determined by theattentionscores.
- is the matrix of
- KV-cache (Key-Value Cache): During the
decodephase,tokensare generated one by one. Recomputing theKeyandValuevectors for all previously generatedtokensat each step would be highly inefficient. TheKV-cachestores theseKeyandValuevectors from previoustokensfor each request. This allows the model to only compute , , and for the newtokenand append them to the cache, significantly speeding up thedecodephase. The size of theKV-cachecan grow substantially with longer sequences and larger batch sizes. - Grouped Query Attention (GQA): A memory optimization where multiple
attentionheads share the sameKeyandValueprojections. This reduces the memory footprint of theKV-cache, allowing for larger batch sizes.
GPU Operations
GPUs (Graphics Processing Units) are specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images. In the context of AI, they are highly efficient for parallel processing of mathematical operations.
- General Matrix Multiplication (GEMM): A fundamental operation in deep learning, representing the multiplication of two matrices. Many
dense operations(e.g., linear layers, feed-forward networks) in LLMs are dominated byGEMMs. These are typicallycompute-bound, meaning their performance is limited by the raw computational power (FLOPs) of the GPU. - General Matrix-Vector Multiplication (GEMV): A special case of
GEMMwhere one of the matrices is a vector.GEMVs are often found inmemory-boundoperations. - Compute-Bound: An operation is
compute-boundif its execution time is limited by the rate at which the processor can perform arithmetic operations (e.g., floating-point operations per second, FLOPs). - Memory-Bound: An operation is
memory-boundif its execution time is limited by the rate at which data can be transferred to and from memory (i.e., memory bandwidth). - Network-Bound: An operation is
network-boundif its execution time is limited by the speed of data transfer over network interconnects between different devices (e.g.,NVLink,Infinity Fabric).
Parallelism Strategies
For very large models that don't fit into a single GPU's memory or require distributed computation, various parallelism strategies are used:
- Tensor Parallelism: This technique splits the weight matrices of a model across multiple GPUs (within or across nodes). During computation, each GPU performs a part of the matrix multiplication, and then
collective communicationoperations are used to aggregate results. This avoids duplicating model weights on each GPU. - Pipeline Parallelism: This involves splitting the model into stages (layers) and assigning different stages to different GPUs. Each GPU processes a different
micro-batchof data, creating a pipeline where data flows through the stages. - Collective Communication Operations: These are specialized operations for exchanging data between multiple GPUs in a distributed system.
AllGather (AG): Each GPU has a piece of data, andAllGathercollects all pieces from all GPUs into each GPU, so every GPU ends up with the complete concatenated data.AllReduce (AR): Each GPU has a piece of data,AllReduceperforms an operation (e.g., sum, average) on all pieces, and the final result is available on all GPUs. These operations arenetwork-bound.
Performance Metrics
- Throughput: A measure of how many units of work a system can process per unit of time. In LLM serving, it's often measured in
tokens per second (tokens/s)ortokens per second per GPU (tokens/s/GPU). - Latency: The time taken for a single request to be processed from start to finish. In LLM serving, it can be measured as the time from receiving a prompt to generating the complete response, or
normalized latency(latency divided by output length). - Service Level Objective (SLO): A target value or range for a performance metric (e.g., latency) that a service aims to meet. For LLMs, an
SLOmight define the maximum acceptablelatencyfor generating responses.
Data Types and Units
- FP16 (Floating Point 16-bit): A reduced-precision floating-point format that uses 16 bits instead of the standard 32 bits (
FP32).FP16reduces memory footprint and can significantly speed up computation on modern GPUs that have specializedTensor CoresforFP16arithmetic, often with minimal loss in model accuracy. This is standard for data center-scale inference. - GFLOP/s (Giga Floating Point Operations per second): A measure of computational performance, indicating billions of floating-point operations per second.
- GB/s (Gigabytes per second): A measure of data transfer rate, indicating billions of bytes per second.
3.2. Previous Works
The paper contextualizes NanoFlow by contrasting it with existing LLM serving optimization techniques, broadly categorizing them by the granularity at which they operate:
- Request-Level Optimizations:
- Continuous Batching (e.g., Orca [53]): This technique dynamically refills the batch of requests being processed by the GPU. Instead of waiting for all requests in a batch to complete before starting a new one,
continuous batchingallows new requests to be added to the batch as soon as GPU resources become available (e.g., when a request finishes). This maximizes GPU utilization by keeping the batch full. - PagedAttention (e.g., vLLM [17]):
PagedAttentionis a memory management technique that addresses the fragmentation and waste of GPU memory caused by variable-lengthKV-cachesizes. Inspired by virtual memory and paging in operating systems, it storesKV-cacheentries in fixed-size "pages" and manages them using a page table. This allows for more efficient memory utilization and enables non-contiguous memory allocation forKV-cache, similar to howOSmanages memory.
- Continuous Batching (e.g., Orca [53]): This technique dynamically refills the batch of requests being processed by the GPU. Instead of waiting for all requests in a batch to complete before starting a new one,
- Phase-Level Scheduling:
- Disaggregating Prefill and Decode (e.g., DistServe [59], Splitwise [32]): These approaches separate the
prefillanddecodephases of LLM inference, potentially assigning them to different clusters or specialized hardware. This can improve efficiency by allowing each phase to be optimized independently and handled by resources best suited for its characteristics.
- Disaggregating Prefill and Decode (e.g., DistServe [59], Splitwise [32]): These approaches separate the
- Batch-Level Optimizations:
-
Chunked Prefill (e.g., DeepSpeed-FastGen [13], Sarathi-Serve [2]): Instead of processing an entire long
prefillprompt at once,chunked prefillsplits it into smallerchunks. Thesechunkscan then be batched together withdecoderequests, allowing for a more consistent and higher utilization of the GPU by amortizingprefillcosts and keeping the batch full of active operations. -
Dynamic Batching: Automatically adjusts the batch size of incoming requests to maximize throughput or minimize latency, often by ensuring the GPU is always busy with a sufficiently large batch.
The paper highlights that while these prior works have significantly improved throughput by optimizing at the request, phase, or batch level, they generally do not perform scheduling or resource management at the granularity of individual operations within a device.
-
3.3. Technological Evolution
The evolution of LLM serving systems has been driven by the increasing scale and complexity of LLMs, coupled with the rising demand for efficient inference:
-
Early Stages (Basic Inference): Initially, LLM inference involved sequential processing on a single GPU or distributed inference using basic model parallelism, often inefficiently.
-
Memory Optimization: The advent of
KV-cachewas a crucial step, addressing the quadratic memory scaling ofself-attentionduringdecode. Techniques likePagedAttention(vLLM) further optimizedKV-cachememory usage, enabling larger batch sizes. -
Batching and Scheduling:
Continuous batching(Orca) anddynamic batchingbecame standard to maximize GPU utilization by keeping the batch full and varying its size based on available resources.Chunked prefilland phase-level scheduling emerged to better manage the distinct characteristics ofprefillanddecodeworkloads. -
Distributed Inference: As models grew beyond single-GPU capacity,
tensor parallelism,pipeline parallelism, and hybrid approaches became essential for distributing model weights and computation across multiple GPUs and nodes. -
Focus on Throughput vs. Latency: Initial efforts often balanced throughput and latency. However, with "planet-scale" demand, throughput (and thus cost-efficiency) has emerged as a primary concern for cloud providers.
NanoFlowfits into this timeline by pushing the boundaries of optimization to an even finer granularity: the intra-device operation level. While previous works focused on what to batch or how to schedule requests,NanoFlowlooks at how individual operations within a single GPU can be overlapped to maximize hardware utilization, especially for the actual bottleneck (compute), which it re-identifies.
3.4. Differentiation Analysis
Compared to the main methods in related work, NanoFlow introduces a fundamental shift in its approach to LLM serving optimization:
- Granularity of Optimization:
- Prior Works: Primarily focus on request-level, phase-level, or batch-level scheduling. For example,
vLLMandDeepSpeed-FastGenoptimize how requests are batched and processed at an iteration level,DistServeandSplitwisedisaggregate phases, andSarathi-Serveuseschunked prefill. These methods manage the flow of data into or between GPUs. - NanoFlow: Operates at a much finer granularity:
intra-device operation-level parallelism. It focuses on how different operations (e.g.,GEMM,KV-cacheaccess,network communication) within a single GPU can be executed concurrently. This is a novel approach that complements existing batching and scheduling techniques.
- Prior Works: Primarily focus on request-level, phase-level, or batch-level scheduling. For example,
- Bottleneck Identification:
- Prior Works (Implicit Assumption): Often operated under the common assumption that LLM serving is
memory-bound, leading to optimizations targeting memory efficiency (e.g.,PagedAttention,GQA). - NanoFlow: Challenges this assumption with a rigorous analysis, demonstrating that for many modern LLMs and workloads, the end-to-end serving process is actually
compute-bound. This re-identifies the true bottleneck, allowingNanoFlowto targetcompute utilizationdirectly.
- Prior Works (Implicit Assumption): Often operated under the common assumption that LLM serving is
- Mechanism for Parallelism:
- Prior Works: Achieve parallelism through macro-level batching, pipeline stages, or distributing tasks across devices.
- NanoFlow: Achieves parallelism by
nano-batchinginputs and duplicating operations intonano-operations. Thesenano-operationscan then execute concurrently within the same device, effectively overlapping the execution of operations with heterogeneous resource demands (compute-bound, memory-bound, network-bound). This is a fine-grained pipelining within a single GPU's resources.
- Adaptive Optimization:
- Prior Works: While some have dynamic scheduling or batching, the internal execution logic for operations is typically static.
- NanoFlow: Incorporates an
auto-searchengine that automatically constructs and refines optimized pipelines. This adaptive approach accounts for specific model architectures, hardware characteristics, and evenkernel interference, which is a significant advancement in handling the complexity of heterogeneous GPU workloads.
- Memory Overhead vs. Compute Utilization:
-
Prior Works: Aim to minimize memory movement and overhead.
-
NanoFlow: Acknowledges that
nano-batchingmight increase memory I/O due to repeated weight loading. However, it strategically embraces this trade-off, arguing that forcompute-boundworkloads, this increased memory I/O can be hidden throughpipelining, leading to overall highercompute utilizationand throughput.In essence,
NanoFlowinnovates by shifting the focus from simply managing requests and data flow to actively orchestrating the concurrent execution of diverse operations within the GPU itself, driven by a re-evaluated understanding of the actual bottleneck.
-
4. Methodology
4.1. Principles
The core idea behind NanoFlow stems from a critical re-evaluation of the bottlenecks in Large Language Model (LLM) serving. The prevailing assumption has been that LLM serving is memory-bound due to massive model sizes, KV-cache growth, and the memory-intensive nature of self-attention. However, through detailed analysis and empirical validation, NanoFlow demonstrates that for most common workloads and modern LLMs (especially those using optimizations like Grouped Query Attention), the overall serving process is predominantly compute-bound.
Despite this compute-bound nature, existing LLM serving engines suffer from suboptimal compute utilization. This is because the diverse operations that constitute LLM serving—ranging from compute-bound General Matrix Multiplications (GEMMs) to memory-bound KV-cache accesses and network-bound collective communications in distributed settings—are executed sequentially within a single device. This sequential execution leads to pipeline bubbles, periods where the GPU's compute units are idle while waiting for memory or network operations to complete, or vice-versa.
NanoFlow's principle is to address this by leveraging intra-device parallelism. Instead of processing a single large batch sequentially, NanoFlow breaks down the input into smaller, independent units called nano-batches. Each operation is then duplicated to form nano-operations that process these nano-batches. Since nano-operations on different nano-batches are independent, NanoFlow can overlap their execution within the same GPU. This fine-grained pipelining allows for concurrent utilization of heterogeneous resources (compute, memory, network) on the device, maximizing the utilization of the most constrained resource (compute) and thereby improving overall throughput. While this approach might increase memory I/O for weight loading, NanoFlow posits that for compute-bound workloads, this additional I/O can be hidden through effective pipelining.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Analysis of LLM Serving Workloads and Bottlenecks
NanoFlow begins with a fundamental analysis to accurately identify the bottlenecks in LLM serving.
Key Factors of Serving Throughput (Section 3.1)
The paper identifies key factors that determine LLM serving throughput, defining throughput as the total number of tokens processed per second (including both prefill and decode phases).
- Hardware Specification:
- : Number of GPUs.
MemBW(GB/s): Aggregate GPU memory bandwidth.MemSize(GB): Aggregate GPU memory capacity.Compute(GFLOP/s): Aggregate GPU compute capacity.NetBW(GB/s): Aggregate GPU interconnect bandwidth.
- Model Configuration:
- : Hidden dimension size (e.g., the size of the embedding vectors).
- : Number of layers in the transformer model.
- : Total number of parameters in the model.
- : Group size of
Grouped Query Attention(how many query heads share aKV-cache). - (Bytes): Size in bytes of the data type for model parameters (e.g., 2 bytes for
FP16).
- User Query Statistics:
- : Average number of tokens in prompts to be prefilled.
- : Average number of tokens in output to be decoded.
Thus, a request involves
prefilltokens anddecodetokens.
- Batch Size: For optimal
throughput, the system should operate at the largest possible batch size that fits the model weights and allKV-cacheswithin available memory. Thismaximum batch sizeamortizes overheads and increases utilization forcompute-bound,memory-bound, andnetwork-boundoperations.
Cost Model of LLM Serving (Section 3.2)
Under the assumption of using the maximum possible batch size, NanoFlow models the latency of a single LLM serving iteration from the perspectives of memory, compute, and network resources. An "iteration" refers to processing a batch of user requests through all transformer layers.
-
Memory Latency (): This represents the time required to load all necessary data from memory, assuming the entire device memory content needs to be loaded into GPU caches and registers once per iteration (e.g., for model weights and
KV-cache). $ T_{mem} = \frac{MemSize}{MemBW} $ Where:MemSize: The total memory content that needs to be loaded in bytes (e.g., model weights + KV-cache).MemBW: The available memory bandwidth of the GPU in bytes/second.
-
Compute Latency (): This primarily accounts for the time spent on
dense operations(mainlyGEMMs), which constitute the vast majority of computations in LLMs. For eachGEMMindense operations, computations are needed, where and are dimensions of the weight matrices. Summing over all layers, the total compute is . The term can be approximated by the total number of model parameters, . $ T_{Compute} \approx \frac{2 \mathcal{B}{Dense} \cdot P{Model}}{Compute} $ Where:- : The batch size for
dense operations(includingdecode tokensfrom many requests andprefill tokens). - : Total number of parameters in the model (e.g., 70 billion for LLaMA-2 70B).
Compute: The aggregate GPU compute capacity in FLOPs/second.
- : The batch size for
-
Network Latency (): This applies specifically to
tensor parallelismsetups wherecollective communication(e.g.,AllGather,AllReduce) is needed to synchronize results across multiple GPUs after operations.Tensor parallelismtypically requires twoAllGathersand oneAllReduce(or twoAllReduces) per layer. AnAllReducetransfers activations twice, while anAllGathertransfers once. Thus, the total data movement for one GPU in bytes is estimated as . $ T_{net} \approx 4 \cdot \frac{N_{GPU} B_{Dense} D_{model} S_{type} L}{NetBW} $ Where:- : Number of GPUs involved in
tensor parallelism. - : Batch size for
dense operations. - : Hidden dimension size.
- : Size of the data type (e.g., 2 bytes for
FP16). - : Number of layers.
NetBW: Aggregate GPU interconnect bandwidth.
- : Number of GPUs involved in
Classification of LLM Serving Workloads (Section 3.3)
By comparing these latency components, NanoFlow classifies the workload characteristics.
-
Network vs. Compute: The ratio is analyzed. For large models () and modern data center GPUs with high-bandwidth interconnects (like
NVLink), this ratio is typically less than 1 (Figure 2, provided as Image 4 in this analysis), indicating that the network is generally not the bottleneck compared to compute.The following are the results from [Figure 2] of the original paper:
该图像是一个热力图,展示了不同模型(如LLaMA-3 8B、Mistral 8x7B等)在不同GPU配置下的计算和网络资源利用率。颜色深浅代表了计算限制和网络限制的程度,数值展示了系统的性能表现。 -
Memory vs. Compute: The ratio is used to determine if the workload is
memory-boundorcompute-bound: $ T_R = \frac{T_{Mem}}{T_{Compute}} \approx \frac{Compute}{MemBW} \frac{MemSize}{P_{Model}} \frac{1}{2 B_{Dense}} $ Where:-
Compute: Aggregate GPU compute capacity. -
MemBW: Aggregate GPU memory bandwidth. -
MemSize: Aggregate GPU memory capacity. -
: Total number of model parameters.
-
: Batch size for dense operations.
The paper observes that modern models widely adopt
Grouped Query Attention (GQA), which allows for significantly larger values (e.g., 1024-2048 for LLaMA-2 70B). Combined with the increasing size of , these factors cause to drop below 1, implying that the workload becomescompute-bound. Figure 3 (provided as Image 5 in this analysis) empirically validates this, showing that many common workloads arecompute-bound(closer to yellow) rather thanmemory-bound(closer to green).
-
The following are the results from [Figure 3] of the original paper:
该图像是一个性能对比图,展示了不同矩阵乘法方法的归一化性能。图中蓝色线表示 GEMM 性能,橙色线表示 GEMV 性能,灰色线表示非最优 GEMV 性能。优先处理 GEMM 和 GEMV 的效果也被标出,数据显示在某些情况下,优先处理 GEMM 性能优于非最优 GEMV。
The following are the results from [Table 1] of the original paper:
| Vendor | Model | Release Year | MemSize (GB) | MemBW (GB/s) | NetBW (GB/s) | Compute (FP16 GFLOP/s) | MemSize MemBW | Compute MemBW | NetBW MemBW |
| NVIDIA | V100 | 2017 | 16 | 900 | 300 | 125,000 | 0.018 | 139 | 0.33 |
| NVIDIA | A100 | 2020 | 40 | 1,555 | 600 | 312,000 | 0.026 | 200 | 0.39 |
| NVIDIA | A100 | 2021 | 80 | 2,000 | 600 | 312,000 | 0.040 | 156 | 0.30 |
| NVIDIA | H100 | 2023 | 80 | 3,352 | 900 | 989,000 | 0.024 | 295 | 0.268 |
| NVIDIA | H200 | 2024 | 96 | 4,800 | 900 | 989,000 | 0.020 | 206 | 0.19 |
| NVIDIA | B100 | 2024 | 120 | 8,000 | 1,800 | 1,800,000 | 0.015 | 225 | 0.23 |
| NVIDIA | B200 | 2024 | 120 | 8,000 | 1,800 | 2,250,000 | 0.015 | 281 | 0.23 |
| AMD | MI250 | 2021 | 128 | 3,352 | 800 | 362,000 | 0.038 | 107 | 0.24 |
| AMD | MI300 | 2023 | 192 | 5,300 | 1,024 | 1,307,000 | 0.036 | 246 | 0.19 |
| AMD | MI325X | 2024 | 256 | 6,000 | 1,024 | 1,307,000 | 0.043 | 218 | 0.17 |
| Intel | Gaudi 2 | 2022 | 96 | 2,400 | 600 | 1,000,000 | 0.040 | 417 | 0.25 |
| Intel | Gaudi 3 | 2024 | 128 | 3,700 | 1,200 | 1,800,000 | 0.035 | 486 | 0.32 |
| NVIDIA | Ada 6000 | 2022 | 48 | 960 | 64 | 182,000 | 0.050 | 190 | 0.067 |
Table 1 (above) presents accelerator model characteristics, showing that the Compute/MemBW and Compute/NetBW ratios are stable across vendors and generations, reinforcing the compute-bound conclusion.
Validation of the Cost Model (Section 3.4)
The cost model is validated on LLaMA-2 70B with 8 NVIDIA A100 GPUs and a dense batch size of 2048. GFLOPs, memory movement, and network traffic are computed for each operation (Table 2), and then , , and are estimated. The operation with the maximum estimated time, , indicates the most constrained resource. The sums of these values over all operations confirm that compute is the most constrained resource, aligning with the model's finding.
The following are the results from [Table 2] of the original paper:
| Operation | Compute (GFLOP) | Mem Load (GB) | Net Usage (GB) | Est. Tcomp (ms) | Est. Tmem (ms) | Est. Tnet (ms) | Real Time (ms) |
| KQV | 27487.8 | 19.5 | 0 | 11.01 | 1.22 | 0 | 16.08 |
| O | 21990.2 | 16.1 | 0 | 8.81 | 1.01 | 0 | 16.01 |
| UG | 153931.6 | 96.6 | 0 | 61.67 | 6.04 | 0 | 69.92 |
| D | 76965.8 | 49.7 | 0 | 30.84 | 3.11 | 0 | 34.96 |
| DecAttn | 3665.9 | 462.2 | 0 | 1.47 | 28.89 | 0 | 35.60 |
| PfAttn | 916.3 | 2.1 | 0 | 0.37 | 0.13 | 0 | 4.56 |
| Net | 18.8 | 75.2 | 75.2 | 0.01 | 4.70 | 31.33 | 47.92 |
| Total | 114.17 | 45.09 | 31.33 |
Table 2 (above) compares cost model estimations with real-world measurements for different LLM operations. The Est. Tcomp column, particularly for UG (Up Gate) and (Down projection), shows the highest values, confirming compute as the dominant factor for the entire process.
Optimal Serving Throughput (Section 3.5)
Given that LLM serving is compute-bound, optimal throughput is achieved when the compute resource is fully utilized.
$
\mathrm {Throughput_{optimal}} = \frac{B_{Dense}}{T_{Compute}} = \frac{Compute}{2 P_{Model}}
$
Where:
-
: The maximum theoretical throughput in tokens per second per GPU.
-
Compute: Aggregate GPU compute capacity. -
: Total number of model parameters.
This equation indicates that optimal throughput is solely dependent on the GPU's computational capacity and the model's parameter count, largely independent of memory or network characteristics. For example, for LLaMA-2 70B on GPUs, with 280 TFLOPS peak
FP16 Computeand of 70B, the optimal throughput is calculated as 1857 tokens/s/GPU.
4.2.2. NanoFlow Design: Intra-device Parallelism (Section 3.7 & 4)
Motivated by the compute-bound nature and the gap between current system throughput and optimal throughput (due to sequential execution), NanoFlow introduces intra-device parallelism through nano-batching.
Intra-device Parallelism Concept
Instead of processing a single large batch, NanoFlow splits each input batch into nano-batches. For example, an Up projection operation on a batch of 2048 could be split into two nano-operations, UP1 and UP2, each processing a nano-batch of 768 and 1280 tokens respectively. These nano-operations are independent and can be executed concurrently. This allows heterogeneous operations (e.g., a compute-bound GEMM and a memory-bound KV-cache access) to overlap their execution within the same GPU, maximizing resource utilization.
Automated Pipeline Search (Section 4.1)
The complexity of determining the optimal number, size, ordering, and resource allocation of nano-batches across diverse models and hardware is immense. NanoFlow addresses this with an auto-search engine using mixed integer linear programming (MILP) and a two-stage approximation.
4.2.2.1. Kernel Profiling and Interference Modeling (Section 4.1.1)
Understanding individual kernel performance and their interactions is crucial.
- Maximum Dense Batch Size: For a given model and hardware,
NanoFlowfirst determines the largestdense batch sizethat can fit into GPU memory. This sets the upper bound for profiling. - Profiling Interference-Free Kernels:
NanoFlowprofiles kernels (GEMM,GEMV,network kernels) individually, running them exclusively on the GPU. It explores various input batch sizes (e.g., multiples of 128) and kernel implementations (varyingthread blocks,warps,tile sizes) to find the fastest configuration for each (kernel, batch size) pair. - Profiling Kernels with Interference:
-
Kernel Interference: When multiple kernels run in parallel on a GPU, they compete for shared resources (execution units, caches, memory bandwidth), leading to slowdown. This is
kernel interference. -
Resource Allocation Proxy (): NVIDIA GPUs don't offer explicit control over resource allocation.
NanoFlowusesGEMMperformance as a proxy for resource allocation . If acompute kernel Ais overlapped withkernel B, andkernel Aachieves of its individual peak performance, then . The remaining resources are assumed forkernel B, so . -
Performance (): For the non-compute kernel , its
normalized performancecaptures its memory or network utilization when resources are assigned tokernel A. This establishes an "exchange rate" betweencompute utilization() andmemory/network utilization(). -
Reduced Profiling Space: To make profiling feasible given the vast number of kernel combinations,
NanoFlow:- Limits
thread blocknumbers forGEMVandnetwork kernels. - Excludes inefficient
GEMMkernels. - Focuses on
pairwise interference(compute-memory, compute-network), assuming these to mappings hold for three-kernel overlaps.
- Limits
-
Example (Figure 5, provided as Image 7 in this analysis): The paper shows profiling results for
GEMM-GEMVpairs, identifying trade-offs. The goal is to find the best-performing kernel combinations at various trade-off points.The following are the results from [Figure 5] of the original paper:
该图像是一个图表,展示了不同模型在固定输入和输出长度下的每 GPU 代币吞吐量。包含了 vLLM、DeepSpeed-FastGen、TensorRT-LLM 和 NanoFlow 的性能对比,最高吞吐量达 1286,而最优值为 1857。图表显示了不同的输入和输出长度对性能的影响,并提供了两个实验场景的对比。
-
The following are the results from [Table 3] of the original paper:
| Operations | Resource Utilization (R) | |||||
|---|---|---|---|---|---|---|
| 0 | 0.1 | 0.2 | .. | 0.8 0.9 | 1 | |
| GEMM (by definition) | 0 | 0.1 | 0.2 | 0.8 0.9 | 1 | |
| GEMV | 0 | 0.2 | 0.3 | 0.85 0.95 | 1 | |
| Network | 0 | 0.3 | 0.5 | 0.9 1 | 1 | |
Table 3 (above) is an example resource mapping table generated from interference profiles, quantifying GEMV and network kernel performance () as functions of resource utilization ().
4.2.2.2. Auto-search Stage I: Pipeline Structure Search (Section 4.1.2)
This stage uses MILP to determine the initial pipeline structure, assuming no kernel interference.
- Input:
Dense batch size, operation dependencies (from model architecture),interference-free kernelprofiles. - Output: Number, batch size, and order of each
nano-operation. - Optimization Objective: Minimize execution time by removing
pipeline bubbles(idle periods) forcompute operations. - Constraints:
- Number of Nano-operations: Start with splitting into two
nano-operations. Ifcompute bubblespersist, increase the number for operations near the bubble until no further improvement is made. - Batch Sizes and Execution Times: Batch sizes are chosen from discrete values (e.g., multiples of 128) up to the
dense batch size.Interference-free execution timesfrom profiling are used. - Dependencies:
Nano-operationsare dependent if their parent operations are dependent AND their inputnano-batchesintersect. - Overlapping: Only operations constrained by different resources (e.g.,
computeandmemory) are allowed to overlap, as overlapping same-resource operations is counterproductive. - Operation Transformations: Explores alternative implementations for
network nano-operations(e.g.,AllGathertoAllReduceconversions) with different performance characteristics.
- Number of Nano-operations: Start with splitting into two
- Search Time: While finding a globally optimal solution can be very time-consuming,
NanoFlowprioritizes finding a feasible and practical solution within a reasonable time (e.g., ~10 minutes).
4.2.2.3. Auto-search Stage II: Refining the Pipeline (Section 4.1.3)
This stage refines the pipeline from Stage I by incorporating kernel interference.
- Input: Pipeline structure (number, batch sizes, ordering) from Stage I, to mapping (Table 3).
- Optimization Objective: Minimize pipeline execution time, considering
kernel slowdowndue tointerference. - Constraints:
- GPU Resource Utilization: The sum of for concurrently executing kernels at any given time must be less than or equal to 1.0 (representing the total GPU resources).
- Execution Times: The execution time of a
nano-operationis calculated as , where is its bestinterference-freeexecution time and is itsnormalized performancederived from Table 3 based on its allocated .
- Trigger: This
auto-searchis performed only when the model architecture or workload characteristics change significantly.
Example Pipelines (Section 4.1.4)
-
70B Pipeline (e.g., LLaMA-2 70B): For models like LLaMA-2 70B, LLaMA-3 70B, Qwen2.5-72B, and Deepseek-67B,
auto-searchgenerates similar schedules because their performance characteristics are largely alike. For example, in LLaMA-2 70B,NanoFlowmay use 4nano-operationsforKQV generation(wherecompute,memory, andnetworkresources overlap), allowingdecode attentionto operate at 80% of its peak performance with a 0.4GEMMperformance sacrifice. For the rest of the pipeline,GEMMoperations are prioritized with twonano-operations. This is illustrated in Figure 6 (provided as Image 8 in this analysis).The following are the results from [Figure 6] of the original paper:
该图像是一个图表,展示了不同请求速率下,各个系统的归一化延迟。在图中,NanoFlow(红色)与其他方法(如vLLM、DeepSpeed-FastGen和TensorRT-LLM)相比,展现了更优的性能表现。各子图(a、b、c)分别对应不同的测试场景,清晰地反映了各系统的性能差异。 -
8B Pipeline (e.g., LLaMA-3 8B): These models fit on a single GPU, so
network operationsare not relevant.Auto-searchsplits operations into twonano-operations, typically overlappingdecode attentionwithUp Gate Down projection. -
MoE Pipeline (e.g., Mixtral 8x7B):
Mixture-of-Experts (MoE)models have different hidden dimensions and layers. Due toexpert imbalance,NanoFlowusestensor parallelismforFFN(Feed-Forward Network) layers, which are implemented usinggrouped-GEMMand include an additionalgate routing operation.Auto-searchadapts to these specific characteristics to generate an efficient pipeline.
4.2.3. NanoFlow Runtime (Section 4.2)
The NanoFlow runtime executes the auto-generated pipelines efficiently.
4.2.3.1. Request Scheduling (Section 4.2.1)
NanoFlow manages batching and scheduling to maintain high GPU utilization.
- Batch Formation:
NanoFlowassumes externalauto-scaling,workload balancing, andpriority-aware routing.- It operates with the assumption of abundant requests and equal priority. If requests are scarce, the control plane should reduce
NanoFlowinstances to ensure a sufficiently large per-instancebatch size. - It prioritizes unfinished
decode requestsandchunks prefill requests(following SarathiServe [3]) to precisely fill the remaining capacity of a pre-selectedbest-performing dense batch. This keepsdense operationsrunning with consistent batch sizes, reducingtail latency. - The
global batchinitially hasprefill requests. As they complete, they becomedecode requests, and newprefill requestsare introduced to maintain a fixedtoken batch size. The ratio ofprefilltodecodetokens stabilizes over time. - To prevent out-of-memory errors,
NanoFlowpredicts future memory usage based on request status (e.g., decoded tokens, estimated completion time) and onlyprefillsnew requests if memory limits are maintained. IfOOMoccurs, requests can be offloaded to the CPU and reloaded later.
- Asynchronous Scheduling:
- In traditional systems, CPU-side tasks like
batch formation,EOS tokendetection, andrequest managementhappen sequentially after each GPU iteration, leading to GPU idleness (pipeline bubbles). NanoFlowasynchronously schedulesbatch formationin parallel with GPU execution. For iteration , the batch for is formed before ends. After launching , the batch for is formed, andEOS tokensfrom iteration are detected, and finished requests removed.- This might mean a slight delay in detecting
EOS(one extra decode token), but for typical workloads with average decode lengths over 100, this overhead is negligible () compared to the benefit of hidingbatch formationoverhead.
- In traditional systems, CPU-side tasks like
4.2.3.2. KV-cache Management (Section 4.2.2)
To support multi-round conversations efficiently, NanoFlow implements advanced KV-cache offloading.
- Simultaneous Offloading:
- Instead of waiting for requests to complete,
NanoFlowoffloads theKV-cacheoftokensdirectly afterKQV generationin each transformer layer, before they are appended to the mainKV-cache. - This ensures
KV vectorsare contiguous and offloading data size is balanced across iterations. Device-host copies(GPU-initiated) are performed duringcompute-bound FFNoperations to minimize overhead.NUMA-aware thread-bindingfurther reduces offloading time.
- Instead of waiting for requests to complete,
- Host KV-cache Management:
NanoFlowuses anLRU (Least Recently Used)policy to manage a hierarchical cache across CPU memory and SSDs.KV-cacheentries are evicted to SSD when CPU memory limits are hit and retrieved from either CPU or SSD when a request's next round arrives.
- KV-cache Loading and Scattering:
-
Utilizing
PagedAttention,KV-cachepages can be fragmented in GPU memory. -
To avoid slow copies to fragmented destinations,
NanoFlowfirst copies theKV-cachedata to a contiguous space on the GPU. -
Then, it
scattersthese contiguous pages to their fragmented destinations in GPU memory, achieving 7-10x higher bandwidth for host-to-device copy.The implementation details involve approximately 10K lines of CUDA and 6K lines of Python code, launching
nano-operationsbased onauto-searchresults and managing dependencies with CUDA events.
-
5. Experimental Setup
5.1. Datasets
The evaluation uses three practical, real-world conversation datasets and also includes experiments with constant input/output lengths.
- Splitwise [32]:
- Source: A conversation trace collected from a real production environment at Microsoft.
- Scale: Approximately 20,000 requests.
- Characteristics: Represents typical interactions in a production setting.
- Average Input (Std): 1155 (1109) tokens
- Average Output (Std): 211 (163) tokens
- Why Chosen: Provides a realistic workload profile from a large-scale application.
- LMSYS-Chat-1M [56]:
- Source: A large-scale dataset featuring 1 million real-world conversations collected from 25 different LLMs.
- Scale: 50,000 requests randomly sampled for evaluation.
- Characteristics: Diverse conversations reflecting a broad range of user interactions with various LLMs.
- Average Input (Std): 102 (169) tokens
- Average Output (Std): 222 (210) tokens
- Why Chosen: Offers a comprehensive and varied representation of real-world LLM usage.
- ShareGPT [1]:
-
Source: A dataset of conversations collected from the ShareGPT API.
-
Scale: 50,000 requests randomly sampled for evaluation.
-
Characteristics: Contains interactions with various LLMs, often featuring longer and more complex prompts and responses.
-
Average Input (Std): 246 (547) tokens
-
Average Output (Std): 322 (244) tokens
-
Why Chosen: Provides insights into LLM performance under more extensive and user-generated dialogue.
The following are the results from [Table 4] of the original paper:
Dataset Avg. Input (Std) Avg. Output (Std) Splitwise [32] 1155 (1109) 211 (163) LMSYS-Chat [56] 102 (169) 222 (210) ShareGPT [1] 246 (547) 322 (244)
-
Table 4 (above) provides the average and standard deviation of input and output lengths for the sampled datasets. These statistics are crucial for understanding the workload profiles used in the evaluation. The datasets are effective for validating the method's performance as they represent diverse and realistic LLM workloads, covering varying input/output lengths and conversational patterns.
5.2. Evaluation Metrics
The paper evaluates NanoFlow using standard performance metrics for LLM serving:
- Throughput (tokens/s/GPU):
- Conceptual Definition: Measures the total number of tokens (including both input prompt tokens and generated output tokens) processed by the system per second, normalized per GPU. This metric is crucial for assessing the cost-efficiency and overall capacity of an LLM serving system, as higher throughput means more work done with the same hardware.
- Mathematical Formula: Not explicitly provided as a standard formula for this context, but typically calculated as: $ \text{Throughput} = \frac{\text{Total Tokens Processed}}{\text{Total Execution Time (seconds)}} $ When normalized per GPU: $ \text{Throughput/GPU} = \frac{\text{Total Tokens Processed}}{\text{Total Execution Time (seconds)} \times N_{GPU}} $
- Symbol Explanation:
- : The sum of all input tokens (from
prefillphases) and all output tokens (fromdecodephases) across all requests. - : The total duration for which the system was processing these requests.
- : The number of GPUs used in the serving system.
- : The sum of all input tokens (from
- Optimal Throughput (tokens/s/GPU):
- Conceptual Definition: The theoretical maximum throughput achievable by the system, assuming full utilization of the most constrained resource (which
NanoFlowidentifies ascompute). It serves as an upper bound for performance, against which actual system performance can be benchmarked. - Mathematical Formula: $ \mathrm {Throughput_{optimal}} = \frac{Compute}{2 P_{Model}} $
- Symbol Explanation:
Compute: The aggregate GPU compute capacity in GFLOP/s (Giga Floating Point Operations per second) for the specific data type (e4.g.,FP16).- : The total number of parameters in the LLM.
- Conceptual Definition: The theoretical maximum throughput achievable by the system, assuming full utilization of the most constrained resource (which
- Latency (Normalized Latency, 99th-percentile Latency):
- Conceptual Definition:
- Normalized Latency: The end-to-end time taken for a request to be processed, divided by the output length in tokens. This metric helps to compare
latencyacross requests with varying output lengths. A lower value indicates faster processing per output token. - 99th-percentile Latency: The
latencyvalue below which 99% of all requests fall. This is a critical metric forService Level Objectives (SLOs)as it captures the experience of most users, including those who encounter slower responses (the "tail" of thelatencydistribution).
- Normalized Latency: The end-to-end time taken for a request to be processed, divided by the output length in tokens. This metric helps to compare
- Mathematical Formula:
- The 99th-percentile is a statistical measure and does not have a simple algebraic formula; it's computed from the sorted list of all individual request latencies.
- Symbol Explanation:
- : The time from a request's arrival to its full completion.
- : The number of tokens generated as output for that specific request.
- Conceptual Definition:
- Resource Utilization (Compute, Memory, Network):
- Conceptual Definition: Measures the percentage of time or capacity that a specific hardware resource (e.g., GPU
compute units,memory bandwidth,network bandwidth) is actively being used. High utilization indicates efficient use of hardware.NanoFlowspecifically aims to maximizecompute utilization. - Mathematical Formula: No single standard formula, typically represented as a percentage or fraction: $ \text{Utilization} = \frac{\text{Actual Usage}}{\text{Maximum Capacity}} \times 100% $
- Symbol Explanation:
- : The amount of a resource actively being used (e.g., FLOPs performed, bytes transferred).
- : The theoretical maximum capability of that resource (e.g., peak FLOPs/s, peak GB/s).
- Conceptual Definition: Measures the percentage of time or capacity that a specific hardware resource (e.g., GPU
- Speedup (Throughput Boost):
- Conceptual Definition: The factor by which the performance (e.g.,
throughput) of a new method improves over a baseline method. It quantifies the effectiveness of an optimization. - Mathematical Formula: $ \text{Speedup} = \frac{\text{Throughput}{\text{NanoFlow}}}{\text{Throughput}{\text{Baseline}}} $
- Symbol Explanation:
- : The throughput achieved by
NanoFlow. - : The throughput achieved by a comparative baseline system.
- : The throughput achieved by
- Conceptual Definition: The factor by which the performance (e.g.,
5.3. Baselines
NanoFlow is compared against three state-of-the-art and widely-used LLM serving frameworks, representing the current best practices in the field:
- vLLM [17]:
- Description: A high-throughput LLM serving system known for its efficient memory management and scheduling.
- Key Features: Implements
PagedAttentionfor maximizingKV-cachememory utilization andcontinuous batching(referred to aschunked prefillin some contexts forvLLM's batching strategy) to maintain high GPU utilization. It focuses on effective GPU memory usage and dynamic batching. - Why Representative: A widely adopted and recognized leader in LLM serving throughput optimization, making it a strong benchmark for comparing memory-efficient approaches.
- DeepSpeed-FastGen [13]:
- Description: A serving framework developed by Microsoft, part of the broader DeepSpeed ecosystem for large-scale model training and inference.
- Key Features: Dynamically composes
prefillanddecoderequests to ensure the engine operates in a high-throughput regime. It also useschunked prefilland advanced scheduling techniques. - Why Representative: Represents industrial-grade optimization efforts from a major technology company, focusing on dynamic request management for high throughput.
- TensorRT-LLM [26]:
-
Description: A high-performance LLM inference engine built upon NVIDIA's
TensorRT SDK. -
Key Features: Leverages NVIDIA's highly optimized kernel libraries and compilation tools (
TensorRT) to achieve maximum performance on NVIDIA GPUs. It includes optimizations likepaged KV-cacheanddynamic batching. -
Why Representative: Represents the pinnacle of hardware-specific (NVIDIA) optimization, providing an excellent benchmark for raw performance potential on the target hardware.
These baselines were chosen because they are widely recognized, actively maintained, and implement state-of-the-art techniques for LLM serving, making them robust comparisons for
NanoFlow's performance. The experimental setup tunes parameters for each baseline (e.g.,max-ragged-batch-sizeforDeepSpeed-FastGen,max-num-tokensforTensorRT-LLM) to ensure they are operating at their best possible throughput.
-
5.4. Hardware
The experiments were conducted on the following hardware:
-
GPUs: NVIDIA A100 80GB SXM GPUs.
-
Interconnect: The GPUs are interconnected via
NVLink, providing high-bandwidth, low-latency communication crucial fortensor parallelism.This setup is representative of high-end data center inference nodes. For models that fit on a single GPU (e.g., LLaMA-3-8B), a single A100 80GB SXM GPU was used.
5.5. Models
The evaluation covers a range of popular and representative LLMs to demonstrate NanoFlow's general applicability:
- LLaMA-2-70B [49]:
- Role: The primary model used for detailed evaluation and ablation studies due to its widespread adoption as a large, open-source LLM.
- Characteristics: 70 billion parameters.
- LLaMA-3-70B [22]:
- Role: Used to show
NanoFlow's performance on newer LLaMA generations. - Characteristics: 70 billion parameters, features a larger vocabulary size (128K) compared to LLaMA-2.
- Role: Used to show
- LLaMA-3-8B [22]:
- Role: Evaluates
NanoFlow's performance on smaller models that fit within a single GPU. - Characteristics: 8 billion parameters.
- Role: Evaluates
- Qwen2-72B [9]:
- Role: Demonstrates performance on models with slightly different architectural specifics.
- Characteristics: 72 billion parameters, introduces biases in
KQV generation.
- Deepseek-67B:
- Role: Another large model with architectural variations.
- Characteristics: 67 billion parameters, different number of layers and hidden dimension ().
- Mixtral 8x7B [16]:
-
Role: Crucial for demonstrating
NanoFlow's effectiveness onMixture-of-Experts (MoE)architectures. -
Characteristics: An
MoEmodel with 8 experts, each being a 7B parameter model, enabling sparse activation during inference.All models used
FP16(16-bit floating-point) weights and activations, which is standard practice for high-performance, data center-scale LLM inference to optimize memory and computation.
-
6. Results & Analysis
6.1. Core Results Analysis
Throughput Comparison
The paper first evaluates throughput in an "offline" setting, simulating scenarios like benchmarking, information extraction, or data processing where the system is continuously supplied with requests. Requests sample input and output lengths from the real-world datasets (Splitwise, LMSYS-Chat-1M, ShareGPT) or use constant lengths. The NanoFlow instance, configured with a dense batch size of 2048 for LLaMA-2-70B (where it performs best), is compared against baselines and the theoretical optimal throughput.
The theoretical optimal throughput for LLaMA-2-70B on GPUs is derived as 1857 tokens/s/GPU (as calculated in Section 3.5).
The following are the results from [Figure 7] of the original paper:
该图像是一个柱状图,展示了在不同输入输出参数下,NanoFlow、Nanobatch-only和非重叠方法的每个GPU的吞吐量(tokens/s)。可以看到,NanoFlow在大多数情况下提供了优于其他方法的性能。
Figure 7 illustrates the throughput comparison. NanoFlow consistently achieves the highest throughput across all tested settings. In its best case, NanoFlow reaches 68.5% of the theoretical optimal throughput.
- Constant Length Workloads:
NanoFlowachieves an average of 2.62x higheroffline throughputthanvLLM.NanoFlowachieves an average of 2.78x higheroffline throughputthanDeepSpeed-FastGen.NanoFlowachieves an average of 1.73x higheroffline throughputthanTensorRT-LLM.
- Dataset-Driven Workloads (variable lengths):
-
NanoFlowachieves an average of 4.18x higherthroughputthanvLLM. -
NanoFlowachieves an average of 3.45x higherthroughputthanDeepSpeed-FastGen. -
NanoFlowachieves an average of 1.91x higherthroughputthanTensorRT-LLM.The significant gains, especially on dataset-driven workloads, demonstrate
NanoFlow's ability to handle realistic, variable-length requests much more efficiently than state-of-the-art systems. The 1.91x average boost overTensorRT-LLM(which leverages NVIDIA's highly optimized kernels) is particularly impressive, highlighting the effectiveness ofintra-device parallelismin unlocking latent GPU potential.
-
Latency Comparison
The paper evaluates latency by modeling request arrival intervals with an exponential distribution, generating request traces for various rates over 5 minutes. Normalized latency (end-to-end latency divided by output length) is the primary metric, with an SLO of 200ms (typical human reading speed).
The following are the results from [Figure 8] of the original paper:
该图像是对比图,展示了非重叠管道资源使用情况(左)与NanoFlow资源使用情况(右)。图中分别标示了计算、内存和网络资源在时间上的占用百分比,NanoFlow显著提升了资源利用效率。
Figure 8 presents the normalized latency of NanoFlow versus baselines at different request rates.
- Low Request Rates: At lower request rates,
NanoFlowexhibits comparable but slightly higherlatencythan the best baseline (TensorRT-LLM). This is attributed toNanoFlow's focus onthroughput-orientedscenarios, where it uses a largedense batch sizewhich might introduce a small amount of overhead for individual requests at very low concurrency. - High Request Rates & SLO: As the request rate increases,
NanoFlowdemonstrates superior performance. It is capable of sustaining a significantly higher request rate while staying within the 200mslatency SLOcompared to all baselines across all datasets. For instance, on theLMSys-Chat-1Mdataset,NanoFlowhandles 1.64x higher request rates thanTensorRT-LLMwithin the200ms normalized latencyconstraint. - Tail Latency:
NanoFlowmaintains goodtail latencyperformance. Its 99th-percentilelatencyis only 1.07x of the averagelatencyat near-maximumthroughput. This is a crucial advantage for user experience, as it implies a consistent performance even for a small fraction of slower requests, largely due to its use of a constantdense batch sizewhich avoids performance cliffs.
6.2. Data Presentation (Tables)
The tables from the paper were used in the methodology section to illustrate the foundational analysis for NanoFlow.
- Table 1: Characteristics of accelerator models: Presented in Section 4.2.1.
- Table 2: Comparison of operation runtimes between cost model estimation and real-world measurements: Presented in Section 4.2.1.
- Table 3: Performance of GEMV and network kernels with different resource utilization : Presented in Section 4.2.2.1.
- Table 4: The average and standard deviation of input and output lengths in the sampled datasets: Presented in Section 5.1.
6.3. Ablation Studies / Parameter Analysis
To isolate the contributions of NanoFlow's key techniques, an ablation study was conducted, comparing NanoFlow with two baselines that share its asynchronous request scheduling and kernel libraries:
-
Non-overlapping Baseline: Processes inputs sequentially without
nano-batches. This represents the traditional approach. -
Nano-batch-only Baseline: Splits requests into
nano-batchesbut executes them sequentially, allowing assessment of the overhead ofnano-batchingalone.The following are the results from [Figure 9] of the original paper:
该图像是柱状图,展示了不同大型语言模型的每GPU吞吐量的标准化比较,Nanoflow在多个模型上,相较于vLLM均实现了显著的性能提升,例如Llama-3-8B模型的吞吐量达到了78.5%。
Figure 9 illustrates the results of the ablation study.
- Overhead of Nano-batching: Splitting into
nano-batchesalone (Nano-batch-only vs. Non-overlapping) actually reduces performance by 13.2%. This confirms thatnano-batchingintroduces some overhead (e.g., increased kernel launch overhead, repeated weight loading), which needs to be offset byoverlapping. - Benefit of Overlapping Network-bound Kernels:
- For
prefill-only workloads(Input 512, Output 0), which are typicallycompute-boundwith somenetwork-bound collective communications,NanoFlowachieves a 1.07x speedup compared to thenon-overlapping baseline. This demonstrates the benefit of overlappingnetwork-boundandcompute-boundkernels.
- For
- Benefit of Overlapping Network- and Memory-bound Kernels:
- For
decode-heavy workloads(Input 512, Output 1024), wheredecode attentionintroducesmemory-boundcharacteristics,NanoFlowachieves a 1.17x speedup compared to thenon-overlapping baseline. This indicates the effectiveness of overlapping all three types of heterogeneous operations.
- For
- KV-cache Offloading Overhead: Enabling
KV-cache offloadingintroduces a 3.0% performance degradation due tokernel interferencefromKV-cache movement. However, this is a necessary trade-off:offloadingcan reduce thecomputerequired for multi-roundLMSYS-Chat workloadsby 3.02x, effectively saving resources for longer-term or memory-constrained scenarios despite the minor immediatethroughputhit.
Resource Usage
NanoFlow's effectiveness in resource utilization is demonstrated by comparing it with a non-overlapping baseline.
The following are the results from [Figure 10] of the original paper:
该图像是一个热力图,展示了不同模型(如LLaMA-3 8B、Mistral 8x7B等)在不同GPU配置下的计算和网络资源利用率。颜色深浅代表了计算限制和网络限制的程度,数值展示了系统的性能表现。
Figure 10 (which is an image with caption Figure 10, but the provided image with 10.jpg actually corresponds to Image 2 in the summary, which is a resource usage chart. I will assume Figure 10 in the text refers to this image) visually represents the resource usage pattern. The non-overlapping baseline (left side of the image) shows sequential execution, where typically only one resource (compute, memory, or network) is heavily utilized at any given time, leading to pipeline bubbles and underutilization of others. In contrast, NanoFlow (right side of the image) concurrently utilizes multiple resources, leading to a much denser and more efficient usage pattern. This enables NanoFlow to achieve an average of 68.5% compute utilization across the whole pipeline. While kernel interference prevents 100% optimal compute usage (a natural consequence of concurrent operations), the significant improvement over baselines validates the intra-device parallelism approach.
6.4. Performance on Other LLMs
To showcase NanoFlow's generalizability, its throughput was evaluated on several other popular LLMs, using a constant input length of 1024 and output length of 512. All tests were on 80GB SXM GPUs, except for LLaMA-3-8B which ran on a single A100.
The following are the results from [Figure 11] of the original paper:
该图像是一个示意图,展示了不同模型在多个GPU配置下的计算绑定与内存绑定性能指标。颜色深浅表示性能值,从左到右依次为LLaMA-3 8B 1xGPU、Mistral 8x7B 8xGPU等,并显示通过LMSYS-Chat、Splitwise和ShareGPT等策略的效果。
Figure 11 summarizes the performance for these models.
-
NanoFlowconsistently improvesthroughputfor all tested models. -
It achieves between 50% and 72% of the
optimal throughputfor these diverse architectures, demonstrating robust performance across different model scales and types (includingMoEmodels). -
On average,
NanoFlowachieves a 2.66xthroughput gaincompared tovLLMacross these models, further cementing its advantage over state-of-the-art systems. -
For example,
Llama-3-8Breaches 78.5% of optimal throughput, highlighting efficient scaling down to single-GPU deployments.These results confirm that
NanoFlow'sauto-searchengine is capable of automatically generating efficient pipelines for a wide range of LLMs and hardware configurations, effectively adapting to their specific characteristics.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work presents a compelling analysis that re-evaluates the fundamental bottleneck in Large Language Model (LLM) serving. Contrary to the widely held assumption of memory-bound operations, the authors rigorously demonstrate that end-to-end LLM serving, for most modern models and common workloads, is predominantly compute-bound. The core problem identified is the sequential execution of heterogeneous operations (compute, memory, network) within a single GPU, leading to significant underutilization of the compute resources.
To address this, the paper proposes NanoFlow, a novel and comprehensive LLM serving framework. NanoFlow's key innovation is the exploitation of intra-device parallelism through nano-batching. By splitting inputs into smaller nano-batches and duplicating operations into nano-operations, NanoFlow enables the concurrent execution of heterogeneous operations, effectively overlapping compute-bound, memory-bound, and network-bound tasks within the same GPU. The framework features an intelligent auto-search engine that automatically designs optimal execution pipelines, considering kernel interference and resource allocation, and an efficient runtime system that manages asynchronous scheduling and KV-cache offloading.
The experimental results are robust and significant. NanoFlow achieves an impressive 1.91x throughput improvement over state-of-the-art serving systems like vLLM, DeepSpeed-FastGen, and TensorRT-LLM on practical workloads. Furthermore, it reaches between 50% and 72% of the theoretically optimal throughput across various popular LLMs (e.g., LLaMA-2-70B, Mixtral 8x7B), significantly closing the gap to hardware capabilities. NanoFlow also demonstrates superior latency performance at high request rates while adhering to SLO constraints.
7.2. Limitations & Future Work
While the paper doesn't have a dedicated "Limitations" section, some inherent aspects and areas for potential future work can be inferred:
- Complexity of Interference Modeling: The
kernel interferencemodeling relies on pairwise profiling and an assumption that these mappings ( to ) hold for overlapping three kernels. While practical, this is an approximation and the non-linear, unpredictable nature of GPUkernel interferencemeans there's always a gap between modeled and true performance, limitingNanoFlowfrom reaching 100% optimalthroughput. - Search Space for Auto-search: While
NanoFlowreduces theauto-searchspace, finding a "practical" pipeline in ~10 minutes, there's a trade-off with absolute optimality. For highly dynamic or niche workloads, the current approximation might not always yield the absolute best solution. The MILP formulation itself implies a certain level of abstraction and simplification of real-world GPU complexities. - Fixed Dense Batch Size Assumption:
NanoFlowoperates under the assumption of a stabledense batch sizeto simplify calculations and ensure consistent performance. In extremely sparse or rapidly changing real-world traffic, maintaining this optimaldense batch sizecould be challenging, relying heavily on the externalcontrol planeforauto-scalingandload balancing. - Overhead of Nano-batching: The
ablation studyshowed thatnano-batchingalone introduces a 13.2% performance reduction. Whileoverlappingsignificantly offsets this, further reducing this inherent overhead could yield even greater gains. - Hardware Specificity: The evaluation is primarily focused on NVIDIA GPUs (A100). While Table 1 suggests similar ratios across vendors, the specific
kernel profilingandinterference modelingare hardware-dependent. AdaptingNanoFlowto other accelerators (AMD, Intel, custom AI chips) would require re-profiling and potentially re-tuning theauto-search's assumptions. - Extending to Other Model Architectures: While
NanoFlowdemonstrates effectiveness onMoEmodels, the rapid evolution of LLM architectures (e.g., more complex attention variants, novel layers) would require continuous updates to theauto-search's understanding of operation dependencies and characteristics.
7.3. Personal Insights & Critique
NanoFlow presents a highly insightful and practical advancement in LLM serving. Its core contribution lies in challenging a long-standing assumption and offering a sophisticated, automated solution to the newly identified bottleneck.
- Innovation of Bottleneck Re-evaluation: The re-classification of LLM serving as
compute-boundis a pivotal insight. It redirects optimization efforts from solely memory-centric approaches to maximizingcompute utilization, which is a more accurate target for current hardware and model trends. This demonstrates critical thinking about system performance beyond superficial observations. - Elegance of Intra-device Parallelism: The concept of
intra-device parallelismvianano-batchingis elegant. Instead of just packing more requests into a GPU,NanoFlowintelligently overlaps the execution phases within a single GPU, turning idle times into productive ones. This is a fine-grained form ofpipeliningthat is highly effective for heterogeneous workloads. - Sophistication of Auto-search: The
auto-searchengine is a standout feature. Manually optimizing complex pipelines involving multiplenano-batchesand accounting for non-linearkernel interferenceis an intractable problem. Automating this process withMILPand a two-stage approximation makes the approach adaptable and deployable across diverse models and hardware, significantly reducing engineering effort. The to mapping is a clever way to quantify and manage interference. - Holistic System Design:
NanoFlowisn't just an algorithm; it's a complete system, addressingbatch formation,asynchronous scheduling, andKV-cache management. This holistic approach ensures that the gains fromintra-device parallelismare not negated by other system bottlenecks. - Applicability Beyond LLMs: The principles of
intra-device parallelismandauto-searchfor optimizing heterogeneous workloads could potentially be applied to other complex deep learning models or even general-purpose GPU computing where different kernel types (compute, memory, I/O) are executed sequentially.
Critique:
-
Approximation in Interference Modeling: While necessary, the approximation in
kernel interferencemodeling (pairwise profiling, to mapping) is a potential source of suboptimality. Real-world interference patterns are often more complex and context-dependent. Future work could explore more dynamic or machine-learning-based interference prediction models, perhaps with hardware performance counters, to refine this. -
Search Time vs. Optimality Trade-off: The "10 minutes" search time for a "practical" pipeline is reasonable for deployment but implies that it's not exhaustively exploring the entire space. For extremely critical or static deployments, a longer, more exhaustive search might yield marginal but still valuable improvements.
-
Real-time Adaptability: While
auto-searchruns when architectures change, dynamic changes in workload characteristics (e.g., sudden shifts in input/output length distributions) might require rapid pipeline re-optimization. The currentauto-search, taking minutes, might not be suitable for real-time adaptation. A lightweight, online adaptation mechanism could be a valuable future direction.Overall,
NanoFlowis a robust and impactful piece of research. It provides a deeper understanding of LLM serving bottlenecks and delivers a practical, high-performance solution that significantly advances the state of the art in efficient LLM inference.
Similar papers
Recommended via semantic vector search.