Paper status: completed

Handling Heavy-tailed Input of Transformer Inference on GPUs

Published:06/16/2022

Efficient Inference of Transformer Models (1)GPU Batch Processing Optimization (1)Self-Attention Module Parallel Strategies (1)Handling Variable-Length Sequences (1)Performance Improvement for NLP Tasks (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study proposes a unified solution to enhance Transformer inference efficiency on GPUs dealing with heavy-tailed input, reducing redundancy through fine-grained and word-accumulation strategies. Results show a 63.9% latency reduction in the self-attention module and 28.1% in t

Abstract

Transformer-based models achieve superior accuracy in the field of natural language processing (NLP) and start to be widely deployed in production. As a popular deployment device, graphic processing units (GPUs) basically adopt the batch processing technique for inferring transformer-based models and achieving high hardware performance. However, as the input sequence lengths of NLP tasks are generally variable and in a heavy-tailed distribution, the batch processing will bring large amounts of redundant computation and hurt the practical efficiency. In this paper, we propose a unified solution for eliminating most redundant computation and gaining performance profit in handling heavy-tailed input of the transformer-based model inference on GPUs. In details, the unified solution includes three strategies for the self-attention module, the multilayer perceptron (MLP) module, and the entire transformer-based model respectively. For the self-attention module, we design a fine-grained strategy, which orchestrates fine-grained parallelism in the self-attention module by indexing the valid block matrix multiplication. For the MLP module, we take the common word-accumulation strategy, which places all sequences in a batch densely. For the entire model, we design a block-organized strategy to link up the fine-grained strategy and the word-accumulation strategy through organizing the data layout of the self-attention module in the grain of block. Applying our solution to eight corpora of the GLUE benchmark, there averagely achieves 63.9% latency reduction in the self-attention module and 28.1% latency reduction in the Bert-base model.

Mind Map

In-depth Reading

English Analysis~34 min read · 39,363 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Handling Heavy-tailed Input of Transformer Inference on GPUs". It focuses on optimizing the performance of Transformer-based model inference when input sequence lengths exhibit a heavy-tailed distribution, particularly on Graphics Processing Units (GPUs).

1.2. Authors

The authors and their affiliations are:

Jiangsu Du: Sun Yat-sen University, Guangzhou, China
Jiazhi Jiang: Sun Yat-sen University, Guangzhou, China
Yang You: National University of Singapore, Singapore
Dan Huang: Sun Yat-sen University, Guangzhou, China
Yutong Lu: Sun Yat-sen University, Guangzhou, China

1.3. Journal/Conference

The paper was published at the 2022 International Conference on Supercomputing (ICS '22). ICS is a highly reputable and influential conference in the field of high-performance computing, focusing on advanced architectures, parallel algorithms, and software. Publication at ICS indicates a rigorous peer-review process and a significant contribution to the supercomputing community.

1.4. Publication Year

The paper was published in 2022. The specific publication date was 2022-06-16T00:00:00.000Z.

1.5. Abstract

The paper addresses the inefficiency of Transformer-based model inference on GPUs caused by heavy-tailed input sequence lengths in natural language processing (NLP) tasks. While GPUs use batch processing for high performance, variable and heavy-tailed input lengths lead to significant redundant computation due to padding.

To mitigate this, the authors propose a unified solution comprising three strategies:

Fine-grained strategy for the self-attention module: This strategy orchestrates fine-grained parallelism by indexing valid block matrix multiplication to eliminate most redundant computation.
Word-accumulation strategy for the multilayer perceptron (MLP) module: This common strategy densely places all sequences in a batch to remove redundancy.
Block-organized strategy for the entire model: This strategy links the previous two by organizing the data layout of the self-attention module at the block level, reducing overhead from layout switch operations.

Applying this solution to eight corpora of the GLUE benchmark, the authors report an average 63.9% latency reduction in the self-attention module and 28.1% latency reduction in the Bert-base model inference.

1.6. Original Source Link

The original source link is: /files/papers/6919d6ad110b75dcc59ae2b8/paper.pdf This paper was officially published at ICS '22.

2. Executive Summary

2.1. Background & Motivation

The widespread adoption of Transformer-based models in Natural Language Processing (NLP) for tasks like sentiment analysis and question answering has made their efficient deployment in production critical. GPUs are the go-to hardware for this, primarily utilizing batch processing to maximize hardware performance through parallelism.

However, NLP tasks often involve input sequences of variable length, which typically follow a heavy-tailed distribution. This means most sequences are short, but a few are very long. To apply batch processing directly, a common workaround is the padding strategy: all sequences in a batch are padded to the length of the longest sequence or a preset maximum length (seq_len). While this enables batching and high hardware utilization, it introduces a significant problem: redundant computation. The padded (invalid) portions of shorter sequences are processed just like valid data, wasting computational resources and severely impacting practical efficiency. The paper highlights that this redundant computation can even exceed the valid computation, especially in scenarios with highly varied lengths.

Existing solutions, such as the EFF-rebuild solution adopted by frameworks like EffectiveTransformer, FasterTransformer, and TurboTransformers, address redundant computation in the MLP modules by removing padding and densely arranging words (word-accumulation strategy). However, these solutions have two key disadvantages:

They require extra data movement to switch between padded and dense data layouts, incurring overhead.
Crucially, they fail to eliminate redundant computation within the self-attention module. As MLP module redundancy is reduced, the self-attention module begins to dominate the total computation time, making its optimization imperative. This gap in existing solutions motivates the current paper.

The paper's innovative idea is to propose a unified solution that specifically targets and eliminates this overlooked redundant computation in the self-attention module, while also improving the overall data flow for variable-length inputs.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

Demonstrating the problem: The paper rigorously illustrates that NLP input often follows a heavy-tailed distribution and that redundant computation in self-attention modules constitutes a substantial portion of the total workload, making it a critical target for optimization.
Proposing a unified solution: The authors introduce a comprehensive approach for eliminating redundant computation in Transformer-based model inference, consisting of three interconnected strategies:
1. Fine-grained strategy for the self-attention module: This novel strategy orchestrates fine-grained parallelism on GPUs by indexing valid block matrix multiplication within the self-attention mechanism. It significantly reduces redundant computations by only processing the necessary parts. This strategy incorporates three key techniques: mini-block index (MBIX), shared memory block transpose (SMBT), and efficient atomic operation (EAOP).
2. Word-accumulation strategy for the MLP module: This strategy, while not new, is integrated into the unified solution by densely arranging sequences in a batch, effectively eliminating MLP redundancy.
3. Block-organized strategy for the entire model: This strategy unifies the fine-grained and word-accumulation approaches by designing a block padding data layout for the self-attention module. This block-grained organization reduces the overhead associated with data layout switches between the MLP and self-attention modules.
Customized GPU kernels: The paper describes the implementation of customized GPU kernels for the $Q x K^T$ and $QK^T x V$ functions using the fine-grained strategy and its associated techniques.
Empirical Validation: Through extensive experiments on a Tesla V100 GPU using eight corpora from the GLUE benchmark, the unified solution achieves significant performance gains.

The key findings and conclusions are:
The fine-grained strategy can achieve an average of 63.9% latency reduction in the self-attention module compared to existing cublas-based implementations.
The unified solution (integrating all three strategies) achieves an average of 28.1% latency reduction for the Bert-base model inference compared to the FasterTransformer framework, which uses the EFF-rebuild solution.
The effectiveness of the solution is more pronounced for corpora with larger seq_len and when more sequences concentrate in shorter length areas, where the self-attention module's quadratic complexity makes its optimization highly impactful.
The block-organized strategy successfully reduces layout switching overhead by introducing a more memory-efficient block padding layout compared to traditional batch padding.

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following core concepts:

3.1.1. Deep Learning and Natural Language Processing (NLP)

Deep Learning is a subfield of machine learning that uses artificial neural networks with multiple layers to learn representations of data. NLP is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. Deep learning models have achieved state-of-the-art results across many NLP tasks (e.g., sentiment analysis, question answering, machine translation).

3.1.2. Transformer-based Models

Transformer-based models (e.g., BERT, GPT-2) are a type of neural network architecture that have revolutionized NLP. Unlike previous architectures like Recurrent Neural Networks (RNNs) that process sequences sequentially, Transformers process all parts of an input sequence simultaneously, which allows for greater parallelization and better handling of long-range dependencies in text.

Key components of a Transformer:
- Word Embedding Module: Converts each word in an input sequence into a numerical vector representation (word embedding). This initial representation captures semantic meaning. The output is typically a matrix of size hidden_size $\times$ seq_len for a single sequence, where hidden_size is the dimensionality of each word vector and seq_len is the maximum sequence length.
- Self-Attention Module: The core innovation of Transformers. It allows the model to weigh the importance of different words in the input sequence when processing a specific word. It computes a weighted sum of input representations, where weights are determined by how "relevant" other words are to the current word.
- Multilayer Perceptron (MLP) Module: Also known as the feed-forward network. It consists of multiple linear functions (matrix multiplications followed by non-linear activations) applied independently to each position in the sequence. It processes the output of the self-attention module to further transform the representations.

3.1.3. Self-Attention Mechanism

The self-attention mechanism is crucial for Transformers. It maps a query and a set of key-value pairs to an output. For each word in a sequence, three vectors are generated:

Query (Q): Represents what the current word is looking for.
Key (K): Represents what information other words offer.
Value (V): Contains the actual information from other words.

The attention output is calculated as follows: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
$Q$ is the matrix of queries (shape: seq_len $\times$ $d_k$ ).
$K$ is the matrix of keys (shape: seq_len $\times$ $d_k$ ).
$V$ is the matrix of values (shape: seq_len $\times$ $d_v$ ).
$d_k$ is the dimensionality of the key vectors. The division by $\sqrt{d_k}$ is a scaling factor to prevent the dot products from becoming too large, which can push the softmax function into regions with very small gradients.
$QK^T$ computes the dot product similarity between each query and all keys, resulting in an attention score matrix.
$\mathrm{softmax}$ normalizes these scores, turning them into probabilities that sum to 1, indicating the weight of each word's value in the output.
Multiplying by $V$ produces the final attention output, which is a weighted sum of the value vectors.

The paper focuses on two main computations within self-attention: $Q x K^T$ and $QK^T x V$ .

3.1.4. Graphic Processing Units (GPUs) and Batch Processing

GPUs are specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer for output to a display device. Their highly parallel architecture (thousands of small processing cores) makes them ideal for workloads involving large matrix operations, common in deep learning.

Batch Processing: To fully utilize GPUs, deep learning inference typically employs batch processing. Instead of processing one input sample at a time, multiple samples (a batch) are grouped and processed simultaneously. This exposes higher parallelism, amortizes kernel launch overhead (the time taken to set up and launch a computation on the GPU), and helps saturate the GPU's compute capacity, leading to higher throughput.
Padding Strategy: For variable-length inputs in NLP, batch processing isn't straightforward because matrix operations require inputs of consistent dimensions. The padding strategy involves extending shorter sequences with special padding tokens (e.g., zeros) until all sequences in a batch reach a uniform maximum length (seq_len). This creates rectangular matrices suitable for efficient batch matrix multiplication.

3.1.5. Heavy-tailed Distribution

In statistics, a heavy-tailed distribution is a probability distribution whose tails are not exponentially bounded. In the context of NLP sequence lengths, it means that while the majority of sequences are relatively short, there's a non-negligible number of exceptionally long sequences. When applying the padding strategy, this implies that the seq_len must accommodate these rare long sequences, leading to extensive padding for the more common short sequences.

3.1.6. General Matrix Multiplication (GEMM)

GEMM refers to a set of highly optimized routines for performing matrix-matrix multiplication. It is a fundamental operation in deep learning. cuBLAS is NVIDIA's GPU-accelerated basic linear algebra subroutine library that provides high-performance GEMM implementations. The MLP modules and parts of the self-attention can be reduced to GEMM operations.

3.1.7. CUDA and GPU Architecture (NVIDIA Terminology)

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA for its GPUs.

Streaming Multiprocessors (SMs): The core processing units of a GPU. A GPU contains an array of SMs.
Threads, Warps, and Blocks:
- Thread: The smallest unit of execution.
- Warp: A group of 32 threads that execute the same instruction in single-instruction, multiple-data (SIMD) fashion.
- GPU Thread Block (Block): A group of warps. All threads within a block can communicate through fast shared memory.
Grid: The entire set of thread blocks launched for a kernel.
Kernel: A function that runs on the GPU.
Shared Memory: A small, very fast, programmer-managed memory region local to each SM. It's significantly faster than global memory but has limited capacity.
Global Memory: The main GPU memory, accessible by all SMs. It is large but has higher latency than shared memory.
Occupancy: The number of thread blocks that execute concurrently on an SM. Higher occupancy generally means better hardware performance as latency (e.g., memory access latency) can be hidden by switching to other active warps.
Bank Conflicts: Shared memory is divided into banks. If multiple threads in the same warp try to access different words in the same bank simultaneously, a bank conflict occurs, serializing the accesses and reducing efficiency.

3.2. Previous Works

The paper discusses existing frameworks that have noticed the redundant computation problem in Transformer inference:

EffectiveTransformer [10] (ByteDance): This framework is credited with first proposing the padding-rebuild method (which the paper refers to as the EFF-rebuild solution) to address the variable length problem in MLP modules. It aimed to reduce execution time and memory consumption for large batch size cases.
FasterTransformer [11] (NVIDIA): An active open-source project by NVIDIA that focuses on accelerating Transformer inference through manual kernel fusion and other optimizations. It adopts the EFF-rebuild solution.
TurboTransformer [12] (Tencent): This framework also addresses the variable length problem with a memory management strategy and presents designs for serving systems. It also adopts the EFF-rebuild solution.

3.2.1. The EFF-Rebuild Solution

This solution, implemented in the frameworks above, works as follows:

Before MLP: It removes padding tokens from the input sequences within a batch and arranges the valid words densely in memory. This is called the word-accumulation strategy.
MLP Computation: The densely-arranged data can then be processed efficiently by standard GEMM libraries (like cuBLAS) without redundant computations on padded values.
After MLP, Before Self-Attention: It rebuilds the padding to revert the data to the padded layout required by the self-attention module. This involves data movement.

3.2.2. Limitations of the EFF-Rebuild Solution

The paper points out two major disadvantages of the EFF-rebuild solution:

Extra Data Movement: The process of removing and rebuilding padding requires frequent data movement between different memory layouts, introducing overhead.
No Redundancy Elimination in Self-Attention: Crucially, this solution cannot eliminate redundant computation within the self-attention module itself. The self-attention module still operates on the fully padded sequences, leading to significant wasted computation, especially given its quadratic complexity with respect to sequence length. As MLP redundancy is reduced, self-attention's unoptimized portion becomes a larger bottleneck.

Other related works mentioned:

E.T. [20]: Proposed a kernel fusion design for the attention module, but its performance was limited by shared memory capacity and degraded for longer sequence lengths.
Model Size Reduction: Works like E.T. [20] (hardware-aware pruning) and Synthesizer [22] (alternative self-attention structures) aim to accelerate Transformer inference by reducing the model size or computational complexity, often with trade-offs in accuracy or applicability.

3.3. Technological Evolution

The evolution of NLP models for inference has moved from:

Early models (e.g., RNNs): Sequential processing, difficult to parallelize effectively on GPUs.
Initial GPU deployments: Naive batching with extensive padding for models like RNNs and early Transformers, leading to redundant computation.
Transformer revolution: Introduction of the self-attention mechanism allows for greater parallelization, making GPUs even more critical.
Framework optimizations (e.g., FasterTransformer, TurboTransformer): Focused on kernel fusion and addressing variable length inputs in MLP modules using the EFF-rebuild solution, recognizing the padding overhead. These were a significant step, but still left self-attention unoptimized for redundancy.
This paper's work: Represents the next evolutionary step, specifically targeting the self-attention module's redundant computation and proposing a unified data management strategy to further improve overall Transformer inference efficiency on GPUs.

3.4. Differentiation Analysis

Compared to the main methods in related work, specifically the EFF-rebuild solution adopted by EffectiveTransformer, FasterTransformer, and TurboTransformer, this paper's core innovations and differences are:

Targeting Self-Attention Redundancy: The most significant differentiation is the direct attack on redundant computation within the self-attention module. Previous solutions largely ignored this, processing the self-attention with full padding using cuBLAS routines, even after optimizing MLP modules. This paper explicitly designs fine-grained strategies to eliminate this waste.
Fine-grained Parallelism: Instead of relying on cuBLAS for self-attention, which operates on entire padded matrices, this paper proposes customized kernels that orchestrate fine-grained parallelism by only processing valid mini-blocks within the self-attention computations ( $Q x K^T$ and $QK^T x V$ ).
Unified Data Layout Strategy: While previous methods switch between dense and padded layouts for MLP and self-attention respectively, incurring overhead, this paper introduces a block-organized strategy with a block padding layout. This new layout is designed to be more compatible with both the word-accumulation strategy (for MLP) and the fine-grained strategy (for self-attention), reducing data movement overhead during layout switches.
Low-level GPU Optimization: The fine-grained strategy incorporates specific GPU architecture optimizations like shared memory block transpose to avoid bank conflicts and efficient atomic operations to mitigate contention issues, which are not present in the EFF-rebuild solution's handling of self-attention.

In essence, this paper goes beyond existing kernel fusion and MLP optimization efforts by providing a holistic approach that tackles the variable-length input problem across the entire Transformer model, with a particular focus on the previously unaddressed self-attention redundancy.

4. Methodology

The paper proposes a unified solution to eliminate most redundant computation when handling heavy-tailed input for Transformer-based model inference on GPUs. This solution integrates three strategies: a fine-grained strategy for the self-attention module, the word-accumulation strategy for the MLP module (a common technique adapted here), and a block-organized strategy for the entire model to seamlessly connect the other two.

4.1. Principles

The core idea behind the proposed methodology is to move away from the naive padding strategy that processes invalid (padded) data, and instead perform computations only on valid data. This requires fine-grained control over GPU operations and a unified approach to data layout management across different modules of the Transformer. The theoretical basis lies in decomposing large matrix operations into smaller, independent block matrix multiplications and carefully orchestrating their execution to avoid redundant work, while leveraging GPU architectural features like shared memory and parallel execution efficiently.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Computation Analysis and Problem Identification

The paper first analyzes where redundant computation occurs in Transformer-based models due to heavy-tailed input and the padding strategy. Table 1 describes the key hyperparameters for Transformer-based models:

seq_len: The longest sequence length preset for the model.
head_size: The vector size of Q, K, V.
hidden_size: The vector size of word embedding.
batch_size: The sample number in the same batch.

val_len: The actual sequence length of a sample.

The following are the results from Table 1 of the original paper:

Variable	Meaning
seq_len	The longest sequence length preset for the model
head_size	The vector size of Q,K,V
hidden_size	The vector size of word embedding
batch_size	The sample number in the same batch
val_len	The sequence length of a sample

4.2.1.1. Heavy-tailed Input (3.1)

The paper statistically confirms that input sequence lengths in NLP tasks (using the GLUE benchmark as an example) follow a heavy-tailed distribution. Most sequences are much shorter than the maximum seq_len, meaning the padding strategy introduces substantial redundant computation.

As can be seen from the results in Figure 9, the lengths of sequences in the same batch generally vary a lot and they mainly concentrate on the short length area.

Figure 2: The length statistic of corpora in GLUE Benchmark. 该图像是图表，展示了GLUE基准中各语料的长度统计，X轴为序列长度区间，Y轴为百分比，各条线代表不同数据集的分布情况。

Figure 9: The length statistic of corpora in GLUE Benchmark.

4.2.1.2. Linear Function (MLP Module) (3.2)

Linear functions form the core of MLP modules. With the padding strategy and batch processing, all input sequences are padded to seq_len. This means computation is performed on hidden_size $\times$ seq_len matrices, even if the actual val_len is much shorter. The practical computation amount for a linear function is linearly proportional to val_len.

The EFF-rebuild solution addresses this problem in MLP modules using the word-accumulation strategy.

Process: Before entering the MLP module, padding is removed, and valid words from all sequences in a batch are densely arranged. This densely-arranged matrix can then be multiplied with the parameter matrix using standard GEMM libraries (e.g., cublasLtMatmul, cublasGemmEx). After the MLP module, padding is rebuilt before the self-attention module.
Benefit: This eliminates redundant computation in MLP modules.
Drawback: It requires extra data movement for layout switches and does not address self-attention redundancy.

The following are the results from Figure 3 of the original paper:

该图像是示意图，展示了如何通过记录偏移量和数组运算在变换器模型的自注意力模块中实现细粒度并行计算。图中首先显示了输入序列和记录偏移量，然后展示了结果的存储方式。

该图像是示意图，展示了批量化 $Q imes K^T$ 操作的两种形式。部分矩阵在 (a) 处表示为批量的形式，而 (b) 处则为扁平化批量形式，展示了它们之间的结构与连接关系。

Figure 3: The illustration of Linear Function. Figure 3(a) illustrates the batched linear function with padding. Grey squares are valid, white are padded. batch_size is 4, with actual lengths 1, 5, 2, 1, padded to seq_len = 6. The parameter matrix is fixed. The diagram shows that computation is performed on all padded elements. Figure 3(b) (not fully shown in provided image text, but implied by the context of "Switch from the padding data layout to densely-arranged data layout") would show data being rearranged from padded to dense format.

4.2.1.3. $Q \times K^T$ Function (3.3)

This is a critical part of the self-attention module. It computes the correlation between any two words in a sequence.

Problem: Both query (Q) and key (K) matrices are derived from the input sequence, so their dimensions are related to seq_len. The $Q x K^T$ operation produces a seq_len $\times$ seq_len matrix. If sequences are padded, most of this resulting matrix will contain invalid values. The practical computation amount for this function has a quadratic relationship with val_len.
Existing Implementation: Uses cublasGemmStridedBatchedEx from the cuBLAS library. This function is highly optimized for batched GEMM but performs computations on all padded elements.
Redundancy Ratio: The ratio of total computation amount (with padding) to valid computation amount (without padding) is given by: $ \frac{(head_size) \times (seq_len)^2}{(head_size) \times (val_len)^2} = (\frac{seq_len}{val_len})^2 $ For example, if seq_len is 128 and val_len is 16, the ratio is $(128/16)^2 = 8^2 = 64$ . This means redundant computation can be 63 times more than valid computation.

The following are the results from Figure 4 of the original paper:

该图像是示意图，展示了在处理变长输入序列时，输入数据与参数矩阵相乘的过程。左侧为输入序列，右侧为计算结果，体现了模型在进行自注意力计算时的矩阵运算方式。

Figure 4: The illustration of batched $Q \times K^T$ function in the matrix form. Figure 4(a) shows the batched $Q x K^T$ function. The input $Q$ and $K$ matrices for different sequences are stacked. The output $Q K^T$ matrix shows valid values in color and padded values in white. Only the top-left val_len $\times$ val_len sub-matrix for each sequence is valid. Figure 4(b) illustrates the $flattened batched Q x K^T function$ . Red lines represent valid computations between vectors, while black lines represent redundant computations due to padding. The large number of black lines visually emphasizes the extent of redundancy.

4.2.1.4. $QK^T \times V$ Function (3.4)

This is the second major computation in the self-attention module, multiplying the $attention scores (QK^T)$ with the value (V) matrix. Similar to $Q x K^T$ , both $QK^T$ and $V$ are related to seq_len, so this operation also has a quadratic relationship with val_len. The same redundancy ratio (Equation 1) applies. Existing implementations also use cublasGemmStridedBatchedEx.

The following are the results from Figure 5 of the original paper:

$Figure 5: The illustration of batched $Q \\times K ^ { T }$ function in the matrix form.$ 该图像是示意图，展示了批处理 $Q imes K^{T}$ 函数的矩阵形式。在左侧，显示了多个输入序列的矩阵叠加，右侧则是经过运算后的结果矩阵，展示了如何通过批处理提高计算效率。

Figure 5: The illustration of batched $Q \times K^T$ function in the matrix form. Figure 5 presents the $QK^T x V$ function, showing the $QK^T$ matrix (with padded regions) being multiplied by the $V$ matrix (also with padded regions). The output, like in $Q x K^T$ , will have a val_len $\times$ val_len valid region, with the rest being padded.

4.2.1.5. Computation Amount Comparison (3.5)

The paper profiles Bert-base inference on the SST-2 corpus to compare the computation time of MLP modules (GEMM) and self-attention modules (BatchedGEMM). The following are the results from Table 2 of the original paper:

	GEMM(3)	GEMM(2)	BatchedGEMM
Padding	20.4	50.16	15.75
%	23.6%	58.1%	18.2%
No Padding	3.24	6.84	15.26
%	12.8%	27.0%	60.2%

Padding Strategy: GEMM (MLP) accounts for the majority of computation time (23.6% + 58.1% = 81.7%), with BatchedGEMM (self-attention) taking a smaller proportion (18.2%). In this scenario, optimizing self-attention alone would yield limited overall benefit.
No Padding Strategy (with EFF-rebuild for MLP): After MLP redundant computation is largely reduced, the BatchedGEMM (self-attention) time becomes dominant (60.2%), highlighting that self-attention is now the primary bottleneck and its optimization is crucial.

4.2.2. Overview of the Unified Solution (4)

The proposed unified solution combines three strategies, as shown in Figure 6. The following are the results from Figure 6 of the original paper:

Figure 6: The Overview of Our Unified Solution. 该图像是示意图，展示了我们提出的统一解决方案的整体结构。图中包含了针对自注意力模块、MLP模块和整个变换器模型的策略，其中涉及到的关键步骤包括基于序列长度生成索引数组以及运用 $Q imes K^T imes V$ 计算的细粒度策略和词维度策略。

Figure 6: The Overview of Our Unified Solution.

Fine-grained strategy (for self-attention): Orchestrates fine-grained parallelism for $Q x K^T$ and $QK^T x V$ functions by indexing valid block matrix multiplication, eliminating most redundant computation within self-attention. This involves three techniques: Mini-block Index (MBIX), Shared Memory Block Transpose (SMBT), and Efficient Atomic Operation (EAOP).
Word-accumulation strategy (for MLP): This is the common approach of densely arranging sequences in a batch, adopted to eliminate MLP redundancy.
Block-organized strategy (for entire model): This strategy links the fine-grained and word-accumulation strategies by organizing the data layout of the self-attention module in a block-grained manner (block padding), which reduces overhead from layout switches.

4.2.3. Fine-grained Strategy (4.1)

The fine-grained strategy addresses the quadratic redundancy in the self-attention module by decomposing matrix multiplications into mini-block matrix multiplications and only computing the valid ones. This is inspired by the concept of block matrix multiplication: $ \begin{array}{c} { \begin{array} { r l } { A B = { \binom { A _ { 1 1 } } { A _ { 2 1 } } } } & { A _ { 1 2 } } \ { A _ { 2 1 } } & { A _ { 2 2 } } \end{array} } { \left[ \begin{array} { l l } { B _ { 1 1 } } & { B _ { 1 2 } } \ { B _ { 2 1 } } & { B _ { 2 2 } } \end{array} \right] } \ { = { \binom { A _ { 1 1 } B _ { 1 1 } + A _ { 1 2 } B _ { 2 1 } } { A _ { 2 1 } B _ { 1 1 } + A _ { 2 2 } B _ { 2 1 } } } } & { A _ { 1 1 } B _ { 1 2 } + A _ { 1 2 } B _ { 2 2 } } \end{array} $ This formula shows how two matrices $A$ and $B$ , partitioned into blocks, can be multiplied by performing multiplications and additions on their respective blocks. $A_{ij}$ and $B_{ij}$ are sub-matrices (blocks). The strategy applies this by decomposing the $Q$ , $K$ , and $V$ matrices into mini-blocks and only performing the computations for the valid mini-blocks.

4.2.3.1. Mini-block Index (MBIX, 4.1.1)

Purpose: To identify and orchestrate only the valid mini-block matrix multiplications.
Mechanism: MBIX builds index arrays (e.g., q_offset, k_offset, qk_offset for $Q x K^T$ ) that record the execution order and target position for valid mini-blocks. These offsets are calculated based on val_len of input sequences.
Process:
1. Based on val_len for each sequence, determine which mini-blocks are valid.
2. Create q_offset array: Stores the starting memory offset for valid Q mini-blocks.
3. Create k_offset array: Stores the starting memory offset for valid K mini-blocks.
4. Create qk_offset (or result_offset) array: Stores the starting memory offset in the output matrix where the result of a mini-block multiplication should be written.
5. The GPU then iterates through these offset arrays, retrieving and processing only the valid mini-block pairs.
  
  The following are the results from Figure 7 of the original paper:
  
  $Figure 7: 2-D partitioning for $Q \\times K ^ { T }$ function.$ 该图像是示意图，展示了 $Q \times K^T$ 计算中的二维分区过程。左侧展示了参与计算的矩阵分区，中央部分为矩阵乘法运算，右侧为结果矩阵及相应的偏移量，包括 $Q\_offset$ 、 $K\_offset$ 和 $QK\_offset$ 。该示意图用于阐释在自注意力模块中进行有效计算的过程。

Figure 7: 2-D partitioning for $Q \times K ^ { T }$ function. Figure 7 illustrates 2-D partitioning for the $Q x K^T$ function. It shows query (left), key (middle), and output (right) matrices partitioned into $2 \times 2$ mini-blocks. The q_offset and k_offset arrays record the starting positions of valid mini-blocks in $Q$ and $K$ respectively. The qk_offset array records where the results of multiplying corresponding valid mini-blocks should be written in the output $QK^T$ matrix. For a three-word sequence with val_len=3 (assuming a mini-block size of $2 \times 2$ ), q_offset could be [0,2,0,..,10], k_offset could be [0,2,8,..,10], and qk_offset could be [0,0,2,..,14]. This allows the GPU to only process mini-blocks that contain valid data (highlighted in red dashed boxes).

The following are the results from Figure 8 of the original paper:

$Figure 8: 2-D partitioning for $Q K ^ { T } \\times V$ function.$ 该图像是示意图，展示了 $Q K ^ { T } \times V$ 函数的二维分块过程。左侧为输入矩阵的分区，右侧展示了计算后的结果矩阵。在图中，红色框标识出参与运算的有效块，底部则列出了对应的偏移量 $QK_\text{offset}$ 、 $V_\text{offset}$ 和 $Res_\text{offset}$ 。该图形有助于理解自注意力模块中的计算方式。

Figure 8: 2-D partitioning for $Q K ^ { T } \times V$ function. Figure 8 applies the same MBIX concept to the $QK^T x V$ function. QK_offset, V_offset, and Res_offset arrays guide the processing of valid mini-blocks for this multiplication.

Trade-offs of mini-block size:
- Too small: Large index arrays (high build time, memory footprint), poor spatial locality.
- Too large: Less adaptation to irregular data, more redundant computation.
Empirical choice: $32 \times 32$ block size for 2-D partitioning.

4.2.3.2. Shared Memory Block Transpose (SMBT, 4.1.2)

Problem: After fetching mini-block data from global memory to shared memory, direct copying can lead to bank conflicts. If multiple threads in a warp access different data elements that happen to reside in the same shared memory bank, these accesses are serialized, reducing memory bandwidth. This occurs because shared memory is typically organized into banks, and consecutive addresses often map to consecutive banks.
Solution: When emigrating mini-block data from global memory to shared memory, the data is transposed. This changes the memory access pattern such that threads within the same warp are more likely to access data from different shared memory banks, thereby eliminating or significantly reducing bank conflicts.

The following are the results from Figure 9 of the original paper:

该图像是示意图，左侧展示了直接复制的数组布局，右侧展示了转置复制的数组布局。红线指示了复制的方向。

Figure 9: The Illustration of Shared Memory Block Transpose. The red line represents the arrangement direction of memory. The left side shows a mini-block copied directly to shared memory. The right side shows how data copied directly can lead to bank conflicts (multiple threads accessing the same bank, indicated by multiple arrows pointing to the same bank). The solution (left part of image) is to transpose the mini-block as it's copied to shared memory, altering the memory layout to avoid bank conflicts for subsequent accesses by threads in a warp.

4.2.3.3. Efficient Atomic Operation (EAOP, 4.1.3)

Problem: In the 2-D partitioning for $Q x K^T$ and $QK^T x V$ , multiple mini-block computations might write their results to the same location in the output matrix (e.g., overlapping result mini-blocks). This requires atomic operations to ensure correctness, which can introduce contention and degrade performance due to serialization.
Solution 1 (for $Q \times K^T$ ): Degenerate to 1-D partitioning.
- If query and key are the same shape, for $Q x K^T$ , the MBIX can be simplified to a 1-D partitioning.
- In 1-D partitioning, each output mini-block corresponds to a unique pair of input mini-blocks, thus eliminating the need for atomic operations.
- Empirical choice: $64 \times 32$ block size for $Q x K^T$ in this 1-D setup.
  
  The following are the results from Figure 10 of the original paper:
  
  $Figure 10: 1-D partitioning for $Q \\times K ^ { T }$ function.$ 该图像是一个示意图，展示了 $Q imes K^T$ 操作的 1-D 划分。左侧矩阵和中间绿色矩阵的乘法结果映射到右侧矩阵，同时下方给出了 Q、K 和 QK 的偏移量。

Figure 10: 1-D partitioning for $Q \times K ^ { T }$ function. Figure 10 shows 1-D partitioning for $Q x K^T$ . The query (left) and key (middle) matrices are partitioned such that each row block of $Q$ multiplies with a column block of $K^T$ . The output matrix (right) then has a unique target location for each block multiplication, avoiding write conflicts.

Solution 2 (for $QK^T \times V$ ): Reorganize offset generation.
- For $QK^T x V$ , 2-D partitioning might still be necessary. To mitigate contention in atomic operations, the offset generation function is designed to enlarge the write interval between values that are destined for the same memory location.
- By adjusting the order of offsets (e.g., [0,12,2,14,0,12,2,14] for q_offset and [0,0,8,8,2,2,10,10] for k_offset in an example), dependent writes are spaced out, reducing the likelihood of threads from the same warp or block contending for the same atomic lock simultaneously. This can be further improved by inserting loops in other dimensions, such as the batch dimension.

4.2.4. Block-organized Strategy (4.2)

Problem: The word-accumulation strategy for MLP modules uses a dense arrangement (no padding between sequences), while the self-attention module (even with the fine-grained strategy) traditionally requires some form of padding for structured access, leading to incompatible data layouts and overhead from layout switches. The previous batch padding for self-attention is too inefficient as the fine-grained strategy makes much of that padding unnecessary.
Solution: Block Padding Layout: A novel data layout called block padding is introduced for the self-attention module.
- Instead of padding all sequences to seq_len (batch padding), block padding only pads essential values required by the fine-grained strategy at the block grain.
- If val_len cannot be perfectly divided by the mini-block size, only the blocks at the edge need padding. This significantly reduces the padded area compared to batch padding.
- This block padding layout is still distinct from the densely-arranged layout for MLP, so data movement is still required, but the amount of data moved (especially padded data) is much less.
Customized Layout Switch Method:
- The layout switch function handles conversion between dense arrangement (for MLP) and block padding layout (for self-attention).
- It builds an index array to record the target offsets for block padding. Unlike FasterTransformer which builds indexes at the word grain, this method builds indexes at the block grain, making the index array much smaller and allowing each warp to move an entire block, potentially reducing overhead.
- The reverse process (switching from block padding to dense) follows a similar block-grained indexing idea.
- The offset generation function for block padding is straightforward, primarily involving accumulation.
  
  The following are the results from Figure 11 of the original paper:
  
  该图像是示意图，展示了不同的数据排列方式，包括密集排列（Dense Arrangement）、块排列（Block Padding）和批处理填充（Batch Padding）。这些布局方法旨在优化输入序列的处理效率，减少冗余计算。

Figure 11: The illustration of the data layout. Figure 11 contrasts different data layouts:

Dense Arrangement: All valid words from different sequences are packed contiguously, eliminating all padding between sequences. Ideal for MLP with the word-accumulation strategy.
Block Padding (Proposed): Only pads within blocks at the edges of valid sequence data to ensure block-grained alignment for the fine-grained strategy. It has significantly less padding than batch padding.
Batch Padding: All sequences in a batch are padded to the maximum seq_len, resulting in large amounts of redundant padded data.

5. Experimental Setup

5.1. Datasets

The experiments utilized eight corpora from the GLUE (General Language Understanding Evaluation) benchmark [9], a widely used collection of datasets for evaluating NLP models. These datasets are chosen to represent diverse NLP tasks and varying sequence length characteristics.

The following are the results from Table 3 of the original paper:

Corpus	Iteration	Max	Mean	Seq_Len
CoLA	8544	231	40.7	256
SST-2	67328	268	53.5	512
WNLI	608	441	148.5	512
SST-B	5696	2576	116.2	1024
MNLI	391168	181871	174.1	1024
QQP	363840	1318	119.6	1536
QNLI	103136	9678	227.9	1536
RTE	2464	1418	320.0	1536

Corpus: The name of the dataset (e.g., CoLA, SST-2).
Iteration: The number of samples in the corpus.
Max: The maximum sequence length observed in the corpus (after removing outliers).
Mean: The average sequence length in the corpus.
Seq_Len: The preset maximum sequence length used for the Bert model, chosen to cover most sequences in the corpus.

These datasets are effective for validating the method because they exhibit the heavy-tailed distribution of sequence lengths that the paper aims to optimize. They also cover a range of seq_len values and average lengths, allowing for evaluation across different levels of padding redundancy.

5.2. Evaluation Metrics

The primary evaluation metric used in this paper is latency reduction.

5.2.1. Latency

Conceptual Definition: In the context of computer systems and networks, latency refers to the time delay between the initiation of a process and the completion of that process. For deep learning inference, it quantifies the total time taken for a model to process a single input or a batch of inputs from start to finish. Lower latency is generally desirable, especially in real-time applications.
Mathematical Formula: $ \text{Latency} = T_{\text{end}} - T_{\text{start}} $
Symbol Explanation:
- $T_{\text{end}}$ : The timestamp when the inference process finishes.
- $T_{\text{start}}$ : The timestamp when the inference process begins.

5.2.2. Latency Reduction

Conceptual Definition: Latency reduction measures the percentage decrease in inference time achieved by a new method compared to a baseline method. It directly quantifies the performance improvement.
Mathematical Formula: $ \text{Latency Reduction (%) } = \left(1 - \frac{\text{Latency}{\text{New Method}}}{\text{Latency}{\text{Baseline}}}\right) \times 100% $
Symbol Explanation:
- $\text{Latency}_{\text{New Method}}$ : The inference latency measured using the proposed unified solution.
- $\text{Latency}_{\text{Baseline}}$ : The inference latency measured using the baseline method (e.g., cublas for self-attention, FasterTransformer for overall model).

5.3. Baselines

The paper compares its proposed methods against the following baselines:

For Self-Attention Module (Fine-grained Strategy evaluation):
- The cublas strategy: This refers to the standard implementation of $Q x K^T$ and $QK^T x V$ functions using cublasGemmStridedBatchedEx from the cuBLAS library. This is the common approach adopted by existing popular frameworks like FasterTransformer and TurboTransformer for self-attention. Specific cuBLAS configurations (CUBLAS_GEMM_ALGO0 for $Q x K^T$ and CUBLAS_GEMM_ALGO1 for $QK^T x V$ ) are used, consistent with FasterTransformer.
For Data Layout Switching (Block-organized Strategy evaluation):
- Fast: Represents the data layout switch implementation (between densely-arranged and batched-padding layouts) in FasterTransformer. This method builds a prefix sum array for valid words to reduce redundant data movement.
- Turbo: Represents the data layout switch implementation in TurboTransformer. This method computes offsets within the kernel and includes conditional statements to check for redundancy.
For Entire Model Inference (Overall Unified Solution evaluation):
- Fast (Eff-rebuild solution of FasterTransformer): This baseline represents the state-of-the-art inference framework that already eliminates redundant computation in MLP modules using the word-accumulation strategy and data layout switches. The paper's unified solution is applied by modifying this baseline (replacing cublasGemmStridedBatchedEx with fine-grained kernels, developing new offset generation functions, using block-organized padding, and a new data layout switch method).
  
  These baselines are representative because they reflect the widely used and highly optimized implementations in industrial-grade Transformer inference frameworks. Comparing against them effectively demonstrates the practical performance gains of the proposed unified solution.

5.4. Hardware and Software Setup

GPU: Tesla V100 (16GB memory).
CPU: Intel Xeon Silver 4208.
Operating System: Ubuntu 18.04.
Compiler: GCC 7.5.0.
CUDA Toolkit: CUDA 11.3.
Models: Bert-base and Bert-large configurations for the fine-grained strategy evaluation. Only Bert-base for other evaluations, as model scaling is linear.
Batch Sizes: Varied batch sizes (e.g., 4, 8, 16, 32) are tested.
Sequence Lengths (seq_len): Dynamically set based on the dataset's maximum length (after outlier removal), as shown in Table 3.
Reliability: Each case is repeated 3 times, and the average result is used.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Evaluation of the Fine-grained Strategy (5.2)

This evaluation focuses on the latency of computing the $Q x K^T$ and $QK^T x V$ functions within the self-attention module. The fine-grained strategy is compared against the cublas implementation (using cublasGemmStridedBatchedEx). The index array building time for the fine-grained strategy is excluded from the reported kernel execution time, as it's a small overhead (approx. 5% of kernel time) and can be hidden in practical inference by being executed once per batch during the word embedding module stage.

The following are the results from Figure 12(a) of the original paper:

该图像是一个展示不同输入规模下，cublas 和 FineGrained 方法在多个数据集上延迟（Latency）比较的柱状图。通过对比可以看出，FineGrained 方法在多个情况下有效减少了延迟。

Figure 12(a): The Average Latency of Computing $Q \times K ^ { T }$ and $Q K ^ { T } \times V f$ unctions in Bert-base configuration.

The following are the results from Figure 12(b) of the original paper:

Figure 12: The Modular Evaluation of fine-grained strategy. 该图像是条形图，展示了在Bert-large配置下，计算 $Q \times K^T$ 和 $QK^T \times V$ 函数的平均延迟。图中包含四个自然语言处理任务（CoLA、SST-2、WNLI、SST-B）的数据，分别显示了不同序列长度（4、8、16、32）下，使用 cublas 和 FineGrained 两种策略的延迟对比。可以看出，FineGrained策略在多个任务中显著降低了延迟。

Figure 12(b): The Average Latency of Computing $Q \times K ^ { T }$ and $Q K ^ { T } \times V$ functions in Bert-large configuration.

Cublas Strategy:
- Corpora with the same seq_len exhibit very similar latencies. This is expected because cublas processes all padded data, so performance is primarily dictated by the fixed seq_len rather than the actual val_len.
- Latency grows severely as seq_len doubles, which is consistent with the quadratic relationship between seq_len and the computational amount of $Q x K^T$ and $QK^T x V$ .
Fine-grained Strategy (Ours):
- Latency primarily depends on the average length of the corpus, not the seq_len. This is a direct consequence of eliminating redundant computation on padded tokens.
- Significant latency reduction is observed across most corpora. For example, it achieves up to 87.8% latency reduction on the QQP corpus and an average of 63.9% latency reduction across all 8 GLUE benchmark corpora for the self-attention module.
- Negative Optimization: In some cases, such as WNLI, the fine-grained kernels can perform worse than cublas. This occurs when the average sequence length is relatively large, meaning less redundant computation to eliminate, and the overhead introduced by MBIX (index array building, irregular data access) outweighs the benefits of redundancy elimination.
- Scaling: As shown in Figure 12(b) for Bert-large configuration, increasing the model size (Bert-large vs. Bert-base) increases latency for both strategies, but the ratio of latency reduction achieved by the fine-grained strategy remains stable. This indicates that the optimization scales well with model size. Increasing batch size generally improves performance for both, as it helps saturate hardware, but the gains gradually diminish.

6.1.2. Evaluation of the Block-organized Strategy (5.3)

This evaluation compares the overhead of data layout switching between densely-arranged and block padding (Ours) against FasterTransformer (Fast) and TurboTransformer (Turbo), both of which switch between densely-arranged and batched-padding layouts.

The following are the results from Figure 13 of the original paper:

Figure 13: The Modular Evaluation of block-organized strategy. 该图像是一个图表，展示了不同策略（Ours、Fast 和 Turbo）在多个数据集（CoLA、SST-2、WNLI、SST-B）上所需时间的比较。横轴为序列长度，纵轴为时间（毫秒）。每个图显示了在不同批大小（4、8、16、32）下的性能表现。

Figure 13: The Modular Evaluation of block-organized strategy.

TurboTransformer (Turbo):
- Generally shows the worst performance. This is attributed to its implementation computing offsets within the kernel and using conditional statements, which are less efficient.
- Its switching time is similar for corpora with the same seq_len because its performance largely depends on the fixed seq_len (all sequences padded to seq_len), rather than the actual content.
- Exception: For CoLA with batch size 4, Turbo performs best. This could be due to Turbo launching more GPU warps, allowing it to saturate the device with smaller workloads, and avoiding index building overhead present in Ours and Fast.
Ours (Block-organized) and FasterTransformer (Fast):
- Their performance mainly depends on the average length of the corpus, not seq_len. This is because both methods aim to reduce redundant data movement by dealing with fewer invalid values.
- Ours achieves slightly better performance in data layout switching. This is because FasterTransformer builds a more granular prefix sum index (at the word grain), leading to more serial computation and potentially less efficient warp utilization compared to Ours which uses block-grained indexing.
- The switching time for Ours increases slowly with batch size for small workloads but doubles when batch size goes from 512 to 1024. This suggests that for smaller workloads, the GPU is not fully saturated, while for larger workloads, the GPU warps are sufficiently utilized, and the overhead scales more linearly.

6.1.3. Overall Evaluation of the Unified Solution (5.4)

This evaluation measures the total inference latency of the Bert-base model using the unified solution (Ours) compared to the Eff-rebuild solution implemented in FasterTransformer (Fast).

The following are the results from Figure 14 of the original paper:

Figure 14: The Overall Evaluation: The average latency of the Bert-base model inference. 该图像是图表，展示了在不同数据集下，快速（Fast）和我们的方法（Ours）在BERT-base模型推理中的平均延迟（Latency）。图中包括了四个数据集：CoLA、SST-2、WNLI和SST-B，各使用不同的批次大小（4、8、16和32）进行比较。

Figure 14: The Overall Evaluation: The average latency of the Bert-base model inference.

Latency Improvement:
- The unified solution (Ours) consistently shows latency improvement for most corpora compared to FasterTransformer (Fast).
- The performance benefit is greater when more sequences in a corpus are concentrated in the short length area, maximizing the effect of redundant computation elimination.
- Negative Optimization: Similar to the modular evaluation, WNLI shows negative optimization due to its relatively larger average sequence length, meaning less opportunity for redundancy elimination and potentially higher overhead from the fine-grained strategy.
Impact of seq_len: The unified solution performs better on corpora using larger seq_len. For example, QQP and QNLI (which have large seq_len values in Table 3) show over 70% latency reduction in the fine-grained strategy modular evaluation, leading to around 50% latency reduction in the overall evaluation. This is because for larger seq_len, the self-attention module's quadratic computation makes it a more dominant part of the total computation, thus its optimization has a greater impact on the overall model.
Average Reduction: On average, the unified solution achieves 28.1% latency reduction across the 8 GLUE benchmark corpora compared to FasterTransformer.
Hardware Efficiency Note: The paper notes that the fine-grained strategy itself is not as hardware-efficient as cublasGemmStridedBatchedEx in terms of raw GFLOPS, achieving only about $1/4$ of its performance. The major performance gain comes from eliminating redundant computation, not from intrinsic speedup of valid computations. This suggests potential for further optimization in the fine-grained kernel implementation.

6.2. Ablation Studies / Parameter Analysis

The paper mentions specific empirical choices for parameters within the methodology section, which implicitly serve as a form of parameter analysis:

Mini-block size for 2-D partitioning: Empirically chosen as $32 \times 32$ . This size is a trade-off between minimizing index array overhead and effectively adapting to irregular data structures to reduce redundant computation.
Mini-block size for 1-D partitioning (Q x K^T): Empirically chosen as $64 \times 32$ to facilitate removing atomic operations.

While a formal ablation study explicitly comparing different block sizes or detailing the impact of SMBT and EAOP in isolation isn't presented, the design choices reflect an understanding of parameter influence on performance and overhead. The negative optimization observed for WNLI also acts as an implicit ablation point, showing that when the benefit of redundancy elimination is low, the inherent overhead of the fine-grained approach can become dominant.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper provides a significant insight into the performance bottlenecks of Transformer-based model inference on GPUs stemming from heavy-tailed input sequence lengths. It rigorously demonstrates that redundant computation, especially within the self-attention module, is a major impediment to practical efficiency.

To address this, the authors propose a novel unified solution comprising three key strategies:

A fine-grained strategy for the self-attention module that utilizes mini-block indexing, shared memory block transpose, and efficient atomic operations to eliminate a substantial amount of redundant computation by only processing valid data.
The word-accumulation strategy for the MLP module to handle its redundancy (a common technique).
A block-organized strategy for the entire model that introduces a block padding data layout for the self-attention module and a customized layout switch method. This strategy seamlessly integrates the first two strategies by minimizing data movement overhead between different data layouts.

Through comprehensive experiments on the GLUE benchmark using a Tesla V100 GPU, the unified solution achieved an impressive average of 63.9% latency reduction in the self-attention module and an average of 28.1% latency reduction for the overall Bert-base model inference compared to the popular FasterTransformer framework. The solution proved particularly effective for corpora with larger seq_len and pronounced heavy-tailed distributions.

7.2. Limitations & Future Work

The authors themselves acknowledge a key limitation:

Hardware Efficiency of Fine-grained Strategy: The fine-grained strategy, while effective at eliminating redundancy, only achieves approximately $1/4$ of the raw hardware performance (GFLOPS) compared to the highly optimized cublasGemmStridedBatchedEx for valid computations. This suggests that the custom kernels, while tailored for redundancy elimination, might not be as intrinsically efficient for the actual matrix multiplications as highly specialized vendor libraries.

Potential future research directions implicitly suggested or arising from this work include:
Further Optimization of Fine-grained Kernels: Improving the hardware efficiency of the fine-grained kernels themselves, beyond just redundancy elimination, could yield even greater performance gains. This might involve more advanced GPU programming techniques or hardware-specific optimizations.
Dynamic Block Sizing: Investigating more dynamic or adaptive mini-block sizing strategies that can adjust based on runtime val_len characteristics to further optimize the trade-off between overhead and redundancy elimination.
Wider Model & Hardware Applicability: Extending the solution to other Transformer architectures (e.g., GPT-3, T5) and evaluating its performance on different GPU generations or other accelerators (e.g., TPUs, FPGAs).
Integration with Dynamic Batching: Combining the unified solution with dynamic batching techniques, where batch sizes are adapted based on incoming request patterns, could offer further throughput and latency improvements in real-world serving scenarios.
Memory Footprint Analysis: While block padding reduces peak memory demand compared to batch padding, a more detailed analysis of the memory footprint of index arrays and intermediate states for very large models or batches could be beneficial.

7.3. Personal Insights & Critique

This paper makes a crucial contribution by systematically addressing the redundant computation in Transformer inference that arises from heavy-tailed input, particularly in the self-attention module which was largely unoptimized by prior efforts. The clarity in identifying the quadratic redundancy and proposing a fine-grained block-level approach is commendable.

Inspiration: The work highlights the importance of deep, architecture-aware optimization. It's not enough to rely on general-purpose GEMM libraries; for specific bottlenecks like heavy-tailed inputs, custom kernels that intimately understand the data structure and computation flow are necessary. The integration of MBIX, SMBT, and EAOP showcases a sophisticated understanding of GPU architecture to translate theoretical efficiency gains into practical performance. The block-organized strategy is a clever way to harmonize disparate optimization techniques (word-accumulation and fine-grained) across the entire model, demonstrating a holistic system design approach.
Transferability: The core idea of fine-grained redundant computation elimination is highly transferable. Any deep learning model or computational graph with variable-sized inputs and operations sensitive to input dimensions (e.g., dynamic RNNs, graph neural networks with variable graph sizes) could potentially benefit from similar block-level indexing and execution strategies. The block padding concept could be generalized for other tensor operations that require structured access but deal with sparse or irregular valid data.
Potential Issues/Areas for Improvement:
- Overhead Trade-off: The paper explicitly mentions that the raw hardware performance of the fine-grained kernels is lower than cuBLAS. This implies that for sequences with very uniform lengths or where the average length is very close to seq_len (i.e., minimal redundancy), the fine-grained strategy might consistently perform worse. The "negative optimization" observed for WNLI underscores this. A dynamic selection mechanism that switches between fine-grained and cuBLAS based on runtime batch statistics could be beneficial.
- Generalization of Block Sizes: The empirical block size choices ( $32 \times 32$ , $64 \times 32$ ) are validated for the Bert-base/large configuration on V100. These parameters might need careful tuning for different Transformer architectures, head sizes, hidden sizes, or GPU generations to maintain optimal performance. A more robust, potentially auto-tuning approach for block sizes could make the solution more general.
- Complexity of Implementation: Developing and maintaining highly optimized custom CUDA kernels is complex and prone to errors. cuBLAS and TensorRT offer high reliability and ease of use. While the performance gains justify the effort, the increased complexity might be a barrier for widespread adoption or for researchers without deep GPU programming expertise. This might be addressed by integrating such optimizations directly into higher-level deep learning frameworks or compilers.
- Real-world Serving Context: The paper focuses on inference latency. In a real-world serving system, other factors like throughput, memory consumption under high load, cold start latency, and dynamic batching strategies also play a role. While the solution improves latency, its interaction with these other aspects would be an interesting extension.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.