PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization
TL;DR Summary
PLAIN is a novel software/hardware co-design framework for accelerating large language model inference through mixed-precision quantization. It optimizes parameter quantization and leverages PIM characteristics, achieving up to 5.03x and 1.69x performance improvements with neglig
Abstract
DRAM-based processing-in-memory (DRAM-PIM) has gained commercial prominence in recent years. However, its integration for deep learning acceleration, particularly for large language models (LLMs), poses inherent challenges. Existing DRAM-PIM systems are limited in computational capabilities, primarily supporting element-wise and general matrix-vector multiplication (GEMV) operations, which contribute only a small portion of the execution time in LLM workloads. As a result, current systems still require powerful host processors to manage compute-heavy operations. To address these challenges and expand the applicability of commodity DRAM-PIMs in accelerating LLMs, we introduce PLAIN, a novel software/hardware co-design framework for PIM-enabled systems. PLAIN leverages the distribution locality of parameters and the unique characteristics of PIM to achieve optimal trade-offs between inference cost and model quality. Our framework includes three key innovations: 1) firstly, we propose a novel quantization algorithm that determines the optimal precision of parameters within each layer, considering both algorithmic and hardware characteristics to optimize hardware mapping; 2) PLAIN strategically utilizes both GPUs and PIMs, leveraging the high internal memory bandwidth within HBM for attention layers and the powerful compute capability of conventional systems for fully connected (FC) layers; 3) PLAIN integrates a workload-aware dataflow scheduler that efficiently arranges complex computations and memory access for mixed-precision tensors, optimizing execution across different hardware components. Experiments show PLAIN outperforms the conventional GPU with the same memory parameters and the state-of-the-art PIM accelerator, achieving a 5.03× and 1.69× performance boost, with negligible model quality loss.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization." This title highlights a novel software/hardware co-design framework, PLAIN, which aims to improve the efficiency of Large Language Model (LLM) inference by utilizing the unique characteristics of Processing-in-Memory (PIM) architectures and applying mixed-precision quantization techniques.
1.2. Authors
The authors are Yiwei Hu, Fangxin Liu, Zongwu Wang, Yilong Zhao, Tao Yang, Li Jiang, and Haibing Guan. Their affiliations are not explicitly detailed in the provided text beyond an email address suffix suggesting "sjtu.edu.cn," which likely refers to Shanghai Jiao Tong University. Li Jiang is noted as a corresponding author. The research backgrounds appear to be in computer architecture, deep learning acceleration, and memory systems, focusing on optimizing hardware-software interactions for AI workloads.
1.3. Journal/Conference
The publication venue is not explicitly stated in the provided text. However, the nature of the research (software/hardware co-design, PIM, LLM acceleration) suggests it would typically be published in top-tier computer architecture conferences such as ASPLOS, ISCA, MICRO, or HPCA, or potentially a highly regarded AI systems conference.
1.4. Publication Year
The paper was published at (UTC): 2025-10-26T00:00:00.000Z. This indicates a future publication date, suggesting it might be an accepted paper for an upcoming conference or journal issue.
1.5. Abstract
The abstract introduces DRAM-based processing-in-memory (DRAM-PIM) as a commercially prominent technology facing challenges in accelerating Large Language Models (LLMs) due to limited computational capabilities (primarily supporting element-wise and GEMV operations). To address this, the paper proposes PLAIN, a novel software/hardware co-design framework. PLAIN aims to optimize LLM inference cost and model quality by leveraging parameter distribution locality and PIM's unique characteristics. Its three key innovations include: 1) a novel quantization algorithm that determines optimal precision per layer based on algorithmic and hardware characteristics; 2) strategic utilization of GPUs for compute-heavy Fully Connected (FC) layers and PIMs for bandwidth-intensive attention layers, leveraging HBM's high internal bandwidth; and 3) a workload-aware dataflow scheduler for efficient execution of mixed-precision tensors across heterogeneous hardware. Experimental results demonstrate that PLAIN achieves a and performance boost over conventional GPUs with similar memory parameters and state-of-the-art PIM accelerators, respectively, while maintaining negligible model quality loss.
1.6. Original Source Link
The original source link provided is /files/papers/69571ce38c5983e9f07b96e1/paper.pdf. This appears to be a local file path or an internal identifier within a larger system, rather than a publicly accessible URL. Its publication status is pending as the publication date is in the future.
2. Executive Summary
2.1. Background & Motivation
The paper addresses the critical challenges in deploying Large Language Models (LLMs), which are known for their massive parameter counts and immense resource demands.
-
Core Problem: The primary hurdle in LLM deployment, especially during inference (token generation), is the "memory wall." This refers to the bottleneck caused by limited memory bandwidth between GPU compute units and DRAM, leading to low hardware utilization. For instance,
GPT-3requires 326 GB ofFP16memory, far exceeding the capacity of high-end GPUs likeA100(80 GB). Even with significant HBM capacity and bandwidth, GPU compute unit utilization can fall below 1%. While computational performance of accelerators has grown, memory capacity and bandwidth have not kept pace. -
Importance of the Problem: Efficient LLM inference is crucial for their widespread adoption and practical utility in various applications. The existing memory bottlenecks limit the size and complexity of deployable models, hindering advancements in AI capabilities and increasing the cost of running powerful LLMs.
-
Challenges/Gaps in Prior Research:
- Limited PIM Capabilities: Existing
DRAM-PIMsystems, while offering high internal memory bandwidth, are typically limited to simple operations like element-wise computations andGeneral Matrix-Vector Multiplication (GEMV). These operations constitute only a small fraction of LLM workloads, leaving compute-heavy operations to powerful host processors. This limits the full potential ofDRAM-PIMfor LLM acceleration. - Quantization Limitations: While model quantization reduces memory footprint, current methods often face a trade-off between accuracy and computational resources. Mainstream hardware lacks native support for optimal bit-widths (e.g., 6-bit quantization), favoring less flexible 4-bit or 8-bit formats.
- PIM-Quantization Integration: There is a lack of
PIM-friendlyframeworks that effectively balance hardware overhead with compression ratio, considerDRAM-PIM's unique architectural constraints in mixed-precision techniques, and translate theoretical quantization benefits into practical speedups due to workload imbalances.
- Limited PIM Capabilities: Existing
-
Paper's Entry Point / Innovative Idea: The paper proposes
PLAIN, a novel algorithm-architecture co-design framework. It aims to optimize performance by intelligently combiningDRAM-PIM's high internal bandwidth withmixed-precision quantization. PLAIN leverages the distribution locality of parameters (intra-tensor value distributions and inter-tensor patterns) to adaptively quantize parts of the model, ensuring efficiency while preserving accuracy. It strategically offloadsmemory-boundoperations to PIM andcompute-boundoperations to GPUs, creating a heterogeneous acceleration system.
2.2. Main Contributions / Findings
PLAIN makes several primary contributions to address the challenges of LLM inference:
- Novel Quantization Algorithm: PLAIN introduces a
Locality-Aware Adaptive Quantizationmethod. This algorithm determines the optimal precision (e.g., INT4 or INT8) for parameters within each layer, considering both algorithmic characteristics (like sensitivity to precision changes) and hardware characteristics to optimize mapping to the PIM architecture. This balances accuracy and compression without requiring expensive retraining. - Heterogeneous Hardware Utilization: The framework strategically leverages
GPUsandPIMsin a co-design approach. It exploits the high internal memory bandwidth ofHigh-Bandwidth Memory (HBM)withinPIMunits forattention layers(which are oftenmemory-boundin LLMs), while utilizing the powerful compute capabilities of conventionalGPUsforFully Connected (FC) layers(which are typicallycompute-bound). This maximizes the strengths of each component. - Workload-Aware Dataflow Scheduler: PLAIN integrates a sophisticated
workload-aware dataflow scheduler. This scheduler efficiently arranges complex computations and memory accesses formixed-precision tensorsacross different hardware components. It utilizes bank-level parallelism and dynamic workload balancing (e.g., bit-wise splitting of INT8 tokens into INT4 for parallel processing) to ensure that the benefits of mixed-precision quantization are realized as practical speedups, minimizing stalls and maximizing resource utilization. - Significant Performance and Energy Boost: Experimental results demonstrate substantial improvements. PLAIN achieves an average speedup of and up to compared to conventional
GPU-FP16inference. It also shows a performance boost over the state-of-the-art PIM accelerator,AttAcc. Furthermore, PLAIN significantly reduces energy consumption compared to both GPU and AttAcc, indicating improved energy efficiency. - Negligible Model Quality Loss: Despite aggressive quantization and performance optimizations, PLAIN maintains negligible model quality loss across various LLMs (GPT-2, OPT, LLaMA-2), demonstrating the effectiveness of its adaptive quantization strategy.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the innovations presented in PLAIN, a reader should understand several foundational concepts:
-
Large Language Models (LLMs):
- Conceptual Definition: LLMs are deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data. They are designed to understand, generate, and process human language, performing tasks like text generation, translation, summarization, and question answering.
- Transformer Architecture: LLMs predominantly use the Transformer architecture, introduced by Vaswani et al. (2017). This architecture revolutionized sequence processing by relying entirely on
attention mechanisms(particularlyself-attention) to draw global dependencies between input and output, eschewing recurrent (RNN) or convolutional (CNN) layers. Many LLMs like GPT and LLaMA are built using stackeddecoder blocksof the Transformer. - Inference Stages: LLM inference, especially for generative tasks, typically involves two stages:
-
Prefill Stage: The initial input prompt (e.g., "I like playing") is processed. This stage often involves large matrix multiplications (
GEMM) as the entire input sequence is processed to build a context. -
Decoding Stage: The model generates tokens one by one autoregressively (e.g., "basketball", then "!"). In this stage, the
queryvector is typically small (), andkeyandvaluematrices grow as more tokens are generated (, where is the sequence length). This stage is often characterized byGeneral Matrix-Vector Multiplication (GEMV)operations and is heavilymemory-bounddue to frequent accesses to a growingKV cache.The following figure (Figure 1 from the original paper) illustrates the computational workflow of decoder layers during inference:
该图像是一个示意图,展示了大型语言模型(LLM)的预填充阶段和解码阶段的结构。在预填充阶段,输入为 ['I', 'like', 'playing'],经过自注意力计算后,输出为 ['basketball']。解码阶段接收的输入为 ['I', 'like', 'playing', 'basketball'],同样经过自注意力和前馈网络处理,最终输出为 ['!']。
-
-
DRAM-based Processing-in-Memory (DRAM-PIM):
- Conceptual Definition:
DRAM-PIMis a memory-centric computing paradigm where computational units are integrated directly within or very close toDRAM(Dynamic Random-Access Memory) memory banks. The core idea is to move computation closer to data, thereby reducing the need to transfer large amounts of data between the main processor (e.g., CPU or GPU) and external memory. - Advantages:
- Reduced Data Movement: The primary benefit is mitigating the "memory wall" bottleneck by performing operations where the data resides, significantly decreasing data transfer energy and latency.
- Higher Internal Bandwidth: PIM architectures can leverage the extremely high internal bandwidth available within DRAM banks, which is often much greater than the external bandwidth between DRAM and the host processor.
- Limitations:
- Limited Computational Capability: Commercial
DRAM-PIMsolutions (likeSamsung HBM-PIMorSK Hynix GDDR6-AiM) often have relatively simple compute units, primarily supporting basic operations such as element-wise operations andGEMV. More complex operations (GEMM) or custom functions might still require the host processor. - Programming Complexity: Developing software that effectively utilizes PIM architectures can be challenging due to the need for explicit data placement and workload partitioning.
- Limited Computational Capability: Commercial
- Conceptual Definition:
-
Quantization:
- Conceptual Definition: Quantization in Deep Neural Networks (DNNs) is a technique used to reduce the precision (bit-width) of weights and/or activations from high-precision floating-point formats (e.g.,
FP32,FP16) to lower-precision integer formats (e.g.,INT8,INT4,INT1). - Benefits:
- Reduced Memory Footprint: Lower bit-widths mean less memory is required to store model parameters and intermediate activations, allowing larger models to fit into memory or enabling deployment on resource-constrained devices.
- Improved Inference Speed: Operations on lower-precision integers can be significantly faster and more energy-efficient on specialized hardware (e.g., integer
Tensor CoresonNVIDIA GPUs) or custom PIM units. - Reduced Data Movement: Smaller data sizes lead to less data transfer across memory hierarchies, further alleviating the memory wall.
- Trade-off: The main challenge is maintaining model accuracy. Aggressive quantization can lead to significant accuracy degradation if not carefully managed.
- Conceptual Definition: Quantization in Deep Neural Networks (DNNs) is a technique used to reduce the precision (bit-width) of weights and/or activations from high-precision floating-point formats (e.g.,
-
Mixed-Precision Quantization (MPQ):
- Conceptual Definition:
MPQis an advanced quantization technique where different layers or even different parts within a tensor are quantized to varying bit-widths (e.g., some layers toINT8, others toINT4). - Purpose: It aims to find an optimal balance between accuracy and efficiency. Highly sensitive layers or values (e.g., outliers in activations) might retain higher precision to preserve accuracy, while less sensitive parts can be aggressively quantized to lower bit-widths for maximum compression and speedup.
- Conceptual Definition:
-
Memory Wall:
- Conceptual Definition: The "memory wall" refers to the growing performance gap between processor speed and memory access speed. While processors have become exponentially faster, the rate at which data can be fetched from main memory has not kept pace. This bottleneck becomes particularly pronounced in
data-intensiveworkloads like LLMs, where the processor frequently stalls, waiting for data from memory, leading to low compute unit utilization.
- Conceptual Definition: The "memory wall" refers to the growing performance gap between processor speed and memory access speed. While processors have become exponentially faster, the rate at which data can be fetched from main memory has not kept pace. This bottleneck becomes particularly pronounced in
-
Roofline Model:
-
Conceptual Definition: The
Roofline Modelis a performance model that graphically illustrates the achievable performance of a computational kernel on a given hardware platform. It plots performance (typically in FLOPS) againstarithmetic intensity(FLOPS per byte transferred). The "roofline" itself consists of two lines:- Memory Bandwidth Bound: A diagonal line representing the maximum performance limited by the memory bandwidth.
- Compute Bound: A horizontal line representing the maximum performance limited by the processor's peak floating-point operation rate (FLOPS).
-
Purpose: By plotting a workload on the roofline chart, one can easily identify whether the workload is
memory-bound(falling under the bandwidth-limited diagonal) orcompute-bound(hitting the FLOPS-limited horizontal roof) and understand which hardware resource is the bottleneck.The following figure (Figure 2 from the original paper) provides a roofline model analysis for the attention layers in LLaMA-2-7b, highlighting how workloads transition from compute-bound to memory-bound as output length increases:
该图像是图表,展示了LLAMA2-7B注意力层在Nvidia V100 GPU和HBM-PIM下的性能与算术强度的关系。图中标出HBM-PIM和V100 GPU在记忆受限与计算受限场景下的表现,HBM-PIM性能为4.8 TFLOPS,而V100 GPU为14 TFLOPS。
-
3.2. Previous Works
The paper builds upon and distinguishes itself from several key areas of prior research:
-
Transformer Architecture (Vaswani et al. [1]):
- Background: The Transformer is the backbone of modern LLMs. It introduced the
self-attention mechanismas a replacement for recurrent and convolutional layers, enabling parallel processing of sequences and capturing long-range dependencies efficiently. - Core Formula for Attention: The fundamental
attention mechanismis crucial. It calculates a weighted sum ofvaluevectors, where the weights are determined by the similarity betweenqueryandkeyvectors. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $- : The Query matrix, representing the current token(s) for which context is being computed. Its shape is typically , where is the number of queries and is the dimension of the key vectors.
- : The Key matrix, representing all available contextual tokens. Its shape is typically , where is the number of keys.
- : The Value matrix, containing the actual information to be aggregated. Its shape is typically , where is the dimension of the value vectors.
- : The dot product between Queries and Keys, measuring the compatibility (attention scores) between each query and all keys.
- : A scaling factor to prevent the dot products from becoming too large, which can push the
softmaxfunction into regions with tiny gradients. - : A function that normalizes the attention scores into a probability distribution, ensuring weights sum to 1.
- Output: A matrix of shape , representing the contextually enriched queries.
- Role in PLAIN: PLAIN focuses on accelerating
Transformerinference, specifically targeting theattention layersdue to their memory-bound nature in the decoding stage.
- Background: The Transformer is the backbone of modern LLMs. It introduced the
-
Commercial DRAM-PIM Solutions (Samsung HBM-PIM [22], SK Hynix GDDR6-AiM [25]):
- Background: These are real-world implementations of the PIM concept, integrating basic compute capabilities into
HBMorGDDR6memory modules. They demonstrate the viability of PIM for accelerating memory-bound AI workloads. - Limitations (addressed by PLAIN): As mentioned in the abstract, these systems primarily support element-wise and
GEMVoperations. While useful for some memory-intensive tasks, they are insufficient for the diverse and compute-heavy operations of LLMs, especiallyGEMMin the prefill stage or complex attention computations. They typically rely on host processors for the majority of the LLM workload.
- Background: These are real-world implementations of the PIM concept, integrating basic compute capabilities into
-
Model Quantization Techniques:
- HAQ [27]: Uses reinforcement learning to search for optimal bit-widths per layer, considering hardware metrics like latency and energy. This is a complex, hardware-aware approach to
MPQ. - LLM.int8() [28]: A specific quantization method targeting LLMs. It addresses the issue of
outliers(activations with very large magnitudes) by isolating outlier dimensions intoFP16while quantizing the majority of values toINT8using a vector-wise approach. - SmoothQuant [9]: A post-training quantization technique designed for LLMs. It tackles activation outliers by "smoothing" them. This is achieved by remapping activation outliers to weights through per-channel scaling, making both weights and activations more amenable to low-bit quantization without significant accuracy loss. PLAIN explicitly builds upon
SmoothQuantfor itsQuantization Granularitystrategy.
- HAQ [27]: Uses reinforcement learning to search for optimal bit-widths per layer, considering hardware metrics like latency and energy. This is a complex, hardware-aware approach to
-
AttAcc [8]:
- Background:
AttAccis a state-of-the-art heterogeneous PIM system specifically designed for batchedTransformer-based generative model inference. It offloadsmulti-head attentionduring the decoding phase to PIM units, while theGPUhandles the prefill stage andQKV generation. - Role as Baseline: PLAIN uses
AttAccas a key baseline to demonstrate its performance improvements. PLAIN aims to surpassAttAccby offering more comprehensive PIM utilization, a more advanced quantization scheme, and better workload balancing. The paper implies thatAttAccmight still face challenges in efficiently handling the full complexity of LLM attention operations, particularly with mixed precision.
- Background:
3.3. Technological Evolution
The field of AI acceleration has evolved significantly, driven by the increasing computational demands of deep learning.
-
Early Deep Learning (2010s): Initial acceleration focused on
GPUs(e.g.,NVIDIA CUDA) for their parallel processing capabilities, effectively handlingdense matrix multiplicationsfundamental to Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). -
Transformer Era (2017 onwards): With the advent of the Transformer, LLMs grew exponentially in size. This exposed the
memory wallproblem more acutely. While GPUs continued to improve (e.g.,NVIDIA A100,H100withTensor Cores), memory capacity and bandwidth became persistent bottlenecks. -
Quantization for Efficiency (Late 2010s-Present): To combat memory and computational demands, quantization emerged. Initial efforts focused on
FP16orINT8quantization for general DNNs. Later,mixed-precisionandpost-training quantization(likeSmoothQuant,LLM.int8()) specifically for LLMs became crucial to maintain accuracy with very low bit-widths. -
Processing-in-Memory (PIM) Resurgence (Early 2020s): As the memory wall became more critical, PIM solutions, which had been a research topic for decades, saw commercial viability (e.g.,
Samsung HBM-PIM). These aimed to tacklememory-boundworkloads by integrating computation directly into memory. -
Heterogeneous Systems and Co-design (Present): The current frontier involves integrating these technologies. Recognizing that no single architecture is optimal for all LLM operations, heterogeneous systems (e.g., GPU + PIM) and
hardware-software co-designframeworks (like PLAIN) are emerging. These seek to intelligently partition workloads and optimize algorithms to fully exploit the unique strengths of different hardware components, addressing the limitations of individual technologies.PLAIN fits into this evolution by pushing the boundaries of heterogeneous acceleration for LLMs. It specifically targets the integration of advanced
mixed-precision quantizationwithDRAM-PIM, moving beyond simpleGEMVoffloading to a more comprehensive acceleration ofmemory-boundattention layers.
3.4. Differentiation Analysis
Compared to the main methods in related work, PLAIN offers several core differentiators and innovations:
-
PIM Utilization Scope:
- Conventional PIMs (e.g., Samsung HBM-PIM): These are limited to
element-wiseandGEMVoperations, leaving the bulk of LLM computation to hostGPUs. - AttAcc: Improves on conventional PIM by offloading
multi-head attentionduring the decoding stage to PIM. - PLAIN's Innovation: PLAIN goes further by not only offloading attention layers but also by fully integrating
mixed-precision quantizationwithin the PIM architecture. It proposes specificBank-PIMunits forQKV projection,attention score, andattention contextcalculations, which involve a mix ofGEMMandGEMVoperations, and aStack-PIMforquantization/dequantizationandsoftmax. This broader and deeper integration allows PIM to handle more complex parts of the attention mechanism across both prefill and decoding stages.
- Conventional PIMs (e.g., Samsung HBM-PIM): These are limited to
-
Quantization Strategy:
- Traditional Quantization (e.g., W8A8): Often a fixed bit-width, which can lead to significant accuracy loss for LLMs with prominent activation outliers, or miss opportunities for higher compression.
- SmoothQuant: Primarily focuses on
INT8quantization by smoothing outliers. - LLM.int8(): Uses
vector-wisequantization withFP16outliers. - PLAIN's Innovation: PLAIN proposes a novel
Locality-Aware Adaptive Quantizationalgorithm (MixQ-PIM) that determines the optimal precision (INT4orINT8) at a fine-grainedtoken-wise(for activations) andchannel-wise(for weights) granularity. This is driven by anentropy-based heuristiccombined with ahardware-aware cost function. Unlike methods that rely purely on algorithmic metrics or reinforcement learning, PLAIN explicitly considers hardware mapping and overhead, leading to a better balance of accuracy and efficiency tailored for PIM. It also supports6-bit quantization, which is often optimal for LLMs but lacks native hardware support, by effectively mapping it to 4-bit operations.
-
Software/Hardware Co-design & Workload Management:
- Existing Frameworks: Many quantization frameworks are not designed with
DRAM-PIMconstraints in mind, leading to suboptimal performance or inability to fully leverage PIM's capabilities. - PLAIN's Innovation: PLAIN is a holistic
software/hardware co-design. It introduces a sophisticatedworkload-aware dataflow schedulerandoverlapping mechanisms(weight loading, quantization/communication, softmax/communication).-
Bit-wise splitting: PLAIN's scheduler cleverly splits
INT8tokens into twoINT4segments for uniform processing by4-bit multipliersacross PIMs, addressing the challenge of mixed-precision load balancing without requiring variable-bit-width PIM units. -
Schedule Table: It uses a
Bank-PIM schedule tableto manage bank-level parallelism and minimize conflicts. -
Overlap: Its overlapping strategies hide the latency of communication, quantization, and
softmaxby parallelizing them with computation, which is crucial for achieving practical speedups in a heterogeneous system. This level of fine-grained, hardware-aware scheduling and overlapping is a key differentiator for PLAIN.In essence, PLAIN differentiates itself by offering a more deeply integrated and optimized solution for LLM inference on PIM. It provides a specialized quantization algorithm that is hardware-aware, a more comprehensive partitioning of
Transformeroperations between GPU and PIM, and an intelligent scheduler that maximizes the utilization of PIM's internal bandwidth by unifying mixed-precision computations and overlapping various overheads.
-
- Existing Frameworks: Many quantization frameworks are not designed with
4. Methodology
4.1. Principles
The core idea of PLAIN is an algorithm-architecture co-design that optimizes Large Language Model (LLM) inference by leveraging the unique characteristics of DRAM-based Processing-in-Memory (DRAM-PIM) and mixed-precision quantization. The theoretical basis and intuition behind PLAIN are rooted in two observations:
-
Memory Wall Bottleneck: LLM inference, especially during the decoding phase, is often
memory-bound, meaning performance is limited by memory bandwidth rather than computational throughput.DRAM-PIMarchitectures inherently offer significantly higher internal memory bandwidth compared to external memory interfaces. -
Quantization for Efficiency and Outliers: Quantization reduces memory footprint and computational cost by lowering data precision. However,
LLMsare sensitive to quantization, particularly due toactivation outliers.Mixed-precision quantizationcan mitigate accuracy loss by adaptively assigning different bit-widths, but it must be carefully designed to align with hardware capabilities for practical speedups.PLAIN's approach is to:
-
Locality-Aware Quantization: Exploit the observation that not all parts of an LLM are equally sensitive to precision reduction. By using a
locality-aware adaptive quantizationalgorithm, PLAIN can assign lower precision to less critical components (based onentropy) while retaining higher precision for sensitive ones, without expensive retraining. This ishardware-awareto ensure efficient mapping. -
Heterogeneous Workload Partitioning: Identify that
attention layersare primarilymemory-bound(benefiting from PIM's high bandwidth), whileFully Connected (FC) layersare oftencompute-bound(benefiting fromGPU's powerful compute capabilities). PLAIN strategically offloadsattention layercomputation toPIMand keepsFC layercomputation on theGPU, maximizing the strengths of each. -
Workload Balancing and Overlapping: Address the challenges introduced by
mixed-precisioncomputations and heterogeneous execution. A sophisticatedworkload-aware dataflow schedulerensures balanced utilization of PIM units, even with varying bit-widths, and employs overlapping techniques to hide the latency of data movement, quantization, andsoftmaxoperations.By combining these principles, PLAIN aims to achieve optimal trade-offs between inference cost and model quality, overcoming the limitations of both conventional
GPUsystems and existingDRAM-PIMsolutions.
4.2. Core Methodology In-depth (Layer by Layer)
PLAIN's methodology comprises a MixQ-PIM algorithm for quantization and a dedicated hardware architecture (PLAIN Hardware Architecture) with a sophisticated Schedule Design and Overlapping mechanisms.
4.2.1. MixQ-PIM Algorithm
The MixQ-PIM algorithm is designed to enable DRAM-PIM-friendly quantization, balancing accuracy and hardware efficiency.
4.2.1.1. Quantization Granularity
To address the challenge of outliers in LLM activations, which typically hinder low-bit quantization, PLAIN employs SmoothQuant [9]. SmoothQuant works by remapping activation outliers to weights via per-channel scaling. This process results in more balanced and compressible distributions for both activations and weights.
The transformation achieved by SmoothQuant can be expressed as:
$
\mathbf { Y } = ( \mathbf { X } \mathbf { d i a g } ( \mathbf { s } ) ^ { -1 } ) \cdot ( \mathbf { d i a g } ( \mathbf { s } ) \mathbf { W } ) = \hat { \mathbf { X } } \hat { \mathbf { W } }
$
-
: The output of the layer, typically a matrix.
-
: The input
activationtensor (before smoothing). -
: The
weighttensor (before smoothing). -
: A per-channel
scaling factorvector. Thediag(s)operation creates a diagonal matrix with elements of on the diagonal. -
: The inverse of the scaling factor diagonal matrix, applied to . This effectively "descales" (divides) the activation values.
-
: The
scaling factormatrix applied to . This "scales" (multiplies) the weight values. -
: The "smoothed" activation tensor after descaling by .
-
: The "smoothed" weight tensor after scaling by .
This equation shows that the scaling factors are moved from the activations to the weights. This operation makes the distributions of and more amenable to low-bit quantization by reducing the dynamic range of activations without altering the final output . The paper states that this smoothing process reorganizes
low-error elementsinto distinct channels, which makesmixed-precision quantizationmore hardware-efficient.
PLAIN then applies two specific granularity schemes for quantization that align with DRAM-PIM's dataflow:
-
Token-wise granularity for activations: Each input token's activations are quantized independently. This is suitable for the sequential nature of token generation in LLMs.
-
Channel-wise granularity for weights: Weights are quantized based on their output channels. This allows for fine-grained control over weight precision and is compatible with how weights are typically processed in matrix multiplications.
The following figure (Figure 3 from the original paper) illustrates the effect of SmoothQuant on quantization error:
该图像是一个热力图,展示了LLaMA-2-7B模型权重在应用SmoothQuant前后的INT8量化误差。左侧为未平滑的量化误差热力图,右侧为经过平滑处理后的热力图。量化误差的分布经过平滑处理后更为集中,表明在误差较小的区域可以接受更低的准确性。
As seen in the heat map, SmoothQuant localizes quantization errors and redistributes the value range, allowing for more flexible precision assignment.
4.2.1.2. Quantization Configuration Searching
To determine the optimal bit-width (either INT4 or INT8) for different parts of the model, PLAIN employs a lightweight, entropy-based heuristic rather than computationally expensive retraining or reinforcement learning. This heuristic considers both quantization error and hardware cost.
a) Weight Precision Entropy:
The quantization error for weights is quantified using KL-divergence (Kullback-Leibler divergence), which measures how one probability distribution diverges from a second, expected probability distribution. In this context, it measures the information loss when quantizing:
$
E n t r o p y _ { i } = \mathcal { D } _ { \mathrm { K L } } ( \mathbf { W } _ { i } ^ { \mathrm { F P } } \parallel \mathbf { W } _ { i } ^ { \mathrm { I N T } } )
$
-
: The KL-divergence for the -th channel-wise block of weights.
-
: The Kullback-Leibler divergence function.
-
: The original, high-precision (e.g., FP16) distribution of the -th weight block.
-
: The quantized (e.g., INT4 or INT8) distribution of the -th weight block.
-
: Denotes a specific channel-wise block of weights. Weights are partitioned into these blocks to provide sufficient statistical support for the entropy calculation.
This approach allows for
fine-grained mixed precisionfor weights without retraining.
b) Simplify Activation Entropy: Runtime distribution fitting for activations is impractical. Therefore, a simplified proxy based on data range scaling is introduced for activations: $ S i m p l e E n t r o p y _ { t } = s i g m o i d ( | A ^ { \mathrm { F P } } | ) \times S _ { t } $ $ S _ { t } = \frac { \operatorname* { m a x } ( \mathbf { A } _ { t } ^ { \mathrm { F P } } ) - \operatorname* { m i n } ( \mathbf { A } _ { t } ^ { \mathrm { F P } } ) } { \operatorname* { m a x } ( \mathbf { A } _ { t } ^ { \mathrm { I N T } } ) - \operatorname* { m i n } ( \mathbf { A } _ { t } ^ { \mathrm { I N T } } ) } $
-
: The simplified entropy proxy for the -th token's activations.
-
: The sigmoid activation function, which squashes values between 0 and 1.
-
: The absolute average value of the
FP16activation tensor (before quantization). -
: An affine factor that scales the data range.
-
and : The maximum and minimum values of the
FP16activation tensor for the -th token. -
and : The maximum and minimum values of the
quantizedactivation tensor for the -th token (e.g., INT4 or INT8). -
: Denotes the number of tokens, as activations use
token-wise granularity.This simplified proxy aims to capture the sensitivity of activations to quantization by considering their dynamic range. The operation can be efficiently supported by hardware during result collection.
c) Hardware-aware Configuration Searching:
To incorporate the hardware cost of different bit-widths, a byte-level cost function is defined:
$
\begin{array} { r } { \mathcal { C } = \mathcal { C } _ { \mathbf { W } } + \mathcal { C } _ { \mathbf { A } } = \mathrm { N } _ { \mathbf { W } } \times \mathbf { B } _ { \mathbf { W } } ^ { \mathrm { I N T } } + \mathrm { N } _ { \mathbf { A } } \times \mathbf { B } _ { \mathbf { A } } ^ { \mathrm { I N T } } } \end{array}
$
-
: The total byte-level cost.
-
: The cost associated with weights.
-
: The cost associated with activations.
-
: The size of the weight tensor (e.g., number of parameters).
-
: The chosen bit-width for weights after quantization (e.g., 4 bits or 8 bits).
-
: The size of the activation tensor.
-
: The chosen bit-width for activations after quantization.
The paper notes that weights can be loaded offline to PIMs, so their cost is primarily for calculation. Activations, however, often require movement between different dies during inter-HBM communication, incurring significant communication cost. To reflect this, a square term is added to the formula (though not explicitly shown in the provided snippet, it is mentioned in the text).
The final bit-width for each component (weights and activations) is selected by minimizing a unified loss function: $ \mathcal { L } _ { M i x Q } ^ { \mathrm { I N T } } = E n t r o p y - \varsigma \mathcal { C } $
-
: The final loss function to be minimized for integer quantization.
-
Entropy: Refers to either for weights or for activations, depending on the tensor type. -
: A scaling factor that adjusts the relative magnitude and importance between the entropy (accuracy loss) and the hardware cost.
The algorithm compares the loss for
INT4andINT8(and potentiallyINT6, which is mapped to operations) to choose the optimal precision for each component.
4.2.1.3. Inference Process in PLAIN
The inference process in PLAIN, as shown in Figure 4, integrates the MixQ-PIM algorithm with the specialized hardware. Let's trace the processing of two tokens (token0 and token1):
-
Weight Loading: Each
PIM unit(specifically,Bank-PIMunits, described in the hardware section) first loads the weights of the current LLM layer. These weights are often pre-loaded or streamed efficiently. -
Token Quantization & Dispatch: Incoming tokens are quantized using
mixed precision. For example,token0might be quantized toINT4(requiring less computation), whiletoken1is quantized toINT8(requiring more computation due to higher precision). To balance the workload across the PIM units,token1(theINT8token) issplitand dispatched to multiplePIM groups(e.g., three groups in the figure). -
Computation in Bank-PIMs:
- The split tokens and weights are processed by the
Bank-PIMs. These units contain4-bit multipliersandadder trees. - The computations, primarily
GEMM/GEMVoperations, generateINT32outputs. ThisINT32precision is crucial for safely supporting accumulation and partial sum fusion without overflow during intermediate calculations.
- The split tokens and weights are processed by the
-
QKV Projection: The
query (Q),key (K), andvalue (V)matrices are then generated by multiplying the input tokens with their respective weight matrices (). These matrices are quantized separately. Thevalue matrixis quantized along a different dimension due to theGEMMlayout, optimizing for subsequent operations. -
Attention Score Calculation: The and matrices are loaded into
PIMforactivation computation(calculating ). The results of this dot product form theattention scores. -
Softmax Operation: After the attention score computation, the
softmaxoperation is performed on theStack-PIM(described below). This normalizes the scores. -
Attention Context Calculation: The
softmaxresults (normalized attention scores) are then multiplied with the (value) matrix to compute theattention context. -
Dequantization: Finally, a
dequantizationstep is performed to convert the accumulatedINT32results back toFP16or a suitable higher precision for further processing or output. -
Iterative Process: Each
attention layertypically undergoes threeQuant-Dequantstages (QKV projection, attention score, attention context). PLAIN is designed to support these operations with minimal overhead.The following figure (Figure 4 from the original paper) depicts this execution dataflow:
该图像是PLAIN执行数据流的示意图,展示了Weight-Activation计算(QKV投影)和Activation-Activation计算(OK矩阵乘法)的流程。图中包含多个Bank-PIMs和数据流的连接,显示了计量化、Tokens、缓冲区及复杂计算的结构。
4.2.2. PLAIN Hardware Architecture
PLAIN is a heterogeneous system integrated into the HBM (High-Bandwidth Memory) stack, co-existing with an XPU (e.g., GPU or NPU).
4.2.2.1. Architecture Overview
The PLAIN architecture consists of a host CPU, an XPU, multiple HBM stacks, and PLAIN-enabled memory.
- XPU & PIM Interaction: When the
XPUprocessesnon-attention layers(typicallyFully Connected layers),PLAINfunctions as conventionalHBM memory, minimizing communication overhead. Intermediate results are written back to memory, allowing theGPUto overlap computation with memory operations and enhance request-level parallelism. Forattention layers, PLAIN's in-memory compute capabilities are activated. - HBM Stack Structure: Each
HBM stackcomprises eight3D-stacked DRAM diesand abuffer dieconnected viaThrough-Silicon Vias (TSVs). PLAIN adds compute logic with minimal DRAM changes, supporting both standard DRAM and PIM modes. - Two Types of PIM Units:
- Bank-PIM Units:
- Placement: Located between
DRAM banksand connected throughI/O boundarieswithin each DRAM die. - Components: Each
Bank-PIMcontains -bitregister files,4-bit multipliers, anadder tree, and a32-bit output register. - Data Flow: Data is read from odd/even bank row buffers via
512-bit buses. Computation results are sent to thebuffer diethrough a dedicatedresult bus. - Operations:
Bank-PIMsexecutequantized GEMM/GEMV operationsfor the three phases of theattention layer:QKV projection(token weight)Attention score(query key)Attention context(score value)
- Conflict Avoidance: Each
Bank-PIMreads operands from separate banks to prevent access conflicts, and computation is scheduled based on the first operand in the matrix multiplication.
- Placement: Located between
- Stack-PIM Unit:
-
Placement: Located on the
buffer die(one perHBM stack). -
Components: Includes buffers for
scaling factorsandscheduling tables, along with specializedsoftmax,quantization, anddequantization units. -
Functionality: Manages
quantization,dequantization,softmax,data accumulation, andscheduling.Dequantization unitsscale accumulated results and forward them tosoftmaxorquantization units.Quantization unitsapply theMixQ-PIMalgorithm and dispatch outputs toBank-PIMsor storescaling factorsin buffer memory.Softmaxis performed using the maximum value from quantized inputs. -
Control: Unlike
Bank-PIMs,Stack-PIMsintegrate control and computation. Thehost CPU, via theDRAM controller, coordinates execution using thescheduling table.Stack-PIMhandles quantization beforeBank-PIMcomputation and dequantization after result accumulation. Thesoftmaxstep is performed after theattention scorestage, completing one attention pass with minimal overhead.The following figure (Figure 5 from the original paper) illustrates the PLAIN hardware architecture:
该图像是PLAIN硬件架构示意图,展示了DRAM-PIM系统中的各个组件,包括主机CPU、PIM模块、存储单元和调度器等,旨在优化深度学习推理过程中的计算效率与内存带宽。
-
- Bank-PIM Units:
4.2.2.2. Schedule Design
The scheduling algorithm for PLAIN units is crucial for maximizing compute resource utilization and system throughput. It leverages DRAM's bank-level parallelism to saturate internal bandwidth and minimize execution stalls. A key challenge is managing the uneven computational loads introduced by mixed-precision quantization (e.g., INT8 vs. INT4 tokens).
a) Bit-wise Splitting Token:
To unify the computation model across different precision levels and ensure load balance, PLAIN adopts a bit-wise splitting strategy.
-
Mechanism: As illustrated in Figure 6,
8-bit activationsare split into two4-bit segments: a high 4-bit segment and a low 4-bit segment. -
Processing: Each 4-bit segment is then independently processed by
GEMV operationsusing the4-bit multipliersin theBank-PIMs. -
Weight Support: While weights are quantized at
channel granularityusing mixed precisions, only minor hardware support (e.g.,shifters) is needed to handle both4-bitand8-bitweight computations. -
Accumulation: All intermediate results are stored in
32-bit registers, allowing partial results to be shifted and accumulated on thebuffer die(viaStack-PIM) without overflow. -
Benefits: This approach enables a uniform
Bank-PIMarchitecture capable of handling different precision levels efficiently, maintaining load balance by distributing the workload of higher-precision tokens.The following figure (Figure 6 from the original paper) illustrates the bit-wise splitting strategy:
该图像是示意图,展示了PLAIN方法中将8位权重和激活向量进行位分割的过程。通过将8位向量拆分为两个4位向量并分布到多个PIMs中,经过乘法和移位操作后,将结果累加至32位加法器输出。
b) Bank-PIM Schedule Table:
To maintain high throughput between Bank-PIMs and Stack-PIM and mitigate conflicts among banks, a schedule table is introduced.
-
Location: Stored on the
buffer dieand integrated into theDRAM controller. -
Structure: The table is sorted by
Bank ID, which indexes banks sequentially. This ordering is consistent with theaccumulationandscaling factor buffers, allowingdequantization unitsto access parameters with lookup time (constant time). -
Contents: Each entry includes
Bank ID,PIM ID(interleaved order of banks),Group ID(dynamically assigned at runtime),Token ID,Token Type(e.g.,INT8-HIGH,INT4), andScaling Factor. -
Purpose: This table manages the assignment of computational tasks to specific
Bank-PIMsand helps coordinate data flow, especially for mixed-precision operations.c) Bank-PIM Schedule Algorithm: The scheduling algorithm aims to balance workloads across
Bank-PIMsand minimize communication withStack-PIM. -
Output Row Assignment: Each
Bank-PIMis assigned a full or partial output row to avoid extra accumulation steps inStack-PIM. At least one token is processed perBank-PIMbased onGEMM/GEMVprinciples. -
Parallelism for INT8 Tokens: To increase parallelism, each
INT8token is split into twoINT4tokens and distributed to separateBank-PIMs. -
Greedy Scheduling for Short Inputs: For short input lengths, particularly in the generation phases, token-level partitioning might be insufficient. PLAIN employs a
greedy scheduling strategythat further partitions the computation matrix. Tokens are split, andBank-PIMsare grouped based on the number of4-bit tokens. Each group then jointly computes a token usingchannel-wise partitioning. This ensures balanced utilization and efficient mapping across PIMs. -
QKV Generation: Weights are preloaded into all
Bank-PIMsduringQKV generation. While distributing work increasesactivation communication, it significantly reducescompute latency. -
Score and Context Stages: In
scoreandcontextstages, communication (especially for and ) can dominate. However, PLAIN benefits from reduced per-bank compute time andoverlapping techniques. For LLM generation withKV cache,K matrix redistributionis one-time, butV matrixrequiresre-quantizationanddynamic distribution. Overlapping schemes (discussed next) mitigate this cost.
4.2.2.3. Overlapping And Parallelism
To overcome the potential bottleneck of sequential execution and fully exploit bank-level parallelism, PLAIN employs several overlapping mechanisms. These are possible due to:
-
Data Independence: Different components process disjoint data subsets.
-
Command Isolation:
DRAMandPIM commandsavoid conflicts on theC/A (Command/Address) bus. -
Exclusive Execution: Each bank operates exclusively in
DRAMorPIM modeat any moment.The following figure (Figure 8 from the original paper) illustrates the overlapping timeline:
该图像是图表,展示了 PLAIN 系统中不同组件的重叠时间线。包括了加权加载、量化与通信重叠,以及 Softmax 与通信的实现,旨在优化大语言模型的推理效率。
a) Weight Loading Overlap (Figure 8a):
-
Problem: For
QKV generation, weights are preloaded intoBank-PIMs. Large models can exceed a single bank's capacity, causing stalls during weight loading. -
Solution: An overlapping scheme is used. While one bank (
bank 0) is actively computing, the other bank (bank 1) within the sameBank-PIMconcurrently loads weights, and vice versa. -
Mechanism:
GEMMoperations are decomposed intoGEMVto allow reuse of cached tokens in localregister files, creatingidle timeduring which weight transfers can occur. APing-Pong buffer schemealternates bank roles: odd-numbered banks enterDRAM mode(for loading) while even-numbered banks enterPIM mode(for computing) between layers. For very large models, weights are distributed across multiple banks, andBank-PIMsare grouped intovirtual PIMsto scale this approach.b) Quantization and Communication Overlap (Figure 8b):
-
Problem: Before the
attention scoreandcontext stages, activations are aggregated on thebuffer die, whereStack-PIMperformsde/quantization. These operations are compute-intensive. -
Solution: These operations are overlapped with
communicationandcomputation. -
Mechanism:
Bank-PIMssend intermediate results via theresult busduring computation.Stack-PIMcollects these results at fixed intervals and immediately appliesdequantizationby accessing the correspondingscaling factorfrom its on-chip buffers (avoiding lookup overhead). Concurrently, it updatesquantization units(e.g.,min/max values). Except for the final result transfer, allde/quantizationsteps are hidden within the ongoing computation, minimizing their impact on the critical path.c) Softmax and Communication Overlap (Figure 8c):
-
Problem: The
softmaxoperation is performed in full precision onStack-PIMand is a sequential step between theattention scoreandcontext stages. -
Solution:
Softmaxis overlapped withquantizationandKV-cache communication. -
Mechanism: During the generation phase, and matrices are cached in PIMs. uses
token-wise quantizationand can be reused directly. However, theV matrixis quantizedchannel-wiseand needs to bere-quantizedandredistributed.Stack-PIMbegins to quantify concurrently while performingsoftmax. Oncesoftmaxis complete, the resulting attention scores are sent toBank-PIMs. This overlap effectively hides the latency of bothsoftmaxandV quantization, further boosting end-to-end throughput.
5. Experimental Setup
5.1. Datasets
The experiments primarily utilize the WikiText-103 dataset.
- Source and Characteristics:
WikiText-103comprises over 100 million tokens extracted from verified Good and Featured Wikipedia articles. It is a widely used benchmark for language modeling tasks, known for its diverse vocabulary and long-term dependencies. - Purpose: This dataset helps compute
language perplexity (PPL)and evaluates the effect of the quantization algorithm ondecoder-only modelinference performance. - Data Sample (for inference): To bridge the gap between PPL calculation (where the generation phase is absent) and actual inference, the dataset is split into
conversation-length segments. These segments are then used aspromptsin text-generation tasks, simulating real-world dialogue scenarios. For example, a segment might be "The quick brown fox jumps over the lazy dog." and the model is prompted to continue the sentence.
5.2. Evaluation Metrics
The paper uses several metrics to evaluate PLAIN's performance:
-
Perplexity (PPL):
- Conceptual Definition:
Perplexityis a common metric used to evaluate the performance of language models. It quantifies how well a probability model predicts a sample. A lower perplexity score indicates that the model is better at predicting the next word in a sequence, suggesting higher accuracy and better generalization. In essence, it measures the model's "surprise" by the actual sequence of words; less surprise means better prediction. - Mathematical Formula: The perplexity of a language model on a test set (or sequence) is calculated as: $ \mathrm{PPL}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \dots, w_{i-1})\right) $
- Symbol Explanation:
- : The perplexity score for the sequence .
- : The exponential function (base ).
- : The total number of tokens (words) in the sequence .
- : Summation over all tokens from to .
- : The natural logarithm.
- : The probability assigned by the language model to the -th token , given all the preceding tokens .
- Conceptual Definition:
-
Speedup:
- Conceptual Definition:
Speedupmeasures the performance improvement of a new system or method relative to a baseline. It quantifies how many times faster the new approach is compared to the reference. - Mathematical Formula: $ \mathrm{Speedup} = \frac{\mathrm{Execution,Time}{\mathrm{Baseline}}}{\mathrm{Execution,Time}{\mathrm{PLAIN}}} $
- Symbol Explanation:
- : The performance gain.
- : The time taken by the baseline system to complete a task.
- : The time taken by the PLAIN system to complete the same task.
- Conceptual Definition:
-
Energy Consumption:
- Conceptual Definition:
Energy consumptionmeasures the total electrical energy used by the system to perform a given task. Lower energy consumption indicates higher energy efficiency, which is critical for deployment in data centers and edge devices due to operational costs and environmental impact. - Mathematical Formula: While the paper does not provide an explicit formula for energy consumption, it is typically calculated as the integral of power over time for various components: $ \mathrm{Energy} = \sum_{j} \int_{t_0}^{t_f} P_j(t) , dt $
- Symbol Explanation:
- : Total energy consumed.
- : Summation over all active components in the system (e.g., CPU, GPU, PIM, DRAM).
- : The integral of instantaneous power consumed by component over the execution time period .
- The paper calculates energy and area using
Verilogfor arithmetic units,Synopsys Design Compilerfor synthesis, andCacti 7.0for buffers, also factoring inHBM3operations.
- Conceptual Definition:
5.3. Baselines
To evaluate PLAIN, the authors compare it against three representative baseline systems:
-
GPU (FP16):
- Description: This baseline represents conventional LLM inference on a powerful, high-end
GPUusingFP16(half-precision floating-point) arithmetic. It uses standardHuggingfaceandPyTorchimplementations. - Hardware: An
Nvidia A100 GPUis used for measuring end-to-end latency. - Representativeness: This is the de-facto standard for high-performance LLM inference and serves as a benchmark for raw computational power and unquantized accuracy.
- Description: This baseline represents conventional LLM inference on a powerful, high-end
-
SmoothQuant (W8A8):
- Description: This baseline implements
SmoothQuant[9], a state-of-the-artpost-training quantizationtechnique for LLMs. It quantizes model weights and activations toINT8(W8A8, meaning 8-bit weights and 8-bit activations). - Hardware: It utilizes the
Nvidia A100 GPU'stensor cores, which are specialized hardware units for low-precision matrix operations. TheCutlass libraryis used for compilingself-attention layersto optimizeINT8inference. - Representativeness: This baseline demonstrates the performance achievable with advanced
INT8quantization on modernGPUs, addressing the "accuracy vs. efficiency" trade-off within a traditional GPU architecture. It highlights the benefits of software optimizations for low-precision inference.
- Description: This baseline implements
-
AttAcc [8]:
- Description:
AttAccis a heterogeneous system that leverages bothGPUandPIM. It uses theGPUfor theprefill stageandQKV generation(which are oftencompute-heavy), and offloadsmulti-head attentionduring thedecoding phase(which is typicallymemory-bound) toPIMunits. - Hardware Adjustment: Originally tested on a
DGX system, its hardware configuration is adjusted in PLAIN's evaluation to match the same magnitude (e.g., memory capacity, timing parameters) as PLAIN for fair comparison. - Representativeness: This baseline represents the state-of-the-art in
PIM-accelerated LLM inference, showcasing the benefits of specialized hardware for memory-bound operations. Comparing againstAttAccdirectly evaluates PLAIN's innovations in deeper PIM integration, mixed-precision handling, and overall system efficiency over existing PIM solutions.
- Description:
5.4. Simulation & Hardware Configuration
- Simulation Environment: The authors developed an in-house simulator by modifying
Ramulator2[30], a modern, modular, and extensibleDRAM simulator. - Experimental Stages: Due to the complexity of integrating real LLM inference with a cycle-accurate simulator, experiments are split:
- Precision Trace Generation: The LLM inference workload is compiled on a
GPU, and the precision trace of activations (after quantization by thePLAIN algorithm) is generated. - Cycle-Accurate Simulation: This precision trace is then fed into the modified
Ramulator2simulator to simulate memory access, producingcycle-accurate inference timesfor the PLAIN system.
- Precision Trace Generation: The LLM inference workload is compiled on a
- Hardware Specifications: PLAIN adds arithmetic units and buffers to a standard
HBM3memory stack. TheHBM3organization details and timing parameters are crucial for the simulation's accuracy. - Energy and Area Calculation:
-
Verilogis used to design and analyze the arithmetic units. -
Synopsys Design Compileris used for synthesis. -
Cacti 7.0[31] is used to estimate the energy and area consumption of buffers in both theDRAMdies and thebuffer die. -
The power consumption of standard
HBM3operations (e.g., activation, reading) is also factored in [32, 33, 34].The following are the results from Table I of the original paper:
HBM Organization HBM Organization 4 Banks per Bank-group, 4 BGs per Pseudo channel, 8 pCHs per die HBM Timing Parameter Frequency = 1GHz, tRP = 19, tRCD = 19, tRAS = 45, tRRDL = 4, tWR = 8, tCCD_S = 2, tCCD_L = 4, tREFI = 5070, tFAW = 39 LLM configuration Model Layers Hidden_size GPT2-large 36 1280 GPT2-xl 48 1600 OPT-6.7b 32 4096 OPT-13b 40 5120 LLaMA-2-7b 32 4096 LLaMA-2-13b 40 5120
-
5.5. Models
The evaluation includes a range of Large Language Models (LLMs):
-
GPT-2 [2]: Including
GPT2-largeandGPT2-xl. -
OPT [35]: Including
OPT-6.7bandOPT-13b. -
LLaMA-2 [3]: Including
LLaMA-2-7bandLLaMA-2-13b.These models are all
decoder-only generation models, which are common for text generation tasks and are the focus of PLAIN's acceleration efforts. Their sizes and configurations (number of layers and hidden size) are detailed in theLLM configurationsection of Table I, indicating a comprehensive evaluation across different model scales.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results validate PLAIN's effectiveness in accelerating LLM inference with minimal quality loss and improved energy efficiency across various models.
6.1.1. Quantization Results
The paper evaluates the MixQ-PIM algorithm's impact on model quality using perplexity (PPL). A lower PPL indicates better model quality. The experiments use a striped sliding window approach with a sequence length and stride of 1024.
The following are the results from Table II of the original paper:
| Model | GPT2-large | GPT2-xl | OPT-6.7b | OPT-13b | LLAMA2-7b | LLAMA2-13b |
| FP16 | 19.42 | 17.38 | 12.42 | 10.12 | 6.45 | 5.43 |
| W4A4 | 1164.16 | 983.13 | inf | inf | inf | 6388.79 |
| W8A8 | 19.71 | 18.29 | 30.27 | 4272.0 | 6.27 | 5.52 |
| SmoothQuant (W8A8) | - | - | 12.42 | 11.90 | 6.14 | 5.50 |
| SmoothQuant (W6A6) | - | 15.25 | 14.73 | 9.12 | 7.80 | |
| PLAIN (W6A6) | 20.56 | 18.20 | 13.48 | 9.95 | 6.91 | 5.92 |
Analysis:
-
FP16 Baseline: The
FP16perplexity values serve as the golden standard for model quality. -
W4A4 Quantization:
W4A4(4-bit weights, 4-bit activations) quantization performs extremely poorly, leading toinf(infinite) perplexity for many models (OPT-6.7b, OPT-13b, LLAMA2-7b) and very high values for others (e.g., 1164.16 for GPT2-large). This highlights the severe accuracy degradation of aggressive low-bit quantization without proper techniques. -
W8A8 Quantization (Naive): For smaller models like
GPT2-largeandGPT2-xl,W8A8shows perplexity values very close toFP16(e.g., 19.71 vs. 19.42 for GPT2-large). However, as model size increases (e.g.,OPT-13b,LLAMA2-7b),W8A8alone leads to significant accuracy loss (4272.0 for OPT-13b, compared to 10.12 for FP16), indicating the presence ofactivation outliers. -
SmoothQuant (W8A8):
SmoothQuant (W8A8)significantly improves accuracy for larger models that suffer fromactivation outliers. ForOPT-6.7b, it achieves 12.42, matchingFP16. ForOPT-13b, it reduces PPL from 4272.0 to 11.90, much closer toFP16. This confirms the effectiveness ofSmoothQuantin handling outliers. -
SmoothQuant (W6A6):
W6A6(6-bit weights, 6-bit activations) is generally a good balance but lacks native hardware support.SmoothQuant (W6A6)shows higher PPL than itsW8A8counterpart, indicating that 6-bit quantization is more challenging without specific hardware optimizations. -
PLAIN (W6A6): PLAIN demonstrates superior performance in
W6A6configuration. Despite using a lower bit-width (6-bit for both weights and activations, likely mapped to operations in hardware), it achieves perplexity values comparable to or even better thanFP16for some models (e.g.,OPT-13bat 9.95 vs. 10.12FP16,LLAMA2-7bat 6.91 vs. 6.45FP16). For smaller models whereSmoothQuantwasn't used (GPT2-large, GPT2-xl), PLAIN'sW6A6performance is still very close toFP16andW8A8. This indicates that PLAIN'slocality-aware adaptive quantizationalgorithm effectively leverages the 6-bit sweet spot, balancing accuracy with high compression, and mapping it efficiently to the underlying hardware.Conclusion: PLAIN's quantization algorithm effectively reduces bit-widths while maintaining negligible model quality loss, even for challenging
6-bitquantization, by intelligently managing activation outliers and optimizing precision at a fine granularity.
6.1.2. Latency (Speedup)
The speedup results demonstrate PLAIN's significant performance advantages over conventional GPU and PIM baselines. The configuration uses batch size 1, input token length 64, and output token length 64.
The following figure (Figure 9 from the original paper) shows the speedup comparison:

Analysis:
- Overall Speedup: PLAIN consistently outperforms all baselines across different LLM models. It achieves a substantial to acceleration over
GPU-FP16inference, with an average speedup of . ForGPT2-xl, it achieves the highest speedup of . - Comparison to GPU-SmoothQuant: PLAIN also shows a significant speedup compared to
GPU-SmoothQuant(W8A8). This indicates that whileSmoothQuantimprovesINT8performance on GPUs, PLAIN'sPIM-enabled mixed-precisionapproach provides further acceleration by leveraging internal memory bandwidth and dedicated in-memory computation. - Comparison to AttAcc-PIM: Crucially, PLAIN achieves a performance boost over
AttAcc-PIM, which is a state-of-the-art PIM accelerator. This highlights PLAIN's superiorPIM utilization,mixed-precision scheduling, andoverlapping mechanismsthat allow it to extract more performance from the PIM architecture for LLMs. - Reasons for Improvement: The performance boost stems from:
-
Reduced Data Bit-width: Quantization lowers the computational load and data movement.
-
Architectural Optimizations: PLAIN's
hardware-software co-designminimizes weight movement, and itsStack-PIMandBank-PIMunits efficiently handle quantized operations. -
Overlapping and Scheduling: The intelligent
workload-aware dataflow schedulerandoverlapping techniqueshide latencies associated with communication, quantization, andsoftmaxoperations, maximizing hardware utilization.Conclusion: PLAIN significantly accelerates LLM inference, showcasing its effectiveness in translating
mixed-precision quantizationandPIMcapabilities into real-world performance gains.
-
6.1.3. Energy Consumption
Energy efficiency is a critical factor for LLM deployment. The paper presents the normalized energy consumption of PLAIN compared to GPU and AttAcc.
The following figure (Figure 10 from the original paper) shows the normalized energy consumption results:

Analysis:
- Overall Reduction: PLAIN achieves a significantly greater reduction in energy consumption compared to both the
GPUandAttAcc. For example, forGPT2-large, PLAIN consumes only about 20% of the energy of theGPUand significantly less thanAttAcc. - Comparison to AttAcc: While
AttAccalso shows a decrease in energy consumption compared to theGPU(due to offloading memory-bound tasks to PIM), PLAIN's reduction is substantially larger across all models. - Reasons for Reduction:
-
Lower Bit-widths: PLAIN's use of
lower bit-widthsfor weights and activations directly reduces the energy consumed per operation and the energy cost of data movement. TheMixQ-PIMalgorithm ensures this reduction doesn't come at a significant accuracy cost. -
Efficient PIM Offloading: By offloading
QKV generationand attention computations entirely toPIM, PLAIN reduces heavyweight movementandcommunication overheadbetween theGPUand main memory. Processing data closer to where it's stored is inherently more energy-efficient. -
Optimized Dataflow and Overlapping: The efficient
dataflow schedulerandoverlapping mechanismsensure high utilization ofPIMunits and minimize idle cycles where components might consume static power without performing useful work.Conclusion: PLAIN dramatically improves energy efficiency for LLM inference, making it a more sustainable and cost-effective solution for large-scale deployments.
-
6.1.4. Ablation Study
An ablation study helps understand the contribution of each component of PLAIN to its overall performance. The study uses the OPT-6.7B model with input/output token lengths of 64, compared to a GPU-FP16 baseline.
The following figure (Figure 11 from the original paper) shows PLAIN's ablation study results:

Analysis (from right to left in the chart):
-
PLAIN (Full System): The full PLAIN system achieves a speedup of approximately (implied, as the other bars show reductions from this baseline).
-
Without Overlapping (kcache-balance): Removing the
overlapping optimizations(quantization with communication, weight loading with computation, and softmax with value communication) significantly reduces the speedup from to . This highlights that overlapping is a critical factor, contributing to about of the total speedup, as it effectively hides latency and keeps the pipeline full. -
Without Workload Balancing (kcache): Further removing
workload balancing(the strategy to splitINT8tokens into twoINT4tokens for even distribution across PIMs) causes another substantial drop in speedup, from to . This confirms that workload imbalance, if not addressed, can lead to idleINT4 PIMsand significantly hamper performance. -
Without Kcache (MixQ): Removing the
Kcache optimization(preloading key vectors intoBank-PIMsduring prefill) results in a reduction of speedup from to . While less impactful than overlapping or balancing, caching key vectors still provides a noticeable speedup, reducing data movement. -
Remaining Acceleration: The remaining acceleration (labeled as
OPT-MixQ) comes from the inherentbank-level parallelismof thePIM architectureand the basic benefits ofmixed-precision quantizationitself.Conclusion: The ablation study clearly demonstrates that all three key innovations (
Kcache,workload balancing, andoverlapping) are essential for PLAIN to achieve its high performance. Theoverlapping techniquesprovide the most significant boost, followed byworkload balancing.
6.1.5. Sensitivity Analysis
6.1.5.1. Speedup over Different Output Token Lengths
The paper analyzes how PLAIN's speedup varies with different output token lengths.
The following figure (Figure 12 from the original paper) shows PLAIN's speedup over different tokens:

Analysis:
-
Short Outputs (16 tokens): For very short output lengths, the speedup is relatively limited (around ). In this scenario, the
prefill phase(processing the initial prompt)dominatesthe execution time. The prefill phase typically involves largerGEMMoperations which are morecompute-boundand thus less amenable toPIM's bandwidth advantage overGPU's raw FLOPS. -
Increasing Output Length (64, 128 tokens): As the output length increases, the
decoding phasebecomes more significant. The decoding phase involves numerousGEMVoperations in theattention layers, making it amemory-boundscenario with anarithmetic intensitytypically around 1 (low compute-to-memory-access ratio).PIMarchitectures, with their superior internal bandwidth, are exceptionally well-suited for such memory-bound tasks. Consequently, PLAIN's performance improves with more tokens, achieving speedups of approximately for 64 tokens and for 128 tokens.Conclusion: PLAIN demonstrates higher effectiveness in
memory-boundscenarios, particularly during the decoding phase of LLMs, where its PIM-centric design can fully leverage the high internal memory bandwidth.
6.1.5.2. Scalability with Increasing HBM Stacks
The scalability of PLAIN with an increasing number of HBM stacks is analyzed to understand its potential for larger systems.
The following figure (Figure 13 from the original paper) illustrates the speedup of PLAIN with increasing stacks:

Analysis:
-
Linear Scalability: The graph shows that PLAIN achieves nearly
linear performance improvementsas the number ofHBM stacksincreases from 1 to 4. For instance, with 4 stacks, the speedup is roughly that of a single stack. -
Workload Distribution: PLAIN efficiently distributes the
attention layerworkload across multiple stacks without incurring significantinter-stack communication overhead. This is attributed tooptimal activation data mapping, which minimizes the need for data transfers between differentHBMstacks. -
Increased Bank-to-Die Communication: While increasing
Bank-PIMs(which occurs with more stacks) does lead to higherbank-to-die communication overhead, this is more than offset by the substantialFLOPS boostprovided by the additionalPIMunits. -
Flexible Hardware Configuration: The inherent flexibility of PLAIN's hardware configuration allows capacity, bandwidth, and FLOPS to be adjusted by varying the number of
DRAMandPIM dieswithin anHBM stack, making it adaptable to different performance requirements.Conclusion: PLAIN exhibits excellent scalability, demonstrating that its architecture can efficiently leverage multiple
HBM stacksto achieve nearly linear performance gains for LLM inference, making it suitable for high-performance deployments.
6.1.6. Area Overhead
The area overhead for integrating PLAIN within an HBM stack is quantified:
- Per DRAM Die: per
DRAM die. - Per Buffer Die: per
buffer die. - Additional Area per DRAM Die: This translates to an additional area per
DRAM die. - GEMV Units: Each
DRAM dieincludes 64GEMV units, with each unit occupying based on a1z-nm DRAM process[21]. - Buffer Die Components: The
buffer diehouses:-
Quantization units: -
Softmax units: -
Accumulators: -
Buffer: -
Scaling: All processing units on the
buffer dieare scaled to a7nm process[36], reflecting state-of-the-art fabrication technology for logic components.Conclusion: The area overhead introduced by PLAIN is relatively modest, particularly given the substantial performance and energy benefits it delivers. The design integrates compute logic into the
HBMstack efficiently, minimizing changes to coreDRAMstructures.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces PLAIN, a novel software/hardware co-design framework specifically tailored for optimizing Large Language Model (LLM) inference by effectively integrating mixed-precision quantization with DRAM-based Processing-in-Memory (DRAM-PIM) technology. The core contributions and findings are:
-
Hardware-Efficient Quantization: PLAIN proposes a
locality-aware adaptive quantization algorithm(MixQ-PIM) that determines optimal bit-widths for weights and activations (e.g., INT4/INT8) based onentropyandhardware cost. This approach allows for aggressive quantization (including effective mapping of 6-bit precision) with negligible model quality loss, even for models sensitive to outliers. -
Heterogeneous Workload Partitioning: The PLAIN architecture strategically partitions LLM inference tasks, leveraging the high internal memory bandwidth of
PIMunits formemory-bound attention layersand the powerful computational capabilities ofGPUsforcompute-bound Fully Connected layers. -
Optimized Dataflow and Overlapping: A sophisticated
workload-aware dataflow schedulercombined with intelligentoverlapping mechanisms(for weight loading, quantization/communication, and softmax/communication) ensures balanced workload distribution across PIM units and effectively hides latencies. This includes a novelbit-wise splitting strategyto unify mixed-precision computations. -
Significant Performance and Energy Gains: Experimental evaluations demonstrate that PLAIN achieves substantial performance improvements, with average speedups of (up to ) over conventional
GPU-FP16inference and a boost over the state-of-the-art PIM accelerator,AttAcc. These gains are accompanied by a significant reduction in energy consumption across various LLMs. -
Scalability: PLAIN demonstrates near-linear scalability with an increasing number of
HBM stacks, indicating its suitability for larger-scale deployments without significant inter-stack communication overhead.In essence, PLAIN successfully addresses the
memory wallbottleneck in LLM inference by providing a holistic framework that intelligently co-designs software algorithms and hardware architecture to unlock the full potential ofDRAM-PIMfor complex deep learning workloads.
7.2. Limitations & Future Work
While the paper does not dedicate a specific section to "Limitations & Future Work," some aspects can be inferred:
- PIM Computational Generalization: The current
Bank-PIMunits are primarily designed forGEMM/GEMVoperations and4-bit multipliers. While effective forattention layers, future work might explore expanding the computational capabilities of PIM units to support a wider range of operations or more complex arithmetic types directly in memory, which could potentially offload more parts of the LLM (e.g., certain activation functions, non-linearities inFC layers) from theGPU. - Dynamic Workload Adaptation: The
greedy scheduling strategyand staticschedule tableare designed for specific workloads. More dynamic or adaptive scheduling algorithms could be explored that can react to real-time workload fluctuations, varying batch sizes, or model changes more flexibly, especially in a multi-user or dynamic inference environment. - Inter-HBM Communication Overhead: Although PLAIN minimizes
inter-stack communicationfor attention layers, scaling to an even larger number ofHBM stacksor more complex models might expose new inter-PIM communication bottlenecks that need to be addressed. - Beyond Attention Layers: While PLAIN focuses on
attention layersdue to their memory-bound nature,Fully Connected layersstill reside on theGPU. Future research could investigate how PIM capabilities could be extended to accelerate parts ofFC layerswithout compromising the GPU's compute power, perhaps through advanced partitioning or specialized PIM compute units. - Software Stack Development: The paper relies on an in-house simulator. Developing a full-fledged software stack, including compilers and runtime systems, that can seamlessly integrate
mixed-precision quantizationwithPIMandGPUfor real-world deployment remains a significant challenge and a direction for future work. This would involve managing data movement, synchronization, and task scheduling across the heterogeneous system with minimal programmer effort. - Power Gating and Fine-grained Control: While energy efficiency is improved, further fine-grained power management, such as dynamic voltage and frequency scaling (DVFS) or power gating for inactive PIM units, could be explored to optimize energy consumption even further.
7.3. Personal Insights & Critique
PLAIN presents a compelling and well-engineered solution that directly tackles the memory wall problem in LLM inference. Its strength lies in the rigorous co-design of both the quantization algorithm and the underlying hardware architecture, rather than simply porting existing techniques.
- Transferability: The core principle of
locality-aware mixed-precision quantizationcombined withheterogeneous compute offloadingis highly transferable. This approach could be applied to otherdata-intensive deep learning modelsbeyond LLMs, such as largevision transformersorgraph neural networks, where similar memory bottlenecks and opportunities for low-precision computation exist. The concept ofbit-wise splittingto unify execution across different precisions on fixed-bit-width hardware is also a clever technique that could find applications in other specialized accelerators. - Novelty of Quantization: The
entropy-based heuristicfor quantization configuration, particularly the simplified activation entropy and hardware-aware cost function, is a practical and efficient alternative to complex reinforcement learning approaches. This makes theMixQ-PIMalgorithm more deployable for various models without extensive retraining. The careful mapping of 6-bit quantization (often optimal for LLMs) to underlying 4-bit hardware operations is a testament to the practical hardware-aware design. - System-Level Optimization: The emphasis on
workload balancingandoverlappingis crucial. It highlights that even with superior underlying hardware, a poorly managed dataflow can nullify theoretical gains. PLAIN's detailed scheduling and overlapping strategies are key to achieving practical speedups in a complex heterogeneous environment. This demonstrates a deep understanding of system-level performance bottlenecks. - Potential Issues/Critique:
-
Commercial Viability of PIM: While PLAIN shows significant benefits, the widespread commercial adoption of
DRAM-PIMsolutions (beyond specialized cases likeSamsung HBM-PIMorSK Hynix GDDR6-AiM) still faces challenges in manufacturing costs, standardization, and a mature software ecosystem. The success of PLAIN relies heavily on continued advancements and broader acceptance of PIM. -
Generalization of
MixQ-PIM: Theentropy-based heuristicmight need fine-tuning (e.g., the factor) for different LLM families or tasks. While lightweight, its robustness across highly diverse models and training conditions compared to more adaptive, learning-based quantization methods could be further investigated. -
Host-PIM Interface Overhead: While the paper mentions that PLAIN functions as conventional HBM for non-attention layers, the transition overheads and the complexity of managing data movement and synchronization between the
XPUandPIM(especially with dynamic output lengths andKV cacheupdates) are always a concern in heterogeneous systems. The paper largely mitigates this through overlapping, but it remains a critical aspect of overall system efficiency. -
Specifics of
6-bitImplementation: The paper mentionsW6A6performance and the use of4-bit multiplierswithshiftersfor8-bitcompute. A more detailed explanation of how6-bitvalues are precisely handled and mapped onto a4-bitarithmetic unit (e.g., through partial sums, two separate 4-bit operations, or custom 6-bit logic) would enhance clarity for a beginner. It's implied that it's treated as two 3-bit operations or similar, which would fit into a 4-bit framework.Overall, PLAIN represents a significant step forward in making large, memory-intensive LLMs more efficient and deployable on emerging hardware. Its holistic approach to co-design and meticulous attention to system-level details set a strong precedent for future research in AI accelerators.
-
Similar papers
Recommended via semantic vector search.