Paper status: completed

PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization

Published:10/26/2025

LLM Inference Acceleration (3)DRAM-PIM Deep Learning Acceleration (1)Mixed-Precision Quantization Algorithm (1)High Bandwidth Memory Optimization (1)PIM-Based Computation Scheduling (1)

Original Link

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

PLAIN is a novel software/hardware co-design framework for accelerating large language model inference through mixed-precision quantization. It optimizes parameter quantization and leverages PIM characteristics, achieving up to 5.03x and 1.69x performance improvements with neglig

Abstract

DRAM-based processing-in-memory (DRAM-PIM) has gained commercial prominence in recent years. However, its integration for deep learning acceleration, particularly for large language models (LLMs), poses inherent challenges. Existing DRAM-PIM systems are limited in computational capabilities, primarily supporting element-wise and general matrix-vector multiplication (GEMV) operations, which contribute only a small portion of the execution time in LLM workloads. As a result, current systems still require powerful host processors to manage compute-heavy operations. To address these challenges and expand the applicability of commodity DRAM-PIMs in accelerating LLMs, we introduce PLAIN, a novel software/hardware co-design framework for PIM-enabled systems. PLAIN leverages the distribution locality of parameters and the unique characteristics of PIM to achieve optimal trade-offs between inference cost and model quality. Our framework includes three key innovations: 1) firstly, we propose a novel quantization algorithm that determines the optimal precision of parameters within each layer, considering both algorithmic and hardware characteristics to optimize hardware mapping; 2) PLAIN strategically utilizes both GPUs and PIMs, leveraging the high internal memory bandwidth within HBM for attention layers and the powerful compute capability of conventional systems for fully connected (FC) layers; 3) PLAIN integrates a workload-aware dataflow scheduler that efficiently arranges complex computations and memory access for mixed-precision tensors, optimizing execution across different hardware components. Experiments show PLAIN outperforms the conventional GPU with the same memory parameters and the state-of-the-art PIM accelerator, achieving a 5.03× and 1.69× performance boost, with negligible model quality loss.

Mind Map

In-depth Reading

English Analysis~44 min read · 59,075 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization." This title highlights a novel software/hardware co-design framework, PLAIN, which aims to improve the efficiency of Large Language Model (LLM) inference by utilizing the unique characteristics of Processing-in-Memory (PIM) architectures and applying mixed-precision quantization techniques.

1.2. Authors

The authors are Yiwei Hu, Fangxin Liu, Zongwu Wang, Yilong Zhao, Tao Yang, Li Jiang, and Haibing Guan. Their affiliations are not explicitly detailed in the provided text beyond an email address suffix suggesting "sjtu.edu.cn," which likely refers to Shanghai Jiao Tong University. Li Jiang is noted as a corresponding author. The research backgrounds appear to be in computer architecture, deep learning acceleration, and memory systems, focusing on optimizing hardware-software interactions for AI workloads.

1.3. Journal/Conference

The publication venue is not explicitly stated in the provided text. However, the nature of the research (software/hardware co-design, PIM, LLM acceleration) suggests it would typically be published in top-tier computer architecture conferences such as ASPLOS, ISCA, MICRO, or HPCA, or potentially a highly regarded AI systems conference.

1.4. Publication Year

The paper was published at (UTC): 2025-10-26T00:00:00.000Z. This indicates a future publication date, suggesting it might be an accepted paper for an upcoming conference or journal issue.

1.5. Abstract

The abstract introduces DRAM-based processing-in-memory (DRAM-PIM) as a commercially prominent technology facing challenges in accelerating Large Language Models (LLMs) due to limited computational capabilities (primarily supporting element-wise and GEMV operations). To address this, the paper proposes PLAIN, a novel software/hardware co-design framework. PLAIN aims to optimize LLM inference cost and model quality by leveraging parameter distribution locality and PIM's unique characteristics. Its three key innovations include: 1) a novel quantization algorithm that determines optimal precision per layer based on algorithmic and hardware characteristics; 2) strategic utilization of GPUs for compute-heavy Fully Connected (FC) layers and PIMs for bandwidth-intensive attention layers, leveraging HBM's high internal bandwidth; and 3) a workload-aware dataflow scheduler for efficient execution of mixed-precision tensors across heterogeneous hardware. Experimental results demonstrate that PLAIN achieves a $5.03\times$ and $1.69\times$ performance boost over conventional GPUs with similar memory parameters and state-of-the-art PIM accelerators, respectively, while maintaining negligible model quality loss.

1.6. Original Source Link

The original source link provided is /files/papers/69571ce38c5983e9f07b96e1/paper.pdf. This appears to be a local file path or an internal identifier within a larger system, rather than a publicly accessible URL. Its publication status is pending as the publication date is in the future.

2. Executive Summary

2.1. Background & Motivation

The paper addresses the critical challenges in deploying Large Language Models (LLMs), which are known for their massive parameter counts and immense resource demands.

Core Problem: The primary hurdle in LLM deployment, especially during inference (token generation), is the "memory wall." This refers to the bottleneck caused by limited memory bandwidth between GPU compute units and DRAM, leading to low hardware utilization. For instance, GPT-3 requires 326 GB of FP16 memory, far exceeding the capacity of high-end GPUs like A100 (80 GB). Even with significant HBM capacity and bandwidth, GPU compute unit utilization can fall below 1%. While computational performance of accelerators has grown, memory capacity and bandwidth have not kept pace.
Importance of the Problem: Efficient LLM inference is crucial for their widespread adoption and practical utility in various applications. The existing memory bottlenecks limit the size and complexity of deployable models, hindering advancements in AI capabilities and increasing the cost of running powerful LLMs.
Challenges/Gaps in Prior Research:
- Limited PIM Capabilities: Existing DRAM-PIM systems, while offering high internal memory bandwidth, are typically limited to simple operations like element-wise computations and General Matrix-Vector Multiplication (GEMV). These operations constitute only a small fraction of LLM workloads, leaving compute-heavy operations to powerful host processors. This limits the full potential of DRAM-PIM for LLM acceleration.
- Quantization Limitations: While model quantization reduces memory footprint, current methods often face a trade-off between accuracy and computational resources. Mainstream hardware lacks native support for optimal bit-widths (e.g., 6-bit quantization), favoring less flexible 4-bit or 8-bit formats.
- PIM-Quantization Integration: There is a lack of PIM-friendly frameworks that effectively balance hardware overhead with compression ratio, consider DRAM-PIM's unique architectural constraints in mixed-precision techniques, and translate theoretical quantization benefits into practical speedups due to workload imbalances.
Paper's Entry Point / Innovative Idea: The paper proposes PLAIN, a novel algorithm-architecture co-design framework. It aims to optimize performance by intelligently combining DRAM-PIM's high internal bandwidth with mixed-precision quantization. PLAIN leverages the distribution locality of parameters (intra-tensor value distributions and inter-tensor patterns) to adaptively quantize parts of the model, ensuring efficiency while preserving accuracy. It strategically offloads memory-bound operations to PIM and compute-bound operations to GPUs, creating a heterogeneous acceleration system.

2.2. Main Contributions / Findings

PLAIN makes several primary contributions to address the challenges of LLM inference:

Novel Quantization Algorithm: PLAIN introduces a Locality-Aware Adaptive Quantization method. This algorithm determines the optimal precision (e.g., INT4 or INT8) for parameters within each layer, considering both algorithmic characteristics (like sensitivity to precision changes) and hardware characteristics to optimize mapping to the PIM architecture. This balances accuracy and compression without requiring expensive retraining.
Heterogeneous Hardware Utilization: The framework strategically leverages GPUs and PIMs in a co-design approach. It exploits the high internal memory bandwidth of High-Bandwidth Memory (HBM) within PIM units for attention layers (which are often memory-bound in LLMs), while utilizing the powerful compute capabilities of conventional GPUs for Fully Connected (FC) layers (which are typically compute-bound). This maximizes the strengths of each component.
Workload-Aware Dataflow Scheduler: PLAIN integrates a sophisticated workload-aware dataflow scheduler. This scheduler efficiently arranges complex computations and memory accesses for mixed-precision tensors across different hardware components. It utilizes bank-level parallelism and dynamic workload balancing (e.g., bit-wise splitting of INT8 tokens into INT4 for parallel processing) to ensure that the benefits of mixed-precision quantization are realized as practical speedups, minimizing stalls and maximizing resource utilization.
Significant Performance and Energy Boost: Experimental results demonstrate substantial improvements. PLAIN achieves an average speedup of $4.41\times$ and up to $5.03\times$ compared to conventional GPU-FP16 inference. It also shows a $1.69\times$ performance boost over the state-of-the-art PIM accelerator, AttAcc. Furthermore, PLAIN significantly reduces energy consumption compared to both GPU and AttAcc, indicating improved energy efficiency.
Negligible Model Quality Loss: Despite aggressive quantization and performance optimizations, PLAIN maintains negligible model quality loss across various LLMs (GPT-2, OPT, LLaMA-2), demonstrating the effectiveness of its adaptive quantization strategy.

3.1. Foundational Concepts

To fully grasp the innovations presented in PLAIN, a reader should understand several foundational concepts:

Large Language Models (LLMs):
- Conceptual Definition: LLMs are deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data. They are designed to understand, generate, and process human language, performing tasks like text generation, translation, summarization, and question answering.
- Transformer Architecture: LLMs predominantly use the Transformer architecture, introduced by Vaswani et al. (2017). This architecture revolutionized sequence processing by relying entirely on attention mechanisms (particularly self-attention) to draw global dependencies between input and output, eschewing recurrent (RNN) or convolutional (CNN) layers. Many LLMs like GPT and LLaMA are built using stacked decoder blocks of the Transformer.
- Inference Stages: LLM inference, especially for generative tasks, typically involves two stages:
  - Prefill Stage: The initial input prompt (e.g., "I like playing") is processed. This stage often involves large matrix multiplications (GEMM) as the entire input sequence is processed to build a context.
  - Decoding Stage: The model generates tokens one by one autoregressively (e.g., "basketball", then "!"). In this stage, the query vector is typically small ( $1 \times \mathrm{dim}$ ), and key and value matrices grow as more tokens are generated ( $N \times \mathrm{dim}$ , where $N$ is the sequence length). This stage is often characterized by General Matrix-Vector Multiplication (GEMV) operations and is heavily memory-bound due to frequent accesses to a growing KV cache.
    
    The following figure (Figure 1 from the original paper) illustrates the computational workflow of decoder layers during inference:
    
    该图像是一个示意图，展示了大型语言模型（LLM）的预填充阶段和解码阶段的结构。在预填充阶段，输入为 ['I', 'like', 'playing']，经过自注意力计算后，输出为 ['basketball']。解码阶段接收的输入为 ['I', 'like', 'playing', 'basketball']，同样经过自注意力和前馈网络处理，最终输出为 ['!']。
DRAM-based Processing-in-Memory (DRAM-PIM):
- Conceptual Definition: DRAM-PIM is a memory-centric computing paradigm where computational units are integrated directly within or very close to DRAM (Dynamic Random-Access Memory) memory banks. The core idea is to move computation closer to data, thereby reducing the need to transfer large amounts of data between the main processor (e.g., CPU or GPU) and external memory.
- Advantages:
  - Reduced Data Movement: The primary benefit is mitigating the "memory wall" bottleneck by performing operations where the data resides, significantly decreasing data transfer energy and latency.
  - Higher Internal Bandwidth: PIM architectures can leverage the extremely high internal bandwidth available within DRAM banks, which is often much greater than the external bandwidth between DRAM and the host processor.
- Limitations:
  - Limited Computational Capability: Commercial DRAM-PIM solutions (like Samsung HBM-PIM or SK Hynix GDDR6-AiM) often have relatively simple compute units, primarily supporting basic operations such as element-wise operations and GEMV. More complex operations (GEMM) or custom functions might still require the host processor.
  - Programming Complexity: Developing software that effectively utilizes PIM architectures can be challenging due to the need for explicit data placement and workload partitioning.
Quantization:
- Conceptual Definition: Quantization in Deep Neural Networks (DNNs) is a technique used to reduce the precision (bit-width) of weights and/or activations from high-precision floating-point formats (e.g., FP32, FP16) to lower-precision integer formats (e.g., INT8, INT4, INT1).
- Benefits:
  - Reduced Memory Footprint: Lower bit-widths mean less memory is required to store model parameters and intermediate activations, allowing larger models to fit into memory or enabling deployment on resource-constrained devices.
  - Improved Inference Speed: Operations on lower-precision integers can be significantly faster and more energy-efficient on specialized hardware (e.g., integer Tensor Cores on NVIDIA GPUs) or custom PIM units.
  - Reduced Data Movement: Smaller data sizes lead to less data transfer across memory hierarchies, further alleviating the memory wall.
- Trade-off: The main challenge is maintaining model accuracy. Aggressive quantization can lead to significant accuracy degradation if not carefully managed.
Mixed-Precision Quantization (MPQ):
- Conceptual Definition: MPQ is an advanced quantization technique where different layers or even different parts within a tensor are quantized to varying bit-widths (e.g., some layers to INT8, others to INT4).
- Purpose: It aims to find an optimal balance between accuracy and efficiency. Highly sensitive layers or values (e.g., outliers in activations) might retain higher precision to preserve accuracy, while less sensitive parts can be aggressively quantized to lower bit-widths for maximum compression and speedup.
Memory Wall:
- Conceptual Definition: The "memory wall" refers to the growing performance gap between processor speed and memory access speed. While processors have become exponentially faster, the rate at which data can be fetched from main memory has not kept pace. This bottleneck becomes particularly pronounced in data-intensive workloads like LLMs, where the processor frequently stalls, waiting for data from memory, leading to low compute unit utilization.
Roofline Model:
- Conceptual Definition: The Roofline Model is a performance model that graphically illustrates the achievable performance of a computational kernel on a given hardware platform. It plots performance (typically in FLOPS) against arithmetic intensity (FLOPS per byte transferred). The "roofline" itself consists of two lines:
  - Memory Bandwidth Bound: A diagonal line representing the maximum performance limited by the memory bandwidth.
  - Compute Bound: A horizontal line representing the maximum performance limited by the processor's peak floating-point operation rate (FLOPS).
- Purpose: By plotting a workload on the roofline chart, one can easily identify whether the workload is memory-bound (falling under the bandwidth-limited diagonal) or compute-bound (hitting the FLOPS-limited horizontal roof) and understand which hardware resource is the bottleneck.
  
  The following figure (Figure 2 from the original paper) provides a roofline model analysis for the attention layers in LLaMA-2-7b, highlighting how workloads transition from compute-bound to memory-bound as output length increases:
  
  该图像是图表，展示了LLAMA2-7B注意力层在Nvidia V100 GPU和HBM-PIM下的性能与算术强度的关系。图中标出HBM-PIM和V100 GPU在记忆受限与计算受限场景下的表现，HBM-PIM性能为4.8 TFLOPS，而V100 GPU为14 TFLOPS。

3.2. Previous Works

The paper builds upon and distinguishes itself from several key areas of prior research:

Transformer Architecture (Vaswani et al. [1]):
- Background: The Transformer is the backbone of modern LLMs. It introduced the self-attention mechanism as a replacement for recurrent and convolutional layers, enabling parallel processing of sequences and capturing long-range dependencies efficiently.
- Core Formula for Attention: The fundamental attention mechanism is crucial. It calculates a weighted sum of value vectors, where the weights are determined by the similarity between query and key vectors. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
  - $Q$ : The Query matrix, representing the current token(s) for which context is being computed. Its shape is typically $(N_q, d_k)$ , where $N_q$ is the number of queries and $d_k$ is the dimension of the key vectors.
  - $K$ : The Key matrix, representing all available contextual tokens. Its shape is typically $(N_k, d_k)$ , where $N_k$ is the number of keys.
  - $V$ : The Value matrix, containing the actual information to be aggregated. Its shape is typically $(N_k, d_v)$ , where $d_v$ is the dimension of the value vectors.
  - $QK^T$ : The dot product between Queries and Keys, measuring the compatibility (attention scores) between each query and all keys.
  - $\sqrt{d_k}$ : A scaling factor to prevent the dot products from becoming too large, which can push the softmax function into regions with tiny gradients.
  - $\mathrm{softmax}$ : A function that normalizes the attention scores into a probability distribution, ensuring weights sum to 1.
  - Output: A matrix of shape $(N_q, d_v)$ , representing the contextually enriched queries.
- Role in PLAIN: PLAIN focuses on accelerating Transformer inference, specifically targeting the attention layers due to their memory-bound nature in the decoding stage.
Commercial DRAM-PIM Solutions (Samsung HBM-PIM [22], SK Hynix GDDR6-AiM [25]):
- Background: These are real-world implementations of the PIM concept, integrating basic compute capabilities into HBM or GDDR6 memory modules. They demonstrate the viability of PIM for accelerating memory-bound AI workloads.
- Limitations (addressed by PLAIN): As mentioned in the abstract, these systems primarily support element-wise and GEMV operations. While useful for some memory-intensive tasks, they are insufficient for the diverse and compute-heavy operations of LLMs, especially GEMM in the prefill stage or complex attention computations. They typically rely on host processors for the majority of the LLM workload.
Model Quantization Techniques:
- HAQ [27]: Uses reinforcement learning to search for optimal bit-widths per layer, considering hardware metrics like latency and energy. This is a complex, hardware-aware approach to MPQ.
- LLM.int8() [28]: A specific quantization method targeting LLMs. It addresses the issue of outliers (activations with very large magnitudes) by isolating outlier dimensions into FP16 while quantizing the majority of values to INT8 using a vector-wise approach.
- SmoothQuant [9]: A post-training quantization technique designed for LLMs. It tackles activation outliers by "smoothing" them. This is achieved by remapping activation outliers to weights through per-channel scaling, making both weights and activations more amenable to low-bit quantization without significant accuracy loss. PLAIN explicitly builds upon SmoothQuant for its Quantization Granularity strategy.
AttAcc [8]:
- Background: AttAcc is a state-of-the-art heterogeneous PIM system specifically designed for batched Transformer-based generative model inference. It offloads multi-head attention during the decoding phase to PIM units, while the GPU handles the prefill stage and QKV generation.
- Role as Baseline: PLAIN uses AttAcc as a key baseline to demonstrate its performance improvements. PLAIN aims to surpass AttAcc by offering more comprehensive PIM utilization, a more advanced quantization scheme, and better workload balancing. The paper implies that AttAcc might still face challenges in efficiently handling the full complexity of LLM attention operations, particularly with mixed precision.

3.3. Technological Evolution

The field of AI acceleration has evolved significantly, driven by the increasing computational demands of deep learning.

Early Deep Learning (2010s): Initial acceleration focused on GPUs (e.g., NVIDIA CUDA) for their parallel processing capabilities, effectively handling dense matrix multiplications fundamental to Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
Transformer Era (2017 onwards): With the advent of the Transformer, LLMs grew exponentially in size. This exposed the memory wall problem more acutely. While GPUs continued to improve (e.g., NVIDIA A100, H100 with Tensor Cores), memory capacity and bandwidth became persistent bottlenecks.
Quantization for Efficiency (Late 2010s-Present): To combat memory and computational demands, quantization emerged. Initial efforts focused on FP16 or INT8 quantization for general DNNs. Later, mixed-precision and post-training quantization (like SmoothQuant, LLM.int8()) specifically for LLMs became crucial to maintain accuracy with very low bit-widths.
Processing-in-Memory (PIM) Resurgence (Early 2020s): As the memory wall became more critical, PIM solutions, which had been a research topic for decades, saw commercial viability (e.g., Samsung HBM-PIM). These aimed to tackle memory-bound workloads by integrating computation directly into memory.
Heterogeneous Systems and Co-design (Present): The current frontier involves integrating these technologies. Recognizing that no single architecture is optimal for all LLM operations, heterogeneous systems (e.g., GPU + PIM) and hardware-software co-design frameworks (like PLAIN) are emerging. These seek to intelligently partition workloads and optimize algorithms to fully exploit the unique strengths of different hardware components, addressing the limitations of individual technologies.

PLAIN fits into this evolution by pushing the boundaries of heterogeneous acceleration for LLMs. It specifically targets the integration of advanced mixed-precision quantization with DRAM-PIM, moving beyond simple GEMV offloading to a more comprehensive acceleration of memory-bound attention layers.

3.4. Differentiation Analysis

Compared to the main methods in related work, PLAIN offers several core differentiators and innovations:

PIM Utilization Scope:
- Conventional PIMs (e.g., Samsung HBM-PIM): These are limited to element-wise and GEMV operations, leaving the bulk of LLM computation to host GPUs.
- AttAcc: Improves on conventional PIM by offloading multi-head attention during the decoding stage to PIM.
- PLAIN's Innovation: PLAIN goes further by not only offloading attention layers but also by fully integrating mixed-precision quantization within the PIM architecture. It proposes specific Bank-PIM units for QKV projection, attention score, and attention context calculations, which involve a mix of GEMM and GEMV operations, and a Stack-PIM for quantization/dequantization and softmax. This broader and deeper integration allows PIM to handle more complex parts of the attention mechanism across both prefill and decoding stages.
Quantization Strategy:
- Traditional Quantization (e.g., W8A8): Often a fixed bit-width, which can lead to significant accuracy loss for LLMs with prominent activation outliers, or miss opportunities for higher compression.
- SmoothQuant: Primarily focuses on INT8 quantization by smoothing outliers.
- LLM.int8(): Uses vector-wise quantization with FP16 outliers.
- PLAIN's Innovation: PLAIN proposes a novel Locality-Aware Adaptive Quantization algorithm (MixQ-PIM) that determines the optimal precision (INT4 or INT8) at a fine-grained token-wise (for activations) and channel-wise (for weights) granularity. This is driven by an entropy-based heuristic combined with a hardware-aware cost function. Unlike methods that rely purely on algorithmic metrics or reinforcement learning, PLAIN explicitly considers hardware mapping and overhead, leading to a better balance of accuracy and efficiency tailored for PIM. It also supports 6-bit quantization, which is often optimal for LLMs but lacks native hardware support, by effectively mapping it to 4-bit operations.
Software/Hardware Co-design & Workload Management:
- Existing Frameworks: Many quantization frameworks are not designed with DRAM-PIM constraints in mind, leading to suboptimal performance or inability to fully leverage PIM's capabilities.
- PLAIN's Innovation: PLAIN is a holistic software/hardware co-design. It introduces a sophisticated workload-aware dataflow scheduler and overlapping mechanisms (weight loading, quantization/communication, softmax/communication).
  - Bit-wise splitting: PLAIN's scheduler cleverly splits INT8 tokens into two INT4 segments for uniform processing by 4-bit multipliers across PIMs, addressing the challenge of mixed-precision load balancing without requiring variable-bit-width PIM units.
  - Schedule Table: It uses a Bank-PIM schedule table to manage bank-level parallelism and minimize conflicts.
  - Overlap: Its overlapping strategies hide the latency of communication, quantization, and softmax by parallelizing them with computation, which is crucial for achieving practical speedups in a heterogeneous system. This level of fine-grained, hardware-aware scheduling and overlapping is a key differentiator for PLAIN.
    
    In essence, PLAIN differentiates itself by offering a more deeply integrated and optimized solution for LLM inference on PIM. It provides a specialized quantization algorithm that is hardware-aware, a more comprehensive partitioning of Transformer operations between GPU and PIM, and an intelligent scheduler that maximizes the utilization of PIM's internal bandwidth by unifying mixed-precision computations and overlapping various overheads.

4. Methodology

4.1. Principles

The core idea of PLAIN is an algorithm-architecture co-design that optimizes Large Language Model (LLM) inference by leveraging the unique characteristics of DRAM-based Processing-in-Memory (DRAM-PIM) and mixed-precision quantization. The theoretical basis and intuition behind PLAIN are rooted in two observations:

Memory Wall Bottleneck: LLM inference, especially during the decoding phase, is often memory-bound, meaning performance is limited by memory bandwidth rather than computational throughput. DRAM-PIM architectures inherently offer significantly higher internal memory bandwidth compared to external memory interfaces.
Quantization for Efficiency and Outliers: Quantization reduces memory footprint and computational cost by lowering data precision. However, LLMs are sensitive to quantization, particularly due to activation outliers. Mixed-precision quantization can mitigate accuracy loss by adaptively assigning different bit-widths, but it must be carefully designed to align with hardware capabilities for practical speedups.

PLAIN's approach is to:

Locality-Aware Quantization: Exploit the observation that not all parts of an LLM are equally sensitive to precision reduction. By using a locality-aware adaptive quantization algorithm, PLAIN can assign lower precision to less critical components (based on entropy) while retaining higher precision for sensitive ones, without expensive retraining. This is hardware-aware to ensure efficient mapping.
Heterogeneous Workload Partitioning: Identify that attention layers are primarily memory-bound (benefiting from PIM's high bandwidth), while Fully Connected (FC) layers are often compute-bound (benefiting from GPU's powerful compute capabilities). PLAIN strategically offloads attention layer computation to PIM and keeps FC layer computation on the GPU, maximizing the strengths of each.
Workload Balancing and Overlapping: Address the challenges introduced by mixed-precision computations and heterogeneous execution. A sophisticated workload-aware dataflow scheduler ensures balanced utilization of PIM units, even with varying bit-widths, and employs overlapping techniques to hide the latency of data movement, quantization, and softmax operations.

By combining these principles, PLAIN aims to achieve optimal trade-offs between inference cost and model quality, overcoming the limitations of both conventional GPU systems and existing DRAM-PIM solutions.

4.2. Core Methodology In-depth (Layer by Layer)

PLAIN's methodology comprises a MixQ-PIM algorithm for quantization and a dedicated hardware architecture (PLAIN Hardware Architecture) with a sophisticated Schedule Design and Overlapping mechanisms.

4.2.1. MixQ-PIM Algorithm

The MixQ-PIM algorithm is designed to enable DRAM-PIM-friendly quantization, balancing accuracy and hardware efficiency.

4.2.1.1. Quantization Granularity

To address the challenge of outliers in LLM activations, which typically hinder low-bit quantization, PLAIN employs SmoothQuant [9]. SmoothQuant works by remapping activation outliers to weights via per-channel scaling. This process results in more balanced and compressible distributions for both activations and weights.

The transformation achieved by SmoothQuant can be expressed as: $ \mathbf { Y } = ( \mathbf { X } \mathbf { d i a g } ( \mathbf { s } ) ^ { -1 } ) \cdot ( \mathbf { d i a g } ( \mathbf { s } ) \mathbf { W } ) = \hat { \mathbf { X } } \hat { \mathbf { W } } $

$\mathbf{Y}$ : The output of the layer, typically a matrix.
$\mathbf{X}$ : The input activation tensor (before smoothing).
$\mathbf{W}$ : The weight tensor (before smoothing).
$\mathbf{s}$ : A per-channel scaling factor vector. The diag(s) operation creates a diagonal matrix with elements of $s$ on the diagonal.
$\mathbf{diag(s)}^{-1}$ : The inverse of the scaling factor diagonal matrix, applied to $X$ . This effectively "descales" (divides) the activation values.
$\mathbf{diag(s)W}$ : The scaling factor matrix applied to $W$ . This "scales" (multiplies) the weight values.
$\hat{\mathbf{X}}$ : The "smoothed" activation tensor after descaling by $\mathbf{s}$ .
$\hat{\mathbf{W}}$ : The "smoothed" weight tensor after scaling by $\mathbf{s}$ .

This equation shows that the scaling factors are moved from the activations to the weights. This operation makes the distributions of $\hat{\mathbf{X}}$ and $\hat{\mathbf{W}}$ more amenable to low-bit quantization by reducing the dynamic range of activations without altering the final output $\mathbf{Y}$ . The paper states that this smoothing process reorganizes low-error elements into distinct channels, which makes mixed-precision quantization more hardware-efficient.

PLAIN then applies two specific granularity schemes for quantization that align with DRAM-PIM's dataflow:

Token-wise granularity for activations: Each input token's activations are quantized independently. This is suitable for the sequential nature of token generation in LLMs.
Channel-wise granularity for weights: Weights are quantized based on their output channels. This allows for fine-grained control over weight precision and is compatible with how weights are typically processed in matrix multiplications.

The following figure (Figure 3 from the original paper) illustrates the effect of SmoothQuant on quantization error:

该图像是一个热力图，展示了LLaMA-2-7B模型权重在应用SmoothQuant前后的INT8量化误差。左侧为未平滑的量化误差热力图，右侧为经过平滑处理后的热力图。量化误差的分布经过平滑处理后更为集中，表明在误差较小的区域可以接受更低的准确性。

As seen in the heat map, SmoothQuant localizes quantization errors and redistributes the value range, allowing for more flexible precision assignment.

4.2.1.2. Quantization Configuration Searching

To determine the optimal bit-width (either INT4 or INT8) for different parts of the model, PLAIN employs a lightweight, entropy-based heuristic rather than computationally expensive retraining or reinforcement learning. This heuristic considers both quantization error and hardware cost.

a) Weight Precision Entropy: The quantization error for weights is quantified using KL-divergence (Kullback-Leibler divergence), which measures how one probability distribution diverges from a second, expected probability distribution. In this context, it measures the information loss when quantizing: $ E n t r o p y _ { i } = \mathcal { D } _ { \mathrm { K L } } ( \mathbf { W } _ { i } ^ { \mathrm { F P } } \parallel \mathbf { W } _ { i } ^ { \mathrm { I N T } } ) $

$Entropy_i$ : The KL-divergence for the $i$ -th channel-wise block of weights.
$\mathcal{D}_{\mathrm{KL}}(\cdot \parallel \cdot)$ : The Kullback-Leibler divergence function.
$\mathbf{W}_i^{\mathrm{FP}}$ : The original, high-precision (e.g., FP16) distribution of the $i$ -th weight block.
$\mathbf{W}_i^{\mathrm{INT}}$ : The quantized (e.g., INT4 or INT8) distribution of the $i$ -th weight block.
$i$ : Denotes a specific channel-wise block of weights. Weights are partitioned into these blocks to provide sufficient statistical support for the entropy calculation.

This approach allows for fine-grained mixed precision for weights without retraining.

b) Simplify Activation Entropy: Runtime distribution fitting for activations is impractical. Therefore, a simplified proxy based on data range scaling is introduced for activations: $ S i m p l e E n t r o p y _ { t } = s i g m o i d ( | A ^ { \mathrm { F P } } | ) \times S _ { t } $ $ S _ { t } = \frac { \operatorname* { m a x } ( \mathbf { A } _ { t } ^ { \mathrm { F P } } ) - \operatorname* { m i n } ( \mathbf { A } _ { t } ^ { \mathrm { F P } } ) } { \operatorname* { m a x } ( \mathbf { A } _ { t } ^ { \mathrm { I N T } } ) - \operatorname* { m i n } ( \mathbf { A } _ { t } ^ { \mathrm { I N T } } ) } $

$SimpleEntropy_t$ : The simplified entropy proxy for the $t$ -th token's activations.
$sigmoid(\cdot)$ : The sigmoid activation function, which squashes values between 0 and 1.
$|A^{\mathrm{FP}}|$ : The absolute average value of the FP16 activation tensor (before quantization).
$S_t$ : An affine factor that scales the data range.
$\operatorname{max}(\mathbf{A}_t^{\mathrm{FP}})$ and $\operatorname{min}(\mathbf{A}_t^{\mathrm{FP}})$ : The maximum and minimum values of the FP16 activation tensor for the $t$ -th token.
$\operatorname{max}(\mathbf{A}_t^{\mathrm{INT}})$ and $\operatorname{min}(\mathbf{A}_t^{\mathrm{INT}})$ : The maximum and minimum values of the quantized activation tensor for the $t$ -th token (e.g., INT4 or INT8).
$t$ : Denotes the number of tokens, as activations use token-wise granularity.

This simplified proxy aims to capture the sensitivity of activations to quantization by considering their dynamic range. The operation can be efficiently supported by hardware during result collection.

c) Hardware-aware Configuration Searching: To incorporate the hardware cost of different bit-widths, a byte-level cost function $\mathcal{C}$ is defined: $ \begin{array} { r } { \mathcal { C } = \mathcal { C } _ { \mathbf { W } } + \mathcal { C } _ { \mathbf { A } } = \mathrm { N } _ { \mathbf { W } } \times \mathbf { B } _ { \mathbf { W } } ^ { \mathrm { I N T } } + \mathrm { N } _ { \mathbf { A } } \times \mathbf { B } _ { \mathbf { A } } ^ { \mathrm { I N T } } } \end{array} $

$\mathcal{C}$ : The total byte-level cost.
$\mathcal{C}_{\mathbf{W}}$ : The cost associated with weights.
$\mathcal{C}_{\mathbf{A}}$ : The cost associated with activations.
$\mathrm{N}_{\mathbf{W}}$ : The size of the weight tensor (e.g., number of parameters).
$\mathbf{B}_{\mathbf{W}}^{\mathrm{INT}}$ : The chosen bit-width for weights after quantization (e.g., 4 bits or 8 bits).
$\mathrm{N}_{\mathbf{A}}$ : The size of the activation tensor.
$\mathbf{B}_{\mathbf{A}}^{\mathrm{INT}}$ : The chosen bit-width for activations after quantization.

The paper notes that weights can be loaded offline to PIMs, so their cost is primarily for calculation. Activations, however, often require movement between different dies during inter-HBM communication, incurring significant communication cost. To reflect this, a square term is added to the formula (though not explicitly shown in the provided snippet, it is mentioned in the text).

The final bit-width for each component (weights and activations) is selected by minimizing a unified loss function: $ \mathcal { L } _ { M i x Q } ^ { \mathrm { I N T } } = E n t r o p y - \varsigma \mathcal { C } $

$\mathcal{L}_{MixQ}^{\mathrm{INT}}$ : The final loss function to be minimized for integer quantization.
Entropy: Refers to either $Entropy_i$ for weights or $SimpleEntropy_t$ for activations, depending on the tensor type.
$\varsigma$ : A scaling factor that adjusts the relative magnitude and importance between the entropy (accuracy loss) and the hardware cost.

The algorithm compares the loss for INT4 and INT8 (and potentially INT6, which is mapped to $INT4/INT8$ operations) to choose the optimal precision for each component.

4.2.1.3. Inference Process in PLAIN

The inference process in PLAIN, as shown in Figure 4, integrates the MixQ-PIM algorithm with the specialized hardware. Let's trace the processing of two tokens (token0 and token1):

Weight Loading: Each PIM unit (specifically, Bank-PIM units, described in the hardware section) first loads the weights of the current LLM layer. These weights are often pre-loaded or streamed efficiently.
Token Quantization & Dispatch: Incoming tokens are quantized using mixed precision. For example, token0 might be quantized to INT4 (requiring less computation), while token1 is quantized to INT8 (requiring more computation due to higher precision). To balance the workload across the PIM units, token1 (the INT8 token) is split and dispatched to multiple PIM groups (e.g., three groups in the figure).
Computation in Bank-PIMs:
- The split tokens and weights are processed by the Bank-PIMs. These units contain 4-bit multipliers and adder trees.
- The computations, primarily GEMM/GEMV operations, generate INT32 outputs. This INT32 precision is crucial for safely supporting accumulation and partial sum fusion without overflow during intermediate calculations.
QKV Projection: The query (Q), key (K), and value (V) matrices are then generated by multiplying the input tokens with their respective weight matrices ( $W_Q, W_K, W_V$ ). These matrices are quantized separately. The value matrix is quantized along a different dimension due to the GEMM layout, optimizing for subsequent operations.
Attention Score Calculation: The $Q$ and $K$ matrices are loaded into PIM for activation computation (calculating $QK^T$ ). The results of this dot product form the attention scores.
Softmax Operation: After the attention score computation, the softmax operation is performed on the Stack-PIM (described below). This normalizes the scores.
Attention Context Calculation: The softmax results (normalized attention scores) are then multiplied with the $V$ (value) matrix to compute the attention context.
Dequantization: Finally, a dequantization step is performed to convert the accumulated INT32 results back to FP16 or a suitable higher precision for further processing or output.
Iterative Process: Each attention layer typically undergoes three Quant-Dequant stages (QKV projection, attention score, attention context). PLAIN is designed to support these operations with minimal overhead.

The following figure (Figure 4 from the original paper) depicts this execution dataflow:

该图像是PLAIN执行数据流的示意图，展示了Weight-Activation计算（QKV投影）和Activation-Activation计算（OK矩阵乘法）的流程。图中包含多个Bank-PIMs和数据流的连接，显示了计量化、Tokens、缓冲区及复杂计算的结构。

4.2.2. PLAIN Hardware Architecture

PLAIN is a heterogeneous system integrated into the HBM (High-Bandwidth Memory) stack, co-existing with an XPU (e.g., GPU or NPU).

4.2.2.1. Architecture Overview

The PLAIN architecture consists of a host CPU, an XPU, multiple HBM stacks, and PLAIN-enabled memory.

XPU & PIM Interaction: When the XPU processes non-attention layers (typically Fully Connected layers), PLAIN functions as conventional HBM memory, minimizing communication overhead. Intermediate results are written back to memory, allowing the GPU to overlap computation with memory operations and enhance request-level parallelism. For attention layers, PLAIN's in-memory compute capabilities are activated.
HBM Stack Structure: Each HBM stack comprises eight 3D-stacked DRAM dies and a buffer die connected via Through-Silicon Vias (TSVs). PLAIN adds compute logic with minimal DRAM changes, supporting both standard DRAM and PIM modes.
Two Types of PIM Units:
- Bank-PIM Units:
  - Placement: Located between DRAM banks and connected through I/O boundaries within each DRAM die.
  - Components: Each Bank-PIM contains $16 \times 256$ -bit register files, 4-bit multipliers, an adder tree, and a 32-bit output register.
  - Data Flow: Data is read from odd/even bank row buffers via 512-bit buses. Computation results are sent to the buffer die through a dedicated result bus.
  - Operations: Bank-PIMs execute quantized GEMM/GEMV operations for the three phases of the attention layer:
    1. QKV projection (token $\times$ weight)
    2. Attention score (query $\times$ key)
    3. Attention context (score $\times$ value)
  - Conflict Avoidance: Each Bank-PIM reads operands from separate banks to prevent access conflicts, and computation is scheduled based on the first operand in the matrix multiplication.
- Stack-PIM Unit:
  - Placement: Located on the buffer die (one per HBM stack).
  - Components: Includes buffers for scaling factors and scheduling tables, along with specialized softmax, quantization, and dequantization units.
  - Functionality: Manages quantization, dequantization, softmax, data accumulation, and scheduling. Dequantization units scale accumulated results and forward them to softmax or quantization units. Quantization units apply the MixQ-PIM algorithm and dispatch outputs to Bank-PIMs or store scaling factors in buffer memory. Softmax is performed using the maximum value from quantized inputs.
  - Control: Unlike Bank-PIMs, Stack-PIMs integrate control and computation. The host CPU, via the DRAM controller, coordinates execution using the scheduling table. Stack-PIM handles quantization before Bank-PIM computation and dequantization after result accumulation. The softmax step is performed after the attention score stage, completing one attention pass with minimal overhead.
    
    The following figure (Figure 5 from the original paper) illustrates the PLAIN hardware architecture:
    
    该图像是PLAIN硬件架构示意图，展示了DRAM-PIM系统中的各个组件，包括主机CPU、PIM模块、存储单元和调度器等，旨在优化深度学习推理过程中的计算效率与内存带宽。

4.2.2.2. Schedule Design

The scheduling algorithm for PLAIN units is crucial for maximizing compute resource utilization and system throughput. It leverages DRAM's bank-level parallelism to saturate internal bandwidth and minimize execution stalls. A key challenge is managing the uneven computational loads introduced by mixed-precision quantization (e.g., INT8 vs. INT4 tokens).

a) Bit-wise Splitting Token: To unify the computation model across different precision levels and ensure load balance, PLAIN adopts a bit-wise splitting strategy.

Mechanism: As illustrated in Figure 6, 8-bit activations are split into two 4-bit segments: a high 4-bit segment and a low 4-bit segment.
Processing: Each 4-bit segment is then independently processed by GEMV operations using the 4-bit multipliers in the Bank-PIMs.
Weight Support: While weights are quantized at channel granularity using mixed precisions, only minor hardware support (e.g., shifters) is needed to handle both 4-bit and 8-bit weight computations.
Accumulation: All intermediate results are stored in 32-bit registers, allowing partial results to be shifted and accumulated on the buffer die (via Stack-PIM) without overflow.
Benefits: This approach enables a uniform Bank-PIM architecture capable of handling different precision levels efficiently, maintaining load balance by distributing the workload of higher-precision tokens.

The following figure (Figure 6 from the original paper) illustrates the bit-wise splitting strategy:

该图像是示意图，展示了PLAIN方法中将8位权重和激活向量进行位分割的过程。通过将8位向量拆分为两个4位向量并分布到多个PIMs中，经过乘法和移位操作后，将结果累加至32位加法器输出。

b) Bank-PIM Schedule Table: To maintain high throughput between Bank-PIMs and Stack-PIM and mitigate conflicts among banks, a schedule table is introduced.

Location: Stored on the buffer die and integrated into the DRAM controller.
Structure: The table is sorted by Bank ID, which indexes banks sequentially. This ordering is consistent with the accumulation and scaling factor buffers, allowing dequantization units to access parameters with $O(1)$ lookup time (constant time).
Contents: Each entry includes Bank ID, PIM ID (interleaved order of banks), Group ID (dynamically assigned at runtime), Token ID, Token Type (e.g., INT8-HIGH, INT4), and Scaling Factor.
Purpose: This table manages the assignment of computational tasks to specific Bank-PIMs and helps coordinate data flow, especially for mixed-precision operations.

c) Bank-PIM Schedule Algorithm: The scheduling algorithm aims to balance workloads across Bank-PIMs and minimize communication with Stack-PIM.
Output Row Assignment: Each Bank-PIM is assigned a full or partial output row to avoid extra accumulation steps in Stack-PIM. At least one token is processed per Bank-PIM based on GEMM/GEMV principles.
Parallelism for INT8 Tokens: To increase parallelism, each INT8 token is split into two INT4 tokens and distributed to separate Bank-PIMs.
Greedy Scheduling for Short Inputs: For short input lengths, particularly in the generation phases, token-level partitioning might be insufficient. PLAIN employs a greedy scheduling strategy that further partitions the computation matrix. Tokens are split, and Bank-PIMs are grouped based on the number of 4-bit tokens. Each group then jointly computes a token using channel-wise partitioning. This ensures balanced utilization and efficient mapping across PIMs.
QKV Generation: Weights are preloaded into all Bank-PIMs during QKV generation. While distributing work increases activation communication, it significantly reduces compute latency.
Score and Context Stages: In score and context stages, communication (especially for $K$ and $V$ ) can dominate. However, PLAIN benefits from reduced per-bank compute time and overlapping techniques. For LLM generation with KV cache, K matrix redistribution is one-time, but V matrix requires re-quantization and dynamic distribution. Overlapping schemes (discussed next) mitigate this cost.

4.2.2.3. Overlapping And Parallelism

To overcome the potential bottleneck of sequential execution and fully exploit bank-level parallelism, PLAIN employs several overlapping mechanisms. These are possible due to:

Data Independence: Different components process disjoint data subsets.
Command Isolation: DRAM and PIM commands avoid conflicts on the C/A (Command/Address) bus.
Exclusive Execution: Each bank operates exclusively in DRAM or PIM mode at any moment.

The following figure (Figure 8 from the original paper) illustrates the overlapping timeline:

该图像是图表，展示了 PLAIN 系统中不同组件的重叠时间线。包括了加权加载、量化与通信重叠，以及 Softmax 与通信的实现，旨在优化大语言模型的推理效率。

a) Weight Loading Overlap (Figure 8a):

Problem: For QKV generation, weights are preloaded into Bank-PIMs. Large models can exceed a single bank's capacity, causing stalls during weight loading.
Solution: An overlapping scheme is used. While one bank (bank 0) is actively computing, the other bank (bank 1) within the same Bank-PIM concurrently loads weights, and vice versa.
Mechanism: GEMM operations are decomposed into GEMV to allow reuse of cached tokens in local register files, creating idle time during which weight transfers can occur. A Ping-Pong buffer scheme alternates bank roles: odd-numbered banks enter DRAM mode (for loading) while even-numbered banks enter PIM mode (for computing) between layers. For very large models, weights are distributed across multiple banks, and Bank-PIMs are grouped into virtual PIMs to scale this approach.

b) Quantization and Communication Overlap (Figure 8b):
Problem: Before the attention score and context stages, activations are aggregated on the buffer die, where Stack-PIM performs de/quantization. These operations are compute-intensive.
Solution: These operations are overlapped with communication and computation.
Mechanism: Bank-PIMs send intermediate results via the result bus during computation. Stack-PIM collects these results at fixed intervals and immediately applies dequantization by accessing the corresponding scaling factor from its on-chip buffers (avoiding lookup overhead). Concurrently, it updates quantization units (e.g., min/max values). Except for the final result transfer, all de/quantization steps are hidden within the ongoing computation, minimizing their impact on the critical path.

c) Softmax and Communication Overlap (Figure 8c):
Problem: The softmax operation is performed in full precision on Stack-PIM and is a sequential step between the attention score and context stages.
Solution: Softmax is overlapped with quantization and KV-cache communication.
Mechanism: During the generation phase, $K$ and $V$ matrices are cached in PIMs. $K$ uses token-wise quantization and can be reused directly. However, the V matrix is quantized channel-wise and needs to be re-quantized and redistributed. Stack-PIM begins to quantify $V$ concurrently while performing softmax. Once softmax is complete, the resulting attention scores are sent to Bank-PIMs. This overlap effectively hides the latency of both softmax and V quantization, further boosting end-to-end throughput.

5. Experimental Setup

5.1. Datasets

The experiments primarily utilize the WikiText-103 dataset.

Source and Characteristics: WikiText-103 comprises over 100 million tokens extracted from verified Good and Featured Wikipedia articles. It is a widely used benchmark for language modeling tasks, known for its diverse vocabulary and long-term dependencies.
Purpose: This dataset helps compute language perplexity (PPL) and evaluates the effect of the quantization algorithm on decoder-only model inference performance.
Data Sample (for inference): To bridge the gap between PPL calculation (where the generation phase is absent) and actual inference, the dataset is split into conversation-length segments. These segments are then used as prompts in text-generation tasks, simulating real-world dialogue scenarios. For example, a segment might be "The quick brown fox jumps over the lazy dog." and the model is prompted to continue the sentence.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate PLAIN's performance:

Perplexity (PPL):
- Conceptual Definition: Perplexity is a common metric used to evaluate the performance of language models. It quantifies how well a probability model predicts a sample. A lower perplexity score indicates that the model is better at predicting the next word in a sequence, suggesting higher accuracy and better generalization. In essence, it measures the model's "surprise" by the actual sequence of words; less surprise means better prediction.
- Mathematical Formula: The perplexity of a language model $P$ on a test set (or sequence) $W = (w_1, w_2, \dots, w_N)$ is calculated as: $ \mathrm{PPL}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \dots, w_{i-1})\right) $
- Symbol Explanation:
  - $\mathrm{PPL}(W)$ : The perplexity score for the sequence $W$ .
  - $\exp(\cdot)$ : The exponential function (base $e$ ).
  - $N$ : The total number of tokens (words) in the sequence $W$ .
  - $\sum_{i=1}^{N}$ : Summation over all tokens from $i=1$ to $N$ .
  - $\log$ : The natural logarithm.
  - $P(w_i | w_1, \dots, w_{i-1})$ : The probability assigned by the language model $P$ to the $i$ -th token $w_i$ , given all the preceding tokens $w_1, \dots, w_{i-1}$ .
Speedup:
- Conceptual Definition: Speedup measures the performance improvement of a new system or method relative to a baseline. It quantifies how many times faster the new approach is compared to the reference.
- Mathematical Formula: $ \mathrm{Speedup} = \frac{\mathrm{Execution,Time}{\mathrm{Baseline}}}{\mathrm{Execution,Time}{\mathrm{PLAIN}}} $
- Symbol Explanation:
  - $\mathrm{Speedup}$ : The performance gain.
  - $\mathrm{Execution\,Time}_{\mathrm{Baseline}}$ : The time taken by the baseline system to complete a task.
  - $\mathrm{Execution\,Time}_{\mathrm{PLAIN}}$ : The time taken by the PLAIN system to complete the same task.
Energy Consumption:
- Conceptual Definition: Energy consumption measures the total electrical energy used by the system to perform a given task. Lower energy consumption indicates higher energy efficiency, which is critical for deployment in data centers and edge devices due to operational costs and environmental impact.
- Mathematical Formula: While the paper does not provide an explicit formula for energy consumption, it is typically calculated as the integral of power over time for various components: $ \mathrm{Energy} = \sum_{j} \int_{t_0}^{t_f} P_j(t) , dt $
- Symbol Explanation:
  - $\mathrm{Energy}$ : Total energy consumed.
  - $\sum_{j}$ : Summation over all active components $j$ in the system (e.g., CPU, GPU, PIM, DRAM).
  - $\int_{t_0}^{t_f} P_j(t) \, dt$ : The integral of instantaneous power $P_j(t)$ consumed by component $j$ over the execution time period $[t_0, t_f]$ .
  - The paper calculates energy and area using Verilog for arithmetic units, Synopsys Design Compiler for synthesis, and Cacti 7.0 for buffers, also factoring in HBM3 operations.

5.3. Baselines

To evaluate PLAIN, the authors compare it against three representative baseline systems:

GPU (FP16):
- Description: This baseline represents conventional LLM inference on a powerful, high-end GPU using FP16 (half-precision floating-point) arithmetic. It uses standard Huggingface and PyTorch implementations.
- Hardware: An Nvidia A100 GPU is used for measuring end-to-end latency.
- Representativeness: This is the de-facto standard for high-performance LLM inference and serves as a benchmark for raw computational power and unquantized accuracy.
SmoothQuant (W8A8):
- Description: This baseline implements SmoothQuant [9], a state-of-the-art post-training quantization technique for LLMs. It quantizes model weights and activations to INT8 (W8A8, meaning 8-bit weights and 8-bit activations).
- Hardware: It utilizes the Nvidia A100 GPU's tensor cores, which are specialized hardware units for low-precision matrix operations. The Cutlass library is used for compiling self-attention layers to optimize INT8 inference.
- Representativeness: This baseline demonstrates the performance achievable with advanced INT8 quantization on modern GPUs, addressing the "accuracy vs. efficiency" trade-off within a traditional GPU architecture. It highlights the benefits of software optimizations for low-precision inference.
AttAcc [8]:
- Description: AttAcc is a heterogeneous system that leverages both GPU and PIM. It uses the GPU for the prefill stage and QKV generation (which are often compute-heavy), and offloads multi-head attention during the decoding phase (which is typically memory-bound) to PIM units.
- Hardware Adjustment: Originally tested on a DGX system, its hardware configuration is adjusted in PLAIN's evaluation to match the same magnitude (e.g., memory capacity, timing parameters) as PLAIN for fair comparison.
- Representativeness: This baseline represents the state-of-the-art in PIM-accelerated LLM inference, showcasing the benefits of specialized hardware for memory-bound operations. Comparing against AttAcc directly evaluates PLAIN's innovations in deeper PIM integration, mixed-precision handling, and overall system efficiency over existing PIM solutions.

5.4. Simulation & Hardware Configuration

Simulation Environment: The authors developed an in-house simulator by modifying Ramulator2 [30], a modern, modular, and extensible DRAM simulator.
Experimental Stages: Due to the complexity of integrating real LLM inference with a cycle-accurate simulator, experiments are split:
1. Precision Trace Generation: The LLM inference workload is compiled on a GPU, and the precision trace of activations (after quantization by the PLAIN algorithm) is generated.
2. Cycle-Accurate Simulation: This precision trace is then fed into the modified Ramulator2 simulator to simulate memory access, producing cycle-accurate inference times for the PLAIN system.
Hardware Specifications: PLAIN adds arithmetic units and buffers to a standard HBM3 memory stack. The HBM3 organization details and timing parameters are crucial for the simulation's accuracy.

Energy and Area Calculation:

Verilog is used to design and analyze the arithmetic units.
Synopsys Design Compiler is used for synthesis.
Cacti 7.0 [31] is used to estimate the energy and area consumption of buffers in both the DRAM dies and the buffer die.

The power consumption of standard HBM3 operations (e.g., activation, reading) is also factored in [32, 33, 34].

The following are the results from Table I of the original paper:

HBM Organization
HBM Organization	4 Banks per Bank-group, 4 BGs per Pseudo channel, 8 pCHs per die
HBM Timing Parameter	Frequency = 1GHz, tRP = 19, tRCD = 19, tRAS = 45, tRRDL = 4, tWR = 8, tCCD_S = 2, tCCD_L = 4, tREFI = 5070, tFAW = 39
LLM configuration
Model	Layers Hidden_size
GPT2-large	36 1280
GPT2-xl	48 1600
OPT-6.7b	32 4096
OPT-13b	40 5120
LLaMA-2-7b	32 4096
LLaMA-2-13b	40 5120

5.5. Models

The evaluation includes a range of Large Language Models (LLMs):

GPT-2 [2]: Including GPT2-large and GPT2-xl.
OPT [35]: Including OPT-6.7b and OPT-13b.
LLaMA-2 [3]: Including LLaMA-2-7b and LLaMA-2-13b.

These models are all decoder-only generation models, which are common for text generation tasks and are the focus of PLAIN's acceleration efforts. Their sizes and configurations (number of layers and hidden size) are detailed in the LLM configuration section of Table I, indicating a comprehensive evaluation across different model scales.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results validate PLAIN's effectiveness in accelerating LLM inference with minimal quality loss and improved energy efficiency across various models.

6.1.1. Quantization Results

The paper evaluates the MixQ-PIM algorithm's impact on model quality using perplexity (PPL). A lower PPL indicates better model quality. The experiments use a striped sliding window approach with a sequence length and stride of 1024.

The following are the results from Table II of the original paper:

Model	GPT2-large	GPT2-xl	OPT-6.7b	OPT-13b	LLAMA2-7b	LLAMA2-13b
FP16	19.42	17.38	12.42	10.12	6.45	5.43
W4A4	1164.16	983.13	inf	inf	inf	6388.79
W8A8	19.71	18.29	30.27	4272.0	6.27	5.52
SmoothQuant (W8A8)	-	-	12.42	11.90	6.14	5.50
SmoothQuant (W6A6)	-		15.25	14.73	9.12	7.80
PLAIN (W6A6)	20.56	18.20	13.48	9.95	6.91	5.92

Analysis:

FP16 Baseline: The FP16 perplexity values serve as the golden standard for model quality.
W4A4 Quantization: W4A4 (4-bit weights, 4-bit activations) quantization performs extremely poorly, leading to inf (infinite) perplexity for many models (OPT-6.7b, OPT-13b, LLAMA2-7b) and very high values for others (e.g., 1164.16 for GPT2-large). This highlights the severe accuracy degradation of aggressive low-bit quantization without proper techniques.
W8A8 Quantization (Naive): For smaller models like GPT2-large and GPT2-xl, W8A8 shows perplexity values very close to FP16 (e.g., 19.71 vs. 19.42 for GPT2-large). However, as model size increases (e.g., OPT-13b, LLAMA2-7b), W8A8 alone leads to significant accuracy loss (4272.0 for OPT-13b, compared to 10.12 for FP16), indicating the presence of activation outliers.
SmoothQuant (W8A8): SmoothQuant (W8A8) significantly improves accuracy for larger models that suffer from activation outliers. For OPT-6.7b, it achieves 12.42, matching FP16. For OPT-13b, it reduces PPL from 4272.0 to 11.90, much closer to FP16. This confirms the effectiveness of SmoothQuant in handling outliers.
SmoothQuant (W6A6): W6A6 (6-bit weights, 6-bit activations) is generally a good balance but lacks native hardware support. SmoothQuant (W6A6) shows higher PPL than its W8A8 counterpart, indicating that 6-bit quantization is more challenging without specific hardware optimizations.
PLAIN (W6A6): PLAIN demonstrates superior performance in W6A6 configuration. Despite using a lower bit-width (6-bit for both weights and activations, likely mapped to $INT4/INT8$ operations in hardware), it achieves perplexity values comparable to or even better than FP16 for some models (e.g., OPT-13b at 9.95 vs. 10.12 FP16, LLAMA2-7b at 6.91 vs. 6.45 FP16). For smaller models where SmoothQuant wasn't used (GPT2-large, GPT2-xl), PLAIN's W6A6 performance is still very close to FP16 and W8A8. This indicates that PLAIN's locality-aware adaptive quantization algorithm effectively leverages the 6-bit sweet spot, balancing accuracy with high compression, and mapping it efficiently to the underlying hardware.

Conclusion: PLAIN's quantization algorithm effectively reduces bit-widths while maintaining negligible model quality loss, even for challenging 6-bit quantization, by intelligently managing activation outliers and optimizing precision at a fine granularity.

6.1.2. Latency (Speedup)

The speedup results demonstrate PLAIN's significant performance advantages over conventional GPU and PIM baselines. The configuration uses batch size 1, input token length 64, and output token length 64.

The following figure (Figure 9 from the original paper) shows the speedup comparison:

Fig. 9: Speedup comparison results of GPU-FP16, GPUSmoothQuant, AttAcc-PIM, PLAIN. We use batch size 1, input token length 64 and output token length 64.

Analysis:

Overall Speedup: PLAIN consistently outperforms all baselines across different LLM models. It achieves a substantial $4\times$ to $5\times$ acceleration over GPU-FP16 inference, with an average speedup of $4.41\times$ . For GPT2-xl, it achieves the highest speedup of $5.03\times$ .
Comparison to GPU-SmoothQuant: PLAIN also shows a significant speedup compared to GPU-SmoothQuant (W8A8). This indicates that while SmoothQuant improves INT8 performance on GPUs, PLAIN's PIM-enabled mixed-precision approach provides further acceleration by leveraging internal memory bandwidth and dedicated in-memory computation.
Comparison to AttAcc-PIM: Crucially, PLAIN achieves a $1.69\times$ performance boost over AttAcc-PIM, which is a state-of-the-art PIM accelerator. This highlights PLAIN's superior PIM utilization, mixed-precision scheduling, and overlapping mechanisms that allow it to extract more performance from the PIM architecture for LLMs.
Reasons for Improvement: The performance boost stems from:
1. Reduced Data Bit-width: Quantization lowers the computational load and data movement.
2. Architectural Optimizations: PLAIN's hardware-software co-design minimizes weight movement, and its Stack-PIM and Bank-PIM units efficiently handle quantized operations.
3. Overlapping and Scheduling: The intelligent workload-aware dataflow scheduler and overlapping techniques hide latencies associated with communication, quantization, and softmax operations, maximizing hardware utilization.
  
  Conclusion: PLAIN significantly accelerates LLM inference, showcasing its effectiveness in translating mixed-precision quantization and PIM capabilities into real-world performance gains.

6.1.3. Energy Consumption

Energy efficiency is a critical factor for LLM deployment. The paper presents the normalized energy consumption of PLAIN compared to GPU and AttAcc.

The following figure (Figure 10 from the original paper) shows the normalized energy consumption results:

Fig. 10: Normalized energy consumption results of GPU, AttAcc and PLAIN. Experiment configuration is the same as speedup results.

Analysis:

Overall Reduction: PLAIN achieves a significantly greater reduction in energy consumption compared to both the GPU and AttAcc. For example, for GPT2-large, PLAIN consumes only about 20% of the energy of the GPU and significantly less than AttAcc.
Comparison to AttAcc: While AttAcc also shows a decrease in energy consumption compared to the GPU (due to offloading memory-bound tasks to PIM), PLAIN's reduction is substantially larger across all models.
Reasons for Reduction:
1. Lower Bit-widths: PLAIN's use of lower bit-widths for weights and activations directly reduces the energy consumed per operation and the energy cost of data movement. The MixQ-PIM algorithm ensures this reduction doesn't come at a significant accuracy cost.
2. Efficient PIM Offloading: By offloading QKV generation and attention computations entirely to PIM, PLAIN reduces heavy weight movement and communication overhead between the GPU and main memory. Processing data closer to where it's stored is inherently more energy-efficient.
3. Optimized Dataflow and Overlapping: The efficient dataflow scheduler and overlapping mechanisms ensure high utilization of PIM units and minimize idle cycles where components might consume static power without performing useful work.
  
  Conclusion: PLAIN dramatically improves energy efficiency for LLM inference, making it a more sustainable and cost-effective solution for large-scale deployments.

6.1.4. Ablation Study

An ablation study helps understand the contribution of each component of PLAIN to its overall performance. The study uses the OPT-6.7B model with input/output token lengths of 64, compared to a GPU-FP16 baseline.

The following figure (Figure 11 from the original paper) shows PLAIN's ablation study results:

Fig. 11: PLAIN's ablation study in OPT-6.7B model. kcache: key vectors cache; balance: split a INT8 token to 2 INT4 token; overlap: overlapping quantization and softmax time with communication time

Analysis (from right to left in the chart):

PLAIN (Full System): The full PLAIN system achieves a speedup of approximately $4.3\times$ (implied, as the other bars show reductions from this baseline).
Without Overlapping (kcache-balance): Removing the overlapping optimizations (quantization with communication, weight loading with computation, and softmax with value communication) significantly reduces the speedup from $4.3\times$ to $1.62\times$ . This highlights that overlapping is a critical factor, contributing to about $2.68\times$ of the total speedup, as it effectively hides latency and keeps the pipeline full.
Without Workload Balancing (kcache): Further removing workload balancing (the strategy to split INT8 tokens into two INT4 tokens for even distribution across PIMs) causes another substantial drop in speedup, from $1.62\times$ to $1.55\times$ . This confirms that workload imbalance, if not addressed, can lead to idle INT4 PIMs and significantly hamper performance.
Without Kcache (MixQ): Removing the Kcache optimization (preloading key vectors into Bank-PIMs during prefill) results in a reduction of speedup from $1.55\times$ to $1.35\times$ . While less impactful than overlapping or balancing, caching key vectors still provides a noticeable $0.2\times$ speedup, reducing data movement.
Remaining Acceleration: The remaining $1.35\times$ acceleration (labeled as OPT-MixQ) comes from the inherent bank-level parallelism of the PIM architecture and the basic benefits of mixed-precision quantization itself.

Conclusion: The ablation study clearly demonstrates that all three key innovations (Kcache, workload balancing, and overlapping) are essential for PLAIN to achieve its high performance. The overlapping techniques provide the most significant boost, followed by workload balancing.

6.1.5. Sensitivity Analysis

6.1.5.1. Speedup over Different Output Token Lengths

The paper analyzes how PLAIN's speedup varies with different output token lengths.

The following figure (Figure 12 from the original paper) shows PLAIN's speedup over different tokens:

Fig. 12: PLAIN's speedup over different tokens.

Analysis:

Short Outputs (16 tokens): For very short output lengths, the speedup is relatively limited (around $2.5\times$ ). In this scenario, the prefill phase (processing the initial prompt) dominates the execution time. The prefill phase typically involves larger GEMM operations which are more compute-bound and thus less amenable to PIM's bandwidth advantage over GPU's raw FLOPS.
Increasing Output Length (64, 128 tokens): As the output length increases, the decoding phase becomes more significant. The decoding phase involves numerous GEMV operations in the attention layers, making it a memory-bound scenario with an arithmetic intensity typically around 1 (low compute-to-memory-access ratio). PIM architectures, with their superior internal bandwidth, are exceptionally well-suited for such memory-bound tasks. Consequently, PLAIN's performance improves with more tokens, achieving speedups of approximately $4.11\times$ for 64 tokens and $4.91\times$ for 128 tokens.

Conclusion: PLAIN demonstrates higher effectiveness in memory-bound scenarios, particularly during the decoding phase of LLMs, where its PIM-centric design can fully leverage the high internal memory bandwidth.

6.1.5.2. Scalability with Increasing HBM Stacks

The scalability of PLAIN with an increasing number of HBM stacks is analyzed to understand its potential for larger systems.

The following figure (Figure 13 from the original paper) illustrates the speedup of PLAIN with increasing stacks:

Fig. 13: The speedup of PLAIN with increasing stacks.

Analysis:

Linear Scalability: The graph shows that PLAIN achieves nearly linear performance improvements as the number of HBM stacks increases from 1 to 4. For instance, with 4 stacks, the speedup is roughly $4\times$ that of a single stack.
Workload Distribution: PLAIN efficiently distributes the attention layer workload across multiple stacks without incurring significant inter-stack communication overhead. This is attributed to optimal activation data mapping, which minimizes the need for data transfers between different HBM stacks.
Increased Bank-to-Die Communication: While increasing Bank-PIMs (which occurs with more stacks) does lead to higher bank-to-die communication overhead, this is more than offset by the substantial FLOPS boost provided by the additional PIM units.
Flexible Hardware Configuration: The inherent flexibility of PLAIN's hardware configuration allows capacity, bandwidth, and FLOPS to be adjusted by varying the number of DRAM and PIM dies within an HBM stack, making it adaptable to different performance requirements.

Conclusion: PLAIN exhibits excellent scalability, demonstrating that its architecture can efficiently leverage multiple HBM stacks to achieve nearly linear performance gains for LLM inference, making it suitable for high-performance deployments.

6.1.6. Area Overhead

The area overhead for integrating PLAIN within an HBM stack is quantified:

Per DRAM Die: $8.39 \mathrm{mm^2}$ per DRAM die.
Per Buffer Die: $1.56 \mathrm{mm^2}$ per buffer die.
Additional Area per DRAM Die: This translates to an additional $6.93\%$ area per DRAM die.
GEMV Units: Each DRAM die includes 64 GEMV units, with each unit occupying $0.057 \mathrm{mm^2}$ based on a 1z-nm DRAM process [21].
Buffer Die Components: The buffer die houses:
- Quantization units: $0.13 \mathrm{mm^2}$
- Softmax units: $1.38 \mathrm{mm^2}$
- Accumulators: $0.02 \mathrm{mm^2}$
- Buffer: $0.03 \mathrm{mm^2}$
- Scaling: All processing units on the buffer die are scaled to a 7nm process [36], reflecting state-of-the-art fabrication technology for logic components.
  
  Conclusion: The area overhead introduced by PLAIN is relatively modest, particularly given the substantial performance and energy benefits it delivers. The design integrates compute logic into the HBM stack efficiently, minimizing changes to core DRAM structures.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces PLAIN, a novel software/hardware co-design framework specifically tailored for optimizing Large Language Model (LLM) inference by effectively integrating mixed-precision quantization with DRAM-based Processing-in-Memory (DRAM-PIM) technology. The core contributions and findings are:

Hardware-Efficient Quantization: PLAIN proposes a locality-aware adaptive quantization algorithm (MixQ-PIM) that determines optimal bit-widths for weights and activations (e.g., INT4/INT8) based on entropy and hardware cost. This approach allows for aggressive quantization (including effective mapping of 6-bit precision) with negligible model quality loss, even for models sensitive to outliers.
Heterogeneous Workload Partitioning: The PLAIN architecture strategically partitions LLM inference tasks, leveraging the high internal memory bandwidth of PIM units for memory-bound attention layers and the powerful computational capabilities of GPUs for compute-bound Fully Connected layers.
Optimized Dataflow and Overlapping: A sophisticated workload-aware dataflow scheduler combined with intelligent overlapping mechanisms (for weight loading, quantization/communication, and softmax/communication) ensures balanced workload distribution across PIM units and effectively hides latencies. This includes a novel bit-wise splitting strategy to unify mixed-precision computations.
Significant Performance and Energy Gains: Experimental evaluations demonstrate that PLAIN achieves substantial performance improvements, with average speedups of $4.41\times$ (up to $5.03\times$ ) over conventional GPU-FP16 inference and a $1.69\times$ boost over the state-of-the-art PIM accelerator, AttAcc. These gains are accompanied by a significant reduction in energy consumption across various LLMs.
Scalability: PLAIN demonstrates near-linear scalability with an increasing number of HBM stacks, indicating its suitability for larger-scale deployments without significant inter-stack communication overhead.

In essence, PLAIN successfully addresses the memory wall bottleneck in LLM inference by providing a holistic framework that intelligently co-designs software algorithms and hardware architecture to unlock the full potential of DRAM-PIM for complex deep learning workloads.

7.2. Limitations & Future Work

While the paper does not dedicate a specific section to "Limitations & Future Work," some aspects can be inferred:

PIM Computational Generalization: The current Bank-PIM units are primarily designed for GEMM/GEMV operations and 4-bit multipliers. While effective for attention layers, future work might explore expanding the computational capabilities of PIM units to support a wider range of operations or more complex arithmetic types directly in memory, which could potentially offload more parts of the LLM (e.g., certain activation functions, non-linearities in FC layers) from the GPU.
Dynamic Workload Adaptation: The greedy scheduling strategy and static schedule table are designed for specific workloads. More dynamic or adaptive scheduling algorithms could be explored that can react to real-time workload fluctuations, varying batch sizes, or model changes more flexibly, especially in a multi-user or dynamic inference environment.
Inter-HBM Communication Overhead: Although PLAIN minimizes inter-stack communication for attention layers, scaling to an even larger number of HBM stacks or more complex models might expose new inter-PIM communication bottlenecks that need to be addressed.
Beyond Attention Layers: While PLAIN focuses on attention layers due to their memory-bound nature, Fully Connected layers still reside on the GPU. Future research could investigate how PIM capabilities could be extended to accelerate parts of FC layers without compromising the GPU's compute power, perhaps through advanced partitioning or specialized PIM compute units.
Software Stack Development: The paper relies on an in-house simulator. Developing a full-fledged software stack, including compilers and runtime systems, that can seamlessly integrate mixed-precision quantization with PIM and GPU for real-world deployment remains a significant challenge and a direction for future work. This would involve managing data movement, synchronization, and task scheduling across the heterogeneous system with minimal programmer effort.
Power Gating and Fine-grained Control: While energy efficiency is improved, further fine-grained power management, such as dynamic voltage and frequency scaling (DVFS) or power gating for inactive PIM units, could be explored to optimize energy consumption even further.

7.3. Personal Insights & Critique

PLAIN presents a compelling and well-engineered solution that directly tackles the memory wall problem in LLM inference. Its strength lies in the rigorous co-design of both the quantization algorithm and the underlying hardware architecture, rather than simply porting existing techniques.

Transferability: The core principle of locality-aware mixed-precision quantization combined with heterogeneous compute offloading is highly transferable. This approach could be applied to other data-intensive deep learning models beyond LLMs, such as large vision transformers or graph neural networks, where similar memory bottlenecks and opportunities for low-precision computation exist. The concept of bit-wise splitting to unify execution across different precisions on fixed-bit-width hardware is also a clever technique that could find applications in other specialized accelerators.
Novelty of Quantization: The entropy-based heuristic for quantization configuration, particularly the simplified activation entropy and hardware-aware cost function, is a practical and efficient alternative to complex reinforcement learning approaches. This makes the MixQ-PIM algorithm more deployable for various models without extensive retraining. The careful mapping of 6-bit quantization (often optimal for LLMs) to underlying 4-bit hardware operations is a testament to the practical hardware-aware design.
System-Level Optimization: The emphasis on workload balancing and overlapping is crucial. It highlights that even with superior underlying hardware, a poorly managed dataflow can nullify theoretical gains. PLAIN's detailed scheduling and overlapping strategies are key to achieving practical speedups in a complex heterogeneous environment. This demonstrates a deep understanding of system-level performance bottlenecks.
Potential Issues/Critique:
- Commercial Viability of PIM: While PLAIN shows significant benefits, the widespread commercial adoption of DRAM-PIM solutions (beyond specialized cases like Samsung HBM-PIM or SK Hynix GDDR6-AiM) still faces challenges in manufacturing costs, standardization, and a mature software ecosystem. The success of PLAIN relies heavily on continued advancements and broader acceptance of PIM.
- Generalization of MixQ-PIM: The entropy-based heuristic might need fine-tuning (e.g., the $\varsigma$ factor) for different LLM families or tasks. While lightweight, its robustness across highly diverse models and training conditions compared to more adaptive, learning-based quantization methods could be further investigated.
- Host-PIM Interface Overhead: While the paper mentions that PLAIN functions as conventional HBM for non-attention layers, the transition overheads and the complexity of managing data movement and synchronization between the XPU and PIM (especially with dynamic output lengths and KV cache updates) are always a concern in heterogeneous systems. The paper largely mitigates this through overlapping, but it remains a critical aspect of overall system efficiency.
- Specifics of 6-bit Implementation: The paper mentions W6A6 performance and the use of 4-bit multipliers with shifters for 8-bit compute. A more detailed explanation of how 6-bit values are precisely handled and mapped onto a 4-bit arithmetic unit (e.g., through partial sums, two separate 4-bit operations, or custom 6-bit logic) would enhance clarity for a beginner. It's implied that it's treated as two 3-bit operations or similar, which would fit into a 4-bit framework.
  
  Overall, PLAIN represents a significant step forward in making large, memory-intensive LLMs more efficient and deployable on emerging hardware. Its holistic approach to co-design and meticulous attention to system-level details set a strong precedent for future research in AI accelerators.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~44 min read · 59,075 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. MixQ-PIM Algorithm

4.2.1.1. Quantization Granularity

4.2.1.2. Quantization Configuration Searching

4.2.1.3. Inference Process in PLAIN

4.2.2. PLAIN Hardware Architecture

4.2.2.1. Architecture Overview

4.2.2.2. Schedule Design

4.2.2.3. Overlapping And Parallelism

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Simulation & Hardware Configuration

5.5. Models

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Quantization Results

6.1.2. Latency (Speedup)

6.1.3. Energy Consumption

6.1.4. Ablation Study

6.1.5. Sensitivity Analysis

6.1.5.1. Speedup over Different Output Token Lengths

6.1.5.2. Scalability with Increasing HBM Stacks

6.1.6. Area Overhead

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers