SmoothQuant: Accurate and Efficient Post-Training Quantization for Large
  Language Models

Song Han

Paper status: completed

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Published:11/19/2022

Post-Training Quantization for Large Language Models (1)Training-Free Quantization Methods (1)INT8 Weight and Activation Quantization (1)LLM Inference Acceleration (1)Activation Smoothing Quantization (1)

Original Link PDF

Price: 0.10

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SmoothQuant is a training-free post-training quantization method that transfers activation quantization challenges to weights, enabling efficient 8-bit quantization with minimal accuracy loss and up to 1.56× speedup and 2× memory savings for large language models.

Abstract

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, Llama-1/2, Falcon, Mistral, and Mixtral models. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. SmoothQuant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at https://github.com/mit-han-lab/smoothquant.

Mind Map

In-depth Reading

English Analysis~17 min read · 20,067 chars

1. Bibliographic Information

Title: SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Authors: Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han.
- The authors are affiliated with the Massachusetts Institute of Technology (MIT) and NVIDIA. This combination of top-tier academic and industry research labs is highly relevant, as the work bridges theoretical innovation with practical hardware performance. Song Han's lab at MIT (Han Lab) is particularly well-known for its pioneering work in model compression and efficient deep learning.
Journal/Conference: The paper is an arXiv preprint. The version analyzed (v7) was last updated in April 2024. While not formally peer-reviewed in a conference proceedings within this document, its widespread citation, adoption, and integration into major frameworks like NVIDIA's FasterTransformer signify its significant impact and validation by the community.
Publication Year: First submitted in 2022.
Abstract: The paper addresses a critical challenge in deploying Large Language Models (LLMs): their high computational and memory costs. While quantization is a known solution, existing methods struggle to balance accuracy and hardware efficiency. The authors propose SmoothQuant, a training-free, post-training quantization (PTQ) method that enables 8-bit weight and 8-bit activation (W8A8) quantization. The core idea is to "smooth" activation outliers, which are difficult to quantize, by mathematically migrating this difficulty to the weights, which are easier to quantize. This transformation is done offline and preserves the model's mathematical equivalence. SmoothQuant is demonstrated to work across a wide range of LLMs (including OPT, BLOOM, Llama, and Mixtral families) with negligible accuracy loss, achieving up to 1.56x speedup and 2x memory reduction. A key achievement highlighted is enabling a 530B parameter model to be served on a single server node.
Original Source Link:
- arXiv: https://arxiv.org/abs/2211.10438v7
- PDF: https://arxiv.org/pdf/2211.10438v7.pdf
- Status: This is a preprint, meaning it has been made publicly available by the authors but may not have completed a formal peer-review process for a specific conference or journal at the time of this version's release.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: LLMs have become extraordinarily powerful, but their immense size (billions of parameters) makes them prohibitively expensive to deploy. Running them requires massive amounts of high-end GPU memory and computational power, creating a significant barrier to access. As shown in Figure 1, the growth in model size is far outpacing the growth in GPU memory.
- Existing Gaps: Quantization, the process of converting floating-point numbers (like FP16) to low-bit integers (like INT8), is a promising solution to reduce memory and accelerate computation. However, for LLMs larger than ~6B parameters, a phenomenon of "activation outliers" emerges. These are a few activation values that are orders of magnitude larger than the rest. Standard quantization schemes, which scale all values based on the maximum, allocate too many quantization levels to these outliers, leaving very few levels for the majority of values and causing severe accuracy degradation.
- Prior methods either failed to maintain accuracy (W8A8, ZeroQuant on large models) or sacrificed hardware efficiency to handle outliers (LLM.int8(), which uses a slow mixed-precision approach). A solution that is accurate, hardware-efficient, and training-free was missing.
Main Contributions / Findings (What):
- A Novel Transformation for Quantization-Friendliness: The paper introduces SmoothQuant, a method that mathematically transforms the model to make it easier to quantize. It doesn't change the model's function, only its representation.
- Difficulty Migration: The key insight is to "migrate" the quantization difficulty from the hard-to-quantize activations to the easy-to-quantize weights. This is achieved by scaling down the activation channels with large outliers and scaling up the corresponding weight channels.
- A Tunable, Offline Process: This "smoothing" is controlled by a hyperparameter, $\alpha$ , and is performed offline on a small calibration dataset. It requires no model retraining, making it a Post-Training Quantization (PTQ) method.
- Hardware-Efficient W8A8 Quantization: SmoothQuant enables full 8-bit weight, 8-bit activation (W8A8) quantization for all compute-intensive matrix multiplications in LLMs. This allows the use of highly optimized INT8 hardware units (like NVIDIA's Tensor Cores), leading to significant speedups and memory savings.
- Broad Applicability and State-of-the-Art Results: The method is shown to work across a vast range of models (OPT, BLOOM, GLM, MT-NLG, Llama, Falcon, Mistral, Mixtral) and scales, preserving accuracy while achieving up to 1.56x speedup and 2x memory reduction. It enables serving a 530B model on a single 8-GPU node, a feat previously impractical.
  
  该图像是一张折线图，展示了近年来大语言模型的模型规模（以参数数量计）和GPU加速器内存大小的对比，模型规模增长远超内存，导致内存供需差距不断扩大，量化与模型压缩技术可有效缓解该问题。

Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks, typically based on the Transformer architecture, trained on vast amounts of text data. Their size gives them powerful language understanding and generation capabilities but also makes them resource-intensive.
- Quantization: The process of converting a continuous range of values (like 32-bit or 16-bit floating-point numbers) into a smaller, discrete set of values (like 8-bit integers). This reduces memory footprint and can accelerate computation on hardware with specialized integer arithmetic units.
- Post-Training Quantization (PTQ): A family of quantization techniques that are applied after a model has been fully trained. This is highly desirable as it avoids the massive cost of retraining large models.
- W8A8: A specific quantization scheme where both the Weights and Activations of a neural network layer are represented using 8-bit integers. This is the key to unlocking maximum performance from hardware integer accelerators.
- Activation Outliers: A phenomenon observed in LLMs where certain channels in the activation tensors (the outputs of neuron layers) consistently have values that are dramatically larger (e.g., 100x) than the values in other channels. These outliers make standard quantization very challenging.
- Quantization Granularity: This refers to how many quantization parameters (like the scaling factor) are used for a tensor.
  - Per-tensor: One scaling factor for the entire matrix. Most efficient.
  - Per-token / Per-channel: A separate scaling factor for each row (token) or column (channel). This offers more precision but can add overhead. The paper notes that per-channel quantization on activations is accurate but not hardware-friendly for matrix multiplication.
    
    该图像是示意图，展示了图3中per-tensor，per-token和per-channel量化的定义。图中用 $\Delta_{X}$ 和 $\Delta_{W}$ 表示量化缩放因子，说明了向量量化只能用外层维度的缩放因子，即token维度 $T$ 和输出通道维度 $C_{o}$ ，而不能用内层维度 $C_{i}$ 。
Previous Works:
- LLM.int8(): This method acknowledged the activation outlier problem and proposed a mixed-precision solution. It identifies outliers and keeps them in high-precision FP16 format, while quantizing the rest of the values to INT8. While it preserves accuracy, performing matrix multiplication on this mixed-format data is complex and slow, often resulting in performance worse than the original FP16 model.
- ZeroQuant: This approach uses a group-wise quantization for weights and a per-token dynamic quantization for activations. It showed good results on smaller models but, as demonstrated in this paper, fails to maintain accuracy on very large models like OPT-175B.
- Outlier Suppression: This method attempts to mitigate outliers by modifying the model architecture (e.g., using non-scaling LayerNorm) or clipping activation values. The paper shows this is insufficient for large-scale LLMs and results in significant accuracy loss.
- GPTQ: A concurrent work that focuses on highly accurate weight-only quantization (e.g., to 4-bit), but does not quantize activations. This saves memory but doesn't accelerate the computation as much as W8A8 methods.
Differentiation: SmoothQuant's key innovation is that instead of trying to work around the outliers (like LLM.int8()) or simply suppressing them (like Outlier Suppression), it redistributes them. The transformation is mathematically lossless before quantization is applied. By making both weights and activations moderately "bumpy" rather than having one be extremely "spiky," it makes them both amenable to simple, hardware-friendly per-tensor quantization, thus achieving both accuracy and efficiency.

4. Methodology (Core Technology & Implementation)

The core idea of SmoothQuant is to make activations easier to quantize by migrating their dynamic range to the weights.

Principles:
- Observation 1: Weights are easy to quantize. Their distributions are typically uniform and well-behaved.
- Observation 2: Activations in LLMs are hard to quantize. This is due to systematic outliers in specific channels that persist across different input tokens.
- Intuition: If we could scale down the activation channels that have large values, they would become easier to quantize. To preserve the mathematical output of the layer, we must apply an inverse scaling to the corresponding channels in the weight matrix. This shifts the "difficulty" from activations to weights. Since weights are inherently easier to quantize, they can absorb this difficulty without significant quantization error. This is visualized in Figure 2.
  
  $Figure 2: SmoothQuant's intuition: the activation $\\mathbf { X }$ is hard to quantize because outliers stretch the quantization range, leaving few effective bits for most values. We migrate the scale…$ 该图像是示意图，展示了论文中图2 SmoothQuant的直观理念。上半部分展示原始激活 $\mathbf{X}$ 含有异常值导致量化困难，下半部分通过迁移难度使平滑激活 $\hat{\mathbf{X}}$ 和调整后的权重 $\hat{\mathbf{W}}$ 都易于量化。
Steps & Procedures: For a linear layer computation $\mathbf{Y} = \mathbf{X} \mathbf{W}$ , where $\mathbf{X}$ is the activation tensor and $\mathbf{W}$ is the weight tensor:
1. Introduce a Smoothing Factor: A per-channel scaling vector $\mathbf{s}$ is introduced. The transformation is defined as: $\mathbf { Y } = ( \mathbf { X } \mathrm { d i a g } ( \mathbf { s } ) ^ { - 1 } ) \cdot ( \mathrm { d i a g } ( \mathbf { s } ) \mathbf { W } ) = \hat { \mathbf { X } } \hat { \mathbf { W } }$
2. Define the Transformed Tensors:
  - The new, "smoothed" activation is $\hat{\mathbf{X}} = \mathbf{X} \mathrm{diag}(\mathbf{s})^{-1}$ . This is equivalent to dividing each column (channel) of $\mathbf{X}$ by the corresponding value in $\mathbf{s}$ .
  - The new, "adjusted" weight is $\hat{\mathbf{W}} = \mathrm{diag}(\mathbf{s}) \mathbf{W}$ . This is equivalent to multiplying each row of $\mathbf{W}$ by the corresponding value in $\mathbf{s}$ .
3. Offline Calculation: The crucial point is that this transformation is done offline. The smoothing factor $\mathbf{s}$ is calculated once using a small calibration dataset. The new weights $\hat{\mathbf{W}}$ are pre-computed and stored. The scaling factor $\mathrm{diag}(\mathbf{s})^{-1}$ is fused into the preceding layer (e.g., another linear layer or a LayerNorm), so it adds no runtime overhead. At inference time, the model directly uses the quantization-friendly $\hat{\mathbf{X}}$ and $\hat{\mathbf{W}}$ .
Mathematical Formulas & Key Details: The choice of the smoothing factor $\mathbf{s}$ is critical. It needs to balance the quantization difficulty between $\hat{\mathbf{X}}$ and $\hat{\mathbf{W}}$ . The authors propose a formula controlled by a migration strength hyperparameter, $\alpha \in [0, 1]$ . For the $j$ -th channel: $\mathbf { s } _ { j } = \frac{\operatorname* { m a x } ( | \mathbf { X } _ { j } | ) ^ { \alpha }}{\operatorname* { m a x } ( | \mathbf { W } _ { j } | ) ^ { 1 - \alpha }}$
- Symbol Explanation:
  - $\mathbf{s}_j$ : The smoothing factor for the $j$ -th input channel.
  - $\max(|\mathbf{X}_j|)$ : The maximum absolute value observed in the $j$ -th channel of the activation tensor $\mathbf{X}$ across the calibration dataset.
  - $\max(|\mathbf{W}_j|)$ : The maximum absolute value in the $j$ -th row of the weight matrix $\mathbf{W}$ .
  - $\alpha$ $α$ : The migration strength.
    - If $\alpha=1$ , the difficulty is fully migrated to the weights ( $\mathbf{s}_j = \max(|\mathbf{X}_j|)$ ).
    - If $\alpha=0$ , the difficulty is fully migrated to the activations ( $\mathbf{s}_j = 1 / \max(|\mathbf{W}_j|)$ ).
    - The paper finds $\alpha=0.5$ works well for most models (OPT, BLOOM), effectively aiming to make the maximum values in the transformed activation and weight channels equal. This balances the quantization challenge. Figure 4 and Figure 5 illustrate this transformation.
      
      该图像是SmoothQuant方法中激活和权重的绝对值分布三维柱状图，展示了通过迁移量化难度平滑激活（从“难以量化”变为“易于量化”）并适当调整权重，使权重量化仍然相对“容易”。
Implementation in a Transformer Block: SmoothQuant is applied to all compute-intensive matrix multiplications (GEMM in linear layers and BMM in attention). Figure 6 shows the data flow. Operations like LayerNorm, Softmax, and residual connections remain in FP16 to preserve accuracy, while the heavy lifting is done with efficient INT8 arithmetic.

该图像是一个示意图，展示了SmoothQuant在Transformer块中的精度映射策略，所有计算密集型操作如线性层和批量矩阵乘法（BMM）均采用INT8运算，图中以不同颜色区分FP16和INT8的数据路径。

5. Experimental Setup

Datasets:
- Zero-shot Evaluation: A suite of common sense reasoning and language understanding tasks were used, including LAMBADA, HellaSwag, PIQA, WinoGrande, OpenBookQA, RTE, COPA, MMLU, MNLI, and QNLI.
- Language Modeling: WikiText and WikiText-2 were used to evaluate perplexity.
- Calibration: The smoothing factors and static quantization scales were determined using 512 random sentences from the Pile dataset, which is a large and diverse text corpus.
Evaluation Metrics:
1. Accuracy:
  - Conceptual Definition: For classification or multiple-choice tasks (like most of the zero-shot benchmarks), accuracy is the percentage of examples the model answers correctly. A higher value is better.
2. Perplexity (PPL):
  - Conceptual Definition: Perplexity is a measure of how well a probability model predicts a sample. In language modeling, it quantifies the model's "surprise" when encountering a sequence of words. A lower perplexity indicates the model is more confident and accurate in its predictions.
  - Mathematical Formula: For a test set with $N$ tokens $w_1, w_2, \ldots, w_N$ , the perplexity is the exponentiated average negative log-likelihood: $\mathrm{PPL} = \exp\left( -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \ldots, w_{i-1}) \right)$
  - Symbol Explanation:
    - $N$ : Total number of tokens in the test set.
    - $P(w_i | w_1, \ldots, w_{i-1})$ : The probability assigned by the model to the token $w_i$ given the preceding context.
3. Speedup: The ratio of the FP16 baseline's latency to the quantized model's latency. Higher is better.
4. Memory Saving: The ratio of the FP16 baseline's peak memory usage to the quantized model's peak memory usage. Higher is better.

Baselines: The paper compares SmoothQuant against several key PTQ methods. Table 2 from the paper (transcribed below) details the quantization schemes.

W8A8: A naive baseline with per-tensor dynamic quantization.
ZeroQuant: Uses group-wise quantization for weights and per-token dynamic for activations.
LLM.int8(): Uses per-channel weights and a mixed-precision (INT8+FP16) scheme for activations.
Outlier Suppression: A method that tries to clip outliers.

SmoothQuant-O1/O2/O3: The proposed method with three levels of efficiency. O1 is the most precise (per-token dynamic activation quantization), while O3 is the most hardware-efficient (per-tensor static activation quantization).

Manual transcription of Table 2:

Method	Weight	Activation
W8A8	per-tensor	per-tensor dynamic
ZeroQuant	group-wise	per-token dynamic
LLM.int8()	per-channel	per-token dynamic+FP16
Outlier Suppression	per-tensor	per-tensor static
SmoothQuant-O1	per-tensor	per-token dynamic
SmoothQuant-O2	per-tensor	per-tensor dynamic
SmoothQuant-O3	per-tensor	per-tensor static

6. Results & Analysis

Core Results:

Accuracy on Large Models: Table 3 shows the performance on OPT-175B. SmoothQuant (all variants) successfully maintains the FP16 accuracy. In stark contrast, W8A8, ZeroQuant, and Outlier Suppression all cause a catastrophic accuracy drop, with performance becoming near-random. LLM.int8() also maintains accuracy, but as later results show, it comes at a high latency cost.

Manual transcription of Table 3:

OPT-175B	LAMBADA	HellaSwag	PIQA	WinoGrande	OpenBookQA	RTE	COPA	Average↑	WikiText↓
FP16	74.7%	59.3%	79.7%	72.6%	34.0%	59.9%	88.0%	66.9%	10.99
W8A8	0.0%	25.6%	53.4%	50.3%	14.0%	49.5%	56.0%	35.5%	93080
ZeroQuant	0.0%*	26.0%	51.7%	49.3%	17.0%	50.9%	55.0%	35.8%	84648
LLM.int8()	74.7%	59.2%	79.7%	72.1%	34.2%	60.3%	87.0%	66.7%	11.10
Outlier Suppression	0.00%	25.8%	52.5%	48.6%	16.6%	53.4%	55.0%	36.0%	96151
SmoothQuant-O1	74.7%	59.2%	79.7%	71.2%	33.4%	58.1%	89.0%	66.5%	11.11
SmoothQuant-O2	75.0%	59.0%	79.2%	71.2%	33.0%	59.6%	88.0%	66.4%	11.14
SmoothQuant-O3	74.6%	58.9%	79.7%	71.2%	33.4%	59.9%	90.0%	66.8%	11.17

Generality Across Models and Scales: Table 4 demonstrates that SmoothQuant is not specific to one model architecture. It preserves accuracy for OPT-175B, BLOOM-176B, and GLM-130B. Figure 7 further confirms this, showing that SmoothQuant-O3 consistently matches FP16 accuracy across all scales of the OPT model family (from 125M to 175B parameters).

Manual transcription of Table 4:

Method	OPT-175B	BLOOM-176B	GLM-130B*
FP16	71.6%	68.2%	73.8%
W8A8	32.3%	64.2%	26.9%
ZeroQuant	31.7%	67.4%	26.7%
LLM.int8()	71.4%	68.0%	73.8%
Outlier Suppression	31.7%	54.1%	63.5%
SmoothQuant-O1	71.2%	68.3%	73.7%
SmoothQuant-O2	71.1%	68.4%	72.5%
SmoothQuant-O3	71.1%	67.4%	72.8%

Figure 7: SmoothQuant-O3 (the most efficient setting, defined in Table 2) preserves the accuracy of OPT models across different scales when quantized to INT8. LLM.int8() requires mixed precision and… 该图像是图表，展示了不同规模OPT模型在INT8量化下的准确率对比。SmoothQuant-O3方案在各模型规模上保持了接近FP16的高准确率，而ZeroQuant和W8A8精度显著下降。LLM.int8()需要混合精度且速度减慢。

Applicability to Modern LLMs: Tables 5, 6, and 7 show that SmoothQuant also works effectively on instruction-tuned models (OPT-IML), as well as newer, powerful open-source models like LLaMA, Llama-2, Falcon, Mistral, and even the Mixture-of-Experts (MoE) model Mixtral, all with negligible performance degradation.

Manual transcription of Table 5, 6, 7: Table 5:

OPT-IML-30B	LAMBADA ↑	WikiText ↓
FP16	69.12%	14.26
W8A8	4.21%	576.53
ZeroQuant	5.12%	455.12
LLM.int8()	69.14%	14.27
Outlier Suppression	0.00%	9485.62
SmoothQuant-O3	69.77%	14.37

Table 6:

Wiki PPL↓	7B	13B	30B	65B
FP16	11.51	10.05	7.53	6.17
W8A8 SmoothQuant	11.56	10.08	7.56	6.20

Table 7:

Model	Method	PPL	α
Llama-2-7B	FP16	5.474	0.85
	W8A8 SQ	5.515
Llama-2-13B	FP16	4.950	0.85
	W8A8 SQ	4.929
Llama-2-70B	FP16	3.320	0.9
	W8A8 SQ	3.359
Falcon-7B	FP16	6.590	0.6
	W8A8 SQ	6.629
Falcon-40B	FP16	5.228	0.7
	W8A8 SQ	5.255
Mistral-7B	FP16	5.253	0.8
	W8A8 SQ	5.277
Mixtral-8x7B	FP16	3.842	0.8
	W8A8 SQ	3.893

Speedup and Memory Saving:

PyTorch Implementation: Figure 8 shows that SmoothQuant-O3 achieves up to 1.51x speedup and 1.96x memory saving on a single GPU. In contrast, LLM.int8() is often slower than the FP16 baseline due to its mixed-precision overhead.
FasterTransformer Implementation: Figure 9 shows even more impressive results in a production-grade inference framework. SmoothQuant achieves up to 1.56x speedup on single-GPU setups. More importantly, it achieves similar latency to the FP16 version while using half the number of GPUs for large models (e.g., 4 GPUs vs. 8 for OPT-175B). Memory usage is nearly halved across the board. Table 8 confirms these benefits extend to the autoregressive decoding stage as well.

$Figure 8: The PyTorch implementation of SmoothQuant-O3 achieves up to $\\mathbf { 1 . 5 1 \\times }$ speedup and $\\mathbf { 1 . 9 6 \\times }$ memory saving for OPT models on a single NVIDIA A100-80GB G…$ 该图像是图表，展示了OPT-13B和OPT-30B模型在不同量化方法（FP16、LLM.int8和SmoothQuant）下的延迟和内存使用情况。SmoothQuant在延迟和内存方面均优于其他方法，体现出更高的加速和内存节省效果。

Manual transcription of Table 8:

BS	SeqLen	Latency (ms)			Memory (GB)
BS	SeqLen	FP16	Ours	Speedup (↑)	FP16	Ours	Saving (↑)
OPT-30B (1 GPU)
1	512	422	314	1.35×	57	30	1.91×
1	1024	559	440	1.27×	58	31	1.87×
16	512	2488	1753	1.42×	69	44	1.59×
OPT-175B (8 GPUs)
1	512	426	359	1.19×	44	23	1.87×
1	1024	571	475	1.20×	44	24	1.85×
16	512	2212	1628	1.36×	50	30	1.67×
16	1024	4133	3231	1.28×	56	37	1.52×

Scaling Up: 530B Model: Tables 9 and 10 provide a powerful demonstration of SmoothQuant's impact. It successfully quantizes the massive MT-NLG 530B model with no accuracy loss. This allows the model to be served on 8 A100 GPUs instead of 16, with comparable latency. This makes it possible to serve a >500B model on a single server node, drastically lowering the hardware barrier.

Manual transcription of Tables 9 and 10: Table 9:

LAMBADA HellaSwag PIQA WinoGrande Average
FP16 76.6% 62.1% 81.0% 72.9% 73.1%
INT8 77.2% 60.4% 80.7% 74.1% 73.1%

Table 10:

SeqLen Prec. #GPUs Latency Memory
128 FP16 16 232ms 1040GB
INT8 8 253ms 527GB
256 FP16 16 451ms 1054GB
INT8 8 434ms 533GB
512 FP16 16 838ms 1068GB
INT8 8 839ms 545GB
1024 FP16 16 1707ms 1095GB
INT8 8 1689ms 570GB
Ablations / Parameter Sensitivity:
- Quantization Schemes: Table 11 shows a clear trade-off between quantization granularity and latency. The most efficient scheme, SmoothQuant-O3 (per-tensor, static), is the fastest, while O1 (per-token, dynamic) is slightly slower but can be more accurate in some cases. All SmoothQuant variants are faster than the FP16 and LLM.int8() baselines.
- Migration Strength ( $\alpha$ ): Figure 10 clearly illustrates the importance of the $\alpha$ hyperparameter. If $\alpha$ is too small (e.g., < 0.4), not enough difficulty is migrated from activations, and they remain hard to quantize. If $\alpha$ is too large (e.g., > 0.6), too much difficulty is pushed to the weights, making them hard to quantize. The "sweet spot" (around 0.4-0.6 for OPT-175B) balances the difficulty and achieves high accuracy.
  
  $Figure 10: A suitable migration strength $\\alpha$ (sweet spot) makes both activations and weights easy to quantize. If the $\\alpha$ is too large, weights will be hard to quantize; if too small, activ…$ 该图像是图表，展示了不同迁移强度 $\alpha$ 对 OPT-175B 模型量化准确率的影响。图中标示了当 $\alpha$ 过大或过小时，权重或激活量化难度增加，最佳量化准确率出现在中间的“sweet spot”区域。

7. Conclusion & Reflections

Conclusion Summary: SmoothQuant presents an elegant and effective solution to the problem of quantizing large language models. By identifying activation outliers as the primary bottleneck and proposing a mathematically equivalent transformation to migrate this difficulty to the more robust weights, it enables true W8A8 quantization. This approach maintains model accuracy while fully leveraging hardware acceleration for integer operations. The results are compelling: significant speedups, a near halving of memory usage, and the ability to deploy massive models on much less hardware. It stands out as a practical, training-free, and general-purpose "turn-key" solution that helps democratize access to LLMs.
Limitations & Future Work: The authors do not explicitly list limitations in the provided text. However, some can be inferred:
- Hyperparameter Tuning: The migration strength $\alpha$ is a hyperparameter that needs to be tuned for different model families (e.g., 0.5 for OPT/BLOOM, 0.75 for GLM, 0.8+ for Llama-2). While a quick grid search is proposed, this still represents a manual step in the pipeline.
- Extreme Outliers: While the method works well, extremely pathological outliers could still pose a challenge, potentially requiring a very high $\alpha$ that makes weights too difficult to quantize.
- Lower-bit Quantization: The paper focuses on INT8. Extending this balanced difficulty migration to more aggressive schemes like INT4/W4A8 would be a valuable next step, but may prove more challenging as the quantization error budget for both weights and activations becomes much smaller.
Personal Insights & Critique:
- SmoothQuant is a prime example of brilliant yet simple engineering. The core idea is intuitive and powerful. Instead of adding complexity to handle a problem (like LLM.int8()'s mixed precision), it reframes the problem by transforming the data itself. This is a much more hardware-friendly philosophy.
- The paper's true strength lies in its practicality and extensive validation. The authors not only propose a method but also implement it in a production-grade framework (FasterTransformer) and test it on the largest publicly available models, demonstrating real-world impact. The MT-NLG 530B result is a landmark achievement in efficient LLM serving.
- The concept of "difficulty migration" could be a general principle applicable beyond LLM quantization. Any deep learning model with heterogeneous quantization difficulty across different tensor types could potentially benefit from a similar balancing act.
- An open question is how this technique interacts with other compression methods like pruning or knowledge distillation. A combination of these techniques could lead to even more compact and efficient models. SmoothQuant has set a new standard for what is possible with post-training quantization in the era of giant models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

SeqLen	Prec.	#GPUs	Latency	Memory
128	FP16	16	232ms	1040GB
	INT8	8	253ms	527GB
256	FP16	16	451ms	1054GB
	INT8	8	434ms	533GB
512	FP16	16	838ms	1068GB
	INT8	8	839ms	545GB
1024	FP16	16	1707ms	1095GB
	INT8	8	1689ms	570GB

	LAMBADA	HellaSwag	PIQA	WinoGrande	Average
FP16	76.6%	62.1%	81.0%	72.9%	73.1%
INT8	77.2%	60.4%	80.7%	74.1%	73.1%