SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
TL;DR Summary
SmoothQuant is a training-free post-training quantization method that transfers activation quantization challenges to weights, enabling efficient 8-bit quantization with minimal accuracy loss and up to 1.56× speedup and 2× memory savings for large language models.
Abstract
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, Llama-1/2, Falcon, Mistral, and Mixtral models. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. SmoothQuant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at https://github.com/mit-han-lab/smoothquant.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
- Authors: Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han.
- The authors are affiliated with the Massachusetts Institute of Technology (MIT) and NVIDIA. This combination of top-tier academic and industry research labs is highly relevant, as the work bridges theoretical innovation with practical hardware performance. Song Han's lab at MIT (Han Lab) is particularly well-known for its pioneering work in model compression and efficient deep learning.
- Journal/Conference: The paper is an arXiv preprint. The version analyzed (v7) was last updated in April 2024. While not formally peer-reviewed in a conference proceedings within this document, its widespread citation, adoption, and integration into major frameworks like NVIDIA's FasterTransformer signify its significant impact and validation by the community.
- Publication Year: First submitted in 2022.
- Abstract: The paper addresses a critical challenge in deploying Large Language Models (LLMs): their high computational and memory costs. While quantization is a known solution, existing methods struggle to balance accuracy and hardware efficiency. The authors propose SmoothQuant, a training-free, post-training quantization (PTQ) method that enables 8-bit weight and 8-bit activation (W8A8) quantization. The core idea is to "smooth" activation outliers, which are difficult to quantize, by mathematically migrating this difficulty to the weights, which are easier to quantize. This transformation is done offline and preserves the model's mathematical equivalence. SmoothQuant is demonstrated to work across a wide range of LLMs (including OPT, BLOOM, Llama, and Mixtral families) with negligible accuracy loss, achieving up to 1.56x speedup and 2x memory reduction. A key achievement highlighted is enabling a 530B parameter model to be served on a single server node.
- Original Source Link:
- arXiv: https://arxiv.org/abs/2211.10438v7
- PDF: https://arxiv.org/pdf/2211.10438v7.pdf
- Status: This is a preprint, meaning it has been made publicly available by the authors but may not have completed a formal peer-review process for a specific conference or journal at the time of this version's release.
2. Executive Summary
- Background & Motivation (Why):
- Core Problem: LLMs have become extraordinarily powerful, but their immense size (billions of parameters) makes them prohibitively expensive to deploy. Running them requires massive amounts of high-end GPU memory and computational power, creating a significant barrier to access. As shown in Figure 1, the growth in model size is far outpacing the growth in GPU memory.
- Existing Gaps: Quantization, the process of converting floating-point numbers (like FP16) to low-bit integers (like INT8), is a promising solution to reduce memory and accelerate computation. However, for LLMs larger than ~6B parameters, a phenomenon of "activation outliers" emerges. These are a few activation values that are orders of magnitude larger than the rest. Standard quantization schemes, which scale all values based on the maximum, allocate too many quantization levels to these outliers, leaving very few levels for the majority of values and causing severe accuracy degradation.
- Prior methods either failed to maintain accuracy (
W8A8,ZeroQuanton large models) or sacrificed hardware efficiency to handle outliers (LLM.int8(), which uses a slow mixed-precision approach). A solution that is accurate, hardware-efficient, and training-free was missing.
- Main Contributions / Findings (What):
-
A Novel Transformation for Quantization-Friendliness: The paper introduces
SmoothQuant, a method that mathematically transforms the model to make it easier to quantize. It doesn't change the model's function, only its representation. -
Difficulty Migration: The key insight is to "migrate" the quantization difficulty from the hard-to-quantize activations to the easy-to-quantize weights. This is achieved by scaling down the activation channels with large outliers and scaling up the corresponding weight channels.
-
A Tunable, Offline Process: This "smoothing" is controlled by a hyperparameter, , and is performed offline on a small calibration dataset. It requires no model retraining, making it a Post-Training Quantization (PTQ) method.
-
Hardware-Efficient W8A8 Quantization:
SmoothQuantenables full 8-bit weight, 8-bit activation (W8A8) quantization for all compute-intensive matrix multiplications in LLMs. This allows the use of highly optimized INT8 hardware units (like NVIDIA's Tensor Cores), leading to significant speedups and memory savings. -
Broad Applicability and State-of-the-Art Results: The method is shown to work across a vast range of models (OPT, BLOOM, GLM, MT-NLG, Llama, Falcon, Mistral, Mixtral) and scales, preserving accuracy while achieving up to 1.56x speedup and 2x memory reduction. It enables serving a 530B model on a single 8-GPU node, a feat previously impractical.
该图像是一张折线图,展示了近年来大语言模型的模型规模(以参数数量计)和GPU加速器内存大小的对比,模型规模增长远超内存,导致内存供需差距不断扩大,量化与模型压缩技术可有效缓解该问题。
-
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks, typically based on the Transformer architecture, trained on vast amounts of text data. Their size gives them powerful language understanding and generation capabilities but also makes them resource-intensive.
- Quantization: The process of converting a continuous range of values (like 32-bit or 16-bit floating-point numbers) into a smaller, discrete set of values (like 8-bit integers). This reduces memory footprint and can accelerate computation on hardware with specialized integer arithmetic units.
- Post-Training Quantization (PTQ): A family of quantization techniques that are applied after a model has been fully trained. This is highly desirable as it avoids the massive cost of retraining large models.
- W8A8: A specific quantization scheme where both the Weights and Activations of a neural network layer are represented using 8-bit integers. This is the key to unlocking maximum performance from hardware integer accelerators.
- Activation Outliers: A phenomenon observed in LLMs where certain channels in the activation tensors (the outputs of neuron layers) consistently have values that are dramatically larger (e.g., 100x) than the values in other channels. These outliers make standard quantization very challenging.
- Quantization Granularity: This refers to how many quantization parameters (like the scaling factor) are used for a tensor.
-
Per-tensor: One scaling factor for the entire matrix. Most efficient. -
Per-token/Per-channel: A separate scaling factor for each row (token) or column (channel). This offers more precision but can add overhead. The paper notes thatper-channelquantization on activations is accurate but not hardware-friendly for matrix multiplication.
该图像是示意图,展示了图3中per-tensor,per-token和per-channel量化的定义。图中用和表示量化缩放因子,说明了向量量化只能用外层维度的缩放因子,即token维度和输出通道维度,而不能用内层维度。
-
-
Previous Works:
LLM.int8(): This method acknowledged the activation outlier problem and proposed a mixed-precision solution. It identifies outliers and keeps them in high-precisionFP16format, while quantizing the rest of the values toINT8. While it preserves accuracy, performing matrix multiplication on this mixed-format data is complex and slow, often resulting in performance worse than the originalFP16model.ZeroQuant: This approach uses agroup-wisequantization for weights and aper-tokendynamic quantization for activations. It showed good results on smaller models but, as demonstrated in this paper, fails to maintain accuracy on very large models like OPT-175B.Outlier Suppression: This method attempts to mitigate outliers by modifying the model architecture (e.g., using non-scalingLayerNorm) or clipping activation values. The paper shows this is insufficient for large-scale LLMs and results in significant accuracy loss.GPTQ: A concurrent work that focuses on highly accurate weight-only quantization (e.g., to 4-bit), but does not quantize activations. This saves memory but doesn't accelerate the computation as much asW8A8methods.
-
Differentiation:
SmoothQuant's key innovation is that instead of trying to work around the outliers (likeLLM.int8()) or simply suppressing them (likeOutlier Suppression), it redistributes them. The transformation is mathematically lossless before quantization is applied. By making both weights and activations moderately "bumpy" rather than having one be extremely "spiky," it makes them both amenable to simple, hardware-friendlyper-tensorquantization, thus achieving both accuracy and efficiency.
4. Methodology (Core Technology & Implementation)
The core idea of SmoothQuant is to make activations easier to quantize by migrating their dynamic range to the weights.
-
Principles:
-
Observation 1: Weights are easy to quantize. Their distributions are typically uniform and well-behaved.
-
Observation 2: Activations in LLMs are hard to quantize. This is due to systematic outliers in specific channels that persist across different input tokens.
-
Intuition: If we could scale down the activation channels that have large values, they would become easier to quantize. To preserve the mathematical output of the layer, we must apply an inverse scaling to the corresponding channels in the weight matrix. This shifts the "difficulty" from activations to weights. Since weights are inherently easier to quantize, they can absorb this difficulty without significant quantization error. This is visualized in Figure 2.
该图像是示意图,展示了论文中图2 SmoothQuant的直观理念。上半部分展示原始激活含有异常值导致量化困难,下半部分通过迁移难度使平滑激活 和调整后的权重都易于量化。
-
-
Steps & Procedures: For a linear layer computation , where is the activation tensor and is the weight tensor:
- Introduce a Smoothing Factor: A per-channel scaling vector is introduced. The transformation is defined as:
- Define the Transformed Tensors:
- The new, "smoothed" activation is . This is equivalent to dividing each column (channel) of by the corresponding value in .
- The new, "adjusted" weight is . This is equivalent to multiplying each row of by the corresponding value in .
- Offline Calculation: The crucial point is that this transformation is done offline. The smoothing factor is calculated once using a small calibration dataset. The new weights are pre-computed and stored. The scaling factor is fused into the preceding layer (e.g., another linear layer or a LayerNorm), so it adds no runtime overhead. At inference time, the model directly uses the quantization-friendly and .
-
Mathematical Formulas & Key Details: The choice of the smoothing factor is critical. It needs to balance the quantization difficulty between and . The authors propose a formula controlled by a migration strength hyperparameter, . For the -th channel:
- Symbol Explanation:
- : The smoothing factor for the -th input channel.
- : The maximum absolute value observed in the -th channel of the activation tensor across the calibration dataset.
- : The maximum absolute value in the -th row of the weight matrix .
- : The migration strength.
-
If , the difficulty is fully migrated to the weights ().
-
If , the difficulty is fully migrated to the activations ().
-
The paper finds works well for most models (OPT, BLOOM), effectively aiming to make the maximum values in the transformed activation and weight channels equal. This balances the quantization challenge. Figure 4 and Figure 5 illustrate this transformation.
该图像是SmoothQuant方法中激活和权重的绝对值分布三维柱状图,展示了通过迁移量化难度平滑激活(从“难以量化”变为“易于量化”)并适当调整权重,使权重量化仍然相对“容易”。
-
- Symbol Explanation:
-
Implementation in a Transformer Block:
SmoothQuantis applied to all compute-intensive matrix multiplications (GEMMin linear layers andBMMin attention). Figure 6 shows the data flow. Operations likeLayerNorm,Softmax, and residual connections remain inFP16to preserve accuracy, while the heavy lifting is done with efficientINT8arithmetic.
该图像是一个示意图,展示了SmoothQuant在Transformer块中的精度映射策略,所有计算密集型操作如线性层和批量矩阵乘法(BMM)均采用INT8运算,图中以不同颜色区分FP16和INT8的数据路径。
5. Experimental Setup
-
Datasets:
- Zero-shot Evaluation: A suite of common sense reasoning and language understanding tasks were used, including
LAMBADA,HellaSwag,PIQA,WinoGrande,OpenBookQA,RTE,COPA,MMLU,MNLI, andQNLI. - Language Modeling:
WikiTextandWikiText-2were used to evaluate perplexity. - Calibration: The smoothing factors and static quantization scales were determined using 512 random sentences from the
Piledataset, which is a large and diverse text corpus.
- Zero-shot Evaluation: A suite of common sense reasoning and language understanding tasks were used, including
-
Evaluation Metrics:
- Accuracy:
- Conceptual Definition: For classification or multiple-choice tasks (like most of the zero-shot benchmarks), accuracy is the percentage of examples the model answers correctly. A higher value is better.
- Perplexity (PPL):
- Conceptual Definition: Perplexity is a measure of how well a probability model predicts a sample. In language modeling, it quantifies the model's "surprise" when encountering a sequence of words. A lower perplexity indicates the model is more confident and accurate in its predictions.
- Mathematical Formula: For a test set with tokens , the perplexity is the exponentiated average negative log-likelihood:
- Symbol Explanation:
- : Total number of tokens in the test set.
- : The probability assigned by the model to the token given the preceding context.
- Speedup: The ratio of the FP16 baseline's latency to the quantized model's latency. Higher is better.
- Memory Saving: The ratio of the FP16 baseline's peak memory usage to the quantized model's peak memory usage. Higher is better.
- Accuracy:
-
Baselines: The paper compares
SmoothQuantagainst several key PTQ methods. Table 2 from the paper (transcribed below) details the quantization schemes.-
W8A8: A naive baseline with per-tensor dynamic quantization. -
ZeroQuant: Uses group-wise quantization for weights and per-token dynamic for activations. -
LLM.int8(): Uses per-channel weights and a mixed-precision (INT8+FP16) scheme for activations. -
Outlier Suppression: A method that tries to clip outliers. -
SmoothQuant-O1/O2/O3: The proposed method with three levels of efficiency.O1is the most precise (per-token dynamic activation quantization), whileO3is the most hardware-efficient (per-tensor static activation quantization).Manual transcription of Table 2:
Method Weight Activation W8A8 per-tensor per-tensor dynamic ZeroQuant group-wise per-token dynamic LLM.int8() per-channel per-token dynamic+FP16 Outlier Suppression per-tensor per-tensor static SmoothQuant-O1 per-tensor per-token dynamic SmoothQuant-O2 per-tensor per-tensor dynamic SmoothQuant-O3 per-tensor per-tensor static
-
6. Results & Analysis
-
Core Results:
-
Accuracy on Large Models: Table 3 shows the performance on OPT-175B.
SmoothQuant(all variants) successfully maintains the FP16 accuracy. In stark contrast,W8A8,ZeroQuant, andOutlier Suppressionall cause a catastrophic accuracy drop, with performance becoming near-random.LLM.int8()also maintains accuracy, but as later results show, it comes at a high latency cost.Manual transcription of Table 3:
OPT-175B LAMBADA HellaSwag PIQA WinoGrande OpenBookQA RTE COPA Average↑ WikiText↓ FP16 74.7% 59.3% 79.7% 72.6% 34.0% 59.9% 88.0% 66.9% 10.99 W8A8 0.0% 25.6% 53.4% 50.3% 14.0% 49.5% 56.0% 35.5% 93080 ZeroQuant 0.0%* 26.0% 51.7% 49.3% 17.0% 50.9% 55.0% 35.8% 84648 LLM.int8() 74.7% 59.2% 79.7% 72.1% 34.2% 60.3% 87.0% 66.7% 11.10 Outlier Suppression 0.00% 25.8% 52.5% 48.6% 16.6% 53.4% 55.0% 36.0% 96151 SmoothQuant-O1 74.7% 59.2% 79.7% 71.2% 33.4% 58.1% 89.0% 66.5% 11.11 SmoothQuant-O2 75.0% 59.0% 79.2% 71.2% 33.0% 59.6% 88.0% 66.4% 11.14 SmoothQuant-O3 74.6% 58.9% 79.7% 71.2% 33.4% 59.9% 90.0% 66.8% 11.17 -
Generality Across Models and Scales: Table 4 demonstrates that
SmoothQuantis not specific to one model architecture. It preserves accuracy for OPT-175B, BLOOM-176B, and GLM-130B. Figure 7 further confirms this, showing thatSmoothQuant-O3consistently matches FP16 accuracy across all scales of the OPT model family (from 125M to 175B parameters).Manual transcription of Table 4:
Method OPT-175B BLOOM-176B GLM-130B* FP16 71.6% 68.2% 73.8% W8A8 32.3% 64.2% 26.9% ZeroQuant 31.7% 67.4% 26.7% LLM.int8() 71.4% 68.0% 73.8% Outlier Suppression 31.7% 54.1% 63.5% SmoothQuant-O1 71.2% 68.3% 73.7% SmoothQuant-O2 71.1% 68.4% 72.5% SmoothQuant-O3 71.1% 67.4% 72.8%
该图像是图表,展示了不同规模OPT模型在INT8量化下的准确率对比。SmoothQuant-O3方案在各模型规模上保持了接近FP16的高准确率,而ZeroQuant和W8A8精度显著下降。LLM.int8()需要混合精度且速度减慢。 -
Applicability to Modern LLMs: Tables 5, 6, and 7 show that
SmoothQuantalso works effectively on instruction-tuned models (OPT-IML), as well as newer, powerful open-source models like LLaMA, Llama-2, Falcon, Mistral, and even the Mixture-of-Experts (MoE) model Mixtral, all with negligible performance degradation.Manual transcription of Table 5, 6, 7: Table 5:
OPT-IML-30B LAMBADA ↑ WikiText ↓ FP16 69.12% 14.26 W8A8 4.21% 576.53 ZeroQuant 5.12% 455.12 LLM.int8() 69.14% 14.27 Outlier Suppression 0.00% 9485.62 SmoothQuant-O3 69.77% 14.37 Table 6:
Wiki PPL↓ 7B 13B 30B 65B FP16 11.51 10.05 7.53 6.17 W8A8 SmoothQuant 11.56 10.08 7.56 6.20 Table 7:
Model Method PPL α Llama-2-7B FP16 5.474 0.85 W8A8 SQ 5.515 Llama-2-13B FP16 4.950 0.85 W8A8 SQ 4.929 Llama-2-70B FP16 3.320 0.9 W8A8 SQ 3.359 Falcon-7B FP16 6.590 0.6 W8A8 SQ 6.629 Falcon-40B FP16 5.228 0.7 W8A8 SQ 5.255 Mistral-7B FP16 5.253 0.8 W8A8 SQ 5.277 Mixtral-8x7B FP16 3.842 0.8 W8A8 SQ 3.893
-
-
Speedup and Memory Saving:
-
PyTorch Implementation: Figure 8 shows that
SmoothQuant-O3achieves up to 1.51x speedup and 1.96x memory saving on a single GPU. In contrast,LLM.int8()is often slower than the FP16 baseline due to its mixed-precision overhead. -
FasterTransformer Implementation: Figure 9 shows even more impressive results in a production-grade inference framework.
SmoothQuantachieves up to 1.56x speedup on single-GPU setups. More importantly, it achieves similar latency to the FP16 version while using half the number of GPUs for large models (e.g., 4 GPUs vs. 8 for OPT-175B). Memory usage is nearly halved across the board. Table 8 confirms these benefits extend to the autoregressive decoding stage as well.
该图像是图表,展示了OPT-13B和OPT-30B模型在不同量化方法(FP16、LLM.int8和SmoothQuant)下的延迟和内存使用情况。SmoothQuant在延迟和内存方面均优于其他方法,体现出更高的加速和内存节省效果。
Manual transcription of Table 8:
BS SeqLen Latency (ms) Memory (GB) FP16 Ours Speedup (↑) FP16 Ours Saving (↑) OPT-30B (1 GPU) 1 512 422 314 1.35× 57 30 1.91× 1 1024 559 440 1.27× 58 31 1.87× 16 512 2488 1753 1.42× 69 44 1.59× OPT-175B (8 GPUs) 1 512 426 359 1.19× 44 23 1.87× 1 1024 571 475 1.20× 44 24 1.85× 16 512 2212 1628 1.36× 50 30 1.67× 16 1024 4133 3231 1.28× 56 37 1.52× -
-
Scaling Up: 530B Model: Tables 9 and 10 provide a powerful demonstration of
SmoothQuant's impact. It successfully quantizes the massive MT-NLG 530B model with no accuracy loss. This allows the model to be served on 8 A100 GPUs instead of 16, with comparable latency. This makes it possible to serve a >500B model on a single server node, drastically lowering the hardware barrier.Manual transcription of Tables 9 and 10: Table 9:
LAMBADA HellaSwag PIQA WinoGrande Average FP16 76.6% 62.1% 81.0% 72.9% 73.1% INT8 77.2% 60.4% 80.7% 74.1% 73.1% Table 10:
SeqLen Prec. #GPUs Latency Memory 128 FP16 16 232ms 1040GB INT8 8 253ms 527GB 256 FP16 16 451ms 1054GB INT8 8 434ms 533GB 512 FP16 16 838ms 1068GB INT8 8 839ms 545GB 1024 FP16 16 1707ms 1095GB INT8 8 1689ms 570GB -
Ablations / Parameter Sensitivity:
-
Quantization Schemes: Table 11 shows a clear trade-off between quantization granularity and latency. The most efficient scheme,
SmoothQuant-O3(per-tensor, static), is the fastest, whileO1(per-token, dynamic) is slightly slower but can be more accurate in some cases. AllSmoothQuantvariants are faster than the FP16 andLLM.int8()baselines. -
Migration Strength (): Figure 10 clearly illustrates the importance of the hyperparameter. If is too small (e.g., < 0.4), not enough difficulty is migrated from activations, and they remain hard to quantize. If is too large (e.g., > 0.6), too much difficulty is pushed to the weights, making them hard to quantize. The "sweet spot" (around 0.4-0.6 for OPT-175B) balances the difficulty and achieves high accuracy.
该图像是图表,展示了不同迁移强度 对 OPT-175B 模型量化准确率的影响。图中标示了当 过大或过小时,权重或激活量化难度增加,最佳量化准确率出现在中间的“sweet spot”区域。
-
7. Conclusion & Reflections
-
Conclusion Summary:
SmoothQuantpresents an elegant and effective solution to the problem of quantizing large language models. By identifying activation outliers as the primary bottleneck and proposing a mathematically equivalent transformation to migrate this difficulty to the more robust weights, it enables trueW8A8quantization. This approach maintains model accuracy while fully leveraging hardware acceleration for integer operations. The results are compelling: significant speedups, a near halving of memory usage, and the ability to deploy massive models on much less hardware. It stands out as a practical, training-free, and general-purpose "turn-key" solution that helps democratize access to LLMs. -
Limitations & Future Work: The authors do not explicitly list limitations in the provided text. However, some can be inferred:
- Hyperparameter Tuning: The migration strength is a hyperparameter that needs to be tuned for different model families (e.g., 0.5 for OPT/BLOOM, 0.75 for GLM, 0.8+ for Llama-2). While a quick grid search is proposed, this still represents a manual step in the pipeline.
- Extreme Outliers: While the method works well, extremely pathological outliers could still pose a challenge, potentially requiring a very high that makes weights too difficult to quantize.
- Lower-bit Quantization: The paper focuses on INT8. Extending this balanced difficulty migration to more aggressive schemes like INT4/W4A8 would be a valuable next step, but may prove more challenging as the quantization error budget for both weights and activations becomes much smaller.
-
Personal Insights & Critique:
SmoothQuantis a prime example of brilliant yet simple engineering. The core idea is intuitive and powerful. Instead of adding complexity to handle a problem (likeLLM.int8()'s mixed precision), it reframes the problem by transforming the data itself. This is a much more hardware-friendly philosophy.- The paper's true strength lies in its practicality and extensive validation. The authors not only propose a method but also implement it in a production-grade framework (FasterTransformer) and test it on the largest publicly available models, demonstrating real-world impact. The MT-NLG 530B result is a landmark achievement in efficient LLM serving.
- The concept of "difficulty migration" could be a general principle applicable beyond LLM quantization. Any deep learning model with heterogeneous quantization difficulty across different tensor types could potentially benefit from a similar balancing act.
- An open question is how this technique interacts with other compression methods like pruning or knowledge distillation. A combination of these techniques could lead to even more compact and efficient models.
SmoothQuanthas set a new standard for what is possible with post-training quantization in the era of giant models.
Similar papers
Recommended via semantic vector search.