OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting
TL;DR Summary
The OSTQuant method optimizes large language model quantization using orthogonal and scaling transformations, addressing uneven and heavy-tailed data distributions. Introducing Quantization Space Utilization Rate (QSUR) effectively assesses quantizability, while the KL-Top loss f
Abstract
Post-training quantization (PTQ) has emerged as a widely adopted technique for compressing and accelerating Large Language Models (LLMs). The major challenge in LLM quantization is that uneven and heavy-tailed data distributions can expand the quantization range, thereby reducing bit precision for most values. Recent methods attempt to eliminate outliers and balance inter-channel differences by employing linear transformations; however, they remain heuristic and are often overlook optimizing the data distribution across the entire quantization space.In this paper, we introduce Quantization Space Utilization Rate (QSUR), a novel metric that effectively assesses the quantizability of transformed data by measuring the space utilization of the data in the quantization space. We complement QSUR with mathematical derivations that examine the effects and limitations of various transformations, guiding our development of Orthogonal and Scaling Transformation-based Quantization (OSTQuant). OSQuant employs a learnable equivalent transformation, consisting of an orthogonal transformation and a scaling transformation, to optimize the distributions of weights and activations across the entire quantization space. Futhermore, we propose the KL-Top loss function, designed to mitigate noise during optimization while retaining richer semantic information within the limited calibration data imposed by PTQ. OSTQuant outperforms existing work on various LLMs and benchmarks. In the W4-only setting, it retains 99.5% of the floating-point accuracy. In the more challenging W4A4KV4 configuration, OSTQuant reduces the performance gap by 32% on the LLaMA-3-8B model compared to state-of-the-art methods. \href{https://github.com/BrotherHappy/OSTQuant}{https://github.com/BrotherHappy/OSTQuant}.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting
1.2. Authors
Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, Sifan Zhou. Affiliations include Houmo AI, Nanjing University, and Southeast University.
1.3. Journal/Conference
This paper is a preprint published on arXiv. It is currently under review or has not yet been officially published in a peer-reviewed journal or conference proceeding. arXiv is a reputable open-access archive for preprints of scientific papers, particularly in physics, mathematics, computer science, and related disciplines, widely used by the academic community to disseminate research findings prior to formal publication.
1.4. Publication Year
2025
1.5. Abstract
Post-training quantization (PTQ) is a critical technique for compressing and accelerating Large Language Models (LLMs). A significant challenge in LLM quantization arises from uneven and heavy-tailed data distributions, which expand the quantization range and consequently reduce bit precision for most values. Existing methods often employ heuristic linear transformations to address outliers and inter-channel differences but frequently overlook optimizing the data distribution across the entire quantization space.
This paper introduces Quantization Space Utilization Rate (QSUR), a novel metric designed to assess the quantizability of transformed data by measuring its space utilization within the quantization space. Complementing QSUR with mathematical derivations, the authors examine the effects and limitations of various transformations. This theoretical foundation guides the development of Orthogonal and Scaling Transformation-based Quantization (OSTQuant). OSTQuant employs learnable equivalent transformations, comprising both orthogonal and scaling components, to optimize weight and activation distributions across the entire quantization space. Additionally, the paper proposes KL-Top loss function, which mitigates noise during optimization while preserving richer semantic information, particularly valuable given the limited calibration data typical in PTQ.
OSTQuant demonstrates superior performance over existing methods across various LLMs and benchmarks. In a W4-only setting, it retains 99.5% of the floating-point accuracy. For the more challenging W4A4KV4 configuration, OSTQuant reduces the performance gap by 32% on the LLaMA-3-8B model compared to state-of-the-art methods.
1.6. Original Source Link
https://arxiv.org/abs/2501.13987v1 (This is a preprint on arXiv.) PDF Link: https://arxiv.org/pdf/2501.13987v1.pdf
2. Executive Summary
2.1. Background & Motivation
The rapid advancement of Large Language Models (LLMs) has led to their widespread adoption, but their substantial memory and computational requirements pose significant deployment challenges. These challenges exist not only for resource-constrained edge devices but also for powerful cloud servers, limiting their practical applicability.
Post-training quantization (PTQ) is a widely adopted technique to compress and accelerate LLMs by reducing the bit-width of weights and activations after the model has been trained. However, a major hurdle in LLM quantization is the presence of uneven and heavy-tailed data distributions in weights and activations. These distributions mean that a few extreme values (outliers) dictate a large quantization range. Consequently, the limited number of available quantization levels must be stretched over this wide range, leading to reduced precision for the majority of values that fall within the narrower, central part of the distribution. This ultimately results in significant quantization errors and a drop in model performance.
Prior research has attempted to address these issues using linear transformations, such as smooth-based methods (e.g., SmoothQuant) that redistribute quantization difficulty between weights and activations, and rotation-based methods (e.g., QuaRot, SpinQuant) that aim to suppress outliers. While these methods have shown some effectiveness in specific regions of the quantization space, they are often heuristic and fail to holistically optimize the data distribution across the entire quantization space. This leads to sub-optimal utilization of the available quantization levels and leaves a notable gap in ensuring robust performance at very low bit-widths. The paper identifies this gap as the lack of a quantitative metric to assess quantizability and the absence of a method that globally optimizes data distribution for better fit within the quantization space.
2.2. Main Contributions / Findings
The paper introduces several primary contributions to address the challenges in LLM quantization:
- Quantization Space Utilization Rate (QSUR): The authors propose
QSUR, a novel and effective metric to quantitatively assess thequantizabilityof transformed data.QSURmeasures how effectively data utilizes the available quantization space. Mathematical derivations are provided to analyze the effects and limitations of various linear transformations onQSUR, establishing a theoretical foundation for designing more effective transformations. Experiments demonstrate a positive correlation between higherQSURvalues and improved quantization accuracy. - Orthogonal and Scaling Transformation-based Quantization (OSTQuant): Based on the insights from
QSUR,OSTQuantis proposed. It employs learnable equivalent transformation pairs, each consisting of anorthogonal transformationand ascaling transformation, assigned to each fully connected (FC) layer in LLMs. These transformations are optimized to globally reshape the distributions of weights and activations across the entire quantization space, making them morequantization-friendly. Critically, these transformation pairs and their inverses are fused into their respective FC layers after optimization, preserving the original computational graph and incurring no additional computational overhead during inference. The method also introducesWeight Outlier Minimization Initialization (WOMI)for effective initialization of orthogonal matrices. - KL-Top Loss Function: Recognizing that PTQ typically uses small calibration datasets (e.g., 1000 samples), which can lead to overfitting with standard cross-entropy loss or noise with full Kullback-Leibler divergence on large vocabularies, the
KL-Top loss functionis introduced. This loss function computesKL divergenceonly over the top highest-probability logits from the full-precision model. This approach aims to capture more nuanced semantic information and provide clearer, more informative gradients while mitigating noise from long-tail distributions and reducing computational costs. - Robust Performance and Efficiency:
OSTQuantconsistently outperforms existing state-of-the-art PTQ methods across various LLMs (LLaMA-1, LLaMA-2, LLaMA-3) and benchmarks.-
In the
W4-only(4-bit weights, 16-bit activations/KV cache) setting, it retains over 99.5% of the floating-point accuracy. -
In the more aggressive
W4A4KV4(4-bit weights, 4-bit activations, 4-bit KV cache) configuration, it reduces the performance gap by 32% on the LLaMA-3-8B model compared to state-of-the-art methods, maintaining at least 96% of the original model's performance. -
It achieves significant inference speedups (over average, up to for LLaMA-30B) and memory savings (exceeding ).
-
The optimization process is highly efficient, quantizing LLaMA-7B in about 20 minutes, up to faster than block reconstruction methods like
OmniQuant.These contributions collectively provide a more theoretically grounded and empirically effective approach to
LLM quantization, making these powerful models more accessible and efficient for diverse deployment scenarios.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand OSTQuant, a foundational grasp of Large Language Models (LLMs), quantization principles, and linear algebra concepts is essential.
-
Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the
transformerarchitecture, that are trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like translation, summarization, question-answering, and creative writing. Examples include OpenAI's GPT series, Google's Bard/Gemini, and Meta's LLaMA series. Their immense size (billions of parameters) makes them computationally intensive and memory-hungry, necessitating compression techniques for practical deployment. -
Post-Training Quantization (PTQ): This is a model compression technique applied after a neural network has been fully trained (post-training). The goal is to reduce the numerical precision (bit-width) of the model's parameters (weights) and intermediate computations (activations) from high-precision floating-point numbers (e.g., FP32 or FP16) to lower-precision integers (e.g., INT8 or INT4).
- Benefits:
PTQsignificantly reduces memory footprint (allowing larger models on smaller devices or more models on a single device) and accelerates inference speed (as low-bit integer operations are faster and more energy-efficient than floating-point operations). - Challenge: The main challenge is to perform this reduction without significant loss in model accuracy, which often happens due to quantization errors.
- Benefits:
-
Quantization Range & Outliers: In
quantization, a continuous range of floating-point values is mapped to a discrete set of integer values. This mapping typically involves defining aquantization range(e.g.,mintomaxvalue in a tensor) and then dividing this range into discrete steps, where is the bit-width.Outliersare a few data points that are significantly different from the majority of the data. InLLMs, weights and activations often exhibitheavy-tailed distributions, meaning there are a few extremeoutliervalues far from the mean.- If these
outliersare included in thequantization range, they force theminandmaxvalues to be very far apart. This widens the overallquantization range, making eachquantization steplarger. A larger step size means less precision for the vast majority of values clustered near the center of the distribution, leading to substantial quantization error and accuracy degradation.
-
Data Distributions (Uneven, Heavy-tailed):
- Uneven Distribution: Refers to data where values are not uniformly spread or symmetrically centered. In multidimensional data (like activation tensors, where each channel might have its own distribution),
unevennesscan also refer to significant disparities in variance or mean across different channels. - Heavy-tailed Distribution: A probability distribution where the tails (the extreme ends) are "heavier" than those of a normal (Gaussian) distribution. This means
outliersoccur more frequently or with larger magnitudes than expected in a Gaussian distribution. This characteristic is common inLLM activations, makingquantizationdifficult.
- Uneven Distribution: Refers to data where values are not uniformly spread or symmetrically centered. In multidimensional data (like activation tensors, where each channel might have its own distribution),
-
Linear Transformations: In
linear algebra, alinear transformationis a function that maps vectors from one vector space to another, preserving vector addition and scalar multiplication. In neural networks, this typically involves multiplying a data vector or matrix by another matrix.Linear transformationscanrotate,scale,shear, orprojectdata. The paper focuses onorthogonal transformations(rotations/reflections) andscaling transformations.
-
Orthogonal Matrices & Stiefel Manifold:
- An
orthogonal matrixis a square matrix whose columns and rows are orthonormal vectors. This means , where is theidentity matrixand is thetransposeof . Orthogonal transformationspreserve vector lengths and angles, meaning they arenorm-preserving. They primarilyrotateorreflectdata without distorting its shape or relative distances. This property is crucial forquantizationas it allows reshaping distributions without introducing unwanted scaling biases.- The
Stiefel manifoldis the set of all orthonormal -frames in . In the context oforthogonal matrices, theStiefel manifoldrepresents the space of all possibleorthogonal matrices. Optimizing parameters that must remainorthogonalrequires specializedRiemannian optimizationtechniques that respect the geometric constraints of this manifold.
- An
-
Diagonal Matrices & Scaling Transformations:
- A
diagonal matrixis a square matrix where all entries outside the main diagonal are zero. - A
scaling transformationapplied by adiagonal matrixscales each dimension of the input data independently. If the diagonal entries are reciprocals, it acts as an inverse scaling. This is used to balance variances across different channels, making distributions more uniform.
- A
-
Kullback-Leibler (KL) Divergence: Also known as
relative entropy,KL divergenceis a non-symmetric measure of the difference between two probability distributions, and . It quantifies how much information is lost when is used to approximate .- Formula:
- In
quantization,KL divergenceis often used as a loss function to ensure that the output distribution of thequantized model() remains as close as possible to that of thefull-precision model().
-
Cross-Entropy Loss (CE Loss): A common loss function in
classification tasksthat measures the performance of a classification model whose output is a probability value between 0 and 1. It increases as the predicted probability diverges from the actual label.- Formula for single-class classification: , where is 1 if class is the true class and 0 otherwise, and is the predicted probability for class .
- In
LLMs,CE lossis typically used fornext-token prediction. However, when training on smallcalibration datasetsduringPTQ, focusing solely on the single correct label might lead tooverfittingand a loss of the model's broadersemantic capabilities.
3.2. Previous Works
The paper contextualizes OSTQuant within the landscape of Post-Training Quantization (PTQ) for LLMs, broadly categorizing existing methods into weight-only and weight-activation quantization.
-
Weight-Only Quantization: Aims to reduce memory by quantizing only the model weights, while activations remain in higher precision (e.g., FP16). This provides memory savings but less inference speedup compared to quantizing activations.
- GPTQ (Frantar et al., 2022): Utilizes
Hessian-based error compensationto minimize quantization errors layer-by-layer, achieving high compression rates. It specifically quantizes weights greedily by selecting optimal quantization values for each weight given the errors in previously quantized weights. - AWQ (Lin et al., 2023) and OWQ (Lee et al., 2023): These methods focus on the observation that
activation outlierssignificantly impact the performance ofweight quantization. They aim to protect salient weights or channels from aggressive quantization error by scaling weights or identifying specific channels that are robust to quantization. - QuIP (Chee et al., 2023) and QuIP# (Tseng et al., 2024): These approaches use
random Hadamard matricesforincoherent processingof weights (making them more uniformly distributed) and applyvector quantizationto weights, leading to better performance at very low precisions.Hadamard transformationsare a type oforthogonal transformation.
- GPTQ (Frantar et al., 2022): Utilizes
-
Weight-Activation Quantization: Aims to quantize both weights and activations (including the
Key-Value (KV) cacheintransformers) to achieve maximum memory savings and inference speedup. This is more challenging due to the dynamic nature and wide distributions of activations.- ZeroQuant (Yao et al., 2022): Proposes a fine-grained, hardware-friendly
quantization schemefor both weights and activations, often utilizingper-channelandper-tokenquantization strategies. - SmoothQuant (Xiao et al., 2022): A pioneering
smooth-based transformationmethod. It addressesactivation quantization difficultyby mathematicallyshiftingsome of thequantization challengefrom activations to weights, which are more static and easier to quantize. This is achieved by introducing adiagonal scaling matrixthatsmoothsout largeinter-channel variancesin activations. - OmniQuant (Shao et al., 2023): Further enhances performance by actively
trainingbothquantization parameters(scales, zero-points) andtransformation coefficientsthrough anoptimization process, often usingblock-wise reconstructionto minimize quantization error. - I-LLM (Hu et al., 2024): Focuses on achieving
integer-only quantizationand inference, which is highly efficient for specialized hardware. It usesfully-smooth block reconstructionandfully integer operators. - QuaRot (Ashkboos et al., 2024): Utilizes
random rotation matrices(a type oforthogonal transformation) to suppressoutliersin both weights and activations, enabling 4-bit quantization. The rotations are fixed (randomly initializedHadamard-likematrices). - SpinQuant (Liu et al., 2024): Extends
QuaRotbylearningtheserotation matricesto further refine 4-bitweight-activation quantization. This moves from fixed random rotations to optimized ones.
- ZeroQuant (Yao et al., 2022): Proposes a fine-grained, hardware-friendly
-
Riemannian Optimization: Optimizing
orthogonal matrices(which lie on aStiefel manifold) requires specialized optimization techniques that respect theirorthonormality constraints.- Cayley SGD (Li et al., 2020): Relies on iterative approximations of the
Cayley Transformto optimizerotation matricesfor arbitrary loss functions efficiently, primarily using matrix multiplication. - RAOM (Bécigneul & Ganea, 2018): Extends popular adaptive optimizers like
ADAM(Kingma, 2014) andADAGRADto the realm ofRiemannian optimization, making them suitable forStiefel manifolds. - Geoopt (Kochurov et al., 2020): A Python library providing
Riemannian optimizationalgorithms for various manifolds, facilitating their integration into deep learning models.
- Cayley SGD (Li et al., 2020): Relies on iterative approximations of the
3.3. Technological Evolution
The field of LLM quantization has evolved from basic uniform quantization to sophisticated methods that adaptively handle the complex distributions of LLMs. Initially, quantization focused on simple per-tensor or per-channel scaling (RTN - Round-to-Nearest). The discovery of activation outliers in LLMs highlighted the need for more advanced techniques. Early methods like GPTQ and AWQ addressed weight quantization by mitigating outlier impact. Later, SmoothQuant introduced the idea of shifting quantization difficulty by scaling activations and inverse-scaling weights. More recent work, including QuaRot and SpinQuant, introduced orthogonal transformations (rotations) to directly reduce outliers and balance distributions in both weights and activations. OmniQuant pushed towards learnable quantization parameters and transformations.
OSTQuant builds upon this evolution by:
- Introducing a principled metric (
QSUR) to quantitatively guide the optimization of data distributions. - Combining the strengths of both
orthogonalandscaling transformationsin alearnable,equivalent transformation pairfor global optimization. - Addressing the
limited calibration datachallenge withKL-Top loss.
3.4. Differentiation Analysis
OSTQuant differentiates itself from previous methods primarily through its holistic approach to optimizing data distribution across the entire quantization space, guided by a novel quantitative metric, and its specific combination of learnable transformations.
-
From Heuristic to Principled:
- Prior Methods (e.g., SmoothQuant, QuaRot, SpinQuant): These methods often operate based on empirical observations and heuristics (e.g., "smooth outliers," "rotate outliers"). While effective, they lack a unified metric to quantify "quantizability" or to rigorously guide the optimization process.
- OSTQuant: Introduces
QSURas a theoretically grounded metric to directly measure how effectively data utilizes thequantization space. This metric provides a clear optimization target and allows for a more principled design of transformations, moving beyond pure heuristics.
-
Combined Orthogonal and Scaling Transformations for Global Optimization:
- Smooth-based Methods (e.g., SmoothQuant): Primarily use
diagonal scaling matricesto reduceinter-channel variancein activations by shifting it to weights. They are effective at balancing scale but are sensitive tooutliersanduneven mean values. Their optimization is typically localized (per-layer or per-channel) and doesn't explicitly optimize the overall shape of the distribution in thequantization space. - Rotation-based Methods (e.g., QuaRot, SpinQuant): Focus on
orthogonal transformations(rotations) to suppressoutliersin both weights and activations by rotating the data.QuaRotuses fixed random rotations, whileSpinQuantlearns them. These are good at making distributions moreball-shapedbut might not optimally balanceinter-channel variancesor ensure maximum utilization of thequantization range. Their scope is often limited to reducingoutliersrather than globally fitting the entire distribution. - OSTQuant: Combines the strengths of both. It employs
learnable equivalent transformation pairsconsisting of both anorthogonal transformationand ascaling transformation. This allows it to:Rotatethe data to align its principal components with the quantization axes andreduce outlier impact(like rotation-based methods).Scaledimensions to balancevariancesandmean valuesacross channels (like smooth-based methods), ensuring a flatter, more uniform distribution. This combined approach allows for a more comprehensive optimization of the data distribution across the entire quantization space, leading to higherQSURand better precision.
- Smooth-based Methods (e.g., SmoothQuant): Primarily use
-
Equivalence Preservation and Fusion:
OSTQuantexplicitly designsequivalent transformation pairsthat, in the absence ofquantization, preserve the network's original output. This preventsoverfittingto thecalibration dataand ensures better generalization. These transformations are thenfusedinto the model weights during deployment, incurring no runtime overhead. WhileSmoothQuantalso fuses scaling factors,OSTQuantextends this toorthogonal matricesas well. -
Addressing Limited Calibration Data:
-
Prior Methods: Often rely on standard
cross-entropy lossor fullKL divergencefor optimization.Cross-entropycan lead tooverfittingon smallcalibration sets, while fullKL divergencecan be noisy due tolong-tail distributionsinLLM vocabularies. -
OSTQuant: Proposes
KL-Top loss, which focusesKL divergencecomputation on only the top highest-probability logits. This provides more stable and semantically rich gradients for optimization with limited data.In essence,
OSTQuantprovides a more theoretically grounded, comprehensive, and adaptively optimized solution by integrating a novel metric, a powerful combination of learnable transformations, and a robust loss function specifically designed forPTQchallenges.
-
4. Methodology
4.1. Principles
The core principle behind OSTQuant is to enhance the quantizability of Large Language Models (LLMs) by actively reshaping the data distributions of weights and activations to better utilize the available quantization space. This is achieved through a novel metric, Quantization Space Utilization Rate (QSUR), which quantifies the effectiveness of this utilization, and a sophisticated learnable equivalent transformation mechanism.
The theoretical intuition is that quantization error is minimized when the data distribution is as compact and uniform as possible within the fixed quantization range. Uneven and heavy-tailed distributions lead to large quantization ranges and sparse quantization levels for most values. OSTQuant hypothesizes that a combination of orthogonal transformations (rotations) and scaling transformations can effectively transform these sub-optimal distributions into more quantization-friendly ones, characterized by higher QSUR.
Orthogonal transformations (rotations) can reduce the maximum extent of outliers by distributing their magnitude across multiple dimensions, making the data more "ball-shaped." Scaling transformations (diagonal matrices) can then balance variances across different channels, making the data distribution flatter and more uniform. By learning these transformations, OSTQuant aims to dynamically adapt to the specific distributions within each layer of an LLM, thereby minimizing quantization loss across the entire network.
The process is guided by minimizing the KL-Top loss, which ensures that the quantized model's output distribution remains semantically close to the full-precision model's output, even with limited calibration data, preventing overfitting and preserving performance.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Quantization Notations
The paper first defines the standard uniform quantization and dequantization process. This process maps a floating-point number to a discrete integer representation and then approximates the original floating-point number from that integer.
Given a floating-point tensor , the quantization function converts it into its quantized integer counterpart . The dequantization process then approximates the original value as .
The quantization process is defined by:
$
X^I = \mathrm{clamp}\left(\left\lfloor \frac{X}{s} \right\rceil + zp^I, 0, 2^{n^I}-1\right)
$
where:
-
: The original floating-point tensor.
-
: The quantized integer tensor, with values clamped between 0 and .
-
: The number of bits for quantization (e.g., 4 bits for
INT4). -
: The
quantization step size(also known asscale), which determines the width of each discrete quantization level. -
: The
zero-point, an integer value that represents the floating-point value 0 in the quantized integer range. -
: The rounding function (typically
round-to-nearest). -
: A function that clips values to be within the specified
minandmaxrange.The
quantization step sizeandzero-pointare calculated from the minimum () and maximum () values observed in the floating-point tensor : $ s = \frac{x_{\mathrm{max}} - x_{\mathrm{min}}}{2^{n^I}-1} $ $ zp^I = \left\lfloor \frac{-x_{\mathrm{min}}}{s} \right\rceil $ After quantization, thedequantizedfloating-point approximation is computed as: $ \boldsymbol{X}' = (\boldsymbol{X}^I - zp^I) \cdot s $ This dequantization formula maps the integer back to a floating-point value within the original approximate range. The choice of (scale) and (zero-point) is crucial forquantization accuracy. These can be determined statically from a few calibration samples (static quantization) or dynamically from runtime statistics (dynamic quantization). Quantization can also be appliedper-channelorper-token, referring to the granularity at which and are computed.
4.2.2. Quantization Space Utilization Rate (QSUR)
To quantitatively evaluate the quantizability of data and the effectiveness of transformations, the paper introduces QSUR.
Definition 1: Given a set of -dimensional data , let denote the hypervolume occupied by , and denote the hypervolume of the quantization space corresponding to . The quantization space is modeled as a hypercube whose edge lengths are defined by the maximum quantization range across all dimensions of .
The Quantization Space Utilization Rate of is then defined as:
$
\mathrm{QSUR}X = \frac{V_X}{V{S_X}} \quad (1)
$
A higher QSUR indicates that the data effectively fills the allocated quantization space, suggesting better quantizability.
For data typically following a Gaussian distribution (), is calculated based on the ellipsoid formed by the covariance matrix and mean vector . The covariance matrix can be diagonalized via eigenvalue decomposition: , where is a unit orthogonal matrix of eigenvectors, and contains the eigenvalues (representing variance along principal axes) in descending order.
The hypervolume of this ellipsoid at a given confidence level (e.g., ) is given by:
$
V_X = \frac{\pi^{d/2}}{\Gamma(d/2+1)} \times \left(\chi_d^2(\alpha)\right)^{d/2} \times \sqrt{\operatorname*{det}(\boldsymbol{\Sigma})} \quad (2)
$
where:
-
: The
Gamma function. -
: The
chi-squared quantilefordegrees of freedomatconfidence level. This factor scales the ellipsoid to encompass a certain percentage of the data. -
: The
determinantof thecovariance matrix. Since isorthogonal, . Thus, .The
volume of the quantization hypercube() is determined by the range of the distribution along each principal axis, considering the maximum and minimum coordinate values. The extremal points of the ellipsoid correspond to these maximum and minimum values. Let and be eigenvalues, and and be corresponding eigenvectors (though the paper simplifies this by later considering only the largest eigenvalue). The maximum and minimum coordinate values and are given as: $ v_{\mathrm{max}}^{\mathrm{org}} = \sqrt{\chi_d^2(\alpha) \cdot \lambda_{\mathrm{max}}} \cdot \pmb{q}{\mathrm{max}} + \boldsymbol{\mu} $ $ v{\mathrm{min}}^{\mathrm{org}} = -\sqrt{\chi_d^2(\alpha) \cdot \lambda_{\mathrm{min}}} \cdot \pmb{q}{\mathrm{min}} + \boldsymbol{\mu} $ Thevolume of the quantization hypercubeis then: $ V{S_X} = (\operatorname*{max}(v_{\mathrm{max}}^{\mathrm{org}}) - \operatorname*{min}(v_{\mathrm{min}}^{\mathrm{org}}))^d \quad (5) $ Combining these, theQSURexpression is: $ \mathrm{QSUR}X = \frac{ \frac{\pi^{d/2}}{\Gamma(d/2+1)} \cdot \left(\chi_d^2(\alpha)\right)^{d/2} \cdot \sqrt{\operatorname*{det}(\boldsymbol{\Lambda})} }{ \left( \operatorname*{max}(\sqrt{\chi_d^2(\alpha) \cdot \lambda{\mathrm{max}}} \cdot |\pmb{q}{\mathrm{max}}| + \boldsymbol{\mu}) - \operatorname*{min}(\sqrt{\chi_d^2(\alpha) \cdot \lambda{\mathrm{min}}} \cdot |\pmb{q}_{\mathrm{min}}| + \boldsymbol{\mu}) \right)^d } \quad (6) $ Simplified QSUR: The paper further simplifies this by neglecting the mean vector (assuming its magnitude is small relative to the largest eigenvalue) and assuming (the largest eigenvalue) for extremal points. This simplification leads to: $ \mathrm{QSUR}X = \frac{ \frac{\pi^{d/2}}{\Gamma(d/2+1)} \cdot \sqrt{\prod{i=1}^{d} \lambda_i} }{ 2^d \left( \operatorname*{max}(\sqrt{\lambda_1} \cdot \pmb{q}_1) \right)^d } \quad (7) $ From this simplifiedQSURformula (Equation 7), two key observations are made:
-
QSURis proportional to the product of the ratios of each eigenvalue to the largest eigenvalue . This implies thatQSURis maximized when all eigenvalues are similar (i.e., the distribution is moreball-shaped). -
The maximum component of the eigenvector corresponding to is inversely proportional to
QSUR.QSURis minimized when this component is large, and maximized when it is small. The denominator in Equation 7 is minimized when the components of are (Appendix A.2.1), which corresponds to the most uniform spread of eigenvector components.Influence of linear transformation on QSUR: Applying a linear transformation to results in a transformed distribution , where and .
-
Smoothing-based approaches(e.g.,SmoothQuant) treat as adiagonal matrixthat scales variances, reducing disparities among eigenvalues . However, they are sensitive tooutliersanduneven mean values. -
Rotation-based methods(e.g.,QuaRot,SpinQuant) useorthogonal matricesto modify , reducingoutliersand thehypercube volume, thereby increasingQSUR. This ability improves with dimensionality (Appendix A.2.2). The best outlier reduction is achieved when theorthogonal matrixis , where is aHadamard matrix.The Best Transformation Matrix: Combining the insights from Equation 7 and the influence of transformations, the paper deduces that the maximum
QSURis achieved when the transformation matrix is: $ \pmb{T} = c \cdot \pmb{\Lambda}^{-\frac{1}{2}} \pmb{Q}^\top \quad (8) $ where is an arbitrary scalar. This transformation effectivelyspheresthe data, making itscovariance matrixanidentity matrix(, see Appendix A.2.3). At this point, the maximumQSURis achieved: $ \mathrm{QSUR}'' = \frac{ \frac{\pi^{d/2}}{\Gamma(d/2+1)} }{ 2^d } \quad (28) $ This theoretical derivation highlights that an ideal transformation involves bothscaling(by ) to normalize variances androtation(by ) to align with principal axes, effectively making the data distribution isotropic (uniform in all directions).
4.2.3. Orthogonal and Scaling Transformation-based Quantization (OSTQuant)
OSTQuant leverages these theoretical insights to design a practical PTQ framework. It uses learnable equivalent transformation pairs within LLM layers to optimize weight and activation distributions for improved quantization performance.
Overall Flow (Figure 5):
OSTQuant applies multiple transformation pairs globally across and within LLM blocks. Four equivalent transformation pairs are learned per block, each comprising a learnable diagonal scaling matrix (denoted as in formulas) and a learnable orthogonal matrix (denoted as in figures/text for rotation).
Equivalent Transformation Pair:
A transformation pair is defined as , where is a diagonal scaling matrix and is a unit orthogonal matrix.
The forward inference process for a pair of matrix multiplications (e.g., ) is reformulated as:
$
\pmb{y} = \mathcal{Q}(\pmb{x} W_1 O \pmb{\Lambda}) \mathcal{Q}(\pmb{\Lambda}^{-1} O^T W_2) \quad (9)
$
where:
- : Input to the first linear layer.
- : Weights of two consecutive linear layers.
- : Learnable
orthogonal matrix. - : Learnable
diagonal scaling matrix. - : The
quantization operation. The intermediate product transforms the output of , and its inverse transforms the input of . This maintains mathematical equivalence infull-precision, but allows for optimization of intermediate distributions duringquantization.
Advantages of Equivalent Transformation Pair:
-
Learnability and Computational Efficiency: Both and are learnable parameters. The inverse of a
diagonal matrixis trivial (reciprocal of diagonal elements). Theorthogonal matrixis optimized usingRiemannian optimizerslikeRiemannAdam, leveraging first-order gradient information. -
Equivalence Preservation: Without
quantization, the transformations cancel out (), ensuring the network's output is identical to the original model. This reduces the risk ofoverfittingto the smallcalibration set. -
No Runtime Overhead: After optimization, and are fused directly into the existing
weight matrices. For example, and . This means no additional computational cost or parameters are introduced duringinference.The overall optimization objective for the network is to minimize the loss between the
quantized network outputand thefull-precision output, by learning the optimal scaling () and orthogonal () matrices: $ \arg \operatorname*{min}_{A_i, O_i} \mathcal{L}(\hat{y}, y; A_i, O_i, \theta) \quad (10) $ where represents the frozenfull-precision network parameters.
Weight Outlier Minimization Initialization (WOMI):
Since weights often follow a Gaussian distribution with zero mean, the orthogonal matrix is initialized to optimally reduce weight outliers. For a global orthogonal matrix , weights of all linear layers receiving residual inputs are concatenated. Eigenvalue decomposition is performed on their covariance matrix to obtain the eigenmatrix . Then, is initialized as , where is a normalized Hadamard matrix (Equation 27 from Appendix A.2.2 is , suggesting a similar structure for initialization). This initialization leverages Hadamard matrices' ability to spread out values and covariance matrices to align with principal components, minimizing outliers and inter-channel disparities. For scaling matrices, they are initialized as identity matrices.
Inter-Block Learning (Global Transformations):
As shown in the upper part of Figure 5, a global orthogonal transformation is applied at the embedding layer and propagates through each residual path in the LLM. This transformation rotates activations across the entire residual path. Due to the norm-preserving property of unitary orthogonal matrices, this allows bypassing RMSNorm layers (which normalize based on norm) and applying an inverse transformation to inputs of projection layers and the final output head to maintain equivalence.
Additionally, for the two normalization layers in each block, and their respective projection layers ( and ), scaling matrices are applied. The orthogonal matrix rotates activations along residual paths and adjusts weights of multiple projection layers. The scaling matrices and scale the outputs of RMSNorm layers and corresponding projection layers. These transformations are absorbed into the corresponding weight matrices, optimizing for distribution shifts and mitigating quantization errors.
Intra-Block Learning (Local Transformations):
The lower part of Figure 5 illustrates local equivalent transformation pairs within each transformer block:
-
Multi-Head Self-Attention (MHSA) Layer: Two pairs are introduced.
- Value (V) and Output (O) Projection Layers: For each attention head , the data flow is:
$
\pmb{Y} = \pmb{P}^h \cdot \pmb{X}^h \cdot (\pmb{W}v^h \pmb{R}{ov}^h \pmb{S}{ov}^h) \cdot (\pmb{S}{ov}^{h^{-1}} \pmb{R}_{ov}^{h^T} \pmb{W}_o^h) \quad (11)
$
Here:
- : Head index.
- : Attention matrix for head .
- : Input to head .
- : Weight for value projection.
- : Weight for output projection.
- : Learnable
orthogonal matrixfor V/O layers. - : Learnable
scaling matrixfor V/O layers. These transformations improveQSURfor both thevalue cacheandoutput projection layer.
- Query (Q) and Key (K) Projection Layers: After
Rotary Positional Encoding (ROPE), the query and key outputs undergo an equivalentscaling transformation, which can be incorporated into weights and . An additionalHadamard transformationis applied to and outputs to enhancekey cache quantization.
- Value (V) and Output (O) Projection Layers: For each attention head , the data flow is:
$
\pmb{Y} = \pmb{P}^h \cdot \pmb{X}^h \cdot (\pmb{W}v^h \pmb{R}{ov}^h \pmb{S}{ov}^h) \cdot (\pmb{S}{ov}^{h^{-1}} \pmb{R}_{ov}^{h^T} \pmb{W}_o^h) \quad (11)
$
Here:
-
Feed-Forward Network (FFN) Layer: The
diagonal matricesin theup-projectionanddown-projectionlayers of theFFNare optimized. The inverse of theHadamard transformation(if applied) is fused into from the start.
4.2.4. KL-Top Loss
While LLMs are trained on massive datasets, PTQ optimization is typically performed on small calibration sets (e.g., 1000 samples). In this limited data regime:
-
Standard
cross-entropy (CE) losscan lead tooverfitting, where the model performs well on the calibration set but degrades on broader tasks (as shown in Table 1). This is because small datasets may not fully utilizeLLM capacity, causingCE loss(focusing on a single correct label) to overfit to narrow features. -
Full
Kullback-Leibler (KL) divergence(comparing the full prediction distributions) is also problematic.LLMshave very large vocabularies (e.g., LLaMA-3-8B has over 100,000 tokens), and their prediction results often follow asevere long-tail distribution(Figure 6), meaning only a few tokens have significant probabilities. Directly applyingKL divergenceover all classes means the loss can be dominated byuninformative low-probability classes, adding noise to training and incurring high computational/memory costs.To mitigate these issues,
OSTQuantproposes theKL-Top loss function. This function computesKL divergenceover only the top classes with the highest probabilities. This approach focuses optimization on the model's primary predictions, enhancinggradient qualityby avoiding noise fromlow-probability values. It also reduces computational and memory costs for large vocabularies.
The KL-Top loss is calculated as follows:
$
idxs = \operatorname*{top}k(z) \quad (12)
$
$
\mathcal{L} = \sum_{i \in idxs} z[i] \log \left( \frac{z[i]}{\hat{z}[i]} \right)
$
where:
-
: The prediction distribution (logits, usually after softmax) from the
full-precision model. -
: The prediction distribution from the
quantized model. -
: A function that returns the indices of the top highest-probability logits from .
-
: The
KL-Top loss. The sum is taken only over the selected top indices.Experiments (Table 6) show that a value of 1,000 typically yields the best results, balancing
semantic richnesswithoptimization stability.
5. Experimental Setup
5.1. Datasets
The experimental validation of OSTQuant is conducted across a range of Large Language Models (LLMs) from the LLaMA family and evaluated on standard benchmarks.
-
Models:
- LLaMA-1 (7B, 13B, 30B) (Touvron et al., 2023a)
- LLaMA-2 (7B, 13B, 70B) (Touvron et al., 2023b)
- LLaMA-3-8B, LLaMA-3-70B
-
Calibration Data: For the
PTQoptimization phase, a smallcalibration setis used.- Source: 1,000 samples from
WikiText2(Merity et al., 2016) - Characteristics: Each sample has a token length of 2,048.
WikiText2is a common dataset for language model evaluation, particularly forperplexitymeasurements, as it is derived from high-quality Wikipedia articles. - Why chosen: It's a standard and relatively clean dataset suitable for calibrating
quantization parametersdue to its representative linguistic patterns, while being small enough to fit thePTQparadigm of limited data.
- Source: 1,000 samples from
-
Evaluation Datasets (Zero-Shot Tasks): To assess the generalized performance and emergent capabilities of the
quantized LLMs, the models are evaluated on a suite of ninezero-shot tasksusing thelm-evaluation-harness(version 0.4.4) (Gao et al., 2024). These tasks are designed to test various aspects of language understanding and reasoning.BoolQ(Clark et al., 2019): Boolean Question Answering.HellaSwag(Zellers et al., 2019): Commonsense reasoning for sentence completion.LAMBADA(Radford et al., 2019): Language Modeling Broadened Across Diverse Activities, testing language understanding by predicting the last word of a sentence.OpenBookQA (OBQA)(Mihaylov et al., 2018): Open-book question answering requiring commonsense and factual knowledge.PIQA(Bisk et al., 2020): Physical Interaction Question Answering, focusing on physical commonsense.SIQA(Sap et al., 2019): Social IQa, evaluating commonsense reasoning about social interactions.WinoGrande(Sakaguchi et al., 2021): An adversarialWinograd Schema Challengeto test commonsense reasoning.ARC-EasyandARC-Challenge(Boratko et al., 2018): AI2 Reasoning Challenge, testing scientific reasoning.- Why chosen: These datasets represent a diverse set of
zero-shot NLP tasksand are standard benchmarks for evaluating the capabilities and robustness ofLLMsafterquantization. They are crucial becauseperplexityalone (fromWikiText2) might not fully reflect the true functional performance of the model, especially afterquantization.
5.2. Evaluation Metrics
For comprehensive evaluation, OSTQuant uses both perplexity and zero-shot accuracy (averaged across multiple tasks), along with inference speedup and memory savings.
-
Perplexity (PPL)
- Conceptual Definition:
Perplexityis a common metric for evaluating language models. It measures how well a probability distribution or language model predicts a sample. A lowerperplexityscore indicates that the model predicts the test set more accurately and assigns higher probabilities to the observed sequences, meaning it is a better language model. In simple terms, it's a measure of how "surprised" the model is by new data; less surprise means better performance. - Mathematical Formula: For a sequence of words , the
perplexityis defined as: $ PPL(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log P(w_i | w_1, \dots, w_{i-1})\right) $ - Symbol Explanation:
- : The test corpus or sequence of words.
- : The total number of words in the test corpus.
- : The probability assigned by the language model to the -th word , given the preceding words . This probability is typically computed by the
LLMduring itsnext-token predictiontask.
- Conceptual Definition:
-
Zero-Shot Accuracy (Avg. Zero-Shot Score)
- Conceptual Definition:
Zero-shot accuracymeasures a model's ability to perform tasks it has not been explicitly trained on, demonstrating itsgeneralization capabilitiesandemergent properties. Theaverage zero-shot scorerefers to the mean accuracy across a suite of predefinedzero-shot tasks. For each task, accuracy is simply the percentage of correct predictions. - Mathematical Formula: Since this metric is an average across multiple tasks, there isn't a single universal formula beyond the definition of accuracy for each task. For an individual task,
Accuracyis: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \times 100% $ TheAverage Zero-Shot Scoreis then: $ \text{Avg. Zero-Shot Score} = \frac{1}{M} \sum_{j=1}^M \text{Accuracy}_j $ - Symbol Explanation:
- : The accuracy achieved on the -th
zero-shot task. - : The total number of
zero-shot tasksevaluated (in this paper, ).
- : The accuracy achieved on the -th
- Conceptual Definition:
-
Speedup
- Conceptual Definition:
Speedupquantifies how much faster thequantized modelperforms inference compared to itsfull-precision(e.g., FP16) counterpart. A speedup of means thequantized modelis twice as fast. - Mathematical Formula: $ \text{Speedup} = \frac{\text{Inference Time (Full Precision)}}{\text{Inference Time (Quantized)}} $
- Symbol Explanation:
- : The time taken for a specific inference operation (e.g., prefill, decoding a token) using the
full-precision model. - : The time taken for the same inference operation using the
quantized model.
- : The time taken for a specific inference operation (e.g., prefill, decoding a token) using the
- Conceptual Definition:
-
Memory Saving Factor
- Conceptual Definition:
Memory saving factorindicates how much less memory thequantized modelconsumes compared to itsfull-precisionversion. A factor of means thequantized modeluses one-third of the memory. - Mathematical Formula: $ \text{Memory Saving Factor} = \frac{\text{Memory Usage (Full Precision)}}{\text{Memory Usage (Quantized)}} $
- Symbol Explanation:
- : The memory consumed by the
full-precision model(e.g., for storing weights or activations). - : The memory consumed by the
quantized model.
- : The memory consumed by the
- Conceptual Definition:
5.3. Baselines
OSTQuant is benchmarked against several representative Post-Training Quantization (PTQ) methods, ranging from basic approaches to state-of-the-art techniques.
- RTN (Round-to-Nearest): A basic
quantizationmethod where floating-point values are simply rounded to the nearest available integerquantization level. It serves as a strong baseline to show the benefits of more sophisticated methods over naivequantization. - SmoothQuant (Xiao et al., 2022): A
smooth-based transformationmethod that shiftsquantization difficultyfromactivationstoweightsby scalingactivationsand inversely scalingweights. This aims to reduceinter-channel variancein activations. - GPTQ (Frantar et al., 2022): An
accurate weight-only PTQmethod that usesHessian-based error compensationto minimizequantization errorsingenerative pre-trained transformers. - OmniQuant (Shao et al., 2023): A method that further enhances
quantization performancebytraining quantization parameters(scales and zero-points) andtransformation coefficientsthrough anoptimization process, often involvingblock-wise reconstruction. - AWQ (Lin et al., 2023):
Activation-aware weight quantization, which focuses on protectingsalient weights(those critical for model performance) fromquantization errorsby identifyingactivation outliersand applying appropriate scaling. - QuaRot (Ashkboos et al., 2024): A
rotation-based methodthat usesrandom rotation matricesto makeactivationsandweightsmore uniform andoutlier-free, enabling 4-bit quantization for both. - SpinQuant (Liu et al., 2024): An advancement over
QuaRotthatlearnstherotation matricesfor 4-bitweight-activation quantizationrather than relying on random ones, aiming for better optimization.
5.4. Implementation Details
The paper specifies key aspects of OSTQuant's implementation and optimization:
- Quantization Scheme:
- Activations:
Per-token asymmetric quantizationwithout any pruning operations.Per-tokenmeans scales and zero-points are calculated for each token in a sequence, allowing adaptation to dynamic ranges.Asymmetricmeans thequantization rangedoes not necessarily center around zero. - Weights:
Per-channel symmetric quantization.Per-channelmeans scales and zero-points are calculated independently for each output channel of a weight tensor.Symmetricmeans thequantization rangeis centered around zero (e.g., from -Max to +Max).
- Activations:
- Optimizer:
RiemannAdam(Bécigneul & Ganea, 2018) is used to optimize allunit orthogonal matricesandscaling matrices.RiemannAdamis suitable forRiemannian optimizationonStiefel Manifolds, whereorthogonal matricesreside. - Calibration Data and Iterations:
- Calibration Samples: 1,000 samples from
WikiText2. - Token Length: Each sample has a token length of 2,048.
- Iterations: 150 iterations.
- Batch Size: 8.
- Calibration Samples: 1,000 samples from
- Learning Rate Schedule:
Cosine learning rate decayis applied.- Initial Learning Rate for
orthogonal matrixparameters: . - Initial Learning Rate for
scaling parameters: . - The paper notes that using a learning rate for the
Stiefel manifold(orthogonal matrices) 10 times larger than that forscaling parameterstypically leads to better results.
- Initial Learning Rate for
- Hardware: Quantization and inference speed tests were conducted on an A800 GPU for optimization timing, and on a 3090 GPU for detailed prefill/memory speedup tests (Table 9), and an A6000 GPU for decoding speed (Table 10).
- Kernel Implementation: The 4-bit
matrix multiplication kernelwas implemented usingCUTLASSof Nvidia, a high-performance library for CUDA. Theself-attention mechanismusedPyTorch's native SDPA(scaled dot product attention) function.
6. Results & Analysis
6.1. Core Results Analysis
OSTQuant consistently outperforms existing state-of-the-art methods across various LLMs and quantization configurations, demonstrating its effectiveness in preserving accuracy at low bit-widths and achieving significant efficiency gains.
Quantization Performance (Table 2):
-
W4-only (4-16-16) Setting:
OSTQuantachieves remarkable accuracy retention, maintaining at least 99.5% of the floating-point (FP) accuracy inzero-shot tasks. For the challengingLLaMA-3-8Bmodel,OSTQuanthas only a 0.29-point performance drop (67.80 vs FP 68.09), significantly outperforming other methods that incur losses exceeding 1.55 points. This demonstrates its superior ability to quantize weights with minimal impact. -
W4A4KV16 Setting (4-4-16): When activations are also quantized but the KV cache remains at FP16,
OSTQuantshows substantial performance boosts. For instance, onLLaMA-2-7B,OSTQuant(63.90) outperformsSpinQuant(57.37) by 6.53 points inzero-shot accuracy. This highlights its effectiveness in handling activation distributions. -
W4A4KV4 Setting (4-4-4): In this most aggressive 4-bit quantization configuration for weights, activations, and KV cache,
OSTQuantstill maintains significant performance gains. It outperforms the previous SOTA method,SpinQuant, by approximately 1 point across multiple models (e.g.,LLaMA-3-8B: 65.37 vs 64.10,LLaMA-2-7B: 63.18 vs 62.01). This demonstrates its robustness in extreme compression scenarios, making 4-bit inference truly feasible. -
Comparison to Smooth-based vs. Rotation-based: The results clearly show that
rotation-based methods(likeQuaRot,SpinQuant,OSTQuant) significantly outperformsmooth-based methods(likeSmoothQuant) once activations are quantized. This confirms the paper's argument thatsmooth-based methodsstruggle more with the prominentoutliersanduneven distributionsin activations.QSUR and Accuracy Correlation (Figure 3): The evaluation of
QSURacross differentquantization methodsforLLaMAvariants reveals a clear positive correlation with model performance (Zero-Shot accuracy).OSTQuantachieves the highestQSUR, which directly translates to improved model accuracy. This validates theQSURmetric as an effective indicator ofquantizabilityand the success ofOSTQuantin optimizing data distributions.
Activation Distribution Uniformity (Figure 2, 11, 12): Visualizations (Figure 2) show that LLaMA-3-8B's activation distributions prior to OSTQuant exhibit substantial variation across channels and numerous outliers. After OSTQuant's transformations, the distributions become significantly more uniform across channels and outliers are mitigated. This qualitative observation supports the quantitative QSUR improvements and performance gains.
Speedup and Memory Savings (Table 3, 9, 10):
- Inference Speedup:
OSTQuantenables 4-bit inference that delivers an average inferencespeedupof over compared toFP16. For larger models likeLLaMA-30B, it achieves approximatelyacceleration. This is attributed to reduced memory access overhead and efficient low-precision computation units. - Memory Savings:
OSTQuantprovides memory savings exceeding . This is crucial for deployingLLMson resource-constrained devices or running larger models on existing hardware. - Decoding Stage Efficiency (Table 10): In the
decoding stage(which is oftenmemory-bound),quantizingto 4-bit significantly reducesmemory access overhead. This allows running very large models likeLLaMA-3-70Bon a single A6000 GPU at a respectable speed of 15 tokens per second, which would beOut of Memory (OOM)forFP16. - Training Speed (Table 11):
OSTQuantalso offers substantial advantages in optimization speed compared toblock reconstruction-based methods(likeOmniQuant). With only 150 iterations and minimal learnable parameters, it can quantize smaller models (7B) in around 20 minutes (a speedup overOmniQuant) and larger models (30B) in 120 minutes (a speedup).
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Model | Loss Type | Wiki PPL | Arc-Easy Score | Arc-Challenge Score |
| LLaMA-2-7B | Origin | 5.38 | 69.87 | 42.41 |
| KL-Top | 5.94 | 72.69 | 44.62 | |
| LLaMA-2 13B | Origin | 5.12 | 75.09 | 46.08 |
| KL-Top | 5.25 | 75.29 | 47.10 | |
| LLaMA-3 8B | Origin | 6.80 | 76.68 | 49.26 |
| KL-Top | 7.29 | 76.73 | 49.32 |
The following are the results from Table 2 of the original paper:
| #Bits | Method | LLaMA-3 8B | LLaMA-3 70B | LLaMA-2 7B | LLaMA-2 13B | LLaMA-2 70B | LLaMA 7B | LLaMA 13B | LLaMA 30B | ||||||||||||||||||||||||
| 0-shot | Wiki | 0-shot | Wiki | 0-shot | Wiki | 0-shot | Wiki | 0-shot | Wiki | 0-shot | Wiki | 0-shot | Wiki | 0-shot | Wiki | ||||||||||||||||||
| Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | Avg.(↑) | (↓) | ||||||
| W-A-KV | FloatingPoint | 68.09 | - | 6.14 | - | 73.81 | - | 2.86 | - | 65.21 | - | 5.47 | - | 67.61 | - | 4.88 | - | 71.59 | - | 3.32 | - | 64.48 | - | 5.68 | - | 66.67 | - | 5.09 | - | 70.00 | - | 4.10 | - |
| 4-16-16 | RTN | 63.70 | 8.13 | - | 31.15 | - | 1e5 | - | 61.27 | 7.02 | - | 60.24 | -6.39 | - | 69.62 | 3.87 | - | 62.67 | - | 7.94 | - | 63.45 | - | 8.60 | - | 65.69 | -6.13 | - | |||||
| SmoothQuant | 62.79 | 8.12 | - | 67.94 | 6.70 | - | 58.88 | 8.03 | - | 62.03 | 5.86 | - | 65.93 | 5.50 | - | 62.24 | - | 7.46 | - | 62.69 | - | 18.75 | - | 65.69 | 5.80 | - | |||||||
| GPTQ | 61.03 | 7.43 | - | 31.45 | - | 9e3 | - | 60.86 | 9.84 | - | 64.71 | 5.79 | - | 70.96 | 3.94 | - | 60.15 | - | 7.93 | - | 64.36 | 6.58 | - | 66.95 | 5.26 | - | |||||||
| Omniquant | 65.66 | 7.19 | - | - | - | - | 63.19 | 5.74 | - | 66.38 | 5.02 | - | 71.04 | 3.47 | - | 63.42 | - | 5.86 | - | 66.22 | - | 5.21 | - | 69.07 | 4.25 | - | |||||||
| AWQ | 67.03 | 7.36 | - | 68.92 | 5.92 | - | 63.89 | 5.83 | - | 66.25 | 5.07 | - | 70.88 | 4.03 | - | 63.30 | 5.97 | - | 65.58 | 5.28 | - | 69.44 | 4.28 | - | |||||||||
| QuaRot | 67.27 | 6.53 | - | 72.93 | 3.53 | - | 64.30 | 5.62 | - | 66.95 | 5.00 | - | 71.21 | 3.41 | - | 63.40 | 5.83 | - | 65.91 | 5.20 | - | 69.73 | 4.27 | - | |||||||||
| SpinQuant | 66.54 | 6.49 | - | 72.90 | 3.49 | - | 63.59 | 5.58 | - | 67.14 | 5.00 | - | 71.12 | 3.43 | - | 63.94 | 5.76 | - | 66.32 | 5.16 | - | 69.62 | 4.21 | - | |||||||||
| OSTQuant | 67.80 | 6.53 | - | 73.69 | 3.19 | - | 64.37 | 5.64 | - | 67.31 | 4.94 | - | 71.48 | 3.41 | - | 64.13 | 5.81 | - | 66.62 | 5.21 | - | 69.84 | 4.19 | - | |||||||||
| RTN | 33.42 | -6e2 | - | 31.21 | - | 8e3 | - | 32.44 | nan | - | 30.86 | δe3 | - | 30.90 | 7e4 | - | 32.51 | - | 7e3 | - | 31.63 | - | 3e4 | - | 31.57 | 2e3 | - | ||||||
| SmoothQuant | 33.04 | 1e3 | - | 34.67 | 2e2 | - | 32.13 | nan | - | 34.26 | 1e3 | - | 35.86 | 3e2 | - | 34.42 | - | 3e2 | - | 33.29 | - | 6e2 | - | 34.64 | 1e3 | - | |||||||
| GPTQ | 32.98 | 5e2 | - | 31.47 | - | 4e4 | - | 32.72 | nan | - | 30.11 | 4e3 | - | 30.86 | nan | - | 32.12 | - | 1e3 | - | 31.51 | - | 3e3 | - | 30.88 | 2e3 | - | ||||||
| QuaRot | 61.69 | 8.02 | - | 65.56 | 6.35 | - | 61.87 | 6.05 | - | 65.13 | 5.35 | - | 69.96 | 3.78 | - | 61.76 | - | 6.22 | - | 64.46 | 5.50 | - | 68.14 | 4.57 | - | ||||||||
| 4-4-16 | SpinQuant | 64.11 | 7.28 | - | 66.99 | 6.10 | - | 57.37 | 6.78 | - | 63.23 | 5.24 | - | 70.58 | 3.68 | - | 61.82 | - | 6.08 | - | 64.59 | 5.36 | - | 68.08 | 4.53 | - | |||||||
| OSTQuant | 65.14 | 7.24 | - | 72.21 | 3.97 | - | 63.90 | 5.60 | - | 66.24 | 5.14 | - | 70.92 | 3.57 | - | 62.72 | - | 6.04 | - | 65.80 | 5.40 | - | 68.52 | 4.43 | - | ||||||||
| RTN | 33.18 | 7e2 | - | 30.82 | - | 8e3 | - | 32.67 | nan | - | 30.93 | 7e3 | - | 31.73 | 7e4 | - | 32.87 | - | 1¯e4 | - | 31.33 | - | 3e4 | - | 31.64 | 2e3 | - | ||||||
| 4-4-4 | SmoothQuant | 32.96 | 1e3 | - | 33.76 | 3e2 | - | 32.12 | nan | - | 33.36 | 1e3 | - | 35.54 | 3e2 | - | 33.32 | - | 3e2 | - | 33.28 | - | 5e2 | - | 34.65 | 1e3 | - | ||||||
| GPTQ | 33.71 | 6e2 | - | 31.20 | - | 4e4 | - | 33.52 | nan | - | 27.85 | 5e3 | - | 31.09 | nan | - | 31.80 | - | 2e3 | - | 30.63 | - | 3e3 | - | 31.07 | 2e3 | - | ||||||
| Omniquant | 32.33 | 4e2 | - | - | - | - | 48.40 | 14.26 | - | 50.35 | 12.30 | - | - | - | - | 48.46 | - | 11.26 | - | 45.63 | - | 10.87 | - | 45.04 | 12.35 | - | |||||||
| QuaRot | 61.38 | 8.18 | - | 65.33 | 6.60 | - | 61.48 | 6.11 | - | 65.16 | 5.39 | - | 70.30 | 3.80 | - | 61.22 | - | 6.26 | - | 64.59 | 5.53 | - | 68.08 | 4.60 | - | ||||||||
| SpinQuant | 64.10 | 7.35 | - | 66.31 | 6.24 | - | 62.01 | 5.96 | - | 64.13 | 5.74 | - | 70.57 | 3.61 | - | 61.32 | - | 6.12 | - | 64.95 | 5.39 | - | 68.14 | 4.55 | - | ||||||||
| OSTQuant | 65.37 | 7.29 | - | 71.69 | 4.01 | - | 63.18 | 5.91 | - | 65.41 | 5.25 | - | 70.84 | 3.59 | - | 62.55 | - | 6.07 | - | 65.43 | 5.40 | - | 68.20 | 4.42 | - | ||||||||
The following are the results from Table 3 of the original paper:
| Model Size | Prefill Speedup (Seqlen) | Memory Saving Factor (Seqlen) | ||||||||||
| 256 | 512 | 1024 | 2048 | 4096 | 8192 | 256 | 512 | 1024 | 2048 | 4096 | 8192 | |
| 7B | 2.24x | 2.27x | 2.23x | 2.14x | 2.11x | 2.02x | 3.48x | 3.34x | 3.12x | 2.86x | 2.57x | 2.34x |
| 8B | 2.42x | 2.52x | 2.52x | 2.43x | 2.36x | 2.23x | 3.48x | 3.36x | 3.12x | 2.77x | 2.38x | 2.00x |
| 13B | 2.62x | 2.68x | 2.63x | 2.52x | 2.83x | 2.32x | 3.64x | 3.51x | 3.30x | 3.02x | 2.70x | 2.43x |
| 30B | 3.18x | 3.01x | 2.98x | 3.40x | 2.84x | 2.68x | 3.70x | 3.59x | 3.42x | 3.15x | 2.83x | 2.53x |
The following are the results from Table 4 of the original paper:
| Metric | Baseline | +Rres | +Sres | +Rdn | +Su|d | +Rqk | +Sqk | +Rov | +Sov |
| Wiki PPL | nan | 9.70 | 9.46 | 6.16 | 6.00 | 5.92 | 5.92 | 5.94 | 5.91 |
| Zero-shot9 | 33.51 | 54.33 | 53.74 | 61.75 | 61.79 | 62.35 | 62.56 | 63.11 | 63.18 |
The following are the results from Table 5 of the original paper:
| Model | Optimizer Type | Best Steps | Best LR1 | Best LR2 | Zero-Shot Score |
| LLaMA-2-7B | Cayley SGD | 150 | 1.50 | 0.20 | 63.11 |
| Riemann SGD | 500 | 0.10 | 0.02 | 63.09 | |
| Riemann Adam | 150 | 0.02 | 1e-3 | 63.18 | |
| LLaMA-2-13B | Cayley SGD | 200 | 1.50 | 0.2 | 64.77 |
| Riemann SGD | 500 | 0.1 | 0.02 | 65.19 | |
| Riemann Adam | 150 | 0.02 | 0.002 | 65.41 |
The following are the results from Table 6 of the original paper:
| Setting | Metric | k=5 | k=50 | k=100 | k=500 | k=1000 | k=5000 | k=10000 |
| W3 Only | Zero-Shot9 Score | 61.87 | 61.88 | 61.75 | 62.18 | 62.30 | 61.25 | 61.21 |
| Wiki PPL | 6.06 | 6.116 | 6.13 | 6.07 | 6.06 | 6.06 | 6.12 | |
| W4A4KV4 | Zero-Shot Score | 62.4 | 62.13 | 62.38 | 62.34 | 63.18 | 62.44 | 62.11 |
| Wiki PPL | 5.99 | 5.96 | 5.95 | 5.96 | 5.96 | 5.93 | 5.94 |
The following are the results from Table 7 of the original paper:
| Model | Quant Setting | Method | Zero-Shot9 | Wiki PPL |
| LLaMA-2-7B | Full-Precision | 65.21 | 5.47 | |
| W4A16KV16 | Hadamard | 63.32 | 5.62 | |
| W4A16KV16 | WOMI | 63.45 | 5.59 | |
| W4A4KV4 | Hadamard | 61.47 | 6.11 | |
| W4A4KV4 | WOMI | 61.52 | 6.09 | |
| LLaMA-3-8B | Full-Precision | - | 68.09 | 6.14 |
| W4A16KV16 | Hadamard | 67.27 | 6.53 | |
| W4A16KV16 | WOMI | 67.41 | 6.48 | |
| W4A4KV4 | Hadamard | 61.38 | 8.18 | |
| W4A4KV4 | WOMI | 61.40 | 8.17 | |
The following are the results from Table 8 of the original paper:
| Model | Method | ARC-c | ARC-e | BoolQ | HellaS | Lam. | OBQA | PIQA | SIQA | WinoG. | Avg. | Wiki2 PPL |
| LLaMA3-8B | RTN | 23.72 | 30.56 | 46.18 | 29.83 | 2.70 | 28.60 | 52.45 | 34.39 | 50.20 | 33.18 | 704.34 |
| SpinQuant | 46.33 | 73.57 | 76.15 | 75.43 | 71.40 | 41.40 | 79.16 | 44.68 | 68.75 | 64.10 | 7.35 | |
| SpinQuant + KL-Top | 47.29 | 73.95 | 75.82 | 75.64 | 71.40 | 41.58 | 78.16 | 44.38 | 68.45 | 64.07 | 7.54 | |
| OSTQuant | 49.26 | 76.68 | 78.25 | 76.18 | 70.48 | 43.19 | 77.85 | 45.18 | 69.13 | 65.13 | 6.80 | |
| OSTQuant + KL-Top | 49.32 | 76.73 | 78.87 | 76.01 | 70.77 | 43.20 | 78.51 | 45.70 | 69.22 | 65.37 | 7.29 | |
| LLaMA2-7B | RTN | 27.22 | 27.06 | 50.83 | 27.34 | 0.93 | 25.80 | 49.51 | 34.85 | 50.51 | 32.67 | nan |
| SpinQuant | 40.44 | 71.08 | 74.40 | 73.51 | 70.66 | 41.80 | 76.88 | 43.50 | 65.82 | 62.01 | 5.96 | |
| SpinQuant + KL-Top | 40.76 | 71.29 | 74.61 | 73.08 | 70.19 | 40.94 | 76.32 | 43.85 | 67.78 | 62.09 | 6.16 | |
| OSQuant | 42.41 | 69.87 | 75.07 | 72.90 | 70.21 | 40.87 | 78.16 | 44.16 | 68.40 | 62.45 | 5.38 | |
| OSTQuant + KL-Top | 44.62 | 72.69 | 75.10 | 73.27 | 70.21 | 41.00 | 78.13 | 44.42 | 68.27 | 63.11 | 5.94 |
The following are the results from Table 9 of the original paper:
| Model | Seqlen | Prefill Time | Prefill Speedup | Memory | Memory Saving | ||
| FP16 | INT4 | FP16 | INT4 | ||||
| LLaMA2-7B | 256 | 8.050ms | 3.597ms | 2.238x | 0.411GB | 0.118GB | 3.479x |
| 512 | 14.904ms | 6.579ms | 2.265x | 0.435GB | 0.130GB | 3.341x | |
| 1024 | 27.989ms | 12.582ms | 2.225x | 0.483GB | 0.155GB | 3.116x | |
| 2048 | 54.276ms | 25.312ms | 2.144x | 0.577GB | 0.202GB | 2.857x | |
| 4096 | 112.230ms | 53.145ms | 2.112x | 0.766GB | 0.299GB | 2.566x | |
| 8192 | 244.675ms | 121.339ms | 2.016x | 1.147GB | 0.491GB | 2.336x | |
| LLaMA3-8B | 256 | 8.035ms | 3.314ms | 2.424x | 0.430GB | 0.124GB | 3.478x |
| 512 | 15.545ms | 6.176ms | 2.517x | 0.442GB | 0.132GB | 3.356x | |
| 1024 | 29.169ms | 11.599ms | 2.515x | 0.466GB | 0.149GB | 3.116x | |
| 2048 | 57.470ms | 23.631ms | 2.432x | 0.513GB | 0.185GB | 2.774x | |
| 4096 | 117.523ms | 49.835ms | 2.358x | 0.608GB | 0.256GB | 2.378x | |
| 8192 | 256.394ms | 114.815ms | 2.233x | 0.795GB | 0.397GB | 2.003x | |
| LLaMA2-13B | 256 | 11.449ms | 4.370ms | 2.620x | 0.634GB | 0.174GB | 3.643x |
| 512 | 21.195ms | 7.924ms | 2.675x | 0.663GB | 0.189GB | 3.512x | |
| 1024 | 41.752ms | 15.867ms | 2.631x | 0.723GB | 0.219GB | 3.301x | |
| 2048 | 81.965ms | 32.553ms | 2.518x | 0.841GB | 0.279GB | 3.018x | |
| 4096 | 199.046ms | 70.442ms | 2.826x | 1.079GB | 0.399GB | 2.702x | |
| 8192 | 359.409ms | 154.640ms | 2.324x | 1.551GB | 0.639GB | 2.426x | |
| LLaMA-30B | 256 | 18.682ms | 5.883ms | 3.175x | 1.047GB | 0.283GB | 3.703x |
| 512 | 34.393ms | 11.445ms | 3.005x | 1.085GB | 0.302GB | 3.589x | |
| 1024 | 66.880ms | 22.464ms | 2.977x | 1.162GB | 0.340GB | 3.416x | |
| 2048 | 157.500ms | 46.317ms | 3.400x | 1.315GB | 0.418GB | 3.148x | |
| 4096 | 272.355ms | 96.052ms | 2.835x | 1.625GB | 0.575GB | 2.828x | |
| 8192 | 576.555ms | 215.27ms | 2.678x | 2.242GB | 0.887GB | 2.527x | |
The following are the results from Table 10 of the original paper:
| Model | Decoder Speed (tokens/sec) | Memory Use (GB) | ||||
| FP | Quantized | Speed up | FP | Quantized | Memory Saving | |
| LLaMA-2-7B | 47.32 | 89.4 | 1.89x | 13.94 | 4.32 | 3.23x |
| LLaMA-3-8B | 38.33 | 77.71 | 2.03x | 15.83 | 5.88 | 2.69x |
| LLaMA-2-13B | 23.7 | 55.35 | 2.34x | 23.7 | 8.5 | 2.79x |
| LLaMA-30B | OOM | 30.49 | - | OOM | 18.19 | - |
| LLaMA-3-70B | OOM | 14.68 | - | OOM | 38.41 | - |
The following are the results from Table 11 of the original paper:
| Method | 7B | 8B | 13B | 30B | 70B |
| Omniquant | 1.6h | 1.8h | 3.3h | 7.3h | 9.5h |
| OSTQuant | 0.3h | 0.4h | 0.8h | 2.2h | 5.5h |
| Speedup | 5.3x | 4.5x | 4.1x | 3.3x | 1.7x |
The following are the results from Table 12 of the original paper:
| Model | #BitsW-A-KV | Method | Zero-Shot Common Sense Reasoning Tasks (Avg. Accuracy %) | Wiki2 PPL (↓) | |||||||||
| ARC-c (↑) | ARC-e (↑) | BoolQ (↑) | HellaS. (↑) | Lam. (↑) | OBQA (↑) | PIQA (↑) | SIQA (↑) | WinoG. (↑) | |||||
| (Avg. Zero-Shot Score (↑)) | |||||||||||||
| 2-7B | 16-16-16 | Full Precision | 46.42 | 74.33 | 77.71 | 75.94 | 73.69 | 44.20 | 79.16 | 45.91 | 69.53 | 65.21 | 5.47 |
| 4-16-16 | RTN | 42.15 | 67.59 | 73.06 | 72.34 | 67.18 | 41.80 | 76.50 | 44.11 | 66.69 | 61.27 | 7.02 | |
| SmoothQuant | 39.59 | 65.19 | 69.82 | 73.83 | 67.61 | 42.40 | 77.64 | 44.52 | 68.43 | 60.86 | 8.30 | ||
| GPTQ | 42.49 | 71.00 | 74.70 | 73.50 | 70.87 | 42.40 | 78.40 | 44.93 | 68.82 | 63.19 | 5.74 | ||
| Omniquant | 44.11 | 70.50 | 78.07 | 74.98 | 70.68 | 43.80 | 78.30 | 45.14 | 69.38 | 63.89 | 5.83 | ||
| AWQ | 43.94 | 73.50 | 76.97 | 74.87 | 72.07 | 44.00 | 78.24 | 45.40 | 69.38 | 64.30 | 5.62 | ||
| QuaRot | 43.77 | 72.69 | 73.36 | 75.10 | 73.80 | 43.00 | 77.86 | 45.60 | 67.66 | 63.59 | 5.58 | ||
| SpinQuant | 44.54 | 73.31 | 75.57 | 75.04 | 73.67 | 44.20 | 78.89 | 45.50 | 68.59 | 64.37 | 5.64 | ||
| OSTQuant | 44.54 | 73.31 | 75.57 | 75.04 | 73.67 | 44.20 | 78.89 | 45.50 | 68.59 | 64.37 | 5.64 | ||
| 4-4-16 | RTN | 25.34 | 28.03 | 50.52 | 27.71 | 1.01 | 26.20 | 50.80 | 33.93 | 48.38 | 32.44 | nan | |
| SmoothQuant | 28.33 | 26.39 | 49.39 | 27.28 | 1.18 | 23.40 | 48.80 | 33.75 | 40.75 | 32.13 | nan | ||
| GPTQ | 24.40 | 28.70 | 51.62 | 28.66 | 1.36 | 24.60 | 51.14 | 34.49 | 49.49 | 32.72 | nan | ||
| QuaRot | 42.32 | 69.65 | 74.77 | 72.91 | 70.81 | 39.80 | 77.20 | 43.55 | 65.82 | 61.76 | 6.05 | ||
| SpinQuant | 37.54 | 62.58 | 71.16 | 70.48 | 67.16 | 34.80 | 76.20 | 39.96 | 60.62 | 57.37 | 6.78 | ||
| OSTQuant | 44.03 | 71.93 | 75.41 | 74.94 | 73.22 | 43.20 | 78.51 | 45.85 | 68.03 | 63.90 | 5.60 | ||
| 4-4-4 | RTN | 27.22 | 27.06 | 50.83 | 27.34 | 0.93 | 25.80 | 49.51 | 34.85 | 50.51 | 32.67 | nan | |
| SmoothQuant | 26.37 | 25.63 | 47.71 | 27.05 | 1.11 | 26.40 | 41.90 | 34.49 | 48.38 | 32.12 | nan | ||
| GPTQ | 26.96 | 27.65 | 52.84 | 28.83 | 1.63 | 29.00 | 49.62 | 35.11 | 49.80 | 33.52 | nan | ||
| Omniquant | 31.40 | 53.75 | 63.79 | 60.13 | 55.63 | 34.40 | 60.59 | 40.28 | 54.07 | 48.40 | 14.26 | ||
| QuaRot | 41.43 | 69.32 | 74.83 | 72.50 | 70.66 | 39.80 | 77.42 | 43.70 | 64.64 | 61.48 | 6.11 | ||
| SpinQuant | 40.44 | 71.08 | 74.40 | 73.51 | 70.66 | 41.80 | 76.88 | 43.50 | 65.82 | 62.01 | 5.96 | ||
| OSTQuant | 42.92 | 72.56 | 74.71 | 73.10 | 71.76 | 44.00 | 77.20 | 44.98 | 67.77 | 63.18 | 5.91 | ||
| 2-13B | 16-16-16 | Full Precision | 49.15 | 77.53 | 80.58 | 79.39 | 76.62 | 45.20 | 80.63 | 47.49 | 71.90 | 67.61 | 4.88 |
| 4-16-16 | RTN | 42.92 | 66.54 | 71.38 | 66.62 | 68.99 | 39.40 | 76.93 | 44.06 | 65.35 | 60.24 | 6.39 | |
| SmoothQuant | 46.25 | 70.45 | 75.92 | 69.16 | 70.49 | 39.80 | 77.86 | 45.14 | 64.17 | 62.03 | 5.86 | ||
| GPTQ | 46.63 | 73.95 | 74.83 | 73.77 | 73.00 | 42.40 | 78.51 | 45.50 | 70.64 | 64.71 | 5.79 | ||
| Omniquant | 48.29 | 75.42 | 77.92 | 77.80 | 75.59 | 45.20 | 80.41 | 46.62 | 70.17 | 66.38 | 5.02 | ||
| AWQ | 48.63 | 78.16 | 78.81 | 78.48 | 75.70 | 45.00 | 79.54 | 46.21 | 72.45 | 66.25 | 5.07 | ||
| QuaRot | 49.07 | 78.17 | 76.58 | 77.60 | 76.70 | 45.40 | 80.03 | 45.50 | 71.11 | 66.95 | 5.00 | ||
| SpinQuant | 49.15 | 77.48 | 79.27 | 78.60 | 77.10 | 44.60 | 80.03 | 46.47 | 71.67 | 67.14 | 5.00 | ||
| OSTQuant | 48.72 | 76.26 | 80.67 | 78.27 | 76.54 | 45.54 | 80.25 | 47.65 | 71.90 | 67.31 | 4.94 | ||
| 4-4-16 | RTN | 27.99 | 26.81 | 38.50 | 26.08 | 0.00 | 23.60 | 48.20 | 34.90 | 51.62 | 30.86 | nan | |
| SmoothQuant | 24.49 | 35.06 | 47.98 | 30.87 | 3.67 | 26.20 | 51.01 | 35.31 | 49.20 | 34.26 | nan | ||
| GPTQ | 27.82 | 26.77 | 37.92 | 25.67 | 0.00 | 21.80 | 47.77 | 33.51 | 48.15 | 30.11 | nan | ||
| QuaRot | 46.42 | 72.70 | 78.10 | 75.58 | 74.31 | 43.00 | 79.05 | 44.37 | 71.35 | 65.13 | 5.35 | ||
| SpinQuant | 43.77 | 69.99 | 76.57 | 74.63 | 72.81 | 41.60 | 77.72 | 44.27 | 68.19 | 63.23 | 5.24 | ||
| OSTQuant | 47.78 | 74.66 | 80.03 | 77.60 | 75.94 | 44.00 | 79.38 | 46.06 | 70.32 | 66.24 | 5.14 | ||
| 4-4-4 | RTN | 27.82 | 26.52 | 38.38 | 26.27 | 0.02 | 26.00 | 49.78 | 34.39 | 49.17 | 30.93 | nan | |
| SmoothQuant | 24.49 | 28.83 | 45.84 | 30.70 | 2.70 | 23.40 | 53.81 | 34.80 | 51.07 | 33.36 | nan | ||
| GPTQ | 27.90 | 26.39 | 37.95 | 26.16 | 0.00 | 20.00 | 48.26 | 34.39 | 50.43 | 27.85 | nan | ||
| Omniquant | 32.85 | 55.13 | 64.34 | 60.13 | 56.45 | 33.40 | 68.17 | 40.76 | 56.51 | 40.35 | 12.30 | ||
| QuaRot | 47.27 | 73.91 | 78.41 | 75.33 | 75.30 | 43.80 | 79.27 | 45.85 | 69.06 | 65.16 | 5.39 | ||
| SpinQuant | 46.67 | 74.49 | 76.06 | 75.22 | 72.19 | 42.40 | 78.20 | 43.45 | 67.20 | 64.13 | 5.74 | ||
| OSTQuant | 48.10 | 75.21 | 77.46 | 76.71 | 75.14 | 44.60 | 78.67 | 45.75 | 68.30 | 65.41 | 5.25 | ||
| 3-8B | 16-16-16 | Full Precision | 53.50 | 77.74 | 81.10 | 79.18 | 75.74 | 44.80 | 80.63 | 47.08 | 73.01 | 68.09 | 6.14 |
| 4-16-16 | RTN | 48.98 | 73.23 | 72.75 | 75.90 | 63.85 | 43.20 | 78.40 | 43.81 | 73.16 | 63.70 | 8.13 | |
| SmoothQuant | 47.22 | 72.35 | 72.11 | 74.92 | 62.41 | 43.00 | 77.80 | 43.91 | 71.27 | 62.79 | 8.12 | ||
| GPTQ | 49.74 | 72.50 | 71.28 | 68.34 | 46.69 | 43.60 | 78.50 | 46.47 | 71.82 | 61.03 | 7.43 | ||
| Omniquant | 50.09 | 74.54 | 79.51 | 76.92 | 70.31 | 43.00 | 79.54 | 44.52 | 71.74 | 65.66 | 7.19 | ||
| AWQ | 52.22 | 76.68 | 80.31 | 75.91 | 74.81 | 44.20 | 79.87 | 46.26 | 71.67 | 67.03 | 7.36 | ||
| QuaRot | 51.88 | 77.53 | 79.60 | 77.87 | 74.11 | 44.40 | 80.14 | 46.37 | 72.50 | 67.27 | 6.53 | ||
| SpinQuant | 52.13 | 77.28 | 79.99 | 78.40 | 73.60 | 44.80 | 79.98 | 45.22 | 72.77 | 66.54 | 6.49 | ||
| OSTQuant | 52.82 | 79.84 | 80.31 | 77.86 | 76.48 | 42.80 | 80.74 | 45.55 | 72.80 | 67.80 | 6.53 | ||
| 4-4-16 | RTN | 23.72 | 30.89 | 46.30 | 31.26 | 3.03 | 27.60 | 52.72 | 35.26 | 50.04 | 33.42 | nan | |
| SmoothQuant | 23.29 | 28.24 | 48.93 | 29.19 | 1.57 | 28.60 | 44.46 | 33.37 | 49.64 | 33.04 | nan | ||
| GPTQ | 23.46 | 32.07 | 43.79 | 30.10 | 2.40 | 28.00 | 53.97 | 34.14 | 48.60 | 32.98 | nan | ||
| QuaRot | 47.26 | 72.73 | 73.60 | 67.00 | 43.00 | 76.61 | 45.04 | 65.90 | 61.69 | 8.02 | nan | ||
| SpinQuant | 47.35 | 74.12 | 76.36 | 75.98 | 69.88 | 42.46 | 77.37 | 44.47 | 68.98 | 64.11 | 7.28 | ||
| OSTQuant | 48.81 | 73.48 | 79.82 | 75.97 | 72.20 | 42.70 | 78.18 | 45.75 | 69.22 | 65.14 | 7.24 | ||
| 4-4-4 | RTN | 23.72 | 30.56 | 46.18 | 29.83 | 2.70 | 28.60 | 52.45 | 34.39 | 50.20 | 33.18 | nan | |
| SmoothQuant | 23.55 | 28.96 | 44.84 | 28.90 | 1.44 | 29.40 | 51.09 | 34.14 | 50.36 | 32.96 | nan | ||
| GPTQ | 22.87 | 30.35 | 44.34 | 29.72 | 2.39 | 29.80 | 54.95 | 34.75 | 51.30 | 33.71 | nan | ||
| Omniquant | 29.10 | 33.79 | 41.53 | 31.11 | 1.86 | 25.80 | 53.37 | 34.08 | 50.43 | 32.33 | nan | ||
| QuaRot | 49.49 | 74.37 | 79.16 | 77.22 | 71.69 | 42.29 | 78.89 | 43.87 | 71.03 | 65.33 | 6.60 | ||
| SpinQuant | 46.33 | 73.57 | 76.15 | 75.43 | 71.40 | 41.40 | 79.16 | 44.68 | 68.75 | 64.10 | 7.35 | ||
| OSTQuant | 49.32 | 76.73 | 78.87 | 76.01 | 70.77 | 43.20 | 78.51 | 45.70 | 69.22 | 65.37 | 7.29 | ||
The following are the results from Table 13 of the original paper:
| Model | #Bits-A-KV | Method | Zero-Shot Common Sense Reasoning Tasks (Avg. Accuracy %) | Wiki2 PPL (↓) | |||||||||
| ARC-c (↑) | ARC-e (↑) | BoolQ (↑) | HellaS. (↑) | LambA. (↑) | OBQA (↑) | PIQA (↑) | SIQA (↑) | WinoG. (↑) | |||||
| (Avg. Zero-Shot Score (↑)) | |||||||||||||
| 7B | [16-16-16] | Full Precision | 44.71 | 72.90 | 74.98 | 76.20 | 73.08 | 43.80 | 79.16 | 45.55 | 69.93 | 64.48 | 5.68 |
| 4-16-16 | RTN | 41.70 | 69.82 | 73.30 | 69.67 | 42.00 | 78.13 | 45.34 | 68.00 | 62.67 | 7.94 | nan | |
| SmoothQuant | 40.96 | 68.60 | 74.04 | 73.16 | 68.74 | 42.00 | 78.07 | 44.37 | 69.46 | 62.69 | 7.46 | ||
| GPTQ | 41.70 | 67.98 | 69.50 | 63.15 | 40.80 | 66.55 | 44.37 | 69.46 | 60.15 | 7.93 | nan | ||
| Omniquant | 42.49 | 71.38 | 74.62 | 74.71 | 71.98 | 42.00 | 79.05 | 45.09 | 69.14 | 63.30 | 5.86 | ||
| AWQ | 43.86 | 70.79 | 75.27 | 69.94 | 43.00 | 78.45 | 45.09 | 69.14 | 63.30 | 5.97 | nan | ||
| QuaRot | 42.75 | 69.99 | 73.30 | 72.71 | 73.55 | 42.80 | 78.35 | 45.14 | 69.61 | 63.40 | 5.83 | ||
| SpinQuant | 43.77 | 71.17 | 74.06 | 72.87 | 72.00 | 44.40 | 78.40 | 44.52 | 70.60 | 63.94 | 5.76 | ||
| OSTQuant | 44.20 | 72.56 | 73.73 | 75.05 | 73.45 | 44.60 | 78.73 | 45.45 | 69.38 | 64.13 | 5.81 | ||
| 4-4-16 | RTN | 23.46 | 29.34 | 45.05 | 29.02 | 1.24 | 26.00 | 52.07 | 35.11 | 51.30 | 32.51 | nan | |
| SmoothQuant | 25.40 | 31.40 | 45.68 | 29.73 | 5.43 | 28.20 | 49.68 | 34.44 | 49.09 | 34.42 | nan | ||
| GPTQ | 23.89 | 27.74 | 42.87 | 28.49 | 1.28 | 27.40 | 51.00 | 36.23 | 50.20 | 32.12 | nan | ||
| Omniquant | 40.36 | 67.26 | 73.15 | 72.89 | 70.81 | 42.00 | 77.90 | 44.27 | 67.17 | 61.76 | 6.22 | ||
| AWQ | 40.19 | 68.43 | 72.35 | 70.91 | 70.68 | 41.20 | 77.75 | 44.17 | 68.67 | 61.82 | 6.08 | ||
| QuaRot | 42.58 | 70.79 | 72.87 | 74.06 | 70.77 | 43.40 | 77.69 | 45.04 | 67.25 | 62.72 | 6.04 | ||
| SpinQuant | 40.19 | 68.43 | 72.35 | 70.91 | 70.68 | 41.20 | 77.75 | 44.17 | 68.67 | 61.82 | 6.08 | ||
| OSTQuant | 42.58 | 70.79 | 72.87 | 74.06 | 70.77 | 43.40 | 77.69 | 45.04 | 67.88 | 62.55 | 6.07 | ||
| 4-4-4 | RTN | 23.89 | 29.59 | 46.67 | 28.37 | 1.13 | 26.40 | 52.90 | 35.21 | 51.54 | 32.87 | nan | |
| SmoothQuant | 23.38 | 30.18 | 50.03 | 29.67 | 4.89 | 24.60 | 51.74 | 34.75 | 50.67 | 33.32 | nan | ||
| GPTQ | 23.89 | 27.90 | 43.88 | 27.86 | 1.05 | 26.20 | 51.85 | 34.75 | 49.49 | 31.80 | nan | ||
| Omniquant | 31.40 | 54.00 | 61.90 | 58.29 | 46.45 | 31.80 | 60.59 | 38.54 | 55.17 | 48.46 | 11.26 | ||
| QuaRot | 40.27 | 67.55 | 72.20 | 72.71 | 70.39 | 39.80 | 77.20 | 44.88 | 65.90 | 61.22 | 6.26 | ||
| SpinQuant | 40.30 | 68.18 | 73.60 | 72.87 | 70.46 | 40.60 | 77.42 | 42.68 | 67.56 | 61.32 | 6.12 | ||
| OSTQuant | 42.92 | 70.33 | 72.11 | 73.77 | 70.66 | 42.42 | 77.91 | 44.93 | 67.88 | 62.55 | 6.07 | ||
| 13B | 16-16-16 | Full Precision | 47.87 | 74.49 | 77.86 | 79.10 | 76.03 | 44.40 | 80.30 | 46.72 | 73.24 | 66.67 | 5.09 |
| 4-16-16 | RTN | 45.56 | 70.66 | 72.45 | 76.06 | 70.50 | 42.00 | 78.84 | 44.93 | 70.01 | 64.45 | 8.60 | |
| SmoothQuant | 43.86 | 71.21 | 75.19 | 74.19 | 69.34 | 40.00 | 77.80 | 45.45 | 70.72 | 62.69 | 8.75 | ||
| GPTQ | 45.99 | 72.85 | 75.27 | 75.31 | 70.10 | 44.60 | 79.87 | 46.16 | 71.11 | 64.36 | 6.58 | ||
| Omniquant | 47.01 | 73.86 | 77.22 | 77.95 | 75.59 | 45.00 | 79.87 | 46.88 | 72.61 | 66.22 | 5.21 | ||
| AWQ | 47.30 | 73.86 | 77.60 | 79.03 | 78.30 | 43.40 | 79.87 | 45.85 | 71.67 | 65.58 | 5.28 | ||
| QuaRot | 47.80 | 72.22 | 76.50 | 78.07 | 75.90 | 45.00 | 79.90 | 45.50 | 72.60 | 65.91 | 5.20 | ||
| SpinQuant | 47.44 | 74.83 | 77.37 | 78.30 | 75.50 | 45.60 | 79.90 | 46.01 | 72.06 | 66.32 | 5.16 | ||
| OSTQuant | 48.04 | 73.86 | 78.10 | 77.28 | 75.99 | 45.60 | 80.55 | 46.93 | 72.30 | 66.62 | 5.21 | ||
| 4-4-16 | RTN | 25.85 | 26.26 | 42.05 | 29.17 | 0.00 | 28.00 | 50.33 | 34.60 | 50.67 | 31.63 | nan | |
| SmoothQuant | 25.43 | 29.29 | 51.00 | 28.10 | 2.02 | 26.00 | 53.32 | 34.30 | 49.57 | 33.29 | nan | ||
| GPTQ | 24.66 | 27.80 | 40.80 | 25.30 | 0.70 | 24.20 | 51.31 | 33.65 | 51.70 | 31.51 | nan | ||
| QuaRot | 46.93 | 72.50 | 75.57 | 76.63 | 74.13 | 42.40 | 78.73 | 45.24 | 68.98 | 64.46 | 5.50 | ||
| SpinQuant | 45.30 | 72.56 | 75.38 | 76.86 | 73.28 | 43.60 | 78.90 | 44.63 | 70.40 | 64.59 | 5.36 | ||
| OSTQuant | 48.04 | 74.07 | 77.13 | 77.22 | 74.58 | 45.00 | 78.62 | 46.16 | 71.35 | 65.80 | 5.40 | ||
| 4-4-4 | RTN | 26.28 | 27.27 | 42.35 | 25.85 | 0.19 | 26.60 | 49.95 | 34.19 | 49.25 | 31.33 | nan | |
| SmoothQuant | 24.49 | 28.83 | 51.65 | 27.91 | 2.08 | 26.00 | 52.56 | 35.41 | 50.59 | 33.28 | nan | ||
| GPTQ | 23.63 | 27.31 | 39.50 | 26.17 | 0.56 | 26.00 | 50.59 | 35.00 | 49.57 | 30.63 | nan | ||
| Omniquant | 31.40 | 54.00 | 61.90 | 58.29 | 46.45 | 31.80 | 60.59 | 38.54 | 55.17 | 45.63 | 10.87 | ||
| QuaRot | 46.50 | 71.55 | 75.50 | 74.30 | 73.47 | 45.00 | 78.90 | 44.37 | 70.09 | 64.59 | 5.53 | ||
| SpinQuant | 45.90 | 70.71 | 76.51 | 77.00 | 73.63 | 45.60 | 79.00 | 45.65 | 70.32 | 64.95 | 5.39 | ||
| OSTQuant | 45.90 | 75.25 | 76.94 | 77.21 | 74.23 | 43.40 | 79.43 | 45.91 | 70.56 | 65.43 | 5.40 | ||
| 30B | 16-16-16 | Full Precision | 52.99 | 80.39 | 82.75 | 82.62 | 77.59 | 48.00 | 82.26 | 47.75 | 75.69 | 70.00 | 4.10 |
| 4-16-16 | RTN | 49.74 | 73.99 | 77.89 | 72.21 | 70.21 | 44.20 | 79.00 | 45.70 | 73.88 | 65.90 | 6.13 | |
| SmoothQuant | 48.98 | 72.90 | 80.00 | 79.00 | 71.49 | 44.80 | 78.13 | 45.96 | 73.70 | 65.69 | 5.80 | ||
| GPTQ | 52.22 | 77.62 | 81.10 | 81.94 | 76.80 | 47.00 | 81.07 | 47.50 | 74.43 | 69.07 | 4.25 | ||
| Omniquant | 53.20 | 77.77 | 81.68 | 82.29 | 76.79 | 48.20 | 81.20 | 48.16 | 75.37 | 69.44 | 4.28 | ||
| AWQ | 53.58 | 78.62 | 82.11 | 82.10 | 77.30 | 48.00 | 81.72 | 47.90 | 75.09 | 69.73 | 4.27 | ||
| QuaRot | 52.90 | 78.49 | 82.02 | 82.21 | 78.00 | 48.00 | 81.01 | 48.41 | 75.06 | 69.62 | 4.21 | ||
| SpinQuant | 53.07 | 79.12 | 82.09 | 82.04 | 78.58 | 48.60 | 81.18 | 48.06 | 74.80 | 69.84 | 4.19 | ||
| OSTQuant | 53.07 | 79.12 | 82.09 | 82.04 | 78.58 | 48.60 | 81.18 | 48.06 | 74.80 | 69.84 | 4.19 | ||
| 4-4-16 | RTN | 25.00 | 27.95 | 42.02 | 27.22 | 0.21 | 27.00 | 49.13 | 34.65 | 50.91 | 31.57 | nan | |
| SmoothQuant | 25.63 | 31.91 | 54.86 | 32.52 | 3.80 | 28.20 | 50.13 | 34.49 | 50.04 | 34.64 | nan | ||
| GPTQ | 27.30 | 27.19 | 38.69 | 26.75 | 0.70 | 25.80 | 49.40 | 35.21 | 47.75 | 30.88 | nan | ||
| Omniquant | 30.60 | 33.79 | 41.95 | 32.44 | 1.86 | 26.40 | 51.56 | 38.54 | 53.91 | 45.04 | 10.33 | ||
| QuaRot | 51.79 | 76.39 | 80.76 | 77.08 | 77.08 | 45.80 | 80.58 | 45.60 | 74.35 | 68.14 | 4.57 | ||
| SpinQuant | 52.06 | 77.06 | 81.38 | 80.62 | 76.90 | 46.00 | 80.14 | 46.26 | 74.27 | 68.08 | 4.53 | ||
| OSTQuant | 51.37 | 78.11 | 82.48 | 79.51 | 75.99 | 45.40 | 81.18 | 47.80 | 74.82 | 68.52 | 4.43 | ||
| 4-4-4 | RTN | 25.00 | 28.87 | 44.07 | 27.29 | 0.39 | 25.60 | 49.67 | 34.54 | 49.33 | 31.64 | nan | |
| SmoothQuant | 22.61 | 31.48 | 55.05 | 31.28 | 3.40 | 28.00 | 53.50 | 34.65 | 50.28 | 34.65 | nan | ||
| GPTQ | 22.87 | 27.70 | 39.36 | 27.00 | 0.33 | 24.00 | 50.71 | 34.30 | 47.91 | 31.07 | nan | ||
| Omniquant | 29.10 | 33.79 | 41.95 | 32.44 | 1.86 | 25.80 | 53.37 | 34.08 | 50.43 | 45.04 | 10.33 | ||
| QuaRot | 51.71 | 76.50 | 80.50 | 77.00 | 76.20 | 45.00 | 80.63 | 45.00 | 73.32 | 68.08 | 4.55 | ||
| SpinQuant | 51.60 | 76.98 | 81.07 | 80.57 | 77.00 | 46.00 | 79.92 | 46.26 | 74.90 | 68.14 | 4.55 | ||
| OSTQuant | 49.76 | 78.52 | 81.16 | 81.13 | 77.57 | 46.40 | 80.90 | 46.11 | 74.27 | 68.20 | 4.42 | ||
The following are the results from Table 14 of the original paper:
| Model | #BitsW-A-KV | Method | Zero-Shot Common Sense Reasoning Tasks (Avg. Accuracy %) | Wiki2 PPL (↓) | |||||||||
| ARC-c (↑) | ARC-e (↑) | BoolQ (↑) | HellaS. (↑) | LambA. (↑) | OBQA (↑) | PIQA (↑) | SIQA (↑) | WinoG. (↑) | |||||
| (Avg. Zero-Shot Score (↑)) | |||||||||||||
| 2-70B | 16-16-16 | Full Precision | 57.42 | 81.02 | 83.79 | 83.81 | 79.60 | 48.80 | 82.70 | 49.18 | 77.98 | 71.59 | 3.32 |
| 4-16-16 | RTN | 55.05 | 79.29 | 81.35 | 81.78 | 75.51 | 47.60 | 81.94 | 46.83 | 76.46 | 69.62 | 3.87 | |
| SmoothQuant | 50.26 | 76.56 | 81.53 | 67.81 | 73.63 | 44.40 | 81.34 | 44.17 | 73.64 | 65.93 | 5.50 | ||
| GPTQ | 56.91 | 80.81 | 83.24 | 82.47 | 79.06 | 47.80 | 82.50 | 48.06 | 77.51 | 70.96 | 3.94 | ||
| Omniquant | 57.08 | 80.81 | 82.69 | 83.07 | 79.18 | 47.40 | 83.08 | 48.87 | 77.19 | 71.04 | 3.47 | ||
| AWQ | 56.67 | 80.54 | 82.98 | 82.54 | 78.83 | 47.67 | 82.97 | 48.20 | 77.70 | 70.88 | 4.03 | ||
| QuaRot | 57.34 | 80.85 | 83.24 | 83.27 | 80.38 | 47.60 | 82.21 | 48.62 | 77.35 | 71.21 | 3.41 | ||
| SpinQuant | 56.91 | 80.60 | 83.18 | 83.06 | 79.16 | 49.00 | 82.54 | 48.31 | 77.11 | 71.12 | 3.43 | ||
| OSTQuant | 57.36 | 81.37 | 83.20 | 83.86 | 79.77 | 48.73 | 82.69 | 48.46 | 77.89 | 71.48 | 3.41 | ||
| 4-4-16 | RTN | 29.35 | 26.05 | 37.74 | 25.97 | 0.02 | 24.80 | 51.31 | 34.14 | 48.70 | 30.90 | nan | |
| SmoothQuant | 25.00 | 25.98 | 55.23 | 32.52 | 7.49 | 25.00 | 54.62 | 35.21 | 51.70 | 35.86 | nan | ||
| GPTQ | 27.82 | 25.80 | 37.95 | 25.82 | 0.00 | 27.00 | 49.67 | 33.98 | 49.00 | 30.86 | nan | ||
| QuaRot | 55.29 | 80.35 | 81.10 | 81.87 | 79.06 | 45.80 | 82.05 | 47.90 | 76.24 | 69.96 | 3.78 | ||
| SpinQuant | 55.38 | 78.96 | 83.36 | 82.54 | 79.00 | 47.20 | 82.10 | 48.67 | 77.43 | 70.58 | 3.68 | ||
| OSTQuant | 56.61 | 80.51 | 83.03 | 82.68 | 79.11 | 47.86 | 83.00 | 48.76 | 76.70 | 70.92 | 3.57 | ||
| 4-4-4 | RTN | 30.30 | 27.74 | 38.23 | 26.12 | 0.02 | 24.60 | 51.74 | 34.29 | 52.43 | 31.73 | nan | |
| SmoothQuant | 24.15 | 33.88 | 55.32 | 31.75 | 7.14 | 26.40 | 54.95 | 34.14 | 52.17 | 35.54 | nan | ||
| GPTQ | 28.75 | 26.39 | 37.86 | 25.96 | 0.00 | 26.40 | 50.00 | 34.44 | 50.04 | 31.09 | nan | ||
| QuaRot | 56.48 | 80.56 | 81.59 | 81.93 | 79.16 | 46.00 | 82.21 | 48.00 | 76.80 | 70.30 | 3.80 | ||
| SpinQuant | 56.31 | 80.64 | 83.55 | 82.36 | 79.41 | 47.20 | 82.21 | 47.29 | 77.05 | 70.57 | 3.61 | ||
| OSTQuant | 56.58 | 80.17 | 83.64 | 82.49 | 78.72 | 48.00 | 82.76 | 48.67 | 76.49 | 70.84 | 3.59 | ||
| 3-70B | 16-16-16 | Full Precision | 64.42 | 85.98 | 85.14 | 84.95 | 79.47 | 48.46 | 84.39 | 50.82 | 80.66 | 73.81 | 2.86 |
| 4-16-16 | RTN | 26.28 | 25.55 | 37.83 | 26.36 | 0.00 | 29.00 | 50.98 | 34.70 | 49.30 | 31.15 | nan | |
| SmoothQuant | 51.88 | 77.53 | 80.09 | 80.47 | 73.16 | 46.60 | 80.58 | 45.29 | 75.85 | 67.94 | 6.70 | ||
| GPTQ | 25.77 | 25.29 | 37.83 | 26.36 | 0.12 | 28.40 | 51.74 | 34.90 | 52.64 | 31.45 | nan | ||
| Omniquant | 48.29 | 75.42 | 77.92 | 77.80 | 75.59 | 45.20 | 80.41 | 46.62 | 70.17 | 66.38 | 5.02 | ||
| AWQ | 62.20 | 83.88 | 85.57 | 84.18 | 79.04 | 48.20 | 83.13 | 50.10 | 80.03 | 72.93 | 3.53 | ||
| QuaRot | 62.03 | 84.97 | 85.11 | 84.60 | 78.30 | 47.00 | 83.90 | 49.85 | 80.90 | 72.90 | 3.49 | ||
| SpinQuant | 63.76 | 85.82 | 84.99 | 85.16 | 79.53 | 48.45 | 84.26 | 51.01 | 80.22 | 73.69 | 3.19 | ||
| OSTQuant | 63.76 | 85.82 | 84.99 | 85.16 | 79.53 | 48.45 | 84.26 | 51.01 | 80.22 | 73.69 | 3.19 | ||
| 4-4-16 | RTN | 27.47 | 25.88 | 37.83 | 26.26 | 0.00 | 27.20 | 51.63 | 35.26 | 49.33 | 31.21 | nan | |
| SmoothQuant | 26.00 | 34.47 | 50.46 | 32.48 | 11.98 | 30.00 | 54.24 | 34.83 | 48.93 | 34.67 | nan | ||
| GPTQ | 25.77 | 25.80 | 43.64 | 26.42 | 0.00 | 27.40 | 52.01 | 32.55 | 49.33 | 31.47 | nan | ||
| QuaRot | 60.60 | 73.65 | 77.46 | 77.83 | 71.96 | 43.20 | 78.13 | 45.29 | 71.90 | 65.56 | 6.35 | ||
| SpinQuant | 53.84 | 77.69 | 80.24 | 78.19 | 77.06 | 45.00 | 78.67 | 43.24 | 73.01 | 66.99 | 6.10 | ||
| OSTQuant | 61.84 | 84.56 | 84.14 | 82.47 | 77.08 | 46.07 | 83.38 | 50.23 | 80.13 | 72.21 | 3.97 | ||
| 4-4-4 | RTN | 27.13 | 25.42 | 37.83 | 26.12 | 0.00 | 26.60 | 50.76 | 34.95 | 51.22 | 33.76 | nan | |
| SmoothQuant | 23.46 | 33.18 | 52.56 | 32.48 | 4.13 | 28.00 | 53.50 | 34.95 | 51.22 | 33.76 | nan | ||
| QuaRot | 49.49 | 74.37 | 79.16 | 77.22 | 71.69 | 42.29 | 78.89 | 43.87 | 71.03 | 65.33 | 6.60 | ||
| SpinQuant | 51.88 | 76.39 | 80.98 | 76.50 | 71.43 | 43.46 | 79.27 | 44.17 | 72.69 | 66.31 | 6.24 | ||
| OSTQuant | 61.29 | 82.39 | 83.43 | 83.25 | 75.90 | 48.93 | 81.73 | 51.24 | 77.01 | 71.69 | 4.01 | ||
6.3. Ablation Studies / Parameter Analysis
6.3.1. Effect of Different Transformation Matrices (Table 4)
The ablation study investigates the contribution of different learnable transformation matrices on LLaMA-2 7B under W4A4KV4 quantization.
-
Baseline (
Zero-Shot933.51): This represents the performance without any of the proposed transformations, likely using a basicRTNorSmoothQuant-like approach, resulting in poor accuracy. -
(
Zero-Shot954.33,Wiki PPL9.70): Adding the globalorthogonal transformationbrings the most significant improvement (over 20 points inZero-Shot9accuracy). This highlights the critical role of rotating activations globally across theresidual pathsto manage distributions. -
(
Zero-Shot953.74,Wiki PPL9.46): This refers to the global scaling transformation . While the paper indicates provides the largest single improvement, also shows a positive effect, suggesting the importance of global scaling, though less impactful than global rotation. The specific combination is crucial. -
(
Zero-Shot961.75,Wiki PPL6.16): Adding theorthogonal transformation(likely related to FFN down-projection or similarintra-blockrotation) provides the next largest jump in performance. This indicates that local rotations within critical layers are also highly effective. -
(
Zero-Shot961.79,Wiki PPL6.00): This represents scaling transformations for FFN up/down-projections. It offers a slight improvement over , confirming that scaling helps balancevariancesacross channels afterrotation. -
(
Zero-Shot962.35,Wiki PPL5.92): TheHadamard transformation(referred to asRqkfor Query/Key rotation) applied afterROPEfurther improves accuracy. -
(
Zero-Shot962.56,Wiki PPL5.92): Thescaling transformationfor Query/Key also contributes. -
(
Zero-Shot963.11,Wiki PPL5.94):Orthogonal transformationfor Value/Output projections inMHSAsignificantly boosts performance. -
(
Zero-Shot963.18,Wiki PPL5.91): Thescaling transformationfor Value/Output projections provides the final marginal improvement.Conclusion: The results demonstrate that
orthogonal transformations(especially global ones like and local ones like , ) provide the most substantial gains by reshaping the distributions.Scaling transformations(, , , ) then build upon these by finely balancingvarianceacross channels, minimizing residualquantization lossesand maximizingQSUR. The cumulative effect of all these learnable transformations is critical forOSTQuant's strong performance.
6.3.2. Different Manifold Optimizers (Table 5)
The paper compares different Riemannian optimizers for the unit orthogonal matrices (which reside on a Stiefel manifold) on LLaMA-2-7B and LLaMA-2-13B under W4A4KV4 configuration.
-
Cayley SGD (Stochastic Gradient Descent): Requires a relatively high learning rate (
LR11.50,LR20.20) and achieves good results (63.11 for 7B, 64.77 for 13B) within 150-200 steps. -
Riemann SGD: Needs more iterations (500 steps) and lower learning rates (
LR10.10,LR20.02) to reach similar or slightly worse performance (63.09 for 7B, 65.19 for 13B). -
Riemann Adam: Delivers the best results (63.18 for 7B, 65.41 for 13B) with the fewest iterations (150 steps) and lower learning rates (
LR10.02,LR21e-3 or 0.002). This highlightsRiemannAdam's efficiency and effectiveness in optimizing onStiefel manifolds.Learning Rate Ratio: A key finding is that setting the
learning ratefor theStiefel manifold(orthogonal matrices,LR2) approximately 10 times larger than that forscaling transformation parameters(LR1) leads to better results. This suggests that orthogonal transformations might require larger steps or adapt faster during optimization.
6.3.3. Influence of in KL-Top Loss (Table 6)
The parameter in KL-Top loss (Equation 12) determines the number of top-probability classes considered for KL divergence calculation. The ablation study explores the impact of different values on LLaMA-2 7B's Zero-Shot9 score and Wiki PPL under W3-only and W4A4KV4 settings.
- Balanced is Key: Both excessively small (, ) or excessively large (, ) values for negatively impact optimization.
- Small : May discard too much semantic information, leading to sub-optimal optimization.
- Large : Reintroduces noise from
long-tail distributionsand increases computational cost, similar to fullKL divergence.
- Optimal Value: For both
W3-onlyandW4A4KV4settings, consistently yields the bestZero-Shot9scores (62.30 and 63.18 respectively). This value appears to strike the right balance between retainingsemantic informationand mitigatingnoise. - PPL Stability:
Wiki PPLshows less sensitivity to compared toZero-Shotscores, indicating thatZero-Shottasks are a more robust indicator of model performance afterquantization.
6.3.4. Effect of Weight Outlier Minimization Initialization (WOMI) (Table 7, Figure 7, 13)
WOMI is designed to initialize trainable orthogonal matrices by leveraging eigenvalue decomposition of weights and Hadamard matrices.
- Performance Improvement (Table 7):
WOMIconsistently achieves lowerperplexityand higherfew-shot accuracycompared to randomHadamard initializationforLLaMA-2-7BandLLaMA-3-8Bunder bothW4A16KV16andW4A4KV4configurations. This confirms its effectiveness in providing a better starting point for optimization. - Enhanced W4-only Performance: Notably,
WOMIshows greater performance improvements inW4-only quantizationsettings compared toW4A4KV4. This suggests thatWOMIis particularly effective at minimizingweight quantization errors, which is crucial whenweightsare the primary quantized component. - Visualization (Figure 7, 13): Visualizations illustrate that original weight distributions often have significant variations across input/output channels. While
Hadamard transformationsreduce some of these differences,WOMI(by incorporating thecovariance matrix) further smoothsinter-channel disparitiesand minimizesoutliers. This leads to a more compact distribution within thequantization spaceand reducedrelative quantization error.
6.3.5. Effect of KL-Top Loss with SpinQuant (Table 8)
This ablation investigates the impact of KL-Top loss when applied to SpinQuant versus OSTQuant.
SpinQuantwithKL-Top: WhenKL-Top lossis introduced toSpinQuantalone, it does not lead to significant performance improvements and may even cause slight degradation (Avg. Zero-ShotforLLaMA3-8B64.10 -> 64.07,LLaMA2-7B62.01 -> 62.09). This suggests thatKL-Top lossis most effective when coupled with powerful distribution-reshaping transformations.OSTQuantwithKL-Top: ForOSTQuant, usingCross-Entropy (CE) loss(referred to asOriginin Table 1) results inoverfittingon thecalibration set, leading to higherWiki PPLand lowerZero-Shotscores. The introduction ofKL-Top loss(OSTQuant + KL-Top) alleviates thisoverfittingproblem, leading to better generalization and improvedZero-Shotscores (e.g.,LLaMA3-8B65.13 -> 65.37,LLaMA2-7B62.45 -> 63.11). This highlights thatKL-Top lossis crucial forOSTQuantto leverage its transformation capabilities effectively on limitedPTQdata.
6.4. QSUR and Evaluation Loss Trends (Figure 10)
Figure 10 illustrates the dynamic behavior of QSUR and evaluation loss during the training process for the LLaMA3-8B model.
- As the number of training iterations increases, the
localandaverage QSURs(Quantization Space Utilization Rates) consistently increase and stabilize at a high level. This directly shows thatOSTQuant's learnable transformations are effectively optimizing the data distributions to better utilize thequantization space. - Concurrently, the
evaluation lossgradually decreases, reflecting the model's improved performance during thequantization-aware optimization. - The positive correlation between increasing
QSURand decreasingevaluation loss(or improving accuracy, as seen in Figure 3) empirically validatesQSURas an effective metric for guiding and assessingquantization quality.
6.5. Activation and Weight Distribution Visualizations (Figure 11, 12, 14)
These visualizations provide qualitative evidence of OSTQuant's impact on data distributions and quantization error.
- Activation Distributions (Figure 11, 12): These figures display the activation distributions of different layers in
LLaMA-2-7BandLLaMA-3-8Bbefore and afterOSTQuant.- Before OSTQuant: The distributions show significant variations across different channels and often exhibit prominent
outliers, leading to wide ranges and lowQSUR. - After OSTQuant: The distributions become noticeably more uniform across channels, and
outliersare substantially mitigated. This "flattening" and "centering" of distributions allows for a more efficient mapping to discretequantization levels, reducing the relativequantization error.
- Before OSTQuant: The distributions show significant variations across different channels and often exhibit prominent
- Weight Distributions and WOMI (Figure 13): This figure compares the weight distribution and
relative L1 errorforLLaMA-2-7Bwhen quantized with original,Hadamard transformed, andWOMI transformedweights.- Original: Shows large variations and
outliers. - Hadamard Transformed: Reduces some
inter-channel differences, but some spikes or extreme values might remain. - WOMI Transformed: Further smooths these
inter-channel differencesand reducesoutliersmore effectively, leading to a smallerrelative L1 error. This demonstratesWOMI's role in providing a better initial distribution forweight quantization.
- Original: Shows large variations and
- Comparison of Activation Quantization Errors (Figure 14): This figure compares the
activation distributionandrelative L1 errorforQuaRot,SpinQuant, andOSTQuantonLLaMA-2-7B.QuaRotandSpinQuantshow improvements over baselines, butOSTQuantachieves the most uniformactivation distributionswith the lowestrelative L1 errors. This visually confirms thatOSTQuant's combinedorthogonalandscaling transformationsare superior in making activationsquantization-friendlyby effectively managingoutliersandvariances, resulting in less information loss duringper-token quantization.
7. Conclusion & Reflections
7.1. Conclusion Summary
In this paper, the authors introduce OSTQuant, a novel post-training quantization (PTQ) method aimed at significantly improving the efficiency of Large Language Models (LLMs). The central innovation is the Quantization Space Utilization Rate (QSUR), a new metric that quantitatively assesses the quantizability of transformed data by measuring how effectively it occupies and utilizes the available quantization space. This metric, supported by rigorous mathematical derivations on the effects and limitations of various linear transformations, provides a principled guide for optimizing data distributions.
Leveraging these insights, OSTQuant employs learnable equivalent transformation pairs, consisting of both orthogonal and scaling transformations, to optimize the distributions of weights and activations across the entire quantization space. This combined approach effectively addresses outliers and inter-channel disparities, leading to more quantization-friendly distributions. Furthermore, to overcome the challenges of limited calibration data inherent in PTQ, the KL-Top loss function is proposed. This loss function mitigates optimization noise while retaining richer semantic information by focusing KL divergence on only the top highest-probability logits.
Extensive experiments across various LLMs (LLaMA-1, -2, -3) and benchmarks demonstrate OSTQuant's superior performance. It retains over 99.5% of the floating-point accuracy in W4-only settings and reduces the performance gap by 32% on LLaMA-3-8B in the more challenging W4A4KV4 configuration compared to state-of-the-art methods. The method also delivers significant inference speedups (over ) and memory savings (over ), while accelerating the quantization process itself by up to . These results underscore the effectiveness of OSTQuant's principled approach to optimizing data distributions for practical and efficient LLM deployment.
7.2. Limitations & Future Work
The authors identify clear directions for future research, primarily focusing on extending OSTQuant to fully-quantized LLMs.
- Extension to Full Quantization: The current
OSTQuantframework, while effective, is primarily presented for traditionalPTQwhere some components (likeKV cacheor certain activations) might still be in higher precision. The future work section (Appendix A.5.1) outlines an extension forfull quantization, where allactivationswithin eachTransformer Blockare quantized to low bits. This would further reducememory transfer overheadand fully leveragelow-precision computational units.- New Transformation Pairs: This extension proposes introducing more
equivalence transformationsaroundROPE(Rotary Positional Encoding) andSiLU(Sigmoid Linear Unit) operations, which are critical components in modernLLMs.- ROPE Handling: Treating
ROPEas a lightweightGEMM layerand constructing a weight matrix based on its principle allows forpre-ROPEandpost-ROPEtransformation pairs. This involvesscaling transformations() andorthogonal transformations() to optimize query and key distributions before and afterROPEandattention computation. - SiLU Smoothing: Decomposing
SiLU(X) =X \cdot \sigma(X)$$ allows forequivalence transformations(e.g., ) to alleviateinter-channel discrepanciesofactivationsbefore and afterSiLUusingscaling matriceslike . The authors plan to conduct experiments in thefull-quantization domainto explore the full potential ofOSTQuant.
- ROPE Handling: Treating
- New Transformation Pairs: This extension proposes introducing more
7.3. Personal Insights & Critique
This paper presents a rigorous and well-motivated approach to LLM quantization. Here are some personal insights and critiques:
-
Strengths and Innovations:
- Principled Metric (QSUR): The introduction of
QSURis a significant contribution. It moves beyond heuristic observation to provide a quantitative, theoretically-backed metric to guidequantization optimization. This allows for a more direct and transparent evaluation of transformation effectiveness. The correlation betweenQSURand accuracy (Figure 3) strongly supports its utility. - Holistic Transformation Strategy: The combination of
learnable orthogonalandscaling transformationsis powerful.Orthogonal transformationsexcel at reshaping distributions to be moreball-shapedandoutlier-resistant, whilescaling transformationsfine-tuneinter-channel variances. This dual approach provides a more comprehensive solution than methods relying solely on one type of linear transformation. - Addressing PTQ Constraints: The
KL-Top loss functionis a clever solution to the problem oflimited calibration dataandlarge vocabulary sizesinLLMs. It effectively balances the need for semantic preservation with the practical challenges ofPTQoptimization, preventingoverfittingandnoise. - Efficiency: The method not only achieves high accuracy but also demonstrates impressive
speedupsin inference andmemory savings, makingLLMsmore deployable. The fasttraining timeforquantizationis also a major practical advantage. - Equivalence Preservation: The design of
equivalent transformation pairsthat are fused into weights ensures no additional runtime overhead, which is crucial for real-world deployment.
- Principled Metric (QSUR): The introduction of
-
Potential Issues, Unverified Assumptions, or Areas for Improvement:
- QSUR Formula Simplifications: While the
QSURderivation provides a solid theoretical basis, some simplifications (e.g., neglecting themean vector, assuming ) are made to arrive at a tractable form (Equation 7). The practical impact of these simplifications on the optimality of the derived transformation (Equation 8) could be further explored. Do realLLM distributionsalways conform well enough to these assumptions for the derived optimum to hold globally? - Complexity of Riemannian Optimization: While
RiemannAdamis effective, optimizing onStiefel manifoldsis inherently more complex than standardEuclidean optimization. This might pose a barrier for practitioners less familiar withRiemannian geometryor require specialized libraries. The ease of integration into commonML frameworks(like PyTorch) is good, but understanding the underlying mechanics remains a learning curve. - Sensitivity to in KL-Top: While an optimal is identified for the tested models, this value might be sensitive to specific
LLM architectures,vocabulary sizes, ordownstream tasks. A more adaptive or data-driven way to select could enhance robustness. - Generalizability of WOMI:
WOMIrelies on the assumption that weights follow aGaussian distributionwith zero mean. While often true for typical weight initializations, deviations could affect its optimality. - Full Quantization Details: The future work section on
full quantizationis promising, but the mathematical details of how theROPEandSiLU transformationsare maintained asequivalent transformation pairs(similar to Equation 9) could be further elaborated in a subsequent paper. The decomposition ofSiLUinto a product and its transformation forinter-channel discrepancies(Equation 31) is an intriguing idea, and its empirical validation would be valuable.
- QSUR Formula Simplifications: While the
-
Transferability and Future Value:
-
The
QSURmetric is highly transferable and could be adopted by otherquantization methodsas a standard evaluation and optimization target, fostering more principled research. -
The combined
orthogonalandscaling transformationstrategy is broadly applicable to any neural network layer that performs matrix multiplication, not justLLMs. -
The insights from
KL-Top lossare relevant for anyPTQscenario involvinglarge output spaces(e.g., insparse expert models) andlimited calibration data. -
The paper's direction towards
full quantizationis critical, as this is the ultimate goal for maximizing hardware efficiency.OSTQuantseems well-positioned to contribute significantly to this challenging area.Overall,
OSTQuantrepresents a significant step forward inLLM quantization, offering a robust, theoretically grounded, and highly effective solution that addresses key challenges inLLM compressionandacceleration.
-
Similar papers
Recommended via semantic vector search.