AiPaper
Paper status: completed

OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

Published:01/23/2025
Original LinkPDF
Price: 0.10
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The OSTQuant method optimizes large language model quantization using orthogonal and scaling transformations, addressing uneven and heavy-tailed data distributions. Introducing Quantization Space Utilization Rate (QSUR) effectively assesses quantizability, while the KL-Top loss f

Abstract

Post-training quantization (PTQ) has emerged as a widely adopted technique for compressing and accelerating Large Language Models (LLMs). The major challenge in LLM quantization is that uneven and heavy-tailed data distributions can expand the quantization range, thereby reducing bit precision for most values. Recent methods attempt to eliminate outliers and balance inter-channel differences by employing linear transformations; however, they remain heuristic and are often overlook optimizing the data distribution across the entire quantization space.In this paper, we introduce Quantization Space Utilization Rate (QSUR), a novel metric that effectively assesses the quantizability of transformed data by measuring the space utilization of the data in the quantization space. We complement QSUR with mathematical derivations that examine the effects and limitations of various transformations, guiding our development of Orthogonal and Scaling Transformation-based Quantization (OSTQuant). OSQuant employs a learnable equivalent transformation, consisting of an orthogonal transformation and a scaling transformation, to optimize the distributions of weights and activations across the entire quantization space. Futhermore, we propose the KL-Top loss function, designed to mitigate noise during optimization while retaining richer semantic information within the limited calibration data imposed by PTQ. OSTQuant outperforms existing work on various LLMs and benchmarks. In the W4-only setting, it retains 99.5% of the floating-point accuracy. In the more challenging W4A4KV4 configuration, OSTQuant reduces the performance gap by 32% on the LLaMA-3-8B model compared to state-of-the-art methods. \href{https://github.com/BrotherHappy/OSTQuant}{https://github.com/BrotherHappy/OSTQuant}.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

1.2. Authors

Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, Sifan Zhou. Affiliations include Houmo AI, Nanjing University, and Southeast University.

1.3. Journal/Conference

This paper is a preprint published on arXiv. It is currently under review or has not yet been officially published in a peer-reviewed journal or conference proceeding. arXiv is a reputable open-access archive for preprints of scientific papers, particularly in physics, mathematics, computer science, and related disciplines, widely used by the academic community to disseminate research findings prior to formal publication.

1.4. Publication Year

2025

1.5. Abstract

Post-training quantization (PTQ) is a critical technique for compressing and accelerating Large Language Models (LLMs). A significant challenge in LLM quantization arises from uneven and heavy-tailed data distributions, which expand the quantization range and consequently reduce bit precision for most values. Existing methods often employ heuristic linear transformations to address outliers and inter-channel differences but frequently overlook optimizing the data distribution across the entire quantization space.

This paper introduces Quantization Space Utilization Rate (QSUR), a novel metric designed to assess the quantizability of transformed data by measuring its space utilization within the quantization space. Complementing QSUR with mathematical derivations, the authors examine the effects and limitations of various transformations. This theoretical foundation guides the development of Orthogonal and Scaling Transformation-based Quantization (OSTQuant). OSTQuant employs learnable equivalent transformations, comprising both orthogonal and scaling components, to optimize weight and activation distributions across the entire quantization space. Additionally, the paper proposes KL-Top loss function, which mitigates noise during optimization while preserving richer semantic information, particularly valuable given the limited calibration data typical in PTQ.

OSTQuant demonstrates superior performance over existing methods across various LLMs and benchmarks. In a W4-only setting, it retains 99.5% of the floating-point accuracy. For the more challenging W4A4KV4 configuration, OSTQuant reduces the performance gap by 32% on the LLaMA-3-8B model compared to state-of-the-art methods.

https://arxiv.org/abs/2501.13987v1 (This is a preprint on arXiv.) PDF Link: https://arxiv.org/pdf/2501.13987v1.pdf

2. Executive Summary

2.1. Background & Motivation

The rapid advancement of Large Language Models (LLMs) has led to their widespread adoption, but their substantial memory and computational requirements pose significant deployment challenges. These challenges exist not only for resource-constrained edge devices but also for powerful cloud servers, limiting their practical applicability.

Post-training quantization (PTQ) is a widely adopted technique to compress and accelerate LLMs by reducing the bit-width of weights and activations after the model has been trained. However, a major hurdle in LLM quantization is the presence of uneven and heavy-tailed data distributions in weights and activations. These distributions mean that a few extreme values (outliers) dictate a large quantization range. Consequently, the limited number of available quantization levels must be stretched over this wide range, leading to reduced precision for the majority of values that fall within the narrower, central part of the distribution. This ultimately results in significant quantization errors and a drop in model performance.

Prior research has attempted to address these issues using linear transformations, such as smooth-based methods (e.g., SmoothQuant) that redistribute quantization difficulty between weights and activations, and rotation-based methods (e.g., QuaRot, SpinQuant) that aim to suppress outliers. While these methods have shown some effectiveness in specific regions of the quantization space, they are often heuristic and fail to holistically optimize the data distribution across the entire quantization space. This leads to sub-optimal utilization of the available quantization levels and leaves a notable gap in ensuring robust performance at very low bit-widths. The paper identifies this gap as the lack of a quantitative metric to assess quantizability and the absence of a method that globally optimizes data distribution for better fit within the quantization space.

2.2. Main Contributions / Findings

The paper introduces several primary contributions to address the challenges in LLM quantization:

  • Quantization Space Utilization Rate (QSUR): The authors propose QSUR, a novel and effective metric to quantitatively assess the quantizability of transformed data. QSUR measures how effectively data utilizes the available quantization space. Mathematical derivations are provided to analyze the effects and limitations of various linear transformations on QSUR, establishing a theoretical foundation for designing more effective transformations. Experiments demonstrate a positive correlation between higher QSUR values and improved quantization accuracy.
  • Orthogonal and Scaling Transformation-based Quantization (OSTQuant): Based on the insights from QSUR, OSTQuant is proposed. It employs learnable equivalent transformation pairs, each consisting of an orthogonal transformation and a scaling transformation, assigned to each fully connected (FC) layer in LLMs. These transformations are optimized to globally reshape the distributions of weights and activations across the entire quantization space, making them more quantization-friendly. Critically, these transformation pairs and their inverses are fused into their respective FC layers after optimization, preserving the original computational graph and incurring no additional computational overhead during inference. The method also introduces Weight Outlier Minimization Initialization (WOMI) for effective initialization of orthogonal matrices.
  • KL-Top Loss Function: Recognizing that PTQ typically uses small calibration datasets (e.g., 1000 samples), which can lead to overfitting with standard cross-entropy loss or noise with full Kullback-Leibler divergence on large vocabularies, the KL-Top loss function is introduced. This loss function computes KL divergence only over the top kk highest-probability logits from the full-precision model. This approach aims to capture more nuanced semantic information and provide clearer, more informative gradients while mitigating noise from long-tail distributions and reducing computational costs.
  • Robust Performance and Efficiency: OSTQuant consistently outperforms existing state-of-the-art PTQ methods across various LLMs (LLaMA-1, LLaMA-2, LLaMA-3) and benchmarks.
    • In the W4-only (4-bit weights, 16-bit activations/KV cache) setting, it retains over 99.5% of the floating-point accuracy.

    • In the more aggressive W4A4KV4 (4-bit weights, 4-bit activations, 4-bit KV cache) configuration, it reduces the performance gap by 32% on the LLaMA-3-8B model compared to state-of-the-art methods, maintaining at least 96% of the original model's performance.

    • It achieves significant inference speedups (over 2×2 \times average, up to 3×3 \times for LLaMA-30B) and memory savings (exceeding 3.5×3.5 \times).

    • The optimization process is highly efficient, quantizing LLaMA-7B in about 20 minutes, up to 5.3×5.3 \times faster than block reconstruction methods like OmniQuant.

      These contributions collectively provide a more theoretically grounded and empirically effective approach to LLM quantization, making these powerful models more accessible and efficient for diverse deployment scenarios.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand OSTQuant, a foundational grasp of Large Language Models (LLMs), quantization principles, and linear algebra concepts is essential.

  • Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like translation, summarization, question-answering, and creative writing. Examples include OpenAI's GPT series, Google's Bard/Gemini, and Meta's LLaMA series. Their immense size (billions of parameters) makes them computationally intensive and memory-hungry, necessitating compression techniques for practical deployment.

  • Post-Training Quantization (PTQ): This is a model compression technique applied after a neural network has been fully trained (post-training). The goal is to reduce the numerical precision (bit-width) of the model's parameters (weights) and intermediate computations (activations) from high-precision floating-point numbers (e.g., FP32 or FP16) to lower-precision integers (e.g., INT8 or INT4).

    • Benefits: PTQ significantly reduces memory footprint (allowing larger models on smaller devices or more models on a single device) and accelerates inference speed (as low-bit integer operations are faster and more energy-efficient than floating-point operations).
    • Challenge: The main challenge is to perform this reduction without significant loss in model accuracy, which often happens due to quantization errors.
  • Quantization Range & Outliers: In quantization, a continuous range of floating-point values is mapped to a discrete set of integer values. This mapping typically involves defining a quantization range (e.g., min to max value in a tensor) and then dividing this range into 2N2^N discrete steps, where NN is the bit-width.

    • Outliers are a few data points that are significantly different from the majority of the data. In LLMs, weights and activations often exhibit heavy-tailed distributions, meaning there are a few extreme outlier values far from the mean.
    • If these outliers are included in the quantization range, they force the min and max values to be very far apart. This widens the overall quantization range, making each quantization step larger. A larger step size means less precision for the vast majority of values clustered near the center of the distribution, leading to substantial quantization error and accuracy degradation.
  • Data Distributions (Uneven, Heavy-tailed):

    • Uneven Distribution: Refers to data where values are not uniformly spread or symmetrically centered. In multidimensional data (like activation tensors, where each channel might have its own distribution), unevenness can also refer to significant disparities in variance or mean across different channels.
    • Heavy-tailed Distribution: A probability distribution where the tails (the extreme ends) are "heavier" than those of a normal (Gaussian) distribution. This means outliers occur more frequently or with larger magnitudes than expected in a Gaussian distribution. This characteristic is common in LLM activations, making quantization difficult.
  • Linear Transformations: In linear algebra, a linear transformation is a function that maps vectors from one vector space to another, preserving vector addition and scalar multiplication. In neural networks, this typically involves multiplying a data vector or matrix by another matrix.

    • Linear transformations can rotate, scale, shear, or project data. The paper focuses on orthogonal transformations (rotations/reflections) and scaling transformations.
  • Orthogonal Matrices & Stiefel Manifold:

    • An orthogonal matrix O\pmb{O} is a square matrix whose columns and rows are orthonormal vectors. This means OO=OO=I\pmb{O}^\top \pmb{O} = \pmb{O} \pmb{O}^\top = \pmb{I}, where I\pmb{I} is the identity matrix and O\pmb{O}^\top is the transpose of O\pmb{O}.
    • Orthogonal transformations preserve vector lengths and angles, meaning they are norm-preserving. They primarily rotate or reflect data without distorting its shape or relative distances. This property is crucial for quantization as it allows reshaping distributions without introducing unwanted scaling biases.
    • The Stiefel manifold Vk(Rn)V_k(\mathbb{R}^n) is the set of all orthonormal kk-frames in Rn\mathbb{R}^n. In the context of orthogonal matrices, the Stiefel manifold represents the space of all possible orthogonal matrices. Optimizing parameters that must remain orthogonal requires specialized Riemannian optimization techniques that respect the geometric constraints of this manifold.
  • Diagonal Matrices & Scaling Transformations:

    • A diagonal matrix Λ\pmb{\Lambda} is a square matrix where all entries outside the main diagonal are zero.
    • A scaling transformation applied by a diagonal matrix scales each dimension of the input data independently. If the diagonal entries are reciprocals, it acts as an inverse scaling. This is used to balance variances across different channels, making distributions more uniform.
  • Kullback-Leibler (KL) Divergence: Also known as relative entropy, KL divergence is a non-symmetric measure of the difference between two probability distributions, PP and QQ. It quantifies how much information is lost when QQ is used to approximate PP.

    • Formula: DKL(PQ)=iP(i)log(P(i)Q(i))D_{KL}(P || Q) = \sum_i P(i) \log \left(\frac{P(i)}{Q(i)}\right)
    • In quantization, KL divergence is often used as a loss function to ensure that the output distribution of the quantized model (y^\hat{y}) remains as close as possible to that of the full-precision model (yy).
  • Cross-Entropy Loss (CE Loss): A common loss function in classification tasks that measures the performance of a classification model whose output is a probability value between 0 and 1. It increases as the predicted probability diverges from the actual label.

    • Formula for single-class classification: LCE=cyclog(y^c)L_{CE} = -\sum_c y_c \log(\hat{y}_c), where ycy_c is 1 if class cc is the true class and 0 otherwise, and y^c\hat{y}_c is the predicted probability for class cc.
    • In LLMs, CE loss is typically used for next-token prediction. However, when training on small calibration datasets during PTQ, focusing solely on the single correct label might lead to overfitting and a loss of the model's broader semantic capabilities.

3.2. Previous Works

The paper contextualizes OSTQuant within the landscape of Post-Training Quantization (PTQ) for LLMs, broadly categorizing existing methods into weight-only and weight-activation quantization.

  • Weight-Only Quantization: Aims to reduce memory by quantizing only the model weights, while activations remain in higher precision (e.g., FP16). This provides memory savings but less inference speedup compared to quantizing activations.

    • GPTQ (Frantar et al., 2022): Utilizes Hessian-based error compensation to minimize quantization errors layer-by-layer, achieving high compression rates. It specifically quantizes weights greedily by selecting optimal quantization values for each weight given the errors in previously quantized weights.
    • AWQ (Lin et al., 2023) and OWQ (Lee et al., 2023): These methods focus on the observation that activation outliers significantly impact the performance of weight quantization. They aim to protect salient weights or channels from aggressive quantization error by scaling weights or identifying specific channels that are robust to quantization.
    • QuIP (Chee et al., 2023) and QuIP# (Tseng et al., 2024): These approaches use random Hadamard matrices for incoherent processing of weights (making them more uniformly distributed) and apply vector quantization to weights, leading to better performance at very low precisions. Hadamard transformations are a type of orthogonal transformation.
  • Weight-Activation Quantization: Aims to quantize both weights and activations (including the Key-Value (KV) cache in transformers) to achieve maximum memory savings and inference speedup. This is more challenging due to the dynamic nature and wide distributions of activations.

    • ZeroQuant (Yao et al., 2022): Proposes a fine-grained, hardware-friendly quantization scheme for both weights and activations, often utilizing per-channel and per-token quantization strategies.
    • SmoothQuant (Xiao et al., 2022): A pioneering smooth-based transformation method. It addresses activation quantization difficulty by mathematically shifting some of the quantization challenge from activations to weights, which are more static and easier to quantize. This is achieved by introducing a diagonal scaling matrix that smooths out large inter-channel variances in activations.
    • OmniQuant (Shao et al., 2023): Further enhances performance by actively training both quantization parameters (scales, zero-points) and transformation coefficients through an optimization process, often using block-wise reconstruction to minimize quantization error.
    • I-LLM (Hu et al., 2024): Focuses on achieving integer-only quantization and inference, which is highly efficient for specialized hardware. It uses fully-smooth block reconstruction and fully integer operators.
    • QuaRot (Ashkboos et al., 2024): Utilizes random rotation matrices (a type of orthogonal transformation) to suppress outliers in both weights and activations, enabling 4-bit quantization. The rotations are fixed (randomly initialized Hadamard-like matrices).
    • SpinQuant (Liu et al., 2024): Extends QuaRot by learning these rotation matrices to further refine 4-bit weight-activation quantization. This moves from fixed random rotations to optimized ones.
  • Riemannian Optimization: Optimizing orthogonal matrices (which lie on a Stiefel manifold) requires specialized optimization techniques that respect their orthonormality constraints.

    • Cayley SGD (Li et al., 2020): Relies on iterative approximations of the Cayley Transform to optimize rotation matrices for arbitrary loss functions efficiently, primarily using matrix multiplication.
    • RAOM (Bécigneul & Ganea, 2018): Extends popular adaptive optimizers like ADAM (Kingma, 2014) and ADAGRAD to the realm of Riemannian optimization, making them suitable for Stiefel manifolds.
    • Geoopt (Kochurov et al., 2020): A Python library providing Riemannian optimization algorithms for various manifolds, facilitating their integration into deep learning models.

3.3. Technological Evolution

The field of LLM quantization has evolved from basic uniform quantization to sophisticated methods that adaptively handle the complex distributions of LLMs. Initially, quantization focused on simple per-tensor or per-channel scaling (RTN - Round-to-Nearest). The discovery of activation outliers in LLMs highlighted the need for more advanced techniques. Early methods like GPTQ and AWQ addressed weight quantization by mitigating outlier impact. Later, SmoothQuant introduced the idea of shifting quantization difficulty by scaling activations and inverse-scaling weights. More recent work, including QuaRot and SpinQuant, introduced orthogonal transformations (rotations) to directly reduce outliers and balance distributions in both weights and activations. OmniQuant pushed towards learnable quantization parameters and transformations.

OSTQuant builds upon this evolution by:

  1. Introducing a principled metric (QSUR) to quantitatively guide the optimization of data distributions.
  2. Combining the strengths of both orthogonal and scaling transformations in a learnable, equivalent transformation pair for global optimization.
  3. Addressing the limited calibration data challenge with KL-Top loss.

3.4. Differentiation Analysis

OSTQuant differentiates itself from previous methods primarily through its holistic approach to optimizing data distribution across the entire quantization space, guided by a novel quantitative metric, and its specific combination of learnable transformations.

  • From Heuristic to Principled:

    • Prior Methods (e.g., SmoothQuant, QuaRot, SpinQuant): These methods often operate based on empirical observations and heuristics (e.g., "smooth outliers," "rotate outliers"). While effective, they lack a unified metric to quantify "quantizability" or to rigorously guide the optimization process.
    • OSTQuant: Introduces QSUR as a theoretically grounded metric to directly measure how effectively data utilizes the quantization space. This metric provides a clear optimization target and allows for a more principled design of transformations, moving beyond pure heuristics.
  • Combined Orthogonal and Scaling Transformations for Global Optimization:

    • Smooth-based Methods (e.g., SmoothQuant): Primarily use diagonal scaling matrices to reduce inter-channel variance in activations by shifting it to weights. They are effective at balancing scale but are sensitive to outliers and uneven mean values. Their optimization is typically localized (per-layer or per-channel) and doesn't explicitly optimize the overall shape of the distribution in the quantization space.
    • Rotation-based Methods (e.g., QuaRot, SpinQuant): Focus on orthogonal transformations (rotations) to suppress outliers in both weights and activations by rotating the data. QuaRot uses fixed random rotations, while SpinQuant learns them. These are good at making distributions more ball-shaped but might not optimally balance inter-channel variances or ensure maximum utilization of the quantization range. Their scope is often limited to reducing outliers rather than globally fitting the entire distribution.
    • OSTQuant: Combines the strengths of both. It employs learnable equivalent transformation pairs consisting of both an orthogonal transformation and a scaling transformation. This allows it to:
      1. Rotate the data to align its principal components with the quantization axes and reduce outlier impact (like rotation-based methods).
      2. Scale dimensions to balance variances and mean values across channels (like smooth-based methods), ensuring a flatter, more uniform distribution. This combined approach allows for a more comprehensive optimization of the data distribution across the entire quantization space, leading to higher QSUR and better precision.
  • Equivalence Preservation and Fusion: OSTQuant explicitly designs equivalent transformation pairs that, in the absence of quantization, preserve the network's original output. This prevents overfitting to the calibration data and ensures better generalization. These transformations are then fused into the model weights during deployment, incurring no runtime overhead. While SmoothQuant also fuses scaling factors, OSTQuant extends this to orthogonal matrices as well.

  • Addressing Limited Calibration Data:

    • Prior Methods: Often rely on standard cross-entropy loss or full KL divergence for optimization. Cross-entropy can lead to overfitting on small calibration sets, while full KL divergence can be noisy due to long-tail distributions in LLM vocabularies.

    • OSTQuant: Proposes KL-Top loss, which focuses KL divergence computation on only the top kk highest-probability logits. This provides more stable and semantically rich gradients for optimization with limited data.

      In essence, OSTQuant provides a more theoretically grounded, comprehensive, and adaptively optimized solution by integrating a novel metric, a powerful combination of learnable transformations, and a robust loss function specifically designed for PTQ challenges.

4. Methodology

4.1. Principles

The core principle behind OSTQuant is to enhance the quantizability of Large Language Models (LLMs) by actively reshaping the data distributions of weights and activations to better utilize the available quantization space. This is achieved through a novel metric, Quantization Space Utilization Rate (QSUR), which quantifies the effectiveness of this utilization, and a sophisticated learnable equivalent transformation mechanism.

The theoretical intuition is that quantization error is minimized when the data distribution is as compact and uniform as possible within the fixed quantization range. Uneven and heavy-tailed distributions lead to large quantization ranges and sparse quantization levels for most values. OSTQuant hypothesizes that a combination of orthogonal transformations (rotations) and scaling transformations can effectively transform these sub-optimal distributions into more quantization-friendly ones, characterized by higher QSUR.

Orthogonal transformations (rotations) can reduce the maximum extent of outliers by distributing their magnitude across multiple dimensions, making the data more "ball-shaped." Scaling transformations (diagonal matrices) can then balance variances across different channels, making the data distribution flatter and more uniform. By learning these transformations, OSTQuant aims to dynamically adapt to the specific distributions within each layer of an LLM, thereby minimizing quantization loss across the entire network.

The process is guided by minimizing the KL-Top loss, which ensures that the quantized model's output distribution remains semantically close to the full-precision model's output, even with limited calibration data, preventing overfitting and preserving performance.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Quantization Notations

The paper first defines the standard uniform quantization and dequantization process. This process maps a floating-point number to a discrete integer representation and then approximates the original floating-point number from that integer.

Given a floating-point tensor X\boldsymbol{X}, the quantization function Q\mathcal{Q} converts it into its quantized integer counterpart XIX^I. The dequantization process then approximates the original value as XX'. The quantization process is defined by: $ X^I = \mathrm{clamp}\left(\left\lfloor \frac{X}{s} \right\rceil + zp^I, 0, 2^{n^I}-1\right) $ where:

  • XX: The original floating-point tensor.

  • XIX^I: The quantized integer tensor, with values clamped between 0 and 2nI12^{n^I}-1.

  • nIn^I: The number of bits for quantization (e.g., 4 bits for INT4).

  • ss: The quantization step size (also known as scale), which determines the width of each discrete quantization level.

  • zpIzp^I: The zero-point, an integer value that represents the floating-point value 0 in the quantized integer range.

  • \lfloor \cdot \rceil: The rounding function (typically round-to-nearest).

  • clamp(,min,max)\mathrm{clamp}(\cdot, \mathrm{min}, \mathrm{max}): A function that clips values to be within the specified min and max range.

    The quantization step size ss and zero-point zpIzp^I are calculated from the minimum (xminx_{\mathrm{min}}) and maximum (xmaxx_{\mathrm{max}}) values observed in the floating-point tensor X\boldsymbol{X}: $ s = \frac{x_{\mathrm{max}} - x_{\mathrm{min}}}{2^{n^I}-1} $ $ zp^I = \left\lfloor \frac{-x_{\mathrm{min}}}{s} \right\rceil $ After quantization, the dequantized floating-point approximation X\boldsymbol{X}' is computed as: $ \boldsymbol{X}' = (\boldsymbol{X}^I - zp^I) \cdot s $ This dequantization formula maps the integer back to a floating-point value within the original approximate range. The choice of ss (scale) and zpIzp^I (zero-point) is crucial for quantization accuracy. These can be determined statically from a few calibration samples (static quantization) or dynamically from runtime statistics (dynamic quantization). Quantization can also be applied per-channel or per-token, referring to the granularity at which ss and zpIzp^I are computed.

4.2.2. Quantization Space Utilization Rate (QSUR)

To quantitatively evaluate the quantizability of data and the effectiveness of transformations, the paper introduces QSUR.

Definition 1: Given a set of dd-dimensional data XRn×d\boldsymbol{X} \in \mathbb{R}^{n \times d}, let VXV_X denote the hypervolume occupied by X\boldsymbol{X}, and VSXV_{S_X} denote the hypervolume of the quantization space SS corresponding to X\boldsymbol{X}. The quantization space SXS_X is modeled as a hypercube whose edge lengths are defined by the maximum quantization range across all dimensions of X\boldsymbol{X}. The Quantization Space Utilization Rate of X\boldsymbol{X} is then defined as: $ \mathrm{QSUR}X = \frac{V_X}{V{S_X}} \quad (1) $ A higher QSUR indicates that the data effectively fills the allocated quantization space, suggesting better quantizability.

For data typically following a Gaussian distribution (XN(μ,Σ)X \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})), VXV_X is calculated based on the ellipsoid formed by the covariance matrix Σ\boldsymbol{\Sigma} and mean vector μ\boldsymbol{\mu}. The covariance matrix can be diagonalized via eigenvalue decomposition: Σ=QΛQ\boldsymbol{\Sigma} = \boldsymbol{Q} \boldsymbol{\Lambda} \boldsymbol{Q}^\top, where Q\boldsymbol{Q} is a unit orthogonal matrix of eigenvectors, and Λ=diag(λ1,λ2,...,λd)\boldsymbol{\Lambda} = \mathrm{diag}(\lambda_1, \lambda_2, ..., \lambda_d) contains the eigenvalues (representing variance along principal axes) in descending order.

The hypervolume of this ellipsoid at a given confidence level α\alpha (e.g., α=0.99\alpha = 0.99) is given by: $ V_X = \frac{\pi^{d/2}}{\Gamma(d/2+1)} \times \left(\chi_d^2(\alpha)\right)^{d/2} \times \sqrt{\operatorname*{det}(\boldsymbol{\Sigma})} \quad (2) $ where:

  • Γ\Gamma: The Gamma function.

  • χd2(α)\chi_d^2(\alpha): The chi-squared quantile for dd degrees of freedom at confidence level α\alpha. This factor scales the ellipsoid to encompass a certain percentage of the data.

  • det(Σ)\operatorname*{det}(\boldsymbol{\Sigma}): The determinant of the covariance matrix. Since Q\boldsymbol{Q} is orthogonal, det(Σ)=det(QΛQ)=det(Q)det(Λ)det(Q)=det(Λ)=i=1dλi\operatorname*{det}(\boldsymbol{\Sigma}) = \operatorname*{det}(\boldsymbol{Q} \boldsymbol{\Lambda} \boldsymbol{Q}^\top) = \operatorname*{det}(\boldsymbol{Q}) \operatorname*{det}(\boldsymbol{\Lambda}) \operatorname*{det}(\boldsymbol{Q}^\top) = \operatorname*{det}(\boldsymbol{\Lambda}) = \prod_{i=1}^{d} \lambda_i. Thus, det(Σ)=i=1dλi\sqrt{\operatorname*{det}(\boldsymbol{\Sigma})} = \sqrt{\prod_{i=1}^{d} \lambda_i}.

    The volume of the quantization hypercube (VSXV_{S_X}) is determined by the range of the distribution along each principal axis, considering the maximum and minimum coordinate values. The extremal points of the ellipsoid correspond to these maximum and minimum values. Let λmax\lambda_{\mathrm{max}} and λmin\lambda_{\mathrm{min}} be eigenvalues, and qmax\pmb{q}_{\mathrm{max}} and qmin\pmb{q}_{\mathrm{min}} be corresponding eigenvectors (though the paper simplifies this by later considering only the largest eigenvalue). The maximum and minimum coordinate values vmaxorgv_{\mathrm{max}}^{\mathrm{org}} and vminorgv_{\mathrm{min}}^{\mathrm{org}} are given as: $ v_{\mathrm{max}}^{\mathrm{org}} = \sqrt{\chi_d^2(\alpha) \cdot \lambda_{\mathrm{max}}} \cdot \pmb{q}{\mathrm{max}} + \boldsymbol{\mu} $ $ v{\mathrm{min}}^{\mathrm{org}} = -\sqrt{\chi_d^2(\alpha) \cdot \lambda_{\mathrm{min}}} \cdot \pmb{q}{\mathrm{min}} + \boldsymbol{\mu} $ The volume of the quantization hypercube is then: $ V{S_X} = (\operatorname*{max}(v_{\mathrm{max}}^{\mathrm{org}}) - \operatorname*{min}(v_{\mathrm{min}}^{\mathrm{org}}))^d \quad (5) $ Combining these, the QSUR expression is: $ \mathrm{QSUR}X = \frac{ \frac{\pi^{d/2}}{\Gamma(d/2+1)} \cdot \left(\chi_d^2(\alpha)\right)^{d/2} \cdot \sqrt{\operatorname*{det}(\boldsymbol{\Lambda})} }{ \left( \operatorname*{max}(\sqrt{\chi_d^2(\alpha) \cdot \lambda{\mathrm{max}}} \cdot |\pmb{q}{\mathrm{max}}| + \boldsymbol{\mu}) - \operatorname*{min}(\sqrt{\chi_d^2(\alpha) \cdot \lambda{\mathrm{min}}} \cdot |\pmb{q}_{\mathrm{min}}| + \boldsymbol{\mu}) \right)^d } \quad (6) $ Simplified QSUR: The paper further simplifies this by neglecting the mean vector μ\boldsymbol{\mu} (assuming its magnitude is small relative to the largest eigenvalue) and assuming λmax=λmin=λ1\lambda_{\mathrm{max}} = \lambda_{\mathrm{min}} = \lambda_1 (the largest eigenvalue) for extremal points. This simplification leads to: $ \mathrm{QSUR}X = \frac{ \frac{\pi^{d/2}}{\Gamma(d/2+1)} \cdot \sqrt{\prod{i=1}^{d} \lambda_i} }{ 2^d \left( \operatorname*{max}(\sqrt{\lambda_1} \cdot \pmb{q}_1) \right)^d } \quad (7) $ From this simplified QSUR formula (Equation 7), two key observations are made:

  1. QSUR is proportional to the product of the ratios of each eigenvalue λi\lambda_i to the largest eigenvalue λ1\lambda_1. This implies that QSUR is maximized when all eigenvalues are similar (i.e., the distribution is more ball-shaped).

  2. The maximum component of the eigenvector q1\pmb{q}_1 corresponding to λ1\lambda_1 is inversely proportional to QSUR. QSUR is minimized when this component is large, and maximized when it is small. The denominator in Equation 7 is minimized when the components of q1\pmb{q}_1 are ±d1/2\pm d^{-1/2} (Appendix A.2.1), which corresponds to the most uniform spread of eigenvector components.

    Influence of linear transformation on QSUR: Applying a linear transformation T\pmb{T} to XN(μ,Σ)X \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) results in a transformed distribution X^N(μ^,Σ^)\hat{X} \sim \mathcal{N}(\hat{\boldsymbol{\mu}}, \hat{\boldsymbol{\Sigma}}), where μ^=Tμ\hat{\boldsymbol{\mu}} = \pmb{T}\boldsymbol{\mu} and Σ^=TQΛQT\hat{\boldsymbol{\Sigma}} = \pmb{T} \boldsymbol{Q} \boldsymbol{\Lambda} \boldsymbol{Q}^\top \pmb{T}^\top.

  • Smoothing-based approaches (e.g., SmoothQuant) treat T\pmb{T} as a diagonal matrix that scales variances, reducing disparities among eigenvalues λi\lambda_i. However, they are sensitive to outliers and uneven mean values.

  • Rotation-based methods (e.g., QuaRot, SpinQuant) use orthogonal matrices to modify Q\boldsymbol{Q}, reducing outliers and the hypercube volume, thereby increasing QSUR. This ability improves with dimensionality (Appendix A.2.2). The best outlier reduction is achieved when the orthogonal matrix is T=d12HQ\pmb{T} = d^{-\frac{1}{2}} \pmb{H} \pmb{Q}^\top, where H\pmb{H} is a Hadamard matrix.

    The Best Transformation Matrix: Combining the insights from Equation 7 and the influence of transformations, the paper deduces that the maximum QSUR is achieved when the transformation matrix is: $ \pmb{T} = c \cdot \pmb{\Lambda}^{-\frac{1}{2}} \pmb{Q}^\top \quad (8) $ where cc is an arbitrary scalar. This transformation effectively spheres the data, making its covariance matrix an identity matrix (Σ^=I\hat{\boldsymbol{\Sigma}} = \boldsymbol{I}, see Appendix A.2.3). At this point, the maximum QSUR is achieved: $ \mathrm{QSUR}'' = \frac{ \frac{\pi^{d/2}}{\Gamma(d/2+1)} }{ 2^d } \quad (28) $ This theoretical derivation highlights that an ideal transformation involves both scaling (by Λ12\pmb{\Lambda}^{-\frac{1}{2}}) to normalize variances and rotation (by Q\pmb{Q}^\top) to align with principal axes, effectively making the data distribution isotropic (uniform in all directions).

4.2.3. Orthogonal and Scaling Transformation-based Quantization (OSTQuant)

OSTQuant leverages these theoretical insights to design a practical PTQ framework. It uses learnable equivalent transformation pairs within LLM layers to optimize weight and activation distributions for improved quantization performance.

Overall Flow (Figure 5): OSTQuant applies multiple transformation pairs globally across and within LLM blocks. Four equivalent transformation pairs are learned per block, each comprising a learnable diagonal scaling matrix S\pmb{S} (denoted as Λ\pmb{\Lambda} in formulas) and a learnable orthogonal matrix O\pmb{O} (denoted as R\pmb{R} in figures/text for rotation).

Equivalent Transformation Pair: A transformation pair is defined as T=ΛO\pmb{T} = \pmb{\Lambda} \pmb{O}, where Λ\pmb{\Lambda} is a diagonal scaling matrix and O\pmb{O} is a unit orthogonal matrix. The forward inference process for a pair of matrix multiplications (e.g., W1W2W_1 W_2) is reformulated as: $ \pmb{y} = \mathcal{Q}(\pmb{x} W_1 O \pmb{\Lambda}) \mathcal{Q}(\pmb{\Lambda}^{-1} O^T W_2) \quad (9) $ where:

  • x\pmb{x}: Input to the first linear layer.
  • W1,W2W_1, W_2: Weights of two consecutive linear layers.
  • O\pmb{O}: Learnable orthogonal matrix.
  • Λ\pmb{\Lambda}: Learnable diagonal scaling matrix.
  • Q()\mathcal{Q}(\cdot): The quantization operation. The intermediate product OΛ\pmb{O} \pmb{\Lambda} transforms the output of W1W_1, and its inverse Λ1O\pmb{\Lambda}^{-1} \pmb{O}^\top transforms the input of W2W_2. This maintains mathematical equivalence in full-precision, but allows for optimization of intermediate distributions during quantization.

Advantages of Equivalent Transformation Pair:

  1. Learnability and Computational Efficiency: Both O\pmb{O} and Λ\pmb{\Lambda} are learnable parameters. The inverse of a diagonal matrix Λ\pmb{\Lambda} is trivial (reciprocal of diagonal elements). The orthogonal matrix O\pmb{O} is optimized using Riemannian optimizers like RiemannAdam, leveraging first-order gradient information.

  2. Equivalence Preservation: Without quantization, the transformations cancel out (OΛΛ1O=I\pmb{O} \pmb{\Lambda} \pmb{\Lambda}^{-1} \pmb{O}^\top = \pmb{I}), ensuring the network's output is identical to the original model. This reduces the risk of overfitting to the small calibration set.

  3. No Runtime Overhead: After optimization, O\pmb{O} and Λ\pmb{\Lambda} are fused directly into the existing weight matrices. For example, W1W1OΛW_1 \leftarrow W_1 \pmb{O} \pmb{\Lambda} and W2Λ1OW2W_2 \leftarrow \pmb{\Lambda}^{-1} \pmb{O}^\top W_2. This means no additional computational cost or parameters are introduced during inference.

    The overall optimization objective for the network is to minimize the loss between the quantized network output y^\hat{y} and the full-precision output yy, by learning the optimal scaling (Ai\pmb{A}_i) and orthogonal (Oi\pmb{O}_i) matrices: $ \arg \operatorname*{min}_{A_i, O_i} \mathcal{L}(\hat{y}, y; A_i, O_i, \theta) \quad (10) $ where θ\theta represents the frozen full-precision network parameters.

Weight Outlier Minimization Initialization (WOMI): Since weights often follow a Gaussian distribution with zero mean, the orthogonal matrix O\pmb{O} is initialized to optimally reduce weight outliers. For a global orthogonal matrix RresR_{res}, weights of all linear layers receiving residual inputs are concatenated. Eigenvalue decomposition is performed on their covariance matrix to obtain the eigenmatrix QW\boldsymbol{Q}_W. Then, O\pmb{O} is initialized as O=HQW\pmb{O} = \pmb{H} \boldsymbol{Q}_W^\top, where H\pmb{H} is a normalized Hadamard matrix (Equation 27 from Appendix A.2.2 is T=d12HQ\pmb{T} = d^{-\frac{1}{2}} \pmb{H} \pmb{Q}^\top, suggesting a similar structure for initialization). This initialization leverages Hadamard matrices' ability to spread out values and covariance matrices to align with principal components, minimizing outliers and inter-channel disparities. For scaling matrices, they are initialized as identity matrices.

Inter-Block Learning (Global Transformations): As shown in the upper part of Figure 5, a global orthogonal transformation RresR_{res} is applied at the embedding layer and propagates through each residual path in the LLM. This transformation rotates activations across the entire residual path. Due to the norm-preserving property of unitary orthogonal matrices, this allows bypassing RMSNorm layers (which normalize based on norm) and applying an inverse transformation to inputs of projection layers and the final output head to maintain equivalence. Additionally, for the two normalization layers in each block, and their respective projection layers (SattniS_{attn}^i and SffniS_{ffn}^i), scaling matrices are applied. The orthogonal matrix RresR_{res} rotates activations along residual paths and adjusts weights of multiple projection layers. The scaling matrices SattniS_{attn}^i and SffniS_{ffn}^i scale the outputs of RMSNorm layers and corresponding projection layers. These transformations are absorbed into the corresponding weight matrices, optimizing for distribution shifts and mitigating quantization errors.

Intra-Block Learning (Local Transformations): The lower part of Figure 5 illustrates local equivalent transformation pairs within each transformer block:

  1. Multi-Head Self-Attention (MHSA) Layer: Two pairs are introduced.

    • Value (V) and Output (O) Projection Layers: For each attention head hh, the data flow is: $ \pmb{Y} = \pmb{P}^h \cdot \pmb{X}^h \cdot (\pmb{W}v^h \pmb{R}{ov}^h \pmb{S}{ov}^h) \cdot (\pmb{S}{ov}^{h^{-1}} \pmb{R}_{ov}^{h^T} \pmb{W}_o^h) \quad (11) $ Here:
      • hh: Head index.
      • Ph\pmb{P}^h: Attention matrix for head hh.
      • Xh\pmb{X}^h: Input to head hh.
      • Wvh\pmb{W}_v^h: Weight for value projection.
      • Woh\pmb{W}_o^h: Weight for output projection.
      • Rovh\pmb{R}_{ov}^h: Learnable orthogonal matrix for V/O layers.
      • Sovh\pmb{S}_{ov}^h: Learnable scaling matrix for V/O layers. These transformations improve QSUR for both the value cache and output projection layer.
    • Query (Q) and Key (K) Projection Layers: After Rotary Positional Encoding (ROPE), the query and key outputs undergo an equivalent scaling transformation SqkS_{qk}, which can be incorporated into weights WqW_q and WkW_k. An additional Hadamard transformation is applied to QQ and KK outputs to enhance key cache quantization.
  2. Feed-Forward Network (FFN) Layer: The diagonal matrices in the up-projection and down-projection layers of the FFN are optimized. The inverse of the Hadamard transformation (if applied) is fused into WdownW_{down} from the start.

4.2.4. KL-Top Loss

While LLMs are trained on massive datasets, PTQ optimization is typically performed on small calibration sets (e.g., 1000 samples). In this limited data regime:

  • Standard cross-entropy (CE) loss can lead to overfitting, where the model performs well on the calibration set but degrades on broader tasks (as shown in Table 1). This is because small datasets may not fully utilize LLM capacity, causing CE loss (focusing on a single correct label) to overfit to narrow features.

  • Full Kullback-Leibler (KL) divergence (comparing the full prediction distributions) is also problematic. LLMs have very large vocabularies (e.g., LLaMA-3-8B has over 100,000 tokens), and their prediction results often follow a severe long-tail distribution (Figure 6), meaning only a few tokens have significant probabilities. Directly applying KL divergence over all classes means the loss can be dominated by uninformative low-probability classes, adding noise to training and incurring high computational/memory costs.

    To mitigate these issues, OSTQuant proposes the KL-Top loss function. This function computes KL divergence over only the top kk classes with the highest probabilities. This approach focuses optimization on the model's primary predictions, enhancing gradient quality by avoiding noise from low-probability values. It also reduces computational and memory costs for large vocabularies.

The KL-Top loss is calculated as follows: $ idxs = \operatorname*{top}k(z) \quad (12) $ $ \mathcal{L} = \sum_{i \in idxs} z[i] \log \left( \frac{z[i]}{\hat{z}[i]} \right) $ where:

  • zz: The prediction distribution (logits, usually after softmax) from the full-precision model.

  • z^\hat{z}: The prediction distribution from the quantized model.

  • topk(z)\operatorname*{top}k(z): A function that returns the indices of the top kk highest-probability logits from zz.

  • L\mathcal{L}: The KL-Top loss. The sum is taken only over the selected top kk indices.

    Experiments (Table 6) show that a kk value of 1,000 typically yields the best results, balancing semantic richness with optimization stability.

5. Experimental Setup

5.1. Datasets

The experimental validation of OSTQuant is conducted across a range of Large Language Models (LLMs) from the LLaMA family and evaluated on standard benchmarks.

  • Models:

    • LLaMA-1 (7B, 13B, 30B) (Touvron et al., 2023a)
    • LLaMA-2 (7B, 13B, 70B) (Touvron et al., 2023b)
    • LLaMA-3-8B, LLaMA-3-70B
  • Calibration Data: For the PTQ optimization phase, a small calibration set is used.

    • Source: 1,000 samples from WikiText2 (Merity et al., 2016)
    • Characteristics: Each sample has a token length of 2,048. WikiText2 is a common dataset for language model evaluation, particularly for perplexity measurements, as it is derived from high-quality Wikipedia articles.
    • Why chosen: It's a standard and relatively clean dataset suitable for calibrating quantization parameters due to its representative linguistic patterns, while being small enough to fit the PTQ paradigm of limited data.
  • Evaluation Datasets (Zero-Shot Tasks): To assess the generalized performance and emergent capabilities of the quantized LLMs, the models are evaluated on a suite of nine zero-shot tasks using the lm-evaluation-harness (version 0.4.4) (Gao et al., 2024). These tasks are designed to test various aspects of language understanding and reasoning.

    • BoolQ (Clark et al., 2019): Boolean Question Answering.
    • HellaSwag (Zellers et al., 2019): Commonsense reasoning for sentence completion.
    • LAMBADA (Radford et al., 2019): Language Modeling Broadened Across Diverse Activities, testing language understanding by predicting the last word of a sentence.
    • OpenBookQA (OBQA) (Mihaylov et al., 2018): Open-book question answering requiring commonsense and factual knowledge.
    • PIQA (Bisk et al., 2020): Physical Interaction Question Answering, focusing on physical commonsense.
    • SIQA (Sap et al., 2019): Social IQa, evaluating commonsense reasoning about social interactions.
    • WinoGrande (Sakaguchi et al., 2021): An adversarial Winograd Schema Challenge to test commonsense reasoning.
    • ARC-Easy and ARC-Challenge (Boratko et al., 2018): AI2 Reasoning Challenge, testing scientific reasoning.
    • Why chosen: These datasets represent a diverse set of zero-shot NLP tasks and are standard benchmarks for evaluating the capabilities and robustness of LLMs after quantization. They are crucial because perplexity alone (from WikiText2) might not fully reflect the true functional performance of the model, especially after quantization.

5.2. Evaluation Metrics

For comprehensive evaluation, OSTQuant uses both perplexity and zero-shot accuracy (averaged across multiple tasks), along with inference speedup and memory savings.

  • Perplexity (PPL)

    • Conceptual Definition: Perplexity is a common metric for evaluating language models. It measures how well a probability distribution or language model predicts a sample. A lower perplexity score indicates that the model predicts the test set more accurately and assigns higher probabilities to the observed sequences, meaning it is a better language model. In simple terms, it's a measure of how "surprised" the model is by new data; less surprise means better performance.
    • Mathematical Formula: For a sequence of words W=(w1,w2,,wN)W = (w_1, w_2, \ldots, w_N), the perplexity is defined as: $ PPL(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log P(w_i | w_1, \dots, w_{i-1})\right) $
    • Symbol Explanation:
      • WW: The test corpus or sequence of words.
      • NN: The total number of words in the test corpus.
      • P(wiw1,,wi1)P(w_i | w_1, \dots, w_{i-1}): The probability assigned by the language model to the ii-th word wiw_i, given the preceding words w1,,wi1w_1, \dots, w_{i-1}. This probability is typically computed by the LLM during its next-token prediction task.
  • Zero-Shot Accuracy (Avg. Zero-Shot Score)

    • Conceptual Definition: Zero-shot accuracy measures a model's ability to perform tasks it has not been explicitly trained on, demonstrating its generalization capabilities and emergent properties. The average zero-shot score refers to the mean accuracy across a suite of predefined zero-shot tasks. For each task, accuracy is simply the percentage of correct predictions.
    • Mathematical Formula: Since this metric is an average across multiple tasks, there isn't a single universal formula beyond the definition of accuracy for each task. For an individual task, Accuracy is: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \times 100% $ The Average Zero-Shot Score is then: $ \text{Avg. Zero-Shot Score} = \frac{1}{M} \sum_{j=1}^M \text{Accuracy}_j $
    • Symbol Explanation:
      • Accuracyj\text{Accuracy}_j: The accuracy achieved on the jj-th zero-shot task.
      • MM: The total number of zero-shot tasks evaluated (in this paper, M=9M=9).
  • Speedup

    • Conceptual Definition: Speedup quantifies how much faster the quantized model performs inference compared to its full-precision (e.g., FP16) counterpart. A speedup of 2×2 \times means the quantized model is twice as fast.
    • Mathematical Formula: $ \text{Speedup} = \frac{\text{Inference Time (Full Precision)}}{\text{Inference Time (Quantized)}} $
    • Symbol Explanation:
      • Inference Time (Full Precision)\text{Inference Time (Full Precision)}: The time taken for a specific inference operation (e.g., prefill, decoding a token) using the full-precision model.
      • Inference Time (Quantized)\text{Inference Time (Quantized)}: The time taken for the same inference operation using the quantized model.
  • Memory Saving Factor

    • Conceptual Definition: Memory saving factor indicates how much less memory the quantized model consumes compared to its full-precision version. A factor of 3×3 \times means the quantized model uses one-third of the memory.
    • Mathematical Formula: $ \text{Memory Saving Factor} = \frac{\text{Memory Usage (Full Precision)}}{\text{Memory Usage (Quantized)}} $
    • Symbol Explanation:
      • Memory Usage (Full Precision)\text{Memory Usage (Full Precision)}: The memory consumed by the full-precision model (e.g., for storing weights or activations).
      • Memory Usage (Quantized)\text{Memory Usage (Quantized)}: The memory consumed by the quantized model.

5.3. Baselines

OSTQuant is benchmarked against several representative Post-Training Quantization (PTQ) methods, ranging from basic approaches to state-of-the-art techniques.

  • RTN (Round-to-Nearest): A basic quantization method where floating-point values are simply rounded to the nearest available integer quantization level. It serves as a strong baseline to show the benefits of more sophisticated methods over naive quantization.
  • SmoothQuant (Xiao et al., 2022): A smooth-based transformation method that shifts quantization difficulty from activations to weights by scaling activations and inversely scaling weights. This aims to reduce inter-channel variance in activations.
  • GPTQ (Frantar et al., 2022): An accurate weight-only PTQ method that uses Hessian-based error compensation to minimize quantization errors in generative pre-trained transformers.
  • OmniQuant (Shao et al., 2023): A method that further enhances quantization performance by training quantization parameters (scales and zero-points) and transformation coefficients through an optimization process, often involving block-wise reconstruction.
  • AWQ (Lin et al., 2023): Activation-aware weight quantization, which focuses on protecting salient weights (those critical for model performance) from quantization errors by identifying activation outliers and applying appropriate scaling.
  • QuaRot (Ashkboos et al., 2024): A rotation-based method that uses random rotation matrices to make activations and weights more uniform and outlier-free, enabling 4-bit quantization for both.
  • SpinQuant (Liu et al., 2024): An advancement over QuaRot that learns the rotation matrices for 4-bit weight-activation quantization rather than relying on random ones, aiming for better optimization.

5.4. Implementation Details

The paper specifies key aspects of OSTQuant's implementation and optimization:

  • Quantization Scheme:
    • Activations: Per-token asymmetric quantization without any pruning operations. Per-token means scales and zero-points are calculated for each token in a sequence, allowing adaptation to dynamic ranges. Asymmetric means the quantization range does not necessarily center around zero.
    • Weights: Per-channel symmetric quantization. Per-channel means scales and zero-points are calculated independently for each output channel of a weight tensor. Symmetric means the quantization range is centered around zero (e.g., from -Max to +Max).
  • Optimizer: RiemannAdam (Bécigneul & Ganea, 2018) is used to optimize all unit orthogonal matrices and scaling matrices. RiemannAdam is suitable for Riemannian optimization on Stiefel Manifolds, where orthogonal matrices reside.
  • Calibration Data and Iterations:
    • Calibration Samples: 1,000 samples from WikiText2.
    • Token Length: Each sample has a token length of 2,048.
    • Iterations: 150 iterations.
    • Batch Size: 8.
  • Learning Rate Schedule: Cosine learning rate decay is applied.
    • Initial Learning Rate for orthogonal matrix parameters: 2×1022 \times 10^{-2}.
    • Initial Learning Rate for scaling parameters: 3×1023 \times 10^{-2}.
    • The paper notes that using a learning rate for the Stiefel manifold (orthogonal matrices) 10 times larger than that for scaling parameters typically leads to better results.
  • Hardware: Quantization and inference speed tests were conducted on an A800 GPU for optimization timing, and on a 3090 GPU for detailed prefill/memory speedup tests (Table 9), and an A6000 GPU for decoding speed (Table 10).
  • Kernel Implementation: The 4-bit matrix multiplication kernel was implemented using CUTLASS of Nvidia, a high-performance library for CUDA. The self-attention mechanism used PyTorch's native SDPA (scaled dot product attention) function.

6. Results & Analysis

6.1. Core Results Analysis

OSTQuant consistently outperforms existing state-of-the-art methods across various LLMs and quantization configurations, demonstrating its effectiveness in preserving accuracy at low bit-widths and achieving significant efficiency gains.

Quantization Performance (Table 2):

  • W4-only (4-16-16) Setting: OSTQuant achieves remarkable accuracy retention, maintaining at least 99.5% of the floating-point (FP) accuracy in zero-shot tasks. For the challenging LLaMA-3-8B model, OSTQuant has only a 0.29-point performance drop (67.80 vs FP 68.09), significantly outperforming other methods that incur losses exceeding 1.55 points. This demonstrates its superior ability to quantize weights with minimal impact.

  • W4A4KV16 Setting (4-4-16): When activations are also quantized but the KV cache remains at FP16, OSTQuant shows substantial performance boosts. For instance, on LLaMA-2-7B, OSTQuant (63.90) outperforms SpinQuant (57.37) by 6.53 points in zero-shot accuracy. This highlights its effectiveness in handling activation distributions.

  • W4A4KV4 Setting (4-4-4): In this most aggressive 4-bit quantization configuration for weights, activations, and KV cache, OSTQuant still maintains significant performance gains. It outperforms the previous SOTA method, SpinQuant, by approximately 1 point across multiple models (e.g., LLaMA-3-8B: 65.37 vs 64.10, LLaMA-2-7B: 63.18 vs 62.01). This demonstrates its robustness in extreme compression scenarios, making 4-bit inference truly feasible.

  • Comparison to Smooth-based vs. Rotation-based: The results clearly show that rotation-based methods (like QuaRot, SpinQuant, OSTQuant) significantly outperform smooth-based methods (like SmoothQuant) once activations are quantized. This confirms the paper's argument that smooth-based methods struggle more with the prominent outliers and uneven distributions in activations.

    QSUR and Accuracy Correlation (Figure 3): The evaluation of QSUR across different quantization methods for LLaMA variants reveals a clear positive correlation with model performance (Zero-Shot accuracy). OSTQuant achieves the highest QSUR, which directly translates to improved model accuracy. This validates the QSUR metric as an effective indicator of quantizability and the success of OSTQuant in optimizing data distributions.

Activation Distribution Uniformity (Figure 2, 11, 12): Visualizations (Figure 2) show that LLaMA-3-8B's activation distributions prior to OSTQuant exhibit substantial variation across channels and numerous outliers. After OSTQuant's transformations, the distributions become significantly more uniform across channels and outliers are mitigated. This qualitative observation supports the quantitative QSUR improvements and performance gains.

Speedup and Memory Savings (Table 3, 9, 10):

  • Inference Speedup: OSTQuant enables 4-bit inference that delivers an average inference speedup of over 2×2 \times compared to FP16. For larger models like LLaMA-30B, it achieves approximately 3×3 \times acceleration. This is attributed to reduced memory access overhead and efficient low-precision computation units.
  • Memory Savings: OSTQuant provides memory savings exceeding 3.5×3.5 \times. This is crucial for deploying LLMs on resource-constrained devices or running larger models on existing hardware.
  • Decoding Stage Efficiency (Table 10): In the decoding stage (which is often memory-bound), quantizing to 4-bit significantly reduces memory access overhead. This allows running very large models like LLaMA-3-70B on a single A6000 GPU at a respectable speed of 15 tokens per second, which would be Out of Memory (OOM) for FP16.
  • Training Speed (Table 11): OSTQuant also offers substantial advantages in optimization speed compared to block reconstruction-based methods (like OmniQuant). With only 150 iterations and minimal learnable parameters, it can quantize smaller models (7B) in around 20 minutes (a 5.3×5.3 \times speedup over OmniQuant) and larger models (30B) in 120 minutes (a 3.3×3.3 \times speedup).

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Model Loss Type Wiki PPL Arc-Easy Score Arc-Challenge Score
LLaMA-2-7B Origin 5.38 69.87 42.41
KL-Top 5.94 72.69 44.62
LLaMA-2 13B Origin 5.12 75.09 46.08
KL-Top 5.25 75.29 47.10
LLaMA-3 8B Origin 6.80 76.68 49.26
KL-Top 7.29 76.73 49.32

The following are the results from Table 2 of the original paper:

#Bits Method LLaMA-3 8B LLaMA-3 70B LLaMA-2 7B LLaMA-2 13B LLaMA-2 70B LLaMA 7B LLaMA 13B LLaMA 30B
0-shot Wiki 0-shot Wiki 0-shot Wiki 0-shot Wiki 0-shot Wiki 0-shot Wiki 0-shot Wiki 0-shot Wiki
Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓) Avg.(↑) (↓)
W-A-KV FloatingPoint 68.09 - 6.14 - 73.81 - 2.86 - 65.21 - 5.47 - 67.61 - 4.88 - 71.59 - 3.32 - 64.48 - 5.68 - 66.67 - 5.09 - 70.00 - 4.10 -
4-16-16 RTN 63.70 8.13 - 31.15 - 1e5 - 61.27 7.02 - 60.24 -6.39 - 69.62 3.87 - 62.67 - 7.94 - 63.45 - 8.60 - 65.69 -6.13 -
SmoothQuant 62.79 8.12 - 67.94 6.70 - 58.88 8.03 - 62.03 5.86 - 65.93 5.50 - 62.24 - 7.46 - 62.69 - 18.75 - 65.69 5.80 -
GPTQ 61.03 7.43 - 31.45 - 9e3 - 60.86 9.84 - 64.71 5.79 - 70.96 3.94 - 60.15 - 7.93 - 64.36 6.58 - 66.95 5.26 -
Omniquant 65.66 7.19 - - - - 63.19 5.74 - 66.38 5.02 - 71.04 3.47 - 63.42 - 5.86 - 66.22 - 5.21 - 69.07 4.25 -
AWQ 67.03 7.36 - 68.92 5.92 - 63.89 5.83 - 66.25 5.07 - 70.88 4.03 - 63.30 5.97 - 65.58 5.28 - 69.44 4.28 -
QuaRot 67.27 6.53 - 72.93 3.53 - 64.30 5.62 - 66.95 5.00 - 71.21 3.41 - 63.40 5.83 - 65.91 5.20 - 69.73 4.27 -
SpinQuant 66.54 6.49 - 72.90 3.49 - 63.59 5.58 - 67.14 5.00 - 71.12 3.43 - 63.94 5.76 - 66.32 5.16 - 69.62 4.21 -
OSTQuant 67.80 6.53 - 73.69 3.19 - 64.37 5.64 - 67.31 4.94 - 71.48 3.41 - 64.13 5.81 - 66.62 5.21 - 69.84 4.19 -
RTN 33.42 -6e2 - 31.21 - 8e3 - 32.44 nan - 30.86 δe3 - 30.90 7e4 - 32.51 - 7e3 - 31.63 - 3e4 - 31.57 2e3 -
SmoothQuant 33.04 1e3 - 34.67 2e2 - 32.13 nan - 34.26 1e3 - 35.86 3e2 - 34.42 - 3e2 - 33.29 - 6e2 - 34.64 1e3 -
GPTQ 32.98 5e2 - 31.47 - 4e4 - 32.72 nan - 30.11 4e3 - 30.86 nan - 32.12 - 1e3 - 31.51 - 3e3 - 30.88 2e3 -
QuaRot 61.69 8.02 - 65.56 6.35 - 61.87 6.05 - 65.13 5.35 - 69.96 3.78 - 61.76 - 6.22 - 64.46 5.50 - 68.14 4.57 -
4-4-16 SpinQuant 64.11 7.28 - 66.99 6.10 - 57.37 6.78 - 63.23 5.24 - 70.58 3.68 - 61.82 - 6.08 - 64.59 5.36 - 68.08 4.53 -
OSTQuant 65.14 7.24 - 72.21 3.97 - 63.90 5.60 - 66.24 5.14 - 70.92 3.57 - 62.72 - 6.04 - 65.80 5.40 - 68.52 4.43 -
RTN 33.18 7e2 - 30.82 - 8e3 - 32.67 nan - 30.93 7e3 - 31.73 7e4 - 32.87 - 1¯e4 - 31.33 - 3e4 - 31.64 2e3 -
4-4-4 SmoothQuant 32.96 1e3 - 33.76 3e2 - 32.12 nan - 33.36 1e3 - 35.54 3e2 - 33.32 - 3e2 - 33.28 - 5e2 - 34.65 1e3 -
GPTQ 33.71 6e2 - 31.20 - 4e4 - 33.52 nan - 27.85 5e3 - 31.09 nan - 31.80 - 2e3 - 30.63 - 3e3 - 31.07 2e3 -
Omniquant 32.33 4e2 - - - - 48.40 14.26 - 50.35 12.30 - - - - 48.46 - 11.26 - 45.63 - 10.87 - 45.04 12.35 -
QuaRot 61.38 8.18 - 65.33 6.60 - 61.48 6.11 - 65.16 5.39 - 70.30 3.80 - 61.22 - 6.26 - 64.59 5.53 - 68.08 4.60 -
SpinQuant 64.10 7.35 - 66.31 6.24 - 62.01 5.96 - 64.13 5.74 - 70.57 3.61 - 61.32 - 6.12 - 64.95 5.39 - 68.14 4.55 -
OSTQuant 65.37 7.29 - 71.69 4.01 - 63.18 5.91 - 65.41 5.25 - 70.84 3.59 - 62.55 - 6.07 - 65.43 5.40 - 68.20 4.42 -

The following are the results from Table 3 of the original paper:

Model Size Prefill Speedup (Seqlen) Memory Saving Factor (Seqlen)
256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192
7B 2.24x 2.27x 2.23x 2.14x 2.11x 2.02x 3.48x 3.34x 3.12x 2.86x 2.57x 2.34x
8B 2.42x 2.52x 2.52x 2.43x 2.36x 2.23x 3.48x 3.36x 3.12x 2.77x 2.38x 2.00x
13B 2.62x 2.68x 2.63x 2.52x 2.83x 2.32x 3.64x 3.51x 3.30x 3.02x 2.70x 2.43x
30B 3.18x 3.01x 2.98x 3.40x 2.84x 2.68x 3.70x 3.59x 3.42x 3.15x 2.83x 2.53x

The following are the results from Table 4 of the original paper:

Metric Baseline +Rres +Sres +Rdn +Su|d +Rqk +Sqk +Rov +Sov
Wiki PPL nan 9.70 9.46 6.16 6.00 5.92 5.92 5.94 5.91
Zero-shot9 33.51 54.33 53.74 61.75 61.79 62.35 62.56 63.11 63.18

The following are the results from Table 5 of the original paper:

Model Optimizer Type Best Steps Best LR1 Best LR2 Zero-Shot Score
LLaMA-2-7B Cayley SGD 150 1.50 0.20 63.11
Riemann SGD 500 0.10 0.02 63.09
Riemann Adam 150 0.02 1e-3 63.18
LLaMA-2-13B Cayley SGD 200 1.50 0.2 64.77
Riemann SGD 500 0.1 0.02 65.19
Riemann Adam 150 0.02 0.002 65.41

The following are the results from Table 6 of the original paper:

Setting Metric k=5 k=50 k=100 k=500 k=1000 k=5000 k=10000
W3 Only Zero-Shot9 Score 61.87 61.88 61.75 62.18 62.30 61.25 61.21
Wiki PPL 6.06 6.116 6.13 6.07 6.06 6.06 6.12
W4A4KV4 Zero-Shot Score 62.4 62.13 62.38 62.34 63.18 62.44 62.11
Wiki PPL 5.99 5.96 5.95 5.96 5.96 5.93 5.94

The following are the results from Table 7 of the original paper:

Model Quant Setting Method Zero-Shot9 Wiki PPL
LLaMA-2-7B Full-Precision 65.21 5.47
W4A16KV16 Hadamard 63.32 5.62
W4A16KV16 WOMI 63.45 5.59
W4A4KV4 Hadamard 61.47 6.11
W4A4KV4 WOMI 61.52 6.09
LLaMA-3-8B Full-Precision - 68.09 6.14
W4A16KV16 Hadamard 67.27 6.53
W4A16KV16 WOMI 67.41 6.48
W4A4KV4 Hadamard 61.38 8.18
W4A4KV4 WOMI 61.40 8.17

The following are the results from Table 8 of the original paper:

Model Method ARC-c ARC-e BoolQ HellaS Lam. OBQA PIQA SIQA WinoG. Avg. Wiki2 PPL
LLaMA3-8B RTN 23.72 30.56 46.18 29.83 2.70 28.60 52.45 34.39 50.20 33.18 704.34
SpinQuant 46.33 73.57 76.15 75.43 71.40 41.40 79.16 44.68 68.75 64.10 7.35
SpinQuant + KL-Top 47.29 73.95 75.82 75.64 71.40 41.58 78.16 44.38 68.45 64.07 7.54
OSTQuant 49.26 76.68 78.25 76.18 70.48 43.19 77.85 45.18 69.13 65.13 6.80
OSTQuant + KL-Top 49.32 76.73 78.87 76.01 70.77 43.20 78.51 45.70 69.22 65.37 7.29
LLaMA2-7B RTN 27.22 27.06 50.83 27.34 0.93 25.80 49.51 34.85 50.51 32.67 nan
SpinQuant 40.44 71.08 74.40 73.51 70.66 41.80 76.88 43.50 65.82 62.01 5.96
SpinQuant + KL-Top 40.76 71.29 74.61 73.08 70.19 40.94 76.32 43.85 67.78 62.09 6.16
OSQuant 42.41 69.87 75.07 72.90 70.21 40.87 78.16 44.16 68.40 62.45 5.38
OSTQuant + KL-Top 44.62 72.69 75.10 73.27 70.21 41.00 78.13 44.42 68.27 63.11 5.94

The following are the results from Table 9 of the original paper:

Model Seqlen Prefill Time Prefill Speedup Memory Memory Saving
FP16 INT4 FP16 INT4
LLaMA2-7B 256 8.050ms 3.597ms 2.238x 0.411GB 0.118GB 3.479x
512 14.904ms 6.579ms 2.265x 0.435GB 0.130GB 3.341x
1024 27.989ms 12.582ms 2.225x 0.483GB 0.155GB 3.116x
2048 54.276ms 25.312ms 2.144x 0.577GB 0.202GB 2.857x
4096 112.230ms 53.145ms 2.112x 0.766GB 0.299GB 2.566x
8192 244.675ms 121.339ms 2.016x 1.147GB 0.491GB 2.336x
LLaMA3-8B 256 8.035ms 3.314ms 2.424x 0.430GB 0.124GB 3.478x
512 15.545ms 6.176ms 2.517x 0.442GB 0.132GB 3.356x
1024 29.169ms 11.599ms 2.515x 0.466GB 0.149GB 3.116x
2048 57.470ms 23.631ms 2.432x 0.513GB 0.185GB 2.774x
4096 117.523ms 49.835ms 2.358x 0.608GB 0.256GB 2.378x
8192 256.394ms 114.815ms 2.233x 0.795GB 0.397GB 2.003x
LLaMA2-13B 256 11.449ms 4.370ms 2.620x 0.634GB 0.174GB 3.643x
512 21.195ms 7.924ms 2.675x 0.663GB 0.189GB 3.512x
1024 41.752ms 15.867ms 2.631x 0.723GB 0.219GB 3.301x
2048 81.965ms 32.553ms 2.518x 0.841GB 0.279GB 3.018x
4096 199.046ms 70.442ms 2.826x 1.079GB 0.399GB 2.702x
8192 359.409ms 154.640ms 2.324x 1.551GB 0.639GB 2.426x
LLaMA-30B 256 18.682ms 5.883ms 3.175x 1.047GB 0.283GB 3.703x
512 34.393ms 11.445ms 3.005x 1.085GB 0.302GB 3.589x
1024 66.880ms 22.464ms 2.977x 1.162GB 0.340GB 3.416x
2048 157.500ms 46.317ms 3.400x 1.315GB 0.418GB 3.148x
4096 272.355ms 96.052ms 2.835x 1.625GB 0.575GB 2.828x
8192 576.555ms 215.27ms 2.678x 2.242GB 0.887GB 2.527x

The following are the results from Table 10 of the original paper:

Model Decoder Speed (tokens/sec) Memory Use (GB)
FP Quantized Speed up FP Quantized Memory Saving
LLaMA-2-7B 47.32 89.4 1.89x 13.94 4.32 3.23x
LLaMA-3-8B 38.33 77.71 2.03x 15.83 5.88 2.69x
LLaMA-2-13B 23.7 55.35 2.34x 23.7 8.5 2.79x
LLaMA-30B OOM 30.49 - OOM 18.19 -
LLaMA-3-70B OOM 14.68 - OOM 38.41 -

The following are the results from Table 11 of the original paper:

Method 7B 8B 13B 30B 70B
Omniquant 1.6h 1.8h 3.3h 7.3h 9.5h
OSTQuant 0.3h 0.4h 0.8h 2.2h 5.5h
Speedup 5.3x 4.5x 4.1x 3.3x 1.7x

The following are the results from Table 12 of the original paper:

Model #BitsW-A-KV Method Zero-Shot Common Sense Reasoning Tasks (Avg. Accuracy %) Wiki2 PPL (↓)
ARC-c (↑) ARC-e (↑) BoolQ (↑) HellaS. (↑) Lam. (↑) OBQA (↑) PIQA (↑) SIQA (↑) WinoG. (↑)
(Avg. Zero-Shot Score (↑))
2-7B 16-16-16 Full Precision 46.42 74.33 77.71 75.94 73.69 44.20 79.16 45.91 69.53 65.21 5.47
4-16-16 RTN 42.15 67.59 73.06 72.34 67.18 41.80 76.50 44.11 66.69 61.27 7.02
SmoothQuant 39.59 65.19 69.82 73.83 67.61 42.40 77.64 44.52 68.43 60.86 8.30
GPTQ 42.49 71.00 74.70 73.50 70.87 42.40 78.40 44.93 68.82 63.19 5.74
Omniquant 44.11 70.50 78.07 74.98 70.68 43.80 78.30 45.14 69.38 63.89 5.83
AWQ 43.94 73.50 76.97 74.87 72.07 44.00 78.24 45.40 69.38 64.30 5.62
QuaRot 43.77 72.69 73.36 75.10 73.80 43.00 77.86 45.60 67.66 63.59 5.58
SpinQuant 44.54 73.31 75.57 75.04 73.67 44.20 78.89 45.50 68.59 64.37 5.64
OSTQuant 44.54 73.31 75.57 75.04 73.67 44.20 78.89 45.50 68.59 64.37 5.64
4-4-16 RTN 25.34 28.03 50.52 27.71 1.01 26.20 50.80 33.93 48.38 32.44 nan
SmoothQuant 28.33 26.39 49.39 27.28 1.18 23.40 48.80 33.75 40.75 32.13 nan
GPTQ 24.40 28.70 51.62 28.66 1.36 24.60 51.14 34.49 49.49 32.72 nan
QuaRot 42.32 69.65 74.77 72.91 70.81 39.80 77.20 43.55 65.82 61.76 6.05
SpinQuant 37.54 62.58 71.16 70.48 67.16 34.80 76.20 39.96 60.62 57.37 6.78
OSTQuant 44.03 71.93 75.41 74.94 73.22 43.20 78.51 45.85 68.03 63.90 5.60
4-4-4 RTN 27.22 27.06 50.83 27.34 0.93 25.80 49.51 34.85 50.51 32.67 nan
SmoothQuant 26.37 25.63 47.71 27.05 1.11 26.40 41.90 34.49 48.38 32.12 nan
GPTQ 26.96 27.65 52.84 28.83 1.63 29.00 49.62 35.11 49.80 33.52 nan
Omniquant 31.40 53.75 63.79 60.13 55.63 34.40 60.59 40.28 54.07 48.40 14.26
QuaRot 41.43 69.32 74.83 72.50 70.66 39.80 77.42 43.70 64.64 61.48 6.11
SpinQuant 40.44 71.08 74.40 73.51 70.66 41.80 76.88 43.50 65.82 62.01 5.96
OSTQuant 42.92 72.56 74.71 73.10 71.76 44.00 77.20 44.98 67.77 63.18 5.91
2-13B 16-16-16 Full Precision 49.15 77.53 80.58 79.39 76.62 45.20 80.63 47.49 71.90 67.61 4.88
4-16-16 RTN 42.92 66.54 71.38 66.62 68.99 39.40 76.93 44.06 65.35 60.24 6.39
SmoothQuant 46.25 70.45 75.92 69.16 70.49 39.80 77.86 45.14 64.17 62.03 5.86
GPTQ 46.63 73.95 74.83 73.77 73.00 42.40 78.51 45.50 70.64 64.71 5.79
Omniquant 48.29 75.42 77.92 77.80 75.59 45.20 80.41 46.62 70.17 66.38 5.02
AWQ 48.63 78.16 78.81 78.48 75.70 45.00 79.54 46.21 72.45 66.25 5.07
QuaRot 49.07 78.17 76.58 77.60 76.70 45.40 80.03 45.50 71.11 66.95 5.00
SpinQuant 49.15 77.48 79.27 78.60 77.10 44.60 80.03 46.47 71.67 67.14 5.00
OSTQuant 48.72 76.26 80.67 78.27 76.54 45.54 80.25 47.65 71.90 67.31 4.94
4-4-16 RTN 27.99 26.81 38.50 26.08 0.00 23.60 48.20 34.90 51.62 30.86 nan
SmoothQuant 24.49 35.06 47.98 30.87 3.67 26.20 51.01 35.31 49.20 34.26 nan
GPTQ 27.82 26.77 37.92 25.67 0.00 21.80 47.77 33.51 48.15 30.11 nan
QuaRot 46.42 72.70 78.10 75.58 74.31 43.00 79.05 44.37 71.35 65.13 5.35
SpinQuant 43.77 69.99 76.57 74.63 72.81 41.60 77.72 44.27 68.19 63.23 5.24
OSTQuant 47.78 74.66 80.03 77.60 75.94 44.00 79.38 46.06 70.32 66.24 5.14
4-4-4 RTN 27.82 26.52 38.38 26.27 0.02 26.00 49.78 34.39 49.17 30.93 nan
SmoothQuant 24.49 28.83 45.84 30.70 2.70 23.40 53.81 34.80 51.07 33.36 nan
GPTQ 27.90 26.39 37.95 26.16 0.00 20.00 48.26 34.39 50.43 27.85 nan
Omniquant 32.85 55.13 64.34 60.13 56.45 33.40 68.17 40.76 56.51 40.35 12.30
QuaRot 47.27 73.91 78.41 75.33 75.30 43.80 79.27 45.85 69.06 65.16 5.39
SpinQuant 46.67 74.49 76.06 75.22 72.19 42.40 78.20 43.45 67.20 64.13 5.74
OSTQuant 48.10 75.21 77.46 76.71 75.14 44.60 78.67 45.75 68.30 65.41 5.25
3-8B 16-16-16 Full Precision 53.50 77.74 81.10 79.18 75.74 44.80 80.63 47.08 73.01 68.09 6.14
4-16-16 RTN 48.98 73.23 72.75 75.90 63.85 43.20 78.40 43.81 73.16 63.70 8.13
SmoothQuant 47.22 72.35 72.11 74.92 62.41 43.00 77.80 43.91 71.27 62.79 8.12
GPTQ 49.74 72.50 71.28 68.34 46.69 43.60 78.50 46.47 71.82 61.03 7.43
Omniquant 50.09 74.54 79.51 76.92 70.31 43.00 79.54 44.52 71.74 65.66 7.19
AWQ 52.22 76.68 80.31 75.91 74.81 44.20 79.87 46.26 71.67 67.03 7.36
QuaRot 51.88 77.53 79.60 77.87 74.11 44.40 80.14 46.37 72.50 67.27 6.53
SpinQuant 52.13 77.28 79.99 78.40 73.60 44.80 79.98 45.22 72.77 66.54 6.49
OSTQuant 52.82 79.84 80.31 77.86 76.48 42.80 80.74 45.55 72.80 67.80 6.53
4-4-16 RTN 23.72 30.89 46.30 31.26 3.03 27.60 52.72 35.26 50.04 33.42 nan
SmoothQuant 23.29 28.24 48.93 29.19 1.57 28.60 44.46 33.37 49.64 33.04 nan
GPTQ 23.46 32.07 43.79 30.10 2.40 28.00 53.97 34.14 48.60 32.98 nan
QuaRot 47.26 72.73 73.60 67.00 43.00 76.61 45.04 65.90 61.69 8.02 nan
SpinQuant 47.35 74.12 76.36 75.98 69.88 42.46 77.37 44.47 68.98 64.11 7.28
OSTQuant 48.81 73.48 79.82 75.97 72.20 42.70 78.18 45.75 69.22 65.14 7.24
4-4-4 RTN 23.72 30.56 46.18 29.83 2.70 28.60 52.45 34.39 50.20 33.18 nan
SmoothQuant 23.55 28.96 44.84 28.90 1.44 29.40 51.09 34.14 50.36 32.96 nan
GPTQ 22.87 30.35 44.34 29.72 2.39 29.80 54.95 34.75 51.30 33.71 nan
Omniquant 29.10 33.79 41.53 31.11 1.86 25.80 53.37 34.08 50.43 32.33 nan
QuaRot 49.49 74.37 79.16 77.22 71.69 42.29 78.89 43.87 71.03 65.33 6.60
SpinQuant 46.33 73.57 76.15 75.43 71.40 41.40 79.16 44.68 68.75 64.10 7.35
OSTQuant 49.32 76.73 78.87 76.01 70.77 43.20 78.51 45.70 69.22 65.37 7.29

The following are the results from Table 13 of the original paper:

Model #Bits-A-KV Method Zero-Shot Common Sense Reasoning Tasks (Avg. Accuracy %) Wiki2 PPL (↓)
ARC-c (↑) ARC-e (↑) BoolQ (↑) HellaS. (↑) LambA. (↑) OBQA (↑) PIQA (↑) SIQA (↑) WinoG. (↑)
(Avg. Zero-Shot Score (↑))
7B [16-16-16] Full Precision 44.71 72.90 74.98 76.20 73.08 43.80 79.16 45.55 69.93 64.48 5.68
4-16-16 RTN 41.70 69.82 73.30 69.67 42.00 78.13 45.34 68.00 62.67 7.94 nan
SmoothQuant 40.96 68.60 74.04 73.16 68.74 42.00 78.07 44.37 69.46 62.69 7.46
GPTQ 41.70 67.98 69.50 63.15 40.80 66.55 44.37 69.46 60.15 7.93 nan
Omniquant 42.49 71.38 74.62 74.71 71.98 42.00 79.05 45.09 69.14 63.30 5.86
AWQ 43.86 70.79 75.27 69.94 43.00 78.45 45.09 69.14 63.30 5.97 nan
QuaRot 42.75 69.99 73.30 72.71 73.55 42.80 78.35 45.14 69.61 63.40 5.83
SpinQuant 43.77 71.17 74.06 72.87 72.00 44.40 78.40 44.52 70.60 63.94 5.76
OSTQuant 44.20 72.56 73.73 75.05 73.45 44.60 78.73 45.45 69.38 64.13 5.81
4-4-16 RTN 23.46 29.34 45.05 29.02 1.24 26.00 52.07 35.11 51.30 32.51 nan
SmoothQuant 25.40 31.40 45.68 29.73 5.43 28.20 49.68 34.44 49.09 34.42 nan
GPTQ 23.89 27.74 42.87 28.49 1.28 27.40 51.00 36.23 50.20 32.12 nan
Omniquant 40.36 67.26 73.15 72.89 70.81 42.00 77.90 44.27 67.17 61.76 6.22
AWQ 40.19 68.43 72.35 70.91 70.68 41.20 77.75 44.17 68.67 61.82 6.08
QuaRot 42.58 70.79 72.87 74.06 70.77 43.40 77.69 45.04 67.25 62.72 6.04
SpinQuant 40.19 68.43 72.35 70.91 70.68 41.20 77.75 44.17 68.67 61.82 6.08
OSTQuant 42.58 70.79 72.87 74.06 70.77 43.40 77.69 45.04 67.88 62.55 6.07
4-4-4 RTN 23.89 29.59 46.67 28.37 1.13 26.40 52.90 35.21 51.54 32.87 nan
SmoothQuant 23.38 30.18 50.03 29.67 4.89 24.60 51.74 34.75 50.67 33.32 nan
GPTQ 23.89 27.90 43.88 27.86 1.05 26.20 51.85 34.75 49.49 31.80 nan
Omniquant 31.40 54.00 61.90 58.29 46.45 31.80 60.59 38.54 55.17 48.46 11.26
QuaRot 40.27 67.55 72.20 72.71 70.39 39.80 77.20 44.88 65.90 61.22 6.26
SpinQuant 40.30 68.18 73.60 72.87 70.46 40.60 77.42 42.68 67.56 61.32 6.12
OSTQuant 42.92 70.33 72.11 73.77 70.66 42.42 77.91 44.93 67.88 62.55 6.07
13B 16-16-16 Full Precision 47.87 74.49 77.86 79.10 76.03 44.40 80.30 46.72 73.24 66.67 5.09
4-16-16 RTN 45.56 70.66 72.45 76.06 70.50 42.00 78.84 44.93 70.01 64.45 8.60
SmoothQuant 43.86 71.21 75.19 74.19 69.34 40.00 77.80 45.45 70.72 62.69 8.75
GPTQ 45.99 72.85 75.27 75.31 70.10 44.60 79.87 46.16 71.11 64.36 6.58
Omniquant 47.01 73.86 77.22 77.95 75.59 45.00 79.87 46.88 72.61 66.22 5.21
AWQ 47.30 73.86 77.60 79.03 78.30 43.40 79.87 45.85 71.67 65.58 5.28
QuaRot 47.80 72.22 76.50 78.07 75.90 45.00 79.90 45.50 72.60 65.91 5.20
SpinQuant 47.44 74.83 77.37 78.30 75.50 45.60 79.90 46.01 72.06 66.32 5.16
OSTQuant 48.04 73.86 78.10 77.28 75.99 45.60 80.55 46.93 72.30 66.62 5.21
4-4-16 RTN 25.85 26.26 42.05 29.17 0.00 28.00 50.33 34.60 50.67 31.63 nan
SmoothQuant 25.43 29.29 51.00 28.10 2.02 26.00 53.32 34.30 49.57 33.29 nan
GPTQ 24.66 27.80 40.80 25.30 0.70 24.20 51.31 33.65 51.70 31.51 nan
QuaRot 46.93 72.50 75.57 76.63 74.13 42.40 78.73 45.24 68.98 64.46 5.50
SpinQuant 45.30 72.56 75.38 76.86 73.28 43.60 78.90 44.63 70.40 64.59 5.36
OSTQuant 48.04 74.07 77.13 77.22 74.58 45.00 78.62 46.16 71.35 65.80 5.40
4-4-4 RTN 26.28 27.27 42.35 25.85 0.19 26.60 49.95 34.19 49.25 31.33 nan
SmoothQuant 24.49 28.83 51.65 27.91 2.08 26.00 52.56 35.41 50.59 33.28 nan
GPTQ 23.63 27.31 39.50 26.17 0.56 26.00 50.59 35.00 49.57 30.63 nan
Omniquant 31.40 54.00 61.90 58.29 46.45 31.80 60.59 38.54 55.17 45.63 10.87
QuaRot 46.50 71.55 75.50 74.30 73.47 45.00 78.90 44.37 70.09 64.59 5.53
SpinQuant 45.90 70.71 76.51 77.00 73.63 45.60 79.00 45.65 70.32 64.95 5.39
OSTQuant 45.90 75.25 76.94 77.21 74.23 43.40 79.43 45.91 70.56 65.43 5.40
30B 16-16-16 Full Precision 52.99 80.39 82.75 82.62 77.59 48.00 82.26 47.75 75.69 70.00 4.10
4-16-16 RTN 49.74 73.99 77.89 72.21 70.21 44.20 79.00 45.70 73.88 65.90 6.13
SmoothQuant 48.98 72.90 80.00 79.00 71.49 44.80 78.13 45.96 73.70 65.69 5.80
GPTQ 52.22 77.62 81.10 81.94 76.80 47.00 81.07 47.50 74.43 69.07 4.25
Omniquant 53.20 77.77 81.68 82.29 76.79 48.20 81.20 48.16 75.37 69.44 4.28
AWQ 53.58 78.62 82.11 82.10 77.30 48.00 81.72 47.90 75.09 69.73 4.27
QuaRot 52.90 78.49 82.02 82.21 78.00 48.00 81.01 48.41 75.06 69.62 4.21
SpinQuant 53.07 79.12 82.09 82.04 78.58 48.60 81.18 48.06 74.80 69.84 4.19
OSTQuant 53.07 79.12 82.09 82.04 78.58 48.60 81.18 48.06 74.80 69.84 4.19
4-4-16 RTN 25.00 27.95 42.02 27.22 0.21 27.00 49.13 34.65 50.91 31.57 nan
SmoothQuant 25.63 31.91 54.86 32.52 3.80 28.20 50.13 34.49 50.04 34.64 nan
GPTQ 27.30 27.19 38.69 26.75 0.70 25.80 49.40 35.21 47.75 30.88 nan
Omniquant 30.60 33.79 41.95 32.44 1.86 26.40 51.56 38.54 53.91 45.04 10.33
QuaRot 51.79 76.39 80.76 77.08 77.08 45.80 80.58 45.60 74.35 68.14 4.57
SpinQuant 52.06 77.06 81.38 80.62 76.90 46.00 80.14 46.26 74.27 68.08 4.53
OSTQuant 51.37 78.11 82.48 79.51 75.99 45.40 81.18 47.80 74.82 68.52 4.43
4-4-4 RTN 25.00 28.87 44.07 27.29 0.39 25.60 49.67 34.54 49.33 31.64 nan
SmoothQuant 22.61 31.48 55.05 31.28 3.40 28.00 53.50 34.65 50.28 34.65 nan
GPTQ 22.87 27.70 39.36 27.00 0.33 24.00 50.71 34.30 47.91 31.07 nan
Omniquant 29.10 33.79 41.95 32.44 1.86 25.80 53.37 34.08 50.43 45.04 10.33
QuaRot 51.71 76.50 80.50 77.00 76.20 45.00 80.63 45.00 73.32 68.08 4.55
SpinQuant 51.60 76.98 81.07 80.57 77.00 46.00 79.92 46.26 74.90 68.14 4.55
OSTQuant 49.76 78.52 81.16 81.13 77.57 46.40 80.90 46.11 74.27 68.20 4.42

The following are the results from Table 14 of the original paper:

Model #BitsW-A-KV Method Zero-Shot Common Sense Reasoning Tasks (Avg. Accuracy %) Wiki2 PPL (↓)
ARC-c (↑) ARC-e (↑) BoolQ (↑) HellaS. (↑) LambA. (↑) OBQA (↑) PIQA (↑) SIQA (↑) WinoG. (↑)
(Avg. Zero-Shot Score (↑))
2-70B 16-16-16 Full Precision 57.42 81.02 83.79 83.81 79.60 48.80 82.70 49.18 77.98 71.59 3.32
4-16-16 RTN 55.05 79.29 81.35 81.78 75.51 47.60 81.94 46.83 76.46 69.62 3.87
SmoothQuant 50.26 76.56 81.53 67.81 73.63 44.40 81.34 44.17 73.64 65.93 5.50
GPTQ 56.91 80.81 83.24 82.47 79.06 47.80 82.50 48.06 77.51 70.96 3.94
Omniquant 57.08 80.81 82.69 83.07 79.18 47.40 83.08 48.87 77.19 71.04 3.47
AWQ 56.67 80.54 82.98 82.54 78.83 47.67 82.97 48.20 77.70 70.88 4.03
QuaRot 57.34 80.85 83.24 83.27 80.38 47.60 82.21 48.62 77.35 71.21 3.41
SpinQuant 56.91 80.60 83.18 83.06 79.16 49.00 82.54 48.31 77.11 71.12 3.43
OSTQuant 57.36 81.37 83.20 83.86 79.77 48.73 82.69 48.46 77.89 71.48 3.41
4-4-16 RTN 29.35 26.05 37.74 25.97 0.02 24.80 51.31 34.14 48.70 30.90 nan
SmoothQuant 25.00 25.98 55.23 32.52 7.49 25.00 54.62 35.21 51.70 35.86 nan
GPTQ 27.82 25.80 37.95 25.82 0.00 27.00 49.67 33.98 49.00 30.86 nan
QuaRot 55.29 80.35 81.10 81.87 79.06 45.80 82.05 47.90 76.24 69.96 3.78
SpinQuant 55.38 78.96 83.36 82.54 79.00 47.20 82.10 48.67 77.43 70.58 3.68
OSTQuant 56.61 80.51 83.03 82.68 79.11 47.86 83.00 48.76 76.70 70.92 3.57
4-4-4 RTN 30.30 27.74 38.23 26.12 0.02 24.60 51.74 34.29 52.43 31.73 nan
SmoothQuant 24.15 33.88 55.32 31.75 7.14 26.40 54.95 34.14 52.17 35.54 nan
GPTQ 28.75 26.39 37.86 25.96 0.00 26.40 50.00 34.44 50.04 31.09 nan
QuaRot 56.48 80.56 81.59 81.93 79.16 46.00 82.21 48.00 76.80 70.30 3.80
SpinQuant 56.31 80.64 83.55 82.36 79.41 47.20 82.21 47.29 77.05 70.57 3.61
OSTQuant 56.58 80.17 83.64 82.49 78.72 48.00 82.76 48.67 76.49 70.84 3.59
3-70B 16-16-16 Full Precision 64.42 85.98 85.14 84.95 79.47 48.46 84.39 50.82 80.66 73.81 2.86
4-16-16 RTN 26.28 25.55 37.83 26.36 0.00 29.00 50.98 34.70 49.30 31.15 nan
SmoothQuant 51.88 77.53 80.09 80.47 73.16 46.60 80.58 45.29 75.85 67.94 6.70
GPTQ 25.77 25.29 37.83 26.36 0.12 28.40 51.74 34.90 52.64 31.45 nan
Omniquant 48.29 75.42 77.92 77.80 75.59 45.20 80.41 46.62 70.17 66.38 5.02
AWQ 62.20 83.88 85.57 84.18 79.04 48.20 83.13 50.10 80.03 72.93 3.53
QuaRot 62.03 84.97 85.11 84.60 78.30 47.00 83.90 49.85 80.90 72.90 3.49
SpinQuant 63.76 85.82 84.99 85.16 79.53 48.45 84.26 51.01 80.22 73.69 3.19
OSTQuant 63.76 85.82 84.99 85.16 79.53 48.45 84.26 51.01 80.22 73.69 3.19
4-4-16 RTN 27.47 25.88 37.83 26.26 0.00 27.20 51.63 35.26 49.33 31.21 nan
SmoothQuant 26.00 34.47 50.46 32.48 11.98 30.00 54.24 34.83 48.93 34.67 nan
GPTQ 25.77 25.80 43.64 26.42 0.00 27.40 52.01 32.55 49.33 31.47 nan
QuaRot 60.60 73.65 77.46 77.83 71.96 43.20 78.13 45.29 71.90 65.56 6.35
SpinQuant 53.84 77.69 80.24 78.19 77.06 45.00 78.67 43.24 73.01 66.99 6.10
OSTQuant 61.84 84.56 84.14 82.47 77.08 46.07 83.38 50.23 80.13 72.21 3.97
4-4-4 RTN 27.13 25.42 37.83 26.12 0.00 26.60 50.76 34.95 51.22 33.76 nan
SmoothQuant 23.46 33.18 52.56 32.48 4.13 28.00 53.50 34.95 51.22 33.76 nan
QuaRot 49.49 74.37 79.16 77.22 71.69 42.29 78.89 43.87 71.03 65.33 6.60
SpinQuant 51.88 76.39 80.98 76.50 71.43 43.46 79.27 44.17 72.69 66.31 6.24
OSTQuant 61.29 82.39 83.43 83.25 75.90 48.93 81.73 51.24 77.01 71.69 4.01

6.3. Ablation Studies / Parameter Analysis

6.3.1. Effect of Different Transformation Matrices (Table 4)

The ablation study investigates the contribution of different learnable transformation matrices on LLaMA-2 7B under W4A4KV4 quantization.

  • Baseline (Zero-Shot9 33.51): This represents the performance without any of the proposed transformations, likely using a basic RTN or SmoothQuant-like approach, resulting in poor accuracy.

  • +Rres+Rres (Zero-Shot9 54.33, Wiki PPL 9.70): Adding the global orthogonal transformation RresR_{res} brings the most significant improvement (over 20 points in Zero-Shot9 accuracy). This highlights the critical role of rotating activations globally across the residual paths to manage distributions.

  • +Sres+Sres (Zero-Shot9 53.74, Wiki PPL 9.46): This refers to the global scaling transformation SresS_{res}. While the paper indicates RresR_{res} provides the largest single improvement, SresS_{res} also shows a positive effect, suggesting the importance of global scaling, though less impactful than global rotation. The specific combination is crucial.

  • +Rdn+Rdn (Zero-Shot9 61.75, Wiki PPL 6.16): Adding the orthogonal transformation RdownR_{down} (likely related to FFN down-projection or similar intra-block rotation) provides the next largest jump in performance. This indicates that local rotations within critical layers are also highly effective.

  • +Sud+Su|d (Zero-Shot9 61.79, Wiki PPL 6.00): This represents scaling transformations for FFN up/down-projections. It offers a slight improvement over +Rdn+Rdn, confirming that scaling helps balance variances across channels after rotation.

  • +Rqk+Rqk (Zero-Shot9 62.35, Wiki PPL 5.92): The Hadamard transformation (referred to as Rqk for Query/Key rotation) applied after ROPE further improves accuracy.

  • +Sqk+Sqk (Zero-Shot9 62.56, Wiki PPL 5.92): The scaling transformation SqkS_{qk} for Query/Key also contributes.

  • +Rov+Rov (Zero-Shot9 63.11, Wiki PPL 5.94): Orthogonal transformation for Value/Output projections in MHSA significantly boosts performance.

  • +Sov+Sov (Zero-Shot9 63.18, Wiki PPL 5.91): The scaling transformation for Value/Output projections provides the final marginal improvement.

    Conclusion: The results demonstrate that orthogonal transformations (especially global ones like RresR_{res} and local ones like RdnR_{dn}, RovR_{ov}) provide the most substantial gains by reshaping the distributions. Scaling transformations (SresS_{res}, SudS_{u|d}, SqkS_{qk}, SovS_{ov}) then build upon these by finely balancing variance across channels, minimizing residual quantization losses and maximizing QSUR. The cumulative effect of all these learnable transformations is critical for OSTQuant's strong performance.

6.3.2. Different Manifold Optimizers (Table 5)

The paper compares different Riemannian optimizers for the unit orthogonal matrices (which reside on a Stiefel manifold) on LLaMA-2-7B and LLaMA-2-13B under W4A4KV4 configuration.

  • Cayley SGD (Stochastic Gradient Descent): Requires a relatively high learning rate (LR1 1.50, LR2 0.20) and achieves good results (63.11 for 7B, 64.77 for 13B) within 150-200 steps.

  • Riemann SGD: Needs more iterations (500 steps) and lower learning rates (LR1 0.10, LR2 0.02) to reach similar or slightly worse performance (63.09 for 7B, 65.19 for 13B).

  • Riemann Adam: Delivers the best results (63.18 for 7B, 65.41 for 13B) with the fewest iterations (150 steps) and lower learning rates (LR1 0.02, LR2 1e-3 or 0.002). This highlights RiemannAdam's efficiency and effectiveness in optimizing on Stiefel manifolds.

    Learning Rate Ratio: A key finding is that setting the learning rate for the Stiefel manifold (orthogonal matrices, LR2) approximately 10 times larger than that for scaling transformation parameters (LR1) leads to better results. This suggests that orthogonal transformations might require larger steps or adapt faster during optimization.

6.3.3. Influence of kk in KL-Top Loss (Table 6)

The parameter kk in KL-Top loss (Equation 12) determines the number of top-probability classes considered for KL divergence calculation. The ablation study explores the impact of different kk values on LLaMA-2 7B's Zero-Shot9 score and Wiki PPL under W3-only and W4A4KV4 settings.

  • Balanced kk is Key: Both excessively small (k=5k=5, k=50k=50) or excessively large (k=5000k=5000, k=10000k=10000) values for kk negatively impact optimization.
    • Small kk: May discard too much semantic information, leading to sub-optimal optimization.
    • Large kk: Reintroduces noise from long-tail distributions and increases computational cost, similar to full KL divergence.
  • Optimal kk Value: For both W3-only and W4A4KV4 settings, k=1000k=1000 consistently yields the best Zero-Shot9 scores (62.30 and 63.18 respectively). This kk value appears to strike the right balance between retaining semantic information and mitigating noise.
  • PPL Stability: Wiki PPL shows less sensitivity to kk compared to Zero-Shot scores, indicating that Zero-Shot tasks are a more robust indicator of model performance after quantization.

6.3.4. Effect of Weight Outlier Minimization Initialization (WOMI) (Table 7, Figure 7, 13)

WOMI is designed to initialize trainable orthogonal matrices by leveraging eigenvalue decomposition of weights and Hadamard matrices.

  • Performance Improvement (Table 7): WOMI consistently achieves lower perplexity and higher few-shot accuracy compared to random Hadamard initialization for LLaMA-2-7B and LLaMA-3-8B under both W4A16KV16 and W4A4KV4 configurations. This confirms its effectiveness in providing a better starting point for optimization.
  • Enhanced W4-only Performance: Notably, WOMI shows greater performance improvements in W4-only quantization settings compared to W4A4KV4. This suggests that WOMI is particularly effective at minimizing weight quantization errors, which is crucial when weights are the primary quantized component.
  • Visualization (Figure 7, 13): Visualizations illustrate that original weight distributions often have significant variations across input/output channels. While Hadamard transformations reduce some of these differences, WOMI (by incorporating the covariance matrix) further smooths inter-channel disparities and minimizes outliers. This leads to a more compact distribution within the quantization space and reduced relative quantization error.

6.3.5. Effect of KL-Top Loss with SpinQuant (Table 8)

This ablation investigates the impact of KL-Top loss when applied to SpinQuant versus OSTQuant.

  • SpinQuant with KL-Top: When KL-Top loss is introduced to SpinQuant alone, it does not lead to significant performance improvements and may even cause slight degradation (Avg. Zero-Shot for LLaMA3-8B 64.10 -> 64.07, LLaMA2-7B 62.01 -> 62.09). This suggests that KL-Top loss is most effective when coupled with powerful distribution-reshaping transformations.
  • OSTQuant with KL-Top: For OSTQuant, using Cross-Entropy (CE) loss (referred to as Origin in Table 1) results in overfitting on the calibration set, leading to higher Wiki PPL and lower Zero-Shot scores. The introduction of KL-Top loss (OSTQuant + KL-Top) alleviates this overfitting problem, leading to better generalization and improved Zero-Shot scores (e.g., LLaMA3-8B 65.13 -> 65.37, LLaMA2-7B 62.45 -> 63.11). This highlights that KL-Top loss is crucial for OSTQuant to leverage its transformation capabilities effectively on limited PTQ data.

Figure 10 illustrates the dynamic behavior of QSUR and evaluation loss during the training process for the LLaMA3-8B model.

  • As the number of training iterations increases, the local and average QSURs (Quantization Space Utilization Rates) consistently increase and stabilize at a high level. This directly shows that OSTQuant's learnable transformations are effectively optimizing the data distributions to better utilize the quantization space.
  • Concurrently, the evaluation loss gradually decreases, reflecting the model's improved performance during the quantization-aware optimization.
  • The positive correlation between increasing QSUR and decreasing evaluation loss (or improving accuracy, as seen in Figure 3) empirically validates QSUR as an effective metric for guiding and assessing quantization quality.

6.5. Activation and Weight Distribution Visualizations (Figure 11, 12, 14)

These visualizations provide qualitative evidence of OSTQuant's impact on data distributions and quantization error.

  • Activation Distributions (Figure 11, 12): These figures display the activation distributions of different layers in LLaMA-2-7B and LLaMA-3-8B before and after OSTQuant.
    • Before OSTQuant: The distributions show significant variations across different channels and often exhibit prominent outliers, leading to wide ranges and low QSUR.
    • After OSTQuant: The distributions become noticeably more uniform across channels, and outliers are substantially mitigated. This "flattening" and "centering" of distributions allows for a more efficient mapping to discrete quantization levels, reducing the relative quantization error.
  • Weight Distributions and WOMI (Figure 13): This figure compares the weight distribution and relative L1 error for LLaMA-2-7B when quantized with original, Hadamard transformed, and WOMI transformed weights.
    • Original: Shows large variations and outliers.
    • Hadamard Transformed: Reduces some inter-channel differences, but some spikes or extreme values might remain.
    • WOMI Transformed: Further smooths these inter-channel differences and reduces outliers more effectively, leading to a smaller relative L1 error. This demonstrates WOMI's role in providing a better initial distribution for weight quantization.
  • Comparison of Activation Quantization Errors (Figure 14): This figure compares the activation distribution and relative L1 error for QuaRot, SpinQuant, and OSTQuant on LLaMA-2-7B.
    • QuaRot and SpinQuant show improvements over baselines, but OSTQuant achieves the most uniform activation distributions with the lowest relative L1 errors. This visually confirms that OSTQuant's combined orthogonal and scaling transformations are superior in making activations quantization-friendly by effectively managing outliers and variances, resulting in less information loss during per-token quantization.

7. Conclusion & Reflections

7.1. Conclusion Summary

In this paper, the authors introduce OSTQuant, a novel post-training quantization (PTQ) method aimed at significantly improving the efficiency of Large Language Models (LLMs). The central innovation is the Quantization Space Utilization Rate (QSUR), a new metric that quantitatively assesses the quantizability of transformed data by measuring how effectively it occupies and utilizes the available quantization space. This metric, supported by rigorous mathematical derivations on the effects and limitations of various linear transformations, provides a principled guide for optimizing data distributions.

Leveraging these insights, OSTQuant employs learnable equivalent transformation pairs, consisting of both orthogonal and scaling transformations, to optimize the distributions of weights and activations across the entire quantization space. This combined approach effectively addresses outliers and inter-channel disparities, leading to more quantization-friendly distributions. Furthermore, to overcome the challenges of limited calibration data inherent in PTQ, the KL-Top loss function is proposed. This loss function mitigates optimization noise while retaining richer semantic information by focusing KL divergence on only the top kk highest-probability logits.

Extensive experiments across various LLMs (LLaMA-1, -2, -3) and benchmarks demonstrate OSTQuant's superior performance. It retains over 99.5% of the floating-point accuracy in W4-only settings and reduces the performance gap by 32% on LLaMA-3-8B in the more challenging W4A4KV4 configuration compared to state-of-the-art methods. The method also delivers significant inference speedups (over 2×2 \times) and memory savings (over 3.5×3.5 \times), while accelerating the quantization process itself by up to 5.3×5.3 \times. These results underscore the effectiveness of OSTQuant's principled approach to optimizing data distributions for practical and efficient LLM deployment.

7.2. Limitations & Future Work

The authors identify clear directions for future research, primarily focusing on extending OSTQuant to fully-quantized LLMs.

  • Extension to Full Quantization: The current OSTQuant framework, while effective, is primarily presented for traditional PTQ where some components (like KV cache or certain activations) might still be in higher precision. The future work section (Appendix A.5.1) outlines an extension for full quantization, where all activations within each Transformer Block are quantized to low bits. This would further reduce memory transfer overhead and fully leverage low-precision computational units.
    • New Transformation Pairs: This extension proposes introducing more equivalence transformations around ROPE (Rotary Positional Encoding) and SiLU (Sigmoid Linear Unit) operations, which are critical components in modern LLMs.
      • ROPE Handling: Treating ROPE as a lightweight GEMM layer and constructing a weight matrix based on its principle allows for pre-ROPE and post-ROPE transformation pairs. This involves scaling transformations (Sqi,Ski,SpostiS_q^i, S_k^i, S_{post}^i) and orthogonal transformations (Rprei,RpostiR_{pre}^i, R_{post}^i) to optimize query and key distributions before and after ROPE and attention computation.
      • SiLU Smoothing: Decomposing SiLU(X) =X \cdot \sigma(X)$$ allows for equivalence transformations (e.g., X1X2σ(X1)=(X1S)(S21S)σ((X1S)1S)X_1 \cdot X_2 \cdot \sigma(X_1) = (X_1 \cdot S) \cdot (S_2 \cdot \frac{1}{S}) \cdot \sigma((X_1 \cdot S) \cdot \frac{1}{S})) to alleviate inter-channel discrepancies of activations before and after SiLU using scaling matrices like SugS_{u|g}. The authors plan to conduct experiments in the full-quantization domain to explore the full potential of OSTQuant.

7.3. Personal Insights & Critique

This paper presents a rigorous and well-motivated approach to LLM quantization. Here are some personal insights and critiques:

  • Strengths and Innovations:

    • Principled Metric (QSUR): The introduction of QSUR is a significant contribution. It moves beyond heuristic observation to provide a quantitative, theoretically-backed metric to guide quantization optimization. This allows for a more direct and transparent evaluation of transformation effectiveness. The correlation between QSUR and accuracy (Figure 3) strongly supports its utility.
    • Holistic Transformation Strategy: The combination of learnable orthogonal and scaling transformations is powerful. Orthogonal transformations excel at reshaping distributions to be more ball-shaped and outlier-resistant, while scaling transformations fine-tune inter-channel variances. This dual approach provides a more comprehensive solution than methods relying solely on one type of linear transformation.
    • Addressing PTQ Constraints: The KL-Top loss function is a clever solution to the problem of limited calibration data and large vocabulary sizes in LLMs. It effectively balances the need for semantic preservation with the practical challenges of PTQ optimization, preventing overfitting and noise.
    • Efficiency: The method not only achieves high accuracy but also demonstrates impressive speedups in inference and memory savings, making LLMs more deployable. The fast training time for quantization is also a major practical advantage.
    • Equivalence Preservation: The design of equivalent transformation pairs that are fused into weights ensures no additional runtime overhead, which is crucial for real-world deployment.
  • Potential Issues, Unverified Assumptions, or Areas for Improvement:

    • QSUR Formula Simplifications: While the QSUR derivation provides a solid theoretical basis, some simplifications (e.g., neglecting the mean vector μ\boldsymbol{\mu}, assuming λmax=λmin=λ1\lambda_{\mathrm{max}} = \lambda_{\mathrm{min}} = \lambda_1) are made to arrive at a tractable form (Equation 7). The practical impact of these simplifications on the optimality of the derived transformation (Equation 8) could be further explored. Do real LLM distributions always conform well enough to these assumptions for the derived optimum to hold globally?
    • Complexity of Riemannian Optimization: While RiemannAdam is effective, optimizing on Stiefel manifolds is inherently more complex than standard Euclidean optimization. This might pose a barrier for practitioners less familiar with Riemannian geometry or require specialized libraries. The ease of integration into common ML frameworks (like PyTorch) is good, but understanding the underlying mechanics remains a learning curve.
    • Sensitivity to kk in KL-Top: While an optimal k=1000k=1000 is identified for the tested models, this value might be sensitive to specific LLM architectures, vocabulary sizes, or downstream tasks. A more adaptive or data-driven way to select kk could enhance robustness.
    • Generalizability of WOMI: WOMI relies on the assumption that weights follow a Gaussian distribution with zero mean. While often true for typical weight initializations, deviations could affect its optimality.
    • Full Quantization Details: The future work section on full quantization is promising, but the mathematical details of how the ROPE and SiLU transformations are maintained as equivalent transformation pairs (similar to Equation 9) could be further elaborated in a subsequent paper. The decomposition of SiLU into a product and its transformation for inter-channel discrepancies (Equation 31) is an intriguing idea, and its empirical validation would be valuable.
  • Transferability and Future Value:

    • The QSUR metric is highly transferable and could be adopted by other quantization methods as a standard evaluation and optimization target, fostering more principled research.

    • The combined orthogonal and scaling transformation strategy is broadly applicable to any neural network layer that performs matrix multiplication, not just LLMs.

    • The insights from KL-Top loss are relevant for any PTQ scenario involving large output spaces (e.g., in sparse expert models) and limited calibration data.

    • The paper's direction towards full quantization is critical, as this is the ultimate goal for maximizing hardware efficiency. OSTQuant seems well-positioned to contribute significantly to this challenging area.

      Overall, OSTQuant represents a significant step forward in LLM quantization, offering a robust, theoretically grounded, and highly effective solution that addresses key challenges in LLM compression and acceleration.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.