OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

Sifan Zhou

Paper status: completed

OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

Published:01/23/2025

LLM Quantization (2)Orthogonal and Scaling Transformations (1)Quantization Space Utilization Rate (1)KL-Top Loss Function (1)Post-Training Quantization (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The OSTQuant method optimizes large language model quantization using orthogonal and scaling transformations, addressing uneven and heavy-tailed data distributions. Introducing Quantization Space Utilization Rate (QSUR) effectively assesses quantizability, while the KL-Top loss f

Abstract

Post-training quantization (PTQ) has emerged as a widely adopted technique for compressing and accelerating Large Language Models (LLMs). The major challenge in LLM quantization is that uneven and heavy-tailed data distributions can expand the quantization range, thereby reducing bit precision for most values. Recent methods attempt to eliminate outliers and balance inter-channel differences by employing linear transformations; however, they remain heuristic and are often overlook optimizing the data distribution across the entire quantization space.In this paper, we introduce Quantization Space Utilization Rate (QSUR), a novel metric that effectively assesses the quantizability of transformed data by measuring the space utilization of the data in the quantization space. We complement QSUR with mathematical derivations that examine the effects and limitations of various transformations, guiding our development of Orthogonal and Scaling Transformation-based Quantization (OSTQuant). OSQuant employs a learnable equivalent transformation, consisting of an orthogonal transformation and a scaling transformation, to optimize the distributions of weights and activations across the entire quantization space. Futhermore, we propose the KL-Top loss function, designed to mitigate noise during optimization while retaining richer semantic information within the limited calibration data imposed by PTQ. OSTQuant outperforms existing work on various LLMs and benchmarks. In the W4-only setting, it retains 99.5% of the floating-point accuracy. In the more challenging W4A4KV4 configuration, OSTQuant reduces the performance gap by 32% on the LLaMA-3-8B model compared to state-of-the-art methods. \href{https://github.com/BrotherHappy/OSTQuant}{https://github.com/BrotherHappy/OSTQuant}.

Mind Map

In-depth Reading

English Analysis~41 min read · 70,192 chars

1. Bibliographic Information

1.1. Title

OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

1.2. Authors

Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, Sifan Zhou. Affiliations include Houmo AI, Nanjing University, and Southeast University.

1.3. Journal/Conference

This paper is a preprint published on arXiv. It is currently under review or has not yet been officially published in a peer-reviewed journal or conference proceeding. arXiv is a reputable open-access archive for preprints of scientific papers, particularly in physics, mathematics, computer science, and related disciplines, widely used by the academic community to disseminate research findings prior to formal publication.

1.4. Publication Year

2025

1.5. Abstract

Post-training quantization (PTQ) is a critical technique for compressing and accelerating Large Language Models (LLMs). A significant challenge in LLM quantization arises from uneven and heavy-tailed data distributions, which expand the quantization range and consequently reduce bit precision for most values. Existing methods often employ heuristic linear transformations to address outliers and inter-channel differences but frequently overlook optimizing the data distribution across the entire quantization space.

This paper introduces Quantization Space Utilization Rate (QSUR), a novel metric designed to assess the quantizability of transformed data by measuring its space utilization within the quantization space. Complementing QSUR with mathematical derivations, the authors examine the effects and limitations of various transformations. This theoretical foundation guides the development of Orthogonal and Scaling Transformation-based Quantization (OSTQuant). OSTQuant employs learnable equivalent transformations, comprising both orthogonal and scaling components, to optimize weight and activation distributions across the entire quantization space. Additionally, the paper proposes KL-Top loss function, which mitigates noise during optimization while preserving richer semantic information, particularly valuable given the limited calibration data typical in PTQ.

OSTQuant demonstrates superior performance over existing methods across various LLMs and benchmarks. In a W4-only setting, it retains 99.5% of the floating-point accuracy. For the more challenging W4A4KV4 configuration, OSTQuant reduces the performance gap by 32% on the LLaMA-3-8B model compared to state-of-the-art methods.

1.6. Original Source Link

https://arxiv.org/abs/2501.13987v1 (This is a preprint on arXiv.) PDF Link: https://arxiv.org/pdf/2501.13987v1.pdf

2. Executive Summary

2.1. Background & Motivation

The rapid advancement of Large Language Models (LLMs) has led to their widespread adoption, but their substantial memory and computational requirements pose significant deployment challenges. These challenges exist not only for resource-constrained edge devices but also for powerful cloud servers, limiting their practical applicability.

Post-training quantization (PTQ) is a widely adopted technique to compress and accelerate LLMs by reducing the bit-width of weights and activations after the model has been trained. However, a major hurdle in LLM quantization is the presence of uneven and heavy-tailed data distributions in weights and activations. These distributions mean that a few extreme values (outliers) dictate a large quantization range. Consequently, the limited number of available quantization levels must be stretched over this wide range, leading to reduced precision for the majority of values that fall within the narrower, central part of the distribution. This ultimately results in significant quantization errors and a drop in model performance.

Prior research has attempted to address these issues using linear transformations, such as smooth-based methods (e.g., SmoothQuant) that redistribute quantization difficulty between weights and activations, and rotation-based methods (e.g., QuaRot, SpinQuant) that aim to suppress outliers. While these methods have shown some effectiveness in specific regions of the quantization space, they are often heuristic and fail to holistically optimize the data distribution across the entire quantization space. This leads to sub-optimal utilization of the available quantization levels and leaves a notable gap in ensuring robust performance at very low bit-widths. The paper identifies this gap as the lack of a quantitative metric to assess quantizability and the absence of a method that globally optimizes data distribution for better fit within the quantization space.

2.2. Main Contributions / Findings

The paper introduces several primary contributions to address the challenges in LLM quantization:

Quantization Space Utilization Rate (QSUR): The authors propose QSUR, a novel and effective metric to quantitatively assess the quantizability of transformed data. QSUR measures how effectively data utilizes the available quantization space. Mathematical derivations are provided to analyze the effects and limitations of various linear transformations on QSUR, establishing a theoretical foundation for designing more effective transformations. Experiments demonstrate a positive correlation between higher QSUR values and improved quantization accuracy.
Orthogonal and Scaling Transformation-based Quantization (OSTQuant): Based on the insights from QSUR, OSTQuant is proposed. It employs learnable equivalent transformation pairs, each consisting of an orthogonal transformation and a scaling transformation, assigned to each fully connected (FC) layer in LLMs. These transformations are optimized to globally reshape the distributions of weights and activations across the entire quantization space, making them more quantization-friendly. Critically, these transformation pairs and their inverses are fused into their respective FC layers after optimization, preserving the original computational graph and incurring no additional computational overhead during inference. The method also introduces Weight Outlier Minimization Initialization (WOMI) for effective initialization of orthogonal matrices.
KL-Top Loss Function: Recognizing that PTQ typically uses small calibration datasets (e.g., 1000 samples), which can lead to overfitting with standard cross-entropy loss or noise with full Kullback-Leibler divergence on large vocabularies, the KL-Top loss function is introduced. This loss function computes KL divergence only over the top $k$ highest-probability logits from the full-precision model. This approach aims to capture more nuanced semantic information and provide clearer, more informative gradients while mitigating noise from long-tail distributions and reducing computational costs.
Robust Performance and Efficiency: OSTQuant consistently outperforms existing state-of-the-art PTQ methods across various LLMs (LLaMA-1, LLaMA-2, LLaMA-3) and benchmarks.
- In the W4-only (4-bit weights, 16-bit activations/KV cache) setting, it retains over 99.5% of the floating-point accuracy.
- In the more aggressive W4A4KV4 (4-bit weights, 4-bit activations, 4-bit KV cache) configuration, it reduces the performance gap by 32% on the LLaMA-3-8B model compared to state-of-the-art methods, maintaining at least 96% of the original model's performance.
- It achieves significant inference speedups (over $2 \times$ average, up to $3 \times$ for LLaMA-30B) and memory savings (exceeding $3.5 \times$ ).
- The optimization process is highly efficient, quantizing LLaMA-7B in about 20 minutes, up to $5.3 \times$ faster than block reconstruction methods like OmniQuant.
  
  These contributions collectively provide a more theoretically grounded and empirically effective approach to LLM quantization, making these powerful models more accessible and efficient for diverse deployment scenarios.

3.1. Foundational Concepts

To understand OSTQuant, a foundational grasp of Large Language Models (LLMs), quantization principles, and linear algebra concepts is essential.

Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like translation, summarization, question-answering, and creative writing. Examples include OpenAI's GPT series, Google's Bard/Gemini, and Meta's LLaMA series. Their immense size (billions of parameters) makes them computationally intensive and memory-hungry, necessitating compression techniques for practical deployment.
Post-Training Quantization (PTQ): This is a model compression technique applied after a neural network has been fully trained (post-training). The goal is to reduce the numerical precision (bit-width) of the model's parameters (weights) and intermediate computations (activations) from high-precision floating-point numbers (e.g., FP32 or FP16) to lower-precision integers (e.g., INT8 or INT4).
- Benefits: PTQ significantly reduces memory footprint (allowing larger models on smaller devices or more models on a single device) and accelerates inference speed (as low-bit integer operations are faster and more energy-efficient than floating-point operations).
- Challenge: The main challenge is to perform this reduction without significant loss in model accuracy, which often happens due to quantization errors.
Quantization Range & Outliers: In quantization, a continuous range of floating-point values is mapped to a discrete set of integer values. This mapping typically involves defining a quantization range (e.g., min to max value in a tensor) and then dividing this range into $2^N$ discrete steps, where $N$ is the bit-width.
- Outliers are a few data points that are significantly different from the majority of the data. In LLMs, weights and activations often exhibit heavy-tailed distributions, meaning there are a few extreme outlier values far from the mean.
- If these outliers are included in the quantization range, they force the min and max values to be very far apart. This widens the overall quantization range, making each quantization step larger. A larger step size means less precision for the vast majority of values clustered near the center of the distribution, leading to substantial quantization error and accuracy degradation.
Data Distributions (Uneven, Heavy-tailed):
- Uneven Distribution: Refers to data where values are not uniformly spread or symmetrically centered. In multidimensional data (like activation tensors, where each channel might have its own distribution), unevenness can also refer to significant disparities in variance or mean across different channels.
- Heavy-tailed Distribution: A probability distribution where the tails (the extreme ends) are "heavier" than those of a normal (Gaussian) distribution. This means outliers occur more frequently or with larger magnitudes than expected in a Gaussian distribution. This characteristic is common in LLM activations, making quantization difficult.
Linear Transformations: In linear algebra, a linear transformation is a function that maps vectors from one vector space to another, preserving vector addition and scalar multiplication. In neural networks, this typically involves multiplying a data vector or matrix by another matrix.
- Linear transformations can rotate, scale, shear, or project data. The paper focuses on orthogonal transformations (rotations/reflections) and scaling transformations.
Orthogonal Matrices & Stiefel Manifold:
- An orthogonal matrix $\pmb{O}$ is a square matrix whose columns and rows are orthonormal vectors. This means $\pmb{O}^\top \pmb{O} = \pmb{O} \pmb{O}^\top = \pmb{I}$ , where $\pmb{I}$ is the identity matrix and $\pmb{O}^\top$ is the transpose of $\pmb{O}$ .
- Orthogonal transformations preserve vector lengths and angles, meaning they are norm-preserving. They primarily rotate or reflect data without distorting its shape or relative distances. This property is crucial for quantization as it allows reshaping distributions without introducing unwanted scaling biases.
- The Stiefel manifold $V_k(\mathbb{R}^n)$ is the set of all orthonormal $k$ -frames in $\mathbb{R}^n$ . In the context of orthogonal matrices, the Stiefel manifold represents the space of all possible orthogonal matrices. Optimizing parameters that must remain orthogonal requires specialized Riemannian optimization techniques that respect the geometric constraints of this manifold.
Diagonal Matrices & Scaling Transformations:
- A diagonal matrix $\pmb{\Lambda}$ is a square matrix where all entries outside the main diagonal are zero.
- A scaling transformation applied by a diagonal matrix scales each dimension of the input data independently. If the diagonal entries are reciprocals, it acts as an inverse scaling. This is used to balance variances across different channels, making distributions more uniform.
Kullback-Leibler (KL) Divergence: Also known as relative entropy, KL divergence is a non-symmetric measure of the difference between two probability distributions, $P$ and $Q$ . It quantifies how much information is lost when $Q$ is used to approximate $P$ .
- Formula: $D_{KL}(P || Q) = \sum_i P(i) \log \left(\frac{P(i)}{Q(i)}\right)$
- In quantization, KL divergence is often used as a loss function to ensure that the output distribution of the quantized model ( $\hat{y}$ ) remains as close as possible to that of the full-precision model ( $y$ ).
Cross-Entropy Loss (CE Loss): A common loss function in classification tasks that measures the performance of a classification model whose output is a probability value between 0 and 1. It increases as the predicted probability diverges from the actual label.
- Formula for single-class classification: $L_{CE} = -\sum_c y_c \log(\hat{y}_c)$ , where $y_c$ is 1 if class $c$ is the true class and 0 otherwise, and $\hat{y}_c$ is the predicted probability for class $c$ .
- In LLMs, CE loss is typically used for next-token prediction. However, when training on small calibration datasets during PTQ, focusing solely on the single correct label might lead to overfitting and a loss of the model's broader semantic capabilities.

3.2. Previous Works

The paper contextualizes OSTQuant within the landscape of Post-Training Quantization (PTQ) for LLMs, broadly categorizing existing methods into weight-only and weight-activation quantization.

Weight-Only Quantization: Aims to reduce memory by quantizing only the model weights, while activations remain in higher precision (e.g., FP16). This provides memory savings but less inference speedup compared to quantizing activations.
- GPTQ (Frantar et al., 2022): Utilizes Hessian-based error compensation to minimize quantization errors layer-by-layer, achieving high compression rates. It specifically quantizes weights greedily by selecting optimal quantization values for each weight given the errors in previously quantized weights.
- AWQ (Lin et al., 2023) and OWQ (Lee et al., 2023): These methods focus on the observation that activation outliers significantly impact the performance of weight quantization. They aim to protect salient weights or channels from aggressive quantization error by scaling weights or identifying specific channels that are robust to quantization.
- QuIP (Chee et al., 2023) and QuIP# (Tseng et al., 2024): These approaches use random Hadamard matrices for incoherent processing of weights (making them more uniformly distributed) and apply vector quantization to weights, leading to better performance at very low precisions. Hadamard transformations are a type of orthogonal transformation.
Weight-Activation Quantization: Aims to quantize both weights and activations (including the Key-Value (KV) cache in transformers) to achieve maximum memory savings and inference speedup. This is more challenging due to the dynamic nature and wide distributions of activations.
- ZeroQuant (Yao et al., 2022): Proposes a fine-grained, hardware-friendly quantization scheme for both weights and activations, often utilizing per-channel and per-token quantization strategies.
- SmoothQuant (Xiao et al., 2022): A pioneering smooth-based transformation method. It addresses activation quantization difficulty by mathematically shifting some of the quantization challenge from activations to weights, which are more static and easier to quantize. This is achieved by introducing a diagonal scaling matrix that smooths out large inter-channel variances in activations.
- OmniQuant (Shao et al., 2023): Further enhances performance by actively training both quantization parameters (scales, zero-points) and transformation coefficients through an optimization process, often using block-wise reconstruction to minimize quantization error.
- I-LLM (Hu et al., 2024): Focuses on achieving integer-only quantization and inference, which is highly efficient for specialized hardware. It uses fully-smooth block reconstruction and fully integer operators.
- QuaRot (Ashkboos et al., 2024): Utilizes random rotation matrices (a type of orthogonal transformation) to suppress outliers in both weights and activations, enabling 4-bit quantization. The rotations are fixed (randomly initialized Hadamard-like matrices).
- SpinQuant (Liu et al., 2024): Extends QuaRot by learning these rotation matrices to further refine 4-bit weight-activation quantization. This moves from fixed random rotations to optimized ones.
Riemannian Optimization: Optimizing orthogonal matrices (which lie on a Stiefel manifold) requires specialized optimization techniques that respect their orthonormality constraints.
- Cayley SGD (Li et al., 2020): Relies on iterative approximations of the Cayley Transform to optimize rotation matrices for arbitrary loss functions efficiently, primarily using matrix multiplication.
- RAOM (Bécigneul & Ganea, 2018): Extends popular adaptive optimizers like ADAM (Kingma, 2014) and ADAGRAD to the realm of Riemannian optimization, making them suitable for Stiefel manifolds.
- Geoopt (Kochurov et al., 2020): A Python library providing Riemannian optimization algorithms for various manifolds, facilitating their integration into deep learning models.

3.3. Technological Evolution

The field of LLM quantization has evolved from basic uniform quantization to sophisticated methods that adaptively handle the complex distributions of LLMs. Initially, quantization focused on simple per-tensor or per-channel scaling (RTN - Round-to-Nearest). The discovery of activation outliers in LLMs highlighted the need for more advanced techniques. Early methods like GPTQ and AWQ addressed weight quantization by mitigating outlier impact. Later, SmoothQuant introduced the idea of shifting quantization difficulty by scaling activations and inverse-scaling weights. More recent work, including QuaRot and SpinQuant, introduced orthogonal transformations (rotations) to directly reduce outliers and balance distributions in both weights and activations. OmniQuant pushed towards learnable quantization parameters and transformations.

OSTQuant builds upon this evolution by:

Introducing a principled metric (QSUR) to quantitatively guide the optimization of data distributions.
Combining the strengths of both orthogonal and scaling transformations in a learnable, equivalent transformation pair for global optimization.
Addressing the limited calibration data challenge with KL-Top loss.

3.4. Differentiation Analysis

OSTQuant differentiates itself from previous methods primarily through its holistic approach to optimizing data distribution across the entire quantization space, guided by a novel quantitative metric, and its specific combination of learnable transformations.

From Heuristic to Principled:
- Prior Methods (e.g., SmoothQuant, QuaRot, SpinQuant): These methods often operate based on empirical observations and heuristics (e.g., "smooth outliers," "rotate outliers"). While effective, they lack a unified metric to quantify "quantizability" or to rigorously guide the optimization process.
- OSTQuant: Introduces QSUR as a theoretically grounded metric to directly measure how effectively data utilizes the quantization space. This metric provides a clear optimization target and allows for a more principled design of transformations, moving beyond pure heuristics.
Combined Orthogonal and Scaling Transformations for Global Optimization:
- Smooth-based Methods (e.g., SmoothQuant): Primarily use diagonal scaling matrices to reduce inter-channel variance in activations by shifting it to weights. They are effective at balancing scale but are sensitive to outliers and uneven mean values. Their optimization is typically localized (per-layer or per-channel) and doesn't explicitly optimize the overall shape of the distribution in the quantization space.
- Rotation-based Methods (e.g., QuaRot, SpinQuant): Focus on orthogonal transformations (rotations) to suppress outliers in both weights and activations by rotating the data. QuaRot uses fixed random rotations, while SpinQuant learns them. These are good at making distributions more ball-shaped but might not optimally balance inter-channel variances or ensure maximum utilization of the quantization range. Their scope is often limited to reducing outliers rather than globally fitting the entire distribution.
- OSTQuant: Combines the strengths of both. It employs learnable equivalent transformation pairs consisting of both an orthogonal transformation and a scaling transformation. This allows it to:
  1. Rotate the data to align its principal components with the quantization axes and reduce outlier impact (like rotation-based methods).
  2. Scale dimensions to balance variances and mean values across channels (like smooth-based methods), ensuring a flatter, more uniform distribution. This combined approach allows for a more comprehensive optimization of the data distribution across the entire quantization space, leading to higher QSUR and better precision.
Equivalence Preservation and Fusion: OSTQuant explicitly designs equivalent transformation pairs that, in the absence of quantization, preserve the network's original output. This prevents overfitting to the calibration data and ensures better generalization. These transformations are then fused into the model weights during deployment, incurring no runtime overhead. While SmoothQuant also fuses scaling factors, OSTQuant extends this to orthogonal matrices as well.
Addressing Limited Calibration Data:
- Prior Methods: Often rely on standard cross-entropy loss or full KL divergence for optimization. Cross-entropy can lead to overfitting on small calibration sets, while full KL divergence can be noisy due to long-tail distributions in LLM vocabularies.
- OSTQuant: Proposes KL-Top loss, which focuses KL divergence computation on only the top $k$ highest-probability logits. This provides more stable and semantically rich gradients for optimization with limited data.
  
  In essence, OSTQuant provides a more theoretically grounded, comprehensive, and adaptively optimized solution by integrating a novel metric, a powerful combination of learnable transformations, and a robust loss function specifically designed for PTQ challenges.

4. Methodology

4.1. Principles

The core principle behind OSTQuant is to enhance the quantizability of Large Language Models (LLMs) by actively reshaping the data distributions of weights and activations to better utilize the available quantization space. This is achieved through a novel metric, Quantization Space Utilization Rate (QSUR), which quantifies the effectiveness of this utilization, and a sophisticated learnable equivalent transformation mechanism.

The theoretical intuition is that quantization error is minimized when the data distribution is as compact and uniform as possible within the fixed quantization range. Uneven and heavy-tailed distributions lead to large quantization ranges and sparse quantization levels for most values. OSTQuant hypothesizes that a combination of orthogonal transformations (rotations) and scaling transformations can effectively transform these sub-optimal distributions into more quantization-friendly ones, characterized by higher QSUR.

Orthogonal transformations (rotations) can reduce the maximum extent of outliers by distributing their magnitude across multiple dimensions, making the data more "ball-shaped." Scaling transformations (diagonal matrices) can then balance variances across different channels, making the data distribution flatter and more uniform. By learning these transformations, OSTQuant aims to dynamically adapt to the specific distributions within each layer of an LLM, thereby minimizing quantization loss across the entire network.

The process is guided by minimizing the KL-Top loss, which ensures that the quantized model's output distribution remains semantically close to the full-precision model's output, even with limited calibration data, preventing overfitting and preserving performance.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Quantization Notations

The paper first defines the standard uniform quantization and dequantization process. This process maps a floating-point number to a discrete integer representation and then approximates the original floating-point number from that integer.

Given a floating-point tensor $\boldsymbol{X}$ , the quantization function $\mathcal{Q}$ converts it into its quantized integer counterpart $X^I$ . The dequantization process then approximates the original value as $X'$ . The quantization process is defined by: $ X^I = \mathrm{clamp}\left(\left\lfloor \frac{X}{s} \right\rceil + zp^I, 0, 2^{n^I}-1\right) $ where:

$X$ : The original floating-point tensor.
$X^I$ : The quantized integer tensor, with values clamped between 0 and $2^{n^I}-1$ .
$n^I$ : The number of bits for quantization (e.g., 4 bits for INT4).
$s$ : The quantization step size (also known as scale), which determines the width of each discrete quantization level.
$zp^I$ : The zero-point, an integer value that represents the floating-point value 0 in the quantized integer range.
$\lfloor \cdot \rceil$ : The rounding function (typically round-to-nearest).
$\mathrm{clamp}(\cdot, \mathrm{min}, \mathrm{max})$ : A function that clips values to be within the specified min and max range.

The quantization step size $s$ and zero-point $zp^I$ are calculated from the minimum ( $x_{\mathrm{min}}$ ) and maximum ( $x_{\mathrm{max}}$ ) values observed in the floating-point tensor $\boldsymbol{X}$ : $ s = \frac{x_{\mathrm{max}} - x_{\mathrm{min}}}{2^{n^I}-1} $ $ zp^I = \left\lfloor \frac{-x_{\mathrm{min}}}{s} \right\rceil $ After quantization, the dequantized floating-point approximation $\boldsymbol{X}'$ is computed as: $ \boldsymbol{X}' = (\boldsymbol{X}^I - zp^I) \cdot s $ This dequantization formula maps the integer back to a floating-point value within the original approximate range. The choice of $s$ (scale) and $zp^I$ (zero-point) is crucial for quantization accuracy. These can be determined statically from a few calibration samples (static quantization) or dynamically from runtime statistics (dynamic quantization). Quantization can also be applied per-channel or per-token, referring to the granularity at which $s$ and $zp^I$ are computed.

4.2.2. Quantization Space Utilization Rate (QSUR)

To quantitatively evaluate the quantizability of data and the effectiveness of transformations, the paper introduces QSUR.

Definition 1: Given a set of $d$ -dimensional data $\boldsymbol{X} \in \mathbb{R}^{n \times d}$ , let $V_X$ denote the hypervolume occupied by $\boldsymbol{X}$ , and $V_{S_X}$ denote the hypervolume of the quantization space $S$ corresponding to $\boldsymbol{X}$ . The quantization space $S_X$ is modeled as a hypercube whose edge lengths are defined by the maximum quantization range across all dimensions of $\boldsymbol{X}$ . The Quantization Space Utilization Rate of $\boldsymbol{X}$ is then defined as: $ \mathrm{QSUR}X = \frac{V_X}{V{S_X}} \quad (1) $ A higher QSUR indicates that the data effectively fills the allocated quantization space, suggesting better quantizability.

For data typically following a Gaussian distribution ( $X \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ ), $V_X$ is calculated based on the ellipsoid formed by the covariance matrix $\boldsymbol{\Sigma}$ and mean vector $\boldsymbol{\mu}$ . The covariance matrix can be diagonalized via eigenvalue decomposition: $\boldsymbol{\Sigma} = \boldsymbol{Q} \boldsymbol{\Lambda} \boldsymbol{Q}^\top$ , where $\boldsymbol{Q}$ is a unit orthogonal matrix of eigenvectors, and $\boldsymbol{\Lambda} = \mathrm{diag}(\lambda_1, \lambda_2, ..., \lambda_d)$ contains the eigenvalues (representing variance along principal axes) in descending order.

The hypervolume of this ellipsoid at a given confidence level $\alpha$ (e.g., $\alpha = 0.99$ ) is given by: $ V_X = \frac{\pi^{d/2}}{\Gamma(d/2+1)} \times \left(\chi_d^2(\alpha)\right)^{d/2} \times \sqrt{\operatorname*{det}(\boldsymbol{\Sigma})} \quad (2) $ where:

$\Gamma$ : The Gamma function.
$\chi_d^2(\alpha)$ : The chi-squared quantile for $d$ degrees of freedom at confidence level $\alpha$ . This factor scales the ellipsoid to encompass a certain percentage of the data.
$\operatorname*{det}(\boldsymbol{\Sigma})$ : The determinant of the covariance matrix. Since $\boldsymbol{Q}$ is orthogonal, $\operatorname*{det}(\boldsymbol{\Sigma}) = \operatorname*{det}(\boldsymbol{Q} \boldsymbol{\Lambda} \boldsymbol{Q}^\top) = \operatorname*{det}(\boldsymbol{Q}) \operatorname*{det}(\boldsymbol{\Lambda}) \operatorname*{det}(\boldsymbol{Q}^\top) = \operatorname*{det}(\boldsymbol{\Lambda}) = \prod_{i=1}^{d} \lambda_i$ . Thus, $\sqrt{\operatorname*{det}(\boldsymbol{\Sigma})} = \sqrt{\prod_{i=1}^{d} \lambda_i}$ .

The volume of the quantization hypercube ( $V_{S_X}$ ) is determined by the range of the distribution along each principal axis, considering the maximum and minimum coordinate values. The extremal points of the ellipsoid correspond to these maximum and minimum values. Let $\lambda_{\mathrm{max}}$ and $\lambda_{\mathrm{min}}$ be eigenvalues, and $\pmb{q}_{\mathrm{max}}$ and $\pmb{q}_{\mathrm{min}}$ be corresponding eigenvectors (though the paper simplifies this by later considering only the largest eigenvalue). The maximum and minimum coordinate values $v_{\mathrm{max}}^{\mathrm{org}}$ and $v_{\mathrm{min}}^{\mathrm{org}}$ are given as: $ v_{\mathrm{max}}^{\mathrm{org}} = \sqrt{\chi_d^2(\alpha) \cdot \lambda_{\mathrm{max}}} \cdot \pmb{q}{\mathrm{max}} + \boldsymbol{\mu} $ $ v{\mathrm{min}}^{\mathrm{org}} = -\sqrt{\chi_d^2(\alpha) \cdot \lambda_{\mathrm{min}}} \cdot \pmb{q}{\mathrm{min}} + \boldsymbol{\mu} $ The volume of the quantization hypercube is then: $ V{S_X} = (\operatorname*{max}(v_{\mathrm{max}}^{\mathrm{org}}) - \operatorname*{min}(v_{\mathrm{min}}^{\mathrm{org}}))^d \quad (5) $ Combining these, the QSUR expression is: $ \mathrm{QSUR}X = \frac{ \frac{\pi^{d/2}}{\Gamma(d/2+1)} \cdot \left(\chi_d^2(\alpha)\right)^{d/2} \cdot \sqrt{\operatorname*{det}(\boldsymbol{\Lambda})} }{ \left( \operatorname*{max}(\sqrt{\chi_d^2(\alpha) \cdot \lambda{\mathrm{max}}} \cdot |\pmb{q}{\mathrm{max}}| + \boldsymbol{\mu}) - \operatorname*{min}(\sqrt{\chi_d^2(\alpha) \cdot \lambda{\mathrm{min}}} \cdot |\pmb{q}_{\mathrm{min}}| + \boldsymbol{\mu}) \right)^d } \quad (6) $ Simplified QSUR: The paper further simplifies this by neglecting the mean vector $\boldsymbol{\mu}$ (assuming its magnitude is small relative to the largest eigenvalue) and assuming $\lambda_{\mathrm{max}} = \lambda_{\mathrm{min}} = \lambda_1$ (the largest eigenvalue) for extremal points. This simplification leads to: $ \mathrm{QSUR}X = \frac{ \frac{\pi^{d/2}}{\Gamma(d/2+1)} \cdot \sqrt{\prod{i=1}^{d} \lambda_i} }{ 2^d \left( \operatorname*{max}(\sqrt{\lambda_1} \cdot \pmb{q}_1) \right)^d } \quad (7) $ From this simplified QSUR formula (Equation 7), two key observations are made:

QSUR is proportional to the product of the ratios of each eigenvalue $\lambda_i$ to the largest eigenvalue $\lambda_1$ . This implies that QSUR is maximized when all eigenvalues are similar (i.e., the distribution is more ball-shaped).
The maximum component of the eigenvector $\pmb{q}_1$ corresponding to $\lambda_1$ is inversely proportional to QSUR. QSUR is minimized when this component is large, and maximized when it is small. The denominator in Equation 7 is minimized when the components of $\pmb{q}_1$ are $\pm d^{-1/2}$ (Appendix A.2.1), which corresponds to the most uniform spread of eigenvector components.

Influence of linear transformation on QSUR: Applying a linear transformation $\pmb{T}$ to $X \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ results in a transformed distribution $\hat{X} \sim \mathcal{N}(\hat{\boldsymbol{\mu}}, \hat{\boldsymbol{\Sigma}})$ , where $\hat{\boldsymbol{\mu}} = \pmb{T}\boldsymbol{\mu}$ and $\hat{\boldsymbol{\Sigma}} = \pmb{T} \boldsymbol{Q} \boldsymbol{\Lambda} \boldsymbol{Q}^\top \pmb{T}^\top$ .

Smoothing-based approaches (e.g., SmoothQuant) treat $\pmb{T}$ as a diagonal matrix that scales variances, reducing disparities among eigenvalues $\lambda_i$ . However, they are sensitive to outliers and uneven mean values.
Rotation-based methods (e.g., QuaRot, SpinQuant) use orthogonal matrices to modify $\boldsymbol{Q}$ , reducing outliers and the hypercube volume, thereby increasing QSUR. This ability improves with dimensionality (Appendix A.2.2). The best outlier reduction is achieved when the orthogonal matrix is $\pmb{T} = d^{-\frac{1}{2}} \pmb{H} \pmb{Q}^\top$ , where $\pmb{H}$ is a Hadamard matrix.

The Best Transformation Matrix: Combining the insights from Equation 7 and the influence of transformations, the paper deduces that the maximum QSUR is achieved when the transformation matrix is: $ \pmb{T} = c \cdot \pmb{\Lambda}^{-\frac{1}{2}} \pmb{Q}^\top \quad (8) $ where $c$ is an arbitrary scalar. This transformation effectively spheres the data, making its covariance matrix an identity matrix ( $\hat{\boldsymbol{\Sigma}} = \boldsymbol{I}$ , see Appendix A.2.3). At this point, the maximum QSUR is achieved: $ \mathrm{QSUR}'' = \frac{ \frac{\pi^{d/2}}{\Gamma(d/2+1)} }{ 2^d } \quad (28) $ This theoretical derivation highlights that an ideal transformation involves both scaling (by $\pmb{\Lambda}^{-\frac{1}{2}}$ ) to normalize variances and rotation (by $\pmb{Q}^\top$ ) to align with principal axes, effectively making the data distribution isotropic (uniform in all directions).

4.2.3. Orthogonal and Scaling Transformation-based Quantization (OSTQuant)

OSTQuant leverages these theoretical insights to design a practical PTQ framework. It uses learnable equivalent transformation pairs within LLM layers to optimize weight and activation distributions for improved quantization performance.

Overall Flow (Figure 5): OSTQuant applies multiple transformation pairs globally across and within LLM blocks. Four equivalent transformation pairs are learned per block, each comprising a learnable diagonal scaling matrix $\pmb{S}$ (denoted as $\pmb{\Lambda}$ in formulas) and a learnable orthogonal matrix $\pmb{O}$ (denoted as $\pmb{R}$ in figures/text for rotation).

Equivalent Transformation Pair: A transformation pair is defined as $\pmb{T} = \pmb{\Lambda} \pmb{O}$ , where $\pmb{\Lambda}$ is a diagonal scaling matrix and $\pmb{O}$ is a unit orthogonal matrix. The forward inference process for a pair of matrix multiplications (e.g., $W_1 W_2$ ) is reformulated as: $ \pmb{y} = \mathcal{Q}(\pmb{x} W_1 O \pmb{\Lambda}) \mathcal{Q}(\pmb{\Lambda}^{-1} O^T W_2) \quad (9) $ where:

$\pmb{x}$ : Input to the first linear layer.
$W_1, W_2$ : Weights of two consecutive linear layers.
$\pmb{O}$ : Learnable orthogonal matrix.
$\pmb{\Lambda}$ : Learnable diagonal scaling matrix.
$\mathcal{Q}(\cdot)$ : The quantization operation. The intermediate product $\pmb{O} \pmb{\Lambda}$ transforms the output of $W_1$ , and its inverse $\pmb{\Lambda}^{-1} \pmb{O}^\top$ transforms the input of $W_2$ . This maintains mathematical equivalence in full-precision, but allows for optimization of intermediate distributions during quantization.

Advantages of Equivalent Transformation Pair:

Learnability and Computational Efficiency: Both $\pmb{O}$ and $\pmb{\Lambda}$ are learnable parameters. The inverse of a diagonal matrix $\pmb{\Lambda}$ is trivial (reciprocal of diagonal elements). The orthogonal matrix $\pmb{O}$ is optimized using Riemannian optimizers like RiemannAdam, leveraging first-order gradient information.
Equivalence Preservation: Without quantization, the transformations cancel out ( $\pmb{O} \pmb{\Lambda} \pmb{\Lambda}^{-1} \pmb{O}^\top = \pmb{I}$ ), ensuring the network's output is identical to the original model. This reduces the risk of overfitting to the small calibration set.
No Runtime Overhead: After optimization, $\pmb{O}$ and $\pmb{\Lambda}$ are fused directly into the existing weight matrices. For example, $W_1 \leftarrow W_1 \pmb{O} \pmb{\Lambda}$ and $W_2 \leftarrow \pmb{\Lambda}^{-1} \pmb{O}^\top W_2$ . This means no additional computational cost or parameters are introduced during inference.

The overall optimization objective for the network is to minimize the loss between the quantized network output $\hat{y}$ and the full-precision output $y$ , by learning the optimal scaling ( $\pmb{A}_i$ ) and orthogonal ( $\pmb{O}_i$ ) matrices: $ \arg \operatorname*{min}_{A_i, O_i} \mathcal{L}(\hat{y}, y; A_i, O_i, \theta) \quad (10) $ where $\theta$ represents the frozen full-precision network parameters.

Weight Outlier Minimization Initialization (WOMI): Since weights often follow a Gaussian distribution with zero mean, the orthogonal matrix $\pmb{O}$ is initialized to optimally reduce weight outliers. For a global orthogonal matrix $R_{res}$ , weights of all linear layers receiving residual inputs are concatenated. Eigenvalue decomposition is performed on their covariance matrix to obtain the eigenmatrix $\boldsymbol{Q}_W$ . Then, $\pmb{O}$ is initialized as $\pmb{O} = \pmb{H} \boldsymbol{Q}_W^\top$ , where $\pmb{H}$ is a normalized Hadamard matrix (Equation 27 from Appendix A.2.2 is $\pmb{T} = d^{-\frac{1}{2}} \pmb{H} \pmb{Q}^\top$ , suggesting a similar structure for initialization). This initialization leverages Hadamard matrices' ability to spread out values and covariance matrices to align with principal components, minimizing outliers and inter-channel disparities. For scaling matrices, they are initialized as identity matrices.

Inter-Block Learning (Global Transformations): As shown in the upper part of Figure 5, a global orthogonal transformation $R_{res}$ is applied at the embedding layer and propagates through each residual path in the LLM. This transformation rotates activations across the entire residual path. Due to the norm-preserving property of unitary orthogonal matrices, this allows bypassing RMSNorm layers (which normalize based on norm) and applying an inverse transformation to inputs of projection layers and the final output head to maintain equivalence. Additionally, for the two normalization layers in each block, and their respective projection layers ( $S_{attn}^i$ and $S_{ffn}^i$ ), scaling matrices are applied. The orthogonal matrix $R_{res}$ rotates activations along residual paths and adjusts weights of multiple projection layers. The scaling matrices $S_{attn}^i$ and $S_{ffn}^i$ scale the outputs of RMSNorm layers and corresponding projection layers. These transformations are absorbed into the corresponding weight matrices, optimizing for distribution shifts and mitigating quantization errors.

Intra-Block Learning (Local Transformations): The lower part of Figure 5 illustrates local equivalent transformation pairs within each transformer block:

Multi-Head Self-Attention (MHSA) Layer: Two pairs are introduced.
- Value (V) and Output (O) Projection Layers: For each attention head $h$ $h$ , the data flow is: $ \pmb{Y} = \pmb{P}^h \cdot \pmb{X}^h \cdot (\pmb{W}v^h \pmb{R}{ov}^h \pmb{S}{ov}^h) \cdot (\pmb{S}{ov}^{h^{-1}} \pmb{R}_{ov}^{h^T} \pmb{W}_o^h) \quad (11) $ Here:
  - $h$ : Head index.
  - $\pmb{P}^h$ : Attention matrix for head $h$ .
  - $\pmb{X}^h$ : Input to head $h$ .
  - $\pmb{W}_v^h$ : Weight for value projection.
  - $\pmb{W}_o^h$ : Weight for output projection.
  - $\pmb{R}_{ov}^h$ : Learnable orthogonal matrix for V/O layers.
  - $\pmb{S}_{ov}^h$ : Learnable scaling matrix for V/O layers. These transformations improve QSUR for both the value cache and output projection layer.
- Query (Q) and Key (K) Projection Layers: After Rotary Positional Encoding (ROPE), the query and key outputs undergo an equivalent scaling transformation $S_{qk}$ , which can be incorporated into weights $W_q$ and $W_k$ . An additional Hadamard transformation is applied to $Q$ and $K$ outputs to enhance key cache quantization.
Feed-Forward Network (FFN) Layer: The diagonal matrices in the up-projection and down-projection layers of the FFN are optimized. The inverse of the Hadamard transformation (if applied) is fused into $W_{down}$ from the start.

4.2.4. KL-Top Loss

While LLMs are trained on massive datasets, PTQ optimization is typically performed on small calibration sets (e.g., 1000 samples). In this limited data regime:

Standard cross-entropy (CE) loss can lead to overfitting, where the model performs well on the calibration set but degrades on broader tasks (as shown in Table 1). This is because small datasets may not fully utilize LLM capacity, causing CE loss (focusing on a single correct label) to overfit to narrow features.
Full Kullback-Leibler (KL) divergence (comparing the full prediction distributions) is also problematic. LLMs have very large vocabularies (e.g., LLaMA-3-8B has over 100,000 tokens), and their prediction results often follow a severe long-tail distribution (Figure 6), meaning only a few tokens have significant probabilities. Directly applying KL divergence over all classes means the loss can be dominated by uninformative low-probability classes, adding noise to training and incurring high computational/memory costs.

To mitigate these issues, OSTQuant proposes the KL-Top loss function. This function computes KL divergence over only the top $k$ classes with the highest probabilities. This approach focuses optimization on the model's primary predictions, enhancing gradient quality by avoiding noise from low-probability values. It also reduces computational and memory costs for large vocabularies.

The KL-Top loss is calculated as follows: $ idxs = \operatorname*{top}k(z) \quad (12) $ $ \mathcal{L} = \sum_{i \in idxs} z[i] \log \left( \frac{z[i]}{\hat{z}[i]} \right) $ where:

$z$ : The prediction distribution (logits, usually after softmax) from the full-precision model.
$\hat{z}$ : The prediction distribution from the quantized model.
$\operatorname*{top}k(z)$ : A function that returns the indices of the top $k$ highest-probability logits from $z$ .
$\mathcal{L}$ : The KL-Top loss. The sum is taken only over the selected top $k$ indices.

Experiments (Table 6) show that a $k$ value of 1,000 typically yields the best results, balancing semantic richness with optimization stability.

5. Experimental Setup

5.1. Datasets

The experimental validation of OSTQuant is conducted across a range of Large Language Models (LLMs) from the LLaMA family and evaluated on standard benchmarks.

Models:
- LLaMA-1 (7B, 13B, 30B) (Touvron et al., 2023a)
- LLaMA-2 (7B, 13B, 70B) (Touvron et al., 2023b)
- LLaMA-3-8B, LLaMA-3-70B
Calibration Data: For the PTQ optimization phase, a small calibration set is used.
- Source: 1,000 samples from WikiText2 (Merity et al., 2016)
- Characteristics: Each sample has a token length of 2,048. WikiText2 is a common dataset for language model evaluation, particularly for perplexity measurements, as it is derived from high-quality Wikipedia articles.
- Why chosen: It's a standard and relatively clean dataset suitable for calibrating quantization parameters due to its representative linguistic patterns, while being small enough to fit the PTQ paradigm of limited data.
Evaluation Datasets (Zero-Shot Tasks): To assess the generalized performance and emergent capabilities of the quantized LLMs, the models are evaluated on a suite of nine zero-shot tasks using the lm-evaluation-harness (version 0.4.4) (Gao et al., 2024). These tasks are designed to test various aspects of language understanding and reasoning.
- BoolQ (Clark et al., 2019): Boolean Question Answering.
- HellaSwag (Zellers et al., 2019): Commonsense reasoning for sentence completion.
- LAMBADA (Radford et al., 2019): Language Modeling Broadened Across Diverse Activities, testing language understanding by predicting the last word of a sentence.
- OpenBookQA (OBQA) (Mihaylov et al., 2018): Open-book question answering requiring commonsense and factual knowledge.
- PIQA (Bisk et al., 2020): Physical Interaction Question Answering, focusing on physical commonsense.
- SIQA (Sap et al., 2019): Social IQa, evaluating commonsense reasoning about social interactions.
- WinoGrande (Sakaguchi et al., 2021): An adversarial Winograd Schema Challenge to test commonsense reasoning.
- ARC-Easy and ARC-Challenge (Boratko et al., 2018): AI2 Reasoning Challenge, testing scientific reasoning.
- Why chosen: These datasets represent a diverse set of zero-shot NLP tasks and are standard benchmarks for evaluating the capabilities and robustness of LLMs after quantization. They are crucial because perplexity alone (from WikiText2) might not fully reflect the true functional performance of the model, especially after quantization.

5.2. Evaluation Metrics

For comprehensive evaluation, OSTQuant uses both perplexity and zero-shot accuracy (averaged across multiple tasks), along with inference speedup and memory savings.

Perplexity (PPL)
- Conceptual Definition: Perplexity is a common metric for evaluating language models. It measures how well a probability distribution or language model predicts a sample. A lower perplexity score indicates that the model predicts the test set more accurately and assigns higher probabilities to the observed sequences, meaning it is a better language model. In simple terms, it's a measure of how "surprised" the model is by new data; less surprise means better performance.
- Mathematical Formula: For a sequence of words $W = (w_1, w_2, \ldots, w_N)$ , the perplexity is defined as: $ PPL(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log P(w_i | w_1, \dots, w_{i-1})\right) $
- Symbol Explanation:
  - $W$ : The test corpus or sequence of words.
  - $N$ : The total number of words in the test corpus.
  - $P(w_i | w_1, \dots, w_{i-1})$ : The probability assigned by the language model to the $i$ -th word $w_i$ , given the preceding words $w_1, \dots, w_{i-1}$ . This probability is typically computed by the LLM during its next-token prediction task.
Zero-Shot Accuracy (Avg. Zero-Shot Score)
- Conceptual Definition: Zero-shot accuracy measures a model's ability to perform tasks it has not been explicitly trained on, demonstrating its generalization capabilities and emergent properties. The average zero-shot score refers to the mean accuracy across a suite of predefined zero-shot tasks. For each task, accuracy is simply the percentage of correct predictions.
- Mathematical Formula: Since this metric is an average across multiple tasks, there isn't a single universal formula beyond the definition of accuracy for each task. For an individual task, Accuracy is: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \times 100% $ The Average Zero-Shot Score is then: $ \text{Avg. Zero-Shot Score} = \frac{1}{M} \sum_{j=1}^M \text{Accuracy}_j $
- Symbol Explanation:
  - $\text{Accuracy}_j$ : The accuracy achieved on the $j$ -th zero-shot task.
  - $M$ : The total number of zero-shot tasks evaluated (in this paper, $M=9$ ).
Speedup
- Conceptual Definition: Speedup quantifies how much faster the quantized model performs inference compared to its full-precision (e.g., FP16) counterpart. A speedup of $2 \times$ means the quantized model is twice as fast.
- Mathematical Formula: $ \text{Speedup} = \frac{\text{Inference Time (Full Precision)}}{\text{Inference Time (Quantized)}} $
- Symbol Explanation:
  - $\text{Inference Time (Full Precision)}$ : The time taken for a specific inference operation (e.g., prefill, decoding a token) using the full-precision model.
  - $\text{Inference Time (Quantized)}$ : The time taken for the same inference operation using the quantized model.
Memory Saving Factor
- Conceptual Definition: Memory saving factor indicates how much less memory the quantized model consumes compared to its full-precision version. A factor of $3 \times$ means the quantized model uses one-third of the memory.
- Mathematical Formula: $ \text{Memory Saving Factor} = \frac{\text{Memory Usage (Full Precision)}}{\text{Memory Usage (Quantized)}} $
- Symbol Explanation:
  - $\text{Memory Usage (Full Precision)}$ : The memory consumed by the full-precision model (e.g., for storing weights or activations).
  - $\text{Memory Usage (Quantized)}$ : The memory consumed by the quantized model.

5.3. Baselines

OSTQuant is benchmarked against several representative Post-Training Quantization (PTQ) methods, ranging from basic approaches to state-of-the-art techniques.

RTN (Round-to-Nearest): A basic quantization method where floating-point values are simply rounded to the nearest available integer quantization level. It serves as a strong baseline to show the benefits of more sophisticated methods over naive quantization.
SmoothQuant (Xiao et al., 2022): A smooth-based transformation method that shifts quantization difficulty from activations to weights by scaling activations and inversely scaling weights. This aims to reduce inter-channel variance in activations.
GPTQ (Frantar et al., 2022): An accurate weight-only PTQ method that uses Hessian-based error compensation to minimize quantization errors in generative pre-trained transformers.
OmniQuant (Shao et al., 2023): A method that further enhances quantization performance by training quantization parameters (scales and zero-points) and transformation coefficients through an optimization process, often involving block-wise reconstruction.
AWQ (Lin et al., 2023): Activation-aware weight quantization, which focuses on protecting salient weights (those critical for model performance) from quantization errors by identifying activation outliers and applying appropriate scaling.
QuaRot (Ashkboos et al., 2024): A rotation-based method that uses random rotation matrices to make activations and weights more uniform and outlier-free, enabling 4-bit quantization for both.
SpinQuant (Liu et al., 2024): An advancement over QuaRot that learns the rotation matrices for 4-bit weight-activation quantization rather than relying on random ones, aiming for better optimization.

5.4. Implementation Details

The paper specifies key aspects of OSTQuant's implementation and optimization:

Quantization Scheme:
- Activations: Per-token asymmetric quantization without any pruning operations. Per-token means scales and zero-points are calculated for each token in a sequence, allowing adaptation to dynamic ranges. Asymmetric means the quantization range does not necessarily center around zero.
- Weights: Per-channel symmetric quantization. Per-channel means scales and zero-points are calculated independently for each output channel of a weight tensor. Symmetric means the quantization range is centered around zero (e.g., from -Max to +Max).
Optimizer: RiemannAdam (Bécigneul & Ganea, 2018) is used to optimize all unit orthogonal matrices and scaling matrices. RiemannAdam is suitable for Riemannian optimization on Stiefel Manifolds, where orthogonal matrices reside.
Calibration Data and Iterations:
- Calibration Samples: 1,000 samples from WikiText2.
- Token Length: Each sample has a token length of 2,048.
- Iterations: 150 iterations.
- Batch Size: 8.
Learning Rate Schedule: Cosine learning rate decay is applied.
- Initial Learning Rate for orthogonal matrix parameters: $2 \times 10^{-2}$ .
- Initial Learning Rate for scaling parameters: $3 \times 10^{-2}$ .
- The paper notes that using a learning rate for the Stiefel manifold (orthogonal matrices) 10 times larger than that for scaling parameters typically leads to better results.
Hardware: Quantization and inference speed tests were conducted on an A800 GPU for optimization timing, and on a 3090 GPU for detailed prefill/memory speedup tests (Table 9), and an A6000 GPU for decoding speed (Table 10).
Kernel Implementation: The 4-bit matrix multiplication kernel was implemented using CUTLASS of Nvidia, a high-performance library for CUDA. The self-attention mechanism used PyTorch's native SDPA (scaled dot product attention) function.

6. Results & Analysis

6.1. Core Results Analysis

OSTQuant consistently outperforms existing state-of-the-art methods across various LLMs and quantization configurations, demonstrating its effectiveness in preserving accuracy at low bit-widths and achieving significant efficiency gains.

Quantization Performance (Table 2):

W4-only (4-16-16) Setting: OSTQuant achieves remarkable accuracy retention, maintaining at least 99.5% of the floating-point (FP) accuracy in zero-shot tasks. For the challenging LLaMA-3-8B model, OSTQuant has only a 0.29-point performance drop (67.80 vs FP 68.09), significantly outperforming other methods that incur losses exceeding 1.55 points. This demonstrates its superior ability to quantize weights with minimal impact.
W4A4KV16 Setting (4-4-16): When activations are also quantized but the KV cache remains at FP16, OSTQuant shows substantial performance boosts. For instance, on LLaMA-2-7B, OSTQuant (63.90) outperforms SpinQuant (57.37) by 6.53 points in zero-shot accuracy. This highlights its effectiveness in handling activation distributions.
W4A4KV4 Setting (4-4-4): In this most aggressive 4-bit quantization configuration for weights, activations, and KV cache, OSTQuant still maintains significant performance gains. It outperforms the previous SOTA method, SpinQuant, by approximately 1 point across multiple models (e.g., LLaMA-3-8B: 65.37 vs 64.10, LLaMA-2-7B: 63.18 vs 62.01). This demonstrates its robustness in extreme compression scenarios, making 4-bit inference truly feasible.
Comparison to Smooth-based vs. Rotation-based: The results clearly show that rotation-based methods (like QuaRot, SpinQuant, OSTQuant) significantly outperform smooth-based methods (like SmoothQuant) once activations are quantized. This confirms the paper's argument that smooth-based methods struggle more with the prominent outliers and uneven distributions in activations.

QSUR and Accuracy Correlation (Figure 3): The evaluation of QSUR across different quantization methods for LLaMA variants reveals a clear positive correlation with model performance (Zero-Shot accuracy). OSTQuant achieves the highest QSUR, which directly translates to improved model accuracy. This validates the QSUR metric as an effective indicator of quantizability and the success of OSTQuant in optimizing data distributions.

Activation Distribution Uniformity (Figure 2, 11, 12): Visualizations (Figure 2) show that LLaMA-3-8B's activation distributions prior to OSTQuant exhibit substantial variation across channels and numerous outliers. After OSTQuant's transformations, the distributions become significantly more uniform across channels and outliers are mitigated. This qualitative observation supports the quantitative QSUR improvements and performance gains.

Speedup and Memory Savings (Table 3, 9, 10):

Inference Speedup: OSTQuant enables 4-bit inference that delivers an average inference speedup of over $2 \times$ compared to FP16. For larger models like LLaMA-30B, it achieves approximately $3 \times$ acceleration. This is attributed to reduced memory access overhead and efficient low-precision computation units.
Memory Savings: OSTQuant provides memory savings exceeding $3.5 \times$ . This is crucial for deploying LLMs on resource-constrained devices or running larger models on existing hardware.
Decoding Stage Efficiency (Table 10): In the decoding stage (which is often memory-bound), quantizing to 4-bit significantly reduces memory access overhead. This allows running very large models like LLaMA-3-70B on a single A6000 GPU at a respectable speed of 15 tokens per second, which would be Out of Memory (OOM) for FP16.
Training Speed (Table 11): OSTQuant also offers substantial advantages in optimization speed compared to block reconstruction-based methods (like OmniQuant). With only 150 iterations and minimal learnable parameters, it can quantize smaller models (7B) in around 20 minutes (a $5.3 \times$ speedup over OmniQuant) and larger models (30B) in 120 minutes (a $3.3 \times$ speedup).

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Model	Loss Type	Wiki PPL	Arc-Easy Score	Arc-Challenge Score
LLaMA-2-7B	Origin	5.38	69.87	42.41
LLaMA-2-7B	KL-Top	5.94	72.69	44.62
LLaMA-2 13B	Origin	5.12	75.09	46.08
LLaMA-2 13B	KL-Top	5.25	75.29	47.10
LLaMA-3 8B	Origin	6.80	76.68	49.26
LLaMA-3 8B	KL-Top	7.29	76.73	49.32

The following are the results from Table 2 of the original paper:

#Bits	Method	LLaMA-3 8B				LLaMA-3 70B				LLaMA-2 7B				LLaMA-2 13B				LLaMA-2 70B				LLaMA 7B				LLaMA 13B				LLaMA 30B
		0-shot		Wiki		0-shot		Wiki		0-shot		Wiki		0-shot		Wiki		0-shot		Wiki		0-shot		Wiki		0-shot		Wiki		0-shot		Wiki
		Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)	Avg.(↑)	(↓)
W-A-KV	FloatingPoint	68.09	-	6.14	-	73.81	-	2.86	-	65.21	-	5.47	-	67.61	-	4.88	-	71.59	-	3.32	-	64.48	-	5.68	-	66.67	-	5.09	-	70.00	-	4.10	-
4-16-16	RTN	63.70	8.13	-	31.15	-	1e5	-	61.27	7.02	-	60.24	-6.39	-	69.62	3.87	-	62.67	-	7.94	-	63.45	-	8.60	-	65.69	-6.13	-
	SmoothQuant	62.79	8.12	-	67.94	6.70	-	58.88	8.03	-	62.03	5.86	-	65.93	5.50	-	62.24	-	7.46	-	62.69	-	18.75	-	65.69	5.80	-
	GPTQ	61.03	7.43	-	31.45	-	9e3	-	60.86	9.84	-	64.71	5.79	-	70.96	3.94	-	60.15	-	7.93	-	64.36	6.58	-	66.95	5.26	-
	Omniquant	65.66	7.19	-	-	-	-	63.19	5.74	-	66.38	5.02	-	71.04	3.47	-	63.42	-	5.86	-	66.22	-	5.21	-	69.07	4.25	-
	AWQ	67.03	7.36	-	68.92	5.92	-	63.89	5.83	-	66.25	5.07	-	70.88	4.03	-	63.30	5.97	-	65.58	5.28	-	69.44	4.28	-
	QuaRot	67.27	6.53	-	72.93	3.53	-	64.30	5.62	-	66.95	5.00	-	71.21	3.41	-	63.40	5.83	-	65.91	5.20	-	69.73	4.27	-
	SpinQuant	66.54	6.49	-	72.90	3.49	-	63.59	5.58	-	67.14	5.00	-	71.12	3.43	-	63.94	5.76	-	66.32	5.16	-	69.62	4.21	-
	OSTQuant	67.80	6.53	-	73.69	3.19	-	64.37	5.64	-	67.31	4.94	-	71.48	3.41	-	64.13	5.81	-	66.62	5.21	-	69.84	4.19	-
	RTN	33.42	-6e2	-	31.21	-	8e3	-	32.44	nan	-	30.86	δe3	-	30.90	7e4	-	32.51	-	7e3	-	31.63	-	3e4	-	31.57	2e3	-
	SmoothQuant	33.04	1e3	-	34.67	2e2	-	32.13	nan	-	34.26	1e3	-	35.86	3e2	-	34.42	-	3e2	-	33.29	-	6e2	-	34.64	1e3	-
	GPTQ	32.98	5e2	-	31.47	-	4e4	-	32.72	nan	-	30.11	4e3	-	30.86	nan	-	32.12	-	1e3	-	31.51	-	3e3	-	30.88	2e3	-
	QuaRot	61.69	8.02	-	65.56	6.35	-	61.87	6.05	-	65.13	5.35	-	69.96	3.78	-	61.76	-	6.22	-	64.46	5.50	-	68.14	4.57	-
4-4-16	SpinQuant	64.11	7.28	-	66.99	6.10	-	57.37	6.78	-	63.23	5.24	-	70.58	3.68	-	61.82	-	6.08	-	64.59	5.36	-	68.08	4.53	-
	OSTQuant	65.14	7.24	-	72.21	3.97	-	63.90	5.60	-	66.24	5.14	-	70.92	3.57	-	62.72	-	6.04	-	65.80	5.40	-	68.52	4.43	-
	RTN	33.18	7e2	-	30.82	-	8e3	-	32.67	nan	-	30.93	7e3	-	31.73	7e4	-	32.87	-	1¯e4	-	31.33	-	3e4	-	31.64	2e3	-
4-4-4	SmoothQuant	32.96	1e3	-	33.76	3e2	-	32.12	nan	-	33.36	1e3	-	35.54	3e2	-	33.32	-	3e2	-	33.28	-	5e2	-	34.65	1e3	-
	GPTQ	33.71	6e2	-	31.20	-	4e4	-	33.52	nan	-	27.85	5e3	-	31.09	nan	-	31.80	-	2e3	-	30.63	-	3e3	-	31.07	2e3	-
	Omniquant	32.33	4e2	-	-	-	-	48.40	14.26	-	50.35	12.30	-	-	-	-	48.46	-	11.26	-	45.63	-	10.87	-	45.04	12.35	-
	QuaRot	61.38	8.18	-	65.33	6.60	-	61.48	6.11	-	65.16	5.39	-	70.30	3.80	-	61.22	-	6.26	-	64.59	5.53	-	68.08	4.60	-
	SpinQuant	64.10	7.35	-	66.31	6.24	-	62.01	5.96	-	64.13	5.74	-	70.57	3.61	-	61.32	-	6.12	-	64.95	5.39	-	68.14	4.55	-
	OSTQuant	65.37	7.29	-	71.69	4.01	-	63.18	5.91	-	65.41	5.25	-	70.84	3.59	-	62.55	-	6.07	-	65.43	5.40	-	68.20	4.42	-

The following are the results from Table 3 of the original paper:

Model Size	Prefill Speedup (Seqlen)						Memory Saving Factor (Seqlen)
Model Size	256	512	1024	2048	4096	8192	256	512	1024	2048	4096	8192
7B	2.24x	2.27x	2.23x	2.14x	2.11x	2.02x	3.48x	3.34x	3.12x	2.86x	2.57x	2.34x
8B	2.42x	2.52x	2.52x	2.43x	2.36x	2.23x	3.48x	3.36x	3.12x	2.77x	2.38x	2.00x
13B	2.62x	2.68x	2.63x	2.52x	2.83x	2.32x	3.64x	3.51x	3.30x	3.02x	2.70x	2.43x
30B	3.18x	3.01x	2.98x	3.40x	2.84x	2.68x	3.70x	3.59x	3.42x	3.15x	2.83x	2.53x

The following are the results from Table 4 of the original paper:

Metric	Baseline	+Rres	+Sres	+Rdn	+Su\|d	+Rqk	+Sqk	+Rov	+Sov
Wiki PPL	nan	9.70	9.46	6.16	6.00	5.92	5.92	5.94	5.91
Zero-shot9	33.51	54.33	53.74	61.75	61.79	62.35	62.56	63.11	63.18

The following are the results from Table 5 of the original paper:

Model	Optimizer Type	Best Steps	Best LR1	Best LR2	Zero-Shot Score
LLaMA-2-7B	Cayley SGD	150	1.50	0.20	63.11
	Riemann SGD	500	0.10	0.02	63.09
	Riemann Adam	150	0.02	1e-3	63.18
LLaMA-2-13B	Cayley SGD	200	1.50	0.2	64.77
	Riemann SGD	500	0.1	0.02	65.19
	Riemann Adam	150	0.02	0.002	65.41

The following are the results from Table 6 of the original paper:

Setting	Metric	k=5	k=50	k=100	k=500	k=1000	k=5000	k=10000
W3 Only	Zero-Shot9 Score	61.87	61.88	61.75	62.18	62.30	61.25	61.21
W3 Only	Wiki PPL	6.06	6.116	6.13	6.07	6.06	6.06	6.12
W4A4KV4	Zero-Shot Score	62.4	62.13	62.38	62.34	63.18	62.44	62.11
W4A4KV4	Wiki PPL	5.99	5.96	5.95	5.96	5.96	5.93	5.94

The following are the results from Table 7 of the original paper:

Model	Quant Setting	Method	Zero-Shot9	Wiki PPL
LLaMA-2-7B	Full-Precision		65.21	5.47
	W4A16KV16	Hadamard	63.32	5.62
	W4A16KV16	WOMI	63.45	5.59
	W4A4KV4	Hadamard	61.47	6.11
	W4A4KV4	WOMI	61.52	6.09
LLaMA-3-8B	Full-Precision	-	68.09	6.14
	W4A16KV16	Hadamard	67.27	6.53
	W4A16KV16	WOMI	67.41	6.48
	W4A4KV4	Hadamard	61.38	8.18
	W4A4KV4	WOMI	61.40	8.17

The following are the results from Table 8 of the original paper:

Model	Method	ARC-c	ARC-e	BoolQ	HellaS	Lam.	OBQA	PIQA	SIQA	WinoG.	Avg.	Wiki2 PPL
LLaMA3-8B	RTN	23.72	30.56	46.18	29.83	2.70	28.60	52.45	34.39	50.20	33.18	704.34
	SpinQuant	46.33	73.57	76.15	75.43	71.40	41.40	79.16	44.68	68.75	64.10	7.35
	SpinQuant + KL-Top	47.29	73.95	75.82	75.64	71.40	41.58	78.16	44.38	68.45	64.07	7.54
	OSTQuant	49.26	76.68	78.25	76.18	70.48	43.19	77.85	45.18	69.13	65.13	6.80
	OSTQuant + KL-Top	49.32	76.73	78.87	76.01	70.77	43.20	78.51	45.70	69.22	65.37	7.29
LLaMA2-7B	RTN	27.22	27.06	50.83	27.34	0.93	25.80	49.51	34.85	50.51	32.67	nan
	SpinQuant	40.44	71.08	74.40	73.51	70.66	41.80	76.88	43.50	65.82	62.01	5.96
	SpinQuant + KL-Top	40.76	71.29	74.61	73.08	70.19	40.94	76.32	43.85	67.78	62.09	6.16
	OSQuant	42.41	69.87	75.07	72.90	70.21	40.87	78.16	44.16	68.40	62.45	5.38
	OSTQuant + KL-Top	44.62	72.69	75.10	73.27	70.21	41.00	78.13	44.42	68.27	63.11	5.94

The following are the results from Table 9 of the original paper:

Model	Seqlen	Prefill Time		Prefill Speedup	Memory		Memory Saving
Model	Seqlen	FP16	INT4	Prefill Speedup	FP16	INT4	Memory Saving
LLaMA2-7B	256	8.050ms	3.597ms	2.238x	0.411GB	0.118GB	3.479x
	512	14.904ms	6.579ms	2.265x	0.435GB	0.130GB	3.341x
	1024	27.989ms	12.582ms	2.225x	0.483GB	0.155GB	3.116x
	2048	54.276ms	25.312ms	2.144x	0.577GB	0.202GB	2.857x
	4096	112.230ms	53.145ms	2.112x	0.766GB	0.299GB	2.566x
	8192	244.675ms	121.339ms	2.016x	1.147GB	0.491GB	2.336x
LLaMA3-8B	256	8.035ms	3.314ms	2.424x	0.430GB	0.124GB	3.478x
	512	15.545ms	6.176ms	2.517x	0.442GB	0.132GB	3.356x
	1024	29.169ms	11.599ms	2.515x	0.466GB	0.149GB	3.116x
	2048	57.470ms	23.631ms	2.432x	0.513GB	0.185GB	2.774x
	4096	117.523ms	49.835ms	2.358x	0.608GB	0.256GB	2.378x
	8192	256.394ms	114.815ms	2.233x	0.795GB	0.397GB	2.003x
LLaMA2-13B	256	11.449ms	4.370ms	2.620x	0.634GB	0.174GB	3.643x
	512	21.195ms	7.924ms	2.675x	0.663GB	0.189GB	3.512x
	1024	41.752ms	15.867ms	2.631x	0.723GB	0.219GB	3.301x
	2048	81.965ms	32.553ms	2.518x	0.841GB	0.279GB	3.018x
	4096	199.046ms	70.442ms	2.826x	1.079GB	0.399GB	2.702x
	8192	359.409ms	154.640ms	2.324x	1.551GB	0.639GB	2.426x
LLaMA-30B	256	18.682ms	5.883ms	3.175x	1.047GB	0.283GB	3.703x
	512	34.393ms	11.445ms	3.005x	1.085GB	0.302GB	3.589x
	1024	66.880ms	22.464ms	2.977x	1.162GB	0.340GB	3.416x
	2048	157.500ms	46.317ms	3.400x	1.315GB	0.418GB	3.148x
	4096	272.355ms	96.052ms	2.835x	1.625GB	0.575GB	2.828x
	8192	576.555ms	215.27ms	2.678x	2.242GB	0.887GB	2.527x

The following are the results from Table 10 of the original paper:

Model	Decoder Speed (tokens/sec)			Memory Use (GB)
	FP	Quantized	Speed up	FP	Quantized	Memory Saving
LLaMA-2-7B	47.32	89.4	1.89x	13.94	4.32	3.23x
LLaMA-3-8B	38.33	77.71	2.03x	15.83	5.88	2.69x
LLaMA-2-13B	23.7	55.35	2.34x	23.7	8.5	2.79x
LLaMA-30B	OOM	30.49	-	OOM	18.19	-
LLaMA-3-70B	OOM	14.68	-	OOM	38.41	-

The following are the results from Table 11 of the original paper:

Method	7B	8B	13B	30B	70B
Omniquant	1.6h	1.8h	3.3h	7.3h	9.5h
OSTQuant	0.3h	0.4h	0.8h	2.2h	5.5h
Speedup	5.3x	4.5x	4.1x	3.3x	1.7x

The following are the results from Table 12 of the original paper:

Model	#BitsW-A-KV	Method	Zero-Shot Common Sense Reasoning Tasks (Avg. Accuracy %)									Wiki2 PPL (↓)
			ARC-c (↑)	ARC-e (↑)	BoolQ (↑)	HellaS. (↑)	Lam. (↑)	OBQA (↑)	PIQA (↑)	SIQA (↑)	WinoG. (↑)
			(Avg. Zero-Shot Score (↑))
2-7B	16-16-16	Full Precision	46.42	74.33	77.71	75.94	73.69	44.20	79.16	45.91	69.53	65.21	5.47
	4-16-16	RTN	42.15	67.59	73.06	72.34	67.18	41.80	76.50	44.11	66.69	61.27	7.02
		SmoothQuant	39.59	65.19	69.82	73.83	67.61	42.40	77.64	44.52	68.43	60.86	8.30
		GPTQ	42.49	71.00	74.70	73.50	70.87	42.40	78.40	44.93	68.82	63.19	5.74
		Omniquant	44.11	70.50	78.07	74.98	70.68	43.80	78.30	45.14	69.38	63.89	5.83
		AWQ	43.94	73.50	76.97	74.87	72.07	44.00	78.24	45.40	69.38	64.30	5.62
		QuaRot	43.77	72.69	73.36	75.10	73.80	43.00	77.86	45.60	67.66	63.59	5.58
		SpinQuant	44.54	73.31	75.57	75.04	73.67	44.20	78.89	45.50	68.59	64.37	5.64
	OSTQuant	44.54	73.31	75.57	75.04	73.67	44.20	78.89	45.50	68.59	64.37	5.64
	4-4-16	RTN	25.34	28.03	50.52	27.71	1.01	26.20	50.80	33.93	48.38	32.44	nan
		SmoothQuant	28.33	26.39	49.39	27.28	1.18	23.40	48.80	33.75	40.75	32.13	nan
		GPTQ	24.40	28.70	51.62	28.66	1.36	24.60	51.14	34.49	49.49	32.72	nan
		QuaRot	42.32	69.65	74.77	72.91	70.81	39.80	77.20	43.55	65.82	61.76	6.05
		SpinQuant	37.54	62.58	71.16	70.48	67.16	34.80	76.20	39.96	60.62	57.37	6.78
		OSTQuant	44.03	71.93	75.41	74.94	73.22	43.20	78.51	45.85	68.03	63.90	5.60
	4-4-4	RTN	27.22	27.06	50.83	27.34	0.93	25.80	49.51	34.85	50.51	32.67	nan
		SmoothQuant	26.37	25.63	47.71	27.05	1.11	26.40	41.90	34.49	48.38	32.12	nan
		GPTQ	26.96	27.65	52.84	28.83	1.63	29.00	49.62	35.11	49.80	33.52	nan
		Omniquant	31.40	53.75	63.79	60.13	55.63	34.40	60.59	40.28	54.07	48.40	14.26
		QuaRot	41.43	69.32	74.83	72.50	70.66	39.80	77.42	43.70	64.64	61.48	6.11
		SpinQuant	40.44	71.08	74.40	73.51	70.66	41.80	76.88	43.50	65.82	62.01	5.96
OSTQuant		42.92	72.56	74.71	73.10	71.76	44.00	77.20	44.98	67.77	63.18	5.91
2-13B	16-16-16	Full Precision	49.15	77.53	80.58	79.39	76.62	45.20	80.63	47.49	71.90	67.61	4.88
	4-16-16	RTN	42.92	66.54	71.38	66.62	68.99	39.40	76.93	44.06	65.35	60.24	6.39
		SmoothQuant	46.25	70.45	75.92	69.16	70.49	39.80	77.86	45.14	64.17	62.03	5.86
		GPTQ	46.63	73.95	74.83	73.77	73.00	42.40	78.51	45.50	70.64	64.71	5.79
		Omniquant	48.29	75.42	77.92	77.80	75.59	45.20	80.41	46.62	70.17	66.38	5.02
		AWQ	48.63	78.16	78.81	78.48	75.70	45.00	79.54	46.21	72.45	66.25	5.07
		QuaRot	49.07	78.17	76.58	77.60	76.70	45.40	80.03	45.50	71.11	66.95	5.00
		SpinQuant	49.15	77.48	79.27	78.60	77.10	44.60	80.03	46.47	71.67	67.14	5.00
		OSTQuant	48.72	76.26	80.67	78.27	76.54	45.54	80.25	47.65	71.90	67.31	4.94
	4-4-16	RTN	27.99	26.81	38.50	26.08	0.00	23.60	48.20	34.90	51.62	30.86	nan
		SmoothQuant	24.49	35.06	47.98	30.87	3.67	26.20	51.01	35.31	49.20	34.26	nan
		GPTQ	27.82	26.77	37.92	25.67	0.00	21.80	47.77	33.51	48.15	30.11	nan
		QuaRot	46.42	72.70	78.10	75.58	74.31	43.00	79.05	44.37	71.35	65.13	5.35
		SpinQuant	43.77	69.99	76.57	74.63	72.81	41.60	77.72	44.27	68.19	63.23	5.24
		OSTQuant	47.78	74.66	80.03	77.60	75.94	44.00	79.38	46.06	70.32	66.24	5.14
	4-4-4	RTN	27.82	26.52	38.38	26.27	0.02	26.00	49.78	34.39	49.17	30.93	nan
		SmoothQuant	24.49	28.83	45.84	30.70	2.70	23.40	53.81	34.80	51.07	33.36	nan
		GPTQ	27.90	26.39	37.95	26.16	0.00	20.00	48.26	34.39	50.43	27.85	nan
		Omniquant	32.85	55.13	64.34	60.13	56.45	33.40	68.17	40.76	56.51	40.35	12.30
		QuaRot	47.27	73.91	78.41	75.33	75.30	43.80	79.27	45.85	69.06	65.16	5.39
		SpinQuant	46.67	74.49	76.06	75.22	72.19	42.40	78.20	43.45	67.20	64.13	5.74
		OSTQuant	48.10	75.21	77.46	76.71	75.14	44.60	78.67	45.75	68.30	65.41	5.25
		3-8B	16-16-16	Full Precision	53.50	77.74	81.10	79.18	75.74	44.80	80.63	47.08	73.01	68.09	6.14
4-16-16	RTN		48.98	73.23	72.75	75.90	63.85	43.20	78.40	43.81	73.16	63.70	8.13
	SmoothQuant		47.22	72.35	72.11	74.92	62.41	43.00	77.80	43.91	71.27	62.79	8.12
	GPTQ		49.74	72.50	71.28	68.34	46.69	43.60	78.50	46.47	71.82	61.03	7.43
	Omniquant		50.09	74.54	79.51	76.92	70.31	43.00	79.54	44.52	71.74	65.66	7.19
	AWQ		52.22	76.68	80.31	75.91	74.81	44.20	79.87	46.26	71.67	67.03	7.36
	QuaRot		51.88	77.53	79.60	77.87	74.11	44.40	80.14	46.37	72.50	67.27	6.53
	SpinQuant		52.13	77.28	79.99	78.40	73.60	44.80	79.98	45.22	72.77	66.54	6.49
	OSTQuant		52.82	79.84	80.31	77.86	76.48	42.80	80.74	45.55	72.80	67.80	6.53
4-4-16	RTN		23.72	30.89	46.30	31.26	3.03	27.60	52.72	35.26	50.04	33.42	nan
	SmoothQuant		23.29	28.24	48.93	29.19	1.57	28.60	44.46	33.37	49.64	33.04	nan
	GPTQ		23.46	32.07	43.79	30.10	2.40	28.00	53.97	34.14	48.60	32.98	nan
	QuaRot		47.26	72.73	73.60	67.00	43.00	76.61	45.04	65.90	61.69	8.02	nan
	SpinQuant		47.35	74.12	76.36	75.98	69.88	42.46	77.37	44.47	68.98	64.11	7.28
	OSTQuant		48.81	73.48	79.82	75.97	72.20	42.70	78.18	45.75	69.22	65.14	7.24
4-4-4	RTN		23.72	30.56	46.18	29.83	2.70	28.60	52.45	34.39	50.20	33.18	nan
	SmoothQuant		23.55	28.96	44.84	28.90	1.44	29.40	51.09	34.14	50.36	32.96	nan
	GPTQ		22.87	30.35	44.34	29.72	2.39	29.80	54.95	34.75	51.30	33.71	nan
	Omniquant		29.10	33.79	41.53	31.11	1.86	25.80	53.37	34.08	50.43	32.33	nan
	QuaRot		49.49	74.37	79.16	77.22	71.69	42.29	78.89	43.87	71.03	65.33	6.60
	SpinQuant		46.33	73.57	76.15	75.43	71.40	41.40	79.16	44.68	68.75	64.10	7.35
	OSTQuant		49.32	76.73	78.87	76.01	70.77	43.20	78.51	45.70	69.22	65.37	7.29

The following are the results from Table 13 of the original paper:

Model	#Bits-A-KV	Method	Zero-Shot Common Sense Reasoning Tasks (Avg. Accuracy %)									Wiki2 PPL (↓)
			ARC-c (↑)	ARC-e (↑)	BoolQ (↑)	HellaS. (↑)	LambA. (↑)	OBQA (↑)	PIQA (↑)	SIQA (↑)	WinoG. (↑)
			(Avg. Zero-Shot Score (↑))
7B	[16-16-16]	Full Precision	44.71	72.90	74.98	76.20	73.08	43.80	79.16	45.55	69.93	64.48	5.68
	4-16-16	RTN	41.70	69.82	73.30	69.67	42.00	78.13	45.34	68.00	62.67	7.94	nan
		SmoothQuant	40.96	68.60	74.04	73.16	68.74	42.00	78.07	44.37	69.46	62.69	7.46
		GPTQ	41.70	67.98	69.50	63.15	40.80	66.55	44.37	69.46	60.15	7.93	nan
		Omniquant	42.49	71.38	74.62	74.71	71.98	42.00	79.05	45.09	69.14	63.30	5.86
		AWQ	43.86	70.79	75.27	69.94	43.00	78.45	45.09	69.14	63.30	5.97	nan
		QuaRot	42.75	69.99	73.30	72.71	73.55	42.80	78.35	45.14	69.61	63.40	5.83
		SpinQuant	43.77	71.17	74.06	72.87	72.00	44.40	78.40	44.52	70.60	63.94	5.76
		OSTQuant	44.20	72.56	73.73	75.05	73.45	44.60	78.73	45.45	69.38	64.13	5.81
	4-4-16	RTN	23.46	29.34	45.05	29.02	1.24	26.00	52.07	35.11	51.30	32.51	nan
		SmoothQuant	25.40	31.40	45.68	29.73	5.43	28.20	49.68	34.44	49.09	34.42	nan
		GPTQ	23.89	27.74	42.87	28.49	1.28	27.40	51.00	36.23	50.20	32.12	nan
		Omniquant	40.36	67.26	73.15	72.89	70.81	42.00	77.90	44.27	67.17	61.76	6.22
		AWQ	40.19	68.43	72.35	70.91	70.68	41.20	77.75	44.17	68.67	61.82	6.08
		QuaRot	42.58	70.79	72.87	74.06	70.77	43.40	77.69	45.04	67.25	62.72	6.04
		SpinQuant	40.19	68.43	72.35	70.91	70.68	41.20	77.75	44.17	68.67	61.82	6.08
	OSTQuant	42.58	70.79	72.87	74.06	70.77	43.40	77.69	45.04	67.88	62.55	6.07
	4-4-4	RTN	23.89	29.59	46.67	28.37	1.13	26.40	52.90	35.21	51.54	32.87	nan
		SmoothQuant	23.38	30.18	50.03	29.67	4.89	24.60	51.74	34.75	50.67	33.32	nan
		GPTQ	23.89	27.90	43.88	27.86	1.05	26.20	51.85	34.75	49.49	31.80	nan
		Omniquant	31.40	54.00	61.90	58.29	46.45	31.80	60.59	38.54	55.17	48.46	11.26
		QuaRot	40.27	67.55	72.20	72.71	70.39	39.80	77.20	44.88	65.90	61.22	6.26
		SpinQuant	40.30	68.18	73.60	72.87	70.46	40.60	77.42	42.68	67.56	61.32	6.12
OSTQuant		42.92	70.33	72.11	73.77	70.66	42.42	77.91	44.93	67.88	62.55	6.07
13B	16-16-16	Full Precision	47.87	74.49	77.86	79.10	76.03	44.40	80.30	46.72	73.24	66.67	5.09
	4-16-16	RTN	45.56	70.66	72.45	76.06	70.50	42.00	78.84	44.93	70.01	64.45	8.60
		SmoothQuant	43.86	71.21	75.19	74.19	69.34	40.00	77.80	45.45	70.72	62.69	8.75
		GPTQ	45.99	72.85	75.27	75.31	70.10	44.60	79.87	46.16	71.11	64.36	6.58
		Omniquant	47.01	73.86	77.22	77.95	75.59	45.00	79.87	46.88	72.61	66.22	5.21
		AWQ	47.30	73.86	77.60	79.03	78.30	43.40	79.87	45.85	71.67	65.58	5.28
		QuaRot	47.80	72.22	76.50	78.07	75.90	45.00	79.90	45.50	72.60	65.91	5.20
		SpinQuant	47.44	74.83	77.37	78.30	75.50	45.60	79.90	46.01	72.06	66.32	5.16
		OSTQuant	48.04	73.86	78.10	77.28	75.99	45.60	80.55	46.93	72.30	66.62	5.21
	4-4-16	RTN	25.85	26.26	42.05	29.17	0.00	28.00	50.33	34.60	50.67	31.63	nan
		SmoothQuant	25.43	29.29	51.00	28.10	2.02	26.00	53.32	34.30	49.57	33.29	nan
		GPTQ	24.66	27.80	40.80	25.30	0.70	24.20	51.31	33.65	51.70	31.51	nan
		QuaRot	46.93	72.50	75.57	76.63	74.13	42.40	78.73	45.24	68.98	64.46	5.50
		SpinQuant	45.30	72.56	75.38	76.86	73.28	43.60	78.90	44.63	70.40	64.59	5.36
		OSTQuant	48.04	74.07	77.13	77.22	74.58	45.00	78.62	46.16	71.35	65.80	5.40
	4-4-4	RTN	26.28	27.27	42.35	25.85	0.19	26.60	49.95	34.19	49.25	31.33	nan
		SmoothQuant	24.49	28.83	51.65	27.91	2.08	26.00	52.56	35.41	50.59	33.28	nan
		GPTQ	23.63	27.31	39.50	26.17	0.56	26.00	50.59	35.00	49.57	30.63	nan
		Omniquant	31.40	54.00	61.90	58.29	46.45	31.80	60.59	38.54	55.17	45.63	10.87
		QuaRot	46.50	71.55	75.50	74.30	73.47	45.00	78.90	44.37	70.09	64.59	5.53
		SpinQuant	45.90	70.71	76.51	77.00	73.63	45.60	79.00	45.65	70.32	64.95	5.39
		OSTQuant	45.90	75.25	76.94	77.21	74.23	43.40	79.43	45.91	70.56	65.43	5.40
30B	16-16-16	Full Precision	52.99	80.39	82.75	82.62	77.59	48.00	82.26	47.75	75.69	70.00	4.10
	4-16-16	RTN	49.74	73.99	77.89	72.21	70.21	44.20	79.00	45.70	73.88	65.90	6.13
		SmoothQuant	48.98	72.90	80.00	79.00	71.49	44.80	78.13	45.96	73.70	65.69	5.80
		GPTQ	52.22	77.62	81.10	81.94	76.80	47.00	81.07	47.50	74.43	69.07	4.25
		Omniquant	53.20	77.77	81.68	82.29	76.79	48.20	81.20	48.16	75.37	69.44	4.28
		AWQ	53.58	78.62	82.11	82.10	77.30	48.00	81.72	47.90	75.09	69.73	4.27
		QuaRot	52.90	78.49	82.02	82.21	78.00	48.00	81.01	48.41	75.06	69.62	4.21
		SpinQuant	53.07	79.12	82.09	82.04	78.58	48.60	81.18	48.06	74.80	69.84	4.19
		OSTQuant	53.07	79.12	82.09	82.04	78.58	48.60	81.18	48.06	74.80	69.84	4.19
		4-4-16	RTN	25.00	27.95	42.02	27.22	0.21	27.00	49.13	34.65	50.91	31.57	nan
			SmoothQuant	25.63	31.91	54.86	32.52	3.80	28.20	50.13	34.49	50.04	34.64	nan
	GPTQ		27.30	27.19	38.69	26.75	0.70	25.80	49.40	35.21	47.75	30.88	nan
	Omniquant		30.60	33.79	41.95	32.44	1.86	26.40	51.56	38.54	53.91	45.04	10.33
	QuaRot		51.79	76.39	80.76	77.08	77.08	45.80	80.58	45.60	74.35	68.14	4.57
	SpinQuant		52.06	77.06	81.38	80.62	76.90	46.00	80.14	46.26	74.27	68.08	4.53
	OSTQuant		51.37	78.11	82.48	79.51	75.99	45.40	81.18	47.80	74.82	68.52	4.43
	4-4-4		RTN	25.00	28.87	44.07	27.29	0.39	25.60	49.67	34.54	49.33	31.64	nan
		SmoothQuant	22.61	31.48	55.05	31.28	3.40	28.00	53.50	34.65	50.28	34.65	nan
		GPTQ	22.87	27.70	39.36	27.00	0.33	24.00	50.71	34.30	47.91	31.07	nan
		Omniquant	29.10	33.79	41.95	32.44	1.86	25.80	53.37	34.08	50.43	45.04	10.33
		QuaRot	51.71	76.50	80.50	77.00	76.20	45.00	80.63	45.00	73.32	68.08	4.55
		SpinQuant	51.60	76.98	81.07	80.57	77.00	46.00	79.92	46.26	74.90	68.14	4.55
		OSTQuant	49.76	78.52	81.16	81.13	77.57	46.40	80.90	46.11	74.27	68.20	4.42

The following are the results from Table 14 of the original paper:

Model	#BitsW-A-KV	Method	Zero-Shot Common Sense Reasoning Tasks (Avg. Accuracy %)									Wiki2 PPL (↓)
			ARC-c (↑)	ARC-e (↑)	BoolQ (↑)	HellaS. (↑)	LambA. (↑)	OBQA (↑)	PIQA (↑)	SIQA (↑)	WinoG. (↑)
			(Avg. Zero-Shot Score (↑))
2-70B	16-16-16	Full Precision	57.42	81.02	83.79	83.81	79.60	48.80	82.70	49.18	77.98	71.59	3.32
	4-16-16	RTN	55.05	79.29	81.35	81.78	75.51	47.60	81.94	46.83	76.46	69.62	3.87
		SmoothQuant	50.26	76.56	81.53	67.81	73.63	44.40	81.34	44.17	73.64	65.93	5.50
		GPTQ	56.91	80.81	83.24	82.47	79.06	47.80	82.50	48.06	77.51	70.96	3.94
		Omniquant	57.08	80.81	82.69	83.07	79.18	47.40	83.08	48.87	77.19	71.04	3.47
		AWQ	56.67	80.54	82.98	82.54	78.83	47.67	82.97	48.20	77.70	70.88	4.03
		QuaRot	57.34	80.85	83.24	83.27	80.38	47.60	82.21	48.62	77.35	71.21	3.41
		SpinQuant	56.91	80.60	83.18	83.06	79.16	49.00	82.54	48.31	77.11	71.12	3.43
		OSTQuant	57.36	81.37	83.20	83.86	79.77	48.73	82.69	48.46	77.89	71.48	3.41
	4-4-16	RTN	29.35	26.05	37.74	25.97	0.02	24.80	51.31	34.14	48.70	30.90	nan
		SmoothQuant	25.00	25.98	55.23	32.52	7.49	25.00	54.62	35.21	51.70	35.86	nan
		GPTQ	27.82	25.80	37.95	25.82	0.00	27.00	49.67	33.98	49.00	30.86	nan
		QuaRot	55.29	80.35	81.10	81.87	79.06	45.80	82.05	47.90	76.24	69.96	3.78
		SpinQuant	55.38	78.96	83.36	82.54	79.00	47.20	82.10	48.67	77.43	70.58	3.68
		OSTQuant	56.61	80.51	83.03	82.68	79.11	47.86	83.00	48.76	76.70	70.92	3.57
	4-4-4	RTN	30.30	27.74	38.23	26.12	0.02	24.60	51.74	34.29	52.43	31.73	nan
		SmoothQuant	24.15	33.88	55.32	31.75	7.14	26.40	54.95	34.14	52.17	35.54	nan
		GPTQ	28.75	26.39	37.86	25.96	0.00	26.40	50.00	34.44	50.04	31.09	nan
		QuaRot	56.48	80.56	81.59	81.93	79.16	46.00	82.21	48.00	76.80	70.30	3.80
		SpinQuant	56.31	80.64	83.55	82.36	79.41	47.20	82.21	47.29	77.05	70.57	3.61
		OSTQuant	56.58	80.17	83.64	82.49	78.72	48.00	82.76	48.67	76.49	70.84	3.59
3-70B	16-16-16	Full Precision	64.42	85.98	85.14	84.95	79.47	48.46	84.39	50.82	80.66	73.81	2.86
	4-16-16	RTN	26.28	25.55	37.83	26.36	0.00	29.00	50.98	34.70	49.30	31.15	nan
		SmoothQuant	51.88	77.53	80.09	80.47	73.16	46.60	80.58	45.29	75.85	67.94	6.70
		GPTQ	25.77	25.29	37.83	26.36	0.12	28.40	51.74	34.90	52.64	31.45	nan
		Omniquant	48.29	75.42	77.92	77.80	75.59	45.20	80.41	46.62	70.17	66.38	5.02
		AWQ	62.20	83.88	85.57	84.18	79.04	48.20	83.13	50.10	80.03	72.93	3.53
		QuaRot	62.03	84.97	85.11	84.60	78.30	47.00	83.90	49.85	80.90	72.90	3.49
		SpinQuant	63.76	85.82	84.99	85.16	79.53	48.45	84.26	51.01	80.22	73.69	3.19
	OSTQuant	63.76	85.82	84.99	85.16	79.53	48.45	84.26	51.01	80.22	73.69	3.19
	4-4-16	RTN	27.47	25.88	37.83	26.26	0.00	27.20	51.63	35.26	49.33	31.21	nan
		SmoothQuant	26.00	34.47	50.46	32.48	11.98	30.00	54.24	34.83	48.93	34.67	nan
		GPTQ	25.77	25.80	43.64	26.42	0.00	27.40	52.01	32.55	49.33	31.47	nan
		QuaRot	60.60	73.65	77.46	77.83	71.96	43.20	78.13	45.29	71.90	65.56	6.35
		SpinQuant	53.84	77.69	80.24	78.19	77.06	45.00	78.67	43.24	73.01	66.99	6.10
		OSTQuant	61.84	84.56	84.14	82.47	77.08	46.07	83.38	50.23	80.13	72.21	3.97
	4-4-4	RTN	27.13	25.42	37.83	26.12	0.00	26.60	50.76	34.95	51.22	33.76	nan
		SmoothQuant	23.46	33.18	52.56	32.48	4.13	28.00	53.50	34.95	51.22	33.76	nan
		QuaRot	49.49	74.37	79.16	77.22	71.69	42.29	78.89	43.87	71.03	65.33	6.60
		SpinQuant	51.88	76.39	80.98	76.50	71.43	43.46	79.27	44.17	72.69	66.31	6.24
OSTQuant		61.29	82.39	83.43	83.25	75.90	48.93	81.73	51.24	77.01	71.69	4.01

6.3. Ablation Studies / Parameter Analysis

6.3.1. Effect of Different Transformation Matrices (Table 4)

The ablation study investigates the contribution of different learnable transformation matrices on LLaMA-2 7B under W4A4KV4 quantization.

Baseline (Zero-Shot9 33.51): This represents the performance without any of the proposed transformations, likely using a basic RTN or SmoothQuant-like approach, resulting in poor accuracy.
$+Rres$ (Zero-Shot9 54.33, Wiki PPL 9.70): Adding the global orthogonal transformation $R_{res}$ brings the most significant improvement (over 20 points in Zero-Shot9 accuracy). This highlights the critical role of rotating activations globally across the residual paths to manage distributions.
$+Sres$ (Zero-Shot9 53.74, Wiki PPL 9.46): This refers to the global scaling transformation $S_{res}$ . While the paper indicates $R_{res}$ provides the largest single improvement, $S_{res}$ also shows a positive effect, suggesting the importance of global scaling, though less impactful than global rotation. The specific combination is crucial.
$+Rdn$ (Zero-Shot9 61.75, Wiki PPL 6.16): Adding the orthogonal transformation $R_{down}$ (likely related to FFN down-projection or similar intra-block rotation) provides the next largest jump in performance. This indicates that local rotations within critical layers are also highly effective.
$+Su|d$ (Zero-Shot9 61.79, Wiki PPL 6.00): This represents scaling transformations for FFN up/down-projections. It offers a slight improvement over $+Rdn$ , confirming that scaling helps balance variances across channels after rotation.
$+Rqk$ (Zero-Shot9 62.35, Wiki PPL 5.92): The Hadamard transformation (referred to as Rqk for Query/Key rotation) applied after ROPE further improves accuracy.
$+Sqk$ (Zero-Shot9 62.56, Wiki PPL 5.92): The scaling transformation $S_{qk}$ for Query/Key also contributes.
$+Rov$ (Zero-Shot9 63.11, Wiki PPL 5.94): Orthogonal transformation for Value/Output projections in MHSA significantly boosts performance.
$+Sov$ (Zero-Shot9 63.18, Wiki PPL 5.91): The scaling transformation for Value/Output projections provides the final marginal improvement.

Conclusion: The results demonstrate that orthogonal transformations (especially global ones like $R_{res}$ and local ones like $R_{dn}$ , $R_{ov}$ ) provide the most substantial gains by reshaping the distributions. Scaling transformations ( $S_{res}$ , $S_{u|d}$ , $S_{qk}$ , $S_{ov}$ ) then build upon these by finely balancing variance across channels, minimizing residual quantization losses and maximizing QSUR. The cumulative effect of all these learnable transformations is critical for OSTQuant's strong performance.

6.3.2. Different Manifold Optimizers (Table 5)

The paper compares different Riemannian optimizers for the unit orthogonal matrices (which reside on a Stiefel manifold) on LLaMA-2-7B and LLaMA-2-13B under W4A4KV4 configuration.

Cayley SGD (Stochastic Gradient Descent): Requires a relatively high learning rate (LR1 1.50, LR2 0.20) and achieves good results (63.11 for 7B, 64.77 for 13B) within 150-200 steps.
Riemann SGD: Needs more iterations (500 steps) and lower learning rates (LR1 0.10, LR2 0.02) to reach similar or slightly worse performance (63.09 for 7B, 65.19 for 13B).
Riemann Adam: Delivers the best results (63.18 for 7B, 65.41 for 13B) with the fewest iterations (150 steps) and lower learning rates (LR1 0.02, LR2 1e-3 or 0.002). This highlights RiemannAdam's efficiency and effectiveness in optimizing on Stiefel manifolds.

Learning Rate Ratio: A key finding is that setting the learning rate for the Stiefel manifold (orthogonal matrices, LR2) approximately 10 times larger than that for scaling transformation parameters (LR1) leads to better results. This suggests that orthogonal transformations might require larger steps or adapt faster during optimization.

6.3.3. Influence of $k$ in KL-Top Loss (Table 6)

The parameter $k$ in KL-Top loss (Equation 12) determines the number of top-probability classes considered for KL divergence calculation. The ablation study explores the impact of different $k$ values on LLaMA-2 7B's Zero-Shot9 score and Wiki PPL under W3-only and W4A4KV4 settings.

Balanced $k$ is Key: Both excessively small ( $k=5$ $k = 5$ , $k=50$ $k = 50$ ) or excessively large ( $k=5000$ $k = 5000$ , $k=10000$ $k = 10000$ ) values for $k$ $k$ negatively impact optimization.
- Small $k$ : May discard too much semantic information, leading to sub-optimal optimization.
- Large $k$ : Reintroduces noise from long-tail distributions and increases computational cost, similar to full KL divergence.
Optimal $k$ Value: For both W3-only and W4A4KV4 settings, $k=1000$ consistently yields the best Zero-Shot9 scores (62.30 and 63.18 respectively). This $k$ value appears to strike the right balance between retaining semantic information and mitigating noise.
PPL Stability: Wiki PPL shows less sensitivity to $k$ compared to Zero-Shot scores, indicating that Zero-Shot tasks are a more robust indicator of model performance after quantization.

6.3.4. Effect of Weight Outlier Minimization Initialization (WOMI) (Table 7, Figure 7, 13)

WOMI is designed to initialize trainable orthogonal matrices by leveraging eigenvalue decomposition of weights and Hadamard matrices.

Performance Improvement (Table 7): WOMI consistently achieves lower perplexity and higher few-shot accuracy compared to random Hadamard initialization for LLaMA-2-7B and LLaMA-3-8B under both W4A16KV16 and W4A4KV4 configurations. This confirms its effectiveness in providing a better starting point for optimization.
Enhanced W4-only Performance: Notably, WOMI shows greater performance improvements in W4-only quantization settings compared to W4A4KV4. This suggests that WOMI is particularly effective at minimizing weight quantization errors, which is crucial when weights are the primary quantized component.
Visualization (Figure 7, 13): Visualizations illustrate that original weight distributions often have significant variations across input/output channels. While Hadamard transformations reduce some of these differences, WOMI (by incorporating the covariance matrix) further smooths inter-channel disparities and minimizes outliers. This leads to a more compact distribution within the quantization space and reduced relative quantization error.

6.3.5. Effect of KL-Top Loss with SpinQuant (Table 8)

This ablation investigates the impact of KL-Top loss when applied to SpinQuant versus OSTQuant.

SpinQuant with KL-Top: When KL-Top loss is introduced to SpinQuant alone, it does not lead to significant performance improvements and may even cause slight degradation (Avg. Zero-Shot for LLaMA3-8B 64.10 -> 64.07, LLaMA2-7B 62.01 -> 62.09). This suggests that KL-Top loss is most effective when coupled with powerful distribution-reshaping transformations.
OSTQuant with KL-Top: For OSTQuant, using Cross-Entropy (CE) loss (referred to as Origin in Table 1) results in overfitting on the calibration set, leading to higher Wiki PPL and lower Zero-Shot scores. The introduction of KL-Top loss (OSTQuant + KL-Top) alleviates this overfitting problem, leading to better generalization and improved Zero-Shot scores (e.g., LLaMA3-8B 65.13 -> 65.37, LLaMA2-7B 62.45 -> 63.11). This highlights that KL-Top loss is crucial for OSTQuant to leverage its transformation capabilities effectively on limited PTQ data.

6.4. QSUR and Evaluation Loss Trends (Figure 10)

Figure 10 illustrates the dynamic behavior of QSUR and evaluation loss during the training process for the LLaMA3-8B model.

As the number of training iterations increases, the local and average QSURs (Quantization Space Utilization Rates) consistently increase and stabilize at a high level. This directly shows that OSTQuant's learnable transformations are effectively optimizing the data distributions to better utilize the quantization space.
Concurrently, the evaluation loss gradually decreases, reflecting the model's improved performance during the quantization-aware optimization.
The positive correlation between increasing QSUR and decreasing evaluation loss (or improving accuracy, as seen in Figure 3) empirically validates QSUR as an effective metric for guiding and assessing quantization quality.

6.5. Activation and Weight Distribution Visualizations (Figure 11, 12, 14)

These visualizations provide qualitative evidence of OSTQuant's impact on data distributions and quantization error.

Activation Distributions (Figure 11, 12): These figures display the activation distributions of different layers in LLaMA-2-7B and LLaMA-3-8B before and after OSTQuant.
- Before OSTQuant: The distributions show significant variations across different channels and often exhibit prominent outliers, leading to wide ranges and low QSUR.
- After OSTQuant: The distributions become noticeably more uniform across channels, and outliers are substantially mitigated. This "flattening" and "centering" of distributions allows for a more efficient mapping to discrete quantization levels, reducing the relative quantization error.
Weight Distributions and WOMI (Figure 13): This figure compares the weight distribution and relative L1 error for LLaMA-2-7B when quantized with original, Hadamard transformed, and WOMI transformed weights.
- Original: Shows large variations and outliers.
- Hadamard Transformed: Reduces some inter-channel differences, but some spikes or extreme values might remain.
- WOMI Transformed: Further smooths these inter-channel differences and reduces outliers more effectively, leading to a smaller relative L1 error. This demonstrates WOMI's role in providing a better initial distribution for weight quantization.
Comparison of Activation Quantization Errors (Figure 14): This figure compares the activation distribution and relative L1 error for QuaRot, SpinQuant, and OSTQuant on LLaMA-2-7B.
- QuaRot and SpinQuant show improvements over baselines, but OSTQuant achieves the most uniform activation distributions with the lowest relative L1 errors. This visually confirms that OSTQuant's combined orthogonal and scaling transformations are superior in making activations quantization-friendly by effectively managing outliers and variances, resulting in less information loss during per-token quantization.

7. Conclusion & Reflections

7.1. Conclusion Summary

In this paper, the authors introduce OSTQuant, a novel post-training quantization (PTQ) method aimed at significantly improving the efficiency of Large Language Models (LLMs). The central innovation is the Quantization Space Utilization Rate (QSUR), a new metric that quantitatively assesses the quantizability of transformed data by measuring how effectively it occupies and utilizes the available quantization space. This metric, supported by rigorous mathematical derivations on the effects and limitations of various linear transformations, provides a principled guide for optimizing data distributions.

Leveraging these insights, OSTQuant employs learnable equivalent transformation pairs, consisting of both orthogonal and scaling transformations, to optimize the distributions of weights and activations across the entire quantization space. This combined approach effectively addresses outliers and inter-channel disparities, leading to more quantization-friendly distributions. Furthermore, to overcome the challenges of limited calibration data inherent in PTQ, the KL-Top loss function is proposed. This loss function mitigates optimization noise while retaining richer semantic information by focusing KL divergence on only the top $k$ highest-probability logits.

Extensive experiments across various LLMs (LLaMA-1, -2, -3) and benchmarks demonstrate OSTQuant's superior performance. It retains over 99.5% of the floating-point accuracy in W4-only settings and reduces the performance gap by 32% on LLaMA-3-8B in the more challenging W4A4KV4 configuration compared to state-of-the-art methods. The method also delivers significant inference speedups (over $2 \times$ ) and memory savings (over $3.5 \times$ ), while accelerating the quantization process itself by up to $5.3 \times$ . These results underscore the effectiveness of OSTQuant's principled approach to optimizing data distributions for practical and efficient LLM deployment.

7.2. Limitations & Future Work

The authors identify clear directions for future research, primarily focusing on extending OSTQuant to fully-quantized LLMs.

Extension to Full Quantization: The current OSTQuant framework, while effective, is primarily presented for traditional PTQ where some components (like KV cache or certain activations) might still be in higher precision. The future work section (Appendix A.5.1) outlines an extension for full quantization, where all activations within each Transformer Block are quantized to low bits. This would further reduce memory transfer overhead and fully leverage low-precision computational units.
- New Transformation Pairs: This extension proposes introducing more equivalence transformations around ROPE (Rotary Positional Encoding) and SiLU (Sigmoid Linear Unit) operations, which are critical components in modern LLMs.
  - ROPE Handling: Treating ROPE as a lightweight GEMM layer and constructing a weight matrix based on its principle allows for pre-ROPE and post-ROPE transformation pairs. This involves scaling transformations ( $S_q^i, S_k^i, S_{post}^i$ ) and orthogonal transformations ( $R_{pre}^i, R_{post}^i$ ) to optimize query and key distributions before and after ROPE and attention computation.
  - SiLU Smoothing: Decomposing SiLU(X) =X \cdot \sigma(X)$$ allows for equivalence transformations (e.g., $X_1 \cdot X_2 \cdot \sigma(X_1) = (X_1 \cdot S) \cdot (S_2 \cdot \frac{1}{S}) \cdot \sigma((X_1 \cdot S) \cdot \frac{1}{S})$ ) to alleviate inter-channel discrepancies of activations before and after SiLU using scaling matrices like $S_{u|g}$ . The authors plan to conduct experiments in the full-quantization domain to explore the full potential of OSTQuant.

7.3. Personal Insights & Critique

This paper presents a rigorous and well-motivated approach to LLM quantization. Here are some personal insights and critiques:

Strengths and Innovations:
- Principled Metric (QSUR): The introduction of QSUR is a significant contribution. It moves beyond heuristic observation to provide a quantitative, theoretically-backed metric to guide quantization optimization. This allows for a more direct and transparent evaluation of transformation effectiveness. The correlation between QSUR and accuracy (Figure 3) strongly supports its utility.
- Holistic Transformation Strategy: The combination of learnable orthogonal and scaling transformations is powerful. Orthogonal transformations excel at reshaping distributions to be more ball-shaped and outlier-resistant, while scaling transformations fine-tune inter-channel variances. This dual approach provides a more comprehensive solution than methods relying solely on one type of linear transformation.
- Addressing PTQ Constraints: The KL-Top loss function is a clever solution to the problem of limited calibration data and large vocabulary sizes in LLMs. It effectively balances the need for semantic preservation with the practical challenges of PTQ optimization, preventing overfitting and noise.
- Efficiency: The method not only achieves high accuracy but also demonstrates impressive speedups in inference and memory savings, making LLMs more deployable. The fast training time for quantization is also a major practical advantage.
- Equivalence Preservation: The design of equivalent transformation pairs that are fused into weights ensures no additional runtime overhead, which is crucial for real-world deployment.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
- QSUR Formula Simplifications: While the QSUR derivation provides a solid theoretical basis, some simplifications (e.g., neglecting the mean vector $\boldsymbol{\mu}$ , assuming $\lambda_{\mathrm{max}} = \lambda_{\mathrm{min}} = \lambda_1$ ) are made to arrive at a tractable form (Equation 7). The practical impact of these simplifications on the optimality of the derived transformation (Equation 8) could be further explored. Do real LLM distributions always conform well enough to these assumptions for the derived optimum to hold globally?
- Complexity of Riemannian Optimization: While RiemannAdam is effective, optimizing on Stiefel manifolds is inherently more complex than standard Euclidean optimization. This might pose a barrier for practitioners less familiar with Riemannian geometry or require specialized libraries. The ease of integration into common ML frameworks (like PyTorch) is good, but understanding the underlying mechanics remains a learning curve.
- Sensitivity to $k$ in KL-Top: While an optimal $k=1000$ is identified for the tested models, this value might be sensitive to specific LLM architectures, vocabulary sizes, or downstream tasks. A more adaptive or data-driven way to select $k$ could enhance robustness.
- Generalizability of WOMI: WOMI relies on the assumption that weights follow a Gaussian distribution with zero mean. While often true for typical weight initializations, deviations could affect its optimality.
- Full Quantization Details: The future work section on full quantization is promising, but the mathematical details of how the ROPE and SiLU transformations are maintained as equivalent transformation pairs (similar to Equation 9) could be further elaborated in a subsequent paper. The decomposition of SiLU into a product and its transformation for inter-channel discrepancies (Equation 31) is an intriguing idea, and its empirical validation would be valuable.
Transferability and Future Value:
- The QSUR metric is highly transferable and could be adopted by other quantization methods as a standard evaluation and optimization target, fostering more principled research.
- The combined orthogonal and scaling transformation strategy is broadly applicable to any neural network layer that performs matrix multiplication, not just LLMs.
- The insights from KL-Top loss are relevant for any PTQ scenario involving large output spaces (e.g., in sparse expert models) and limited calibration data.
- The paper's direction towards full quantization is critical, as this is the ultimate goal for maximizing hardware efficiency. OSTQuant seems well-positioned to contribute significantly to this challenging area.
  
  Overall, OSTQuant represents a significant step forward in LLM quantization, offering a robust, theoretically grounded, and highly effective solution that addresses key challenges in LLM compression and acceleration.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~41 min read · 70,192 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Quantization Notations

4.2.2. Quantization Space Utilization Rate (QSUR)

4.2.3. Orthogonal and Scaling Transformation-based Quantization (OSTQuant)

4.2.4. KL-Top Loss

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Effect of Different Transformation Matrices (Table 4)

6.3.2. Different Manifold Optimizers (Table 5)

6.3.3. Influence of kkk in KL-Top Loss (Table 6)

6.3.4. Effect of Weight Outlier Minimization Initialization (WOMI) (Table 7, Figure 7, 13)

6.3.5. Effect of KL-Top Loss with SpinQuant (Table 8)

6.4. QSUR and Evaluation Loss Trends (Figure 10)

6.5. Activation and Weight Distribution Visualizations (Figure 11, 12, 14)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.3.3. Influence of $k$ in KL-Top Loss (Table 6)