INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
TL;DR Summary
This study offers a systematic comparison between floating-point (FP) and integer (INT) quantization formats, revealing that MXINT8 outperforms FP in 8-bit fine-grained formats. For 4-bit formats, FP often excels, but NVINT4 can surpass it with outlier-mitigation techniques. A ne
Abstract
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is: "INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats".
1.2. Authors
The authors are:
- Mengzhao Chen (The University of Hong Kong, ByteDance Seed)
- Meng Wu (PicoHeart)
- Hui Jin (ByteDance Seed)
- Zhihang Yuan (ByteDance Seed)
- Jing Liu (ByteDance Seed)
- Chaoyi Zhang (ByteDance Seed)
- Yunshui Li (ByteDance Seed)
- Jie Huang (ByteDance Seed)
- Jin Ma (ByteDance Seed)
- Zeyue Xue (The University of Hong Kong)
- Zhiheng Liu (The University of Hong Kong)
- Xingyan Bin (ByteDance Seed, Corresponding author)
- Ping Luo (The University of Hong Kong, Corresponding author)
1.3. Journal/Conference
The paper was published on arXiv, a preprint server. Its publication date was 2025-10-29T15:11:53.000Z. As a preprint, it has not yet undergone formal peer review by a specific journal or conference. However, arXiv is a widely used and respected platform for disseminating cutting-edge research in fields like artificial intelligence, allowing for rapid sharing of findings.
1.4. Publication Year
The publication year is 2025.
1.5. Abstract
The abstract summarizes the paper's critical investigation into low-precision quantization formats, specifically comparing floating-point (FP) and integer (INT) representations. It addresses a gap in unified comparisons, particularly concerning varying granularities, despite the industry's trend towards FP formats (e.g., in Nvidia's Blackwell architecture) for handling activation outliers in Large Language Models (LLMs). The paper reveals a performance "crossover" where FP excels in coarse-grained quantization, but the comparison becomes more nuanced at fine-grained (block-wise) levels.
A key finding is that for popular 8-bit fine-grained formats (like MX with block size 32), MXINT8 outperforms its FP counterpart in both algorithmic accuracy and hardware efficiency. In contrast, for 4-bit formats, FP (e.g., MXFP4, NVFP4) generally holds an accuracy advantage, although NVINT4 can surpass NVFP4 when combined with outlier-mitigation techniques such as Hadamard rotation. The authors also introduce a symmetric clipping method that resolves gradient bias during fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These results challenge the current hardware focus on FP formats, advocating for fine-grained INT formats, especially MXINT8, as a more balanced solution for accuracy, power, and efficiency in future AI accelerators.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2510.25602v1
- PDF Link: https://arxiv.org/pdf/2510.25602v1.pdf The paper is available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The proliferation of Large Language Models (LLMs) has led to an exponential increase in their computational and memory demands. To make these models more efficient for deployment, quantization has become an indispensable technique. A significant challenge in quantizing LLMs, particularly those based on the Transformer architecture, is the presence of activation outliers—values with large magnitudes but infrequent occurrence. These outliers can severely degrade the performance of low-precision representations.
The AI hardware industry, exemplified by NVIDIA's Blackwell architecture, has largely responded to this challenge by pivoting towards low-precision floating-point (FP) formats (e.g., FP8, FP4). This trend is driven by FP's inherently superior dynamic range, which is believed to handle outliers more gracefully than traditional integer (INT) formats.
However, the authors argue that this industry-wide momentum towards FP is based on an incomplete picture. A systematic and unified comparison of FP and INT quantization across different granularities has been missing. Most existing studies tend to focus on a single format or compare them only at coarse granularities. Given that fine-grained (block-wise) quantization is now a standard technique for mitigating outliers and improving accuracy at low precision, understanding the interplay between different number formats and quantization granularity is crucial for effective algorithm-hardware co-design. The paper aims to fill this critical research gap.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
- Performance Crossover Revelation: It reveals a
critical performance crossoverwhereFP formatshold a distinct advantage incoarse-grained scenarios, butINT formatsbecome highly competitive as theblock size shrinks(i.e., fine-grained quantization). - Theoretical and Statistical Framework: The authors develop a theoretical and statistical framework to model the
Quantization Signal-to-Noise Ratio (QSNR)for both INT and FP formats. This framework enables a direct theoretical comparison and clarifies thecrossover points. - MXINT8 Superiority: The study demonstrates that
MXINT8consistentlyoutperforms MXFP8in bothdirect-cast inferenceandlow-bit training(8-bit settings). This is a strong finding challenging the FP-centric trend. - NVINT4 with Outlier Mitigation: For 4-bit formats, while
FP(e.g.,MXFP4,NVFP4) often shows an initialaccuracy advantage, the paper shows thatNVINT4cansurpass NVFP4when combined withoutlier-mitigation techniqueslikeHadamard rotation. - Symmetric Clipping for INT Training: A novel
symmetric clipping methodis introduced toresolve gradient biasinfine-grained low-bit INT training. This technique enables nearlylossless performanceforMXINT8 training, addressing a critical challenge for INT formats in training contexts. - Hardware Efficiency of INT: A comparative
hardware cost analysisreveals thatfine-grained INT formatsare significantly morearea- and energy-efficientthan their floating-point counterparts at matched throughput. - Challenge to Hardware Trajectory: Collectively, these findings challenge the prevailing
FP-centric trajectoryin AI hardware design. The paper advocates for prioritizingfine-grained INT formatsto achieve a more optimal balance ofaccuracyandefficiencyin future AI accelerators, asserting that aone-size-fits-all FP approachis suboptimal.
3. Prerequisite Knowledge & Related Work
This section provides foundational knowledge essential for understanding the paper's contribution.
3.1. Foundational Concepts
3.1.1. Large Language Models (LLMs)
Large Language Models (LLMs) are advanced artificial intelligence models, typically based on the Transformer architecture, designed to understand, generate, and process human language. They are characterized by their vast number of parameters (ranging from billions to trillions) and are trained on enormous datasets of text and code. Due to their size, deploying LLMs efficiently requires significant computational and memory resources.
3.1.2. Quantization
Quantization is a technique used in deep learning to reduce the precision of numerical representations (e.g., weights, activations) in neural networks, typically from 32-bit floating-point (FP32) to lower bit-widths like 8-bit integers (INT8) or 4-bit floating-point (FP4). The primary goals of quantization are to:
- Reduce Memory Footprint: Lower-precision numbers require less storage.
- Improve Computational Efficiency: Operations on lower-precision numbers are often faster and consume less energy on specialized hardware.
- Reduce Bandwidth Requirements: Less data needs to be moved between memory and compute units.
The challenge is to achieve these benefits while
minimizing accuracy loss.
3.1.3. Integer (INT) Quantization
Integer (INT) quantization maps high-precision floating-point numbers to a finite set of integer values. This typically involves a scale factor and sometimes a zero-point (for asymmetric quantization) to map the floating-point range to the integer range.
The basic process often looks like this:
-
Scaling: Divide the floating-point value by a
scale factor. -
Rounding: Round the scaled value to the nearest integer.
-
Clamping/Clipping: Ensure the integer falls within the target integer range (e.g., for
INT8symmetric). -
Dequantization: To use the quantized values in computation or convert them back, they are multiplied by the
scale factor.The paper uses a
symmetricINT quantization where values are centered around zero. For -bit integer quantization, the general formula for quantization and dequantization is: $ \mathbf { X _ { q } } = \mathrm { c l i p } \left( \left\lfloor \frac { \mathbf { X } } { s } \right\rceil , Q _ { \mathrm { m i n } } , Q _ { \mathrm { m a x } } \right) \cdot s $ Where:
- : The original high-precision tensor.
- : The dequantized (reconstructed) tensor after quantization.
- : The
scale factorused to normalize to the target integer range. - : The
round-to-nearestfunction, which rounds a number to the closest integer. clip(value, Q_min, Q_max): A function thatclips(constrains) thevalueto be within the range .- , : The minimum and maximum representable integer values for the given bit-width . For standard signed -bit integers, these are typically and . However, the paper introduces a
symmetric clipping methodwhere and to avoid gradient bias, meaning anINT8range would be instead of .
3.1.4. Floating-Point (FP) Quantization
Floating-point (FP) quantization represents numbers using a sign bit, an exponent, and a mantissa. This format offers a wider dynamic range (the range of expressible values) compared to integers for the same number of bits, making it more robust to outliers (extremely large or small values). The exponent determines the magnitude, and the mantissa determines the precision.
A floating-point number is decoded as: $ \mathbb { C } _ { \mathrm { F P } } = \left{ \begin{array} { l l } { ( - 1 ) ^ { s } \times ( 1 . m ) _ { 2 } \times 2 ^ { e - \mathrm { b i a s } } } & { \mathrm { i f ~ } e \neq 0 \mathrm { ~ ( N o r m a l ) , } } \ { ( - 1 ) ^ { s } \times ( 0 . m ) _ { 2 } \times 2 ^ { 1 - \mathrm { b i a s } } } & { \mathrm { i f ~ } e = 0 , m \neq 0 \mathrm { ~ ( S u b n o r m a l ) , } } \end{array} \right. $ Where:
-
: The
sign bit(0 for positive, 1 for negative). -
: The
exponentvalue. -
: The
mantissa(or significand) value. -
bias: An offset added to the exponent to allow representation of both very small and very large numbers. -
: Represents for
normalnumbers, where is the number of mantissa bits. The "1." is animplicit leading bit. -
: Represents for
subnormalnumbers, which are used to represent numbers very close to zero without losing precision (at the cost of dynamic range). -
ExMy: A notation where is the number of exponent bits and is the number of mantissa bits. For example,E4M3has 4 exponent bits and 3 mantissa bits.Floating-point quantization is expressed as: $ \mathbf { X _ { q } } = \mathrm { Nearest } \bigg ( \frac { \mathbf { X } } { s } , \mathbb { C } _ { \mathrm { F P } } \bigg ) \cdot s $ Where:
-
: The original high-precision tensor.
-
: The dequantized tensor.
-
: The
scale factor. -
: A function that maps a normalized value to the nearest representable value in the set of low-bit floating-point values .
3.1.5. Quantization Granularity
Quantization granularity refers to the scope over which a single scale factor (and zero-point if applicable) is applied. Finer granularity generally leads to better accuracy because it can adapt to local variations in data distribution, but it also increases the overhead (memory and computation) for storing and applying more scale factors.
Common granularities include:
- Per-tensor: A single
scale factoris applied to the entire tensor. Simplest, but least accurate for diverse distributions. - Per-channel: A
scale factoris applied to each input or output channel of a layer. Common for weights. - Block-k (Block-wise): The tensor is partitioned into smaller blocks (e.g., elements), and each block has its own
scale factor. This is afine-grained quantizationmethod and is the primary focus of this paper. For LLMs, this is particularly effective because activationoutliersare often localized to small regions.
3.1.6. Activation Outliers
Activation outliers are values with exceptionally large magnitudes that appear infrequently within the activation tensors of neural networks, particularly in LLMs based on the Transformer architecture. These outliers pose a significant challenge for low-precision quantization because they demand a very wide dynamic range to be represented accurately. If the quantization range is not wide enough, these outliers get clipped, leading to a large quantization error that can severely degrade model accuracy. Conversely, if the range is made wide enough to accommodate outliers, the precision for the majority of "normal" values might be reduced, also impacting accuracy.
3.1.7. Crest Factor ()
The crest factor () is a dimensionless parameter that quantifies the ratio of the peak value to the effective value of a signal. In the context of quantization, it measures how "peaky" a data distribution is.
The paper defines the crest factor as:
$
\kappa := \frac { \operatorname* { m a x } ( | \mathbf { X } | ) } { \sigma }
$
Where:
- : The maximum absolute value within a block of the tensor .
- : The
root-mean-square (RMS)value (standard deviation for zero-mean data) of the block. Ahigher crest factorindicates the presence of significantoutliersthat are much larger than the average magnitude of the values in the block. A lower crest factor suggests a more uniform distribution without extreme peaks. This metric is critical because it directly influences the requireddynamic rangefor accurate quantization and helps determine whetherINTorFPformats are more suitable.
3.1.8. Quantization Signal-to-Noise Ratio (QSNR)
The Quantization Signal-to-Noise Ratio (QSNR) is a metric used to quantify the numerical fidelity of a quantized signal. It measures the ratio of the power of the original signal to the power of the quantization noise (the error introduced by quantization). A higher QSNR indicates that the quantized signal is a more faithful representation of the original, with less introduced error.
The formula for QSNR (in decibels, dB) is:
$
\mathrm { QSNR } = - 10 \log _ { 10 } \left( \frac { | \mathbf { X } - \mathbf { X } _ { q } | ^ { 2 } } { | \mathbf { X } | ^ { 2 } } \right)
$
Where:
- : The squared
Euclidean norm(or sum of squared differences) of thequantization error(difference between the original tensor and the dequantized tensor ). This represents the power of the noise. - : The squared
Euclidean normof the original tensor . This represents the power of the signal. - : Converts the ratio to decibels, where a larger positive number indicates a better signal-to-noise ratio.
3.1.9. Hadamard Rotation
Hadamard rotation is an outlier-mitigation technique used in conjunction with quantization. It involves multiplying the input tensor by a Hadamard matrix, which is an orthogonal matrix with entries of or -1. The effect of this transformation is to spread out the values of the input tensor, making the distribution more uniform and effectively reducing the crest factor. By reducing the crest factor, Hadamard rotation can make INT quantization more effective, as it minimizes the impact of extreme outliers that would otherwise demand a larger dynamic range. After the quantized operation, the inverse Hadamard rotation can be applied to recover the original distribution.
3.2. Previous Works
The paper contextualizes its work by referencing prior studies in quantization algorithms and hardware support.
- Quantization Algorithms: Previous research includes
post-training quantization (PTQ)methods (e.g., [15, 20, 36, 41]), where quantization is applied after a model is fully trained, andquantization-aware training (QAT)[7, 23], which incorporates quantization effects into the training loop.Low-bit training[9, 27, 39] aims to train models directly using low-precision numbers for both forward and backward passes. Some works also explorescaling lawsfor quantization [5, 8, 16, 19].- Core Gap: The paper notes that most prior work focuses on a single
low-bit format(eitherINTorFP) and lacks direct, systematic comparisons between them across varying granularities. While [45] studiesmixed-format quantizationinPTQ, it doesn't provide the unifiedINTvs.FPcomparison the current paper undertakes.
- Core Gap: The paper notes that most prior work focuses on a single
- Hardware Support: Earlier AI accelerators [29, 30] typically did not natively support
fine-grained quantization, posing challenges for algorithms dealing withoutliersusingper-channel quantization[6, 41]. More recently, theMicroscaling (MX)data formats [34] were proposed, combiningper-block scalingwith a block size of 32 to enhancelow-bit quantization.NVIDIA's Blackwell architecture[31] has incorporated native hardware support forMXFP8,MXFP4, andNVFP4, underscoring the industry's lean towardsfine-grained floating-point formats.
3.3. Technological Evolution
The field of AI model deployment has seen a continuous drive towards efficiency. Initially, models primarily operated in FP32 or FP16. As models grew, BFloat16 (a 16-bit floating-point format with a wider exponent range than FP16) became popular for training, offering a good balance between precision and computational efficiency. INT8 quantization gained traction for inference due to its speed and memory benefits, but faced challenges with outliers, especially in LLMs.
To address outliers, the industry started favoring low-precision floating-point (FP) formats like FP8 and FP4, believing their superior dynamic range could handle the extreme values better. This led to hardware advancements like NVIDIA's Blackwell offering native support for MXFP formats.
This paper positions itself at a critical juncture in this evolution. It challenges the assumption that FP is universally superior for low-bit quantization, especially at fine-grained granularities. By systematically comparing INT and FP across various bit-widths and granularities, and by introducing new techniques like symmetric clipping for INT training and demonstrating the efficacy of Hadamard rotation for NVINT4, the paper suggests a re-evaluation of the current hardware trajectory. It advocates for fine-grained INT formats as a potentially more optimal solution for future AI accelerators, potentially shifting the focus back towards specialized integer hardware where appropriate.
3.4. Differentiation Analysis
The core differentiations and innovations of this paper's approach compared to main methods in related work are:
-
Unified and Systematic Comparison: Unlike previous studies that often focus on a single
low-bit formator conduct comparisons atcoarse granularities, this paper provides acomprehensive, systematic, and unified comparisonofFPandINTformats acrossvarying granularities(specificallyfine-grained block-wise quantization) and multiplebit-widths(8-bit, 6-bit, 4-bit). -
Identification of Performance Crossover: The paper uniquely identifies a
critical performance crossover pointwhere the relative advantage ofFPandINTformats changes based onquantization granularityandcrest factor. It showsFPexcels coarsely, butINTbecomes highly competitive finely. -
Introduction of Integer Counterparts for MX/NV Formats: To enable a direct comparison, the paper introduces and evaluates
integer variants(e.g.,MXINT8,MXINT6,MXINT4,NVINT4) that align with existingMicroscaling (MX)andNVIDIA (NV)floating-point formats. This allows for a fairalgorithm-hardware co-designperspective. -
Algorithmic and Hardware Efficiency Argument for MXINT8: The paper presents strong evidence that
MXINT8not onlyoutperforms MXFP8inalgorithmic accuracybut also offers significanthardware efficiencybenefits (area and energy reduction), challenging the perceived universal superiority ofFP8. -
Enhanced NVINT4 with Outlier Mitigation: For 4-bit, where
FPoften holds an initial edge, the paper demonstrates thatNVINT4cansurpass NVFP4by integrating anoutlier-mitigation technique(Hadamard rotation), showcasing the potential of combiningINTwith algorithmic enhancements. -
Novel Symmetric Clipping Method for INT Training: A practical innovation is the introduction of a
symmetric clipping methodtoresolve gradient biasinfine-grained low-bit INT training. This addresses a specific limitation ofINTformats and enables nearlylossless MXINT8 training, which was previously a domain primarily explored forFP8. -
Comprehensive Trade-off Analysis: The study integrates
theoretical QSNR analysis,tensor-wise analysison real LLM data,direct-cast inferenceon diverse LLMs,low-bit trainingresults, andhardware cost modeling, providing a holistic view of the trade-offs.In essence, the paper moves beyond simply demonstrating quantization, instead offering a deep, comparative dive that questions prevailing industry assumptions and provides concrete guidance for future
AI acceleratordesign.
4. Methodology
This section details the technical solutions proposed and evaluated in the paper. The core idea is to systematically compare low-bit integer (INT) and floating-point (FP) quantization formats across different granularities, bit-widths, and operational contexts (inference and training), supported by theoretical analysis and hardware cost modeling.
4.1. Principles
The fundamental principle driving this research is the hypothesis that while floating-point (FP) formats might be advantageous for coarse-grained quantization due to their superior dynamic range in handling widespread outliers, the landscape could shift significantly at fine-grained granularities. As quantization granularity becomes finer (i.e., smaller block sizes), the local dynamic range within each block is reduced. This reduced local variation might diminish FP's advantage, allowing integer (INT) formats, with their uniform precision and simpler hardware implementation, to become highly competitive or even superior, especially when combined with appropriate outlier-mitigation techniques and training stability measures. The paper aims to rigorously test this hypothesis through a comprehensive theoretical, empirical, and hardware-centric study.
4.2. Core Methodology In-depth
4.2.1. Low-Precision Integer Formats
For -bit integer quantization, the paper defines the quantization and dequantization process as follows: $ \mathbf { X _ { q } } = \mathrm { c l i p } \left( \left\lfloor \frac { \mathbf { X } } { s } \right\rceil , Q _ { \mathrm { m i n } } , Q _ { \mathrm { m a x } } \right) \cdot s $ Where:
-
: Represents the high-precision input tensor (e.g.,
BFloat16). -
: Represents the dequantized output tensor, which is an approximation of after being mapped to a low-bit integer representation and then converted back to high-precision.
-
: Is the
scale factor. This floating-point value is crucial as it determines the mapping between the high-precision range of and the fixed integer range. It is typically calculated to cover the range of without excessive clipping. -
: Denotes the
round-to-nearestfunction. This operation takes a floating-point number and rounds it to the closest integer. -
clip(value, Q_min, Q_max): This function performsclipping. It ensures that the integer value, after rounding, stays within the defined range . Ifvalueis less than , it becomes ; if it's greater than , it becomes . -
: These define the minimum and maximum representable integer values for a given bit-width . For standard signed -bit integers using two's complement, and . For example, for
INT8, this would be .The paper, however, introduces a crucial modification for
INTformats:symmetric clipping. They find that the standardasymmetric range(e.g., forINT8) can degradeINT8 trainingdue to apersistent negative bias in gradients. To resolve this, they enforce asymmetric integer rangefor allINT quantizers: $ Q _ { m i n } = - ( 2 ^ { b - 1 } - 1 ) , \quad Q _ { m a x } = 2 ^ { b - 1 } - 1 $ ForINT8, this translates to and . This means the range is , sacrificing one negative value to ensure symmetry around zero.
4.2.2. Low-Precision Floating-Point Formats
Floating-point (FP) representation is characterized by a sign bit (S), an exponent (E), and a mantissa (M). The paper uses the ExMy notation, where is the number of exponent bits and is the number of mantissa bits.
A floating-point number is decoded according to the following formula:
$
\mathbb { C } _ { \mathrm { F P } } = \left{ \begin{array} { l l } { ( - 1 ) ^ { s } \times ( 1 . m ) _ { 2 } \times 2 ^ { e - \mathrm { b i a s } } } & { \mathrm { i f ~ } e \neq 0 \mathrm { ~ ( N o r m a l ) , } } \ { ( - 1 ) ^ { s } \times ( 0 . m ) _ { 2 } \times 2 ^ { 1 - \mathrm { b i a s } } } & { \mathrm { i f ~ } e = 0 , m \neq 0 \mathrm { ~ ( S u bn o r m a l ) , } } \end{array} \right.
$
Where:
-
: Represents the set of all representable low-bit floating-point values for a given format.
-
: Is the value of the
sign bit. If , the number is positive; if , it's negative. -
: Represents the
mantissa(or significand) fornormalnumbers. It implicitly has a leading '1', meaning the value is , where is the fractional part represented by the mantissa bits. -
: Is the
exponentpart. is the raw exponent value, andbiasis an offset specific to the FP format that allows representation of both very small and very large numbers. -
Normal: Refers to floating-point numbers where the exponent field is not all zeros or all ones. These numbers have an implicit leading '1' in their mantissa. -
: Represents the
mantissaforsubnormalnumbers. In this case, the implicit leading bit is '0', allowing representation of numbers even closer to zero than the smallestnormalnumber, at the cost of precision. -
Subnormal: Refers to floating-point numbers where the exponent field is all zeros, but the mantissa is not all zeros.The general form for floating-point quantization is given by: $ \mathbf { X _ { q } } = \mathrm { Nearest } \bigg ( \frac { \mathbf { X } } { s } , \mathbb { C } _ { \mathrm { F P } } \bigg ) \cdot s $ Where:
-
: This function takes a high-precision value (normalized by ) and maps it to the closest representable value within the discrete set of low-bit floating-point numbers .
4.2.3. Quantization Granularity
The paper primarily focuses on block quantization, a fine-grained approach. In this scheme, a tensor is divided into smaller blocks, and each block receives its own scale factor. This allows the quantization range to adapt more precisely to local data distributions, which is crucial for mitigating the impact of outliers common in LLMs.
4.2.4. Block-Quantization Formats
The paper compares standard formats and introduces custom integer variants for a fair comparison. The formats are derived from Microscaling (MX) and NVIDIA (NV) specifications.
- MX Formats:
Block Size: 32 elements.Scale Type:UE8M0(Unsigned Exponent 8, Mantissa 0) for the scale factor. This means the scale factor itself is represented in a low-precision floating-point format that primarily offersdynamic range(via 8 exponent bits) with minimalprecision(0 mantissa bits).Variants:MXFP8(E4M3),MXFP6(E2M3),MXFP4(E2M1), and their integer counterpartsMXINT8,MXINT6,MXINT4. The FP variants prioritizemantissa bitsfor precision given thefine-grainedcontext.
- NV Formats:
-
Block Size: 16 elements (finer than MX). -
Scale Type:E4M3for the first-level scale, and aFP32second-level per-tensor scale to prevent overflow.E4M3(Exponent 4, Mantissa 3) offers more precision for the scale factor compared toUE8M0. -
Variants:NVFP4and its integer counterpartNVINT4.The following table summarizes the block formats studied, including the integer variants introduced by the paper:
-
The following are the results from Table 1 of the original paper:
| Format | Block Size | Max Value | Min Value | Dynamic Range | Scale-1 | Scale-2 |
|---|---|---|---|---|---|---|
| MXFP8 (E4M3) | 32 | ±448 | ±2-9 | 1.75 × 217 | UE8M0 | - |
| MXINT8 | 32 | 127 | 1 | 127 | UE8M0 | - |
| MXFP6 (E2M3) | 32 | ±7.5 | ±0.125 | 60 | UE8M0 | - |
| MXINT6 | 32 | ±31 | ±1 | 31 | UE8M0 | - |
| MXFP4 (E2M1) | 32 | ±6 | ±0.5 | 12 | UE8M0 | - |
| MXINT4 | 32 | ±7 | ±1 | 7 | UE8M0 | - |
| NVFP4 | 16 | ±6 | ±0.5 | 12 | E4M3 | FP32 |
| NVINT4 | 16 | ±7 | ±1 | 7 | E4M3 | FP32 |
4.2.5. Quantization Compute Flow
The paper illustrates the computation flow for low-bit inference and training using a linear layer as an example. This flow dictates where and when quantization operations occur for weights, activations, and their gradients.
The following figure (Figure 1 from the original paper) shows the compute flow of low-bit forward and backward propagation of linear layer:
Figure 1 Compute flow of low-bit forward and backward propagation of linear layer.
Given high-precision (e.g., BFloat16) activations and weights , the forward pass of a quantized linear layer computes the output :
$
\mathbf { Y } = \underbrace { \mathrm { Q u a n t i z e } ( \mathbf { X } ) } _ { \textcircled { 1 } } \underbrace { \mathrm { Q u a n t i z e } ( \mathbf { W } ) } _ { \textcircled { 2 } }
$
Here:
- : Represents the quantization of the input activations .
- : Represents the quantization of the weights .
The output would then be computed using low-precision
General Matrix Multiply (GEMM)operations.
The backward pass involves computing gradients for activations () and weights ().
To compute :
$
d \mathbf { X } = \underbrace { \mathrm { Q u a n t i z e } ( \mathbf { d Y } ) } _ { \mathfrak { V } } \underbrace { \mathrm { Q u an t i z e } ( \mathbf { W } ^ { T } ) } _ { \mathfrak { V } }
$
Here:
-
: Represents the quantization of the gradient of the output .
-
: Represents the quantization of the transpose of the weights .
To compute : $ d \mathbf { W } = \underbrace { \mathrm { Q u a n t i z e } ( \mathbf { X } ^ { T } ) } _ { \mathfrak { V } } \underbrace { \mathrm { Q u a n t i z e } ( \mathbf { d Y } ^ { T } ) } _ { \mathfrak { O } } $ Here:
-
: Represents the quantization of the transpose of the input activations .
-
: Represents the quantization of the transpose of the gradient of the output .
In total, there are
six quantization operationsin one linear layer during training. The paper notes that forblock-wise quantization, tensors must be quantized along theGEMM reduction dimensionto gain hardware benefits. This means the quantization axes for operations ( and ), ( and ), and ( and ) are different.
4.2.6. Quantization Operation: Scale Factor Computation
The scale factor is crucial for both INT and FP quantization. The paper employs the AbsMax quantizer approach, where is computed to map the maximum absolute value in a group to the maximum representable low-precision value.
The initial scale factor is calculated as:
$
s = \frac { \mathrm { AbsMax } ( \mathbf { X } ) } { Q _ { m a x } }
$
Where:
-
: Is the maximum absolute value within the group of values that share a single
scale factor(e.g., within a block). -
: Is the maximum value of the target quantized type (refer to Table 1 for specific formats). This ensures that no value greater than is represented.
For
MX formats, the high-precisionscale factoris further converted to theUE8M0format. The conventional approach (as used by OCP [34]) involves rounding down: $ s ^ { \prime } = 2 ^ { \mathrm { clip } ( \lfloor \log _ { 2 } ( \mathrm { AbsMax } ( \mathbf X ) ) \rfloor - \lfloor \log _ { 2 } ( Q _ { m a x } ) \rfloor , - 1 2 7 , 1 2 7 ) } $ Where: -
: The
UE8M0quantized scale factor. -
: The
floorfunction (rounding down). -
This approach can introduce
extra clipping errorbecause rounding down might make the effective scale too small, causing values to exceed the maximum representable range.Following existing work, the paper adopts a strategy to round up the
UE8M0 scaleto avoid thisclipping error: $ s ^ { \prime } = 2 ^ { \mathrm { clip } ( \lceil \log _ { 2 } ( s ) \rceil , - 1 2 7 , 1 2 7 ) } $ Where: -
: Denotes the
ceilingfunction (rounding up). This ensures that the effective range is always sufficient to cover theAbsMaxvalue, preventing overflow, although it might slightly reduce precision for smaller values.
4.2.7. Quantization Operation: Symmetric Clipping
As previously mentioned, the paper identifies a problem with asymmetric integer ranges (like for INT8) during low-bit training. This asymmetry leads to a persistent negative bias in gradients, especially pronounced in fine-grained quantization where more values might map to the unique negative endpoint (e.g., -128).
The following figure (Figure 2 from the original paper) shows the impact of clipping range on INT8 final training loss on 145M model with 20B training tokens:
Figure 2 Impact of clipping range on INT8 final training loss on 145M model with 20B training tokens. Scale factor is kept on BF16 to emphasize the harm of asymmetric representation space during low-bit training.
Figure 2 clearly illustrates that using the asymmetric range for INT8 results in a higher (worse) final training loss compared to the symmetric range . This degradation is more severe for finer granularities (smaller block sizes, like block 32), as more individual quantization blocks increase the probability of mapping values to the problematic Q_min.
To mitigate this, the paper mandates the use of a symmetric integer range for all INT quantizers, as shown in Table 1:
$
Q _ { m i n } = - ( 2 ^ { b - 1 } - 1 ) , \quad Q _ { m a x } = 2 ^ { b - 1 } - 1
$
This adjustment ensures that the integer range is balanced around zero, preventing the gradient bias and enabling more stable low-bit INT training.
4.2.8. Theoretical Framework: QSNR Metric
To quantitatively compare the numerical fidelity of different quantization schemes, the paper uses the Quantization Signal-to-Noise Ratio (QSNR), measured in decibels (dB).
$
\mathrm { QSNR } = - 10 \log _ { 10 } \left( \frac { | \mathbf { X } - \mathbf { X } _ { q } | ^ { 2 } } { | \mathbf { X } | ^ { 2 } } \right)
$
Where:
- : Represents the power of the
quantization noise, which is the squared Euclidean norm of the difference between the original signal and the dequantized signal . - : Represents the power of the
original signal, which is the squared Euclidean norm of . A higherQSNRvalue indicates a better preservation of the original signal's magnitude and direction, meaning lower quantization error.
4.2.9. Theoretical Framework: Common Assumptions for QSNR Derivation
For deriving the theoretical QSNR expressions, the paper makes several common assumptions:
- Block Vectors: They consider block vectors (where is the block size).
- I.I.D. Entries: The entries within each block are assumed to be
independent and identically distributed (i.i.d.)and follow aGaussian distribution(), meaning they have a mean of 0 and a variance of . - Block RMS: The
block root-mean-square (RMS)value is approximated by . - Crest Factor: The
crest factoris defined as the ratio of the maximum absolute value in the block to its RMS: $ \kappa := \frac { \operatorname* { m a x } ( | \mathbf { X } | ) } { \sigma } $ - Blockwise AbsMax Scaling: The
scale factor(the actual scale used for quantization) is derived fromblockwise absolute-maximum (AbsMax) scaling. The ideal scale matches the largest magnitude in the block to the maximum representable value of the low-precision format (). $ s = \frac { \operatorname* { m a x } ( | \mathbf { X } | ) } { Q _ { \mathrm { r e f } } } $ Where is forsymmetric INT(b)and forFP(E,M,B). - Scale Overhead: The actual scale is related to the ideal scale by a factor :
$
s ^ { \prime } = \rho s
$
- For
UE8M0 scale(used inMXformats), models the overhead due to rounding the scale to a power of two. - For
E4M3 scale(used inNVformats), is assumed because this scale closely matches the ideal value, introducing minimal overhead.
- For
4.2.10. Theoretical Framework: Theorem 1 (INT QSNR)
Under the described assumptions, the QSNR for -bit INT quantization is derived.
The INT QSNR (in dB) is:
$
\mathrm { QSNR } _ { \mathrm { INT } } \approx \left{ \begin{array} { l l } { { 4 . 7 8 ~ + ~ 6 . 0 2 ~ b ~ - ~ 2 0\log _ { 1 0 } ( \rho ) ~ - ~ 2 0\log _ { 1 0 } ( \kappa ) , } } & { { \mathrm { U E 8 M 0 ~ s c a l e } } } \ { { } } & { { } } \ { { 4 . 7 8 ~ + ~ 6 . 0 2 ~ b ~ - ~ 2 0\log _ { 1 0 } ( \kappa ) ~ + ~ 1 0\log _ { 10 } \left( { \frac { g } { g - 1 } } \right) , } } & { { \mathrm { E4 M3 ~ s c a l e } } } \end{array} \right.
$
Where:
- : Is the
bit widthof the integer format. - : Is the
scale overheadfactor (forUE8M0 scale, ; forE4M3 scale, ). - : Is the
crest factorof the data in the block. - : Is the
block size(number of elements per block). - Interpretation:
- Each additional bit provides approximately
6.02 dBgain inQSNR. - The
UE8M0 scaleintroduces a penalty of up to (maximum 6.02 dB if ). - A larger
crest factor(more prominent outliers) reducesQSNR. Smaller blocks generally have smaller , thus improvingQSNR. - The
E4M3 scale(like inNVINT4) avoids the overhead and benefits from a gain, accounting for thenear-error-free mappingof the block's maximum value.
- Each additional bit provides approximately
4.2.11. Theoretical Framework: Theorem 2 (FP QSNR)
For floating-point quantization, the QSNR (in dB) is derived. This derivation considers the different error contributions from normal and subnormal regions of the FP representation.
The FP QSNR (in dB) is:
$
\mathrm { QSNR } _ { \mathrm { FP } } \approx \left{ \begin{array} { l l } { - 10 \log _ { 10 } \bigl ( \alpha _ { M } w _ { \mathrm { n o r m } } + \beta ( \rho \kappa ) ^ { 2 } p _ { \mathrm { s u b } } \bigr ) , } & { \mathrm { U E8 M0 ~ s c a l e } } \ { - 10 \log _ { 10 } \Bigl ( \alpha _ { M } \bigl ( w _ { \mathrm { n o r m } } - \frac { \kappa ^ { 2 } } { g } \bigr ) + \beta \kappa ^ { 2 } p _ { \mathrm { s u b } } \Bigr ) , } & { \mathrm { E4 M3 ~ s c a l e } } \end{array} \right.
$
Where the auxiliary terms are defined as:
-
: Represents the mantissa quantization error. is the
mantissa bit width. -
: Related to the subnormal step error. is the
exponent bias. -
: Is the largest finite
normalmagnitude of the target FP format (e.g., 448 forE4M3). -
: Is the fraction of
signal energycarried bynormal FP numbers. It measures how much of the distribution falls into the normal region. -
: Is the probability that a value encodes as
subnormal. It measures how much of the distribution falls into the subnormal region. -
: Is the
scale overheadfactor (forUE8M0 scale, ; forE4M3 scale, ). -
: Is the
crest factor. -
: Is the
block size. -
Interpretation:
- The
mantissa bit widthis crucial, setting an upper bound on FPQSNR. If there's ampledynamic range( and ),QSNRapproaches dB, largely independent of block specifics or data distribution. - A larger
crest factorincreases the share ofsubnormals() and reducesQSNR. Finer-grained blocks (smaller ) tend to reduce , lower , and thus improveQSNR. - The
E4M3 scale(like inNVFP4) has no overhead and accounts for theper-block maximumvalue being mapped with minimal error, which reduces the effective error energy in thenormal regionby .
- The
4.2.12. Theoretical Framework: Theoretical Comparisons
Using the derived QSNR formulas (Eq. 13 and Eq. 14), the paper theoretically compares INT and FP formats by plotting QSNR against the crest factor . A key finding is that the crest factor is the primary determinant of which format performs better.
The following figure (Figure 3 from the original paper) shows the theoretical QSNR comparison between various integer (INT) and floating-point (FP) formats across a range of crest factors :
Figure 3 Theoretical QSNR comparison between various integer (INT) and foating-point (FP) formats across a range of crest factors , derived from Eq. (13) and Eq. (14). The boxes represent the crest factor and QSNR of the crossover point of the INT and FP curves.
Figure 3 illustrates distinct crossover points for different bit-widths:
-
MXINT8vs.MXFP8:MXINT8outperformsMXFP8when .MXFP8'sQSNRis relatively constant due to its largedynamic rangeand mantissa-bit bound. -
MXINT6vs.MXFP6:MXFP6initially performs similarly toMXFP8(both have three mantissa bits), but itsQSNRdrops rapidly as increases due to a more limiteddynamic range.MXINT6only surpassesMXFP6when . -
MXINT4vs.MXFP4:MXINT4beatsMXFP4when . -
NVINT4vs.NVFP4:NVINT4wins when . An interesting observation is thatNVFP4'sQSNRcan increase when , because in this range, thenormal domainerror dominates, and a larger can reduce this component, even as it increasessubnormalerror.These theoretical insights highlight that the decision between
INTandFPis not absolute but depends critically on the data'screst factorand the specific bit-width.
4.2.13. Hardware Cost Modeling
The paper includes a hardware cost analysis to compare the area and energy efficiency of INT and FP formats. This analysis focuses on the core Matrix-Multiply Unit (MMU) components.
The model is based on the following:
-
Components Modeled:
Multiply-and-Accumulate (MAC)unit,dequantizer, andFP32 accumulator. Thequantizeris explicitlyexcludedfrom cost accounting. -
Accumulation:
FP32 accumulationis chosen to prevent error growth and preserve scalability. -
MAC Unit Differences:
FP multipliersare generally more area/energy efficient thanINT multipliers, butFP addersare more complex and expensive thanINT addersdue to requirements likeexponent comparison,mantissa alignment, andnormalization. -
Mantissa Aligner Width (): A crucial parameter that affects both
numerical fidelityandhardware complexityfor FP operations. It's defined as: $ n = \mathrm { min } \left( 2 ^ { x + 1 } + 2 y , \mathrm { psum_bit_width } \right) $ Where:- : Is the number of
exponent bits. - : Is the number of
mantissa bits. ForINTformats, . psum_bit_width: A cap, set to 24 in this evaluation.- This ensures the aligner is wide enough but doesn't exceed the accumulator's precision.
- : Is the number of
-
MAC Unit Structure: Modeled as a -lane array (e.g., for
MX, forNV). Each lane has a multiplier, and adders are fused into a multi-input adder tree with FP-specific logic. A singlenormalizeris shared acrossMAC lanesto reduce cost.The following are the results from Table 6 of the original paper:
Sub-block INT Mul FP Mul INT Add FP Add Main Cells Multiplier k(x+y+1)2 k(y+1)2 AND, FA, HA Adder (mantissa/int) 2k(x+y+1) kn FA, HA Exponent adder kx − FA, HA Exponent subtractor kx XOR, FA, HA Comparator kx XOR, AND, OR Aligner (barrel) k n log2 n MUX Normalizer (shared) n log2 n MUX, OR
Table 6 provides a gate-complexity model for the MAC Unit's sub-blocks. For instance, an INT multiplier scales with the square of the total bit-width (), while an FP multiplier scales with the mantissa bit-width (). FP adders involve more complex components like exponent adders/subtractors, comparators, and a barrel aligner that scales with , where is the aligner width. Main Cells (AND, FA, HA, XOR, OR, MUX) are standard logic gates.
-
Area and Energy Aggregation: The total
AreaandEnergyfor each component (MAC, ACC32, DEQ) are calculated by summing the contributions of individual logic gates (e.g.,FAfor Full Adder,HAfor Half Adder,MUXfor Multiplexer), weighted by their technology-dependent area () and energy () factors, and a toggle rate ().MAC Unitcost: $ \mathrm { Area } _ { \mathrm { MAC } } \ = \ \sum _ { s \in S } \sum _ { g \in \mathcal { G } } c _ { s , g } \left( x , y , k , n \right) A _ { g } $ $ \mathrm { Energy } _ { \mathrm { MAC } } \ = \ \sum _ { s \in S } \sum _ { g \in \mathcal { G } } c _ { s , g } \left( x , y , k , n \right) E _ { g } \tau _ { g } $ Where is the set of sub-block types, is the set of cell types, is the count of cell in sub-block , and is the toggle rate.FP32 Accumulator (ACC32)cost: $ \mathrm { Area } _ { \mathrm { ACC32 } } \ = \ \sum _ { g \in { \mathcal G } } c _ { g } ^ { \mathrm { ACC32 } } A _ { g } $ $ \mathrm { Energy } _ { \mathrm { ACC32 } } \ = \ \sum _ { g \in { \mathcal G } } c _ { g } ^ { \mathrm { ACC32 } } E _ { g } \tau _ { g } $Dequantizer (DEQ)cost: $ \mathrm { Area } _ { \mathrm { DEQ } } \ = \ \sum _ { g \in { \mathcal G } } c _ { g } ^ { \mathrm { DEQ } } A _ { g } $ $ \mathrm { Energy } _ { \mathrm { DEQ } } \ = \ \sum _ { g \in { \mathcal G } } c _ { g } ^ { \mathrm { DEQ } } E _ { g } \tau _ { g } $
-
Total MMU Cost: $ \begin{array} { r } { \mathrm { Area } _ { \mathrm { MMU } } = \mathrm { Area } _ { \mathrm { MAC } } + \mathrm { Area } _ { \mathrm { DEQ } } + \mathrm { Area } _ { \mathrm { ACC32 } } , \quad } \ { \mathrm { Energy } _ { \mathrm { MMU } } = \mathrm { Energy } _ { \mathrm { MAC } } + \mathrm { Energy } _ { \mathrm { DEQ } } + \mathrm { Energy } _ { \mathrm { ACC32 } } } \end{array} $
-
Reuse Schemes: The paper also considers
mixed-format configurationsby evaluating differentMAC unit configurationsthat allowreuseof hardware for different bit-widths. For example, anINT8lane could be reconfigured forINT4, leading to area savings.The following are the results from Table 7 of the original paper:
Throughput Ratio INT8 : INT4 = 1 : 2 No reuse 1 * int8 MAC unit + 2 * int4 MAC unit INT reuse scheme 1 1 * int8−MAC_unit + 1 * int4_MAC_unit INT reuse scheme 2 2 * int8_(u)int4_MAC_unit Throughput Ratio FP8 : FP4 = 1 : 2 No reuse 1 * e4m3_MAC_unit + 2 * e2m1_MAC_unit FP reuse scheme 1 * e4m3_MAC_unit + 1 * e2m1_MAC_unit
Table 7 outlines different configurations for MAC units with throughput ratios for INT8:INT4 and FP8:FP4. No reuse implies separate hardware, while reuse schemes (like INT reuse scheme 2, which uses two INT8 lanes reconfigurable for INT4) are designed to optimize area. This model allows for a detailed comparison of the hardware implications of supporting various low-bit formats.
5. Experimental Setup
This section details the datasets, evaluation metrics, and models used to conduct the experimental comparisons.
5.1. Datasets
5.1.1. WikiText2
- Source & Characteristics:
WikiText2[25] is a widely used dataset for language modeling, comprising a collection of "good" and "featured" articles from Wikipedia. It is known for having a diverse vocabulary and realistic text. - Usage: In this paper,
WikiText2sequences (length 4096) are fed intoLlama3.1-8Bto capture intermediate tensors (weights, activations, gradients) during both forward and backward propagation inBFloat16precision. These captured tensors are then used fortensor-wise analysis(computingcrest factorsandQSNR) and fordirect-cast inferenceevaluation (computingKL divergenceandperplexity). - Why Chosen:
WikiText2serves as a representative dataset for evaluating language model performance, providing a realistic distribution of values forQSNRanalysis and a standard benchmark forinference accuracy.
5.1.2. OLMo2-Mix-1124
- Source & Characteristics:
OLMo2-Mix-1124[33] is a large-scale pretraining dataset. - Usage: This dataset is used for
low-bit trainingexperiments withLlama3-stylemodels (1B and 3B parameters) over 100B and 200B training tokens, respectively. - Why Chosen: As a large pretraining dataset, it provides a robust environment to evaluate the stability and performance of
low-bit trainingmethods over extended training schedules.
5.2. Evaluation Metrics
5.2.1. Quantization Signal-to-Noise Ratio (QSNR)
- Conceptual Definition:
QSNRmeasures the fidelity of a quantized signal by comparing the power of the original signal to the power of the error (noise) introduced by quantization. A higherQSNRin decibels (dB) indicates less distortion and a more accurate representation of the original data. - Mathematical Formula: $ \mathrm { QSNR } = - 10 \log _ { 10 } \left( \frac { | \mathbf { X } - \mathbf { X } _ { q } | ^ { 2 } } { | \mathbf { X } | ^ { 2 } } \right) $
- Symbol Explanation:
- : The original high-precision tensor.
- : The dequantized tensor (original tensor after quantization and dequantization).
- : Denotes the squared Euclidean norm (sum of squares of all elements in the tensor).
- : Represents the total squared error (or noise power) introduced by quantization.
- : Represents the total squared magnitude (or signal power) of the original tensor.
- : Converts the ratio of noise power to signal power into a decibel scale, where a larger positive value signifies better quality.
5.2.2. KL Divergence (Kullback-Leibler Divergence)
- Conceptual Definition:
KL Divergence(also known as relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution. In the context of LLMs, it quantifies the difference between the output probability distribution (logits) of a quantized model and a full-precisionBFloat16model. A lowerKL Divergenceindicates that the quantized model's predictions are closer to the full-precision model's predictions, implying better preservation ofalgorithmic accuracy. The paper calculates this over the softmax distribution restricted to the top-25 logits of theBFloat16model to reduce noise and focus on critical predictions. - Mathematical Formula: $ D_{KL}(P || Q) = \sum_{i} P(i) \log \left(\frac{P(i)}{Q(i)}\right) $
- Symbol Explanation:
P(i): The probability of outcome according to the referenceBFloat16model's output distribution (softmax over top-25 logits).Q(i): The probability of outcome according to the quantized model's output distribution (softmax over top-25 logits).- : Summation over all possible outcomes (in this case, the top-25 logits).
- : Typically the natural logarithm ().
- : The
KL Divergencefrom to . It measures the information lost when is used to approximate .
5.2.3. Perplexity (PPL)
- Conceptual Definition:
Perplexityis a common metric for evaluating language models. It measures how well a probability distribution (the language model) predicts a sample (a sequence of words). A lowerperplexityscore indicates that the model is better at predicting the next word in a sequence, suggesting higher quality text generation and understanding. - Mathematical Formula: For a sequence of words , the
perplexityis defined as: $ PPL(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i | w_1, \dots, w_{i-1})}} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i | w_1, \dots, w_{i-1})} $ This is equivalent to2raised to the power of thecross-entropy lossper word. - Symbol Explanation:
- : A sequence of words.
- : The total number of words in the sequence.
- : The probability assigned by the language model to the -th word , given all the preceding words .
- : Product over all words in the sequence.
- : Summation over all words in the sequence.
- : Logarithm base 2.
5.2.4. Training Loss
- Conceptual Definition: During model training,
training lossquantifies the discrepancy between the model's predictions and the true labels. The goal of training is to minimize this loss. For language models, this is typicallycross-entropy loss. A lowertraining lossindicates that the model is learning more effectively and fitting the training data better. The paper uses anexponential moving average (EMA)with a coefficient of 0.9 to smooth the loss curves. - Mathematical Formula: For
cross-entropy lossin a classification task (like next-token prediction in LLMs): $ L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log(p_{ic}) $ - Symbol Explanation:
- : The number of samples (e.g., tokens in a batch).
- : The number of possible classes (e.g., vocabulary size).
- : A binary indicator (1 if sample belongs to class , 0 otherwise).
- : The predicted probability that sample belongs to class .
- : Typically the natural logarithm ().
5.2.5. Task Accuracy
- Conceptual Definition: This refers to the standard accuracy metric on various
common-sense reasoning tasksused to evaluate thedownstream performanceof the trainedLLMs. It measures the proportion of correctly answered questions or tasks. The paper uses5-shot evaluation, meaning the model is given 5 examples before being asked to perform the task. - Specific Tasks and Metrics:
WinoGrande[35]: Evaluated usingacc(accuracy). This dataset focuses oncommon-sense reasoningby resolving ambiguous pronouns.HellaSwag[44]: Evaluated usingacc_norm(normalized accuracy). This task involves selecting the most plausible ending to a given sentence, designed to be challenging for models.Arc_Challenge,Arc_Easy[10]: Evaluated usingacc_norm. These are scientific question-answering tasks, withArc_Challengebeing harder.PIQA[4]: Evaluated usingacc_norm. This dataset involves physicalcommon-sense reasoning.Openbookqa[26]: Evaluated usingacc_norm. This task also involvescommon-sense reasoningbased on an open book of facts.
- Why Chosen: These tasks are standard benchmarks for evaluating the
common-sense reasoning capabilitiesofLLMs, providing a comprehensive assessment of how well the models retain their intelligence afterlow-bit training.
5.3. Baselines
The paper compares its proposed INT formats against several established and industry-standard FP formats, as well as the full-precision BFloat16 baseline.
- Full Precision Baseline:
- : This is the standard 16-bit floating-point format often used for training
LLMsdue to its widedynamic range. All comparisons aim to achieve performance close toBF16.
- : This is the standard 16-bit floating-point format often used for training
- Floating-Point Baselines (Fine-Grained):
MXFP8(E4M3): AMicroscaling8-bit floating-point format with a block size of 32 andUE8M0scale.MXFP6(E2M3): AMicroscaling6-bit floating-point format with a block size of 32 andUE8M0scale.MXFP4(E2M1): AMicroscaling4-bit floating-point format with a block size of 32 andUE8M0scale.NVFP4: AnNVIDIA4-bit floating-point format with a block size of 16 andE4M3scale (plusFP32second-level scale).
- Integer Counterparts (Introduced for Comparison):
MXINT8MXINT6MXINT4NVINT4These integer variants are designed to match theblock sizesandscale factortypes of theirFPcounterparts, enabling a direct and fair comparison.
5.4. Models
5.4.1. Models for Direct-Cast Inference Evaluation
For direct-cast inference (quantizing only the forward pass from a pretrained BFloat16 model), the paper evaluates a diverse set of 12 LLMs, covering various sizes and architectures.
The following are the results from Table 8 of the original paper:
| Model Name | Huggingface ID |
|---|---|
| Qwen3-0.6B | Qwen/Qwen3-0.6B-Base |
| Qwen3-1.7B | Qwen/Qwen3-1.7B-Base |
| Qwen3-4B | Qwen/Qwen3-4B-Base |
| Qwen3-8B | Qwen/Qwen3-8B-Base |
| Qwen3-14B | Qwen/Qwen3-14B-Base |
| Qwen3-32B | Qwen/Qwen3-32B |
| Qwen3-30B-A3B | Qwen/Qwen3-30B-A3B-Instruct-2507 |
| Qwen3-235B-A22B | Qwen/Qwen3-235B-22B-Instruct-2507 |
| Llama-3.2-1B | meta-llama/Llama-3.2-1B |
| Llama-3.2-3B | meta-llama/Llama-3.2-3B |
| Llama-3.1-8B | meta-llama/Meta-Llama-3.1-8B |
| Llama-3.1-70B | meta-llama/Meta-Llama-3.1-70B |
The models range from 0.6B to 235B parameters and include both dense and Mixture-of-Experts (MoE) architectures, providing a broad evaluation scope. The paper uses base models without Supervised Fine-Tuning (SFT) when available, otherwise selecting SFT models.
5.4.2. Models for Low-Bit Training Evaluation
For low-bit training experiments, the paper uses Llama3-style models due to their widespread adoption.
The following are the results from Table 9 of the original paper:
| Model Size | 145M | 1B | 3B |
|---|---|---|---|
| Layers | 12 | 16 | 28 |
| Hidden Size | 1024 | 2048 | 3072 |
| FFN Hidden Size | 3072 | 8192 | 8192 |
| Attention Heads | 16 | 32 | 24 |
| KV Heads | 4 | 8 | 8 |
| Batch Size (# Sequence) | 256 | 512 | 512 |
| Max LR | 1.0e-3 | 6e-4 | 6e-4 |
| Min LR | 0.1 × Max LR | ||
| Optimizer | AdamW (β1 = 0.9, β2 = 0.95) | ||
| Weight Decay | 0.1 | ||
| Clip Grad Norm | 1.0 | ||
| LR Schedule | Cosine | ||
| Warmup Steps | 500 | ||
| Sequence Length | 2048 |
Table 9 provides detailed architectural settings and training hyperparameters for the Llama3-style models. The models feature Group Query Attention (GQA) [1] and SwiGLU [37] for efficiency. Training is performed on 1B and 3B models, with a 145M model likely used for initial ablation studies.
6. Results & Analysis
This section presents and analyzes the experimental results, providing empirical validation for the paper's theoretical insights and contributions.
6.1. Core Results Analysis
6.1.1. Tensor-wise Analysis: Crest Factor and QSNR
The paper first performs a tensor-wise analysis on intermediate tensors (activations, weights, gradients) collected from Llama3.1-8B during WikiText2 processing. This allows for direct measurement of crest factors and QSNR under various formats.
The following are the results from Table 2 of the original paper:
| Type | Block Size | Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|---|---|
| Crest factor | -1 | 3.55 | 4.26 | 6.2 | 11.97 | 60.15 |
| 32 | 2.28 | 2.40 | 2.48 | 2.96 | 4.26 | |
| 16 | 2.04 | 2.13 | 2.16 | 2.39 | 3.16 | |
| Crest factor w/ hadamard rotatioin | -1 | 3.62 | 3.9 | 4.15 | 5.79 | 13.02 |
| 32 | 1.91 | 2.29 | 2.35 | 2.36 | 2.57 | |
| 16 | 1.77 | 2.06 | 2.1 | 2.11 | 2.21 |
Table 2 shows the crest factor statistics across different block sizes. The Q3 (75th percentile) is highlighted as representing typical worst-case behavior.
-
Coarse-grained (Block Size -1, likely per-channel):
Q3is 11.97. This is significantly above theMXINT8vs.MXFP8crossover point () from Figure 3. This indicates thatFPis generally superior forcoarse granularity. -
MX-format(Block Size 32):Q3drops to 2.96. This value is well below theMXINT8vs.MXFP8crossover point (7.55). This suggestsMXINT8should outperformMXFP8in most cases. However, 2.96 is above the crossover points forMXINT6vs.MXFP6() andMXINT4vs.MXFP4(), implyingMXINT6andMXINT4would underperform. -
NV-format(Block Size 16):Q3is 2.39. This is approximately theNVINT4vs.NVFP4crossover point (), suggesting a more balanced competition. -
Impact of
Hadamard Rotation:Hadamard rotationeffectivelyreduces the crest factor. For block size 32,Q3decreases from 2.96 to 2.39. For block size 16,Q3drops from 2.39 to 2.11, pushing it further below theNVINT4vs.NVFP4crossover, which favorsNVINT4post-rotation.The following figure (Figure 4 from the original paper) shows the practical QSNR across crest factors from 10752 tensors source from to in compute flow in Figure 1:
Figure 4 Practical QSNR across crest factors from 10752 tensors source from to in compute flow in Figure 1. ()eupnee T b to hrepor e vee QSNR INTan qaan, and he w rate IT n quani
Figure 4 illustrates the practical QSNR (measured on real LLM tensors) across crest factors. The empirical results largely corroborate the theoretical predictions from Section 4.2.12:
MXINT8vs.MXFP8:MXFP8'sQSNRis almost constant at 31.50 dB (due to its ampledynamic rangeand mantissa-bit bound).MXINT8achieves a significantly higher averageQSNRof 40.35 dB, confirming its strong performance.MXINT6andMXINT4: TheseINTformats consistently lag behind theirFPcounterparts (MXFP6andMXFP4), even withHadamard rotation, as predicted by theircrest factorbeing above the crossover points.NVINT4vs.NVFP4: Initially,NVINT4's averageQSNR(20.55 dB) is slightly belowNVFP4's (20.60 dB), despite a 64.3%win rate. This is becauseNVINT4'sQSNRdegrades faster with increasingcrest factor. However, after applyingHadamard rotation,NVINT4's averageQSNRincreases to 21.65 dB, surpassingNVFP4's 20.35 dB. The decrease inNVFP4'sQSNRafter rotation is consistent with the theoretical plot in Figure 3, whereNVFP4'sQSNRcan increase when , so reducing further can sometimes lower its QSNR in that specific range.
6.1.2. Direct-Cast Inference
Direct-cast inference evaluates model accuracy when quantization is applied only to the forward pass of pretrained models. KL divergence is used as the primary metric, with perplexity also reported.
The following are the results from Table 3 of the original paper:
| Original | |||
|---|---|---|---|
| INT Win | FP Win | INT Win FP | |
| MXINT8 v.s. MXFP8 | 12 | 0 | 12 0 |
| MXINT6 v.s. MXFP6 | 0 | 12 | 11 |
| MXINT4 v.s. MXFP4 | 0 | 12 | 12 |
| NVINT4 v.S. NVFP4 | 0 | 12 | 0 |
Table 3 summarizes the win rate (number of models where INT or FP performed better in terms of KL divergence) across 12 LLMs.
- Without Rotation:
MXINT8consistentlyoutperforms MXFP8on all 12 models.MXINT6,MXINT4, andNVINT4generallyunderperformtheirFPcounterparts, confirming the predictions from thetensor-wise analysisregarding their respectivecrest factorcrossover points.NVINT4performs worse thanNVFP4here, suggesting that even with similar averageQSNR, highercrest factorscan lead toworst-case behaviorfor integers.
- With
Random Hadamard Rotation:-
MXINT8andNVINT4nowwin on all 12 models. This highlights the effectiveness ofHadamard rotationin mitigatingoutliers, makingNVINT4competitive with or superior toNVFP4. -
MXINT6wins on 1 of 12 models, andMXINT4still loses on all 12, remaining consistent with thetensor-wise analysiswhere theircrest factorsgenerally remain above their respectiveFPcrossover points even after rotation.The following are the results from Table 12 of the original paper:
Qwen-3 Format 0.6B 1.7B 4B 8B 14B 32B 30B-A3B 235B-A22B MXINT8 191 209 112 168 96 118 160 276 MXFP8 579 406 346 362 300 457 380 483 MXINT6− 1944 2464 928 1104 804 1012 768 1333 MXFP6 1030 874 539 592 467 627 606 1099 MXINT4 39936 30208 1708 15552 −34304 27392 13248 1631 MXFP4 17602 14614 8568 8228 8119 10302 6194 16238 NVINT4 10560 8320 4864 5120 568 7968 3120 9702 NVFP4 8104 4995 3844 3430 2835 3778 2443 9238 (w/ random Hadamard Format 0.6B 1.7B Qwen-3 4B 8B 14B 32B rotation) 30B-A3B 235B-A22B MXINT8 137 150 80 130 70 88 135 229 MXFP8 921 1321 468 577 393 497 391 707 MXINT6 1137 1274 547 690 481 615 444 809 MXFP6 1007 1446 497 618 454 558 422 740 MXINT4 26488 26578 10498 12241 8459 9510 6080 9660 MXFP4 17995 20443 7260 8562 6410 6536 5087 7058 NVINT4 7771 7236 3431 4026 30700 3647 22 3931 NVFP4 12031 10582 5065 5912 4214 4662 3200 5786
-
Table 12 presents the KL divergence results for Qwen-3 models. Lower values indicate better accuracy. For example, for Qwen3-0.6B without rotation, MXINT8 has a KL divergence of 191, much lower than MXFP8's 579. After Hadamard rotation, MXINT8 improves to 137, while MXFP8 also improves to 921 (though it was higher before, rotation helps it more for this case). The trend of MXINT8 outperforming MXFP8 holds consistently across all Qwen-3 models. For 4-bit, NVINT4 generally improves with rotation, often surpassing NVFP4.
The following are the results from Table 13 of the original paper:
| Llama | ||||
|---|---|---|---|---|
| Format | 3.2-1B | 3.2-3B | 3.1-8b | 3.1-70B |
| MXINT8 | 111 | 77 | 82 | 191 |
| MXFP8 | 464 | 325 | 359 | 514 |
| MXINT6 | 1133 | 743 | 776 | 1744 |
| MXFP6 | 651 | 457 | 491 | 1436 |
| MXINT4 | 26153 | 14089 | 12380 | 22538 |
| MXFP4 | 14446 | 8251 | 7586 | 21372 |
| NVINT4 | 508 | 4312 | 4224 | 10970 |
| NVFP4 | 5691 | 3684 | 3718 | 10544 |
| Llama(w/ random n Hadamard rotation) | ||||
| Format | 3.2-1B | 3.2-3B | 3.1-8b | 3.1-70B |
| MXINT8 | 89 | 63 | 65 | 145 |
| MXFP8 | 573 | 388 | 409 | 1393 |
| MXINT6 | 773 | 531 | 558 | 1518 |
| MXFP6 | 643 | 447 | 457 | 1476 |
| MXINT4 | 20126 | 116 | 10272 | 137612 |
| MXFP4 | 11967 | 8269 | 7189 | 129471 |
| NVINT4 | 5854 | 3912 | 609 | 19975 |
| NVFP4 | 8129 | 5240 | 4752 | 77363 |
Table 13 shows KL divergence for Llama models, reinforcing the findings for Qwen-3. MXINT8 consistently achieves lower KL divergence than MXFP8. With Hadamard rotation, NVINT4 again shows significant improvement, often outperforming NVFP4.
The following are the results from Table 14 of the original paper:
| Format | 0.6B | 1.7B | 4B | 8B | Qwen-3 14B | 32B | 30B-A3B | 235B-A22B |
|---|---|---|---|---|---|---|---|---|
| BF16 | 11.5868 | 8.7084 | 7.3368 | 6.5135 | 5.9498 | 7.0168 | 6.8178 | 4.0929 |
| MXINT8 | 11.6377 | 8.7424 | 7.3511 | 6.5174 | 5.955 | 7.0185 | 6.8167 | 4.0959 |
| MXFP8 | 11.7494 | 8.7822 | 7.3813 | 6.5444 | 5.9711 | 7.0357 | 6.8335 | 4.1101 |
| MXINT6 | 12.2297 | 9.2622 | 7.496 | 6.6499 | 6.0483 | 7.05 | 6.8745 | 4.1743 |
| MXFP6 | 11.9108 | 8.8961 | 7.4135 | 6.5825 | 5.9953 | 7.0285 | 6.8467 | 4.1662 |
| MXINT4 | 48.673 | 21.8749 | 11.9487 | 10.0423 | 16.7227 | 15.1619 | 9.3837 | 5.918 |
| MXFP4 | 20.4522 | 24.0766 | 9.1553 | 8.0135 | 7.2471 | 8.2047 | 7.8203 | 5.9007 |
| NVINT4 | 15.9729 | 10.9128 | 8.3304 | 7.415 | 6.81 | 8.0161 | 7.2024 | 4.88916 |
| NVFP4 | 14.6818 | 9.9966 | 8.0144 | 7.0285 | 6.3129 | 7.3604 | 7.1874 | 4.8309 |
| Qwen-3(w/ random Hadamard rotation) | ||||||||
| Format | 0.6B | 1.7B | 4B | 8B | 14B | 32B | 30B-A3B | 235B-A22B |
| MXINT8 | 11.6179 | 8.7240 | 7.3407 | 6.5170 | 5.9521 | 7.0187 | 6.8231 | 4.0973 |
| MXFP8 | 11.8629 | 8.9972 | 7.4068 | 6.5898 | 5.9839 | 7.0448 | 6.8918 | 4.1287 |
| MXINT6 | 11.9422 | 9.0122 | 7.4071 | 6.6119 | 5.990 | 7.0627 | 6.8666 | 4.1263 |
| MXFP6 | 11.9096 | 9.0089 | 7.4108 | 6.5911 | 5.9981 | 7.0787 | 6.8711 | 4.1252 |
| MXINT4 | 28.6510 | 1.3032 | 9.8238 | 9.2029 | 7.3564 | 8.2083 | 7.8292 | 4.9891 |
| MXFP4 | 20.3684 | 15.9527 | 8.8148 | 8.1113 | 6.9521 | 7.7401 | 7.9673 | 4.7035 |
| NVINT4 | 14.6052 | 110.7822 | 7.9824 | 7.1705 | 6.3702 | 7.3625 | 1557 | 4.3913 |
| NVFP4 | 16.5762 | 11.7541 | 8.2716 | 7.5084 | 6.5427 | 7.4522 | 7.3214 | 4.5918 |
Table 14 provides the perplexity results for Qwen-3 models. Lower perplexity is better. For Qwen3-0.6B, BF16 achieves 11.5868. MXINT8 (11.6377) is very close, while MXFP8 (11.7494) is slightly worse. This again confirms MXINT8's strong performance. The trends are generally consistent with KL divergence, showing that MXINT8 performs well, and Hadamard rotation can significantly improve NVINT4.
The following are the results from Table 15 of the original paper:
| Llama | ||||
|---|---|---|---|---|
| Format | 3.2-1B | 3.2-3B | 3.1-8b 3.1-70B | |
| BF16 | 9.0625 | 7.2857 | 5.8402 2.637 | |
| MXINT8 | 9.0815 | 7.2944 5.8487 | 2.664 | |
| MXFP8 | 9.1695 | 7.3381 | 5.895 2.6674 | |
| MXINT6 | 9.3557 | 7.4184 | 5.9643 2.7298 | |
| MXFP6 MXINT4 | 9.2209 | 7.3605 | 5.916 2.7298 | |
| MXFP4 | 21.9893 14.0516 | 11.2715 9.2355 | 8.7408 5.1894 6.4845 4.9492 | |
| NVINT4 | 11.3987 | 8.225 | 6.5957 3.5502 | |
| NVFP4 | 10.7473 | 8.0343 6.4917 | 3.492 | |
| Llama(w/ random n Hadamard rotation) | ||||
| Format | 3.2-1B | 3.2-3B | 3.1-8b 3.1-70B | |
| MXINT8 | 9.0715 | 7.2912 5.845 | 2.6428 | |
| MXFP8 | 9.1932 | 7.3465 5.9001 | 2.7232 | |
| MXINT6 | 9.2622 | 7.3828 5.9276 | 2.7333 | |
| MXFP6 | 9.2204 | 7.3703 5.9075 | 2.735 | |
| MXINT4 | 17.9797 | 10.357 8.0745 | 1146.7256 | |
| MXFP4 | 13.3987 | 9.262 7.2318 | 1118.4431 | |
| NVINT4 | 10.8399 | 8.1119 6.4701 | 4.9786 | |
| NVFP4 | 6.7028 | |||
| 8.4693 | 79.7586 | |||
Table 15 presents the perplexity results for Llama models, again showing that MXINT8 is very competitive with BF16 and superior to MXFP8. The effectiveness of Hadamard rotation for NVINT4 is also clear here.
6.1.3. Training
The paper investigates low-bit training stability and performance, focusing on 8-bit formats (MXINT8 vs. MXFP8).
The following figure (Figure 5 from the original paper) shows the loss curves comparison among BF16, MXFP8 and MXINT8 training on Llama-1B with 100B tokens:
Figure 5 Loss curves comparison among BF16, MXFP8 and MXINT8 training on Llama-1B with 100B tokens. Results are smoothed by exponential moving average with a coefficient of 0.9.
Figure 5 plots the training loss curves for BF16, MXFP8, and MXINT8 on Llama-1B. It shows that both MXFP8 and MXINT8 achieve nearly lossless training, with their curves closely tracking the BF16 baseline. An enlarged view reveals that MXINT8 consistently maintains a slightly lower loss (by approximately 0.001) than MXFP8. This is a significant finding, demonstrating that MXINT8 can support nearly lossless low-bit training, a domain where FP8 training has been the primary focus in prior work.
The following are the results from Table 4 of the original paper:
| Model size Training tokens Precision | loss | Arc_ E | Arc C | HS | OB | PIQA | WG | Avg. | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1B | 100B | BF16 | 2.6727 | 37.80 | 69.40 | 60.20 | 38.40 | 74.43 | 61.09 | 56.89 | |
| 1B | 10B | MXFP8 | 2.6767 | 37.03 | 69.82 | 60.28 | 38.00 | 74.37 | 61.64 | 556.86 | |
| 1B | 100B | MXINT8 | 2.6758 | 37.95 | 69.45 | 60.02 | 38.80 | 74.54 | 61.38 | 57.02 | |
| 3B | 200B | BF16 | 2.4794 | 46.50 | 75.42 | 72.28 | 45.00 | 78.07 | 69.45 64.45 | ||
| 3B | 20B | ¯MXFP8 | 2.4821 | 46.70 | 74.12 | 72.08 | 44.60 | 77.56 | 69.25 64.05 | ||
| 3B | 200B | MXINT8 | 2.4812 | 46.10 | 75.58 | 72.00 | 44.80 | 77.78 | 69.55 | 64.30 |
Table 4 presents the low-bit training comparisons across various common-sense reasoning tasks.
- For the
1B modeltrained for 100B tokens:BF16has an average accuracy of 56.89.MXFP8(trained for 10B tokens, which is a typo and should be 100B based on context) achieves 56.86.MXINT8achieves 57.02, slightly surpassing bothBF16andMXFP8on average.
- For the
3B modeltrained for 200B tokens:-
BF16has an average accuracy of 64.45. -
MXFP8(trained for 20B tokens, likely a typo) achieves 64.05. -
MXINT8achieves 64.30, again very close toBF16and slightly outperformingMXFP8.These results confirm that
MXINT8training is not only stable but can also achievenearly lossless performancecompared toBF16and even slightly outperformMXFP8ondownstream tasks.
-
6.1.4. Hardware Cost Analysis
The paper evaluates the energy and area costs of INT and FP formats based on a Matrix-Multiply Unit (MMU) model.
The following are the results from Table 5 of the original paper:
| Single Format | Mixed Format | |||||
|---|---|---|---|---|---|---|
| MXFP8 | MXINT8 | NVFP4 | NVINT4 | MXFP8+NVFP4 | MXINT8+NVINT4 | |
| Energy | 1x | 0.63x | 0.55x | 0.34x | 1x | 0.75x |
| Area | 1x | 0.79x | 0.54x | 0.38x | 1x | 0.66x |
Table 5 shows normalized energy and area costs at the same throughput, with MXFP8 and serving as baselines (1x).
- Single Format Comparisons:
MXINT8consumes only0.63x energyand0.79x areacompared toMXFP8. This means a 37% energy reduction and a 21% area reduction.NVINT4is even more efficient, using0.34x energyand0.38x areacompared toNVFP4. This represents a 66% energy reduction and 62% area reduction. These significant reductions demonstrate thatlow-bit integer formatsare substantially morehardware-efficientthan theirfloating-pointcounterparts.
- Mixed Format Configurations:
-
Comparing a configuration supporting both 8-bit and 4-bit data types, configuration achieves
0.75x energyand0.66x areacompared to . This indicates a 25% energy reduction and 34% area reduction. -
This efficiency gain in
mixed-formatis attributed to the simplercircuit reusein theINT pipeline(as described in Table 7 in the methodology).This hardware analysis provides compelling evidence that
fine-grained INT formatsoffer superiorhardware efficiencyin terms of bothareaandenergy, reinforcing their overall advantages.
-
6.2. Ablation Studies / Parameter Analysis
6.2.1. Necessity of Symmetric Integer Representation
The paper conducts an ablation study to demonstrate the critical importance of using a symmetric integer range for INT quantization, particularly during training.
The following are the results from Table 10 of the original paper:
| BF16 scale | UE8M0 scale | |||
|---|---|---|---|---|
| [-128, 127] | [-127, 127] | [-128, 127] | [-127, 127] | |
| per-channel | 3.2544 | 3.2560 | 3.3602 | 3.4307 |
| 256 | 3.1340 | 3.1307 | 3.1628 | 3.1574 |
| 128 | 3.1309 | 3.1289 | 3.1353 | 3.1326 |
| 64 | 3.1312 | 3.1269 | 3.1312 | 3.1288 |
| 32 | 3.1354 | 3.1251 | 3.1299 | 3.1269 |
Table 10 shows the 8-bit training loss (lower is better) on a 15M model with 20B training tokens, comparing asymmetric ([-128, 127]) and symmetric ([-127, 127]) INT8 ranges, using both BFloat16 and UE8M0 scale factors.
-
Asymmetric Degradation: For
BFloat16 scale factors, using theasymmetric range ([-128, 127])consistently leads to worse training loss compared to thesymmetric range ([-127, 127]). This degradation is more pronounced forfiner-grained quantization(smaller block sizes, e.g., block 32), where the asymmetric range performs worse than evenper-channelorblock 256quantization. This is because finer granularity means more blocks, increasing the chance of values mapping to the problematic unique negative endpoint (-128), causing agradient bias. -
Impact of Scale Factor: The
asymmetric rangealso degrades performance withUE8M0 scale factors, though slightly less severely than withBFloat16scales. This is becauseUE8M0 scale factorsare generally larger or equal toBFloat16scales, leading to fewer high-precision numbers mapping to .This ablation clearly demonstrates that the
symmetric clipping method(enforcing the range forINT8) is essential forstableandhigh-performing low-bit INT training.
6.2.2. Numerical Stability Analysis
To further explain the need for symmetric clipping, the paper conducts a numerical stability analysis for different floating-point precisions during the quantization mapping process.
The algorithm for this analysis (Algorithm 1 in the paper) involves:
-
Generating an random matrix (from ) in different precisions (
BFloat16,Float16,Float32). -
Calculating a
scaler matrix. -
Normalizing by and rounding: .
-
Counting the number of elements in that map to 128. This value of 128 implies that values are attempting to exceed the
INT8maximum of 127, which can occur due to arithmetic precision issues inlow-precision FPformats.The following are the results from Table 11 of the original paper:
BFloat16 Float16 Float32 16.82% 0.02% 0
Table 11 shows that:
-
In
BFloat16precision, a significant16.82%of values are numerically mapped to 128 (meaning they are just over 127), even though the scale factor is theoretically designed to map to 127. -
In
Float16, this phenomenon is much rarer (0.02%). -
In
Float32, it does not occur (0%).This analysis demonstrates that
low-precision floating-point formats(especiallyBFloat16) can introducenumerical instabilityandoverflowduring the scaling step, causing values to fall outside the intended integer range. This powerfully supports the paper's argument that aforced symmetric clipping stepis essential forguaranteeing the correctness and stability of integer quantization, particularly when the underlying arithmetic is performed usinglow-precision data typeslikeBFloat16.
6.2.3. Impact of Hadamard Rotation
The direct-cast inference results (Table 3, 12, 13) and tensor-wise QSNR analysis (Figure 4, Table 2) clearly show the positive impact of random Hadamard rotation.
- Reduced Crest Factor:
Hadamard rotationeffectivelyreduces the crest factorof the intermediate tensors (Table 2), making the distributions more amenable toINT quantization. - Improved NVINT4 Accuracy: With rotation,
NVINT4significantly improves itsQSNR(Figure 4) and itsKL divergence(Table 12, 13), enabling it tosurpass NVFP4in inference accuracy. This demonstrates thatoutlier-mitigation techniquescan unlock the potential oflow-bit INT formatsin scenarios where they might otherwise be weaker thanFP.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper presents a comprehensive and rigorous study comparing integer (INT) and floating-point (FP) low-bit quantization formats, addressing a critical gap in the existing literature regarding their trade-offs across varying granularities. The central finding is the identification of a performance crossover point: while FP formats tend to excel in coarse-grained quantization due to their superior dynamic range, fine-grained quantization presents a more nuanced picture.
The study provides strong evidence that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 consistently outperforms MXFP8 in both algorithmic accuracy and hardware efficiency (reduced area and energy consumption). For 4-bit formats, FP (MXFP4, NVFP4) often holds an accuracy advantage, but the paper demonstrates that NVINT4 can surpass NVFP4 when combined with outlier-mitigation techniques like Hadamard rotation. Furthermore, the authors introduce a crucial symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training.
These findings collectively challenge the current FP-centric trajectory in AI hardware design, which has largely favored floating-point formats for LLM quantization. The paper advocates for a strategic shift towards fine-grained INT formats, particularly MXINT8, as they offer a superior balance of accuracy, power, and efficiency for future AI accelerators. The extensive theoretical framework, empirical validation, and hardware cost analysis provide clear guidance for algorithm-hardware co-design.
7.2. Limitations & Future Work
The paper implicitly points to several areas for future exploration, although it does not explicitly list "Limitations" or "Future Work" sections:
- Computational Overhead of Outlier Mitigation: While
Hadamard rotationis shown to be effective, its computational overhead during inference or training is not explicitly quantified or optimized. Future work could investigate the most hardware-efficient ways to implement such techniques. - Broader Range of Outlier Mitigation Techniques: The paper primarily focuses on
Hadamard rotationfor 4-bitINT. Exploring other or more advancedoutlier mitigation techniques(e.g., more sophisticated scaling, mixed-precision within a block, adaptive transformations) could yield further improvements for low-bitINTformats. - Complex Mixed-Precision Strategies: The hardware cost analysis touches upon
mixed-format configurations. Future work could delve into more dynamic or fine-grainedmixed-precision strategies, where different layers or even sub-layers might benefit from differentINTorFPformats and bit-widths. - Impact on Different LLM Architectures: While the study covers various
LLMsizes andMoEarchitectures, further investigation into specific architectural components or novelLLMdesigns could reveal additional nuances inINTvs.FPtrade-offs. - Quantizer Cost Modeling: The current hardware model excludes the
quantizerblock from cost accounting. A more comprehensive analysis might include the area and energy costs associated with the quantization logic itself. - Dynamic vs. Static Quantization: The paper focuses on
AbsMax quantization, but exploring dynamic quantization schemes (where ranges are determined at runtime) or more advanced static schemes could provide further insights.
7.3. Personal Insights & Critique
This paper provides a highly valuable contribution to the field of efficient LLM deployment. The rigorous approach, combining theoretical analysis with extensive empirical validation and hardware modeling, is commendable. The identification of the performance crossover based on granularity and crest factor is a crucial insight that challenges the industry's predominant FP-centric mindset.
My personal insights include:
-
Paradigm Shift Potential: The findings suggest that focusing solely on
FPforLLMacceleration might be suboptimal.MXINT8emerges as a surprisingly strong contender, offering both accuracy and significant hardware efficiency. This could lead to a paradigm shift in howAI acceleratorsare designed, with a greater emphasis on optimizingfine-grained INT pipelines. -
Importance of Co-Design: The paper strongly reinforces the necessity of
algorithm-hardware co-design. The success ofNVINT4withHadamard rotationis a prime example: an algorithmic improvement directly translates into competitive hardware performance. Similarly, thesymmetric clipping methodaddresses a specific numerical stability issue inINT training, making the format more viable. -
Nuance over Dogma: The paper moves beyond a simplistic "INT vs. FP" debate, demonstrating that the optimal choice is highly contextual, dependent on bit-width, granularity, and the application of
outlier-mitigation techniques. This nuanced view is essential for designing truly efficient systems.Potential areas for improvement or further critique:
-
Generalizability of Hadamard Rotation: While effective,
Hadamard rotationadds an extra computational step. Investigating its real-world latency impact and comparing it with otheroutlier mitigation techniques(e.g., advanced clipping, mixed-precision within blocks) would be beneficial. -
Hardware Model Simplifications: The hardware model, while thorough, simplifies some aspects (e.g.,
toggle rates,interconnects,quantizercost). A more detailedVLSIimplementation and characterization could provide even more precise cost estimates. -
Activation Outlier Characteristics: The paper relies on
crest factoras a key indicator. A deeper dive into the specific characteristics and dynamics ofLLM activation outliers(e.g., their sparsity, distribution shapes, temporal stability) could lead to more tailored quantization andoutlier mitigationstrategies.Overall, this paper is an excellent piece of research that provides compelling evidence and clear guidance for the future of
low-bit quantizationinAI accelerators, advocating for a well-deserved re-evaluation offine-grained integer formats.
Similar papers
Recommended via semantic vector search.