Paper status: completed

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Published:10/29/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study offers a systematic comparison between floating-point (FP) and integer (INT) quantization formats, revealing that MXINT8 outperforms FP in 8-bit fine-grained formats. For 4-bit formats, FP often excels, but NVINT4 can surpass it with outlier-mitigation techniques. A ne

Abstract

Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The title of the paper is: "INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats".

1.2. Authors

The authors are:

  • Mengzhao Chen (The University of Hong Kong, ByteDance Seed)
  • Meng Wu (PicoHeart)
  • Hui Jin (ByteDance Seed)
  • Zhihang Yuan (ByteDance Seed)
  • Jing Liu (ByteDance Seed)
  • Chaoyi Zhang (ByteDance Seed)
  • Yunshui Li (ByteDance Seed)
  • Jie Huang (ByteDance Seed)
  • Jin Ma (ByteDance Seed)
  • Zeyue Xue (The University of Hong Kong)
  • Zhiheng Liu (The University of Hong Kong)
  • Xingyan Bin (ByteDance Seed, Corresponding author)
  • Ping Luo (The University of Hong Kong, Corresponding author)

1.3. Journal/Conference

The paper was published on arXiv, a preprint server. Its publication date was 2025-10-29T15:11:53.000Z. As a preprint, it has not yet undergone formal peer review by a specific journal or conference. However, arXiv is a widely used and respected platform for disseminating cutting-edge research in fields like artificial intelligence, allowing for rapid sharing of findings.

1.4. Publication Year

The publication year is 2025.

1.5. Abstract

The abstract summarizes the paper's critical investigation into low-precision quantization formats, specifically comparing floating-point (FP) and integer (INT) representations. It addresses a gap in unified comparisons, particularly concerning varying granularities, despite the industry's trend towards FP formats (e.g., in Nvidia's Blackwell architecture) for handling activation outliers in Large Language Models (LLMs). The paper reveals a performance "crossover" where FP excels in coarse-grained quantization, but the comparison becomes more nuanced at fine-grained (block-wise) levels.

A key finding is that for popular 8-bit fine-grained formats (like MX with block size 32), MXINT8 outperforms its FP counterpart in both algorithmic accuracy and hardware efficiency. In contrast, for 4-bit formats, FP (e.g., MXFP4, NVFP4) generally holds an accuracy advantage, although NVINT4 can surpass NVFP4 when combined with outlier-mitigation techniques such as Hadamard rotation. The authors also introduce a symmetric clipping method that resolves gradient bias during fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These results challenge the current hardware focus on FP formats, advocating for fine-grained INT formats, especially MXINT8, as a more balanced solution for accuracy, power, and efficiency in future AI accelerators.

2. Executive Summary

2.1. Background & Motivation

The proliferation of Large Language Models (LLMs) has led to an exponential increase in their computational and memory demands. To make these models more efficient for deployment, quantization has become an indispensable technique. A significant challenge in quantizing LLMs, particularly those based on the Transformer architecture, is the presence of activation outliers—values with large magnitudes but infrequent occurrence. These outliers can severely degrade the performance of low-precision representations.

The AI hardware industry, exemplified by NVIDIA's Blackwell architecture, has largely responded to this challenge by pivoting towards low-precision floating-point (FP) formats (e.g., FP8, FP4). This trend is driven by FP's inherently superior dynamic range, which is believed to handle outliers more gracefully than traditional integer (INT) formats.

However, the authors argue that this industry-wide momentum towards FP is based on an incomplete picture. A systematic and unified comparison of FP and INT quantization across different granularities has been missing. Most existing studies tend to focus on a single format or compare them only at coarse granularities. Given that fine-grained (block-wise) quantization is now a standard technique for mitigating outliers and improving accuracy at low precision, understanding the interplay between different number formats and quantization granularity is crucial for effective algorithm-hardware co-design. The paper aims to fill this critical research gap.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

  • Performance Crossover Revelation: It reveals a critical performance crossover where FP formats hold a distinct advantage in coarse-grained scenarios, but INT formats become highly competitive as the block size shrinks (i.e., fine-grained quantization).
  • Theoretical and Statistical Framework: The authors develop a theoretical and statistical framework to model the Quantization Signal-to-Noise Ratio (QSNR) for both INT and FP formats. This framework enables a direct theoretical comparison and clarifies the crossover points.
  • MXINT8 Superiority: The study demonstrates that MXINT8 consistently outperforms MXFP8 in both direct-cast inference and low-bit training (8-bit settings). This is a strong finding challenging the FP-centric trend.
  • NVINT4 with Outlier Mitigation: For 4-bit formats, while FP (e.g., MXFP4, NVFP4) often shows an initial accuracy advantage, the paper shows that NVINT4 can surpass NVFP4 when combined with outlier-mitigation techniques like Hadamard rotation.
  • Symmetric Clipping for INT Training: A novel symmetric clipping method is introduced to resolve gradient bias in fine-grained low-bit INT training. This technique enables nearly lossless performance for MXINT8 training, addressing a critical challenge for INT formats in training contexts.
  • Hardware Efficiency of INT: A comparative hardware cost analysis reveals that fine-grained INT formats are significantly more area- and energy-efficient than their floating-point counterparts at matched throughput.
  • Challenge to Hardware Trajectory: Collectively, these findings challenge the prevailing FP-centric trajectory in AI hardware design. The paper advocates for prioritizing fine-grained INT formats to achieve a more optimal balance of accuracy and efficiency in future AI accelerators, asserting that a one-size-fits-all FP approach is suboptimal.

3. Prerequisite Knowledge & Related Work

This section provides foundational knowledge essential for understanding the paper's contribution.

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

Large Language Models (LLMs) are advanced artificial intelligence models, typically based on the Transformer architecture, designed to understand, generate, and process human language. They are characterized by their vast number of parameters (ranging from billions to trillions) and are trained on enormous datasets of text and code. Due to their size, deploying LLMs efficiently requires significant computational and memory resources.

3.1.2. Quantization

Quantization is a technique used in deep learning to reduce the precision of numerical representations (e.g., weights, activations) in neural networks, typically from 32-bit floating-point (FP32) to lower bit-widths like 8-bit integers (INT8) or 4-bit floating-point (FP4). The primary goals of quantization are to:

  • Reduce Memory Footprint: Lower-precision numbers require less storage.
  • Improve Computational Efficiency: Operations on lower-precision numbers are often faster and consume less energy on specialized hardware.
  • Reduce Bandwidth Requirements: Less data needs to be moved between memory and compute units. The challenge is to achieve these benefits while minimizing accuracy loss.

3.1.3. Integer (INT) Quantization

Integer (INT) quantization maps high-precision floating-point numbers to a finite set of integer values. This typically involves a scale factor and sometimes a zero-point (for asymmetric quantization) to map the floating-point range to the integer range. The basic process often looks like this:

  1. Scaling: Divide the floating-point value by a scale factor ss.

  2. Rounding: Round the scaled value to the nearest integer.

  3. Clamping/Clipping: Ensure the integer falls within the target integer range (e.g., [127,127][-127, 127] for INT8 symmetric).

  4. Dequantization: To use the quantized values in computation or convert them back, they are multiplied by the scale factor ss.

    The paper uses a symmetric INT quantization where values are centered around zero. For bb-bit integer quantization, the general formula for quantization and dequantization is: $ \mathbf { X _ { q } } = \mathrm { c l i p } \left( \left\lfloor \frac { \mathbf { X } } { s } \right\rceil , Q _ { \mathrm { m i n } } , Q _ { \mathrm { m a x } } \right) \cdot s $ Where:

  • X\mathbf{X}: The original high-precision tensor.
  • Xq\mathbf{X_q}: The dequantized (reconstructed) tensor after quantization.
  • ss: The scale factor used to normalize X\mathbf{X} to the target integer range.
  • \lfloor \cdot \rceil: The round-to-nearest function, which rounds a number to the closest integer.
  • clip(value, Q_min, Q_max): A function that clips (constrains) the value to be within the range [Qmin,Qmax][Q_{min}, Q_{max}].
  • QminQ_{min}, QmaxQ_{max}: The minimum and maximum representable integer values for the given bit-width bb. For standard signed bb-bit integers, these are typically 2b1-2^{b-1} and 2b112^{b-1}-1. However, the paper introduces a symmetric clipping method where Qmin=(2b11)Q_{min} = -(2^{b-1}-1) and Qmax=2b11Q_{max} = 2^{b-1}-1 to avoid gradient bias, meaning an INT8 range would be [127,127][-127, 127] instead of [128,127][-128, 127].

3.1.4. Floating-Point (FP) Quantization

Floating-point (FP) quantization represents numbers using a sign bit, an exponent, and a mantissa. This format offers a wider dynamic range (the range of expressible values) compared to integers for the same number of bits, making it more robust to outliers (extremely large or small values). The exponent determines the magnitude, and the mantissa determines the precision.

A floating-point number is decoded as: $ \mathbb { C } _ { \mathrm { F P } } = \left{ \begin{array} { l l } { ( - 1 ) ^ { s } \times ( 1 . m ) _ { 2 } \times 2 ^ { e - \mathrm { b i a s } } } & { \mathrm { i f ~ } e \neq 0 \mathrm { ~ ( N o r m a l ) , } } \ { ( - 1 ) ^ { s } \times ( 0 . m ) _ { 2 } \times 2 ^ { 1 - \mathrm { b i a s } } } & { \mathrm { i f ~ } e = 0 , m \neq 0 \mathrm { ~ ( S u b n o r m a l ) , } } \end{array} \right. $ Where:

  • ss: The sign bit (0 for positive, 1 for negative).

  • ee: The exponent value.

  • mm: The mantissa (or significand) value.

  • bias: An offset added to the exponent to allow representation of both very small and very large numbers.

  • (1.m)2(1.m)_2: Represents 1+i=1Mmi2i1 + \sum_{i=1}^M m_i 2^{-i} for normal numbers, where MM is the number of mantissa bits. The "1." is an implicit leading bit.

  • (0.m)2(0.m)_2: Represents i=1Mmi2i\sum_{i=1}^M m_i 2^{-i} for subnormal numbers, which are used to represent numbers very close to zero without losing precision (at the cost of dynamic range).

  • ExMy: A notation where xx is the number of exponent bits and yy is the number of mantissa bits. For example, E4M3 has 4 exponent bits and 3 mantissa bits.

    Floating-point quantization is expressed as: $ \mathbf { X _ { q } } = \mathrm { Nearest } \bigg ( \frac { \mathbf { X } } { s } , \mathbb { C } _ { \mathrm { F P } } \bigg ) \cdot s $ Where:

  • X\mathbf{X}: The original high-precision tensor.

  • Xq\mathbf{X_q}: The dequantized tensor.

  • ss: The scale factor.

  • Nearest(,CFP)\mathrm{Nearest}(\cdot, \mathbb{C}_{FP}): A function that maps a normalized value to the nearest representable value in the set of low-bit floating-point values CFP\mathbb{C}_{FP}.

3.1.5. Quantization Granularity

Quantization granularity refers to the scope over which a single scale factor (and zero-point if applicable) is applied. Finer granularity generally leads to better accuracy because it can adapt to local variations in data distribution, but it also increases the overhead (memory and computation) for storing and applying more scale factors. Common granularities include:

  • Per-tensor: A single scale factor is applied to the entire tensor. Simplest, but least accurate for diverse distributions.
  • Per-channel: A scale factor is applied to each input or output channel of a layer. Common for weights.
  • Block-k (Block-wise): The tensor is partitioned into smaller blocks (e.g., 1×k1 \times k elements), and each block has its own scale factor. This is a fine-grained quantization method and is the primary focus of this paper. For LLMs, this is particularly effective because activation outliers are often localized to small regions.

3.1.6. Activation Outliers

Activation outliers are values with exceptionally large magnitudes that appear infrequently within the activation tensors of neural networks, particularly in LLMs based on the Transformer architecture. These outliers pose a significant challenge for low-precision quantization because they demand a very wide dynamic range to be represented accurately. If the quantization range is not wide enough, these outliers get clipped, leading to a large quantization error that can severely degrade model accuracy. Conversely, if the range is made wide enough to accommodate outliers, the precision for the majority of "normal" values might be reduced, also impacting accuracy.

3.1.7. Crest Factor (κ\kappa)

The crest factor (κ\kappa) is a dimensionless parameter that quantifies the ratio of the peak value to the effective value of a signal. In the context of quantization, it measures how "peaky" a data distribution is. The paper defines the crest factor as: $ \kappa := \frac { \operatorname* { m a x } ( | \mathbf { X } | ) } { \sigma } $ Where:

  • max(X)\operatorname* { m a x } ( | \mathbf { X } | ): The maximum absolute value within a block of the tensor X\mathbf{X}.
  • σ\sigma: The root-mean-square (RMS) value (standard deviation for zero-mean data) of the block. A higher crest factor indicates the presence of significant outliers that are much larger than the average magnitude of the values in the block. A lower crest factor suggests a more uniform distribution without extreme peaks. This metric is critical because it directly influences the required dynamic range for accurate quantization and helps determine whether INT or FP formats are more suitable.

3.1.8. Quantization Signal-to-Noise Ratio (QSNR)

The Quantization Signal-to-Noise Ratio (QSNR) is a metric used to quantify the numerical fidelity of a quantized signal. It measures the ratio of the power of the original signal to the power of the quantization noise (the error introduced by quantization). A higher QSNR indicates that the quantized signal is a more faithful representation of the original, with less introduced error. The formula for QSNR (in decibels, dB) is: $ \mathrm { QSNR } = - 10 \log _ { 10 } \left( \frac { | \mathbf { X } - \mathbf { X } _ { q } | ^ { 2 } } { | \mathbf { X } | ^ { 2 } } \right) $ Where:

  • XXq2\| \mathbf { X } - \mathbf { X } _ { q } \| ^ { 2 }: The squared Euclidean norm (or sum of squared differences) of the quantization error (difference between the original tensor X\mathbf{X} and the dequantized tensor Xq\mathbf{X_q}). This represents the power of the noise.
  • X2\| \mathbf { X } \| ^ { 2 }: The squared Euclidean norm of the original tensor X\mathbf{X}. This represents the power of the signal.
  • 10log10()-10 \log_{10}(\cdot): Converts the ratio to decibels, where a larger positive number indicates a better signal-to-noise ratio.

3.1.9. Hadamard Rotation

Hadamard rotation is an outlier-mitigation technique used in conjunction with quantization. It involves multiplying the input tensor by a Hadamard matrix, which is an orthogonal matrix with entries of +1+1 or -1. The effect of this transformation is to spread out the values of the input tensor, making the distribution more uniform and effectively reducing the crest factor. By reducing the crest factor, Hadamard rotation can make INT quantization more effective, as it minimizes the impact of extreme outliers that would otherwise demand a larger dynamic range. After the quantized operation, the inverse Hadamard rotation can be applied to recover the original distribution.

3.2. Previous Works

The paper contextualizes its work by referencing prior studies in quantization algorithms and hardware support.

  • Quantization Algorithms: Previous research includes post-training quantization (PTQ) methods (e.g., [15, 20, 36, 41]), where quantization is applied after a model is fully trained, and quantization-aware training (QAT) [7, 23], which incorporates quantization effects into the training loop. Low-bit training [9, 27, 39] aims to train models directly using low-precision numbers for both forward and backward passes. Some works also explore scaling laws for quantization [5, 8, 16, 19].
    • Core Gap: The paper notes that most prior work focuses on a single low-bit format (either INT or FP) and lacks direct, systematic comparisons between them across varying granularities. While [45] studies mixed-format quantization in PTQ, it doesn't provide the unified INT vs. FP comparison the current paper undertakes.
  • Hardware Support: Earlier AI accelerators [29, 30] typically did not natively support fine-grained quantization, posing challenges for algorithms dealing with outliers using per-channel quantization [6, 41]. More recently, the Microscaling (MX) data formats [34] were proposed, combining per-block scaling with a block size of 32 to enhance low-bit quantization. NVIDIA's Blackwell architecture [31] has incorporated native hardware support for MXFP8, MXFP4, and NVFP4, underscoring the industry's lean towards fine-grained floating-point formats.

3.3. Technological Evolution

The field of AI model deployment has seen a continuous drive towards efficiency. Initially, models primarily operated in FP32 or FP16. As models grew, BFloat16 (a 16-bit floating-point format with a wider exponent range than FP16) became popular for training, offering a good balance between precision and computational efficiency. INT8 quantization gained traction for inference due to its speed and memory benefits, but faced challenges with outliers, especially in LLMs.

To address outliers, the industry started favoring low-precision floating-point (FP) formats like FP8 and FP4, believing their superior dynamic range could handle the extreme values better. This led to hardware advancements like NVIDIA's Blackwell offering native support for MXFP formats.

This paper positions itself at a critical juncture in this evolution. It challenges the assumption that FP is universally superior for low-bit quantization, especially at fine-grained granularities. By systematically comparing INT and FP across various bit-widths and granularities, and by introducing new techniques like symmetric clipping for INT training and demonstrating the efficacy of Hadamard rotation for NVINT4, the paper suggests a re-evaluation of the current hardware trajectory. It advocates for fine-grained INT formats as a potentially more optimal solution for future AI accelerators, potentially shifting the focus back towards specialized integer hardware where appropriate.

3.4. Differentiation Analysis

The core differentiations and innovations of this paper's approach compared to main methods in related work are:

  • Unified and Systematic Comparison: Unlike previous studies that often focus on a single low-bit format or conduct comparisons at coarse granularities, this paper provides a comprehensive, systematic, and unified comparison of FP and INT formats across varying granularities (specifically fine-grained block-wise quantization) and multiple bit-widths (8-bit, 6-bit, 4-bit).

  • Identification of Performance Crossover: The paper uniquely identifies a critical performance crossover point where the relative advantage of FP and INT formats changes based on quantization granularity and crest factor. It shows FP excels coarsely, but INT becomes highly competitive finely.

  • Introduction of Integer Counterparts for MX/NV Formats: To enable a direct comparison, the paper introduces and evaluates integer variants (e.g., MXINT8, MXINT6, MXINT4, NVINT4) that align with existing Microscaling (MX) and NVIDIA (NV) floating-point formats. This allows for a fair algorithm-hardware co-design perspective.

  • Algorithmic and Hardware Efficiency Argument for MXINT8: The paper presents strong evidence that MXINT8 not only outperforms MXFP8 in algorithmic accuracy but also offers significant hardware efficiency benefits (area and energy reduction), challenging the perceived universal superiority of FP8.

  • Enhanced NVINT4 with Outlier Mitigation: For 4-bit, where FP often holds an initial edge, the paper demonstrates that NVINT4 can surpass NVFP4 by integrating an outlier-mitigation technique (Hadamard rotation), showcasing the potential of combining INT with algorithmic enhancements.

  • Novel Symmetric Clipping Method for INT Training: A practical innovation is the introduction of a symmetric clipping method to resolve gradient bias in fine-grained low-bit INT training. This addresses a specific limitation of INT formats and enables nearly lossless MXINT8 training, which was previously a domain primarily explored for FP8.

  • Comprehensive Trade-off Analysis: The study integrates theoretical QSNR analysis, tensor-wise analysis on real LLM data, direct-cast inference on diverse LLMs, low-bit training results, and hardware cost modeling, providing a holistic view of the trade-offs.

    In essence, the paper moves beyond simply demonstrating quantization, instead offering a deep, comparative dive that questions prevailing industry assumptions and provides concrete guidance for future AI accelerator design.

4. Methodology

This section details the technical solutions proposed and evaluated in the paper. The core idea is to systematically compare low-bit integer (INT) and floating-point (FP) quantization formats across different granularities, bit-widths, and operational contexts (inference and training), supported by theoretical analysis and hardware cost modeling.

4.1. Principles

The fundamental principle driving this research is the hypothesis that while floating-point (FP) formats might be advantageous for coarse-grained quantization due to their superior dynamic range in handling widespread outliers, the landscape could shift significantly at fine-grained granularities. As quantization granularity becomes finer (i.e., smaller block sizes), the local dynamic range within each block is reduced. This reduced local variation might diminish FP's advantage, allowing integer (INT) formats, with their uniform precision and simpler hardware implementation, to become highly competitive or even superior, especially when combined with appropriate outlier-mitigation techniques and training stability measures. The paper aims to rigorously test this hypothesis through a comprehensive theoretical, empirical, and hardware-centric study.

4.2. Core Methodology In-depth

4.2.1. Low-Precision Integer Formats

For bb-bit integer quantization, the paper defines the quantization and dequantization process as follows: $ \mathbf { X _ { q } } = \mathrm { c l i p } \left( \left\lfloor \frac { \mathbf { X } } { s } \right\rceil , Q _ { \mathrm { m i n } } , Q _ { \mathrm { m a x } } \right) \cdot s $ Where:

  • X\mathbf{X}: Represents the high-precision input tensor (e.g., BFloat16).

  • Xq\mathbf{X_q}: Represents the dequantized output tensor, which is an approximation of X\mathbf{X} after being mapped to a low-bit integer representation and then converted back to high-precision.

  • ss: Is the scale factor. This floating-point value is crucial as it determines the mapping between the high-precision range of X\mathbf{X} and the fixed integer range. It is typically calculated to cover the range of X\mathbf{X} without excessive clipping.

  • \lfloor \cdot \rceil: Denotes the round-to-nearest function. This operation takes a floating-point number and rounds it to the closest integer.

  • clip(value, Q_min, Q_max): This function performs clipping. It ensures that the integer value, after rounding, stays within the defined range [Qmin,Qmax][Q_{min}, Q_{max}]. If value is less than QminQ_{min}, it becomes QminQ_{min}; if it's greater than QmaxQ_{max}, it becomes QmaxQ_{max}.

  • Qmin,QmaxQ_{min}, Q_{max}: These define the minimum and maximum representable integer values for a given bit-width bb. For standard signed bb-bit integers using two's complement, Qmin=2b1Q_{min} = -2^{b-1} and Qmax=2b11Q_{max} = 2^{b-1}-1. For example, for INT8, this would be [128,127][-128, 127].

    The paper, however, introduces a crucial modification for INT formats: symmetric clipping. They find that the standard asymmetric range (e.g., [128,127][-128, 127] for INT8) can degrade INT8 training due to a persistent negative bias in gradients. To resolve this, they enforce a symmetric integer range for all INT quantizers: $ Q _ { m i n } = - ( 2 ^ { b - 1 } - 1 ) , \quad Q _ { m a x } = 2 ^ { b - 1 } - 1 $ For INT8, this translates to Qmin=(2811)=127Q_{min} = -(2^{8-1}-1) = -127 and Qmax=2811=127Q_{max} = 2^{8-1}-1 = 127. This means the range is [127,127][-127, 127], sacrificing one negative value to ensure symmetry around zero.

4.2.2. Low-Precision Floating-Point Formats

Floating-point (FP) representation is characterized by a sign bit (S), an exponent (E), and a mantissa (M). The paper uses the ExMy notation, where xx is the number of exponent bits and yy is the number of mantissa bits. A floating-point number is decoded according to the following formula: $ \mathbb { C } _ { \mathrm { F P } } = \left{ \begin{array} { l l } { ( - 1 ) ^ { s } \times ( 1 . m ) _ { 2 } \times 2 ^ { e - \mathrm { b i a s } } } & { \mathrm { i f ~ } e \neq 0 \mathrm { ~ ( N o r m a l ) , } } \ { ( - 1 ) ^ { s } \times ( 0 . m ) _ { 2 } \times 2 ^ { 1 - \mathrm { b i a s } } } & { \mathrm { i f ~ } e = 0 , m \neq 0 \mathrm { ~ ( S u bn o r m a l ) , } } \end{array} \right. $ Where:

  • CFP\mathbb{C}_{FP}: Represents the set of all representable low-bit floating-point values for a given format.

  • ss: Is the value of the sign bit. If s=0s=0, the number is positive; if s=1s=1, it's negative.

  • (1.m)2(1.m)_2: Represents the mantissa (or significand) for normal numbers. It implicitly has a leading '1', meaning the value is 1+m1 + m, where mm is the fractional part represented by the mantissa bits.

  • 2ebias2^{e-\mathrm{bias}}: Is the exponent part. ee is the raw exponent value, and bias is an offset specific to the FP format that allows representation of both very small and very large numbers.

  • Normal: Refers to floating-point numbers where the exponent field ee is not all zeros or all ones. These numbers have an implicit leading '1' in their mantissa.

  • (0.m)2(0.m)_2: Represents the mantissa for subnormal numbers. In this case, the implicit leading bit is '0', allowing representation of numbers even closer to zero than the smallest normal number, at the cost of precision.

  • Subnormal: Refers to floating-point numbers where the exponent field ee is all zeros, but the mantissa mm is not all zeros.

    The general form for floating-point quantization is given by: $ \mathbf { X _ { q } } = \mathrm { Nearest } \bigg ( \frac { \mathbf { X } } { s } , \mathbb { C } _ { \mathrm { F P } } \bigg ) \cdot s $ Where:

  • Nearest(,CFP)\mathrm{Nearest}(\cdot, \mathbb{C}_{FP}): This function takes a high-precision value (normalized by ss) and maps it to the closest representable value within the discrete set of low-bit floating-point numbers CFP\mathbb{C}_{FP}.

4.2.3. Quantization Granularity

The paper primarily focuses on block quantization, a fine-grained approach. In this scheme, a tensor is divided into smaller blocks, and each block receives its own scale factor. This allows the quantization range to adapt more precisely to local data distributions, which is crucial for mitigating the impact of outliers common in LLMs.

4.2.4. Block-Quantization Formats

The paper compares standard formats and introduces custom integer variants for a fair comparison. The formats are derived from Microscaling (MX) and NVIDIA (NV) specifications.

  • MX Formats:
    • Block Size: 32 elements.
    • Scale Type: UE8M0 (Unsigned Exponent 8, Mantissa 0) for the scale factor. This means the scale factor itself is represented in a low-precision floating-point format that primarily offers dynamic range (via 8 exponent bits) with minimal precision (0 mantissa bits).
    • Variants: MXFP8 (E4M3), MXFP6 (E2M3), MXFP4 (E2M1), and their integer counterparts MXINT8, MXINT6, MXINT4. The FP variants prioritize mantissa bits for precision given the fine-grained context.
  • NV Formats:
    • Block Size: 16 elements (finer than MX).

    • Scale Type: E4M3 for the first-level scale, and a FP32 second-level per-tensor scale to prevent overflow. E4M3 (Exponent 4, Mantissa 3) offers more precision for the scale factor compared to UE8M0.

    • Variants: NVFP4 and its integer counterpart NVINT4.

      The following table summarizes the block formats studied, including the integer variants introduced by the paper:

The following are the results from Table 1 of the original paper:

Format Block Size Max Value Min Value Dynamic Range Scale-1 Scale-2
MXFP8 (E4M3) 32 ±448 ±2-9 1.75 × 217 UE8M0 -
MXINT8 32 127 1 127 UE8M0 -
MXFP6 (E2M3) 32 ±7.5 ±0.125 60 UE8M0 -
MXINT6 32 ±31 ±1 31 UE8M0 -
MXFP4 (E2M1) 32 ±6 ±0.5 12 UE8M0 -
MXINT4 32 ±7 ±1 7 UE8M0 -
NVFP4 16 ±6 ±0.5 12 E4M3 FP32
NVINT4 16 ±7 ±1 7 E4M3 FP32

4.2.5. Quantization Compute Flow

The paper illustrates the computation flow for low-bit inference and training using a linear layer as an example. This flow dictates where and when quantization operations occur for weights, activations, and their gradients.

The following figure (Figure 1 from the original paper) shows the compute flow of low-bit forward and backward propagation of linear layer:

Figure 1 Compute flow of low-bit forward and backward propagation of linear layer. Figure 1 Compute flow of low-bit forward and backward propagation of linear layer.

Given high-precision (e.g., BFloat16) activations X\mathbf{X} and weights W\mathbf{W}, the forward pass of a quantized linear layer computes the output Y\mathbf{Y}: $ \mathbf { Y } = \underbrace { \mathrm { Q u a n t i z e } ( \mathbf { X } ) } _ { \textcircled { 1 } } \underbrace { \mathrm { Q u a n t i z e } ( \mathbf { W } ) } _ { \textcircled { 2 } } $ Here:

  • 1\textcircled{1}: Represents the quantization of the input activations X\mathbf{X}.
  • 2\textcircled{2}: Represents the quantization of the weights W\mathbf{W}. The output Y\mathbf{Y} would then be computed using low-precision General Matrix Multiply (GEMM) operations.

The backward pass involves computing gradients for activations (dXd\mathbf{X}) and weights (dWd\mathbf{W}). To compute dXd\mathbf{X}: $ d \mathbf { X } = \underbrace { \mathrm { Q u a n t i z e } ( \mathbf { d Y } ) } _ { \mathfrak { V } } \underbrace { \mathrm { Q u an t i z e } ( \mathbf { W } ^ { T } ) } _ { \mathfrak { V } } $ Here:

  • 3\textcircled{3}: Represents the quantization of the gradient of the output dY\mathbf{dY}.

  • 4\textcircled{4}: Represents the quantization of the transpose of the weights WT\mathbf{W}^T.

    To compute dWd\mathbf{W}: $ d \mathbf { W } = \underbrace { \mathrm { Q u a n t i z e } ( \mathbf { X } ^ { T } ) } _ { \mathfrak { V } } \underbrace { \mathrm { Q u a n t i z e } ( \mathbf { d Y } ^ { T } ) } _ { \mathfrak { O } } $ Here:

  • 5\textcircled{5}: Represents the quantization of the transpose of the input activations XT\mathbf{X}^T.

  • 6\textcircled{6}: Represents the quantization of the transpose of the gradient of the output dYT\mathbf{dY}^T.

    In total, there are six quantization operations in one linear layer during training. The paper notes that for block-wise quantization, tensors must be quantized along the GEMM reduction dimension to gain hardware benefits. This means the quantization axes for operations (1\textcircled{1} and 5\textcircled{5}), (2\textcircled{2} and 4\textcircled{4}), and (3\textcircled{3} and 6\textcircled{6}) are different.

4.2.6. Quantization Operation: Scale Factor Computation

The scale factor ss is crucial for both INT and FP quantization. The paper employs the AbsMax quantizer approach, where ss is computed to map the maximum absolute value in a group to the maximum representable low-precision value. The initial scale factor ss is calculated as: $ s = \frac { \mathrm { AbsMax } ( \mathbf { X } ) } { Q _ { m a x } } $ Where:

  • AbsMax(X)\mathrm{AbsMax}(\mathbf{X}): Is the maximum absolute value within the group of values that share a single scale factor (e.g., within a block).

  • QmaxQ_{max}: Is the maximum value of the target quantized type (refer to Table 1 for specific formats). This ensures that no value greater than QmaxQ_{max} is represented.

    For MX formats, the high-precision scale factor is further converted to the UE8M0 format. The conventional approach (as used by OCP [34]) involves rounding down: $ s ^ { \prime } = 2 ^ { \mathrm { clip } ( \lfloor \log _ { 2 } ( \mathrm { AbsMax } ( \mathbf X ) ) \rfloor - \lfloor \log _ { 2 } ( Q _ { m a x } ) \rfloor , - 1 2 7 , 1 2 7 ) } $ Where:

  • ss': The UE8M0 quantized scale factor.

  • \lfloor \cdot \rfloor: The floor function (rounding down).

  • This approach can introduce extra clipping error because rounding down might make the effective scale too small, causing values to exceed the maximum representable range.

    Following existing work, the paper adopts a strategy to round up the UE8M0 scale to avoid this clipping error: $ s ^ { \prime } = 2 ^ { \mathrm { clip } ( \lceil \log _ { 2 } ( s ) \rceil , - 1 2 7 , 1 2 7 ) } $ Where:

  • \lceil \cdot \rceil: Denotes the ceiling function (rounding up). This ensures that the effective range is always sufficient to cover the AbsMax value, preventing overflow, although it might slightly reduce precision for smaller values.

4.2.7. Quantization Operation: Symmetric Clipping

As previously mentioned, the paper identifies a problem with asymmetric integer ranges (like [128,127][-128, 127] for INT8) during low-bit training. This asymmetry leads to a persistent negative bias in gradients, especially pronounced in fine-grained quantization where more values might map to the unique negative endpoint (e.g., -128).

The following figure (Figure 2 from the original paper) shows the impact of clipping range on INT8 final training loss on 145M model with 20B training tokens:

Figure 2 Impact of clipping range on INT8 final training loss on 145M model with 20B training tokens. Scale factor is kept on BF16 to emphasize the harm of asymmetric representation space during low-… Figure 2 Impact of clipping range on INT8 final training loss on 145M model with 20B training tokens. Scale factor is kept on BF16 to emphasize the harm of asymmetric representation space during low-bit training.

Figure 2 clearly illustrates that using the asymmetric range [128,127][-128, 127] for INT8 results in a higher (worse) final training loss compared to the symmetric range [127,127][-127, 127]. This degradation is more severe for finer granularities (smaller block sizes, like block 32), as more individual quantization blocks increase the probability of mapping values to the problematic Q_min.

To mitigate this, the paper mandates the use of a symmetric integer range for all INT quantizers, as shown in Table 1: $ Q _ { m i n } = - ( 2 ^ { b - 1 } - 1 ) , \quad Q _ { m a x } = 2 ^ { b - 1 } - 1 $ This adjustment ensures that the integer range is balanced around zero, preventing the gradient bias and enabling more stable low-bit INT training.

4.2.8. Theoretical Framework: QSNR Metric

To quantitatively compare the numerical fidelity of different quantization schemes, the paper uses the Quantization Signal-to-Noise Ratio (QSNR), measured in decibels (dB). $ \mathrm { QSNR } = - 10 \log _ { 10 } \left( \frac { | \mathbf { X } - \mathbf { X } _ { q } | ^ { 2 } } { | \mathbf { X } | ^ { 2 } } \right) $ Where:

  • XXq2\| \mathbf { X } - \mathbf { X } _ { q } \| ^ { 2 }: Represents the power of the quantization noise, which is the squared Euclidean norm of the difference between the original signal X\mathbf{X} and the dequantized signal Xq\mathbf{X}_q.
  • X2\| \mathbf { X } \| ^ { 2 }: Represents the power of the original signal, which is the squared Euclidean norm of X\mathbf{X}. A higher QSNR value indicates a better preservation of the original signal's magnitude and direction, meaning lower quantization error.

4.2.9. Theoretical Framework: Common Assumptions for QSNR Derivation

For deriving the theoretical QSNR expressions, the paper makes several common assumptions:

  • Block Vectors: They consider block vectors XRk\mathbf{X} \in \mathbb{R}^k (where kk is the block size).
  • I.I.D. Entries: The entries XiX_i within each block are assumed to be independent and identically distributed (i.i.d.) and follow a Gaussian distribution (XiN(0,σ2)X_i \sim \mathcal{N}(0, \sigma^2)), meaning they have a mean of 0 and a variance of σ2\sigma^2.
  • Block RMS: The block root-mean-square (RMS) value is approximated by σ\sigma.
  • Crest Factor: The crest factor κ\kappa is defined as the ratio of the maximum absolute value in the block to its RMS: $ \kappa := \frac { \operatorname* { m a x } ( | \mathbf { X } | ) } { \sigma } $
  • Blockwise AbsMax Scaling: The scale factor ss' (the actual scale used for quantization) is derived from blockwise absolute-maximum (AbsMax) scaling. The ideal scale ss matches the largest magnitude in the block to the maximum representable value of the low-precision format (QrefQ_{ref}). $ s = \frac { \operatorname* { m a x } ( | \mathbf { X } | ) } { Q _ { \mathrm { r e f } } } $ Where QrefQ_{ref} is 2b112^{b-1}-1 for symmetric INT(b) and QmaxQ_{max} for FP(E,M,B).
  • Scale Overhead: The actual scale ss' is related to the ideal scale ss by a factor ρ\rho: $ s ^ { \prime } = \rho s $
    • For UE8M0 scale (used in MX formats), ρ[1,2)\rho \in [1, 2) models the overhead due to rounding the scale to a power of two.
    • For E4M3 scale (used in NV formats), ρ=1\rho = 1 is assumed because this scale closely matches the ideal value, introducing minimal overhead.

4.2.10. Theoretical Framework: Theorem 1 (INT QSNR)

Under the described assumptions, the QSNR for bb-bit INT quantization is derived. The INT QSNR (in dB) is: $ \mathrm { QSNR } _ { \mathrm { INT } } \approx \left{ \begin{array} { l l } { { 4 . 7 8 ~ + ~ 6 . 0 2 ~ b ~ - ~ 2 0\log _ { 1 0 } ( \rho ) ~ - ~ 2 0\log _ { 1 0 } ( \kappa ) , } } & { { \mathrm { U E 8 M 0 ~ s c a l e } } } \ { { } } & { { } } \ { { 4 . 7 8 ~ + ~ 6 . 0 2 ~ b ~ - ~ 2 0\log _ { 1 0 } ( \kappa ) ~ + ~ 1 0\log _ { 10 } \left( { \frac { g } { g - 1 } } \right) , } } & { { \mathrm { E4 M3 ~ s c a l e } } } \end{array} \right. $ Where:

  • bb: Is the bit width of the integer format.
  • ρ\rho: Is the scale overhead factor (for UE8M0 scale, ρ[1,2)\rho \in [1, 2); for E4M3 scale, ρ1\rho \approx 1).
  • κ\kappa: Is the crest factor of the data in the block.
  • gg: Is the block size (number of elements per block).
  • Interpretation:
    • Each additional bit bb provides approximately 6.02 dB gain in QSNR.
    • The UE8M0 scale introduces a penalty of up to 20log10(ρ)20 log10(ρ) (maximum 6.02 dB if ρ=2\rho=2).
    • A larger crest factor κ\kappa (more prominent outliers) reduces QSNR. Smaller blocks generally have smaller κ\kappa, thus improving QSNR.
    • The E4M3 scale (like in NVINT4) avoids the ρ\rho overhead and benefits from a 10log10(g/(g1))10 log10(g/(g-1)) gain, accounting for the near-error-free mapping of the block's maximum value.

4.2.11. Theoretical Framework: Theorem 2 (FP QSNR)

For floating-point quantization, the QSNR (in dB) is derived. This derivation considers the different error contributions from normal and subnormal regions of the FP representation. The FP QSNR (in dB) is: $ \mathrm { QSNR } _ { \mathrm { FP } } \approx \left{ \begin{array} { l l } { - 10 \log _ { 10 } \bigl ( \alpha _ { M } w _ { \mathrm { n o r m } } + \beta ( \rho \kappa ) ^ { 2 } p _ { \mathrm { s u b } } \bigr ) , } & { \mathrm { U E8 M0 ~ s c a l e } } \ { - 10 \log _ { 10 } \Bigl ( \alpha _ { M } \bigl ( w _ { \mathrm { n o r m } } - \frac { \kappa ^ { 2 } } { g } \bigr ) + \beta \kappa ^ { 2 } p _ { \mathrm { s u b } } \Bigr ) , } & { \mathrm { E4 M3 ~ s c a l e } } \end{array} \right. $ Where the auxiliary terms are defined as:

  • αM=12422M\alpha _ { M } = \frac { 1 } { 24 \cdot 2 ^ { 2 M } }: Represents the mantissa quantization error. MM is the mantissa bit width.

  • β=22(1BM)12Qmax2\beta = \frac { 2 ^ { 2(1-B-M) } } { 12 Q_{\mathrm{max}}^2 }: Related to the subnormal step error. BB is the exponent bias.

  • QmaxQ_{max}: Is the largest finite normal magnitude of the target FP format (e.g., 448 for E4M3).

  • wnormw_{norm}: Is the fraction of signal energy carried by normal FP numbers. It measures how much of the distribution falls into the normal region.

  • psubp_{sub}: Is the probability that a value encodes as subnormal. It measures how much of the distribution falls into the subnormal region.

  • ρ\rho: Is the scale overhead factor (for UE8M0 scale, ρ[1,2)\rho \in [1, 2); for E4M3 scale, ρ1\rho \approx 1).

  • κ\kappa: Is the crest factor.

  • gg: Is the block size.

  • Interpretation:

    • The mantissa bit width MM is crucial, setting an upper bound on FP QSNR. If there's ample dynamic range (wnorm1w_{norm} \approx 1 and psub0p_{sub} \approx 0), QSNR approaches 13.80+6.02M13.80 + 6.02 M dB, largely independent of block specifics or data distribution.
    • A larger crest factor κ\kappa increases the share of subnormals (psubp_{sub}) and reduces QSNR. Finer-grained blocks (smaller gg) tend to reduce κ\kappa, lower psubp_{sub}, and thus improve QSNR.
    • The E4M3 scale (like in NVFP4) has no ρ\rho overhead and accounts for the per-block maximum value being mapped with minimal error, which reduces the effective error energy in the normal region by κ2g\frac{\kappa^2}{g}.

4.2.12. Theoretical Framework: Theoretical Comparisons

Using the derived QSNR formulas (Eq. 13 and Eq. 14), the paper theoretically compares INT and FP formats by plotting QSNR against the crest factor κ\kappa. A key finding is that the crest factor is the primary determinant of which format performs better.

The following figure (Figure 3 from the original paper) shows the theoretical QSNR comparison between various integer (INT) and floating-point (FP) formats across a range of crest factors (κ)\left( \kappa \right):

Figure 3 Theoretical QSNR comparison between various integer (INT) and foating-point (FP) formats across a range of crest factors \(\\left( \\kappa \\right)\) , derived from Eq. (13) and Eq. (14). The box… Figure 3 Theoretical QSNR comparison between various integer (INT) and foating-point (FP) formats across a range of crest factors (κ)\left( \kappa \right) , derived from Eq. (13) and Eq. (14). The boxes represent the crest factor and QSNR of the crossover point of the INT and FP curves.

Figure 3 illustrates distinct crossover points for different bit-widths:

  • MXINT8 vs. MXFP8: MXINT8 outperforms MXFP8 when κ<7.55\kappa < 7.55. MXFP8's QSNR is relatively constant due to its large dynamic range and mantissa-bit bound.

  • MXINT6 vs. MXFP6: MXFP6 initially performs similarly to MXFP8 (both have three mantissa bits), but its QSNR drops rapidly as κ\kappa increases due to a more limited dynamic range. MXINT6 only surpasses MXFP6 when κ<1.96\kappa < 1.96.

  • MXINT4 vs. MXFP4: MXINT4 beats MXFP4 when κ<2.04\kappa < 2.04.

  • NVINT4 vs. NVFP4: NVINT4 wins when κ<2.39\kappa < 2.39. An interesting observation is that NVFP4's QSNR can increase when κ<4\kappa < 4, because in this range, the normal domain error dominates, and a larger κ\kappa can reduce this component, even as it increases subnormal error.

    These theoretical insights highlight that the decision between INT and FP is not absolute but depends critically on the data's crest factor and the specific bit-width.

4.2.13. Hardware Cost Modeling

The paper includes a hardware cost analysis to compare the area and energy efficiency of INT and FP formats. This analysis focuses on the core Matrix-Multiply Unit (MMU) components.

The model is based on the following:

  • Components Modeled: Multiply-and-Accumulate (MAC) unit, dequantizer, and FP32 accumulator. The quantizer is explicitly excluded from cost accounting.

  • Accumulation: FP32 accumulation is chosen to prevent error growth and preserve scalability.

  • MAC Unit Differences: FP multipliers are generally more area/energy efficient than INT multipliers, but FP adders are more complex and expensive than INT adders due to requirements like exponent comparison, mantissa alignment, and normalization.

  • Mantissa Aligner Width (nn): A crucial parameter that affects both numerical fidelity and hardware complexity for FP operations. It's defined as: $ n = \mathrm { min } \left( 2 ^ { x + 1 } + 2 y , \mathrm { psum_bit_width } \right) $ Where:

    • xx: Is the number of exponent bits.
    • yy: Is the number of mantissa bits. For INT formats, x=0x=0.
    • psum_bit_width: A cap, set to 24 in this evaluation.
    • This ensures the aligner is wide enough but doesn't exceed the accumulator's precision.
  • MAC Unit Structure: Modeled as a kk-lane array (e.g., k=32k=32 for MX, k=16k=16 for NV). Each lane has a multiplier, and adders are fused into a multi-input adder tree with FP-specific logic. A single normalizer is shared across kk MAC lanes to reduce cost.

    The following are the results from Table 6 of the original paper:

    Sub-block INT Mul FP Mul INT Add FP Add Main Cells
    Multiplier k(x+y+1)2 k(y+1)2 AND, FA, HA
    Adder (mantissa/int) 2k(x+y+1) kn FA, HA
    Exponent adder kx FA, HA
    Exponent subtractor kx XOR, FA, HA
    Comparator kx XOR, AND, OR
    Aligner (barrel) k n log2 n MUX
    Normalizer (shared) n log2 n MUX, OR

Table 6 provides a gate-complexity model for the MAC Unit's sub-blocks. For instance, an INT multiplier scales with the square of the total bit-width (x+y+1x+y+1), while an FP multiplier scales with the mantissa bit-width (y+1y+1). FP adders involve more complex components like exponent adders/subtractors, comparators, and a barrel aligner that scales with knlog2nk \cdot n \cdot \log_2 n, where nn is the aligner width. Main Cells (AND, FA, HA, XOR, OR, MUX) are standard logic gates.

  • Area and Energy Aggregation: The total Area and Energy for each component (MAC, ACC32, DEQ) are calculated by summing the contributions of individual logic gates (e.g., FA for Full Adder, HA for Half Adder, MUX for Multiplexer), weighted by their technology-dependent area (AgA_g) and energy (EgE_g) factors, and a toggle rate (τg\tau_g).

    • MAC Unit cost: $ \mathrm { Area } _ { \mathrm { MAC } } \ = \ \sum _ { s \in S } \sum _ { g \in \mathcal { G } } c _ { s , g } \left( x , y , k , n \right) A _ { g } $ $ \mathrm { Energy } _ { \mathrm { MAC } } \ = \ \sum _ { s \in S } \sum _ { g \in \mathcal { G } } c _ { s , g } \left( x , y , k , n \right) E _ { g } \tau _ { g } $ Where SS is the set of sub-block types, G\mathcal{G} is the set of cell types, cs,gc_{s,g} is the count of cell gg in sub-block ss, and τg\tau_g is the toggle rate.
    • FP32 Accumulator (ACC32) cost: $ \mathrm { Area } _ { \mathrm { ACC32 } } \ = \ \sum _ { g \in { \mathcal G } } c _ { g } ^ { \mathrm { ACC32 } } A _ { g } $ $ \mathrm { Energy } _ { \mathrm { ACC32 } } \ = \ \sum _ { g \in { \mathcal G } } c _ { g } ^ { \mathrm { ACC32 } } E _ { g } \tau _ { g } $
    • Dequantizer (DEQ) cost: $ \mathrm { Area } _ { \mathrm { DEQ } } \ = \ \sum _ { g \in { \mathcal G } } c _ { g } ^ { \mathrm { DEQ } } A _ { g } $ $ \mathrm { Energy } _ { \mathrm { DEQ } } \ = \ \sum _ { g \in { \mathcal G } } c _ { g } ^ { \mathrm { DEQ } } E _ { g } \tau _ { g } $
  • Total MMU Cost: $ \begin{array} { r } { \mathrm { Area } _ { \mathrm { MMU } } = \mathrm { Area } _ { \mathrm { MAC } } + \mathrm { Area } _ { \mathrm { DEQ } } + \mathrm { Area } _ { \mathrm { ACC32 } } , \quad } \ { \mathrm { Energy } _ { \mathrm { MMU } } = \mathrm { Energy } _ { \mathrm { MAC } } + \mathrm { Energy } _ { \mathrm { DEQ } } + \mathrm { Energy } _ { \mathrm { ACC32 } } } \end{array} $

  • Reuse Schemes: The paper also considers mixed-format configurations by evaluating different MAC unit configurations that allow reuse of hardware for different bit-widths. For example, an INT8 lane could be reconfigured for INT4, leading to area savings.

    The following are the results from Table 7 of the original paper:

    Throughput Ratio INT8 : INT4 = 1 : 2
    No reuse 1 * int8 MAC unit + 2 * int4 MAC unit
    INT reuse scheme 1 1 * int8−MAC_unit + 1 * int4_MAC_unit
    INT reuse scheme 2 2 * int8_(u)int4_MAC_unit
    Throughput Ratio FP8 : FP4 = 1 : 2
    No reuse 1 * e4m3_MAC_unit + 2 * e2m1_MAC_unit
    FP reuse scheme 1 * e4m3_MAC_unit + 1 * e2m1_MAC_unit

Table 7 outlines different configurations for MAC units with throughput ratios for INT8:INT4 and FP8:FP4. No reuse implies separate hardware, while reuse schemes (like INT reuse scheme 2, which uses two INT8 lanes reconfigurable for INT4) are designed to optimize area. This model allows for a detailed comparison of the hardware implications of supporting various low-bit formats.

5. Experimental Setup

This section details the datasets, evaluation metrics, and models used to conduct the experimental comparisons.

5.1. Datasets

5.1.1. WikiText2

  • Source & Characteristics: WikiText2 [25] is a widely used dataset for language modeling, comprising a collection of "good" and "featured" articles from Wikipedia. It is known for having a diverse vocabulary and realistic text.
  • Usage: In this paper, WikiText2 sequences (length 4096) are fed into Llama3.1-8B to capture intermediate tensors (weights, activations, gradients) during both forward and backward propagation in BFloat16 precision. These captured tensors are then used for tensor-wise analysis (computing crest factors and QSNR) and for direct-cast inference evaluation (computing KL divergence and perplexity).
  • Why Chosen: WikiText2 serves as a representative dataset for evaluating language model performance, providing a realistic distribution of values for QSNR analysis and a standard benchmark for inference accuracy.

5.1.2. OLMo2-Mix-1124

  • Source & Characteristics: OLMo2-Mix-1124 [33] is a large-scale pretraining dataset.
  • Usage: This dataset is used for low-bit training experiments with Llama3-style models (1B and 3B parameters) over 100B and 200B training tokens, respectively.
  • Why Chosen: As a large pretraining dataset, it provides a robust environment to evaluate the stability and performance of low-bit training methods over extended training schedules.

5.2. Evaluation Metrics

5.2.1. Quantization Signal-to-Noise Ratio (QSNR)

  • Conceptual Definition: QSNR measures the fidelity of a quantized signal by comparing the power of the original signal to the power of the error (noise) introduced by quantization. A higher QSNR in decibels (dB) indicates less distortion and a more accurate representation of the original data.
  • Mathematical Formula: $ \mathrm { QSNR } = - 10 \log _ { 10 } \left( \frac { | \mathbf { X } - \mathbf { X } _ { q } | ^ { 2 } } { | \mathbf { X } | ^ { 2 } } \right) $
  • Symbol Explanation:
    • X\mathbf{X}: The original high-precision tensor.
    • Xq\mathbf{X}_q: The dequantized tensor (original tensor after quantization and dequantization).
    • 2\| \cdot \|^2: Denotes the squared Euclidean norm (sum of squares of all elements in the tensor).
    • XXq2\| \mathbf { X } - \mathbf { X } _ { q } \| ^ { 2 }: Represents the total squared error (or noise power) introduced by quantization.
    • X2\| \mathbf { X } \| ^ { 2 }: Represents the total squared magnitude (or signal power) of the original tensor.
    • 10log10()-10 \log_{10}(\cdot): Converts the ratio of noise power to signal power into a decibel scale, where a larger positive value signifies better quality.

5.2.2. KL Divergence (Kullback-Leibler Divergence)

  • Conceptual Definition: KL Divergence (also known as relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution. In the context of LLMs, it quantifies the difference between the output probability distribution (logits) of a quantized model and a full-precision BFloat16 model. A lower KL Divergence indicates that the quantized model's predictions are closer to the full-precision model's predictions, implying better preservation of algorithmic accuracy. The paper calculates this over the softmax distribution restricted to the top-25 logits of the BFloat16 model to reduce noise and focus on critical predictions.
  • Mathematical Formula: $ D_{KL}(P || Q) = \sum_{i} P(i) \log \left(\frac{P(i)}{Q(i)}\right) $
  • Symbol Explanation:
    • P(i): The probability of outcome ii according to the reference BFloat16 model's output distribution (softmax over top-25 logits).
    • Q(i): The probability of outcome ii according to the quantized model's output distribution (softmax over top-25 logits).
    • i\sum_{i}: Summation over all possible outcomes ii (in this case, the top-25 logits).
    • log\log: Typically the natural logarithm (ln\ln).
    • DKL(PQ)D_{KL}(P || Q): The KL Divergence from QQ to PP. It measures the information lost when QQ is used to approximate PP.

5.2.3. Perplexity (PPL)

  • Conceptual Definition: Perplexity is a common metric for evaluating language models. It measures how well a probability distribution (the language model) predicts a sample (a sequence of words). A lower perplexity score indicates that the model is better at predicting the next word in a sequence, suggesting higher quality text generation and understanding.
  • Mathematical Formula: For a sequence of words W=w1,w2,,wNW = w_1, w_2, \dots, w_N, the perplexity is defined as: $ PPL(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i | w_1, \dots, w_{i-1})}} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i | w_1, \dots, w_{i-1})} $ This is equivalent to 2 raised to the power of the cross-entropy loss per word.
  • Symbol Explanation:
    • WW: A sequence of NN words.
    • NN: The total number of words in the sequence.
    • P(wiw1,,wi1)P(w_i | w_1, \dots, w_{i-1}): The probability assigned by the language model to the ii-th word wiw_i, given all the preceding words w1,,wi1w_1, \dots, w_{i-1}.
    • i=1N\prod_{i=1}^{N}: Product over all words in the sequence.
    • i=1N\sum_{i=1}^{N}: Summation over all words in the sequence.
    • log2\log_2: Logarithm base 2.

5.2.4. Training Loss

  • Conceptual Definition: During model training, training loss quantifies the discrepancy between the model's predictions and the true labels. The goal of training is to minimize this loss. For language models, this is typically cross-entropy loss. A lower training loss indicates that the model is learning more effectively and fitting the training data better. The paper uses an exponential moving average (EMA) with a coefficient of 0.9 to smooth the loss curves.
  • Mathematical Formula: For cross-entropy loss in a classification task (like next-token prediction in LLMs): $ L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log(p_{ic}) $
  • Symbol Explanation:
    • NN: The number of samples (e.g., tokens in a batch).
    • CC: The number of possible classes (e.g., vocabulary size).
    • yicy_{ic}: A binary indicator (1 if sample ii belongs to class cc, 0 otherwise).
    • picp_{ic}: The predicted probability that sample ii belongs to class cc.
    • log\log: Typically the natural logarithm (ln\ln).

5.2.5. Task Accuracy

  • Conceptual Definition: This refers to the standard accuracy metric on various common-sense reasoning tasks used to evaluate the downstream performance of the trained LLMs. It measures the proportion of correctly answered questions or tasks. The paper uses 5-shot evaluation, meaning the model is given 5 examples before being asked to perform the task.
  • Specific Tasks and Metrics:
    • WinoGrande [35]: Evaluated using acc (accuracy). This dataset focuses on common-sense reasoning by resolving ambiguous pronouns.
    • HellaSwag [44]: Evaluated using acc_norm (normalized accuracy). This task involves selecting the most plausible ending to a given sentence, designed to be challenging for models.
    • Arc_Challenge, Arc_Easy [10]: Evaluated using acc_norm. These are scientific question-answering tasks, with Arc_Challenge being harder.
    • PIQA [4]: Evaluated using acc_norm. This dataset involves physical common-sense reasoning.
    • Openbookqa [26]: Evaluated using acc_norm. This task also involves common-sense reasoning based on an open book of facts.
  • Why Chosen: These tasks are standard benchmarks for evaluating the common-sense reasoning capabilities of LLMs, providing a comprehensive assessment of how well the models retain their intelligence after low-bit training.

5.3. Baselines

The paper compares its proposed INT formats against several established and industry-standard FP formats, as well as the full-precision BFloat16 baseline.

  • Full Precision Baseline:
    • BFloat16(BF16)BFloat16 (BF16): This is the standard 16-bit floating-point format often used for training LLMs due to its wide dynamic range. All comparisons aim to achieve performance close to BF16.
  • Floating-Point Baselines (Fine-Grained):
    • MXFP8 (E4M3): A Microscaling 8-bit floating-point format with a block size of 32 and UE8M0 scale.
    • MXFP6 (E2M3): A Microscaling 6-bit floating-point format with a block size of 32 and UE8M0 scale.
    • MXFP4 (E2M1): A Microscaling 4-bit floating-point format with a block size of 32 and UE8M0 scale.
    • NVFP4: An NVIDIA 4-bit floating-point format with a block size of 16 and E4M3 scale (plus FP32 second-level scale).
  • Integer Counterparts (Introduced for Comparison):
    • MXINT8
    • MXINT6
    • MXINT4
    • NVINT4 These integer variants are designed to match the block sizes and scale factor types of their FP counterparts, enabling a direct and fair comparison.

5.4. Models

5.4.1. Models for Direct-Cast Inference Evaluation

For direct-cast inference (quantizing only the forward pass from a pretrained BFloat16 model), the paper evaluates a diverse set of 12 LLMs, covering various sizes and architectures.

The following are the results from Table 8 of the original paper:

Model Name Huggingface ID
Qwen3-0.6B Qwen/Qwen3-0.6B-Base
Qwen3-1.7B Qwen/Qwen3-1.7B-Base
Qwen3-4B Qwen/Qwen3-4B-Base
Qwen3-8B Qwen/Qwen3-8B-Base
Qwen3-14B Qwen/Qwen3-14B-Base
Qwen3-32B Qwen/Qwen3-32B
Qwen3-30B-A3B Qwen/Qwen3-30B-A3B-Instruct-2507
Qwen3-235B-A22B Qwen/Qwen3-235B-22B-Instruct-2507
Llama-3.2-1B meta-llama/Llama-3.2-1B
Llama-3.2-3B meta-llama/Llama-3.2-3B
Llama-3.1-8B meta-llama/Meta-Llama-3.1-8B
Llama-3.1-70B meta-llama/Meta-Llama-3.1-70B

The models range from 0.6B to 235B parameters and include both dense and Mixture-of-Experts (MoE) architectures, providing a broad evaluation scope. The paper uses base models without Supervised Fine-Tuning (SFT) when available, otherwise selecting SFT models.

5.4.2. Models for Low-Bit Training Evaluation

For low-bit training experiments, the paper uses Llama3-style models due to their widespread adoption.

The following are the results from Table 9 of the original paper:

Model Size 145M 1B 3B
Layers 12 16 28
Hidden Size 1024 2048 3072
FFN Hidden Size 3072 8192 8192
Attention Heads 16 32 24
KV Heads 4 8 8
Batch Size (# Sequence) 256 512 512
Max LR 1.0e-3 6e-4 6e-4
Min LR 0.1 × Max LR
Optimizer AdamW (β1 = 0.9, β2 = 0.95)
Weight Decay 0.1
Clip Grad Norm 1.0
LR Schedule Cosine
Warmup Steps 500
Sequence Length 2048

Table 9 provides detailed architectural settings and training hyperparameters for the Llama3-style models. The models feature Group Query Attention (GQA) [1] and SwiGLU [37] for efficiency. Training is performed on 1B and 3B models, with a 145M model likely used for initial ablation studies.

6. Results & Analysis

This section presents and analyzes the experimental results, providing empirical validation for the paper's theoretical insights and contributions.

6.1. Core Results Analysis

6.1.1. Tensor-wise Analysis: Crest Factor and QSNR

The paper first performs a tensor-wise analysis on intermediate tensors (activations, weights, gradients) collected from Llama3.1-8B during WikiText2 processing. This allows for direct measurement of crest factors and QSNR under various formats.

The following are the results from Table 2 of the original paper:

Type Block Size Min Q1 Median Q3 Max
Crest factor -1 3.55 4.26 6.2 11.97 60.15
32 2.28 2.40 2.48 2.96 4.26
16 2.04 2.13 2.16 2.39 3.16
Crest factor w/ hadamard rotatioin -1 3.62 3.9 4.15 5.79 13.02
32 1.91 2.29 2.35 2.36 2.57
16 1.77 2.06 2.1 2.11 2.21

Table 2 shows the crest factor statistics across different block sizes. The Q3 (75th percentile) is highlighted as representing typical worst-case behavior.

  • Coarse-grained (Block Size -1, likely per-channel): Q3 is 11.97. This is significantly above the MXINT8 vs. MXFP8 crossover point (κ<7.55\kappa < 7.55) from Figure 3. This indicates that FP is generally superior for coarse granularity.

  • MX-format (Block Size 32): Q3 drops to 2.96. This value is well below the MXINT8 vs. MXFP8 crossover point (7.55). This suggests MXINT8 should outperform MXFP8 in most cases. However, 2.96 is above the crossover points for MXINT6 vs. MXFP6 (κ<1.96\kappa < 1.96) and MXINT4 vs. MXFP4 (κ<2.04\kappa < 2.04), implying MXINT6 and MXINT4 would underperform.

  • NV-format (Block Size 16): Q3 is 2.39. This is approximately the NVINT4 vs. NVFP4 crossover point (κ<2.39\kappa < 2.39), suggesting a more balanced competition.

  • Impact of Hadamard Rotation: Hadamard rotation effectively reduces the crest factor. For block size 32, Q3 decreases from 2.96 to 2.39. For block size 16, Q3 drops from 2.39 to 2.11, pushing it further below the NVINT4 vs. NVFP4 crossover, which favors NVINT4 post-rotation.

    The following figure (Figure 4 from the original paper) shows the practical QSNR across crest factors from 10752 tensors source from 1\textcircled{1} to 6\textcircled{6} in compute flow in Figure 1:

    Figure 4 Practical QSNR across crest factors from 10752 tensors source from \(\\textcircled{1}\) to \(\\textcircled{6}\) in compute flow in Figure 1. ()eupnee T b to hrepor e vee QSNR INTan qaan, and he w… Figure 4 Practical QSNR across crest factors from 10752 tensors source from textcircled1\\textcircled{1} to textcircled6\\textcircled{6} in compute flow in Figure 1. ()eupnee T b to hrepor e vee QSNR INTan qaan, and he w rate IT n quani

Figure 4 illustrates the practical QSNR (measured on real LLM tensors) across crest factors. The empirical results largely corroborate the theoretical predictions from Section 4.2.12:

  • MXINT8 vs. MXFP8: MXFP8's QSNR is almost constant at 31.50 dB (due to its ample dynamic range and mantissa-bit bound). MXINT8 achieves a significantly higher average QSNR of 40.35 dB, confirming its strong performance.
  • MXINT6 and MXINT4: These INT formats consistently lag behind their FP counterparts (MXFP6 and MXFP4), even with Hadamard rotation, as predicted by their crest factor being above the crossover points.
  • NVINT4 vs. NVFP4: Initially, NVINT4's average QSNR (20.55 dB) is slightly below NVFP4's (20.60 dB), despite a 64.3% win rate. This is because NVINT4's QSNR degrades faster with increasing crest factor. However, after applying Hadamard rotation, NVINT4's average QSNR increases to 21.65 dB, surpassing NVFP4's 20.35 dB. The decrease in NVFP4's QSNR after rotation is consistent with the theoretical plot in Figure 3, where NVFP4's QSNR can increase when κ<4\kappa < 4, so reducing κ\kappa further can sometimes lower its QSNR in that specific range.

6.1.2. Direct-Cast Inference

Direct-cast inference evaluates model accuracy when quantization is applied only to the forward pass of pretrained models. KL divergence is used as the primary metric, with perplexity also reported.

The following are the results from Table 3 of the original paper:

Original
INT Win FP Win INT Win FP
MXINT8 v.s. MXFP8 12 0 12 0
MXINT6 v.s. MXFP6 0 12 11
MXINT4 v.s. MXFP4 0 12 12
NVINT4 v.S. NVFP4 0 12 0

Table 3 summarizes the win rate (number of models where INT or FP performed better in terms of KL divergence) across 12 LLMs.

  • Without Rotation:
    • MXINT8 consistently outperforms MXFP8 on all 12 models.
    • MXINT6, MXINT4, and NVINT4 generally underperform their FP counterparts, confirming the predictions from the tensor-wise analysis regarding their respective crest factor crossover points. NVINT4 performs worse than NVFP4 here, suggesting that even with similar average QSNR, higher crest factors can lead to worst-case behavior for integers.
  • With Random Hadamard Rotation:
    • MXINT8 and NVINT4 now win on all 12 models. This highlights the effectiveness of Hadamard rotation in mitigating outliers, making NVINT4 competitive with or superior to NVFP4.

    • MXINT6 wins on 1 of 12 models, and MXINT4 still loses on all 12, remaining consistent with the tensor-wise analysis where their crest factors generally remain above their respective FP crossover points even after rotation.

      The following are the results from Table 12 of the original paper:

      Qwen-3
      Format 0.6B 1.7B 4B 8B 14B 32B 30B-A3B 235B-A22B
      MXINT8 191 209 112 168 96 118 160 276
      MXFP8 579 406 346 362 300 457 380 483
      MXINT6− 1944 2464 928 1104 804 1012 768 1333
      MXFP6 1030 874 539 592 467 627 606 1099
      MXINT4 39936 30208 1708 15552 −34304 27392 13248 1631
      MXFP4 17602 14614 8568 8228 8119 10302 6194 16238
      NVINT4 10560 8320 4864 5120 568 7968 3120 9702
      NVFP4 8104 4995 3844 3430 2835 3778 2443 9238
      (w/ random Hadamard
      Format 0.6B 1.7B Qwen-3 4B 8B 14B 32B rotation) 30B-A3B 235B-A22B
      MXINT8 137 150 80 130 70 88 135 229
      MXFP8 921 1321 468 577 393 497 391 707
      MXINT6 1137 1274 547 690 481 615 444 809
      MXFP6 1007 1446 497 618 454 558 422 740
      MXINT4 26488 26578 10498 12241 8459 9510 6080 9660
      MXFP4 17995 20443 7260 8562 6410 6536 5087 7058
      NVINT4 7771 7236 3431 4026 30700 3647 22 3931
      NVFP4 12031 10582 5065 5912 4214 4662 3200 5786

Table 12 presents the KL divergence results for Qwen-3 models. Lower values indicate better accuracy. For example, for Qwen3-0.6B without rotation, MXINT8 has a KL divergence of 191, much lower than MXFP8's 579. After Hadamard rotation, MXINT8 improves to 137, while MXFP8 also improves to 921 (though it was higher before, rotation helps it more for this case). The trend of MXINT8 outperforming MXFP8 holds consistently across all Qwen-3 models. For 4-bit, NVINT4 generally improves with rotation, often surpassing NVFP4.

The following are the results from Table 13 of the original paper:

Llama
Format 3.2-1B 3.2-3B 3.1-8b 3.1-70B
MXINT8 111 77 82 191
MXFP8 464 325 359 514
MXINT6 1133 743 776 1744
MXFP6 651 457 491 1436
MXINT4 26153 14089 12380 22538
MXFP4 14446 8251 7586 21372
NVINT4 508 4312 4224 10970
NVFP4 5691 3684 3718 10544
Llama(w/ random n Hadamard rotation)
Format 3.2-1B 3.2-3B 3.1-8b 3.1-70B
MXINT8 89 63 65 145
MXFP8 573 388 409 1393
MXINT6 773 531 558 1518
MXFP6 643 447 457 1476
MXINT4 20126 116 10272 137612
MXFP4 11967 8269 7189 129471
NVINT4 5854 3912 609 19975
NVFP4 8129 5240 4752 77363

Table 13 shows KL divergence for Llama models, reinforcing the findings for Qwen-3. MXINT8 consistently achieves lower KL divergence than MXFP8. With Hadamard rotation, NVINT4 again shows significant improvement, often outperforming NVFP4.

The following are the results from Table 14 of the original paper:

Format 0.6B 1.7B 4B 8B Qwen-3 14B 32B 30B-A3B 235B-A22B
BF16 11.5868 8.7084 7.3368 6.5135 5.9498 7.0168 6.8178 4.0929
MXINT8 11.6377 8.7424 7.3511 6.5174 5.955 7.0185 6.8167 4.0959
MXFP8 11.7494 8.7822 7.3813 6.5444 5.9711 7.0357 6.8335 4.1101
MXINT6 12.2297 9.2622 7.496 6.6499 6.0483 7.05 6.8745 4.1743
MXFP6 11.9108 8.8961 7.4135 6.5825 5.9953 7.0285 6.8467 4.1662
MXINT4 48.673 21.8749 11.9487 10.0423 16.7227 15.1619 9.3837 5.918
MXFP4 20.4522 24.0766 9.1553 8.0135 7.2471 8.2047 7.8203 5.9007
NVINT4 15.9729 10.9128 8.3304 7.415 6.81 8.0161 7.2024 4.88916
NVFP4 14.6818 9.9966 8.0144 7.0285 6.3129 7.3604 7.1874 4.8309
Qwen-3(w/ random Hadamard rotation)
Format 0.6B 1.7B 4B 8B 14B 32B 30B-A3B 235B-A22B
MXINT8 11.6179 8.7240 7.3407 6.5170 5.9521 7.0187 6.8231 4.0973
MXFP8 11.8629 8.9972 7.4068 6.5898 5.9839 7.0448 6.8918 4.1287
MXINT6 11.9422 9.0122 7.4071 6.6119 5.990 7.0627 6.8666 4.1263
MXFP6 11.9096 9.0089 7.4108 6.5911 5.9981 7.0787 6.8711 4.1252
MXINT4 28.6510 1.3032 9.8238 9.2029 7.3564 8.2083 7.8292 4.9891
MXFP4 20.3684 15.9527 8.8148 8.1113 6.9521 7.7401 7.9673 4.7035
NVINT4 14.6052 110.7822 7.9824 7.1705 6.3702 7.3625 1557 4.3913
NVFP4 16.5762 11.7541 8.2716 7.5084 6.5427 7.4522 7.3214 4.5918

Table 14 provides the perplexity results for Qwen-3 models. Lower perplexity is better. For Qwen3-0.6B, BF16 achieves 11.5868. MXINT8 (11.6377) is very close, while MXFP8 (11.7494) is slightly worse. This again confirms MXINT8's strong performance. The trends are generally consistent with KL divergence, showing that MXINT8 performs well, and Hadamard rotation can significantly improve NVINT4.

The following are the results from Table 15 of the original paper:

Llama
Format 3.2-1B 3.2-3B 3.1-8b 3.1-70B
BF16 9.0625 7.2857 5.8402 2.637
MXINT8 9.0815 7.2944 5.8487 2.664
MXFP8 9.1695 7.3381 5.895 2.6674
MXINT6 9.3557 7.4184 5.9643 2.7298
MXFP6 MXINT4 9.2209 7.3605 5.916 2.7298
MXFP4 21.9893 14.0516 11.2715 9.2355 8.7408 5.1894 6.4845 4.9492
NVINT4 11.3987 8.225 6.5957 3.5502
NVFP4 10.7473 8.0343 6.4917 3.492
Llama(w/ random n Hadamard rotation)
Format 3.2-1B 3.2-3B 3.1-8b 3.1-70B
MXINT8 9.0715 7.2912 5.845 2.6428
MXFP8 9.1932 7.3465 5.9001 2.7232
MXINT6 9.2622 7.3828 5.9276 2.7333
MXFP6 9.2204 7.3703 5.9075 2.735
MXINT4 17.9797 10.357 8.0745 1146.7256
MXFP4 13.3987 9.262 7.2318 1118.4431
NVINT4 10.8399 8.1119 6.4701 4.9786
NVFP4 6.7028
8.4693 79.7586

Table 15 presents the perplexity results for Llama models, again showing that MXINT8 is very competitive with BF16 and superior to MXFP8. The effectiveness of Hadamard rotation for NVINT4 is also clear here.

6.1.3. Training

The paper investigates low-bit training stability and performance, focusing on 8-bit formats (MXINT8 vs. MXFP8).

The following figure (Figure 5 from the original paper) shows the loss curves comparison among BF16, MXFP8 and MXINT8 training on Llama-1B with 100B tokens:

Figure 5 Loss curves comparison among BF16, MXFP8 and MXINT8 training on Llama-1B with 100B tokens. Results are smoothed by exponential moving average with a coefficient of 0.9. Figure 5 Loss curves comparison among BF16, MXFP8 and MXINT8 training on Llama-1B with 100B tokens. Results are smoothed by exponential moving average with a coefficient of 0.9.

Figure 5 plots the training loss curves for BF16, MXFP8, and MXINT8 on Llama-1B. It shows that both MXFP8 and MXINT8 achieve nearly lossless training, with their curves closely tracking the BF16 baseline. An enlarged view reveals that MXINT8 consistently maintains a slightly lower loss (by approximately 0.001) than MXFP8. This is a significant finding, demonstrating that MXINT8 can support nearly lossless low-bit training, a domain where FP8 training has been the primary focus in prior work.

The following are the results from Table 4 of the original paper:

Model size Training tokens Precision loss Arc_ E Arc C HS OB PIQA WG Avg.
1B 100B BF16 2.6727 37.80 69.40 60.20 38.40 74.43 61.09 56.89
1B 10B MXFP8 2.6767 37.03 69.82 60.28 38.00 74.37 61.64 556.86
1B 100B MXINT8 2.6758 37.95 69.45 60.02 38.80 74.54 61.38 57.02
3B 200B BF16 2.4794 46.50 75.42 72.28 45.00 78.07 69.45 64.45
3B 20B ¯MXFP8 2.4821 46.70 74.12 72.08 44.60 77.56 69.25 64.05
3B 200B MXINT8 2.4812 46.10 75.58 72.00 44.80 77.78 69.55 64.30

Table 4 presents the low-bit training comparisons across various common-sense reasoning tasks.

  • For the 1B model trained for 100B tokens:
    • BF16 has an average accuracy of 56.89.
    • MXFP8 (trained for 10B tokens, which is a typo and should be 100B based on context) achieves 56.86.
    • MXINT8 achieves 57.02, slightly surpassing both BF16 and MXFP8 on average.
  • For the 3B model trained for 200B tokens:
    • BF16 has an average accuracy of 64.45.

    • MXFP8 (trained for 20B tokens, likely a typo) achieves 64.05.

    • MXINT8 achieves 64.30, again very close to BF16 and slightly outperforming MXFP8.

      These results confirm that MXINT8 training is not only stable but can also achieve nearly lossless performance compared to BF16 and even slightly outperform MXFP8 on downstream tasks.

6.1.4. Hardware Cost Analysis

The paper evaluates the energy and area costs of INT and FP formats based on a Matrix-Multiply Unit (MMU) model.

The following are the results from Table 5 of the original paper:

Single Format Mixed Format
MXFP8 MXINT8 NVFP4 NVINT4 MXFP8+NVFP4 MXINT8+NVINT4
Energy 1x 0.63x 0.55x 0.34x 1x 0.75x
Area 1x 0.79x 0.54x 0.38x 1x 0.66x

Table 5 shows normalized energy and area costs at the same throughput, with MXFP8 and MXFP8+NVFP4MXFP8+NVFP4 serving as baselines (1x).

  • Single Format Comparisons:
    • MXINT8 consumes only 0.63x energy and 0.79x area compared to MXFP8. This means a 37% energy reduction and a 21% area reduction.
    • NVINT4 is even more efficient, using 0.34x energy and 0.38x area compared to NVFP4. This represents a 66% energy reduction and 62% area reduction. These significant reductions demonstrate that low-bit integer formats are substantially more hardware-efficient than their floating-point counterparts.
  • Mixed Format Configurations:
    • Comparing a configuration supporting both 8-bit and 4-bit data types, MXINT8+NVINT4MXINT8+NVINT4 configuration achieves 0.75x energy and 0.66x area compared to MXFP8+NVFP4MXFP8+NVFP4. This indicates a 25% energy reduction and 34% area reduction.

    • This efficiency gain in mixed-format is attributed to the simpler circuit reuse in the INT pipeline (as described in Table 7 in the methodology).

      This hardware analysis provides compelling evidence that fine-grained INT formats offer superior hardware efficiency in terms of both area and energy, reinforcing their overall advantages.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Necessity of Symmetric Integer Representation

The paper conducts an ablation study to demonstrate the critical importance of using a symmetric integer range for INT quantization, particularly during training.

The following are the results from Table 10 of the original paper:

BF16 scale UE8M0 scale
[-128, 127] [-127, 127] [-128, 127] [-127, 127]
per-channel 3.2544 3.2560 3.3602 3.4307
256 3.1340 3.1307 3.1628 3.1574
128 3.1309 3.1289 3.1353 3.1326
64 3.1312 3.1269 3.1312 3.1288
32 3.1354 3.1251 3.1299 3.1269

Table 10 shows the 8-bit training loss (lower is better) on a 15M model with 20B training tokens, comparing asymmetric ([-128, 127]) and symmetric ([-127, 127]) INT8 ranges, using both BFloat16 and UE8M0 scale factors.

  • Asymmetric Degradation: For BFloat16 scale factors, using the asymmetric range ([-128, 127]) consistently leads to worse training loss compared to the symmetric range ([-127, 127]). This degradation is more pronounced for finer-grained quantization (smaller block sizes, e.g., block 32), where the asymmetric range performs worse than even per-channel or block 256 quantization. This is because finer granularity means more blocks, increasing the chance of values mapping to the problematic unique negative endpoint (-128), causing a gradient bias.

  • Impact of Scale Factor: The asymmetric range also degrades performance with UE8M0 scale factors, though slightly less severely than with BFloat16 scales. This is because UE8M0 scale factors are generally larger or equal to BFloat16 scales, leading to fewer high-precision numbers mapping to QminQ_{min}.

    This ablation clearly demonstrates that the symmetric clipping method (enforcing the [127,127][-127, 127] range for INT8) is essential for stable and high-performing low-bit INT training.

6.2.2. Numerical Stability Analysis

To further explain the need for symmetric clipping, the paper conducts a numerical stability analysis for different floating-point precisions during the quantization mapping process.

The algorithm for this analysis (Algorithm 1 in the paper) involves:

  1. Generating an N×NN \times N random matrix DD (from N(0,1)\mathcal{N}(0, 1)) in different precisions (BFloat16, Float16, Float32).

  2. Calculating a scaler matrix S=D/127S = D/127.

  3. Normalizing DD by SS and rounding: Dnorm=Round(DS)D_{norm} = \mathrm{Round}(D \oslash S).

  4. Counting the number of elements in DnormD_{norm} that map to 128. This value of 128 implies that values are attempting to exceed the INT8 maximum of 127, which can occur due to arithmetic precision issues in low-precision FP formats.

    The following are the results from Table 11 of the original paper:

    BFloat16 Float16 Float32
    16.82% 0.02% 0

Table 11 shows that:

  • In BFloat16 precision, a significant 16.82% of values are numerically mapped to 128 (meaning they are just over 127), even though the scale factor is theoretically designed to map to 127.

  • In Float16, this phenomenon is much rarer (0.02%).

  • In Float32, it does not occur (0%).

    This analysis demonstrates that low-precision floating-point formats (especially BFloat16) can introduce numerical instability and overflow during the scaling step, causing values to fall outside the intended integer range. This powerfully supports the paper's argument that a forced symmetric clipping step is essential for guaranteeing the correctness and stability of integer quantization, particularly when the underlying arithmetic is performed using low-precision data types like BFloat16.

6.2.3. Impact of Hadamard Rotation

The direct-cast inference results (Table 3, 12, 13) and tensor-wise QSNR analysis (Figure 4, Table 2) clearly show the positive impact of random Hadamard rotation.

  • Reduced Crest Factor: Hadamard rotation effectively reduces the crest factor of the intermediate tensors (Table 2), making the distributions more amenable to INT quantization.
  • Improved NVINT4 Accuracy: With rotation, NVINT4 significantly improves its QSNR (Figure 4) and its KL divergence (Table 12, 13), enabling it to surpass NVFP4 in inference accuracy. This demonstrates that outlier-mitigation techniques can unlock the potential of low-bit INT formats in scenarios where they might otherwise be weaker than FP.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents a comprehensive and rigorous study comparing integer (INT) and floating-point (FP) low-bit quantization formats, addressing a critical gap in the existing literature regarding their trade-offs across varying granularities. The central finding is the identification of a performance crossover point: while FP formats tend to excel in coarse-grained quantization due to their superior dynamic range, fine-grained quantization presents a more nuanced picture.

The study provides strong evidence that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 consistently outperforms MXFP8 in both algorithmic accuracy and hardware efficiency (reduced area and energy consumption). For 4-bit formats, FP (MXFP4, NVFP4) often holds an accuracy advantage, but the paper demonstrates that NVINT4 can surpass NVFP4 when combined with outlier-mitigation techniques like Hadamard rotation. Furthermore, the authors introduce a crucial symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training.

These findings collectively challenge the current FP-centric trajectory in AI hardware design, which has largely favored floating-point formats for LLM quantization. The paper advocates for a strategic shift towards fine-grained INT formats, particularly MXINT8, as they offer a superior balance of accuracy, power, and efficiency for future AI accelerators. The extensive theoretical framework, empirical validation, and hardware cost analysis provide clear guidance for algorithm-hardware co-design.

7.2. Limitations & Future Work

The paper implicitly points to several areas for future exploration, although it does not explicitly list "Limitations" or "Future Work" sections:

  • Computational Overhead of Outlier Mitigation: While Hadamard rotation is shown to be effective, its computational overhead during inference or training is not explicitly quantified or optimized. Future work could investigate the most hardware-efficient ways to implement such techniques.
  • Broader Range of Outlier Mitigation Techniques: The paper primarily focuses on Hadamard rotation for 4-bit INT. Exploring other or more advanced outlier mitigation techniques (e.g., more sophisticated scaling, mixed-precision within a block, adaptive transformations) could yield further improvements for low-bit INT formats.
  • Complex Mixed-Precision Strategies: The hardware cost analysis touches upon mixed-format configurations. Future work could delve into more dynamic or fine-grained mixed-precision strategies, where different layers or even sub-layers might benefit from different INT or FP formats and bit-widths.
  • Impact on Different LLM Architectures: While the study covers various LLM sizes and MoE architectures, further investigation into specific architectural components or novel LLM designs could reveal additional nuances in INT vs. FP trade-offs.
  • Quantizer Cost Modeling: The current hardware model excludes the quantizer block from cost accounting. A more comprehensive analysis might include the area and energy costs associated with the quantization logic itself.
  • Dynamic vs. Static Quantization: The paper focuses on AbsMax quantization, but exploring dynamic quantization schemes (where ranges are determined at runtime) or more advanced static schemes could provide further insights.

7.3. Personal Insights & Critique

This paper provides a highly valuable contribution to the field of efficient LLM deployment. The rigorous approach, combining theoretical analysis with extensive empirical validation and hardware modeling, is commendable. The identification of the performance crossover based on granularity and crest factor is a crucial insight that challenges the industry's predominant FP-centric mindset.

My personal insights include:

  • Paradigm Shift Potential: The findings suggest that focusing solely on FP for LLM acceleration might be suboptimal. MXINT8 emerges as a surprisingly strong contender, offering both accuracy and significant hardware efficiency. This could lead to a paradigm shift in how AI accelerators are designed, with a greater emphasis on optimizing fine-grained INT pipelines.

  • Importance of Co-Design: The paper strongly reinforces the necessity of algorithm-hardware co-design. The success of NVINT4 with Hadamard rotation is a prime example: an algorithmic improvement directly translates into competitive hardware performance. Similarly, the symmetric clipping method addresses a specific numerical stability issue in INT training, making the format more viable.

  • Nuance over Dogma: The paper moves beyond a simplistic "INT vs. FP" debate, demonstrating that the optimal choice is highly contextual, dependent on bit-width, granularity, and the application of outlier-mitigation techniques. This nuanced view is essential for designing truly efficient systems.

    Potential areas for improvement or further critique:

  • Generalizability of Hadamard Rotation: While effective, Hadamard rotation adds an extra computational step. Investigating its real-world latency impact and comparing it with other outlier mitigation techniques (e.g., advanced clipping, mixed-precision within blocks) would be beneficial.

  • Hardware Model Simplifications: The hardware model, while thorough, simplifies some aspects (e.g., toggle rates, interconnects, quantizer cost). A more detailed VLSI implementation and characterization could provide even more precise cost estimates.

  • Activation Outlier Characteristics: The paper relies on crest factor as a key indicator. A deeper dive into the specific characteristics and dynamics of LLM activation outliers (e.g., their sparsity, distribution shapes, temporal stability) could lead to more tailored quantization and outlier mitigation strategies.

    Overall, this paper is an excellent piece of research that provides compelling evidence and clear guidance for the future of low-bit quantization in AI accelerators, advocating for a well-deserved re-evaluation of fine-grained integer formats.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.