Abstract

Transformer models have achieved remarkable success across various AI applications but face significant training costs. Low-bit training, such as INT8 training, can leverage computational units with higher throughput, and has already demonstrated its effectiveness on GPT2 models with block-level quantization. However, it struggles with modern Transformer variants incorporating GLU units. This is because those variants demonstrate complex distributions of activation outliers. To address the challenge, we propose Fallback Quantization, implementing mixed-precision GEMM that dynamically falls back 8-bit to 16-bit for activation blocks containing outliers. Experiments show that our approach is robustly competent in both fine-tuning and pretraining settings. Moreover, our method achieves a 1.57x end-to-end training speedup on RTX4090 GPUs.

1. Bibliographic Information

Title: Accurate INT8 Training Through Dynamic Block-Level Fallback
Authors: Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen. The affiliations are not explicitly listed in the provided text, but the authors are likely associated with academic or industrial research labs focused on machine learning and systems optimization.
Journal/Conference: The paper is presented as a preprint on arXiv. Preprints are research articles shared before formal peer review, often to disseminate findings quickly. The specific conference or journal it was submitted to is not mentioned.
Publication Year: The original version was submitted in 2025 (as per the citation format), with the specific version analyzed being v3.
Abstract: The paper addresses the high computational cost of training Transformer models. It proposes a solution using low-bit INT8 training, which can utilize faster hardware units. While existing INT8 training methods work for older models like GPT-2, they fail on modern Transformers that use Gated Linear Units (GLU) due to complex activation outlier patterns. The proposed method, Fallback Quantization, is a mixed-precision technique that dynamically switches from 8-bit to 16-bit computation for specific activation blocks that contain outliers. Experimental results show this approach achieves accuracy comparable to standard 16-bit training in both fine-tuning and pretraining scenarios. Furthermore, it delivers a 1.57x end-to-end training speedup on an RTX 4090 GPU.
Original Source Link: The paper is available at https://arxiv.org/abs/2503.08040v3 (PDF: http://arxiv.org/pdf/2503.08040v3) and is currently in a preprint status.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Training large Transformer models is extremely expensive due to the massive number of matrix multiplications (GEMMs). While using lower-precision numerical formats like 8-bit integers (INT8) can significantly speed up these operations on modern GPUs, standard INT8 training struggles with accuracy.
- Importance & Gap: This accuracy loss is particularly severe in modern Transformer architectures (like Llama and Qwen) that use Gated Linear Units (GLU). GLU layers produce activations with sparse, high-magnitude "outliers," which are values far outside the normal range. Standard INT8 quantization, which scales all values in a group by the maximum absolute value, causes significant information loss for the non-outlier values, crippling the model's ability to learn. Prior methods like Jetfire use fine-grained quantization but are either too slow or not robust enough for these modern architectures.
- Innovation: The paper introduces Fallback Quantization, a dynamic, block-level, mixed-precision training method. Instead of treating all data uniformly, it identifies blocks of the activation matrix that contain outliers and "falls back" to a higher-precision representation (effectively 16-bit) for those specific blocks, while computing the rest in fast INT8. This approach isolates the disruptive effect of outliers without sacrificing the speed benefits of INT8 training.
Main Contributions / Findings (What):
- Novel Algorithm (Fallback Quantization): The paper proposes a novel mixed-precision GEMM that dynamically identifies and handles outlier blocks in activations by representing them with two INT8 components (quantization block + residual block), effectively achieving 16-bit precision for those parts.
- State-of-the-Art Accuracy: This is the first work to demonstrate that INT8 training can achieve lossless accuracy (matching BF16 baselines) on challenging fine-tuning and pretraining tasks for modern, powerful models like Llama-3.1 and Qwen-2.5.
- High Efficiency: The proposed method is designed to be hardware-friendly. The custom mixed-precision GEMM kernel achieves 425 TOPS on an RTX 4090, which is 2.58x faster than standard BF16 and 1.65x faster than the previous Jetfire method. This translates to a 1.57x end-to-end training speedup and a 38% reduction in activation memory.

Foundational Concepts:
- Low-Precision Training: Standard deep learning training often uses 32-bit floating-point numbers (FP32). To reduce memory usage and accelerate computation, lower-precision formats are used. FP16 (16-bit floating-point) and BF16 (bfloat16) are common, offering a trade-off between range and precision. INT8 (8-bit integer) offers even greater speedups, as GPUs have specialized hardware (TensorCores) that can perform INT8 matrix multiplications at a much higher throughput than FP16/BF16.
- Quantization: This is the process of converting a high-precision number (like FP32) into a low-precision one (like INT8). This typically involves scaling the values to fit within the INT8 range [-127, 127] and then rounding.
- Group Quantization: Instead of using a single scaling factor for an entire matrix (per-tensor), group quantization divides the matrix into smaller blocks (or groups) and calculates a separate scaling factor for each. This allows for more accurate representation, as the scaling factor is tailored to a smaller, more uniform set of values. The granularity can be per-channel, per-token, or per-block (a small 2D tile).
- Activation Outliers: In Transformers, some activation values can be orders of magnitude larger than the rest. These outliers are problematic for quantization because the scaling factor for a group is determined by the maximum absolute value. A large outlier forces the scaling factor to be large, which in turn "squashes" all other smaller values towards zero, causing a massive loss of information.
- Gated Linear Unit (GLU): A component used in modern Transformers' feed-forward networks. It involves an element-wise multiplication of two activation vectors, one of which is passed through a non-linearity like SiLU. This multiplicative interaction tends to create more extreme and sparse outliers compared to older architectures.
Previous Works:
- Post-Training Quantization (PTQ): Methods like GPTQ quantize a pre-trained model for faster inference without retraining. They are not suitable for training.
- Quantization-Aware Training (QAT): Simulates quantization effects during training to help the model adapt, but typically keeps the actual computations in high precision. It is also aimed at inference acceleration.
- Fully Quantized Training (FQT): This is the focus of the paper. It performs computations in low precision during the training process itself (both forward and backward passes) to accelerate training.
  - Switchback: An FQT method that used per-token and per-channel quantization. It was effective for vision transformers but struggles with the complex outliers in large language models (LLMs).
  - Jetfire: A more recent FQT method that introduced per-block quantization (with a small 32x32 block size). While it handled outliers better than Switchback and worked for GPT-2 models, its small block size created significant overhead, limiting its actual speedup. Furthermore, its effectiveness on modern GLU-based models was unproven.
- Mixed-Precision for Outliers:
  - LLM.int8(): A technique for inference that identifies outliers and computes the parts of the matrix multiplication involving them in FP16, while the rest is in INT8. This approach is not designed for the dynamic nature of training.
Differentiation: The proposed Fallback Quantization method innovates by:
1. Being Dynamic: It adapts to the changing activation patterns during training.
2. Using Block-Level Fallback: It applies higher precision only to the small blocks containing outliers, making it more efficient than row- or column-wise methods.
3. Being Hardware-Efficient: The fallback mechanism is implemented in a way that minimally disrupts the efficient INT8 GEMM pipeline, unlike methods that require switching between different types of TensorCores.
4. Targeting Modern Architectures: It is specifically designed to handle the challenging outlier patterns produced by GLU-based models like Llama and Qwen, a domain where previous FQT methods failed.

4. Methodology (Core Technology & Implementation)

The core of the paper is a new quantization scheme designed to handle the difficult activation patterns in modern Transformers.

4.1. Outlier Pattern Analysis

The authors first analyze why modern GLU-based models (Llama-3.1, Qwen-2.5) are harder to quantize than older models like GPT-2.

GLU's Role: A GLU layer computes its output $y$ as $y = \sigma(x_1) \odot x_2$ , where $\odot$ is element-wise multiplication. This multiplication can amplify magnitudes, creating much larger outliers than in non-GLU models.
Key Outlier Patterns (P1, P2, P3):
1. (P1) Larger Magnitude: As shown in Table 1, the maximum outlier magnitude in GLU models like Llama-3.1 (605.77) and Qwen-2.5 (6558.65) is drastically higher than in non-GLU models like GPT2-XL (58.07). Values this large make standard INT8 quantization (range [-127, 127]) extremely inaccurate for all other numbers in the same quantization group.
2. (P2) Random Occasional Outliers: Outliers don't just appear in specific rows (token-wise) or columns (channel-wise). There are also "rogue" outliers that appear randomly. This makes simple per-token or per-channel quantization schemes insufficient.
3. (P3) Sparsity: The outliers are sparse. Even within an outlier channel or token, only a few values are extremely large. This is visualized in Image 4, where the final activation y has very few bright spots (high values).
  
  该图像为图表，展示了GLU和非GLU激活值的分布及排序比较，以及GLU激活的热力图。(a)为激活值分布密度图，显示非GLU激活值更集中，GLU激活值分布更宽；(b)为激活值归一化大小的百分位排序曲线，非GLU激活值尾部更高，GLU激活值较低且平缓；(c)展示两个样本的GLU激活热力图，左图为σ(x1)，中图为x2，右图为对应的乘积y，颜色表示数值大小，突出GLU激活中稀疏且高幅值的异常点。
- Image 4 Analysis: This figure compares GLU and non-GLU activation distributions. (a) shows GLU activations have a wider distribution ("heavier tails"). (b) shows the sorted magnitude, confirming GLU has more extreme values. (c) provides a heatmap of the GLU computation, visually demonstrating that the output y is sparse, with large values appearing only when both inputs σ(x₁) and x₂ are large.
  
  These patterns motivate a solution that can isolate sparse outliers in small regions (blocks) and give those regions special treatment (higher precision).

4.2. GEMM with Block Quantization (Recap of Jetfire)

The paper first explains how block-quantized GEMM works. A matrix multiplication $C = AB$ is broken down into smaller block-wise multiplications. Each block $G$ of the input matrices $A$ and $B$ is quantized to INT8. The final result for a block of $C$ is computed by accumulating the results of the INT8 block multiplications, with scales applied during accumulation. The computation for a block $G_{i,j}^C$ is: $G _ { i , j } ^ { C } = \sum _ { k = 0 } ^ { \lceil K / K _ { g } \rceil - 1 } [ \hat { Q } ( G _ { i , k } ^ { A } ) \hat { Q } ( G _ { k , j } ^ { B } ) ] _ { \mathrm { I N T } \times \mathrm { I N T } } \big ( a _ { i , k } ^ { A } \times a _ { k , j } ^ { B } \big )$

Symbol Explanation:
- $G_{i,j}^C$ , $G_{i,k}^A$ , $G_{k,j}^B$ : Sub-blocks of matrices C, A, B.
- $\hat{Q}(\cdot)$ : The quantization function, which returns INT8 values.
- $a_{i,k}^A$ , $a_{k,j}^B$ : The scaling factors for the respective blocks.
- $[\cdot]_{\mathrm{INT \times INT}}$ : An INT8 matrix multiplication operation producing INT32 output.
  
  The issue with this, as seen in Image 1(b), is that for high accuracy, Jetfire needed a small block size (32x32), which is computationally inefficient. A larger block size (e.g., 128x128) is much faster but less accurate in the presence of outliers.
  
  该图像由两个部分组成，属于图表和示意图类型。(a)部分展示了Group Quantization的示意图，图中用不同颜色和虚线分别标示了per-block（Mg×Ng）、per-token（1×N）和per-tensor（M×N）的量化粒度布局。(b)部分为折线图，横轴为Group Size K（32至256），纵轴为TFLOPS，曲线展示了不同序列长度（2048、4096、8192）下，随着Group Size增加，计算性能的提升趋势。
Image 1 Analysis: (a) illustrates different quantization granularities. (b) shows that GEMM throughput (in TFLOPS) increases significantly with a larger group size K. A group size of 32, as used by Jetfire, is much slower than a group size of 128 or 256.

4.3. Block Fallback Quantization

This is the core proposal of the paper. When a block $G_{i,k}^A$ in the activation matrix $A$ is identified as an "outlier block" (i.e., contains a very large value), it is represented with higher precision using a two-step process:

First Step (Coarse Quantization): The block is quantized to INT8 as usual. This step captures the general magnitude and the outlier itself. Let's call this $Q(G_{i,k}^A)$ .
Second Step (Residual Quantization): The error from the first step, called the residual $\Delta Q(G_{i,k}^A) = G_{i,k}^A - Q(G_{i,k}^A)$ , is calculated. This residual is then also quantized to INT8.

The block $G_{i,k}^A$ is now represented by two INT8 blocks: the main quantization block and the fallback (residual) block. This is conceptually similar to an INT16 representation but is empirically more accurate because the first step isolates the large outlier, allowing the second step to quantize the smaller residual values with much higher fidelity.

![该图像包含三部分：(a)为示意图，展示了“回退量化”（Fallback Quantization）过程，原始矩阵 $G_{ij}$ 先量化为 $Q(…](/files/papers/68ea686157e390f1cc60a0ee/images/3.jpg) *该图像包含三部分：(a)为示意图，展示了“回退量化”（Fallback Quantization）过程，原始矩阵 \(G_{ij}$ 先量化为 $Q(G_{ij})$ ，然后对包含激活异常值的子矩阵 $\Delta Q(G_{ij})$ 采用16位精度回退量化，最终加和得到更精确的量化结果；(b)为折线图，展示了不同量化位数下，回退方法和普通量化的RMSE误差随量化位数变化的趋势，回退方法在低位宽表现更优；(c)也是折线图，显示了不同回退率下模型输出余弦相似度（CosSim），包括Amax、L1、L1-Rel三种方法，随着回退率增加相似度提升。*

Image 5 Analysis: (a) visually explains Fallback Quantization. A naive INT8 quantization of block $G_{ij}$ with a large outlier (157.00) causes other values to be quantized poorly or to zero. In fallback, the first quantization captures the outlier, and the second quantization accurately represents the remaining residual values. (b) shows that the proposed Fallback method has a lower Root Mean Square Error (RMSE) than standard double-bit (INT16) quantization. (c) shows that using the AbsMax value of a block as the criterion for fallback is as effective as more complex L1-error metrics.

The GEMM operation is modified to incorporate this. If a block $G_{i,k}^A$ is a fallback block, the computation involves two INT8 multiplications instead of one: $G _ { i , j } ^ { C } = \sum _ { k = 0 } ^ { \lceil K / G _ { K } \rceil - 1 } \left( Q ( G _ { i , k } ^ { A } ) + u ( i , k ) Q ( \Delta Q ( G _ { i , k } ^ { A } ) ) \right) Q ( G _ { k , j } ^ { B } )$
Symbol Explanation:
- $u(i, k)$ : An indicator variable, which is 1 if block $(i,k)$ is a fallback block and 0 otherwise.
- $Q(\Delta Q(G_{i,k}^A))$ : The quantized residual block.
  
  This is efficient because it reuses the same quantized block from matrix $B$ ( $Q(G_{k,j}^B)$ ) for both multiplications, requiring only one extra INT8 GEMM and a load of the fallback block. Algorithm 1 outlines this process.

4.4. Threshold for Dynamic Fallback

A simple and effective criterion is needed to decide which blocks get the fallback treatment. The paper finds that simply checking if the maximum absolute value in a block (AbsMax) exceeds a certain threshold $\theta$ works best. $u(i,k) = [\text{max}(\text{abs } G_{i,k}^A) > \theta]$ This is efficient because AbsMax is already computed during the standard quantization process.

To avoid a fixed threshold, which may not adapt well during training, they use a Delay Threshold method (described in Algorithm 2). This method dynamically adjusts the threshold $\theta$ for each layer to keep the percentage of fallback blocks within a target range (e.g., [10%, 30%]). If the rate is too low, the threshold is decreased; if too high, it's increased.

$该图像包含三部分：(a)为示意图，展示了“回退量化”（Fallback Quantization）过程，原始矩阵 $G_{ij}$ 先量化为 $Q(…$ 该图像包含两部分图表：(a)为矩阵示意图，展示了含有异常值的回退量化块，横轴为通道，纵轴为Token；(b)为折线图，比较了不同块大小下1.5B和3B模型在使用Block量化和Fallback量化时的困惑度（Perplexity），显示Fallback方法在不同块大小均能降低困惑度。

Image 2 Analysis: (a) shows the distribution of fallback blocks for a layer's input. The pattern is sparse, with some full columns (channel-wise outliers) and some scattered blocks (occasional outliers), justifying the dynamic, block-level approach. (b) shows that adding fallback (dashed lines) significantly reduces perplexity (a measure of error, lower is better) compared to naive block quantization (solid lines) across various block sizes.

4.5. Kernel Implementation for better Acceleration

The fallback mechanism is accurate enough to allow the use of a large block size of 128x128. As shown in Figure 1(b), this block size is much more efficient on GPUs than the 32x32 size used by Jetfire. This choice is a key reason for the high end-to-end speedup.

5. Training System Design

The paper details how Fallback Quantization is integrated into a full training system.

Linear Layers:
- In a linear layer, three GEMMs are performed: one in the forward pass ( $Y = XW^T$ ) and two in the backward pass ( $\nabla X = \nabla Y W$ and $\nabla W = \nabla Y^T X$ ).
- Fallback Quantization is only applied to the activation matrix $X$ in the forward pass.
- For the gradient matrix $\nabla Y$ , they use stochastic rounding, an unbiased rounding method that is sufficient to preserve accuracy. This is more efficient than applying fallback.
- For the backward GEMM involving $X$ ( $\nabla W = \nabla Y^T X$ ), they also use stochastic rounding for $X$ instead of the full fallback mechanism to simplify memory management of saved activations. Figure 5 shows this simplification has minimal impact on accuracy.
  
  该图像为图表，包含两个子图。(a)显示了不同量化位数（4到12位）下X、W和DY三种量化方法的余弦相似度（CosSim）变化，W bit表现优于X bit和DY bit，随着位数增加，CosSim逐渐接近1。(b)展示了不同回退率（Fallback Rate）下两种回退策略（Fallback Fwd和Fallback Both）的CosSim，均表现出随着回退率增加CosSim提升，且两者在高回退率时接近于0.99以上，表明回退机制显著提高了量化准确性。
- Image 6 Analysis: (a) compares the impact of quantizing different tensors ( $X, W, \nabla Y$ ) on gradient cosine similarity. It shows that the activation $X$ is the most sensitive, justifying the focus on improving its quantization. (b) shows that applying fallback in both forward and backward passes offers little benefit over applying it only in the forward pass, supporting the design choice for efficiency.
Non-Linear Layers:
- Layers like Normalization and Activation functions are more sensitive to quantization errors than linear layers (which can average out errors over the accumulation dimension). This is shown in Figure 6(a).
- However, they account for a smaller fraction of total computation time, especially in large models (Figure 6(b)).
- Therefore, the paper's strategy is to keep the computations for these layers in BF16 to maintain accuracy, but compress the activations saved for the backward pass to INT10. This saves memory without introducing computational errors. Figure 7(a) shows that 10-bit quantization for these activations is nearly lossless.
  
  该图像为两幅图表。(a)是灵敏度分析折线图，横轴为量化位数，纵轴为困惑度，分别展示了线性和非线性层在不同模型规模（1.5B和3B）下的表现，体现非线性层困惑度随量化位数变化更剧烈。(b)为柱状图，横轴为模型规模（1.5b、3b、7b），纵轴为时间(ms)，柱状图分为线性和非线性部分，同时显示线性计算占比随模型规模增加而提升，提示线性部分加速潜力更大。
- Image 7 Analysis: (a) demonstrates that non-linear layers are much more sensitive to low-bit quantization than linear layers. (b) shows that as model size increases, the proportion of time spent on linear layers grows, making their acceleration more impactful.
Training Framework Implementation:
- The overall data flow is shown in Figures 7(c) and 7(d).
- Attention Module: The QKV projection is a linear layer and uses the proposed INT8 fallback method. The dot-product attention itself (FlashAttention) is kept in BF16.
- MLP Module: The linear projections in the MLP block use INT8 fallback. The non-linear operations (SiLU, Gating) are computed in BF16, with their outputs quantized to INT10 before being saved for the backward pass.
  
  该图像包含四个部分的图表与示意图：(a)展示不同位宽(bits)下1.5B和3B模型全量与归一化激活的余弦相似度变化趋势；(b)显示使用不同训练方法（BF16、Ours、Jetfire、Block）在预训练阶段损失随训练Token数的下降曲线；(c)为Attention模块的量化流程示意图，标注了归一化、QKV计算、注意力输出及其反向传播过程中的Fallback和随机采样策略；(d)为MLP模块的量化示意，展示带门控激活的前向与反向传播过程及相应Fallback，体现动态块级混合精度设计。
- Image 3 Analysis: (a) shows that 10-bit quantization is a sweet spot for non-linear activations, preserving high gradient similarity. (b) displays pretraining loss curves, showing the proposed method (Ours) closely tracking the BF16 baseline. (c) and (d) are diagrams illustrating the data flow and precision choices for the Attention and MLP modules respectively.

6. Experimental Setup

Datasets:
- Fine-tuning:
  - GSM8K: A dataset of grade-school math word problems.
  - DROP: A reading comprehension dataset requiring discrete reasoning.
  - MMLU: A massive multi-task language understanding benchmark.
  - HELLASWAG: A commonsense reasoning benchmark (sentence completion).
- Pretraining:
  - OpenWebText: A large, open-source corpus of English text scraped from the web.
Evaluation Metrics:
- Accuracy (Acc):
  - Conceptual Definition: The proportion of correct predictions out of the total number of predictions. Used for classification tasks like GSM8K and MMLU.
  - Mathematical Formula: $\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$
- **F1-Score (

Accurate INT8 Training Through Dynamic Block-Level Fallback

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~16 min read · 16,875 chars

1. Bibliographic Information

2. Executive Summary

4. Methodology (Core Technology & Implementation)

4.1. Outlier Pattern Analysis

4.2. GEMM with Block Quantization (Recap of Jetfire)

4.3. Block Fallback Quantization

4.4. Threshold for Dynamic Fallback

4.5. Kernel Implementation for better Acceleration

5. Training System Design

6. Experimental Setup

Similar papers

Accurate INT8 Training Through Dynamic Block-Level Fallback

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~16 min read · 16,875 chars

1. Bibliographic Information

2. Executive Summary

3. Prerequisite Knowledge & Related Work

4. Methodology (Core Technology & Implementation)

4.1. Outlier Pattern Analysis

4.2. GEMM with Block Quantization (Recap of Jetfire)

4.3. Block Fallback Quantization

4.4. Threshold for Dynamic Fallback

4.5. Kernel Implementation for better Acceleration

5. Training System Design

6. Experimental Setup

Similar papers