Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications

its applications Katsuhisa Ozaki

Paper status: completed

Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications

Published:06/13/2011

Error-Free Matrix Multiplication Transformation (1)Floating-Point Number Splitting Algorithm (1)High-Performance BLAS Matrix Multiplication (1)Matrix Multiplication Error Estimation (1)Numerical Linear Algebra Algorithm Optimization (1)

Original Link

Price: 0.10

9 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces an error-free matrix multiplication method using fast splitting, leveraging optimized BLAS routines for high accuracy and efficiency, outperforming XBLAS on large matrices with prior error bounds.

Abstract

Numer Algor (2012) 59:95–118 DOI 10.1007/s11075-011-9478-1 ORIGINAL PAPER Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications Katsuhisa Ozaki · Takeshi Ogita · Shin’ichi Oishi · Siegfried M. Rump Received: 8 February 2010 / Accepted: 26 May 2011 / Published online: 14 June 2011 © Springer Science+Business Media, LLC 2011 Abstract This paper is concerned with accurate matrix multiplication in floating-point arithmetic. Recently, an accurate summation algorithm was developed by Rump et al. (SIAM J Sci Comput 31(1):189–224, 2008). The key technique of their method is a fast error-free splitting of floating-point numbers. Using this technique, we first develop an error-free transformation of a product of two floating-point matrices into a sum of floating-point matrices. Next, we partially apply this error-free transformation and develop an algo- rithm which aims to output an accurate approximation of the matrix product. In addition, an a priori error estimate is given. It is a characteristic of the proposed method that in terms of computation as well as in terms of memory consumption, the dominant part

Mind Map

In-depth Reading

English Analysis~13 min read · 15,898 chars

1. Bibliographic Information

Title: Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications
Authors: Katsuhisa Ozaki, Takeshi Ogita, Shin'ichi Oishi, Siegfried M. Rump.
Journal/Conference: The paper was published online by Springer Science+Business Media, LLC. The abstract cites a related key work in SIAM J Sci Comput, a highly reputable journal in scientific computing and numerical analysis, suggesting the research aligns with top-tier publication standards in the field.
Publication Year: 2011
Abstract: The paper presents novel algorithms for performing highly accurate matrix multiplication using standard floating-point arithmetic. The core technique is an "error-free splitting" method, which transforms the product of two matrices into a sum of several other matrices. This transformation is designed so that each sub-product can be computed without any rounding error. The authors then introduce a practical algorithm that applies this transformation partially to obtain a very accurate approximation of the final product. A key feature of these algorithms is that they are constructed primarily from standard, highly optimized matrix multiplication routines (like BLAS gemm), which makes them computationally efficient. While the methods require substantial working memory, they are shown to be significantly faster than other high-accuracy libraries like XBLAS for large matrices.
Original Source Link: /files/papers/68f361a0d77e2c20857d89b5/paper.pdf (This is a local file path provided in the context of the analysis).

2. Executive Summary

Background & Motivation (Why): Standard computer hardware performs arithmetic using floating-point numbers (e.g., binary64 or double precision), which have limited precision. When multiplying matrices, small rounding errors occur at each step and can accumulate, sometimes leading to a final result that is significantly inaccurate. This is a major problem in scientific and engineering simulations where high precision is critical. While solutions like arbitrary-precision arithmetic libraries exist, they are often extremely slow because they cannot use the hardware's fast floating-point units. The paper aims to bridge this gap by creating algorithms that achieve high accuracy while still leveraging the immense speed of optimized hardware-level routines for matrix multiplication (BLAS).
Main Contributions / Findings (What):
1. A Novel Error-Free Transformation (EFT) for Matrix Products: The paper introduces a method to take two floating-point matrices, $A$ and $B$ , and represent their exact product $A \times B$ as an unevaluated sum of several other floating-point matrices, $\sum C^{(i)}$ . Each matrix $C^{(i)}$ in this sum is the result of a standard matrix multiplication that is mathematically proven to have zero rounding error.
2. A Fast, High-Accuracy Approximation Algorithm (Acc_Mul): Based on the EFT, the authors develop a practical algorithm that computes a highly accurate approximation of the matrix product. It works by partially applying the transformation, computing the most significant parts of the product error-free and accepting small, controlled errors in the less significant parts. This provides a tunable trade-off between accuracy and computational cost.
3. Performance via Optimized BLAS: The algorithms are designed so that the most computationally intensive part is a series of standard matrix-matrix multiplications. This allows them to harness the power of highly optimized libraries like BLAS, Intel MKL, or ATLAS, making them much faster than competing high-accuracy methods.
4. Efficiency through Sparsity: The splitting technique often produces matrices that are sparse (containing many zeros). The algorithms exploit this by using sparse matrix multiplication routines, which further accelerates computation, especially for input matrices with a large dynamic range of values.

Foundational Concepts:
- Floating-Point Arithmetic: A system for representing real numbers on a computer with a fixed number of bits, standardized by IEEE 754. A number is stored as a sign, a significand (the digits), and an exponent. Because the significand is finite, most real numbers cannot be represented exactly, leading to rounding errors in calculations. The unit roundoff, denoted as $\mathbf{u}$ , is the maximum relative error incurred when rounding a number to the nearest floating-point representation. For binary64 (double precision), $\mathbf{u} = 2^{-53}$ .
- BLAS (Basic Linear Algebra Subprograms): A specification for a set of low-level routines for common linear algebra operations. Level 3 BLAS, which includes matrix-matrix multiplication (gemm), is particularly important. Implementations like Intel MKL and OpenBLAS are hand-tuned for specific CPU architectures to maximize performance by optimizing memory access patterns (caching) and using vector instructions.
- Error-Free Transformation (EFT): A technique to represent the result of a floating-point operation, which would normally be rounded, as the exact sum of two or more floating-point numbers. A classic example is Dekker's algorithm for splitting a number or computing an exact product. The method used in this paper is a different kind of EFT, specialized for splitting a number into high- and low-order bits.
Previous Works:
- Rump et al. (2008) [11]: This is the foundational work. The authors developed a fast and accurate summation algorithm based on an "error-free splitting" technique. This paper adapts that splitting technique from summing a vector of numbers to multiplying matrices.
- XBLAS [7]: An "eXtra-precise" BLAS library that provides highly accurate results by using techniques like double-double or quad-double arithmetic internally. It serves as a key high-accuracy baseline for performance comparison. It is generally more accurate but slower than standard BLAS.
- INTLAB [20]: A MATLAB toolbox for interval arithmetic and reliable computing. The paper notes that its lssresidual function uses a similar splitting idea for calculating accurate residuals, though the application to full matrix multiplication is a novel contribution of this paper.
Differentiation: While previous works offered either high speed (standard BLAS) or high accuracy (XBLAS, multiple-precision libraries), this paper's approach is novel because it combines both. It strategically restructures the matrix multiplication problem into a series of standard gemm calls that can be executed at high speed, while mathematically ensuring that rounding errors are either eliminated or systematically controlled.

4. Methodology (Core Technology & Implementation)

The core of the paper is a multi-stage process: first, splitting floating-point numbers, then applying this to vectors (for dot products), and finally scaling it up to matrices.

2.2 Error-free splitting

The fundamental building block is ExtractScalar (Algorithm 1), a technique from Rump et al. for splitting a floating-point number $p$ into a high-order part $q$ and a low-order part $p'$ .

The splitting works as follows:

A scaling factor $\sigma$ is chosen, which must be a power of two and larger than $|p|$ .
The high-order part is computed as $q = \text{fl}((\sigma + p) - \sigma)$ . In floating-point arithmetic, when adding a small number ( $p$ ) to a very large one ( $\sigma$ ), the result is rounded. This effectively aligns $p$ to the bit positions of $\sigma$ , truncating its lower-order bits. Subtracting $\sigma$ then isolates this truncated, high-order part of $p$ .
The low-order part is simply the remainder: $p' = \text{fl}(p - q)$ .

This transformation is error-free, meaning the identity $p = q + p'$ holds exactly. Image 1 provides a visual intuition for this process, showing how a vector of numbers $x$ is split into two vectors $x^(1)$ (the high-order parts) and $x^(2)$ (the low-order parts).

该图像是论文中的示意图，展示了浮点数 $x_i$ 分解为两个浮点数 $x_i^{(1)}$ 和 $x_i^{(2)}$ 的过程，横轴表示单位位置，纵轴表示不同数值的区间长度。

2.3 & 2.4 Error-Free Transformation of Matrix Products

The paper extends this splitting idea to matrix multiplication. Since a matrix product $C = AB$ is composed of dot products between the rows of $A$ and columns of $B$ , the authors first analyze the dot product case.

Splitting Vectors: Given two vectors $x, y \in \mathbb{F}^n$ , they are split into $x = x^{(1)} + \underline{x}^{(2)}$ and $y = y^{(1)} + \underline{y}^{(2)}$ . The dot product is then $x^T y = (x^{(1)})^T y^{(1)} + (x^{(1)})^T \underline{y}^{(2)} + (\underline{x}^{(2)})^T y$ .
Error-Free Sub-product: The crucial insight is that the term $(x^{(1)})^T y^{(1)}$ can be computed without any rounding errors. This is achieved by carefully choosing the split point. The number of leading bits in the elements of $x^{(1)}$ and $y^{(1)}$ is restricted to a value $\alpha$ , where $\alpha = \left\lfloor - { \frac { \log _ { 2 } \mathbf { u } + \lceil \log _ { 2 } n \rceil } { 2 } } \right\rfloor$
- $\mathbf{u}$ is the unit roundoff (e.g., $2^{-53}$ ).
- $n$ is the vector dimension. Each product $x_i^{(1)} y_i^{(1)}$ will have roughly $2\alpha$ significant bits. The sum $\sum_{i=1}^n x_i^{(1)} y_i^{(1)}$ will have approximately $2\alpha + \log_2 n$ bits. The value of $\alpha$ is chosen such that $2\alpha + \log_2 n$ is less than the number of bits in a floating-point significand ( $-\log_2 \mathbf{u}$ ). Therefore, the sum can be represented exactly as a floating-point number. Image 3 illustrates how the individual products and their sum fit within the available precision.
  
  $Fig. 3 This visualizes the dot product. Each of $x _ { 1 } ^ { ( 1 ) } y _ { 1 } ^ { ( 1 ) } , \\dots , x _ { 1 } ^ { ( n ) } y _ { 1 } ^ { ( n ) }$ , x(n)y(n) h has at most $2 \\alpha$ nonzero leading…$ 该图像是图表，图3，展示了点积的可视化。图中各项 $x_i^{(1)} y_i^{(1)}$ 最大含有 $2\alpha$ 个非零前导位，因此 $| (x^{(1)})^T y^{(1)} |$ 可在 $-\log_2 \mathbf{u}$ 位内表示。
Iterative Splitting and Transformation: This process can be repeated on the remainder terms. For instance, $\underline{x}^{(2)}$ can be split again into $x^{(2)} + \underline{x}^{(3)}$ . By repeating this, a vector $x$ can be fully decomposed into an error-free sum $x = \sum_{i=1}^p x^{(i)}$ . Image 5 shows this multi-level splitting.

$Fig. 5 This figure illustrates the splittings. Each floatingpoint number has at most $\\alpha$ nonzero leading bits$ 该图像是图 5 的示意图，展示了浮点数的分割方法。每个浮点数最多有 $\alpha$ 个非零的领先二进制位，表示为 $x_i$ 被分解为 $x_i^{(1)}, x_i^{(2)}, x_i^{(3)}$ 等部分。

If both matrices $A$ and $B$ are fully decomposed into $A = \sum_r A^{(r)}$ and $B = \sum_s B^{(s)}$ , their product is $AB = \sum_{r,s} A^{(r)}B^{(s)}$ . Theorem 1 proves that every single matrix product $A^{(r)}B^{(s)}$ in this sum can be computed by a standard gemm call with zero rounding error. This transforms the original matrix product into an unevaluated sum of error-free matrices.
Algorithms:
- Split_Mat (Algorithm 3): This function implements the iterative splitting of a matrix $A$ into a sequence of matrices $D\{1\}, D\{2\}, \dots$ . It includes a check to use a sparse matrix format if a split-off matrix $D\{k\}$ has few non-zero elements, which is a key optimization.
- EFT_Mul (Algorithm 4): This algorithm performs the full error-free transformation. It uses Split_Mat to decompose $A$ and $B$ completely and then computes all the error-free cross-products $D\{r\} * E\{s\}$ . The output is a collection of matrices that, when summed, equal the exact product AB.

3.1 Accurate Matrix Multiplication (`Acc_Mul`)

The full EFT can be computationally expensive. For practical use, the paper proposes Acc_Mul (Algorithm 5), which applies the transformation only partially.

It splits $A$ and $B$ into $k$ components each: $A = \sum_{i=1}^{k-1} A^{(i)} + \underline{A}^{(k)}, \quad B = \sum_{j=1}^{k-1} B^{(j)} + \underline{B}^{(k)}$
It then computes the product using a carefully structured sum: $AB = \sum_{i+j \le k} A^{(i)}B^{(j)} + \sum_{i=1}^{k-1} A^{(i)}\underline{B}^{(k-i+1)} + \underline{A}^{(k)}B$
The terms in the first sum, where $i+j \le k$ , are computed error-free.
The remaining terms involve the small-magnitude remainders ( $\underline{A}^{(k)}, \underline{B}^{(\dots)}$ ). While these multiplications incur rounding errors, the errors are small because the input magnitudes are small.
The final result is obtained by summing all these partial products. This yields a highly accurate approximation of AB, with accuracy increasing with the parameter $k$ .

$该图像是示意图，展示了浮点向量分块的结构，左侧标注了53 bits的分段宽度，右侧列出了对应的块乘积形式，如$(x^{(i)})^T y^{(j)}$，体现了向量分块乘积的构造。$ 该图像是示意图，展示了浮点向量分块的结构，左侧标注了53 bits的分段宽度，右侧列出了对应的块乘积形式，如 $(x^{(i)})^T y^{(j)}$ ，体现了向量分块乘积的构造。

5. Experimental Setup

Datasets: The experiments use randomly generated square matrices of size $n \times n$ . The elements are created with the formula (rand(n) - 0.5) .* exp(phi * randn(n)). The parameter $\phi$ is critical: it controls the dynamic range of the matrix elements. A larger $\phi$ leads to matrices with both very large and very small elements, making them more numerically challenging (ill-conditioned) and requiring more splits in the EFT.
Evaluation Metrics:
1. Maximum Relative Error: To measure accuracy, the authors compute the maximum relative error between the computed result $C$ $C$ and the "exact" product AB (calculated using a high-precision library).
  - Conceptual Definition: This metric captures the worst-case relative discrepancy for any single element in the resulting matrix. A smaller value indicates higher accuracy.
  - Mathematical Formula: $\mathtt{RelErr}(AB, C) := \max_{1 \le i,j \le n, (AB)_{ij} \neq 0} \frac{|(AB)_{ij} - C_{ij}|}{|(AB)_{ij}|}$
  - Symbol Explanation:
    - (AB)_{ij} is the element in the $i$ -th row and $j$ -th column of the exact product matrix.
    - $C_{ij}$ is the corresponding element in the computed matrix.
2. Computing Time Ratio: Performance is measured by the ratio of the execution time of a given method to the time taken by a standard, pure floating-point matrix multiplication in MATLAB. A ratio of 3.0, for example, means the method was three times slower than the standard gemm.
Baselines: The proposed methods are compared against:
- M1: Standard MATLAB matrix multiplication ( $A * B$ ), representing the baseline performance and accuracy.
- M2: lssresidual from the INTLAB toolbox.
- M3, M4, M5: The proposed Acc_Mul algorithm with $k=2, 3, 4$ respectively.
- M6: gemmx from the XBLAS library, a well-known high-accuracy competitor.

6. Results & Analysis

The experiments validate the two main claims of the paper: high accuracy and high performance.

EFT Performance (Algorithm 4)

Tables 1 and 2 show the performance of the full error-free transformation (EFT_Mul).

(Manual transcription of Table 1 from the paper) Table 1: The ratio of computing time of Algorithm 4 to the built-in matrix product for various problems with $n = 1,000$ (Intel Core 2 Duo).

φ	nA	nB	Ratio (no sparse)	Ratio (using sparse)
1	4	4	18.7	13.0
5	6	6	40.9	29.8
10	9	9	85.0	42.1
15	12	12	151	66.1

(Manual transcription of Table 2 from the paper) Table 2: The ratio of computing time of Algorithm 4 to the built-in matrix product for various problems with $n = 2,000$ (Intel Core 2 Extreme).

φ	nA	nB	Ratio (no sparse)	Ratio (using sparse)
1	4	4	19.7	20.8
5	6	6	41.5	32.8
10	9	9	89.0	74.9
15	12	12	154	108

Key Findings:
- As $\phi$ increases, the matrices become more challenging, requiring more splits (nA, nB increase). This leads to a higher number of matrix multiplications ( $nA * nB$ ) and thus longer runtimes.
- The "using sparse" optimization provides a dramatic speedup. For $\phi=15$ in Table 1, the runtime ratio drops from 151 to 66.1. This confirms that the split-off matrices become sparse and using sparse gemm is highly effective.

Accuracy and Performance of `Acc_Mul` (Algorithm 5)

Table 3 shows the accuracy achieved by the practical Acc_Mul algorithm compared to baselines.

(Manual transcription of Table 3 from the paper) Table 3: Comparison of RelErr(AB, C).

φ	M1 (Std)	M2 (INTLAB)	M3 (Acc_Mul k=2)	M4 (Acc_Mul k=3)	M5 (Acc_Mul k=4)	M6 (XBLAS)
1	5.64e-10	8.12e-11	7.95e-15	2.20e-16	3.27e-16	1.11e-16
5	3.48e-11	2.39e-11	7.28e-12	2.19e-16	3.24e-16	1.11e-16
10	2.90e-11	2.68e-11	8.88e-11	1.59e-12	2.21e-14	1.11e-16
15	6.81e-12	5.29e-12	5.39e-12	5.60e-12	4.18e-12	1.11e-16

Accuracy Analysis:
- Standard gemm (M1) provides relatively low accuracy, with errors around $10^{-10}$ to $10^{-12}$ .
- The proposed methods (M3, M4, M5) significantly improve accuracy. For small $\phi$ , M3 ( $k=2$ ) already reaches near machine precision ( $\approx 10^{-15}$ ), and M4 ( $k=3$ ) delivers results as accurate as XBLAS (M6).
- As $k$ increases from 2 to 4, the accuracy of Acc_Mul steadily improves, demonstrating the effectiveness of the tunable partial transformation.
Performance Analysis:
- The paper states that its methods are significantly faster than gemmx in XBLAS.
- For $phi=10$ , the computing time ratio for M3 ( $k=2$ ) was only 2.40. This is very impressive, as a naive implementation would require 3 matrix multiplications (a ratio of 3.0). The speedup comes from the sparse matrix optimization.
- This result highlights the core strength of the paper's approach: achieving accuracy comparable to specialized libraries like XBLAS but at a much lower computational cost, thanks to its reliance on optimized standard gemm routines.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces a powerful and practical framework for high-accuracy matrix multiplication in standard floating-point arithmetic. By transforming a matrix product into an error-free sum of other matrix products, it enables the use of fast, hardware-optimized BLAS routines to perform the bulk of the computation. The proposed Acc_Mul algorithm provides a tunable method that delivers accuracy close to or matching that of specialized libraries like XBLAS, but with significantly better performance.
Limitations & Future Work:
- Memory Consumption: The authors explicitly note that the algorithms require a "significant amount of working memory" to store the multiple split-off matrices ( $A^{(r)}$ , $B^{(s)}$ ). This could be a major constraint for extremely large matrices on memory-limited systems.
- Dependence on BLAS Quality: The performance is fundamentally tied to the efficiency of the underlying BLAS and sparse matrix multiplication libraries. Any inefficiencies in those routines would bottleneck the proposed algorithms.
- Future work could explore extending these techniques to other linear algebra operations (e.g., matrix factorizations like LU or QR), developing GPU-accelerated versions, or devising strategies to reduce the memory footprint.
Personal Insights & Critique:
- The paper's central idea is both elegant and highly practical. It reframes the problem of error accumulation not as something to be avoided with expensive custom arithmetic, but as something to be managed through a clever algebraic restructuring that plays to the strengths of modern computer hardware.
- The use of sparsity detection and corresponding routines is a critical engineering detail that turns a theoretically interesting idea into a performant algorithm. It shows a deep understanding of practical high-performance computing.
- This work represents a significant contribution to the field of numerical computing. It provides a valuable tool for scientists and engineers who need reliable, accurate results without having to sacrifice computational speed. The principles demonstrated here could inspire similar "accuracy-enhancing transformations" for a wider range of numerical problems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~13 min read · 15,898 chars

1. Bibliographic Information

2. Executive Summary

3. Prerequisite Knowledge & Related Work

4. Methodology (Core Technology & Implementation)

2.2 Error-free splitting

2.3 & 2.4 Error-Free Transformation of Matrix Products

3.1 Accurate Matrix Multiplication (Acc_Mul)

5. Experimental Setup

6. Results & Analysis

EFT Performance (Algorithm 4)

Accuracy and Performance of Acc_Mul (Algorithm 5)

7. Conclusion & Reflections

Similar papers

3.1 Accurate Matrix Multiplication (`Acc_Mul`)

Accuracy and Performance of `Acc_Mul` (Algorithm 5)