Paper status: completed

Plaintext-Ciphertext Matrix Multiplication and FHE Bootstrapping: Fast and Fused

Published:01/01/2024

Homomorphic Encryption Matrix Multiplication (1)Multi-Secret RLWE Scheme (1)Plaintext-Ciphertext Matrix Multiplication Optimization (1)CKKS Bootstrapping Acceleration (1)Batch Bootstrapping Mechanism (1)

Original Link

Price: 0.100000

9 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work develops multi-key RLWE-based PC-MM algorithms reducing complexity to efficient PP-MM, integrates them with CKKS bootstrapping, and introduces MaMBo to accelerate privacy-preserving model inference with minimal overhead.

Abstract

Plaintext-Ciphertext Matrix Multiplication and FHE Bootstrapping: Fast and Fused Youngjin Bae 1[0000 − 0001 − 6870 − 4504] , Jung Hee Cheon 1 , 2[0000 − 0002 − 7085 − 2220] , Guillaume Hanrot 3[0000 − 0001 − 9319 − 0365] , Jai Hyun Park 2[0000 − 0002 − 5401 − 8949] , and Damien Stehl´ e 3[0000 − 0003 − 3435 − 2453] 1 CryptoLab Inc., Seoul, Republic of Korea 2 Seoul National University, Seoul, Republic of Korea 3 CryptoLab Inc., Lyon, France Abstract. Homomorphically multiplying a plaintext matrix with a ci- phertext matrix (PC-MM) is a central task for the private evaluation of transformers, commonly used for large language models. We provide sev- eral RLWE-based algorithms for PC-MM that consist of multiplications of plaintext matrices (PP-MM) and comparatively cheap pre-processing and post-processing steps: for small and large dimensions compared to the RLWE ring degree, and with and without precomputation. For the algorithms with precomputation, we show how to perform a PC-MM with a single floating-point PP-MM of the same dimensions. This is par- ticularly meaningful for practical purposes as a floating-point PP-MM can be implemented using high-performance BLAS libr

Mind Map

In-depth Reading

English Analysis~9 min read · 11,698 chars

1. Bibliographic Information

Title: Plaintext-Ciphertext Matrix Multiplication and FHE Bootstrapping: Fast and Fused
Authors: Youngjin Bae, Jung Hee Cheon, Guillaume Hanrot, Jai Hyun Park, and Damien Stehlé.
Affiliations: The authors are affiliated with CryptoLab Inc. (Seoul, Korea and Lyon, France) and Seoul National University (Seoul, Korea), which are prominent institutions in the field of applied cryptography and Fully Homomorphic Encryption (FHE).
Journal/Conference: The paper is available on the IACR ePrint Archive. The ePrint Archive is a preprint server for cryptographic research, serving as a platform for rapid dissemination of new results. Papers on ePrint are often later submitted to peer-reviewed conferences and journals.
Publication Year: 2024
Abstract: The paper presents novel algorithms for Plaintext-Ciphertext Matrix Multiplication (PC-MM), a critical operation for privacy-preserving large language models (LLMs). The proposed methods reduce PC-MM to highly efficient Plaintext-Plaintext Matrix Multiplications (PP-MM), which can be accelerated using standard high-performance libraries (BLAS). The algorithms are designed for various matrix dimensions and support an optional precomputation phase for even greater speed. The core innovation relies on a "multi-secret" variant of the Ring Learning With Errors (RLWE) problem, where multiple ciphertexts share a common component to reduce data size and computational cost. The authors demonstrate that this format is compatible with key FHE operations, leading to an accelerated batch bootstrapping procedure. By combining these techniques, they introduce MaMBo (Matrix Multiplication Bootstrapping), an algorithm that fuses a PC-MM operation into the bootstrapping process with minimal overhead.
Original Source Link: https://eprint.iacr.org/2024/1284.pdf (Preprint)

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Fully Homomorphic Encryption (FHE) allows computation on encrypted data but is notoriously slow. A key bottleneck in applying FHE to machine learning, particularly large language models (LLMs), is matrix multiplication. Specifically, multiplying a public matrix (e.g., model weights) with an encrypted matrix (e.g., user input), known as Plaintext-Ciphertext Matrix Multiplication (PC-MM), is computationally expensive.
- Importance & Gaps: As LLMs grow in size (e.g., dimensions like 4096x4096 or larger), efficient PC-MM becomes essential for practical privacy-preserving AI. Prior FHE-based methods for PC-MM were often slow, inflexible with respect to matrix dimensions, or involved costly internal data-shuffling operations within ciphertexts. There was a need for algorithms that could handle large, variable dimensions and leverage hardware-optimized libraries.
- Innovation: This paper's fresh angle is to transform the complex homomorphic PC-MM problem into a much simpler, faster problem: standard matrix multiplication on unencrypted data (PP-MM). This reduction allows the use of highly optimized Basic Linear Algebra Subprograms (BLAS) libraries, bridging the gap between theoretical cryptography and high-performance computing.
Main Contributions / Findings (What):
1. A Suite of PC-MM Algorithms: The paper introduces several new algorithms for PC-MM that reduce the operation to one or two PP-MMs. These algorithms cater to different scenarios:
  - Matrix dimensions ( $d$ ) that are smaller or larger than the FHE ring degree ( $N$ ).
  - Modes with and without a precomputation phase.
2. The "Shared-a" Ciphertext Format: The authors leverage a multi-secret ciphertext format where multiple ciphertexts share one of their two components (the $a$ -part). This makes the representation of matrices more compact and reduces the computational load of PC-MM. They provide algorithms to convert between standard and shared- $a$ formats.
3. Reduction to Floating-Point PP-MM: Crucially, the authors show that the modular PP-MM can be replaced by a standard floating-point PP-MM, with a careful error analysis. This is a major practical breakthrough, as it allows direct use of ubiquitous, highly-optimized libraries like OpenBLAS.
4. Efficient Batch Bootstrapping: They demonstrate that the shared- $a$ format is compatible with essential FHE operations (addition, plaintext multiplication, key-switching). This enables batch bootstrapping, where multiple ciphertexts are "refreshed" simultaneously at a lower amortized cost than refreshing them one by one.
5. MaMBo (Matrix Multiplication Bootstrapping): The final contribution is MaMBo, a novel procedure that "fuses" the PC-MM operation directly into the batch bootstrapping process. This makes the computationally expensive bootstrapping step do double duty, performing a matrix multiplication for a very small additional cost. The paper reports that MaMBo (with a PC-MM) can be over 40% faster than a standard bootstrapping procedure (without a PC-MM).

Foundational Concepts:
- Fully Homomorphic Encryption (FHE): A powerful form of encryption that allows a third party (like a cloud server) to perform arbitrary computations on encrypted data without decrypting it. The result of the computation remains encrypted and can only be read by the owner of the secret key.
- Approximate FHE (CKKS): The paper focuses on the Cheon-Kim-Kim-Song (CKKS) scheme, a popular FHE scheme designed for computations on real or complex numbers. It is an "approximate" scheme because small errors are introduced and accumulate with each operation. This is well-suited for applications like machine learning where perfect precision is not required.
- Ring Learning With Errors (RLWE): The security of CKKS and many other FHE schemes is based on the presumed hardness of the RLWE problem. In this context, data is encoded as polynomials in a special algebraic structure called a ring (typically $\mathcal{R}_N = \mathbb{Z}[X]/(X^N+1)$ ). An RLWE ciphertext is a pair of polynomials (a, b) that encrypts a message polynomial $m$ under a secret key polynomial $\mathsf{sk}$ via the relation $a \cdot \mathsf{sk} + b \approx m \pmod Q$ , where $Q$ is a large integer modulus.
- Module Learning With Errors (MLWE): A generalization of RLWE where the secret key and the $a$ -part of the ciphertext are vectors of polynomials. This provides more flexibility in structuring ciphertexts.
- Slots vs. Coefficients Encoding: In CKKS, a vector of complex numbers can be packed into a single polynomial in two main ways.
  - Slot Encoding: The numbers are encoded into the "slots" of the ciphertext, which correspond to evaluations of the polynomial at specific complex roots of unity. This allows for efficient SIMD (Single Instruction, Multiple Data) operations.
  - Coefficient Encoding: The numbers are directly encoded as the coefficients of the plaintext polynomial. This is less common for general computation but is a critical intermediate step in bootstrapping.
- Bootstrapping: A "re-encryption" procedure that takes a noisy ciphertext (one that has undergone many operations) and whose modulus is small, and converts it into a "fresh" ciphertext with less noise and a large modulus, encrypting the same underlying message. This is what makes FHE "fully" homomorphic, enabling unlimited computations. Key steps include Slots-to-Coeffs (S2C), ModRaise, Coeffs-to-Slots (C2S), and EvalMod.
Previous Works:
- The paper builds upon a lineage of work on matrix multiplication in FHE. A notable prior method from Halevi and Shoup (2014) [24] was designed for ciphertext-ciphertext multiplication but can be adapted for PC-MM. However, it requires expensive homomorphic operations to move data within ciphertext slots, hindering performance.
- The most direct predecessor is a method by Lou and Yao (2022) [26], which this paper refers to as the "LZ algorithm". This method was the first to reduce PC-MM to two PP-MMs, but it was limited to cases where the matrix dimension $d$ was greater than or equal to the RLWE ring degree $N$ .
Differentiation: This paper significantly advances the state-of-the-art by:
- Generalizing the LZ algorithm: It provides solutions for both $d < N$ and $d \ge N$ , offering much greater flexibility.
- Improving Efficiency: By using the shared- $a$ format, it reduces the two PP-MMs of the LZ algorithm to one large and one small PP-MM, or even a single PP-MM in the online phase with precomputation.
- Leveraging Standard Hardware: The reduction to floating-point PP-MM is a major practical contribution, allowing FHE applications to directly benefit from decades of optimization in numerical computing libraries.
- Fusing with Bootstrapping: The MaMBo algorithm is a novel paradigm where the expensive bootstrapping process is made to perform useful work (PC-MM), greatly improving the throughput of FHE applications.

4. Methodology (Core Technology & Implementation)

The core technical contribution is a set of methods to transform a homomorphic PC-MM into one or more non-homomorphic PP-MMs.

Principles: The fundamental idea is to express the RLWE decryption equation in matrix form and then left-multiply by the public plaintext matrix $\mathbf{U}$ . An RLWE encryption of a matrix $\mathbf{M}$ can be seen as a matrix equation: $\mathbf{A} \cdot \mathbf{S_{sk}} + \mathbf{B} \approx \mathbf{M} \pmod q$ where $\mathbf{A}$ and $\mathbf{B}$ are matrices formed from the coefficients of the $a$ -parts and $b$ -parts of the ciphertexts, and $\mathbf{S_{sk}}$ is a structured matrix derived from the secret key $\mathsf{sk}$ . By left-multiplying by the public matrix $\mathbf{U}$ , we get: $(\mathbf{U} \cdot \mathbf{A}) \cdot \mathbf{S_{sk}} + (\mathbf{U} \cdot \mathbf{B}) \approx \mathbf{U} \cdot \mathbf{M} \pmod q$ The new pair of matrices, $\mathbf{A'} = \mathbf{U} \cdot \mathbf{A}$ and $\mathbf{B'} = \mathbf{U} \cdot \mathbf{B}$ , form the ciphertexts for the desired result $\mathbf{U} \cdot \mathbf{M}$ . The computationally heavy part is computing $\mathbf{U} \cdot \mathbf{A}$ and $\mathbf{U} \cdot \mathbf{B}$ , which are standard PP-MMs.
Shared- $a$ Ciphertext Format: This is a key enabler for the paper's efficiency gains.
- Standard RLWE: $n$ messages $m_i$ are encrypted into $n$ ciphertexts $(a_i, b_i)$ under a single secret key $\mathsf{sk}$ : $a_i \cdot \mathsf{sk} + b_i \approx m_i$ . This requires storing 2n polynomials.
- Shared- $a$ RLWE: $n$ messages $m_i$ are encrypted using $n$ different secret keys $\mathsf{sk}_i$ but a single, shared $a$ -part: $a \cdot \mathsf{sk}_i + b_i \approx m_i$ . This only requires storing $n+1$ polynomials ( $a$ and the $n$ $b_i$ 's), a significant space saving.
- Format Conversion (FmtSwitch): The paper provides a mechanism to convert a batch of standard ciphertexts into the shared- $a$ format using a special "format switching key". This allows existing CKKS ciphertexts to be used with the new, more efficient algorithms.

PC-MM Algorithms: The paper presents a family of algorithms, summarized in the transcribed table below.

Manual Transcription of Table 1: List of PC-MM algorithms for square matrices of dimension $d \ge N^{1/2}$ . sh.-a refers to the shared- $a$ format. Mod-PP-MM_{d,d,d'} (resp. FP-PP-MM_{d,d,d'}) stands for PP-MM of dimensions $(d \times d) \times (d \times d')$ over $\mathbb{Z}_Q$ (resp. over $\mathbb{R}$ using floating-point arithmetic).

$d$	Ctxt format	Reduces to	Reference
$\le N$	MLWE	Mod-PP-MM`_{d,d,d}` + Mod-PP-MM`_{d,d,N}`	Alg. 2, p. 21
$\le N$	sh.- $a$ MLWE	Precomp. + Mod-PP-MM`_{d,d,d}`	Alg. 4, p. 38
$\le N$	sh.- $a$ MLWE	Precomp. + FP-PP-MM`_{d,d,d}`	Alg. 4, p. 38 + Sec. 5
$\ge N$	RLWE	2 $\times$ Mod-PP-MM`_{d,d,d}`	[26]
$\ge N$	sh.- $a$ RLWE	Mod-PP-MM`_{d,d,d}` + Mod-PP-MM`_{d,d,N}`	Alg. 1, p. 18
$\ge N$	sh.- $a$ RLWE	Precomp. + Mod-PP-MM`_{d,d,d}`	Alg. 3, p. 37
$\ge N$	sh.- $a$ RLWE	Precomp. + FP-PP-MM`_{d,d,d}`	Alg. 3, p. 37 + Sec. 5

Without Precomputation: For large dimensions ( $d \ge N$ ), the algorithm uses the shared- $a$ format to reduce the PC-MM to one $d \times d$ PP-MM and one smaller $d \times N$ PP-MM. This is an improvement over the [26] baseline, which requires two $d \times d$ PP-MMs. For small dimensions ( $d \le N$ ), an MLWE-based approach is used.
With Precomputation: If the plaintext matrix $\mathbf{U}$ is known in advance, a precomputation step involving the secret keys can be performed. This reduces the online cost of PC-MM to a single PP-MM of dimension $d \times d$ . This is extremely efficient for applications where model weights are fixed.
PC-MM with Floating-Point PP-MM: This is the paper's most practical optimization. The authors observe that in approximate FHE, the least significant bits of the $b$ -part of a ciphertext contain noise and can be discarded. Therefore, the modular multiplication $\mathbf{U} \cdot \mathbf{B} \pmod Q$ can be well approximated by performing a standard floating-point multiplication of $\mathbf{U}$ with the most significant bits of $\mathbf{B}$ , and then performing a final modular reduction. This allows the use of highly optimized BLAS libraries like OpenBLAS for the most expensive part of the computation.

MaMBo: Fused PC-MM and Batch Bootstrapping:
- Batch Bootstrapping: The S2C and C2S steps of CKKS bootstrapping consist of linear operations (additions, plaintext multiplications, rotations). The authors show that these operations are compatible with the shared- $a$ format. When bootstrapping a batch of $n$ shared- $a$ ciphertexts, the operations on the shared $a$ -part are performed only once, and the operations on the $n$ distinct $b$ -parts are batched. This amortizes the computational cost, making batch bootstrapping significantly faster than bootstrapping $n$ ciphertexts individually.
- Fused Pipeline (MaMBo): MaMBo cleverly integrates PC-MM into the bootstrapping process. The typical bootstrapping pipeline is S2C -> ModRaise -> C2S -> EvalMod. MaMBo modifies this to:
  1. Batch S2C: Convert the input ciphertexts (in slot encoding) to coefficient encoding. This is done in a batch using the shared- $a$ format.
  2. PC-MM: The ciphertexts are now in coefficient format, which is exactly what the paper's new PC-MM algorithms require. The PC-MM is performed at this stage.
  3. Half-Bootstrapping: The rest of the bootstrapping pipeline (ModRaise, Batch-C2S, EvalMod) is executed on the result of the PC-MM. By performing the matrix multiplication "for free" during the mandatory refresh step, MaMBo achieves a remarkable efficiency boost for applications that interleave linear transformations and non-linear functions (which require bootstrapping).

5. Experimental Setup

Datasets: The experiments do not use public machine learning datasets. Instead, they are microbenchmarks designed to measure the performance of the cryptographic primitives. The inputs are matrices of various dimensions (e.g., $2^9 \times 2^9, 2^{10} \times 2^{10}$ ) with random entries, which is standard practice for evaluating the performance of FHE operations.
Evaluation Metrics: The primary metric is execution time, measured in seconds. The goal is to demonstrate the speedup of the proposed algorithms compared to baselines. The paper reports timings for individual components (e.g., PC-MM, batch bootstrapping) and the end-to-end MaMBo procedure.
Baselines: The paper compares its results against:
- The state-of-the-art PC-MM method from [26].
- Standard CKKS bootstrapping (as implemented in the HEaaN library) without any matrix multiplication, to highlight the low overhead of MaMBo.
- Bootstrapping multiple ciphertexts independently, to show the benefit of the proposed batch bootstrapping.

6. Results & Analysis

Core Results:
- PC-MM Performance: The new algorithms are shown to be significantly faster than previous work, especially when leveraging floating-point PP-MM via BLAS libraries. The precomputation-based methods offer the best online performance.
- Batch Bootstrapping Speedup: The shared- $a$ format provides a substantial speedup for bootstrapping multiple ciphertexts at once. The cost is amortized effectively, making the per-ciphertext cost decrease as the batch size increases.
- MaMBo Efficiency: The flagship result is the performance of MaMBo. The abstract claims that for matrices of dimension $2^9$ or more, MaMBo (which includes a PC-MM) is more than 40% faster than a standard CKKS bootstrapping procedure that performs no matrix multiplication at all. This demonstrates that the PC-MM can be performed at a "negative cost" by optimizing the overall computational flow.
Ablations / Parameter Sensitivity: The paper's structure inherently presents an ablation study by comparing different algorithmic choices. The summary in Table 1 (transcribed in the Methodology section) clearly delineates the trade-offs:
- Precomputation vs. No Precomputation: Algorithms with precomputation have a much faster online phase (one PP-MM instead of two) but require the plaintext matrix to be known in advance.
- Dimension Flexibility: The paper provides distinct algorithms for $d < N$ and $d >= N$ , demonstrating the flexibility of the framework.
- Modular vs. Floating-Point PP-MM: The move to floating-point PP-MM provides a major speedup, confirming that leveraging optimized numerical libraries is a winning strategy.

7. Conclusion & Reflections

Conclusion Summary: The paper makes a significant contribution to the practicality of FHE. It introduces a family of fast and flexible algorithms for PC-MM, a critical bottleneck in privacy-preserving machine learning. By reducing PC-MM to standard, optimizable PP-MM and introducing the MaMBo framework to fuse this operation with bootstrapping, the authors have substantially lowered the computational barrier for complex FHE applications like LLM inference.
Limitations & Future Work: (Based on the provided text and general knowledge of the field)
- Memory Overhead: The format-switching and key-switching keys, especially for large batches ( $n$ ) or large dimensions, can be very large, posing a memory challenge. The paper notes that the matrix $\mathbf{B}_{\mathsf{fmt-swk}}$ contains $n^2$ ring elements.
- Parameter Selection: The use of floating-point approximations and the noise increase from the matrix multiplication require very careful parameter selection to ensure both correctness (sufficient precision) and security.
- Client-side Cost: While the server-side (online) cost is heavily optimized, some algorithms shift complexity to the precomputation or client side (e.g., computing $\mathbf{U} \cdot \mathbf{S}$ ). This trade-off must be considered in a full application context.
Personal Insights & Critique:
- Practical Engineering at its Best: The most impressive aspect of this work is its focus on practical performance. The insight to replace a modular matrix multiplication with a floating-point one is a brilliant piece of engineering that bridges the often-separate worlds of theoretical cryptography and high-performance computing.
- A Powerful Paradigm: The MaMBo concept exemplifies the idea of "programmable bootstrapping," where the expensive but necessary process of noise reduction is co-opted to perform other useful computations. This is a powerful paradigm that could be extended to other functions beyond matrix multiplication.
- Unlocking Private AI: This work directly addresses a major roadblock for private LLM inference. By making the ubiquitous matrix-vector and matrix-matrix multiplications significantly faster, this research brings the deployment of large-scale, privacy-preserving AI services a step closer to reality.
- Clever Repurposing of Ideas: The use of the shared- $a$ format, historically a tool for security proofs, as a computational primitive for efficiency is a clever and insightful cross-pollination of ideas within the field of lattice-based cryptography.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.