CipherGPT: Secure Two-Party GPT Inference
TL;DR Summary
CipherGPT is a framework addressing user privacy in GPT inference. It innovatively optimizes secure matrix multiplication and GELU computation, achieving significant performance gains. Additionally, it introduces a secure top-k sampling protocol, providing comprehensive benchmark
Abstract
ChatGPT is recognized as a significant revolution in the field of artificial intelligence, but it raises serious concerns regarding user privacy, as the data submitted by users may contain sensitive information. Existing solutions for secure inference face significant challenges in supporting GPT-like models due to the enormous number of model parameters and complex activation functions. In this paper, we develop CipherGPT, the first framework for secure two-party GPT inference, building upon a series of innovative protocols. First, we propose a secure matrix multiplication that is customized for GPT inference, achieving up to 6.2× speedup and 4.1× bandwidth reduction over SOTA. We also propose a novel protocol for securely computing GELU, surpassing SOTA by 1.8× in runtime, 2.5× in communication and 7.4× in precision. Furthermore, we come up with the first protocol for secure top-k sampling. We provide a full-fledged implementation and comprehensive benchmark for CipherGPT. In particular, we measure the runtime and communication for each individual operation, along with their corresponding proportions. We believe this can serve as a reference for future research in this area.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is secure two-party inference for Generative Pre-trained Transformer (GPT) models, specifically focusing on addressing user privacy concerns when interacting with such models. The title, CipherGPT: Secure Two-Party GPT Inference, clearly reflects this focus.
1.2. Authors
The authors and their affiliations are:
-
Xiaoyang Hou (Zhejiang University)
-
Jian Liu (Zhejiang University)
-
Jingyu Li (Zhejiang University)
-
Yuhan Li (Zhejiang University)
-
Wen-jie Lu (Ant Group)
-
Cheng Hong (Ant Group)
-
Kui Ren (Zhejiang University)
The authors are affiliated with Zhejiang University, a prominent research institution in China, and Ant Group, a large financial technology company, suggesting a collaboration between academia and industry. This blend of affiliations often indicates research with strong theoretical foundations and practical applicability.
1.3. Journal/Conference
The paper does not explicitly state the journal or conference of publication. Based on its structure and content, it appears to be a research paper submitted to or presented at a peer-reviewed conference in the fields of cryptography, privacy-preserving machine learning, or computer security.
1.4. Publication Year
Based on the content, which references multiple papers from 2022 and 2023 and discusses ChatGPT (which gained widespread public attention in late 2022), the likely publication year is 2023.
1.5. Abstract
ChatGPT, a significant advancement in AI, raises serious user privacy concerns due to the potential submission of sensitive data. Existing secure inference solutions struggle with GPT-like models because of their massive parameter counts and complex activation functions. This paper introduces CipherGPT, the first framework for secure two-party GPT inference, built upon several novel protocols. The authors propose a customized secure matrix multiplication for GPT, achieving up to 6.2x speedup and 4.1x bandwidth reduction over state-of-the-art (SOTA) methods. They also present a new protocol for securely computing GELU, outperforming SOTA by 1.8x in runtime, 2.5x in communication, and 7.4x in precision. Furthermore, CipherGPT introduces the first protocol for secure top-k sampling. The paper provides a full implementation and a comprehensive benchmark, detailing the runtime and communication for individual operations and their proportions, intending it as a reference for future research.
1.6. Original Source Link
The provided link is /files/papers/692f0055713363b9c83c81fc/paper.pdf. This appears to be a local file path or an internal identifier for a PDF, rather than a publicly accessible URL. Its publication status (e.g., officially published, preprint) is not specified.
2. Executive Summary
2.1. Background & Motivation
The widespread adoption of large language models (LLMs) like ChatGPT marks a revolution in artificial intelligence, offering capabilities from question answering to content generation. However, the paradigm of interacting with these models through online services or APIs necessitates users submitting prompts or messages, which often contain sensitive personal or proprietary information. This poses a significant user privacy risk, potentially restricting the deployment of LLMs in privacy-critical scenarios (e.g., healthcare, legal, finance).
Current secure inference solutions, which aim to perform computations on encrypted data while preserving privacy for both the user's input and the model's parameters, face substantial challenges when applied to GPT-like models. These challenges stem from:
-
Enormous Model Parameters: GPT models, such as GPT-2 (117 million parameters), involve a multitude of high-dimensional
matrix multiplications. -
Complex Activation Functions: The models utilize non-linear activation functions like
GELU(Gaussian Error Linear Unit), which are computationally expensive to secure using traditional cryptographic primitives. -
Generative Tasks: Unlike discriminative tasks (e.g., classification), generative LLMs require
repeated inferencesto produce output word-by-word, and they often employrandom sampling mechanismsto ensure creativity and diversity, which are difficult to secure efficiently. -
Limitations of Prior Work: Existing secure inference protocols, such as
Iron, primarily focus on transformer-based models for non-generative tasks (e.g., BERT inference). Others likeBoltandBumbleBeeoptimizematrix multiplicationbut may introduce higher computational complexity. State-of-the-art approaches forGELUapproximation use high-degree polynomials, often sacrificing accuracy for efficiency. Prior work onrandom samplingeither relies on extremely heavyGarbled Circuitsor simplifies the problem by selecting the highest-scoring word, diminishing utility.The paper's entry point is to directly address these specific challenges by designing a dedicated framework and innovative cryptographic protocols optimized for the unique characteristics of
GPTinference, thereby enabling practicalsecure two-party GPT inferencefor the first time.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
First Secure Two-Party GPT Inference Framework (
CipherGPT): DevelopingCipherGPT, the first framework designed specifically for secure two-party inference ofGPT-like models. -
Customized Secure Matrix Multiplication: Proposing a novel
secure matrix multiplicationprotocol tailored forGPT's autoregressive nature, leveragingsVOLE. This achieves up to 6.2x speedup and 4.1x bandwidth reduction over state-of-the-art (SOTA) methods. -
Novel Secure GELU Protocol: Introducing a new protocol for securely computing the
GELUactivation function. This spline-based approximation, usingLUTsand secret-shared linear functions, surpasses SOTA by 1.8x in runtime, 2.5x in communication, and achieves 7.4x higher precision. -
First Secure Top-K Sampling Protocol: Presenting the first protocol for
secure top-k sampling, which is crucial for enabling creative and diverse text generation in a privacy-preserving manner. -
Comprehensive Implementation and Benchmark: Providing a full-fledged implementation of
CipherGPTand a detailed benchmark. This benchmark measures the runtime and communication costs for each individual operation and their respective proportions, serving as a valuable reference for future research in this area.The key findings demonstrate that despite the inherent complexity,
CipherGPTsignificantly advances the feasibility of secureGPTinference by developing highly optimized cryptographic primitives for its core operations. While current costs remain high, the paper's quantitative analysis highlights bottlenecks and potential areas for future improvements, moving closer to practical deployment.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand CipherGPT, a grasp of several fundamental cryptographic and machine learning concepts is essential:
-
Generative Pre-trained Transformer (GPT):
GPTis a type oflarge language model (LLM)based on theTransformerarchitecture.GPTmodels are "generative" because they can produce new text, and "pre-trained" because they are initially trained on vast amounts of text data to learn language patterns. They are "autoregressive," meaning they generate text word by word, using previously generated words as context for the next prediction. TheGPTarchitecture (illustrated in Figure 2 from the original paper) primarily consists of:Embedding Layer: Maps input words into numerical vector representations (word embeddings) and adds positional information (position embeddings).Transformer Decoders: Multiple identical layers (with different weights) that process the embedded sequence. Each decoder typically includes:Masked Self-Attention: Allows the model to weigh the importance of different words in the input sequence when processing each word, but only considers prior words to prevent "cheating" during training/inference of generative tasks.Feed-Forward Neural Network: Standard neural network layers applied independently to each position.Layer Normalization: A technique to normalize the activations of previous layers, improving training stability and performance.
Vec2word Layer: The final layer that maps the output of the last transformer decoder to a probability distribution over the vocabulary, from which a response word is selected.
-
Secure Inference: A
two-party cryptographic protocolwhere two entities—aclient (C)with a private input (e.g., a prompt) and aserver (S)with a private model (e.g., aGPTmodel)—collaborate to perform model inference without either party fully revealing their private data to the other.- Client's privacy: learns nothing about 's input beyond its length and the output length.
- Server's privacy: learns nothing about 's model beyond its architecture and the final inference result.
- This is achieved by computing on encrypted or secret-shared data.
-
Threat Model (Semi-honest Adversary): This paper assumes a
semi-honest adversary(also known as "honest-but-curious"). A party acting as a semi-honest adversary follows the protocol correctly but attempts to extract as much information as possible from the data they observe during the protocol execution. This is a common and practical security assumption in cryptographic protocols. denotes the computational security parameter. -
Secret Sharing: A cryptographic technique that splits a secret value into multiple shares, , such that:
- No single share reveals any information about .
- Combining a sufficient number of shares (e.g., two shares in a 2-out-of-2 scheme) allows reconstruction of .
- 2-out-of-2 Additive Secret Sharing: Used in this paper over
power-of-2 rings(). For a secret , it is split into two shares (for ) and (for ) such that . Each party holds one share.
-
Oblivious Transfer (OT): A fundamental cryptographic primitive where a
senderhas multiple messages and areceiverwants to choose one message without thesenderknowing which one was chosen, and without thereceiverlearning anything about the other messages.- : The ideal functionality for
Oblivious Transfer. - 1-out-of-M OT (): The sender has messages, and the receiver chooses one.
- Random OT (rOT): A variant where the messages and choices are randomly sampled.
Ferretprovides efficient protocols for generating large batches ofrOTs.
- : The ideal functionality for
-
Homomorphic Encryption (HE): An encryption scheme that allows computations to be performed directly on encrypted data without decrypting it first. The results of these computations remain encrypted and, when decrypted, are the same as if the operations were performed on the unencrypted data.
- Fully Homomorphic Encryption (FHE): Supports arbitrary computations, but usually
leveled FHEis used in practice, meaning operations can only be performed a limited number of times beforebootstrapping(an expensive refresh operation) is needed. - RLWE (Ring Learning With Errors): A common mathematical problem on which many efficient
FHEschemes (likeBFV,CKKS) are based. Plaintexts are encoded as polynomials in a quotient ring (e.g., ), and ciphertexts consist of two polynomials in a larger ring. - Additively Homomorphic Encryption (AHE): A type of
HEthat supports addition operations on ciphertexts.
- Fully Homomorphic Encryption (FHE): Supports arbitrary computations, but usually
-
Vector Oblivious Linear Evaluation (VOLE) and Subfield VOLE (sVOLE):
VOLEis a two-party functionality where asenderwith input and areceiverinteract. Thesenderlearns a vector of length , and thereceiverlearns vectors , both of length , such that .sVOLE(subfieldVOLE) is a generalization where (a subfield of ), and the resulting correlation is where and . It's more cost-effective for computing unbalancedMatrixMulswhen . It can be adapted to work over finite rings like .
-
Batch Oblivious Linear Evaluation (BOLE): A two-party functionality that takes vectors from a
senderand from areceiverand generates a correlation . Thereceiverlearns and thesenderlearns . This can be interpreted as generatingBeaver triplesfor secure multiplication. -
Ideal Functionalities: The paper uses notation like , , , , , , to represent
ideal functionalities. An ideal functionality defines the desired secure computation in an abstract way, assuming a trusted third party. Cryptographic protocols then aim to securely realize these ideal functionalities in a real-world setting without a trusted party.- : Secure Multiplication (element-wise or matrix).
- : Secure Comparison ( if , otherwise).
- : Secure Multiplexer (output if ,
0if ). - : Secure Truncation (right-shift by bits, , keeping bits).
- : Secure Truncate-then-Reduce (right-shift by bits, , reducing to
l-sbits). - : Secure Lookup Table (retrieve
T[i]given ). - : Secure Shuffle (permute a vector according to a secret permutation ).
3.2. Previous Works
The paper contextualizes CipherGPT against existing secure inference solutions, highlighting their limitations for GPT-like models.
- Early Secure Inference: Efforts date back to the early 2010s (
[39],[7],[55]) for simpler ML algorithms like SVMs and linear regression. - CryptoNets (
[22]): This was an initial endeavor insecure neural network inferencerelying solely onFHE. Its main limitation was supporting only linear operations and low-degree polynomials, suitable for networks with few layers. - MiniONN (
[34]): The first work to customize2PC protocolsfor neural network inference. It introduced aspline-based approximationfor non-linear operations, whichCipherGPTdraws inspiration from for itssecure GELUprotocol. - SIMD vs. Coefficient Packing for Matrix Multiplication:
GAZELLE [31]: Reduced cost of linear layers by mapping them toSIMD-based matrix-vector multiplication.SIMD(Single Instruction, Multiple Data) allows batching multiple elements into oneRLWE ciphertextfor parallel element-wise operations but requires expensivehomomorphic rotationsfor summation. It often operates over prime fields, necessitating conversions for rings.Cheetah [29]: SubstitutedSIMDwithcoefficient packing, eliminating expensiveSIMD rotationsand directly computing over rings.Iron [27]: Further reduced communication complexity compared toCheetah.Bolt [40]andBumbleBee [30]: Optimizedmatrix multiplicationin communication but introduced more computational complexity, often involving expensive rotations or conversions between prime fields and rings. They focused on transformer-based models for non-generating tasks (likeBERT).
- Secure Activations (GELU, Sigmoid, Tanh):
CrypTFlow2 [44]: Provided efficient protocols forsecure comparisonanddivision.SIRNN [43]: Offered crypto-friendly approximations for math functions likeexponential,sigmoid,tanh, andreciprocal square root, along with corresponding2PC implementations.SIRNNandIrontypically employLookup Tables (LUTs)to approximate and its reciprocal, a multi-step process that can accumulate precision errors.Bolt [40]andBumbleBee [30]: State-of-the-art approaches forGELUthat split the curve into several parts and use high-degree polynomials for approximation within each part. This involves multipleMultiplicationandTruncationoperations, and comparisons for selecting the correct part, which can affect both efficiency and accuracy.
- Secure Sorting and Sampling:
Bitonic sorting network [28]: A commonly used data-independent sorting algorithm in secure computation, but generally heavy in communication.- Prior works for
random samplingoften usedGarbled Circuits [58], which are computationally and communicatively expensive, or simplified by selecting the highest-scoring word, which sacrifices the LLM's utility.Chengkun Wei et al. [54]explored securely sampling discrete Gaussian noise forMPC Differential Privacy, which differs fromCipherGPT's goal of sampling from secret-shared probabilities.
- Crypto-friendly Model Structures: Some solutions (
DeepSecure [48],XONN [46],Quotient [2]) modify the neural network architecture (e.g., binarized neural networks) to be more compatible with cryptographic primitives, often requiring model retraining.Delphi [37]usesneural architecture searchto find performance-accuracy trade-offs.CipherGPTaims to avoid retraining. - GPU Acceleration: Solutions like
GForce [38]leverageGPU parallelismfor the online phase but don't address the expensive preprocessing. - Multi-Party (3PC) Settings: Some works (
[47],[52],[5]) explore three-party settings, assuming non-colluding servers. While potentially more efficient, this assumption is often considered less practical than the two-party model.
3.3. Technological Evolution
The evolution of secure inference has progressed from simple machine learning models to complex deep neural networks (DNNs) and now to large language models (LLMs).
- Early Works (2010s): Focused on basic ML models like SVMs and linear regression, often using generic
multi-party computation (MPC)orhomomorphic encryption (HE)primitives. - CryptoNets Era (Mid-2010s): Demonstrated the feasibility of
HE-only inferencefor simpleCNNs, but highlighted the limitations ofHEfor non-linear operations and deep architectures. - Hybrid Approaches (Late 2010s): The emergence of
2PC frameworkslikeMiniONN,GAZELLE,Cheetah, andIroncombined the strengths ofHE(for linear operations) andsecret sharingorGarbled Circuits(for non-linear operations). This enabled more complexDNNs, includingtransformersfor discriminative tasks (BERT). Significant efforts were made to optimizematrix multiplication(e.g.,SIMDvs.coefficient packing) and non-linear activations (e.g.,lookup tables, polynomial approximations). - LLM Era (Early 2020s): With the rise of
GPTmodels, new challenges surfaced:-
Scale: Handling hundreds of millions or billions of parameters efficiently.
-
Generative Nature: Supporting
autoregressive inferenceandrandom samplingfor diverse outputs, which were not primary concerns in discriminativeDNNinference. -
Specific Activations: Efficiently securing
GELU. -
Top-K Selection: A crucial component for vocabulary management in
LLMs.CipherGPTrepresents a significant step in this evolution, specifically addressing the unique demands ofGPT-like models, moving beyond generictransformerinference to tackle the fullgenerative LLMworkflow, including sophisticated sampling and customizedmatrix multiplicationfor autoregressive tasks.
-
3.4. Differentiation Analysis
CipherGPT differentiates itself from existing secure inference methods primarily by its tailored optimizations for the specific characteristics of GPT-like models, particularly their autoregressive nature and the need for creative sampling.
-
Matrix Multiplication (MatrixMul):
- Previous SOTA (
Cheetah,Iron,BumbleBee,Bolt): TheseRLWE-based HEapproaches (usingSIMD slotsorpolynomial coefficients) handleMatrixMulby performing operations on encrypted data.BoltandBumbleBeeintroduce complex techniques likehomomorphic rotationsorSIMD rotationsto save communication, often at the cost of computation, andBoltrequires secure conversion between prime fields and rings. - CipherGPT's Innovation:
CipherGPTleverages the observation thatGPT's autoregressive generation repeatedly uses the same model weights () with different inputs (). Itbatchesthese multipleMatrixMuloperations into a single, large,unbalanced MatrixMulduring thepreprocessing phase. This is then efficiently processed usingsVOLE(subfield Vector Oblivious Linear Evaluation). - Differentiation: By amortizing the cost over many
MatrixMulinstances ( iterations),CipherGPTachieves significant speedup (up to 6.2x) and bandwidth reduction (4.1x) compared toHE-basedmethods which might process eachMatrixMulindependently or with less efficient batching for this specific use case. The communication cost ofsVOLEis almost independent of (input vector length), providing greater savings for larger .
- Previous SOTA (
-
GELU Activation Function:
- Previous SOTA (
SIRNN,Iron,BumbleBee,Bolt):SIRNNandIronuse multipleLookup Tables (LUTs)to approximate parts like and reciprocals, which involves a multi-step process, leading to accumulated precision errors.BumbleBeeandBoltsplitGELUinto several parts and approximate each part with high-degree polynomials. This requires numerousmultiplication-then-truncationoperations andcomparisonsfor selecting the correct polynomial, often sacrificing accuracy for efficiency due to polynomial degree and limited split parts.
- CipherGPT's Innovation:
CipherGPTadopts aspline-based approximationusing only linear functions () within several small intervals of theGELUcurve. It simplifies the curve into three main parts (, , ), then shifts the curve to simplify the middle part to . A singleLUTis used to find the correct interval and retrieve the linear function's coefficients, followed by a singlemultiplication-then-truncation. - Differentiation: This approach drastically reduces the number of cryptographic primitives (fewer
Mult,Trunc,CMP,MUXoperations) compared to polynomial-based methods and avoids error accumulation from multi-step approximations. It leads to superior precision (7.4x better maximal ULP error) while being faster (1.8x runtime speedup) and more communication-efficient (2.5x reduction).
- Previous SOTA (
-
Top-K Selection:
- Previous SOTA (Implicit/General): Often, general
secure sorting algorithmslikeBitonic sorting network([28]) would be used, which are very heavy. - CipherGPT's Innovation: Introduces a novel protocol for
secure top-K selectionbased on a modifiedquicksortalgorithm. It firstsecurely shufflesthe input vector (to prevent information leakage during comparisons) and then applies a comparison-based selection. Critically, it only recursively processes partitions containing the top-K elements, reducing comparisons from to . - Differentiation: This is presented as the first protocol for
secure top-K samplingin this context, offering significant speedup (8.8x) and communication reduction (14.8x) compared to genericsecure sorting networks.
- Previous SOTA (Implicit/General): Often, general
-
Secure Sampling:
-
Previous SOTA: Typically relied on computationally heavy
Garbled Circuits [58]or simplified to selecting the highest score, losing model utility. -
CipherGPT's Innovation: Provides a specific protocol for
securely samplingan element from a vector based on secret-shared probabilities. It uses aserver-sampled random numberandK-1 secure comparisonsandK multiplexersto find the correct index based on cumulative probability sums. It also includes a mechanism to securely map the sampled index back to the original (unshuffled) word vector index without revealing intermediate information. -
Differentiation: This is claimed as the first exploration of secure sampling from secret-shared probabilities for
LLMgeneration, providing a practical cryptographic primitive for this crucialGPTfunctionality.In summary,
CipherGPTachieves its differentiation by moving beyond genericsecure inferenceto deeply understand and exploit the specific computational patterns and privacy requirements ofGPT's autoregressive, generative workflow, resulting in highly specialized and more efficient cryptographic protocols.
-
4. Methodology
4.1. Principles
The core idea behind CipherGPT is to enable secure two-party GPT inference by breaking down the complex GPT architecture into individual operations and designing highly optimized, privacy-preserving protocols for each, especially focusing on bottlenecks like matrix multiplication, non-linear activation (GELU), and generative mechanisms (Top-K selection, sampling). The theoretical basis relies on 2-out-of-2 additive secret sharing for general operations, homomorphic encryption (HE) for specific matrix operations, and vector oblivious linear evaluation (VOLE) for highly efficient batched matrix multiplication.
The key intuitions are:
- Amortized Matrix Multiplication:
GPT's autoregressive nature means the same model weights () are used repeatedly with different inputs () over many inference steps. Instead of securing eachmatrix multiplicationindependently, these can be batched together and processed usingsVOLEin a preprocessing phase, significantly reducing the amortized online cost. - Spline-based Activation: Complex non-linear functions like
GELUcan be accurately and efficiently approximated by splitting their curve into linear segments (splines). By leveraging a singlelookup table (LUT)andsecret-sharedlinear function computation, this avoids the multi-step approximations and high-degree polynomial computations used in prior works, improving both precision and efficiency. - Optimized Top-K Selection: Standard
secure sortingis too expensive. By firstsecurely shufflingthe input vector, subsequent comparisons forTop-Kselection can reveal relative order without revealing original values. A modifiedquicksortthat only processes relevant partitions further reduces computational cost. - Secure Probabilistic Sampling: To enable diverse text generation, a word must be sampled from a secret-shared probability distribution. This can be done by comparing a
randomly sampled valueagainst cumulative sums of probabilities in asecret-sharedmanner, followed by carefully constructedmultiplexers.
4.2. Core Methodology In-depth (Layer by Layer)
CipherGPT implements the GPT inference workflow, shown in Figure 2 of the original paper, by securing each component.
4.2.1. VOLE-based Matrix Multiplication
MatrixMul () is a fundamental operation in GPT. For autoregressive generation, GPT takes a sentence as input, generates a response word, then adds that word to the input sentence, and repeats the process. This means MatrixMul operations are performed repeatedly, each time with a new input matrix but the same weight matrix . CipherGPT exploits this by batching multiple MatrixMul operations into a single unbalanced MatrixMul to reduce amortized cost.
Let and . The result is . The calculation can be expressed as a sum of outer products: $ \mathbf{Z} = \sum_{i=1}^{m} (\mathbf{x}_i \otimes \mathbf{y}_i') $ where (columns of ) and (rows of ).
Suppose and need to generate response words, leading to input matrices: .
Each .
The key idea is to concatenate the corresponding columns of all matrices:
for each .
Then, the outer product of this concatenated vector with becomes:
$
\mathbf{x}_i' \otimes \mathbf{y}i' = (\mathbf{x}{1,i} \otimes \mathbf{y}i') || (\mathbf{x}{2,i} \otimes \mathbf{y}i') || \dots || (\mathbf{x}{t,i} \otimes \mathbf{y}i')
$
Summing these outer products gives the concatenated results of all MatrixMul operations:
$
\sum{i=1}^{m} (\mathbf{x}_i' \otimes \mathbf{y}_i') = \mathbf{Z}_1 || \mathbf{Z}_2 || \dots || \mathbf{Z}_t
$
The sVOLE protocol is used to implement this batched MatrixMul.
-
Preprocessing Phase: and generate
sVOLEcorrelations. For each : $ \mathbf{W}_i = \mathbf{u}_i \otimes \mathbf{y}_i' + \mathbf{V}_i $ Here, holds (a vector of length ) and . holds and .- Note: here denotes the
Kronecker product(outer product), which results in a matrix.
- Note: here denotes the
-
Online Phase: For an input matrix (for the -th response word):
- computes a masked version of its share of by subtracting the relevant portion of : $ \langle \mathbf{x}{j,i} \rangle_S := \mathbf{x}{j,i} - \mathbf{u}_i[(j-1)n+1, \dots, j \cdot n] \quad \forall i \in [1, m] $ then sends these masked shares to .
- receives and locally computes an outer product with its known : $ \langle \mathbf{x}_{j,i} \rangle_S \otimes \mathbf{y}i' = (\mathbf{x}{j,i} - \mathbf{u}_i[(j-1)n+1, \dots, j \cdot n]) \otimes \mathbf{y}i' \ = \mathbf{x}{j,i} \otimes \mathbf{y}_i' - \mathbf{u}_i[(j-1)n+1, \dots, j \cdot n] \otimes \mathbf{y}_i' $
- then uses its share and 's to reconstruct shares of . The rearranged equation for is:
$
\mathbf{x}_{j,i} \otimes \mathbf{y}i' = \langle \mathbf{x}{j,i} \rangle_S \otimes \mathbf{y}_i' + \mathbf{u}_i[(j-1)n+1, \dots, j \cdot n] \otimes \mathbf{y}_i'
$
From the
sVOLEcorrelation, we know that . So, and obtain shares:- holds:
- holds:
This means and now hold additive shares of . They can then locally sum these shares over to obtain secret shares of
\mathbf{Z}_j = \sum_{i=1}^m (\mathbf{x}_{j,i} \otimes \mathbf{y}_i').
4.2.2. Spline-based GELU
The GELU (Gaussian Error Linear Unit) activation function is used in GPT. Its formula is:
$
\mathsf { G E L U } ( x ) = 0 . 5 x ( 1 + \mathsf { Tanh } \left[ \sqrt { 2 / \pi } ( x + 0 . 0 4 4 7 1 5 x ^ { 3 } ) \right] )
$
The paper's secure GELU protocol (Algorithm 1) simplifies its secure computation.
The intuition (as shown in Figure 1 from the original paper) is to divide the GELU curve into three main parts:
-
:
-
: (approximated by splines)
-
:
To simplify the approximation in the middle part, the entire curve is effectively right-shifted by , transforming the interval to . This allows a single
lookup tableoperation.
Algorithm 1: Secure GELU
Input: & hold (secret shares of with bits), public value (for splitting), lookup table size .
Output: & get for .
- Scale : Let .
- Explanation: The input to the model is assumed to be scaled up by (e.g., floating points converted to integers with fractional bits). To maintain correct alignment for comparisons and arithmetic, the public
split valueis also scaled by .
- Explanation: The input to the model is assumed to be scaled up by (e.g., floating points converted to integers with fractional bits). To maintain correct alignment for comparisons and arithmetic, the public
- Shift Input: & (locally) compute .
- Explanation: This step performs the conceptual right-shift of the
GELUcurve. By adding to , the input is transformed to . This can be done locally because is a public value; each party adds to their respective share of , and the sum property ofadditive secret sharingis preserved.
- Explanation: This step performs the conceptual right-shift of the
- Define Range for Approximation: Let .
- Explanation: This defines the new upper bound for the approximation interval, which is now .
- Determine Bit-length for Interval Indexing: Let .
- Explanation: is the number of bits required to represent values within the range (assuming is a power of 2, or adjusted otherwise). This helps in extracting the relevant bits for
lookup tableindexing.
- Explanation: is the number of bits required to represent values within the range (assuming is a power of 2, or adjusted otherwise). This helps in extracting the relevant bits for
- Extract Lower Bits: & (locally) extract the lower bits of to get .
- Explanation: Since (under the initial assumption for the spline part), only the lower bits of are relevant for determining the interval. This operation can be done locally if is a power of 2, otherwise, secure
truncationis used.
- Explanation: Since (under the initial assumption for the spline part), only the lower bits of are relevant for determining the interval. This operation can be done locally if is a power of 2, otherwise, secure
- Find Interval Index: & invoke .
- Explanation: The
F_TR(Truncate-then-Reduce) ideal functionality is used. It takes the -bit value and right-shifts it byh-sbits, effectively extracting the most significant bits. These bits represent the index of the small interval that belongs to within . The output is secret-shared. - : Takes secret-shared of bits and a public shift amount . Returns secret-shared of
l-sbits, where . This functionality truncates the value by right-shifting and reduces the bit-length of the ring.
- Explanation: The
- Lookup Coefficients: & invoke .
- Explanation: holds a public table containing the
coefficients(, ) for each linear spline segment . Using the secret-shared index , and execute theF_LUT(Lookup Table) functionality to retrieve the -th entry of in a secret-shared form. - : Takes a public table and a secret-shared index . Returns secret-shared
T[i].
- Explanation: holds a public table containing the
- Compute term: & invoke .
- Explanation: They securely multiply the retrieved secret-shared coefficient with the shifted input using the
F_Mult(Multiplication) ideal functionality. This result is still at a higher scale ( bits if and are -scaled). - : Takes secret-shared of bits and secret-shared of bits. Returns secret-shared of bits.
- Explanation: They securely multiply the retrieved secret-shared coefficient with the shifted input using the
- Truncate Product: & invoke .
- Explanation: The result from the previous step is likely at a higher fixed-point scale. This step truncates the result by right-shifting bits to bring it back to the desired fixed-point representation.
- : Takes secret-shared of bits and a public shift amount . Returns secret-shared of bits, but the value is effectively scaled down.
- Add term: & (locally) compute .
- Explanation: The final step of the linear approximation is completed by adding the secret-shared coefficient to . This can be done locally. is the potential
GELUoutput assuming .
- Explanation: The final step of the linear approximation is completed by adding the secret-shared coefficient to . This can be done locally. is the potential
- Compare with : & invoke .
- Explanation: This compares the shifted input with (the upper bound of the spline approximation range). will be 1 if (i.e., ), and 0 otherwise.
- : Takes secret-shared and (both bits). Returns secret-shared bit if , otherwise.
- Compare with 0: & invoke .
- Explanation: This compares the shifted input with 0 (the lower bound of the spline approximation range). will be 1 if (i.e., ), and 0 otherwise.
- Multiplex for and 0: & invoke .
- Explanation: This
multiplexerselects between the approximated value and 0.- If (i.e., ), then and , so . The output becomes .
- If (i.e., ), then and , so . The output becomes 0.
- If (i.e., ), then and , so . The output becomes 0.
- : Takes secret-shared ( bits) and a secret-shared control bit . Returns secret-shared if , and if .
- Explanation: This
- Multiplex for and 0: & invoke .
- Explanation: This
multiplexerselects between the original input and 0.- If (i.e., ), then . The output becomes .
- Otherwise (), then . The output becomes 0.
- This handles the case where
GELU(x)should be .
- Explanation: This
- Combine results: & (locally) compute .
- Explanation: This final local addition combines the outputs from the two
multiplexersto produce the correctGELUresult:- If : .
- If : (spline approximation).
- If : .
- This correctly implements the three-part
GELUfunction.
- Explanation: This final local addition combines the outputs from the two
4.2.3. Secure Top-K Selection
In the vec2word layer, GPT generates a vector of probabilities for all possible words. From this, the top-K largest probabilities need to be selected. The paper proposes Algorithm 2: Secure TopK.
Algorithm 2: Secure TopK Input: & hold (secret shares of a vector of length ). Output: & get (secret shares of vector of length containing the largest values of ).
- Secure Shuffle: & invoke .
- Explanation: The input vector is first
securely shuffledusing theF_Shuffleideal functionality. This is crucial: subsequent comparisons will reveal the relative order of elements, but because the elements have been randomly permuted, these comparisons do not leak information about the original elements' values or their original positions. - : Takes a secret-shared vector and a secret-shared permutation . Returns . (The protocol implemented uses permutations chosen by each party separately).
- Assumption: Elements in are assumed to be distinct. This is typically achieved by appending a unique index (1 to ) to each element before shuffling and comparison, which is truncated after selection.
- Explanation: The input vector is first
- Select Top-K: .
- Explanation: The
selectfunction (Lines 3-21) is a modifiedquicksort-like algorithm that identifies thetop-Kelements from the shuffled vector .
- Explanation: The
Function select(, K):
3.
4. Base Case: If , return .
* Explanation: If the current sub-vector has or fewer elements, all of them are part of the top-K result, so no further selection is needed.
5. Choose Pivot: Let .
* Explanation: The last element of the current sub-vector is chosen as the pivot for partitioning.
6. Initialize Partitions: Let , .
7. Partitioning Loop: for to n-2 do
8. Secure Comparison: .
* Explanation: Each element (except the pivot) is securely compared with the pivot using F_CMP. is 1 if , and 0 otherwise.
9. Reveal Comparison Result: Reveal to and .
* Explanation: This is a critical step. The comparison result can be revealed because is a shuffled version of the original input. Knowing whether a shuffled element is greater or smaller than a shuffled pivot leaks no information about the original elements or their original positions.
10. Populate or : If , add to . Else, add to .
* Explanation: Based on the revealed comparison result, each element is deterministically placed into either (elements greater than or equal to pivot) or (elements smaller than pivot). This entire partitioning step can be performed locally by and after the comparison result is revealed.
11. end for
12. Add Pivot to : Add to .
* Explanation: The pivot element itself is considered to be in the "greater than or equal to" partition.
13. Determine Size of : Let .
14. Recursive Steps:
15. If : return .
* Explanation: If the size of is exactly , then these are precisely the K largest elements, and the selection is complete.
16. If : return .
* Explanation: If contains more than elements, the top-K elements must be within . The function recursively calls itself on to find the K largest within it.
17. If : return .
* Explanation: If contains fewer than elements, then all elements in are part of the top-K. The remaining elements must be found in . The function recursively calls itself on to find these additional elements.
This `select` function requires `CMPs` (secure comparisons) in expectation, a significant improvement over for full sorting.
4.2.4. Secure Sampling
After Top-K selection, a word must be sampled from the selected probabilities. Algorithm 3 describes this secure sampling protocol.
Algorithm 3: Secure Sampling
Input: & hold (secret shares of a vector of probabilities, scaled by ).
Output: & get (secret shares of an index , where \Pr(j=i) = x_i / \sum_{k=1}^K x_k).
The protocol is based on the idea that if a random value is sampled from , the selected index satisfies .
- Sample Random Value: samples with .
- Explanation: generates a random number within the range of the scaled probabilities. It's acceptable for to sample this value alone because the final output index remains secret from .
- Initialize Cumulative Sum: & (locally) initialize .
- Explanation: will store the cumulative sum of probabilities up to .
- Compute Cumulative Sums and Comparisons: for to
K-1do - & (locally) compute .
- Explanation: They locally compute the cumulative sum up to the -th probability.
- & invoke .
- Explanation: They securely compare the random value with the cumulative sum . is 1 if , and 0 otherwise.
- end for
- Explanation: After this loop, a secret-shared bit vector is obtained. This vector will have for all (where is the sampled index) and for all .
- Initialize Auxiliary Bits: & (locally) initialize and .
- Explanation: These boundary conditions simplify the next step.
- Derive Indicator Bits: for to do
- & (locally) compute .
- Explanation: This operation effectively identifies the unique index where the random value falls.
- If , then and . Their XOR will be 1.
- For any other , either both are 1 or both are 0, so their XOR will be 0. Thus, will be 1 only for the sampled index , and 0 for all other indices.
- Explanation: This operation effectively identifies the unique index where the random value falls.
- end for
- Compute Sampled Index: & compute .
- Explanation: A final
multiplexeroperation is performed for each index . If is 1 (meaning is the sampled index), it selects ; otherwise, it selects 0. Summing these results yields the sampled index .
- Explanation: A final
Mapping to Response Word:
The index (from Line 11 of Algorithm 3) corresponds to an element in the shuffled and Top-K selected vector. To get the actual word, this index needs to be mapped back to the original vocabulary index.
- Inverse Shuffle for S's permutation: The
Top-Kselection process reveals the index of the selected element in the shuffled vector. Let be the original index in the shuffled vector corresponding to the -th element in theTop-Koutput. The paper states computes and secret-shares . (Here, is the permutation chosen by during shuffling, as described inF_Shuffle). The final summation inAlgorithm 3(Line 11) is then modified to: $ \langle j \rangle := \sum_{i=1}^K \mathsf{F_{MUX}}(\langle i' \rangle, \langle b'_i \rangle^1) $ Now, does not know (because is secret) and does not know (because is secret to ). - Inverse Shuffle for C's permutation: Once is revealed to , computes , where is the permutation chosen by during shuffling. This is the correct index in the original word vector (vocabulary), allowing to retrieve the final response word.
4.2.5. CipherGPT Framework Integration (Section VII)
The CipherGPT framework integrates these protocols to secure the entire GPT inference pipeline:
The architecture and workflow of GPT (Figure 2 from the paper) shows the data flow through different layers:
The image "images/2.jpg" shows the architecture and workflow of GPT.
- Input Sequence: Text input from the user.
- Word Embeddings: Each word is converted into a numerical vector.
- Position Embeddings: Positional information is added to the word embeddings.
- Multiple Transformer Decoders: These are the core processing units, each consisting of:
Masked Self-AttentionFeed-Forward Neural NetworkLayer Normalization
- Vec2Word Layer: The final layer that outputs the predicted word.
A. Embedding
This layer maps input words to word embedding vectors and augments them with position embedding vectors.
- encrypts all rows of the
embedding matrixusingAHE(e.g.,RLWE ciphertextfor each row ). sends these ciphertexts to . This is a one-time preprocessing step. - (knowing its input words) identifies the corresponding ciphertexts. For each, adds a random vector to the ciphertext (which is
homomorphicallyadded) and sends back to . - decrypts to get . then locally adds the corresponding
position embedding vector: . - The
embedding vectorfor each word, , is now secret-shared: and .
B. Layer Normalization
Layer Normalization normalizes each element in an input vector according to:
$
x_i := \frac{x_i - \mathrm{E}[\mathbf{x}]}{\sqrt{\mathrm{Var}[\mathbf{x}] + \epsilon}} \cdot \gamma + \beta
$
where is the mean, is the variance, and are learnable parameters, and prevents division by zero.
The secure computation proceeds as:
- Compute
varianceterms: for each . - Compute
inverse square root: to compute . - Compute scaled difference: for .
- Apply
gamma: for . - Apply
betaandTruncation: and locally add , then run to reduce the scale to bits and truncate the width to bits.- Note: The paper mentions using for multiplication and performing
truncationonly once at the end for efficiency.
- Note: The paper mentions using for multiplication and performing
C. Masked Self-Attention
Self-attention involves computing Query (Q), Key (K), and Value (V) matrices, scoring, masking, softmax, and output generation.
- Q, K, V Matrices: , , . Here is the input, and are model weights.
- These
MatrixMulsuse the same weights (e.g., ) across autoregressive steps. Therefore,CipherGPTapplies itssVOLE-based MatrixMul(Section 4.2.1) here. - After
MatrixMul,F_Truncis used to maintain -bit scaling.
- These
- Multi-headed Attention: are partitioned into segments (attention heads): , each of size , where . This is a local operation.
- Score Matrix: for each head .
- Since and are not known beforehand (they are outputs of previous layers),
sVOLE-based MatrixMulis not suitable. Instead, theAHE-based MatrixMulproposed in[27](Iron) is used.
- Since and are not known beforehand (they are outputs of previous layers),
- Self-attention Masking: The upper triangle of each is zeroed out to prevent attending to future words. This is a local operation.
- Softmax: Applied to each row of each to normalize scores to sum to 1.
CipherGPTleveragesBumbleBee's [30]approach:- Normalize input row : .
- Bound values: Use and to set to 0 if .
- Approximate : Use the approximation implemented with and .
- Output Calculation: for each head .
- Again,
AHE-based MatrixMul [27]is used. - Results are reassembled: .
- Residual connection: (local addition).
- Again,
D. Feed Forward
This block involves Layer Normalization, two fully-connected (FC) layers, and the GELU activation.
- Layer Normalization: Applied as described in Section 4.2.5 B.
- First FC Layer: .
- , , .
- Since is a model weight,
CipherGPTapplies itssVOLE-based MatrixMul.
- GELU Activation: The
secure GELUprotocol (Algorithm 1) is applied element-wise to , producing . - Second FC Layer: .
- , , .
- Again,
sVOLE-based MatrixMulis used as is a model weight.
- Another residual connection and
Layer Normalizationfollows, continuing the decoder structure.
E. Vec2word
This final layer generates the predicted response word.
- Initial MatrixMul: .
- (the last row of the final transformer output ), (where is the vocabulary size, typically very large, e.g., 50257 in GPT-2), .
- Since (vocabulary size) is very large, the
sVOLE-based MatrixMulmight not be as efficient here. The paper states it usesAHE-based MatrixMul [27]. F_Truncis applied.
- Top-K Selection: .
- The
secure Top-Kprotocol (Algorithm 2) is applied to select the largest probability scores from , resulting in .
- The
- Temperature Scaling: Each value in is multiplied by a
temperature T(a hyperparameter held by ). This is done securely usingAHE:- encrypts its shares and sends to .
- adds its shares to ciphertexts, decrypts to , multiplies by , re-encrypts .
- adds random number to each ciphertext: , returns to .
- decrypts. New shares and .
F_Truncis applied.
- Softmax: Applied to to get a probability vector . (As described in Section 4.2.5 C).
- Random Sampling: .
- The
secure samplingprotocol (Algorithm 3), with the modifications for index mapping, is used to sample an index from .
- The
- Word Mapping: obtains the final sampled index and maps it back to the original word vector (vocabulary) index to retrieve the response word. This completes one autoregressive step.
5. Experimental Setup
5.1. Datasets
The paper primarily evaluates the accuracy loss of CipherGPT using the WikiText-103 dataset [36].
- Source and Characteristics:
WikiText-103is a large dataset of over 103 million words extracted from the set of verified "Good" and "Featured" articles on Wikipedia. It is designed for language modeling benchmarks, providing a diverse collection of high-quality text. - Scale: Contains a vocabulary of 267,735 unique words.
- Domain: General-purpose English text from Wikipedia articles.
- Usage: The authors randomly selected 10,000 sentences from
WikiText-103for their accuracy evaluation. This dataset is suitable because it represents typical text data thatGPTmodels are trained on and expected to generate, allowing for a realistic assessment of how the cryptographic transformations affect the model's output quality compared to the original model.
5.2. Evaluation Metrics
The paper employs several metrics to evaluate the performance and accuracy of CipherGPT and its components.
5.2.1. Runtime (s)
- Conceptual Definition: Measures the total time taken to execute a protocol or operation, encompassing both computation (CPU cycles) and communication (data transfer latency). It quantifies the efficiency in terms of speed.
- Mathematical Formula: Not a specific formula, but rather measured empirically as wall-clock time. $ \text{Runtime (s)} = \text{CPU Time} + \text{Communication Latency} $
- Symbol Explanation:
CPU Time: Time spent by processors performing computations.Communication Latency: Time spent waiting for data to be transferred between parties over the network.
5.2.2. Communication (MB)
- Conceptual Definition: Quantifies the total amount of data exchanged between
Client (C)andServer (S)during the execution of a protocol or operation. It reflects the network bandwidth consumption and is a critical factor for efficiency, especially in geographically distributed settings. - Mathematical Formula: Not a specific formula, but measured empirically as total bytes transferred. $ \text{Communication (MB)} = \sum_{k=1}^{\text{num_messages}} \text{size_of_message}_k \quad (\text{converted to MB}) $
- Symbol Explanation:
num_messages: The total number of messages exchanged.size_of_message}_k: The size in bytes of the -th message.
5.2.3. ULP Error (Units in the Last Place)
- Conceptual Definition: Measures the precision of a numerical approximation by quantifying the error in terms of
Units in the Last Place. It represents the number of possible floating-point values between the exact mathematical result and the approximated result. In the context ofCipherGPT, where floating-point numbers are scaled to integers, theULP errorsimplifies to the absolute difference between the exact and approximated integer values. - Mathematical Formula: For scaled integer values (exact) and (approximated), the ULP error is defined as: $ \text{ULP Error} = |y - \tilde{y}| $
- Symbol Explanation:
- : The exact (infinite precision) real result, after scaling to an integer.
- : The approximated result obtained from the secure protocol, after scaling to an integer.
- : Absolute value function.
- The paper reports both
maximal ULP error(the largest absolute difference found) andaverage ULP error(the mean of absolute differences over all test cases).
5.2.4. Accuracy (for Generated Text)
- Conceptual Definition: Assesses the quality of the text generated by
CipherGPTcompared to the originalGPTmodel. It measures how oftenCipherGPTproduces the exact same word as the original model and how close its "wrong" predictions are to the original's top choices. - Mathematical Formula: Two primary forms:
- Percentage of Identical Outputs: $ \text{Identical Output Accuracy} = \frac{\text{Number of identical words}}{\text{Total number of words sampled}} \times 100% $
- Top-k Accuracy (implicitly used as "falls within top-5"):
$
\text{Top-k Accuracy} = \frac{\text{Number of times original output is in CipherGPT's top-k list}}{\text{Total number of words sampled}} \times 100%
$
(In this paper, it's used for "wrong" outputs: how many of them were still in
GPT-original's top-5.)
- Symbol Explanation:
Number of identical words: Count of words whereCipherGPT's prediction matchesGPT-original's prediction.Total number of words sampled: The total number of words for which predictions were made (e.g., 10,000 sentences, with meaning 10,000 word predictions).Number of times original output is in CipherGPT's top-k list: How often the word predicted byGPT-originalwas among thetop-khighest-probability words predicted byCipherGPT. The paper reverses this by checking ifCipherGPT's "wrong" output was inGPT-original's top-5.
5.3. Baselines
CipherGPT is compared against several state-of-the-art and foundational secure inference solutions, as well as general cryptographic primitives:
-
For GELU:
Iron [27]: A secure inference framework that reduces communication complexity forHE-based matrix multiplicationand usesLUTsfor non-linear functions.Bolt [40]: A recent framework optimized fortransformers, usingSIMD slotsinHEformatrix multiplicationand high-degree polynomial approximation forGELU.Bolt+ [40], [30]: An improved version ofBoltwhere itsOT-based multiplicationis replaced withBumbleBee'smore efficientBOLEfor fair comparison.BumbleBee [30]: Another recent secure inference framework for largetransformers, known forciphertext compactingandhigh-degree polynomialapproximations for activations.
-
For Matrix Multiplication:
Cheetah [29]: A framework that usescoefficient packing(instead ofSIMD) forRLWE-based HEto eliminate expensive rotations.Iron [27]: (See above)BumbleBee [30]: (See above)Bolt [40]: (See above)
-
For Top-K Selection:
-
Bitonic sorting network [28]: A well-knowndata-independent sorting algorithmcommonly used insecure computationscenarios. It provides a strong baseline for general secure sorting, but is computationally intensive.The choice of these baselines is representative because they cover various approaches to
secure inference(HE-only, hybrid 2PC, differentHEpacking schemes) and are considered state-of-the-art fortransformer-based models or fundamental cryptographic primitives.
-
5.4. Implementation Details
The authors provide details on their implementation to ensure reproducibility and transparency:
- Language: Full implementation in .
- Security Parameter: Set to
128bits, a standard level for cryptographic security. - Homomorphic Encryption:
Microsoft SEAL(version 4.0) library is used forAHE.- Specifically, the
Brakerski-Fan-Vercauteren (BFV) scheme [11], [19]is employed. - is used as the
polynomial modulus degree, with defaultSEALparameters for 128-bit security. Hexlis used to accelerateHEoperations withAVX-512instructions.Noise flooding [45], [30]is performed on returned ciphertexts to ensurecircuit privacy.
- Multiplication Protocols:
Uniform bit-width product: The open-sourcedBOLE(Batch Oblivious Linear Evaluation) code fromBumbleBee [30]is used.Non-uniform bit-width product: The open-sourced code fromSIRNN [43]is used.
- Secure GELU:
LUT,Mult,Trunc,CMP, andMUXprimitives are implemented leveragingSIRNN's open-sourced code.IKNP-OT [32](used inSIRNN) is replaced withFerret OT [57]for efficiency.OT-based multiplicationis replaced withFHE-based BOLE [30].
- sVOLE-based MatrixMul:
Reverse-VOLEis implemented withAHE.Halftree [26], [25]optimization is incorporated forPPRF(Pseudorandom Permutation Families).- Optimizations from
[56], [6]are included. - Adherence to advice in
[33]for protection against known attacks.
- TopK: A custom implementation of
secret-shared shuffle(from[14]) is developed, as it was not open-sourced. - Bolt Baseline:
Bolt [40]was implemented based onSIRNNwithFerret OTusing parameters from theBoltpaper, as its code was unavailable.
5.5. Optimizations for HE
Two optimizations are leveraged to reduce the communication overhead of transferring HE ciphertexts:
- Symmetric Version of FHE: In the
symmetric versionofFHE, a freshly encrypted ciphertext consists of two polynomial components. One of these polynomials is uniformly sampled and can be represented by aseedinstead of being transmitted entirely.- Effect: This saves half of the communication when sending a ciphertext without compromising security or correctness.
- Modulus Switch Before Return:
FHE ciphertextsare polynomials in . These can be converted to a smaller ring , where , without affecting the decryption result.- Mechanism: This operation, known as
modulus reduction [13], only requires public parameters and can be performed by either party (even without the secret key). - Effect: This optimization can compress ciphertexts by a factor of , significantly reducing bandwidth.
- Mechanism: This operation, known as
5.6. Experimental setup
- Network Environment:
LAN networksetting, simulating a high-speed local network.Bandwidth: 3000 Mbps (Megabits per second).RTT (Round-Trip Time): 0.8 ms (milliseconds).
- Hardware: Experiments were conducted on
AWS c5.9xlarge instances.CPU: Intel Xeon 8000 series CPUs clocked at 3.6 GHz.
- Parallelism: All experiments were performed using a
single thread, indicating that significant further speedups could potentially be achieved with parallel computing. - Data Collection: All reported results are
average valuesfrom 5 runs, with minimal variance.
5.7. Model Parameters
The GPT-2 model ([42]) is used as the benchmark target:
- Model Size: 117 million parameters.
- Architecture: 12
transformer decoders. - Embedding Size: 768.
- Fixed-point Representation:
- Floating-point numbers are
left-shiftedby bits. This effectively converts floating-point numbers into fixed-point integers, where bits represent the fractional part. - The fractional part is then dropped.
- During inference,
F_Truncis used to ensure that the largest value remains smaller than , with bits (meaning the total integer bit-length, including fractional bits, is 37).
- Floating-point numbers are
6. Results & Analysis
6.1. Core Results Analysis
The experimental evaluation of CipherGPT provides a comprehensive benchmark of its individual components and overall framework, demonstrating significant improvements over existing solutions.
Evaluation of GELU
The secure GELU protocol (Algorithm 1) is evaluated for 37-bit elements in a -length vector. The approximation uses a 64-piece spline within the range .
The following are the results from Table IV of the original paper:
| GELU(Z) | Runtime(s) | Comm.(MB) | MaximalULP Err. | AverageULP Err. |
| Iron[27] | 694 | 12 225 | 9 | 1.93 |
| Bolt[40] | 55.61 | 1 962.23 | 37 | 4.55 |
| Bolt+[40], [30] | 52.22 | 559.28 | 37 | 4.55 |
| BumbleBee[30] | 73.52 | 641.02 | 73 | 10.82 |
| Ours | 30.56 1.8×↓ | 764.96 2.5×↓ | 5 7.4×↓ | 1.06 4.3×↓ |
Analysis:
- Runtime:
CipherGPTachieves a runtime of 30.56s, which is a 1.8x speedup overBolt(55.61s) and even a 1.7x speedup over the optimized (52.22s). This demonstrates the efficiency of the spline-based approach withLUTscompared to high-degree polynomial approximations.Ironis significantly slower (694s). - Communication:
CipherGPT's communication is 764.96MB, showing a 2.5x reduction compared toBolt(1962.23MB). While (559.28MB) andBumbleBee(641.02MB) have lower communication,CipherGPT's overall efficiency is still competitive given its superior precision.Ironhas extremely high communication (12225MB). - Precision (ULP Error): This is where
CipherGPTtruly excels.- Maximal ULP Error:
CipherGPTachieves a maximal ULP error of 5, which is 7.4x more accurate thanBolt(37) and 14.6x more accurate thanBumbleBee(73). - Average ULP Error:
CipherGPT's average ULP error is 1.06, which is 4.3x more accurate thanBolt(4.55) and 10.2x more accurate thanBumbleBee(10.82). - Reasoning for Precision: The paper attributes this superior precision to its single-step, spline-based approximation, which avoids error accumulation seen in multi-step approaches (like
Iron/SIRNNapproximating exponentiation and reciprocation separately) and the precision loss inherent in high-degree polynomial approximations (likeBolt/BumbleBee).
- Maximal ULP Error:
Evaluation of MatrixMul
The sVOLE-based MatrixMul protocol is designed for unbalanced matrix multiplications that occur repeatedly with the same weights in GPT's autoregressive inference. It is benchmarked for computing iterations of . The results highlight the benefits of amortized costs.
The image "images/3.jpg" shows the evaluation of MatrixMul. The left plot (a) shows amortized runtime (s) vs. iterations, and the right plot (b) shows amortized communication (MB) vs. iterations. For (a), sVOLE runtime significantly decreases as iterations increase. For (b), sVOLE communication also decreases.
Analysis from Figure 3:
- Amortized Runtime (Figure 3a):
- For iterations (a reasonable number for
ChatGPTresponses),CipherGPT's amortized runtime is 1462 ms. This represents a 2.3x speedup over Iron, 6.4x speedup over Bolt, and 5.8x speedup over BumbleBee. The curves clearly showCipherGPT(sVOLE) consistently outperforming otherHE-basedmethods, with its runtime decreasing significantly as increases due to amortization. - For iterations,
CipherGPTfurther improves, showing 2.5x speedup over Iron, 6.9x speedup over Bolt, and 6.2x speedup over BumbleBee.
- For iterations (a reasonable number for
- Amortized Communication (Figure 3b):
- For iterations,
CipherGPT's amortized communication is 8.2MB. This is a 3.7x reduction over Iron, 3.0x reduction over Bolt, and 1.4x reduction over BumbleBee. Similar to runtime, communication also decreases with increasing . - For iterations,
CipherGPTachieves even greater reductions: 11.2x over Iron, 8.9x over Bolt, and 4.1x over BumbleBee.
- For iterations,
- Reasoning: The effectiveness of
sVOLEstems from its communication complexity being almost independent of (input vector length). By batching multipleMatrixMuloperations that share the same weights, the one-timesVOLEsetup cost is amortized, leading to superior performance for generativeLLMtasks.
Evaluation of TopK
The secure TopK protocol (Algorithm 2) is benchmarked for selecting 100 elements from a vector of length 50257 (typical GPT-2 vocabulary size).
- Performance: It takes 3281 ms and consumes 136.1 MB of bandwidth.
- Comparison: Compared to the commonly used
Bitonic sorting network [28],CipherGPTachieves an 8.8x speedup in runtime and a 14.8x reduction in communication. This highlights the efficiency gained by using a shuffledquicksort-like selection that only processes necessary partitions, rather than a full sorting algorithm.
Evaluation of CipherGPT (Overall Framework)
The overall CipherGPT framework is evaluated for generating a sentence of 256 response words.
The following are the results from Table V of the original paper:
| Layer | Operation | Output←Input | Method | Times | Runtime (ms) | Runtime % | Comm. (MB) | Comm. % | |
| Embedding | Embedding | Z256×768←Z2166, L = 12 | §VII-A | 1 | 46 | < 0.01% | 2.20 | < 0.01% | |
| LayerNorm | LayerNorm | Z256×768←Z256×768, L = 12 | §VII-B | 12 | 4756 × 12 | 4.83% | 65 × 12 | 5.17% | |
| Self-attention | MatrixMul | (Z256×768←Z256×768 × Z768×768) × 3, L = 24 | §II | 12 | 4358 × 12 | 4.43% | 21.79 × 12 | 1.73% | |
| Trunc | (Z256×768←Z256×768) × 3, L = 12 | [43] | 12 | 3117 × 12 | 3.17% | 24.75 × 12 | 1.97% | ||
| Multi-head | (Z256×64) × 12 ← (Z256×768) × 3, L = 12 | plain | 12 | (< 1) × 12 | ≈ 0% | 0 | 0% | ||
| MatrixMul | (Z256×256←Z256×64 × Z64×256) × 12, L = 24 | [30] | 12 | 7614 × 12 | 7.74% | 29.77 × 12 | 2.37% | ||
| Trunc | (Z256×256←Z256×256) × 12, L = 12 | [43] | 12 | 3262 × 12 | 3.31% | 30.61 × 12 | 2.43% | ||
| Masking | (Z256×256←Z256×256) × 12, L = 12 | plain | 12 | (< 1) × 12 | ≈ 0% | 0 | 0% | ||
| Softmax | (Z256×256←Z256×256) × 12, L = 12, (by row) | §VII-C | 12 | 14865 × 12 | 15.10% | 277.59 × 12 | 22.06% | ||
| MatrixMul | (Z256×64←Z256×256 × Z256×64) × 12, L = 24 | [30] | 12 | 7570 × 12 | 7.69% | 10.62 × 12 | 0.84% | ||
| Trunc | (Z256×64←Z256×64) × 12, L = 12 | [43] | 12 | 2910 × 12 | 2.96% | 13.03 × 12 | 1.04% | ||
| Reassemble | Z256×768←((Z256×64) × 12), L = 12 | plain | 12 | (< 1) × 12 | ≈ 0% | 0 | 0% | ||
| MatrixMul | Z256×768←Z256×768 × Z768×768, L = 24 | §II | 12 | 1463 × 12 | 1.49% | 8.20 × 12 | 0.65% | ||
| Trunc | Z256×768←Z256×768, L = 12 | [43] | 12 | 2910 × 12 | 2.96% | 13.03 × 12 | 1.04% | ||
| Matrix Add | Z256×768←Z256×768 + Z256×768, L = 12 | plain | 12 | (< 1) × 12 | ≈ 0% | 0 | 0% | ||
| LayerNorm | LayerNorm | Z256×768←Z256×768, L = 12 | §VII-B | 12 | 4756 × 12 | 4.83% | 65 × 12 | 5.17% | |
| Feed-forward | MatrixMul | Z256×3072←Z256×768 × Z768×3072, L = 24 | §II | 12 | 5997 × 12 | 6.09% | 28.5 × 12 | 2.27% | |
| Trunc | Z256×3072←Z256×3072, L = 12 | [43] | 12 | 3204 × 12 | 3.26% | 30.61 × 12 | 2.43% | ||
| GELU | Z256×3072←Z256×3072, L = 12 | §IV | 12 | 20657 × 12 | 20.99% | 575.91 × 12 | 45.78% | ||
| MatrixMul | Z256×768←Z256×3072 × Z3072×768, L = 24 | §II | 12 | 5841 × 12 | 5.94% | 32.42 × 12 | 2.58% | ||
| Trunc | Z256×768←Z256×768, L = 12 | [43] | 12 | 2910 × 12 | 2.96% | 13.03 × 12 | 1.04% | ||
| Matrix Add | Z256×768←Z256×768 + Z256×768, L = 12 | plain | 12 | (< 1) × 12 | ≈ 0% | 0 | 0% | ||
| LayerNorm | LayerNorm | Z256×768←Z256×768, L = 12 | §VII-B | 1 | 4756 | 0.40% | 65 | 0.43% | |
| Vec2Word | MatrixMul | Z50257←Z768 × Z768×50257, L = 24 | [30] | 1 | 12000 | 1.02% | 5.5 | 0.04% | |
| Trunc | Z50257←Z50257, L = 12 | [43] | 1 | 2834 | 0.24% | 8.67 | 0.06% | ||
| Shuffle | Z50257←Z50257, L = 12 | [14] | 1 | 4004 | 0.34% | 51.3 | 0.34% | ||
| TopK | Z100←Z50257, L = 12 | §V | 1 | 1277 | 0.11% | 84.8 | 0.56% | ||
| Vec2Word | Temperature | Z100←Z100, L = 24 | §VII-E | 1 | 6 | < 0.01% | 0.084 | < 0.01% | |
| Trunc | Z100←Z100, L = 12 | [43] | 1 | 1 | < 0.01% | < 0.01 | < 0.01% | ||
| Softmax | Z100←Z100, L = 12 | §VII-C | 1 | 1705 | 0.14% | 0.71 | < 0.01% | ||
| Sampling | Z37←Z100, L = 12 | §VI | 1 | 7.843 | < 0.01% | 0.11 | < 0.01% | ||
| Total | 1 180 933 | 15 096.82 | |||||||
Analysis of Overall Performance (for 256 response words):
- Total Runtime: 1,180,933 ms, which is approximately 19.68 minutes.
- Total Communication: 15,096.82 MB, which is approximately 14.74 GB.
Proportional Breakdown:
- Runtime Bottlenecks:
GELU: Dominates runtime, accounting for 20.99% of total time. This operation is performed 12 times within theFeed-forwardlayers.Softmax: Contributes 15.10% of runtime (also performed 12 times withinSelf-attention).MatrixMul(various types): Summing up allMatrixMuloperations, they collectively contribute a significant portion. Specifically, theAHE-based MatrixMulinSelf-attention(7.74% and 7.69%) andsVOLE-based MatrixMulinFeed-forward(6.09% and 5.94%) are major contributors. The totalMatrixMul(includingQKV, internalSelf-attention,FClayers) across all 12 decoders is around 34.39%.Truncation (Trunc): Essential for fixed-point arithmetic, accumulates to approximately 18.85% of runtime across the model.LayerNorm: Contributes approximately 10.07% of runtime.
- Communication Bottlenecks:
GELU: Is the largest communication consumer, occupying 45.78% of the total bandwidth.Softmax: Contributes 22.06% of communication.LayerNorm: Consumes 10.76% of bandwidth.MatrixMul(various types): OverallMatrixMulcontributes approximately 10.49%.Truncation (Trunc): Accounts for approximately 10% of bandwidth.
Observations:
- The customized protocols for
GELU(Section IV) andMatrixMul(Section III) are indeed critical, as they form the largest proportions of both runtime and communication. - Operations like
Embedding,Multi-head partitioning,Masking,Reassemble,Matrix Add,Temperature,Sampling, andTrunc(after Temperature) forVec2Wordare relatively cheap, often taking negligible time and communication. - The high overall latency (around 20 minutes) and bandwidth (around 15 GB) for generating 256 words clearly indicate that
CipherGPT, despite its significant optimizations, is not yet practical for real-time interactiveGPTinference. However, the detailed breakdown provides a clear roadmap for future optimizations by highlighting the most expensive operations.
Accuracy Evaluation
The paper also assessed the accuracy loss introduced by CipherGPT using WikiText-103.
- Methodology: 10,000 sentences were randomly selected.
CipherGPTwas run (with forTopKto predict the most probable word, thus eliminatingTopKsampling interference) and compared againstGPT-original(the original model with floating-point numbers, no truncations or approximations). - Results:
- 99.22% of
CipherGPT's outputs were identical toGPT-original's outputs. This is a very high degree of fidelity, indicating that the fixed-point approximations and secure protocols introduce minimal changes to the model's core predictions. - For the remaining 0.78% (78 out of 10,000) of outputs that were different, each "wrong" output still fell within the
top-5outputs produced byGPT-originalfor the corresponding sentence. This suggests that even whenCipherGPTdeviates, its predictions are still highly plausible and close to the original model's top choices, preserving the overall utility of theLLM.
- 99.22% of
6.2. Data Presentation (Tables)
The tables presented in the "Core Results Analysis" section are direct transcriptions from the original paper's Table IV and Table V, using Markdown for Table IV (no merged cells) and HTML for Table V (due to merged cells).
6.3. Ablation Studies / Parameter Analysis
The paper does not present explicit ablation studies to verify the individual contribution of each component of CipherGPT (e.g., removing sVOLE optimization, or using a less precise GELU approximation). However, the detailed breakdown of runtime and communication costs for each operation in Table V implicitly serves a similar purpose by showing the relative impact of each component.
For GELU, the authors specify the parameters used: and (resulting in a 64-piece spline) for the approximation. They don't analyze how varying these parameters might affect performance or precision, but the choice is justified based on balancing accuracy and efficiency.
The comparison of MatrixMul performance for different (number of iterations) in Figure 3 acts as a form of parameter analysis, showing how the amortization benefits scale with the batch size.
7. Conclusion & Reflections
7.1. Conclusion Summary
CipherGPT introduces the first comprehensive framework for secure two-party GPT inference, addressing critical privacy concerns associated with large language models. The framework is built upon a suite of innovative cryptographic protocols specifically tailored to the unique computational demands of GPT. Key contributions include a novel sVOLE-based secure matrix multiplication that leverages GPT's autoregressive nature for significant speedup and bandwidth reduction, a spline-based secure GELU protocol that achieves superior precision and efficiency over existing methods, and the first protocol for secure top-K sampling, crucial for generative text diversity. Despite the current practical challenges in terms of latency and bandwidth, CipherGPT provides a full implementation and a detailed benchmark, offering valuable insights into performance bottlenecks and a foundational reference for future research in this nascent field.
7.2. Limitations & Future Work
The authors candidly acknowledge the current limitations and suggest avenues for future research:
- Impractical Performance: The primary limitation is the current high cost. Generating a single token (response word) requires a latency of approximately 20 minutes and consumes 15 GB of bandwidth (for a 256-word response, amortized). This makes real-time, interactive
GPTinference currently impractical. - Future Technological Advancements: They anticipate that ongoing advancements in computing and network technologies will eventually pave the way for practical implementations.
- Computing Power: Leveraging parallel computing technologies such as
GPUorFPGA acceleration, and exploring new architectures likein-memory computing [50]andin-storage computing [49], could significantly speed up computation. The current single-threaded implementation leaves ample room for improvement. - Network Bandwidth: The emergence of
100 Gigabit Ethernet [1]under theIEEE standardcould effectively address the high bandwidth requirements.
- Computing Power: Leveraging parallel computing technologies such as
- Non-Real-Time Applications: Despite the high latency,
secure GPT inferencecan find valuable applications in scenarios where real-time responsiveness is not critical. An example provided is an institution evaluating anLLM's performance using confidential prompts, where both the prompts and the model itself need to remain private. In such cases, the long latency is tolerable.
7.3. Personal Insights & Critique
CipherGPT represents a significant leap forward in the crucial field of privacy-preserving AI, especially for large language models. The authors' approach of deeply analyzing GPT's architectural and operational specifics to design custom cryptographic primitives is highly effective and demonstrates a deep understanding of both fields.
Strengths:
- Problem Relevance: The privacy concerns surrounding
LLMsare paramount and growing. This paper directly tackles a pressing real-world issue. - Customized Optimizations: The
sVOLE-based MatrixMulfor autoregressiveGPTand thespline-based GELUare particularly clever and impactful innovations. They highlight that genericsecure inferencesolutions are often insufficient for specializedAI models. The improvements in precision forGELUare also highly commendable, as accuracy is often sacrificed in secure computation. - Comprehensive Benchmark: The detailed breakdown of runtime and communication for each operation (Table V) is invaluable. It not only provides a baseline but also clearly points to the current bottlenecks, guiding future research efforts. This level of transparency is excellent.
- First-of-its-Kind Protocols: Introducing the first protocols for
secure Top-K selectionandsecure samplingin this context fills critical gaps for truly generativeLLMprivacy.
Critique and Future Directions:
-
Practicality Gap: The latency of 20 minutes per token is indeed a "showstopper" for most interactive applications. While the authors correctly point to future hardware and network advancements, the cryptographic overhead itself remains extremely high. Further research into more efficient cryptographic primitives (e.g., post-quantum FHE, faster
OTextensions, or newsecret-sharingschemes) will be crucial. -
Preprocessing Cost: While the
sVOLE MatrixMulamortizes online costs, the preprocessing phase (which generatessVOLEcorrelations) can itself be costly. The paper doesn't deeply elaborate on the non-amortized preprocessing time, which could be significant for models with many layers or large weights. -
AHE-based MatrixMul: For parts of the model (like score matrix calculation in
self-attentionand the finalvec2word MatrixMul) wheresVOLEis not applicable, the framework reverts toAHE-based MatrixMul [27]. This indicates that these operations might still be relatively expensive compared to thesVOLEoptimized parts, and further optimizations for these non-batchedMatrixMulinstances could be beneficial. -
Scaling K for Top-K and Sampling: The performance of
TopKandSamplingmight be sensitive to the value of (the number of elements selected/sampled). A larger could increase costs. Analyzing this sensitivity could provide further insights. -
Trust Assumptions: The
semi-honestmodel is a standard assumption. Exploring solutions under a strongermalicious adversarymodel would enhance security guarantees, albeit with higher costs. -
Model Flexibility: The current approach is highly specialized for
GPT. While this brings efficiency, it might require significant adaptation for otherLLMarchitectures (e.g.,Llama,Mixtral) or newer activation functions.Overall,
CipherGPTlays a foundational stone for securegenerative AI. Its innovations provide a clear direction for bridging the gap between cutting-edgeLLMsand privacy, even if the journey to real-time practicality is still long. The detailed analysis provided in this paper will be an indispensable reference for those working on makingsecure LLM inferencea reality.
Similar papers
Recommended via semantic vector search.