Paper status: completed

CipherGPT: Secure Two-Party GPT Inference

Secure Two-Party GPT Inference (1)Encrypted Matrix Multiplication (1)Secure GELU Computation Protocol (1)Secure Top-k Sampling Protocol (1)GPT Inference Optimization (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

CipherGPT is a framework addressing user privacy in GPT inference. It innovatively optimizes secure matrix multiplication and GELU computation, achieving significant performance gains. Additionally, it introduces a secure top-k sampling protocol, providing comprehensive benchmark

Abstract

ChatGPT is recognized as a significant revolution in the field of artificial intelligence, but it raises serious concerns regarding user privacy, as the data submitted by users may contain sensitive information. Existing solutions for secure inference face significant challenges in supporting GPT-like models due to the enormous number of model parameters and complex activation functions. In this paper, we develop CipherGPT, the first framework for secure two-party GPT inference, building upon a series of innovative protocols. First, we propose a secure matrix multiplication that is customized for GPT inference, achieving up to 6.2× speedup and 4.1× bandwidth reduction over SOTA. We also propose a novel protocol for securely computing GELU, surpassing SOTA by 1.8× in runtime, 2.5× in communication and 7.4× in precision. Furthermore, we come up with the first protocol for secure top-k sampling. We provide a full-fledged implementation and comprehensive benchmark for CipherGPT. In particular, we measure the runtime and communication for each individual operation, along with their corresponding proportions. We believe this can serve as a reference for future research in this area.

Mind Map

In-depth Reading

English Analysis~32 min read · 45,344 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is secure two-party inference for Generative Pre-trained Transformer (GPT) models, specifically focusing on addressing user privacy concerns when interacting with such models. The title, CipherGPT: Secure Two-Party GPT Inference, clearly reflects this focus.

1.2. Authors

The authors and their affiliations are:

Xiaoyang Hou (Zhejiang University)
Jian Liu (Zhejiang University)
Jingyu Li (Zhejiang University)
Yuhan Li (Zhejiang University)
Wen-jie Lu (Ant Group)
Cheng Hong (Ant Group)
Kui Ren (Zhejiang University)

The authors are affiliated with Zhejiang University, a prominent research institution in China, and Ant Group, a large financial technology company, suggesting a collaboration between academia and industry. This blend of affiliations often indicates research with strong theoretical foundations and practical applicability.

1.3. Journal/Conference

The paper does not explicitly state the journal or conference of publication. Based on its structure and content, it appears to be a research paper submitted to or presented at a peer-reviewed conference in the fields of cryptography, privacy-preserving machine learning, or computer security.

1.4. Publication Year

Based on the content, which references multiple papers from 2022 and 2023 and discusses ChatGPT (which gained widespread public attention in late 2022), the likely publication year is 2023.

1.5. Abstract

ChatGPT, a significant advancement in AI, raises serious user privacy concerns due to the potential submission of sensitive data. Existing secure inference solutions struggle with GPT-like models because of their massive parameter counts and complex activation functions. This paper introduces CipherGPT, the first framework for secure two-party GPT inference, built upon several novel protocols. The authors propose a customized secure matrix multiplication for GPT, achieving up to 6.2x speedup and 4.1x bandwidth reduction over state-of-the-art (SOTA) methods. They also present a new protocol for securely computing GELU, outperforming SOTA by 1.8x in runtime, 2.5x in communication, and 7.4x in precision. Furthermore, CipherGPT introduces the first protocol for secure top-k sampling. The paper provides a full implementation and a comprehensive benchmark, detailing the runtime and communication for individual operations and their proportions, intending it as a reference for future research.

1.6. Original Source Link

The provided link is /files/papers/692f0055713363b9c83c81fc/paper.pdf. This appears to be a local file path or an internal identifier for a PDF, rather than a publicly accessible URL. Its publication status (e.g., officially published, preprint) is not specified.

2. Executive Summary

2.1. Background & Motivation

The widespread adoption of large language models (LLMs) like ChatGPT marks a revolution in artificial intelligence, offering capabilities from question answering to content generation. However, the paradigm of interacting with these models through online services or APIs necessitates users submitting prompts or messages, which often contain sensitive personal or proprietary information. This poses a significant user privacy risk, potentially restricting the deployment of LLMs in privacy-critical scenarios (e.g., healthcare, legal, finance).

Current secure inference solutions, which aim to perform computations on encrypted data while preserving privacy for both the user's input and the model's parameters, face substantial challenges when applied to GPT-like models. These challenges stem from:

Enormous Model Parameters: GPT models, such as GPT-2 (117 million parameters), involve a multitude of high-dimensional matrix multiplications.
Complex Activation Functions: The models utilize non-linear activation functions like GELU (Gaussian Error Linear Unit), which are computationally expensive to secure using traditional cryptographic primitives.
Generative Tasks: Unlike discriminative tasks (e.g., classification), generative LLMs require repeated inferences to produce output word-by-word, and they often employ random sampling mechanisms to ensure creativity and diversity, which are difficult to secure efficiently.
Limitations of Prior Work: Existing secure inference protocols, such as Iron, primarily focus on transformer-based models for non-generative tasks (e.g., BERT inference). Others like Bolt and BumbleBee optimize matrix multiplication but may introduce higher computational complexity. State-of-the-art approaches for GELU approximation use high-degree polynomials, often sacrificing accuracy for efficiency. Prior work on random sampling either relies on extremely heavy Garbled Circuits or simplifies the problem by selecting the highest-scoring word, diminishing utility.

The paper's entry point is to directly address these specific challenges by designing a dedicated framework and innovative cryptographic protocols optimized for the unique characteristics of GPT inference, thereby enabling practical secure two-party GPT inference for the first time.

2.2. Main Contributions / Findings

The paper's primary contributions are:

First Secure Two-Party GPT Inference Framework (CipherGPT): Developing CipherGPT, the first framework designed specifically for secure two-party inference of GPT-like models.
Customized Secure Matrix Multiplication: Proposing a novel secure matrix multiplication protocol tailored for GPT's autoregressive nature, leveraging sVOLE. This achieves up to 6.2x speedup and 4.1x bandwidth reduction over state-of-the-art (SOTA) methods.
Novel Secure GELU Protocol: Introducing a new protocol for securely computing the GELU activation function. This spline-based approximation, using LUTs and secret-shared linear functions, surpasses SOTA by 1.8x in runtime, 2.5x in communication, and achieves 7.4x higher precision.
First Secure Top-K Sampling Protocol: Presenting the first protocol for secure top-k sampling, which is crucial for enabling creative and diverse text generation in a privacy-preserving manner.
Comprehensive Implementation and Benchmark: Providing a full-fledged implementation of CipherGPT and a detailed benchmark. This benchmark measures the runtime and communication costs for each individual operation and their respective proportions, serving as a valuable reference for future research in this area.

The key findings demonstrate that despite the inherent complexity, CipherGPT significantly advances the feasibility of secure GPT inference by developing highly optimized cryptographic primitives for its core operations. While current costs remain high, the paper's quantitative analysis highlights bottlenecks and potential areas for future improvements, moving closer to practical deployment.

3.1. Foundational Concepts

To understand CipherGPT, a grasp of several fundamental cryptographic and machine learning concepts is essential:

Generative Pre-trained Transformer (GPT): GPT is a type of large language model (LLM) based on the Transformer architecture. GPT models are "generative" because they can produce new text, and "pre-trained" because they are initially trained on vast amounts of text data to learn language patterns. They are "autoregressive," meaning they generate text word by word, using previously generated words as context for the next prediction. The GPT architecture (illustrated in Figure 2 from the original paper) primarily consists of:
- Embedding Layer: Maps input words into numerical vector representations (word embeddings) and adds positional information (position embeddings).
- Transformer Decoders: Multiple identical layers (with different weights) that process the embedded sequence. Each decoder typically includes:
  - Masked Self-Attention: Allows the model to weigh the importance of different words in the input sequence when processing each word, but only considers prior words to prevent "cheating" during training/inference of generative tasks.
  - Feed-Forward Neural Network: Standard neural network layers applied independently to each position.
  - Layer Normalization: A technique to normalize the activations of previous layers, improving training stability and performance.
- Vec2word Layer: The final layer that maps the output of the last transformer decoder to a probability distribution over the vocabulary, from which a response word is selected.
Secure Inference: A two-party cryptographic protocol where two entities—a client (C) with a private input (e.g., a prompt) and a server (S) with a private model (e.g., a GPT model)—collaborate to perform model inference without either party fully revealing their private data to the other.
- Client's privacy: $S$ learns nothing about $C$ 's input beyond its length and the output length.
- Server's privacy: $C$ learns nothing about $S$ 's model beyond its architecture and the final inference result.
- This is achieved by computing on encrypted or secret-shared data.
Threat Model (Semi-honest Adversary): This paper assumes a semi-honest adversary (also known as "honest-but-curious"). A party acting as a semi-honest adversary follows the protocol correctly but attempts to extract as much information as possible from the data they observe during the protocol execution. This is a common and practical security assumption in cryptographic protocols. $λ$ denotes the computational security parameter.
Secret Sharing: A cryptographic technique that splits a secret value $x$ into multiple shares, $\langle x \rangle = (\langle x \rangle_S, \langle x \rangle_C)$ , such that:
- No single share reveals any information about $x$ .
- Combining a sufficient number of shares (e.g., two shares in a 2-out-of-2 scheme) allows reconstruction of $x$ .
- 2-out-of-2 Additive Secret Sharing: Used in this paper over power-of-2 rings ( $\mathbb{Z}_{2^l}$ ). For a secret $x \in \mathbb{Z}_{2^l}$ , it is split into two shares $\langle x \rangle_S$ (for $S$ ) and $\langle x \rangle_C$ (for $C$ ) such that $x = \langle x \rangle_S + \langle x \rangle_C \pmod{2^l}$ . Each party holds one share.
Oblivious Transfer (OT): A fundamental cryptographic primitive where a sender has multiple messages and a receiver wants to choose one message without the sender knowing which one was chosen, and without the receiver learning anything about the other messages.
- $\mathsf{F_{OT}}$ : The ideal functionality for Oblivious Transfer.
- 1-out-of-M OT ( $\binom{M}{1}\text{-OT}$ ): The sender has $M$ messages, and the receiver chooses one.
- Random OT (rOT): A variant where the messages and choices are randomly sampled. Ferret provides efficient protocols for generating large batches of rOTs.
Homomorphic Encryption (HE): An encryption scheme that allows computations to be performed directly on encrypted data without decrypting it first. The results of these computations remain encrypted and, when decrypted, are the same as if the operations were performed on the unencrypted data.
- Fully Homomorphic Encryption (FHE): Supports arbitrary computations, but usually leveled FHE is used in practice, meaning operations can only be performed a limited number of times before bootstrapping (an expensive refresh operation) is needed.
- RLWE (Ring Learning With Errors): A common mathematical problem on which many efficient FHE schemes (like BFV, CKKS) are based. Plaintexts are encoded as polynomials in a quotient ring (e.g., $\mathbb{Z}_p[x]/(x^N+1)$ ), and ciphertexts consist of two polynomials in a larger ring.
- Additively Homomorphic Encryption (AHE): A type of HE that supports addition operations on ciphertexts.
Vector Oblivious Linear Evaluation (VOLE) and Subfield VOLE (sVOLE):
- VOLE is a two-party functionality where a sender with input $x \in \mathbb{F}_p$ and a receiver interact. The sender learns a vector $\mathbf{w}$ of length $n$ , and the receiver learns vectors $(\mathbf{u}, \mathbf{v})$ , both of length $n$ , such that $\mathbf{w} = \mathbf{u}x + \mathbf{v}$ .
- sVOLE (subfield VOLE) is a generalization where $x \in \mathbb{F}_q$ (a subfield of $\mathbb{F}_p$ ), and the resulting correlation is $\mathbf{w} = \mathbf{u}x + \mathbf{v}$ where $\mathbf{u} \in \mathbb{F}_p^n$ and $\mathbf{w}, \mathbf{v} \in \mathbb{F}_q^n$ . It's more cost-effective for computing unbalanced MatrixMuls when $n \gg k$ . It can be adapted to work over finite rings like $\mathbb{Z}_{2^l}$ .
Batch Oblivious Linear Evaluation (BOLE): A two-party functionality that takes vectors $\mathbf{x} \in \mathbb{F}_p^n$ from a sender and $\mathbf{y} \in \mathbb{F}_p^n$ from a receiver and generates a correlation $\mathbf{v}_i + \mathbf{w}_i = \mathbf{x}_i * \mathbf{y}_i$ . The receiver learns $\mathbf{v}$ and the sender learns $\mathbf{w}$ . This can be interpreted as generating Beaver triples for secure multiplication.
Ideal Functionalities: The paper uses notation like $\mathsf{F_{Mult}}$ , $\mathsf{F_{CMP}}$ , $\mathsf{F_{MUX}}$ , $\mathsf{F_{Trunc}}$ , $\mathsf{F_{TR}}$ , $\mathsf{F_{LUT}}$ , $\mathsf{F_{Shuffle}}$ to represent ideal functionalities. An ideal functionality defines the desired secure computation in an abstract way, assuming a trusted third party. Cryptographic protocols then aim to securely realize these ideal functionalities in a real-world setting without a trusted party.
- $\mathsf{F_{Mult}}$ : Secure Multiplication (element-wise or matrix).
- $\mathsf{F_{CMP}}$ : Secure Comparison ( $b=1$ if $x \ge y$ , $b=0$ otherwise).
- $\mathsf{F_{MUX}}$ : Secure Multiplexer (output $x$ if $b=1$ , 0 if $b=0$ ).
- $\mathsf{F_{Trunc}}$ : Secure Truncation (right-shift by $s$ bits, $y = x \gg s$ , keeping $l$ bits).
- $\mathsf{F_{TR}}$ : Secure Truncate-then-Reduce (right-shift by $s$ bits, $y = x \gg s$ , reducing to l-s bits).
- $\mathsf{F_{LUT}}$ : Secure Lookup Table (retrieve T[i] given $\langle i \rangle$ ).
- $\mathsf{F_{Shuffle}}$ : Secure Shuffle (permute a vector $\mathbf{x}$ according to a secret permutation $\pi$ ).

3.2. Previous Works

The paper contextualizes CipherGPT against existing secure inference solutions, highlighting their limitations for GPT-like models.

Early Secure Inference: Efforts date back to the early 2010s ([39], [7], [55]) for simpler ML algorithms like SVMs and linear regression.
CryptoNets ([22]): This was an initial endeavor in secure neural network inference relying solely on FHE. Its main limitation was supporting only linear operations and low-degree polynomials, suitable for networks with few layers.
MiniONN ([34]): The first work to customize 2PC protocols for neural network inference. It introduced a spline-based approximation for non-linear operations, which CipherGPT draws inspiration from for its secure GELU protocol.
SIMD vs. Coefficient Packing for Matrix Multiplication:
- GAZELLE [31]: Reduced cost of linear layers by mapping them to SIMD-based matrix-vector multiplication. SIMD (Single Instruction, Multiple Data) allows batching multiple elements into one RLWE ciphertext for parallel element-wise operations but requires expensive homomorphic rotations for summation. It often operates over prime fields, necessitating conversions for $Z_2^l$ rings.
- Cheetah [29]: Substituted SIMD with coefficient packing, eliminating expensive SIMD rotations and directly computing over $Z_2^l$ rings.
- Iron [27]: Further reduced communication complexity compared to Cheetah.
- Bolt [40] and BumbleBee [30]: Optimized matrix multiplication in communication but introduced more computational complexity, often involving expensive rotations or conversions between prime fields and $Z_2^l$ rings. They focused on transformer-based models for non-generating tasks (like BERT).
Secure Activations (GELU, Sigmoid, Tanh):
- CrypTFlow2 [44]: Provided efficient protocols for secure comparison and division.
- SIRNN [43]: Offered crypto-friendly approximations for math functions like exponential, sigmoid, tanh, and reciprocal square root, along with corresponding 2PC implementations. SIRNN and Iron typically employ Lookup Tables (LUTs) to approximate $e^{-x}$ and its reciprocal, a multi-step process that can accumulate precision errors.
- Bolt [40] and BumbleBee [30]: State-of-the-art approaches for GELU that split the curve into several parts and use high-degree polynomials for approximation within each part. This involves multiple Multiplication and Truncation operations, and comparisons for selecting the correct part, which can affect both efficiency and accuracy.
Secure Sorting and Sampling:
- Bitonic sorting network [28]: A commonly used data-independent sorting algorithm in secure computation, but generally heavy in communication.
- Prior works for random sampling often used Garbled Circuits [58], which are computationally and communicatively expensive, or simplified by selecting the highest-scoring word, which sacrifices the LLM's utility. Chengkun Wei et al. [54] explored securely sampling discrete Gaussian noise for MPC Differential Privacy, which differs from CipherGPT's goal of sampling from secret-shared probabilities.
Crypto-friendly Model Structures: Some solutions (DeepSecure [48], XONN [46], Quotient [2]) modify the neural network architecture (e.g., binarized neural networks) to be more compatible with cryptographic primitives, often requiring model retraining. Delphi [37] uses neural architecture search to find performance-accuracy trade-offs. CipherGPT aims to avoid retraining.
GPU Acceleration: Solutions like GForce [38] leverage GPU parallelism for the online phase but don't address the expensive preprocessing.
Multi-Party (3PC) Settings: Some works ([47], [52], [5]) explore three-party settings, assuming non-colluding servers. While potentially more efficient, this assumption is often considered less practical than the two-party model.

3.3. Technological Evolution

The evolution of secure inference has progressed from simple machine learning models to complex deep neural networks (DNNs) and now to large language models (LLMs).

Early Works (2010s): Focused on basic ML models like SVMs and linear regression, often using generic multi-party computation (MPC) or homomorphic encryption (HE) primitives.
CryptoNets Era (Mid-2010s): Demonstrated the feasibility of HE-only inference for simple CNNs, but highlighted the limitations of HE for non-linear operations and deep architectures.
Hybrid Approaches (Late 2010s): The emergence of 2PC frameworks like MiniONN, GAZELLE, Cheetah, and Iron combined the strengths of HE (for linear operations) and secret sharing or Garbled Circuits (for non-linear operations). This enabled more complex DNNs, including transformers for discriminative tasks (BERT). Significant efforts were made to optimize matrix multiplication (e.g., SIMD vs. coefficient packing) and non-linear activations (e.g., lookup tables, polynomial approximations).
LLM Era (Early 2020s): With the rise of GPT models, new challenges surfaced:
- Scale: Handling hundreds of millions or billions of parameters efficiently.
- Generative Nature: Supporting autoregressive inference and random sampling for diverse outputs, which were not primary concerns in discriminative DNN inference.
- Specific Activations: Efficiently securing GELU.
- Top-K Selection: A crucial component for vocabulary management in LLMs.
  
  CipherGPT represents a significant step in this evolution, specifically addressing the unique demands of GPT-like models, moving beyond generic transformer inference to tackle the full generative LLM workflow, including sophisticated sampling and customized matrix multiplication for autoregressive tasks.

3.4. Differentiation Analysis

CipherGPT differentiates itself from existing secure inference methods primarily by its tailored optimizations for the specific characteristics of GPT-like models, particularly their autoregressive nature and the need for creative sampling.

Matrix Multiplication (MatrixMul):
- Previous SOTA (Cheetah, Iron, BumbleBee, Bolt): These RLWE-based HE approaches (using SIMD slots or polynomial coefficients) handle MatrixMul by performing operations on encrypted data. Bolt and BumbleBee introduce complex techniques like homomorphic rotations or SIMD rotations to save communication, often at the cost of computation, and Bolt requires secure conversion between prime fields and $Z_2^l$ rings.
- CipherGPT's Innovation: CipherGPT leverages the observation that GPT's autoregressive generation repeatedly uses the same model weights ( $Y$ ) with different inputs ( $X$ ). It batches these multiple MatrixMul operations into a single, large, unbalanced MatrixMul during the preprocessing phase. This is then efficiently processed using sVOLE (subfield Vector Oblivious Linear Evaluation).
- Differentiation: By amortizing the cost over many MatrixMul instances ( $t$ iterations), CipherGPT achieves significant speedup (up to 6.2x) and bandwidth reduction (4.1x) compared to HE-based methods which might process each MatrixMul independently or with less efficient batching for this specific use case. The communication cost of sVOLE is almost independent of $n$ (input vector length), providing greater savings for larger $n$ .
GELU Activation Function:
- Previous SOTA (SIRNN, Iron, BumbleBee, Bolt):
  - SIRNN and Iron use multiple Lookup Tables (LUTs) to approximate parts like $e^{-x}$ and reciprocals, which involves a multi-step process, leading to accumulated precision errors.
  - BumbleBee and Bolt split GELU into several parts and approximate each part with high-degree polynomials. This requires numerous multiplication-then-truncation operations and comparisons for selecting the correct polynomial, often sacrificing accuracy for efficiency due to polynomial degree and limited split parts.
- CipherGPT's Innovation: CipherGPT adopts a spline-based approximation using only linear functions ( $y=ax+d$ ) within several small intervals of the GELU curve. It simplifies the curve into three main parts ( $x < -\alpha$ , $-\alpha \le x \le \alpha$ , $x > \alpha$ ), then shifts the curve to simplify the middle part to $[0, 2\alpha]$ . A single LUT is used to find the correct interval and retrieve the linear function's coefficients, followed by a single multiplication-then-truncation.
- Differentiation: This approach drastically reduces the number of cryptographic primitives (fewer Mult, Trunc, CMP, MUX operations) compared to polynomial-based methods and avoids error accumulation from multi-step approximations. It leads to superior precision (7.4x better maximal ULP error) while being faster (1.8x runtime speedup) and more communication-efficient (2.5x reduction).
Top-K Selection:
- Previous SOTA (Implicit/General): Often, general secure sorting algorithms like Bitonic sorting network ([28]) would be used, which are very heavy.
- CipherGPT's Innovation: Introduces a novel protocol for secure top-K selection based on a modified quicksort algorithm. It first securely shuffles the input vector (to prevent information leakage during comparisons) and then applies a comparison-based selection. Critically, it only recursively processes partitions containing the top-K elements, reducing comparisons from $O(n \log n)$ to $O(n)$ .
- Differentiation: This is presented as the first protocol for secure top-K sampling in this context, offering significant speedup (8.8x) and communication reduction (14.8x) compared to generic secure sorting networks.
Secure Sampling:
- Previous SOTA: Typically relied on computationally heavy Garbled Circuits [58] or simplified to selecting the highest score, losing model utility.
- CipherGPT's Innovation: Provides a specific protocol for securely sampling an element from a vector based on secret-shared probabilities. It uses a server-sampled random number $v$ and K-1 secure comparisons and K multiplexers to find the correct index based on cumulative probability sums. It also includes a mechanism to securely map the sampled index back to the original (unshuffled) word vector index without revealing intermediate information.
- Differentiation: This is claimed as the first exploration of secure sampling from secret-shared probabilities for LLM generation, providing a practical cryptographic primitive for this crucial GPT functionality.
  
  In summary, CipherGPT achieves its differentiation by moving beyond generic secure inference to deeply understand and exploit the specific computational patterns and privacy requirements of GPT's autoregressive, generative workflow, resulting in highly specialized and more efficient cryptographic protocols.

4. Methodology

4.1. Principles

The core idea behind CipherGPT is to enable secure two-party GPT inference by breaking down the complex GPT architecture into individual operations and designing highly optimized, privacy-preserving protocols for each, especially focusing on bottlenecks like matrix multiplication, non-linear activation (GELU), and generative mechanisms (Top-K selection, sampling). The theoretical basis relies on 2-out-of-2 additive secret sharing for general operations, homomorphic encryption (HE) for specific matrix operations, and vector oblivious linear evaluation (VOLE) for highly efficient batched matrix multiplication.

The key intuitions are:

Amortized Matrix Multiplication: GPT's autoregressive nature means the same model weights ( $Y$ ) are used repeatedly with different inputs ( $X$ ) over many inference steps. Instead of securing each matrix multiplication independently, these can be batched together and processed using sVOLE in a preprocessing phase, significantly reducing the amortized online cost.
Spline-based Activation: Complex non-linear functions like GELU can be accurately and efficiently approximated by splitting their curve into linear segments (splines). By leveraging a single lookup table (LUT) and secret-shared linear function computation, this avoids the multi-step approximations and high-degree polynomial computations used in prior works, improving both precision and efficiency.
Optimized Top-K Selection: Standard secure sorting is too expensive. By first securely shuffling the input vector, subsequent comparisons for Top-K selection can reveal relative order without revealing original values. A modified quicksort that only processes relevant partitions further reduces computational cost.
Secure Probabilistic Sampling: To enable diverse text generation, a word must be sampled from a secret-shared probability distribution. This can be done by comparing a randomly sampled value against cumulative sums of probabilities in a secret-shared manner, followed by carefully constructed multiplexers.

4.2. Core Methodology In-depth (Layer by Layer)

CipherGPT implements the GPT inference workflow, shown in Figure 2 of the original paper, by securing each component.

4.2.1. VOLE-based Matrix Multiplication

MatrixMul ( $\mathbf{Z} = \mathbf{X}\mathbf{Y}$ ) is a fundamental operation in GPT. For autoregressive generation, GPT takes a sentence as input, generates a response word, then adds that word to the input sentence, and repeats the process. This means MatrixMul operations are performed repeatedly, each time with a new input matrix $\mathbf{X}$ but the same weight matrix $\mathbf{Y}$ . CipherGPT exploits this by batching multiple MatrixMul operations into a single unbalanced MatrixMul to reduce amortized cost.

Let $\mathbf{X} \in \mathbb{Z}_{2^l}^{n \times m}$ and $\mathbf{Y} \in \mathbb{Z}_{2^l}^{m \times k}$ . The result is $\mathbf{Z} \in \mathbb{Z}_{2^l}^{n \times k}$ . The calculation can be expressed as a sum of outer products: $ \mathbf{Z} = \sum_{i=1}^{m} (\mathbf{x}_i \otimes \mathbf{y}_i') $ where $\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_m]$ (columns of $\mathbf{X}$ ) and $\mathbf{Y}^T = [\mathbf{y}_1', \mathbf{y}_2', \dots, \mathbf{y}_m']$ (rows of $\mathbf{Y}$ ).

Suppose $S$ and $C$ need to generate $t$ response words, leading to $t$ input matrices: $\mathbf{X}_1, \mathbf{X}_2, \dots, \mathbf{X}_t$ . Each $\mathbf{X}_j = [\mathbf{x}_{j,1}, \mathbf{x}_{j,2}, \dots, \mathbf{x}_{j,m}]$ . The key idea is to concatenate the corresponding columns of all $\mathbf{X}_j$ matrices: $\mathbf{x}_i' = \mathbf{x}_{1,i} || \mathbf{x}_{2,i} || \dots || \mathbf{x}_{t,i}$ for each $i \in [1, m]$ . Then, the outer product of this concatenated vector with $\mathbf{y}_i'$ becomes: $ \mathbf{x}_i' \otimes \mathbf{y}i' = (\mathbf{x}{1,i} \otimes \mathbf{y}i') || (\mathbf{x}{2,i} \otimes \mathbf{y}i') || \dots || (\mathbf{x}{t,i} \otimes \mathbf{y}i') $ Summing these outer products gives the concatenated results of all $t$ MatrixMul operations: $ \sum{i=1}^{m} (\mathbf{x}_i' \otimes \mathbf{y}_i') = \mathbf{Z}_1 || \mathbf{Z}_2 || \dots || \mathbf{Z}_t $

The sVOLE protocol is used to implement this batched MatrixMul.

Preprocessing Phase: $S$ and $C$ generate $m$ sVOLE correlations. For each $i \in [1, m]$ : $ \mathbf{W}_i = \mathbf{u}_i \otimes \mathbf{y}_i' + \mathbf{V}_i $ Here, $C$ holds $\mathbf{u}_i \in \mathbb{Z}_{2^l}^{t \cdot n}$ (a vector of length $t \cdot n$ ) and $\mathbf{V}_i \in \mathbb{Z}_{2^l}^{(t \cdot n) \times k}$ . $S$ holds $\mathbf{y}_i' \in \mathbb{Z}_{2^l}^{k}$ and $\mathbf{W}_i \in \mathbb{Z}_{2^l}^{(t \cdot n) \times k}$ .
- Note: $\otimes$ here denotes the Kronecker product (outer product), which results in a matrix.
Online Phase: For an input matrix $\mathbf{X}_j = [\mathbf{x}_{j,1}, \mathbf{x}_{j,2}, \dots, \mathbf{x}_{j,m}]$ (for the $j$ -th response word):
1. $C$ computes a masked version of its share of $\mathbf{x}_{j,i}$ by subtracting the relevant portion of $\mathbf{u}_i$ : $ \langle \mathbf{x}{j,i} \rangle_S := \mathbf{x}{j,i} - \mathbf{u}_i[(j-1)n+1, \dots, j \cdot n] \quad \forall i \in [1, m] $ $C$ then sends these masked shares to $S$ .
2. $S$ receives $\langle \mathbf{x}_{j,i} \rangle_S$ and locally computes an outer product with its known $\mathbf{y}_i'$ : $ \langle \mathbf{x}_{j,i} \rangle_S \otimes \mathbf{y}i' = (\mathbf{x}{j,i} - \mathbf{u}_i[(j-1)n+1, \dots, j \cdot n]) \otimes \mathbf{y}i' \ = \mathbf{x}{j,i} \otimes \mathbf{y}_i' - \mathbf{u}_i[(j-1)n+1, \dots, j \cdot n] \otimes \mathbf{y}_i' $
3. $S$ $S$ then uses its share $\mathbf{W}_i$ $W_{i}$ and $C$ $C$ 's $V_i$ $V_{i}$ to reconstruct shares of $\mathbf{x}_{j,i} \otimes \mathbf{y}_i'$ $x_{j, i} \otimes y_{i}^{'}$ . The rearranged equation for $\mathbf{x}_{j,i} \otimes \mathbf{y}_i'$ $x_{j, i} \otimes y_{i}^{'}$ is: $ \mathbf{x}_{j,i} \otimes \mathbf{y}i' = \langle \mathbf{x}{j,i} \rangle_S \otimes \mathbf{y}_i' + \mathbf{u}_i[(j-1)n+1, \dots, j \cdot n] \otimes \mathbf{y}_i' $ From the sVOLE correlation, we know that $\mathbf{u}_i \otimes \mathbf{y}_i' = \mathbf{W}_i - \mathbf{V}_i$ $u_{i} \otimes y_{i}^{'} = W_{i} - V_{i}$ . So, $S$ $S$ and $C$ $C$ obtain shares:
  - $S$ holds: $\langle \mathbf{x}_{j,i} \rangle_S \otimes \mathbf{y}_i' + \mathbf{W}_i[(j-1)kn+1, \dots, j \cdot k \cdot n]$
  - $C$ holds: $\mathbf{V}_i[(j-1)kn+1, \dots, j \cdot k \cdot n]$ This means $S$ and $C$ now hold additive shares of $\mathbf{x}_{j,i} \otimes \mathbf{y}_i'$ . They can then locally sum these shares over $i$ to obtain secret shares of \mathbf{Z}_j = \sum_{i=1}^m (\mathbf{x}_{j,i} \otimes \mathbf{y}_i').

4.2.2. Spline-based GELU

The GELU (Gaussian Error Linear Unit) activation function is used in GPT. Its formula is: $ \mathsf { G E L U } ( x ) = 0 . 5 x ( 1 + \mathsf { Tanh } \left[ \sqrt { 2 / \pi } ( x + 0 . 0 4 4 7 1 5 x ^ { 3 } ) \right] ) $ The paper's secure GELU protocol (Algorithm 1) simplifies its secure computation.

The intuition (as shown in Figure 1 from the original paper) is to divide the GELU curve into three main parts:

$x < -\alpha$ : $y = 0$
$-\alpha \le x \le \alpha$ : $y = \mathsf{GELU}(x)$ (approximated by splines)
$x > \alpha$ : $y = x$

To simplify the approximation in the middle part, the entire curve is effectively right-shifted by $\alpha$ , transforming the interval $[-\alpha, \alpha]$ to $[0, 2\alpha]$ . This allows a single lookup table operation.

Algorithm 1: Secure GELU Input: $S$ & $C$ hold $\langle x \rangle^l$ (secret shares of $x$ with $l$ bits), public value $\alpha$ (for splitting), lookup table size $2^s$ . Output: $S$ & $C$ get $\langle y \rangle^l$ for $y = \mathsf{GELU}(x)$ .

Scale $\alpha$ : Let $\alpha' := 2^L \alpha$ $α^{'} := 2^{L} α$ .
- Explanation: The input $x$ to the model is assumed to be scaled up by $2^L$ (e.g., floating points converted to integers with $L$ fractional bits). To maintain correct alignment for comparisons and arithmetic, the public split value $\alpha$ is also scaled by $2^L$ .
Shift Input: $S$ $S$ & $C$ $C$ (locally) compute $\langle x' \rangle^l := \langle x \rangle^l + \alpha'$ $⟨ x^{'} ⟩^{l} := ⟨ x ⟩^{l} + α^{'}$ .
- Explanation: This step performs the conceptual right-shift of the GELU curve. By adding $\alpha'$ to $\langle x \rangle^l$ , the input $x$ is transformed to $x + \alpha'$ . This can be done locally because $\alpha'$ is a public value; each party adds $\alpha'$ to their respective share of $x$ , and the sum property of additive secret sharing is preserved.
Define Range for Approximation: Let $\beta := 2\alpha'$ $β := 2 α^{'}$ .
- Explanation: This defines the new upper bound for the approximation interval, which is now $[0, \beta]$ .
Determine Bit-length for Interval Indexing: Let $h := \log \beta$ $h := lo g β$ .
- Explanation: $h$ is the number of bits required to represent values within the range $[0, \beta]$ (assuming $\beta$ is a power of 2, or adjusted otherwise). This helps in extracting the relevant bits for lookup table indexing.
Extract Lower Bits: $S$ $S$ & $C$ $C$ (locally) extract the lower $h$ $h$ bits of $\langle x' \rangle^l$ $⟨ x^{'} ⟩^{l}$ to get $\langle x' \rangle^h$ $⟨ x^{'} ⟩^{h}$ .
- Explanation: Since $x' \in [0, \beta]$ (under the initial assumption for the spline part), only the lower $h$ bits of $x'$ are relevant for determining the interval. This operation can be done locally if $\beta$ is a power of 2, otherwise, secure truncation is used.
Find Interval Index: $S$ $S$ & $C$ $C$ invoke $\langle i \rangle^s \gets \mathsf{F_{TR}}(\langle x' \rangle^h, h-s)$ $⟨ i ⟩^{s} \leftarrow F_{TR} (⟨ x^{'} ⟩^{h}, h - s)$ .
- Explanation: The F_TR (Truncate-then-Reduce) ideal functionality is used. It takes the $h$ -bit value $\langle x' \rangle^h$ and right-shifts it by h-s bits, effectively extracting the most significant $s$ bits. These $s$ bits represent the index $i \in \mathbb{Z}_{2^s}$ of the small interval that $x'$ belongs to within $[0, \beta]$ . The output $\langle i \rangle^s$ is secret-shared.
- $\mathsf{F_{TR}}(\langle x \rangle^l, s)$ : Takes secret-shared $x$ of $l$ bits and a public shift amount $s$ . Returns secret-shared $y$ of l-s bits, where $y = x \gg s$ . This functionality truncates the value by right-shifting and reduces the bit-length of the ring.
Lookup Coefficients: $S$ $S$ & $C$ $C$ invoke $(\langle a_i \rangle^l, \langle d_i \rangle^l) \gets \mathsf{F_{LUT}}(T, \langle i \rangle^s)$ $(⟨ a_{i} ⟩^{l}, ⟨ d_{i} ⟩^{l}) \leftarrow F_{LUT} (T, ⟨ i ⟩^{s})$ .
- Explanation: $S$ holds a public table $T$ containing the coefficients ( $a_i$ , $d_i$ ) for each linear spline segment $y=a_ix+d_i$ . Using the secret-shared index $\langle i \rangle^s$ , $S$ and $C$ execute the F_LUT (Lookup Table) functionality to retrieve the $i$ -th entry of $T$ in a secret-shared form.
- $\mathsf{F_{LUT}}(T, \langle i \rangle)$ : Takes a public table $T$ and a secret-shared index $\langle i \rangle$ . Returns secret-shared T[i].
Compute $a_i x'$ term: $S$ $S$ & $C$ $C$ invoke $\langle a_i x' \rangle^l \gets \mathsf{F_{Mult}}(\langle a_i \rangle^l, \langle x' \rangle^l)$ $⟨ a_{i} x^{'} ⟩^{l} \leftarrow F_{Mult} (⟨ a_{i} ⟩^{l}, ⟨ x^{'} ⟩^{l})$ .
- Explanation: They securely multiply the retrieved secret-shared coefficient $\langle a_i \rangle^l$ with the shifted input $\langle x' \rangle^l$ using the F_Mult (Multiplication) ideal functionality. This result is still at a higher scale ( $L+L$ bits if $a_i$ and $x'$ are $L$ -scaled).
- $\mathsf{F_{Mult}}(\langle x \rangle^g, \langle y \rangle^h)$ : Takes secret-shared $x$ of $g$ bits and secret-shared $y$ of $h$ bits. Returns secret-shared $z = x \cdot y$ of $g+h$ bits.
Truncate Product: $S$ $S$ & $C$ $C$ invoke $\langle k \rangle^l \gets \mathsf{F_{Trunc}}(\langle a_i x' \rangle^l, L)$ $⟨ k ⟩^{l} \leftarrow F_{Trunc} (⟨ a_{i} x^{'} ⟩^{l}, L)$ .
- Explanation: The result from the previous step is likely at a higher fixed-point scale. This step truncates the result by right-shifting $L$ bits to bring it back to the desired fixed-point representation.
- $\mathsf{F_{Trunc}}(\langle x \rangle^l, s)$ : Takes secret-shared $x$ of $l$ bits and a public shift amount $s$ . Returns secret-shared $y = x \gg s$ of $l$ bits, but the value is effectively scaled down.
Add $d_i$ term: $S$ $S$ & $C$ $C$ (locally) compute $\langle z \rangle^l := \langle k \rangle^l + \langle d_i \rangle^l$ $⟨ z ⟩^{l} := ⟨ k ⟩^{l} + ⟨ d_{i} ⟩^{l}$ .
- Explanation: The final step of the linear approximation $y = a_i x' + d_i$ is completed by adding the secret-shared $d_i$ coefficient to $\langle k \rangle^l$ . This can be done locally. $\langle z \rangle^l$ is the potential GELU output assuming $x' \in [0, \beta]$ .
Compare $x'$ with $\beta$ : $S$ $S$ & $C$ $C$ invoke $\langle b \rangle^1 \gets \mathsf{F_{CMP}}(\langle x' \rangle^l, \beta)$ $⟨ b ⟩^{1} \leftarrow F_{CMP} (⟨ x^{'} ⟩^{l}, β)$ .
- Explanation: This compares the shifted input $\langle x' \rangle^l$ with $\beta$ (the upper bound of the spline approximation range). $\langle b \rangle^1$ will be 1 if $x' \geq \beta$ (i.e., $x > \alpha$ ), and 0 otherwise.
- $\mathsf{F_{CMP}}(\langle x \rangle^l, \langle y \rangle^l)$ : Takes secret-shared $x$ and $y$ (both $l$ bits). Returns secret-shared bit $b=1$ if $x \ge y$ , $b=0$ otherwise.
Compare $x'$ with 0: $S$ $S$ & $C$ $C$ invoke $\langle b' \rangle^1 \gets \mathsf{F_{CMP}}(\langle x' \rangle^l, 0)$ $⟨ b^{'} ⟩^{1} \leftarrow F_{CMP} (⟨ x^{'} ⟩^{l}, 0)$ .
- Explanation: This compares the shifted input $\langle x' \rangle^l$ with 0 (the lower bound of the spline approximation range). $\langle b' \rangle^1$ will be 1 if $x' \geq 0$ (i.e., $x \ge -\alpha$ ), and 0 otherwise.
Multiplex for $z$ and 0: $S$ $S$ & $C$ $C$ invoke $\langle u \rangle^l \gets \mathsf{F_{MUX}}(\langle z \rangle^l, \langle b \rangle^1 \oplus \langle b' \rangle^1)$ $⟨ u ⟩^{l} \leftarrow F_{MUX} (⟨ z ⟩^{l}, ⟨ b ⟩^{1} \oplus ⟨ b^{'} ⟩^{1})$ .
- Explanation: This multiplexer selects between the approximated value $\langle z \rangle^l$ $⟨ z ⟩^{l}$ and 0.
  - If $x' \in [0, \beta]$ (i.e., $-\alpha \le x \le \alpha$ ), then $b=0$ and $b'=1$ , so $b \oplus b' = 1$ . The output $u$ becomes $z$ .
  - If $x' < 0$ (i.e., $x < -\alpha$ ), then $b=0$ and $b'=0$ , so $b \oplus b' = 0$ . The output $u$ becomes 0.
  - If $x' \geq \beta$ (i.e., $x > \alpha$ ), then $b=1$ and $b'=1$ , so $b \oplus b' = 0$ . The output $u$ becomes 0.
  - $\mathsf{F_{MUX}}(\langle x \rangle^l, \langle b \rangle^1)$ : Takes secret-shared $x$ ( $l$ bits) and a secret-shared control bit $b$ . Returns secret-shared $y=x$ if $b=1$ , and $y=0$ if $b=0$ .
Multiplex for $x$ and 0: $S$ $S$ & $C$ $C$ invoke $\langle v \rangle^l \gets \mathsf{F_{MUX}}(\langle x \rangle^l, \langle b \rangle^1)$ $⟨ v ⟩^{l} \leftarrow F_{MUX} (⟨ x ⟩^{l}, ⟨ b ⟩^{1})$ .
- Explanation: This multiplexer selects between the original input $\langle x \rangle^l$ $⟨ x ⟩^{l}$ and 0.
  - If $x' \geq \beta$ (i.e., $x > \alpha$ ), then $b=1$ . The output $v$ becomes $x$ .
  - Otherwise ( $x' < \beta$ ), then $b=0$ . The output $v$ becomes 0.
- This handles the case where GELU(x) should be $x$ .
Combine results: $S$ $S$ & $C$ $C$ (locally) compute $\langle y \rangle^l := \langle u \rangle^l + \langle v \rangle^l$ $⟨ y ⟩^{l} := ⟨ u ⟩^{l} + ⟨ v ⟩^{l}$ .
- Explanation: This final local addition combines the outputs from the two multiplexers to produce the correct GELU result:
  - If $x < -\alpha$ : $u=0, v=0 \Rightarrow y=0$ .
  - If $-\alpha \le x \le \alpha$ : $u=z, v=0 \Rightarrow y=z$ (spline approximation).
  - If $x > \alpha$ : $u=0, v=x \Rightarrow y=x$ .
- This correctly implements the three-part GELU function.

4.2.3. Secure Top-K Selection

In the vec2word layer, GPT generates a vector of probabilities for all possible words. From this, the top-K largest probabilities need to be selected. The paper proposes Algorithm 2: Secure TopK.

Algorithm 2: Secure TopK Input: $S$ & $C$ hold $\langle \mathbf{x} \rangle^l$ (secret shares of a vector $\mathbf{x}$ of length $n$ ). Output: $S$ & $C$ get $\langle \mathbf{y} \rangle^l$ (secret shares of vector $\mathbf{y}$ of length $K$ containing the $K$ largest values of $\mathbf{x}$ ).

Secure Shuffle: $S$ $S$ & $C$ $C$ invoke $\langle \mathbf{x}' \rangle \gets \mathsf{F_{Shuffle}}(\langle \mathbf{x} \rangle)$ $⟨ x^{'} ⟩ \leftarrow F_{Shuffle} (⟨ x ⟩)$ .
- Explanation: The input vector $\mathbf{x}$ is first securely shuffled using the F_Shuffle ideal functionality. This is crucial: subsequent comparisons will reveal the relative order of elements, but because the elements have been randomly permuted, these comparisons do not leak information about the original elements' values or their original positions.
- $\mathsf{F_{Shuffle}}(\langle \mathbf{x} \rangle^l, \langle \pi \rangle)$ : Takes a secret-shared vector $\mathbf{x}$ and a secret-shared permutation $\pi$ . Returns $\langle \pi(\mathbf{x}) \rangle^l$ . (The protocol implemented uses permutations chosen by each party separately).
- Assumption: Elements in $\mathbf{x}$ are assumed to be distinct. This is typically achieved by appending a unique index (1 to $n$ ) to each element before shuffling and comparison, which is truncated after selection.
Select Top-K: $\langle \mathbf{y} \rangle \gets \text{select}(\langle \mathbf{x}' \rangle, K)$ $⟨ y ⟩ \leftarrow select (⟨ x^{'} ⟩, K)$ .
- Explanation: The select function (Lines 3-21) is a modified quicksort-like algorithm that identifies the top-K elements from the shuffled vector $\langle \mathbf{x}' \rangle$ .

Function select( $\langle \mathbf{x}' \rangle$ , K): 3. $n := |\langle \mathbf{x}' \rangle|$ 4. Base Case: If $n \le K$ , return $\langle \mathbf{x}' \rangle$ . * Explanation: If the current sub-vector has $K$ or fewer elements, all of them are part of the top-K result, so no further selection is needed. 5. Choose Pivot: Let $\langle \text{pivot} \rangle := \langle \mathbf{x}' \rangle[n-1]$ . * Explanation: The last element of the current sub-vector is chosen as the pivot for partitioning. 6. Initialize Partitions: Let $S_L := \text{empty list}$ , $S_R := \text{empty list}$ . 7. Partitioning Loop: for $j := 0$ to n-2 do 8. Secure Comparison: $\langle b \rangle^1 \gets \mathsf{F_{CMP}}(\langle \mathbf{x}' \rangle[j], \langle \text{pivot} \rangle)$ . * Explanation: Each element $\langle \mathbf{x}' \rangle[j]$ (except the pivot) is securely compared with the pivot using F_CMP. $\langle b \rangle^1$ is 1 if $\langle \mathbf{x}' \rangle[j] \ge \langle \text{pivot} \rangle$ , and 0 otherwise. 9. Reveal Comparison Result: Reveal $b$ to $S$ and $C$ . * Explanation: This is a critical step. The comparison result $b$ can be revealed because $\mathbf{x}'$ is a shuffled version of the original input. Knowing whether a shuffled element is greater or smaller than a shuffled pivot leaks no information about the original elements or their original positions. 10. Populate $S_L$ or $S_R$ : If $b=1$ , add $\langle \mathbf{x}' \rangle[j]$ to $S_R$ . Else, add $\langle \mathbf{x}' \rangle[j]$ to $S_L$ . * Explanation: Based on the revealed comparison result, each element is deterministically placed into either $S_R$ (elements greater than or equal to pivot) or $S_L$ (elements smaller than pivot). This entire partitioning step can be performed locally by $S$ and $C$ after the comparison result $b$ is revealed. 11. end for 12. Add Pivot to $S_R$ : Add $\langle \text{pivot} \rangle$ to $S_R$ . * Explanation: The pivot element itself is considered to be in the "greater than or equal to" partition. 13. Determine Size of $S_R$ : Let $K' := |S_R|$ . 14. Recursive Steps: 15. If $K' = K$ : return $S_R$ . * Explanation: If the size of $S_R$ is exactly $K$ , then these are precisely the K largest elements, and the selection is complete. 16. If $K' > K$ : return $\text{select}(S_R, K)$ . * Explanation: If $S_R$ contains more than $K$ elements, the top-K elements must be within $S_R$ . The function recursively calls itself on $S_R$ to find the K largest within it. 17. If $K' < K$ : return $S_R \cup \text{select}(S_L, K - K')$ . * Explanation: If $S_R$ contains fewer than $K$ elements, then all elements in $S_R$ are part of the top-K. The remaining $(K - K')$ elements must be found in $S_L$ . The function recursively calls itself on $S_L$ to find these additional elements.

    This `select` function requires  $O(n)$  `CMPs` (secure comparisons) in expectation, a significant improvement over  $O(n \log n)$  for full sorting.

4.2.4. Secure Sampling

After Top-K selection, a word must be sampled from the $K$ selected probabilities. Algorithm 3 describes this secure sampling protocol.

Algorithm 3: Secure Sampling Input: $S$ & $C$ hold $\langle \mathbf{x} \rangle$ (secret shares of a vector $\mathbf{x}$ of $K$ probabilities, scaled by $2^L$ ). Output: $S$ & $C$ get $\langle j \rangle$ (secret shares of an index $j \in [1, K]$ , where \Pr(j=i) = x_i / \sum_{k=1}^K x_k).

The protocol is based on the idea that if a random value $p'$ is sampled from $[0, \sum x_k]$ , the selected index $j$ satisfies $\sum_{k=1}^{j-1} x_k \leq p' < \sum_{k=1}^{j} x_k$ .

Sample Random Value: $S$ $S$ samples $v \gets [0, 2^L-1]$ $v \leftarrow [0, 2^{L} - 1]$ with $v \in \mathbb{Z}_{2^l}$ $v \in Z_{2^{l}}$ .
- Explanation: $S$ generates a random number $v$ within the range of the scaled probabilities. It's acceptable for $S$ to sample this value alone because the final output index $j$ remains secret from $S$ .
Initialize Cumulative Sum: $S$ $S$ & $C$ $C$ (locally) initialize $\langle s_0 \rangle := 0$ $⟨ s_{0} ⟩ := 0$ .
- Explanation: $\langle s_i \rangle$ will store the cumulative sum of probabilities up to $x_i$ .
Compute Cumulative Sums and Comparisons: for $i := 1$ to K-1 do
$S$ $S$ & $C$ $C$ (locally) compute $\langle s_i \rangle := \langle x_i \rangle + \langle s_{i-1} \rangle$ $⟨ s_{i} ⟩ := ⟨ x_{i} ⟩ + ⟨ s_{i - 1} ⟩$ .
- Explanation: They locally compute the cumulative sum up to the $i$ -th probability.
$S$ $S$ & $C$ $C$ invoke $\langle b_i \rangle^1 \gets \mathsf{F_{CMP}}(\langle v \rangle, \langle s_i \rangle)$ $⟨ b_{i} ⟩^{1} \leftarrow F_{CMP} (⟨ v ⟩, ⟨ s_{i} ⟩)$ .
- Explanation: They securely compare the random value $\langle v \rangle$ with the cumulative sum $\langle s_i \rangle$ . $\langle b_i \rangle^1$ is 1 if $v \ge s_i$ , and 0 otherwise.
end for
- Explanation: After this loop, a secret-shared bit vector $\langle \mathbf{b} \rangle = (\langle b_1 \rangle, \dots, \langle b_{K-1} \rangle)$ is obtained. This vector will have $b_i = 1$ for all $i < j$ (where $j$ is the sampled index) and $b_i = 0$ for all $i \ge j$ .
Initialize Auxiliary Bits: $S$ $S$ & $C$ $C$ (locally) initialize $\langle b_0 \rangle^1 := 1$ $⟨ b_{0} ⟩^{1} := 1$ and $\langle b_K \rangle^1 := 0$ $⟨ b_{K} ⟩^{1} := 0$ .
- Explanation: These boundary conditions simplify the next step.
Derive Indicator Bits: for $i := 1$ to $K$ do
$S$ $S$ & $C$ $C$ (locally) compute $\langle b'_i \rangle^1 := \langle b_{i-1} \rangle^1 \oplus \langle b_i \rangle^1$ $⟨ b_{i}^{'} ⟩^{1} := ⟨ b_{i - 1} ⟩^{1} \oplus ⟨ b_{i} ⟩^{1}$ .
- Explanation: This operation effectively identifies the unique index $j$ $j$ where the random value $v$ $v$ falls.
  - If $s_{i-1} \le v < s_i$ , then $\langle b_{i-1} \rangle^1 = 1$ and $\langle b_i \rangle^1 = 0$ . Their XOR will be 1.
  - For any other $i$ , either both are 1 or both are 0, so their XOR will be 0. Thus, $\langle b'_i \rangle^1$ will be 1 only for the sampled index $i=j$ , and 0 for all other indices.
end for
Compute Sampled Index: $S$ $S$ & $C$ $C$ compute $\langle j \rangle := \sum_{i=1}^K \mathsf{F_{MUX}}(i, \langle b'_i \rangle^1)$ $⟨ j ⟩ := \sum_{i = 1}^{K} F_{MUX} (i, ⟨ b_{i}^{'} ⟩^{1})$ .
- Explanation: A final multiplexer operation is performed for each index $i$ . If $\langle b'_i \rangle^1$ is 1 (meaning $i$ is the sampled index), it selects $i$ ; otherwise, it selects 0. Summing these results yields the sampled index $\langle j \rangle$ .

Mapping to Response Word: The index $\langle j \rangle$ (from Line 11 of Algorithm 3) corresponds to an element in the shuffled and Top-K selected vector. To get the actual word, this index needs to be mapped back to the original vocabulary index.

Inverse Shuffle for S's permutation: The Top-K selection process reveals the index of the selected element in the shuffled vector. Let $t_i$ be the original index in the shuffled vector corresponding to the $i$ -th element in the Top-K output. The paper states $S$ computes $i' := \pi_S^{-1}(t_i)$ and secret-shares $i'$ . (Here, $\pi_S$ is the permutation chosen by $S$ during shuffling, as described in F_Shuffle). The final summation in Algorithm 3 (Line 11) is then modified to: $ \langle j \rangle := \sum_{i=1}^K \mathsf{F_{MUX}}(\langle i' \rangle, \langle b'_i \rangle^1) $ Now, $S$ does not know $j$ (because $v$ is secret) and $C$ does not know $i'$ (because $\pi_S^{-1}$ is secret to $S$ ).
Inverse Shuffle for C's permutation: Once $\langle j \rangle$ is revealed to $C$ , $C$ computes $j' := \pi_C^{-1}(j)$ , where $\pi_C$ is the permutation chosen by $C$ during shuffling. This $j'$ is the correct index in the original word vector (vocabulary), allowing $C$ to retrieve the final response word.

4.2.5. CipherGPT Framework Integration (Section VII)

The CipherGPT framework integrates these protocols to secure the entire GPT inference pipeline:

The architecture and workflow of GPT (Figure 2 from the paper) shows the data flow through different layers:

The image "images/2.jpg" shows the architecture and workflow of GPT.

Input Sequence: Text input from the user.
Word Embeddings: Each word is converted into a numerical vector.
Position Embeddings: Positional information is added to the word embeddings.
Multiple Transformer Decoders: These are the core processing units, each consisting of:
- Masked Self-Attention
- Feed-Forward Neural Network
- Layer Normalization
Vec2Word Layer: The final layer that outputs the predicted word.

A. Embedding

This layer maps input words to word embedding vectors and augments them with position embedding vectors.

$S$ encrypts all rows of the embedding matrix using AHE (e.g., RLWE ciphertext $E(\mathbf{w}_i)$ for each row $\mathbf{w}_i$ ). $S$ sends these ciphertexts to $C$ . This is a one-time preprocessing step.
$C$ (knowing its input words) identifies the corresponding ciphertexts. For each, $C$ adds a random vector $\mathbf{r}_i$ to the ciphertext (which is homomorphically added) and sends $E(\mathbf{w}_i + \mathbf{r}_i)$ back to $S$ .
$S$ decrypts to get $\mathbf{w}_i + \mathbf{r}_i$ . $S$ then locally adds the corresponding position embedding vector $\mathbf{p}_i$ : $\mathbf{w}_i + \mathbf{r}_i + \mathbf{p}_i$ .
The embedding vector for each word, $\mathbf{x}_i$ , is now secret-shared: $\langle \mathbf{x}_i \rangle_C = -\mathbf{r}_i$ and $\langle \mathbf{x}_i \rangle_S = \mathbf{w}_i + \mathbf{r}_i + \mathbf{p}_i$ .

B. Layer Normalization

Layer Normalization normalizes each element $x_i$ in an input vector $\mathbf{x} \in \mathbb{Z}_{2^l}^m$ according to: $ x_i := \frac{x_i - \mathrm{E}[\mathbf{x}]}{\sqrt{\mathrm{Var}[\mathbf{x}] + \epsilon}} \cdot \gamma + \beta $ where $\mathrm{E}[\mathbf{x}] = \frac{1}{n} \sum x_i$ is the mean, $\mathrm{Var}[\mathbf{x}] = \frac{1}{n-1} \sum (x_i - \mathrm{E}[\mathbf{x}])^2$ is the variance, $\gamma$ and $\beta$ are learnable parameters, and $\epsilon$ prevents division by zero. The secure computation proceeds as:

Compute variance terms: $\mathsf{F_{Mult}}$ for each $var_i := (x_i - \mathrm{E}[\mathbf{x}])^2$ .
Compute inverse square root: $\mathsf{F_{LUT}}$ to compute $\frac{1}{\sqrt{\mathrm{Var}[\mathbf{x}] + \epsilon}}$ .
Compute scaled difference: $\mathsf{F_{Mult}}$ for $\frac{x_i - \mathrm{E}[\mathbf{x}]}{\sqrt{\mathrm{Var}[\mathbf{x}] + \epsilon}}$ .
Apply gamma: $\mathsf{F_{Mult}}$ for $\frac{x_i - \mathrm{E}[\mathbf{x}]}{\sqrt{\mathrm{Var}[\mathbf{x}] + \epsilon}} \cdot \gamma$ .
Apply beta and Truncation: $S$ $S$ and $C$ $C$ locally add $\beta$ $β$ , then run $\mathsf{F_{TR}}$ $F_{TR}$ to reduce the scale to $L$ $L$ bits and truncate the width to $l$ $l$ bits.
- Note: The paper mentions using $BOLE-based F_Mult$ for multiplication and performing truncation only once at the end for efficiency.

C. Masked Self-Attention

Self-attention involves computing Query (Q), Key (K), and Value (V) matrices, scoring, masking, softmax, and output generation.

Q, K, V Matrices: $\mathbf{Q} := \mathbf{X}\mathbf{W}_Q$ $Q := X W_{Q}$ , $\mathbf{K} := \mathbf{X}\mathbf{W}_K$ $K := X W_{K}$ , $\mathbf{V} := \mathbf{X}\mathbf{W}_V$ $V := X W_{V}$ . Here $\mathbf{X} \in \mathbb{Z}_{2^l}^{n \times m}$ $X \in Z_{2^{l}}^{n \times m}$ is the input, and $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{Z}_{2^l}^{m \times m}$ $W_{Q}, W_{K}, W_{V} \in Z_{2^{l}}^{m \times m}$ are model weights.
- These MatrixMuls use the same weights (e.g., $\mathbf{W}_Q$ ) across $t$ autoregressive steps. Therefore, CipherGPT applies its sVOLE-based MatrixMul (Section 4.2.1) here.
- After MatrixMul, F_Trunc is used to maintain $L$ -bit scaling.
Multi-headed Attention: $\langle \mathbf{Q} \rangle, \langle \mathbf{K} \rangle, \langle \mathbf{V} \rangle$ are partitioned into $M$ segments (attention heads): $\langle \mathbf{q}_i \rangle, \langle \mathbf{k}_i \rangle, \langle \mathbf{v}_i \rangle$ , each of size $n \times m'$ , where $m' = m/M$ . This is a local operation.
Score Matrix: $\mathbf{s}_i := \mathbf{q}_i \mathbf{k}_i^T$ $s_{i} := q_{i} k_{i}^{T}$ for each head $i$ $i$ .
- Since $\mathbf{q}_i$ and $\mathbf{k}_i$ are not known beforehand (they are outputs of previous layers), sVOLE-based MatrixMul is not suitable. Instead, the AHE-based MatrixMul proposed in [27] (Iron) is used.
Self-attention Masking: The upper triangle of each $\mathbf{s}_i$ is zeroed out to prevent attending to future words. This is a local operation.
Softmax: Applied to each row of each $\mathbf{s}_i$ $s_{i}$ to normalize scores to sum to 1.
- CipherGPT leverages BumbleBee's [30] approach:
  1. Normalize input row $\mathbf{x} \in \mathbb{Z}_{2^L}^n$ : $x_i' := x_i - \max(\mathbf{x})$ .
  2. Bound values: Use $\mathsf{F_{CMP}}$ and $\mathsf{F_{MUX}}$ to set $e^{x_i'}$ to 0 if $x_i' < -16 \times 2^L$ .
  3. Approximate $e^{x_i'}$ : Use the approximation $e^{x_i'} \approx (1 + \frac{x_i'}{2^n})^{2^n}$ implemented with $\mathsf{F_{Mult}}$ and $\mathsf{F_{Trunc}}$ .
Output Calculation: $\mathbf{z}_i := \mathbf{s}_i \mathbf{v}_i$ $z_{i} := s_{i} v_{i}$ for each head $i$ $i$ .
- Again, AHE-based MatrixMul [27] is used.
- Results are reassembled: $\langle \mathbf{Z} \rangle := \langle \mathbf{z}_1 \rangle || \dots || \langle \mathbf{z}_n \rangle$ .
- Residual connection: $\langle \mathbf{X} \rangle := \langle \mathbf{X} \rangle + \langle \mathbf{Z} \rangle$ (local addition).

D. Feed Forward

This block involves Layer Normalization, two fully-connected (FC) layers, and the GELU activation.

Layer Normalization: Applied as described in Section 4.2.5 B.
First FC Layer: $\mathbf{X}_1 := \mathbf{X}\mathbf{W}_1 + \mathbf{B}_1$ $X_{1} := X W_{1} + B_{1}$ .
- $\mathbf{X} \in \mathbb{Z}_{2^l}^{n \times m}$ , $\mathbf{W}_1 \in \mathbb{Z}_{2^l}^{m \times k}$ , $\mathbf{B}_1 \in \mathbb{Z}_{2^l}^{n \times k}$ .
- Since $\mathbf{W}_1$ is a model weight, CipherGPT applies its sVOLE-based MatrixMul.
GELU Activation: The secure GELU protocol (Algorithm 1) is applied element-wise to $\mathbf{X}_1$ , producing $\mathbf{X}_1'$ .
Second FC Layer: $\mathbf{X}_2 := \mathbf{X}_1'\mathbf{W}_2 + \mathbf{B}_2$ $X_{2} := X_{1}^{'} W_{2} + B_{2}$ .
- $\mathbf{X}_1' \in \mathbb{Z}_{2^l}^{n \times k}$ , $\mathbf{W}_2 \in \mathbb{Z}_{2^l}^{k \times m}$ , $\mathbf{B}_2 \in \mathbb{Z}_{2^l}^{n \times m}$ .
- Again, sVOLE-based MatrixMul is used as $\mathbf{W}_2$ is a model weight.
Another residual connection and Layer Normalization follows, continuing the decoder structure.

E. Vec2word

This final layer generates the predicted response word.

Initial MatrixMul: $\mathbf{y}_0 := \mathbf{x}\mathbf{W}$ $y_{0} := xW$ .
- $\mathbf{x} \in \mathbb{Z}_{2^l}^m$ (the last row of the final transformer output $\mathbf{X}$ ), $\mathbf{W} \in \mathbb{Z}_{2^l}^{m \times k}$ (where $k$ is the vocabulary size, typically very large, e.g., 50257 in GPT-2), $\mathbf{y}_0 \in \mathbb{Z}_{2^l}^k$ .
- Since $k$ (vocabulary size) is very large, the sVOLE-based MatrixMul might not be as efficient here. The paper states it uses AHE-based MatrixMul [27].
- F_Trunc is applied.
Top-K Selection: $\langle \mathbf{y}_1 \rangle \gets \Pi_{\mathsf{TopK}}(\langle \mathbf{y}_0 \rangle)$ $⟨ y_{1} ⟩ \leftarrow Π_{TopK} (⟨ y_{0} ⟩)$ .
- The secure Top-K protocol (Algorithm 2) is applied to select the $K$ largest probability scores from $\mathbf{y}_0$ , resulting in $\mathbf{y}_1 \in \mathbb{Z}_{2^l}^K$ .
Temperature Scaling: Each value in $\mathbf{y}_1$ $y_{1}$ is multiplied by a temperature T (a hyperparameter held by $S$ $S$ ). This is done securely using AHE:
1. $C$ encrypts its shares $E(\langle y_{1,i} \rangle_C)$ and sends to $S$ .
2. $S$ adds its shares to ciphertexts, decrypts to $y_{1,i}$ , multiplies by $T$ , re-encrypts $E(T \cdot y_{1,i})$ .
3. $S$ adds random number $r_i$ to each ciphertext: $E(T \cdot y_{1,i} + r_i)$ , returns to $C$ .
4. $C$ decrypts. New shares $\langle y_{2,i} \rangle_C := T \cdot y_{1,i} + r_i$ and $\langle y_{2,i} \rangle_S := -r_i$ .
- F_Trunc is applied.
Softmax: Applied to $\mathbf{y}_2$ to get a probability vector $\mathbf{y}_3$ . (As described in Section 4.2.5 C).
Random Sampling: $\langle j \rangle \gets \Pi_{\mathsf{Sample}}(\langle \mathbf{y}_3 \rangle)$ $⟨ j ⟩ \leftarrow Π_{Sample} (⟨ y_{3} ⟩)$ .
- The secure sampling protocol (Algorithm 3), with the modifications for index mapping, is used to sample an index $j$ from $\mathbf{y}_3$ .
Word Mapping: $C$ obtains the final sampled index $j$ and maps it back to the original word vector (vocabulary) index $j' = \pi_C^{-1}(j)$ to retrieve the response word. This completes one autoregressive step.

5. Experimental Setup

5.1. Datasets

The paper primarily evaluates the accuracy loss of CipherGPT using the WikiText-103 dataset [36].

Source and Characteristics: WikiText-103 is a large dataset of over 103 million words extracted from the set of verified "Good" and "Featured" articles on Wikipedia. It is designed for language modeling benchmarks, providing a diverse collection of high-quality text.
Scale: Contains a vocabulary of 267,735 unique words.
Domain: General-purpose English text from Wikipedia articles.
Usage: The authors randomly selected 10,000 sentences from WikiText-103 for their accuracy evaluation. This dataset is suitable because it represents typical text data that GPT models are trained on and expected to generate, allowing for a realistic assessment of how the cryptographic transformations affect the model's output quality compared to the original model.

5.2. Evaluation Metrics

The paper employs several metrics to evaluate the performance and accuracy of CipherGPT and its components.

5.2.1. Runtime (s)

Conceptual Definition: Measures the total time taken to execute a protocol or operation, encompassing both computation (CPU cycles) and communication (data transfer latency). It quantifies the efficiency in terms of speed.
Mathematical Formula: Not a specific formula, but rather measured empirically as wall-clock time. $ \text{Runtime (s)} = \text{CPU Time} + \text{Communication Latency} $
Symbol Explanation:
- CPU Time: Time spent by processors performing computations.
- Communication Latency: Time spent waiting for data to be transferred between parties over the network.

5.2.2. Communication (MB)

Conceptual Definition: Quantifies the total amount of data exchanged between Client (C) and Server (S) during the execution of a protocol or operation. It reflects the network bandwidth consumption and is a critical factor for efficiency, especially in geographically distributed settings.
Mathematical Formula: Not a specific formula, but measured empirically as total bytes transferred. $ \text{Communication (MB)} = \sum_{k=1}^{\text{num_messages}} \text{size_of_message}_k \quad (\text{converted to MB}) $
Symbol Explanation:
- num_messages: The total number of messages exchanged.
- size_of_message}_k: The size in bytes of the $k$ -th message.

5.2.3. ULP Error (Units in the Last Place)

Conceptual Definition: Measures the precision of a numerical approximation by quantifying the error in terms of Units in the Last Place. It represents the number of possible floating-point values between the exact mathematical result and the approximated result. In the context of CipherGPT, where floating-point numbers are scaled to integers, the ULP error simplifies to the absolute difference between the exact and approximated integer values.
Mathematical Formula: For scaled integer values $y$ (exact) and $\tilde{y}$ (approximated), the ULP error is defined as: $ \text{ULP Error} = |y - \tilde{y}| $
Symbol Explanation:
- $y$ : The exact (infinite precision) real result, after scaling to an integer.
- $\tilde{y}$ : The approximated result obtained from the secure protocol, after scaling to an integer.
- $|\cdot|$ : Absolute value function.
- The paper reports both maximal ULP error (the largest absolute difference found) and average ULP error (the mean of absolute differences over all test cases).

5.2.4. Accuracy (for Generated Text)

Conceptual Definition: Assesses the quality of the text generated by CipherGPT compared to the original GPT model. It measures how often CipherGPT produces the exact same word as the original model and how close its "wrong" predictions are to the original's top choices.
Mathematical Formula: Two primary forms:
1. Percentage of Identical Outputs: $ \text{Identical Output Accuracy} = \frac{\text{Number of identical words}}{\text{Total number of words sampled}} \times 100% $
2. Top-k Accuracy (implicitly used as "falls within top-5"): $ \text{Top-k Accuracy} = \frac{\text{Number of times original output is in CipherGPT's top-k list}}{\text{Total number of words sampled}} \times 100% $ (In this paper, it's used for "wrong" outputs: how many of them were still in GPT-original's top-5.)
Symbol Explanation:
- Number of identical words: Count of words where CipherGPT's prediction matches GPT-original's prediction.
- Total number of words sampled: The total number of words for which predictions were made (e.g., 10,000 sentences, with $K=1$ meaning 10,000 word predictions).
- Number of times original output is in CipherGPT's top-k list: How often the word predicted by GPT-original was among the top-k highest-probability words predicted by CipherGPT. The paper reverses this by checking if CipherGPT's "wrong" output was in GPT-original's top-5.

5.3. Baselines

CipherGPT is compared against several state-of-the-art and foundational secure inference solutions, as well as general cryptographic primitives:

For GELU:
- Iron [27]: A secure inference framework that reduces communication complexity for HE-based matrix multiplication and uses LUTs for non-linear functions.
- Bolt [40]: A recent framework optimized for transformers, using SIMD slots in HE for matrix multiplication and high-degree polynomial approximation for GELU.
- Bolt+ [40], [30]: An improved version of Bolt where its OT-based multiplication is replaced with BumbleBee's more efficient BOLE for fair comparison.
- BumbleBee [30]: Another recent secure inference framework for large transformers, known for ciphertext compacting and high-degree polynomial approximations for activations.
For Matrix Multiplication:
- Cheetah [29]: A framework that uses coefficient packing (instead of SIMD) for RLWE-based HE to eliminate expensive rotations.
- Iron [27]: (See above)
- BumbleBee [30]: (See above)
- Bolt [40]: (See above)
For Top-K Selection:
- Bitonic sorting network [28]: A well-known data-independent sorting algorithm commonly used in secure computation scenarios. It provides a strong baseline for general secure sorting, but is computationally intensive.
  
  The choice of these baselines is representative because they cover various approaches to secure inference (HE-only, hybrid 2PC, different HE packing schemes) and are considered state-of-the-art for transformer-based models or fundamental cryptographic primitives.

5.4. Implementation Details

The authors provide details on their implementation to ensure reproducibility and transparency:

Language: Full implementation in $C++$ .
Security Parameter: Set to 128 bits, a standard level for cryptographic security.
Homomorphic Encryption:
- Microsoft SEAL (version 4.0) library is used for AHE.
- Specifically, the Brakerski-Fan-Vercauteren (BFV) scheme [11], [19] is employed.
- $N = 4096$ is used as the polynomial modulus degree, with default SEAL parameters for 128-bit security.
- Hexl is used to accelerate HE operations with AVX-512 instructions.
- Noise flooding [45], [30] is performed on returned ciphertexts to ensure circuit privacy.
Multiplication Protocols:
- Uniform bit-width product: The open-sourced BOLE (Batch Oblivious Linear Evaluation) code from BumbleBee [30] is used.
- Non-uniform bit-width product: The open-sourced code from SIRNN [43] is used.
Secure GELU:
- LUT, Mult, Trunc, CMP, and MUX primitives are implemented leveraging SIRNN's open-sourced code.
- IKNP-OT [32] (used in SIRNN) is replaced with Ferret OT [57] for efficiency.
- OT-based multiplication is replaced with FHE-based BOLE [30].
sVOLE-based MatrixMul:
- Reverse-VOLE is implemented with AHE.
- Halftree [26], [25] optimization is incorporated for PPRF (Pseudorandom Permutation Families).
- Optimizations from [56], [6] are included.
- Adherence to advice in [33] for protection against known attacks.
TopK: A custom implementation of secret-shared shuffle (from [14]) is developed, as it was not open-sourced.
Bolt Baseline: Bolt [40] was implemented based on SIRNN with Ferret OT using parameters from the Bolt paper, as its code was unavailable.

5.5. Optimizations for HE

Two optimizations are leveraged to reduce the communication overhead of transferring HE ciphertexts:

Symmetric Version of FHE: In the symmetric version of FHE, a freshly encrypted ciphertext consists of two polynomial components. One of these polynomials is uniformly sampled and can be represented by a seed instead of being transmitted entirely.
- Effect: This saves half of the communication when sending a ciphertext without compromising security or correctness.
Modulus Switch Before Return: FHE ciphertexts are polynomials in $\mathbb{Z}_q[x]/(x^N+1)$ $Z_{q} [x] / (x^{N} + 1)$ . These can be converted to a smaller ring $\mathbb{Z}_{q'}[x]/(x^N+1)$ $Z_{q^{'}} [x] / (x^{N} + 1)$ , where $q' < q$ $q^{'} < q$ , without affecting the decryption result.
- Mechanism: This operation, known as modulus reduction [13], only requires public parameters and can be performed by either party (even without the secret key).
- Effect: This optimization can compress ciphertexts by a factor of $\frac{\log q}{\log q'}$ , significantly reducing bandwidth.

5.6. Experimental setup

Network Environment: LAN network setting, simulating a high-speed local network.
- Bandwidth: 3000 Mbps (Megabits per second).
- RTT (Round-Trip Time): 0.8 ms (milliseconds).
Hardware: Experiments were conducted on AWS c5.9xlarge instances.
- CPU: Intel Xeon 8000 series CPUs clocked at 3.6 GHz.
Parallelism: All experiments were performed using a single thread, indicating that significant further speedups could potentially be achieved with parallel computing.
Data Collection: All reported results are average values from 5 runs, with minimal variance.

5.7. Model Parameters

The GPT-2 model ([42]) is used as the benchmark target:

Model Size: 117 million parameters.
Architecture: 12 transformer decoders.
Embedding Size: 768.
Fixed-point Representation:
- Floating-point numbers are left-shifted by $L=12$ bits. This effectively converts floating-point numbers into fixed-point integers, where $L$ bits represent the fractional part.
- The fractional part is then dropped.
- During inference, F_Trunc is used to ensure that the largest value remains smaller than $2^l - 1$ , with $l=37$ bits (meaning the total integer bit-length, including fractional bits, is 37).

6. Results & Analysis

6.1. Core Results Analysis

The experimental evaluation of CipherGPT provides a comprehensive benchmark of its individual components and overall framework, demonstrating significant improvements over existing solutions.

Evaluation of GELU

The secure GELU protocol (Algorithm 1) is evaluated for 37-bit elements in a $2^{20}$ -length vector. The approximation uses a 64-piece spline within the range $[-3.25 \times 2^{12}, 3.25 \times 2^{12}]$ .

The following are the results from Table IV of the original paper:

GELU(Z $_{2^{37}}$ )	Runtime(s)	Comm.(MB)	MaximalULP Err.	AverageULP Err.
Iron[27]	694	12 225	9	1.93
Bolt[40]	55.61	1 962.23	37	4.55
Bolt+[40], [30]	52.22	559.28	37	4.55
BumbleBee[30]	73.52	641.02	73	10.82
Ours	30.56 1.8×↓	764.96 2.5×↓	5 7.4×↓	1.06 4.3×↓

Analysis:

Runtime: CipherGPT achieves a runtime of 30.56s, which is a 1.8x speedup over Bolt (55.61s) and even a 1.7x speedup over the optimized $Bolt+$ (52.22s). This demonstrates the efficiency of the spline-based approach with LUTs compared to high-degree polynomial approximations. Iron is significantly slower (694s).
Communication: CipherGPT's communication is 764.96MB, showing a 2.5x reduction compared to Bolt (1962.23MB). While $Bolt+$ (559.28MB) and BumbleBee (641.02MB) have lower communication, CipherGPT's overall efficiency is still competitive given its superior precision. Iron has extremely high communication (12225MB).
Precision (ULP Error): This is where CipherGPT truly excels.
- Maximal ULP Error: CipherGPT achieves a maximal ULP error of 5, which is 7.4x more accurate than Bolt (37) and 14.6x more accurate than BumbleBee (73).
- Average ULP Error: CipherGPT's average ULP error is 1.06, which is 4.3x more accurate than Bolt (4.55) and 10.2x more accurate than BumbleBee (10.82).
- Reasoning for Precision: The paper attributes this superior precision to its single-step, spline-based approximation, which avoids error accumulation seen in multi-step approaches (like Iron/SIRNN approximating exponentiation and reciprocation separately) and the precision loss inherent in high-degree polynomial approximations (like Bolt/BumbleBee).

Evaluation of MatrixMul

The sVOLE-based MatrixMul protocol is designed for unbalanced matrix multiplications that occur repeatedly with the same weights in GPT's autoregressive inference. It is benchmarked for computing $t$ iterations of $\mathbb{Z}_{2^{37}}^{256 \times 768} \times \mathbb{Z}_{2^{37}}^{768 \times 768}$ . The results highlight the benefits of amortized costs.

The image "images/3.jpg" shows the evaluation of MatrixMul. The left plot (a) shows amortized runtime (s) vs. iterations, and the right plot (b) shows amortized communication (MB) vs. iterations. For (a), sVOLE runtime significantly decreases as iterations increase. For (b), sVOLE communication also decreases.

Analysis from Figure 3:

Amortized Runtime (Figure 3a):
- For $t = 256$ iterations (a reasonable number for ChatGPT responses), CipherGPT's amortized runtime is 1462 ms. This represents a 2.3x speedup over Iron, 6.4x speedup over Bolt, and 5.8x speedup over BumbleBee. The curves clearly show CipherGPT (sVOLE) consistently outperforming other HE-based methods, with its runtime decreasing significantly as $t$ increases due to amortization.
- For $t = 1024$ iterations, CipherGPT further improves, showing 2.5x speedup over Iron, 6.9x speedup over Bolt, and 6.2x speedup over BumbleBee.
Amortized Communication (Figure 3b):
- For $t = 256$ iterations, CipherGPT's amortized communication is 8.2MB. This is a 3.7x reduction over Iron, 3.0x reduction over Bolt, and 1.4x reduction over BumbleBee. Similar to runtime, communication also decreases with increasing $t$ .
- For $t = 1024$ iterations, CipherGPT achieves even greater reductions: 11.2x over Iron, 8.9x over Bolt, and 4.1x over BumbleBee.
Reasoning: The effectiveness of sVOLE stems from its communication complexity being almost independent of $n$ (input vector length). By batching multiple MatrixMul operations that share the same weights, the one-time sVOLE setup cost is amortized, leading to superior performance for generative LLM tasks.

Evaluation of TopK

The secure TopK protocol (Algorithm 2) is benchmarked for selecting 100 elements from a vector of length 50257 (typical GPT-2 vocabulary size).

Performance: It takes 3281 ms and consumes 136.1 MB of bandwidth.
Comparison: Compared to the commonly used Bitonic sorting network [28], CipherGPT achieves an 8.8x speedup in runtime and a 14.8x reduction in communication. This highlights the efficiency gained by using a shuffled quicksort-like selection that only processes necessary partitions, rather than a full sorting algorithm.

Evaluation of CipherGPT (Overall Framework)

The overall CipherGPT framework is evaluated for generating a sentence of 256 response words.

The following are the results from Table V of the original paper:

Layer	Operation	Output←Input	Method	Times	Runtime (ms)	Runtime %	Comm. (MB)	Comm. %
Embedding	Embedding	Z_256×768←Z₂₁₆₆, L = 12	§VII-A	1	46	< 0.01%	2.20	< 0.01%
LayerNorm	LayerNorm	Z_256×768←Z_256×768, L = 12	§VII-B	12	4756 × 12	4.83%	65 × 12	5.17%
Self-attention	MatrixMul	(Z_256×768←Z_256×768 × Z_768×768) × 3, L = 24	§II	12	4358 × 12	4.43%	21.79 × 12	1.73%
	Trunc	(Z_256×768←Z_256×768) × 3, L = 12	[43]	12	3117 × 12	3.17%	24.75 × 12	1.97%
	Multi-head	(Z_256×64) × 12 ← (Z_256×768) × 3, L = 12	plain	12	(< 1) × 12	≈ 0%	0	0%
	MatrixMul	(Z_256×256←Z_256×64 × Z_64×256) × 12, L = 24	[30]	12	7614 × 12	7.74%	29.77 × 12	2.37%
	Trunc	(Z_256×256←Z_256×256) × 12, L = 12	[43]	12	3262 × 12	3.31%	30.61 × 12	2.43%
	Masking	(Z_256×256←Z_256×256) × 12, L = 12	plain	12	(< 1) × 12	≈ 0%	0	0%
	Softmax	(Z_256×256←Z_256×256) × 12, L = 12, (by row)	§VII-C	12	14865 × 12	15.10%	277.59 × 12	22.06%
	MatrixMul	(Z_256×64←Z_256×256 × Z_256×64) × 12, L = 24	[30]	12	7570 × 12	7.69%	10.62 × 12	0.84%
	Trunc	(Z_256×64←Z_256×64) × 12, L = 12	[43]	12	2910 × 12	2.96%	13.03 × 12	1.04%
	Reassemble	Z_256×768←((Z_256×64) × 12), L = 12	plain	12	(< 1) × 12	≈ 0%	0	0%
	MatrixMul	Z_256×768←Z_256×768 × Z_768×768, L = 24	§II	12	1463 × 12	1.49%	8.20 × 12	0.65%
	Trunc	Z_256×768←Z_256×768, L = 12	[43]	12	2910 × 12	2.96%	13.03 × 12	1.04%
	Matrix Add	Z_256×768←Z_256×768 + Z_256×768, L = 12	plain	12	(< 1) × 12	≈ 0%	0	0%
LayerNorm	LayerNorm	Z_256×768←Z_256×768, L = 12	§VII-B	12	4756 × 12	4.83%	65 × 12	5.17%
Feed-forward	MatrixMul	Z_256×3072←Z_256×768 × Z_768×3072, L = 24	§II	12	5997 × 12	6.09%	28.5 × 12	2.27%
	Trunc	Z_256×3072←Z_256×3072, L = 12	[43]	12	3204 × 12	3.26%	30.61 × 12	2.43%
	GELU	Z_256×3072←Z_256×3072, L = 12	§IV	12	20657 × 12	20.99%	575.91 × 12	45.78%
	MatrixMul	Z_256×768←Z_256×3072 × Z_3072×768, L = 24	§II	12	5841 × 12	5.94%	32.42 × 12	2.58%
	Trunc	Z_256×768←Z_256×768, L = 12	[43]	12	2910 × 12	2.96%	13.03 × 12	1.04%
	Matrix Add	Z_256×768←Z_256×768 + Z_256×768, L = 12	plain	12	(< 1) × 12	≈ 0%	0	0%
LayerNorm	LayerNorm	Z_256×768←Z_256×768, L = 12	§VII-B	1	4756	0.40%	65	0.43%
Vec2Word	MatrixMul	Z₅₀₂₅₇←Z₇₆₈ × Z_768×50257, L = 24	[30]	1	12000	1.02%	5.5	0.04%
	Trunc	Z₅₀₂₅₇←Z₅₀₂₅₇, L = 12	[43]	1	2834	0.24%	8.67	0.06%
	Shuffle	Z₅₀₂₅₇←Z₅₀₂₅₇, L = 12	[14]	1	4004	0.34%	51.3	0.34%
	TopK	Z₁₀₀←Z₅₀₂₅₇, L = 12	§V	1	1277	0.11%	84.8	0.56%
	Vec2Word	Temperature	Z₁₀₀←Z₁₀₀, L = 24	§VII-E	1	6	< 0.01%	0.084	< 0.01%
	Trunc	Z₁₀₀←Z₁₀₀, L = 12	[43]	1	1	< 0.01%	< 0.01	< 0.01%
	Softmax	Z₁₀₀←Z₁₀₀, L = 12	§VII-C	1	1705	0.14%	0.71	< 0.01%
	Sampling	Z₃₇←Z₁₀₀, L = 12	§VI	1	7.843	< 0.01%	0.11	< 0.01%
Total					1 180 933		15 096.82

Analysis of Overall Performance (for 256 response words):

Total Runtime: 1,180,933 ms, which is approximately 19.68 minutes.
Total Communication: 15,096.82 MB, which is approximately 14.74 GB.

Proportional Breakdown:

Runtime Bottlenecks:
1. GELU: Dominates runtime, accounting for 20.99% of total time. This operation is performed 12 times within the Feed-forward layers.
2. Softmax: Contributes 15.10% of runtime (also performed 12 times within Self-attention).
3. MatrixMul (various types): Summing up all MatrixMul operations, they collectively contribute a significant portion. Specifically, the AHE-based MatrixMul in Self-attention (7.74% and 7.69%) and sVOLE-based MatrixMul in Feed-forward (6.09% and 5.94%) are major contributors. The total MatrixMul (including QKV, internal Self-attention, FC layers) across all 12 decoders is around 34.39%.
4. Truncation (Trunc): Essential for fixed-point arithmetic, accumulates to approximately 18.85% of runtime across the model.
5. LayerNorm: Contributes approximately 10.07% of runtime.
Communication Bottlenecks:
1. GELU: Is the largest communication consumer, occupying 45.78% of the total bandwidth.
2. Softmax: Contributes 22.06% of communication.
3. LayerNorm: Consumes 10.76% of bandwidth.
4. MatrixMul (various types): Overall MatrixMul contributes approximately 10.49%.
5. Truncation (Trunc): Accounts for approximately 10% of bandwidth.

Observations:

The customized protocols for GELU (Section IV) and MatrixMul (Section III) are indeed critical, as they form the largest proportions of both runtime and communication.
Operations like Embedding, Multi-head partitioning, Masking, Reassemble, Matrix Add, Temperature, Sampling, and Trunc (after Temperature) for Vec2Word are relatively cheap, often taking negligible time and communication.
The high overall latency (around 20 minutes) and bandwidth (around 15 GB) for generating 256 words clearly indicate that CipherGPT, despite its significant optimizations, is not yet practical for real-time interactive GPT inference. However, the detailed breakdown provides a clear roadmap for future optimizations by highlighting the most expensive operations.

Accuracy Evaluation

The paper also assessed the accuracy loss introduced by CipherGPT using WikiText-103.

Methodology: 10,000 sentences were randomly selected. CipherGPT was run (with $K=1$ for TopK to predict the most probable word, thus eliminating TopK sampling interference) and compared against GPT-original (the original model with floating-point numbers, no truncations or approximations).
Results:
- 99.22% of CipherGPT's outputs were identical to GPT-original's outputs. This is a very high degree of fidelity, indicating that the fixed-point approximations and secure protocols introduce minimal changes to the model's core predictions.
- For the remaining 0.78% (78 out of 10,000) of outputs that were different, each "wrong" output still fell within the top-5 outputs produced by GPT-original for the corresponding sentence. This suggests that even when CipherGPT deviates, its predictions are still highly plausible and close to the original model's top choices, preserving the overall utility of the LLM.

6.2. Data Presentation (Tables)

The tables presented in the "Core Results Analysis" section are direct transcriptions from the original paper's Table IV and Table V, using Markdown for Table IV (no merged cells) and HTML for Table V (due to merged cells).

6.3. Ablation Studies / Parameter Analysis

The paper does not present explicit ablation studies to verify the individual contribution of each component of CipherGPT (e.g., removing sVOLE optimization, or using a less precise GELU approximation). However, the detailed breakdown of runtime and communication costs for each operation in Table V implicitly serves a similar purpose by showing the relative impact of each component.

For GELU, the authors specify the parameters used: $\alpha = 3.25$ and $s = 6$ (resulting in a 64-piece spline) for the approximation. They don't analyze how varying these parameters might affect performance or precision, but the choice is justified based on balancing accuracy and efficiency.

The comparison of MatrixMul performance for different $t$ (number of iterations) in Figure 3 acts as a form of parameter analysis, showing how the amortization benefits scale with the batch size.

7. Conclusion & Reflections

7.1. Conclusion Summary

CipherGPT introduces the first comprehensive framework for secure two-party GPT inference, addressing critical privacy concerns associated with large language models. The framework is built upon a suite of innovative cryptographic protocols specifically tailored to the unique computational demands of GPT. Key contributions include a novel sVOLE-based secure matrix multiplication that leverages GPT's autoregressive nature for significant speedup and bandwidth reduction, a spline-based secure GELU protocol that achieves superior precision and efficiency over existing methods, and the first protocol for secure top-K sampling, crucial for generative text diversity. Despite the current practical challenges in terms of latency and bandwidth, CipherGPT provides a full implementation and a detailed benchmark, offering valuable insights into performance bottlenecks and a foundational reference for future research in this nascent field.

7.2. Limitations & Future Work

The authors candidly acknowledge the current limitations and suggest avenues for future research:

Impractical Performance: The primary limitation is the current high cost. Generating a single token (response word) requires a latency of approximately 20 minutes and consumes 15 GB of bandwidth (for a 256-word response, amortized). This makes real-time, interactive GPT inference currently impractical.
Future Technological Advancements: They anticipate that ongoing advancements in computing and network technologies will eventually pave the way for practical implementations.
- Computing Power: Leveraging parallel computing technologies such as GPU or FPGA acceleration, and exploring new architectures like in-memory computing [50] and in-storage computing [49], could significantly speed up computation. The current single-threaded implementation leaves ample room for improvement.
- Network Bandwidth: The emergence of 100 Gigabit Ethernet [1] under the IEEE standard could effectively address the high bandwidth requirements.
Non-Real-Time Applications: Despite the high latency, secure GPT inference can find valuable applications in scenarios where real-time responsiveness is not critical. An example provided is an institution evaluating an LLM's performance using confidential prompts, where both the prompts and the model itself need to remain private. In such cases, the long latency is tolerable.

7.3. Personal Insights & Critique

CipherGPT represents a significant leap forward in the crucial field of privacy-preserving AI, especially for large language models. The authors' approach of deeply analyzing GPT's architectural and operational specifics to design custom cryptographic primitives is highly effective and demonstrates a deep understanding of both fields.

Strengths:

Problem Relevance: The privacy concerns surrounding LLMs are paramount and growing. This paper directly tackles a pressing real-world issue.
Customized Optimizations: The sVOLE-based MatrixMul for autoregressive GPT and the spline-based GELU are particularly clever and impactful innovations. They highlight that generic secure inference solutions are often insufficient for specialized AI models. The improvements in precision for GELU are also highly commendable, as accuracy is often sacrificed in secure computation.
Comprehensive Benchmark: The detailed breakdown of runtime and communication for each operation (Table V) is invaluable. It not only provides a baseline but also clearly points to the current bottlenecks, guiding future research efforts. This level of transparency is excellent.
First-of-its-Kind Protocols: Introducing the first protocols for secure Top-K selection and secure sampling in this context fills critical gaps for truly generative LLM privacy.

Critique and Future Directions:

Practicality Gap: The latency of 20 minutes per token is indeed a "showstopper" for most interactive applications. While the authors correctly point to future hardware and network advancements, the cryptographic overhead itself remains extremely high. Further research into more efficient cryptographic primitives (e.g., post-quantum FHE, faster OT extensions, or new secret-sharing schemes) will be crucial.
Preprocessing Cost: While the sVOLE MatrixMul amortizes online costs, the preprocessing phase (which generates sVOLE correlations) can itself be costly. The paper doesn't deeply elaborate on the non-amortized preprocessing time, which could be significant for models with many layers or large weights.
AHE-based MatrixMul: For parts of the model (like score matrix calculation in self-attention and the final vec2word MatrixMul) where sVOLE is not applicable, the framework reverts to AHE-based MatrixMul [27]. This indicates that these operations might still be relatively expensive compared to the sVOLE optimized parts, and further optimizations for these non-batched MatrixMul instances could be beneficial.
Scaling K for Top-K and Sampling: The performance of TopK and Sampling might be sensitive to the value of $K$ (the number of elements selected/sampled). A larger $K$ could increase costs. Analyzing this sensitivity could provide further insights.
Trust Assumptions: The semi-honest model is a standard assumption. Exploring solutions under a stronger malicious adversary model would enhance security guarantees, albeit with higher costs.
Model Flexibility: The current approach is highly specialized for GPT. While this brings efficiency, it might require significant adaptation for other LLM architectures (e.g., Llama, Mixtral) or newer activation functions.

Overall, CipherGPT lays a foundational stone for secure generative AI. Its innovations provide a clear direction for bridging the gap between cutting-edge LLMs and privacy, even if the journey to real-time practicality is still long. The detailed analysis provided in this paper will be an indispensable reference for those working on making secure LLM inference a reality.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.