The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
TL;DR Summary
This study introduces BitNet b1.58, a 1-bit LLM variant using ternary weights {-1, 0, 1}. It matches the performance of full-precision models while being more cost-effective in latency, memory, throughput, and energy, paving the way for new training methods and hardware design.
Abstract
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits," which introduces a novel 1-bit Large Language Model (LLM) variant, BitNet b1.58, that quantizes model weights to ternary values .
1.2. Authors
The authors are Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. Their affiliation is with Microsoft, as indicated by the URL https://aka.ms/GeneralAI. These researchers are actively involved in AI and machine learning research, particularly in the domain of large language models and model compression.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server, under the identifier . While arXiv is not a peer-reviewed journal or conference, it is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, and related fields. Papers published on arXiv are often subsequently submitted to and published in top-tier conferences (e.g., NeurIPS, ICML, ICLR) or journals, but as of the publication date, this specific version is a preprint.
1.4. Publication Year
The paper was published on 2024-02-27.
1.5. Abstract
This paper introduces BitNet b1.58, a 1-bit LLM variant where every parameter (weight) is constrained to ternary values . The research demonstrates that BitNet b1.58 matches the performance of full-precision (FP16 or BF16) Transformer LLMs of the same size and trained with the same number of tokens, in terms of both perplexity and end-task performance. Crucially, it achieves this while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. The authors claim that this 1.58-bit LLM establishes a new scaling law and training recipe for future generations of high-performance and cost-effective LLMs. Furthermore, it paves the way for a new computation paradigm and the design of specialized hardware optimized for 1-bit LLMs.
1.6. Original Source Link
The official source link for the paper is https://arxiv.org/abs/2402.17764. The PDF link is https://arxiv.org/pdf/2402.17764v1.pdf. This is a preprint available on arXiv.
2. Executive Summary
2.1. Background & Motivation
The rapid growth in the size and capabilities of Large Language Models (LLMs) has led to remarkable performance across various natural language processing tasks. However, this increasing size poses significant challenges for deployment due to high computational demands, substantial memory requirements, and considerable energy consumption. These factors raise concerns about the economic and environmental impact of LLMs, hindering their widespread accessibility and deployment, especially on resource-constrained devices.
Prior research has attempted to address these issues through methods like post-training quantization, which reduces the precision of weights and activations in already trained models. While effective, these methods are often suboptimal. The paper highlights a more promising direction: 1-bit model architectures, such as the original BitNet, which replace floating-point operations with integer arithmetic, leading to substantial energy savings.
The core problem the paper aims to solve is the high cost of deploying and operating large, full-precision LLMs, while maintaining their high performance. The paper's innovative idea is to extend the concept by introducing a ternary weight representation (), which the authors refer to as 1.58-bit, to bridge the performance gap with full-precision models while retaining the cost benefits of extreme quantization.
2.2. Main Contributions / Findings
The primary contributions of this paper are:
-
Introduction of
BitNet b1.58: A novel 1-bit LLM variant where every single parameter (weight) is quantized to a ternary value . This adds a0value to the original1-bit BitNet's scheme, which allows for explicitfeature filteringand improves modeling capability. -
Performance Parity with Full Precision: The
BitNet b1.58model, starting from a 3B parameter size, is shown to match theperplexityandend-task performanceof full-precision (FP16 or BF16) Transformer LLMs of equivalent size and trained with the same number of tokens. -
Significant Cost-Effectiveness:
BitNet b1.58demonstrates substantial improvements inlatency,memory consumption,throughput, andenergy consumptioncompared to full-precision baselines. For example, a 3BBitNet b1.58is 2.71 times faster and uses 3.55 times less GPU memory than a 3BLLaMA LLM. A 70BBitNet b1.58achieves 8.9 times higherthroughputand 4.1 times fasterlatencythan a 70BLLaMA LLM. It also saves 71.4 times arithmetic operations energy for matrix multiplication on7nmchips. -
New Scaling Law and Recipe: The work defines a new scaling law and recipe for training high-performance and cost-effective LLMs, suggesting that extreme quantization is viable for large-scale models trained from scratch.
-
Enabling New Computation Paradigm and Hardware:
BitNet b1.58's unique computation paradigm, which heavily relies on integer additions rather than floating-point multiplications, opens the door for the design of specialized hardware (LPUs) optimized specifically for 1-bit LLMs, similar to howGPUsare optimized for floating-point operations.These findings collectively demonstrate that
BitNet b1.58represents aPareto improvementover state-of-the-art LLMs, offering superior cost-efficiency without sacrificing performance.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a grasp of several foundational concepts in deep learning and large language models is essential:
- Large Language Models (LLMs): These are deep learning models, typically based on the
Transformerarchitecture, trained on vast amounts of text data to understand, generate, and process human language. They exhibit emergent abilities as their size and training data increase. - Transformer Architecture: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017). It relies heavily on
self-attentionmechanisms to weigh the importance of different parts of the input sequence, rather than recurrent or convolutional layers. Key components includemulti-head attention,feed-forward networks,residual connections, andlayer normalization.- Self-Attention: A mechanism that allows the model to weigh the importance of different words in the input sequence when processing a specific word. It computes a weighted sum of values based on query, key, and value vectors derived from the input. The core formula for
Attentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- is the
Querymatrix, representing the current token's query for other tokens. - is the
Keymatrix, representing the "key" of each token that other tokens can query. - is the
Valuematrix, representing the actual information to be aggregated from other tokens. - is the dimension of the
keyvectors, used for scaling to prevent very small gradients. - computes the similarity scores between queries and keys.
softmaxnormalizes these scores into probability distributions.
- is the
- Self-Attention: A mechanism that allows the model to weigh the importance of different words in the input sequence when processing a specific word. It computes a weighted sum of values based on query, key, and value vectors derived from the input. The core formula for
- Quantization: The process of reducing the precision of numerical representations (e.g., weights, activations) in a neural network. This typically involves mapping floating-point numbers to lower-bit integer representations.
- Post-Training Quantization (PTQ): Quantization applied to an already trained full-precision model. It's often simpler but can lead to performance degradation.
- Quantization-Aware Training (QAT): Quantization applied during the training process, where the model learns to operate with low-precision numbers, often leading to better performance than PTQ.
- 1-bit Quantization: An extreme form of quantization where weights are restricted to two values, typically . This drastically reduces model size and computational cost.
- Floating-Point (FP) and Brain Floating-Point (BF) Precision:
- FP16: Half-precision floating-point format using 16 bits (1 sign bit, 5 exponent bits, 10 mantissa bits). It offers a balance between precision and computational efficiency for deep learning.
- BF16: Brain floating-point format, also 16 bits (1 sign bit, 8 exponent bits, 7 mantissa bits). It offers a wider dynamic range than FP16, which can be beneficial for training stability in some cases.
- Matrix Multiplication: The most computationally intensive operation in
Transformermodels. It involves multiplying large matrices of weights and activations. In full-precision models, this involves floating-point multiplications and additions. - DRAM (Dynamic Random-Access Memory) / SRAM (Static Random-Access Memory): Types of computer memory.
DRAMis typically larger, slower, and cheaper, used for main system memory.SRAMis smaller, faster, and more expensive, often used for caches withinCPUsandGPUs. Reducingmemory footprintcan lead to less data transfer fromDRAMtoSRAM, improvinglatency. - Perplexity (PPL): A common metric for evaluating language models. It measures how well a probability model predicts a sample. Lower
perplexitygenerally indicates a better model.- Mathematically, for a given sequence of tokens , the
perplexityis defined as the inverse probability of the sequence, normalized by the number of tokens: $ \mathrm{PPL}(W) = \exp \left( - \frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \dots, w_{i-1}) \right) $ Where:- is the sequence of tokens.
- is the number of tokens in the sequence.
- is the probability of the -th token given the preceding
i-1tokens, as predicted by the language model. - is the natural logarithm.
- Mathematically, for a given sequence of tokens , the
- Zero-shot Accuracy: A measure of a model's ability to perform a task it was not explicitly trained on, without any examples or fine-tuning for that specific task.
- Latency: The time delay between a cause and effect, often measured as the time it takes for a system to respond to a request (e.g., generating one output token). Lower
latencyis better. - Throughput: The rate at which a system processes tasks or data (e.g., number of tokens generated per second). Higher
throughputis better. - Energy Consumption: The amount of electrical energy used by a system. In deep learning,
energy consumptionis a significant concern due to environmental impact and operational costs. Reducedenergy consumptionis highly desirable.
3.2. Previous Works
The paper builds upon and references several key prior studies:
- BitNet : The precursor to
BitNet b1.58. This work introduced the concept of1-bit Transformerstrained from scratch. The core idea is to replace the standardnn.Linearlayers withBitLinearlayers, where weights are1-bit() and activations are8-bit. This drastically reduces the number of bits required to represent weights and allows for matrix multiplications to be performed primarily with integer additions, saving significant energy. - LLaMA Architecture : The
LLaMA(Large Language Model Meta AI) architecture has become a de-facto standard for open-source LLMs.BitNet b1.58adoptsLLaMA-alikecomponents to ensure compatibility and leverage community-driven optimizations. Key architectural components include:- RMSNorm
[ZS19]:Root Mean Square Layer Normalization, a simplified alternative toLayerNormthat normalizes inputs based on their root mean square. - SwiGLU
[Sha20]: A variant of the Gated Linear Unit (GLU) activation function, often used infeed-forward networkswithinTransformersto improve performance. - Rotary Position Embedding (RoPE) : An embedding technique that encodes absolute positional information with a rotation matrix, allowing for effective handling of longer sequences.
- Bias Removal:
LLaMAmodels typically remove biases in linear layers andLayerNormfor simplicity and sometimes improved stability.
- RMSNorm
- Post-Training Quantization (PTQ) Methods :
- These methods aim to quantize pre-trained full-precision models to lower bit-widths (e.g., 4-bit, 2-bit, even 1-bit) without retraining. Examples include
SmoothQuant,OPTQ[FAHA23],AWQ,QuIP[CCKS23], andQuIP#. While widely used in industry for inference, the paper notes thatPTQis often suboptimal compared toquantization-aware trainingfrom scratch.
- These methods aim to quantize pre-trained full-precision models to lower bit-widths (e.g., 4-bit, 2-bit, even 1-bit) without retraining. Examples include
- Mixture-of-Experts (MoE): A technique used in deep learning to scale models by conditionally activating different "expert" sub-networks for different inputs. While
MoEmodels can be very large, they only activate a subset of parameters per input, leading to reduced computational cost. However, they typically suffer from highmemory consumptionand inter-chip communication overhead. - FasterTransformer: A highly optimized
inference engineforTransformermodels developed by NVIDIA, used here for measuringlatency. - Ladder : A framework for efficient tensor compilation on customized data formats, mentioned for its
2-bit kernelintegration withBitNet b1.58. - PagedAttention : An efficient memory management technique for
LLMserving, used in systems likevLLM, whichBitNet b1.58aims to be compatible with. - Energy Models
[Hor14, ZZL22]: The paper references prior work on estimatingenergy consumptionfor arithmetic operations, such as the7nmprocess node energy model by Horowitz[Hor14]andPokeBNN[ZZL22].
3.3. Technological Evolution
The evolution of LLMs has seen a clear trend towards larger models exhibiting superior capabilities. However, this growth has been accompanied by a proportional increase in computational and memory resource demands. Initially, research focused on making LLMs more efficient through post-training quantization (PTQ), converting trained models to lower precision (e.g., 4-bit ) for inference. While this offered some relief, it often came with performance compromises, as models weren't inherently designed for low-bit operations.
The next evolutionary step, exemplified by the original BitNet , shifted from PTQ to quantization-aware training of 1-bit models from scratch. This approach aimed to bake the low-precision constraint directly into the training process, hoping to mitigate performance loss while maximizing efficiency gains. BitNet replaced floating-point matrix multiplications with integer additions, drastically reducing energy consumption.
This paper's work, BitNet b1.58, represents a refinement of this 1-bit training paradigm. By introducing a ternary weight representation (), it moves beyond the strict of previous 1-bit models. This 1.58-bit scheme seeks to find a sweet spot: gaining the explicit feature filtering capability of the 0 weight (effectively turning off connections) while retaining the extreme efficiency benefits. This positions BitNet b1.58 as a leader in the movement towards highly efficient LLMs that are performant enough for practical deployment, even on resource-constrained environments.
3.4. Differentiation Analysis
Compared to the main methods in related work, BitNet b1.58 introduces several core differences and innovations:
-
From Binary to Ternary Weights: The most significant differentiation from the original
1-bit BitNetis the expansion of weights from to . This ternary representation, dubbed1.58-bit, provides an additional value0. This0value allows for explicitfeature filtering—effectively "turning off" certain connections or features. The authors argue this significantly improves themodeling capabilityof1-bit LLMs, which is crucial for matching full-precision performance. -
Performance Parity with Full-Precision Models (from scratch): Unlike many
post-training quantizationmethods that often incur some performance degradation,BitNet b1.58is trained from scratch and demonstrated to matchFP16 LLaMA LLMperformance (bothperplexityandend-task accuracy) starting from a 3B parameter size. This is a critical distinction, as it shows that extreme quantization doesn't necessitate a performance trade-off if integrated into the training process. -
Enhanced Cost-Effectiveness: While
1-bit BitNetalready offered significant cost savings,BitNet b1.58retains these benefits and, in some aspects, enhances them due to its improved modeling capacity allowing for smaller models to achieve comparable performance to larger full-precision ones. It offers superior reductions inlatency,memory,throughput, andenergy consumptioncompared toFP16models. -
New Scaling Law: The paper suggests that
BitNet b1.58defines a newscaling law. This implies that the conventional understanding of how performance scales with model size and precision might need re-evaluation for1-bitarchitectures, indicating that highly efficient1.58-bitmodels can achieve performance typically associated with much largerFP16models. For example, a 13BBitNet b1.58can be more efficient than a 3BFP16 LLM, a 30BBitNet b1.58more efficient than a 7BFP16 LLM, and a 70BBitNet b1.58more efficient than a 13BFP16 LLM. -
Direct Support for Hardware Optimization: By adopting integer-only arithmetic (or primarily integer arithmetic),
BitNet b1.58further strengthens the case for designing new, specialized hardware (LPUs) optimized for1-bit LLMs. This is distinct from optimizing existing hardware (likeGPUs) for operations or using specialized kernels.In essence,
BitNet b1.58differentiates itself by pushing the boundary of extreme quantization from binary to ternary, demonstrating performance parity with full-precision models when trained from scratch, and consequently offering unprecedented cost savings and enabling a paradigm shift inLLMdeployment and hardware design.
4. Methodology
4.1. Principles
The core principle behind BitNet b1.58 is to significantly reduce the computational and memory footprint of Large Language Models (LLMs) by quantizing all model weights to a ternary representation . This extreme quantization, which the authors term "1.58-bit," aims to leverage the efficiency benefits of integer arithmetic (primarily additions) for matrix multiplications, while the inclusion of the 0 value provides enhanced modeling capability by allowing for explicit feature filtering or "pruning" of connections during training. The method operates under the premise that an LLM trained from scratch with these low-precision constraints can achieve performance parity with full-precision FP16 models, thereby offering a Pareto optimal solution for performance and cost.
4.2. Core Methodology In-depth (Layer by Layer)
BitNet b1.58 is fundamentally based on the BitNet architecture, which modifies the standard Transformer by replacing conventional nn.Linear layers with BitLinear layers. This new variant introduces specific modifications to the quantization function and adheres to LLaMA-alike architectural components.
4.2.1. Quantization Function for Weights
To constrain the weights to the ternary set , BitNet b1.58 adopts an absmean quantization function. This function first scales the weight matrix by its average absolute value and then rounds each scaled value to the nearest integer among .
The process can be broken down as follows:
-
Calculate the average absolute value (gamma): For a given weight matrix , the average absolute value is computed across all its elements. This acts as a scaling factor. $ \gamma = \displaystyle \frac { 1 } { n m } \sum _ { i j } | W _ { i j } | $ Where:
- is the full-precision weight matrix.
- and are the dimensions of the weight matrix (number of rows and columns).
- is an individual element of the weight matrix.
- denotes the absolute value of the element.
- indicates summation over all elements of the matrix.
-
Scale the weight matrix: Each element of the original full-precision weight matrix is divided by the calculated (with a small epsilon added for numerical stability, though not explicitly shown in the denominator of but present in the scaling step), resulting in a scaled weight .
-
Round and Clip: The scaled weights are then passed through a
RoundClipfunction. This function first rounds the scaled value to the nearest integer and then clips it to be within a specified range[a, b]. In this case, the range is . $ \widetilde W = \displaystyle \mathrm { R o u n d C l i p } ( \frac { W } { \gamma + \epsilon } , - 1 , 1 ) $ $ \mathrm { R o u n d C l i p } ( x , a , b ) = \displaystyle \operatorname* { m a x } _ { { \bf \alpha } } \bigl ( a , \mathrm { m i n } ( b , \mathrm { r ound } ( x ) ) \bigr ) $ Where:-
is the resulting quantized weight matrix, with elements in .
-
is the scaled full-precision weight.
-
is a small constant (e.g., ) added to the denominator to prevent division by zero, especially if is very small.
-
rounds to the nearest integer.
-
ensures the value does not exceed .
-
ensures the value is not less than .
-
For
BitNet b1.58, and .This process effectively maps the continuous range of full-precision weights to the discrete set .
-
4.2.2. Quantization Function for Activations
The quantization function for activations in BitNet b1.58 largely follows the implementation used in the original BitNet, with one key modification:
-
Scaling Range: Instead of scaling activations before non-linear functions to the range (where typically represents the maximum value for an 8-bit unsigned integer, e.g., 255), the activations are scaled to per token. This eliminates the need for
zero-point quantization, which simplifies both implementation and system-level optimization. The authors report that this change introduces negligible effects on performance. -
Bit-width: Activations are quantized to 8-bit, which is a common practice in low-bit quantization to maintain sufficient information flow while still offering memory and computational benefits over full-precision activations.
The combination of
1.58-bitweights and8-bitactivations means that the core matrix multiplications, which form the bulk ofLLMcomputation, involve operations between very low-precision values. The multiplication of a weight by an8-bitactivation can be implemented using simple integer additions and subtractions (e.g., , , ), avoiding complex floating-point operations. This new computational paradigm is key to the energy and latency savings.
4.2.3. LLaMA-alike Components
To facilitate integration with existing open-source frameworks and leverage well-established architectural designs, BitNet b1.58 adopts several components from the LLaMA architecture :
-
RMSNorm
[ZS19]:Root Mean Square Layer Normalizationis used instead of standardLayerNorm.RMSNormnormalizes inputs based on their root mean square, which can be computationally simpler and often performs comparably toLayerNorm. -
SwiGLU
[Sha20]: TheSwish-Gated Linear Unitactivation function is employed in thefeed-forward networksof theTransformerblocks.SwiGLUis known for improvingTransformerperformance. -
Rotary Position Embedding (RoPE) : This type of
positional embeddingis used to inject positional information into the model's inputs.RoPEhas been shown to be effective for handling long sequences by allowing attention to decay with distance in a way that respects relative positions. -
Bias Removal: Consistent with
LLaMAmodels, all biases are removed from linear layers and normalization layers withinBitNet b1.58. This simplifies the model and can sometimes improve generalization.By incorporating these widely adopted and optimized
LLaMAcomponents,BitNet b1.58aims for seamless integration into popular open-source software ecosystems likeHuggingface,vLLM, andllama.cpp, with minimal adaptation efforts required.
5. Experimental Setup
5.1. Datasets
The experiments used a variety of datasets for pre-training and evaluation:
-
RedPajama dataset
[Com23]: This is a large, open-source dataset designed for trainingLLMs. It consists of diverse web data.- Usage: Models were pre-trained on the
RedPajamadataset for 100 billion tokens. - Characteristics: Large scale, general domain text, suitable for foundational
LLMpre-training.
- Usage: Models were pre-trained on the
-
WikiText2
[MXBS16]and C4 datasets: These are standard benchmarks for evaluating language models.- Usage: Used to report validation
perplexity. - Characteristics:
WikiText2is a collection of "good" articles from Wikipedia, known for its clean and diverse content.C4(Colossal Clean Crawled Corpus) is a much larger dataset derived from web crawls, preprocessed for quality.
- Usage: Used to report validation
-
Zero-shot Accuracy Benchmarks: A suite of common
NLPtasks for evaluatingLLMperformance without task-specific fine-tuning. The evaluation pipeline fromlm-evaluation-harnesswas used.- ARC-Easy
[YBS19]and ARC-Challenge[YBS19]: TheAI2 Reasoning Challengedatasets, designed to test scientific reasoning.ARC-Easycontains questions solvable by retrieved knowledge, whileARC-Challengerequires multi-step reasoning.- Example (ARC-Easy): "Which type of wave has the greatest frequency and shortest wavelength?" (A) Radio waves (B) Microwaves (C) Gamma rays (D) X-rays. (Correct: C)
- Hellaswag : A dataset for commonsense natural language inference, requiring models to choose the most plausible continuation of a given sentence.
- Example: "A person is playing a guitar. They sit on a chair and start to play. Then they /" (A) sing a song. (B) play a drum. (C) eat a sandwich. (D) stop playing. (Correct: A)
- Winogrande
[SBBC20]: An adversarial Winograd Schema Challenge, testing commonsense reasoning by resolving ambiguous pronouns in sentences.- Example: "The city councilmen refused the demonstrators a permit because they [feared/advocated] violence." (Which word makes "they" refer to councilmen/demonstrators?)
- PIQA :
Physical Interaction Question Answering, focused on physical commonsense.- Example: "How can I cut a piece of wood?" (A) Use a saw. (B) Use a spoon. (C) Use a pillow. (Correct: A)
- OpenbookQA
[MCKS18]: Questions that require commonsense knowledge found in textbooks. - BoolQ : A dataset of yes/no questions from Wikipedia articles.
- SciQ
[WLG17]: Science question answering dataset. - LAMBADA : A dataset designed to test the ability of models to predict the last word of a sentence, requiring broad discourse context.
- ARC-Easy
-
StableLM-3B
[TBMR]: For the 2T token training comparison,BitNet b1.58was trained following the data recipe ofStableLM-3B, which is a state-of-the-art open-source 3B model. The evaluation benchmark consisted ofWinogrande,PIQA,SciQ,LAMBADA, andARC-easy.These datasets were chosen to cover a wide range of
LLMcapabilities, from basic language modeling (perplexity) to complex reasoning and commonsense inference (zero-shot accuracytasks), ensuring a comprehensive validation of the proposed method's performance across different aspects.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
-
Perplexity (PPL):
- Conceptual Definition: Perplexity is a measure of how well a probability distribution or probability model predicts a sample. In the context of language models, a lower perplexity score indicates that the model is better at predicting the next word in a sequence, suggesting a more accurate and fluent understanding of the language. It is inversely related to the probability of the test set, normalized by the number of words.
- Mathematical Formula: $ \mathrm{PPL}(W) = \exp \left( - \frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i | w_1, \dots, w_{i-1}) \right) $ (Note: The base of the logarithm can vary, often or 2. When the paper does not specify, is commonly assumed for natural log, but sometimes is used for "bits per word." The formula with implies natural logarithm. Assuming natural logarithm for consistency with in the formula.)
- Symbol Explanation:
- is a sequence of tokens (words) from the test set.
- is the total number of tokens in the sequence.
- is the probability of the -th token given the preceding tokens , as assigned by the language model.
- (or ) represents the logarithm (base 2 or natural logarithm, respectively).
- is the exponential function (e to the power of the argument).
-
Zero-shot Accuracy:
- Conceptual Definition: Zero-shot accuracy refers to a model's ability to correctly perform a task without having been explicitly trained on any examples of that task. For
LLMs, this means providing a prompt for a task (e.g., question answering, summarization) and evaluating the model's direct output based on its general language understanding capabilities, without any task-specific fine-tuning or examples (few-shotlearning). It is typically reported as the percentage of correct answers. - Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \times 100% $
- Symbol Explanation:
Number of Correct Predictions: The count of instances where the model's output matches the ground truth for a given task.Total Number of Predictions: The total count of instances (questions, prompts) evaluated for the task.
- Conceptual Definition: Zero-shot accuracy refers to a model's ability to correctly perform a task without having been explicitly trained on any examples of that task. For
-
Memory (GB):
- Conceptual Definition: This metric quantifies the amount of Graphics Processing Unit (
GPU) memory (in Gigabytes) consumed by the model during inference. Lower memory usage is crucial for deployingLLMson devices with limited resources or for allowing larger batch sizes. - Mathematical Formula: No specific formula; it's a direct measurement of
GPUmemory allocation. - Symbol Explanation: Measured in
GB(Gigabytes).
- Conceptual Definition: This metric quantifies the amount of Graphics Processing Unit (
-
Latency (ms):
- Conceptual Definition: Latency refers to the time taken for the model to generate one output token during inference. It is a critical metric for real-time applications where quick responses are necessary. Lower latency indicates faster inference.
- Mathematical Formula: No specific formula; it's a direct measurement of time.
- Symbol Explanation: Measured in
ms(milliseconds). The paper specifically reports "time per output token."
-
Throughput (tokens/s):
- Conceptual Definition: Throughput measures the rate at which the model can process and generate tokens per unit of time, typically tokens per second. Higher throughput indicates greater efficiency in processing multiple requests or larger batches simultaneously, which is important for serving many users or large workloads.
- Mathematical Formula: $ \text{Throughput} = \frac{\text{Total Number of Generated Tokens}}{\text{Total Inference Time (seconds)}} $
- Symbol Explanation:
Total Number of Generated Tokens: The sum of all tokens produced across all processed inputs.Total Inference Time (seconds): The total time elapsed during the generation of these tokens. Measured intokens/s.
-
Energy Consumption:
- Conceptual Definition: This metric estimates the electrical energy consumed by the model's operations, primarily focusing on
matrix multiplicationwhich is the dominant contributor toLLMcomputational cost. Reduced energy consumption leads to lower operational costs and a smaller environmental footprint. The paper differentiates between arithmetic operations energy and end-to-end energy cost. - Mathematical Formula: No single universal formula is provided, as it depends on the specific hardware and operation. It's often calculated based on power models of different arithmetic operations (e.g.,
FP16add/multiply,INT8add) on specific process nodes (e.g.,7nm). For example,Horowitz [Hor14]provides energy costs per operation type.- For
matrix multiplication, the number of operations (e.g., multiplications and additions) is multiplied by their respective energy cost per operation.
- For
- Symbol Explanation: Measured in units of energy (e.g., Joules or relative units). The paper specifies relative savings (e.g., 71.4 times).
- Conceptual Definition: This metric estimates the electrical energy consumed by the model's operations, primarily focusing on
5.3. Baselines
The paper primarily compared BitNet b1.58 against the following baseline models:
- Reproduced FP16 LLaMA LLM: This is the main baseline. The authors reproduced the
LLaMALLMin various sizes (700M, 1.3B, 3B, 7B, 13B, 70B parameters) usingFP16(full-precision) weights and activations.- Why representative:
LLaMAhas become a widely adopted and highly competitive open-sourceLLMarchitecture, making it an excellent standard for comparison. UsingFP16represents the current common practice for high-performanceLLMs. Reproducing it ensures a fair comparison under the same training conditions (e.g.,RedPajamadataset for 100 billion tokens).
- Why representative:
- StableLM-3B
[TBMR]: For the extended training token experiment (2T tokens),BitNet b1.58was compared againstStableLM-3B, a state-of-the-art open-source 3B parameter model.- Why representative: It serves as a benchmark for comparing
LLMperformance specifically when trained on a very large number of tokens (2T), demonstratingBitNet b1.58's ability to scale effectively with training data.
- Why representative: It serves as a benchmark for comparing
- Original 1-bit BitNet (Implicit): Although not explicitly listed as a baseline in the results tables,
BitNet b1.58is presented as a variant and improvement over the original1-bit BitNet. The paper differentiatesBitNet b1.58by its ternary weights which provide stronger modeling capabilities than the binary weights of the original.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that BitNet b1.58 achieves performance comparable to full-precision LLaMA LLMs while offering significant cost savings across multiple dimensions.
The following are the results from Table 1 of the original paper:
| Models | Size | Memory (GB)↓ | Latency (ms)↓ | PPL↓ |
| LLaMA LLM | 700M | 2.08 (1.00x) | 1.18 (1.00x) | 12.33 |
| BitNet b1.58 | 700M | 0.80 (2.60x) | 0.96 (1.23x) | 12.87 |
| LLaMA LLM | 1.3B | 3.34 (1.00x) | 1.62 (1.00x) | 11.25 |
| BitNet b1.58 | 1.3B | 1.14 (2.93x) | 0.97 (1.67x) | 11.29 |
| LLaMA LLM | 3B | 7.89 (1.00x) | 5.07 (1.00x) | 10.04 |
| BitNet b1.58 | 3B | 2.22 (3.55x) | 1.87 (2.71x) | 9.91 |
| BitNet b1.58 | 3.9B | 2.38 (3.32x) | 2.11 (2.40x) | 9.62 |
Perplexity and Cost (Table 1):
BitNet b1.58 starts to match or even outperform the FP16 LLaMA LLM in terms of perplexity (PPL) at the 3B parameter size. For instance, BitNet b1.58 (3B) achieves a PPL of 9.91, slightly better than LLaMA LLM (3B) at 10.04. This performance parity is achieved with significantly reduced cost:
- Memory: The 3B
BitNet b1.58uses only 2.22 GB of memory, which is 3.55 times less than the 7.89 GB used by the 3BLLaMA LLM. This reduction is critical for deployingLLMson devices with limitedGPUmemory. - Latency: The 3B
BitNet b1.58is 2.71 times faster with a latency of 1.87 ms, compared to 5.07 ms for the 3BLLaMA LLM. This translates to much faster inference and generation. The 3.9BBitNet b1.58variant further improvesPPLto 9.62, outperforming the 3BLLaMA LLMbaseline, while still being significantly more efficient (3.32x less memory, 2.40x faster). At smaller scales (700M, 1.3B),BitNet b1.58shows higherPPLbut still offers substantial cost reductions (2.60-2.93x memory, 1.23-1.67x latency). This suggests the benefits ofBitNet b1.58become more pronounced as model size increases, allowing it to better approximate full-precision performance.
The following are the results from Table 2 of the original paper:
| Models | Size | ARCe | ARCc | HS | BQ | OQ | PQ | WGe | Avg. |
| LLaMA LLM | 700M | 54.7 | 23.0 | 37.0 | 60.0 | 20.2 | 68.9 | 54.8 | 45.5 |
| BitNet b1.58 | 700M | 51.8 | 21.4 | 35.1 | 58.2 | 20.0 | 68.1 | 55.2 | 44.3 |
| LLaMA LLM | 1.3B | 56.9 | 23.5 | 38.5 | 59.1 | 21.6 | 70.0 | 53.9 | 46.2 |
| BitNet b1.58 | 1.3B | 54.9 | 24.2 | 37.7 | 56.7 | 19.6 | 68.8 | 55.8 | 45.4 |
| LLaMA LLM | 3B | 62.1 | 25.6 | 43.3 | 61.8 | 24.6 | 72.1 | 58.2 | 49.7 |
| BitNet b1.58 | 3B | 61.4 | 28.3 | 42.9 | 61.5 | 26.6 | 71.5 | 59.3 | 50.2 |
| BitNet b1.58 | 3.9B | 64.2 | 28.7 | 44.2 | 63.5 | 24.2 | 73.2 | 60.5 | 51.2 |
Zero-shot Accuracy (Table 2):
The zero-shot accuracy on various NLP tasks further reinforces the findings from perplexity.
-
At 700M and 1.3B sizes,
BitNet b1.58shows a slight performance gap compared toLLaMA LLM. For example, at 700M,BitNet b1.58has an average accuracy of 44.3% vsLLaMA LLM's 45.5%. -
However, at 3B parameters,
BitNet b1.58not only closes this gap but slightly surpassesLLaMA LLM, achieving an average accuracy of 50.2% compared to 49.7%. This indicates that the1.58-bitrepresentation, with its0value forfeature filtering, becomes sufficiently expressive at this scale. -
The 3.9B
BitNet b1.58model demonstrates even stronger performance, with an average accuracy of 51.2%, clearly outperforming the 3BLLaMA LLMbaseline. This is aPareto improvementbecause it achieves higher performance with lower resource costs.As can be seen from the results in Figure 2, the decoding latency (Left) and memory consumption (Right) of BitNet b1.58 and LLaMA across different model sizes.
Memory and Latency Scaling (Figure 2):
Figure 2 visually confirms the trends observed in Table 1 for larger models (up to 70B). The speed-up and memory reduction for BitNet b1.58 increase with model size. For a 70B model, BitNet b1.58 is 4.1 times faster than LLaMA LLM. This is attributed to the fact that the cost of nn.Linear layers, which are heavily optimized in BitNet b1.58, grows with model size, making the relative savings more substantial. Memory consumption follows a similar trend, as the full-precision embedding layer becomes a smaller proportion of the total model size for larger models. The authors note that further optimizations are possible, as these measurements were taken with a 2-bit kernel.
The following are the results from Table 3 of the original paper:
| Models | Size | Max Batch Size | Throughput (tokens/s) |
| LLaMA LLM | 70B | 16 (1.0x) | 333 (1.0x) |
| BitNet b1.58 | 70B | 176 (11.0x) | 2977 (8.9x) |
Throughput (Table 3):
For a 70B parameter model, BitNet b1.58 achieves an impressive throughput improvement. Using two 80GB A100 GPUs with pipeline parallelism, BitNet b1.58 can support 11 times the batch size (176 vs. 16) of LLaMA LLM, resulting in an 8.9 times higher throughput (2977 tokens/s vs. 333 tokens/s). This is crucial for serving high-demand applications and reducing inference costs at scale.
As can be seen from the results in Figure 3, the energy consumption of BitNet b1.58 compared to LLaMA LLM at process nodes.
Energy Consumption (Figure 3):
The energy consumption analysis highlights one of the most profound benefits.
-
Arithmetic Operations: For
matrix multiplicationon7nmchips,BitNet b1.58saves 71.4 times theenergy consumptioncompared toLLaMA LLM. This is becauseBitNet b1.58primarily usesINT8addition, which is far more energy-efficient thanFP16addition and multiplication used byLLaMA LLM. -
End-to-End Energy: As model size scales,
BitNet b1.58becomes increasingly more efficient inend-to-end energy consumption. This is because the proportion ofnn.Linearoperations (whereBitNet b1.58excels in efficiency) grows with model size, while other components' costs become relatively smaller. This translates directly to lower operational costs and a reduced environmental footprint.New Scaling Law: The paper effectively defines a new
scaling lawwhereBitNet b1.58models of a certain size can be more efficient thanFP16 LLMsof a significantly smaller size in terms oflatency,memory, andenergy consumption. -
13B
BitNet b1.58> 3BFP16 LLM(efficiency) -
30B
BitNet b1.58> 7BFP16 LLM(efficiency) -
70B
BitNet b1.58> 13BFP16 LLM(efficiency)The following are the results from Table 4 of the original paper:
Models Tokens Winogrande PIQA SciQ LAMBADA ARC-easy Avg. StableLM-3B 2T 64.56 76.93 90.75 66.09 67.78 73.22 BitNet b1.58 3B 2T 66.37 78.40 91.20 67.63 68.12 74.34
Training with 2T Tokens (Table 4):
When trained with 2 trillion tokens, a BitNet b1.58 3B model surpasses StableLM-3B (a state-of-the-art FP16 3B model also trained with 2T tokens) across all evaluated zero-shot accuracy tasks. BitNet b1.58 3B achieves an average accuracy of 74.34%, compared to StableLM-3B's 73.22%. This demonstrates BitNet b1.58's strong generalization capabilities and its ability to scale effectively with vast amounts of training data, further validating its performance and potential as a foundational LLM architecture.
6.2. Data Presentation (Tables)
All tables from the original paper have been transcribed and presented in the "Core Results Analysis" section above, specifically Table 1, Table 2, Table 3, and Table 4.
6.3. Ablation Studies / Parameter Analysis
The paper does not explicitly present dedicated ablation studies or detailed parameter sensitivity analyses. However, the comparison of BitNet b1.58 at various model sizes (700M, 1.3B, 3B, 3.9B, 7B, 13B, 70B) against LLaMA LLM provides an implicit analysis of how the 1.58-bit quantization scales with model size. The results consistently show that the performance gap narrows and eventually closes, even exceeding FP16 performance, as the model size of BitNet b1.58 increases. This suggests that the 1.58-bit quantization scheme becomes more effective with larger models, likely due to the increased model capacity that can absorb the expressiveness limitations of low-bit weights. The inclusion of the 0 weight for feature filtering is a core design choice differentiating BitNet b1.58 from previous 1-bit models, and its effectiveness is validated by the observed performance parity. The authors also implicitly analyze the impact of their activation quantization choice (scaling to without zero-point quantization), stating it introduces "negligible effects to the performance in our experiments."
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces BitNet b1.58, a groundbreaking 1.58-bit Large Language Model (LLM) variant where all weights are ternary (). The core finding is that BitNet b1.58 can match the perplexity and end-task performance of full-precision (FP16) Transformer LLMs of the same size and trained with the same number of tokens, starting from a 3B parameter model. This is achieved while delivering vastly superior cost-effectiveness across key metrics: latency, memory consumption, throughput, and energy consumption. For instance, a 3B BitNet b1.58 is 2.71x faster and uses 3.55x less memory than a 3B FP16 LLaMA LLM, and a 70B BitNet b1.58 achieves 8.9x higher throughput. The paper establishes a new scaling law, demonstrating that larger BitNet b1.58 models can be more efficient than significantly smaller FP16 counterparts. This work pioneers a new computation paradigm, relying heavily on integer arithmetic, and explicitly calls for the development of specialized hardware optimized for 1-bit LLMs.
7.2. Limitations & Future Work
The authors highlight several promising avenues for future research and development:
- 1-bit Mixture-of-Experts (MoE) LLMs:
MoEmodels, while computationally efficient in terms ofFLOPs, suffer from highmemory consumptionand inter-chip communication overhead.1.58-bit LLMscan address these challenges by significantly reducing the memory footprint, potentially enablingMoEmodels to be deployed on fewer devices or even a single chip, thereby eliminating cross-device communication overheads. - Native Support of Long Sequences in LLMs: The
KV cache(Key-Value cache) is a major memory bottleneck for long sequence inference.BitNet b1.58's 8-bit activations already reduce memory usage, effectively doubling the context length given the same resources. Future work could explore further lossless compression of these activations to 4 bits or even lower for1.58-bit LLMs, thereby enabling even longer sequence lengths. - LLMs on Edge and Mobile: The reduced
memoryandenergy consumptionof1.58-bit LLMsmake them ideal candidates for deployment on resource-constrainededgeandmobile devices. This opens up new possibilities for on-deviceLLMapplications that were previously infeasible. Furthermore, theCPU-friendlynature of integer operations makes them well-suited for the primary processors in these devices, enhancing their capabilities. - New Hardware for 1-bit LLMs: The paper explicitly calls for the design of specific hardware (e.g.,
LPUsor Logic Processing Units) tailored for the unique computation paradigm of1-bit LLMs. The shift from floating-point to predominantly integer arithmetic offers a fundamental opportunity for specialized hardware architectures that can drastically improve performance and energy efficiency beyond what general-purposeGPUscan achieve.
7.3. Personal Insights & Critique
This paper marks a significant milestone in the efficiency of Large Language Models. The demonstration of performance parity between 1.58-bit models and FP16 models, especially when trained from scratch, is a powerful rebuttal to the notion that extreme quantization necessarily entails a performance compromise. The inclusion of the 0 in the ternary weight system is a clever yet simple innovation that likely contributes significantly to the model's expressiveness, allowing it to "prune" irrelevant connections and effectively perform feature filtering. This small addition differentiates it from purely binary neural networks and appears to be a critical factor in achieving competitive performance.
The implications of BitNet b1.58 are profound. If LLMs can be deployed with orders of magnitude less memory and energy, it democratizes access to advanced AI capabilities, making them viable for a much wider range of applications and devices, from smartphones to embedded systems. This could lead to a proliferation of personalized, offline AI assistants, reducing reliance on cloud computing and addressing privacy concerns.
However, several aspects warrant further investigation and potential critique:
-
Training Stability and Complexity: While the paper states
BitNet b1.58is "trained from scratch," the training process for1-bitmodels can be notoriously difficult and sensitive to hyper-parameters. A more detailed exposition of the training recipe, optimization strategies, and stability challenges would be valuable for the community. -
Generalization Beyond LLaMA-alike Architectures: The experiments are predominantly based on
LLaMA-alikecomponents. While this is a strong starting point, it would be insightful to see if the1.58-bitquantization strategy generalizes as effectively to otherTransformervariants or entirely different model architectures. -
Hardware Realization Challenges: The call for new hardware is exciting, but the development of specialized
LPUsis a massive undertaking. The practical timeline for such hardware, and the ecosystem of tools (compilers, software stacks) required to support it, remains an open question. Bridging the gap between theoretical efficiency and practical deployment on custom hardware will be critical. -
Activation Quantization Details: While 8-bit activations are used, the choice of scaling to per token is stated to have "negligible effects." A deeper analysis or ablation demonstrating the robustness of this choice and its impact on different model scales or tasks would strengthen this claim.
-
Impact of the
0Weight: The paper attributes improved modeling capability to the0value forfeature filtering. It would be interesting to see a direct ablation study comparingBitNet b1.58() with a hypotheticalBitNet b1() trained from scratch with the same setup, to quantify the specific contribution of the0value.Overall,
BitNet b1.58represents a compelling advancement, challenging the conventional wisdom ofLLMscaling and pointing towards a future where high-performance AI is inherently more efficient and accessible. Its methods and conclusions could be highly transferable to other deep learning domains facing similar computational constraints.
Similar papers
Recommended via semantic vector search.