ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
TL;DR Summary
ModuLoRA is a memory-efficient finetuning algorithm that enables 2/3/4-bit precision tuning of 65B LLMs on a 24GB consumer GPU, integrating any weight quantizer for improved performance across various tasks with significantly reduced memory usage.
Abstract
We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision on as little as one 24GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 2-bit and 3-bit LLMs for the first time -- leveraging state-of-the-art 2-bit QuIP# quantization and 3-bit OPTQ quantization -- outperforming finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, \lplora~attains competitive performance on text classification, natural language inference, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release \lplora~together with a series of low-precision models as part of \llmtune, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is a memory-efficient finetuning algorithm for large language models (LLMs) that operates on consumer-grade GPUs by integrating with various quantization methods. The title is ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers.
1.2. Authors
-
Junjie Yin: Department of Computer Science, Johns Hopkins University.
-
Jiahao Dong: Department of Computer Science, Cornell University and Cornell Tech.
-
Yingheng Wang: Department of Computer Science, Cornell University.
-
Christopher De Sa: Department of Computer Science, Cornell University.
-
Volodymyr Kuleshov: Department of Computer Science, Cornell University and Cornell Tech.
The authors are primarily affiliated with the Department of Computer Science at Johns Hopkins University and Cornell University, indicating a strong background in computer science research, particularly in areas related to machine learning, efficient computing, and large language models. Volodymyr Kuleshov and Christopher De Sa are often associated with research in machine learning, optimization, and efficient AI systems.
1.3. Journal/Conference
This paper is published as a preprint on arXiv, with the publication date (UTC) being 2023-09-28T02:55:01.000Z. As an arXiv preprint, it has not yet undergone formal peer review for a specific journal or conference at the time of this publication. However, arXiv is a highly reputable platform for disseminating cutting-edge research in fields like AI and machine learning, allowing researchers to share their work rapidly.
1.4. Publication Year
2023
1.5. Abstract
The paper introduces ModuLoRA, a memory-efficient finetuning algorithm for large language models (LLMs) that allows finetuning of models up to 65 billion parameters in 2, 3, or 4-bit precision on consumer GPUs (e.g., a single 24GB GPU). ModuLoRA achieves this by integrating any user-specified weight quantizer with Low-Rank Adapters (LoRAs) through a quantization-agnostic backward pass. This novel approach adaptively materializes low-precision LLM weights from a black-box quantization module. For the first time, ModuLoRA enables finetuning 2-bit and 3-bit LLMs, leveraging advanced quantization techniques like 2-bit QuIP# and 3-bit OPTQ. The experiments demonstrate that ModuLoRA outperforms finetuning methods based on less sophisticated 4-bit and 8-bit quantization, achieving competitive performance on tasks like text classification, natural language inference, and instruction following with significantly less memory. It also sets a new state-of-the-art ROUGE score on a summarization task. The authors release ModuLoRA as part of LLMTools, a user-friendly library for quantizing, running, and finetuning LLMs on consumer hardware.
1.6. Original Source Link
https://arxiv.org/abs/2309.16119 PDF Link: https://arxiv.org/pdf/2309.16119v2.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the prohibitive memory requirements for finetuning large language models (LLMs), which limits their accessibility and deployment on consumer-grade hardware. LLMs, with hundreds of billions of parameters, demand significant computational and memory resources, making tasks like finetuning feasible only with expensive, specialized GPUs.
This problem is crucial because it creates a barrier to entry for many researchers and practitioners who lack access to high-end data center hardware. Democratizing access to LLMs for finetuning would accelerate research, foster open-source development, and enable wider application of these powerful models across diverse domains. The specific challenges are:
-
High Memory Footprint: Storing full-precision LLM weights (e.g., 16-bit or 32-bit floating point) for models with tens or hundreds of billions of parameters quickly exceeds the memory capacity of even high-end consumer GPUs (typically 24GB, 48GB).
-
Finetuning Overhead: Beyond just storing weights, finetuning requires additional memory for activations, gradients, and optimizer states, further exacerbating the memory crunch.
-
Limited Quantization Integration: While
quantization(reducing the precision of weights) has been explored for inference, its effective integration with finetuning, especially at very low bit-widths (e.g., 2-bit, 3-bit), has been challenging. Existing finetuning methods for quantized LLMs often rely on simpler 4-bit or 8-bit schemes, potentially sacrificing performance.The paper's entry point and innovative idea revolve around combining the memory efficiency of
quantizationwith the parameter efficiency ofLow-Rank Adaptation (LoRA). Crucially, instead of developing a newquantizationscheme,ModuLoRAproposes a modular,quantization-agnosticapproach. This means it can integrate with any state-of-the-artquantizer(treated as a black box), allowing users to leverage the best availablequantizationmethods, even those operating at aggressively low bit-widths like 2-bit or 3-bit, for finetuning on consumer GPUs.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Proposed ModuLoRA: It introduces
ModuLoRA, a memory-efficient finetuning method for LLMs that operates over low-precision weights. A key innovation is itsquantization-agnostic backward pass, which allows seamless integration with any user-specified black-boxquantization module. This enables finetuning of LLMs up to 65B parameters in 2, 3, or 4-bit precision on consumer GPUs (e.g., a single 24GB or 48GB GPU). This addresses the memory limitation directly, making finetuning more accessible. - Release of LLMTools Library: The authors release
LLMTools, a user-friendly Python library that implementsModuLoRA. This library facilitates the quantization, running, and finetuning of LLMs on consumer hardware, offering modular support for variousquantizers, LLMs (likeLLaMA,BLOOM,OPT), and optimization algorithms. This contributes to the open-source community and simplifies practical application of their method. - Empirical Evidence of High Performance with Smaller Quantized LLMs: The research provides extensive empirical evidence showing that high performance on downstream tasks can be achieved with significantly smaller and more aggressively quantized LLMs than previously believed. Specifically, it demonstrates that:
ModuLoRAenables finetuning 2-bit and 3-bit LLMs for the first time, leveraging advancedquantizerslikeQuIP#(2-bit) andOPTQ(3-bit).- These low-precision models (2-bit, 3-bit, 4-bit) often match or outperform finetuning methods based on less sophisticated 4-bit or 8-bit
quantization(e.g.,QLoRAorLLM.int8()). ModuLoRAachieves a new state-of-the-artROUGEscore on a popular summarization task using a 4-bit quantized LLaMA-65B model.- For instruction following, 4-bit and 3-bit 65B models outperform 8-bit 30B models, despite using fewer total bits.
These findings suggest that competitive finetuning performance is attainable even with aggressive
quantization, challenging previous assumptions about the necessary model size and precision for high-quality results.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand ModuLoRA, a reader needs to grasp several fundamental concepts in large language models and efficiency techniques:
-
Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the
Transformerarchitecture, that are trained on vast amounts of text data to understand, generate, and process human language. They can perform diverse tasks like translation, summarization, question answering, and code generation. Theirmassive size(billions to trillions of parameters) makes them powerful but also computationally demanding. -
Finetuning: This is a process where a pre-trained LLM, which has learned general language patterns, is further trained on a smaller, task-specific dataset. The goal is to adapt the model to perform a particular downstream task (e.g., sentiment analysis, summarization) more effectively. Traditional
finetuninginvolves updating all or most of the model's parameters, which is memory-intensive. -
Parameter-Efficient Finetuning (PEFT): Given the high memory and computational costs of full
finetuning, PEFT methods aim to adapt LLMs to new tasks by training only a small subset of parameters, or by introducing new, small, trainable parameters, while keeping most of the original model's parametersfrozen(unchanged). This significantly reduces memory usage and training time. -
Low-Rank Adaptation (LoRA):
LoRAis a popular PEFT technique. Instead of directly modifying theweight matricesof a pre-trained model,LoRAintroduces small, trainablelow-rank matricesalongside the original,frozenweight matrices. Duringfinetuning, only theselow-rank matricesare updated, while the originalweightsremain fixed. This drastically reduces the number of trainable parameters.Mathematically, for an original
weight matrix,LoRAreparameterizes it as: $ \mathbf{W} = \mathbf{W}_0 + \mathbf{AB}^\top $ Where:- : The original,
frozenweight matrixof the pre-trained LLM. It remains fixed duringfinetuning. - : A randomly initialized matrix with
rank. - : A matrix initialized to zeros.
- : The
low-rank adapter, which is the product of matrix and the transpose of matrix . Therankis typically much smaller than the original dimension (i.e., ), making a low-rank approximation of the update to . - Only and are trainable parameters. The number of trainable parameters in the
adapteris2dr, which is significantly less than for the originalweight matrixwhen .
- : The original,
-
Quantization: This is a technique to reduce the memory footprint and computational cost of neural networks by representing their
weightsandactivationswith lower-precision numbers (e.g., 8-bit integers, 4-bit integers, 2-bit integers) instead of standard 16-bit or 32-bit floating-point numbers. A -bitquantizationmethod typically takes a full-precisionweight matrixand outputs aquantizedversion , along withzero-pointandscaleparameters (often stored in full precision). Thequantizedweights are stored using bits per entry, while and allow fordequantizationback to an approximation of the originalfull-precision weights. Thequantizationprocess can be summarized as: $ (\hat{\mathbf{W}}_q, \mathbf{z}, \mathbf{s}) = \mathcal{Q}(\mathbf{W}) $ And thedequantizationprocess as: $ \hat{\mathbf{W}} = \mathcal{D}(\hat{\mathbf{W}}_q, \mathbf{z}, \mathbf{s}) = \mathbf{s} \odot \hat{\mathbf{W}}_q + \mathbf{z} $ Where:- : The
quantizationalgorithm. - : The
dequantizationalgorithm. - : The original
full-precision weight matrix. - : The
quantized weight matrix, stored using bits per entry. - :
Zero-pointparameters, typicallyfull-precision. - :
Scaleparameters, typicallyfull-precision. - : The
dequantized approximationof . - :
Hadamard product(element-wise multiplication). - The
numpy-style broadcastingmeans that and (if they are vectors) are broadcast across the dimensions of for the element-wise operations.
- : The
3.2. Previous Works
The paper discusses several key prior works that inform its approach:
-
LoRA (Low-Rank Adaptation): Proposed by Hu et al. (2022),
LoRAis a foundational PEFT method. As explained above, it involves adding small, trainablelow-rank matrices() to the originalfrozen weight matrices(). WhileLoRAreduces the number of trained parameters, it still requires storing the entirefull-precisionbase model weights () in memory, which can be substantial for very large LLMs.ModuLoRAbuilds uponLoRAbut addresses this memory bottleneck by quantizing . -
OPTQ (Optimal Quantization for Generative Pre-trained Transformers): Introduced by Frantar et al. (2023),
OPTQis a state-of-the-artquantizationalgorithm for modern LLMs. It works by iteratively running two steps over theweight columns: (1) quantizing withnearest roundingand computing the error, and (2) updating the remaining weights with ascaled error. This method scales effectively to LLMs and achieves good performance, making it a suitable "black-box"quantizerforModuLoRAto integrate with, especially for 3-bit and 4-bitquantization. -
QuIP and QuIP# (Quantization with Incoherence Processing): Chee et al. (2023) proposed
QuIP, which made 2-bit LLM compression viable. It uses anadaptive rounding procedureto minimize a quadratic proxy objective and an efficientpre- and post-processing procedureto ensureweight and Hessian incoherencethrough multiplication byrandom orthogonal matrices. Following this, Tseng et al. (2023) introducedQuIP#, combininglattice codebookswithQuIP'sincoherence processingto create state-of-the-art 2-bitquantized models.ModuLoRAleveragesQuIP#to enable 2-bit finetuning for the first time, demonstrating its modularity. -
QLoRA (Quantized LoRA): Concurrent work by Dettmers et al. (2023),
QLoRAis another approach for finetuningquantized LLMsbased onLoRA.QLoRAdefines its ownquantization scheme, which is simpler thanOPTQorQuIP. It primarily supports 4-bitfinetuningand includes innovations like a specialized packing routine andquantization of zero-points and scales.ModuLoRAdifferentiates itself fromQLoRAby beingquantizer-agnosticand supporting lower bit-widths (2-bit, 3-bit) by integrating with more advancedquantizers. -
LLM.int8(): Dettmers et al. (2022) proposed
LLM.int8(), an 8-bitquantizationmethod that decomposesmatrix multiplicationsinto a majority of 8-bit operations and a minority of 16-bit operations. This allows large models to fit into memory for inference and serves as a baseline for 8-bitLoRA finetuning.
3.3. Technological Evolution
The evolution of LLM efficiency techniques has generally followed these stages:
- Full Finetuning: Initially, researchers would
finetuneall parameters of a pre-trained model. This yielded high performance but was extremely resource-intensive, requiring vast amounts of memory and compute. - Parameter-Efficient Finetuning (PEFT): Methods like
prompt tuning,adapter layers, andLoRAemerged to reduce the number of trainable parameters. This eased the computational burden but still often required storing thefull-precisionbase model. - Quantization for Inference: Techniques for
quantizingLLMs (e.g., to 8-bit, 4-bit) became prevalent to enable inference on less powerful hardware, but these often weren't directly compatible withfinetuningor led to significant performance degradation when applied naively. - Finetuning Quantized Models (QLoRA): The next step was combining
PEFT(likeLoRA) withquantizationforfinetuning.QLoRAwas a notable development, allowing 4-bitfinetuningof large models, but it used its ownquantization schemeand was limited in bit-width. - Modular, Ultra-Low-Precision Finetuning (ModuLoRA): This paper's
ModuLoRArepresents a further advancement. It takes theLoRAapproach and makes the underlyingquantizationstrategy entirely modular and black-box. This allowsModuLoRAto seamlessly integrate with the best availablequantizers(e.g.,OPTQ,QuIP#), enablingfinetuningat aggressively low bit-widths (2-bit, 3-bit) while maintaining high performance. It also focuses explicitly on enabling this on commodity consumer hardware.
3.4. Differentiation Analysis
Compared to the main methods in related work, ModuLoRA introduces several core differences and innovations:
-
Modularity and Quantizer-Agnostic Design: The most significant innovation is that
ModuLoRAdoes not define its ownquantizationprocedure. Instead, it is designed to integrate with any user-specified, black-box quantization module. This is a crucial distinction fromQLoRA, which uses its own specific 4-bitquantization scheme.ModuLoRA's modularity allows it to leverage the cutting-edgequantizationresearch as it develops, without requiring changes to its corefinetuningmechanism. -
Support for Ultra-Low Bit-Widths (2-bit and 3-bit Finetuning): By integrating with advanced
quantizerslikeOPTQ(for 3-bit) andQuIP#(for 2-bit),ModuLoRAenablesfinetuningLLMs at these aggressively low bit-widths for the first time. This goes beyondQLoRA's 4-bit limitation andLLM.int8()'s 8-bit standard, leading to even greater memory savings. -
Performance with Advanced Quantizers: The paper demonstrates that by using sophisticated data-driven
quantizers(likeOPTQandQuIP#) within theModuLoRAframework, it can achieve better performance than simplerquantizationstrategies (likeround-to-nearestorQLoRA's internal scheme) for a given bit budget. This highlights the synergy betweenModuLoRA'sfinetuningapproach and the quality of thequantizationalgorithm. -
Memory Efficiency:
ModuLoRApushes the boundaries of memory efficiency further. It enables finetuning of a 65B LLM on a single 24GB GPU in 2-bit precision, and a 65B LLM on a 48GB GPU in 3-bit/4-bit precision. This significantly lowers the hardware barrier compared to previousLoRA(which needs full-precision base weights) or evenQLoRAin some settings due to the lower bit-width support. -
Quantization-Agnostic Backward Pass: The paper introduces a simple, yet effective,
quantization-agnostic backward passthat adaptively materializeslow-precision LLM weightsonly when needed. This ensures that the memory for the fulldequantized weightsis not held for the entire model simultaneously, contributing to the overall memory efficiency.
4. Methodology
4.1. Principles
The core idea behind ModuLoRA is to enable memory-efficient finetuning of large language models by combining the parameter-efficiency of Low-Rank Adaptation (LoRA) with the memory-efficiency of weight quantization, while remaining flexible enough to integrate with any state-of-the-art quantization algorithm. The theoretical basis or intuition is that LoRA allows for efficient finetuning by only updating a small number of parameters, and quantization drastically reduces the memory footprint of the frozen base model weights. By treating the quantizer as a black box and adaptively dequantizing weights during both the forward and backward passes, ModuLoRA ensures that the benefits of low-precision storage are maintained while still allowing gradients to be computed accurately for the LoRA adapters. This allows finetuning to occur on hardware with limited memory, such as consumer GPUs.
4.2. Core Methodology In-depth (Layer by Layer)
ModuLoRA's methodology can be broken down into three main stages: initial quantization, reparameterization of linear layers with LoRA adapters and quantized weights, and an efficient quantization-agnostic backward pass.
4.2.1. Initial Quantization
The first step is to take a pre-trained LLM and apply a chosen black-box quantization algorithm to its full-precision weight matrices .
For each weight matrix (where indexes the different linear layers in the LLM), the quantization algorithm produces:
-
Quantized weights, stored in low precision (e.g., 2, 3, or 4 bits per entry). -
Zero-pointparameters . -
Scaleparameters .This process is defined as: $ (\hat{\mathbf{W}}_q^{(i)}, \mathbf{z}^{(i)}, \mathbf{s}^{(i)}) = \mathcal{Q}(\mathbf{W}^{(i)}) $ The paper emphasizes that
ModuLoRAitself does not specify ; it treats it as a black box. Thedequantizationalgorithm then recovers an approximation of the originalfull-precision weightsas : $ \hat{\mathbf{W}}^{(i)} = \mathcal{D}(\hat{\mathbf{W}}_q^{(i)}, \mathbf{z}^{(i)}, \mathbf{s}^{(i)}) = \mathbf{s}^{(i)} \odot \hat{\mathbf{W}}_q^{(i)} + \mathbf{z}^{(i)} $ Thesequantized weights, along with theirscaleandzero-pointparameters, become thefrozenbase of the LLM forfinetuning. They are stored in memory in their low-precision format (e.g., 2, 3, or 4 bits for ) to minimize memory usage.
4.2.2. Reparameterized ModuLoRALinear Layer
ModuLoRA modifies the original LLM by replacing each standard linear layer with a ModuLoRALinear layer. An original linear layer computes , where is the input, are the weights, and are the biases.
The ModuLoRALinear layer reparameterizes this operation. Instead of using the original full-precision weights , it uses the dequantized approximation and adds a low-rank adapter . The new affine map becomes:
$
x \mapsto x (\hat{\mathbf{W}}^{(i)})^\top + x \mathbf{B}^{(i)} (\mathbf{A}^{(i)})^\top + \mathbf{b}^{(i)}
$
Here:
-
: The
dequantized weight matrixobtained from theblack-box quantizerand stored in low precision. This part of the weight isfrozenand not updated duringfinetuning. -
: These are the learnable
LoRAparameters. and are initialized as in Hu et al. (2022), typically with being random and being zero-initialized. These matrices are stored infull precision(e.g., 16-bit float) and are the only parameters updated duringfinetuning. -
: The bias term, which can also be
frozenorfinetuned. The paper indicates biases are stored asfloat16.The core
ModuLoRALinearclass structure is conceptually represented as:
class ModuLoRALinear(Module):
"Linear ModuLoRA Layer"
def __init__(self, ...):
self.hatWq_z_s = quantize(pretrained_W) # Stores (quantized weights, zero-point, scale)
(self.A, self.B) = lora_init(...) # Initializes LoRA adapter matrices
def forward(self, x):
(hatWq, z, s) = self.hatWq_z_s
# LPLinear.apply handles dequantization and multiplication for the quantized part
# The second term handles the LoRA adapter part
return LPLinear.apply(x, hatWq, z, s) \
+ x @ (self.B @ self.A.t()) \
+ self.bias
4.2.3. Efficient Mixed-Precision Computation: Forward Pass
The ModuLoRALinear layer utilizes a custom autograd.Function called LPLinear (Low-Precision Linear Map) to handle the operations involving the quantized base weights. This is crucial for managing memory efficiently.
In the forward pass of LPLinear:
-
The
quantized weights,zero-point, andscaleare passed along with the input . -
The
dequantizationalgorithm is called to materialize thehigh-precision approximationjust-in-time. -
A standard
matrix multiplicationis performed:input @ hatW.t(). -
Crucially, immediately after its use, the materialized
high-precisionis deallocated (hatW is deallocatedin the pseudocode). This prevents the entire model'sdequantized weightsfrom residing in memory simultaneously.The pseudocode for the
forwardpass is:
class LPLinear(Function):
"Low-Precision Linear Map"
@staticmethod
def forward(ctx, input, hatWq, z, s):
ctx.save_for_backward(hatWq, z, s) # Saves low-precision components for backward pass
hatW = dequantize(hatWq, z, s) # Dequantize to high-precision
output = input @ hatW.t() # Perform matrix multiplication
return output # hatW is deallocated after this
By deallocating immediately, the memory footprint for the base quantized model is kept minimal.
4.2.4. Efficient Mixed-Precision Computation: Backward Pass
The backward pass is critical for finetuning the LoRA adapters and . The chain rule requires calculating gradients that involve the transpose of the weight matrices.
Consider the overall weight matrix of a ModuLoRALinear layer as .
The pre-activation output is .
The loss is . We want to compute and .
By the chain rule:
$
\frac{\mathrm{d}L}{\mathrm{d}\mathbf{A}^{(i)}} = \frac{\mathrm{d}L}{\mathrm{d}\bar{\mathbf{y}}_i} \cdot \frac{\mathrm{d}\bar{\mathbf{y}}i}{\mathrm{d}\mathbf{A}^{(i)}}
$
And similarly for .
The term is propagated from subsequent layers. Its computation involves the transpose of the weight matrix of the next layer:
$
\frac{\mathrm{d}L}{\mathrm{d}\bar{\mathbf{y}}i} = \frac{\mathrm{d}L}{\mathrm{d}\bar{\mathbf{y}}{i+1}} \cdot \frac{\mathrm{d}\bar{\mathbf{y}}{i+1}}{\mathrm{d}\mathbf{y}_i} \cdot \frac{\mathrm{d}\mathbf{y}_i}{\mathrm{d}\bar{\mathbf{y}}_i}
$
where is the derivative of the activation function, and .
This shows that the backward pass also requires performing matrix-vector multiplications involving .
To maintain memory efficiency, the LPLinear backward pass re-implements the dequantization and matrix multiplication steps:
-
The
low-precision quantized components(, , ) saved inctxduring theforwardpass are retrieved. -
The
high-precision approximationis recomputed (re-dequantized) from these components. -
The
gradientgrad_inputis calculated using . -
Again, the recomputed
hatWis immediately deallocated after use (hatW can be deallocated). This ensures that memory is freed quickly and not accumulated across layers.The pseudocode for the
backwardpass is:
@staticmethod
def backward(ctx, grad_output):
hatWq, z, s = ctx.saved_tensors # Retrieve low-precision components
hatW = dequantize(hatWq, z, s) # Recompute high-precision hatW
grad_input = grad_output @ hatW # Compute gradient
return grad_input, None, None, None # hatW is deallocated after this
This strategy of recomputing (re-dequantizing) on-the-fly for both forward and backward passes, rather than storing it, is the core mechanism that allows ModuLoRA to avoid manifesting all full-precision weights in memory simultaneously, thereby enabling finetuning on consumer GPUs.
4.2.5. Increasing Efficiency Further
To further reduce memory consumption beyond simply recomputing dequantized weights, ModuLoRA can employ more granular materialization strategies:
- Row Materialization: For many
quantization algorithms(e.g.,Nagel et al., 2020; Frantar et al., 2023), it's possible todequantizeone row at a time. Eachdequantized rowis then immediately multiplied with the corresponding part of the input , and then the row is freed. This avoids materializing the entireweight matrixeven temporarily. - Direct Vector-by-Quantized-Matrix Product: The most efficient approach would be if the
quantizeritself provides a direct subroutine for computingvector-by-quantized-matrix products() without explicitlydequantizinginto at all.ModuLoRA's modular design can generalize to such subroutines, eliminating the need to materialize any part of .
4.2.6. LLMTools Implementation
ModuLoRA is implemented as part of LLMTools, a user-friendly library. LLMTools provides:
- Implementation of
ModuLoRAfor 2-bit, 3-bit, and 4-bit precision. - Python API for
quantization,inference, andfinetuning. - Modular support for various
quantizers(e.g.,OPTQ,QuIP#), LLMs (LLaMA1,LLaMA2,BLOOM,OPT), andoptimization algorithmscompatible withHugging Face Trainer. - Efficient
CUDAimplementations formixed-precision matrix-vector multiplication, includingrow and weight materialization. CUDA kernelsare provided for bothrow and weight materializationinforwardandbackwardpasses. For maximum efficiency,materialized elementsof are infloat16.Base quantized LLM modelsare represented byweightsin 3 or 4 bits, withscales,zero-points, andbiasesall stored asfloat16.- For
QuIP#integration (2-bit),LLMToolsprovidesCUDA kernelsforweight re-materializationandorthogonal matrices multiplication. The base models use 2-bit .
5. Experimental Setup
5.1. Datasets
The experiments used a variety of datasets to evaluate ModuLoRA across different natural language processing tasks:
-
Text Classification:
- Dataset: A custom dataset derived from Williams et al. (2018), comprising 392,702 text snippets (up to 50 words each) from five genres. Evaluation is performed on 9,815 held-out instances.
- Purpose: To assess the model's ability to classify short text into distinct categories.
- Data Sample Example (Hypothetical): A short text like "The protagonist, a detective, unravels a complex mystery in the heart of London." might be classified as "fiction." A snippet from a phone call transcript would be "telephone chat."
-
Natural Language Inference (NLI):
- Dataset:
Multi-Genre Natural Language Inference Corpus (MNLI)(Williams et al., 2018). - Purpose: To evaluate the model's understanding of semantic relationships between sentence pairs (a
hypothesisand apremise). The task is to predict if thehypothesisentails, contradicts, or is neutral to thepremise. - Data Sample Example (from MNLI):
- Premise: "A man is standing on a ladder and painting a wall."
- Hypothesis: "A man is painting a wall."
- Label:
Entailment
- Data Sample Example (from MNLI):
- Premise: "A man is standing on a ladder and painting a wall."
- Hypothesis: "A man is flying a kite."
- Label:
Contradiction
- Dataset:
-
Abstractive Summarization:
- Dataset:
SAMSumdataset (Gliwa et al., 2019). It contains 14,732 (text, summary) training pairs and 819 test pairs. - Purpose: To assess the model's ability to generate concise and coherent summaries from longer text inputs.
- Data Sample Example (Hypothetical from SAMSum):
- Dialogue (Input): Person A: Hey, did you finish the report for the Q3 meeting? Person B: Almost, just need to finalize the sales figures. Should be done by lunch. Person A: Great, I'll review it then.
- Summary (Target): Person B is finishing the Q3 report and will send it to Person A for review by lunch.
- Dataset:
-
Instruction Following:
- Dataset:
Alpaca dataset(Taori et al., 2023), consisting of 52,000 instructions, andCodeAlpaca dataset(Chaudhary, 2023), consisting of 20,000 code generation instructions. - Purpose: To evaluate how well models follow natural language instructions to generate appropriate responses, including code.
- Data Sample Example (from Alpaca):
- Instruction: "Explain the concept of recursion to a 5-year-old."
- Response (Target): "Imagine you have a magic box, and inside that box is another magic box, and inside that one is another, and so on! Recursion is like when you keep opening the boxes until you find the smallest one, then you close them all back up."
- Dataset:
-
Calibration Data for Quantization:
-
Dataset: 128 samples from
C4(Raffel et al., 2020) were used for calibrating models quantized withOPTQ.C4is a massive, cleaned web text dataset. -
Purpose:
Quantizationalgorithms often require a small amount of data to determine optimalscaleandzero-pointparameters to minimize information loss during the conversion to lower precision.These datasets were chosen because they represent a diverse set of common LLM tasks, allowing for a comprehensive evaluation of
ModuLoRA's performance and efficiency across different domains and complexities.
-
5.2. Evaluation Metrics
The paper uses several standard evaluation metrics tailored to each task:
-
Accuracy:
- Conceptual Definition:
Accuracymeasures the proportion of correctly predicted instances out of the total number of instances. It quantifies how often the model's predictions match the true labels. It is a straightforward metric, commonly used in classification tasks. - Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
Number of Correct Predictions: The count of instances where the model's output matches the ground truth label.Total Number of Predictions: The total count of all instances evaluated.
- Used for: Text classification and Natural Language Inference (MNLI-m). Also for
BigBenchHard (BBH)in the context ofinstruction followingtasks where it's referred to asexact match accuracy.
- Conceptual Definition:
-
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- Conceptual Definition:
ROUGEis a set of metrics used for evaluating automatic summarization and machine translation software. It works by comparing an automatically produced summary or translation against a set of human-produced reference summaries or translations. It measures the overlap of n-grams (sequences of words) between the candidate and reference summaries. The paper specifically mentionsROUGE-1,ROUGE-2, andROUGE-L.ROUGE-N: Measures the overlap of n-grams between the candidate and reference summary.ROUGE-1measures unigram (single word) overlap,ROUGE-2measures bigram (two-word sequence) overlap.ROUGE-L: Measures the longest common subsequence (LCS) between the candidate and reference summary. It captures sentence-level structure similarity more naturally and does not require consecutive matches.
- Mathematical Formulas (Standard Definitions):
Let be the candidate summary and be the set of reference summaries.
- ROUGE-N: $ \text{ROUGE-N} = \frac{\sum_{i=1}^m \sum_{\text{n-gram} \in R_i} \text{Count}{\text{match}}(\text{n-gram})}{\sum{i=1}^m \sum_{\text{n-gram} \in R_i} \text{Count}(\text{n-gram})} $
- ROUGE-L (F-measure based on LCS): First, calculate Precision (), Recall (), and F-measure () for the Longest Common Subsequence (LCS). $ P_{LCS} = \frac{\text{LCS}(\text{candidate}, \text{reference})}{\text{Length}(\text{candidate})} $ $ R_{LCS} = \frac{\text{LCS}(\text{candidate}, \text{reference})}{\text{Length}(\text{reference})} $ $ F_{LCS} = \frac{(1+\beta^2) P_{LCS} R_{LCS}}{\beta^2 P_{LCS} + R_{LCS}} \quad (\text{typically } \beta=1 \text{ for F1-score}) $ For multiple references, the maximum is usually taken.
- Symbol Explanation:
- : The maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries.
- : The number of n-grams in the reference summary (or set of references).
- : The length of the
Longest Common Subsequencebetween the candidate and reference summaries. A subsequence does not have to be consecutive. - : The length (number of words) of the candidate summary.
- : The length (number of words) of the reference summary.
- : A weight for the
F-measure, typically set to 1 forF1-score(equal importance for precision and recall).
- Used for: Abstractive summarization (SAMSum dataset) and Code Alpaca evaluation (ROUGE 1/2/LSum).
- Conceptual Definition:
-
Perplexity (PPL):
- Conceptual Definition:
Perplexityis a measure of how well a probability model predicts a sample. In natural language processing, it's used to evaluate language models. A lowerperplexityscore indicates that the model is better at predicting the next word in a sequence, suggesting a better understanding of the language. It can be interpreted as the inverse probability of the test set, normalized by the number of words. - Mathematical Formula:
Given a sequence of words ,
perplexityis defined as: $ \text{PPL}(W) = P(w_1, w_2, \dots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \dots, w_N)}} $ Which can be rewritten using log probabilities (assuming base 2 or for the logarithm): $ \text{PPL}(W) = \exp \left( -\frac{1}{N} \sum_{i=1}^N \log P(w_i | w_1, \dots, w_{i-1}) \right) $ - Symbol Explanation:
- : A sequence of words.
- : The joint probability of the entire sequence according to the language model.
- : The probability of the -th word given the preceding words, as predicted by the language model.
- : The total number of words in the sequence.
- : The exponential function.
- Used for: Evaluating base LLM quality, presented alongside
BBHresults to show correlation (or lack thereof) betweenperplexityand downstream task performance.
- Conceptual Definition:
5.3. Baselines
The paper compares ModuLoRA against several representative baselines:
- LoRA (Full-Precision):
LoRA(Hu et al., 2022) applied tofull-precisionbase models. This represents the state-of-the-art in parameter-efficientfinetuningwithoutquantizationof the base weights. Its memory requirements are very high. - BitsAndBytes 8-bit (LLM.int8()):
LoRAcombined withLLM.int8()quantization(Dettmers et al., 2022). This is a common method for fitting larger models into GPU memory forinferenceandfinetuningviaLoRA. It uses 8-bitquantizationfor mostmatrix multiplications. - BitsAndBytes 4-bit (QLoRA):
QLoRA(Dettmers et al., 2023) is a concurrent approach that allows 4-bitfinetuningusingLoRA. It defines its ownquantization schemeand incorporates specialized techniques likequantization of zero-points and scalesanddouble quantization. This is a direct competitor in the low-bitfinetuningspace. - Full Finetuning: For some tasks, results from
full finetuningof large models likeGPT-3andT5are included from existing literature (Hu et al., 2022; Chung et al., 2022) to provide an upper bound on performance. - Other PEFT Methods:
Adaptertuning (Houlsby et al., 2019) andSliC(forPegasus) are mentioned in summarization benchmarks for comparison. - No Finetuning: For instruction following tasks,
FLAN-T5andLLaMAwithout anyfinetuningare included to show the baseline performance of the raw models.
5.4. Training Details
- Models:
LLaMA(7B, 13B, 30B, 65B),BLOOM, andOPTmodels (7B, 13B, 30B). - Quantization:
- 3-bit and 4-bit
quantizationusedOPTQ(Frantar et al., 2023) with calibration on 128 samples fromC4. - 2-bit
quantizationusedQuIP#(Chee et al., 2023; Tseng et al., 2023) withlattice codebooks.
- 3-bit and 4-bit
- Hardware: Finetuning was performed on
NVIDIA TITAN,3090, andA6000 GPUs, depending on the model size and memory requirements. - LoRA Configuration:
LoRA rank(): 8LoRA alpha(): 32
- Optimization:
AdamWoptimizer was used. - Random Seeds: Results are reported from 3 random seeds to account for variability.
- Hyperparameters:
- SAMSum: Training for 350 steps, batch size of 128 samples, learning rate
1e-3,cosinelearning rate schedule,weight decay 0.0, Max sequence length 250. - Text Classification: Batch size
256, evaluation batch size32,100evaluation steps,1000total training steps, learning rate1e-3,cosinelearning rate schedule,weight decay 0.0, Max sequence length128. - Code-Alpaca: Batch size
128, evaluation batch size4,40evaluation steps,120total training steps, learning rate1e-3,linearlearning rate schedule,weight decay 0.0, Max sequence length165. - MNLI-M: Batch size
128, evaluation batch size64,64evaluation steps,1.0training epoch, learning rate1e-3,cosinelearning rate schedule,weight decay 0.0. - Alpaca (for BBH): Batch size
128,3total training epochs, learning rate1e-3,linearlearning rate schedule,weight decay 0.0.
- SAMSum: Training for 350 steps, batch size of 128 samples, learning rate
- Fair Comparison: Hyperparameters were chosen to match those used in
QLoRA(Dettmers et al., 2023) for a fair comparison.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that ModuLoRA consistently achieves competitive performance across various tasks, often outperforming less sophisticated quantization methods, while drastically reducing memory requirements.
-
Text Classification (Table 1):
ModuLoRAwith 3-bit and 4-bitLLaMAmodels shows accuracy comparable to 8-bitBits&Bytesfinetuning. For instance, LLaMA-65B achieves 97.2% (3-bit) and 98.0% (4-bit) withModuLoRA, versus 98.6% with 8-bitBits&Bytes. This indicates that aggressivequantization(3-bit, 4-bit) combined withModuLoRAcan maintain high performance on simpler classification tasks while using significantly less memory. -
Natural Language Inference (MNLI-m) (Table 2):
ModuLoRAdemonstrates strong results here. The 2-bit and 3-bit 65BLLaMAmodels achieve performance matching afull-precision GPT-3 + LoRAbaseline. Notably, the 2-bit 65B model quantized withQuIP#outperforms other 65B models at higher precisions. Across the model size range,ModuLoRA's 3-bit and 4-bit models consistently outperform 8-bitBits&Bytesmodels. Also,ModuLoRAmodels (2, 3, or 4-bit) either match or exceed their 4-bitQLoRAcounterparts, often with lower memory usage due to finer precision. This highlights the benefit ofModuLoRA's ability to integrate with advancedquantizers. -
Abstractive Summarization (SAMSum) (Table 3, 4, 5):
- LLaMA models: A significant finding is that
ModuLoRA's 4-bit 65BLLaMAmodels attain a new state-of-the-artROUGEscore (e.g., 54.8 / 31.3 / 47.2 for ROUGE 1/2/L) onSAMSum, surpassingGPT-3baselines.ModuLoRAconsistently outperforms 4-bitQLoRAand 8-bitBitsAndBytesmethods. Even 2-bitModuLoRAmodels match the performance of 8-bit baselines. The performance drop from 4-bit to 3-bit to 2-bit is marginal (about 1%ROUGE), showcasing the robustness of ultra-low precisionfinetuning. - Ablation with RTN (Table 4): The comparison with
Round-to-Nearest (RTN)quantizationreveals thatOPTQ(used byModuLoRA) performs better, underscoring the importance of using sophisticated, data-drivenquantizersover simpler ones. This justifiesModuLoRA's modular design. - OPT models (Table 5): Similar trends are observed with
OPTmodels, whereModuLoRA(3-bit and 4-bit) matches or outperforms 4-bitQLoRAand 8-bitBits&Bytesbaselines.
- LLaMA models: A significant finding is that
-
Instruction Following (BBH) (Table 6, 9):
- Alpaca (Table 6): Performance drops only slightly for 2-bit, 3-bit, and 4-bit
ModuLoRAmodels compared to 8-bit models. Crucially, 2-bit models match 4-bitQLoRAperformance. More impressively, 4-bit and 3-bit 65BModuLoRAmodels outperform 8-bit 30B models, demonstrating the efficiency of combining larger model sizes with aggressivequantization.ModuLoRAprovides consistent improvements overQLoRA, especially for smaller models. - Code Alpaca (Table 9):
ModuLoRA(3-bit and 4-bit) performs comparably or better than 8-bitBits&Bytesmodels, confirming the general trend.
- Alpaca (Table 6): Performance drops only slightly for 2-bit, 3-bit, and 4-bit
-
Memory Requirements (Table 7 & Figure 2): This is a key highlight.
ModuLoRAsignificantly reduces memory needs. For a 65B model on MNLI-M,ModuLoRA(2-bit) uses only 21.8 GB, making it finetunable on a single 24GB GPU. In contrast,QLoRA(4-bit) requires 36.7 GB, andfull-precision LoRAneeds 360.4 GB.ModuLoRAuses only about 6% of the memory offull-precision LoRAfor a 65B model. This is a groundbreaking achievement for accessibility. -
Finetuning & Inference Latency (Table 10 & 11):
- Finetuning (Table 10):
ModuLoRA(2-bit) is significantly faster (0.61 s/it) thanQLoRA(0.80 s/it) andfull-precision LoRA(1.50 s/it), reducing training time by approximately 59.3% and memory usage by 91.5% compared tofull-precision LoRA. This efficiency comes from reduced data movement. - Inference (Table 11):
ModuLoRA(2-bit) is slightly slower (0.68 s/it) thanQLoRAandfull-precision LoRA(both 0.52 s/it). The authors attribute this to less optimizedCUDA kernelscompared toQLoRA, suggesting future optimization potential.
- Finetuning (Table 10):
-
BBH vs. PPL (Table 8): The paper also notes an interesting finding: the correlation between
perplexity(PPL) on Wiki2 andfinetuningperformance onBBHis not perfect. Large differences inPPLsometimes correspond to only small differences inBBHaccuracy. This questions traditionalLLMevaluation metrics when the goal isfinetuningand suggests that factors beyond rawperplexitymight be more indicative offinetuningpotential.In summary,
ModuLoRAsuccessfully demonstrates that integrating advancedquantizationwithLoRAin a modular fashion allows for highly memory-efficientfinetuning(even down to 2-bit) on consumer GPUs, often achieving state-of-the-art or competitive performance.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| LLAMA Tuning | 13B | 30B | 65B |
| LLMTOOLS (3-bit) | 93.5 ± 0.7 | 97.0 ± 0.9 | 97.2 ± 0.8 |
| LLMTOOLS (4-bit) | 92.9 ± 0.7 | 96.3 ± 1.0 | 98.0 ± 0.9 |
| Bits&Bytes 8-bit (LLM.int8()) | 93.0 ± 0.7 | 93.7 ± 1.0 | 98.6 ± 1.0 |
The following are the results from Table 2 of the original paper:
| Models | Finetuning Adaptation | Model Size | # Trainable Parameters | MNLI-m (accuracy) | ||
| GPT-3 | Full Finetuning | 175B | 175,255.8M | 89.5 ± 0.1 | ||
| GPT-3 | Adapter | 175B | 40.1M | 91.5 ± 0.1 | ||
| GPT-3 | LoRA | 175B | 4.7M | 91.7 ± 0.1 | ||
| T5 | Full Finetuning | 11B | 11,307.4M | 92.2 ± 0.1 | ||
| LLaMA Finetuning | Quantizer | 7B | 13B | 30B | ||
| LLMTooLS (2-bit) | QuIP#(E8) | 88.50 ± 0.3 | 89.72 ± 0.3 | 91.30 ± 0.3 | ||
| LLMTOOLS (3-bit) | OPTQ | 88.98 ± 0.2 | 90.20 ± 0.2 | 91.09 ± 0.2 | ||
| LLMTOOLS (4-bit) | OPTQ | 89.31 ± 0.2 | 90.41 ± 0.2 | 91.31 ± 0.1 | ||
| Bits&Bytes (4-bit) | QLoRA | 89.28 ± 0.2 | 89.67 ± 0.2 | |||
| Bits&Bytes (8-bit) | LLM.int8() | 88.95 ± 0.1 | 90.08 ± 0.1 | 91.22 ± 0.1 91.15 ± 0.1 | ||
The following are the results from Table 3 of the original paper:
| Models | Finetuning Adaptation | |||||
| GPT-3 | Full Finetuning | 175,255.8M | 52.0 28.0/ 44.5 | |||
| GPT-3 | Adapter | 53.2 29.0 / 45.1 | ||||
| GPT-3 | LoRA | 40.1M 4.7M | 53.8 / 29.8 / 45.9 | |||
| Pegasus | SliC | 2B | 54.4 / 29.9 / 45.9 | |||
| LLAMA Finetuning | Quantizer | 30B | 65B | |||
| LLMTOOLS (2-bit) | QuIP# (E8) | 7B 51.3 / 27.3 / 43.7 | / 46.0 | 54.0/ 30.6 / 46.2 | ||
| LLMTOOLS (3-bit) | OPTQ | 51.2 / 28.2 / 44.0 | 52.3 / 29.0 / 45.0 52.4 / 29.6 / 45.1 | 53.3 / 30.2 53.6 / 30.8 / 46.3 | 54.1 / 30.9 / | 46.5 |
| LLMTOOLS (4-bit) | OPT | 51.7 / 28.3 / 44.4 | 53.2 / 30.2 / 46.1 | 53.9 / 31.2 / 46.9 | 54.8 / | 31.3 / 47.2 |
| Bits&Bytes (4-bit) | QLoRA | |||||
| Bits&Bytes (8-bit) | LLM.int8() | 51.6 / 28.3 3/ 44.5 51.9 / 28.1 / 44.5 | 51.3 / 28.1 / 44.1 51.3 / 28.2 / 43.6 | 53.0 / 30.2 50.8 / 28.4 | / 45.7 53.8 / 44.1 53.9 | 3 / 30.5 /4 45.9 / 30.4 / 46.3 |
The following are the results from Table 4 of the original paper:
| SAMSum Performance | Quantizer | 7B | 13B | |||||
| LLMTOOLS (3-bit) | OPTQ | 51.2 / | 28.2 / | 44.0 / 44.2 | 52.4 / 29.6 / 45.1 / | 45.1 | ||
| RTN OPTQ | 51.7 | 50.7 / 27.2 / 28.3 | 43.6 / 43.6 44.4/ | 44.4 | 51.1/ 28.7 / | 44.3 / | 44.5 46.1 | |
| LLMTOOLS (4-bit) | RTN | / 51.2 / | / 28.5 / | 44.2 / 44.2 | 53.2 52.5 | / 30.2 / 29.9 | 46.1 45.5 | 45.5 |
The following are the results from Table 5 of the original paper:
| OPT Finetuning | Quantizer | 13B | 30B |
| LLMTOOLS (3-bit) | OPTQ | 48.8 / 26.7 / 41.9 | 49.9 / 27.1 / 42.5 |
| LLMTOOLS s (4-bit) | OPTQ | 49.3/ 26.8 / 42.0 | 49.6 27.1 42.4 1 / |
| Bits&Bytes (4-bit) | QLoRA | 49.2 / 27.0 / 42.1 | 49.9 / 27.0 / 42.5 |
| Bits&Bytes (8-bit) | LLM.int8() | 48.8 26.5 / 41.7 | 49.3 | 27.1 / 42.3 |
The following are the results from Table 6 of the original paper:
| Model | Method | Quantizer | BASE (250M) | L (780M) | XL (3B) | XXL (11B) | |
| FLAN-T5 | No Finetuning | None | 30.8 | 30.3 | 39.9 | 47.4 | |
| LLaMA | Methods | Quantizer | 7B | 30B | 65B | ||
| LLMToOLS (2-bit) | QuIP# (E8) | 30.8 ± 0.5 | 38.3 ± 0.6 | 43.5 ± 0.5 | |||
| LLMTOOLS (3-bit) | OPTQ | 31.1 ± 0.4 | |||||
| 35.3 ± 0.2 | 37.2 ± 0.6 | 43.3 ± 0.4 | |||||
| LLMTOOLS (4-bit) | OPTQ | 36.2 ± 0.4 | 40.4 ± 0.2 | 43.7 ± 0.4 | |||
| Bits&Bytes (4-bit) | QLoRA | 35.4 ± 0.2 | 39.0 ± 0.4 | 43.5 ± 0.5 | |||
| Bits&Bytes (8-bit) | LLM.int8() | 31.9 ± 0.1 36.8 ± 0.2 37.1 | 39.1 ± 0.5 | 44.7 ± 0.4 | |||
| No Finetuning | None | 33.3 ± 0.3 30.9 | 39.3 | 42.6 | |||
The following are the results from Table 7 of the original paper:
| LLaMA Finetuning | 7B | 13B | 30B | 65B |
| LLMTOOLS (2-bit) | 3.2 GB | 5.4 GB | 11.4 GB | 21.8 GB |
| QLoRA (4-bit) | 5.2 GB | 8.6 GB | 19.5 GB | 36.7 GB |
| Full Precision (LoRA) | 38.4 GB | 73.9 GB | 183.3 GB | 360.4 GB |
The following are the results from Table 8 of the original paper:
| Models | Quantization | BBH | PPL |
| LLAMA (13B) | 3-bit | 35.3 | 6.63 |
| 4-bit | 36.2 | 5.36 | |
| LLAMA (65B) | 3-bit | 43.3 | 5.04 |
| 4-bit | 43.7 | 3.84 |
The following are the results from Table 9 of the original paper:
| Code Alpaca Per- formance | 13B | 30B | ||||||||
| LLMTOOLS LLMTOOLS | (3-bit) | 53.6 / 36.3 54.6 37.2 | 50.7 51.4 | 57.0 40.0 / 40.6 | 53.3 | 58.1 40.7 | | 54.3 60.0 | 44.1 | 58.8 | |
| Bits&Bytes (LLM.int8()) | (4-bit) 8-bit | 54.0 / | 36.3 | 50.9 | 57.4 57.7 / | 54.3 / 41.3 / 54.9 | 59.0 / 41.4 60.6 √ 43.5/ | 57.5 60.2 57.5 61.1/ | 43.5 44.1 / 58.0 | / 56.8 |
The following are the results from Table 10 of the original paper:
| Precision | LLMTools (2-bit) | QLoRA (4-bit) | LoRA (Full Precision) |
| Seconds/Iteration | 0.61 s/it | 0.80 s/it | 1.50 s/it |
The following are the results from Table 11 of the original paper:
| Precision | LLMTools (2-bit) | QLoRA (4-bit) | LoRA (Full Precision) |
| Seconds/Iteration | 0.68 s/it | 0.52 s/it | 0.52 s/it |
The following are the results from Table 12 of the original paper:
| Dataset | Model | LLaMA 7B 13B 30B 65B | OPT 7B/ 13B 30B |
| SAMSum | Optimizer Warmup Ratio | AdamW 0.06 | |
| Batch size Evaluation Batch size | 128 | ||
| Evaluation Steps | 16 | ||
| Total # Training Steps | 50 | ||
| Learning Rate Schedule | 350 | ||
| Cosine | |||
| Learning Rate | 1e-3 | ||
| WeightDecay | 0.0 | ||
| LoRAConfig | rq = rv = 8 | ||
| LoRA α | 32 | ||
| Max Seq. Len | 250 |
The following are the results from Table 13 of the original paper:
| Dataset LLaMA Model | 13/30/65 B | |
| OptimizerWarmup Ratio | AdamW0.06 | |
| Batch sizeText- Evaluation Batch sizeClassification Evaluation StepsTotal # Training StepsLearning Rate ScheduleLearning RateWeightDecayLoRAConfigLoRA α | 256321001000Cosine1e-30.0rq = rv = 832128 | |
| Max Seq. Len | ||
The following are the results from Table 14 of the original paper:
| Dataset | LLaMA Model | | 7/13/30/65 B |
| Code- Alpaca | Optimizer Warmup Ratio Batch size | AdamW 0.06 128 4 |
| Evaluation Batch size Evaluation Steps Total # Training Steps Learning Rate Schedule Learning Rate WeightDecay LoRAConfig LoRA α Max Seq. Len | 40 120 Linear 1e-3 0.0 rq = rv = 8 32 165 |
The following are the results from Table 15 of the original paper:
| Dataset | Model | LLaMA 7B 13B 30B 65B |
| MNLI-M | Optimizer Warmup Ratio | AdamW 0.06 |
| Batch size Evaluation Batch size Evaluation Steps gEpoch | 128 64 64 | |
| Total # Training Learning Rate Schedule Learning Rate WeightDecay | 1.0 Cosine 1e-3 0.0 |
The following are the results from Table 16 of the original paper:
| Dataset | Model | LLaMA 7B 13B 30B 65B |
| Alpaca | Optimizer Warmup Ratio Batch size | AdamW 0.06 128 |
| Total # Training Epochs Learning Rate Schedule Learning Rate WeightDecay LoRAConfig | 3 Linear 1e-3 0.0 rq = rv = 8 |
6.3. Ablation Studies / Parameter Analysis
The paper includes a significant ablation study and analysis of model quality metrics:
-
Impact of Quantizer Quality (Table 4): This
ablation studyon theSAMSumdataset directly comparesModuLoRA's performance when using the sophisticatedOPTQquantizerversus a simplerRound-to-Nearest (RTN)approach. The results show thatOPTQconsistently yields betterROUGEscores thanRTNfor both 3-bit and 4-bitquantizationacross 7B and 13BLLaMAmodels. For example, a 7BLLaMAwith 3-bitOPTQachieves ROUGE-1/2/L of , while with 3-bitRTNit drops to . This highlights that the choice ofquantizationalgorithm is crucial and thatModuLoRA's modularity, allowing it to integrate with high-qualityquantizerslikeOPTQ, is a key advantage for achieving superior performance. -
Correlation between Perplexity and Finetuning Performance (Table 8): The authors investigate the relationship between a base model's
perplexity (PPL)onWiki2and itsfinetuningperformance (BBH accuracy) on instruction-following tasks forLLaMAmodels. Interestingly, the correlation is not perfect. For 13BLLaMA, 3-bitquantizationyields35.3 BBHwith6.63 PPL, while 4-bitquantizationresults in36.2 BBHwith5.36 PPL. For 65BLLaMA, 3-bit gives43.3 BBHwith5.04 PPL, and 4-bit gives43.7 BBHwith3.84 PPL. This analysis indicates that models with slightly worseperplexity(e.g., 3-bit vs. 4-bit) can still achieve very competitivefinetuningperformance. The paper points out that "large gaps in PPL admit small gaps in BBH," suggesting thatperplexityalone might not be a sufficient proxy for a base LLM'sfinetuningpotential. This finding prompts a re-evaluation ofLLMevaluation strategies, especially when the end goal isfinetuningfor downstream tasks rather than purelanguage modelingcapability.
The following figure (Figure 2 from the original paper) visualizes the memory requirements with different methods:
该图像是一个示意图,展示了不同模型参数大小下所需的内存情况。X轴表示模型参数大小(十亿),Y轴表示所需内存(GB)。不同的曲线代表了2-bit和4-bit量化方法以及全精度的方法,其中LLMTools(2-bit)显著减少了内存需求。
The visualization of memory requirements in Figure 2 strongly supports the paper's claims about ModuLoRA's efficiency. It clearly shows that LLMTools (2-bit) (which implements ModuLoRA) drastically reduces the memory needed compared to QLoRA (4-bit) and especially Full Precision (LoRA), across all model sizes. For a 65B model, LLMTools (2-bit) uses the least memory, making it feasible on consumer GPUs. This visual representation underscores the practical impact of ModuLoRA in democratizing access to LLM finetuning.
6.4. Other Model Families
The paper extends its evaluation to OPT models (Table 5) and observes consistent trends. ModuLoRA with 3-bit and 4-bit OPTQ quantization for OPT models (13B and 30B) achieves ROUGE 1/2/L scores that match or slightly outperform 4-bit QLoRA and 8-bit LLM.int8() baselines on the SAMSum dataset. This generalization across different LLM architectures (LLaMA and OPT) reinforces the robustness and effectiveness of the ModuLoRA approach. While OPT models generally perform worse than LLaMA models, ModuLoRA still provides competitive results relative to more memory-intensive finetuning strategies within the OPT family.
6.5. Overall Conclusions from Results
The experimental results collectively highlight that ModuLoRA successfully bridges the gap between ultra-low-precision quantization and high-performance LoRA finetuning. It demonstrates that it's possible to achieve state-of-the-art or competitive results on diverse NLP tasks (classification, NLI, summarization, instruction following) using significantly less memory by leveraging:
-
Modular Quantizer Integration: Allowing the use of advanced, performance-optimized
quantizers(e.g.,OPTQ,QuIP#). -
Aggressive Low Bit-Widths: Pushing
finetuningdown to 2-bit and 3-bit precision. -
Memory-Efficient Backward Pass: Through adaptive
dequantizationand recomputation.This leads to a substantial reduction in hardware requirements, making
LLM finetuningmuch more accessible on consumer GPUs.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces ModuLoRA, a novel and memory-efficient finetuning algorithm that addresses the critical challenge of deploying and adapting large language models on resource-constrained hardware, specifically consumer GPUs. At its core, ModuLoRA leverages a modular design, integrating low-rank adapters (LoRAs) with any user-specified, black-box quantization module. This is achieved through a quantization-agnostic backward pass that adaptively dequantizes (materializes) low-precision LLM weights only when necessary, minimizing memory footprint.
A key achievement of ModuLoRA is enabling, for the first time, the finetuning of 2-bit and 3-bit LLMs (up to 65 billion parameters) on single consumer GPUs (e.g., 24GB or 48GB). By integrating with state-of-the-art quantizers like 2-bit QuIP# and 3-bit OPTQ, ModuLoRA demonstrates superior or competitive performance compared to finetuning methods that rely on less sophisticated 4-bit and 8-bit quantization. The empirical evaluations across text classification, natural language inference, instruction following, and summarization tasks consistently show ModuLoRA matching or surpassing existing approaches while using significantly less memory. Notably, it achieved a new state-of-the-art ROUGE score on the SAMSum summarization benchmark. The accompanying LLMTools library further democratizes access by providing a user-friendly platform for quantizing, running, and finetuning these models.
7.2. Limitations & Future Work
The authors thoughtfully acknowledge several limitations of ModuLoRA:
-
Inference Overhead: A primary advantage of traditional
LoRAis its ability to fuse thelow-rank adapter() with thefull-precision base weight matrix() duringinference, effectively becoming a singlefull-precision weight matrixand incurring minimalinference overhead.ModuLoRAloses this advantage relative to theblack-box quantized model. Since theadapterisfull-precisionand the baseweight matrixisquantized, they cannot be trivially fused. This means theModuLoRALinearlayer (as depicted in Figure 1 of the paper, conceptually) must perform thedequantizationand separateLoRAaddition duringinference, potentially leading to slightly higherinference latencycompared to a fully fusedfull-precision LoRAor optimizedquantized inferencekernels. The latency results (Table 11) confirm this, showingModuLoRA(2-bit) inference being slightly slower thanQLoRAorfull-precision LoRA. This suggests a need for furtherCUDA kerneloptimization forModuLoRA'sinferencepath. -
Hardware Limits for Trillion-Parameter Models: While
ModuLoRAsignificantly pushes the boundaries for consumer GPUs, even at the most aggressive 1-bit per parameter, a trillion-parameter model would still require 125GB of memory, exceeding the capacity of current high-end consumer GPUs (e.g., 24GB or 48GB). This meansModuLoRAcannot yet make the largest-scale models (likeGPT-4or beyond)finetunableon single commodity hardware.Model parallelismor more advanced distributedquantizationtechniques would still be necessary for such massive models. -
LLM Safety Concerns: The authors briefly touch upon the ethical implications, noting that making
finetuningLLMs more accessible on commodity hardware could "make finetuning too easy," potentially presenting problems related toLLM safety. This highlights a broader societal concern as powerful AI models become easier to customize and deploy, potentially for malicious purposes or without sufficient safeguards.Future research directions could involve:
-
Developing optimized
CUDA kernelsforModuLoRAinferenceto reduce the latency observed in the current implementation. -
Exploring hybrid
quantizationandparallelizationstrategies to bring trillion-parameter models within reach of more accessible hardware. -
Investigating
quantization-aware training(QAT) approaches within theModuLoRAframework to potentially further improve performance at ultra-low bit-widths. -
Developing tools and guidelines for responsible
finetuningto mitigate theLLM safetyconcerns raised.
7.3. Personal Insights & Critique
ModuLoRA represents a significant step forward in making LLM finetuning more democratic and accessible. The paper's core strength lies in its modularity. By decoupling the finetuning mechanism from the specific quantization algorithm, ModuLoRA effectively future-proofs itself against advancements in quantization research. As quantizers become even more sophisticated and capable of preserving quality at lower bit-widths, ModuLoRA can immediately benefit without requiring architectural changes. This is a very elegant design choice that leverages the best of both worlds: PEFT for trainable parameters and state-of-the-art quantization for the frozen base.
The empirical evidence is compelling, particularly the achievement of 2-bit finetuning and the surpassing of state-of-the-art ROUGE scores for summarization. The drastic reduction in memory requirements is the most impactful practical contribution, enabling researchers and developers with consumer-grade GPUs to participate in LLM finetuning, which was previously reserved for well-funded institutions. This will undoubtedly accelerate open-source LLM development and foster innovation.
One area for potential improvement, as the authors noted, is the inference overhead. While finetuning efficiency is the primary goal, a streamlined inference path that can fuse the adapter and quantized base more effectively would enhance the holistic utility of ModuLoRA. This is a common challenge in mixed-precision and parameter-efficient methods, and active research in optimized kernel development could address this.
The observation about the imperfect correlation between perplexity and finetuning performance (BBH) is also a valuable insight. It suggests that researchers should diversify their evaluation metrics for base LLMs if their ultimate goal is finetuning for specific downstream tasks, rather than relying solely on traditional language modeling benchmarks. This could lead to new directions in pre-training or base model selection strategies.
Overall, ModuLoRA offers a practical, high-performance, and forward-looking solution to the LLM finetuning accessibility problem. Its methods and conclusions could be easily transferred to other domains beyond NLP where large transformer models are used, such as computer vision, as long as LoRA and quantization are applicable. The project's release as an open-source library, LLMTools, further amplifies its potential impact.
Similar papers
Recommended via semantic vector search.