Paper status: completed

ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers

Published:09/28/2023

Low-Rank Adaptation Finetuning (5)Large Language Model Fine-Tuning (50)Quantization Methods (1)2-Bit LLMs Training (1)Consumer GPU Optimization (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ModuLoRA is a memory-efficient finetuning algorithm that enables 2/3/4-bit precision tuning of 65B LLMs on a 24GB consumer GPU, integrating any weight quantizer for improved performance across various tasks with significantly reduced memory usage.

Abstract

We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision on as little as one 24GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 2-bit and 3-bit LLMs for the first time -- leveraging state-of-the-art 2-bit QuIP# quantization and 3-bit OPTQ quantization -- outperforming finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, \lplora~attains competitive performance on text classification, natural language inference, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release \lplora~together with a series of low-precision models as part of \llmtune, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.

Mind Map

In-depth Reading

English Analysis~32 min read · 44,026 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is a memory-efficient finetuning algorithm for large language models (LLMs) that operates on consumer-grade GPUs by integrating with various quantization methods. The title is ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers.

1.2. Authors

Junjie Yin: Department of Computer Science, Johns Hopkins University.
Jiahao Dong: Department of Computer Science, Cornell University and Cornell Tech.
Yingheng Wang: Department of Computer Science, Cornell University.
Christopher De Sa: Department of Computer Science, Cornell University.
Volodymyr Kuleshov: Department of Computer Science, Cornell University and Cornell Tech.

The authors are primarily affiliated with the Department of Computer Science at Johns Hopkins University and Cornell University, indicating a strong background in computer science research, particularly in areas related to machine learning, efficient computing, and large language models. Volodymyr Kuleshov and Christopher De Sa are often associated with research in machine learning, optimization, and efficient AI systems.

1.3. Journal/Conference

This paper is published as a preprint on arXiv, with the publication date (UTC) being 2023-09-28T02:55:01.000Z. As an arXiv preprint, it has not yet undergone formal peer review for a specific journal or conference at the time of this publication. However, arXiv is a highly reputable platform for disseminating cutting-edge research in fields like AI and machine learning, allowing researchers to share their work rapidly.

1.4. Publication Year

2023

1.5. Abstract

The paper introduces ModuLoRA, a memory-efficient finetuning algorithm for large language models (LLMs) that allows finetuning of models up to 65 billion parameters in 2, 3, or 4-bit precision on consumer GPUs (e.g., a single 24GB GPU). ModuLoRA achieves this by integrating any user-specified weight quantizer with Low-Rank Adapters (LoRAs) through a quantization-agnostic backward pass. This novel approach adaptively materializes low-precision LLM weights from a black-box quantization module. For the first time, ModuLoRA enables finetuning 2-bit and 3-bit LLMs, leveraging advanced quantization techniques like 2-bit QuIP# and 3-bit OPTQ. The experiments demonstrate that ModuLoRA outperforms finetuning methods based on less sophisticated 4-bit and 8-bit quantization, achieving competitive performance on tasks like text classification, natural language inference, and instruction following with significantly less memory. It also sets a new state-of-the-art ROUGE score on a summarization task. The authors release ModuLoRA as part of LLMTools, a user-friendly library for quantizing, running, and finetuning LLMs on consumer hardware.

1.6. Original Source Link

https://arxiv.org/abs/2309.16119 PDF Link: https://arxiv.org/pdf/2309.16119v2.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the prohibitive memory requirements for finetuning large language models (LLMs), which limits their accessibility and deployment on consumer-grade hardware. LLMs, with hundreds of billions of parameters, demand significant computational and memory resources, making tasks like finetuning feasible only with expensive, specialized GPUs.

This problem is crucial because it creates a barrier to entry for many researchers and practitioners who lack access to high-end data center hardware. Democratizing access to LLMs for finetuning would accelerate research, foster open-source development, and enable wider application of these powerful models across diverse domains. The specific challenges are:

High Memory Footprint: Storing full-precision LLM weights (e.g., 16-bit or 32-bit floating point) for models with tens or hundreds of billions of parameters quickly exceeds the memory capacity of even high-end consumer GPUs (typically 24GB, 48GB).
Finetuning Overhead: Beyond just storing weights, finetuning requires additional memory for activations, gradients, and optimizer states, further exacerbating the memory crunch.
Limited Quantization Integration: While quantization (reducing the precision of weights) has been explored for inference, its effective integration with finetuning, especially at very low bit-widths (e.g., 2-bit, 3-bit), has been challenging. Existing finetuning methods for quantized LLMs often rely on simpler 4-bit or 8-bit schemes, potentially sacrificing performance.

The paper's entry point and innovative idea revolve around combining the memory efficiency of quantization with the parameter efficiency of Low-Rank Adaptation (LoRA). Crucially, instead of developing a new quantization scheme, ModuLoRA proposes a modular, quantization-agnostic approach. This means it can integrate with any state-of-the-art quantizer (treated as a black box), allowing users to leverage the best available quantization methods, even those operating at aggressively low bit-widths like 2-bit or 3-bit, for finetuning on consumer GPUs.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Proposed ModuLoRA: It introduces ModuLoRA, a memory-efficient finetuning method for LLMs that operates over low-precision weights. A key innovation is its quantization-agnostic backward pass, which allows seamless integration with any user-specified black-box quantization module. This enables finetuning of LLMs up to 65B parameters in 2, 3, or 4-bit precision on consumer GPUs (e.g., a single 24GB or 48GB GPU). This addresses the memory limitation directly, making finetuning more accessible.
Release of LLMTools Library: The authors release LLMTools, a user-friendly Python library that implements ModuLoRA. This library facilitates the quantization, running, and finetuning of LLMs on consumer hardware, offering modular support for various quantizers, LLMs (like LLaMA, BLOOM, OPT), and optimization algorithms. This contributes to the open-source community and simplifies practical application of their method.
Empirical Evidence of High Performance with Smaller Quantized LLMs: The research provides extensive empirical evidence showing that high performance on downstream tasks can be achieved with significantly smaller and more aggressively quantized LLMs than previously believed. Specifically, it demonstrates that:
- ModuLoRA enables finetuning 2-bit and 3-bit LLMs for the first time, leveraging advanced quantizers like QuIP# (2-bit) and OPTQ (3-bit).
- These low-precision models (2-bit, 3-bit, 4-bit) often match or outperform finetuning methods based on less sophisticated 4-bit or 8-bit quantization (e.g., QLoRA or LLM.int8()).
- ModuLoRA achieves a new state-of-the-art ROUGE score on a popular summarization task using a 4-bit quantized LLaMA-65B model.
- For instruction following, 4-bit and 3-bit 65B models outperform 8-bit 30B models, despite using fewer total bits. These findings suggest that competitive finetuning performance is attainable even with aggressive quantization, challenging previous assumptions about the necessary model size and precision for high-quality results.

3.1. Foundational Concepts

To understand ModuLoRA, a reader needs to grasp several fundamental concepts in large language models and efficiency techniques:

Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the Transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. They can perform diverse tasks like translation, summarization, question answering, and code generation. Their massive size (billions to trillions of parameters) makes them powerful but also computationally demanding.
Finetuning: This is a process where a pre-trained LLM, which has learned general language patterns, is further trained on a smaller, task-specific dataset. The goal is to adapt the model to perform a particular downstream task (e.g., sentiment analysis, summarization) more effectively. Traditional finetuning involves updating all or most of the model's parameters, which is memory-intensive.
Parameter-Efficient Finetuning (PEFT): Given the high memory and computational costs of full finetuning, PEFT methods aim to adapt LLMs to new tasks by training only a small subset of parameters, or by introducing new, small, trainable parameters, while keeping most of the original model's parameters frozen (unchanged). This significantly reduces memory usage and training time.
Low-Rank Adaptation (LoRA): LoRA is a popular PEFT technique. Instead of directly modifying the weight matrices of a pre-trained model, LoRA introduces small, trainable low-rank matrices alongside the original, frozen weight matrices. During finetuning, only these low-rank matrices are updated, while the original weights remain fixed. This drastically reduces the number of trainable parameters.

Mathematically, for an original weight matrix $\mathbf{W} \in \mathbb{R}^{d \times d}$ , LoRA reparameterizes it as: $ \mathbf{W} = \mathbf{W}_0 + \mathbf{AB}^\top $ Where:
- $\mathbf{W}_0$ : The original, frozen weight matrix of the pre-trained LLM. It remains fixed during finetuning.
- $\mathbf{A} \in \mathbb{R}^{d \times r}$ : A randomly initialized matrix with rank $r$ .
- $\mathbf{B} \in \mathbb{R}^{d \times r}$ : A matrix initialized to zeros.
- $\mathbf{AB}^\top$ : The low-rank adapter, which is the product of matrix $\mathbf{A}$ and the transpose of matrix $\mathbf{B}$ . The rank $r$ is typically much smaller than the original dimension $d$ (i.e., $r \ll d$ ), making $\mathbf{AB}^\top$ a low-rank approximation of the update to $\mathbf{W}$ .
- Only $\mathbf{A}$ and $\mathbf{B}$ are trainable parameters. The number of trainable parameters in the adapter is 2dr, which is significantly less than $d^2$ for the original weight matrix when $r \ll d$ .
Quantization: This is a technique to reduce the memory footprint and computational cost of neural networks by representing their weights and activations with lower-precision numbers (e.g., 8-bit integers, 4-bit integers, 2-bit integers) instead of standard 16-bit or 32-bit floating-point numbers. A $b$ -bit quantization method typically takes a full-precision weight matrix $\mathbf{W}$ and outputs a quantized version $\hat{\mathbf{W}}_q$ , along with zero-point $\mathbf{z}$ and scale $\mathbf{s}$ parameters (often stored in full precision). The quantized weights $\hat{\mathbf{W}}_q$ are stored using $b$ bits per entry, while $\mathbf{z}$ and $\mathbf{s}$ allow for dequantization back to an approximation $\hat{\mathbf{W}}$ of the original full-precision weights. The quantization process can be summarized as: $ (\hat{\mathbf{W}}_q, \mathbf{z}, \mathbf{s}) = \mathcal{Q}(\mathbf{W}) $ And the dequantization process as: $ \hat{\mathbf{W}} = \mathcal{D}(\hat{\mathbf{W}}_q, \mathbf{z}, \mathbf{s}) = \mathbf{s} \odot \hat{\mathbf{W}}_q + \mathbf{z} $ Where:
- $\mathcal{Q}$ : The quantization algorithm.
- $\mathcal{D}$ : The dequantization algorithm.
- $\mathbf{W} \in \mathbb{R}^{d \times d}$ : The original full-precision weight matrix.
- $\hat{\mathbf{W}}_q \in \{0, 1, \dots, 2^b-1\}^{d \times d}$ : The quantized weight matrix, stored using $b$ bits per entry.
- $\mathbf{z} \in \mathbb{R}^d$ : Zero-point parameters, typically full-precision.
- $\mathbf{s} \in \mathbb{R}^d$ : Scale parameters, typically full-precision.
- $\hat{\mathbf{W}} \in \mathbb{R}^{d \times d}$ : The dequantized approximation of $\mathbf{W}$ .
- $\odot$ : Hadamard product (element-wise multiplication).
- The numpy-style broadcasting means that $\mathbf{s}$ and $\mathbf{z}$ (if they are vectors) are broadcast across the dimensions of $\hat{\mathbf{W}}_q$ for the element-wise operations.

3.2. Previous Works

The paper discusses several key prior works that inform its approach:

LoRA (Low-Rank Adaptation): Proposed by Hu et al. (2022), LoRA is a foundational PEFT method. As explained above, it involves adding small, trainable low-rank matrices ( $\mathbf{AB}^\top$ ) to the original frozen weight matrices ( $\mathbf{W}_0$ ). While LoRA reduces the number of trained parameters, it still requires storing the entire full-precision base model weights ( $\mathbf{W}_0$ ) in memory, which can be substantial for very large LLMs. ModuLoRA builds upon LoRA but addresses this memory bottleneck by quantizing $\mathbf{W}_0$ .
OPTQ (Optimal Quantization for Generative Pre-trained Transformers): Introduced by Frantar et al. (2023), OPTQ is a state-of-the-art quantization algorithm for modern LLMs. It works by iteratively running two steps over the weight columns: (1) quantizing with nearest rounding and computing the error, and (2) updating the remaining weights with a scaled error. This method scales effectively to LLMs and achieves good performance, making it a suitable "black-box" quantizer for ModuLoRA to integrate with, especially for 3-bit and 4-bit quantization.
QuIP and QuIP# (Quantization with Incoherence Processing): Chee et al. (2023) proposed QuIP, which made 2-bit LLM compression viable. It uses an adaptive rounding procedure to minimize a quadratic proxy objective and an efficient pre- and post-processing procedure to ensure weight and Hessian incoherence through multiplication by random orthogonal matrices. Following this, Tseng et al. (2023) introduced QuIP#, combining lattice codebooks with QuIP's incoherence processing to create state-of-the-art 2-bit quantized models. ModuLoRA leverages QuIP# to enable 2-bit finetuning for the first time, demonstrating its modularity.
QLoRA (Quantized LoRA): Concurrent work by Dettmers et al. (2023), QLoRA is another approach for finetuning quantized LLMs based on LoRA. QLoRA defines its own quantization scheme, which is simpler than OPTQ or QuIP. It primarily supports 4-bit finetuning and includes innovations like a specialized packing routine and quantization of zero-points and scales. ModuLoRA differentiates itself from QLoRA by being quantizer-agnostic and supporting lower bit-widths (2-bit, 3-bit) by integrating with more advanced quantizers.
LLM.int8(): Dettmers et al. (2022) proposed LLM.int8(), an 8-bit quantization method that decomposes matrix multiplications into a majority of 8-bit operations and a minority of 16-bit operations. This allows large models to fit into memory for inference and serves as a baseline for 8-bit LoRA finetuning.

3.3. Technological Evolution

The evolution of LLM efficiency techniques has generally followed these stages:

Full Finetuning: Initially, researchers would finetune all parameters of a pre-trained model. This yielded high performance but was extremely resource-intensive, requiring vast amounts of memory and compute.
Parameter-Efficient Finetuning (PEFT): Methods like prompt tuning, adapter layers, and LoRA emerged to reduce the number of trainable parameters. This eased the computational burden but still often required storing the full-precision base model.
Quantization for Inference: Techniques for quantizing LLMs (e.g., to 8-bit, 4-bit) became prevalent to enable inference on less powerful hardware, but these often weren't directly compatible with finetuning or led to significant performance degradation when applied naively.
Finetuning Quantized Models (QLoRA): The next step was combining PEFT (like LoRA) with quantization for finetuning. QLoRA was a notable development, allowing 4-bit finetuning of large models, but it used its own quantization scheme and was limited in bit-width.
Modular, Ultra-Low-Precision Finetuning (ModuLoRA): This paper's ModuLoRA represents a further advancement. It takes the LoRA approach and makes the underlying quantization strategy entirely modular and black-box. This allows ModuLoRA to seamlessly integrate with the best available quantizers (e.g., OPTQ, QuIP#), enabling finetuning at aggressively low bit-widths (2-bit, 3-bit) while maintaining high performance. It also focuses explicitly on enabling this on commodity consumer hardware.

3.4. Differentiation Analysis

Compared to the main methods in related work, ModuLoRA introduces several core differences and innovations:

Modularity and Quantizer-Agnostic Design: The most significant innovation is that ModuLoRA does not define its own quantization procedure. Instead, it is designed to integrate with any user-specified, black-box quantization module. This is a crucial distinction from QLoRA, which uses its own specific 4-bit quantization scheme. ModuLoRA's modularity allows it to leverage the cutting-edge quantization research as it develops, without requiring changes to its core finetuning mechanism.
Support for Ultra-Low Bit-Widths (2-bit and 3-bit Finetuning): By integrating with advanced quantizers like OPTQ (for 3-bit) and QuIP# (for 2-bit), ModuLoRA enables finetuning LLMs at these aggressively low bit-widths for the first time. This goes beyond QLoRA's 4-bit limitation and LLM.int8()'s 8-bit standard, leading to even greater memory savings.
Performance with Advanced Quantizers: The paper demonstrates that by using sophisticated data-driven quantizers (like OPTQ and QuIP#) within the ModuLoRA framework, it can achieve better performance than simpler quantization strategies (like round-to-nearest or QLoRA's internal scheme) for a given bit budget. This highlights the synergy between ModuLoRA's finetuning approach and the quality of the quantization algorithm.
Memory Efficiency: ModuLoRA pushes the boundaries of memory efficiency further. It enables finetuning of a 65B LLM on a single 24GB GPU in 2-bit precision, and a 65B LLM on a 48GB GPU in 3-bit/4-bit precision. This significantly lowers the hardware barrier compared to previous LoRA (which needs full-precision base weights) or even QLoRA in some settings due to the lower bit-width support.
Quantization-Agnostic Backward Pass: The paper introduces a simple, yet effective, quantization-agnostic backward pass that adaptively materializes low-precision LLM weights only when needed. This ensures that the memory for the full dequantized weights is not held for the entire model simultaneously, contributing to the overall memory efficiency.

4. Methodology

4.1. Principles

The core idea behind ModuLoRA is to enable memory-efficient finetuning of large language models by combining the parameter-efficiency of Low-Rank Adaptation (LoRA) with the memory-efficiency of weight quantization, while remaining flexible enough to integrate with any state-of-the-art quantization algorithm. The theoretical basis or intuition is that LoRA allows for efficient finetuning by only updating a small number of parameters, and quantization drastically reduces the memory footprint of the frozen base model weights. By treating the quantizer as a black box and adaptively dequantizing weights during both the forward and backward passes, ModuLoRA ensures that the benefits of low-precision storage are maintained while still allowing gradients to be computed accurately for the LoRA adapters. This allows finetuning to occur on hardware with limited memory, such as consumer GPUs.

4.2. Core Methodology In-depth (Layer by Layer)

ModuLoRA's methodology can be broken down into three main stages: initial quantization, reparameterization of linear layers with LoRA adapters and quantized weights, and an efficient quantization-agnostic backward pass.

4.2.1. Initial Quantization

The first step is to take a pre-trained LLM and apply a chosen black-box quantization algorithm $\mathcal{Q}$ to its full-precision weight matrices $\mathbf{W}^{(i)}$ . For each weight matrix $\mathbf{W}^{(i)}$ (where $i$ indexes the different linear layers in the LLM), the quantization algorithm $\mathcal{Q}$ produces:

Quantized weights $\hat{\mathbf{W}}_q^{(i)}$ , stored in low precision (e.g., 2, 3, or 4 bits per entry).
Zero-point parameters $\mathbf{z}^{(i)}$ .
Scale parameters $\mathbf{s}^{(i)}$ .

This process is defined as: $ (\hat{\mathbf{W}}_q^{(i)}, \mathbf{z}^{(i)}, \mathbf{s}^{(i)}) = \mathcal{Q}(\mathbf{W}^{(i)}) $ The paper emphasizes that ModuLoRA itself does not specify $\mathcal{Q}$ ; it treats it as a black box. The dequantization algorithm $\mathcal{D}$ then recovers an approximation of the original full-precision weights $\mathbf{W}^{(i)}$ as $\hat{\mathbf{W}}^{(i)}$ : $ \hat{\mathbf{W}}^{(i)} = \mathcal{D}(\hat{\mathbf{W}}_q^{(i)}, \mathbf{z}^{(i)}, \mathbf{s}^{(i)}) = \mathbf{s}^{(i)} \odot \hat{\mathbf{W}}_q^{(i)} + \mathbf{z}^{(i)} $ These quantized weights $\hat{\mathbf{W}}_q^{(i)}$ , along with their scale $\mathbf{s}^{(i)}$ and zero-point $\mathbf{z}^{(i)}$ parameters, become the frozen base of the LLM for finetuning. They are stored in memory in their low-precision format (e.g., 2, 3, or 4 bits for $\hat{\mathbf{W}}_q^{(i)}$ ) to minimize memory usage.

4.2.2. Reparameterized ModuLoRALinear Layer

ModuLoRA modifies the original LLM by replacing each standard linear layer with a ModuLoRALinear layer. An original linear layer computes $x \mapsto x (\mathbf{W}^{(i)})^\top + \mathbf{b}^{(i)}$ , where $x$ is the input, $\mathbf{W}^{(i)}$ are the weights, and $\mathbf{b}^{(i)}$ are the biases.

The ModuLoRALinear layer reparameterizes this operation. Instead of using the original full-precision weights $\mathbf{W}^{(i)}$ , it uses the dequantized approximation $\hat{\mathbf{W}}^{(i)}$ and adds a low-rank adapter $\mathbf{B}^{(i)}(\mathbf{A}^{(i)})^\top$ . The new affine map becomes: $ x \mapsto x (\hat{\mathbf{W}}^{(i)})^\top + x \mathbf{B}^{(i)} (\mathbf{A}^{(i)})^\top + \mathbf{b}^{(i)} $ Here:

$\hat{\mathbf{W}}^{(i)}$ : The dequantized weight matrix obtained from the black-box quantizer and stored in low precision. This part of the weight is frozen and not updated during finetuning.
$\mathbf{A}^{(i)}, \mathbf{B}^{(i)} \in \mathbb{R}^{d \times r}$ : These are the learnable LoRA parameters. $\mathbf{A}^{(i)}$ and $\mathbf{B}^{(i)}$ are initialized as in Hu et al. (2022), typically with $\mathbf{A}^{(i)}$ being random and $\mathbf{B}^{(i)}$ being zero-initialized. These matrices are stored in full precision (e.g., 16-bit float) and are the only parameters updated during finetuning.
$\mathbf{b}^{(i)}$ : The bias term, which can also be frozen or finetuned. The paper indicates biases are stored as float16.

The core ModuLoRALinear class structure is conceptually represented as:

class ModuLoRALinear(Module):
    "Linear ModuLoRA Layer"
    def __init__(self, ...):
        self.hatWq_z_s = quantize(pretrained_W) # Stores (quantized weights, zero-point, scale)
        (self.A, self.B) = lora_init(...)       # Initializes LoRA adapter matrices
    
    def forward(self, x):
        (hatWq, z, s) = self.hatWq_z_s
        # LPLinear.apply handles dequantization and multiplication for the quantized part
        # The second term handles the LoRA adapter part
        return LPLinear.apply(x, hatWq, z, s) \
             + x @ (self.B @ self.A.t()) \
             + self.bias

4.2.3. Efficient Mixed-Precision Computation: Forward Pass

The ModuLoRALinear layer utilizes a custom autograd.Function called LPLinear (Low-Precision Linear Map) to handle the operations involving the quantized base weights. This is crucial for managing memory efficiently.

In the forward pass of LPLinear:

The quantized weights $\hat{\mathbf{W}}_q$ , zero-point $\mathbf{z}$ , and scale $\mathbf{s}$ are passed along with the input $x$ .
The dequantization algorithm $\mathcal{D}$ is called to materialize the high-precision approximation $\hat{\mathbf{W}}$ just-in-time.
A standard matrix multiplication is performed: input @ hatW.t().
Crucially, immediately after its use, the materialized high-precision $\hat{\mathbf{W}}$ is deallocated (hatW is deallocated in the pseudocode). This prevents the entire model's dequantized weights from residing in memory simultaneously.

The pseudocode for the forward pass is:

class LPLinear(Function):
    "Low-Precision Linear Map"
    @staticmethod
    def forward(ctx, input, hatWq, z, s):
        ctx.save_for_backward(hatWq, z, s) # Saves low-precision components for backward pass
        hatW = dequantize(hatWq, z, s)      # Dequantize to high-precision
        output = input @ hatW.t()           # Perform matrix multiplication
        return output                       # hatW is deallocated after this

By deallocating $\hat{\mathbf{W}}$ immediately, the memory footprint for the base quantized model is kept minimal.

4.2.4. Efficient Mixed-Precision Computation: Backward Pass

The backward pass is critical for finetuning the LoRA adapters $\mathbf{A}^{(i)}$ and $\mathbf{B}^{(i)}$ . The chain rule requires calculating gradients that involve the transpose of the weight matrices.

Consider the overall weight matrix of a ModuLoRALinear layer as $\mathbf{W}_l^{(i)} = \hat{\mathbf{W}}^{(i)} + \mathbf{A}^{(i)} (\mathbf{B}^{(i)})^\top$ . The pre-activation output is $\bar{\mathbf{y}}_i = \mathbf{W}_l^{(i)} \mathbf{x} + \mathbf{b}^{(i)}$ . The loss is $L$ . We want to compute $\mathrm{d}L / \mathrm{d}\mathbf{A}^{(i)}$ and $\mathrm{d}L / \mathrm{d}\mathbf{B}^{(i)}$ . By the chain rule: $ \frac{\mathrm{d}L}{\mathrm{d}\mathbf{A}^{(i)}} = \frac{\mathrm{d}L}{\mathrm{d}\bar{\mathbf{y}}_i} \cdot \frac{\mathrm{d}\bar{\mathbf{y}}i}{\mathrm{d}\mathbf{A}^{(i)}} $ And similarly for $\mathbf{B}^{(i)}$ . The term $\frac{\mathrm{d}L}{\mathrm{d}\bar{\mathbf{y}}_i}$ is propagated from subsequent layers. Its computation involves the transpose of the weight matrix of the next layer: $ \frac{\mathrm{d}L}{\mathrm{d}\bar{\mathbf{y}}i} = \frac{\mathrm{d}L}{\mathrm{d}\bar{\mathbf{y}}{i+1}} \cdot \frac{\mathrm{d}\bar{\mathbf{y}}{i+1}}{\mathrm{d}\mathbf{y}_i} \cdot \frac{\mathrm{d}\mathbf{y}_i}{\mathrm{d}\bar{\mathbf{y}}_i} $ where $\frac{\mathrm{d}\mathbf{y}_i}{\mathrm{d}\bar{\mathbf{y}}_i}$ is the derivative of the activation function, and $\frac{\mathrm{d}\bar{\mathbf{y}}_{i+1}}{\mathrm{d}\mathbf{y}_i} = (\mathbf{W}_l^{(i+1)})^\top = (\hat{\mathbf{W}}^{(i+1)})^\top + \mathbf{B}^{(i+1)} (\mathbf{A}^{(i+1)})^\top$ .

This shows that the backward pass also requires performing matrix-vector multiplications involving $(\hat{\mathbf{W}}^{(i)})^\top$ . To maintain memory efficiency, the LPLinear backward pass re-implements the dequantization and matrix multiplication steps:

The low-precision quantized components ( $\hat{\mathbf{W}}_q$ , $\mathbf{z}$ , $\mathbf{s}$ ) saved in ctx during the forward pass are retrieved.
The high-precision approximation $\hat{\mathbf{W}}$ is recomputed (re-dequantized) from these components.
The gradient grad_input is calculated using $\hat{\mathbf{W}}$ .
Again, the recomputed hatW is immediately deallocated after use (hatW can be deallocated). This ensures that memory is freed quickly and not accumulated across layers.

The pseudocode for the backward pass is:

    @staticmethod
    def backward(ctx, grad_output):
        hatWq, z, s = ctx.saved_tensors # Retrieve low-precision components
        hatW = dequantize(hatWq, z, s)  # Recompute high-precision hatW
        grad_input = grad_output @ hatW # Compute gradient
        return grad_input, None, None, None # hatW is deallocated after this

This strategy of recomputing (re-dequantizing) $\hat{\mathbf{W}}$ on-the-fly for both forward and backward passes, rather than storing it, is the core mechanism that allows ModuLoRA to avoid manifesting all full-precision weights in memory simultaneously, thereby enabling finetuning on consumer GPUs.

4.2.5. Increasing Efficiency Further

To further reduce memory consumption beyond simply recomputing dequantized weights, ModuLoRA can employ more granular materialization strategies:

Row Materialization: For many quantization algorithms (e.g., Nagel et al., 2020; Frantar et al., 2023), it's possible to dequantize $\hat{\mathbf{W}}^{(i)}$ one row at a time. Each dequantized row is then immediately multiplied with the corresponding part of the input $\mathbf{x}$ , and then the row is freed. This avoids materializing the entire weight matrix $\hat{\mathbf{W}}^{(i)}$ even temporarily.
Direct Vector-by-Quantized-Matrix Product: The most efficient approach would be if the quantizer $\mathcal{Q}$ itself provides a direct subroutine for computing vector-by-quantized-matrix products ( $x \mathbf{\hat{W}}_q^\top$ ) without explicitly dequantizing $\hat{\mathbf{W}}_q$ into $\hat{\mathbf{W}}$ at all. ModuLoRA's modular design can generalize to such subroutines, eliminating the need to materialize any part of $\hat{\mathbf{W}}^{(i)}$ .

4.2.6. LLMTools Implementation

ModuLoRA is implemented as part of LLMTools, a user-friendly library. LLMTools provides:

Implementation of ModuLoRA for 2-bit, 3-bit, and 4-bit precision.
Python API for quantization, inference, and finetuning.
Modular support for various quantizers (e.g., OPTQ, QuIP#), LLMs (LLaMA1, LLaMA2, BLOOM, OPT), and optimization algorithms compatible with Hugging Face Trainer.
Efficient CUDA implementations for mixed-precision matrix-vector multiplication, including row and weight materialization.
CUDA kernels are provided for both row and weight materialization in forward and backward passes. For maximum efficiency, materialized elements of $\hat{\mathbf{W}}_q^{(i)}$ are in float16.
Base quantized LLM models are represented by weights $\hat{\mathbf{W}}_q^{(i)}$ in 3 or 4 bits, with scales $\mathbf{s}^{(i)}$ , zero-points $\mathbf{z}^{(i)}$ , and biases $\mathbf{b}^{(i)}$ all stored as float16.
For QuIP# integration (2-bit), LLMTools provides CUDA kernels for weight re-materialization and orthogonal matrices multiplication. The base models use 2-bit $\hat{\mathbf{W}}_q^{(i)}$ .

5. Experimental Setup

5.1. Datasets

The experiments used a variety of datasets to evaluate ModuLoRA across different natural language processing tasks:

Text Classification:
- Dataset: A custom dataset derived from Williams et al. (2018), comprising 392,702 text snippets (up to 50 words each) from five genres. Evaluation is performed on 9,815 held-out instances.
- Purpose: To assess the model's ability to classify short text into distinct categories.
- Data Sample Example (Hypothetical): A short text like "The protagonist, a detective, unravels a complex mystery in the heart of London." might be classified as "fiction." A snippet from a phone call transcript would be "telephone chat."
Natural Language Inference (NLI):
- Dataset: Multi-Genre Natural Language Inference Corpus (MNLI) (Williams et al., 2018).
- Purpose: To evaluate the model's understanding of semantic relationships between sentence pairs (a hypothesis and a premise). The task is to predict if the hypothesis entails, contradicts, or is neutral to the premise.
- Data Sample Example (from MNLI):
  - Premise: "A man is standing on a ladder and painting a wall."
  - Hypothesis: "A man is painting a wall."
  - Label: Entailment
- Data Sample Example (from MNLI):
  - Premise: "A man is standing on a ladder and painting a wall."
  - Hypothesis: "A man is flying a kite."
  - Label: Contradiction
Abstractive Summarization:
- Dataset: SAMSum dataset (Gliwa et al., 2019). It contains 14,732 (text, summary) training pairs and 819 test pairs.
- Purpose: To assess the model's ability to generate concise and coherent summaries from longer text inputs.
- Data Sample Example (Hypothetical from SAMSum):
  - Dialogue (Input): Person A: Hey, did you finish the report for the Q3 meeting? Person B: Almost, just need to finalize the sales figures. Should be done by lunch. Person A: Great, I'll review it then.
  - Summary (Target): Person B is finishing the Q3 report and will send it to Person A for review by lunch.
Instruction Following:
- Dataset: Alpaca dataset (Taori et al., 2023), consisting of 52,000 instructions, and CodeAlpaca dataset (Chaudhary, 2023), consisting of 20,000 code generation instructions.
- Purpose: To evaluate how well models follow natural language instructions to generate appropriate responses, including code.
- Data Sample Example (from Alpaca):
  - Instruction: "Explain the concept of recursion to a 5-year-old."
  - Response (Target): "Imagine you have a magic box, and inside that box is another magic box, and inside that one is another, and so on! Recursion is like when you keep opening the boxes until you find the smallest one, then you close them all back up."
Calibration Data for Quantization:
- Dataset: 128 samples from C4 (Raffel et al., 2020) were used for calibrating models quantized with OPTQ. C4 is a massive, cleaned web text dataset.
- Purpose: Quantization algorithms often require a small amount of data to determine optimal scale and zero-point parameters to minimize information loss during the conversion to lower precision.
  
  These datasets were chosen because they represent a diverse set of common LLM tasks, allowing for a comprehensive evaluation of ModuLoRA's performance and efficiency across different domains and complexities.

5.2. Evaluation Metrics

The paper uses several standard evaluation metrics tailored to each task:

Accuracy:
- Conceptual Definition: Accuracy measures the proportion of correctly predicted instances out of the total number of instances. It quantifies how often the model's predictions match the true labels. It is a straightforward metric, commonly used in classification tasks.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
  - Number of Correct Predictions: The count of instances where the model's output matches the ground truth label.
  - Total Number of Predictions: The total count of all instances evaluated.
- Used for: Text classification and Natural Language Inference (MNLI-m). Also for BigBenchHard (BBH) in the context of instruction following tasks where it's referred to as exact match accuracy.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- Conceptual Definition: ROUGE is a set of metrics used for evaluating automatic summarization and machine translation software. It works by comparing an automatically produced summary or translation against a set of human-produced reference summaries or translations. It measures the overlap of n-grams (sequences of words) between the candidate and reference summaries. The paper specifically mentions ROUGE-1, ROUGE-2, and ROUGE-L.
  - ROUGE-N: Measures the overlap of n-grams between the candidate and reference summary. ROUGE-1 measures unigram (single word) overlap, ROUGE-2 measures bigram (two-word sequence) overlap.
  - ROUGE-L: Measures the longest common subsequence (LCS) between the candidate and reference summary. It captures sentence-level structure similarity more naturally and does not require consecutive matches.
- Mathematical Formulas (Standard Definitions): Let $S$ $S$ be the candidate summary and $R = \{R_1, R_2, \dots, R_m\}$ $R = {R_{1}, R_{2}, \dots, R_{m}}$ be the set of reference summaries.
  - ROUGE-N: $ \text{ROUGE-N} = \frac{\sum_{i=1}^m \sum_{\text{n-gram} \in R_i} \text{Count}{\text{match}}(\text{n-gram})}{\sum{i=1}^m \sum_{\text{n-gram} \in R_i} \text{Count}(\text{n-gram})} $
  - ROUGE-L (F-measure based on LCS): First, calculate Precision ( $P_{LCS}$ ), Recall ( $R_{LCS}$ ), and F-measure ( $F_{LCS}$ ) for the Longest Common Subsequence (LCS). $ P_{LCS} = \frac{\text{LCS}(\text{candidate}, \text{reference})}{\text{Length}(\text{candidate})} $ $ R_{LCS} = \frac{\text{LCS}(\text{candidate}, \text{reference})}{\text{Length}(\text{reference})} $ $ F_{LCS} = \frac{(1+\beta^2) P_{LCS} R_{LCS}}{\beta^2 P_{LCS} + R_{LCS}} \quad (\text{typically } \beta=1 \text{ for F1-score}) $ For multiple references, the maximum $F_{LCS}$ is usually taken.
- Symbol Explanation:
  - $\text{Count}_{\text{match}}(\text{n-gram})$ : The maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries.
  - $\text{Count}(\text{n-gram})$ : The number of n-grams in the reference summary (or set of references).
  - $\text{LCS}(\text{candidate}, \text{reference})$ : The length of the Longest Common Subsequence between the candidate and reference summaries. A subsequence does not have to be consecutive.
  - $\text{Length}(\text{candidate})$ : The length (number of words) of the candidate summary.
  - $\text{Length}(\text{reference})$ : The length (number of words) of the reference summary.
  - $\beta$ : A weight for the F-measure, typically set to 1 for F1-score (equal importance for precision and recall).
- Used for: Abstractive summarization (SAMSum dataset) and Code Alpaca evaluation (ROUGE 1/2/LSum).
Perplexity (PPL):
- Conceptual Definition: Perplexity is a measure of how well a probability model predicts a sample. In natural language processing, it's used to evaluate language models. A lower perplexity score indicates that the model is better at predicting the next word in a sequence, suggesting a better understanding of the language. It can be interpreted as the inverse probability of the test set, normalized by the number of words.
- Mathematical Formula: Given a sequence of words $W = (w_1, w_2, \dots, w_N)$ , perplexity is defined as: $ \text{PPL}(W) = P(w_1, w_2, \dots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \dots, w_N)}} $ Which can be rewritten using log probabilities (assuming base 2 or $e$ for the logarithm): $ \text{PPL}(W) = \exp \left( -\frac{1}{N} \sum_{i=1}^N \log P(w_i | w_1, \dots, w_{i-1}) \right) $
- Symbol Explanation:
  - $W = (w_1, w_2, \dots, w_N)$ : A sequence of $N$ words.
  - $P(w_1, w_2, \dots, w_N)$ : The joint probability of the entire sequence according to the language model.
  - $P(w_i | w_1, \dots, w_{i-1})$ : The probability of the $i$ -th word given the preceding words, as predicted by the language model.
  - $N$ : The total number of words in the sequence.
  - $\exp(\cdot)$ : The exponential function.
- Used for: Evaluating base LLM quality, presented alongside BBH results to show correlation (or lack thereof) between perplexity and downstream task performance.

5.3. Baselines

The paper compares ModuLoRA against several representative baselines:

LoRA (Full-Precision): LoRA (Hu et al., 2022) applied to full-precision base models. This represents the state-of-the-art in parameter-efficient finetuning without quantization of the base weights. Its memory requirements are very high.
BitsAndBytes 8-bit (LLM.int8()): LoRA combined with LLM.int8() quantization (Dettmers et al., 2022). This is a common method for fitting larger models into GPU memory for inference and finetuning via LoRA. It uses 8-bit quantization for most matrix multiplications.
BitsAndBytes 4-bit (QLoRA): QLoRA (Dettmers et al., 2023) is a concurrent approach that allows 4-bit finetuning using LoRA. It defines its own quantization scheme and incorporates specialized techniques like quantization of zero-points and scales and double quantization. This is a direct competitor in the low-bit finetuning space.
Full Finetuning: For some tasks, results from full finetuning of large models like GPT-3 and T5 are included from existing literature (Hu et al., 2022; Chung et al., 2022) to provide an upper bound on performance.
Other PEFT Methods: Adapter tuning (Houlsby et al., 2019) and SliC (for Pegasus) are mentioned in summarization benchmarks for comparison.
No Finetuning: For instruction following tasks, FLAN-T5 and LLaMA without any finetuning are included to show the baseline performance of the raw models.

5.4. Training Details

Models: LLaMA (7B, 13B, 30B, 65B), BLOOM, and OPT models (7B, 13B, 30B).
Quantization:
- 3-bit and 4-bit quantization used OPTQ (Frantar et al., 2023) with calibration on 128 samples from C4.
- 2-bit quantization used QuIP# (Chee et al., 2023; Tseng et al., 2023) with $E_8$ lattice codebooks.
Hardware: Finetuning was performed on NVIDIA TITAN, 3090, and A6000 GPUs, depending on the model size and memory requirements.
LoRA Configuration:
- LoRA rank ( $r$ ): 8
- LoRA alpha ( $a$ ): 32
Optimization: AdamW optimizer was used.
Random Seeds: Results are reported from 3 random seeds to account for variability.
Hyperparameters:
- SAMSum: Training for 350 steps, batch size of 128 samples, learning rate 1e-3, cosine learning rate schedule, weight decay 0.0, Max sequence length 250.
- Text Classification: Batch size 256, evaluation batch size 32, 100 evaluation steps, 1000 total training steps, learning rate 1e-3, cosine learning rate schedule, weight decay 0.0, Max sequence length 128.
- Code-Alpaca: Batch size 128, evaluation batch size 4, 40 evaluation steps, 120 total training steps, learning rate 1e-3, linear learning rate schedule, weight decay 0.0, Max sequence length 165.
- MNLI-M: Batch size 128, evaluation batch size 64, 64 evaluation steps, 1.0 training epoch, learning rate 1e-3, cosine learning rate schedule, weight decay 0.0.
- Alpaca (for BBH): Batch size 128, 3 total training epochs, learning rate 1e-3, linear learning rate schedule, weight decay 0.0.
Fair Comparison: Hyperparameters were chosen to match those used in QLoRA (Dettmers et al., 2023) for a fair comparison.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that ModuLoRA consistently achieves competitive performance across various tasks, often outperforming less sophisticated quantization methods, while drastically reducing memory requirements.

Text Classification (Table 1): ModuLoRA with 3-bit and 4-bit LLaMA models shows accuracy comparable to 8-bit Bits&Bytes finetuning. For instance, LLaMA-65B achieves 97.2% (3-bit) and 98.0% (4-bit) with ModuLoRA, versus 98.6% with 8-bit Bits&Bytes. This indicates that aggressive quantization (3-bit, 4-bit) combined with ModuLoRA can maintain high performance on simpler classification tasks while using significantly less memory.
Natural Language Inference (MNLI-m) (Table 2): ModuLoRA demonstrates strong results here. The 2-bit and 3-bit 65B LLaMA models achieve performance matching a full-precision GPT-3 + LoRA baseline. Notably, the 2-bit 65B model quantized with QuIP# outperforms other 65B models at higher precisions. Across the model size range, ModuLoRA's 3-bit and 4-bit models consistently outperform 8-bit Bits&Bytes models. Also, ModuLoRA models (2, 3, or 4-bit) either match or exceed their 4-bit QLoRA counterparts, often with lower memory usage due to finer precision. This highlights the benefit of ModuLoRA's ability to integrate with advanced quantizers.
Abstractive Summarization (SAMSum) (Table 3, 4, 5):
- LLaMA models: A significant finding is that ModuLoRA's 4-bit 65B LLaMA models attain a new state-of-the-art ROUGE score (e.g., 54.8 / 31.3 / 47.2 for ROUGE 1/2/L) on SAMSum, surpassing GPT-3 baselines. ModuLoRA consistently outperforms 4-bit QLoRA and 8-bit BitsAndBytes methods. Even 2-bit ModuLoRA models match the performance of 8-bit baselines. The performance drop from 4-bit to 3-bit to 2-bit is marginal (about 1% ROUGE), showcasing the robustness of ultra-low precision finetuning.
- Ablation with RTN (Table 4): The comparison with Round-to-Nearest (RTN) quantization reveals that OPTQ (used by ModuLoRA) performs better, underscoring the importance of using sophisticated, data-driven quantizers over simpler ones. This justifies ModuLoRA's modular design.
- OPT models (Table 5): Similar trends are observed with OPT models, where ModuLoRA (3-bit and 4-bit) matches or outperforms 4-bit QLoRA and 8-bit Bits&Bytes baselines.
Instruction Following (BBH) (Table 6, 9):
- Alpaca (Table 6): Performance drops only slightly for 2-bit, 3-bit, and 4-bit ModuLoRA models compared to 8-bit models. Crucially, 2-bit models match 4-bit QLoRA performance. More impressively, 4-bit and 3-bit 65B ModuLoRA models outperform 8-bit 30B models, demonstrating the efficiency of combining larger model sizes with aggressive quantization. ModuLoRA provides consistent improvements over QLoRA, especially for smaller models.
- Code Alpaca (Table 9): ModuLoRA (3-bit and 4-bit) performs comparably or better than 8-bit Bits&Bytes models, confirming the general trend.
Memory Requirements (Table 7 & Figure 2): This is a key highlight. ModuLoRA significantly reduces memory needs. For a 65B model on MNLI-M, ModuLoRA (2-bit) uses only 21.8 GB, making it finetunable on a single 24GB GPU. In contrast, QLoRA (4-bit) requires 36.7 GB, and full-precision LoRA needs 360.4 GB. ModuLoRA uses only about 6% of the memory of full-precision LoRA for a 65B model. This is a groundbreaking achievement for accessibility.
Finetuning & Inference Latency (Table 10 & 11):
- Finetuning (Table 10): ModuLoRA (2-bit) is significantly faster (0.61 s/it) than QLoRA (0.80 s/it) and full-precision LoRA (1.50 s/it), reducing training time by approximately 59.3% and memory usage by 91.5% compared to full-precision LoRA. This efficiency comes from reduced data movement.
- Inference (Table 11): ModuLoRA (2-bit) is slightly slower (0.68 s/it) than QLoRA and full-precision LoRA (both 0.52 s/it). The authors attribute this to less optimized CUDA kernels compared to QLoRA, suggesting future optimization potential.
BBH vs. PPL (Table 8): The paper also notes an interesting finding: the correlation between perplexity (PPL) on Wiki2 and finetuning performance on BBH is not perfect. Large differences in PPL sometimes correspond to only small differences in BBH accuracy. This questions traditional LLM evaluation metrics when the goal is finetuning and suggests that factors beyond raw perplexity might be more indicative of finetuning potential.

In summary, ModuLoRA successfully demonstrates that integrating advanced quantization with LoRA in a modular fashion allows for highly memory-efficient finetuning (even down to 2-bit) on consumer GPUs, often achieving state-of-the-art or competitive performance.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

LLAMA Tuning	13B	30B	65B
LLMTOOLS (3-bit)	93.5 ± 0.7	97.0 ± 0.9	97.2 ± 0.8
LLMTOOLS (4-bit)	92.9 ± 0.7	96.3 ± 1.0	98.0 ± 0.9
Bits&Bytes 8-bit (LLM.int8())	93.0 ± 0.7	93.7 ± 1.0	98.6 ± 1.0

The following are the results from Table 2 of the original paper:

Models	Finetuning Adaptation	Model Size			MNLI-m (accuracy)
Models	Finetuning Adaptation	Model Size	# Trainable Parameters		MNLI-m (accuracy)
GPT-3	Full Finetuning	175B	175,255.8M		89.5 ± 0.1
GPT-3	Adapter	175B	40.1M		91.5 ± 0.1
GPT-3	LoRA	175B	4.7M		91.7 ± 0.1
T5	Full Finetuning	11B	11,307.4M		92.2 ± 0.1
LLaMA Finetuning	Quantizer	7B	13B	30B
	LLMTooLS (2-bit)	QuIP#(E8)	88.50 ± 0.3	89.72 ± 0.3	91.30 ± 0.3
	LLMTOOLS (3-bit)	OPTQ	88.98 ± 0.2	90.20 ± 0.2	91.09 ± 0.2
	LLMTOOLS (4-bit)	OPTQ	89.31 ± 0.2	90.41 ± 0.2	91.31 ± 0.1
	Bits&Bytes (4-bit)	QLoRA	89.28 ± 0.2	89.67 ± 0.2
Bits&Bytes (8-bit)		LLM.int8()	88.95 ± 0.1	90.08 ± 0.1	91.22 ± 0.1 91.15 ± 0.1

The following are the results from Table 3 of the original paper:

Models	Finetuning Adaptation
GPT-3	Full Finetuning	175,255.8M			52.0 28.0/ 44.5
GPT-3	Adapter				53.2 29.0 / 45.1
GPT-3	LoRA		40.1M 4.7M		53.8 / 29.8 / 45.9
Pegasus	SliC	2B			54.4 / 29.9 / 45.9
LLAMA Finetuning	Quantizer				30B	65B
	LLMTOOLS (2-bit)	QuIP# (E8)	7B 51.3 / 27.3 / 43.7			/ 46.0	54.0/ 30.6 / 46.2
	LLMTOOLS (3-bit)	OPTQ	51.2 / 28.2 / 44.0	52.3 / 29.0 / 45.0 52.4 / 29.6 / 45.1	53.3 / 30.2 53.6 / 30.8 / 46.3	54.1 / 30.9 /	46.5
	LLMTOOLS (4-bit)	OPT	51.7 / 28.3 / 44.4	53.2 / 30.2 / 46.1	53.9 / 31.2 / 46.9	54.8 /	31.3 / 47.2
	Bits&Bytes (4-bit)	QLoRA
	Bits&Bytes (8-bit)	LLM.int8()	51.6 / 28.3 3/ 44.5 51.9 / 28.1 / 44.5	51.3 / 28.1 / 44.1 51.3 / 28.2 / 43.6	53.0 / 30.2 50.8 / 28.4	/ 45.7 53.8 / 44.1 53.9	3 / 30.5 /4 45.9 / 30.4 / 46.3

The following are the results from Table 4 of the original paper:

SAMSum Performance	Quantizer	7B			13B
LLMTOOLS (3-bit)	OPTQ	51.2 /	28.2 /	44.0 / 44.2		52.4 / 29.6 / 45.1 /		45.1
	RTN OPTQ	51.7	50.7 / 27.2 / 28.3	43.6 / 43.6 44.4/	44.4	51.1/ 28.7 /	44.3 /	44.5 46.1
LLMTOOLS (4-bit)	RTN	/ 51.2 /	/ 28.5 /	44.2 / 44.2	53.2 52.5	/ 30.2 / 29.9	46.1 45.5	45.5

The following are the results from Table 5 of the original paper:

OPT Finetuning	Quantizer	13B	30B
LLMTOOLS (3-bit)	OPTQ	48.8 / 26.7 / 41.9	49.9 / 27.1 / 42.5
LLMTOOLS s (4-bit)	OPTQ	49.3/ 26.8 / 42.0	49.6 27.1 42.4 1 /
Bits&Bytes (4-bit)	QLoRA	49.2 / 27.0 / 42.1	49.9 / 27.0 / 42.5
Bits&Bytes (8-bit)	LLM.int8()	48.8 26.5 / 41.7	49.3 \| 27.1 / 42.3

The following are the results from Table 6 of the original paper:

Model	Method	Quantizer			L (780M)	XL (3B)	XXL (11B)
Model	Method	Quantizer	BASE (250M)		L (780M)	XL (3B)	XXL (11B)
FLAN-T5	No Finetuning	None	30.8		30.3	39.9	47.4
LLaMA	Methods	Quantizer		7B		30B	65B
	LLMToOLS (2-bit)	QuIP# (E8)		30.8 ± 0.5		38.3 ± 0.6	43.5 ± 0.5
	LLMTOOLS (3-bit)	OPTQ		31.1 ± 0.4
					35.3 ± 0.2	37.2 ± 0.6	43.3 ± 0.4
	LLMTOOLS (4-bit)		OPTQ		36.2 ± 0.4	40.4 ± 0.2	43.7 ± 0.4
	Bits&Bytes (4-bit)		QLoRA		35.4 ± 0.2	39.0 ± 0.4	43.5 ± 0.5
Bits&Bytes (8-bit)		LLM.int8()		31.9 ± 0.1 36.8 ± 0.2 37.1	39.1 ± 0.5	44.7 ± 0.4
No Finetuning		None		33.3 ± 0.3 30.9		39.3	42.6

The following are the results from Table 7 of the original paper:

LLaMA Finetuning	7B	13B	30B	65B
LLMTOOLS (2-bit)	3.2 GB	5.4 GB	11.4 GB	21.8 GB
QLoRA (4-bit)	5.2 GB	8.6 GB	19.5 GB	36.7 GB
Full Precision (LoRA)	38.4 GB	73.9 GB	183.3 GB	360.4 GB

The following are the results from Table 8 of the original paper:

Models	Quantization	BBH	PPL
LLAMA (13B)	3-bit	35.3	6.63
LLAMA (13B)	4-bit	36.2	5.36
LLAMA (65B)	3-bit	43.3	5.04
LLAMA (65B)	4-bit	43.7	3.84

The following are the results from Table 9 of the original paper:

Code Alpaca Per- formance				13B			30B
LLMTOOLS LLMTOOLS	(3-bit)	53.6 / 36.3 54.6 37.2		50.7 51.4	57.0 40.0 / 40.6	53.3	58.1 40.7 \|	54.3 60.0	44.1	58.8
Bits&Bytes (LLM.int8())	(4-bit) 8-bit	54.0 /	36.3	50.9	57.4 57.7 /	54.3 / 41.3 / 54.9	59.0 / 41.4 60.6 √ 43.5/	57.5 60.2 57.5 61.1/	43.5 44.1 / 58.0	/ 56.8

The following are the results from Table 10 of the original paper:

Precision	LLMTools (2-bit)	QLoRA (4-bit)	LoRA (Full Precision)
Seconds/Iteration	0.61 s/it	0.80 s/it	1.50 s/it

The following are the results from Table 11 of the original paper:

Precision	LLMTools (2-bit)	QLoRA (4-bit)	LoRA (Full Precision)
Seconds/Iteration	0.68 s/it	0.52 s/it	0.52 s/it

The following are the results from Table 12 of the original paper:

Dataset	Model	LLaMA 7B 13B 30B 65B	OPT 7B/ 13B 30B
SAMSum	Optimizer Warmup Ratio	AdamW 0.06

	Batch size Evaluation Batch size	128
	Evaluation Steps	16
	Total # Training Steps	50
	Learning Rate Schedule	350
		Cosine
	Learning Rate	1e-3
	WeightDecay	0.0
	LoRAConfig	rq = rv = 8
	LoRA α	32
	Max Seq. Len	250

The following are the results from Table 13 of the original paper:

Dataset LLaMA Model		13/30/65 B
OptimizerWarmup Ratio		AdamW0.06
Batch sizeText- Evaluation Batch sizeClassification Evaluation StepsTotal # Training StepsLearning Rate ScheduleLearning RateWeightDecayLoRAConfigLoRA α		256321001000Cosine1e-30.0rq = rv = 832128





	Max Seq. Len

The following are the results from Table 14 of the original paper:

Dataset	LLaMA Model	\| 7/13/30/65 B
Code- Alpaca	Optimizer Warmup Ratio Batch size	AdamW 0.06 128 4
	Evaluation Batch size Evaluation Steps Total # Training Steps Learning Rate Schedule Learning Rate WeightDecay LoRAConfig LoRA α Max Seq. Len	40 120 Linear 1e-3 0.0 rq = rv = 8 32 165

The following are the results from Table 15 of the original paper:

Dataset	Model	LLaMA 7B 13B 30B 65B
MNLI-M	Optimizer Warmup Ratio	AdamW 0.06
	Batch size Evaluation Batch size Evaluation Steps gEpoch	128 64 64
	Total # Training Learning Rate Schedule Learning Rate WeightDecay	1.0 Cosine 1e-3 0.0

The following are the results from Table 16 of the original paper:

Dataset	Model	LLaMA 7B 13B 30B 65B
Alpaca	Optimizer Warmup Ratio Batch size	AdamW 0.06 128
Alpaca	Total # Training Epochs Learning Rate Schedule Learning Rate WeightDecay LoRAConfig	3 Linear 1e-3 0.0 rq = rv = 8

6.3. Ablation Studies / Parameter Analysis

The paper includes a significant ablation study and analysis of model quality metrics:

Impact of Quantizer Quality (Table 4): This ablation study on the SAMSum dataset directly compares ModuLoRA's performance when using the sophisticated OPTQ quantizer versus a simpler Round-to-Nearest (RTN) approach. The results show that OPTQ consistently yields better ROUGE scores than RTN for both 3-bit and 4-bit quantization across 7B and 13B LLaMA models. For example, a 7B LLaMA with 3-bit OPTQ achieves ROUGE-1/2/L of $51.2 / 28.2 / 44.0$ , while with 3-bit RTN it drops to $50.7 / 27.2 / 43.6$ . This highlights that the choice of quantization algorithm is crucial and that ModuLoRA's modularity, allowing it to integrate with high-quality quantizers like OPTQ, is a key advantage for achieving superior performance.
Correlation between Perplexity and Finetuning Performance (Table 8): The authors investigate the relationship between a base model's perplexity (PPL) on Wiki2 and its finetuning performance (BBH accuracy) on instruction-following tasks for LLaMA models. Interestingly, the correlation is not perfect. For 13B LLaMA, 3-bit quantization yields 35.3 BBH with 6.63 PPL, while 4-bit quantization results in 36.2 BBH with 5.36 PPL. For 65B LLaMA, 3-bit gives 43.3 BBH with 5.04 PPL, and 4-bit gives 43.7 BBH with 3.84 PPL. This analysis indicates that models with slightly worse perplexity (e.g., 3-bit vs. 4-bit) can still achieve very competitive finetuning performance. The paper points out that "large gaps in PPL admit small gaps in BBH," suggesting that perplexity alone might not be a sufficient proxy for a base LLM's finetuning potential. This finding prompts a re-evaluation of LLM evaluation strategies, especially when the end goal is finetuning for downstream tasks rather than pure language modeling capability.

The following figure (Figure 2 from the original paper) visualizes the memory requirements with different methods:

Figure 2: Visualization of memory requirements with different methods. 该图像是一个示意图，展示了不同模型参数大小下所需的内存情况。X轴表示模型参数大小（十亿），Y轴表示所需内存（GB）。不同的曲线代表了2-bit和4-bit量化方法以及全精度的方法，其中LLMTools（2-bit）显著减少了内存需求。

The visualization of memory requirements in Figure 2 strongly supports the paper's claims about ModuLoRA's efficiency. It clearly shows that LLMTools (2-bit) (which implements ModuLoRA) drastically reduces the memory needed compared to QLoRA (4-bit) and especially Full Precision (LoRA), across all model sizes. For a 65B model, LLMTools (2-bit) uses the least memory, making it feasible on consumer GPUs. This visual representation underscores the practical impact of ModuLoRA in democratizing access to LLM finetuning.

6.4. Other Model Families

The paper extends its evaluation to OPT models (Table 5) and observes consistent trends. ModuLoRA with 3-bit and 4-bit OPTQ quantization for OPT models (13B and 30B) achieves ROUGE 1/2/L scores that match or slightly outperform 4-bit QLoRA and 8-bit LLM.int8() baselines on the SAMSum dataset. This generalization across different LLM architectures (LLaMA and OPT) reinforces the robustness and effectiveness of the ModuLoRA approach. While OPT models generally perform worse than LLaMA models, ModuLoRA still provides competitive results relative to more memory-intensive finetuning strategies within the OPT family.

6.5. Overall Conclusions from Results

The experimental results collectively highlight that ModuLoRA successfully bridges the gap between ultra-low-precision quantization and high-performance LoRA finetuning. It demonstrates that it's possible to achieve state-of-the-art or competitive results on diverse NLP tasks (classification, NLI, summarization, instruction following) using significantly less memory by leveraging:

Modular Quantizer Integration: Allowing the use of advanced, performance-optimized quantizers (e.g., OPTQ, QuIP#).
Aggressive Low Bit-Widths: Pushing finetuning down to 2-bit and 3-bit precision.
Memory-Efficient Backward Pass: Through adaptive dequantization and recomputation.

This leads to a substantial reduction in hardware requirements, making LLM finetuning much more accessible on consumer GPUs.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces ModuLoRA, a novel and memory-efficient finetuning algorithm that addresses the critical challenge of deploying and adapting large language models on resource-constrained hardware, specifically consumer GPUs. At its core, ModuLoRA leverages a modular design, integrating low-rank adapters (LoRAs) with any user-specified, black-box quantization module. This is achieved through a quantization-agnostic backward pass that adaptively dequantizes (materializes) low-precision LLM weights only when necessary, minimizing memory footprint.

A key achievement of ModuLoRA is enabling, for the first time, the finetuning of 2-bit and 3-bit LLMs (up to 65 billion parameters) on single consumer GPUs (e.g., 24GB or 48GB). By integrating with state-of-the-art quantizers like 2-bit QuIP# and 3-bit OPTQ, ModuLoRA demonstrates superior or competitive performance compared to finetuning methods that rely on less sophisticated 4-bit and 8-bit quantization. The empirical evaluations across text classification, natural language inference, instruction following, and summarization tasks consistently show ModuLoRA matching or surpassing existing approaches while using significantly less memory. Notably, it achieved a new state-of-the-art ROUGE score on the SAMSum summarization benchmark. The accompanying LLMTools library further democratizes access by providing a user-friendly platform for quantizing, running, and finetuning these models.

7.2. Limitations & Future Work

The authors thoughtfully acknowledge several limitations of ModuLoRA:

Inference Overhead: A primary advantage of traditional LoRA is its ability to fuse the low-rank adapter ( $\mathbf{AB}^\top$ ) with the full-precision base weight matrix ( $\mathbf{W}_0$ ) during inference, effectively becoming a single full-precision weight matrix and incurring minimal inference overhead. ModuLoRA loses this advantage relative to the black-box quantized model. Since the adapter is full-precision and the base weight matrix is quantized, they cannot be trivially fused. This means the ModuLoRALinear layer (as depicted in Figure 1 of the paper, conceptually) must perform the dequantization and separate LoRA addition during inference, potentially leading to slightly higher inference latency compared to a fully fused full-precision LoRA or optimized quantized inference kernels. The latency results (Table 11) confirm this, showing ModuLoRA (2-bit) inference being slightly slower than QLoRA or full-precision LoRA. This suggests a need for further CUDA kernel optimization for ModuLoRA's inference path.
Hardware Limits for Trillion-Parameter Models: While ModuLoRA significantly pushes the boundaries for consumer GPUs, even at the most aggressive 1-bit per parameter, a trillion-parameter model would still require 125GB of memory, exceeding the capacity of current high-end consumer GPUs (e.g., 24GB or 48GB). This means ModuLoRA cannot yet make the largest-scale models (like GPT-4 or beyond) finetunable on single commodity hardware. Model parallelism or more advanced distributed quantization techniques would still be necessary for such massive models.
LLM Safety Concerns: The authors briefly touch upon the ethical implications, noting that making finetuning LLMs more accessible on commodity hardware could "make finetuning too easy," potentially presenting problems related to LLM safety. This highlights a broader societal concern as powerful AI models become easier to customize and deploy, potentially for malicious purposes or without sufficient safeguards.

Future research directions could involve:
Developing optimized CUDA kernels for ModuLoRA inference to reduce the latency observed in the current implementation.
Exploring hybrid quantization and parallelization strategies to bring trillion-parameter models within reach of more accessible hardware.
Investigating quantization-aware training (QAT) approaches within the ModuLoRA framework to potentially further improve performance at ultra-low bit-widths.
Developing tools and guidelines for responsible finetuning to mitigate the LLM safety concerns raised.

7.3. Personal Insights & Critique

ModuLoRA represents a significant step forward in making LLM finetuning more democratic and accessible. The paper's core strength lies in its modularity. By decoupling the finetuning mechanism from the specific quantization algorithm, ModuLoRA effectively future-proofs itself against advancements in quantization research. As quantizers become even more sophisticated and capable of preserving quality at lower bit-widths, ModuLoRA can immediately benefit without requiring architectural changes. This is a very elegant design choice that leverages the best of both worlds: PEFT for trainable parameters and state-of-the-art quantization for the frozen base.

The empirical evidence is compelling, particularly the achievement of 2-bit finetuning and the surpassing of state-of-the-art ROUGE scores for summarization. The drastic reduction in memory requirements is the most impactful practical contribution, enabling researchers and developers with consumer-grade GPUs to participate in LLM finetuning, which was previously reserved for well-funded institutions. This will undoubtedly accelerate open-source LLM development and foster innovation.

One area for potential improvement, as the authors noted, is the inference overhead. While finetuning efficiency is the primary goal, a streamlined inference path that can fuse the adapter and quantized base more effectively would enhance the holistic utility of ModuLoRA. This is a common challenge in mixed-precision and parameter-efficient methods, and active research in optimized kernel development could address this.

The observation about the imperfect correlation between perplexity and finetuning performance (BBH) is also a valuable insight. It suggests that researchers should diversify their evaluation metrics for base LLMs if their ultimate goal is finetuning for specific downstream tasks, rather than relying solely on traditional language modeling benchmarks. This could lead to new directions in pre-training or base model selection strategies.

Overall, ModuLoRA offers a practical, high-performance, and forward-looking solution to the LLM finetuning accessibility problem. Its methods and conclusions could be easily transferred to other domains beyond NLP where large transformer models are used, such as computer vision, as long as LoRA and quantization are applicable. The project's release as an open-source library, LLMTools, further amplifies its potential impact.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.