LoRA: Low-Rank Adaptation of Large Language Models

Weizhu Chen

Paper status: completed

LoRA: Low-Rank Adaptation of Large Language Models

Published:06/18/2021

Transformer architecture (12)Large Language Model Fine-Tuning (46)Low-Rank Adaptation for Large Language Models (1)Parameter Efficiency Optimization (1)RoBERTa and Its Derivatives (1)

Original Link PDF

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

LoRA introduces a low-rank adaptation method for fine-tuning large language models, significantly reducing trainable parameters by injecting rank decomposition matrices while freezing the model weights. It achieves comparable or better performance on RoBERTa, DeBERTa, GPT-2, and

Abstract

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

Mind Map

In-depth Reading

English Analysis~21 min read · 29,057 chars

1. Bibliographic Information

1.1. Title

LoRA: Low-Rank Adaptation of Large Language Models

The title clearly states the paper's core contribution: a method named LoRA (Low-Rank Adaptation) designed for adapting Large Language Models (LLMs). This immediately signals that the paper addresses the challenge of efficiently fine-tuning massive neural networks.

1.2. Authors

Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.

All authors were affiliated with Microsoft Corporation at the time of publication, with Yuanzhi Li also holding a position at Carnegie Mellon University. This affiliation is significant, as Microsoft has access to large-scale computational resources and proprietary models (like the GPT series through its collaboration with OpenAI), which enabled the authors to validate their method on a massive 175-billion-parameter model like GPT-3.

1.3. Journal/Conference

The version of the paper provided was submitted to arXiv, a preprint server, on June 17, 2021. The paper was later officially accepted and published at the International Conference on Learning Representations (ICLR) 2022. ICLR is a top-tier, highly competitive conference in the field of machine learning and artificial intelligence. Its acceptance signifies the paper's high quality, novelty, and impact as recognized by the research community.

1.4. Publication Year

2021 (arXiv preprint), 2022 (ICLR publication).

1.5. Abstract

The abstract introduces the central problem: as language models grow larger (e.g., GPT-3 with 175B parameters), fully fine-tuning them for every new task becomes impractical and prohibitively expensive. To solve this, the authors propose Low-Rank Adaptation (LoRA). LoRA's approach is to freeze the pre-trained model weights and inject smaller, trainable "rank decomposition matrices" into each Transformer layer. This drastically reduces the number of parameters that need to be trained for downstream tasks. The abstract highlights LoRA's key achievements compared to full fine-tuning with an Adam optimizer on GPT-3 175B: a 10,000-fold reduction in trainable parameters and a 3-fold reduction in GPU memory requirements. Crucially, LoRA achieves performance that is on-par with or better than full fine-tuning across various models (RoBERTa, DeBERTa, GPT-2, GPT-3) and, unlike methods such as adapters, introduces no additional inference latency. The paper also provides an empirical study supporting the hypothesis that weight updates during model adaptation are inherently low-rank.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2106.09685
PDF Link: https://arxiv.org/pdf/2106.09685v2.pdf
Publication Status: This is an arXiv preprint. The paper was peer-reviewed and officially published at ICLR 2022.

2. Executive Summary

2.1. Background & Motivation

The dominant paradigm in Natural Language Processing (NLP) involves taking a massive Large Language Model (LLM) pre-trained on a general web-scale text corpus and adapting it to specific downstream tasks (e.g., summarization, question answering, code generation). The standard adaptation method is full fine-tuning, where all of the model's weights are updated using task-specific data.

While effective, this approach faces a critical scalability crisis. For a model like GPT-3 with 175 billion parameters, full fine-tuning results in a new, distinct set of 175 billion parameters for every single task. The operational costs are enormous:

Storage Cost: Storing a full copy of the model for each task is extremely expensive (hundreds of gigabytes per model).
Deployment Cost: Loading and swapping these massive models in memory for different tasks is slow and resource-intensive, making it difficult to serve many customized models simultaneously.
Training Cost: The GPU memory required to train all 175 billion parameters (including storing optimizer states, like in Adam) is immense, creating a high hardware barrier.

Existing solutions, known as Parameter-Efficient Fine-Tuning (PEFT) methods, tried to address this but had their own drawbacks:
Adapter Tuning: Inserts small new layers into the model. While it reduces trainable parameters, these extra layers are processed sequentially, which introduces inference latency. In latency-sensitive applications, even a small delay can be unacceptable.
Prefix/Prompt Tuning: Modifies the input to the model by learning continuous "soft prompts". This can be difficult to optimize and reduces the effective context window available for the actual task input, as the learned prefixes consume sequence length.

The paper's innovative entry point is a simple but powerful hypothesis: the change in a model's weights during adaptation ( $\Delta W$ ) does not need to be full-rank. Instead, it likely has a very low "intrinsic rank". This means the update can be represented far more efficiently, which is the core idea behind LoRA.

2.2. Main Contributions / Findings

The paper makes several key contributions that have had a profound impact on the field:

Proposal of LoRA: LoRA is a novel PEFT technique that freezes the pre-trained weights and injects a pair of trainable low-rank matrices ( $A$ and $B$ ) in parallel with the existing weight matrices in the Transformer architecture. Only these small matrices are trained.
Extreme Parameter Efficiency: LoRA dramatically reduces the number of trainable parameters. For GPT-3 175B, it achieved a reduction of up to 10,000 times (from 175B to just a few million) while also cutting GPU memory usage during training by a factor of 3.
Zero Additional Inference Latency: This is a crucial advantage over methods like adapters. After training, the low-rank matrices ( $B$ and $A$ ) can be mathematically merged with the original frozen weights ( $W_0 + BA = W_{\text{new}}$ ). This means the deployed model has the exact same architecture and size as the original, resulting in no extra computational steps or latency during inference.
High-Quality Performance: Across a wide range of models (RoBERTa, DeBERTa, GPT-2, GPT-3) and tasks (NLU and NLG), LoRA performs on-par with or even better than full fine-tuning, demonstrating that efficiency is achieved without sacrificing model quality.
Empirical Justification of the Low-Rank Hypothesis: The paper provides an analysis showing that the weight updates learned during adaptation are indeed rank-deficient. It finds that a very small rank (e.g., 1, 2, or 4) is often sufficient and that higher ranks do not necessarily lead to better performance, suggesting they may just capture noise.

3.1. Foundational Concepts

3.1.1. Transformer Architecture

The Transformer, introduced by Vaswani et al. in "Attention Is All You Need" (2017), is a neural network architecture that has become the foundation for most modern LLMs. Unlike previous recurrent models (RNNs), it processes input sequences in parallel, making it highly efficient. Its key components are:

Self-Attention Mechanism: This allows the model to weigh the importance of different words in the input sequence when processing a specific word. For each word, it creates three vectors: a Query (Q), a Key (K), and a Value (V) by multiplying its embedding with learned weight matrices ( $W_q$ , $W_k$ , $W_v$ ). The attention score is calculated based on the compatibility of the Query of the current word with the Keys of all other words. The core formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here, $d_k$ is the dimension of the key vectors. The softmax function converts scores into probabilities, and the result is a weighted sum of the Value vectors. LoRA specifically targets adapting the weight matrices $W_q, W_k, W_v$ , and $W_o$ (the output projection matrix) in this module.
Feed-Forward Networks (FFN): Each Transformer layer also contains a position-wise FFN, which is a two-layer multi-layer perceptron (MLP) applied independently to each position.
Residual Connections and Layer Normalization: These are used to stabilize training in deep networks.

3.1.2. Pre-training and Fine-tuning Paradigm

This is a two-stage process for training powerful NLP models:

Pre-training: An LLM is trained on a massive, unlabeled text corpus (like the entire internet). The model learns general linguistic patterns, facts, and reasoning abilities through self-supervised objectives, such as predicting the next word in a sentence. This stage is extremely computationally expensive and produces a "base model."
Fine-tuning: The pre-trained base model is then further trained on a smaller, labeled dataset for a specific downstream task (e.g., sentiment analysis). In full fine-tuning, all of the model's parameters are updated. This adapts the model's general knowledge to the specifics of the task. LoRA is a method for making this second stage much more efficient.

3.1.3. Linear Algebra: Matrix Rank and Low-Rank Decomposition

This concept is central to LoRA.

Matrix Rank: The rank of a matrix is the maximum number of linearly independent columns (or rows) it has. A matrix with a rank lower than its smallest dimension is called "rank-deficient." Intuitively, a low-rank matrix contains redundant information and can be represented more compactly.
Low-Rank Decomposition: A key idea in linear algebra is that any matrix $M \in \mathbb{R}^{d \times k}$ can be represented or approximated as the product of two or more matrices. If $M$ is low-rank (with rank $r \ll \min(d, k)$ ), it can be exactly decomposed into two "thin" matrices: $M = BA$ , where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ .

The number of parameters in $M$ is $d \times k$ . The number of parameters in $B$ and $A$ combined is $(d \times r) + (r \times k) = r(d+k)$ . When $r$ is very small, $r(d+k) \ll d \times k$ . LoRA leverages this property to represent the weight update matrix $\Delta W$ efficiently.

3.2. Previous Works

3.2.1. Adapter Tuning

Proposed by Houlsby et al. (2019), adapter tuning freezes the entire pre-trained model and injects small, new modules called "adapters" inside each Transformer layer. An adapter typically consists of a down-projection linear layer, a non-linearity (like ReLU), and an up-projection linear layer. This bottleneck structure ensures the adapter has very few parameters. Only the adapter layers are trained.

Drawback: Adapters are added sequentially within the network's data flow. This means that during inference, the computation must pass through these additional layers, which inevitably adds latency.

3.2.2. Prefix-Tuning and Prompt-Tuning

This family of methods, including Prefix-Tuning (Li & Liang, 2021) and Prompt Tuning (Lester et al., 2021), also freezes the entire LLM. Instead of adding modules, they learn a small set of continuous vectors (a "soft prompt" or "prefix") that are prepended to the input sequence's embeddings. These learned vectors steer the frozen model's behavior to perform the desired task.

Drawbacks:
- Reduced Context Length: The learned prefix vectors occupy positions in the input sequence, reducing the available space for the actual user prompt and context.
- Optimization Instability: The authors of LoRA and the original prefix-tuning paper note that these methods can be difficult to optimize, with performance sometimes degrading as more trainable parameters (longer prefixes) are added.

3.3. Technological Evolution

The evolution of model adaptation techniques can be seen as a progression towards greater parameter efficiency without sacrificing performance or introducing new bottlenecks:

Full Fine-Tuning: The original and most powerful method, but computationally infeasible for modern LLMs.
Partial Fine-Tuning: Simpler approaches like fine-tuning only the last few layers or only the bias terms (BitFit). These are efficient but often lead to a significant drop in model quality.
Adapter-based Methods: A major step forward in parameter efficiency, proving that good performance could be achieved by training only ~1% of the parameters. However, they introduced the problem of inference latency.
Prompt-based Methods: An orthogonal approach that manipulates the input rather than the model weights. This avoids latency but comes with its own challenges regarding context length and optimization stability.
LoRA: Represents a conceptual breakthrough by proposing to modify weights indirectly and in parallel. It combines the parameter efficiency of adapters with the zero-latency benefit of prompt-tuning (and full fine-tuning), offering a more practical and robust solution.

3.4. Differentiation Analysis

Method	Trainable Parameters	Modifies	Inference Latency	Key Advantage	Key Disadvantage
Full Fine-Tuning	All model weights	All weights directly	None (Baseline)	Highest potential performance	Prohibitively expensive (memory, storage)
Adapter Tuning	Small adapter modules	Adds new sequential layers	Yes (Adds latency)	High parameter efficiency	Slower inference
Prefix-Tuning	Continuous prefix vectors	Input activations	None	High parameter efficiency	Reduces context length, can be unstable
LoRA (This Paper)	Low-rank matrices `A, B`	Weights indirectly (parallel update)	None (Mergeable)	High parameter efficiency, no latency	Requires merging step for deployment

4. Methodology

4.1. Principles

The core principle of LoRA is based on the hypothesis that the update to the weights of a pre-trained model during adaptation has a low "intrinsic rank".

Let's consider a pre-trained weight matrix $W_0$ . When we perform full fine-tuning, we are learning an update matrix $\Delta W$ , such that the new weight matrix is $W = W_0 + \Delta W$ . The number of parameters in $\Delta W$ is the same as in $W_0$ , which is very large.

LoRA hypothesizes that this large $\Delta W$ matrix is rank-deficient. This means it can be effectively approximated by the product of two much smaller, "thin" matrices. Instead of learning all the parameters in $\Delta W$ directly, LoRA proposes to learn these two smaller matrices, which is far more efficient. This approach is a form of low-rank decomposition applied to the weight update matrix, not the original weight matrix itself.

4.2. Core Methodology In-depth (Layer by Layer)

The LoRA method modifies the forward pass of any dense layer (like the linear projections in a Transformer's self-attention module) that has a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$ .

Standard Forward Pass: The original forward pass for a given input $x$ is: $ h = W_0 x $ During LoRA's training, the pre-trained weights $W_0$ are frozen and do not receive any gradient updates.
Low-Rank Parametrization: The update to the weights, $\Delta W$ , is represented by a low-rank decomposition using two matrices, $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ . The rank $r$ is a hyperparameter and is chosen to be very small (e.g., 1, 2, 4, 8), where $r \ll \min(d, k)$ . The weight update is thus constrained: $\Delta W = BA$ .
Modified Forward Pass: The LoRA module is added in parallel to the frozen layer. The modified forward pass combines the output of the original layer and the new LoRA path: $ h = W_0 x + \Delta W x = W_0 x + BAx $ In this equation:
- $W_0 x$ is the output from the original, frozen pre-trained weights.
- BAx is the output from the new, trainable LoRA matrices. The input $x$ is first multiplied by $A$ , then by $B$ .
- Only the matrices $A$ and $B$ contain trainable parameters. The total number of trainable parameters is $d \times r + r \times k$ , which is significantly smaller than $d \times k$ .
  
  The following diagram from the paper illustrates this parallel structure.
  
  $Figure 1: Our reparametrization. We only train $A$ and $B$ .$ 该图像是示意图，展示了LoRA的重参数化过程。图中预训练权重 $W egin{pmatrix} d \ d \\ ext{Pretrained Weights} \\ ext{B} = 0 \\ r \\ A oldsymbol{ ext{N}}(0,oldsymbol{ ext{σ}}^2) \\ x \\ h \\ d \\ ext{\text{训练}} \\ ext{\text{过程}} \\ ext{\text{合作}} \\ ext{\text{表示}} \\ ext{\text{绘图}} \\ ext{\text{流量}} \\}$ 中， $A$ 和 $B$ 是可训练的参数。
Initialization and Scaling: The initialization is crucial for stable training.
- Matrix $A$ is initialized with random values from a Gaussian distribution.
- Matrix $B$ is initialized to zero.
- This ensures that at the very beginning of training, the product BA is zero. Therefore, $\Delta W = 0$ , and the model's output is identical to that of the original pre-trained model. Training then gradually introduces non-zero updates.
  
  The paper also introduces a scaling factor for the LoRA path's output. The final forward pass is: $ h = W_0 x + \frac{\alpha}{r} BAx $
- $\alpha$ is a constant hyperparameter.
- $r$ is the rank.
- Purpose of Scaling: This scaling normalizes the update's magnitude with respect to the rank $r$ . It helps reduce the need to retune other hyperparameters (like the learning rate) when changing $r$ . The paper mentions they simply set $\alpha$ to the first $r$ they try and don't tune it further.
Application to Transformer Models: The authors choose to apply LoRA only to the weight matrices in the self-attention module ( $W_q, W_k, W_v, W_o$ ) for simplicity and parameter efficiency. The much larger MLP modules are kept frozen. This targeted application proves to be highly effective.
No Inference Latency via Merging: This is a key practical advantage of LoRA. After training is complete, the matrices $A$ and $B$ are fixed. We can compute the final, adapted weight matrix $W$ as: $ W = W_0 + BA $ Since $W_0, B, A$ are all constant matrices, this is a one-time calculation. The resulting matrix $W$ has the same dimensions as the original $W_0$ . For deployment, we can simply discard $A$ and $B$ and use this new merged weight matrix $W$ . The model's architecture remains identical to the original, so inference proceeds with zero additional latency. When switching tasks, one can subtract the old BA and add the new B'A', which is a very fast operation.

5. Experimental Setup

5.1. Datasets

The authors used a diverse set of datasets to evaluate LoRA on both Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks.

GLUE (General Language Understanding Evaluation) Benchmark: A collection of nine NLU tasks.
- MNLI (Multi-Genre Natural Language Inference): Given a premise, determine if a hypothesis is an entailment, contradiction, or neutral.
- SST-2 (Stanford Sentiment Treebank): Classify the sentiment of movie reviews.
- MRPC (Microsoft Research Paraphrase Corpus): Determine if two sentences are paraphrases of each other.
- CoLA (Corpus of Linguistic Acceptability): Determine if a sentence is grammatically acceptable.
- QNLI (Question Natural Language Inference): Determine if a sentence contains the answer to a question.
- QQP (Quora Question Pairs): Determine if two questions are semantically equivalent.
- RTE (Recognizing Textual Entailment): A smaller textual entailment task.
- STS-B (Semantic Textual Similarity Benchmark): Score the similarity of two sentences on a 1-5 scale.
WikiSQL: A dataset for generating SQL queries from natural language questions about tables in a database.
- Example Data:
  - Context (x): Table schema [Col1: Player, Col2: Team], Question "What team does player John Doe belong to?"
  - Target (y): $SELECT Team FROM table WHERE Player = "John Doe"$
SAMSum: A dataset of chat conversations and their corresponding human-written abstractive summaries.
E2E NLG Challenge: A data-to-text generation task in the restaurant domain, where the model must generate a natural language description from a set of slot-value pairs.
DART & WebNLG: Additional data-to-text generation datasets with structured data (triples) as input.

These datasets were chosen because they represent a wide variety of common NLP tasks and are standard benchmarks for evaluating model performance.

5.2. Evaluation Metrics

The paper uses standard metrics for each task.

5.2.1. NLU Metrics (GLUE)

Accuracy:
- Conceptual Definition: The proportion of predictions that are correct. It is the most straightforward metric for classification tasks.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation: N/A.
Matthew's Correlation Coefficient (MCC) (for CoLA):
- Conceptual Definition: A metric for binary classification that is more robust than accuracy, especially for imbalanced datasets. It measures the correlation between the true and predicted classes, with a value from -1 (total disagreement) to +1 (perfect agreement). 0 indicates random performance.
- Mathematical Formula: $ \text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} $
- Symbol Explanation:
  - TP: True Positives
  - TN: True Negatives
  - FP: False Positives
  - FN: False Negatives
Pearson Correlation Coefficient (for STS-B):
- Conceptual Definition: Measures the linear relationship between two sets of data. For STS-B, it measures the correlation between the model's predicted similarity scores and the human-annotated gold scores.
- Mathematical Formula: $ \rho_{X, Y} = \frac{\text{cov}(X, Y)}{\sigma_X \sigma_Y} $
- Symbol Explanation:
  - cov(X, Y): Covariance of variables X (predicted scores) and Y (true scores).
  - $\sigma_X, \sigma_Y$ : Standard deviations of X and Y.

5.2.2. NLG Metrics

BLEU (Bilingual Evaluation Understudy):
- Conceptual Definition: Measures how many n-grams (contiguous sequences of n words) in the machine-generated text overlap with n-grams in a set of human-written reference texts. It focuses on precision.
- Mathematical Formula: $ \text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $
- Symbol Explanation:
  - BP: Brevity Penalty, penalizes generated text that is too short.
  - $p_n$ : Modified n-gram precision.
  - $w_n$ : Weights for each n-gram (typically uniform, e.g., 0.25 for N=4).
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- Conceptual Definition: Similar to BLEU, but focuses on recall—how many n-grams from the reference texts appear in the generated text. ROUGE-L specifically measures the longest common subsequence.
- Mathematical Formula (ROUGE-L): $ R_{lcs} = \frac{LCS(X, Y)}{m}, \quad P_{lcs} = \frac{LCS(X, Y)}{n}, \quad F_{lcs} = \frac{(1+\beta^2) R_{lcs} P_{lcs}}{R_{lcs} + \beta^2 P_{lcs}} $
- Symbol Explanation:
  - LCS(X, Y): Length of the Longest Common Subsequence between reference X and generated text Y.
  - m, n: Lengths of X and Y.
Other NLG metrics like METEOR, NIST, and CIDEr are also used, which are more advanced variants that consider synonyms, stemming, and term frequency, respectively.

5.3. Baselines

LoRA was compared against a comprehensive set of baselines representing the state-of-the-art in model adaptation at the time:

Fine-Tuning (FT): The standard full fine-tuning approach where all model parameters are updated. This serves as the primary performance benchmark.
FT_top2: A variant of FT where only the top two layers of the model are trained.
BitFit (Bias-only): A very parameter-efficient baseline where only the bias terms throughout the network are trained.
Adapter Tuning (AdapterH, AdapterL, AdapterP, AdapterD): Several variants of the adapter method, differing in where the adapter layers are placed and their internal design. AdapterH is the original design from Houlsby et al.
Prefix-Embedding Tuning (PreEmbed): Learns continuous embeddings for special tokens prepended to the input.
Prefix-Layer Tuning (PreLayer): A more powerful version of prefix-tuning where trainable activations are learned for the prefix tokens at every layer of the Transformer.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results strongly validate LoRA's effectiveness across all tested models and tasks, establishing it as a superior parameter-efficient adaptation method.

6.1.1. Inference Latency

The paper first addresses a key weakness of adapter-based methods. The following are the results from Table 1 of the original paper:

Batch Size Sequence Length \|Θ\|	32 512 0.5M	16 256 11M	1 128 11M
Fine-Tune/LoRA	1449.4±0.8	338.0±0.6	19.8±2.7
AdapterL	1482.0±1.0 (+2.2%)	354.8±0.5 (+5.0%)	23.9±2.1 (+20.7%)
AdapterH	1492.2±1.0 (+3.0%)	366.3±0.5 (+8.4%)	25.8±2.2 (+30.3%)

Analysis: This table shows the inference latency in milliseconds for GPT-2 medium.

With large batch sizes and long sequences (left column), the overhead from adapters is minimal (2-3%) because the GPU is already saturated with computation.
However, in a realistic online inference scenario with a batch size of 1 and short sequences (right column), the latency penalty from adapters is substantial, reaching +20.7% and +30.3%.
LoRA has the same latency as Fine-Tune by design, as the weights are merged. This result directly proves LoRA's advantage in latency-critical applications.

6.1.2. Performance on RoBERTa and DeBERTa (GLUE Benchmark)

The following are the results from Table 2 of the original paper:

Model & Method	# Trainable Parameters	MNLI	SST-2	MRPC	CoLA	QNLI	QQP	RTE	STS-B	Avg.
RoBbase (FT)*	125.0M	87.6	94.8	90.2	63.6	92.8	91.8	78.7	91.2	85.2
RoBbase (BitFit)*	0.1M	84.7	93.7	81.5	62.0	91.9	90.8	84.0	90.8	86.4
RoBbase (AdptD)*	0.3M	87.1±.0	94.2±.1	88.5±1.1	60.8±.4	93.1±.1	90.2±.0	71.5±2.7	89.7±.3	84.4
RoBbase (AdptP)*	0.9M	87.3±.1	94.7±.3	88.4±.1	62.6±.9	93.0±.2	90.6±.0	75.9±2.2	90.3±.1	85.4
RoBbase (LoRA)	0.3M	87.5±.3	95.1±.2	89.7±.7	63.4±1.2	93.3±.3	90.8±.1	86.6±.7	91.5±.2	87.2
RoBlarge (FT)*	355.0M	90.2	96.4	90.9	68.0	94.7	92.2	86.6	92.4	88.9
RoBlarge (LoRA)	0.8M	90.6±.2	96.2±.5	90.9±1.2	68.2±1.9	94.9±.3	91.6±.1	87.4±2.5	92.6±.2	89.0
DeBxxL (FT)*	1500.0M	91.8	97.2	92.0	72.0	96.0	92.7	93.9	92.9	91.1
DeBxxL (LoRA)	4.7M	91.9±.2	96.9±.2	92.6±.6	72.4±1.1	96.0±.1	92.9±.1	94.9±.4	93.0±.2	91.3

Analysis:

On RoBERTa-base, LoRA (0.3M params) achieves an average score of 87.2, significantly outperforming adapter methods with comparable or even larger parameter budgets (84.4, 85.4) and even beating full fine-tuning (85.2, although the paper notes different setups affect the FT avg).
On RoBERTa-large, LoRA (0.8M params) achieves an average score of 89.0, slightly better than full fine-tuning (355M params) at 88.9.
On the massive DeBERTa-XXL (1.5B params), LoRA with only 4.7M parameters (~0.3% of the total) achieves an average score of 91.3, again slightly surpassing the full fine-tuning baseline of 91.1. This demonstrates LoRA's scalability and effectiveness on very large models.

6.1.3. Performance on GPT-3 175B

This is the paper's flagship experiment, demonstrating LoRA's utility at an unprecedented scale. The following are the results from Table 4 of the original paper:

Model&Method	# Trainable Parameters	WikiSQL	MNLI-m	SAMSum
Model&Method	# Trainable Parameters	Acc. (%)	Acc. (%)	R1/R2/RL
GPT-3 (FT)	175,255.8M	73.8	89.5	52.0/28.0/44.5
GPT-3 (BitFit)	14.2M	71.3	91.0	51.3/27.4/43.5
GPT-3 (PreEmbed)	3.2M	63.1	88.6	48.3/24.2/40.5
GPT-3 (PreLayer)	20.2M	70.1	89.5	50.8/27.3/43.5
GPT-3 (AdapterH)	7.1M	71.9	89.8	53.0/28.9/44.8
GPT-3 (AdapterH)	40.1M	73.2	91.5	53.2/29.0/45.1
GPT-3 (LoRA)	4.7M	73.4	91.7	53.8/29.8/45.9
GPT-3 (LoRA)	37.7M	74.0	91.6	53.4/29.2/45.1

Analysis:

On all three tasks, LoRA consistently matches or outperforms full fine-tuning while using a miniscule fraction of the parameters. For instance, on MNLI-m, LoRA with 4.7M parameters gets 91.7% accuracy, while full fine-tuning with 175B parameters gets 89.5%.
LoRA also outperforms all other PEFT baselines, including adapters and prefix-tuning, often with fewer parameters.
The chart below (Figure 2 from the paper) visualizes this trend, showing that LoRA's performance scales gracefully with more parameters, while prefix-based methods can degrade, highlighting LoRA's superior stability and effectiveness.

该图像是一个图表，展示了在WikiSQL和MultiNLImatched上，几种适配方法的验证准确率与可训练参数数量的关系。LoRA在参数效率和任务性能方面优于其他方法。

6.2. Ablation Studies / Parameter Analysis

Section 7 of the paper provides a deep dive into why LoRA works so well, effectively serving as a series of ablation studies.

6.2.1. Which Weight Matrices Should We Adapt?

The authors tested applying LoRA to different combinations of attention weights ( $W_q, W_k, W_v, W_o$ ) with a fixed parameter budget. The following are the results from Table 5 of the original paper:

	# of Trainable Parameters = 18M
Weight Type Rank r	Wq 8	Wk 8	Wv 8	Wo 8	Wq, Wk 4	Wq, Wv 4	Wq, Wk, Wv, Wo 2
WikiSQL (±0.5%)	70.4	70.0	73.0	73.2	71.4	73.7	73.7
MultiNLI (±0.1%)	91.0	90.8	91.0	91.3	91.3	91.3	91.7

Analysis: The results show that adapting more sets of matrices with a smaller rank is generally better than adapting fewer matrices with a higher rank. For example, adapting both $W_q$ and $W_v$ with rank $r=4$ outperforms adapting only $W_q$ with rank $r=8$ . This suggests that the adaptation needs to be applied broadly, even if shallowly (low rank).

6.2.2. What is the Optimal Rank $r$ ?

The paper investigates how performance changes with the rank $r$ . The following are the results from Table 6 of the original paper:

	Weight Type	r = 1	r = 2	r = 4	r = 8	r = 64
WikiSQL (±0.5%)	Wq	68.8	69.6	70.5	70.4	70.0
	Wq, Wv	73.4	73.3	73.7	73.8	73.5
	Wq, Wk, Wv, Wo	74.1	73.7	74.0	74.0	73.9
MultiNLI (±0.1%)	Wq	90.7	90.9	91.1	90.7	90.7
	Wq, Wv	91.3	91.4	91.3	91.6	91.4
	Wq, Wk, Wv, Wo	91.2	91.7	91.7	91.5	91.4

Analysis: This is a striking result. A very low rank, even $r=1$ or $r=2$ , achieves highly competitive performance. Performance generally peaks around $r=4$ or $r=8$ and then plateaus or even slightly degrades at $r=64$ . This provides strong evidence for the core hypothesis: the weight update matrix $\Delta W$ is indeed extremely rank-deficient, and a small rank is sufficient to capture the essential information for adaptation.

6.2.3. Analysis of Learned Subspaces

To further support the low-rank hypothesis, the authors analyzed the subspaces spanned by the learned LoRA matrices.

Comparison of Ranks (r=8 vs. r=64): They found that the top singular direction learned with $r=8$ had a very high overlap with the top singular direction learned with $r=64$ . However, the remaining directions showed little correlation. This suggests that the most important information is captured in the very first few dimensions, and increasing the rank primarily adds noise.
Comparison of Random Seeds: When comparing two runs with different random seeds (both with $r=64$ ), they again found that only a few top singular directions were consistently learned by both, while the rest were different. This reinforces the idea that the "true" intrinsic rank of the task adaptation is small.

6.2.4. Relationship between $\Delta W$ and $W$

The authors investigated whether the learned update $\Delta W$ simply amplifies the most prominent features already present in the pre-trained weights $W$ . Their analysis in Table 7 reveals that this is not the case. Instead of correlating with the top singular directions of $W$ , $\Delta W$ tends to amplify directions that were not emphasized in $W$ . The amplification factor is very large (e.g., ~21.5 for $r=4$ ). This suggests that LoRA identifies and significantly boosts task-specific features that were learned during pre-training but were not dominant in the general-purpose model.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces LoRA, a simple, elegant, and highly effective method for adapting large language models. LoRA's core innovation is to constrain the weight updates during fine-tuning to be low-rank, which drastically reduces the number of trainable parameters and the associated memory and storage costs. The authors demonstrate empirically that LoRA matches or exceeds the performance of full fine-tuning on a wide array of models and tasks, including the 175B parameter GPT-3. A key practical contribution is that LoRA introduces no inference latency, as the learned low-rank matrices can be merged back into the original weights for deployment. This makes LoRA an ideal solution for creating and serving many customized LLMs efficiently.

7.2. Limitations & Future Work

The authors acknowledge several areas for future research:

Combining with Other Methods: LoRA is orthogonal to many other PEFT methods (like prefix-tuning) and could potentially be combined with them for further gains.
Deeper Understanding of Mechanism: While empirically successful, the fundamental mechanisms of how pre-trained features are transformed for downstream tasks remain an open question. LoRA, with its constrained updates, provides a more tractable way to study this phenomenon compared to full fine-tuning.
Principled Matrix Selection: The choice of which weight matrices to apply LoRA to (e.g., only attention weights) was based on heuristics. More principled or automated methods for selecting the optimal subset of weights could lead to better performance.
Rank-Deficiency of Pre-trained Weights: The finding that the update matrix $\Delta W$ is rank-deficient raises the question of whether the original pre-trained weight matrices $W$ are also rank-deficient, which could inspire new model compression or training techniques.

The authors also note one practical limitation: if the LoRA weights are not merged, batching inputs from different tasks (with different LoRA modules) in a single forward pass is not straightforward.

7.3. Personal Insights & Critique

Impact and Inspiration: LoRA is a landmark paper in the field of efficient LLM training. Its simplicity and effectiveness have made it the de facto standard for PEFT in the open-source community. The core insight—that the change in weights is low-rank—is brilliant and has inspired a wealth of follow-up research. It fundamentally shifted the community's approach to fine-tuning from "how do we add things?" (adapters) to "how do we modify things efficiently?".
Practical Significance: The zero-latency guarantee is arguably LoRA's most significant practical advantage. For real-world production systems where inference speed is critical, this feature makes LoRA vastly superior to methods like adapters. The ability to store a single base model and many tiny (megabyte-sized) LoRA checkpoints makes custom model deployment feasible for a wide range of users and organizations.
Potential Areas for Improvement:
- The optimal rank $r$ remains a hyperparameter that requires tuning for each task, though the paper shows that performance is robust to a range of small values.
- The analysis focuses on applying a uniform rank $r$ across all layers. It's possible that different layers would benefit from different ranks, and an adaptive rank allocation strategy could be more efficient.
- The paper's theoretical justification rests on an empirical hypothesis. While the results are compelling, a more formal theoretical analysis of why adaptation is a low-rank process would further strengthen the work. However, the empirical evidence provided is so strong that it has been widely accepted by the community.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

LoRA: Low-Rank Adaptation of Large Language Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~21 min read · 29,057 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Transformer Architecture

3.1.2. Pre-training and Fine-tuning Paradigm

3.1.3. Linear Algebra: Matrix Rank and Low-Rank Decomposition

3.2. Previous Works

3.2.1. Adapter Tuning

3.2.2. Prefix-Tuning and Prompt-Tuning

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. NLU Metrics (GLUE)

5.2.2. NLG Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Inference Latency

6.1.2. Performance on RoBERTa and DeBERTa (GLUE Benchmark)

6.1.3. Performance on GPT-3 175B

6.2. Ablation Studies / Parameter Analysis

6.2.1. Which Weight Matrices Should We Adapt?

6.2.2. What is the Optimal Rank rrr?

6.2.3. Analysis of Learned Subspaces

6.2.4. Relationship between ΔW\Delta WΔW and WWW

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.2.2. What is the Optimal Rank $r$ ?

6.2.4. Relationship between $\Delta W$ and $W$