LoRAFusion: Efficient LoRA Fine-Tuning for LLMs
TL;DR Summary
LoRAFusion is introduced as an efficient fine-tuning system for LLMs, addressing inefficiencies in existing methods by utilizing a graph-splitting approach and adaptive batching, achieving up to 1.96x acceleration and redefining LoRA fine-tuning efficiency.
Abstract
Low-Rank Adaptation (LoRA) has become the leading Parameter-Efficient Fine-Tuning (PEFT) method for Large Language Models (LLMs), as it significantly reduces GPU memory usage while maintaining competitive fine-tuned model quality on downstream tasks. Despite these benefits, we identify two key inefficiencies in existing LoRA fine-tuning systems. First, they incur substantial runtime overhead due to redundant memory accesses on large activation tensors. Second, they miss the opportunity to concurrently fine-tune multiple independent LoRA adapters that share the same base model on the same set of GPUs. This leads to missed performance gains such as reduced pipeline bubbles, better communication overlap, and improved GPU load balance. To address these issues, we introduce LoRAFusion, an efficient LoRA fine-tuning system for LLMs. At the kernel level, we propose a graph-splitting method that fuses memory-bound operations. This design eliminates unnecessary memory accesses and preserves the performance of compute-bound GEMMs without incurring the cost of recomputation or synchronization. At the scheduling level, LoRAFusion introduces an adaptive batching algorithm for multi-job fine-tuning. It first splits LoRA adapters into groups to intentionally stagger batch execution across jobs, and then solves a bin-packing problem within each group to generate balanced, dependency-aware microbatches. LoRAFusion achieves up to ( on average) end-to-end speedup compared to Megatron-LM, and up to ( on average) improvement over mLoRA, the state-of-the-art multi-LoRA fine-tuning system. Our fused kernel achieves up to ( on average) kernel performance improvement and can directly serve as a plug-and-play replacement in existing LoRA systems. We open-source LoRAFusion at https://github.com/CentML/lorafusion.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "LoRAFusion: Efficient LoRA Fine-Tuning for LLMs," which focuses on improving the efficiency of Low-Rank Adaptation (LoRA) for fine-tuning Large Language Models (LLMs).
1.2. Authors
The authors are Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, and Gennady Pekhimenko. Their affiliations include the University of Toronto, Vector Institute, and NVIDIA, suggesting a strong background in both academic research and industry application in the fields of machine learning systems, distributed computing, and LLMs.
1.3. Journal/Conference
The paper is published at the "21st European Conference on Computer Systems (EUROSYS '26), April 27-30, 2026, Edinburgh, Scotland UK." EuroSys is a highly reputable and influential conference in the field of computer systems, known for publishing cutting-edge research in operating systems, distributed systems, and networked systems. Publication at such a venue indicates significant contributions and rigorous peer review.
1.4. Publication Year
The paper was published at (UTC) 2025-09-30T19:26:22.000Z, with the ACM reference format indicating the conference year as 2026.
1.5. Abstract
This paper introduces LoRAFusion, an efficient system designed to accelerate LoRA fine-tuning for Large Language Models (LLMs). The authors identify two key inefficiencies in existing LoRA fine-tuning systems: substantial runtime overhead from redundant memory accesses on large activation tensors, and the missed opportunity to concurrently fine-tune multiple independent LoRA adapters that share the same base model on the same set of GPUs.
LoRAFusion addresses these issues through a two-pronged approach. At the kernel level, it proposes a graph-splitting method to fuse memory-bound operations. This design eliminates unnecessary memory accesses and maintains the performance of compute-bound General Matrix Multiplications (GEMMs) without incurring recomputation or synchronization costs. At the scheduling level, LoRAFusion introduces an adaptive batching algorithm for multi-job fine-tuning. This algorithm groups LoRA adapters to stagger batch execution across jobs and then uses a bin-packing problem formulation to generate balanced, dependency-aware microbatches.
The system achieves significant performance improvements, with up to ( on average) end-to-end speedup compared to Megatron-LM and up to ( on average) improvement over mLoRA, which is the state-of-the-art multi-LoRA fine-tuning system. The proposed fused kernel alone provides up to ( on average) kernel performance improvement and can be used as a drop-in replacement in existing LoRA systems. The authors open-source LoRAFusion at https://github.com/CentML/lorafusion.
1.6. Original Source Link
The official source link is https://arxiv.org/abs/2510.00206, and the PDF link is https://arxiv.org/pdf/2510.00206v1.pdf. This indicates it is a preprint, likely submitted to arXiv before its formal publication at EuroSys '26.
2. Executive Summary
2.1. Background & Motivation
The rapid advancement of pre-trained Large Language Models (LLMs) like GPT and LLaMa has made fine-tuning these models essential for adapting them to specific, personalized, or domain-specific tasks (e.g., biomedical analysis, specialized chatbots). However, traditional full-model fine-tuning, which updates all model parameters, demands exorbitant hardware resources. For instance, fine-tuning LLaMa-3.1-70B requires roughly 1120GB of GPU memory just for model states (parameters, gradients, optimizer states), making it prohibitively expensive.
To counter this, Parameter-Efficient Fine-Tuning (PEFT) methods have emerged. These methods freeze the majority of pre-trained LLM parameters and only train a small subset of injected trainable parameters, called adapters. Among PEFT methods, Low-Rank Adaptation (LoRA) has become a leading technique due to its simplicity and effectiveness. LoRA introduces a low-rank decomposition of weight updates, drastically reducing the number of trainable parameters. For example, LLaMa-3.1-70B with LoRA rank 16 requires only 142GB of GPU memory, a substantial reduction while maintaining model quality.
Despite LoRA's algorithmic benefits in reducing memory footprint, current LoRA fine-tuning systems mainly reuse optimizations from traditional full-model fine-tuning. The authors identify two critical inefficiencies that these systems fail to address:
-
Substantial Runtime Overhead from Redundant Memory Access: Although LoRA adapters add less than 1% of parameters, they introduce significant runtime overhead. Profiling shows a ~40% reduction in training throughput compared to a frozen linear layer. This is largely due to memory-bandwidth-bound operations in the small LoRA projection layers and repeated loading/storing of large activation tensors, leading to a ~2.64x increase in GPU global memory traffic.
-
Missed Opportunities for Multi-Job Fine-Tuning: Existing systems typically fine-tune each LoRA adapter independently, even when sharing the same base model. This overlooks the potential to concurrently fine-tune multiple adapters on the same GPUs, which could yield performance gains by reducing distributed training overhead (e.g., pipeline bubbles, communication, load imbalance). While
mLoRAattempts multi-LoRA fine-tuning, it has limitations, such as not addressing kernel-level memory bottlenecks, failing to balance variable sequence lengths across GPUs, and relying on inefficient communication forpipeline parallelism.The paper's innovative idea is to address these inefficiencies at both the low-level kernel execution and high-level job scheduling. By doing so,
LoRAFusionaims to unlock the full potential of LoRA for efficient, multi-job fine-tuning on modern GPU clusters.
2.2. Main Contributions / Findings
LoRAFusion makes the following primary contributions:
-
Identification of Key Bottlenecks: The paper rigorously identifies and quantifies the two core limitations in existing LoRA fine-tuning systems: high runtime overhead due to redundant memory accesses (bottlenecked by memory bandwidth) and missed opportunities for optimizing multi-job training scenarios, leading to distributed parallelism overhead and GPU load imbalance.
-
Novel Kernel-Level Optimization (FusedLoRA & FusedMultiLoRA): It proposes a
graph-splittingmethod that fuses memory-bound operations within the LoRA computation graph. This design strategically splits the graph at small, intermediate tensors to eliminate unnecessary memory accesses without recomputing or synchronizing large tensors, while preserving the optimal performance of compute-bound General Matrix Multiplications (GEMMs). -
Adaptive Job-Level Scheduling (Multi-LoRA Scheduler): It introduces a hierarchical adaptive batching algorithm for multi-job fine-tuning. This scheduler first groups LoRA adapters to intentionally stagger batch execution across jobs, and then employs a two-stage Mixed Integer Linear Programming (MILP)-based bin-packing algorithm to generate balanced, dependency-aware microbatches. This strategy significantly improves GPU load balance and reduces
pipeline bubblesin distributed training. -
Comprehensive System Implementation and Evaluation: The authors implement
LoRAFusionon top ofMegatron-LMand evaluate it across diverse LLMs (LLaMa-3.1-8B,Qwen-2.5-32B,LLaMa-3.1-70B), realistic datasets, and NVIDIA H100/L40S GPUs.The key conclusions and findings are:
LoRAFusionachieves substantial end-to-end speedups: up to ( on average) compared toMegatron-LMand up to ( on average) overmLoRA.- The
FusedLoRAkernel alone delivers an average (up to ) kernel performance improvement and can serve as a plug-and-play replacement in existing LoRA systems. - The kernel fusion effectively reduces GPU global memory read/write traffic by .
- The job-level scheduling significantly reduces
pipeline bubblesfrom44.17%(single adapter) to11.09%(four adapters), and improves GPU utilization from65%to89%. - The system demonstrates strong scalability and robustness, even in challenging
heterogeneousmulti-job settings. These findings collectively solve the problem of inefficient LoRA fine-tuning by optimizing both low-level memory access and high-level distributed training coordination, making LLM adaptation more accessible and cost-effective.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Large Language Models (LLMs)
Large Language Models (LLMs) are deep learning models, typically based on the transformer architecture, trained on vast amounts of text data to understand and generate human-like text. They excel at tasks like text generation, question answering, and code generation. Their pre-training involves learning general language patterns and knowledge, while fine-tuning adapts them to specific downstream tasks or domains. Examples include GPT and LLaMa.
3.1.2. Low-Rank Adaptation (LoRA)
LoRA (Low-Rank Adaptation) is a leading Parameter-Efficient Fine-Tuning (PEFT) method. Instead of fine-tuning all parameters of a pre-trained LLM, LoRA freezes the original, large pre-trained weights. For specific layers (typically linear layers), it injects a small, trainable adapter composed of two low-rank linear layers.
Formally, for a pre-trained weight matrix (where is the input dimension and is the output dimension), LoRA introduces two new trainable matrices: and . Here, is the LoRA rank, and . The update to the original layer's output is then represented as the product of these two low-rank matrices, scaled by a constant .
The output of a LoRA-equipped linear layer is calculated as:
Where:
-
is the input tensor, where is the number of aggregated tokens (batch size multiplied by sequence length).
-
represents the
frozen(non-trainable) base model weights. -
is the input tensor after
dropout(a regularization technique where random elements of the input are set to zero during training to prevent overfitting). -
is the first trainable LoRA weight matrix (down-projection).
-
is the second trainable LoRA weight matrix (up-projection).
-
is an intermediate result.
-
is a constant scalar that scales the output of the LoRA branch.
The key benefit is that only and (and their corresponding gradients and optimizer states) need to be stored and updated, dramatically reducing memory usage and computational cost compared to fine-tuning .
3.1.3. Parameter-Efficient Fine-Tuning (PEFT)
PEFT methods are a family of techniques designed to adapt large pre-trained models to new tasks with minimal computational and memory resources. Instead of fine-tuning all parameters, PEFT methods typically introduce a small number of new, trainable parameters (adapters) or modify existing ones in a parameter-efficient way, while keeping the bulk of the pre-trained model frozen. LoRA is one of the most popular PEFT methods. Others include Prefix-tuning, Prompt tuning, and AdapterDrop.
3.1.4. Distributed Training
Distributed training involves training a model across multiple computational devices (e.g., GPUs) or multiple machines (nodes) to handle large models or datasets that cannot fit into a single device's memory or to accelerate training. Common strategies include:
- Data Parallelism (DP): The model is replicated on each GPU, and the training data is partitioned across GPUs. Each GPU computes gradients on its data subset, and then gradients are aggregated (e.g., averaged) across all GPUs to update the model copies.
- Fully Sharded Data Parallelism (FSDP) / ZeRO-3: An advanced form of data parallelism where not just data, but also model states (parameters, gradients, and optimizer states), are
sharded(partitioned) across GPUs. This drastically reduces the memory footprint per GPU, allowing larger models to be trained. Model states are only gathered to a GPU when needed for computation. - Tensor Parallelism (TP): Individual layers (e.g., linear layers or attention layers) within the model are split across multiple GPUs. For instance, a weight matrix might be split into and across two GPUs, and the input would be distributed accordingly (). Computations are performed in parallel, and intermediate results are communicated between GPUs to form the complete output.
- Pipeline Parallelism (PP): The model's layers are divided into sequential
stages, and each stage is assigned to a different GPU or device. Data batches are then processed in a pipeline fashion, where different GPUs work on different stages of the model for differentmicrobatchessimultaneously. This reduces communication overhead but can introducepipeline bubbles(idle time) if stages are imbalanced or microbatches are not perfectly aligned.
3.1.5. Kernel Fusion
Kernel fusion is an optimization technique that merges multiple elementary computational operations (called kernels) into a single, larger kernel. This reduces overhead associated with:
- Memory Transfers: By fusing operations, intermediate results can remain in faster on-chip memory (e.g., shared memory or registers) rather than being written to and read from slower global memory (DRAM).
- Kernel Launch Overhead: Each kernel launch incurs a small overhead on the GPU. Fusing multiple operations into one reduces the number of launches.
Examples include
FlashAttention, which fuses attention operations.
3.1.6. On-the-fly Data Packing
In LLM fine-tuning, training data often consists of sequences of variable lengths.
- Traditional Batch Padding: Shorter sequences are padded with special
padding tokensto match the length of the longest sequence in a batch (Figure 2(a)). This leads towasted computationson padding tokens. - Dataset Pre-packing: Sequences are pre-processed and concatenated into fixed-length blocks offline (Figure 2(b)). While efficient, it can introduce variable sample counts per batch and might affect training stability or randomness if not handled carefully.
- On-the-fly Packing: Sequences are dynamically concatenated within each batch to fill a fixed
token capacity, avoiding wasted computations from padding while maintaining a consistent number of actual training samples per batch (Figure 2(c)). This method is commonly adopted for its effectiveness.
3.2. Previous Works
3.2.1. PEFT Library
The PEFT Library [55] from Hugging Face is a popular open-source library that provides implementations of various Parameter-Efficient Fine-Tuning methods, including LoRA. It simplifies the application of PEFT techniques to pre-trained transformer models, making it a common tool for researchers and practitioners. LoRAFusion integrates with this library for model architecture and LoRA adaptation.
3.2.2. Megatron-LM
Megatron-LM [81] is a state-of-the-art distributed training framework developed by NVIDIA, designed for training very large transformer models. It implements various parallelization strategies, including Tensor Parallelism (TP) and Pipeline Parallelism (PP), to enable efficient training across many GPUs and nodes. LoRAFusion builds its multi-adapter pipeline parallelism on top of Megatron-LM.
3.2.3. mLoRA
mLoRA [98] is a prior system designed for multi-LoRA fine-tuning. It focuses on grouping multiple LoRA adapters that share the same base model from separate tasks into a single batched operation. Its primary motivation was to reduce the memory footprint of replicated pre-trained models and improve training efficiency, particularly in pipeline parallelism by filling pipeline bubbles with independent groups of samples from different adapters. However, LoRAFusion critically analyzes mLoRA's limitations, which include:
- Reliance on generic LoRA
kernelsthat are bottlenecked by redundant memory accesses. - Lack of handling for
load imbalancefrom variable sequence lengths in real workloads. - Narrow focus on
Pipeline Parallelismand reliance on inefficient CPU-based communication, limiting scalability.
3.2.4. LoRA Serving Systems (Punica, S-LoRA, dLoRA)
Works like Punica [9], S-LoRA [79], and dLoRA [94] are system-level optimizations for LoRA, but specifically for inference and serving scenarios. They focus on efficiently serving thousands of concurrent LoRA adapters by batching requests to increase arithmetic intensity during autoregressive single-token decoding. While they also use multi-LoRA grouping, their motivations and challenges differ significantly from fine-tuning, which already processes full sequences with sufficient arithmetic intensity.
3.2.5. Kernel Fusion Works (FlashAttention, Triton, Mirage)
FlashAttention [13] is a prominent example of kernel fusion that significantly improves the efficiency of the attention mechanism in transformers by fusing operations to reduce memory transfers. Triton [84] is an open-source domain-specific language and compiler for writing highly optimized GPU kernels with fine-grained control, which LoRAFusion uses to implement its FusedLoRA and FusedMultiLoRA kernels. Mirage [95] explores multi-level superoptimizers for tensor programs, showing benefits for LoRA serving, but it doesn't fully address the complexities of LoRA fine-tuning (e.g., dropout, backward computation, multi-LoRA kernels).
3.3. Technological Evolution
The evolution of fine-tuning LLMs has progressed through several stages:
- Full-Model Fine-Tuning: Initial approaches involved updating all parameters of a pre-trained LLM. While effective for task adaptation, this method is extremely memory-intensive and computationally expensive, requiring vast
GPU memoryand powerfuldistributed trainingsetups. - Parameter-Efficient Fine-Tuning (PEFT): To address resource constraints, PEFT methods emerged. These techniques, such as
Adapter-tuning,Prefix-tuning, andLoRA, reduce the number of trainable parameters by keeping most of the base modelfrozenand introducing small, task-specificadapters. LoRA, in particular, gained popularity due to its simplicity and effectiveness, significantly reducing memory footprint for gradients and optimizer states. - System-Level Optimizations for PEFT: As PEFT methods became prevalent, the focus shifted to optimizing the systems that execute them. Initially, these systems reused
distributed trainingtechniques (likeFSDP,TP,PP) developed for full-model pre-training. However, these generic optimizations often failed to fully leverage the unique characteristics of lightweightLoRA adapters. - Specialized Optimizations for LoRA (Inference): The
inferenceside of LoRA saw specialized system optimizations (e.g.,Punica,S-LoRA) to efficiently serve multiple adapters concurrently, primarily by batching requests duringautoregressive decodingto improveGPU utilization. - Specialized Optimizations for LoRA (Fine-Tuning): The
fine-tuningside, however, lagged. WhilemLoRAstarted exploring multi-adapter fine-tuning, it still relied on generickernelperformance and lacked robustschedulingforvariable sequence lengths. This is whereLoRAFusionenters the timeline, providing specialized kernel fusion and adaptive scheduling specifically for LoRA fine-tuning, addressing its uniquememory-bandwidth-boundbottlenecks anddistributed parallelismchallenges.
3.4. Differentiation Analysis
LoRAFusion differentiates itself from prior works by addressing two fundamental inefficiencies in LoRA fine-tuning that existing systems largely overlook:
-
Kernel-Level Efficiency for LoRA Operations:
- Differentiation from
Megatron-LMand standard LoRA implementations:Megatron-LMand other frameworks (like those usingHugging Face PEFT Library) rely on generickernelimplementations for LoRA. As shown in the paper, these arememory-bandwidth-bounddue to redundant memory accesses on large activation tensors, leading to significant overhead despite LoRA's small parameter count.torch.compilealso offers limited benefits. - LoRAFusion's innovation:
LoRAFusionintroducesFusedLoRAandFusedMultiLoRAkernelsthat employ a novelgraph-splittingfusion strategy. This strategy fuses memory-bound operations (e.g.,dropout, element-wise additions) around full-sized activation tensors without recomputing them or introducing expensive synchronization. It strategically splits the graph at small, intermediateLoRA ranktensors, preserving the optimal performance ofcompute-bound GEMMs(the base model's large matrix multiplications). This directly tackles the memory traffic bottleneck, reducing global memory read/write by .
- Differentiation from
-
Adaptive Scheduling for Multi-Job LoRA Fine-Tuning:
- Differentiation from
Megatron-LM(single-job focus):Megatron-LM(withFSDPorPP) is designed for single-job, large-scale training and does not natively support efficient concurrent fine-tuning of multiple LoRA adapters. Running multiple jobs would typically mean sequential execution or inefficient resource partitioning. - Differentiation from
mLoRA(incomplete multi-job optimization): WhilemLoRAattempts multi-LoRA fine-tuning bybatchingadapters and fillingpipeline bubbles, it has key limitations:- It still uses generic LoRA
kernels, ignoring the memory-bandwidth bottleneck. - It fails to adequately address
load imbalancecaused by variablesequence lengthsin real datasets, which can lead topipeline stallsand idle GPUs.mLoRAassumes uniform adapter grouping and schedules based on memory, not sequence length variability. - Its
BatchLoRA kernelreduces kernel launch overhead but doesn't solve the core memory access bottleneck. - It's narrowly focused on
Pipeline Parallelismand relies on inefficient CPU-based communication.
- It still uses generic LoRA
- LoRAFusion's innovation:
LoRAFusionintroduces a sophisticatedMulti-LoRA Schedulerthat goes beyond simple adapter grouping. It uses a two-stage hierarchical strategy:-
Adapter Grouping with
Bubble Lemmaconsideration: Groups adapters based onsample length distributionsto ensureglobal batchesfrom the same adapter are sufficiently spaced, satisfyingpipeline parallelismdependencies. -
Adaptive Batching with Two-Stage MILP: Solves a
bin-packing problemusingMixed Integer Linear Programming (MILP)(with greedy fallback) to create balanced,dependency-aware microbatchesthat minimize the total number of microbatches and maximize space in underfilled ones. This significantly reducespipeline bubbles(from34.11%inmLoRAto11.09%inLoRAFusion) and improvesGPU load balanceby intelligently packing variable-length samples across jobs.In essence,
LoRAFusionprovides a more holistic and deeply optimized solution for LoRA fine-tuning by tackling both the micro-level (kernel execution) and macro-level (job scheduling) inefficiencies, leading to superior end-to-end throughput and resource utilization.
-
- Differentiation from
4. Methodology
4.1. Principles
The core idea of LoRAFusion is to address the identified inefficiencies in LoRA fine-tuning by applying multi-level fusion:
-
Kernel-level Optimization (FusedLoRA & FusedMultiLoRA): The primary principle here is to reduce
redundant memory accessesto large activation tensors by fusingmemory-bound operationswithin the LoRA computation graph. This is achieved through agraph-splittingstrategy that isolatescompute-bound General Matrix Multiplications (GEMMs)(from the base model) to preserve their optimal performance, while combining relatedmemory-bound operations(from the LoRA branch and element-wise operations) that share large inputs/outputs. This strategic fusion minimizesmemory trafficwithout incurring the costs of recomputation or synchronization for large intermediate tensors. -
Job-level Scheduling (Multi-LoRA Scheduler): The principle is to maximize
GPU utilizationand mitigatedistributed parallelism overheadby intelligently scheduling multiple independent LoRA fine-tuning jobs. This involves adaptively batching samples from different jobs into balancedmicrobatches, consideringsequence length variabilityand respectingdata dependencies(especially inpipeline parallelism). The goal is to reducepipeline bubblesandload imbalanceacross GPUs, which are major bottlenecks in multi-GPU and multi-node training.These two levels of optimization work synergistically. The
fused kernelsmake individual LoRA operations more efficient, while theschedulerensures that these efficient operations are fed with optimally packed data across multiple concurrent jobs, maximizing system throughput.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. FusedLoRA and FusedMultiLoRA
4.2.1.1. Design Considerations and Challenges
Traditional kernel fusion aims to merge multiple operations into a single GPU kernel to reduce memory transfers and kernel launch overhead. However, applying full fusion to LoRA modules presents specific challenges:
-
Sensitivity of Compute-Bound GEMMs: The base model's
GEMMoperation () is typicallycompute-boundand highly optimized. Fusing other operations with it could disrupt its optimaltiling strategiesandGPU resourceusage (e.g., registers, shared memory), leading to performance degradation. -
Recomputation or Synchronization Overhead: Fusing operations with producer-consumer dependencies, such as the
down-projectionfollowed by theup-projection, might require recomputing intermediate results (costly for large inputs) or introducingthread block synchronization(adds coordination overhead).A good fusion strategy must therefore reduce memory access without compromising the performance of compute-bound operations and avoid expensive synchronization or recomputation.
4.2.1.2. Full Graph Fusion vs. Split Graph Fusion
The paper explores different strategies for handling the intermediate tensor (the output of the LoRA down-projection) in the forward pass, as illustrated in Figure 9.
-
Full Graph Fusion (Recompute S): One option is to recompute within each tile of a fully fused kernel. However, this repeatedly loads the entire matrix, which becomes expensive for large
batch sizes. -
Full Graph Fusion (Synchronize to Share S): Another option is to fuse the computation and use
synchronizationacrossthread blocksto share . Here, a singleMtile(a block along the token dimension) computes the intermediate tiles and writes them to global memory, while other tiles wait. This addscoordination overhead. -
Split Graph Fusion (LoRAFusion's approach):
LoRAFusionadopts a third approach: explicitly storing and reloading fromGPU global memory. This is feasible because (of size ) is much smaller than other tensors (which are or ) due to the smallLoRA rank. The cost of reading and writing this small tensor is low. By splitting the graph at ,LoRAFusionavoids both recomputation for large inputs and synchronization overhead, while still reducing memory traffic associated with full-sized activation tensors. This also preservesGPU resourcesfor optimal tiling of the compute-boundXWoperation.The following figure (Figure 9 from the original paper) illustrates the full graph fusion approach versus the split graph fusion approach:
该图像是展示 LoRA 模块在前向传递中的融合策略的示意图,比较了完全图融合方法与分裂图融合方法。图中展示了不同选项,包括融合与重计算、融合与同步等操作,以优化内存访问和计算效率。
4.2.1.3. FusedLoRA Design
The FusedLoRA kernels are designed to fuse memory-heavy operations while maintaining the efficiency of compute-bound operations. Figure 10 illustrates this design for both the forward and backward passes.
The following figure (Figure 10 from the original paper) illustrates the LoRA kernel design:

Forward Pass (Figure 10(a)):
- Operation (Dropout + Down-projection): This kernel fuses the
dropoutoperation on the input tensor to produce with thedown-projection. By doing so, it eliminates the need to load the full-sized activation tensor twice and saves an intermediate write/read of toglobal memory. The output of this fused kernel is the intermediate tensor . - Operation (Base GEMM + LoRA Up-projection + Addition): This kernel combines the
compute-boundbase modelGEMM() with thememory-boundLoRAup-projection() and the final element-wise addition of their results (). This fusion is crucial. It eliminates redundant memory operations by directly accumulating the partial results ofXWand . Specifically, it saves one read and one write of the full-sized output tensor (or ) by performing the addition within the same kernel. This is achieved without affecting the baseGEMMperformance, as the fusion happens after theXWcomputation. The small tensor (the output of ) is read into this kernel.
Backward Pass (Figure 10(b)): The backward pass similarly applies fusion principles to efficiently compute gradients.
-
Operation (Gradient of S + Gradient of B): This kernel fuses the computation of the gradient of () and the gradient of () from the output gradient
dY. By operating directly ondYand , it eliminates the need to reloaddYfor separate gradient computations, reducing memory traffic. -
Operation (Gradient of A): This operation remains separate. It computes the gradient of () using the masked input and . Since it operates on the relatively small masked input and intermediate gradient, fusion provides minimal additional benefit here.
-
Operation (Gradient of Base Model + Gradient of LoRA Path + Addition): This kernel horizontally fuses the
compute-intensivegradient computation for the base model weights () with thememory-boundgradient computation from the LoRA path and the addition of their partial gradients. Similar to the forward pass, this prevents redundant reads and writes of partial output gradients ( and ) by directly accumulating them, while preserving the performance of the base model's gradientGEMM.The key insight is to identify operations that share large
activation tensors(, , , and ) and fuse them. This significantly reducesmemory bottleneckswhile allowingcompute-bound operationsto use optimaltiling strategies.
4.2.1.4. Extending to FusedMultiLoRA
To support concurrent fine-tuning of multiple LoRA adapters, FusedLoRA is extended to FusedMultiLoRA. This allows the fused kernels to operate on mixed-adapter batches from different jobs efficiently.
As shown in Figure 11, tile-level routing is employed:
-
Each input
Mtile(a block of tokens) is tagged with anadapter IDand its configuration (e.g.,LoRA rank,scaling factor,dropout ratio). -
This information is stored in a lightweight
lookup table. -
During execution, the
frozen model computation(XW) is shared across all tokens, regardless of their adapter. -
However,
adapter-specific logic(e.g., applying and matrices,scaling,dropout) is applied dynamically perMtile. For each(Mtile, Ntile)of the output, the kernel loads the appropriate and matrices based on theadapter IDassociated with thatMtile. -
In the backward pass, the same mapping mechanism is used to route
gradientsto their respective adapters without interference.This
tile-level routingenables efficient execution of heterogeneous adapters within a single fused kernel, avoiding redundantkernel launchesper adapter and maintaining highGPU utilizationacross multiple jobs. The system dynamically chooses betweenFusedLoRA(if only one adapter is in the batch) andFusedMultiLoRA(for multiple adapters).
The following figure (Figure 11 from the original paper) illustrates FusedMultiLoRA in the forward pass:
该图像是多种 LoRA 适配器的前向传播示意图,展示了 MultiLoRA A 矩阵和 B 矩阵的路由关系。映射查找表在其中起到关键作用,确保适配器间的信息传递与高效调度。
4.2.2. Multi-LoRA Scheduler
The Multi-LoRA Scheduler in LoRAFusion orchestrates the grouping of adapters and adaptive batching of their samples across multiple fine-tuning jobs to optimize GPU load balance and minimize distributed parallelism overhead. The overall workflow is depicted in Figure 12.
The following figure (Figure 12 from the original paper) illustrates the Multi-LoRA adapter scheduling workflow:
该图像是示意图,展示了多LoRA适配器的调度工作流程。顶部展示了根据序列长度统计的适配器分组,中间部分显示了微批次创建的双阶段MILP优化,而底部则展示了对不足填充微批次的跨批次合并。
4.2.2.1. Granularity
The scheduling operates at the global batch level. Each adapter's dataset is conceptually divided into global batches based on a user-specified global batch size. The scheduler then aggregates all samples belonging to the same global batch index across all active adapters and packs them into multiple microbatches for actual processing.
4.2.2.2. Bubble Lemma & Adapter Grouping
A critical challenge in pipeline parallelism is respecting data dependencies between consecutive global batches. Specifically, for an adapter , if a sample from global batch completes its forward pass in microbatch , its backward pass will only begin after S-1 other microbatches (where is the number of pipeline stages) complete their forward passes. To ensure correctness, no sample from global batch of the same adapter can start its forward pass before microbatch k+S-1 (i.e., before sample 's backward pass completes). This is termed the bubble lemma.
To manage this, LoRAFusion first groups LoRA adapters before batching samples:
- Strict Ordering between Groups: Adapters are grouped such that there is strict ordering in their execution. This creates natural
gapsorstaggeringbetweenglobal batchesfrom different adapter groups. - Flexible Merging within Groups: Within each group, samples can be flexibly merged into
microbatches. - Head-Tail Pairing for Load Balance: To enhance load balance within groups, adapters are sorted by their mean token length.
Short-sequenceadapters are then paired withlong-sequenceadapters. Thishead-tail pairingstrategy aims to create more balancedmicrobatcheswithin a group. This grouping approach balances the need to respectpipeline dependencieswith the desire for flexiblebatchingto improveload balance.
4.2.2.3. Data Batching with Two-Stage MILP
After adapter grouping, the scheduler solves a bin-packing problem to pack samples into microbatches, each constrained by a fixed token capacity. The goal is twofold: (i) minimize the total number of microbatches needed to pack all samples, and (ii) make the smallest microbatch as empty as possible to facilitate merging in later stages.
The paper uses a two-stage Mixed Integer Linear Programming (MILP) formulation, as outlined in Algorithm 1.
For notation:
- :
Padding multiple, a user-specified parameter to pad the sequence length of samples from the same adapter to a multiple of (e.g., 64 or 128). - : Binary variable indicating if sample is assigned to bin (microbatch) .
- : The number of padded multiples contributed by adapter in bin .
- : Binary variable indicating if bin is used.
Algorithm 1: Data Batching & Merging (Per Group)
1 foreach global batch b in parallel do
2 (Bg, {mi}) ← GreedyPacking(b, C) // Greedy fallback as baseline
3 B* ← MILP_MinBins(b, C, timeout = t) // Stage 1: minimize number of microbatches
4 if B* ≥ Bg then
5 B* ← Bg
6 end
7 {B1 ← MILP_MinSmallestBin(b, B*, C, timeout = t)} // Stage 2: minimize smallest bin tokens
8 if B* = Bg and {m_i*} ≥ {m_i} then
9 return GreedyPacking(b,C)
10 end
11 end
12 foreach consecutive batch pairs (b, b+1) do
13 Shift tokens from b+1 into b if bubble lemma is preserved
14 end
15 VerifyAndFix(schedule) // Insert no-ops where needed
16 return Scheduled microbatches
Stage 1: Minimize the Number of Microbatches (MILP_MinBins)
This stage aims to find the minimum number of microbatches required to pack all samples within a global batch.
The optimization problem is formulated as:
Where:
- : The objective function, which minimizes the total number of used bins .
- : Ensures that used bins are contiguous from the start (i.e., if bin is used, bin
b-1must also be used). - : Ensures each sample is assigned to exactly one bin .
- : This constraint ensures that for each adapter and bin , the total length of samples from assigned to does not exceed its padded contribution .
- : This links the usage of bin () to its total token count. If , the bin must contain tokens, and its total padded token count () must not exceed the
token capacityof a microbatch. If , the bin is empty.
Stage 2: Minimize Smallest Bin Tokens (MILP_MinSmallestBin)
After the first stage determines the optimal number of bins , the second stage fixes and aims to minimize the smallest total token count among all bins. This leaves more slack or empty space in the least-full microbatch, making it more amenable to merging with tokens from subsequent global batches.
The optimization problem is:
Where:
- : The objective function, which minimizes the minimum total padded token count across all bins.
- The constraints are similar to Stage 1, ensuring each sample is assigned to one bin, adapter-specific token counts respect padding multiples, and bin capacity is not exceeded for the fixed .
Runtime Efficiency Techniques:
To improve the runtime of the MILP solver, LoRAFusion employs two techniques:
- Greedy Fallback: A
timeoutis set for theMILP solver. If the solver exceeds this time, it falls back to a simplergreedy bin-packing algorithm(Algorithm 1, lines 2, 5, and 9), balancing optimality with computational cost. - Multiprocessing: Since
global batchesare independent, thebin-packing optimizationfor differentglobal batchescan beparallelizedusingmultiprocessing(Algorithm 1, line 1). This allows efficient scheduling of all training data.
4.2.2.4. Merging & Verification
After microbatch packing, the final microbatch in a global batch might be underfilled, leading to reduced GPU efficiency and increased pipeline bubbles.
- Greedy Merge Pass: A
greedy merge pass(Figure 12 bottom, Algorithm 1, lines 12-14) attempts to shift tokens from the nextglobal batchinto the current batch's finalmicrobatch. This is done only if thetoken capacityis not exceeded and thebubble lemma(data dependency constraint) is preserved. - Verification and Fix: A final
verification stepensures that no constraint is violated. If anybubble conditionis not met,no-op microbatches(empty microbatches) are inserted into the schedule (Algorithm 1, line 15) to restore correctness and maintainpipeline consistency.
4.2.2.5. Parallelism Profiler
The Multi-LoRA Scheduler requires a specific token capacity for its microbatches. This capacity depends on the underlying parallelism strategy (e.g., FSDP + PP) and hardware. LoRAFusion includes a lightweight parallelism profiler. This profiler benchmarks the runtime under different model parallelism configurations using fixed-length inputs and collects throughput data. The configuration yielding the best throughput determines the token capacity passed to the data batching stage, ensuring the scheduler's packing aligns with the system's optimal performance characteristics. This decouples the low-level parallelism tuning from the high-level scheduling logic.
4.2.3. System Workflow
Figure 8 provides an overview of the LoRAFusion system workflow:
-
Input: A set of
fine-tuning jobsis provided, each specifying itsLoRA adapterand dataset. -
Dataset Statistics Extraction:
LoRAFusionfirst extractsdataset statistics, particularlysample length distributions, which are crucial for adaptivebatching. -
Parallelism Simulation & Token Budget Proposal: A
parallelism simulatoris used to propose an optimalmicrobatch token budgetbased on the available hardware andparallelism strategy. -
Adapter Grouping: The
Multi-LoRA SchedulergroupsLoRA adaptersbased on dataset statistics (e.g.,sequence length distributions) andbubble lemmaconsiderations. -
Microbatch Construction: Within each group, the scheduler constructs
microbatchesusing the two-stageMILP-based adaptive batching algorithm, aiming for balancedtoken countsand respectingdependencies. -
Simulation & Iteration: The proposed grouping and batching configuration is re-evaluated through simulation. If a higher-throughput configuration can be found, the process iterates.
-
Execution with Fused Kernels: Once an optimal schedule is determined, the
fine-tuning jobsare executed by theExecutorusing theFusedLoRAandFusedMultiLoRAkernels. -
Multi-Adapter Runtime Coordinator: A
multi-adapter runtime coordinatorensurestoken-to-adapter consistency(i.e., each token is processed by its correct adapter), managesresource sharing, and tracksgradientsacrossjob boundaries.Through this combined approach,
LoRAFusionsystematically addresses bothmemory bandwidthbottlenecks at thekernel levelanddistributed training overheadat thejob scheduling level.
5. Experimental Setup
5.1. Datasets
The authors evaluate LoRAFusion on three public summarization datasets, chosen for their diverse length distributions (as illustrated in Figure 13), which stress the batching and scheduling capabilities under realistic variable-length scenarios:
The following figure (Figure 13 from the original paper) shows the distribution of sample lengths across the XSum, CNN/DailyMail, and WikiSum datasets:
该图像是图表,展示了在XSum、CNN/DailyMail和WikiSum数据集上样本长度的分布情况。横轴为样本长度(以# Tokens表示),纵轴为密度。不同颜色的曲线代表各数据集的分布,虚线表示对应的均值。
- XSum [61]: A dataset for
extreme summarization, where summaries are very short (single sentence) and abstractive. It typically features shorter input and output sequences. - CNN/DailyMail (CNNDM) [78]: A widely used dataset for
abstractive summarization, consisting of news articles and accompanying multi-sentence summaries. This dataset generally has medium-length sequences. - WikiSum [12]: A dataset for
coherent summarization, which likely contains longer documents and summaries, posing challenges fortoken capacityandmemory management.
Dataset Characteristics: As seen in Figure 13:
-
XSumhas a distribution centered around shortersample lengths(e.g., < 512 tokens). -
CNN/DailyMailhas a broader distribution, with many samples in the mid-range (e.g., 512-1024 tokens) and some extending to longer lengths. -
WikiSumexhibits a distribution with a significant number of longer sequences (e.g., > 1024 tokens), demonstrating highersequence length variabilityand largertoken countsper sample.These diverse distributions are effective for validating the method's performance because:
-
They represent realistic fine-tuning
workloadswherevariable sequence lengthsare common. -
They allow
LoRAFusionto demonstrate its ability to mitigateload imbalanceand efficiently packmicrobatchesunder differenttoken distributions. -
The presence of longer sequences (e.g., in
WikiSum) specifically challengesbatchingandmemory management, highlighting the benefits ofLoRAFusion's optimizedkernelsandscheduler.
Workload Settings for Multi-LoRA Experiments:
For multi-LoRA experiments, four LoRA adapters are trained in parallel, with varying dataset configurations:
- XSum, CNN/DailyMail, WikiSum Configurations: All four adapters are trained independently on the same respective dataset (e.g., 4 adapters all on XSum).
- Mixed Setting: Each of the four adapters is trained on a dataset that combines samples from all three
XSum,CNN/DailyMail, andWikiSum. - Heterogeneous (Het) Setting: The four adapters are trained on different datasets: one adapter on
XSum, one onCNN/DailyMail, one onWikiSum, and one on theMixeddataset. This is the most challenging scenario, testing the scheduler's ability to handle highly diverse workloads concurrently.
5.2. Evaluation Metrics
The primary evaluation metric used in the paper is throughput, specifically measured in trained tokens per second.
5.2.1. Conceptual Definition
Throughput in the context of LLM training measures the efficiency of the training system. When dealing with inputs of variable sequence lengths (common in LLM fine-tuning), simply counting "samples per second" can be misleading because samples can have vastly different computational costs. "Tokens per second" provides a more accurate and normalized measure of computational work completed per unit of time, reflecting the system's ability to process the actual information content. A higher tokens per second indicates better system efficiency and faster training.
5.2.2. Mathematical Formula
While the paper does not explicitly provide a formula for "tokens per second," it is generally calculated as:
5.2.3. Symbol Explanation
-
: The metric quantifying the number of tokens processed by the training system per second.
-
: The total number of training samples processed during the measurement period.
-
: The number of tokens in the -th training sample. This sum represents the total number of actual tokens processed.
-
: The total wall-clock time taken to process all samples, including computation, communication, and any idle time.
Additionally, the paper uses other derived metrics to analyze components:
-
Speedup: Calculated as (Throughput of LoRAFusion) / (Throughput of Baseline).
-
Memory Traffic Reduction: Measured as a ratio of
DRAMread/write traffic usingNVIDIA Nsight Compute (NCU). -
Pipeline Bubble Ratio: Measures the percentage of idle time in
pipeline parallelismdue topipeline bubbles. -
Tuning Time: Time taken by the
schedulerto determine the optimal batching strategy.
5.3. Baselines
LoRAFusion is compared against three representative baselines to demonstrate its performance advantages:
-
Megatron-LM [81] with Fully Sharded Data Parallelism (FSDP):
- Description:
Megatron-LMis a state-of-the-art distributed training framework.FSDP(orZeRO-3) is a memory optimization technique that shards all model states (parameters, gradients, optimizer states) across data-parallel ranks, drastically reducingGPU memoryconsumption. - Representativeness: This is a strong baseline for memory-efficient distributed training of large models. It represents a common and effective way to scale LLM fine-tuning in terms of memory.
- Limitation (for LoRA):
Megatron-LMdoes not natively support multi-LoRA fine-tuning, meaning multiple jobs would typically be trained sequentially, failing to exploitmulti-job concurrencyor specific LoRA characteristics.
- Description:
-
Megatron-LM [81] with Pipeline Parallelism (PP):
- Description:
Pipeline Parallelismdivides the model layers into sequential stages, with each stage assigned to a different GPU. Datamicrobatchesflow through these stages in a pipeline, reducing communication overhead but potentially introducingpipeline bubbles(idle time). - Representativeness:
PPis another criticaldistributed trainingstrategy, especially for models too large to fit on a single GPU even withFSDP, and is often combined withFSDPforhybrid parallelism. - Limitation (for LoRA): Similar to
FSDP, it lacks native multi-LoRA support, and its efficiency is susceptible topipeline bubblesandload imbalance, especially withvariable sequence lengths.
- Description:
-
mLoRA [98]:
-
Description:
mLoRAis the state-of-the-art multi-LoRA fine-tuning system prior toLoRAFusion. It specifically aims to group multiple LoRA adapters sharing the same base model to improve training efficiency, primarily by fillingpipeline bubblesinpipeline parallelism. -
Representativeness: This is the most direct and relevant baseline for multi-LoRA fine-tuning, as it attempts to solve a similar problem.
-
Authors' Reimplementation: The paper notes that the original
mLoRAused inefficientPython RPCfor inter-GPU communication. To ensure a fair comparison, the authors reimplementedmLoRAwithin their system usinghigh-performance communication primitives. They also optimistically assumemLoRA'sBatchLoRA kernelhas the same performance as a naive single LoRA kernel, as it doesn't provide a unique multi-LoRA CUDA kernel. This ensures that the comparison focuses on the core scheduling and architectural differences rather than implementation details.Common Software Stack: All experiments use PyTorch 2.6, CUDA Toolkit 12.4, Triton 3.2.0, and Megatron-Core 0.11.0, ensuring a consistent environment.
-
5.4. Hardware Settings
The experiments are primarily conducted on modern GPU clusters:
- NVIDIA H100 (80GB) GPUs:
- Configuration: Each node is equipped with 8 NVIDIA H100 GPUs, connected via
NVLink(high-speed interconnect for intra-node communication). Nodes are interconnected viaInfiniBandfor multi-node communication. Each node also has 208 vCPUs. - Importance: H100 GPUs represent the cutting-edge in AI accelerators, offering high compute
FLOPSandmemory bandwidth. Testing on H100s demonstratesLoRAFusion's performance on powerful, modern hardware, wherememory bandwidthcan still be a bottleneck relative tocompute capability.
- Configuration: Each node is equipped with 8 NVIDIA H100 GPUs, connected via
- NVIDIA L40S (48GB) GPUs:
-
Configuration: Each server contains 4 L40S GPUs, connected over
PCIe(a slower interconnect compared to NVLink), and 128 vCPUs. -
Importance: L40S GPUs offer a different hardware profile, typically with lower
compute-to-memory bandwidth ratioscompared to H100s. Evaluating on L40S demonstrates thegeneralizabilityofLoRAFusion's benefits across different GPU generations and interconnect types, showing thatmemory bandwidth optimizationis valuable even on less premium hardware.GPU Allocation: Most experiments use 1, 2, or 4 GPUs per job, representing typical deployment scenarios. The paper notes that assigning fewer GPUs per job and running more independent jobs often leads to better efficiency by reducing inter-GPU communication and synchronization overhead, a finding explored in their
Scalability Studies.
-
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. End-to-End Results
The end-to-end throughput (measured in tokens per second) demonstrates LoRAFusion's significant performance improvements across various models and hardware.
6.1.1.1. Speedup on H100 GPUs
The following figure (Figure 14 from the original paper) reports the end-to-end throughput of training 4 LoRA adapters on 1, 2, and 4 H100 GPUs:
该图像是图表,展示了在不同数据集上,使用 1、2 和 4 个 H100 GPU 训练 4 个 LoRA 适配器的端到端训练吞吐量(tokens/sec)。图中比较了不同方法的吞吐量,包括 mLoRA 和 LoRAFusion。
- Overall Performance:
LoRAFusionconsistently outperforms all baselines by . - LLaMa-3.1-8B (Single H100 GPU): For
LLaMa-3.1-8B, which fits on a single H100 GPU,LoRAFusionachieves an average speedup (up to ). This improvement is primarily attributed to theFusedLoRAkernel's ability to reducememory traffic. Since single-GPU setups inherently do not suffer fromload imbalanceordistributed parallelism overhead, this gain directly reflects thekernel-level optimizations. - Qwen-2.5-32B and LLaMa-3.1-70B (Distributed Training): For larger models like
Qwen-2.5-32BandLLaMa-3.1-70B, which necessitatedistributed training(2 or 4 H100 GPUs),LoRAFusionachieves even higher average speedups of and (up to and ), respectively. This indicates that larger models benefit more fromLoRAFusion's combined kernel and scheduling optimizations, aspipeline stallsandload imbalancebecome more pronounced with higherparallelism. - Dataset Impact: The
WikiSumdataset, known for its large variance in sample lengths, shows high speedups. While baseline methods struggle without-of-memory errorsor severe slowdowns,LoRAFusionachieves stable and efficient packing. - Heterogeneous Setting (Het): In the most challenging
heterogeneoussetting, where each of the four adapters is trained on a different dataset,LoRAFusionstill maintains strong performance. This highlights the robustness and adaptability of itsMulti-LoRA Scheduler.
6.1.1.2. Speedup on L40S GPUs
The following figure (Figure 15 from the original paper) presents results on NVIDIA L40S GPUs:
该图像是一个柱状图,展示了在 Llama-3.1-8B 和 Qwen-2.5-32B 模型下,使用不同方法(如 LoRAFusion、mLoRA 等)进行混合和异构训练的吞吐量(tokens/sec)。图中显示了 LoRAFusion 在两种模型下的显著性能提升。
- Overall Performance:
LoRAFusionachieves average speedup forLLaMa-3.1-8BandQwen-2.5-32B, respectively. - LLaMa-3.1-8B on L40S: The benefit for
LLaMa-3.1-8Bis slightly smaller on L40S compared to H100. This is attributed to thelimited memory capacity(48GB vs. 80GB) on a single L40S GPU, which constrains thebatch sizeand thus limits the full effectiveness ofkernel fusion. Despite these constraints,LoRAFusionstill delivers consistent improvements, demonstrating itsgeneralizabilityacross differentmodel sizesandhardware platforms. The performance gains frommemory bandwidth optimizationare particularly relevant on hardware wherememory bandwidthis a more limiting factor.
6.1.2. Scalability Studies
The scalability of LoRAFusion is evaluated across 4, 8, and 16 H100 GPUs under two scaling strategies: DP scaling (more GPUs per job) and job-level scaling (more concurrent jobs). Global batch sizes are scaled proportionally with the GPU count for fair comparison.
The following figure (Figure 16 from the original paper) shows the scalability of LoRAFusion across 4, 8, and 16 H100 GPUs:
该图像是一个图表,展示了在同时训练4个LoRA适配器时,LoRAFusion与Megatron-LM的性能对比。图中列出了不同GPU数量(4、8、16)下的吞吐量(K tokens/s)以及相应的缩放因子,显示了LoRAFusion在DP和Job缩放中的优越性能。
- Job-Level Scaling vs. DP Scaling: The results clearly show that
job-level scalingconsistently outperformsDP scaling. This is attributed to betterload balancewhen running multiple independent jobs.Job-level scalingachieves and higher throughput on 8 and 16 GPUs, respectively, compared toDP scaling. This implies that, forLoRA fine-tuning, efficiently running more independent jobs across available GPUs is often more beneficial than simply increasing thedata parallelismdegree for a single job, as it better utilizes resources by reducing inter-GPU communication and synchronization overhead. - Compatibility and Performance with DP Scaling: Even under
DP scaling,LoRAFusiondemonstrates strong performance. It achieves an average speedup overMegatron-LMand overmLoRA. This confirms thatLoRAFusionis fully compatible with traditionaldata parallelismandmulti-node fine-tuningsetups, while still providing significant gains through itskernelandschedulingoptimizations.
6.1.3. Effectiveness of FusedLoRA Kernel
6.1.3.1. Kernel Performance
The following figure (Figure 17 from the original paper) shows the throughput of FusedLoRA and FusedMultiLoRA kernels compared to the standard Torch LoRA implementation:
该图像是图表,展示了不同模型(Torch LoRA、FusedLoRA、FusedMultiLoRA)在不同 Token 数量下的规范化吞吐量。图中显示了在 N=K=4096、N=K=5120 和 N=K=8192 的条件下,各模型的性能变化趋势,FusedLoRA 在大多数情况下表现优越。
FusedLoRAachieves an average speedup of (up to ).FusedMultiLoRAachieves an average speedup of (up to ).- In the
forward pass,FusedMultiLoRAperforms similarly toFusedLoRAbecause the majority of thecomputation(the base modelGEMM) is shared across tokens regardless of the adapter. - In the
backward pass,FusedMultiLoRAincurs a slight overhead. This is due to the additional complexity of accumulatinggradientsacross different adapters and performing extraelement-wise operations(e.g., routinggradientsto the correct adapter weights). - Despite this slight overhead in the backward pass, both fused kernels consistently outperform the baseline
Torch LoRAimplementation across differenttoken sizesandmodel configurations. This validates the effectiveness of thegraph-splitting fusion strategyin reducingmemory access bottlenecks.
6.1.3.2. Layer-wise Performance
The following figure (Figure 18 from the original paper) compares the speedup across different linear layers in various models:
该图像是一个图表,展示了不同模型在不同批量大小下的标准化吞吐量。图表中包括了三种方法:Torch LoRA、FusedLoRA 和 FusedMultiLoRA,分别针对 Llama-3.1-8B、Qwen2.5-32B 和 Llama-3.1-70B 模型进行比较。
FusedLoRAachieves an average speedup of (up to ).FusedMultiLoRAachieves an average speedup of (up to ).- These results are based on
microbatchescontaining four adapters. The paper notes that in practicalfine-tuning workloads, eachmicrobatchoften contains only one or two adapters. In such scenarios,FusedMultiLoRA's performance would be even closer toFusedLoRAdue to less overhead frommulti-adapter gradient accumulation. This further confirms the robustness of the fused kernels across different layers andmodel sizes.
6.1.3.3. Memory Traffic Reduction
The following figure (Figure 19 from the original paper) illustrates the DRAM read and write traffic from NVIDIA Nsight Compute (NCU) across representative GEMM shapes:
该图像是一个示意图,展示了不同 LoRA 方法在 DRAM 读取/写入性能上的比较。数据显示,在不同的维度配置下,Fused LoRA 和 Fused MultiLoRA 相较于传统的 Torch LoRA 具有更低的内存使用量。具体表现为在矩阵尺寸 时,Torch LoRA 为 1.00 倍,而 Fused LoRA 为 0.63 倍。
- Both
FusedLoRAandFusedMultiLoRAconsistently reducememory usagecompared toTorch LoRA. - For a large
GEMMshape of , the totalDRAM trafficis reduced to (i.e., a reduction). - Across all tested settings, the
DRAM trafficis reduced by . This directly confirms that thegraph-splitting fusion designeffectively addresses theredundant memory accessbottleneck identified in the motivations, leading to substantialmemory bandwidth savings.
6.1.3.4. Performance Insights Across Diverse Hardware
The paper emphasizes that the FusedLoRA and FusedMultiLoRA kernels' benefits (reducing redundant memory access for large activation tensors) are particularly pronounced on hardware where memory bandwidth is a limiting factor compared to compute FLOPS. As modern accelerators continue to increase compute FLOPS faster than memory bandwidth [27], the relative performance gains from LoRAFusion's fused kernels are expected to grow in future systems.
6.1.4. Effectiveness of Job-Level Scheduling
6.1.4.1. Pipeline Bubble Reduction
The following figure (Figure 20 from the original paper) illustrates how LoRAFusion helps reduce pipeline bubbles by scheduling multiple adapters together:
该图像是图表,展示了不同方法下的管道气泡比率。Megatron-LM表现出48.79%,而mLoRA则为34.11%。在使用一个适配器的LoRAFusion中,管道气泡比率为44.17%;使用两个适配器的LoRAFusion为15.00%;三个适配器为12.23%;而四个适配器的比率则降至11.09%。
The analysis reveals three key observations:
- Single Adapter Performance: With only one adapter, the
pipeline bubble ratioremains high at44.17%, which is close toMegatron-LM's48.79%. This demonstrates thatadapter groupingis ineffective when only a single dataset/job is available, highlighting the importance ofmulti-LoRA fine-tuningto enable improvedscheduling flexibility. - Impact of Multiple Adapters: As more adapters are trained concurrently, the
bubble ratiosteadily decreases:15.00%for 2 adapters,12.23%for 3 adapters, and11.09%for 4 adapters. In contrast,mLoRA(a baseline for multi-LoRA) only achieves34.11%. This confirms thatLoRAFusion'sgroupingandadaptive batchingstrategies significantly reducepipeline idle timeby effectively fillingpipeline bubbleswith useful work from other jobs. - Residual Bubbles: Even with four adapters, a
bubble ratioof11.09%remains. The authors attribute this touneven execution timesacrosspipeline stages, specifically the last stage taking longer due to an extralinear layerandcross-entropy losscomputation. This limitation is inherent to the model's structure andpipeline partitioning, and thus is not addressable by thescheduleralone.
6.1.4.2. Tuning Time
The following figure (Figure 21 from the original paper) shows how tuning (scheduling) and computation time grow with the number of training samples for a 4-stage pipeline with 4 adapters:
该图像是图表,展示了在具有4个适配器的4阶段流水线中,调优时间与样本数量之间的关系。随着样本数量的增加,计算时间呈现出明显递增的趋势,而调优时间则相对平稳。
- Linear Scalability: The
scheduling time(for theMILP-based optimization) increases nearly linearly with the number of samples, from 15.74 seconds at 640 samples to 102.12 seconds at 25600 samples. This demonstrates the linear scalability of thescheduler. - Negligible Overhead: The
scheduling overheadis negligible compared to the overallcomputation time. This is due to three factors:- Overlap: The CPU-based
schedulingruns in parallel withGPU trainingof the precedingglobal batch. With a linear scaling of ~4ms per sample on the CPU and a much larger magnitude difference in execution time between CPU and GPU, the scheduler's latency is fully hidden by this overlap. - Saturation of Gains: Performance gains from adding more adapters saturate at around 4 adapters, meaning practical deployments can operate with a small, constant number of adapters.
- Timeout and Fallback: The
MILP solverhas atimeoutmechanism and falls back to agreedy bin-packing algorithmif it takes too long. This ensures the scheduling overhead remains within a controllable range.
- Overlap: The CPU-based
6.1.4.3. Effectiveness of the Merging & Greedy Fallback
An evaluation on 4 adapters of LLaMa-3.1-70B fine-tuned on four H100 GPUs quantifies the contribution of the merging pass and two-stage MILP optimization:
- The
merging pass(which shifts tokens from the nextglobal batchinto anunderfilled microbatch) improvesthroughputby4.34%. - The
two-stage MILP optimization(compared to pure greedybin-packing) provides an additional3.82%improvement. - The
MILP solverpath is selected for77.4%ofglobal batches(with atimeoutof 10 seconds), indicating its effectiveness in reducingtoken countsforunderfilled microbatches. These modest improvements suggest that mostmicrobatchesare already well-packed, and these algorithms primarily optimize the finalmicrobatchin eachglobal batch. Given thatscheduling overheadis hidden byparallel GPU execution, these optimizations push performance closer to the hardware limit without introducing additional latency.
6.1.5. Speedup Breakdown
The following figure (Figure 22 from the original paper) shows the contribution of each component in LoRAFusion on LLaMa-3.1-70B with 4 GPUs:
该图像是图表,展示了在使用4个GPU时,LoRAFusion在LLaMa3.1-70B上的相对加速效果。各项加速效果与1F1B PP的基准1.00x进行对比,其中平衡的Multi-LoRA ZeRO Bubble PP + FusedMultiLoRA达到最高的2.05x加速效果。
- Baseline (1F1B PP in
Megatron-LM): This serves as the reference point ( speedup). 1F1B PP+FusedLoRA: AddingFusedLoRAalone yields a1.13xspeedup. This gain is modest becauseload imbalanceand suboptimaltoken shapes(which limitkernel efficiency) are still present.Multi-LoRA Zero-Bubble PP: Replacing the1F1B PPwith aMulti-LoRA zero-bubble pipeline parallelism(withoutFusedLoRA) improvesthroughputto1.50x. This substantial gain comes from eliminatingpipeline stallsby fillingpipeline bubbleswithmicrobatchesfrom independent adapters.Multi-LoRA Zero-Bubble PP+FusedMultiLoRA: Further adding theFusedMultiLoRA kernel(which enablesmulti-adapter microbatchesand reducesredundant memory access) raises the speedup to1.72x.Balanced Multi-LoRA ZeRO Bubble PP(with Scheduler): Applying theMulti-LoRA Schedulerto rebalancetoken distributionacrossmicrobatches(even without fusion) achieves1.57xspeedup. This shows the significant impact of the scheduler in reducingload imbalance.Balanced Multi-LoRA ZeRO Bubble PP+FusedMultiLoRA(LoRAFusion's Full System): Combining theadaptive schedulingwith thefused kernelsachieves the highest speedup of2.05x. This clearly demonstrates the importance of jointly optimizingkernel efficiency,parallelism, andworkload balance.
Speedup over mLoRA:
The speedup over mLoRA is driven by two main optimizations:
- Kernel Fusion: Our
kernel fusion(comparingMulti-LoRA Zero-Bubble PPtoMulti-LoRA Zero-Bubble PP + FusedMultiLoRA) yields a1.15xspeedup (). This is consistent withFigure 17( average speedup) andFigure 18(1.13xaverage speedup). - Adaptive Batching: Our
adaptive batching(comparingMulti-LoRA Zero-Bubble PP + FusedMultiLoRAtoBalanced Multi-LoRA ZeRO Bubble PP + FusedMultiLoRA) mitigatesload imbalance, providing an additional1.19xspeedup (). This is supported byFigure 20, which shows a23.02%reduction inpipeline bubbleswithLoRAFusion's scheduling compared tomLoRA.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper effectively identifies and addresses two crucial performance bottlenecks in LLM LoRA fine-tuning: the substantial runtime overhead caused by redundant memory accesses in LoRA modules and the missed optimization opportunities inherent in grouping multiple concurrent LoRA jobs. The proposed LoRAFusion system offers a dual-level solution:
-
Kernel-level Optimization: A novel
horizontal fusiontechnique (FusedLoRAandFusedMultiLoRA) is introduced. This method judiciously splits the computation graph to fusememory-bound operationswhile preserving the efficiency ofcompute-bound GEMMs, resulting in a significant reduction ofmemory trafficby up to37%. -
Job-level Scheduling: An adaptive
multi-LoRA schedulingstrategy is presented. This scheduler intelligently groups adapters and uses atwo-stage MILP-based bin-packing algorithmto create balanced,dependency-aware microbatches. This significantly improvesGPU utilizationfrom65%to89%and reducespipeline bubblesfrom44.17%to11.09%.Combined, these optimizations achieve an impressive end-to-end speedup of up to (average ) compared to
Megatron-LMand up to (average ) overmLoRA, the previous state-of-the-art multi-LoRA fine-tuning system.LoRAFusionnot only enhances performance but also improves the accessibility and efficiency ofLLM LoRA fine-tuningfor both researchers and practitioners.
7.2. Limitations & Future Work
The authors acknowledge several areas for future work and discuss the generalizability of their approach:
7.2.1. Generalizability to LoRA Variants
The proposed kernel fusion design is inherently extensible to other popular LoRA variants such as DoRA [52] and VeRA [37]. These variants typically introduce pre- or post-processing functions around the core LoRA computation. The authors state that LoRAFusion's optimizations are orthogonal to these modifications, meaning users could define prologue/epilogue functions to extend the existing kernels manually.
For a more general approach, they plan to integrate these fusion patterns into a compiler framework, specifically by adding compiler annotations as hints for torch.compile. This would automate the optimization process for both existing and future LoRA variants, removing the need for manual kernel development and specialized system expertise from users.
7.2.2. Generalizability to Quantization
The kernels developed in LoRAFusion can be directly applied to 4-bit QLoRA [14]. Current QLoRA implementations typically dequantize 4-bit weights to half-precision (e.g., FP16) before performing the LoRA computation. This means LoRAFusion's kernels can operate without modification on these dequantized weights. While it might be possible to fuse dequantization with the LoRA path, recent research suggests that two-step approaches (dequantize then compute) are often more performant for large token counts.
7.3. Personal Insights & Critique
7.3.1. Strengths and Innovations
- Holistic Approach: The paper's greatest strength lies in its holistic approach, tackling inefficiencies at both the low-level
kernelexecution and high-leveljob scheduling. This multi-level optimization is crucial for maximizing performance in complexdistributed systems. - Rigorous Bottleneck Analysis: The detailed
profilingin Section 3, clearly identifyingmemory bandwidthas the primary bottleneck for LoRA, is highly insightful. This rigorous analysis guides the elegantgraph-splitting fusionstrategy, which is a key innovation. - Practicality of Kernel Fusion: The
FusedLoRAandFusedMultiLoRAkernels offer a practical and immediate benefit. Theirplug-and-playnature means existing LoRA systems can integrate them with minimal effort, offering immediatethroughputgains. Thetile-level routingforFusedMultiLoRAis particularly clever for handling heterogeneous jobs. - Sophisticated Scheduling: The
two-stage MILP-based adaptive batchingalgorithm is a significant advancement over priormulti-LoRAscheduling. It directly addresses the critical issues ofload imbalanceandpipeline bubblescaused byvariable sequence lengthsanddata dependencies, makingdistributed fine-tuningmuch more efficient. The use ofgreedy fallbacksandmultiprocessingforMILPdemonstrates a pragmatic approach to optimize for both optimality and runtime. - Strong Evaluation: The extensive evaluation across diverse
LLMs,datasets, andGPU platforms(H100 and L40S) under variousworkload configurations(homogeneous, mixed, heterogeneous) provides robust evidence for the system's effectiveness and generalizability.
7.3.2. Potential Issues, Unverified Assumptions, or Areas for Improvement
- Complexity of MILP: While effective, the
MILP-based schedulercan be computationally intensive for extremely large numbers ofsamplesoradapters, even withtimeoutsandgreedy fallbacks. The paper shows linear scaling, but the constant factor might be significant for very dynamic or extremely large-scale scheduling scenarios with frequent re-scheduling. Further work could explore more lightweight heuristics or reinforcement learning-based schedulers for extreme scale. - Dependency on Parallelism Profiler: The
schedulerrelies on an externalparallelism profilerto determine the optimaltoken capacity. While this decouples concerns, it introduces an extra profiling step and assumes that thistoken capacityremains optimal throughout training, which might not always be true if workload characteristics shift significantly. Integrating this profiling more dynamically into the scheduling loop could be an enhancement. - Fixed Pipeline Stage Imbalance: The paper acknowledges that residual
pipeline bubbles() are due touneven execution timesacrosspipeline stages, a limitation not solvable by theirscheduler. This points to an area wheredynamic pipeline re-balancingorstage-aware kernel fusioncould offer further improvements, potentially beyond the current scope ofLoRAFusion. - Triton Kernel Maintenance: While
Tritonoffers fine-grained control forkernel optimization, it also requires manual tuning and maintenance (as seen in theArtifact Appendix'stune_kernels.pyscript). As hardware evolves, these kernels might need re-tuning. The proposed future work on integrating withcompiler frameworksliketorch.compileis crucial to mitigate this. - Specific to LoRA: The optimizations are highly tailored to LoRA. While this yields significant gains, adapting them to other PEFT methods (beyond just
DoRAorVeRAvariants) might require substantial re-engineering.
7.3.3. Transferability and Broader Impact
- Memory-Bound Optimization: The
graph-splitting fusion strategyformemory-bound operationscould be transferable to other domains where small computational additions cause disproportionatememory trafficoverhead. This principle could apply to otheradapter-basedarchitectures or even specific layers infull modelsthat arememory-bandwidth-bound. - Adaptive Batching for Variable Workloads: The
adaptive batchingandscheduling algorithmforvariable sequence lengthsandmulti-job concurrencyis highly transferable. This approach can benefit anydistributed trainingorinference systemdealing with heterogeneous, variable-length workloads, not just LLMs or LoRA. For example, inmulti-modal modelsorgraph neural networkswith variable graph sizes, similar scheduling challenges arise. - Cost Reduction in AI: By significantly improving the efficiency of
LLM fine-tuning,LoRAFusiondirectly contributes to reducing the operational costs and hardware requirements for adapting LLMs. This democratizes access to advanced AI capabilities, making it more feasible for smaller organizations and individual researchers to fine-tune high-quality models. - Foundation for Future Systems:
LoRAFusionestablishes a strong foundation for futureLLM fine-tuning systems, demonstrating the power of combining low-levelkernel engineeringwith high-levelsystem schedulingfor specializedworkloads. This multi-level optimization paradigm will likely inspire further research in system-level AI acceleration.
Similar papers
Recommended via semantic vector search.