Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
TL;DR Summary
This paper introduces a novel interleaved pipeline parallelism schedule, combining tensor, pipeline, and data parallelism, to enhance the training efficiency of large language models on GPU clusters, achieving 502 petaFLOP/s on 3072 GPUs with over 10% throughput improvement.
Abstract
Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak. Our code is open sourced at this https URL.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is the efficient training of large-scale language models on GPU clusters, specifically utilizing the Megatron-LM framework.
1.2. Authors
The authors of this paper are Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Their affiliations are:
- †NVIDIA: Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro.
- ‡Stanford University: Matei Zaharia.
- *Microsoft Research: Deepak Narayanan, Amar Phanishayee. The authors represent a mix of industry research (NVIDIA, Microsoft Research) and academia (Stanford University), indicating a strong collaboration between leading researchers in deep learning systems and large-scale computing.
1.3. Journal/Conference
This paper was published as a preprint on arXiv. While not a formal journal or conference publication, arXiv is a highly influential open-access repository for scientific preprints, especially in fields like AI and machine learning. Papers on arXiv often precede formal peer-reviewed publications and are widely read and cited in the research community. The work is also heavily associated with NVIDIA, a key player in GPU hardware for deep learning.
1.4. Publication Year
2021
1.5. Abstract
Large language models (LLMs) achieve state-of-the-art results across various tasks, but their efficient training is highly challenging due to two primary issues: limited GPU memory preventing large models from fitting on single or multi-GPU servers, and the immense computational demands leading to impractically long training times. Existing model parallelism methods like tensor parallelism and pipeline parallelism have been proposed, but their naive application faces fundamental scaling problems on thousands of GPUs, such such as high cross-node communication costs or significant device idle time.
This paper introduces a novel approach that effectively composes different parallelism methods—tensor, pipeline, and data parallelism—to scale training to thousands of GPUs and models with up to a trillion parameters. The authors survey existing pipeline parallelism techniques and propose a new interleaved pipeline parallelism schedule, which improves throughput by over 10% while maintaining a comparable memory footprint. Through quantitative studies, they analyze the trade-offs between these parallelism types and provide practical guidance for configuring distributed training. Their method achieves a remarkable training iteration throughput of 502 petaFLOP/s for a 1-trillion-parameter model on 3072 GPUs, demonstrating a per-GPU throughput of 52% of the theoretical peak. The accompanying code is open-sourced.
1.6. Original Source Link
Official Source: https://arxiv.org/abs/2104.04473 PDF Link: https://arxiv.org/pdf/2104.04473.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inefficient and challenging training of extremely large language models (LLMs) on GPU clusters. Recent advancements in Transformer-based language models have led to unparalleled performance across Natural Language Processing (NLP) tasks, driven by an exponential increase in model size (number of parameters) and dataset scale.
This problem is critical in the current field for several reasons:
- Memory Capacity Limitations: State-of-the-art LLMs, now reaching hundreds of billions or even trillions of parameters, simply cannot fit into the memory of a single
GPU, or even a singlemulti-GPU server. This hardware limitation is a fundamental bottleneck. - Unrealistically Long Training Times: Even if a model could hypothetically fit (e.g., by constantly swapping data between host and device memory), the sheer number of
floating-point operations (FLOPs)required for training would lead to training durations of hundreds of years on a singleGPU, rendering such training impractical. - Limitations of Traditional Data Parallelism: While
data parallelismis a commonscale-outstrategy, it faces two main limitations for very large models: a)per-GPU batch sizesbecome too small beyond a certain point, leading to lowGPU utilizationand highcommunication overhead; and b) the maximum number of devices that can be utilized is bounded by the globalbatch size, which can be insufficient for scaling to thousands ofGPUs. - Scaling Issues with Existing Model Parallelism: Prior model parallelism techniques, such as
tensor parallelismandpipeline parallelism, address memory constraints. However, their naive application introduces new scaling challenges:-
Tensor parallelismwithin amulti-GPU server(e.g.,NVIDIA DGX A100) works well up to a point, but scaling it acrossmulti-GPU serversbecomes inefficient due to slowerinter-server linksforall-reduce communication. It can also lead to smallmatrix multiplications (GEMMs)and decreasedGPU utilization. -
Pipeline parallelism, while effective for partitioning layers across devices, suffers from significantpipeline bubbles(idle time) due to synchronization requirements (pipeline flushes) to maintainstrict optimizer semantics. This idle time can consume up to 50% of the training time, especially with fewermicrobatches.The paper's entry point and innovative idea revolve around developing a composed parallelism strategy (
PTD-P) that intelligently combinestensor,pipeline, anddata parallelismto overcome these scaling limitations. Furthermore, it introduces a novelinterleaved pipeline parallelism scheduleto mitigatepipeline bubblesand enhance throughput, alongside various communication and computation optimizations. The authors aim to provide both an effective practical solution and a quantitative analysis of the tradeoffs involved in configuring these complex distributed training setups.
-
2.2. Main Contributions / Findings
The primary contributions and key findings of this paper are:
- PTD-P: A Composed Parallelism Approach: The paper demonstrates how
pipeline parallelism(acrossmulti-GPU servers),tensor parallelism(within amulti-GPU server), anddata parallelismcan be effectively combined. ThisPTD-Pstrategy enables the practical training of models with up to a trillion parameters with graceful scaling on optimizedGPU clusters. - Novel Interleaved Pipeline Parallelism Schedule: The authors propose a new
interleaved pipeline schedulethat significantly reducespipeline bubbletime compared to previous schedules (likePipeDream-Flush). This schedule improves throughput by as much as 10%+ while maintaining a comparable memory footprint, especially beneficial at smallerbatch sizes. - Quantitative Study of Parallelism Trade-offs: The paper provides a comprehensive quantitative and analytical study of the interactions and trade-offs between
tensor,pipeline, anddata parallelism, as well as the impact of hyperparameters likemicrobatch sizeandactivation recomputation. This analysis offers guiding principles for configuring distributed training effectively. - Achieved State-of-the-Art Training Throughput: The proposed approach enables training iterations on a 1-trillion-parameter model to reach an aggregate throughput of 502
petaFLOP/son 3072A100 GPUs. This corresponds to an achieved per-GPU throughput of 163teraFLOP/s, or 52% of the theoretical peak device throughput, which is significantly higher than previous systems. - Practical Training Times: Based on these throughputs, the authors estimate the end-to-end training of a 1-trillion-parameter model to take approximately 3 months, demonstrating the practical viability of training such colossal models. For a 175-billion-parameter
GPT-3model, training is estimated at 34 days on 1024A100 GPUs. - Superior Performance to ZeRO-3: The
PTD-Papproach outperformsZeRO-3(without model parallelism) by 70% for 175- and 530-billion-parameter models when scaling to moreGPUs, primarily due to lesscross-node communication. - Key Optimizations: The high throughput is attributed to several innovations and engineering efforts, including:
Scatter/gather communication optimization: Reduces redundantcross-node communicationby leveragingtensor parallelismoutputs.- Efficient
kernel implementationsandoperator fusion: Ensures computation iscompute-boundrather thanmemory-bound. - Smart
partitioning of computation graphs: Minimizes network traffic anddevice idle periods.
- Open-Sourced Software: The implemented system, an extension to the
Megatron-LMcodebase, is open-sourced athttps://github.com/nvidia/megatron-lm, facilitating its adoption by the broader research community.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following fundamental concepts in deep learning and distributed computing:
-
Deep Learning Models (specifically Transformers):
- Neural Network: A computational model inspired by the human brain, composed of interconnected nodes (neurons) organized in layers. It learns to recognize patterns in data.
- Transformer: A neural network architecture introduced in 2017, which became the backbone for many modern
Large Language Models (LLMs). It revolutionizedsequence-to-sequencetasks (like machine translation) and laterNLPby relying entirely onattention mechanismsinstead ofrecurrent neural networks (RNNs)orconvolutional neural networks (CNNs). - Self-Attention: A core component of the
Transformermodel. It allows the model to weigh the importance of different parts of the input sequence when processing a specific element. For example, in a sentence, when processing the word "it,"self-attentionhelps the model decide whether "it" refers to "the river" or "the bank." - Multi-Layer Perceptron (MLP) / Feed-Forward Network: A simple type of neural network within each
Transformer layerthat processes the output of theself-attentionmechanism. It typically consists of two linear transformations with an activation function in between. - Parameters: The learnable weights and biases in a neural network. The number of parameters directly correlates with a model's capacity and memory footprint.
- Activations: The outputs of neurons at each layer of a neural network. During the forward pass, these are computed and often stored for use in the backward pass.
- Gradients: During training, gradients are computed to indicate how much each parameter should be adjusted to minimize the model's error.
- Optimizer: An algorithm (e.g.,
Adam,SGD) that uses the computed gradients to update the model's parameters, iteratively improving its performance.Strict optimizer semanticsmean that allGPUsmust see a consistent version of the weights for gradient updates.
-
GPU Clusters and Hardware:
- GPU (Graphics Processing Unit): A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer for output to a display device. In deep learning,
GPUsare extensively used for their parallel processing capabilities, which are highly efficient for matrix multiplications and other computations fundamental to neural networks. - GPU Cluster: A group of interconnected
GPUs(often housed inmulti-GPU serversor nodes) that work together to perform large-scale computations, such as training very large deep learning models. - Multi-GPU Server (Node): A single physical server containing multiple
GPUs(e.g., 8NVIDIA A100 GPUsin aDGX A100system). - NVLink and NVSwitch: High-bandwidth, low-latency interconnect technologies developed by
NVIDIAfor directGPU-to-GPU communicationwithin amulti-GPU server.NVLinkprovides direct links betweenGPUs, andNVSwitchacts as a routing chip to enableall-to-all communicationbetweenGPUsin a server. - InfiniBand: A high-speed
networking interconnecttechnology used inHigh-Performance Computing (HPC)forinter-server communication(between differentmulti-GPU serversin a cluster). It offers higher bandwidth and lower latency compared to standard Ethernet. - Floating-Point Operations (FLOPs): A measure of computational workload, referring to the number of floating-point arithmetic operations (additions, multiplications, etc.) performed.
TeraFLOP/s(trillions of FLOPs per second) andPetaFLOP/s(quadrillions of FLOPs per second) are units of computational speed. - Mixed Precision Training: A technique that uses both 16-bit
floating-pointformats (e.g.,FP16orbfloat16) and 32-bitfloating-pointformats (FP32) during training. It helps reduce memory consumption and speed up computations onGPUswith specializedTensor Cores, while maintainingFP32precision for critical parts likeweight updatesto prevent numerical instability.
- GPU (Graphics Processing Unit): A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer for output to a display device. In deep learning,
-
Distributed Training Concepts:
- Data Parallelism: The most common approach to distributed training. Each
GPUholds a complete copy of the model. The inputbatchof data is divided among theGPUs, each computesgradientsfor its portion of the data, and then thesegradientsareaggregated(e.g., averaged) across allGPUsto update the model parameters. - Model Parallelism: When a model is too large to fit on a single
GPU, its layers or components are split and distributed across multipleGPUs. This allows training models larger than a singleGPU'smemory capacity. - Tensor Parallelism (Intra-layer Model Parallelism): A type of
model parallelismwhere individual layers (specifically, largematrix multiplicationswithin a layer) of the model are partitioned across multipleGPUs. For example, a large weight matrix might be split into column or row chunks, with eachGPUcomputing a part of thematrix multiplication. This usually requiresall-reduce communicationto synchronize intermediate results. - Pipeline Parallelism (Inter-layer Model Parallelism): A type of
model parallelismwhere different layers or sequential groups of layers of a model are assigned to differentGPUs(orpipeline stages). Data (in the form ofmicrobatches) flows sequentially through thesepipeline stages, similar to an assembly line. This helps fit very deep models and can overlap computation with communication. - Microbatch: In
pipeline parallelism, a largetraining batchis divided into smaller units calledmicrobatches. Thesemicrobatchesare processed sequentially through the pipeline, allowing computation to be overlapped betweenpipeline stages. - Pipeline Bubble (or Stall): In
pipeline parallelism, this refers to the idle time whenGPUsin the pipeline are waiting for data or computation from other stages, particularly at the beginning and end of abatchduringpipeline warm-uporflushphases. Minimizing thisbubbleis crucial for efficiency. - All-reduce: A
collective communicationoperation where all participatingGPUscontribute data, and all receive the final, reduced (e.g., summed, averaged) result. It's often used forgradient aggregationindata parallelismand for synchronizing intermediate results intensor parallelism. - Point-to-point communication: A
communication operationbetween exactly twoGPUs(a sender and a receiver). This is typically faster and has lower overhead thancollective operationslikeall-reduce, making it suitable for sequential data transfer inpipeline parallelism. - Activation Recomputation (or Checkpointing): A memory-saving technique where instead of storing all intermediate
activationsfrom the forward pass for use in the backward pass, someactivationsare recomputed during the backward pass. This trades off increased computation for reduced memory footprint.
- Data Parallelism: The most common approach to distributed training. Each
3.2. Previous Works
The paper builds upon and differentiates itself from several key prior studies, primarily in distributed deep learning:
-
Transformer-based Language Models (e.g., GPT-3, BERT, T5):
Attention is All You Need[42] (Vaswani et al., 2017): Introduced theTransformerarchitecture, which usesself-attentionmechanisms rather thanRNNsorCNNs. This paper is foundational for the models being scaled.BERT[13] (Devlin et al., 2018),GPT[33] (Radford et al., 2018),RoBERTa[27] (Liu et al., 2019),T5[35] (Raffel et al., 2019),GPT-3[11] (Brown et al., 2020): These works represent the lineage of largeTransformermodels that demonstrate the power of scaling, thereby motivating the need for efficient large-scale training. The current paper usesGPTmodels for its experiments.
-
Megatron-LM (Tensor Parallelism):
Megatron[40] (Shoeybi et al., 2019): This work introduced a specifictensor model parallelismstrategy forTransformerlayers. It partitions largematrix multiplications (GEMMs)within eachTransformer layeracross multipleGPUs. This strategy is leveraged in the current paper forintra-node(within amulti-GPU server) parallelization. The current paper extendsMegatron-LMby combining thistensor parallelismwithpipelineanddata parallelism.
-
Pipeline Parallelism Schemes:
GPipe[20] (Huang et al., 2019): Proposed one of the earlypipeline parallelism scheduleswhere allforward passesformicrobatchesare completed first, followed by allbackward passes. While reducingpipeline bubbles, it has a high memory footprint as it requires stashing all intermediateactivationsfor allmicrobatchesin abatch.PipeDream-Flush[30] (Narayanan et al., 2020): An improvement overGPipe, this schedule uses a "1F1B" (oneforward pass, onebackward pass) pattern after an initial warm-up. It significantly reduces memory footprint by limiting the number of in-flightmicrobatches(and thus activations to be stashed) to the number ofpipeline stages(instead of the total number ofmicrobatchesin abatch). The current paper usesPipeDream-Flushas its "default schedule" and further improves upon it with theinterleaved schedule.PipeDream[29] (Narayanan et al., 2019): A foundational work that generalizedpipeline parallelismforDNN trainingand combined it withdata parallelism. It also exploredasynchronousandbounded-stalenessapproaches.TeraPipe[26] (Li et al., 2021): Explores fine-grainedpipeline parallelismacross tokens within a singletraining sequenceforauto-regressive modelslikeGPT.PipeTransformer[19] (He et al., 2021): Elastically adjusts the degree ofpipelininganddata parallelismby freezing "stable" layers and training "active" ones.HetPipe[31] (Park et al., 2020): Combinespipelineanddata parallelismonheterogeneous GPU clusters.PipeMare[45] (Yang et al., 2021) andKosson et al.[23] (Kosson et al., 2021): These works focus onasynchronous pipeline parallelismor relaxedweight update semanticsto improve throughput by doing away withflushes, but potentially at the cost of convergence or final accuracy. The current paper focuses onstrict optimizer semantics.
-
Sharded Data Parallelism (ZeRO):
ZeRO[36, 37] (Rajbhandari et al., 2019; Rajbhandari et al., 2021): A family of memory optimization techniques,ZeRO(Zero Redundancy Optimizer) shards theoptimizer state,gradients, and eventuallymodel parametersacrossdata-parallel workers. This reduces memory footprint perGPUwithout introducing extra communication over vanilladata parallelism(forZeRO-1/2) or by introducing additional, often hidden, communication (forZeRO-3).ZeRO-Infinity[37] extends this by usingNVMe(solid-state drives) for swapping parameters to disk, enabling training of very large models on a small number ofGPUs, though with potentially very long training times. The current paper compares itsPTD-Papproach toZeRO-3and finds its method superior for large models due to lesscross-node communication.
-
Automatic Partitioning:
FlexFlow[22] (Jia et al., 2018),PipeDream[29] (Narayanan et al., 2019),DAPPLE[14] (Fan et al., 2021),Tarnawski et al.[41] (Tarnawski et al., 2020): These systems explore automatic partitioning ofDNN training graphsover multiple devices usingcost models. The current paper, however, suggests heuristics rather than automatic exploration, acknowledging the increased complexity with more parallelism dimensions.
-
HPC for Model Training:
Goyal et al.[17] (Goyal et al., 2017) andYou et al.[47] (You et al., 2018): DemonstratedHPCtechniques to trainImageNetmodels quickly. These models, however, are typically smaller, fit on a single accelerator, use very largebatch sizesallowing extensivedata parallelism, and areCNN-based, which is inherently amenable todata-parallel communication.
3.3. Technological Evolution
The field of large-scale deep learning, particularly NLP, has seen an exponential growth in model size, as illustrated in Figure 1 of the paper. Initially, models could be trained on a single GPU. As models grew, data parallelism became the dominant strategy, allowing multiple GPUs to train a single model by distributing data. However, for models with billions of parameters, data parallelism alone became insufficient due to GPU memory limits (models wouldn't fit) and scaling constraints (max number of GPUs is limited by batch size).
This necessitated the development of model parallelism techniques:
-
Early Model Parallelism: Involved manually splitting layers or operations across
GPUs. -
Tensor Parallelism (
intra-layer):Megatron-LM[40] emerged as a prominent solution, partitioningmatrix multiplicationswithin a single layer acrossGPUswithin amulti-GPU server(e.g.,DGX A100). This was efficient due to high-bandwidthNVLinkinterconnects. -
Pipeline Parallelism (
inter-layer): Techniques likeGPipe[20] andPipeDream[29] distributed entire layers across differentGPUs, allowing deeper models to be trained.PipeDream-Flush[30] improved memory efficiency. These generally usedpoint-to-point communicationbetweenpipeline stages. -
Sharded Data Parallelism (
ZeRO): Approaches likeZeRO[36, 37] further optimizeddata parallelismby shardingoptimizer states,gradients, and evenparametersacrossdata-parallel workers, reducing the memory footprint perGPUsignificantly.The current paper's work (
Megatron-LMextension) represents the next evolutionary step: a sophisticated composition of these techniques (PTD-P). It recognizes that no single parallelism strategy is optimal for all scales and scenarios. Instead, it intelligently combines the strengths oftensor parallelism(for efficientintra-nodecomputation),pipeline parallelism(for scalinginter-nodeacross layers), anddata parallelism(for overallscale-outwhen memory allows), while introducing new scheduling and communication optimizations to mitigate their individual weaknesses (e.g.,pipeline bubbles,cross-node all-reduce overheads).
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's PTD-P approach are:
-
Holistic Composition of Parallelism (
PTD-P): Unlike many prior works that focus on one or two forms of parallelism in isolation (e.g.,Megatronwithtensor parallelism,PipeDreamwithpipelineanddata parallelism,ZeROwithsharded data parallelism), this paper rigorously studies and optimizes the composition of all three major forms:tensor,pipeline, anddata parallelism. This allows for scaling to truly unprecedented model sizes (trillion parameters) on thousands ofGPUs. The paper emphasizes thatneither tensor model parallelism nor pipeline model parallelism in isolation can match the performance of using both techniques in conjunction. -
Optimized Hierarchical Parallelism Strategy: The
PTD-Pstrategy implicitly leverages hardware topology:Tensor parallelismis primarily usedintra-node(within amulti-GPU server) where high-bandwidthNVLinkmakes itsall-reduce communicationefficient.Pipeline parallelismis usedinter-node(acrossmulti-GPU servers) where itspoint-to-point communicationis less sensitive to slowerInfiniBandlinks thanall-reduce.Data parallelismthen scales out the combinedmodel-parallelreplica to moreGPUs. This hierarchical approach is key to its superior scaling.
-
Novel Interleaved Pipeline Parallelism Schedule: The paper proposes a new
interleaved pipeline schedulethat significantly reduces thepipeline bubble(idle time) compared to the widely usedPipeDream-Flushschedule (a 1F1B approach). This schedule achieves 10%+ higher throughput by effectively makingpipeline flusheshappen sooner, with a comparable memory footprint. This is a direct improvement on existingpipeline parallelismtechniques that maintainstrict optimizer semantics. -
Quantitative Trade-off Analysis and Guiding Principles: The paper provides a detailed quantitative and analytical study of how different parallelization dimensions (, , ),
microbatch size, andactivation recomputationinteract and impactthroughput,memory footprint, andcommunication overhead. It offers concrete "takeaways" orguiding principlesfor configuring distributed training, which is crucial for practical deployment but often lacking in theoretical works. -
Specialized Communication Optimization (
Scatter/Gather): Thescatter/gather communication optimizationspecifically addresses a redundancy issue whentensorandpipeline parallelismare combined. By scattering replicated outputs oftensor-parallelranks acrossInfiniBand cardsand thenall-gatheringthem via fastNVLinkon the receiving node, it reducescross-node communicationvolume by a factor oftensor-parallel size(e.g., 8x for ). This makes theinterleaved schedulemore feasible and improves overall throughput. -
Hardware-Software Co-design and Engineering Excellence: The paper highlights the importance of
efficient kernel implementations,operator fusion, andcareful data layoutspecific to theTransformerarchitecture to ensure computation remainscompute-boundonA100 GPUs. This deep engineering, combined with optimized hardware (Selene supercomputerwithA100 GPUs,NVLink,NVSwitch,InfiniBand), contributes to the achieved performance levels, which are significantly higher (e.g., 52% of peak vs. 36% of peak forDeepSpeed[2] for trillion-parameter models) than comparable systems likeDeepSpeedorZeRO-3.
4. Methodology
4.1. Principles
The core idea of the method used in this paper is PTD-P, which stands for Pipeline, Tensor, and Data Parallelism. It's a comprehensive strategy for efficiently training extremely large language models (LLMs) on GPU clusters. The theoretical basis or intuition behind PTD-P is to intelligently combine the strengths of different parallelism types while mitigating their individual weaknesses, especially concerning GPU memory capacity, computation time, and communication overhead in a hierarchical cluster environment.
The key principles are:
-
Memory Management: Models with trillions of parameters cannot fit on a single
GPUor even a singlemulti-GPU server.PTD-Pusestensor parallelismandpipeline parallelismto shard the model's parameters andintermediate activationsacross manyGPUs, ensuring the model can be represented in distributed memory.Activation recomputationfurther helps manageactivation memory. -
Maximize GPU Utilization and Minimize Idle Time: Distributed training inherently introduces
idle time(e.g.,pipeline bubbles) andcommunication overhead. The goal is to keepGPUsas busy as possible performing useful computation. Theinterleaved pipeline scheduleis designed to reducepipeline bubbles, whilecommunication optimizations(scatter/gather) andcomputation optimizations(fused kernels,data layout) aim to make computationscompute-boundand communication less of a bottleneck. -
Optimal Communication Strategy: The hierarchical nature of
GPU clusters(fastintra-server NVLink, slowerinter-server InfiniBand) dictates how communication-intensive operations should be placed.Tensor parallelism, which involves frequentall-reduceoperations, is best keptintra-server.Pipeline parallelism, relying onpoint-to-point communication, is more suitable forinter-serverscaling.Data parallelism, which typically involves less frequent but largerall-reduceoperations, is used for overallscale-out. -
Scalability: The composition allows scaling to thousands of
GPUsandtrillion-parameter models, addressing the limitations of single parallelism techniques that might not scale beyond certain model sizes or numbers of devices. -
Strict Optimizer Semantics: The paper prioritizes maintaining
strict optimizer semantics, meaning allGPUssee consistent weight versions, which is crucial for predictable convergence and accuracy. This necessitates mechanisms likepipeline flushes, but the system is designed to minimize their impact.In essence,
PTD-Pis about finding the optimal balance betweenmemory footprint,device utilization(reducingpipeline bubbles), andcommunication overheadby strategically applying and optimizing multiple forms of parallelism across aGPU cluster'shardware hierarchy.
4.2. Core Methodology In-depth (Layer by Layer)
The PTD-P methodology combines data parallelism, pipeline model parallelism, and tensor model parallelism. The implementation extends the Megatron-LM codebase using PyTorch.
4.2.1. Modes of Parallelism
The paper combines pipeline model parallelism and tensor model parallelism with data parallelism, which they call PTD-P. The overall structure of how tensor and pipeline parallelism are combined is visualized in Figure 2.
该图像是示意图,展示了两层变换器(Transformer layer)中的张量并行(Tensor MP)和流水线并行(Pipeline MP)分区方案。图中清晰地标示了两个变换器层及其对应的张量和流水线分区设计,帮助理解并行处理在模型训练中的应用。
The image is a schematic diagram that illustrates the tensor parallel (Tensor MP) and pipeline parallel (Pipeline MP) partitioning schemes within two transformer layers. The diagram clearly indicates the two transformer layers along with their corresponding tensor and pipeline partitions, aiding in the understanding of parallel processing in model training.
4.2.1.1. Data Parallelism
With data parallelism [25, 43], each worker (which could be a GPU or a model-parallel group of GPUs) typically has a copy of the full model. The input dataset is sharded, meaning each worker receives a different subset of the data for its local computation. After processing its local data and computing gradients, workers periodically aggregate their gradients (e.g., using all-reduce) to ensure that all workers maintain a consistent version of the model's weights.
For very large models that do not fit entirely on a single worker, data parallelism can be used on smaller model shards. This means that if a model is too large for one GPU, it must first be split across multiple GPUs using model parallelism (tensor or pipeline), and then multiple copies of this model-parallel group can be run in data-parallel fashion.
4.2.1.2. Pipeline Model Parallelism
In pipeline parallelism, the layers of a model are sharded across multiple devices. For models with repetitive structures, such as Transformer blocks, each device can be assigned an equal number of layers. The paper focuses on such symmetric architectures.
A key aspect of pipeline parallelism is that a batch of data is split into smaller units called microbatches. These microbatches are then processed in a pipelined fashion across the devices. To ensure that optimizer steps are synchronized and inputs see consistent weight versions across forward and backward passes (maintaining strict optimizer semantics), the paper introduces periodic pipeline flushes. During a pipeline flush, no new microbatches are injected, and all currently in-flight microbatches are allowed to complete their forward and backward passes. This leads to periods where devices are idle, known as the pipeline bubble. The goal is to minimize this pipeline bubble to improve efficiency.
The paper discusses two approaches to scheduling forward and backward microbatches:
4.2.1.2.1. Default Schedule (PipeDream-Flush)
The GPipe schedule [20] first executes all forward passes for all microbatches in a batch, followed by all backward passes. This strategy, however, requires a high memory footprint because intermediate activations for all microbatches (or at least input activations for activation recomputation) must be stored in memory throughout the entire training iteration.
Instead, the paper uses the PipeDream-Flush schedule [30], which is more memory-efficient. This schedule operates in three phases, as shown in Figure 3.
该图像是一个示意图,展示了在四个设备上进行模型训练的前向传播和反向传播过程。图中时间轴上标记了每个设备在不同时间步进行前向(蓝色)和反向(绿色)传播的状态,并显示了管道冲洗期间设备的闲置状态。
The image is a schematic diagram illustrating the forward and backward passes of model training across four devices. The timeline marks the status of each device performing forward (in blue) and backward (in green) passes at different time steps, and shows the idle state of devices during a pipeline flush.
-
Warm-up Phase: Workers perform differing numbers of
forward passesto fill the pipeline. -
Steady State (1F1B): After warm-up, each worker enters a steady state where it performs one
forward pass(1F) immediately followed by onebackward pass(1B) for amicrobatch. This ensures that the number ofin-flight microbatches(those for which thebackward passis outstanding andactivationsneed to be retained) is limited to the depth of the pipeline (), rather than the total number ofmicrobatches() in abatch. -
Cool-down Phase: At the end of a
batch,backward passesfor all remainingin-flight microbatchesare completed.The
pipeline bubblein this schedule is quantified as . Let be the number ofmicrobatchesin abatch, be the number ofpipeline stages(devices used forpipeline parallelism), be the ideal time per iteration (assuming perfect scaling), and and be the time to execute a singlemicrobatch'sforwardandbackward pass, respectively. The total time spent in thepipeline bubbleis: $ t_{pb} = (p - 1) \cdot (t_f + t_b) $ The ideal processing time for the batch is: $ t_{id} = m \cdot (t_f + t_b) $ Therefore, the fraction of ideal computation time spent in thepipeline bubble(orpipeline bubble size) is: -
: Number of
pipeline stages(devices). -
: Number of
microbatchesin abatchper pipeline. -
: Time spent in the
pipeline bubble. -
: Ideal processing time for the
batch. -
: Time for a single
microbatch'sforward pass. -
: Time for a single
microbatch'sbackward pass.For the
bubble time fractionto be small, it is necessary that . While thePipeDream-Flushschedule has the samebubble time fractionasGPipe, its primary advantage is significantly reduced memory footprint, asactivationsneed to be stashed for or fewermicrobatches(compared to forGPipe), making it more memory-efficient when .
4.2.1.2.2. Schedule with Interleaved Stages
To further reduce the pipeline bubble size, the paper proposes a novel interleaved schedule. Instead of assigning a single contiguous set of layers to each device, each device can perform computation for multiple subsets of layers, called model chunks or v stages. For example, if a device previously handled layers 1-4, it could now handle layers 1, 2, 9, 10, effectively managing two model chunks. This means each device in the pipeline is assigned multiple pipeline stages, where each stage now has less computation compared to before.
Similar to PipeDream-Flush, this interleaved schedule adapts the memory-efficient 1F1B pattern. This new schedule, shown in Figure 4, requires the number of microbatches in a batch to be an integer multiple of the degree of pipeline parallelism (number of devices in the pipeline).
该图像是一个示意图,展示了如何在多个设备上分配不同阶段的前向传递和反向传递计算。上方为传统的单阶段分配,下方为多阶段分配,提高了设备的利用率。该方法通过将多个阶段分配给每个设备来优化并行处理。
The image is a diagram illustrating how to distribute different stages of forward and backward passes across multiple devices. The top part shows traditional single-stage allocation, while the bottom part demonstrates multi-stage allocation, enhancing device utilization. This method optimizes parallel processing by assigning multiple stages to each device.
As depicted in Figure 4, the pipeline flush for the same batch size occurs sooner in the interleaved schedule. If each device has stages (or model chunks), the forward and backward pass time for a microbatch for each stage will be and , respectively.
The pipeline bubble time is thus reduced to:
$
t_{pb}^{\mathrm{int.}} = \frac{(p - 1) \cdot (t_f + t_b)}{v}
$
And the bubble time fraction becomes:
-
: Number of
pipeline stages(devices). -
: Number of
microbatchesin abatchper pipeline. -
: Number of
model chunksorstagesassigned to each device. -
: Interleaved
pipeline bubble time. -
: Ideal processing time for the
batch.This means the new schedule reduces the
bubble timeby a factor of . However, this reduction comes at the cost of increased communication, also by a factor of , because data needs to be sent more frequently betweenpipeline stagesdue to finer-grained chunking. The paper discusses how to mitigate this extra communication using specificcommunication optimizationslikescatter/gather.
4.2.1.3. Tensor Model Parallelism
Tensor model parallelism partitions individual layers of the model over multiple devices. The paper specifically uses the partitioning strategy from Megatron [40] for Transformer layers.
A Transformer layer fundamentally consists of a self-attention block followed by a two-layer Multi-Layer Perceptron (MLP).
-
MLP Block Partitioning: The
MLP blockinvolves twoGeneral Matrix Multiplications (GEMMs)and aGeLUnon-linearity. The operations are: Here, is the input, and are weight matrices.- Splitting : The first weight matrix is split along its columns: . This allows the
GeLUnon-linearity to be applied independently to the output of each partitionedGEMMon differentGPUs: Thiscolumn-wise partitioningof is advantageous becauseGeLUis a non-linear activation function. If were split along its rows, anall-reducewould be needed before applyingGeLUto ensure the correct non-linearity, which would introduce extra communication. By splitting columns,GeLUcan be applied locally, removing this synchronization need. - Splitting : The second weight matrix is then split along its rows to remove the need for communication between the two
GEMMs. IfY = [ Y _ { 1 } , Y _ { 2 } ]is the concatenated output from the firstGEMMandGeLU, then is partitioned as: The secondGEMMoperationY Bcan then be computed in parallel as , where eachGPUcomputes one part of the sum. The results of and are then reduced (summed) across theGPUsusing anall-reduce operation(denoted as inFigure 5a) before theDropoutlayer.
- Splitting : The first weight matrix is split along its columns: . This allows the
-
Self-Attention Block Partitioning: The
multi-head attention operationinherently has parallelism. TheKey(),Query(), andValue() matrices, which are typically derived from the input through linear transformations, can be partitioned in acolumn-parallelfashion. After computing the attention mechanism on these partitioned matrices, the output linear layer can directly operate on the partitioned output of the attention operation (where itsweight matrixis partitioned across rows).
This tensor model parallelism approach, as illustrated in Figure 5, requires only two all-reduce operations in the forward pass (denoted by ) and two all-reduce operations in the backward pass (denoted by ) per Transformer layer. The operators and are conjugates: is the identity operator in the forward pass and all-reduce in the backward pass, while is the reverse (all-reduce in forward and identity in backward).
Figure 5: Blocks of transformer model partitioned with tensor model parallelism (figures borrowed from Megatron [40]). and are conjugate. is the identity operator in the forward pass and allreduce in the backward pass, while is the reverse.
Figure 5: Blocks of transformer model partitioned with tensor model parallelism (figures borrowed from Megatron [40]). and are conjugate. is the identity operator in the forward pass and allreduce in the backward pass, while is the reverse.
4.2.2. Performance Analysis of Parallelization Configurations
This section analyzes the performance implications of combining pipeline and tensor model parallelism with data parallelism, discussing trade-offs between memory footprint, device utilization, and communication.
4.2.2.1. Notation
(p, t, d): Parallelization dimensions.- :
Pipeline-model-parallelsize (number of devices forpipeline parallelism). - :
Tensor-model-parallelsize (number of devices fortensor parallelism). - :
Data-parallelsize (number of devices fordata parallelism).
- :
- : Total number of
GPUs. The product . - : Global
batch size(provided as input). - :
Microbatch size. - : Number of
microbatchesin abatchper pipeline, calculated as .
4.2.2.2. Tensor and Pipeline Model Parallelism
The paper considers the interactions between tensor and pipeline model parallelism. Assuming data-parallel size , the total number of GPUs is . The pipeline bubble size (from section 4.2.1.2.1) in terms of is:
As (tensor-model-parallel size) increases, the pipeline-parallel size decreases. Consequently, the term decreases, and thus the pipeline bubble decreases for fixed global batch size , microbatch size , and data-parallel size (since is fixed). This suggests that increasing tensor parallelism can reduce pipeline bubble overhead.
However, communication costs differ significantly:
-
Pipeline Model Parallelism: Features cheaper
point-to-point communication. For eachmicrobatch, the total communication between consecutive devices (for eitherforwardorbackward pass) isbsh, where is thesequence lengthand is thehidden size. -
Tensor Model Parallelism: Uses expensive
all-reduce communication. For eachmicrobatchand eachlayer, tensors of total sizebshneed to beall-reducedamong model replicas twice in theforward passand twice in thebackward pass. This leads to a total communication of per layer per device permicrobatch. If apipeline stagehas layers, the totaltensor-parallel-communicationper device permicrobatchis .This means that
tensor model parallelismsignificantly increases the amount of communication between devices. If is larger than the number ofGPUswithin a singlemulti-GPU server, thisall-reduce communicationwould need to traverse slowerinter-node links(e.g.,InfiniBand), leading to a substantial performance bottleneck.
Takeaway #1: When considering different forms of model parallelism, tensor model parallelism should generally be used up to the degree equal to the number of GPUs within a multi-GPU server (e.g., 8 on a DGX A100). Beyond this, pipeline model parallelism should be used to scale up to larger models across different servers, leveraging its more efficient point-to-point communication.
4.2.2.3. Data and Model Parallelism
This section considers the interaction between data parallelism and model parallelism (pipeline and tensor) independently.
4.2.2.3.1. Pipeline Model Parallelism
Assuming tensor-model-parallel size , the number of microbatches per pipeline is . Let be the ratio of global batch size to microbatch size. Then . With a total of GPUs, the number of pipeline stages is .
The pipeline bubble size (from section 4.2.1.2.1) is then:
-
: Number of
pipeline stages. -
: Number of
microbatchesper pipeline. -
: Total number of
GPUs. -
:
Data-parallelsize. -
: Ratio of global
batch sizetomicrobatch size(B/b).As (
data-parallel size) increases, becomes smaller, which in turn reduces thepipeline bubble size.Figure 6illustrates this behavior.
Figure 6: Fraction of time spent idling due to pipeline flush (pipeline bubble size) versus data-parallel size ( d ), for different numbers of GPUs( n )and ratio of batch size to microbatch size .
Figure 6: Fraction of time spent idling due to pipeline flush (pipeline bubble size) versus data-parallel size ( d ) , for different numbers of GPUs ( n ) and ratio of batch size to microbatch size .
Overall throughput will increase if the all-reduce communication required for data parallelism does not increase drastically with higher . This is generally true for ring-based implementations, where communication time scales roughly with .
Increasing the global batch size also affects throughput. For a given parallel configuration, as increases, increases, which makes the term decrease, thus reducing the pipeline bubble fraction and increasing throughput. Additionally, all-reduce communication for data parallelism becomes less frequent, further boosting throughput.
4.2.2.3.2. Data and Tensor Model Parallelism
With tensor model parallelism, all-reduce communication must be performed for every microbatch within each layer. This can be very expensive, especially across multi-GPU servers using InfiniBand.
In contrast, data parallelism only needs to perform its expensive all-reduce communication once per batch (for gradients). Furthermore, as the tensor-model-parallel size increases, each model-parallel rank (GPU) performs a smaller subset of computations within each model layer. For insufficiently-large layers (e.g., small hidden sizes or sequence lengths), these smaller sub-matrix computations (GEMMs) might not be performed by modern GPUs with peak efficiency, leading to lower GPU utilization.
Takeaway #2: When combining data and model parallelism, a total model-parallel size of should be chosen such that the model's parameters and intermediate metadata (e.g., activations) fit within the GPU memory. Once this memory constraint is met, data parallelism should be used to scale up training to utilize more GPUs, as it generally incurs less frequent and potentially more efficient collective communication compared to tensor parallelism across nodes.
4.2.2.4. Microbatch Size
The choice of microbatch size is a critical hyperparameter affecting model training throughput. It impacts both the arithmetic intensity of GPU kernels and the pipeline bubble size. Figure 7 illustrates how per-GPU throughput can increase significantly with a larger microbatch size on a single GPU.
Figure 7: Per-GPU throughput versus microbatch size for a GPT model with a billion parameters (128 attention heads, hidden size of 4096, 4 transformer layers).
Figure 7: Per-GPU throughput versus microbatch size for a GPT model with a billion parameters (128 attention heads, hidden size of 4096, 4 transformer layers).
The optimal microbatch size needs to be determined for a given parallel configuration (p, t, d) and global batch size . The amount of data-parallel communication remains constant regardless of microbatch size.
Given functions and that map microbatch size to the forward and backward computation times for a single microbatch, the total time spent computing a batch, ignoring communication cost, is:
-
: Ratio of global
batch sizetomicrobatch size(B/b). This is equivalent toB/bfrom previous notation, where is theglobal batch sizeand is themicrobatch size. In the context of this formula, it's referring to the total number ofmicrobatchesprocessed by a single pipeline group over the entireglobal batch. -
:
Pipeline-model-parallelsize. -
:
Microbatch size. -
:
Forward passcomputation time for amicrobatchof size . -
:
Backward passcomputation time for amicrobatchof size .The first term represents the effective number of
microbatchprocessing steps, incorporating thepipeline bubbleoverhead. Themicrobatch sizethus influences both the efficiency ofGPU kernelexecution (largermicrobatchesoften lead to betterarithmetic intensityandGPU utilization) and thepipeline bubble size(smaller leads to a largerbubble fraction). These two factors are at odds.
Figure 8 shows the estimated throughput (using the above equation) for a GPT model with a billion parameters.
Figure 8: Behavior of normalized estimated throughput (time computed as with respect to the microbatch size for the same GPT model from Figure 7.
Figure 8: Behavior of normalized estimated throughput (time computed as with respect to the microbatch size for the same GPT model from Figure 7.
The plot indicates an optimal microbatch size of 4 for both batch sizes in this specific example. The optimal value is problem-dependent.
Takeaway #3: The optimal microbatch size is a complex balance, depending on the model's throughput and memory footprint characteristics, as well as the pipeline depth , data-parallel size , and global batch size . It's not a one-size-fits-all value.
4.2.2.5. Activation Recomputation
Activation recomputation [12, 18, 20, 21] (also known as gradient checkpointing) is an optional memory-saving technique. During the forward pass, instead of storing all intermediate activations needed for the backward pass, only a subset (or just the input activations for each pipeline stage) is stored. The remaining activations are recomputed during the backward pass. This trades off increased computation (FLOPs) for reduced memory footprint.
If a model stage has layers and activation checkpoints are taken at points within that stage, the total memory footprint (for activations) is approximately:
-
: Size of the input
activationsfor a layer. -
: Size of the intermediate
activationswithin a layer. -
: Number of layers in a
model stage. -
: Number of
checkpoints(recomputation points) within the stage.The optimal number of
checkpointsthat minimizes memory footprint is obtained when . In practice, the authors find that recomputing every 1 or 2Transformer layersis optimal. Whileactivation recomputationincreases totalFLOPs(by effectively running parts of theforward passtwice), it is often required to train very large models that would otherwise exceedGPU memory.
Other techniques like activation partitioning [36] can also be used with tensor model parallelism for further memory footprint reduction.
4.2.3. Implementation
The PTD-P system is implemented as an extension to the Megatron-LM codebase, built using PyTorch [32]. NCCL [7] is used for inter-device communication. To achieve high performance, specific communication and computation optimizations were implemented.
4.2.3.1. Communication Optimizations
When combining pipeline and tensor model parallelism, a key challenge arises from cross-node communication. Each DGX A100 server has 8 InfiniBand (IB) networking cards. Point-to-point communication in pipeline parallelism typically occurs between a single pair of GPUs on adjacent servers, making it hard to utilize all 8 IB cards for a single logical communication.
The paper introduces a scatter/gather communication optimization to address this, leveraging the fact that outputs of tensor-parallel ranks are often replicated. Specifically, after the all-reduce operation in tensor parallelism (e.g., after the operator in the MLP block), the output tensor is replicated across all tensor-parallel ranks within a multi-GPU server.
- Redundancy: If this replicated tensor were simply sent from each of the
tensor-parallelranks on the sending node to its corresponding rank on the nextpipeline stagenode, it would result in redundant copies of the same tensor being sent overinter-node InfiniBand links(as shown inFigure 9a). - Scatter/Gather Optimization: Instead, the optimization works as follows (
Figure 9b):-
Sender Side: The replicated tensor on the sending
multi-GPU serveris split (scattered) into equal-sized chunks. Eachtensor-parallel rankthen sends only one unique chunk to its corresponding rank on the nextpipeline stagenode, using its dedicatedInfiniBand card. This reduces the size of data sent over the slowerInfiniBand linksby a factor of . -
Receiver Side: On the receiving node, after each
tensor-parallel rankreceives its chunk, anall-gather operationis performed over the high-bandwidthNVLink(which is much faster thanInfiniBand) to re-materialize the full tensor on eachtensor-parallel rank.
该图像是示意图,展示了Infiniband连接和数据分散、合并机制的工作原理。左侧部分为Infiniband连接的配置,右侧展示了Scatter和All-gather通信方式。
-
(a) W/o scatter/gather optimization. (b) With scatter/gather optimization. Figure 9: Scatter/gather communication optimization. Light blue blocks are layers in the first pipeline stage, and dark blue blocks are layers in the second pipeline stage. Without the scatter/gather optimization, the same tensor is sent redundantly over inter-node InfiniBand links. Instead, at the sender, we can scatter the tensor into smaller chunks, reducing the sizes of tensors sent over InfiniBand links. The final tensor can then be rematerialized at the receiver using a gather operation.
Quantitatively, with this optimization, the total amount of communication that needs to be performed between every pair of consecutive pipeline stages is reduced from bsh to , where is the tensor-model-parallel size (e.g., in their experiments), is the sequence length, and is the hidden size. This optimization effectively utilizes multiple InfiniBand cards and makes communication-intensive schedules like the interleaved one more feasible.
4.2.3.2. Computation Optimizations
The authors implemented three model-specific optimizations to the computation graph to achieve high performance:
- Data Layout Change: The data layout in the
Transformer layerwas changed from[b, s, a, h](batch, sequence, attention-head, hidden-size) to[s, b, a, h]. This avoids memory-intensivetranspose operationsand enables the use ofstrided batched GEMM kernels, which are more efficient. - Fused Kernels: For sequences of element-wise operations (e.g.,
bias+GeLU, andbias+dropout+add), customfused kernelswere generated usingPyTorch JIT[10].Operator fusionreducesmemory bandwidthusage by combining multiple operations into a single kernel, reducing the number of times data needs to be loaded from and stored toGPU memory. - Custom Kernels for Attention: Two custom kernels were created to fuse
scale,mask, andsoftmax(reduction) operations within theattention block. One supports general masking (for models likeBERT), and the other supports implicitcausal masking(forauto-regressive modelslikeGPT). This also improves efficiency by reducingmemory accessandkernel launch overheads.
4.3. Performance Analysis of Parallelization Configurations (Continued)
4.3.1. Activation Recomputation (Continued)
The total number of FLOPs per iteration () for a Transformer model with Transformer layers, hidden size , sequence length , vocabulary size , and training batch size is given by the formula, which accounts for activation recomputation (running the forward pass twice for gradients and recomputation) and considers the logit layer:
-
: Total number of
floating-point operationsper iteration. -
:
Batch size. -
:
Sequence length. -
: Number of
Transformer layers. -
:
Hidden size. -
:
Vocabulary size.This formula is based on counting
GEMMsin theTransformerandlogitlayers, assumingmixed precision. The factor of comes from theforwardandbackward passesof theTransformer layerswithactivation recomputation. The terms account forself-attentionspecificFLOPsand thelogit layercontribution.
5. Experimental Setup
5.1. Datasets
The experiments in this paper primarily use GPT models of varying sizes, which are Transformer-based language models. The paper does not specify a particular training dataset (e.g., C4, Pile) but focuses on the model architectures and their scaling properties.
-
Model Characteristics:
- Vocabulary Size (): All models use a
vocabulary sizeof 51,200 (a multiple of 1024). - Sequence Length (): All models use a
sequence lengthof 2048. - Varying Parameters: The authors vary the
hidden size(), number ofattention heads, and number oflayers() to create models ranging from 1 billion to 1 trillion parameters.
- Vocabulary Size (): All models use a
-
Data Sample: While no explicit data sample is provided,
GPT modelsare trained on vast text corpora. A typical data sample would be a sequence of tokens (words or sub-word units) from natural language text, e.g., "The quick brown fox jumps over the lazy dog." The model would learn to predict the next token in the sequence. -
Choice of Datasets (Implicit): The use of
GPT modelsis highly effective for validating the method's performance because these models are known for their extremely large scale and computational demands, making them ideal benchmarks for distributed training efficiency. The paper's goal is to enable training of these large models, so using them directly as the experimental subject is appropriate.The number of parameters () in a model is computed as:
-
: Total number of parameters in the
GPT model. -
: Number of
Transformer layers. -
:
Hidden size. -
:
Vocabulary size. -
:
Sequence length. This formula approximates the total parameters, accounting for theembedding layer,Transformer layers(which containattentionandMLP blocks), and thelogit layer.
5.2. Evaluation Metrics
The paper evaluates the performance of its distributed training system using several key metrics, primarily focused on throughput and efficiency.
-
1. Achieved teraFLOP/s per GPU:
- Conceptual Definition: This metric quantifies the average number of trillion
floating-point operations(FLOPs) that eachGPUin the cluster performs per second, including all overheads like communication, data processing, and optimizer steps. It is a direct measure of the computational work done perGPUunit time. A higher value indicates betterGPU utilizationand efficiency. - Mathematical Formula: While not explicitly provided as a single formula in the paper, it is derived from the total
FLOPsper iteration and the measured iteration time. $ \text{Achieved teraFLOP/s per GPU} = \frac{\text{Total FLOPs per iteration}}{\text{Number of GPUs} \times \text{Iteration time (seconds)}} \times 10^{-12} $ - Symbol Explanation:
Total FLOPs per iteration: The totalfloating-point operationsrequired for oneforwardandbackward pass(includingactivation recomputation) for the entire model andbatch. This is denoted as in the paper's Appendix, where .Number of GPUs: The total count ofGPUsused in the training run ().Iteration time (seconds): The measured wall-clock time for one complete training iteration (oneforward pass, onebackward pass, andparameter updatesfor aglobal batch).
- Conceptual Definition: This metric quantifies the average number of trillion
-
2. Percentage of Theoretical Peak FLOP/s:
- Conceptual Definition: This metric expresses the
achieved teraFLOP/s per GPUas a percentage of the maximum possiblefloating-point operationsthat a singleGPUcould theoretically perform per second. It indicates how close the actual performance is to the hardware's ideal capability, highlighting the efficiency of the software stack and parallelization strategy in utilizing theGPU'scompute power. - Mathematical Formula: $ \text{Percentage of Theoretical Peak FLOP/s} = \frac{\text{Achieved teraFLOP/s per GPU}}{\text{Theoretical Peak FLOP/s of a single GPU}} \times 100% $
- Symbol Explanation:
Achieved teraFLOP/s per GPU: As defined above.Theoretical Peak FLOP/s of a single GPU: The maximum specified computational capacity of theGPUhardware. For anA100 GPUwith 16-bit precision, this is stated as 312teraFLOP/s.
- Conceptual Definition: This metric expresses the
-
3. Achieved Aggregate petaFLOP/s:
- Conceptual Definition: This metric represents the total
floating-point operationsperformed by allGPUsin the cluster per second, expressed inpetaFLOP/s(quadrillions ofFLOPsper second). It's a measure of the overall computational power achieved by the entire distributed system. - Mathematical Formula: $ \text{Achieved Aggregate petaFLOP/s} = \text{Achieved teraFLOP/s per GPU} \times \text{Number of GPUs} \times 10^{-3} $
- Symbol Explanation:
Achieved teraFLOP/s per GPU: As defined above.Number of GPUs: The total count ofGPUsused in the training run ().
- Conceptual Definition: This metric represents the total
-
4. Training Time Estimates:
- Conceptual Definition: This is an estimation of the total wall-clock time required to train a given model to convergence on a specific number of tokens, based on the
achieved throughput. It provides a practical measure of how long a training job would take. - Mathematical Formula: The paper provides a simplified approximate formula based on observations that for their configurations, , , and . $ { \mathrm { E n d - t o - e n d ~ t r a i n i n g ~ t i m e } } \approx { \frac { 8 T P } { n X } } . $
- Symbol Explanation:
End-to-end training time: Estimated total time for model training.- : Total number of tokens required for training (e.g., 300 billion for
GPT-3). - : Total number of parameters in the model.
- : Total number of
GPUsused. - : Achieved
teraFLOP/s per GPU(from Table 1). - The constant '8' likely arises from approximations and the specific
FLOPaccounting (e.g., related tomixed precisionandactivation recomputation).
- Conceptual Definition: This is an estimation of the total wall-clock time required to train a given model to convergence on a specific number of tokens, based on the
5.3. Baselines
The paper primarily compares its PTD-P approach against:
-
ZeRO-3 without Model Parallelism:
ZeRO-3[36, 37] is a state-of-the-art memory optimization technique that shards alloptimizer states,gradients, andmodel parametersacrossdata-parallel workers. This allows training very large models that wouldn't fit on a singleGPUeven withdata parallelism. The comparison is crucial becauseZeRO-3is a strong baseline for memory-efficientdata-paralleltraining of large models. The paper specifically notes that it considersZeRO-3without any additional model parallelism (i.e.,model-parallel sizeof 1 for the entire model). This highlightsPTD-P'sadvantage when explicitmodel parallelism(tensor + pipeline) is combined withdata parallelism.The paper also implicitly compares against prior individual
model parallelismtechniques by analyzingpipeline parallelismin isolation (e.g., default vs.interleaved schedule),tensor parallelismin isolation, and their various combinations, demonstrating thatPTD-Poutperforms these isolated approaches.
5.4. Hardware
All experiments were conducted on the Selene supercomputer [8]. This is a powerful, optimized HPC cluster with specific characteristics:
- GPUs: Each cluster node features 8
NVIDIA 80-GB A100 GPUs[6]. These are state-of-the-artGPUsat the time of publication, offering high computational throughput (312teraFLOP/swith 16-bit precision) and substantialGPU memory(80 GB). - Intra-node Interconnect: Within each node, the 8
A100 GPUsare connected byNVLinkandNVSwitch[9].NVLinkprovides high-bandwidth, directGPU-to-GPUconnections, whileNVSwitchenablesall-to-all communicationbetween allGPUsin the server with extremely low latency. This is crucial for efficientintra-node tensor parallelism. - Inter-node Interconnect: Each node is equipped with eight
NVIDIA Mellanox 200Gbps HDR InfiniBand HCAs(Host Channel Adapters) for application communication, plus an additional twoHCAsfor dedicated storage. The nodes are connected via athree-level (leaf, spine, core) fat-tree topologywith 850 switches. Thistopologyis designed for efficientall-reduce communication, which is a dominant pattern in deep learning training. - Storage: The cluster uses an
all-NVMe shared parallel filesystemfor high-performance data access and storage, which is important for fastcheckpoint loadingandsaving. - Software Stack: The implementation is built using
PyTorch[32] and leveragesNCCL[7] forcollective communication. All results are run withmixed precision.
6. Results & Analysis
6.1. Core Results Analysis
The paper's core results demonstrate the exceptional weak-scaling performance of PTD-P for GPT models, ranging from 1 billion to a trillion parameters. The key findings validate the effectiveness of the proposed method in achieving high throughput and practical training times.
The following are the results from Table 1 of the original paper:
| Number of parameters (billion) | Attention heads | Hidden size | Number of layers | Tensor model-parallel size | Pipeline model-parallel size | Number of GPUs | Batch size | Achieved teraFLOP/s per GPU | Percentage of theoretical peak FLOP/s | Achieved aggregate petaFLOP/s |
|---|---|---|---|---|---|---|---|---|---|---|
| 1.7 | 24 | 2304 | 24 | 1 | 1 | 32 | 512 | 137 | 44% | 4.4 |
| 3.6 | 32 | 3072 | 30 | 2 | 1 | 64 | 512 | 138 | 44% | 8.8 |
| 7.5 | 32 | 4096 | 36 | 4 | 1 | 128 | 512 | 142 | 46% | 18.2 |
| 18.4 | 48 | 6144 | 40 | 8 | 1 | 256 | 1024 | 135 | 43% | 34.6 |
| 39.1 | 64 | 8192 | 48 | 8 | 2 | 512 | 1536 | 138 | 44% | 70.8 |
| 76.1 | 80 | 10240 | 60 | 8 | 4 | 1024 | 1792 | 140 | 45% | 143.8 |
| 145.6 | 96 | 12288 | 80 | 8 | 8 | 1536 | 2304 | 148 | 47% | 227.1 |
| 310.1 | 128 | 16384 | 96 | 8 | 16 | 1920 | 2160 | 155 | 50% | 297.4 |
| 529.6 | 128 | 20480 | 105 | 8 | 35 | 2520 | 2520 | 163 | 52% | 410.2 |
| 1008.0 | 160 | 25600 | 128 | 8 | 64 | 3072 | 3072 | 163 | 52% | 502.0 |
Analysis of Table 1:
- Weak Scaling: As the model size increases from 1.7 billion to 1 trillion parameters, the number of
GPUsused also increases proportionally (from 32 to 3072). This demonstrates weak scaling, where the problem size perGPUgenerally stays constant or increases, allowing the system to scale efficiently. - High GPU Utilization: The
achieved teraFLOP/s per GPUconsistently remains high, ranging from 135 to 163teraFLOP/s. For the largest models (529.6B and 1T parameters), the system achieves 52% of the theoretical peak FLOP/s of anA100 GPU. This is an impressive utilization rate for a complex distributed training setup, indicating that thePTD-Pstrategy and its optimizations effectively keep theGPUsbusy with computation rather than waiting for communication or idling. - Aggregate Throughput: The
aggregate petaFLOP/sscales dramatically, reaching 502.0 petaFLOP/s for the 1-trillion-parameter model on 3072GPUs. This demonstrates near-linear scaling, showcasing the system's ability to harness vast computational resources effectively. - Parallelism Configuration: For models larger than 18.4 billion parameters,
tensor model parallelismis consistently set to 8 (matching the number ofGPUswithin aDGX A100server, as per Takeaway #1).Pipeline model parallelismis then used to scale across servers, increasing from 1 to 64 as model size andGPUcount increase. This validates thePTD-Papproach's hierarchical design. - Increasing Batch Size: As the models and
GPUcounts grow, thebatch sizealso increases (from 512 to 3072). This is often necessary inpipeline parallelismto amortize thepipeline bubbleoverhead over moremicrobatchesand maintain efficiency.
Training Time Estimates: Given these impressive throughputs, the paper estimates practical end-to-end training times:
- GPT-3 (175 billion parameters): On
A100 GPUswith abatch sizeof 1536, achievingteraFLOP/s per GPU. For training on billion tokens, the estimated time is approximately 34 days. - 1 Trillion Parameter Model: On
A100 GPUswithteraFLOP/s per GPU. Assuming billion tokens for convergence, the estimated training time is approximately 84 days (around 3 months). These estimates highlight thatPTD-Pmakes training such colossal models feasible within reasonable timeframes, moving them from theoretical possibility to practical reality.
6.2. Comparison to ZeRO-3
The paper compares PTD-P with ZeRO-3, a strong baseline for memory-efficient data-parallel training without explicit model parallelism.
The following are the results from Table 2 of the original paper:
| Scheme | Number of parameters (billion) | Model-parallel size | Batch size | Number of GPUs | Microbatch size | Achieved teraFLOP/s per GPU | Training time for 300B tokens (days) |
|---|---|---|---|---|---|---|---|
| ZeRO-3 without Model Parallelism | 174.6 | 1 | 1536 | 384 | 4 | 144 | 90 |
| 768 | 2 | 88 | 74 | ||||
| 1536 | 1 | 44 | 74 | ||||
| 529.6 | 1 | 2560* | 640 | 4 | 138 | 169 | |
| 2240 | 1120 | 2 | 98 | 137 | |||
| 2240 | 2240 | 1 | 48 | 140 | |||
| PTD Parallelism | 174.6 | 96 | 1536 | 384 | 1 | 153 | 84 |
| 768 | 1 | 149 | 43 | ||||
| 1536 | 1 | 141 | 23 | ||||
| 529.6 | 280 | 2240 | 560 | 1 | 171 | 156 | |
| 1120 | 1 | 167 | 80 | ||||
| 2240 | 1 | 159 | 42 |
Analysis of Table 2 and Figure 10:
Figure 10: Throughput per GPU of PTD-P and ZeRO-3 for two different GPT models (the 175B GPT-3 model is shown with dotted lines, and the 530B model is shown with solid lines). Global batch sizes are fixed and ZeRO-3 is used without any model parallelism.
Figure 10: Throughput per GPU of PTD-P and ZeRO-3 for two different GPT models (the 175B GPT-3 model is shown with dotted lines, and the 530B model is shown with solid lines). Global batch sizes are fixed and ZeRO-3 is used without any model parallelism.
- Initial Performance (Lower GPU Counts): For the 175-billion-parameter model with 384
GPUsand amicrobatch sizeof 4,PTD-Pachieves 153teraFLOP/s per GPUversusZeRO-3's 144teraFLOP/s per GPU, representing a 6% higherthroughput. For the 530-billion-parameter model with 640GPUsandmicrobatch size4,PTD-P(171teraFLOP/s per GPU) outperformsZeRO-3(138teraFLOP/s per GPU) by 24%. This initial lead suggestsPTD-Pis more efficient even at relatively lowerGPUcounts. - Scaling Behavior: As the number of
GPUsincreases (while keeping the globalbatch sizefixed),PTD-Pscales significantly more gracefully thanZeRO-3.- For the 175B model, doubling the
GPUsfrom 384 to 768 results inPTD-Pmaintaining highthroughput(149teraFLOP/s per GPU) whileZeRO-3drops significantly (to 88teraFLOP/s per GPU). At 1536GPUs,PTD-Pstill achieves 141teraFLOP/s per GPU, whereasZeRO-3falls to just 44teraFLOP/s per GPU. - Similar trends are observed for the 530B model. At 2240
GPUs,PTD-Pachieves 159teraFLOP/s per GPU, whileZeRO-3drops to 48teraFLOP/s per GPU.
- For the 175B model, doubling the
- Performance Gap: At higher
GPUcounts,PTD-PoutperformsZeRO-3by as much as 70% for both model sizes. This is primarily attributed toZeRO-3's increasedcross-node communicationoverhead.ZeRO-3shardsparametersandgradientsand requiresall-gatheroperations for weights andall-reducefor gradients for every layer if not carefully optimized or hidden. When this communication has to cross slowerinter-node linksfrequently, it becomes a significant bottleneck, especially at larger scales.PTD-P's combination oftensor parallelism(intra-node) andpipeline parallelism(inter-node) with optimizations likescatter/gatheris designed to minimize and optimize thiscross-node communication. - Model Parallelism Size: It's important to note that
ZeRO-3in this comparison has amodel-parallel sizeof 1 (meaning the entire model is conceptually replicated and sharded acrossdata-parallelranks), whereasPTD-Puses a significantmodel-parallel size(e.g., 96 for 174.6B model, 280 for 529.6B model) which is composed oftensorandpipeline parallelism. This meansPTD-Pexplicitly partitions the model, which is essential for thesememory-intensivemodels that might not even fit theZeRO-3setup on a smaller number ofGPUswithout very fine-grainedmicrobatching.
6.3. Pipeline Parallelism
6.3.1. Weak Scaling
The paper evaluates the weak-scaling performance of the default non-interleaved pipeline-parallel schedule. In a weak-scaling setup, as the number of pipeline stages (and thus GPUs) increases, the size of the model also increases proportionally (by increasing the number of layers), aiming to keep the workload per GPU roughly constant.
该图像是一个图表,展示了在不同的管道并行大小下,每个GPU所实现的teraflop/s数值。蓝色线条代表批量大小为8的情况,橙色线条则代表批量大小为128,在管道并行大小从1增加到8时,性能相对下降趋势明显。
The image is a chart showing the achieved teraFLOP/s per GPU for different pipeline parallel sizes. The blue line represents a batch size of 8, while the orange line represents a batch size of 128, with a noticeable downward trend in performance as the pipeline parallel size increases from 1 to 8.
Analysis of Figure 11:
- Impact of Pipeline Bubble: The figure shows
throughput per GPUfor aGPT model(128 attention heads, hidden size 20480, microbatch size 1) withtensor-parallel sizeof 8. As thepipeline-parallel sizeincreases from 1 to 8, thethroughput per GPUfor bothbatch sizes(8 and 128) decreases. This aligns with the analytical model forpipeline bubble size(), which increases with . Even in aweak-scalingscenario, the inherent idle time introduced by thepipeline flushbecomes more significant as the pipeline lengthens. - Batch Size Amortization: The higher
batch size(128, orange line) consistently performs better and scales more gracefully than the smallerbatch size(8, blue line). This is because thepipeline bubble(idle time) is a fixed cost perbatch. With a largerbatch size, this fixed cost is amortized over moremicrobatches(), reducing its relative impact on overall efficiency.
6.3.2. Interleaved versus Non-Interleaved Schedule
The paper compares the throughput of the proposed interleaved schedule against the non-interleaved (default) PipeDream-Flush schedule on a GPT-3 model with 175 billion parameters.
Figure 11: Throughput per GPU of pipeline parallelism using two different batch sizes in a weak-scaling experiment setup (model size increases with the pipeline-parallel size). Figure 12: Throughput per GPU of interleaved and non-interleaved schedules for a GPT model (175 billion parameters) on 96 GPUs.
Figure 12: Throughput per GPU of interleaved and non-interleaved schedules for a GPT model (175 billion parameters) on 96 GPUs.
Analysis of Figure 12:
- Interleaved Schedule Advantage: The
interleaved schedule(green line) with thescatter/gather communication optimizationenabled consistently achieves higherper-GPU throughputthan thenon-interleaved (default) schedule(blue line). This validates the theoretical reduction inpipeline bubble size() offered by interleaving stages, leading to up to 10% improvement in throughput. - Batch Size Impact:
- At smaller
batch sizes, theinterleaved scheduleshows a more pronounced advantage. This is because smallerbatch sizeslead to fewermicrobatches(), making thepipeline bubblea larger fraction of the total time for thenon-interleaved schedule. Theinterleaved schedulemitigates this by reducing the effectivepipeline depth. - As the
batch sizeincreases, thethroughput gapbetween the two schedules narrows. This is due to two factors: a) thepipeline bubblein thedefault schedulebecomes a smaller fraction of the total time as it's amortized over moremicrobatches; and b) theinterleaved schedulerequires morepoint-to-point communicationper sample, and this communication overhead becomes more significant at largerbatch sizes.
- At smaller
- Importance of Scatter/Gather: The paper notes that without the
scatter/gather optimization, thedefault schedulewould perform better than theinterleaved scheduleat largerbatch sizes. This underscores that the communication overhead introduced by interleaving must be mitigated by specific optimizations likescatter/gatherto realize its performance benefits.
6.4. Comparison of Parallel Configurations
This section demonstrates the critical trade-offs and interactions when combining different parallelization dimensions.
6.4.1. Tensor versus Pipeline Parallelism
The paper evaluates the performance of various combinations of tensor and pipeline model parallelism for a 161-billion-parameter GPT model on 64 A100 GPUs.
Figure 13: Throughput per GPU of various parallel configurations that combine pipeline and tensor model parallelism using a GPT model with 162.2 billion parameters and 64 A100 GPUs.
Figure 13: Throughput per GPU of various parallel configurations that combine pipeline and tensor model parallelism using a GPT model with 162.2 billion parameters and 64 A100 GPUs.
Analysis of Figure 13:
- Optimal Combination: The peak performance is achieved when the
tensor-parallel size() is equal to 8. This corresponds to utilizing allGPUswithin a singleDGX A100server fortensor parallelism, wherehigh-bandwidth NVLinkfacilitates efficientall-reduce communication. - Tensor Parallelism Benefits (Intra-node): Increasing
tensor-parallel sizefrom 1 to 8 (within a node) generally improvesthroughput. This is becausetensor parallelismefficiently partitions the largeGEMMswithin layers acrossGPUsconnected by fastNVLink, making computations more efficient and allowing larger models to fit. - Pipeline Parallelism Benefits (Inter-node): Once
tensor parallelismsaturates the intra-node capabilities (at ),pipeline parallelism() is used to scale across nodes. For instance, comparing with , the latter achieves much higherthroughput. The configuration means 8tensor-parallel groups, each on aDGX A100node, forming an 8-stage pipeline. This effectively leveragespoint-to-point communicationforpipeline stagesacross slowerinter-node links(InfiniBand), avoiding the costlyall-reduceoftensor parallelismover these links. - Limitations of Isolated Parallelism:
- Pure
tensor parallelism(, varying ): If goes beyond 8 (meaningtensor parallelismspans across nodes), thethroughputwould severely degrade due to expensiveall-reduceoverInfiniBand. (Not shown explicitly past on this graph, but implied by Takeaway #1). - Pure
pipeline parallelism(, varying ): As increases,throughputdecreases significantly (e.g., from to ). This is because the entireTransformer layeris processed on a singleGPU, which quickly becomes a memory bottleneck, and thepipeline bubblebecomes dominant as grows without sufficientmicrobatches.
- Pure
- Takeaway Validation: This figure strongly validates Takeaway #1:
tensor model parallelismis most effectiveintra-node(up to the number ofGPUsper server), andpipeline model parallelismis best used forinter-nodescaling. Neither in isolation performs as well as their optimized combination.
6.4.2. Pipeline versus Data Parallelism
This comparison uses a smaller GPT model (5.9 billion parameters) that can fit on fewer model-parallel GPUs to illustrate the dynamics between data and pipeline parallelism.
Figure 14: Throughput per GPU of various parallel configurations that combine data and pipeline model parallelism using a GPT model with 5.9 billion parameters, three different batch sizes, microbatch size of 1, and 64 A100 GPUs.
Figure 14: Throughput per GPU of various parallel configurations that combine data and pipeline model parallelism using a GPT model with 5.9 billion parameters, three different batch sizes, microbatch size of 1, and 64 A100 GPUs.
Analysis of Figure 14:
- Data Parallelism Preference: For a fixed total number of
GPUs(64) andmicrobatch size(1), increasing thedata-parallel size() while decreasing thepipeline-parallel size() generally leads to higherthroughput per GPU. This is evident by the risingthroughputas increases from 1 to 64 (and decreases from 64 to 1) for allbatch sizes. - Pipeline Bubble Impact: As the
pipeline-parallel size() increases, thethroughputtends to decrease. This confirms the analytical model from§3.3.1where thepipeline bubble sizeincreases with (specifically, where ).Pipeline parallelismintroduces idle time due topipeline flushes, which becomes more significant as the pipeline grows longer relative to the number ofmicrobatches. - Batch Size and Throughput: Larger
batch sizes(e.g., 2048, green line) consistently yield higherthroughputthan smaller ones (e.g., 512, blue line). This is because a largerbatch sizehelps amortize thepipeline bubbleoverhead and makesdata-parallel communicationless frequent. - Takeaway Validation: This figure supports Takeaway #2:
pipeline model parallelismshould primarily be employed to enable the training of models that are too large to fit on a singleGPUor a smallmodel-parallelgroup. Once the model can fit,data parallelismis generally the more efficient strategy forscaling up trainingto moreGPUsdue to its lowercommunication overheadpermicrobatchand betterpipeline bubblecharacteristics.
6.4.3. Tensor versus Data Parallelism
This section evaluates the interaction between data and tensor model parallelism using the same 5.9-billion-parameter GPT model as in the previous section.
Figure 15: Throughput per GPU of various parallel configurations that combine data and tensor model parallelism using a GPT model with 5.9 billion parameters, three different batch sizes, microbatch size of 1, and 64 A100 GPUs.
Figure 15: Throughput per GPU of various parallel configurations that combine data and tensor model parallelism using a GPT model with 5.9 billion parameters, three different batch sizes, microbatch size of 1, and 64 A100 GPUs.
Analysis of Figure 15:
- Data Parallelism Preference: Similar to the previous comparison, increasing
data-parallel size() while decreasingtensor-parallel size() generally leads to higherthroughput. Thethroughputis highest when (notensor parallelism) and (fulldata parallelism), especially for largebatch sizes. - Tensor Parallelism Overhead: As
tensor-parallel size() increases (and decreases), thethroughputtends to decline. This is becausetensor model parallelismrequires frequentall-reduce communicationfor every microbatch and every layer. When increases, a larger portion of thisall-reduce communicationmight spaninter-node links(if forDGX A100nodes), which are slower. - Small GEMMs: Additionally, increasing means that
matrix multiplicationson eachGPUbecome smaller. If theseGEMMsare not sufficiently large,GPU utilizationcan decrease, as theGPUmight not achieve its peak computational efficiency. - Batch Size Impact: Again, larger
batch sizes(e.g., 2048, green line) lead to betterthroughput. This is becausedata-parallel communicationis less frequent perbatchthantensor-parallel communicationpermicrobatch. Whenbatch sizeis large, thegradient all-reduceindata parallelismis amortized, while thetensor-parallel all-reduceremains frequent. - Takeaway Validation: This further reinforces Takeaway #2:
data parallelismis generally more efficient forscaling up trainingonce the model fits themodel-parallelgroup.Tensor parallelismshould be chosen strategically, usuallyintra-node, to handle models that exceed a singleGPU'smemory, but itsall-reduce communicationoverheads can be significant, especiallycross-node.
6.5. Microbatch Size
The choice of microbatch size is a crucial hyperparameter that balances GPU utilization (from larger GEMMs) and pipeline bubble overhead.
Figure 16: Throughput per GPU of a parallel configuration for different microbatch sizes on a GPT model with 91 billion parameters, for two different batch sizes using 64 A100 GPUs.
Figure 16: Throughput per GPU of a parallel configuration for different microbatch sizes on a GPT model with 91 billion parameters, for two different batch sizes using 64 A100 GPUs.
Analysis of Figure 16:
- Optimal Microbatch Size: For the 91-billion-parameter
GPT modelwith a parallel configuration on 64A100 GPUs, the figure shows that amicrobatch sizeof 2 yields the highestthroughput per GPUfor bothbatch sizes(2048 and 1024). - Trade-off in Action:
- Smaller Microbatch Sizes (e.g., 1): Lead to a larger number of
microbatches() within abatch. This could mean thepipeline bubbleis better amortized, butGPU utilizationfor individualGEMMsmight be lower due to smallermatrix sizes. - Larger Microbatch Sizes (e.g., 4, 8): Lead to fewer
microbatches(), which means thepipeline bubbletakes up a larger fraction of the overall time (as decreases, increases). However, largermicrobatch sizesgenerally improve thearithmetic intensityofGPU kernelsand thusGPU utilizationfor the actual computation.
- Smaller Microbatch Sizes (e.g., 1): Lead to a larger number of
- Problem-Dependence: The optimal
microbatch sizeis a sweet spot between these opposing forces, and it ismodel-dependentandconfiguration-dependent. The paper's analytical model from§3.3.3(now§3.4) provides a reasonable approximation and can guide the selection of this hyperparameter.
6.6. Activation Recomputation
Activation recomputation (or gradient checkpointing) is a memory-saving technique that trades increased computation for reduced memory.
Figure 17: Throughput (in sequences per second) with and without activation recomputation for a GPT model with 145 billion parameters using 128 A100 GPUs .
Figure 17: Throughput (in sequences per second) with and without activation recomputation for a GPT model with 145 billion parameters using 128 A100 GPUs .
Analysis of Figure 17:
- Throughput Drop for Small Batch Sizes: For small
batch sizes, enablingactivation recomputationleads to up to 33% lower throughput (measured in sequences per second). This is becauseactivation recomputationrequires re-executing parts of theforward passduring thebackward pass, effectively doubling theFLOPsfor those sections. This additional computation directly reducesthroughputif memory is not a bottleneck. - Enabling Larger Batch Sizes: The crucial benefit of
activation recomputationis its memory savings. Without it, the model might not fit inGPU memoryat all, or only with very smallbatch sizes. The figure shows thatactivation recomputationis needed to support largerbatch sizesthat would otherwise be out-of-memory. - Overall Throughput Gain: At large
batch sizes(made possible byactivation recomputation), the system can achieve significantly higher overallthroughput. For example, thethroughputat largebatch sizeswithactivation recomputationis up to 2x higher than the bestthroughputachieved withoutactivation recomputation(which was limited to smallerbatch sizes). This is because largerbatch sizesbetter amortize thepipeline bubbleand other overheads, even with the increasedFLOPsfromrecomputation. Thus,activation recomputationis an essential technique for training very large models.
6.7. Scatter-Gather Optimization
The scatter/gather communication optimization (§4.1) was designed to reduce redundant cross-node communication when tensor and pipeline parallelism are combined.
Figure 18: Throughput per GPU with and without the scatter/gather optimization for a GPT model with 175 billion parameters using 96 A100 GPUs and the interleaved schedule.
Figure 18: Throughput per GPU with and without the scatter/gather optimization for a GPT model with 175 billion parameters using 96 A100 GPUs and the interleaved schedule.
Analysis of Figure 18:
- Throughput Improvement: The figure clearly shows that enabling the
scatter/gather optimization(green line) leads to a substantial improvement inper-GPU throughputcompared to running without it (blue line). For communication-intensive scenarios, especially with theinterleaved scheduleand largerbatch sizes, this optimization improvesthroughputby up to 11%. - Mitigating Interleaving Overhead: The
interleaved scheduleinherently increasespoint-to-point communicationvolume (by a factor of , the number ofmodel chunksper device). Thescatter/gather optimizationis critical for making thisinterleaved schedulefeasible and efficient by reducing the actual data transferred over slowercross-node InfiniBand links. By scattering the replicatedtensor-paralleloutputs and thenall-gatheringthemintra-nodeover fastNVLink, it effectively turns redundantinter-node communicationinto efficientintra-node communication.
6.8. Fused Operators
The paper implemented three model-specific computation optimizations (data layout change, fused kernels, custom attention kernels) as described in §4.2.
-
Impact on GPT-3 (175 billion parameters): With these
fused operators,throughputincreased by 19% (from 113teraFLOP/s per GPUto 135teraFLOP/s per GPU). -
Impact on Larger GPT (530 billion parameters): For the even larger 530-billion-parameter
GPT model,throughputincreased by 11% (from 133teraFLOP/s per GPUto 148teraFLOP/s per GPU).These results highlight the importance of low-level software optimizations and
kernel engineeringto ensure that computation remainscompute-boundandGPUsare maximally utilized, rather than being bottlenecked bymemory accessorkernel launch overheads.
6.9. Inter-Node Communication Bandwidth
The exceptional results achieved are partly due to the optimized Selene supercomputer hardware stack. The paper quantifies the effective bisection bandwidth utilization:
- Pipeline-Parallel Communication: For
point-to-point communicationamongpipeline stages(primarilyinter-node), an effectivebisection bandwidthof 892 GB/s was observed. This is a very high bandwidth, underscoring the efficiency of theInfiniBandinterconnect and thescatter/gatheroptimization. - Data-Parallel Communication: For
all-reduce operationsamongdata-parallel replicas, an even higher effectivebisection bandwidthof 12.9 TB/s was observed. This highlights the highly optimizedall-reduceimplementation (likely leveraging thefat-tree topologyandNCCL) and the fact thatdata-parallel all-reducesare often larger but less frequent thantensor-parallel all-reduces, allowing for better saturation of network links. The authors emphasize that a less optimized partitioning or morecommunication-intensiveoperations would severely hamper scaling performance, especially with these high bandwidth requirements.
6.10. Checkpoint Loading and Saving
An important practical consideration for training large models is managing their checkpoints.
-
Trillion-parameter model checkpoint size: A single
checkpointfor the trillion-parameter model is a massive 13.8 terabytes. -
Loading Performance: The initial loading of this
checkpointby all 384 nodes (3072GPUs) achieved a peak read bandwidth of 1 TB/s from theparallel filesystem. This demonstrates the capability of theall-NVMe shared parallel filesystem. -
Saving Performance:
Checkpoint savesreached 40% of peak write bandwidth (273 GB/s). While lower than read, this is still substantial and crucial for periodic saving during long training runs.Efficient
checkpointingis vital for fault tolerance and for resuming training, and the system demonstrates robust performance in handling these enormous data volumes.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper presents a highly effective and innovative solution for training large language models (LLMs) with up to a trillion parameters on GPU clusters. The core contribution is PTD-P, a composed parallelism strategy that intelligently integrates pipeline parallelism (across multi-GPU servers), tensor parallelism (within multi-GPU servers), and data parallelism for maximal efficiency.
Key achievements include:
-
Achieving an aggregate
throughputof 502petaFLOP/son 3072A100 GPUsfor a 1-trillion-parameter model, representing 52% of theoretical peakFLOP/sperGPU. -
Proposing a novel
interleaved pipeline parallelism schedulethat improvesthroughputby over 10% by reducingpipeline bubbletime, without significantly increasing memory footprint. -
Developing critical
communication optimizations(e.g.,scatter/gather) andcomputation optimizations(e.g.,fused kernels,data layout) that are essential for realizing highGPU utilizationand mitigatingcommunication overheads. -
Providing a quantitative analysis and practical heuristics for configuring distributed training, highlighting the complex interactions between different parallelism dimensions and hyperparameters.
-
Demonstrating practical
end-to-end training times(e.g., ~3 months for a trillion-parameter model), making such large-scale model development feasible. -
Open-sourcing the
Megatron-LMcodebase with these advancements, enabling wider adoption by the research community.The significance of this work lies in pushing the boundaries of
LLM trainingefficiency, directly addressing the memory and computational challenges that arise from the exponential growth ofAI models.
7.2. Limitations & Future Work
The authors acknowledge certain limitations and suggest avenues for future research:
- Heuristic-based Parallelism Configuration: The paper does not automatically explore the vast search space of optimal parallelization strategies. Instead, it suggests practical heuristics based on empirical and analytical studies. Future work could involve developing more sophisticated
auto-partitioningorcost-model-based systems(likeFlexFlow,PipeDream,DAPPLE) that can automatically determine the bestPTD-Pconfiguration for arbitrary models and hardware. - Strict Optimizer Semantics: The paper focuses on
pipeline parallelismwithflushesto ensurestrict optimizer semantics. This inherently introducespipeline bubbles. The authors defer consideration ofasynchronousandbounded-stalenessapproaches (likePipeMare,PipeDream-2BW) to future work. These alternative schemes could potentially improvethroughputby eliminatingflushesentirely, but might come at the cost ofconvergence rateorfinal accuracy, which would need careful investigation. - Model Architecture Focus: The optimizations and analyses are specifically tailored to
Transformer-based language models. While the core ideas are general, their application to other asymmetric model architectures (e.g.,CNNswith highly varied layer sizes) might require different layer assignment strategies forpipeline stages, which the paper defers to related work. - Hardware Dependence: The strong results are a byproduct of a highly optimized software and hardware stack, particularly relying on
high-bandwidth interconnectslikeNVLinkandInfiniBand. Performance on less-optimized or lower-bandwidth cluster environments might be significantly different, limiting the direct transferability of certain performance numbers.
7.3. Personal Insights & Critique
This paper is a landmark contribution to the field of large-scale AI model training, showcasing an impressive feat of engineering and distributed systems design.
Personal Insights:
- Hardware-Software Co-design is Paramount: The paper vividly illustrates that pushing the frontiers of
AI scaleis not just about novel algorithms, but critically about an intricate co-design between hardware capabilities (e.g.,A100 GPUs,NVLink,InfiniBand) and highly optimized software (PTD-P,fused kernels,scatter/gather). It's a testament to the fact that maximum efficiency is extracted when the software stack is deeply aware of and optimized for the underlying hardware topology and communication patterns. - Complexity of Distributed Training: The analysis of interactions between
tensor,pipeline, anddata parallelismreveals the immense complexity of configuring such systems. The trade-offs are non-trivial and often contradictory (e.g.,microbatch sizeimprovingGPU utilizationbut increasingpipeline bubble). The guiding principles provided are invaluable for practitioners navigating this complexity. - Value of Open-Sourcing: Open-sourcing the
Megatron-LMextension is a significant contribution. It democratizes access to state-of-the-artLLM traininginfrastructure, enabling broader research and development that would otherwise be limited to organizations with massiveHPCresources and expertise. - Practicality over Purity: The focus on
strict optimizer semanticsensures predictable model quality, which is crucial for real-world applications. Whileasynchronousmethods promise higherthroughput, the paper's choice to first optimize asynchronousapproach thoroughly is a pragmatic one, providing a solid, reliable baseline.
Critique / Areas for Improvement:
-
Generalizability of Heuristics: While the heuristics are effective for
Transformer modelsonNVIDIA hardware, their direct applicability to vastly different model architectures (e.g.,sparse models,diffusion models) or differentaccelerator types(e.g.,TPUs,IPUs) might require re-evaluation. The "optimal microbatch size is problem-dependent" highlights the need for more adaptive strategies. -
Environmental Impact: Training trillion-parameter models on thousands of
GPUsfor months consumes enormous amounts of energy. While this paper focuses on efficiency (reducing the energy per FLOP or per iteration), the sheer scale of the computation means the overall environmental footprint remains substantial. Future research might explore ways to achieve similar model capabilities with less computational intensity or moreenergy-efficient hardware. -
Cost of Hardware: The
Selene supercomputeris a top-tierHPCfacility. While the software is open-sourced, replicating this performance requires access to comparable, expensive hardware. This could be a barrier for smaller research groups, despite the software's availability. -
Dynamic Adaptation: The paper's approach relies on a pre-configured parallelization strategy. A potential future improvement could involve dynamic adaptation of parallelism degrees (, , ) and
microbatch sizesduring training, perhaps based on real-timeGPU utilizationandnetwork metrics, to handle varying model characteristics or workload fluctuations more optimally. This aligns with the "automatic partitioning" research mentioned in related work.Overall, this paper provides a robust and highly optimized solution that significantly advances the practical capabilities of
large-scale AI model training, setting a new benchmark forthroughputand efficiency.
Similar papers
Recommended via semantic vector search.