Paper status: completed

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Published:04/09/2021

Large Language Model Fine-Tuning (49)RL Training for Large Language Models (63)Transformer-Based Efficient Forward Prediction (4)GPU Cluster Training (1)Pipeline Parallel Training (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a novel interleaved pipeline parallelism schedule, combining tensor, pipeline, and data parallelism, to enhance the training efficiency of large language models on GPU clusters, achieving 502 petaFLOP/s on 3072 GPUs with over 10% throughput improvement.

Abstract

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak. Our code is open sourced at this https URL.

Mind Map

In-depth Reading

English Analysis~52 min read · 65,004 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is the efficient training of large-scale language models on GPU clusters, specifically utilizing the Megatron-LM framework.

1.2. Authors

The authors of this paper are Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Their affiliations are:

†NVIDIA: Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro.
‡Stanford University: Matei Zaharia.
*Microsoft Research: Deepak Narayanan, Amar Phanishayee. The authors represent a mix of industry research (NVIDIA, Microsoft Research) and academia (Stanford University), indicating a strong collaboration between leading researchers in deep learning systems and large-scale computing.

1.3. Journal/Conference

This paper was published as a preprint on arXiv. While not a formal journal or conference publication, arXiv is a highly influential open-access repository for scientific preprints, especially in fields like AI and machine learning. Papers on arXiv often precede formal peer-reviewed publications and are widely read and cited in the research community. The work is also heavily associated with NVIDIA, a key player in GPU hardware for deep learning.

1.4. Publication Year

2021

1.5. Abstract

Large language models (LLMs) achieve state-of-the-art results across various tasks, but their efficient training is highly challenging due to two primary issues: limited GPU memory preventing large models from fitting on single or multi-GPU servers, and the immense computational demands leading to impractically long training times. Existing model parallelism methods like tensor parallelism and pipeline parallelism have been proposed, but their naive application faces fundamental scaling problems on thousands of GPUs, such such as high cross-node communication costs or significant device idle time.

This paper introduces a novel approach that effectively composes different parallelism methods—tensor, pipeline, and data parallelism—to scale training to thousands of GPUs and models with up to a trillion parameters. The authors survey existing pipeline parallelism techniques and propose a new interleaved pipeline parallelism schedule, which improves throughput by over 10% while maintaining a comparable memory footprint. Through quantitative studies, they analyze the trade-offs between these parallelism types and provide practical guidance for configuring distributed training. Their method achieves a remarkable training iteration throughput of 502 petaFLOP/s for a 1-trillion-parameter model on 3072 GPUs, demonstrating a per-GPU throughput of 52% of the theoretical peak. The accompanying code is open-sourced.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2104.04473 PDF Link: https://arxiv.org/pdf/2104.04473.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inefficient and challenging training of extremely large language models (LLMs) on GPU clusters. Recent advancements in Transformer-based language models have led to unparalleled performance across Natural Language Processing (NLP) tasks, driven by an exponential increase in model size (number of parameters) and dataset scale.

This problem is critical in the current field for several reasons:

Memory Capacity Limitations: State-of-the-art LLMs, now reaching hundreds of billions or even trillions of parameters, simply cannot fit into the memory of a single GPU, or even a single multi-GPU server. This hardware limitation is a fundamental bottleneck.
Unrealistically Long Training Times: Even if a model could hypothetically fit (e.g., by constantly swapping data between host and device memory), the sheer number of floating-point operations (FLOPs) required for training would lead to training durations of hundreds of years on a single GPU, rendering such training impractical.
Limitations of Traditional Data Parallelism: While data parallelism is a common scale-out strategy, it faces two main limitations for very large models: a) per-GPU batch sizes become too small beyond a certain point, leading to low GPU utilization and high communication overhead; and b) the maximum number of devices that can be utilized is bounded by the global batch size, which can be insufficient for scaling to thousands of GPUs.
Scaling Issues with Existing Model Parallelism: Prior model parallelism techniques, such as tensor parallelism and pipeline parallelism, address memory constraints. However, their naive application introduces new scaling challenges:
- Tensor parallelism within a multi-GPU server (e.g., NVIDIA DGX A100) works well up to a point, but scaling it across multi-GPU servers becomes inefficient due to slower inter-server links for all-reduce communication. It can also lead to small matrix multiplications (GEMMs) and decreased GPU utilization.
- Pipeline parallelism, while effective for partitioning layers across devices, suffers from significant pipeline bubbles (idle time) due to synchronization requirements (pipeline flushes) to maintain strict optimizer semantics. This idle time can consume up to 50% of the training time, especially with fewer microbatches.
  
  The paper's entry point and innovative idea revolve around developing a composed parallelism strategy (PTD-P) that intelligently combines tensor, pipeline, and data parallelism to overcome these scaling limitations. Furthermore, it introduces a novel interleaved pipeline parallelism schedule to mitigate pipeline bubbles and enhance throughput, alongside various communication and computation optimizations. The authors aim to provide both an effective practical solution and a quantitative analysis of the tradeoffs involved in configuring these complex distributed training setups.

2.2. Main Contributions / Findings

The primary contributions and key findings of this paper are:

PTD-P: A Composed Parallelism Approach: The paper demonstrates how pipeline parallelism (across multi-GPU servers), tensor parallelism (within a multi-GPU server), and data parallelism can be effectively combined. This PTD-P strategy enables the practical training of models with up to a trillion parameters with graceful scaling on optimized GPU clusters.
Novel Interleaved Pipeline Parallelism Schedule: The authors propose a new interleaved pipeline schedule that significantly reduces pipeline bubble time compared to previous schedules (like PipeDream-Flush). This schedule improves throughput by as much as 10%+ while maintaining a comparable memory footprint, especially beneficial at smaller batch sizes.
Quantitative Study of Parallelism Trade-offs: The paper provides a comprehensive quantitative and analytical study of the interactions and trade-offs between tensor, pipeline, and data parallelism, as well as the impact of hyperparameters like microbatch size and activation recomputation. This analysis offers guiding principles for configuring distributed training effectively.
Achieved State-of-the-Art Training Throughput: The proposed approach enables training iterations on a 1-trillion-parameter model to reach an aggregate throughput of 502 petaFLOP/s on 3072 A100 GPUs. This corresponds to an achieved per-GPU throughput of 163 teraFLOP/s, or 52% of the theoretical peak device throughput, which is significantly higher than previous systems.
Practical Training Times: Based on these throughputs, the authors estimate the end-to-end training of a 1-trillion-parameter model to take approximately 3 months, demonstrating the practical viability of training such colossal models. For a 175-billion-parameter GPT-3 model, training is estimated at 34 days on 1024 A100 GPUs.
Superior Performance to ZeRO-3: The PTD-P approach outperforms ZeRO-3 (without model parallelism) by 70% for 175- and 530-billion-parameter models when scaling to more GPUs, primarily due to less cross-node communication.
Key Optimizations: The high throughput is attributed to several innovations and engineering efforts, including:
- Scatter/gather communication optimization: Reduces redundant cross-node communication by leveraging tensor parallelism outputs.
- Efficient kernel implementations and operator fusion: Ensures computation is compute-bound rather than memory-bound.
- Smart partitioning of computation graphs: Minimizes network traffic and device idle periods.
Open-Sourced Software: The implemented system, an extension to the Megatron-LM codebase, is open-sourced at https://github.com/nvidia/megatron-lm, facilitating its adoption by the broader research community.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following fundamental concepts in deep learning and distributed computing:

Deep Learning Models (specifically Transformers):
- Neural Network: A computational model inspired by the human brain, composed of interconnected nodes (neurons) organized in layers. It learns to recognize patterns in data.
- Transformer: A neural network architecture introduced in 2017, which became the backbone for many modern Large Language Models (LLMs). It revolutionized sequence-to-sequence tasks (like machine translation) and later NLP by relying entirely on attention mechanisms instead of recurrent neural networks (RNNs) or convolutional neural networks (CNNs).
- Self-Attention: A core component of the Transformer model. It allows the model to weigh the importance of different parts of the input sequence when processing a specific element. For example, in a sentence, when processing the word "it," self-attention helps the model decide whether "it" refers to "the river" or "the bank."
- Multi-Layer Perceptron (MLP) / Feed-Forward Network: A simple type of neural network within each Transformer layer that processes the output of the self-attention mechanism. It typically consists of two linear transformations with an activation function in between.
- Parameters: The learnable weights and biases in a neural network. The number of parameters directly correlates with a model's capacity and memory footprint.
- Activations: The outputs of neurons at each layer of a neural network. During the forward pass, these are computed and often stored for use in the backward pass.
- Gradients: During training, gradients are computed to indicate how much each parameter should be adjusted to minimize the model's error.
- Optimizer: An algorithm (e.g., Adam, SGD) that uses the computed gradients to update the model's parameters, iteratively improving its performance. Strict optimizer semantics mean that all GPUs must see a consistent version of the weights for gradient updates.
GPU Clusters and Hardware:
- GPU (Graphics Processing Unit): A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer for output to a display device. In deep learning, GPUs are extensively used for their parallel processing capabilities, which are highly efficient for matrix multiplications and other computations fundamental to neural networks.
- GPU Cluster: A group of interconnected GPUs (often housed in multi-GPU servers or nodes) that work together to perform large-scale computations, such as training very large deep learning models.
- Multi-GPU Server (Node): A single physical server containing multiple GPUs (e.g., 8 NVIDIA A100 GPUs in a DGX A100 system).
- NVLink and NVSwitch: High-bandwidth, low-latency interconnect technologies developed by NVIDIA for direct GPU-to-GPU communication within a multi-GPU server. NVLink provides direct links between GPUs, and NVSwitch acts as a routing chip to enable all-to-all communication between GPUs in a server.
- InfiniBand: A high-speed networking interconnect technology used in High-Performance Computing (HPC) for inter-server communication (between different multi-GPU servers in a cluster). It offers higher bandwidth and lower latency compared to standard Ethernet.
- Floating-Point Operations (FLOPs): A measure of computational workload, referring to the number of floating-point arithmetic operations (additions, multiplications, etc.) performed. TeraFLOP/s (trillions of FLOPs per second) and PetaFLOP/s (quadrillions of FLOPs per second) are units of computational speed.
- Mixed Precision Training: A technique that uses both 16-bit floating-point formats (e.g., FP16 or bfloat16) and 32-bit floating-point formats (FP32) during training. It helps reduce memory consumption and speed up computations on GPUs with specialized Tensor Cores, while maintaining FP32 precision for critical parts like weight updates to prevent numerical instability.
Distributed Training Concepts:
- Data Parallelism: The most common approach to distributed training. Each GPU holds a complete copy of the model. The input batch of data is divided among the GPUs, each computes gradients for its portion of the data, and then these gradients are aggregated (e.g., averaged) across all GPUs to update the model parameters.
- Model Parallelism: When a model is too large to fit on a single GPU, its layers or components are split and distributed across multiple GPUs. This allows training models larger than a single GPU's memory capacity.
- Tensor Parallelism (Intra-layer Model Parallelism): A type of model parallelism where individual layers (specifically, large matrix multiplications within a layer) of the model are partitioned across multiple GPUs. For example, a large weight matrix might be split into column or row chunks, with each GPU computing a part of the matrix multiplication. This usually requires all-reduce communication to synchronize intermediate results.
- Pipeline Parallelism (Inter-layer Model Parallelism): A type of model parallelism where different layers or sequential groups of layers of a model are assigned to different GPUs (or pipeline stages). Data (in the form of microbatches) flows sequentially through these pipeline stages, similar to an assembly line. This helps fit very deep models and can overlap computation with communication.
- Microbatch: In pipeline parallelism, a large training batch is divided into smaller units called microbatches. These microbatches are processed sequentially through the pipeline, allowing computation to be overlapped between pipeline stages.
- Pipeline Bubble (or Stall): In pipeline parallelism, this refers to the idle time when GPUs in the pipeline are waiting for data or computation from other stages, particularly at the beginning and end of a batch during pipeline warm-up or flush phases. Minimizing this bubble is crucial for efficiency.
- All-reduce: A collective communication operation where all participating GPUs contribute data, and all receive the final, reduced (e.g., summed, averaged) result. It's often used for gradient aggregation in data parallelism and for synchronizing intermediate results in tensor parallelism.
- Point-to-point communication: A communication operation between exactly two GPUs (a sender and a receiver). This is typically faster and has lower overhead than collective operations like all-reduce, making it suitable for sequential data transfer in pipeline parallelism.
- Activation Recomputation (or Checkpointing): A memory-saving technique where instead of storing all intermediate activations from the forward pass for use in the backward pass, some activations are recomputed during the backward pass. This trades off increased computation for reduced memory footprint.

3.2. Previous Works

The paper builds upon and differentiates itself from several key prior studies, primarily in distributed deep learning:

Transformer-based Language Models (e.g., GPT-3, BERT, T5):
- Attention is All You Need [42] (Vaswani et al., 2017): Introduced the Transformer architecture, which uses self-attention mechanisms rather than RNNs or CNNs. This paper is foundational for the models being scaled.
- BERT [13] (Devlin et al., 2018), GPT [33] (Radford et al., 2018), RoBERTa [27] (Liu et al., 2019), T5 [35] (Raffel et al., 2019), GPT-3 [11] (Brown et al., 2020): These works represent the lineage of large Transformer models that demonstrate the power of scaling, thereby motivating the need for efficient large-scale training. The current paper uses GPT models for its experiments.
Megatron-LM (Tensor Parallelism):
- Megatron [40] (Shoeybi et al., 2019): This work introduced a specific tensor model parallelism strategy for Transformer layers. It partitions large matrix multiplications (GEMMs) within each Transformer layer across multiple GPUs. This strategy is leveraged in the current paper for intra-node (within a multi-GPU server) parallelization. The current paper extends Megatron-LM by combining this tensor parallelism with pipeline and data parallelism.
Pipeline Parallelism Schemes:
- GPipe [20] (Huang et al., 2019): Proposed one of the early pipeline parallelism schedules where all forward passes for microbatches are completed first, followed by all backward passes. While reducing pipeline bubbles, it has a high memory footprint as it requires stashing all intermediate activations for all microbatches in a batch.
- PipeDream-Flush [30] (Narayanan et al., 2020): An improvement over GPipe, this schedule uses a "1F1B" (one forward pass, one backward pass) pattern after an initial warm-up. It significantly reduces memory footprint by limiting the number of in-flight microbatches (and thus activations to be stashed) to the number of pipeline stages (instead of the total number of microbatches in a batch). The current paper uses PipeDream-Flush as its "default schedule" and further improves upon it with the interleaved schedule.
- PipeDream [29] (Narayanan et al., 2019): A foundational work that generalized pipeline parallelism for DNN training and combined it with data parallelism. It also explored asynchronous and bounded-staleness approaches.
- TeraPipe [26] (Li et al., 2021): Explores fine-grained pipeline parallelism across tokens within a single training sequence for auto-regressive models like GPT.
- PipeTransformer [19] (He et al., 2021): Elastically adjusts the degree of pipelining and data parallelism by freezing "stable" layers and training "active" ones.
- HetPipe [31] (Park et al., 2020): Combines pipeline and data parallelism on heterogeneous GPU clusters.
- PipeMare [45] (Yang et al., 2021) and Kosson et al. [23] (Kosson et al., 2021): These works focus on asynchronous pipeline parallelism or relaxed weight update semantics to improve throughput by doing away with flushes, but potentially at the cost of convergence or final accuracy. The current paper focuses on strict optimizer semantics.
Sharded Data Parallelism (ZeRO):
- ZeRO [36, 37] (Rajbhandari et al., 2019; Rajbhandari et al., 2021): A family of memory optimization techniques, ZeRO (Zero Redundancy Optimizer) shards the optimizer state, gradients, and eventually model parameters across data-parallel workers. This reduces memory footprint per GPU without introducing extra communication over vanilla data parallelism (for ZeRO-1/2) or by introducing additional, often hidden, communication (for ZeRO-3). ZeRO-Infinity [37] extends this by using NVMe (solid-state drives) for swapping parameters to disk, enabling training of very large models on a small number of GPUs, though with potentially very long training times. The current paper compares its PTD-P approach to ZeRO-3 and finds its method superior for large models due to less cross-node communication.
Automatic Partitioning:
- FlexFlow [22] (Jia et al., 2018), PipeDream [29] (Narayanan et al., 2019), DAPPLE [14] (Fan et al., 2021), Tarnawski et al. [41] (Tarnawski et al., 2020): These systems explore automatic partitioning of DNN training graphs over multiple devices using cost models. The current paper, however, suggests heuristics rather than automatic exploration, acknowledging the increased complexity with more parallelism dimensions.
HPC for Model Training:
- Goyal et al. [17] (Goyal et al., 2017) and You et al. [47] (You et al., 2018): Demonstrated HPC techniques to train ImageNet models quickly. These models, however, are typically smaller, fit on a single accelerator, use very large batch sizes allowing extensive data parallelism, and are CNN-based, which is inherently amenable to data-parallel communication.

3.3. Technological Evolution

The field of large-scale deep learning, particularly NLP, has seen an exponential growth in model size, as illustrated in Figure 1 of the paper. Initially, models could be trained on a single GPU. As models grew, data parallelism became the dominant strategy, allowing multiple GPUs to train a single model by distributing data. However, for models with billions of parameters, data parallelism alone became insufficient due to GPU memory limits (models wouldn't fit) and scaling constraints (max number of GPUs is limited by batch size).

This necessitated the development of model parallelism techniques:

Early Model Parallelism: Involved manually splitting layers or operations across GPUs.
Tensor Parallelism (intra-layer): Megatron-LM [40] emerged as a prominent solution, partitioning matrix multiplications within a single layer across GPUs within a multi-GPU server (e.g., DGX A100). This was efficient due to high-bandwidth NVLink interconnects.
Pipeline Parallelism (inter-layer): Techniques like GPipe [20] and PipeDream [29] distributed entire layers across different GPUs, allowing deeper models to be trained. PipeDream-Flush [30] improved memory efficiency. These generally used point-to-point communication between pipeline stages.
Sharded Data Parallelism (ZeRO): Approaches like ZeRO [36, 37] further optimized data parallelism by sharding optimizer states, gradients, and even parameters across data-parallel workers, reducing the memory footprint per GPU significantly.

The current paper's work (Megatron-LM extension) represents the next evolutionary step: a sophisticated composition of these techniques (PTD-P). It recognizes that no single parallelism strategy is optimal for all scales and scenarios. Instead, it intelligently combines the strengths of tensor parallelism (for efficient intra-node computation), pipeline parallelism (for scaling inter-node across layers), and data parallelism (for overall scale-out when memory allows), while introducing new scheduling and communication optimizations to mitigate their individual weaknesses (e.g., pipeline bubbles, cross-node all-reduce overheads).

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's PTD-P approach are:

Holistic Composition of Parallelism (PTD-P): Unlike many prior works that focus on one or two forms of parallelism in isolation (e.g., Megatron with tensor parallelism, PipeDream with pipeline and data parallelism, ZeRO with sharded data parallelism), this paper rigorously studies and optimizes the composition of all three major forms: tensor, pipeline, and data parallelism. This allows for scaling to truly unprecedented model sizes (trillion parameters) on thousands of GPUs. The paper emphasizes that neither tensor model parallelism nor pipeline model parallelism in isolation can match the performance of using both techniques in conjunction.
Optimized Hierarchical Parallelism Strategy: The PTD-P strategy implicitly leverages hardware topology:
- Tensor parallelism is primarily used intra-node (within a multi-GPU server) where high-bandwidth NVLink makes its all-reduce communication efficient.
- Pipeline parallelism is used inter-node (across multi-GPU servers) where its point-to-point communication is less sensitive to slower InfiniBand links than all-reduce.
- Data parallelism then scales out the combined model-parallel replica to more GPUs. This hierarchical approach is key to its superior scaling.
Novel Interleaved Pipeline Parallelism Schedule: The paper proposes a new interleaved pipeline schedule that significantly reduces the pipeline bubble (idle time) compared to the widely used PipeDream-Flush schedule (a 1F1B approach). This schedule achieves 10%+ higher throughput by effectively making pipeline flushes happen sooner, with a comparable memory footprint. This is a direct improvement on existing pipeline parallelism techniques that maintain strict optimizer semantics.
Quantitative Trade-off Analysis and Guiding Principles: The paper provides a detailed quantitative and analytical study of how different parallelization dimensions ( $p$ , $t$ , $d$ ), microbatch size, and activation recomputation interact and impact throughput, memory footprint, and communication overhead. It offers concrete "takeaways" or guiding principles for configuring distributed training, which is crucial for practical deployment but often lacking in theoretical works.
Specialized Communication Optimization (Scatter/Gather): The scatter/gather communication optimization specifically addresses a redundancy issue when tensor and pipeline parallelism are combined. By scattering replicated outputs of tensor-parallel ranks across InfiniBand cards and then all-gathering them via fast NVLink on the receiving node, it reduces cross-node communication volume by a factor of tensor-parallel size (e.g., 8x for $t=8$ ). This makes the interleaved schedule more feasible and improves overall throughput.
Hardware-Software Co-design and Engineering Excellence: The paper highlights the importance of efficient kernel implementations, operator fusion, and careful data layout specific to the Transformer architecture to ensure computation remains compute-bound on A100 GPUs. This deep engineering, combined with optimized hardware (Selene supercomputer with A100 GPUs, NVLink, NVSwitch, InfiniBand), contributes to the achieved performance levels, which are significantly higher (e.g., 52% of peak vs. 36% of peak for DeepSpeed [2] for trillion-parameter models) than comparable systems like DeepSpeed or ZeRO-3.

4. Methodology

4.1. Principles

The core idea of the method used in this paper is PTD-P, which stands for Pipeline, Tensor, and Data Parallelism. It's a comprehensive strategy for efficiently training extremely large language models (LLMs) on GPU clusters. The theoretical basis or intuition behind PTD-P is to intelligently combine the strengths of different parallelism types while mitigating their individual weaknesses, especially concerning GPU memory capacity, computation time, and communication overhead in a hierarchical cluster environment.

The key principles are:

Memory Management: Models with trillions of parameters cannot fit on a single GPU or even a single multi-GPU server. PTD-P uses tensor parallelism and pipeline parallelism to shard the model's parameters and intermediate activations across many GPUs, ensuring the model can be represented in distributed memory. Activation recomputation further helps manage activation memory.
Maximize GPU Utilization and Minimize Idle Time: Distributed training inherently introduces idle time (e.g., pipeline bubbles) and communication overhead. The goal is to keep GPUs as busy as possible performing useful computation. The interleaved pipeline schedule is designed to reduce pipeline bubbles, while communication optimizations (scatter/gather) and computation optimizations (fused kernels, data layout) aim to make computations compute-bound and communication less of a bottleneck.
Optimal Communication Strategy: The hierarchical nature of GPU clusters (fast intra-server NVLink, slower inter-server InfiniBand) dictates how communication-intensive operations should be placed. Tensor parallelism, which involves frequent all-reduce operations, is best kept intra-server. Pipeline parallelism, relying on point-to-point communication, is more suitable for inter-server scaling. Data parallelism, which typically involves less frequent but larger all-reduce operations, is used for overall scale-out.
Scalability: The composition allows scaling to thousands of GPUs and trillion-parameter models, addressing the limitations of single parallelism techniques that might not scale beyond certain model sizes or numbers of devices.
Strict Optimizer Semantics: The paper prioritizes maintaining strict optimizer semantics, meaning all GPUs see consistent weight versions, which is crucial for predictable convergence and accuracy. This necessitates mechanisms like pipeline flushes, but the system is designed to minimize their impact.

In essence, PTD-P is about finding the optimal balance between memory footprint, device utilization (reducing pipeline bubbles), and communication overhead by strategically applying and optimizing multiple forms of parallelism across a GPU cluster's hardware hierarchy.

4.2. Core Methodology In-depth (Layer by Layer)

The PTD-P methodology combines data parallelism, pipeline model parallelism, and tensor model parallelism. The implementation extends the Megatron-LM codebase using PyTorch.

4.2.1. Modes of Parallelism

The paper combines pipeline model parallelism and tensor model parallelism with data parallelism, which they call PTD-P. The overall structure of how tensor and pipeline parallelism are combined is visualized in Figure 2.

该图像是示意图，展示了两层变换器（Transformer layer）中的张量并行（Tensor MP）和流水线并行（Pipeline MP）分区方案。图中清晰地标示了两个变换器层及其对应的张量和流水线分区设计，帮助理解并行处理在模型训练中的应用。

The image is a schematic diagram that illustrates the tensor parallel (Tensor MP) and pipeline parallel (Pipeline MP) partitioning schemes within two transformer layers. The diagram clearly indicates the two transformer layers along with their corresponding tensor and pipeline partitions, aiding in the understanding of parallel processing in model training.

4.2.1.1. Data Parallelism

With data parallelism [25, 43], each worker (which could be a GPU or a model-parallel group of GPUs) typically has a copy of the full model. The input dataset is sharded, meaning each worker receives a different subset of the data for its local computation. After processing its local data and computing gradients, workers periodically aggregate their gradients (e.g., using all-reduce) to ensure that all workers maintain a consistent version of the model's weights.

For very large models that do not fit entirely on a single worker, data parallelism can be used on smaller model shards. This means that if a model is too large for one GPU, it must first be split across multiple GPUs using model parallelism (tensor or pipeline), and then multiple copies of this model-parallel group can be run in data-parallel fashion.

4.2.1.2. Pipeline Model Parallelism

In pipeline parallelism, the layers of a model are sharded across multiple devices. For models with repetitive structures, such as Transformer blocks, each device can be assigned an equal number of layers. The paper focuses on such symmetric architectures.

A key aspect of pipeline parallelism is that a batch of data is split into smaller units called microbatches. These microbatches are then processed in a pipelined fashion across the devices. To ensure that optimizer steps are synchronized and inputs see consistent weight versions across forward and backward passes (maintaining strict optimizer semantics), the paper introduces periodic pipeline flushes. During a pipeline flush, no new microbatches are injected, and all currently in-flight microbatches are allowed to complete their forward and backward passes. This leads to periods where devices are idle, known as the pipeline bubble. The goal is to minimize this pipeline bubble to improve efficiency.

The paper discusses two approaches to scheduling forward and backward microbatches:

4.2.1.2.1. Default Schedule (PipeDream-Flush)

The GPipe schedule [20] first executes all forward passes for all microbatches in a batch, followed by all backward passes. This strategy, however, requires a high memory footprint because intermediate activations for all microbatches (or at least input activations for activation recomputation) must be stored in memory throughout the entire training iteration.

Instead, the paper uses the PipeDream-Flush schedule [30], which is more memory-efficient. This schedule operates in three phases, as shown in Figure 3.

该图像是一个示意图，展示了在四个设备上进行模型训练的前向传播和反向传播过程。图中时间轴上标记了每个设备在不同时间步进行前向（蓝色）和反向（绿色）传播的状态，并显示了管道冲洗期间设备的闲置状态。

The image is a schematic diagram illustrating the forward and backward passes of model training across four devices. The timeline marks the status of each device performing forward (in blue) and backward (in green) passes at different time steps, and shows the idle state of devices during a pipeline flush.

Warm-up Phase: Workers perform differing numbers of forward passes to fill the pipeline.
Steady State (1F1B): After warm-up, each worker enters a steady state where it performs one forward pass (1F) immediately followed by one backward pass (1B) for a microbatch. This ensures that the number of in-flight microbatches (those for which the backward pass is outstanding and activations need to be retained) is limited to the depth of the pipeline ( $p$ ), rather than the total number of microbatches ( $m$ ) in a batch.
Cool-down Phase: At the end of a batch, backward passes for all remaining in-flight microbatches are completed.

The pipeline bubble in this schedule is quantified as $t_{pb}$ . Let $m$ be the number of microbatches in a batch, $\mathcal{P}$ be the number of pipeline stages (devices used for pipeline parallelism), $t_{id}$ be the ideal time per iteration (assuming perfect scaling), and $t_f$ and $t_b$ be the time to execute a single microbatch's forward and backward pass, respectively. The total time spent in the pipeline bubble is: $ t_{pb} = (p - 1) \cdot (t_f + t_b) $ The ideal processing time for the batch is: $ t_{id} = m \cdot (t_f + t_b) $ Therefore, the fraction of ideal computation time spent in the pipeline bubble (or pipeline bubble size) is: $\mathrm { B u b b l e ~ t i m e ~ f r a c t i o n ~ ( p i p e l i n e ~ b u b b l e ~ s i z e ) } = \frac { t _ { p b } } { t _ { i d } } = \frac { { \dot { p } } - 1 } { m } .$
$p$ : Number of pipeline stages (devices).
$m$ : Number of microbatches in a batch per pipeline.
$t_{pb}$ : Time spent in the pipeline bubble.
$t_{id}$ : Ideal processing time for the batch.
$t_f$ : Time for a single microbatch's forward pass.
$t_b$ : Time for a single microbatch's backward pass.

For the bubble time fraction to be small, it is necessary that $m \gg p$ . While the PipeDream-Flush schedule has the same bubble time fraction as GPipe, its primary advantage is significantly reduced memory footprint, as activations need to be stashed for $p$ or fewer microbatches (compared to $m$ for GPipe), making it more memory-efficient when $m \gg p$ .

4.2.1.2.2. Schedule with Interleaved Stages

To further reduce the pipeline bubble size, the paper proposes a novel interleaved schedule. Instead of assigning a single contiguous set of layers to each device, each device can perform computation for multiple subsets of layers, called model chunks or v stages. For example, if a device previously handled layers 1-4, it could now handle layers 1, 2, 9, 10, effectively managing two model chunks. This means each device in the pipeline is assigned multiple pipeline stages, where each stage now has less computation compared to before.

Similar to PipeDream-Flush, this interleaved schedule adapts the memory-efficient 1F1B pattern. This new schedule, shown in Figure 4, requires the number of microbatches in a batch to be an integer multiple of the degree of pipeline parallelism (number of devices in the pipeline).

该图像是一个示意图，展示了如何在多个设备上分配不同阶段的前向传递和反向传递计算。上方为传统的单阶段分配，下方为多阶段分配，提高了设备的利用率。该方法通过将多个阶段分配给每个设备来优化并行处理。

The image is a diagram illustrating how to distribute different stages of forward and backward passes across multiple devices. The top part shows traditional single-stage allocation, while the bottom part demonstrates multi-stage allocation, enhancing device utilization. This method optimizes parallel processing by assigning multiple stages to each device.

As depicted in Figure 4, the pipeline flush for the same batch size occurs sooner in the interleaved schedule. If each device has $v$ stages (or model chunks), the forward and backward pass time for a microbatch for each stage will be $t_f / v$ and $t_b / v$ , respectively. The pipeline bubble time is thus reduced to: $ t_{pb}^{\mathrm{int.}} = \frac{(p - 1) \cdot (t_f + t_b)}{v} $ And the bubble time fraction becomes: $\text{Bubble time fraction (pipeline bubble size)} = \frac { t _ { \mathcal { P } } ^ { \mathrm { i n t. } } } { t _ { i d } } = \frac { 1 } { v } \cdot \frac { \mathsf { \partial } p - 1 } { m } .$

$p$ : Number of pipeline stages (devices).
$m$ : Number of microbatches in a batch per pipeline.
$v$ : Number of model chunks or stages assigned to each device.
$t_{\mathcal{P}}^{\mathrm{int.}}$ : Interleaved pipeline bubble time.
$t_{id}$ : Ideal processing time for the batch.

This means the new schedule reduces the bubble time by a factor of $v$ . However, this reduction comes at the cost of increased communication, also by a factor of $v$ , because data needs to be sent more frequently between pipeline stages due to finer-grained chunking. The paper discusses how to mitigate this extra communication using specific communication optimizations like scatter/gather.

4.2.1.3. Tensor Model Parallelism

Tensor model parallelism partitions individual layers of the model over multiple devices. The paper specifically uses the partitioning strategy from Megatron [40] for Transformer layers.

A Transformer layer fundamentally consists of a self-attention block followed by a two-layer Multi-Layer Perceptron (MLP).

MLP Block Partitioning: The MLP block involves two General Matrix Multiplications (GEMMs) and a GeLU non-linearity. The operations are: $Y = \mathrm { GeLU } ( X A ) . ~ Z = \mathrm { Dropout } ( Y B ) .$ Here, $X$ is the input, $A$ and $B$ are weight matrices.
1. Splitting $A$ : The first weight matrix $A$ is split along its columns: $A = \left[ A _ { 1 } , A _ { 2 } \right]$ . This allows the GeLU non-linearity to be applied independently to the output of each partitioned GEMM on different GPUs: $[ Y _ { 1 } , Y _ { 2 } ] = [ \mathrm { GeLU } ( X A _ { 1 } ) , \mathrm { GeLU } ( X A _ { 2 } ) ] .$ This column-wise partitioning of $A$ is advantageous because GeLU is a non-linear activation function. If $A$ were split along its rows, an all-reduce would be needed before applying GeLU to ensure the correct non-linearity, which would introduce extra communication. By splitting columns, GeLU can be applied locally, removing this synchronization need.
2. Splitting $B$ : The second weight matrix $B$ is then split along its rows to remove the need for communication between the two GEMMs. If Y = [ Y _ { 1 } , Y _ { 2 } ] is the concatenated output from the first GEMM and GeLU, then $B$ is partitioned as: $B = \left[ { \begin{array} { l } { B _ { 1 } } \\ { B _ { 2 } } \end{array} } \right] , \ Y = [ Y _ { 1 } , Y _ { 2 } ] .$ The second GEMM operation Y B can then be computed in parallel as $Y_1 B_1 + Y_2 B_2$ , where each GPU computes one part of the sum. The results of $Y_1 B_1$ and $Y_2 B_2$ are then reduced (summed) across the GPUs using an all-reduce operation (denoted as $f$ in Figure 5a) before the Dropout layer.
Self-Attention Block Partitioning: The multi-head attention operation inherently has parallelism. The Key ( $K$ ), Query ( $Q$ ), and Value ( $V$ ) matrices, which are typically derived from the input through linear transformations, can be partitioned in a column-parallel fashion. After computing the attention mechanism on these partitioned matrices, the output linear layer can directly operate on the partitioned output of the attention operation (where its weight matrix is partitioned across rows).

This tensor model parallelism approach, as illustrated in Figure 5, requires only two all-reduce operations in the forward pass (denoted by $g$ ) and two all-reduce operations in the backward pass (denoted by $f$ ) per Transformer layer. The operators $f$ and $g$ are conjugates: $f$ is the identity operator in the forward pass and all-reduce in the backward pass, while $g$ is the reverse (all-reduce in forward and identity in backward).

$Figure 5: Blocks of transformer model partitioned with tensor model parallelism (figures borrowed from Megatron \[40\]). $f$ and $g$ are conjugate. $f$ is the identity operator in the forward pass and allreduce in the backward pass, while $g$ is the reverse.$ Figure 5: Blocks of transformer model partitioned with tensor model parallelism (figures borrowed from Megatron [40]). $f$ and $g$ are conjugate. $f$ is the identity operator in the forward pass and allreduce in the backward pass, while $g$ is the reverse.

Figure 5: Blocks of transformer model partitioned with tensor model parallelism (figures borrowed from Megatron [40]). $f$ and $g$ are conjugate. $f$ is the identity operator in the forward pass and allreduce in the backward pass, while $g$ is the reverse.

4.2.2. Performance Analysis of Parallelization Configurations

This section analyzes the performance implications of combining pipeline and tensor model parallelism with data parallelism, discussing trade-offs between memory footprint, device utilization, and communication.

4.2.2.1. Notation

(p, t, d): Parallelization dimensions.
- $p$ : Pipeline-model-parallel size (number of devices for pipeline parallelism).
- $t$ : Tensor-model-parallel size (number of devices for tensor parallelism).
- $d$ : Data-parallel size (number of devices for data parallelism).
$n$ : Total number of GPUs. The product $p \cdot t \cdot d = n$ .
$B$ : Global batch size (provided as input).
$b$ : Microbatch size.
$m$ : Number of microbatches in a batch per pipeline, calculated as $m = \frac{1}{b} \cdot \frac{B}{d}$ .

4.2.2.2. Tensor and Pipeline Model Parallelism

The paper considers the interactions between tensor and pipeline model parallelism. Assuming data-parallel size $d=1$ , the total number of GPUs is $n = t \cdot p$ . The pipeline bubble size (from section 4.2.1.2.1) in terms of $t$ is: $\frac { p - 1 } { m } = \frac { n / t - 1 } { m } .$ As $t$ (tensor-model-parallel size) increases, the pipeline-parallel size $p = n/t$ decreases. Consequently, the term $n/t - 1$ decreases, and thus the pipeline bubble decreases for fixed global batch size $B$ , microbatch size $b$ , and data-parallel size $d$ (since $m$ is fixed). This suggests that increasing tensor parallelism can reduce pipeline bubble overhead.

However, communication costs differ significantly:

Pipeline Model Parallelism: Features cheaper point-to-point communication. For each microbatch, the total communication between consecutive devices (for either forward or backward pass) is bsh, where $s$ is the sequence length and $h$ is the hidden size.
Tensor Model Parallelism: Uses expensive all-reduce communication. For each microbatch and each layer, tensors of total size bsh need to be all-reduced among $t$ model replicas twice in the forward pass and twice in the backward pass. This leads to a total communication of $8 bsh \left( \frac { t - 1 } { t } \right)$ per layer per device per microbatch. If a pipeline stage has $l^{\mathrm{stage}}$ layers, the total tensor-parallel-communication per device per microbatch is $l^{\mathrm{stage}} \cdot \left( 8 bsh \left( \frac { t - 1 } { t } \right) \right)$ .

This means that tensor model parallelism significantly increases the amount of communication between devices. If $t$ is larger than the number of GPUs within a single multi-GPU server, this all-reduce communication would need to traverse slower inter-node links (e.g., InfiniBand), leading to a substantial performance bottleneck.

Takeaway #1: When considering different forms of model parallelism, tensor model parallelism should generally be used up to the degree equal to the number of GPUs within a multi-GPU server (e.g., 8 on a DGX A100). Beyond this, pipeline model parallelism should be used to scale up to larger models across different servers, leveraging its more efficient point-to-point communication.

4.2.2.3. Data and Model Parallelism

This section considers the interaction between data parallelism and model parallelism (pipeline and tensor) independently.

4.2.2.3.1. Pipeline Model Parallelism

Assuming tensor-model-parallel size $t=1$ , the number of microbatches per pipeline is $m = B / (d \cdot b)$ . Let $b' := B/b$ be the ratio of global batch size to microbatch size. Then $m = b'/d$ . With a total of $n$ GPUs, the number of pipeline stages is $p = n / (t \cdot d) = n / d$ . The pipeline bubble size (from section 4.2.1.2.1) is then: ${ \frac { p - 1 } { m } } = { \frac { n / d - 1 } { b ^ { \prime } / d } } = { \frac { n - d } { b ^ { \prime } } } .$

$p$ : Number of pipeline stages.
$m$ : Number of microbatches per pipeline.
$n$ : Total number of GPUs.
$d$ : Data-parallel size.
$b'$ : Ratio of global batch size to microbatch size (B/b).

As $d$ (data-parallel size) increases, $n - d$ becomes smaller, which in turn reduces the pipeline bubble size. Figure 6 illustrates this behavior.

$Figure 6: Fraction of time spent idling due to pipeline flush (pipeline bubble size) versus data-parallel size `( d )` , for different numbers of GPUs `( n )` and ratio of batch size to microbatch size $( b ^ { \\prime } = B / b )$ .$ Figure 6: Fraction of time spent idling due to pipeline flush (pipeline bubble size) versus data-parallel size ( d ) , for different numbers of GPUs ( n ) and ratio of batch size to microbatch size $( b ^ { \prime } = B / b )$ .

Figure 6: Fraction of time spent idling due to pipeline flush (pipeline bubble size) versus data-parallel size ( d ) , for different numbers of GPUs ( n ) and ratio of batch size to microbatch size $( b ^ { \prime } = B / b )$ .

Overall throughput will increase if the all-reduce communication required for data parallelism does not increase drastically with higher $d$ . This is generally true for ring-based implementations, where communication time scales roughly with $1 - 1/d$ .

Increasing the global batch size $B$ also affects throughput. For a given parallel configuration, as $B$ increases, $b' = B/b$ increases, which makes the term $(n - d) / b'$ decrease, thus reducing the pipeline bubble fraction and increasing throughput. Additionally, all-reduce communication for data parallelism becomes less frequent, further boosting throughput.

4.2.2.3.2. Data and Tensor Model Parallelism

With tensor model parallelism, all-reduce communication must be performed for every microbatch within each layer. This can be very expensive, especially across multi-GPU servers using InfiniBand. In contrast, data parallelism only needs to perform its expensive all-reduce communication once per batch (for gradients). Furthermore, as the tensor-model-parallel size increases, each model-parallel rank (GPU) performs a smaller subset of computations within each model layer. For insufficiently-large layers (e.g., small hidden sizes or sequence lengths), these smaller sub-matrix computations (GEMMs) might not be performed by modern GPUs with peak efficiency, leading to lower GPU utilization.

Takeaway #2: When combining data and model parallelism, a total model-parallel size of $M = t \cdot p$ should be chosen such that the model's parameters and intermediate metadata (e.g., activations) fit within the GPU memory. Once this memory constraint is met, data parallelism should be used to scale up training to utilize more GPUs, as it generally incurs less frequent and potentially more efficient collective communication compared to tensor parallelism across nodes.

4.2.2.4. Microbatch Size

The choice of microbatch size $b$ is a critical hyperparameter affecting model training throughput. It impacts both the arithmetic intensity of GPU kernels and the pipeline bubble size. Figure 7 illustrates how per-GPU throughput can increase significantly with a larger microbatch size on a single GPU.

Figure 7: Per-GPU throughput versus microbatch size for a GPT model with a billion parameters (128 attention heads, hidden size of 4096, 4 transformer layers).

Figure 7: Per-GPU throughput versus microbatch size for a GPT model with a billion parameters (128 attention heads, hidden size of 4096, 4 transformer layers).

The optimal microbatch size needs to be determined for a given parallel configuration (p, t, d) and global batch size $B$ . The amount of data-parallel communication remains constant regardless of microbatch size. Given functions $t_f(b)$ and $t_b(b)$ that map microbatch size $b$ to the forward and backward computation times for a single microbatch, the total time spent computing a batch, ignoring communication cost, is: $\left( b ^ { \prime } / b + p - 1 \right) \cdot \left( t _ { f } ( b ) + t _ { b } ( b ) \right) .$

$b'$ : Ratio of global batch size to microbatch size (B/b). This is equivalent to B/b from previous notation, where $B$ is the global batch size and $b$ is the microbatch size. In the context of this formula, it's referring to the total number of microbatches processed by a single pipeline group over the entire global batch.
$p$ : Pipeline-model-parallel size.
$b$ : Microbatch size.
$t_f(b)$ : Forward pass computation time for a microbatch of size $b$ .
$t_b(b)$ : Backward pass computation time for a microbatch of size $b$ .

The first term $(b'/b + p - 1)$ represents the effective number of microbatch processing steps, incorporating the pipeline bubble overhead. The microbatch size thus influences both the efficiency of GPU kernel execution (larger microbatches often lead to better arithmetic intensity and GPU utilization) and the pipeline bubble size (smaller $m = b'/b$ leads to a larger bubble fraction $\frac{p-1}{m}$ ). These two factors are at odds.

Figure 8 shows the estimated throughput (using the above equation) for a GPT model with a billion parameters.

$Figure 8: Behavior of normalized estimated throughput (time computed as $t = ( b ^ { \\prime } / b + p - 1 ) \\cdot \\left( t _ { f } ( b ) + t _ { b } ( b ) \\right) )$ with respect to the microbatch size $^ { b }$ for the same GPT model from Figure 7.$ Figure 8: Behavior of normalized estimated throughput (time computed as $t = ( b ^ { \prime } / b + p - 1 ) \cdot \left( t _ { f } ( b ) + t _ { b } ( b ) \right) )$ with respect to the microbatch size $^ { b }$ for the same GPT model from Figure 7.

Figure 8: Behavior of normalized estimated throughput (time computed as $t = ( b ^ { \prime } / b + p - 1 ) \cdot \left( t _ { f } ( b ) + t _ { b } ( b ) \right) )$ with respect to the microbatch size $^ { b }$ for the same GPT model from Figure 7.

The plot indicates an optimal microbatch size of 4 for both batch sizes in this specific example. The optimal value is problem-dependent.

Takeaway #3: The optimal microbatch size $b$ is a complex balance, depending on the model's throughput and memory footprint characteristics, as well as the pipeline depth $p$ , data-parallel size $d$ , and global batch size $B$ . It's not a one-size-fits-all value.

4.2.2.5. Activation Recomputation

Activation recomputation [12, 18, 20, 21] (also known as gradient checkpointing) is an optional memory-saving technique. During the forward pass, instead of storing all intermediate activations needed for the backward pass, only a subset (or just the input activations for each pipeline stage) is stored. The remaining activations are recomputed during the backward pass. This trades off increased computation (FLOPs) for reduced memory footprint.

If a model stage has $l$ layers and activation checkpoints are taken at $c$ points within that stage, the total memory footprint (for activations) is approximately: $c \cdot A ^ { \mathrm { i n p u t } } + l / c \cdot A ^ { \mathrm { i n t e r m e d i a t e } }$

$A^{\mathrm{input}}$ : Size of the input activations for a layer.
$A^{\mathrm{intermediate}}$ : Size of the intermediate activations within a layer.
$l$ : Number of layers in a model stage.
$c$ : Number of checkpoints (recomputation points) within the stage.

The optimal number of checkpoints $c$ that minimizes memory footprint is obtained when $c = \sqrt { l \cdot \left( A ^ { \mathrm { i n t e r m e d i a t e } } / A ^ { \mathrm { i n p u t } } \right) }$ . In practice, the authors find that recomputing every 1 or 2 Transformer layers is optimal. While activation recomputation increases total FLOPs (by effectively running parts of the forward pass twice), it is often required to train very large models that would otherwise exceed GPU memory.

Other techniques like activation partitioning [36] can also be used with tensor model parallelism for further memory footprint reduction.

4.2.3. Implementation

The PTD-P system is implemented as an extension to the Megatron-LM codebase, built using PyTorch [32]. NCCL [7] is used for inter-device communication. To achieve high performance, specific communication and computation optimizations were implemented.

4.2.3.1. Communication Optimizations

When combining pipeline and tensor model parallelism, a key challenge arises from cross-node communication. Each DGX A100 server has 8 InfiniBand (IB) networking cards. Point-to-point communication in pipeline parallelism typically occurs between a single pair of GPUs on adjacent servers, making it hard to utilize all 8 IB cards for a single logical communication.

The paper introduces a scatter/gather communication optimization to address this, leveraging the fact that outputs of tensor-parallel ranks are often replicated. Specifically, after the all-reduce operation in tensor parallelism (e.g., after the $g$ operator in the MLP block), the output tensor is replicated across all tensor-parallel ranks within a multi-GPU server.

Redundancy: If this replicated tensor were simply sent from each of the $t$ tensor-parallel ranks on the sending node to its corresponding rank on the next pipeline stage node, it would result in $t$ redundant copies of the same tensor being sent over inter-node InfiniBand links (as shown in Figure 9a).
Scatter/Gather Optimization: Instead, the optimization works as follows (Figure 9b):
- Sender Side: The replicated tensor on the sending multi-GPU server is split (scattered) into $t$ equal-sized chunks. Each tensor-parallel rank then sends only one unique chunk to its corresponding rank on the next pipeline stage node, using its dedicated InfiniBand card. This reduces the size of data sent over the slower InfiniBand links by a factor of $t$ .
- Receiver Side: On the receiving node, after each tensor-parallel rank receives its chunk, an all-gather operation is performed over the high-bandwidth NVLink (which is much faster than InfiniBand) to re-materialize the full tensor on each tensor-parallel rank.
  
  该图像是示意图，展示了Infiniband连接和数据分散、合并机制的工作原理。左侧部分为Infiniband连接的配置，右侧展示了Scatter和All-gather通信方式。

(a) W/o scatter/gather optimization. (b) With scatter/gather optimization. Figure 9: Scatter/gather communication optimization. Light blue blocks are layers in the first pipeline stage, and dark blue blocks are layers in the second pipeline stage. Without the scatter/gather optimization, the same tensor is sent redundantly over inter-node InfiniBand links. Instead, at the sender, we can scatter the tensor into smaller chunks, reducing the sizes of tensors sent over InfiniBand links. The final tensor can then be rematerialized at the receiver using a gather operation.

Quantitatively, with this optimization, the total amount of communication that needs to be performed between every pair of consecutive pipeline stages is reduced from bsh to $\frac{\bar{b}sh}{t}$ , where $t$ is the tensor-model-parallel size (e.g., $t=8$ in their experiments), $s$ is the sequence length, and $h$ is the hidden size. This optimization effectively utilizes multiple InfiniBand cards and makes communication-intensive schedules like the interleaved one more feasible.

4.2.3.2. Computation Optimizations

The authors implemented three model-specific optimizations to the computation graph to achieve high performance:

Data Layout Change: The data layout in the Transformer layer was changed from [b, s, a, h] (batch, sequence, attention-head, hidden-size) to [s, b, a, h]. This avoids memory-intensive transpose operations and enables the use of strided batched GEMM kernels, which are more efficient.
Fused Kernels: For sequences of element-wise operations (e.g., bias + GeLU, and bias + dropout + add), custom fused kernels were generated using PyTorch JIT [10]. Operator fusion reduces memory bandwidth usage by combining multiple operations into a single kernel, reducing the number of times data needs to be loaded from and stored to GPU memory.
Custom Kernels for Attention: Two custom kernels were created to fuse scale, mask, and softmax (reduction) operations within the attention block. One supports general masking (for models like BERT), and the other supports implicit causal masking (for auto-regressive models like GPT). This also improves efficiency by reducing memory access and kernel launch overheads.

4.3. Performance Analysis of Parallelization Configurations (Continued)

4.3.1. Activation Recomputation (Continued)

The total number of FLOPs per iteration ( $F$ ) for a Transformer model with $l$ Transformer layers, hidden size $h$ , sequence length $s$ , vocabulary size $V$ , and training batch size $B$ is given by the formula, which accounts for activation recomputation (running the forward pass twice for gradients and recomputation) and considers the logit layer: $F = 9 6 B s l h ^ { 2 } \left( 1 + \frac { s } { 6 h } + \frac { V } { 1 6 l h } \right) .$

$F$ : Total number of floating-point operations per iteration.
$B$ : Batch size.
$s$ : Sequence length.
$l$ : Number of Transformer layers.
$h$ : Hidden size.
$V$ : Vocabulary size.

This formula is based on counting GEMMs in the Transformer and logit layers, assuming mixed precision. The factor of $96 B s l h^2$ comes from the forward and backward passes of the Transformer layers with activation recomputation. The terms $\left( 1 + \frac { s } { 6 h } + \frac { V } { 1 6 l h } \right)$ account for self-attention specific FLOPs and the logit layer contribution.

5. Experimental Setup

5.1. Datasets

The experiments in this paper primarily use GPT models of varying sizes, which are Transformer-based language models. The paper does not specify a particular training dataset (e.g., C4, Pile) but focuses on the model architectures and their scaling properties.

Model Characteristics:
- Vocabulary Size ( $V$ ): All models use a vocabulary size of 51,200 (a multiple of 1024).
- Sequence Length ( $s$ ): All models use a sequence length of 2048.
- Varying Parameters: The authors vary the hidden size ( $h$ ), number of attention heads, and number of layers ( $l$ ) to create models ranging from 1 billion to 1 trillion parameters.
Data Sample: While no explicit data sample is provided, GPT models are trained on vast text corpora. A typical data sample would be a sequence of tokens (words or sub-word units) from natural language text, e.g., "The quick brown fox jumps over the lazy dog." The model would learn to predict the next token in the sequence.
Choice of Datasets (Implicit): The use of GPT models is highly effective for validating the method's performance because these models are known for their extremely large scale and computational demands, making them ideal benchmarks for distributed training efficiency. The paper's goal is to enable training of these large models, so using them directly as the experimental subject is appropriate.

The number of parameters ( $P$ ) in a model is computed as: $P = 1 2 l h ^ { 2 } \left( 1 + \frac { 1 3 } { 1 2 h } + \frac { V + s } { 1 2 l h } \right) .$
$P$ : Total number of parameters in the GPT model.
$l$ : Number of Transformer layers.
$h$ : Hidden size.
$V$ : Vocabulary size.
$s$ : Sequence length. This formula approximates the total parameters, accounting for the embedding layer, Transformer layers (which contain attention and MLP blocks), and the logit layer.

5.2. Evaluation Metrics

The paper evaluates the performance of its distributed training system using several key metrics, primarily focused on throughput and efficiency.

1. Achieved teraFLOP/s per GPU:
- Conceptual Definition: This metric quantifies the average number of trillion floating-point operations (FLOPs) that each GPU in the cluster performs per second, including all overheads like communication, data processing, and optimizer steps. It is a direct measure of the computational work done per GPU unit time. A higher value indicates better GPU utilization and efficiency.
- Mathematical Formula: While not explicitly provided as a single formula in the paper, it is derived from the total FLOPs per iteration and the measured iteration time. $ \text{Achieved teraFLOP/s per GPU} = \frac{\text{Total FLOPs per iteration}}{\text{Number of GPUs} \times \text{Iteration time (seconds)}} \times 10^{-12} $
- Symbol Explanation:
  - Total FLOPs per iteration: The total floating-point operations required for one forward and backward pass (including activation recomputation) for the entire model and batch. This is denoted as $F$ in the paper's Appendix, where $F = 9 6 B s l h ^ { 2 } \left( 1 + \frac { s } { 6 h } + \frac { V } { 1 6 l h } \right)$ .
  - Number of GPUs: The total count of GPUs used in the training run ( $n$ ).
  - Iteration time (seconds): The measured wall-clock time for one complete training iteration (one forward pass, one backward pass, and parameter updates for a global batch).
2. Percentage of Theoretical Peak FLOP/s:
- Conceptual Definition: This metric expresses the achieved teraFLOP/s per GPU as a percentage of the maximum possible floating-point operations that a single GPU could theoretically perform per second. It indicates how close the actual performance is to the hardware's ideal capability, highlighting the efficiency of the software stack and parallelization strategy in utilizing the GPU's compute power.
- Mathematical Formula: $ \text{Percentage of Theoretical Peak FLOP/s} = \frac{\text{Achieved teraFLOP/s per GPU}}{\text{Theoretical Peak FLOP/s of a single GPU}} \times 100% $
- Symbol Explanation:
  - Achieved teraFLOP/s per GPU: As defined above.
  - Theoretical Peak FLOP/s of a single GPU: The maximum specified computational capacity of the GPU hardware. For an A100 GPU with 16-bit precision, this is stated as 312 teraFLOP/s.
3. Achieved Aggregate petaFLOP/s:
- Conceptual Definition: This metric represents the total floating-point operations performed by all GPUs in the cluster per second, expressed in petaFLOP/s (quadrillions of FLOPs per second). It's a measure of the overall computational power achieved by the entire distributed system.
- Mathematical Formula: $ \text{Achieved Aggregate petaFLOP/s} = \text{Achieved teraFLOP/s per GPU} \times \text{Number of GPUs} \times 10^{-3} $
- Symbol Explanation:
  - Achieved teraFLOP/s per GPU: As defined above.
  - Number of GPUs: The total count of GPUs used in the training run ( $n$ ).
4. Training Time Estimates:
- Conceptual Definition: This is an estimation of the total wall-clock time required to train a given model to convergence on a specific number of tokens, based on the achieved throughput. It provides a practical measure of how long a training job would take.
- Mathematical Formula: The paper provides a simplified approximate formula based on observations that for their configurations, $6h \gg s$ , $16lh \gg (V+s)$ , and $12lh \gg V$ . $ { \mathrm { E n d - t o - e n d ~ t r a i n i n g ~ t i m e } } \approx { \frac { 8 T P } { n X } } . $
- Symbol Explanation:
  - End-to-end training time: Estimated total time for model training.
  - $T$ : Total number of tokens required for training (e.g., 300 billion for GPT-3).
  - $P$ : Total number of parameters in the model.
  - $n$ : Total number of GPUs used.
  - $X$ : Achieved teraFLOP/s per GPU (from Table 1).
  - The constant '8' likely arises from approximations and the specific FLOP accounting (e.g., related to mixed precision and activation recomputation).

5.3. Baselines

The paper primarily compares its PTD-P approach against:

ZeRO-3 without Model Parallelism: ZeRO-3 [36, 37] is a state-of-the-art memory optimization technique that shards all optimizer states, gradients, and model parameters across data-parallel workers. This allows training very large models that wouldn't fit on a single GPU even with data parallelism. The comparison is crucial because ZeRO-3 is a strong baseline for memory-efficient data-parallel training of large models. The paper specifically notes that it considers ZeRO-3 without any additional model parallelism (i.e., model-parallel size of 1 for the entire model). This highlights PTD-P's advantage when explicit model parallelism (tensor + pipeline) is combined with data parallelism.

The paper also implicitly compares against prior individual model parallelism techniques by analyzing pipeline parallelism in isolation (e.g., default vs. interleaved schedule), tensor parallelism in isolation, and their various combinations, demonstrating that PTD-P outperforms these isolated approaches.

5.4. Hardware

All experiments were conducted on the Selene supercomputer [8]. This is a powerful, optimized HPC cluster with specific characteristics:

GPUs: Each cluster node features 8 NVIDIA 80-GB A100 GPUs [6]. These are state-of-the-art GPUs at the time of publication, offering high computational throughput (312 teraFLOP/s with 16-bit precision) and substantial GPU memory (80 GB).
Intra-node Interconnect: Within each node, the 8 A100 GPUs are connected by NVLink and NVSwitch [9]. NVLink provides high-bandwidth, direct GPU-to-GPU connections, while NVSwitch enables all-to-all communication between all GPUs in the server with extremely low latency. This is crucial for efficient intra-node tensor parallelism.
Inter-node Interconnect: Each node is equipped with eight NVIDIA Mellanox 200Gbps HDR InfiniBand HCAs (Host Channel Adapters) for application communication, plus an additional two HCAs for dedicated storage. The nodes are connected via a three-level (leaf, spine, core) fat-tree topology with 850 switches. This topology is designed for efficient all-reduce communication, which is a dominant pattern in deep learning training.
Storage: The cluster uses an all-NVMe shared parallel filesystem for high-performance data access and storage, which is important for fast checkpoint loading and saving.
Software Stack: The implementation is built using PyTorch [32] and leverages NCCL [7] for collective communication. All results are run with mixed precision.

6. Results & Analysis

6.1. Core Results Analysis

The paper's core results demonstrate the exceptional weak-scaling performance of PTD-P for GPT models, ranging from 1 billion to a trillion parameters. The key findings validate the effectiveness of the proposed method in achieving high throughput and practical training times.

The following are the results from Table 1 of the original paper:

Number of parameters (billion)	Attention heads	Hidden size	Number of layers	Tensor model-parallel size	Pipeline model-parallel size	Number of GPUs	Batch size	Achieved teraFLOP/s per GPU	Percentage of theoretical peak FLOP/s	Achieved aggregate petaFLOP/s
1.7	24	2304	24	1	1	32	512	137	44%	4.4
3.6	32	3072	30	2	1	64	512	138	44%	8.8
7.5	32	4096	36	4	1	128	512	142	46%	18.2
18.4	48	6144	40	8	1	256	1024	135	43%	34.6
39.1	64	8192	48	8	2	512	1536	138	44%	70.8
76.1	80	10240	60	8	4	1024	1792	140	45%	143.8
145.6	96	12288	80	8	8	1536	2304	148	47%	227.1
310.1	128	16384	96	8	16	1920	2160	155	50%	297.4
529.6	128	20480	105	8	35	2520	2520	163	52%	410.2
1008.0	160	25600	128	8	64	3072	3072	163	52%	502.0

Analysis of Table 1:

Weak Scaling: As the model size increases from 1.7 billion to 1 trillion parameters, the number of GPUs used also increases proportionally (from 32 to 3072). This demonstrates weak scaling, where the problem size per GPU generally stays constant or increases, allowing the system to scale efficiently.
High GPU Utilization: The achieved teraFLOP/s per GPU consistently remains high, ranging from 135 to 163 teraFLOP/s. For the largest models (529.6B and 1T parameters), the system achieves 52% of the theoretical peak FLOP/s of an A100 GPU. This is an impressive utilization rate for a complex distributed training setup, indicating that the PTD-P strategy and its optimizations effectively keep the GPUs busy with computation rather than waiting for communication or idling.
Aggregate Throughput: The aggregate petaFLOP/s scales dramatically, reaching 502.0 petaFLOP/s for the 1-trillion-parameter model on 3072 GPUs. This demonstrates near-linear scaling, showcasing the system's ability to harness vast computational resources effectively.
Parallelism Configuration: For models larger than 18.4 billion parameters, tensor model parallelism is consistently set to 8 (matching the number of GPUs within a DGX A100 server, as per Takeaway #1). Pipeline model parallelism is then used to scale across servers, increasing from 1 to 64 as model size and GPU count increase. This validates the PTD-P approach's hierarchical design.
Increasing Batch Size: As the models and GPU counts grow, the batch size also increases (from 512 to 3072). This is often necessary in pipeline parallelism to amortize the pipeline bubble overhead over more microbatches and maintain efficiency.

Training Time Estimates: Given these impressive throughputs, the paper estimates practical end-to-end training times:

GPT-3 (175 billion parameters): On $n = 1024$ A100 GPUs with a batch size of 1536, achieving $X = 140$ teraFLOP/s per GPU. For training on $T = 300$ billion tokens, the estimated time is approximately 34 days.
1 Trillion Parameter Model: On $n = 3072$ A100 GPUs with $X = 163$ teraFLOP/s per GPU. Assuming $T = 450$ billion tokens for convergence, the estimated training time is approximately 84 days (around 3 months). These estimates highlight that PTD-P makes training such colossal models feasible within reasonable timeframes, moving them from theoretical possibility to practical reality.

6.2. Comparison to ZeRO-3

The paper compares PTD-P with ZeRO-3, a strong baseline for memory-efficient data-parallel training without explicit model parallelism.

The following are the results from Table 2 of the original paper:

Scheme	Number of parameters (billion)	Model-parallel size	Batch size	Number of GPUs	Microbatch size	Achieved teraFLOP/s per GPU	Training time for 300B tokens (days)
Scheme	Number of parameters (billion)	Model-parallel size	Batch size	Number of GPUs	Microbatch size	Achieved teraFLOP/s per GPU	Training time for 300B tokens (days)	ZeRO-3 without Model Parallelism	174.6	1	1536	384	4	144	90
768	2	88	74
1536	1	44	74
529.6	1	2560*	640	4	138	169
		2240	1120	2	98	137
		2240	2240	1	48	140
PTD Parallelism	174.6	96	1536	384	1	153	84
				768	1	149	43
				1536	1	141	23
	529.6	280	2240	560	1	171	156
				1120	1	167	80
				2240	1	159	42

Analysis of Table 2 and Figure 10:

Figure 10: Throughput per GPU of PTD-P and ZeRO-3 for two different GPT models (the 175B GPT-3 model is shown with dotted lines, and the 530B model is shown with solid lines). Global batch sizes are fixed and ZeRO-3 is used without any model parallelism.

Initial Performance (Lower GPU Counts): For the 175-billion-parameter model with 384 GPUs and a microbatch size of 4, PTD-P achieves 153 teraFLOP/s per GPU versus ZeRO-3's 144 teraFLOP/s per GPU, representing a 6% higher throughput. For the 530-billion-parameter model with 640 GPUs and microbatch size 4, PTD-P (171 teraFLOP/s per GPU) outperforms ZeRO-3 (138 teraFLOP/s per GPU) by 24%. This initial lead suggests PTD-P is more efficient even at relatively lower GPU counts.
Scaling Behavior: As the number of GPUs increases (while keeping the global batch size fixed), PTD-P scales significantly more gracefully than ZeRO-3.
- For the 175B model, doubling the GPUs from 384 to 768 results in PTD-P maintaining high throughput (149 teraFLOP/s per GPU) while ZeRO-3 drops significantly (to 88 teraFLOP/s per GPU). At 1536 GPUs, PTD-P still achieves 141 teraFLOP/s per GPU, whereas ZeRO-3 falls to just 44 teraFLOP/s per GPU.
- Similar trends are observed for the 530B model. At 2240 GPUs, PTD-P achieves 159 teraFLOP/s per GPU, while ZeRO-3 drops to 48 teraFLOP/s per GPU.
Performance Gap: At higher GPU counts, PTD-P outperforms ZeRO-3 by as much as 70% for both model sizes. This is primarily attributed to ZeRO-3's increased cross-node communication overhead. ZeRO-3 shards parameters and gradients and requires all-gather operations for weights and all-reduce for gradients for every layer if not carefully optimized or hidden. When this communication has to cross slower inter-node links frequently, it becomes a significant bottleneck, especially at larger scales. PTD-P's combination of tensor parallelism (intra-node) and pipeline parallelism (inter-node) with optimizations like scatter/gather is designed to minimize and optimize this cross-node communication.
Model Parallelism Size: It's important to note that ZeRO-3 in this comparison has a model-parallel size of 1 (meaning the entire model is conceptually replicated and sharded across data-parallel ranks), whereas PTD-P uses a significant model-parallel size (e.g., 96 for 174.6B model, 280 for 529.6B model) which is composed of tensor and pipeline parallelism. This means PTD-P explicitly partitions the model, which is essential for these memory-intensive models that might not even fit the ZeRO-3 setup on a smaller number of GPUs without very fine-grained microbatching.

6.3. Pipeline Parallelism

6.3.1. Weak Scaling

The paper evaluates the weak-scaling performance of the default non-interleaved pipeline-parallel schedule. In a weak-scaling setup, as the number of pipeline stages (and thus GPUs) increases, the size of the model also increases proportionally (by increasing the number of layers), aiming to keep the workload per GPU roughly constant.

该图像是一个图表，展示了在不同的管道并行大小下，每个GPU所实现的teraflop/s数值。蓝色线条代表批量大小为8的情况，橙色线条则代表批量大小为128，在管道并行大小从1增加到8时，性能相对下降趋势明显。

The image is a chart showing the achieved teraFLOP/s per GPU for different pipeline parallel sizes. The blue line represents a batch size of 8, while the orange line represents a batch size of 128, with a noticeable downward trend in performance as the pipeline parallel size increases from 1 to 8.

Analysis of Figure 11:

Impact of Pipeline Bubble: The figure shows throughput per GPU for a GPT model (128 attention heads, hidden size 20480, microbatch size 1) with tensor-parallel size of 8. As the pipeline-parallel size increases from 1 to 8, the throughput per GPU for both batch sizes (8 and 128) decreases. This aligns with the analytical model for pipeline bubble size ( $\frac{p-1}{m}$ ), which increases with $p$ . Even in a weak-scaling scenario, the inherent idle time introduced by the pipeline flush becomes more significant as the pipeline lengthens.
Batch Size Amortization: The higher batch size (128, orange line) consistently performs better and scales more gracefully than the smaller batch size (8, blue line). This is because the pipeline bubble (idle time) is a fixed cost per batch. With a larger batch size, this fixed cost is amortized over more microbatches ( $m = B / (d * b)$ ), reducing its relative impact on overall efficiency.

6.3.2. Interleaved versus Non-Interleaved Schedule

The paper compares the throughput of the proposed interleaved schedule against the non-interleaved (default) PipeDream-Flush schedule on a GPT-3 model with 175 billion parameters.

Figure 11: Throughput per GPU of pipeline parallelism using two different batch sizes in a weak-scaling experiment setup (model size increases with the pipeline-parallel size). Figure 12: Throughput per GPU of interleaved and non-interleaved schedules for a GPT model (175 billion parameters) on 96 GPUs.

Figure 12: Throughput per GPU of interleaved and non-interleaved schedules for a GPT model (175 billion parameters) on 96 GPUs.

Analysis of Figure 12:

Interleaved Schedule Advantage: The interleaved schedule (green line) with the scatter/gather communication optimization enabled consistently achieves higher per-GPU throughput than the non-interleaved (default) schedule (blue line). This validates the theoretical reduction in pipeline bubble size ( $1/v \cdot \frac{p-1}{m}$ ) offered by interleaving stages, leading to up to 10% improvement in throughput.
Batch Size Impact:
- At smaller batch sizes, the interleaved schedule shows a more pronounced advantage. This is because smaller batch sizes lead to fewer microbatches ( $m$ ), making the pipeline bubble a larger fraction of the total time for the non-interleaved schedule. The interleaved schedule mitigates this by reducing the effective pipeline depth.
- As the batch size increases, the throughput gap between the two schedules narrows. This is due to two factors: a) the pipeline bubble in the default schedule becomes a smaller fraction of the total time as it's amortized over more microbatches; and b) the interleaved schedule requires more point-to-point communication per sample, and this communication overhead becomes more significant at larger batch sizes.
Importance of Scatter/Gather: The paper notes that without the scatter/gather optimization, the default schedule would perform better than the interleaved schedule at larger batch sizes. This underscores that the communication overhead introduced by interleaving must be mitigated by specific optimizations like scatter/gather to realize its performance benefits.

6.4. Comparison of Parallel Configurations

This section demonstrates the critical trade-offs and interactions when combining different parallelization dimensions.

6.4.1. Tensor versus Pipeline Parallelism

The paper evaluates the performance of various combinations of tensor and pipeline model parallelism for a 161-billion-parameter GPT model on 64 A100 GPUs.

Figure 13: Throughput per GPU of various parallel configurations that combine pipeline and tensor model parallelism using a GPT model with 162.2 billion parameters and 64 A100 GPUs.

Figure 13: Throughput per GPU of various parallel configurations that combine pipeline and tensor model parallelism using a GPT model with 162.2 billion parameters and 64 A100 GPUs.

Analysis of Figure 13:

Optimal Combination: The peak performance is achieved when the tensor-parallel size ( $t$ ) is equal to 8. This corresponds to utilizing all GPUs within a single DGX A100 server for tensor parallelism, where high-bandwidth NVLink facilitates efficient all-reduce communication.
Tensor Parallelism Benefits (Intra-node): Increasing tensor-parallel size from 1 to 8 (within a node) generally improves throughput. This is because tensor parallelism efficiently partitions the large GEMMs within layers across GPUs connected by fast NVLink, making computations more efficient and allowing larger models to fit.
Pipeline Parallelism Benefits (Inter-node): Once tensor parallelism saturates the intra-node capabilities (at $t=8$ ), pipeline parallelism ( $p$ ) is used to scale across nodes. For instance, comparing $(t=1, p=64)$ with $(t=8, p=8)$ , the latter achieves much higher throughput. The configuration $(t=8, p=8)$ means 8 tensor-parallel groups, each on a DGX A100 node, forming an 8-stage pipeline. This effectively leverages point-to-point communication for pipeline stages across slower inter-node links (InfiniBand), avoiding the costly all-reduce of tensor parallelism over these links.
Limitations of Isolated Parallelism:
- Pure tensor parallelism ( $p=1$ , varying $t$ ): If $t$ goes beyond 8 (meaning tensor parallelism spans across nodes), the throughput would severely degrade due to expensive all-reduce over InfiniBand. (Not shown explicitly past $t=8$ on this graph, but implied by Takeaway #1).
- Pure pipeline parallelism ( $t=1$ , varying $p$ ): As $p$ increases, throughput decreases significantly (e.g., from $(t=1, p=1)$ to $(t=1, p=64)$ ). This is because the entire Transformer layer is processed on a single GPU, which quickly becomes a memory bottleneck, and the pipeline bubble becomes dominant as $p$ grows without sufficient microbatches.
Takeaway Validation: This figure strongly validates Takeaway #1: tensor model parallelism is most effective intra-node (up to the number of GPUs per server), and pipeline model parallelism is best used for inter-node scaling. Neither in isolation performs as well as their optimized combination.

6.4.2. Pipeline versus Data Parallelism

This comparison uses a smaller GPT model (5.9 billion parameters) that can fit on fewer model-parallel GPUs to illustrate the dynamics between data and pipeline parallelism.

Figure 14: Throughput per GPU of various parallel configurations that combine data and pipeline model parallelism using a GPT model with 5.9 billion parameters, three different batch sizes, microbatch size of 1, and 64 A100 GPUs.

Analysis of Figure 14:

Data Parallelism Preference: For a fixed total number of GPUs (64) and microbatch size (1), increasing the data-parallel size ( $d$ ) while decreasing the pipeline-parallel size ( $p$ ) generally leads to higher throughput per GPU. This is evident by the rising throughput as $d$ increases from 1 to 64 (and $p$ decreases from 64 to 1) for all batch sizes.
Pipeline Bubble Impact: As the pipeline-parallel size ( $p$ ) increases, the throughput tends to decrease. This confirms the analytical model from §3.3.1 where the pipeline bubble size increases with $p$ (specifically, $\frac{n-d}{b'}$ where $n=64$ ). Pipeline parallelism introduces idle time due to pipeline flushes, which becomes more significant as the pipeline grows longer relative to the number of microbatches.
Batch Size and Throughput: Larger batch sizes (e.g., 2048, green line) consistently yield higher throughput than smaller ones (e.g., 512, blue line). This is because a larger batch size helps amortize the pipeline bubble overhead and makes data-parallel communication less frequent.
Takeaway Validation: This figure supports Takeaway #2: pipeline model parallelism should primarily be employed to enable the training of models that are too large to fit on a single GPU or a small model-parallel group. Once the model can fit, data parallelism is generally the more efficient strategy for scaling up training to more GPUs due to its lower communication overhead per microbatch and better pipeline bubble characteristics.

6.4.3. Tensor versus Data Parallelism

This section evaluates the interaction between data and tensor model parallelism using the same 5.9-billion-parameter GPT model as in the previous section.

Figure 15: Throughput per GPU of various parallel configurations that combine data and tensor model parallelism using a GPT model with 5.9 billion parameters, three different batch sizes, microbatch size of 1, and 64 A100 GPUs.

Analysis of Figure 15:

Data Parallelism Preference: Similar to the previous comparison, increasing data-parallel size ( $d$ ) while decreasing tensor-parallel size ( $t$ ) generally leads to higher throughput. The throughput is highest when $t=1$ (no tensor parallelism) and $d=64$ (full data parallelism), especially for large batch sizes.
Tensor Parallelism Overhead: As tensor-parallel size ( $t$ ) increases (and $d$ decreases), the throughput tends to decline. This is because tensor model parallelism requires frequent all-reduce communication for every microbatch and every layer. When $t$ increases, a larger portion of this all-reduce communication might span inter-node links (if $t > 8$ for DGX A100 nodes), which are slower.
Small GEMMs: Additionally, increasing $t$ means that matrix multiplications on each GPU become smaller. If these GEMMs are not sufficiently large, GPU utilization can decrease, as the GPU might not achieve its peak computational efficiency.
Batch Size Impact: Again, larger batch sizes (e.g., 2048, green line) lead to better throughput. This is because data-parallel communication is less frequent per batch than tensor-parallel communication per microbatch. When batch size is large, the gradient all-reduce in data parallelism is amortized, while the tensor-parallel all-reduce remains frequent.
Takeaway Validation: This further reinforces Takeaway #2: data parallelism is generally more efficient for scaling up training once the model fits the model-parallel group. Tensor parallelism should be chosen strategically, usually intra-node, to handle models that exceed a single GPU's memory, but its all-reduce communication overheads can be significant, especially cross-node.

6.5. Microbatch Size

The choice of microbatch size is a crucial hyperparameter that balances GPU utilization (from larger GEMMs) and pipeline bubble overhead.

$Figure 16: Throughput per GPU of a $( t , p ) = ( 8 , 8 )$ parallel configuration for different microbatch sizes on a GPT model with 91 billion parameters, for two different batch sizes using 64 A100 GPUs.$ Figure 16: Throughput per GPU of a $( t , p ) = ( 8 , 8 )$ parallel configuration for different microbatch sizes on a GPT model with 91 billion parameters, for two different batch sizes using 64 A100 GPUs.

Figure 16: Throughput per GPU of a $( t , p ) = ( 8 , 8 )$ parallel configuration for different microbatch sizes on a GPT model with 91 billion parameters, for two different batch sizes using 64 A100 GPUs.

Analysis of Figure 16:

Optimal Microbatch Size: For the 91-billion-parameter GPT model with a $(t, p) = (8, 8)$ parallel configuration on 64 A100 GPUs, the figure shows that a microbatch size of 2 yields the highest throughput per GPU for both batch sizes (2048 and 1024).
Trade-off in Action:
- Smaller Microbatch Sizes (e.g., 1): Lead to a larger number of microbatches ( $m$ ) within a batch. This could mean the pipeline bubble is better amortized, but GPU utilization for individual GEMMs might be lower due to smaller matrix sizes.
- Larger Microbatch Sizes (e.g., 4, 8): Lead to fewer microbatches ( $m$ ), which means the pipeline bubble takes up a larger fraction of the overall time (as $m$ decreases, $\frac{p-1}{m}$ increases). However, larger microbatch sizes generally improve the arithmetic intensity of GPU kernels and thus GPU utilization for the actual computation.
Problem-Dependence: The optimal microbatch size is a sweet spot between these opposing forces, and it is model-dependent and configuration-dependent. The paper's analytical model from §3.3.3 (now §3.4) provides a reasonable approximation and can guide the selection of this hyperparameter.

6.6. Activation Recomputation

Activation recomputation (or gradient checkpointing) is a memory-saving technique that trades increased computation for reduced memory.

$Figure 17: Throughput (in sequences per second) with and without activation recomputation for a GPT model with 145 billion parameters using 128 A100 GPUs $\\left( \\left( t , p \\right) = \\left( 8 , 1 6 \\right) \\right)$ .$ Figure 17: Throughput (in sequences per second) with and without activation recomputation for a GPT model with 145 billion parameters using 128 A100 GPUs $\left( \left( t , p \right) = \left( 8 , 1 6 \right) \right)$ .

Figure 17: Throughput (in sequences per second) with and without activation recomputation for a GPT model with 145 billion parameters using 128 A100 GPUs $\left( \left( t , p \right) = \left( 8 , 1 6 \right) \right)$ .

Analysis of Figure 17:

Throughput Drop for Small Batch Sizes: For small batch sizes, enabling activation recomputation leads to up to 33% lower throughput (measured in sequences per second). This is because activation recomputation requires re-executing parts of the forward pass during the backward pass, effectively doubling the FLOPs for those sections. This additional computation directly reduces throughput if memory is not a bottleneck.
Enabling Larger Batch Sizes: The crucial benefit of activation recomputation is its memory savings. Without it, the model might not fit in GPU memory at all, or only with very small batch sizes. The figure shows that activation recomputation is needed to support larger batch sizes that would otherwise be out-of-memory.
Overall Throughput Gain: At large batch sizes (made possible by activation recomputation), the system can achieve significantly higher overall throughput. For example, the throughput at large batch sizes with activation recomputation is up to 2x higher than the best throughput achieved without activation recomputation (which was limited to smaller batch sizes). This is because larger batch sizes better amortize the pipeline bubble and other overheads, even with the increased FLOPs from recomputation. Thus, activation recomputation is an essential technique for training very large models.

6.7. Scatter-Gather Optimization

The scatter/gather communication optimization (§4.1) was designed to reduce redundant cross-node communication when tensor and pipeline parallelism are combined.

Figure 18: Throughput per GPU with and without the scatter/gather optimization for a GPT model with 175 billion parameters using 96 A100 GPUs and the interleaved schedule.

Figure 18: Throughput per GPU with and without the scatter/gather optimization for a GPT model with 175 billion parameters using 96 A100 GPUs and the interleaved schedule.

Analysis of Figure 18:

Throughput Improvement: The figure clearly shows that enabling the scatter/gather optimization (green line) leads to a substantial improvement in per-GPU throughput compared to running without it (blue line). For communication-intensive scenarios, especially with the interleaved schedule and larger batch sizes, this optimization improves throughput by up to 11%.
Mitigating Interleaving Overhead: The interleaved schedule inherently increases point-to-point communication volume (by a factor of $v$ , the number of model chunks per device). The scatter/gather optimization is critical for making this interleaved schedule feasible and efficient by reducing the actual data transferred over slower cross-node InfiniBand links. By scattering the replicated tensor-parallel outputs and then all-gathering them intra-node over fast NVLink, it effectively turns redundant inter-node communication into efficient intra-node communication.

6.8. Fused Operators

The paper implemented three model-specific computation optimizations (data layout change, fused kernels, custom attention kernels) as described in §4.2.

Impact on GPT-3 (175 billion parameters): With these fused operators, throughput increased by 19% (from 113 teraFLOP/s per GPU to 135 teraFLOP/s per GPU).
Impact on Larger GPT (530 billion parameters): For the even larger 530-billion-parameter GPT model, throughput increased by 11% (from 133 teraFLOP/s per GPU to 148 teraFLOP/s per GPU).

These results highlight the importance of low-level software optimizations and kernel engineering to ensure that computation remains compute-bound and GPUs are maximally utilized, rather than being bottlenecked by memory access or kernel launch overheads.

6.9. Inter-Node Communication Bandwidth

The exceptional results achieved are partly due to the optimized Selene supercomputer hardware stack. The paper quantifies the effective bisection bandwidth utilization:

Pipeline-Parallel Communication: For point-to-point communication among pipeline stages (primarily inter-node), an effective bisection bandwidth of 892 GB/s was observed. This is a very high bandwidth, underscoring the efficiency of the InfiniBand interconnect and the scatter/gather optimization.
Data-Parallel Communication: For all-reduce operations among data-parallel replicas, an even higher effective bisection bandwidth of 12.9 TB/s was observed. This highlights the highly optimized all-reduce implementation (likely leveraging the fat-tree topology and NCCL) and the fact that data-parallel all-reduces are often larger but less frequent than tensor-parallel all-reduces, allowing for better saturation of network links. The authors emphasize that a less optimized partitioning or more communication-intensive operations would severely hamper scaling performance, especially with these high bandwidth requirements.

6.10. Checkpoint Loading and Saving

An important practical consideration for training large models is managing their checkpoints.

Trillion-parameter model checkpoint size: A single checkpoint for the trillion-parameter model is a massive 13.8 terabytes.
Loading Performance: The initial loading of this checkpoint by all 384 nodes (3072 GPUs) achieved a peak read bandwidth of 1 TB/s from the parallel filesystem. This demonstrates the capability of the all-NVMe shared parallel filesystem.
Saving Performance: Checkpoint saves reached 40% of peak write bandwidth (273 GB/s). While lower than read, this is still substantial and crucial for periodic saving during long training runs.

Efficient checkpointing is vital for fault tolerance and for resuming training, and the system demonstrates robust performance in handling these enormous data volumes.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents a highly effective and innovative solution for training large language models (LLMs) with up to a trillion parameters on GPU clusters. The core contribution is PTD-P, a composed parallelism strategy that intelligently integrates pipeline parallelism (across multi-GPU servers), tensor parallelism (within multi-GPU servers), and data parallelism for maximal efficiency.

Key achievements include:

Achieving an aggregate throughput of 502 petaFLOP/s on 3072 A100 GPUs for a 1-trillion-parameter model, representing 52% of theoretical peak FLOP/s per GPU.
Proposing a novel interleaved pipeline parallelism schedule that improves throughput by over 10% by reducing pipeline bubble time, without significantly increasing memory footprint.
Developing critical communication optimizations (e.g., scatter/gather) and computation optimizations (e.g., fused kernels, data layout) that are essential for realizing high GPU utilization and mitigating communication overheads.
Providing a quantitative analysis and practical heuristics for configuring distributed training, highlighting the complex interactions between different parallelism dimensions and hyperparameters.
Demonstrating practical end-to-end training times (e.g., ~3 months for a trillion-parameter model), making such large-scale model development feasible.
Open-sourcing the Megatron-LM codebase with these advancements, enabling wider adoption by the research community.

The significance of this work lies in pushing the boundaries of LLM training efficiency, directly addressing the memory and computational challenges that arise from the exponential growth of AI models.

7.2. Limitations & Future Work

The authors acknowledge certain limitations and suggest avenues for future research:

Heuristic-based Parallelism Configuration: The paper does not automatically explore the vast search space of optimal parallelization strategies. Instead, it suggests practical heuristics based on empirical and analytical studies. Future work could involve developing more sophisticated auto-partitioning or cost-model-based systems (like FlexFlow, PipeDream, DAPPLE) that can automatically determine the best PTD-P configuration for arbitrary models and hardware.
Strict Optimizer Semantics: The paper focuses on pipeline parallelism with flushes to ensure strict optimizer semantics. This inherently introduces pipeline bubbles. The authors defer consideration of asynchronous and bounded-staleness approaches (like PipeMare, PipeDream-2BW) to future work. These alternative schemes could potentially improve throughput by eliminating flushes entirely, but might come at the cost of convergence rate or final accuracy, which would need careful investigation.
Model Architecture Focus: The optimizations and analyses are specifically tailored to Transformer-based language models. While the core ideas are general, their application to other asymmetric model architectures (e.g., CNNs with highly varied layer sizes) might require different layer assignment strategies for pipeline stages, which the paper defers to related work.
Hardware Dependence: The strong results are a byproduct of a highly optimized software and hardware stack, particularly relying on high-bandwidth interconnects like NVLink and InfiniBand. Performance on less-optimized or lower-bandwidth cluster environments might be significantly different, limiting the direct transferability of certain performance numbers.

7.3. Personal Insights & Critique

This paper is a landmark contribution to the field of large-scale AI model training, showcasing an impressive feat of engineering and distributed systems design.

Personal Insights:

Hardware-Software Co-design is Paramount: The paper vividly illustrates that pushing the frontiers of AI scale is not just about novel algorithms, but critically about an intricate co-design between hardware capabilities (e.g., A100 GPUs, NVLink, InfiniBand) and highly optimized software (PTD-P, fused kernels, scatter/gather). It's a testament to the fact that maximum efficiency is extracted when the software stack is deeply aware of and optimized for the underlying hardware topology and communication patterns.
Complexity of Distributed Training: The analysis of interactions between tensor, pipeline, and data parallelism reveals the immense complexity of configuring such systems. The trade-offs are non-trivial and often contradictory (e.g., microbatch size improving GPU utilization but increasing pipeline bubble). The guiding principles provided are invaluable for practitioners navigating this complexity.
Value of Open-Sourcing: Open-sourcing the Megatron-LM extension is a significant contribution. It democratizes access to state-of-the-art LLM training infrastructure, enabling broader research and development that would otherwise be limited to organizations with massive HPC resources and expertise.
Practicality over Purity: The focus on strict optimizer semantics ensures predictable model quality, which is crucial for real-world applications. While asynchronous methods promise higher throughput, the paper's choice to first optimize a synchronous approach thoroughly is a pragmatic one, providing a solid, reliable baseline.

Critique / Areas for Improvement:

Generalizability of Heuristics: While the heuristics are effective for Transformer models on NVIDIA hardware, their direct applicability to vastly different model architectures (e.g., sparse models, diffusion models) or different accelerator types (e.g., TPUs, IPUs) might require re-evaluation. The "optimal microbatch size is problem-dependent" highlights the need for more adaptive strategies.
Environmental Impact: Training trillion-parameter models on thousands of GPUs for months consumes enormous amounts of energy. While this paper focuses on efficiency (reducing the energy per FLOP or per iteration), the sheer scale of the computation means the overall environmental footprint remains substantial. Future research might explore ways to achieve similar model capabilities with less computational intensity or more energy-efficient hardware.
Cost of Hardware: The Selene supercomputer is a top-tier HPC facility. While the software is open-sourced, replicating this performance requires access to comparable, expensive hardware. This could be a barrier for smaller research groups, despite the software's availability.
Dynamic Adaptation: The paper's approach relies on a pre-configured parallelization strategy. A potential future improvement could involve dynamic adaptation of parallelism degrees ( $p$ , $t$ , $d$ ) and microbatch sizes during training, perhaps based on real-time GPU utilization and network metrics, to handle varying model characteristics or workload fluctuations more optimally. This aligns with the "automatic partitioning" research mentioned in related work.

Overall, this paper provides a robust and highly optimized solution that significantly advances the practical capabilities of large-scale AI model training, setting a new benchmark for throughput and efficiency.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.