ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
TL;DR Summary
ZeRO-Infinity leverages GPU, CPU, and NVMe memory to break the GPU memory wall, enabling trillion-parameter model training and fine-tuning without code refactoring, achieving high throughput and superlinear scalability in extreme-scale deep learning.
Abstract
In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been supported primarily though system innovations that allow large models to fit in the aggregate GPU memory of multiple GPUs. However, we are getting close to the GPU memory wall. It requires 800 NVIDIA V100 GPUs just to fit a trillion parameter model for training, and such clusters are simply out of reach for most data scientists. In addition, training models at that scale requires complex combinations of parallelism techniques that puts a big burden on the data scientists to refactor their model. In this paper we present ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs(40% of peak), while also demonstrating super linear scalability. An open source implementation of ZeRO-Infinity is available through DeepSpeed, a deep learning optimization library that makes distributed training easy, efficient, and effective.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
1.2. Authors
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. All authors are affiliated with Microsoft. Their research backgrounds primarily lie in distributed systems, deep learning optimization, and high-performance computing, particularly focused on scaling deep learning models.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server. The specific publication status (e.g., if it was later accepted to a peer-reviewed conference or journal) is not explicitly stated in the provided abstract, but it is typical for such foundational work to be presented at top-tier conferences in machine learning systems (e.g., OSDI, SOSP, MLSys, SC). DeepSpeed, the open-source library implementing ZeRO-Infinity, is widely adopted and well-regarded in the deep learning community.
1.4. Publication Year
2021
1.5. Abstract
The paper addresses the significant disparity between the rapid growth in large dense deep learning (DL) model sizes (over 1000x in three years) and the much slower increase in GPU memory (only 5x). This disparity has led to a "GPU memory wall," making it prohibitively expensive and complex (requiring 800 NVIDIA V100 GPUs for a trillion-parameter model) to train extreme-scale models using existing system innovations like 3D parallelism. Furthermore, current methods demand extensive model code refactoring, burdening data scientists.
ZeRO-Infinity is introduced as a novel heterogeneous system technology that overcomes these limitations. It leverages GPU, CPU, and NVMe memory simultaneously, enabling unprecedented model scale on limited resources without requiring model code refactoring. The system achieves excellent training throughput and scalability by managing the slower CPU and NVMe bandwidths efficiently. ZeRO-Infinity can train models with tens to hundreds of trillions of parameters on current-generation GPU clusters and allows fine-tuning trillion-parameter models on a single NVIDIA DGX-2 node, significantly improving accessibility. It demonstrates strong performance, sustaining over 25 petaflops on 512 NVIDIA V100 GPUs (40% of peak) and exhibiting super-linear scalability. An open-source implementation is available through DeepSpeed, a deep learning optimization library.
1.6. Original Source Link
https://arxiv.org/abs/2104.07857v1 Publication Status: Preprint, available on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the "GPU memory wall" in deep learning training. In recent years, the size of state-of-the-art dense deep learning models has grown exponentially, often by orders of magnitude (e.g., GPT-3 with 175 billion parameters). However, the memory capacity of individual GPUs has increased at a much slower rate. This growing gap means that many cutting-edge models are simply too large to fit into the memory of a single GPU, or even the aggregate GPU memory of a moderately sized cluster.
This problem is critical for several reasons:
-
Limited Model Scale: Without innovative system solutions, the growth of model sizes, which has been a primary driver of
DLadvancements, would be severely constrained. The paper notes that training a trillion-parameter model on currentV100 GPUswould require 800GPUsjust to fit the model states, making it inaccessible for most researchers and businesses. -
High Cost and Inaccessibility: The immense
GPUresources required for large model training or even fine-tuning make it an exclusive endeavor for organizations with access to massive clusters. For example, fine-tuningGPT-3would require 8DGX-2nodes (128GPUs) using existing3D parallelismtechniques, even though a singleDGX-2node might have sufficient compute power. -
Complexity for Data Scientists: Current state-of-the-art solutions for large model training, such as
3D parallelism(combining data, model, and pipeline parallelism), demand significant effort from data scientists. They often involve complex model code refactoring, including splitting models into tensor-sliced versions or load-balanced pipeline stages, which is both time-consuming and inflexible for diverse model architectures.The paper's entry point and innovative idea is to transcend the
GPUmemory wall by holistically leveraging the entire memory hierarchy of modern computing systems:GPUmemory (fast but small),CPUmemory (slower but larger), andNVMe(Non-Volatile Memory Express) storage (slowest but massive). Thisheterogeneous memory managementaims to allow extreme model scaling without imposing the burden of complex model refactoring on data scientists, thereby democratizing access to large model training.
2.2. Main Contributions / Findings
The primary contributions and key findings of the paper are:
-
Unprecedented Model Scale:
ZeRO-Infinityenables the training of models with tens and even hundreds of trillions of parameters, a significant leap from previous state-of-the-art methods. Specifically, it can train models over 32 trillion parameters, representing a 50x increase in model scale compared to3D parallelism. This is achieved by simultaneously exploitingGPU,CPU, andNVMememory through itsinfinity offload engine. -
Accessibility and Ease-of-Use:
- It allows fine-tuning trillion-parameter models on a single
NVIDIA DGX-2node (16GPUs), making large model training more accessible to researchers and companies with limited resources. - It eliminates the need for manual model code refactoring, pipeline parallelism, or model parallelism (
tensor-slicing), simplifying the development process for data scientists through itsease-inspired implementationandmemory-centric tiling.
- It allows fine-tuning trillion-parameter models on a single
-
Excellent Training Throughput and Scalability:
ZeRO-Infinitysustains over 25 petaflops on 512NVIDIA V100 GPUs(40% of peak theoretical performance), demonstrating high efficiency even with offloading data to slower memory tiers.- It achieves
super-linear scalabilityfor trillion-parameter models, meaning that as the number ofGPUsincreases, the throughput perGPUcan sometimes increase more than proportionally, primarily by effectively leveraging aggregatePCIeandNVMebandwidths andCPUcompute across multiple nodes.
-
Innovative System Technologies:
- Infinity Offload Engine: A novel mechanism to intelligently offload model states (
optimizer states,gradients,parameters) toCPUorNVMememory based on memory requirements, while maintaining high performance. - Memory-centric Tiling: A
GPUmemory optimization technique that breaks down large operators into smaller, sequentially executable tiles, reducingGPU working memoryrequirements and eliminating the need formodel parallelismfor massive layers. - Bandwidth-centric Partitioning: A data partitioning and retrieval strategy that leverages the aggregate bandwidth across all parallel devices (e.g.,
PCIelinks) by partitioning individual parameters acrossdata parallelprocesses and usingallgatheroperations. - Overlap-centric Design: An architecture that aggressively overlaps
computewith various forms ofcommunication(GPU-GPU, NVMe-CPU, CPU-GPU) to hide the latency of data movement to/from slower memory tiers.
- Infinity Offload Engine: A novel mechanism to intelligently offload model states (
-
Open-Source Implementation:
ZeRO-Infinityis available as part ofDeepSpeed, an open-source library, making these advanced capabilities widely accessible to theDLcommunity.These findings solve the problems of limited model scale, high training cost, and complexity, pushing the boundaries of what is possible in large-scale deep learning.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand ZeRO-Infinity, a reader should be familiar with the following foundational concepts in deep learning and distributed systems:
- Deep Learning (DL): A subfield of machine learning inspired by the structure and function of the human brain. It involves training
neural networkswith multiple layers (hence "deep") on large datasets to learn complex patterns. - GPU Memory Wall: This refers to the growing gap between the increasing memory demands of large
DL modelsand the relatively slower growth of memory capacity onGPUs. As models become larger, they quickly exhaust the availableGPU memory, hindering training. - Model States: In
DLtraining, themodel statesrefer to the various components that define and are updated during the learning process. These include:- Parameters: The learnable weights and biases of the neural network. For example, in a linear layer , and are parameters. In
mixed precision training,parametersare typically stored inFP16(16-bit floating point). - Gradients: The derivatives of the loss function with respect to the
parameters. These indicate the direction and magnitude of parameter adjustments needed to reduce the loss.Gradientsare also often stored inFP16. - Optimizer States: Additional state variables maintained by optimizers (like
AdamorSGD) to facilitate parameter updates. For example, theAdam optimizermaintainsfirst-order momentumandsecond-order varianceestimates for each parameter. These are typically stored inFP32(32-bit floating point) for higher precision. - For
mixed precision trainingwithAdam, eachparameter() requires a correspondinggradient() andoptimizer states(, , and often a master copy of theparameterfor precise updates). This sums up to 2 bytes (FP16) + 2 bytes (FP16) + 4 bytes (FP32 momentum) + 4 bytes (FP32 variance) + 4 bytes (FP32 master parameter) = 16 bytes per parameter if theFP32master parameter is stored separately. The paper states 20 bytes perparameterforAdaminmixed precision, implyingFP16parametersandgradients(2 bytes each), andFP32momentum,variance, and a copy of theparameterfor update (4 bytes each), totaling bytes, plus potentially an additional FP32 copy of the gradient, or a more precise accounting of memory usage per parameter inAdamthat totals 20 bytes.
- Parameters: The learnable weights and biases of the neural network. For example, in a linear layer , and are parameters. In
- Residual States: These are temporary memory allocations needed during training, primarily:
- Activations: The outputs of intermediate layers during the
forward passof a neural network. Theseactivationsneed to be stored because they are required to computegradientsduring thebackward passusing thechain rule. Storing allactivationscan be memory-intensive. - Activation Checkpointing: A technique to reduce
activation memoryconsumption. Instead of storing allactivations, only a subset (checkpoints) are stored. Intermediateactivationsbetween checkpoints are recomputed during thebackward passwhen needed, trading off memory for additional computation. This is also known asgradient checkpointing.
- Activations: The outputs of intermediate layers during the
- Mixed Precision Training: A training technique that uses both
FP16(half-precision floating point) andFP32(single-precision floating point) numbers.FP16computations are faster and use less memory on modernGPUs(especially withTensor Cores), whileFP32is used for critical operations likeparameter updatesto maintain numerical stability and accuracy. - Adam Optimizer: A popular
adaptive learning rate optimization algorithmthat computes adaptive learning rates for each parameter. It storesexponentially decaying averagesof pastgradients(first moment estimates, ) and squared pastgradients(second moment estimates, ), which contribute significantly tooptimizer state memoryrequirements. - Arithmetic Intensity (AIT): A measure of the ratio of total floating-point operations (computations) to the total amount of data moved (memory accesses) by a computation kernel. Higher
AITmeans more computation is performed for each unit of data transferred, making the computation less sensitive tomemory bandwidthlimitations. - Parallelism Techniques:
- Data Parallelism (DP): The simplest form of distributed training where multiple
GPUseach hold a complete copy of the model and process different mini-batches of data.Gradientsare then aggregated (e.g., averaged) across allGPUs. - Model Parallelism (MP) / Tensor Slicing: When a model is too large for a single
GPU, its layers or even individual tensors within a layer can be split across multipleGPUs. This requires careful partitioning and communication.Tensor slicingis a specific form where a tensor (e.g., a large weight matrix) is partitioned. - Pipeline Parallelism (PP): Divides the layers of a model into stages, with each stage assigned to a different
GPU(or set ofGPUs). Data flows through these stages sequentially, forming a pipeline. - 3D Parallelism: A composite technique that combines
Data Parallelism,Model Parallelism, andPipeline Parallelismto efficiently train extremely large models on hundreds or thousands ofGPUs.
- Data Parallelism (DP): The simplest form of distributed training where multiple
- ZeRO (Zero Redundancy Optimizer): A family of memory optimization technologies that partition
model statesacrossdata parallelprocesses to eliminate memory redundancies.- ZeRO-1: Partitions only the
optimizer states. - ZeRO-2: Partitions both
optimizer statesandgradients. - ZeRO-3: Partitions all three
model states:optimizer states,gradients, andparameters. This is the most memory-efficient stage, where eachparameteris owned by a uniquedata parallelprocess.
- ZeRO-1: Partitions only the
- Heterogeneous Memory Systems: Modern computing systems often feature a hierarchy of memory types with different characteristics:
- GPU Memory (HBM/GDDR): High-bandwidth, low-latency, but relatively small capacity. Ideal for active computation.
- CPU Memory (DDR): Lower bandwidth and higher latency than
GPU memory, but much larger capacity. Connected toGPUsviaPCIe(Peripheral Component Interconnect Express). - NVMe Storage: Non-Volatile Memory Express is a storage interface for
SSDs(Solid State Drives) that offers much higher bandwidth and lower latency than traditionalSATA SSDsorHDDs. It provides massive storage capacity but is significantly slower thanCPUorGPUmemory.
- PyTorch: An open-source machine learning framework known for its flexibility and dynamic computational graph.
- Hooks: Mechanisms in
PyTorchthat allow users to insert custom code to be executed before or afterforwardorbackwardpasses of a module or specific tensors.ZeRO-Infinityuses these for automated data movement. - Submodules: In
PyTorch,neural networksare typically composed oftorch.nn.Moduleobjects, which can contain otherModuleobjects (submodules), forming a hierarchical structure (e.g., a Transformer layer containing attention and feed-forward submodules).
- Hooks: Mechanisms in
- Communication Collectives: Operations used in distributed computing to exchange data among multiple processes.
- Allgather: Each process contributes its local data, and all processes receive the concatenation of data from all other processes.
- Broadcast: One process sends its data to all other processes.
- Reduce-scatter: Each process contributes its local data, and the results of a reduction (e.g., sum) are scattered across all processes.
3.2. Previous Works
The paper contextualizes ZeRO-Infinity against several prior advancements in large model training:
- ELMo [6] and GPT-3 [4]: These models are cited as examples of the rapid growth in model scale, from hundreds of millions to hundreds of billions of parameters. They highlight the trend that motivates the need for systems like
ZeRO-Infinity. - Model Parallelism (MP) [7, 17, 18]: Early techniques for splitting models across
GPUswhen they wouldn't fit on a single device.Megatron-LM[7] is a prominent example. - Pipeline Parallelism (PP) [8-10]: Techniques like
PipeDream[8],Gpipe[9], andMemory-efficient pipeline-parallel DNN training[10] aim to improve efficiency by pipelining computations acrossGPUs. - 3D Parallelism [13, 14]: This is the state-of-the-art for large model training that
ZeRO-Infinitydirectly aims to surpass. It combinesdata parallelism,model parallelism, andpipeline parallelism. TheDeepSpeedimplementation [15] of3D parallelismcan scale to over a trillion parameters on 800NVIDIA V100 GPUs.- Limitations of 3D Parallelism: The paper points out its drawbacks:
- GPU Memory Wall: Even with
3D parallelism, the aggregateGPU memorybecomes insufficient for future model scales (e.g., 320A100 GPUsfor 1T parameters, 6KGPUsfor 100T parameters). - Complexity and Refactoring Burden: Requires significant model code refactoring and splitting models into load-balanced
pipeline stagesandtensor-slicedcomponents. - Inflexibility: Models with complex dependencies are hard to fit into
pipeline parallelism.
- GPU Memory Wall: Even with
- Limitations of 3D Parallelism: The paper points out its drawbacks:
- ZeRO (Zero Redundancy Optimizer) [11]:
ZeRO-Infinityis built upon theZeROfamily of technologies, specificallyZeRO-3.ZeROremoves memory redundancies acrossdata-parallelprocesses by partitioningmodel states(optimizer states, gradients, parameters).- ZeRO-1: Partitions
optimizer states. - ZeRO-2: Partitions
optimizer statesandgradients. - ZeRO-3: Partitions all
model states. Duringforwardandbackward passes,parametersare broadcast from their owningGPUto others, then freed.Optimizer statesare updated only on the owningGPU.
- ZeRO-1: Partitions
- ZeRO-Offload [12]: This is a specific
heterogeneous training approachbuilt onZeRO-2. It storesgradientsandoptimizer statesinCPU memoryto saveGPU memory.- Limitations of ZeRO-Offload: It still requires
parametersto be stored inGPU memoryand replicated, limiting model scale to what a singleGPUcan host. It also needs largebatch sizesto be efficient due to suboptimal data partitioning and limitedPCIe bandwidth.
- Limitations of ZeRO-Offload: It still requires
- Other Heterogeneous CPU/NVMe Approaches [20-26]: The paper acknowledges other works that explore leveraging
CPUorNVMememory (e.g.,Capuchin[23],SuperNeurons[31],vDNN[25],Sentinel[24]). However,ZeRO-Infinityis positioned as a generic system for massive dense models, contrasting with specialized uses likeZhao et al. [27]for sparseDLAds Systems. - Reducing Activation Memory [28-31]: Techniques like
activation checkpointing[29, 30] andcompression[28] are mentioned as complementary toZeRO-Infinityfor managingactivation memory.
3.3. Technological Evolution
The evolution of large model training has been a continuous effort to overcome memory and computational bottlenecks:
- Single GPU: Early
DLmodels fit entirely on a singleGPU. - Data Parallelism (DP): As models grew,
DPallowed scaling computation by replicating the model across multipleGPUsand distributing data. This hit a wall when the model itself became too large for a singleGPU. - Model Parallelism (MP) and Pipeline Parallelism (PP): These techniques were developed to address models too large for a single
GPU, by splitting model layers or components acrossGPUs. However, they introduce complexity in model design and communication. - 3D Parallelism: A sophisticated combination of
DP,MP, andPPto push the limits further, but still constrained by aggregateGPU memoryand the refactoring burden. - ZeRO (Zero Redundancy Optimizer): An innovation that drastically reduced memory footprint by partitioning
model statesinstead of replicating them, moving beyond traditionalDP's memory inefficiencies.ZeRO-3fully partitions allmodel states. - ZeRO-Offload: Extended
ZeRO-2by offloadingoptimizer statesandgradientstoCPU memory, marking an early step towards heterogeneous memory utilization. - ZeRO-Infinity: This paper represents the next major leap, moving beyond the
GPU memory wallby fully embracing aheterogeneous memoryapproach (GPU,CPU,NVMe) for allmodel states, and critically, doing so withoutmodel refactoringwhile maintaining high efficiency. It takes theZeRO-3partitioning concept and extends it to offloadparametersthemselves to slower memory tiers.
3.4. Differentiation Analysis
Compared to the main methods in related work, ZeRO-Infinity introduces several core differences and innovations:
- Compared to 3D Parallelism:
- Memory Wall Transcendence:
3D parallelismis fundamentally limited by the aggregateGPU memoryavailable in a cluster.ZeRO-Infinitybreaks this by leveragingCPUandNVMememory, allowing for models tens to hundreds of times larger. - Ease of Use:
3D parallelismrequires complexmodel parallelismandpipeline parallelismstrategies, necessitating significant model code refactoring.ZeRO-Infinityeliminates this need, making it much easier for data scientists to use without modifying their model code. This is a crucial differentiator for accessibility. Memory-centric Tiling:3D parallelismusesmodel parallelism(tensor-slicing) to fit large individual layers.ZeRO-Infinityachieves this withmemory-centric tiling, which dynamically breaks down operators, avoiding explicitmodel parallelism.
- Memory Wall Transcendence:
- Compared to ZeRO-Offload:
- Full
Model StateOffload:ZeRO-Offloadprimarily offloadsoptimizer statesandgradientstoCPU memory, butparametersmust still reside inGPU memory.ZeRO-Infinity, built onZeRO-3, can offload allmodel states(includingparameters) toCPUorNVMe, enabling much larger models. - Bandwidth Efficiency for
Parameters:ZeRO-Offload's parameter handling is limited by singlePCIe bandwidthasparametersare replicated onGPUs.ZeRO-Infinityintroducesbandwidth-centric partitioningwhereparametersare partitioned acrossdata parallelprocesses and retrieved usingallgather, effectively utilizing aggregatePCIeandNVMebandwidths forparametermovement. This is critical for efficiency when offloadingparameters. - Scope:
ZeRO-Offloadis more of aCPU offloadtechnique forZeRO-2, whileZeRO-Infinityis a completeheterogeneous memory managementsystem forZeRO-3that includesNVMeand advanced overlap strategies.
- Full
- Novelty in Heterogeneous Memory Management:
-
ZeRO-Infinityis the first to simultaneously and efficiently coordinateGPU,CPU, andNVMememory for allmodel statesinDLtraining. -
The
infinity offload enginewithDeepNVMeandpinned memory managementprovides a robust and high-performance solution for utilizingNVMestorage, which is far beyond the capabilities of previous systems that might only touchCPU memory. -
Its
overlap-centric designis comprehensive, hiding latencies acrossNVMe-CPU,CPU-GPU, andGPU-GPUtransfers, which is crucial for making slower memory tiers practical for training.In essence,
ZeRO-Infinityis a holistic approach that redefines the memory paradigm forDLtraining, moving beyond theGPUas the sole memory bottleneck and making extreme-scaleDLmore accessible and practical.
-
4. Methodology
4.1. Principles
The core principle behind ZeRO-Infinity is to break the GPU memory wall by treating the entire memory hierarchy of a modern computing cluster (GPU, CPU, and NVMe storage) as a single, virtualized, and massively scalable memory pool for deep learning model states and activations. It achieves this by:
-
Extreme Partitioning: Building upon
ZeRO-3, it partitions allmodel states(parameters,gradients, andoptimizer states) acrossdata parallelprocesses, eliminating memory redundancies. -
Heterogeneous Offloading: Instead of keeping partitioned
model statesexclusively onGPUs, it intelligently offloads them toCPUorNVMememory, leveraging their larger capacities. -
Aggressive Overlapping: It employs sophisticated
communication-computation overlappingstrategies to hide the latency associated with accessing slowerCPUandNVMememories, making offloading practical and efficient. -
Ease of Use: It automates complex memory management and data movement, abstracting away the need for
model parallelismorcode refactoringfrom data scientists.The theoretical basis is that by partitioning
model statesand distributing them acrossheterogeneous memorytiers, the aggregate memory capacity becomes virtually unlimited. The challenge then becomes efficiently moving the necessary data to theGPUjust-in-time for computation, andZeRO-Infinityaddresses this with itsbandwidth-centric partitioningandoverlap-centric design. The intuition is that even if individualCPUorNVMeaccess is slow, by performing many such accesses in parallel and overlapping them withGPUcompute, the overall training throughput can remain high.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Memory Requirements Characterization
The paper begins by characterizing the memory requirements for DL training, focusing on Transformer-based architectures with mixed precision training and the Adam optimizer. Memory is categorized into Model states and Residual states.
Model States Memory
Model states comprise optimizer states, gradients, and parameters.
-
Parametersandgradientsare stored inFP16. -
Optimizer statesconsist ofFP32 momentum,variance, and typically anFP32copy of theparametersfor accurate updates. -
The paper states that each
parameterrequires 20 bytes of memory in total. -
For
Transformer-based models, the total number ofparametersprimarily depends on thehidden dimension(hd) and the number ofTransformer layers(nl). Mostparameterscome from four linear layers within each block with sizes: (hd,3hd), (hd,hd), (hd,4hd), and (4hd,hd).The total number of parameters in a
Transformer-based model can be approximated as: where: -
nl: Number ofTransformer layers. -
hd:Hidden dimension.The total memory (in bytes) required to store the
model statesis then: This formula uses . Figure 2a (column 5 in the table below) illustrates these memory requirements, showing how quickly they grow to terabytes for models with hundreds of billions to trillions of parameters. For context, Figure 2b shows aggregateGPU,CPU, andNVMememory available onDGX-2systems.
The following are the results from Figure 2(a) and 2(b) of the original paper:
| Params(Trillions) | Layers | Hidden Size | AttnHeads | Model States(TB/Model) | TB/Node | Working Mem.per GPU (GB) | |||
| Act. | Act.Ckpt. | ModelState | Act. | ||||||
| 0.10 | 80 | 10K | 128 | 1.83 | 2.03 | 0.05 | 1.95 | 1.63 | |
| 0.50 | 100 | 20K | 160 | 9.16 | 3.91 | 0.12 | 6.25 | 2.50 | |
| 1.01 | 128 | 25K | 256 | 18.31 | 7.13 | 0.20 | 9.77 | 3.56 | |
| 10.05 | 195 | 64K | 512 | 182.81 | 24.38 | 0.76 | 64.00 | 8.00 | |
| 101.47 | 315 | 160K | 1024 | 1845.70 | 88.59 | 3.08 | 400.00 | 18.00 | |
| Nodes | GPUs | Aggregate Memory (TB) | GPU-GPUBandwidth | Memory Bandwidth/GPU(GB/s) | |||||
| GPU | CPU | NVMe | Bandwidth(GB/s) | GPU | CPU | NVMe | |||
| 1 | 1 | 0.032 | 1.5 | 28.0 | N/A | 600-900 | 12.0 | 12.0 | |
| 1 | 16 | 0.5 | 1.5 | 28.0 | 150-300 | 600-900 | 3.0 | 1.6 | |
| 4 | 64 | 2.0 | 6.0 | 112.0 | 60-100 | 600-900 | 3.0 | 1.6 | |
| 16 | 256 | 8.0 | 24.0 | 448.0 | 60-100 | 600-900 | 3.0 | 1.6 | |
| 64 | 1024 | 32.0 | 96.0 | 1792.0 | 60-100 | 600-900 | 3.0 | 1.6 | |
| 96 | 1536 | 48.0 | 144.0 | 2688.0 | 60-100 | 600-900 | 3.0 | 1.6 | |
Residual States Memory
Residual states primarily refer to activation memory.
- Activation Checkpointing: This technique reduces memory. The memory required to store
activation checkpointsis estimated as: where:bsz:Batch size.seq:Sequence length.hd:Hidden dimension.nl: Number ofTransformer layers.ci: Number ofTransformer blocksbetween twoactivation checkpoints.- The term represents the size of the input to each
Transformer block. Figure 2a (column 7) shows that even withactivation checkpointing, the memory for activations can become large for multi-trillion parameter models.
Model State Working Memory (MSWM)
MSWM is the minimum GPU memory needed for forward or backward propagation of the largest single operator (e.g., a linear layer) after model states are offloaded. It's approximately the size of the parameters and gradients of that operator.
- For a
Transformer, the largest operator is a linear layer transforminghidden statesfromhdto4hd. - The size (in bytes) of
parametersandgradientsfor this linear layer is: This is because it needs to holdFP16parameters(2 bytes/element) andFP16gradients(2 bytes/element), so for parameters and another for gradients, or for parameters and for gradients, totaling bytes (assuming a 2-byte element size for bothparametersandgradients). The paper states bytes which implies 1 byte per value (e.g. FP8), or it is simply a placeholder for the total parameter count multiplied by a factor. Given thatFP16parametersandgradientsare used, bytes would be correct if one assumes4bytes per element (e.g. FP32). However, earlier it was statedFP16is used. A more accurate interpretation forFP16parametersandgradientsfor a matrix would be bytes for parameters and bytes for gradients, summing to bytes. So the formula makes sense forFP16representation. MSWM(Figure 2a, column 8) can require multiplegigabytesof contiguous memory, leading to out-of-memory errors.3D parallelismusesmodel parallelismto split these operators.ZeRO-Infinityintroducesmemory-centric tilingto address this.
Activation Working Memory (AWM)
AWM is the memory needed during backward propagation for recomputing activations between two activation checkpoints.
- If one
activation checkpointperTransformer block() is used, the memory (in bytes) is approximately: where:- : Number of attention heads.
- This formula estimates the total activation size per Transformer block.
AWM(Figure 2a, column 9) also gets large beyond 10 trillion parameters.
4.2.2. Bandwidth Requirements Characterization
The paper analyzes the impact of CPU and NVMe bandwidth on training efficiency.
Efficiency Metric
The efficiency of a workload, assuming no compute-communication overlap, is defined as:
This can be expressed in terms of arithmetic intensity (ait), data movement bandwidth (bw), and peak computational throughput ():
where:
Quantifying AIT in DL Training
The AIT varies for different model states and activation checkpoints.
-
Total Computation per Iteration: Dominated by linear layers in the
Transformer.Forward propagationcomputation: .Backward propagationcomputation: Approximatelyforward propagation.Activation checkpointingadds an additionalforward computation.- Total computation per iteration:
The factor accounts for forward pass, backward pass, and recomputation due to
activation checkpointing. The is the total computation related to theparameters.
-
AIT w.r.t. Parameters and Gradients:
Parametersare loaded at least twice (forward, backward), potentially three times (recomputation withactivation checkpointing).Gradientsare stored at least once.- Total data movement: (or in bytes).
AITwith respect toparametersandgradientsis: This implies that for each parameter value transferred, operations are performed.
-
AIT w.r.t. Optimizer States:
Optimizer statesare read once and written once per iteration during theoptimizer step.- Total data movement: , which is approximately bytes (since each parameter has 16 bytes of
optimizer states, e.g., twoFP32states of 4 bytes each, and anFP32master parameter, but the paper states 20 bytes total so this might be an approximation). AITwith respect tooptimizer statesis: ThisAITis 4x lower than forparametersandgradients, meaningoptimizer statesare more bandwidth-sensitive.
-
AIT w.r.t. Activation Checkpoints:
Activation checkpointsare saved duringforwardand retrieved duringbackward.- Total data movement: , which is from Eq. (3), or .
AITwith respect toactivation checkpointsis: ThisAITdepends on thehidden dimensionand checkpoint frequency.
The following images (Figure 2a, 2b, and 2c from the original paper) illustrate the bandwidth requirements.
该图像是基于AIT对参数和梯度效率的折线图,展示了不同批量大小(Bsz1到Bsz16)下随带宽(GB/s)增加,效率百分比提升的趋势。
Figure 2a: Parameter and Gradient Bandwidth
该图像是基于AIT相对于优化器状态的效率折线图,展示了不同批量大小(Bsz1、Bsz2、Bsz4、Bsz8)随着带宽(GB/s)增加,效率提升的趋势。
Figure 2b: Optimizer States bandwidth
该图像是一个折线图,展示了不同带宽下使用多种激活检查点方法(HD-2K到HD-64K)的效率变化,纵轴为效率百分比,横轴为带宽(GB/s),显示效率随带宽增长而提升并趋近于100%。
Figure 2c: Activation Checkpoint Bandwidth
From these analyses (Figures 2a, 2b, 2c in the original paper), optimizer states have the highest bandwidth requirements ( for 90% efficiency with batch size 2), followed by parameters and gradients (), while activation checkpoints have relatively low bandwidth requirements ( for large hidden sizes).
4.2.3. ZeRO-Infinity Design Overview
The overall architecture of ZeRO-Infinity is depicted in Figure 4 (Image 5), showing the interaction between GPU, CPU, NVMe, and network communication.
该图像是论文中的示意图,展示了ZeRO-Infinity在四个数据并行(DP)rank上训练两层模型时的参数和梯度在GPU、慢速内存(CPU+NVMe)及网络中的迁移与通信过程。
Figure 4: A snapshot of ZeRO-Infinity training a model with two layers on four data parallel (DP) ranks. Communication for the backward pass of the first layer is depicted. Partitioned parameters are moved from slow memory to GPU and then collected to form the full layer. After gradients are computed, they are aggregated, repartitoned, and then offloaded to slow memory. Layers are denoted with subscripts and DP ranks are denoted with superscripts. For example, p(2) is the portion of layer O's parameters owned by .
Design for Unprecedented Scale
-
Infinity Offload Engine for Model States:
ZeRO-Infinitybuilds onZeRO-3, which partitions allmodel states.- The
infinity offload engineallows offloading these partitionedmodel states(parameters,gradients,optimizer states) toCPUorNVMememory, or keeping them on theGPU, based on available resources and performance considerations. - This enables fitting models with hundreds of trillions of parameters, as aggregate
NVMememory in a large cluster (e.g., 96DGX-2nodes) can reachterabytes.
-
CPU Offload for Activations:
Activation checkpointscan also be offloaded toCPU memorywhenGPU memoryis insufficient.- A 10 trillion parameter model's
activation checkpoints(0.76 TB) can fit in aDGX-2's 1.5TBCPU memory. This helps scaleactivation memoryrequirements.
-
Memory-centric Tiling for Working Memory:
- This novel technique addresses large
Model State Working Memory(MSWM) requirements for massive individual operators (e.g., a huge linear layer). - Instead of
model parallelism(splitting the operator itself acrossGPUs),memory-centric tilingbreaks down a large operator into smaller, mathematically equivalent tiles. - These tiles are executed sequentially. When combined with
ZeRO-3, theparametersandgradientsfor each tile are fetched and released one at a time. This reduces theworking memoryproportionally to the number of tiles, allowing arbitrary operator sizes without explicitmodel parallelism.
- This novel technique addresses large
Design for Excellent Training Efficiency
Offloading to CPU and NVMe is only viable if efficiency is maintained despite their slower bandwidth.
-
Bandwidth-Centric Partitioning (Section 6.1):
- Traditional
ZeROandZeRO-Offloaduse abroadcast-based approach where onedata parallelprocess owns a fullparameterand broadcasts it. This is limited by a singlePCIelink's bandwidth from the source memory (CPU/NVMe) to theGPU. ZeRO-Infinitypartitions individual parameters across alldata parallelprocesses. When aparameteris needed, anallgathercollective is used.- This means each
data parallelprocess only fetches (wheredpis thedata parallel degree) of theparameterfromCPU/NVMeto itsGPUvia its ownPCIelink. - This parallel retrieval effectively increases the
CPU/NVMetoGPU bandwidthlinearly with thedpdegree, achieving virtually unlimitedheterogeneous memory bandwidthon multi-node setups. For example, on aDGX-2(16GPUs), this can scale from (singlePCIe) to (CPUtoGPU) or (NVMetoGPU) aggregated acrossGPUs.
- Traditional
-
Overlap Centric Design (Section 6.2):
- To hide latencies from slow memory,
ZeRO-Infinityaggressively overlapscomputewithcommunicationacross all tiers:GPU-GPU,NVMe-CPU, andCPU-GPU. - Dynamic Prefetcher:
- Traces
forwardandbackward computationto build an internal map of the operator sequence. - Prefetches
parametersneeded by future operators. - It understands the three-step
NVMeaccess process (nc-transfer: NVMe to CPU,cg-transfer: CPU to GPU,gg-transfer: GPUallgather). - For operator , it can simultaneously invoke
nc-transferfor parameters of operator ,cg-transferfor , andgg-transferfor , all while operator executes.
- Traces
- Communication and Offload Overlapping for Gradients:
- During the
backward pass, it overlapsreduce-scatterforgradientsof operator with computation of operator . - Simultaneously, it transfers partitioned
gradientsfrom thereduce-scatterof operator toCPUorNVMe.
- During the
- To hide latencies from slow memory,
-
Infinity Offload Engine (Section 6.3): This engine comprises specialized components:
- DeepNVMe: A powerful library for
NVMe read/writeoperations.- Supports
bulk read/write requestsforasynchronous completion. - Allows explicit synchronization.
- Enables
asynchronyto overlapNVMetransfers withGPU/CPUoperations. - Achieves near-peak
sequential read/write bandwidthsthrough aggressiveparallelizationofI/O requests, smartwork scheduling, avoidingdata copying, andmemory pinning.
- Supports
- Pinned Memory Management Layer:
- Crucial for high-performance
NVMe/CPUdata transfers, as source/destination tensors must reside inpinned memory(host memory thatGPUcan directly access without staging). - Manages the scarce
pinned memoryby reusing a small amount (tens ofGBs) to offload massivemodel states(tens ofTBs). - Minimizes
memory fragmentationin bothCPUandGPU memory. - Provides
PyTorch tensorswithpinned memorydata, allowingin-place computationand direct writing toNVMewithout extra copies, boosting bandwidth.
- Crucial for high-performance
- DeepNVMe: A powerful library for
Design for Ease of Use
ZeRO-Infinity aims for PyTorch-native usage without model refactoring.
-
Automated Data Movement (Section 7.1):
- Coordinates the movement of
parameters,gradients, andoptimizer states. - When a tensor is not active, it's partitioned and potentially offloaded.
- Hooks:
ZeRO-Infinityrecursively injectshooksintoPyTorch submodules.- Pre-forward/backward hooks: Ensure parameters are on
GPUby executingallgathercollectives before computation. - Post-forward/backward hooks: Partition and optionally offload
parametersorgradientsafter computation.
- Pre-forward/backward hooks: Ensure parameters are on
- The
overlap-centric designensures theseallgatheroperations don't cause significant stalls.
- Coordinates the movement of
-
Auto Registration of External Parameters (Section 7.1.1):
- Addresses cases where
parametersdefined in onesubmoduleare used in another (e.g., shared embedding weights inGPT). - Manual API: Provided for users to register external
parameters. - Automatic Mechanisms:
- Intercepting Partitioned Parameter Accesses: Replaces the
PyTorchhash tablefortensor parameterswith a subclass that overridestensor accesses. When a partitionedparameteris accessed, it triggers a blockingallgather, registers it as external, and returns the gatheredparameter. - Activation Introspection: Inspects
activation outputsfromsubmoduleforward passesfor partitionedparameters. If found, it collects and registers them as external.
- Intercepting Partitioned Parameter Accesses: Replaces the
- Addresses cases where
-
Automatic Model Partitioning during Initialization (Section 7.2):
- Solves the problem of large models not fitting into a single
GPUorCPU memoryduring initialization (before partitioning). - A
Python ZeRO-Infinity contextdecoratestorch.nn.Module's__init__method. - This ensures that
parametersallocated under eachmodule/sub-moduleare immediately partitioned amongdata parallelprocesses after their initialization. - The full model is never replicated on a single
data parallelprocess, allowing initialization of models (e.g., 500 billion parameters) that requireterabytesof aggregate memory, without requiring a single node to have that capacity.
- Solves the problem of large models not fitting into a single
5. Experimental Setup
5.1. Datasets
The experiments primarily use GPT-like Transformer-based models. These models are chosen because they represent the class of large, dense models whose scaling is the focus of the paper.
-
Sequence Length: Fixed to 1024, a common value for
Transformermodels likeGPT-2,Megatron-LM, andTuring-NLG. -
Model Size Variation: The
hidden dimensionandnumber of layersare varied to create models with different total parameter counts, ranging from billions to tens of trillions. -
Data Sample: The paper does not provide a concrete example of a data sample (e.g., an input text sequence) but implies typical text data used for language modeling. For a
Transformerwithsequence length1024, an input sample would be a sequence of 1024 tokens (words or subword units), which are then embedded intohidden dimensionvectors before being processed by theTransformerlayers. -
Suitability:
Transformermodels are ideal for validatingZeRO-Infinitybecause they are known for their massive scale, their quadratic memory consumption withsequence length(in some components), and their widespread use inSOTADLapplications, making memory efficiency a critical challenge.Specific model configurations used in the evaluation are detailed in Table 1 and in Appendix A (Tables 4-8), which are transcribed below.
The following are the results from Table 1 of the original paper:
| # nodes | # params | hidden dim | # layers | batch/GPU | mp | fp16 param | Opt State |
|---|---|---|---|---|---|---|---|
| 1 | 10 B | 4K | 50 | 8 | 1 | GPU | GPU |
| 1 | 50, 100 B | 8K | 62, 125 | 26, 24 | 1 | CPU | NVMe |
| 1 | 0.5, 1 T | 18K, 25K | 124, 128 | 8,7 | 1 | NVMe | NVMe |
| 32 | 0.5, 1 T | 18K, 25K | 124, 128 | 7,5 | 4 | GPU | GPU |
| 32 | 5, 10, 20 T | 48K, 64K, 88K | 174, 200, 205 | 3, 2, 1.25 | 4, 4,8 | NVMe | NVMe |
5.2. Evaluation Metrics
The paper uses the following metrics to evaluate ZeRO-Infinity's performance:
- Model Size (Number of Parameters):
- Conceptual Definition: Quantifies the maximum size of a neural network model, in terms of its learnable weights and biases, that can be trained by a system given specific hardware resources. This directly addresses the
GPU memory wallproblem. - Mathematical Formula: There isn't a single formula for "model size" as it's a count. For
Transformermodels, it's typically derived from architectural parameters: - Symbol Explanation:
nl: Number ofTransformer layers.hd:Hidden dimensionof the model.
- Conceptual Definition: Quantifies the maximum size of a neural network model, in terms of its learnable weights and biases, that can be trained by a system given specific hardware resources. This directly addresses the
- Training Throughput (TFlops/GPU or Petaflops):
- Conceptual Definition: Measures the computational efficiency of the training process, indicating how many floating-point operations per second (flops) are performed per
GPUor across the entire cluster. Higher throughput signifies faster training. - Mathematical Formula: Often reported in TeraFLOPs (TFlops, FLOPs) or PetaFLOPs (PFlops, FLOPs).
- Symbol Explanation:
- : The total number of floating-point operations required to complete one training step (forward pass, backward pass, optimizer update).
- : The duration it takes to complete one training step.
- Conceptual Definition: Measures the computational efficiency of the training process, indicating how many floating-point operations per second (flops) are performed per
- Scalability (Super-linear, Linear, Sub-linear):
- Conceptual Definition: Describes how efficiently the training throughput increases as more
GPUs(or nodes) are added to the system.- Linear Scaling: Throughput increases proportionally with the number of
GPUs. - Super-linear Scaling: Throughput increases more than proportionally with the number of
GPUs, often due to increased aggregate bandwidth or memory capacity unlocking new efficiencies. - Sub-linear Scaling: Throughput increases less than proportionally, typically due to communication overheads or load imbalance.
- Linear Scaling: Throughput increases proportionally with the number of
- Mathematical Formula:
For
weak scaling(as used in the paper forsuper-linear scaling), thebatch size per GPUis kept constant, so thetotal batch sizeandtotal problem sizeincrease with . The efficiency would ideally be 1 (linear). Super-linear means efficiency > 1. - Symbol Explanation:
- : The total throughput measured when using
GPUs. - : The number of
GPUsused. - : The baseline throughput measured when using a single
GPU.
- : The total throughput measured when using
- Conceptual Definition: Describes how efficiently the training throughput increases as more
- Backward Propagation Time:
- Conceptual Definition: Measures the time taken specifically for the
backward passof the neural network, which includes gradient computation and aggregation. This metric is used to evaluate the efficiency of gradient offloading and communication strategies. - Mathematical Formula: Typically measured directly as wall-clock time in seconds or milliseconds.
- Conceptual Definition: Measures the time taken specifically for the
- Maximum Hidden Size:
- Conceptual Definition: The largest
hidden dimensionthat a single layer of aTransformermodel can be configured to have and still be trained, particularly in the context of memory fragmentation or limitations. This metric evaluates the effectiveness ofmemory-centric tilingin handling large individual operators. - Mathematical Formula: A direct integer value representing the dimension size.
- Conceptual Definition: The largest
5.3. Baselines
The paper compares ZeRO-Infinity against several state-of-the-art and foundational distributed training methods:
- torch's Distributed Data Parallel (DDP) [42]:
- Description: The standard
data parallelismimplementation inPyTorch. EachGPUholds a full replica of the model, processes a subset of the data, computesgradients, and thenall-reducesthesegradientsto keep all model replicas synchronized. - Representativeness: It's the most common and simplest
data parallelismbaseline, representing the scenario where the model fits entirely on eachGPU.
- Description: The standard
- Megatron-LM [7]:
- Description: An early and influential framework developed by
NVIDIAfor training multi-billion parameter language models. It primarily relies onmodel parallelism(specificallytensor-slicingfor linear layers) andpipeline parallelism. - Representativeness: It represents the state-of-the-art in
model parallelismandpipeline parallelismforTransformermodels.
- Description: An early and influential framework developed by
- 3D Parallelism [13, 14]:
- Description: A composite distributed training strategy that combines
data parallelism,model parallelism, andpipeline parallelism. TheDeepSpeedimplementation [15] is highlighted. This is positioned as the immediate predecessor and primary competitor toZeRO-Infinityfor extreme-scaleDL. - Representativeness: This is the most advanced existing baseline for training
trillion-parametermodels on largeGPU clusters, making it the most direct comparison forZeRO-Infinity's scalability claims.
- Description: A composite distributed training strategy that combines
- ZeRO [11] (specifically ZeRO-3):
- Description:
Zero Redundancy Optimizerpartitionsmodel statesto eliminate memory redundancy.ZeRO-3partitions allmodel states(parameters,gradients,optimizer states). - Representativeness:
ZeRO-Infinitybuilds uponZeRO-3's partitioning strategy, soZeRO-3serves as a crucial baseline to demonstrate the added value ofheterogeneous offloadingand advanced communication strategies.
- Description:
- ZeRO-Offload [12]:
-
Description: An extension of
ZeRO-2that offloadsgradientsandoptimizer statestoCPU memory. -
Representativeness: This is a direct baseline for evaluating the benefits of
ZeRO-Infinity's more comprehensiveheterogeneous memory management(includingparameter offloadtoCPU/NVMe) and itsbandwidth-centric partitioningoverZeRO-Offload'sCPU offloadingapproach.These baselines collectively cover different dimensions of large model training—from simple data parallelism to complex
3D parallelismand memory-efficientZeROvariants—allowingZeRO-Infinityto demonstrate its superiority in terms of model scale, efficiency, and ease of use.
-
5.4. Hardware
The experiments were conducted on a cluster of NVIDIA V100 SXM3 32 GB GPUs.
-
GPU Model:
NVIDIA V100 SXM3 32 GB(eachGPUhas 32 GB of memory). -
Cluster Size: Up to 512
V100 GPUs, which corresponds to 32NVIDIA DGX-2nodes. -
Inter-node Communication: 800
Gbps(Gigabits per second) bandwidth, indicating a high-performance network formulti-node distributed training. -
Node Configuration: A single
DGX-2node contains 16V100 GPUsand 1.5 TB ofCPU memory(as mentioned in Section 5.1.2 of the paper), and often a substantial amount ofNVMestorage.This hardware setup represents a powerful, high-end
GPU clustersuitable forextreme-scale deep learning, providing a realistic and challenging environment for evaluatingZeRO-Infinity's capabilities.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Model Size and Speed
ZeRO-Infinity demonstrates a significant leap in model scale and competitive training throughput.
- Model Scale:
ZeRO-Infinitycan train models exceeding 32 trillion parameters, which is a 50x increase compared to the roughly 650 billion parameters achievable with3D parallelismon similar hardware. This directly addresses theGPU memory walland pushes the boundary of trainable model sizes. - Training Throughput:
-
For a 500 billion parameter model (a size near the upper limit of
3D parallelism),ZeRO-Infinityachieves nearly identical throughput to3D parallelism, indicating that its offloading mechanisms do not introduce significant overhead for models manageable by existing state-of-the-art. -
When scaling to 1 trillion, 5 trillion, 10 trillion, and 20 trillion parameter models, where
3D parallelismruns out of memory,ZeRO-Infinitycontinues to train these models with excellent throughput. It achieves up to 49TFlops/GPUfor the 5 trillion parameter model. -
A performance drop is observed for the 20 trillion parameter model (down to 34
TFlops/GPU). This is attributed not toNVMe bandwidthsaturation (as both 10T and 20T useNVMe offload), but to an extremely smallbatch size per GPUnecessary for the 20T model due to limitedCPU memoryto storeactivation checkpoints. The paper suggests this could be improved by increasingCPU memoryor offloadingactivation checkpointstoNVMein the future.The following image (Figure 5(a) from the original paper) illustrates the throughput performance:
该图像是图表,展示了ZeRO-Infinity在训练超大规模模型上的效率与扩展性,分别对比了3D并行、ZeRO-Offload与ZeRO-Infinity在不同模型规模及GPU数量下的吞吐率表现,体现了其超线性扩展能力和在单节点上训练1万亿参数模型的优势。
-
Figure 5: Efficiency and scalability of ZeRO-Infinity for training multi-trillion parameter models.
6.1.2. Superlinear Scalability
ZeRO-Infinity exhibits super-linear scalability when training a 1 trillion parameter model, particularly from 4 nodes (64 GPUs) to 32 nodes (512 GPUs).
-
Observation: The system exceeds perfect linear scaling, meaning the throughput per
GPUincreases as moreGPUsare added, holding thebatch size per GPUconstant (weak scaling). -
Reason: This
super-linear scalingis primarily due to:- Leveraging Aggregate Bandwidth: As more nodes are added, the aggregate
PCIeandNVMe bandwidthsincrease linearly.ZeRO-Infinity'sbandwidth-centric partitioningeffectively utilizes this increased aggregate bandwidth to accelerate the offloading ofparametersandoptimizer states. - Increased CPU Compute: Additional nodes also bring more
CPU computepower, whichZeRO-Infinityleverages for tasks likeoptimizer stepcomputations.
- Leveraging Aggregate Bandwidth: As more nodes are added, the aggregate
-
Efficiency at Modest Scale: Even with just 4 nodes,
ZeRO-Infinityachieves over 2.8 petaflops (44TFlops/GPU), demonstrating that the aggregatedNVMe bandwidthis sufficient for good efficiency even at a relatively modest scale.The following image (Figure 5(b) from the original paper) illustrates the super-linear scalability:
该图像是图表,展示了ZeRO-Infinity在训练超大规模模型上的效率与扩展性,分别对比了3D并行、ZeRO-Offload与ZeRO-Infinity在不同模型规模及GPU数量下的吞吐率表现,体现了其超线性扩展能力和在单节点上训练1万亿参数模型的优势。
Figure 5: Efficiency and scalability of ZeRO-Infinity for training multi-trillion parameter models.
6.1.3. Democratizing Large Model Training
ZeRO-Infinity significantly improves the accessibility and ease of use for large model training.
-
Accessibility: It enables training (and specifically fine-tuning) of models up to 1 trillion parameters on a single
NVIDIA DGX-2node (which has 16GPUs). This makes models likeGPT-3(175B parameters) accessible for fine-tuning to users who do not have access to massiveGPU clusters. In contrast,3D parallelismcannot scale beyond 20 billion parameters on a singleDGX-2node. -
Ease-of-Use: This capability is achieved without requiring
model parallelismorpipeline parallelism, and without anymodel code refactoring. This significantly reduces the burden on data scientists, allowing them to scale their models with minimal effort. -
Performance: On a single
DGX-2node,ZeRO-Infinityachieves excellent performance of over 40TFlops/GPUfor models up to 100 billion parameters.The following image (Figure 5(c) from the original paper) illustrates the single-node performance:
该图像是图表,展示了ZeRO-Infinity在训练超大规模模型上的效率与扩展性,分别对比了3D并行、ZeRO-Offload与ZeRO-Infinity在不同模型规模及GPU数量下的吞吐率表现,体现了其超线性扩展能力和在单节点上训练1万亿参数模型的优势。
Figure 5: Efficiency and scalability of ZeRO-Infinity for training multi-trillion parameter models.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| # nodes | # params | hidden dim | # layers | batch/GPU | mp | fp16 param | Opt State |
|---|---|---|---|---|---|---|---|
| 1 | 10 B | 4K | 50 | 8 | 1 | GPU | GPU |
| 1 | 50, 100 B | 8K | 62, 125 | 26, 24 | 1 | CPU | NVMe |
| 1 | 0.5, 1 T | 18K, 25K | 124, 128 | 8,7 | 1 | NVMe | NVMe |
| 32 | 0.5, 1 T | 18K, 25K | 124, 128 | 7,5 | 4 | GPU | GPU |
| 32 | 5, 10, 20 T | 48K, 64K, 88K | 174, 200, 205 | 3, 2, 1.25 | 4, 4,8 | NVMe | NVMe |
The following are the results from Table 2 of the original paper:
| Name | Optimizer + Grad(devices/partitioned) | Parameters(devices/partitioned) |
|---|---|---|
| Data parallel | [GPU] / X | [GPU] / X |
| ZeRO 2 | [GPU] /✓ | [GPU] / X |
| ZeRO-Offload | [CPU,GPU] / ✓ | [GPU] / X |
| 3D Parallelism | [GPU] / | [GPU] / |
| ZeRO 3 | [GPU] /✓ | [GPU] /✓ |
| ZeRO-Inf-CPU | [CPU, GPU]— | [CPU,GPU] √ |
| ZeRO-Inf-NVMe | [NVMe,CPU,GPU] / | [NVMe,CPU,GPU] /✓ |
The following are the results from Table 4 of the original paper:
| Figure 6(a) | |||||||
| Model size | Number of GPUs | MP | Layers | Hidden Size | Attention head | Batch size | Total batch size |
|---|---|---|---|---|---|---|---|
| 1.4B | 16 | 1 | 40 | 1536 | 16 | 1 | 16 |
| 10B | 16 | 1 | 50 | 4096 | 16 | 1 | 16 |
| 13B | 16 | 1 | 64 | 4096 | 16 | 1 | 16 |
| 20B (ZeRO-3) | 16 | 1 | 98 | 4096 | 32 | 1 | 16 |
| 20B(3D Par.) | 16 | 4 | 98 | 4096 | 32 | 1 | 16 |
| 70B | 16 | 1 | 125 | 8192 | 32 | 1 | 16 |
| 1000B | 16 | 4 | 128 | 25600 | 256 | 5 | 20 |
The following are the results from Table 5 of the original paper:
| Figure 6(b) | |||||||
| Hidden size | Number of GPUs | MP | Layers | Model size | Attention head | Batch size | Total batch size |
|---|---|---|---|---|---|---|---|
| 8192 | 16 | 1 | 1 | 900M | 16 | 1 | 16 |
| 16384 | 16 | 1 | 1 | 3B | 16 | 1 | 16 |
| 32768 | 16 | 1 | 1 | 13B | 16 | 1 | 16 |
| 65536 | 16 | 1 | 1 | 50B | 32 | 1 | 16 |
The following are the results from Table 6 of the original paper:
| Figure 6(c) | |||||||
| Number of GPUs | Hidden size | MP | Layers | Model size | Attention head | Batch size | Total batch size |
|---|---|---|---|---|---|---|---|
| [4,16,32,64] | 8192 | 1 | 10 | 8B | 16 | 2 | [8,32,64,128] |
The following are the results from Table 7 of the original paper:
| Figure 6(d) | |||||||
| Batch size | Number of GPUs | Hidden size | MP | Layers | Model size | Attention head | Total batch size |
|---|---|---|---|---|---|---|---|
| [2,4,8,10,14,16] | 64 | 8192 | 1 | 10 | 8B | 16 | [128,256,512,640,896,1024] |
The following are the results from Table 8 of the original paper:
| Figure 6(e) | ||||||||
| Hidden size | Number of GPUs | Opt Device | MP | Layers | Model size | Attention head | Batch size | Total batch size |
|---|---|---|---|---|---|---|---|---|
| 2048 | 32 | CPU | 1 | 5 | 275M | 16 | 4 | 128 |
| 8192 | 32 | CPU | 1 | 5 | 4B | 16 | 4 | 128 |
| 16384 | 32 | CPU | 1 | 5 | 16B | 16 | 4 | 128 |
| 32768 | 32 | CPU | 1 | 5 | 64B | 16 | 4 | 128 |
| 65536 | 64 | NVMe | 1 | 5 | 260B | 16 | 4 | 128 |
6.3. Ablation Studies / Parameter Analysis
The paper conducts several ablation studies to demonstrate the impact of individual ZeRO-Infinity features on model scale and performance.
The following image (Figure 6 from the original paper) illustrates the impact of system features on model scale and performance:
该图像是论文中图6,展示了系统特性对模型规模和性能的影响。图(a)比较了不同ZeRO策略的最大模型规模,图(b)展示了不同切片因子下最大隐藏维度,图(c)比较了ZeRO-Infinity与ZeRO Offload的反向时间,图(d)显示了通信重叠带来的加速比,图(e)反映了激活检查点转移至CPU带来的开销。
Figure 6: Impact of system features on model scale and performance.
6.3.1. Impact of System Features on Model Scale (Figure 6a)
This experiment investigates how different device placement and partitioning strategies influence the maximum model size trainable on a single DGX-2 system (16 GPUs). Table 2 defines these strategies.
- Data Parallelism: The baseline, limited to 1.4 billion parameters due to
GPU memoryandmodel state redundancies. - ZeRO-2: With
optimizer/gradient partitioningandCPU offloadfor these states, scales to 13 billion parameters (9x increase). - ZeRO-Infinity-CPU (ZeRO-Inf-CPU): Offloading
parameter statestoCPUalong withoptimizerandgradient states(Table 2 entry forOptimizer + Gradand forParameters) enables scaling to nearly 100 billion parameters. - ZeRO-Infinity-NVMe (ZeRO-Inf-NVMe): The final major jump comes from offloading
model statestoNVMe(Table 2 entry[NVMe,CPU,GPU] /forOptimizer + Gradand forParameters), allowing training of models up to 1 trillion parameters. - Conclusion: This demonstrates a 700x increase in
model sizecompared todata parallelismalone, highlighting the critical role ofheterogeneous memory offloading(especially toNVMe) in achieving extreme scale.
6.3.2. Maximum Hidden Size with Memory-centric Tiling (Figure 6b)
This study evaluates the effectiveness of memory-centric tiling (Section 5.1.3) in enabling large hidden sizes despite memory fragmentation. A single-layer Transformer model is trained on a DGX-2 (16 GPUs), with GPU memory pre-fragmented into 2 GB chunks.
- Without Tiling: The largest
hidden sizetrainable is 8K. Any larger size would fail due to insufficient contiguous memory for the large operator. - With Tiling: Using a
memory-centric tiling factorof 16,ZeRO-Infinitycan train a massivehidden sizeof 64K. - Conclusion:
Memory-centric tilingsignificantly simplifies theDL system stackby allowing large operators to fit inGPU memorywithout resorting tomodel parallelism(tensor-slicing), making it easier for data scientists to use largehidden sizes.
6.3.3. Impact of System Features on Performance
ZeRO-Infinity vs. ZeRO-Offload (Figure 6c)
This experiment compares the backward propagation time for an 8 billion parameter model between ZeRO-Infinity and ZeRO-Offload, focusing on gradient offloading to CPU memory.
- Observation:
ZeRO-Infinityachieves a speedup of nearly 2x at 64GPUscompared toZeRO-Offload. - Reason:
ZeRO-Offloadis limited by thePCIe bandwidthof a singleGPUforgradient offloading.ZeRO-Infinity, through itsbandwidth-centric partitioning(Section 6.1) andallgather-based approach, leverages the aggregatePCIe bandwidthacross allGPUsto offloadgradientsin parallel. - Conclusion: This demonstrates the superior
bandwidth utilizationstrategy ofZeRO-Infinityforoffloaded states.
Prefetching and Overlapping (Figure 6d)
This study examines the effect of communication overlapping and prefetching (part of overlap-centric design, Section 6.2) on throughput for an 8 billion parameter model with 64 GPUs across varying batch sizes.
- Observation:
Prefetchingandoverlappingare crucial for achieving goodperformanceat smallbatch sizes per GPU. As thebatch sizeincreases, their impact diminishes. - Reason: At small
batch sizes, thearithmetic intensityis lower, and thecommunication timebecomes a larger proportion of the total time.Prefetchingandoverlappingeffectively hide thiscommunication latency. At largebatch sizes, thecompute-to-communication ratiois higher, making the system less sensitive tocommunication overheads. - Conclusion: This validates the importance of
overlap-centric designfor maintaining efficiency, especially in scenarios wherebatch size per GPUis constrained (e.g., extremely large models where even a small batch might exhaust memory).
Activation Checkpoint Offload (Figure 6e)
This experiment measures the training throughput impact of CPU offloading of activation checkpoints in ZeRO-Infinity for different hidden sizes.
- Observation: For small
hidden sizes(e.g., 2K),CPU offloadingofactivation checkpointsreducestraining throughputby up to 1.2x. However, for largerhidden sizes(32K and 64K), the performance impact is minimal. - Reason: For smaller
hidden sizes, thearithmetic intensityassociated withactivation checkpointsmight be lower, and the overhead ofCPU-GPU transferbecomes more noticeable. For largerhidden sizes, theAITincreases, making the cost of recomputation andCPU offloadrelatively less significant compared to the overallGPU computation. - Conclusion:
ZeRO-Infinitycan offloadactivation checkpointstoCPU memorywithout significantly impacting efficiency for largehidden sizes, which is crucial for training models where even checkpointedactivationsexceedGPU memory.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper presents ZeRO-Infinity, a groundbreaking heterogeneous system technology designed to overcome the GPU memory wall for extreme-scale deep learning. By intelligently leveraging GPU, CPU, and NVMe memory, ZeRO-Infinity enables the training of models with tens and even hundreds of trillions of parameters, a 50x increase over previous state-of-the-art 3D parallelism. Key innovations include the infinity offload engine for comprehensive model state management, memory-centric tiling to handle massive individual operators without model parallelism, bandwidth-centric partitioning for efficient data movement across memory tiers, and an overlap-centric design to mask communication latency. Crucially, ZeRO-Infinity achieves this unprecedented scale and efficiency without requiring model code refactoring, significantly democratizing access to large model training by allowing fine-tuning of trillion-parameter models on a single NVIDIA DGX-2 node. It demonstrates robust performance, sustaining over 25 petaflops on 512 V100 GPUs, and exhibits super-linear scalability.
7.2. Limitations & Future Work
The authors acknowledge a current limitation and project future needs:
-
Current Limitation: The performance drop observed for the 20 trillion parameter model is attributed to an extremely small
batch size per GPU, which is a consequence of limitedCPU memoryto storeactivation checkpoints. -
Suggested Future Work: The authors propose that this limitation can be addressed by increasing
CPU memoryor, more significantly, by offloadingactivation checkpointstoNVMein future implementations ofZeRO-Infinity. -
Future Hardware Implications: Looking ahead, the paper emphasizes that while
ZeRO-Infinityeffectively removes thedevice memorybottleneck, training models with tens or hundreds of trillions of parameters in a reasonable time will necessitate massive leaps incompute power( or more powerful accelerators). These future powerful devices will, in turn, require a proportional increase indevice-to-device bandwidth(e.g.,GPU-GPU bandwidthof to ) to remain efficient. The paper highlights that even today's technology (likeNVLinkconnectingGPUstoCPU memoryat ) can meet theslow memory bandwidthrequirements for 10x fasterGPUsusingZeRO-Infinity.The following are the results from Table 3 of the original paper:
V100 10x 100x Total devices 512 512 512 Achievable peak (pflops/device) 0.07 0.70 7.00 Slow memory bw requirement(GB/s per device) 3.0 30.0 300.0 Slow memory aggregate bw (TB/s) 1.5 15.0 150.0 GPU-to-GPU bw (GB/s) 70.0 700.0 7000.0
7.3. Personal Insights & Critique
ZeRO-Infinity represents a pivotal advancement in distributed deep learning, fundamentally changing how we think about memory resources for extreme-scale models.
-
Inspiration: The most significant inspiration is the holistic approach to
heterogeneous memory management. Instead of just optimizingGPU memory, it treatsGPU,CPU, andNVMeas a unified, tiered memory system. This paradigm shift will likely influence future system designs, pushing for better integration and high-bandwidth pathways between these memory components. Thesuper-linear scalabilityis particularly intriguing, demonstrating that intelligently designeddistributed systemscan yield unexpected benefits beyond simple aggregation of resources. This challenges the conventional wisdom that adding more resources inevitably leads to diminishing returns due tocommunication overhead. -
Transferability: The core principles of
bandwidth-centric partitioning,overlap-centric design, andmemory-centric tilingare highly transferable.- Other
DLtasks: These techniques could be adapted for other memory-intensiveDLtasks beyondTransformertraining, such as largegraph neural networksordiffusion models. - General HPC: The approach to managing
heterogeneous memoryand overlappingI/Ocould inspire optimizations in broaderHigh-Performance Computing(HPC) domains that deal with large datasets and complex computations on diverse memory architectures. - Cloud Computing: Cloud providers could leverage
ZeRO-Infinityto offer more cost-effectivelarge model trainingby intelligently usinglocal SSDs(likeNVMe) alongsideGPUandCPU memory, potentially allowing users to provision fewer expensiveGPUinstances.
- Other
-
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
NVMe Lifespan and Reliability: While
NVMeoffers massive capacity, frequent read/write cycles formodel states(especiallyoptimizer statesandparametersduring full offloading) could impactNVMe drive lifespandue towrite endurance limits. The paper doesn't discuss strategies to mitigate this, which might be critical for prolonged training runs. -
CPU Overhead: Despite the excellent throughput,
ZeRO-Infinityrelies heavily on theCPUforNVMedata transfers,pinned memory management, and potentiallyoptimizer steps. WhileCPU computeis leveraged, excessiveCPU utilizationcould become a bottleneck or impact other processes running on the system, especially on less powerfulCPUnodes. The "smallbatch size per GPU" for the 20T model due toCPU memorylimitation foractivationshints atCPUresources still being a potential constraint. -
Complexity for Debugging: While
ZeRO-Infinitysimplifies the user experience by eliminatingmodel refactoring, the underlying system is highly complex, involving intricate interactions acrossGPU,CPU, andNVMememory,asynchronous I/O, and multiple layers ofcommunication-computation overlap. Debugging performance issues or memory errors in such a system could be challenging for developers or advanced users trying to push its limits. -
Generalizability to Diverse Workloads: The analysis is focused on
Transformer-based models andAdamoptimizer. While these are prevalent,ZeRO-Infinity's efficiency might vary forDL workloadswith differentarithmetic intensities, memory access patterns, or optimizer types. Further characterization for a wider range ofDL architectureswould strengthen its claims. -
Cost-Benefit Analysis of NVMe: While
NVMeis cheaper perGBthanGPU memory, the overall system cost of equipping nodes with massiveNVMearrays, coupled with the power consumption for suchI/O, might warrant a more detailedcost-benefit analysisin certain deployment scenarios. -
Activation Offload to NVMe: The authors acknowledge that
offloading activations to NVMeis a future step. This will be crucial for truly scaling models beyond 20T parameters, asCPU memory(even 1.5TB on aDGX-2) can still be a bottleneck foractivation checkpoints. This will introduce another layer ofI/O complexityandlatency managementthatZeRO-Infinitywould need to address.Overall,
ZeRO-Infinityis a testament to sophisticated system co-design, effectively pushingDLcapabilities into realms previously thought impossible without prohibitively expensive hardware. Its contributions will undoubtedly shape the development of futureDL systemsandhardware architectures.
-
Similar papers
Recommended via semantic vector search.