AiPaper
Paper status: completed

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

Published:04/16/2021
Original LinkPDF
Price: 0.10
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ZeRO-Infinity leverages GPU, CPU, and NVMe memory to break the GPU memory wall, enabling trillion-parameter model training and fine-tuning without code refactoring, achieving high throughput and superlinear scalability in extreme-scale deep learning.

Abstract

In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been supported primarily though system innovations that allow large models to fit in the aggregate GPU memory of multiple GPUs. However, we are getting close to the GPU memory wall. It requires 800 NVIDIA V100 GPUs just to fit a trillion parameter model for training, and such clusters are simply out of reach for most data scientists. In addition, training models at that scale requires complex combinations of parallelism techniques that puts a big burden on the data scientists to refactor their model. In this paper we present ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs(40% of peak), while also demonstrating super linear scalability. An open source implementation of ZeRO-Infinity is available through DeepSpeed, a deep learning optimization library that makes distributed training easy, efficient, and effective.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

1.2. Authors

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. All authors are affiliated with Microsoft. Their research backgrounds primarily lie in distributed systems, deep learning optimization, and high-performance computing, particularly focused on scaling deep learning models.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server. The specific publication status (e.g., if it was later accepted to a peer-reviewed conference or journal) is not explicitly stated in the provided abstract, but it is typical for such foundational work to be presented at top-tier conferences in machine learning systems (e.g., OSDI, SOSP, MLSys, SC). DeepSpeed, the open-source library implementing ZeRO-Infinity, is widely adopted and well-regarded in the deep learning community.

1.4. Publication Year

2021

1.5. Abstract

The paper addresses the significant disparity between the rapid growth in large dense deep learning (DL) model sizes (over 1000x in three years) and the much slower increase in GPU memory (only 5x). This disparity has led to a "GPU memory wall," making it prohibitively expensive and complex (requiring 800 NVIDIA V100 GPUs for a trillion-parameter model) to train extreme-scale models using existing system innovations like 3D parallelism. Furthermore, current methods demand extensive model code refactoring, burdening data scientists.

ZeRO-Infinity is introduced as a novel heterogeneous system technology that overcomes these limitations. It leverages GPU, CPU, and NVMe memory simultaneously, enabling unprecedented model scale on limited resources without requiring model code refactoring. The system achieves excellent training throughput and scalability by managing the slower CPU and NVMe bandwidths efficiently. ZeRO-Infinity can train models with tens to hundreds of trillions of parameters on current-generation GPU clusters and allows fine-tuning trillion-parameter models on a single NVIDIA DGX-2 node, significantly improving accessibility. It demonstrates strong performance, sustaining over 25 petaflops on 512 NVIDIA V100 GPUs (40% of peak) and exhibiting super-linear scalability. An open-source implementation is available through DeepSpeed, a deep learning optimization library.

https://arxiv.org/abs/2104.07857v1 Publication Status: Preprint, available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the "GPU memory wall" in deep learning training. In recent years, the size of state-of-the-art dense deep learning models has grown exponentially, often by orders of magnitude (e.g., GPT-3 with 175 billion parameters). However, the memory capacity of individual GPUs has increased at a much slower rate. This growing gap means that many cutting-edge models are simply too large to fit into the memory of a single GPU, or even the aggregate GPU memory of a moderately sized cluster.

This problem is critical for several reasons:

  • Limited Model Scale: Without innovative system solutions, the growth of model sizes, which has been a primary driver of DL advancements, would be severely constrained. The paper notes that training a trillion-parameter model on current V100 GPUs would require 800 GPUs just to fit the model states, making it inaccessible for most researchers and businesses.

  • High Cost and Inaccessibility: The immense GPU resources required for large model training or even fine-tuning make it an exclusive endeavor for organizations with access to massive clusters. For example, fine-tuning GPT-3 would require 8 DGX-2 nodes (128 GPUs) using existing 3D parallelism techniques, even though a single DGX-2 node might have sufficient compute power.

  • Complexity for Data Scientists: Current state-of-the-art solutions for large model training, such as 3D parallelism (combining data, model, and pipeline parallelism), demand significant effort from data scientists. They often involve complex model code refactoring, including splitting models into tensor-sliced versions or load-balanced pipeline stages, which is both time-consuming and inflexible for diverse model architectures.

    The paper's entry point and innovative idea is to transcend the GPU memory wall by holistically leveraging the entire memory hierarchy of modern computing systems: GPU memory (fast but small), CPU memory (slower but larger), and NVMe (Non-Volatile Memory Express) storage (slowest but massive). This heterogeneous memory management aims to allow extreme model scaling without imposing the burden of complex model refactoring on data scientists, thereby democratizing access to large model training.

2.2. Main Contributions / Findings

The primary contributions and key findings of the paper are:

  • Unprecedented Model Scale: ZeRO-Infinity enables the training of models with tens and even hundreds of trillions of parameters, a significant leap from previous state-of-the-art methods. Specifically, it can train models over 32 trillion parameters, representing a 50x increase in model scale compared to 3D parallelism. This is achieved by simultaneously exploiting GPU, CPU, and NVMe memory through its infinity offload engine.

  • Accessibility and Ease-of-Use:

    • It allows fine-tuning trillion-parameter models on a single NVIDIA DGX-2 node (16 GPUs), making large model training more accessible to researchers and companies with limited resources.
    • It eliminates the need for manual model code refactoring, pipeline parallelism, or model parallelism (tensor-slicing), simplifying the development process for data scientists through its ease-inspired implementation and memory-centric tiling.
  • Excellent Training Throughput and Scalability:

    • ZeRO-Infinity sustains over 25 petaflops on 512 NVIDIA V100 GPUs (40% of peak theoretical performance), demonstrating high efficiency even with offloading data to slower memory tiers.
    • It achieves super-linear scalability for trillion-parameter models, meaning that as the number of GPUs increases, the throughput per GPU can sometimes increase more than proportionally, primarily by effectively leveraging aggregate PCIe and NVMe bandwidths and CPU compute across multiple nodes.
  • Innovative System Technologies:

    • Infinity Offload Engine: A novel mechanism to intelligently offload model states (optimizer states, gradients, parameters) to CPU or NVMe memory based on memory requirements, while maintaining high performance.
    • Memory-centric Tiling: A GPU memory optimization technique that breaks down large operators into smaller, sequentially executable tiles, reducing GPU working memory requirements and eliminating the need for model parallelism for massive layers.
    • Bandwidth-centric Partitioning: A data partitioning and retrieval strategy that leverages the aggregate bandwidth across all parallel devices (e.g., PCIe links) by partitioning individual parameters across data parallel processes and using allgather operations.
    • Overlap-centric Design: An architecture that aggressively overlaps compute with various forms of communication (GPU-GPU, NVMe-CPU, CPU-GPU) to hide the latency of data movement to/from slower memory tiers.
  • Open-Source Implementation: ZeRO-Infinity is available as part of DeepSpeed, an open-source library, making these advanced capabilities widely accessible to the DL community.

    These findings solve the problems of limited model scale, high training cost, and complexity, pushing the boundaries of what is possible in large-scale deep learning.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand ZeRO-Infinity, a reader should be familiar with the following foundational concepts in deep learning and distributed systems:

  • Deep Learning (DL): A subfield of machine learning inspired by the structure and function of the human brain. It involves training neural networks with multiple layers (hence "deep") on large datasets to learn complex patterns.
  • GPU Memory Wall: This refers to the growing gap between the increasing memory demands of large DL models and the relatively slower growth of memory capacity on GPUs. As models become larger, they quickly exhaust the available GPU memory, hindering training.
  • Model States: In DL training, the model states refer to the various components that define and are updated during the learning process. These include:
    • Parameters: The learnable weights and biases of the neural network. For example, in a linear layer y=Wx+by = Wx + b, WW and bb are parameters. In mixed precision training, parameters are typically stored in FP16 (16-bit floating point).
    • Gradients: The derivatives of the loss function with respect to the parameters. These indicate the direction and magnitude of parameter adjustments needed to reduce the loss. Gradients are also often stored in FP16.
    • Optimizer States: Additional state variables maintained by optimizers (like Adam or SGD) to facilitate parameter updates. For example, the Adam optimizer maintains first-order momentum and second-order variance estimates for each parameter. These are typically stored in FP32 (32-bit floating point) for higher precision.
    • For mixed precision training with Adam, each parameter (WFP16W_{FP16}) requires a corresponding gradient (gFP16g_{FP16}) and optimizer states (mFP32m_{FP32}, vFP32v_{FP32}, and often a master copy of the parameter WFP32W_{FP32} for precise updates). This sums up to 2 bytes (FP16) + 2 bytes (FP16) + 4 bytes (FP32 momentum) + 4 bytes (FP32 variance) + 4 bytes (FP32 master parameter) = 16 bytes per parameter if the FP32 master parameter is stored separately. The paper states 20 bytes per parameter for Adam in mixed precision, implying FP16 parameters and gradients (2 bytes each), and FP32 momentum, variance, and a copy of the parameter for update (4 bytes each), totaling 2+2+4+4+4=162+2+4+4+4 = 16 bytes, plus potentially an additional FP32 copy of the gradient, or a more precise accounting of memory usage per parameter in Adam that totals 20 bytes.
  • Residual States: These are temporary memory allocations needed during training, primarily:
    • Activations: The outputs of intermediate layers during the forward pass of a neural network. These activations need to be stored because they are required to compute gradients during the backward pass using the chain rule. Storing all activations can be memory-intensive.
    • Activation Checkpointing: A technique to reduce activation memory consumption. Instead of storing all activations, only a subset (checkpoints) are stored. Intermediate activations between checkpoints are recomputed during the backward pass when needed, trading off memory for additional computation. This is also known as gradient checkpointing.
  • Mixed Precision Training: A training technique that uses both FP16 (half-precision floating point) and FP32 (single-precision floating point) numbers. FP16 computations are faster and use less memory on modern GPUs (especially with Tensor Cores), while FP32 is used for critical operations like parameter updates to maintain numerical stability and accuracy.
  • Adam Optimizer: A popular adaptive learning rate optimization algorithm that computes adaptive learning rates for each parameter. It stores exponentially decaying averages of past gradients (first moment estimates, mm) and squared past gradients (second moment estimates, vv), which contribute significantly to optimizer state memory requirements.
  • Arithmetic Intensity (AIT): A measure of the ratio of total floating-point operations (computations) to the total amount of data moved (memory accesses) by a computation kernel. Higher AIT means more computation is performed for each unit of data transferred, making the computation less sensitive to memory bandwidth limitations.
  • Parallelism Techniques:
    • Data Parallelism (DP): The simplest form of distributed training where multiple GPUs each hold a complete copy of the model and process different mini-batches of data. Gradients are then aggregated (e.g., averaged) across all GPUs.
    • Model Parallelism (MP) / Tensor Slicing: When a model is too large for a single GPU, its layers or even individual tensors within a layer can be split across multiple GPUs. This requires careful partitioning and communication. Tensor slicing is a specific form where a tensor (e.g., a large weight matrix) is partitioned.
    • Pipeline Parallelism (PP): Divides the layers of a model into stages, with each stage assigned to a different GPU (or set of GPUs). Data flows through these stages sequentially, forming a pipeline.
    • 3D Parallelism: A composite technique that combines Data Parallelism, Model Parallelism, and Pipeline Parallelism to efficiently train extremely large models on hundreds or thousands of GPUs.
  • ZeRO (Zero Redundancy Optimizer): A family of memory optimization technologies that partition model states across data parallel processes to eliminate memory redundancies.
    • ZeRO-1: Partitions only the optimizer states.
    • ZeRO-2: Partitions both optimizer states and gradients.
    • ZeRO-3: Partitions all three model states: optimizer states, gradients, and parameters. This is the most memory-efficient stage, where each parameter is owned by a unique data parallel process.
  • Heterogeneous Memory Systems: Modern computing systems often feature a hierarchy of memory types with different characteristics:
    • GPU Memory (HBM/GDDR): High-bandwidth, low-latency, but relatively small capacity. Ideal for active computation.
    • CPU Memory (DDR): Lower bandwidth and higher latency than GPU memory, but much larger capacity. Connected to GPUs via PCIe (Peripheral Component Interconnect Express).
    • NVMe Storage: Non-Volatile Memory Express is a storage interface for SSDs (Solid State Drives) that offers much higher bandwidth and lower latency than traditional SATA SSDs or HDDs. It provides massive storage capacity but is significantly slower than CPU or GPU memory.
  • PyTorch: An open-source machine learning framework known for its flexibility and dynamic computational graph.
    • Hooks: Mechanisms in PyTorch that allow users to insert custom code to be executed before or after forward or backward passes of a module or specific tensors. ZeRO-Infinity uses these for automated data movement.
    • Submodules: In PyTorch, neural networks are typically composed of torch.nn.Module objects, which can contain other Module objects (submodules), forming a hierarchical structure (e.g., a Transformer layer containing attention and feed-forward submodules).
  • Communication Collectives: Operations used in distributed computing to exchange data among multiple processes.
    • Allgather: Each process contributes its local data, and all processes receive the concatenation of data from all other processes.
    • Broadcast: One process sends its data to all other processes.
    • Reduce-scatter: Each process contributes its local data, and the results of a reduction (e.g., sum) are scattered across all processes.

3.2. Previous Works

The paper contextualizes ZeRO-Infinity against several prior advancements in large model training:

  • ELMo [6] and GPT-3 [4]: These models are cited as examples of the rapid growth in model scale, from hundreds of millions to hundreds of billions of parameters. They highlight the trend that motivates the need for systems like ZeRO-Infinity.
  • Model Parallelism (MP) [7, 17, 18]: Early techniques for splitting models across GPUs when they wouldn't fit on a single device. Megatron-LM [7] is a prominent example.
  • Pipeline Parallelism (PP) [8-10]: Techniques like PipeDream [8], Gpipe [9], and Memory-efficient pipeline-parallel DNN training [10] aim to improve efficiency by pipelining computations across GPUs.
  • 3D Parallelism [13, 14]: This is the state-of-the-art for large model training that ZeRO-Infinity directly aims to surpass. It combines data parallelism, model parallelism, and pipeline parallelism. The DeepSpeed implementation [15] of 3D parallelism can scale to over a trillion parameters on 800 NVIDIA V100 GPUs.
    • Limitations of 3D Parallelism: The paper points out its drawbacks:
      • GPU Memory Wall: Even with 3D parallelism, the aggregate GPU memory becomes insufficient for future model scales (e.g., 320 A100 GPUs for 1T parameters, 6K GPUs for 100T parameters).
      • Complexity and Refactoring Burden: Requires significant model code refactoring and splitting models into load-balanced pipeline stages and tensor-sliced components.
      • Inflexibility: Models with complex dependencies are hard to fit into pipeline parallelism.
  • ZeRO (Zero Redundancy Optimizer) [11]: ZeRO-Infinity is built upon the ZeRO family of technologies, specifically ZeRO-3. ZeRO removes memory redundancies across data-parallel processes by partitioning model states (optimizer states, gradients, parameters).
    • ZeRO-1: Partitions optimizer states.
    • ZeRO-2: Partitions optimizer states and gradients.
    • ZeRO-3: Partitions all model states. During forward and backward passes, parameters are broadcast from their owning GPU to others, then freed. Optimizer states are updated only on the owning GPU.
  • ZeRO-Offload [12]: This is a specific heterogeneous training approach built on ZeRO-2. It stores gradients and optimizer states in CPU memory to save GPU memory.
    • Limitations of ZeRO-Offload: It still requires parameters to be stored in GPU memory and replicated, limiting model scale to what a single GPU can host. It also needs large batch sizes to be efficient due to suboptimal data partitioning and limited PCIe bandwidth.
  • Other Heterogeneous CPU/NVMe Approaches [20-26]: The paper acknowledges other works that explore leveraging CPU or NVMe memory (e.g., Capuchin [23], SuperNeurons [31], vDNN [25], Sentinel [24]). However, ZeRO-Infinity is positioned as a generic system for massive dense models, contrasting with specialized uses like Zhao et al. [27] for sparse DL Ads Systems.
  • Reducing Activation Memory [28-31]: Techniques like activation checkpointing [29, 30] and compression [28] are mentioned as complementary to ZeRO-Infinity for managing activation memory.

3.3. Technological Evolution

The evolution of large model training has been a continuous effort to overcome memory and computational bottlenecks:

  1. Single GPU: Early DL models fit entirely on a single GPU.
  2. Data Parallelism (DP): As models grew, DP allowed scaling computation by replicating the model across multiple GPUs and distributing data. This hit a wall when the model itself became too large for a single GPU.
  3. Model Parallelism (MP) and Pipeline Parallelism (PP): These techniques were developed to address models too large for a single GPU, by splitting model layers or components across GPUs. However, they introduce complexity in model design and communication.
  4. 3D Parallelism: A sophisticated combination of DP, MP, and PP to push the limits further, but still constrained by aggregate GPU memory and the refactoring burden.
  5. ZeRO (Zero Redundancy Optimizer): An innovation that drastically reduced memory footprint by partitioning model states instead of replicating them, moving beyond traditional DP's memory inefficiencies. ZeRO-3 fully partitions all model states.
  6. ZeRO-Offload: Extended ZeRO-2 by offloading optimizer states and gradients to CPU memory, marking an early step towards heterogeneous memory utilization.
  7. ZeRO-Infinity: This paper represents the next major leap, moving beyond the GPU memory wall by fully embracing a heterogeneous memory approach (GPU, CPU, NVMe) for all model states, and critically, doing so without model refactoring while maintaining high efficiency. It takes the ZeRO-3 partitioning concept and extends it to offload parameters themselves to slower memory tiers.

3.4. Differentiation Analysis

Compared to the main methods in related work, ZeRO-Infinity introduces several core differences and innovations:

  • Compared to 3D Parallelism:
    • Memory Wall Transcendence: 3D parallelism is fundamentally limited by the aggregate GPU memory available in a cluster. ZeRO-Infinity breaks this by leveraging CPU and NVMe memory, allowing for models tens to hundreds of times larger.
    • Ease of Use: 3D parallelism requires complex model parallelism and pipeline parallelism strategies, necessitating significant model code refactoring. ZeRO-Infinity eliminates this need, making it much easier for data scientists to use without modifying their model code. This is a crucial differentiator for accessibility.
    • Memory-centric Tiling: 3D parallelism uses model parallelism (tensor-slicing) to fit large individual layers. ZeRO-Infinity achieves this with memory-centric tiling, which dynamically breaks down operators, avoiding explicit model parallelism.
  • Compared to ZeRO-Offload:
    • Full Model State Offload: ZeRO-Offload primarily offloads optimizer states and gradients to CPU memory, but parameters must still reside in GPU memory. ZeRO-Infinity, built on ZeRO-3, can offload all model states (including parameters) to CPU or NVMe, enabling much larger models.
    • Bandwidth Efficiency for Parameters: ZeRO-Offload's parameter handling is limited by single PCIe bandwidth as parameters are replicated on GPUs. ZeRO-Infinity introduces bandwidth-centric partitioning where parameters are partitioned across data parallel processes and retrieved using allgather, effectively utilizing aggregate PCIe and NVMe bandwidths for parameter movement. This is critical for efficiency when offloading parameters.
    • Scope: ZeRO-Offload is more of a CPU offload technique for ZeRO-2, while ZeRO-Infinity is a complete heterogeneous memory management system for ZeRO-3 that includes NVMe and advanced overlap strategies.
  • Novelty in Heterogeneous Memory Management:
    • ZeRO-Infinity is the first to simultaneously and efficiently coordinate GPU, CPU, and NVMe memory for all model states in DL training.

    • The infinity offload engine with DeepNVMe and pinned memory management provides a robust and high-performance solution for utilizing NVMe storage, which is far beyond the capabilities of previous systems that might only touch CPU memory.

    • Its overlap-centric design is comprehensive, hiding latencies across NVMe-CPU, CPU-GPU, and GPU-GPU transfers, which is crucial for making slower memory tiers practical for training.

      In essence, ZeRO-Infinity is a holistic approach that redefines the memory paradigm for DL training, moving beyond the GPU as the sole memory bottleneck and making extreme-scale DL more accessible and practical.

4. Methodology

4.1. Principles

The core principle behind ZeRO-Infinity is to break the GPU memory wall by treating the entire memory hierarchy of a modern computing cluster (GPU, CPU, and NVMe storage) as a single, virtualized, and massively scalable memory pool for deep learning model states and activations. It achieves this by:

  1. Extreme Partitioning: Building upon ZeRO-3, it partitions all model states (parameters, gradients, and optimizer states) across data parallel processes, eliminating memory redundancies.

  2. Heterogeneous Offloading: Instead of keeping partitioned model states exclusively on GPUs, it intelligently offloads them to CPU or NVMe memory, leveraging their larger capacities.

  3. Aggressive Overlapping: It employs sophisticated communication-computation overlapping strategies to hide the latency associated with accessing slower CPU and NVMe memories, making offloading practical and efficient.

  4. Ease of Use: It automates complex memory management and data movement, abstracting away the need for model parallelism or code refactoring from data scientists.

    The theoretical basis is that by partitioning model states and distributing them across heterogeneous memory tiers, the aggregate memory capacity becomes virtually unlimited. The challenge then becomes efficiently moving the necessary data to the GPU just-in-time for computation, and ZeRO-Infinity addresses this with its bandwidth-centric partitioning and overlap-centric design. The intuition is that even if individual CPU or NVMe access is slow, by performing many such accesses in parallel and overlapping them with GPU compute, the overall training throughput can remain high.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Memory Requirements Characterization

The paper begins by characterizing the memory requirements for DL training, focusing on Transformer-based architectures with mixed precision training and the Adam optimizer. Memory is categorized into Model states and Residual states.

Model States Memory

Model states comprise optimizer states, gradients, and parameters.

  • Parameters and gradients are stored in FP16.

  • Optimizer states consist of FP32 momentum, variance, and typically an FP32 copy of the parameters for accurate updates.

  • The paper states that each parameter requires 20 bytes of memory in total.

  • For Transformer-based models, the total number of parameters primarily depends on the hidden dimension (hd) and the number of Transformer layers (nl). Most parameters come from four linear layers within each block with sizes: (hd, 3hd), (hd, hd), (hd, 4hd), and (4hd, hd).

    The total number of parameters in a Transformer-based model can be approximated as: Total Parameters12×nl×hd2 \text{Total Parameters} \approx 12 \times nl \times hd^2 where:

  • nl: Number of Transformer layers.

  • hd: Hidden dimension.

    The total memory (in bytes) required to store the model states is then: Memory for Model States240×nl×hd2 bytes \text{Memory for Model States} \approx 240 \times nl \times hd^2 \text{ bytes} This formula uses 20 bytes/parameter×12×nl×hd220 \text{ bytes/parameter} \times 12 \times nl \times hd^2. Figure 2a (column 5 in the table below) illustrates these memory requirements, showing how quickly they grow to terabytes for models with hundreds of billions to trillions of parameters. For context, Figure 2b shows aggregate GPU, CPU, and NVMe memory available on DGX-2 systems.

The following are the results from Figure 2(a) and 2(b) of the original paper:

Params(Trillions) Layers Hidden Size AttnHeads Model States(TB/Model) TB/Node Working Mem.per GPU (GB)
Act. Act.Ckpt. ModelState Act.
0.10 80 10K 128 1.83 2.03 0.05 1.95 1.63
0.50 100 20K 160 9.16 3.91 0.12 6.25 2.50
1.01 128 25K 256 18.31 7.13 0.20 9.77 3.56
10.05 195 64K 512 182.81 24.38 0.76 64.00 8.00
101.47 315 160K 1024 1845.70 88.59 3.08 400.00 18.00
Nodes GPUs Aggregate Memory (TB) GPU-GPUBandwidth Memory Bandwidth/GPU(GB/s)
GPU CPU NVMe Bandwidth(GB/s) GPU CPU NVMe
1 1 0.032 1.5 28.0 N/A 600-900 12.0 12.0
1 16 0.5 1.5 28.0 150-300 600-900 3.0 1.6
4 64 2.0 6.0 112.0 60-100 600-900 3.0 1.6
16 256 8.0 24.0 448.0 60-100 600-900 3.0 1.6
64 1024 32.0 96.0 1792.0 60-100 600-900 3.0 1.6
96 1536 48.0 144.0 2688.0 60-100 600-900 3.0 1.6

Residual States Memory

Residual states primarily refer to activation memory.

  • Activation Checkpointing: This technique reduces memory. The memory required to store activation checkpoints is estimated as: 2×bsz×seq×hd×nl/ci bytes 2 \times bsz \times seq \times hd \times nl / ci \text{ bytes} where:
    • bsz: Batch size.
    • seq: Sequence length.
    • hd: Hidden dimension.
    • nl: Number of Transformer layers.
    • ci: Number of Transformer blocks between two activation checkpoints.
    • The term bsz×seq×hdbsz \times seq \times hd represents the size of the input to each Transformer block. Figure 2a (column 7) shows that even with activation checkpointing, the memory for activations can become large for multi-trillion parameter models.

Model State Working Memory (MSWM)

MSWM is the minimum GPU memory needed for forward or backward propagation of the largest single operator (e.g., a linear layer) after model states are offloaded. It's approximately the size of the parameters and gradients of that operator.

  • For a Transformer, the largest operator is a linear layer transforming hidden states from hd to 4hd.
  • The size (in bytes) of parameters and gradients for this linear layer is: 4×hd×4hd bytes 4 \times hd \times 4hd \text{ bytes} This is because it needs to hold FP16 parameters (2 bytes/element) and FP16 gradients (2 bytes/element), so 2×2 bytes×hd×4hd2 \times 2 \text{ bytes} \times hd \times 4hd for parameters and another 2×2 bytes×hd×4hd2 \times 2 \text{ bytes} \times hd \times 4hd for gradients, or 2 bytes/element×(hd×4hd)2 \text{ bytes/element} \times (hd \times 4hd) for parameters and 2 bytes/element×(hd×4hd)2 \text{ bytes/element} \times (hd \times 4hd) for gradients, totaling 4×hd×4hd4 \times hd \times 4hd bytes (assuming a 2-byte element size for both parameters and gradients). The paper states 4×hd×4hd4 \times hd \times 4hd bytes which implies 1 byte per value (e.g. FP8), or it is simply a placeholder for the total parameter count multiplied by a factor. Given that FP16 parameters and gradients are used, 4×hd×4hd4 \times hd \times 4hd bytes would be correct if one assumes 4 bytes per element (e.g. FP32). However, earlier it was stated FP16 is used. A more accurate interpretation for FP16 parameters and gradients for a hd×4hdhd \times 4hd matrix would be 2×(hd×4hd)2 \times (hd \times 4hd) bytes for parameters and 2×(hd×4hd)2 \times (hd \times 4hd) bytes for gradients, summing to 4×hd×4hd4 \times hd \times 4hd bytes. So the formula makes sense for FP16 representation.
  • MSWM (Figure 2a, column 8) can require multiple gigabytes of contiguous memory, leading to out-of-memory errors. 3D parallelism uses model parallelism to split these operators. ZeRO-Infinity introduces memory-centric tiling to address this.

Activation Working Memory (AWM)

AWM is the memory needed during backward propagation for recomputing activations between two activation checkpoints.

  • If one activation checkpoint per Transformer block (ci=1ci=1) is used, the memory (in bytes) is approximately: bsz×seq×ci×(16×hd+2×attn_heads×seq) bsz \times seq \times ci \times (16 \times hd + 2 \times attn\_heads \times seq) where:
    • attn_headsattn\_heads: Number of attention heads.
    • This formula estimates the total activation size per Transformer block.
  • AWM (Figure 2a, column 9) also gets large beyond 10 trillion parameters.

4.2.2. Bandwidth Requirements Characterization

The paper analyzes the impact of CPU and NVMe bandwidth on training efficiency.

Efficiency Metric

The efficiency of a workload, assuming no compute-communication overlap, is defined as: efficiency=compute_timecompute_time+communication_time efficiency = \frac{compute\_time}{compute\_time + communication\_time} This can be expressed in terms of arithmetic intensity (ait), data movement bandwidth (bw), and peak computational throughput (peak_tppeak\_tp): efficiency=ait×bwait×bw+peak_tp efficiency = \frac{ait \times bw}{ait \times bw + peak\_tp} where:

  • compute_time=total_computationpeak_tpcompute\_time = \frac{total\_computation}{peak\_tp}
  • communication_time=total_data_movementbwcommunication\_time = \frac{total\_data\_movement}{bw}
  • ait=total_computationtotal_data_movementait = \frac{total\_computation}{total\_data\_movement}

Quantifying AIT in DL Training

The AIT varies for different model states and activation checkpoints.

  • Total Computation per Iteration: Dominated by linear layers in the Transformer.

    • Forward propagation computation: 2×bsz×seq×params2 \times bsz \times seq \times params.
    • Backward propagation computation: Approximately 2×2 \times forward propagation.
    • Activation checkpointing adds an additional forward computation.
    • Total computation per iteration: computation_per_iter=2×4×bsz×seq×parameters=2×4×12×bsz×seq×nl×hd2 computation\_per\_iter = 2 \times 4 \times bsz \times seq \times parameters \\ = 2 \times 4 \times 12 \times bsz \times seq \times nl \times hd^2 The 2×42 \times 4 factor accounts for forward pass, backward pass, and recomputation due to activation checkpointing. The 12×bsz×seq×nl×hd212 \times bsz \times seq \times nl \times hd^2 is the total computation related to the parameters.
  • AIT w.r.t. Parameters and Gradients:

    • Parameters are loaded at least twice (forward, backward), potentially three times (recomputation with activation checkpointing).
    • Gradients are stored at least once.
    • Total data movement: 4×parameters4 \times parameters (or 2×4×parameters2 \times 4 \times parameters in bytes).
    • AIT with respect to parameters and gradients is: aitparams_grads=seq×bsz ait_{params\_grads} = seq \times bsz This implies that for each parameter value transferred, seq×bszseq \times bsz operations are performed.
  • AIT w.r.t. Optimizer States:

    • Optimizer states are read once and written once per iteration during the optimizer step.
    • Total data movement: 2×optimizer_states2 \times optimizer\_states, which is approximately 2×16×parameters2 \times 16 \times parameters bytes (since each parameter has 16 bytes of optimizer states, e.g., two FP32 states of 4 bytes each, and an FP32 master parameter, but the paper states 20 bytes total so this might be an approximation).
    • AIT with respect to optimizer states is: aitopt_states=seq×bsz/4 ait_{opt\_states} = seq \times bsz / 4 This AIT is 4x lower than for parameters and gradients, meaning optimizer states are more bandwidth-sensitive.
  • AIT w.r.t. Activation Checkpoints:

    • Activation checkpoints are saved during forward and retrieved during backward.
    • Total data movement: 2×total_activation_checkpoints_in_bytes2 \times total\_activation\_checkpoints\_in\_bytes, which is 2×(2×bsz×seq×hd×nl/ci)2 \times (2 \times bsz \times seq \times hd \times nl / ci) from Eq. (3), or 4×nl/ci×hd×seq×bsz4 \times nl/ci \times hd \times seq \times bsz.
    • AIT with respect to activation checkpoints is: aitact_ckpts=24×hd×ci ait_{act\_ckpts} = 24 \times hd \times ci This AIT depends on the hidden dimension and checkpoint frequency.

The following images (Figure 2a, 2b, and 2c from the original paper) illustrate the bandwidth requirements.

该图像是基于AIT对参数和梯度效率的折线图,展示了不同批量大小(Bsz1到Bsz16)下随带宽(GB/s)增加,效率百分比提升的趋势。 该图像是基于AIT对参数和梯度效率的折线图,展示了不同批量大小(Bsz1到Bsz16)下随带宽(GB/s)增加,效率百分比提升的趋势。

Figure 2a: Parameter and Gradient Bandwidth

该图像是基于AIT相对于优化器状态的效率折线图,展示了不同批量大小(Bsz1、Bsz2、Bsz4、Bsz8)随着带宽(GB/s)增加,效率提升的趋势。 该图像是基于AIT相对于优化器状态的效率折线图,展示了不同批量大小(Bsz1、Bsz2、Bsz4、Bsz8)随着带宽(GB/s)增加,效率提升的趋势。

Figure 2b: Optimizer States bandwidth

该图像是一个折线图,展示了不同带宽下使用多种激活检查点方法(HD-2K到HD-64K)的效率变化,纵轴为效率百分比,横轴为带宽(GB/s),显示效率随带宽增长而提升并趋近于100%。 该图像是一个折线图,展示了不同带宽下使用多种激活检查点方法(HD-2K到HD-64K)的效率变化,纵轴为效率百分比,横轴为带宽(GB/s),显示效率随带宽增长而提升并趋近于100%。

Figure 2c: Activation Checkpoint Bandwidth

From these analyses (Figures 2a, 2b, 2c in the original paper), optimizer states have the highest bandwidth requirements (1.5 TB/s1.5 \text{ TB/s} for 90% efficiency with batch size 2), followed by parameters and gradients (>70 GB/s>70 \text{ GB/s}), while activation checkpoints have relatively low bandwidth requirements (<2 GB/s<2 \text{ GB/s} for large hidden sizes).

4.2.3. ZeRO-Infinity Design Overview

The overall architecture of ZeRO-Infinity is depicted in Figure 4 (Image 5), showing the interaction between GPU, CPU, NVMe, and network communication.

Figure 4: A snapshot of ZeRO-Infinity training a model with two layers on four data parallel (DP) ranks. Communication for the backward pass of the first layer is depicted. Partitioned parameters are… 该图像是论文中的示意图,展示了ZeRO-Infinity在四个数据并行(DP)rank上训练两层模型时的参数和梯度在GPU、慢速内存(CPU+NVMe)及网络中的迁移与通信过程。

Figure 4: A snapshot of ZeRO-Infinity training a model with two layers on four data parallel (DP) ranks. Communication for the backward pass of the first layer is depicted. Partitioned parameters are moved from slow memory to GPU and then collected to form the full layer. After gradients are computed, they are aggregated, repartitoned, and then offloaded to slow memory. Layers are denoted with subscripts and DP ranks are denoted with superscripts. For example, p(2) is the portion of layer O's parameters owned by GPU(2)G P U ^ { ( 2 ) } .

Design for Unprecedented Scale

  1. Infinity Offload Engine for Model States:

    • ZeRO-Infinity builds on ZeRO-3, which partitions all model states.
    • The infinity offload engine allows offloading these partitioned model states (parameters, gradients, optimizer states) to CPU or NVMe memory, or keeping them on the GPU, based on available resources and performance considerations.
    • This enables fitting models with hundreds of trillions of parameters, as aggregate NVMe memory in a large cluster (e.g., 96 DGX-2 nodes) can reach terabytes.
  2. CPU Offload for Activations:

    • Activation checkpoints can also be offloaded to CPU memory when GPU memory is insufficient.
    • A 10 trillion parameter model's activation checkpoints (0.76 TB) can fit in a DGX-2's 1.5TB CPU memory. This helps scale activation memory requirements.
  3. Memory-centric Tiling for Working Memory:

    • This novel technique addresses large Model State Working Memory (MSWM) requirements for massive individual operators (e.g., a huge linear layer).
    • Instead of model parallelism (splitting the operator itself across GPUs), memory-centric tiling breaks down a large operator into smaller, mathematically equivalent tiles.
    • These tiles are executed sequentially. When combined with ZeRO-3, the parameters and gradients for each tile are fetched and released one at a time. This reduces the working memory proportionally to the number of tiles, allowing arbitrary operator sizes without explicit model parallelism.

Design for Excellent Training Efficiency

Offloading to CPU and NVMe is only viable if efficiency is maintained despite their slower bandwidth.

  1. Bandwidth-Centric Partitioning (Section 6.1):

    • Traditional ZeRO and ZeRO-Offload use a broadcast-based approach where one data parallel process owns a full parameter and broadcasts it. This is limited by a single PCIe link's bandwidth from the source memory (CPU/NVMe) to the GPU.
    • ZeRO-Infinity partitions individual parameters across all data parallel processes. When a parameter is needed, an allgather collective is used.
    • This means each data parallel process only fetches 1/dp1/dp (where dp is the data parallel degree) of the parameter from CPU/NVMe to its GPU via its own PCIe link.
    • This parallel retrieval effectively increases the CPU/NVMe to GPU bandwidth linearly with the dp degree, achieving virtually unlimited heterogeneous memory bandwidth on multi-node setups. For example, on a DGX-2 (16 GPUs), this can scale from 12 GB/s12 \text{ GB/s} (single PCIe) to 48 GB/s48 \text{ GB/s} (CPU to GPU) or 25 GB/s25 \text{ GB/s} (NVMe to GPU) aggregated across GPUs.
  2. Overlap Centric Design (Section 6.2):

    • To hide latencies from slow memory, ZeRO-Infinity aggressively overlaps compute with communication across all tiers: GPU-GPU, NVMe-CPU, and CPU-GPU.
    • Dynamic Prefetcher:
      • Traces forward and backward computation to build an internal map of the operator sequence.
      • Prefetches parameters needed by future operators.
      • It understands the three-step NVMe access process (nc-transfer: NVMe to CPU, cg-transfer: CPU to GPU, gg-transfer: GPU allgather).
      • For operator ii, it can simultaneously invoke nc-transfer for parameters of operator i+3i+3, cg-transfer for i+2i+2, and gg-transfer for i+1i+1, all while operator ii executes.
    • Communication and Offload Overlapping for Gradients:
      • During the backward pass, it overlaps reduce-scatter for gradients of operator i+1i+1 with computation of operator ii.
      • Simultaneously, it transfers partitioned gradients from the reduce-scatter of operator i+2i+2 to CPU or NVMe.
  3. Infinity Offload Engine (Section 6.3): This engine comprises specialized components:

    • DeepNVMe: A powerful C++C++ library for NVMe read/write operations.
      • Supports bulk read/write requests for asynchronous completion.
      • Allows explicit synchronization.
      • Enables asynchrony to overlap NVMe transfers with GPU/CPU operations.
      • Achieves near-peak sequential read/write bandwidths through aggressive parallelization of I/O requests, smart work scheduling, avoiding data copying, and memory pinning.
    • Pinned Memory Management Layer:
      • Crucial for high-performance NVMe/CPU data transfers, as source/destination tensors must reside in pinned memory (host memory that GPU can directly access without staging).
      • Manages the scarce pinned memory by reusing a small amount (tens of GBs) to offload massive model states (tens of TBs).
      • Minimizes memory fragmentation in both CPU and GPU memory.
      • Provides PyTorch tensors with pinned memory data, allowing in-place computation and direct writing to NVMe without extra copies, boosting bandwidth.

Design for Ease of Use

ZeRO-Infinity aims for PyTorch-native usage without model refactoring.

  1. Automated Data Movement (Section 7.1):

    • Coordinates the movement of parameters, gradients, and optimizer states.
    • When a tensor is not active, it's partitioned and potentially offloaded.
    • Hooks: ZeRO-Infinity recursively injects hooks into PyTorch submodules.
      • Pre-forward/backward hooks: Ensure parameters are on GPU by executing allgather collectives before computation.
      • Post-forward/backward hooks: Partition and optionally offload parameters or gradients after computation.
    • The overlap-centric design ensures these allgather operations don't cause significant stalls.
  2. Auto Registration of External Parameters (Section 7.1.1):

    • Addresses cases where parameters defined in one submodule are used in another (e.g., shared embedding weights in GPT).
    • Manual API: Provided for users to register external parameters.
    • Automatic Mechanisms:
      • Intercepting Partitioned Parameter Accesses: Replaces the PyTorch hash table for tensor parameters with a subclass that overrides tensor accesses. When a partitioned parameter is accessed, it triggers a blocking allgather, registers it as external, and returns the gathered parameter.
      • Activation Introspection: Inspects activation outputs from submodule forward passes for partitioned parameters. If found, it collects and registers them as external.
  3. Automatic Model Partitioning during Initialization (Section 7.2):

    • Solves the problem of large models not fitting into a single GPU or CPU memory during initialization (before partitioning).
    • A Python ZeRO-Infinity context decorates torch.nn.Module's __init__ method.
    • This ensures that parameters allocated under each module/sub-module are immediately partitioned among data parallel processes after their initialization.
    • The full model is never replicated on a single data parallel process, allowing initialization of models (e.g., 500 billion parameters) that require terabytes of aggregate memory, without requiring a single node to have that capacity.

5. Experimental Setup

5.1. Datasets

The experiments primarily use GPT-like Transformer-based models. These models are chosen because they represent the class of large, dense models whose scaling is the focus of the paper.

  • Sequence Length: Fixed to 1024, a common value for Transformer models like GPT-2, Megatron-LM, and Turing-NLG.

  • Model Size Variation: The hidden dimension and number of layers are varied to create models with different total parameter counts, ranging from billions to tens of trillions.

  • Data Sample: The paper does not provide a concrete example of a data sample (e.g., an input text sequence) but implies typical text data used for language modeling. For a Transformer with sequence length 1024, an input sample would be a sequence of 1024 tokens (words or subword units), which are then embedded into hidden dimension vectors before being processed by the Transformer layers.

  • Suitability: Transformer models are ideal for validating ZeRO-Infinity because they are known for their massive scale, their quadratic memory consumption with sequence length (in some components), and their widespread use in SOTA DL applications, making memory efficiency a critical challenge.

    Specific model configurations used in the evaluation are detailed in Table 1 and in Appendix A (Tables 4-8), which are transcribed below.

The following are the results from Table 1 of the original paper:

# nodes # params hidden dim # layers batch/GPU mp fp16 param Opt State
1 10 B 4K 50 8 1 GPU GPU
1 50, 100 B 8K 62, 125 26, 24 1 CPU NVMe
1 0.5, 1 T 18K, 25K 124, 128 8,7 1 NVMe NVMe
32 0.5, 1 T 18K, 25K 124, 128 7,5 4 GPU GPU
32 5, 10, 20 T 48K, 64K, 88K 174, 200, 205 3, 2, 1.25 4, 4,8 NVMe NVMe

5.2. Evaluation Metrics

The paper uses the following metrics to evaluate ZeRO-Infinity's performance:

  • Model Size (Number of Parameters):
    • Conceptual Definition: Quantifies the maximum size of a neural network model, in terms of its learnable weights and biases, that can be trained by a system given specific hardware resources. This directly addresses the GPU memory wall problem.
    • Mathematical Formula: There isn't a single formula for "model size" as it's a count. For Transformer models, it's typically derived from architectural parameters: Total Parameters12×nl×hd2 \text{Total Parameters} \approx 12 \times nl \times hd^2
    • Symbol Explanation:
      • nl: Number of Transformer layers.
      • hd: Hidden dimension of the model.
  • Training Throughput (TFlops/GPU or Petaflops):
    • Conceptual Definition: Measures the computational efficiency of the training process, indicating how many floating-point operations per second (flops) are performed per GPU or across the entire cluster. Higher throughput signifies faster training.
    • Mathematical Formula: Throughput (FLOPs/s)=Total FLOPs per iterationTime per iteration \text{Throughput (FLOPs/s)} = \frac{\text{Total FLOPs per iteration}}{\text{Time per iteration}} Often reported in TeraFLOPs (TFlops, 101210^{12} FLOPs) or PetaFLOPs (PFlops, 101510^{15} FLOPs).
    • Symbol Explanation:
      • Total FLOPs per iteration\text{Total FLOPs per iteration}: The total number of floating-point operations required to complete one training step (forward pass, backward pass, optimizer update).
      • Time per iteration\text{Time per iteration}: The duration it takes to complete one training step.
  • Scalability (Super-linear, Linear, Sub-linear):
    • Conceptual Definition: Describes how efficiently the training throughput increases as more GPUs (or nodes) are added to the system.
      • Linear Scaling: Throughput increases proportionally with the number of GPUs.
      • Super-linear Scaling: Throughput increases more than proportionally with the number of GPUs, often due to increased aggregate bandwidth or memory capacity unlocking new efficiencies.
      • Sub-linear Scaling: Throughput increases less than proportionally, typically due to communication overheads or load imbalance.
    • Mathematical Formula: Scaling Efficiency=Throughput with N GPUsN×Throughput with 1 GPU \text{Scaling Efficiency} = \frac{\text{Throughput with N GPUs}}{N \times \text{Throughput with 1 GPU}} For weak scaling (as used in the paper for super-linear scaling), the batch size per GPU is kept constant, so the total batch size and total problem size increase with NN. The efficiency would ideally be 1 (linear). Super-linear means efficiency > 1.
    • Symbol Explanation:
      • Throughput with N GPUs\text{Throughput with N GPUs}: The total throughput measured when using NN GPUs.
      • NN: The number of GPUs used.
      • Throughput with 1 GPU\text{Throughput with 1 GPU}: The baseline throughput measured when using a single GPU.
  • Backward Propagation Time:
    • Conceptual Definition: Measures the time taken specifically for the backward pass of the neural network, which includes gradient computation and aggregation. This metric is used to evaluate the efficiency of gradient offloading and communication strategies.
    • Mathematical Formula: Typically measured directly as wall-clock time in seconds or milliseconds.
  • Maximum Hidden Size:
    • Conceptual Definition: The largest hidden dimension that a single layer of a Transformer model can be configured to have and still be trained, particularly in the context of memory fragmentation or limitations. This metric evaluates the effectiveness of memory-centric tiling in handling large individual operators.
    • Mathematical Formula: A direct integer value representing the dimension size.

5.3. Baselines

The paper compares ZeRO-Infinity against several state-of-the-art and foundational distributed training methods:

  • torch's Distributed Data Parallel (DDP) [42]:
    • Description: The standard data parallelism implementation in PyTorch. Each GPU holds a full replica of the model, processes a subset of the data, computes gradients, and then all-reduces these gradients to keep all model replicas synchronized.
    • Representativeness: It's the most common and simplest data parallelism baseline, representing the scenario where the model fits entirely on each GPU.
  • Megatron-LM [7]:
    • Description: An early and influential framework developed by NVIDIA for training multi-billion parameter language models. It primarily relies on model parallelism (specifically tensor-slicing for linear layers) and pipeline parallelism.
    • Representativeness: It represents the state-of-the-art in model parallelism and pipeline parallelism for Transformer models.
  • 3D Parallelism [13, 14]:
    • Description: A composite distributed training strategy that combines data parallelism, model parallelism, and pipeline parallelism. The DeepSpeed implementation [15] is highlighted. This is positioned as the immediate predecessor and primary competitor to ZeRO-Infinity for extreme-scale DL.
    • Representativeness: This is the most advanced existing baseline for training trillion-parameter models on large GPU clusters, making it the most direct comparison for ZeRO-Infinity's scalability claims.
  • ZeRO [11] (specifically ZeRO-3):
    • Description: Zero Redundancy Optimizer partitions model states to eliminate memory redundancy. ZeRO-3 partitions all model states (parameters, gradients, optimizer states).
    • Representativeness: ZeRO-Infinity builds upon ZeRO-3's partitioning strategy, so ZeRO-3 serves as a crucial baseline to demonstrate the added value of heterogeneous offloading and advanced communication strategies.
  • ZeRO-Offload [12]:
    • Description: An extension of ZeRO-2 that offloads gradients and optimizer states to CPU memory.

    • Representativeness: This is a direct baseline for evaluating the benefits of ZeRO-Infinity's more comprehensive heterogeneous memory management (including parameter offload to CPU/NVMe) and its bandwidth-centric partitioning over ZeRO-Offload's CPU offloading approach.

      These baselines collectively cover different dimensions of large model training—from simple data parallelism to complex 3D parallelism and memory-efficient ZeRO variants—allowing ZeRO-Infinity to demonstrate its superiority in terms of model scale, efficiency, and ease of use.

5.4. Hardware

The experiments were conducted on a cluster of NVIDIA V100 SXM3 32 GB GPUs.

  • GPU Model: NVIDIA V100 SXM3 32 GB (each GPU has 32 GB of memory).

  • Cluster Size: Up to 512 V100 GPUs, which corresponds to 32 NVIDIA DGX-2 nodes.

  • Inter-node Communication: 800 Gbps (Gigabits per second) bandwidth, indicating a high-performance network for multi-node distributed training.

  • Node Configuration: A single DGX-2 node contains 16 V100 GPUs and 1.5 TB of CPU memory (as mentioned in Section 5.1.2 of the paper), and often a substantial amount of NVMe storage.

    This hardware setup represents a powerful, high-end GPU cluster suitable for extreme-scale deep learning, providing a realistic and challenging environment for evaluating ZeRO-Infinity's capabilities.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Model Size and Speed

ZeRO-Infinity demonstrates a significant leap in model scale and competitive training throughput.

  • Model Scale: ZeRO-Infinity can train models exceeding 32 trillion parameters, which is a 50x increase compared to the roughly 650 billion parameters achievable with 3D parallelism on similar hardware. This directly addresses the GPU memory wall and pushes the boundary of trainable model sizes.
  • Training Throughput:
    • For a 500 billion parameter model (a size near the upper limit of 3D parallelism), ZeRO-Infinity achieves nearly identical throughput to 3D parallelism, indicating that its offloading mechanisms do not introduce significant overhead for models manageable by existing state-of-the-art.

    • When scaling to 1 trillion, 5 trillion, 10 trillion, and 20 trillion parameter models, where 3D parallelism runs out of memory, ZeRO-Infinity continues to train these models with excellent throughput. It achieves up to 49 TFlops/GPU for the 5 trillion parameter model.

    • A performance drop is observed for the 20 trillion parameter model (down to 34 TFlops/GPU). This is attributed not to NVMe bandwidth saturation (as both 10T and 20T use NVMe offload), but to an extremely small batch size per GPU necessary for the 20T model due to limited CPU memory to store activation checkpoints. The paper suggests this could be improved by increasing CPU memory or offloading activation checkpoints to NVMe in the future.

      The following image (Figure 5(a) from the original paper) illustrates the throughput performance:

      Figure 5: Efficiency and scalability of ZeRO-Infinity for training multi-trillion parameter models. 该图像是图表,展示了ZeRO-Infinity在训练超大规模模型上的效率与扩展性,分别对比了3D并行、ZeRO-Offload与ZeRO-Infinity在不同模型规模及GPU数量下的吞吐率表现,体现了其超线性扩展能力和在单节点上训练1万亿参数模型的优势。

Figure 5: Efficiency and scalability of ZeRO-Infinity for training multi-trillion parameter models.

6.1.2. Superlinear Scalability

ZeRO-Infinity exhibits super-linear scalability when training a 1 trillion parameter model, particularly from 4 nodes (64 GPUs) to 32 nodes (512 GPUs).

  • Observation: The system exceeds perfect linear scaling, meaning the throughput per GPU increases as more GPUs are added, holding the batch size per GPU constant (weak scaling).

  • Reason: This super-linear scaling is primarily due to:

    1. Leveraging Aggregate Bandwidth: As more nodes are added, the aggregate PCIe and NVMe bandwidths increase linearly. ZeRO-Infinity's bandwidth-centric partitioning effectively utilizes this increased aggregate bandwidth to accelerate the offloading of parameters and optimizer states.
    2. Increased CPU Compute: Additional nodes also bring more CPU compute power, which ZeRO-Infinity leverages for tasks like optimizer step computations.
  • Efficiency at Modest Scale: Even with just 4 nodes, ZeRO-Infinity achieves over 2.8 petaflops (44 TFlops/GPU), demonstrating that the aggregated NVMe bandwidth is sufficient for good efficiency even at a relatively modest scale.

    The following image (Figure 5(b) from the original paper) illustrates the super-linear scalability:

    Figure 5: Efficiency and scalability of ZeRO-Infinity for training multi-trillion parameter models. 该图像是图表,展示了ZeRO-Infinity在训练超大规模模型上的效率与扩展性,分别对比了3D并行、ZeRO-Offload与ZeRO-Infinity在不同模型规模及GPU数量下的吞吐率表现,体现了其超线性扩展能力和在单节点上训练1万亿参数模型的优势。

Figure 5: Efficiency and scalability of ZeRO-Infinity for training multi-trillion parameter models.

6.1.3. Democratizing Large Model Training

ZeRO-Infinity significantly improves the accessibility and ease of use for large model training.

  • Accessibility: It enables training (and specifically fine-tuning) of models up to 1 trillion parameters on a single NVIDIA DGX-2 node (which has 16 GPUs). This makes models like GPT-3 (175B parameters) accessible for fine-tuning to users who do not have access to massive GPU clusters. In contrast, 3D parallelism cannot scale beyond 20 billion parameters on a single DGX-2 node.

  • Ease-of-Use: This capability is achieved without requiring model parallelism or pipeline parallelism, and without any model code refactoring. This significantly reduces the burden on data scientists, allowing them to scale their models with minimal effort.

  • Performance: On a single DGX-2 node, ZeRO-Infinity achieves excellent performance of over 40 TFlops/GPU for models up to 100 billion parameters.

    The following image (Figure 5(c) from the original paper) illustrates the single-node performance:

    Figure 5: Efficiency and scalability of ZeRO-Infinity for training multi-trillion parameter models. 该图像是图表,展示了ZeRO-Infinity在训练超大规模模型上的效率与扩展性,分别对比了3D并行、ZeRO-Offload与ZeRO-Infinity在不同模型规模及GPU数量下的吞吐率表现,体现了其超线性扩展能力和在单节点上训练1万亿参数模型的优势。

Figure 5: Efficiency and scalability of ZeRO-Infinity for training multi-trillion parameter models.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

# nodes # params hidden dim # layers batch/GPU mp fp16 param Opt State
1 10 B 4K 50 8 1 GPU GPU
1 50, 100 B 8K 62, 125 26, 24 1 CPU NVMe
1 0.5, 1 T 18K, 25K 124, 128 8,7 1 NVMe NVMe
32 0.5, 1 T 18K, 25K 124, 128 7,5 4 GPU GPU
32 5, 10, 20 T 48K, 64K, 88K 174, 200, 205 3, 2, 1.25 4, 4,8 NVMe NVMe

The following are the results from Table 2 of the original paper:

Name Optimizer + Grad(devices/partitioned) Parameters(devices/partitioned)
Data parallel [GPU] / X [GPU] / X
ZeRO 2 [GPU] /✓ [GPU] / X
ZeRO-Offload [CPU,GPU] / ✓ [GPU] / X
3D Parallelism [GPU] / [GPU] /
ZeRO 3 [GPU] /✓ [GPU] /✓
ZeRO-Inf-CPU [CPU, GPU]— [CPU,GPU] √
ZeRO-Inf-NVMe [NVMe,CPU,GPU] / [NVMe,CPU,GPU] /✓

The following are the results from Table 4 of the original paper:

Figure 6(a)
Model size Number of GPUs MP Layers Hidden Size Attention head Batch size Total batch size
1.4B 16 1 40 1536 16 1 16
10B 16 1 50 4096 16 1 16
13B 16 1 64 4096 16 1 16
20B (ZeRO-3) 16 1 98 4096 32 1 16
20B(3D Par.) 16 4 98 4096 32 1 16
70B 16 1 125 8192 32 1 16
1000B 16 4 128 25600 256 5 20

The following are the results from Table 5 of the original paper:

Figure 6(b)
Hidden size Number of GPUs MP Layers Model size Attention head Batch size Total batch size
8192 16 1 1 900M 16 1 16
16384 16 1 1 3B 16 1 16
32768 16 1 1 13B 16 1 16
65536 16 1 1 50B 32 1 16

The following are the results from Table 6 of the original paper:

Figure 6(c)
Number of GPUs Hidden size MP Layers Model size Attention head Batch size Total batch size
[4,16,32,64] 8192 1 10 8B 16 2 [8,32,64,128]

The following are the results from Table 7 of the original paper:

Figure 6(d)
Batch size Number of GPUs Hidden size MP Layers Model size Attention head Total batch size
[2,4,8,10,14,16] 64 8192 1 10 8B 16 [128,256,512,640,896,1024]

The following are the results from Table 8 of the original paper:

Figure 6(e)
Hidden size Number of GPUs Opt Device MP Layers Model size Attention head Batch size Total batch size
2048 32 CPU 1 5 275M 16 4 128
8192 32 CPU 1 5 4B 16 4 128
16384 32 CPU 1 5 16B 16 4 128
32768 32 CPU 1 5 64B 16 4 128
65536 64 NVMe 1 5 260B 16 4 128

6.3. Ablation Studies / Parameter Analysis

The paper conducts several ablation studies to demonstrate the impact of individual ZeRO-Infinity features on model scale and performance.

The following image (Figure 6 from the original paper) illustrates the impact of system features on model scale and performance:

Figure 6: Impact of system features on model scale and performance. 该图像是论文中图6,展示了系统特性对模型规模和性能的影响。图(a)比较了不同ZeRO策略的最大模型规模,图(b)展示了不同切片因子下最大隐藏维度,图(c)比较了ZeRO-Infinity与ZeRO Offload的反向时间,图(d)显示了通信重叠带来的加速比,图(e)反映了激活检查点转移至CPU带来的开销。

Figure 6: Impact of system features on model scale and performance.

6.3.1. Impact of System Features on Model Scale (Figure 6a)

This experiment investigates how different device placement and partitioning strategies influence the maximum model size trainable on a single DGX-2 system (16 GPUs). Table 2 defines these strategies.

  • Data Parallelism: The baseline, limited to 1.4 billion parameters due to GPU memory and model state redundancies.
  • ZeRO-2: With optimizer/gradient partitioning and CPU offload for these states, scales to 13 billion parameters (9x increase).
  • ZeRO-Infinity-CPU (ZeRO-Inf-CPU): Offloading parameter states to CPU along with optimizer and gradient states (Table 2 entry [CPU,GPU]/[CPU, GPU] / √ for Optimizer + Grad and [CPU,GPU][CPU,GPU] √ for Parameters) enables scaling to nearly 100 billion parameters.
  • ZeRO-Infinity-NVMe (ZeRO-Inf-NVMe): The final major jump comes from offloading model states to NVMe (Table 2 entry [NVMe,CPU,GPU] / for Optimizer + Grad and [NVMe,CPU,GPU]/[NVMe,CPU,GPU] /√ for Parameters), allowing training of models up to 1 trillion parameters.
  • Conclusion: This demonstrates a 700x increase in model size compared to data parallelism alone, highlighting the critical role of heterogeneous memory offloading (especially to NVMe) in achieving extreme scale.

6.3.2. Maximum Hidden Size with Memory-centric Tiling (Figure 6b)

This study evaluates the effectiveness of memory-centric tiling (Section 5.1.3) in enabling large hidden sizes despite memory fragmentation. A single-layer Transformer model is trained on a DGX-2 (16 GPUs), with GPU memory pre-fragmented into 2 GB chunks.

  • Without Tiling: The largest hidden size trainable is 8K. Any larger size would fail due to insufficient contiguous memory for the large operator.
  • With Tiling: Using a memory-centric tiling factor of 16, ZeRO-Infinity can train a massive hidden size of 64K.
  • Conclusion: Memory-centric tiling significantly simplifies the DL system stack by allowing large operators to fit in GPU memory without resorting to model parallelism (tensor-slicing), making it easier for data scientists to use large hidden sizes.

6.3.3. Impact of System Features on Performance

ZeRO-Infinity vs. ZeRO-Offload (Figure 6c)

This experiment compares the backward propagation time for an 8 billion parameter model between ZeRO-Infinity and ZeRO-Offload, focusing on gradient offloading to CPU memory.

  • Observation: ZeRO-Infinity achieves a speedup of nearly 2x at 64 GPUs compared to ZeRO-Offload.
  • Reason: ZeRO-Offload is limited by the PCIe bandwidth of a single GPU for gradient offloading. ZeRO-Infinity, through its bandwidth-centric partitioning (Section 6.1) and allgather-based approach, leverages the aggregate PCIe bandwidth across all GPUs to offload gradients in parallel.
  • Conclusion: This demonstrates the superior bandwidth utilization strategy of ZeRO-Infinity for offloaded states.

Prefetching and Overlapping (Figure 6d)

This study examines the effect of communication overlapping and prefetching (part of overlap-centric design, Section 6.2) on throughput for an 8 billion parameter model with 64 GPUs across varying batch sizes.

  • Observation: Prefetching and overlapping are crucial for achieving good performance at small batch sizes per GPU. As the batch size increases, their impact diminishes.
  • Reason: At small batch sizes, the arithmetic intensity is lower, and the communication time becomes a larger proportion of the total time. Prefetching and overlapping effectively hide this communication latency. At large batch sizes, the compute-to-communication ratio is higher, making the system less sensitive to communication overheads.
  • Conclusion: This validates the importance of overlap-centric design for maintaining efficiency, especially in scenarios where batch size per GPU is constrained (e.g., extremely large models where even a small batch might exhaust memory).

Activation Checkpoint Offload (Figure 6e)

This experiment measures the training throughput impact of CPU offloading of activation checkpoints in ZeRO-Infinity for different hidden sizes.

  • Observation: For small hidden sizes (e.g., 2K), CPU offloading of activation checkpoints reduces training throughput by up to 1.2x. However, for larger hidden sizes (32K and 64K), the performance impact is minimal.
  • Reason: For smaller hidden sizes, the arithmetic intensity associated with activation checkpoints might be lower, and the overhead of CPU-GPU transfer becomes more noticeable. For larger hidden sizes, the AIT increases, making the cost of recomputation and CPU offload relatively less significant compared to the overall GPU computation.
  • Conclusion: ZeRO-Infinity can offload activation checkpoints to CPU memory without significantly impacting efficiency for large hidden sizes, which is crucial for training models where even checkpointed activations exceed GPU memory.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper presents ZeRO-Infinity, a groundbreaking heterogeneous system technology designed to overcome the GPU memory wall for extreme-scale deep learning. By intelligently leveraging GPU, CPU, and NVMe memory, ZeRO-Infinity enables the training of models with tens and even hundreds of trillions of parameters, a 50x increase over previous state-of-the-art 3D parallelism. Key innovations include the infinity offload engine for comprehensive model state management, memory-centric tiling to handle massive individual operators without model parallelism, bandwidth-centric partitioning for efficient data movement across memory tiers, and an overlap-centric design to mask communication latency. Crucially, ZeRO-Infinity achieves this unprecedented scale and efficiency without requiring model code refactoring, significantly democratizing access to large model training by allowing fine-tuning of trillion-parameter models on a single NVIDIA DGX-2 node. It demonstrates robust performance, sustaining over 25 petaflops on 512 V100 GPUs, and exhibits super-linear scalability.

7.2. Limitations & Future Work

The authors acknowledge a current limitation and project future needs:

  • Current Limitation: The performance drop observed for the 20 trillion parameter model is attributed to an extremely small batch size per GPU, which is a consequence of limited CPU memory to store activation checkpoints.

  • Suggested Future Work: The authors propose that this limitation can be addressed by increasing CPU memory or, more significantly, by offloading activation checkpoints to NVMe in future implementations of ZeRO-Infinity.

  • Future Hardware Implications: Looking ahead, the paper emphasizes that while ZeRO-Infinity effectively removes the device memory bottleneck, training models with tens or hundreds of trillions of parameters in a reasonable time will necessitate massive leaps in compute power (10x10 \text{x} or 100x100 \text{x} more powerful accelerators). These future powerful devices will, in turn, require a proportional increase in device-to-device bandwidth (e.g., GPU-GPU bandwidth of 700 GB/s700 \text{ GB/s} to 7000 GB/s7000 \text{ GB/s}) to remain efficient. The paper highlights that even today's technology (like NVLink connecting GPUs to CPU memory at 40 GB/s40 \text{ GB/s}) can meet the slow memory bandwidth requirements for 10x faster GPUs using ZeRO-Infinity.

    The following are the results from Table 3 of the original paper:

    V100 10x 100x
    Total devices 512 512 512
    Achievable peak (pflops/device) 0.07 0.70 7.00
    Slow memory bw requirement(GB/s per device) 3.0 30.0 300.0
    Slow memory aggregate bw (TB/s) 1.5 15.0 150.0
    GPU-to-GPU bw (GB/s) 70.0 700.0 7000.0

7.3. Personal Insights & Critique

ZeRO-Infinity represents a pivotal advancement in distributed deep learning, fundamentally changing how we think about memory resources for extreme-scale models.

  • Inspiration: The most significant inspiration is the holistic approach to heterogeneous memory management. Instead of just optimizing GPU memory, it treats GPU, CPU, and NVMe as a unified, tiered memory system. This paradigm shift will likely influence future system designs, pushing for better integration and high-bandwidth pathways between these memory components. The super-linear scalability is particularly intriguing, demonstrating that intelligently designed distributed systems can yield unexpected benefits beyond simple aggregation of resources. This challenges the conventional wisdom that adding more resources inevitably leads to diminishing returns due to communication overhead.

  • Transferability: The core principles of bandwidth-centric partitioning, overlap-centric design, and memory-centric tiling are highly transferable.

    • Other DL tasks: These techniques could be adapted for other memory-intensive DL tasks beyond Transformer training, such as large graph neural networks or diffusion models.
    • General HPC: The approach to managing heterogeneous memory and overlapping I/O could inspire optimizations in broader High-Performance Computing (HPC) domains that deal with large datasets and complex computations on diverse memory architectures.
    • Cloud Computing: Cloud providers could leverage ZeRO-Infinity to offer more cost-effective large model training by intelligently using local SSDs (like NVMe) alongside GPU and CPU memory, potentially allowing users to provision fewer expensive GPU instances.
  • Potential Issues, Unverified Assumptions, or Areas for Improvement:

    • NVMe Lifespan and Reliability: While NVMe offers massive capacity, frequent read/write cycles for model states (especially optimizer states and parameters during full offloading) could impact NVMe drive lifespan due to write endurance limits. The paper doesn't discuss strategies to mitigate this, which might be critical for prolonged training runs.

    • CPU Overhead: Despite the excellent throughput, ZeRO-Infinity relies heavily on the CPU for NVMe data transfers, pinned memory management, and potentially optimizer steps. While CPU compute is leveraged, excessive CPU utilization could become a bottleneck or impact other processes running on the system, especially on less powerful CPU nodes. The "small batch size per GPU" for the 20T model due to CPU memory limitation for activations hints at CPU resources still being a potential constraint.

    • Complexity for Debugging: While ZeRO-Infinity simplifies the user experience by eliminating model refactoring, the underlying system is highly complex, involving intricate interactions across GPU, CPU, and NVMe memory, asynchronous I/O, and multiple layers of communication-computation overlap. Debugging performance issues or memory errors in such a system could be challenging for developers or advanced users trying to push its limits.

    • Generalizability to Diverse Workloads: The analysis is focused on Transformer-based models and Adam optimizer. While these are prevalent, ZeRO-Infinity's efficiency might vary for DL workloads with different arithmetic intensities, memory access patterns, or optimizer types. Further characterization for a wider range of DL architectures would strengthen its claims.

    • Cost-Benefit Analysis of NVMe: While NVMe is cheaper per GB than GPU memory, the overall system cost of equipping nodes with massive NVMe arrays, coupled with the power consumption for such I/O, might warrant a more detailed cost-benefit analysis in certain deployment scenarios.

    • Activation Offload to NVMe: The authors acknowledge that offloading activations to NVMe is a future step. This will be crucial for truly scaling models beyond 20T parameters, as CPU memory (even 1.5TB on a DGX-2) can still be a bottleneck for activation checkpoints. This will introduce another layer of I/O complexity and latency management that ZeRO-Infinity would need to address.

      Overall, ZeRO-Infinity is a testament to sophisticated system co-design, effectively pushing DL capabilities into realms previously thought impossible without prohibitively expensive hardware. Its contributions will undoubtedly shape the development of future DL systems and hardware architectures.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.