Abstract

Training large models with long context lengths requires significant communication overhead, which becomes a bottleneck in distributed training. We propose WeiPipe, a weight pipeline parallelism method designed to reduce communication costs effectively. By dividing the model weights into pipeline stages and overlapping communication with computation, WeiPipe minimizes idle times and achieves a communication-efficient training paradigm. Experimental results demonstrate that WeiPipe significantly improves scalability and throughput in training large models with extensive context lengths compared to existing methods.

1. Bibliographic Information

1.1. Title

WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training

1.2. Authors

JUNFENG LIN: Tsinghua University, Beijing, China
ZIMING LIU: National University of Singapore, Singapore City, Singapore
YANG YOU: National University of Singapore, Singapore City, Singapore
JUN WANG: CETHIK Group Co. Ltd., Hangzhou, China
WEIHAO ZHANG: Lynxi Technologies, Beijing, China
RONG ZHAO: Tsinghua University, Beijing, China

1.3. Journal/Conference

Published at PPoPP '25: The 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, March 1 - 5, 2025, NV, Las Vegas, USA. PPoPP is a highly reputable and influential conference in the field of parallel programming and distributed systems. Publication at PPoPP signifies significant contributions to the principles and practice of parallel computing, indicating a strong peer-review process and relevance to the research community.

1.4. Publication Year

2025

1.5. Abstract

Training large models with long context lengths requires significant communication overhead, which becomes a bottleneck in distributed training. The paper proposes WeiPipe, a weight pipeline parallelism method designed to reduce communication costs effectively. By dividing the model weights into pipeline stages and overlapping communication with computation, WeiPipe minimizes idle times and achieves a communication-efficient training paradigm. Experimental results demonstrate that WeiPipe significantly improves scalability and throughput in training large models with extensive context lengths compared to existing methods.

1.6. Original Source Link

/files/papers/694664ea769f2826079b7079/paper.pdf The publication status is "Published: 28 February 2025" at the PPoPP '25 conference.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the significant communication overhead that acts as a bottleneck in distributed training of large models, especially those with long context lengths. As Large Language Models (LLMs) and other Transformer-based models continue to grow in size (billions to hundreds of billions of parameters) and process longer input sequences (long-context capabilities), the computational resources required for training escalate dramatically.

This problem is important because:

Model Size Growth: LLMs like Llama-3.1 require an immense amount of GPU memory (e.g., 6480 GB for 405 billion parameters), making it impossible to fit them on a single device. Distributed training techniques are essential.
Long-Context Training: Tasks such as multi-round conversations, summarization, and processing large codebases necessitate models that can handle extended token lengths. This leads to a substantial increase in the size of activations (intermediate outputs of neural network layers) during training.
Memory Optimization Techniques: Techniques like mixed-precision training, recomputation (gradient checkpointing), and Flash Attention, while reducing peak memory usage and allowing larger microbatch sizes, inadvertently increase the communication burden in traditional pipeline parallelism (PP) because they enable larger activations and gradients to be passed between workers.

Existing distributed training techniques, such as Data Parallelism (DP), Tensor Parallelism (TP), Pipeline Parallelism (PP), and Fully Sharded Data Parallelism (FSDP), each have limitations. Traditional PP (activation-passing PP) is favored for its reliance on peer-to-peer (P2P) communication, which requires less bandwidth than collective communication. However, the increasing size of activations and gradients of activations due to long contexts and larger microbatch sizes makes communication the new bottleneck for activation-passing PP. Specifically, the paper notes that the size of output activations (GSH) can easily exceed the size of weights ( $12H^2$ for a Transformer layer) when the ratio $GSH / (12H^2)$ becomes significant, making activation-passing less efficient.

The paper's entry point is this observation: if the communication cost of activations and their gradients (in traditional PP) becomes higher than the communication cost of weights and their gradients, then a weight-passing approach might be more efficient. This innovative idea forms the basis of WeiPipe.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

Introduction of WeiPipe (Weight Pipeline Parallelism): The paper proposes a novel distributed training technique that shifts from an activation-passing pipeline (where activations and their gradients are passed between stages) to a weight-passing pipeline (where weights and their gradients are passed). This fundamentally rethinks how data is communicated in pipeline parallelism to address the communication bottleneck in long-context training.
WeiPipe-Naive and WeiPipe-Interleave: The paper first introduces the basic concept of WeiPipe-Naive and then proposes WeiPipe-Interleave, an enhanced strategy that interleaves forward and backward passes. This improvement significantly reduces the pipeline bubble ratio (idle time) and halves communication requirements compared to the naive approach, making it more practical and efficient.
Exploration of WeiPipe-zero-bubble strategies: The authors investigate the potential of integrating WeiPipe with zero-bubble parallelism (a technique that aims to eliminate idle times in pipelines). They discuss two conceptual variations, WZB1 and WZB2, demonstrating that a weight-passing pipeline can achieve near-zero bubble ratios, although with potential trade-offs in memory or communication.
Scalable Implementation and Experimental Validation: The paper implements WeiPipe-Interleave from scratch in PyTorch, incorporating standard optimizations like mixed precision training, communication overlap, recomputation (gradient checkpointing), and Flash Attention.
Superior Performance and Scalability: Experimental results demonstrate that WeiPipe-Interleave significantly improves throughput (by approximately 30%-80%) compared to state-of-the-art pipeline parallelism (1F1B, ZB1, ZB2) and Fully Sharded Data Parallelism (FSDP). This holds true across various model configurations, including large-context LLM training, and different underlying infrastructures (e.g., NVLink within clusters, Ethernet between clusters, PCIe within clusters). The strategy also shows greater weak and strong scalability in communication-constrained scenarios.

The key conclusions and findings indicate that WeiPipe effectively addresses the communication bottleneck in long-context large model training. By fundamentally changing the communication paradigm from activations to weights, it achieves higher efficiency and scalability, especially in environments with less robust network connections. This approach reduces the reliance on expensive high-bandwidth communication infrastructure, opening up new possibilities for training very large models with extensive contexts.

3.1. Foundational Concepts

To understand WeiPipe, a reader should be familiar with the basic concepts of neural network training and distributed training paradigms.

Large Language Models (LLMs): These are neural networks, typically based on the Transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. Their "large" aspect refers to billions or even hundreds of billions of parameters (weights and biases) that define the model.
Transformer Architecture: A neural network architecture introduced in 2017 by Vaswani et al. in "Attention Is All You Need." It relies heavily on self-attention mechanisms to process sequential data, making it highly effective for tasks like natural language processing. Unlike recurrent neural networks, Transformers process input sequences in parallel, leading to faster training.
Context Length / Sequence Length: In LLMs, this refers to the number of tokens (words or sub-word units) that the model can process or attend to simultaneously. "Long context" implies handling thousands or even tens of thousands of tokens, which is crucial for tasks like summarizing long documents or multi-round conversations.
Distributed Training: The process of training a single neural network model across multiple computing devices (e.g., GPUs, CPUs) to overcome memory and computational limitations of a single device. This typically involves partitioning the model or data and coordinating computations and data exchange between devices.
Parameters (Weights and Biases): The learnable values within a neural network that are adjusted during training to minimize the difference between the model's predictions and the actual target values.
Activations: The output of a neuron or a layer in a neural network after applying an activation function. These are intermediate values that are passed from one layer to the next during the forward pass.
Gradients: During the backward pass (backpropagation), gradients are calculated. A gradient indicates the direction and magnitude of the change needed for a parameter (weight or bias) to reduce the model's error. Gradients of activations (also called error signals or deltas) are passed backward through the network to compute gradients of weights.
Microbatch: In pipeline parallelism, a large batch of data is often divided into smaller units called microbatches. This allows for more granular scheduling of computations and communication, helping to keep the pipeline full and reduce idle times.
Forward Pass: The process where input data is fed through the neural network, layer by layer, to produce an output prediction. Activations are generated during this pass.
Backward Pass: The process of computing and propagating gradients backward through the neural network, from the output layer to the input layer, to update the model's parameters.
Communication Overhead: The time and resources spent on transferring data between different devices in a distributed training system. This includes the latency (delay) and bandwidth (data transfer rate) of the network.
Pipeline Bubble / Idle Time: In pipeline parallelism, a bubble refers to periods when a GPU or a stage in the pipeline is idle, waiting for data from a previous stage or for a subsequent stage to become available. Minimizing these bubbles is crucial for efficiency.
Mixed-Precision Training: A technique that uses different numerical precisions (e.g., FP16 or BF16 for activations and weights, FP32 for optimizer states) during training. This reduces memory usage and can speed up computation on hardware with specialized $FP16/BF16$ cores (like Tensor Cores on NVIDIA GPUs), while maintaining sufficient numerical stability.
Recomputation (Gradient Checkpointing): A memory optimization technique where activations from certain layers are not stored during the forward pass but are recomputed during the backward pass when needed. This reduces peak memory consumption at the cost of increased computation.
Flash Attention: An optimized attention mechanism for Transformers that reduces memory usage and speeds up attention computation by leveraging fast on-chip memory (SRAM) and reducing redundant data transfers between high-bandwidth memory (HBM) and SRAM.
Peer-to-Peer (P2P) Communication: Direct data transfer between two specific devices. This is generally more efficient and has lower overhead than collective communication for small transfers.
Collective Communication: Operations that involve multiple devices simultaneously, such as all-reduce (summing or averaging data across all devices), all-gather (collecting data from all devices onto all devices), or reduce-scatter (reducing data and distributing chunks to different devices). These operations often require higher bandwidth and can be more prone to bottlenecks.
NVLink / PCIe / Ethernet: Different hardware technologies for inter-device communication:
- NVLink: A high-speed interconnect developed by NVIDIA for direct GPU-to-GPU communication within a server or between tightly coupled servers. Offers very high bandwidth.
- PCIe (Peripheral Component Interconnect Express): A standard interface for connecting high-speed components within a computer, including GPUs. Slower than NVLink for direct GPU-to-GPU but commonly used for host-to-GPU.
- Ethernet: A widely used networking technology for connecting computers over a local area network (LAN) or wider area networks. Bandwidth can vary greatly (e.g., 1Gb, 10Gb, 100Gb), but it's generally slower and has higher latency than NVLink or PCIe for inter-GPU communication within a cluster.

3.2. Previous Works

The paper discusses several existing distributed training techniques, highlighting their strengths and weaknesses, especially in the context of long-context LLM training.

3.2.1. Parallelism Techniques for Training

Data Parallelism (DP):
- Concept: Each device gets a copy of the entire model, but only a portion of the training data batch. All devices compute gradients on their data subset, and then these gradients are all-reduced (e.g., averaged) across all devices to keep the model weights synchronized.
- Pros: Easy to implement, good for scaling computation when communication is not a bottleneck.
- Cons: Each device must store the entire model, limiting the maximum model size. All-reduce operations can be communication-heavy, especially with large models or slow networks.
Fully Sharded Data Parallelism (FSDP) / ZeRO-3:
- Concept: An advanced form of DP where not just data, but also model weights, gradients, and optimizer states are sharded (divided) across devices. Weights and gradients are only gathered onto a device when needed for computation (all-gather for forward, reduce-scatter for backward) and then immediately discarded.
- Pros: Highly memory efficient, allowing much larger models than traditional DP. Incorporates asynchronous communication overlap.
- Cons: Requires substantial collective communication (all-gather, reduce-scatter) during both forward and backward passes, which can be a bottleneck in scaled scenarios or with poor network connections.
Tensor Parallelism (TP):
- Concept: Splits individual matrix operations (tensors) within a layer across multiple workers. For example, a large matrix multiplication can be split, with each worker computing a part of the output.
- Pros: Allows for training models that are too large to fit on a single device even if pipeline parallelism is not feasible.
- Cons: Requires frequent and fine-grained collective communication to synchronize intermediate results within a layer, which can be very bandwidth-intensive.
Pipeline Parallelism (PP):
- Concept: Partitions the model at the layer level, assigning different layers (or groups of layers, called stages) to different devices. The input batch is divided into microbatches, which flow sequentially through the pipeline stages.
- Pros: Reduces memory requirements per worker (each worker only stores its assigned layers). Relies primarily on P2P communication (passing activations from one stage to the next), which can be more bandwidth-efficient than collective communication for large models.
- Cons: Inherently suffers from pipeline bubbles (idle times) due to the sequential nature of computation across stages. The primary focus of traditional PP research is to minimize these bubbles. Communication of activations and their gradients can become a bottleneck with long contexts and large microbatch sizes.
  - GPipe [15]: A classic PP technique where all microbatches complete their forward pass before any backward pass begins.
  - Dapple [12] (1F1B): Improves GPipe by initiating the backward pass for a microbatch immediately after its forward pass finishes. This reduces peak memory and bubble ratios. It's a common baseline.
  - Chimera [22]: Further optimizes by combining multiple pipelines in different directions to reduce bubble ratio.
  - Hanayo [26]: Builds on Chimera by using a wave-shaped pipeline to decouple it from model duplication, improving efficiency.
  - Zero-bubble PP [32] (ZB1, ZB2): Computes gradients for weights and activations separately (B pass for activations, W pass for weights) to achieve almost zero-bubble configurations. Aims to minimize peak memory or bubble ratio.
Sequence Parallelism (SP):
- Concept: Divides activations along the sequence dimension (e.g., token dimension) across different devices. Each device processes a part of the input sequence.
- Pros: Specifically designed for long sequences, reducing memory per device for activations.
- Cons: Requires specific handling for attention mechanisms and can introduce communication overhead for attention calculations across devices.

3.2.2. Core Formula (Self-Attention in Transformer)

While not directly modified by WeiPipe, the Transformer architecture's self-attention mechanism is fundamental to LLMs and is the source of the activations that are passed in traditional PP. Understanding it helps contextualize the activation size.

The Attention mechanism, specifically Scaled Dot-Product Attention, is defined as: $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ Where:

$Q$ (Query), $K$ (Key), $V$ (Value) are matrices representing the input sequence after linear transformations. Their rows correspond to tokens in the sequence, and columns correspond to head dimension $d_k$ .
$Q K^T$ calculates the dot product similarity between queries and keys for all token pairs.
$\sqrt{d_k}$ is a scaling factor to prevent the dot products from becoming too large, which could push the softmax function into regions with very small gradients.
$\mathrm{softmax}$ normalizes the scores to obtain attention weights.
$V$ is then weighted by these attention weights to produce the output.

The size of these Q, K, V matrices, and consequently the activations passing between Transformer layers, heavily depends on the sequence length (S) and hidden dimension (H). This is why long-context models (large S) lead to large activations, increasing communication burden in activation-passing PP.

3.3. Technological Evolution

Distributed training techniques have evolved primarily to cope with the ever-increasing size of neural networks and datasets.

Early DP (1980s-1990s): Basic data parallelism was one of the first methods to scale training.
Model Parallelism (e.g., early TP, PP): As models grew too large for a single device, researchers explored partitioning the model itself. Early PP (like GPipe) emerged to distribute layers.
Optimization for PP (2010s-Present): The focus shifted to reducing pipeline bubbles and improving throughput (e.g., 1F1B, Chimera, Hanayo, zero-bubble PP).
Memory-Efficient DP (e.g., ZeRO, FSDP, 2020s): To scale DP to even larger models, techniques that shard weights, gradients, and optimizer states became critical.
Long-Context Specific Optimizations (2020s-Present): With the rise of LLMs capable of processing very long sequences, Sequence Parallelism and memory-optimized attention (e.g., Flash Attention) became essential.

WeiPipe fits into this evolution by addressing a new bottleneck emerging from the combination of long-context models and advanced memory optimization techniques. While prior PP focused on bubble reduction, WeiPipe tackles the fundamental communication cost by changing the nature of what is communicated, specifically in the context where activations become excessively large.

3.4. Differentiation Analysis

Compared to the main methods in related work, WeiPipe offers several core differences and innovations:

vs. Traditional Activation-Passing PP (e.g., GPipe, 1F1B, Chimera, Hanayo, zero-bubble PP):
- Core Difference: WeiPipe passes weights and their gradients between pipeline stages, whereas traditional PP passes activations and their gradients.
- Innovation: This shift is motivated by the observation that for long-context models and large microbatch sizes (enabled by memory optimizations), the size of activations (GSH) can exceed the size of weights ( $12H^2$ ). By passing weights, WeiPipe's communication volume becomes independent of microbatch size ( $G$ ) and sequence length ( $S$ ), making it significantly more communication-efficient in long-context scenarios. Traditional PP communication scales with $G$ and $S$ .
- Communication Type: Both rely on P2P communication, which is inherently scalable. However, WeiPipe's P2P communication volume is often much smaller.
vs. FSDP / ZeRO-3:
- Core Difference: FSDP shards weights, gradients, and optimizer states and uses collective communication (all-gather, reduce-scatter) to bring necessary data to each GPU for computation. WeiPipe uses P2P communication to circulate weights and their gradients in a pipeline fashion.
- Innovation: WeiPipe avoids collective communication primitives, which can be a major bottleneck in clusters with less robust network connections (e.g., Ethernet between nodes). FSDP's performance is heavily dependent on high-bandwidth collective communication. WeiPipe aims for scalability even in communication-constrained environments.
- Memory Footprint: FSDP is highly memory efficient by sharding all states. WeiPipe also aims for memory efficiency by distributing weights and optimizer states and by using recomputation for activations, but its peak memory characteristics might differ from FSDP depending on implementation details and microbatch sizes.
vs. Tensor Parallelism (TP):
- Core Difference: TP splits individual matrix operations within a layer, requiring frequent collective communication within a layer. WeiPipe partitions the model at the layer level and uses P2P communication for weights.
- Innovation: WeiPipe avoids the very fine-grained, high-bandwidth collective communication of TP, which can be prohibitive for very large models or slower interconnects. TP is typically combined with PP or DP.
  
  In summary, WeiPipe innovates by changing the fundamental currency of pipeline communication from activations to weights. This makes it uniquely suited for long-context LLM training where activation sizes explode, and it offers better scalability in communication-constrained environments by avoiding collective communication and reducing P2P communication volume compared to traditional PP.

4. Methodology

4.1. Principles

The core idea behind WeiPipe is to transition from an activation-passing pipeline to a weight-passing pipeline. The theoretical basis or intuition stems from the observation that as Large Language Models (LLMs) grow in context length ( $S$ ) and are trained with larger microbatch sizes ( $G$ ) (often enabled by memory optimization techniques), the size of activations ( $A$ ) and their gradients ( $B$ ) transmitted between pipeline stages can become significantly larger than the weights ( $W$ ) and their gradients ( $D$ ) for a given layer.

For a Transformer layer, the output activation size is approximately $G \times S \times H$ (where $H$ is the hidden dimension size). The weight size for one layer in a model like Llama2 is about $12H^2$ (e.g., $4H^2$ for attention, $8H^2$ for FFN). If the ratio of output activation to weight size, which is $GSH / (12H^2) = GS / (12H)$ , exceeds 1, then passing weights could be more communication-efficient than passing activations. This is particularly relevant for long-context LLMs where $S$ is very large.

By passing weights and their gradients, WeiPipe aims to:

Reduce Communication Volume: The size of weights for a layer is constant, independent of the microbatch size ( $G$ ) or sequence length ( $S$ ). This makes communication volume predictable and potentially much smaller for long-context scenarios.
Overlap Communication with Computation: By carefully orchestrating the circulation of weights and gradients among workers, WeiPipe seeks to hide communication latency behind computation.
Balance Resource Utilization: Distribute computational and memory workloads more evenly across workers and over time.
Leverage P2P Communication: Like traditional PP, WeiPipe relies on peer-to-peer (P2P) communication, which is known for its scalability compared to collective communication.

4.2. Core Methodology In-depth (Layer by Layer)

WeiPipe utilizes a ring topology for its pipeline, meaning workers are arranged in a circle, passing data to their neighbor. The model layers ( $L$ ) are initially distributed evenly across $P$ workers. For simplicity, assume $L=P$ , so each worker initially holds the weights for one layer.

4.2.1. WeiPipe-Naive

The WeiPipe-Naive strategy introduces the fundamental concept of weight-passing in a pipeline.

Initialization: Each worker $p$ initially holds the weights $W_p$ for its assigned layer $p$ .
Forward Pass:
- The microbatch input starts at worker 0.
- Worker 0, holding $W_0$ , performs the forward computation for the first layer.
- Simultaneously, worker 0 passes $W_0$ to worker 1 and receives $W_1$ from worker P-1 (if using a ring).
- After computing with $W_0$ , worker 0 discards $W_0$ but retains the activations $A_0$ of the first layer in its memory. It then starts forwarding the second layer using $W_1$ (which it just received).
- This process continues: weights circulate counter-clockwise among workers. Each worker receives weights from its predecessor, performs computation, stores activations, and passes the weights to its successor.
- During this phase, activations $A_j$ for each layer $j$ are accumulated in the memory of the worker that computed that layer for the current microbatch.
- Figure 1 (from $t=0$ to $t=3$ for $P=4$ ) illustrates this. Worker $i$ (0 to 3) initially has $W_i$ . As the circle rotates counter-clockwise (representing time steps), each worker processes the weights it receives. For example, at $t=0$ , worker 0 uses $W_0$ . At $t=1$ , worker 0 receives $W_1$ and processes it, while worker 1 receives $W_0$ and processes it. This continues until worker 0 has processed all layers (using $W_0, W_1, W_2, W_3$ ) and completed its full forward pass for a microbatch.
Backward Pass:
- The backward pass requires weights in reverse order. To maintain the weight-pipelining, gradients of weights ( $D_s$ ) are also circulated.
- When a worker finishes its forward pass for all layers, it starts its backward pass. The weights for backward pass flow in the opposite direction (conceptually, or are pre-arranged).
- In WeiPipe-Naive, to handle the fact that some workers might still be in forward pass while others are in backward pass, both forward and backward weights must be transferred simultaneously. This is visualized in Figure 1 by having $W_0$ to $W_3$ (for forward) and $W_3$ to $W_0$ (for backward) on opposite sides of the circular representation.
- As weights for the backward pass circulate, workers compute gradients of weights ( $D_j$ ) for their respective layers.
- Figure 1 (from $t=4$ to $t=8$ ) shows worker 0 performing the backward pass. At $t=4$ , worker 0 processes layer 3 with $W_3$ (received) and generates $D_3$ . This $D_3$ then travels with $W_3$ around the circle.
Update Pass:
- Each worker generates only a partial gradient of weights ( $D_j$ ) for the microbatch it processed. To get the full gradient for a layer's weights, all $D_j$ from different microbatches that processed that layer must be aggregated.
- Instead of using all-reduce (like in DP), WeiPipe circulates $D_s$ through the ring. When a worker receives an existing $D_j$ and generates a new one, it averages, sums, or normalizes them. This keeps the communication volume of gradients consistent.
- After all microbatches are processed, each worker updates the weights ( $W_s$ ) and gradients ( $D_s$ ) it holds using the specified optimizer. Since each worker is responsible for updating a specific layer's weights, it also stores the corresponding optimizer state for that layer, which doesn't need to be transmitted.
  
  Flaws of WeiPipe-Naive:

Redundant Transmission: Two weight-flows circulate simultaneously (one for forward, one for backward), but only one is used at a time for computation, increasing communication costs.
High Bubble Ratio: The backward pass of a layer takes approximately twice as long as the forward pass. When one worker enters the backward pass, it creates significant idle time (pipeline bubbles) for other workers still engaged in the forward pass.

4.2.2. WeiPipe-Interleave for Lower Bubble Ratio

WeiPipe-Interleave addresses the flaws of WeiPipe-Naive by interleaving forward and backward passes more efficiently.

Core Idea: Utilize the weights at the "diagonal positions" of the circular representation (i.e., the weights that would be used by another worker in the Naive scheme) for a different microbatch's forward or backward computation. This allows for simultaneous forward and backward operations on a single worker for different microbatches.
Initial Forward Pass: The initial forward pass for the first microbatch is similar to WeiPipe-Naive.
Interleaved Forward-Backward Pass:
- Once a worker completes its forward pass for the first microbatch and is ready to start its backward pass, it also simultaneously begins the forward pass for a new microbatch.
- Figure 2 illustrates this from $t=4$ to $t=11$ . At $t=4$ , worker 0 begins its backward pass for microbatch 0. Crucially, it also starts the forward pass for microbatch 4 using $W_0$ .
- In this interleave stage, a worker executes one backward pass and one forward pass before the weight circle makes its next turn. This means worker 0 is computing $B_3^0$ (gradient for activation of layer 3, microbatch 0) and $A_0^4$ (activation for layer 0, microbatch 4) concurrently.
- As the process continues, other workers also enter this forward-backward interleave stage, balancing computation workloads.
- This strategy ensures that there are virtually no pipeline bubbles during the forward-backward interleave stage because workers are continuously active with either forward or backward computations.
- Workers can dynamically decide their execution order for forward and backward based on data transmission availability.
Memory and Communication Efficiency:
- WeiPipe-Interleave utilizes idle memory during forward and backward passes to store activation values generated by new microbatches, leading to more balanced memory utilization.
- The communication overhead is reduced. For a Llama-style model with $12H^2$ parameters per layer, during the forward-backward interleave stage, each worker receives two layers of weights ( $W$ ) and one layer of gradients of weights ( $D$ ) per turn. This results in a communication volume of $36H^2$ (assuming $W$ and $D$ are similar size and $W \approx D$ in terms of number of elements), effectively doubling the compute/communicate ratio compared to WeiPipe-Naive.

4.2.3. WeiPipe-zero-bubble

To push the boundaries of WeiPipe, the paper explores WeiPipe-zero-bubble strategies by integrating with the concept of zero-bubble pipeline parallelism. Zero-bubble PP typically splits the backward pass into two distinct phases:

B pass: Computes gradients for activations.
W pass: Computes gradients for weights. This decoupling allows for more flexible scheduling and filling of pipeline gaps.

4.2.3.1. WeiPipe-zero-bubble 1 (WZB1)

WZB1 slightly reduces the bubble ratio compared to WeiPipe-Interleave with relatively low storage and communication overhead.

Procedure:
- Figure 3 illustrates WZB1 from $t=4$ .
- At $t=4$ $t = 4$ , worker 0 performs two tasks concurrently:
  - Forward pass: For layer 0 of a new microbatch (e.g., microbatch 4) using $W_0$ , generating activations $A_0^4$ .
  - B pass: For layer 3 of an older microbatch (e.g., microbatch 0) using $W_3$ , producing the activation gradient $B_3^0$ .
- Worker 0 retains a portion of $A_3^0$ (activations for layer 3, microbatch 0) to support a future W pass on $W_3$ .
- Unlike WeiPipe-Interleave, weights for the backward pass (e.g., $W_1$ ) are placed together with weights for the forward pass (e.g., $W_3$ ) in the circular flow.
- At $t=5$ , worker 0 performs the W pass, consuming $A_3^0$ and $B_3^0$ to generate $D_3$ (gradient of weights for layer 3, microbatch 0), which is then sent to worker 1. Simultaneously, worker 0 also conducts the forward pass for microbatch 4 with $W_1$ .
- This pattern continues with alternating "one-forward-one-B" and "one-forward-one-W" operations until the forward pass of microbatch 4 is completed.
- At $t=8$ , worker 0 performs two B passes: one for microbatch 0 with $W_1$ and one for microbatch 4 with $W_3$ .
- From $t=8$ to $t=11$ , worker 0 alternates between two B passes and two W passes until $t=12$ , when all passes for microbatch 0 are completed, and a new forward pass for microbatch 8 begins.
Data Arrangement:
- The weights $W_0$ to $W_3$ for the forward pass are arranged similarly to previous strategies.
- The weights for the B pass and the gradients generated by the W pass (Ds) are placed in pairs. For $L$ layers, $W_{L-1}$ and $W_{\frac{L}{2}-1}$ are paired, $W_{L-2}$ and $W_{\frac{L}{2}-2}$ are paired, and so forth. The same pairing applies to Ds.
- This ensures each worker performs two-chunk operations while transmitting three chunks of data to the next worker within one turn.

4.2.3.2. WeiPipe-zero-bubble 2 (WZB2)

WZB2 aims for a nearly zero-bubble configuration with a simpler procedure but potentially higher communication and storage costs.

Procedure:
- Figure 4 illustrates WZB2.
- The arrangement of $W_s$ and $D_s$ is the same as WeiPipe-Interleave.
- In WZB2, the forward, $B$ , and W passes for all layers are executed sequentially within a worker for a given microbatch.
- Crucially, the W pass progresses from layer 0 to layer 3, matching the forward order.
- During the B pass, old versions of weights can be discarded to reduce transmission volume (indicated by blanks from $t=8$ to $t=11$ in Figure 4).
- The last worker (worker 3 in Figure 4) aggregates all gradients of weights (Ds) and updates the weights.
- At $t=11$ , worker 3 holds the aggregated $D_0$ and updates $W_0$ using the optimizer.
- Worker 3 then sends the updated $W_0$ , which initiates a new forward pass at $t=12$ . This seamless handover allows for frequent weight updates with fewer microbatches and achieves almost zero bubble.
Trade-offs: WZB2 incurs higher communication and storage costs because it performs one chunk operation while transmitting two chunks of data to the next worker.

4.2.4. Theoretical Analysis of Communication and Memory

Table 1. Meaning of the symbols that are used in this paper.

Symbol	Meaning
$N$	The number of micro-batches in an iteration
Iter	The number of iteration
$G$	Micro-batch size
$P$	The number of workers
$L$	The number of layers in neural network
$A_j^i$	Activation values of $j$ th layer in $i$ th micro-batch
$B_j^i$	Gradients of $A_j^i$
$W_j$	Weights of $j$ th layer
$D_j$	Gradients of $W_j$
$M_A / M_B / M_W / M_D$	Memory consumption of `A, B, W` or $D$
$T_F / T_B / T_W$	Time cost for complete forward pass, backward pass, or weight pass respectively
`TBW`	Total bandwidth usage

The paper provides a theoretical comparison of bubble ratio, communication efficiency, and memory consumption.

Bubble Ratio (Pipeline Efficiency)

1F1B and WeiPipe-Interleave: Have similar bubble ratios, as both aim for forward-backward interleaving.
Zero-bubble strategies (ZB1, ZB2, WZB1, WZB2): Significantly reduce the bubble ratio by decoupling B pass and W pass, aiming for near-zero idle times. The bubble ratio equation for traditional PP is: $ \text{Bubble Ratio} = \frac{(P-1) (T_F + T_B) - (P-1) T_{\text{overlap}}}{N \cdot (T_F + T_B)} $ where $T_{\text{overlap}}$ is the time computation and communication can overlap. Zero-bubble strategies effectively multiply $N$ by a factor greater than 1 in the denominator, reducing the bubble ratio.
ZB2 and WZB2: Claim to have almost no bubbles along the iteration.

Communication Efficiency (Bandwidth Usage)

Measured by Total Bandwidth Usage (TBW). The paper presents the equation for theoretical bandwidth usage in "Zone 1" (where passes are fully alternated) for activation-passing PP: $ TBW = \frac{2 \cdot M_A \cdot N}{T_{\text{Zone1}}} $

For activation-passing PP: Communication volume is $2 \cdot M_A$ (for activations and gradients of activations) per microbatch. This scales linearly with microbatch size ( $G$ ) and sequence length ( $S$ ), as $M_A \approx G \cdot S \cdot H \cdot \text{precision\_bytes}$ .
For WeiPipe: Communication is dictated by the amount of weights ( $M_W$ $M_{W}$ ) and their gradients ( $M_D$ $M_{D}$ ), which is independent of $G$ $G$ and $S$ $S$ .
- For Llama-style models, weights per layer are about $12H^2$ .
- WeiPipe-Interleave requires transmitting two chunks of $W$ and one chunk of $D$ per turn. Assuming similar sizes for $W$ and $D$ , the communication volume is approximately $3 \times 12H^2 = 36H^2$ .
- The time duration for this communication is approximately $\frac{T_F + T_B}{P}$ .

Memory Consumption

Memory consumption is implementation-dependent, but theoretical estimates are provided:

1F1B and WeiPipe-Interleave: Dominant memory usage is for storing activations, roughly $G \cdot M_A$ . WeiPipe-Interleave is said to have similar memory consumption to 1F1B but with more balanced distribution.
Strategies with separate B pass and W pass (ZB1, ZB2, WZB1, WZB2):
- A B pass stage (which is $\frac{1}{P}$ of a complete B pass) consumes part of the activations $M_A/P$ and generates gradients of activations $M_B/P$ .
- The paper assumes that $\alpha M_A/P$ activation storage is left after one B pass stage, where $\alpha$ is a factor for remaining activations.
- ZB1 and WZB2: Both need to store $G \cdot (\alpha M_A + M_B)$ data.
- ZB2: Nearly doubles this requirement compared to ZB1.
- WZB1: Can achieve a maximum memory consumption of approximately $1.5 G M_A$ , which is less than other zero-bubble strategies when $M_B/M_A > 1.5 - \alpha$ .
The paper notes that for zero-bubble pipelines, peak memory can be tricky to calculate. With techniques like Flash Attention, the activations generated by attention are reduced. The remaining activations are mainly from FFN (Feed-Forward Network) computations. The size of gradients produced during the B pass can be approximately equal to the size of activations generated in one forward pass. This can cause peak memory to occur before the first W pass of the last rank, potentially being twice the peak of the first rank. This highlights a challenge for zero-bubble PP with Flash Attention enabled.

4.3. Implementation Details

The authors implemented WeiPipe-Interleave from scratch in PyTorch, focusing on LLM training but noting general applicability. The WeiPipe-zero-bubble variations (WZB1, WZB2) are discussed for their potential but not implemented in the current work due to their intricate control requirements.

Key implementation considerations include:

Mixed Precision: To reduce storage and communication, the implementation uses:
- activations ( $A$ ), weights ( $W$ ), and gradients of weights ( $D$ ) in fp16 precision.
- gradients of activations ( $B$ ) in bf16 precision.
- optimizer states in fp32, distributed among workers.
- Explanation: fp16 (half-precision floating point) and bf16 (bfloat16) use 16 bits per number, compared to fp32 (single-precision floating point) which uses 32 bits. This halves memory usage and can speed up computation on compatible hardware. bf16 generally offers a better dynamic range than fp16, which can be beneficial for gradient accumulation, while fp16 is often sufficient for activations and weights. Optimizer states are kept in fp32 for numerical stability during weight updates.
Communication Overlap: WeiPipe is designed to balance communication and computation. To hide communication latency:
- Ws (weights) and Ds (gradients of weights) are prefetched using asynchronous communication.
- This is realized via the batch_isend_irecv function provided by the PyTorch distributed library.
- Explanation: Asynchronous communication allows a device to initiate a data transfer operation and then immediately proceed with computation, without waiting for the transfer to complete. The isend (immediate send) and irecv (immediate receive) functions are non-blocking calls that return a handle, allowing the program to check for completion later. This hides communication time behind computation, reducing the overall execution time.
Recomputation (Gradient Checkpointing) and Flash Attention: These memory optimization techniques are integrated to enable larger microbatch sizes, enhancing WeiPipe's advantages.
- Recomputation: Saves memory by discarding activations during the forward pass and recomputing them during the backward pass. This adds computational overhead but significantly reduces peak memory.
- Flash Attention: Optimizes attention module memory access and saves memory by performing operations in a memory-efficient manner, specifically by avoiding writing large intermediate attention matrices to high-bandwidth memory.
- Fair Comparison: These optimizations are also applied to other comparison strategies to ensure a fair evaluation. However, recomputation is not applied to zero-bubble pipeline strategies because, as the paper notes, it offers no storage savings there and merely adds computational overhead due to how zero-bubble PP handles activations for B pass.

5. Experimental Setup

5.1. Datasets

The paper trains models based on the open-source LLama-2 structure (which are GPT-style models). No specific training dataset (e.g., C4, Wikipedia) is mentioned for the training process itself. The focus is on the performance of the distributed training method rather than the trained model's downstream task performance. The model configurations are varied to simulate different scales and long-context scenarios.

The parameters varied for model configuration are:

$H$ : Hidden dimension size
$S$ : Token length (sequence length)
$G$ : Micro-batch size
$N$ : Number of micro-batches in an iteration
Fixed head number: 32
Fixed layer number: 32

The paper tests different $H$ and $S$ values:
$H \in \{1024, 2048, 4096\}$
$S \in \{4096, 8192, 16384\}$ These combinations cover model sizes ranging from 384M (million) to 6.1B (billion) parameters, specifically targeting scenarios with long contexts.

5.2. Evaluation Metrics

The primary evaluation metrics used are:

Throughput (Tokens/second/GPU):
- Conceptual Definition: This metric quantifies the efficiency of the training process by measuring how many tokens (units of text) are processed per second per GPU. A higher throughput indicates a more efficient and faster training setup. It reflects the overall speed of the distributed training system, taking into account computation, communication, and synchronization overheads.
- Mathematical Formula: The paper does not explicitly state a formula for Throughput. However, it can be generally understood as: $ \text{Throughput} = \frac{\text{Total Tokens Processed}}{\text{Total Training Time} \times \text{Number of GPUs}} $ Or, more commonly, within one iteration: $ \text{Throughput}_{\text{per GPU}} = \frac{G \times S \times N}{\text{Time per Iteration}} \times \frac{1}{\text{Number of GPUs}} $
- Symbol Explanation:
  - $G$ : Micro-batch size (number of samples per microbatch).
  - $S$ : Sequence length (number of tokens per sample).
  - $N$ : Number of microbatches in an iteration.
  - Time per Iteration: The total time taken to complete one full training iteration (forward, backward, and parameter update for all microbatches).
  - Number of GPUs: The total count of GPUs used in the distributed training setup.
Memory (GB):
- Conceptual Definition: This metric measures the peak memory consumption in Gigabytes (GB) on a single GPU during the training process. Lower memory consumption indicates better memory efficiency, allowing for larger models, larger batch sizes, or longer context lengths to be trained on available hardware.
- Mathematical Formula: No explicit formula is provided, as it's a direct measurement of GPU memory usage.
- Symbol Explanation: Measured in GB (Gigabytes).
Scalability:
- Conceptual Definition: This assesses how effectively the training method utilizes additional hardware resources (GPUs).
  - Weak Scaling: Measures how throughput changes when both the problem size (e.g., total batch size) and the number of workers (GPUs) increase proportionally, such that the workload per GPU remains constant. An ideal weak scaling shows constant throughput per GPU as the number of GPUs increases.
  - Strong Scaling: Measures how throughput changes when the total problem size (e.g., total batch size) is kept constant, but the number of workers (GPUs) increases. An ideal strong scaling shows a linear increase in total throughput (or a proportional decrease in total time) as the number of GPUs increases.
- Mathematical Formula: No explicit formula for scalability, but it's evaluated by observing the trends in throughput metrics (total throughput and throughput per GPU) as the number of GPUs changes under specific workload conditions.
- Symbol Explanation: $P$ (number of workers/GPUs) is the primary variable, while total batch size ( $G \times N \times P$ for weak scaling, $G \times N$ for strong scaling) is controlled.

5.3. Baselines

WeiPipe-Interleave is compared against several state-of-the-art distributed training strategies:

1F1B (One Forward, One Backward): A widely used pipeline parallelism strategy.
- Implementation: Through Megatron-LM.
- Why representative: It's a standard and effective PP method that reduces pipeline bubbles compared to earlier GPipe.
ZB1 (Zero-bubble 1): A zero-bubble pipeline parallelism strategy from the paper "Zero Bubble Pipeline Parallelism."
- Implementation: Through Megatron-LM.
- Why representative: It's a state-of-the-art PP method that aims to minimize pipeline bubbles by decoupling B pass and W pass.
ZB2 (Zero-bubble 2): Another zero-bubble pipeline parallelism strategy from the same paper.
- Implementation: Through Megatron-LM.
- Why representative: It's an alternative zero-bubble PP configuration that also aims for minimal bubbles, often with different memory-throughput trade-offs.
FSDP (Fully Sharded Data Parallelism): A highly memory-efficient data parallelism strategy.
- Implementation: Achieved by the ZeRO-3 optimization in DeepSpeed.
- Why representative: It's a leading memory-efficient DP method capable of scaling to very large models by sharding model states. It's often considered a strong baseline for memory and throughput in large-scale training.
  
  All comparison strategies are configured with the same model, microbatch size, mixed-precision settings, Flash Attention, and hardware environment to ensure a fair comparison. Recomputation is applied to all strategies except zero-bubble pipeline strategies (ZB1, ZB2), as the authors note recomputation offers no storage savings and adds computational overhead when used with these specific zero-bubble methods.

5.4. Hardware Environment

Experiments were conducted on Colossal Cloud using A800 GPUs.

A800 GPU: Features 80GB HBM (High Bandwidth Memory) and 312 TFlops $fp16/bf16$ tensor cores. Notably, its NVLink bandwidth is limited to $400 GB/s$ (compared to $600 GB/s$ on A100 GPUs). This limitation helps in evaluating communication effectiveness under slightly more constrained high-bandwidth scenarios.
Communication Library: NCCL (NVIDIA Collective Communications Library) is used as the underlying communication library. NCCL's default behavior for collective primitives like reduce-scatter and all-gather (used in FSDP) is ring-based implementation, and tree algorithms were not adopted in these experiments. This justifies maintaining a ring topology for all parallel strategies in the experiments.

Two different communication infrastructures were used to evaluate WeiPipe's communication effectiveness:

NVLink Environment (Within Cluster):
- Setup: 16 A800 GPUs in two clusters, with NVLink connections. This represents a high-bandwidth, low-latency environment, typical for GPUs within a single server or tightly coupled servers.
- Purpose: To assess performance when communication is relatively fast but activation size is still a factor.
PCIe and Ethernet Environment (Across Clusters):
- Setup: 32 A800 GPUs across 4 clusters. NVLink connections are used within each cluster, but the clusters themselves are connected by 10Gb Ethernet.
- Purpose: To simulate a more communication-constrained scenario, where the low bandwidth of Ethernet between clusters can expose communication bottlenecks. This is crucial for evaluating WeiPipe's ability to perform well in less ideal network conditions.
  
  The layer number for small-scale weak scaling experiments was set to 16, while for large-scale experiments and throughput/memory tests, it was 32.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate WeiPipe-Interleave's superior performance and scalability, particularly in communication-constrained scenarios and for long-context models.

6.1.1. Throughput and Memory Consumption (NVLink Environment)

The following are the results from Table 2 of the original paper:

Model Config			Throughput(Tokens/second/GPU)					Memory(GB)
H	S	G	1F1B	ZB1	ZB2	FSDP	WeiPipe	1F1B	ZB1	ZB2	FSDP	WeiPipe
1024	4096	16	8581.7	7547.0	7638.5	11525.9	15138.8	13.0	20.4	39.3	8.6	9.4
	8192	8	7403.8	6739.6	6768.1	9424.4	12122.3	9.9	10.7	20.5	8.6	9.4
	16384	4	5641.2	5651.6	5651.9	6973.6	8188.3	9.1	21.6	42.2	8.6	9.4
2048	4096	16	4163.2	3823.3	OOM	4104.8	6499.7	18.7	44.3	OOM	17.9	19.9
	8192	8	3791.3	3517.8	OOM	3706.8	6033.2	19.6	22.3	OOM	17.9	19.9
	16384	4	3146.3	3050.1	OOM	3087.2	4607.8	22.9	42.9	OOM	17.9	19.9
4096	4096	16	1662.7	OOM	OOM	1110.5	2023.1	40.5	OOM	OOM	39	44.5
	8192	8	1556.2	OOM	OOM	1063.2	2059.4	41.6	OOM	OOM	39	44.5
	16384	4	1331.6	OOM	OOM	944.2	1684.9	45.1	OOM	OOM	39	44.5

The results are for training LLama-style models on 16 GPUs with NVLink connections. OOM indicates "Out Of Memory." For ZB strategies, $G$ (micro-batch size) is set to 4 if $S=4096$ and $G=1$ if $S=8192$ or 16384 due to memory limitations.

Throughput Advantage: WeiPipe consistently demonstrates higher throughput across almost all configurations.
- For example, with $H=1024, S=4096, G=16$ , WeiPipe achieves 15138.8 Tokens/second/GPU, significantly outperforming FSDP (11525.9) and 1F1B (8581.7).
- The improvement is particularly notable for larger $H$ and $S$ . When $H=4096, S=16384, G=4$ , WeiPipe (1684.9) shows a 22.3% improvement over 1F1B (1331.6) and a 78.4% improvement over FSDP (944.2). This highlights WeiPipe's effectiveness in long-context scenarios where activation sizes are large.
Memory Consumption:
- 1F1B generally has the smallest memory usage among pipeline parallelism strategies, benefiting from recomputation and Flash Attention.
- FSDP also exhibits good memory efficiency, typically having lower memory usage than WeiPipe for some configurations (e.g., $H=1024, S=4096, G=16$ : FSDP 8.6GB vs WeiPipe 9.4GB). This is attributed to FSDP's operator-wise buffer creation leading to smaller, more fragmented buffers, while WeiPipe uses larger buffers for sending and receiving Ws and Ds.
- Zero-bubble strategies (ZB1, ZB2) show significantly higher memory consumption, frequently leading to OOM errors, especially for larger $H$ and $S$ . This contradicts previous claims in literature that ZB1's memory is comparable to 1F1B, and ZB2's is twice 1F1B's. The authors attribute this discrepancy to the use of Flash Attention, which reduces attention activation memory, making FFN activations (and their gradients) the dominant factor, leading to higher peak memory during B pass and W pass for zero-bubble methods. The necessity for smaller microbatch sizes in ZB strategies (e.g., $G=1$ for $S=8192$ or 16384) further compromises their computational efficiency despite theoretical zero-bubble potential.

6.1.2. Throughput (PCIe and Ethernet Environment)

The following are the results from Table 3 of the original paper:

Model Config		Throughput(Tokens/second/GPU)
H S		G 1F1B	ZB1 ZB2	FSDP	WeiPipe
1k 4k	4k 16k	16	8193 7708 7952	11545	13847
		4	5394 4583 4630	6764	7551
		16	4030 3701 OOM	4205	5587
2k	16k	4	2907 2638 OOM	3150	4151
2k	4k	16	1530 OOM	OOM 1186	1402
4k	16k	4	1232 OOM	OOM	966	1505

The results are for training LLama-style models on 16 GPUs with PCIe and Ethernet connections. OOM indicates "Out Of Memory." For ZB strategies, $G$ (micro-batch size) is set to 4 if $S=4096$ and $G=1$ if $S=8192$ or 16384 due to memory limitations.

Enhanced Performance in Communication-Constrained Environments: In this environment, where 10Gb Ethernet connects clusters, communication becomes a more severe bottleneck.
- WeiPipe-Interleave further solidifies its advantage. For $S=16384, H=2048, G=4$ , WeiPipe (4151) improves throughput by 31.7% compared to the best performing alternative strategy (3150 for FSDP).
- For $S=16384, H=4096, G=4$ , WeiPipe (1505) outperforms FSDP (966) by 55.8%, and 1F1B (1232) by 22.2%.
This confirms WeiPipe's effectiveness in reducing communication pressure, making it less dependent on high-bandwidth interconnects. WeiPipe's ability to transmit weights concurrently with computation (communication hiding) contributes to this.

6.1.3. Throughput (8 GPUs with NVLink, Layer Number 16)

The following are the results from Table 4 of the original paper:

Model Config		1F1B	Throughput(Kilo Tokens/second/GPU)
H S 4k 1k	G	1F1B	ZB1	ZB2 FSDP	WeiPipe
H S 4k 1k	G	16	32.0 45.8 46.5	37.9 31.3
2k	16k	4	15.9 22.0	22.1 17.8	16.9
2k	4k 16k	16 4	15.0 22.4 9.4 12.8	OOM 17.0	14.2
4k	4k	16	5.2 OOM	OOM OOM	10.1 6.0	9.7 4.9
4k	16k	4	3.7 OOM	OOM	3.8	3.6

The results are for training LLama-style models on 8 GPUs with NVLink connections with layer number as 16. For ZB strategies, $G$ (micro-batch size) is set to 4 if $S=4096$ and $G=1$ if $S=8192$ or 16384 due to memory limitations.

Less Significant Advantage in High-Bandwidth, Small-Scale Scenarios: In an environment with fewer GPUs (8) and solely NVLink connections (less communication constraint), the advantage of WeiPipe can be less pronounced, and conventional methods may have advantages for certain configurations. For example, for $H=1024, S=4096, G=16$ , FSDP (37.9) and ZB1/ZB2 (45.8/46.5) outperform WeiPipe (31.3). This suggests that WeiPipe's strength lies where communication is a bottleneck, especially with large activation sizes. When activation sizes are moderate and interconnects are very fast, other optimizations (like zero-bubble techniques or FSDP's optimized collective communication) might sometimes take precedence.

6.1.4. Weak Scaling

The following are the results from Figure 6 and Figure 7 of the original paper:

Figure 6. Small scale weak scaling. The number of GPUs scales from 4 to 16 (4 GPUs in 1 server), and the batch size increases proportionally from 64 to 256. The left coordinates correspond to the overall throughput as represented by the bar chart, while the coordinates on the right indicate tokens per second per GPU. 该图像是图表，展示了在不同 GPU 数量下的弱扩展性性能。X 轴表示 GPU 数量（4、8、16），Y 轴显示每秒处理的千个标记量（Kilo Tokens/s）和每个 GPU 的处理能力（Kilo Tokens/s/GPU）。不同颜色代表不同的方法，WeiPipe 在所有 GPU 配置中表现出更高的吞吐量。

As can be seen from the results in Figure 6, WeiPipe demonstrates strong weak scaling. The $Tokens/second/GPU$ (right y-axis) remains relatively stable or even slightly increases as the number of GPUs and problem size (batch size) increase proportionally. This indicates that WeiPipe effectively utilizes added resources without significant performance degradation per GPU, even in small-scale setups with Ethernet communication between servers. In contrast, 1F1B, ZB1, ZB2, and FSDP show a more noticeable decrease in $Tokens/second/GPU$ as the number of GPUs increases, suggesting they are more sensitive to communication overhead in this environment.

Figure 7. Large scale weak scaling. The number of GPUs scales from 8 to 32 (8 GPUs in 1 server) with the batch size increasing proportionally from 128 to 512. 该图像是图表，展示了不同GPU数目（8、16和32）下的训练性能，包括每秒处理的Kilo Tokens数（Kilo Tokens/s）和每个GPU的处理能力（Kilo Tokens/s/GPU）。通过比较1F1B、FSDP和WeiPipe方法，可以看到WeiPipe在GPU数增多时性能提升明显。

As can be seen from the results in Figure 7, in large-scale weak scaling (up to 32 GPUs), WeiPipe continues to show superior weak scaling. Its $Tokens/second/GPU$ remains the highest and most stable as the number of GPUs increases, especially compared to 1F1B and FSDP, which experience more significant drops in per-GPU efficiency. This confirms WeiPipe's ability to maintain efficiency when scaling up, making it suitable for very large-scale training.

6.1.5. Strong Scaling

The following are the results from Figure 8 and Figure 9 of the original paper:

Figure 8. Small scale strong scaling. The number of GPUs scales from 4 to 16, with the batch size remains 128. 该图像是图表，展示了在小规模强扩展下，从4到16个GPU的数量变化，批量大小保持为128。不同颜色的条形代表不同的性能指标，表现出在扩展过程中各指标的波动情况。

As can be seen from the results in Figure 8, WeiPipe exhibits superior strong scaling in the small-scale setup (4 to 16 GPUs, fixed batch size of 128). The Kilo Tokens/second (total throughput) for WeiPipe increases more linearly with the number of GPUs compared to other methods, indicating better utilization of additional GPUs to speed up a fixed task. 1F1B and FSDP show sub-linear scaling, implying that their overheads (communication for FSDP, bubbles and activation communication for 1F1B) become more prominent as more GPUs are added for a fixed workload.

Figure 9. Large scale strong scaling. The number of GPUs scales from 8 to 32, with the batch size remains 256. 该图像是一个图表，展示了使用不同数量的GPU（8、16、32）时，每秒处理的Kilo Tokens数，其中WeiPipe方法在32个GPU时表现最佳，达到最高的吞吐量，对比1F1B和FSDP方法的数据。

As can be seen from the results in Figure 9, in large-scale strong scaling (8 to 32 GPUs, fixed batch size of 256), WeiPipe again outperforms 1F1B and FSDP. WeiPipe achieves the highest total throughput and demonstrates a better scaling trend. This further validates WeiPipe's potential to utilize more GPUs to achieve greater speed-up for a fixed training task, especially when large token lengths and low-bandwidth communication infrastructure (like Ethernet between clusters) create challenges for conventional PP and FSDP.

6.2. Ablation Studies / Parameter Analysis

The paper primarily focuses on comparing WeiPipe-Interleave against existing state-of-the-art methods under various conditions rather than conducting ablation studies on WeiPipe's internal components. However, the comparison of WeiPipe-Naive (described conceptually) with WeiPipe-Interleave implicitly serves as an ablation, showing the benefit of interleaving forward and backward passes to reduce bubbles and communication.

The different model configurations (varying $H$ , $S$ , $G$ ) and hardware environments (NVLink vs. PCIe/Ethernet) serve as parameter analyses, demonstrating how WeiPipe's performance is affected by hidden dimension size, sequence length, micro-batch size, and network bandwidth. These analyses highlight WeiPipe's robust performance under conditions that stress communication (large $S$ , Ethernet connections).

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces WeiPipe (Weight Pipeline Parallelism), a novel distributed training strategy designed to address the communication bottleneck in training Large Language Models (LLMs) with long context lengths. By shifting from the traditional activation-passing pipeline to a weight-passing pipeline, WeiPipe significantly reduces communication overhead, as the volume of weights and their gradients is independent of microbatch size and sequence length.

The core contribution, WeiPipe-Interleave, efficiently overlaps communication with computation, minimizes pipeline bubbles by interleaving forward and backward passes, and balances memory utilization. The authors also theoretically explored WeiPipe-zero-bubble strategies, demonstrating the potential for near-zero idle times in weight-passing pipelines.

Experimental results on various LLama-style models and hardware configurations (including challenging PCIe and Ethernet environments) show that WeiPipe-Interleave consistently outperforms state-of-the-art pipeline parallelism (1F1B, ZB1, ZB2) and Fully Sharded Data Parallelism (FSDP) in terms of throughput and scalability, especially for long-context training. This demonstrates WeiPipe's ability to provide efficient and scalable training, even in communication-constrained scenarios, thereby reducing the reliance on expensive high-bandwidth communication infrastructure.

7.2. Limitations & Future Work

The authors implicitly or explicitly acknowledge several limitations and areas for future work:

Implementation of WeiPipe-zero-bubble: The paper discusses WZB1 and WZB2 conceptually but explicitly states that their implementation is left for future exploration due to the need for intricate and fine-grained control. This suggests that the practical challenges of achieving true zero-bubble with weight-passing are substantial.
Memory Overhead of Zero-Bubble PP: The paper finds that existing zero-bubble PP methods (ZB1, ZB2), when combined with Flash Attention, incur significantly higher memory consumption and OOM errors than previously reported. This implies that while zero-bubble reduces idle time, it might come with memory costs that limit achievable batch sizes, compromising overall efficiency. WeiPipe-zero-bubble might face similar memory challenges.
Specific to Transformer Architecture: While stated that the approach is "not limited to Transformers," the analysis and experiments are heavily focused on LLama-style models (a type of Transformer). Its applicability and performance characteristics on other neural network architectures would need further investigation.
Hardware Dependency: The performance gains are most pronounced in communication-constrained environments. In highly optimized, fast-interconnect (e.g., full NVLink) environments with fewer GPUs, traditional methods might sometimes still be competitive or even superior for certain configurations (as shown in Table 4). This suggests WeiPipe is a specialized solution particularly effective where communication is the dominant bottleneck.
Complexity of Implementation: Implementing WeiPipe from scratch (as the authors did) suggests that integrating it into existing frameworks might require significant engineering effort due to the fundamental shift in pipeline data flow.

7.3. Personal Insights & Critique

This paper presents a highly insightful and timely contribution to the field of distributed deep learning. The core idea of switching from activation-passing to weight-passing in pipeline parallelism is a fundamental shift that directly addresses a growing bottleneck in long-context LLM training.

Inspirations and Applications:

Rethinking Communication Primitives: The paper inspires a deeper look into the nature of data being communicated in distributed systems. When activations become the bottleneck, it's not just about optimizing how they are sent, but what is sent. This paradigm shift could be applicable to other distributed algorithms where intermediate data structures grow disproportionately large.
Resilience to Network Constraints: WeiPipe's ability to perform well under low-bandwidth Ethernet conditions is a significant practical advantage. This makes high-performance LLM training more accessible on commodity clusters or geographically dispersed compute resources, reducing the reliance on expensive, specialized NVLink-heavy superclusters. This could democratize LLM training.
Synergy with Memory Optimizations: The paper's demonstration that Flash Attention and recomputation can enhance WeiPipe's benefits (by allowing larger microbatches and thus reducing total communication/computation phases) is crucial. It shows that WeiPipe is not an isolated optimization but works well within the current ecosystem of LLM training techniques.

Potential Issues and Areas for Improvement:
Dynamic Layer Assignment: The paper assumes an even distribution of layers. In practice, layers can have varying computational complexities. Dynamic or adaptive layer assignment to balance workload (not just layer count) could further improve efficiency.
Optimizer State Communication for Non-Ring Topologies: While WeiPipe relies on P2P communication and ring topology for weights and gradients, the paper mentions optimizer states are distributed but doesn't detail how they are managed in the context of the weight circulation for update if the weights are not updated locally. However, since each worker updates one layer of weights, it also stores the corresponding layer of optimization state, which does not need to be transmitted between workers.
Generalization Beyond Transformers: While the principle is general, empirical validation on other large models (e.g., large vision models, graph neural networks) would strengthen the claim of broad applicability.
Complexity of zero-bubble variants: The theoretical discussion of WZB1 and WZB2 is intriguing, but their non-implementation suggests significant practical challenges. Future work on making these zero-bubble weight-passing methods practical and robust would be highly valuable. The identified memory issues with existing zero-bubble PP combined with Flash Attention also point to a critical trade-off that needs careful consideration.
Interaction with Tensor Parallelism: How WeiPipe could be effectively combined with Tensor Parallelism (for extremely large individual layers) is not explored, but often necessary for models beyond a certain scale. This might introduce new communication patterns and challenges.

Overall, WeiPipe offers a promising new direction for efficient distributed training of long-context LLMs, particularly in environments where network bandwidth is a primary constraint. Its innovation in shifting the communication paradigm is a well-motivated response to the evolving challenges of model scaling.