TASP: Topology-aware Sequence Parallelism
TL;DR Summary
TASP improves long-context LLM training by decomposing modern accelerator topologies into concurrent ring paths, enabling efficient parallel communication and achieving up to 3.58x speedup over Ring Attention on NVIDIA H100 and AMD MI300X.
Abstract
Long-context large language models (LLMs) face constraints due to the quadratic complexity of the self-attention mechanism. The mainstream sequence parallelism (SP) method, Ring Attention, attempts to solve this by distributing the query into multiple query chunks across accelerators and enable each Q tensor to access all KV tensors from other accelerators via the Ring AllGather communication primitive. However, it exhibits low communication efficiency, restricting its practical applicability. This inefficiency stems from the mismatch between the Ring AllGather communication primitive it adopts and the AlltoAll topology of modern accelerators. A Ring AllGather primitive is composed of iterations of ring-styled data transfer, which can only utilize a very limited fraction of an AlltoAll topology. Inspired by the Hamiltonian decomposition of complete directed graphs, we identify that modern accelerator topology can be decomposed into multiple orthogonal ring datapaths which can concurrently transfer data without interference. Based on this, we further observe that the Ring AllGather primitive can also be decomposed into the same number of concurrent ring-styled data transfer at every iteration. Based on these insights, we propose TASP, a topology-aware SP method for long-context LLMs that fully utilizes the communication capacity of modern accelerators via topology decomposition and primitive decomposition. Experimental results on both single-node and multi-node NVIDIA H100 systems and a single-node AMD MI300X system demonstrate that TASP achieves higher communication efficiency than Ring Attention on these modern accelerator topologies and achieves up to 3.58 speedup than Ring Attention and its variant Zigzag-Ring Attention. The code is available at https://github.com/infinigence/HamiltonAttention.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: TASP: Topology-aware Sequence Parallelism
- Authors: Yida Wang, Ke Hong, Xiuhong Li, Yuanchao Xu, Wenxun Wang, Guohao Dai, Yu Wang
- Journal/Conference: This paper is available on arXiv, a preprint server. This means it has not yet undergone formal peer review for a major conference or journal at the time of this analysis.
- Publication Year: 2024 (based on the provided source link information, though the text uses a future year placeholder)
- Abstract: The paper addresses the communication inefficiency of
Ring Attention, a mainstream Sequence Parallelism (SP) method for long-context Large Language Models (LLMs). The authors argue thatRing Attention'sRing AllGathercommunication primitive is mismatched with theAlltoAll(fully-connected) topology of modern accelerators, leading to underutilization of communication bandwidth. Inspired by Hamiltonian decomposition from graph theory, the paper proposes TASP (Topology-aware Sequence Parallelism). TASP decomposes the accelerator topology into multiple concurrent, non-interferingring datapathsand decomposes the communication primitive into multiple concurrent ring-style transfers. This creates a perfect mapping that fully utilizes the available hardware communication capacity. Experiments on NVIDIA H100 and AMD MI300X systems show TASP achieves up to 3.58x speedup overRing Attentionand its variants. - Original Source Link: https://arxiv.org/abs/2509.26541 (Note: This is a placeholder link from the prompt.)
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: The
self-attentionmechanism in Transformer-based LLMs has a computational and memory complexity that scales quadratically with the input sequence length (). This makes processing very long contexts (e.g., millions of tokens) prohibitively expensive on a single accelerator. - Existing Solution & Gap: Sequence Parallelism (SP) is a technique to mitigate this by splitting the sequence across multiple accelerators. The leading method,
Ring Attention, works by having each accelerator hold a chunk of the query () and rotating chunks of keys and values (KV) among all accelerators in a ring-like fashion. However, thisRing AllGathercommunication pattern only uses a single "ring" of communication links at a time. Modern accelerator clusters (like an 8-GPU server) have a fully-connected (AlltoAll) topology, meaning every accelerator can communicate with every other accelerator simultaneously.Ring Attention's communication pattern thus severely underutilizes this hardware capability, using only1/(n-1)of the links, where is the number of accelerators. This communication inefficiency becomes the primary performance bottleneck, especially when computation is fast relative to communication. - Fresh Angle: The paper's key innovation is to solve this mismatch by treating it as a graph theory problem. The authors realized that a fully-connected hardware topology can be mathematically decomposed into multiple, independent "ring" communication paths (
Hamiltonian cycles) that can operate concurrently. They then designed a new communication primitive,Multi-Ring AllGather, to run multiple data transfers in parallel, one on each of these decomposed hardware paths. This perfectly aligns the software's communication pattern with the hardware's full capacity.
- Core Problem: The
-
Main Contributions / Findings (What):
- A Novel Communication Optimization Methodology: The paper introduces a general methodology of
topology decompositionandprimitive decomposition. This involves analyzing the hardware network graph to find parallel data paths and then redesigning the communication algorithm to exploit them. - TASP Algorithm: The paper applies this methodology to create
TASP. It uses Hamiltonian decomposition to break downAlltoAllaccelerator topologies into multiple orthogonal ring datapaths. It then decomposes the standardRing AllGatherinto aMulti-Ring AllGatherprimitive that drives concurrent data transfers along all these paths, maximizing bandwidth utilization. - Significant Performance Improvements: Experimental results demonstrate that
TASPsubstantially outperformsRing Attention. On an 8-GPU AMD MI300X server, it achieves up to a 3.58x speedup, and on an NVIDIA H100 server, up to a 2.31x speedup. The speedup is most pronounced in communication-bound scenarios. The method also shows strong performance in multi-node clusters.
- A Novel Communication Optimization Methodology: The paper introduces a general methodology of
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): AI models, typically based on the Transformer architecture, trained on vast amounts of text data to understand and generate human-like language.
Long-contextLLMs can process and reason over very long inputs, like entire books or codebases. - Self-Attention: The core mechanism of the Transformer architecture. For each token in a sequence, it calculates an attention score with every other token to determine how much "attention" to pay to them. This is what leads to the quadratic complexity (), as every token must be compared with every other token. The inputs to this mechanism are Query (), Key (), and Value () matrices derived from the input tokens.
- Sequence Parallelism (SP): A distributed computing strategy for training/running LLMs. Instead of splitting the model's parameters, SP splits the input data sequence across multiple accelerators (GPUs/TPUs). This reduces the memory footprint and computational load on each device when dealing with long sequences.
- Accelerator Interconnect: The physical network connecting accelerators.
AlltoAllTopology: A network where every accelerator has a direct, high-speed communication link to every other accelerator. This can be achieved via afull-mesh(direct cables, like in AMD MI300X) orswitch-basedarchitecture (using high-speed switches like NVIDIA's NVSwitch).- Communication Primitives: Standard operations for exchanging data in a distributed system.
AllGather: Each accelerator starts with a piece of data. After the operation, every accelerator has a copy of all the data from all other accelerators.Ring AllGather: An implementation ofAllGatherwhere data is passed sequentially from one accelerator to the next in a logical ring. It's memory-efficient as each accelerator only holds one extra piece of data at a time, but it's slow onAlltoAlltopologies.AlltoAll: Each accelerator has a buffer of data to be sent, split into chunks for every other accelerator. It sends each chunk to its corresponding destination and receives a chunk from every other accelerator.
- Hamiltonian Decomposition: A concept from graph theory. A Hamiltonian cycle is a path in a graph that visits every node (vertex) exactly once before returning to the starting node. A Hamiltonian decomposition of a graph is the partitioning of its edges into a set of edge-disjoint Hamiltonian cycles. This is the mathematical foundation of TASP's topology decomposition.
- Large Language Models (LLMs): AI models, typically based on the Transformer architecture, trained on vast amounts of text data to understand and generate human-like language.
-
Previous Works:
Ulysses: An SP method that first shards the sequence and then uses anAlltoAlloperation to partition the data by attention heads. Its main limitation is that its scalability is tied to the number ofKVheads in the model. Modern models like those using Grouped-Query Attention (GQA) or Multi-Query Attention (MQA) have fewKVheads, limitingUlysses's effectiveness.Ring Attention: The primary baseline. It partitions the sequence and rotatesKVblocks using aRing AllGatherprimitive. This allows computation to be overlapped with communication. Its variants include:Striped Attention: Aims to improve load balance for causal attention masks.Zig-zag Ring Attention/Megatron-CP: Achieves perfect load balance for causal attention by using a more complex, symmetric data partitioning scheme.
- The key issue with all
Ring Attentionvariants is their reliance on theRing AllGatherprimitive, which, as this paper argues, is fundamentally inefficient on modern hardware.
-
Differentiation: While previous works like
Zig-zag Ring Attentionfocused on balancing the computational load,TASPfocuses on optimizing the communication pattern. It doesn't change the attention math but fundamentally redesigns how data is moved between accelerators. Instead of a single, slow ring communication,TASPorchestrates multiple, parallel ring communications to saturate the hardware's available bandwidth.
4. Methodology (Core Technology & Implementation)
The core of TASP is a two-part decomposition strategy: first, decompose the hardware topology into parallel paths, and second, decompose the communication workload to run on these paths.
Image 1: This figure visually explains the core problem and solution. On the left, Ring Attention uses a single communication ring, leaving most high-speed links in a fully-connected accelerator cluster idle. On the right, TASP decomposes the topology into multiple, independent rings and the data transfer into parallel streams, achieving full utilization of the communication links.
Principles:
The central idea is to perfectly map the communication algorithm to the underlying physical network topology. If a network has parallel, non-interfering paths, the algorithm should be designed to perform parallel data transfers.
Steps & Procedures:
1. Decomposition of Accelerator Topology
The paper models the accelerator cluster as a directed graph , where vertices are accelerators and edges are unidirectional communication links.
-
Ring Datapath: A
ring datapathis defined as a Hamiltonian cycle in this graph—a directed path that visits every accelerator exactly once and returns to the start. -
Orthogonal Ring Datapaths: Two or more ring datapaths are orthogonal if they are edge-disjoint (i.e., they don't share any communication links). A set of orthogonal ring datapaths can be used for simultaneous data transfers without interference.
-
Single-Node Topology Decomposition:
-
Modern single-node systems with accelerators (e.g., NVIDIA DGX with 8 H100s or an AMD MI300X server) have an
AlltoAlltopology, which can be modeled as a complete directed graph . -
The paper applies a known graph theory result: Hamiltonian decomposition of a complete graph. A complete graph can be decomposed into
n-1edge-disjoint Hamiltonian cycles. -
This decomposition provides
n-1orthogonal ring datapaths. For an 8-GPU node (), this yields 7 parallel, non-interfering communication rings.
该图像是论文中展示的示意图,图2展示了8个加速器的AlltoAll拓扑图被分解为7条边不相交的定向哈密顿环路,体现了拓扑分解的核心思想。
Image 2: This figure illustrates the decomposition of an 8-accelerator AlltoAlltopology () into 7 distinct, edge-disjoint Hamiltonian cycles. Each color represents a separatering datapaththat can operate concurrently.
-
-
Multi-Node Topology Decomposition:
-
Multi-node clusters have a hierarchical network: very high-speed intra-node links (e.g., NVLink) and comparatively lower-speed inter-node links (e.g., InfiniBand).
-
A naive approach is to model a cluster of nodes with GPUs each as a single large complete graph . The paper shows this is inefficient because the slower inter-node links would bottleneck the entire system, underutilizing the fast intra-node links.
-
The paper proposes a more sophisticated decomposition scheme. For a 2-node system (8 GPUs/node), this works by:
- Decomposing the
AlltoAlltopology within each node () into Hamiltonian paths (paths that visit every node once but don't form a closed loop). - Using the inter-node InfiniBand links to connect the ends of these paths across the two nodes, thereby forming complete Hamiltonian cycles that span the entire 16-GPU cluster.
- Decomposing the
-
This scheme balances the load between intra-node and inter-node links far more effectively. The paper proves this scheme can be generalized to any number of nodes.
该图像是图表,展示了图3中基于 Hamiltonian Cycle 和 Hamiltonian Path 对两节点互连拓扑进行的环路路径分解,分别得到15条和8条环路传输路径,涉及的内外节点通信速率分别为64GB/s和6.25GB/s或50GB/s。
Image 3: This figure compares the two multi-node decomposition schemes. The left shows the naive decomposition for a 2-node H100 system, which creates a large bandwidth mismatch. The right shows the proposed decomposition, which results in more balanced bandwidth usage between intra-node (gray) and inter-node (blue) links.2. Decomposition of Ring AllGather Primitive
-
To use the decomposed topology, the communication primitive itself must be decomposed. TASP replaces Ring AllGather with Multi-Ring AllGather.
-
In standard
Ring Attentionwith accelerators, theKVcache is split into chunks, and one chunk is assigned to each accelerator. -
In
TASPwith an -ring decomposition, theKVcache is split into chunks. Each accelerator is initially assigned chunks. Each of these chunks is assigned to one of the orthogonal ring datapaths and circulates independently and concurrently on its designated path.
该图像是示意图,展示了Ring AllGather通信原语与Multi-Ring AllGather通信原语在数据传输过程中不同迭代环节的数据流动路径,反映了多环路并发传输的拓扑意识通信机制。
Image 4: This diagram contrasts the data flow. The top shows Ring AllGather, where a single set ofKVblocks (t[0]tot[7]) circulates in one ring. The bottom showsTASP'sMulti-Ring AllGatherfor a decomposition. Here, 7 sets ofKVblocks (e.g.,t[0,j],t[1,j], ...,t[6,j]) circulate simultaneously on 7 different ring paths in each iteration.
The paper proves that Multi-Ring AllGather maintains the essential properties of Ring AllGather:
- Accessibility: Every chunk will eventually see every
KVchunk, as eachKVchunk traverses a complete Hamiltonian cycle. - Zero-copy: Each
KVchunk is unique and follows its own path, so no extra memory is used for duplication.
3. Zig-zag Chunk Placement for TASP
Standard Ring Attention with a causal mask suffers from load imbalance (in early steps, some accelerators have no work). Zig-zag Ring Attention solves this with a symmetric partitioning scheme. TASP requires its own adapted version:
Zig-zag TASP: For an -accelerator system withn-1rings, the sequence is divided into2n(n-1)small segments. EachKVchunkt[i, j](for ring and initial accelerator ) is composed of two non-contiguous segments: one from the beginning of the sequence and one from the end.- The formulas for the segments are:
- : Total sequence length.
- : Number of accelerators.
- : Index of the accelerator where the chunk starts.
- : Index of the ring datapath the chunk will travel on.
- The paper provides a proof showing that with this placement, every accelerator performs the same amount of computation in every step, achieving perfect load balance even with a causal mask.
Implementation
- Pre-computed Routing Tables: To avoid runtime overhead, the communication patterns (which accelerator sends which chunk to whom in each step) are pre-calculated and stored in routing tables.
- Concurrent Transfer via
AlltoAll: For topologies, theMulti-Ring AllGathercan be efficiently implemented using a singleAlltoAllcollective communication call in each step. This leverages highly optimized libraries like NVIDIA's NCCL and AMD's RCCL. - Forward Pass (Algorithm 3): In each of the
n-1steps, each accelerator uses the routing tables to prepare a send buffer containing theKVchunks for all other accelerators. It then executes anAlltoAllcall to perform alln-1ring-style transfers at once. While communication happens, it computes attention on theKVchunks it currently holds.
5. Experimental Setup
-
Datasets/Workloads: The evaluation was not on a specific named dataset but on synthetic workloads simulating LLM attention computation. The experiments covered a wide range of parameters:
- Batch size: 1 to 128
- Sequence length: 3k to 1M
- Attention heads: 4, 12, 20 (with a head dimension of 64)
- Precision: BF16
-
Testbeds:
- H100-1: A single server with 8 NVIDIA H100 GPUs (switch-based
AlltoAlltopology). - MI300X-1: A single server with 8 AMD MI300X GPUs (full-mesh
AlltoAlltopology). - H100-2: Two servers with 8 H100s each, interconnected via InfiniBand.
- H100-4: Four servers with 8 H100s each, interconnected via InfiniBand.
- H100-1: A single server with 8 NVIDIA H100 GPUs (switch-based
-
Evaluation Metrics:
t_comm(Total Communication Time): The total time spent on data transfers between accelerators.t_comp(Total Computation Time): The total time spent on GPU computations (primarily theflash-attnkernel).t_all(Overall Latency): The total end-to-end time for the attention layer, which is the sum oft_commandt_comp.- (Baseline Compute-to-Communication Ratio):
- Conceptual Definition: This metric measures how compute-bound or communication-bound the baseline method (
Ring Attention) is for a given workload. A low (< 1.0) means communication takes more time than computation, so the system is communication-bound. A high (> 1.0) means it's compute-bound. - Mathematical Formula:
- Symbol Explanation:
- : The total computation time of the baseline method.
- : The total communication time of the baseline method.
- Conceptual Definition: This metric measures how compute-bound or communication-bound the baseline method (
-
Baselines:
Ring Attention: Used as the baseline for non-causal (full) attention masks.Zig-zag Ring Attention: Used as the baseline for causal attention masks, as it's the state-of-the-art for load-balanced causalRing Attention.
6. Results & Analysis
This is a manual transcription of Table 1 from the paper.
| SeqLen | 10K | 20K | 40K | 50K | 100K |
|---|---|---|---|---|---|
| CCR | 0.39 | 0.65 | 0.80 | 0.98 | 1.17 |
| Speedup | 2.4 | 1.8 | 1.5 | 1.3 | 1.1 |
Table 1 demonstrates that as the sequence length increases, the CCR for Ring Attention also increases (computation becomes more dominant). Correspondingly, the speedup of TASP over Ring Attention decreases, highlighting that TASP's benefit is greatest when communication is the bottleneck (low CCR).
Core Results:
-
Single-Node Evaluation:
-
TASPachieves significant speedups, especially when is low. On the MI300X-1, it achieved a maximum speedup of 3.58x, and on the H100-1, a maximum of 2.31x. -
The performance gain on the AMD MI300X was higher than on the NVIDIA H100. The paper attributes this to the MI300X having a lower inter-GPU communication bandwidth, which makes workloads on it more communication-bound, thus benefiting more from TASP's communication optimizations.
-
As predicted by theory, the speedup is highest for low values and diminishes as approaches and exceeds 1.0, where computation becomes the bottleneck that
TASPdoes not address.
该图像是图表,展示了图5中加速比与的关系,数据点根据阈值分为红蓝两类,并附有三次多项式拟合曲线,横轴为,纵轴为加速比。
Image 5: This figure plots the overall speedup (t_allspeedup) against the baseline's for single-node systems. The red dots, representing workloads with large enough data transfers to saturate bandwidth, show a clear trend: speedup is highest when is low and decreases as increases. This confirms thatTASPis most effective in communication-bound scenarios.
-
-
Multi-Node Evaluation:
-
Two decomposition schemes were tested: the naive and the topology-aware .
-
On the 2-node H100-2 system, the scheme paradoxically performed better, achieving a 1.43x average speedup versus 1.20x for the scheme. The authors explain this is an implementation artifact: the scheme can be mapped directly to a highly-optimized
AlltoAlllibrary call, while the more complex pattern had to be implemented with less optimized batchedSendRecvcalls. -
Crucially, when scaling from 2 to 4 nodes, the scheme's speedup dropped (from 1.43x to 1.27x), while the scheme's speedup remained stable (1.20x on both). This demonstrates the superior scalability of the topology-aware decomposition, confirming its theoretical benefits.
该图像是图表,展示了图6中不同配置下 加速比与 的关系。图中以红蓝点区分了不同阈值的通信时间 ,表现出TA分解方案的稳定加速比,体现了良好可扩展性。
Image 6: This figure shows the multi-node results. It highlights that the decomposition (labeled as ) maintains a relatively constant speedup as the system scales, showcasing its better scalability compared to the naive decomposition.
-
Overhead Analysis:
- Communication Overhead:
TASP's use of more complex communication primitives (AlltoAllor many batchedSendRecvs) has a higher startup overhead thanRing Attention's simpleSendRecv. This makesTASPslower for very small workloads (short sequences, small batches), as shown by the blue dots in the figures. - Computational Overhead:
TASPintroduces a minor computational overhead (1-5%) because it splits theKVcache into many more, smaller chunks. These chunks must be concatenated on the GPU before they can be processed by theflash-attnkernel. The authors note this could be optimized away with a custom kernel.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully identifies and solves a critical performance bottleneck in
Ring Attention-based sequence parallelism. By introducing a novel methodology oftopology decompositionandprimitive decomposition,TASPcreates a communication strategy that is aware of and fully utilizes theAlltoAllinterconnects of modern accelerators. This results in significant, sometimes dramatic, speedups in communication-bound scenarios, making long-context inference more practical. -
Limitations & Future Work: The authors acknowledge that
TASP's benefits are directly tied to the compute-to-communication ratio (CCR). In strongly compute-bound scenarios (highCCR), the gains from communication optimization are marginal. Future work could involve creating custom GPU kernels to eliminate the minor computational overhead fromKVcache concatenation. -
Personal Insights & Critique:
- Elegance of the Approach: The core idea of applying graph theory (Hamiltonian decomposition) to solve a systems-level performance problem is both elegant and powerful. It represents a first-principles approach to algorithm-hardware co-design.
- Generality: The methodology of decomposing topology and primitives is not limited to sequence parallelism. It could be applied to other distributed algorithms that rely on collective communication, such as tensor parallelism or general data-parallel training, whenever there is a mismatch between the communication pattern and the underlying hardware.
- Practical Implications: The finding that a theoretically superior algorithm can be outperformed by a theoretically inferior one due to library-level optimizations (
AlltoAllvs. manualSendRecv) is a crucial practical insight. It underscores the need for communication libraries like NCCL/RCCL to support more flexible and complex communication patterns natively. - Untested Assumptions: The paper's model of network topology is simplified. Real-world data centers often use more complex fat-tree networks, where not all
AlltoAllpaths are truly equal due to potential switch contention. An analysis on a real, large-scale cluster could reveal additional nuances. Nonetheless, the paper provides a robust and highly effective new direction for optimizing distributed LLM systems.
Similar papers
Recommended via semantic vector search.