Paper status: completed

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

Published:04/28/2025

Distributed AI Systems Programming (1)Triton Compiler Extension (1)Overlapping Optimization Techniques (1)Integration of Communication Primitives (1)High-Level Python Programming Model (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents Triton-distributed, an extension of the Triton compiler, addressing programming challenges in distributed AI systems. It supports native overlapping optimizations, integrates OpenSHMEM communication primitives, and achieves complex joint optimizations to enhan

Abstract

In this report, we propose Triton-distributed, an extension of existing Triton compiler, to overcome the programming challenges in distributed AI systems. Triton-distributed is the first compiler that supports native overlapping optimizations for distributed AI workloads, providing a good coverage of existing optimizations from different frameworks. First, we integrate communication primitives compliant with the OpenSHMEM standard into the compiler. This enables programmers to utilize these primitives with a higher-level Python programming model. Second, we illustrate how to achieve complex joint optimization of computation, memory access, and communication with the assistance of the compiler. In particular, we show how to use overlapping techniques to hide latency and present our compiler-based programming methods in both single-node and multi-node scenarios. Finally, we showcase the performance of the code generated by our compiler. In a test environment with up to 64 devices, our compiler can fully utilize heterogeneous communication and computation resources to provide effective overlapping and high performance. In many cases, the performance of the generated code can even outperform hand-optimized code. Moreover, the development difficulty and the time cost for development using our compiler are far less than those of low-level programming such as CUDA/C++, which clearly demonstrates significant productivity advantages.

Mind Map

In-depth Reading

English Analysis~26 min read · 29,546 chars

1. Bibliographic Information

1.1. Title

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

1.2. Authors

The authors are a large team primarily from ByteDance, with collaborators from Tsinghua University, Peking University, Shanghai Jiao Tong University, and Zhejiang University. The presence of researchers from a major tech company (ByteDance) and top academic institutions indicates a strong blend of industrial application and rigorous academic research, focusing on solving practical, large-scale AI infrastructure challenges.

1.3. Journal/Conference

The paper is available as a preprint on arXiv. Preprints are versions of scholarly papers that precede formal peer review and publication in a journal or conference. This allows for rapid dissemination of new research ideas.

1.4. Publication Year

The paper was submitted to arXiv with a timestamp of April 28, 2025.

1.5. Abstract

The paper introduces Triton-distributed, an extension of the Triton compiler designed to simplify programming on distributed AI systems. The core problem it addresses is the difficulty of jointly optimizing computation, memory access, and communication, which are often handled at different programming levels in existing frameworks. Triton-distributed is presented as the first compiler to offer native support for fine-grained overlapping optimizations. Key contributions include integrating OpenSHMEM-compliant communication primitives into Triton's Python-based programming model, demonstrating how to achieve complex latency-hiding optimizations through this model, and showcasing superior performance on clusters of up to 64 devices. The authors claim that the code generated by Triton-distributed can outperform hand-optimized CUDA/C++ code while significantly reducing development time and complexity.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2504.19442
PDF Link: https://arxiv.org/pdf/2504.19442v3.pdf
Publication Status: This paper is a preprint and has not yet undergone formal peer review for publication in a conference or journal.

2. Executive Summary

2.1. Background & Motivation

The training and inference of modern large language models (LLMs) have outgrown the capabilities of single accelerators (like GPUs). This necessitates the use of large, distributed systems composed of many accelerators. In such systems, performance is determined by three concurrent activities: computation (mathematical operations), memory access (moving data within a device), and communication (moving data between devices).

The core problem is that these three activities are typically optimized independently. AI developers work in high-level languages like Python, while performance engineers write low-level communication and computation code in CUDA/C++. This creates a "programming gap," making it extremely difficult to perform joint optimization, where computation and communication are carefully scheduled to overlap. Overlapping is a critical technique where a device performs computation on one piece of data while simultaneously communicating another piece of data, effectively hiding the communication latency and maximizing hardware utilization. Achieving this overlap traditionally requires expert-level, low-level programming, which is slow, error-prone, and inaccessible to most developers.

The paper's innovative entry point is to extend a high-level, Python-based compiler, Triton, to natively handle distributed communication and overlapping. This allows developers to program complex, fine-grained distributed logic entirely within Python, bridging the gap between algorithm development and systems optimization.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Proposal of Triton-distributed: An extension to the open-source Triton compiler that is the first of its kind to support native, fine-grained computation-communication overlapping for distributed AI workloads.
A Unified High-Level Programming Model: It introduces a Python-level programming model based on symmetric memory, signal exchange, and async-tasks. This model integrates OpenSHMEM-compliant communication primitives, enabling developers to write distributed code with high-level abstractions.
Comprehensive Overlapping Optimization Coverage: The framework is shown to cover a wide range of advanced overlapping techniques (13 in total, including swizzling, copy engine utilization, and low-latency protocols), many of which were previously only available in specialized, low-level frameworks.
Demonstrated High Performance and Productivity: The authors provide extensive experimental results on both NVIDIA and AMD GPU clusters (up to 64 devices). The code generated by Triton-distributed not only achieves performance comparable to or exceeding hand-optimized baselines (like $PyTorch+NCCL$ and specialized frameworks like FLUX) but also drastically reduces development complexity and time. For instance, it achieves speedups ranging from 1.09x to 44.97x over baselines.

3.1. Foundational Concepts

AI Compiler (Triton): In the context of AI, a compiler translates high-level code (like Python) into low-level machine code that can run efficiently on accelerators like GPUs. Triton is a specific open-source language and compiler developed by OpenAI. It allows developers to write GPU programs (called kernels) in a Python-like syntax, and the Triton compiler automatically optimizes this code to achieve performance close to what a human expert writing low-level CUDA code could achieve. Its main goal is to improve programmer productivity for single-GPU optimization.
Distributed AI Systems: When a model is too large to fit on one GPU, it must be split across multiple GPUs. This can be done in several ways:
- Data Parallelism: Each GPU holds a full copy of the model and processes a different batch of data. Gradients are averaged across GPUs.
- Tensor Parallelism (TP): Individual layers or tensors within the model are split across GPUs. This requires frequent communication between GPUs within a single forward/backward pass.
- Mixture of Experts (MoE): The model consists of many "expert" sub-networks. For each input, only a few experts are activated. When experts are distributed across GPUs, this requires AllToAll communication to route tokens to their assigned experts.
Computation-Communication Overlapping: This is the core performance optimization technique in distributed systems. A GPU has specialized hardware units: compute cores (e.g., SMs on NVIDIA GPUs) for calculations and copy engines (DMAs) for data movement. Overlapping means scheduling a computation task on the compute cores at the same time a data transfer task is running on a copy engine or over the network. If the computation takes as long as or longer than the communication, the communication latency is effectively "hidden," leading to higher overall efficiency.
Collective Communication Operations: These are standard patterns of communication involving a group of processes (e.g., GPUs). Key examples in the paper include:
- AllGather: Each GPU starts with a piece of data and ends up with a copy of the data from all other GPUs.
- ReduceScatter: Each GPU starts with a piece of data. The data from all GPUs is combined (e.g., summed) through a reduction operation, and the final result is split and distributed (scattered) among the GPUs.
- AllToAll: Each GPU has data to send to every other GPU. It's a complex permutation where every device is both a sender and a receiver to all others.
OpenSHMEM (SHMEM): This is a standard for one-sided communication. In traditional two-sided communication (like MPI), a send operation on one process must be matched by a receive operation on another. In one-sided communication, a process can directly perform put (write) or get (read) operations on the memory of a remote process without requiring the remote process to be actively involved at that moment. This is highly suitable for the fine-grained, asynchronous operations needed for overlapping. NVSHMEM (NVIDIA) and ROCSHMEM (AMD) are vendor-specific implementations of this standard.

3.2. Previous Works

The paper compares Triton-distributed against several existing frameworks and libraries:

NCCL (NVIDIA Collective Communications Library) / RCCL (ROCm Communication Collectives Library): These are highly optimized libraries from NVIDIA and AMD, respectively, that provide implementations of collective communication primitives. While extremely fast, they are typically "black boxes" that perform synchronization before and after execution, making it difficult to achieve fine-grained overlap with computation within the collective operation itself.
PyTorch: A popular deep learning framework. When used in a distributed setting, it relies on backends like NCCL for communication. Overlapping in PyTorch is typically coarse-grained, managed at the operator level using streams.
FLUX: A research framework specifically designed for computation-communication overlapping in tensor parallelism. It achieves this by fusing communication logic directly into computation kernels written in CUDA. While high-performance, it requires low-level programming.
DeepEP: A specialized framework for optimizing expert-parallel MoE models. It features highly hand-optimized AllToAll kernels written in thousands of lines of CUDA code, demonstrating the high engineering effort required for state-of-the-art performance.
Pallas, CoCoNet, TE (Tensor Expression): Other compiler-based approaches or domain-specific languages (DSLs) for performance optimization, but they either focus on single-device scenarios or lack the native support for the complex, overlapping communication patterns required by modern LLMs.

3.3. Technological Evolution

The field has evolved from manual, low-level optimization of distributed communication (using CUDA and libraries like NCCL) towards higher levels of abstraction. Initially, performance was king, and the complexity of CUDA/C++ was a necessary evil. Then, compilers like Triton emerged to simplify single-device kernel programming, boosting productivity. However, these compilers did not address distributed programming. Triton-distributed represents the next logical step in this evolution: extending a high-level, productive compiler framework to encompass the complexities of distributed, overlapping execution.

3.4. Differentiation Analysis

The core innovation of Triton-distributed compared to prior work is its unification of high-level programming with low-level distributed performance.

vs. NCCL/PyTorch: Triton-distributed enables fine-grained, compiler-assisted overlapping. Instead of treating communication as a monolithic block, it breaks it down into one-sided put/get operations that can be interleaved with computation at the tile level, all orchestrated from Python.
vs. FLUX/DeepEP: Triton-distributed achieves similar or better performance but with dramatically higher productivity. It replaces thousands of lines of complex CUDA/C++ with a few hundred lines of more maintainable Python code, lowering the barrier to entry for developing novel distributed optimizations.

vs. Other Compilers: It is the first compiler to natively support overlapping primitives for distributed AI workloads. While other compilers might target distributed systems, they typically do not provide the specific low-level control over asynchronous communication and synchronization needed for effective latency hiding.

The following table, adapted from Table 2 of the paper, summarizes this differentiation across various optimization techniques:

Name	NCCL	PyTorch	TE	Pallas	CoCoNet	FLUX	DeepEP	Ours (Triton-distributed)
Intra-Node Swizzle		✓	✓		✓			√
Inter-Node Swizzle		X	X		✓	✓		√
Inter-NUMA Swizzle		X	X	X	X	X		√
Copy Engine	✓	✓	·	✓			X	√
High-BW Link	✓	✓	·	✓	·	·	✓	√
Network Comm.	✓	√	X	√	·	·	✓	√
PCIe Comm.	✓	✓	X	X	X	·	X	√
OpenSHMEM Support	X	X	X	X	X	·	·	√
Low-latency Protocol	✓	X	X	X	X	·	X	√
Multimem Feature	·	X	X	X	X	X	X	√
Fusion	X	X	X	·	·	√	·	√
Code Generation	X	X	X	√	·	X	X	√
Nvidia/AMD	√/X	√/√	✓/X	✓/X	√/X	✓/X	✓/X	√/√

4. Methodology

4.1. Principles

The core principle of Triton-distributed is to provide a high-level programming model that gives developers fine-grained control over distributed operations. This is achieved by abstracting low-level hardware features and communication protocols into Python-level primitives managed by the compiler. The compiler then translates these high-level descriptions into highly optimized, low-level code that effectively schedules computation and communication to run in parallel.

The methodology is built upon the MPMD (Multiple Programs Multiple Data) model, where different programs (e.g., a communication kernel and a computation kernel) can run in parallel on different data, coordinating to achieve a global task. The following figure provides an overview of the compilation stack.

Figure 2 Overview of our compilation stack. 该图像是示意图，展示了用户分类及我们的编译工作流。其中，用户分为四类：power user、ordinary user、fresh hand 和我们的目标用户。编译流程包含从用户编写的通信和计算脚本，到生成 LLVM IR 及相关库的多个步骤，直观呈现了 Triton-distributed 编译器的运行机制。

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The MPMD Programming Model

The programming model rests on three fundamental concepts, which are illustrated in the figure below. The left side shows how different ranks (processes) have their own symmetric memory and use signals to coordinate. The right side shows a timeline where intra-node communication, inter-node communication, and computation tasks run in parallel.

Figure 3 Explanation of Symmetric Memory, Signal Exchange, and Async-Task. 该图像是示意图，展示了对称内存、信号交换和异步任务的工作机制。在 NODE 0 的 rank 0 和 rank 1 中，通过信号交换实现通信和计算的协调，显现对称内存的使用。此外，图中的关系图详细展示了不同节点间的计算和通信任务的安排。

Symmetric Memory: Each process (or rank, typically one GPU) allocates a memory buffer that is visible to other ranks. However, this is not a globally unified address space. A rank cannot simply use a pointer to access another rank's memory. Instead, it must use specific communication primitives (like put or get) to transfer data to or from a remote memory buffer.
Signal Exchange: This is the primary synchronization mechanism. A signal is a data object (e.g., an integer) residing in symmetric memory. Ranks can perform atomic operations on these signals (e.g., set, add, wait until value is X). This allows for producer-consumer relationships. For example, a communication task can set a signal to 1 after it has finished transferring a data tile, and a computation task can spin-lock (repeatedly check) on that signal, only proceeding to compute on that tile once the signal becomes 1.
Async-Task: All operations, including data transfers and computations, are treated as asynchronous tasks that can run in parallel. On GPUs, this is typically implemented using multi-streaming, where different tasks are launched on different CUDA streams, allowing the GPU's hardware scheduler to execute them concurrently.

4.2.2. Communication Primitives

Triton-distributed extends Triton with a set of communication primitives, divided into two categories. These primitives are the building blocks for creating complex distributed kernels.

The following table from the paper (Table 1) lists the available primitives:

Primitive	Explanation
	OpenSHMEM Primitives
`my_pe`	Get the current device id
`n_pes`	The number of devices in the world
$int_p$	Put an integer to remote device
`remote_ptr`	Convert local shared memory pointer to remote pointer
`barrier_all`	Barrier all the devices
`sync_all`	Synchronize all the devices
`quiet`	Ensure completion of shared memory operation of calling device
`fence`	Ensure order of shared memory operation of calling device
`getmem`	Blocking get data from remote device
`getmem_nbi`	Non-blocking get data from remote device
`putmem`	Blocking put data to remote device
`putmem_nbi`	Non-blocking put data to remote device
`putmem_signal`	Blocking put data and write signal to remote device
`putmem_signal_nbi`	Non-blocking put data and write signal to remote device
`signal_op`	Perform signal set/add operation to remote
`signal_wait_until`	Wait local signal until condition is meet
`broadcast`	Broadcast data into all the other ranks
	non-OpenSHMEM Primitives
`wait`	Locally wait a signal until it equals to some given value
`consume_token`	used with wait primitive to create data dependency
`notify`	Notify a remote signal, similar to signal_op primitive
`atomic_cas`	Atomic compare and swap
`atomic_add`	Atomic add value
`ld_acquire`	Load with acquire semantic
`red_release`	Reduction add with release semantic
`multimem_ld_reduce`	Multimem load data and perform reduction
`multimem_st`	Multimem broadcast data

The wait and consume_token primitives are particularly important. wait creates a dependency on a signal, and consume_token links this dependency to a subsequent memory load, ensuring the compiler schedules the load only after the signal has been satisfied. This is crucial for creating fine-grained dataflow pipelines.

4.2.3. Example: Inter-node Overlapping `AllGather` GEMM

The paper provides a code example (Figure 4) showing how these concepts come together to implement an overlapping AllGather + GEMM operation, a common pattern in tensor parallelism.

Figure 4 Code Example of AllGather GEMM for Inter-node 该图像是代码示例，展示了实现跨节点的AllGather GEMM操作的过程。代码主要分为两部分：生产者和消费者。生产者部分负责协调节点间的数据传输，使用信号等待机制来确保数据的同步；而消费者部分则进行矩阵乘法计算，并利用信号机制获取数据。整个结构清晰，反映了在分布式AI系统中进行优化计算和通信的编程方法。

The implementation has three parts:

Communication Part (Producer): This kernel is responsible for the AllGather operation. It is written in Python using the new primitives. Different threadblocks are assigned to handle intra-node and inter-node communication in parallel.
Computation Part (Consumer): This is a standard Triton GEMM kernel, with minimal modification. The key additions are the wait and consume_token primitives. Before loading a tile of data for multiplication, the kernel waits on a signal corresponding to that tile. This ensures the computation for a given tile only begins after the communication part has finished transferring that tile's data.
Host Side: The host code allocates the symmetric memory buffers and launches the communication and computation kernels on different CUDA streams, enabling them to run concurrently.

4.2.4. Overlapping Optimization Techniques

The paper details how Triton-distributed can be used to implement various advanced overlapping optimizations.

Intra-node Communication with Copy Engine (`AllGather` and `ReduceScatter`)

For communication within a single node (e.g., 8 GPUs connected by NVLink), the dedicated GPU copy engines are used to offload data transfers from the main compute cores.

AllGather: The paper describes two modes. In push mode (Algorithm 1), each rank actively copies its local data to all other ranks' buffers. In pull mode (Algorithm 2), each rank first copies its data to its own symmetric buffer and then waits for all other ranks to do the same before pulling the required data from them.
ReduceScatter: This is the reverse of AllGather. As shown in Algorithm 3, it's implemented with two parallel streams. One stream waits for the producer (GEMM kernel) to generate a tile of data and then pushes it to other ranks. The second stream waits for signals indicating that it has received tiles from all other ranks, and then performs the local reduction (summation).

Inter-node `AllGather` with Low-latency Protocol

For inter-node communication, latency is a major bottleneck, especially for small messages. The naive approach of looping through P2P transfers introduces serialization and signal overhead. The timeline comparison below (Figure 5) illustrates this problem and the proposed solution.

Figure 5 The Timeline of Baseline AllGather and Low-latency AllGather. 该图像是图表，展示了基线 AllGather 和优化后的 AllGather 的时间线对比。左侧为基线 AllGather 的时间，最坏情况约为 25 微秒；右侧为优化后的 AllGather，稳定时间约为 13.5 微秒。通过优化，显著降低了延迟。

To achieve low latency, Triton-distributed implements a low-latency (LL) protocol.

For intra-node broadcast, it uses the multimem_st primitive, which maps to a special hardware instruction on NVIDIA GPUs (multimem) to broadcast data to all GPUs on an NVLink fabric simultaneously.
For inter-node transfer, it uses the LL protocol, where an 8-byte chunk containing both data and a flag is sent atomically. The receiver spins on the flag to detect data arrival. This eliminates the need for separate signal operations, reducing overhead, but it doubles the message size, making it suitable only for small messages.

Inter-node `ReduceScatter` with Heterogeneous Communication

This operation is decomposed into three stages: intra-node scatter, local reduction, and inter-node P2P communication. To maximize overlap, these stages are carefully scheduled across different GPU resources. As shown in the timeline (Figure 9), the intra-node scatter uses the copy engine, while the local reduction and inter-node P2P communication use a small, partitioned number of compute cores (SMs). This ensures that the main GEMM computation, which uses the bulk of the SMs, is minimally impacted.

Figure 9 The Timeline of Inter-node GEMM ReduceScatter. 该图像是图表，展示了在116个SM上执行的GEMM内核（生产者）与其他计算流的时间线。包括了设备间数据复制、局部归约内核及跨节点p2p内核的操作，显示了不同流之间的交互与重叠。

Optimizations for AMD GPUs

The paper demonstrates portability by applying similar principles to AMD GPUs. Key differences include:

Topology: AMD MI308X GPUs use a full-mesh interconnect, whereas NVIDIA H800 uses NVSwitches. This requires different communication scheduling to saturate bandwidth.
API behavior: The authors found that AMD's driver APIs for signaling (hipStreamWriteValue) interfered with compute kernels, forcing them to use alternative signaling methods or fuse communication directly into the producer kernel to avoid explicit signals.

Overlapping Computation with Swizzling

Swizzling is the technique of reordering the sequence in which a computation kernel processes data tiles. To achieve perfect overlap, the computation order must align with the communication order. Triton-distributed enables this by allowing the tile mapping logic in the kernel to be customized.

For NVIDIA GPUs (AllGather GEMM): Since each NVLink connection has high bandwidth, each rank gathers data from one other rank at a time in a round-robin fashion. The swizzling logic (Figure 7) makes each rank start its GEMM computation on a different tile, corresponding to the data it is scheduled to receive first.

该图像是示意图，展示了在Nvidia GPU上进行内部节点AllGather的Swizzle示例，假设有4个rank。图中展示了四个步骤，每个rank如何利用本地数据和收集的数据进行计算和汇总。
For AMD GPUs (AllGather GEMM): The individual links have lower bandwidth, so to maximize throughput, each rank must gather data from all other ranks simultaneously. The swizzling logic (Figure 8) reflects this by having each rank process sub-chunks from all peers in each step.

该图像是图表，展示了在 AMD GPU 上进行 intra-node AllGather 操作的示例，假设有 4 个 rank。图中分为四个步骤，分别为从不同 rank 收集数据并进行计算，清晰地描述了每个步骤的数据流动和计算过程。
For Inter-node GEMM ReduceScatter: This is the most complex example (Figure 10). The swizzling strategy is designed to hide both intra-node and inter-node communication latency. Each rank starts computation on the data chunks that are needed by the other node first. This ensures that the data for the first inter-node P2P transfer is ready as early as possible, maximizing overlap.

该图像是图表，展示了在单节点 H800x8 上进行 AllGather GEMM 的相对性能。图中比较了 PyTorch+NCCL、FLUX 和 Triton-distributed 三种方法在不同输入参数下的性能表现。

Auto-Tuning and Resource Partition

Auto-Tuning: Triton-distributed includes a distributed autotuner. Unlike single-device tuners, it can profile a complete distributed workflow (communication + computation) across multiple devices, reset synchronization state (signals) between runs, and aggregate results to find the globally optimal configuration (e.g., tile sizes).
Resource Partition: This involves manually assigning different numbers of compute units (SMs) to the concurrent asynchronous tasks (e.g., main computation, reduction, P2P communication). The goal is to balance the execution time of all parallel tasks to avoid any single task becoming a bottleneck (a "long tail").

5. Experimental Setup

5.1. Datasets

The paper does not use traditional datasets like ImageNet or Wikipedia. Instead, it evaluates the performance of individual distributed kernels (AllGather + GEMM, ReduceScatter, etc.) on synthetic data. The performance of these kernels is a function of the tensor shapes (e.g., matrix dimensions, number of experts, sequence length), which are varied across experiments to simulate workloads from real-world LLMs. The specific shapes used are detailed in the results tables and charts.

5.2. Evaluation Metrics

The primary evaluation metric is performance, measured in one of two ways:

Latency: The time taken to complete an operation, measured in milliseconds (ms) or microseconds (µs). Lower latency is better.
Throughput/Bandwidth: The rate at which data is processed or transferred, measured in Gigabytes per second (GB/s) or Terabytes per second (TB/s). Higher throughput is better.

The paper also frequently reports Speedup, which is a relative performance measure.

Conceptual Definition: Speedup quantifies how many times faster one system (e.g., Triton-distributed) is compared to a baseline system (e.g., $PyTorch+NCCL$ ).
Mathematical Formula: $ \text{Speedup} = \frac{\text{Latency}{\text{baseline}}}{\text{Latency}{\text{ours}}} $
Symbol Explanation:
- $\text{Latency}_{\text{baseline}}$ : The execution time of the baseline method.
- $\text{Latency}_{\text{ours}}$ : The execution time of the proposed method (Triton-distributed).

5.3. Baselines

The paper compares its generated kernels against several state-of-the-art baselines:

PyTorch+NCCL/RCCL: This represents the standard, widely-used approach for distributed training in the PyTorch ecosystem. NCCL (for NVIDIA) and RCCL (for AMD) are highly optimized communication libraries.
FLUX: A specialized, low-level CUDA framework for overlapping tensor-parallel operations. This serves as a strong baseline for hand-optimized performance. The paper uses performance numbers reported in the original FLUX paper for comparison.
DeepEP: A highly-optimized, hand-written CUDA framework for expert-parallel MoE models. It represents the pinnacle of manual optimization for AllToAll communication.
NVSHMEM: NVIDIA's implementation of the OpenSHMEM standard. This is used as a baseline for low-latency communication on PCIe-based systems.

These baselines are representative because they cover the spectrum from standard industry practice ( $PyTorch+NCCL$ ) to highly specialized, expert-tuned research frameworks (FLUX, DeepEP).

6. Results & Analysis

The experiments were conducted on clusters of NVIDIA H800, NVIDIA L20, and AMD MI308X GPUs, at scales ranging from 8 to 64 devices.

6.1. Core Results Analysis

6.1.1. Intra-node Performance on Nvidia H800 GPUs (8 GPUs)

AllGather + GEMM: As shown in Figure 11, Triton-distributed outperforms both $PyTorch+NCCL$ (by 1.42x on average) and the specialized FLUX framework (by 1.09x on average). This is a significant result, as it shows that the compiler-generated code can even beat a dedicated, hand-optimized CUDA framework.

该图像是图表，展示了在单节点 H800x8 上进行 AllGather GEMM 的相对性能。图中比较了 PyTorch+NCCL、FLUX 和 Triton-distributed 三种方法在不同输入参数下的性能表现。
GEMM + ReduceScatter: Figure 12 shows a similar trend. Triton-distributed achieves an average speedup of 1.28x over $PyTorch+NCCL$ and 1.30x over FLUX. This demonstrates the effectiveness of its asynchronous scatter and reduction scheduling.

该图像是图表，展示了在单节点 H800x8 上进行 GEMM ReduceScatter 的相对性能。不同方法（PyTorch+NCCL、FLUX 和 Triton-distributed）在不同设置下的性能对比，说明了 Triton-distributed 编译器在性能上的优势。

Mixture of Experts (MoE) Kernels: For MoE workloads, the baseline $PyTorch+NCCL$ implementation is very inefficient. Triton-distributed achieves massive speedups: 44.97x for AllGather + MoE and 15.55x for MoE + ReduceScatter. While impressive, it's important to note this comparison is against a weak baseline. The absolute performance numbers are more informative.

The following are the results from Table 4 of the original paper:

Name	tokens/rank	in hidden out hidden		experts	topk	Ours		PyTorch
Name	tokens/rank			experts	topk	Intra	Inter	Intra	Inter
AG+MoE-1	256	2048	1408	60	4	0.33	0.45	23.95	28.84
AG+MoE-2	512	2048	1408	60	4	0.40	1.37	26.25	29.77
AG+MoE-3	1024	2048	1408	60	4	0.58	1.80	30.42	43.31
AG+MoE-4	2048	2048	1408	60	4	0.97	3.07	55.63	63.73
AG+MoE-5	256	14336	4096	8	2	0.54	1.01	7.05	19.92
AG+MoE-6	512	14336	4096	8	2	0.72	1.89	26.34	36.07
AG+MoE-7	1024	14336	4096	8	2	1.19	3.41	52.99	67.61
AG+MoE-8	2048	14336	4096	8	2	2.10	6.51	107.32	129.30
AG+MoE-9	256	16384	6144	8	2	0.81	1.39	11.02	27.29
AG+MoE-10	512	16384	6144	8	2	1.06	2.21	39.65	52.32
AG+MoE-11	1024	16384	6144	8	2	1.66	4.32	80.46	101.61
AG+MoE-12	2048	16384	6144	8	2	2.92	8.28	159.69	192.67
AG+MoE-13	512	1408	2048	64	6	0.45	0.84	29.25	38.17
AG+MoE-14	1024	1408	2048	64	6	0.67	1.26	48.86	56.77
AG+MoE-15	2048	1408	2048	64	6	1.18	2.18	74.26	90.44

The following are the results from Table 5 of the original paper:

Name	tokens/rank	in hidden	out hidden	experts	topk	Ours		PyTorch
Name	tokens/rank	in hidden	out hidden	experts	topk	Intra	Inter	Intra	Inter
MoE-RS-1	1024	1536	2048	8	2	0.51	3.62	4.35	12.41
MoE-RS-2	1024	1536	2048	32	2	0.55	3.90	13.89	33.05
MoE-RS-3	1024	1536 \|	2048	64	2	0.67	4.82	27.91	61.70
MoE-RS-4	1024	1536 \|	2048	32	5	0.92	7.78	14.48	35.35
MoE-RS-5	1024	1536 \|	2048	64	5	0.93	8.25	29.96	64.88
MoE-RS-6	1024	2048 \|	4096	8	2	0.98	7.00	5.02	17.93
MoE-RS-7	1024	2048	4096	32	2	1.08	7.86	14.12	38.24
MoE-RS-8	1024	2048	4096	64	2	1.34	9.87	28.61	66.48
MoE-RS-9	1024	2048	4096	32	5	1.84	15.51	16.70	44.37
MoE-RS-10	1024	2048	4096	64	5	1.86	16.60	27.71	71.82

6.1.2. Inter-node Performance on Nvidia H800 GPUs (up to 64 GPUs)

AllGather + GEMM (16 GPUs): On two nodes, Triton-distributed achieves an average speedup of 1.33x over $PyTorch+NCCL$ and reaches 95.6% of the performance of FLUX (Figure 13). This shows its effectiveness extends to multi-node scenarios.

该图像是图表，展示了在两个 H800 节点上执行 AllGather GEMM 的相对性能。不同编程框架的性能比较包括 PyTorch+NCCL、FLUX 和 Triton-distributed，图中显示了各自的性能值。
GEMM + ReduceScatter (16 GPUs): Similarly, it achieves 96.36% of FLUX's performance and a 1.42x speedup over $PyTorch+NCCL$ (Figure 14).

该图像是一个条形图，展示了在2个H800x8节点上进行GEMM ReduceScatter的相对性能。不同颜色的条形代表了不同的实现方式，包括PyTorch+NCCL、FLUX和Triton-distributed，相比之下，Triton-distributed展现出最佳性能。
Distributed Flash Decoding: The paper presents a novel distributed implementation of FlashAttention-style decoding. Figure 15 shows that the implementation scales well. In a weak scaling scenario (KV cache per GPU is fixed), the HBM bandwidth per GPU remains high even at 32 GPUs. This demonstrates the effectiveness of the low-latency AllGather kernel and enables efficient inference for models with extremely long contexts.

该图像是图表，包含三部分展示了分布式闪存解码的性能。左侧为强缩放延迟图，显示不同节点数（1、2、4节点）在不同全局KV长度下的延迟；中间为弱缩放带宽图，展示不同数量GPU在相同本地KV长度下的带宽表现；右侧为强缩放带宽图，展示不同GPU数量在不同全局KV长度下的带宽变化。这些数据展示了系统的性能特征和资源利用情况。
Low-latency AllToAll: Triton-distributed was used to reimplement the complex AllToAll kernel from DeepEP in just a few hundred lines of Python. Figure 16 shows that this Python-based version consistently outperforms the hand-written CUDA version of DeepEP on up to 32 GPUs (1.18x speedup for Dispatch, 1.44x for Combine). The paper notes DeepEP is faster at 64 GPUs and beyond due to its use of a more scalable network backend (IBGDA), which is a target for future work in Triton-distributed.

该图像是图表，展示了在不同 GPU 数量下，Triton-distributed 和 DeepEP 的 AllToAll 调度与合并相对性能的比较。左侧图表为 AllToAll Dispatch，右侧为 AllToAll Combine，均显示出二者在不同设备上的性能差异。
Low-latency AllGather on PCIe (L20 GPUs): On a different hardware setup with slower PCIe interconnects, Triton-distributed still provides the lowest latency compared to NVSHMEM and NCCL, demonstrating its robustness across different hardware environments (Figure 19).

该图像是图表，展示了在 L20 GPU 上低延迟 AllGather 的性能。图中显示了不同消息大小下各种方法的延迟，包括 NVSHMEM 32bit、NVSHMEM 64bit 和 Triton-distributed。性能随着消息大小的变化而变化，在 8xL20 和 16xL20 的设置中进行比较。

6.1.3. Intra-node Performance on AMD GPUs (8 GPUs)

To prove its generality, the paper shows results on AMD MI308X GPUs.

AllGather + GEMM: Achieves an average speedup of 1.09x over $PyTorch+RCCL$ (Figure 17).
GEMM + ReduceScatter: Achieves an average speedup of 1.16x over $PyTorch+RCCL$ (Figure 18).

These results are significant because they demonstrate that the high-level concepts and primitives in Triton-distributed are portable and can be used to generate high-performance code for different hardware backends, not just NVIDIA's.

该图像是图表，展示了在单节点 8xMI308X 上进行的 AllGather GEMM 性能相对比较。图中绿色柱状表示使用 PyTorch+RCCL 的性能，橙色柱状表示 Triton-distributed 的性能。提供了不同配置下的相对性能评估。

该图像是图表，展示了在单节点 8xMI308X 上 GEMM ReduceScatter 的相对性能。图中比较了使用 PyTorch+RCCL 和 Triton-distributed 的性能表现，结果显示在不同输入参数下，两种方法的性能差异。

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Triton-distributed, a powerful extension to the Triton compiler that bridges the gap between high-level AI development and low-level distributed systems optimization. By integrating OpenSHMEM-compliant primitives and a well-defined programming model into Python, it enables developers to implement sophisticated computation-communication overlapping strategies with unprecedented ease. The experimental results robustly demonstrate that this approach not only drastically improves developer productivity but also generates code that is highly competitive with, and often superior to, manually optimized low-level code across a variety of workloads and hardware platforms (NVIDIA and AMD). This work marks a significant step towards democratizing high-performance distributed computing for the broader AI community.

7.2. Limitations & Future Work

The authors acknowledge a few areas for future improvement:

Hardware/Software Stack Dependencies: The current implementation relies on hardware vendors providing an OpenSHMEM-compliant library (NVSHMEM, ROCSHMEM) and a bitcode library for compiler integration. Broader adoption across more NPUs would require these vendors to support the standard.
Scalability on Different Network Backends: The AllToAll performance comparison with DeepEP revealed that the choice of underlying network protocol (IBRC vs. IBGDA) impacts scalability at very large node counts. Future work includes integrating support for more scalable backends like IBGDA directly into the compiler.
Dedicated Kernels for More Workloads: The scaling of the MoE+RS-inter kernel was suboptimal, indicating that a more specialized inter-node ReduceScatter kernel is needed for that specific workload. This suggests that while the framework is general, achieving peak performance on every workload may still require some specialized kernel design.

7.3. Personal Insights & Critique

Major Advancement in Productivity: The most significant contribution of this paper is the massive improvement in developer productivity. The ability to express complex, fine-grained overlapping patterns in a high-level language like Python is a game-changer. It lowers the barrier to entry for performance optimization, allowing more researchers and engineers to experiment with novel distributed algorithms without needing to be CUDA experts.
Critique of Baselines: While the performance results are strong, some of the most dramatic speedups (e.g., >40x for MoE kernels) are against a known-to-be-inefficient PyTorch baseline. The more telling comparisons are against specialized frameworks like FLUX and DeepEP, where Triton-distributed shows it is competitive or superior, which is a very strong result.
The Power of Compiler Abstractions: This work is a testament to the power of compiler technology. By providing the right set of high-level abstractions (signals, async-tasks) and a compiler smart enough to lower them to efficient hardware-specific code, Triton-distributed hides immense complexity from the user. This paradigm of "compiler-assisted performance engineering" is likely to become increasingly important as hardware and systems grow more complex.
Future Implications: The foundation laid by Triton-distributed could be extended in many exciting ways. For instance, the compiler could incorporate more automated resource partitioning and scheduling, moving from a manual/autotuned approach to a fully analytical one. Furthermore, as new interconnect technologies and hardware features emerge, they can be exposed as new primitives within the compiler, allowing developers to quickly leverage them without rewriting their entire codebase. The work also paves the way for higher-level compilers (like the mentioned TileLink) to be built on top, creating an even more abstract and powerful programming environment for distributed AI.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~26 min read · 29,546 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The MPMD Programming Model

4.2.2. Communication Primitives

4.2.3. Example: Inter-node Overlapping AllGather GEMM

4.2.4. Overlapping Optimization Techniques

Intra-node Communication with Copy Engine (AllGather and ReduceScatter)

Inter-node AllGather with Low-latency Protocol

Inter-node ReduceScatter with Heterogeneous Communication

Optimizations for AMD GPUs

Overlapping Computation with Swizzling

Auto-Tuning and Resource Partition

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Intra-node Performance on Nvidia H800 GPUs (8 GPUs)

6.1.2. Inter-node Performance on Nvidia H800 GPUs (up to 64 GPUs)

6.1.3. Intra-node Performance on AMD GPUs (8 GPUs)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.3. Example: Inter-node Overlapping `AllGather` GEMM

Intra-node Communication with Copy Engine (`AllGather` and `ReduceScatter`)

Inter-node `AllGather` with Low-latency Protocol

Inter-node `ReduceScatter` with Heterogeneous Communication