Paper status: completed

RDMA Point-to-Point Communication for LLM Systems

Published:11/01/2025

LLM System Communication Optimization (1)RDMA Point-to-Point Communication (1)Mixture-of-Experts Routing (1)Distributed Inference Communication Mechanism (1)Asynchronous RL Fine-Tuning Communication (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

TransferEngine offers a portable RDMA point-to-point communication interface with ImmCounter primitive, achieving 400 Gbps on NVIDIA ConnectX-7 and AWS EFA. It enhances performance in disaggregated inference, asynchronous RL fine-tuning, and MoE routing, avoiding vendor lock-in.

Abstract

Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs to expose a uniform interface. TransferEngine exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase TransferEngine through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in.

Mind Map

In-depth Reading

English Analysis~41 min read · 55,315 chars

1. Bibliographic Information

1.1. Title

RDMA Point-to-Point Communication for LLM Systems

1.2. Authors

Nandor Licker, Kevin Hu, Vladimir Zaytsev, Lequn Chen. The paper does not provide author affiliations or research backgrounds within the text.

1.3. Journal/Conference

The paper is listed as published at (UTC): 2025-10-31T17:28:22.000Z, implying it's a recent publication or preprint. The context of the paper (e.g., focus on LLM systems, distributed computing, and high-performance networking) suggests it would typically be presented at top-tier conferences in machine learning systems (MLSys, NeurIPS Systems Track), computer architecture (ISCA, ASPLOS), or distributed systems (OSDI, NSDI).

1.4. Publication Year

2025

1.5. Abstract

The paper introduces TransferEngine, a portable Remote Direct Memory Access (RDMA) communication library designed to address the limitations of existing, vendor-locked solutions for emerging Large Language Model (LLM) system patterns. These patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond traditional collectives. TransferEngine bridges various Network Interface Controllers (NICs), including NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA), exposing a uniform interface. It utilizes one-sided WRITEIMM operations with a novel ImmCounter primitive for completion notification, bypassing network transport ordering assumptions and transparently managing multiple NICs per GPU. The system achieves peak throughput of 400 Gbps on both ConnectX-7 and EFA. The paper demonstrates TransferEngine's utility through three production systems: KvCache transfer for dynamic disaggregated inference, RL weight updates (achieving 1.3 seconds for trillion-parameter models), and MoE dispatch/combine (outperforming DeepEP on ConnectX-7 and providing the first viable latencies on EFA). The authors conclude that TransferEngine offers portable point-to-point communication that complements collectives, avoiding vendor lock-in.

1.6. Original Source Link

https://arxiv.org/abs/2510.27656

1.7. PDF Link

http://arxiv.org/pdf/2510.27656v1 This indicates the paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The rapidly evolving landscape of Large Language Model (LLM) systems is introducing new architectural patterns that demand more flexible and dynamic communication capabilities than traditionally offered by collective communication libraries. These emerging patterns include:

Disaggregated Inference: Separating different stages of LLM inference (e.g., prefill and decode) onto distinct computing resources. This requires efficient, flexible communication between these stages.
Mixture-of-Experts (MoE) Routing: Models that dynamically route different parts of an input to specialized "expert" sub-networks. This necessitates dynamic, sparse, and often asynchronous point-to-point communication among experts spread across many devices.
Asynchronous Reinforcement Fine-Tuning: Training paradigms where model updates and inference rollouts occur asynchronously on separate sets of GPUs, requiring fast and efficient weight synchronization.

Traditional LLM frameworks heavily rely on collective communication (e.g., NCCL, torch.distributed). While excellent for static patterns like tensor or data parallelism, these collectives impose significant constraints:

Fixed Membership: Requires all participants to be known in advance, hindering dynamic scaling.
Synchronized Initialization: Adds overhead and global coordination requirements.
Uniform Buffer Sizes: Forces dense communication even for sparse patterns.

While these libraries offer basic point-to-point primitives (SEND/RECV), they often cannot be composed effectively for low-latency, flexible scenarios.

Remote Direct Memory Access (RDMA) offers high-throughput, low-latency communication, which has long been used in high-performance computing. However, a key barrier to its adoption in LLM frameworks is hardware diversity and vendor lock-in. Cloud providers deploy different RDMA implementations:

NVIDIA ConnectX NICs use Reliable Connection (RC) transport, which offers in-order delivery.
AWS Elastic Fabric Adapter (EFA) uses a proprietary Scalable Reliable Datagram (SRD) protocol, which is inherently out-of-order.

Existing RDMA solutions are often tied to specific hardware (e.g., DeepEP to ConnectX's GPU-initiated RDMA, NVSHMEM's performance degradation on EFA, or lack of EFA support in NIXL/Mooncake). This vendor lock-in prevents seamless integration into LLM inference engines and portability across different cloud providers, hindering widespread adoption for these critical emerging LLM workloads.

The paper aims to solve this problem by providing a portable, high-performance point-to-point communication library that abstracts away hardware differences and addresses the specific communication needs of modern LLM systems, thereby avoiding vendor lock-in.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of high-performance communication for LLMs:

Portable RDMA Abstraction (TransferEngine): It introduces TransferEngine, a novel communication library that bridges the functionality of diverse RDMA hardware (NVIDIA ConnectX-7 and AWS EFA) to expose a uniform, minimal application programming interface (API). This is achieved by leveraging the commonality of reliable but unordered delivery across these different transport protocols, fundamentally addressing the vendor lock-in problem.
Novel Completion Notification Primitive (ImmCounter): TransferEngine exposes one-sided WRITEIMM operations complemented by a unique ImmCounter primitive for completion notification. This mechanism operates without relying on network transport ordering assumptions, which is crucial for supporting inherently out-of-order protocols like EFA's SRD, while also being efficient for ConnectX's RC by ignoring its ordering guarantees where not strictly necessary.
Transparent Multi-NIC Management: The library transparently manages multiple Network Interface Controllers (NICs) per GPU. This is particularly important for aggregating bandwidth, especially on platforms like AWS EFA where multiple lower-bandwidth NICs (e.g., four 100 Gbps NICs) must be combined to achieve peak performance (400 Gbps).
Demonstrated Peak Throughput: The paper empirically demonstrates that TransferEngine achieves peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS EFA, showcasing its high performance across heterogeneous hardware.
Validation in Production Systems: The effectiveness and practicality of TransferEngine are validated through its integration into and demonstration with three distinct production LLM system patterns:
- KvCache Transfer for Disaggregated Inference: Enables flexible, dynamic scaling and low-latency layer-by-layer transfers for disaggregated inference on EFA, complete with CUDA Graph support.
- RL Weight Updates: Achieves remarkably fast 1.3-second updates for trillion-parameter models, utilizing full cluster bandwidth via one-sided RDMA WRITE from training GPUs to inference GPUs, a 100x speedup over existing RL frameworks.
- MoE Dispatch/Combine: Delivers state-of-the-art decode latency on ConnectX-7, competitive with specialized DeepEP kernels despite using a host proxy thread, and provides the first viable latencies for MoE on EFA by relying on parallel token and route transfers.
  
  In essence, the paper provides a crucial building block for cloud-native LLM deployments, enabling high-performance, portable point-to-point communication that is essential for the flexibility and scale required by modern LLM architectures, thereby complementing traditional collective communication.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with several key concepts in distributed systems, high-performance computing, and machine learning.

Large Language Models (LLMs): These are advanced artificial intelligence models capable of understanding and generating human-like text. They are typically very large, with billions or even trillions of parameters, requiring significant computational resources and often distributed processing.
- LLM System Patterns:
  - Disaggregated Inference: A system architecture where different parts or stages of an LLM inference pipeline (e.g., the initial "prefill" stage that processes input tokens and the subsequent "decode" stage that generates output tokens one by one) are run on separate, potentially specialized, hardware or clusters. This allows for more flexible resource allocation and scaling.
  - Mixture-of-Experts (MoE) Routing: A technique used in LLMs to scale model capacity without a proportional increase in computational cost. In an MoE layer, instead of using all parameters for every input, a "router" network selects a small number of "expert" sub-networks to process specific parts of the input. This requires dynamic routing of data (tokens) to the selected experts, which might be distributed across different devices.
  - Asynchronous Reinforcement Fine-Tuning: A method for improving LLMs using reinforcement learning (RL), where the model's behavior is refined based on feedback. "Asynchronous" implies that the training (weight updates) and inference (generating responses to collect feedback) components might operate independently and in parallel, requiring efficient mechanisms to synchronize model weights.
Communication Patterns:
- Collective Communication: Operations involving a group of processes/devices where all participants perform a coordinated action (e.g., AllReduce, Broadcast, AllGather). These are common in traditional data parallelism (where each device processes a chunk of data) and tensor parallelism (where layers or tensors are split across devices).
- Point-to-Point Communication: Operations involving only two specific processes/devices, where one sends data and the other receives it. This offers more flexibility for dynamic and irregular data exchange.
Remote Direct Memory Access (RDMA): A technology that allows one computer to directly access memory from another computer without involving the operating system or CPU of the target computer. This significantly reduces latency and increases throughput by bypassing the CPU and OS kernel.
- Control Plane vs. Data Plane:
  - Control Plane: Operations involved in setting up the RDMA connection, registering memory regions, and initializing devices. These typically involve the operating system.
  - Data Plane: Operations involved in the actual data transfer and completion polling. These bypass the kernel, leading to high performance.
- RDMA Operations:
  - Two-sided Operations (SEND/RECV): Require explicit coordination between sender and receiver. The receiver must post a RECV operation to allocate memory before the sender can issue a SEND.
  - One-sided Operations (WRITE/READ): Allow a remote machine to access memory on another machine without explicit involvement from the remote machine's CPU. The initiator specifies the remote memory address and a security key (RKEY).
    - WRITE: Copies local memory to a remote memory region.
    - WRITEIMM: An extension of WRITE that also delivers a small 32-bit immediate value to the receiver's completion queue upon completion, serving as a notification.
    - READ: Copies remote memory to a local memory region.
    - Atomic Operations: Operations like Fetch-and-Add or Compare-and-Swap on remote memory. The paper notes READ and Atomic operations are generally slower and not suitable for their purposes.
- RDMA Transports: The protocols used for RDMA communication.
  - Reliable Connection (RC): A connection-oriented, reliable transport that guarantees in-order delivery of messages. This is common in traditional InfiniBand networks (e.g., NVIDIA ConnectX NICs).
  - Unreliable Connection (UC): Connection-oriented but does not guarantee reliable delivery.
  - Unreliable Datagram (UD): Connectionless and does not guarantee reliable delivery or ordering. Used for small, broadcast-like messages.
  - Scalable Reliable Datagram (SRD): A proprietary, connectionless protocol used by AWS Elastic Fabric Adapter (EFA). It provides reliable delivery but does not guarantee in-order delivery. This out-of-order nature is a key challenge addressed by the paper.
Network Interface Controllers (NICs): Hardware components that connect a computer to a computer network. In the context of RDMA, these are specialized NICs that offload network processing from the CPU.
- NVIDIA ConnectX: A series of high-performance InfiniBand and Ethernet NICs widely used in HPC and AI.
- AWS Elastic Fabric Adapter (EFA): A network interface for Amazon EC2 instances that enables customers to run HPC and machine learning applications requiring high levels of inter-instance communication. It uses the SRD protocol.
GPUDirect: A technology from NVIDIA that allows direct data transfer between NVIDIA GPUs and other devices (like RDMA NICs or NVMe storage) over PCIe, bypassing the CPU and system memory.
- GPUDirect RDMA: Enables direct memory access between GPU memory and RDMA NICs.
- GPUDirect Async (IBGDA): An extension that allows the GPU itself to initiate RDMA transfers asynchronously, further reducing CPU involvement and latency. Currently exclusive to ConnectX NICs.
Non-Uniform Memory Access (NUMA): A computer memory design used in multiprocessing, where the memory access time depends on the memory's location relative to the processor. Processors can access their local memory faster than non-local memory (memory attached to another processor). Correct NUMA awareness is crucial for performance in multi-CPU systems.
CUDA Graph: A feature in NVIDIA's CUDA programming model that allows sequences of GPU operations (kernels, memory copies) to be recorded into a graph and then launched multiple times with minimal CPU overhead. This is vital for reducing launch latencies in repetitive GPU workloads.
Unified Virtual Memory (UVM): A CUDA feature that provides a single virtual memory address space accessible by both CPU and GPU, simplifying memory management.
GDRCopy: A library that enables efficient memory copies between GPU device memory and host memory, often leveraging GPUDirect RDMA capabilities for low-latency transfers.
Fully Sharded Data Parallel (FSDP): A data parallelism strategy for training large models, especially LLMs. It shards (splits) the model's parameters, gradients, and optimizer states across different GPUs, allowing larger models to fit into memory and reducing communication overhead compared to traditional data parallelism.
General Matrix Multiply (GEMM): A fundamental operation in linear algebra that computes the product of two matrices. It is a cornerstone of deep learning computations, particularly in neural network layers.

3.2. Previous Works

The paper contextualizes its contributions against several categories of existing work:

Collective Communication Libraries:
- NCCL (NVIDIA Collective Communication Library) and torch.distributed (built on top of NCCL and MPI for PyTorch) are standard in ML frameworks for inter-GPU data exchange. They excel at structured, static patterns like AllReduce for data parallelism or AllGather for tensor parallelism. However, their limitations (fixed membership, synchronized initialization, uniform buffer sizes, limited point-to-point composition) make them unsuitable for the dynamic and sparse patterns of emerging LLM architectures like MoE or disaggregated inference.
- MPI (Message Passing Interface) is a widely used standard for parallel programming in high-performance computing, offering both collective and point-to-point communication primitives. However, its direct use in ML frameworks is less common than NCCL for GPU-specific collectives.
Existing RDMA Solutions for LLMs (Vendor-Locked):
- DeepEP (Zhao et al., 2025): An efficient expert-parallel communication library. It achieves state-of-the-art MoE latency but is tied to ConnectX NICs because it relies on GPU-initiated RDMA (IBGDA), a feature exclusive to ConnectX.
- NVSHMEM (Langer et al., 2021): Exposes both collective and flexible point-to-point communication. It supports both IBGDA and host-proxy communication. However, it suffers from severe performance degradation on AWS EFA, making it non-viable for cross-provider deployments.
- NVIDIA Inference Xfer Library (NIXL) (NVIDIA, 2025): Targets point-to-point communication for LLM inference, built on UCX (a high-performance communication framework). The paper notes that its production-deployed EFA implementation predates preliminary EFA support in NIXL (v0.6.1, October 2025).
- Mooncake (Qin et al., 2025): Provides an RDMA Transfer Engine for disaggregated inference but lacks support for EFA.
- Other libraries like UCCL (Zhou et al., 2025) and $MSCCL++$ (Shah et al., 2025) focus on network-layer optimizations for collectives rather than flexible point-to-point.
Disaggregated Inference Architectures:
- Splitwise (Patel et al., 2024), DistServe (Zhong et al., 2024), Mooncake (Qin et al., 2025): These works demonstrate the benefits of disaggregating prefill and decode stages of LLM inference onto distinct devices. The paper's KvCache transfer use case builds on this paradigm.
Distributed KvCache Storage:
- Mooncake Store and DeepSeek 3FS (DeepSeek AI, 2025): Provide distributed storage solutions for KV caches. The paper notes these currently lack EFA support, highlighting the need for portable RDMA primitives like TransferEngine to complement them in cloud deployments.
Compute-Communication Overlapping:
- Research such as Flux (Chang et al., 2024), COMET (Zhang et al., 2025), TritonDistributed (Zheng et al., 2025a), and TileLink (Zheng et al., 2025b) focuses on optimizing LLM kernels by overlapping computation with collective communication. The paper notes its work is orthogonal to this but enables background transfers.
LLM Frameworks:
- The paper mentions TransferEngine can be integrated into existing LLM inference frameworks like vLLM (Kwon et al., 2023), SGLang (Zheng et al., 2024), TensorRT-LLM (NVIDIA, 2023), FlashInfer (Ye et al., 2025).
- Its P2P weight update approach can be adopted by RL frameworks such as Slime (Wu et al., 2025), OpenRLHF (Zhu et al., 2025), AReaL (Mei et al., 2025), veRL (Fu et al., 2025), LlamaRL (Sheng et al., 2024), NVIDIA Nemo (Hu et al., 2025a).

3.3. Technological Evolution

The evolution of communication for ML systems has largely mirrored the growth in model size and complexity. Initially, MPI provided the backbone for distributed computing, followed by specialized libraries like NCCL that were highly optimized for GPU-to-GPU collectives, becoming the de-facto standard for data and tensor parallelism in deep learning. These collective-centric approaches worked well for static, dense communication patterns.

However, the advent of LLMs and their specialized architectures (e.g., MoE, disaggregated inference) introduced new demands: dynamic routing, sparse communication, and flexible partitioning. This necessitated a shift towards more adaptable point-to-point communication. While RDMA hardware (like ConnectX) has been available for some time, its integration into ML frameworks has been slow, often hindered by complex APIs and a lack of a unified abstraction. The emergence of cloud-specific RDMA solutions (like AWS EFA with its SRD protocol, Alibaba Cloud eRDMA, Google Falcon) further fragmented the ecosystem, introducing incompatibility challenges (e.g., out-of-order delivery).

This paper's TransferEngine fits into this timeline as a critical step forward. It bridges the gap between the high-performance capabilities of diverse RDMA hardware and the specific, flexible communication needs of modern LLMs, moving beyond the limitations of collective-only approaches and addressing the crucial problem of vendor lock-in in a heterogeneous cloud environment.

3.4. Differentiation Analysis

Compared to related work, TransferEngine introduces several core differences and innovations:

Hardware Agnosticism and Portability: The most significant differentiation is its ability to provide a uniform interface across heterogeneous RDMA hardware, specifically targeting both NVIDIA ConnectX (RC protocol) and AWS EFA (SRD protocol). Previous solutions were largely vendor-locked (DeepEP to ConnectX), suffered performance degradation on alternative hardware (NVSHMEM on EFA), or lacked support entirely (Mooncake, NIXL's early EFA support). TransferEngine achieves true portability by focusing on common features (reliable but unordered delivery) rather than specific transport guarantees.
ImmCounter for Unordered Reliability: Instead of relying on in-order delivery guarantees (which EFA SRD does not provide), TransferEngine introduces a novel ImmCounter primitive. This mechanism tracks completion notifications via immediate values, enabling reliable communication even with out-of-order packet arrival, thus enabling EFA compatibility without sacrificing performance. This is a crucial design choice that differentiates it from systems that assume or require strict ordering.
Transparent Multi-NIC Aggregation: TransferEngine automatically manages and aggregates multiple NICs per GPU, which is essential for achieving full bandwidth on cloud platforms like AWS EFA (which uses multiple 100 Gbps NICs to reach 400 Gbps). This simplifies development for users and maximizes hardware utilization.
Focus on Flexible Point-to-Point: While collective libraries (NCCL, torch.distributed) offer SEND/RECV, their composition for complex, dynamic, and sparse patterns is often inefficient. TransferEngine is explicitly designed for these flexible point-to-point scenarios (KvCache, MoE, RL weight updates), providing specialized primitives like paged WRITEs, SCATTER, and BARRIER operations.
Host Proxy for Portability vs. GPU-Initiated RDMA: TransferEngine utilizes a host proxy thread to coordinate GPUs and NICs. While this introduces a slight overhead compared to GPU-initiated RDMA (like ConnectX's IBGDA used by DeepEP), it is the key enabler for portability to hardware that lacks IBGDA (like EFA). The paper demonstrates that despite this overhead, it still achieves competitive or even superior performance on ConnectX-7 and viable performance on EFA, proving the efficacy of this design choice.

In essence, TransferEngine fills a critical gap by providing a high-performance, vendor-agnostic point-to-point communication solution tailored for the dynamic and diverse requirements of next-generation LLM systems, which was previously unavailable.

4. Methodology

4.1. Principles

The core principle behind TransferEngine is to provide a minimal, uniform Application Programming Interface (API) for high-performance point-to-point communication across heterogeneous RDMA hardware. It achieves this by recognizing a common denominator: both NVIDIA ConnectX NICs (using Reliable Connection (RC) transport) and AWS Elastic Fabric Adapter (EFA) (using Scalable Reliable Datagram (SRD) protocol) offer reliable but potentially unordered delivery.

Instead of relying on strict in-order delivery guarantees (which ConnectX RC provides but EFA SRD does not), TransferEngine builds its completion notification mechanism around a novel ImmCounter primitive. This primitive allows the sender to attach a 32-bit immediate value to a one-sided WRITE operation (WRITEIMM), which is then delivered to the receiver's completion queue. The ImmCounter tracks the receipt of these immediate values, providing a way to confirm transfer completion without assuming any particular order of message arrival. This design choice is fundamental to its portability and performance on out-of-order transports like EFA SRD.

The system also emphasizes transparent management of multiple NICs per GPU, crucial for aggregating bandwidth on platforms where a single NIC might not provide the desired throughput (e.g., EFA). Low-latency operations are further ensured through zero-copy interfaces and hardware-specific optimizations for both ConnectX and EFA.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Overview and Design Goals

TransferEngine is designed as a foundational library to enable efficient RDMA-based point-to-point communication. Its primary goals are:

Abstraction: Hide the complexities of underlying RDMA interfaces.
Portability: Support diverse hardware, including EFA (SRD protocol) and libibverbs-programmable NICs (like ConnectX-7).
Performance: Achieve low-latency and high-throughput using zero-copy interfaces and hardware-specific optimizations.
Multi-NIC Management: Transparently detect and manage multiple NICs per GPU to aggregate bandwidth (e.g., 4x100 Gbps EFA NICs for 400 Gbps).
Minimal API: Expose a small set of powerful primitives.
Reliable but Unordered Semantics: The key design insight is that both ConnectX RC can ignore ordering for certain operations, and EFA SRD is inherently unordered. TransferEngine leverages this by providing reliable but unordered data delivery, using a specialized ImmCounter for completion notification.

The library supports various communication patterns:
SEND/RECV for RPC-like interactions.
Paged WRITEs for KvCache transfers.
Low-latency, high-throughput WRITE operations for RL weight transfers.
Specialized WRITEs (SCATTER, BARRIER) for MoE routing. Crucially, there are no ordering guarantees across any of these operations.

4.2.2. Architecture

The architecture of TransferEngine is designed for high performance and scalability within a single node.

The following figure (Figure 1 from the original paper) illustrates the architecture of TransferEngine:

Figure 1. TransferEngine managing GPUs across NUMA nodes, each with multiple NICs. Commands are forwarded to workers, which respond back to the callback handler or ImMCounTER. 该图像是论文中图1的示意图，展示了TransferEngine管理位于不同NUMA节点、配备多个NIC的GPU。请求由TransferEngine发起，经过Worker处理，完成信息通过ImmCounter和Callback Worker回调反馈。

Figure 1. TransferEngine managing GPUs across NUMA nodes, each with multiple NICs. Commands are forwarded to workers, which respond back to the callback handler or ImMCounTER.

Components:

TransferEngine Instance: A single instance manages all GPUs and NICs within a node. It exposes a main address (NetAddr) for identification and discovery by remote peers.
Worker Threads: For each GPU, TransferEngine spawns a dedicated worker thread. Each worker thread is pinned to a CPU core on the NUMA node to which its associated devices are attached. This minimizes scheduling and memory access latency.
DoMAINGROUP: Each worker thread manages a generic DoMAINGROUP. A DoMAINGROUP coordinates all associated RDMA NICs for a specific GPU, typically 1-4 NICs depending on the hardware platform (e.g., 1 ConnectX-7, or 4 EFA NICs for 400 Gbps bandwidth).
DoMAIN: Within each DoMAINGROUP, each DoMAIN is specialized to the hardware (e.g., EFA or ConnectX) and responsible for a single NIC. It handles queue pair management, work submission, and completion polling for that NIC.
NetAddr: A structure used to capture and serialize the network address of a DoMAIN. These are exchanged between peers to establish communication. A restriction is that all peers must use the same number of NICs per GPU, allowing the TransferEngine to have full knowledge of the NICs between source and destination and enable intelligent sharding or load balancing of requests.
ImmCounter: A dedicated component that tracks per-immediate counters. These counters are incremented upon receipt of WRITEIMM operations. Events from completion queues are aggregated to deliver per-transfer notifications through callbacks or atomic flags. The counters are allocated in the same NUMA node as the domain worker.
UVM Watcher: A CPU thread that continuously polls a Unified Virtual Memory (UVM) location (allocated via alloc_uvm_watcher) using GDRCopy. This allows GPU-side progress (e.g., within a CUDA Graph) to trigger host-side actions, enabling CPU-GPU synchronization.

4.2.3. API Design

The API of TransferEngine is designed to be minimal yet powerful, abstracting underlying RDMA complexities. The following is the TransferEngine API signature as presented in the paper (Figure 2):

#[serde] struct NetAddr(Bytes);
#[serde] struct MrDesc{ ptr: u64, rkeys: Vec<(NetAddr, u64)> } struct MrHandle(NonNull<c_void>); type Offset = u64;
struct Pages{ indices: Vec<u32>, stride: u64, offset: Offset } struct PeerGroupHandle(u64);
struct ScatterDst{ len: u64, src: Offset, dst: (MrDesc,Offset)) enum OnDone { Callback(fn () -> ()), Flag(Atomic<bool>) }

trait TransferEngine {
    fn main_address() -> NetAddr; // Memory Region Management
    fn reg_mr(ptr, len, device) -> (MrHandle, MrDesc); // Two-sided Send/Recv
    fn submit_send(addr: NetAddr, msg: &[u8], cb: fn () -> ())
    fn submit_recvs(len: u64, cnt: u64, cb: fn (&[u8]) -> ()); // One-sided Write
    fn expect_imm_count(imm: u32, count: u32, cb: fn () -> ());
    fn submit_single_write(len: u64, imm: Option<u32>, src: (MrHandle, Offset), dst: (MrDesc, Offset), OnDone);
    fn submit_paged_writes(page_len: u64, imm: Option<u32>, src: (MrHandle, Pages), dst: (MrDesc, Pages), OnDone); // One-sided Write to a group of peers
    fn add_peer_group(addrs: Vec<NetAddr>) -> PeerGroupHandle;
    fn submit_scatter(h: Option<PeerGroupHandle>, OnDone, imm: Option<u32>, src: MrHandle, dst: Vec<ScatterDst>);
    fn submit_barrier(h: Option<PeerGroupHandle>, OnDone, imm: u32, dst: Vec<MrDesc>); // Watcher for CPU-GPU synchronization
    fn alloc_uvm_watcher(cb: fn(u64,u64) -> ()) -> NonNull<u64>;
}

NetAddr: A serializable struct NetAddr(Bytes) used to uniquely identify a DoMAIN and facilitate peer discovery.
MrDesc, MrHandle:
- MrDesc: A serializable structure containing a ptr (pointer to memory) and rkeys (a vector of (NetAddr, u64) pairs, where $u64$ is the remote key). This MrDesc can be exchanged with peers to allow them to write to the registered memory.
- MrHandle: A local handle ( $NonNull<c_void>$ ) used as the source for transfers.
Offset: A $u64$ type representing an offset within a memory region.
Pages: A struct ${ indices: Vec<u32>, stride: u64, offset: Offset }$ used to describe paged memory regions for transfers, supporting indirect addressing.
PeerGroupHandle: A $u64$ identifier for a pre-registered group of peers, enabling optimized group operations.
ScatterDst: A struct { len: u64, src: Offset, dst: (MrDesc,Offset)) } defining a destination for a scatter operation, including length, source offset, and destination MrDesc with its offset.
OnDone: An enum $enum OnDone { Callback(fn () -> ()), Flag(Atomic<bool>) }$ specifying how completion notifications should be delivered: either via a callback function or by setting an atomic boolean flag.

Trait TransferEngine API Functions:

main_address() -> NetAddr: Returns the main network address for the current TransferEngine instance, used for identification and discovery.
Memory Region Management:
- reg_mr(ptr, len, device) -> (MrHandle, MrDesc): Registers a memory region (host-side buffer or GPU memory) of len bytes at ptr on a specific device. It returns a local MrHandle for use as a source in transfers and a serializable MrDesc that can be shared with remote peers. The MrDesc internally holds the addresses of associated NICs and their respective remote keys (RKEYs).
Two-sided Send/Recv:
- submit_send(addr: NetAddr, msg: &[u8], cb: fn () -> ()): Initiates a two-sided send operation. addr specifies the remote peer, msg is the payload to send, and cb is a callback for send completion. The payload is copied, allowing the caller to reuse the buffer immediately. These operations use only the first NIC in a DoMAINGROUP.
- submit_recvs(len: u64, cnt: u64, cb: fn (&[u8]) -> ()): Posts cnt receive buffers, each of len bytes, to receive incoming messages. Upon message arrival, a buffer is temporarily taken for the cb (callback) to process the data without copying. After the callback, the buffer is automatically re-posted. Adequate cnt is crucial to avoid rejecting messages.
One-sided Write:
- expect_imm_count(imm: u32, count: u32, cb: fn () -> ()): Registers an expectation for count immediate values (imm) to be received. Once the ImmCounter reaches count for imm, the provided cb is invoked. This is the primary mechanism for receiver-side completion notification without ordering assumptions.
- submit_single_write(len: u64, imm: Option<u32>, src: (MrHandle, Offset), dst: (MrDesc, Offset), OnDone): Submits a single, contiguous one-sided write operation. len is the data length, imm is an optional 32-bit immediate value for notification, src specifies the source memory region and offset (local), dst specifies the destination memory region and offset (remote), and OnDone defines the completion notification method. The engine translates this into zero-copy RDMA WRITE operations, sharding and rotating them across available NICs.
- submit_paged_writes(page_len: u64, imm: Option<u32>, src: (MrHandle, Pages), dst: (MrDesc, Pages), OnDone): Submits multiple one-sided write operations for paged memory regions. page_len is the size of each page. src and dst use Pages structs to define indirect indices, strides, and offsets for the transfers. This is useful for transferring non-contiguous data efficiently.
One-sided Write to a group of peers:
- add_peer_group(addrs: Vec<NetAddr>) -> PeerGroupHandle: Registers a group of remote peer addresses, returning a PeerGroupHandle for subsequent group operations.
- submit_scatter(h: Option<PeerGroupHandle>, OnDone, imm: Option<u32>, src: MrHandle, dst: Vec<ScatterDst>): Sends a slice from a local source buffer (src) to multiple peers in a PeerGroup (identified by $h$ ). Each ScatterDst in the dst vector specifies a different offset in the remote receive buffer for each peer. This is an optimized wrapper around multiple WRITE operations.
- submit_barrier(h: Option<PeerGroupHandle>, OnDone, imm: u32, dst: Vec<MrDesc>): An immediate-only operation for peer notification, acting as a lightweight barrier for a PeerGroup. It sends an imm to each MrDesc in dst.
Watcher for CPU-GPU synchronization:
- alloc_uvm_watcher(cb: fn(u64,u64) -> ()) -> NonNull<u64>: Allocates a Unified Virtual Memory (UVM) location. A dedicated CPU thread continuously polls this location using GDRCopy. When the GPU updates this location, the cb (callback) is invoked with the old and new values, allowing the host to react to GPU-side progress (e.g., from within a CUDA Graph).

Completion Notification Mechanism: As described, ImmCounter is central. It's a component that maintains per-immediate counters, incremented by events retrieved from the underlying NIC's completion queues. Events are generated on the sender after a transfer is completed and on the receiver once a WRITEIMM payload has been fully delivered, guaranteeing atomicity. These counters are NUMA-aware. Synchronization relies on these counters, as no other ordering guarantees exist. They can be synchronized with the GPU via GDRCopy, polled directly, or handled by a dedicated callback thread (expect_imm_count).

4.2.4. Implementation

TransferEngine is implemented in Rust, with careful optimizations for performance:

Resource Allocation: Allocations, threading, and synchronization are optimized for minimal latency.
NUMA Awareness: Worker threads are spawned and pinned to CPU cores on the NUMA node corresponding to their associated devices. Data structures specific to a domain are allocated after pinning to ensure memory is reserved in the correct NUMA node, reducing memory access latency.
Threading Model:
- One worker thread per DoMAINGROUP (i.e., per GPU), handling up to 4 DoMAINs (each managing a single NIC).
- A dedicated thread for polling the GPU via GDRCopy to update UVM watchers.
- A dedicated callback thread (shared by all groups) for delivering transfer notifications.
Inter-thread Communication: Uses lock-free queues to minimize overhead.
Request Processing Loop: Each domain worker operates in a tight loop:
1. Polls for new work, prioritizing submission.
2. Shards and load-balances requests across available DoMAINs (NICs).
3. Immediately posts the first WRITE of a composite request to the NIC's send queue.
4. Posts subsequent WRITEs to fill the hardware pipeline.
5. Polls completion queues for finished transfers and immediate counter increments.
6. Aggregates events to deliver per-transfer notifications to the shared callback thread.
Sharding: DoMAINGROUPs offer flexible sharding. Transfers can target specific NICs by index, or a single WRITE can be split. Paged transfers, scatter, and barrier operations (which translate to multiple WRITEs) can shard across all NICs to maximize bandwidth.

4.2.5. Hardware-Specific Optimizations

The DoMAIN components are specialized and optimized for the specific RDMA hardware they control:

AWS EFA:
- Implemented using libfabric, with a fabric domain managed per NIC within the DoMAINGROUP.
- Addresses EFA's divergence from the RDMA spec regarding immediate-only zero-sized writes by enforcing valid descriptors for all transfers.
- Employs work request (WR) templating for bulk transfers and peer groups, pre-populating and retaining common fields of libfabric descriptors before posting, reducing per-transfer overhead.
NVIDIA ConnectX-7:
- Implemented through libibverbs.
- Uses a UD queue pair per peer for exchanging RC handshakes (setup information).
- Creates two RC queue pairs per peer:
  1. One for two-sided SEND/RECV operations.
  2. Another for one-sided WRITE and WRITEIMM operations. This separation is critical because RECV and WRITEIMM completions both consume work requests in posting order. Separating them prevents interference, allowing high-level RECV semantics while supporting WRITEIMM efficiently.
- In addition to WR templating, it employs WR chaining by linking up to 4 work requests through the next pointer of ibv_send_wr. This reduces the number of doorbell rings to the NIC, improving efficiency.
- Enables IBV_ACCESS_RELAXED_ORDERING to permit out-of-order PCIe transactions between the NIC and GPU memory, which reduces latency by allowing the NIC to reorder operations for better pipeline utilization.

4.3. Use Cases

4.3.1. KvCache Transfer (Section 4)

This use case demonstrates TransferEngine for disaggregated inference. The following figure (Figure 3 from the original paper) illustrates the KvCache transfer process:

Figure 3. KV transfer between prefillers and decoders 该图像是图3的示意图，展示了预填充器和解码器之间通过InfiniBand网络进行的KV传输流程，涉及GPU和NIC的多重协同工作及数据路径。

Figure 3. KV transfer between prefillers and decoders

Scenario: A "prefiller" node processes input tokens, generating key-value (KV) cache pages and context (e.g., last token hidden states, logits for speculative decoding). This data is then transferred to a "decoder" node, which performs token-by-token decoding.
Dynamic Scaling: TransferEngine enables elastic scaling without synchronized initialization or fixed membership, crucial for cloud environments.
Workflow:
1. A request arrives; a global scheduler assigns it to a prefiller and a decoder.
2. The decoder pre-allocates KV pages and context storage.
3. The decoder dispatches the request to the prefiller using submit_send(), indicating target KV page indices.
4. During chunked prefill, the prefiller updates a UVM watcher after the attention output projection of each layer (CUDA Graph compatible).
5. TransferEngine detects the UVM watcher change and initiates the transfer of KV pages from the current chunk using submit_paged_writes().
6. Once the last chunk is complete and context is populated, it's copied via submit_single_write().
7. Prefillers await commands using submit_recvs().
KV Cache Layout: Handles differences in KV cache sharding/replication. For MLA, prefiller and decoder ranks are randomly matched. For GQA, page-wise offsets and strides select slices. KV caches are laid out with heads preceding pages to ensure continuity for efficient paged writes.
Completion: The prefiller doesn't send explicit completion messages. The decoder knows the number of expected transfers and uses expect_imm_count() to be notified by TransferEngine and start decoding.
Error Handling: Involves heartbeats to detect transport failures. Cancellation (from decoder) requires explicit confirmation from prefiller to ensure KV pages are not prematurely reused. A per-request cancellation token can stop future transfers and wait for pending operations.

4.3.2. RL Rollout Weight Transfer (Section 5)

This section addresses the slow weight updates in asynchronous reinforcement learning fine-tuning. The following figure (Figure 4 from the original paper) illustrates the data path for weight transfer using different approaches:

Figure 4. Weight transfer data path for different approaches. 该图像是图4，展示了不同方法下权重传输的数据路径示意图。左侧(a)为Rank0-based方案，权重数据通过训练工作节点0的GPU和NIC传出。右侧(b)为点对点(Point-to-Point)方案，训练和推理工作节点GPU之间直接通过NIC进行数据传输，结构更为灵活。

Figure 4. Weight transfer data path for different approaches. Part (a) shows a Rank0-based bottleneck approach, while part (b) shows the P2P approach.

Problem: Existing RL frameworks often form a global collective world, bottlenecking weight transfers through a single Rank0 GPU, leading to tens to hundreds of seconds for trillion-parameter models.
TransferEngine Solution (P2P Approach): Each training GPU directly sends weights to inference GPUs via one-sided RDMA WRITE operations, utilizing the full cluster bandwidth across all NICs. This is shown in Figure 4(b).
Initialization:
1. Controller script gathers parameter metadata (name, shape, dtype, DTensor sharding) from all training and inference GPUs.
2. Computes a static weight transfer schedule (which training GPU sends which parameter to which inference GPU, in what order).
3. Broadcasts the schedule to all training GPUs.
Execution: At each training step, the controller signals training GPUs to send weights. Inference nodes are unaware, as it uses one-sided operations.
Pipelined Execution: To optimize, the transfer of each parameter tensor is treated as a task, split into four overlapping stages. The following figure (Figure 5 from the original paper) illustrates the pipelined weight transfer execution:

该图像是图5，展示了流水线权重传输执行的示意图，包含H2D内存复制、数据融合与量化、RDMA传输及Gloo屏障同步等步骤，表现了多网格组间的流水线处理过程。

Figure 5. Pipelined weight transfer execution.

1.  **Host-to-Device (`H2D`) memcpy:** If `FSDP` offloads weights to CPU, this stage copies them back to GPU.
2.  **Parameter Preparation:** Reconstructs the full weight (using `full_tensor()`), applies projection fusion, and quantizes if necessary.
3.  **RDMA Transfer:** Zero-copy `WRITE` to remote inference GPU memory.
4.  **Global Barrier:** After all `full_tensor()` calls complete, synchronization across mesh groups using `GLOO` via Ethernet.

Memory Management: GPU operations (like full_tensor()) introduce temporary memory usage. New tasks are only started if in-flight tasks occupy less GPU memory than a configurable watermark to prevent Out-Of-Memory (OOM) errors.

4.3.3. MoE Dispatch/Combine (Section 6)

This use case demonstrates low-latency MoE routing using TransferEngine. The following figure (Figure 6 from the original paper) illustrates the Dispatch and Combine GPU-CPU-NIC coordination:

Figure 6. Dispatch and Combine GPU-CPU-NIC coordination 该图像是图6的示意图，展示了GPU、CPU和NIC在Dispatch与Combine过程中的协同工作流程，体现了数据通过NVLink和不同阶段的同步与传输机制。

Figure 6. Dispatch and Combine GPU-CPU-NIC coordination

Goal: Provide portable, low-latency MoE dispatch and combine kernels, especially viable on EFA.
Architecture:
- Host Proxy Thread: Coordinates GPUs and NICs, using GDRCopy to poll GPU for progress and invoke TransferEngine.
- Split Kernels: Dispatch and combine operations are implemented with sender halves (preparing data into send buffers) and receiver halves (shuffling tokens from receive buffers).
- NVLink: Utilized within a node to reduce network load.
- Minimize Writes: To reduce proxy overhead, peers first exchange routing information (per-expert token counts) to determine unique ranges in contiguous receive buffers. This allows senders to pack writes efficiently.
Dispatch Workflow:
1. Routing Information Exchange: Small payloads exchanged to determine token counts per expert. This latency is hidden by speculatively dispatching a small number of tokens.
2. Dispatch Kernel: Receives tokens and expert indices.
3. Count Tokens: Kernel counts tokens sent to each expert in shared memory, transfers counts to host via UVM.
4. Proxy Signal: GPU signals proxy via UVM.
5. Scatter Routes: Proxy uses TransferEngine to initiate submit_scatter() for routes to all peers.
6. Copy Tokens: Input tokens are copied into send buffers, creating a contiguous source. The following figure (Figure 7 from the original paper) illustrates dispatch into private and contiguous buffers:
  
  该图像是图7示意图，展示了在两个Rank之间基于Top-K索引的输入Token路由过程，数据从发送缓冲区发送到接收缓冲区和私有缓存，体现了高效数据分发机制。
Figure 7. Dispatch into private and contiguous buffers 7. Initial Token Scatter: Proxy scatters tokens (up to a fixed limit) into private buffers on each receiver peer, hiding routing overhead. 8. Remainder Token Scatter: Once routes are received and token positions are known, the remaining tokens are scattered to shared receiver buffers. 9. NVLink Transfers: While RDMA transfers are pending, intra-node tokens are transferred via NVLink. 10. Synchronization: NVLink peers use flags for synchronization. A grid barrier and careful write ordering minimize critical path latency. 11. Receiver Kernel: Waits for TransferEngine to confirm all transfers via ImmCounter and GDRCopy, then reorders tokens into a layout suitable for Grouped GEMM.
Combine Workflow:
1. Single Scatter: Since routing information is centralized during dispatch, combine transfers all payloads in a single SCATTER operation.
2. Command Preparation: Prepared during idle time (e.g., between send and combine, while Grouped GEMM executes).
3. NVLink and RDMA Barriers: Reuses buffers from dispatch stage, requiring NVLink and RDMA barriers to ensure completion of prior operations before overwriting send buffers. Host proxy waits for TransferEngine to confirm all writes sent.
4. Weighted Average: Receiver caches offsets from routing information, waits for all tokens, then computes weighted average locally.
Comparison with DeepEP:
- DeepEP: Relies on IBGDA (ConnectX-exclusive) and strong RC QP ordering. Balances tokens across SMs, transferring them one-by-one. Lower latency to first transfer but more per-token work and packets. Signals completion via Atomics.
- TransferEngine: Portable (EFA & ConnectX) via host proxy. Longer latency to first transfer due to GPU-CPU-NIC overhead. Bulk transfers achieve better network utilization. ImmCounter for completion.
- Prefill: DeepEP uses pre-accumulation via NVLink to reduce RDMA data for prefill. TransferEngine scales its single-transfer strategy without specific prefill tweaks, leading to higher memory overhead. DeepEP's partial sum in combine also reduces RDMA bytes.

5. Experimental Setup

5.1. Datasets

The paper's evaluation focuses on communication performance rather than specific model training or inference tasks that would require distinct datasets. However, for certain application-level evaluations, it refers to:

KvCache Transfer: No specific dataset is mentioned. The evaluation focuses on the efficiency of transferring generic KV pages, which are intermediate states in LLM inference.
RL Weight Updates: While no dataset for RL training is mentioned, the evaluation context involves models at the scale of:
- Kimi-K2 (1 trillion parameters)
- DeepSeek V3 (671 billion parameters)
- Qwen3 (235 billion parameters) These models serve as proxies for the scale of weights being transferred.
MoE Dispatch/Combine: The benchmarks are based on the settings of the DeepSeek-V3 model.
- Tokens: $7168 \times \text{fp8}$ tokens. fp8 refers to 8-bit floating-point precision, a common quantization scheme for LLMs to save memory and improve performance.
- Scaling Factors: $56 \times \text{fp32}$ scaling factors. fp32 refers to standard 32-bit floating-point precision. Scaling factors are often used in conjunction with quantized formats (like fp8) to restore dynamic range.
- Tensors: Combine bf16 tensors of the same dimension. bf16 (bfloat16) is a 16-bit floating-point format offering a wider dynamic range than fp16 (half-precision float) but lower precision than fp32.
- Expert Routing: Each token is dispatched to 8 random experts. This reflects the typical sparse activation pattern of MoE models.
- Batch Sizes: Decode is evaluated on batches of 128 tokens, while prefill is evaluated on chunks of 4096 tokens.
  
  The choice of these model settings and token/tensor types is representative of real-world large-scale LLM deployments and their specific precision requirements.

5.2. Evaluation Metrics

The paper uses standard metrics to evaluate communication performance and application-level latency.

Throughput (Gbps):
- Conceptual Definition: Throughput measures the amount of data successfully transferred per unit of time. In networking, it's typically expressed in gigabits per second (Gbps), indicating the rate at which data can be moved across the network. Higher throughput is desirable for bulk data transfers.
- Mathematical Formula: $ \text{Throughput (Gbps)} = \frac{\text{Total Data Transferred (bits)}}{\text{Total Time (seconds)}} \times 10^{-9} $
- Symbol Explanation:
  - Total Data Transferred (bits): The total volume of data moved, usually measured in bits (e.g., bytes $\times$ 8).
  - Total Time (seconds): The duration it took for the data transfer to complete.
  - $10^{-9}$ : Conversion factor to express the result in Gigabits per second (Gbps).
Latency ( $\mu s$ ):
- Conceptual Definition: Latency measures the time delay between the initiation of an operation (e.g., sending a message or starting a task) and its completion. In high-performance computing, it's often expressed in microseconds ( $\mu s$ ) or nanoseconds, indicating the responsiveness of the system. Lower latency is critical for interactive or time-sensitive operations.
- Mathematical Formula: This is typically a direct measurement of time. There isn't a single universal formula, but it's measured as: $ \text{Latency} = \text{Time of Completion} - \text{Time of Initiation} $
- Symbol Explanation:
  - Time of Completion: The timestamp when the operation finishes.
  - Time of Initiation: The timestamp when the operation starts.
  - Note: For statistical analysis, the paper reports median latency ( $\text{p50}$ ) and various percentiles ( $\text{p01, p25, p75, p95, p99}$ ) to capture the distribution of latencies.
Operations per Second (op/s):
- Conceptual Definition: Operations per second measures the rate at which discrete operations are completed. For paged writes, this indicates how many individual page write operations can be performed in one second. This metric is useful for understanding the processing capacity for fine-grained tasks.
- Mathematical Formula: $ \text{Operations per Second (op/s)} = \frac{\text{Total Number of Operations}}{\text{Total Time (seconds)}} $
- Symbol Explanation:
  - Total Number of Operations: The count of discrete operations performed.
  - Total Time (seconds): The duration over which these operations were performed.

5.3. Baselines

The paper evaluates TransferEngine against several baselines and state-of-the-art implementations, tailored to the specific evaluation context:

For General Point-to-Point Communication Performance (Throughput):
- NIXL v0.6.1: NVIDIA Inference Xfer Library, a contemporary library targeting P2P communication for LLM inference. This serves as a direct comparison for general RDMA performance.
- Hardware Benchmarks:
  - ib_write_bw (from rdma-core): A standard benchmark tool for measuring peak RDMA write bandwidth on InfiniBand/ConnectX hardware.
  - fi_rma_bw (from libfabric): A benchmark tool for measuring peak RDMA bandwidth (specifically RMA, Remote Memory Access) using the libfabric API, which is used for EFA. These serve as theoretical peak performance indicators for the underlying hardware.
For MoE Dispatch/Combine Latency:
- DeepEP (Zhao et al., 2025): The "most performant open-source implementation" for MoE communication, according to the authors. DeepEP uses GPU-initiated RDMA (IBGDA) and is specialized for ConnectX NICs. It represents a highly optimized, but vendor-specific, baseline.
- pplx-kernels (Licker et al., 2025): These are portable and open-source MoE kernels built around NVSHMEM v3.4.5. This baseline is important for comparing TransferEngine's portability and performance against another open-source solution that aims for some level of hardware abstraction, especially its IBRC (host-proxy) mode. The paper notes NVSHMEM on EFA is "unusably slow," so pplx-kernels are primarily evaluated on ConnectX-7.
- UCCLEP (Mao et al., 2025): The paper mentions this work but explicitly states that a detailed comparison is not included as its published latencies are "substantially higher" than TransferEngine's.
  
  By comparing against these baselines, the paper demonstrates TransferEngine's ability to achieve competitive or superior performance while offering crucial portability across different cloud RDMA hardware, a capability largely absent in prior work.

6. Results & Analysis

6.1. Core Results Analysis

The evaluation demonstrates TransferEngine's high performance and portability across various LLM communication patterns.

6.1.1. Point-to-Point Communication Performance

The paper first evaluates the raw throughput of TransferEngine for single and paged WRITE operations, comparing it against NIXL and native hardware benchmarks. This is crucial for KvCache transfers and RL rollouts.

The following figure (Figure 8 from the original paper) shows the relative bandwidth performance:

Figure 8. Point-to-Point communication performance 该图像是论文中展示的图表，编号为图8，表现了点对点通信的带宽相对性能。图表分为两个部分，左侧是单次写入（Single Write），右侧是分页写入（Paged Write），比较了不同方法（Ours-EFA, Ours-CX7, NIXL-EFA, NIXL-CX7）在不同数据大小下的相对带宽变化。

Figure 8. Point-to-Point communication performance

The plot shows the fraction of peak bandwidth achieved as a function of message size for Single Write (left) and Paged Write (right).

Single Write: To saturate the 400 Gbps link with a single WRITE operation, messages of at least 16 MiB are required. TransferEngine (Ours) and NIXL show comparable performance on both EFA and ConnectX-7, with TransferEngine being slightly faster in some regimes.
Paged Write: TransferEngine saturates the link with 32 KiB messages, while NIXL requires 64 KiB. This indicates TransferEngine is more efficient for smaller, paged transfers.

EFA vs. ConnectX: EFA generally requires larger messages to saturate the link, explaining observed performance gaps, especially in MoE routing.

The following are the results from Table 2 of the original paper, detailing absolute performance numbers for TransferEngine:

		EFA		CX-7
		Throughput		Throughput
Single Write	64 KiB	16 Gbps		44 Gbps
	256 KiB	54 Gbps		116 Gbps
	1 MiB	145 Gbps		245 Gbps
	32 MiB	336 Gbps		378 Gbps
Paged Write	1 KiB	17 Gbps	2.11M op/s	91 Gbps	11.10M op/s
	8 KiB	138 Gbps	2.10M op/s	320 Gbps	4.89M op/s
	16 KiB	274 Gbps	2.08M op/s	367 Gbps	2.80M op/s
	64 KiB	364 Gbps	0.69M op/s	370 Gbps	0.71M op/s

Table 2. EFA and ConnectX-7 performance comparison

256 KiB Single WRITE: A typical size for MoE routing. TransferEngine achieves 54 Gbps on EFA and 116 Gbps on ConnectX-7.
64 KiB Paged WRITE: A typical size for a KvCache page. Both EFA and ConnectX-7 are able to saturate the available bandwidth (364 Gbps on EFA, 370 Gbps on CX-7).
RL Weight Transfer: Message sizes for RL weight updates are well beyond the saturation point, indicating that TransferEngine can efficiently handle large bulk transfers for this use case.

This data confirms that TransferEngine delivers high throughput across both NIC types, making it suitable for bandwidth-heavy LLM workloads.

6.1.2. MoE Dispatch/Combine

The MoE kernels are evaluated for decode and prefill latencies across different numbers of GPUs (8, 16, 32, 64), comparing TransferEngine with DeepEP and pplx-kernels.

6.1.2.1. Private Buffer Size

The paper investigates the impact of the private buffer size on $p50$ decode latency, which is designed to hide the latency of routing information exchange.

The following figure (Figure 9 from the original paper) shows the impact of private buffer size on $p50$ decode latency:

$Figure 9. Impact of private buffer size on $\\mathsf { p 5 0 }$ decode latency$ 该图像是图表，展示了不同私有缓冲区大小对两种网络适配器（EFA和ConnectX-7）在多种EP配置下的相对解码延迟速率影响。图中横轴为最大私有Token数，纵轴为相对延迟下降百分比，EP64、EP32、EP16、EP8代表不同的EP配置。

Figure 9. Impact of private buffer size on $\mathsf { p50 }$ decode latency

The plot shows the relative latency slowdown compared to the peak performance (achieved when the buffer is large enough for all tokens in one burst).
Performance Degradation: As the private buffer size decreases (meaning less speculative dispatch), latency increases.
Intra-node: For intra-node transfers (faster), at least ~32 tokens are needed to hide the route exchange latency for both NIC types.
Inter-node: For inter-node transfers, ConnectX-7 requires as few as 24 tokens, while EFA shows performance degradation below 32 tokens, indicating slower route exchange on EFA.
Justification: This analysis justifies the use of private buffers to mask overheads, especially for latency-sensitive decode operations.

6.1.2.2. Send and Receive Latency

The kernels are split into senders and receivers, with NVLink and RDMA transfers overlapping. The following figure (Figure 10 from the original paper) shows separate send and receive latency for EP=64:

$Figure 10. Separate Send and Receive Latency for $\\mathrm { E P } { = } 6 4$$ 该图像是一个柱状图，展示了不同方案在 Dispatch-Send、Dispatch-Recv、Combine-Send 和 Combine-Recv 四个环节的延迟对比，单位为微秒。柱状图中包含三种方案：Ours-EFA-SRD、Ours-CX7-RC 和 DeepEP-CX7-GDA，具体数值显示在柱顶部。

Figure 10. Separate Send and Receive Latency for $\mathrm { E P } { = } 6 4$

Dispatch Send: Ours-EFA-SRD (21 $\mu s$ ) and Ours-CX7-RC (18 $\mu s$ ) outperform DeepEP-CX7-GDA (28 $\mu s$ ). This is attributed to TransferEngine only copying memory for sends, implying a more direct data path or better pipelining.
Dispatch Receive: DeepEP-CX7-GDA (34 $\mu s$ ) is faster than Ours-EFA-SRD (41 $\mu s$ ) and Ours-CX7-RC (40 $\mu s$ ). The paper explains that Dispatch receive is an outlier for TransferEngine because it pulls data using NVLink loads, which can be slower than DeepEP's approach.
Combine Send: Similar to dispatch send, Ours-EFA-SRD (24 $\mu s$ ) and Ours-CX7-RC (21 $\mu s$ ) outperform DeepEP-CX7-GDA (28 $\mu s$ ).
Combine Receive: Ours-EFA-SRD (37 $\mu s$ ) and Ours-CX7-RC (35 $\mu s$ ) are faster than DeepEP-CX7-GDA (40 $\mu s$ ). This is attributed to faster accumulation in TransferEngine's combine receive stage.
Overall: The total execution time of these kernels (excluding transfer time) is under 15% of the transfer times, indicating good compute-communication overlap. The proxy starts RDMA work midway through send kernels, with shuffling adding only ~15 $\mu s$ of idle time.

6.1.2.3. Decode Latency

Decode latency is critical as it directly impacts user experience. The following figure (Figure 11 from the original paper) shows the MoE decode latency:

该图像是论文中展示不同方法在EP64、EP32、EP16和EP8四种场景下调度（Dispatch）和合并（Combine）延迟的柱状图，比较了Ours-EFA-SRD、Ours-CX7-RC、DeepEP-CX7-GDA、pplx-CX7-GDA和pplx-CX7-RC五种方案的性能表现。

ConnectX-7 Performance: On ConnectX-7, TransferEngine (Ours-CX7-RC) achieves state-of-the-art decode latency, outperforming DeepEP-CX7-GDA for 16, 32, and 64 ranks. For example, at EP64 (64 GPUs), Ours-CX7-RC has a mean latency of 52 $\mu s$ (Dispatch) and 49 $\mu s$ (Combine), compared to DeepEP-CX7-GDA's 58 $\mu s$ (Dispatch) and 53 $\mu s$ (Combine). This is significant, especially considering TransferEngine uses a host proxy while DeepEP uses IBGDA. The pplx-CX7-RC (NVSHMEM via host proxy) is an order of magnitude slower, highlighting TransferEngine's efficiency.
EFA Performance: TransferEngine (Ours-EFA-SRD) provides the first viable implementation and latencies on EFA. While trailing ConnectX-7 by about 30% (e.g., 68 $\mu s$ Dispatch, 64 $\mu s$ Combine at EP64), this is still a major achievement given the challenges of EFA's unordered delivery.
Intra-node (8 Ranks): DeepEP is slightly faster (29 $\mu s$ Dispatch, 28 $\mu s$ Combine) than TransferEngine (31 $\mu s$ Dispatch, 29 $\mu s$ Combine) in the intra-node setup. This is attributed to TransferEngine's use of NICs for routing information and DeepEP's highly efficient NVLink transfers in this specific configuration.
Scaling: As ranks increase (16, 32, 64), TransferEngine's performance improves relative to DeepEP, particularly for combine. At 64 ranks, TransferEngine's combine still outperforms DeepEP, but the host proxy's CPU overhead becomes noticeable for dispatch.
Bandwidth Saturation: EFA's latencies are only 30% behind ConnectX despite lower raw 256KiB WRITE throughput, suggesting that decode operations are not fully bandwidth-bound but rather latency-sensitive.

6.1.2.4. Prefill Latency

Prefill latency is evaluated for larger batches (4096 tokens). The following figure (Figure 12 from the original paper) shows the MoE prefill latency:

Figure 12. MoE Prefill Latency. Bar height is mean. Error bars show p01, p25, p50, p75, p95, p99. 该图像是一个条形图，展示了图12中MoE预填充的延迟情况，对比了不同方案（Ours-EFA-SRD、Ours-CX7-RC和DeepEP-CX7-GDA）在调度和合并阶段，随着EP规模变化的平均延迟及对应误差范围。

Figure 12. MoE Prefill Latency. Bar height is mean. Error bars show p01, p25, p50, p75, p95, p99.

Comparison: pplx-kernels are excluded as they are ineffective for 4096 tokens.
DeepEP Advantage: DeepEP generally performs better for prefill, especially at lower ranks. For dispatch, DeepEP's strategy of transferring only one token replica via RDMA and moving copies via NVLink within a node is effective. For combine, DeepEP's sender-side partial sum significantly reduces RDMA bytes, leading to lower latency, though with reduced accumulation precision (bf16).
TransferEngine Limitations for Prefill: The paper notes that TransferEngine's decode-optimized kernels have memory overheads that limit viability for some prefill models. This is because the design is optimized for latency-bound decode rather than bandwidth-bound prefill.
EFA for Prefill: Similar to decode, Ours-EFA-SRD performs worse than Ours-CX7-RC for prefill (e.g., 208 $\mu s$ vs 142 $\mu s$ for Dispatch at EP64), but still provides a viable solution on EFA.

6.2. Ablation Studies / Parameter Analysis

While not explicitly labeled as traditional "ablation studies," the paper conducts several experiments that function similarly by varying key parameters or comparing architectural choices:

Private Buffer Size Analysis (Figure 9): This directly evaluates the impact of a design parameter (the size of speculatively dispatched tokens) on decode latency. It shows how this parameter helps hide network latency for routing information, justifying its inclusion in the MoE dispatch mechanism. The experiment demonstrates a clear trade-off: insufficient buffer size leads to performance degradation.
Comparison of EFA vs. ConnectX-7: The entire evaluation framework implicitly compares the performance of TransferEngine on two distinct hardware platforms (EFA with SRD and ConnectX-7 with RC). This is a crucial "ablation" or comparison of the underlying network stack's impact on TransferEngine's portable design. It validates the claim of portability and highlights performance characteristics specific to each NIC.
Comparison with Baselines (DeepEP, pplx-kernels): These comparisons serve as a form of architectural ablation. By comparing TransferEngine's proxy-based approach to DeepEP's IBGDA and pplx-kernels' NVSHMEM, the authors implicitly evaluate the trade-offs of different RDMA initiation mechanisms and library choices. This shows that the chosen host-proxy design, despite its inherent overhead, can still yield superior or competitive results while gaining portability.
Separate Send and Receive Latency (Figure 10): This detailed breakdown helps understand the individual components of latency and the effectiveness of pipelining and overlapping. It shows where TransferEngine excels (send phases, combine receive) and where DeepEP might have an edge (dispatch receive), providing insights into the fine-grained performance characteristics of the design.

These analyses collectively demonstrate that the design choices made in TransferEngine (e.g., ImmCounter, multi-NIC aggregation, host proxy, specific optimizations) are effective in achieving its goals of portability and high performance for LLM workloads.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces TransferEngine, a portable RDMA communication library designed to address the critical need for flexible point-to-point communication in emerging LLM system patterns. By identifying and leveraging the commonality of reliable but unordered delivery across heterogeneous RDMA hardware (specifically NVIDIA ConnectX-7 and AWS EFA), TransferEngine provides a uniform API, effectively mitigating vendor lock-in. Its novel ImmCounter primitive enables completion notification without relying on network transport ordering, and it transparently manages multiple NICs per GPU to maximize bandwidth.

The efficacy of TransferEngine is robustly demonstrated through its integration and evaluation in three production-grade LLM applications:

KvCache transfer for disaggregated inference: Enables dynamic scaling and low-latency layer-by-layer transfers.
RL weight updates: Achieves unprecedented 1.3-second updates for trillion-parameter models by fully utilizing cluster bandwidth via one-sided RDMA WRITEs.
MoE dispatch/combine: Delivers state-of-the-art decode latency on ConnectX-7 (surpassing DeepEP in many cases) and provides the first viable, high-performance implementation on EFA.

The paper concludes that TransferEngine provides a crucial complement to existing collective communication libraries, offering the flexibility and performance required for modern, cloud-native LLM architectures while avoiding vendor lock-in.

7.2. Limitations & Future Work

The paper explicitly mentions one limitation:

Memory Overhead for Prefill in MoE: The current MoE kernels, optimized for decode latency, have memory overheads that limit their viability for some models during the prefill stage. This suggests a trade-off between decode-optimized latency and memory efficiency for prefill. The paper notes that DeepEP, for instance, uses pre-accumulation and partial sums to reduce memory and RDMA bytes during prefill, which TransferEngine's current decode-optimized kernels do not.

While the paper doesn't explicitly outline future work, the identified limitation for prefill optimization naturally suggests a direction:
Prefill-Specific Optimizations: Developing MoE kernels specifically optimized for prefill to reduce memory overhead and further improve prefill latency, potentially incorporating techniques like partial sums or more aggressive data reduction.
Reducing Host Proxy Overhead: Although TransferEngine achieves strong performance with a host proxy, further reducing this overhead could enhance performance, especially as the number of peers (and thus proxy interactions) scales. This might involve exploring more aggressive batching of work requests to the NIC or optimizing the CPU-GPU communication path.
Broader Hardware Support: While EFA and ConnectX are covered, extending TransferEngine to other cloud-specific RDMA solutions (e.g., Alibaba Cloud eRDMA, Google Falcon) could further enhance its portability.
Integration with LLM Frameworks: Continued integration into a wider range of LLM inference and training frameworks (vLLM, TensorRT-LLM, etc.) would be a natural next step for broader adoption.

7.3. Personal Insights & Critique

This paper presents a highly practical and impactful solution for a growing challenge in the LLM ecosystem. My insights and critique are as follows:

Innovation of ImmCounter: The ImmCounter primitive is a particularly elegant solution to the problem of out-of-order reliable transports like EFA SRD. By decoupling reliability from ordering at the transport layer and providing an atomic, application-level completion notification, TransferEngine neatly sidesteps a fundamental incompatibility between different RDMA implementations. This design choice is the cornerstone of its portability. It's a clever re-interpretation of what a "reliable" primitive means in a distributed ML context, where explicit ordering might not always be necessary, but confirmation of completion is.
Pragmatism of Host Proxy: While GPU-initiated RDMA (IBGDA) offers the lowest theoretical latency by completely bypassing the CPU, TransferEngine's choice of a host proxy, despite introducing some overhead, is a pragmatic and necessary step for achieving true portability. The results convincingly show that this overhead can be managed to achieve competitive or even superior performance on ConnectX, while being the only viable option for EFA. This highlights a crucial trade-off: sometimes a slightly higher baseline latency due to a host proxy is worth the immense gain in hardware agnosticism and deployability.
Relevance to Cloud-Native LLMs: The paper directly addresses a pain point for deploying LLMs in heterogeneous cloud environments. As LLM inference and training scales, organizations will increasingly leverage diverse hardware offerings from different cloud providers. A library like TransferEngine is essential for avoiding vendor lock-in and maximizing resource utilization across these environments. The 400 Gbps peak throughput on both ConnectX and EFA validates its readiness for production workloads.
Complementary to Collectives: The emphasis that point-to-point communication complements collectives, rather than replaces them, is important. Different communication patterns are optimal for different tasks. TransferEngine doesn't aim to rewrite NCCL; instead, it fills a critical gap for dynamic, sparse, and asynchronous operations where collectives are ill-suited. This nuanced view reflects a mature understanding of distributed systems design.
Potential Areas for Improvement/Critique:
- Proxy Overhead at Scale: While TransferEngine outperforms DeepEP at larger scales for Combine operations, Dispatch performance shows a slight dip at 64 ranks due to CPU overhead. Further optimization of the host proxy, perhaps through more aggressive batching of work requests or offloading more logic to the NIC's own processor (if available and programmable), could improve scalability for Dispatch.
- Memory Efficiency for Prefill: The noted limitation regarding memory overhead for prefill is a significant one, as prefill is often memory-intensive. Developing specialized prefill kernels that balance latency and memory footprint would make TransferEngine a more complete solution for MoE workloads.
- Complexity Management: While the API is minimal, the internal implementation (NUMA awareness, multi-threading, hardware-specific optimizations, ImmCounter logic) is complex. Maintaining and extending this in the long run will require significant engineering effort. The Rust implementation likely aids in reliability and performance but implies a specific language ecosystem.
  
  Overall, TransferEngine is a strong contribution that moves the needle forward for flexible, high-performance communication in LLM systems, particularly in the context of increasing hardware diversity and the demands of cloud-native deployments. Its design principles are sound, and its demonstrated performance is compelling.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

RDMA Point-to-Point Communication for LLM Systems

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~41 min read · 55,315 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Overview and Design Goals

4.2.2. Architecture

4.2.3. API Design

4.2.4. Implementation

4.2.5. Hardware-Specific Optimizations

4.3. Use Cases

4.3.1. KvCache Transfer (Section 4)

4.3.2. RL Rollout Weight Transfer (Section 5)

4.3.3. MoE Dispatch/Combine (Section 6)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Point-to-Point Communication Performance

6.1.2. MoE Dispatch/Combine

6.1.2.1. Private Buffer Size

6.1.2.2. Send and Receive Latency

6.1.2.3. Decode Latency

6.1.2.4. Prefill Latency

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers