eGPU: Extending eBPF Programmability and Observability to GPUs
TL;DR Summary
eGPU dynamically compiles eBPF bytecode into PTX for real-time, low-overhead GPU instrumentation, enabling fine-grained performance monitoring and bottleneck detection in AI and HPC workloads, advancing programmable GPU infrastructure.
Abstract
eGPU: Extending eBPF Programmability and Observability to GPUs Yiwei Yang UC Santa Cruz Santa Cruz, California, USA yyang363@ucsc.edu Tong Yu Eunomia Inc Wuhan, Hubei, China yt.xyxx@gmail.com Yusheng Zheng UC Santa Cruz Santa Cruz, California, USA yzhen165@ucsc.edu Andrew Quinn UC Santa Cruz Santa Cruz, California, USA aquinn1@ucsc.edu Abstract Precise GPU observability and programmability are essential for optimizing performance in AI workloads and other computation- ally intensive high-performance computing (HPC) applications. In this paper, we introduce eGPU, the first framework and eBPF runtime that dynamically offloads eBPF bytecode onto GPUs via dynamic PTX injection. Designed primarily for observability, our system leverages real-time GPU telemetry, eBPF-based dynamic instrumentation, and automated performance analysis to pinpoint bottlenecks at multiple layers—including kernel execution, mem- ory transfers, and heterogeneous compute orchestration—without incurring significant overhead or interrupting active GPU kernels. By dynamically compiling eBPF programs into PTX snippets and injecting them directly into running GPU kernels, eGPU provides fine-grained, low-ove
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: eGPU: Extending eBPF Programmability and Observability to GPUs
- Authors:
- Yiwei Yang (University of California, Santa Cruz)
- Tong Yu (Eunomia Inc)
- Yusheng Zheng (University of California, Santa Cruz)
- Andrew Quinn (University of California, Santa Cruz)
- Journal/Conference: Published in the 4th Workshop on Heterogeneous Composable and Disaggregated Systems (HCDS '25). Workshops like HCDS are academic venues for presenting early-stage, novel, or highly focused research. They provide a platform for rapid dissemination of new ideas to a specialized community, often co-located with major systems conferences.
- Publication Year: 2025
- Abstract: The paper introduces
eGPU, a novel framework that extends the capabilities of eBPF (extended Berkeley Packet Filter) to Graphics Processing Units (GPUs). The core innovation is a runtime that dynamically compiles eBPF bytecode into PTX (NVIDIA's intermediate GPU code) and injects it into active GPU kernels. This allows for fine-grained, low-overhead observability of GPU activities, such as kernel execution and memory transfers, without interrupting the GPU. The authors' evaluation shows thateGPUhas low instrumentation overhead while providing high-resolution performance data. They conclude thateGPUpaves the way for a future of programmable GPU infrastructures that can adapt to changing workload demands. - Original Source Link: The paper is available at
/files/papers/68f35de4d77e2c20857d89a7/paper.pdf. Its publication in an ACM-indexed workshop and the presence of a DOI suggest it is a formally published paper, not just a preprint. The open-source code is available at https://github.com/victoryang00/bpftime-super.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Modern AI and High-Performance Computing (HPC) applications are critically dependent on GPUs. However, understanding exactly what a GPU is doing to optimize its performance is incredibly difficult. Existing tools for GPU profiling, like NVIDIA's CUPTI or binary instrumentation frameworks like NVBit, either impose significant performance overhead, are too coarse-grained, or require stopping and restarting the application, making them unsuitable for live production environments.
- The Gap: On the CPU side, eBPF (extended Berkeley Packet Filter) has revolutionized system observability. It allows developers to run small, safe programs directly in the operating system kernel to trace events with very low overhead. However, eBPF has always been CPU-centric, with no direct way to "see" inside a running GPU kernel.
- Innovation: This paper introduces
eGPU, the first system that bridges this gap. It takes the power and flexibility of eBPF and extends it directly onto the GPU, enabling a new level of dynamic, low-overhead instrumentation for GPU workloads.
-
Main Contributions / Findings (What):
- The paper introduces
eGPU, the first framework to dynamically offload eBPF programs onto running GPU kernels by translating them into PTX code and injecting them at runtime. - It presents the complete system design, which integrates a user-space eBPF runtime (
bpftime), a runtime PTX compiler, and shared memory for efficient CPU-GPU communication. - Through micro-benchmarks, it demonstrates that
eGPUachieves significantly lower instrumentation overhead compared to existing GPU instrumentation methods. - It highlights the broader potential of this technology beyond just observability, suggesting future applications in dynamic GPU optimization, resource management, and security.
- The paper introduces
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- GPU (Graphics Processing Unit): A specialized processor designed to accelerate parallel computations. GPUs execute programs called kernels on thousands of threads simultaneously. Understanding their performance requires insight into this massively parallel execution model and its complex memory hierarchy.
- eBPF (extended Berkeley Packet Filter): A powerful in-kernel virtual machine in Linux that allows developers to run sandboxed programs (eBPF programs) in response to system events (like network packet arrivals or function calls). It provides a safe and efficient way to instrument, monitor, and modify kernel behavior without changing the kernel's source code.
bpftime: A user-space implementation of the eBPF runtime. Traditional eBPF runs in the kernel, and instrumenting user applications (uprobes) requires costly context switches between the application and the kernel.bpftimemoves this functionality entirely into user space, drastically reducing overhead and simplifying deployment.- PTX (Parallel Thread Execution): An intermediate, assembly-like language for NVIDIA GPUs. CUDA C++ code is first compiled into PTX. The GPU driver then performs a final Just-In-Time (JIT) compilation step to translate the PTX code into the specific machine code (SASS) for the target GPU architecture.
eGPUleverages this JIT step to inject its instrumentation. - CXL (Compute Express Link): An open industry standard for high-speed, low-latency interconnects between CPUs and peripheral devices like accelerators or memory modules.
CXL.memis a protocol that allows a CPU to share its main memory with a device, creating a unified, tiered memory system.
-
Previous Works: The paper positions itself against existing observability solutions, particularly Meta's sophisticated AI observability stack:
- Traditional Tools (
CUPTI,NVBit): NVIDIA's official CUDA Profiling Tools Interface (CUPTI) provides hooks into the CUDA runtime but can be intrusive. NVBit allows for instrumentation by rewriting the final GPU machine code (SASS), but this process is complex and can incur high overhead. - Meta's Observability Stack: Meta uses a layered approach.
Dynologcollects system-level metrics. For deeper insights, they use tools like thePyTorch ProfilerandStrobelight, which leverages CPU-side eBPFuprobesto trace CUDA API calls. - Strobelight/BPF: This is a key point of comparison. Strobelight can attach eBPF probes to CUDA library functions (e.g., memory allocation calls) on the CPU side. This provides valuable information but is fundamentally limited to observing interactions between the CPU and the GPU driver; it cannot see what happens inside the GPU kernel itself.
- Traditional Tools (
-
Differentiation:
eGPU's key innovation is its ability to move instrumentation from the CPU side into the GPU kernel. Instead of just observing the requests for GPU work (like Strobelight),eGPUcan observe the execution of that work on the GPU itself. It achieves this by dynamically compiling eBPF logic into PTX and injecting it directly into a running kernel, enabling fine-grained, on-device tracing of events like individual memory accesses.
4. Methodology (Core Technology & Implementation)
The core of eGPU is a pipeline that transforms a standard eBPF program into live instrumentation running on a GPU. This process is illustrated in the system design diagram.
该图像是示意图,展示了论文中eGPU系统设计的架构流程,涵盖了从eBPF程序源代码到GPU动态PTX注入的关键组件及其交互关系。
-
Principles: The central idea is to hijack the GPU's normal execution flow at the PTX level. By intercepting the code just before it's compiled to final machine code,
eGPUcan insert new functionality (derived from an eBPF program) directly into the GPU kernel without stopping it. Communication between the injected code and the user-space monitoring application is handled efficiently via shared memory. -
Steps & Procedures (following Figure 1):
- eBPF Program Compilation: A developer writes an eBPF program in C and compiles it into eBPF bytecode using standard toolchains like
clang. - User-space Loading: The eBPF bytecode is loaded into the
bpftimeuser-space runtime.bpftimemanages the eBPF program's lifecycle without involving the Linux kernel, using a shared library (bpftime-syscall.so) to intercept and handle eBPF-related system calls. - eBPF-to-PTX Translation: This is the core novelty.
eGPUincludes a JIT compiler that translates the loaded eBPF bytecode into an equivalent PTX code snippet. This snippet is designed to perform the original eBPF program's logic (e.g., counting events, recording timestamps). - Shared Memory for Communication:
eGPUsets up a shared memory region usingboost::managed_shared_memory. This region holdseBPF maps, which are key-value data structures used by eBPF programs to store state, aggregate statistics, and communicate with user-space applications. Because this memory is accessible to both the CPU and GPU, it acts as a high-speed channel for data exchange. - PTX Injection: The generated PTX snippet is dynamically injected into the target CUDA application's PTX code before it is JIT-compiled by the NVIDIA driver. The paper cites the technique used in
ParallelGPU OS (POS)[8], which involves intercepting driver APIs to perform this "self-modifying code" action on the fly. This allowseGPUto adduprobe-like ortracepoint-like hooks directly inside the GPU kernel logic. - On-GPU Execution & Data Collection: The instrumented kernel now runs on the GPU. When the injected PTX code is executed (e.g., on every memory load), it updates the
eBPF mapsin the shared memory region. - User-space Observation: A user-space application can read the
eBPF mapsfrom shared memory to get real-time insights into the GPU's behavior.
- eBPF Program Compilation: A developer writes an eBPF program in C and compiles it into eBPF bytecode using standard toolchains like
-
Implementation Details:
- Synchronization: Since both CPU threads and GPU threads can access the shared
eBPF mapsconcurrently, synchronization is critical. The system uses standard CPU atomic operations (std::atomic) for host-side access and CUDA atomic operations and memory fences for device-side access to prevent race conditions and ensure data consistency. - Self-Modifying Code: The ability to inject code into a running kernel is based on the approach from
ParallelGPU OS (POS)[8]. This technique avoids the massive overhead of stopping, recompiling, and relaunching the entire kernel, enabling true dynamic instrumentation.
- Synchronization: Since both CPU threads and GPU threads can access the shared
5. Experimental Setup
The paper evaluates eGPU using a micro-benchmark and discusses its application in two key use cases.
-
Hardware: The experiments were conducted on a server with a dual-socket Intel Xeon E5-2697-v2 CPU, 256 GB of RAM, and an NVIDIA P40 GPU.
-
Evaluation Use Cases:
- GPU Memory Observer and CXL.mem Simulator:
- Objective: To demonstrate fine-grained memory observability and simulate next-generation memory architectures.
- Method:
eGPUis used to instrument everyLD(load) andST(store) instruction in a GPU kernel. For each operation, it records metadata like the memory address and access size. - Metrics: The collected data can be used to calculate real-time memory bandwidth. The paper provides the formula:
BW: Memory bandwidth.- : The number of bytes transferred in the -th memory operation.
- : The total number of memory operations in the time window.
- : The duration of the observation window.
- To simulate
CXL.mem,eGPUinjects an artificial delay into each memory access. The latency is modeled as:- : The simulated access latency with CXL memory.
- : The native latency of the GPU's local memory.
- : An added delay that is a function of the memory access
pattern(e.g., sequential vs. random) and the current bandwidth utilizationBW_curr.
- LLM CPU-GPU Collaborative Caching:
- Objective: To optimize memory management for Large Language Models (LLMs) that are too big to fit entirely in GPU memory.
- Method:
eGPUprovides detailed visibility into memory access patterns, cache hit/miss rates, and data transfer volumes between the CPU and GPU. This allows developers to analyze the effectiveness of different caching policies (like LRU or LFU) and data transfer mechanisms (like Unified Memory or pinned memory).
- GPU Memory Observer and CXL.mem Simulator:
-
Baselines: The primary baseline for performance comparison is
iGuard[9], a tool that also uses GPU instrumentation.iGuardand similar tools likegpumemtrace[14] (based onNVBit) represent the state-of-the-art in binary/IR-level GPU instrumentation, making them a strong point of comparison for overhead.
6. Results & Analysis
-
Core Results (Micro-benchmark): The paper's main quantitative result is presented in Figure 2, which measures the end-to-end latency overhead of
eGPU's instrumentation compared to a baseline.
该图像是图表,展示了论文中关于100次查找的端到端延迟(单位为纳秒)随数据字节大小变化的运行时开销对比,图中包含延迟与iGuard基线的曲线比较。- Analysis of Figure 2: The chart plots the latency for 100 memory lookups against the size of the data being accessed.
- The
iGuard baseline(red line) shows an extremely high and constant latency. This represents the high fixed overhead of traditional binary instrumentation frameworks. - The
eGPUlatency (blue line) starts incredibly low (around 100 ns) for small access sizes and increases gradually as the access size grows.
- The
- Interpretation: The results convincingly show that
eGPU's dynamic PTX injection method introduces significantly lower overhead than the baseline, especially for the common case of frequent, small memory accesses. Even for large accesses (e.g., 4MB), its latency remains orders of magnitude lower than theiGuardbaseline shown. This validates the core design goal of providing low-overhead instrumentation.
- Analysis of Figure 2: The chart plots the latency for 100 memory lookups against the size of the data being accessed.
-
Discussion and Limitations: The authors responsibly acknowledge the limitations of their current work:
- Limited Evaluation: The evaluation is based on a micro-benchmark. The performance on complex, real-world applications like LLM training, which involve thousands of different and often short-lived kernels, has not yet been assessed.
- JIT Compilation Overhead: The one-time cost of JIT-compiling eBPF to PTX and injecting it, while small, could become noticeable in scenarios with many short-lived kernels that are launched and destroyed rapidly.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully demonstrates
eGPU, a pioneering framework that extends the powerful observability model of eBPF to the GPU. By combining a user-space eBPF runtime with dynamic PTX injection, it enables fine-grained, low-overhead instrumentation of active GPU kernels. The preliminary results are promising, showing a significant reduction in overhead compared to existing methods and opening up new possibilities for performance analysis, dynamic optimization, and real-time adaptation of GPU workloads. -
Limitations & Future Work: Beyond the limitations discussed in the evaluation, the authors plan to:
- Conduct extensive end-to-end evaluations on real-world HPC and AI workloads.
- Explore further optimizations, such as tuning resource usage on the GPU's streaming multiprocessors (SMs).
- Leverage next-generation hardware like NVIDIA's Grace Hopper Superchips and CXL-based memory pooling to further enhance the capabilities and reduce the overhead of CPU-GPU data sharing.
-
Personal Insights & Critique:
- Significance and Novelty: The concept of marrying eBPF with GPU PTX injection is highly innovative and impactful. It addresses a long-standing and critical pain point in the HPC/AI community. If this technology matures, it could become a standard tool for anyone looking to deeply understand and optimize GPU performance.
- Potential Impact: The vision of a "programmable GPU infrastructure" is particularly exciting. This work lays the foundation for systems that can dynamically reconfigure GPU kernels at runtime to enforce security policies, adapt to workload changes, or even self-optimize performance.
- Open Questions and Critiques:
- Robustness and Generality: The PTX injection mechanism is the "secret sauce." The paper defers to another work (
POS[8]) for its implementation. The stability and compatibility of such a technique across different NVIDIA driver versions and GPU architectures will be a major challenge for production adoption. - Security and Safety: eBPF on the CPU is made safe by a strict in-kernel verifier that analyzes the bytecode for safety before loading it. The paper mentions a user-space verifier but does not detail how safety is guaranteed after the eBPF code is translated to PTX and injected into the GPU. A bug in this process could crash the GPU or introduce subtle correctness issues. This GPU-side safety model needs further elaboration.
- Complexity: While powerful, the system is complex, integrating multiple cutting-edge technologies (
bpftime, PTX JIT, self-modifying code). Making this accessible to the average developer will require significant engineering and tooling efforts.
- Robustness and Generality: The PTX injection mechanism is the "secret sauce." The paper defers to another work (
Similar papers
Recommended via semantic vector search.