CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
TL;DR Summary
The paper introduces CUDA-L2, which utilizes Large Language Models and Reinforcement Learning to optimize HGEMM CUDA kernels, outperforming major baselines like torch.matmul and cuBLAS by over 11% across 1,000 configurations.
Abstract
In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over {\it torch.matmul} on average; +19.2% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is the optimization of Half-precision General Matrix Multiply (HGEMM) CUDA kernels through a novel system called CUDA-L2, which combines Large Language Models (LLMs) and Reinforcement Learning (RL). The paper claims that CUDA-L2 surpasses the performance of NVIDIA's cuBLAS library.
1.2. Authors
Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li and Chris Shum. Their affiliation is the DeepReinforce Team, as indicated by the GitHub link.
1.3. Journal/Conference
The paper is published on arXiv, a preprint server, under the identifier arXiv:2512.02551. As it is a preprint, it has not yet undergone formal peer review for a specific journal or conference. However, arXiv is a widely recognized platform for disseminating research in fields like AI, ML, and computer science.
1.4. Publication Year
1.5. Abstract
This paper introduces CUDA-L2, a system that leverages Large Language Models (LLMs) and Reinforcement Learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. By using CUDA execution speed as the RL reward, CUDA-L2 optimizes HGEMM kernels across 1,000 diverse configurations relevant to modern LLMs. The system consistently outperforms existing major matrix multiplication baselines, including torch.matmul and NVIDIA's state-of-the-art closed-source libraries like cuBLAS and cuBLASLt. In offline mode, CUDA-L2 achieves average speedups of +22.0% over torch.matmul, +19.2% over cuBLAS (optimal layout), +16.8% over cuBLASLt-heuristic, and +11.4% over cuBLASLt-AutoTuning. These speedups further increase in server mode (simulating real-time inference) to +28.7%, +26.0%, +22.4%, and +15.9% respectively. The paper concludes that CUDA-L2 demonstrates the potential of LLM-guided RL automation to improve even highly optimized kernels by exploring complex configuration spaces at scales impractical for human experts.
1.6. Original Source Link
https://arxiv.org/abs/2512.02551 (Preprint) PDF Link: https://arxiv.org/pdf/2512.02551v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the performance optimization of Half-precision General Matrix Multiply (HGEMM) operations on GPUs. HGEMM is a fundamental building block for many modern AI workloads, especially Large Language Models (LLMs), where it constitutes a significant portion of the computational cost in layers like attention and feed-forward networks.
This problem is crucial because even small percentage improvements in these highly frequent operations can lead to substantial gains in overall LLM inference and training speed, translating to reduced operational costs and faster research cycles. Existing solutions, such as NVIDIA's highly optimized cuBLAS and cuBLASLt libraries, represent the state-of-the-art, meticulously hand-tuned by expert engineers. However, the paper argues that the vast and complex configuration space of modern GPU architectures (e.g., memory hierarchies, specialized Tensor Cores, instruction sets) makes it challenging for human experts to exhaustively explore all optimization opportunities. There remains a gap where automated, systematic exploration could uncover further performance gains.
The paper's entry point and innovative idea lie in applying a combination of Large Language Models (LLMs) and Reinforcement Learning (RL) to automatically generate and optimize CUDA kernels for HGEMM. By treating kernel optimization as a search problem and using real execution speed as a reward signal, CUDA-L2 aims to systematically explore the optimization landscape beyond human capabilities.
2.2. Main Contributions / Findings
The primary contributions and key findings of the paper are:
- Novel System
CUDA-L2: Introduction ofCUDA-L2, a system that integrates LLMs and RL for automatic optimization ofHGEMM CUDA kernels. This system extends previous work (CUDA-L1) with continued pretraining on diverse CUDA code and a more generalized RL framework. - Systematic Performance Outperformance:
CUDA-L2systematically surpasses the performance of all major matmul baselines, includingtorch.matmul,cuBLAS(with optimal layout),cuBLASLt-heuristic, and evencuBLASLt-AutoTuning(which itself performs extensive auto-tuning).- In offline mode,
CUDA-L2achieves average speedups of +22.0% overtorch.matmul, +19.2% overcuBLAS-max, +16.8% overcuBLASLt-heuristic-max, and +11.4% overcuBLASLt-AutoTuning-max. - In server mode (simulating real-time inference), these speedups are even higher: +28.7% over
torch.matmul, +26.0% overcuBLAS-max, +22.4% overcuBLASLt-heuristic-max, and +15.9% overcuBLASLt-AutoTuning-max.
- In offline mode,
- Discovery of Novel Optimization Techniques: The LLM-guided RL approach not only achieves superior performance but also discovers and applies sophisticated CUDA optimization techniques. These include dynamic abstraction selection (
PTX/CuTebased on problem size), padding input matrices with zeros to enable better tiling, and advanced techniques likeDouble-Buffered Register Fragments with Ping-Pong Execution,Aggressive Register-Level Prefetching,Epilogue Optimization with Direct Register-to-Shared-Memory Copy, andStaggered/Split A-B Prefetch Scheduling. - Scalability and Adaptability:
CUDA-L2demonstrates its ability to optimize kernels across 1,000 configurations of matrix dimensions (M, N, K), covering those used in widely-used LLMs. The approach is also designed to adapt to new GPU architectures (e.g., Ada Lovelace, Hopper, Blackwell) and architectural characteristics. - Implications for Performance-Critical Kernels: The findings suggest that even the most performance-critical and heavily-optimized low-level kernels can be further improved through automated, LLM-guided RL exploration, opening new avenues for compiler and hardware-software co-design optimization.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following fundamental concepts:
- General Matrix Multiply (GEMM): At its core, GEMM is a linear algebra operation that computes the matrix product . Here, , , and are matrices, and and are scalar coefficients. In the context of deep learning, the common case simplifies to (where ). GEMM is a cornerstone operation for neural networks, used in dense layers, convolutions, and attention mechanisms.
- Half-Precision (FP16): This refers to using 16-bit floating-point numbers instead of the standard 32-bit (FP32) or 64-bit (FP64) precision. FP16 offers several advantages for deep learning:
- Reduced Memory Footprint: Less memory is required to store weights and activations, allowing larger models or batch sizes.
- Faster Computation: GPUs often have specialized hardware (like NVIDIA's
Tensor Cores) that can perform FP16 operations much faster than FP32. - Lower Bandwidth Usage: Moving less data between memory and processing units improves efficiency. The trade-off is reduced numerical precision, which can sometimes lead to accuracy issues if not handled carefully, but for many deep learning tasks, it is sufficient.
- CUDA (Compute Unified Device Architecture): NVIDIA's parallel computing platform and programming model that enables software developers to use a GPU for general-purpose processing.
CUDA kernelsare functions written in CUDA C/C++ that are executed in parallel on the GPU. Optimizing these kernels involves managing memory, thread scheduling, and utilizing specialized hardware efficiently. - Reinforcement Learning (RL): A subfield of machine learning where an "agent" learns to make decisions by performing "actions" in an "environment" to maximize a cumulative "reward." In this paper, the
LLMacts as the agent, generatingCUDA kernelcode (actions), and theGPUenvironment provides feedback (execution speed) as a reward. - Large Language Models (LLMs): Advanced AI models (e.g., GPT-4, Llama) capable of understanding, generating, and processing human language. In this context, LLMs are used as a powerful code generation tool, capable of producing diverse and complex
CUDA kernelcode given prompts or architectural specifications. - GPU Architecture Concepts:
- Thread Block: A group of threads that can communicate with each other via shared memory and synchronize their execution. Multiple thread blocks can run concurrently on a GPU.
- Shared Memory: A small, fast memory on a GPU that is shared by threads within the same
thread block. It's much faster than global memory but has limited capacity. Efficient use of shared memory is critical for high-performance CUDA kernels. - Registers: The fastest memory available on a GPU, directly accessible by individual threads. Used to store local variables and intermediate computation results.
- Global Memory: The main GPU memory, larger but significantly slower than shared memory or registers. Data often needs to be explicitly moved from global memory to shared memory/registers for computation.
- Tensor Cores: Specialized processing units on NVIDIA GPUs (starting with Volta architecture) designed to accelerate matrix multiplication operations, particularly for FP16 precision. They perform
Mixed-Precision Matrix Multiply-Accumulate (MMA)operations.
torch.matmul: A function in the PyTorch deep learning framework that performs matrix multiplication. For GPU operations, it typically dispatches to optimized backend libraries (likecuBLAS). It represents a common, high-level interface for users.cuBLAS: NVIDIA's CUDA Basic Linear Algebra Subprograms (BLAS) library. It provides highly optimized implementations of standard linear algebra routines, including GEMM, for NVIDIA GPUs. It is a closed-source, heavily hand-tuned library, serving as a de facto benchmark for matrix multiplication performance.cuBLAS-NN: Standard configuration where both input matrices and are in Normal (row-major) layout.cuBLAS-TN: Configuration where matrix is Transposed and matrix is in Normal layout.cuBLAS-max: Refers to selecting the faster ofcuBLAS-NNandcuBLAS-TNfor a given matrix configuration.
cuBLASLt: NVIDIA's CUDA Basic Linear Algebra Subprograms (BLAS)Lightlibrary. It offers a lower-level, more flexible API thancuBLAS, allowing developers finer control over algorithm selection and execution parameters. It's designed for more advanced users and scenarios where maximum performance tuning is required.cuBLASLt-heuristic: This mode queries thecuBLASLtlibrary's internal heuristic, which suggests an optimal algorithm based on predefined rules and characteristics of the input matrices (M, N, K, data types, etc.).cuBLASLt-AutoTuning(orbenchmark): This mode goes a step further by empirically benchmarking multiple candidate algorithms suggested bycuBLASLt(up to 100 in this paper) and selecting the fastest one for a specific configuration. This is often considered the most competitive baseline as it involves runtime profiling and selection.
CUTLASS: CUDA Templates for Linear Algebra Subroutines. An open-source, high-performance CUDA library built on top ofCUDA C++andCuTe(CUDA Template Engine). It provides highly optimized GEMM kernel implementations usingTensor Coresand sophisticated memory management patterns. It acts as a framework for building highly efficient kernels.CuTe(CUDA Template Engine): A template metaprogramming library withinCUTLASSthat provides abstractions for defining complex tiled matrix operations, memory layouts, and data movement patterns on GPUs. It allows for flexible and performant kernel generation.
3.2. Previous Works
The paper builds upon a lineage of research involving LLMs and automated kernel optimization:
-
KernelBench[10]: This benchmark suite covers five tasks, includingHalf-precision General Matrix Multiply (HGEMM), and aims to evaluate the ability ofLLMsto write efficientGPU kernels. It serves as a foundational environment for testing LLM-based code generation for performance-critical tasks. -
CUDA-L1[7]: This is a direct predecessor toCUDA-L2.CUDA-L1is described as a pipeline system that optimizes kernels onKernelBench. The current paperCUDA-L2extendsCUDA-L1through several technical enhancements. While the specific details ofCUDA-L1are not fully elaborated in this paper, it's clear it laid the groundwork for usingLLMsandRLfor CUDA kernel optimization. -
CUTLASSandCuTe: These NVIDIA libraries are fundamental to high-performance GEMM on GPUs.CUDA-L2does not aim to replace them but rather to leverage them.CUTLASSprovides templates andCuTeprovides abstractions thatCUDA-L2can utilize to generate highly optimized code. This meansCUDA-L2can output code that usesCUTLASS/CuTecomponents, or even lower-levelPTX assembly. -
Other LLM/AI-driven optimization efforts: The introduction mentions
Sakana AI CUDA[6] andSwizzleper[16] as related works inAIforCUDAkernel optimization. These works generally explore the use of AI to discover, optimize, or composeGPU kernels, indicating an active research area.Swizzleper, for instance, focuses on hardware-aware LLMs for GPU kernel performance optimization.The paper notes that to their best knowledge, no prior work has achieved performance comparable to manually-optimized
matmul kernels, especially when compared against NVIDIA'scuBLASlibrary, beforeCUDA-L2.
3.3. Technological Evolution
The field of GPU kernel optimization has evolved from purely manual, expert-driven tuning to increasingly automated and AI-assisted approaches:
-
Manual Optimization: Early
CUDA kernelswere primarily hand-tuned by deep experts who understood the intricacies of GPU architecture, memory hierarchies, and instruction sets. Libraries likecuBLASare products of this era, representing years of highly specialized human effort. This approach is effective but slow, non-scalable, and often cannot explore the entire vast optimization space. -
Domain-Specific Languages (DSLs) and Code Generators: Tools like
Halide,TVM, andTensor Comprehensionsemerged to abstract away some low-level details, allowing developers to express computations at a higher level while optimizing compilers generate efficientGPU code. These tools automate parts of the optimization process but still rely on predefined optimization passes or search heuristics. -
Meta-Programming Libraries (
CUTLASS,CuTe): These libraries provide highly parameterized templates and abstractions that allow developers to construct complex, high-performance kernels by combining optimized building blocks. They offer flexibility but still require human expertise to select the right parameters and configurations. -
AI-driven Optimization (LLMs + RL): This paper represents a significant step in this evolution.
LLMsare used to generate initial code or variations, leveraging their vast knowledge base from pretraining on code.Reinforcement Learningthen guides theLLMto iteratively refine and optimize the generated code based on actual execution performance on the hardware. This approach aims to autonomously discover novel optimizations that might be missed by human experts or traditional compilers.CUDA-L2fits into this evolution as a cutting-edge example of the AI-driven optimization paradigm, pushing the boundaries of what automated systems can achieve in performance-critical low-level code generation.
3.4. Differentiation Analysis
Compared to the main methods in related work, CUDA-L2 presents several core differences and innovations:
- Beyond Heuristics and Predefined Search Spaces: Unlike traditional auto-tuning methods (like
cuBLASLt-heuristicor evencuBLASLt-AutoTuning) that operate within a predefined set of algorithms or a constrained search space,CUDA-L2usesLLMsto generate code. This allows it to explore a much broader, more flexible, and potentially novel set of kernel implementations, not limited by human preconceptions or fixed templates. - Direct Performance Feedback (RL): The use of
Reinforcement LearningwithCUDA execution speedas the reward signal provides a direct, empirical feedback loop. This contrasts with approaches that might rely on static analysis, cost models, or surrogate metrics. The direct hardware performance measurement ensures that the generated kernels are genuinely optimal for the target GPU. - Combining Generative AI with Optimization: The core innovation is the synergistic combination of
LLMs(for code generation and leveraging vast pre-trained knowledge) andRL(for guided, performance-driven search and refinement). This allows the system to learn to write not just syntactically correct, but performance-optimal code. - Outperforming State-of-the-Art Closed-Source Libraries: A key differentiator is
CUDA-L2's ability to consistently outperform NVIDIA's highly-optimized and closed-sourcecuBLASandcuBLASLtlibraries. This is a significant claim, as these libraries are products of immense human engineering effort and are considered the gold standard forGPU BLASoperations. PriorLLMorAI-based kernel optimization efforts generally struggled to matchcuBLASperformance. - Adaptability and Generalization:
CUDA-L2emphasizes continued pretraining on diverseCUDA codeandLLMcontexts (e.g., usingDeepSeek 671B). This aims to equip theLLMwith more generalCUDA optimizationcapabilities, allowing it to adapt to newGPU architectures(e.g., Hopper, Blackwell) or characteristics not explicitly covered in its initial training. - Discovery of Non-Obvious Techniques: The paper highlights that
CUDA-L2can discover non-standard or novel variations of optimization techniques (e.g.,padding M values not divisible by BM,double-buffered register fragments,staggered prefetching), suggesting that the automated search can go beyond conventional human-designed solutions.
4. Methodology
4.1. Principles
The core idea behind CUDA-L2 is to automatically generate and optimize high-performance Half-precision General Matrix Multiply (HGEMM) CUDA kernels using a Large Language Model (LLM) guided by Reinforcement Learning (RL). The LLM acts as a code generator, proposing CUDA kernel implementations for given matrix dimensions (M, N, K). The RL component evaluates these generated kernels by executing them on a GPU and using the measured execution speed as a reward signal. This reward then guides the LLM to iteratively improve its code generation strategy. The theoretical basis is that by systematically exploring the vast search space of possible CUDA kernel optimizations, an LLM-RL agent can discover highly efficient configurations and even novel optimization techniques that might be impractical for human experts to find. The process is designed to be hardware-aware, directly optimizing for real-world GPU execution performance.
4.2. Core Methodology In-depth (Layer by Layer)
CUDA-L2 extends a previous system called CUDA-L1 by introducing several key enhancements to improve its generalization and optimization capabilities for HGEMM kernels.
4.2.1. A Brief Review of CUDA-L1
CUDA-L1 is described as a pipeline system [7] designed to optimize kernels on KernelBench [10], which covers 50 tasks across five categories (e.g., HGEMM, SGEMM). CUDA-L1 leveraged LLMs for code generation and RL for optimization. CUDA-L2 builds upon this foundation, aiming to provide more general CUDA optimization capabilities beyond the specific tasks in KernelBench.
4.2.2. CUDA-L2 Enhancements
CUDA-L2 introduces the following key enhancements over CUDA-L1:
4.2.2.1. Continued Pretraining
To enable more general-purpose CUDA optimization, CUDA-L2 extends the LLM's pretraining beyond the KernelBench dataset. This involves pretraining on a diverse range of CUDA code found in open-source projects, including CUTLASS versions, the CuTe library, and various NVIDIA CUDA optimization technical blogs. The goal is to equip the LLM with a broader understanding of CUDA programming patterns, optimization idioms, and GPU architectural characteristics. The paper explicitly mentions using DeepSeek 671B [8] for this continued pretraining, which allows the model to acquire more general CUDA optimization capabilities. This pretraining helps the LLM generate more relevant and potentially performant kernel code from the outset.
4.2.2.2. General Kernel RL
CUDA-L2 incorporates a more general Reinforcement Learning strategy for CUDA kernel generation. This involves using CUTLASS as a foundational library. CUTLASS provides highly optimized, modular CUDA C++ templates for linear algebra, making it a good target for LLM-generated code. The RL framework generates CUDA kernel code (e.g., in .cu files), which is then compiled and executed on the GPU. The execution speed is measured and used as the reward signal for the RL agent. This feedback loop allows the LLM to learn which code structures and optimization techniques lead to better performance. The system is designed to prevent "timing measurement hacking" by ensuring proper CUDA synchronization and event-based timing, making the reward reliable.
4.2.2.3. HGEMM RL
For the specific task of HGEMM optimization, a tailored RL strategy is adopted within CUDA-L2. The reward function is designed to balance execution speed, correctness, and code conciseness.
The reward function for a generated custom kernel is defined as:
$
r ( \mathrm { c u s t o m } ) = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \left[ \frac { t _ { \mathrm { r e f } } ^ { i } } { t _ { \mathrm { c u s t o m } } ^ { i } } - \alpha \cdot \mathrm { diff } ^ { i } \right] - \beta \mathrm { L } ( \mathrm { c u s t o m } )
$
Where:
-
: The total number of configurations (matrix sizes M, N, K) evaluated. In this paper, .
-
: The execution time of a reference kernel for the -th configuration. This is typically the best available baseline kernel (e.g.,
cuBLASLt-AutoTuning). -
: The execution time of the
customkernel generated byCUDA-L2for the -th configuration. -
: A positive penalty coefficient for correctness errors. This term ensures that the
LLMprioritizes generating correct kernels. -
: The maximum element-wise absolute difference between the
FP32reference output and thecustomkernel'sFP16output for the -th configuration. It is defined as: $ \mathrm { diff } ^ { i } = \operatorname* { max } _ { j } | \mathrm { out } _ { \mathrm { FP32 } } ^ { i } [ j ] - \mathrm { out } _ { \mathrm { custom } } ^ { i } [ j ] | $ where is the -th element of the ground truth output computed inFP32(high precision), and is the -th element of the output from thecustomkernel. This term penalizes numerical inaccuracies. -
: A positive penalty coefficient for code length. This term encourages the
LLMto generate concise and efficient code rather than overly verbose or complex implementations. -
: The length of the generated
customkernel code.The generated
HGEMM kernelsare parsed into.cufiles and compiled usingnvcc(NVIDIA CUDA Compiler). This implies that theLLMcan generate code leveragingCUDA C/C++,CuTe(CUDA Template Engine),inline PTX assembly(low-level parallel thread execution instructions), andCUDA intrinsics(specialized GPU instructions), as well asCUTLASS templates. However, Python-based Domain-Specific Languages (DSLs) likeTritonare explicitly excluded, asCUDA-L2focuses on generating rawCUDA C++orCuTe-based implementations.
4.2.3. Kernel Successfulness
A custom HGEMM kernel generated by CUDA-L2 is considered successful if it meets two criteria: executability and correctness.
4.2.3.1. Executability
For a kernel to be executable, it must compile successfully with nvcc and run without crashing or causing illegal memory access. The paper states that compute-sanitizer --tool memcheck is used to check for memory access violations, ensuring the kernel operates within its allocated memory boundaries.
4.2.3.2. Correctness
Determining the correctness of FP16 GEMM results is challenging due to the non-associative nature of floating-point arithmetic (e.g., ). This means a direct bit-for-bit comparison with an FP32 reference is often impossible. CUDA-L2 adopts two practical criteria:
-
Exact Match with binary Inputs: This criterion addresses the challenges of floating-point comparison by using specially crafted inputs. Matrices and are randomly generated with elements being binary, i.e., . For a matrix multiplication , each output element is calculated as: $ c _ { i j } = \sum _ { k = 1 } ^ { K } a _ { i k } \cdot b _ { k j } $ Since , each product is either 0 or 1. Consequently, the sum is guaranteed to be a non-negative integer. The
reference outputis first computed usingFP32on the CPU, which provides exact integer results. Then, the output from the generatedcustom kernelis computed. The correctness check is based on the value of :- If , the
custom kernel's output element is required to be exactly equal to . This is because all integers in the range[0, 2048)can be exactly represented inhalf-precision(FP16). - If ,
FP16may not be able to represent the integer exactly. However, because each term , the partial sums are monotonically non-decreasing. If the final value of is below 2048, and thecustom kernelproduces a different result, it is considered incorrect.
- If , the
-
Error-tolerance relative to Baselines: This criterion compares the
custom kernel's output against the outputs of established baselines (cuBLAS-NN,cuBLAS-TN,cuBLASLt-heuristic-NN,cuBLASLt-heuristic-TN,cuBLASLt-AutoTuning-NN,cuBLASLt-AutoTuning-TN). Acustom kernelis deemed correct if its element-wise deviation from theFP32reference output does not exceed the deviation of the worst-performing baseline. This allows for typicalFP16precision variations while still ensuring competitive numerical accuracy.
4.2.4. Evaluation
To benchmark the custom kernel against a reference kernel, the single run speedup score is defined as: $ s ( \mathrm { c u s t o m } ) = \frac { t _ { \mathrm { r e f } } } { t _ { \mathrm { c u s t o m } } } - 1 $ Where:
-
: The execution time of the
reference kernel. -
: The execution time of the
custom kernelgenerated byCUDA-L2.Each run executes both the
referenceandcustomkernels. The reported evaluation score is the mean speedup score over all runs. A positive value for indicates that the custom kernel is faster than the reference kernel.
4.2.4.1. Avoiding Timing Measurement Hacking
To ensure accurate and fair timing measurements, CUDA-L2 employs robust timing mechanisms that prevent common issues like Python's lazy evaluation or GPU warm-up effects. The process involves:
-
torch.cuda.synchronize(): Ensures all previously launched CUDA kernels complete before starting measurement. -
torch.cuda.Event:CUDA eventsare used to record timestamps on the GPU, providing accurate kernel execution times without CPU-side overhead.start_event = torch.cuda.Event(enable_timing=True)end_event = torch.cuda.Event(enable_timing=True)
-
start_event.record(): Records the start time on the GPU. -
kernel(a, b, b_col_major, out): Executes the targetHGEMM kernel. -
end_event.record(): Records the end time on the GPU. -
torch.cuda.synchronize(): Ensures all GPU operations, including event recording, are complete before reading the elapsed time. -
elapsed_time_ms = start_event.elapsed_time(end_event): Calculates the duration between the start and end events in milliseconds.The
LLMgeneratesHGEMM kernelsdirectly asCUDA codein.cufiles, which inherently bypassesPython's lazy evaluationthat can complicate timing in frameworks like PyTorch.
4.2.4.2. Offline v.s. Server Mode
The paper evaluates performance in two distinct modes to simulate different real-world scenarios:
- Offline Mode: In this mode,
CUDA kernelsare executed consecutively without any significant time intervals between them. This simulates a scenario where theGPUis continuously busy, and its caches remain "hot." This is typical for batch processing or training workloads. - Server Mode: In contrast,
CUDA kernelsare executed at random intervals. This simulates a real-time inference scenario where requests arrive sporadically, and theGPUmight experience "cold starts" where its caches have cooled down between executions. This mode is often more challenging for performance due to cache misses and setup overheads. The paper states that the speedups typically increase inserver modebecauseCUDA-L2's optimizations are better at mitigating these cold start penalties.
5. Experimental Setup
5.1. Datasets
The experiments in the paper are conducted across 1,000 distinct configurations for Half-precision General Matrix Multiply (HGEMM). These configurations represent all combinations of M, N, and K values, where M, N, and K are chosen from the set: {64, 128, 256, 512, 1024, 2048, 4096, 8192, 12288, 16384}.
-
Source: The matrices are generated with elements being binary in for correctness checking, and presumably with random floating-point values for performance benchmarking.
-
Scale: The range of dimensions is broad, covering matrix sizes from up to .
-
Characteristics: These (M, N, K) triplets define the dimensions of matrix and matrix for the operation .
-
Domain: The chosen configurations are explicitly stated to cover those commonly used in attention and Feed-Forward Network (FFN) layers of widely open-sourced Large Language Models (LLMs) such as Qwen [11], Llama [4], and DeepSeek [5]. This ensures the experimental results are directly relevant to real-world LLM workloads.
These datasets are chosen because they represent a comprehensive and realistic set of
HGEMMoperations encountered in modernLLMinference and training. This allows for robust validation of the method's performance across various problem sizes and aspect ratios.
5.2. Evaluation Metrics
The primary evaluation metric used in the paper is speedup, specifically the relative speedup of a custom kernel over a reference kernel.
-
Conceptual Definition:
Speedupquantifies how much faster acustom kernelexecutes compared to areference kernel. A higher speedup indicates better performance. It focuses on the time efficiency of theGPU kernelexecution. -
Mathematical Formula: The speedup for a single run of a custom kernel against a reference kernel is defined as: $ s ( \mathrm { c u s t o m } ) = \frac { t _ { \mathrm { r e f } } } { t _ { \mathrm { c u s t o m } } } - 1 $ The reported evaluation score is the mean speedup score over all runs for a given configuration.
-
Symbol Explanation:
-
: The speedup score of the
custom kernel. -
: The execution time of the
reference kernel(a baseline, e.g.,torch.matmul,cuBLAS,cuBLASLt). -
: The execution time of the
custom kernelgenerated byCUDA-L2.A speedup of +0.10, for example, means the custom kernel is 10% faster than the reference kernel. If the custom kernel is slower, the speedup will be negative. The paper also reports mean, median, standard deviation (
Std), and win rates () for speedups across the 1,000 configurations. refers to the number of configurations out of 1000 whereCUDA-L2outperforms the baseline.
-
The paper also implicitly uses correctness as a binary metric (pass/fail) during the RL training and evaluation, as described in Section 4.2.3. Correctness is essential; an optimized kernel is useless if it produces incorrect results. The diff term in the reward function directly penalizes incorrectness.
5.3. Baselines
The paper compares CUDA-L2 against several representative and state-of-the-art matrix multiplication baselines:
torch.matmul: This is the PyTorch framework's default matrix multiplication operation. It represents the common-case usage for most deep learning practitioners. It typically relies on highly optimized backend libraries likecuBLASforGPUexecution.cuBLAS: NVIDIA's highly optimizedCUDA BLASlibrary. This is a critical baseline as it's a closed-source, manually-tuned library considered a gold standard forGPU GEMMperformance.CUBLAS-NN: Refers to thecuBLASimplementation where both input matrices and are in Normal (row-major) layout.CUBLAS-TN: Refers to thecuBLASimplementation where matrix is Transposed (column-major) and matrix is in Normal (row-major) layout.CUBLAS-max: For each(M, N, K)configuration,CUDA-L2is compared against the optimalcuBLASlayout, meaning the faster ofCUBLAS-NNorCUBLAS-TN. This provides the strongestcuBLAScomparison.- The implementation uses the
cublasGemmExfunction withCUBLAS_GEMM_DEFAULT_TENSOR_OPto leverageAmpere FP16 Tensor Cores.
cuBLASLt: NVIDIA's lower-levelBLAS Lightlibrary, offering more control over algorithm selection.cuBLASLt-heuristic: This baseline queries thecuBLASLtlibrary for its heuristic-based suggestion for the optimal algorithm for a given(M, N, K)configuration. It represents a more advanced optimization thancuBLASalone, as it dynamically selects algorithms.- Similar to
cuBLAS, it evaluatescuBLASLt-heuristic-NNandcuBLASLt-heuristic-TN. cuBLASLt-heuristic-max: Selects the optimal layout (NNorTN) from the heuristic's suggestions.
- Similar to
cuBLASLt-AutoTuning(orcuBLASLt-benchmark): This is the most competitive baseline. It queriescuBLASLtfor up to 100 candidate algorithms, then empirically benchmarks all of them (with warm-up and measurement rounds) and selects the fastest one for the specific(M, N, K)configuration. This represents the absolute best performance achievable bycuBLASLtthrough extensive runtime profiling.-
It also evaluates
cuBLASLt-AutoTuning-NNandcuBLASLt-AutoTuning-TN. -
cuBLASLt-AutoTuning-max: Selects the optimal layout (NNorTN) after auto-tuning.These baselines are chosen because they represent a spectrum of widely-used to state-of-the-art, highly optimized matrix multiplication implementations on NVIDIA GPUs.
cuBLASLt-AutoTuningis particularly important as it establishes a very high bar for automated kernel optimization.
-
6. Results & Analysis
6.1. Core Results Analysis
CUDA-L2 consistently demonstrates superior performance across all baselines and evaluation modes (offline and server). The results clearly validate the effectiveness of the proposed LLM-guided RL approach.
The following are the results from the abstract and Table 1 of the original paper:
- Overall
CUDA-L2Performance:-
Offline Mode:
CUDA-L2yields +22.0% overtorch.matmulon average; +19.2% overcuBLAS-max; +16.8% overcuBLASLt-heuristic-max; and +11.4% over the most competitivecuBLASLt-AutoTuning-max. -
Server Mode: The speedups are further increased: +28.7% over
torch.matmul; +26.0% overcuBLAS-max; +22.4% overcuBLASLt-heuristic-max; and +15.9% overcuBLASLt-AutoTuning-max.The speedups in
server modeare consistently higher than inoffline mode. This indicates thatCUDA-L2's optimizations are particularly effective at mitigating the overheads associated with "cold starts" and cache misses that occur when kernels are executed at random intervals, simulating real-time inference.
-
The win rates (number of configurations out of 1000 where CUDA-L2 outperforms the baseline) for CUDA-L2 range from 79.3% to 95.7% across all baselines (as seen in Table 1). This confirms that the improvements are systematic and not merely driven by a few outlier configurations.
The following are the results from Table 1 of the original paper:
| Baseline | Offline | Server | ||||||
| Mean | Median | Std | >1 | Mean | Median | Std | >1 | |
| CUBLAS-NN | 20.0% | 17.5% | 0.197 | 884/1000 | 28.8% | 25.2% | 0.283 | 862/1000 |
| CUBLAS-TN | 21.4% | 19.5% | 0.193 | 913/1000 | 30.2% | 26.3% | 0.275 | 887/1000 |
| cuBLASLt-heuristic-NN | 17.3% | 15.6% | 0.143 | 914/1000 | 24.4% | 22.6% | 0.202 | 901/1000 |
| CuBLASLt-heuristic-TN | 19.1% | 17.1% | 0.140 | 957/1000 | 25.9% | 24.1% | 0.198 | 939/1000 |
| cuBLASLt-AutoTuning-NN | 12.1% | 11.4% | 0.157 | 803/1000 | 17.9% | 15.9% | 0.220 | 818/1000 |
| cuBLASLt-AutoTuning-TN | 13.3% | 13.5% | 0.152 | 852/1000 | 19.1% | 17.6% | 0.217 | 844/1000 |
| torch.matmul | 22.0% | 19.2% | 0.211 | 902/1000 | 28.7% | 25.6% | 0.275 | 899/1000 |
| CuBLAS-max | 19.2% | 16.4% | 0.191 | 878/1000 | 26.0% | 22.9% | 0.260 | 849/1000 |
| cuBLASLt-heuristic-max | 16.8% | 15.3% | 0.140 | 913/1000 | 22.4% | 21.4% | 0.186 | 894/1000 |
| cuBLASLt-AutoTuning-max | 11.4% | 11.2% | 0.152 | 793/1000 | 15.9% | 14.4% | 0.207 | 798/1000 |
The raw cuBLAS and cuBLASLt versions (e.g., CUBLAS-NN, CUBLAS-TN) show that CUDA-L2 generally performs better against TN (Transposed-Normal) configurations than NN (Normal-Normal) configurations. For instance, CUDA-L2 gains 20.0% over CUBLAS-NN versus 21.4% over CUBLAS-TN in offline mode. A similar pattern is observed for cuBLASLt-heuristic (17.3% vs 19.1%) and cuBLASLt-AutoTuning (12.1% vs 13.3%). This might suggest that CUDA-L2 is particularly adept at optimizing kernels for scenarios involving transposed inputs or that the baseline TN implementations have more room for improvement.
The most challenging baseline for CUDA-L2 is cuBLASLt-AutoTuning-max, which already selects the best algorithm from up to 100 candidates by empirical benchmarking. Yet, CUDA-L2 still achieves a significant speedup in offline mode and in server mode over this highly optimized baseline. This highlights the power of the LLM-guided RL approach in discovering even finer-grained optimizations beyond what standard auto-tuning can achieve.
6.2. Max(CUDA-L2, baseline)
The paper also presents results for a scenario where the user can choose the faster kernel between CUDA-L2's generated kernel and the baseline's kernel for each configuration. This max(CUDA-L2, baseline) scenario simulates a practical deployment where the best available option is always selected.
The following are the results from Table 2 of the original paper:
| Baseline | CUDA-L2 vs baseline | max(CUDA-L2, baseline) vs baseline | ||
| Offline | Server | Offline | Server | |
| torch.matmul | 22.0% | 28.7% | 23.1% | 29.8% |
| CuBLAS-max | 19.2% | 26.0% | 20.2% | 27.2% |
| cuBLASLt-heuristic-max | 16.8% | 22.4% | 17.0% | 22.7% |
| cuBLASLt-AutoTuning-max | 11.4% | 15.9% | 13.2% | 18.1% |
As expected, combining CUDA-L2 with the baselines by always picking the faster option yields additional marginal gains across all baselines and modes. For example, the offline speedup over torch.matmul increases from 22.0% to 23.1%, and over cuBLASLt-AutoTuning-max from 11.4% to 13.2%. This confirms that even for configurations where CUDA-L2 might not be strictly faster, it provides a competitive alternative, and leveraging both can lead to a more robust and faster solution overall. The server mode shows similar increases, with torch.matmul going from 28.7% to 29.8% and cuBLASLt-AutoTuning-max from 15.9% to 18.1%.
6.3. Speedup vs. Problem Size
The paper analyzes how CUDA-L2's speedup over cuBLASLt-AutoTuning-max varies with matrix dimensions. The problem size is characterized by .
The following figure (Figure 2 from the original paper) shows how CUDA-L2's speedup against cuBLASLt-AutoTuning-max varies with matrix dimensions:
该图像是图表,展示了相对于不同矩阵大小的相对加速比,包含三个子图(a)、(b)和(c)。(a)显示了相对加速比与的关系;(b)表示相对加速比与平均维度的关系;(c)则展示了相对加速比与最大维度的关系,提供了统计分布信息。
The figure, titled "Table 3: Speedup over cuBLASLt-AutoTuning-max by matrix size in offline mode," shows a clear trend: as the problem size (represented by ) increases, the speedup provided by CUDA-L2 generally decreases.
-
For smaller problems (),
CUDA-L2achieves substantial speedups, often in the range of1.3xto1.4x(i.e.,30-40%faster). -
For larger problems (), the speedup tends to converge toward
1.0x(i.e.,0%speedup), meaningCUDA-L2performs similarly to the highly optimizedcuBLASLt-AutoTuning-maxbaseline.This trend suggests that
CUDA-L2provides the most significant improvements for smaller and medium-sized matrix multiplications. These smallerGEMMoperations are often found in scenarios like batch size 1 inference, or in the attention heads ofTransformerswhere , , or can be sequence length or head dimension, which may not always be very large. The reason for this pattern is that smaller problems often have more "headroom" for optimization in terms of latency hiding, cache utilization, and instruction scheduling, whereCUDA-L2's fine-grained tuning can make a proportionally larger impact. For very large problems, theGPUis already heavily utilized, and most standard optimizations (like efficient memory access, pipelining) are already well-implemented by baselines, leaving less room for further significant speedups.
The analysis of average dimension and maximum dimension in the subplots of Figure 2 (Table 3) further supports this, showing higher speedups for smaller average and maximum dimensions. This implies that problems with less extreme dimensions benefit more from CUDA-L2's approach.
6.4. Optimization Techniques Discovered and Applied by CUDA-L2
CUDA-L2's LLM-guided RL approach was able to discover and apply a range of sophisticated CUDA optimization techniques, some of which are novel variations beyond standard implementations.
6.4.1. Abstraction Selection
CUDA-L2 automatically selects between different levels of abstraction for implementation based on the (M, N, K) dimensions.
- For smaller matrices:
CUDA-L2tends to generate kernels using lower-levelPTX assemblyandCUDA intrinsics. These provide maximum control over hardware, allowing for highly specialized optimizations, fewer pipeline stages, and minimal synchronization overhead, leading to faster speeds for smaller problem sizes. - For larger matrices:
CUDA-L2leans towards usingCuTe's abstractions.CuTe(CUDA Template Engine) provides higher-level tools to manage sophisticated tiledMMA(Matrix Multiply-Accumulate) operations withTensor Cores, which are essential for efficiently handling large data volumes. This allowsCUDA-L2to leverage the powerful capabilities ofCUTLASS/CuTefor large-scale operations.
6.4.2. Padding the Input Matrix with Zeros
A common challenge in tiled GEMM implementations is handling matrix dimensions that are not perfectly divisible by thread block tile sizes (BM, BN). This can lead to out-of-bounds memory accesses for boundary thread blocks. Traditionally, BM (M-dimension tile size) and BN (N-dimension tile size) are chosen such that is divisible by BM and by BN.
CUDA-L2 discovered an alternative: padding the input matrix with zeros.
- Mechanism: When is not divisible by
BM,CUDA-L2can choose aBMthat does not divide . It then pads to a slightly larger value (e.g., ) that is divisible by the chosenBM. This padding involves adding zero-valued rows or columns to the input matrix. - Benefit: This allows
CUDA-L2to select aBMthat is more optimal for theTensor Coreinstruction format, memory access patterns, or other architectural characteristics, even if it introduces a small overhead from padding. The example provided is for .CUDA-L2selects , which does not divide 8192. It pads to 8320 (a1.6%overhead). This choice outperforms conventional choices like or . Specifically, the speedup overcublaslt-AutoTuning-TNis with , but only0.4%with and-15.7%with . This demonstrates that intelligent padding can unlock significant performance gains by enabling a more effective tile size.
6.4.3. Proficiency in Using CUDA Optimization Techniques
CUDA-L2 is proficient in using several standard CUDA optimization techniques and discovering novel variations:
-
Shared memory with bank conflict avoidance: This involves organizing data layout (e.g., using a
swizzle patternand bit-level permutations) to prevent multiple threads from simultaneously accessing the same bank ofshared memory, which causes serialization and performance degradation. -
Double buffering: Using two buffers (e.g., in
shared memory) to overlap data transfer and computation. While one buffer is being processed by compute units, the other is being filled with the next batch of data from global memory. This hides memory latency. The paper denotes this withn_stagefor buffering stages. -
Asynchronous memory copy: Initiating memory transfers between
global memoryandshared memory(e.g., 128-bit transfers) without blocking theGPU's compute units. This allows computation to proceed while data is being fetched, improving parallelism. -
Register accumulation: Storing partial results of computations in
registers(the fastest memory) for as long as possible to minimize accesses to slowershared memoryorglobal memory. -
Block Swizzling (
Blockwise reordering of block indices): This technique determines the stride pattern used to reorder howthread blocksaccessglobal memory. It can help improve cache locality and coalescing ofglobal memoryaccesses, especially for specific matrix dimensions or access patterns. -
Epilogue optimization: The final stage of a
GEMM kernelwhere accumulated results fromregistersare written back toglobal memory.CUDA-L2effectively handles transpositions and reorders results fromregisterstoglobal memory(potentially usinguser-shape type 1-byteand2-byteinstructions) to reduce memory traffic. -
Loop unrolling and simple loops:
CUDA-L2is proficient in generating efficient loop structures, includingloop unrolling(expanding a loop's body to reduce loop overhead and expose more instruction-level parallelism) and designing simple, optimized loops. -
Tile sizes, pipeline stages, etc.:
CUDA-L2can determine optimaltile sizes(BM,BN,BK) andpipeline stages(n_stage) for specific(M, N, K)configurations.Moreover,
CUDA-L2can discover novel variations beyond standard implementations:
6.4.3.1. Double-Buffered Register Fragments with Ping-Pong Execution
This is an advanced form of double buffering applied at the register level.
The following figure (Listing 2 from the original paper) illustrates register fragment buffering:
该图像是代码片段,展示了用于优化半精度通用矩阵乘法(HGEMM)CUDA内核的部分实现细节。代码涉及到矩阵的分区和复制,以及使用条件语句选择不同的缓冲区处理数据。这些细节是通过自动化和强化学习策略改进计算性能的关键部分,反映了CUDA-L2系统的操作机制和效率提升策略。
- Standard (Single-buffer - left side of Figure 3): Data is loaded into a single
register fragment(e.g.,tCrA_fragment) and processed. The next data load waits for the current processing to complete. - Optimized (Double-buffer with ping-pong - right side of Figure 3): Two
register fragments(, ) are used. While onefragmentis being used for computation, the other is asynchronously loaded with the next set of data. This "ping-pong" mechanism (alternating between buffers) allowsdata loadingandcomputationto overlap, effectively hiding memory latency. TheLLMcan automatically generate this more complex logic, which is crucial for maximizingTensor Coreutilization.
6.4.3.2. Aggressive Register-Level Prefetching
Prefetching aims to load data into faster memory (like registers) before it is actually needed, reducing stalls.
The following figure (Listing 3 from the original paper) illustrates register-level prefetching:
该图像是示意图,展示了CUDA-L2算法中半精度通用矩阵乘法(HGEMM)CUDA内核的优化代码。左侧代码段为K+1预取策略,右侧代码段展示了K+0、K+1和K+2的管道预取过程。两段代码通过使用#pragma unroll和合适的内存预取策略,以提高HGEMM运算的执行效率。这些优化策略是实现性能提升的关键因素。
- Standard (Single-Step Prefetch - left side of Figure 4): This involves prefetching data for the next iteration (e.g., ). The
LLMis capable of generating this prefetching logic using#pragma unroll. - Optimized (Multi-Step Prefetch - right side of Figure 4):
CUDA-L2can discover more aggressive prefetching strategies, loading data for multiple future iterations (e.g., , , ). This keeps the compute units busy for longer by ensuring a deeper pipeline of data. This is particularly effective when iteration counts are high and there is sufficientregister headroom(i.e., enough available registers to store prefetched data without spilling to slower memory).
6.4.3.3. Epilogue Optimization with Direct Register-to-Shared-Memory Copy
The epilogue phase involves writing the final results from registers back to shared memory or global memory.
The following figure (Listing 4 from the original paper) illustrates epilogue register-to-shared-memory copy:
该图像是代码片段,展示了两种在 CUDA 上执行张量拷贝的不同方法。左侧代码通过中间张量进行拷贝,右侧则直接执行 R2S(Row to Sparse)拷贝操作。这些代码涉及张量的初始化以及使用类库进行数据复制的操作,以优化矩阵乘法的性能。
- Two-step (left side of Figure 5): The standard approach, often seen in
CUTLASS/CuTe, might involve an intermediate tensor (intermediate_tensor) to reorganize data. This is necessary if theregisterlayout does not directly match theshared memorylayout. - Direct Wide Copy (right side of Figure 5):
CUDA-L2can discover how to directly copy data fromregisterstoshared memoryusing wide memory accesses (e.g., 8-byte or 16-byte stores). This avoids the overhead of intermediate tensors and extra memory movements, making the epilogue more efficient. It requires a precise understanding of memory layouts and hardware capabilities.
6.4.3.4. Staggered/Split A-B Prefetch Scheduling
This optimization deals with the timing of prefetching input matrices A and B within the GEMM main loop.
The following figure (Listing 5 from the original paper) illustrates A-B prefetch scheduling:
该图像是一个六部分的箱线图,展示了不同超参数对优化CUDA矩阵乘法内核的影响。部分(a)和(b)分析了随着M和N的增加,BM的变化情况,分别显示了相关系数ρ=0.652和ρ=0.705。部分(c)揭示BK与K之间的弱相关性(ρ=0.256),部分(d)显示BM和BN趋于相似(ρ=0.695)。部分(e)比较了随着K的增加阶段数的变化(ρ=0.473),而部分(f)展示了块交换的影响,表明问题规模与性能关系的复杂性。
- Standard (Consecutive Prefetch - left side of Figure 6): Typically,
prefetch Aandprefetch Bhappen consecutively. For example, completes, then starts, thengemm(Matrix Multiply-Accumulate) unit executes. This can leave gaps whereexecution unitsare idle becausememoryandcompute pipelinesare not fully overlapped. - Optimized (Staggered Prefetch - right side of Figure 6):
CUDA-L2can generatestaggered prefetch scheduling. For instance, it can issueprefetch A(), then immediately follow with thegemmoperation (cute::gemm(tiled_mma, ...)), and thenprefetch B(). This "splits" the and prefetches, filling the gap after theMMAissues and achieving better overlap between memory operations and compute operations, thus improving pipeline utilization.
6.5. Hyperparameter Selection Patterns in Optimized CUDA Matrix Multiplication Kernels
The paper analyzes the patterns of hyperparameter selection made by CUDA-L2's LLM for optimized CUDA kernels. These patterns provide insights into the strategies learned by the RL agent.
The following table (Table 4 from the original paper) displays hyperparameter selection patterns:
该图像是一个六部分的箱线图,展示了不同超参数对优化CUDA矩阵乘法内核的影响。部分(a)和(b)分析了随着M和N的增加,BM的变化情况,分别显示了相关系数ρ=0.652和ρ=0.705。部分(c)揭示BK与K之间的弱相关性(ρ=0.256),部分(d)显示BM和BN趋于相似(ρ=0.695)。部分(e)比较了随着K的增加阶段数的变化(ρ=0.473),而部分(f)展示了块交换的影响,表明问题规模与性能关系的复杂性。
The image contains Figure 4 (not Table 4) which shows six subplots (a-f) illustrating hyperparameter selection patterns.
6.5.1. How to Choose BM, BN, BK
BM, BN, and BK refer to the tile sizes for the M, N, and K dimensions, respectively, used by each thread block during the matrix multiplication.
BMand (Figure 6a):BM(M-dimension tile size) scales proportionally with the dimension, showing a positive correlation coefficient of . As increases from small values () to large values (),BMgrows from approximately 60 to 160. This indicates that larger dimensions benefit from largerM-tilesto maintain efficiency, likely by better utilizingTensor Coresand improving data locality for larger output tiles.BNand (Figure 6b): A similar, even stronger, positive correlation is observed betweenBN(N-dimension tile size) and dimension, with . This suggests that larger dimensions also necessitate largerN-tilesfor optimal performance, for similar reasons asBM.BKand (Figure 6c): In contrast,BK(K-dimension tile size) shows only a weak correlation with (). This is expected becauseBKis often constrained byshared memorycapacity,Tensor Coreinstruction formats, and the number ofpipeline stagesrather than directly scaling with the dimension. Its primary role is to determine the granularity of data loading andinner loopiterations.BMandBNCorrelation (Figure 6d): There is a high correlation betweenBMandBN(), indicating thatCUDA-L2tends to choose balancedBMandBNvalues. This is likely due to the nature ofTensor Coreinstruction formats (e.g., ) that prefer balanced input dimensions, and to reduce the risk ofresource imbalancewhere one dimension becomes abottleneck.
6.5.2. How to Choose Stage Number in Multi-stage Pipelining
The n_stage parameter refers to the number of pipeline stages used in double buffering or multi-stage pipelining for data movement.
n_stageand (Figure 6e): As the dimension increases, the number ofpipeline stages(n_stage) tends to increase (correlation ). This is crucial forlatency hiding.- Small values () typically require only 2-3 stages for adequate
latency hiding. - Large values () require 6 or more stages to maintain high throughput by keeping multiple data loads "in flight" simultaneously, ensuring the
Tensor Coresare continuously fed with data.
- Small values () typically require only 2-3 stages for adequate
6.5.3. When and How to Use Block Swizzling
Block Swizzling is a technique to reorder how thread blocks access global memory, typically to improve memory access patterns and cache locality.
Block Swizzlingand Problem Size (Figure 6f): The decision to enableBlock Swizzlingis primarily driven by problem size (total operations ).- For small problems (less than or million operations),
Block Swizzlingis optional and used in only44%of configurations, as the overhead might outweigh the benefits. - For medium problems ( to operations),
Block Swizzlingbecomes increasingly prevalent, used in73-80%of the time. - For very large problems (greater than or billion operations),
Block Swizzlingis almost universally applied, with99%usage. This indicates it is essential asmemory access patternsgrow more complex at larger scales.
- For small problems (less than or million operations),
Block Swizzling Strideand Problem Size: For larger problems,CUDA-L2tends to select largerstridevalues forBlock Swizzling(512-16,384). A strong correlation () demonstrates that both the decision to useBlock Swizzlingand the choice of itsstrideare closely tied to problem size, becoming essential asmemory access patternsgrow more complex.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces CUDA-L2, an innovative system that integrates Large Language Models (LLMs) and Reinforcement Learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Through continuous pretraining on a diverse range of CUDA code and LLM contexts, CUDA-L2 achieved remarkable performance gains across 1,000 distinct HGEMM configurations on A100 GPUs.
Specifically, CUDA-L2 demonstrated significant speedups over state-of-the-art baselines:
- In offline mode, it yielded over NVIDIA's highly competitive
cuBLASLt-AutoTuninglibrary. - In server mode, this speedup further increased to .
Against more commonly used baselines, the gains were even larger: over
torch.matmul( in server mode) and overcuBLAS( in server mode). Beyond raw performance,CUDA-L2also demonstrated the ability to discover and apply sophisticatedCUDA optimization techniques, including novel variations likedouble-buffered register fragmentsandstaggered A-B prefetching. This work conclusively shows that even heavily-optimized, performance-critical kernels likeHGEMMcan be substantially improved throughLLM-guidedRLautomation, by systematically exploring configuration spaces at scales impractical for human experts.
7.2. Limitations & Future Work
The provided text does not explicitly list a "Limitations" section, but some can be inferred from the results and discussions:
-
Decreasing Speedup for Large Problems: As shown in the "Speedup vs. Problem Size" analysis,
CUDA-L2's performance gains tend to diminish for very large matrix sizes. For the largest problems (), the speedup approaches1.0x, meaning it performs similarly tocuBLASLt-AutoTuning-max. This suggests that the remaining optimization opportunities for extremely large, compute-bound kernels might be more limited or harder to discover through the current approach. -
Computational Cost of RL Search: While not explicitly discussed,
RL-based optimization, especially involvingLLMsfor code generation and empirical evaluation on hardware, can be computationally expensive and time-consuming. Generating, compiling, and benchmarking thousands of kernels across 1,000 configurations would require substantialGPUresources and time. -
Generalizability Beyond HGEMM: The paper focuses specifically on
HGEMM. While the methodology (LLM-guided RL for CUDA kernels) is general, the extent to which these specific speedups and discovered techniques transfer to other types ofCUDA kernels(e.g., convolutions, reductions, custom layers) remains to be fully explored.The paper implicitly points towards future work in its ambition for general
CUDA optimization capabilitiesand mentions adaptability to newGPU architectureslikeBlackwell (B200). This suggests ongoing research into maintaining performance gains on future hardware and expanding the scope of automatically optimizable kernels.
7.3. Personal Insights & Critique
This paper presents a highly inspiring advancement in the field of automated code optimization. The ability of an LLM-RL system to surpass cuBLAS performance is a significant milestone, challenging the long-held belief that manual, expert-driven tuning is irreplaceable for such critical low-level kernels.
Inspirations:
- Paradigm Shift: This work truly exemplifies a paradigm shift where AI is not just assisting humans, but actively discovering and implementing solutions that outperform human-engineered ones in highly specialized domains. This has profound implications for compiler design, hardware-software co-design, and even the future of low-level systems programming.
- Leveraging LLM's Code Generation Prowess: The paper effectively demonstrates that
LLMs, when properly guided byRLand validated by real-world performance, can move beyond generating syntactically correct code to generating performance-optimal code. The fine-grained optimization techniques discovered (e.g., staggered prefetching, padding with zeros for betterBMselection) show a deep understanding ofGPUarchitecture implicitly learned by the system. - Scalability of Exploration: The
LLM-RLframework's capacity to explore 1,000 configurations (and implicitly, many more kernel variations during training) is a scale that manual optimization cannot match, leading to the discovery of niche optimizations.
Potential Issues & Areas for Improvement:
-
Training Cost and Efficiency: While
CUDA-L2achieves impressive results, the computational cost and time required to train such anLLM-RLsystem to reach this level of performance are likely substantial. Future work could focus on making the training process more sample-efficient or knowledge-transferable across differentGPUarchitectures or kernel types. -
Interpretability and Debugging:
LLM-generated code can be complex, andRLoften operates as a black box. Understanding why certain generated kernels are faster or why certain optimization patterns emerge could be challenging. Tools for interpreting theLLM's "reasoning" or debugging generated code would enhance trust and allow for human insight to further refine the process. -
Generalizability Across Kernel Types: While impressive for
HGEMM, generalizing this approach to a wider array ofCUDA kernels(e.g., sparse operations, custom activation functions, non-linear operations) with similar performance gains would be the ultimate test. Each kernel type might have its own unique optimization challenges and architectural bottlenecks. -
Dynamic Adaptation: The paper mentions
server modeas mimicking real-time inference. A more advanced system might not just generate an optimal kernel offline but dynamically adapt or select kernels based on runtime conditions (e.g.,GPUload, memory pressure, specific input data characteristics). -
Integration with Compilers: To make such a system widely usable, seamless integration into existing compiler toolchains (e.g.,
LLVM,GCC) or deep learning frameworks (e.g., PyTorch, TensorFlow) would be crucial.Overall,
CUDA-L2is a landmark paper that showcases the immense potential of blendingAIwith low-level systems optimization, opening exciting new frontiers for high-performance computing.
Similar papers
Recommended via semantic vector search.