Paper status: completed

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Published:12/02/2025

LLM-guided Reinforcement Learning Optimization (1)Matrix Multiplication Performance Optimization (1)Automated Optimization of HGEMM CUDA Kernels (1)CUDA Execution Speed Improvement (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces CUDA-L2, which utilizes Large Language Models and Reinforcement Learning to optimize HGEMM CUDA kernels, outperforming major baselines like torch.matmul and cuBLAS by over 11% across 1,000 configurations.

Abstract

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over {\it torch.matmul} on average; +19.2% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

Mind Map

In-depth Reading

English Analysis~31 min read · 40,960 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is the optimization of Half-precision General Matrix Multiply (HGEMM) CUDA kernels through a novel system called CUDA-L2, which combines Large Language Models (LLMs) and Reinforcement Learning (RL). The paper claims that CUDA-L2 surpasses the performance of NVIDIA's cuBLAS library.

1.2. Authors

Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li and Chris Shum. Their affiliation is the DeepReinforce Team, as indicated by the GitHub link.

1.3. Journal/Conference

The paper is published on arXiv, a preprint server, under the identifier arXiv:2512.02551. As it is a preprint, it has not yet undergone formal peer review for a specific journal or conference. However, arXiv is a widely recognized platform for disseminating research in fields like AI, ML, and computer science.

1.4. Publication Year

1.5. Abstract

This paper introduces CUDA-L2, a system that leverages Large Language Models (LLMs) and Reinforcement Learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. By using CUDA execution speed as the RL reward, CUDA-L2 optimizes HGEMM kernels across 1,000 diverse configurations relevant to modern LLMs. The system consistently outperforms existing major matrix multiplication baselines, including torch.matmul and NVIDIA's state-of-the-art closed-source libraries like cuBLAS and cuBLASLt. In offline mode, CUDA-L2 achieves average speedups of +22.0% over torch.matmul, +19.2% over cuBLAS (optimal layout), +16.8% over cuBLASLt-heuristic, and +11.4% over cuBLASLt-AutoTuning. These speedups further increase in server mode (simulating real-time inference) to +28.7%, +26.0%, +22.4%, and +15.9% respectively. The paper concludes that CUDA-L2 demonstrates the potential of LLM-guided RL automation to improve even highly optimized kernels by exploring complex configuration spaces at scales impractical for human experts.

1.6. Original Source Link

https://arxiv.org/abs/2512.02551 (Preprint) PDF Link: https://arxiv.org/pdf/2512.02551v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the performance optimization of Half-precision General Matrix Multiply (HGEMM) operations on GPUs. HGEMM is a fundamental building block for many modern AI workloads, especially Large Language Models (LLMs), where it constitutes a significant portion of the computational cost in layers like attention and feed-forward networks.

This problem is crucial because even small percentage improvements in these highly frequent operations can lead to substantial gains in overall LLM inference and training speed, translating to reduced operational costs and faster research cycles. Existing solutions, such as NVIDIA's highly optimized cuBLAS and cuBLASLt libraries, represent the state-of-the-art, meticulously hand-tuned by expert engineers. However, the paper argues that the vast and complex configuration space of modern GPU architectures (e.g., memory hierarchies, specialized Tensor Cores, instruction sets) makes it challenging for human experts to exhaustively explore all optimization opportunities. There remains a gap where automated, systematic exploration could uncover further performance gains.

The paper's entry point and innovative idea lie in applying a combination of Large Language Models (LLMs) and Reinforcement Learning (RL) to automatically generate and optimize CUDA kernels for HGEMM. By treating kernel optimization as a search problem and using real execution speed as a reward signal, CUDA-L2 aims to systematically explore the optimization landscape beyond human capabilities.

2.2. Main Contributions / Findings

The primary contributions and key findings of the paper are:

Novel System CUDA-L2: Introduction of CUDA-L2, a system that integrates LLMs and RL for automatic optimization of HGEMM CUDA kernels. This system extends previous work (CUDA-L1) with continued pretraining on diverse CUDA code and a more generalized RL framework.
Systematic Performance Outperformance: CUDA-L2 systematically surpasses the performance of all major matmul baselines, including torch.matmul, cuBLAS (with optimal layout), cuBLASLt-heuristic, and even cuBLASLt-AutoTuning (which itself performs extensive auto-tuning).
- In offline mode, CUDA-L2 achieves average speedups of +22.0% over torch.matmul, +19.2% over cuBLAS-max, +16.8% over cuBLASLt-heuristic-max, and +11.4% over cuBLASLt-AutoTuning-max.
- In server mode (simulating real-time inference), these speedups are even higher: +28.7% over torch.matmul, +26.0% over cuBLAS-max, +22.4% over cuBLASLt-heuristic-max, and +15.9% over cuBLASLt-AutoTuning-max.
Discovery of Novel Optimization Techniques: The LLM-guided RL approach not only achieves superior performance but also discovers and applies sophisticated CUDA optimization techniques. These include dynamic abstraction selection (PTX/CuTe based on problem size), padding input matrices with zeros to enable better tiling, and advanced techniques like Double-Buffered Register Fragments with Ping-Pong Execution, Aggressive Register-Level Prefetching, Epilogue Optimization with Direct Register-to-Shared-Memory Copy, and Staggered/Split A-B Prefetch Scheduling.
Scalability and Adaptability: CUDA-L2 demonstrates its ability to optimize kernels across 1,000 configurations of matrix dimensions (M, N, K), covering those used in widely-used LLMs. The approach is also designed to adapt to new GPU architectures (e.g., Ada Lovelace, Hopper, Blackwell) and architectural characteristics.
Implications for Performance-Critical Kernels: The findings suggest that even the most performance-critical and heavily-optimized low-level kernels can be further improved through automated, LLM-guided RL exploration, opening new avenues for compiler and hardware-software co-design optimization.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following fundamental concepts:

General Matrix Multiply (GEMM): At its core, GEMM is a linear algebra operation that computes the matrix product $C = \alpha A B + \beta C$ . Here, $A$ , $B$ , and $C$ are matrices, and $\alpha$ and $\beta$ are scalar coefficients. In the context of deep learning, the common case simplifies to $C = A B$ (where $\alpha=1, \beta=0$ ). GEMM is a cornerstone operation for neural networks, used in dense layers, convolutions, and attention mechanisms.
Half-Precision (FP16): This refers to using 16-bit floating-point numbers instead of the standard 32-bit (FP32) or 64-bit (FP64) precision. FP16 offers several advantages for deep learning:
- Reduced Memory Footprint: Less memory is required to store weights and activations, allowing larger models or batch sizes.
- Faster Computation: GPUs often have specialized hardware (like NVIDIA's Tensor Cores) that can perform FP16 operations much faster than FP32.
- Lower Bandwidth Usage: Moving less data between memory and processing units improves efficiency. The trade-off is reduced numerical precision, which can sometimes lead to accuracy issues if not handled carefully, but for many deep learning tasks, it is sufficient.
CUDA (Compute Unified Device Architecture): NVIDIA's parallel computing platform and programming model that enables software developers to use a GPU for general-purpose processing. CUDA kernels are functions written in CUDA C/C++ that are executed in parallel on the GPU. Optimizing these kernels involves managing memory, thread scheduling, and utilizing specialized hardware efficiently.
Reinforcement Learning (RL): A subfield of machine learning where an "agent" learns to make decisions by performing "actions" in an "environment" to maximize a cumulative "reward." In this paper, the LLM acts as the agent, generating CUDA kernel code (actions), and the GPU environment provides feedback (execution speed) as a reward.
Large Language Models (LLMs): Advanced AI models (e.g., GPT-4, Llama) capable of understanding, generating, and processing human language. In this context, LLMs are used as a powerful code generation tool, capable of producing diverse and complex CUDA kernel code given prompts or architectural specifications.
GPU Architecture Concepts:
- Thread Block: A group of threads that can communicate with each other via shared memory and synchronize their execution. Multiple thread blocks can run concurrently on a GPU.
- Shared Memory: A small, fast memory on a GPU that is shared by threads within the same thread block. It's much faster than global memory but has limited capacity. Efficient use of shared memory is critical for high-performance CUDA kernels.
- Registers: The fastest memory available on a GPU, directly accessible by individual threads. Used to store local variables and intermediate computation results.
- Global Memory: The main GPU memory, larger but significantly slower than shared memory or registers. Data often needs to be explicitly moved from global memory to shared memory/registers for computation.
- Tensor Cores: Specialized processing units on NVIDIA GPUs (starting with Volta architecture) designed to accelerate matrix multiplication operations, particularly for FP16 precision. They perform Mixed-Precision Matrix Multiply-Accumulate (MMA) operations.
torch.matmul: A function in the PyTorch deep learning framework that performs matrix multiplication. For GPU operations, it typically dispatches to optimized backend libraries (like cuBLAS). It represents a common, high-level interface for users.
cuBLAS: NVIDIA's CUDA Basic Linear Algebra Subprograms (BLAS) library. It provides highly optimized implementations of standard linear algebra routines, including GEMM, for NVIDIA GPUs. It is a closed-source, heavily hand-tuned library, serving as a de facto benchmark for matrix multiplication performance.
- cuBLAS-NN: Standard configuration where both input matrices $A$ and $B$ are in Normal (row-major) layout.
- cuBLAS-TN: Configuration where matrix $A$ is Transposed and matrix $B$ is in Normal layout.
- cuBLAS-max: Refers to selecting the faster of cuBLAS-NN and cuBLAS-TN for a given matrix configuration.
cuBLASLt: NVIDIA's CUDA Basic Linear Algebra Subprograms (BLAS) Light library. It offers a lower-level, more flexible API than cuBLAS, allowing developers finer control over algorithm selection and execution parameters. It's designed for more advanced users and scenarios where maximum performance tuning is required.
- cuBLASLt-heuristic: This mode queries the cuBLASLt library's internal heuristic, which suggests an optimal algorithm based on predefined rules and characteristics of the input matrices (M, N, K, data types, etc.).
- cuBLASLt-AutoTuning (or benchmark): This mode goes a step further by empirically benchmarking multiple candidate algorithms suggested by cuBLASLt (up to 100 in this paper) and selecting the fastest one for a specific configuration. This is often considered the most competitive baseline as it involves runtime profiling and selection.
CUTLASS: CUDA Templates for Linear Algebra Subroutines. An open-source, high-performance CUDA library built on top of CUDA C++ and CuTe (CUDA Template Engine). It provides highly optimized GEMM kernel implementations using Tensor Cores and sophisticated memory management patterns. It acts as a framework for building highly efficient kernels.
CuTe (CUDA Template Engine): A template metaprogramming library within CUTLASS that provides abstractions for defining complex tiled matrix operations, memory layouts, and data movement patterns on GPUs. It allows for flexible and performant kernel generation.

3.2. Previous Works

The paper builds upon a lineage of research involving LLMs and automated kernel optimization:

KernelBench [10]: This benchmark suite covers five tasks, including Half-precision General Matrix Multiply (HGEMM), and aims to evaluate the ability of LLMs to write efficient GPU kernels. It serves as a foundational environment for testing LLM-based code generation for performance-critical tasks.
CUDA-L1 [7]: This is a direct predecessor to CUDA-L2. CUDA-L1 is described as a pipeline system that optimizes kernels on KernelBench. The current paper CUDA-L2 extends CUDA-L1 through several technical enhancements. While the specific details of CUDA-L1 are not fully elaborated in this paper, it's clear it laid the groundwork for using LLMs and RL for CUDA kernel optimization.
CUTLASS and CuTe: These NVIDIA libraries are fundamental to high-performance GEMM on GPUs. CUDA-L2 does not aim to replace them but rather to leverage them. CUTLASS provides templates and CuTe provides abstractions that CUDA-L2 can utilize to generate highly optimized code. This means CUDA-L2 can output code that uses CUTLASS/CuTe components, or even lower-level PTX assembly.
Other LLM/AI-driven optimization efforts: The introduction mentions Sakana AI CUDA [6] and Swizzleper [16] as related works in AI for CUDA kernel optimization. These works generally explore the use of AI to discover, optimize, or compose GPU kernels, indicating an active research area. Swizzleper, for instance, focuses on hardware-aware LLMs for GPU kernel performance optimization.

The paper notes that to their best knowledge, no prior work has achieved performance comparable to manually-optimized matmul kernels, especially when compared against NVIDIA's cuBLAS library, before CUDA-L2.

3.3. Technological Evolution

The field of GPU kernel optimization has evolved from purely manual, expert-driven tuning to increasingly automated and AI-assisted approaches:

Manual Optimization: Early CUDA kernels were primarily hand-tuned by deep experts who understood the intricacies of GPU architecture, memory hierarchies, and instruction sets. Libraries like cuBLAS are products of this era, representing years of highly specialized human effort. This approach is effective but slow, non-scalable, and often cannot explore the entire vast optimization space.
Domain-Specific Languages (DSLs) and Code Generators: Tools like Halide, TVM, and Tensor Comprehensions emerged to abstract away some low-level details, allowing developers to express computations at a higher level while optimizing compilers generate efficient GPU code. These tools automate parts of the optimization process but still rely on predefined optimization passes or search heuristics.
Meta-Programming Libraries (CUTLASS, CuTe): These libraries provide highly parameterized templates and abstractions that allow developers to construct complex, high-performance kernels by combining optimized building blocks. They offer flexibility but still require human expertise to select the right parameters and configurations.
AI-driven Optimization (LLMs + RL): This paper represents a significant step in this evolution. LLMs are used to generate initial code or variations, leveraging their vast knowledge base from pretraining on code. Reinforcement Learning then guides the LLM to iteratively refine and optimize the generated code based on actual execution performance on the hardware. This approach aims to autonomously discover novel optimizations that might be missed by human experts or traditional compilers.

CUDA-L2 fits into this evolution as a cutting-edge example of the AI-driven optimization paradigm, pushing the boundaries of what automated systems can achieve in performance-critical low-level code generation.

3.4. Differentiation Analysis

Compared to the main methods in related work, CUDA-L2 presents several core differences and innovations:

Beyond Heuristics and Predefined Search Spaces: Unlike traditional auto-tuning methods (like cuBLASLt-heuristic or even cuBLASLt-AutoTuning) that operate within a predefined set of algorithms or a constrained search space, CUDA-L2 uses LLMs to generate code. This allows it to explore a much broader, more flexible, and potentially novel set of kernel implementations, not limited by human preconceptions or fixed templates.
Direct Performance Feedback (RL): The use of Reinforcement Learning with CUDA execution speed as the reward signal provides a direct, empirical feedback loop. This contrasts with approaches that might rely on static analysis, cost models, or surrogate metrics. The direct hardware performance measurement ensures that the generated kernels are genuinely optimal for the target GPU.
Combining Generative AI with Optimization: The core innovation is the synergistic combination of LLMs (for code generation and leveraging vast pre-trained knowledge) and RL (for guided, performance-driven search and refinement). This allows the system to learn to write not just syntactically correct, but performance-optimal code.
Outperforming State-of-the-Art Closed-Source Libraries: A key differentiator is CUDA-L2's ability to consistently outperform NVIDIA's highly-optimized and closed-source cuBLAS and cuBLASLt libraries. This is a significant claim, as these libraries are products of immense human engineering effort and are considered the gold standard for GPU BLAS operations. Prior LLM or AI-based kernel optimization efforts generally struggled to match cuBLAS performance.
Adaptability and Generalization: CUDA-L2 emphasizes continued pretraining on diverse CUDA code and LLM contexts (e.g., using DeepSeek 671B). This aims to equip the LLM with more general CUDA optimization capabilities, allowing it to adapt to new GPU architectures (e.g., Hopper, Blackwell) or characteristics not explicitly covered in its initial training.
Discovery of Non-Obvious Techniques: The paper highlights that CUDA-L2 can discover non-standard or novel variations of optimization techniques (e.g., padding M values not divisible by BM, double-buffered register fragments, staggered prefetching), suggesting that the automated search can go beyond conventional human-designed solutions.

4. Methodology

4.1. Principles

The core idea behind CUDA-L2 is to automatically generate and optimize high-performance Half-precision General Matrix Multiply (HGEMM) CUDA kernels using a Large Language Model (LLM) guided by Reinforcement Learning (RL). The LLM acts as a code generator, proposing CUDA kernel implementations for given matrix dimensions (M, N, K). The RL component evaluates these generated kernels by executing them on a GPU and using the measured execution speed as a reward signal. This reward then guides the LLM to iteratively improve its code generation strategy. The theoretical basis is that by systematically exploring the vast search space of possible CUDA kernel optimizations, an LLM-RL agent can discover highly efficient configurations and even novel optimization techniques that might be impractical for human experts to find. The process is designed to be hardware-aware, directly optimizing for real-world GPU execution performance.

4.2. Core Methodology In-depth (Layer by Layer)

CUDA-L2 extends a previous system called CUDA-L1 by introducing several key enhancements to improve its generalization and optimization capabilities for HGEMM kernels.

4.2.1. A Brief Review of CUDA-L1

CUDA-L1 is described as a pipeline system [7] designed to optimize kernels on KernelBench [10], which covers 50 tasks across five categories (e.g., HGEMM, SGEMM). CUDA-L1 leveraged LLMs for code generation and RL for optimization. CUDA-L2 builds upon this foundation, aiming to provide more general CUDA optimization capabilities beyond the specific tasks in KernelBench.

4.2.2. CUDA-L2 Enhancements

CUDA-L2 introduces the following key enhancements over CUDA-L1:

4.2.2.1. Continued Pretraining

To enable more general-purpose CUDA optimization, CUDA-L2 extends the LLM's pretraining beyond the KernelBench dataset. This involves pretraining on a diverse range of CUDA code found in open-source projects, including CUTLASS versions, the CuTe library, and various NVIDIA CUDA optimization technical blogs. The goal is to equip the LLM with a broader understanding of CUDA programming patterns, optimization idioms, and GPU architectural characteristics. The paper explicitly mentions using DeepSeek 671B [8] for this continued pretraining, which allows the model to acquire more general CUDA optimization capabilities. This pretraining helps the LLM generate more relevant and potentially performant kernel code from the outset.

4.2.2.2. General Kernel RL

CUDA-L2 incorporates a more general Reinforcement Learning strategy for CUDA kernel generation. This involves using CUTLASS as a foundational library. CUTLASS provides highly optimized, modular CUDA C++ templates for linear algebra, making it a good target for LLM-generated code. The RL framework generates CUDA kernel code (e.g., in .cu files), which is then compiled and executed on the GPU. The execution speed is measured and used as the reward signal for the RL agent. This feedback loop allows the LLM to learn which code structures and optimization techniques lead to better performance. The system is designed to prevent "timing measurement hacking" by ensuring proper CUDA synchronization and event-based timing, making the reward reliable.

4.2.2.3. HGEMM RL

For the specific task of HGEMM optimization, a tailored RL strategy is adopted within CUDA-L2. The reward function is designed to balance execution speed, correctness, and code conciseness.

The reward function $r(\mathrm{custom})$ for a generated custom kernel is defined as: $ r ( \mathrm { c u s t o m } ) = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \left[ \frac { t _ { \mathrm { r e f } } ^ { i } } { t _ { \mathrm { c u s t o m } } ^ { i } } - \alpha \cdot \mathrm { diff } ^ { i } \right] - \beta \mathrm { L } ( \mathrm { c u s t o m } ) $

Where:

$N$ : The total number of configurations (matrix sizes M, N, K) evaluated. In this paper, $N=1000$ .
$t_{\mathrm{ref}}^i$ : The execution time of a reference kernel for the $i$ -th configuration. This is typically the best available baseline kernel (e.g., cuBLASLt-AutoTuning).
$t_{\mathrm{custom}}^i$ : The execution time of the custom kernel generated by CUDA-L2 for the $i$ -th configuration.
$\alpha$ : A positive penalty coefficient for correctness errors. This term ensures that the LLM prioritizes generating correct kernels.
$\mathrm{diff}^i$ : The maximum element-wise absolute difference between the FP32 reference output and the custom kernel's FP16 output for the $i$ -th configuration. It is defined as: $ \mathrm { diff } ^ { i } = \operatorname* { max } _ { j } | \mathrm { out } _ { \mathrm { FP32 } } ^ { i } [ j ] - \mathrm { out } _ { \mathrm { custom } } ^ { i } [ j ] | $ where $\mathrm{out}_{\mathrm{FP32}}^i[j]$ is the $j$ -th element of the ground truth output computed in FP32 (high precision), and $\mathrm{out}_{\mathrm{custom}}^i[j]$ is the $j$ -th element of the output from the custom kernel. This term penalizes numerical inaccuracies.
$\beta$ : A positive penalty coefficient for code length. This term encourages the LLM to generate concise and efficient code rather than overly verbose or complex implementations.
$\mathrm{L}(\mathrm{custom})$ : The length of the generated custom kernel code.

The generated HGEMM kernels are parsed into .cu files and compiled using nvcc (NVIDIA CUDA Compiler). This implies that the LLM can generate code leveraging CUDA C/C++, CuTe (CUDA Template Engine), inline PTX assembly (low-level parallel thread execution instructions), and CUDA intrinsics (specialized GPU instructions), as well as CUTLASS templates. However, Python-based Domain-Specific Languages (DSLs) like Triton are explicitly excluded, as CUDA-L2 focuses on generating raw CUDA C++ or CuTe-based implementations.

4.2.3. Kernel Successfulness

A custom HGEMM kernel generated by CUDA-L2 is considered successful if it meets two criteria: executability and correctness.

4.2.3.1. Executability

For a kernel to be executable, it must compile successfully with nvcc and run without crashing or causing illegal memory access. The paper states that compute-sanitizer --tool memcheck is used to check for memory access violations, ensuring the kernel operates within its allocated memory boundaries.

4.2.3.2. Correctness

Determining the correctness of FP16 GEMM results is challenging due to the non-associative nature of floating-point arithmetic (e.g., $(a+b)+c \neq a+(b+c)$ ). This means a direct bit-for-bit comparison with an FP32 reference is often impossible. CUDA-L2 adopts two practical criteria:

Exact Match with binary Inputs: This criterion addresses the challenges of floating-point comparison by using specially crafted inputs. Matrices $A$ and $B$ are randomly generated with elements being binary, i.e., $a_{ik}, b_{kj} \in \{0, 1\}$ . For a matrix multiplication $C = A \times B$ , each output element $c_{ij}$ is calculated as: $ c _ { i j } = \sum _ { k = 1 } ^ { K } a _ { i k } \cdot b _ { k j } $ Since $a_{ik}, b_{kj} \in \{0, 1\}$ , each product $a_{ik} \cdot b_{kj}$ is either 0 or 1. Consequently, the sum $c_{ij}$ is guaranteed to be a non-negative integer. The reference output $C^{\mathrm{ref}}$ is first computed using FP32 on the CPU, which provides exact integer results. Then, the output $C^{\mathrm{custom}}$ from the generated custom kernel is computed. The correctness check is based on the value of $c_{ij}^{\mathrm{ref}}$ :
- If $c_{ij}^{\mathrm{ref}} < 2048$ , the custom kernel's output element $c_{ij}^{\mathrm{custom}}$ is required to be exactly equal to $c_{ij}^{\mathrm{ref}}$ . This is because all integers in the range [0, 2048) can be exactly represented in half-precision (FP16).
- If $c_{ij}^{\mathrm{ref}} \ge 2048$ , FP16 may not be able to represent the integer exactly. However, because each term $a_{ik} \cdot b_{kj} \in \{0, 1\}$ , the partial sums are monotonically non-decreasing. If the final value of $c_{ij}^{\mathrm{ref}}$ is below 2048, and the custom kernel produces a different result, it is considered incorrect.
Error-tolerance relative to Baselines: This criterion compares the custom kernel's output against the outputs of established baselines (cuBLAS-NN, cuBLAS-TN, cuBLASLt-heuristic-NN, cuBLASLt-heuristic-TN, cuBLASLt-AutoTuning-NN, cuBLASLt-AutoTuning-TN). A custom kernel is deemed correct if its element-wise deviation from the FP32 reference output does not exceed the deviation of the worst-performing baseline. This allows for typical FP16 precision variations while still ensuring competitive numerical accuracy.

4.2.4. Evaluation

To benchmark the custom kernel against a reference kernel, the single run speedup score $s(\mathrm{custom})$ is defined as: $ s ( \mathrm { c u s t o m } ) = \frac { t _ { \mathrm { r e f } } } { t _ { \mathrm { c u s t o m } } } - 1 $ Where:

$t_{\mathrm{ref}}$ : The execution time of the reference kernel.
$t_{\mathrm{custom}}$ : The execution time of the custom kernel generated by CUDA-L2.

Each run executes both the reference and custom kernels. The reported evaluation score is the mean speedup score over all runs. A positive value for $s(\mathrm{custom})$ indicates that the custom kernel is faster than the reference kernel.

4.2.4.1. Avoiding Timing Measurement Hacking

To ensure accurate and fair timing measurements, CUDA-L2 employs robust timing mechanisms that prevent common issues like Python's lazy evaluation or GPU warm-up effects. The process involves:

torch.cuda.synchronize(): Ensures all previously launched CUDA kernels complete before starting measurement.
torch.cuda.Event: CUDA events are used to record timestamps on the GPU, providing accurate kernel execution times without CPU-side overhead.
- start_event = torch.cuda.Event(enable_timing=True)
- end_event = torch.cuda.Event(enable_timing=True)
start_event.record(): Records the start time on the GPU.
kernel(a, b, b_col_major, out): Executes the target HGEMM kernel.
end_event.record(): Records the end time on the GPU.
torch.cuda.synchronize(): Ensures all GPU operations, including event recording, are complete before reading the elapsed time.
elapsed_time_ms = start_event.elapsed_time(end_event): Calculates the duration between the start and end events in milliseconds.

The LLM generates HGEMM kernels directly as CUDA code in .cu files, which inherently bypasses Python's lazy evaluation that can complicate timing in frameworks like PyTorch.

4.2.4.2. Offline v.s. Server Mode

The paper evaluates performance in two distinct modes to simulate different real-world scenarios:

Offline Mode: In this mode, CUDA kernels are executed consecutively without any significant time intervals between them. This simulates a scenario where the GPU is continuously busy, and its caches remain "hot." This is typical for batch processing or training workloads.
Server Mode: In contrast, CUDA kernels are executed at random intervals. This simulates a real-time inference scenario where requests arrive sporadically, and the GPU might experience "cold starts" where its caches have cooled down between executions. This mode is often more challenging for performance due to cache misses and setup overheads. The paper states that the speedups typically increase in server mode because CUDA-L2's optimizations are better at mitigating these cold start penalties.

5. Experimental Setup

5.1. Datasets

The experiments in the paper are conducted across 1,000 distinct configurations for Half-precision General Matrix Multiply (HGEMM). These configurations represent all $10^3$ combinations of M, N, and K values, where M, N, and K are chosen from the set: {64, 128, 256, 512, 1024, 2048, 4096, 8192, 12288, 16384}.

Source: The matrices are generated with elements being binary in $\{0, 1\}$ for correctness checking, and presumably with random floating-point values for performance benchmarking.
Scale: The range of dimensions is broad, covering matrix sizes from $64 \times 64$ up to $16384 \times 16384$ .
Characteristics: These (M, N, K) triplets define the dimensions of matrix $A \in \mathbb{R}^{M \times K}$ and matrix $B \in \mathbb{R}^{K \times N}$ for the operation $C = A B$ .
Domain: The chosen configurations are explicitly stated to cover those commonly used in attention and Feed-Forward Network (FFN) layers of widely open-sourced Large Language Models (LLMs) such as Qwen [11], Llama [4], and DeepSeek [5]. This ensures the experimental results are directly relevant to real-world LLM workloads.

These datasets are chosen because they represent a comprehensive and realistic set of HGEMM operations encountered in modern LLM inference and training. This allows for robust validation of the method's performance across various problem sizes and aspect ratios.

5.2. Evaluation Metrics

The primary evaluation metric used in the paper is speedup, specifically the relative speedup of a custom kernel over a reference kernel.

Conceptual Definition: Speedup quantifies how much faster a custom kernel executes compared to a reference kernel. A higher speedup indicates better performance. It focuses on the time efficiency of the GPU kernel execution.
Mathematical Formula: The speedup $s(\mathrm{custom})$ for a single run of a custom kernel against a reference kernel is defined as: $ s ( \mathrm { c u s t o m } ) = \frac { t _ { \mathrm { r e f } } } { t _ { \mathrm { c u s t o m } } } - 1 $ The reported evaluation score is the mean speedup score over all runs for a given configuration.
Symbol Explanation:
- $s(\mathrm{custom})$ : The speedup score of the custom kernel.
- $t_{\mathrm{ref}}$ : The execution time of the reference kernel (a baseline, e.g., torch.matmul, cuBLAS, cuBLASLt).
- $t_{\mathrm{custom}}$ : The execution time of the custom kernel generated by CUDA-L2.
  
  A speedup of +0.10, for example, means the custom kernel is 10% faster than the reference kernel. If the custom kernel is slower, the speedup will be negative. The paper also reports mean, median, standard deviation (Std), and win rates ( $>1$ ) for speedups across the 1,000 configurations. $>1$ refers to the number of configurations out of 1000 where CUDA-L2 outperforms the baseline.

The paper also implicitly uses correctness as a binary metric (pass/fail) during the RL training and evaluation, as described in Section 4.2.3. Correctness is essential; an optimized kernel is useless if it produces incorrect results. The diff term in the reward function directly penalizes incorrectness.

5.3. Baselines

The paper compares CUDA-L2 against several representative and state-of-the-art matrix multiplication baselines:

torch.matmul: This is the PyTorch framework's default matrix multiplication operation. It represents the common-case usage for most deep learning practitioners. It typically relies on highly optimized backend libraries like cuBLAS for GPU execution.
cuBLAS: NVIDIA's highly optimized CUDA BLAS library. This is a critical baseline as it's a closed-source, manually-tuned library considered a gold standard for GPU GEMM performance.
- CUBLAS-NN: Refers to the cuBLAS implementation where both input matrices $A$ and $B$ are in Normal (row-major) layout.
- CUBLAS-TN: Refers to the cuBLAS implementation where matrix $A$ is Transposed (column-major) and matrix $B$ is in Normal (row-major) layout.
- CUBLAS-max: For each (M, N, K) configuration, CUDA-L2 is compared against the optimal cuBLAS layout, meaning the faster of CUBLAS-NN or CUBLAS-TN. This provides the strongest cuBLAS comparison.
- The implementation uses the cublasGemmEx function with CUBLAS_GEMM_DEFAULT_TENSOR_OP to leverage Ampere FP16 Tensor Cores.
cuBLASLt: NVIDIA's lower-level BLAS Light library, offering more control over algorithm selection.
- cuBLASLt-heuristic: This baseline queries the cuBLASLt library for its heuristic-based suggestion for the optimal algorithm for a given (M, N, K) configuration. It represents a more advanced optimization than cuBLAS alone, as it dynamically selects algorithms.
  - Similar to cuBLAS, it evaluates cuBLASLt-heuristic-NN and cuBLASLt-heuristic-TN.
  - cuBLASLt-heuristic-max: Selects the optimal layout (NN or TN) from the heuristic's suggestions.
- cuBLASLt-AutoTuning (or cuBLASLt-benchmark): This is the most competitive baseline. It queries cuBLASLt for up to 100 candidate algorithms, then empirically benchmarks all of them (with warm-up and measurement rounds) and selects the fastest one for the specific (M, N, K) configuration. This represents the absolute best performance achievable by cuBLASLt through extensive runtime profiling.
  - It also evaluates cuBLASLt-AutoTuning-NN and cuBLASLt-AutoTuning-TN.
  - cuBLASLt-AutoTuning-max: Selects the optimal layout (NN or TN) after auto-tuning.
    
    These baselines are chosen because they represent a spectrum of widely-used to state-of-the-art, highly optimized matrix multiplication implementations on NVIDIA GPUs. cuBLASLt-AutoTuning is particularly important as it establishes a very high bar for automated kernel optimization.

6. Results & Analysis

6.1. Core Results Analysis

CUDA-L2 consistently demonstrates superior performance across all baselines and evaluation modes (offline and server). The results clearly validate the effectiveness of the proposed LLM-guided RL approach.

The following are the results from the abstract and Table 1 of the original paper:

Overall CUDA-L2 Performance:
- Offline Mode: CUDA-L2 yields +22.0% over torch.matmul on average; +19.2% over cuBLAS-max; +16.8% over cuBLASLt-heuristic-max; and +11.4% over the most competitive cuBLASLt-AutoTuning-max.
- Server Mode: The speedups are further increased: +28.7% over torch.matmul; +26.0% over cuBLAS-max; +22.4% over cuBLASLt-heuristic-max; and +15.9% over cuBLASLt-AutoTuning-max.
  
  The speedups in server mode are consistently higher than in offline mode. This indicates that CUDA-L2's optimizations are particularly effective at mitigating the overheads associated with "cold starts" and cache misses that occur when kernels are executed at random intervals, simulating real-time inference.

The win rates (number of configurations out of 1000 where CUDA-L2 outperforms the baseline) for CUDA-L2 range from 79.3% to 95.7% across all baselines (as seen in Table 1). This confirms that the improvements are systematic and not merely driven by a few outlier configurations.

The following are the results from Table 1 of the original paper:

Baseline	Offline				Server
Baseline	Mean	Median	Std	>1	Mean	Median	Std	>1
CUBLAS-NN	20.0%	17.5%	0.197	884/1000	28.8%	25.2%	0.283	862/1000
CUBLAS-TN	21.4%	19.5%	0.193	913/1000	30.2%	26.3%	0.275	887/1000
cuBLASLt-heuristic-NN	17.3%	15.6%	0.143	914/1000	24.4%	22.6%	0.202	901/1000
CuBLASLt-heuristic-TN	19.1%	17.1%	0.140	957/1000	25.9%	24.1%	0.198	939/1000
cuBLASLt-AutoTuning-NN	12.1%	11.4%	0.157	803/1000	17.9%	15.9%	0.220	818/1000
cuBLASLt-AutoTuning-TN	13.3%	13.5%	0.152	852/1000	19.1%	17.6%	0.217	844/1000
torch.matmul	22.0%	19.2%	0.211	902/1000	28.7%	25.6%	0.275	899/1000
CuBLAS-max	19.2%	16.4%	0.191	878/1000	26.0%	22.9%	0.260	849/1000
cuBLASLt-heuristic-max	16.8%	15.3%	0.140	913/1000	22.4%	21.4%	0.186	894/1000
cuBLASLt-AutoTuning-max	11.4%	11.2%	0.152	793/1000	15.9%	14.4%	0.207	798/1000

The raw cuBLAS and cuBLASLt versions (e.g., CUBLAS-NN, CUBLAS-TN) show that CUDA-L2 generally performs better against TN (Transposed-Normal) configurations than NN (Normal-Normal) configurations. For instance, CUDA-L2 gains 20.0% over CUBLAS-NN versus 21.4% over CUBLAS-TN in offline mode. A similar pattern is observed for cuBLASLt-heuristic (17.3% vs 19.1%) and cuBLASLt-AutoTuning (12.1% vs 13.3%). This might suggest that CUDA-L2 is particularly adept at optimizing kernels for scenarios involving transposed inputs or that the baseline TN implementations have more room for improvement.

The most challenging baseline for CUDA-L2 is cuBLASLt-AutoTuning-max, which already selects the best algorithm from up to 100 candidates by empirical benchmarking. Yet, CUDA-L2 still achieves a significant $+11.4%$ speedup in offline mode and $+15.9%$ in server mode over this highly optimized baseline. This highlights the power of the LLM-guided RL approach in discovering even finer-grained optimizations beyond what standard auto-tuning can achieve.

6.2. Max(CUDA-L2, baseline)

The paper also presents results for a scenario where the user can choose the faster kernel between CUDA-L2's generated kernel and the baseline's kernel for each configuration. This max(CUDA-L2, baseline) scenario simulates a practical deployment where the best available option is always selected.

The following are the results from Table 2 of the original paper:

Baseline	CUDA-L2 vs baseline		max(CUDA-L2, baseline) vs baseline
Baseline	Offline	Server	Offline	Server
torch.matmul	22.0%	28.7%	23.1%	29.8%
CuBLAS-max	19.2%	26.0%	20.2%	27.2%
cuBLASLt-heuristic-max	16.8%	22.4%	17.0%	22.7%
cuBLASLt-AutoTuning-max	11.4%	15.9%	13.2%	18.1%

As expected, combining CUDA-L2 with the baselines by always picking the faster option yields additional marginal gains across all baselines and modes. For example, the offline speedup over torch.matmul increases from 22.0% to 23.1%, and over cuBLASLt-AutoTuning-max from 11.4% to 13.2%. This confirms that even for configurations where CUDA-L2 might not be strictly faster, it provides a competitive alternative, and leveraging both can lead to a more robust and faster solution overall. The server mode shows similar increases, with torch.matmul going from 28.7% to 29.8% and cuBLASLt-AutoTuning-max from 15.9% to 18.1%.

6.3. Speedup vs. Problem Size

The paper analyzes how CUDA-L2's speedup over cuBLASLt-AutoTuning-max varies with matrix dimensions. The problem size is characterized by $\log_2(M \times N \times K)$ .

The following figure (Figure 2 from the original paper) shows how CUDA-L2's speedup against cuBLASLt-AutoTuning-max varies with matrix dimensions:

Table 3: Speedup over cuBLASLt-AutoTuning-max by matrix size in offline mode. 该图像是图表，展示了相对于不同矩阵大小的相对加速比，包含三个子图(a)、(b)和(c)。(a)显示了相对加速比与 $ext{log}_2( ext{M} imes ext{N} imes ext{K})$ 的关系；(b)表示相对加速比与平均维度的关系；(c)则展示了相对加速比与最大维度的关系，提供了统计分布信息。

The figure, titled "Table 3: Speedup over cuBLASLt-AutoTuning-max by matrix size in offline mode," shows a clear trend: as the problem size (represented by $\log_2(M \times N \times K)$ ) increases, the speedup provided by CUDA-L2 generally decreases.

For smaller problems ( $\log_2(M \times N \times K) \approx 18-20$ ), CUDA-L2 achieves substantial speedups, often in the range of 1.3x to 1.4x (i.e., 30-40% faster).
For larger problems ( $\log_2(M \times N \times K) > 38$ ), the speedup tends to converge toward 1.0x (i.e., 0% speedup), meaning CUDA-L2 performs similarly to the highly optimized cuBLASLt-AutoTuning-max baseline.

This trend suggests that CUDA-L2 provides the most significant improvements for smaller and medium-sized matrix multiplications. These smaller GEMM operations are often found in scenarios like batch size 1 inference, or in the attention heads of Transformers where $M$ , $N$ , or $K$ can be sequence length or head dimension, which may not always be very large. The reason for this pattern is that smaller problems often have more "headroom" for optimization in terms of latency hiding, cache utilization, and instruction scheduling, where CUDA-L2's fine-grained tuning can make a proportionally larger impact. For very large problems, the GPU is already heavily utilized, and most standard optimizations (like efficient memory access, pipelining) are already well-implemented by baselines, leaving less room for further significant speedups.

The analysis of average dimension and maximum dimension in the subplots of Figure 2 (Table 3) further supports this, showing higher speedups for smaller average and maximum dimensions. This implies that problems with less extreme dimensions benefit more from CUDA-L2's approach.

6.4. Optimization Techniques Discovered and Applied by CUDA-L2

CUDA-L2's LLM-guided RL approach was able to discover and apply a range of sophisticated CUDA optimization techniques, some of which are novel variations beyond standard implementations.

6.4.1. Abstraction Selection

CUDA-L2 automatically selects between different levels of abstraction for implementation based on the (M, N, K) dimensions.

For smaller matrices: CUDA-L2 tends to generate kernels using lower-level PTX assembly and CUDA intrinsics. These provide maximum control over hardware, allowing for highly specialized optimizations, fewer pipeline stages, and minimal synchronization overhead, leading to faster speeds for smaller problem sizes.
For larger matrices: CUDA-L2 leans towards using CuTe's abstractions. CuTe (CUDA Template Engine) provides higher-level tools to manage sophisticated tiled MMA (Matrix Multiply-Accumulate) operations with Tensor Cores, which are essential for efficiently handling large data volumes. This allows CUDA-L2 to leverage the powerful capabilities of CUTLASS/CuTe for large-scale operations.

6.4.2. Padding the Input Matrix with Zeros

A common challenge in tiled GEMM implementations is handling matrix dimensions that are not perfectly divisible by thread block tile sizes (BM, BN). This can lead to out-of-bounds memory accesses for boundary thread blocks. Traditionally, BM (M-dimension tile size) and BN (N-dimension tile size) are chosen such that $M$ is divisible by BM and $N$ by BN.

CUDA-L2 discovered an alternative: padding the input matrix with zeros.

Mechanism: When $M$ is not divisible by BM, CUDA-L2 can choose a BM that does not divide $M$ . It then pads $M$ to a slightly larger value (e.g., $M'$ ) that is divisible by the chosen BM. This padding involves adding zero-valued rows or columns to the input matrix.
Benefit: This allows CUDA-L2 to select a BM that is more optimal for the Tensor Core instruction format, memory access patterns, or other architectural characteristics, even if it introduces a small overhead from padding. The example provided is for $M=8192, N=512, K=2048$ . CUDA-L2 selects $BM=160$ , which does not divide 8192. It pads $M$ to 8320 (a 1.6% overhead). This choice outperforms conventional choices like $BM=128$ or $BM=256$ . Specifically, the speedup over cublaslt-AutoTuning-TN is $+15.2%$ with $BM=160$ , but only 0.4% with $BM=128$ and -15.7% with $BM=256$ . This demonstrates that intelligent padding can unlock significant performance gains by enabling a more effective tile size.

6.4.3. Proficiency in Using CUDA Optimization Techniques

CUDA-L2 is proficient in using several standard CUDA optimization techniques and discovering novel variations:

Shared memory with bank conflict avoidance: This involves organizing data layout (e.g., using a swizzle pattern and bit-level permutations) to prevent multiple threads from simultaneously accessing the same bank of shared memory, which causes serialization and performance degradation.
Double buffering: Using two buffers (e.g., in shared memory) to overlap data transfer and computation. While one buffer is being processed by compute units, the other is being filled with the next batch of data from global memory. This hides memory latency. The paper denotes this with n_stage for buffering stages.
Asynchronous memory copy: Initiating memory transfers between global memory and shared memory (e.g., 128-bit transfers) without blocking the GPU's compute units. This allows computation to proceed while data is being fetched, improving parallelism.
Register accumulation: Storing partial results of computations in registers (the fastest memory) for as long as possible to minimize accesses to slower shared memory or global memory.
Block Swizzling (Blockwise reordering of block indices): This technique determines the stride pattern used to reorder how thread blocks access global memory. It can help improve cache locality and coalescing of global memory accesses, especially for specific matrix dimensions or access patterns.
Epilogue optimization: The final stage of a GEMM kernel where accumulated results from registers are written back to global memory. CUDA-L2 effectively handles transpositions and reorders results from registers to global memory (potentially using user-shape type 1-byte and 2-byte instructions) to reduce memory traffic.
Loop unrolling and simple loops: CUDA-L2 is proficient in generating efficient loop structures, including loop unrolling (expanding a loop's body to reduce loop overhead and expose more instruction-level parallelism) and designing simple, optimized loops.
Tile sizes, pipeline stages, etc.: CUDA-L2 can determine optimal tile sizes (BM, BN, BK) and pipeline stages (n_stage) for specific (M, N, K) configurations.

Moreover, CUDA-L2 can discover novel variations beyond standard implementations:

6.4.3.1. Double-Buffered Register Fragments with Ping-Pong Execution

This is an advanced form of double buffering applied at the register level. The following figure (Listing 2 from the original paper) illustrates register fragment buffering:

该图像是代码片段，展示了用于优化半精度通用矩阵乘法（HGEMM）CUDA内核的部分实现细节。代码涉及到矩阵的分区和复制，以及使用条件语句选择不同的缓冲区处理数据。这些细节是通过自动化和强化学习策略改进计算性能的关键部分，反映了CUDA-L2系统的操作机制和效率提升策略。

Standard (Single-buffer - left side of Figure 3): Data is loaded into a single register fragment (e.g., tCrA_fragment) and processed. The next data load waits for the current processing to complete.
Optimized (Double-buffer with ping-pong - right side of Figure 3): Two register fragments ( $tCrA_fragment_0$ , $tCrA_fragment_1$ ) are used. While one fragment is being used for computation, the other is asynchronously loaded with the next set of data. This "ping-pong" mechanism (alternating between buffers) allows data loading and computation to overlap, effectively hiding memory latency. The LLM can automatically generate this more complex logic, which is crucial for maximizing Tensor Core utilization.

6.4.3.2. Aggressive Register-Level Prefetching

Prefetching aims to load data into faster memory (like registers) before it is actually needed, reducing stalls. The following figure (Listing 3 from the original paper) illustrates register-level prefetching:

该图像是示意图，展示了CUDA-L2算法中半精度通用矩阵乘法（HGEMM）CUDA内核的优化代码。左侧代码段为K+1预取策略，右侧代码段展示了K+0、K+1和K+2的管道预取过程。两段代码通过使用#pragma unroll和合适的内存预取策略，以提高HGEMM运算的执行效率。这些优化策略是实现性能提升的关键因素。

Standard (Single-Step Prefetch - left side of Figure 4): This involves prefetching data for the next iteration (e.g., $K+1$ ). The LLM is capable of generating this $K+1$ prefetching logic using #pragma unroll.
Optimized (Multi-Step Prefetch - right side of Figure 4): CUDA-L2 can discover more aggressive prefetching strategies, loading data for multiple future iterations (e.g., $K+0$ , $K+1$ , $K+2$ ). This keeps the compute units busy for longer by ensuring a deeper pipeline of data. This is particularly effective when iteration counts are high and there is sufficient register headroom (i.e., enough available registers to store prefetched data without spilling to slower memory).

6.4.3.3. Epilogue Optimization with Direct Register-to-Shared-Memory Copy

The epilogue phase involves writing the final results from registers back to shared memory or global memory. The following figure (Listing 4 from the original paper) illustrates epilogue register-to-shared-memory copy:

该图像是代码片段，展示了两种在 CUDA 上执行张量拷贝的不同方法。左侧代码通过中间张量进行拷贝，右侧则直接执行 R2S（Row to Sparse）拷贝操作。这些代码涉及张量的初始化以及使用类库进行数据复制的操作，以优化矩阵乘法的性能。

Two-step (left side of Figure 5): The standard approach, often seen in CUTLASS/CuTe, might involve an intermediate tensor (intermediate_tensor) to reorganize data. This is necessary if the register layout does not directly match the shared memory layout.
Direct Wide Copy (right side of Figure 5): CUDA-L2 can discover how to directly copy data from registers to shared memory using wide memory accesses (e.g., 8-byte or 16-byte stores). This avoids the overhead of intermediate tensors and extra memory movements, making the epilogue more efficient. It requires a precise understanding of memory layouts and hardware capabilities.

6.4.3.4. Staggered/Split A-B Prefetch Scheduling

This optimization deals with the timing of prefetching input matrices A and B within the GEMM main loop. The following figure (Listing 5 from the original paper) illustrates A-B prefetch scheduling:

Table 4: Hyperparameter Selection Patterns in Optimized CUDA Matrix Multiplication Kernels 该图像是一个六部分的箱线图，展示了不同超参数对优化CUDA矩阵乘法内核的影响。部分(a)和(b)分析了随着M和N的增加，BM的变化情况，分别显示了相关系数ρ=0.652和ρ=0.705。部分(c)揭示BK与K之间的弱相关性（ρ=0.256），部分(d)显示BM和BN趋于相似（ρ=0.695）。部分(e)比较了随着K的增加阶段数的变化（ρ=0.473），而部分(f)展示了块交换的影响，表明问题规模与性能关系的复杂性。

Standard (Consecutive Prefetch - left side of Figure 6): Typically, prefetch A and prefetch B happen consecutively. For example, $copy_a$ completes, then $copy_b$ starts, then gemm (Matrix Multiply-Accumulate) unit executes. This can leave gaps where execution units are idle because memory and compute pipelines are not fully overlapped.
Optimized (Staggered Prefetch - right side of Figure 6): CUDA-L2 can generate staggered prefetch scheduling. For instance, it can issue prefetch A ( $cute::copy(s2r_tiled_copy_a, ...)$ ), then immediately follow with the gemm operation (cute::gemm(tiled_mma, ...)), and then prefetch B ( $cute::copy(s2r_tiled_copy_b, ...)$ ). This "splits" the $A$ and $B$ prefetches, filling the gap after the MMA issues and achieving better overlap between memory operations and compute operations, thus improving pipeline utilization.

6.5. Hyperparameter Selection Patterns in Optimized CUDA Matrix Multiplication Kernels

The paper analyzes the patterns of hyperparameter selection made by CUDA-L2's LLM for optimized CUDA kernels. These patterns provide insights into the strategies learned by the RL agent.

The following table (Table 4 from the original paper) displays hyperparameter selection patterns:

The image contains Figure 4 (not Table 4) which shows six subplots (a-f) illustrating hyperparameter selection patterns.

6.5.1. How to Choose BM, BN, BK

BM, BN, and BK refer to the tile sizes for the M, N, and K dimensions, respectively, used by each thread block during the matrix multiplication.

BM and $M$ (Figure 6a): BM (M-dimension tile size) scales proportionally with the $M$ dimension, showing a positive correlation coefficient of $\rho = 0.652$ . As $M$ increases from small values ( $\leq 256$ ) to large values ( $> 4K$ ), BM grows from approximately 60 to 160. This indicates that larger $M$ dimensions benefit from larger M-tiles to maintain efficiency, likely by better utilizing Tensor Cores and improving data locality for larger output tiles.
BN and $N$ (Figure 6b): A similar, even stronger, positive correlation is observed between BN (N-dimension tile size) and $N$ dimension, with $\rho = 0.705$ . This suggests that larger $N$ dimensions also necessitate larger N-tiles for optimal performance, for similar reasons as BM.
BK and $K$ (Figure 6c): In contrast, BK (K-dimension tile size) shows only a weak correlation with $K$ ( $\rho = 0.256$ ). This is expected because BK is often constrained by shared memory capacity, Tensor Core instruction formats, and the number of pipeline stages rather than directly scaling with the $K$ dimension. Its primary role is to determine the granularity of data loading and inner loop iterations.
BM and BN Correlation (Figure 6d): There is a high correlation between BM and BN ( $\rho = 0.695$ ), indicating that CUDA-L2 tends to choose balanced BM and BN values. This is likely due to the nature of Tensor Core instruction formats (e.g., $16 \times 8 \times 16$ ) that prefer balanced input dimensions, and to reduce the risk of resource imbalance where one dimension becomes a bottleneck.

6.5.2. How to Choose Stage Number in Multi-stage Pipelining

The n_stage parameter refers to the number of pipeline stages used in double buffering or multi-stage pipelining for data movement.

n_stage and $K$ (Figure 6e): As the $K$ $K$ dimension increases, the number of pipeline stages (n_stage) tends to increase (correlation $\rho = 0.473$ $ρ = 0.473$ ). This is crucial for latency hiding.
- Small $K$ values ( $\leq 128$ ) typically require only 2-3 stages for adequate latency hiding.
- Large $K$ values ( $> 8K$ ) require 6 or more stages to maintain high throughput by keeping multiple data loads "in flight" simultaneously, ensuring the Tensor Cores are continuously fed with data.

6.5.3. When and How to Use Block Swizzling

Block Swizzling is a technique to reorder how thread blocks access global memory, typically to improve memory access patterns and cache locality.

Block Swizzling and Problem Size (Figure 6f): The decision to enable Block Swizzling is primarily driven by problem size (total operations $M \times N \times K$ $M \times N \times K$ ).
- For small problems (less than $2^{27}$ or $\approx 134$ million operations), Block Swizzling is optional and used in only 44% of configurations, as the overhead might outweigh the benefits.
- For medium problems ( $2^{27}$ to $2^{33}$ operations), Block Swizzling becomes increasingly prevalent, used in 73-80% of the time.
- For very large problems (greater than $2^{36}$ or $\approx 68$ billion operations), Block Swizzling is almost universally applied, with 99% usage. This indicates it is essential as memory access patterns grow more complex at larger scales.
Block Swizzling Stride and Problem Size: For larger problems, CUDA-L2 tends to select larger stride values for Block Swizzling (512-16,384). A strong correlation ( $\rho = 0.453$ ) demonstrates that both the decision to use Block Swizzling and the choice of its stride are closely tied to problem size, becoming essential as memory access patterns grow more complex.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces CUDA-L2, an innovative system that integrates Large Language Models (LLMs) and Reinforcement Learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Through continuous pretraining on a diverse range of CUDA code and LLM contexts, CUDA-L2 achieved remarkable performance gains across 1,000 distinct HGEMM configurations on A100 GPUs. Specifically, CUDA-L2 demonstrated significant speedups over state-of-the-art baselines:

In offline mode, it yielded $+11.4%$ over NVIDIA's highly competitive cuBLASLt-AutoTuning library.
In server mode, this speedup further increased to $+15.9%$ . Against more commonly used baselines, the gains were even larger: $+22.0%$ over torch.matmul ( $+28.7%$ in server mode) and $+19.2%$ over cuBLAS ( $+26.0%$ in server mode). Beyond raw performance, CUDA-L2 also demonstrated the ability to discover and apply sophisticated CUDA optimization techniques, including novel variations like double-buffered register fragments and staggered A-B prefetching. This work conclusively shows that even heavily-optimized, performance-critical kernels like HGEMM can be substantially improved through LLM-guided RL automation, by systematically exploring configuration spaces at scales impractical for human experts.

7.2. Limitations & Future Work

The provided text does not explicitly list a "Limitations" section, but some can be inferred from the results and discussions:

Decreasing Speedup for Large Problems: As shown in the "Speedup vs. Problem Size" analysis, CUDA-L2's performance gains tend to diminish for very large matrix sizes. For the largest problems ( $\log_2(M \times N \times K) > 38$ ), the speedup approaches 1.0x, meaning it performs similarly to cuBLASLt-AutoTuning-max. This suggests that the remaining optimization opportunities for extremely large, compute-bound kernels might be more limited or harder to discover through the current approach.
Computational Cost of RL Search: While not explicitly discussed, RL-based optimization, especially involving LLMs for code generation and empirical evaluation on hardware, can be computationally expensive and time-consuming. Generating, compiling, and benchmarking thousands of kernels across 1,000 configurations would require substantial GPU resources and time.
Generalizability Beyond HGEMM: The paper focuses specifically on HGEMM. While the methodology (LLM-guided RL for CUDA kernels) is general, the extent to which these specific speedups and discovered techniques transfer to other types of CUDA kernels (e.g., convolutions, reductions, custom layers) remains to be fully explored.

The paper implicitly points towards future work in its ambition for general CUDA optimization capabilities and mentions adaptability to new GPU architectures like Blackwell (B200). This suggests ongoing research into maintaining performance gains on future hardware and expanding the scope of automatically optimizable kernels.

7.3. Personal Insights & Critique

This paper presents a highly inspiring advancement in the field of automated code optimization. The ability of an LLM-RL system to surpass cuBLAS performance is a significant milestone, challenging the long-held belief that manual, expert-driven tuning is irreplaceable for such critical low-level kernels.

Inspirations:

Paradigm Shift: This work truly exemplifies a paradigm shift where AI is not just assisting humans, but actively discovering and implementing solutions that outperform human-engineered ones in highly specialized domains. This has profound implications for compiler design, hardware-software co-design, and even the future of low-level systems programming.
Leveraging LLM's Code Generation Prowess: The paper effectively demonstrates that LLMs, when properly guided by RL and validated by real-world performance, can move beyond generating syntactically correct code to generating performance-optimal code. The fine-grained optimization techniques discovered (e.g., staggered prefetching, padding with zeros for better BM selection) show a deep understanding of GPU architecture implicitly learned by the system.
Scalability of Exploration: The LLM-RL framework's capacity to explore 1,000 configurations (and implicitly, many more kernel variations during training) is a scale that manual optimization cannot match, leading to the discovery of niche optimizations.

Potential Issues & Areas for Improvement:

Training Cost and Efficiency: While CUDA-L2 achieves impressive results, the computational cost and time required to train such an LLM-RL system to reach this level of performance are likely substantial. Future work could focus on making the training process more sample-efficient or knowledge-transferable across different GPU architectures or kernel types.
Interpretability and Debugging: LLM-generated code can be complex, and RL often operates as a black box. Understanding why certain generated kernels are faster or why certain optimization patterns emerge could be challenging. Tools for interpreting the LLM's "reasoning" or debugging generated code would enhance trust and allow for human insight to further refine the process.
Generalizability Across Kernel Types: While impressive for HGEMM, generalizing this approach to a wider array of CUDA kernels (e.g., sparse operations, custom activation functions, non-linear operations) with similar performance gains would be the ultimate test. Each kernel type might have its own unique optimization challenges and architectural bottlenecks.
Dynamic Adaptation: The paper mentions server mode as mimicking real-time inference. A more advanced system might not just generate an optimal kernel offline but dynamically adapt or select kernels based on runtime conditions (e.g., GPU load, memory pressure, specific input data characteristics).
Integration with Compilers: To make such a system widely usable, seamless integration into existing compiler toolchains (e.g., LLVM, GCC) or deep learning frameworks (e.g., PyTorch, TensorFlow) would be crucial.

Overall, CUDA-L2 is a landmark paper that showcases the immense potential of blending AI with low-level systems optimization, opening exciting new frontiers for high-performance computing.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.