QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach
TL;DR Summary
QiMeng-Xpiler integrates large language models with symbolic synthesis for automatic tensor program transcompilation across deep learning systems, reducing manual effort while ensuring correctness and high performance.
Abstract
Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering "Write Once, Run Anywhere" of tensor programs an open question. We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0x over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0x via transcompiling legacy tensor programs.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach
1.2. Authors
Shouyang Dong, Yuanbo Wen, Jun Bi, Di Huang, Jiaming Guo, Jianxing Xu, Ruibai Xu, Xinkai Song, Yifan Hao, Ling Li, Xuehai Zhou, Tianshi Chen, Qi Guo, and Yunji Chen.
Affiliations:
- University of Science and Technology of China
- Cambricon Technologies
- SKL of Processors, Institute of Computing Technology, Chinese Academy of Sciences
- Institute of Software, Chinese Academy of Sciences
1.3. Journal/Conference
The paper was published on arXiv, which is an open-access archive for scholarly articles, primarily in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It serves as a preprint server, allowing researchers to share their work before formal peer review and publication in a traditional journal or conference. While not a peer-reviewed venue itself, papers on arXiv are widely read and often later appear in prestigious conferences or journals.
1.4. Publication Year
2025 (Published on 2025-05-04T15:14:27.000Z)
1.5. Abstract
The abstract outlines the problem of developing multiple low-level tensor programs for diverse heterogeneous deep learning systems (DLS) like GPUs and ASICs, leading to a significant programming burden. Existing transcompilation techniques (converting code from one language/platform to another) are deemed insufficient due to either requiring extensive manual effort or leading to functional incorrectness, thus failing to achieve the "Write Once, Run Anywhere" ideal for tensor programs.
The paper introduces QiMeng-Xpiler, a novel transcompiler that automates tensor program translation across DLS. It employs a neural-symbolic synthesis approach, combining large language models (LLMs) for powerful code generation with symbolic program synthesis for correctness. The core idea is to leverage LLMs to make the usually costly search-based symbolic synthesis computationally tractable. This is achieved through multiple LLM-assisted compilation passes, each guided by meta-prompts for program transformation. During each pass, efficient symbolic program synthesis is used to repair small, incorrect code snippets. To ensure high performance, a hierarchical auto-tuning approach systematically explores both the parameters of transformation passes and their optimal sequences.
Experiments across four distinct DLS—Intel DL Boost (VNNI), NVIDIA GPU (CUDA), AMD MI (HIP), and Cambricon MLU (BANG)—demonstrate that QiMeng-Xpiler correctly translates tensor programs with an average accuracy of 95%. Translated programs achieve up to 2.0x performance improvement over vendor-provided, manually-optimized libraries. This results in a programming productivity improvement of up to 96.0x by effectively transcompiling legacy tensor programs.
1.6. Original Source Link
https://arxiv.org/abs/2505.02146
1.7. PDF Link
https://arxiv.org/pdf/2505.02146v1.pdf
2. Executive Summary
2.1. Background & Motivation (Why)
The proliferation of diverse heterogeneous deep learning systems (DLS), such as NVIDIA GPUs, Google TPUs, GraphCore IPUs, Intel DL Boost, AMD MI, and Cambricon MLUs, in industrial data centers presents a significant challenge: each platform often requires its own set of high-performance, low-level tensor programs (implementations of tensor operations like matrix multiplication or convolution). Writing and optimizing these programs for every unique architecture is notoriously difficult and labor-intensive due to their complex architectures, unique programming models (e.g., SIMT vs. SIMD), intricate memory hierarchies, and specialized instruction sets (intrinsics). This creates a substantial programming burden, hindering the full exploitation of these powerful systems.
The ideal solution is "Write Once, Run Anywhere" for tensor programs, which transcompilation (source-to-source code translation) aims to achieve. However, existing transcompilation techniques fall short:
-
Rule-based approaches: Require extensive manual effort to define transformation rules between vastly different architectures, making them unscalable.
-
Symbolic synthesis approaches: Suffer from large search spaces, making them computationally expensive and difficult to apply to general-purpose programs, especially when dealing with diverse parallel semantics. They also demand considerable manual effort to specify input constraints.
-
Data-driven approaches (e.g., LLMs): While promising, they struggle with preserving tensor program semantics, leading to low translation accuracy (e.g., 29.6% reported in prior work), necessitating significant manual correction.
These limitations mean that current methods either demand
tremendous manual efforts, havelimited scalability, or result infunctional incorrectness, leaving the "Write Once, Run Anywhere" problem largely unsolved for tensor programs.
The paper's novel approach, QiMeng-Xpiler, addresses these gaps by combining the flexibility of data-driven (LLM) approaches with the soundness of symbolic synthesis. The core innovation is a neural-symbolic method that leverages the code generation capabilities of LLMs to guide and prune the search space for symbolic synthesis, making it computationally tractable, and uses symbolic synthesis to ensure the correctness that LLMs alone cannot provide.
2.2. Main Contributions / Findings (What)
The paper makes the following key contributions:
-
Neural-symbolic program synthesis: It proposes a novel
neural-symbolicapproach that usesLLMsto generate high-level program sketches (meta-prompts) and then employsSMT-based symbolic synthesisto repair incorrect low-level details. This achieves fully automatic program translation with a strong guarantee of functional correctness. This is the first work to automatically translate tensor programs across DLS with different programming models using such a hybrid approach. -
Hierarchical performance auto-tuning: It introduces a
hierarchical auto-tuningstrategy to maximize the performance of translated programs. This involvesintra-pass auto-tuningfor exploring parameters (e.g., tiling sizes) within a transformation pass using brute-force search, andinter-pass auto-tuningusingMonte Carlo Tree Search (MCTS)to find the optimal sequence of transformation passes. -
Extensive evaluation: The proposed
QiMeng-Xpileris rigorously evaluated on four distinct DLS (Intel DL Boost, NVIDIA GPU, AMD MI, and Cambricon MLU).The key findings from the experiments are:
- High Accuracy:
QiMeng-Xpilercorrectly translates different tensor programs with an average accuracy of 95% (ranging from 86.9% to 100% depending on the translation direction and target). It significantly outperforms state-of-the-artLLM-basedandrule-basedmethods in terms of both compilation and computation accuracy, often achieving 100% compilation accuracy and near-100% computation accuracy where other methods fall short. - Improved Performance: The translated programs achieve up to 2.0x performance improvement (and an average of 0.78x) over vendor-provided, manually-optimized libraries like
cuDNN/cuBLASandoneDNN. - Enhanced Productivity: The programming productivity for DLS is improved by up to 96.0x (e.g., for Cambricon MLU) and 34.3x (for NVIDIA GPU) by automating the transcompilation of legacy tensor programs.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following concepts:
- Deep Learning Systems (DLS): Specialized hardware architectures designed to accelerate deep learning workloads. Examples include:
- GPUs (Graphics Processing Units): Originally for graphics rendering, now widely used for general-purpose parallel computing (GPGPU) in deep learning. NVIDIA GPUs with
Tensor Coresare a prime example. - ASICs (Application-Specific Integrated Circuits): Custom-designed chips optimized for specific tasks, offering higher efficiency for deep learning. Examples include Google
TPU(Tensor Processing Unit) and CambriconMLU(Machine Learning Unit).
- GPUs (Graphics Processing Units): Originally for graphics rendering, now widely used for general-purpose parallel computing (GPGPU) in deep learning. NVIDIA GPUs with
- Tensor Programs/Operators: Low-level implementations of mathematical operations involving tensors (multi-dimensional arrays), which are the fundamental data structures in deep learning. Examples include matrix multiplication (
MatMul), convolution, activation functions (e.g., ReLU), and pooling. - Heterogeneous Systems: Computing environments composed of different types of processors or specialized hardware, each with unique capabilities and programming models.
- Transcompilation (Source-to-Source Compilation): The process of automatically translating source code written in one programming language (or for one platform's programming model) into another language (or for another platform's model), while preserving its functionality.
- Programming Models: The conceptual framework that defines how a programmer specifies computation and data management for a particular hardware architecture.
- SIMT (Single Instruction, Multiple Threads): Used by NVIDIA CUDA. Multiple threads execute the same instruction on different data simultaneously. Threads are grouped into
warps(orwavefrontson AMD). - SIMD (Single Instruction, Multiple Data): A type of parallel processing where a single instruction operates on multiple data points simultaneously. Often used with specialized vector units or instruction sets like
VNNIorCambricon BANG. - DSA (Domain-Specific Architecture): Hardware tailored for specific applications or domains, like deep learning, offering high performance and efficiency for those tasks. Cambricon MLU is a DSA.
- SIMT (Single Instruction, Multiple Threads): Used by NVIDIA CUDA. Multiple threads execute the same instruction on different data simultaneously. Threads are grouped into
- Parallelism:
- Task-level parallelism: Different independent tasks run concurrently.
- Multi-core parallelism: Tasks distributed across multiple processing cores.
- Thread-level parallelism: Execution of different threads concurrently (e.g.,
CUDAblockIdx,threadIdx).
- Memory Hierarchy: The multi-layered organization of memory in a computer system, from fast, small caches (registers, shared memory) to slower, larger main memory (global memory), designed to optimize data access. Programmers often need to explicitly manage data movement within this hierarchy for performance.
- Specialized Intrinsics: Low-level functions or instructions provided by hardware vendors that directly access specialized hardware capabilities (e.g.,
Tensor Cores,Matrix Cores,VNNIinstructions) for highly optimized computation. These are often platform-specific. - Large Language Models (LLMs): Advanced artificial intelligence models (like
GPT-4,OpenAI o1,StarCoder) trained on vast amounts of text and code data. They can understand natural language prompts and generate human-like text, including source code. - Symbolic Program Synthesis: A technique that automatically generates programs from a high-level specification (e.g., input/output examples, logical constraints). It often involves searching a space of possible programs.
- Satisfiability Modulo Theories (SMT) Solver: A tool used in symbolic program synthesis to determine if a logical formula, potentially involving variables from different theories (e.g., arithmetic, arrays), has a solution. It's used to verify correctness or find program components that satisfy given constraints.
- Abstract Syntax Tree (AST): A tree representation of the abstract syntactic structure of source code. Each node in the tree denotes a construct occurring in the source code. Used for program analysis and transformation.
- Meta-prompts: High-level, parameterized
promptsor templates used withLLMsthat can be adapted with specific context (e.g., program annotations) to guide code generation for different tasks or scenarios. - Unit Tests: Small, isolated tests that verify the correctness of individual components or functions of a program. Essential for validating translated code.
- Monte Carlo Tree Search (MCTS): A heuristic search algorithm used in decision processes, particularly in games. It explores a search space by building a search tree, using random simulations (Monte Carlo) to estimate the value of nodes and guide exploration towards promising branches. In this context, it's used to find optimal sequences of transformation passes.
- Brute-force search: An algorithm that systematically checks all possible solutions to a problem until the correct one is found. Computationally expensive for large search spaces but guarantees optimality.
3.2. Previous Works
The paper categorizes existing transcompilation efforts into three main groups:
-
Rule-based Approaches:
- Concept: Experts define manual transformation rules, often applied to an
Abstract Syntax Tree (AST), to convert code between languages/platforms.Pattern matchingis used to apply these rules. - Examples:
FCUDA(CUDA to FPGA translator) [37],C2Rust[13],CxGo[5],HIPIFY(CUDA to HIP) [7],PPCG(polyhedral auto-parallelization from C to CUDA) [47]. - Limitations: Highly labor-intensive, unscalable due to the need for extensive manual rules, struggle with architectural discrepancies and irregular code structures, and lack general applicability across many DLS.
- Concept: Experts define manual transformation rules, often applied to an
-
Symbolic Synthesis Approaches:
- Concept: Generate semantic-preserving target code from specifications (e.g.,
domain-specific languages (DSLs), input/output examples). Often rely on search-basedSMT solvers. - Examples:
C2TACO(translates C tensor code toTACO DSLusing a guided enumerative synthesizer) [33],MetaLift[18],Tenspiler[39] (these convert source programs into a unified intermediate representation (IR) and then synthesize target programs in the same IR domain). - Limitations: Struggle with large search spaces, making them unscalable for large, general-purpose programs or commodity DLS. Cannot efficiently handle diverse
parallel semantics(e.g.,SIMTvs.SIMD) or complex memory hierarchies. Requires manual definition of target language semantics using specification IR, which is challenging and error-prone.
- Concept: Generate semantic-preserving target code from specifications (e.g.,
-
Data-driven Approaches (LLM-assisted):
- Concept: Train
neural networks(typicallyLLMs) on large code corpora to generate code or translate between languages. - Examples:
TransCoder[40] (sequence-to-sequence models for C++, Java, Python translation),StarCoder[31],GPT-4[15, 20, 36],OpenAI-o1[8],Codex[21],CodeGen[35],CodeT5[48],CodeGeeX[52],LLaMA[44],Gemini[43],AlphaCode[32]. - Limitations: Lack a deep understanding of program semantics, leading to functional incorrectness or incompleteness. Perform poorly in specific domains like DLS
tensor programsdue to scarce high-quality training data. Translation accuracy for DLS is often low (e.g., 29.6% mentioned in prior work), requiring significant manual correction and making them unsuitable for high-accuracy transcompilers.
- Concept: Train
3.3. Technological Evolution
The evolution of transcompilation reflects the increasing complexity and heterogeneity of computing systems. Early efforts focused on rule-based transformations, which were feasible for relatively simpler or more aligned source and target languages/platforms. As hardware diversified, especially with the advent of DLS, the manual burden of rule-based methods became prohibitive. Symbolic synthesis emerged to automate correctness guarantees by leveraging formal methods, but its inherent computational cost limited its applicability to smaller, more constrained domains. More recently, the rise of powerful LLMs has opened new avenues for code generation and translation, offering unprecedented flexibility and automation. However, their data-driven nature struggles with the precise semantic preservation and domain-specific correctness required for low-level tensor programs on highly specialized DLS.
QiMeng-Xpiler represents a significant step in this evolution by attempting to bridge the gap between the flexibility of LLMs and the soundness of symbolic synthesis. It acknowledges the strengths and weaknesses of both, combining them into a neural-symbolic framework to tackle the unique challenges of DLS transcompilation.
3.4. Differentiation
QiMeng-Xpiler differentiates itself from existing approaches by:
- Hybrid Neural-Symbolic Approach: Unlike purely
rule-based,symbolic, ordata-drivenmethods,QiMeng-Xpilerintelligently combinesLLMsfor flexible high-level program sketching andSMT-based symbolic synthesisfor precise low-level detail repair and correctness guarantees. This avoids the manual effort ofrule-basedsystems, the scalability issues of puresymbolic synthesis, and the accuracy/correctness problems of pureLLM-basedsystems. - Tackling DLS Heterogeneity: It specifically targets the complex challenges of
transcompiling tensor programsacross diverseDLSwith distinctprogramming models(SIMT,SIMD),memory hierarchies, andspecialized intrinsics. Previous methods generally failed in this domain due to these complexities. - Decomposed Transformation Passes: Instead of a single
LLMprompt for the entire translation,QiMeng-Xpilerbreaks the process into a series ofLLM-assisted transformation passes. This modularity significantly improves accuracy and tractability forsymbolic synthesis. - Hierarchical Performance Auto-Tuning: It uniquely integrates
hierarchical auto-tuning(brute-forcefor intra-pass parameters,MCTSfor inter-pass sequences) to not only ensure functional correctness but also maximize the execution performance of the translated code, often outperforming manually optimized libraries. This is a critical feature often missing incode generationtools. - Correctness Guarantee: By incorporating
SMT-based repairat a small scale within each transformation pass,QiMeng-Xpilerensures functional equivalence, a crucial aspect for production-gradetranscompilersthatLLM-onlysolutions cannot reliably provide.
4. Methodology
4.1. Principles
The core idea behind QiMeng-Xpiler is to combine the strengths of large language models (LLMs) and symbolic program synthesis in a synergistic way, forming a neural-symbolic synthesis approach. The underlying principles are:
- Leverage LLM for Tractability: Utilize the powerful code generation capabilities of
LLMsto produce initial program sketches and guide transformation. This makes the typically costly search-basedsymbolic synthesiscomputationally tractable by significantly pruning its search space and focusing it on smaller, more manageable problem instances. - Ensure Correctness with Symbolic Synthesis: Employ
symbolic program synthesisto formally verify and repair small, incorrect code snippets generated by theLLM. This guarantees semantic equivalence and functional correctness, whichLLMsalone often struggle to provide forlow-level tensor programs. - Decomposed Transformation: Break down the complex
transcompilationprocess into a series of modular,LLM-assisted transformation passes. Each pass addresses specific characteristics ofDLS(parallelism, memory hierarchy, specialized instructions), improving the accuracy and manageability of the overall translation. - Performance Optimization through Auto-Tuning: Systematically explore the vast search space of
transformation parametersandsequencesusing ahierarchical auto-tuningapproach to achieve high-performance translated programs. This ensures that the generated code is not just functionally correct but also optimally tuned for the targetDLS.
4.2. Steps & Procedures
QiMeng-Xpiler operates in two main parts: Neural-Symbolic Program Synthesis and Hierarchical Performance Auto-Tuning.
The overall system architecture is presented in Figure 3.
该图像是包含两个子图的示意图,展示了QiMeng-Xpiler的神经-符号程序综合及层级性能自动调优流程。左图(a)描述了基于LLM的程序注释和变换,以及基于SMT的代码修复过程;右图(b)展示了变换传递的层级自动调优,包括跨传递和传递内的调优策略。
The system fundamentally separates Neural-Symbolic Program Synthesis (left) from Hierarchical Performance Auto-Tuning (right) to first ensure correctness and then optimize performance.
4.2.1. Neural-Symbolic Program Synthesis
This part decomposes the transcompilation into a series of LLM-assisted transformation passes, with symbolic synthesis for repair. There are 11 such passes, categorized into three classes (Table 4):
The following table shows the transformation passes and their description.
| Name | Description | |
|---|---|---|
| (1) | Loop Recovery | Convert parallel variables to sequential for loops |
| Loop Bind | Assign a sequential loop to parallel variables | |
| Loop Split | Divide a loop into several sub-loops | |
| Loop Fuse | Merge several loops into a hyper-loop | |
| Loop Reorder | Change the execution orders of loops | |
| Loop Expansion Loop Contraction | Split a loop body into several loop bodies Merge the producer in the loop body of consumer | |
| (2) | Cache | Adapt to the memory hierarchy for efficient load/store inputs/outputs |
| Pipeline | Pipeline of data load/store and computation | |
| Tensorize | Replace a specific loop body to leverage special intrinsics | |
| (3) | ||
| Detensorize | Restore a specific loop body from special intrinsics |
-
Sequentialization/Parallelization:
Loop Recovery,Loop Bind,Loop Split,Loop Fuse,Loop Reorder,Loop Expansion,Loop Contraction. These passes handle the conversion between parallel program constructs (e.g.,CUDAthreadIdx.x) and sequential loops, or restructure loops for better parallelism. -
Memory Conversion:
Cache,Pipeline. These passes adapt the code to the target system'smemory hierarchyby managing data movement and access patterns (e.g., caching, pipelining loads/stores). -
(De)Tensorization:
Tensorize,Detensorize. These passes convert sequential code into paralleltensor intrinsicsor vice versa, leveraging specialized hardware instructions.Each
program transformation passfollows a workflow:Program Annotation,Meta-Prompts based Transformation,Bug Localization, andSMT-based Code Repairing. This process is illustrated in Figure 4 for atensorizationcase.
该图像是论文中的示意图,展示了神经符号程序合成在张量化案例中的应用流程,包括程序注释、基于元提示的程序转换,以及基于SMT的错误定位与修复步骤。
Figure 4 illustrates how a CUDA C matmul loop is tensorized into a BANG C __bang_mlp intrinsic using the neural-symbolic approach.
4.2.1.1. Program Annotation
This initial step marks code blocks with semantics relevant to the transformation pass, aiding the LLM. It consists of two parts (Algorithm 1):
Algorithm 1: Program Annotation Algorithm
Input: A source program P
Output: An annotated program P_annotated
Given: LLM, BM25 search engine (SE), programming manual M
1 P_annotated ← P
2 // Annotate program with identified computation semantics
3 P_annotated ← LLM.AnnotateSemantics(P_annotated)
4 // Get the list of computation information
5 L ← TraverseComputation(P_annotated)
6 for n in L do
7 // Retrieve the programming manual of each computation
8 retrieved_info ← SE.Retrieve(n, M)
9 // Annotate program with memory spaces or tensor intrinsics
10 P_annotated ← LLM.AnnotateReferences(P_annotated, n, retrieved_info)
11 end
12 return P_annotated
- Semantics Annotation (lines 2-3): An
LLMidentifies computational operations and annotates them with platform-agnostic semantics (e.g.,operation(matmul)). This abstracts away platform-specific details. - Reference Annotation (lines 4-11): For each identified operation, a
BM25 search engine[45] retrieves relevant information (e.g., target operation, parameters, memory constraints) from the target system'sprogramming manual. AnLLMthen uses this information and the source program to annotate the operator with specific hardware characteristics (e.g.,intrinsic(__bang_mlp(input[Nram, Wram], output[Nram]))). This provides platform-specific examples to guide the subsequentLLM-based transformation.
4.2.1.2. Meta-Prompts based Transformation
This step uses LLMs to generate the transformed code. A meta-prompt is a high-level template adapted for each source program based on annotations. Each transformation pass has a unique meta-prompt comprising three parts:
-
Platform-agnostic description: Describes the program's functionality and general constraints (e.g.,
tensorizationin the context ofSIMD). This part is constant across platforms. -
Platform-specific examples: Uses the annotated source program to search for relevant examples in the target
programming manual. These examples (e.g.,C to BANGC matmul) are incorporated into the prompt, guiding theLLMto generate functionally correct code with appropriate syntax andintrinsics. -
Tuning knobs: Optional constraints for passes like
Loop SplitorLoop Reorder, defining parameters such asloop split alignment sizeto generate a search space forauto-tuning.The
LLMgenerates a transformed code sketch that is generally correct in structure but may contain error-prone values.
4.2.1.3. Bug Localization
If the LLM-generated code fails unit tests, Algorithm 2 is used to precisely locate buggy code snippets.
Algorithm 2: Bug Localization Algorithm
Input: Source program S, Buggy transcompiled program E, Loop index I, Buffer name N, control-flow matching model M
Output: Error Nodes Mt
1 S_n ← AST(S)
2 S_s ← ExtractASTSegment(S_n, I, N)
3 // Obtain the buffer list
4 B_s ← TraverseBuffers(S_s)
5 E_n ← BinarySearchErrorNodes(B_s, E, S)
6 M_b ← FindMatchingCodeBlocks(M, S_s, E_n)
7 for Ms, Mt in M_b do
8 if Diverse_ds(Ms, Mt) then
9 return Mt
10 end
11 end
The process involves:
- Identifying the code snippet transformed based on loop index or buffer name.
- Comparing the buggy transformed snippet with the original.
- Using
binary searchby inserting print statements at relevant memory locations to narrow down the error scope. - Mapping source-level control flow to diverging inputs to pinpoint erroneous behavior.
- Enumerating constraints (index variables, loop boundaries, buffer sizes, tensor semantics) within the snippet to create queries for the
SMT solver.
4.2.1.4. SMT-based Code Repairing
Once a bug is localized, the SMT solver repairs the buggy code snippet, focusing on low-level implementation details such as loop boundaries, indexing, and instruction parameters (Algorithm 3).
Algorithm 3: SMT-based Code Repairing Algorithm
Input: A source program P, error program E
Output: A repaired program F
1 Given: Error code snippet S_error
2 T_s ← AST(P); S_s ← extract(T_s, S_error)
3 T_e ← AST(E); S_e ← extract(T_e, S_error)
4 // Generate the sketch according to the transformation definition.
5 S_k ← GenerateCodeSketch(S_s, S_e)
6 // Generate the smt query
7 Q ← CreateSMTQuery(S_s, S_e)
8 // Generate the verified code snippet
9 R ← SynthesisCode(Q)
10 F ← StitchBack(P, R, S_error)
11 return F
The process involves:
-
Parsing source and error code to extract the buggy snippet.
-
Generating a code sketch based on the specific loop transformation's definition.
-
Explicitly enumerating constraints over concrete index variables, loop boundaries, and buffer indexes.
-
Formulating an
SMT query() using these boundary and buffer index constraints. -
Using the
SMT solver(SynthesisCode(Q)) to fill in missing expressions in the sketch, ensuring symbolic alignment with the source program.Figure 5 shows examples of
SMTconstraints forloop splitandcache read.
该图像是图5,展示了循环拆分和缓存读取的SMT约束,包括代码示例和对应的符号约束。图中涉及循环拆分的约束 等,以及缓存读取的索引范围和赋值一致性约束。
Figure 5 illustrates how constraints for loop splitting and cache access are formulated for the SMT solver. For loop split, it ensures the split factor maintains the original loop's behavior (). For cache read, it sets constraints on memory access, ensuring data is correctly loaded from global memory to cache, and then accessed from cache based on the loop index (, ensuring , etc.).
4.2.2. Hierarchical Performance Auto-Tuning
This component aims to automatically maximize the performance of transcompiled programs by exploring transformation parameters and sequences. The transcompilation exploration space () is defined as:
where is the source tensor program, is a randomly selected pass from predefined passes , is a tuning option from predefined tuning knobs , and is the number of transformations. The size of this search space is:
This vast search space is explored using a hierarchical approach:
4.2.2.1. Intra-Pass Auto-Tuning
This focuses on fine-tuning parameters within a single transformation pass.
-
Search space generation:
LLMsandprogramming manualshelp generate the search space. For example, forLoop SplitandLoop Reorder, specially designedmeta-prompts(Figure 6) are used to generate programs with different adjustable parameters (e.g., number of blocks for logical loops, loop order, tiling sizes).
该图像是图6的示意图,展示了针对循环拆分(loop split)的自动调优提示,包含如何将循环变量i拆分为两个嵌套循环及返回所有可能的循环索引和范围的示例。
Figure 6 shows a meta-prompt for loop split which instructs the LLM to split a loop variable into two nested loops and return all possible loop indices and their extents, providing a structured way for the LLM to explore tiling options.
- Search space exploration: A
brute-force searchis employed to find the optimal program configuration within the generated search space. This is deemed feasible because, for someDLS(e.g.,BANG C), the search space size is relatively small (e.g., 10 for Matmul on MLU, compared to 150 on GPU).
4.2.2.2. Inter-Pass Auto-Tuning with MCTS
This component uses Monte Carlo Tree Search (MCTS) to find the optimal sequence of transformation passes, treating transcompilation as a Markov decision process. Each intermediate program is a state, and applying a pass is an action.
- Reward function: The reward for a given transformation sequence is tied to the
execution time improvement.MCTSruns in parallel with the current program, recording the best real throughput as the reward. Where denotes transformed programs at iteration , is the index of the transformed program within theintra-pass auto-tuningsearch space, and is the throughput of the program. - Searching parameters setting: Through design space exploration, the maximum search depth ( in the equation above) for
MCTSwas set to 13, and the total number of simulations to 512. This balances search time and reward, allowingQiMeng-Xpilerto find optimaltranscompiledprograms within a reasonable time (e.g., hours).
4.3. Mathematical Formulas & Key Details
4.3.1. Exploration Space Formalization
The transcompilation exploration space is defined as the set of all possible programs that can be generated by applying a sequence of transformations.
Let represent the initial source tensor program.
Let represent a selected transformation pass from the set of predefined passes at step .
Let represent a selected tuning option (e.g., a specific tiling size or loop order) from the set of predefined tuning knobs at step .
The application of a pass with tuning option to a program results in a new program .
The entire exploration space for a sequence of transformations is given by:
Here,
-
is the final transformed program after steps.
-
is a function that applies a transformation pass to a program.
-
is a function that applies a tuning knob (parameter) to a program, typically affecting the details of the transformation.
The size of this exploration space is the product of the number of choices at each step for both the pass and the tuning knob: where is the number of available transformation passes at step , and is the number of available tuning knob options at step . This formula highlights the exponential growth of the search space with the number of transformation steps .
4.3.2. MCTS Reward Function
The reward function for Monte Carlo Tree Search (MCTS) quantifies the performance of a transcompiled program.
Let denote a transformed program at iteration resulting from the -th choice within the intra-pass auto-tuning search space.
Let be the measured throughput (e.g., operations per second) of the program on the target DLS.
The intermediate throughput is defined as:
Here,
-
: The throughput value for program .
-
: The actual measured throughput of the program.
-
"is valid": Implies that the program successfully compiles and executes and is functionally correct (i.e., passes
unit tests). If it's not valid, its throughput is considered 0, providing no reward.The reward for the
MCTSat step is the maximum throughput achieved among all valid programs generated within theintra-pass auto-tuningsearch space at that step: Here, -
: The reward obtained at step .
-
: The maximum throughput found among all programs at step . This ensures that
MCTSfavors sequences of transformations that lead to the best-performing valid programs.
5. Experimental Setup
5.1. Datasets
The study evaluates QiMeng-Xpiler on 21 widely-used deep learning operators. These operators are grouped into 6 types of common deep learning operations:
-
MatMul(Matrix Multiplication) -
Convolution -
Activationfunctions (e.g., ReLU, Softmax, GeLU, Sigmoid) -
Poolingoperations (MaxPool, AvgPool, MinPool, SumPool) -
Element-wiseoperations (Add, Sign) -
LLM(Large Language Model) specific operations (LayerNorm, Deformable Attention, Self Attention, RMSNorm)Each of these 21 operators is further evaluated with 8 typical shapes. These shapes are extracted from real-world neural network architectures, ensuring practical relevance. The networks from which these shapes are derived include:
-
GPT[36] -
LLaMA-2[44] -
DAT[51] -
BERT[23] -
ResNet[25] -
MobileNet[26] -
VGG[42]In total, this leads to 168 test cases (). The size of the code for these operators ranges from 7 to 214 lines of code, covering a variety of complexities and types.
Example of a data sample:
For a MatMul operator, a "shape" would refer to the dimensions of the input matrices. For instance, a GEMM (General Matrix Multiplication) operation might involve matrices of shape and , producing an output . This specific configuration would be one of the "8 typical shapes" for MatMul. The "lines of code" (LoC) would be the CUDA C, BANG C, HIP, or C with VNNI implementation for this operation.
The following table provides a breakdown of the evaluated benchmarks for QiMeng-Xpiler, including the lines of code for different target languages and the total number of cases for each operator type.
The following table shows the results from Table 5:
| Type | Operators | Lines of Code | Cases | |||
|---|---|---|---|---|---|---|
| CUDA C | BANG C | Hip | C with VNNI | |||
| MatMul | GEMM, GEMV | 26, 12 | 15, 11 | 25, 12 | 41, 34 | 24 |
| Batch GEMM | 29 | 16 | 28 | 47 | ||
| Convolution | Conv1D | 10 | 14 | 10 | 9 | 32 |
| Conv2D NHWC, NCHW | 32, 30 | 23, 30 | 32, 30 | 83, 85 | ||
| Depthwise Conv | 27 | 24 | 27 | 17 | ||
| Activation | ReLU, Softmax | 7, 24 | 18, 18 | 7, 24 | 14, 28 | 32 |
| GeLU, Sigmoid | 7, 10 | 29, 30 | 7, 10 | 14, 14 | ||
| Elementwise | Add, Sign | 20, 12 | 26, 29 | 18, 12 | 17, 14 | 16 |
| Pooling | MaxPool, AvgPool | 25, 30 | 25, 25 | 25, 30 | 31, 34 | 32 |
| MinPool, SumPool | 23, 29 | 25, 25 | 23, 29 | 33, 30 | ||
| LLM | LayerNorm | 42 | 35 | 42 | 46 | 32 |
| Deformable Attention | 139 | 191 | 139 | 214 | ||
| Self Attention, RMSNorm | 54, 25 | 64, 18 | 54, 25 | 111, 18 | ||
These datasets were chosen to comprehensively assess QiMeng-Xpiler's capabilities by covering a wide range of operator types, varying code complexities, and diverse input shapes representative of real-world deep learning workloads.
5.2. Evaluated Platforms
The experiments were conducted on four distinct Deep Learning Systems (DLS), each with its own programming interface:
-
Intel Gold 6348 CPU: Equipped with
VNNI(Vector Neural Network Instructions) extensions. This platform utilizes specialized instruction sets designed to accelerate deep learning operations. The programming interface isC with VNNI intrinsics. -
NVIDIA A100 GPU: A high-performance GPU widely used in deep learning. It follows the
SIMT(Single Instruction, Multiple Threads) programming model. The programming interface isCUDA C. -
AMD MI200 (Instinct MI250X): An AMD data center GPU. It provides an alternative to
CUDAwith itsHIP(Heterogeneous-Compute Interface for Portability) language, which is designed to be a porting tool fromCUDA. -
Cambricon MLU: A
Deep Learning Accelerator (DSA)from Cambricon. It typically follows aSIMD(Single Instruction, Multiple Data) programming model. The programming interface isBANG C.Table 1 provides a detailed comparison of these
DLSand their interfaces, illustrating the programming challenges they present.
The following table shows the results from Table :
| Platforms | Interfaces | Categories | Examples |
|---|---|---|---|
| Intel DL Boost | C with VNNI extensions | Specialized Intrinsic | _mm_dpbusds_epi32(...), _mm512_dpbusd_epi32(...) |
| NVIDIA GPU with Tensor Core | CUDA C | Parallelism | blockIdx, threadIdx |
| Memory Hierarchy | registers, __shared__, __global_ | ||
| matrix_a, matrix_b, and accumulator in Tensor Core fragments | |||
| Specialized Intrinsics | wmma::mma_sync(d, a, b, c) | ||
| AMD MI with Matrix Core | HIP | Parallelism | blockIdx, threadIdx |
| Memory Hierarchy | registers, __shared__, __global_ | ||
| matrix_a, matrix_b, and accumulator in Matrix Core fragments | |||
| Specialized Intrinsics | d = __builtin_amdgcn_mfma_f32_16x16x4f32(a, b, c, ...) | ||
| Cambricon MLU | BANG C | Parallelism | taskId for task-level parallelism |
| Memory Hierarchy | clusterId, coreId for multi-core parallelism | ||
| registers, __mlu_shared__, __mlu_device__ | |||
| Specialized Intrinsics | __nram__, ___wram__ | ||
| __bang_mlp(...),__bang_conv(...) |
5.3. Evaluation Metrics
The performance and accuracy of QiMeng-Xpiler are evaluated using three key metrics:
-
Compilation Accuracy:
- Conceptual Definition: This metric measures the proportion of
transcompiledprograms that successfully pass the compiler checks for the target platform. It indicates thesyntactic correctnessand adherence to the target language's rules and API usage. - Importance: A high compilation accuracy signifies the
transcompiler's ability to generate valid, buildable code for the target system. Without successful compilation, a program cannot be executed or tested for functional correctness or performance. - Mathematical Formula:
- Symbol Explanation:
- : The count of generated target programs that compile without errors.
- : The total count of source programs that were subjected to transcompilation.
- Conceptual Definition: This metric measures the proportion of
-
Computation Accuracy:
- Conceptual Definition: This metric assesses the
functional correctnessof thetranscompiledcode. A program is deemed "computationally accurate" if it produces the correct output, verified by passing a predefined set ofunit testsagainst a ground truth. This goes beyond mere compilation to ensure the translated program behaves as expected. - Importance: This is a crucial metric for
transcompilersin critical applications like deep learning, where even subtle numerical discrepancies can lead to incorrect model behavior. It guaranteessemantic equivalencebetween the source and target programs. - Mathematical Formula:
- Symbol Explanation:
- : The count of compiled target programs that produce the expected output when executed with
unit tests. - : The total count of programs that compiled without errors. This is usually a subset of "Total number of programs attempted" for compilation accuracy.
- : The count of compiled target programs that produce the expected output when executed with
- Conceptual Definition: This metric assesses the
-
Execution Performance:
- Conceptual Definition: This metric evaluates how fast the
transcompiledcode runs on the target platform compared tovendor-provided, manually-optimized libraries(e.g.,cuDNN,cuBLAS,oneDNN). It's typically expressed as a speedup factor or a ratio. - Importance: Beyond correctness, performance is paramount in
deep learning systems. Atranscompilermust generate code that is not only correct but also efficient, ideally matching or exceeding the performance of highly optimized, hand-tuned libraries. This highlights the practicality and real-world applicability of the method. - Mathematical Formula:
- Symbol Explanation:
- : The time taken by the corresponding
vendor-provided, manually-optimized libraryto execute the operation. - : The time taken by the program
transcompiledbyQiMeng-Xpilerto execute the same operation. - A ratio greater than 1.0 means
QiMeng-Xpiler's program is faster; a ratio less than 1.0 means it's slower.
- : The time taken by the corresponding
- Conceptual Definition: This metric evaluates how fast the
5.4. Baselines
QiMeng-Xpiler is compared against both state-of-the-art LLM-based and rule-based transcompilation approaches:
-
LLM-based Approaches:
- GPT-4 Zero-Shot: Using
GPT-4without providing any in-context examples (i.e., minimal prompt). - OpenAI o1 Zero-Shot: Using
OpenAI o1without any in-context examples. - GPT-4 Few-Shot: Using
GPT-4with a few relevant examples provided in the prompt to guide the translation. - OpenAI o1 Few-Shot: Using
OpenAI o1with a few relevant examples in the prompt. - Purpose: These baselines evaluate the raw capabilities of general-purpose
LLMsforcode translation, both with and without explicit examples.
- GPT-4 Zero-Shot: Using
-
Rule-based Approaches:
- PPCG [47]: A polyhedral model-based
auto-parallelizationtool that translates code toCUDA C. - HIPIFY [7]: A
vendor-providedtool by AMD designed to migrateCUDAcode toAMD HIPcode. - Purpose: These baselines represent established, hand-crafted
transcompilationmethods. The paper notes thatrule-based approachesare generally not available for othertranscompilationdirections due to the immense human effort required.
- PPCG [47]: A polyhedral model-based
-
Search-based Program Synthesis Approaches:
- Implicit Baseline: While not directly used as a comparative baseline in the results tables, the paper discusses
search-based program synthesisas a category of related work. It explicitly states that these approaches "struggle with the large search space problem" and "cannot performtranscompilationdirectly across differentDLS." This implies they are not considered viable direct baselines for the broadtranscompilationtask thatQiMeng-Xpileraddresses.
- Implicit Baseline: While not directly used as a comparative baseline in the results tables, the paper discusses
6. Results & Analysis
6.1. Core Results
QiMeng-Xpiler demonstrates superior performance across all evaluation metrics and transcompilation directions. It achieves near-perfect compilation accuracy and high computation accuracy, significantly outperforming LLM-based and rule-based baselines. Moreover, it translates programs that achieve competitive or even superior execution performance compared to manually optimized libraries, leading to substantial productivity gains.
6.2. Data Presentation (Tables)
The following table shows the results from Table 6:
| Source | Method | Compilation Accuracy | Computation Accuracy | ||||||
|---|---|---|---|---|---|---|---|---|---|
| CUDA C | BANG C | Hip | C with VNNI | CUDA C | BANG C | Hip | C with VNNI | ||
| CUDA C | GPT-4 Zero-Shot | 0 | 82.7 | 9.5 | 0 | 82.7 | 4.2 | ||
| OpenAI o1 Zero-Shot | 0 | 85.7 | 61.9 | 0 | 82.7 | 60.7 | |||
| GPT-4 Few-Shot | 50.6 | 97.0 | 84.5 | 7.7 | 96.4 | 30.4 | |||
| OpenAI o1 Few-Shot | 51.8 | 98.2 | 85.1 | 48.2 | 98.2 | 55.4 | |||
| QiMeng-Xpiler w/o SMT | 82.7 | 98.2 | 88.1 | 54.2 | 98.2 | 58.3 | |||
| QiMeng-Xpiler w/o SMT + Self-Debugging | 87.5 | 98.8 | 89.3 | 54.8 | 98.2 | 58.9 | |||
| QiMeng-Xpiler | 100 | 100 | 100 | 91.7 | 100 | 95.2 | |||
| BANG C | GPT-4 Zero-Shot | 24.4 | 26.8 | 0 | 0 | 0 | 0 | ||
| OpenAI o1 Zero-Shot | 27.4 | 97.0 | 9.5 | 0 | 0 | 4.2 | |||
| GPT-4 Few-Shot | 69.0 | 66.1 | 23.8 | 6.5 | 6.5 | 13.1 | |||
| OpenAI o1 Few-Shot | 71.4 | 97.0 | 41.7 | 10.1 | 7.7 | 23.2 | |||
| QiMeng-Xpiler w/o SMT | 85.1 | 84.5 | 47.6 | 77.4 | 78.6 | 41.1 | |||
| QiMeng-Xpiler w/o SMT + Self-Debugging | 88.1 | 88.7 | 50.6 | 77.4 | 78.6 | 41.1 | |||
| QiMeng-Xpiler | 100 | 100 | 100 | 95.8 | / | 97.0 | 95.2 | ||
| Hip | GPT-4 Zero-Shot | 97.0 | 0 | 23.8 | 97.0 | 0 | 5.4 | ||
| OpenAI o1 Zero-Shot | 98.2 | 0 | 45.8 | 98.2 | 0 | 4.2 | |||
| GPT-4 Few-Shot | 97.0 | 35.1 | 85.1 | 97.0 | 5.4 | 24.4 | |||
| OpenAI o1 Few-Shot | 98.8 | 42.3 | 88.7 | 98.2 | 9.0 | 30.4 | |||
| QiMeng-Xpiler w/o SMT | 98.2 | 60.7 | 65.5 | 97.6 | 52.4 | 57.1 | |||
| QiMeng-Xpiler w/o SMT + Self-Debugging | 98.8 | 62.5 | 66.1 | 98.2 | 52.4 | 57.1 | |||
| QiMeng-Xpiler | 100 | 100 | 100 | 100 | 86.9 | 96.4 | |||
| C with VNNI | GPT-4 Zero-Shot | 57.1 | 0 | 60.1 | 8.3 | 0 | 8.9 | ||
| OpenAI o1 Zero-Shot | 66.1 | 0 | 97.0 | 10.1 | 0 | 96.4 | |||
| GPT-4 Few-Shot | 81.5 | 41.7 | 74.4 | 14.3 | 6.0 | 12.5 | |||
| OpenAI o1 Few-Shot | 87.5 | 55.4 | 97.0 | 51.2 | 10.7 | 96.4 | |||
| QiMeng-Xpiler w/o SMT | 95.8 | 78.0 | 87.5 | 83.9 | 58.3 | 85.7 | |||
| QiMeng-Xpiler w/o SMT + Self-Debugging | 97.0 | 84.5 | 89.3 | 83.9 | 58.3 | 85.7 | |||
| QiMeng-Xpiler | 100 | 99.4 | 100 | 98.2 | 88.7 | 99.4 | |||
Analysis of Table 6 (Accuracy vs. LLM-based Methods):
- Superiority of QiMeng-Xpiler:
QiMeng-Xpilerconsistently achieves the highest compilation and computation accuracy across alltranscompilationdirections. It reaches 100% compilation accuracy in most scenarios and computation accuracy ranges from 86.9% to 100%. This is a significant advantage for practicaltranscompilerswhere high correctness is critical. - Limitations of LLM-only Baselines:
- Zero-Shot:
GPT-4andOpenAI o1inZero-Shotmode show extremely poor performance, especially for challenging targets likeBANG C(0% accuracy). This indicates thatLLMslack inherent domain-specific knowledge forDLS. - Few-Shot: Providing
few-shot examplesimprovesLLMperformance significantly, but still falls short ofQiMeng-Xpiler. For instance,OpenAI o1 Few-ShottranslatingCUDA CtoBANG Cachieves 51.8% compilation and 48.2% computation accuracy, far from 100%. Even for easier tasks likeCUDA CtoHip, while compilation accuracy can be high (e.g., 98.2%), computation accuracy might still be lower (98.2%), suggesting subtle functional errors.
- Zero-Shot:
- Impact of SMT Solver (Ablation Study):
-
The rows
QiMeng-Xpiler w/o SMTandQiMeng-Xpiler w/o SMT + Self-Debuggingclearly demonstrate the necessity of theSMT solver. Without it, even with the proposedLLM-assisted passesand evenself-debuggingtechniques, computation accuracy drops significantly (e.g.,HIPtoBANG Cgoes from 86.9% to 52.4%). -
This confirms that
LLMscan generate good sketches but struggle with low-level correctness, whichSMTeffectively fixes. The small gain fromSelf-Debuggingon top ofQiMeng-Xpiler w/o SMT(e.g.,CUDA CtoBANG Ccomputation accuracy from 54.2% to 54.8%) further underscores that traditionalLLM-based debuggingis insufficient for the semantic precision required.The following table shows the results from Table 7:
Direction Method Compilation(%) Computation(%) CUDA C → HIP Hipify 85.7 85.7 QiMeng-Xpiler 100 100 C → CUDA C PPCG 47.6 47.6 QiMeng-Xpiler 100 98.2
-
Analysis of Table 7 (Accuracy vs. Rule-based Methods):
- Superiority over Rule-based:
QiMeng-Xpileroutperforms establishedrule-based methodslikeHipifyandPPCG.- For
CUDA CtoHIP,QiMeng-Xpilerachieves perfect 100% compilation and computation accuracy, whereasHipifymanages 85.7%. This highlightsQiMeng-Xpiler's robustness even for a vendor-supportedtranscompilationtask. - For to
CUDA C,QiMeng-Xpilerreaches 100% compilation and 98.2% computation accuracy, which is approximately 50% higher thanPPCG's 47.6% for both metrics. This demonstratesQiMeng-Xpiler's ability to handleauto-parallelizationtasks more effectively.
- For
- Flexibility: The paper notes that
rule-basedmethods are often limited to specific pairs of languages, whereasQiMeng-Xpilershows flexibility across all fourDLSwith minimal adaptation cost, a key advantage.
6.3. Execution Performance
The performance evaluation (Figure 7) compares QiMeng-Xpiler's transcompiled programs against vendor-provided, manually-optimized libraries (e.g., cuDNN/cuBLAS, CNNL, rocBLAS, oneDNN).
该图像是一个多子图柱状图,展示了QiMeng-Xpiler在不同平台间转编译张量程序的性能表现及纠正案例数,包括从C with VNNI到CUDA C、CUDA C到BANG C、CUDA C到HIP及CUDA C到C with VNNI的对比。
Figure 7 presents the performance of QiMeng-Xpiler's translated programs across various transcompilation directions and operators, relative to manually optimized libraries. The Y-axis represents the performance relative to optimized libraries, while the X-axis lists operators.
Analysis of Figure 7:
QiMeng-Xpilerachieves an average performance of 0.78x and, in some cases, up to 2.00x better performance than its manually optimized counterparts. This means that on average,QiMeng-Xpiler's programs run at 78% of the speed of highly optimized libraries, which is highly competitive, and in specific instances, they can be twice as fast.- This demonstrates that the
hierarchical auto-tuningapproach is effective in generating not just correct, but also high-quality, performant code that can rival or even surpass human expert optimizations. - The results are shown across four common
transcompilationdirections:C with VNNICUDA C,CUDA CBANG C,CUDA CHIP, andCUDA CC with VNNI. This confirms the general applicability ofQiMeng-Xpilerin generating performant code for diverseDLS.
6.4. Case Study: Transcompilation from CUDA C to BANG C
This transcompilation direction is highlighted as particularly challenging due to:
- Programming Model Discrepancy: Translating from
SIMT(CUDA C) toSIMD(BANG C), which involves fundamental differences in parallel execution paradigms. - Uncommon Target Language:
BANG Cis anuncommon languagewith significantly less training data available forLLMs, making it harder forLLM-onlyapproaches.
Analysis:
- The results in Table 6 show that for
CUDA CtoBANG C:OpenAI o1 Zero-Shotyields 0% accuracy, demonstrating its complete failure for this challenging task.OpenAI o1 Few-Shotimproves to 51.8% compilation and 7.7% computation accuracy, showing that examples help, butLLMsstill struggle significantly with correctness.QiMeng-Xpiler w/o SMTachieves much higher accuracy (82.7% compilation, 54.2% computation) by incorporatingdocument retrievalfor domain knowledge anddecomposition into passes. This highlights the strength of theneural-symbolicframework even without theSMTcomponent for full correctness.- Crucially,
QiMeng-Xpiler(withSMT) achieves 100% compilation and 91.7% computation accuracy. This significant leap confirms thatsmall-scale symbolic synthesisis essential for ensuringfunctional equivalenceand reaching high accuracy in challengingtranscompilationscenarios.
6.5. Compilation Time
Figure 8 illustrates the breakdown of QiMeng-Xpiler's compilation time for 6 typical operators when transcompiling from CUDA C to BANG C.
该图像是图8,展示了QiMeng-Xpiler各类程序编译阶段所耗时间的柱状图,涵盖LLM、单元测试、SMT、自动调优和评估等多个环节,反映不同深度学习算子编译的时间分布与平均耗时。
Figure 8 breaks down the compilation time for six typical operators (GEMM, Conv2D, ReLU, Add, MaxPool, LayerNorm) when translating from CUDA C to BANG C. The components are LLM, Unit Test, SMT, Auto-tuning, and Evaluation.
Analysis of Figure 8:
- The total compilation time for these operators ranges from 1.2 to 7.8 hours, with an average of 3.7 hours. This indicates that while automated, the process can still be lengthy due to the complexity of
auto-tuningandSMT repair. - SMT Component: The
SMTcomponent's time consumption is variable. It is "only triggered when theLLMfails to produce a correct translation." For simpler programs or when theLLMis highly accurate,SMT's proportion is smaller. This aligns with the principle ofneural-symbolic synthesis, whereLLMhandles the bulk, andSMTacts as a targeted repair mechanism. - Auto-tuning Component: For operators like
GEMMandConv2Dwhich inherently have largersearch spacesfortiling sizes,loop orders, etc., the proportion of time spent onauto-tuning(exploring parameters and sequences) significantly increases. This is a trade-off for achieving high performance. - Special Intrinsics: Compilation time tends to increase with the number of
special intrinsicsused in the program, likely due to increased complexity inprogram annotation,LLM transformation, and potentiallySMTorauto-tuningin handling these specific hardware instructions.
6.6. Productivity Improvement
The paper evaluates the productivity improvement for Deformable Attention (a complex LLM operator with ~200 lines of code), comparing manual implementation against QiMeng-Xpiler's transcompilation. This is assessed for two DLS: NVIDIA GPU (C with VNNI CUDA C) and MLU (CUDA C BANG C), involving junior and senior coders.
The following table shows the results from Table 8:
| Deformable Attention (~ 200LoCs) | CUDA C-> BANG C | C with VNNI -> CUDA C | |||
|---|---|---|---|---|---|
| Costs | Performance | Costs | Performance | ||
| Senior Coder | Manual | ~6 d 4.5 + 0.5 h | 100% | ~1 d 2.1 h | 100% |
| w/ QiMeng-Xpiler Time Saving | ~28.8× | 69.20% | ~11.4× | 132.50 % | |
| Junior Coder | Manual | ~30 d | 49.85% | ~3 d | 75.76% |
| w/ QiMeng-Xpiler Time Saving | 4.5 + 3 h ~96.0× | 65.17% | 2.1 h ~34.3× | 132.50 % | |
Analysis of Table 8:
- Significant Productivity Gains:
QiMeng-Xpilerdramatically improves programming productivity.- For
MLU(CUDA CBANG C), productivity is improved by up to 96.0x for junior coders and 28.8x for senior coders. - For
GPU(C with VNNICUDA C), productivity is improved by up to 34.3x for junior coders and 11.4x for senior coders.
- For
- Reduced Manual Effort:
- Manual Costs: A senior coder takes ~6 days for MLU and ~1 day for GPU to manually implement
Deformable Attention. A junior coder takes ~30 days for MLU and ~3 days for GPU. This highlights the immense human effort and skill required. - QiMeng-Xpiler Costs: The
transcompilationtime byQiMeng-XpilerforDeformable Attentionis about 4.5 hours forCUDA CBANG Cand 2.1 hours forC with VNNICUDA C. This is a massive time reduction.
- Manual Costs: A senior coder takes ~6 days for MLU and ~1 day for GPU to manually implement
- Performance with QiMeng-Xpiler:
- For
MLU(CUDA CBANG C), theQiMeng-Xpilergenerated code achieves 69.20% of the performance of a senior coder's manual implementation and 65.17% of a junior coder's, which is highly competitive, especially considering the huge time savings. - For
GPU(C with VNNICUDA C),QiMeng-Xpiler's code actually outperforms manual implementations by 32.50% (132.50% of manual performance), showing its effectiveness in optimization.
- For
- Debugging: Even when
QiMeng-Xpilerfails to generate a fully functional correct program (as implied for MLU, requiring "minimal effort" for debugging), the required manual debugging time is significantly reduced (0.5 hours for senior, 3 hours for junior) compared to the weeks/days of initial manual coding.
6.7. Failure Case
The paper identifies a failure case in transcompiling Deformable Attention from CUDA C to BANG C, specifically shown in Figure 9.

Figure 9 illustrates a complex control flow within the Deformable Attention operator, featuring multiple nested loops and conditional statements. This complexity makes it difficult for current LLMs to map to SIMD intrinsics and for SMT solvers to extract constraints.
Analysis of Figure 9:
- The
Deformable Attentionoperator, with its complex control flow, multiple nested loops, and conditional statements, presents a significant challenge. - Limitations of Neural Component: The
LLMcomponent ofQiMeng-Xpilerstruggles to correctly handle such intricate logical structures and conditional branches when mapping them to the requiredSIMD intrinsicsforBANG C. This is likely due to the inherent difficulty forLLMsto reason deeply about complex programmatic logic and ensure precise semantic preservation across vastly differentparallel models. - Limitations of Symbolic Component: The
symbolic component(SMT solver) also faces difficulties in this scenario. Extracting comprehensive mathematical constraints from such complex control flow and then performing effective code repair within a "limited scale" becomes problematic. The search space forSMTmight still become too large or the logical relationships too convoluted for efficient solving. - Future Outlook: The paper suggests that
more powerful LLMs(with better logical reasoning) andadvanced SMT analysis toolscould potentially address these limitations, further improvingQiMeng-Xpiler's capability in handling such highly complex cases. This indicates an area for future research and development.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces QiMeng-Xpiler, a pioneering neural-symbolic synthesis approach for transcompiling tensor programs across heterogeneous deep learning systems (DLS). By strategically combining the code generation prowess of LLMs with the correctness guarantees of small-scale symbolic program synthesis, QiMeng-Xpiler addresses the long-standing challenge of "Write Once, Run Anywhere" for tensor programs. The system's unique architecture, involving LLM-assisted transformation passes and a hierarchical auto-tuning mechanism, not only ensures high functional accuracy (averaging 95%) but also achieves remarkable performance (up to 2.0x over manually optimized libraries) and unprecedented programming productivity gains (up to 96.0x). The extensive evaluation across four diverse DLS underscores its effectiveness and broad applicability.
7.2. Limitations & Future Work
The authors acknowledge a primary limitation:
- Complex Control Flow:
QiMeng-Xpilerstruggles withtensor programsexhibiting highly complex control flow, such as theDeformable Attentionexample with multiple nested loops and conditional statements. In these cases, both theLLMcomponent fails to generate accurate mappings toSIMD intrinsics, and theSMT solverfinds it difficult to extract and solve the necessary mathematical constraints within its "limited scale."
Future Work:
- The authors propose that
more powerful LLMs(implying models with enhanced logical reasoning capabilities) andadvanced SMT analysis toolscould be promising avenues to overcome this limitation and improveQiMeng-Xpiler's ability to handle such complex cases. This suggests continuous improvement in both the neural and symbolic components.
7.3. Personal Insights & Critique
7.3.1. Novelty and Significance
The primary novelty of QiMeng-Xpiler lies in its rigorous integration of neural and symbolic methods for the specific domain of tensor program transcompilation across heterogeneous DLS. While neural-symbolic approaches exist in other areas, their application to this highly specialized and performance-critical domain, addressing challenges like diverse programming models (SIMT vs. SIMD), memory hierarchies, and specialized intrinsics, is a significant contribution. The problem of transcompiling low-level, performance-sensitive code is much harder than high-level language translation, where LLMs have seen success. The explicit use of SMT to repair LLM imperfections, making neural-symbolic computationally tractable, is a clever design choice.
The paper's significance is profound. By providing a truly automatic and high-quality transcompiler for DLS, QiMeng-Xpiler has the potential to:
- Democratize DLS Development: Lower the barrier to entry for developers to utilize specialized hardware, reducing the need for deep expertise in every
DLS's idiosyncraticprogramming model. - Accelerate Innovation: Enable faster deployment of deep learning models on new or specialized hardware, as
porting codebecomes less of a bottleneck. - Improve Hardware Utilization: Maximize the efficiency of heterogeneous
DLSin data centers by generating highly optimized code.
7.3.2. Strengths
- Hybrid Approach Effectiveness: The core strength is the effective synergy between
LLMsandSMT.LLMsprovide the flexibility and breadth for initial code generation, whileSMTprovides the precision and correctness guarantees essential forlow-level system programming. - High Accuracy and Performance: Achieving 95% average accuracy and up to 2.0x performance over
vendor-optimized librariesis exceptional. This level of quality makes the tool practically viable, unlike manyLLM-onlysolutions. - Hierarchical Auto-tuning: The
intra-passandinter-pass auto-tuningis crucial for performance. Simply translating code correctly is not enough; it must also be fast. This mechanism systematically explores optimizations that human experts would typically perform. - Comprehensive Evaluation: The experiments on four distinct
DLSand 21deep learning operatorswith 8shapes(168 test cases) are extensive and demonstrate the robustness and generality ofQiMeng-Xpiler. - Productivity Improvement: The reported productivity gains (up to 96x) are staggering and directly address a major pain point in the industry.
7.3.3. Weaknesses and Potential Improvements
- Scalability of SMT for Complexity: While the paper states
SMTis used for "limited scale" repair, the identified failure case (complex control flow inDeformable Attention) suggests that "limited scale" might still be a bottleneck for truly intricate code. Future work could explore more advancedSMT-solvertechniques or hybrid approaches that simplify complex logical expressions before feeding them toSMT. - Compilation Time: An average compilation time of 3.7 hours (and up to 7.8 hours) is significant. While a massive improvement over manual coding, it's not "on-the-fly"
transcompilation. For iterative development, this could still be a factor. Optimizing theauto-tuningandSMTphases, perhaps through more aggressiveLLM-guided pruning or learning from pasttuningresults, could be beneficial. - Dependency on LLMs: The quality of the initial
LLM-generatedsketch is still critical. If theLLMconsistently produces very poor sketches, theSMTcomponent might be overwhelmed or theauto-tuningprocess might take too long. Continuous improvements inLLMcode generation, especially for specialized domains, will directly benefitQiMeng-Xpiler. - Maintenance of Meta-prompts and Manuals: The system relies on
meta-promptsand access toprogramming manuals. AsDLSevolve, these inputs would need continuous updates, which could incur maintenance costs, albeit less than manualtranscompilation. - Dynamic Program Behavior: The current approach likely focuses on static analysis and optimization. Handling programs with highly dynamic behavior or runtime-dependent optimizations might pose new challenges.
7.3.4. Open Questions and Future Impact
-
Generality to New DLS: How easily can
QiMeng-Xpileradapt to entirely newDLSbeyond the four evaluated ones? Theneural-symbolicframework seems robust, but the quality ofprogramming manualsand theLLM's ability to generalize to newintrinsicswould be key. -
Integration with Higher-Level Frameworks: Can
QiMeng-Xpilerbe integrated directly into deep learning frameworks (e.g., PyTorch, TensorFlow) to automaticallytranscompileintermediate representations or higher-level graph definitions into optimizedDLScode? This would further enhance productivity. -
Formal Verification of LLM-generated Sketches: While
SMTverifies snippets, could there be methods to formally verify properties of theLLM-generatedsketches themselves, reducing the burden onSMTorunit testing? -
User Interface and Debugging Experience: While
QiMeng-Xpileris automatic, what is the user experience fordebugging(as mentioned in the productivity section) when it doesn't achieve 100% correctness? How intuitive are theerror messagesorrepair suggestions?Overall,
QiMeng-Xpilerpresents a compelling solution to a critical problem in deep learning hardware utilization. Itsneural-symbolicparadigm, combined withauto-tuning, marks a significant step towards truly automated and efficient programming for diverseDLS.
Similar papers
Recommended via semantic vector search.