Paper status: completed

QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

Published:05/04/2025

Automatic Tensor Program Migration (1)Neural-Symbolic Program Synthesis (1)Cross-DLS Compilation (1)LLM-Assisted Compilation (1)Heterogeneous Accelerator Code Transpiling (1)

Original Link PDF

Price: 0.100000

9 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

QiMeng-Xpiler integrates large language models with symbolic synthesis for automatic tensor program transcompilation across deep learning systems, reducing manual effort while ensuring correctness and high performance.

Abstract

Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering "Write Once, Run Anywhere" of tensor programs an open question. We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0x over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0x via transcompiling legacy tensor programs.

Mind Map

In-depth Reading

English Analysis~29 min read · 40,156 chars

1. Bibliographic Information

1.1. Title

QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

1.2. Authors

Shouyang Dong, Yuanbo Wen, Jun Bi, Di Huang, Jiaming Guo, Jianxing Xu, Ruibai Xu, Xinkai Song, Yifan Hao, Ling Li, Xuehai Zhou, Tianshi Chen, Qi Guo, and Yunji Chen.

Affiliations:

University of Science and Technology of China
Cambricon Technologies
SKL of Processors, Institute of Computing Technology, Chinese Academy of Sciences
Institute of Software, Chinese Academy of Sciences

1.3. Journal/Conference

The paper was published on arXiv, which is an open-access archive for scholarly articles, primarily in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It serves as a preprint server, allowing researchers to share their work before formal peer review and publication in a traditional journal or conference. While not a peer-reviewed venue itself, papers on arXiv are widely read and often later appear in prestigious conferences or journals.

1.4. Publication Year

2025 (Published on 2025-05-04T15:14:27.000Z)

1.5. Abstract

The abstract outlines the problem of developing multiple low-level tensor programs for diverse heterogeneous deep learning systems (DLS) like GPUs and ASICs, leading to a significant programming burden. Existing transcompilation techniques (converting code from one language/platform to another) are deemed insufficient due to either requiring extensive manual effort or leading to functional incorrectness, thus failing to achieve the "Write Once, Run Anywhere" ideal for tensor programs.

The paper introduces QiMeng-Xpiler, a novel transcompiler that automates tensor program translation across DLS. It employs a neural-symbolic synthesis approach, combining large language models (LLMs) for powerful code generation with symbolic program synthesis for correctness. The core idea is to leverage LLMs to make the usually costly search-based symbolic synthesis computationally tractable. This is achieved through multiple LLM-assisted compilation passes, each guided by meta-prompts for program transformation. During each pass, efficient symbolic program synthesis is used to repair small, incorrect code snippets. To ensure high performance, a hierarchical auto-tuning approach systematically explores both the parameters of transformation passes and their optimal sequences.

Experiments across four distinct DLS—Intel DL Boost (VNNI), NVIDIA GPU (CUDA), AMD MI (HIP), and Cambricon MLU (BANG)—demonstrate that QiMeng-Xpiler correctly translates tensor programs with an average accuracy of 95%. Translated programs achieve up to 2.0x performance improvement over vendor-provided, manually-optimized libraries. This results in a programming productivity improvement of up to 96.0x by effectively transcompiling legacy tensor programs.

1.6. Original Source Link

https://arxiv.org/abs/2505.02146

1.7. PDF Link

https://arxiv.org/pdf/2505.02146v1.pdf

2. Executive Summary

2.1. Background & Motivation (Why)

The proliferation of diverse heterogeneous deep learning systems (DLS), such as NVIDIA GPUs, Google TPUs, GraphCore IPUs, Intel DL Boost, AMD MI, and Cambricon MLUs, in industrial data centers presents a significant challenge: each platform often requires its own set of high-performance, low-level tensor programs (implementations of tensor operations like matrix multiplication or convolution). Writing and optimizing these programs for every unique architecture is notoriously difficult and labor-intensive due to their complex architectures, unique programming models (e.g., SIMT vs. SIMD), intricate memory hierarchies, and specialized instruction sets (intrinsics). This creates a substantial programming burden, hindering the full exploitation of these powerful systems.

The ideal solution is "Write Once, Run Anywhere" for tensor programs, which transcompilation (source-to-source code translation) aims to achieve. However, existing transcompilation techniques fall short:

Rule-based approaches: Require extensive manual effort to define transformation rules between vastly different architectures, making them unscalable.
Symbolic synthesis approaches: Suffer from large search spaces, making them computationally expensive and difficult to apply to general-purpose programs, especially when dealing with diverse parallel semantics. They also demand considerable manual effort to specify input constraints.
Data-driven approaches (e.g., LLMs): While promising, they struggle with preserving tensor program semantics, leading to low translation accuracy (e.g., 29.6% reported in prior work), necessitating significant manual correction.

These limitations mean that current methods either demand tremendous manual efforts, have limited scalability, or result in functional incorrectness, leaving the "Write Once, Run Anywhere" problem largely unsolved for tensor programs.

The paper's novel approach, QiMeng-Xpiler, addresses these gaps by combining the flexibility of data-driven (LLM) approaches with the soundness of symbolic synthesis. The core innovation is a neural-symbolic method that leverages the code generation capabilities of LLMs to guide and prune the search space for symbolic synthesis, making it computationally tractable, and uses symbolic synthesis to ensure the correctness that LLMs alone cannot provide.

2.2. Main Contributions / Findings (What)

The paper makes the following key contributions:

Neural-symbolic program synthesis: It proposes a novel neural-symbolic approach that uses LLMs to generate high-level program sketches (meta-prompts) and then employs SMT-based symbolic synthesis to repair incorrect low-level details. This achieves fully automatic program translation with a strong guarantee of functional correctness. This is the first work to automatically translate tensor programs across DLS with different programming models using such a hybrid approach.
Hierarchical performance auto-tuning: It introduces a hierarchical auto-tuning strategy to maximize the performance of translated programs. This involves intra-pass auto-tuning for exploring parameters (e.g., tiling sizes) within a transformation pass using brute-force search, and inter-pass auto-tuning using Monte Carlo Tree Search (MCTS) to find the optimal sequence of transformation passes.
Extensive evaluation: The proposed QiMeng-Xpiler is rigorously evaluated on four distinct DLS (Intel DL Boost, NVIDIA GPU, AMD MI, and Cambricon MLU).

The key findings from the experiments are:

High Accuracy: QiMeng-Xpiler correctly translates different tensor programs with an average accuracy of 95% (ranging from 86.9% to 100% depending on the translation direction and target). It significantly outperforms state-of-the-art LLM-based and rule-based methods in terms of both compilation and computation accuracy, often achieving 100% compilation accuracy and near-100% computation accuracy where other methods fall short.
Improved Performance: The translated programs achieve up to 2.0x performance improvement (and an average of 0.78x) over vendor-provided, manually-optimized libraries like cuDNN/cuBLAS and oneDNN.
Enhanced Productivity: The programming productivity for DLS is improved by up to 96.0x (e.g., for Cambricon MLU) and 34.3x (for NVIDIA GPU) by automating the transcompilation of legacy tensor programs.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following concepts:

Deep Learning Systems (DLS): Specialized hardware architectures designed to accelerate deep learning workloads. Examples include:
- GPUs (Graphics Processing Units): Originally for graphics rendering, now widely used for general-purpose parallel computing (GPGPU) in deep learning. NVIDIA GPUs with Tensor Cores are a prime example.
- ASICs (Application-Specific Integrated Circuits): Custom-designed chips optimized for specific tasks, offering higher efficiency for deep learning. Examples include Google TPU (Tensor Processing Unit) and Cambricon MLU (Machine Learning Unit).
Tensor Programs/Operators: Low-level implementations of mathematical operations involving tensors (multi-dimensional arrays), which are the fundamental data structures in deep learning. Examples include matrix multiplication (MatMul), convolution, activation functions (e.g., ReLU), and pooling.
Heterogeneous Systems: Computing environments composed of different types of processors or specialized hardware, each with unique capabilities and programming models.
Transcompilation (Source-to-Source Compilation): The process of automatically translating source code written in one programming language (or for one platform's programming model) into another language (or for another platform's model), while preserving its functionality.
Programming Models: The conceptual framework that defines how a programmer specifies computation and data management for a particular hardware architecture.
- SIMT (Single Instruction, Multiple Threads): Used by NVIDIA CUDA. Multiple threads execute the same instruction on different data simultaneously. Threads are grouped into warps (or wavefronts on AMD).
- SIMD (Single Instruction, Multiple Data): A type of parallel processing where a single instruction operates on multiple data points simultaneously. Often used with specialized vector units or instruction sets like VNNI or Cambricon BANG.
- DSA (Domain-Specific Architecture): Hardware tailored for specific applications or domains, like deep learning, offering high performance and efficiency for those tasks. Cambricon MLU is a DSA.
Parallelism:
- Task-level parallelism: Different independent tasks run concurrently.
- Multi-core parallelism: Tasks distributed across multiple processing cores.
- Thread-level parallelism: Execution of different threads concurrently (e.g., CUDA blockIdx, threadIdx).
Memory Hierarchy: The multi-layered organization of memory in a computer system, from fast, small caches (registers, shared memory) to slower, larger main memory (global memory), designed to optimize data access. Programmers often need to explicitly manage data movement within this hierarchy for performance.
Specialized Intrinsics: Low-level functions or instructions provided by hardware vendors that directly access specialized hardware capabilities (e.g., Tensor Cores, Matrix Cores, VNNI instructions) for highly optimized computation. These are often platform-specific.
Large Language Models (LLMs): Advanced artificial intelligence models (like GPT-4, OpenAI o1, StarCoder) trained on vast amounts of text and code data. They can understand natural language prompts and generate human-like text, including source code.
Symbolic Program Synthesis: A technique that automatically generates programs from a high-level specification (e.g., input/output examples, logical constraints). It often involves searching a space of possible programs.
Satisfiability Modulo Theories (SMT) Solver: A tool used in symbolic program synthesis to determine if a logical formula, potentially involving variables from different theories (e.g., arithmetic, arrays), has a solution. It's used to verify correctness or find program components that satisfy given constraints.
Abstract Syntax Tree (AST): A tree representation of the abstract syntactic structure of source code. Each node in the tree denotes a construct occurring in the source code. Used for program analysis and transformation.
Meta-prompts: High-level, parameterized prompts or templates used with LLMs that can be adapted with specific context (e.g., program annotations) to guide code generation for different tasks or scenarios.
Unit Tests: Small, isolated tests that verify the correctness of individual components or functions of a program. Essential for validating translated code.
Monte Carlo Tree Search (MCTS): A heuristic search algorithm used in decision processes, particularly in games. It explores a search space by building a search tree, using random simulations (Monte Carlo) to estimate the value of nodes and guide exploration towards promising branches. In this context, it's used to find optimal sequences of transformation passes.
Brute-force search: An algorithm that systematically checks all possible solutions to a problem until the correct one is found. Computationally expensive for large search spaces but guarantees optimality.

3.2. Previous Works

The paper categorizes existing transcompilation efforts into three main groups:

Rule-based Approaches:
- Concept: Experts define manual transformation rules, often applied to an Abstract Syntax Tree (AST), to convert code between languages/platforms. Pattern matching is used to apply these rules.
- Examples: FCUDA (CUDA to FPGA translator) [37], C2Rust [13], CxGo [5], HIPIFY (CUDA to HIP) [7], PPCG (polyhedral auto-parallelization from C to CUDA) [47].
- Limitations: Highly labor-intensive, unscalable due to the need for extensive manual rules, struggle with architectural discrepancies and irregular code structures, and lack general applicability across many DLS.
Symbolic Synthesis Approaches:
- Concept: Generate semantic-preserving target code from specifications (e.g., domain-specific languages (DSLs), input/output examples). Often rely on search-based SMT solvers.
- Examples: C2TACO (translates C tensor code to TACO DSL using a guided enumerative synthesizer) [33], MetaLift [18], Tenspiler [39] (these convert source programs into a unified intermediate representation (IR) and then synthesize target programs in the same IR domain).
- Limitations: Struggle with large search spaces, making them unscalable for large, general-purpose programs or commodity DLS. Cannot efficiently handle diverse parallel semantics (e.g., SIMT vs. SIMD) or complex memory hierarchies. Requires manual definition of target language semantics using specification IR, which is challenging and error-prone.
Data-driven Approaches (LLM-assisted):
- Concept: Train neural networks (typically LLMs) on large code corpora to generate code or translate between languages.
- Examples: TransCoder [40] (sequence-to-sequence models for C++, Java, Python translation), StarCoder [31], GPT-4 [15, 20, 36], OpenAI-o1 [8], Codex [21], CodeGen [35], CodeT5 [48], CodeGeeX [52], LLaMA [44], Gemini [43], AlphaCode [32].
- Limitations: Lack a deep understanding of program semantics, leading to functional incorrectness or incompleteness. Perform poorly in specific domains like DLS tensor programs due to scarce high-quality training data. Translation accuracy for DLS is often low (e.g., 29.6% mentioned in prior work), requiring significant manual correction and making them unsuitable for high-accuracy transcompilers.

3.3. Technological Evolution

The evolution of transcompilation reflects the increasing complexity and heterogeneity of computing systems. Early efforts focused on rule-based transformations, which were feasible for relatively simpler or more aligned source and target languages/platforms. As hardware diversified, especially with the advent of DLS, the manual burden of rule-based methods became prohibitive. Symbolic synthesis emerged to automate correctness guarantees by leveraging formal methods, but its inherent computational cost limited its applicability to smaller, more constrained domains. More recently, the rise of powerful LLMs has opened new avenues for code generation and translation, offering unprecedented flexibility and automation. However, their data-driven nature struggles with the precise semantic preservation and domain-specific correctness required for low-level tensor programs on highly specialized DLS.

QiMeng-Xpiler represents a significant step in this evolution by attempting to bridge the gap between the flexibility of LLMs and the soundness of symbolic synthesis. It acknowledges the strengths and weaknesses of both, combining them into a neural-symbolic framework to tackle the unique challenges of DLS transcompilation.

3.4. Differentiation

QiMeng-Xpiler differentiates itself from existing approaches by:

Hybrid Neural-Symbolic Approach: Unlike purely rule-based, symbolic, or data-driven methods, QiMeng-Xpiler intelligently combines LLMs for flexible high-level program sketching and SMT-based symbolic synthesis for precise low-level detail repair and correctness guarantees. This avoids the manual effort of rule-based systems, the scalability issues of pure symbolic synthesis, and the accuracy/correctness problems of pure LLM-based systems.
Tackling DLS Heterogeneity: It specifically targets the complex challenges of transcompiling tensor programs across diverse DLS with distinct programming models (SIMT, SIMD), memory hierarchies, and specialized intrinsics. Previous methods generally failed in this domain due to these complexities.
Decomposed Transformation Passes: Instead of a single LLM prompt for the entire translation, QiMeng-Xpiler breaks the process into a series of LLM-assisted transformation passes. This modularity significantly improves accuracy and tractability for symbolic synthesis.
Hierarchical Performance Auto-Tuning: It uniquely integrates hierarchical auto-tuning (brute-force for intra-pass parameters, MCTS for inter-pass sequences) to not only ensure functional correctness but also maximize the execution performance of the translated code, often outperforming manually optimized libraries. This is a critical feature often missing in code generation tools.
Correctness Guarantee: By incorporating SMT-based repair at a small scale within each transformation pass, QiMeng-Xpiler ensures functional equivalence, a crucial aspect for production-grade transcompilers that LLM-only solutions cannot reliably provide.

4. Methodology

4.1. Principles

The core idea behind QiMeng-Xpiler is to combine the strengths of large language models (LLMs) and symbolic program synthesis in a synergistic way, forming a neural-symbolic synthesis approach. The underlying principles are:

Leverage LLM for Tractability: Utilize the powerful code generation capabilities of LLMs to produce initial program sketches and guide transformation. This makes the typically costly search-based symbolic synthesis computationally tractable by significantly pruning its search space and focusing it on smaller, more manageable problem instances.
Ensure Correctness with Symbolic Synthesis: Employ symbolic program synthesis to formally verify and repair small, incorrect code snippets generated by the LLM. This guarantees semantic equivalence and functional correctness, which LLMs alone often struggle to provide for low-level tensor programs.
Decomposed Transformation: Break down the complex transcompilation process into a series of modular, LLM-assisted transformation passes. Each pass addresses specific characteristics of DLS (parallelism, memory hierarchy, specialized instructions), improving the accuracy and manageability of the overall translation.
Performance Optimization through Auto-Tuning: Systematically explore the vast search space of transformation parameters and sequences using a hierarchical auto-tuning approach to achieve high-performance translated programs. This ensures that the generated code is not just functionally correct but also optimally tuned for the target DLS.

4.2. Steps & Procedures

QiMeng-Xpiler operates in two main parts: Neural-Symbolic Program Synthesis and Hierarchical Performance Auto-Tuning.

The overall system architecture is presented in Figure 3.

该图像是包含两个子图的示意图，展示了QiMeng-Xpiler的神经-符号程序综合及层级性能自动调优流程。左图(a)描述了基于LLM的程序注释和变换，以及基于SMT的代码修复过程；右图(b)展示了变换传递的层级自动调优，包括跨传递和传递内的调优策略。

The system fundamentally separates Neural-Symbolic Program Synthesis (left) from Hierarchical Performance Auto-Tuning (right) to first ensure correctness and then optimize performance.

4.2.1. Neural-Symbolic Program Synthesis

This part decomposes the transcompilation into a series of LLM-assisted transformation passes, with symbolic synthesis for repair. There are 11 such passes, categorized into three classes (Table 4):

The following table shows the transformation passes and their description.

	Name	Description
(1)	Loop Recovery	Convert parallel variables to sequential for loops
	Loop Bind	Assign a sequential loop to parallel variables
	Loop Split	Divide a loop into several sub-loops
	Loop Fuse	Merge several loops into a hyper-loop
	Loop Reorder	Change the execution orders of loops
	Loop Expansion Loop Contraction	Split a loop body into several loop bodies Merge the producer in the loop body of consumer
(2)	Cache	Adapt to the memory hierarchy for efficient load/store inputs/outputs
	Pipeline	Pipeline of data load/store and computation
	Tensorize	Replace a specific loop body to leverage special intrinsics
(3)
(3)	Detensorize	Restore a specific loop body from special intrinsics

Sequentialization/Parallelization: Loop Recovery, Loop Bind, Loop Split, Loop Fuse, Loop Reorder, Loop Expansion, Loop Contraction. These passes handle the conversion between parallel program constructs (e.g., CUDA threadIdx.x) and sequential loops, or restructure loops for better parallelism.
Memory Conversion: Cache, Pipeline. These passes adapt the code to the target system's memory hierarchy by managing data movement and access patterns (e.g., caching, pipelining loads/stores).
(De)Tensorization: Tensorize, Detensorize. These passes convert sequential code into parallel tensor intrinsics or vice versa, leveraging specialized hardware instructions.

Each program transformation pass follows a workflow: Program Annotation, Meta-Prompts based Transformation, Bug Localization, and SMT-based Code Repairing. This process is illustrated in Figure 4 for a tensorization case.

该图像是论文中的示意图，展示了神经符号程序合成在张量化案例中的应用流程，包括程序注释、基于元提示的程序转换，以及基于SMT的错误定位与修复步骤。

Figure 4 illustrates how a CUDA C matmul loop is tensorized into a BANG C __bang_mlp intrinsic using the neural-symbolic approach.

4.2.1.1. Program Annotation

This initial step marks code blocks with semantics relevant to the transformation pass, aiding the LLM. It consists of two parts (Algorithm 1):

Algorithm 1: Program Annotation Algorithm

Input: A source program P
Output: An annotated program P_annotated
Given: LLM, BM25 search engine (SE), programming manual M

1 P_annotated ← P
2 // Annotate program with identified computation semantics
3 P_annotated ← LLM.AnnotateSemantics(P_annotated)
4 // Get the list of computation information
5 L ← TraverseComputation(P_annotated)
6 for n in L do
7   // Retrieve the programming manual of each computation
8   retrieved_info ← SE.Retrieve(n, M)
9   // Annotate program with memory spaces or tensor intrinsics
10  P_annotated ← LLM.AnnotateReferences(P_annotated, n, retrieved_info)
11 end
12 return P_annotated

Semantics Annotation (lines 2-3): An LLM identifies computational operations and annotates them with platform-agnostic semantics (e.g., operation(matmul)). This abstracts away platform-specific details.
Reference Annotation (lines 4-11): For each identified operation, a BM25 search engine [45] retrieves relevant information (e.g., target operation, parameters, memory constraints) from the target system's programming manual. An LLM then uses this information and the source program to annotate the operator with specific hardware characteristics (e.g., intrinsic(__bang_mlp(input[Nram, Wram], output[Nram]))). This provides platform-specific examples to guide the subsequent LLM-based transformation.

4.2.1.2. Meta-Prompts based Transformation

This step uses LLMs to generate the transformed code. A meta-prompt is a high-level template adapted for each source program based on annotations. Each transformation pass has a unique meta-prompt comprising three parts:

Platform-agnostic description: Describes the program's functionality and general constraints (e.g., tensorization in the context of SIMD). This part is constant across platforms.
Platform-specific examples: Uses the annotated source program to search for relevant examples in the target programming manual. These examples (e.g., C to BANGC matmul) are incorporated into the prompt, guiding the LLM to generate functionally correct code with appropriate syntax and intrinsics.
Tuning knobs: Optional constraints for passes like Loop Split or Loop Reorder, defining parameters such as loop split alignment size to generate a search space for auto-tuning.

The LLM generates a transformed code sketch that is generally correct in structure but may contain error-prone values.

4.2.1.3. Bug Localization

If the LLM-generated code fails unit tests, Algorithm 2 is used to precisely locate buggy code snippets.

Algorithm 2: Bug Localization Algorithm

Input: Source program S, Buggy transcompiled program E, Loop index I, Buffer name N, control-flow matching model M
Output: Error Nodes Mt

1 S_n ← AST(S)
2 S_s ← ExtractASTSegment(S_n, I, N)
3 // Obtain the buffer list
4 B_s ← TraverseBuffers(S_s)
5 E_n ← BinarySearchErrorNodes(B_s, E, S)
6 M_b ← FindMatchingCodeBlocks(M, S_s, E_n)
7 for Ms, Mt in M_b do
8   if Diverse_ds(Ms, Mt) then
9     return Mt
10  end
11 end

The process involves:

Identifying the code snippet transformed based on loop index or buffer name.
Comparing the buggy transformed snippet with the original.
Using binary search by inserting print statements at relevant memory locations to narrow down the error scope.
Mapping source-level control flow to diverging inputs to pinpoint erroneous behavior.
Enumerating constraints (index variables, loop boundaries, buffer sizes, tensor semantics) within the snippet to create queries for the SMT solver.

4.2.1.4. SMT-based Code Repairing

Once a bug is localized, the SMT solver repairs the buggy code snippet, focusing on low-level implementation details such as loop boundaries, indexing, and instruction parameters (Algorithm 3).

Algorithm 3: SMT-based Code Repairing Algorithm

Input: A source program P, error program E
Output: A repaired program F

1 Given: Error code snippet S_error
2 T_s ← AST(P); S_s ← extract(T_s, S_error)
3 T_e ← AST(E); S_e ← extract(T_e, S_error)
4 // Generate the sketch according to the transformation definition.
5 S_k ← GenerateCodeSketch(S_s, S_e)
6 // Generate the smt query
7 Q ← CreateSMTQuery(S_s, S_e)
8 // Generate the verified code snippet
9 R ← SynthesisCode(Q)
10 F ← StitchBack(P, R, S_error)
11 return F

The process involves:

Parsing source and error code to extract the buggy snippet.
Generating a code sketch based on the specific loop transformation's definition.
Explicitly enumerating constraints over concrete index variables, loop boundaries, and buffer indexes.
Formulating an SMT query ( $Q$ ) using these boundary and buffer index constraints.
Using the SMT solver (SynthesisCode(Q)) to fill in missing expressions in the sketch, ensuring symbolic alignment with the source program.

Figure 5 shows examples of SMT constraints for loop split and cache read.

该图像是图5，展示了循环拆分和缓存读取的SMT约束，包括代码示例和对应的符号约束。图中涉及循环拆分的约束 $r_i - r_i' < 4 \wedge r_i - r_i' > -4$ 等，以及缓存读取的索引范围和赋值一致性约束。

Figure 5 illustrates how constraints for loop splitting and cache access are formulated for the SMT solver. For loop split, it ensures the split factor maintains the original loop's behavior ( $r_i - r'_i < 4 \wedge r_i - r'_i > -4$ ). For cache read, it sets constraints on memory access, ensuring data is correctly loaded from global memory to cache, and then accessed from cache based on the loop index ( $k' * 1024 + k''$ , ensuring $0 \le k' < 1024/1024$ , etc.).

4.2.2. Hierarchical Performance Auto-Tuning

This component aims to automatically maximize the performance of transcompiled programs by exploring transformation parameters and sequences. The transcompilation exploration space ( $S$ ) is defined as: $S = \left\{ S^{(n)} \mid S^{(t)} = \mathrm{apply}(\mathrm{apply}(S^{(t-1)}, d_t), k_t), \forall d_t \in D_t, 1 \leq i \leq n, \forall k_t \in K_t, 1 \leq t \leq n \right\}$ where $S^{(0)}$ is the source tensor program, $d_t$ is a randomly selected pass from predefined passes $D_t$ , $k_t$ is a tuning option from predefined tuning knobs $K_t$ , and $n$ is the number of transformations. The size of this search space is: $|S| = |D_1| \times |K_1| \times |D_2| \times |K_2| \times \ldots \times |D_n| \times |K_n$ This vast search space is explored using a hierarchical approach:

4.2.2.1. Intra-Pass Auto-Tuning

This focuses on fine-tuning parameters within a single transformation pass.

Search space generation: LLMs and programming manuals help generate the search space. For example, for Loop Split and Loop Reorder, specially designed meta-prompts (Figure 6) are used to generate programs with different adjustable parameters (e.g., number of blocks for logical loops, loop order, tiling sizes).

该图像是图6的示意图，展示了针对循环拆分（loop split）的自动调优提示，包含如何将循环变量i拆分为两个嵌套循环及返回所有可能的循环索引和范围的示例。

Figure 6 shows a meta-prompt for loop split which instructs the LLM to split a loop variable $i$ into two nested loops and return all possible loop indices and their extents, providing a structured way for the LLM to explore tiling options.

Search space exploration: A brute-force search is employed to find the optimal program configuration within the generated search space. This is deemed feasible because, for some DLS (e.g., BANG C), the search space size is relatively small (e.g., 10 for Matmul on MLU, compared to 150 on GPU).

4.2.2.2. Inter-Pass Auto-Tuning with MCTS

This component uses Monte Carlo Tree Search (MCTS) to find the optimal sequence of transformation passes, treating transcompilation as a Markov decision process. Each intermediate program is a state, and applying a pass is an action.

Reward function: The reward for a given transformation sequence is tied to the execution time improvement. MCTS runs in parallel with the current program, recording the best real throughput as the reward. $T_{ti} = \left\{ \begin{array}{ll} T(p_{ti}), & \mathrm{if~} p_{ti} \mathrm{~is~valid} \\ 0, & \mathrm{otherwise} \end{array} \right\}$ $R_t = \max(T_{ti})$ Where $p_t$ denotes transformed programs at iteration $t$ , $i$ is the index of the transformed program within the intra-pass auto-tuning search space, and $T$ is the throughput of the program.
Searching parameters setting: Through design space exploration, the maximum search depth ( $n$ in the equation above) for MCTS was set to 13, and the total number of simulations to 512. This balances search time and reward, allowing QiMeng-Xpiler to find optimal transcompiled programs within a reasonable time (e.g., hours).

4.3. Mathematical Formulas & Key Details

4.3.1. Exploration Space Formalization

The transcompilation exploration space $S$ is defined as the set of all possible programs that can be generated by applying a sequence of transformations. Let $S^{(0)}$ represent the initial source tensor program. Let $d_t$ represent a selected transformation pass from the set of predefined passes $D_t$ at step $t$ . Let $k_t$ represent a selected tuning option (e.g., a specific tiling size or loop order) from the set of predefined tuning knobs $K_t$ at step $t$ . The application of a pass $d_t$ with tuning option $k_t$ to a program $S^{(t-1)}$ results in a new program $S^{(t)}$ . The entire exploration space $S$ for a sequence of $n$ transformations is given by: $S = \left\{ S^{(n)} \mid S^{(t)} = \mathrm{apply}(\mathrm{apply}(S^{(t-1)}, d_t), k_t), \forall d_t \in D_t, 1 \leq t \leq n, \forall k_t \in K_t, 1 \leq t \leq n \right\}$ Here,

$S^{(n)}$ is the final transformed program after $n$ steps.
$\mathrm{apply}(program, pass)$ is a function that applies a transformation pass to a program.
$\mathrm{apply}(program, knob)$ is a function that applies a tuning knob (parameter) to a program, typically affecting the details of the transformation.

The size of this exploration space $|S|$ is the product of the number of choices at each step for both the pass and the tuning knob: $|S| = |D_1| \times |K_1| \times |D_2| \times |K_2| \times \ldots \times |D_n| \times |K_n$ where $|D_t|$ is the number of available transformation passes at step $t$ , and $|K_t|$ is the number of available tuning knob options at step $t$ . This formula highlights the exponential growth of the search space with the number of transformation steps $n$ .

4.3.2. MCTS Reward Function

The reward function for Monte Carlo Tree Search (MCTS) quantifies the performance of a transcompiled program. Let $p_{ti}$ denote a transformed program at iteration $t$ resulting from the $i$ -th choice within the intra-pass auto-tuning search space. Let $T(p_{ti})$ be the measured throughput (e.g., operations per second) of the program $p_{ti}$ on the target DLS. The intermediate throughput $T_{ti}$ is defined as: $T_{ti} = \left\{ \begin{array}{ll} T(p_{ti}), & \mathrm{if~} p_{ti} \mathrm{~is~valid} \\ 0, & \mathrm{otherwise} \end{array} \right\}$ Here,

$T_{ti}$ : The throughput value for program $p_{ti}$ .
$T(p_{ti})$ : The actual measured throughput of the program.
"is valid": Implies that the program successfully compiles and executes and is functionally correct (i.e., passes unit tests). If it's not valid, its throughput is considered 0, providing no reward.

The reward $R_t$ for the MCTS at step $t$ is the maximum throughput achieved among all valid programs generated within the intra-pass auto-tuning search space at that step: $R_t = \max(T_{ti})$ Here,
$R_t$ : The reward obtained at step $t$ .
$\max(T_{ti})$ : The maximum throughput found among all programs $p_{ti}$ at step $t$ . This ensures that MCTS favors sequences of transformations that lead to the best-performing valid programs.

5. Experimental Setup

5.1. Datasets

The study evaluates QiMeng-Xpiler on 21 widely-used deep learning operators. These operators are grouped into 6 types of common deep learning operations:

MatMul (Matrix Multiplication)
Convolution
Activation functions (e.g., ReLU, Softmax, GeLU, Sigmoid)
Pooling operations (MaxPool, AvgPool, MinPool, SumPool)
Element-wise operations (Add, Sign)
LLM (Large Language Model) specific operations (LayerNorm, Deformable Attention, Self Attention, RMSNorm)

Each of these 21 operators is further evaluated with 8 typical shapes. These shapes are extracted from real-world neural network architectures, ensuring practical relevance. The networks from which these shapes are derived include:

GPT [36]
LLaMA-2 [44]
DAT [51]
BERT [23]
ResNet [25]
MobileNet [26]
VGG [42]

In total, this leads to 168 test cases ( $21 \text{ operators} \times 8 \text{ shapes}$ ). The size of the code for these operators ranges from 7 to 214 lines of code, covering a variety of complexities and types.

Example of a data sample: For a MatMul operator, a "shape" would refer to the dimensions of the input matrices. For instance, a GEMM (General Matrix Multiplication) operation might involve matrices of shape $A_{512 \times 512}$ and $B_{512 \times 512}$ , producing an output $C_{512 \times 512}$ . This specific $512 \times 512 \times 512$ configuration would be one of the "8 typical shapes" for MatMul. The "lines of code" (LoC) would be the CUDA C, BANG C, HIP, or C with VNNI implementation for this operation.

The following table provides a breakdown of the evaluated benchmarks for QiMeng-Xpiler, including the lines of code for different target languages and the total number of cases for each operator type.

The following table shows the results from Table 5:

Type	Operators	Lines of Code				Cases
Type	Operators	CUDA C	BANG C	Hip	C with VNNI	Cases
MatMul	GEMM, GEMV	26, 12	15, 11	25, 12	41, 34	24
MatMul	Batch GEMM	29	16	28	47	24
Convolution	Conv1D	10	14	10	9	32
	Conv2D NHWC, NCHW	32, 30	23, 30	32, 30	83, 85
	Depthwise Conv	27	24	27	17
Activation	ReLU, Softmax	7, 24	18, 18	7, 24	14, 28	32
Activation	GeLU, Sigmoid	7, 10	29, 30	7, 10	14, 14	32
Elementwise	Add, Sign	20, 12	26, 29	18, 12	17, 14	16
Pooling	MaxPool, AvgPool	25, 30	25, 25	25, 30	31, 34	32
Pooling	MinPool, SumPool	23, 29	25, 25	23, 29	33, 30	32
LLM	LayerNorm	42	35	42	46	32
	Deformable Attention	139	191	139	214
	Self Attention, RMSNorm	54, 25	64, 18	54, 25	111, 18

These datasets were chosen to comprehensively assess QiMeng-Xpiler's capabilities by covering a wide range of operator types, varying code complexities, and diverse input shapes representative of real-world deep learning workloads.

5.2. Evaluated Platforms

The experiments were conducted on four distinct Deep Learning Systems (DLS), each with its own programming interface:

Intel Gold 6348 CPU: Equipped with VNNI (Vector Neural Network Instructions) extensions. This platform utilizes specialized instruction sets designed to accelerate deep learning operations. The programming interface is C with VNNI intrinsics.
NVIDIA A100 GPU: A high-performance GPU widely used in deep learning. It follows the SIMT (Single Instruction, Multiple Threads) programming model. The programming interface is CUDA C.
AMD MI200 (Instinct MI250X): An AMD data center GPU. It provides an alternative to CUDA with its HIP (Heterogeneous-Compute Interface for Portability) language, which is designed to be a porting tool from CUDA.
Cambricon MLU: A Deep Learning Accelerator (DSA) from Cambricon. It typically follows a SIMD (Single Instruction, Multiple Data) programming model. The programming interface is BANG C.

Table 1 provides a detailed comparison of these DLS and their interfaces, illustrating the programming challenges they present.

The following table shows the results from Table :

Platforms	Interfaces	Categories	Examples
Intel DL Boost	C with VNNI extensions	Specialized Intrinsic	_mm_dpbusds_epi32(...), _mm512_dpbusd_epi32(...)
NVIDIA GPU with Tensor Core	CUDA C	Parallelism	blockIdx, threadIdx
		Memory Hierarchy	registers, __shared__, __global_
		Memory Hierarchy	matrix_a, matrix_b, and accumulator in Tensor Core fragments
		Specialized Intrinsics	wmma::mma_sync(d, a, b, c)
AMD MI with Matrix Core	HIP	Parallelism	blockIdx, threadIdx
		Memory Hierarchy	registers, __shared__, __global_
		Memory Hierarchy	matrix_a, matrix_b, and accumulator in Matrix Core fragments
		Specialized Intrinsics	d = __builtin_amdgcn_mfma_f32_16x16x4f32(a, b, c, ...)
Cambricon MLU	BANG C	Parallelism	taskId for task-level parallelism
		Memory Hierarchy	clusterId, coreId for multi-core parallelism
		Memory Hierarchy	registers, __mlu_shared__, __mlu_device__
		Specialized Intrinsics	__nram__, ___wram__
		Specialized Intrinsics	__bang_mlp(...),__bang_conv(...)

5.3. Evaluation Metrics

The performance and accuracy of QiMeng-Xpiler are evaluated using three key metrics:

Compilation Accuracy:
- Conceptual Definition: This metric measures the proportion of transcompiled programs that successfully pass the compiler checks for the target platform. It indicates the syntactic correctness and adherence to the target language's rules and API usage.
- Importance: A high compilation accuracy signifies the transcompiler's ability to generate valid, buildable code for the target system. Without successful compilation, a program cannot be executed or tested for functional correctness or performance.
- Mathematical Formula: $\text{Compilation Accuracy} = \frac{\text{Number of successfully compiled programs}}{\text{Total number of programs attempted}} \times 100\%$
- Symbol Explanation:
  - $\text{Number of successfully compiled programs}$ : The count of generated target programs that compile without errors.
  - $\text{Total number of programs attempted}$ : The total count of source programs that were subjected to transcompilation.
Computation Accuracy:
- Conceptual Definition: This metric assesses the functional correctness of the transcompiled code. A program is deemed "computationally accurate" if it produces the correct output, verified by passing a predefined set of unit tests against a ground truth. This goes beyond mere compilation to ensure the translated program behaves as expected.
- Importance: This is a crucial metric for transcompilers in critical applications like deep learning, where even subtle numerical discrepancies can lead to incorrect model behavior. It guarantees semantic equivalence between the source and target programs.
- Mathematical Formula: $\text{Computation Accuracy} = \frac{\text{Number of functionally correct programs (passing unit tests)}}{\text{Total number of successfully compiled programs}} \times 100\%$
- Symbol Explanation:
  - $\text{Number of functionally correct programs (passing unit tests)}$ : The count of compiled target programs that produce the expected output when executed with unit tests.
  - $\text{Total number of successfully compiled programs}$ : The total count of programs that compiled without errors. This is usually a subset of "Total number of programs attempted" for compilation accuracy.
Execution Performance:
- Conceptual Definition: This metric evaluates how fast the transcompiled code runs on the target platform compared to vendor-provided, manually-optimized libraries (e.g., cuDNN, cuBLAS, oneDNN). It's typically expressed as a speedup factor or a ratio.
- Importance: Beyond correctness, performance is paramount in deep learning systems. A transcompiler must generate code that is not only correct but also efficient, ideally matching or exceeding the performance of highly optimized, hand-tuned libraries. This highlights the practicality and real-world applicability of the method.
- Mathematical Formula: $\text{Execution Performance Ratio} = \frac{\text{Execution Time of Manually Optimized Library}}{\text{Execution Time of QiMeng-Xpiler Translated Program}}$
- Symbol Explanation:
  - $\text{Execution Time of Manually Optimized Library}$ : The time taken by the corresponding vendor-provided, manually-optimized library to execute the operation.
  - $\text{Execution Time of QiMeng-Xpiler Translated Program}$ : The time taken by the program transcompiled by QiMeng-Xpiler to execute the same operation.
  - A ratio greater than 1.0 means QiMeng-Xpiler's program is faster; a ratio less than 1.0 means it's slower.

5.4. Baselines

QiMeng-Xpiler is compared against both state-of-the-art LLM-based and rule-based transcompilation approaches:

LLM-based Approaches:
- GPT-4 Zero-Shot: Using GPT-4 without providing any in-context examples (i.e., minimal prompt).
- OpenAI o1 Zero-Shot: Using OpenAI o1 without any in-context examples.
- GPT-4 Few-Shot: Using GPT-4 with a few relevant examples provided in the prompt to guide the translation.
- OpenAI o1 Few-Shot: Using OpenAI o1 with a few relevant examples in the prompt.
- Purpose: These baselines evaluate the raw capabilities of general-purpose LLMs for code translation, both with and without explicit examples.
Rule-based Approaches:
- PPCG [47]: A polyhedral model-based auto-parallelization tool that translates $C$ code to CUDA C.
- HIPIFY [7]: A vendor-provided tool by AMD designed to migrate CUDA code to AMD HIP code.
- Purpose: These baselines represent established, hand-crafted transcompilation methods. The paper notes that rule-based approaches are generally not available for other transcompilation directions due to the immense human effort required.
Search-based Program Synthesis Approaches:
- Implicit Baseline: While not directly used as a comparative baseline in the results tables, the paper discusses search-based program synthesis as a category of related work. It explicitly states that these approaches "struggle with the large search space problem" and "cannot perform transcompilation directly across different DLS." This implies they are not considered viable direct baselines for the broad transcompilation task that QiMeng-Xpiler addresses.

6. Results & Analysis

6.1. Core Results

QiMeng-Xpiler demonstrates superior performance across all evaluation metrics and transcompilation directions. It achieves near-perfect compilation accuracy and high computation accuracy, significantly outperforming LLM-based and rule-based baselines. Moreover, it translates programs that achieve competitive or even superior execution performance compared to manually optimized libraries, leading to substantial productivity gains.

6.2. Data Presentation (Tables)

The following table shows the results from Table 6:

Source	Method	Compilation Accuracy				Computation Accuracy
Source	Method	CUDA C	BANG C	Hip	C with VNNI	CUDA C	BANG C	Hip	C with VNNI
CUDA C	GPT-4 Zero-Shot		0	82.7	9.5		0	82.7	4.2
	OpenAI o1 Zero-Shot		0	85.7	61.9		0	82.7	60.7
	GPT-4 Few-Shot		50.6	97.0	84.5		7.7	96.4	30.4
	OpenAI o1 Few-Shot		51.8	98.2	85.1		48.2	98.2	55.4
	QiMeng-Xpiler w/o SMT		82.7	98.2	88.1		54.2	98.2	58.3
	QiMeng-Xpiler w/o SMT + Self-Debugging		87.5	98.8	89.3		54.8	98.2	58.9
	QiMeng-Xpiler		100	100	100		91.7	100	95.2
BANG C	GPT-4 Zero-Shot	24.4		26.8	0	0		0	0
	OpenAI o1 Zero-Shot	27.4		97.0	9.5	0		0	4.2
	GPT-4 Few-Shot	69.0		66.1	23.8	6.5		6.5	13.1
	OpenAI o1 Few-Shot	71.4		97.0	41.7	10.1		7.7	23.2
	QiMeng-Xpiler w/o SMT	85.1		84.5	47.6	77.4		78.6	41.1
	QiMeng-Xpiler w/o SMT + Self-Debugging	88.1		88.7	50.6	77.4		78.6	41.1
	QiMeng-Xpiler	100		100	100	95.8	/	97.0	95.2
Hip	GPT-4 Zero-Shot	97.0	0		23.8	97.0	0		5.4
	OpenAI o1 Zero-Shot	98.2	0		45.8	98.2	0		4.2
	GPT-4 Few-Shot	97.0	35.1		85.1	97.0	5.4		24.4
	OpenAI o1 Few-Shot	98.8	42.3		88.7	98.2	9.0		30.4
	QiMeng-Xpiler w/o SMT	98.2	60.7		65.5	97.6	52.4		57.1
	QiMeng-Xpiler w/o SMT + Self-Debugging	98.8	62.5		66.1	98.2	52.4		57.1
	QiMeng-Xpiler	100	100		100	100	86.9		96.4
C with VNNI	GPT-4 Zero-Shot	57.1	0	60.1		8.3	0	8.9
	OpenAI o1 Zero-Shot	66.1	0	97.0		10.1	0	96.4
	GPT-4 Few-Shot	81.5	41.7	74.4		14.3	6.0	12.5
	OpenAI o1 Few-Shot	87.5	55.4	97.0		51.2	10.7	96.4
	QiMeng-Xpiler w/o SMT	95.8	78.0	87.5		83.9	58.3	85.7
	QiMeng-Xpiler w/o SMT + Self-Debugging	97.0	84.5	89.3		83.9	58.3	85.7
	QiMeng-Xpiler	100	99.4	100		98.2	88.7	99.4

Analysis of Table 6 (Accuracy vs. LLM-based Methods):

Superiority of QiMeng-Xpiler: QiMeng-Xpiler consistently achieves the highest compilation and computation accuracy across all transcompilation directions. It reaches 100% compilation accuracy in most scenarios and computation accuracy ranges from 86.9% to 100%. This is a significant advantage for practical transcompilers where high correctness is critical.
Limitations of LLM-only Baselines:
- Zero-Shot: GPT-4 and OpenAI o1 in Zero-Shot mode show extremely poor performance, especially for challenging targets like BANG C (0% accuracy). This indicates that LLMs lack inherent domain-specific knowledge for DLS.
- Few-Shot: Providing few-shot examples improves LLM performance significantly, but still falls short of QiMeng-Xpiler. For instance, OpenAI o1 Few-Shot translating CUDA C to BANG C achieves 51.8% compilation and 48.2% computation accuracy, far from 100%. Even for easier tasks like CUDA C to Hip, while compilation accuracy can be high (e.g., 98.2%), computation accuracy might still be lower (98.2%), suggesting subtle functional errors.
Impact of SMT Solver (Ablation Study):
- The rows QiMeng-Xpiler w/o SMT and QiMeng-Xpiler w/o SMT + Self-Debugging clearly demonstrate the necessity of the SMT solver. Without it, even with the proposed LLM-assisted passes and even self-debugging techniques, computation accuracy drops significantly (e.g., HIP to BANG C goes from 86.9% to 52.4%).
- This confirms that LLMs can generate good sketches but struggle with low-level correctness, which SMT effectively fixes. The small gain from Self-Debugging on top of QiMeng-Xpiler w/o SMT (e.g., CUDA C to BANG C computation accuracy from 54.2% to 54.8%) further underscores that traditional LLM-based debugging is insufficient for the semantic precision required.
  
  The following table shows the results from Table 7:
  
  Direction Method Compilation(%) Computation(%)
  CUDA C → HIP Hipify 85.7 85.7
  QiMeng-Xpiler 100 100
  C → CUDA C PPCG 47.6 47.6
  QiMeng-Xpiler 100 98.2

Direction	Method	Compilation(%)	Computation(%)
CUDA C → HIP	Hipify	85.7	85.7
QiMeng-Xpiler	100	100
C → CUDA C	PPCG	47.6	47.6
QiMeng-Xpiler	100	98.2

Analysis of Table 7 (Accuracy vs. Rule-based Methods):

Superiority over Rule-based: QiMeng-Xpiler outperforms established rule-based methods like Hipify and PPCG.
- For CUDA C to HIP, QiMeng-Xpiler achieves perfect 100% compilation and computation accuracy, whereas Hipify manages 85.7%. This highlights QiMeng-Xpiler's robustness even for a vendor-supported transcompilation task.
- For $C$ to CUDA C, QiMeng-Xpiler reaches 100% compilation and 98.2% computation accuracy, which is approximately 50% higher than PPCG's 47.6% for both metrics. This demonstrates QiMeng-Xpiler's ability to handle auto-parallelization tasks more effectively.
Flexibility: The paper notes that rule-based methods are often limited to specific pairs of languages, whereas QiMeng-Xpiler shows flexibility across all four DLS with minimal adaptation cost, a key advantage.

6.3. Execution Performance

The performance evaluation (Figure 7) compares QiMeng-Xpiler's transcompiled programs against vendor-provided, manually-optimized libraries (e.g., cuDNN/cuBLAS, CNNL, rocBLAS, oneDNN).

该图像是一个多子图柱状图，展示了QiMeng-Xpiler在不同平台间转编译张量程序的性能表现及纠正案例数，包括从C with VNNI到CUDA C、CUDA C到BANG C、CUDA C到HIP及CUDA C到C with VNNI的对比。

Figure 7 presents the performance of QiMeng-Xpiler's translated programs across various transcompilation directions and operators, relative to manually optimized libraries. The Y-axis represents the performance relative to optimized libraries, while the X-axis lists operators.

Analysis of Figure 7:

QiMeng-Xpiler achieves an average performance of 0.78x and, in some cases, up to 2.00x better performance than its manually optimized counterparts. This means that on average, QiMeng-Xpiler's programs run at 78% of the speed of highly optimized libraries, which is highly competitive, and in specific instances, they can be twice as fast.
This demonstrates that the hierarchical auto-tuning approach is effective in generating not just correct, but also high-quality, performant code that can rival or even surpass human expert optimizations.
The results are shown across four common transcompilation directions: C with VNNI $\to$ CUDA C, CUDA C $\to$ BANG C, CUDA C $\to$ HIP, and CUDA C $\to$ C with VNNI. This confirms the general applicability of QiMeng-Xpiler in generating performant code for diverse DLS.

6.4. Case Study: Transcompilation from CUDA C to BANG C

This transcompilation direction is highlighted as particularly challenging due to:

Programming Model Discrepancy: Translating from SIMT (CUDA C) to SIMD (BANG C), which involves fundamental differences in parallel execution paradigms.
Uncommon Target Language: BANG C is an uncommon language with significantly less training data available for LLMs, making it harder for LLM-only approaches.

Analysis:

The results in Table 6 show that for CUDA C to BANG C:
- OpenAI o1 Zero-Shot yields 0% accuracy, demonstrating its complete failure for this challenging task.
- OpenAI o1 Few-Shot improves to 51.8% compilation and 7.7% computation accuracy, showing that examples help, but LLMs still struggle significantly with correctness.
- QiMeng-Xpiler w/o SMT achieves much higher accuracy (82.7% compilation, 54.2% computation) by incorporating document retrieval for domain knowledge and decomposition into passes. This highlights the strength of the neural-symbolic framework even without the SMT component for full correctness.
- Crucially, QiMeng-Xpiler (with SMT) achieves 100% compilation and 91.7% computation accuracy. This significant leap confirms that small-scale symbolic synthesis is essential for ensuring functional equivalence and reaching high accuracy in challenging transcompilation scenarios.

6.5. Compilation Time

Figure 8 illustrates the breakdown of QiMeng-Xpiler's compilation time for 6 typical operators when transcompiling from CUDA C to BANG C.

Figure 8: Breakdown of QiMeng-Xpiler's compilation time. 该图像是图8，展示了QiMeng-Xpiler各类程序编译阶段所耗时间的柱状图，涵盖LLM、单元测试、SMT、自动调优和评估等多个环节，反映不同深度学习算子编译的时间分布与平均耗时。

Figure 8 breaks down the compilation time for six typical operators (GEMM, Conv2D, ReLU, Add, MaxPool, LayerNorm) when translating from CUDA C to BANG C. The components are LLM, Unit Test, SMT, Auto-tuning, and Evaluation.

Analysis of Figure 8:

The total compilation time for these operators ranges from 1.2 to 7.8 hours, with an average of 3.7 hours. This indicates that while automated, the process can still be lengthy due to the complexity of auto-tuning and SMT repair.
SMT Component: The SMT component's time consumption is variable. It is "only triggered when the LLM fails to produce a correct translation." For simpler programs or when the LLM is highly accurate, SMT's proportion is smaller. This aligns with the principle of neural-symbolic synthesis, where LLM handles the bulk, and SMT acts as a targeted repair mechanism.
Auto-tuning Component: For operators like GEMM and Conv2D which inherently have larger search spaces for tiling sizes, loop orders, etc., the proportion of time spent on auto-tuning (exploring parameters and sequences) significantly increases. This is a trade-off for achieving high performance.
Special Intrinsics: Compilation time tends to increase with the number of special intrinsics used in the program, likely due to increased complexity in program annotation, LLM transformation, and potentially SMT or auto-tuning in handling these specific hardware instructions.

6.6. Productivity Improvement

The paper evaluates the productivity improvement for Deformable Attention (a complex LLM operator with ~200 lines of code), comparing manual implementation against QiMeng-Xpiler's transcompilation. This is assessed for two DLS: NVIDIA GPU (C with VNNI $\to$ CUDA C) and MLU (CUDA C $\to$ BANG C), involving junior and senior coders.

The following table shows the results from Table 8:

Deformable Attention (~ 200LoCs)		CUDA C-> BANG C		C with VNNI -> CUDA C
Deformable Attention (~ 200LoCs)		Costs	Performance	Costs	Performance
Senior Coder	Manual	~6 d 4.5 + 0.5 h	100%	~1 d 2.1 h	100%
Senior Coder	w/ QiMeng-Xpiler Time Saving	~28.8×	69.20%	~11.4×	132.50 %
Junior Coder	Manual	~30 d	49.85%	~3 d	75.76%
Junior Coder	w/ QiMeng-Xpiler Time Saving	4.5 + 3 h ~96.0×	65.17%	2.1 h ~34.3×	132.50 %

Analysis of Table 8:

Significant Productivity Gains: QiMeng-Xpiler dramatically improves programming productivity.
- For MLU (CUDA C $\to$ BANG C), productivity is improved by up to 96.0x for junior coders and 28.8x for senior coders.
- For GPU (C with VNNI $\to$ CUDA C), productivity is improved by up to 34.3x for junior coders and 11.4x for senior coders.
Reduced Manual Effort:
- Manual Costs: A senior coder takes ~6 days for MLU and ~1 day for GPU to manually implement Deformable Attention. A junior coder takes ~30 days for MLU and ~3 days for GPU. This highlights the immense human effort and skill required.
- QiMeng-Xpiler Costs: The transcompilation time by QiMeng-Xpiler for Deformable Attention is about 4.5 hours for CUDA C $\to$ BANG C and 2.1 hours for C with VNNI $\to$ CUDA C. This is a massive time reduction.
Performance with QiMeng-Xpiler:
- For MLU (CUDA C $\to$ BANG C), the QiMeng-Xpiler generated code achieves 69.20% of the performance of a senior coder's manual implementation and 65.17% of a junior coder's, which is highly competitive, especially considering the huge time savings.
- For GPU (C with VNNI $\to$ CUDA C), QiMeng-Xpiler's code actually outperforms manual implementations by 32.50% (132.50% of manual performance), showing its effectiveness in optimization.
Debugging: Even when QiMeng-Xpiler fails to generate a fully functional correct program (as implied for MLU, requiring "minimal effort" for debugging), the required manual debugging time is significantly reduced (0.5 hours for senior, 3 hours for junior) compared to the weeks/days of initial manual coding.

6.7. Failure Case

The paper identifies a failure case in transcompiling Deformable Attention from CUDA C to BANG C, specifically shown in Figure 9.

Figure 9 illustrates a complex control flow within the Deformable Attention operator, featuring multiple nested loops and conditional statements. This complexity makes it difficult for current LLMs to map to SIMD intrinsics and for SMT solvers to extract constraints.

Analysis of Figure 9:

The Deformable Attention operator, with its complex control flow, multiple nested loops, and conditional statements, presents a significant challenge.
Limitations of Neural Component: The LLM component of QiMeng-Xpiler struggles to correctly handle such intricate logical structures and conditional branches when mapping them to the required SIMD intrinsics for BANG C. This is likely due to the inherent difficulty for LLMs to reason deeply about complex programmatic logic and ensure precise semantic preservation across vastly different parallel models.
Limitations of Symbolic Component: The symbolic component (SMT solver) also faces difficulties in this scenario. Extracting comprehensive mathematical constraints from such complex control flow and then performing effective code repair within a "limited scale" becomes problematic. The search space for SMT might still become too large or the logical relationships too convoluted for efficient solving.
Future Outlook: The paper suggests that more powerful LLMs (with better logical reasoning) and advanced SMT analysis tools could potentially address these limitations, further improving QiMeng-Xpiler's capability in handling such highly complex cases. This indicates an area for future research and development.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces QiMeng-Xpiler, a pioneering neural-symbolic synthesis approach for transcompiling tensor programs across heterogeneous deep learning systems (DLS). By strategically combining the code generation prowess of LLMs with the correctness guarantees of small-scale symbolic program synthesis, QiMeng-Xpiler addresses the long-standing challenge of "Write Once, Run Anywhere" for tensor programs. The system's unique architecture, involving LLM-assisted transformation passes and a hierarchical auto-tuning mechanism, not only ensures high functional accuracy (averaging 95%) but also achieves remarkable performance (up to 2.0x over manually optimized libraries) and unprecedented programming productivity gains (up to 96.0x). The extensive evaluation across four diverse DLS underscores its effectiveness and broad applicability.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation:

Complex Control Flow: QiMeng-Xpiler struggles with tensor programs exhibiting highly complex control flow, such as the Deformable Attention example with multiple nested loops and conditional statements. In these cases, both the LLM component fails to generate accurate mappings to SIMD intrinsics, and the SMT solver finds it difficult to extract and solve the necessary mathematical constraints within its "limited scale."

Future Work:

The authors propose that more powerful LLMs (implying models with enhanced logical reasoning capabilities) and advanced SMT analysis tools could be promising avenues to overcome this limitation and improve QiMeng-Xpiler's ability to handle such complex cases. This suggests continuous improvement in both the neural and symbolic components.

7.3. Personal Insights & Critique

7.3.1. Novelty and Significance

The primary novelty of QiMeng-Xpiler lies in its rigorous integration of neural and symbolic methods for the specific domain of tensor program transcompilation across heterogeneous DLS. While neural-symbolic approaches exist in other areas, their application to this highly specialized and performance-critical domain, addressing challenges like diverse programming models (SIMT vs. SIMD), memory hierarchies, and specialized intrinsics, is a significant contribution. The problem of transcompiling low-level, performance-sensitive code is much harder than high-level language translation, where LLMs have seen success. The explicit use of SMT to repair LLM imperfections, making neural-symbolic computationally tractable, is a clever design choice.

The paper's significance is profound. By providing a truly automatic and high-quality transcompiler for DLS, QiMeng-Xpiler has the potential to:

Democratize DLS Development: Lower the barrier to entry for developers to utilize specialized hardware, reducing the need for deep expertise in every DLS's idiosyncratic programming model.
Accelerate Innovation: Enable faster deployment of deep learning models on new or specialized hardware, as porting code becomes less of a bottleneck.
Improve Hardware Utilization: Maximize the efficiency of heterogeneous DLS in data centers by generating highly optimized code.

7.3.2. Strengths

Hybrid Approach Effectiveness: The core strength is the effective synergy between LLMs and SMT. LLMs provide the flexibility and breadth for initial code generation, while SMT provides the precision and correctness guarantees essential for low-level system programming.
High Accuracy and Performance: Achieving 95% average accuracy and up to 2.0x performance over vendor-optimized libraries is exceptional. This level of quality makes the tool practically viable, unlike many LLM-only solutions.
Hierarchical Auto-tuning: The intra-pass and inter-pass auto-tuning is crucial for performance. Simply translating code correctly is not enough; it must also be fast. This mechanism systematically explores optimizations that human experts would typically perform.
Comprehensive Evaluation: The experiments on four distinct DLS and 21 deep learning operators with 8 shapes (168 test cases) are extensive and demonstrate the robustness and generality of QiMeng-Xpiler.
Productivity Improvement: The reported productivity gains (up to 96x) are staggering and directly address a major pain point in the industry.

7.3.3. Weaknesses and Potential Improvements

Scalability of SMT for Complexity: While the paper states SMT is used for "limited scale" repair, the identified failure case (complex control flow in Deformable Attention) suggests that "limited scale" might still be a bottleneck for truly intricate code. Future work could explore more advanced SMT-solver techniques or hybrid approaches that simplify complex logical expressions before feeding them to SMT.
Compilation Time: An average compilation time of 3.7 hours (and up to 7.8 hours) is significant. While a massive improvement over manual coding, it's not "on-the-fly" transcompilation. For iterative development, this could still be a factor. Optimizing the auto-tuning and SMT phases, perhaps through more aggressive LLM-guided pruning or learning from past tuning results, could be beneficial.
Dependency on LLMs: The quality of the initial LLM-generated sketch is still critical. If the LLM consistently produces very poor sketches, the SMT component might be overwhelmed or the auto-tuning process might take too long. Continuous improvements in LLM code generation, especially for specialized domains, will directly benefit QiMeng-Xpiler.
Maintenance of Meta-prompts and Manuals: The system relies on meta-prompts and access to programming manuals. As DLS evolve, these inputs would need continuous updates, which could incur maintenance costs, albeit less than manual transcompilation.
Dynamic Program Behavior: The current approach likely focuses on static analysis and optimization. Handling programs with highly dynamic behavior or runtime-dependent optimizations might pose new challenges.

7.3.4. Open Questions and Future Impact

Generality to New DLS: How easily can QiMeng-Xpiler adapt to entirely new DLS beyond the four evaluated ones? The neural-symbolic framework seems robust, but the quality of programming manuals and the LLM's ability to generalize to new intrinsics would be key.
Integration with Higher-Level Frameworks: Can QiMeng-Xpiler be integrated directly into deep learning frameworks (e.g., PyTorch, TensorFlow) to automatically transcompile intermediate representations or higher-level graph definitions into optimized DLS code? This would further enhance productivity.
Formal Verification of LLM-generated Sketches: While SMT verifies snippets, could there be methods to formally verify properties of the LLM-generated sketches themselves, reducing the burden on SMT or unit testing?
User Interface and Debugging Experience: While QiMeng-Xpiler is automatic, what is the user experience for debugging (as mentioned in the productivity section) when it doesn't achieve 100% correctness? How intuitive are the error messages or repair suggestions?

Overall, QiMeng-Xpiler presents a compelling solution to a critical problem in deep learning hardware utilization. Its neural-symbolic paradigm, combined with auto-tuning, marks a significant step towards truly automated and efficient programming for diverse DLS.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~29 min read · 40,156 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation (Why)

2.2. Main Contributions / Findings (What)

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation

4. Methodology

4.1. Principles

4.2. Steps & Procedures

4.2.1. Neural-Symbolic Program Synthesis

4.2.1.1. Program Annotation

4.2.1.2. Meta-Prompts based Transformation

4.2.1.3. Bug Localization

4.2.1.4. SMT-based Code Repairing

4.2.2. Hierarchical Performance Auto-Tuning

4.2.2.1. Intra-Pass Auto-Tuning

4.2.2.2. Inter-Pass Auto-Tuning with MCTS

4.3. Mathematical Formulas & Key Details

4.3.1. Exploration Space Formalization

4.3.2. MCTS Reward Function

5. Experimental Setup

5.1. Datasets

5.2. Evaluated Platforms

5.3. Evaluation Metrics

5.4. Baselines

6. Results & Analysis

6.1. Core Results

6.2. Data Presentation (Tables)

6.3. Execution Performance

6.4. Case Study: Transcompilation from CUDA C to BANG C

6.5. Compilation Time

6.6. Productivity Improvement

6.7. Failure Case

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

7.3.1. Novelty and Significance

7.3.2. Strengths

7.3.3. Weaknesses and Potential Improvements

7.3.4. Open Questions and Future Impact

Similar papers