HStencil: Matrix-Vector Stencil Computation with Interleaved Outer Product and MLA
TL;DR Summary
HStencil accelerates inefficient stencil computation on modern CPUs by employing hybrid micro-kernels, interleaved execution for enhanced instruction-level parallelism, and spatial prefetching. This approach achieves up to 4.75x speedup over auto-vectorization, significantly outp
Abstract
HStencil: Matrix-Vector Stencil Computation with Interleaved Outer Product and MLA Han Huang huangh367@mail2.sysu.edu.cn CSE, Sun Yat-sen University Guangzhou, China Jiabin Xie xiejb6@mail2.sysu.edu.cn CSE, Sun Yat-sen University Guangzhou, China Guangnan Feng fenggn7@mail.sysu.edu.cn CSE, Sun Yat-sen University Guangzhou, China Xianwei Zhang zhangxw79@mail.sysu.edu.cn CSE, Sun Yat-sen University Guangzhou, China Dan Huang huangd79@mail.sysu.edu.cn CSE, Sun Yat-sen University Guangzhou, China Zhiguang Chen chenzhg29@mail.sysu.edu.cn CSE, Sun Yat-sen University Guangzhou, China Yutong Lu ∗ luyutong@mail.sysu.edu.cn CSE, Sun Yat-sen University Guangzhou, China Abstract Stencil computations are fundamental to many intelligent comput- ing applications, often accounting for a substantial portion of their execution time. The emergence of specialized matrix units presents new opportunities to accelerate stencil computations. Scalable ma- trix compute units provide abundant computing power through outer product operations, but prior efforts fail to fully utilize them for stencils due to suboptimal matrix unit utilization, insufficient instruction-level parallelism (ILP), an
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: HStencil: Matrix-Vector Stencil Computation with Interleaved Outer Product and MLA
- Authors: Han Huang, Jiabin Xie, Guangnan Feng, Xianwei Zhang, Dan Huang, Zhiguang Chen, and Yutong Lu.
- Affiliations: All authors are affiliated with the CSE (Department of Computer Science and Engineering) at Sun Yat-sen University, Guangzhou, China.
- Journal/Conference: The paper does not explicitly state the publication venue. However, based on its content, structure, and references to recent top-tier conferences like SC '24 and PPoPP '24, it is characteristic of a high-quality submission to a leading conference in High-Performance Computing (HPC) or Parallel Programming.
- Publication Year: The paper references materials from 2024, indicating it was likely written and submitted in 2024 for publication in late 2024 or 2025.
- Abstract: The paper introduces HStencil, a framework to accelerate stencil computations on modern CPUs equipped with both scalable vector and matrix compute units. Stencil computations are critical for many applications but are not efficiently accelerated by existing methods that use new matrix units. Prior work suffers from poor matrix unit utilization, low instruction-level parallelism (ILP), and poor cache performance. HStencil addresses these issues with three main contributions: (1) hybrid micro-kernels using both matrix and vector units, (2) fine-grained instruction scheduling to increase ILP, and (3) a spatial prefetching mechanism for large, out-of-cache problems. The framework achieves significant speed-ups (1.81x–4.75x) over standard auto-vectorization and outperforms the state-of-the-art by up to 91%.
- Original Source Link: /files/papers/68ec50cd00e47ee3518bc90b/paper.pdf. This appears to be a direct link to the paper's PDF, likely a preprint or a version submitted for review.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Stencil computations, a fundamental pattern in scientific and intelligent computing, are performance-critical. Modern CPUs are introducing powerful specialized matrix units to accelerate matrix-heavy workloads like those in AI. However, existing methods for stencil computation fail to efficiently harness the full power of these new units.
- Key Gaps in Prior Work: The state-of-the-art approach for using matrix units (
STOP) suffers from three major performance bottlenecks:- Suboptimal Hardware Utilization: Especially for "star" shaped stencils, which have sparse dependency patterns, the matrix units are poorly utilized.
- Insufficient Instruction-Level Parallelism (ILP): The computation is often bottlenecked by either matrix, vector, or memory instructions, leaving other processor pipelines idle.
- Low Cache Hit Rates: The 2D data access patterns of matrix-based stencil methods do not align well with the 1D-optimized caching and prefetching mechanisms in standard CPUs, leading to poor performance on large datasets that don't fit in the cache.
- Paper's Innovation: HStencil introduces a novel hybrid approach that intelligently combines the strengths of both scalable matrix units (using outer product operations) and scalable vector units (using multiply-accumulate operations) within the same computation kernel. This co-design allows it to overcome the limitations of matrix-only approaches.
-
Main Contributions / Findings (What):
- The paper presents HStencil, a comprehensive framework for high-performance stencil computation on modern CPUs. Its primary contributions are:
- Hybrid Matrix-Vector Micro Kernels: Novel compute kernels that use matrix instructions for dense parts of the stencil and vector instructions for sparse parts, coupled with an efficient in-place accumulation technique to avoid redundant memory operations.
- Fine-Grained Instruction Scheduling: A scheduling strategy that interleaves matrix, vector, and memory (load/store) instructions to maximize ILP and hide operational latencies, keeping all parts of the processor busy.
- Spatial Prefetch Strategy: A software-based prefetching technique that manually guides the CPU's cache to better handle the 2D memory access patterns of stencil computations, significantly improving performance for out-of-cache problems.
- Key Findings: HStencil demonstrates superior performance, achieving speed-ups of 1.81x to 4.75x over compiler auto-vectorization. It also outperforms the previous state-of-the-art matrix-based method (
STOP) by 31% to 91%. Furthermore, the framework's principles are shown to be portable to other modern architectures, such as the Apple M4 CPU.
- The paper presents HStencil, a comprehensive framework for high-performance stencil computation on modern CPUs. Its primary contributions are:
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
-
Stencil Computations: A class of algorithms common in scientific simulations (e.g., solving differential equations) where each point in a multi-dimensional grid is updated iteratively based on the values of its neighbors. The pattern of neighbors is called the "stencil."
-
Box vs. Star Stencils: As shown in Figure 1, these are two common patterns. A
Box Stencilconsiders all immediate neighbors in a square/cube, while aStar Stencilonly considers neighbors along the coordinate axes, resulting in a sparser dependency pattern.
该图像是示意图,展示了论文中提及的两种常见的Stencil计算模板:Box Stencil和Star Stencil。图中用不同颜色和箭头表示了中心元素及其邻域元素的相互关系,分别对应矩阵中不同形状的依赖范围。
-
-
Vector Units and SIMD: Single Instruction, Multiple Data (SIMD) is a form of parallelism where a single instruction can operate on multiple data points simultaneously. This is achieved using wide vector registers and specialized instructions, like
Multiply-Accumulate (MLA), which performs A = A + B * C on entire vectors of numbers at once. -
Matrix Units: These are newer, more specialized hardware units designed to accelerate matrix operations. They are common in GPUs (NVIDIA Tensor Cores) and emerging CPUs (Intel AMX, ARM SME).
-
Inner Product vs. Outer Product: These are two fundamental ways to perform matrix multiplication. Inner product (dot product) multiplies a row vector by a column vector. Outer product multiplies a column vector by a row vector to produce a matrix. The hardware targeted by HStencil (e.g., ARM's Scalable Matrix Extension) is optimized for outer product operations.
Figure 2: A visual comparison of inner product (left) and outer product (right) operations. HStencil targets architectures optimized for outer products.
-
-
Gather vs. Scatter Paradigms:
- Gather: In this approach, for each output point, the necessary input neighbor values are "gathered" to compute the result. This naturally maps to vector
MLAinstructions. - Scatter: In this approach, each input point "scatters" its contribution to all the output neighbors it affects. This maps well to
outer productinstructions, where a single input value can update a block of output points.
- Gather: In this approach, for each output point, the necessary input neighbor values are "gathered" to compute the result. This naturally maps to vector
-
Instruction-Level Parallelism (ILP): The ability of a modern processor to execute multiple instructions from a single program thread in parallel by using different execution units (e.g., integer, floating-point, memory units) simultaneously. High ILP is crucial for performance.
-
-
Previous Works & Differentiation:
- Vector-based Methods: Early work focused on optimizing stencil computations for vector units.
DLTrearranged data in memory for better vectorization, andTemporal Vectorizationfused computations across time steps. These methods are limited by the width of vector units. - GPU Matrix-based Methods: On NVIDIA GPUs, works like
TCStencil,ConvStencil, andLoRAStencilhave successfully used Tensor Cores (which are inner-product-based matrix units) for stencils. However, these are GPU-specific solutions. - CPU Matrix-based Method (
STOP): This is the most direct predecessor and the state-of-the-art that HStencil compares against.STOPwas the first to use outer product-based matrix units on CPUs for stencils. HStencil's motivation stems directly fromSTOP's weaknesses:STOPis inefficient for star stencils due to low matrix unit utilization.- It does not effectively overlap computation and memory access, leading to low ILP.
- It performs poorly on out-of-cache problems.
- HStencil's Novelty: HStencil is the first framework to co-design stencil computations for hybrid execution on both matrix and vector units on a CPU. It doesn't treat them as alternatives but as cooperative partners, scheduling instructions across their distinct execution pipelines to maximize throughput. It is also the first to provide an efficient stencil implementation for the LX2 and Apple M4 CPUs.
- Vector-based Methods: Early work focused on optimizing stencil computations for vector units.
4. Methodology (Core Technology & Implementation)
The HStencil framework is built on three pillars, as illustrated in the overview diagram below. The core idea is to process data in tiles, applying an optimized micro-kernel that leverages instruction scheduling and is supported by a smart prefetching strategy.
该图像是示意图,展示了HStencil框架的整体流程。图中以数据块(Tile)为基本单位,分别进行了分块加载、微内核计算(结合矩阵、向量及加载/存储指令)、指令调度和空间预取,最终将计算结果存入输出数据。图中标注了对应的章节内容,体现了框架在微内核设计、指令调度和空间预取三方面的关键技术。
4.1. Hybrid Matrix-Vector Micro Kernel
The goal of the micro-kernel is to maximize hardware utilization while maintaining efficient, continuous memory access. This is especially challenging for sparse stencils like the star pattern.
-
The Problem with Matrix-Only for Star Stencils: A star stencil's coefficient matrix is sparse (contains many zeros). Using an outer product on the full matrix is wasteful, as many computations result in zero.
STOP's solution was to use a mix of outer products along rows and columns, but this leads to non-contiguous memory access, which is slow. -
Naive Hybrid Approach: A straightforward idea is to use matrix units (outer product) for the row-wise (outer-axis) computations and vector units (
MLA) for the column-wise (inner-axis) computations.
该图像是示意图,展示了朴素的矩阵-向量计算方法。图中通过三个步骤说明计算流程:①外积操作形成矩阵寄存器,②使用乘累加指令(MLA,Multiply-Accumulate)实现向量乘法与累加,③累加结果更新矩阵B,体现了矩阵元素的逐步加载、存储与更新过程。图中关键表达式包括和。As shown in Figure 7, this approach has a major flaw: it requires an explicit, separate accumulation step. The results from the matrix unit (B') and the vector unit (B'') are stored back to memory and then re-loaded to be added together. This incurs significant overhead from redundant load and store operations. The total computation time and memory access cost are:
- Where represents time, and and are the cycle costs for L1 cache loads and stores.
-
HStencil's Solution: In-place Accumulation: HStencil's key innovation is to eliminate the intermediate memory traffic. It accumulates the vector results directly into the matrix registers. This is cleverly achieved by using an
outer productinstruction, which can add its result to an existing matrix register.
In-place accumulation method. The vector computation result is directly accumulated into the matrix register using an outer product, avoiding intermediate stores and loads.The vector unit computes its part of the stencil, and the result vector is then used in an outer product operation that adds it to the correct row of the matrix register holding the main results. This transforms a costly memory-based accumulation into a single, efficient instruction. The improved costs are: The
maxfunction appears because matrix and vector instructions can be executed concurrently, overlapping their execution time. The memory operations are reduced from 7 to 3 per row. -
Multi-register Kernel: To achieve peak theoretical performance, modern matrix units require several independent instructions to be in flight simultaneously. HStencil achieves this by unrolling the outer loop (j axis) to process multiple tiles of data in each iteration, thus utilizing four or more matrix registers at once. To manage the increased memory pressure from unrolling, it uses vector
EXTinstructions to reuse overlapping data between adjacent tiles.
4.2. Fine-grained Matrix-Vector Instructions Replacement and Scheduling
Simply having a good kernel is not enough; the instructions must be scheduled effectively to maximize ILP.
-
Vector Instruction Replacement: The hybrid kernel can create an imbalance where either the matrix or vector pipeline is the bottleneck.
- MLA Replacement: If vector
MLAinstructions become the bottleneck, HStencil selectively reverts some of them back toouter productinstructions. Even if an outer product is slower for a specific sub-task, using it can free up the vector pipeline and leverage idle matrix units, improving overall throughput. - EXT Instruction Replacement: Similarly, vector
EXTinstructions (for data reuse) can compete withMLAinstructions for the same pipeline. HStencil replaces someEXTinstructions with direct memoryloadinstructions. Since the data is likely in the L1 cache, this extra load is cheap and helps balance pipeline usage.
- MLA Replacement: If vector
-
Matrix-Vector Instruction Scheduling: HStencil proposes a schedule to overlap computation, data preparation, and memory operations.
Figure 10: Comparison of an unscheduled pipeline (a) with HStencil's scheduled pipeline (b). Bubbles (idle cycles) are removed, and store operations are overlapped with computation.The key ideas are:
- Overlap Loads and Computation: Load the data for the next iteration while computing on the current data.
- Overlap Matrix and Vector Work: Execute matrix
outer productinstructions concurrently with vectorEXTandMLAinstructions. - Scatter Store Operations: Instead of storing all results in a large burst at the end of the tile computation (which stresses the memory system), HStencil stores rows as soon as they are fully computed. A row i is ready to be stored after the computation has processed all input rows that contribute to it. This spreads memory traffic evenly throughout the execution.
4.3. Spatial Prefetch for Matrix Units
For large stencils that do not fit in the cache ("out-of-cache"), memory access becomes the dominant bottleneck.
-
The Problem: Standard CPU hardware prefetchers are optimized for 1D, linear memory access. The tiled, 2D access pattern of HStencil's matrix-based approach confuses these prefetchers, leading to low cache hit rates.
Figure 11: Data access pattern of HStencil. The computation moves row by row through tiles, but from one tile to the next, the access pattern is non-contiguous from the hardware's perspective. -
HStencil's Solution: HStencil inserts explicit
prefetchinstructions into the code to manually guide the cache. In each iteration, it prefetches:- The next row of the input matrix A that will be needed.
- The corresponding row of the output matrix B where the results will be stored. This software-guided approach effectively creates a 2D prefetching effect, ensuring data is already in the L1 cache when it is needed and dramatically improving performance for out-of-cache workloads.
5. Experimental Setup
- Platforms:
- LX2 CPU: A high-performance CPU with 512-bit scalable vector units and 8x8 double-precision scalable matrix units.
- Apple M4 Pro CPU: A commercial CPU that also features scalable matrix units, used to demonstrate performance portability. It has a 128KB L1 data cache and a 4MB L2 cache.
- Baselines:
Auto: Standard C++ code compiled with Clang 17 with auto-vectorization enabled. This is the main performance baseline.Vector-only: A hand-optimized stencil implementation using only scalable vector instructions.Matrix-only: The state-of-the-art methodSTOP[45], which uses only scalable matrix instructions.
- Evaluation Metrics:
- Speed-up: The ratio of the execution time of a baseline (usually
Auto) to the execution time of the target method. It measures relative performance improvement. - Instructions Per Cycle (IPC): The average number of instructions the CPU executes per clock cycle. A higher IPC indicates better utilization of the processor's parallel execution capabilities.
- L1 Cache Hit Rate: The percentage of memory requests that are successfully found in the L1 cache. A higher hit rate means fewer slow accesses to main memory.
- GStencil/s: Giga-Stencils per second. A measure of throughput, indicating how many billion stencil grid point updates are performed per second.
- Speed-up: The ratio of the execution time of a baseline (usually
- Benchmarks: A suite of common 2D and 3D stencil patterns with varying radii, including
Heat-2D,Star-2D9P,Box-2D9P,Star-3D13P, andBox-3D75P.
6. Results & Analysis
6.1 In-cache Stencils
-
Overall Performance Improvement:
该图像是一个柱状图,展示了不同微内核(尺寸为128×128)下HStencil与当前先进的纯向量(Vector-only)及纯矩阵(Matrix-only)方法的性能对比,速度提升均以自动向量化(auto-vectorization)为基准进行归一化。图中显示HStencil在所有测试基准中均明显优于纯向量和纯矩阵方法,速度提升最高可达近4倍。Figure 12 shows that HStencil consistently and significantly outperforms both
Vector-onlyandMatrix-only(STOP) methods across all stencil types. For star stencils, whereMatrix-onlystruggles, HStencil's hybrid approach provides a large benefit (e.g., ~1.8x speed-up for Star-2D9P vs. ~1.4x forMatrix-only). For box stencils, HStencil still provides a clear advantage, achieving up to a 4.16x speed-up for 3D box stencils. -
Performance Breakdown:
该图像是性能对比柱状图,展示了HStencil在不同矩阵大小下处理二维星型(star)和箱型(box)模版计算的加速比表现。上半部分(a)为星型模版,不同方法在64×64至256×256矩阵大小上加速比变化较为明显;下半部分(b)为箱型模版,HStencil在所有矩阵规模下均表现出最高加速比,最高可达约3倍。Figure 13 demonstrates the value of each of HStencil's optimizations.
Mat-only(STOP) provides a good baseline improvement over auto-vectorization.HStencil-nosched(the hybrid kernel without scheduling) already outperformsMat-only, confirming the benefit of the hybrid matrix-vector approach.HStencil(with full instruction scheduling) provides the largest speed-up, showing that fine-grained scheduling is critical for unlocking the full potential of the hardware.
-
Matrix-Vector Cooperation (ILP):
该图像是图表,展示了在尺寸为128×128的二维Stencil计算中,HStencil与仅使用矩阵单元(Matrix-only)和仅使用向量单元(Vector-only)方法的每周期指令数(IPC)比较。图表显示HStencil在所有测试案例中均实现了最高的IPC,表明其在硬件利用效率上优于其他方法。This result is crucial as it directly measures hardware efficiency. HStencil achieves a much higher IPC (up to 2.30) than
Matrix-only(IPC < 1.6) andVector-only(IPC ~ 1.8). This proves that by co-scheduling instructions to the separate matrix and vector pipelines, HStencil keeps the processor's execution units busier and achieves higher parallelism.
6.2 Out-of-cache Stencils
-
Single-core Performance & Prefetching:
该图像是图表,展示了在超出缓存容量的不同矩阵尺寸下,HStencil(带预取与不带预取)与STOP方案相较于自动向量化的性能加速比。横轴为矩阵尺寸,纵轴为加速比,结果显示HStencil带预取的加速效果优于其他方案,且加速比随着矩阵尺寸增大呈下降趋势。Figure 15 shows the critical impact of the spatial prefetch strategy. Without prefetching, HStencil's performance degrades as the matrix size increases and exceeds the cache capacity. With prefetching, HStencil sustains its high performance, achieving up to 42% higher speed-up than the non-prefetched version on the largest problem size.
The following transcribed table provides the underlying cache metrics that explain this performance gain.
Table 7: L1 cache metrics of box stencils.
| Methods L1 Cache Metrics | w/o Prefetch | | with Prefetch | | :--- | :--- | :--- | :--- | :--- | | Hit Rate | Hit Times | Hit Rate | Hit Times | 1024 × 1024 | 66.16% | | 71.72% | | 2048 × 2048 | 66.26% | | 73.66% | | 4096 × 4096 | 34.58% | | 60.21% | | 8192 × 8192 | 33.51% | | 59.74% |
As the matrix size grows to 8192x8192, the L1 hit rate without prefetching plummets to ~33%. With spatial prefetching, the hit rate is maintained at nearly 60%, almost doubling the effectiveness of the L1 cache.
-
Multi-core Scaling:
该图像是一个折线图,展示了在Box2D9P模板大小为8192×8192的情况下,HStencil算法从1核扩展到32核时的性能表现。横轴为线程数(Threads),纵轴为执行速度(GStencil/s)。图中比较了仅使用向量单元(Vector-only)、仅使用矩阵单元(Matrix-only)和HStencil三种方案的性能,HStencil在多核条件下表现出明显的加速优势。Figure 16 shows that HStencil scales well on multi-core systems. On 32 cores, it achieves a throughput of 12.91 GStencil/s, significantly outperforming both
Matrix-only(7.76 GStencil/s) andVector-only(7.14 GStencil/s). This demonstrates that the single-core optimizations translate effectively to parallel environments.
6.3 Performance Portability on Apple CPUs
The paper evaluates HStencil on an Apple M4 Pro CPU to demonstrate that its principles are not limited to a single custom architecture. Although the M4 lacks dedicated vector MLA units, it has matrix MLA capabilities. HStencil was adapted to this architecture. The results (described in the paper, though the figure is not provided in the resources) show average speed-ups of 3.07x for box stencils and 1.90x for star stencils over auto-vectorization. This confirms the portability and general applicability of the hybrid computation and scheduling approach.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces HStencil, a pioneering framework for accelerating stencil computations on modern CPUs that feature both scalable vector and matrix units. By treating these units as cooperative partners rather than alternatives, HStencil overcomes the key limitations of previous matrix-only approaches. Its three core contributions—hybrid micro-kernels with in-place accumulation, fine-grained instruction scheduling, and spatial prefetching—combine to deliver substantial performance gains across a range of benchmarks, problem sizes, and hardware platforms (LX2 and Apple M4).
-
Limitations & Future Work:
- Architecture Specificity: The most efficient "in-place accumulation" technique was not directly portable to the Apple M4 and required reverting to a less optimal "naive" method. This suggests that achieving peak performance still requires some architecture-specific tuning.
- Heuristic-based Scheduling: The instruction replacement and scheduling strategies appear to be based on expert analysis and heuristics. A more automated, model-driven approach could potentially find an even better balance of instruction mix and scheduling for different stencil patterns and architectures.
- Generalization: The paper focuses on box and star stencils. The framework's effectiveness on more complex or irregular stencil patterns remains an area for future exploration.
-
Personal Insights & Critique:
- HStencil is a strong piece of systems research that provides a practical and well-engineered solution to a timely problem. The emergence of heterogeneous compute units within a single CPU core is a key architectural trend, and this paper offers a valuable blueprint for how to exploit them effectively.
- The paper’s methodology is rigorous, with a clear breakdown of performance contributions and thorough analysis of bottlenecks like ILP and cache behavior. The inclusion of results on two different modern CPUs adds significant weight to its claims of portability.
- A key takeaway for the field is the principle of co-designing for heterogeneous execution units. Simply offloading work to a new, powerful unit (like a matrix engine) is not enough. True performance gains come from understanding the interplay between all available resources and orchestrating their use in a balanced, parallel manner. HStencil is an excellent case study in this principle.
Similar papers
Recommended via semantic vector search.