Paper status: completed

Inference Performance of Large Language Models on a 64-core RISC-V CPU with Silicon-Enabled Vectors

LLM Reasoning Capacity Enhancement (39)RISC-V Based Hardware Optimization (1)Silicon-Enabled Vector Computing (1)Energy-Efficient Computing Architectures (1)Matrix Multiplication Performance Benchmark (1)

Original Link

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study evaluates LLM inference performance on a 64-core RISC-V CPU with Silicon-Enabled Vectors, revealing significant throughput and energy efficiency improvements, particularly for smaller models. It offers practical insights for deploying LLMs on future heterogeneous compu

Abstract

Adriano Marques Garcia, Giulio Malenza, Robert Birke, Marco Aldinucci Inference Performance of Large Language Models on a 64-core RISC-V CPU with Silicon-Enabled Vectors Large Language Models (LLMs) are revolutionizing computing, but their inference is resource-intensive. This paper investigates the performance of LLMs on a novel 64-core RISC-V CPU equipped with Silicon-Enabled Vectors (SEV) — a new vector extension designed to complement RISC-V's ISA. Our motivation is to explore how SEV can accelerate LLM inference, particularly in the context of emerging, energy-efficient architectures. We benchmark three prominent LLMs (Llama2-7B, Llama2-13B, Llama2-70B) using a comprehensive suite of operations including matrix multiplication, attention mechanisms, and tokenization. Our methodology involves leveraging a custom-built, open-source inference engine that integrates seamlessly with the SEV ISA and implements hardware-specific optimizations. We conduct experiments across varying compute configurations, including different core counts and vector width settings, and measure performance in terms of throughput, latency, and energy efficiency. Our results demonstrate that SEV delivers significant performance improvements, with up to 1.8x speedup for the 7B model and 1.4x for the 13B model compared to a baseline RISC-V CPU without SEV. The 70B model, while showing less relative gain, still benefits from improved throughput and reduced energy consumption. Crucially, we find that LLM inference performance is highly sensitive to the vector width used. Our methodology identifies an optimal vector width that maximizes throughput for each model size, with wider vectors yielding better performance for smaller models. We also observe that the added hardware complexity of SEV increases energy consumption for non-vectorized workloads but offers substantial gains for workloads dominated by vectorizable operations, which is characteristic of LLM inference. These findings reveal that SEV is a viable and promising solution for accelerating LLM inference on multi-core RISC-V platforms, bridging the gap between energy efficiency and computational performance. The paper concludes with practical insights for architects and developers seeking to deploy LLMs on next-generation, heterogeneous computing platforms.

Mind Map

In-depth Reading

English Analysis~40 min read · 52,753 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is the "Inference Performance of Large Language Models on a 64-core RISC-V CPU with Silicon-Enabled Vectors."

1.2. Authors

The authors are Adriano Marques Garcia, Giulio Malenza, Robert Birke, and Marco Aldinucci. Adriano Marques Garcia is a postdoctoral researcher at the University of Turin. His research interests include parallel stream processing, benchmarking, and HPC for AI. Giulio Malenza is a PhD student at the University of Turin, focusing on performance portability of parallel programming frameworks for scientific codes on accelerators. Robert Birke is a tenured Assistant Professor at the University of Turin, with research interests in virtual resource management, network design, workload characterization, and AI/big-data application optimization. Marco Aldinucci is a Full Professor and Head of the Parallel Computing research group at the University of Turin, specializing in parallel and distributed systems, HPC, Cloud, and large AI systems.

All authors are affiliated with the University of Turin, Corso Svizzera 185, Turin, 10149, Italy.

1.3. Journal/Conference

The paper is slated to appear in "Future Generation Computer Systems." This journal is a peer-reviewed academic journal focused on advanced computer systems, including parallel and distributed computing, high-performance computing, and emerging architectures. It is a reputable venue in the field of computer science, particularly for research on future computing paradigms and systems.

1.4. Publication Year

The paper was received on April 5, 2025, revised on October 9, 2025, and accepted on November 3, 2025. It is scheduled to be published in 2025.

1.5. Abstract

This paper investigates the performance of Large Language Models (LLMs) during inference on a 64-core RISC-V CPU featuring Silicon-Enabled Vectors (SEV), also known as RISC-V Vector (RVV) extension. The primary motivation is to understand how these vector extensions can accelerate LLM inference on emerging, energy-efficient architectures. The authors benchmark three prominent LLMs (Llama2-7B, Llama2-13B, Llama2-70B - though the paper body refers to different models like LLaMA-3.2 and DeepSeek-LLM) using key operations like matrix multiplication, attention mechanisms, and tokenization. They employ a custom-built, open-source inference engine integrated with the RVV ISA and hardware-specific optimizations. Experiments are conducted across various configurations, including different core counts and vector width settings, measuring throughput, latency, and energy efficiency.

The results indicate that RVV delivers significant performance improvements, achieving up to 1.8x speedup for the 7B model and 1.4x for the 13B model compared to a baseline RISC-V CPU without RVV. While the 70B model shows less relative gain, it still benefits from improved throughput and reduced energy consumption. A crucial finding is that LLM inference performance is highly sensitive to the vector width, with an optimal width identified for each model size, where wider vectors generally benefit smaller models more. The paper also observes that while RVV's added hardware complexity increases energy consumption for non-vectorized workloads, it offers substantial gains for vectorizable operations, which characterize LLM inference. These findings suggest that RVV is a viable solution for accelerating LLM inference on multi-core RISC-V platforms, balancing energy efficiency and computational performance. The paper concludes with practical insights for architects and developers deploying LLMs on next-generation heterogeneous computing platforms.

1.6. Original Source Link

/files/papers/6913263128fca5b9baec610b/paper.pdf This link appears to be a local file path or a relative path on a server, indicating it's likely a pre-print or an internal link to the paper content provided for analysis rather than a publicly accessible DOI or journal page link. The paper itself indicates it is a "Journal Pre-proof" and provides a DOI: https://doi.org/10.1016/j.future.2025.108242

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the resource-intensive nature of Large Language Model (LLM) inference, particularly in the context of emerging hardware architectures. LLMs, such as those based on BERT and GPT, have revolutionized Natural Language Processing (NLP) with their advanced capabilities in understanding, generating, and summarizing human language. However, their exceptional performance comes at the cost of significant computational resources, with models often encompassing billions of parameters (e.g., Llama v3 with 400 billion parameters). The growing demand for fast response times in AI applications like text generation further exacerbates this challenge, necessitating more efficient and versatile hardware solutions.

This problem is highly important because it addresses the fundamental bottleneck in deploying advanced AI models broadly and sustainably. Traditional hardware architectures often struggle to deliver both high performance and energy efficiency required for widespread LLM adoption, especially for edge devices or embedded systems with tight power budgets. The paper highlights RISC-V (Reduced Instruction Set Computer-Five) as a compelling emerging architecture due to its open-source nature, flexibility, scalability, and potential for energy-efficient performance, setting it apart from traditional CISC (Complex Instruction Set Computing) architectures. The recent commercial availability of RISC-V processors equipped with RISC-V Vector (RVV) extensions further amplifies its significance, as vector instructions are critical for accelerating the parallel processing tasks inherent in LLM operations.

The paper's entry point and innovative idea revolve around exploring how the RVV extension on a 64-core RISC-V CPU can specifically accelerate LLM inference. It seeks to fill the gap in understanding the practical performance implications of RVV on real-world LLM workloads, particularly contrasting with micro-benchmarks that might not fully represent end-to-end inference behavior. The motivation is to explore if and how this novel combination of energy-efficient architecture and dedicated vector processing can bridge the gap between computational demands and resource constraints for next-generation AI.

2.2. Main Contributions / Findings

The paper makes several primary contributions and reaches key conclusions regarding LLM inference on RISC-V platforms with RVV:

Enabling RVV Support for PyTorch: The authors successfully built the PyTorch library with RISC-V Vector 0.7.1 support on the SOPHON SG2042 platform, leveraging OpenBLAS and BLIS backends. This addresses a critical software ecosystem gap for RISC-V in deep learning.
Comprehensive Evaluation of RVV Impact: They extensively explored the impact of RISC-V Vector instructions on the inference performance of diverse LLM models, including BERT, GPT-2 (various sizes), Gemma-2, LLaMA-3.2, and DeepSeek-LLM.
Scalability Analysis Across Linear Algebra Libraries and Configurations: The study analyzed the scalability of LLM inference across different linear algebra libraries (OpenBLAS, BLIS), PyTorch precision settings, and varying core counts on the SOPHON SG2042 64-core RISC-V processor.
Performance Gains with RVV: The RVV extension delivers significant performance improvements, with up to 1.8x speedup for smaller models (e.g., 7B) and 1.4x for 13B models compared to RISC-V without RVV. Even larger models (e.g., 70B) show throughput improvements and reduced energy consumption.
Sensitivity to Arithmetic Intensity and Batch Size: A critical finding is that RVV effectiveness is highly configuration-dependent and tied to the matrix shape and arithmetic intensity. Specifically, vectorization can degrade performance for memory-bound GEMM operations (e.g., with $batch size N=1$ ) due to increased memory bandwidth demands. However, RVV shows clear benefits in compute-bound regions, typically achieved with higher batch sizes.
Optimal Vector Utilization: The methodology identifies that LLM inference performance is sensitive to the vector width, with wider vectors generally yielding better performance for smaller models. More broadly, RVV is most beneficial in compute-bound configurations that avoid caching mechanisms and rely on natively supported data types.
Impact of Unsupported Data Types: Running LLMs with data types not natively supported by the hardware (e.g., bfloat16 on the RISC-V platform) leads to flat, non-scalable performance due to inefficient runtime conversions and reliance on single-threaded fallbacks. Converting to natively supported types (e.g., float32) is essential to unlock parallelism and vectorization benefits.
Validation through Roofline Modeling and Traced GEMM Timing: The authors validated their observations experimentally using roofline modeling and traced GEMM timing, which revealed performance bottlenecks invisible to synthetic micro-benchmarks. This emphasizes the importance of analyzing real-world workloads.
Practical Insights: The paper offers practical insights for architects and developers, highlighting that RVV is a viable and promising solution for accelerating LLM inference on multi-core RISC-V platforms when workload characteristics, threading behavior, and datatype are carefully aligned.

3.1. Foundational Concepts

To understand this paper, a foundational grasp of several computing and machine learning concepts is essential:

Large Language Models (LLMs):
- Conceptual Definition: An LLM is a type of artificial intelligence model designed to understand, generate, and process human language. They are typically based on deep neural networks with billions of parameters, trained on vast amounts of text data. LLMs learn the statistical relationships and structures within language, enabling them to perform various Natural Language Processing (NLP) tasks such as text generation, translation, summarization, and question answering.
- Transformer Architecture: The dominant architecture for LLMs is the Transformer, introduced in 2017. It relies on attention mechanisms to weigh the importance of different parts of the input sequence when processing each word. Key components include self-attention layers, which allow the model to consider all other words in a sequence when encoding a single word, and feed-forward networks.
- Specific LLMs mentioned in the paper:
  - BERT (Bidirectional Encoder Representations from Transformers): An encoder-only Transformer model that learns contextual representations from both directions of a sentence. It excels in natural language understanding (NLU) tasks.
  - GPT-2 (Generative Pre-trained Transformer 2): A decoder-only Transformer model that predicts the next token in a sequence based on preceding context (causal modeling), making it highly effective for text generation.
  - Gemma-2: A "lightweight" open model from Google, enhancing decoder-only transformer architecture with interleaving local-global attention and group-query attention, trained with knowledge distillation.
  - LLaMA-3.2: An open model series from Meta, improving upon previous LLaMA versions with higher quality training data and reinforcement learning with human feedback.
  - DeepSeek-LLM: Follows the LLaMA design with a pre-norm structure, RMSNorm function, SwiGLU activation, and rotary embedding for positional encoding.
LLM Inference Characteristics:
- Inference: The process of using a trained LLM to make predictions or generate outputs based on new input data.
- Prefill Stage: The first stage of LLM inference, where the entire input prompt (sequence) is processed at once. This stage typically involves larger matrix operations with high data reuse, making it compute-bound (performance limited by computational speed rather than memory access). The results are stored in a key-value (KV) cache.
- Decode Stage: The second stage, where the model generates output tokens one at a time, autoregressively. Each new token is generated based on the current context and the KV cache. Operations here involve smaller matrix multiplications with limited data reuse, often making this stage memory-bound (performance limited by the speed of data transfer to/from memory, especially for small batch sizes).
- Batch Size: The number of input sequences (prompts) processed simultaneously. A larger batch size can improve hardware utilization and shift memory-bound operations towards compute-bound, but latency-sensitive applications (e.g., interactive chat) often use a batch size of 1.
RISC-V Architecture:
- Conceptual Definition: RISC-V (Reduced Instruction Set Computer-Five) is an open-standard Instruction Set Architecture (ISA) based on RISC principles. Unlike proprietary ISAs like $x86$ or ARM, RISC-V is freely available for anyone to use, modify, and implement. This open nature fosters innovation, customization, and allows for flexible, scalable, and potentially more energy-efficient hardware designs.
- SOPHON SG2042 SoC: A System-on-Chip (SoC) featuring 64 RISC-V cores, divided into 16 clusters. Each cluster contains a XuanTie C920 4-core RISC-V CPU. It includes a hierarchical cache system (L1, L2 per cluster, L3 shared) and multiple DDR4-3200 memory controllers.
- XuanTie C920 CPU: A high-performance 64-bit multi-core RISC-V CPU architecture supporting RV64GCV instruction set, including RISC-V Vector (RVV) extension version 0.7.1. It features a superscalar pipeline and operates up to 2 GHz.
RISC-V Vector (RVV) Extension:
- Conceptual Definition: RVV is an optional ISA extension for RISC-V that enables Single Instruction, Multiple Data (SIMD) parallelism. SIMD allows a single instruction to operate on multiple data elements simultaneously, significantly accelerating operations that can be vectorized, such as those common in scientific computing, multimedia, and deep learning. RVV is highly configurable, allowing implementations to choose different vector register lengths and capabilities. The XuanTie C920 supports RVV v0.7.1 with 128-bit vector registers and various data types (FP16, FP32, FP64, INT8, INT16, INT32, INT64).
PyTorch:
- Conceptual Definition: An open-source machine learning framework widely used for deep learning. PyTorch provides tools for building and training neural networks, including a flexible tensor library and automatic differentiation. It can delegate computationally intensive tasks to specialized low-level libraries (e.g., Intel MKL for Intel CPUs, cuDNN for NVIDIA GPUs, OpenBLAS/BLIS for BLAS operations) to optimize performance on specific hardware.
BLAS (Basic Linear Algebra Subprograms) Libraries:
- Conceptual Definition: BLAS is a specification that defines a set of low-level routines for common linear algebra operations (e.g., vector addition, scalar multiplication, matrix multiplication). Optimized BLAS libraries (like OpenBLAS and BLIS) provide highly tuned implementations of these routines for specific CPU architectures, taking advantage of features like SIMD instructions and cache hierarchies to achieve high performance.
- OpenBLAS: An open-source implementation of the BLAS and LAPACK standards, optimized for various processor architectures.
- BLIS (BLAS-like Library Instantiation Software): A framework for rapidly instantiating high-performance BLAS-like libraries. It's known for its portability and ability to generate highly optimized kernels.
GEMM (General Matrix Multiplication):
- Conceptual Definition: The operation $C = \alpha AB + \beta C$ , where A, B, C are matrices and $\alpha, \beta$ are scalars. GEMM is a fundamental and computationally intensive operation in linear algebra, forming the backbone of many scientific simulations and deep learning workloads, including LLM inference. Its efficiency is critical for overall performance.
- aten::addmm and aten::mm: These are PyTorch ATen backend operations that perform matrix multiplication (aten::mm) and matrix multiplication followed by addition (aten::addmm, i.e., $C = A \times B + C$ ). They are identified as dominant operations in LLM inference.
- Arithmetic Intensity (AI): The ratio of floating-point operations (FLOPs) to memory traffic (bytes moved).
  - Compute-bound tasks have high AI (performance limited by CPU's FLOPs capacity).
  - Memory-bound tasks have low AI (performance limited by memory bandwidth).
- Fused Multiply-Add (FMA): A single instruction that performs both a multiplication and an addition (e.g., $a \times b + c$ ) in one step. FMA instructions are crucial for high-performance computing as they reduce instruction count, improve throughput, and can sometimes reduce rounding errors.
Roofline Model:
- Conceptual Definition: A visual performance model that characterizes the maximum achievable performance of a computing system based on its peak floating-point performance ( $Fp_{peak}$ ) and peak memory bandwidth ( $B_{peak}$ ). It plots performance (FLOPs/second) against arithmetic intensity (FLOPs/byte). The "roof" consists of two parts: a horizontal line representing compute-bound performance (limited by $Fp_{peak}$ ) and a diagonal line representing memory-bound performance (limited by $B_{peak}$ ).
- Machine Balance Point (BP): The arithmetic intensity at which a workload transitions from being memory-bound to compute-bound. It is calculated as $BP = \frac{Fp_{peak}}{B_{peak}}$ . If a kernel's arithmetic intensity is below BP, it is memory-bound; otherwise, it is compute-bound.
Data Types:
- FP32 (Single-Precision Floating-Point): Standard 32-bit floating-point format, widely supported, offers good precision.
- BF16 (Bfloat16): 16-bit brain floating-point format, designed to offer a similar dynamic range to FP32 but with reduced precision. It's common in deep learning to save memory and accelerate computations on specialized hardware (e.g., TPUs, some GPUs) that natively support it.
- FP16 (Half-Precision Floating-Point): 16-bit floating-point format, also used for memory savings and speedups, but with a smaller dynamic range and potentially less precision than BF16.
- The SOPHON SG2042 with XuanTie C920 natively supports FP16, FP32, FP64, INT8, INT16, INT32, INT64, but not BF16.

3.2. Previous Works

The paper contextualizes its research by referencing existing literature that highlights the challenges and opportunities in optimizing deep learning workloads, especially on RISC-V architectures.

PyTorch on RISC-V:
- Colonnelli et al. [9] described the initial porting of PyTorch v2.0 to the RISC-V ISA. However, their platform offered limited acceleration (only fused multiply-add (FMA) support), highlighting a gap in leveraging full vector capabilities which this paper addresses with RVV.
RISC-V with RVV Capabilities:
- Brown et al. [3] evaluated RAJAPerf workloads on the SOPHON SG2042 SoC, the same platform used in this paper. They compared its performance against ARM and $x86$ architectures, observing that while the SG2042 delivered strong per-core performance for RISC-V, it was still outperformed by $x86$ CPUs in multi-threaded scenarios. They noted the importance of custom thread mapping strategies. This paper extends their work by focusing on LLM inference, a more complex and high-impact workload.
- Lee et al. [26] also tested RAJAPerf kernels on a T-Head C906 single-core RISC-V CPU with RVV v0.7.1. They compared RVV performance against ARM NEON and SVE instruction sets and against SiFive U74 (RISC-V without RVV). Their findings showed that vectorized code on the C906 could outperform the U74 by about 80%, but noted the immaturity of RISC-V tooling and hardware.
GEMM Optimization on RISC-V:
- Igual et al. [21] evaluated GEMM kernels on C906 and C910 T-Head RISC-V architectures (both RVV v0.7.1). They reported OpenBLAS with RVV achieving up to 80% performance gains for SGEMM kernels. This work builds on such findings by applying it to LLM context, noting that isolated kernel benchmarks might not reflect real-world LLM behavior.
- Banchelli et al. [2] explored RISC-V long vector capabilities for batched GEMM (specifically for small matrices in earth sciences), achieving significant speedups. This highlights the potential of RVV for specific matrix sizes.
LLM Inference Optimization:
- Liu et al. [29] focused on LLM inference optimization for edge devices using model-level techniques like quantization and pruning. This paper complements such work by analyzing low-level GEMM performance under architectural constraints like RVV and thread scalability.
Memory-Bound Application Optimization:
- Olas et al. [33] evaluated the same SG2042 RISC-V platform using the NAS Parallel Benchmark Suite. They found that only embarrassingly parallel problems scaled up to 64 threads, and heavy manual optimizations were needed for memory-bound applications to achieve scalability, including vectorization. This paper tackles similar memory-bound challenges inherent in LLM inference.
Specialized Matrix Optimizations:
- Pirova et al. [35] investigated accelerating banded matrices on RISC-V with RVV, focusing on structured sparsity to improve vector utilization. Their work differs by focusing on dense matrix multiplications (addmm) in LLMs where irregular shapes and small batch sizes pose different vectorization challenges.
SIMD/Vectorization on Other ISAs:
- Several studies on ARM architectures (using NEON/SVE instructions) demonstrated substantial latency and throughput gains for deep learning workloads [31, 46]. Rossi et al. [40] reported up to 4.25x speedups for LLaMA-2 training with SVE. Other RISC-V efforts [8, 18] explored vectorized posit arithmetic and CNN inference, showing benefits of longer vectors and larger caches. These works collectively establish the general benefits of SIMD for deep learning, but this paper explicitly differentiates by focusing on the unique challenges and specific behavior of RVV on RISC-V for full LLM inference.

3.3. Technological Evolution

The field of NLP and AI has seen rapid evolution, primarily driven by LLMs. Starting from early language models in the 1980s, the field significantly advanced with the introduction of the Transformer architecture and attention mechanism by Google in 2017 [11]. This led to foundational models like BERT and GPT-2, which showcased unprecedented capabilities in language understanding and generation. The models have continuously grown in size, with LLama v3 reaching 400 billion parameters, demanding ever-increasing computational resources.

Concurrently, hardware architectures have been evolving to meet these demands. RISC-V emerged as an open-source alternative to proprietary ISAs, offering flexibility and customization for various applications, including specialized AI accelerators. The development and commercialization of RISC-V processors with vector extensions (RVV) mark a significant step towards enabling high-performance, energy-efficient AI computation on these platforms.

This paper's work fits within this technological timeline by being at the forefront of evaluating the practical applicability of a nascent yet promising hardware paradigm (RISC-V with RVV) for the most demanding AI workloads (LLM inference). It moves beyond theoretical discussions or micro-benchmarks to assess end-to-end LLM performance on real RISC-V silicon, addressing the critical need for efficient hardware solutions as LLMs become ubiquitous.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's core differences and innovations lie in its comprehensive and practical evaluation specific to LLM inference on a real-world, multi-core RISC-V platform with RVV.

Focus on Full LLM Inference: Many related works, especially those evaluating RISC-V vector capabilities, tend to focus on synthetic GEMM kernels [21, 2] or CNN inference [18] or generic HPC benchmarks [3, 26]. This paper distinctively evaluates full, real-world LLM inference for a range of Transformer-based models (BERT, GPT-2, Gemma, LLaMA, DeepSeek). This approach is crucial because, as the paper demonstrates, isolated kernel performance doesn't always translate to end-to-end model performance due to factors like varying tensor shapes, memory access patterns, non-linear activations, and control-flow logic.
Real Silicon Evaluation: Unlike studies that might rely on simulators [18], this work uses an in-silicon RISC-V platform (SOPHON SG2042 with XuanTie C920 and RVV v0.7.1). This provides concrete, experimentally validated results that reflect actual hardware behavior, including its nuances and limitations (e.g., BF16 support, memory bandwidth constraints).
Detailed Bottleneck Analysis with Roofline Model: The paper goes beyond mere performance numbers by employing roofline modeling and detailed traced GEMM timing to deeply understand performance bottlenecks. It identifies and explains why RVV can sometimes degrade performance (e.g., in memory-bound $N=1$ scenarios), a finding that contrasts with the generally consistent SIMD gains reported for other ISAs like ARM SVE [40]. This detailed analysis of arithmetic intensity and machine balance point provides crucial insights into when and why RVV is effective.
Software Stack Integration: The work involves the significant effort of building PyTorch with RVV-enabled OpenBLAS and BLIS for the specific RISC-V v0.7.1 architecture. This practical contribution addresses the immaturity of the RISC-V software ecosystem, which is often a hurdle for adopting new architectures.
Investigation of Practical Factors: The paper analyzes the impact of practical factors like attention caching (KV cache) and unsupported data types (BF16) on RVV performance. It reveals that optimizations beneficial in one context (e.g., KV cache reducing operations) might inadvertently limit the effectiveness of vectorization, and that data type compatibility is paramount.

In essence, while related works explore components or general benchmarks, this paper provides a holistic, rigorous, and practical investigation into the specific challenges and opportunities of running production-scale LLMs on RISC-V with RVV, offering unique insights into its performance characteristics and effective deployment strategies.

4. Methodology

4.1. Principles

The core idea of the method used in this paper is to thoroughly evaluate the inference performance of Large Language Models (LLMs) on a novel 64-core RISC-V CPU equipped with Silicon-Enabled Vectors (RVV) v0.7.1. The theoretical basis is that LLM inference is dominated by General Matrix Multiplication (GEMM) operations, which are highly parallelizable and thus potentially benefit significantly from SIMD (Single Instruction, Multiple Data) capabilities provided by RVV. The intuition is to quantify these benefits on a real hardware platform and understand the conditions under which RVV provides speedups, especially considering factors like arithmetic intensity, batch size, and software stack configurations. The authors aim to move beyond micro-benchmarks to assess end-to-end model performance.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology involves a comprehensive approach, from building the software stack to detailed performance profiling and analysis using models like the roofline model.

4.2.1. Hardware Platform

The experiments were conducted on a Milk-V Pioneer Box, a commercial development platform. This system is equipped with:

A SOPHON SG2042 System-on-Chip (SoC) as the main processor.
- The SG2042 SoC contains 64 RISC-V cores.
- These cores are organized into 16 clusters, with each cluster comprising a XuanTie C920 4-core RISC-V CPU.
- Each XuanTie C920 core has 64KB of L1 instruction and data cache.
- Each 4-core cluster shares 1MB of L2 cache.
- A unified 64MB L3 cache is shared among all 64 cores.
- Four DDR4-3200 memory controllers manage access to the main memory.
- The XuanTie C920 supports the RV64GCV instruction set, including RISC-V Vector (RVV) extension version 0.7.1.
- The Vector Processing Unit (VPU) has 128-bit vector registers and supports FP16, FP32, FP64, INT8, INT16, INT32, and INT64 data types.
128GB of DDR4 RAM operating at 3200MHz.
A 1TB PCIe 3.0 SSD.
The operating system used was Linux fedora-riscv 6.1.31.

4.2.2. Software Stack Preparation

To enable LLM inference and RVV support, a specialized software stack was built:

PyTorch Compilation: PyTorch v2.3 was compiled for Python v3.10.10.
- The primary compiler used was Xuantie's GCC v13.2.
- OpenMP v4.5 was enabled for multi-threading support.
- Key build options for PyTorch included:
  - USE_OPENMP=1: To leverage the multiple cores on the SG2042 SoC.
  - USE_BLAS=1 and USE_LAPACK=1: To utilize BLAS libraries as the main computational backend for linear algebra.
  - USE_KINETO=ON: For profiling capabilities.
  - USE_NUMPY=ON: For NumPy support.
BLAS Library Compilation (with RVV support):
- OpenBLAS: OpenBLAS v0.3.26 was compiled using the XuanTie GNU Compiler Toolchain v2.8.0. To enable RVV support, the TARGET was set to C910v.
- BLIS: A modified version [32] based on BLIS v0.9.0 was used. This BLIS version requires an LLVM-based compiler, so it was compiled using LLVM-EPI v0.7. The target architecture was configured as rv64iv0p7 to match the RVV v0.7.1 standard of the SG2042 SoC.
- The paper notes that due to different RVV standard versions (e.g., v0.7.1 vs v1.0), careful matching between the target architecture and a compatible compiler is required.

4.2.3. Language Models Benchmarked

The study evaluates the inference performance of several prominent LLMs (parameters and data types are from Table 1, but the abstract mentions Llama2-7B, Llama2-13B, Llama2-70B which are not the models specifically listed in the paper's Table 1; the paper text explicitly lists the models from Table 1 as those used). The following LLMs were used:

Model	Parameters	Hidden Dim	# Layers	# Attention Heads	Context Length	FFN Dim	Data Type
GPT-2	137M	768	12	12	1024	3072	float32
GPT-2 Medium	380M	1024	24	16	1024	4096	float32
GPT-2 Large	812M	1280	36	20	1024	5120	float32
GPT-2 XL	1.61B	1600	48	25	1024	6400	float32
BERT-large-cased	335M	1024	24	16	512	4096	float32
Gemma-2-2B	2.61B	2304	26	8	8192	9216	float32
LLaMA-3.2-1B	1.24B	2048	16	32	131072	8192	bfloat16
DeepSeek-LLM-7B-base	7B	4096	30	32	4096	11008	bfloat16

The following are the results from Table 1 of the original paper.

4.2.4. Benchmarking Methodology for LLMs

Benchmark Applications: Simple Python scripts were written for each model, based on examples from the Hugging Face platform, to perform text generation.
Input Prompt: A standard input prompt "The quick brown fox ran" was used for all LLM experiments.
Output Generation: Models were configured to generate/predict the next $n$ tokens. For GPT-2, an example output was "The quick brown fox ran off. This fox needs to think for a while. This fox needs a rest."
Experiment Runs: All performance experiments reported the average of 10 runs, with standard deviation represented by whiskers in plots.
Configurations Tested: The performance was compared across five PyTorch configurations:
1. PyTorch (default backend, without OpenBLAS/BLIS).
2. PyTorch + OpenBLAS (without RVV support, generic RISC-V build).
3. PyTorch + OpenBLAS (with RVV support).
4. PyTorch + BLIS (without RVV support, generic RISC-V build).
5. PyTorch + BLIS (with RVV support).
Varying Thread Counts: Experiments were conducted by varying the number of OpenMP threads (core counts) from 1 to 64 to analyze scalability.

4.2.5. Microbenchmarking `aten::addmm`

Recognizing that matrix multiplication (MM) operations are dominant in LLM inference (e.g., aten::addmm and aten::mm accounting for 70-99% of CPU operations), a microbenchmark for aten::addmm was implemented.

Operation: aten::addmm performs $C = A \times B + C$ , fusing a matrix multiplication and an addition.
Matrix Sizes: Two square matrix sizes were used:
- 768 x 768: Reflecting the hidden dimension size of lighter models like GPT-2.
- 4096 x 4096: Corresponding to heavier models like DeepSeek.
Purpose: To observe the direct impact of BLAS backends and RVV on a core linear algebra primitive.

The following figure (Figure 2 from the original paper) shows the aten::addmm kernel performance:

该图像是图表，展示了在不同的OMP线程下，PyTorch及其优化实现（包括BLIS和OpenBLAS）处理768×768和4096×4096矩阵的执行时间。图中显示了不同配置对计算性能的影响，结果表明线程数增加时，优化实现显著降低了执行时间。

Figure 2: Aten::addmm kernel performance using $768 \times 768$ and $4096 \times 4096$ matrices when using PyTorch with different BLAS runtimes over multiple thread counts.

4.2.6. Profiling GEMM Execution in Full LLMs

To understand the actual proportion of time spent on GEMM operations within a complete LLM inference run, profiling was performed.

Method: The PyTorch execution of the models was profiled. The ATen library was instrumented to identify direct calls to GEMM functions within the BLAS libraries. Routines were added to measure time before and after GEMM function calls, and PyTorch was recompiled with this modified ATen code.
Purpose: To ascertain how much of the total execution time is exclusively spent on GEMM operations, especially when increasing parallelism, as microbenchmarks alone can be misleading.

The following figure (Figure 3 from the original paper) shows the time spent exclusively running GEMM computations:

该图像是图表，展示了在使用KV缓存和不使用KV缓存情况下，GPT-2和GPT-2 Medium模型的单线程执行时间。左侧为GPT-2的执行时间，右侧为GPT-2 Medium的执行时间。

Figure 3: Time spent exclusively running GEMM computations by GPT-2 models compared to the total execution time across different thread counts. The models were run with PyTorch + OpenBLAS with RVV enabled. * Both Llama and Deepseek originally run with Bfloat16 precision, which is not supported by the BLAS libraries and the architecture used on these experiments. While we are running Llama with float32 precision, we kept the original Bfloat16 for Deepseek.

4.2.7. Analyzing Impact of Model Scaling, Caching, and Data Types

Model Scaling: The paper analyzed GPT-2 models of varying sizes (from gpt2 to gpt2-xl) to observe how performance characteristics change with increasing model complexity and parameter count.
Attention Caching (KV Caching): The use_cache parameter (from HuggingFace Transformers library) was examined. This parameter, enabled by default, reuses key and value projections from previous tokens in the attention mechanism. Experiments were run with use_cache=True and use_cache=False for gpt2 and gpt2-medium to understand its interaction with RVV performance.
Data Types: Models like LLaMA-3.2-1B and DeepSeek-LLM-7B-base use $bfloat16 (BF16)$ by default. Since the RISC-V platform lacks native BF16 support, the LLaMA model was converted from torch.bfloat16 to torch.float32 (using save_pretrained() method) to evaluate the impact of unsupported data types and the benefits of conversion.

4.2.8. Roofline Model Analysis

To deeply understand GEMM performance and memory-bound vs. compute-bound behavior, the roofline model was applied.

Theoretical Peak Performance ( $Fp_{Peak}^{32}$ ): The maximum single-precision floating-point performance was calculated using the formula: $ Fp_{Peak}^{32} [GFLOP/s] = #C \times CF \times #FPC $ Where:
- $\#C$ : Number of total cores.
- CF: Clock frequency, which is $2 \mathrm{GHz/s}$ for the XuanTie C920 cores.
- $\#FPC$ $# FPC$ : Number of FLOPs per cycle.
  - Without RVV: Assuming one FMA instruction per cycle, it's 2 FLOPs per cycle per core.
  - With RVV: The 128-bit vector registers can accommodate four 32-bit FMA operations per cycle, resulting in 8 FLOPs per cycle per core.
Memory Bandwidth ( $B_{peak}$ ):
- Theoretical maximum bandwidth for DDR4-3200 (4 controllers) is $25.6 \times 4 = 102.4 \mathrm{GB/s}$ .
- Measured effective bandwidth was $41.2 \mathrm{GB/s}$ with 64 cores.
Arithmetic Intensity (AI): For GEMM operations, the approximate arithmetic intensity is given by: $ AI \approx \frac{O(2 \times M \times N \times K)}{O(4 \times (M \times N + N \times K + K \times M))} $ Where:
- M, N, K: Dimensions of the matrices in the $C = A \times B + C$ operation. $A \in \mathbb{R}^{M \times K}$ , $B \in \mathbb{R}^{K \times N}$ , and $C \in \mathbb{R}^{M \times N}$ .
- $O(2 \times M \times N \times K)$ : Represents the order of floating-point operations (approximately 2MNK for $A \times B$ and MN for the addition, so $2MNK+MN$ ).
- $O(4 \times (M \times N + N \times K + K \times M))$ : Represents the order of memory accesses (reading $A$ , $B$ , $C$ and writing $C$ ).
- For the specific case of $N=1$ , common in LLM autoregressive decoding, the paper simplifies this to $AI \approx \frac{O(M \times K)}{O(M \times K)} \approx O(1)$ . This notation $O(1)$ implies a constant arithmetic intensity that does not scale favorably with matrix size, thus indicating a memory-bound workload where the computational work is not significantly greater than memory traffic.
Machine Balance Point (BP): $ BP = \frac{Fp_{peak}}{B_{peak}} \quad \left[ \frac{\mathrm{FLOP}}{\mathrm{Byte}} \right] $
- If $AI < BP$ , the kernel is memory-bound.
- If $AI > BP$ , the kernel is compute-bound.
Matrix Dimensions for Roofline: The roofline model was computed for GEMM with $M = 1024$ , $K = 1024$ , and varying $N = 1, 8, 64$ .
Purpose: To visually identify whether GEMM kernels are memory-bound or compute-bound under different conditions (e.g., $N$ values, thread counts, RVV enabled/disabled) and understand the underlying reasons for observed performance.

The following figure (Figure 9 from the original paper) shows the roofline models for RVV and no-RVV:

该图像是一个人像照片，显示出一位男性的面部特征和表情。

Figure 9: Comparison of roofline models for RVV and no-RVV on 1, 8, 32, and 64 cores, varying $N$ while keeping $M = K = 1024$

4.2.9. Validation on GPT2-medium with Simulated GEMM Operations

To validate insights from the roofline model, a program was implemented to simulate only the GEMM operations performed during a full GPT2-medium inference pass.

Method: The program executed GEMM operations using the exact matrix shapes and in the same order as traced from the real GPT2-medium model.
Configurations: Two representative batch sizes were tested:
- $N=1$ : Corresponding to typical autoregressive decoding.
- $N=64$ : Simulating a higher-batch workload.
Purpose: To directly observe how RVV and non-RVV OpenBLAS perform for different batch sizes and thread counts, confirming whether memory-bound behavior for $N=1$ and compute-bound for larger $N$ holds true in a more controlled, but still model-representative, environment.

The following figure (Figure 10 from the original paper) shows the GEMM computation, considering two different sizes for N, and RVV enabled and disabled:

该图像是一个人物肖像，显示了一名年轻男性，留有卷发和胡须，正面表情严肃。他穿着深色衬衫，眼镜框为黑色，背景为浅色墙面。

Figure 10: OnAMeauultyaT-o MM computation, considering two different sizes for N, and RVV enabled and disabled.

5. Experimental Setup

5.1. Datasets

In the context of LLM inference, "datasets" refer to the models themselves and the input prompts used.

Source Models: The LLMs used were publicly available pre-trained models from various sources, primarily through the Hugging Face platform.
- GPT-2 and its variants (gpt2, gpt2-medium, gpt2-large, gpt2-xl) are from OpenAI's profile on Hugging Face.
- bert-large-cased from Google.
- Gemma-2-2B from Google.
- LLaMA-3.2-1B from Meta.
- deepseek-llm-7b-base from DeepSeek-AI.
Characteristics: These models vary significantly in size, from 137M parameters for GPT-2 to 7B parameters for DeepSeek-LLM, and employ different architectures (encoder-only for BERT, decoder-only for GPT-variants, Gemma, LLaMA, DeepSeek). They also use different default data types (float32 for GPT-2, BERT, Gemma-2; bfloat16 for LLaMA-3.2, DeepSeek-LLM).
Input Data (Prompt): For all experiments involving text generation, a consistent input prompt was used: "The quick brown fox ran" An example of generated output from GPT-2 for this prompt was: "The quick brown fox ran off. This fox needs to think for a while. This fox needs a rest."
Purpose of Choice: These models were chosen to represent a range of LLM sizes and architectural characteristics, allowing for a comprehensive evaluation of RVV impact across different computational demands. The standard prompt ensures consistency across comparisons.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, the following provides a complete explanation:

Execution Time (seconds/milliseconds):
1. Conceptual Definition: This metric directly measures the wall-clock time taken for a specific operation or a complete LLM inference task to finish. It quantifies how fast a task can be completed. Lower execution times indicate better performance.
2. Mathematical Formula: Not explicitly a formula, but a direct measurement, often represented as $T_{execution}$ .
3. Symbol Explanation:
  - $T_{execution}$ : The measured duration from the start to the end of the specified task.
Speedup (X factor):
1. Conceptual Definition: Speedup quantifies the performance improvement of an optimized system or method compared to a baseline. It shows how many times faster the optimized version is.
2. Mathematical Formula: $ \text{Speedup} = \frac{T_{\text{baseline}}}{T_{\text{optimized}}} $
3. Symbol Explanation:
  - $T_{\text{baseline}}$ : The execution time of the reference or unoptimized system/method.
  - $T_{\text{optimized}}$ : The execution time of the optimized system/method being evaluated.
Arithmetic Intensity (AI):
1. Conceptual Definition: Arithmetic Intensity is a fundamental metric in performance analysis, particularly relevant for the Roofline Model. It represents the ratio of the total number of floating-point operations (FLOPs) performed by a computational kernel to the total number of bytes transferred between memory and the processor during that kernel's execution. It helps determine whether a workload is compute-bound (high AI) or memory-bound (low AI).
2. Mathematical Formula: For a General Matrix Multiplication (GEMM) of the form $C = A \times B + C$ , where $A \in \mathbb{R}^{M \times K}$ , $B \in \mathbb{R}^{K \times N}$ , and $C \in \mathbb{R}^{M \times N}$ , the paper provides an approximate AI formula: $ AI \approx \frac{O(2 \times M \times N \times K)}{O(4 \times (M \times N + N \times K + K \times M))} $ The paper further simplifies this for the case of $N=1$ (small batch size) to: $ AI \approx \frac{O(M \times K)}{O(M \times K)} \approx O(1) $
3. Symbol Explanation:
  - $O(\cdot)$ : Denotes the order of magnitude.
  - M, N, K: Dimensions of the matrices involved in the GEMM operation.
  - $2 \times M \times N \times K$ : Represents the approximate number of floating-point operations (multiply-adds) for the matrix multiplication $A \times B$ . The addition $C$ adds MN operations, which is often small compared to 2MNK for large matrices.
  - $4 \times (M \times N + N \times K + K \times M)$ : Represents the approximate number of bytes transferred. This includes reading matrices $A$ , $B$ , and $C$ , and writing matrix $C$ . Assuming each element is 4 bytes (for float32).
  - $O(1)$ : Indicates that for $N=1$ , the arithmetic intensity tends to be a small constant, signifying that the workload is memory-bound and its AI does not grow proportionally to $M$ or $K$ in a way that would make it compute-bound.
Machine Balance Point (BP):
1. Conceptual Definition: The Machine Balance Point is a critical threshold derived from the Roofline Model. It defines the minimum arithmetic intensity a computational kernel must possess to become compute-bound on a given hardware platform. If a kernel's AI is below BP, its performance will be limited by memory bandwidth (memory-bound). If its AI is above BP, its performance will be limited by the processor's peak computational capacity (compute-bound).
2. Mathematical Formula: $ BP = \frac{Fp_{peak}}{B_{peak}} \quad \left[ \frac{\mathrm{FLOP}}{\mathrm{Byte}} \right] $
3. Symbol Explanation:
  - $Fp_{peak}$ : The peak floating-point performance of the processor in FLOPs/second.
  - $B_{peak}$ : The peak memory bandwidth of the system in Bytes/second.
Instructions Per Cycle (IPC):
1. Conceptual Definition: IPC measures the average number of instructions executed by a processor per clock cycle. It is a key indicator of a processor's efficiency and pipeline utilization. Higher IPC values generally indicate better processor performance, as more work is being done in each cycle.
2. Mathematical Formula: Not explicitly given in the paper, but conceptually $IPC = \frac{\text{Number of Instructions}}{\text{Number of Clock Cycles}}$ .
3. Symbol Explanation: N/A (conceptual definition is sufficient here as no formula was provided).
Stalled Cycles Per Instruction:
1. Conceptual Definition: This metric quantifies the average number of clock cycles during which the processor pipeline is stalled (not performing useful work) for each instruction executed. Stalls can be caused by various factors, such as cache misses, data dependencies, or branch mispredictions. Lower values indicate better pipeline efficiency.
2. Mathematical Formula: Not explicitly given in the paper, but conceptually $Stalled\ Cycles\ Per\ Instruction = \frac{\text{Total Stalled Cycles}}{\text{Total Instructions Executed}}$ .
3. Symbol Explanation: N/A (conceptual definition is sufficient here as no formula was provided).
Cache Miss Rates (L1, LLC):
1. Conceptual Definition: Cache miss rates measure the percentage of times the processor attempts to access data from a specific cache level (e.g., L1 cache, Last Level Cache (LLC)) but finds that the data is not present, requiring it to fetch data from a slower memory level. High cache miss rates indicate poor data locality and can lead to significant performance bottlenecks due to increased memory latency.
2. Mathematical Formula: Not explicitly given in the paper, but conceptually $Cache\ Miss\ Rate = \frac{\text{Number of Cache Misses}}{\text{Total Cache Accesses}} \times 100\%$ .
3. Symbol Explanation: N/A (conceptual definition is sufficient here as no formula was provided).

5.3. Baselines

The paper's method (PyTorch with RVV-enabled OpenBLAS or BLIS) was compared against several baselines to evaluate the effectiveness of RVV and optimized BLAS libraries:

PyTorch Default Backend: This configuration represents the standard PyTorch installation without explicit linking to OpenBLAS or BLIS. It uses its internal ATen kernels or potentially falls back to less optimized CPU implementations. This serves as a general baseline for comparing the impact of using optimized BLAS libraries.
PyTorch with OpenBLAS (no RVV): This baseline uses the OpenBLAS library but compiled for generic RISC-V architectures without RVV vectorization enabled. This helps isolate the performance gains attributable solely to RVV instructions by comparing it against the RVV-enabled OpenBLAS configuration.
PyTorch with BLIS (no RVV): Similar to OpenBLAS (no RVV), this baseline uses BLIS compiled without RVV support. It allows for a comparison of the BLIS library's baseline performance and subsequently the RVV-specific gains when RVV is enabled for BLIS.

By comparing against these baselines, the authors can systematically assess the individual and combined contributions of optimized BLAS libraries and RVV vectorization to LLM inference performance on the RISC-V platform.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results provide a detailed analysis of LLM inference performance on the RISC-V platform, highlighting the nuanced impact of RVV, different BLAS backends, thread counts, model sizes, caching mechanisms, and data types.

6.1.1. `aten::addmm` Kernel Performance (Microbenchmark)

The aten::addmm microbenchmark, a core GEMM operation, showed clear trends:

Optimized BLAS Benefit: Switching from PyTorch default to OpenBLAS or BLIS consistently improved performance. This underscores the importance of optimized linear algebra libraries.
RVV Additional Speedups: Enabling RVV with OpenBLAS or BLIS led to further performance gains, in some scenarios doubling the performance compared to their non-RVV counterparts. This demonstrates RVV's potential for dense matrix operations.
Scalability with Threads: Performance generally scaled well with increasing thread counts, with optimal performance observed around 16 and 32 threads.
Overhead for Small Matrices: For the smaller $768 \times 768$ matrix size, a performance loss was observed at 32 and 64 threads. This indicates that thread generation and handling overheads can outweigh the benefits of parallelization for smaller workloads.

The following figure (Figure 2 from the original paper) presents the aten::addmm kernel performance:

该图像是图表，展示了在不同的OMP线程下，PyTorch及其优化实现（包括BLIS和OpenBLAS）处理768×768和4096×4096矩阵的执行时间。图中显示了不同配置对计算性能的影响，结果表明线程数增加时，优化实现显著降低了执行时间。

Figure 2: Aten::addmm kernel performance using $768 \times 768$ and $4096 \times 4096$ matrices when using PyTorch with different BLAS runtimes over multiple thread counts.

While GEMM operations are numerically dominant, their actual execution time share can vary significantly in full LLM inference:

Dominance in CPU Operations: aten::addmm and aten::mm can account for a substantial portion (e.g., 70% for GPT-2, 96% for BERT, over 99% for DeepSeek) of CPU operations, confirming their importance.
Declining Share with Parallelism: However, the percentage of total execution time spent exclusively on GEMM can drop significantly with increased parallelism, sometimes falling below 50% (e.g., GPT2-medium with 64 threads). This implies that as GEMM operations are accelerated, other parts of the LLM inference pipeline (memory access patterns, non-linear activations, normalization, control flow, thread synchronization, communication overhead) become new bottlenecks.

The following figure (Figure 3 from the original paper) illustrates the time spent exclusively running GEMM computations:

该图像是图表，展示了在使用KV缓存和不使用KV缓存情况下，GPT-2和GPT-2 Medium模型的单线程执行时间。左侧为GPT-2的执行时间，右侧为GPT-2 Medium的执行时间。

6.1.3. Whole Model Inference Performance

The results for full LLM inference categorized performance into three recurring patterns:

BERT-like Behavior:
- BERT showed predictable scaling: performance improved with increasing thread counts, and RVV provided clear benefits.
- PyTorch's default backend degraded with parallelism, highlighting its inefficiency.
- BLIS consistently achieved the best performance among all configurations for BERT.
Gemma-like Behavior:
- Gemma-2 exhibited an "unexpected pattern": PyTorch + OpenBLAS performed significantly worse than BLIS and default PyTorch in the single-threaded configuration (up to 3x longer execution).
- However, at 16 and 32 threads, PyTorch + OpenBLAS with RVV achieved the best execution times, demonstrating strong multi-threaded performance gains (26.7% speedup at 16 threads over non-RVV). The single-thread anomaly for OpenBLAS suggests potential library overheads or specific kernel choices impacting performance at low parallelism.
DeepSeek-like Behavior:
- DeepSeek-LLM showed a complete lack of scalability, with execution times nearly identical across all configurations and thread counts.
- This was attributed to the model's use of the BF16 data type, which is not natively supported by the RISC-V hardware or the BLAS libraries. The resulting runtime conversions to FP32 introduced significant overhead, negating any benefits from multithreading or RVV.
  
  The following figure (Figure 4 from the original paper) shows LLM inference performance:
  
  该图像是一个图表，展示了在 SG2042 RISC-V CPU 上使用 PyTorch 运行 LLaMA LLM 的推理性能。上图显示了不同 OMP 线程数（1、4、16、32、64）下，使用 bfloat16 权重表示时的执行时间；下图则展示了转换为 float32 权重后的执行时间。结果揭示了线程数对推理性能的影响。

Figure 4: LLMs inference performance when generating 25 tokens and using PyTorch with different BLAS runtimes over multiple thread counts.

6.1.4. Impact of Model Scaling on GPT-2 Performance

Evaluating different sizes of GPT-2 revealed varying behaviors:

PyTorch Default Degradation: The default PyTorch backend performed well for smaller gpt2 and gpt2-medium but showed severe degradation for gpt2-large and gpt2-xl.
- For gpt2-medium, IPC was 0.95 with 0.67 stalled cycles/instruction.
- For gpt2-large, IPC dropped to 0.37, and stalled cycles increased to 2.35, indicating severe backend and frontend bottlenecks.
- Cache miss rates (L1 and LLC) increased significantly for gpt2-large, pointing to higher memory pressure.
Single-Threaded RVV Discrepancy: For gpt2-medium, the RVV-enabled OpenBLAS backend in the single-threaded configuration was over 3x slower than the non-RVV version. Further investigation revealed this was consistent across all common GEMM shapes (e.g., $1024 \times 1 \times 1024$ ), especially for $N=1$ autoregressive decoding. This highlighted that microbenchmarks can be misleading and real-world execution traces are crucial.

The following figure (Figure 5 from the original paper) shows the inference execution time of various LLMs (including openai-community/gpt2 and its variants):

该图像是图表，展示了不同核数下RISC-V CPU的性能与算术强度的关系。图(a)至图(d)分别代表1核、8核、32核和64核的性能测量，显示了峰值和测量带宽的对比，数据表明RVV在高算术强度下表现优越。

The following figure (Image 9 from the original image list, referred to as Figure 6 in the text) shows the execution times of the recorded GEMM shapes for GPT2-medium:

$该图像是执行时间的柱状图，展示了不同矩阵形状与线程数对执行时间的影响。上部分为 $N=1$ 的结果，展示了 RVV 和 No-RVV 下的执行时间对比；下部分为 $N=64$ 的结果，显示了更高矩阵形状的执行时间变化。图中涵盖了多线程配置的性能差异。$ 该图像是执行时间的柱状图，展示了不同矩阵形状与线程数对执行时间的影响。上部分为 $N=1$ 的结果，展示了 RVV 和 No-RVV 下的执行时间对比；下部分为 $N=64$ 的结果，显示了更高矩阵形状的执行时间变化。图中涵盖了多线程配置的性能差异。

6.1.5. Impact of RVV with and without Attention Caching (KV caching)

The use_cache parameter in Transformer models (which reuses key and value projections) was investigated:

Limited RVV Benefit with Caching: While use_cache=True significantly reduced instruction count, the runtime improvement with RVV execution was modest.
RVV Benefits from Recomputation: The performance gap between use_cache=True and use_cache=False was much smaller under RVV. For gpt2-medium, RVV + OpenBLAS showed only 1.2x speedup with caching, compared to 3.2x for non-RVV OpenBLAS and 11x for PyTorch default.
Conclusion: This suggests RVV vectorization is more effective when all attention layers are recomputed (use_cache=False), as this path provides higher computational intensity and better vector utilization. Caching, while generally beneficial, can reduce computational intensity in a way that limits RVV's potential.

The following figure (Figure 7 from the original paper) shows the single-thread performance of GPT-2 with the transformers caching mechanism (KV cache) enabled and disabled:

该图像是一个人站在红色沙发前的照片，背景简单清晰，没有其他复杂元素。

Figure 7: Single-thread performance of GPT-2 with the transformers caching mechanism (KV cache) enabled and disabled.

6.1.6. Handling Unsupported Data Types: BF16 vs FP32 on RISC-V

The lack of native BF16 support on the RISC-V platform proved to be a major bottleneck:

Flat Performance with BF16: Models using BF16 (e.g., LLaMA-3.2-1B, DeepSeek-LLM) exhibited flat, non-scalable performance regardless of backend or thread count. PyTorch was forced to dispatch BF16 matrix multiplications to inefficient, mostly single-threaded non-BLAS kernels or NumPy-based implementations.
Unlocking Performance with FP32 Conversion: Converting the LLaMA model to FP32 significantly improved performance. With FP32, PyTorch correctly dispatched GEMM operations to multi-threaded BLAS libraries, enabling hardware vectorization. For instance, with four threads, PyTorch + OpenBLAS (no RVV) improved from 63s (BF16) to 38s (FP32).
Key Insight: Converting BF16 models to FP32 is essential on RISC-V platforms lacking native BF16 support. This avoids costly runtime fallbacks and unlocks thread-level parallelism and RVV benefits.

The following figure (Figure 8 from the original paper) shows the inference performance of the LLaMA LLM on the SG2042 RISC-V CPU:

该图像是一个人的肖像照。他穿着深色的毛衣，系着红色的领带，背景模糊且颜色柔和。这一形象传达了专业和亲切的感觉。

Figure 8: Inference performance of the LLaMA LLM on the SG2042 RISC-V CPU using PyTorch with different BLAS runtimes and thread counts. Results are shown for the model using its original torch. bfloat16 weight representation and after conversion to torch. float32.

6.1.7. LLMs' GEMM Performance and Roofline Analysis

The roofline model provided crucial insights into GEMM behavior:

Theoretical Peak Performance: The following are the results from Table 2 of the original paper:

Cores 1 4 8 16 32 64

$Fp_{peak}$ (no RVV) 4 8 16 32 64 128

$Fp_{peak}$ (with RVV) 16 64 128 256 512 1024
Memory Bandwidth: The measured effective memory bandwidth was $41.2 GB/s$ with 64 cores.
$N=1$ (Memory-Bound): For GEMM with $N=1$ (e.g., $M=K=1024, N=1$ ), the arithmetic intensity (AI) is very low ( $AI \approx 0.49$ ). This places the operation deep into the memory-bound region, especially for RVV which has a higher Machine Balance Point (BP) ( $BP_{RVV} = 16/4 = 4$ FLOP/Byte for 1 core). In this scenario, the non-RVV implementation often outperformed the RVV one. This is because RVV instructions, by loading/storing multiple values simultaneously, place greater demands on memory bandwidth, which is already the bottleneck.
$N=8$ (Borderline): With $N=8$ , the AI increases, and RVV starts to surpass non-RVV performance, reaching close to the BP.
$N=64$ (Compute-Bound): At $N=64$ , the computation fully enters the compute-bound region. Here, both RVV and non-RVV implementations could reach close to their respective $Fp_{peak}$ limits, with RVV showing significant benefits.
Scalability Challenges: As the number of cores increased (to 32 and 64), the work per thread decreased substantially. Even though the relative positions of AI with respect to BP remained similar, the performance gap to optimal increased. This indicated that overheads from data movement, thread synchronization, and increased memory contention outweighed the benefits of more compute capacity, particularly for small batch sizes like $N=1$ .

The following figure (Figure 9 from the original paper) shows the roofline models for RVV and no-RVV:

该图像是一个人像照片，显示出一位男性的面部特征和表情。

Figure 9: Comparison of roofline models for RVV and no-RVV on 1, 8, 32, and 64 cores, varying $N$ while keeping $M = K = 1024$

6.1.8. Validation on GPT2-medium (Simulated GEMM)

The simulated GEMM operations for GPT2-medium validated the roofline insights:

$N=1$ (Memory-Bound Confirmation): For $N=1$ , arithmetic intensity remained low. RVV vector instructions provided no performance benefit and often degraded performance due to increased memory bandwidth demands and register pressure. Non-RVV consistently outperformed RVV at low thread counts.
$N=64$ (Compute-Bound Confirmation): For $N=64$ , arithmetic intensity rose significantly, pushing computation into the compute-bound region. RVV became effective, showing substantial improvements over non-RVV execution, especially at low-to-moderate thread counts (1-16 cores).
Diminishing Returns: At higher core counts (32 and 64), the performance gap narrowed or even reversed. This was attributed to diminishing returns from threading, increased contention on memory channels, and reduced per-thread workload, which limited vector register utilization.
Conclusion: RVV is primarily beneficial when arithmetic intensity is sufficiently high and workload per core is large enough to amortize vectorization overhead. This highlights the challenge for typical LLM inference workloads dominated by $N=1$ small matrix multiplications.

The following figure (Figure 10 from the original paper) presents the GEMM computation, considering two different sizes for N, and RVV enabled and disabled:

该图像是一个人物肖像，显示了一名年轻男性，留有卷发和胡须，正面表情严肃。他穿着深色衬衫，眼镜框为黑色，背景为浅色墙面。

Figure 10: OnAMeauultyaT-o MM computation, considering two different sizes for N, and RVV enabled and disabled.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Model	Parameters	Hidden Dim	# Layers	# Attention Heads	Context Length	FFN Dim	Data Type
GPT-2	137M	768	12	12	1024	3072	float32
GPT-2 Medium	380M	1024	24	16	1024	4096	float32
GPT-2 Large	812M	1280	36	20	1024	5120	float32
GPT-2 XL	1.61B	1600	48	25	1024	6400	float32
BERT-large-cased	335M	1024	24	16	512	4096	float32
Gemma-2-2B	2.61B	2304	26	8	8192	9216	float32
LLaMA-3.2-1B	1.24B	2048	16	32	131072	8192	bfloat16
DeepSeek-LLM-7B-base	7B	4096	30	32	4096	11008	bfloat16

The following are the results from Table 2 of the original paper:

Cores	1	4	8	16	32	64
$Fp_{peak}$ (no RVV)	4	8	16	32	64	128
$Fp_{peak}$ (with RVV)	16	64	128	256	512	1024

6.3. Ablation Studies / Parameter Analysis

The authors did not conduct formal ablation studies in the traditional sense (removing a component from their proposed method). However, their experimental design implicitly acts as a series of parameter analyses and comparisons that reveal the effectiveness of different components and configurations:

RVV On/Off Comparison: By systematically comparing performance with RVV enabled versus RVV disabled (for both OpenBLAS and BLIS), they demonstrated the direct impact of the vector extension. This showed that RVV provides benefits primarily in compute-bound scenarios.
Choice of BLAS Library: Comparing OpenBLAS and BLIS (both with and without RVV) against the PyTorch default backend evaluated the efficacy of different optimized BLAS implementations. This revealed that BLIS often performed better or scaled differently than OpenBLAS in certain configurations (e.g., BERT, Gemma-2 single-thread anomaly).
Varying Core Counts: Running experiments across 1, 4, 8, 16, 32, and 64 cores analyzed the scalability of LLM inference with and without RVV. This highlighted the diminishing returns at higher core counts due to thread overheads and memory contention, especially for memory-bound workloads.
Model Size Impact: Using four different sizes of GPT-2 (from 137M to 1.61B parameters) allowed for an analysis of how model complexity affects performance and RVV effectiveness. This revealed that larger models can expose different bottlenecks (e.g., increased memory pressure for PyTorch default).
Impact of use_cache Parameter: The comparison of LLM inference with and without the attention caching (KV cache) feature demonstrated how model-level optimizations interact with hardware vectorization. It showed that caching, while reducing overall computation, could inadvertently limit RVV utilization by reducing arithmetic intensity.
Data Type Analysis: The explicit comparison of BF16 and FP32 for LLaMA models highlighted the critical importance of native hardware support for data types. It showed that conversion to a supported FP32 data type was essential to unlock multi-threading and RVV benefits on the RISC-V platform.

These analyses effectively break down the overall system performance, attributing gains or losses to specific architectural features, software choices, and workload characteristics, thereby mimicking the insights gained from ablation studies.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously assessed the inference performance of pre-trained Large Language Models (LLMs) on a multi-core RISC-V CPU featuring Silicon-Enabled Vectors (RVV) v0.7.1. A significant initial challenge involved building the PyTorch acceleration library with RVV support for RISC-V using OpenBLAS and BLIS.

The experimental evaluation demonstrated that while RVV can provide performance gains for LLM inference, these gains are highly dependent on specific configuration choices. For instance, disabling the use_cache feature in Transformers had a lesser impact when RVV was active, suggesting that RVV thrives with higher computational intensity, which caching often reduces.

A deeper investigation, supported by roofline modeling and traced execution data, revealed that RVV-enabled matrix multiplication can be slower than scalar implementations when inference workloads operate on small batch sizes (e.g., $N=1$ ). This is primarily because such GEMMs are memory-bound, limiting the effective utilization of vectorization.

Furthermore, the study highlighted that running LLMs with data types not natively supported by the hardware, such as bfloat16 on the RISC-V platform, leads to flat and non-scalable performance due to the lack of BLAS library support and costly runtime emulation. Converting models to a natively supported format like float32 was crucial for achieving performance improvements and scalability across thread counts and BLAS backends, allowing RVV-enabled backends to deliver measurable benefits, especially with multiple threads.

In summary, the results suggest that RVV is most beneficial in compute-bound configurations that avoid caching and rely on native data types. Its effectiveness is highly contingent on workload characteristics, including matrix shape, number of threads, and arithmetic intensity.

7.2. Limitations & Future Work

The authors identified several limitations in their current work and proposed avenues for future exploration:

F16 Data Type Performance: Preliminary experiments showed that converting BF16 models to FP16 (which is supported by the RISC-V architecture) resulted in significant performance overhead compared to FP32 converted models. This unexpected behavior warrants future investigation.
RVV Scalability and Overheads: The inconsistent scalability of RVV (plateauing or regressing in certain models and thread counts) requires further study. Understanding how threading overheads, memory bandwidth saturation, and core-level contention impact RVV efficiency is critical for future optimizations.
Quantized Models and Sparse Representations: The authors plan to expand their analysis to include quantized models and sparse representations, which might further amplify RVV benefits due to reduced memory footprint and potentially higher arithmetic intensity.
Unsupported Data Types on Other Architectures: Future work will investigate the effects of running unsupported data types on available RISC-V architectures compared to other state-of-the-art architectures (e.g., $x86$ and ARM), which might have better BF16 emulation or native support.
NUMA Design and Task Scheduling: The SG2042 platform has a NUMA (Non-Uniform Memory Access) design. Investigating the effects of this NUMA design on performance, possibly through testing more complex task scheduling policies and understanding associated overheads, is another future direction.

7.3. Personal Insights & Critique

This paper offers valuable, practical insights into a cutting-edge area: evaluating LLM inference on real RISC-V hardware with vector extensions. Its rigorous methodology, employing both micro-benchmarks and full LLM inference profiling coupled with roofline analysis, is particularly commendable.

Inspirations Drawn:

Real-world Applicability: The paper strongly highlights the chasm between theoretical peak performance or micro-benchmark results and actual end-to-end application performance. This underscores the need for application-specific profiling and optimization, especially in emerging hardware ecosystems.
Interplay of Hardware and Software: The profound impact of data type support and BLAS library integration on RVV effectiveness demonstrates that hardware innovation alone is insufficient; a mature and optimized software stack is equally crucial. This is a key takeaway for anyone working on new architectures.
Memory-Bound vs. Compute-Bound Nuances: The detailed roofline analysis provides a clear mental model for understanding why vectorization might not always be a panacea, especially for memory-bound workloads common in LLM decoding. This helps architects and developers make informed decisions about where to invest optimization efforts.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

RVV Version Dependency: The paper focuses on RVV v0.7.1. As RVV v1.0 is now stable and gaining traction, the results might differ significantly on newer hardware implementing v1.0. The v0.7.1 standard's immaturity, as noted in related work, might contribute to some observed inconsistencies.
"Optimal Vector Width" Discrepancy: The abstract mentions finding an "optimal vector width that maximizes throughput for each model size." While the paper discusses the effectiveness of vectorization based on arithmetic intensity and problem size, it doesn't explicitly detail a methodology for determining or the results of finding an "optimal vector width." This might be a slight overstatement in the abstract compared to the body's detailed analysis of $N$ values.
Single-Threaded OpenBLAS Anomaly: The "Gemma-like behavior" where OpenBLAS (even without RVV) performs poorly in single-threaded configurations for certain models (like Gemma-2 and gpt2-medium) needs deeper investigation. Is it related to OpenBLAS's internal threading model, initialization overheads, or a specific kernel choice that is sub-optimal for single-core execution? This might point to an issue within the OpenBLAS port itself rather than an inherent RVV limitation.
Energy Efficiency Quantification: While the abstract mentions RVV offering "substantial gains" in energy consumption for vectorizable operations, the main body of the paper does not present explicit energy consumption measurements or analysis. Providing specific power/energy numbers would strengthen this claim.
Broader LLM Workloads: The paper focuses on text generation with a short prompt and generating 25 tokens. Real-world LLM use cases often involve much longer context lengths, larger batch sizes for throughput-oriented tasks, and different fine-tuning scenarios. How these factors interact with RVV performance could reveal further complexities.
Tooling and Ecosystem Maturity: The challenges in building the PyTorch stack with specific BLAS versions and compiler toolchains highlight the nascent state of the RISC-V software ecosystem. While the paper successfully navigated this, it's a practical barrier that affects broader adoption. Further work could quantify the overheads or complexities of this setup.

Overall, this paper serves as a crucial benchmark and analytical foundation for the burgeoning RISC-V AI ecosystem. It not only showcases the potential of RVV but also pragmatically identifies its current limitations and the critical factors for its effective deployment.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.