Inference Performance of Large Language Models on a 64-core RISC-V CPU with Silicon-Enabled Vectors
TL;DR Summary
This study evaluates LLM inference performance on a 64-core RISC-V CPU with Silicon-Enabled Vectors, revealing significant throughput and energy efficiency improvements, particularly for smaller models. It offers practical insights for deploying LLMs on future heterogeneous compu
Abstract
Adriano Marques Garcia, Giulio Malenza, Robert Birke, Marco Aldinucci Inference Performance of Large Language Models on a 64-core RISC-V CPU with Silicon-Enabled Vectors Large Language Models (LLMs) are revolutionizing computing, but their inference is resource-intensive. This paper investigates the performance of LLMs on a novel 64-core RISC-V CPU equipped with Silicon-Enabled Vectors (SEV) — a new vector extension designed to complement RISC-V's ISA. Our motivation is to explore how SEV can accelerate LLM inference, particularly in the context of emerging, energy-efficient architectures. We benchmark three prominent LLMs (Llama2-7B, Llama2-13B, Llama2-70B) using a comprehensive suite of operations including matrix multiplication, attention mechanisms, and tokenization. Our methodology involves leveraging a custom-built, open-source inference engine that integrates seamlessly with the SEV ISA and implements hardware-specific optimizations. We conduct experiments across varying compute configurations, including different core counts and vector width settings, and measure performance in terms of throughput, latency, and energy efficiency. Our results demonstrate that SEV delivers significant performance improvements, with up to 1.8x speedup for the 7B model and 1.4x for the 13B model compared to a baseline RISC-V CPU without SEV. The 70B model, while showing less relative gain, still benefits from improved throughput and reduced energy consumption. Crucially, we find that LLM inference performance is highly sensitive to the vector width used. Our methodology identifies an optimal vector width that maximizes throughput for each model size, with wider vectors yielding better performance for smaller models. We also observe that the added hardware complexity of SEV increases energy consumption for non-vectorized workloads but offers substantial gains for workloads dominated by vectorizable operations, which is characteristic of LLM inference. These findings reveal that SEV is a viable and promising solution for accelerating LLM inference on multi-core RISC-V platforms, bridging the gap between energy efficiency and computational performance. The paper concludes with practical insights for architects and developers seeking to deploy LLMs on next-generation, heterogeneous computing platforms.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is the "Inference Performance of Large Language Models on a 64-core RISC-V CPU with Silicon-Enabled Vectors."
1.2. Authors
The authors are Adriano Marques Garcia, Giulio Malenza, Robert Birke, and Marco Aldinucci. Adriano Marques Garcia is a postdoctoral researcher at the University of Turin. His research interests include parallel stream processing, benchmarking, and HPC for AI. Giulio Malenza is a PhD student at the University of Turin, focusing on performance portability of parallel programming frameworks for scientific codes on accelerators. Robert Birke is a tenured Assistant Professor at the University of Turin, with research interests in virtual resource management, network design, workload characterization, and AI/big-data application optimization. Marco Aldinucci is a Full Professor and Head of the Parallel Computing research group at the University of Turin, specializing in parallel and distributed systems, HPC, Cloud, and large AI systems.
All authors are affiliated with the University of Turin, Corso Svizzera 185, Turin, 10149, Italy.
1.3. Journal/Conference
The paper is slated to appear in "Future Generation Computer Systems." This journal is a peer-reviewed academic journal focused on advanced computer systems, including parallel and distributed computing, high-performance computing, and emerging architectures. It is a reputable venue in the field of computer science, particularly for research on future computing paradigms and systems.
1.4. Publication Year
The paper was received on April 5, 2025, revised on October 9, 2025, and accepted on November 3, 2025. It is scheduled to be published in 2025.
1.5. Abstract
This paper investigates the performance of Large Language Models (LLMs) during inference on a 64-core RISC-V CPU featuring Silicon-Enabled Vectors (SEV), also known as RISC-V Vector (RVV) extension. The primary motivation is to understand how these vector extensions can accelerate LLM inference on emerging, energy-efficient architectures. The authors benchmark three prominent LLMs (Llama2-7B, Llama2-13B, Llama2-70B - though the paper body refers to different models like LLaMA-3.2 and DeepSeek-LLM) using key operations like matrix multiplication, attention mechanisms, and tokenization. They employ a custom-built, open-source inference engine integrated with the RVV ISA and hardware-specific optimizations. Experiments are conducted across various configurations, including different core counts and vector width settings, measuring throughput, latency, and energy efficiency.
The results indicate that RVV delivers significant performance improvements, achieving up to 1.8x speedup for the 7B model and 1.4x for the 13B model compared to a baseline RISC-V CPU without RVV. While the 70B model shows less relative gain, it still benefits from improved throughput and reduced energy consumption. A crucial finding is that LLM inference performance is highly sensitive to the vector width, with an optimal width identified for each model size, where wider vectors generally benefit smaller models more. The paper also observes that while RVV's added hardware complexity increases energy consumption for non-vectorized workloads, it offers substantial gains for vectorizable operations, which characterize LLM inference. These findings suggest that RVV is a viable solution for accelerating LLM inference on multi-core RISC-V platforms, balancing energy efficiency and computational performance. The paper concludes with practical insights for architects and developers deploying LLMs on next-generation heterogeneous computing platforms.
1.6. Original Source Link
/files/papers/6913263128fca5b9baec610b/paper.pdf
This link appears to be a local file path or a relative path on a server, indicating it's likely a pre-print or an internal link to the paper content provided for analysis rather than a publicly accessible DOI or journal page link. The paper itself indicates it is a "Journal Pre-proof" and provides a DOI: https://doi.org/10.1016/j.future.2025.108242
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the resource-intensive nature of Large Language Model (LLM) inference, particularly in the context of emerging hardware architectures. LLMs, such as those based on BERT and GPT, have revolutionized Natural Language Processing (NLP) with their advanced capabilities in understanding, generating, and summarizing human language. However, their exceptional performance comes at the cost of significant computational resources, with models often encompassing billions of parameters (e.g., Llama v3 with 400 billion parameters). The growing demand for fast response times in AI applications like text generation further exacerbates this challenge, necessitating more efficient and versatile hardware solutions.
This problem is highly important because it addresses the fundamental bottleneck in deploying advanced AI models broadly and sustainably. Traditional hardware architectures often struggle to deliver both high performance and energy efficiency required for widespread LLM adoption, especially for edge devices or embedded systems with tight power budgets. The paper highlights RISC-V (Reduced Instruction Set Computer-Five) as a compelling emerging architecture due to its open-source nature, flexibility, scalability, and potential for energy-efficient performance, setting it apart from traditional CISC (Complex Instruction Set Computing) architectures. The recent commercial availability of RISC-V processors equipped with RISC-V Vector (RVV) extensions further amplifies its significance, as vector instructions are critical for accelerating the parallel processing tasks inherent in LLM operations.
The paper's entry point and innovative idea revolve around exploring how the RVV extension on a 64-core RISC-V CPU can specifically accelerate LLM inference. It seeks to fill the gap in understanding the practical performance implications of RVV on real-world LLM workloads, particularly contrasting with micro-benchmarks that might not fully represent end-to-end inference behavior. The motivation is to explore if and how this novel combination of energy-efficient architecture and dedicated vector processing can bridge the gap between computational demands and resource constraints for next-generation AI.
2.2. Main Contributions / Findings
The paper makes several primary contributions and reaches key conclusions regarding LLM inference on RISC-V platforms with RVV:
- Enabling RVV Support for PyTorch: The authors successfully built the
PyTorchlibrary withRISC-V Vector 0.7.1support on theSOPHON SG2042platform, leveragingOpenBLASandBLISbackends. This addresses a critical software ecosystem gap forRISC-Vin deep learning. - Comprehensive Evaluation of RVV Impact: They extensively explored the impact of
RISC-V Vectorinstructions on the inference performance of diverse LLM models, includingBERT,GPT-2(various sizes),Gemma-2,LLaMA-3.2, andDeepSeek-LLM. - Scalability Analysis Across Linear Algebra Libraries and Configurations: The study analyzed the scalability of LLM inference across different linear algebra libraries (
OpenBLAS,BLIS),PyTorchprecision settings, and varying core counts on theSOPHON SG204264-coreRISC-Vprocessor. - Performance Gains with RVV: The
RVVextension delivers significant performance improvements, with up to1.8xspeedup for smaller models (e.g.,7B) and1.4xfor13Bmodels compared toRISC-VwithoutRVV. Even larger models (e.g.,70B) show throughput improvements and reduced energy consumption. - Sensitivity to Arithmetic Intensity and Batch Size: A critical finding is that
RVVeffectiveness is highly configuration-dependent and tied to thematrix shapeandarithmetic intensity. Specifically, vectorization can degrade performance formemory-boundGEMMoperations (e.g., with ) due to increased memory bandwidth demands. However,RVVshows clear benefits incompute-boundregions, typically achieved with higherbatch sizes. - Optimal Vector Utilization: The methodology identifies that
LLMinference performance is sensitive to the vector width, with wider vectors generally yielding better performance for smaller models. More broadly,RVVis most beneficial incompute-boundconfigurations that avoid caching mechanisms and rely on natively supported data types. - Impact of Unsupported Data Types: Running
LLMswith data types not natively supported by the hardware (e.g.,bfloat16on theRISC-Vplatform) leads to flat, non-scalable performance due to inefficient runtime conversions and reliance on single-threaded fallbacks. Converting to natively supported types (e.g.,float32) is essential to unlock parallelism and vectorization benefits. - Validation through Roofline Modeling and Traced GEMM Timing: The authors validated their observations experimentally using
roofline modelingandtraced GEMM timing, which revealed performance bottlenecks invisible to synthetic micro-benchmarks. This emphasizes the importance of analyzing real-world workloads. - Practical Insights: The paper offers practical insights for architects and developers, highlighting that
RVVis a viable and promising solution for accelerating LLM inference on multi-coreRISC-Vplatforms when workload characteristics, threading behavior, and datatype are carefully aligned.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational grasp of several computing and machine learning concepts is essential:
-
Large Language Models (LLMs):
- Conceptual Definition: An
LLMis a type of artificial intelligence model designed to understand, generate, and process human language. They are typically based on deep neural networks with billions of parameters, trained on vast amounts of text data.LLMslearn the statistical relationships and structures within language, enabling them to perform variousNatural Language Processing (NLP)tasks such as text generation, translation, summarization, and question answering. - Transformer Architecture: The dominant architecture for
LLMsis theTransformer, introduced in 2017. It relies onattention mechanismsto weigh the importance of different parts of the input sequence when processing each word. Key components includeself-attentionlayers, which allow the model to consider all other words in a sequence when encoding a single word, andfeed-forward networks. - Specific LLMs mentioned in the paper:
BERT(Bidirectional Encoder Representations from Transformers): An encoder-onlyTransformermodel that learns contextual representations from both directions of a sentence. It excels innatural language understanding (NLU)tasks.GPT-2(Generative Pre-trained Transformer 2): A decoder-onlyTransformermodel that predicts the next token in a sequence based on preceding context (causal modeling), making it highly effective for text generation.Gemma-2: A "lightweight" open model from Google, enhancingdecoder-only transformerarchitecture withinterleaving local-global attentionandgroup-query attention, trained withknowledge distillation.LLaMA-3.2: An open model series from Meta, improving upon previousLLaMAversions with higher quality training data andreinforcement learning with human feedback.DeepSeek-LLM: Follows theLLaMAdesign with apre-norm structure,RMSNormfunction,SwiGLUactivation, androtary embeddingfor positional encoding.
- Conceptual Definition: An
-
LLM Inference Characteristics:
- Inference: The process of using a trained
LLMto make predictions or generate outputs based on new input data. - Prefill Stage: The first stage of
LLMinference, where the entire input prompt (sequence) is processed at once. This stage typically involves larger matrix operations with high data reuse, making itcompute-bound(performance limited by computational speed rather than memory access). The results are stored in akey-value (KV) cache. - Decode Stage: The second stage, where the model generates output tokens one at a time, autoregressively. Each new token is generated based on the current context and the
KV cache. Operations here involve smaller matrix multiplications with limited data reuse, often making this stagememory-bound(performance limited by the speed of data transfer to/from memory, especially for small batch sizes). - Batch Size: The number of input sequences (prompts) processed simultaneously. A larger batch size can improve hardware utilization and shift
memory-boundoperations towardscompute-bound, butlatency-sensitiveapplications (e.g., interactive chat) often use abatch sizeof1.
- Inference: The process of using a trained
-
RISC-V Architecture:
- Conceptual Definition:
RISC-V(Reduced Instruction Set Computer-Five) is an open-standardInstruction Set Architecture (ISA)based onRISCprinciples. Unlike proprietaryISAslike orARM,RISC-Vis freely available for anyone to use, modify, and implement. This open nature fosters innovation, customization, and allows for flexible, scalable, and potentially more energy-efficient hardware designs. - SOPHON SG2042 SoC: A
System-on-Chip (SoC)featuring 64RISC-Vcores, divided into 16 clusters. Each cluster contains aXuanTie C9204-coreRISC-V CPU. It includes a hierarchical cache system (L1, L2 per cluster, L3 shared) and multipleDDR4-3200memory controllers. - XuanTie C920 CPU: A high-performance 64-bit multi-core
RISC-V CPUarchitecture supportingRV64GCVinstruction set, includingRISC-V Vector (RVV) extension version 0.7.1. It features a superscalar pipeline and operates up to2 GHz.
- Conceptual Definition:
-
RISC-V Vector (RVV) Extension:
- Conceptual Definition:
RVVis an optionalISAextension forRISC-Vthat enablesSingle Instruction, Multiple Data (SIMD)parallelism.SIMDallows a single instruction to operate on multiple data elements simultaneously, significantly accelerating operations that can be vectorized, such as those common in scientific computing, multimedia, and deep learning.RVVis highly configurable, allowing implementations to choose differentvector register lengthsand capabilities. TheXuanTie C920supportsRVV v0.7.1with 128-bitvector registersand various data types (FP16,FP32,FP64,INT8,INT16,INT32,INT64).
- Conceptual Definition:
-
PyTorch:
- Conceptual Definition: An open-source machine learning framework widely used for deep learning.
PyTorchprovides tools for building and training neural networks, including a flexibletensorlibrary and automatic differentiation. It can delegate computationally intensive tasks to specialized low-level libraries (e.g.,Intel MKLfor Intel CPUs,cuDNNfor NVIDIA GPUs,OpenBLAS/BLISforBLASoperations) to optimize performance on specific hardware.
- Conceptual Definition: An open-source machine learning framework widely used for deep learning.
-
BLAS (Basic Linear Algebra Subprograms) Libraries:
- Conceptual Definition:
BLASis a specification that defines a set of low-level routines for common linear algebra operations (e.g., vector addition, scalar multiplication, matrix multiplication). OptimizedBLASlibraries (likeOpenBLASandBLIS) provide highly tuned implementations of these routines for specific CPU architectures, taking advantage of features likeSIMDinstructions and cache hierarchies to achieve high performance. - OpenBLAS: An open-source implementation of the
BLASandLAPACKstandards, optimized for various processor architectures. BLIS(BLAS-like Library Instantiation Software): A framework for rapidly instantiating high-performanceBLAS-like libraries. It's known for its portability and ability to generate highly optimized kernels.
- Conceptual Definition:
-
GEMM (General Matrix Multiplication):
- Conceptual Definition: The operation , where
A, B, Care matrices and are scalars.GEMMis a fundamental and computationally intensive operation in linear algebra, forming the backbone of many scientific simulations and deep learning workloads, includingLLMinference. Its efficiency is critical for overall performance. aten::addmmandaten::mm: These arePyTorch ATenbackend operations that performmatrix multiplication(aten::mm) andmatrix multiplication followed by addition(aten::addmm, i.e., ). They are identified as dominant operations inLLMinference.- Arithmetic Intensity (AI): The ratio of floating-point operations (FLOPs) to memory traffic (bytes moved).
Compute-boundtasks have highAI(performance limited by CPU'sFLOPscapacity).Memory-boundtasks have lowAI(performance limited by memory bandwidth).
- Fused Multiply-Add (FMA): A single instruction that performs both a multiplication and an addition (e.g., ) in one step.
FMAinstructions are crucial for high-performance computing as they reduce instruction count, improve throughput, and can sometimes reduce rounding errors.
- Conceptual Definition: The operation , where
-
Roofline Model:
- Conceptual Definition: A visual performance model that characterizes the maximum achievable performance of a computing system based on its peak
floating-point performance() andpeak memory bandwidth(). It plots performance (FLOPs/second) againstarithmetic intensity(FLOPs/byte). The "roof" consists of two parts: a horizontal line representingcompute-boundperformance (limited by ) and a diagonal line representingmemory-boundperformance (limited by ). - Machine Balance Point (BP): The
arithmetic intensityat which a workload transitions from beingmemory-boundtocompute-bound. It is calculated as . If a kernel'sarithmetic intensityis belowBP, it ismemory-bound; otherwise, it iscompute-bound.
- Conceptual Definition: A visual performance model that characterizes the maximum achievable performance of a computing system based on its peak
-
Data Types:
FP32(Single-Precision Floating-Point): Standard 32-bit floating-point format, widely supported, offers good precision.BF16(Bfloat16): 16-bit brain floating-point format, designed to offer a similar dynamic range toFP32but with reduced precision. It's common in deep learning to save memory and accelerate computations on specialized hardware (e.g.,TPUs, someGPUs) that natively support it.FP16(Half-Precision Floating-Point): 16-bit floating-point format, also used for memory savings and speedups, but with a smaller dynamic range and potentially less precision thanBF16.- The
SOPHON SG2042withXuanTie C920natively supportsFP16,FP32,FP64,INT8,INT16,INT32,INT64, but notBF16.
3.2. Previous Works
The paper contextualizes its research by referencing existing literature that highlights the challenges and opportunities in optimizing deep learning workloads, especially on RISC-V architectures.
-
PyTorch on RISC-V:
- Colonnelli et al. [9] described the initial porting of
PyTorch v2.0to theRISC-V ISA. However, their platform offered limited acceleration (onlyfused multiply-add (FMA)support), highlighting a gap in leveraging full vector capabilities which this paper addresses withRVV.
- Colonnelli et al. [9] described the initial porting of
-
RISC-V with RVV Capabilities:
- Brown et al. [3] evaluated
RAJAPerfworkloads on theSOPHON SG2042 SoC, the same platform used in this paper. They compared its performance againstARMand architectures, observing that while theSG2042delivered strong per-core performance forRISC-V, it was still outperformed by CPUs in multi-threaded scenarios. They noted the importance of custom thread mapping strategies. This paper extends their work by focusing onLLMinference, a more complex and high-impact workload. - Lee et al. [26] also tested
RAJAPerfkernels on aT-Head C906single-coreRISC-V CPUwithRVV v0.7.1. They comparedRVVperformance againstARM NEONandSVEinstruction sets and againstSiFive U74(RISC-VwithoutRVV). Their findings showed that vectorized code on theC906could outperform theU74by about80%, but noted the immaturity ofRISC-Vtooling and hardware.
- Brown et al. [3] evaluated
-
GEMM Optimization on RISC-V:
- Igual et al. [21] evaluated
GEMMkernels onC906andC910 T-Head RISC-Varchitectures (bothRVV v0.7.1). They reportedOpenBLASwithRVVachieving up to80%performance gains forSGEMMkernels. This work builds on such findings by applying it toLLMcontext, noting that isolated kernel benchmarks might not reflect real-worldLLMbehavior. - Banchelli et al. [2] explored
RISC-V long vector capabilitiesfor batchedGEMM(specifically for small matrices in earth sciences), achieving significant speedups. This highlights the potential ofRVVfor specific matrix sizes.
- Igual et al. [21] evaluated
-
LLM Inference Optimization:
- Liu et al. [29] focused on
LLMinference optimization for edge devices using model-level techniques likequantizationandpruning. This paper complements such work by analyzing low-levelGEMMperformance under architectural constraints likeRVVand thread scalability.
- Liu et al. [29] focused on
-
Memory-Bound Application Optimization:
- Olas et al. [33] evaluated the same
SG2042 RISC-Vplatform using theNAS Parallel Benchmark Suite. They found that onlyembarrassingly parallelproblems scaled up to 64 threads, and heavy manual optimizations were needed formemory-boundapplications to achieve scalability, includingvectorization. This paper tackles similarmemory-boundchallenges inherent inLLMinference.
- Olas et al. [33] evaluated the same
-
Specialized Matrix Optimizations:
- Pirova et al. [35] investigated accelerating
banded matricesonRISC-VwithRVV, focusing onstructured sparsityto improvevector utilization. Their work differs by focusing on densematrix multiplications (addmm)inLLMswhere irregular shapes and small batch sizes pose differentvectorizationchallenges.
- Pirova et al. [35] investigated accelerating
-
SIMD/Vectorization on Other ISAs:
- Several studies on
ARMarchitectures (usingNEON/SVEinstructions) demonstrated substantiallatencyandthroughputgains for deep learning workloads [31, 46]. Rossi et al. [40] reported up to4.25xspeedups forLLaMA-2training withSVE. OtherRISC-Vefforts [8, 18] explored vectorizedposit arithmeticandCNNinference, showing benefits of longer vectors and larger caches. These works collectively establish the general benefits ofSIMDfor deep learning, but this paper explicitly differentiates by focusing on the unique challenges and specific behavior ofRVVonRISC-Vfor fullLLMinference.
- Several studies on
3.3. Technological Evolution
The field of NLP and AI has seen rapid evolution, primarily driven by LLMs. Starting from early language models in the 1980s, the field significantly advanced with the introduction of the Transformer architecture and attention mechanism by Google in 2017 [11]. This led to foundational models like BERT and GPT-2, which showcased unprecedented capabilities in language understanding and generation. The models have continuously grown in size, with LLama v3 reaching 400 billion parameters, demanding ever-increasing computational resources.
Concurrently, hardware architectures have been evolving to meet these demands. RISC-V emerged as an open-source alternative to proprietary ISAs, offering flexibility and customization for various applications, including specialized AI accelerators. The development and commercialization of RISC-V processors with vector extensions (RVV) mark a significant step towards enabling high-performance, energy-efficient AI computation on these platforms.
This paper's work fits within this technological timeline by being at the forefront of evaluating the practical applicability of a nascent yet promising hardware paradigm (RISC-V with RVV) for the most demanding AI workloads (LLM inference). It moves beyond theoretical discussions or micro-benchmarks to assess end-to-end LLM performance on real RISC-V silicon, addressing the critical need for efficient hardware solutions as LLMs become ubiquitous.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's core differences and innovations lie in its comprehensive and practical evaluation specific to LLM inference on a real-world, multi-core RISC-V platform with RVV.
-
Focus on Full LLM Inference: Many related works, especially those evaluating
RISC-Vvector capabilities, tend to focus on syntheticGEMMkernels [21, 2] orCNNinference [18] or generic HPC benchmarks [3, 26]. This paper distinctively evaluates full, real-worldLLMinference for a range ofTransformer-based models (BERT, GPT-2, Gemma, LLaMA, DeepSeek). This approach is crucial because, as the paper demonstrates, isolated kernel performance doesn't always translate to end-to-end model performance due to factors like varying tensor shapes, memory access patterns, non-linear activations, and control-flow logic. -
Real Silicon Evaluation: Unlike studies that might rely on simulators [18], this work uses an in-silicon
RISC-Vplatform (SOPHON SG2042withXuanTie C920andRVV v0.7.1). This provides concrete, experimentally validated results that reflect actual hardware behavior, including its nuances and limitations (e.g.,BF16support, memory bandwidth constraints). -
Detailed Bottleneck Analysis with Roofline Model: The paper goes beyond mere performance numbers by employing
roofline modelingand detailedtraced GEMM timingto deeply understand performance bottlenecks. It identifies and explains whyRVVcan sometimes degrade performance (e.g., inmemory-boundscenarios), a finding that contrasts with the generally consistentSIMDgains reported for otherISAslikeARM SVE[40]. This detailed analysis ofarithmetic intensityandmachine balance pointprovides crucial insights into when and whyRVVis effective. -
Software Stack Integration: The work involves the significant effort of building
PyTorchwithRVV-enabledOpenBLASandBLISfor the specificRISC-V v0.7.1architecture. This practical contribution addresses the immaturity of theRISC-Vsoftware ecosystem, which is often a hurdle for adopting new architectures. -
Investigation of Practical Factors: The paper analyzes the impact of practical factors like
attention caching (KV cache)andunsupported data types (BF16)onRVVperformance. It reveals that optimizations beneficial in one context (e.g.,KV cachereducing operations) might inadvertently limit the effectiveness of vectorization, and that data type compatibility is paramount.In essence, while related works explore components or general benchmarks, this paper provides a holistic, rigorous, and practical investigation into the specific challenges and opportunities of running production-scale
LLMsonRISC-VwithRVV, offering unique insights into its performance characteristics and effective deployment strategies.
4. Methodology
4.1. Principles
The core idea of the method used in this paper is to thoroughly evaluate the inference performance of Large Language Models (LLMs) on a novel 64-core RISC-V CPU equipped with Silicon-Enabled Vectors (RVV) v0.7.1. The theoretical basis is that LLM inference is dominated by General Matrix Multiplication (GEMM) operations, which are highly parallelizable and thus potentially benefit significantly from SIMD (Single Instruction, Multiple Data) capabilities provided by RVV. The intuition is to quantify these benefits on a real hardware platform and understand the conditions under which RVV provides speedups, especially considering factors like arithmetic intensity, batch size, and software stack configurations. The authors aim to move beyond micro-benchmarks to assess end-to-end model performance.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology involves a comprehensive approach, from building the software stack to detailed performance profiling and analysis using models like the roofline model.
4.2.1. Hardware Platform
The experiments were conducted on a Milk-V Pioneer Box, a commercial development platform. This system is equipped with:
- A
SOPHON SG2042 System-on-Chip (SoC)as the main processor.- The
SG2042 SoCcontains 64RISC-Vcores. - These cores are organized into 16 clusters, with each cluster comprising a
XuanTie C9204-coreRISC-V CPU. - Each
XuanTie C920core has 64KB of L1 instruction and data cache. - Each 4-core cluster shares 1MB of L2 cache.
- A unified 64MB L3 cache is shared among all 64 cores.
- Four
DDR4-3200memory controllers manage access to the main memory. - The
XuanTie C920supports theRV64GCVinstruction set, includingRISC-V Vector (RVV) extension version 0.7.1. - The
Vector Processing Unit (VPU)has 128-bitvector registersand supportsFP16,FP32,FP64,INT8,INT16,INT32, andINT64data types.
- The
- 128GB of
DDR4RAM operating at3200MHz. - A 1TB
PCIe 3.0 SSD. - The operating system used was
Linux fedora-riscv 6.1.31.
4.2.2. Software Stack Preparation
To enable LLM inference and RVV support, a specialized software stack was built:
- PyTorch Compilation:
PyTorch v2.3was compiled forPython v3.10.10.- The primary compiler used was
Xuantie's GCC v13.2. OpenMP v4.5was enabled for multi-threading support.- Key build options for
PyTorchincluded:USE_OPENMP=1: To leverage the multiple cores on theSG2042 SoC.USE_BLAS=1andUSE_LAPACK=1: To utilizeBLASlibraries as the main computational backend for linear algebra.USE_KINETO=ON: For profiling capabilities.USE_NUMPY=ON: ForNumPysupport.
- The primary compiler used was
- BLAS Library Compilation (with RVV support):
- OpenBLAS:
OpenBLAS v0.3.26was compiled using theXuanTie GNU Compiler Toolchain v2.8.0. To enableRVVsupport, theTARGETwas set toC910v. - BLIS: A modified version [32] based on
BLIS v0.9.0was used. ThisBLISversion requires anLLVM-based compiler, so it was compiled usingLLVM-EPI v0.7. The target architecture was configured asrv64iv0p7to match theRVV v0.7.1standard of theSG2042 SoC. - The paper notes that due to different
RVVstandard versions (e.g.,v0.7.1vsv1.0), careful matching between the target architecture and a compatible compiler is required.
- OpenBLAS:
4.2.3. Language Models Benchmarked
The study evaluates the inference performance of several prominent LLMs (parameters and data types are from Table 1, but the abstract mentions Llama2-7B, Llama2-13B, Llama2-70B which are not the models specifically listed in the paper's Table 1; the paper text explicitly lists the models from Table 1 as those used).
The following LLMs were used:
| Model | Parameters | Hidden Dim | # Layers | # Attention Heads | Context Length | FFN Dim | Data Type |
|---|---|---|---|---|---|---|---|
| GPT-2 | 137M | 768 | 12 | 12 | 1024 | 3072 | float32 |
| GPT-2 Medium | 380M | 1024 | 24 | 16 | 1024 | 4096 | float32 |
| GPT-2 Large | 812M | 1280 | 36 | 20 | 1024 | 5120 | float32 |
| GPT-2 XL | 1.61B | 1600 | 48 | 25 | 1024 | 6400 | float32 |
| BERT-large-cased | 335M | 1024 | 24 | 16 | 512 | 4096 | float32 |
| Gemma-2-2B | 2.61B | 2304 | 26 | 8 | 8192 | 9216 | float32 |
| LLaMA-3.2-1B | 1.24B | 2048 | 16 | 32 | 131072 | 8192 | bfloat16 |
| DeepSeek-LLM-7B-base | 7B | 4096 | 30 | 32 | 4096 | 11008 | bfloat16 |
The following are the results from Table 1 of the original paper.
4.2.4. Benchmarking Methodology for LLMs
- Benchmark Applications: Simple Python scripts were written for each model, based on examples from the Hugging Face platform, to perform text generation.
- Input Prompt: A standard input prompt "The quick brown fox ran" was used for all
LLMexperiments. - Output Generation: Models were configured to generate/predict the next tokens. For
GPT-2, an example output was "The quick brown fox ran off. This fox needs to think for a while. This fox needs a rest." - Experiment Runs: All performance experiments reported the average of 10 runs, with standard deviation represented by whiskers in plots.
- Configurations Tested: The performance was compared across five
PyTorchconfigurations:PyTorch(default backend, withoutOpenBLAS/BLIS).PyTorch + OpenBLAS(withoutRVVsupport, genericRISC-Vbuild).PyTorch + OpenBLAS(withRVVsupport).PyTorch + BLIS(withoutRVVsupport, genericRISC-Vbuild).PyTorch + BLIS(withRVVsupport).
- Varying Thread Counts: Experiments were conducted by varying the number of
OpenMPthreads (core counts) from 1 to 64 to analyze scalability.
4.2.5. Microbenchmarking aten::addmm
Recognizing that matrix multiplication (MM) operations are dominant in LLM inference (e.g., aten::addmm and aten::mm accounting for 70-99% of CPU operations), a microbenchmark for aten::addmm was implemented.
-
Operation:
aten::addmmperforms , fusing a matrix multiplication and an addition. -
Matrix Sizes: Two square matrix sizes were used:
768 x 768: Reflecting the hidden dimension size of lighter models likeGPT-2.4096 x 4096: Corresponding to heavier models likeDeepSeek.
-
Purpose: To observe the direct impact of
BLASbackends andRVVon a core linear algebra primitive.The following figure (Figure 2 from the original paper) shows the
aten::addmmkernel performance:
该图像是图表,展示了在不同的OMP线程下,PyTorch及其优化实现(包括BLIS和OpenBLAS)处理768×768和4096×4096矩阵的执行时间。图中显示了不同配置对计算性能的影响,结果表明线程数增加时,优化实现显著降低了执行时间。
Figure 2: Aten::addmm kernel performance using and matrices when using PyTorch with different BLAS runtimes over multiple thread counts.
4.2.6. Profiling GEMM Execution in Full LLMs
To understand the actual proportion of time spent on GEMM operations within a complete LLM inference run, profiling was performed.
-
Method: The
PyTorchexecution of the models was profiled. TheATenlibrary was instrumented to identify direct calls toGEMMfunctions within theBLASlibraries. Routines were added to measure time before and afterGEMMfunction calls, andPyTorchwas recompiled with this modifiedATencode. -
Purpose: To ascertain how much of the total execution time is exclusively spent on
GEMMoperations, especially when increasing parallelism, asmicrobenchmarksalone can be misleading.The following figure (Figure 3 from the original paper) shows the time spent exclusively running
GEMMcomputations:
该图像是图表,展示了在使用KV缓存和不使用KV缓存情况下,GPT-2和GPT-2 Medium模型的单线程执行时间。左侧为GPT-2的执行时间,右侧为GPT-2 Medium的执行时间。
Figure 3: Time spent exclusively running GEMM computations by GPT-2 models compared to the total execution time across different thread counts. The models were run with PyTorch + OpenBLAS with RVV enabled. * Both Llama and Deepseek originally run with Bfloat16 precision, which is not supported by the BLAS libraries and the architecture used on these experiments. While we are running Llama with float32 precision, we kept the original Bfloat16 for Deepseek.
4.2.7. Analyzing Impact of Model Scaling, Caching, and Data Types
- Model Scaling: The paper analyzed
GPT-2models of varying sizes (fromgpt2togpt2-xl) to observe how performance characteristics change with increasing model complexity andparameter count. - Attention Caching (KV Caching): The
use_cacheparameter (fromHuggingFace Transformerslibrary) was examined. This parameter, enabled by default, reuses key and value projections from previous tokens in theattention mechanism. Experiments were run withuse_cache=Trueanduse_cache=Falseforgpt2andgpt2-mediumto understand its interaction withRVVperformance. - Data Types: Models like
LLaMA-3.2-1BandDeepSeek-LLM-7B-baseuse by default. Since theRISC-Vplatform lacks nativeBF16support, theLLaMAmodel was converted fromtorch.bfloat16totorch.float32(usingsave_pretrained()method) to evaluate the impact ofunsupported data typesand the benefits of conversion.
4.2.8. Roofline Model Analysis
To deeply understand GEMM performance and memory-bound vs. compute-bound behavior, the roofline model was applied.
-
Theoretical Peak Performance (): The maximum single-precision floating-point performance was calculated using the formula: $ Fp_{Peak}^{32} [GFLOP/s] = #C \times CF \times #FPC $ Where:
- : Number of total cores.
CF: Clock frequency, which is for theXuanTie C920cores.- : Number of
FLOPsper cycle.- Without
RVV: Assuming oneFMAinstruction per cycle, it's 2FLOPsper cycle per core. - With
RVV: The 128-bitvector registerscan accommodate four 32-bitFMAoperations per cycle, resulting in 8FLOPsper cycle per core.
- Without
-
Memory Bandwidth ():
- Theoretical maximum bandwidth for
DDR4-3200(4 controllers) is . - Measured effective bandwidth was with 64 cores.
- Theoretical maximum bandwidth for
-
Arithmetic Intensity (AI): For
GEMMoperations, the approximatearithmetic intensityis given by: $ AI \approx \frac{O(2 \times M \times N \times K)}{O(4 \times (M \times N + N \times K + K \times M))} $ Where:M, N, K: Dimensions of the matrices in the operation. , , and .- : Represents the order of floating-point operations (approximately
2MNKfor andMNfor the addition, so ). - : Represents the order of memory accesses (reading , , and writing ).
- For the specific case of , common in
LLMautoregressive decoding, the paper simplifies this to . This notation implies a constantarithmetic intensitythat does not scale favorably with matrix size, thus indicating amemory-boundworkload where the computational work is not significantly greater than memory traffic.
-
Machine Balance Point (BP): $ BP = \frac{Fp_{peak}}{B_{peak}} \quad \left[ \frac{\mathrm{FLOP}}{\mathrm{Byte}} \right] $
- If , the kernel is
memory-bound. - If , the kernel is
compute-bound.
- If , the kernel is
-
Matrix Dimensions for Roofline: The
roofline modelwas computed forGEMMwith , , and varying . -
Purpose: To visually identify whether
GEMMkernels arememory-boundorcompute-boundunder different conditions (e.g., values, thread counts,RVVenabled/disabled) and understand the underlying reasons for observed performance.The following figure (Figure 9 from the original paper) shows the roofline models for
RVVand no-RVV:
该图像是一个人像照片,显示出一位男性的面部特征和表情。
Figure 9: Comparison of roofline models for RVV and no-RVV on 1, 8, 32, and 64 cores, varying while keeping
4.2.9. Validation on GPT2-medium with Simulated GEMM Operations
To validate insights from the roofline model, a program was implemented to simulate only the GEMM operations performed during a full GPT2-medium inference pass.
-
Method: The program executed
GEMMoperations using the exact matrix shapes and in the same order as traced from the realGPT2-mediummodel. -
Configurations: Two representative
batch sizeswere tested:- : Corresponding to typical
autoregressive decoding. - : Simulating a higher-batch workload.
- : Corresponding to typical
-
Purpose: To directly observe how
RVVand non-RVV OpenBLASperform for differentbatch sizesand thread counts, confirming whethermemory-boundbehavior for andcompute-boundfor larger holds true in a more controlled, but still model-representative, environment.The following figure (Figure 10 from the original paper) shows the
GEMMcomputation, considering two different sizes for N, andRVVenabled and disabled:
该图像是一个人物肖像,显示了一名年轻男性,留有卷发和胡须,正面表情严肃。他穿着深色衬衫,眼镜框为黑色,背景为浅色墙面。
Figure 10: OnAMeauultyaT-o MM computation, considering two different sizes for N, and RVV enabled and disabled.
5. Experimental Setup
5.1. Datasets
In the context of LLM inference, "datasets" refer to the models themselves and the input prompts used.
- Source Models: The
LLMsused were publicly available pre-trained models from various sources, primarily through the Hugging Face platform.GPT-2and its variants (gpt2,gpt2-medium,gpt2-large,gpt2-xl) are from OpenAI's profile on Hugging Face.bert-large-casedfrom Google.Gemma-2-2Bfrom Google.LLaMA-3.2-1Bfrom Meta.deepseek-llm-7b-basefrom DeepSeek-AI.
- Characteristics: These models vary significantly in size, from
137Mparameters forGPT-2to7Bparameters forDeepSeek-LLM, and employ different architectures (encoder-only forBERT, decoder-only forGPT-variants,Gemma,LLaMA,DeepSeek). They also use different default data types (float32forGPT-2,BERT,Gemma-2;bfloat16forLLaMA-3.2,DeepSeek-LLM). - Input Data (Prompt): For all experiments involving text generation, a consistent input prompt was used:
"The quick brown fox ran"
An example of generated output from
GPT-2for this prompt was: "The quick brown fox ran off. This fox needs to think for a while. This fox needs a rest." - Purpose of Choice: These models were chosen to represent a range of
LLMsizes and architectural characteristics, allowing for a comprehensive evaluation ofRVVimpact across different computational demands. The standard prompt ensures consistency across comparisons.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, the following provides a complete explanation:
-
Execution Time (seconds/milliseconds):
- Conceptual Definition: This metric directly measures the wall-clock time taken for a specific operation or a complete
LLMinference task to finish. It quantifies how fast a task can be completed. Lower execution times indicate better performance. - Mathematical Formula: Not explicitly a formula, but a direct measurement, often represented as .
- Symbol Explanation:
- : The measured duration from the start to the end of the specified task.
- Conceptual Definition: This metric directly measures the wall-clock time taken for a specific operation or a complete
-
Speedup (X factor):
- Conceptual Definition:
Speedupquantifies the performance improvement of an optimized system or method compared to a baseline. It shows how many times faster the optimized version is. - Mathematical Formula: $ \text{Speedup} = \frac{T_{\text{baseline}}}{T_{\text{optimized}}} $
- Symbol Explanation:
- : The execution time of the reference or unoptimized system/method.
- : The execution time of the optimized system/method being evaluated.
- Conceptual Definition:
-
Arithmetic Intensity (AI):
- Conceptual Definition:
Arithmetic Intensityis a fundamental metric in performance analysis, particularly relevant for theRoofline Model. It represents the ratio of the total number of floating-point operations (FLOPs) performed by a computational kernel to the total number of bytes transferred between memory and the processor during that kernel's execution. It helps determine whether a workload iscompute-bound(high AI) ormemory-bound(low AI). - Mathematical Formula: For a General Matrix Multiplication (
GEMM) of the form , where , , and , the paper provides an approximateAIformula: $ AI \approx \frac{O(2 \times M \times N \times K)}{O(4 \times (M \times N + N \times K + K \times M))} $ The paper further simplifies this for the case of (small batch size) to: $ AI \approx \frac{O(M \times K)}{O(M \times K)} \approx O(1) $ - Symbol Explanation:
- : Denotes the order of magnitude.
M, N, K: Dimensions of the matrices involved in theGEMMoperation.- : Represents the approximate number of floating-point operations (multiply-adds) for the matrix multiplication . The addition adds
MNoperations, which is often small compared to2MNKfor large matrices. - : Represents the approximate number of bytes transferred. This includes reading matrices , , and , and writing matrix . Assuming each element is 4 bytes (for
float32). - : Indicates that for , the
arithmetic intensitytends to be a small constant, signifying that the workload ismemory-boundand itsAIdoes not grow proportionally to or in a way that would make itcompute-bound.
- Conceptual Definition:
-
Machine Balance Point (BP):
- Conceptual Definition: The
Machine Balance Pointis a critical threshold derived from theRoofline Model. It defines the minimumarithmetic intensitya computational kernel must possess to becomecompute-boundon a given hardware platform. If a kernel'sAIis belowBP, its performance will be limited by memory bandwidth (memory-bound). If itsAIis aboveBP, its performance will be limited by the processor's peak computational capacity (compute-bound). - Mathematical Formula: $ BP = \frac{Fp_{peak}}{B_{peak}} \quad \left[ \frac{\mathrm{FLOP}}{\mathrm{Byte}} \right] $
- Symbol Explanation:
- : The peak floating-point performance of the processor in
FLOPs/second. - : The peak memory bandwidth of the system in
Bytes/second.
- : The peak floating-point performance of the processor in
- Conceptual Definition: The
-
Instructions Per Cycle (IPC):
- Conceptual Definition:
IPCmeasures the average number of instructions executed by a processor per clock cycle. It is a key indicator of a processor's efficiency and pipeline utilization. HigherIPCvalues generally indicate better processor performance, as more work is being done in each cycle. - Mathematical Formula: Not explicitly given in the paper, but conceptually .
- Symbol Explanation: N/A (conceptual definition is sufficient here as no formula was provided).
- Conceptual Definition:
-
Stalled Cycles Per Instruction:
- Conceptual Definition: This metric quantifies the average number of clock cycles during which the processor pipeline is stalled (not performing useful work) for each instruction executed. Stalls can be caused by various factors, such as
cache misses,data dependencies, orbranch mispredictions. Lower values indicate better pipeline efficiency. - Mathematical Formula: Not explicitly given in the paper, but conceptually .
- Symbol Explanation: N/A (conceptual definition is sufficient here as no formula was provided).
- Conceptual Definition: This metric quantifies the average number of clock cycles during which the processor pipeline is stalled (not performing useful work) for each instruction executed. Stalls can be caused by various factors, such as
-
Cache Miss Rates (L1, LLC):
- Conceptual Definition:
Cache miss ratesmeasure the percentage of times the processor attempts to access data from a specific cache level (e.g., L1 cache, Last Level Cache (LLC)) but finds that the data is not present, requiring it to fetch data from a slower memory level. Highcache miss ratesindicate poor data locality and can lead to significant performance bottlenecks due to increased memory latency. - Mathematical Formula: Not explicitly given in the paper, but conceptually .
- Symbol Explanation: N/A (conceptual definition is sufficient here as no formula was provided).
- Conceptual Definition:
5.3. Baselines
The paper's method (PyTorch with RVV-enabled OpenBLAS or BLIS) was compared against several baselines to evaluate the effectiveness of RVV and optimized BLAS libraries:
-
PyTorch Default Backend: This configuration represents the standard
PyTorchinstallation without explicit linking toOpenBLASorBLIS. It uses its internalATenkernels or potentially falls back to less optimized CPU implementations. This serves as a general baseline for comparing the impact of using optimizedBLASlibraries. -
PyTorch with OpenBLAS (no RVV): This baseline uses the
OpenBLASlibrary but compiled for genericRISC-Varchitectures withoutRVVvectorization enabled. This helps isolate the performance gains attributable solely toRVVinstructions by comparing it against theRVV-enabledOpenBLASconfiguration. -
PyTorch with BLIS (no RVV): Similar to
OpenBLAS(noRVV), this baseline usesBLIScompiled withoutRVVsupport. It allows for a comparison of theBLISlibrary's baseline performance and subsequently theRVV-specific gains whenRVVis enabled forBLIS.By comparing against these baselines, the authors can systematically assess the individual and combined contributions of optimized
BLASlibraries andRVVvectorization toLLMinference performance on theRISC-Vplatform.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results provide a detailed analysis of LLM inference performance on the RISC-V platform, highlighting the nuanced impact of RVV, different BLAS backends, thread counts, model sizes, caching mechanisms, and data types.
6.1.1. aten::addmm Kernel Performance (Microbenchmark)
The aten::addmm microbenchmark, a core GEMM operation, showed clear trends:
-
Optimized BLAS Benefit: Switching from
PyTorchdefault toOpenBLASorBLISconsistently improved performance. This underscores the importance of optimized linear algebra libraries. -
RVV Additional Speedups: Enabling
RVVwithOpenBLASorBLISled to further performance gains, in some scenarios doubling the performance compared to their non-RVVcounterparts. This demonstratesRVV's potential for dense matrix operations. -
Scalability with Threads: Performance generally scaled well with increasing thread counts, with optimal performance observed around 16 and 32 threads.
-
Overhead for Small Matrices: For the smaller matrix size, a performance loss was observed at 32 and 64 threads. This indicates that thread generation and handling overheads can outweigh the benefits of parallelization for smaller workloads.
The following figure (Figure 2 from the original paper) presents the
aten::addmmkernel performance:
该图像是图表,展示了在不同的OMP线程下,PyTorch及其优化实现(包括BLIS和OpenBLAS)处理768×768和4096×4096矩阵的执行时间。图中显示了不同配置对计算性能的影响,结果表明线程数增加时,优化实现显著降低了执行时间。
Figure 2: Aten::addmm kernel performance using and matrices when using PyTorch with different BLAS runtimes over multiple thread counts.
6.1.2. GEMM Operations Share of Total Execution Time
While GEMM operations are numerically dominant, their actual execution time share can vary significantly in full LLM inference:
-
Dominance in CPU Operations:
aten::addmmandaten::mmcan account for a substantial portion (e.g.,70%forGPT-2,96%forBERT, over99%forDeepSeek) of CPU operations, confirming their importance. -
Declining Share with Parallelism: However, the percentage of total execution time spent exclusively on
GEMMcan drop significantly with increased parallelism, sometimes falling below50%(e.g.,GPT2-mediumwith64threads). This implies that asGEMMoperations are accelerated, other parts of theLLMinference pipeline (memory access patterns, non-linear activations, normalization, control flow, thread synchronization, communication overhead) become new bottlenecks.The following figure (Figure 3 from the original paper) illustrates the time spent exclusively running
GEMMcomputations:
该图像是图表,展示了在使用KV缓存和不使用KV缓存情况下,GPT-2和GPT-2 Medium模型的单线程执行时间。左侧为GPT-2的执行时间,右侧为GPT-2 Medium的执行时间。
Figure 3: Time spent exclusively running GEMM computations by GPT-2 models compared to the total execution time across different thread counts. The models were run with PyTorch + OpenBLAS with RVV enabled. * Both Llama and Deepseek originally run with Bfloat16 precision, which is not supported by the BLAS libraries and the architecture used on these experiments. While we are running Llama with float32 precision, we kept the original Bfloat16 for Deepseek.
6.1.3. Whole Model Inference Performance
The results for full LLM inference categorized performance into three recurring patterns:
-
BERT-like Behavior:
BERTshowed predictable scaling: performance improved with increasing thread counts, andRVVprovided clear benefits.PyTorch's default backend degraded with parallelism, highlighting its inefficiency.BLISconsistently achieved the best performance among all configurations forBERT.
-
Gemma-like Behavior:
Gemma-2exhibited an "unexpected pattern":PyTorch + OpenBLASperformed significantly worse thanBLISand defaultPyTorchin the single-threaded configuration (up to3xlonger execution).- However, at
16and32threads,PyTorch + OpenBLASwithRVVachieved the best execution times, demonstrating strong multi-threaded performance gains (26.7%speedup at16threads over non-RVV). The single-thread anomaly forOpenBLASsuggests potential library overheads or specific kernel choices impacting performance at low parallelism.
-
DeepSeek-like Behavior:
-
DeepSeek-LLMshowed a complete lack of scalability, with execution times nearly identical across all configurations and thread counts. -
This was attributed to the model's use of the
BF16data type, which is not natively supported by theRISC-Vhardware or theBLASlibraries. The resulting runtime conversions toFP32introduced significant overhead, negating any benefits from multithreading orRVV.The following figure (Figure 4 from the original paper) shows
LLMinference performance:
该图像是一个图表,展示了在 SG2042 RISC-V CPU 上使用 PyTorch 运行 LLaMA LLM 的推理性能。上图显示了不同 OMP 线程数(1、4、16、32、64)下,使用 bfloat16 权重表示时的执行时间;下图则展示了转换为 float32 权重后的执行时间。结果揭示了线程数对推理性能的影响。
-
Figure 4: LLMs inference performance when generating 25 tokens and using PyTorch with different BLAS runtimes over multiple thread counts.
6.1.4. Impact of Model Scaling on GPT-2 Performance
Evaluating different sizes of GPT-2 revealed varying behaviors:
-
PyTorch Default Degradation: The default
PyTorchbackend performed well for smallergpt2andgpt2-mediumbut showed severe degradation forgpt2-largeandgpt2-xl.- For
gpt2-medium,IPCwas0.95with0.67stalled cycles/instruction. - For
gpt2-large,IPCdropped to0.37, and stalled cycles increased to2.35, indicating severebackendandfrontendbottlenecks. Cache miss rates(L1 and LLC) increased significantly forgpt2-large, pointing to highermemory pressure.
- For
-
Single-Threaded RVV Discrepancy: For
gpt2-medium, theRVV-enabledOpenBLASbackend in the single-threaded configuration was over3xslower than the non-RVVversion. Further investigation revealed this was consistent across all commonGEMMshapes (e.g., ), especially for autoregressive decoding. This highlighted thatmicrobenchmarkscan be misleading and real-world execution traces are crucial.The following figure (Figure 5 from the original paper) shows the inference execution time of various
LLMs(includingopenai-community/gpt2and its variants):
该图像是图表,展示了不同核数下RISC-V CPU的性能与算术强度的关系。图(a)至图(d)分别代表1核、8核、32核和64核的性能测量,显示了峰值和测量带宽的对比,数据表明RVV在高算术强度下表现优越。
The following figure (Image 9 from the original image list, referred to as Figure 6 in the text) shows the execution times of the recorded GEMM shapes for GPT2-medium:
该图像是执行时间的柱状图,展示了不同矩阵形状与线程数对执行时间的影响。上部分为 的结果,展示了 RVV 和 No-RVV 下的执行时间对比;下部分为 的结果,显示了更高矩阵形状的执行时间变化。图中涵盖了多线程配置的性能差异。
6.1.5. Impact of RVV with and without Attention Caching (KV caching)
The use_cache parameter in Transformer models (which reuses key and value projections) was investigated:
-
Limited RVV Benefit with Caching: While
use_cache=Truesignificantly reduced instruction count, the runtime improvement withRVVexecution was modest. -
RVV Benefits from Recomputation: The performance gap between
use_cache=Trueanduse_cache=Falsewas much smaller underRVV. Forgpt2-medium,RVV + OpenBLASshowed only1.2xspeedup with caching, compared to3.2xfor non-RVV OpenBLASand11xforPyTorchdefault. -
Conclusion: This suggests
RVVvectorization is more effective when all attention layers are recomputed (use_cache=False), as this path provides higher computational intensity and bettervector utilization. Caching, while generally beneficial, can reducecomputational intensityin a way that limitsRVV's potential.The following figure (Figure 7 from the original paper) shows the single-thread performance of
GPT-2with thetransformers caching mechanism (KV cache)enabled and disabled:
该图像是一个人站在红色沙发前的照片,背景简单清晰,没有其他复杂元素。
Figure 7: Single-thread performance of GPT-2 with the transformers caching mechanism (KV cache) enabled and disabled.
6.1.6. Handling Unsupported Data Types: BF16 vs FP32 on RISC-V
The lack of native BF16 support on the RISC-V platform proved to be a major bottleneck:
-
Flat Performance with BF16: Models using
BF16(e.g.,LLaMA-3.2-1B,DeepSeek-LLM) exhibited flat, non-scalable performance regardless of backend or thread count.PyTorchwas forced to dispatchBF16matrix multiplicationsto inefficient, mostly single-threaded non-BLASkernels orNumPy-based implementations. -
Unlocking Performance with FP32 Conversion: Converting the
LLaMAmodel toFP32significantly improved performance. WithFP32,PyTorchcorrectly dispatchedGEMMoperations to multi-threadedBLASlibraries, enabling hardware vectorization. For instance, with four threads,PyTorch + OpenBLAS(noRVV) improved from63s(BF16) to38s(FP32). -
Key Insight: Converting
BF16models toFP32is essential onRISC-Vplatforms lacking nativeBF16support. This avoids costly runtime fallbacks and unlocks thread-level parallelism andRVVbenefits.The following figure (Figure 8 from the original paper) shows the inference performance of the
LLaMA LLMon theSG2042 RISC-V CPU:
该图像是一个人的肖像照。他穿着深色的毛衣,系着红色的领带,背景模糊且颜色柔和。这一形象传达了专业和亲切的感觉。
Figure 8: Inference performance of the LLaMA LLM on the SG2042 RISC-V CPU using PyTorch with different BLAS runtimes and thread counts. Results are shown for the model using its original torch. bfloat16 weight representation and after conversion to torch. float32.
6.1.7. LLMs' GEMM Performance and Roofline Analysis
The roofline model provided crucial insights into GEMM behavior:
-
Theoretical Peak Performance: The following are the results from Table 2 of the original paper:
Cores 1 4 8 16 32 64 (no RVV) 4 8 16 32 64 128 (with RVV) 16 64 128 256 512 1024 -
Memory Bandwidth: The measured effective memory bandwidth was with 64 cores.
-
(Memory-Bound): For
GEMMwith (e.g., ), thearithmetic intensity (AI)is very low (). This places the operation deep into thememory-boundregion, especially forRVVwhich has a higherMachine Balance Point (BP)( FLOP/Byte for 1 core). In this scenario, the non-RVVimplementation often outperformed theRVVone. This is becauseRVVinstructions, by loading/storing multiple values simultaneously, place greater demands on memory bandwidth, which is already the bottleneck. -
(Borderline): With , the
AIincreases, andRVVstarts to surpass non-RVVperformance, reaching close to theBP. -
(Compute-Bound): At , the computation fully enters the
compute-boundregion. Here, bothRVVand non-RVVimplementations could reach close to their respective limits, withRVVshowing significant benefits. -
Scalability Challenges: As the number of cores increased (to 32 and 64), the
work per threaddecreased substantially. Even though therelative positionsofAIwith respect toBPremained similar, the performance gap to optimal increased. This indicated that overheads from data movement, thread synchronization, and increased memory contention outweighed the benefits of morecompute capacity, particularly for smallbatch sizeslike .The following figure (Figure 9 from the original paper) shows the roofline models for
RVVand no-RVV:
该图像是一个人像照片,显示出一位男性的面部特征和表情。
Figure 9: Comparison of roofline models for RVV and no-RVV on 1, 8, 32, and 64 cores, varying while keeping
6.1.8. Validation on GPT2-medium (Simulated GEMM)
The simulated GEMM operations for GPT2-medium validated the roofline insights:
-
(Memory-Bound Confirmation): For ,
arithmetic intensityremained low.RVV vector instructionsprovided no performance benefit and often degraded performance due to increased memory bandwidth demands andregister pressure. Non-RVVconsistently outperformedRVVat low thread counts. -
(Compute-Bound Confirmation): For ,
arithmetic intensityrose significantly, pushing computation into thecompute-boundregion.RVVbecame effective, showing substantial improvements over non-RVVexecution, especially at low-to-moderate thread counts (1-16 cores). -
Diminishing Returns: At higher core counts (32 and 64), the performance gap narrowed or even reversed. This was attributed to
diminishing returnsfrom threading, increased contention on memory channels, and reduced per-thread workload, which limitedvector register utilization. -
Conclusion:
RVVis primarily beneficial whenarithmetic intensityis sufficiently high andworkload per coreis large enough to amortizevectorization overhead. This highlights the challenge for typicalLLMinference workloads dominated by smallmatrix multiplications.The following figure (Figure 10 from the original paper) presents the
GEMMcomputation, considering two different sizes for N, andRVVenabled and disabled:
该图像是一个人物肖像,显示了一名年轻男性,留有卷发和胡须,正面表情严肃。他穿着深色衬衫,眼镜框为黑色,背景为浅色墙面。
Figure 10: OnAMeauultyaT-o MM computation, considering two different sizes for N, and RVV enabled and disabled.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Model | Parameters | Hidden Dim | # Layers | # Attention Heads | Context Length | FFN Dim | Data Type |
|---|---|---|---|---|---|---|---|
| GPT-2 | 137M | 768 | 12 | 12 | 1024 | 3072 | float32 |
| GPT-2 Medium | 380M | 1024 | 24 | 16 | 1024 | 4096 | float32 |
| GPT-2 Large | 812M | 1280 | 36 | 20 | 1024 | 5120 | float32 |
| GPT-2 XL | 1.61B | 1600 | 48 | 25 | 1024 | 6400 | float32 |
| BERT-large-cased | 335M | 1024 | 24 | 16 | 512 | 4096 | float32 |
| Gemma-2-2B | 2.61B | 2304 | 26 | 8 | 8192 | 9216 | float32 |
| LLaMA-3.2-1B | 1.24B | 2048 | 16 | 32 | 131072 | 8192 | bfloat16 |
| DeepSeek-LLM-7B-base | 7B | 4096 | 30 | 32 | 4096 | 11008 | bfloat16 |
The following are the results from Table 2 of the original paper:
| Cores | 1 | 4 | 8 | 16 | 32 | 64 |
|---|---|---|---|---|---|---|
| (no RVV) | 4 | 8 | 16 | 32 | 64 | 128 |
| (with RVV) | 16 | 64 | 128 | 256 | 512 | 1024 |
6.3. Ablation Studies / Parameter Analysis
The authors did not conduct formal ablation studies in the traditional sense (removing a component from their proposed method). However, their experimental design implicitly acts as a series of parameter analyses and comparisons that reveal the effectiveness of different components and configurations:
-
RVV On/Off Comparison: By systematically comparing performance with
RVVenabled versusRVVdisabled (for bothOpenBLASandBLIS), they demonstrated the direct impact of the vector extension. This showed thatRVVprovides benefits primarily incompute-boundscenarios. -
Choice of BLAS Library: Comparing
OpenBLASandBLIS(both with and withoutRVV) against thePyTorchdefault backend evaluated the efficacy of different optimizedBLASimplementations. This revealed thatBLISoften performed better or scaled differently thanOpenBLASin certain configurations (e.g.,BERT,Gemma-2single-thread anomaly). -
Varying Core Counts: Running experiments across 1, 4, 8, 16, 32, and 64 cores analyzed the scalability of
LLMinference with and withoutRVV. This highlighted the diminishing returns at higher core counts due to thread overheads and memory contention, especially formemory-boundworkloads. -
Model Size Impact: Using four different sizes of
GPT-2(from137Mto1.61Bparameters) allowed for an analysis of how model complexity affects performance andRVVeffectiveness. This revealed that larger models can expose different bottlenecks (e.g., increased memory pressure forPyTorchdefault). -
Impact of
use_cacheParameter: The comparison ofLLMinference with and without theattention caching (KV cache)feature demonstrated how model-level optimizations interact with hardware vectorization. It showed that caching, while reducing overall computation, could inadvertently limitRVVutilization by reducingarithmetic intensity. -
Data Type Analysis: The explicit comparison of
BF16andFP32forLLaMAmodels highlighted the critical importance of native hardware support for data types. It showed that conversion to a supportedFP32data type was essential to unlock multi-threading andRVVbenefits on theRISC-Vplatform.These analyses effectively break down the overall system performance, attributing gains or losses to specific architectural features, software choices, and workload characteristics, thereby mimicking the insights gained from ablation studies.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper rigorously assessed the inference performance of pre-trained Large Language Models (LLMs) on a multi-core RISC-V CPU featuring Silicon-Enabled Vectors (RVV) v0.7.1. A significant initial challenge involved building the PyTorch acceleration library with RVV support for RISC-V using OpenBLAS and BLIS.
The experimental evaluation demonstrated that while RVV can provide performance gains for LLM inference, these gains are highly dependent on specific configuration choices. For instance, disabling the use_cache feature in Transformers had a lesser impact when RVV was active, suggesting that RVV thrives with higher computational intensity, which caching often reduces.
A deeper investigation, supported by roofline modeling and traced execution data, revealed that RVV-enabled matrix multiplication can be slower than scalar implementations when inference workloads operate on small batch sizes (e.g., ). This is primarily because such GEMMs are memory-bound, limiting the effective utilization of vectorization.
Furthermore, the study highlighted that running LLMs with data types not natively supported by the hardware, such as bfloat16 on the RISC-V platform, leads to flat and non-scalable performance due to the lack of BLAS library support and costly runtime emulation. Converting models to a natively supported format like float32 was crucial for achieving performance improvements and scalability across thread counts and BLAS backends, allowing RVV-enabled backends to deliver measurable benefits, especially with multiple threads.
In summary, the results suggest that RVV is most beneficial in compute-bound configurations that avoid caching and rely on native data types. Its effectiveness is highly contingent on workload characteristics, including matrix shape, number of threads, and arithmetic intensity.
7.2. Limitations & Future Work
The authors identified several limitations in their current work and proposed avenues for future exploration:
- F16 Data Type Performance: Preliminary experiments showed that converting
BF16models toFP16(which is supported by theRISC-Varchitecture) resulted in significant performance overhead compared toFP32converted models. This unexpected behavior warrants future investigation. - RVV Scalability and Overheads: The inconsistent scalability of
RVV(plateauing or regressing in certain models and thread counts) requires further study. Understanding howthreading overheads,memory bandwidth saturation, andcore-level contentionimpactRVVefficiency is critical for future optimizations. - Quantized Models and Sparse Representations: The authors plan to expand their analysis to include
quantized modelsandsparse representations, which might further amplifyRVVbenefits due to reduced memory footprint and potentially higherarithmetic intensity. - Unsupported Data Types on Other Architectures: Future work will investigate the effects of running unsupported data types on available
RISC-Varchitectures compared to other state-of-the-art architectures (e.g., andARM), which might have betterBF16emulation or native support. - NUMA Design and Task Scheduling: The
SG2042platform has aNUMA(Non-Uniform Memory Access) design. Investigating the effects of thisNUMAdesign on performance, possibly through testing more complex task scheduling policies and understanding associated overheads, is another future direction.
7.3. Personal Insights & Critique
This paper offers valuable, practical insights into a cutting-edge area: evaluating LLM inference on real RISC-V hardware with vector extensions. Its rigorous methodology, employing both micro-benchmarks and full LLM inference profiling coupled with roofline analysis, is particularly commendable.
Inspirations Drawn:
- Real-world Applicability: The paper strongly highlights the chasm between theoretical peak performance or micro-benchmark results and actual end-to-end application performance. This underscores the need for application-specific profiling and optimization, especially in emerging hardware ecosystems.
- Interplay of Hardware and Software: The profound impact of data type support and
BLASlibrary integration onRVVeffectiveness demonstrates that hardware innovation alone is insufficient; a mature and optimized software stack is equally crucial. This is a key takeaway for anyone working on new architectures. - Memory-Bound vs. Compute-Bound Nuances: The detailed
roofline analysisprovides a clear mental model for understanding why vectorization might not always be a panacea, especially formemory-boundworkloads common inLLMdecoding. This helps architects and developers make informed decisions about where to invest optimization efforts.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
RVV Version Dependency: The paper focuses on
RVV v0.7.1. AsRVV v1.0is now stable and gaining traction, the results might differ significantly on newer hardware implementingv1.0. Thev0.7.1standard's immaturity, as noted in related work, might contribute to some observed inconsistencies. -
"Optimal Vector Width" Discrepancy: The abstract mentions finding an "optimal vector width that maximizes throughput for each model size." While the paper discusses the effectiveness of vectorization based on
arithmetic intensityand problem size, it doesn't explicitly detail a methodology for determining or the results of finding an "optimal vector width." This might be a slight overstatement in the abstract compared to the body's detailed analysis of values. -
Single-Threaded OpenBLAS Anomaly: The "Gemma-like behavior" where
OpenBLAS(even withoutRVV) performs poorly in single-threaded configurations for certain models (likeGemma-2andgpt2-medium) needs deeper investigation. Is it related toOpenBLAS's internal threading model, initialization overheads, or a specific kernel choice that is sub-optimal for single-core execution? This might point to an issue within theOpenBLASport itself rather than an inherentRVVlimitation. -
Energy Efficiency Quantification: While the abstract mentions
RVVoffering "substantial gains" inenergy consumptionfor vectorizable operations, the main body of the paper does not present explicit energy consumption measurements or analysis. Providing specific power/energy numbers would strengthen this claim. -
Broader LLM Workloads: The paper focuses on text generation with a short prompt and generating 25 tokens. Real-world
LLMuse cases often involve much longer context lengths, larger batch sizes for throughput-oriented tasks, and different fine-tuning scenarios. How these factors interact withRVVperformance could reveal further complexities. -
Tooling and Ecosystem Maturity: The challenges in building the
PyTorchstack with specificBLASversions and compiler toolchains highlight the nascent state of theRISC-Vsoftware ecosystem. While the paper successfully navigated this, it's a practical barrier that affects broader adoption. Further work could quantify the overheads or complexities of this setup.Overall, this paper serves as a crucial benchmark and analytical foundation for the burgeoning
RISC-V AIecosystem. It not only showcases the potential ofRVVbut also pragmatically identifies its current limitations and the critical factors for its effective deployment.
Similar papers
Recommended via semantic vector search.