ShiDianNao: Shifting Vision Processing Closer to the Sensor
TL;DR Summary
The ShiDianNao CNN accelerator is placed next to CMOS or CCD sensors to eliminate DRAM accesses, achieving 60x energy efficiency improvement and 30x faster performance than high-end GPUs, with a compact design of 4.86mm² area and 320mW power consumption.
Abstract
In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and mining applications. Still, both the energy efficiency and performance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are Convolutional Neural Networks (CNN), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs. In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator. We present a full design down to the layout at 65 nm, with a modest footprint of 4.86 mm² and consuming only 320 mW, but still about 30x faster than high-end GPUs.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "ShiDianNao: Shifting Vision Processing Closer to the Sensor."
1.2. Authors
The authors and their affiliations are:
- Zidong Du, Tianshi Chen, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen: Laboratory Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences (CAS), China.
- Robert Fasthuber, Paolo Ienne: EPFL, Switzerland.
- Olivier Temam: Inria, France.
1.3. Journal/Conference
The paper was published in a conference setting, indicated by the typical format of academic papers from such venues. Given the authors' previous work (DianNao [3] in ASPLOS), this paper likely appeared in a top-tier computer architecture or systems conference. Such venues are highly reputable and influential in the fields of computer architecture, hardware design, and accelerators for emerging workloads like machine learning.
1.4. Publication Year
The publication year is not explicitly stated on the first page, but the references indicate that cited works go up to 2015 (e.g., [4]), suggesting a publication year of 2015 or later.
1.5. Abstract
The paper addresses the energy efficiency and performance limitations of neural network accelerators, primarily caused by memory accesses. It focuses on image applications, specifically Convolutional Neural Networks (CNNs), which benefit from weight sharing, significantly reducing their memory footprint. This property allows an entire CNN to be mapped within on-chip SRAM, eliminating DRAM accesses for weights. The core innovation is to further integrate this CNN accelerator directly next to an image sensor (CMOS or CCD), thereby eliminating all remaining DRAM accesses for inputs and outputs. The proposed accelerator, ShiDianNao, leverages the absence of DRAM accesses and exploits specific data access patterns within CNNs to achieve 60 times greater energy efficiency than the previous state-of-the-art neural network accelerator. The design is presented down to a 65 nm layout, occupying a modest 4.86 mm² area and consuming only 320 mW, while still being approximately 30 times faster than high-end GPUs.
1.6. Original Source Link
/files/papers/6915a0914d6b2ff314a02e49/paper.pdf
This is an officially published paper, likely from a conference proceedings, provided as a PDF link.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inherent limitation of neural network accelerators in terms of both energy efficiency and performance, largely due to frequent memory accesses, particularly to off-chip DRAM. While accelerators have gained traction as energy- and cost-effective alternatives to CPUs and GPUs for recognition and mining applications, memory bandwidth remains a bottleneck, as acknowledged even in prior specialized designs like DianNao [3].
This problem is particularly critical for vision applications, which constitute one of the broadest categories of recognition tasks. In many real-world and embedded scenarios (e.g., smartphones, security cameras, self-driving cars), image data originates directly from a CMOS or CCD sensor. The traditional pipeline involves the image being acquired by the sensor, transferred to DRAM, and then fetched by a CPU/GPU for processing. Each of these steps, especially DRAM accesses, incurs significant energy costs.
The paper's entry point and innovative idea stem from exploiting specific properties of Convolutional Neural Networks (CNNs), which are state-of-the-art for image applications. CNNs have a crucial property: weights are shared among many neurons (translation invariance), drastically reducing the neural network memory footprint. This makes it feasible to entirely map a CNN's weights within a small, fast, and energy-efficient on-chip SRAM. Building on this, the paper proposes hoisting this entire accelerator next to the image sensor. This radical architectural shift allows for the elimination of all DRAM accesses, not just for weights but also for inputs (from the sensor) and outputs (typically just classification results), thereby addressing the memory bottleneck at its root.
2.2. Main Contributions / Findings
The primary contributions of this paper are:
-
Novel Accelerator Architecture (ShiDianNao): Proposal and detailed design of a
CNN acceleratorspecifically engineered to be placed directly adjacent toCMOSorCCDimage sensors. This architecture is optimized to eliminate all off-chipDRAM accessesfor both model weights and input/output data. -
Exploitation of CNN Properties: A careful exploitation of
CNN-specific data access patterns, particularly theweight sharingproperty and2D data localityof feature maps, which enables theentire mapping of a CNN within on-chip SRAMand efficient inter-Processing Element (PE)data reuse. -
Significant Energy Efficiency Improvement: The proposed
ShiDianNaoachieves an average of60 times more energy efficiencythan the previous state-of-the-art neural network accelerator,DianNao[3]. When integrated directly with a sensor, it is87.39 times more energy efficientthanDianNaoand2.37 times more energy efficientthan an idealDianNao-FreeMem(assuming no memory cost). -
High Performance: Despite its ultra-low power consumption and small footprint,
ShiDianNaois approximately30 times fasterthan high-endGPUsfor the targeted visual recognition tasks, and1.87 times fasterthan the resizedDianNaobaseline. -
Full Design and Evaluation: Presentation of a
full design down to the layoutat65 nm CMOS technology, showcasing a modest footprint of4.86 mm²and consuming only320 mWat1 GHz, while delivering . The design is empirically evaluated on ten representativeCNN benchmarks.The key conclusions and findings are that by shifting vision processing physically closer to the sensor and meticulously optimizing for
CNNdata patterns to eliminate allDRAMaccesses, it is possible to achieve unprecedented levels ofenergy efficiencyandperformancefor sophisticated visual recognition tasks in embedded systems. This approach significantly lowers the hardware and energy cost, potentially enabling widespread deployment of advanced vision processing in mobile and wearable devices.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the ShiDianNao paper, a foundational understanding of several key concepts in computer architecture, machine learning, and digital design is essential.
-
Neural Networks (NNs) and Accelerators:
- Definition: Neural networks are computational models inspired by biological neural networks, used for tasks like pattern recognition and classification. They consist of layers of interconnected "neurons" that process data.
- Accelerators: These are specialized hardware components designed to speed up specific computational tasks, in this case, neural network operations. They aim to achieve higher performance and energy efficiency compared to general-purpose processors (CPUs/GPUs) for those specific tasks. The motivation for accelerators comes from the observation that general-purpose chips are often inefficient for highly specialized, compute-intensive workloads [20].
- Energy and Performance Bottlenecks: A major bottleneck in conventional computing systems, including those using CPUs and GPUs, is
memory access. Moving data between different levels of memory hierarchy (e.g., between off-chipDRAMand on-chip caches/registers) consumes significantly more energy than performing arithmetic operations [20, 28]. Accelerators try to mitigate this by bringing computation closer to data or reducing data movement.
-
Convolutional Neural Networks (CNNs):
- Definition:
CNNsare a class ofNNsparticularly effective for processing grid-like data, such as images. They are the state-of-the-art for many vision tasks. - Layers: A
CNNtypically consists of:- Convolutional Layers (C): These apply a set of learnable
filters(orkernels) across the input data. Each filter slides over the input (e.g., an image or a feature map) and computes a dot product, generating afeature mapthat highlights specific features (edges, textures, etc.).- Weight Sharing (Translation Invariance): A key property of
CNNsis that the same filter (set of weights) is applied across the entire input feature map. This means that if a particular feature (e.g., an edge) is useful in one part of an image, it's likely useful elsewhere. This property dramatically reduces the total number of uniqueweightsthat need to be stored compared toFully Connected (FC)layers. This also impliestranslation invariance, meaning the network can detect a feature regardless of its position in the input.
- Weight Sharing (Translation Invariance): A key property of
- Pooling Layers (S): These layers reduce the spatial dimensions (width and height) of the input
feature maps, which helps in reducing the number of parameters and computations, and makes the features more robust to slight shifts or distortions. Common operations includemax pooling(taking the maximum value in a window) oraverage pooling. - Normalization Layers: These layers normalize the activities of neurons to improve the network's performance and stability during training. Examples include
Local Response Normalization (LRN)andLocal Contrast Normalization (LCN). - Classifier Layers (F) (Fully Connected Layers): Typically found at the end of a
CNN, these are standardMulti-Layer Perceptrons (MLPs)where every neuron in one layer is connected to every neuron in the next layer with independent weights. These layers perform the final classification based on the high-level features extracted by the convolutional and pooling layers.
- Convolutional Layers (C): These apply a set of learnable
- Feature Map: A
2D arrayrepresenting the output of a convolutional or pooling layer, effectively a map of where a specific feature is detected in the input. - Kernel/Filter: A small matrix of
weightsthat slides over the inputfeature mapto perform convolution. Its size is denoted by . - Stride: The step size by which the
kernelmoves across the inputfeature map. Denoted by and .
- Definition:
-
Deep Neural Networks (DNNs):
- Definition: A broader category of
NNswith multiple hidden layers.CNNsare a type ofDNN. - Distinction from CNNs: While
CNNsuseweight sharing, genericDNNs(often referring toMLPsor networks without convolutional properties in their earlier layers) typically do not. In aDNN, each connection between neurons has a unique weight, leading to a much larger number ofsynapses(weights) to store. This difference is critical forShiDianNao's design.
- Definition: A broader category of
-
Memory Hierarchy (SRAM vs. DRAM):
- DRAM (Dynamic Random Access Memory): High-capacity, relatively slow, and energy-intensive off-chip memory. It's used for main system memory. Accessing
DRAMis a major energy and performance bottleneck. - SRAM (Static Random Access Memory): Low-capacity, very fast, and energy-efficient on-chip memory. It's often used for caches and registers due to its speed but is much more expensive per bit and takes up more area than
DRAM.ShiDianNaoaims to fit all necessary data intoSRAM.
- DRAM (Dynamic Random Access Memory): High-capacity, relatively slow, and energy-intensive off-chip memory. It's used for main system memory. Accessing
-
Fixed-Point vs. Floating-Point Arithmetic:
- Floating-Point: Represents numbers with a fractional component and a dynamic range, offering high precision, similar to how computers handle real numbers in scientific computing. Typically 32-bit (
single-precision) or 64-bit (double-precision). - Fixed-Point: Represents numbers with a fixed number of bits for the integer part and a fixed number of bits for the fractional part. It has a smaller dynamic range and precision than floating-point but requires significantly less hardware (smaller multipliers, adders) and consumes less energy.
- Rationale for 16-bit Fixed-Point in NNs: Previous studies [3, 10, 57] have shown that for neural network inference (recognition phase), 16-bit fixed-point arithmetic provides negligible accuracy loss compared to 32-bit floating-point, while drastically reducing hardware cost (e.g., 6.10x smaller and 7.33x more energy-efficient for a 16-bit fixed-point multiplier vs. 32-bit floating-point [3]).
- Floating-Point: Represents numbers with a fractional component and a dynamic range, offering high precision, similar to how computers handle real numbers in scientific computing. Typically 32-bit (
-
Image Sensors (CMOS/CCD):
- Definition: Devices that convert light into electrical signals, capturing images.
- CMOS (Complementary Metal-Oxide-Semiconductor) / CCD (Charge-Coupled Device): Two main types of image sensor technologies. They are the initial source of image data in many embedded applications.
- Integration:
ShiDianNaoproposes placing the accelerator physically next to these sensors to intercept the data stream before it goes to off-chip memory.
3.2. Previous Works
The paper contextualizes ShiDianNao against several categories of previous work:
-
General-Purpose Processors (CPUs and GPUs) for NNs:
- Conventionally,
NNswere executed onCPUs[59, 2] orGPUs[15, 51, 5]. While flexible, these platforms are not optimized for the specific data access patterns and arithmetic intensity ofNNs, leading to poor energy efficiency.GPUs, though powerful for parallel computation, can be inefficient for smallcomputational kernelsfound inCNNsdue to overheads of managing thousands of threads.
- Conventionally,
-
Dedicated Neural Network Accelerators:
- Early Designs: A first wave of specialized
NNhardware emerged in the late 20th century [25]. - Modern Implementations: More recent accelerators exist, implemented on
FPGAs[50, 46, 52] orASICs[3, 16, 57]. - Systolic Architectures:
- Examples:
NeuFlow[16] by Farabet et al., and asystolic-like coprocessorby Chakradhar et al. [2]. - Characteristics: Effective for
2D convolutionin signal processing [37, 58, 38, 22]. Data flows through an array of processing elements in a pipelined fashion. - Limitations: Often lack flexibility to support diverse
CNNsettings (e.g., varying kernel sizes, strides) [8, 50, 16, 2], and can have high memory bandwidth requirements.
- Examples:
- SIMD-like Architectures:
- Examples:
NnSP (Neural Network Stream Processing core)by Esmaeilzadeh et al. [12] (though initially forMLPs). Peemen et al. [45] proposed anFPGAaccelerator with a customized memory subsystem forCNNs, but it still required a host processor, limiting overall energy efficiency. Gokhale et al. [18] designed a mobile coprocessor for visual processing supporting bothCNNsandDNNs. - Limitations: Many of these designs did not treat main memory accesses as a primary concern or connected computational blocks directly to main memory via
DMA, still incurring significant energy costs for data movement.
- Examples:
- Early Designs: A first wave of specialized
-
DianNao Family [3, 4, 40]:
- DianNao [3]: This is the most direct predecessor and a crucial baseline for
ShiDianNao. It was proposed by some of the same authors.DianNaowas designed as a small-footprint, high-throughput accelerator for a broad range of neural networks, including bothCNNsandDNNs. It introduced dedicated on-chipSRAM buffersto reduce main memory accesses. - DianNao's Limitation (relevant to ShiDianNao): To support a broad scope of
NNs(includingDNNswithout weight sharing),DianNaodid not implement specialized hardware to extensively exploit the2D data localityspecific toCNNs. Instead, it often treated2D feature mapsas1D data vectors, which still led to frequent memory accesses (albeit to its on-chip buffers, but requiring more data movement thanShiDianNao's approach) and thus was less energy-efficient forCNNsthan a specialized design. - Other DianNao Family Members: Later works like
DaDianNao[4] (for large-scaleNNs) and another accelerator [40] (for classic machine learning) were optimized for different scales or techniques and were not designed for embedded applications likeShiDianNao.
- DianNao [3]: This is the most direct predecessor and a crucial baseline for
3.3. Technological Evolution
The field of NN acceleration has evolved significantly:
- Early NN Hardware (1980s-1990s): Initial attempts at hardware for
NNsoften involved purely spatial implementations or simple systolic arrays, but were limited by transistor densities and the complexity ofNNs[25]. - GPU Dominance (2000s-early 2010s):
GPUsbecame the go-to for parallelNNtraining and inference due to their high computational throughput. However, their general-purpose nature and high power consumption limited their suitability for embedded, energy-constrained scenarios. - Emergence of Specialized Accelerators (early 2010s onwards): The realization that
NNs(especiallyCNNs) have specific computational and data access patterns led to the development ofASICandFPGAaccelerators. These designs started focusing on:- Quantization: Using lower precision arithmetic (e.g., 16-bit fixed-point) to reduce hardware cost and energy while maintaining sufficient accuracy.
- On-chip Memory: Employing large on-chip
SRAMsto reduce costly off-chipDRAMaccesses. - Dataflow Optimization: Designing processing elements and memory access patterns to maximize
data reuseand minimize data movement.
- "Near-Sensor" Processing:
ShiDianNaorepresents a further evolution by pushing the accelerator physically closer to the data source (the image sensor) to achieve complete elimination of off-chip memory accesses, targeting the most energy-sensitive edge applications.
3.4. Differentiation Analysis
Compared to the main methods in related work, ShiDianNao offers several core differences and innovations:
-
Complete Elimination of DRAM Accesses: This is
ShiDianNao's most significant differentiator. Unlike most prior accelerators (includingDianNao), which still relied onDRAMfor input/output data (and sometimes weights),ShiDianNaoleverages the small footprint ofCNNs(due to weight sharing) and its proximity to the sensor to avoid allDRAMaccesses. This is achieved by fitting the entireCNN modeland image portions into sufficiently large on-chipSRAMs. -
Sensor-Adjacent Integration: The physical placement of the accelerator directly next to the
CMOS/CCDsensor is a unique architectural choice. This allows it to intercept raw image data streams and process them before they ever reachDRAM, drastically cutting energy consumption. -
Specialized CNN Data Locality Exploitation: While
DianNaoaimed for broaderNNsupport (includingDNNsthat don't benefit as much from2D data locality),ShiDianNaois custom-built forCNNs. ItsNeural Functional Unit (NFU)is a2D mesh of PEs, optimized for2D feature maps. Crucially, it incorporatesinter-PE data propagation(usingFIFOs) to efficiently reuse input neurons among adjacentPEs, further reducing internalSRAMbandwidth requirements.DianNaotreated2D dataas1D vectors, missing out on these2D localitybenefits. -
Flexible yet Optimized Design for CNNs: Unlike rigid
systolic arraysthat might be optimized for a singleCNNconfiguration,ShiDianNaomaintains flexibility through itsHierarchical Finite State Machine (HFSM)based control, allowing it to accommodate variousCNNlayers and parameters while still being highly optimized forCNNoperations. -
Orders of Magnitude Improvement in Energy Efficiency: The combined innovations lead to an unprecedented
60xenergy efficiency improvement overDianNaoand almost4700xoverGPUs, demonstrating the profound impact of its architectural choices.In essence,
ShiDianNaorepresents a paradigm shift towards ultra-low-power, embedded visual intelligence by tightly integrating specialized hardware with the image sensor and meticulously optimizing forCNNcharacteristics at every architectural level.
4. Methodology
4.1. Principles
The core idea behind ShiDianNao is to achieve extreme energy efficiency and high performance for Convolutional Neural Network (CNN)-based visual recognition by eliminating all costly off-chip DRAM accesses. This is accomplished through two main principles:
-
On-Chip Mapping of Entire CNN: By exploiting the
weight sharingproperty ofCNNs, which significantly reduces their memory footprint compared toDeep Neural Networks (DNNs),ShiDianNaois designed with sufficiently large on-chipSRAMto store allsynapses(weights) of a practicalCNN. This removes the need forDRAMaccess for the model itself. -
Sensor-Adjacent Integration for Input/Output Elimination: The accelerator is designed to be small enough to be physically placed directly next to a
CMOSorCCDimage sensor. This allows it to process image data directly as it streams from the sensor, thus eliminatingDRAMaccesses for input images. Only the few bytes of recognition results (e.g., an image category) are then sent toDRAMor the host processor, effectively eliminatingDRAMaccesses for outputs as well.These principles, combined with careful exploitation of
CNN's inherent2D data localityand efficient data movement within the accelerator, form the foundation forShiDianNao's high energy efficiency.
4.2. Core Methodology In-depth (Layer by Layer)
As illustrated in Figure 4, the ShiDianNao accelerator consists of several main components:
-
NBin (Input Neuron Buffer): Stores input neurons for the current layer.
-
NBout (Output Neuron Buffer): Stores output neurons generated by the current layer. (NBin and NBout can exchange functionality for subsequent layers).
-
SB (Synapse Buffer): Stores the weights (synapses) of the
CNNmodel. -
NFU (Neural Functional Unit): The main computational engine for fundamental neuron operations (multiplications, additions, comparisons).
-
ALU (Arithmetic Logic Unit): Handles activation functions and other arithmetic operations not covered by the
NFU. -
IB (Instruction Buffer): Stores control instructions for the accelerator.
The entire design uses
16-bit fixed-point arithmetic operatorsfor both theNFUandALU. This choice is based on prior studies showing negligible accuracy loss forNNswith significant hardware cost and energy savings (e.g., a 16-bit fixed-point multiplier is 6.10x smaller and 7.33x more energy-efficient than a 32-bit floating-point multiplier inTSMC 65nmtechnology [3]).
The following figure (Figure 4 from the original paper) shows the overall architecture of the accelerator:
该图像是示意图,展示了ShiDianNao加速器的架构,包含输入图像、缓冲控制器、解码器、NFU和ALU等组件。通过合理的数据访问模式,该加速器实现了显著的能效提升。
4.2.1. Neural Functional Unit (NFU)
The NFU is the core computational component, optimized for 2D feature maps. Unlike designs that treat 2D data as 1D vectors, ShiDianNao's NFU is a .
The following figure (Figure 5 from the original paper) depicts the architecture of the NFU:
该图像是一个示意图,展示了NFU(神经功能单元)的架构与输入输出的连接方式。图中包括了输入列和行、核以及输出的控制信号,说明了如何在神经网络加速器中处理数据。
-
Neuron-PE Mapping: Instead of allocating a block of
PEsfor a single output neuron (which would lead to complex data sharing and variability issues),ShiDianNaomapseach output neuron to a single PE. EachPEthentime-sharesits computation across input neurons (synapses) connecting to that output neuron. When aCNNlayer executes, eachPEcontinuously processes a single output neuron until it's fully computed, then switches to another. -
Processing Elements (PEs): The following figure (Figure 6 from the original paper) shows the detailed architecture of a single PE:
该图像是PE架构的示意图,展示了处理元素(PE)内部的工作机制,包括乘法器、加法器和寄存器等组件,数据通过FIFO缓冲区进行输入输出。该结构设计旨在提高卷积神经网络的计算效率。At each cycle, each
PE(denoted as for thePEat the -th row and -th column of theNFU) can perform:- A multiplication and an addition (for convolutional, classifier, or normalization layers).
- An addition (for average pooling).
- A comparison (for max pooling).
Each
PEhas: - Three Inputs:
- Control signals.
Synapses(e.g., kernel values) fromSB.NeuronsfromNBin/NBout, or from a neighborPE( for right, for bottom), depending on control.
- Two Outputs:
- Computation results to
NBout/NBin. - Locally stored neurons propagated to neighbor
PEsfor data reuse.
- Computation results to
-
Inter-PE Data Propagation: This mechanism is crucial for
CNN's2D data locality. In convolutional, pooling, and normalization layers, adjacent output neurons often require data from significantlyoverlapping rectangular windowsof input neurons. While data could be repeatedly read fromNBin/NBout, this would demand high bandwidth. The following figure (Figure 7 from the original paper) illustrates the internal bandwidth required with and without inter-PE data propagation:
该图像是一个图表,展示了不同处理单元(PE)数量下,神经元输入和核(突触权重)的内存带宽(GB/s)。横轴表示处理单元数量,纵轴表示内存带宽。图中有两个数据标记,分别表示有无处理单元间数据传播的情况。To support efficient
data reuse,ShiDianNaoallowsinter-PE data propagationwithin thePE mesh. EachPEincludes twoFIFOs(First-In, First-Out buffers):FIFO-H (Horizontal FIFO):Buffers data fromNBin/NBoutand from the right neighborPE. This data is then propagated to the left neighborPEfor reuse.FIFO-V (Vertical FIFO):Buffers data fromNBin/NBoutand from the upper neighborPE. This data is then propagated to the lower neighborPEfor reuse. This mechanism drastically reduces the internal bandwidth requirement between the on-chip buffers and theNFU. For example, for a convolutional layerC1ofLeNet-5, inter-PE data propagation reduces theNBinbandwidth requirement by 73.88%.
4.2.2. Arithmetic Logic Unit (ALU)
The ALU complements the NFU by performing computational primitives not covered by the PEs. It also uses 16-bit fixed-point arithmetic.
Its functions include:
- Division: Used for operations in average pooling and normalization layers.
- Non-linear Activation Functions: Computes
tanh()andsigmoid()functions (used in convolutional and pooling layers). These are approximated usingpiecewise linear interpolation( when for ).Segment coefficientsand are pre-stored in registers, enabling efficient computation with a multiplier and an adder. This approximation introduces only negligible accuracy loss [31, 3].
4.2.3. Storage Architecture
ShiDianNao relies heavily on on-chip SRAM to store all data and instructions, enabling the elimination of off-chip DRAM accesses.
- SRAM Capacity: The design incorporates
288 KBof on-chipSRAM, which is sufficient for all 10 practicalCNNsbenchmarked in the paper (Table 1 shows max neuron storage 45 KB, max synapse storage 118 KB). ThisSRAMhas a moderate cost:128 KB SRAMis estimated at1.65 mm²and0.44 nJper read inTSMC 65nmprocess. - SRAM Partitioning and Banking: The
on-chip SRAMis split into dedicated buffers for different data types to allow for suitable read widths and minimize energy per read:- NBin & NBout: Store input and output neurons, respectively. They exchange roles for sequential layers. Each has banks, supporting
SRAM-to-PEdata movements andinter-PE data propagation. Each bank's width is (i.e., 16-bit neurons). They must be large enough for all neurons of a whole layer. - SB (Synapse Buffer): Stores all
CNN synapses. It has banks. - IB (Instruction Buffer): Stores control instructions.
- NBin & NBout: Store input and output neurons, respectively. They exchange roles for sequential layers. Each has banks, supporting
4.2.4. Control Architecture
The control mechanisms ensure efficient data reuse and flexible operation for various CNN layers.
-
Buffer Controllers: The
NB controller(for bothNBinandNBout) is a key example. The following figure (Figure 9 from the original paper) shows its architecture:
该图像是图表,展示了GPU、DianNao、DianNao-FreeMem和ShiDianNao在不同任务下的能量消耗的对数值。各个模型在不同任务上表现出的能量消耗变化明显,ShiDianNao的能量效率显著优于其他模型,特别是在一些特定任务上。It supports six
read modesand onewrite mode.NBinhas banks, each wide. The following figure (Figure 10 from the original paper) illustrates these six read modes of the NB controller:
该图像是一个示意图,展示了LeNet5卷积神经网络的结构。输入为32x32的图像,经过多个卷积层和子采样层处理,最后通过全连接层输出。各层的特征图尺寸和数量被标注,体现了CNN的层级处理特点。The read modes are:
- (a) Read multiple banks (
#0to #P_y-1). - (b) Read multiple banks (#P_y to #2P_y-1).
- (c) Read one bank.
- (d) Read a single neuron.
- (e) Read neurons with a given step size.
- (f) Read a single neuron per bank (
#0to #P_y-1 or #P_y to #2P_y-1). Different modes are selected based on theCNNlayer type. For example, convolutional and pooling layers use modes (a), (b), (e), (c), (f) for sliding windows and data access. Classifier layers primarily use mode (d) to load a single input neuron for all output neurons. Normalization layers decompose into sub-layers that behave similarly.
The
write modeof theNB controlleris simpler. AfterPEscompute an output neuron, results are temporarily stored in anoutput register array(Figure 9). Once results are collected from allPEs, they are written toNBoutsimultaneously. The output neurons are organized into adata block( rows, each wide) and written to either the first banks or the second banks ofNBout, depending on their position ( to -th columns vs. gray columns) in the output feature map. The following figure (Figure 11 from the original paper) shows the data organization of NB:
该图像是示意图,展示了不同类型的神经网络实现,包括1D和2D systolic阵列、空间神经元以及ShiDianNao结构。图中使用了公式 来表示神经元的计算过程。 - (a) Read multiple banks (
-
Control Instructions: To support flexible
CNNconfigurations without excessively large instruction storage, atwo-level Hierarchical Finite State Machine (HFSM)is used. The following figure (Figure 12 from the original paper) illustrates the hierarchical control finite state machine:
该图像是示意图,展示了ShiDianNao加速器的架构,包含输入图像、缓冲控制器、解码器、NFU和ALU等组件。通过合理的数据访问模式,该加速器实现了显著的能效提升。- First-level states: Describe abstract tasks (e.g., different layer types like
Conv,Pool,Class,Norm,ALU task). - Second-level states: Characterize low-level execution events within each first-level state (e.g., execution phases for an input-output feature map pair in a
Convlayer). A61-bit instructionrepresents eachHFSM stateand related parameters (e.g., feature map size). This instruction can be decoded into detailed control signals for multiple accelerator cycles. This compact representation means aCNNrequiring 50K cycles only needs1 KBof instruction storage and a small decoder (0.03 mm²in65nmprocess), saving significant area and power compared to storing cycle-by-cycle control signals.
- First-level states: Describe abstract tasks (e.g., different layer types like
4.2.5. CNN Mapping
This section details how different CNN layer types are mapped onto the ShiDianNao architecture.
4.2.5.1. Convolutional Layer
A convolutional layer generates multiple output feature maps from multiple input feature maps. The accelerator processes one output feature map at a time. Within an output feature map, each PE continuously computes a single output neuron.
The output neuron at position (a, b) of the output feature map #mo is computed with the following formula:
$
\mathbf { O } _ { a , b } ^ { m o } = f \left( \sum _ { m i \in A _ { m o } } \left( \beta ^ { m i , m o } + \sum _ { i = 0 } ^ { K _ { x } - 1 } \sum _ { j = 0 } ^ { K _ { y } - 1 } \omega _ { i , j } ^ { m i , m o } \times \mathbf { I } _ { a S _ { x } + i , b S _ { y } + j } ^ { m i } \right) \right)
$
Where:
-
: The output neuron at position
(a, b)of the output feature map #mo. -
: The non-linear activation function (e.g.,
tanhorsigmoid), computed by theALU. -
: Summation over the set of input feature maps connected to output feature map #mo.
-
: The bias value for the pair of input feature map #mi and output feature map #mo.
-
: Summation over the
kerneldimensions. -
: The
kernelcoefficient (weight) at position(i, j)between input feature map #mi and output feature map #mo, read fromSB. -
: The input neuron at a specific position within the input feature map #mi. and are the step sizes (strides) of the
convolutional window.The following figure (Figure 13 from the original paper) illustrates an example of algorithm-hardware mapping for a convolutional layer:
该图像是一个示意图,展示了NFU(神经功能单元)的架构与输入输出的连接方式。图中包括了输入列和行、核以及输出的控制信号,说明了如何在神经网络加速器中处理数据。
Let's consider a small design with 2 x 2 PEs and a 3 x 3 kernel with a 1 x 1 step size:
- Cycle #0:
- All four
PEs(, , , ) simultaneously read their first required input neurons (e.g., ) fromNBinusing Read Mode (a). - They also read the same
kernelvalue () fromSB. - Each
PEperforms a multiplication and stores the result locally. - Each
PEstores its received input neuron inFIFO-HandFIFO-Vfor futureinter-PE data propagation.
- All four
- Cycle #1:
- and retrieve their required input neurons ( and ) from the
FIFO-Hsof their right neighbors ( and ) (horizontalinter-PE data propagation). - and read their required input neurons () from
NBinusing Read Mode (f). - All
PEssharekernelvalue () fromSB.
- and retrieve their required input neurons ( and ) from the
- Cycle #2: Similar to Cycle #1, with and getting data from
FIFO-Hsand and reading fromNBin(Mode (f)). AllPEssharekernelvalue (). At this point, eachPEhas processed the first row of itsconvolutional window. - Cycle #3:
-
and retrieve their required input neurons ( and ) from the
FIFO-Vsof their upper neighbors ( and ) (verticalinter-PE data propagation). -
and read their required input neurons ( and ) from
NBinusing Read Mode (c). -
All
PEssharekernelvalue () fromSB. -
Each
PEagain stores received input neurons in itsFIFOsfor future propagation.This detailed, cycle-by-cycle management of data flow and
inter-PE data propagationsignificantly reducesNBinreads and thus internal bandwidth requirements.
-
4.2.5.2. Pooling Layer
A pooling layer downsamples input feature maps. Each output neuron is computed from a pooling window of input neurons. The accelerator computes one output feature map at a time, with each PE continuously working on a single output neuron.
The output neuron at position (a, b) of the output feature map #mo for max pooling is computed with the following formula:
$
\overset { \cdot } { \operatorname { O } } _ { a , b } ^ { m o } = \operatorname* { m a x } _ { \substack { 0 \leq i < K_x, \ 0 \leq j < K_y } } \left( \operatorname { I } _ { a + i , b + j } ^ { m i } \right)
$
Where:
-
: The output neuron at position
(a, b)of the output feature map #mo. -
: The maximum operation.
-
: Dimensions of the
pooling window. -
: The input neuron at a specific position within the input feature map #mi.
-
: The mapping between input and output feature maps is one-to-one (i.e., each input feature map produces one output feature map).
The following figure (Figure 14 from the original paper) illustrates the algorithm-hardware mapping for a pooling layer:
该图像是PE架构的示意图,展示了处理元素(PE)内部的工作机制,包括乘法器、加法器和寄存器等组件,数据通过FIFO缓冲区进行输入输出。该结构设计旨在提高卷积神经网络的计算效率。
In a typical pooling layer, pooling windows for adjacent output neurons are adjacent but non-overlapping (i.e., step size equals window size). In such cases, PEs do not mutually propagate data because there is no data reuse between them. Each PE reads its required input neurons directly from NBin (e.g., using Read Mode (e)). If pooling windows overlap (step size smaller than window size), the process is similar to a convolutional layer, but without synapses.
4.2.5.3. Classifier Layer
Classifier layers are usually fully connected, meaning there is no synaptic weight sharing among different input-output neuron pairs. This results in these layers often consuming the largest portion of the SB (e.g., 97.28% for LeNet-5). Each PE works on a single output neuron.
The output neuron #no is computed with the following formula: $ { \bf O } ^ { n o } = f \left( \beta ^ { n o } + \sum _ { n i } \omega ^ { n i , n o } \times { \bf I } ^ { n i } \right) $ Where:
-
: The output neuron #no.
-
: The activation function.
-
: The bias value of output neuron #no.
-
: Summation over all input neurons #ni.
-
: The
synapse(weight) connecting input neuron #ni to output neuron #no. -
: The input neuron #ni.
Unlike convolutional layers where one
synapseand input neurons are read per cycle, aclassifier layerreads differentsynaptic weightsand a single input neuron for allPEsin each cycle. EachPEthen multiplies its uniquesynapsewith the common input neuron and accumulates the result to a partial sum. Once the dot product for an output neuron is complete, the result goes to theALUfor activation function computation.
4.2.5.4. Normalization Layers
Normalization layers are decomposed into a series of sub-layers and fundamental computational primitives that can be executed by ShiDianNao.
The following figure (Figure 15 from the original paper) shows the decomposition of an LRN layer:
该图像是一个图表,展示了不同处理单元(PE)数量下,神经元输入和核(突触权重)的内存带宽(GB/s)。横轴表示处理单元数量,纵轴表示内存带宽。图中有两个数据标记,分别表示有无处理单元间数据传播的情况。
An LRN layer is decomposed into:
-
A
classifier sub-layer. -
An
element-wise squareoperation. -
A
matrix addition. -
Exponential functions(computed byALU). -
Divisions(computed byALU).The output neuron at position
(a, b)of output feature map #mi in anLRN layeris computed with the following formula: $ { \bf O } _ { a , b } ^ { m i } = { \bf I } _ { a , b } ^ { m i } / \left( k + \alpha \times \sum _ { j = \operatorname* { m a x } \left( 0 , m i - M / 2 \right) } ^ { \operatorname* { m i n } \left( Mi - 1 , m i + M / 2 \right) } ( { \bf I } _ { a , b } ^ { j } ) ^ { 2 } \right) ^ { \beta } $ Where: -
: The output neuron at position
(a, b)of output feature map #mi. -
: The input neuron at position
(a, b)of input feature map #mi. -
: Constant parameters.
-
: Summation over neighboring feature maps.
-
Mi: Total number of input feature maps. -
: Maximum number of input feature maps connected to one output feature map.
-
: Element-wise square of input neuron from feature map #j.
The following figure (Figure 16 from the original paper) shows the decomposition of an
LCN layer:
该图像是示意图,展示了典型卷积层执行过程中的数据流。图中考虑了最复杂的情况:核大小大于 NFU 大小 (#PEs),即 和 。
An LCN layer is decomposed into:
-
Two
convolutional sub-layers. -
A
pooling sub-layer. -
A
classifier sub-layer. -
Two
matrix additions. -
An
element-wise squareoperation. -
Divisions(computed byALU).The output neuron at position
(a, b)of output feature map #mi in anLCN layeris computed with the following formula: $ \mathrm { O } _ { a , b } ^ { \bar { m } i } = \nu _ { a , b } ^ { m i } / \operatorname* { m a x } \left( \mathrm { mean } ( \delta _ { a , b } ) , \delta _ { a , b } \right) $ Where: -
: The output neuron at position
(a, b)of output feature map #mi. -
: The
subtractively normalizedinput, computed by: $ \nu _ { a , b } ^ { m i } = \mathrm { I } _ { a , b } ^ { m i } - \sum _ { j , a , b } \omega _ { a , b } \times \mathrm { I } _ { a + p , b + \pmb { q } } ^ { j } $ where is the input neuron, is a normalized Gaussian weighting window (). -
: A local
standard deviationlike term, computed by: $ \delta _ { a , b } = \sqrt { \sum _ { m i , a , b } ( \nu _ { a + p , b + \pmb { q } } ^ { m i } ) ^ { 2 } } $ -
: Normalization term involving the mean and current .
For element-wise square and matrix addition primitives, each
PEworks on one matrix element per cycle using its multiplier or adder, and the results are then written toNBout.
5. Experimental Setup
5.1. Datasets
The paper uses 10 CNNs collected from representative visual recognition applications as benchmarks to evaluate ShiDianNao. These CNNs represent diverse workloads with varying layer sizes and configurations. While specific example data samples are not provided in the paper, the context implies standard image data (e.g., LeNet-5 is known for document recognition, and other CNNs are for face recognition, general image classification). The characteristics mentioned for these benchmarks are:
- Input Neurons: Maximum 45 KB for any layer across all benchmarks.
- Synapses (Weights): Maximum 118 KB for any
CNNacross all benchmarks. These sizes are critical as they directly inform the requiredSRAMcapacities forShiDianNaoto function entirely on-chip.
The following are the results from Table 2 of the original paper:
| Layer | Kernel Size #@size | Layer Size #@size | Layer | Kernel Size #@size | Layer Size #@size | ||
|---|---|---|---|---|---|---|---|
| 0 | Input C1 S2 C3 S4 C5 F6 | 6@7x7 6@2x2 61@7x7 16@2x2 305@6x6 160@1x1 | 1@42x42 6@36x36 6@18x18 16@12x12 16@6x6 80@1x1 21@1x1 | 20 | Input C1 S2 C3 S4 C5 | 20@5x5 20@2x2 400@5x5 20@2x2 400@3x3 | 1@32x32 20@28x28 20@14x14 20@10x10 20@5x5 20@3x3 |
| Layer | Kernel Size #@size | Layer Size #@size | F6 F7 Layer | 6000@1x1 1800@1x1 Kernel Size #@size | 300@1x1 6@1x1 Layer Size #@size | ||
| 20 | Input C1 S2 C3 S4 F5 | 20@3x3 20@2x2 125@3x3 25@2x2 1000@1x1 | 1@23x28 20@21x26 20@11x13 25@9x11 25@5x6 40@1x1 | 20 | Input C1 S2 C3 S4 F5 F6 | 6@5x5 6@2x2 60@5x5 16@2x2 1920@5x5 10080@1x1 | 1@32x32 6@28x28 6@14x14 16@10x10 16@5x5 120@1x1 84@1x1 |
| Layer | Kernel Size #@size | Layer Size #@size | F7 Layer | 840@1x1 Kernel Size #@size | 10@1x1 Layer Size #@size | ||
| 20 | Input C1 C2 F3 F4 | 5@5x5 250@5x5 5000@1x1 1000@1x1 | 1@29x29 5@13x13 50@5x5 100@1x1 10@1x1 | CE | Input C1 S2 C3 S4 F5 | 4@5x5 4@2x2 20@3x3 14@2x2 14@6x7 | 1@32x36 4@28x32 4@14x16 14@12x14 14@6x7 14@1x1 |
| Layer | Kernel Size #@size | Layer Size #@size | F6 Layer | 14@1x1 Kernel Size #@size | 1@1x1 Layer Size #@size | ||
| 0 | Input C1 S2 C3 | 4@5x5 6@3x3 14@5x5 | 1@24x24 1@24x24 4@12x12 4@12x12 | Input C1 S2 C3 | 12@5x5 12@2x2 60@3x3 | 3@64x36 12@60x32 12@30x16 14@28x14 | |
| S4 F5 | 60@3x3 160@6x7 Kernel Size | 16@6x6 10@1x1 | 20 | S4 F5 F6 | 14@2x2 14@14x7 14@1x1 Kernel Size | 14@14x7 14@1x1 1@1x1 Layer Size | |
| Layer Input C1 | #@size | #@size 1@20x20 | Layer Input C1 | #@size 4@7x7 | #@size 1@46x56 4@40x50 | ||
| 20 | S2 C3 S4 F5 F6 | 4@5x5 4@2x2 20@3x3 14@2x2 14@1x1 14@1x1 | 4@16x16 4@8x8 14@6x6 14@3x3 14@1x1 1@1x1 | 20 | S2 C3 S4 F5 F6 | 4@2x2 6@5x5 3@2x2 180@8x10 240@1x1 | 4@20x25 3@16x21 3@8x10 60@1x1 4@1x1 |
These datasets were chosen to represent state-of-the-art CNN implementations for visual recognition, allowing for a comprehensive evaluation of ShiDianNao's performance and energy efficiency across a range of realistic workloads.
5.2. Evaluation Metrics
The paper uses several metrics to evaluate the proposed ShiDianNao accelerator, focusing on performance, energy efficiency, and hardware cost.
-
Speedup:
- Conceptual Definition:
Speedupquantifies how much faster a task executes on one system compared to another. It is a dimensionless ratio. - Mathematical Formula: $ \text{Speedup} = \frac{\text{Execution Time}{\text{Baseline}}}{\text{Execution Time}{\text{Proposed}}} $
- Symbol Explanation:
- : The time taken to complete the task on a baseline system (e.g., CPU).
- : The time taken to complete the same task on the proposed
ShiDianNaoaccelerator. A higher speedup value indicates better performance.
- Conceptual Definition:
-
Energy Efficiency:
- Conceptual Definition:
Energy efficiencymeasures the amount of useful work performed per unit of energy consumed. In this context, it often refers to how much less energy is consumed to perform the same task. The paper expresses it as a ratio of energy consumed. - Mathematical Formula: $ \text{Energy Efficiency Ratio} = \frac{\text{Energy Consumed}{\text{Baseline}}}{\text{Energy Consumed}{\text{Proposed}}} $
- Symbol Explanation:
- : The total energy consumed by a baseline system for a task (including memory accesses).
- : The total energy consumed by the
ShiDianNaoaccelerator for the same task. A higher ratio indicates better energy efficiency.
- Conceptual Definition:
-
Area (mm²):
- Conceptual Definition: The physical footprint of the chip design, typically measured in square millimeters. It directly relates to manufacturing cost and feasibility for embedded systems.
- Mathematical Formula: N/A (directly measured in ).
- Symbol Explanation: N/A.
-
Power (mW):
- Conceptual Definition: The average rate at which energy is consumed by the accelerator during operation, measured in milliwatts. Low power consumption is critical for embedded and mobile devices with limited battery life.
- Mathematical Formula: N/A (directly measured in ).
- Symbol Explanation: N/A.
-
GOP/s (Billions of fixed-point Operations per second):
- Conceptual Definition: A measure of computational throughput, specifically for fixed-point operations. An "operation" in the context of neural networks typically refers to a
multiply-accumulate (MAC)operation, which is the dominant operation in convolutional and fully connected layers. - Mathematical Formula: N/A (directly measured as
GOP/s). - Symbol Explanation: N/A.
- Conceptual Definition: A measure of computational throughput, specifically for fixed-point operations. An "operation" in the context of neural networks typically refers to a
-
Frames Per Second (FPS):
- Conceptual Definition: The number of full image frames that can be processed per second, directly indicating the real-time processing capability for video streams.
- Mathematical Formula: N/A (derived from total processing time per frame).
- Symbol Explanation: N/A.
5.3. Baselines
To provide a comprehensive comparison, ShiDianNao is evaluated against three baselines:
-
CPU:
- Specification: Intel Xeon E7-8830, 2.13 GHz clock frequency, 1 TB main memory. It supports 256-bit
SIMD(Single Instruction, Multiple Data) instructions (likeMMX,SSE,SSE2,SSE4.1,SSE4.2). - Configuration: Benchmarks compiled with GCC 4.4.7 using optimizations () to maximize performance by utilizing
SIMDinstructions. - Representativeness: Represents a high-end server-class CPU, showcasing general-purpose processing capabilities.
- Specification: Intel Xeon E7-8830, 2.13 GHz clock frequency, 1 TB main memory. It supports 256-bit
-
GPU:
- Specification: NVIDIA K20M, a modern
GPUcard with 5 GBGDDR5memory, capable of 3.52 TFlops (trillions of floating-point operations per second) peak performance, built on 28 nm technology. - Configuration: The Caffe library, widely recognized as a fast
CNNlibrary forGPUs[1], is used for executing benchmarks. - Representativeness: Represents a powerful, state-of-the-art accelerator for highly parallelizable workloads, often used for
NNtraining and inference.
- Specification: NVIDIA K20M, a modern
-
Accelerator (DianNao Reimplementation):
- Specification: A reimplementation of
DianNao[3], resized for a fair comparison in an embedded context.NFU:8 x 8DianNao-NFU(8 hardware neurons, each processing 8 input neurons and 8 synapses per cycle). This is smaller than the original16 x 16DianNao-NFU.- Memory Bandwidth: memory model (compared to original , which is considered unrealistic for a vision sensor).
- On-chip Buffers: Shrinked to 1 KB
NBin/NBoutand 16 KBSB(half the size of the originalDianNao's buffers).
- Verification: The reimplementation's area (
1.38 mm²) is roughly proportional to the originalDianNao's area (3.02 mm²), confirming its fidelity as a scaled-down version. - Representativeness: Represents a previous generation, state-of-the-art neural network accelerator from the same research group, but one that targets a broader range of
NNsand has different memory access patterns compared toShiDianNao. This provides a direct comparison of the architectural specializations made inShiDianNao.
- Specification: A reimplementation of
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate ShiDianNao's superior performance and energy efficiency, particularly when compared to general-purpose architectures and even its predecessor, DianNao.
6.1.1. Layout Characteristics
The following figure (Figure 17 from the original paper) shows the layout of ShiDianNao (65 nm):
该图像是NB控制器的示意图,展示了多个存储单元(Bank #0, Bank #1等)与列缓冲区之间的连接关系。该控制器通过多路复用器阵列(MUX array)处理输入和输出,从而实现高效的数据访问。
The following are the results from Table 3 of the original paper:
| ShiDianNao | DianNao | |
|---|---|---|
| Data width | 16-bit | 16-bit |
| # multipliers | 64 | 64 |
| NBin SRAM size | 64 KB | 1 KB |
| NBout SRAM size | 64 KB | 1 KB |
| SB SRAM size | 128 KB | 16KB |
| Inst. SRAM size | 32KB | 8 KB |
ShiDianNao features 8 x 8 (64) PEs, matching the number of multipliers in the resized DianNao for a fair comparison of computational units. However, ShiDianNao significantly increases its on-chip SRAM capacity to enable full CNN mapping and eliminate DRAM accesses: 64 KB for NBin, 64 KB for NBout, 128 KB for SB, and 32 KB for IB. This totals 288 KB of SRAM, which is 11.1x larger than the total SRAM in the DianNao baseline. Despite this substantial increase in SRAM, the total area of ShiDianNao is only 4.86 mm², which is 3.52x larger than the DianNao baseline (1.38 mm²), demonstrating efficient area utilization for the increased memory.
6.1.2. Performance
The following figure (Figure 18 from the original paper) shows the speedup of GPU, DianNao, and ShiDianNao over the CPU:
该图像是示意图,展示了不同读模式下的NB控制器连接模式,包括多个缓冲控制器和NFU的布局结构。图中标记的不同模式如#0、#1、#Py等表示不同的存储银行及其连接方式。
-
Vs. General-Purpose Architectures:
ShiDianNaosignificantly outperforms general-purpose architectures. It is, on average,46.38 times fasterthan theCPU(Intel Xeon E7-8830) and28.94 times fasterthan the high-endGPU(NVIDIA K20M). TheGPUstruggles to fully leverage its computational power due to the smallcomputational kernelsof the visual recognition tasks, which map poorly onto its2,496 hardware threads. -
Vs. Previous Accelerator (DianNao):
ShiDianNaoalso outperforms theDianNaobaseline on 9 out of 10 benchmarks, averaging1.87 times faster. This advantage stems from two main factors:- Elimination of Off-chip Memory Accesses:
ShiDianNao's larger on-chipSRAMallows it to entirely avoidoff-chip memory accessesduring execution, which were still present inDianNao. - Exploitation of 2D Data Locality:
ShiDianNao's2D PE meshandinter-PE data reuse mechanismsare specifically designed to exploit the2D data localityofCNN feature maps, whichDianNao(designed for a broaderNNscope) did not effectively utilize.
- Elimination of Off-chip Memory Accesses:
-
Performance Anomaly (Simple Conv Benchmark):
ShiDianNaoperforms slightly worse than theDianNaobaseline on theSimple Convbenchmark. This is attributed toShiDianNao's design where it processes a single output feature map at a time, and eachPEworks on a single output neuron within that map. ForCNNswithuncommonly small output feature maps(e.g.,5x5in theC2layer ofSimple Convfor an8x8 PEdesign), somePEsremain idle, reducing overall efficiency. The authors chose not to implement complex control logic to alleviate this, prioritizing a simpler programming model. -
Real-time Processing Capability: For a
640x480 video frame, the longest processing time occurs for theConvNNbenchmark, requiring0.047 msto process a64x36-pixel region. Since a frame consists of1073such regions (with 16-pixel overlap), a full frame takes slightly over50 msto process, resulting in a speed of20 frames per second (FPS)for the most demanding workload. This speed, combined withpartial frame buffering(a few tens of pixel rows, fitting within256 KBof commercial image processors), allowsShiDianNaoto process real-time video streams directly from the sensor.
6.1.3. Energy
The following figure (Figure 19 from the original paper) shows the energy cost of GPU, DianNao, and ShiDianNao:
该图像是示意图,展示了卷积神经网络中银行结构与特征图的关系。图中显示了多个银行的连接方式,以及如何通过特定路径访问特征图,涉及输入和输出的维度。此外,特征图尺寸及访问模式的重要性也得到了体现。
-
Vs. General-Purpose Architectures:
ShiDianNaois remarkably energy-efficient. It is, on average,4688.13 timesmore energy-efficient than theGPU. This includesmain memory accessesfor input data, even thoughShiDianNaois designed not to accessDRAM. -
Vs. Previous Accelerator (DianNao):
ShiDianNaois63.48 timesmore energy-efficient thanDianNao. Even when compared to an idealizedDianNao-FreeMem(which assumes zero energy cost for main memory accesses),ShiDianNaois still1.66 timesmore energy-efficient. This highlights the effectiveness ofShiDianNao's specialized2D data localityexploitation and internal data movement optimizations. -
Sensor-Integrated Advantage: When
ShiDianNaois integrated directly with an embedded vision sensor, and frames are streamed directly into itsNBin(completely bypassingDRAM), its energy superiority becomes even more pronounced:87.39 timesmore energy-efficient thanDianNaoand2.37 timesmore energy-efficient thanDianNao-FreeMem.The following are the results from Table 4 of the original paper:
Accelerator Area (mm2) Power (mW) Energy (nJ) Total 4.86 (100%) 320.10 (100%) 6048.70 (100%) NFU 0.66 (13.58%) 268.82 (83.98%) 5281.09 (87.29%) NBin 1.12 (23.05%) 35.53 (11.10%) 475.01 (7.85%) NBout 1.12 (23.05%) 6.60 (2.06%) 86.61 (1.43%) SB 1.65 (33.95%) 6.77 (2.11%) 94.08 (1.56%) IB 0.31 (6.38%) 2.38 (0.74%) 35.84 (0.59%) -
Energy Breakdown: The breakdown of energy consumption (averaged over 10 benchmarks) reveals a critical insight: The four
SRAM buffers(NBin,NBout,SB,IB) account for only11.43%of the overall energy. The vast majority,87.29%, is consumed by theNFUlogic. This is in stark contrast to prior observations made onDianNao[3], whereDRAMaccess dominated energy consumption (more than 95%). This demonstrates thatShiDianNaosuccessfully shifts the energy bottleneck from memory accesses to computation, which is a significant achievement foron-chip processing.
6.2. Ablation Studies / Parameter Analysis
The paper implicitly conducts an ablation-like study by comparing ShiDianNao to DianNao and DianNao-FreeMem.
-
DianNao comparison: Isolates the effect of
2D data locality exploitationandDRAMelimination. The1.87xspeedup and63.48xenergy efficiency improvement show thatShiDianNao's specialized architecture forCNNsand itsDRAM-less operation are highly effective. -
DianNao-FreeMem comparison: Isolates the effect of
2D data locality exploitationand internaldata movement minimization. Even ifDianNaohad freeDRAMaccess,ShiDianNao'sinter-PE data propagationand optimizedNFUstill yield1.66xbetter energy efficiency, confirming the value of these architectural choices.The paper also discusses the impact of
PEcount relative tofeature mapsize. Whenoutput feature mapsare smaller than thePE mesh(e.g., 5x5 feature mapon an8x8 PEarray), somePEsbecome idle. This indicates that thePE meshsize, while efficient for typicalCNNlayers, can be ahyper-parameterthat affects utilization for specific workloads. The authors chose a fixedPEsize for simplicity in the programming model, acknowledging a potential trade-off for very small layers.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully presents ShiDianNao, a novel and highly efficient accelerator designed for state-of-the-art visual recognition algorithms based on Convolutional Neural Networks (CNNs). By innovatively integrating the accelerator directly next to CMOS/CCD image sensors and meticulously designing its architecture to leverage CNN's weight sharing and 2D data locality, ShiDianNao completely eliminates all energy-costly DRAM accesses. This fundamental shift results in an average 50x speedup over mainstream CPUs and 30x over high-end GPUs, while being 1.87x faster than a carefully reimplemented DianNao accelerator. Crucially, ShiDianNao achieves an astounding 4700x and 60x reduction in energy consumption compared to GPUs and DianNao, respectively. With a compact 4.86 mm² area (in 65 nm process) and 320.10 mW power consumption at 1 GHz, ShiDianNao is ideally suited for embedded visual applications in mobile and wearable devices, promising to significantly reduce server workloads, enhance the Quality of Service (QoS) for visual applications, and accelerate the widespread adoption of sophisticated vision processing.
7.2. Limitations & Future Work
The authors implicitly point out a limitation related to PE utilization:
-
Idle PEs for Small Feature Maps: When
CNNshaveuncommonly small output feature maps(fewer output neurons than the number ofPEs), somePEswill remain idle. While the authors considered adding complex control logic to allow differentPEsto work on different feature maps simultaneously, they ultimately decided against it due to the detrimental impact on theprogramming model. This suggests a trade-off between maximizingPEutilization for all possibleCNNlayer configurations and maintaining architectural simplicity.The paper doesn't explicitly outline specific future work directions, but the overall motivation and conclusion suggest:
-
Widespread Adoption: The vision is to make sophisticated
vision processingubiquitous by considerably lowering its hardware and energy cost. This implies continued development to integrate such accelerators into various embedded systems. -
Scaling to Larger/More Complex CNNs: As
CNNmodels evolve, future work might involve adapting the architecture to support even larger models or more diverse layer types while maintaining theDRAM-less principle. -
Enhanced Flexibility: Exploring ways to improve
PEutilization for smallerfeature mapswithout overly complicating theprogramming modelcould be a direction.
7.3. Personal Insights & Critique
The ShiDianNao paper presents a highly impactful and forward-thinking architectural design. My personal insights and critique are as follows:
- Significance of "Shift Left" Principle: The most profound aspect of
ShiDianNaois its radical adherence to the "shift left" principle – processing data as close to its source as possible. By truly eliminating allDRAMaccesses forCNNinference, the paper demonstrates a path to unprecedented energy efficiency, which is critical for the proliferation of AI at the very edge (e.g., smart cameras, always-on sensors). This is a strong testament to how deep co-design (algorithm-architecture-system) can yield orders-of-magnitude improvements. - Leveraging Algorithm Properties: The intelligent exploitation of
CNN'sweight sharingand2D data localityis a masterstroke. This highlights the importance of understanding the specific characteristics of the target algorithm class to design maximally efficient hardware, rather than aiming for overly general solutions that compromise on efficiency. The detailedinter-PE data propagationmechanism is a brilliant example of minimizing internal data movement, complementing theDRAMelimination. - Validation of Fixed-Point Arithmetic: The paper further solidifies the argument for
16-bit fixed-point arithmeticinNNinference, showing that sufficient accuracy can be maintained with massive hardware and energy savings. This is a crucial validation for embeddedAIsystems. - Potential Issues/Areas for Improvement:
-
Flexibility for Evolving CNNs: While
ShiDianNaooffers more flexibility than rigidsystolic arrays,CNNarchitectures are constantly evolving. New layers, new pooling mechanisms, or more dynamickernelsizes might challenge its fixed2D PE meshandHFSM-based control. The trade-off between specialization for extreme efficiency and adaptability to future algorithmic innovation is always present. -
Model Size Limitations: The
288 KBon-chipSRAMis sufficient for the benchmarks tested. However, very large, state-of-the-artCNNscan have hundreds of millions or even billions of parameters, potentially exceeding this capacity without further model compression or specialized partitioning strategies. The "entirely map a CNN within an SRAM" promise is contingent on theCNN's size. -
Generalization to Other AI Tasks: While highly optimized for
CNNsin vision,ShiDianNao's architecture is less suited for otherNNtypes (e.g.,Recurrent Neural Networks (RNNs),Transformers) or broaderAIworkloads that lack the same2D localityorweight sharingproperties. This narrow scope is a deliberate design choice for efficiency but is a limitation in terms of generalAIacceleration. -
The "Simple Conv" Anomaly: The performance drop for
small feature maps(due to idlePEs) indicates a potential area for micro-architectural refinement. While the authors explicitly state their decision not to add complexity for a simpler programming model, for certain applications dominated by small layers, this could be a noticeable inefficiency. -
"Hot" NFU: The energy breakdown shows the
NFUconsuming87.29%of the total energy. While successful in shifting the bottleneck from memory, this implies that further energy savings would need to focus on making theNFUoperations themselves even more efficient.Overall,
ShiDianNaois an excellent example of how to architect for extreme efficiency by deeply understanding the target application domain and algorithm. Its innovations are highly transferable to other embeddedAIcontexts where data originates from sensors and ultra-low power is paramount.
-
Similar papers
Recommended via semantic vector search.