ShiDianNao: Shifting Vision Processing Closer to the Sensor

Olivier Temam

Paper status: completed

ShiDianNao: Shifting Vision Processing Closer to the Sensor

Image Application-Specific Neural Network Accelerator (1)Optimization of Convolutional Neural Networks (1)Near-Sensor Architecture Design (1)High-Efficiency Neural Network Accelerator (1)65nm Layout Design (1)

Original Link

Price: 0.10

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The ShiDianNao CNN accelerator is placed next to CMOS or CCD sensors to eliminate DRAM accesses, achieving 60x energy efficiency improvement and 30x faster performance than high-end GPUs, with a compact design of 4.86mm² area and 320mW power consumption.

Abstract

In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and mining applications. Still, both the energy efficiency and performance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are Convolutional Neural Networks (CNN), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs. In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator. We present a full design down to the layout at 65 nm, with a modest footprint of 4.86 mm² and consuming only 320 mW, but still about 30x faster than high-end GPUs.

Mind Map

In-depth Reading

English Analysis~35 min read · 42,387 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "ShiDianNao: Shifting Vision Processing Closer to the Sensor."

1.2. Authors

The authors and their affiliations are:

Zidong Du, Tianshi Chen, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen: Laboratory Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences (CAS), China.
Robert Fasthuber, Paolo Ienne: EPFL, Switzerland.
Olivier Temam: Inria, France.

1.3. Journal/Conference

The paper was published in a conference setting, indicated by the typical format of academic papers from such venues. Given the authors' previous work (DianNao [3] in ASPLOS), this paper likely appeared in a top-tier computer architecture or systems conference. Such venues are highly reputable and influential in the fields of computer architecture, hardware design, and accelerators for emerging workloads like machine learning.

1.4. Publication Year

The publication year is not explicitly stated on the first page, but the references indicate that cited works go up to 2015 (e.g., [4]), suggesting a publication year of 2015 or later.

1.5. Abstract

The paper addresses the energy efficiency and performance limitations of neural network accelerators, primarily caused by memory accesses. It focuses on image applications, specifically Convolutional Neural Networks (CNNs), which benefit from weight sharing, significantly reducing their memory footprint. This property allows an entire CNN to be mapped within on-chip SRAM, eliminating DRAM accesses for weights. The core innovation is to further integrate this CNN accelerator directly next to an image sensor (CMOS or CCD), thereby eliminating all remaining DRAM accesses for inputs and outputs. The proposed accelerator, ShiDianNao, leverages the absence of DRAM accesses and exploits specific data access patterns within CNNs to achieve 60 times greater energy efficiency than the previous state-of-the-art neural network accelerator. The design is presented down to a 65 nm layout, occupying a modest 4.86 mm² area and consuming only 320 mW, while still being approximately 30 times faster than high-end GPUs.

1.6. Original Source Link

/files/papers/6915a0914d6b2ff314a02e49/paper.pdf This is an officially published paper, likely from a conference proceedings, provided as a PDF link.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inherent limitation of neural network accelerators in terms of both energy efficiency and performance, largely due to frequent memory accesses, particularly to off-chip DRAM. While accelerators have gained traction as energy- and cost-effective alternatives to CPUs and GPUs for recognition and mining applications, memory bandwidth remains a bottleneck, as acknowledged even in prior specialized designs like DianNao [3].

This problem is particularly critical for vision applications, which constitute one of the broadest categories of recognition tasks. In many real-world and embedded scenarios (e.g., smartphones, security cameras, self-driving cars), image data originates directly from a CMOS or CCD sensor. The traditional pipeline involves the image being acquired by the sensor, transferred to DRAM, and then fetched by a CPU/GPU for processing. Each of these steps, especially DRAM accesses, incurs significant energy costs.

The paper's entry point and innovative idea stem from exploiting specific properties of Convolutional Neural Networks (CNNs), which are state-of-the-art for image applications. CNNs have a crucial property: weights are shared among many neurons (translation invariance), drastically reducing the neural network memory footprint. This makes it feasible to entirely map a CNN's weights within a small, fast, and energy-efficient on-chip SRAM. Building on this, the paper proposes hoisting this entire accelerator next to the image sensor. This radical architectural shift allows for the elimination of all DRAM accesses, not just for weights but also for inputs (from the sensor) and outputs (typically just classification results), thereby addressing the memory bottleneck at its root.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

Novel Accelerator Architecture (ShiDianNao): Proposal and detailed design of a CNN accelerator specifically engineered to be placed directly adjacent to CMOS or CCD image sensors. This architecture is optimized to eliminate all off-chip DRAM accesses for both model weights and input/output data.
Exploitation of CNN Properties: A careful exploitation of CNN-specific data access patterns, particularly the weight sharing property and 2D data locality of feature maps, which enables the entire mapping of a CNN within on-chip SRAM and efficient inter-Processing Element (PE) data reuse.
Significant Energy Efficiency Improvement: The proposed ShiDianNao achieves an average of 60 times more energy efficiency than the previous state-of-the-art neural network accelerator, DianNao [3]. When integrated directly with a sensor, it is 87.39 times more energy efficient than DianNao and 2.37 times more energy efficient than an ideal DianNao-FreeMem (assuming no memory cost).
High Performance: Despite its ultra-low power consumption and small footprint, ShiDianNao is approximately 30 times faster than high-end GPUs for the targeted visual recognition tasks, and 1.87 times faster than the resized DianNao baseline.
Full Design and Evaluation: Presentation of a full design down to the layout at 65 nm CMOS technology, showcasing a modest footprint of 4.86 mm² and consuming only 320 mW at 1 GHz, while delivering $194 GOP/s$ . The design is empirically evaluated on ten representative CNN benchmarks.

The key conclusions and findings are that by shifting vision processing physically closer to the sensor and meticulously optimizing for CNN data patterns to eliminate all DRAM accesses, it is possible to achieve unprecedented levels of energy efficiency and performance for sophisticated visual recognition tasks in embedded systems. This approach significantly lowers the hardware and energy cost, potentially enabling widespread deployment of advanced vision processing in mobile and wearable devices.

3.1. Foundational Concepts

To fully understand the ShiDianNao paper, a foundational understanding of several key concepts in computer architecture, machine learning, and digital design is essential.

Neural Networks (NNs) and Accelerators:
- Definition: Neural networks are computational models inspired by biological neural networks, used for tasks like pattern recognition and classification. They consist of layers of interconnected "neurons" that process data.
- Accelerators: These are specialized hardware components designed to speed up specific computational tasks, in this case, neural network operations. They aim to achieve higher performance and energy efficiency compared to general-purpose processors (CPUs/GPUs) for those specific tasks. The motivation for accelerators comes from the observation that general-purpose chips are often inefficient for highly specialized, compute-intensive workloads [20].
- Energy and Performance Bottlenecks: A major bottleneck in conventional computing systems, including those using CPUs and GPUs, is memory access. Moving data between different levels of memory hierarchy (e.g., between off-chip DRAM and on-chip caches/registers) consumes significantly more energy than performing arithmetic operations [20, 28]. Accelerators try to mitigate this by bringing computation closer to data or reducing data movement.
Convolutional Neural Networks (CNNs):
- Definition: CNNs are a class of NNs particularly effective for processing grid-like data, such as images. They are the state-of-the-art for many vision tasks.
- Layers: A CNN typically consists of:
  - Convolutional Layers (C): These apply a set of learnable filters (or kernels) across the input data. Each filter slides over the input (e.g., an image or a feature map) and computes a dot product, generating a feature map that highlights specific features (edges, textures, etc.).
    - Weight Sharing (Translation Invariance): A key property of CNNs is that the same filter (set of weights) is applied across the entire input feature map. This means that if a particular feature (e.g., an edge) is useful in one part of an image, it's likely useful elsewhere. This property dramatically reduces the total number of unique weights that need to be stored compared to Fully Connected (FC) layers. This also implies translation invariance, meaning the network can detect a feature regardless of its position in the input.
  - Pooling Layers (S): These layers reduce the spatial dimensions (width and height) of the input feature maps, which helps in reducing the number of parameters and computations, and makes the features more robust to slight shifts or distortions. Common operations include max pooling (taking the maximum value in a window) or average pooling.
  - Normalization Layers: These layers normalize the activities of neurons to improve the network's performance and stability during training. Examples include Local Response Normalization (LRN) and Local Contrast Normalization (LCN).
  - Classifier Layers (F) (Fully Connected Layers): Typically found at the end of a CNN, these are standard Multi-Layer Perceptrons (MLPs) where every neuron in one layer is connected to every neuron in the next layer with independent weights. These layers perform the final classification based on the high-level features extracted by the convolutional and pooling layers.
- Feature Map: A 2D array representing the output of a convolutional or pooling layer, effectively a map of where a specific feature is detected in the input.
- Kernel/Filter: A small matrix of weights that slides over the input feature map to perform convolution. Its size is denoted by $K_x \times K_y$ .
- Stride: The step size by which the kernel moves across the input feature map. Denoted by $S_x$ and $S_y$ .
Deep Neural Networks (DNNs):
- Definition: A broader category of NNs with multiple hidden layers. CNNs are a type of DNN.
- Distinction from CNNs: While CNNs use weight sharing, generic DNNs (often referring to MLPs or networks without convolutional properties in their earlier layers) typically do not. In a DNN, each connection between neurons has a unique weight, leading to a much larger number of synapses (weights) to store. This difference is critical for ShiDianNao's design.
Memory Hierarchy (SRAM vs. DRAM):
- DRAM (Dynamic Random Access Memory): High-capacity, relatively slow, and energy-intensive off-chip memory. It's used for main system memory. Accessing DRAM is a major energy and performance bottleneck.
- SRAM (Static Random Access Memory): Low-capacity, very fast, and energy-efficient on-chip memory. It's often used for caches and registers due to its speed but is much more expensive per bit and takes up more area than DRAM. ShiDianNao aims to fit all necessary data into SRAM.
Fixed-Point vs. Floating-Point Arithmetic:
- Floating-Point: Represents numbers with a fractional component and a dynamic range, offering high precision, similar to how computers handle real numbers in scientific computing. Typically 32-bit (single-precision) or 64-bit (double-precision).
- Fixed-Point: Represents numbers with a fixed number of bits for the integer part and a fixed number of bits for the fractional part. It has a smaller dynamic range and precision than floating-point but requires significantly less hardware (smaller multipliers, adders) and consumes less energy.
- Rationale for 16-bit Fixed-Point in NNs: Previous studies [3, 10, 57] have shown that for neural network inference (recognition phase), 16-bit fixed-point arithmetic provides negligible accuracy loss compared to 32-bit floating-point, while drastically reducing hardware cost (e.g., 6.10x smaller and 7.33x more energy-efficient for a 16-bit fixed-point multiplier vs. 32-bit floating-point [3]).
Image Sensors (CMOS/CCD):
- Definition: Devices that convert light into electrical signals, capturing images.
- CMOS (Complementary Metal-Oxide-Semiconductor) / CCD (Charge-Coupled Device): Two main types of image sensor technologies. They are the initial source of image data in many embedded applications.
- Integration: ShiDianNao proposes placing the accelerator physically next to these sensors to intercept the data stream before it goes to off-chip memory.

3.2. Previous Works

The paper contextualizes ShiDianNao against several categories of previous work:

General-Purpose Processors (CPUs and GPUs) for NNs:
- Conventionally, NNs were executed on CPUs [59, 2] or GPUs [15, 51, 5]. While flexible, these platforms are not optimized for the specific data access patterns and arithmetic intensity of NNs, leading to poor energy efficiency. GPUs, though powerful for parallel computation, can be inefficient for small computational kernels found in CNNs due to overheads of managing thousands of threads.
Dedicated Neural Network Accelerators:
- Early Designs: A first wave of specialized NN hardware emerged in the late 20th century [25].
- Modern Implementations: More recent accelerators exist, implemented on FPGAs [50, 46, 52] or ASICs [3, 16, 57].
- Systolic Architectures:
  - Examples: NeuFlow [16] by Farabet et al., and a systolic-like coprocessor by Chakradhar et al. [2].
  - Characteristics: Effective for 2D convolution in signal processing [37, 58, 38, 22]. Data flows through an array of processing elements in a pipelined fashion.
  - Limitations: Often lack flexibility to support diverse CNN settings (e.g., varying kernel sizes, strides) [8, 50, 16, 2], and can have high memory bandwidth requirements.
- SIMD-like Architectures:
  - Examples: NnSP (Neural Network Stream Processing core) by Esmaeilzadeh et al. [12] (though initially for MLPs). Peemen et al. [45] proposed an FPGA accelerator with a customized memory subsystem for CNNs, but it still required a host processor, limiting overall energy efficiency. Gokhale et al. [18] designed a mobile coprocessor for visual processing supporting both CNNs and DNNs.
  - Limitations: Many of these designs did not treat main memory accesses as a primary concern or connected computational blocks directly to main memory via DMA, still incurring significant energy costs for data movement.
DianNao Family [3, 4, 40]:
- DianNao [3]: This is the most direct predecessor and a crucial baseline for ShiDianNao. It was proposed by some of the same authors. DianNao was designed as a small-footprint, high-throughput accelerator for a broad range of neural networks, including both CNNs and DNNs. It introduced dedicated on-chip SRAM buffers to reduce main memory accesses.
- DianNao's Limitation (relevant to ShiDianNao): To support a broad scope of NNs (including DNNs without weight sharing), DianNao did not implement specialized hardware to extensively exploit the 2D data locality specific to CNNs. Instead, it often treated 2D feature maps as 1D data vectors, which still led to frequent memory accesses (albeit to its on-chip buffers, but requiring more data movement than ShiDianNao's approach) and thus was less energy-efficient for CNNs than a specialized design.
- Other DianNao Family Members: Later works like DaDianNao [4] (for large-scale NNs) and another accelerator [40] (for classic machine learning) were optimized for different scales or techniques and were not designed for embedded applications like ShiDianNao.

3.3. Technological Evolution

The field of NN acceleration has evolved significantly:

Early NN Hardware (1980s-1990s): Initial attempts at hardware for NNs often involved purely spatial implementations or simple systolic arrays, but were limited by transistor densities and the complexity of NNs [25].
GPU Dominance (2000s-early 2010s): GPUs became the go-to for parallel NN training and inference due to their high computational throughput. However, their general-purpose nature and high power consumption limited their suitability for embedded, energy-constrained scenarios.
Emergence of Specialized Accelerators (early 2010s onwards): The realization that NNs (especially CNNs) have specific computational and data access patterns led to the development of ASIC and FPGA accelerators. These designs started focusing on:
- Quantization: Using lower precision arithmetic (e.g., 16-bit fixed-point) to reduce hardware cost and energy while maintaining sufficient accuracy.
- On-chip Memory: Employing large on-chip SRAMs to reduce costly off-chip DRAM accesses.
- Dataflow Optimization: Designing processing elements and memory access patterns to maximize data reuse and minimize data movement.
"Near-Sensor" Processing: ShiDianNao represents a further evolution by pushing the accelerator physically closer to the data source (the image sensor) to achieve complete elimination of off-chip memory accesses, targeting the most energy-sensitive edge applications.

3.4. Differentiation Analysis

Compared to the main methods in related work, ShiDianNao offers several core differences and innovations:

Complete Elimination of DRAM Accesses: This is ShiDianNao's most significant differentiator. Unlike most prior accelerators (including DianNao), which still relied on DRAM for input/output data (and sometimes weights), ShiDianNao leverages the small footprint of CNNs (due to weight sharing) and its proximity to the sensor to avoid all DRAM accesses. This is achieved by fitting the entire CNN model and image portions into sufficiently large on-chip SRAMs.
Sensor-Adjacent Integration: The physical placement of the accelerator directly next to the CMOS/CCD sensor is a unique architectural choice. This allows it to intercept raw image data streams and process them before they ever reach DRAM, drastically cutting energy consumption.
Specialized CNN Data Locality Exploitation: While DianNao aimed for broader NN support (including DNNs that don't benefit as much from 2D data locality), ShiDianNao is custom-built for CNNs. Its Neural Functional Unit (NFU) is a 2D mesh of PEs, optimized for 2D feature maps. Crucially, it incorporates inter-PE data propagation (using FIFOs) to efficiently reuse input neurons among adjacent PEs, further reducing internal SRAM bandwidth requirements. DianNao treated 2D data as 1D vectors, missing out on these 2D locality benefits.
Flexible yet Optimized Design for CNNs: Unlike rigid systolic arrays that might be optimized for a single CNN configuration, ShiDianNao maintains flexibility through its Hierarchical Finite State Machine (HFSM) based control, allowing it to accommodate various CNN layers and parameters while still being highly optimized for CNN operations.
Orders of Magnitude Improvement in Energy Efficiency: The combined innovations lead to an unprecedented 60x energy efficiency improvement over DianNao and almost 4700x over GPUs, demonstrating the profound impact of its architectural choices.

In essence, ShiDianNao represents a paradigm shift towards ultra-low-power, embedded visual intelligence by tightly integrating specialized hardware with the image sensor and meticulously optimizing for CNN characteristics at every architectural level.

4. Methodology

4.1. Principles

The core idea behind ShiDianNao is to achieve extreme energy efficiency and high performance for Convolutional Neural Network (CNN)-based visual recognition by eliminating all costly off-chip DRAM accesses. This is accomplished through two main principles:

On-Chip Mapping of Entire CNN: By exploiting the weight sharing property of CNNs, which significantly reduces their memory footprint compared to Deep Neural Networks (DNNs), ShiDianNao is designed with sufficiently large on-chip SRAM to store all synapses (weights) of a practical CNN. This removes the need for DRAM access for the model itself.
Sensor-Adjacent Integration for Input/Output Elimination: The accelerator is designed to be small enough to be physically placed directly next to a CMOS or CCD image sensor. This allows it to process image data directly as it streams from the sensor, thus eliminating DRAM accesses for input images. Only the few bytes of recognition results (e.g., an image category) are then sent to DRAM or the host processor, effectively eliminating DRAM accesses for outputs as well.

These principles, combined with careful exploitation of CNN's inherent 2D data locality and efficient data movement within the accelerator, form the foundation for ShiDianNao's high energy efficiency.

4.2. Core Methodology In-depth (Layer by Layer)

As illustrated in Figure 4, the ShiDianNao accelerator consists of several main components:

NBin (Input Neuron Buffer): Stores input neurons for the current layer.
NBout (Output Neuron Buffer): Stores output neurons generated by the current layer. (NBin and NBout can exchange functionality for subsequent layers).
SB (Synapse Buffer): Stores the weights (synapses) of the CNN model.
NFU (Neural Functional Unit): The main computational engine for fundamental neuron operations (multiplications, additions, comparisons).
ALU (Arithmetic Logic Unit): Handles activation functions and other arithmetic operations not covered by the NFU.
IB (Instruction Buffer): Stores control instructions for the accelerator.

The entire design uses 16-bit fixed-point arithmetic operators for both the NFU and ALU. This choice is based on prior studies showing negligible accuracy loss for NNs with significant hardware cost and energy savings (e.g., a 16-bit fixed-point multiplier is 6.10x smaller and 7.33x more energy-efficient than a 32-bit floating-point multiplier in TSMC 65nm technology [3]).

The following figure (Figure 4 from the original paper) shows the overall architecture of the accelerator:

Figure 4: Accelerator architecture. 该图像是示意图，展示了ShiDianNao加速器的架构，包含输入图像、缓冲控制器、解码器、NFU和ALU等组件。通过合理的数据访问模式，该加速器实现了显著的能效提升。

4.2.1. Neural Functional Unit (NFU)

The NFU is the core computational component, optimized for 2D feature maps. Unlike designs that treat 2D data as 1D vectors, ShiDianNao's NFU is a $2D mesh of P_x x P_y Processing Elements (PEs)$ .

The following figure (Figure 5 from the original paper) depicts the architecture of the NFU:

Figure 5: NFU architecure. 该图像是一个示意图，展示了NFU（神经功能单元）的架构与输入输出的连接方式。图中包括了输入列和行、核以及输出的控制信号，说明了如何在神经网络加速器中处理数据。

Neuron-PE Mapping: Instead of allocating a block of $K_x x K_y$ PEs for a single output neuron (which would lead to complex data sharing and variability issues), ShiDianNao maps each output neuron to a single PE. Each PE then time-shares its computation across input neurons (synapses) connecting to that output neuron. When a CNN layer executes, each PE continuously processes a single output neuron until it's fully computed, then switches to another.
Processing Elements (PEs): The following figure (Figure 6 from the original paper) shows the detailed architecture of a single PE:

该图像是PE架构的示意图，展示了处理元素（PE）内部的工作机制，包括乘法器、加法器和寄存器等组件，数据通过FIFO缓冲区进行输入输出。该结构设计旨在提高卷积神经网络的计算效率。

At each cycle, each PE (denoted as $PE_i,j$ for the PE at the $i$ -th row and $j$ -th column of the NFU) can perform:
- A multiplication and an addition (for convolutional, classifier, or normalization layers).
- An addition (for average pooling).
- A comparison (for max pooling). Each PE has:
- Three Inputs:
  1. Control signals.
  2. Synapses (e.g., kernel values) from SB.
  3. Neurons from NBin/NBout, or from a neighbor PE ( $PE_i+1,j$ for right, $PE_i,j+1$ for bottom), depending on control.
- Two Outputs:
  1. Computation results to NBout/NBin.
  2. Locally stored neurons propagated to neighbor PEs for data reuse.
Inter-PE Data Propagation: This mechanism is crucial for CNN's 2D data locality. In convolutional, pooling, and normalization layers, adjacent output neurons often require data from significantly overlapping rectangular windows of input neurons. While data could be repeatedly read from NBin/NBout, this would demand high bandwidth. The following figure (Figure 7 from the original paper) illustrates the internal bandwidth required with and without inter-PE data propagation:

该图像是一个图表，展示了不同处理单元（PE）数量下，神经元输入和核（突触权重）的内存带宽（GB/s）。横轴表示处理单元数量，纵轴表示内存带宽。图中有两个数据标记，分别表示有无处理单元间数据传播的情况。

To support efficient data reuse, ShiDianNao allows inter-PE data propagation within the PE mesh. Each PE includes two FIFOs (First-In, First-Out buffers):
- FIFO-H (Horizontal FIFO): Buffers data from NBin/NBout and from the right neighbor PE. This data is then propagated to the left neighbor PE for reuse.
- FIFO-V (Vertical FIFO): Buffers data from NBin/NBout and from the upper neighbor PE. This data is then propagated to the lower neighbor PE for reuse. This mechanism drastically reduces the internal bandwidth requirement between the on-chip buffers and the NFU. For example, for a convolutional layer C1 of LeNet-5, inter-PE data propagation reduces the NBin bandwidth requirement by 73.88%.

4.2.2. Arithmetic Logic Unit (ALU)

The ALU complements the NFU by performing computational primitives not covered by the PEs. It also uses 16-bit fixed-point arithmetic. Its functions include:

Division: Used for operations in average pooling and normalization layers.
Non-linear Activation Functions: Computes tanh() and sigmoid() functions (used in convolutional and pooling layers). These are approximated using piecewise linear interpolation ( $f(x) = a_i x + b_i$ when $x \in [x_i, x_{i+1}]$ for $i=0, \ldots, 15$ ). Segment coefficients $a_i$ and $b_i$ are pre-stored in registers, enabling efficient computation with a multiplier and an adder. This approximation introduces only negligible accuracy loss [31, 3].

4.2.3. Storage Architecture

ShiDianNao relies heavily on on-chip SRAM to store all data and instructions, enabling the elimination of off-chip DRAM accesses.

SRAM Capacity: The design incorporates 288 KB of on-chip SRAM, which is sufficient for all 10 practical CNNs benchmarked in the paper (Table 1 shows max neuron storage 45 KB, max synapse storage 118 KB). This SRAM has a moderate cost: 128 KB SRAM is estimated at 1.65 mm² and 0.44 nJ per read in TSMC 65nm process.
SRAM Partitioning and Banking: The on-chip SRAM is split into dedicated buffers for different data types to allow for suitable read widths and minimize energy per read:
- NBin & NBout: Store input and output neurons, respectively. They exchange roles for sequential layers. Each has $2 x P_y$ banks, supporting SRAM-to-PE data movements and inter-PE data propagation. Each bank's width is $P_x x 2 bytes$ (i.e., $P_x$ 16-bit neurons). They must be large enough for all neurons of a whole layer.
- SB (Synapse Buffer): Stores all CNN synapses. It has $P_y$ banks.
- IB (Instruction Buffer): Stores control instructions.

4.2.4. Control Architecture

The control mechanisms ensure efficient data reuse and flexible operation for various CNN layers.

Buffer Controllers: The NB controller (for both NBin and NBout) is a key example. The following figure (Figure 9 from the original paper) shows its architecture:

该图像是图表，展示了GPU、DianNao、DianNao-FreeMem和ShiDianNao在不同任务下的能量消耗的对数值。各个模型在不同任务上表现出的能量消耗变化明显，ShiDianNao的能量效率显著优于其他模型，特别是在一些特定任务上。

It supports six read modes and one write mode. NBin has $2 x P_y$ banks, each $P_x x 2 bytes$ wide. The following figure (Figure 10 from the original paper) illustrates these six read modes of the NB controller:

$Figure 2: A representative CNN architecture—LeNet5 \[35\]. C: Convolutional layer; S: Pooling layer; F: Classifier layer.$ 该图像是一个示意图，展示了LeNet5卷积神经网络的结构。输入为32x32的图像，经过多个卷积层和子采样层处理，最后通过全连接层输出。各层的特征图尺寸和数量被标注，体现了CNN的层级处理特点。

The read modes are:
- (a) Read multiple banks (#0 to $#P_y-1$ ).
- (b) Read multiple banks ( $#P_y$ to $#2P_y-1$ ).
- (c) Read one bank.
- (d) Read a single neuron.
- (e) Read neurons with a given step size.
- (f) Read a single neuron per bank (#0 to $#P_y-1$ or $#P_y$ to $#2P_y-1$ ). Different modes are selected based on the CNN layer type. For example, convolutional and pooling layers use modes (a), (b), (e), (c), (f) for sliding windows and data access. Classifier layers primarily use mode (d) to load a single input neuron for all output neurons. Normalization layers decompose into sub-layers that behave similarly.
The write mode of the NB controller is simpler. After PEs compute an output neuron, results are temporarily stored in an output register array (Figure 9). Once $P_x x P_y$ results are collected from all PEs, they are written to NBout simultaneously. The $P_x x P_y$ output neurons are organized into a data block ( $P_y$ rows, each $P_x x 2-bit$ wide) and written to either the first $P_y$ banks or the second $P_y$ banks of NBout, depending on their position ( $2kP_x$ to $((2k+1)P_x-1)$ -th columns vs. gray columns) in the output feature map. The following figure (Figure 11 from the original paper) shows the data organization of NB:

该图像是示意图，展示了不同类型的神经网络实现，包括1D和2D systolic阵列、空间神经元以及ShiDianNao结构。图中使用了公式 $y_i = f( ext{sum}( ext{weights} imes ext{inputs}))$ 来表示神经元的计算过程。
Control Instructions: To support flexible CNN configurations without excessively large instruction storage, a two-level Hierarchical Finite State Machine (HFSM) is used. The following figure (Figure 12 from the original paper) illustrates the hierarchical control finite state machine:

该图像是示意图，展示了ShiDianNao加速器的架构，包含输入图像、缓冲控制器、解码器、NFU和ALU等组件。通过合理的数据访问模式，该加速器实现了显著的能效提升。
- First-level states: Describe abstract tasks (e.g., different layer types like Conv, Pool, Class, Norm, ALU task).
- Second-level states: Characterize low-level execution events within each first-level state (e.g., execution phases for an input-output feature map pair in a Conv layer). A 61-bit instruction represents each HFSM state and related parameters (e.g., feature map size). This instruction can be decoded into detailed control signals for multiple accelerator cycles. This compact representation means a CNN requiring 50K cycles only needs 1 KB of instruction storage and a small decoder (0.03 mm² in 65nm process), saving significant area and power compared to storing cycle-by-cycle control signals.

4.2.5. CNN Mapping

This section details how different CNN layer types are mapped onto the ShiDianNao architecture.

4.2.5.1. Convolutional Layer

A convolutional layer generates multiple output feature maps from multiple input feature maps. The accelerator processes one output feature map at a time. Within an output feature map, each PE continuously computes a single output neuron.

The output neuron at position (a, b) of the output feature map #mo is computed with the following formula: $ \mathbf { O } _ { a , b } ^ { m o } = f \left( \sum _ { m i \in A _ { m o } } \left( \beta ^ { m i , m o } + \sum _ { i = 0 } ^ { K _ { x } - 1 } \sum _ { j = 0 } ^ { K _ { y } - 1 } \omega _ { i , j } ^ { m i , m o } \times \mathbf { I } _ { a S _ { x } + i , b S _ { y } + j } ^ { m i } \right) \right) $ Where:

$\mathbf{O}_{a,b}^{mo}$ : The output neuron at position (a, b) of the output feature map #mo.
$f(\cdot)$ : The non-linear activation function (e.g., tanh or sigmoid), computed by the ALU.
$\sum_{mi \in A_{mo}}$ : Summation over the set of input feature maps $A_{mo}$ connected to output feature map #mo.
$\beta^{mi,mo}$ : The bias value for the pair of input feature map #mi and output feature map #mo.
$\sum_{i=0}^{K_x-1} \sum_{j=0}^{K_y-1}$ : Summation over the kernel dimensions.
$\omega_{i,j}^{mi,mo}$ : The kernel coefficient (weight) at position (i, j) between input feature map #mi and output feature map #mo, read from SB.
$\mathbf{I}_{aS_x+i, bS_y+j}^{mi}$ : The input neuron at a specific position $(aS_x+i, bS_y+j)$ within the input feature map #mi. $S_x$ and $S_y$ are the step sizes (strides) of the convolutional window.

The following figure (Figure 13 from the original paper) illustrates an example of algorithm-hardware mapping for a convolutional layer:

该图像是一个示意图，展示了NFU（神经功能单元）的架构与输入输出的连接方式。图中包括了输入列和行、核以及输出的控制信号，说明了如何在神经网络加速器中处理数据。

Let's consider a small design with 2 x 2 PEs and a 3 x 3 kernel with a 1 x 1 step size:

Cycle #0:
- All four PEs ( $PE_0,0$ , $PE_1,0$ , $PE_0,1$ , $PE_1,1$ ) simultaneously read their first required input neurons (e.g., $x_{0,0}, x_{1,0}, x_{0,1}, x_{1,1}$ ) from NBin using Read Mode (a).
- They also read the same kernel value ( $k_{0,0}$ ) from SB.
- Each PE performs a multiplication and stores the result locally.
- Each PE stores its received input neuron in FIFO-H and FIFO-V for future inter-PE data propagation.
Cycle #1:
- $PE_0,0$ and $PE_0,1$ retrieve their required input neurons ( $x_{1,0}$ and $x_{1,1}$ ) from the FIFO-Hs of their right neighbors ( $PE_1,0$ and $PE_1,1$ ) (horizontal inter-PE data propagation).
- $PE_1,0$ and $PE_1,1$ read their required input neurons ( $x_{2,0}, x_{2,1}$ ) from NBin using Read Mode (f).
- All PEs share kernel value ( $k_{1,0}$ ) from SB.
Cycle #2: Similar to Cycle #1, with $PE_0,0$ and $PE_0,1$ getting data from FIFO-Hs and $PE_1,0$ and $PE_1,1$ reading from NBin (Mode (f)). All PEs share kernel value ( $k_{2,0}$ ). At this point, each PE has processed the first row of its convolutional window.
Cycle #3:
- $PE_0,0$ and $PE_1,0$ retrieve their required input neurons ( $x_{0,1}$ and $x_{1,1}$ ) from the FIFO-Vs of their upper neighbors ( $PE_0,1$ and $PE_1,1$ ) (vertical inter-PE data propagation).
- $PE_0,1$ and $PE_1,1$ read their required input neurons ( $x_{0,2}$ and $x_{1,2}$ ) from NBin using Read Mode (c).
- All PEs share kernel value ( $k_{0,1}$ ) from SB.
- Each PE again stores received input neurons in its FIFOs for future propagation.
  
  This detailed, cycle-by-cycle management of data flow and inter-PE data propagation significantly reduces NBin reads and thus internal bandwidth requirements.

4.2.5.2. Pooling Layer

A pooling layer downsamples input feature maps. Each output neuron is computed from a pooling window of input neurons. The accelerator computes one output feature map at a time, with each PE continuously working on a single output neuron.

The output neuron at position (a, b) of the output feature map #mo for max pooling is computed with the following formula: $ \overset { \cdot } { \operatorname { O } } _ { a , b } ^ { m o } = \operatorname* { m a x } _ { \substack { 0 \leq i < K_x, \ 0 \leq j < K_y } } \left( \operatorname { I } _ { a + i , b + j } ^ { m i } \right) $ Where:

$\overset{\cdot}{\mathbf{O}}_{a,b}^{mo}$ : The output neuron at position (a, b) of the output feature map #mo.
$\operatorname*{max}$ : The maximum operation.
$K_x, K_y$ : Dimensions of the pooling window.
$\mathbf{I}_{a+i, b+j}^{mi}$ : The input neuron at a specific position $(a+i, b+j)$ within the input feature map #mi.
$mo = mi$ : The mapping between input and output feature maps is one-to-one (i.e., each input feature map produces one output feature map).

The following figure (Figure 14 from the original paper) illustrates the algorithm-hardware mapping for a pooling layer:

该图像是PE架构的示意图，展示了处理元素（PE）内部的工作机制，包括乘法器、加法器和寄存器等组件，数据通过FIFO缓冲区进行输入输出。该结构设计旨在提高卷积神经网络的计算效率。

In a typical pooling layer, pooling windows for adjacent output neurons are adjacent but non-overlapping (i.e., step size equals window size). In such cases, PEs do not mutually propagate data because there is no data reuse between them. Each PE reads its required input neurons directly from NBin (e.g., using Read Mode (e)). If pooling windows overlap (step size smaller than window size), the process is similar to a convolutional layer, but without synapses.

4.2.5.3. Classifier Layer

Classifier layers are usually fully connected, meaning there is no synaptic weight sharing among different input-output neuron pairs. This results in these layers often consuming the largest portion of the SB (e.g., 97.28% for LeNet-5). Each PE works on a single output neuron.

The output neuron #no is computed with the following formula: $ { \bf O } ^ { n o } = f \left( \beta ^ { n o } + \sum _ { n i } \omega ^ { n i , n o } \times { \bf I } ^ { n i } \right) $ Where:

$\mathbf{O}^{no}$ : The output neuron #no.
$f(\cdot)$ : The activation function.
$\beta^{no}$ : The bias value of output neuron #no.
$\sum_{ni}$ : Summation over all input neurons #ni.
$\omega^{ni,no}$ : The synapse (weight) connecting input neuron #ni to output neuron #no.
$\mathbf{I}^{ni}$ : The input neuron #ni.

Unlike convolutional layers where one synapse and $P_x x P_y$ input neurons are read per cycle, a classifier layer reads $P_x x P_y$ different synaptic weights and a single input neuron for all PEs in each cycle. Each PE then multiplies its unique synapse with the common input neuron and accumulates the result to a partial sum. Once the dot product for an output neuron is complete, the result goes to the ALU for activation function computation.

4.2.5.4. Normalization Layers

Normalization layers are decomposed into a series of sub-layers and fundamental computational primitives that can be executed by ShiDianNao.

The following figure (Figure 15 from the original paper) shows the decomposition of an LRN layer:

Figure 7: Internal bandwidth from storage structures (input neurons and synapses) to NFU. 该图像是一个图表，展示了不同处理单元（PE）数量下，神经元输入和核（突触权重）的内存带宽（GB/s）。横轴表示处理单元数量，纵轴表示内存带宽。图中有两个数据标记，分别表示有无处理单元间数据传播的情况。

An LRN layer is decomposed into:

A classifier sub-layer.
An element-wise square operation.
A matrix addition.
Exponential functions (computed by ALU).
Divisions (computed by ALU).

The output neuron at position (a, b) of output feature map #mi in an LRN layer is computed with the following formula: $ { \bf O } _ { a , b } ^ { m i } = { \bf I } _ { a , b } ^ { m i } / \left( k + \alpha \times \sum _ { j = \operatorname* { m a x } \left( 0 , m i - M / 2 \right) } ^ { \operatorname* { m i n } \left( Mi - 1 , m i + M / 2 \right) } ( { \bf I } _ { a , b } ^ { j } ) ^ { 2 } \right) ^ { \beta } $ Where:
$\mathbf{O}_{a,b}^{mi}$ : The output neuron at position (a, b) of output feature map #mi.
$\mathbf{I}_{a,b}^{mi}$ : The input neuron at position (a, b) of input feature map #mi.
$k, \alpha, \beta$ : Constant parameters.
$\sum_{j = \operatorname*{max}(0, mi - M/2)}^{\operatorname*{min}(Mi-1, mi + M/2)}$ : Summation over neighboring feature maps.
Mi: Total number of input feature maps.
$M$ : Maximum number of input feature maps connected to one output feature map.
$(\mathbf{I}_{a,b}^{j})^2$ : Element-wise square of input neuron from feature map #j.

The following figure (Figure 16 from the original paper) shows the decomposition of an LCN layer:

$Figure 8: Data stream in the execution of a typical convolutional layer, where we consider the most complex case: the kernel size is larger than the NFU size (#PEs), i.e., `K _ { x } > P _ { x }` and…$ 该图像是示意图，展示了典型卷积层执行过程中的数据流。图中考虑了最复杂的情况：核大小大于 NFU 大小 (#PEs)，即 $K_{x} > P_{x}$ 和 $K_{y} > P_{y}$ 。

An LCN layer is decomposed into:

Two convolutional sub-layers.
A pooling sub-layer.
A classifier sub-layer.
Two matrix additions.
An element-wise square operation.
Divisions (computed by ALU).

The output neuron at position (a, b) of output feature map #mi in an LCN layer is computed with the following formula: $ \mathrm { O } _ { a , b } ^ { \bar { m } i } = \nu _ { a , b } ^ { m i } / \operatorname* { m a x } \left( \mathrm { mean } ( \delta _ { a , b } ) , \delta _ { a , b } \right) $ Where:
$\mathrm{O}_{a,b}^{\bar{m}i}$ : The output neuron at position (a, b) of output feature map #mi.
$\nu_{a,b}^{mi}$ : The subtractively normalized input, computed by: $ \nu _ { a , b } ^ { m i } = \mathrm { I } _ { a , b } ^ { m i } - \sum _ { j , a , b } \omega _ { a , b } \times \mathrm { I } _ { a + p , b + \pmb { q } } ^ { j } $ where $\mathrm{I}_{a,b}^{mi}$ is the input neuron, $\omega_{a,b}$ is a normalized Gaussian weighting window ( $\sum \omega_{a,b}=1$ ).
$\delta_{a,b}$ : A local standard deviation like term, computed by: $ \delta _ { a , b } = \sqrt { \sum _ { m i , a , b } ( \nu _ { a + p , b + \pmb { q } } ^ { m i } ) ^ { 2 } } $
$\operatorname*{max}(\mathrm{mean}(\delta_{a,b}), \delta_{a,b})$ : Normalization term involving the mean and current $\delta_{a,b}$ .

For element-wise square and matrix addition primitives, each PE works on one matrix element per cycle using its multiplier or adder, and the $P_x x P_y$ results are then written to NBout.

5. Experimental Setup

5.1. Datasets

The paper uses 10 CNNs collected from representative visual recognition applications as benchmarks to evaluate ShiDianNao. These CNNs represent diverse workloads with varying layer sizes and configurations. While specific example data samples are not provided in the paper, the context implies standard image data (e.g., LeNet-5 is known for document recognition, and other CNNs are for face recognition, general image classification). The characteristics mentioned for these benchmarks are:

Input Neurons: Maximum 45 KB for any layer across all benchmarks.
Synapses (Weights): Maximum 118 KB for any CNN across all benchmarks. These sizes are critical as they directly inform the required SRAM capacities for ShiDianNao to function entirely on-chip.

The following are the results from Table 2 of the original paper:

	Layer	Kernel Size #@size	Layer Size #@size		Layer	Kernel Size #@size	Layer Size #@size
0	Input C1 S2 C3 S4 C5 F6	6@7x7 6@2x2 61@7x7 16@2x2 305@6x6 160@1x1	1@42x42 6@36x36 6@18x18 16@12x12 16@6x6 80@1x1 21@1x1	20	Input C1 S2 C3 S4 C5	20@5x5 20@2x2 400@5x5 20@2x2 400@3x3	1@32x32 20@28x28 20@14x14 20@10x10 20@5x5 20@3x3
	Layer	Kernel Size #@size	Layer Size #@size		F6 F7 Layer	6000@1x1 1800@1x1 Kernel Size #@size	300@1x1 6@1x1 Layer Size #@size
20	Input C1 S2 C3 S4 F5	20@3x3 20@2x2 125@3x3 25@2x2 1000@1x1	1@23x28 20@21x26 20@11x13 25@9x11 25@5x6 40@1x1	20	Input C1 S2 C3 S4 F5 F6	6@5x5 6@2x2 60@5x5 16@2x2 1920@5x5 10080@1x1	1@32x32 6@28x28 6@14x14 16@10x10 16@5x5 120@1x1 84@1x1
	Layer	Kernel Size #@size	Layer Size #@size		F7 Layer	840@1x1 Kernel Size #@size	10@1x1 Layer Size #@size
20	Input C1 C2 F3 F4	5@5x5 250@5x5 5000@1x1 1000@1x1	1@29x29 5@13x13 50@5x5 100@1x1 10@1x1	CE	Input C1 S2 C3 S4 F5	4@5x5 4@2x2 20@3x3 14@2x2 14@6x7	1@32x36 4@28x32 4@14x16 14@12x14 14@6x7 14@1x1
	Layer	Kernel Size #@size	Layer Size #@size		F6 Layer	14@1x1 Kernel Size #@size	1@1x1 Layer Size #@size
0	Input C1 S2 C3	4@5x5 6@3x3 14@5x5	1@24x24 1@24x24 4@12x12 4@12x12		Input C1 S2 C3	12@5x5 12@2x2 60@3x3	3@64x36 12@60x32 12@30x16 14@28x14
	S4 F5	60@3x3 160@6x7 Kernel Size	16@6x6 10@1x1	20	S4 F5 F6	14@2x2 14@14x7 14@1x1 Kernel Size	14@14x7 14@1x1 1@1x1 Layer Size
	Layer Input C1	#@size	#@size 1@20x20		Layer Input C1	#@size 4@7x7	#@size 1@46x56 4@40x50
20	S2 C3 S4 F5 F6	4@5x5 4@2x2 20@3x3 14@2x2 14@1x1 14@1x1	4@16x16 4@8x8 14@6x6 14@3x3 14@1x1 1@1x1	20	S2 C3 S4 F5 F6	4@2x2 6@5x5 3@2x2 180@8x10 240@1x1	4@20x25 3@16x21 3@8x10 60@1x1 4@1x1

These datasets were chosen to represent state-of-the-art CNN implementations for visual recognition, allowing for a comprehensive evaluation of ShiDianNao's performance and energy efficiency across a range of realistic workloads.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate the proposed ShiDianNao accelerator, focusing on performance, energy efficiency, and hardware cost.

Speedup:
1. Conceptual Definition: Speedup quantifies how much faster a task executes on one system compared to another. It is a dimensionless ratio.
2. Mathematical Formula: $ \text{Speedup} = \frac{\text{Execution Time}{\text{Baseline}}}{\text{Execution Time}{\text{Proposed}}} $
3. Symbol Explanation:
  - $\text{Execution Time}_{\text{Baseline}}$ : The time taken to complete the task on a baseline system (e.g., CPU).
  - $\text{Execution Time}_{\text{Proposed}}$ : The time taken to complete the same task on the proposed ShiDianNao accelerator. A higher speedup value indicates better performance.
Energy Efficiency:
1. Conceptual Definition: Energy efficiency measures the amount of useful work performed per unit of energy consumed. In this context, it often refers to how much less energy is consumed to perform the same task. The paper expresses it as a ratio of energy consumed.
2. Mathematical Formula: $ \text{Energy Efficiency Ratio} = \frac{\text{Energy Consumed}{\text{Baseline}}}{\text{Energy Consumed}{\text{Proposed}}} $
3. Symbol Explanation:
  - $\text{Energy Consumed}_{\text{Baseline}}$ : The total energy consumed by a baseline system for a task (including memory accesses).
  - $\text{Energy Consumed}_{\text{Proposed}}$ : The total energy consumed by the ShiDianNao accelerator for the same task. A higher ratio indicates better energy efficiency.
Area (mm²):
1. Conceptual Definition: The physical footprint of the chip design, typically measured in square millimeters. It directly relates to manufacturing cost and feasibility for embedded systems.
2. Mathematical Formula: N/A (directly measured in $\text{mm}^2$ ).
3. Symbol Explanation: N/A.
Power (mW):
1. Conceptual Definition: The average rate at which energy is consumed by the accelerator during operation, measured in milliwatts. Low power consumption is critical for embedded and mobile devices with limited battery life.
2. Mathematical Formula: N/A (directly measured in $\text{mW}$ ).
3. Symbol Explanation: N/A.
GOP/s (Billions of fixed-point Operations per second):
1. Conceptual Definition: A measure of computational throughput, specifically for fixed-point operations. An "operation" in the context of neural networks typically refers to a multiply-accumulate (MAC) operation, which is the dominant operation in convolutional and fully connected layers.
2. Mathematical Formula: N/A (directly measured as GOP/s).
3. Symbol Explanation: N/A.
Frames Per Second (FPS):
1. Conceptual Definition: The number of full image frames that can be processed per second, directly indicating the real-time processing capability for video streams.
2. Mathematical Formula: N/A (derived from total processing time per frame).
3. Symbol Explanation: N/A.

5.3. Baselines

To provide a comprehensive comparison, ShiDianNao is evaluated against three baselines:

CPU:
- Specification: Intel Xeon E7-8830, 2.13 GHz clock frequency, 1 TB main memory. It supports 256-bit SIMD (Single Instruction, Multiple Data) instructions (like MMX, SSE, SSE2, SSE4.1, SSE4.2).
- Configuration: Benchmarks compiled with GCC 4.4.7 using optimizations ( $-O3 -lm -march=native$ ) to maximize performance by utilizing SIMD instructions.
- Representativeness: Represents a high-end server-class CPU, showcasing general-purpose processing capabilities.
GPU:
- Specification: NVIDIA K20M, a modern GPU card with 5 GB GDDR5 memory, capable of 3.52 TFlops (trillions of floating-point operations per second) peak performance, built on 28 nm technology.
- Configuration: The Caffe library, widely recognized as a fast CNN library for GPUs [1], is used for executing benchmarks.
- Representativeness: Represents a powerful, state-of-the-art accelerator for highly parallelizable workloads, often used for NN training and inference.
Accelerator (DianNao Reimplementation):
- Specification: A reimplementation of DianNao [3], resized for a fair comparison in an embedded context.
  - NFU: 8 x 8 DianNao-NFU (8 hardware neurons, each processing 8 input neurons and 8 synapses per cycle). This is smaller than the original 16 x 16 DianNao-NFU.
  - Memory Bandwidth: $62.5 GB/s$ memory model (compared to original $250 GB/s$ , which is considered unrealistic for a vision sensor).
  - On-chip Buffers: Shrinked to 1 KB NBin/NBout and 16 KB SB (half the size of the original DianNao's buffers).
- Verification: The reimplementation's area (1.38 mm²) is roughly proportional to the original DianNao's area (3.02 mm²), confirming its fidelity as a scaled-down version.
- Representativeness: Represents a previous generation, state-of-the-art neural network accelerator from the same research group, but one that targets a broader range of NNs and has different memory access patterns compared to ShiDianNao. This provides a direct comparison of the architectural specializations made in ShiDianNao.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate ShiDianNao's superior performance and energy efficiency, particularly when compared to general-purpose architectures and even its predecessor, DianNao.

6.1.1. Layout Characteristics

The following figure (Figure 17 from the original paper) shows the layout of ShiDianNao (65 nm):

Figure 9: NB controller architecture. 该图像是NB控制器的示意图，展示了多个存储单元（Bank #0, Bank #1等）与列缓冲区之间的连接关系。该控制器通过多路复用器阵列（MUX array）处理输入和输出，从而实现高效的数据访问。

The following are the results from Table 3 of the original paper:

	ShiDianNao	DianNao
Data width	16-bit	16-bit
# multipliers	64	64
NBin SRAM size	64 KB	1 KB
NBout SRAM size	64 KB	1 KB
SB SRAM size	128 KB	16KB
Inst. SRAM size	32KB	8 KB

ShiDianNao features 8 x 8 (64) PEs, matching the number of multipliers in the resized DianNao for a fair comparison of computational units. However, ShiDianNao significantly increases its on-chip SRAM capacity to enable full CNN mapping and eliminate DRAM accesses: 64 KB for NBin, 64 KB for NBout, 128 KB for SB, and 32 KB for IB. This totals 288 KB of SRAM, which is 11.1x larger than the total SRAM in the DianNao baseline. Despite this substantial increase in SRAM, the total area of ShiDianNao is only 4.86 mm², which is 3.52x larger than the DianNao baseline (1.38 mm²), demonstrating efficient area utilization for the increased memory.

6.1.2. Performance

The following figure (Figure 18 from the original paper) shows the speedup of GPU, DianNao, and ShiDianNao over the CPU:

Figure 10: Read modes of NB controller. 该图像是示意图，展示了不同读模式下的NB控制器连接模式，包括多个缓冲控制器和NFU的布局结构。图中标记的不同模式如#0、#1、#Py等表示不同的存储银行及其连接方式。

Vs. General-Purpose Architectures: ShiDianNao significantly outperforms general-purpose architectures. It is, on average, 46.38 times faster than the CPU (Intel Xeon E7-8830) and 28.94 times faster than the high-end GPU (NVIDIA K20M). The GPU struggles to fully leverage its computational power due to the small computational kernels of the visual recognition tasks, which map poorly onto its 2,496 hardware threads.
Vs. Previous Accelerator (DianNao): ShiDianNao also outperforms the DianNao baseline on 9 out of 10 benchmarks, averaging 1.87 times faster. This advantage stems from two main factors:
1. Elimination of Off-chip Memory Accesses: ShiDianNao's larger on-chip SRAM allows it to entirely avoid off-chip memory accesses during execution, which were still present in DianNao.
2. Exploitation of 2D Data Locality: ShiDianNao's 2D PE mesh and inter-PE data reuse mechanisms are specifically designed to exploit the 2D data locality of CNN feature maps, which DianNao (designed for a broader NN scope) did not effectively utilize.
Performance Anomaly (Simple Conv Benchmark): ShiDianNao performs slightly worse than the DianNao baseline on the Simple Conv benchmark. This is attributed to ShiDianNao's design where it processes a single output feature map at a time, and each PE works on a single output neuron within that map. For CNNs with uncommonly small output feature maps (e.g., 5x5 in the C2 layer of Simple Conv for an 8x8 PE design), some PEs remain idle, reducing overall efficiency. The authors chose not to implement complex control logic to alleviate this, prioritizing a simpler programming model.
Real-time Processing Capability: For a 640x480 video frame, the longest processing time occurs for the ConvNN benchmark, requiring 0.047 ms to process a 64x36-pixel region. Since a frame consists of 1073 such regions (with 16-pixel overlap), a full frame takes slightly over 50 ms to process, resulting in a speed of 20 frames per second (FPS) for the most demanding workload. This speed, combined with partial frame buffering (a few tens of pixel rows, fitting within 256 KB of commercial image processors), allows ShiDianNao to process real-time video streams directly from the sensor.

6.1.3. Energy

The following figure (Figure 19 from the original paper) shows the energy cost of GPU, DianNao, and ShiDianNao:

该图像是示意图，展示了卷积神经网络中银行结构与特征图的关系。图中显示了多个银行的连接方式，以及如何通过特定路径访问特征图，涉及输入和输出的维度。此外，特征图尺寸及访问模式的重要性也得到了体现。

Vs. General-Purpose Architectures: ShiDianNao is remarkably energy-efficient. It is, on average, 4688.13 times more energy-efficient than the GPU. This includes main memory accesses for input data, even though ShiDianNao is designed not to access DRAM.
Vs. Previous Accelerator (DianNao): ShiDianNao is 63.48 times more energy-efficient than DianNao. Even when compared to an idealized DianNao-FreeMem (which assumes zero energy cost for main memory accesses), ShiDianNao is still 1.66 times more energy-efficient. This highlights the effectiveness of ShiDianNao's specialized 2D data locality exploitation and internal data movement optimizations.

Sensor-Integrated Advantage: When ShiDianNao is integrated directly with an embedded vision sensor, and frames are streamed directly into its NBin (completely bypassing DRAM), its energy superiority becomes even more pronounced: 87.39 times more energy-efficient than DianNao and 2.37 times more energy-efficient than DianNao-FreeMem.

The following are the results from Table 4 of the original paper:

Accelerator	Area (mm2)	Power (mW)	Energy (nJ)
Total	4.86 (100%)	320.10 (100%)	6048.70 (100%)
NFU	0.66 (13.58%)	268.82 (83.98%)	5281.09 (87.29%)
NBin	1.12 (23.05%)	35.53 (11.10%)	475.01 (7.85%)
NBout	1.12 (23.05%)	6.60 (2.06%)	86.61 (1.43%)
SB	1.65 (33.95%)	6.77 (2.11%)	94.08 (1.56%)
IB	0.31 (6.38%)	2.38 (0.74%)	35.84 (0.59%)

Energy Breakdown: The breakdown of energy consumption (averaged over 10 benchmarks) reveals a critical insight: The four SRAM buffers (NBin, NBout, SB, IB) account for only 11.43% of the overall energy. The vast majority, 87.29%, is consumed by the NFU logic. This is in stark contrast to prior observations made on DianNao [3], where DRAM access dominated energy consumption (more than 95%). This demonstrates that ShiDianNao successfully shifts the energy bottleneck from memory accesses to computation, which is a significant achievement for on-chip processing.

6.2. Ablation Studies / Parameter Analysis

The paper implicitly conducts an ablation-like study by comparing ShiDianNao to DianNao and DianNao-FreeMem.

DianNao comparison: Isolates the effect of 2D data locality exploitation and DRAM elimination. The 1.87x speedup and 63.48x energy efficiency improvement show that ShiDianNao's specialized architecture for CNNs and its DRAM-less operation are highly effective.
DianNao-FreeMem comparison: Isolates the effect of 2D data locality exploitation and internal data movement minimization. Even if DianNao had free DRAM access, ShiDianNao's inter-PE data propagation and optimized NFU still yield 1.66x better energy efficiency, confirming the value of these architectural choices.

The paper also discusses the impact of PE count relative to feature map size. When output feature maps are smaller than the PE mesh (e.g., 5x5 feature map on an 8x8 PE array), some PEs become idle. This indicates that the PE mesh size, while efficient for typical CNN layers, can be a hyper-parameter that affects utilization for specific workloads. The authors chose a fixed PE size for simplicity in the programming model, acknowledging a potential trade-off for very small layers.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully presents ShiDianNao, a novel and highly efficient accelerator designed for state-of-the-art visual recognition algorithms based on Convolutional Neural Networks (CNNs). By innovatively integrating the accelerator directly next to CMOS/CCD image sensors and meticulously designing its architecture to leverage CNN's weight sharing and 2D data locality, ShiDianNao completely eliminates all energy-costly DRAM accesses. This fundamental shift results in an average 50x speedup over mainstream CPUs and 30x over high-end GPUs, while being 1.87x faster than a carefully reimplemented DianNao accelerator. Crucially, ShiDianNao achieves an astounding 4700x and 60x reduction in energy consumption compared to GPUs and DianNao, respectively. With a compact 4.86 mm² area (in 65 nm process) and 320.10 mW power consumption at 1 GHz, ShiDianNao is ideally suited for embedded visual applications in mobile and wearable devices, promising to significantly reduce server workloads, enhance the Quality of Service (QoS) for visual applications, and accelerate the widespread adoption of sophisticated vision processing.

7.2. Limitations & Future Work

The authors implicitly point out a limitation related to PE utilization:

Idle PEs for Small Feature Maps: When CNNs have uncommonly small output feature maps (fewer output neurons than the number of PEs), some PEs will remain idle. While the authors considered adding complex control logic to allow different PEs to work on different feature maps simultaneously, they ultimately decided against it due to the detrimental impact on the programming model. This suggests a trade-off between maximizing PE utilization for all possible CNN layer configurations and maintaining architectural simplicity.

The paper doesn't explicitly outline specific future work directions, but the overall motivation and conclusion suggest:
Widespread Adoption: The vision is to make sophisticated vision processing ubiquitous by considerably lowering its hardware and energy cost. This implies continued development to integrate such accelerators into various embedded systems.
Scaling to Larger/More Complex CNNs: As CNN models evolve, future work might involve adapting the architecture to support even larger models or more diverse layer types while maintaining the DRAM-less principle.
Enhanced Flexibility: Exploring ways to improve PE utilization for smaller feature maps without overly complicating the programming model could be a direction.

7.3. Personal Insights & Critique

The ShiDianNao paper presents a highly impactful and forward-thinking architectural design. My personal insights and critique are as follows:

Significance of "Shift Left" Principle: The most profound aspect of ShiDianNao is its radical adherence to the "shift left" principle – processing data as close to its source as possible. By truly eliminating all DRAM accesses for CNN inference, the paper demonstrates a path to unprecedented energy efficiency, which is critical for the proliferation of AI at the very edge (e.g., smart cameras, always-on sensors). This is a strong testament to how deep co-design (algorithm-architecture-system) can yield orders-of-magnitude improvements.
Leveraging Algorithm Properties: The intelligent exploitation of CNN's weight sharing and 2D data locality is a masterstroke. This highlights the importance of understanding the specific characteristics of the target algorithm class to design maximally efficient hardware, rather than aiming for overly general solutions that compromise on efficiency. The detailed inter-PE data propagation mechanism is a brilliant example of minimizing internal data movement, complementing the DRAM elimination.
Validation of Fixed-Point Arithmetic: The paper further solidifies the argument for 16-bit fixed-point arithmetic in NN inference, showing that sufficient accuracy can be maintained with massive hardware and energy savings. This is a crucial validation for embedded AI systems.
Potential Issues/Areas for Improvement:
- Flexibility for Evolving CNNs: While ShiDianNao offers more flexibility than rigid systolic arrays, CNN architectures are constantly evolving. New layers, new pooling mechanisms, or more dynamic kernel sizes might challenge its fixed 2D PE mesh and HFSM-based control. The trade-off between specialization for extreme efficiency and adaptability to future algorithmic innovation is always present.
- Model Size Limitations: The 288 KB on-chip SRAM is sufficient for the benchmarks tested. However, very large, state-of-the-art CNNs can have hundreds of millions or even billions of parameters, potentially exceeding this capacity without further model compression or specialized partitioning strategies. The "entirely map a CNN within an SRAM" promise is contingent on the CNN's size.
- Generalization to Other AI Tasks: While highly optimized for CNNs in vision, ShiDianNao's architecture is less suited for other NN types (e.g., Recurrent Neural Networks (RNNs), Transformers) or broader AI workloads that lack the same 2D locality or weight sharing properties. This narrow scope is a deliberate design choice for efficiency but is a limitation in terms of general AI acceleration.
- The "Simple Conv" Anomaly: The performance drop for small feature maps (due to idle PEs) indicates a potential area for micro-architectural refinement. While the authors explicitly state their decision not to add complexity for a simpler programming model, for certain applications dominated by small layers, this could be a noticeable inefficiency.
- "Hot" NFU: The energy breakdown shows the NFU consuming 87.29% of the total energy. While successful in shifting the bottleneck from memory, this implies that further energy savings would need to focus on making the NFU operations themselves even more efficient.
  
  Overall, ShiDianNao is an excellent example of how to architect for extreme efficiency by deeply understanding the target application domain and algorithm. Its innovations are highly transferable to other embedded AI contexts where data originates from sensors and ultra-low power is paramount.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

ShiDianNao: Shifting Vision Processing Closer to the Sensor

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~35 min read · 42,387 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Neural Functional Unit (NFU)

4.2.2. Arithmetic Logic Unit (ALU)

4.2.3. Storage Architecture

4.2.4. Control Architecture

4.2.5. CNN Mapping

4.2.5.1. Convolutional Layer

4.2.5.2. Pooling Layer

4.2.5.3. Classifier Layer

4.2.5.4. Normalization Layers

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Layout Characteristics

6.1.2. Performance

6.1.3. Energy

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers