Paper status: completed

DaDianNao: A Machine-Learning Supercomputer

Published:12/01/2014
Original Link
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DaDianNao is a machine-learning supercomputer optimized for CNNs and DNNs, demonstrating a 450.65x speedup and 150.31x energy reduction compared to GPUs, effectively addressing the high computational and memory demands of machine learning.

Abstract

Abstract —Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on-chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

DaDianNao: A Machine-Learning Supercomputer

1.2. Authors

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, Olivier Temam. Their affiliations include:

  • SKL of Computer Architecture, ICT, CAS, China
  • Inria, Scalay, France
  • University of CAS, China
  • Inner Mongolia University, China

1.3. Journal/Conference

The paper does not explicitly state a journal or conference in the provided text. However, "Published at (UTC): 2014-12-01T00:00:00.000Z" suggests it was published around late 2014, likely in a prominent computer architecture or machine learning conference or journal of that period.

1.4. Publication Year

2014

1.5. Abstract

This paper introduces DaDianNao, a custom multi-chip architecture designed for accelerating Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs). These machine-learning algorithms, while popular for sophisticated data processing in various services, are known to be computationally and memory-intensive. Existing neural network accelerators improve computational capacity but are often bottlenecked by memory accesses. The authors observe that, unlike general-purpose workloads, the large memory footprint of CNNs and DNNs can be accommodated by the on-chip storage of a multi-chip system. This property, combined with algorithmic characteristics, allows for high internal bandwidth and low external communications, facilitating high-degree parallelism at a reasonable area cost. The proposed DaDianNao system, implemented down to place and route at 28nm technology and featuring custom storage and computational units with industry-grade interconnects, demonstrates significant performance gains. For a 64-chip system, it achieves an average speedup of 450.65x over a GPU and reduces energy by 150.31x on a subset of the largest known neural network layers.

/files/papers/6915903fdb1128a32d47a715/paper.pdf This is the provided internal link for the PDF, indicating its publication status as available.

2. Executive Summary

2.1. Background & Motivation

The core problem DaDianNao aims to solve is the computational and memory intensity of state-of-the-art machine-learning (ML) algorithms, particularly Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs). These algorithms have become ubiquitous in a wide range of applications, from speech recognition (e.g., Siri, Google Now) to image identification (e.g., Apple iPhoto) and even pharmaceutical research. The authors argue that machine-learning applications are "in the process of displacing scientific computing as the major driver for high-performance computing."

Despite their widespread adoption, CNNs and DNNs pose significant challenges for conventional hardware. While GPUs (Graphics Processing Units) are currently favored for their parallel processing capabilities, they still incur high area costs, considerable execution times, and moderate energy efficiency due to their general-purpose nature and reliance on off-chip memory for large models. Existing dedicated neural network accelerators offer high computational density but are often hampered by memory accesses, where data movement between on-chip processing units and off-chip main memory becomes the performance bottleneck – a phenomenon known as the memory wall.

The paper's innovative idea stems from a critical observation: although the memory footprint of CNNs and DNNs can be very large (up to tens of GB), it is not beyond the capability of the aggregated on-chip storage of a multi-chip system. This contrasts with general-purpose workloads where the memory wall is often insurmountable with on-chip resources alone. This unique property of CNNs and DNNs, combined with their algorithmic characteristics (e.g., high data reuse, local connectivity in many layers), suggests that a specialized multi-chip architecture could achieve high internal bandwidth and low external communication, thereby enabling high parallelism at a reasonable cost.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

  • Novel Multi-Chip Architecture: Introduction of DaDianNao, a custom multi-chip, highly specialized architecture specifically designed for CNNs and DNNs. This architecture fundamentally addresses the memory wall problem by distributing the large neural network models across the on-chip eDRAM (embedded DRAM) of multiple interconnected chips.

  • Asymmetric Node Design: Each chip (node) in the system is designed with an asymmetric bias towards storage rather than computation. This involves integrating substantial eDRAM directly on-chip to hold synaptic weights close to the Neural Functional Units (NFUs), minimizing costly off-chip memory accesses.

  • Neuron-Centric Data Movement: The architecture prioritizes moving neuron values (which are typically fewer in number) between chips and within nodes, rather than synaptic weights. This strategy significantly reduces inter-chip communication bandwidth requirements.

  • High Internal Bandwidth via Tiling: Each node utilizes a tile-based design with multiple smaller NFUs and eDRAM banks, interconnected by a fat tree. This maximizes internal data bandwidth and mitigates wire congestion issues observed in monolithic designs.

  • Configurability for Flexibility: The architecture is designed to be highly configurable, supporting different CNN and DNN layer types (Convolutional, Pooling, Local Response Normalization, Classifier) and both inference (forward pass) and training (forward and backward passes) modes. This is achieved through reconfigurable NFU pipelines and support for fixed-point arithmetic at various precisions.

  • Significant Performance and Energy Efficiency Gains:

    • Speedup: On a subset of the largest known neural network layers, a 64-chip DaDianNao system achieved an average speedup of 450.65x compared to a GPU (NVIDIA K20M).
    • Energy Reduction: The same 64-chip system demonstrated an average energy reduction of 150.31x relative to the GPU. A single node shows an even higher energy reduction of 330.56x.
  • Practical Implementation: The design is not merely theoretical; a single node was implemented down to place and route at 28nm, confirming its feasibility and providing concrete area and power estimations. The node contains a combination of custom storage (eDRAM) and computational units (NFUs), integrated with industry-grade interconnects (HyperTransport).

    These findings collectively demonstrate that dedicated multi-chip architectures can overcome the memory wall for DNNs and CNNs, enabling supercomputer-level performance and energy efficiency for large-scale machine-learning applications.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Machine Learning (ML): A field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" from data, without being explicitly programmed. It involves building models from sample data (training) that can make predictions or decisions on new, unseen data (inference).
    • Inference (Forward Phase/Testing): The process of using a trained ML model to make predictions or classify new data. This is typically a feed-forward computation.
    • Training (Backward Phase/Learning): The process of adjusting the parameters (weights and biases) of an ML model using a dataset to minimize prediction errors. This usually involves forward propagation, calculating a loss, and then backpropagation to compute gradients and update weights.
  • Neural Networks (NNs): A subset of machine learning inspired by the structure and function of the human brain. They consist of interconnected nodes (neurons) organized in layers. Each connection between neurons has a synaptic weight that is adjusted during training.
    • Deep Neural Networks (DNNs): Neural networks with multiple hidden layers between the input and output layers. The "deep" refers to the number of layers. They are capable of learning complex patterns.
    • Convolutional Neural Networks (CNNs): A specialized type of DNN particularly effective for processing grid-like data, such as images. CNNs use convolutional layers that apply filters (kernels) to local regions of the input, enabling them to learn hierarchical features and exhibit translation invariance.
      • Layer Types in CNNs/DNNs:
        • Convolutional Layers (CONV): Apply a set of learnable filters (kernels) to the input data, sliding them across the input to produce feature maps. These filters detect specific features (e.g., edges, textures).
        • Pooling Layers (POOL): Reduce the dimensionality of the feature maps by taking the maximum (max pooling) or average (average pooling) over local regions. This helps to reduce computation, control overfitting, and make the network more robust to small variations in input.
        • Local Response Normalization Layers (LRN): Implement a form of competition between neurons in different feature maps at the same spatial location. It normalizes the activation of a neuron based on the activity of its neighbors, similar to lateral inhibition in biological systems.
        • Classifier Layers (CLASS): Typically fully connected layers found at the end of a CNN or DNN. They take the features extracted by previous layers and map them to the final output categories (e.g., classifying an image as a "cat" or "dog"). These layers often have a large number of synaptic weights due to their full connectivity.
  • Synaptic Weights / Parameters: The adjustable values in a neural network that determine the strength of the connection between neurons. These are the core elements learned during training.
  • Memory Wall: A long-standing problem in computer architecture where the increasing speed of CPUs outpaces the improvement in memory access speed. This creates a bottleneck where the processor spends a significant amount of time waiting for data from memory, limiting overall performance. For accelerators, this means computation units might be idle waiting for data.
  • GPU (Graphics Processing Unit): A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. Modern GPUs are highly parallel processors capable of performing many operations simultaneously, making them suitable for machine learning tasks that involve large-scale parallel computations.
  • On-chip vs. Off-chip Memory:
    • On-chip Memory: Memory integrated directly onto the same chip as the processing units. This typically offers much lower latency and higher bandwidth compared to off-chip memory (e.g., SRAM, eDRAM).
    • Off-chip Memory: Memory located on separate chips, connected to the processor chip via an external bus (e.g., DRAM, GDDR5). While offering larger capacity, it has significantly higher latency and lower effective bandwidth due to the physical distance and interface overheads.
  • eDRAM (embedded DRAM) vs. SRAM (Static RAM):
    • SRAM: A type of RAM that holds data in a static manner, meaning it does not need to be periodically refreshed. It is faster and consumes less power in standby than DRAM, but it is much less dense (takes up more area) and more expensive per bit. Often used for caches.
    • eDRAM: DRAM integrated directly onto the same chip as other logic circuits. It offers significantly higher density than SRAM (allowing more storage in less area) and better performance than off-chip DRAM, but it has higher latency than SRAM and requires periodic refreshing.
  • Fixed-point vs. Floating-point Arithmetic:
    • Floating-point: A method of representing real numbers that uses a fractional part and an exponent. It offers a wide dynamic range and high precision, commonly used in scientific computing and ML training.
    • Fixed-point: A method of representing real numbers where the position of the binary point (analogous to the decimal point) is fixed. It is simpler and faster to implement in hardware, consumes less power, and requires less memory bandwidth, but it has a more limited dynamic range and precision. Often used for ML inference when precision requirements are lower.
  • 2D Mesh Topology: A network topology where processors (or nodes) are arranged in a 2D grid. Each node is connected to its immediate neighbors (up, down, left, right). Data travels across multiple hops between non-adjacent nodes.
  • Place and Route: A crucial step in the physical design of integrated circuits (ICs). After logical design (synthesis), place and route software determines the physical location of all circuit components (placement) and then lays out all the connections (routing) between them on the silicon chip. This process significantly impacts area, performance, and power consumption.
  • 28nm Technology: Refers to the process technology node in semiconductor manufacturing. 28 nanometers is the typical minimum feature size (e.g., transistor gate length) that can be reliably manufactured. Smaller process nodes generally allow for denser circuits, higher performance, and lower power consumption.

3.2. Previous Works

The paper references several prior works, which can be broadly categorized into early NN accelerators, DNN/CNN specific accelerators, and large-scale biological emulation systems.

  • Early NN Accelerators (before Deep Learning surge):

    • Temam [47]: Proposed a neural network accelerator for multi-layer perceptrons. The key distinction noted by the authors is that this was not designed for deep learning neural networks, which gained prominence later.
    • Esmaeilzadeh et al. [16]: Introduced a hardware neural network called NPU for approximating any program function, rather than being specifically tailored for machine-learning applications. This highlights a general trend toward using NNs for hardware acceleration beyond just ML.
  • Emergence of Deep Learning and Dedicated Accelerators:

    • Chen et al. [5] (DianNao): This is a direct predecessor and the primary comparison point for DaDianNao. DianNao proposed an accelerator for Deep Learning (CNNs and DNNs) in a small form factor (3mm23 \mathrm{mm^2} at 65nm65 \mathrm{nm}). Its architecture features buffers for caching input/output neurons and synapses, and a Neural Functional Unit (NFU) which is a pipelined processor for neuron evaluation (multiplication of synaptic values by input neurons, additions, and transfer function application).
      • Crucial Limitation (acknowledged by DianNao authors and addressed by DaDianNao): DianNao and similar single-chip accelerators were memory bandwidth-limited, especially for convolutional layers with private kernels and classifier layers, where the massive number of synapses (parameters) required frequent off-chip memory accesses. This significantly reduced their potential performance gains. The paper cites that DianNao could lose an order of magnitude in performance due to memory accesses.
    • Krizhevsky et al. [32]: Achieved state-of-the-art accuracy on the ImageNet database using a CNN with 60 million parameters. This work is a benchmark for the scale of modern DNNs and is used as a full NN benchmark in DaDianNao's evaluation.
    • 1-billion and 10-billion parameter neural networks [34], [8]: These represent extreme experiments in unsupervised learning requiring massive computational resources (thousands of CPUs or tens of GPUs), showcasing the trend towards increasingly large neural networks.
  • Large-Scale Custom Architectures (primarily for Biological Emulation):

    • Schemmel et al. [46]: Proposed a wafer-scale design capable of implementing thousands of neurons and millions of synapses.
    • Khan et al. [30] (SpiNNaker): A multi-chip supercomputer where each node contains multiple ARM9 cores linked by an asynchronous network, targeting a million-core machine for modeling a billion neurons.
    • IBM Cognitive Chip [39]: A functional chip designed to implement 256 neurons and 256K synapses in a small area.
    • Differentiation: The key distinction for these is that their primary goal is the emulation of biological neurons (often spiking neurons), not direct acceleration of machine-learning tasks based on CNNs and DNNs as defined in this paper. While they might show ML capabilities on simple tasks, their underlying neuron models are fundamentally different.
  • Other Parallel Architectures:

    • Majumdar et al. [37]: Investigated a parallel architecture for various machine-learning algorithms. Unlike DaDianNao, it uses off-chip banked memory and on-chip memory banks primarily for caching.
    • Anton [12]: A specialized supercomputer for molecular dynamics simulation, showcasing the broader trend of custom architectures for high-performance computing tasks.

3.3. Technological Evolution

The paper is situated at a critical juncture in computing and machine-learning history:

  1. Rise of Machine Learning as a HPC Driver: ML algorithms, especially CNNs and DNNs, were becoming pervasive across industries (speech, vision, search). This marked a shift from traditional scientific computing as the sole driver for high-performance computing, creating a new demand for specialized hardware.

  2. Hardware Specialization (Heterogeneous Computing): The end of Dennard Scaling (where power density remained constant as transistors shrunk) and the concept of Dark Silicon (where not all transistors on a chip can be powered simultaneously) pushed the computing community towards heterogeneous computing. This paradigm involves integrating specialized accelerators alongside general-purpose CPUs to achieve higher performance and energy efficiency.

  3. Deep Learning's Algorithmic Convergence: While ML is diverse, Deep Learning (particularly CNNs and DNNs) had emerged as the state-of-the-art for a broad range of applications. This convergence meant that a single category of algorithms could benefit from specialized hardware, making dedicated accelerator design economically viable and impactful.

    DaDianNao fits into this evolution by recognizing the unique opportunity presented by Deep Learning's characteristics and the need for specialized hardware. It aims to bridge the gap between algorithmic demand and hardware capabilities, especially concerning the memory wall problem that even early deep learning accelerators faced.

3.4. Differentiation Analysis

Compared to the main methods in related work, DaDianNao introduces several core innovations:

  • Addressing the Memory Wall for Large DNNs/CNNs: Previous NN accelerators (like DianNao) were primarily single-chip designs that struggled with the memory bandwidth requirements of larger DNNs. DaDianNao explicitly tackles this by moving to a multi-chip system where the entire synaptic weight footprint (up to tens of GBs) can be mapped to on-chip eDRAM distributed across nodes. This is a fundamental shift from relying on off-chip DRAM, which bottlenecks performance and energy.

  • Focus on On-Chip Storage Density: DaDianNao's node design is aggressively storage-biased, leveraging eDRAM's higher density over SRAM to accommodate large synaptic weight sets directly on the chip. This is a deliberate architectural choice to keep data close to computation.

  • Neuron-Centric Communication Paradigm: Unlike systems that might move weights or other data, DaDianNao explicitly states its principle of transferring only neuron values between nodes because they are orders of magnitude fewer than synapses for convolutional and classifier layers. This minimizes inter-chip communication overhead.

  • Scalability to Supercomputer-level: While DianNao was an accelerator for heterogeneous multi-cores, DaDianNao is designed as a machine-learning supercomputer – a system of interconnected chips to achieve performance significantly beyond single GPU or single-chip accelerator capabilities. The multi-chip mesh topology is a key enabler here.

  • Configurability for Both Inference and Training: Many early accelerators focused solely on inference (e.g., [18]). DaDianNao designs its NFU and pipeline to be reconfigurable for both inference and training (including backpropagation and pre-training with RBMs), offering broader utility. It also supports fixed-point arithmetic for both modes, with adaptable precision.

  • Specific CNN/DNN Focus vs. Biological Emulation: Unlike SpiNNaker, IBM Cognitive Chip, or Schemmel et al.'s work, which aimed at emulating biological spiking neurons, DaDianNao directly targets the specific computational patterns of CNNs and DNNs (multiplication-accumulation, transfer functions) which are the algorithms driving real-world ML applications.

    In essence, DaDianNao distinguishes itself by being a multi-chip, memory-centric, neuron-communication-optimized architecture tailored specifically to overcome the memory wall for large-scale, industry-relevant Deep Learning tasks.

4. Methodology

4.1. Principles

The core idea behind DaDianNao is to build a highly specialized, multi-chip machine-learning supercomputer that overcomes the memory wall bottleneck prevalent in GPUs and prior single-chip neural network accelerators. The theoretical basis and intuition are founded on several key observations specific to CNNs and DNNs:

  1. Large but Manageable Memory Footprint: While the total number of synaptic weights (parameters) in large DNNs can be enormous (tens of GBs), this size is still within the realm of what can be collectively stored on-chip across a reasonable number of specialized chips.

  2. Data Movement is the Bottleneck: For these algorithms, fetching synaptic weights from off-chip memory is the primary energy and performance limiter.

  3. Asymmetry in Data Volume: The number of synaptic weights is significantly higher than the number of neuron values (inputs/outputs of layers). Therefore, moving neuron values is far more efficient than moving synapses.

  4. Local Connectivity for Efficiency: Many CNN layers (e.g., convolutional, pooling) exhibit local connectivity, meaning neurons primarily interact with neighboring neurons. This suggests that distributing computation and synapses geographically can reduce communication for these layers.

  5. High Internal Bandwidth is Crucial: To keep specialized computational units busy, a very high internal bandwidth for data access is required.

    Based on these principles, DaDianNao adopts the following design philosophy:

  • Distributed On-Chip Storage: Utilize the collective on-chip eDRAM across a multi-chip system to store the entire neural network model, eliminating off-chip memory accesses.
  • Synapses Close to Computation: Place synaptic weights physically adjacent to the Neural Functional Units (NFUs) that will use them, minimizing data travel time and energy.
  • Neuron-Centric Communication: Design the inter-node communication mechanism to transfer neuron values rather than synapses.
  • Tiled Architecture for Internal Bandwidth: Break down the node (chip) into multiple tiles, each with its own NFU and eDRAM, connected by a high-bandwidth internal fat tree network.
  • Configurable Processing: Make the NFU configurable to efficiently execute different CNN/DNN layer types and support both inference and training modes with appropriate fixed-point precision.

4.2. Core Methodology In-depth (Layer by Layer)

The DaDianNao architecture is a multi-chip system where each chip is an identical node, arranged in a 2D mesh topology. Each node contains significant on-chip storage (primarily eDRAM) for synapses, Neural Functional Units (NFUs) for computation, and a router fabric for inter-node communication.

4.2.1. Node Architecture (V.B.)

The node is the fundamental building block of DaDianNao.

4.2.1.1. Synapses Close to Neurons (V.B.1.)

A central design characteristic is to store synapses (weights) as close as possible to the neurons that use them, and to make this storage massive.

  • Motivation for Neuron-Centric Data Movement:

    • Inference and Training Support: The architecture is designed for both inference (forward pass) and training (forward and backward passes). In training, neurons are forward-propagated and then backward-propagated. Depending on how data (neurons and synapses) are allocated, neurons of the previous or next layer act as inputs and need to be moved.
    • Relative Data Volumes: There are significantly more synapses than neurons. For classifier layers, synapses are O(N2)O(N^2) vs. neurons O(N)O(N). For convolutional layers with private kernels, synapses are Kx×Ky×Nif×Nof×Nx×NyK_x \times K_y \times N_{if} \times N_{of} \times N_x \times N_y vs. neurons Nif×Nx×NyN_{if} \times N_x \times N_y. This makes it much more efficient to move neuron outputs than synapses.
    • Low-Energy/Low-Latency Access: Keeping synapses close to computational operators provides low-energy and low-latency data transfers, enabling high internal bandwidth.
  • Choice of eDRAM for Storage:

    • SRAM is dense enough for caching but not for the large-scale storage (up to 1GB for single layers, tens of MB common) required for all synapses.
    • eDRAM offers higher storage density. For example, a 10MB SRAM at 28nm takes 20.73mm220.73 \mathrm{mm^2}, while 10MB eDRAM takes 7.27mm27.27 \mathrm{mm^2} (a 2.85x higher density).
    • Using eDRAM on-chip for the entire neural network eliminates off-chip DRAM accesses, which are extremely costly in terms of energy. A 256-bit read access to eDRAM at 28nm consumes 0.0192nJ0.0192 \mathrm{nJ}, while the same access to Micron DDR3 DRAM consumes 6.18nJ6.18 \mathrm{nJ} (a 321x energy ratio). This large difference is due to the memory controller, DDR3 physical interface, on-chip bus access, and page activation for off-chip DRAM.
  • Scaling NFU Capacity and Addressing eDRAM Challenges:

    • By removing the memory bandwidth bottleneck, the NFU size can be scaled up to process more output neurons (NoN_o) and inputs per output neuron (NiN_i) simultaneously, improving throughput. For instance, to achieve 16x operations compared to DianNao, an NFU might need Ni=64N_i = 64 and No=64N_o = 64. This would require fetching 64×64×16=6553664 \times 64 \times 16 = 65536 bits from eDRAM every cycle.
    • eDRAM has drawbacks: higher latency than SRAM, destructive reads, and periodic refresh. To compensate and sustain NFU operation every cycle, the eDRAM is split into four banks, and synapse rows are interleaved among them.
  • Initial Monolithic NFU Design Flaw:

    • An early design attempt for a single large NFU with a 65536-bit interface to eDRAM at 28nm led to a floorplan (Figure 4) where wires caused significant congestion. The NFU itself was tiny (0.78mm20.78 \mathrm{mm^2}), but the 65536 wires connecting it to eDRAM required a width of 3.2768mm3.2768 \mathrm{mm} (given 0.2μm0.2 \mathrm{\mu m} spacing and 4 metal layers). This resulted in the wires consuming 42.18mm242.18 \mathrm{mm^2}, nearly the combined area of all other components. This highlights the practical challenges of high internal bandwidth with traditional wiring.

      Figure 4: Simplified floorplan with a single central NFU showing wire congestion. 该图像是一个示意图,展示了多芯片机器学习架构的布局及其组件,包括4个eDRAM模块和连接线。该布局的整体尺寸为3.27 mm x 3.27 mm,而NFU的尺寸为0.88 mm。

    The image above is Figure 4 from the original paper, showing a simplified floorplan with a single central NFU and the resulting wire congestion.

4.2.1.2. High Internal Bandwidth (V.B.2.)

To address the wire congestion and achieve high internal bandwidth, a tile-based design is adopted, as shown in Figure 5.

  • Tile-based Architecture:

    • The output neurons are spread across different tiles. Each tile contains a smaller NFU capable of simultaneously processing 16 input neurons and 16 output neurons (256 parallel operations).

    • This reduces the data requirement from eDRAM per tile to 16×16×16=409616 \times 16 \times 16 = 4096 bits per cycle.

    • Each tile retains the 4-bank eDRAM organization (each bank 4096-bit wide) to compensate for eDRAM weaknesses.

    • A tile (with its NFU, four eDRAM banks, and input/output interfaces) has an area of 1.89mm21.89 \mathrm{mm^2}. 16 such tiles account for 30.16mm230.16 \mathrm{mm^2}, representing a 28.5% area reduction compared to the monolithic NFU design, because the routing network now only consumes 8.97% of the overall area.

      Figure 5: Tile-based organization of a node (left) and tile architecture (right). A node contains 16 tiles, two central eDRAM banks and fat tree interconnect; a tile has an NFU, four eDRAM banks and… 该图像是示意图,展示了一个多芯片机器学习架构的设计。左侧部分显示了一种连接方式,各个模块通过 eDRAM 路由器连接。右侧展示了神经功能单元(NFU),包含 16 个输入神经元和 16 个输出神经元,数据流向存储单元(SB)。

    The image above is Figure 5 from the original paper, illustrating the tile-based organization of a node (left) and the tile architecture (right). A node contains 16 tiles, two central eDRAM banks, and a fat tree interconnect; a tile has an NFU, four eDRAM banks, and input/output interfaces to/from the central eDRAM banks.

  • Internal Communication (Fat Tree):

    • All tiles within a node are connected via a fat tree network.
    • The fat tree is used to:
      • Broadcast input neuron values to each tile.
      • Collect output neuron values from each tile.
    • At the center of the chip, there are two special eDRAM banks: one for input neurons and one for output neurons.
  • Neuron Processing Flow within a Node:

    • Since the total number of hardware output neurons in all NFUs can still be smaller than the actual number of neurons in large layers, multiple different output neurons are computed on the same hardware neuron for each set of input neurons broadcasted.
    • Intermediate values of these neurons are saved locally in the tile eDRAM.
    • Once the computation of an output neuron is complete (all input neurons factored in), its value is sent via the fat tree to the central output neuron eDRAM bank at the chip's center.
  • Neural Functional Unit (NFU) Details:

    • The NFU (Figure 6) contains several parallel operators.

    • Multipliers: Perform the multiplication of synaptic values by input neuron values.

    • Adders: Aggregate the products, forming adder trees.

    • Max: Used in pooling layers (e.g., max pooling).

    • Transfer Function: Applies non-linear activation functions through linear interpolation.

      Figure 6: The different (parallel) operators of an NFU: multipliers, adders, max, transfer function. 该图像是一个示意图,展示了多层神经网络的计算流程,包括输入神经元、乘法、加法、激活函数以及更新突触的过程。图中表示了三个阶段的处理和输入输出的关系,便于理解神经网络的结构与工作机制。

    The image above is Figure 6 from the original paper, detailing the different (parallel) operators of an NFU: multipliers, adders, max, and transfer function.

4.2.1.3. Configurability (Layers, Inference vs. Training) (V.B.3.)

The tile and its NFU pipeline are designed to be highly adaptable to different CNN/DNN layer types and inference or training modes.

  • NFU Hardware Blocks: The NFU is decomposed into:

    • Adder Block: Configurable as a 256-input, 16-output adder tree or 256 parallel adders.
    • Multiplier Block: 256 parallel multipliers.
    • Max Block: 16 parallel max operations.
    • Transfer Block: Two independent sub-blocks, each performing 16 piecewise linear interpolations. The linear interpolation coefficients (a, b for y=a×x+by = a \times x + b) are stored in two 16-entry SRAMs and can be configured to implement any transfer function and its derivative.
  • Pipeline Configurations (Figure 7): The different hardware blocks can be configured into various pipelines to support different layer types (CONV, LRN, POOL, CLASS) and phases (forward for inference, backward for training).

    Figure 7: Different pipeline configurations for CONV, LRN, POOL and CLASS layers.
    该图像是示意图,展示了分类器和卷积的前向传播(FP)及反向传播(BP)过程中的数据流动与运算步骤,包括Multiply、Add、Transfer等操作。图中还展示了各阶段的输入输出及权重更新。

    The image above is Figure 7 from the original paper, showing the different pipeline configurations for CONV, LRN, POOL, and CLASS layers. These configurations vary for forward propagation (FP) and backpropagation (BP). For instance, a CLASS layer in FP uses Multipliers, Adders, and Transfer functions, while in BP it also involves Weight Update and Gradient Computation stages.

  • Bit-Width Aggregation for Training:

    • The hardware blocks are designed to allow aggregation of 16-bit operators (adders, multipliers, max) into 32-bit operators. For example, two 16-bit adders can form one 32-bit adder, or four 16-bit multipliers can form one 32-bit multiplier. The overhead for this is very low.
    • 16-bit operators are generally sufficient for inference. However, for training, higher precision is often necessary to maintain accuracy and ensure convergence.
    • Impact of Fixed-Point (Table II): The paper provides an analysis of fixed-point computation on error:
      • Floating-Point for both inference and training results in 0.82% error.

      • 16-bit Fixed-Point for inference and Floating-Point for training results in 0.83% error (negligible impact).

      • 16-bit Fixed-Point for both inference and training leads to no convergence.

      • 16-bit Fixed-Point for inference and 32-bit Fixed-Point for training results in 0.91% error (small impact).

      • Default for training mode is 32-bit operators.

        The following are the results from Table II of the original paper:

        Inference Training Error
        Floating-Point Floating-Point 0.82%
        Fixed-Point (16 bits) Floating-Point 0.83%
        Fixed-Point (32 bits) Floating-Point 0.83%
        Fixed-Point (16 bits) Fixed-Point (16 bits) (no convergence)
        Fixed-Point (16 bits) Fixed-Point (32 bits) 0.91%
  • Tile Data Movement Configurations:

    • Input neurons for a classifier layer can come from the node's central eDRAM (potentially after transfer from another node).
    • They can also come from two 16KB SRAM buffers used for input and output neuron values, or for temporary values (like neuron partial sums to enable reuse).
    • In the backward phase (training), the NFU must write to the tile eDRAM after weight updates.
    • During gradient computations, input and output gradients use the same data paths as input and output neurons in the forward phase.

4.2.2. Interconnect (V.C.)

Inter-chip (inter-node) communication is crucial for a multi-chip system.

  • Communication Volume: Since only neuron values are transferred and these are heavily reused within each node, inter-node communication is significant but generally not a bottleneck, except for a few layers and very large systems.
  • Off-the-shelf Interconnect: The design does not rely on custom high-speed interconnects. Instead, it uses a commercially available HyperTransport (HT) 2.0 IP block.
  • Physical Layer: The HT2.0 PHY (physical layer interface) used for 28nm is a long, thin strip (5.635mm×0.5575mm5.635 \mathrm{mm} \times 0.5575 \mathrm{mm}), typically placed at the die periphery.
  • Topology: A simple 2D mesh topology connects nodes. While a 3D mesh could be more efficient, it's left for future work.
  • Link Details: Each chip connects to four neighbors via four HT2.0 IP blocks. Each block has 16x HT links (16 pairs of differential outgoing and incoming signals) operating at 1.6GHz.
    • The HT is connected to the central eDRAM via a 128-bit, 4-entry, asynchronous FIFO.
    • Each HT block provides 6.4GB/s6.4GB/s bandwidth in each direction.
    • HT2.0 latency between two neighbor nodes is about 80ns.
  • Router: Next to the central block of the tile, a router is implemented (Figure 5).
    • It uses wormhole routing.
    • It has five input/output ports (four directions plus an injection/ejection port).
    • Each input port contains 8 virtual channels (with 5 flit slots per VC).
    • A 5x5 crossbar connects all input/output ports.
    • The router pipeline has four stages: routing computation (RC), VC allocation (VA), switch allocation (SA), and switch traversal (ST).

4.2.3. Overall Characteristics (V.D.)

The architecture's characteristics are summarized in Table III.

  • Frequency: NFU clocked at 606MHz, matching the eDRAM frequency in the 28nm technology. This is a conservative choice, as DianNao's NFU operated at 0.98GHz at 65nm. Faster NFU and asynchronous communications are future work.
  • Node Capacity:
    • 16 tiles per node.
    • Each tile eDRAM bank: 1024 rows of 4096 bits.
    • Total tile eDRAM per tile: 4×1024×4096=2MB4 \times 1024 \times 4096 = 2 \mathrm{MB}.
    • Central eDRAM per node: 4MB.
    • Total node eDRAM capacity: (16×2MB)+4MB=36MB(16 \times 2 \mathrm{MB}) + 4 \mathrm{MB} = 36 \mathrm{MB}.
  • Peak Performance:
    • 16-bit operation: 16×(288 multipliers+288 adders)×606 MHz=5.58 TeraOps/s16 \times (288 \text{ multipliers} + 288 \text{ adders}) \times 606 \text{ MHz} = 5.58 \text{ TeraOps/s}. (Here, 288 refers to the total number of 16-bit operators per tile, likely 256 for core ops + 32 for transfer function/other small ops).

    • 32-bit operation: 16×(144 multipliers+72 adders)×606 MHz=2.09 TeraOps/s16 \times (144 \text{ multipliers} + 72 \text{ adders}) \times 606 \text{ MHz} = 2.09 \text{ TeraOps/s}. This lower performance is due to operator aggregation (e.g., two 16-bit adders becoming one 32-bit adder reduces the count of parallel 32-bit operations).

      The following are the results from Table III of the original paper:

      Parameters Settings Parameters Settings
      Frequency 606MHz tile eDRAM latency ∼3 cycles
      # of tiles 16 central eDRAM size 4MB
      # of 16-bit multipliers/tile 256+32 central eDRAM latency ∼10 cycles
      # of 16-bit adders/tile 256+32 Link bandwidth 6.4x4GB/s
      tile eDRAM size/tile 2MB Link latency 80ns

4.2.4. Programming, Code Generation and Multi-Node Mapping (V.E.)

4.2.4.1. Programming, Control and Code Generation (V.E.1.)

  • System ASIC View: The architecture is treated as a system ASIC, implying low programming complexity. The system is primarily configured, and input data is fed to it.
  • Code Generator: A code generator produces a sequence of node instructions (one sequence per node). This configures the neural network.
  • Input Data Partitioning: The initial input layer values are partitioned across nodes and stored in a central eDRAM bank.
  • Example: CLASS2 Inference (Table V):
    • This example shows node instructions for a CLASS2 layer (Ni=4096,No=4096N_i = 4096, N_o = 4096) on a 4-node system.
    • Output neurons are partitioned into 256-bit data blocks (each block contains 256/16=16256/16 = 16 neurons).
    • Each node is allocated 4096/16/4=644096 / 16 / 4 = 64 output data blocks.
    • Each node stores a quarter of all input neurons (4096/4=10244096 / 4 = 1024).
    • Each tile is allocated 64/16=464 / 16 = 4 output data blocks, resulting in 4 instructions per node.
    • An instruction loads 128 input data blocks from the central eDRAM to the tiles.
    • The first three instructions: all tiles receive the same input neurons, read synaptic weights from their local (tile) eDRAM, and write partial sums of output neurons to their local NBout SRAM.
    • The last instruction: NFU in each tile finalizes sums, applies the transfer function, and stores output values back to the central eDRAM.
  • Control Flow: Node instructions drive the control of each tile. The node control circuit generates tile instructions and sends them to each tile.
  • Instruction Granularity: The node or tile instruction performs the same layer computations (e.g., multiply-add-transfer for classifier layers) on a set of contiguous input data. This data is characterized by a start address, step, and number of iterations.
  • Operating Modes:
    • Processing one row at a time: Standard operation.

    • Batch learning [48]: Multiple rows (multiple instances of the same layer for different input data) are processed simultaneously. This improves synapse reuse and is common for stable gradient descent, though it can lead to slower convergence and requires larger memory capacity.

      The following are the results from Table IV of the original paper (Node instruction format):

      20 0 20 20 20 20 20 2 0 0 20 0 0 0 PAI 0 0 RAA∀P 20 0 0 0 20 EN 20 20

The following are the results from Table V of the original paper (An example of classifier code Ni=4096N_i = 4096, No=4096N_o = 4096, 4 nodes):

0 N 0 4 4 4 M T 1 M 2 8 m DAT TION
0 N O 4 4 4 M 2 0 M 2 8 1 M TON T
0 N0 18 0 4 4 4 M T M T M O DAT T
0 M 2 O 4 6 4 M m 10 M 2 - 0 2 - T 2 O

4.2.4.2. Multi-Node Mapping (V.E.2.)

  • Layer Chaining: The output neurons from one layer become the input neurons for the next. At the start of a layer, input neurons are distributed across all nodes as 3D rectangles representing feature maps.

  • Input Distribution within Node: These input neurons are first distributed to all node tiles via the internal fat tree network.

  • Inter-Node Distribution: Simultaneously, the node control sends blocks of input neurons to other nodes through the mesh interconnect.

  • Communication Patterns for Different Layer Types (Figure 8):

    • Convolutional and Pooling Layers: Characterized by local connectivity (small kernel/window). This results in very low inter-node communication, mostly occurring at the border of the layer rectangle mapped to each node. Most communications are intra-node.
    • Local Response Normalization (LRN) Layers: Since all feature maps at a given location are mapped to the same node, there is no inter-node communication.
    • Classifier Layers: Can have high inter-node communication because each output neuron typically uses all input neurons, which may reside on different nodes. The communication pattern is a simple broadcast.
      • To manage this, a computing-and-forwarding communication scheme [24] is used, which involves arranging node communications in a regular ring pattern. A node can start processing newly arrived input neuron blocks as soon as it has finished its own computations and sent its previous block, making the decision locally without a global synchronization mechanism.

        Figure 8: Mapping of (left) a convolutional (or pooling) layer with 4 feature maps; the red section indicates the input neurons used by node O; (right) a classifier layer. 该图像是示意图,左侧展示节点之间的连接以及特征图,右侧则展示节点和输出之间的关系,表明多芯片架构在机器学习中的应用。

    The image above is Figure 8 from the original paper, showing the mapping of (left) a convolutional (or pooling) layer with 4 feature maps where the red section indicates the input neurons used by node O; and (right) a classifier layer which implies full connectivity.

5. Experimental Setup

5.1. Datasets

The experiments primarily use benchmarks derived from various large and state-of-the-art CNN and DNN models, rather than raw datasets like ImageNet. The paper benchmarks specific layers and one full network.

  • Layer Benchmarks: A sample of 10 of the largest known layers of each type (CONV, POOL, LRN, CLASS, and CONVCONV* for convolutional layers with private kernels) were used. These layers are derived from applications like object recognition, speech recognition, natural image processing, street scene parsing, and face detection in YouTube videos.

    • Example Data Sample: While not explicitly providing raw images or speech, the paper implicitly refers to tasks like identifying characteristic elements of an image, or objects, or speech components. The inputs to these layers are neuron activities which could represent pixels, feature maps, or abstract representations.

    • Rationale: These layers were chosen because they represent the most computationally and memory-intensive components of modern DNNs and CNNs, making them ideal for evaluating the performance and energy efficiency of specialized hardware.

      The following are the results from Table I of the original paper:

      Layer Nx Ny Kx Ky Ni No Synapses Description or Ni f or Nof
      CLASS1 - - 2560 2560 12.5MB Object recognition and speech recognition tasks (DNN) [11].
      CLASS2 - 4096 4096 32MB Multi-Object recognition
      CONV1 256 256 11 11 256 384 22.69MB in natural images (DNN),
      POOL2 256 256 2 2 256 256 winner 2012 ImageNet
      LRN1 55 55 - - 96 96 competition [32].
      LRN2 27 27 - - 256 256
      CONV2 500 375 9 9 32 48 0.24MB Street scene parsing
      POOL1 492 367 2 2 12 12 (CNN) (e.g., identifying building, vehicle, etc) [18]
      CONV3* 200 200 18 18 8 8 1.29GB Face Detection in YouTube videos (DNN), (Google) [34].
      CONV4* 200 200 20 20 3 18 1.32GB YouTube video object recognition, largest NN to date [8].
  • Full Neural Network Benchmark: The winner of the ImageNet 2012 competition [32] (by Krizhevsky et al.) CNN was used. This network consists of 12 layers:

    • CONV (224,224,11,11,3,96)
    • LRN(55,55,,,96,96)LRN (55,55,-,-,96,96)
    • POOL (55,55,3,3,96,96)
    • CONV (27,27,5,5,96,256)
    • LRN(27,27,,,256,256)LRN (27,27,-,-,256,256)
    • POOL (27,27,3,3,256,256)
    • CONV (13,13,3,3,256,384)
    • CONV (13,13,3,3,384,384)
    • CONV (13,13,3,3,384,256)
    • CLASS (-,9216,4096)
    • CLASS (-,4096,4096)
    • CLASS (-,4096,1000)
    • Parameters: For all convolutional layers, strides (sx,sys_x, s_y) are 1, except for the first CONV layer (sx,sy=4s_x, s_y = 4). For all pooling layers, strides equal kernel dimension (sx=Kx,sy=Kys_x = K_x, s_y = K_y). For LRN layers, k=5k = 5.
    • Rationale: This full network represents a complete, real-world, state-of-the-art CNN pipeline, allowing for a comprehensive evaluation of the proposed architecture's performance and scalability across diverse layer types.
  • Pre-training Benchmarks: Restricted Boltzmann Machines (RBM) [45], a popular pre-training method for initializing synaptic weights, were applied to the CLASS1 and CLASS2 layers, resulting in RBM1 (2560×25602560 \times 2560) and RBM2 (4096×40964096 \times 4096) benchmarks.

    • Rationale: This evaluates the architecture's efficiency not just for inference but also for the training process, which can be very time-consuming and often involves different computational demands.

5.2. Evaluation Metrics

The paper evaluates the DaDianNao architecture primarily using Speedup and Energy Reduction relative to a GPU baseline. Additionally, Area and Power Consumption of the node are reported.

  1. Speedup (SS)

    • Conceptual Definition: Speedup quantifies how much faster a task can be executed on one system compared to another. It is a ratio that indicates performance improvement.
    • Mathematical Formula: $ S = \frac{T_{baseline}}{T_{proposed}} $
    • Symbol Explanation:
      • SS: The speedup achieved by the proposed system over the baseline system.
      • TbaselineT_{baseline}: The execution time of the task on the baseline system (e.g., GPU).
      • TproposedT_{proposed}: The execution time of the same task on the proposed system (DaDianNao).
  2. Energy Reduction (ER)

    • Conceptual Definition: Energy reduction measures how much less energy is consumed by one system compared to another for the same task. It reflects improved energy efficiency.
    • Mathematical Formula: $ ER = \frac{E_{baseline}}{E_{proposed}} $
    • Symbol Explanation:
      • ER: The energy reduction achieved by the proposed system over the baseline system.
      • EbaselineE_{baseline}: The total energy consumed by the baseline system (e.g., GPU) to complete the task. Energy is typically calculated as power ×\times time.
      • EproposedE_{proposed}: The total energy consumed by the proposed system (DaDianNao) to complete the same task.
  3. Area (AA)

    • Conceptual Definition: Area refers to the physical space occupied by the circuit on a silicon chip, typically measured in mm2\mathrm{mm^2}. It is a critical metric for manufacturing cost and packaging.
    • Mathematical Formula: (Not explicitly given as a formula, but reported directly) Acomponent(mm2)A_{component} \quad (\mathrm{mm^2})
    • Symbol Explanation:
      • AcomponentA_{component}: The area occupied by a specific component or the whole chip.
  4. Power Consumption (PP)

    • Conceptual Definition: Power consumption is the rate at which energy is used by the circuit, measured in Watts (W). Lower power consumption is crucial for energy efficiency and thermal management.
    • Mathematical Formula: (Not explicitly given as a formula, but reported directly) Pcomponent(W)P_{component} \quad (\mathrm{W})
    • Symbol Explanation:
      • PcomponentP_{component}: The power consumed by a specific component or the whole chip.

5.3. Baselines

The paper primarily compares DaDianNao against two main baselines: a modern GPU and, for earlier comparisons, a CPU with SIMD capabilities. It also references DianNao as a comparison for the accelerator approach.

  1. GPU Baseline:

    • Hardware: NVIDIA K20M GPU.
    • Specifications: 5GB GDDR5 memory, 208 GB/s memory bandwidth, 3.52 TFlops peak performance, manufactured using 28nm technology.
    • Implementation: CUDA versions of the neural network layers were implemented using CUDA SDK 5.5, derived from the tuned open-source CUDA Convnet [31].
    • Rationale: GPUs were the most favored and state-of-the-art approach for CNNs and DNNs at the time of publication due to their parallel processing capabilities. The NVIDIA K20M was a high-end GPU representing the best performance achievable with general-purpose parallel hardware. Its power usage was also monitored.
  2. CPU (SIMD) Baseline:

    • Hardware: Intel Xeon E5-4620 Sandy Bridge-EP processor.
    • Specifications: 2.2GHz clock speed, 1TB memory, 256-bit SIMD (Single Instruction, Multiple Data) capabilities.
    • Implementation: A C++ version of the neural network layers was implemented.
    • Rationale: This serves as a baseline to understand the performance gain of GPU over traditional CPUs. The SIMD version was confirmed to be 4.07x faster than a non-SIMD version, indicating that CPU performance was optimized. The GPU achieved an average 58.82x speedup over this SIMD CPU, aligning with state-of-the-art GPU acceleration results.
  3. DianNao Accelerator (Conceptual Baseline):

    • Hardware: DianNao accelerator [5].
    • Specifications: 3mm23 mm^2 at 65nm, 0.98GHz.
    • Implementation: A cycle-level bit-level version of DianNao was re-implemented using memory latency parameters from its original article.
    • Rationale: This comparison highlights the potential efficiency of custom architectures (47.91% of GPU performance in 0.53% of GPU area) but also underscores the memory bandwidth limitation that DaDianNao aims to overcome.

5.4. Measurements

The methodology for obtaining results is rigorous, combining industry-standard CAD tools for physical design with detailed simulations.

  • CAD Tools for Physical Design:

    • Verilog Implementation: The node was first implemented in Verilog.
    • Synthesis and Layout: Standard CAD tools were used for synthesis and layout (physical design) of the Verilog code.
    • Technology Node: ST 28nm Low Power (LP) technology (operating at 0.9V).
    • Tools Used: Synopsys Design Compiler for synthesis, ICC Compiler for layout, and Synopsys PrimeTime PX for power consumption estimation. This provides highly accurate area, energy, and critical path delay measurements post-layout.
  • Simulation for Time, eDRAM, and Inter-Node Measurements:

    • RTL Simulation: VCS was used to simulate the Register-Transfer Level (RTL) design of the node.
    • eDRAM Model: A custom eDRAM model was used, incorporating realistic characteristics like destructive reads and periodic refresh. The eDRAM was modeled as banked and running at 606MHz. eDRAM energy was collected using CACTI5.3 [1] after integrating 1T1C cell characteristics specific to 28nm technology [25].
    • Inter-Node Communication Simulation: Booksim2.0 [10], a cycle-level interconnection network simulator, was used to model inter-node communications. Orion2.0 [29] provided the network energy model.
  • GPU Measurements:

    • Power Usage: The NVIDIA K20M GPU provided its own power usage reporting capabilities.
    • Code Compilation: CUDA SDK 5.5 was used to compile the CUDA versions of the neural network codes.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Main Characteristics of the Node Layout (VII.A.)

The physical implementation of a single DaDianNao node at 28nm provides concrete insights into its area and power consumption.

  • Chip Layout (Figure 9): The image shows a snapshot of the node layout, illustrating the distribution of various components.

    Figure 9: Snapshot of the node layout.
    该图像是一个示意图,展示了多芯片机器学习架构的布局,中央区域为中央块,周围分布有多个 Tile 及控制器,体现了高内存带宽和低外部通信的特性。

    The image above is Figure 9 from the original paper, showing a snapshot of the node layout. It indicates a central block surrounded by tiles, with HT IPs on the periphery, confirming the tiled, multi-component design.

  • Area Breakdown (Table VI): The total chip area is 67.73mm267.73 mm^2.

    • Tiles: 16 tiles consume 44.53% of the chip area (30.16mm230.16 mm^2). This is a substantial portion, reflecting the distributed computational and storage units.
    • HT IPs: The four HyperTransport IPs consume 26.02% of the area (17.62mm217.62 mm^2), indicating the significant footprint of high-speed inter-chip communication.
    • Central Block: The central block (including 4MB eDRAM, router, and control logic) takes 11.66% of the area (7.89mm27.89 mm^2).
    • Wires: The wires connecting the central block and the tiles occupy 8.97% of the area, confirming the effectiveness of the tiled design in reducing wiring overhead compared to the monolithic approach (which had wires taking up much more area, as discussed in the Methodology section).
    • Memory vs. Logic: About half (47.55%) of the chip area is consumed by memory cells (primarily eDRAM), emphasizing the memory-centric design. Combinational logic and registers account for 5.88% and 4.94% respectively.
  • Power Consumption:

    • Peak Power: The peak power consumption is 15.97 W (at a pessimistic 100% toggle rate), which is roughly 5-10% of a state-of-the-art GPU card (NVIDIA K20M has 225W TDP). This highlights significant energy efficiency.
    • Power Breakdown:
      • HT IPs: Consume about half (50.14%) of the total power, emphasizing that inter-chip communication is the most power-hungry component.

      • Tiles: Consume over one third (38.53%) of the power.

      • Memory cells: Account for 38.30% of total power (tile eDRAMs + central eDRAM).

      • Combinational logic: 37.97% (mostly NFUs and HT protocol analyzers).

      • Registers: 19.25%.

        The following are the results from Table VI of the original paper:

        Component/Block Area (µm2) (%) Power (W ) (%)
        Whole ChIP 67,732,900 15.97
        Central Block 7,898,081 (11.66%) 1.80 (11.27%)
        Tiles 30,161,968 (44.53%) 6.15 (38.53%)
        HTs 17,620,440 (26.02%) 8.01 (50.14%)
        Wires 6,078,608 (8.97%) 0.01 (0.06%)
        Other 5,973,803 (8.82%)
        Combinational 3,979,345 (5.88%) 6.06 (37.97%)
        Memory 32207390 (47.55%) 6.12 (38.30%)
        Registers 3,348,677 (4.94%) 3.07 (19.25%)
        Clock network 586323 (0.87%) 0.71 (4.48%)
        Filler cell 27,611,165 (40.76%)

6.1.2. Performance (VII.B.)

The performance evaluation focuses on the speedup of DaDianNao over the GPU baseline for both inference and training across various neural network layers and different numbers of nodes.

  • Inference Speedup (Figure 10):

    • Average Speedup:
      • 1-node: 21.38x faster than GPU.
      • 4-node: 79.81x faster.
      • 16-node: 216.72x faster.
      • 64-node: 450.65x faster.
    • Reasons for High Performance:
      1. Large Number of Operators: Each node has 9216 operators (multipliers and adders) compared to 2496 MACs (Multiply-Accumulate units) in the GPU.
      2. On-chip eDRAM Bandwidth: The on-chip eDRAM provides the necessary high bandwidth and low-latency access to keep these many operators fed with data.
    • Layer-Specific Node Requirements:
      • CONV1: Requires a 4-node system due to its large memory footprint (22.69 MB for synapses, 32 MB for inputs, 44.32 MB for outputs, totaling 99.01 MB, exceeding 36 MB per node).

      • CONV3CONV3* and CONV4CONV4* (with private kernels): Need a 36-node system as their sizes are 1.29 GB and 1.32 GB respectively.

      • Full NN: Requires at least 4 nodes (total 59.48 M synapses, 118.96 MB data).

        Figure 10: Speedup w.r.t. the GPU baseline (inference). Note that CONV1 and the full NN need a 4-node system, while CONV3\\* and CONV4\\* even need a 36-node system. 该图像是一个图表,展示了不同芯片数量下(1、4、16和64芯片)在多种神经网络层(如CLASS1、CLASS2、CONV1、CONV2等)上的加速比(Speedup)。可以观察到,随着芯片数量的增加,加速效果显著提升。

    The image above is Figure 10 from the original paper, showing the Speedup relative to the GPU baseline for inference. Different bar colors represent 1, 4, 16, and 64 nodes. It clearly illustrates significant speedup for 64 nodes, with CONV1, CONV3CONV3*, CONV4CONV4*, and full NN requiring multi-node systems.

  • Scalability of Layers:

    • LRN Layers: Scale best (no inter-node communication), achieving up to 1340.77x for LRN2 with 64 nodes.
    • CONV and POOL Layers: Scale almost as well (inter-node communication only on border elements). CONV1 achieves 2595.23x for 64 nodes. However, their actual speedup can be lower than CONV due to being less computationally intensive.
    • CLASS Layers: Scale less well (e.g., 72.96x for CLASS1 with 64 nodes) due to high inter-node communications, as each output neuron uses all input neurons from different nodes.
  • Time Breakdown (Figure 11):

    • The breakdown shows communication vs. computation for various node counts and layer types. CLASS layers show a higher proportion of time spent in communication as the number of nodes increases.

    • This communication issue for CLASS layers is attributed to the simple 2D mesh topology. A more sophisticated multi-dimensional torus topology could potentially reduce broadcast time for larger node counts.

      Figure 11: Time breakdown (left) for 4, 16 and 64 nodes, (right) breakdown for 1, 4, 16, 64 nodes; CLASS, CONV, POOL, LRN stand for the geometric means of all layers of the corresponding type, Gmean… 该图像是性能分析图,展示了不同神经网络结构在通信和计算方面的比例以及各组件的使用情况。左侧显示了CLASS、CONV、full NN和平均值的通信与计算百分比,右侧则针对NFU、eDRAM、Router和HT组件的使用情况进行了可视化。

    The image above is Figure 11 from the original paper, showing the Time breakdown. The left chart displays communication and computation proportions for 4, 16, and 64 nodes for CLASS, CONV, POOL, LRN (geometric means), and the global geometric mean. The right chart shows the breakdown of NFU, eDRAM, Router, and HT usage for 1, 4, 16, 64 nodes.

  • Full NN Scaling: The full NN scales similarly to CLASS layers (63.35x for 4-node, 116.85x for 16-node, 164.80x for 64-node). This is not because CLASS layers dominate, but because the CNN layers in this specific benchmark are relatively small for a 64-node system, leading to inefficient mapping or frequent inter-node communications for kernel computations.

    The following are the results from Table VII of the original paper:

    CONV LRN POOL CLASS
    4-node 96.63% 0.60% 0.47% 2.31%
    16-node 96.87% 0.28% 0.22% 2.63%
    64-node 92.25% 0.10% 0.08% 7.57%
  • Training and Initialization Speedup (Figure 12):

    • Average Speedup:

      • 1-node: 12.62x faster.
      • 4-node: 43.23x faster.
      • 16-node: 126.66x faster.
      • 64-node: 300.04x faster.
    • Comparison to Inference: Speedups are high but lower than inference mainly due to operator aggregation (using 32-bit operators for training means fewer parallel operations).

    • CLASS Layer Scalability: Training phase for CLASS layers scales better than inference because training involves almost double the computations for the same amount of communications (e.g., backpropagation).

    • RBM Initialization: Scalability is similar to CLASS layers in the inference phase.

      Figure 12: Speedup w.r.t. the GPU baseline (training). 该图像是一个条形图,展示了不同芯片数量下,多个神经网络层的加速比。图中以不同颜色代表了1芯片、4芯片、16芯片和64芯片的情况,纵轴为加速比,横轴为不同的网络层,显示出在64芯片系统中,多个层次实现了显著的加速效果。

    The image above is Figure 12 from the original paper, showing Speedup relative to the GPU baseline for training. Similar to inference, 64 nodes show the most significant speedup across various layers.

6.1.3. Energy Consumption (VII.C.)

The energy consumption analysis highlights DaDianNao's significant efficiency gains.

  • Inference Energy Reduction (Figure 13):

    • Average Energy Reduction:
      • 1-node: 330.56x reduction.
      • 4-node: 323.74x reduction.
      • 16-node: 276.04x reduction.
      • 64-node: 150.31x reduction.
    • Range: Minimum energy improvement is 47.66x (for CLASS1 with 64 nodes), while the best is 896.58x (for CONV2 on a single node).
    • Scalability Trend: Energy benefit remains relatively stable for convolutional, pooling, and LRN layers as nodes scale up, but degrades for classifier layers. This degradation is attributed to the increased communication time, suggesting a multi-dimensional torus could help.
    • Energy Breakdown (Figure 11, right):
      • 1-node architecture: NFU consumes about 83.89% of the energy.

      • 64-node system: The ratio of energy spent in HT (HyperTransport) progressively increases to 29.32% on average, and specifically 48.11% for classifier layers due to larger communication overheads.

        Figure 13: Energy reduction w.r.t. the GPU baseline (inference). 该图像是柱状图,展示了不同芯片数量下各类神经网络的能量消耗减少效果。数据表明,随着芯片数量的增加,能量降低的效果显著,尤其在64芯片系统中表现突出。

    The image above is Figure 13 from the original paper, showing Energy reduction relative to the GPU baseline for inference. It indicates substantial energy savings, with 1-node systems often outperforming multi-node systems in energy reduction for certain layers, but overall consistent benefits.

  • Training and Initialization Energy Reduction (Figure 14):

    • Average Energy Reduction:

      • 1-node: 172.39x reduction.
      • 4-node: 180.42x reduction.
      • 16-node: 142.59x reduction.
      • 64-node: 66.94x reduction.
    • Scalability: The scalability behavior is similar to that of the inference phase, with energy reduction also showing a decreasing trend as node count increases, particularly for CLASS layers due to communication.

      Figure 14: Energy reduction w.r.t. the GPU baseline (training). 该图像是能量减少的柱状图,展示了不同芯片数量(1芯片、4芯片、16芯片、64芯片)在多个神经网络层(如CLASS1、CLASS2等)的能量减少效果。图中显示,随着芯片数量增加,各层的能量减少趋势普遍明显。

    The image above is Figure 14 from the original paper, showing Energy reduction relative to the GPU baseline for training. The trend of energy reduction is similar to inference, decreasing with increasing node count, but still offering substantial savings.

6.2. Data Presentation (Tables)

All tables were transcribed and presented in the Methodology and Experimental Setup sections.

6.3. Ablation Studies / Parameter Analysis

While not explicitly termed "ablation studies," the paper conducts several analyses that serve a similar purpose by varying key architectural parameters and evaluating their impact:

  • Fixed-Point Precision Analysis (Table II): This directly evaluates the impact of using different fixed-point bit-widths (16-bit, 32-bit) for inference and training on the neural network's error and convergence. It shows that 16-bit fixed-point is sufficient for inference but causes training to fail (no convergence), while 32-bit fixed-point for training yields comparable accuracy to floating-point. This justifies the NFU's configurability to use different precisions based on the execution phase.

  • Number of Nodes Scalability Analysis (Figures 10, 12, 13, 14): The core experimental results systematically show how performance (speedup) and energy consumption (reduction) change as the number of nodes in the system increases (1, 4, 16, 64 nodes). This demonstrates the scalability of the proposed multi-chip architecture and highlights layer-specific scaling behaviors (e.g., LRN scales well, CLASS scales less due to communication).

  • Monolithic vs. Tiled Design (Floorplan analysis in V.B.1 and V.B.2): The paper implicitly performs an "ablation" by discussing an initial monolithic NFU design that suffered from extreme wire congestion (Figure 4). The subsequent adoption of the tile-based design (Figure 5) is presented as a solution that reduces area and improves internal bandwidth, justifying this key architectural choice.

  • Communication vs. Computation Breakdown (Figure 11): This analysis implicitly reveals the bottleneck shifts as the system scales. For CLASS layers on larger node counts, communication time becomes a more significant factor, indicating limitations of the 2D mesh topology for certain workloads. This parameter analysis informs future work directions (e.g., multi-dimensional torus).

  • Component Power Breakdown (Table VI, Figure 11 right): Analyzing how power is distributed among tiles, HT IPs, memory, and logic provides insight into where energy is consumed most. For instance, the high power of HT IPs highlights the cost of inter-chip communication, even when neurons are primarily moved.

    These analyses are crucial for understanding the trade-offs and design decisions behind DaDianNao, and for guiding future improvements.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces DaDianNao, a pioneering custom multi-chip architecture engineered to address the computational and memory challenges posed by state-of-the-art machine-learning algorithms such as Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs). Recognizing that these algorithms are increasingly central to modern services and typically suffer from memory bandwidth limitations in GPU and single-chip accelerator implementations, DaDianNao proposes a multi-chip system where the entire neural network model's memory footprint can reside in on-chip eDRAM distributed across interconnected nodes.

The key architectural innovations include:

  1. An asymmetric node design heavily biased towards on-chip eDRAM storage to minimize data movement.

  2. A neuron-centric communication strategy that transfers fewer neuron values instead of numerous synaptic weights.

  3. A tile-based architecture with a fat tree interconnect for high internal bandwidth and efficient data distribution within a node.

  4. A configurable Neural Functional Unit (NFU) that can adapt to different layer types and support both inference and training with appropriate fixed-point precision.

    The experimental results demonstrate remarkable performance and energy efficiency. A 64-node DaDianNao system achieves an average speedup of 450.65x over a GPU baseline and reduces energy by 150.31x. A single node, implemented at 28nm, occupies 67.73mm267.73 mm^2 and consumes 15.97 W peak power, confirming the feasibility and efficiency of the design. While scalability varies per layer type (e.g., classifier layers are more sensitive to inter-node communication), the system consistently outperforms GPUs by a significant margin.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations and propose clear directions for future work:

  • NFU Clock Frequency: The current NFU is clocked conservatively at 606MHz, matching the eDRAM frequency. Future work includes implementing a faster NFU and investigating asynchronous communications with eDRAM to push performance further.
  • Interconnect Scalability for Classifier Layers: The simple 2D mesh topology used currently can lead to communication bottlenecks and reduced scalability for classifier layers as the number of nodes increases. The authors suggest exploring a more efficient multi-dimensional torus topology to mitigate this.
  • Flexible Control: The current control mechanism is based on generated node instructions. Future work aims to incorporate a more flexible control scheme, potentially using a simple VLIW (Very Long Instruction Word) core per node and developing the associated toolchain.
  • Hardware Prototyping: A concrete future step is the tape-out (manufacturing) of a node chip, followed by the development of a multi-node prototype to validate the system in real hardware.

7.3. Personal Insights & Critique

This paper represents a significant milestone in the development of specialized hardware for machine learning. Its insights and architectural choices have deeply influenced subsequent AI accelerator designs.

  • Pioneering a "ML Supercomputer": At a time when GPUs were becoming dominant, DaDianNao boldly proposed a dedicated, multi-chip system specifically for ML. This concept of an ML supercomputer was visionary, predating the explosion of large AI models and the widespread need for AI clusters. It effectively defined a new category of specialized computing.
  • Addressing the Memory Wall - A Core Insight: The central premise that DNN/CNN memory footprints, while large, are manageable with aggregated on-chip storage in a multi-chip system was a profound insight. This diverges from general-purpose computing's memory wall and directly led to the innovative eDRAM-centric, neuron-moving architecture. This focus on localizing weights and minimizing off-chip access is now a common theme in AI accelerator design.
  • Practicality and Rigor: The paper doesn't just propose a theoretical architecture; it details the implementation down to place and route at 28nm, including CAD tool usage, power breakdowns, and eDRAM modeling. This level of detail provides strong credibility and makes the results highly compelling. The comparison to DianNao and detailed GPU baselines demonstrates a robust evaluation methodology.
  • Influence on Future Work: DaDianNao and its follow-ups (DianNao series from ICT, CAS) effectively kickstarted a wave of research into dataflow architectures, in-memory computing, and specialized accelerators for deep learning. Concepts like tile-based designs, on-chip memory hierarchies, and configurable compute units became standard features in later AI chips.
  • Areas for Improvement / Unverified Assumptions:
    • 2D Mesh Limitation: The authors correctly identify the 2D mesh as a limitation for classifier layers. As AI models grow, the proportion of fully connected layers can increase, making this bottleneck more pronounced. Moving to 3D torus or other advanced interconnects is critical.

    • Fixed-Point Precision Trade-offs: While 32-bit fixed-point for training was shown to work, further research has explored 16-bit or even 8-bit fixed-point training with specialized techniques. The paper's early findings on fixed-point were foundational, but the field has since advanced in making lower precision viable for training.

    • Software Stack Complexity: Although DaDianNao is presented as a system ASIC with low programming requirements via a code generator, the complexity of mapping diverse and evolving neural network models efficiently onto such a rigid, specialized architecture remains a continuous challenge. The proposed VLIW core for more flexible control hints at this.

    • Generalizability: While DaDianNao is highly optimized for CNNs and DNNs, its specialization might limit its adaptability to fundamentally new ML algorithms that emerge in the future (e.g., Transformers, Graph Neural Networks) if their computational patterns differ significantly. However, the modular nature of the NFU and general principles of data locality and on-chip memory are still broadly applicable.

      Overall, DaDianNao was a visionary and rigorously executed project that not only delivered impressive performance and efficiency gains for Deep Learning but also laid crucial groundwork for the entire field of AI hardware acceleration. Its focus on overcoming the memory wall through on-chip distributed memory and neuron-centric data flow was a paradigm shift that continues to resonate today.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.