DaDianNao: A Machine-Learning Supercomputer
TL;DR Summary
DaDianNao is a machine-learning supercomputer optimized for CNNs and DNNs, demonstrating a 450.65x speedup and 150.31x energy reduction compared to GPUs, effectively addressing the high computational and memory demands of machine learning.
Abstract
Abstract —Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on-chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
DaDianNao: A Machine-Learning Supercomputer
1.2. Authors
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, Olivier Temam. Their affiliations include:
- SKL of Computer Architecture, ICT, CAS, China
- Inria, Scalay, France
- University of CAS, China
- Inner Mongolia University, China
1.3. Journal/Conference
The paper does not explicitly state a journal or conference in the provided text. However, "Published at (UTC): 2014-12-01T00:00:00.000Z" suggests it was published around late 2014, likely in a prominent computer architecture or machine learning conference or journal of that period.
1.4. Publication Year
2014
1.5. Abstract
This paper introduces DaDianNao, a custom multi-chip architecture designed for accelerating Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs). These machine-learning algorithms, while popular for sophisticated data processing in various services, are known to be computationally and memory-intensive. Existing neural network accelerators improve computational capacity but are often bottlenecked by memory accesses. The authors observe that, unlike general-purpose workloads, the large memory footprint of CNNs and DNNs can be accommodated by the on-chip storage of a multi-chip system. This property, combined with algorithmic characteristics, allows for high internal bandwidth and low external communications, facilitating high-degree parallelism at a reasonable area cost. The proposed DaDianNao system, implemented down to place and route at 28nm technology and featuring custom storage and computational units with industry-grade interconnects, demonstrates significant performance gains. For a 64-chip system, it achieves an average speedup of 450.65x over a GPU and reduces energy by 150.31x on a subset of the largest known neural network layers.
1.6. Original Source Link
/files/papers/6915903fdb1128a32d47a715/paper.pdf This is the provided internal link for the PDF, indicating its publication status as available.
2. Executive Summary
2.1. Background & Motivation
The core problem DaDianNao aims to solve is the computational and memory intensity of state-of-the-art machine-learning (ML) algorithms, particularly Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs). These algorithms have become ubiquitous in a wide range of applications, from speech recognition (e.g., Siri, Google Now) to image identification (e.g., Apple iPhoto) and even pharmaceutical research. The authors argue that machine-learning applications are "in the process of displacing scientific computing as the major driver for high-performance computing."
Despite their widespread adoption, CNNs and DNNs pose significant challenges for conventional hardware. While GPUs (Graphics Processing Units) are currently favored for their parallel processing capabilities, they still incur high area costs, considerable execution times, and moderate energy efficiency due to their general-purpose nature and reliance on off-chip memory for large models. Existing dedicated neural network accelerators offer high computational density but are often hampered by memory accesses, where data movement between on-chip processing units and off-chip main memory becomes the performance bottleneck – a phenomenon known as the memory wall.
The paper's innovative idea stems from a critical observation: although the memory footprint of CNNs and DNNs can be very large (up to tens of GB), it is not beyond the capability of the aggregated on-chip storage of a multi-chip system. This contrasts with general-purpose workloads where the memory wall is often insurmountable with on-chip resources alone. This unique property of CNNs and DNNs, combined with their algorithmic characteristics (e.g., high data reuse, local connectivity in many layers), suggests that a specialized multi-chip architecture could achieve high internal bandwidth and low external communication, thereby enabling high parallelism at a reasonable cost.
2.2. Main Contributions / Findings
The primary contributions of this paper are:
-
Novel Multi-Chip Architecture: Introduction of
DaDianNao, a custom multi-chip, highly specialized architecture specifically designed forCNNsandDNNs. This architecture fundamentally addresses thememory wallproblem by distributing the largeneural networkmodels across theon-chip eDRAM(embeddedDRAM) of multiple interconnected chips. -
Asymmetric Node Design: Each chip (node) in the system is designed with an asymmetric bias towards storage rather than computation. This involves integrating substantial
eDRAMdirectly on-chip to holdsynaptic weightsclose to theNeural Functional Units (NFUs), minimizing costlyoff-chip memoryaccesses. -
Neuron-Centric Data Movement: The architecture prioritizes moving
neuron values(which are typically fewer in number) between chips and within nodes, rather thansynaptic weights. This strategy significantly reduces inter-chip communication bandwidth requirements. -
High Internal Bandwidth via Tiling: Each node utilizes a
tile-based designwith multiple smallerNFUsandeDRAM banks, interconnected by afat tree. This maximizes internal data bandwidth and mitigates wire congestion issues observed in monolithic designs. -
Configurability for Flexibility: The architecture is designed to be highly configurable, supporting different
CNNandDNNlayer types (Convolutional,Pooling,Local Response Normalization,Classifier) and bothinference(forward pass) andtraining(forward and backward passes) modes. This is achieved through reconfigurableNFUpipelines and support forfixed-pointarithmetic at various precisions. -
Significant Performance and Energy Efficiency Gains:
- Speedup: On a subset of the largest known
neural network layers, a64-chip DaDianNao systemachieved an averagespeedup of 450.65xcompared to aGPU(NVIDIA K20M). - Energy Reduction: The same
64-chip systemdemonstrated an averageenergy reduction of 150.31xrelative to theGPU. A single node shows an even higher energy reduction of330.56x.
- Speedup: On a subset of the largest known
-
Practical Implementation: The design is not merely theoretical; a single
nodewas implemented down toplace and routeat28nm, confirming its feasibility and providing concrete area and power estimations. Thenodecontains a combination of custom storage (eDRAM) and computational units (NFUs), integrated with industry-grade interconnects (HyperTransport).These findings collectively demonstrate that dedicated multi-chip architectures can overcome the
memory wallforDNNsandCNNs, enablingsupercomputer-level performanceandenergy efficiencyfor large-scalemachine-learningapplications.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Machine Learning (ML): A field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" from data, without being explicitly programmed. It involves building models from sample data (training) that can make predictions or decisions on new, unseen data (inference).
- Inference (Forward Phase/Testing): The process of using a trained
MLmodel to make predictions or classify new data. This is typically a feed-forward computation. - Training (Backward Phase/Learning): The process of adjusting the parameters (weights and biases) of an
MLmodel using a dataset to minimize prediction errors. This usually involves forward propagation, calculating a loss, and then backpropagation to compute gradients and update weights.
- Inference (Forward Phase/Testing): The process of using a trained
- Neural Networks (NNs): A subset of
machine learninginspired by the structure and function of the human brain. They consist of interconnected nodes (neurons) organized in layers. Each connection between neurons has asynaptic weightthat is adjusted during training.- Deep Neural Networks (DNNs):
Neural networkswith multiple hidden layers between the input and output layers. The "deep" refers to the number of layers. They are capable of learning complex patterns. - Convolutional Neural Networks (CNNs): A specialized type of
DNNparticularly effective for processing grid-like data, such as images.CNNsuseconvolutional layersthat apply filters (kernels) to local regions of the input, enabling them to learn hierarchical features and exhibittranslation invariance.- Layer Types in CNNs/DNNs:
- Convolutional Layers (CONV): Apply a set of learnable
filters(kernels) to the input data, sliding them across the input to producefeature maps. These filters detect specific features (e.g., edges, textures). - Pooling Layers (POOL): Reduce the dimensionality of the
feature mapsby taking the maximum (max pooling) or average (average pooling) over local regions. This helps to reduce computation, control overfitting, and make the network more robust to small variations in input. - Local Response Normalization Layers (LRN): Implement a form of competition between neurons in different
feature mapsat the same spatial location. It normalizes the activation of a neuron based on the activity of its neighbors, similar tolateral inhibitionin biological systems. - Classifier Layers (CLASS): Typically
fully connected layersfound at the end of aCNNorDNN. They take the features extracted by previous layers and map them to the final output categories (e.g., classifying an image as a "cat" or "dog"). These layers often have a large number ofsynaptic weightsdue to theirfull connectivity.
- Convolutional Layers (CONV): Apply a set of learnable
- Layer Types in CNNs/DNNs:
- Deep Neural Networks (DNNs):
- Synaptic Weights / Parameters: The adjustable values in a
neural networkthat determine the strength of the connection between neurons. These are the core elements learned during training. - Memory Wall: A long-standing problem in computer architecture where the increasing speed of
CPUsoutpaces the improvement inmemoryaccess speed. This creates a bottleneck where the processor spends a significant amount of time waiting for data frommemory, limiting overall performance. For accelerators, this means computation units might be idle waiting for data. - GPU (Graphics Processing Unit): A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. Modern
GPUsare highly parallel processors capable of performing many operations simultaneously, making them suitable formachine learningtasks that involve large-scale parallel computations. - On-chip vs. Off-chip Memory:
- On-chip Memory:
Memoryintegrated directly onto the same chip as the processing units. This typically offers much lower latency and higher bandwidth compared tooff-chip memory(e.g.,SRAM,eDRAM). - Off-chip Memory:
Memorylocated on separate chips, connected to the processor chip via an external bus (e.g.,DRAM,GDDR5). While offering larger capacity, it has significantly higher latency and lower effective bandwidth due to the physical distance and interface overheads.
- On-chip Memory:
- eDRAM (embedded DRAM) vs. SRAM (Static RAM):
- SRAM: A type of
RAMthat holds data in a static manner, meaning it does not need to be periodically refreshed. It is faster and consumes less power in standby thanDRAM, but it is much less dense (takes up more area) and more expensive per bit. Often used for caches. - eDRAM:
DRAMintegrated directly onto the same chip as other logic circuits. It offers significantly higher density thanSRAM(allowing more storage in less area) and better performance thanoff-chip DRAM, but it has higher latency thanSRAMand requires periodic refreshing.
- SRAM: A type of
- Fixed-point vs. Floating-point Arithmetic:
- Floating-point: A method of representing real numbers that uses a fractional part and an exponent. It offers a wide dynamic range and high precision, commonly used in scientific computing and
ML training. - Fixed-point: A method of representing real numbers where the position of the
binary point(analogous to the decimal point) is fixed. It is simpler and faster to implement in hardware, consumes less power, and requires less memory bandwidth, but it has a more limited dynamic range and precision. Often used forML inferencewhen precision requirements are lower.
- Floating-point: A method of representing real numbers that uses a fractional part and an exponent. It offers a wide dynamic range and high precision, commonly used in scientific computing and
- 2D Mesh Topology: A network topology where processors (or nodes) are arranged in a 2D grid. Each node is connected to its immediate neighbors (up, down, left, right). Data travels across multiple hops between non-adjacent nodes.
- Place and Route: A crucial step in the physical design of integrated circuits (ICs). After logical design (synthesis),
place and routesoftware determines the physical location of all circuit components (placement) and then lays out all the connections (routing) between them on the silicon chip. This process significantly impacts area, performance, and power consumption. - 28nm Technology: Refers to the
process technology nodein semiconductor manufacturing.28 nanometersis the typical minimum feature size (e.g., transistor gate length) that can be reliably manufactured. Smaller process nodes generally allow for denser circuits, higher performance, and lower power consumption.
3.2. Previous Works
The paper references several prior works, which can be broadly categorized into early NN accelerators, DNN/CNN specific accelerators, and large-scale biological emulation systems.
-
Early
NNAccelerators (before Deep Learning surge):- Temam [47]: Proposed a
neural networkaccelerator formulti-layer perceptrons. The key distinction noted by the authors is that this was not designed fordeep learning neural networks, which gained prominence later. - Esmaeilzadeh et al. [16]: Introduced a hardware
neural networkcalledNPUfor approximating any program function, rather than being specifically tailored formachine-learningapplications. This highlights a general trend toward usingNNsfor hardware acceleration beyond justML.
- Temam [47]: Proposed a
-
Emergence of
Deep Learningand Dedicated Accelerators:- Chen et al. [5] (DianNao): This is a direct predecessor and the primary comparison point for
DaDianNao.DianNaoproposed an accelerator forDeep Learning (CNNs and DNNs)in a small form factor ( at ). Its architecture featuresbuffersfor cachinginput/output neuronsandsynapses, and aNeural Functional Unit (NFU)which is a pipelined processor forneuronevaluation (multiplication ofsynaptic valuesbyinput neurons, additions, and transfer function application).- Crucial Limitation (acknowledged by DianNao authors and addressed by DaDianNao):
DianNaoand similar single-chip accelerators werememory bandwidth-limited, especially forconvolutional layerswithprivate kernelsandclassifier layers, where the massive number ofsynapses(parameters) required frequentoff-chip memoryaccesses. This significantly reduced their potential performance gains. The paper cites thatDianNaocould lose an order of magnitude in performance due tomemory accesses.
- Crucial Limitation (acknowledged by DianNao authors and addressed by DaDianNao):
- Krizhevsky et al. [32]: Achieved state-of-the-art accuracy on the
ImageNetdatabase using aCNNwith 60 million parameters. This work is a benchmark for the scale of modernDNNsand is used as afull NN benchmarkinDaDianNao's evaluation. - 1-billion and 10-billion parameter neural networks [34], [8]: These represent extreme experiments in
unsupervised learningrequiring massive computational resources (thousands ofCPUsor tens ofGPUs), showcasing the trend towards increasingly largeneural networks.
- Chen et al. [5] (DianNao): This is a direct predecessor and the primary comparison point for
-
Large-Scale Custom Architectures (primarily for Biological Emulation):
- Schemmel et al. [46]: Proposed a
wafer-scale designcapable of implementing thousands ofneuronsand millions ofsynapses. - Khan et al. [30] (SpiNNaker): A
multi-chip supercomputerwhere each node contains multipleARM9cores linked by anasynchronous network, targeting a million-core machine for modeling a billionneurons. - IBM Cognitive Chip [39]: A functional chip designed to implement 256
neuronsand 256Ksynapsesin a small area. - Differentiation: The key distinction for these is that their primary goal is the emulation of biological neurons (often
spiking neurons), not direct acceleration ofmachine-learning tasksbased onCNNsandDNNsas defined in this paper. While they might showMLcapabilities on simple tasks, their underlyingneuronmodels are fundamentally different.
- Schemmel et al. [46]: Proposed a
-
Other Parallel Architectures:
- Majumdar et al. [37]: Investigated a parallel architecture for various
machine-learning algorithms. UnlikeDaDianNao, it usesoff-chip banked memoryandon-chip memory banksprimarily for caching. - Anton [12]: A specialized supercomputer for
molecular dynamics simulation, showcasing the broader trend of custom architectures for high-performance computing tasks.
- Majumdar et al. [37]: Investigated a parallel architecture for various
3.3. Technological Evolution
The paper is situated at a critical juncture in computing and machine-learning history:
-
Rise of
Machine Learningas a HPC Driver:MLalgorithms, especiallyCNNsandDNNs, were becoming pervasive across industries (speech, vision, search). This marked a shift from traditional scientific computing as the sole driver for high-performance computing, creating a new demand for specialized hardware. -
Hardware Specialization (Heterogeneous Computing): The end of
Dennard Scaling(where power density remained constant as transistors shrunk) and the concept ofDark Silicon(where not all transistors on a chip can be powered simultaneously) pushed the computing community towardsheterogeneous computing. This paradigm involves integrating specialized accelerators alongside general-purposeCPUsto achieve higher performance and energy efficiency. -
Deep Learning's Algorithmic Convergence: While
MLis diverse,Deep Learning(particularlyCNNsandDNNs) had emerged as the state-of-the-art for a broad range of applications. This convergence meant that a single category of algorithms could benefit from specialized hardware, making dedicated accelerator design economically viable and impactful.DaDianNaofits into this evolution by recognizing the unique opportunity presented byDeep Learning'scharacteristics and the need for specialized hardware. It aims to bridge the gap between algorithmic demand and hardware capabilities, especially concerning thememory wallproblem that even earlydeep learningaccelerators faced.
3.4. Differentiation Analysis
Compared to the main methods in related work, DaDianNao introduces several core innovations:
-
Addressing the Memory Wall for Large
DNNs/CNNs: PreviousNNaccelerators (likeDianNao) were primarily single-chip designs that struggled with thememory bandwidthrequirements of largerDNNs.DaDianNaoexplicitly tackles this by moving to amulti-chip systemwhere the entiresynaptic weightfootprint (up to tens of GBs) can be mapped toon-chip eDRAMdistributed across nodes. This is a fundamental shift from relying onoff-chip DRAM, which bottlenecks performance and energy. -
Focus on
On-Chip StorageDensity:DaDianNao'snodedesign is aggressivelystorage-biased, leveragingeDRAM's higher density overSRAMto accommodate largesynaptic weightsets directly on the chip. This is a deliberate architectural choice to keep data close to computation. -
Neuron-Centric Communication Paradigm: Unlike systems that might move
weightsor other data,DaDianNaoexplicitly states its principle of transferring onlyneuron valuesbetweennodesbecause they are orders of magnitude fewer thansynapsesforconvolutionalandclassifier layers. This minimizes inter-chip communication overhead. -
Scalability to
Supercomputer-level: WhileDianNaowas an accelerator forheterogeneous multi-cores,DaDianNaois designed as amachine-learning supercomputer– a system of interconnected chips to achieve performance significantly beyond singleGPUor single-chip accelerator capabilities. Themulti-chip mesh topologyis a key enabler here. -
Configurability for Both
InferenceandTraining: Many early accelerators focused solely oninference(e.g., [18]).DaDianNaodesigns itsNFUandpipelineto be reconfigurable for bothinferenceandtraining(including backpropagation and pre-training withRBMs), offering broader utility. It also supportsfixed-pointarithmetic for both modes, with adaptable precision. -
Specific
CNN/DNNFocus vs. Biological Emulation: UnlikeSpiNNaker,IBM Cognitive Chip, orSchemmel et al.'swork, which aimed at emulating biologicalspiking neurons,DaDianNaodirectly targets the specific computational patterns ofCNNsandDNNs(multiplication-accumulation, transfer functions) which are the algorithms driving real-worldMLapplications.In essence,
DaDianNaodistinguishes itself by being a multi-chip, memory-centric, neuron-communication-optimized architecture tailored specifically to overcome thememory wallfor large-scale, industry-relevantDeep Learningtasks.
4. Methodology
4.1. Principles
The core idea behind DaDianNao is to build a highly specialized, multi-chip machine-learning supercomputer that overcomes the memory wall bottleneck prevalent in GPUs and prior single-chip neural network accelerators. The theoretical basis and intuition are founded on several key observations specific to CNNs and DNNs:
-
Large but Manageable Memory Footprint: While the total number of
synaptic weights(parameters) in largeDNNscan be enormous (tens of GBs), this size is still within the realm of what can be collectively stored on-chip across a reasonable number of specialized chips. -
Data Movement is the Bottleneck: For these algorithms, fetching
synaptic weightsfromoff-chip memoryis the primary energy and performance limiter. -
Asymmetry in Data Volume: The number of
synaptic weightsis significantly higher than the number ofneuron values(inputs/outputs of layers). Therefore, movingneuron valuesis far more efficient than movingsynapses. -
Local Connectivity for Efficiency: Many
CNNlayers (e.g.,convolutional,pooling) exhibit local connectivity, meaningneuronsprimarily interact with neighboringneurons. This suggests that distributing computation andsynapsesgeographically can reduce communication for these layers. -
High Internal Bandwidth is Crucial: To keep specialized computational units busy, a very high internal bandwidth for data access is required.
Based on these principles,
DaDianNaoadopts the following design philosophy:
- Distributed On-Chip Storage: Utilize the collective
on-chip eDRAMacross amulti-chip systemto store the entireneural network model, eliminatingoff-chip memoryaccesses. - Synapses Close to Computation: Place
synaptic weightsphysically adjacent to theNeural Functional Units (NFUs)that will use them, minimizing data travel time and energy. - Neuron-Centric Communication: Design the inter-node communication mechanism to transfer
neuron valuesrather thansynapses. - Tiled Architecture for Internal Bandwidth: Break down the
node(chip) into multipletiles, each with its ownNFUandeDRAM, connected by a high-bandwidth internalfat treenetwork. - Configurable Processing: Make the
NFUconfigurable to efficiently execute differentCNN/DNNlayer types and support bothinferenceandtrainingmodes with appropriatefixed-pointprecision.
4.2. Core Methodology In-depth (Layer by Layer)
The DaDianNao architecture is a multi-chip system where each chip is an identical node, arranged in a 2D mesh topology. Each node contains significant on-chip storage (primarily eDRAM) for synapses, Neural Functional Units (NFUs) for computation, and a router fabric for inter-node communication.
4.2.1. Node Architecture (V.B.)
The node is the fundamental building block of DaDianNao.
4.2.1.1. Synapses Close to Neurons (V.B.1.)
A central design characteristic is to store synapses (weights) as close as possible to the neurons that use them, and to make this storage massive.
-
Motivation for Neuron-Centric Data Movement:
- Inference and Training Support: The architecture is designed for both
inference(forward pass) andtraining(forward and backward passes). Intraining,neuronsareforward-propagatedand thenbackward-propagated. Depending on how data (neuronsandsynapses) are allocated,neuronsof the previous or next layer act as inputs and need to be moved. - Relative Data Volumes: There are significantly more
synapsesthanneurons. Forclassifier layers,synapsesare vs.neurons. Forconvolutional layerswithprivate kernels,synapsesare vs.neurons. This makes it much more efficient to moveneuron outputsthansynapses. - Low-Energy/Low-Latency Access: Keeping
synapsesclose tocomputational operatorsprovides low-energy and low-latency data transfers, enabling high internal bandwidth.
- Inference and Training Support: The architecture is designed for both
-
Choice of
eDRAMfor Storage:SRAMis dense enough for caching but not for the large-scale storage (up to 1GB for single layers, tens of MB common) required for allsynapses.eDRAMoffers higher storage density. For example, a10MB SRAMat28nmtakes , while10MB eDRAMtakes (a2.85xhigher density).- Using
eDRAMon-chip for the entireneural networkeliminatesoff-chip DRAMaccesses, which are extremely costly in terms of energy. A256-bit read accesstoeDRAMat28nmconsumes , while the same access toMicron DDR3 DRAMconsumes (a321xenergy ratio). This large difference is due to thememory controller,DDR3 physical interface,on-chip bus access, andpage activationforoff-chip DRAM.
-
Scaling
NFUCapacity and AddressingeDRAMChallenges:- By removing the
memory bandwidthbottleneck, theNFUsize can be scaled up to process moreoutput neurons() andinputs per output neuron() simultaneously, improving throughput. For instance, to achieve16xoperations compared toDianNao, anNFUmight need and . This would require fetching bits fromeDRAMevery cycle. eDRAMhas drawbacks: higher latency thanSRAM,destructive reads, andperiodic refresh. To compensate and sustainNFUoperation every cycle, theeDRAMis split intofour banks, andsynapse rowsare interleaved among them.
- By removing the
-
Initial Monolithic
NFUDesign Flaw:-
An early design attempt for a single large
NFUwith a65536-bitinterface toeDRAMat28nmled to afloorplan(Figure 4) wherewirescaused significant congestion. TheNFUitself was tiny (), but the65536 wiresconnecting it toeDRAMrequired a width of (given spacing and 4 metal layers). This resulted in thewiresconsuming , nearly the combined area of all other components. This highlights the practical challenges of high internal bandwidth with traditional wiring.
该图像是一个示意图,展示了多芯片机器学习架构的布局及其组件,包括4个eDRAM模块和连接线。该布局的整体尺寸为3.27 mm x 3.27 mm,而NFU的尺寸为0.88 mm。
The image above is Figure 4 from the original paper, showing a simplified
floorplanwith a single centralNFUand the resultingwire congestion. -
4.2.1.2. High Internal Bandwidth (V.B.2.)
To address the wire congestion and achieve high internal bandwidth, a tile-based design is adopted, as shown in Figure 5.
-
Tile-based Architecture:
-
The
output neuronsare spread across differenttiles. Eachtilecontains a smallerNFUcapable of simultaneously processing16 input neuronsand16 output neurons(256 parallel operations). -
This reduces the data requirement from
eDRAMpertileto bits per cycle. -
Each
tileretains the4-bank eDRAMorganization (each bank4096-bitwide) to compensate foreDRAMweaknesses. -
A
tile(with itsNFU, foureDRAM banks, and input/output interfaces) has an area of .16 such tilesaccount for , representing a28.5% area reductioncompared to the monolithicNFUdesign, because therouting networknow only consumes8.97%of the overall area.
该图像是示意图,展示了一个多芯片机器学习架构的设计。左侧部分显示了一种连接方式,各个模块通过 eDRAM 路由器连接。右侧展示了神经功能单元(NFU),包含 16 个输入神经元和 16 个输出神经元,数据流向存储单元(SB)。
The image above is Figure 5 from the original paper, illustrating the
tile-based organizationof anode(left) and thetile architecture(right). Anodecontains16 tiles, twocentral eDRAM banks, and afat tree interconnect; atilehas anNFU, foureDRAM banks, andinput/output interfacesto/from thecentral eDRAM banks. -
-
Internal Communication (Fat Tree):
- All
tileswithin anodeare connected via afat treenetwork. - The
fat treeis used to:Broadcast input neuron valuesto eachtile.Collect output neuron valuesfrom eachtile.
- At the center of the chip, there are
two special eDRAM banks: one forinput neuronsand one foroutput neurons.
- All
-
Neuron Processing Flow within a Node:
- Since the total number of hardware
output neuronsin allNFUscan still be smaller than the actual number ofneuronsin large layers, multiple differentoutput neuronsare computed on the same hardwareneuronfor each set ofinput neuronsbroadcasted. - Intermediate values of these
neuronsare saved locally in thetile eDRAM. - Once the computation of an
output neuronis complete (allinput neuronsfactored in), its value is sent via thefat treeto thecentral output neuron eDRAM bankat the chip's center.
- Since the total number of hardware
-
Neural Functional Unit (NFU) Details:
-
The
NFU(Figure 6) contains several parallel operators. -
Multipliers: Perform the multiplication of
synaptic valuesbyinput neuron values. -
Adders: Aggregate the products, forming
adder trees. -
Max: Used in
pooling layers(e.g.,max pooling). -
Transfer Function: Applies
non-linear activation functionsthroughlinear interpolation.
该图像是一个示意图,展示了多层神经网络的计算流程,包括输入神经元、乘法、加法、激活函数以及更新突触的过程。图中表示了三个阶段的处理和输入输出的关系,便于理解神经网络的结构与工作机制。
The image above is Figure 6 from the original paper, detailing the different (parallel) operators of an
NFU:multipliers,adders,max, andtransfer function. -
4.2.1.3. Configurability (Layers, Inference vs. Training) (V.B.3.)
The tile and its NFU pipeline are designed to be highly adaptable to different CNN/DNN layer types and inference or training modes.
-
NFU Hardware Blocks: The
NFUis decomposed into:- Adder Block: Configurable as a
256-input, 16-output adder treeor256 parallel adders. - Multiplier Block:
256 parallel multipliers. - Max Block:
16 parallel max operations. - Transfer Block: Two independent sub-blocks, each performing
16 piecewise linear interpolations. Thelinear interpolation coefficients(a, bfor ) are stored in two16-entry SRAMsand can be configured to implement anytransfer functionand itsderivative.
- Adder Block: Configurable as a
-
Pipeline Configurations (Figure 7): The different hardware blocks can be configured into various pipelines to support different layer types (
CONV,LRN,POOL,CLASS) and phases (forwardforinference,backwardfortraining).
该图像是示意图,展示了分类器和卷积的前向传播(FP)及反向传播(BP)过程中的数据流动与运算步骤,包括Multiply、Add、Transfer等操作。图中还展示了各阶段的输入输出及权重更新。The image above is Figure 7 from the original paper, showing the different
pipeline configurationsforCONV,LRN,POOL, andCLASS layers. These configurations vary forforward propagation(FP) andbackpropagation(BP). For instance, aCLASS layerinFPusesMultipliers,Adders, andTransfer functions, while inBPit also involvesWeight UpdateandGradient Computationstages. -
Bit-Width Aggregation for Training:
- The hardware blocks are designed to allow
aggregationof16-bit operators(adders, multipliers, max) into32-bit operators. For example, two16-bit adderscan form one32-bit adder, or four16-bit multiplierscan form one32-bit multiplier. The overhead for this is very low. 16-bit operatorsare generally sufficient forinference. However, fortraining, higher precision is often necessary to maintain accuracy and ensureconvergence.- Impact of Fixed-Point (Table II): The paper provides an analysis of
fixed-pointcomputation on error:-
Floating-Pointfor bothinferenceandtrainingresults in0.82% error. -
16-bit Fixed-PointforinferenceandFloating-Pointfortrainingresults in0.83% error(negligible impact). -
16-bit Fixed-Pointfor bothinferenceandtrainingleads tono convergence. -
16-bit Fixed-Pointforinferenceand32-bit Fixed-Pointfortrainingresults in0.91% error(small impact). -
Default for
trainingmode is32-bit operators.The following are the results from Table II of the original paper:
Inference Training Error Floating-Point Floating-Point 0.82% Fixed-Point (16 bits) Floating-Point 0.83% Fixed-Point (32 bits) Floating-Point 0.83% Fixed-Point (16 bits) Fixed-Point (16 bits) (no convergence) Fixed-Point (16 bits) Fixed-Point (32 bits) 0.91%
-
- The hardware blocks are designed to allow
-
Tile Data Movement Configurations:
Input neuronsfor aclassifier layercan come from thenode's central eDRAM(potentially after transfer from anothernode).- They can also come from
two 16KB SRAM buffersused forinputandoutput neuron values, or fortemporary values(likeneuron partial sumsto enable reuse). - In the
backward phase(training), theNFUmust write to thetile eDRAMafterweight updates. - During
gradient computations,inputandoutput gradientsuse the same data paths asinputandoutput neuronsin theforward phase.
4.2.2. Interconnect (V.C.)
Inter-chip (inter-node) communication is crucial for a multi-chip system.
- Communication Volume: Since only
neuron valuesare transferred and these are heavily reused within eachnode, inter-node communication is significant but generally not a bottleneck, except for a few layers and very large systems. - Off-the-shelf Interconnect: The design does not rely on custom high-speed interconnects. Instead, it uses a commercially available
HyperTransport (HT) 2.0 IP block. - Physical Layer: The
HT2.0 PHY(physical layer interface) used for28nmis a long, thin strip (), typically placed at the die periphery. - Topology: A simple
2D mesh topologyconnectsnodes. While a3D meshcould be more efficient, it's left for future work. - Link Details: Each chip connects to
four neighborsviafour HT2.0 IP blocks. Each block has16x HT links(16 pairs of differential outgoing and incoming signals) operating at1.6GHz.- The
HTis connected to thecentral eDRAMvia a128-bit, 4-entry, asynchronous FIFO. - Each
HT blockprovides bandwidth in each direction. HT2.0 latencybetween two neighbornodesis about80ns.
- The
- Router: Next to the
central blockof thetile, arouteris implemented (Figure 5).- It uses
wormhole routing. - It has
five input/output ports(four directions plus an injection/ejection port). - Each
input portcontains8 virtual channels(with5 flit slotsperVC). - A
5x5 crossbarconnects allinput/output ports. - The
router pipelinehas four stages:routing computation (RC),VC allocation (VA),switch allocation (SA), andswitch traversal (ST).
- It uses
4.2.3. Overall Characteristics (V.D.)
The architecture's characteristics are summarized in Table III.
- Frequency:
NFUclocked at606MHz, matching theeDRAMfrequency in the28nmtechnology. This is a conservative choice, asDianNao'sNFUoperated at0.98GHzat65nm. FasterNFUandasynchronous communicationsare future work. - Node Capacity:
16 tilespernode.- Each
tile eDRAM bank:1024 rowsof4096 bits. - Total
tile eDRAMpertile: . Central eDRAMpernode:4MB.- Total
node eDRAM capacity: .
- Peak Performance:
-
16-bit operation: . (Here, 288 refers to the total number of 16-bit operators per tile, likely 256 for core ops + 32 for transfer function/other small ops). -
32-bit operation: . This lower performance is due tooperator aggregation(e.g., two 16-bit adders becoming one 32-bit adder reduces the count of parallel 32-bit operations).The following are the results from Table III of the original paper:
Parameters Settings Parameters Settings Frequency 606MHz tile eDRAM latency ∼3 cycles # of tiles 16 central eDRAM size 4MB # of 16-bit multipliers/tile 256+32 central eDRAM latency ∼10 cycles # of 16-bit adders/tile 256+32 Link bandwidth 6.4x4GB/s tile eDRAM size/tile 2MB Link latency 80ns
-
4.2.4. Programming, Code Generation and Multi-Node Mapping (V.E.)
4.2.4.1. Programming, Control and Code Generation (V.E.1.)
- System ASIC View: The architecture is treated as a
system ASIC, implying low programming complexity. The system is primarily configured, and input data is fed to it. - Code Generator: A
code generatorproduces a sequence ofnode instructions(one sequence per node). This configures theneural network. - Input Data Partitioning: The initial
input layer valuesare partitioned acrossnodesand stored in acentral eDRAM bank. - Example: CLASS2 Inference (Table V):
- This example shows
node instructionsfor aCLASS2 layer() on a4-node system. Output neuronsare partitioned into256-bit data blocks(each block containsneurons).- Each
nodeis allocatedoutput data blocks. - Each
nodestores a quarter of allinput neurons(). - Each
tileis allocatedoutput data blocks, resulting in4 instructions per node. - An
instructionloads128 input data blocksfrom thecentral eDRAMto thetiles. - The first
three instructions: alltilesreceive the sameinput neurons, readsynaptic weightsfrom theirlocal (tile) eDRAM, and writepartial sumsofoutput neuronsto theirlocal NBout SRAM. - The
last instruction:NFUin eachtilefinalizes sums, applies thetransfer function, and storesoutput valuesback to thecentral eDRAM.
- This example shows
- Control Flow:
Node instructionsdrive the control of eachtile. Thenode control circuitgeneratestile instructionsand sends them to eachtile. - Instruction Granularity: The
nodeortile instructionperforms the same layer computations (e.g.,multiply-add-transferforclassifier layers) on a set ofcontiguous input data. This data is characterized by astart address,step, andnumber of iterations. - Operating Modes:
-
Processing one row at a time: Standard operation.
-
Batch learning [48]: Multiple rows (multiple instances of the same layer for different input data) are processed simultaneously. This improves
synapse reuseand is common for stablegradient descent, though it can lead to slowerconvergenceand requires largermemory capacity.The following are the results from Table IV of the original paper (Node instruction format):
20 0 20 20 20 20 20 2 0 0 20 0 0 0 PAI 0 0 RAA∀P 20 0 0 0 20 EN 20 20
-
The following are the results from Table V of the original paper (An example of classifier code , , 4 nodes):
| 0 | N | 0 | 4 | 4 | 4 | M | T | 1 | M | 2 | 8 | m | • | DAT | TION | • | |||||||||
| 0 | N | O | 4 | 4 | 4 | M | 2 | 0 | M | 2 | 8 | 1 | M | • | TON | T | |||||||||
| 0 | N0 | 18 | 0 | 4 | 4 | 4 | M | T | M | T | • | M | O | DAT | T | ||||||||||
| 0 | M | 2 | O | 4 | 6 | 4 | M | m | 10 | M | 2 | • | - | 0 | 2 | • | - | T | 2 | O |
4.2.4.2. Multi-Node Mapping (V.E.2.)
-
Layer Chaining: The
output neuronsfrom one layer become theinput neuronsfor the next. At the start of a layer,input neuronsare distributed across allnodesas 3D rectangles representingfeature maps. -
Input Distribution within Node: These
input neuronsare first distributed to allnode tilesvia the internalfat treenetwork. -
Inter-Node Distribution: Simultaneously, the
node controlsendsblocks of input neuronsto othernodesthrough themesh interconnect. -
Communication Patterns for Different Layer Types (Figure 8):
- Convolutional and Pooling Layers: Characterized by
local connectivity(small kernel/window). This results invery low inter-node communication, mostly occurring at the border of the layer rectangle mapped to eachnode. Most communications areintra-node. - Local Response Normalization (LRN) Layers: Since all
feature mapsat a given location are mapped to the samenode, there isno inter-node communication. - Classifier Layers: Can have
high inter-node communicationbecause eachoutput neurontypically usesall input neurons, which may reside on differentnodes. The communication pattern is a simplebroadcast.-
To manage this, a
computing-and-forwarding communication scheme [24]is used, which involves arrangingnode communicationsin aregular ring pattern. Anodecan start processing newly arrivedinput neuron blocksas soon as it has finished its own computations and sent its previous block, making the decision locally without a global synchronization mechanism.
该图像是示意图,左侧展示节点之间的连接以及特征图,右侧则展示节点和输出之间的关系,表明多芯片架构在机器学习中的应用。
-
The image above is Figure 8 from the original paper, showing the mapping of (left) a
convolutional(orpooling) layer with4 feature mapswhere the red section indicates theinput neuronsused bynode O; and (right) aclassifier layerwhich implies full connectivity. - Convolutional and Pooling Layers: Characterized by
5. Experimental Setup
5.1. Datasets
The experiments primarily use benchmarks derived from various large and state-of-the-art CNN and DNN models, rather than raw datasets like ImageNet. The paper benchmarks specific layers and one full network.
-
Layer Benchmarks: A sample of
10 of the largest known layersof each type (CONV,POOL,LRN,CLASS, and for convolutional layers with private kernels) were used. These layers are derived from applications like object recognition, speech recognition, natural image processing, street scene parsing, and face detection in YouTube videos.-
Example Data Sample: While not explicitly providing raw images or speech, the paper implicitly refers to tasks like identifying characteristic elements of an image, or objects, or speech components. The inputs to these layers are
neuron activitieswhich could represent pixels,feature maps, or abstract representations. -
Rationale: These layers were chosen because they represent the most computationally and memory-intensive components of modern
DNNsandCNNs, making them ideal for evaluating the performance and energy efficiency of specialized hardware.The following are the results from Table I of the original paper:
Layer Nx Ny Kx Ky Ni No Synapses Description or Ni f or Nof CLASS1 - - 2560 2560 12.5MB Object recognition and speech recognition tasks (DNN) [11]. CLASS2 - 4096 4096 32MB Multi-Object recognition CONV1 256 256 11 11 256 384 22.69MB in natural images (DNN), POOL2 256 256 2 2 256 256 winner 2012 ImageNet LRN1 55 55 - - 96 96 competition [32]. LRN2 27 27 - - 256 256 CONV2 500 375 9 9 32 48 0.24MB Street scene parsing POOL1 492 367 2 2 12 12 (CNN) (e.g., identifying building, vehicle, etc) [18] CONV3* 200 200 18 18 8 8 1.29GB Face Detection in YouTube videos (DNN), (Google) [34]. CONV4* 200 200 20 20 3 18 1.32GB YouTube video object recognition, largest NN to date [8].
-
-
Full Neural Network Benchmark: The winner of the
ImageNet 2012 competition [32](by Krizhevsky et al.)CNNwas used. This network consists of12 layers:CONV (224,224,11,11,3,96)POOL (55,55,3,3,96,96)CONV (27,27,5,5,96,256)POOL (27,27,3,3,256,256)CONV (13,13,3,3,256,384)CONV (13,13,3,3,384,384)CONV (13,13,3,3,384,256)CLASS (-,9216,4096)CLASS (-,4096,4096)CLASS (-,4096,1000)- Parameters: For all
convolutional layers, strides () are 1, except for the firstCONV layer(). For allpooling layers, strides equal kernel dimension (). ForLRN layers, . - Rationale: This full network represents a complete, real-world, state-of-the-art
CNNpipeline, allowing for a comprehensive evaluation of the proposed architecture's performance and scalability across diverse layer types.
-
Pre-training Benchmarks:
Restricted Boltzmann Machines (RBM)[45], a popularpre-training methodfor initializingsynaptic weights, were applied to theCLASS1andCLASS2layers, resulting inRBM1() andRBM2() benchmarks.- Rationale: This evaluates the architecture's efficiency not just for
inferencebut also for thetrainingprocess, which can be very time-consuming and often involves different computational demands.
- Rationale: This evaluates the architecture's efficiency not just for
5.2. Evaluation Metrics
The paper evaluates the DaDianNao architecture primarily using Speedup and Energy Reduction relative to a GPU baseline. Additionally, Area and Power Consumption of the node are reported.
-
Speedup ()
- Conceptual Definition:
Speedupquantifies how much faster a task can be executed on one system compared to another. It is a ratio that indicates performance improvement. - Mathematical Formula: $ S = \frac{T_{baseline}}{T_{proposed}} $
- Symbol Explanation:
- : The speedup achieved by the proposed system over the baseline system.
- : The execution time of the task on the baseline system (e.g.,
GPU). - : The execution time of the same task on the proposed system (
DaDianNao).
- Conceptual Definition:
-
Energy Reduction (
ER)- Conceptual Definition:
Energy reductionmeasures how much less energy is consumed by one system compared to another for the same task. It reflects improved energy efficiency. - Mathematical Formula: $ ER = \frac{E_{baseline}}{E_{proposed}} $
- Symbol Explanation:
ER: The energy reduction achieved by the proposed system over the baseline system.- : The total energy consumed by the baseline system (e.g.,
GPU) to complete the task. Energy is typically calculated as power time. - : The total energy consumed by the proposed system (
DaDianNao) to complete the same task.
- Conceptual Definition:
-
Area ()
- Conceptual Definition:
Arearefers to the physical space occupied by the circuit on a silicon chip, typically measured in . It is a critical metric for manufacturing cost and packaging. - Mathematical Formula: (Not explicitly given as a formula, but reported directly)
- Symbol Explanation:
- : The area occupied by a specific component or the whole chip.
- Conceptual Definition:
-
Power Consumption ()
- Conceptual Definition:
Power consumptionis the rate at which energy is used by the circuit, measured in Watts (W). Lower power consumption is crucial for energy efficiency and thermal management. - Mathematical Formula: (Not explicitly given as a formula, but reported directly)
- Symbol Explanation:
- : The power consumed by a specific component or the whole chip.
- Conceptual Definition:
5.3. Baselines
The paper primarily compares DaDianNao against two main baselines: a modern GPU and, for earlier comparisons, a CPU with SIMD capabilities. It also references DianNao as a comparison for the accelerator approach.
-
GPU Baseline:
- Hardware:
NVIDIA K20M GPU. - Specifications:
5GB GDDR5memory,208 GB/s memory bandwidth,3.52 TFlops peak performance, manufactured using28nm technology. - Implementation:
CUDAversions of theneural network layerswere implemented usingCUDA SDK 5.5, derived from the tuned open-sourceCUDA Convnet [31]. - Rationale:
GPUswere the most favored and state-of-the-art approach forCNNsandDNNsat the time of publication due to their parallel processing capabilities. TheNVIDIA K20Mwas a high-endGPUrepresenting the best performance achievable with general-purpose parallel hardware. Itspower usagewas also monitored.
- Hardware:
-
CPU (SIMD) Baseline:
- Hardware:
Intel Xeon E5-4620 Sandy Bridge-EPprocessor. - Specifications:
2.2GHzclock speed,1TB memory,256-bit SIMD(Single Instruction, Multiple Data) capabilities. - Implementation: A
C++ versionof theneural network layerswas implemented. - Rationale: This serves as a baseline to understand the performance gain of
GPUover traditionalCPUs. TheSIMDversion was confirmed to be4.07xfaster than a non-SIMDversion, indicating thatCPUperformance was optimized. TheGPUachieved an average58.82x speedupover thisSIMD CPU, aligning with state-of-the-artGPUacceleration results.
- Hardware:
-
DianNao Accelerator (Conceptual Baseline):
- Hardware:
DianNao accelerator [5]. - Specifications: at
65nm,0.98GHz. - Implementation: A
cycle-level bit-level versionofDianNaowas re-implemented using memory latency parameters from its original article. - Rationale: This comparison highlights the potential efficiency of custom architectures (
47.91%ofGPUperformance in0.53%ofGPU area) but also underscores thememory bandwidth limitationthatDaDianNaoaims to overcome.
- Hardware:
5.4. Measurements
The methodology for obtaining results is rigorous, combining industry-standard CAD tools for physical design with detailed simulations.
-
CAD Tools for Physical Design:
- Verilog Implementation: The
nodewas first implemented inVerilog. - Synthesis and Layout: Standard
CAD toolswere used forsynthesisandlayout(physical design) of theVerilogcode. - Technology Node:
ST 28nm Low Power (LP) technology(operating at0.9V). - Tools Used:
Synopsys Design Compilerforsynthesis,ICC Compilerforlayout, andSynopsys PrimeTime PXforpower consumption estimation. This provides highly accuratearea,energy, andcritical path delaymeasurements post-layout.
- Verilog Implementation: The
-
Simulation for Time, eDRAM, and Inter-Node Measurements:
- RTL Simulation:
VCSwas used to simulate theRegister-Transfer Level (RTL)design of thenode. - eDRAM Model: A custom
eDRAM modelwas used, incorporating realistic characteristics likedestructive readsandperiodic refresh. TheeDRAMwas modeled asbankedand running at606MHz.eDRAM energywas collected usingCACTI5.3 [1]after integrating1T1C cell characteristicsspecific to28nmtechnology [25]. - Inter-Node Communication Simulation:
Booksim2.0 [10], acycle-level interconnection network simulator, was used to model inter-node communications.Orion2.0 [29]provided thenetwork energy model.
- RTL Simulation:
-
GPU Measurements:
- Power Usage: The
NVIDIA K20M GPUprovided its ownpower usagereporting capabilities. - Code Compilation:
CUDA SDK 5.5was used to compile theCUDAversions of theneural networkcodes.
- Power Usage: The
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Main Characteristics of the Node Layout (VII.A.)
The physical implementation of a single DaDianNao node at 28nm provides concrete insights into its area and power consumption.
-
Chip Layout (Figure 9): The image shows a snapshot of the
node layout, illustrating the distribution of various components.
该图像是一个示意图,展示了多芯片机器学习架构的布局,中央区域为中央块,周围分布有多个 Tile 及控制器,体现了高内存带宽和低外部通信的特性。The image above is Figure 9 from the original paper, showing a snapshot of the
node layout. It indicates a central block surrounded bytiles, withHT IPson the periphery, confirming the tiled, multi-component design. -
Area Breakdown (Table VI): The total chip area is .
- Tiles:
16 tilesconsume44.53%of the chip area (). This is a substantial portion, reflecting the distributed computational and storage units. - HT IPs: The
four HyperTransport IPsconsume26.02%of the area (), indicating the significant footprint of high-speed inter-chip communication. - Central Block: The
central block(including4MB eDRAM,router, andcontrol logic) takes11.66%of the area (). - Wires: The
wiresconnecting thecentral blockand thetilesoccupy8.97%of the area, confirming the effectiveness of the tiled design in reducing wiring overhead compared to the monolithic approach (which had wires taking up much more area, as discussed in the Methodology section). - Memory vs. Logic: About
half (47.55%)of the chip area is consumed bymemory cells(primarilyeDRAM), emphasizing the memory-centric design.Combinational logicandregistersaccount for5.88%and4.94%respectively.
- Tiles:
-
Power Consumption:
- Peak Power: The peak power consumption is
15.97 W(at a pessimistic100% toggle rate), which is roughly5-10%of a state-of-the-artGPU card(NVIDIA K20Mhas225W TDP). This highlights significant energy efficiency. - Power Breakdown:
-
HT IPs: Consume abouthalf (50.14%)of the total power, emphasizing that inter-chip communication is the most power-hungry component. -
Tiles: Consume overone third (38.53%)of the power. -
Memory cells: Account for38.30%of total power (tile eDRAMs + central eDRAM). -
Combinational logic:37.97%(mostlyNFUsandHT protocol analyzers). -
Registers:19.25%.The following are the results from Table VI of the original paper:
Component/Block Area (µm2) (%) Power (W ) (%) Whole ChIP 67,732,900 15.97 Central Block 7,898,081 (11.66%) 1.80 (11.27%) Tiles 30,161,968 (44.53%) 6.15 (38.53%) HTs 17,620,440 (26.02%) 8.01 (50.14%) Wires 6,078,608 (8.97%) 0.01 (0.06%) Other 5,973,803 (8.82%) Combinational 3,979,345 (5.88%) 6.06 (37.97%) Memory 32207390 (47.55%) 6.12 (38.30%) Registers 3,348,677 (4.94%) 3.07 (19.25%) Clock network 586323 (0.87%) 0.71 (4.48%) Filler cell 27,611,165 (40.76%)
-
- Peak Power: The peak power consumption is
6.1.2. Performance (VII.B.)
The performance evaluation focuses on the speedup of DaDianNao over the GPU baseline for both inference and training across various neural network layers and different numbers of nodes.
-
Inference Speedup (Figure 10):
- Average Speedup:
1-node:21.38xfaster thanGPU.4-node:79.81xfaster.16-node:216.72xfaster.64-node:450.65xfaster.
- Reasons for High Performance:
- Large Number of Operators: Each
nodehas9216 operators(multipliers and adders) compared to2496 MACs(Multiply-Accumulateunits) in theGPU. - On-chip eDRAM Bandwidth: The
on-chip eDRAMprovides the necessary high bandwidth and low-latency access to keep these many operators fed with data.
- Large Number of Operators: Each
- Layer-Specific Node Requirements:
-
CONV1: Requires a4-node systemdue to its large memory footprint (22.69 MBforsynapses,32 MBfor inputs,44.32 MBfor outputs, totaling99.01 MB, exceeding36 MBper node). -
and (with private kernels): Need a
36-node systemas their sizes are1.29 GBand1.32 GBrespectively. -
Full NN: Requires at least4 nodes(total59.48 M synapses,118.96 MBdata).
该图像是一个图表,展示了不同芯片数量下(1、4、16和64芯片)在多种神经网络层(如CLASS1、CLASS2、CONV1、CONV2等)上的加速比(Speedup)。可以观察到,随着芯片数量的增加,加速效果显著提升。
-
The image above is Figure 10 from the original paper, showing the
Speeduprelative to theGPU baselineforinference. Different bar colors represent1, 4, 16, and 64 nodes. It clearly illustrates significantspeedupfor64 nodes, withCONV1, , , andfull NNrequiring multi-node systems. - Average Speedup:
-
Scalability of Layers:
- LRN Layers: Scale best (
no inter-node communication), achieving up to1340.77xforLRN2with64 nodes. - CONV and POOL Layers: Scale almost as well (
inter-node communicationonly onborder elements).CONV1achieves2595.23xfor64 nodes. However, their actual speedup can be lower thanCONVdue to being less computationally intensive. - CLASS Layers: Scale less well (
e.g., 72.96xforCLASS1with64 nodes) due tohigh inter-node communications, as eachoutput neuronusesall input neuronsfrom differentnodes.
- LRN Layers: Scale best (
-
Time Breakdown (Figure 11):
-
The breakdown shows
communicationvs.computationfor variousnodecounts andlayer types.CLASSlayers show a higher proportion of time spent incommunicationas the number ofnodesincreases. -
This
communication issueforCLASS layersis attributed to thesimple 2D mesh topology. A more sophisticatedmulti-dimensional torus topologycould potentially reducebroadcast timefor largernodecounts.
该图像是性能分析图,展示了不同神经网络结构在通信和计算方面的比例以及各组件的使用情况。左侧显示了CLASS、CONV、full NN和平均值的通信与计算百分比,右侧则针对NFU、eDRAM、Router和HT组件的使用情况进行了可视化。
The image above is Figure 11 from the original paper, showing the
Time breakdown. The left chart displayscommunicationandcomputationproportions for4, 16, and 64 nodesforCLASS,CONV,POOL,LRN(geometric means), and theglobal geometric mean. The right chart shows the breakdown ofNFU,eDRAM,Router, andHTusage for1, 4, 16, 64 nodes. -
-
Full NN Scaling: The
full NNscales similarly toCLASS layers(63.35xfor4-node,116.85xfor16-node,164.80xfor64-node). This is not becauseCLASS layersdominate, but because theCNNlayers in this specific benchmark are relatively small for a64-nodesystem, leading to inefficient mapping or frequent inter-node communications for kernel computations.The following are the results from Table VII of the original paper:
CONV LRN POOL CLASS 4-node 96.63% 0.60% 0.47% 2.31% 16-node 96.87% 0.28% 0.22% 2.63% 64-node 92.25% 0.10% 0.08% 7.57% -
Training and Initialization Speedup (Figure 12):
-
Average Speedup:
1-node:12.62xfaster.4-node:43.23xfaster.16-node:126.66xfaster.64-node:300.04xfaster.
-
Comparison to Inference: Speedups are high but lower than
inferencemainly due tooperator aggregation(using32-bit operatorsfortrainingmeans fewer parallel operations). -
CLASS Layer Scalability:
Training phaseforCLASS layersscales better thaninferencebecausetraininginvolves almost double the computations for the same amount ofcommunications(e.g.,backpropagation). -
RBM Initialization: Scalability is similar to
CLASS layersin theinference phase.
该图像是一个条形图,展示了不同芯片数量下,多个神经网络层的加速比。图中以不同颜色代表了1芯片、4芯片、16芯片和64芯片的情况,纵轴为加速比,横轴为不同的网络层,显示出在64芯片系统中,多个层次实现了显著的加速效果。
The image above is Figure 12 from the original paper, showing
Speeduprelative to theGPU baselinefortraining. Similar toinference,64 nodesshow the most significantspeedupacross various layers. -
6.1.3. Energy Consumption (VII.C.)
The energy consumption analysis highlights DaDianNao's significant efficiency gains.
-
Inference Energy Reduction (Figure 13):
- Average Energy Reduction:
1-node:330.56xreduction.4-node:323.74xreduction.16-node:276.04xreduction.64-node:150.31xreduction.
- Range: Minimum energy improvement is
47.66x(forCLASS1with64 nodes), while the best is896.58x(forCONV2on asingle node). - Scalability Trend: Energy benefit remains relatively stable for
convolutional,pooling, andLRN layersasnodesscale up, but degrades forclassifier layers. This degradation is attributed to the increasedcommunicationtime, suggesting amulti-dimensional toruscould help. - Energy Breakdown (Figure 11, right):
-
1-node architecture:NFUconsumes about83.89%of the energy. -
64-node system: The ratio of energy spent inHT (HyperTransport)progressively increases to29.32%on average, and specifically48.11%forclassifier layersdue to largercommunicationoverheads.
该图像是柱状图,展示了不同芯片数量下各类神经网络的能量消耗减少效果。数据表明,随着芯片数量的增加,能量降低的效果显著,尤其在64芯片系统中表现突出。
-
The image above is Figure 13 from the original paper, showing
Energy reductionrelative to theGPU baselineforinference. It indicates substantialenergy savings, with1-nodesystems often outperforming multi-node systems in energy reduction for certain layers, but overall consistent benefits. - Average Energy Reduction:
-
Training and Initialization Energy Reduction (Figure 14):
-
Average Energy Reduction:
1-node:172.39xreduction.4-node:180.42xreduction.16-node:142.59xreduction.64-node:66.94xreduction.
-
Scalability: The scalability behavior is similar to that of the
inference phase, withenergy reductionalso showing a decreasing trend asnodecount increases, particularly forCLASS layersdue to communication.
该图像是能量减少的柱状图,展示了不同芯片数量(1芯片、4芯片、16芯片、64芯片)在多个神经网络层(如CLASS1、CLASS2等)的能量减少效果。图中显示,随着芯片数量增加,各层的能量减少趋势普遍明显。
The image above is Figure 14 from the original paper, showing
Energy reductionrelative to theGPU baselinefortraining. The trend ofenergy reductionis similar toinference, decreasing with increasingnodecount, but still offering substantial savings. -
6.2. Data Presentation (Tables)
All tables were transcribed and presented in the Methodology and Experimental Setup sections.
6.3. Ablation Studies / Parameter Analysis
While not explicitly termed "ablation studies," the paper conducts several analyses that serve a similar purpose by varying key architectural parameters and evaluating their impact:
-
Fixed-Point Precision Analysis (Table II): This directly evaluates the impact of using different
fixed-pointbit-widths (16-bit,32-bit) forinferenceandtrainingon theneural network's errorandconvergence. It shows that16-bit fixed-pointis sufficient forinferencebut causestrainingto fail(no convergence), while32-bit fixed-pointfortrainingyields comparable accuracy tofloating-point. This justifies theNFU's configurabilityto use different precisions based on the execution phase. -
Number of Nodes Scalability Analysis (Figures 10, 12, 13, 14): The core experimental results systematically show how
performance (speedup)andenergy consumption (reduction)change as the number ofnodesin the system increases (1, 4, 16, 64 nodes). This demonstrates thescalabilityof the proposedmulti-chip architectureand highlightslayer-specific scaling behaviors(e.g.,LRNscales well,CLASSscales less due to communication). -
Monolithic vs. Tiled Design (Floorplan analysis in V.B.1 and V.B.2): The paper implicitly performs an "ablation" by discussing an initial monolithic
NFUdesign that suffered from extremewire congestion(Figure 4). The subsequent adoption of thetile-based design(Figure 5) is presented as a solution thatreduces areaand improves internal bandwidth, justifying this key architectural choice. -
Communication vs. Computation Breakdown (Figure 11): This analysis implicitly reveals the bottleneck shifts as the system scales. For
CLASS layerson largernodecounts,communicationtime becomes a more significant factor, indicating limitations of the2D mesh topologyfor certain workloads. This parameter analysis informs future work directions (e.g.,multi-dimensional torus). -
Component Power Breakdown (Table VI, Figure 11 right): Analyzing how power is distributed among
tiles,HT IPs,memory, andlogicprovides insight into where energy is consumed most. For instance, the high power ofHT IPshighlights the cost of inter-chip communication, even whenneuronsare primarily moved.These analyses are crucial for understanding the trade-offs and design decisions behind
DaDianNao, and for guiding future improvements.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces DaDianNao, a pioneering custom multi-chip architecture engineered to address the computational and memory challenges posed by state-of-the-art machine-learning algorithms such as Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs). Recognizing that these algorithms are increasingly central to modern services and typically suffer from memory bandwidth limitations in GPU and single-chip accelerator implementations, DaDianNao proposes a multi-chip system where the entire neural network model's memory footprint can reside in on-chip eDRAM distributed across interconnected nodes.
The key architectural innovations include:
-
An asymmetric
nodedesign heavily biased towardson-chip eDRAM storageto minimize data movement. -
A neuron-centric communication strategy that transfers fewer
neuron valuesinstead of numeroussynaptic weights. -
A tile-based architecture with a
fat tree interconnectfor high internal bandwidth and efficient data distribution within anode. -
A configurable
Neural Functional Unit (NFU)that can adapt to differentlayer typesand support bothinferenceandtrainingwith appropriatefixed-pointprecision.The experimental results demonstrate remarkable performance and energy efficiency. A
64-node DaDianNao systemachieves an averagespeedup of 450.65xover aGPUbaseline andreduces energy by 150.31x. A singlenode, implemented at28nm, occupies and consumes15.97 Wpeak power, confirming the feasibility and efficiency of the design. While scalability varies perlayer type(e.g.,classifier layersare more sensitive to inter-node communication), the system consistently outperformsGPUsby a significant margin.
7.2. Limitations & Future Work
The authors candidly acknowledge several limitations and propose clear directions for future work:
- NFU Clock Frequency: The current
NFUis clocked conservatively at606MHz, matching theeDRAMfrequency. Future work includes implementing a fasterNFUand investigatingasynchronous communicationswitheDRAMto push performance further. - Interconnect Scalability for Classifier Layers: The
simple 2D mesh topologyused currently can lead tocommunication bottlenecksand reduced scalability forclassifier layersas the number ofnodesincreases. The authors suggest exploring a more efficientmulti-dimensional torus topologyto mitigate this. - Flexible Control: The current control mechanism is based on generated
node instructions. Future work aims to incorporate a more flexible control scheme, potentially using a simpleVLIW (Very Long Instruction Word) corepernodeand developing the associatedtoolchain. - Hardware Prototyping: A concrete future step is the
tape-out(manufacturing) of anode chip, followed by the development of amulti-node prototypeto validate the system in real hardware.
7.3. Personal Insights & Critique
This paper represents a significant milestone in the development of specialized hardware for machine learning. Its insights and architectural choices have deeply influenced subsequent AI accelerator designs.
- Pioneering a "ML Supercomputer": At a time when
GPUswere becoming dominant,DaDianNaoboldly proposed a dedicated,multi-chip systemspecifically forML. This concept of anML supercomputerwas visionary, predating the explosion of largeAI modelsand the widespread need forAI clusters. It effectively defined a new category of specialized computing. - Addressing the Memory Wall - A Core Insight: The central premise that
DNN/CNNmemory footprints, while large, are manageable with aggregatedon-chip storagein a multi-chip system was a profound insight. This diverges from general-purpose computing'smemory walland directly led to the innovativeeDRAM-centric,neuron-movingarchitecture. This focus on localizingweightsand minimizingoff-chipaccess is now a common theme inAI acceleratordesign. - Practicality and Rigor: The paper doesn't just propose a theoretical architecture; it details the implementation down to
place and routeat28nm, includingCAD toolusage,power breakdowns, andeDRAMmodeling. This level of detail provides strong credibility and makes the results highly compelling. The comparison toDianNaoand detailedGPUbaselines demonstrates a robust evaluation methodology. - Influence on Future Work:
DaDianNaoand its follow-ups (DianNaoseries from ICT, CAS) effectively kickstarted a wave of research intodataflow architectures,in-memory computing, andspecialized acceleratorsfordeep learning. Concepts liketile-based designs,on-chip memory hierarchies, andconfigurable compute unitsbecame standard features in laterAI chips. - Areas for Improvement / Unverified Assumptions:
-
2D Mesh Limitation: The authors correctly identify the
2D meshas a limitation forclassifier layers. AsAI modelsgrow, the proportion offully connected layerscan increase, making this bottleneck more pronounced. Moving to3D torusor other advanced interconnects is critical. -
Fixed-Point Precision Trade-offs: While
32-bit fixed-pointfortrainingwas shown to work, further research has explored16-bitor even8-bit fixed-pointtraining with specialized techniques. The paper's early findings onfixed-pointwere foundational, but the field has since advanced in making lower precision viable for training. -
Software Stack Complexity: Although
DaDianNaois presented as asystem ASICwith low programming requirements via acode generator, the complexity of mapping diverse and evolvingneural network modelsefficiently onto such a rigid, specialized architecture remains a continuous challenge. The proposedVLIW corefor more flexible control hints at this. -
Generalizability: While
DaDianNaois highly optimized forCNNsandDNNs, its specialization might limit its adaptability to fundamentally newML algorithmsthat emerge in the future (e.g.,Transformers,Graph Neural Networks) if their computational patterns differ significantly. However, the modular nature of theNFUand general principles ofdata localityandon-chip memoryare still broadly applicable.Overall,
DaDianNaowas a visionary and rigorously executed project that not only delivered impressive performance and efficiency gains forDeep Learningbut also laid crucial groundwork for the entire field ofAI hardware acceleration. Its focus on overcoming thememory wallthroughon-chip distributed memoryandneuron-centric data flowwas a paradigm shift that continues to resonate today.
-
Similar papers
Recommended via semantic vector search.