High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf ·...

9
4782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 10, OCTOBER 2018 High-Performance Mixed-Signal Neurocomputing With Nanoscale Floating-Gate Memory Cell Arrays Farnood Merrikh-Bayat, Xinjie Guo, Michael Klachko, Mirko Prezioso, Konstantin K. Likharev, Fellow, IEEE , and Dmitri B. Strukov , Senior Member, IEEE Abstract— Potential advantages of analog- and mixed-signal nanoelectronic circuits, based on floating-gate devices with adjustable conductance, for neuromorphic computing had been realized long time ago. However, practical realizations of this approach suffered from using rudimentary floating-gate cells of relatively large area. Here, we report a prototype 28 × 28 binary- input, ten-output, three-layer neuromorphic network based on arrays of highly optimized embedded nonvolatile floating-gate cells, redesigned from a commercial 180-nm nor flash memory. All active blocks of the circuit, including 101 780 floating-gate cells, have a total area below 1 mm 2 . The network has shown a 94.7% classification fidelity on the common Modified National Institute of Standards and Technology benchmark, close to the 96.2% obtained in simulation. The classification of one pattern takes a sub-1-μs time and a sub-20-nJ energy—both numbers much better than in the best reported digital implementations of the same task. Estimates show that a straightforward optimiza- tion of the hardware and its transfer to the already available 55- nm technology may increase this advantage to more than 10 2 × in speed and 10 4 × in energy efficiency. Index Terms— Deep learning, floating-gate memory cells, multilayer perceptron, neuromorphic networks, pattern classification. I. I NTRODUCTION T HE concept of using nonvolatile memories in analog- and mixed-signal neuromorphic networks, far superior to digital circuits of the same functionality in speed and energy efficiency, is at least 30 years old [1]. Recent work [2]–[5] has shown that such circuits, utilizing nanoscale devices, may increase the neuromorphic network performance dramatically, leaving far behind their digital and biological counterparts, and approaching the energy efficiency of the human brain. The background of these advantages is the fact that in analog circuits, the vector-by-matrix multiplication, i.e., the key operation performed at signal propagation through any Manuscript received February 9, 2017; revised July 21, 2017; accepted November 24, 2017. Date of publication December 22, 2017; date of current version September 17, 2018. This work was supported by the DARPA’s UPSIDE Program under Contract HR0011-13-C-0051UPSIDE via BAE Sys- tems, Inc. (Farnood Merrikh Bayat and Xinjie Guo contributed equally to this work.) (Corresponding author: Dmitri B. Strukov.) F. Merrikh-Bayat, X. Guo, M. Klachko, M. Prezioso, and D. B. Strukov are with the Electrical Engineering Department, University of California, Santa Barbara, CA 93106-9560 USA (e-mail: [email protected]). K. K. Likharev is with the Department of Physics and Astronomy, Stony Brook University, Stony Brook, NY 11794-3800 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2017.2778940 Fig. 1. Analog vector-by-matrix multiplication in a crossbar with adjustable crosspoint devices. For clarity, the output signal is shown for just one column of the array. neuromorphic network, is implemented on the physical level, in a resistive crossbar circuit, using the fundamental Ohm and Kirchhoff laws (Fig. 1). On the other hand, the basic handicap of analog circuits, their finite precision, is typically not crucial in neuromorphic networks, due to the inherently high tolerance of their operation to synaptic weight variations [6]. The key component of such mixed-signal neuromorphic networks is a device with adjustable (tunable) conductance— essentially an analog nonvolatile memory cell, mimicking the biological synapse. Up until recently, such devices were imple- mented mostly as floating-gate “synaptic transistors” [4], [7], which may be fabricated using the standard complimentary metal–oxide–semiconductor (CMOS) technology. Recently, some rather sophisticated neuromorphic systems were demon- strated [8], [9] using this approach. However, synaptic tran- sistors have relatively large areas (10 3 F 2 , where F is the minimum feature size), leading to larger time delays and energy consumption [4]. There have been significant recent advances in the devel- opment of alternative nanoscale nonvolatile memory devices, such as phase change, ferroelectric, and magnetic memories, and memristors—for a review see [10]–[14]. In particular, these emerging devices have already been used to demonstrate small neuromorphic networks [15]–[19]. However, their fabri- cation technology is still in much need for improvement and not ready yet for the large-scale integration, which is necessary for practically valuable neuromorphic networks. In this paper, we describe a network prototype based on other alternative devices—the highly optimized, nanoscale, and nonvolatile floating-gate memory cells that are used in the recently developed embedded NOR flash memories [20]. 2162-237X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf ·...

Page 1: High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf · 2019. 4. 18. · Abstract—Potential advantages of analog- and mixed-signal nanoelectronic

4782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 10, OCTOBER 2018

High-Performance Mixed-Signal NeurocomputingWith Nanoscale Floating-Gate Memory Cell ArraysFarnood Merrikh-Bayat, Xinjie Guo, Michael Klachko, Mirko Prezioso, Konstantin K. Likharev, Fellow, IEEE,

and Dmitri B. Strukov , Senior Member, IEEE

Abstract— Potential advantages of analog- and mixed-signalnanoelectronic circuits, based on floating-gate devices withadjustable conductance, for neuromorphic computing had beenrealized long time ago. However, practical realizations of thisapproach suffered from using rudimentary floating-gate cells ofrelatively large area. Here, we report a prototype 28 × 28 binary-input, ten-output, three-layer neuromorphic network based onarrays of highly optimized embedded nonvolatile floating-gatecells, redesigned from a commercial 180-nm nor flash memory.All active blocks of the circuit, including 101 780 floating-gatecells, have a total area below 1 mm2. The network has showna 94.7% classification fidelity on the common Modified NationalInstitute of Standards and Technology benchmark, close to the96.2% obtained in simulation. The classification of one patterntakes a sub-1-µs time and a sub-20-nJ energy—both numbersmuch better than in the best reported digital implementations ofthe same task. Estimates show that a straightforward optimiza-tion of the hardware and its transfer to the already available 55-nm technology may increase this advantage to more than 102×in speed and 104× in energy efficiency.

Index Terms— Deep learning, floating-gate memory cells,multilayer perceptron, neuromorphic networks, patternclassification.

I. INTRODUCTION

THE concept of using nonvolatile memories in analog-and mixed-signal neuromorphic networks, far superior to

digital circuits of the same functionality in speed and energyefficiency, is at least 30 years old [1]. Recent work [2]–[5]has shown that such circuits, utilizing nanoscale devices, mayincrease the neuromorphic network performance dramatically,leaving far behind their digital and biological counterparts,and approaching the energy efficiency of the human brain.The background of these advantages is the fact that inanalog circuits, the vector-by-matrix multiplication, i.e., thekey operation performed at signal propagation through any

Manuscript received February 9, 2017; revised July 21, 2017; acceptedNovember 24, 2017. Date of publication December 22, 2017; date of currentversion September 17, 2018. This work was supported by the DARPA’sUPSIDE Program under Contract HR0011-13-C-0051UPSIDE via BAE Sys-tems, Inc. (Farnood Merrikh Bayat and Xinjie Guo contributed equally to thiswork.) (Corresponding author: Dmitri B. Strukov.)

F. Merrikh-Bayat, X. Guo, M. Klachko, M. Prezioso, and D. B. Strukov arewith the Electrical Engineering Department, University of California, SantaBarbara, CA 93106-9560 USA (e-mail: [email protected]).

K. K. Likharev is with the Department of Physics and Astronomy,Stony Brook University, Stony Brook, NY 11794-3800 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2017.2778940

Fig. 1. Analog vector-by-matrix multiplication in a crossbar with adjustablecrosspoint devices. For clarity, the output signal is shown for just one columnof the array.

neuromorphic network, is implemented on the physical level,in a resistive crossbar circuit, using the fundamental Ohm andKirchhoff laws (Fig. 1). On the other hand, the basic handicapof analog circuits, their finite precision, is typically not crucialin neuromorphic networks, due to the inherently high toleranceof their operation to synaptic weight variations [6].

The key component of such mixed-signal neuromorphicnetworks is a device with adjustable (tunable) conductance—essentially an analog nonvolatile memory cell, mimicking thebiological synapse. Up until recently, such devices were imple-mented mostly as floating-gate “synaptic transistors” [4], [7],which may be fabricated using the standard complimentarymetal–oxide–semiconductor (CMOS) technology. Recently,some rather sophisticated neuromorphic systems were demon-strated [8], [9] using this approach. However, synaptic tran-sistors have relatively large areas (∼103F2, where F is theminimum feature size), leading to larger time delays andenergy consumption [4].

There have been significant recent advances in the devel-opment of alternative nanoscale nonvolatile memory devices,such as phase change, ferroelectric, and magnetic memories,and memristors—for a review see [10]–[14]. In particular,these emerging devices have already been used to demonstratesmall neuromorphic networks [15]–[19]. However, their fabri-cation technology is still in much need for improvement andnot ready yet for the large-scale integration, which is necessaryfor practically valuable neuromorphic networks.

In this paper, we describe a network prototype based onother alternative devices—the highly optimized, nanoscale,and nonvolatile floating-gate memory cells that are used inthe recently developed embedded NOR flash memories [20].

2162-237X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf · 2019. 4. 18. · Abstract—Potential advantages of analog- and mixed-signal nanoelectronic

MERRIKH-BAYAT et al.: HIGH-PERFORMANCE MIXED-SIGNAL NEUROCOMPUTING 4783

Fig. 2. ESF1 NOR flash memory cells. (a) Cross section of the two-cell“supercell” (schematically) and (b) its equivalent circuit. (c) TEM cross-sectional image of one memory cell, fabricated in a 180-nm process. (d) Draincurrent of the cell as a function of the gate voltage, at VDS = 1 V, for severalmemory states. (d) Gray-shaded region shows the subthreshold conductionregion; the currents below IDS = 10 pA (the level shown with the dashedline) are significantly contributed by leakages in the experimental setup usedfor the measurements. Inset: extracted slope of this semilog plot, measuredat IDS = 10 nA, as a function of the memory state (characterized by thecorresponding gate voltage).

These cells are quite suitable to serve as adjustable synapsesin neuromorphic networks, provided that the memory arraysare redesigned to allow for individual, precise adjustment ofthe memory state of each device. Recently, such modificationwas performed [21], [22] using the 180-nm ESF1 embeddedcommercial NOR flash memory technology of SST Inc. [20](Fig. 2), and, more recently, the 55-nm ESF3 technology of thesame company [23], with good prospects for its scaling downto at least F = 28 nm. Though such modification nearly triplesthe cell area, it is still at least an order of magnitude smaller,in terms of F2, than that of synaptic transistors [4].

The main result reported in this paper is the first successfuluse of this approach for the experimental implementationof a relatively simple mixed-signal neuromorphic network,which could perform a high-fidelity classification of patternsof the standard Modified National Institute of Standards andTechnology (MNIST) benchmark, with record-breaking speedand energy efficiency.

II. MEMORY ARRAY CHARACTERIZATION

Our network design uses the energy-saving gate cou-pling [4], [21], [23], [24] of the peripheral and array cells,which works well in the subthreshold mode, with a nearlyexponential dependence of the drain current IDS of the mem-ory cell on the gate voltage VGS [Fig. 2(d)]

IDS ≈ I0 exp

VGS − Vt

VT

}(1)

where Vt is a threshold voltage depending on the memory stateof the cell (physically, the electric charge of its floating gate),VT ≡ kB T /e is the voltage scale of the thermal excitations,equal to ∼26 mV at room temperature, while β < 1 is thedimensionless subthreshold slope d(lnIDS)/dVGS, measured in

Fig. 3. (a) Results of analog retention measurements for severalmemory states, performed in the gate-coupled array configuration. There are1000 points for each state, each point representing an average over 65 samplestaken within a 130-ms period. (b) Relative rms variation and the full(peak-to-valley) swing of the currents during the same time interval. Inset:equivalent circuit of the used gate coupling. (c) Spectral density of cellcurrent’s noise measured at room temperature; the gray lines are just guidesfor the eye, corresponding to SI ∝ 1/ f 1.6.

the units of VT , and characterizing the efficiency of the gate-to-channel coupling. As the inset in Fig. 2(d) shows, in theESF1 cells this slope stays relatively constant in a broad rangeof memory states—a feature enabling the gate-coupled circuitoperation. [For lower Vt , the slope becomes higher, apparentlydue to the specific cell design shown in Fig. 2(a).]

With the requirement to keep the relative current fluctuations[Fig. 3(b)] below 1%, the dynamic range of the subthresholdoperation is about five orders of magnitude, from ∼10 pA to∼300 nA, corresponding to the gate voltage swing of ∼1.5 V.

The ESF1 flash technology guarantees a 10-year digital-mode retention at temperatures up to 125 °C [20]. Ourexperiments have shown that these cells also feature at leasta-few-days analog-level retention, with very low fluctuationsof the output current [see Fig. 3(a)]. (A more extensive testingof the analog-level retention [23], performed for the 55-nm

Page 3: High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf · 2019. 4. 18. · Abstract—Potential advantages of analog- and mixed-signal nanoelectronic

4784 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 10, OCTOBER 2018

Fig. 4. Network architecture. (a) Typical examples of B/W hand-written digits of the MNIST benchmark set. (b) Graph representation of our three-layerperceptron network. Each synapse is implemented using a differential pair of floating-gate memory cells. (c) High-level architecture, with the weight tuningcircuitry for the second array (similar to that of the first one) not shown for clarity. (d) 2 × 2 cell fragment of the first crossbar array shown together with ahidden-layer neuron, consisting of a differential summing operational amplifier pair and an activation-function circuit. (e) 2 × 2 cell fragment of the secondcrossbar array with an output-layer neuron; these neurons do not implement an activation function. The voltage shifter, shown in (c), enables using voltageinputs of both polarities over a 1.65-V bias, and is also used to initiate the classification process by increasing the input background from 1.8 to 4.2 V.

ESF3 NOR cells fabricated using a similar technology, hasshown no substantial drift in memory states for almost 1 dayeven at an elevated temperature of 85 °C.)

Other features of the used ESF1 cell arrays, including thedetails of their modification, switching dynamics and statistics,and a demonstration of fast weight tuning with a ∼0.3%accuracy, were reported earlier [21], [22]

III. NETWORK DESIGN

For the first, proof-of-concept demonstration of this newhardware technology, we have selected the simplest possibleneuromorphic network architecture suitable for classificationof the most common MNIST benchmark set, with a reasonablefidelity. The binary inputs of this benchmark simplify thedesign of the first synaptic array. The implemented network(Fig. 4) was a three-layer (one hidden layer) perceptron with784 binary inputs bi , which may represent, for example,28 × 28 black-and-white pixels of an input image [such asthe MNIST data set images illustrated in Fig. 4(a)], 64 hiddenlayer neurons with the rectified-tanh activation function, and

ten output neurons [Fig. 4(b)]. The goal of the network isto perform the pattern inference by the following sequentialtransformation of the input signals.

Here, h j and f j (with j = 1, 2,…, 64) are, respectively,the input and output signals of the hidden-layer neurons,ck (with k = 1, 2,…, 10) are the output signals, providing theclass of the input pattern, while w(1) and w(2) are two matricesof tunable synaptic weights, characterizing the coupling of theadjacent network layers. In our network, these weights areprovided by floating-gate cells of two crossbar arrays of thefloating-gate memory cells with tunable weights [Fig. 4(c)].Each neuron also gets an additional input from a bias node,with a tunable weight based on a similar cell [Fig. 4(b)].With the differential-pair implementation of each synapse (seebelow), the total number of utilized floating-gate memory cellsis 2 × [(28 × 28 + 1) × 64 + (64 + 1) × 10] = 101 780.

The mixed-signal vector-by-matrix multiplication in the firstcrossbar array is implemented by applying input voltages(4.2 V for black pixels or 0 V for white ones) directly tothe gates of the array cell transistors, with fixed voltages on

h j =784∑i=1

w(1)j i bi + w

(1)j,785, ck =

64∑j=1

w(2)kj f (h j ) + w

(2)k,65 fmax, f (h) ≡ fmax ×

{tanh(h), for h ≥ 0,

0, for h < 0(2)

Page 4: High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf · 2019. 4. 18. · Abstract—Potential advantages of analog- and mixed-signal nanoelectronic

MERRIKH-BAYAT et al.: HIGH-PERFORMANCE MIXED-SIGNAL NEUROCOMPUTING 4785

their sources (1.65 V) and drains (2.7 V) [see Fig. 4(d)].As a result, the transistor source-to-drain current of the celllocated at the crosspoint of the i th column and the j th rowof the array does not depend on the state of any other cells,and is equal to the product of the binary input voltage bi

by the analog weight w(1)j i prerecorded in the memory cell.

The sources of the transistors of each row are connected toa single wire (with an externally fixed voltage on it), so thatthe j th output current of the array is just the sum of productsw

(1)j i bi over all columns i , thus implementing the vector-by-

matrix multiplication described by the first expression of (2),as shown at the bottom of the previous page.

In order to reduce random drifts, and also to work with zero-centered signals h j , we used a differential scheme, in whicheach synaptic weight is recorded in two adjacent cells of eachcolumn, and the output currents [Fig. 4(d), I+

j and I−j ] of two

adjacent cell rows are subtracted in an operational amplifier,with its output, h j ∝ I+

j − I−j , passed to the activation function

circuit performing the function f (h). The used sharing of theweight w

(1)j i between the two cells of the differential pair is

very simple: one of the cells (depending on the sign of thedesirable weight) is completely turned OFF, giving virtuallyno contribution to the output current. This arrangement keepshalf of the cells virtually idle, but simplifies the design andspeeds up the weight tuning process.

The analog vector-by-matrix calculation in the second arraywas performed using the gate-coupled approach [Fig. 4(e)].In this approach [24], the synaptic gate array is complementedby the additional row of “peripheral” cells, which are phys-ically similar to the array cells, and hence having the samesubthreshold slope β. The gate electrode of the peripheralcell of each column is connected to those of all cells of thiscolumn, so that their voltages VGS are also equal. Applying(1) to the current of the cell located at the crosspoint of thekth row and the j th column of the array (Ikj ), and that of theperipheral cell of this column (I j ), and dividing the results,we get

w(2)kj ≡ Ikj

I j= exp

(Vt ) j − (Vt )kj

VT

}. (3)

The resulting currents Ikj are summed up exactly as thosein the first array (with the similar differential scheme fordrift reduction), so that if the array is fed by the outputcurrents of the activation function circuits, I j ∝ f (h j ),it performs the vector-by-matrix multiplication described bythe second expression of (2), with the synaptic weights givenby (3), which depend on the preset memory states of thecorresponding cells, but are independent of the input currents.To minimize the error due to the dependence of β on thememory state [see the inset in Fig. 2(d)], in the second array,we used a higher gate voltage range (1.1–2.7 V), with theupper bound due to the technology restrictions.

Fig. 5(a) shows the circuit used to subtract the currentsI+ and I− of the differential-scheme rows, based on two oper-ational amplifiers [Fig. 5(c)]. Assuming that the resistancesRF are equal and that the outputs of both opamps do notsaturate (which was ensured by the following relation for themaximum value of currents I±: Imax RF < 1 V for the chosen

Fig. 5. (a) Circuit-level diagram of a differential summing amplifier used inthe hidden-layer and output-layer neurons; RF = 16 k� for hidden neuronsand RF = 128 k� for output neurons. (b) Implemented activation function.Transistor-level schematics of (c) operational amplifier and (d) activationfunction, VSS = 0 V, VDD = 2.7 V.

value, RF = 16 k� in the first layer and RF = 128 k� inthe second one), the output voltage of the scheme is

V = RF (I+ − I−) + const (4)

Fig. 5(b) shows the rectified-tanh activation function f (h)used in the hidden-layer neurons [see (2)], with h [V] =10RF [�] (I+− I−) [A] and fmax = 300 nA, while Fig. 5(d)shows the CMOS circuit used for the implementation of thisfunction.

The desirable synaptic weights, calculated in an externalcomputer running a similar “precursor” software-implementednetwork, using the standard error backpropagation algorithm,were imported into the network by analog tuning of thememory state of each floating-gate cell, with peripheral analogdemultiplexer circuitry [Fig. 3(c)]. In order to simplify thisfirst, prototype design, the weights were tuned one by one,by applying proper bias voltage sequences to select and half-select lines [21], [22]. (In principle, this process may besignificantly parallelized.) The large voltages required forthe weight import are decoupled from the basic, low-voltagecircuitry, using high-voltage pass transistors. The input patternbits are shifted serially into a 785-b register before eachclassification; to start it, the bits are read out into the networkin parallel.

The digital encoders and shift register circuits and theirlayouts were synthesized from Verilog in a standard 1.8-Vdigital CMOS process. All other circuits were designed manu-ally for the embedded 180-nm process of Silterra Corp. (Suchan approach was practicable due to the modular, repetitivedesign of the circuit.) All active components of the circuithave a total area of 0.78 mm2 (Fig. 6), with the two synapticarrays occupying less than a quarter of this area, while thetotal chip area, including very sparse routing (which was notyet optimized for this design), is about 5 × 5 mm2.

Page 5: High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf · 2019. 4. 18. · Abstract—Potential advantages of analog- and mixed-signal nanoelectronic

4786 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 10, OCTOBER 2018

Fig. 6. (a) Micrograph of the chip and (b) area breakdown of its activecomponents (excluding wiring between the blocks, which was not optimizedat this stage).

IV. NETWORK TESTING

Because of the digital (fixed voltage) input of the firstsynaptic array, the subthreshold conduction was not enforcedthere, so that the output currents of some cells exceeded300 nA [Fig. 7(a)]. To reduce the computation error due to thepotential slope mismatch between peripheral and array cells,all peripheral floating gate transistors in the second array weretuned to provide output currents of 300 nA at VG = 2.7 V, i.e.,at the largest voltage that could be supplied by the hiddenlayer neuron in our design. With such a scheme, the erroris conveniently smallest for the largest weight wki = 1,corresponding to the array cell tuned to run a current of 300 nAat VGS = 1.6 V. The target current values for all cells inthe second array (excluding bias ones) were ensured to bebetween 0 and 300 nA by clipping the weights during trainingof the precursor network.

To decrease the weight import time, only one cell of eachpair, corresponding to a particular sign of the weight value,was tuned, while its counterpart was kept at a very small,virtually zero, initial conductance. Additionally, all nonbiascells in the first array, for which the target conductanceswere below 30 nA, were also not tuned, because of theirnegligible impact on the classification fidelity, confirmed by

modeling. As a result, only about ∼30% of the cells werefine tuned. Because of the sequential character of the tuningprocess, it took several hours to complete it, with the chosenaccuracy, for the whole chip. (In the future, the tuning maybe greatly sped up by adjusting multiple weights at a time viaintegrated on-chip tuning circuitry [9], and using the bettertuning algorithms we have developed [22].)

Moreover, also to speed up the import process, the weighttuning accuracy for a single-cell tuning was set to a relativelyhigh value of 5%. As Fig. 7(b) indicates, some of the alreadytuned cells were disturbed beyond the target accuracy duringthe subsequent weight import. In this first experiment, thesecells were not retuned, in part because even for such rathercrude weight import the experimentally tested classificationfidelity (94.65%) on MNIST benchmark test patterns (Fig. 8)is already remarkably close to the simulated value (96.2%) forthe same network (Fig. 9). Both these numbers are also nottoo far from the maximum fidelity (97.7%) of the similar per-ceptron of this size, optimized without hardware constraints,with ∼0.5% in fidelity recovered by taking into account smallweights and ∼1% by not clipping the nonbias weights.

Excitingly, such classification fidelity in our network, withlarge optimization reserves (see below), is achieved at anultralow (sub-20-nJ) energy consumption per average classi-fied pattern [Fig. 10(a)], and the average classification timebelow 1 μs [Fig. 10(b)]. The upper bound of the energyis calculated as a product of the measured average power,5.6 mA × 2.7 V + 2.9 mA × 1.05 V ≈ 20 mW, consumed bythe network, by the upper bound, 1 μs, of the average signalpropagation delay. A more accurate measurement of the timedelay, and hence the energy, requires a redesign of the signalinput circuitry, currently rather slow [see Fig. 10(b)].

V. DISCUSSION

The achieved speed and energy efficiency are much betterthan those demonstrated, for the same task, at any digitalnetwork we are aware of. For example, the best resultsfor the same MNIST benchmark classification were reportedfor IBM’s TrueNorth chip [25]. For the comparable 95%fidelity, that chip can classify 1000 images per second whileconsuming 4 μJ of energy per image [26], i.e., it is at leastthree orders of magnitude slower and less energy-efficient thanour, still unoptimized analog circuit. This difference is ratherimpressive, taking into account the advanced 28-nm CMOSprocess used for the TrueNorth chip implementation.

In a less direct comparison, in terms of energy per amultiply-and-accumulate (MAC) operation, our network alsooutperforms the best reported digital systems. Indeed, themeasured upper bound of the energy efficiency of our circuitis 0.2 pJ per MAC. This is a factor of 60 smaller thanthe 12 pJ per MAC reported for 65-nm Eyeriss chip [27],which is highly optimized for machine learning applications.(It performs 16-b operations and, like the TrueNorth chip,was implemented using an advanced fabrication technology.)Note that both the TrueNorth and Eyeriss chips, in turn, faroutperform the modern graphical processing units (GPUs) forneuromorphic-network applications. Our result is also much

Page 6: High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf · 2019. 4. 18. · Abstract—Potential advantages of analog- and mixed-signal nanoelectronic

MERRIKH-BAYAT et al.: HIGH-PERFORMANCE MIXED-SIGNAL NEUROCOMPUTING 4787

Fig. 7. Weight export statistics. (a) Histogram showing the imported cell current values (weights), measured at VD = 2.7 V, and VS = 1.65 V andVG = 4.2 V in the first synaptic array, and VS = 1.1 V and VG = 2.7 V for the second one, which were used in the experiment. (b) Comparison between thetarget synaptic cell currents (computed at the external network training) and the actual cell currents measured after their import, i.e., cell tuning. (c) Similarcomparison for the positive fraction of hidden neuron output computed for all test patterns. (The negative outputs are not shown, because they are discardedby the used activation function.) Red-dashed lines are guides for the eye, corresponding to the perfect weight import.

Fig. 8. Experimental results for the classification of all 10 000 MNIST test set patterns. (a) Histograms of voltages delivered by each output neuron. Redbars correspond to the patterns whose class belongs to this particular output, while the blue ones are for all remaining patterns. (b) Histograms of the largestoutput voltages (among all output neurons) for all test patterns of each class, showing that the correct outputs (red bars) always dominate. Note the logarithmicvertical scales.

better than the ∼1 pJ per analog operation, recently reportedfor a small 130-nm mixed-signal neural networks based onsynaptic transistors [8]. It is also comparable with the bestresults obtained using the switched-capacitor approach [28],for example, the recent ∼0.1 pJ per operation achieved ina much smaller circuit, with only 8 × 8 × 3 discrete (3-b)synaptic weights, using a 40-nm process [29]. (Note that thisapproach does not allow analog tuning of synaptic weights,and its extension to larger circuits may be problematic becauseof the relatively large capacitor size.)

It should be also noted that the energy-per-MAC metric isgenerally less objective, because it does not account for theoperation precision and the complexity and functionality of

the implemented system (e.g., general-purpose systems like atypical GPU versus application-specific ones like the Eyerisschip).

There are still several unused reserves in our design.The most straightforward improvement is to use for neuronsthe current-mirror design similar to the gate-coupled circuitsshown in Fig. 4(e), but implemented with the floating-gate-freetransistors, and hence with the signal transfer weight w = 1.(In our current design, neurons give dominant contributions tothe network latency and energy dissipation [see Fig. 10(a)]).The second direct path forward is to use the more advanced55-nm memory technology ESF3 of the same company [20].(Our preliminary testing [23] of its similar redesign has not

Page 7: High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf · 2019. 4. 18. · Abstract—Potential advantages of analog- and mixed-signal nanoelectronic

4788 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 10, OCTOBER 2018

Fig. 9. Simulated classification fidelity, computed with the 32-b floating-point precision, as a function of weight precision import for the implementednetwork, with the particular set of weights used in the experiment. The weighterror was modeled by adding, to its optimized value, a normally distributednoise with the shown standard deviation. The red, blue (rectangles), and black(segment) markers denote, respectively, the median, the 25%–75% percentile,and the minimum and maximum values for 30 simulation runs. The black andred horizontal dashed lines show, respectively, the calculated misclassificationrate for perfect (no noise) weights, and the rate obtained in the experiment.

found any evident showstoppers on that path.) The time delayand energy dissipation of the network with current-mirrorneurons will be dominated by the synaptic arrays, and may bereadily estimated using the experimentally measured valuesof the subthreshold current slope β for 180-nm ESF1 cellsand 55-nm ESF3 cells. For example, our modeling of a large-scale network deep-learning convolutional networks, suitablefor classification of large, complex patterns [30] (i.e., the samenetwork which was implemented by Eyeriss chip [27]), usingthese two improvements, showed at least a ∼100× advantagein the operation speed, and an enormous, >104× advantagein energy efficiency, over the state-of-the-art purely digital(GPU and custom) circuits [see Table I]. (In this table, theestimates for the floating-gate networks take into account the55 × 55 = 3025-step time-division multiplexing, natural forthis particular network. The crude estimate of the human visualcortex operation is based on the ∼25-W power consumptionof ∼1011 neurons of the whole brain, and a 30-ms delay ofthe visual cortex, and assumes the uniform distribution of thepower over the neurons, and the same number of neuronsparticipating in a single-pattern classification process.) More-over, the energy efficiency of the floating-gate networks wouldclosely approach that of the human visual cortex, at muchhigher speed [see the last two columns in Table I].

It should be also noted that the cell area of sub-100-nmembedded NOR floating-gate memories is only slightly largerthan that of the “1T1R” variety of many emerging nonvolatilememory technologies [10]–[15]. (Here, “T” stands for a ded-icated select transistor and “R” for the adjustable resistivememory element.) On the other hand, our crude estimatesshow [5] that the density and performance may be significantlyimproved using truly passive (“0T1R”) memristor circuits andespecially their 3-D versions [31].

Note also that the recent progress [32], [33] in the devel-opment of machine learning algorithms using binary weightsimplies that our approach may be also extended to novel

Fig. 10. Physical performance. (a) Histogram of the experimentally mea-sured total currents flowing into the circuit, characterizing the static powerconsumption of the both memory cell arrays, for all patterns of the MNISTtest set. Inset: list of the pattern-independent static current of the neurons.(b) Typical signal dynamics after an abrupt turn ON of the voltage shifterpower supply, measured simultaneously at the network input, the output ofa sample hidden-layer neuron, and all network’s outputs. (The actual inputvoltage is 10× larger.) The oscillatory behavior of the outputs is a result of asuboptimal phase stability design of the operational amplifiers. Before it hasbeen improved, and the input circuit is sped up, we can only claim a sub-1-μsaverage time delay of the network, though it is probably closer to 0.5 μs.

TABLE I

SPEED AND ENERGY CONSUMPTION OF THE SIGNAL PROPAGATIONTHROUGH THE CONVOLUTIONAL (DOMINATING) PART OF

A LARGE DEEP NETWORK [30]

3-D NAND flash technologies. Such memories may ensuremuch higher areal densities of the floating-gate cells, but theirredesign to analog weights may be more problematic. Also,the results of [33] show a significant drop in classificationperformance that results from using binary weights in con-volutional layers of large-scale neuromorphic networks. Theperformance of such networks may be improved by increasingthe network size; however, its speed and energy efficiency

Page 8: High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf · 2019. 4. 18. · Abstract—Potential advantages of analog- and mixed-signal nanoelectronic

MERRIKH-BAYAT et al.: HIGH-PERFORMANCE MIXED-SIGNAL NEUROCOMPUTING 4789

may suffer. So, the tradeoff between the density and weightprecision effects in 3-D memories is far from certain yet, andrequires further study.

To summarize, we believe that the reported results givean important proof-of-concept demonstration of the excitingpossibilities opened for neuromorphic networks by mixed-signal circuits based on industrial-grade floating-gate memorycells.

ACKNOWLEDGMENT

The authors would like to thank P.-A. Auroux,M. Bavandpour, N. Do, J. Edwards, M. Graziano, andM. R. Mahmoodi for their useful discussions and technicalsupport.

REFERENCES

[1] C. Mead, Analog VLSI and Neural Systems. Reading, MA, USA:Addison-Wesley, 1989.

[2] G. Indiveri et al., “Neuromorphic silicon neuron circuits,” FrontiersNeurosci., vol. 5, no. 73, pp. 1–23, 2011.

[3] K. Likharev, “CrossNets: Neuromorphic hybrid CMOS/nanoelectronicnetworks,” Sci. Adv. Mater., vol. 3, pp. 322–331, Jun. 2011.

[4] J. Hasler and H. Marr, “Finding a roadmap to achieve large neuro-morphic hardware systems,” Frontiers Neurosci., vol. 7, Sep. 2013,Art. no. 118.

[5] L. Ceze et al., “Nanoelectronic neurocomputing: Status and prospects,”in Proc. DRC, Newark, DE, USA, Jun. 2016, pp. 1–2.

[6] E. Säckinger, “Measurement of finite-precision effects in handwriting-and speech-recognition algorithms,” in Artificial Neural Networks—ICANN (Lecture Notes in Computer Science), vol. 1327. Springer, 1997,pp. 1223–1228.

[7] C. Diorio, P. Hasler, A. Minch, and C. A. Mead, “A single-transistorsilicon synapse,” IEEE Trans. Electron Devices, vol. 43, no. 11,pp. 1972–1980, Nov. 1996.

[8] J. Lu, S. Young, I. Arel, and J. Holleman, “A 1 TOPS/W analog deepmachine-learning engine with floating-gate storage in 0.13 μm CMOS,”IEEE J. Solid-State Circuits, vol. 50, no. 1, pp. 270–281, Jan. 2015.

[9] S. George et al., “A programmable and configurable mixed-mode FPAASoC,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 6,pp. 2253–2261, Jun. 2016.

[10] D. B. Strukov and H. Kohlstedt, “Resistive switching phenomena in thinfilms: Materials, devices, and applications,” MRS Bull., vol. 37, no. 2,pp. 108–114, Feb. 2012.

[11] S. Raoux, D. Ielmini, M. Wuttig, and I. Karpov, “Phase change materi-als,” Mater. Res. Soc. Bull., vol. 37, no. 2, pp. 118–123, 2012.

[12] W. Lu, D. S. Jeong, M. Kozicki, and R. Waser, “Electrochemicalmetallization cells—Blending nanoionics into nanoelectronics?” MRSBull., vol. 37, no. 2, pp. 124–130, 2012.

[13] J. J. Yang, I. H. Inoue, T. Mikolajick, and C. S. Hwang, “Metal oxidememories based on thermochemical and valence change mechanisms,”MRS Bull., vol. 37, no. 2, pp. 131–137, 2012.

[14] E. Y. Tsymbal, A. Gruverman, V. Garcia, M. Bibes, and A. Barthélémy,“Ferroelectric and multiferroic tunnel junctions,” MRS Bull., vol. 37,no. 2, pp. 138–143, 2012.

[15] S. Park et al., “RRAM-based synapse for neuromorphic systemwith pattern recognition function,” in IEDM Tech. Dig., Dec. 2012,pp. 10.2.1–10.2.4.

[16] Y. Nishitani, Y. Kaneko, and M. Ueda, “Supervised learning usingspike-timing-dependent plasticity of memristive synapses,” IEEE Trans.Neural Netw. Learn. Syst., vol. 26, no. 12, pp. 2999–3008, Dec. 2015.

[17] M. Prezioso, F. Merrikh-Bayat, B. D. Hoskins, G. C. Adam,K. K. Likharev, and D. B. Strukov, “Training and operation of anintegrated neuromorphic network based on metal-oxide memristors,”Nature, vol. 521, pp. 61–64, May 2015.

[18] S. Kim et al., “NVM neuromorphic core with 64k-cell (256-by-256)phase change memory synaptic array with on-chip neuron circuitsfor continuous in-situ learning,” in IEDM Tech. Dig., Dec. 2015,pp. 443–446.

[19] F. Merrikh-Bayat, M. Prezioso, B. Chakrabarti, I. Kataeva, andD. B. Strukov. (Nov. 2016). “Advancing memristive analog neuromor-phic networks: Increasing complexity, and coping with imperfect hard-ware components.” [Online]. Available: https://arxiv.org/abs/1611.04465

[20] SST Inc. Superflash Technology Overview. Accessed: Feb. 7, 2017.[Online]. Available: www.sst.com/technology/sst-superflash-technology

[21] F. Merrikh-Bayat, X. Guo, H. A. Om’mani, N. Do, K. K. Likharev,and D. B. Strukov, “Redesigning commercial floating-gate memoryfor analog computing applications,” in Proc. ISCAS, Lisbon, Portugal,May 2015, pp. 1921–1924.

[22] F. Merrikh-Bayat, X. Guo, M. Klachko, N. Do, K. Likharev, andD. Strukov, “Model-based high-precision tuning of NOR flash memorycells for analog computing applications,” in Proc. DRC, Newark, DE,USA, Jun. 2016, pp. 1–2.

[23] X. Guo et al., “Temperature-insensitive analog vector-by-matrix multi-plier based on 55 nm NOR flash memory cells,” in Proc. CICC, Austin,TX, USA, Apr./May 2017, pp. 1–4.

[24] C. R. Schlottmann and P. E. Hasler, “A highly dense, low power, pro-grammable analog vector-matrix multiplier: The FPAA implementation,”IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 1, no. 3, pp. 403–411,Sep. 2011.

[25] P. A. Merolla et al., “A million spiking-neuron integrated circuit witha scalable communication network and interface,” Science, vol. 345,no. 6197, pp. 668–673, Aug. 2014.

[26] S. K. Esser, R. Appuswamy, P. Merolla, J. V. Arthur, and D. S. Modha,“Backpropagation for energy-efficient neuromorphic computing,” inProc. NIPS, Montreal, QC, Canada, Dec. 2015, pp. 1117–1125.

[27] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neuralnetworks,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.Tech. Papers, San Francisco, CA, USA, Jan./Feb. 2016, pp. 262–263.

[28] D. Bankman and B. Murmann, “Passive charge redistribution digital-to-analogue multiplier,” Electron. Lett., vol. 51, no. 5, pp. 386–388, 2015.

[29] E. H. Lee and S. S. Wong, “A 2.5 GHz 7.7 TOPS/W switched-capacitormatrix multiplier with co-designed local memory in 40 nm,” in IEEEInt. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco,CA, USA, Jan./Feb. 2016, pp. 418–420, 2016.

[30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classificationwith deep convolutional neural networks,” in Proc. NIPS, Lake Tahoe,CA, USA, Dec. 2012, pp. 1097–1105.

[31] G. C. Adam et al., “3-D memristor crossbars for analog and neuromor-phic computing applications,” IEEE Trans. Electron Devices, vol. 64,no. 1, pp. 312–318, Jan. 2017.

[32] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Trainingdeep neural networks with binary weights during propagations,” in Proc.NIPS, Montreal, QC, Canada, Dec. 2015, pp. 3105–3113.

[33] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio.(Sep. 2016). “Quantized neural networks: Training neural networks withlow precision weights and activations.” [Online]. Available: https://arxiv.org/abs/1609.07061

Farnood Merrikh-Bayat received the M.Sc.and Ph.D. degrees in electrical engineering fromthe Sharif University of Technology, Tehran, Iran,in 2008, and 2012, respectively, and the Ph.D. degreein computer engineering from the University ofCalifornia, Santa Barbara, CA, USA, in 2015.

He was a Post-Doctoral Researcher in computerengineering with the University of California until2017. His current research interests include abusingemerging device technologies to efficiently imple-ment hardware accelerators for deep neural net-

works and machine learning algorithms, high-performance computing withbeyond complimentary metal–oxide–semiconductor, nonconventional archi-tectures and designs, and smart systems and vehicles.

Xinjie Guo recievied the B.S. degree in electron-ics engineering and computer science from PekingUniversity, Beijing, China, in 2011, the M.S. andPh.D. degrees in electrical and computer engineeringfrom the University of California, Santa Barbara,CA, USA, in 2013 and 2016, respectively.

Page 9: High-Performance Mixed-Signal Neurocomputing With Nanoscale …strukov/papers/2018/tnnls2018.pdf · 2019. 4. 18. · Abstract—Potential advantages of analog- and mixed-signal nanoelectronic

4790 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 10, OCTOBER 2018

Michael Klachko received the M.S. degree in com-puter engineering from the University of California,Santa Barbara, CA, USA, in 2016, where he iscurrently pursuing the Ph.D. degree, as a part ofProf. Strukov’s Research Group.

His current research interests include novel deeplearning architectures and algorithms, and low-precision neural computation.

Mirko Prezioso received the master’s degree incondensed matter physics and the Ph.D. degree inadvanced materials from the University of Parma,Parma, Italy, in 2008.

He was a Post-Doctoral Researcher with theNational Research Council, Institute for the Studyof Nanostructured Materials, Bologna, Italy, wherehe focused on organic spintronics and memristivesystems. In 2013, he joined the Electrical andComputing Engineering Department, University ofCalifornia, Santa Barbara, CA, USA, as a Research

Assistant under the supervision of Prof. Strukov, where he was involved in thefabrication, characterization, and application of oxide-based RRAM devices.His current research interests include the application for neuromorphic com-puting.

Konstantin K. Likharev (M’91–SM’06–F’08)received the Ph.D. degree in physics from theLomonosov Moscow State University, Russia, in1969, and a habilitation degree of doctor of sciencesfrom USSR’s High Attestation Committee in 1979.

From 1969 to 1988, he was a Staff Scientist withMoscow State University, where he was the Headof the Laboratory for Cryoelectronics, from 1989 to1991. In 1991, he assumed a Professorship at StonyBrook University, Stony Brook, NY, USA, wherehe has been a Distinguished Professor since 2002.

He was involved in the fields of nonlinear classical and dissipative quantumdynamics, and solid-state physics and electronics, notably including supercon-ductor electronics and nanoelectronics. His current research interests includethe architecture and nanoelectronic implementation of high-performance neu-romorphic networks. He has authored more than 350 original publications,more than 75 review papers and book chapters, and two monographs, andholds several patents.

Prof. Likharev is an APS Fellow. In 1979, he joined the USSR HighestAttestation Committee.

Dmitri B. Strukov (M’02–SM’16) received theM.S. degree in applied physics and mathematicsfrom the Moscow Institute of Physics and Tech-nology, Moscow, Russia, in 1999, and the Ph.D.degree in electrical engineering from Stony BrookUniversity, Stony Brook, NY, USA, in 2006.

From 2006 to 2007, he was a Post-Doctoral Asso-ciate with Stony Brook University. From 2007 to2009, he was with Hewlett Packard Laboratories,Palo Alto, CA, USA, where he was involved invarious aspects of nanoelectronic systems. He is cur-

rently a Professor of electrical and computer engineering with the Universityof California, Santa Barbara, CA, USA. His current research interests includedifferent aspects of computation, in particular addressing questions on how toefficiently perform computation on various levels of abstraction and hardwareimplementations of artificial neural networks with emerging memory devices.