Nanoelectronic Neurocomputing: Status and Prospectsstrukov/papers/2016/DRCneurocomputing... ·...

Nanoelectronic Neurocomputing: Status and Prospects

L. Ceze,1 J. Hasler,2 K. K. Likharev,3 J.-s. Seo,4 T. Sherwood,5 D. Strukov5*, Y. Xie,5 and S. Yu4

1 University of Washington, Seattle WA 98195-2350, U.S.A., 2 Georgia Institute of Technology, Atlanta GA 30332-

0250, U.S.A., 3 Stony Brook University, Stony Brook, NY 11794-3800, U.S.A., 4 Arizona State University, Tempe,

AZ 85281-9309, U.S.A., 5 UC Santa Barbara, Santa Barbara, CA 93106-9560, U.S.A.

Email: *[email protected]

Potential advantages of specialized hardware for neuromorphic computing had been recognized several decades

ago (see, e.g., Refs. [1, 2]), but the need for it became especially acute recently, due to significant advances of the

computational neuroscience and machine learning. The most vivid example is given by the deep learning in

convolution neuromorphic networks [3]: the recent dramatic progress of this technology, with it's rapid extension to

several important applications, was enabled by the use of modern GPU clusters [4, 5]. Even higher performance and

lower power consumption has been recently demonstrated using FPGAs [5-7] and custom digital circuits [5, 8].

However, the energy efficiency of these fully digital implementations of the convolutional networks and other

neuromorphic systems (see, e.g., Ref. 9) is still well below that of their biological prototypes [2, 10], even if the

most advanced CMOS technology is used. The main reason for this efficiency gap is that the use of digital operations,

with their high precision, for mimicking high-noise biological neural networks is inherently unnatural. Theoretical

estimates show that this gap may be closed by using purely analog and mixed-signal neuromorphic networks [1, 2,

10]. For instance, analog crossbar circuits may perform a key operation, the vector-by-matrix multiplication, by

utilizing the fundamental Ohm and Kirchhoff laws. In addition, nanoelectronic implementation of such analog and

mixed-signal neuromorphic circuits may allow them to approach the areal density of their biological prototypes, far

outperforming them (and their digital counterparts) in speed performance, all at manageable power consumption [2,

10] - see Fig. 1.

The key analog component of the circuits is a compact electron device with adjustable conductance - essentially

an analog nonvolatile memory cell - typically used at each crosspoint of a crossbar array, and mimicking the biological

synapse. Up until recently, such devices were implemented mostly as “synaptic transistors” [2, 11], which may be

fabricated using the standard CMOS technology. This approach has resulted in several sophisticated, efficient systems

- see, e.g., Fig. 2 [11]. However, these devices have relatively large areas, leading to higher interconnect capacitance

and hence larger time delays. Figures 3 and 4 illustrate two promising candidates for addressing this challenge: two-

terminal nonvolative resistive devices (“memristors”) [12, 13], and compact floating-gate memory cells, redesigned

from commercial NOR flash memory to allow their high-precision analog tuning [14, 15]. The advantage of the

floating-gate memory devices is their mature, CMOS-compatible fabrication technology. On the other hand, the tuning

of memristors is based on reversible displacements of just a few atoms, so that even the best technologies of their

fabrication developed by now do not yet provide the device variability low enough for VLSI circuits. However, the

memristors may have inherently lower relative area than the floating-gate cells, and may be scaled down below 10 nm

without sacrificing their endurance, retention, and tuning accuracy [12]. Moreover, these devices are naturally suitable

for 3D integration (Fig. 5) [12, 13]. (Certain recent developments in deep learning algorithms give a hope for using

novel 3D NAND technologies for neurocomputing as well.)

The further progress of nanoelectronic neurocomputing circuits may lead not only to high-performance

convolutional networks, but eventually to other intelligent, high-speed, low-power information processing systems for

emerging applications. (These applications notably include flexible, high-speed platforms for testing large-scale

models of biological neural systems including cerebral cortex.) The important research directions toward these goals

include: (i) the development of hardware-friendly neuromorphic network algorithms to enable the use of advanced

nanoelectronic devices without compromising system’s functional performance; (ii) the device optimization for

particular neuromorphic applications; (iii) the progress toward understanding of the optimal partitioning between the

digital and analog domains in mixed-signal systems (such as shown in Fig. 2); (iv) the development of block-based

architectures for general purpose neuromorphic systems; and (v) the development of suitable models for programming

of such systems (see, e.g., Fig. 6).

[1] C. Mead, Analog VLSI and neural systems, 1989.

[2] J. Hasler et al., Front. Neurosci., vol. 7, p. 119, 2013.

[3] Y. LeCun et al., Nature, vol. 521, p. 436, 2015.

[4] A. Krizhevsky et al., NIPS, Lake Tahoe, CA, 2012, p.1097.

[5] C. Farabet et al., Machine learning on very large data sets,

ed. by R. Bekkerman et al., 2011, p. 399.

[6] A. Putnam et al., ISCA, Minneapolis, MN, 2014, p. 13.

[7] T. Moreau et al., HPCA, Burlingame, CA, 2015, p. 603.

[8] T. Chen et al., ASPLOS, Salt Lake City, UT, 2014, p. 269.

[9] P. Merolla et al., Science, vol. 345, p. 668, 2014.

[10] K. Likharev, Sci. Advanced. Mat., vol. 3, p. 322, 2011.

[11] S. George et al., IEEE Trans. VLSI, 2016 (accepted).

[12] J. Yang et al., Nature Nano, vol. 8, p. 13, 2013.

[13] S. Yu et al., ACS Nano, vol. 7, p. 2320, 2013.

[14] F. Merrikh Bayat et al., IMW, Monterey, CA, 2015, p.1.

[15] F. Merrikh Bayat et al., DRC, Newark, DE, 2016.

Fig. 1. Energy efficiency and performance comparison for a 64×64-pixel deep learning neural network image classification. The

estimates for the state-of-the-art digital and emerging analog nanodevice implementations are based on the data from Refs. [5] and

[12, 14], respectively.

Digital AnalogCPU 2.66 GHz GPU 1 GHz FPGA 200 MHz ASIC 400 MHz NOR 180 nm NOR 55 nm Memristors 200 nm 3D memristors 10 nm

Time (s) ~810-3 ~310-4 ~1.510-4 ~510-5 ~210-6 ~710-7 ~510-8 ~110-8

Power (W) ~30 to 40 ~40 ~10 ~3 ~1 ~1 ~1 ~0.1Energy (J) ~310-1 ~110-2 ~110-3 ~110-4 ~210-6 ~710-7 ~510-8 ~110-9

Fig. 3. Memristor circuits for neurocomputing

[14]: (a) 12×12 crossbar with integrated metal-

oxide memristors, (b) I-V switching curve of a

memristor and the device stack. (c) Pattern

classification results using a single layer

perceptron (bottom inset) for 3×3 binary patterns

(top inset).

-2.0-1.5-1.0-0.5 0.0 0.5 1.0 1.5 2.0

-600

-500

-400

-300

-200

-100

0

100

200

300

Cu

rre

nt (

A)

Voltage (V)

Reset

SetForm

SiO2/Si

Pt (60 nm)

TiO2-x (30 nm)

Ti (15 nm)

Al2O3 (4 nm)

Ta (5 nm)

Pt (60 nm)

Voltage (V)-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

300

200

100

0

-100

-200

-300

-400

-500

-600

Cu

rren

t (μ

A)

(b)

2 μm

(a)

0 5 10 15 20 25 30 35

02468

1012141618202224262830

# m

iscl

assi

fied

pat

tern

s

Epoch

1

2

1

2

3

pre-synaptic neurons

post-synaptic neurons

w10,3

w1,1

10

z

n

v

(c)

Fig. 4. Gate-coupled vector-by-

matrix multiplier implemented with

180-nm embedded NOR flash

memory redesigned for analog

computing applications [14, 15]: (a)

Layout, (b) equivalent circuit, and

(c) experimental results.

-1 -0.5 0 0.5 1 -0.1

-0.05

0

0.05

0.1

Ibias = 500 nA

Vdd = 1V

w1+ ≈ 10-5

Iout+ - Iout

- = (w1+ - w1

-)(Iin+ - Iin

- )

I out+

-I o

ut-

(µA

)

Iin+ - Iin

- (µA)

(c)So

urc

e

Gate and drain

Gate and drain

Peripheral devices

High voltage

pass gates

(a) (b)

supercell

Fig. 5. 3D memristor circuits: (a)

circuit, (b) possible memristor

structure, and (c) TEM cross-section

of side-wall integrated HfOx devices

[13]. (d) General idea of CMOS/

memristor hybrids [10, 12, 14], and (e)

top view of a TiOx memristors back-

integrated on 0.25-μm CMOS circuit.

SiO2

Pt

5nm

HfOx

SiO2

TiN

5 nm

(a)

(b)

add-on

CMOS

stack

(d) (e)

(c)

Fig. 2. RASP 3.0 field programmable

analog array [11]: (a) general

architecture, and (b) chip layout in 350-

nm CMOS process. The chip consists

of an embedded microprocessor, a

static random access memory, digital

and analog converters, circuitry to

program nonvolatile memory, and

digital/analog configurable blocks

(denoted with D/A labels).

(a) (b)

Fig. 6. FPGA-based neural network accelerator for approximate programs [7]: (a) Top level architecture, (b) a fragment of code

written in approximation-aware programming language, which filters pixels in the image. (c) Performance speed-up and (d) energy

savings evaluated on the set of studied benchmarks.

(a) (b)

(c)

(d)

Nanoelectronic Neurocomputing: Status and Prospectsstrukov/papers/2016/DRCneurocomputing... ·...

Documents

Transcript of Nanoelectronic Neurocomputing: Status and Prospectsstrukov/papers/2016/DRCneurocomputing... ·...