Nervana and the Future of Computing

Proprietary and confidential. Do not distribute.

Nervana and the Future of Computing

26 April 2016 Arjun Bansal

Co-founder & VP Algorithms, Nervana

MAKING MACHINES SMARTER.™


AI on demand using Deep Learning

2

DL

Image Classification

Object Localization

Video Indexing

Text Analysis

Nervana Platform

Machine Translation


Image classification and video activity detection

3

Deep learning model Potential applications

• Trained on a public dataset1 of 13K videos in 100 categories

• Training was approximately 3 times faster than competitive framework

• Can be extended to perform scene and object detection, action similarity labeling, video retrieval, anomaly detection

1: UCF101 dataset: http://crcv.ucf.edu/data/UCF101.php

• Activity detection and monitoring for security

• Automatic editing of captured moments from video camera

• Facial recognition and image based retrieval

• Sense and avoid systems for autonomous driving

• Baggage screening at airports and other public venueshttps://www.youtube.com/watch?v=ydnpgUOpdBw

http://crcv.ucf.edu/data/UCF101.php

https://www.youtube.com/watch?v=ydnpgUOpdBw

Proprietary and confidential. Do not distribute.ne r vana

Object localization and recognition

4


Speech to text

5https://youtu.be/NaqZkV_fBIM

https://youtu.be/NaqZkV_fBIM


Question answering

6

Stories

Mary journeyed to Texas. John went to Maryland.

Mary went to Iowa. John travelled to Florida.

Questions

Answers

Where is John located?

Florida


Reinforcement learning

7

Pong Breakout

https://youtu.be/KkIf0Ok5GCEhttps://youtu.be/0ZlgrQS3krg


Application areas

8

Healthcare Agriculture Finance

Online Services Automotive Energy


Nervana is building the future of computing

9

The Economist, March 12, 2016

Cloud Computing

Custom ASIC

Deep Learning / AI


nervana cloud

10

Images

Text

Tabular

Speech

Time series

Video

Data

import train build deploy

Cloud


nervana neon

11


nervana neon

11

• Fastest library


nervana neon

11

• Fastest library

• Model support Models • Convnet • RNN, LSTM • MLP • DQN • NTM

Domains • Images • Video • Speech • Text • Time series


Running locally:

% python rnn.py # or neon rnn.yaml

Running in nervana cloud:

% ncloud submit —py rnn.py # or —yaml rnn.yaml

% ncloud show <model_id>

% ncloud list

% ncloud deploy <model_id>

% ncloud predict <model_id> <data> # or use REST api

nervana neon

11

• Fastest library

• Model support

• Cloud integration


Backends

• CPU • GPU • Multiple GPUs • Parameter server • (Xeon Phi) • nervana TPU

nervana neon

11

• Fastest library

• Model support


• Multiple backends


nervana neon

11

• Fastest library

• Model support


• Multiple backends

• Optimized at assembler level


nervana tensor processing unit (TPU)

12



12

• Unprecedented compute density

=1

nervana engine

10 GPUs

200 CPUs



12


• Scalable distributed architecture



12



• Memory near computation

Instruction and data memory

Ctrl

ALU

CPU

Data Memory

Ctrl

Nervana



12




• Learning and inference



12





• Exploit limited precision



12






• Power efficiency



12

• 10-100x gain

• Architecture optimized for






• Power efficiency


Special purpose computation

13

1940s: Turing Bombe

Motivation: Automating calculations, code breaking


General purpose computation

14

2000s: SoC

Motivation: reduce power and cost, fungible computing.

Enabled inexpensive mobile devices.


Dennard scaling has ended

15

What business and

technology constraints do

we have now?


Many-core tiled architectures

16

Tile Processor Architecture Overview for the TILEPro Series 5

CHAPTER 2 TILE PROCESSOR ARCHITECTURE OVERVIEW

The Tile Processor™ implements Tilera’s multicore architecture, incorporating a two-dimensional array of processing elements (each referred to as a tile), connected via multiple two-dimensional mesh networks. Based on Tilera’s iMesh™ Interconnect technology, the architecture is scalable and provides high bandwidth and extremely low latency communication among tiles. The Tile Processor™ integrates external memory and I/O interfaces on chip and is a complete programma-ble multicore processor. External memory and I/O interfaces are connected to the tiles via the iMesh interconnect.

Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s structure.

Figure 2-1. Tile Processor Hardware Architecture

Each tile is a powerful, full-featured computing system that can independently run an entire oper-ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC), cache, and DMA subsystem. An individual tile is capable of executing up to three operations per cycle.

CDNTDNIDNMDNSTNUDN

1,1 6,1

3,2 4,2 5,2 6,2 7,2

XAUI(10GbE)

TDNIDNMDNSTNUDN

LEGEND:

Tile Detail

port2msh0

port0

port2 port1 port0

DDR2

DDR2

port0msh1

port2

port0 port1 port2

DDR2

DDR2

RGMII(GbE)

XAUI(10GbE)

FlexI/O

PCIe(x4 lane)

I2C, JTAG,HPI, UART,

SPI ROM

FlexI/O

PCIe(x4 lane)

port1 port1

msh3 msh2

port2msh0

port0

port2 port1 port0

port0msh1

port2

port0 port1 port2

port1 port1

msh3 msh2

gpio1

port0

port1

port1

port0

port1

xgbe0

gbe0

xgbe1

port0

gpio1

port1

port0

port1

gbe1

port0

port1

xgbe0

xgbe1

port0

0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3

0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5

0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6

0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7

7,00,0 1,0 2,0 3,0 4,0 5,0 6,0

0,1 1,1 6,12,1 3,1 4,1 5,1 7,1

3,2 4,2 5,2 6,2 7,20,2 1,2 2,2

0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4

port07,0 port0

pcie0

port0

port1

rshim0

gpio0

pcie1port0

port1

pcie0

port0

port1

rshim0

gpio0

pcie1port0

port1

SwitchEngine

CacheEngine

ProcessorEngine

UDN

STN

MDN

IDN

TDN

CDN

UDN

STN

MDN

IDN

TDN

CDN

STNSTN

TDNTDNIDNIDNMDNMDNUDNUDN

CDNCDN

2010s: multi-core, GPGPU

Motivation: increased performance without clock rate increase or smaller devices.

Requires changes in programming paradigm.

NVIDIA GM204Tilera

Intel Xeon Phi Knight’s landing


FPGA architectures

17

Altera Arria 10

Motivation: fine grained parallelism, reconfigurable, lots of IO, scalable.

Slow clock speed, lacks compute density for machine learning.


Neuromorphic architectures

18IBM TrueNorth

activated by input spike events, which are gen-erated by neurons anywhere in the system anddelivered after a desired axonal delay of between1 and 15 time steps. Although the brain has adedicated wire for each connection, in our archi-tecture spike events are carried between cores bytime-multiplexed wires (21) that interconnect atwo-dimensional mesh network of routers, eachwith five ports (north, south, east, west, and lo-cal). The routers form the backbone of a two-dimensional mesh network interconnecting a64-by-64 core array (Fig. 2H). When a neuron ona core spikes, it looks up in local memory anaxonal delay (4 bits) and the destination address(8-bit absolute address for the target axon andtwo 9-bit relative addresses representing corehops in each dimension to the target core). Thisinformation is encoded into a packet that is in-

jected into the mesh, where it is handed fromcore to core—first in the x dimension then in they dimension (deadlock-free dimension-order rout-ing) until it arrives at its target core before fanningout via the crossbar (fig. S2). To implementfeedback connections within a core, where aneuron connects to an axon on the same core,the packet is delivered by using the router’s localchannel, which is efficient because it never leavesthe core. To scale the two-dimensional meshacross chip boundaries where the number ofinterchip connections is limited, we used amerge-split structure at the four edges of themesh to serialize exiting spikes and deserializeentering spikes (Fig. 2I). Spikes leaving the meshare tagged with their row (for spikes travelingeast-west) or column (for spikes traveling north-south) before being merged onto a shared link

that exits the chip. Conversely, spikes enteringthe chip from a shared link are split to the ap-propriate row or column by using the taggedinformation.From a physical view, to implement this

functional blueprint, we built TrueNorth, a fullyfunctional digital chip (supplementary sectionS6) with 1 million spiking neurons and 256million synapses (nonplastic). With 5.4 billiontransistors occupying 4.3-cm2 area in Samsung’s28-nm process technology, TrueNorth has ∼428million bits of on-chip memory. Each core has104,448 bits of local memory to store synapsestates (65,536 bits), neuron states and parame-ters (31,232 bits), destination addresses (6656bits), and axonal delays (1024 bits). In terms ofefficiency, TrueNorth’s power density is 20 mWper cm2, whereas that of a typical central processing

670 8 AUGUST 2014 • VOL 345 ISSUE 6197 sciencemag.org SCIENCE

Fig. 2. TrueNorth architecture. Panels are organized into rows at threedifferent scales (core, chip, and multichip) and into columns at four differentviews (neuroscience inspiration, structural, functional, and physical). (A) Theneurosynaptic core is loosely inspired by the idea of a canonical corticalmicrocircuit. (B) A network of neurosynaptic cores is inspired by the cortex’stwo-dimensional sheet. (C) The multichip network is inspired by the long-range connections between cortical regions shown from the macaque brain(30). (D) Structure of a neurosynaptic core with axons as inputs, neurons asoutputs, and synapses as directed connections from axons to neurons.Multicore networks at (E) chip scale and (F) multichip scale are both createdby connecting a neuron on any core to an axon on anycore with point-to-pointconnections. (G) Functional view of core as a crossbar where horizontal linesare axons, cross points are individually programmable synapses, vertical linesare neuron inputs, and triangles are neurons. Information flows from axons

via active synapses to neurons. Neuron behaviors are individually program-mable, with two examples shown. (H) Functional chip architecture is a two-dimensional array of cores where long-range connections are implementedby sending spike events (packets) over a mesh routing network to activate atarget axon. Axonal delay is implemented at the target. (I) Routing networkextends across chip boundaries through peripheral merge and split blocks.(J) Physical layout of core in 28-nm CMOS fits in a 240-mm-by-390-mmfootprint. A memory (static random-access memory) stores all the data foreach neuron, a time-multiplexed neuron circuit updates neuron membranepotentials, a scheduler buffers incoming spike events to implement axonaldelays, a router relays spike events, and an event-driven controllerorchestrates the core’s operation. (K) Chip layout of 64-by-64 core array,wafer, and chip package. (L) Chip periphery to support multichip networks.I/O, input/output.

RESEARCH | REPORTS


Neural network parallelism

20

Data chunk 1 Data chunk n…

Processor 1 Processor n

…

parameter server

Full deep

network on

each processor

Parameter coordination

Data parallelism Model parallelism


Existing computing topologies are lacking

21

GPU

CPU

SSD

CPU

GPU

GPU

GPU

IB10G



21

GPU

CPU

SSD

CPU

GPU

GPU

GPU

IB10G

GPU

CPU

SSD

CPU

GPU

GPU

GPU

IB10G

PCIE SW PCIE SW



21

GPU

CPU

SSD

CPU

GPU

GPU

GPU

IB10G

GPU

CPU

SSD

CPU

GPU

GPU

GPU

IB10G

PCIE SW PCIE SW

GPU

GPU

GPU

GPU

PCIE SW

CPU

SSD

CPU

IB10G



21

GPU

CPU

SSD

CPU

GPU

GPU

GPU

IB10G

GPU

CPU

SSD

CPU

GPU

GPU

GPU

IB10G

PCIE SW PCIE SW

GPU

GPU

GPU

GPU

PCIE SW

CPU

SSD

CPU

IB10G

GPU

GPU

GPU

GPU

PCIE SW


nervana compute topology

22

CPU

CPU

SSD

IB10G

SSD

IB10G

nn

n n

nn

nn

PCIE SW

PCIE SW


Distributed linear algebra and convolution

23

CS267 Lecture 2 13

Summary of Parallel Matrix Multiply • SUMMA

• Scalable Universal Matrix Multiply Algorithm • Attains communication lower bounds (within log p)

• Cannon • Historically first, attains lower bounds • More assumptions

•  A and B square •  P a perfect square

•  2.5D SUMMA • Uses more memory to communicate even less

• Parallel Strassen • Attains different, even lower bounds

02/27/2014! CS267 Lecture 12! 49! 02/27/2014! CS267 Lecture 12! 50!

SUMMA Algorithm • SUMMA = Scalable Universal Matrix Multiply • Presentation from van de Geijn and Watts

• www.netlib.org/lapack/lawns/lawn96.ps • Similar ideas appeared many times

• Used in practice in PBLAS = Parallel BLAS • www.netlib.org/lapack/lawns/lawn100.ps

SUMMA uses Outer Product form of MatMul • C = A*B means C(i,j) = Σk A(i,k)*B(k,j) !

• Column-wise outer product: C = A*B = Σk A(:,k)*B(k,:) ! = Σk (k-th col of A)*(k-th row of B)!!• Block column-wise outer product (block size = 4 for illustration) C = A*B = A(:,1:4)*B(1:4,:) + A(:,5:8)*B(5:8,:) + … = Σk (k-th block of 4 cols of A)*! (k-th block of 4 rows of B)! 02/27/2014! CS267 Lecture 12! 51!

52!

SUMMA – n x n matmul on P1/2 x P1/2 grid

•  C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij!•  A[i,k] is n/P1/2 x b submatrix of A!•  B[k,j] is b x n/P1/2 submatrix of B !•  C[i,j] = C[i,j] + Σk A[i,k]*B[k,j] !

•  summation over submatrices!•  Need not be square processor grid !

* = i"

j"

A[i,k]"

k"k"

B[k,j]"

C[i,j]

02/27/2014! CS267 Lecture 12!

SUMMA distributed matrix multiply C=A*B

(Jim Demmel, CS267 lecture notes)

Matrix multiplication on multidimensional torus networks

Edgar Solomonik and James Demmel

Division of Computer ScienceUniversity of California at Berkeley, CA, USA

[email protected], [email protected]

Abstract. Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA havea 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version ofCannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure.This algorithm is useful for higher-dimensional torus interconnects that can achieve more injectionbandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannoncan lower the algorithmic bandwidth cost by a factor of up to d. With rectangular collectives, SUMMAalso achieves the lower bandwidth cost but has a higher latency cost. We use Charm++ virtualizationto e�ciently map SD-Cannon on unbalanced and odd-dimensional torus network partitions. Our per-formance study on Blue Gene/P demonstrates that an MPI version of SD-Cannon can exploit multiplecommunication links and improve performance.

1 Introduction

Torus interconnects can scale to hundreds of thousands of nodes because they achieve good bisection band-width while maintaining bounded router degree on each node. Additionally, many scientific simulation andphysically structured codes can be mapped to exploit locality on torus networks. In particular, 3-dimensional(3D) tori have been widely deployed in networks (e.g. IBM Blue Gene/L, Blue Gene/P, and the Cray XTseries). The newest generation of high-end supercomputer networks is beginning to move to higher dimen-sionality (e.g. IBM Blue Gene/Q is 5D [4], K computer is 6D [14]). This transition is natural since theminimal-cost (bisection bandwidth with respect to number of pins) topology for a network of 100,000 nodesis 3D, while for 1,000,000 nodes it is 5D or 6D [5]. Higher-dimensional interconnects motivate the design ofalgorithms that can use such networks e�ciently. In this paper, we adapt a classical matrix multiplicationalgorithm to exploit full injection bandwidth on a torus network of any dimension.

Cannon’s algorithm [3] is a parallel algorithm for matrix multiplication (C = A ·B) on a square (pp-by-

pp)

processor grid. After staggering the initial matrix layout, Cannon’s algorithm performspp shifts of A and B

along the two dimensions of the processor grid. The algorithm can be done in-place and all communicationis e�ciently expressed in the form of near-neighbor data passes. Given n-by-n matrices, each processor mustsend O(n2/

pp) words of data in O(

pp) messages along each dimension. The amount of words and messages

sent by each node in Cannon’s algorithm is asymptotically optimal [2] assuming minimal memory usage.However, since each node sends messages to nearest neighbors in 2 dimensions, at most 2 network links canbe saturated per node. However, a d-dimensional bidirectional torus network has 2d outgoing links per nodethat can be utilized.

It is known that a di↵erent algorithm, SUMMA [12], can utilize all 2d links and send a minimal numberof words. For matrix multiplication of n-by-n matrices, SUMMA sends O(n2/

pp) data in the form of n

outer-products, which can be pipelined or blocked. Each update requires a broadcast along a row or columnof processors. If a higher-dimensional torus is flattened into each row and column of the mapping, rectangularcollective algorithms [13, 6, 10] can utilize all dimensions of the network. Rectangular algorithms subdivideand pipeline the messages into edge-disjoint spanning trees formed by traversing the network in di↵erentdimensional orders. However, SUMMA typically sends more messages since it does O(

pp) broadcasts, rather

than the O(pp) near-neighbor sends in Cannon’s algorithm.

Cannon’s algorithm does not employ communication collectives, so it cannot utilize rectangular collectives.We design a generalization of Cannon’s algorithm, Split-Dimensional Cannon’s algorithm (SD-Cannon), that


Summary

24

• Computers are tools for solving problems of their time

• Was: Coding, calculation, graphics, web

• Today: Learning and Inference on data

• Deep learning as a computational paradigm

• Custom architecture can do vastly better

Nervana and the Future of Computing

Technology

Transcript of Nervana and the Future of Computing