Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k...

24
Bit Fusion Bit-Level Dynamically Composable Architecture for Deep Neural Networks Hardik Sharma Jongse Park Naveen Suda Liangzhen Lai Benson Chau Vikas Chandra Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Arm, Inc. Georgia Institute of Technology University of California, San Diego

Transcript of Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k...

Page 1: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Bit FusionBit-Level Dynamically Composable

Architecture for Deep Neural Networks

Hardik SharmaJongse ParkNaveen Suda†

Liangzhen Lai†

Benson ChauVikas Chandra†

Hadi Esmaeilzadeh‡Alternative Computing Technologies (ACT) Lab

†Arm, Inc.

Georgia Institute of Technology

‡University of California, San Diego

Page 2: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

0%20%40%60%80%

100%

AlexNet

CIFAR1

0

LSTM

LeNet-5

RESN

ET-

18 RNN

SVHN

VGG-7

Avg

1bit/1bit 2bit/2bit 4bit/4bit 8bit/1bit 8bit/8bit

DNNs Tolerate Low-Bitwidth Operations

>99.4% Multiply-Adds require less than 8-bits

2

Page 3: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Bitwidth Flexibility is Necessary for Accuracy

A fixed-bitwidth accelerator would either achieve limitedbenefits (8-bit), or compromise on accuracy (<8-bit)

3

Conv.8b/8b

Conv.4b/4b

Conv.4b/4b

Conv.4b/4b

Conv.4b/4b

FC4b/4b

FC4b/4b

FC8b/8b

Conv.2b/2b

Conv.2b/2b

FC2b/2b

FC2b/2b

AlexNet:IMAGENETdataset(Mishraetal.,WRPN,arXiv2017)

LeNet:MNISTdataset(Lietal.,TWN,arXiv2016)

Page 4: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Our Approach: Bit-level Composability

BitBricks (BBs) are bit-level composable compute units

sy y1 y0

33

6

sx x1 x0

signmode

BitBrick(BB)BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

FusionUnitWBUF

PsumForward

InputForward

4

Page 5: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Compute units (BitBricks)

logically fuse at runtime to form

Fused-PEs (F-PEs) that dynamically match bit-width

of the DNN layers

5

(b)16xParallelism,Binary(1-bit)orTernary(2-bit)

Psum forward

+ +

+ +

+

F-PE F-PE F-PE F-PE

F-PE F-PE F-PE F-PE

F-PE F-PE F-PE F-PE

F-PE F-PE F-PE F-PE

WBUF

Input forward

(d)NoParallelism,8-bits

Psum forward

Input forward

+ +

+ +

+F-PEWBUF

(c)4xParallelism,Mixed-Bitwidth(2-bitweights,8-bitinputs)

Psum forward

WBUF

Input forward

F-PE

F-PE

F-PE

F-PE

(a)FusionUnitwith16BitBricks

Psum forward

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

Input forward

WBUF

Page 6: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Config #1 : Binary/Ternary Mode

Each BitBrick performs a binary/ternary multiplication16x parallelism

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

FusionUnit2-bit F-PE F-PEF-PE

F-PE F-PE F-PEF-PE

F-PE F-PEF-PE

F-PE F-PEF-PE

F-PE

F-PE

Input

Weight2-bit

6

Page 7: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Config #2: 4-bit Mode

Four BitBricks fuse to form a Fused-PE (F-PE)4x Parallelism

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

FusionUnit

F-PE

F-PEF-PE

Input(4-bit)

Weight(4-bit)

2-bit 2-bit

2-bit 2-bit

Par9alProducts

7

Page 8: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Config #3 : 8-bit, 4-bit (Mixed-Mode)

Eight BitBricks fuse to form a Fused-PE (F-PE)2x Parallelism

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

FusionUnit

F-PE

Input(8-bit)

Weight(4-bit)2-bit 2-bit

Par:alProducts

2-bit2-bit 2-bit2-bit

8

Page 9: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Spatial Fusion vs. Temporal Design

Temporal Design (Bit Serial): Combine results over time

Out

<<

g0 h0

g1 h1

g2 h2

g3 h3

Out

<<

e0 f0

e1 f1

e2 f2

e3 f3

Out

<<

c0 d0

c1 d1

c2 d2

c3 d3

Out

<<

a0 b0

a1 b1

a2 b2

a3 b3

Inpu

ts o

ver

time

1

2

3

Spatial Fusion (Bit Parallel): Combine results over space

Out

<<<<<<<<

a0 b0

c0 d0

e0 f0

g0 h0

a1 b1

c1 d1

e1 f1

g1 h1

a2 b2

c2 d2

e2 f2

g2 h2

a3 b3

c3 d3

e3 f3

g3 h3

1

2

3

Inpu

ts o

ver

time

9

Page 10: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Spatial Fusion Surpasses Temporal Design

Area (um^2) BitBricks Shift-Add RegisterTotal

Area

Temporal 463 2989 1454 4905Fusion Unit 369 934 91 1394

Power (nW) BitBricks Shift-Add RegisterTotal

Power

Temporal 60 550 1103 1712Fusion Unit 46 424 69 538Synthesized using a commercial 45 nm technology

10

3.5x lower

area

3.2x lower

power

Page 11: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Control

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

BB

BB

BB

BB+

+

FusionUnit FusionUnit

FusionUnitFusionUnit

IBUF(Sha

red)

IBUF(Sha

red)

WBUF WBUF

WBUF WBUF

OBUF

+

PoolingUnit Ac.va.onUnit

OBUF

+

PoolingUnit Ac.va.onUnit

Bit Fusion Systolic Array Architecture

11

Page 12: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Programmability: BitFusion ISA

Requirements

Amortize cost of bit-level fusion

Enable flexible Data-Path

Concise

12

Page 13: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

ISA: Amortize the Cost of Bit-Level Fusion

Use a block-structured ISA for groups of operations (layers)

Convolu'on8-bit/8-bit

Convolu'on4-bit/1-bit

Conv 1

Block begin: 8-bit/8-bit

Block end: next block

Block begin: 4-bit/1-bit

Convolu'on4-bit/8-bit

Block end: next block

13

Page 14: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

ISA: Concise Expression for DNNs

Use loop instructions as DNNs consist of large number of repeated operations

OC

IC

IC

BB

OC

Fully-Connected Layer

loop: for j in (1 OC)loop: for k in (1 IC)

loop: for i in (1 B)

14

Page 15: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

ISA: Concise Expression for DNNs

DNNs have regular memory access patternUse loop indices to generate memory accesses

OC

IC

IC

BB

OC

Fully-Connected Layer

loop: for j in (1 OC)loop: for k in (1 IC)input k 1 + j 0 + i ICweight k 1 + j IC + i 0output k 0 + j 1 + i OC

loop: for i in (1 B)

15

Page 16: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

ISA: Flexible Storage

ISA changes the semantics of off-chip and on-chip memory accesses according to bitwidth of operands

2-bitmode

16xparallelism

Need:32-bitinputs,32-bitweights

8-bitmode

1xparallelism

Need:8-bitinput,8-bitweight

16

Page 17: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

ISA: Flexible Storage (Software View)

WBUF

32-bit

WBUF

16-bitRegister Register

WBU

F

8-bitReg

Software views the buffers as having a flexible aspect ratio

17

Page 18: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Benchmarked Platforms18

NvidiaTitan-XGPU

NvidiaTegraTX2

ASICBit-Serial

Op5mizedDataflow

HighPerformance

LowPower

Stripes(Micro’16)

Eyeriss(ISCA’16)

Page 19: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Benchmarked DNN Models

SVHNVGG-7

RESNET-18RNN

DNN

AlexNetCIFAR10LSTM

LeNet-5

CNNCNN

CNNRNN

Type

CNNCNNRNNCNN

158MOps317MOps

4,269MOps17MOps

Mul(ply-Adds

2,678MOps617MOps13MOps16MOps

0.8MBytes2.7MBytes

13MBytes8.0MBytes

Bit-FlexibleModelWeights

116.3MBytes3.3MBytes6.2MBytes0.5MBytes

24.4MBytes43.3MBytes

103.7MBytes64.0MBytes

OriginalModelWeights898.6MBytes53.5MBytes49.4MBytes8.2MBytes

19

Page 20: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Comparison with Eyeriss

3.9× speedup and 5.1× energy reduction over Eyeriss

Impr

ovem

ent

over

Eye

riss

0x

4x

8x

12x

AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean

5.1

9.910.0

5.1

1.9

4.34.8

14.0

1.5

3.9

7.78.6

2.71.92.72.4

13.0

1.9

Performance Energy Reduction

20

Page 21: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Comparison with Stripes

2.6× speedup and 3.9× energy reduction over Stripes

Impr

ovem

ent

over

Stri

pes

0×2×4×6×8×

AlexNet Cifar-10 LSTM LeNet-5ResNet-18 RNN SVHN VGG-7 geomean

3.9x4.4x2.7x3.0x

4.4x

7.8x

3.1x

6.0x

2.7x 2.6x2.9x1.8x2.0x2.6x

5.2x

2.1x4.0x

1.8x

Performance Energy Reduction

21

Page 22: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Spee

dup

over

TX

2

10×

20×

30×

AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean

16x

48x

14x

39x

5x11x

38x34x

3x

19x

30x

21x

7x

31x27x

7x

29x23x

TitanX-INT8 Bit Fusion

Comparison with GPUs

Bit Fusion provides almost the same performance as Titan Xp (250 W) with only 895 mW

22

Page 23: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC

Conclusion

Emerging research shows we can reduce bitwidths for DNNs without losing accuracy

Bit Fusion defines a new dimension of bit-level dynamic composability to leverage this opportunity

BitFusion ISA exposes this capability to software stack

23

Page 24: Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0 output k 0 + j 1 + i OC