Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k...
Transcript of Bit Fusion - Welcome to Iscaconf.org · OC Fully-Connected Layer loop: for j in (1 OC) loop: for k...
Bit FusionBit-Level Dynamically Composable
Architecture for Deep Neural Networks
Hardik SharmaJongse ParkNaveen Suda†
Liangzhen Lai†
Benson ChauVikas Chandra†
Hadi Esmaeilzadeh‡Alternative Computing Technologies (ACT) Lab
†Arm, Inc.
Georgia Institute of Technology
‡University of California, San Diego
0%20%40%60%80%
100%
AlexNet
CIFAR1
0
LSTM
LeNet-5
RESN
ET-
18 RNN
SVHN
VGG-7
Avg
1bit/1bit 2bit/2bit 4bit/4bit 8bit/1bit 8bit/8bit
DNNs Tolerate Low-Bitwidth Operations
>99.4% Multiply-Adds require less than 8-bits
2
Bitwidth Flexibility is Necessary for Accuracy
A fixed-bitwidth accelerator would either achieve limitedbenefits (8-bit), or compromise on accuracy (<8-bit)
3
Conv.8b/8b
Conv.4b/4b
Conv.4b/4b
Conv.4b/4b
Conv.4b/4b
FC4b/4b
FC4b/4b
FC8b/8b
Conv.2b/2b
Conv.2b/2b
FC2b/2b
FC2b/2b
AlexNet:IMAGENETdataset(Mishraetal.,WRPN,arXiv2017)
LeNet:MNISTdataset(Lietal.,TWN,arXiv2016)
Our Approach: Bit-level Composability
BitBricks (BBs) are bit-level composable compute units
sy y1 y0
33
6
sx x1 x0
signmode
BitBrick(BB)BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
FusionUnitWBUF
PsumForward
InputForward
4
Compute units (BitBricks)
logically fuse at runtime to form
Fused-PEs (F-PEs) that dynamically match bit-width
of the DNN layers
5
(b)16xParallelism,Binary(1-bit)orTernary(2-bit)
Psum forward
+ +
+ +
+
F-PE F-PE F-PE F-PE
F-PE F-PE F-PE F-PE
F-PE F-PE F-PE F-PE
F-PE F-PE F-PE F-PE
WBUF
Input forward
(d)NoParallelism,8-bits
Psum forward
Input forward
+ +
+ +
+F-PEWBUF
(c)4xParallelism,Mixed-Bitwidth(2-bitweights,8-bitinputs)
Psum forward
WBUF
Input forward
F-PE
F-PE
F-PE
F-PE
(a)FusionUnitwith16BitBricks
Psum forward
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
Input forward
WBUF
Config #1 : Binary/Ternary Mode
Each BitBrick performs a binary/ternary multiplication16x parallelism
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
FusionUnit2-bit F-PE F-PEF-PE
F-PE F-PE F-PEF-PE
F-PE F-PEF-PE
F-PE F-PEF-PE
F-PE
F-PE
Input
Weight2-bit
6
Config #2: 4-bit Mode
Four BitBricks fuse to form a Fused-PE (F-PE)4x Parallelism
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
FusionUnit
F-PE
F-PEF-PE
Input(4-bit)
Weight(4-bit)
2-bit 2-bit
2-bit 2-bit
Par9alProducts
7
Config #3 : 8-bit, 4-bit (Mixed-Mode)
Eight BitBricks fuse to form a Fused-PE (F-PE)2x Parallelism
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
FusionUnit
F-PE
Input(8-bit)
Weight(4-bit)2-bit 2-bit
Par:alProducts
2-bit2-bit 2-bit2-bit
8
Spatial Fusion vs. Temporal Design
Temporal Design (Bit Serial): Combine results over time
Out
<<
g0 h0
g1 h1
g2 h2
g3 h3
Out
<<
e0 f0
e1 f1
e2 f2
e3 f3
Out
<<
c0 d0
c1 d1
c2 d2
c3 d3
Out
<<
a0 b0
a1 b1
a2 b2
a3 b3
Inpu
ts o
ver
time
1
2
3
Spatial Fusion (Bit Parallel): Combine results over space
Out
<<<<<<<<
a0 b0
c0 d0
e0 f0
g0 h0
a1 b1
c1 d1
e1 f1
g1 h1
a2 b2
c2 d2
e2 f2
g2 h2
a3 b3
c3 d3
e3 f3
g3 h3
1
2
3
Inpu
ts o
ver
time
9
Spatial Fusion Surpasses Temporal Design
Area (um^2) BitBricks Shift-Add RegisterTotal
Area
Temporal 463 2989 1454 4905Fusion Unit 369 934 91 1394
Power (nW) BitBricks Shift-Add RegisterTotal
Power
Temporal 60 550 1103 1712Fusion Unit 46 424 69 538Synthesized using a commercial 45 nm technology
10
3.5x lower
area
3.2x lower
power
Control
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
BB
BB
BB
BB+
+
FusionUnit FusionUnit
FusionUnitFusionUnit
IBUF(Sha
red)
IBUF(Sha
red)
WBUF WBUF
WBUF WBUF
OBUF
+
PoolingUnit Ac.va.onUnit
OBUF
+
PoolingUnit Ac.va.onUnit
Bit Fusion Systolic Array Architecture
11
Programmability: BitFusion ISA
Requirements
Amortize cost of bit-level fusion
Enable flexible Data-Path
Concise
12
ISA: Amortize the Cost of Bit-Level Fusion
Use a block-structured ISA for groups of operations (layers)
Convolu'on8-bit/8-bit
Convolu'on4-bit/1-bit
Conv 1
Block begin: 8-bit/8-bit
Block end: next block
Block begin: 4-bit/1-bit
Convolu'on4-bit/8-bit
Block end: next block
13
ISA: Concise Expression for DNNs
Use loop instructions as DNNs consist of large number of repeated operations
OC
IC
IC
BB
OC
Fully-Connected Layer
loop: for j in (1 OC)loop: for k in (1 IC)
loop: for i in (1 B)
14
ISA: Concise Expression for DNNs
DNNs have regular memory access patternUse loop indices to generate memory accesses
OC
IC
IC
BB
OC
Fully-Connected Layer
loop: for j in (1 OC)loop: for k in (1 IC)input k 1 + j 0 + i ICweight k 1 + j IC + i 0output k 0 + j 1 + i OC
loop: for i in (1 B)
15
ISA: Flexible Storage
ISA changes the semantics of off-chip and on-chip memory accesses according to bitwidth of operands
2-bitmode
16xparallelism
Need:32-bitinputs,32-bitweights
8-bitmode
1xparallelism
Need:8-bitinput,8-bitweight
16
ISA: Flexible Storage (Software View)
WBUF
32-bit
WBUF
16-bitRegister Register
WBU
F
8-bitReg
Software views the buffers as having a flexible aspect ratio
17
Benchmarked Platforms18
NvidiaTitan-XGPU
NvidiaTegraTX2
ASICBit-Serial
Op5mizedDataflow
HighPerformance
LowPower
Stripes(Micro’16)
Eyeriss(ISCA’16)
Benchmarked DNN Models
SVHNVGG-7
RESNET-18RNN
DNN
AlexNetCIFAR10LSTM
LeNet-5
CNNCNN
CNNRNN
Type
CNNCNNRNNCNN
158MOps317MOps
4,269MOps17MOps
Mul(ply-Adds
2,678MOps617MOps13MOps16MOps
0.8MBytes2.7MBytes
13MBytes8.0MBytes
Bit-FlexibleModelWeights
116.3MBytes3.3MBytes6.2MBytes0.5MBytes
24.4MBytes43.3MBytes
103.7MBytes64.0MBytes
OriginalModelWeights898.6MBytes53.5MBytes49.4MBytes8.2MBytes
19
Comparison with Eyeriss
3.9× speedup and 5.1× energy reduction over Eyeriss
Impr
ovem
ent
over
Eye
riss
0x
4x
8x
12x
AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean
5.1
9.910.0
5.1
1.9
4.34.8
14.0
1.5
3.9
7.78.6
2.71.92.72.4
13.0
1.9
Performance Energy Reduction
20
Comparison with Stripes
2.6× speedup and 3.9× energy reduction over Stripes
Impr
ovem
ent
over
Stri
pes
0×2×4×6×8×
AlexNet Cifar-10 LSTM LeNet-5ResNet-18 RNN SVHN VGG-7 geomean
3.9x4.4x2.7x3.0x
4.4x
7.8x
3.1x
6.0x
2.7x 2.6x2.9x1.8x2.0x2.6x
5.2x
2.1x4.0x
1.8x
Performance Energy Reduction
21
Spee
dup
over
TX
2
0×
10×
20×
30×
AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean
16x
48x
14x
39x
5x11x
38x34x
3x
19x
30x
21x
7x
31x27x
7x
29x23x
TitanX-INT8 Bit Fusion
Comparison with GPUs
Bit Fusion provides almost the same performance as Titan Xp (250 W) with only 895 mW
22
Conclusion
Emerging research shows we can reduce bitwidths for DNNs without losing accuracy
Bit Fusion defines a new dimension of bit-level dynamic composability to leverage this opportunity
BitFusion ISA exposes this capability to software stack
23