Download - SCNN: An Accelerator for Compressed-sparse Convolutional ...people.csail.mit.edu/anurag_m/papers/2017.scnn.isca.slides.pdf · Angshuman Parashar, Minsoo Rhu*, Anurag Mukkara, Antonio

Angshuman Parashar, Minsoo Rhu*, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally

SCNN: An Accelerator for Compressed-sparseConvolutional Neural Networks

* Now at POSTECH (Pohang University of Science and Technology), South Korea

2© NVIDIA 2017

Motivation

3© NVIDIA 2017

Convolution (dense)Inner product between every input pixel and every filter coefficient

Sliding window is intuitive and maps to reasonable hardware implementation

* = ?

Input activation maps Filters

(Weights)

Output activation maps

4© NVIDIA 2017



* =

: useless MUL ops

5© NVIDIA 2017



* =

: useless MUL ops

6© NVIDIA 2017



* =

: useless MUL ops

7© NVIDIA 2017



* =

: useless MUL ops

8© NVIDIA 2017

Convolution with SparsityMost operand values are zero

Static sparsity: pruned network weights set to ‘0’ during training

* =

* Han et al., “Learning Both Weights and Connections for Efficient Neural Network”, NIPS-2015

9© NVIDIA 2017


* =

* Albericio et al., “CNVLUTIN: Ineffectual-Neuron-Free Deep Neural Network Computing”, ISCA-2016

Dynamic sparsity: negative-valued activations clamped to ‘0’ during inference

10© NVIDIA 2017


Sliding window based convolution is wasteful

Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer

0

0.2

0.4

0.6

0.8

1

1.2

Densi

ty

Density (activations)

Density (weights)

< VGGNet >

11© NVIDIA 2017




* = ?Dynamic sparsity Static sparsity

12© NVIDIA 2017




* =

: useless MUL ops

13© NVIDIA 2017




* =

: useless MUL ops

0

14© NVIDIA 2017




* =

: useless MUL ops

0

15© NVIDIA 2017




* =

: useless MUL ops

0 0

16© NVIDIA 2017

MotivationCNN inference often performed in power-limited environments

Our goal: sparsity-optimized CNN accelerator for high energy-efficiency

17© NVIDIA 2017

Possible Solutions

18© NVIDIA 2017

Option 1: Leverage Dense CNN DesignEmploy pair of bit-masks to track non-zero weights and/or activations

* = ?NZ activations NZ weights

(3x3)

sliding window

19© NVIDIA 2017

Option 1: Leverage Dense CNN Design

* =

NZ bitmask

(weights)

NZ activations NZ weights

Employ pair of bit-masks to track non-zero weights and/or activations

?

20© NVIDIA 2017


* =

NZ bitmask

(weights)



?

NZ bitmask

(activations)

21© NVIDIA 2017


* =

?

vv

NZ bitmask

(activations)

vv

NZ bitmask

(weights)

AND

gate

vv

=

2 useful MULs



CLK#0

ALU ALU ALU ALU

22© NVIDIA 2017


* =

?

v

NZ bitmask

(activations)

v

NZ bitmask

(weights)

AND

gate

v

=

1 useful MUL



CLK#1

ALU ALU ALU ALU

23© NVIDIA 2017


* =

?

NZ bitmask

(activations)

NZ bitmask

(weights)

AND

gate=

0 useful MULs



CLK#2

ALU ALU ALU ALU

24© NVIDIA 2017


* =

?

NZ bitmask

(activations)

NZ bitmask

(weights)

AND

gate=

0 useful MULs

ALU ALU ALU ALU

Challenge: find enough useful MULs to fully

populate the vector ALUs



CLK#2

25© NVIDIA 2017

SCNN Intuition & Approach

26© NVIDIA 2017

Intuition behind SCNNForget the sliding windows based convolution

All NZ activations must (at some point in time) be multiplied by all NZ weights

Holds true for convolution stride ‘1’

* =

y

z

x

?

27© NVIDIA 2017




* =

y

z

x

x

?

28© NVIDIA 2017




* =

y

z

x

x

?

29© NVIDIA 2017




* =

y

z

x

x

?

30© NVIDIA 2017




* =

y

z

xx

?

31© NVIDIA 2017




* =

y

z

x

x?

32© NVIDIA 2017




* =

y

z

x

x

?

33© NVIDIA 2017




* =

y

z

x

x x

x

x

x

x

?

34© NVIDIA 2017




* =

z

x

y y

y

y

y

y

y

?

35© NVIDIA 2017




* =

yx

z z

z

z

z

z

z ?

36© NVIDIA 2017

The SCNN approachThe Cartesian product (i.e., all-to-all) based convolution operation

Assuming a convolution stride of ‘1’:

Minimum # MULs: Cartesian product of the NZ activations and the NZ weights

* = ?y

z

x

b

d

c

e

a

f

37© NVIDIA 2017




* = ?y

z

x

b

d

c

e

a

f

b

d

e

f

c y

z

Compress

NZ activations

Compress

NZ weights

38© NVIDIA 2017




=

x

a

b

d

e

f

cy

z

xa *

ya *

za *

xb *

yb *

zb *

…

Scatter

network

Accumulate MULs

PE frontend PE backend

41© NVIDIA 2017

SCNN ArchitectureWorkload distribution

Weights broadcast to all PEs

All PEs have a copy of all the NZ weights of the CNN model

42© NVIDIA 2017

SCNN ArchitectureWorkload distribution

Filters

(Weights)

*

PE-0 PE-1

PE-2 PE-3

Input activation maps

=


PE-0 PE-1

PE-2 PE-3

Each PE is allocated with a partial volume of input and output activations

Input and output activations stay local to PE

43© NVIDIA 2017

SCNN ArchitectureHalo resolution?

B

Filters

(Weights)

Output halos:

PE-0 calculates (A x B), but the result should be accumulated in PE-1 (X)

*

APE-0 PE-1

PE-2 PE-3

Input activation maps

=


PE-0 PE-1

PE-2 PE-3

X

44© NVIDIA 2017

SCNN ArchitectureInter-PE communication channel for halo resolution

PE

Weights

IAOA

NEneighbor

SEneighbor

: traffic for halo resolution

46© NVIDIA 2017

SCNN PE microarchitecture(Compressed-sparse frontend) + (Scattered-dense backend)

I

F

PE frontend

PE backend

48© NVIDIA 2017

Evaluation

Network models and input activations

trained → pruned → retrained models using Caffe

Area & power

System-C → Catapult HLS → Verilog RTL → Synthesis of an SCNN PE

Performance & energy

Performance model for cycle-level simulation of SCNN

Analytical model for design space exploration (dataflows, sparse vs. dense)

Methodology

49© NVIDIA 2017

Evaluation

DCNN: operates solely on dense weights and activations

DCNN-opt: DCNN with (de)compression of activations + ALU power-gating

SCNN: sparse-optimized CNN accelerator

Architecture configurations

50© NVIDIA 2017

PerformanceDense vs. sparse

< VGGNet >

* Network-wide improvement over DCNN: AlexNet (2.37x), GoogLeNet (2.19x)

51© NVIDIA 2017

Energy consumptionDense vs. sparse

* Network-wide improvement over DCNN: AlexNet (2.13x), GoogLeNet (2.51x)

< VGGNet >

52© NVIDIA 2017

Related WorkQualitative comparison to prior work

ArchitectureSparse optimizations

Convolution dataflowWeights Activations

DaDianNao [ASPLOS ‘14] - -

(Variant of)

Sliding-window

Eyeriss [ISCA ‘16] - Power-gating

CNVLUTIN [ISCA ‘16] - Zero-skipping

Cambricon-X [MICRO ’16] Zero-skipping -

SCNN Zero-skipping Cartesian-product

53© NVIDIA 2017

Follow-up questions?

Technical leads:

SCNN architecture & sparse models: Minsoo Rhu

TimeLoop (CNN analytical model): Angshuman Parashar

Power & area modeling: Rangharajan Venkatesan

Contacts the authors

54© NVIDIA 2017

Conclusion

Novel Cartesian-product based convolution operation

Simple/compressed/sparse PE frontend

Scatter/dense PE backend

Superior performance and energy-efficiency

Average 2.7x higher performance than dense CNN architecture

Average 2.3x higher energy-efficiency than dense CNN architecture

SCNN: a compressed-sparse CNN accelerator