Angshuman Parashar, Minsoo Rhu*, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally
SCNN: An Accelerator for Compressed-sparseConvolutional Neural Networks
* Now at POSTECH (Pohang University of Science and Technology), South Korea
2© NVIDIA 2017
Motivation
3© NVIDIA 2017
Convolution (dense)Inner product between every input pixel and every filter coefficient
Sliding window is intuitive and maps to reasonable hardware implementation
* = ?
Input activation maps Filters
(Weights)
Output activation maps
4© NVIDIA 2017
Convolution (dense)Inner product between every input pixel and every filter coefficient
Sliding window is intuitive and maps to reasonable hardware implementation
* =
: useless MUL ops
5© NVIDIA 2017
Convolution (dense)Inner product between every input pixel and every filter coefficient
Sliding window is intuitive and maps to reasonable hardware implementation
* =
: useless MUL ops
6© NVIDIA 2017
Convolution (dense)Inner product between every input pixel and every filter coefficient
Sliding window is intuitive and maps to reasonable hardware implementation
* =
: useless MUL ops
7© NVIDIA 2017
Convolution (dense)Inner product between every input pixel and every filter coefficient
Sliding window is intuitive and maps to reasonable hardware implementation
* =
: useless MUL ops
8© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Static sparsity: pruned network weights set to ‘0’ during training
* =
* Han et al., “Learning Both Weights and Connections for Efficient Neural Network”, NIPS-2015
9© NVIDIA 2017
Convolution with SparsityMost operand values are zero
* =
* Albericio et al., “CNVLUTIN: Ineffectual-Neuron-Free Deep Neural Network Computing”, ISCA-2016
Dynamic sparsity: negative-valued activations clamped to ‘0’ during inference
10© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
0
0.2
0.4
0.6
0.8
1
1.2
Densi
ty
Density (activations)
Density (weights)
< VGGNet >
11© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
* = ?Dynamic sparsity Static sparsity
12© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
* =
: useless MUL ops
13© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
* =
: useless MUL ops
0
14© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
* =
: useless MUL ops
0
15© NVIDIA 2017
Convolution with SparsityMost operand values are zero
Sliding window based convolution is wasteful
Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer
* =
: useless MUL ops
0 0
16© NVIDIA 2017
MotivationCNN inference often performed in power-limited environments
Our goal: sparsity-optimized CNN accelerator for high energy-efficiency
17© NVIDIA 2017
Possible Solutions
18© NVIDIA 2017
Option 1: Leverage Dense CNN DesignEmploy pair of bit-masks to track non-zero weights and/or activations
* = ?NZ activations NZ weights
(3x3)
sliding window
19© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
NZ bitmask
(weights)
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
?
20© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
NZ bitmask
(weights)
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
?
NZ bitmask
(activations)
21© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
?
vv
NZ bitmask
(activations)
vv
NZ bitmask
(weights)
AND
gate
vv
=
2 useful MULs
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
CLK#0
ALU ALU ALU ALU
22© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
?
v
NZ bitmask
(activations)
v
NZ bitmask
(weights)
AND
gate
v
=
1 useful MUL
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
CLK#1
ALU ALU ALU ALU
23© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
?
NZ bitmask
(activations)
NZ bitmask
(weights)
AND
gate=
0 useful MULs
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
CLK#2
ALU ALU ALU ALU
24© NVIDIA 2017
Option 1: Leverage Dense CNN Design
* =
?
NZ bitmask
(activations)
NZ bitmask
(weights)
AND
gate=
0 useful MULs
ALU ALU ALU ALU
Challenge: find enough useful MULs to fully
populate the vector ALUs
NZ activations NZ weights
Employ pair of bit-masks to track non-zero weights and/or activations
CLK#2
25© NVIDIA 2017
SCNN Intuition & Approach
26© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
?
27© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x
?
28© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x
?
29© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x
?
30© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
xx
?
31© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x?
32© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x
?
33© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
y
z
x
x x
x
x
x
x
?
34© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
z
x
y y
y
y
y
y
y
?
35© NVIDIA 2017
Intuition behind SCNNForget the sliding windows based convolution
All NZ activations must (at some point in time) be multiplied by all NZ weights
Holds true for convolution stride ‘1’
* =
yx
z z
z
z
z
z
z ?
36© NVIDIA 2017
The SCNN approachThe Cartesian product (i.e., all-to-all) based convolution operation
Assuming a convolution stride of ‘1’:
Minimum # MULs: Cartesian product of the NZ activations and the NZ weights
* = ?y
z
x
b
d
c
e
a
f
37© NVIDIA 2017
The SCNN approachThe Cartesian product (i.e., all-to-all) based convolution operation
Assuming a convolution stride of ‘1’:
Minimum # MULs: Cartesian product of the NZ activations and the NZ weights
* = ?y
z
x
b
d
c
e
a
f
b
d
e
f
c y
z
Compress
NZ activations
Compress
NZ weights
38© NVIDIA 2017
The SCNN approachThe Cartesian product (i.e., all-to-all) based convolution operation
Assuming a convolution stride of ‘1’:
Minimum # MULs: Cartesian product of the NZ activations and the NZ weights
=
x
a
b
d
e
f
cy
z
xa *
ya *
za *
xb *
yb *
zb *
…
Scatter
network
Accumulate MULs
PE frontend PE backend
39© NVIDIA 2017
SCNN Architecture
40© NVIDIA 2017
SCNN Architecture2D spatially arranged processing elements (PEs)
41© NVIDIA 2017
SCNN ArchitectureWorkload distribution
Weights broadcast to all PEs
All PEs have a copy of all the NZ weights of the CNN model
42© NVIDIA 2017
SCNN ArchitectureWorkload distribution
Filters
(Weights)
*
PE-0 PE-1
PE-2 PE-3
Input activation maps
=
Output activation maps
PE-0 PE-1
PE-2 PE-3
Each PE is allocated with a partial volume of input and output activations
Input and output activations stay local to PE
43© NVIDIA 2017
SCNN ArchitectureHalo resolution?
B
Filters
(Weights)
Output halos:
PE-0 calculates (A x B), but the result should be accumulated in PE-1 (X)
*
APE-0 PE-1
PE-2 PE-3
Input activation maps
=
Output activation maps
PE-0 PE-1
PE-2 PE-3
X
44© NVIDIA 2017
SCNN ArchitectureInter-PE communication channel for halo resolution
PE
Weights
IAOA
NEneighbor
SEneighbor
: traffic for halo resolution
45© NVIDIA 2017
SCNN PE microarchitecture(Compressed-sparse frontend) + (Scattered-dense backend)
I
F
46© NVIDIA 2017
SCNN PE microarchitecture(Compressed-sparse frontend) + (Scattered-dense backend)
I
F
PE frontend
PE backend
47© NVIDIA 2017
Evaluation
48© NVIDIA 2017
Evaluation
Network models and input activations
trained → pruned → retrained models using Caffe
Area & power
System-C → Catapult HLS → Verilog RTL → Synthesis of an SCNN PE
Performance & energy
Performance model for cycle-level simulation of SCNN
Analytical model for design space exploration (dataflows, sparse vs. dense)
Methodology
49© NVIDIA 2017
Evaluation
DCNN: operates solely on dense weights and activations
DCNN-opt: DCNN with (de)compression of activations + ALU power-gating
SCNN: sparse-optimized CNN accelerator
Architecture configurations
50© NVIDIA 2017
PerformanceDense vs. sparse
< VGGNet >
* Network-wide improvement over DCNN: AlexNet (2.37x), GoogLeNet (2.19x)
51© NVIDIA 2017
Energy consumptionDense vs. sparse
* Network-wide improvement over DCNN: AlexNet (2.13x), GoogLeNet (2.51x)
< VGGNet >
52© NVIDIA 2017
Related WorkQualitative comparison to prior work
ArchitectureSparse optimizations
Convolution dataflowWeights Activations
DaDianNao [ASPLOS ‘14] - -
(Variant of)
Sliding-window
Eyeriss [ISCA ‘16] - Power-gating
CNVLUTIN [ISCA ‘16] - Zero-skipping
Cambricon-X [MICRO ’16] Zero-skipping -
SCNN Zero-skipping Cartesian-product
53© NVIDIA 2017
Follow-up questions?
Technical leads:
SCNN architecture & sparse models: Minsoo Rhu
TimeLoop (CNN analytical model): Angshuman Parashar
Power & area modeling: Rangharajan Venkatesan
Contacts the authors
54© NVIDIA 2017
Conclusion
Novel Cartesian-product based convolution operation
Simple/compressed/sparse PE frontend
Scatter/dense PE backend
Superior performance and energy-efficiency
Average 2.7x higher performance than dense CNN architecture
Average 2.3x higher energy-efficiency than dense CNN architecture
SCNN: a compressed-sparse CNN accelerator
Top Related