Post on 03-Nov-2021
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 1 of 20
A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm
Flagship Mobile SoCJun-Seok Park1, Jun-Woo Jang2, Heonsoo Lee1, Dongwoo Lee1, Sehwan Lee2, Hanwoong Jung2, Seungwon Lee2, Suknam Kwon1,
Kyungah Jeong1, Joon-Ho Song2, SukHwan Lim1, Inyup Kang1
1 Samsung Electronics2 Samsung Advanced Institute of Technology
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 2 of 20
Outline Motivation NPU architecture Convolutional engine Adder-tree based dot-product engine Feature-map aware zero-skipping
Feature-map lossless compressor Command Queue Measurement result Comparison
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 3 of 20
Motivation - Efficiency
On-device DNN is critical for mobile products. NPU on Mobile AP has heavy constraints on Computing resources, power and memory bandwidth.
BokehScene Detection Face DetectionFace Landmark Detection
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 4 of 20
Motivation - Flexibility
NPU needs to support a comprehensive range of NN Diverse kernel sizes, dilation rates and strides
width
height
channel
Various kernel size
Stride Dilated conv.Narrow & deepWide & shallow
Feature map shape Kernels
Convolutional neural networkIFM
Depthwise convolution
Types of layer
OFM
IFM OFM
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 5 of 20
Outline Motivation NPU architecture Convolutional engine Adder-tree based dot-product engine Feature-map aware zero-skipping
Feature-map lossless compressor Command Queue Measurement result Comparison
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 6 of 20
NPU architecture
CPUI$ D$
BUS
NPU Control UnitFM Lossless Compressor 0De-
compressor Compressor
OFM DMAIFM DMA
Cha
nnel
FLC 2
Channel
FLC 1
ChannelWeight DMA
NPU Core 0 NPU Core 1 NPU Core 2
: Data transaction : Control path
Weight Fetcher
FM FetcherPSUM Fetcher
FM DataHolder
Weight buffer
FM-Z
ero
skip
ping MAC
Array
Weight Fetcher
FM Fetcher
PSUM Fetcher
FM DataHolderWeight buffer
Act.Func.
MAC Array
NPU Conv Engine 1
NPU Conv Engine 0
1MB
Shar
ed S
crat
chpa
d
VectorProcessing
Unit
Act.Func.
Queue ArrayCom
man
dQ
ueue synch
Instruction Fetch
FM-Z
ero
skip
ping
NPU Core
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 7 of 20
Convolutional engines (CE)
CE executes 16 dim. data in parallel along the channel If the smallest unit of compute is 1x1x16,
convolutions with arbitrary kernels is straightforward
Feature-map fetcher
Weight Fetcher
PSUM /FMFetcher
Dataholder Skip 0-feature
Weight Buffer +
Mul
t. Ar
ray
Zero selection
Mul
t. Ar
ray
Mul
t. Ar
ray
++
Accu
m/M
ax
Boost
Accu
m/M
axAc
cum
/Max
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 8 of 20
Key pointsChallenges
Solutions
Feature-map Distribution for Neural Layers on inceptionV3
020406080
100
%
Layers
Zero Featuers Non-zero features
0%20%40%60%80%
100%
%Layers
Zero Features Non-zero features
Feature-map Distribution for Neural Layers on DeepLabV3
• To maintain a high utilization factor for those diverse convolutions• To achieve high energy efficiency
• Serialize the convolutional operations along spatial direction• Skip redundant operations exploiting sparseness of feature-map.
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 9 of 20
Dilated convolutionWeight tensor
Output Feature Map
(OFM) Tensor
Input Feature Map (IFM) Tensor
OFM
IFM
t0-8 t9-17 t18-26 t63-71
Weight t0 t1 t2
t3 t4 t5
t6 t7 t8
t9 t10 t11
t12 t13 t14
t15 t16 t17
DilatedWeight tensor
t18 t19 t20
t21 t22 t23
t24 t25 t26
t63 t64 t65
t66 t67 t68
t69 t70 t71
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 10 of 20
Feature-map-aware zero skippingfmVec to MAA
wVec to MAACycle #0
fmVec to MAA
wVec to MAACycle #1
fmVec to MAA
wVec to MAACycle #3
Benefits of feature-map-aware zero skipping Effective performance Energy efficiency HW Utilization
fmV
ec 0
fmV
ec 1
fmV
ec 2
fmV
ec 3
fmV
ec 4
fmV
ec 5
fmV
ec 6
fmV
ec 7
wV
ec 0
wV
ec 1
wV
ec 2
wV
ec 3
wV
ec 4
wV
ec 5
wV
ec 6
wV
ec 7
:Zero data
wV
ec 8
fmV
ec 8
Feature map
Weight
fmVec to MAA
wVec to MAACycle #2
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 11 of 20
Outline Motivation NPU architecture Convolutional engine Adder-tree based dot-product engine Feature-map aware zero-skipping
Feature-map lossless compressor Command Queue Measurement result Comparison
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 12 of 20
Feature Map Lossless Compressor
Level 2 Quad-tree
: Zero feature
: Non-zero feature
: Clustered zero features
Level 1 Quad-tree
: Zero feature
: Non-zero feature
: Clustered zero features
Level 0 Quad-tree
: Zero feature
: Non-zero feature
: Clustered zero features
0110010 1110 1111 1111Quad-tree Header
xxxNonzero Features
xxx xxxCompressed feature map
Stream length
Truncated nonzero bitwidth
0111Meta-data
: Non-zero Features: Zero Features
Feature-map Groups
0110
000011101111
11111111111111101101
0000
0000
Quad-Tree Header
10110111
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 13 of 20
Percentage of Compressed FM Size
Average: 53%
%
LayersLayers
%
Average: 50.8%
(a) Inception V3 (b) DeepLab V3
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 14 of 20
Parallelization of DMA and MAC
Sub-graph of a network is transformed as NPU binary by compiler CMDQ handles an interrupt from a module in tens of cycle The synchronization overhead incur only negligible HW utilization drop
L1
L2 L3
L4 L5IFM
WL1
WL2 WL3
WL4 WL5
1 2 Synch.
3
Synchronization
4 5
Synch. Synch.
Synch.
Read IFM
Read Weight
1 4
1 2 3 4 5
1 5
NPU Core
Write OFM
Sub-Graph of a network for NPU Core
L1
L2 L3
L4 L5IFM
WL1
WL2 WL3
WL4 WL5
1 2 Synch.
3
Synchronization
4 5
Synch. Synch.
Synch.
Read IFM
Read Weight
1 4
1 2 3 4 5
1 5
NPU Core
Write OFM
Sub-Graph of a network for NPU Core
L1
L2 L3
L4 L5IFM
WL1
WL2 WL3
WL4 WL5
1 2 Synch.
3
Synchronization
4 5
Synch. Synch.
Synch.
Read IFM
Read Weight
1 4
1 2 3 4 5
1 5
NPU Core
Write OFM
Sub-Graph of a network for NPU Core
L1
L2 L3
L4 L5IFM
WL1
WL2 WL3
WL4 WL5
1 2 Synch.
3
Synchronization
4 5
Synch. Synch.
Synch.
Read IFM
Read Weight
1 4
1 2 3 4 5
1 5
NPU Core
Write OFM
Sub-Graph of a network for NPU Core
L1
L2 L3
L4 L5IFM
WL1
WL2 WL3
WL4 WL5
1 2 Synch.
3
Synchronization
4 5
Synch. Synch.
Synch.
Read IFM
Read Weight
1 4
1 2 3 4 5
1 5
NPU Core
Write OFM
Sub-Graph of a network for NPU Core
L1
L2 L3
L4 L5IFM
WL1
WL2 WL3
WL4 WL5
1 2 Synch.
3
Synchronization
4 5
Synch. Synch.
Synch.
Read IFM
Read Weight
1 4
1 2 3 4 5
1 5
NPU Core
Write OFM
Sub-Graph of a network for NPU Core
L1
L2 L3
L4 L5IFM
WL1
WL2 WL3
WL4 WL5
1 2 Synch.
3
Synchronization
4 5
Synch. Synch.
Synch.
Read IFM
Read Weight
1 4
1 2 3 4 5
1 5
NPU Core
Write OFM
Sub-Graph of a network for NPU Core
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 15 of 20
Outline Motivation NPU architecture Convolutional engine Adder-tree based dot-product engine Feature-map aware zero-skipping
Feature-map lossless compressor Command Queue Measurement result Comparison
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 16 of 20
Measurement Results
NPU achieves 623 inferences/s at 1196 MHz in multi-thread mode The energy efficiency of 0.84mJ/inf. (1190 Inf./J) were measured Corresponds to 13.6 TOPS/W for Inception-V3 including DMA power (not DRAM)
623 555 1,107 1,190
0
200
400
600
800
1000
1200
1400
0
100
200
300
400
500
600
700 Inference/sec Inference/J
Infe
renc
es/s
econ
d
Infe
renc
es/J
Tim
e (m
s)/In
fere
nce
Voltage (V)
0
1
2
3
4
5
6
Base +Skipping +Reconf. Multithread
Optimized DMA NPU computeMulti-core
Single Core
30%
62.6%389
0.9 0.8 0.7 0.6 0.5
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 17 of 20
ComparisonISSCC2018 [2] VLSI2019 [3] ISSCC2020 [4] ISSCC2020 [5] This work
Process (nm) 8 16 7 12 5Area (mm2) 5.5 2.4 3.04 709 5.46
Supply Voltage(V) 0.5 - 0.8 0.55 – 0.8 0.575 – 0.825 - 0.55 – 0.9Frequency (MHz) 67 - 933 33 - 480 290 - 880 475 - 700 332 - 1196
On-Chip memory (kB) 1,568 281 2,176 196,608 3,072Bit Precision 8, 16 16 8, 16 8, 16 8, 16
The number of MACs 1,024 252 - 576K 6,144
Peak Performance (TOPS) 3.5*, 6.9** (8b) - 3.6 (8b) 825 14.7(8b) @noskip, 29.4(8b) @maxskip
Power (mW) 39 - 1,553 16.3 - 364 173 – 1,053 108000 327 @0.6V,794 @0.9V
Measured network Inception v3 ResNet-50 Inception V3 ResNet-50 Inception V3 (8bit)Energy efficiency (TOPS/W) 3.4 @0.5V 3.6 @0.55 V 6.22 - 13.6 (8b) @0.6V
Energy efficiency (mJ/Inference) - - - 2.0 0.840Peak TOPS/mm2 0.64*, 1.25 ** - 1.184 1.16 2.69
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 18 of 20
Die Photo
NPU Core 0
NPU Core 1
NPU Core 2
NPUControl Unit
(1.08mm2)
(1.51mm2) (1.51mm2)
(1.51mm2)
Process 5nm CMOS technology (Samsung)
Area 5.46mm2
Voltage 0.55-to-0.9V
Frequency 332-to-1196-MHz
Best Peak Performance
623 inferences/s @ 0.9V(Inception V3)
Best Energy Efficiency
13.6 TOPS/W @ 0.6V(Inception V3)
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 19 of 20
Summary
Adder-tree-based datapath and serialized convolutional operations for high utilization of large number of MACs
Feature-map-aware zero-skipping for high performance and energy efficiency
Reduced memory footprint and bandwidth via weight and FM compression
Parallelization of DMA and MAC compute time by fast resource scheduling
9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 20 of 20
Thank you for your attention