A 6K-MAC Feature-Map-Sparsity- Aware Neural Processing ...

9.5: A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC© 2021 IEEE International Solid-State Circuits Conference 1 of 20

A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm

Flagship Mobile SoCJun-Seok Park1, Jun-Woo Jang2, Heonsoo Lee1, Dongwoo Lee1, Sehwan Lee2, Hanwoong Jung2, Seungwon Lee2, Suknam Kwon1,

Kyungah Jeong1, Joon-Ho Song2, SukHwan Lim1, Inyup Kang1

1 Samsung Electronics2 Samsung Advanced Institute of Technology


Outline Motivation NPU architecture Convolutional engine Adder-tree based dot-product engine Feature-map aware zero-skipping

Feature-map lossless compressor Command Queue Measurement result Comparison


Motivation - Efficiency

On-device DNN is critical for mobile products. NPU on Mobile AP has heavy constraints on Computing resources, power and memory bandwidth.

BokehScene Detection Face DetectionFace Landmark Detection


Motivation - Flexibility

NPU needs to support a comprehensive range of NN Diverse kernel sizes, dilation rates and strides

width

height

channel

Various kernel size

Stride Dilated conv.Narrow & deepWide & shallow

Feature map shape Kernels

Convolutional neural networkIFM

Depthwise convolution

Types of layer

OFM

IFM OFM


NPU architecture

CPUI$ D$

BUS

NPU Control UnitFM Lossless Compressor 0De-

compressor Compressor

OFM DMAIFM DMA

Cha

nnel

FLC 2

Channel

FLC 1

ChannelWeight DMA

NPU Core 0 NPU Core 1 NPU Core 2

: Data transaction : Control path

Weight Fetcher

FM FetcherPSUM Fetcher

FM DataHolder

Weight buffer

FM-Z

ero

skip

ping MAC

Array

Weight Fetcher

FM Fetcher

PSUM Fetcher

FM DataHolderWeight buffer

Act.Func.

MAC Array

NPU Conv Engine 1

NPU Conv Engine 0

1MB

Shar

ed S

crat

chpa

d

VectorProcessing

Unit

Act.Func.

Queue ArrayCom

man

dQ

ueue synch

Instruction Fetch

FM-Z

ero

skip

ping

NPU Core


Convolutional engines (CE)

CE executes 16 dim. data in parallel along the channel If the smallest unit of compute is 1x1x16,

convolutions with arbitrary kernels is straightforward

Feature-map fetcher

Weight Fetcher

PSUM /FMFetcher

Dataholder Skip 0-feature

Weight Buffer +

Mul

t. Ar

ray

Zero selection

Mul

t. Ar

ray

Mul

t. Ar

ray

++

Accu

m/M

ax

Boost

Accu

m/M

axAc

cum

/Max


Key pointsChallenges

Solutions

Feature-map Distribution for Neural Layers on inceptionV3

020406080

100

%

Layers

Zero Featuers Non-zero features

0%20%40%60%80%

100%

%Layers

Zero Features Non-zero features

Feature-map Distribution for Neural Layers on DeepLabV3

• To maintain a high utilization factor for those diverse convolutions• To achieve high energy efficiency

• Serialize the convolutional operations along spatial direction• Skip redundant operations exploiting sparseness of feature-map.


Dilated convolutionWeight tensor

Output Feature Map

(OFM) Tensor

Input Feature Map (IFM) Tensor

OFM

IFM

t0-8 t9-17 t18-26 t63-71

Weight t0 t1 t2

t3 t4 t5

t6 t7 t8

t9 t10 t11

t12 t13 t14

t15 t16 t17

DilatedWeight tensor

t18 t19 t20

t21 t22 t23

t24 t25 t26

t63 t64 t65

t66 t67 t68

t69 t70 t71


Feature-map-aware zero skippingfmVec to MAA

wVec to MAACycle #0

fmVec to MAA

wVec to MAACycle #1

fmVec to MAA

wVec to MAACycle #3

Benefits of feature-map-aware zero skipping Effective performance Energy efficiency HW Utilization

fmV

ec 0

fmV

ec 1

fmV

ec 2

fmV

ec 3

fmV

ec 4

fmV

ec 5

fmV

ec 6

fmV

ec 7

wV

ec 0

wV

ec 1

wV

ec 2

wV

ec 3

wV

ec 4

wV

ec 5

wV

ec 6

wV

ec 7

:Zero data

wV

ec 8

fmV

ec 8

Feature map

Weight

fmVec to MAA

wVec to MAACycle #2


Feature Map Lossless Compressor

Level 2 Quad-tree

: Zero feature

: Non-zero feature

: Clustered zero features

Level 1 Quad-tree

: Zero feature

: Non-zero feature


Level 0 Quad-tree

: Zero feature

: Non-zero feature


0110010 1110 1111 1111Quad-tree Header

xxxNonzero Features

xxx xxxCompressed feature map

Stream length

Truncated nonzero bitwidth

0111Meta-data

: Non-zero Features: Zero Features

Feature-map Groups

0110

000011101111

11111111111111101101

0000

0000

Quad-Tree Header

10110111


Percentage of Compressed FM Size

Average: 53%

%

LayersLayers

%

Average: 50.8%

(a) Inception V3 (b) DeepLab V3


Parallelization of DMA and MAC

Sub-graph of a network is transformed as NPU binary by compiler CMDQ handles an interrupt from a module in tens of cycle The synchronization overhead incur only negligible HW utilization drop

L1

L2 L3

L4 L5IFM

WL1

WL2 WL3

WL4 WL5

1 2 Synch.

3

Synchronization

4 5

Synch. Synch.

Synch.

Read IFM

Read Weight

1 4

1 2 3 4 5

1 5

NPU Core

Write OFM

Sub-Graph of a network for NPU Core

L1

L2 L3

L4 L5IFM

WL1

WL2 WL3

WL4 WL5

1 2 Synch.

3

Synchronization

4 5

Synch. Synch.

Synch.

Read IFM

Read Weight

1 4

1 2 3 4 5

1 5

NPU Core

Write OFM


L1

L2 L3

L4 L5IFM

WL1

WL2 WL3

WL4 WL5

1 2 Synch.

3

Synchronization

4 5

Synch. Synch.

Synch.

Read IFM

Read Weight

1 4

1 2 3 4 5

1 5

NPU Core

Write OFM


L1

L2 L3

L4 L5IFM

WL1

WL2 WL3

WL4 WL5

1 2 Synch.

3

Synchronization

4 5

Synch. Synch.

Synch.

Read IFM

Read Weight

1 4

1 2 3 4 5

1 5

NPU Core

Write OFM


L1

L2 L3

L4 L5IFM

WL1

WL2 WL3

WL4 WL5

1 2 Synch.

3

Synchronization

4 5

Synch. Synch.

Synch.

Read IFM

Read Weight

1 4

1 2 3 4 5

1 5

NPU Core

Write OFM


L1

L2 L3

L4 L5IFM

WL1

WL2 WL3

WL4 WL5

1 2 Synch.

3

Synchronization

4 5

Synch. Synch.

Synch.

Read IFM

Read Weight

1 4

1 2 3 4 5

1 5

NPU Core

Write OFM


L1

L2 L3

L4 L5IFM

WL1

WL2 WL3

WL4 WL5

1 2 Synch.

3

Synchronization

4 5

Synch. Synch.

Synch.

Read IFM

Read Weight

1 4

1 2 3 4 5

1 5

NPU Core

Write OFM



Measurement Results

NPU achieves 623 inferences/s at 1196 MHz in multi-thread mode The energy efficiency of 0.84mJ/inf. (1190 Inf./J) were measured Corresponds to 13.6 TOPS/W for Inception-V3 including DMA power (not DRAM)

623 555 1,107 1,190

0

200

400

600

800

1000

1200

1400

0

100

200

300

400

500

600

700 Inference/sec Inference/J

Infe

renc

es/s

econ

d

Infe

renc

es/J

Tim

e (m

s)/In

fere

nce

Voltage (V)

0

1

2

3

4

5

6

Base +Skipping +Reconf. Multithread

Optimized DMA NPU computeMulti-core

Single Core

30%

62.6%389

0.9 0.8 0.7 0.6 0.5


ComparisonISSCC2018 [2] VLSI2019 [3] ISSCC2020 [4] ISSCC2020 [5] This work

Process (nm) 8 16 7 12 5Area (mm2) 5.5 2.4 3.04 709 5.46

Supply Voltage(V) 0.5 - 0.8 0.55 – 0.8 0.575 – 0.825 - 0.55 – 0.9Frequency (MHz) 67 - 933 33 - 480 290 - 880 475 - 700 332 - 1196

On-Chip memory (kB) 1,568 281 2,176 196,608 3,072Bit Precision 8, 16 16 8, 16 8, 16 8, 16

The number of MACs 1,024 252 - 576K 6,144

Peak Performance (TOPS) 3.5*, 6.9** (8b) - 3.6 (8b) 825 14.7(8b) @noskip, 29.4(8b) @maxskip

Power (mW) 39 - 1,553 16.3 - 364 173 – 1,053 108000 327 @0.6V,794 @0.9V

Measured network Inception v3 ResNet-50 Inception V3 ResNet-50 Inception V3 (8bit)Energy efficiency (TOPS/W) 3.4 @0.5V 3.6 @0.55 V 6.22 - 13.6 (8b) @0.6V

Energy efficiency (mJ/Inference) - - - 2.0 0.840Peak TOPS/mm2 0.64*, 1.25 ** - 1.184 1.16 2.69


Die Photo

NPU Core 0

NPU Core 1

NPU Core 2

NPUControl Unit

(1.08mm2)

(1.51mm2) (1.51mm2)

(1.51mm2)

Process 5nm CMOS technology (Samsung)

Area 5.46mm2

Voltage 0.55-to-0.9V

Frequency 332-to-1196-MHz

Best Peak Performance

623 inferences/s @ 0.9V(Inception V3)

Best Energy Efficiency

13.6 TOPS/W @ 0.6V(Inception V3)


Summary

Adder-tree-based datapath and serialized convolutional operations for high utilization of large number of MACs

Feature-map-aware zero-skipping for high performance and energy efficiency

Reduced memory footprint and bandwidth via weight and FM compression

Parallelization of DMA and MAC compute time by fast resource scheduling


Thank you for your attention

A 6K-MAC Feature-Map-Sparsity- Aware Neural Processing ...

Documents

Transcript of A 6K-MAC Feature-Map-Sparsity- Aware Neural Processing ...