A 6K-MAC Feature-Map-Sparsity- Aware Neural Processing ...

A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm

Flagship Mobile SoCJun-Seok Park1, Jun-Woo Jang2, Heonsoo Lee1, Dongwoo Lee1, Sehwan Lee2, Hanwoong Jung2, Seungwon Lee2, Suknam Kwon1,

Kyungah Jeong1, Joon-Ho Song2, SukHwan Lim1, Inyup Kang1

1 Samsung Electronics2 Samsung Advanced Institute of Technology

Outline Motivation NPU architecture Convolutional engine Adder-tree based dot-product engine Feature-map aware zero-skipping

Feature-map lossless compressor Command Queue Measurement result Comparison

Motivation - Efficiency

On-device DNN is critical for mobile products. NPU on Mobile AP has heavy constraints on Computing resources, power and memory bandwidth.

BokehScene Detection Face DetectionFace Landmark Detection

Motivation - Flexibility

NPU needs to support a comprehensive range of NN Diverse kernel sizes, dilation rates and strides

height

channel

Various kernel size

Stride Dilated conv.Narrow & deepWide & shallow

Feature map shape Kernels

Convolutional neural networkIFM

Depthwise convolution

Types of layer

IFM OFM

NPU architecture

CPUI$ D$

NPU Control UnitFM Lossless Compressor 0De-

compressor Compressor

OFM DMAIFM DMA

Channel

ChannelWeight DMA

NPU Core 0 NPU Core 1 NPU Core 2

: Data transaction : Control path

Weight Fetcher

FM FetcherPSUM Fetcher

FM DataHolder

Weight buffer

ping MAC

Weight Fetcher

FM Fetcher

PSUM Fetcher

FM DataHolderWeight buffer

Act.Func.

MAC Array

NPU Conv Engine 1

NPU Conv Engine 0

VectorProcessing

Act.Func.

Queue ArrayCom

ueue synch

Instruction Fetch

NPU Core

Convolutional engines (CE)

CE executes 16 dim. data in parallel along the channel If the smallest unit of compute is 1x1x16,

convolutions with arbitrary kernels is straightforward

Feature-map fetcher

Weight Fetcher

PSUM /FMFetcher

Dataholder Skip 0-feature

Weight Buffer +

Zero selection

Key pointsChallenges

Solutions

Feature-map Distribution for Neural Layers on inceptionV3

020406080

Layers

Zero Featuers Non-zero features

0%20%40%60%80%

%Layers

Zero Features Non-zero features

Feature-map Distribution for Neural Layers on DeepLabV3

• To maintain a high utilization factor for those diverse convolutions• To achieve high energy efficiency

• Serialize the convolutional operations along spatial direction• Skip redundant operations exploiting sparseness of feature-map.

Dilated convolutionWeight tensor

Output Feature Map

(OFM) Tensor

Input Feature Map (IFM) Tensor

t0-8 t9-17 t18-26 t63-71

Weight t0 t1 t2

t3 t4 t5

t6 t7 t8

t9 t10 t11

t12 t13 t14

t15 t16 t17

DilatedWeight tensor

t18 t19 t20

t21 t22 t23

t24 t25 t26

t63 t64 t65

t66 t67 t68

t69 t70 t71

Feature-map-aware zero skippingfmVec to MAA

wVec to MAACycle #0

fmVec to MAA

wVec to MAACycle #1

fmVec to MAA

wVec to MAACycle #3

Benefits of feature-map-aware zero skipping Effective performance Energy efficiency HW Utilization

:Zero data

Feature map

Weight

fmVec to MAA

wVec to MAACycle #2

Feature Map Lossless Compressor

Level 2 Quad-tree

: Zero feature

: Non-zero feature

: Clustered zero features

Level 1 Quad-tree

: Zero feature

: Non-zero feature

Level 0 Quad-tree

: Zero feature

: Non-zero feature

0110010 1110 1111 1111Quad-tree Header

xxxNonzero Features

xxx xxxCompressed feature map

Stream length

Truncated nonzero bitwidth

0111Meta-data

: Non-zero Features: Zero Features

Feature-map Groups

000011101111

11111111111111101101

Quad-Tree Header

10110111

Percentage of Compressed FM Size

Average: 53%

LayersLayers

Average: 50.8%

(a) Inception V3 (b) DeepLab V3

Parallelization of DMA and MAC

Sub-graph of a network is transformed as NPU binary by compiler CMDQ handles an interrupt from a module in tens of cycle The synchronization overhead incur only negligible HW utilization drop

L4 L5IFM

WL2 WL3

WL4 WL5

1 2 Synch.

Synchronization

Synch. Synch.

Synch.

Read IFM

Read Weight

1 2 3 4 5

NPU Core

Write OFM

Sub-Graph of a network for NPU Core

L4 L5IFM

WL2 WL3

WL4 WL5

1 2 Synch.

Synchronization

Synch. Synch.

Synch.

Read IFM

Read Weight

1 2 3 4 5

NPU Core

Write OFM

L4 L5IFM

WL2 WL3

WL4 WL5

1 2 Synch.

Synchronization

Synch. Synch.

Synch.

Read IFM

Read Weight

1 2 3 4 5

NPU Core

Write OFM

L4 L5IFM

WL2 WL3

WL4 WL5

1 2 Synch.

Synchronization

Synch. Synch.

Synch.

Read IFM

Read Weight

1 2 3 4 5

NPU Core

Write OFM

L4 L5IFM

WL2 WL3

WL4 WL5

1 2 Synch.

Synchronization

Synch. Synch.

Synch.

Read IFM

Read Weight

1 2 3 4 5

NPU Core

Write OFM

L4 L5IFM

WL2 WL3

WL4 WL5

1 2 Synch.

Synchronization

Synch. Synch.

Synch.

Read IFM

Read Weight

1 2 3 4 5

NPU Core

Write OFM

L4 L5IFM

WL2 WL3

WL4 WL5

1 2 Synch.

Synchronization

Synch. Synch.

Synch.

Read IFM

Read Weight

1 2 3 4 5

NPU Core

Write OFM

Measurement Results

NPU achieves 623 inferences/s at 1196 MHz in multi-thread mode The energy efficiency of 0.84mJ/inf. (1190 Inf./J) were measured Corresponds to 13.6 TOPS/W for Inception-V3 including DMA power (not DRAM)

623 555 1,107 1,190

700 Inference/sec Inference/J

Voltage (V)

Base +Skipping +Reconf. Multithread

Optimized DMA NPU computeMulti-core

Single Core

62.6%389

0.9 0.8 0.7 0.6 0.5

ComparisonISSCC2018 [2] VLSI2019 [3] ISSCC2020 [4] ISSCC2020 [5] This work

Process (nm) 8 16 7 12 5Area (mm2) 5.5 2.4 3.04 709 5.46

Supply Voltage(V) 0.5 - 0.8 0.55 – 0.8 0.575 – 0.825 - 0.55 – 0.9Frequency (MHz) 67 - 933 33 - 480 290 - 880 475 - 700 332 - 1196

On-Chip memory (kB) 1,568 281 2,176 196,608 3,072Bit Precision 8, 16 16 8, 16 8, 16 8, 16

The number of MACs 1,024 252 - 576K 6,144

Peak Performance (TOPS) 3.5*, 6.9** (8b) - 3.6 (8b) 825 14.7(8b) @noskip, 29.4(8b) @maxskip

Power (mW) 39 - 1,553 16.3 - 364 173 – 1,053 108000 327 @0.6V,794 @0.9V

Measured network Inception v3 ResNet-50 Inception V3 ResNet-50 Inception V3 (8bit)Energy efficiency (TOPS/W) 3.4 @0.5V 3.6 @0.55 V 6.22 - 13.6 (8b) @0.6V

Energy efficiency (mJ/Inference) - - - 2.0 0.840Peak TOPS/mm2 0.64*, 1.25 ** - 1.184 1.16 2.69

Die Photo

NPU Core 0

NPU Core 1

NPU Core 2

NPUControl Unit

(1.08mm2)

(1.51mm2) (1.51mm2)

(1.51mm2)

Process 5nm CMOS technology (Samsung)

Area 5.46mm2

Voltage 0.55-to-0.9V

Frequency 332-to-1196-MHz

Best Peak Performance

623 inferences/s @ 0.9V(Inception V3)

Best Energy Efficiency

13.6 TOPS/W @ 0.6V(Inception V3)

Summary

Adder-tree-based datapath and serialized convolutional operations for high utilization of large number of MACs

Feature-map-aware zero-skipping for high performance and energy efficiency

Reduced memory footprint and bandwidth via weight and FM compression

Parallelization of DMA and MAC compute time by fast resource scheduling

Thank you for your attention

A 6K-MAC Feature-Map-Sparsity- Aware Neural Processing ...

Documents

Transcript of A 6K-MAC Feature-Map-Sparsity- Aware Neural Processing ...

5k 6K 2K Best Practice

Collaborative Sparsity and Compressive MRImath.arizona.edu/~rcrandall/ModelingTalkFinal.pdf · Robert Crandall Collaborative Sparsity and Compressive MRI. ... solutions! This is where

Gdc2012 frames, sparsity and global illumination

Magnum 6K-Series 6K Switches - …media.beldensolutions.com/garrettcom/techsupport/software/user... · www GarrettCom com.. Magnum 6K-Series 6K Switches Software User Guide (MNS-6K)

Sparsity by worst-case quadratic penalties

Distributed Target Localization Via Spatial Sparsity

BRKRST-3067 Troubleshooting Catalyst 4K and 6K

A Message from the Principal · Aastha CHANDRA 4/5/6K Rebecca CHAU 4/5/6K Anjelica DELOS REYES 4/5/6K David MIGUEL 4/5/6K Liz RAMOS 4/5A Jeff YOUSOUF 4/5A Raph CARO 5/6A Lei RAMOS

Rinascita 6k testimonial - English

6K Roll-to-Roll Processing

Automation 6K Series Hardware Installation Guide · 6K Series Hardware Installation Guide Automation 6K2 ... (selecting sinking/sourcing) ... •Mechanical motion control concepts,

Eurographics 2012, Cagliari, Italy S-buffer: Sparsity-aware Multi-fragment Rendering Andreas A. Vasilakis and Ioannis Fudos Department of Computer Science,

6K Controllers

6k words japaenese

Automatic Image Annotation Using Group Sparsity

Photoelectric Standard Sensors BOS 6K - Controller Service · Photoelectric Standard Sensors BOS 6K BOS 6K Due to its high performance, the BOS 6K can be used almost anywhere. ...

FUNCTIONAL SPARSITY: GLOBAL VERSUS LOCALkaib.people.cofc.edu/research/Sinica_FunSparse.pdf · described two types of sparsity: global and local sparsity. In particular, if f is zero

Sparsity-Cognizant Overlapping Co-clustering

Learning with Structured Sparsity - Rangerranger.uta.edu/~huang/papers/JMLR11_SS.pdfadvantage of structured sparsity over standard sparsity on some real applications. Keywords: structured

Distributed Compressed Estimation Based on Compressive ...delamare.cetuc.puc-rio.br/DCE_pres.pdf · – Distributed NLMS algorithm: O(NMI) – Distributed sparsity-aware NLMS algorithm: