How can we optimize convolutional neural network designs...

How can we optimize convolutional neural network designs on mobile and embedded

systems?

June 18, 2016

Sungjoo Yoo

Computing Memory Architecture Lab.

CSE, SNU

http://cmalab.snu.ac.kr

Agenda

• Introduction• Connecting two convolutions in human visual cortex and artificial neural

network

• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)

• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators

• Summary

[Bear]

Retina: On-centered Cell ~ Image Sensor Cell

Retina LGN Primary Visual Cortex (V1)

[Kandel]

Convolution

[Kolb_Whishaw]

Line Detection in V1 ~ Convolution

Convolutional Neural Network (CNN)

• LeNet (1989)

• Consists of convolution layer and subsampling (max-pooling) layer

• Training: backpropagation

Convolution: 2D Input Case

[Chen, 2016]

Convolution: 3D Input Case

• A receptive field in input feature maps gives an output neuron

• Each output feature map has its own set of kernel weights

[Chen, 2016]

Convolution: 3D Input / 3D Output

[Chen, 2016]

Training (backprop) determines kernel weights

Convolution: Computation and Model Size

• # multiplications = kxkxD x NxNxH

• # weights = kxkxD x H

Agenda

network

• Summary

Simple Example: Classification for Two Classes

• Classification ~ draw a surface between two groups

• Complex (high order) surface ~ high cost

• Basic idea: classify simple ones first at low cost

[Venkataramani, 2015]

Big/Little DNN: Overview

Input Image

Little DNN

Big DNN

Successchecker

① Classification

② a) High confidence

Result

② b) Low confidence

[Park, 2015]

Experiment Setup: Comparison of Computation Workload between Big & Little DNNs

• Big DNN has ~10X larger amount of computation

2.54 1.58

0.67 0.79

big s m f c

lications

ImageNet Classifiers

188.80

4.48 1.26

14.40 3.60 0.90

big m1 m2 m3 m4 m5 m6 m7 m8#

multip

lications

MNIST Classifiers

[Park, 2015]

Experiment Setup: HW NPU

• Based on [Zhang, FPGA 2015]

• HW NPU• 512 compute engines (PEs)

• Double buffering

• Loop unrolling

• Verilog design, 65nm TSMC

• In-house cycle accurate simulator+DRAMSim2• Micron DRAM power model

DRAMDRAM

DMAunit

SRAM Input Buffer

PEPEPEPEPEPEPE

SRAM Output Buffer

PEPEPEPEPEPE

*Zhang, et al., “Optimizing FPGA-based accelerator design for deep convolutional neural networks”, FPGA 2015.

[Park, 2015]

Result: MNIST

• 93.0% energy reduction

• Accuracy loss of 0.08%

Image]

SRAM Size [B]

DRAM Computation SRAM

85.6 %

85.3 %

m1 m2 m3 m4 m5 m6 m7 m8

Image]

big/LITTLE Configuration

Big only Static threshold Dynamic threshold

99.07 99.10 98.97

m1 m2 m3 m4 m5 m6 m7 m8 Big

[Park, 2015]

Result: ImageNet

• 56.7% energy reduction

• Top-1 accuracy loss of 0.51%

y [J/Im

SRAM Size [B]

DRAM Computation SRAM

34.3 %34.3 %

s m f c

Big only Static threshold

Dynamic threshold

67.53 67.21

s m f c Big

[Park, 2015]

Agenda

network

• Summary

Pruning CNN

[Han, 2015]

Neuron Pruning is Natural in Biological System• # synapses increases before 2 years old and, then decreases due to

pruning

[https://universe-review.ca/R10-16-ANS12.htm]

Convolution with Matrix Multiplication(called Convolution Lowering)• Input: 3x3x3

• Output: 2x2x2

• Convolutional kernel: 3x2x2

[Chetlur 2014]

Pruning [Han 2015] Hardly Reduces the Runtime of Convolution on GPU

[Han 2015][Chetlur 2014]

Group-wise Brain Damage

• For each input feature map, the same location of 2D filter elements is pruned

• Pruning is performed in an incremental manner• Repeat the followings until no more

pruning candidate• Prune a column in F matrix and train the

network to recover from accuracy loss

• Result• 3X reduction in # multiplications for AlexNet

[Lebedev 2016]

Agenda

network

• Summary

Singular Value Decomposition (SVD)

[K. Baker, Singular Value Decomposition Tutorial]

Example of Truncated SVD: A~USVT

Take square roots of 3 largest eigenvaluesTake 3 eigenvectors

associated with the selected eigenvaluesTake 3 eigenvectorsassociated with the selected eigenvalues

[K. Baker, Singular Value Decomposition Tutorial]

Error degrades accuracy. How to reclaim lost accuracy?

128 59

48x5x5 25x5x5

48x5x5

25x5x5

Z Z’

U3 U4C

Low-rank Approximation in CNN

[Kim, 2016]

Experiments on Samsung Galaxy S6

• Exynos 7420 + LPDDR4• ARM Mali T760, 190Gflops for 8 cores, max 256 threads/core, 32KB

L1$/core, 1MB shared L2$, 25.6GB/s LPDDR4 with four x16 channels

• Comparison: nVidia Titan X provides 6.6TFlops for 24 cores, 336GB/s main memory, max 2k threads/core, 64KB L1$/core, 3MB shared L2$

Exynos 7420

[Kim, 2016]

AlexNet: Power Consumption• Total 245mJ/image, 117ms

• GPU power > DRAM power

• Convolutional layers dominate total energy consumption and runtime

• At fully connected layers, GPU power drops while DRAM power increases • Due to a large number of memory accesses

for weights and less data reuse, i.e., low core utilization (=long total idle time)

C1 C2 C3 C4 C5 F6 F7 F8

[Kim, 2016]

VGG_S: Power Consumption• Total 825mJ/image, 357ms

• Convolutional layers dominate total energy consumption and runtime

• At convolutional layers, DRAM consumes larger power than in AlexNetdue to a large number of weights

• At fully connected layers, similar trend as in AlexNet• GPU power ~ DRAM power

AlexNet

[Kim, 2016]

GoogLeNet: Power Consumption• Total 473mJ/image, 273ms

• 1st and 2nd convolutional layers consume 1/4 of total energy and runtime

• Inception modules • Relatively low power consumption in both GPU

and DRAM • Power consumption fluctuates due to many

small inception modules and cache-unfriendly 1x1 convolutions,

• Fully connected layer (1M parameters) consumes a very little amount of power in GPU and DRAM

AlexNet

GoogLeNet

[Kim, 2016]

AlexNet, VGG_S vs GoogLeNet: Top-5, Runtime and Power

80.0%117ms245mJ

84.6%357ms825mJ

88.9%273ms473mJ

[Kim, 2016]

Results of Low Rank Approximation

• Significant reductions in energy consumption and runtime• Energy: x4.26~x1.6

• Runtime:x3.68~x1.42

3.41X 4.26X 1.6X

[Kim, 2016]

Fine-tuning is Required for Accuracy Recovery

• Low-rank approximation loses accuracy

• Fine-tuning recovers lost error• 1 epoch: 1 run of back propagation

with the entire training set

[Kim, 2016]

Agenda

network

• Summary

Narrow-Data CNN

• Performance improvement due to narrow data• E.g., 16bit 4bit data, 4X speedup with the same memory bandwidth

consumption

weight

activation

MUL ADD

Conventional convolution Convolution with narrow data

weight

activation

Logarithm-based Quantization (Log-Quant)

• Less quantization errors for small values

• No need of multiplication

[Miyashita, 2016]

Log-Quant

• Performance improvement due to narrow data

• Replace multipliers with shifters better area/energy efficiency

weight

activation

MUL ADD

Conventional convolution LogQuant-based convolution with narrow data

weight

activation

Preliminary Results: Log-Quant AlexNet

Act base = 1 Activation bit

Weight base = 4 3 4 5 6

Weight bit

2 79.328 79.326 79.334 79.334

3 77.236 77.254 77.254 77.546

4 8.894 8.66 8.662 8.662

5 1.466 1.186 1.222 1.222

6 1.314 1.184 1.006 1.006

7 1.318 1.294 1.342 1.342

Act base = 4 Activation bit

Weight base = 4 3 4 5 6

Weight bit

2 79.488 79.41 79.336 79.328

3 79.488 77.826 77.35 77.43

4 64.632 12.364 6.402 7.4

5 58.33 5.26 0.306 0.272

6 58.296 5.354 0.158 0.172

7 58.274 5.426 0.178 0.186

• 0.3% accuracy loss at 5-bit weight and activation

[CMA Lab, 2016]

Agenda

network

• Summary

Convolution with Matrix Multiplication(a.k.a Convolution Lowering)• Input: 3x3x3

• Output: 2x2x2

• Convolutional kernel: 3x2x2

[Chetlur 2014]

Matrix Size vs. GPU Cache Size

• Example: 2nd convolutional layer on AlexNet

• Input size = 55x55x48x4B = 580KB• Input matrix size = 580KBx5x5 = 14.5MB

• Output size = 27x27x128x4B = 387KB

• Kernel size = 48x5x5x128 = 614KB

14.5MB

614KB 387KB

cuBLAS vs. cuDNN

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

D0 D1 D2

D3 D4 D5

D6 D7 D8

O2 O3* =

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

cuBLAS vs. cuDNN

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

cuBLAS vs. cuDNN

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

cuBLAS vs. cuDNN

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

cuBLAS vs. cuDNN

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

cuDNN has been utilized due to improvements inoff-chip memory BW utilization

on-chip cache utilization

However, # multiplications remains the same

Winograd Convolution

• Reduce # multiplications at the cost of additional additions• 2.26X faster than FFT for F(2x2, 3x3) [Lavin, 2015]

• Example: F(2,3) and F(2x2, 3x3) 1D

[Lavin, 2015]

F(4x4, 3x3) and F(6x6, 3x3)

output

kernel

Tile-based 2D Convolution: E.g., Nine F(2x2, 3x3)’s for 6x6 Output Feature Map

Three Steps in Winograd Convolution

• The larger tiles, the less multiplications and the more additions

• Finally, additions dominate total runtime

[CMA Lab, 2016]

D times*SpD

H times*SpD

D*H times*SpD

Three Steps in Winograd Convolution

• The larger tiles, the less multiplications and the more additions

• Finally, additions dominate total runtime

[CMA Lab, 2016]

Agenda

network

• Summary

Hardware Accelerator, a.k.a., Neural Processing Unit (NPU)• Commercial chip solutions

• Movidius Myriad 2• Mobileye EyeQ3/4• Google TPU• …

• Academic works• DianNao, ASPLOS 2014• ShiDianNao, ISCA 2015• EIE (Stanford), ISCA 2016• Eyeriss (MIT), ISSCC/ISCA 2016• KAIST, ISSCC 2016

IP solutions Chip solutions

GPU(-like) CogniVue OpusNVIDIA Tegra X1Samsung Exynos

CNN-awareSynopsys EV52TeraDeep

Qualcomm ZerothMobileye EyeQ4

VLIW/SIMD

Apical Spirit coreCadence (Tensilica) IVP coreCEVA XM-4 corevideantis v-MP4 vision core

Movidius Myriad 2Analog Devices BF609Inuitive NU3000Texas Instruments TDA3x

[BDTi 2015]

Off-chip memory traffic- Some large (4MB~400kB) on-chip memory is enough for 32b~3-4b data (3~4bit data obtained from quantization)

On-chip memory traffic for parallel computation- Reuse of data fetched from on-chip memory is critical

KAIST, ISSCC 2016

[KAIST, 2016]

Kernel weight is reused 8 times

[KAIST, 2016]

Kernel weight is reused 8 times Data item is reused 4 times

[KAIST, 2016]

Agenda

network

• Summary

Take-Away

• Removing redundancy (=Exploiting locality) in convolutional neural networks (CNNs)• Boosting, pruning, low rank approximation, … Design-time solutions

• What about runtime solutions?

• How to exploit value locality, e.g., zeros in weight and activation (at the granularity of neuron, sub-feature map, layer and sub-network)?

• Exploiting parallelism and data reuse in CNN execution• Only for inference, only a few mega-bytes (or ~100kB) of on-chip memory will

be sufficient to keep the input/output feature maps and convolution kernel weights for each layer

• How to reduce on-chip memory accesses? Data reuse (by broadcast)

• What about hardware accelerator for learning?

Reference

• [Bear] M. Bear et al., Neuroscience: Exploring the Brain 3e, Lippincott Williams and Wilkins, 2016.

• [Kandel] E. Kandel, Principles of Neural Science 5e, McGraw-Hill Education / Medical, 2012.

• [Kolb_Whishaw] B. Kolb and I. Q. Whishaw, An Introduction to Brain and Behavior 3e, Worth Publishers, 2009.

• [Chen, 2016] Y. Chen, Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, ISSCC, 2016.

• [Chetlur, 2014] S. Chetlur, et al., “cuDNN: Efficient Primitives for Deep Learning,” arXiv preprint arXiv:1410.0759v3, 2014.

• [Han, 2015] S. Han, et al., “Learning both weights and connections for efficient neural network,” NIPS, 2015.

• [Kim, 2016] Y. Kim, et al., “Compression of Deep Convolutional Neural Networks for Fast and Low Power Applications,” Proc. International Conference on Learning and Representation (ICLR), May 2016.

• [Park, 2015] E. Park, et al., “Big/Little Deep Neural Network for Ultra Low Power Inference,” Proc. CODES+ISSS, Oct. 2015.

• [Lavin, 2015] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” arXiv preprint arXiv:1509.09308, 2015.

• [Lebedev, 2016] V. Lebedev and V. Lempitsky, “Fast ConvNets Using Group-wise Brain Damage,” arXiv preprint arXiv:1506.02515v2, 2015.

• [Miyashita, 2016] D. Miyashita, et al., “Convolutional Neural Networks using Logarithmic Data Representation,” arXiv preprint arXiv:1603.01025v2, 2016.

• [Microsoft, 2015] K. Ovtcharov, et al., “Accelerating Deep Convolutional Neural Networks Using Specialized Hardware,” MicroSoft, 2016.

• [KAIST, 2016] J. Sim, et al., “A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent IoE Systems,” ISSCC, 2016.

Thank You!

How can we optimize convolutional neural network designs...

Documents

Transcript of How can we optimize convolutional neural network designs...

Convolutional encoding

Accelerating Convolutional Neural Network Systemshgrg1/publications/honours.pdf · 2015-07-29 · Accelerating Convolutional Neural Network Systems ... Convolutional Neural Networks

Convolutional Codes. p2. OUTLINE [1] Shift registers and polynomials [2] Encoding convolutional codes [3] Decoding convolutional codes [4] Truncated.

Convolutional Neural Networks for Image Recognition … Abstract—In this work, we devise and optimize the schemes for RRAM-based hardware implementation of convolutional neural networks.

CONVOLUTIONAL CODES

Chapter 5 Convolutional Codes - chencode.cn€¦ · Chapter 5 Convolutional Codes • 5.1 Encoder Structure and Trellis Representation • 5.2 Systematic Convolutional Codes • 5.3

Convolutional Phase Retrievalpapers.nips.cc/paper/7189-convolutional-phase-retrieval.pdf · 2018-02-13 · Convolutional Phase Retrieval Qing Qu Columbia University qq2105@columbia.edu

Convolutional Neural Networks - Virginia Techjbhuang/teaching/ece5554-4554/fa17/... · Convolutional Neural Networks Computer Vision ... ImageNet Classification with Deep Convolutional

Structure-Aware Convolutional Neural Networks › paper › 7287-structure-aware-convolutional-ne… · Structure-Aware Convolutional Neural Networks (SACNNs) are readily estab-lished.

Shepard Convolutional Neural Networks - Neural Information …papers.nips.cc/paper/5774-shepard-convolutional-neural... · 2016-02-16 · Shepard Convolutional Neural Networks (ShCNN)

Multiscale Hierarchical Convolutional Networks · Multiscale Hierarchical Convolutional Networks and code is available online using TensorFlow and Keras1. 2. Deep Convolutional Networks

Convolutional Neural Networks Arise From Ising Models and ...web.stanford.edu/~sunilpai/convolutional-neural-networks.pdf · Convolutional neural networks are an attractive option

Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,

Deep Convolutional Inverse Graphics Networkpapers.nips.cc/paper/5851-deep-convolutional-inverse-graphics-network.pdfDeep Convolutional Inverse Graphics Network Tejas D. Kulkarni*1,

Using Support Vector Machines, Convolutional Neural ... · Using Support Vector Machines, Convolutional Neural Networks and ... Using Support Vector Machines, Convolutional Neural

Doubly Convolutional Neural Networks - papers.nips.ccpapers.nips.cc/paper/6340-doubly-convolutional-neural-networks.pdf · convolutional neural network (DCNN). In Figure 3 we have

Hyperbolic Graph Convolutional Neural Networkspapers.nips.cc › paper › 8733-hyperbolic-graph-convolutional-neural-networks.pdfGraph Convolutional Neural Networks (GCNs) are state-of-the-art

Convolutional Code

Robust Graph Convolutional Networks Against Adversarial ...pengcui.thumedialab.com/papers/RGCN.pdf · convolutional networks and graph adversarial attacks. 2.1 Graph Convolutional

Introduction to Convolutional Neural Networksweb.stanford.edu/class/cs231a/lectures/intro_cnn.pdf · 2018-02-25 · Convolutional Neural Networks To address this problem, bionic convolutional