Low Power Convolutional Neural Networks on a Chip · 2018. 11. 27. · Low Power Convolutional...
Transcript of Low Power Convolutional Neural Networks on a Chip · 2018. 11. 27. · Low Power Convolutional...
Low Power Convolutional Neural Networks on a Chip
Yu Wang, Lixue Xia, Tianqi Tang, Boxun Li, Song Yao, Ming Cheng, Huazhong Yang
Dept. of E.E., Tsinghua National Laboratory for Information Science and Technology (TNList)
Tsinghua University, Beijing, China e-mail: [email protected]
Outline�
• Background • FPGA Accelerator • CNN on RRAM
CNN: Application and Performance�
• CNN: State-of-the-art in visual recognition applications �
Tracking [UIUC2015]Vehicle and Lane Detection [Stanford2015]
Pedestrian Detection [arxiv2015]
Google Translate App 2015.7
NN: Complexity�
• AlexNet (a.k.a. CaffeNet) (2012)
• GoogLeNet (2015)�
The computation complexity and energy consumption increase rapidly to obtain better and better recognition accuracy�
Energy Efficient Circuits and Systems�
Energy Efficiency = ——————
= Operations/J = (OP/s�/ W
Micro� Aerospace�Macro�
How to improve the energy efficiency?�
Complexity�
Energy���
Energy Efficiency for non-computational workloads for cognitive and other related applications
CPU� XPU,FPGAASIC…�
BrainCPU�
Architecture�
DelayEnergy�
Architecture IRRELEVANT to the semantic à Relevant
Architecture: Simple to Complex �
10
Embedded GPU for Object Detection�
• PipelinedFastR-CNNonEmbeddedGPU– So5ware:algorithmselec:on&modifica:onforlowpowerobjectdetec:on– Hardware:twostagepipelineonNVIDIATK1EmbeddedGPU– ~0.8sec/img&9.6W– Championofthe1stLow-PowerImageRecogni:onChallenge(LPIRC)– WinneroftheHIGHESTACCURACYwithlowenergy
Outline�
• Background • FPGA Accelerator • CNN on RRAM
Architecture and Implementation Details�
• Overall Architecture�
CPU External Memory
Proc
essi
ng S
yste
m
DMA
Data & Inst. Bus
Input Buffer
PE
Computing Complex
Output Buffer
PE PE
FIFO
Con
trol
ler
Prog
ram
mab
le L
ogic
Config. Bus
…
• Processing System • Flexibility • CPU + DDR • Scheduling operations • Prepare data and instructions • Realize Softmax function
• Programmable Logic • Hardware acceleration • Computing Complex + On-chip
Buffers + Controller + DMA • Few Complex PEs
• Achieve three level parallelism • Inter-output: multiple PEs • Intra-output • Operator-level
• 16-bit dynamic-precision data quantization
Architecture and Implementation Details�
• Processing Engine Architecture�
C
Convolver Complex
+
+
+
+
+ NL PoolC
C
Output Buffer
Input Buffer
Data
Bias
Weights
Intermediate Data
Controller
PE
Adder Tree
Bias Shift
Data shift
……
…
…
• Achieve intra-output parallelism by placing multiple Convolvers • Convolver: optimized for 3x3 convolution operation • Adder Tree sum up results of one convolution operation • NL: supports non-linear function (ReLU) • Pool: supports max-pooling • Bias Shift & Data Shift: support dynamic-precision fixed-point numbers
Architecture and Implementation Details�
• Line-buffer design – Optimized for 3x3 Convolver – Supports operator-level parallelism�
⋯⋯ ⋯⋯
⋯⋯ ⋯⋯
⋯
⋯
MU
XM
UX
Data buffer
Weight buffer
MultipliersAdder Tree
X+
9 Data Inputs
9 Weight Inputs
n Delays
! Delays
+
++
⋯
+⋯
X XX X XX X X
Input Data
Input Weight
Output Data
Performance Comparison�
• Performance and Energy Efficiency Comparison Chakaradh
ar 2010
Gokhale 2014
Zhang 2015 Ours Ours
Platform Virtex 5 SX240t
Zynq XC7Z045
Virtex7 VX485t
Zynq XC7Z045
Zynq XC7Z020
Clock (MHz) 120 150 100 150 100Bandwidth (GB/s) - 4.2 12.8 4.2 4.2
Quantization 48-bit fixed 16-bit fixed 32-bit float 16-bit fixed 8-bit fixed
Problem Complexity (GOP) 0.52 0.552 1.33 30.76 .1
Performance(GOP/s) 16 23.18 61.62 136.97 (Overall)
187.89 (Conv) 19.2
Power (W) 14 8 18.61 9.63 2Power Efficiency
(GOP/J) 1.14 2.90 3.31 14.22 (Overall) 19.50 (Conv) 9.6
Video Demonstration�
• Youku link:http://v.youku.com/v_show/id_XMTQ5MTI3NTM0OA==.html#paction
• Youtube link: https://www.youtube.com/watch?v=m4e1SV89Dpg
Outline�
• Background • FPGA Accelerator • CNN on RRAM
Energy Efficiency Limitation of CMOS�• Scale Up will not improve the energy efficiency
For CNN task: – CPU: 1.5 GOPS/W, FPGA: 14.2 GOPS/W – DaDianNao:350 GOPS/W (peak) – Brain: 500,000 GOPS/W, still >1000X gap
XPU, FPGA, ASIC…�
Brain CPU�
Architecture�
Delay Energy� CMOS
Scaling Down ~10X�
Accelerator ~100X�
?�
Vik
Voj
gkj
RRR
Vi1
Vi3
Vo1 Vo3
RRAM-based Computation�
• Brain is NOT Boolean�• Emerging Devices , such as RRAM devices, provide a promising solution
to realize better implementation of brain inspired circuits and systems�
1( ), oj ik kj kjk kj
V r V g gM
= ⋅ ⋅ =∑
O(n2)!O(n0)�
Merge Mem. & Compute�
I&F neuron LPF neuron
Plasticity: Configure with Voltage/Current�
High Density�
Non Volatile�
~~1RRAMCell 1m-bitMulDplier+1m-bitAdder+1m-bitReg.(SRAM)
RRAMCrossbar Matrix-VectorMulDplicaDonASIC
RRAM Crossbar�Non-Volatile
Merge Mem. & Comp. ~100X Efficiency Gains�
Circ
uit�
Arc
hite
ctur
e�A
pplic
atio
n�
u Device Fault [DATE’14/ICCAD’16 Submitted]
u Device Control (RD/WR) [JCST’16]�
u Interface with CPU [DAC’15] u Interface between crossbars [DAC’16] u Mapping & Compile [ASPDAC’15] u Process-In-Memory [ISCA’16] u Simulator [DATE’16]
u Self-Training with RRAM [ASPDAC’14/ICCAD’16 submitted]
u Series of RRAM-based NN Systems (ANN,SNN) [TCAD’15/DATE’15/GLSVLSI’15/ISLPED’13]
Our Preliminary Work�
Inte
rfac
e
CPU Memristor
Approximate Computing
Two Chips have been Taped-out!�
• CNN consists of cascaded convolutional layers and FC (full connected) layers
Structure of CNN�
• Conv Layers cost the main part of CNN computations�
0.21
0.34
0.17
3.87
3.87
0.90
0.83
1.85
5.55
5.55
0.30
0.30
5.55
9.25
12.9
5
0.45
0.45
5.55
9.25
12.9
5
0.30
0.30
1.85
2.31
3.70
0.08
0.10
0.21
0.21
0.21
0.03
0.03
0.03
0.03
0.03
0.01
0.01
0.01
0.01
0.01
CaffeNet ZF VGG11 VGG16 VGG19
CONV1 CONV2 CONV3 CONV4 CONV5 FC6 FC7 FC8
Distribution of computations (GOPs)
RRAM-based Convolution�
• The function of a convolution kernel is also vector-vector multiplication
• Multiple Conv kernels share the same input data – Convolution kernels can be regarded as Matrix-Vector
multiplication
p. 20
Convolutional Layer on RRAM�
• Implement convolution kernels on RRAM – Store weights of kernels on RRAM device – Input the data to multiple kernels simultaneously
• Peripheral Functions are implemented in CMOS • We use the line buffer similar to our FPGA design�
Functions of Conv Layer RRAM-based Conv Layer
GPU FPGA RRAM ASIC [ISSCC 2016]
Network VGG 16 VGG 16 VGG 16 AlexNet Conv
Problem Complexity (GOP)
30.76 30.76 30.76 5.32
Weight (MB) 28 264 132 4.6
Data (MB) 127 63 32 1.56
Precision 32-bit float 16-bit fixed 8-bit fixed 16-bit fixed
Top-1 Accuracy (%) 8.10 68.02 66.58 -
Top 5 Accuracy (%) 88.00 87.94 7.38 -
Energy Efficiency (GOPS/W)
7.14 14.22 462.67 166
Experimental Results�
• Improve the energy efficiency more than 40× compared with FPGA and GPU implementations�
Conclusion�
• We implement large scale CNN on embedded chip based on FPGA
• RRAM crossbar provides a further efficient way to implement the main computation of CNN – We are designing our RRAM-based CNN chip to verify
the energy efficiency potential
References�
[GoogLeNet]Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[J]. arXiv preprint arXiv:1409.4842, 2014.
[AlexNet] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105.
[FPGA16] Jiantao Qiu et al., "Going deeper with embedded fpga platform for convolutional neural network", to appear in FPGA 2016.
[Chen ISSCC 2016] Y. H. Chen, T. Krishna, J. Emer and V. Sze, "14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," 2016 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, 2016, pp. 262-263.
[Gokhale 2014] V. Gokhale, J. Jin, A. Dundar, B. Martini and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, 2014, pp. 696-701.
[Zhang 2015] Zhang C, Li P, Sun G, et al. Optimizing fpga-based accelerator design for deep convolutional neural networks[C]//Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015: 161-170.
[Chakradhar 2010] Chakradhar S, Sankaradas M, Jakkula V, et al. A dynamically configurable coprocessor for convolutional neural networks[C]//ACM SIGARCH Computer Architecture News. ACM, 2010, 38(3): 247-257.
�
https://nicsefc.ee.tsinghua.edu.cn/�
p. 25