RANC as Neural Network Accelerator
Transcript of RANC as Neural Network Accelerator
DNNARA: A Deep Neural Network Accelerator
using Residue Arithmetic and
Integrated Photonics
Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi
49th International Conference on Parallel Processing – ICPP
August 2020
Outline
➢Introduction
➢Background
➢Integrated Photonic Residue Arithmetic Computing Engine for Convolutional Neural Network
➢Performance Evaluation
➢Conclusion
2
Introduction
3
Introduction
➢Some NN applications require real-time analysis for inference
➢Computation intensive; includes billion multiply-accumulate (MAC) operations
➢We propose DNNARA: a deep neural network accelerator using residue arithmetic based on integrated photonics
➢All the computations through the neural network are done in residue number system (RNS) to avoid extra binary to/from RNS conversions
4Block Diagram of a DNNARA System
Introduction
➢DNNARA: RNS with wavelength-division multiplexing (WDM)
• Execute multiple MVMs due to WDM feature
• Speedup MVMs due to digit-independent feature
• Residues are small-sized
• Increase the system parallelism – save area/hardware resources
5
Background➢ Convolutional Neural Network
➢ Residue Number System
6
Background – Convolutional Neural Network
➢Widely applied in classification• Image recognition
➢Including several layers/functions• Convolutional layers• Activation functions – add non-linearity
• ReLu (Rectified Linear Unit)• Sigmoid function / Hyperbolic tangent function
• Pooling layers – down ample the output• Max pooling• Average pooling
• Fully-connected layers
➢Contains up to billion multiply-accumulate (MAC) operations
7
Background - Residue Number System (RNS)
➢Each Integer X is represented by its “residue,” or remainder obtained by dividing it by a modulus Mi
• Example: Moduli are M1=2, M2=3, M3=5, M4=7• X = 20 is represented as X={0, 2, 0, 6}[2, 3, 5, 7]
• Range of numbers that can be represented: 0 to (M– 1) (here 0 to 219) (M=M1*M2*M3*M4)• Moduli should be relatively prime
➢Negative Number Notation: Similar to 2’s compliment• r = |m-|-X|m|m (where X is negative)• Example: -20 = {|2-0|2, |3-2|3, |5-0|5, |7-6|7}[2, 3, 5, 7] = {0, 1, 0, 1}[2, 3, 5, 7]
• Range of numbers that can be represented: [−(𝑀−1)/2,(𝑀−1)/2]if M is odd, or[−𝑀/2,𝑀/2−1]if M is even
➢Residue Arithmetic: Operations carried out on residues• Example: Addition of X=20={0, 2, 0, 6}[2, 3, 5, 7] and Y=5={1, 2, 0, 5 }[2, 3, 5, 7]
• X+Y = {0+1, 2+2, 0+0, 6+5 }[2, 3, 5, 7] → = {1, 1, 0, 4 }[2, 3, 5, 7]
• X*Y = {0*1, 2*2, 0*0, 6*5 }[2, 3, 5, 7] → = {0, 1, 0, 2 }[2, 3, 5, 7]
• Residue arithmetic is carried out as modulo additions and multiplication on the residues• Residue arithmetic is carried out on each residue in parallel
8
Integrated Photonic Residue Arithmetic Computing Engine for Neural Network
➢ Overview
➢ Residue Adders and Multipliers
➢ Residue Matrix-Vector Multiplication Unit
➢ Sigmoid Unit
➢ Max Pooling Unit
9
• R-MVM: Residue Matrix-Vector Multiplication
• R-Multiplier: Residue Multiplier
• R-Adder: Residue Adder
• MRR: Micro-Ring Resonator
• PD: Photo-Detector
• LUT: Look-up Table
• RNS2Bin: RNS to Binary
• Bin2RNS: Binary to RNS
• T: tile
Overview Architecture
10
Integrated Photonic Residue Adder and Multiplier
➢Basic block• An electro-optical 2×2 switch• Light either propagates through (“bar” state – (a))or
propagates cross (“cross” state – (b))
➢Residue Adder [1] – one-hot encoding• Could be considered as a mapping (injection)• Arbitrary Size Benes (AS-Benes) Network (c – even
number & d – odd number)• Switch states are precomputed and stored in a look-
up table (LUT)
➢An AS-Benes modulo-5 adder (e)• Example with |3+4|5 = 2
➢A Modulo-N Residue Multiplier Implementation (f)
➢WDM capable
11
Residue MVM (R-MVM) Computing Block
➢Schematic of designed R-MVM (b)
➢Wavelength-Division Multiplexing (WDM) Capable
➢Lasers, MRRs, PDs, LUTs, Registers, as well as photonic and electrical connections are needed
➢sel to choose either the partial sum or bias
➢Example: 5x5 input feature and a 2x2 kernel
12
Pipeline of a MAC operation
• Cycle 1:• Input feature (x) are encoded as light with
different wavelengths• Weights (w) are encoded as the selection line,
loading the states of switches in the LUT
• Cycle 2:• Setup the switch states accordingly• Inject light and detect light - multiply• MRRs & PDs act like filter to derive the
solutions for all the multiplications
13
Pipeline of a MAC operation
• Cycle 3:• Results from last cycle (w*x) are decoded as
the selection line to load the states for adders• According to sel, either the partial sum or bias
is decoded as the light
• Cycle 4:• Setup the adders• Inject light and detect light – add
• Cycle 5: Write back to the register
14
Sigmoid Function Unit - Polynomial
➢In residue domain, it is hard to calculate the sigmoid function
➢Instead, it could be considered as a polynomial because sigmoid function could be represented as Taylor series
➢Need to pre-calculate the terms that include x, and build the connection accordingly
➢Example: P(x) = ax4 + bx3 + cx2 + dx + e in modulo-5 system
15
Max pool Function Unit
➢Sign detection in RNS is implicit
➢Instead, we convert the number from RNS to MRS (mixed-radix number system) [2]
➢From the MRS, the coefficient of even number 2 (a4) separates the number to negative or non-negative
➢It is serial but could be pipelined
16
Performance Evaluation
17
Experiment Setup
➢Electrical memory component• CACTI 7.0 [3],
➢Optical Switch [4]• Lumerical FTDT
➢Optical circuit• Lumerical Interconnect
➢Lasers/MRRs/PDs• Data from other work ([5], [6],and [7],
respectively)
➢HyperTransport serial link • Data from [8]
➢System Level Design• Our own simulator
18
Configurations of Selected Benchmarks
Design Space Exploration
➢Swept Parameters• WDM size
• # of tiles in a chip
• # of MVMs in a tile
➢Computation capability• # of operations
/(time*area*power)
19
Hardware Specification
20
Speed & Power Analysis
➢Real benchmarks
➢The more chip the faster but did not scaled proportionally
➢Consumes more power
➢Due to communication
➢19 times faster compared to a GPU (Nvidia Tesla V-100) for VGG-4 with same power budget
21
Conclusion
➢Proposed DNNARA, a deep neural network accelerator that using residue number system
➢DNNARA is a hybrid electro-optical design
➢Proposed a system-level CNN accelerator chip with nano-photonic
➢Built a system-level simulator for experimental estimation
➢Could reach up to 12.6 GOPS/(second·mm2· watt)
➢Reached 19 times faster compared to a state-of-art GPU (Nvidia Tesla V-100) for VGG-4 with same power budget
22
References
➢ [1] Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi.2019. Integrated Photonics Architectures for Residue Number System Computations. In IEEE International Conference on Rebooting Computing (ICRC 2019).129–137.
➢ [2] Nicholas S Szabo and Richard I Tanaka. 1967.Residue arithmetic and its applications to computer technology. McGraw-Hill.
➢ [3] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 3–14.
➢ [4] Shuai Sun, Vikram K Narayana, Ibrahim Sarpkaya, Joseph Crandall, Richard ASoref, Hamed Dalir, Tarek El-Ghazawi, and Volker J Sorger. 2017. Hybrid photonic-plasmonic nonblocking broadband 5×5 router for optical networks. IEEE Pho-tonics Journal10, 2 (2017), 1–12.
➢ [5] Rupert F Oulton, Volker J Sorger, Thomas Zentgraf, Ren-Min Ma, Christopher Gladden, Lun Dai, Guy Bartal, and Xiang Zhang. 2009. Plasmon lasers at deep subwavelength scale.Nature461, 7264 (2009), 629
➢ [6] Erman Timurdogan, Cheryl M Sorace-Agaskar, Jie Sun, Ehsan Shah Hosseini, Aleksandr Biberman, and Michael R Watts. 2014. An ultralow power a thermal silicon modulator. Nature communications5 (2014), 4008.
➢ [7] Yannick Salamin, Ping Ma, Benedikt Baeuerle, Alexandros Emboras, Yuriy Fedoryshyn, Wolfgang Heni, Bojun Cheng, Arne Josten, and Juerg Leuthold. 2018.100 GHz plasmonic photo detector. ACS photonics5, 8 (2018), 3291–3297.
➢ [8] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609–622.
23
24
Thank you!