RANC as Neural Network Accelerator

DNNARA: A Deep Neural Network Accelerator

using Residue Arithmetic and

Integrated Photonics

Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi

49th International Conference on Parallel Processing – ICPP

August 2020

Outline

➢Introduction

➢Background

➢Integrated Photonic Residue Arithmetic Computing Engine for Convolutional Neural Network

➢Performance Evaluation

➢Conclusion

2

Introduction

3

Introduction

➢Some NN applications require real-time analysis for inference

➢Computation intensive; includes billion multiply-accumulate (MAC) operations

➢We propose DNNARA: a deep neural network accelerator using residue arithmetic based on integrated photonics

➢All the computations through the neural network are done in residue number system (RNS) to avoid extra binary to/from RNS conversions

4Block Diagram of a DNNARA System

Introduction

➢DNNARA: RNS with wavelength-division multiplexing (WDM)

• Execute multiple MVMs due to WDM feature

• Speedup MVMs due to digit-independent feature

• Residues are small-sized

• Increase the system parallelism – save area/hardware resources

5

Background➢ Convolutional Neural Network

➢ Residue Number System

6

Background – Convolutional Neural Network

➢Widely applied in classification• Image recognition

➢Including several layers/functions• Convolutional layers• Activation functions – add non-linearity

• ReLu (Rectified Linear Unit)• Sigmoid function / Hyperbolic tangent function

• Pooling layers – down ample the output• Max pooling• Average pooling

• Fully-connected layers

➢Contains up to billion multiply-accumulate (MAC) operations

7

Background - Residue Number System (RNS)

➢Each Integer X is represented by its “residue,” or remainder obtained by dividing it by a modulus Mi

• Example: Moduli are M1=2, M2=3, M3=5, M4=7• X = 20 is represented as X={0, 2, 0, 6}[2, 3, 5, 7]

• Range of numbers that can be represented: 0 to (M– 1) (here 0 to 219) (M=M1*M2*M3*M4)• Moduli should be relatively prime

➢Negative Number Notation: Similar to 2’s compliment• r = |m-|-X|m|m (where X is negative)• Example: -20 = {|2-0|2, |3-2|3, |5-0|5, |7-6|7}[2, 3, 5, 7] = {0, 1, 0, 1}[2, 3, 5, 7]

• Range of numbers that can be represented: [−(𝑀−1)/2,(𝑀−1)/2]if M is odd, or[−𝑀/2,𝑀/2−1]if M is even

➢Residue Arithmetic: Operations carried out on residues• Example: Addition of X=20={0, 2, 0, 6}[2, 3, 5, 7] and Y=5={1, 2, 0, 5 }[2, 3, 5, 7]

• X+Y = {0+1, 2+2, 0+0, 6+5 }[2, 3, 5, 7] → = {1, 1, 0, 4 }[2, 3, 5, 7]

• X*Y = {0*1, 2*2, 0*0, 6*5 }[2, 3, 5, 7] → = {0, 1, 0, 2 }[2, 3, 5, 7]

• Residue arithmetic is carried out as modulo additions and multiplication on the residues• Residue arithmetic is carried out on each residue in parallel

8

Integrated Photonic Residue Arithmetic Computing Engine for Neural Network

➢ Overview

➢ Residue Adders and Multipliers

➢ Residue Matrix-Vector Multiplication Unit

➢ Sigmoid Unit

➢ Max Pooling Unit

9

• R-MVM: Residue Matrix-Vector Multiplication

• R-Multiplier: Residue Multiplier

• R-Adder: Residue Adder

• MRR: Micro-Ring Resonator

• PD: Photo-Detector

• LUT: Look-up Table

• RNS2Bin: RNS to Binary

• Bin2RNS: Binary to RNS

• T: tile

Overview Architecture

10

Integrated Photonic Residue Adder and Multiplier

➢Basic block• An electro-optical 2×2 switch• Light either propagates through (“bar” state – (a))or

propagates cross (“cross” state – (b))

➢Residue Adder [1] – one-hot encoding• Could be considered as a mapping (injection)• Arbitrary Size Benes (AS-Benes) Network (c – even

number & d – odd number)• Switch states are precomputed and stored in a look-

up table (LUT)

➢An AS-Benes modulo-5 adder (e)• Example with |3+4|5 = 2

➢A Modulo-N Residue Multiplier Implementation (f)

➢WDM capable

11

Residue MVM (R-MVM) Computing Block

➢Schematic of designed R-MVM (b)

➢Wavelength-Division Multiplexing (WDM) Capable

➢Lasers, MRRs, PDs, LUTs, Registers, as well as photonic and electrical connections are needed

➢sel to choose either the partial sum or bias

➢Example: 5x5 input feature and a 2x2 kernel

12

Pipeline of a MAC operation

• Cycle 1:• Input feature (x) are encoded as light with

different wavelengths• Weights (w) are encoded as the selection line,

loading the states of switches in the LUT

• Cycle 2:• Setup the switch states accordingly• Inject light and detect light - multiply• MRRs & PDs act like filter to derive the

solutions for all the multiplications

13

Pipeline of a MAC operation

• Cycle 3:• Results from last cycle (w*x) are decoded as

the selection line to load the states for adders• According to sel, either the partial sum or bias

is decoded as the light

• Cycle 4:• Setup the adders• Inject light and detect light – add

• Cycle 5: Write back to the register

14

Sigmoid Function Unit - Polynomial

➢In residue domain, it is hard to calculate the sigmoid function

➢Instead, it could be considered as a polynomial because sigmoid function could be represented as Taylor series

➢Need to pre-calculate the terms that include x, and build the connection accordingly

➢Example: P(x) = ax4 + bx3 + cx2 + dx + e in modulo-5 system

15

Max pool Function Unit

➢Sign detection in RNS is implicit

➢Instead, we convert the number from RNS to MRS (mixed-radix number system) [2]

➢From the MRS, the coefficient of even number 2 (a4) separates the number to negative or non-negative

➢It is serial but could be pipelined

16

Performance Evaluation

17

Experiment Setup

➢Electrical memory component• CACTI 7.0 [3],

➢Optical Switch [4]• Lumerical FTDT

➢Optical circuit• Lumerical Interconnect

➢Lasers/MRRs/PDs• Data from other work ([5], [6],and [7],

respectively)

➢HyperTransport serial link • Data from [8]

➢System Level Design• Our own simulator

18

Configurations of Selected Benchmarks

Design Space Exploration

➢Swept Parameters• WDM size

• # of tiles in a chip

• # of MVMs in a tile

➢Computation capability• # of operations

/(time*area*power)

19

Hardware Specification

20

Speed & Power Analysis

➢Real benchmarks

➢The more chip the faster but did not scaled proportionally

➢Consumes more power

➢Due to communication

➢19 times faster compared to a GPU (Nvidia Tesla V-100) for VGG-4 with same power budget

21

Conclusion

➢Proposed DNNARA, a deep neural network accelerator that using residue number system

➢DNNARA is a hybrid electro-optical design

➢Proposed a system-level CNN accelerator chip with nano-photonic

➢Built a system-level simulator for experimental estimation

➢Could reach up to 12.6 GOPS/(second·mm2· watt)

➢Reached 19 times faster compared to a state-of-art GPU (Nvidia Tesla V-100) for VGG-4 with same power budget

22

References

➢ [1] Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi.2019. Integrated Photonics Architectures for Residue Number System Computations. In IEEE International Conference on Rebooting Computing (ICRC 2019).129–137.

➢ [2] Nicholas S Szabo and Richard I Tanaka. 1967.Residue arithmetic and its applications to computer technology. McGraw-Hill.

➢ [3] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 3–14.

➢ [4] Shuai Sun, Vikram K Narayana, Ibrahim Sarpkaya, Joseph Crandall, Richard ASoref, Hamed Dalir, Tarek El-Ghazawi, and Volker J Sorger. 2017. Hybrid photonic-plasmonic nonblocking broadband 5×5 router for optical networks. IEEE Pho-tonics Journal10, 2 (2017), 1–12.

➢ [5] Rupert F Oulton, Volker J Sorger, Thomas Zentgraf, Ren-Min Ma, Christopher Gladden, Lun Dai, Guy Bartal, and Xiang Zhang. 2009. Plasmon lasers at deep subwavelength scale.Nature461, 7264 (2009), 629

➢ [6] Erman Timurdogan, Cheryl M Sorace-Agaskar, Jie Sun, Ehsan Shah Hosseini, Aleksandr Biberman, and Michael R Watts. 2014. An ultralow power a thermal silicon modulator. Nature communications5 (2014), 4008.

➢ [7] Yannick Salamin, Ping Ma, Benedikt Baeuerle, Alexandros Emboras, Yuriy Fedoryshyn, Wolfgang Heni, Bojun Cheng, Arne Josten, and Juerg Leuthold. 2018.100 GHz plasmonic photo detector. ACS photonics5, 8 (2018), 3291–3297.

➢ [8] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609–622.

23

24

Thank you!

RANC as Neural Network Accelerator

Documents

Transcript of RANC as Neural Network Accelerator