Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights...
Transcript of Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights...
![Page 1: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/1.jpg)
1
Using FPGAs to Accelerate Neural Network Inference1st FPL Workshop on Reconfigurable Computing for Deep Learning (RC4DL)8. September 2017, Ghent, Belgium
Associate Professor Magnus JahreDepartment of Computer ScienceNorwegian University of Science and Technology
ManyofthecontributionsdiscussedinthispresentationhavebeendevelopedinclosecollaborationwithXilinxResearchLabs
![Page 2: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/2.jpg)
2
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
![Page 3: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/3.jpg)
3
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
![Page 4: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/4.jpg)
4
Properties of Acceleration-friendly Workloads
LotsofComputeandParallelism
Higharithmeticintensity
Feworpredictablebranches
Regularmemoryaccesses
Perfo
rmance
(Ope
ratio
ns/secon
d)
ArithmeticIntensity(Operations/Byte)
Compute-bound
RooflineModelofaPlatform
Williamsetal.,“Roofline:AnInsightfulVisualPerformanceModelforMulticoreArchitectures”,CACM,2009
ApplicationBApplicationA
![Page 5: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/5.jpg)
5
Convolutional Neural Networks (CNNs)
HeavyComputation
“Cat”
Krizhevsky etal.,“ImageNetClassificationwithDeepConvolutionalNeuralNetworks”,NIPS,2012
AlexNet
![Page 6: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/6.jpg)
6
CONV-Layer as Matrix-Matrix Multiplication
Aconvolutionallayerconvolvesasetoffiltersovertheinputdata
Reorganizedatatocreateamatrix-matrixmultiplicationoranumberof
matrix-vectormultiplications
Potentialproblem:Ifthefiltersoverlap,theinputdatais
duplicated(intheory)
![Page 7: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/7.jpg)
7
Choice of Matrix Multiplication Algorithm
LotsofComputeandParallelism
Higharithmeticintensity
Feworpredictablebranches
Regularmemoryaccesses
DenseMatrixMultiplication
SparseMatrixMultiplication
Choiceofalgorithmisatrade-offthatdependsonplatformandinputcharacteristics
(canhave)
![Page 8: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/8.jpg)
8
CNN Inference Platform Alternatives
HighPerformance
SufficientPerformance
HighEnergyEfficiency
ShortDevelopmentTime
LowCost
CPU GPU ASIC FPGA
++0 + +
++ ? ++
+0 ++ +
++++ ? +
+++ ? +
Differentiator:FPGAscancustomizethearchitecturetofitthecurrentproblem
“Winner”
CPU
CPU/GPU
ASIC
GPU
FPGA
Overall:Noplatformstandsoutasaclearwinneracrossallmetrics
Metric
![Page 9: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/9.jpg)
9
CNN Customization PotentialExploitthecharacteristicsofneuralnetworks• ManydifferentCNNscanprovidesimilaraccuracy• Chooseonethatmatchesthestrengthsofthetargetplatform• Retrainingmaybenecessary• PotentialsynergiesbetweenCNNalgorithmdevelopmentandaccelerationpotential
Dimensionsofcustomization• Acceleratorarchitecture(genericvsnetwork-specific)• Datatype(fixedpointvsfloatingpoint)• Dataprecision(binary,8bit,64bit,etc.)• FPGA-friendlynetworktransformations
![Page 10: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/10.jpg)
10
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
![Page 11: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/11.jpg)
11
Redundancy and Quantization
Evidenceofredundancyintrainednetworks•sparsification,low-rankapproximations,faulttolerance…
Reducedprecision(quantization)•Restrictweightsand/oractivationstoQ-bitvalues•HWbenefits:Low-bitwidthdatapaths,regularcompute
Sungetal:Quantizationworkswellwhen:•…thenetworkis“bigenough”•…thenetworkisawareofquantizationduringtraining
“(…)theperformancegapbetweenthefloating-pointandtheretrain-basedternary(+1,0,-1)weightneuralnetworks(…)almostvanishesinfullycomplexnetworks(…)”(Sungetal,ResiliencyofDeepNNsUnderQuantization)
![Page 12: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/12.jpg)
12
Binarized Neural Networks (BNNs)Theextremecaseofquantization
• Permitonlytwovalues:+1and-1• Binaryweights,binaryactivations
ByCourbariauxandHubaraetal.(NIPS2016)
• Opensourcetrainingflow,basedonTheanoandLasagne• Layers:convolutional,fully-connected,batchnormandmaxpool
Closetostate-of-the-artaccuracyonsmallerimageclassificationbenchmarks• Andgettingsteadilybetteratthebiggerbenchmarks.
Quantization MNIST SVHN CIFAR-10
Binaryweights,Binaryactivations
0.96% 2.53% 10.15%
Binaryweights,FPactivations
1.29% 2.30% 9.90%
FPweights,FPactivations
0.94% 1.69% 7.62%
BNNaccuracyloss -0.2% -0.84% -2.53%
%classificationerror(lowerisbetter)
![Page 13: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/13.jpg)
13
BNN Performance Potential on FPGAsfewerLUTs/op:higherpeakperformance
66TOPS
1TOPS
stayon-chip:achievemoreofthepeak
0.1TOPS 40TOPS
![Page 14: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/14.jpg)
14
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
![Page 15: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/15.jpg)
15
FINN’s Heterogeneous Streaming Architecture
Layer0 Layer1 LayerN
…image result
FPGA
BNNtopology
1Mops
10Mops
1xPE 10xPE
OnehardwarelayerperBNNlayer
Heterogeneous:Avoid“one-size-fits-all”penalties(computationnotuniformacrosslayers)
Streaming:Maximizethroughputandminimizelatency(overlapcommunicationandcomputation)
Umuroglu etal.,“FINN:AFrameworkforFast,ScalableBinarized NeuralNetworkInference”,FPGA,2017
![Page 16: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/16.jpg)
16
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
![Page 17: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/17.jpg)
17
Transformation Example: StreamliningState-of-the-artquantizedneuralnetworkmethodsstillemploysomefloatingpointcomputationstoimproveaccuracy• Batchnormalization• Alpha-scaling• Non-integerquantizationlevels
Streamliningavoidsfloatingpointcomputationsby:
• Viewingquantizationassuccessivethresholding•Movingandcollapsinglineartransformations• Absorbinglinearoperationsintothresholds
Umuroglu andJahre,“StreamlinedDeploymentforQuantizedNeuralNetworks”,
ToappearatHENND,2017
FewerlayersreduceFPGAresourceconsumptioninnetwork-specificarchitectures
Floatingpointlayer
Integerlayers
![Page 18: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/18.jpg)
18
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
![Page 19: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/19.jpg)
19
Off-Chip Weight Storage
Largeneuralnetworksforceweightstobestoredoff-chip• Increasesbandwidthneeds• NeedtoexploitMemoryLevelParallelism(MLP)tomaximizebandwidthutilization
Neuralnetworkstructurescanbemadesparse• Potentialforreducingcomputeandmemoryrequirements
• Efficientlyexploitingsparsityistricky
for(j=0 to n-1)for(i=colptr[j] to colptr[j+1])y[rowind[i]] += values[i] * x[j]
![Page 20: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/20.jpg)
20
Accelerator Performance TricksMostoftherandomaccessvectorisunusedatanygiventime• Uselight-weightpreprocessingtodetermineneededcachecapacity
• Usespareonchipmemoryforsomethingelse
Overlapmemoryaccessesandcomputation
• Balancedsystem: Acceleratorcompute capabilityshouldmatchmemorysubsystemperformance
• Parallelismeffectivelyhidesthecomputelatency• ExploitMemoryLevelParallelism(MLP)tofurtherimprovememorybusutilizationandperformance
• Solution:Non-blockingcaches
Umuroglu andJahre,“AVectorCachingSchemeforStreamingFPGASpMV Accelerators”,ARC,2015
Umuroglu andJahre,“RandomAccessSchemesforEfficientFPGASpMVAcceleration”,MicroprocessorsandMicrosystems,2016
![Page 21: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/21.jpg)
21
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
![Page 22: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/22.jpg)
22
ConclusionDeeplearningiswellsuitedforacceleration• Nocomputeplatformisaclearwinneracrossallperformancemetrics• FPGAsexcelwhenwecanleverageheavilycustomizedaccelerators• NeedtoidentifyneuralnetworkswithcomputationandmemorypatternsthataresuitabletoFPGAplatformcharacteristics
Possibilitiesforcustomization• Examples:Datatype,precision,architectureandnetworktransformations• Significantpotentialforco-designofneuralnetworkalgorithmsandFPGAaccelerators
![Page 23: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/23.jpg)
23
The Telenor-NTNU AI-Lab• To enable both basic and applied
research• To allow wide variety of research
areas• Maintain research at highest
international level• To enable cross-disciplinary
collaboration
![Page 24: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level](https://reader034.fdocuments.us/reader034/viewer/2022052100/6039a1d0d17cfc00141092e5/html5/thumbnails/24.jpg)
24
Thank You!
Thefollowingpeoplehavecontributedtotheideascoveredinthispresentation:• Yaman Umuroglu,NTNU• MichaelaBlott,XilinxResearchLabs• ResearchersattheNTNUComputer
ArchitectureLab• ResearchersatXilinxResearchLabs