Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same...
Transcript of Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same...
![Page 2: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/2.jpg)
What software developers/users want
Source: Databricks, Apache Spark Survey 2016, Report
2
![Page 3: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/3.jpg)
What software developers/users want
Source: Databricks, Apache Spark Survey 2016, Report
3
![Page 4: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/4.jpg)
A short history of computing performance
4
Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018
A domain-specific architecture for deep neural networks
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson
Multi core
Single core
High frequency
![Page 5: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/5.jpg)
What’s left for faster computing?
5
David Patterson, 2019
?
Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018
A domain-specific architecture for deep neural networks
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson
Multi core
Single core
High frequency
![Page 6: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/6.jpg)
Computing power to train a model
6
In 2018, OpenAIfound that the amount of computational power used to train the largest AI models had doubled every 3.4 months since 2012.
https://www.techn
ologyreview.com/
s/614700/the-
computing-power-
needed-to-train-
ai-is-now-rising-
seven-times-
faster-than-ever-
before/
Open AI
https://openai.com/blog/ai-and-compute/#addendum
![Page 7: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/7.jpg)
Processing requirements in DNN
7
![Page 8: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/8.jpg)
Data Center traffic
Christoforos Kachris, Microlab@NTUA 8
![Page 9: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/9.jpg)
Data Center Requirements
˃ Traffic requirements increase significantly in the data centers but
the power budget remains the same (Source: ITRS, HiPEAC, Cisco)
FPL 2016, Christoforos Kachris, ICCS/NTUA, September 2016 9
1
10
2012 2013 2014 2015 2016 2017 2018 2019
Traffic growth in Data centers versous Power constraints
Traffic growth
Heat load per rack
Power per chip
Transistor count
Transistors
Traffic growth
in Data Centers
Power per chip
Heat load per rack
![Page 10: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/10.jpg)
10https://www.iea.org/reports/data-centres-and-data-transmission-networks
![Page 11: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/11.jpg)
Data Science: need for high computing power
Christoforos Kachris, Microlab@NTUA 11
![Page 12: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/12.jpg)
Christoforos Kachris, Microlab@NTUA 12
![Page 13: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/13.jpg)
How Big are Data Centers
Data Center Site Sq ft
Facebook (Santa Clara) 86,000
Google (South Carolina) 200,000
HP (Atlanta) 200,000
IBM (Colorado) 300,000
Microsoft (Chicago) 700,000
Christoforos Kachris, Microlab@NTUA
[Source: “How Clean is Your Cloud?”, Greenpeace 2011]
Wembley Stadium:172,000 square ft13
![Page 14: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/14.jpg)
Google data center
Christoforos Kachris, Microlab@NTUA 14
![Page 15: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/15.jpg)
Data Centers Power Consumption
• Data centers consumed 330 Billion KWh in 2007 and is expected to reach 1012 Billion KWh in 2020
2007 (Billion KWh) 2020 (Billion KWh)
Data Centers 330 1012
Telecoms 293 951
Total Cloud 623 1963
15Christoforos Kachris, Microlab@NTUA
[Source: How Clean is Your Data Center, Greenpeace, 2012
![Page 16: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/16.jpg)
Data Center power consumtion
Christoforos Kachris, Microlab@NTUA 16
![Page 17: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/17.jpg)
Power consumption
17
![Page 18: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/18.jpg)
Hardware acceleration
Hardware acceleration is the use of specialized hardware
components to perform some functions faster (10x-100x) than is
possible in software running on a more general-purpose CPU.
˃ Hardware acceleration can be performed either by specialized
chips (ASICS) or
˃ By programmable specialized chips (FPGAs) that can be
configured for specific applications
Christoforos Kachris, Microlab@NTUA 18
![Page 19: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/19.jpg)
Data Center applications
Christoforos Kachris, Microlab@NTUA 19
![Page 20: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/20.jpg)
Hardware Accelerators – Why is it faster?
Switch from sequential processing to parallel processing
20
![Page 21: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/21.jpg)
Hardware accelerators
FPL 2016, Christoforos Kachris, ICCS/NTUA, September 2016
• HW acceleration can be used to reduce significantly the execution time and the energy consumption of several applications (10x-100x)
21
[Source: Xilinx, 2016]
![Page 22: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/22.jpg)
FPGAs in the data centers
Christoforos Kachris, Microlab@NTUA 22
![Page 23: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/23.jpg)
CPU vs GPU vs FPGA
23
A GPU is effective at processing the same set of operations in parallel –
single instruction, multiple data (SIMD).
An FPGA is effective at processing the same or different operations in parallel –
multiple instructions, multiple data (MIMD). Specialized circuits for functions.
Control
ALU
ALU
Cache
DRAM
ALU
ALU
CPU(one core) FPGA
DRAM DRAM
GPUEach FPGA has more than 2M of these cells
Each GPU has 2880 of these cores
DRAM
Blo
ck R
AM
Blo
ck R
AM
DRAM DRAM
![Page 24: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/24.jpg)
Specialization
One of the most sophisticated systems in the universe is based on specialization
24
![Page 25: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/25.jpg)
Processing Platforms
˃ HW acceleration can be used to reduce significantly the
execution time and the energy consumption of several
applications (10x-100x)
Christoforos Kachris, Microlab@NTUA 25
![Page 26: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/26.jpg)
Intel Xeon + FPGAs
Christoforos Kachris, Microlab@NTUA 26
![Page 27: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/27.jpg)
Xeon and FPGA in the Cloud
Christoforos Kachris, Microlab@NTUA 27
![Page 28: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/28.jpg)
FPGAs for DNN
˃ The xDNN processing engine has
dedicated execution paths for each type
of command (download, conv, pooling,
element-wise, and upload). This allows for
convolution commands to be run in
parallel with other commands if the
network graph allows it
28
![Page 29: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/29.jpg)
FPGAs for DNN – Throughput & Latency
29
https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf
![Page 30: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/30.jpg)
FPGAs for DNN – Throughput & Latency
30
https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf
![Page 31: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/31.jpg)
FPGAs for DNN – Energy efficiency
31
![Page 32: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/32.jpg)
Intel FPGAs for DNN
32
https://software.intel.com/content/www/us/en/develop/blogs/accelerate-computer-vision-from-edge-to-cloud-with-openvino-toolkit.html
![Page 33: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/33.jpg)
FPGAs vs GPUs in DNN
33
![Page 34: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/34.jpg)
GPU vs FPGA for DNN
34
![Page 35: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/35.jpg)
HW Accelerators for Cloud Computing
FPL 2016, Christoforos Kachris, ICCS/NTUA, September 2016 35
A Survey on Reconfigurable Accelerators for Cloud Computing, FPL 2016 Kachris
![Page 36: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/36.jpg)
Speedup vs Energy efficiency
FPL 2016, Christoforos Kachris, ICCS/NTUA, September 2016 36
Copyright: Christoforos
Kachris, ICCS/NTUA
A Survey on Reconfigurable Accelerators for Cloud Computing, FPL 2016 Kachris
![Page 37: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/37.jpg)
Speedup per category
˃ Page Rank applications achieve the higher speedup
˃ Memcached application achieve higher energy efficiency
FPL 2016, Christoforos Kachris, ICCS/NTUA, September 2016 37
10.7
3.73.8
7.5
1.5
18
7.8
3.7
0
2
4
6
8
10
12
14
16
18
20
Speedup Energy Efficiency
Speedup and Energy efficiency per category
PageRank ML Memcached Databases
![Page 38: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/38.jpg)
Catapult FPGA Acceleration Card
Christoforos Kachris, Microlab@NTUA 38
![Page 39: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/39.jpg)
www.vineyard-h2020.eu
FPGA as a Service
• Amazon EC F1’s Xilinx FPGA
39
![Page 40: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/40.jpg)
Is there a market?
40
• The global Data Center Accelerator market size is expected to reach 35 billion $ by the end of 2025 [1].
• The market for FPGA is expected to grow at the highest rate owing to the increasing adoption of FPGAs for acceleration of enterprise workloads [1][1] https://www.marketwatch.com/press-release/at-387-cagr-data-center-accelerator-market-size-is-expected-to-exhibit-35020-million-usd-by-2025-2019-10-15
Intel
Available FPGAs
![Page 41: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/41.jpg)
Heterogeneous DCs for energy efficiency
Christoforos Kachris, Microlab@NTUA
“The only way to differentiate server offerings is through accelerators, like we saw
with cell phones”, OpenServer Summit 2014 Leendert
Van Doorn; AMD
TODAY’s DCs Future Heterogeneous DCswith VINEYARD infrastructure
Run-time manager and orchestrator
3rd party HWaccelerators
Run-time scheduler
Big Data Applications
• Low performance• High power consumption• Best effort
• Higher performance• Lower power consumption• Predictable performance
Requirements
Servers
PServers
P
P
P
P
P
P
PP
P
P
P
DFE
DFE
DFE
DFE
VINEYARD Servers with dataflow-basedaccelerators (DFE)
Big Data Applications
41
![Page 42: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/42.jpg)
VINEYARD Heterogeneous Accelerators-based Data centre
Christoforos Kachris, Microlab@NTUA
Bioinformatic
sFinance
Big Data Applications
VINEYARD Progr. Framework
Synthesis
(OpenSPL,OpenCL)
Pattern
Matching
Analytics
engines
String
matching
Other
processin
g
Commonly used
Function/tasks
…
HW Manager
Library of Hardware
functions as IP
Blocks
Requirements:
• Throughput
• Latency
• Power
Racks with
programmable
dataflow engine (DFE)
accelerators
Server Racks with
commodity processor
Repository
Compres
sion
Encryptio
nScheduler
DFE
DFE
DFE
DFE
Cluster Resource Manager
Analytics
P
P
P
P
P
P
P
P
Program
mable
Logic
Racks with
MPSoC FPGAs
Programming Framework, APIs
42
![Page 43: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/43.jpg)
www.vineyard-h2020.eu
VINEYARD Framework
43
• Accelerators storedin an AppStore
• Cloud users requestaccelerators based on applications requirements
• Decouple Hardware – Software designers
Cloud computing Applications
VINEYARD Cloud Resource Manager
3rd party IP developersLibrary of Hardware
accelerators as IP Blocks
Heterogeneous Data Center
DFE
Processors Dataflow Proc.+FPGA
IP Accelerator’sApp store
Cloud tenants
Acc
Acc
Acc
Acc
DFE
DFE
DFE
Accelerator Controller
Accelerator Virtualization
Scheduler
Accelerator API
PerformanceEnergy
![Page 44: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/44.jpg)
AWS options
44
![Page 45: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/45.jpg)
Performance evaluation on Machine Learning
˃ Up to 15x speedup for
Logistic regression
classification
˃ Up to 14x speedup for
K-means clustering
˃ Spark- GPU* (3.8x – 5.7x)
45
r5d.4x
f1.4x (InAccel)
0 200 400 600 800 1000 1200 1400
Logistic Regression execution time MNIST 24GB, 100 iter. (secs)
Data preprocessing Data transformation ML training
15x Speedup
r5d.4x
f1.4x (InAccel)
0 500 1000 1500 2000 2500
K-Means clustering exection timeMNIST 24GB, 100 iter. (secs)
Data preprocessing Data transformation ML training
14x Speedup
*[Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters]
1st to offer ML-acceleration on
the cloud using FPGAs
![Page 46: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/46.jpg)
ML training
46
https://inaccel.com/cpu-gpu-or-fpga-performance-evaluation-of-cloud-computing-platforms-for-machine-learning-training/
![Page 47: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/47.jpg)
Unique FPGA orchestrator by InAccel
47
Seamless integration with C/C++,
Python, Java and Scala
Automatic virtualization and scheduling
of the applications to the FPGA cluster
Fully scalable: Scale-up (multiple
FPGAs per node) and Scale-out (multiple
FPGA-based servers over Spark)
InAccel CoralResource Manager
InAccel Runtime- Resource isolation
Applications
FPGA drivers
Server
FPGA
Kernels
Automating deployment, scaling, and management of FPGA clusters
![Page 48: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/48.jpg)
Current limitations for FPGA deployment
˃ Currently only one application can talk to
a single FPGA accelerator through
OpenCL
˃ Application can talk to a single FPGA.
˃ Complex device sharing
• From multiple threads/processes
• Even from the same thread
˃ Explicit allocation of the resources
(memory/compute units)
˃ User need to specify which FPGA to use
(device ID, etc.)
App1
Vendor drivers
Single FPGA
48
![Page 49: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/49.jpg)
From single instance to data centers
˃ Easy deployment
˃ Instant scaling
˃ Seamless sharing
˃ Multiple-users
˃ Multiple applications
˃ Isolation
˃ Privacy
49
InAccel FPGA Orchestrator
Kubernetes cluster
![Page 50: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/50.jpg)
Universities
˃ How do you allow multiple students to
share the available FPGAs?
˃ Many universities have limited number of
FPGA cards that want to share with multiple
students.
˃ InAccel FPGA orchestrator allows multiple
students to share one or more FPGAs
seamlessly.
˃ It allows students to just invoke the function
that want to accelerate and InAccel FPGA
manager performs the serialization and the
scheduling of the functions to the available
FPGA resources.
50
InAccel FPGA Orchestrator
Lab1 Lab2 Lab3
![Page 51: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/51.jpg)
Universities
˃ But the researchers want exclusive access
˃ InAccel orchestrator allows to select which
FPGA cards will be available for multiple
students and which FPGAs can be allocated
exclusively to researchers and Ph.D. students
(so they can get accurate measurements for
their papers).
˃ The FPGAs that are shared with multiple
students will perform on a best-effort approach
(InAccel manager performs the serialization of
the requested access) while the researchers
have exclusive access to the FPGAs with zero
overhead.
51
InAccel FPGA Orchestrator
Lab1 Lab2 Researcher
Shared Shared Exclusive access
![Page 52: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/52.jpg)
52
Instant
Scalability
Distribution of multi-thread
applications to multiple clusters
With a single command
![Page 53: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/53.jpg)
53
oneAPI
OS
with
Hypervisor
OS
InAccel
InAccel InAccel
Container runtime
![Page 54: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/54.jpg)
From IaaS to PaaS and SaaS for FPGAs
54
Servers with FPGAs
Virtualization/Sharing
Operating System
Middleware
Runtime
Applications
Infrastructure as a Service
Servers with FPGAs
Virtualization/Sharing
Operating System
Middleware
Runtime
Platformas a Service
Servers with FPGAs
Virtualization/Sharing
Operating System
Middleware
Runtime
Applications
Softwareas a Service
FPGAOrchestrator
FPGARepository
with accelerators
Applications
![Page 55: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/55.jpg)
Seamless Integration with any framework
55
KUBESPHERE
![Page 56: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/56.jpg)
Lab Exercise
˃ In this lab you are going to create your first accelerated application
˃ Use scikit learn to find out the speedup you get upon running Naive Bayes
algorithm using the original (CPU) and FPGA implementation.
56
![Page 57: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/57.jpg)
Conclusions
˃ Future Data Center will have to sustain huge amount of network traffic
˃ However the power consumption will have to remain almost the same
˃ FPGA acceleration as a promising solution for Machine Learning providing
high throughput,
low latency and
energy efficient processing
Christoforos Kachris, Microlab@NTUA 57
![Page 58: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/58.jpg)
Domain Specific Accelerators
The amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period)
58
![Page 59: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/59.jpg)
Distributed ML
˃ CSCS: Europe’s Top Supercomputer (World 3rd) • 4500+ GPU Nodes, state-of-the-
art interconnect Task:
˃ Image Classification (ResNet-152 on ImageNet)
Single Node time (TensorFlow): 19 days
1024 Nodes: 25 minutes (in theory)
59
![Page 60: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/60.jpg)
Distributed ML
˃ Parallelism in Distributed
Machine Learning.
˃ Data parallelism trains
multiple instances of the
same model on different
subsets of the training
dataset,
˃ model parallelism
distributes parallel paths
of a single model to
multiple nodes
60
A Survey on Distributed Machine Learning: https://arxiv.org/ftp/arxiv/papers/1912/1912.09789.pdf
![Page 61: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/61.jpg)
˃ Centralized systems (Figure 3a) employ a
strictly hierarchical approach to aggregation,
which happens in a single central location.
˃ Decentralized systems allow for intermediate
aggregation, either with a replicated model
that is consistently updated when the
aggregate is broadcast to all nodes such as in
tree topologies (Figure 3b) or with a
partitioned model that is shared over multiple
parameter servers (Figure 3c).
˃ Fully distributed systems (Figure 3d) consists
of a network of independent nodes that
ensemble the solution together and where no
speciffic roles are assigned to certain nodes
61
![Page 62: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/62.jpg)
Distributed ML ecosystem
62
![Page 63: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/63.jpg)
Data Science and ML platforms
63
![Page 64: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/64.jpg)
FPGA for ML
˃ In many applications, neural network is trained in back-end CPU or GPU clusters ◆FPGA:
˃ very suitable for latency-sensitive real-time inference job
Unmanned vehicle
Speech Recognition
Audio Surveillance
Multi-media
64
![Page 65: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/65.jpg)
CPU vs FPGAs
65
http://cadlab.cs.ucla.edu/~cong/slides/HALO15_keynote.pdf
![Page 66: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/66.jpg)
Machine Learning on FPGAs
˃ Classification
Naïve Bayes
˃ Training
Logistic regression
˃ DNN
Resnet50
66
![Page 67: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/67.jpg)
Jupyter - JupyterHub
˃ Deploy and run your
FPGA-accelerated
applications using
Jupyter Notebooks
˃ InAccel manager
allows the instant
deployment of
FPGAs through
HupyterHub
67
![Page 68: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/68.jpg)
JupyterHub on FPGAs
˃ Instant acceleration of
Jupyter Notebooks
with zero code-
changes
˃ Offload the most
computational
intensive tasks on
FPGA-based servers
68
Authentication
Spawner
Kubernetes cluster
![Page 69: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/69.jpg)
FPGA flow
69
Generate an
bitstreamFPGA Place-and-Route using
Xilinx Vivado on C4 or M4
instance
FPGA Logic Design using
Xilinx Vivado on C4 or M4
instance
Program the
FPGA
![Page 70: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/70.jpg)
Bitstream repository
˃ FPGA Resource Manager is
integrated with a bitstream
repository that is used to
store FPGA bitstreams
70
Application FPGA bitstream
repository
FPGA cluster
https://store.inaccel.com
![Page 71: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/71.jpg)
Lab Exercise
˃ Introduction
˃ Creating a Bitstream Artifact
˃ Running the first FPGA accelerated application
˃ Scikit-Learn on FPGAs
˃ Naive Bayes Example
˃ Logistic Regression Example
71https://edu.inaccel.com/
![Page 72: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/72.jpg)
Useful links
˃ MIT: Tutorial on Hardware Accelerators for Deep Neural Networks
http://eyeriss.mit.edu/tutorial.html
˃ Intel
https://software.intel.com/content/www/us/en/develop/training/course-deep-learning-inference-fpga.html
˃ UCLA: Machine Learning on FPGAs
http://cadlab.cs.ucla.edu/~cong/slides/HALO15_keynote.pdf
˃ Distributed ML
https://www.podc.org/data/podc2018/podc2018-tutorial-alistarh.pdf
72
![Page 74: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/74.jpg)
Spectrum of new architectures for DNN
74
![Page 75: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/75.jpg)
DNN requirements
˃ Throughput
˃ Latency
˃ Energy
˃ Power
˃ Cost
75
![Page 76: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/76.jpg)
˃ Optimized hardware acceleration of both AI inference and other performance-critical
functions by tightly coupling custom accelerators into a dynamic architecture silicon
device.
˃ This delivers end-to-end application performance that is significantly greater than a fixed-
architecture AI accelerator like a GPU;
76
![Page 77: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/77.jpg)
Roofline
77
![Page 78: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/78.jpg)
Adaptive to new models
78
![Page 79: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/79.jpg)
FPGAs for DNN
˃ The xDNN processing engine has
dedicated execution paths for each type
of command (download, conv, pooling,
element-wise, and upload). This allows for
convolution commands to be run in
parallel with other commands if the
network graph allows it
79
![Page 80: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/80.jpg)
DNN layers
80
https://www.xilinx.com/publications/events/machine-learning-live/colorado/HotChipsOverview.pdf
![Page 81: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/81.jpg)
CPU-FPGA
˃ Even though the xDNN processing engine supports a wide range of CNN
operations, new custom networks are constantly being developed—and
sometimes, select layers/instructions might not be supported by the engine in the
FPGA. Layers of networks that are not supported in the xDNN processing engine
are identified by the xfDNN compiler and can be executed on the CPU. These
unsupported layers can be in any part of the network—beginning, middle, end, or in
a branch.
81
![Page 82: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/82.jpg)
CPU-FPGA
˃ networks and models are prepared for deployment on xDNN through Caffe,
TensorFlow, or MxNet.
˃ FPGA supports layers for xDNN while running unsupported layers on the CPU.
82
![Page 83: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/83.jpg)
Optimized architecture
˃ Network optimization by fusing layers, optimizing memory dependencies in the
network, and pre-scheduling the entire network. This removes CPU host control
bottlenecks.
83
![Page 84: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/84.jpg)
DNN tradeoffs
84
https://www.xilinx.com/support/documentation/white_papers/wp514-emerging-dnn.pdf
![Page 85: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/85.jpg)
Precision vs Performance vs power
85
![Page 86: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/86.jpg)
Design Space trade offs
86
![Page 87: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/87.jpg)
87
J. Cong et al., Understanding Performance Differences of FPGAs and GPUs
![Page 88: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/88.jpg)
Winners
88
https://www.semanticscholar.org/paper/Unified-Deep-Learning-with-CPU%2C-GPU%2C-and-FPGA-Rush-Sirasao/64c8428e93546479d44a5a3e44cb3d2553eab284#extracted
![Page 89: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA](https://reader036.fdocuments.us/reader036/viewer/2022071301/60a27e31b5496b017b44ca9f/html5/thumbnails/89.jpg)
Links, more info
89