Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power...
Transcript of Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power...
![Page 1: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/1.jpg)
Performance and Power Challenges in Data Center GPUsDaniel Wong University of California, [email protected] Department of Electrical and Computer Engineering
![Page 2: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/2.jpg)
More than just graphics2
![Page 3: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/3.jpg)
GPUs are everywhere3
Picture sources: Nvidia
Data CentersAutonomous Cars
Embedded SystemsCryptocurrency Mining
![Page 4: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/4.jpg)
GPUs can do (almost) everything›Deep Learning, Mining, Graphics, HPC, Database, Network
4
![Page 5: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/5.jpg)
GPUs are massively parallel accelerators5
VS
![Page 6: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/6.jpg)
GPUs leverages Data-level Parallelism›CUDA program is partitioned into a grid of blocks
› Each block is then scheduled to an SM within the GPU
6
![Page 7: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/7.jpg)
GPU becoming more specialized7
Modern GPU “Processing Block”● 32 Threads● 16 INT● 16 single-precision FP● 8 double-precision FP● 4 SFU (sin, cos, log)● 2 Tensor units for DNN● 64KB RF
![Page 8: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/8.jpg)
GPU Streaming Multiprocessor8
● Contains 4 “Processing Blocks”● Each independently schedules a
set of 32 threads called a warp● Share L1 Cache between blocks
![Page 9: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/9.jpg)
GPU Hardware9
● V100 has 80 SM● 5376 FPU● Peak 15.7
TFLOPS
![Page 10: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/10.jpg)
GPU “Data center in a box”›DGX
› A Multi-GPU “Node”› 300GB/s NVlink 2.0 cube mesh› 1 PFLOPS› Faster Machine Learning
10
![Page 11: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/11.jpg)
11
![Page 12: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/12.jpg)
DGX Data Center12
![Page 13: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/13.jpg)
GPU Support in Cloud Computing Stack13
![Page 14: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/14.jpg)
GPUs in the Cloud
› Exponential demand for more compute power
14
![Page 15: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/15.jpg)
GPUs are power hungry15
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
~100 Watt Difference Compared to CPUs
}
![Page 16: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/16.jpg)
How can we save GPU power in data center environments?
16
![Page 17: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/17.jpg)
GPU inter-connection is getting complex17
![Page 18: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/18.jpg)
GPU inter-connection is getting complex18
![Page 19: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/19.jpg)
How can we make efficient use of GPU inter-connects?
19
![Page 20: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/20.jpg)
Power challenges
20
![Page 21: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/21.jpg)
Varying data center utilization ›Data Center load fluctuates over time
› Leads to underutilization of hardware resources
›Common solution:Dynamically scale clock Frequency with current load
21
Google’s Data Center Trace https://github.com/google/cluster-data/blob/master/ClusterData2011_2.md
![Page 22: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/22.jpg)
Tradeoff latency for power22
Slack
› Slow down request processing›Requests must meet latency constraints
› Must be serviced under 99 percentile › Costs money/time/energy if over
![Page 23: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/23.jpg)
Target DNN as a Service23
Source: DjiNN and Tonic: DNN as a Service, ISCA’15
![Page 24: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/24.jpg)
Djinn and Tonic24
Source: DjiNN and Tonic: DNN as a Service, ISCA’15
Running on NVIDIA Titan X
![Page 25: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/25.jpg)
GPU frequency scaling exploits thermal headroom
25
![Page 26: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/26.jpg)
Diminishing Returns from Frequency
› Power vs Frequency non-linear
26
![Page 27: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/27.jpg)
Frequency Scaling Challenges
› Frequency scaling achieves limited power savings › How to trade-off frequency for latency?
› In CPUs, frequency states are supplemented with deep sleep states … which do not exist in GPUs
27
![Page 28: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/28.jpg)
Scale parallelism w/ Thread Block Scaling28
› Exploit application-level characteristics
› Limiting the amount of thread blocks that a single request uses
›Potentially reduce dynamic power by utilizing less hardware
![Page 29: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/29.jpg)
DNN Inference calls multiple kernels 29
![Page 30: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/30.jpg)
Kernels vary in Thread Block usage30
›Most applications do not use all hardware resources, but are provisioned the entire GPU!
![Page 31: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/31.jpg)
Squeeze kernels into less TBs31
› Thread Blocks can be reduced without a major impact to execution time
› Latency becomes an issue at around 75% TB reduction
![Page 32: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/32.jpg)
Enable Colocatating Multiple Requests› Service multiple requests at the same time on a single GPU
›This allows the frequency to remain low while handling a higher load
› Increasing overall energy efficiency
32
![Page 33: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/33.jpg)
Thread Block Scaling Challenges
› Software vs Hardware implementation?
› How to coordinate thread block scaling with frequency scaling?
› Colocating multiple requests may lead to contention of hardware resources› How to allocate resources to kernels?
33
![Page 34: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/34.jpg)
Communication-related performance challenges
34
![Page 35: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/35.jpg)
NVLink: Fast communication between multi-GPUs
›
35
![Page 36: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/36.jpg)
NVLink vs PCIe
› NVLink also provides significantly lower latency› Even with bidirectional traffic!
36
![Page 37: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/37.jpg)
Challenges of complex GPU inter-connects
› Programming Multi-GPU applications is hard
› Not aware of inter-connect topology
› Poor placement of GPU kernel can lead to performance impact
37
![Page 38: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/38.jpg)
Solutions
› Develop new paradigms and APIs to ease multi-GPU application development
38
![Page 39: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/39.jpg)
Solutions› GPU Kernel scheduling and mapping algorithms
› Guided by topology information from programmer APIs› Utilize intermediate GPUs as NVLink routers to allow
communication between non-direct connect GPUs› Avoids PCIe
39
![Page 40: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/40.jpg)
NVLink routing preliminary results40
![Page 41: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/41.jpg)
Conclusion› Modern GPU-based data centers face many power and
performance related challenges› GPUs have limited power savings features (frequency scaling)
› Parallelism-scaling and co-location offers potential to improveenergy efficiency
› Multi-GPU programming and management is made difficult due to GPUs increasingly complex inter-connection › Requires new paradigms, programmer-support,
mapping/scheduling support, and runtime support.
41
![Page 42: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/42.jpg)
Thank you! Questions?
![Page 43: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/43.jpg)
Performance and Power Challenges in Data Center GPUsDaniel Wong University of California, [email protected] Department of Electrical and Computer Engineering
![Page 44: Performance and Power Challenges in Data Center GPUs · 2019-12-27 · Performance and Power Challenges in Data Center GPUs Daniel Wong University of California, Riverside dwong@ece.ucr.edu](https://reader033.fdocuments.us/reader033/viewer/2022042413/5f2da081da7bbd4f13135ea0/html5/thumbnails/44.jpg)
GPU Software View› Each block contains a grid of threads
›Blocks and threads can be logically grouped in 3 dimensions
44