THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH...
Transcript of THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH...
![Page 1: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/1.jpg)
THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS
Vishal Mehta, DevTech Compute
HPC-AI Advisory Council, Swiss Conference, Lugano Date:02/04/2019
![Page 2: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/2.jpg)
2
MLPERF BENCHMARKING AREAS
Problem Model DataSet
Image Classification Resnet-50 v1.5 ImageNet
Object Detection – Heavy Weight Mask R-CNN COCO
Object Detection – Light Weight Single-shot Detector (SSD) COCO
Translation (non-recurrent) Transformer WMT English-German
Translation (recurrent) Neural Machine Translator (NMT) WMT English-German
Recommendation Neural Collaborative Filtering (NCF) MovieLens 20M
Reinforcement Learning Mini Go (Based on Alpha Go)
*Closed Divisions: fixed model parameters, fixed data format, results must be reproducible*Open Divisions: encourage innovations, tricks and model adjustment welcomed
Diverse Use Cases Towards a Full Performance Picture
![Page 3: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/3.jpg)
3
MLPERFFirst Industry Benchmark for Measuring AI Performance
https://mlperf.org/
• Open source code on single/multi node DGX systems
• Everywhere: Workstation/Clusters with SLURM/Cloud
• Reproducible Performance!
![Page 4: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/4.jpg)
4
MLPERF RESULTS: CLOSED DIVISIONSSingle Node & At Scale
Results are Time to Complete Model Training
Single Node At Scale
* Full reference: https://mlperf.org/results/
![Page 5: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/5.jpg)
5
AI TRAINING REQUIRES FULL STACK INNOVATION
2015
36000 Mins (25 Days)1xK80 | 2015
CUDA
2016
1200 Mins (20 Hours)DGX-1P | 2016
NVLink
2017
480 Mins (8 Hours)DGX-1V | 2017
Tensor Core
70 Minutes on MLPerfDGX-2H | 2018
NVSwitch
2018
6.3 Minutes on MLPerfAt Scale | 2018
DGX Cluster
![Page 6: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/6.jpg)
6
PILLARS OF GPU PERFORMANCE
CUDA Architecture
NVLink/NVSwitch Integrated Software
Massively Parallel Processing
High Speed Connecting between GPUs for Distributed
Algorithms
Fully Integrated Software and Hardware for Instant
Productivity
NVSwitch
6x NVLink
CUDA
PYTHON
APACHE ARROW on GPU Memory
DASK
cuDNN
RAPIDS
cuMLcuDF
DL
FRAMEWORKS
Tensor Cores
Mixed Precision Matrix Math Support
![Page 7: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/7.jpg)
7
TENSOR CORESMixed Precision Matrix Math
• CUDA TensorOp instructions & data formats. Automated Mixed Precision
• Using Tensor cores via
• Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations
• cuTENSOR library (pre-release version available)
![Page 8: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/8.jpg)
8
cuTENSORA New High-Performance CUDA Library for Tensor Primitives
![Page 9: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/9.jpg)
9
TENSOR CORE AUTOMATIC MIXED PRECISIONOver 3x Speedup With Just Two Lines of Code
TOOLS AND LIBRARIES MAINTAIN NETWORK ACCURACY
Tensor Core Journey Page
Github
Profiler Tools
Performance increase using automatic mixed precision on a variety of
training data sets. All performance collected using 1xV100-16GB except
bert-squadqa, which ran on 1xV100-32GB.
Enable AMP in TensorFlow (NGC Container 19.03)
export TF_ENABLE_AUTO_MIXED_PRECISION=1
![Page 10: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/10.jpg)
10
MIXED PRECISION MAINTAINS ACCURACYBenefit From Higher Throughput Without Compromise
ILSVRC12 classification top-1 accuracy.(Sharan Narang, Paulius Micikevicius et al., "Mixed Precision Training“, ICLR 2018)
**Same hyperparameters and learning rate schedule as FP32.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
AlexNet VGG-D GoogleNet(Inception v1)
Inception v2 Inception v3 Resnet50
Model Accura
cy
FP32 Mixed Precision**
![Page 11: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/11.jpg)
11
TENSOR CORES FOR SCIENCEMulti-precision computing
AI-POWERED WEATHER PREDICTION
PLASMA FUSION APPLICATION EARTHQUAKE SIMULATION
7.815.7
125
0
20
40
60
80
100
120
140
V100 TFLOPS
FP64+ MULTI-PRECISION
FP16 Solver
3.5x times faster
FP16/FP32
1.15x ExaOPS
FP16-FP21-FP32-FP64
25x times faster
![Page 12: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/12.jpg)
12
NVIDIA DGX-2
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
12
30 TB NVME SSDs Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card
Twelve NVSwitches2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/secEthernet
![Page 13: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/13.jpg)
1313
• 18 NVLINK ports
• @50 GB/s per port bi-directional
• 900 GB/s total bi-directional
• Fully connected crossbar
• 2 billion transistors
NVSWITCH
![Page 14: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/14.jpg)
14
FULL NON-BLOCKING BANDWIDTH
![Page 15: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/15.jpg)
15
UNIFIED MEMORY PROVIDES
• Single memory view shared by all GPUs
• Automatic migration of data between GPUs
• User control of data locality
• CUDA cooperative kernels across multiple
GPU.
NVLINK PROVIDES
• All-to-all high-bandwidth peer mapping
between GPUs
• Full inter-GPU memory interconnect
(incl. Atomics)
NVSWITCH
![Page 16: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/16.jpg)
16
NCCL
• Building block that abstracts highly-optimized communication for each topology
• Rings within node over NVLink
• Trees among nodes over network
• Like MPI collectives, but with a CUDA stream
• Essential to deep learning and machine learning, can be relevant to traditional HPC
• Open sourced as of v2.3
NVIDIA Collectives Communication Library
![Page 17: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/17.jpg)
17
NCCLTrees vs Rings
https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/
Runs on Summit supercomputer on up to 24,576 GPUs.
![Page 18: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/18.jpg)
18
NCCLPerformance comparison on Resnet-50
![Page 19: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/19.jpg)
19
NVIDIA GPU CLOUDGPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows
NGC50+ Containers
DL, ML, HPC
Pre-trained ModelsNLP, Classification, Object Detection & more
Industry WorkflowsMedical Imaging, Intelligent Video Analytics
Model Training ScriptsNLP, Image Classification, Object Detection & more
Innovate Faster
Deploy Anywhere
Simplify Deployments
NGC Support Services
https://ngc.nvidia.com
![Page 20: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/20.jpg)
20
MLPERF KEY POINTSNVIDIA’s Platform Available Everywhere to All Developers
• Software innovations used to achieve this industry-leading
performance is available via the NGC container registry.
• NGC containers are available for all key AI frameworks and can be
used anywhere, on desktops, workstations, servers and all leading
cloud services
![Page 21: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/21.jpg)
21
MLPERF KEY POINTSNVIDIA Sets New Records in AI Performance
• NVIDIA ran models of all kinds of complexity, in the industry's first
comprehensive AI benchmark.
• Tensor Core GPUs are the fastest and combined with CUDA the most
versatile platform for AI.
• NVIDIA platform, is available everywhere from desktop to cloud services
![Page 22: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/22.jpg)
22
MLPERF KEY POINTSState-of-the-art AI Computing Requires Full Stack Innovation
APPS &FRAMEWORKS
CUDA-XNVIDIA LIBRARIES
VIRTUAL GPU
CUDA & CORE LIBRARIES - cuBLAS | NCCL
DEEP LEARNING
cuDNN
HPC
cuFFTOpenACC
+550 Applications
Amber
NAMD
CUSTOMER USE CASES
VIRTUAL GRAPHICS
Speech Translate Recommender
SCIENTIFIC APPLICATIONS
Molecular Simulations
WeatherForecasting
SeismicMapping
CONSUMER INTERNET & INDUSTRY APPLICATIONS
ManufacturingHealthcare Finance
GPUs & SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY
MACHINE LEARNING
cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRTvDWS vPC
Creative & Technical
Knowledge Workers
vAPPS
+600 Applications
DX/OGL
![Page 23: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/23.jpg)
23
RAPIDS: MACHINE LEARNING AT SCALE
![Page 24: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/24.jpg)
24
![Page 25: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/25.jpg)
25
![Page 26: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/26.jpg)
26
![Page 27: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/27.jpg)
27
![Page 28: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/28.jpg)
28
![Page 29: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/29.jpg)
29
![Page 30: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/30.jpg)
30
![Page 31: THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5f0b20e67e708231d42efd32/html5/thumbnails/31.jpg)
THANK YOU