Benchmarking Deep Learning Workloads on Large-scale HPC Systems
Ammar Ahmad Awan and Dhabaleswar K. [email protected], [email protected]
The Ohio State University
Benchmarking in the Datacenter Workshop @ PPoPP ‘20
Follow us on
https://twitter.com/mvapich
2Network Based Computing Laboratory
• Introduction
• Background
• ML/DL Benchmarks
• Solutions and Case Studies
• Conclusion and Future Directions
Agenda
3Network Based Computing Laboratory
Overview of Artificial Intelligence
Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-flexibility-scalability-and-advocacy/
4Network Based Computing Laboratory 4Network Based Computing Laboratory
Applications: Style Transfer, Caption Generation, Translation, etc.
Courtesy: https://github.com/alexjc/neural-doodle
Courtesy: https://machinelearningmastery.com/inspirational-applications-deep-learning/
Courtesy: https://research.googleblog.com/2015/07/how-google-translate-squeezes-deep.html
5Network Based Computing Laboratory
The Deep Learning (DL) Revolution
Adopted from: http://www.deeplearningbook.org/contents/intro.html
• DL – a revolutionary sub-set of ML
– Feature extraction vs. hand-crafted features
• Key success: Deep Neural Networks (DNNs)
– Everything was invented in late 80s except ?
AI
Machine Learning
(ML)
Deep Learning (DL)
Examples:
MLPs, DNNs,Examples:
Logistic Regression
Courtesy: https://hackernoon.com/difference-between-artificial-intelligence-machine-learning-and-deep-learning-1pcv3zeg, https://blog.dataiku.com/ai-vs.-machine-learning-vs.-deep-learning
6Network Based Computing Laboratory 6Network Based Computing Laboratory
Deep Learning on GPUs
*https://blogs.nvidia.com/blog/2014/09/07/imagenet/
• NVIDIA GPUs - driving force for faster DL!
– 90% ImageNet teams used GPUs (2014*)
– DNNs like Inception, ResNet(s), NASNets, and AmoebaNets
– Natural fit for DL workloads – throughput-oriented
• High Performance Computing (HPC) arena
– 135/500 Top HPC systems used NVIDIA GPUs (Nov ’19)
– CUDA-Aware Message Passing Interface (MPI)
• MVAPICH2-GDR, SpectrumMPI, OpenMPI, etc.
– DGX-1/DGX-2- Dedicated DL supercomputers www.top500.org
7Network Based Computing Laboratory 7Network Based Computing Laboratory
Deep Learning on CPUs
1. https://dl.acm.org/citation.cfm?id=1993516, 2. http://ieeexplore.ieee.org/abstract/document/5762730/, 3. https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1
• CPUs (dense many-cores) are emerging
• CPUs exist on nodes with GPUs
– Many-core Xeon, POWER9, EPYC, ARM, etc.
• Are CPUs really 10x – 100x slower than GPUs? [1-3]
• But CPU-based DL is getting much better and faster
– MKL-DNN, Vectorization, MPI optimized for large messages
– Data-Parallelism (Intel-Caffe – MLHPC ‘17)
– Model/Hybrid-Parallelism (HyPar-Flow – ISC ‘20)
8Network Based Computing Laboratory
Three Key Pieces in the story:
• Computability of DNNs
– HPC enables faster DL!
• Datasets – ImageNet and beyond…
• State-of-the-art Accuracy – Vision, NLP, Translation, etc.
Why HPC and DL?
Courtesy: A. Canziani et al., “An Analysis of Deep Neural Network Models for Practical Applications”, CoRR, 2016.
0
2000
4000
6000
8000
10000
2 GTX 580 DGX-2
Min
utes
to T
rain
AlexNet
~500X in 5 years
9Network Based Computing Laboratory 9Network Based Computing Laboratory
• Introduction
• Background
• ML/DL Benchmarks
• Solutions and Case Studies
• Conclusion and Future Directions
Agenda
10Network Based Computing Laboratory 10Network Based Computing Laboratory
Two Phases in Deep Learning
Courtesy: https://devblogs.nvidia.com/
• Training is compute-intensive
– Many passes over data
– Can take days to weeks
– Model adjustment is done
• Inference
– Single pass over the data
– Takes seconds
– No model adjustment
• Challenge: How to make “Training” faster?
– Exploit HPC hardware and software
11Network Based Computing Laboratory 11Network Based Computing Laboratory
• Data Parallelism (most common)
• Model and Hybrid Parallelism (emerging)
• ‘X’-Parallelism
– ‘X’—> Spatial, Channel, Filter, etc.
Parallelization Strategies for DNN Training
Model ParallelismData Parallelism
Hybrid (Model and Data) ParallelismCourtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks
12Network Based Computing Laboratory
• Introduction
• Background
• ML/DL Benchmarks
• Solutions and Case Studies
• Conclusion and Future Directions
Agenda
13Network Based Computing Laboratory
Benchmarking DL Workloads
Benchmarking Scale-upand Scale-out
performance of DL workloads on large-scale HPC systems
Scal
e-up
Per
form
ance
Scale-out Performance
cuDNN
gRPC
Hadoop
MPIMKL-DNN
Desired
NCCL2
14Network Based Computing Laboratory
DNN Training: A Complex Solution Space
Data-Parallel Training
Out-of-Core Training
Model/Hybrid ParallelTraining
In-depth Performance Characterization and Profiling Analysis
15Network Based Computing Laboratory
Several Efforts focused on DL Benchmarking
Many benchmarks but most are not suitable
for HPC Systems
DAWNBench(Stanford)
convnet-benchmarks (Soumith Chintala)
tf_cnn_benchmarks(TensorFlow)
Time Benchmark(CAFFE)
DL Bench(Hong Kong University)
Deep500(ETH Zurich)
MLPerf(mlperf.org – strong industry support)
16Network Based Computing Laboratory
Application Areas
Frameworks
Communication Libraries I/O (?) Compute
Libraries
Simplified Deep Learning Stack
Language Style Transfer Translation Image Captioning
17Network Based Computing Laboratory
DL Execution Stack: Details
Distributed Training Middleware
Communication Middleware
DL Frameworks
HPC Platforms Multi-/Many-core CPUs (Intel Xeon, AMD EPYC, and
IBM POWER9)NVIDIA GPUs
High-Performance Interconnects
(InfiniBand, Omni-Path)
MPI
Caffe
Horovod
NCCL
HyPar-Flow
TensorFlow PyTorch Distributed TensorFlow
MXNet PyTorchDistributed
OSU-Caffe
Gloo gRPC
18Network Based Computing Laboratory
• Introduction
• Background
• ML/DL Benchmarks
• Solutions and Case Studies
• Conclusion and Future Directions
Agenda
19Network Based Computing Laboratory
• Single Node– Computation only - cuDNN, MKL-DNN, etc.
– Communication - shared-memory, CUDA IPC, etc.)
• Multiple Nodes– Computation - same as single node
– Communication - MPI, NCCL, Gloo, etc.
• Studies Covered in today’s talk1. CPU vs. GPU comparison is usually unfair -- MLHPC ‘17
2. Different approaches, different end-to-end performance for same DL framework -- CCGrid ’19
3. Communication in DL is not the same as Communication in HPC -- HotI ‘19, IEEE Micro ’19
4. Beyond Data-Parallelism -- HyPar-Flow (accepted to be presented at ISC ’20)
Solutions and Case Studies: Different Benchmarking Directions
Single Node Single
Core/GPU
Single Node Multiple
cores/GPUs
N/A
Multiple nodes
Multiple cores/GPUs
20Network Based Computing Laboratory
• Faster Convolutions à Faster Training
• Performance of Intel KNL == NVIDIA P100 for AlexNet
• Volta; different league!
1. Caffe: CPUs vs. GPUs
A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17), in conjunction with SC ‘17, Denver, CO
DLApplications(ImageRecognition,SpeechProcessing,etc.)
DLFrameworks(Caffe,TensorFlow,etc.)
BLASLibraries
Hardware
Many-coreGPU(PascalP100)
GenericConvolutionLayer
MKLOptimizedConvolutionLayer
MKL2017 cuDNN/cuBLAS
Multi-/Many-core(Xeon,XeonPhi)
cuDNN OptimizedConvolutionLayer
OtherBLASLibraries
OpenBLASATLAS
OtherProcessors
• Caffe – the first framework NVIDIA optimized using cuDNN
– Everyone has optimized it ever since!
• Holistic View of Performance is needed!
21Network Based Computing Laboratory
• Out-of-Core workloads – no good baseline to compare
– Easiest fallback is to use CPU –> A lot more CPU memory available than GPU memory
• OC-Caffe-Optimized (Opt) designs provide much better than CPU/Optimized CPU designs!– DNN depth is the major cause for slow-downs à significantly more intra-GPU communication
OC-Caffe: GPU (Unified Memory) vs. CPU
0
200400
600800
10001200
1400
1600
Img/
sec (
High
er is
bet
ter)
caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt
oc-caffe-opt is 5Xbetter than
intel-caffe-opt
caffe-gpucannot
run
X 0
50
100
150
200
250
300
Img/
sec (
High
er is
bet
ter)
caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt
caffe-gpucannot
run
X
oc-caffe-opt is2.7X better than intel-caffe-opt
0
5
10
15
20
Img/
sec (
High
er is
bet
ter)
caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt
oc-caffe-optis 80%
better than intel-caffecaffe-gpu
cannot run
X
intel-caffe-opt
(N/A)X
Out-of-Core AlexNet Out-of-Core GoogLeNet Out-of-Core ResNet-50
A. A. Awan, C-H Chu, H. Subramoni, X. Lu, and DK Panda, “OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training”, HiPC ’18
22Network Based Computing Laboratory
• gRPC (official support)
– Open-source – can be enhanced by others
– Accelerated gRPC (add RDMA to gRPC)
• gRPC+X
– Use gRPC for bootstrap and rendezvous
– Actual communication is in “X”
– Xà MPI, Verbs, GPUDirect RDMA (GDR), etc.
• No-gRPC
– Baidu – the first one to use MPI Collectives for TF
– Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)
2. Same Framework, Different Communication Approaches
Distributed TensorFlow
gRPC
Accelerated gRPC
gRPC+X
gRPC+MPI
gRPC+Verbs
gRPC+GDR
No-gRPC
Baidu-MPI Horovod
MPI
NCCL
A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19. https://arxiv.org/abs/1810.11112
23Network Based Computing Laboratory
• gRPC and gRPC+X designs are slower than No-gRPC designs
• Baidu design (ring-allreduce) still slower than Horovod-NCCL and gRPC+‘X’
• Horovod-MPI is about 10% slower than Horovod-NCCL2.
Performance Characterization of Distributed TensorFlow
CNTK0
100200300400500600700800900
1000
1 2 4 8 16Imag
es/s
econ
d (H
ighe
r is b
ette
r)
No. of Nodes (GPUs)
Baidu-MPI Horovod-MPI Horovod-NCCL2 gRPC+verbs gRPC+MPI gRPC (IPoIB)
A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19. https://arxiv.org/abs/1810.11112
24Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library
• Support for multiple interconnects– InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS
EFA
• Support for multiple platforms– x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs
• Started in 2001, first open-source version demonstrated at SC ‘02
• Supports the latest MPI-3.1 standard
• http://mvapich.cse.ohio-state.edu
• Additional optimized versions for different systems/environments:– MVAPICH2-X (Advanced MPI + PGAS), since 2011
– MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014
– MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014
– MVAPICH2-Virt with virtualization support, since 2015
– MVAPICH2-EA with support for Energy-Awareness, since 2015
– MVAPICH2-Azure for Azure HPC IB instances, since 2019
– MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019
• Tools:– OSU MPI Micro-Benchmarks (OMB), since 2003
– OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015
• Used by more than 3,075 organizations in 89 countries
• More than 694,000 (> 0.6 million) downloads from the OSU site directly
• Empowering many TOP500 clusters (Nov ‘19 ranking)– 3rd, 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China
– 5th, 448, 448 cores (Frontera) at TACC
– 8th, 391,680 cores (ABCI) in Japan
– 14th, 570,020 cores (Nurion) in South Korea and many others
• Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, OpenHPC, and Spack)
• Partner in the 5th ranked TACC Frontera system
• Empowering Top500 systems for more than 15 years
25Network Based Computing Laboratory
MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training
MVAPICH2-X for CPU-Based Training
MVAPICH2-GDR for GPU-Based Training
Horovod
TensorFlow PyTorch MXNet
ML/DL Applications
26Network Based Computing Laboratory
Communication Benchmark: MV2-GDR vs. NCCL2 – Allreduce (DGX-2)• Optimized designs in upcoming MVAPICH2-GDR offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)
1
10
100
1000
10000
256K512K 1M 2M 4M 8M
16M32M
64M128M
256M
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR-Next NCCL-2.5
~2.5X better
Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2
05
101520253035404550
8 16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
128K
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR-Next NCCL-2.5
~4.7X better
27Network Based Computing Laboratory
Communication Benchmark at Scale• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs
0
200
400
600
4 16 64 256 1K 4K 16K
Late
ncy
(us)
Message Size (Bytes)
Latency on 1,536 GPUs
MVAPICH2-GDR-2.3.3 NCCL 2.5
1.6X better
Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect
0
5
10
24 48 96192
384768
1536Band
wid
th (G
B/s)
Number of GPUs
128MB Message
SpectrumMPI 10.3 OpenMPI 4.0.1
NCCL 2.5 MVAPICH2-GDR-2.3.3
1.7X better
28Network Based Computing Laboratory
End-to-end Performance Benchmark for TF on Summit• ResNet-50 Training using
TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!
• 1,281,167 images
• Time/epoch = 3.6 seconds
• Total Time (90 epochs) = 3.6 x 90 = 332 seconds = 5.5 minutes!
050
100150200250300350400
1 2 4 6 12 24 48 96 192 384 768 1536
Imag
e pe
r sec
ond
Thou
sand
s
Number of GPUs
NCCL2 MVAPICH2-GDR
Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2
*We observed errors for NCCL2 beyond 96 GPUs
MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!
ImageNet-1k has 1.2 million images
29Network Based Computing Laboratory
• White-box profiling is needed for complex DL frameworks
• hvprof provides multiple types of valuable metrics for– 1) ML/DL developers and 2) Designers of MPI libraries
• Profile of Latency for Allreduce (NVLink, PCIe, IB, Omni-Path)
• Summary: Non-power of 2 is under-optimized for all libraries!
3. Communication in DL vs. HPC
CNTK
Awan et al., “Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects”, IEEE Micro (Magazine) ‘20, Hot Interconnects ’19.
Inception-v4– Intel MPI ResNet-101– MVAPICH2
Communication Middleware
Deep Learning Frameworks
Distributed Training Middleware (Horovod)
HPC Platforms
PyTorch
CPUs
GPUs InfiniBand
NCCL MPI
Proposed Profiling Infrastructure (hvprof)
MXNet TensorFlow
Omni-Path
PCIe
NVLink
High-Performance Interconnects
30Network Based Computing Laboratory
• Data-Parallelism– only for models that fit the memory
• Out-of-core models– Deeper model à Better accuracy but more memory required!
• Model parallelism can work for out-of-core models!
• Key Challenges– Model Partitioning is difficult for application programmers
– Finding the right partition (grain) size is hard
– cut at which layer and why?
– Developing a practical system for model-parallelism• Redesign DL Framework or create additional layers?
• Existing Communication middleware or extensions needed?
4. Beyond Data Parallelism in DNNs
Pascal GPU
Volta GPU
CPU Broadwell (128 GB)
CPU Skylake(192 GB)
Only possible with Model Parallelism!
Memory Consumption (Extrapolated)
1
2
4
8
16
32
64
128
256
512
0 100 200 300 400 500 600 700 800
Mem
ory
(GB)
Input Image Size (Width X Height)ResNet-1k ResNet-5k
A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf
31Network Based Computing Laboratory
• HyPar-Flow is practical (easy-to-use) and high-performance (uses MPI)
– Based on Keras models and exploits TF 2.0 Eager Execution
– Leverages performance of MPI pt-to-pt. and collectives for communication
Model and Hybrid Parallelism
= Max Pooling
DNN with Identity (Residual) Mappings (Input)
HyPar-Flow
Allreducecomm
Allreducecomm
Allreducecomm
send(D) recv(D)
send(∇)recv(∇)
send(D) recv(D)
send(∇)recv(∇)
ImageConv1 Conv2 Conv3 Conv4
FCOutput
Partition-0 Partition-1 Partition-2
∇
D
∇
D
send(∇)
recv (D)
recv (∇)
send(D) D
∇Replica-0
Replica-1
hf.fit(model, num_partitions, num_replicas, strategy)
E.g. hf.fit(model, 3, 2, hybrid)
A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf
32Network Based Computing Laboratory
• CPU based results
– AMD EPYC
– Intel Xeon
• Excellent speedups for
– VGG-19
– ResNet-110
– ResNet-1000 (1k layers)
• Able to train “future” models
– E.g. ResNet-5000 (a synthetic 5000-layer model we benchmarked)
Benchmarking HyPar-Flow in Different Configurations
0
200
400
600
800
1000
1200
1 2 4 8 16 32 64 128 256Im
g/se
c (hi
gher
is b
ette
r)
Nodes
48 Partitions (1 node) 768 Partitions (16 nodes)192 Partitions (4 nodes) 96 Partitions (2 nodes)
EBS = Diameter of Circle
110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2)
33Network Based Computing Laboratory
• ResNet-1001 with variable batch size
• Approach: – 48 model-partitions for 56 cores
– 512 model-replicas for 512 nodes
– Total cores: 48 x 512 = 24,576
• Speedup– 253X on 256 nodes
– 481X on 512 nodes
• Scaling Efficiency– 98% up to 256 nodes
– 93.9% for 512 nodes
End-to-end Performance at Scale (512 nodes on Frontera)
481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera)
0500
100015002000250030003500400045005000
1 2 4 8 16 32 64 128 256 512
Img/
sec (
high
er is
bet
ter)
Nodes (=Replicas)
56 Partitions (1 node) Ideal
34Network Based Computing Laboratory
• Introduction
• Background
• ML/DL Benchmarks
• Solutions and Case Studies
• Conclusion and Future Directions
Agenda
35Network Based Computing Laboratory
• Reproducibility is a challenge for HPC and DL
• Challenges
– No standard and user-friendly way of benchmarking ML/DL models
– Disconnect between HPC and DL communities
– Metrics cause confusion b/w communities -- images/second, time to train, latency, bandwidth, etc.
• MLPerf, a good start but mostly for a single-node/GPU
– Can we extend for HPC systems?
• Deep500 – a meta DL framework to evaluate DL or DL frameworks?
• Other benchmarks – framework specific (tf_cnn_benchmarks) or low-level (latency/bw)
Future Direction 1: Reproducibility via Open Infrastructures
36Network Based Computing Laboratory
• Models beyond ResNet(s) – Generative Adversarial Networks, Transformer, BERT, GPT-2, Mini-go, etc.
• Applications beyond Image Classification – Neural Machine Translation, Language Processing, Recommendation Systems, Reinforcement Learning, Neural Code Gen. etc.
Investigate scale-up/scale-out for published models on current systems like Sierra and Summit and upcoming systems El Capitan and Frontier
Future Direction 2: Benchmarking Emerging Models and Areas
37Network Based Computing Laboratory
• Deep Learning is an important area
• Several benchmarking efforts to better understand DL workloads
– MLPerf, Deep500, DAWNBench, etc.
• DL on single-node systems is complex enough
– cuDNN, MKL, BLAS, shared-memory communication, CUDA IPC, etc.
• DL on HPC systems is even more challenging
– All of the single node + MPI, NCCL, Gloo, etc.
• Several design choices and benchmarking studies
• No single benchmark can cover all aspects!
Conclusion
38Network Based Computing Laboratory
Thank You!
Questions?
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
https://awan-10.github.io
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
39Network Based Computing Laboratory
Please join us here from 1pm - 5pm
• Tutorial on High Performance Distributed Deep Learning
• Several topics on Distributed DL Trends and Designs
• Includes a Hands-on section on Distributed TensorFlow
• Speakers: DK Panda, Ammar Awan, and Hari Subramoni
Top Related