The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF...
Transcript of The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF...
![Page 1: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/1.jpg)
The MVAPICH2 ProjectLatest Status and Future Plans
Hari SubramoniThe Ohio State University
E-mail: [email protected]://www.cse.ohio-state.edu/~subramon
Presentation at MPICH BoF (SC ‘19)by
![Page 2: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/2.jpg)
Network Based Computing Laboratory 2MPICH BoF (SC’19)
• A long time ago, in a galaxy far, far away…. (actually 20 years ago), there existed…
• MPICH– High performance and widely portable implementation of MPI standard
– From ANL
• MVICH– Implementation of MPICH ADI-2 for VIA
– VIA – Virtual Interface Architecture (precursor to InfiniBand)
– From LBL
• VAPI– Verbs level API
– Initial InfiniBand API from IB Vendors (older version of OFED/IB verbs)
MPICH + MVICH + VAPI = MVAPICH
History of MVAPICH
![Page 3: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/3.jpg)
Network Based Computing Laboratory 3MPICH BoF (SC’19)
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– Based on MPICH2 3.2.1
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (SC ‘02)
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,050 organizations in 89 countries
– More than 615,000 (> 0.6 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘19 ranking)
• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 14th, 570,020 cores (Neurion) in South Korea and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
Partner in the #5th TACC Frontera System
![Page 4: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/4.jpg)
Network Based Computing Laboratory 4MPICH BoF (SC’19)
0
100000
200000
300000
400000
500000
600000
Sep-
04
Jan-
05M
ay-0
5
Sep-
05
Jan-
06M
ay-0
6
Sep-
06
Jan-
07M
ay-0
7
Sep-
07
Jan-
08M
ay-0
8
Sep-
08
Jan-
09M
ay-0
9
Sep-
09
Jan-
10M
ay-1
0
Sep-
10
Jan-
11M
ay-1
1
Sep-
11
Jan-
12M
ay-1
2
Sep-
12
Jan-
13M
ay-1
3
Sep-
13
Jan-
14M
ay-1
4
Sep-
14
Jan-
15M
ay-1
5
Sep-
15
Jan-
16M
ay-1
6
Sep-
16
Jan-
17M
ay-1
7
Sep-
17
Jan-
18M
ay-1
8
Sep-
18
Jan-
19M
ay-1
9
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9 M
V2-G
DR
2.0b
MV2
-MIC
2.0
MV2
-GD
R 2
.3.2
MV2
-X2.
3rc
2
MV2
Virt
2.2
MV2
2.3
.2
OSU
INAM
0.9
.3
MV2
-Azu
re 2
.3.2
MV2
-AW
S2.
3
MVAPICH2 Release Timeline and Downloads
![Page 5: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/5.jpg)
Network Based Computing Laboratory 5MPICH BoF (SC’19)
Architecture of MVAPICH2 Software Family (HPC and DL)
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-Awareness
Remote Memory Access
I/O andFile Systems
FaultTolerance
Virtualization Active Messages
Job StartupIntrospection
& Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
MemoryCMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM
![Page 6: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/6.jpg)
Network Based Computing Laboratory 6MPICH BoF (SC’19)
MVAPICH2 Software Family High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration
OEMT Utility to measure the energy consumption of MPI applications
![Page 7: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/7.jpg)
Network Based Computing Laboratory 7MPICH BoF (SC’19)
0
1000
2000
3000
4000
5000
56 224 896 3584 14336 57344
Tim
e Ta
ken
(Mill
isec
onds
)
Number of Processes
MPI_Init on FronteraIntel MPI 2019MVAPICH2 2.3.2
4.5s
3.9s
MVAPICH2 – Basic MPIFast Startup on Emerging Many-Cores Enhanced Intra-node
Performance for ARMEnhanced Intra-node
Performance for OpenPOWER
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
(4,28) (8,28) (16,28)
Late
ncy
(sec
onds
)
(Number of Nodes, PPN)
MVAPICH2MVAPICH2-SHArP
Advanced Allreduce with SHARP
13%
0
0.5
1
1.5
0 1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Late
ncy
(us)
MVAPICH2-2.3.2
0
0.2
0.4
0.6
0.8
4 8 16 32 64 128 256 512 1K 2K
Late
ncy
(us)
MVAPICH2 2.3.2SpectrumMPI-10.3.0.01
0.25us
1
10
100
1000
8 32 128 512
Late
ncy
(us)
No. of Nodes
16 256 4K 64K 512K
MPI_Bcast using RDMA_CM-based Multicast
05
101520253035404550
4 8 16 32 64 128Com
mun
icat
ion-
Com
puta
tion
Ove
rlap
(%)
Message Size (Bytes)
1 PPN, 8 NodesMVAPICH2MVAPICH2-SHArP
Advanced Non-Blocking Allreduce with SHARP
High
er is
Bet
ter
![Page 8: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/8.jpg)
Network Based Computing Laboratory 8MPICH BoF (SC’19)
MVAPICH2-X – Advanced MPI + PGAS + Tools
1
10
100
1000
10000
1000001K 2K 4K 8K 16
K32
K64
K12
8K25
6K51
2K 1M 2M 4M 8MMessage Size
Power8 (160 Processes)
MVAPICH2-2.3a
OpenMPI 2.1.0
MVAPICH2-X 2.3rc1
Use CMA
UseSHMEM
CMA-Aware MPI_Bcast
Late
ncy (
us)
Performance of P3DFFT Optimized Async Progress
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(us)
Message Size
MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-X-2.3rc1
OSU_Allreduce(Broadwell 256 procs)
73.2
1.8X
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4MMessage Size
MVAPICH2-2.3b
IMPI-2017v1.132
MVAPICH2-2.3rc1
OSU_Reduce(Broadwell 256 procs)
4X
36.1
37.9
16.8
Shared Address Space (XPMEM)-based Collectives Design
0
50
100
150
56 112 224 448
Tim
e p
er lo
op in
sec
onds
Number of processes
MVAPICH2-X Async
MVAPICH2-X Default
Intel MPI 18.1.163
44%
33%
0200400600800
1000120014001600
512 1024 2048 4096
Exec
utio
n Ti
me
(s)
No. of Processes
MVAPICH2 MVAPICH2-X
Neuron with YuEtAl2012
10%
76%
39%
Overhead of RC protocol for connection
establishment and communication
![Page 9: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/9.jpg)
Network Based Computing Laboratory 9MPICH BoF (SC’19)
MVAPICH2-GDR – Optimized MPI for clusters with NVIDIA GPUs
05
1015202530
0 1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
MV2-(NO-GDR) MV2-GDR-2.3
10x1.85us
Best Performance for GPU-based Transfers
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96
Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Default Callback-based Event-based
Enhanced Kernel-based Datatype Processing for Cosmo Weather Prediction Model on x86
TensorFlow Training with MVAPICH2-GDR on Summit GPU-Based MPI_Allreduce on Summit
0
5
10
24 48 96 192 384 768 1536
Band
wid
th (G
B/s)
Number of GPUs
128MB Message
SpectrumMPI 10.2.0.11 OpenMPI 4.0.1
NCCL 2.4 MVAPICH2-GDR-2.3.2
050
100150200250300350400
1 2 4 6 12 24 48 96 192 384 768 1536
Imag
e pe
r sec
ond
(Tho
usan
ds)
Number of GPUs
NCCL-2.4 MVAPICH2-GDR-2.3.2
Time/epoch = 3.6 secondsTotal Time (90 epochs)= 3.6 x 90 = 332 seconds = 5.5 minutes!
16 GPUs on POWER9 system (test Comm mpi Mesh cuda Device Buffers mpi_type)
pre-comm post-recv post-send wait-recv wait-send post-
comm start-up test-comm
bench-comm
Spectrum MPI 10.3 0.0001 0.0000 1.6021 1.7204 0.0112 0.0001 0.0004 7.7383 83.6229
MVAPICH2-GDR 2.3.2 0.0001 0.0000 0.0862 0.0871 0.0018 0.0001 0.0009 0.3558 4.4396
MVAPICH2-GDR 2.3.3 (Upcoming) 0.0001 0.0000 0.0030 0.0032 0.0001 0.0001 0.0009 0.0133 0.1602
18x
27x
Enhanced Kernel-based Datatype Processing for COMB Application Kernel on POWER9
![Page 10: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/10.jpg)
Network Based Computing Laboratory 10MPICH BoF (SC’19)
MVAPICH2-X Advanced Support for HPC-CloudsPerformance on Amazon EFA Performance of Radix on Microsoft Azure
01020304050607080
72(2x36) 144(4x36) 288(8x36)
Exec
utio
n Ti
me
(Sec
onds
)
Processes (Nodes X PPN)
miniGhostMV2X OpenMPI
05
1015202530
72(2x36) 144(4x36) 288(8x36)
Exec
utio
n Ti
me
(Sec
onds
)Processes (Nodes X PPN)
CloverLeafMV2X-UDMV2X-SRDOpenMPI
10% better
27.5% better
Instance type: c5n.18xlargeCPU: Intel Xeon Platinum 8124M @ 3.00GHzMVAPICH2 version: MVAPICH2-X 2.3rc2 + SRD supportOpenMPI version: Open MPI v3.1.3 with libfabric 1.7
0
5
10
15
20
25
30
Exec
utio
n Ti
me
(Sec
onds
)
Number of Processes (Nodes X PPN)
Total Execution Time on HC (Lower is better)
MVAPICH2-XHPCx
3x faster
0
5
10
15
20
25
60(1X60) 120(2X60) 240(4X60)
Exec
utio
n Ti
me
(Sec
onds
)
Number of Processes (Nodes X PPN)
Total Execution Time on HB (Lower is better)
MVAPICH2-X
HPCx
38% faster
• MVAPICH2-Azure 2.3.2• Released on 08/16/2019• Major Features and Enhancements
– Based on MVAPICH2-2.3.2– Enhanced tuning for point-to-point and collective operations– Targeted for Azure HB & HC virtual machine instances
– Flexibility for 'one-click' deployment– Tested with Azure HB & HC VM instances
• Available for download from http://mvapich.cse.ohio-state.edu/downloads/• Detailed User Guide: http://mvapich.cse.ohio-state.edu/userguide/mv2-azure/
• MVAPICH2-X-AWS 2.3• Released on 08/12/2019• Major Features and Enhancements
– Based on MVAPICH2-X 2.3– Support for on Amazon EFA adapter's Scalable Reliable Datagram (SRD)– Support for XPMEM based intra-node communication for point-to-point and
collectives– Enhanced tuning for point-to-point and collective operations– Targeted for AWS instances with Amazon Linux 2 AMI and EFA support– Tested with c5n.18xlarge instance
![Page 11: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/11.jpg)
Network Based Computing Laboratory 11MPICH BoF (SC’19)
MVAPICH2 – Future Roadmap and Plans for Exascale• Update to MPICH 3.3.2 CH3 channel
• 2020• Initial support for the CH4 channel
• 2020/2021• Making CH4 channel default
• 2021/2022• Performance and Memory scalability toward 1M-10M cores• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)
• MPI + Task*• Enhanced Optimization for GPUs and FPGAs*• Taking advantage of advanced features of Mellanox InfiniBand
• Tag Matching*• Adapter Memory*
• Enhanced communication schemes for upcoming architectures• NVLINK*• CAPI*
• Extended topology-aware collectives• Extended Energy-aware designs and Virtualization Support• Extended Support for MPI Tools Interface (as in MPI 3.0)• Extended FT support• Support for * features will be available in future MVAPICH2 Releases
![Page 12: The MVAPICH2 Project Latest Status and Future PlansNetwork Based Computing Laboratory MPICH BoF (SC’19) 3. Overview of the MVAPICH2 Project • High Performance open -source MPI](https://reader030.fdocuments.us/reader030/viewer/2022011904/5f1b46c3b9f3ee71ff5022ab/html5/thumbnails/12.jpg)
Network Based Computing Laboratory 12MPICH BoF (SC’19)
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
http://web.cse.ohio-state.edu/~subramon
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/