Software Libraries and Middleware for Exascale Systems...High Performance Parallel Programming...

Designing Scalable HPC, Deep Learning, Big Data and Cloud Middleware for Exascale Systems

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Keynote Talk at HPC-UK 2019 Conference (September ‘19)

by

Follow us on

https://twitter.com/mvapich

mailto:[email protected]

http://www.cse.ohio-state.edu/~panda


2Network Based Computing Laboratory HPC-AI UK (Sept. ’19)

High-End Computing (HEC): PetaFlop to ExaFlop

Expected to have an ExaFlop system in 2020-2021!

100 PFlops in

2017

1 EFlops in

2020-2021?

149

PFlops

in 2018


Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!

Increasing Need to Run these applications on the Cloud!!


Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Physical Compute



Spark Job

Hadoop Job Deep LearningJob


• MVAPICH Project

– MPI and PGAS Library with CUDA-Awareness

• HiBD Project

– High-Performance Big Data Analytics Library

• HiDL Project

– High-Performance Deep Learning

• Public Cloud Deployment

– Microsoft-Azure and Amazon-AWS

• Conclusions

Presentation Overview


Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

OpenSHMEM, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• PGAS models and Hybrid MPI+PGAS models are gradually receiving

importance


Overview of the MVAPICH2 Project

• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,025 organizations in 89 countries

– More than 589,000 (> 0.5 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘18 ranking)

• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 15th, 570,020 cores (Neurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

Partner in the TACC Frontera System

http://mvapich.cse.ohio-state.edu/


0

100000

200000

300000

400000

500000

600000

Sep

-04

Feb

-05

Jul-

05

Dec

-05

May

-06

Oct

-06

Mar

-07

Au

g-0

7

Jan

-08

Jun

-08

No

v-0

8

Ap

r-0

9

Sep

-09

Feb

-10

Jul-

10

Dec

-10

May

-11

Oct

-11

Mar

-12

Au

g-1

2

Jan

-13

Jun

-13

No

v-1

3

Ap

r-1

4

Sep

-14

Feb

-15

Jul-

15

Dec

-15

May

-16

Oct

-16

Mar

-17

Au

g-1

7

Jan

-18

Jun

-18

No

v-1

8

Ap

r-1

9

Nu

mb

er o

f D

ow

nlo

ads

Timeline

MV

0.9

.4

MV

2 0

.9.0

MV

2 0

.9.8

MV

21

.0

MV

1.0

MV

21

.0.3

MV

1.1

MV

21

.4

MV

21

.5

MV

21

.6

MV

21

.7

MV

21

.8

MV

21

.9

MV

2-G

DR

2.0

b

MV

2-M

IC 2

.0

MV

2-G

DR

2.3

.2M

V2

-X2

.3rc

2

MV

2V

irt

2.2

MV

2 2

.3.2

OSU

INA

M 0

.9.3

MV

2-A

zure

2.3

.2M

V2

-AW

S2

.3

MVAPICH2 Release Timeline and Downloads


Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-

point

Primitives

Collectives

Algorithms

Energy-

Awareness

Remote

Memory

Access

I/O and

File Systems

Fault

ToleranceVirtualization

Active

MessagesJob Startup

Introspection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC SRD UD DC UMR ODPSR-

IOV

Multi

Rail

Transport Mechanisms

Shared

MemoryCMA IVSHMEM

Modern Features

Optane* NVLink CAPI*

* Upcoming

XPMEM


MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE

MVAPICH2-X

MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB



HBase, Memcached,

etc.)


etc.)


Convergent Software Stacks for HPC, Big Data and Deep Learning


One-way Latency: MPI over IB with MVAPICH2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8 Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.111.19

1.011.15

1.041.1

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switchConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

0

20

40

60

80

100

120TrueScale-QDR

ConnectX-3-FDR

ConnectIB-DualFDR

ConnectX-4-EDR

Omni-Path

ConnectX-6 HDR

Large Message Latency


Late

ncy

(u

s)


Bandwidth: MPI over IB with MVAPICH2

0

5000

10000

15000

20000

25000

30000 Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


12,590

3,3736,356

12,08312,366

24,532

0

10000

20000

30000

40000

50000

60000TrueScale-QDR

ConnectX-3-FDR

ConnectIB-DualFDR

ConnectX-4-EDR

Omni-Path

ConnectX-6 HDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


21,227

12,161

21,983

6,228

48,027

24,136

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switchConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch


Startup Performance on TACC Frontera

• MPI_Init takes 3.9 seconds on 57,344 processes on 1,024 nodes• All numbers reported with 56 processes per node

4.5s

3.9s

New designs available in MVAPICH2-2.3.2

0

1000

2000

3000

4000

5000

56 112 224 448 896 1792 3584 7168 143362867257344

Tim

e T

ake

n

(Mill

ise

con

ds)

Number of Processes

MPI_Init on Frontera

Intel MPI 2019

MVAPICH2 2.3.2


MPI_Allreduce on KNL + Omni-Path (10,240 Processes)

0

50

100

150

200

250

300

4 8 16 32 64 128 256 512 1024 2048 4096

Late

ncy

(u

s)

Message SizeMVAPICH2 MVAPICH2-OPT IMPI

0

200

400

600

800

1000

1200

1400

1600

1800

2000

8K 16K 32K 64K 128K 256K

Message SizeMVAPICH2 MVAPICH2-OPT IMPI

OSU Micro Benchmark 64 PPN

2.4X

• For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4X

M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based

Multi-Leader Design, SuperComputing '17. Available since MVAPICH2-X 2.3b


Shared Address Space (XPMEM)-based Collectives Design

1

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(u

s)

Message Size

MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-X-2.3rc1

OSU_Allreduce (Broadwell 256 procs)

• “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2

• Offloaded computation/communication to peers ranks in reduction collective operation

• Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce

73.2

1.8X

1

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Message Size

MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-2.3rc1

OSU_Reduce (Broadwell 256 procs)

4X

36.1

37.9

16.8

J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction

Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.Available since MVAPICH2-X 2.3rc1


0

5000

10000

15000

20000

25000

30000

224 448 896

Per

form

ance

in G

FLO

PS

Number of Processes

MVAPICH2 Async MVAPICH2 Default IMPI 2019 Default

0

1

2

3

4

5

6

7

8

9

112 224 448

Tim

e p

er

loo

p in

sec

on

ds

Number of Processes

MVAPICH2 Async MVAPICH2 Default IMPI 2019 Async

Benefits of the New Asynchronous Progress Design: Broadwell + InfiniBand

Up to 33% performance improvement in P3DFFT application with 448 processesUp to 29% performance improvement in HPL application with 896 processes

Memory Consumption = 69%

P3DFFT High Performance Linpack (HPL)

26%

27% Lower is better Higher is better

A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and D.K. Panda, Efficient Asynchronous Communication Progress for MPI without Dedicated Resources, EuroMPI 2018. Enhanced version accepted for PARCO Journal.

Available since MVAPICH2-X 2.3rc1

PPN=28

33%

29%

12%

PPN=28

8%






0

0.5

1

0 1 2 4 8 16 32 64 128 256 512 1K 2K

Late

ncy

(u

s)

MVAPICH2-2.3.1

SpectrumMPI-2019.02.07

Intra-node Point-to-Point Performance on OpenPower

Platform: Two nodes of OpenPOWER (POWER9-ppc64le) CPU using Mellanox EDR (MT4121) HCA

Intra-Socket Small Message Latency Intra-Socket Large Message Latency

Intra-Socket Bi-directional BandwidthIntra-Socket Bandwidth

0.22us0

20

40

60

80

100

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(u

s)

MVAPICH2-2.3.1


0

10000

20000

30000

40000

1 8 64 512 4K 32K 256K 2M

Ban

dw

idth

(M

B/s

) MVAPICH2-2.3.1


0

20000

40000

1 8 64 512 4K 32K 256K 2M

Bi-

Ban

dw

idth

(M

B/s

) MVAPICH2-2.3.1



0

0.5

1

1.5

2

0 2 8 32

Late

ncy

(u

s)

Message Size (Bytes)

Latency - Small Messages

MVAPICH2-X

OpenMPI+UCX

0

5

10

15

20

128 512 2048 8192

Late

ncy

(u

s)


Latency - Medium Messages

MVAPICH2-X

OpenMPI+UCX

0

400

800

1200

1600

2000

32K 128K 512K 2M

Late

ncy

(u

s)


Latency - Large Messages

MVAPICH2-X

OpenMPI+UCX

0

100

200

300

400

500

1 4 16 64

Ban

dw

idth

(M

B/s

)


Bandwidth - Small Messages

MVAPICH2-X

OpenMPI+UCX

0

2000

4000

6000

8000

10000

256 1K 4K 16K

Ban

dw

idth

(M

B/s

)


Bandwidth – Medium Messages

MVAPICH2-X

OpenMPI+UCX

0

3000

6000

9000

12000

64K 256K 1M 4M

Ban

dw

idth

(M

B/s

)


Bandwidth - Large Messages

MVAPICH2-X

OpenMPI+UCX

Point-to-point: Latency & Bandwidth (Intra-socket) on Mayer

8.2x better


0

0.5

1

1.5

2

2.5

3

0 2 8 32

Late

ncy

(u

s)


Latency - Small Messages

MVAPICH2-X

OpenMPI+UCX

0

5

10

15

20

25

30

35

128 512 2048 8192

Late

ncy

(u

s)


Latency - Medium Messages

MVAPICH2-X

OpenMPI+UCX

0

500

1000

1500

2000

2500

32K 128K 512K 2M

Late

ncy

(u

s)


Latency - Large Messages

MVAPICH2-X

OpenMPI+UCX

0

100

200

300

400

1 4 16 64

Ban

dw

idth

(M

B/s

)


Bandwidth - Small Messages

MVAPICH2-X

OpenMPI+UCX

0

2000

4000

6000

8000

10000

256 1K 4K 16K

Ban

dw

idth

(M

B/s

)


Bandwidth – Medium Messages

MVAPICH2-X

OpenMPI+UCX

0

2000

4000

6000

8000

10000

12000

64K 256K 1M 4M

Ban

dw

idth

(M

B/s

)


Bandwidth - Large Messages

MVAPICH2-X

OpenMPI+UCX

Point-to-point: Latency & Bandwidth (Inter-socket) on Mayer

3.5x better 8.3x better

5x better


At Sender:

At Receiver:

MPI_Recv(r_devbuf, size, …);

inside

MVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU


0

2000

4000

6000

1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Ban

dw

idth

(M

B/s

)


GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

0

1000

2000

3000

4000

1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Ban

dw

idth

(M

B/s

)


GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

0

10

20

300 1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

Late

ncy

(u

s)


GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR 2.3

MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.85us11X


• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)

• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768

MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768

MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Ave

rage

Tim

e St

eps

per

se

con

d (

TPS)

Number of Processes

MV2 MV2+GDR

0

500

1000

1500

2000

2500

3000

3500

4 8 16 32Ave

rage

Tim

e St

eps

pe

r se

con

d (

TPS)

Number of Processes

64K Particles 256K Particles

2X2X




Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96No

rmal

ized

Exe

cuti

on

Tim

e

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

0

0.2

0.4

0.6

0.8

1

1.2

4 8 16 32

No

rmal

ized

Exe

cuti

on

Tim

e

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data

Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content

/tasks/operational/meteoSwiss/


http://www2.cosmo-model.org/content



0

20

40

60

80

100

120

140

160

MILC Leslie3D POP2 LAMMPS WRF2 LU

Exe

cuti

on

Tim

e in

(s)

Intel MPI 18.1.163

MVAPICH2-X

31%

SPEC MPI 2007 Benchmarks: Broadwell + InfiniBand

MVAPICH2-X outperforms Intel MPI by up to 31%

Configuration: 448 processes on 16 Intel E5-2680v4 (Broadwell) nodes having 28 PPN and interconnected

with 100Gbps Mellanox MT4115 EDR ConnectX-4 HCA

29% 5%

-12%1%

11%


Application Scalability on Skylake and KNL (Stamepede2)MiniFE (1300x1300x1300 ~ 910 GB)

Runtime parameters: MV2_SMPI_LENGTH_QUEUE=524288 PSM2_MQ_RNDV_SHM_THRESH=128K PSM2_MQ_RNDV_HFI_THRESH=128K

0

50

100

150

2048 4096 8192

Exec

uti

on

Tim

e (s

)

No. of Processes (KNL: 64ppn)

MVAPICH2

0

20

40

60

2048 4096 8192

Exec

uti

on

Tim

e (s

)

No. of Processes (Skylake: 48ppn)

MVAPICH2

0

500

1000

1500

48 96 192 384 768No. of Processes (Skylake: 48ppn)

MVAPICH2

NEURON (YuEtAl2012)

Courtesy: Mahidhar Tatineni @SDSC, Dong Ju (DJ) Choi@SDSC, and Samuel Khuvis@OSC ---- Testbed: TACC Stampede2 using MVAPICH2-2.3b

0

1000

2000

3000

4000

64 128 256 512 1024 2048 4096


MVAPICH2

0

500

1000

1500

68 136 272 544 1088 2176 4352


MVAPICH2

0

500

1000

1500

2000

48 96 192 384 768 1536 3072No. of Processes (Skylake: 48ppn)

MVAPICH2

Cloverleaf (bm64) MPI+OpenMP, NUM_OMP_THREADS = 2


• MVAPICH Project


• HiBD Project


• HiDL Project




• Conclusions



• Substantial impact on designing and utilizing data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online)

• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)

• HDFS, MapReduce, Spark

Data Management and Processing on Modern Datacenters



HBase, Memcached,

etc.)


etc.)




• RDMA for Apache Spark

• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache Kafka

• RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 315 organizations from 35 countries

• More than 31,100 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCE

Also run on Ethernet

Available for x86 and OpenPOWER

Support for Singularity and Docker

http://hibd.cse.ohio-state.edu/


HiBD Release Timeline and Downloads

0

5000

10000

15000

20000

25000

30000

Nu

mb

er

of

Do

wn

load

s

Timeline

RD

MA

-Had

oo

p 2

.x 0

.9.1

RD

MA

-Had

oo

p 2

.x 0

.9.6

RD

MA

-Me

mca

ched

0.9

.4

RD

MA

-Had

oo

p 2

.x 0

.9.7

RD

MA

-Sp

ark

0.9

.1

RD

MA

-HB

ase

0.9

.1

RD

MA

-Had

oo

p 2

.x 1

.0.0

RD

MA

-Sp

ark

0.9

.4

RD

MA

-Had

oo

p 2

.x 1

.3.0

RD

MA

-Me

mca

ched

0.9

.6&

OH

B-0

.9.3

RD

MA

-Sp

ark

0.9

.5

RD

MA

-Had

oo

p 1

.x 0

.9.0

RD

MA

-Had

oo

p 1

.x 0

.9.8

RD

MA

-Me

mca

ched

0.9

.1 &

OH

B-0

.7.1

RD

MA

-Had

oo

p 1

.x 0

.9.9


• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well

as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-

memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst

buffer design is hosted by Memcached servers, each of which has a local SSD.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top

of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-

L, and MapReduce over Lustre).

Different Modes of RDMA for Apache Hadoop 2.x


• OSU-RI2 Cluster

• The RDMA-IB design improves the job execution time of TestDFSIO by 1.4x – 3.4x compared to

IPoIB (100Gbps)

• For TestDFSIO throughput experiment, the RDMA-IB design has an improvement of 1.2x – 3.1x

compared to IPoIB (100Gbps)

• Support for Erasure Coding being worked out (presented at HPDC ’19)

Performance Numbers of RDMA for Apache Hadoop 3.x (NEW)

0

100

200

300

120GB 240GB 360GB

Ex

ecu

tio

nT

ime

(sec

)

TestDFSIO Latency

IPoIB (100Gbps)

RDMA-IB (100Gbps)

0

60

120

180

240

300

120GB 240GB 360GB

Th

rou

gh

put

(MB

ps)

TestDFSIO Throughput

IPoIB (100Gbps)

RDMA-IB (100Gbps)


Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure


• High-Performance Design of Spark over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark

– RDMA-based data shuffle and SEDA-based shuffle architecture

– Non-blocking and chunk-based data transfer

– Off-JVM-heap buffer management

– Support for OpenPOWER

– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)

• Current release: 0.9.5

– Based on Apache Spark 2.1.0

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms (x86, POWER)

• RAM disks, SSDs, and HDD

– http://hibd.cse.ohio-state.edu

RDMA for Apache Spark Distribution

http://hadoop-rdma.cse.ohio-state.edu/


• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)

• RDMA-based design for Spark 1.5.1

• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.

– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)

– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – HiBench PageRank

32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

0

50

100

150

200

250

300

350

400

450

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

0

100

200

300

400

500

600

700

800

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

43%37%


Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure


• MVAPICH Project


• HiBD Project


• HiDL Project




• Conclusions



• Deep Learning frameworks are a different game

altogether

– Unusually large message sizes (order of megabytes)

– Most communication based on GPU buffers

• Existing State-of-the-art

– cuDNN, cuBLAS, NCCL --> scale-up performance

– NCCL2, CUDA-Aware MPI --> scale-out performance

• For small and medium message sizes only!

• Proposed: Can we co-design the MPI runtime (MVAPICH2-

GDR) and the DL framework (Caffe) to achieve both?

– Efficient Overlap of Computation and Communication

– Efficient Large-Message Communication (Reductions)

– What application co-designs are needed to exploit

communication-runtime co-designs?

Deep Learning: New Challenges for MPI Runtimes

Scal

e-u

p P

erf

orm

ance

Scale-out PerformanceA. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

cuDNN

gRPC

Hadoop

MPI

MKL-DNN

DesiredNCCL2



HBase, Memcached,

etc.)


etc.)




• CPU-based Deep Learning

• GPU-based Deep Learning

• RDMA-Enabled TensorFlow

High-Performance Deep Learning


Performance of CNTK with MVAPICH2-X on CPU-based Deep Learning

0

200

400

600

800

28 56 112 224

Exec

uti

on

Tim

e (s

)

No. of Processes

Intel MPIMVAPICH2MVAPICH2-XPMEM

CNTK AlexNet Training

(B.S=default, iteration=50, ppn=28)

20%

9%

• CPU-based training of AlexNet neural

network using ImageNet ILSVRC2012

dataset

• Advanced XPMEM-based designs show

up to 20% benefits over Intel MPI (IMPI)

for CNTK DNN training using All_Reduce

• The proposed designs show good

scalability with increasing system size

Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018

Available since MVAPICH2-X 2.3rc1 release


Deep Learning on Frontera

• TensorFlow, PyTorch, and MXNet are widely used Deep Learning Frameworks

• Optimized by Intel using Math Kernel Library for DNN (MKL-DNN) for Intel

processors

• Single Node performance can be improved by running Multiple MPI processes

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256

Ima

ges

per

sec

Batch Size

TensorFlow PyTorch MXNet

0

20

40

60

80

100120

140160

1 2 4 8 14 16

Ima

ges

pe

r se

c

PPN

TensorFlow PyTorch

Impact of Batch Size on Performance for ResNet-50 Performance Improvement using Multiple MPI processes

A. Jain et al., Scaling Deep Learning Frameworks on Frontera using MVAPICH2 MPI, under review


Deep Learning on Frontera

• Observed 260K images per sec for ResNet-50 on 2,048 Nodes

• Scaled MVAPICH2-X on 2,048 nodes on Frontera for Distributed Training using

TensorFlow

• ResNet-50 can be trained in 7 minutes on 2048 nodes (114,688 cores)

1

8

64

512

4096

32768

262144

1 2 4 8 16

32

64

128

256

512

102

4

204

8

Imag

es p

er s

ec

Nodes

Intel MPI 19.0.5 MVAPICH2-X 2.3.2 ideal

0

5000

10000

15000

20000

2500030000

3500040000

1 2 4 8 16 32 64 128 256

Ima

ges

per

sec

Nodes

Keras TF_CNN MXNet PyTorch

A. Jain et al., Scaling Deep Learning Frameworks on Frontera using MVAPICH2 MPI, under review


MVAPICH2-GDR: Enhanced MPI_Allreduce at Scale

• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs

0

1

2

3

4

5

6

32M 64M 128M 256M

Ban

dw

idth

(G

B/s

)


Bandwidth on 1,536 GPUs

MVAPICH2-GDR-2.3.2 NCCL 2.4

1.7X better

0

50

100

150

200

250

300

350

400

450

4 16 64 256 1K 4K 16K

Late

ncy

(u

s)


Latency on 1,536 GPUs

MVAPICH2-GDR-2.3.2 NCCL 2.4

1.6X better

Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect

0

2

4

6

8

10

24 48 96 192 384 768 1536

Ban

dw

idth

(G

B/s

)

Number of GPUs

128MB Message

SpectrumMPI 10.2.0.11 OpenMPI 4.0.1

NCCL 2.4 MVAPICH2-GDR-2.3.2

1.7X better


Distributed Training with TensorFlow and MVAPICH2-GDR

• ResNet-50 Training using

TensorFlow benchmark on

SUMMIT -- 1536 Volta

GPUs!

• 1,281,167 (1.2 mil.) images

• Time/epoch = 3.6 seconds

• Total Time (90 epochs)

= 3.6 x 90 = 332 seconds =

5.5 minutes!

0

50

100

150

200

250

300

350

400

1 2 4 6 12 24 48 96 192 384 768 1536

Imag

e p

er s

eco

nd

(Th

ou

san

ds)

Number of GPUs

NCCL-2.4 MVAPICH2-GDR-2.3.2

Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2

*We observed errors for NCCL2 beyond 96 GPUs

MVAPICH2-GDR reaching ~0.35 million

images per second for ImageNet-1k!

ImageNet-1k has 1.2 million images


0

5000

10000

15000

20000

Ban

dw

idth

(M

B/s

)


GPU-GPU Inter-node Bi-Bandwidth

Docker Native

02000400060008000

1000012000

Ban

dw

idth

(M

B/s

)


GPU-GPU Inter-node Bandwidth

Docker Native

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Late

ncy

(u

s)


GPU-GPU Inter-node Latency

Docker Native

MVAPICH2-GDR-2.3.2Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

MVAPICH2-GDR on Container with Negligible Overhead


• High-Performance Design of TensorFlow over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand support at the verbs-level for gRPC and TensorFlow

– RDMA-based data communication

– Adaptive communication protocols

– Dynamic message chunking and accumulation

– Support for RDMA device selection

– Easily configurable for different protocols (native InfiniBand and IPoIB)

• Current release: 0.9.1

– Based on Google TensorFlow 1.3.0

– Tested with

• Mellanox InfiniBand adapters (e.g., EDR)

• NVIDIA GPGPU K80

• Tested with CUDA 8.0 and CUDNN 5.0

– http://hidl.cse.ohio-state.edu

RDMA-TensorFlow Distribution

OpenPOWER support

will be coming soon

http://hidl.cse.ohio-state.edu/




0

50

100

150

200

16 32 64

Imag

es /

Sec

on

d

Batch Size

gRPPC (IPoIB-100Gbps)

Verbs (RDMA-100Gbps)

MPI (RDMA-100Gbps)

AR-gRPC (RDMA-100Gbps)

Performance Benefit for RDMA-TensorFlow (Inception3)

• TensorFlow Inception3 performance evaluation on an IB EDR cluster

– Up to 20% performance speedup over Default gRPC (IPoIB) for 8 GPUs



4 Nodes (8 GPUS) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS)

0

200

400

600

16 32 64

Imag

es /

Sec

on

d

Batch Size

gRPPC (IPoIB-100Gbps)Verbs (RDMA-100Gbps)MPI (RDMA-100Gbps)AR-gRPC (RDMA-100Gbps)

0

100

200

300

400

16 32 64Im

ages

/ S

eco

nd

Batch Size

gRPPC (IPoIB-100Gbps)

Verbs (RDMA-100Gbps)

MPI (RDMA-100Gbps)

AR-gRPC (RDMA-100Gbps)


Using HiDL Packages for Deep Learning on Existing HPC Infrastructure

Hadoop Job


• MVAPICH Project


• HiBD Project


• HiDL Project




• Conclusions



• Released on 08/16/2019

• Major Features and Enhancements

– Based on MVAPICH2-2.3.2

– Enhanced tuning for point-to-point and collective operations

– Targeted for Azure HB & HC virtual machine instances

– Flexibility for 'one-click' deployment

– Tested with Azure HB & HC VM instances

MVAPICH2-Azure 2.3.2


Performance of Radix

0

5

10

15

20

25

30

Exec

uti

on

Tim

e (S

eco

nd

s)

Number of Processes (Nodes X PPN)

Total Execution Time on HC (Lower is better)

MVAPICH2-X

HPCx

3x faster

0

5

10

15

20

25

60(1X60) 120(2X60) 240(4X60)

Exec

uti

on

Tim

e (S

eco

nd

s)

Number of Processes (Nodes X PPN)

Total Execution Time on HB (Lower is better)

MVAPICH2-X

HPCx

38% faster


• Released on 08/12/2019

• Major Features and Enhancements

– Based on MVAPICH2-X 2.3

– New design based on Amazon EFA adapter's Scalable Reliable Datagram (SRD) transport protocol

– Support for XPMEM based intra-node communication for point-to-point and collectives

– Enhanced tuning for point-to-point and collective operations

– Targeted for AWS instances with Amazon Linux 2 AMI and EFA support

– Tested with c5n.18xlarge instance

MVAPICH2-X-AWS 2.3


Application Performance

01020304050607080

72(2x36) 144(4x36) 288(8x36)Exec

uti

on

Tim

e (S

eco

nd

s)

Processes (Nodes X PPN)

miniGhost

MV2X OpenMPI

10% better

0

5

10

15

20

25

30

72(2x36) 144(4x36) 288(8x36)Exec

uti

on

Tim

e (S

eco

nd

s)

Processes (Nodes X PPN)

CloverLeaf

MV2X-UD

MV2X-SRD

OpenMPI27.5% better

• Up to 10% performance improvement for MiniGhost on 8 nodes

• Up to 27% better performance with CloverLeaf on 8 nodes

S. Chakraborty, S. Xu, H. Subramoni and D. K. Panda, Designing Scalable and High-Performance MPI Libraries on Amazon Elastic Adapter, Hot

Interconnect, 2019


• Upcoming Exascale systems need to be designed with a holistic view of HPC,

Big Data, Deep Learning, and Cloud

• Presented an overview of designing convergent software stacks

• Presented solutions enable HPC, Big Data, and Deep Learning communities to

take advantage of current and next-generation systems

Concluding Remarks


• Supported through X-ScaleSolutions (http://x-scalesolutions.com)

• Benefits:

– Help and guidance with installation of the library

– Platform-specific optimizations and tuning

– Timely support for operational issues encountered with the library

– Web portal interface to submit issues and tracking their progress

– Advanced debugging techniques

– Application-specific optimizations and tuning

– Obtaining guidelines on best practices

– Periodic information on major fixes and updates

– Information on major releases

– Help with upgrading to the latest release

– Flexible Service Level Agreements

• Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years

Commercial Support for MVAPICH2, HiBD, and HiDL Libraries

http://x-scalesolutions.com/


• Has joined the OpenPOWER Consortium as a silver ISV member

• Provides flexibility:

– To have MVAPICH2, HiDL and HiBD libraries getting integrated into the

OpenPOWER software stack

– A part of the OpenPOWER ecosystem

– Can participate with different vendors for bidding, installation and deployment

process

• Introduced two new integrated products with support for OpenPOWER systems

(Presented at the OpenPOWER North America Summit)

– X-ScaleHPC

– X-ScaleAI

– Send an e-mail to [email protected] for free trial!!

Silver ISV Member for the OpenPOWER Consortium + Products



7th Annual MVAPICH User Group (MUG) Meeting

• August 19-21, 2019; Columbus, Ohio, USA

• Keynote Speakers

– Dan Stanzione, Texas Advanced Computing Center (TACC)

– Robert Harrison, Director of the Institute of Advanced

Computational Science (IACS) and Brookhaven Computational

Science Center (CSC)

• Tutorials

• ARM

• IBM

• Mellanox

• OSU/MVAPICH2

Slides and Videos of the talks are available from

http://mug.mvapich.cse.ohio-state.edu

• Invited Speakers

– Gregory Blum Becker, Lawrence Livermore National Laboratory

– Nicholas Brown, EPCC, The University of Edinburg (United Kingdom)

– Gene Cooperman, Northeastern University

– Hyon-Wook Jin, Konkuk University (South Korea)

– Jithin Jose, Microsoft Azure

– Minsik Kim, KISTI Supercomputing Center (South Korea)

– Pramod Kumbhar, Blue Brain Project, EPFL (Switzerland)

– Naoya Maruyama, Lawrence Livermore National Laboratory

– Heechang Na, Ohio Supercomputer Center

– Vikram Saletore, Intel

– Jeffrey Salmond, University of Cambridge (United Kingdom)

– Gilad Shainer, Mellanox

– Sameer Shende, Paratools and University of Oregon

– Sayantan Sur, Intel

– Shinichiro Takizawa, RWBC-OIL, AIST (Japan)

– Mahidhar Tatineni, San Diego Supercomputing Center (SDSC)

– Karen Tomko, Ohio Supercomputer Center

http://mug.mvapich.cse.ohio-state.edu/

http://mug.mvapich.cse.ohio-state.edu/


Funding Acknowledgments

Funding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

– C.-H. Chu (Ph.D.)

– J. Hashmi (Ph.D.)

– A. Jain (Ph.D.)

– K. S. Kandadi (M.S.)

Past Students

– A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– R. Biswas (M.S.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– S. Chakraborthy (Ph.D.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– R. Rajachandrasekar (Ph.D.)

– D. Shankar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

– J. Zhang (Ph.D.)

Past Research Scientist

– K. Hamidouche

– S. Sur

– X. Lu

Past Post-Docs

– D. Banerjee

– X. Besseron

– H.-W. Jin

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– Kamal Raj (M.S.)

– K. S. Khorassani (Ph.D.)

– P. Kousha (Ph.D.)

– A. Quentin (Ph.D.)

– B. Ramesh (M. S.)

– S. Xu (M.S.)

– J. Lin

– M. Luo

– E. Mancini

Past Programmers

– D. Bureddy

– J. Perkins

Current Research Specialist

– J. Smith

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc

– M. S. Ghazimeersaeed

– A. Ruhela

– K. ManianCurrent Students (Undergraduate)

– V. Gangal (B.S.)

– N. Sarkauskas (B.S.)

Past Research Specialist

– M. Arnold

Current Research Scientist

– H. Subramoni– Q. Zhou (Ph.D.)


• Looking for Bright and Enthusiastic Personnel to join as

– PhD Students

– Post-Doctoral Researchers

– MPI Programmer/Software Engineer

– Hadoop/Big Data Programmer/Software Engineer

– Deep Learning and Cloud Programmer/Software Engineer

• If interested, please send an e-mail to [email protected]

Multiple Positions Available in My Group


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

Follow us on


http://nowlab.cse.ohio-state.edu/



Software Libraries and Middleware for Exascale Systems...High Performance Parallel Programming...

Documents

Transcript of Software Libraries and Middleware for Exascale Systems...High Performance Parallel Programming...