Download - HHgh rformanc , Scaa an Fautigh Performance, Scalable and ... · HHgh rformanc , Scaa an Fautigh Performance, Scalable and Fault-To Mtnora lerant MPI over InfiniBand: An Overview

High Performance, Scalable and Fault-Tolerant MPI H gh rformanc , Sca a an Fau t o rant M over InfiniBand: An Overview of MVAPICH/MVAPICH2 Projectj

Talk at Tsukuba Universityby

Dhabaleswar K. (DK) PandaThe Ohio State University

y

yE-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Current and Next Generation Applications and Computing Systems

• Big demand for – High Performance Computing (HPC)– File-systems, multimedia, database, visualization– Internet data-centers

• Processor performance continues to grow – Chip density doubling every 18 months– Multi-core chips are emergingMulti core chips are emerging

• Commodity networking also continues to grow – Increase in speed and features

Affordable pricing– Affordable pricing• Clusters are increasingly becoming popular to

design next generation computing systemsS l bili M d l i d U d bili i h – Scalability, Modularity and Upgradeability with compute and network technologies

2Tsukuba, Oct 2, 2008

Trends for Computing Clusters in the T p 500 List

• Top 500 list of Supercomputers (www.top500.org)

Top 500 List

June 2001: 33/500 (6.6%) June 2005: 304/500 (60.8%)

Nov 2001: 43/500 (8 6%) Nov 2005: 360/500 (72 0%)Nov 2001: 43/500 (8.6%) Nov 2005: 360/500 (72.0%)

June 2002: 80/500 (16%) June 2006: 364/500 (72.8%)

Nov 2002: 93/500 (18 6%) Nov 2006: 361/500 (72 2%)Nov 2002: 93/500 (18.6%) Nov 2006: 361/500 (72.2%)

June 2003: 149/500 (29.8%) June 2007: 373/500 (74.6%)

Nov 2003: 208/500 (41.6%) Nov 2007: 406/500 (81.2%)

June 2004: 291/500 (58.2%) June 2008: 400/500 (80.0%)

Tsukuba, Oct 2, 2008 3

Nov 2004: 294/500 (58.8%) Nov 2008: To be Announced

Growth in Commodity Network T chn l

Representative commodity networks; their entries into the market

Technology

Ethernet (1979 - ) 10 Mbit/sec

Fast Ethernet (1993 -) 100 Mbit/sec

Gigabit Ethernet (1995 -) 1000 Mbit /sec

ATM (1995 -) 155/622/1024 Mbit/sec

Myrinet (1993 -) 1 Gbit/sec

Fibre Channel (1994 -) 1 Gbit/sec

InfiniBand (2001 -) 2 Gbit/sec (1X SDR)

10-Gigabit Ethernet (2001 -) 10 Gbit/sec

InfiniBand (2003 -) 8 Gbit/sec (4X SDR)InfiniBand (2003 ) 8 Gbit/sec (4X SDR)

InfiniBand (2005 -) 16 Gbit/sec (4X DDR)

24 Gbit/sec (12X SDR)

InfiniBand (2007 -) 32 Gbit/sec (4X QDR)


InfiniBand (2007 -) 32 Gbit/sec (4X QDR)

16 times in the last 7 years

Limitations of Traditional Host-based P t c ls

• Ex: TCP/IP, UDP/IP

Protocols

• Generic architecture for all network interfaces• Host-handles almost all aspects of communicationp

– Data buffering (copies on sender and receiver)– Data integrity (checksum)– Routing aspects (IP routing)

• Signaling between different layers– Hardware interrupt whenever a packet arrives or is sent– Software signals between different layers to handle

protocol processing in different priority levels


Previous High PerformanceN t k St cks

• Virtual Interface Architecture

Network Stacks

– Standardized by Intel, Compaq, Microsoft• Fast Messages (FM)

– Developed by UIUC• Myricom GM

P i t t l t k f M i– Proprietary protocol stack from Myricom• These network stacks set the trend for high-

performance communication requirementsperformance communication requirements– Hardware offloaded protocol stack– Support for fast and secure user-level access to pp

the protocol stackTsukuba, Oct 2, 2008 6

IB Trade Association

• IB Trade Association was formed with seven industry l d ( D ll HP B l f d )leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun)

• Goal: To design a scalable and high performance communication and I/O architecture by taking an communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies

• Many other industry participated in the effort to define the IB architecture specification

• IB Architecture (Volume 1 Version 1 0) was released to IB Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000– Latest version 1.2.1 released January 2008

• http://www.infinibandta.orgTsukuba, Oct 2, 2008 7

Presentation Overview

• Overview of InfiniBand

– Features

P d ts (H d nd S ft )– Products (Hardware and Software)

– Trends

• MVAPICH and MVAPICH2 Features

• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs

• Future Plans

• Conclusions and Final Q&ATsukuba, Oct 2, 2008 8

A Typical IB Networkyp

Three primary components

Channel Adapters

Switches/RoutersSwitches/Routers

Links and connectors


Hardware Protocol Offload

Complete Complete Hardware

ImplementationsExist


Basic IB Capabilities at EachP t c l L

• Link Layer

Protocol Layer

– CRC-based data integrity, Buffering and Flow-control, Virtual Lanes, Service Levels and QoS, Switching and Multicast WAN capabilitiesMulticast, WAN capabilities

• Network LayerRouting and Flow Labels– Routing and Flow Labels

• Transport LayerReliable Connection Unreliable Datagram Reliable – Reliable Connection, Unreliable Datagram, Reliable Datagram and Unreliable Connection

– Shared Receive Queued and Extended Reliable Shared Receive Queued and Extended Reliable Connections (discussed in more detail later)


Communication and Management S m ntics

• Two forms of communication semantics

Semantics

– Channel semantics (Send/Recv)– Memory semantics (RDMA, Atomic operations)

• Management model– A detailed management model complete with

managers, agents, messages and protocols

• Verbs Interface– A low-level programming interface for performing

communication as well as management


Communication in the Channel S m ntics (S nd R c i M d l)Semantics (Send-Receive Model)

Buffer Pool Buffer Pool


Communication in the Memory S m ntics (RDMA M d l)Semantics (RDMA Model)Node Node

Memory

P0 PCI/PCI-EX

Memory

P0B IBA

PCI/PCI-EX

P1 P1IBA IBA

• No involvement by the CPU at the receiver (RDMA Write/Put)

• No involvement by the CPU at the sender (RDMA Read/get)y ( g )

• 1-2 µs latency (for short data)

• 1.5 – 2.6 GBps bandwidth (for large data)

• 3-5 µs for atomic operationTsukuba, Oct 2, 2008 14

IB Transport Servicesp

Advanced mechanisms like SRQ and new transport


Advanced mechanisms like SRQ and new transport eXtended Reliable Connection (XRC) is introduced recently

Shared Receive Queue (SRQ)Q ( Q)

Process Process

One RQ per connection One SRQ for all connections

m p

• SRQ is a hardware mechanism in IB by which a process

p One SRQ for all connectionsn -1

can share receive resources (memory) across multiple connections

• A new feature introduced in specification v1 2• A new feature, introduced in specification v1.2• 0 < p << m*(n-1)


eXtended Reliable Connection (XRC)( )

M = # of nodesN = # of processes/node

RC Connections XRC Connections

(M2 1)*N (M 1)*N

N # of processes/node

(M2-1) N (M-1) N

• Each QP takes at least one page of memoryQ p g y– Connections between all processes is very costly for RC

• New IB Transport added: eXtended Reliable Connection– Allows connections between nodes instead of processes


Subnet Managerg

Inactive Link

Active Links

Inactive Links

Compute Node

Switch Multicast Join

Multicast Setup

Compute Node

Multicast Join

Multicast Setup

Subnet Manager


Automatic Path Migration

• Automatically utilizes IB multipathing for network

g

y p gfault-tolerance

• Enables migrating connections to a different pathEnables migrating connections to a different path

– Connection recovery in the case of failures

O ti l F t– Optional Feature

• Available for RC, UC, and RD

• Reliability guarantees for service type maintained during migration




– Features


– Trends



• Future Plans


IB Hardware Products

• Many IB vendors: Mellanox, Voltaire, Cisco, QlogicAli d ith d I t l IBM SUN D ll– Aligned with many server vendors: Intel, IBM, SUN, Dell

– And many integrators: Appro, Advanced Clustering, Microway, …• Broadly two kinds of adapters

– Offloading (Mellanox) and Onloading (Qlogic)• Adapters with different interfaces:

– Dual port 4X with PCI-X (64 bit/133 MHz) PCIe x8 PCIe 2 0 and HTDual port 4X with PCI X (64 bit/133 MHz), PCIe x8, PCIe 2.0 and HT• MemFree Adapter

– No memory on HCA Uses System memory (through PCIe)G d f L M d (T 2935 6015T NFB)– Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)

• Different speeds– SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps)p p p

• Some 12X SDR adapters exist as well (24 Gbps each way)


Tyan Thunder S2935 Boardy

(Courtesy Tyan)


(Courtesy Tyan)

IB Hardware Products (contd.)

• Customized adapters to work with IB switches

( )

– Cray XD1 (formerly by Octigabay), Cray CX1• Switches:

– 4X SDR switch (8-288 ports)– 4X SDR switch (8-288 ports)• 12X ports available for inter-switch connectivity

– 4X DDR switch (mainly available in 8 to 288 port models) h ( ll l l ) – 12X switches (small sizes available)

– 3456-port “Magnum” switch from SUN used at TACC• 72-port “nano magnum” switch with DDR speedp g p

– New 36-port InfiniScale IV QDR switch silicon by Mellanox• Will allow high-density switches to be built

• Switch Routers with Gateways• Switch Routers with Gateways– IB-to-FC; IB-to-IP


IB Software Products

• Low-level software stacks– VAPI (Verbs-Level API) from Mellanox

– Modified and customized VAPI from other vendors

– New initiative: Open Fabrics (formerly OpenIB)• http://www.openfabrics.org

• Open-source code available with Linux distributions

• Initially IB; later extended to incorporate iWARP

H h l l f k• High-level software stacks– MPI, SDP, IPoIB, SRP, iSER, DAPL, NFS, PVFS on

i k ( i il VAPI d O F b i )various stacks (primarily VAPI and OpenFabrics)


OpenFabrics

• www.openfabrics.org

p

• Open source organization (formerly OpenIB)• Incorporates both IB and iWARP in a unified manner

F i ff t f O S IBA d iWARP • Focusing on effort for Open Source IBA and iWARP support for Linux and Windows

• Design of complete software stack with `best of Design of complete software stack with best of breed’ components– Gen1– Gen2 (current focus)

• Users can download the entire stack and runLatest release is OFED 1 3 1– Latest release is OFED 1.3.1

– OFED 1.4 is being worked outTsukuba, Oct 2, 2008 25

OpenFabrics Software StackpSA Subnet Administrator

MAD Management DatagramOpenDiag

Application Level

ClusteredDB Access

SocketsBasedA

VariousMPIs

Access toFile

S t

BlockStorageA

IP BasedApp

A

SMA Subnet Manager Agent

PMA Performance Manager Agent

IPoIB IP over InfiniBandInfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC

SDP Lib

User Level MAD API

Open SM

DiagTools

User APIs

U S

DB AccessAccess MPIs SystemsAccessAccess

UDAPL

SDP Sockets Direct Protocol

SRP SCSI RDMA Protocol (Initiator)

iSER iSCSI RDMA Protocol (Initiator)

RDS Reliable Datagram Service

SDPIPoIB SRP iSER RDS

SDP Lib

Upper Layer Protocol

Kernel Space

User Space

NFS-RDMARPC

ClusterFile Sys

g

UDAPL User Direct Access Programming Lib

HCA Host Channel Adapter

R-NIC RDMA NIC

ConnectionManager

MADSA Client

ConnectionManager

Connection ManagerAbstraction (CMA)

Mid-LayerSMA

el b

ypas

s

el b

ypas

s

CommonKeyHardware

Specific DriverHardware Specific

Driver

InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC

ProviderApps & Access

Ker

ne

Ker

ne


InfiniBand

iWARPInfiniBand HCA iWARP R-NICHardware

AccessMethodsfor usingOF Stack

IB Installations

• 121 IB clusters (24.2%) in June ’08 TOP500 list (www.top500.org) • 12 IB clusters in TOP25

– 122,400-cores (RoadRunner) at LANL (1st)– 62,976-cores (Ranger) at TACC (4th)– 14,336-cores at New Mexico (7th)– 14,384-cores at Tata CRL, India (8th)– 10,240-cores at TEP, France (10th)– 13,728-cores in Sweden (11th)– 8,320-cores in UK (18th)– 6,720-cores in Germany (19th)

k b ( h)– 10,000-cores at CCS, Tsukuba, Japan (20th)– 9,600-cores at NCSA (23rd)– 12,344-cores at Tokyo Inst. of Technology (24th)

13 824 t NASA/C l bi (25th)– 13,824-cores at NASA/Columbia (25th)• More are getting installed ….


InfiniBand in the Top500p

Systems Performance


Percentage share of InfiniBand is steadily increasing



– Features


– Trends



• Future Plans


Designing MPI Using IB/iWARP F tu sFeatures

MPI Design Components

ProtocolMapping

Buffer

FlowControl

Connection

CommunicationProgress

Collective

Multi-railSupport

One-sidedBufferManagement

ConnectionManagement

CollectiveCommunication

Design Alternatives and Solutions

One sidedActive/Passive

Design Alternatives and Solutions

RDMAOperations

UnreliableDatagram

Static RateControl

Multicast Out-of-orderPlacement

QoS Multi-PathVLANspg

AtomicOperations

SharedReceive Queues

End-to-EndFlow Control

m

Send /Receive

DynamicRate

ControlMulti-Path

LMC

Cluster '08 30

IB and iWARP/Ethernet Features

MVAPICH/MVAPICH2 Software

• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)

– Latest Releases: MVAPICH 1.1RC1 and MVAPICH2 1.2RC2

Used by more than 765 organizations in 42 countries– Used by more than 765 organizations in 42 countries

– More than 23,000 downloads from OSU site directly

– Empowering many TOP500 clusters• 4th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)including Open Fabrics Enterprise Distribution (OFED)

– Also supports uDAPL device to work with any network supporting uDAPL

– http://mvapich.cse.ohio-state.edu/


MVAPICH 1.1 Architecture

MVAPICH (MPI-1)(1.1)

OpenFabrics/Gen2

#1 #2

OpenFabrics/Gen2-Hybrid PSM

#3

Shared-M

#4

TCP/IP

#5

Gen2(Single-rail)

InfiniBand (Mellanox)

Gen2 Hybrid(Single-rail)

SM Memory

InfiniBand (QLogic)

TCP/IP

Serial Nodes/Laptops with InfiniBand (Mellanox)

PCI-X, PCIe, PCIe-Gen2(SDR, DDR and QDR)

InfiniBand (QLogic)

PCIe & HT(SDR and DDR)

p pMulti-core

VAPIGen2-Multirail

uDAPL

IA-32, EM64T, Opteron, IA-64, ..

uDAPL(deprecated)

MVAPICH2 1.2 Architecture

MVAPICH2 (MPI-2)(1.2)

OpenFabrics/Gen2

#1

uDAPL PSM

#4#2

TCP/IP

#3

Gen2(Integrated Multirail)

InfiniBand (Mellanox)

SM

InfiniBand (QLogic)Adapter Supporting

10GigE/iWARP (Chelsio,

Under Design

(Mellanox)

PCI-X, PCIe, PCIe-Gen2

(SDR, DDR and QDR)

InfiniBand (QLogic)

PCIe & HT(SDR and DDR)

pp guDAPL

(IB, Myrinet, Quadrics)

(Linux, Solaris)

( ,Neteffect)

PCI-X, PCIe

IA-32, EM64T, Opteron, IA-64, ..

( , )

Major Features of MVAPICH 1.1

• OpenFabrics-Gen2– Scalable job start-up with mpirun_rsh, support for SLURM– RC and XRC support– Flexible message coalescingFlexible message coalescing– Multi-core-aware pt-to-pt communication– User-defined processor affinity for multi-core platforms– Multi-core-optimized collective communication– Asynchronous and scalable on-demand connection management Asynchronous and scalable on demand connection management – RDMA Write and RDMA Read-based protocols– Lock-free Asynchronous Progress for better overlap between

computation and communication– Polling and blocking support for communication progressPolling and blocking support for communication progress– Multi-pathing support leveraging LMC mechanism on large

fabrics– Network-level fault tolerance with Automatic Path Migration (APM)– Mem-to-mem reliable data transfer mode (for detection of I/O error Mem-to-mem reliable data transfer mode (for detection of I/O error

with 32-bit CRC)

Major Features of MVAPICH 1 1Major Features of MVAPICH 1.1(Cont’d)

• OpenFabrics-Gen2-Hybrid– Newly introduced interface in 1.1– Replaces UD interface in 1.0p– Targeted for emerging multi-thousand-core clusters to

achieve the best performance with minimal memory footprintMost of the features as in Gen2 – Most of the features as in Gen2

– Adaptive selection during run-time (based on application and systems characteristics) to switch between

• RC and UD (or between XRC and UD) transports – Multiple buffer organization with XRC support

Major Features of MVAPICH2 1.2

• OpenFabrics-Gen2– All features as in MVAPICH 1.1 (OpenFabrics-Gen2)

except asynchronous progress and XRC p y p g– RDMA CM-based connection management (Gen2-IB and

Gen2-iWARP)– Integrated multi-rail support for IB and 10GigE/iWARPpp– Checkpoint-Restart (currently for IB)

• Systems-level automatic• Application-initiated systems-level

DAPL• uDAPL– Most of the features of OpenFabrics-Gen2 except multi-

rail and checkpointing• Flexibility for different adapters software stacks and OS (Linux • Flexibility for different adapters , software stacks and OS (Linux

and Solaris) supporting uDAPL

Support for Multiple Interfaces/AdaptersSupport for Multiple Interfaces/Adapters

• OpenFabrics/Gen2-IB and OpenFabrics/Gen2-Hybridp p y– All IB adapters supporting OpenFabrics/Gen2

• Qlogic/PSM• Qlogic adapters

• OpenFabrics/Gen2 iWARP• OpenFabrics/Gen2-iWARP• Chelsio

• uDAPL– Linux-IBL nux B– Solaris-IB– Other adapters such as Neteffect 10GigE

• TCP/IPA d t ti TCP/IP i t f– Any adapter supporting TCP/IP interface

• Shared Memory Channel (MVAPICH)• for running applications in a node with multi-core processors

37Tsukuba, Oct 2, 2008



– Features


– Trends



• Future Plans


Design Insights and Sample Results

• Scalable Job Start-up

g g p

• Basic Performance

– Two-sided Communication

– One-sided Communication

• Multi-core-aware pt-to-pt communication

• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective

• Integrated Multi-rail Design

• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )

• Applications-level Scalability

• Asynchronous Progress

• Fault Tolerance


Scalable Startup

• An enhanced mpirun rsh framework was introduced in

Scalable Startup

An enhanced mpirun_rsh framework was introduced in MVAPICH 1.0 to significantly cut down job start-up on large clusters

• Is available with MVAPICH 1.1 and MVAPICH2 1.2

Wallclock Runtime for MPI Hello World

140

160

ecs) MVAPICH-0.9.9

60

80

100

120

e R

untim

e (s

e

MVAPICH-1.0

Courtesy TACC

0

20

40

1K 2K 4K 8K 16K 32K

Ave

rage

1K 2K 4K 8K 16K 32K# of MPI Tasks (Cores)

One-way Latency: MPI over IBy y

7Small Message Latency

400MVAPICH-InfiniHost III-DDR

Large Message Latency

5

6

250

300


MVAPICH-Qlogic-SDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX-QDR-PCIe2

3

4

Late

ncy

(us)

2.77 150

200

250 MVAPICH-Qlogic-DDR-PCIe2

Late

ncy

(us)

1

2

1.061.281.492.19

50

100

0

Message Size (bytes)

0


InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back

Bandwidth: MPI over IB


Unidirectional Bandwidth6000

Bidirectional Bandwidth

2000

2500MVAPICH-Qlogic-SDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX-QDR-PCIe2MVAPICH-Qlogic-DDR-PCIe2

sec

2570.6

1952.94000

5000

sec

5012.1

3621.4

1000

1500

Mill

ionB

ytes

/s

1399.8

1389.42000

3000

Mill

ionB

ytes

/s

2457.4

2718.3

0

500

1000936.5

0

1000

2000

1519.8

0


0


InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back

RDMA CM and iWARP SupportRDMA CM and iWARP Support

• Available starting with MVAPICH2 Available starting with MVAPICH2 0.9.8

• RDMA CM is supported for both RDMA CM is supported for both – IB

10GigE/iWARP– 10GigE/iWARP

One-way Latency: MPI over iWARP

35One-way Latency

y y

25

30 Chelsio (TCP/IP)

Chelsio (iWARP)

15

20

Late

ncy

(us)

15.47

5

10

L

6.88

00 1 2 4 8 16 32 64 128 256 512 1K 2K 4K



2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch

Bandwidth: MPI over iWARP

1400Ch l i (TCP/IP)

Unidirectional Bandwidth2500

Bidirectional Bandwidth

2260 8

1000

1200Chelsio (TCP/IP)

Chelsio (iWARP)

ec 839 8

2000

c

2260.81231.8

600

800

Mill

ionB

ytes

/se 839.8

1000

1500

illio

nByt

es/s

ec

855.3

200

400

M

500M

i

0


0

Message Size (bytes)Message Size (bytes) Message Size (bytes)

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch

MPI One-sided Communication

• Specified by the MPI-2 standard• Data movement operations

– MPI_Put

Lock

Put

– MPI_Get– MPI_Accumulate

Get

• Synchronization operations– MPI_Lock/MPI_Unlock

Put

Accumulate

– MPI_Win_fence– MPI_Win_post, MPI_Win_start,

Put

Unlock

MPI_Win_complete, MPI_Win_wait


MPI_Put Performance (IB DDR)_ ( )

140512501500

/s)

ConnectX-DDR

6789

s)

ConnectX-DDR

0250500750

10001250

Ban

dwid

th (M

B/ ConnectX-DDR

2716

2.57

012345

Late

ncy

(us

1 8 64 512 4k 32k

256k 2M

Message Size (Bytes)

B

3000)• Single port results only (EM64T PCI-Ex)

27160 2 8 32 128

512

2048


5001000150020002500

ndw

idth

(MB

/s)

ConnectX-DDR

Single port results only (EM64T, PCI Ex)

Results for other platforms athttp://mvapich.cse.ohio-state.edu

0500

1 8 64 512 4k 32k

256k 2M


Bi-B

an




g g p










• Fault Tolerance


Multicore-aware Communication: L t nc nd B nd idthLatency and Bandwidth

Small Message Latency Bandwidth

0.8

1

1.2

s) 1000

1500

2000

2500

wid

th (M

B/s

)

0.2

0.4

0.6

Late

ncy

(us

0

500

1000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Ban

dw0

0.2

1 2 4 8 16 32 64 128


2 1 6 25

Message Size (Bytes)Intra-CMP Basic Design Intra-CMP Advanced DesignInter-CMP Basic Design Inter-CMP Advanced Design

• Multicore-aware design improves both latency and bandwidth• Available in MVAPICH and MVAPICH2 stacks


L. Chai, A. Hartono and D. K. Panda, “Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters”, Cluster ‘06

Shared-memory Aware Collectivesy

MPI_Bcast MPI_Allreduce (512 cores)

60000

70000

80000without-shmem (64x8)

without-shmem (64x4)

with-shmem (64x8)

_

2000

2500

Allreduce-shmem

Allreduce-noshmem

30000

40000

50000

ency

(use

c)

with shmem (64x8)

with-shmem (64x4)1500

aten

cy (u

s)

educe os e

10000

20000

30000

late

500

1000La0

i (b t )

04096 8192 16384 32768 65536

Message size (bytes)


message size (bytes) Message size (bytes)

MVAPICH-PSM Collective P f m nc (512 c s)Performance (512 cores)

Broadcast (64X8)80 Barrier Latency35

60

70 MVAPICH PSM 1.0

InfiniPath MPI 2.125

30

35

us)

MVAPICH PSM 1.0

InfiniPath MPI 2.1

30

40

50

Late

ncy

(us)

10

15

20

Late

ncy

(u10

20

30

0

5

8 8 8 8

0

1 2 4 8 16 32 64 128 256 512 1024Msg Size (Bytes)

1X8

2X8

4X8

8X8

16X8

32X8

64X8

System size

• 64 Intel Quad-core systems with dual sockets; PCIe InfiniPath Adapters


64 Intel Quad core systems with dual sockets; PCIe InfiniPath Adapters • Significant performance improvement for MPI_Bcast and MPI_Barrier



g g p










• Fault Tolerance


Integrated Multi-Rail Design g g(MVAPICH2)

MPI LApplication • Multiple ports/

adapters• Multiple

Eager Rendezvous InputMPI Layer

Communication Scheduling Completion

• Multiple adapters

• Multiple paths with LMCs

Virtual Subchannels Notification

InfiniBand Layer

Schedulerg

Policiesp

Notifier

J. Liu, A. Vishnu and D. K. Panda. Building MultiRail InfiniBand Clusters: MPI Level Design and Performance Evaluation. Presented at Supercomputing ‘04,

InfiniBand Layer

Level Design and Performance Evaluation. Presented at Supercomputing 04, April, 2004



g g p










• Fault Tolerance


Memory Utilization usingSh d R c i Qu u sShared Receive Queues

100

1201416

MVAPICH-RDMA

40

60

80

100

emor

y U

sed

(M

MVAPICH-RDMAMVAPICH SR

68

101214

mor

y U

sed

(G MVAPICH-SRMVAPICH-SRQ

0

20

2 4 8 16 32Number of Processes

Me MVAPICH-SR

MVAPICH-SRQ

024

128 256 512 1024 2048 4096 8192 16384

N b f P

Mem

• SRQ consumes only 1/10th compared to RDMA for 16,000 processes

Number of Processes Number of Processes

Analytical modelMPI_Init memory utilization

SRQ consumes only 1/10 compared to RDMA for 16,000 processes• Send/Recv exhausts the Buffer Pool after 1000 processes;

consumes 2X memory as SRQ for 16,000 processes


S. Sur, L. Chai, H. –W. Jin and D. K. Panda, “Shared Receive Queue Based Scalable MPI Design for InfiniBand Clusters”, IPDPS 2006

Communication Buffer Memory Utili ti n ith NAMD ( p 1)Utilization with NAMD (apoa1)

60

70

MB

)

1.05

1.1 Avg. RDMA channels 53.15

10

20

30

40

50

Mem

ory

Usa

ge (M

0.85

0.9

0.95

1

Nor

mal

ized

Pe

rform

ance Avg. Low watermarks 0.03

Unexpected Msgs (%) 48.2

0

10

16 32 64

Processes

M

0.8

0.85

ARDMA-SR ARDMA-SRQ SRQ

Total Messages 3.7e6

MPI Time (%) 23.54

• 50% messages < 128 Bytes, other 50% between 128 Bytes and 32 KB53 RDMA connections setup for 64 process experiment

ARDMA-SR % ARDMA-SRQ % SRQ %

– 53 RDMA connections setup for 64 process experiment• SRQ Channel takes 5-6MB of memory

– Memory needed by SRQ decreases by 1MB going from 16 to 64


S. Sur, M. Koop and D. K. Panda, “High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis”, SC ‘06

UD vs. RC: Performance and Sc l bilit (SMG2000 Applic ti n)Scalability (SMG2000 Application)

RC (MVAPICH 0.9.8) UD (Progress)Memory Usage (MB/process)Performance

0.81

1.2

ed T

ime

( ) ( g )

RC (MVAPICH 0.9.8) UD Design

Conn. Buffers Struct. Total Buffers Struct Total

512 22.9 65.0 0.3 88.2 37.0 0.2 37.2

29 6 0 0 6 3 0 0 4

00.20.40.6

Nor

mal

ize

128 256 512 1024 2048 4096

1024 29.5 65.0 0.6 95.1 37.0 0.4 37.4

2048 42.4 65.0 1.2 107.4 37.0 0.9 37.9

4096 66.7 65.0 2.4 134.1 37.0 1.7 38.7

Large number of peers per process (992 at maximum)

128 256 512 1024 2048 4096Processes

UD reduces HCA QP cache thrashing

M K S S Q G d D K P d “Hi h P f MPI D i i U li bl


M. Koop, S. Sur, Q. Gao and D. K. Panda, “High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters,” ICS ‘07

Impact of Hybrid RC/UD Designp y g

Combine the benefits of both RC and UD together

Application benchmark results on 512-core system


M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand,” IPDPS ’08

Impact of XRC-based Design

1

1.2

ep g

0.6

0.8

1

zed

Time

RC-SRQRC-MSRQ

0.2

0.4

0.6

Nor

maliz EXRC-SRQ

EXRC-MSRQSXRC-SRQ

0apoa1 er-gre f1atpase jac

Dataset

SXRC SRQSXRC-MSRQ

For the jac dataset RC-MSRQ shows 10% worse performanceHCA cache is likely being thrashedSXRC modes show higher performance since less QPs are being

d ( d h )

Dataset

59

used (and are staying in cache)M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ’08 Tsukuba, Oct 2, 2008

Performance of HPC Applications on TACC R n usin MVAPICH IBTACC Ranger using MVAPICH + IB

• Rob Farber’s facial Rob Farber s facial recognition application was run up to 60K cores

Pusing MVAPICH

• Ranges from 84% f k t l d of peak at low end

to 65% of peak at high end

http://www.tacc.utexas.edu/research/users/features/index.php?m_b_c=farber

60Cluster '08

Performance of HPC Applications on TACC R n : DNS/Tu bul ncTACC Ranger: DNS/Turbulence

Courtesy: P.K. Yeung, Diego Donzis, TG 200861Cluster '08



g g p










• Fault Tolerance


Asynchronous Progress

• Asynchronous progress (both at sender and

Asynchronous Progress

• Asynchronous progress (both at sender and receiver) in MVAPICH 1.0

• Design has been enhanced to a lock-free design in MVAPICH 1 1MVAPICH 1.1

• Potential for overlap of computation and communication

R. Kumar, A. Mamidala, M. Koop, G. Santhanaraman and D.K. Panda, Lock-free A h R d D i f MPI P i t t P i t C i ti E PVM/MPI 2008 Asynchronous Rendezvous Design for MPI Point-to-Point Communication, EuroPVM/MPI 2008, September 2008.

Asynchronous Progress (Mellanox DDR)

Application Availability at Sender

100

120Application Availability at Receiver

100

120

60

80

100

cent

age

ASYNC ProtocolRPUT Protocol

60

80

rcen

tage

ASYNC ProtocolRPUT Protocol

0

20

40Perc RPUT Protocol

0

20

40Per

0

8K16K 32K 64K

128K256K512K 1M 2M 4M

Msg Size (Bytes)

0

8K16K 32K 64K128K256K512K 1M 2M 4M

Msg Size (Bytes)Msg Size (Bytes)

Results for SMB Benchmark



g g p










• Fault Tolerance


Fault Tolerance

• Component failures are common in large-scale clusters• Imposes need on reliability and fault tolerance• Working along the following three anglesW g g f w g g

– Reliable Networking with Automatic Path Migration (APM) utilizing Redundant Communication Paths g(available since MVAPICH 1.0 and MVAPICH2 1.0 onward)

– Process Fault Tolerance with Efficient Checkpoint and Restart (available since MVAPICH2 0.9.8)E d d R li bili i h CRC – End-to-end Reliability with memory-to-memory CRC (available since MVAPICH 0.9.9)


Network Fault-Tolerance with APM

• Network Fault Tolerance using InfiniBand gAutomatic Path Migration (APM)– Utilizes Redundant Communication Paths

• Multiple Ports

• LMC

• Supported in OFED 1.2

A. Vishnu, A. Mamidala, S. Narravula and D. K. Panda, “Automatic Path Migration over InfiniBand: Early Experiences”, Third International Workshop on System Management Techniques,


Processes, and Services, held in conjunction with IPDPS ‘07

Screenshots: MPI Bandwidth Test ith APMwith APM


Checkpoint-Restart Support in MVAPICH2

• Process-level Fault Tolerance

MVAPICH2

– User-transparent, system-level checkpointing– Based on BLCR from LBNL to take coordinated

checkpoints of entire program including front end and checkpoints of entire program, including front end and individual processes

– Designed novel schemes tog• Coordinate all MPI processes to drain all in flight messages

in IB connections S i i & b ff hil h k i i• Store communication state & buffers while checkpointing

• Restarting from the checkpoint• Systems-level checkpoint can be initiated from the Systems level checkpoint can be initiated from the

application (added in MVAPICH2 1.0)Tsukuba, Oct 2, 2008 69

A R i E l (C )A Running Example (Cont.)Terminal A: Terminal B:

LU is running Now, Take checkpointTerminal A: Terminal B:

Select the programList running programs

1 23

A R i E l (C )A Running Example (Cont.)Terminal A: Terminal B:

LU is not affected.Stop it using CTRL-C

Then, restart fromthe checkpoint

Terminal A: Terminal B:

4 35

Checkpoint-Restart Performance with PVFS2PVFS2

NAS, LU Class C, 32x1 (Storage: 8 PVFS2 servers on IPoIB)

80

100

cond

s)

40

60

on T

ime

(Sec

0

20

Exe

cutio

Nocheckpoint

1 ckpt (avg60 secinterval)

2 ckpts (avg40 secinterval)



Number of Checkpoints Taken


Q. Gao, W. Yu, W. Huang and D.K. Panda, “Application-Transparent Checkpoint/Restart for MPI over InfiniBand”, ICPP ‘06



– Features


– Trends



• Future Plans


Future Plans• Most of the focus toward MVAPICH2

Future Plans

• Further enhancements to scalable job start-up• Kernel-based (LiMIC2) shared memory pt-to-pt communication• Optimization of collectives and one-sided communication based on

new LIMIC2 shared memory communicationnew LIMIC2 shared memory communication• Passive synchronization support for one-sided• Flexible process binding for multi-rails• Optimization of collectivesOptimization of collectives

– XRC– multi-rail

• Automatic tuning framework for pt-to-pt and collectives• Network reliability (transparent recovery in case of adapter • Network reliability (transparent recovery in case of adapter

failure)• Job pause-restart framework• Performance and Memory scalability toward 100-200K cores y y

Conclusions

• MVAPICH and MVAPICH2 are being widely used in

Conclusions

• MVAPICH and MVAPICH2 are being widely used in stable production IB clusters delivering best performance and scalabilityAls bli l st s ith 10Gi E/iWARP s t • Also enabling clusters with 10GigE/iWARP support

• The user base stands at more than 765 organizations worldwide g

• New features for scalability, high performance and fault tolerance support are aimed to deploy large-scale clusters (100-200K) nodes in the near large scale clusters ( 00 00K) nodes n the near future

Funding Acknowledgmentsg g

Our research is supported by the following organizations

• Current Funding support by

• Current Equipment support by


Personnel Acknowledgmentsg

Current Students Past StudentsK V id th (Ph D )– L. Chai (Ph.D.)

– T. Gangadharappa (M. S.)– K. Gopalakrishnan (M. S.)

– P. Balaji (Ph.D.)– D. Buntinas (Ph.D.)– S. Bhagvat (M.S.)

B Chandrasekharan (M S )

– K. Vaidyanathan (Ph.D.)– R. Noronha (Ph.D.)– S. Sur (Ph.D.) – K Vaidyanathan (Ph D )

– M. Koop (Ph.D.)– P. Lai (Ph. D.)– G. Marsh (Ph. D.)

– B. Chandrasekharan (M.S.)– W. Jiang (M.S.)– W. Huang (Ph.D.)– S. Kini (M.S.)

K. Vaidyanathan (Ph.D.)– A. Vishnu (Ph.D.)– J. Wu (Ph.D.)– W. Yu (Ph.D.)

– X. Ouyang (Ph.D.)– G. Santhanaraman (Ph.D.)– J. Sridhar (M. S.)

S. Kini (M.S.)– R. Kumar (M.S.)– S. Krishnamoorthy (M.S.)– J. Liu (Ph.D.)

– H. Subramoni (M. S.) – A. Mamidala (Ph.D.)– S. Narravula (Ph.D.)– R. Noronha (Ph.D.)

(Ph D )Current Programmer


– S. Sur (Ph.D.)g

– J. Perkins