HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable...

77
High Performance, Scalable and Fault-Tolerant MPI over InfiniBand: An Overview of MVAPICH/MVAPICH2 Project Talk at Tsukuba University by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda

Transcript of HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable...

Page 1: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

High Performance, Scalable and Fault-Tolerant MPI H gh rformanc , Sca a an Fau t o rant M over InfiniBand: An Overview of MVAPICH/MVAPICH2 Projectj

Talk at Tsukuba Universityby

Dhabaleswar K. (DK) PandaThe Ohio State University

y

yE-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Page 2: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Current and Next Generation Applications and Computing Systems

• Big demand for – High Performance Computing (HPC)– File-systems, multimedia, database, visualization– Internet data-centers

• Processor performance continues to grow – Chip density doubling every 18 months– Multi-core chips are emergingMulti core chips are emerging

• Commodity networking also continues to grow – Increase in speed and features

Affordable pricing– Affordable pricing• Clusters are increasingly becoming popular to

design next generation computing systemsS l bili M d l i d U d bili i h – Scalability, Modularity and Upgradeability with compute and network technologies

2Tsukuba, Oct 2, 2008

Page 3: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Trends for Computing Clusters in the T p 500 List

• Top 500 list of Supercomputers (www.top500.org)

Top 500 List

June 2001: 33/500 (6.6%) June 2005: 304/500 (60.8%)

Nov 2001: 43/500 (8 6%) Nov 2005: 360/500 (72 0%)Nov 2001: 43/500 (8.6%) Nov 2005: 360/500 (72.0%)

June 2002: 80/500 (16%) June 2006: 364/500 (72.8%)

Nov 2002: 93/500 (18 6%) Nov 2006: 361/500 (72 2%)Nov 2002: 93/500 (18.6%) Nov 2006: 361/500 (72.2%)

June 2003: 149/500 (29.8%) June 2007: 373/500 (74.6%)

Nov 2003: 208/500 (41.6%) Nov 2007: 406/500 (81.2%)

June 2004: 291/500 (58.2%) June 2008: 400/500 (80.0%)

Tsukuba, Oct 2, 2008 3

Nov 2004: 294/500 (58.8%) Nov 2008: To be Announced

Page 4: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Growth in Commodity Network T chn l

Representative commodity networks; their entries into the market

Technology

Ethernet (1979 - ) 10 Mbit/sec

Fast Ethernet (1993 -) 100 Mbit/sec

Gigabit Ethernet (1995 -) 1000 Mbit /sec

ATM (1995 -) 155/622/1024 Mbit/sec

Myrinet (1993 -) 1 Gbit/sec

Fibre Channel (1994 -) 1 Gbit/sec

InfiniBand (2001 -) 2 Gbit/sec (1X SDR)

10-Gigabit Ethernet (2001 -) 10 Gbit/sec

InfiniBand (2003 -) 8 Gbit/sec (4X SDR)InfiniBand (2003 ) 8 Gbit/sec (4X SDR)

InfiniBand (2005 -) 16 Gbit/sec (4X DDR)

24 Gbit/sec (12X SDR)

InfiniBand (2007 -) 32 Gbit/sec (4X QDR)

Tsukuba, Oct 2, 2008 4

InfiniBand (2007 -) 32 Gbit/sec (4X QDR)

16 times in the last 7 years

Page 5: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Limitations of Traditional Host-based P t c ls

• Ex: TCP/IP, UDP/IP

Protocols

• Generic architecture for all network interfaces• Host-handles almost all aspects of communicationp

– Data buffering (copies on sender and receiver)– Data integrity (checksum)– Routing aspects (IP routing)

• Signaling between different layers– Hardware interrupt whenever a packet arrives or is sent– Software signals between different layers to handle

protocol processing in different priority levels

Tsukuba, Oct 2, 2008 5

Page 6: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Previous High PerformanceN t k St cks

• Virtual Interface Architecture

Network Stacks

– Standardized by Intel, Compaq, Microsoft• Fast Messages (FM)

– Developed by UIUC• Myricom GM

P i t t l t k f M i– Proprietary protocol stack from Myricom• These network stacks set the trend for high-

performance communication requirementsperformance communication requirements– Hardware offloaded protocol stack– Support for fast and secure user-level access to pp

the protocol stackTsukuba, Oct 2, 2008 6

Page 7: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

IB Trade Association

• IB Trade Association was formed with seven industry l d ( D ll HP B l f d )leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun)

• Goal: To design a scalable and high performance communication and I/O architecture by taking an communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies

• Many other industry participated in the effort to define the IB architecture specification

• IB Architecture (Volume 1 Version 1 0) was released to IB Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000– Latest version 1.2.1 released January 2008

• http://www.infinibandta.orgTsukuba, Oct 2, 2008 7

Page 8: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Presentation Overview

• Overview of InfiniBand

– Features

P d ts (H d nd S ft )– Products (Hardware and Software)

– Trends

• MVAPICH and MVAPICH2 Features

• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs

• Future Plans

• Conclusions and Final Q&ATsukuba, Oct 2, 2008 8

Page 9: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

A Typical IB Networkyp

Three primary components

Channel Adapters

Switches/RoutersSwitches/Routers

Links and connectors

Tsukuba, Oct 2, 2008 9

Page 10: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Hardware Protocol Offload

Complete Complete Hardware

ImplementationsExist

Tsukuba, Oct 2, 2008 10

Page 11: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Basic IB Capabilities at EachP t c l L

• Link Layer

Protocol Layer

– CRC-based data integrity, Buffering and Flow-control, Virtual Lanes, Service Levels and QoS, Switching and Multicast WAN capabilitiesMulticast, WAN capabilities

• Network LayerRouting and Flow Labels– Routing and Flow Labels

• Transport LayerReliable Connection Unreliable Datagram Reliable – Reliable Connection, Unreliable Datagram, Reliable Datagram and Unreliable Connection

– Shared Receive Queued and Extended Reliable Shared Receive Queued and Extended Reliable Connections (discussed in more detail later)

Tsukuba, Oct 2, 2008 11

Page 12: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Communication and Management S m ntics

• Two forms of communication semantics

Semantics

– Channel semantics (Send/Recv)– Memory semantics (RDMA, Atomic operations)

• Management model– A detailed management model complete with

managers, agents, messages and protocols

• Verbs Interface– A low-level programming interface for performing

communication as well as management

Tsukuba, Oct 2, 2008 12

Page 13: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Communication in the Channel S m ntics (S nd R c i M d l)Semantics (Send-Receive Model)

Buffer Pool Buffer Pool

Tsukuba, Oct 2, 2008 13

Page 14: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Communication in the Memory S m ntics (RDMA M d l)Semantics (RDMA Model)Node Node

Memory

P0 PCI/PCI-EX

Memory

P0B IBA

PCI/PCI-EX

P1 P1IBA IBA

• No involvement by the CPU at the receiver (RDMA Write/Put)

• No involvement by the CPU at the sender (RDMA Read/get)y ( g )

• 1-2 µs latency (for short data)

• 1.5 – 2.6 GBps bandwidth (for large data)

• 3-5 µs for atomic operationTsukuba, Oct 2, 2008 14

Page 15: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

IB Transport Servicesp

Advanced mechanisms like SRQ and new transport

Tsukuba, Oct 2, 2008 15

Advanced mechanisms like SRQ and new transport eXtended Reliable Connection (XRC) is introduced recently

Page 16: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Shared Receive Queue (SRQ)Q ( Q)

Process Process

One RQ per connection One SRQ for all connections

m p

• SRQ is a hardware mechanism in IB by which a process

p One SRQ for all connectionsn -1

can share receive resources (memory) across multiple connections

• A new feature introduced in specification v1 2• A new feature, introduced in specification v1.2• 0 < p << m*(n-1)

Tsukuba, Oct 2, 2008 16

Page 17: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

eXtended Reliable Connection (XRC)( )

M = # of nodesN = # of processes/node

RC Connections XRC Connections

(M2 1)*N (M 1)*N

N # of processes/node

(M2-1) N (M-1) N

• Each QP takes at least one page of memoryQ p g y– Connections between all processes is very costly for RC

• New IB Transport added: eXtended Reliable Connection– Allows connections between nodes instead of processes

Tsukuba, Oct 2, 2008 17

Page 18: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Subnet Managerg

Inactive Link

Active Links

Inactive Links

Compute Node

Switch Multicast Join

Multicast Setup

Compute Node

Multicast Join

Multicast Setup

Subnet Manager

Tsukuba, Oct 2, 2008 18

Page 19: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Automatic Path Migration

• Automatically utilizes IB multipathing for network

g

y p gfault-tolerance

• Enables migrating connections to a different pathEnables migrating connections to a different path

– Connection recovery in the case of failures

O ti l F t– Optional Feature

• Available for RC, UC, and RD

• Reliability guarantees for service type maintained during migration

Tsukuba, Oct 2, 2008 19

Page 20: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Presentation Overview

• Overview of InfiniBand

– Features

P d ts (H d nd S ft )– Products (Hardware and Software)

– Trends

• MVAPICH and MVAPICH2 Features

• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs

• Future Plans

• Conclusions and Final Q&ATsukuba, Oct 2, 2008 20

Page 21: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

IB Hardware Products

• Many IB vendors: Mellanox, Voltaire, Cisco, QlogicAli d ith d I t l IBM SUN D ll– Aligned with many server vendors: Intel, IBM, SUN, Dell

– And many integrators: Appro, Advanced Clustering, Microway, …• Broadly two kinds of adapters

– Offloading (Mellanox) and Onloading (Qlogic)• Adapters with different interfaces:

– Dual port 4X with PCI-X (64 bit/133 MHz) PCIe x8 PCIe 2 0 and HTDual port 4X with PCI X (64 bit/133 MHz), PCIe x8, PCIe 2.0 and HT• MemFree Adapter

– No memory on HCA Uses System memory (through PCIe)G d f L M d (T 2935 6015T NFB)– Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)

• Different speeds– SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps)p p p

• Some 12X SDR adapters exist as well (24 Gbps each way)

Tsukuba, Oct 2, 2008 21

Page 22: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Tyan Thunder S2935 Boardy

(Courtesy Tyan)

Tsukuba, Oct 2, 2008 22

(Courtesy Tyan)

Page 23: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

IB Hardware Products (contd.)

• Customized adapters to work with IB switches

( )

– Cray XD1 (formerly by Octigabay), Cray CX1• Switches:

– 4X SDR switch (8-288 ports)– 4X SDR switch (8-288 ports)• 12X ports available for inter-switch connectivity

– 4X DDR switch (mainly available in 8 to 288 port models) h ( ll l l ) – 12X switches (small sizes available)

– 3456-port “Magnum” switch from SUN used at TACC• 72-port “nano magnum” switch with DDR speedp g p

– New 36-port InfiniScale IV QDR switch silicon by Mellanox• Will allow high-density switches to be built

• Switch Routers with Gateways• Switch Routers with Gateways– IB-to-FC; IB-to-IP

Tsukuba, Oct 2, 2008 23

Page 24: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

IB Software Products

• Low-level software stacks– VAPI (Verbs-Level API) from Mellanox

– Modified and customized VAPI from other vendors

– New initiative: Open Fabrics (formerly OpenIB)• http://www.openfabrics.org

• Open-source code available with Linux distributions

• Initially IB; later extended to incorporate iWARP

H h l l f k• High-level software stacks– MPI, SDP, IPoIB, SRP, iSER, DAPL, NFS, PVFS on

i k ( i il VAPI d O F b i )various stacks (primarily VAPI and OpenFabrics)

Tsukuba, Oct 2, 2008 24

Page 25: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

OpenFabrics

• www.openfabrics.org

p

• Open source organization (formerly OpenIB)• Incorporates both IB and iWARP in a unified manner

F i ff t f O S IBA d iWARP • Focusing on effort for Open Source IBA and iWARP support for Linux and Windows

• Design of complete software stack with `best of Design of complete software stack with best of breed’ components– Gen1– Gen2 (current focus)

• Users can download the entire stack and runLatest release is OFED 1 3 1– Latest release is OFED 1.3.1

– OFED 1.4 is being worked outTsukuba, Oct 2, 2008 25

Page 26: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

OpenFabrics Software StackpSA Subnet Administrator

MAD Management DatagramOpenDiag

Application Level

ClusteredDB Access

SocketsBasedA

VariousMPIs

Access toFile

S t

BlockStorageA

IP BasedApp

A

SMA Subnet Manager Agent

PMA Performance Manager Agent

IPoIB IP over InfiniBandInfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC

SDP Lib

User Level MAD API

Open SM

DiagTools

User APIs

U S

DB AccessAccess MPIs SystemsAccessAccess

UDAPL

SDP Sockets Direct Protocol

SRP SCSI RDMA Protocol (Initiator)

iSER iSCSI RDMA Protocol (Initiator)

RDS Reliable Datagram Service

SDPIPoIB SRP iSER RDS

SDP Lib

Upper Layer Protocol

Kernel Space

User Space

NFS-RDMARPC

ClusterFile Sys

g

UDAPL User Direct Access Programming Lib

HCA Host Channel Adapter

R-NIC RDMA NIC

ConnectionManager

MADSA Client

ConnectionManager

Connection ManagerAbstraction (CMA)

Mid-LayerSMA

el b

ypas

s

el b

ypas

s

CommonKeyHardware

Specific DriverHardware Specific

Driver

InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC

ProviderApps & Access

Ker

ne

Ker

ne

Tsukuba, Oct 2, 2008 26

InfiniBand

iWARPInfiniBand HCA iWARP R-NICHardware

AccessMethodsfor usingOF Stack

Page 27: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

IB Installations

• 121 IB clusters (24.2%) in June ’08 TOP500 list (www.top500.org) • 12 IB clusters in TOP25

– 122,400-cores (RoadRunner) at LANL (1st)– 62,976-cores (Ranger) at TACC (4th)– 14,336-cores at New Mexico (7th)– 14,384-cores at Tata CRL, India (8th)– 10,240-cores at TEP, France (10th)– 13,728-cores in Sweden (11th)– 8,320-cores in UK (18th)– 6,720-cores in Germany (19th)

k b ( h)– 10,000-cores at CCS, Tsukuba, Japan (20th)– 9,600-cores at NCSA (23rd)– 12,344-cores at Tokyo Inst. of Technology (24th)

13 824 t NASA/C l bi (25th)– 13,824-cores at NASA/Columbia (25th)• More are getting installed ….

Tsukuba, Oct 2, 2008 27

Page 28: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

InfiniBand in the Top500p

Systems Performance

Tsukuba, Oct 2, 2008 28

Percentage share of InfiniBand is steadily increasing

Page 29: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Presentation Overview

• Overview of InfiniBand

– Features

P d ts (H d nd S ft )– Products (Hardware and Software)

– Trends

• MVAPICH and MVAPICH2 Features

• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs

• Future Plans

• Conclusions and Final Q&ATsukuba, Oct 2, 2008 29

Page 30: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Designing MPI Using IB/iWARP F tu sFeatures

MPI Design Components

ProtocolMapping

Buffer

FlowControl

Connection

CommunicationProgress

Collective

Multi-railSupport

One-sidedBufferManagement

ConnectionManagement

CollectiveCommunication

Design Alternatives and Solutions

One sidedActive/Passive

Design Alternatives and Solutions

RDMAOperations

UnreliableDatagram

Static RateControl

Multicast Out-of-orderPlacement

QoS Multi-PathVLANspg

AtomicOperations

SharedReceive Queues

End-to-EndFlow Control

m

Send /Receive

DynamicRate

ControlMulti-Path

LMC

Cluster '08 30

IB and iWARP/Ethernet Features

Page 31: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

MVAPICH/MVAPICH2 Software

• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)

– Latest Releases: MVAPICH 1.1RC1 and MVAPICH2 1.2RC2

Used by more than 765 organizations in 42 countries– Used by more than 765 organizations in 42 countries

– More than 23,000 downloads from OSU site directly

– Empowering many TOP500 clusters• 4th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)including Open Fabrics Enterprise Distribution (OFED)

– Also supports uDAPL device to work with any network supporting uDAPL

– http://mvapich.cse.ohio-state.edu/

Tsukuba, Oct 2, 2008 31

Page 32: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

MVAPICH 1.1 Architecture

MVAPICH (MPI-1)(1.1)

OpenFabrics/Gen2

#1 #2

OpenFabrics/Gen2-Hybrid PSM

#3

Shared-M

#4

TCP/IP

#5

Gen2(Single-rail)

InfiniBand (Mellanox)

Gen2 Hybrid(Single-rail)

SM Memory

InfiniBand (QLogic)

TCP/IP

Serial Nodes/Laptops with InfiniBand (Mellanox)

PCI-X, PCIe, PCIe-Gen2(SDR, DDR and QDR)

InfiniBand (QLogic)

PCIe & HT(SDR and DDR)

p pMulti-core

VAPIGen2-Multirail

uDAPL

IA-32, EM64T, Opteron, IA-64, ..

uDAPL(deprecated)

Page 33: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

MVAPICH2 1.2 Architecture

MVAPICH2 (MPI-2)(1.2)

OpenFabrics/Gen2

#1

uDAPL PSM

#4#2

TCP/IP

#3

Gen2(Integrated Multirail)

InfiniBand (Mellanox)

SM

InfiniBand (QLogic)Adapter Supporting

10GigE/iWARP (Chelsio,

Under Design

(Mellanox)

PCI-X, PCIe, PCIe-Gen2

(SDR, DDR and QDR)

InfiniBand (QLogic)

PCIe & HT(SDR and DDR)

pp guDAPL

(IB, Myrinet, Quadrics)

(Linux, Solaris)

( ,Neteffect)

PCI-X, PCIe

IA-32, EM64T, Opteron, IA-64, ..

( , )

Page 34: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Major Features of MVAPICH 1.1

• OpenFabrics-Gen2– Scalable job start-up with mpirun_rsh, support for SLURM– RC and XRC support– Flexible message coalescingFlexible message coalescing– Multi-core-aware pt-to-pt communication– User-defined processor affinity for multi-core platforms– Multi-core-optimized collective communication– Asynchronous and scalable on-demand connection management Asynchronous and scalable on demand connection management – RDMA Write and RDMA Read-based protocols– Lock-free Asynchronous Progress for better overlap between

computation and communication– Polling and blocking support for communication progressPolling and blocking support for communication progress– Multi-pathing support leveraging LMC mechanism on large

fabrics– Network-level fault tolerance with Automatic Path Migration (APM)– Mem-to-mem reliable data transfer mode (for detection of I/O error Mem-to-mem reliable data transfer mode (for detection of I/O error

with 32-bit CRC)

Page 35: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Major Features of MVAPICH 1 1Major Features of MVAPICH 1.1(Cont’d)

• OpenFabrics-Gen2-Hybrid– Newly introduced interface in 1.1– Replaces UD interface in 1.0p– Targeted for emerging multi-thousand-core clusters to

achieve the best performance with minimal memory footprintMost of the features as in Gen2 – Most of the features as in Gen2

– Adaptive selection during run-time (based on application and systems characteristics) to switch between

• RC and UD (or between XRC and UD) transports – Multiple buffer organization with XRC support

Page 36: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Major Features of MVAPICH2 1.2

• OpenFabrics-Gen2– All features as in MVAPICH 1.1 (OpenFabrics-Gen2)

except asynchronous progress and XRC p y p g– RDMA CM-based connection management (Gen2-IB and

Gen2-iWARP)– Integrated multi-rail support for IB and 10GigE/iWARPpp– Checkpoint-Restart (currently for IB)

• Systems-level automatic• Application-initiated systems-level

DAPL• uDAPL– Most of the features of OpenFabrics-Gen2 except multi-

rail and checkpointing• Flexibility for different adapters software stacks and OS (Linux • Flexibility for different adapters , software stacks and OS (Linux

and Solaris) supporting uDAPL

Page 37: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Support for Multiple Interfaces/AdaptersSupport for Multiple Interfaces/Adapters

• OpenFabrics/Gen2-IB and OpenFabrics/Gen2-Hybridp p y– All IB adapters supporting OpenFabrics/Gen2

• Qlogic/PSM• Qlogic adapters

• OpenFabrics/Gen2 iWARP• OpenFabrics/Gen2-iWARP• Chelsio

• uDAPL– Linux-IBL nux B– Solaris-IB– Other adapters such as Neteffect 10GigE

• TCP/IPA d t ti TCP/IP i t f– Any adapter supporting TCP/IP interface

• Shared Memory Channel (MVAPICH)• for running applications in a node with multi-core processors

37Tsukuba, Oct 2, 2008

Page 38: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Presentation Overview

• Overview of InfiniBand

– Features

P d ts (H d nd S ft )– Products (Hardware and Software)

– Trends

• MVAPICH and MVAPICH2 Features

• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs

• Future Plans

• Conclusions and Final Q&ATsukuba, Oct 2, 2008 38

Page 39: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Design Insights and Sample Results

• Scalable Job Start-up

g g p

• Basic Performance

– Two-sided Communication

– One-sided Communication

• Multi-core-aware pt-to-pt communication

• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective

• Integrated Multi-rail Design

• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )

• Applications-level Scalability

• Asynchronous Progress

• Fault Tolerance

Tsukuba, Oct 2, 2008 39

Page 40: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Scalable Startup

• An enhanced mpirun rsh framework was introduced in

Scalable Startup

An enhanced mpirun_rsh framework was introduced in MVAPICH 1.0 to significantly cut down job start-up on large clusters

• Is available with MVAPICH 1.1 and MVAPICH2 1.2

Wallclock Runtime for MPI Hello World

140

160

ecs) MVAPICH-0.9.9

60

80

100

120

e R

untim

e (s

e

MVAPICH-1.0

Courtesy TACC

0

20

40

1K 2K 4K 8K 16K 32K

Ave

rage

1K 2K 4K 8K 16K 32K# of MPI Tasks (Cores)

Page 41: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

One-way Latency: MPI over IBy y

7Small Message Latency

400MVAPICH-InfiniHost III-DDR

Large Message Latency

5

6

250

300

350MVAPICH-InfiniHost III-DDR

MVAPICH-Qlogic-SDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX-QDR-PCIe2

3

4

Late

ncy

(us)

2.77 150

200

250 MVAPICH-Qlogic-DDR-PCIe2

Late

ncy

(us)

1

2

1.061.281.492.19

50

100

0

Message Size (bytes)

0

Message Size (bytes)

InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back

Page 42: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Bandwidth: MPI over IB

3000MVAPICH-InfiniHost III-DDR

Unidirectional Bandwidth6000

Bidirectional Bandwidth

2000

2500MVAPICH-Qlogic-SDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX-QDR-PCIe2MVAPICH-Qlogic-DDR-PCIe2

sec

2570.6

1952.94000

5000

sec

5012.1

3621.4

1000

1500

Mill

ionB

ytes

/s

1399.8

1389.42000

3000

Mill

ionB

ytes

/s

2457.4

2718.3

0

500

1000936.5

0

1000

2000

1519.8

0

Message Size (bytes)

0

Message Size (bytes)

InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back

Page 43: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

RDMA CM and iWARP SupportRDMA CM and iWARP Support

• Available starting with MVAPICH2 Available starting with MVAPICH2 0.9.8

• RDMA CM is supported for both RDMA CM is supported for both – IB

10GigE/iWARP– 10GigE/iWARP

Page 44: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

One-way Latency: MPI over iWARP

35One-way Latency

y y

25

30 Chelsio (TCP/IP)

Chelsio (iWARP)

15

20

Late

ncy

(us)

15.47

5

10

L

6.88

00 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Message Size (bytes)

Tsukuba, Oct 2, 2008 44

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch

Page 45: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Bandwidth: MPI over iWARP

1400Ch l i (TCP/IP)

Unidirectional Bandwidth2500

Bidirectional Bandwidth

2260 8

1000

1200Chelsio (TCP/IP)

Chelsio (iWARP)

ec 839 8

2000

c

2260.81231.8

600

800

Mill

ionB

ytes

/se 839.8

1000

1500

illio

nByt

es/s

ec

855.3

200

400

M

500M

i

0

Message Size (bytes)

0

Message Size (bytes)Message Size (bytes) Message Size (bytes)

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch

Page 46: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

MPI One-sided Communication

• Specified by the MPI-2 standard• Data movement operations

– MPI_Put

Lock

Put

– MPI_Get– MPI_Accumulate

Get

• Synchronization operations– MPI_Lock/MPI_Unlock

Put

Accumulate

– MPI_Win_fence– MPI_Win_post, MPI_Win_start,

Put

Unlock

MPI_Win_complete, MPI_Win_wait

Tsukuba, Oct 2, 2008 46

Page 47: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

MPI_Put Performance (IB DDR)_ ( )

140512501500

/s)

ConnectX-DDR

6789

s)

ConnectX-DDR

0250500750

10001250

Ban

dwid

th (M

B/ ConnectX-DDR

2716

2.57

012345

Late

ncy

(us

1 8 64 512 4k 32k

256k 2M

Message Size (Bytes)

B

3000)• Single port results only (EM64T PCI-Ex)

27160 2 8 32 128

512

2048

Message Size (Bytes)

5001000150020002500

ndw

idth

(MB

/s)

ConnectX-DDR

Single port results only (EM64T, PCI Ex)

Results for other platforms athttp://mvapich.cse.ohio-state.edu

0500

1 8 64 512 4k 32k

256k 2M

Message Size (Bytes)

Bi-B

an

Tsukuba, Oct 2, 2008 47

Page 48: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Design Insights and Sample Results

• Scalable Job Start-up

g g p

• Basic Performance

– Two-sided Communication

– One-sided Communication

• Multi-core-aware pt-to-pt communication

• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective

• Integrated Multi-rail Design

• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )

• Applications-level Scalability

• Asynchronous Progress

• Fault Tolerance

Tsukuba, Oct 2, 2008 48

Page 49: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Multicore-aware Communication: L t nc nd B nd idthLatency and Bandwidth

Small Message Latency Bandwidth

0.8

1

1.2

s) 1000

1500

2000

2500

wid

th (M

B/s

)

0.2

0.4

0.6

Late

ncy

(us

0

500

1000

1 4 16 64 256 1K 4K 16K

64K

256K 1M

Ban

dw0

0.2

1 2 4 8 16 32 64 128

Message Size (Bytes)

2 1 6 25

Message Size (Bytes)Intra-CMP Basic Design Intra-CMP Advanced DesignInter-CMP Basic Design Inter-CMP Advanced Design

• Multicore-aware design improves both latency and bandwidth• Available in MVAPICH and MVAPICH2 stacks

Tsukuba, Oct 2, 2008 49

L. Chai, A. Hartono and D. K. Panda, “Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters”, Cluster ‘06

Page 50: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Shared-memory Aware Collectivesy

MPI_Bcast MPI_Allreduce (512 cores)

60000

70000

80000without-shmem (64x8)

without-shmem (64x4)

with-shmem (64x8)

_

2000

2500

Allreduce-shmem

Allreduce-noshmem

30000

40000

50000

ency

(use

c)

with shmem (64x8)

with-shmem (64x4)1500

aten

cy (u

s)

educe os e

10000

20000

30000

late

500

1000La0

i (b t )

04096 8192 16384 32768 65536

Message size (bytes)

Tsukuba, Oct 2, 2008 50

message size (bytes) Message size (bytes)

Page 51: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

MVAPICH-PSM Collective P f m nc (512 c s)Performance (512 cores)

Broadcast (64X8)80 Barrier Latency35

60

70 MVAPICH PSM 1.0

InfiniPath MPI 2.125

30

35

us)

MVAPICH PSM 1.0

InfiniPath MPI 2.1

30

40

50

Late

ncy

(us)

10

15

20

Late

ncy

(u10

20

30

0

5

8 8 8 8

0

1 2 4 8 16 32 64 128 256 512 1024Msg Size (Bytes)

1X8

2X8

4X8

8X8

16X8

32X8

64X8

System size

• 64 Intel Quad-core systems with dual sockets; PCIe InfiniPath Adapters

Tsukuba, Oct 2, 2008 51

64 Intel Quad core systems with dual sockets; PCIe InfiniPath Adapters • Significant performance improvement for MPI_Bcast and MPI_Barrier

Page 52: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Design Insights and Sample Results

• Scalable Job Start-up

g g p

• Basic Performance

– Two-sided Communication

– One-sided Communication

• Multi-core-aware pt-to-pt communication

• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective

• Integrated Multi-rail Design

• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )

• Applications-level Scalability

• Asynchronous Progress

• Fault Tolerance

Tsukuba, Oct 2, 2008 52

Page 53: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Integrated Multi-Rail Design g g(MVAPICH2)

MPI LApplication • Multiple ports/

adapters• Multiple

Eager Rendezvous InputMPI Layer

Communication Scheduling Completion

• Multiple adapters

• Multiple paths with LMCs

Virtual Subchannels Notification

InfiniBand Layer

Schedulerg

Policiesp

Notifier

J. Liu, A. Vishnu and D. K. Panda. Building MultiRail InfiniBand Clusters: MPI Level Design and Performance Evaluation. Presented at Supercomputing ‘04,

InfiniBand Layer

Level Design and Performance Evaluation. Presented at Supercomputing 04, April, 2004

Page 54: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Design Insights and Sample Results

• Scalable Job Start-up

g g p

• Basic Performance

– Two-sided Communication

– One-sided Communication

• Multi-core-aware pt-to-pt communication

• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective

• Integrated Multi-rail Design

• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )

• Applications-level Scalability

• Asynchronous Progress

• Fault Tolerance

Tsukuba, Oct 2, 2008 54

Page 55: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Memory Utilization usingSh d R c i Qu u sShared Receive Queues

100

1201416

MVAPICH-RDMA

40

60

80

100

emor

y U

sed

(M

MVAPICH-RDMAMVAPICH SR

68

101214

mor

y U

sed

(G MVAPICH-SRMVAPICH-SRQ

0

20

2 4 8 16 32Number of Processes

Me MVAPICH-SR

MVAPICH-SRQ

024

128 256 512 1024 2048 4096 8192 16384

N b f P

Mem

• SRQ consumes only 1/10th compared to RDMA for 16,000 processes

Number of Processes Number of Processes

Analytical modelMPI_Init memory utilization

SRQ consumes only 1/10 compared to RDMA for 16,000 processes• Send/Recv exhausts the Buffer Pool after 1000 processes;

consumes 2X memory as SRQ for 16,000 processes

Tsukuba, Oct 2, 2008 55

S. Sur, L. Chai, H. –W. Jin and D. K. Panda, “Shared Receive Queue Based Scalable MPI Design for InfiniBand Clusters”, IPDPS 2006

Page 56: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Communication Buffer Memory Utili ti n ith NAMD ( p 1)Utilization with NAMD (apoa1)

60

70

MB

)

1.05

1.1 Avg. RDMA channels 53.15

10

20

30

40

50

Mem

ory

Usa

ge (M

0.85

0.9

0.95

1

Nor

mal

ized

Pe

rform

ance Avg. Low watermarks 0.03

Unexpected Msgs (%) 48.2

0

10

16 32 64

Processes

M

0.8

0.85

ARDMA-SR ARDMA-SRQ SRQ

Total Messages 3.7e6

MPI Time (%) 23.54

• 50% messages < 128 Bytes, other 50% between 128 Bytes and 32 KB53 RDMA connections setup for 64 process experiment

ARDMA-SR % ARDMA-SRQ % SRQ %

– 53 RDMA connections setup for 64 process experiment• SRQ Channel takes 5-6MB of memory

– Memory needed by SRQ decreases by 1MB going from 16 to 64

Tsukuba, Oct 2, 2008 56

S. Sur, M. Koop and D. K. Panda, “High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis”, SC ‘06

Page 57: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

UD vs. RC: Performance and Sc l bilit (SMG2000 Applic ti n)Scalability (SMG2000 Application)

RC (MVAPICH 0.9.8) UD (Progress)Memory Usage (MB/process)Performance

0.81

1.2

ed T

ime

( ) ( g )

RC (MVAPICH 0.9.8) UD Design

Conn. Buffers Struct. Total Buffers Struct Total

512 22.9 65.0 0.3 88.2 37.0 0.2 37.2

29 6 0 0 6 3 0 0 4

00.20.40.6

Nor

mal

ize

128 256 512 1024 2048 4096

1024 29.5 65.0 0.6 95.1 37.0 0.4 37.4

2048 42.4 65.0 1.2 107.4 37.0 0.9 37.9

4096 66.7 65.0 2.4 134.1 37.0 1.7 38.7

Large number of peers per process (992 at maximum)

128 256 512 1024 2048 4096Processes

UD reduces HCA QP cache thrashing

M K S S Q G d D K P d “Hi h P f MPI D i i U li bl

Tsukuba, Oct 2, 2008 57

M. Koop, S. Sur, Q. Gao and D. K. Panda, “High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters,” ICS ‘07

Page 58: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Impact of Hybrid RC/UD Designp y g

Combine the benefits of both RC and UD together

Application benchmark results on 512-core system

Tsukuba, Oct 2, 2008 58

M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand,” IPDPS ’08

Page 59: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Impact of XRC-based Design

1

1.2

ep g

0.6

0.8

1

zed

Time

RC-SRQRC-MSRQ

0.2

0.4

0.6

Nor

maliz EXRC-SRQ

EXRC-MSRQSXRC-SRQ

0apoa1 er-gre f1atpase jac

Dataset

SXRC SRQSXRC-MSRQ

For the jac dataset RC-MSRQ shows 10% worse performanceHCA cache is likely being thrashedSXRC modes show higher performance since less QPs are being

d ( d h )

Dataset

59

used (and are staying in cache)M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ’08 Tsukuba, Oct 2, 2008

Page 60: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Performance of HPC Applications on TACC R n usin MVAPICH IBTACC Ranger using MVAPICH + IB

• Rob Farber’s facial Rob Farber s facial recognition application was run up to 60K cores

Pusing MVAPICH

• Ranges from 84% f k t l d of peak at low end

to 65% of peak at high end

http://www.tacc.utexas.edu/research/users/features/index.php?m_b_c=farber

60Cluster '08

Page 61: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Performance of HPC Applications on TACC R n : DNS/Tu bul ncTACC Ranger: DNS/Turbulence

Courtesy: P.K. Yeung, Diego Donzis, TG 200861Cluster '08

Page 62: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Design Insights and Sample Results

• Scalable Job Start-up

g g p

• Basic Performance

– Two-sided Communication

– One-sided Communication

• Multi-core-aware pt-to-pt communication

• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective

• Integrated Multi-rail Design

• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )

• Applications-level Scalability

• Asynchronous Progress

• Fault Tolerance

Tsukuba, Oct 2, 2008 62

Page 63: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Asynchronous Progress

• Asynchronous progress (both at sender and

Asynchronous Progress

• Asynchronous progress (both at sender and receiver) in MVAPICH 1.0

• Design has been enhanced to a lock-free design in MVAPICH 1 1MVAPICH 1.1

• Potential for overlap of computation and communication

R. Kumar, A. Mamidala, M. Koop, G. Santhanaraman and D.K. Panda, Lock-free A h R d D i f MPI P i t t P i t C i ti E PVM/MPI 2008 Asynchronous Rendezvous Design for MPI Point-to-Point Communication, EuroPVM/MPI 2008, September 2008.

Page 64: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Asynchronous Progress (Mellanox DDR)

Application Availability at Sender

100

120Application Availability at Receiver

100

120

60

80

100

cent

age

ASYNC ProtocolRPUT Protocol

60

80

rcen

tage

ASYNC ProtocolRPUT Protocol

0

20

40Perc RPUT Protocol

0

20

40Per

0

8K16K 32K 64K

128K256K512K 1M 2M 4M

Msg Size (Bytes)

0

8K16K 32K 64K128K256K512K 1M 2M 4M

Msg Size (Bytes)Msg Size (Bytes)

Results for SMB Benchmark

Page 65: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Design Insights and Sample Results

• Scalable Job Start-up

g g p

• Basic Performance

– Two-sided Communication

– One-sided Communication

• Multi-core-aware pt-to-pt communication

• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective

• Integrated Multi-rail Design

• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )

• Applications-level Scalability

• Asynchronous Progress

• Fault Tolerance

Tsukuba, Oct 2, 2008 65

Page 66: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Fault Tolerance

• Component failures are common in large-scale clusters• Imposes need on reliability and fault tolerance• Working along the following three anglesW g g f w g g

– Reliable Networking with Automatic Path Migration (APM) utilizing Redundant Communication Paths g(available since MVAPICH 1.0 and MVAPICH2 1.0 onward)

– Process Fault Tolerance with Efficient Checkpoint and Restart (available since MVAPICH2 0.9.8)E d d R li bili i h CRC – End-to-end Reliability with memory-to-memory CRC (available since MVAPICH 0.9.9)

Tsukuba, Oct 2, 2008 66

Page 67: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Network Fault-Tolerance with APM

• Network Fault Tolerance using InfiniBand gAutomatic Path Migration (APM)– Utilizes Redundant Communication Paths

• Multiple Ports

• LMC

• Supported in OFED 1.2

A. Vishnu, A. Mamidala, S. Narravula and D. K. Panda, “Automatic Path Migration over InfiniBand: Early Experiences”, Third International Workshop on System Management Techniques,

Tsukuba, Oct 2, 2008 67

Processes, and Services, held in conjunction with IPDPS ‘07

Page 68: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Screenshots: MPI Bandwidth Test ith APMwith APM

Tsukuba, Oct 2, 2008 68

Page 69: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Checkpoint-Restart Support in MVAPICH2

• Process-level Fault Tolerance

MVAPICH2

– User-transparent, system-level checkpointing– Based on BLCR from LBNL to take coordinated

checkpoints of entire program including front end and checkpoints of entire program, including front end and individual processes

– Designed novel schemes tog• Coordinate all MPI processes to drain all in flight messages

in IB connections S i i & b ff hil h k i i• Store communication state & buffers while checkpointing

• Restarting from the checkpoint• Systems-level checkpoint can be initiated from the Systems level checkpoint can be initiated from the

application (added in MVAPICH2 1.0)Tsukuba, Oct 2, 2008 69

Page 70: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

A R i E l (C )A Running Example (Cont.)Terminal A: Terminal B:

LU is running Now, Take checkpointTerminal A: Terminal B:

Select the programList running programs

1 23

Page 71: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

A R i E l (C )A Running Example (Cont.)Terminal A: Terminal B:

LU is not affected.Stop it using CTRL-C

Then, restart fromthe checkpoint

Terminal A: Terminal B:

4 35

Page 72: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Checkpoint-Restart Performance with PVFS2PVFS2

NAS, LU Class C, 32x1 (Storage: 8 PVFS2 servers on IPoIB)

80

100

cond

s)

40

60

on T

ime

(Sec

0

20

Exe

cutio

Nocheckpoint

1 ckpt (avg60 secinterval)

2 ckpts (avg40 secinterval)

3 ckpts (avg30 secinterval)

4 ckpts (avg20 secinterval)

Number of Checkpoints Taken

Tsukuba, Oct 2, 2008 72

Q. Gao, W. Yu, W. Huang and D.K. Panda, “Application-Transparent Checkpoint/Restart for MPI over InfiniBand”, ICPP ‘06

Page 73: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Presentation Overview

• Overview of InfiniBand

– Features

P d ts (H d nd S ft )– Products (Hardware and Software)

– Trends

• MVAPICH and MVAPICH2 Features

• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs

• Future Plans

• Conclusions and Final Q&ATsukuba, Oct 2, 2008 73

Page 74: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Future Plans• Most of the focus toward MVAPICH2

Future Plans

• Further enhancements to scalable job start-up• Kernel-based (LiMIC2) shared memory pt-to-pt communication• Optimization of collectives and one-sided communication based on

new LIMIC2 shared memory communicationnew LIMIC2 shared memory communication• Passive synchronization support for one-sided• Flexible process binding for multi-rails• Optimization of collectivesOptimization of collectives

– XRC– multi-rail

• Automatic tuning framework for pt-to-pt and collectives• Network reliability (transparent recovery in case of adapter • Network reliability (transparent recovery in case of adapter

failure)• Job pause-restart framework• Performance and Memory scalability toward 100-200K cores y y

Page 75: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Conclusions

• MVAPICH and MVAPICH2 are being widely used in

Conclusions

• MVAPICH and MVAPICH2 are being widely used in stable production IB clusters delivering best performance and scalabilityAls bli l st s ith 10Gi E/iWARP s t • Also enabling clusters with 10GigE/iWARP support

• The user base stands at more than 765 organizations worldwide g

• New features for scalability, high performance and fault tolerance support are aimed to deploy large-scale clusters (100-200K) nodes in the near large scale clusters ( 00 00K) nodes n the near future

Page 76: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Funding Acknowledgmentsg g

Our research is supported by the following organizations

• Current Funding support by

• Current Equipment support by

Tsukuba, Oct 2, 2008 76

Page 77: HHgh rformanc , Scaa an Fautigh Performance, Scalable and ...– Reliable Connection Unreliable Datagram Reliable Reliable Connection, Unreliable Datagram, Reliable ... Sockets Based

Personnel Acknowledgmentsg

Current Students Past StudentsK V id th (Ph D )– L. Chai (Ph.D.)

– T. Gangadharappa (M. S.)– K. Gopalakrishnan (M. S.)

– P. Balaji (Ph.D.)– D. Buntinas (Ph.D.)– S. Bhagvat (M.S.)

B Chandrasekharan (M S )

– K. Vaidyanathan (Ph.D.)– R. Noronha (Ph.D.)– S. Sur (Ph.D.) – K Vaidyanathan (Ph D )

– M. Koop (Ph.D.)– P. Lai (Ph. D.)– G. Marsh (Ph. D.)

– B. Chandrasekharan (M.S.)– W. Jiang (M.S.)– W. Huang (Ph.D.)– S. Kini (M.S.)

K. Vaidyanathan (Ph.D.)– A. Vishnu (Ph.D.)– J. Wu (Ph.D.)– W. Yu (Ph.D.)

– X. Ouyang (Ph.D.)– G. Santhanaraman (Ph.D.)– J. Sridhar (M. S.)

S. Kini (M.S.)– R. Kumar (M.S.)– S. Krishnamoorthy (M.S.)– J. Liu (Ph.D.)

– H. Subramoni (M. S.) – A. Mamidala (Ph.D.)– S. Narravula (Ph.D.)– R. Noronha (Ph.D.)

(Ph D )Current Programmer

Tsukuba, Oct 2, 2008 77

– S. Sur (Ph.D.)g

– J. Perkins