High Performance, Scalable and Fault-Tolerant MPI H gh rformanc , Sca a an Fau t o rant M over InfiniBand: An Overview of MVAPICH/MVAPICH2 Projectj
Talk at Tsukuba Universityby
Dhabaleswar K. (DK) PandaThe Ohio State University
y
yE-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Current and Next Generation Applications and Computing Systems
• Big demand for – High Performance Computing (HPC)– File-systems, multimedia, database, visualization– Internet data-centers
• Processor performance continues to grow – Chip density doubling every 18 months– Multi-core chips are emergingMulti core chips are emerging
• Commodity networking also continues to grow – Increase in speed and features
Affordable pricing– Affordable pricing• Clusters are increasingly becoming popular to
design next generation computing systemsS l bili M d l i d U d bili i h – Scalability, Modularity and Upgradeability with compute and network technologies
2Tsukuba, Oct 2, 2008
Trends for Computing Clusters in the T p 500 List
• Top 500 list of Supercomputers (www.top500.org)
Top 500 List
June 2001: 33/500 (6.6%) June 2005: 304/500 (60.8%)
Nov 2001: 43/500 (8 6%) Nov 2005: 360/500 (72 0%)Nov 2001: 43/500 (8.6%) Nov 2005: 360/500 (72.0%)
June 2002: 80/500 (16%) June 2006: 364/500 (72.8%)
Nov 2002: 93/500 (18 6%) Nov 2006: 361/500 (72 2%)Nov 2002: 93/500 (18.6%) Nov 2006: 361/500 (72.2%)
June 2003: 149/500 (29.8%) June 2007: 373/500 (74.6%)
Nov 2003: 208/500 (41.6%) Nov 2007: 406/500 (81.2%)
June 2004: 291/500 (58.2%) June 2008: 400/500 (80.0%)
Tsukuba, Oct 2, 2008 3
Nov 2004: 294/500 (58.8%) Nov 2008: To be Announced
Growth in Commodity Network T chn l
Representative commodity networks; their entries into the market
Technology
Ethernet (1979 - ) 10 Mbit/sec
Fast Ethernet (1993 -) 100 Mbit/sec
Gigabit Ethernet (1995 -) 1000 Mbit /sec
ATM (1995 -) 155/622/1024 Mbit/sec
Myrinet (1993 -) 1 Gbit/sec
Fibre Channel (1994 -) 1 Gbit/sec
InfiniBand (2001 -) 2 Gbit/sec (1X SDR)
10-Gigabit Ethernet (2001 -) 10 Gbit/sec
InfiniBand (2003 -) 8 Gbit/sec (4X SDR)InfiniBand (2003 ) 8 Gbit/sec (4X SDR)
InfiniBand (2005 -) 16 Gbit/sec (4X DDR)
24 Gbit/sec (12X SDR)
InfiniBand (2007 -) 32 Gbit/sec (4X QDR)
Tsukuba, Oct 2, 2008 4
InfiniBand (2007 -) 32 Gbit/sec (4X QDR)
16 times in the last 7 years
Limitations of Traditional Host-based P t c ls
• Ex: TCP/IP, UDP/IP
Protocols
• Generic architecture for all network interfaces• Host-handles almost all aspects of communicationp
– Data buffering (copies on sender and receiver)– Data integrity (checksum)– Routing aspects (IP routing)
• Signaling between different layers– Hardware interrupt whenever a packet arrives or is sent– Software signals between different layers to handle
protocol processing in different priority levels
Tsukuba, Oct 2, 2008 5
Previous High PerformanceN t k St cks
• Virtual Interface Architecture
Network Stacks
– Standardized by Intel, Compaq, Microsoft• Fast Messages (FM)
– Developed by UIUC• Myricom GM
P i t t l t k f M i– Proprietary protocol stack from Myricom• These network stacks set the trend for high-
performance communication requirementsperformance communication requirements– Hardware offloaded protocol stack– Support for fast and secure user-level access to pp
the protocol stackTsukuba, Oct 2, 2008 6
IB Trade Association
• IB Trade Association was formed with seven industry l d ( D ll HP B l f d )leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun)
• Goal: To design a scalable and high performance communication and I/O architecture by taking an communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies
• Many other industry participated in the effort to define the IB architecture specification
• IB Architecture (Volume 1 Version 1 0) was released to IB Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000– Latest version 1.2.1 released January 2008
• http://www.infinibandta.orgTsukuba, Oct 2, 2008 7
Presentation Overview
• Overview of InfiniBand
– Features
P d ts (H d nd S ft )– Products (Hardware and Software)
– Trends
• MVAPICH and MVAPICH2 Features
• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs
• Future Plans
• Conclusions and Final Q&ATsukuba, Oct 2, 2008 8
A Typical IB Networkyp
Three primary components
Channel Adapters
Switches/RoutersSwitches/Routers
Links and connectors
Tsukuba, Oct 2, 2008 9
Hardware Protocol Offload
Complete Complete Hardware
ImplementationsExist
Tsukuba, Oct 2, 2008 10
Basic IB Capabilities at EachP t c l L
• Link Layer
Protocol Layer
– CRC-based data integrity, Buffering and Flow-control, Virtual Lanes, Service Levels and QoS, Switching and Multicast WAN capabilitiesMulticast, WAN capabilities
• Network LayerRouting and Flow Labels– Routing and Flow Labels
• Transport LayerReliable Connection Unreliable Datagram Reliable – Reliable Connection, Unreliable Datagram, Reliable Datagram and Unreliable Connection
– Shared Receive Queued and Extended Reliable Shared Receive Queued and Extended Reliable Connections (discussed in more detail later)
Tsukuba, Oct 2, 2008 11
Communication and Management S m ntics
• Two forms of communication semantics
Semantics
– Channel semantics (Send/Recv)– Memory semantics (RDMA, Atomic operations)
• Management model– A detailed management model complete with
managers, agents, messages and protocols
• Verbs Interface– A low-level programming interface for performing
communication as well as management
Tsukuba, Oct 2, 2008 12
Communication in the Channel S m ntics (S nd R c i M d l)Semantics (Send-Receive Model)
Buffer Pool Buffer Pool
Tsukuba, Oct 2, 2008 13
Communication in the Memory S m ntics (RDMA M d l)Semantics (RDMA Model)Node Node
Memory
P0 PCI/PCI-EX
Memory
P0B IBA
PCI/PCI-EX
P1 P1IBA IBA
• No involvement by the CPU at the receiver (RDMA Write/Put)
• No involvement by the CPU at the sender (RDMA Read/get)y ( g )
• 1-2 µs latency (for short data)
• 1.5 – 2.6 GBps bandwidth (for large data)
• 3-5 µs for atomic operationTsukuba, Oct 2, 2008 14
IB Transport Servicesp
Advanced mechanisms like SRQ and new transport
Tsukuba, Oct 2, 2008 15
Advanced mechanisms like SRQ and new transport eXtended Reliable Connection (XRC) is introduced recently
Shared Receive Queue (SRQ)Q ( Q)
Process Process
One RQ per connection One SRQ for all connections
m p
• SRQ is a hardware mechanism in IB by which a process
p One SRQ for all connectionsn -1
can share receive resources (memory) across multiple connections
• A new feature introduced in specification v1 2• A new feature, introduced in specification v1.2• 0 < p << m*(n-1)
Tsukuba, Oct 2, 2008 16
eXtended Reliable Connection (XRC)( )
M = # of nodesN = # of processes/node
RC Connections XRC Connections
(M2 1)*N (M 1)*N
N # of processes/node
(M2-1) N (M-1) N
• Each QP takes at least one page of memoryQ p g y– Connections between all processes is very costly for RC
• New IB Transport added: eXtended Reliable Connection– Allows connections between nodes instead of processes
Tsukuba, Oct 2, 2008 17
Subnet Managerg
Inactive Link
Active Links
Inactive Links
Compute Node
Switch Multicast Join
Multicast Setup
Compute Node
Multicast Join
Multicast Setup
Subnet Manager
Tsukuba, Oct 2, 2008 18
Automatic Path Migration
• Automatically utilizes IB multipathing for network
g
y p gfault-tolerance
• Enables migrating connections to a different pathEnables migrating connections to a different path
– Connection recovery in the case of failures
O ti l F t– Optional Feature
• Available for RC, UC, and RD
• Reliability guarantees for service type maintained during migration
Tsukuba, Oct 2, 2008 19
Presentation Overview
• Overview of InfiniBand
– Features
P d ts (H d nd S ft )– Products (Hardware and Software)
– Trends
• MVAPICH and MVAPICH2 Features
• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs
• Future Plans
• Conclusions and Final Q&ATsukuba, Oct 2, 2008 20
IB Hardware Products
• Many IB vendors: Mellanox, Voltaire, Cisco, QlogicAli d ith d I t l IBM SUN D ll– Aligned with many server vendors: Intel, IBM, SUN, Dell
– And many integrators: Appro, Advanced Clustering, Microway, …• Broadly two kinds of adapters
– Offloading (Mellanox) and Onloading (Qlogic)• Adapters with different interfaces:
– Dual port 4X with PCI-X (64 bit/133 MHz) PCIe x8 PCIe 2 0 and HTDual port 4X with PCI X (64 bit/133 MHz), PCIe x8, PCIe 2.0 and HT• MemFree Adapter
– No memory on HCA Uses System memory (through PCIe)G d f L M d (T 2935 6015T NFB)– Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)
• Different speeds– SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps)p p p
• Some 12X SDR adapters exist as well (24 Gbps each way)
Tsukuba, Oct 2, 2008 21
Tyan Thunder S2935 Boardy
(Courtesy Tyan)
Tsukuba, Oct 2, 2008 22
(Courtesy Tyan)
IB Hardware Products (contd.)
• Customized adapters to work with IB switches
( )
– Cray XD1 (formerly by Octigabay), Cray CX1• Switches:
– 4X SDR switch (8-288 ports)– 4X SDR switch (8-288 ports)• 12X ports available for inter-switch connectivity
– 4X DDR switch (mainly available in 8 to 288 port models) h ( ll l l ) – 12X switches (small sizes available)
– 3456-port “Magnum” switch from SUN used at TACC• 72-port “nano magnum” switch with DDR speedp g p
– New 36-port InfiniScale IV QDR switch silicon by Mellanox• Will allow high-density switches to be built
• Switch Routers with Gateways• Switch Routers with Gateways– IB-to-FC; IB-to-IP
Tsukuba, Oct 2, 2008 23
IB Software Products
• Low-level software stacks– VAPI (Verbs-Level API) from Mellanox
– Modified and customized VAPI from other vendors
– New initiative: Open Fabrics (formerly OpenIB)• http://www.openfabrics.org
• Open-source code available with Linux distributions
• Initially IB; later extended to incorporate iWARP
H h l l f k• High-level software stacks– MPI, SDP, IPoIB, SRP, iSER, DAPL, NFS, PVFS on
i k ( i il VAPI d O F b i )various stacks (primarily VAPI and OpenFabrics)
Tsukuba, Oct 2, 2008 24
OpenFabrics
• www.openfabrics.org
p
• Open source organization (formerly OpenIB)• Incorporates both IB and iWARP in a unified manner
F i ff t f O S IBA d iWARP • Focusing on effort for Open Source IBA and iWARP support for Linux and Windows
• Design of complete software stack with `best of Design of complete software stack with best of breed’ components– Gen1– Gen2 (current focus)
• Users can download the entire stack and runLatest release is OFED 1 3 1– Latest release is OFED 1.3.1
– OFED 1.4 is being worked outTsukuba, Oct 2, 2008 25
OpenFabrics Software StackpSA Subnet Administrator
MAD Management DatagramOpenDiag
Application Level
ClusteredDB Access
SocketsBasedA
VariousMPIs
Access toFile
S t
BlockStorageA
IP BasedApp
A
SMA Subnet Manager Agent
PMA Performance Manager Agent
IPoIB IP over InfiniBandInfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC
SDP Lib
User Level MAD API
Open SM
DiagTools
User APIs
U S
DB AccessAccess MPIs SystemsAccessAccess
UDAPL
SDP Sockets Direct Protocol
SRP SCSI RDMA Protocol (Initiator)
iSER iSCSI RDMA Protocol (Initiator)
RDS Reliable Datagram Service
SDPIPoIB SRP iSER RDS
SDP Lib
Upper Layer Protocol
Kernel Space
User Space
NFS-RDMARPC
ClusterFile Sys
g
UDAPL User Direct Access Programming Lib
HCA Host Channel Adapter
R-NIC RDMA NIC
ConnectionManager
MADSA Client
ConnectionManager
Connection ManagerAbstraction (CMA)
Mid-LayerSMA
el b
ypas
s
el b
ypas
s
CommonKeyHardware
Specific DriverHardware Specific
Driver
InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC
ProviderApps & Access
Ker
ne
Ker
ne
Tsukuba, Oct 2, 2008 26
InfiniBand
iWARPInfiniBand HCA iWARP R-NICHardware
AccessMethodsfor usingOF Stack
IB Installations
• 121 IB clusters (24.2%) in June ’08 TOP500 list (www.top500.org) • 12 IB clusters in TOP25
– 122,400-cores (RoadRunner) at LANL (1st)– 62,976-cores (Ranger) at TACC (4th)– 14,336-cores at New Mexico (7th)– 14,384-cores at Tata CRL, India (8th)– 10,240-cores at TEP, France (10th)– 13,728-cores in Sweden (11th)– 8,320-cores in UK (18th)– 6,720-cores in Germany (19th)
k b ( h)– 10,000-cores at CCS, Tsukuba, Japan (20th)– 9,600-cores at NCSA (23rd)– 12,344-cores at Tokyo Inst. of Technology (24th)
13 824 t NASA/C l bi (25th)– 13,824-cores at NASA/Columbia (25th)• More are getting installed ….
Tsukuba, Oct 2, 2008 27
InfiniBand in the Top500p
Systems Performance
Tsukuba, Oct 2, 2008 28
Percentage share of InfiniBand is steadily increasing
Presentation Overview
• Overview of InfiniBand
– Features
P d ts (H d nd S ft )– Products (Hardware and Software)
– Trends
• MVAPICH and MVAPICH2 Features
• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs
• Future Plans
• Conclusions and Final Q&ATsukuba, Oct 2, 2008 29
Designing MPI Using IB/iWARP F tu sFeatures
MPI Design Components
ProtocolMapping
Buffer
FlowControl
Connection
CommunicationProgress
Collective
Multi-railSupport
One-sidedBufferManagement
ConnectionManagement
CollectiveCommunication
Design Alternatives and Solutions
One sidedActive/Passive
Design Alternatives and Solutions
RDMAOperations
UnreliableDatagram
Static RateControl
Multicast Out-of-orderPlacement
QoS Multi-PathVLANspg
AtomicOperations
SharedReceive Queues
End-to-EndFlow Control
m
Send /Receive
DynamicRate
ControlMulti-Path
LMC
Cluster '08 30
IB and iWARP/Ethernet Features
MVAPICH/MVAPICH2 Software
• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)
– Latest Releases: MVAPICH 1.1RC1 and MVAPICH2 1.2RC2
Used by more than 765 organizations in 42 countries– Used by more than 765 organizations in 42 countries
– More than 23,000 downloads from OSU site directly
– Empowering many TOP500 clusters• 4th ranked 62,976-core cluster (Ranger) at TACC
– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)including Open Fabrics Enterprise Distribution (OFED)
– Also supports uDAPL device to work with any network supporting uDAPL
– http://mvapich.cse.ohio-state.edu/
Tsukuba, Oct 2, 2008 31
MVAPICH 1.1 Architecture
MVAPICH (MPI-1)(1.1)
OpenFabrics/Gen2
#1 #2
OpenFabrics/Gen2-Hybrid PSM
#3
Shared-M
#4
TCP/IP
#5
Gen2(Single-rail)
InfiniBand (Mellanox)
Gen2 Hybrid(Single-rail)
SM Memory
InfiniBand (QLogic)
TCP/IP
Serial Nodes/Laptops with InfiniBand (Mellanox)
PCI-X, PCIe, PCIe-Gen2(SDR, DDR and QDR)
InfiniBand (QLogic)
PCIe & HT(SDR and DDR)
p pMulti-core
VAPIGen2-Multirail
uDAPL
IA-32, EM64T, Opteron, IA-64, ..
uDAPL(deprecated)
MVAPICH2 1.2 Architecture
MVAPICH2 (MPI-2)(1.2)
OpenFabrics/Gen2
#1
uDAPL PSM
#4#2
TCP/IP
#3
Gen2(Integrated Multirail)
InfiniBand (Mellanox)
SM
InfiniBand (QLogic)Adapter Supporting
10GigE/iWARP (Chelsio,
Under Design
(Mellanox)
PCI-X, PCIe, PCIe-Gen2
(SDR, DDR and QDR)
InfiniBand (QLogic)
PCIe & HT(SDR and DDR)
pp guDAPL
(IB, Myrinet, Quadrics)
(Linux, Solaris)
( ,Neteffect)
PCI-X, PCIe
IA-32, EM64T, Opteron, IA-64, ..
( , )
Major Features of MVAPICH 1.1
• OpenFabrics-Gen2– Scalable job start-up with mpirun_rsh, support for SLURM– RC and XRC support– Flexible message coalescingFlexible message coalescing– Multi-core-aware pt-to-pt communication– User-defined processor affinity for multi-core platforms– Multi-core-optimized collective communication– Asynchronous and scalable on-demand connection management Asynchronous and scalable on demand connection management – RDMA Write and RDMA Read-based protocols– Lock-free Asynchronous Progress for better overlap between
computation and communication– Polling and blocking support for communication progressPolling and blocking support for communication progress– Multi-pathing support leveraging LMC mechanism on large
fabrics– Network-level fault tolerance with Automatic Path Migration (APM)– Mem-to-mem reliable data transfer mode (for detection of I/O error Mem-to-mem reliable data transfer mode (for detection of I/O error
with 32-bit CRC)
Major Features of MVAPICH 1 1Major Features of MVAPICH 1.1(Cont’d)
• OpenFabrics-Gen2-Hybrid– Newly introduced interface in 1.1– Replaces UD interface in 1.0p– Targeted for emerging multi-thousand-core clusters to
achieve the best performance with minimal memory footprintMost of the features as in Gen2 – Most of the features as in Gen2
– Adaptive selection during run-time (based on application and systems characteristics) to switch between
• RC and UD (or between XRC and UD) transports – Multiple buffer organization with XRC support
Major Features of MVAPICH2 1.2
• OpenFabrics-Gen2– All features as in MVAPICH 1.1 (OpenFabrics-Gen2)
except asynchronous progress and XRC p y p g– RDMA CM-based connection management (Gen2-IB and
Gen2-iWARP)– Integrated multi-rail support for IB and 10GigE/iWARPpp– Checkpoint-Restart (currently for IB)
• Systems-level automatic• Application-initiated systems-level
DAPL• uDAPL– Most of the features of OpenFabrics-Gen2 except multi-
rail and checkpointing• Flexibility for different adapters software stacks and OS (Linux • Flexibility for different adapters , software stacks and OS (Linux
and Solaris) supporting uDAPL
Support for Multiple Interfaces/AdaptersSupport for Multiple Interfaces/Adapters
• OpenFabrics/Gen2-IB and OpenFabrics/Gen2-Hybridp p y– All IB adapters supporting OpenFabrics/Gen2
• Qlogic/PSM• Qlogic adapters
• OpenFabrics/Gen2 iWARP• OpenFabrics/Gen2-iWARP• Chelsio
• uDAPL– Linux-IBL nux B– Solaris-IB– Other adapters such as Neteffect 10GigE
• TCP/IPA d t ti TCP/IP i t f– Any adapter supporting TCP/IP interface
• Shared Memory Channel (MVAPICH)• for running applications in a node with multi-core processors
37Tsukuba, Oct 2, 2008
Presentation Overview
• Overview of InfiniBand
– Features
P d ts (H d nd S ft )– Products (Hardware and Software)
– Trends
• MVAPICH and MVAPICH2 Features
• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs
• Future Plans
• Conclusions and Final Q&ATsukuba, Oct 2, 2008 38
Design Insights and Sample Results
• Scalable Job Start-up
g g p
• Basic Performance
– Two-sided Communication
– One-sided Communication
• Multi-core-aware pt-to-pt communication
• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective
• Integrated Multi-rail Design
• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )
• Applications-level Scalability
• Asynchronous Progress
• Fault Tolerance
Tsukuba, Oct 2, 2008 39
Scalable Startup
• An enhanced mpirun rsh framework was introduced in
Scalable Startup
An enhanced mpirun_rsh framework was introduced in MVAPICH 1.0 to significantly cut down job start-up on large clusters
• Is available with MVAPICH 1.1 and MVAPICH2 1.2
Wallclock Runtime for MPI Hello World
140
160
ecs) MVAPICH-0.9.9
60
80
100
120
e R
untim
e (s
e
MVAPICH-1.0
Courtesy TACC
0
20
40
1K 2K 4K 8K 16K 32K
Ave
rage
1K 2K 4K 8K 16K 32K# of MPI Tasks (Cores)
One-way Latency: MPI over IBy y
7Small Message Latency
400MVAPICH-InfiniHost III-DDR
Large Message Latency
5
6
250
300
350MVAPICH-InfiniHost III-DDR
MVAPICH-Qlogic-SDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX-QDR-PCIe2
3
4
Late
ncy
(us)
2.77 150
200
250 MVAPICH-Qlogic-DDR-PCIe2
Late
ncy
(us)
1
2
1.061.281.492.19
50
100
0
Message Size (bytes)
0
Message Size (bytes)
InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch
ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back
Bandwidth: MPI over IB
3000MVAPICH-InfiniHost III-DDR
Unidirectional Bandwidth6000
Bidirectional Bandwidth
2000
2500MVAPICH-Qlogic-SDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX-QDR-PCIe2MVAPICH-Qlogic-DDR-PCIe2
sec
2570.6
1952.94000
5000
sec
5012.1
3621.4
1000
1500
Mill
ionB
ytes
/s
1399.8
1389.42000
3000
Mill
ionB
ytes
/s
2457.4
2718.3
0
500
1000936.5
0
1000
2000
1519.8
0
Message Size (bytes)
0
Message Size (bytes)
InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch
ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back
RDMA CM and iWARP SupportRDMA CM and iWARP Support
• Available starting with MVAPICH2 Available starting with MVAPICH2 0.9.8
• RDMA CM is supported for both RDMA CM is supported for both – IB
10GigE/iWARP– 10GigE/iWARP
One-way Latency: MPI over iWARP
35One-way Latency
y y
25
30 Chelsio (TCP/IP)
Chelsio (iWARP)
15
20
Late
ncy
(us)
15.47
5
10
L
6.88
00 1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Message Size (bytes)
Tsukuba, Oct 2, 2008 44
2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch
Bandwidth: MPI over iWARP
1400Ch l i (TCP/IP)
Unidirectional Bandwidth2500
Bidirectional Bandwidth
2260 8
1000
1200Chelsio (TCP/IP)
Chelsio (iWARP)
ec 839 8
2000
c
2260.81231.8
600
800
Mill
ionB
ytes
/se 839.8
1000
1500
illio
nByt
es/s
ec
855.3
200
400
M
500M
i
0
Message Size (bytes)
0
Message Size (bytes)Message Size (bytes) Message Size (bytes)
2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch
MPI One-sided Communication
• Specified by the MPI-2 standard• Data movement operations
– MPI_Put
Lock
Put
– MPI_Get– MPI_Accumulate
Get
• Synchronization operations– MPI_Lock/MPI_Unlock
Put
Accumulate
– MPI_Win_fence– MPI_Win_post, MPI_Win_start,
Put
Unlock
MPI_Win_complete, MPI_Win_wait
Tsukuba, Oct 2, 2008 46
MPI_Put Performance (IB DDR)_ ( )
140512501500
/s)
ConnectX-DDR
6789
s)
ConnectX-DDR
0250500750
10001250
Ban
dwid
th (M
B/ ConnectX-DDR
2716
2.57
012345
Late
ncy
(us
1 8 64 512 4k 32k
256k 2M
Message Size (Bytes)
B
3000)• Single port results only (EM64T PCI-Ex)
27160 2 8 32 128
512
2048
Message Size (Bytes)
5001000150020002500
ndw
idth
(MB
/s)
ConnectX-DDR
Single port results only (EM64T, PCI Ex)
Results for other platforms athttp://mvapich.cse.ohio-state.edu
0500
1 8 64 512 4k 32k
256k 2M
Message Size (Bytes)
Bi-B
an
Tsukuba, Oct 2, 2008 47
Design Insights and Sample Results
• Scalable Job Start-up
g g p
• Basic Performance
– Two-sided Communication
– One-sided Communication
• Multi-core-aware pt-to-pt communication
• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective
• Integrated Multi-rail Design
• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )
• Applications-level Scalability
• Asynchronous Progress
• Fault Tolerance
Tsukuba, Oct 2, 2008 48
Multicore-aware Communication: L t nc nd B nd idthLatency and Bandwidth
Small Message Latency Bandwidth
0.8
1
1.2
s) 1000
1500
2000
2500
wid
th (M
B/s
)
0.2
0.4
0.6
Late
ncy
(us
0
500
1000
1 4 16 64 256 1K 4K 16K
64K
256K 1M
Ban
dw0
0.2
1 2 4 8 16 32 64 128
Message Size (Bytes)
2 1 6 25
Message Size (Bytes)Intra-CMP Basic Design Intra-CMP Advanced DesignInter-CMP Basic Design Inter-CMP Advanced Design
• Multicore-aware design improves both latency and bandwidth• Available in MVAPICH and MVAPICH2 stacks
Tsukuba, Oct 2, 2008 49
L. Chai, A. Hartono and D. K. Panda, “Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters”, Cluster ‘06
Shared-memory Aware Collectivesy
MPI_Bcast MPI_Allreduce (512 cores)
60000
70000
80000without-shmem (64x8)
without-shmem (64x4)
with-shmem (64x8)
_
2000
2500
Allreduce-shmem
Allreduce-noshmem
30000
40000
50000
ency
(use
c)
with shmem (64x8)
with-shmem (64x4)1500
aten
cy (u
s)
educe os e
10000
20000
30000
late
500
1000La0
i (b t )
04096 8192 16384 32768 65536
Message size (bytes)
Tsukuba, Oct 2, 2008 50
message size (bytes) Message size (bytes)
MVAPICH-PSM Collective P f m nc (512 c s)Performance (512 cores)
Broadcast (64X8)80 Barrier Latency35
60
70 MVAPICH PSM 1.0
InfiniPath MPI 2.125
30
35
us)
MVAPICH PSM 1.0
InfiniPath MPI 2.1
30
40
50
Late
ncy
(us)
10
15
20
Late
ncy
(u10
20
30
0
5
8 8 8 8
0
1 2 4 8 16 32 64 128 256 512 1024Msg Size (Bytes)
1X8
2X8
4X8
8X8
16X8
32X8
64X8
System size
• 64 Intel Quad-core systems with dual sockets; PCIe InfiniPath Adapters
Tsukuba, Oct 2, 2008 51
64 Intel Quad core systems with dual sockets; PCIe InfiniPath Adapters • Significant performance improvement for MPI_Bcast and MPI_Barrier
Design Insights and Sample Results
• Scalable Job Start-up
g g p
• Basic Performance
– Two-sided Communication
– One-sided Communication
• Multi-core-aware pt-to-pt communication
• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective
• Integrated Multi-rail Design
• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )
• Applications-level Scalability
• Asynchronous Progress
• Fault Tolerance
Tsukuba, Oct 2, 2008 52
Integrated Multi-Rail Design g g(MVAPICH2)
MPI LApplication • Multiple ports/
adapters• Multiple
Eager Rendezvous InputMPI Layer
Communication Scheduling Completion
• Multiple adapters
• Multiple paths with LMCs
Virtual Subchannels Notification
InfiniBand Layer
Schedulerg
Policiesp
Notifier
J. Liu, A. Vishnu and D. K. Panda. Building MultiRail InfiniBand Clusters: MPI Level Design and Performance Evaluation. Presented at Supercomputing ‘04,
InfiniBand Layer
Level Design and Performance Evaluation. Presented at Supercomputing 04, April, 2004
Design Insights and Sample Results
• Scalable Job Start-up
g g p
• Basic Performance
– Two-sided Communication
– One-sided Communication
• Multi-core-aware pt-to-pt communication
• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective
• Integrated Multi-rail Design
• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )
• Applications-level Scalability
• Asynchronous Progress
• Fault Tolerance
Tsukuba, Oct 2, 2008 54
Memory Utilization usingSh d R c i Qu u sShared Receive Queues
100
1201416
MVAPICH-RDMA
40
60
80
100
emor
y U
sed
(M
MVAPICH-RDMAMVAPICH SR
68
101214
mor
y U
sed
(G MVAPICH-SRMVAPICH-SRQ
0
20
2 4 8 16 32Number of Processes
Me MVAPICH-SR
MVAPICH-SRQ
024
128 256 512 1024 2048 4096 8192 16384
N b f P
Mem
• SRQ consumes only 1/10th compared to RDMA for 16,000 processes
Number of Processes Number of Processes
Analytical modelMPI_Init memory utilization
SRQ consumes only 1/10 compared to RDMA for 16,000 processes• Send/Recv exhausts the Buffer Pool after 1000 processes;
consumes 2X memory as SRQ for 16,000 processes
Tsukuba, Oct 2, 2008 55
S. Sur, L. Chai, H. –W. Jin and D. K. Panda, “Shared Receive Queue Based Scalable MPI Design for InfiniBand Clusters”, IPDPS 2006
Communication Buffer Memory Utili ti n ith NAMD ( p 1)Utilization with NAMD (apoa1)
60
70
MB
)
1.05
1.1 Avg. RDMA channels 53.15
10
20
30
40
50
Mem
ory
Usa
ge (M
0.85
0.9
0.95
1
Nor
mal
ized
Pe
rform
ance Avg. Low watermarks 0.03
Unexpected Msgs (%) 48.2
0
10
16 32 64
Processes
M
0.8
0.85
ARDMA-SR ARDMA-SRQ SRQ
Total Messages 3.7e6
MPI Time (%) 23.54
• 50% messages < 128 Bytes, other 50% between 128 Bytes and 32 KB53 RDMA connections setup for 64 process experiment
ARDMA-SR % ARDMA-SRQ % SRQ %
– 53 RDMA connections setup for 64 process experiment• SRQ Channel takes 5-6MB of memory
– Memory needed by SRQ decreases by 1MB going from 16 to 64
Tsukuba, Oct 2, 2008 56
S. Sur, M. Koop and D. K. Panda, “High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis”, SC ‘06
UD vs. RC: Performance and Sc l bilit (SMG2000 Applic ti n)Scalability (SMG2000 Application)
RC (MVAPICH 0.9.8) UD (Progress)Memory Usage (MB/process)Performance
0.81
1.2
ed T
ime
( ) ( g )
RC (MVAPICH 0.9.8) UD Design
Conn. Buffers Struct. Total Buffers Struct Total
512 22.9 65.0 0.3 88.2 37.0 0.2 37.2
29 6 0 0 6 3 0 0 4
00.20.40.6
Nor
mal
ize
128 256 512 1024 2048 4096
1024 29.5 65.0 0.6 95.1 37.0 0.4 37.4
2048 42.4 65.0 1.2 107.4 37.0 0.9 37.9
4096 66.7 65.0 2.4 134.1 37.0 1.7 38.7
Large number of peers per process (992 at maximum)
128 256 512 1024 2048 4096Processes
UD reduces HCA QP cache thrashing
M K S S Q G d D K P d “Hi h P f MPI D i i U li bl
Tsukuba, Oct 2, 2008 57
M. Koop, S. Sur, Q. Gao and D. K. Panda, “High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters,” ICS ‘07
Impact of Hybrid RC/UD Designp y g
Combine the benefits of both RC and UD together
Application benchmark results on 512-core system
Tsukuba, Oct 2, 2008 58
M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand,” IPDPS ’08
Impact of XRC-based Design
1
1.2
ep g
0.6
0.8
1
zed
Time
RC-SRQRC-MSRQ
0.2
0.4
0.6
Nor
maliz EXRC-SRQ
EXRC-MSRQSXRC-SRQ
0apoa1 er-gre f1atpase jac
Dataset
SXRC SRQSXRC-MSRQ
For the jac dataset RC-MSRQ shows 10% worse performanceHCA cache is likely being thrashedSXRC modes show higher performance since less QPs are being
d ( d h )
Dataset
59
used (and are staying in cache)M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ’08 Tsukuba, Oct 2, 2008
Performance of HPC Applications on TACC R n usin MVAPICH IBTACC Ranger using MVAPICH + IB
• Rob Farber’s facial Rob Farber s facial recognition application was run up to 60K cores
Pusing MVAPICH
• Ranges from 84% f k t l d of peak at low end
to 65% of peak at high end
http://www.tacc.utexas.edu/research/users/features/index.php?m_b_c=farber
60Cluster '08
Performance of HPC Applications on TACC R n : DNS/Tu bul ncTACC Ranger: DNS/Turbulence
Courtesy: P.K. Yeung, Diego Donzis, TG 200861Cluster '08
Design Insights and Sample Results
• Scalable Job Start-up
g g p
• Basic Performance
– Two-sided Communication
– One-sided Communication
• Multi-core-aware pt-to-pt communication
• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective
• Integrated Multi-rail Design
• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )
• Applications-level Scalability
• Asynchronous Progress
• Fault Tolerance
Tsukuba, Oct 2, 2008 62
Asynchronous Progress
• Asynchronous progress (both at sender and
Asynchronous Progress
• Asynchronous progress (both at sender and receiver) in MVAPICH 1.0
• Design has been enhanced to a lock-free design in MVAPICH 1 1MVAPICH 1.1
• Potential for overlap of computation and communication
R. Kumar, A. Mamidala, M. Koop, G. Santhanaraman and D.K. Panda, Lock-free A h R d D i f MPI P i t t P i t C i ti E PVM/MPI 2008 Asynchronous Rendezvous Design for MPI Point-to-Point Communication, EuroPVM/MPI 2008, September 2008.
Asynchronous Progress (Mellanox DDR)
Application Availability at Sender
100
120Application Availability at Receiver
100
120
60
80
100
cent
age
ASYNC ProtocolRPUT Protocol
60
80
rcen
tage
ASYNC ProtocolRPUT Protocol
0
20
40Perc RPUT Protocol
0
20
40Per
0
8K16K 32K 64K
128K256K512K 1M 2M 4M
Msg Size (Bytes)
0
8K16K 32K 64K128K256K512K 1M 2M 4M
Msg Size (Bytes)Msg Size (Bytes)
Results for SMB Benchmark
Design Insights and Sample Results
• Scalable Job Start-up
g g p
• Basic Performance
– Two-sided Communication
– One-sided Communication
• Multi-core-aware pt-to-pt communication
• Multi core aware Optimized Collective • Multi-core-aware Optimized Collective
• Integrated Multi-rail Design
• Scalability for Large-scale Systems (SRQ, UD, Hybrid & XRC)y f g y m ( Q, , y )
• Applications-level Scalability
• Asynchronous Progress
• Fault Tolerance
Tsukuba, Oct 2, 2008 65
Fault Tolerance
• Component failures are common in large-scale clusters• Imposes need on reliability and fault tolerance• Working along the following three anglesW g g f w g g
– Reliable Networking with Automatic Path Migration (APM) utilizing Redundant Communication Paths g(available since MVAPICH 1.0 and MVAPICH2 1.0 onward)
– Process Fault Tolerance with Efficient Checkpoint and Restart (available since MVAPICH2 0.9.8)E d d R li bili i h CRC – End-to-end Reliability with memory-to-memory CRC (available since MVAPICH 0.9.9)
Tsukuba, Oct 2, 2008 66
Network Fault-Tolerance with APM
• Network Fault Tolerance using InfiniBand gAutomatic Path Migration (APM)– Utilizes Redundant Communication Paths
• Multiple Ports
• LMC
• Supported in OFED 1.2
A. Vishnu, A. Mamidala, S. Narravula and D. K. Panda, “Automatic Path Migration over InfiniBand: Early Experiences”, Third International Workshop on System Management Techniques,
Tsukuba, Oct 2, 2008 67
Processes, and Services, held in conjunction with IPDPS ‘07
Screenshots: MPI Bandwidth Test ith APMwith APM
Tsukuba, Oct 2, 2008 68
Checkpoint-Restart Support in MVAPICH2
• Process-level Fault Tolerance
MVAPICH2
– User-transparent, system-level checkpointing– Based on BLCR from LBNL to take coordinated
checkpoints of entire program including front end and checkpoints of entire program, including front end and individual processes
– Designed novel schemes tog• Coordinate all MPI processes to drain all in flight messages
in IB connections S i i & b ff hil h k i i• Store communication state & buffers while checkpointing
• Restarting from the checkpoint• Systems-level checkpoint can be initiated from the Systems level checkpoint can be initiated from the
application (added in MVAPICH2 1.0)Tsukuba, Oct 2, 2008 69
A R i E l (C )A Running Example (Cont.)Terminal A: Terminal B:
LU is running Now, Take checkpointTerminal A: Terminal B:
Select the programList running programs
1 23
A R i E l (C )A Running Example (Cont.)Terminal A: Terminal B:
LU is not affected.Stop it using CTRL-C
Then, restart fromthe checkpoint
Terminal A: Terminal B:
4 35
Checkpoint-Restart Performance with PVFS2PVFS2
NAS, LU Class C, 32x1 (Storage: 8 PVFS2 servers on IPoIB)
80
100
cond
s)
40
60
on T
ime
(Sec
0
20
Exe
cutio
Nocheckpoint
1 ckpt (avg60 secinterval)
2 ckpts (avg40 secinterval)
3 ckpts (avg30 secinterval)
4 ckpts (avg20 secinterval)
Number of Checkpoints Taken
Tsukuba, Oct 2, 2008 72
Q. Gao, W. Yu, W. Huang and D.K. Panda, “Application-Transparent Checkpoint/Restart for MPI over InfiniBand”, ICPP ‘06
Presentation Overview
• Overview of InfiniBand
– Features
P d ts (H d nd S ft )– Products (Hardware and Software)
– Trends
• MVAPICH and MVAPICH2 Features
• Design Insights and Sample Performance NumbersD s gn ns ghts an Samp rformanc Num rs
• Future Plans
• Conclusions and Final Q&ATsukuba, Oct 2, 2008 73
Future Plans• Most of the focus toward MVAPICH2
Future Plans
• Further enhancements to scalable job start-up• Kernel-based (LiMIC2) shared memory pt-to-pt communication• Optimization of collectives and one-sided communication based on
new LIMIC2 shared memory communicationnew LIMIC2 shared memory communication• Passive synchronization support for one-sided• Flexible process binding for multi-rails• Optimization of collectivesOptimization of collectives
– XRC– multi-rail
• Automatic tuning framework for pt-to-pt and collectives• Network reliability (transparent recovery in case of adapter • Network reliability (transparent recovery in case of adapter
failure)• Job pause-restart framework• Performance and Memory scalability toward 100-200K cores y y
Conclusions
• MVAPICH and MVAPICH2 are being widely used in
Conclusions
• MVAPICH and MVAPICH2 are being widely used in stable production IB clusters delivering best performance and scalabilityAls bli l st s ith 10Gi E/iWARP s t • Also enabling clusters with 10GigE/iWARP support
• The user base stands at more than 765 organizations worldwide g
• New features for scalability, high performance and fault tolerance support are aimed to deploy large-scale clusters (100-200K) nodes in the near large scale clusters ( 00 00K) nodes n the near future
Funding Acknowledgmentsg g
Our research is supported by the following organizations
• Current Funding support by
• Current Equipment support by
Tsukuba, Oct 2, 2008 76
Personnel Acknowledgmentsg
Current Students Past StudentsK V id th (Ph D )– L. Chai (Ph.D.)
– T. Gangadharappa (M. S.)– K. Gopalakrishnan (M. S.)
– P. Balaji (Ph.D.)– D. Buntinas (Ph.D.)– S. Bhagvat (M.S.)
B Chandrasekharan (M S )
– K. Vaidyanathan (Ph.D.)– R. Noronha (Ph.D.)– S. Sur (Ph.D.) – K Vaidyanathan (Ph D )
– M. Koop (Ph.D.)– P. Lai (Ph. D.)– G. Marsh (Ph. D.)
– B. Chandrasekharan (M.S.)– W. Jiang (M.S.)– W. Huang (Ph.D.)– S. Kini (M.S.)
K. Vaidyanathan (Ph.D.)– A. Vishnu (Ph.D.)– J. Wu (Ph.D.)– W. Yu (Ph.D.)
– X. Ouyang (Ph.D.)– G. Santhanaraman (Ph.D.)– J. Sridhar (M. S.)
S. Kini (M.S.)– R. Kumar (M.S.)– S. Krishnamoorthy (M.S.)– J. Liu (Ph.D.)
– H. Subramoni (M. S.) – A. Mamidala (Ph.D.)– S. Narravula (Ph.D.)– R. Noronha (Ph.D.)
(Ph D )Current Programmer
Tsukuba, Oct 2, 2008 77
– S. Sur (Ph.D.)g
– J. Perkins
Top Related