Post on 04-Oct-2020
Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet
Dhabaleswar K. (DK) Panda
A Tutorial at Supercomputing ‘09by
Pavan BalajiDhabaleswar K. (DK) PandaThe Ohio State University
E-mail: panda@cse.ohio-state.eduhttp://www.cse.ohio-state.edu/~panda
Pavan BalajiArgonne National LaboratoryE-mail: balaji@mcs.anl.gov
http://www.mcs.anl.gov/~balajihttp://www.cse.ohio state.edu/ panda http://www.mcs.anl.gov/ balaji
Matthew KoopNASA Goddard
E il tth k @E-mail: matthew.koop@nasa.govhttp://www.cse.ohio-state.edu/~koop
Presentation Overview
• Networking Requirements of HEC Systems
• Recap of InfiniBand and 10GE Overview
• Advanced Features of IBAdvanced Features of IB
• Advanced Features of 10/40/100GE
The Open Fabrics Software Stack• The Open Fabrics Software Stack
• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and
Virtualization
Conclusions and Final Q&A• Conclusions and Final Q&A
Supercomputing '09
HPC Clusters and Applications
• Multi-core processors becoming increasingly common
pp
• System scales growing rapidly– Small (~256 cores – 8/16 nodes with 8/16 cores each)– Medium (~1K cores – 32/64 nodes with 8/16 cores each)– Large (~10K cores – 320/640 nodes with 8/16 cores each)– Huge (~100K cores – 3200/6400 nodes with 8/16 cores each)– Huge ( 100K cores – 3200/6400 nodes with 8/16 cores each)
• Large range of applications
– Scientific– Scientific – Commercial
• Diverse computation and communication characteristicsp• Diverse scaling requirements
Supercomputing '09
Trends for Computing Clusters in the T 500 Li t
• Top 500 list of Supercomputers (www.top500.org)
Top 500 List
Jun. 2001: 33/500 (6.6%) Nov. 2005: 360/500 (72.0%)
Nov. 2001: 43/500 (8.6%) Jun. 2006: 364/500 (72.8%)
Jun. 2002: 80/500 (16%) Nov. 2006: 361/500 (72.2%)
Nov. 2002: 93/500 (18.6%) Jun. 2007: 373/500 (74.6%)
Jun. 2003: 149/500 (29.8%) Nov. 2007: 406/500 (81.2%)
Nov. 2003: 208/500 (41.6%) Jun. 2008: 400/500 (80.0%)
Jun. 2004: 291/500 (58.2%) Nov. 2008: 410/500 (82.0%)
Nov. 2004: 294/500 (58.8%) Jun. 2009: 410/500 (82.0%)
Supercomputing '09
Jun. 2005: 304/500 (60.8%) Nov. 2009: To be announced
PetaFlop to ExaFlop Computingp p p g
10 PFlops in 2011
100 PFlops in 2015
Supercomputing '09
Expected to have an ExaFlop system in 2018-2019 !
Integrated High-End Computing E i tEnvironments
Compute cluster Storage cluster
Meta-DataManager
I/O ServerNode
MetaData
DataCompute
Node
ComputeNode
I/O ServerNode Data
ComputeNode
LANLANFrontend
NodeNode
I/O ServerNode Data
ComputeNodeLAN/WANDifferent requirements for each
type of HEC system
Routers/Servers
Database Server
Application Server
Datacenter for Visualization and Mining
.
.
Switch
.
.
Routers/ServersRouters/Servers
Application Server
Application Server
Database Server
Database Server
Switch Switch
Supercomputing '09
.Routers/Servers
Application Server
Database Server
Tier1 Tier3
Programming Models for HPC Clusters
• Message Passing Interface (MPI) is the de-facto
g g
standard– MPI 1.0 was the initial standard; MPI 2.2 is the latest
– MPI 3.0 is under discussion in the MPI Forum
• Other models coming up as well
– Traditional Partitioned Global Address Space Models (Global Arrays, UPC, Coarray Fortran)
– HPCS languages (X10, Chapel, Fortress)
Supercomputing '09
Networking Requirements for HPC Cl t
• Different Communication PatternsP i t t i t
Clusters
– Point-to-point• Low latency, high bandwidth, low CPU usage, overlap of computation &
communication
S l bl ll ti “ ” i ti– Scalable collective or “group” communication• Broadcast, barrier, reduce, all-reduce, all-to-all, etc.
– Support for concurrent multi-pair communication at the NIC • Emerging multi-core architecture
• Reduced network contention and congestion– Good routing schemeg– Efficient congestion control management
• Reliability and Fault ToleranceF il d t ti d ( t k d l l)– Failure detection and recovery (network and process level)
– Ease of administrationSupercomputing '09
Sample Diagram of State-of-the-Art File S t
• Sample file systems:
Systems
– Lustre, Panasas, GPFS, Sistina/Redhat GFS
– PVFS, Google File systems, Oracle Cluster File system (OCFS2)
Metadata Server
Metadata Server Server
Computing node I/O server
Network
Computing node I/O server
Supercomputing '09
Computing node I/O server
Networking Requirements for Storage Cl t
• Several similar to HPC Clusters
Clusters
– Low latency, high bandwidth, low CPU utilization
– Reduced network contention & congestion
R li bilit d F lt T l– Reliability and Fault Tolerance
• Several unique challenges as wellAggregate bandwidth becomes very important– Aggregate bandwidth becomes very important
• Systems typically use fewer I/O nodes than compute nodes
– Quality of Service (QoS)• The same set of file servers support multiple “client applications”
– Network UnificationN d k i h b h & “ d d”• Need to work with both compute systems & “standard” storage protocols (Fiber Channel, iSCSI)
Supercomputing '09
Enterprise Datacenter Environmentsp
ProxyWeb-server
(A h )ProxyServer
(Apache)
WAN
ClientsStorage
More Computation and CommunicationR i t
ApplicationDatabase
WAN Requirements
• Requests are received from clients over the WAN
Application Server (PHP)
Server(MySQL)
• Proxy nodes perform caching, load balancing, resource monitoring, etc.– If not cached, request forwarded to the next tier Application Server
A li ti f th b i l i (CGI J l t )• Application server performs the business logic (CGI, Java servlets)– Retrieves appropriate data from the database to process the requests
Supercomputing '09
Networking Requirements for D t t
• Support for large number of communication streams
Datacenters
• Heterogeneous compute requirements– Front-end servers typically much lesser loaded than
backend servers need to handle communication in a load-resilient manner
• Quality of Service (QoS)– The same set of servers respond to many clients
• Network Virtualization– Zero-overhead communication in Virtual environments
– Efficient Migration
Supercomputing '09
Networking Requirements for I t t d E i t
• High performance WAN-level communication
Integrated Environments
– Low latency
– High bandwidth (unidirectional and bidirectional)
– Low CPU utilization
• Good performance in spite of delays– Out-of-order messages
– Buffering requirements to keep high-bandwidth pipes full
• Seamless integration between LAN and WAN protocols
• Resiliency to network failure
Supercomputing '09
Presentation Overview
• Networking Requirements of HEC Systems
• Recap of InfiniBand and 10GE Overview
• Advanced Features of IBAdvanced Features of IB
• Advanced Features of 10/40/100GE
The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage
• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and
Virtualization
Conclusions and Final Q&A• Conclusions and Final Q&A
Supercomputing '09
A Typical IB Networkyp
Three primary components
Channel Adapters
Switches/Routers
Links and connectors
Supercomputing '09
Communication and Management S ti
• Two forms of communication semantics
Semantics
– Channel semantics (Send/Recv)
– Memory semantics (RDMA, Atomic operations)
• Management model– A detailed management model complete with
managers, agents, messages and protocols
• Verbs Interface– A low-level programming interface for performing
communication as well as management
Supercomputing '09
Communication in the Channel S ti (S d/R i M d l)
Memory Memory
Semantics (Send/Receive Model)Processor Processor
ReceiveBuffer
SendBuffer
CQQP
Send Recv CQQP
Send RecvProcessor is involved only to:
1. Post receive WQE2. Post send WQEQ
3. Pull out completed CQEs from the CQ
InfiniBand DeviceInfiniBand Device
S d WQE t i i f ti b t th Receive WQE contains information on the receive
Hardware ACK
Supercomputing '09
Send WQE contains information about the send buffer
Receive WQE contains information on the receive buffer; Incoming messages have to be matched to
a receive WQE to know where to place the data
Communication in the Memory S ti (RDMA M d l)
Memory Memory
Semantics (RDMA Model)Processor Processor
ReceiveBuffer
SendBuffer
CQQP
Send Recv CQQP
Send RecvInitiator processor is involved only to:
1. Post send WQE2. Pull out completed CQE from the send CQ
No involvement from the target processor
InfiniBand DeviceInfiniBand Device
S d WQE t i i f ti b t th
Hardware ACK
Supercomputing '09
Send WQE contains information about the send buffer and the receive buffer
Basic iWARP Capabilities
• Supports most of the communication features supported
p
by IB (with minor differences)– Hardware acceleration, RDMA, Multicast, QoS
L k f t• Lacks some features– E.g., Atomic operations
but supports some other features• … but supports some other features– Out-of-Order data placement (useful for iSCSI semantics)
Fine grained data rate control (very useful for long haul– Fine-grained data rate control (very useful for long-haul networks)
– Fixed bandwidth QoS (more details later)
Supercomputing '09
IB and 10GE:C liti d DiffCommonalities and Differences
IB iWARP/10GEIB iWARP/10GE
Hardware Acceleration SupportedSupported
(for TOE and iWARP)Supported
RDMA SupportedSupported
(for iWARP)
Atomic Operations Supported Not supported
Multicast Supported Supported
Data Placement OrderedOut-of-order(for iWARP)
Data Rate-control Static and Coarse-grainedDynamic and Fine-grained
(for TOE and iWARP)
QoS PrioritizationPrioritization and
Fixed Bandwidth QoS
Supercomputing '09
Fixed Bandwidth QoS
One-way Latency: MPI over IBy y
7Small Message Latency
400MVAPICH-InfiniHost III-DDR
Large Message Latency
5
6
250
300
350MVAPICH-InfiniHost III-DDR
MVAPICH-Qlogic-SDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX-QDR-PCIe2
3
4
Late
ncy
(us)
2.77150
200
250 MVAPICH-Qlogic-DDR-PCIe2
Late
ncy
(us)
1
2
1.061.281.492.19
50
100
0
Message Size (bytes)
0
Message Size (bytes)
Supercomputing '09
InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch
ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back
Bandwidth: MPI over IB
3500MVAPICH-InfiniHost III-DDR
Unidirectional Bandwidth6000
Bidirectional Bandwidth5553.5
2500
3000 MVAPICH-Qlogic-SDRMVAPICH-ConnectX-DDRMVAPICH-ConnectX-QDR-PCIe2MVAPICH-Qlogic-DDR-PCIe2
sec
3022.1
1952.94000
5000
sec
5553.5
3621.4
1500
2000
Mill
ionB
ytes
/s
1399.8
1389.42000
3000
Mill
ionB
ytes
/s
2457.4
2718.3
0
500
1000 936.5
0
1000
2000
1519.8
0
Message Size (bytes)
0
Message Size (bytes)
Supercomputing '09
InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch
ConnectX-QDR-PCIe2: 2.4 GHz Quad-core (Nehalem) Intel with IB switch
One-way Latency: MPI over iWARP
35One-way Latency
y y
25
30 Chelsio (TCP/IP)
Chelsio (iWARP)
15
20
Late
ncy
(us)
15.47
5
10
L
6.88
00 1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Message Size (bytes)
Supercomputing '09
2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch
Bandwidth: MPI over iWARP
1400Ch l i (TCP/IP)
Unidirectional Bandwidth2500
Bidirectional Bandwidth
2260 8
1000
1200Chelsio (TCP/IP)
Chelsio (iWARP)
ec 839 8
2000
c
2260.81231.8
600
800
Mill
ionB
ytes
/se 839.8
1000
1500
illio
nByt
es/s
ec
855.3
200
400
M
500M
i
0
Message Size (bytes)
0
Message Size (bytes)Message Size (bytes) Message Size (bytes)
Supercomputing '09
2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch
Presentation Overview
• Networking Requirements of HEC Systems
• Recap of InfiniBand and 10GE Overview
• Advanced Features of IBAdvanced Features of IB
• Advanced Features of 10/40/100GE
The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage
• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and
Virtualization
Conclusions and Final Q&A• Conclusions and Final Q&A
Supercomputing '09
Advanced Capabilities in InfiniBandp
CompleteComplete Hardware
ImplementationsExist
Supercomputing '09
Basic IB Capabilities at EachP t l L
• Link Layer
Protocol Layer
– CRC-based data integrity, Buffering and Flow-control, Virtual Lanes, Service Levels and QoS, Switching and Multicast WAN capabilitiesMulticast, WAN capabilities
• Network LayerRouting and Flow Labels– Routing and Flow Labels
• Transport LayerReliable Connection Unreliable Datagram Reliable– Reliable Connection, Unreliable Datagram, Reliable Datagram and Unreliable Connection
– Shared Receive Queue and Extended ReliableShared Receive Queue and Extended Reliable Connections (discussed in more detail later)
Supercomputing '09
Advanced Capabilities in IB
• Link Layer
p
– Congestion Control
• Network Layer– Multipathing Capability and Automatic Path Migration
• Transport Layer– Shared Receive Queues, Extended Reliable Connection
– Data Segmentation, Transaction Ordering
– Message-level End-to-End Flow Control
– Static Rate Control and Auto-Negotiation
• Management ToolsSupercomputing '09
Congestion Control
• Switch detects congestion on a link
g
– Detects whether it is the root or victim of congestion
• IB follows a three-step protocol Forward Explicit Congestion Notification– Forward Explicit Congestion Notification
• Used to communicate congested port status• Switch sets FECN bit; marks packets leaving the congested state
Backward Explicit Congestion Notification– Backward Explicit Congestion Notification• Destination sends BECN to sender informing about congestion
– Injection Rate Control (Throttling)• Source throttles its send rate temporarily (timer based)• Original injection rate reduces over time• Congestion control may be performed per QP or SL
• Pro-active does not wait for packet drops to occurSupercomputing '09
Multipathing Capability
• Similar to basic switching, except…
p g p y
– … sender can utilize multiple LIDs associated to the same destination port
• Packets sent to one DLID take a fixed path• Packets sent to one DLID take a fixed path• Different packets can be sent using different DLIDs• Each DLID can have a different path (switch can be configured
differently for each DLID)
• Can cause out-of-order arrival of packets– IB uses a simplistic approach:
• If packets in one connection arrive out-of-order, they are droppeddropped
– Easier to use different DLIDs for different connectionsSupercomputing '09
Automatic Path Migration
• Automatically utilizes IB multipathing for network
g
y p g
fault-tolerance
• Enables migrating connections to a different pathEnables migrating connections to a different path
– Connection recovery in the case of failures
O ti l F t– Optional Feature
• Available for RC, UC, and RD
• Reliability guarantees for service type maintained
during migration
Supercomputing '09
Advanced Capabilities in IB
• Link Layer
p
– Congestion Control
• Network Layer– Multipathing Capability and Automatic Path Migration
• Transport Layer– Shared Receive Queues, Extended Reliable Connection
– Data Segmentation, Transaction Ordering
– Message-level End-to-End Flow Control
– Static Rate Control and Auto-Negotiation
• Management ToolsSupercomputing '09
Shared Receive Queues (SRQs)Q ( Q )
Process Process
QP QP QP QP… QP QP QP QP…
Receive Queues (one queue per QP)
Shared Receive Queue (Many QPs to one queue)
• Shared Receive Queues allows multiple QPs to share a single Receive Queue
• Allows much better scalability for applications/libraries that pre post to• Allows much better scalability for applications/libraries that pre-post to receive queues
Supercomputing '09
Flow Control with SRQs
• SRQ has link-level flow-control, but no E2E flow-control
Q
– Problem: How will the sender know how many messages to send (buffers are shared by all receivers)?
• Solution: SRQ limit event functionality can be used to post• Solution: SRQ limit event functionality can be used to post additional buffers
• Limit event functionality– Receiver requests network for an interrupt when there are
less than some buffers left (low watermark)P bl Wh t if h b t f b ff– Problem: What if you have a burst of messages buffers get utilized before you can post more
• Solution: Sender will just retransmit (handled in hardware)j ( )• Can lose some performance, but this is a very rare case !
Supercomputing '09
Low Watermark Method
• Upon receiving a low watermark (SRQ limit event) the receiver can post additional buffers
Sender 1 Sender 2 ReceiverPost 10 buffers, set limit of 4
Limit Event (4 remaining)Have thread post 6 more
• If SRQ is exhausted the send operation does not complete until the receiver has posted more bufferscomplete until the receiver has posted more buffers– Sender hardware gets Receiver Not Ready (RNR NAK)
Supercomputing '09
SRQ Buffer Usage
• Buffers are consumed in the order received from RQs
Q g
and SRQs
• Libraries such as MPI are forced to post fixed sized buffers – leading to potential waste
• Multiple SRQs, each with a different size?
MPI Message
1KB
Posted Buffer (8KB)
7KB
Supercomputing '09
1KB 7KB (unused)
Multiple SRQsp Q
QPs
SRQ
to peer 0
To peer 18KBbuffers
…
SRQ
to peer 2
4KBbuffers
…
SRQ
Process
to peer n
buffers
• Each additional SRQ multiplies the number of QPs required– Each QP can only be associated with one SRQ
• Not a scalable solution since many QPs are needed for each process
Supercomputing '09
eXtended Reliable Connection (XRC)
Instead of connecting
( )
M = number of nodesP = number of processes per node
processes, connect processes to a node
NodeProcesses A
P = number of processes per node
Each process needs only a single QP to a node
In best case M-1 QPs per process are needed for a f ll t d t
M-1(per process)
M-1(per process)
fully-connected setup
M*(M-1)*P QPs total vs. M2P2 M*P for RCB C D M2P2 – M*P for RC
Supercomputing '09
XRC Addressing
• XRC uses SRQ Numbers (SRQN) to direct where a
g
operation should complete
Send to #2Send to #1SRQ#1
Process 0
SRQ#1
Process 2
Send to #2Send to #1
SRQ#2Process 1
SRQ#2Process 3
• Hardware does all routing of data, so p2 is not actually involved in the data transfer
Supercomputing '09
data transfer• Connections are not bi-directional, so p3 cannot sent to p0
Multiple SRQs with XRCp Q
QPs QPs
SRQ
QPsto process 0
to process 1
to process 2
8KBbuffers SRQ
…
QPsto node 0
to node 1
8KBbuffers
…
SRQ
Process
p
to process n
4KBbuffers
…SRQ
Processto node n
4KBbuffers
RC XRC
Add i b SRQ b l ll lti l SRQ
RC XRC
• Addressing by SRQ number also allows multiple SRQs per process without additional memory resources
P t ti l t l i li ti d lib i• Potential to use less memory in applications and libraries
Supercomputing '09
Advanced Capabilities in IB
• Link Layer
p
– Congestion Control
• Network Layer– Multipathing Capability and Automatic Path Migration
• Transport Layer– Shared Receive Queues, Extended Reliable Connection
– Data Segmentation, Transaction Ordering
– Message-level End-to-End Flow Control
– Static Rate Control and Auto-Negotiation
• Management ToolsSupercomputing '09
Data Segmentation
• Application can hand over a large message
g
pp g g
– Network adapter segments it to MTU sized packets
Single notification hen the entire message is transmitted– Single notification when the entire message is transmitted
or received (not per packet)
• Reduced host overhead to send/receive messages
– Depends on the number of messages, not the number of
bytes
Supercomputing '09
Transaction Ordering
• IB follows a strong transaction ordering for RC
g
• Sender network adapter transmits messages in the order in which WQEs were posted
• Each QP utilizes a single LID– All WQEs posted on same QP take the same path
– All packets are received by the receiver in the same order
– All receive WQEs are completed in the order in which they were posted
Supercomputing '09
Message-level Flow-Control
• Also called as End-to-end Flow-control
g
– Does not depend on the number of network hops
• Separate from Link-level Flow-Control– Link-level flow-control only relies on the number of bytes
being transmitted, not the number of messages
– Message-level flow-control only relies on the number of messages transferred, not the number of bytes
• If 5 receive WQEs are posted the sender can send 5• If 5 receive WQEs are posted, the sender can send 5 messages (can post 5 send WQEs)– If the sent messages are larger than what the receive– If the sent messages are larger than what the receive
buffers are posted, flow-control cannot handle itSupercomputing '09
Static Rate Control andA t N ti ti
• IB allows link rates to be statically changed
Auto-Negotiation
– On a 4X link, we can set data to be sent at 1X
– For heterogeneous links, rate can be set to the lowest link rate
– Useful for low-priority traffic
• Auto-negotiation also available– E.g., if you connect a 4X adapter to a 1X switch, data is
t ti ll t t 1X tautomatically sent at 1X rate
• Only fixed settings availableC t t t i t t 3 16 Gb f l– Cannot set rate requirement to 3.16 Gbps, for example
Supercomputing '09
Advanced Capabilities in IB
• Link Layer
p
– Congestion Control
• Network Layer– Multipathing Capability and Automatic Path Migration
• Transport Layer– Shared Receive Queues, Extended Reliable Connection
– Data Segmentation, Transaction Ordering
– Message-level End-to-End Flow Control
– Static Rate Control and Auto-Negotiation
• Management ToolsSupercomputing '09
InfiniBand Management & Tools
• Subnet Management
g
g
• Diagnostic Tools
System Discovery Tools– System Discovery Tools
– System Health Monitoring Tools
S t P f M it i T l– System Performance Monitoring Tools
Supercomputing '09
Concepts in IB Management
• Agents
p g
– Processes or hardware units running on each adapter, switch, router (everything on the network)Provide capability to query and set parameters– Provide capability to query and set parameters
• Managers– Make high-level decisions and implement it on the– Make high-level decisions and implement it on the
network fabric using the agents
• Messaging schemesg g– Used for interactions between the manager and agents
(or between agents)
• MessagesSupercomputing '09
InfiniBand Management
• All IB management happens using packets called as
g
Management Datagrams
– Popularly referred to as “MAD packets”p y p
• Four major classes of management mechanisms
Subnet Management– Subnet Management
– Subnet Administration
– Communication Management
– General Services
Supercomputing '09
Subnet Management & Administration
• Consists of at least one subnet manager (SM) and
g
several subnet management agents (SMAs)– Each adapter, switch, router has an agent running
C i ti b t th SM d t b t– Communication between the SM and agents or between agents happens using MAD packets called as Subnet Management Packets (SMPs)g ( )
• SM’s responsibilities include:– Discovering the physical topology of the subnet– Assigning LIDs to the end nodes, switches and routers– Populating switches and routers with routing paths– Subnet sweeps to discover topology changes
Supercomputing '09
Subnet Managerg
Inactive Link
Active Links
Inactive Links
Compute Node
Switch Multicast Join
Multicast Setup
Node
Multicast Join
Multicast Setup
Subnet ManagerSubnet
ManagerSubnet
Manager
Supercomputing '09
Subnet Manager Sweep Behavior
• SM can be configured to sweep once or continuously
g p
• On the first sweep:– All ports are assigned LIDs on the first sweep
All t t th it h– All routes are setup on the switches
• On consequent sweeps:If there has been any change to the topology appropriate– If there has been any change to the topology, appropriate routes are updated
– If DLID X is down, packet not sent all the wayy• First hop will not have a forwarding entry for LID X
• Sweep time configured by the system administrator– Cannot be too high or too low
Supercomputing '09
Subnet Manager Scalability Issues
• Single subnet manager has issues on large systems
g y
– Performance and overhead of scanning• Hardware implementations on switches are faster, but will work
only for small systems (memory usage)only for small systems (memory usage)• Software implementations are more popular (OpenSM)
– Fault tolerance• There can be multiple SMs• During initialization only one should be active; once started
other SMs can handle different network portionsother SMs can handle different network portions
• Asynchronous events specified to improve scalability– E g TRAPS are events sent by an agent to the SM when a– E.g., TRAPS are events sent by an agent to the SM when a
link goes downSupercomputing '09
Multicast Group Management
• Creation, joining/leaving, deleting multicast groups occur
p g
as SA requests– The requesting node sends a request to a SA
– The SA sends MAD packets to SMAs on the switches to setup routes for the multicast packets
E h it h t i i f ti hi h t t f d• Each switch contains information on which ports to forward the multicast packet to
• Multicast itself does not go through the subnet managerMulticast itself does not go through the subnet manager– Only the setup and teardown goes through the SM
Supercomputing '09
General Services
• Several general service management features provided by the standard– Performance Management
Se eral req ired and optional performance co nters• Several required and optional performance counters• Flow control counters, RNR counters, Number of sent and
received packets
– Hardware Management• Baseboard Management• Device Management• SNMP Tunneling• Vendor SpecificVendor Specific• Application Specific
Supercomputing '09
InfiniBand Management & Tools
• Subnet Management
g
g
• Diagnostic Tools
System Discovery Tools– System Discovery Tools
– System Health Monitoring Tools
S t P f M it i T l– System Performance Monitoring Tools
Supercomputing '09
Tools to Analyze InfiniBand Networks
• Different types of tools exist:
y
– High-level tools that internally talk to the subnet manager using management datagrams
– Each hardware device exposes a few mandatory counters and a number of optional (sometimes vendor-specific) countersspecific) counters
• Possible to write your own tools based on the management datagram interfacemanagement datagram interface– Several vendors provide such IB management tools
Supercomputing '09
Network Discovery Tools
• Starting with almost no knowledge about the system,
y
we can identify several details of the network configuration– Example tools include:
• ibhosts: finds all the network adapters in the system
• ibswitches: finds all the network switches in the system• ibswitches: finds all the network switches in the system
• ibnetdiscover: finds the connectivity between the ports
• … and many others existy
– Possible to write your own tools based on the management datagram interface
• Several vendors provide such IB management tools
Supercomputing '09
Discovering Network Adaptersg p
% ibhostsCa : 0x0002c9020023c314 ports 2 " HCA 2"GUID of the adapterCa : 0x0002c9020023c314 ports 2 HCA-2Ca : 0x0002c9020023c05c ports 2 " HCA-2"Ca : 0x0002c9020023c0e8 ports 2 " HCA-2"Ca : 0x0002c9020023c178 ports 2 " HCA-2"
GUID of the adapter
pCa : 0x0002c9020023c058 ports 2 " HCA-2"Ca : 0x0002c9020023bffc ports 2 " HCA-2"Ca : 0x0002c9020023c08c ports 2 "wci59"Number of adapter
tCa : 0x0011750000ffe01a ports 1 " HCA-1"Ca : 0x0011750000ffe141 ports 1 " HCA-1"Ca : 0x0011750000ffe1dd ports 1 " HCA-1"Ca : 0x0011750000ffe079 ports 1 " HCA-1"
ports96 adapters
“online”
Ca : 0x0011750000ffe079 ports 1 HCA-1Ca : 0x0011750000ffe25c ports 1 " HCA-1"Ca : 0x0002c9020023c318 ports 2 " HCA-2"...
Adapter description
Supercomputing '09
Network Adapter Classificationp
% ibnetdiscover –H /* Some parts snipped out */Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA 2"Vendor IDCa : ports 2 devid 0x6282 vendid 0x2c9 HCA-2Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"
Vendor IDDevice ID
59 InfiniHost III adaptersp
Ca : ports 2 devid 0x6282 vendid 0x2c9 " HCA-2"Ca : ports 2 devid 0x634a vendid 0x2c9 " HCA-1"Ca : ports 2 devid 0x634a vendid 0x2c9 " HCA-1"
Mellanox adapters
adapters
Ca : ports 2 devid 0x634a vendid 0x2c9 " HCA-1"Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"
8 Qlogic adapters
29 ConnectX adapters
Ca : ports 1 devid 0x10 vendid 0x1fc1 HCA-1Ca : ports 1 devid 0x10 vendid 0x1fc1 " HCA-1"...
adapters
Supercomputing '09
Discovering Network Switchesg
% ibswitches /* Some parts snipped out */Switch : ports 24 "SilverStorm 9120 Leaf 1 Chip A"Switch : ports 24 SilverStorm 9120 Leaf 1, Chip ASwitch : ports 24 "SilverStorm 9120 Spine 2, Chip A"Switch : ports 24 "SilverStorm 9120 Spine 1, Chip A"Switch : ports 24 "SilverStorm 9120 Spine 3, Chip A"
Switch vendorinformation
p p pSwitch : ports 24 "SilverStorm 9120 Spine 2, Chip B"Switch : ports 24 "SilverStorm 9120 Spine 1, Chip B"Switch : ports 24 "SilverStorm 9120 Spine 3, Chip B"Switch : ports 24 "SilverStorm 9120 Leaf 8, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 2, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 4, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 12 Chip A"
Ports per switch
Switch : ports 24 SilverStorm 9120 Leaf 12, Chip ASwitch : ports 24 "SilverStorm 9120 Leaf 6, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 10, Chip A"Switch : ports 24 "SilverStorm 9120 Leaf 7, Chip A"
12 Leaf switches3 Spine switches with 2 chips each
Supercomputing '09
Switch : ports 24 "SilverStorm 9120 Leaf 3, Chip A“...
Discovering Network Connectivityg y
% ibnetdiscover /* Some parts snipped out */Switch 24 "S 00066a000700067c" # "SilverStorm 9120Switch 24 S-00066a000700067c # SilverStorm 9120 GUID=0x00066a00020001aa Leaf 1, Chip A" base port 0 lid 66 lmc 0[24] # "SilverStorm 9120 Spine 2, Chip A" lid 104 4xDDRp p[23] # "SilverStorm 9120 Spine 2, Chip A" lid 104 4xDDR[22] # "SilverStorm 9120 Spine 3, Chip A" lid 100 4xDDR[21] # "SilverStorm 9120 Spine 1, Chip A" lid 110 4xDDRConnectivity of
h it h t...[12] "H-0002c9030001e5e6" # " HCA-1" lid 125 4xDDR[11] "H-0002c9030001e3fa" # " HCA-1" lid 142 4xDDR[10] "H-0002c9030000b0c4" # " HCA-1" lid 106 4xDDR
each switch port
[10] H-0002c9030000b0c4 # HCA-1 lid 106 4xDDR[9] "H-0002c9030000b0c8" # " HCA-1" lid 108 4xDDR[8] "H-0002c9030001e5fa" # " HCA-1" lid 143 4xDDR...
Supercomputing '09
Roughly Constructed Network Fabricg y
Spine 3Spine 3 (Unmanaged)Chip A Chip B
Spine 2
FAN4 FAN3Spine 3 (Unmanaged)
Spine 2 (Managed)
Spine 1 (Managed)
8 Qlogic Adapters
Spine 1
FAN2 FAN1PS1
PS2
PS3
PS4
PS5
PS6
59 MellanoxInfiniHost III
Leaf 10Leaf 9Leaf 12Leaf 11
InfiniHost IIIAdapters
Leaf 4Leaf 3
Leaf 6Leaf 5Leaf 8Leaf 7
29 MellanoxConnectXAd t
Supercomputing '09Leaf 2Leaf 1
Leaf 2Leaf 1Adapters
InfiniBand Management & Tools
• Subnet Management
g
g
• Diagnostic Tools
System Discovery Tools– System Discovery Tools
– System Health Monitoring Tools
S t P f M it i T l– System Performance Monitoring Tools
Supercomputing '09
Overall Diagnostics
• Tools to query overall fabric health
g
[ib1 ]# ibdiagnet -r…STAGE Errors Warningsg
Bad GUIDs/LIDs Check 0 0 Link State Active Check 0 0 Performance Counters Report 0 0 Partitions Check 0 0 IPoIB Subnets Check 0 0 Subnet Manager Check 0 0 Fabric Qualities Report 0 0Fabric Qualities Report 0 0 Credit Loops Check 0 0 Multicast Groups Report 0 0
Supercomputing '09
End-node Adapter Statep
[ib1 ]# ibportstate 8 1PortInfo:# Port info: Lid 8 port 1# Port info: Lid 8 port 1LinkState:.......................ActivePhysLinkState:...................LinkUpLinkWidthSupported:..............1X or 4XLinkWidthEnabled:................1X or 4XLinkWidthActive:.................4XLinkSpeedSupported:..............2.5 Gbps or 5.0 GbpsLinkSpeedEnabled:................2.5 Gbps or 5.0 GbpsLinkSpeedActive:.................5.0 Gbps
Supercomputing '09
End-node Adapter Countersp
[ib1 ]# ibdatacounts 119 1# Port counters: Lid 119 port 1XmtData:.........................2102127705RcvData:.........................2101904109XmtPkts: 9069780XmtPkts:.........................9069780RcvPkts:.........................9068305
[ib1 ]# ibdatacounts 119 1# Port counters: Lid 119 port 1XmtData:.........................432RcvData:.........................432XmtPkts 6XmtPkts:.........................6RcvPkts:.........................6
[ib1 ]# ibcheckerrs -v 20 1
Supercomputing '09
Error check on lid 20 (ib12 HCA-2) port 1: OK
InfiniBand Management & Tools
• Subnet Management
g
g
• Diagnostic Tools
System Discovery Tools– System Discovery Tools
– System Health Monitoring Tools
S t P f M it i T l– System Performance Monitoring Tools
Supercomputing '09
Network Switching and Routingg g
% ibroute -G 0x66a000700067cLid Out DestinationLid Out Destination
Port Info0x0001 001 : (Channel Adapter portguid 0x0002c9030001e3f3: ' HCA-1')0x0002 013 : (Channel Adapter portguid 0x0002c9020023c301: ' HCA-1')
Packets to LID 0x0001 will be sent out onp p g
0x0003 014 : (Channel Adapter portguid 0x0002c9030001e603: ' HCA-1')0x0004 015 : (Channel Adapter portguid 0x0002c9020023c305: ' HCA-2')0x0005 016 : (Channel Adapter portguid 0x0011750000ffe005: ' HCA-1')
will be sent out on Port 001
0x0014 017 : (Switch portguid 0x00066a0007000728: 'SilverStorm 9120 GUID=0x00066a00020001aa Leaf 8, Chip A')0x0015 020 : (Channel Adapter portguid 0x0002c9020023c131: ' HCA-2')0x0016 019 : (Switch portguid 0x00066a0007000732: 'SilverStorm 91200x0016 019 : (Switch portguid 0x00066a0007000732: SilverStorm 9120 GUID=0x00066a00020001aa Leaf 10, Chip A')0x0017 019 : (Channel Adapter portguid 0x0002c9030001c937: ' HCA-1')0x0018 019 : (Channel Adapter portguid 0x0002c9020023c039: ' HCA-2')
Supercomputing '09
...
Static Analysis of Network Contentiony
Spine Blocks2 11 22 17 2427 15 20
Leaf Blocks
Spine Blocks
• Based on destination LIDs and switching/routing
4 8 9 13 14 1 19 2 5 3 7 12 16 6 18 10
• Based on destination LIDs and switching/routing information, the exact path of the packets can be identifiedidentified– If application communication pattern is known, we can
statically identify possible network contentiony y
Supercomputing '09
Dynamic Analysis of Network C t ti
• IB provides many optional counters to query
Contention
performance counters– PortXmitWait: Number of ticks in which there was data
to send, but no flow-control credits
– RNR NAKs: Number of times a message was sent, but the receiver has not yet posted a receive bufferthe receiver has not yet posted a receive buffer
• This can timeout, so it can be an error in some cases
– PortXmitFlowPkts: Number of (link-level) flow-controlPortXmitFlowPkts: Number of (link level) flow control packets transmitted on the port
– SWPortVLCongestion: Number of packets dropped due to congestion
Supercomputing '09
Presentation Overview
• Networking Requirements of HEC Systems
• Recap of InfiniBand and 10GE Overview
• Advanced Features of IBAdvanced Features of IB
• Advanced Features of 10/40/100GE
The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage
• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and
Virtualization
Conclusions and Final Q&A• Conclusions and Final Q&A
Supercomputing '09
Advanced Capabilities in iWARP/10GE
• Security in iWARP
p
• Security in iWARP
• Multipathing support using VLANs (Ethernet Feature)
• 64b/66b Encoding Standards
• Link Aggregation (Ethernet Feature)
Supercomputing '09
Security in iWARP
• iWARP was designed to be compliant with the Internet,
y
while providing high performance– Security is an important consideration
– E.g., can be used in a data-centers where unknown clients can communicate over iWARP with the server
M ltiple le els of sec rit meas res specified b the• Multiple levels of security measures specified by the standard (in practice, only a few are implemented)
Untrusted peer access model– Untrusted peer access model
– Encrypted Wire Protocol
– Information DisclosureInformation Disclosure
Supercomputing '09
Untrusted Peer Access Model
• Single access RDMA– The target exposes memory region for a single protected
access– Initiator performs three steps:– Initiator performs three steps:
• Writes data to the target location• Sends an invalidate STAG message (all access to the target
memory is removed)• Sends a verification key
– Target performs two steps:– Target performs two steps:• Uses verification key to ensure message is not tampered• Marks the process as complete
– Verification model unspecified by the iWARP standard
Supercomputing '09
Encrypted Wire Protocol and I f ti Di l
• iWARP is built on top of TCP/IP, so all its security
Information Disclosure
protocols are directly usable by iWARP– The standard discusses IPSec, but does not specify it
• Can be thought of as the “recommended mechanism”
– Security capabilities are not directly accessible, except to turn on or offturn on or off
• Information DisclosurePeer specific memory protection capabilities– Peer-specific memory protection capabilities
• E.g., only peer X can access this buffer
• Only peer X can write to this, and peer Y can read from ity p , p
• Mixed modes (a buffer is readable by a subset of peers)Supercomputing '09
Denial of Service Attacks
• Typical forms of DoS attacks are when the peer negotiates resources (e.g., by opening a connection) but performs no real work– Especially difficult to handle when done on the network
adapter (limited resources)
The iWARP standard does not specify any solution for• The iWARP standard does not specify any solution for this; leaves it to the TCP/IP layer to handle it
E g using authentication or terminating connections by– E.g., using authentication or terminating connections by monitoring usage
– Recommends that communication be offloaded only after ythe authentication is done
Supercomputing '09
Advanced Capabilities in iWARP/10GE
• Security in iWARP
p
• Security in iWARP
• Multipathing support using VLANs (Ethernet Feature)
• 64b/66b Encoding Standards
• Link Aggregation (Ethernet Feature)
Supercomputing '09
VLAN based Multipathing
• Ethernet basic switching
p g
– Network is broken down to a tree by disabling links
– Pros: No live-locks and simple switching
– Cons: Single path between nodes and wastage of links
• VLAN based multipathing– Overlay many logical networks on one physical network
• Each overlay network will break down into a unique tree
D di hi h l t k d t• Depending on which overlay network you send on, you get a different path
• Adding nodes/links is simple; you just add a new overlayg p ; y j y
• Older overlays will continue to work as earlierSupercomputing '09
Example VLAN Configuration
• Basic Ethernet converts the
p g
topology to a tree– Wastes four of the links
• Can be considered as twoCan be considered as two different VLANs– All the links in the network
tili dare utilized
• Can be used for:– High Performanceg
– Security (if someone has to get access only to a part of the network)
• Supported by several switch vendorsthe network)
– Fault toleranceSupercomputing '09
• Woven Systems, Cisco
Advanced Capabilities in iWARP/10GE
• Security in iWARP
p
• Security in iWARP
• Multipathing support using VLANs (Ethernet Feature)
• 64b/66b Encoding Standards
• Link Aggregation (Ethernet Feature)
Supercomputing '09
64b/66b Network Data Encoding
• All communication channels utilize data encoding
g
– There can be imbalance in the number of 1’s and 0’s in the data bytes that are being transmitted
• Leads to a problem called DC-balancingp g• Reduce signal integrity especially for fast networks
– Converts data into a format with more even 1’s and 0’s– E.g., 10GE has 12.5Gbps signaling; same for Myrinet,
Quadrics, GigaNet cLAN and most other networks• 8b/10b encoding8b/10b encoding
– Pros: Better signal integrity, so lesser retransmits– Cons: More bits sent over the wire (20% overhead)
• 64b/66b has the same benefit, but lesser overheadSupercomputing '09
Link Aggregation
• Link aggregation allows for multiple links to logically
gg g
look like a single faster link– Done at a hardware level
– Several multi-port network adapters allow for packets sequencing to avoid out-of-order packets
B th 64b/66b di d li k ti i l– Both 64b/66b encoding and link aggregation are mainly driven by the need to conserve power
• More data rate but not necessarily higher clock speedMore data rate but not necessarily higher clock speed
Supercomputing '09
Presentation Overview
• Networking Requirements of HEC Systems
• Recap of InfiniBand and 10GE Overview
• Advanced Features of IBAdvanced Features of IB
• Advanced Features of 10/40/100GE
The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage
• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and
Virtualization
Conclusions and Final Q&A• Conclusions and Final Q&A
Supercomputing '09
OpenFabrics
• www.openfabrics.org
p
• Open source organization (formerly OpenIB)• Incorporates both IB and iWARP in a unified manner
F i ff t f O S IB d iWARP• Focusing on effort for Open Source IB and iWARP support for Linux and Windows
• Design of complete software stack with `best of breed’Design of complete software stack with best of breed components– Gen1– Gen2 (current focus)
• Users can download the entire stack and runLatest release is OFED 1 4 2– Latest release is OFED 1.4.2
– OFED 1.5 is being worked outSupercomputing '09
OpenFabrics Software StackpSA Subnet Administrator
MAD Management DatagramOpenDiag
Application Level
ClusteredDB Access
SocketsBasedA
VariousMPIs
Access toFile
S t
BlockStorageA
IP BasedApp
A
SMA Subnet Manager Agent
PMA Performance Manager Agent
IPoIB IP over InfiniBandInfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC
SDP Lib
User Level MAD API
Open SM
DiagTools
User APIs
U S
DB AccessAccess MPIs SystemsAccessAccess
UDAPL
SDP Sockets Direct Protocol
SRP SCSI RDMA Protocol (Initiator)
iSER iSCSI RDMA Protocol (Initiator)
RDS Reliable Datagram Service
SDPIPoIB SRP iSER RDS
SDP Lib
Upper Layer Protocol
Kernel Space
User Space
NFS-RDMARPC
ClusterFile Sys
g
UDAPL User Direct Access Programming Lib
HCA Host Channel Adapter
R-NIC RDMA NIC
ConnectionManager
MADSA Client
ConnectionManager
Connection ManagerAbstraction (CMA)
Mid-LayerSMA
el b
ypas
s
el b
ypas
s
CommonKeyHardware
Specific DriverHardware Specific
Driver
InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC
ProviderApps & Access
Ker
ne
Ker
ne
Supercomputing '09
InfiniBand
iWARPInfiniBand HCA iWARP R-NICHardware
AccessMethodsfor usingOF Stack
Programming with OpenFabricsg g p
Sender ReceiverSample Steps
1. Create QPs (endpoints)
Sample Steps
Process(endpoints)
2. Register memory for sending and receiving
3. Send
– ChannelP t i
Kernel
HCA • Post receive
• Post send
– Post RDMA
HCA
operation
Supercomputing '09
Verb Steps
• Open HCA and create QPs to end nodes
p
– Can be done with connection managers (rdma_cm or ibcm) or directly through verbs with out-of-band communication
• Register memory
ibv_mr * mrhandle = ibv_reg_mr(pd, *buffer, len, IBV ACCESS LOCAL WRITE | IBV ACCESS REMOTE WRITE |
P i i b d d
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ
• Permissions can be set as needed
Supercomputing '09
Verbs: Post Receive
• Prepare and post receive descriptor (channel semantics)struct ibv_recv_wr *bad_wr;struct ibv_recv_wr rr;struct ibv_sge sg_entry;
rr.next = NULL;rr.wr_id = 0;
1rr.num_sge = 1;rr.sg_list = &(sg_entry);sg_entry.addr = (uintptr_t) buf; /* local buffer address */sg entry.length = len;g_ y g ;sg_entry.lkey = mr_handle->lkey; /* memory handle */
ret = ibv_post_recv(qp, &rr, &bad_wr); /* post to QP */
Supercomputing '09
ret = ibv_post_srq_recv(srq, &rr, &bad_wr); /* post to SRQ */
Verbs: Post Send
• Prepare and post send descriptor (channel semantics)struct ibv_send_wr *bad_wr;struct ibv_send_wr sr;struct ibv_sge sg_entry;
sr.next = NULL;sr.opcode = IBV_WR_SEND;sr wr id = 0;sr.wr_id = 0;sr.num_sge = 1;sr.send_flags = IBV_SEND_SIGNALED;sr.sg list = &(sg entry);g_ g_ ysg_entry.addr = (uintptr_t) buf;sg_entry.length = len;sg_entry.lkey = mr_handle->lkey;
Supercomputing '09
ret = ibv_post_send(qp, &sr, &bad_wr);
Verbs: Post RDMA Write
• Prepare and post RDMA write (memory semantics)struct ibv_send_wr *bad_wr; struct ibv_send_wr sr;struct ibv_sge sg_entry;
sr.next = NULL;;sr.opcode = IBV_WR_RDMA_WRITE; /* set type to RDMA Write */sr.wr_id = 0;sr.num_sge = 1;sr.send_flags = IBV_SEND_SIGNALED;sr.wr.rdma.remote_addr = remote_addr; /* remote virtual addr. */sr.wr.rdma.rkey = rkey; /* from remote node */sr sg list &(sg entry);sr.sg_list = &(sg_entry);sg_entry.addr = buf; /* local buffer */sg_entry.length = len;sg entry.lkey = mr handle->lkey;
Supercomputing '09
g_ y y _ y;
ret = ibv_post_send(qp, &sr, &bad_wr);
Presentation Overview
• Networking Requirements of HEC Systems
• Recap of InfiniBand and 10GE Overview
• Advanced Features of IBAdvanced Features of IB
• Advanced Features of 10/40/100GE
The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage
• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and
Virtualization
Conclusions and Final Q&A• Conclusions and Final Q&A
Supercomputing '09
Designing High-End Computing S t ith IB d iWARP
• Message Passing Interface (MPI)
Systems with IB and iWARP
• Message Passing Interface (MPI)
• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )
• File Systems
• Multi-tier Data Centers
• Virtualization
Supercomputing '09
Designing MPI Using IB/iWARP F tFeatures
MPI Design Components
ProtocolMapping
Buffer
FlowControl
Connection
CommunicationProgress
Collective
Multi-railSupport
One-sidedBufferManagement
ConnectionManagement
CollectiveCommunication
Substrate
One sidedActive/Passive
Substrate
RDMAOperations
UnreliableDatagram
Static RateControl
Multicast Out-of-orderPlacement
QoS Multi-PathVLANspg
AtomicOperations
SharedReceive Queues
End-to-EndFlow Control
Send /Receive
DynamicRate
Control
Multi-PathLMC
Supercomputing '09
IB and Ethernet Features
MVAPICH/MVAPICH2 Software
• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)
– Used by more than 975 organizations in 51 countries
– More than 32,000 downloads from OSU site directly
– Empowering many TOP500 clusters• 8th ranked 62,976-core cluster (Ranger) at TACC
– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)vendors including Open Fabrics Enterprise Distribution (OFED)
– Also supports uDAPL device to work with any network supporting uDAPL
– http://mvapich.cse.ohio-state.edu/Supercomputing '09
MPICH2 Software Stack
• High-performance and Widely Portable MPIS t MPI 1 MPI 2 d MPI 2 1– Supports MPI-1, MPI-2 and MPI-2.1
– Supports multiple networks (TCP, IB, iWARP, Myrinet)– Commercial support by many vendors
• IBM (integrated stack distributed by Argonne)• Microsoft, Intel (in process of integrating their stack)
– Used by many derivative implementationsUsed by many derivative implementations• E.g., MVAPICH2, IBM, Intel, Microsoft, SiCortex, Cray, Myricom• MPICH2 and its derivatives support many Top500 systems
(estimated at more than 90%)(estimated at more than 90%)– Available with many software distributions– Integrated with the ROMIO MPI-IO implementation and the MPE
profiling libraryprofiling library– http://www.mcs.anl.gov/research/projects/mpich2
Supercomputing '09 96
Design Challenges and Sample R lt
• Interaction with Multi-core Environments
Results
– Communication Characteristics on Multi-core Systems
– Protocol Processing Interactions
• Network Congestion and Hot-spots
• Collective CommunicationCollective Communication
• Scalability for Large-scale Systems
• Fault Tolerance• Fault Tolerance
• Quality of Service
• Application Scalability
Supercomputing '09
MPI Bandwidth on ConnectX with M lti
1600Multi-stream Bandwidth
Multicore
1200
1400
ytes
/sec
)
600
800
1000
th (M
illio
nBy
1 pair
2 pairs
4 pairs
5 fold performance
difference
200
400
Ban
dwid
t p
8 pairsdifference
0
Message Size (bytes)
Supercomputing '09
S. Sur, M. J. Koop, L. Chai and D. K. Panda, “Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms”, IEEE Hot Interconnects, 2007
Design Challenges and Sample R lt
• Interaction with Multi-core Environments
Results
– Communication Characteristics on Multi-core Systems
– Protocol Processing Interactionsg
• Network Congestion and Hot-spots
• Collective CommunicationCollective Communication
• Scalability for Large-scale Systems
F lt T l• Fault Tolerance
• Quality of Service
• Application ScalabilitySupercomputing '09
Analyzing Interrupts and Cache Missesy g p
100000Hardware Interrupts
250L2 Cache Misses
10000
Core 0
Core 1
Core 2200
Core 0
Core 1
Core 2
100
1000
r Mes
sage
Core 3
100
150
Diff
eren
ce Core 3
1
10
nter
rupt
s pe
r
50
100
Per
cent
age
0.1
1 8 64 512
4K 32K
256K 2M
In
0
Supercomputing '09
0.01Message Size (bytes)
-50Message Size (bytes)
MPI Performance on Different Cores
3500Intel Platform
3000AMD Platform
2500
3000
Core 0
Core 1
Core 2
2500
Core 0
Core 1
Core 2
1500
2000
2500
idth
(Mbp
s) Core 3
1500
2000
wid
th (M
bps) Core 3
1000
1500
Ban
dw
500
1000Ban
dw
0
500
1 8 64 2
4K 2K 6K M
0
500
1 8 64 12 4K 2K 6K 2M
Supercomputing '09
6 51 4 32 256 2M
Message Size (bytes)6 51 4 32 256 2
Message Size (bytes)
Design Challenges and Sample R lt
• Interaction with Multi-core Environments
Results
• Network Congestion and Hot-spots
• Collective Communication• Collective Communication
• Scalability for Large-scale Systems
• Fault Tolerance
• Quality of Servicey
• Application Scalability
Supercomputing '09
Hot-Spot Avoidance with MVAPICH
• Deterministic nature of IB routing
p
30NAS FT
leads to network hot-spots
• Responsibility of path utilization 20
25
s)
Original
HSAM, 4 Pathsis up to the MPI Library
• We Design HSAM (Hot-Spot
Avoidance MVAPICH) to alleviate 10
15
20
me
(Sec
onds
Avoidance MVAPICH) to alleviate
this problem
0
5
10
Tim
032x1 64x1
Number of Processes
Supercomputing '09
A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula and D. K. Panda , “Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective”, CCGrid ’07 (Best Paper Nominee)
Network Congestion with Multi-Coresg
Faster Processor
P t l N t k
Protocol Processing
Network Communication
Protocol Processing
Network Communication
Protocol Network
Faster Network
Multi-coresProtocol
ProcessingNetwork
Communication
Supercomputing '09
Network Usage
Communication Burstiness
25000Communication Performance
1000000Network Congestion
20000
s) 10000
100000 RX framesTX frames
10000
15000
ughp
ut (M
bps
100
1000
ause
Fra
mes
5000Thro
u
10
100
Pa0
Configuration
1
Configuration
Supercomputing '09
g Configuration
A. Shah, B. N. Bryant, H. Shah, P. Balaji and W. Feng, “Switch Analysis of Network Congestion for Scientific Applications” (under preparation)
Out-of-Order (OoO) Packets
• Multi-path communication supported by many networks
( )
– IB, 10GE: Hardware feature!– Can cause OoO packets protocol stack has to handle
• Simple approach by most protocols (e g IB) drop & retransmit• Simple approach by most protocols (e.g., IB) drop & retransmit• Not good for large-scale systems
– iWARP specifies a more graceful approach• Out-of-order placement of data• Overhead of out-of-order packets should be minimized
1234 1
2
Supercomputing '09
34
Issues with Out-of-Order Packets
PacketHeader
iWARPHeader Data Payload Packet
HeaderiWARPHeader
Data Payload
PacketHeader
iWARPHeader
Data Payload
I t di t S it h S t ti
PacketHeader
iWARPHeader
PartialPayload
PacketHeader
PartialPayload
PacketHeader
iWARPHeader
Data Payload
PacketHeader
iWARPHeader
Data Payload
Intermediate Switch Segmentation
Delayed Packet Out-Of-Order PacketsOut-Of-Order Packets
(Cannot identify iWARP header)
Supercomputing '09
Handling Out-of-Order Packets in iWARPiWARP
RDMAP RDDPRDMAP MarkersRDMAP HOST
DDPHeader Payload (IF ANY)
Pad CRC
CRC Markers
RDMAP Markers
DDPHeader Payload (IF ANY)
MarkerSegment
TCP/IP
TCP/IP
RDDP CRC
Markers TCP/IP
RDDP CRC NIC
MarkerSegmentLength
• Packet structure becomes overly
Host-based Host-offloaded Host-assistedcomplicated
• Performing in hardware no longer straight forward!
Supercomputing '09
straight forward!
Overhead of Supporting OoO Packetspp g
60iWARP Latency
8000iWARP Bandwidth
50
Host-offloaded iWARP
Host-based iWARP
Host-assisted iWARP 6000
7000
30
40
ency
(us)
In-order iWARP
4000
5000
dth
(Mbp
s)
20
Late
2000
3000
Ban
dwi
0
10
0
1000
Supercomputing '09
Message Size (bytes) Message Size (bytes)
Design Challenges and Sample R lt
• Interaction with Multi-core Environments
Results
• Network Congestion and Hot-spots
• Collective Communication• Collective Communication
– IB Multicast based MPI Broadcast
– Shared memory aware collectives
• Scalability for Large-scale Systems
• Fault Tolerance Quality of Service
• Application ScalabilityApplication Scalability
Supercomputing '09
MPI Broadcast Using IB Multicastg
160Actual Measurements (64 processes)
180Analytical Model (1024 processes)
120
140 Multicast
Point-to-point 140
160 Multicast
Point-to-point
80
100
ency
(us)
80
100
120
ency
(us)
40
60Late
40
60
80
Late
0
20
1 2 4 8 16 32 64 28 56 12 1K 2K 4K
0
20
1 2 4 8 16 32 64 28 56 12 1K 2K 4K1 2 5 2 4
Message Size (bytes)
Supercomputing '093 6 12 25 5 1 2 4
Message Size (bytes)
Shared-memory Aware Collectives(4K cores on TACC Ranger with MVAPICH2)(4K cores on TACC Ranger with MVAPICH2)
160MPI_Reduce (4096 cores)
4500MPI_ Allreduce (4096 cores)
120
140
3500
4000 Original
Shared-memory
80
100
ency
(us) Original
Shared-memory2000
2500
3000
ency
(us)
40
60Late
1000
1500
2000
Late
0
20
0 4 8 16 32 64 128 256 5120
500
0 4 8 6 32 64 28 56 2 K K K K0 4 8 16 32 64 128 256 512
Message Size (bytes)
Supercomputing '09
1 3 6 12 25 51 1 2 4 8
Message Size (bytes)
Design Challenges and Sample R lt
• Interaction with Multi-core Environments
Results
• Network Congestion and Hot-spots
• Collective Communication• Collective Communication
• Scalability for Large-scale Systems
– Memory Efficient Communication
• Fault Tolerance
• Quality of Service
• Application Scalability• Application Scalability
Supercomputing '09
Memory Utilization usingSh d R i QShared Receive Queues
100
120
B) 1416
) MVAPICH-RDMA
40
60
80
100
mor
y U
sed
(MB
MVAPICH-RDMA
MVAPICH SR68
1012
ory
Use
d (G
B)
MVAPICH-SRMVAPICH-SRQ
0
20
2 4 8 16 32
Mem
Number of Processes
MVAPICH-SR
MVAPICH-SRQ
024
128 256 512 1024 2048 4096 8192 16384
Mem
o
N b f P
• SRQ consumes only 1/10th compared to RDMA for 16 000 processes
Number of Processes Number of Processes
Analytical modelMPI_Init memory utilization
SRQ consumes only 1/10 compared to RDMA for 16,000 processes
• Send/Recv exhausts the Buffer Pool after 1000 processes; consumes 2X memory as SRQ for 16,000 processes
Supercomputing '09
S. Sur, L. Chai, H. –W. Jin and D. K. Panda, “Shared Receive Queue Based Scalable MPI Design for InfiniBand Clusters”, IPDPS 2006
Communication Buffer Memory Utili ti ith NAMD ( 1)
ARDMA-SR ARDMA-SRQ SRQARDMA-SR % ARDMA-SRQ % SRQ %
Utilization with NAMD (apoa1)
Avg. RDMA channels 53.15
0.9511.051.1
40506070
Per
form
ance
Usa
ge (M
B)
g
Avg. Low watermarks 0.03
Unexpected Msgs (%) 48.2
0.80.850.90.95
0102030
16 32 64
Nor
mal
ized
Mem
ory
U
Total Messages 3.7e6
MPI Time (%) 23.54
Number of Processes
• 50% messages < 128 Bytes, other 50% between 128 Bytes and 32 KB– 53 RDMA connections setup for 64 process experiment
• SRQ Channel takes 5-6MB of memory– Memory needed by SRQ decreases by 1MB going from 16 to 64
Supercomputing '09
S. Sur, M. Koop and D. K. Panda, “High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis”, SC ‘06
UD vs. RC: Performance and S l bilit (SMG2000 A li ti )Scalability (SMG2000 Application)
1.2RC UDMemory Usage (MB/process)
0.6
0.8
1
aliz
ed T
ime
RC UD
RC (MVAPICH 0.9.8) UD Design
Conn. Buffers Struct. Total Buffers Struct Total
512 22.9 65.0 0.3 88.2 37.0 0.2 37.2
1024 29 5 65 0 0 6 95 1 37 0 0 4 37 4
0
0.2
0.4
Nor
ma1024 29.5 65.0 0.6 95.1 37.0 0.4 37.4
2048 42.4 65.0 1.2 107.4 37.0 0.9 37.9
4096 66.7 65.0 2.4 134.1 37.0 1.7 38.7
Large number of peers per process (992 at maximum)
128 256 512 1024 2048 4096Processes
UD reduces HCA QP cache thrashing
M K S S Q G d D K P d “Hi h P f MPI D i i U li bl D t
Supercomputing '09
M. Koop, S. Sur, Q. Gao and D. K. Panda, “High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters,” ICS ‘07
eXtended Reliable Connection (XRC)( )
RC XRC (8-core) XRC (16-core) RC XRC
Memory Usage Performance on NAMD (64 cores)
300350400450500
/pro
cess
)
0.81
1.2
d Ti
me
RC XRC
50100150200250
Mem
ory
(MB
00.20.40.6
Nor
mal
ize
• Memory usage for 32K processes with 16-cores per node can be 30MB/process (for
0
1 4 16 64 256 1K 4K 16K
M
Connections
0apoa1 er-gre f1atpase jac
Dataset
Memory usage for 32K processes with 16 cores per node can be 30MB/process (for connections)
• Performance for NAMD can increase when there is frequent communication to many peers (HCA cache miss goes down)
Supercomputing '09
M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ‘08
Hybrid Transport Design (UD/RC/XRC)y p g ( )
• Both UD and RC/XRC have benefits
• Evaluate characteristics of all of them and use two sets of transports in the same application – get the best of both
Supercomputing '09
M. Koop, T. Jones and D. K. Panda, “MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand,” IPDPS ‘08
Design Challenges and Sample R lt
• Interaction with Multi-core Environments
Results
• Network Congestion and Hot-spots
• Collective CommunicationCollective Communication
• Scalability for Large-scale Systems
Fault Tolerance• Fault Tolerance
– Network Faults: Automatic Path Migration
P F lt Ch k i t R t t– Process Fault: Checkpoint-Restart
• Quality of Service
• Application ScalabilitySupercomputing '09
Fault Tolerance
• Component failures are common in large-scale clusters
• Imposes need on reliability and fault tolerance
• Working along the following three anglesg g g g– Reliable Networking with Automatic Path Migration (APM)
utilizing Redundant Communication Paths (available since MVAPICH 1.0 and MVAPICH2 1.0)
– Process Fault Tolerance with Efficient Checkpoint and Restart (available since MVAPICH2 0.9.8)
– End-to-end Reliability with memory-to-memory CRC ( il bl i MVAPICH 0 9 9)(available since MVAPICH 0.9.9)
Supercomputing '09
Network Fault-Tolerance with APM
• Network Fault Tolerance using InfiniBand Automatic Path Migration (APM)
– Utilizes Redundant Communication Paths
• Multiple Ports
• LMC
• Supported in OFED 1.2
A. Vishnu, A. Mamidala, S. Narravula and D. K. Panda, “Automatic Path Migration over InfiniBand: Early Experiences”, Third International Workshop on System Management Techniques, Processes, and
Supercomputing '09
y p p y g qServices, held in conjunction with IPDPS ‘07
APM Performance Evaluation
80FT Class B
300LU Class B
60
70250
40
50
(sec
onds
)
150
200
(sec
onds
)20
30Tim
e (
Original Armed
100Tim
e
0
10
8 16 32 64
Armed-Migrated Network Fault
0
50
8 16 32 648 16 32 64
QPs per process
Supercomputing '09
8 16 32 64
QPs per process
Checkpoint-Restart Support in MVAPICH2
• Process-level Fault Tolerance
MVAPICH2
– User-transparent, system-level checkpointing– Based on BLCR from LBNL to take coordinated checkpoints
of entire program including front end and individualof entire program, including front end and individual processes
– Designed novel schemes tog• Coordinate all MPI processes to drain all in flight messages in
IB connections St i ti t t & b ff hil h k i ti• Store communication state & buffers while checkpointing
• Restarting from the checkpoint
• Systems-level checkpoint can be initiated from the applicationSystems level checkpoint can be initiated from the application (available since MVAPICH2 1.0)
Supercomputing '09
Checkpoint-Restart Performance with PVFS2PVFS2
NAS, LU Class C, 32x1 (Storage: 8 PVFS2 servers on IPoIB)
80
100
onds
)
40
60
on T
ime
(Sec
o
0
20
Exe
cutio
No checkpoint
1 ckpt (avg 60 sec
interval)
2 ckpts (avg 40 sec
interval)
3 ckpts (avg 30 sec
interval)
4 ckpts (avg 20 sec
interval)
Number of Checkpoints Taken
Supercomputing '09
Q. Gao, W. Yu, W. Huang and D.K. Panda, “Application-Transparent Checkpoint/Restart for MPI over InfiniBand”, ICPP ‘06
Enhancing CR Performance with I/O A ti f M lti S tAggregation for Multi-core Systems
Original BLCR Aggregation
30000
35000
40000
45000
kpoint(ms)
speedup=13.08
speedup=11.57
15000
20000
25000
30000
take one check
speedup=9.13
0
5000
10000
LU.C.64 SP.C.64 BT.C.64 EP.C.64
Time to
speedup=1.67
• 64 MPI processes on 4 nodes, 16 processes/node• Checkpoint data is written to local disk files
LU.C.64 SP.C.64 BT.C.64 EP.C.64
Supercomputing '09
X. Ouyang, K. Gopalakrishnan, D. K. Panda, Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems, Int'l Conference on Parallel Processing (ICPP '09), Sept. 2009.
Design Challenges and Sample R lt
• Interaction with Multi-core Environments
Results
• Network Congestion and Hot-spots
• Collective Communication• Collective Communication
• Scalability for Large-scale Systems
• Fault Tolerance
• Quality of Servicey
• Application Scalability
Supercomputing '09
QoS in IBQoS in IB
IB i bl f idi IB HCA B ff O i ti• IB is capable of providing network level differentiated service – Virtual Lane
IB HCA Buffer Organization
differentiated service –QoS
• Uses Service LevelsCommon
Virtual Lane
VirtualUses Service Levels (SL) and Virtual Lanes (VL) to classify traffic
Buffer
Pool
Virtual Lane
Virtual
Lane
Arbiter
IB
Link
( ) y Pool
Virtual Lane
Supercomputing '09
Virtual Lane
Inter-Job Quality of ServiceQ y
16Small Messages
9000Large Messages
12
14
6000
7000
8000
)
6
8
10
aten
cy (u
s) No-Traffic
No-QoS
QoS 4000
5000
6000
Late
ncy
(us)
2
4
6La
1000
2000
3000
L
0
2
1 2 4 8 16 32 64 128 256 512
0
M Si (b t )Message Size (bytes) Message Size (bytes)
Supercomputing '09
Can differentiate between multiple jobs
Design Challenges and Sample R lt
• Interaction with Multi-core Environments
Results
• Network Congestion and Hot-spots
• Collective Communication• Collective Communication
• Scalability for Large-scale Systems
• Fault Tolerance
• Quality of Servicey
• Application Scalability
Supercomputing '09
Performance of HPC Applications on TACC R i MVAPICH + IBTACC Ranger using MVAPICH + IB
• Rob Farber’s facial• Rob Farber s facial recognition application was run up to 60K coresup to 60K cores using MVAPICH
• Ranges from 84% of peak at low end to 65% of peak at high end
http://www.tacc.utexas.edu/research/users/features/index.php?m_b_c=farber
Supercomputing '09
3DFFT based Computations
• Internally utilize sequential 1D FFT libraries and perform
p
data grid transforms to collect the required data– Example implementations: P3DFFT, FFTW
– 3D volume of data divided amongst 2D grid of processes
– Grid transpose MPI_Alltoallv across the row and columnE h i i h ll i i• Each process communicates with all processes in its row
Supercomputing '09
Performance of HPC Applications on TACC R DNS/T b lTACC Ranger: DNS/Turbulence
Courtesy: P.K. Yeung, Diego Donzis, TG 2008
Supercomputing '09
Application Example: Blast Simulationspp p
• Researchers from the University of Utah haveUniversity of Utah have developed a simulation framework, called UintahCombines ad anced• Combines advanced mechanical, chemical and physical models into a novel computationala novel computational framework
• Have run > 32K MPI t k Rtasks on Ranger
• Uses asynchronous communication
http://www.tacc.utexas.edu/news/feature-stories/2009/explosive-science/Courtesy: J. Luitjens, M. Bertzins, Univ of Utah
Supercomputing '09
Application Example: OMENpp p
• OMEN is a two- and three-dimensional Schrodinger-Poisson solver basedsolver based
• Used in semi-conductor modelingg
• Run to almost 60K tasks
Courtesy: Mathieu Luisier, Gerhard Klimeck, Purdehttp://www.tacc.utexas.edu/RangerImpact/pdf/Save_Our_Semiconductors.pdf
Supercomputing '09
Designing High-End Computing S t ith IB d iWARP
• Message Passing Interface (MPI)
Systems with IB and iWARP
• Message Passing Interface (MPI)
• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )
• File Systems
• Multi-tier Data Centers
• Virtualization
Supercomputing '09
IPoIB vs. SDP Architectural ModelsTraditional Model Possible SDP Model
S k t A S k t A li tiSDP
Sockets App
Sockets API
Sockets Application
Sockets APIUser User
OS Modules
InfiniBand Hardware
KernelTCP/IP Sockets
Provider KernelTCP/IP Sockets
ProviderSockets Direct
Protocol
TCP/IP TransportDriver
TCP/IP TransportDriver
Kernel Bypass
RDMA Driver
InfiniBand CA
Driver
InfiniBand CA
Semantics
Supercomputing '09
InfiniBand CA InfiniBand CA
(Source: InfiniBand Trade Association 2002 )
SDP vs. IPoIB (IB QDR)( Q )
1500
2000
ps) IPoIB-RC 35004000
ps)
500
1000
1500
ndw
idth
(MBp IPoIB-UD
SDP
1500200025003000
ndw
idth
(MB
0
500
2 4 8 16 32 64 128
256
512
1K 2K 4K 8K 16K
32K
64K
Ban
0500
10001500
2 4 8 6 2 4 8 6 2 K K K K K K K
Bid
ir B
a n20
25
30
us)
1 3 6 12 25 51 1 2 4 8 16 32 64
SDP enables high bandwidth
5
10
15
Late
ncy
(u SDP enables high bandwidth (up to 15 Gbps),
low latency (6.6 µs)
Supercomputing '09
0
5
2 4 8 16 32 64 128 256 512 1K 2K
Flow-Control in IB
• Previous implementations of high-speed sockets (such as SDP) were on other networks– Implemented flow-control in software
• IB provides end-to-end message-level flow-control in hardware– Benefits: Asynchronous progress (i.e., SDP stack does
not need to keep waiting for the receiver to be ready; hardware will take care of it)hardware will take care of it)
– Issues:• No intelligent software coalescingNo intelligent software coalescing
• Does not handle buffer overflowSupercomputing '09
Flow-control Performance
35Latency
7000Bandwidth
25
30
s)
Credit-based
RDMA-based
NIC-assisted5000
6000
Mbp
s)
10
15
20
Late
ncy
(us
2000
3000
4000
Ban
dwid
th (M
0
5
10
0
1000
2000
Message Size (bytes) Message Size (bytes)
Supercomputing '09
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur and W. Gropp, Advanced Flow-control Mechanisms for the Sockets Direct Protocol over InfiniBand, ICPP, Sep 2007
Application Evaluationpp
800Iso-surface Rendering
5Virtual Microscope
600
700
800Credit-based
RDMA-based
NIC-assisted3 5
4
4.5
5Credit-based
RDMA-based
NIC-assisted
400
500
utio
n Ti
me
(s)
2.5
3
3.5
utio
n Ti
me
(s)
100
200
300
Exe
cu
1
1.5
2
Exe
cu
0
100
1024 2048 4096 8192Dataset dimensions
0
0.5
512 1024 2048 4096Dataset dimensions
Supercomputing '09
Designing High-End Computing S t ith IB d iWARP
• Message Passing Interface (MPI)
Systems with IB and iWARP
• Message Passing Interface (MPI)
• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )
• File Systems
• Multi-tier Data Centers
• Virtualization
Supercomputing '09
Lustre Performance
1200Write Performance (4 OSSs)
3500Read Performance (4 OSSs)
800
1000IPoIBNative
ut (M
Bps
)
2000
2500
3000
ut (M
Bps
)
200
400
600
Thro
ughp
u
500
1000
1500
Thro
ughp
u
01 2 3 4
Number of Clients
0
500
1 2 3 4Number of Clients
• Lustre over Native IB– Write: 1.38X faster than IPoIB; Read: 2.16X faster than IPoIB
• Memory copies in IPoIB and Native IB
Supercomputing '09
• Memory copies in IPoIB and Native IB– Reduced throughput and high overhead; I/O servers are saturated
CPU Utilization
90
100IPoIB (Read) IPoIB (Write)
60
70
80
90 Native (Read) Native (Write)
n (%
)
30
40
50
60
PU U
tiliz
atio
n
0
10
20
30
CP
01 2 3 4
Number of Clients
• 4 OSS nodes, IOzone record size 1MB
Supercomputing '09
4 OSS nodes, IOzone record size 1MB
• Offers potential for greater scalability
Can we enhance NFS Performance i IB RDMA
• Many enterprise environments use NFSIETF current major revision is NFSv4; NFSv4 1 deals with pNFS
using IB RDMA
– IETF current major revision is NFSv4; NFSv4.1 deals with pNFS• Current systems use Ethernet with TCP/IP; can they use IB?
– Metadata intensive workloads are latency sensitive– Need throughput for large transfers and OLTP type workloads
• NFS over RDMA standard has been proposed– Designed and implemented this on InfiniBand in Open Solaris
• Taking advantage of RDMA mechanisms in InfiniBand• Design works for NFSv3 and NFSv4• Interoperable with Linux NFS/RDMA
– NFS over RDMA design incorporated into Open Solaris by Sun• Ongoing work for pNFS (NFSv4.1)
– Joint work with Sun and NetApppp– http://nowlab.cse.ohio-state.edu/projects/nfs-rdma/index.html
Supercomputing '09
NFS/RDMA Performance
1000Write (tmpfs)
1000Read (tmpfs)
600
700
800
900
t (M
B/s
)
600
700
800
900
t (M
B/s
)
200
300
400
500
Read-ReadThro
ughp
u
200
300
400
500
Thro
ughp
ut
0
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Read-Write
Number of Threads
0
100
200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Number of Threads
• IOzone Read Bandwidth up to 913 MB/s (Sun x2200’s with x8 PCIe)• Read-Write design by OSU, available with the latest OpenSolaris• NFS/RDMA is being added into OFED 1 4
Supercomputing '09
NFS/RDMA is being added into OFED 1.4
R. Noronha, L. Chai, T. Talpey and D. K. Panda, “Designing NFS With RDMA For Security, Performance and Scalability”, ICPP ‘07
Designing High-End Computing S t ith IB d iWARP
• Message Passing Interface (MPI)
Systems with IB and iWARP
• Message Passing Interface (MPI)
• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )
• File Systems
• Multi-tier Data Centers
• Virtualization
Supercomputing '09
Enterprise Datacenter Environmentsp
ProxyWeb-server
(A h )ProxyServer
(Apache)
WAN
ClientsStorage
More Computation and CommunicationR i t
ApplicationDatabase
WAN Requirements
• Requests are received from clients over the WAN
Application Server (PHP)
Server(MySQL)
• Proxy nodes perform caching, load balancing, resource monitoring, etc.– If not cached, request forwarded to the next tier Application Server
A li ti f th b i l i (CGI J l t )• Application server performs the business logic (CGI, Java servlets)– Retrieves appropriate data from the database to process the requests
Supercomputing '09
Proposed Architecture for Enterprise M lti ti D t tMulti-tier Data-centers
Existing Data-Center ComponentsExisting Data Center Components
Advanced System Services
ActiveCaching
CooperativeCaching
DynamicReconfiguration
ResourceMonitoring
Dynamic Content Caching Active Resource AdaptationActive
CachingCooperative
CachingDynamic
ReconfigurationResource
Monitoring
GlobalM
DistributedL k
PointT
Services
Data-CenterService
Primitives
Caching
SoftSh d
Caching Reconfiguration Monitoring
Distributed Data Sharing Substrate
Caching Caching Reconfiguration Monitoring
SoftSh d
DistributedL k
Distributed Data Sharing Substrate
Sockets Direct Protocol
MemoryAggregator
LockManager
ToPoint
Primitives
AdvancedC i ti P t l
SharedState
SharedState
LockManager
RDMA Atomic MulticastProtocolOffl d
RDMA-basedFlow-control
Communication Protocols and Subsystems
Network
Async. Zero-copyCommunication
Async. Zero-copyCommunication
Supercomputing '09
RDMA Atomic MulticastOffload Network
Data-Center Response Time with SDP(I t t P i )(Internet Proxies)
30
20
25
ms)
15
20
onse
Tim
e (m
IPoIBSDP
5
10
Res
po
032K 64K 128K 256K 512K 1M 2M
Message Size (bytes)
Supercomputing '09
Message Size (bytes)
Cache Coherency and Consistency ith D i D twith Dynamic Data
Example of Strong Cache Coherency: Never Send Stale Data
Proxy NodeCache
Back-endData
User Request #1
User Request #2
Proxy Node Back-end
Example of Strong Cache Consistency:Always Follow Increasing Time Line of Events
User Request #1
Proxy Node
Back endData
q
User Request #2
User Request #3
y
Supercomputing '09
Active Polling: An Approach for Strong C h C hCache Coherency
Proxy Node Back-end
Request
Proxy Node Back-end
Request
SoftwareO h d RDMA Read
Cache Hit
Overhead
Cache Hit
RDMA Read
Cache MissCache Miss
Supercomputing '09
Strong Cache Coherency with RDMAg y
2500Data-center Throughput
10Data-center Response Time
2000No Cache
IPoIBSeco
nd 7
8
9
e (m
s)
1000
1500IPoIB
RDMA
sact
ions
per
S
3
4
5
6
Res
pons
e Ti
me
0
500Tran
s
0
1
2
3R
0 10 20 30 40 50 60 70 80 90 100
200
Number of Compute Threads
0
0 10 20 30 40 50 60 70 80 90 100
200
Number of Compute Threads
RDMA can sustain performance even with heavy load on the back-end
Supercomputing '09
S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and D. K. Panda, “Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand”, SAN ‘04
Advanced System Services over IB
• Cache coherency is one of the possible enhancements
y
IB can provide for Enterprise datacenter environments• Other concepts:
– System load monitoring• RDMA based monitoring of load on each node (kernel writes
this information to a memory location; read it using RDMA)this information to a memory location; read it using RDMA)• Multicast capabilities can help spread such information
quickly to multiple processes (reliability not very important)
– Load Balancing for Performance and QoS• Asynchronously update forwarding tables to decide how
many machines serve the content for a given websitemany machines serve the content for a given website
Supercomputing '09
http://nowlab.cse.ohio-state.edu/projects/data-centers
Designing High-End Computing S t ith IB d iWARP
• Message Passing Interface (MPI)
Systems with IB and iWARP
• Message Passing Interface (MPI)
• Sockets Direct Protocol (SDP) and IPoIB (TCP/IP)( ) ( )
• File Systems
• Multi-tier Data Centers
• Virtualization
Supercomputing '09
Current I/O Virtualization Approachespp
• I/O in VMM (e.g. VMware ESX)– Device drivers hosted in the VMM– I/O operations always trap into the
VMMApplication Application
Dom0 VM
– VMM ensures safe device sharing among VMs
• I/O in a special VM
Guest ModuleBackend Module
PrivilegedMod le p
– Device drivers are hosted in a special (privileged) VM
– I/O operations always involve the
OS
VMM
Module
– I/O operations always involve the VMM and the special VM
– E.g.: Xen and VMware Workstation
Device
Supercomputing '09
From OS-bypass to VMM-bypassyp yp
• Guest modules in guest VMs handle t d t ( i il dsetup and management (privileged
access)
– Guest modules communicate with Application Application
VM VM
VMM backend modules to get jobs done
– Original privileged module can be
OS Guest Module
reused
• Once setup, devices are accessed directly from guest VMs (VMM-bypass)
Backend Module
Privileged Module y g ( yp )
– Either from OS kernel or applications
• Backend and privileged modules can also reside in a special VM
VMM
Device
reside in a special VM
Supercomputing '09
Privileged Access
VMM-bypass Access
Xen Overhead with VMM Bypassyp
18One-way Latency
9001000
Unidirectional Bandwidth
10121416
Xen
Native
(us) 600
700800900
ytes
/sec
468
10
Late
ncy
(
200300400500
Mill
ionB
y
02
0 1 2 4 8 16 32 64 128
256
512 1K 2K 4K
M Si (b t )
0100
1 4 16 64 256
1K 4K 16K
64K
256K 1M 4M
M Si (b t )Message Size (bytes) Message Size (bytes)
• Only VMM Bypass operations are used (MVAPICH implementation)• Xen-IB performs similar to native InfiniBand
Supercomputing '09
W. Huang, J. Liu, B. Abali, D. K. Panda. “A Case for High Performance Computing with Virtual Machines”, ICS ’06
Optimizing VM migration through RDMARDMA
VMVM
Pre-allocate Machine states
Helper Process
Machine statesMachine statesMachine states
Physical host Physical host
resourcesMachine statesMachine statesMachine statesMachine states
Live VM migration:• Step 1: Pre-allocate resource on target hostp g
• Step 2: Pre-copy machine states for multiple iterations
• Step 3: Suspend VM and copy the latest updates to machine states
• Step 4: Restart VM on the new host
Supercomputing '09
Fast Migration over RDMAg
70Native IPoIB RDMA
100%250RDMA IPoIB CPU IPoIB CPU RDMA
30
40
50
60
70
on T
ime
(s)
60%
80%
100%
150
200
250
Util
izat
ion
BW
(M
B/s
)
0
10
20
30
SP BT FT LU EP CG
Exe
cutio
0%
20%
40%
0
50
100
SP A 9 BT A 9 FT B 8 LU A 8 EP B 9 CG B 8
CP
U U
Effe
ctiv
e B
SP BT FT LU EP CG SP.A.9 BT.A.9 FT.B.8 LU.A.8 EP.B.9 CG.B.8
• Disable one physical CPU on the nodes
Mi ti h d ith IP IB d ti ll i• Migration overhead with IPoIB drastically increases
• RDMA achieves higher migration performance with less CPU usage
W H Q G J Li D K P d “Hi h P f Vi t l M hi Mi ti ith RDMA
Supercomputing '09
W. Huang, Q. Gao, J. Liu, D. K. Panda. “High Performance Virtual Machine Migration with RDMA over Modern Interconnects”, Cluster ’07 (Selected as a Best Paper)
Summary of Design Issues and R lt
• Current generation IB adapters, 10GE/iWARP
Results
g p
adapters and software environments are already
delivering competitive performance compared to g p p p
other interconnects
• IB and 10GE/iWARP hardware firmware and• IB and 10GE/iWARP hardware, firmware, and
software are going through rapid changes
• Significant performance improvement is expected in
near future
Supercomputing '09
Presentation Overview
• Networking Requirements of HEC Systems
• Recap of InfiniBand and 10GE Overview
• Advanced Features of IBAdvanced Features of IB
• Advanced Features of 10/40/100GE
The Open Fabrics Software Stack Usage• The Open Fabrics Software Stack Usage
• Designing High-End Systems with IB and 10GE– MPI, Sockets, File Systems, Multi-Tier Data Centers and
Virtualization
Conclusions and Final Q&A• Conclusions and Final Q&A
Supercomputing '09
Concluding Remarks
• Presented networking requirements for HEC Clusters
g
• Presented advanced features of IB and 10GE
• Discussed OpenFabrics stack and usagep g
• Discussed Design Issues, Challenges, and State-of-the-art in designing various high-end systems with IB and 10GEin designing various high end systems with IB and 10GE
• IB and 10GE are emerging as new architectures leading to a new generation of networked computing systems openinga new generation of networked computing systems, opening many research issues needing novel solutions
Supercomputing '09
Funding Acknowledgmentsg g
Our research is supported by the following organizations
• Funding support by
• Equipment support by
Supercomputing '09
Personnel Acknowledgmentsg
Current Students – M Kalaiya (M S )
Past Students – P Balaji (Ph D ) – A Mamidala (Ph D )M. Kalaiya (M. S.)
– K. Kandalla (M.S.)– P. Lai (Ph.D.)– M. Luo (Ph.D.)
P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
A. Mamidala (Ph.D.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)– G. Santhanaraman (Ph.D.)
– G. Marsh (Ph. D.)– X. Ouyang (Ph.D.)– S. Potluri (M. S.)
H S b i (Ph D )
– B. Chandrasekharan (M.S.)– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– K. Vaidyanathan (Ph.D.)– H. Subramoni (Ph.D.)
Current Post-Doc– E Mancini
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
M Koop (Ph D )
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)E. Mancini
Current Programmer– J. Perkins
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– P. Lai (Ph. D.)
Supercomputing '09
( )
– J. Liu (Ph.D.)
Web Pointers
http://www.cse.ohio-state.edu/~pandahttp://www.mcs.anl.gov/~balaji
http://www.cse.ohio-state.edu/~koophttp://nowlab.cse.ohio-state.edu
MVAPICH Web Pagehttp://mvapich.cse.ohio-state.edu
panda@cse.ohio-state.edubalaji@mcs.anl.gov
Supercomputing '09
matthew.koop@nasa.gov