InfiniBand Meets OpenSHMEM · LOM Adapter Card 3.0 From data center to campus and metro...
Transcript of InfiniBand Meets OpenSHMEM · LOM Adapter Card 3.0 From data center to campus and metro...
March 2014
InfiniBand Meets OpenSHMEM
© 2014 Mellanox Technologies 2
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
Storage Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN
© 2014 Mellanox Technologies 3
Virtual Protocol Interconnect (VPI) Technology
64 ports 10GbE
36 ports 40/56GbE
48 10GbE + 12 40/56GbE
36 ports IB up to 56Gb/s
8 VPI subnets
Switch OS Layer
Mezzanine Card
VPI Adapter VPI Switch
Ethernet: 10/40/56 Gb/s
InfiniBand:10/20/40/56 Gb/s
Unified Fabric Manager
Networking Storage Clustering Management
Applications
Acceleration Engines
LOM Adapter Card
3.0
From data center to
campus and metro
connectivity
Standard Protocols of InfiniBand and Ethernet on the Same Wire!
© 2014 Mellanox Technologies 4
OpenSHMEM / PGAS
Mellanox ScalableHPC Communication Library to Accelerate Applications
MXM • Reliable Messaging
• Hybrid Transport Mechanism
• Efficient Memory Registration
• Receive Side Tag Matching
FCA • Topology Aware Collective Optimization
• Hardware Multicast
• Separate Virtual Fabric for Collectives
• CORE-Direct Hardware Offload
MPI Berkeley UPC
0.0
20.0
40.0
60.0
80.0
100.0
0 500 1000 1500 2000 2500
Late
ncy (
us
)
Processes (PPN=8)
Barrier Collective Latency
Without FCA With FCA
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500
Late
ncy (
us
)
Processes (PPN=8)
Reduce Collective Latency
Without FCA With FCA
© 2014 Mellanox Technologies 5
Overview
Extreme scale programming-model challenges
Challenges for scaling OpenSHMEM to extreme scale
InfiniBand enhancements
• Dynamically Connected Transport (DC)
• Cross-Channel synchronization
• Non-contiguous data transfer
• On Demand Paging
Mellanox ScalableSHMEM
© 2014 Mellanox Technologies 6
Exascale-Class Computer Platforms – Communication Challenges
Very large functional unit count ~10,000,000
• Implication to communication stack: Need scalable communication capabilities
- Point-to-point
- Collective
Large on-”node” functional unit count ~500
• Implication to communication stack: Scalable HCA architecture
Deeper memory hierarchies
• Implication to communication stack: Cache aware network access
Smaller amounts of memory per functional unit
• Implication to communication stack: Low latency, high b/w capabilities
May have functional unit heterogeneity
• Implication to communication stack: Support for data heterogeneity
Component failures part of “normal” operation
• Implication to communication stack: Resilient and redundant stack
Data movement is expensive
• Implication to communication stack: Optimize data movement
© 2014 Mellanox Technologies 7
Challenges of a Scalable OpenSHMEM
Scalable Communication interface
• Interface objects scale in less than system dimension
• Not a current issue, but keep in mind moving forward
Support for asynchronous communication
• Non-blocking communication interfaces
• Sparse and topology aware interfaces ?
System noise issues
• Interfaces that support communication delegation (e.g., offloading)
Minimize data motion
• Semantics that support data “aggregation”
© 2014 Mellanox Technologies 8
Scalable Performance Enhancements
© 2014 Mellanox Technologies 9
Dynamically Connected Transport
A Scalable Transport
© 2014 Mellanox Technologies 10
New Transport
Challenges being addressed:
• Scalable communication protocol
• High-performance communication
• Asynchronous communication
Current status: Transports in widest use
• RC
- High Performance: Supports RDMA and Atomic Operations
- Scalability limitations: One connection per destination
• UD
- Scalable: One QP services multiple destinations
- Limited communication support: No support for RDMA and Atomic Operations, unreliable
Need scalable transport that also supports RDMA and Atomic operations DC – The best of both
worlds
• High Performance: Supports RDMA and Atomic Operations, Reliable
• Scalable: One QP services multiple destinations
© 2014 Mellanox Technologies 11
IB Reliable Transports Model
QoS/Multipathing: 2 to 8 times the above
Resource sharing (XRC/RD) causes processes to impact each-other
Example n=4K p=16 Pro
ce
ss
p-1
n*p
Pro
ce
ss
1
n*p
Pro
ce
ss
0
n*p
n*p*p
(~1M)
RC XRC RD(+XRC)
Pro
ce
ss
p-1
n
Pro
ce
ss
1
n
Pro
ce
ss
0
n
~n*p
(~64K)
Pro
ce
ss
p-1
n
Pro
ce
ss
1
n
Pro
ce
ss
0
~n
(~4K)
Sh
are
d
Co
nte
xts
© 2014 Mellanox Technologies 12
12
The DC Model
Dynamic Connectivity
Each DC Initiator can be used to reach any remote DC Target
No resources’ sharing between processes
• process controls how many (and can adapt to load)
• process controls usage model (e.g. SQ allocation policy)
• no inter-process dependencies
Resource footprint
• Function of HCA capability
• Independent of system size
Fast Communication Setup Time
cs – concurrency of the sender
cr=concurrency of the responder
Pro
ce
ss
0
cs
p*(
cs+
cr)
/2
cr
Pro
ce
ss
1
cs
cr
Pro
ce
ss
p
-1
cs
cr
© 2014 Mellanox Technologies 13
Connect-IB – Exascale Scalability
1
1,000
1,000,000
1,000,000,000
InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012
8 nodes
2K nodes
10K nodes
100K nodes
Ho
st
Me
mo
ry C
on
su
mp
tio
n (
MB
)
© 2014 Mellanox Technologies 14
Dynamically Connected Transport
Key objects
• DC Initiator: Initiates data transfer
• DC Target: Handles incoming data
© 2014 Mellanox Technologies 15
Reliable Connection Transport Mode
© 2014 Mellanox Technologies 16
Dynamically Connected Transport Mode
© 2014 Mellanox Technologies 17
Cross-Channel Synchronization
© 2014 Mellanox Technologies 18
Challenges Being Addressed
Scalable Collective communication
Asynchronous communication
Communication resources (not computational resources) are used to manage communication
Avoids some of the effects of system noise
© 2014 Mellanox Technologies 19
High Level Objectives
Provide synchronization mechanism between QP’s
Provide mechanisms for managing communication dependencies to be managed without additional
host intervention
Support asynchronous progress of multi-staged communication protocols
© 2014 Mellanox Technologies 20
Motivating Example - HPC
Collective communications optimization
Communication pattern involving multiple processes
Optimized collectives involve a communicator-wide data-dependent communication pattern, e.g.,
communication initiation is dependent on prior completion of other communication operations
Data needs to be manipulated at intermediate stages of a collective operation (reduction
operations)
Collective operations limit application scalability
• Performance, scalability, and system noise
© 2014 Mellanox Technologies 21
Scalability of Collective Operations
Ideal Algorithm Impact of System Noise
3
1
2
4
© 2014 Mellanox Technologies 22
Scalability of Collective Operations - II
Offloaded Algorithm Nonblocking Algorithm
- Communication processing
© 2014 Mellanox Technologies 23
Network Managed Multi-stage Communication
Key Ideas
• Create a local description of the communication pattern
• Pass the description to the communication subsystem
• Manage the communication operations on the network, freeing the CPU to do meaningful computation
• Check for full-operation completion
Current Assumptions
• Data delivery is detected by new Completion Queue Events
• Use Completion Queue to identify the data source
• Completion order is used to associate data with a specific operation
• Use RDMA with the immediate to generate Completion Queue events
© 2014 Mellanox Technologies 24
Key New Features
New QP trait - Managed QP: WQE on such a QP must be enabled by WQEs from other QP’s
Synchronization primitives:
• Wait work queue entry: waits until specified completion queue (QP) reaches specified producer index value
• Enable tasks: WQE on one QP can “enable” a WQE on a second QP
Submit lists of task to multiple QP’s in single post - sufficient to describe chained operations (such as
collective communication)
Can setup a special completion queue to monitor list completion (request CQE from the relevant task)
© 2014 Mellanox Technologies 25
Setting up CORE-Direct QP’s
Create QP with the ability to use the CORE-Direct primitives
Decide if managed QP’s will be used, if so, need to create QP that will take the enable tasks. Most
likely, this is a centralized resource handling both the enable and the wait tasks
Decide on a Completion Queue strategy
Setup all needed QP’s
© 2014 Mellanox Technologies 26
Initiating CORE-Direct Communication
Task list is created
Target QP for task
Operation send/wait/enable
For wait, the number of completions to wait for
• Number of completions is specified relative to the beginning of the task list
• Number of completions can be positive, zero, or negative (wait on previously posted tasks)
For enable, the number of send tasks on the target QP is specified
• Number of tasks to enable
© 2014 Mellanox Technologies 27
Posting of Tasks
© 2014 Mellanox Technologies 28
CORE-Direct Task-List Completion
Can specify which task will generate completion, in “Collective” completion queue
Single CQ signals full list (collective) completion
CPU is not needed for progress
© 2014 Mellanox Technologies 29
Example – Four Process Recursive Doubling
1 2 3 4
1 2 3 4
1 2 3 4
Step 1
Step 2
© 2014 Mellanox Technologies 30
Four Process Barrier Example – Using Managed Queues – Rank 0
© 2014 Mellanox Technologies 31
Four Process Barrier Example – No Managed Queues – Rank 0
© 2014 Mellanox Technologies 32
User-Mode Memory Registration
© 2014 Mellanox Technologies 33
Key Features
Support combining contiguous registered memory regions into a single memory region. H/W treats
them as a single contiguous region (and handles the non-contiguous regions)
For a given memory region, supports non-contiguous access to memory, using a regular structure
representation – base pointer, element length, stride, repeat count.
• Can combine these from multiple different memory keys
Memory descriptors are created by posting WQE’s to fill in the memory key
Supports local and remote non-contiguous memory access
• Eliminates the need for some memory copies
© 2014 Mellanox Technologies 34
Combining Contiguous Memory Regions
© 2014 Mellanox Technologies 35
Non-Contiguous Memory Access – Regular Access
© 2014 Mellanox Technologies 36
On-Demand Paging
© 2014 Mellanox Technologies 37
Memory Registration
Apps register Memory Regions (MRs) for IO
• Referenced memory must be part of process address
space at registration time
• Memory key returned to identify the MR
Registration operation
• Pins down the MR
• Hands off the virtual to physical mapping to HW
Ap
plic
atio
n
User-
sp
ace
driv
er
sta
ck
HW
MR
SQ
CQ
Key
RQ
Kern
el
driv
er
sta
ck
© 2014 Mellanox Technologies 38
Memory Registration – continued
Fast path
• Applications post IO operations directly to HCA
• HCA accesses memory using the translations
referenced by the memory key
Ap
plic
atio
n
User-
sp
ace
driv
er
sta
ck
HW
MR
SQ
CQ
RQ
Kern
el
driv
er
sta
ck
Key
Wow !!! But…
© 2014 Mellanox Technologies 39
Challenges
Size of registered memory must fit physical memory
Applications must have memory locking privileges
Continuously synchronizing the translation tables between the address space and the HCA is hard
• Address space changes (malloc, mmap, stack)
• NUMA migration
• fork()
Registration is a costly operation
• No locality of reference
© 2014 Mellanox Technologies 40
Achieving High Performance
Requires careful design
Dynamic registration
• Naïve approach induces significant overheads
• Pin-down cache logic is complex and not complete
Pinned bounce buffers
• Application level memory management
• Copying adds overhead
© 2014 Mellanox Technologies 41
On Demand Paging
MR pages are never pinned by the OS
• Paged in when HCA needs them
• Paged out when reclaimed by the OS
HCA translation tables may contain non-present pages
• Initially, a new MR is created with non-present pages
• Virtual memory mappings don’t necessarily exist
© 2014 Mellanox Technologies 42
Semantics
ODP memory registration
• Specify IBV_ACCESS_ON_DEMAND access flag
Work request processing
• WQEs in HW ownership must reference mapped memory
- From Post_Send()/Recv() until PollCQ()
• RDMA operations must target mapped memory
• Access attempts to unmapped memory trigger an error
Transport
• RC semantics unchanged
• UD responder drops packets while page faults are resolved
- Standard semantics cannot be achieved unless wire is back-pressured
© 2014 Mellanox Technologies 43
Execution Time Breakdown (Send Requestor)
2%
36%
4%
5% 16%
37%
0%
Schedule in
WQE read
Get User Pages
PTE update
TLB flush
QP resume
Misc
0%
7%
83%
1% 2%
7%
0%
Schedule in
WQE read
Get User Pages
PTE update
TBL flush
QP resume
Misc
4K Page fault (135us total time) 4M Page fault (1ms total time)
© 2014 Mellanox Technologies 44
Mellanox ScalableSHMEM
© 2014 Mellanox Technologies 45
Mellanox’s OpenSHMEM Implementation
Implemented within the context of the Open MPI project
Exploit’s Open MPI’s component architecture, re-using components used for the MPI
implementation
Adds OpenSHMEM specific components
Uses InifiniBand optimized point-to-point and Collective communication modules
• Hardware Multicast
• CORE-Direct
• Dynamically Connected Transport
• Enhanced Hardware scatter/gather (UMR) – coming soon
• On Demand Paging – coming soon
© 2014 Mellanox Technologies 46
Leveraging the work already done for MPI implementation
Similarities
• Atomics, collectives operations
• one-sided operations (put/get)
• Job start and runtime support (mapping/binding/…)
Differences
• SPMD
• No communicators (yet)
• No user-defined data types
• Limited set of collectives, and slightly different semantics (Barrier)
• Application can put/get data from pre-allocated heap or static variables
• No file I/O support
© 2014 Mellanox Technologies 47
OMPI + OSHMEM
Many OMPI frameworks reused (runtime, platform support, job start,
BTL, BML, MTL, profiling, autotools)
OSHMEM specific frameworks added, keeping MCA plugin
architecture (SCOLL, SPML, atomics, synchronization and ordering
enforcement)
OSHMEM supports Mellanox p2p and collectives accelerators (MXM,
FCA) as long as OMPI provided transports (TCP, OpenIB, Portals, …)
© 2014 Mellanox Technologies 48
SHMEM Implementation Architecture
© 2014 Mellanox Technologies 49
SHMEM Put Latency
1.4
1.5
1.6
1.7
1.8
1.9
2
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
Late
ncy (
usec)
Data Size (bytes)
ConnectX-3 RC
Connect-IB RC
ConnectX-3 UD
Connect-IB UD
© 2014 Mellanox Technologies 50
SHMEM Put Latency
0
20
40
60
80
100
120
140
160
180
200
8192 16384 32768 65536 131072 262144 524288 1048576
Late
ncy (
usec)
Data Size (bytes)
ConnectX-3 RC
Connect-IB RC
ConnectX-3 UD
Connect-IB UD
© 2014 Mellanox Technologies 51
‘SHMEM 8 Byte Message Rate
0
10
20
30
40
50
60
70
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
Messag
e R
ate
(M
MP
S)
Data Size (bytes)
ConnectX-3 RC
Connect-IB RC
ConnectX-3 UD
Connect-IB UD
© 2014 Mellanox Technologies 52
‘SHMEM Barrier - Latency
0
2
4
6
8
10
12
14
16
18
20
2 4 8 16 32 48 63
Late
ncy (
usec)
Number of Hosts (16 Processes per Node)
ConnectX-3
Connect-IB
© 2014 Mellanox Technologies 53
SHMEM All-to-All Benchmark (Message Size – 4096 Bytes)
0
50
100
150
200
250
300
350
400
8 16 32 48 63
Ban
dw
idth
- M
Bp
s p
er
PE
Number of Hosts (16 PE's per Node)
CionnectX-3
Connect-IB
© 2014 Mellanox Technologies 54
SHMEM Atomic Updates
0
1
2
3
4
5
6
Millio
n O
ps p
er
seco
nd
Connect-IB UD
ConnectX-3 UD
For more information: [email protected]