COSC 6374 Parallel Computation Remote Direct Memory...
Transcript of COSC 6374 Parallel Computation Remote Direct Memory...
1
COSC 6374
Parallel Computation
Remote Direct Memory Access
Edgar Gabriel
Fall 2015
Communication Models
P0 P1
A B
send receive
Message Passing Model
P0 P1
A B put
Remote Memory Access
P0 P1
A B
A=B
Shared Memory Model
2
Data Movement
CPU Mem
NIC
CPU
NIC
Mem
Message Passing Model:
• Two-sided communication
CPU Mem
NIC
CPU
NIC
Mem
Remote Memory Access:
• One-sided communication
Remote Direct Memory Access
• Direct Memory Access (DMA) allows data to be sent
directly from an attached device to the memory on the
computer's motherboard.
• One CPU is freed from involvement with the data
transfer, thus speeding up overall computer operation
• Remote Direct Memory Access (RDMA): two or more
computers communicate directly from the main
memory of one system to the main memory of another
3
One-sided communication in MPI
• MPI-2 defines one-sided communication:
– A process can put some data into the main memory of another process (MPI_Put)
– A process can get some data from the main memory of another process (MPI_Get)
– A process can perform some operations on a data item in the main memory of another process (MPI_Accumulate)
• Target process not actively involved in the
communication
RDMA in MPI
• Problems:
– How can a process define which part of its main memory
are available for RDMA?
– How can a process define when this part of the main
memory is available for RDMA?
– How can a process define who is allowed to access its
memory?
– How can a process define which elements in a remote
memory it wants to access?
4
The window concept of MPI-2 (I)
• An MPI_Win defines the group of process allowed to access a certain memory area
• Arguments:
– base: Starting address for the public memory region
– size: size of the public memory area in bytes
– disp_unit: offset from the base address in bytes
– info: Hint to the MPI how the window will be used (e.g. only reading or only writing)
– comm: communicator defining the group of processes allowed to access the memory window
MPI_Win_create(void *base, MPI_Aint size, int
disp_unit, MPI_Info info, MPI_Comm comm,
MPI_Win *win);
The window concept of MPI-2 (II)
• Definition of a temporal window:
– Access Epoch: time slot in which a process accesses
remote memory of another process
– Exposure Epoch: time slot in which a process allows
access to a memory window by other processes
• Does a process have control when other processes are
accessing its memory window?
– yes: active target communication
– no: passive target communication
5
Active Target Communication (I)
• Synchronization of all operations within a window
– collective across all processes of win
– No difference between access and exposure epoch
– Starts or closes an access and exposure epoch
• Arguments
– assert: Hint to the library on the usage (default: 0)
MPI_Win_fence ( int assert, MPI_Win win);
Data exchange (I)
• A single process controls the data parameters of both
processes
• Put data described by (oaddr, ocount, otype)
into the main memory of the process defined by Rank rank in the window win at the position
(base+disp*disp_unit,tcount,ttype)
– base and disp_unit have been defined in MPI_Win_create
– Value of base and disp_unit not known by the
process calling MPI_Put!
MPI_Put (void *oaddr, int ocount, MPI_Datatype otype, int rank, MPI_Aint disp, int tcount, MPI_Datatype ttype, MPI_Win win);
6
21 3050 xx
Example: Ghost-cell update
4
3
2
1
4
3
2
1
5020
305020
305020
3050
rhs
rhs
rhs
rhs
x
x
x
x
321 305020 xxx
3432 305020 rhsxxx
443 5020 rhsxx
1rhs
2rhs
Process 0
Process 1
Process 0 needs x3
Process 1 needs x2
Parallel Matrix-vector multiply for band-matrices
Example: Ghost-cell update (II)
• Ghost cells: (read-only) copy of elements held by another process
• Ghost-cells for 2-D matrices: additional row of data
3x1x
Process 0
2x
Process 1
4x3x2x
Process 0
Process 1
Process 2
nxlocal
nxlocal
nxlocal
ny
7
• Data structure: u[i][j] is stored in a matrix
• Extent of variable u
• with containing the local data
:
:
y
xlocal
n
n no of data points in x direction
no of data points in y direction
]][2[ yxlocal nnu
]1:0][:1[ yxlocal nnu
Example: Ghost-cell update (III)
Example: Ghost-cell update (IV)
MPI_Win_create ( u,(nxlocal+2)*ny*sizeof(double),
0, MPI_INFO_NULL, &win);
…
MPI_Win_fence ( 0, win);
MPI_Put ( &u[1][0], ny, MPI_DOUBLE, rank-1,
(nxlocal+1)*ny*sizeof(double), ny, MPI_DOUBLE, win);
MPI_Put ( &u[nxlocal][0], ny, MPI_DOUBLE, rank+1, 0, ny, MPI_DOUBLE, win);
MPI_Win_fence ( 0, win);
…
MPI_Win_free ( &win);
8
Comments to the example
• Modifications to the data items might only be visible
after closing the corresponding epochs
– No guarantee whether the data item is really transfered during MPI_Put or during MPI_Win_fence
• If multiple processes modify the very same memory
address at the very same process, no guarantees are
given on which data item will be visible.
– Responsibility of the user to get it right
Passive Target Communication
• MPI_Win_lock starts an access epoch to access the main
memory of the process with rank rank
• All RDMA operations between a lock/unlock appear atomic
• lock_type: MPI_LOCK_EXCLUSIVE or MPI_LOCK_SHARED
• Update to the local memory exposed through the MPI window should also happen using MPI_Win_lock/MPI_Put
– Otherwise undefined access order/race condition
between local update and RDMA access
MPI_Win_lock (int lock_type, int rank, int assert,
MPI_Win win);
MPI_Win_unlock (int rank, MPI_Win win);
9
Example: Ghost-cell update (V)
MPI_Win_create ( u,(nxlocal+2)*ny*sizeof(double),
0, MPI_INFO_NULL, &win);
…
MPI_Win_lock ( MPI_LOCK_EXCLUSIVE, rank-1, 0, win);
MPI_Put ( &u[1][0], ny, MPI_DOUBLE, rank-1,
(nxlocal+1)*ny*sizeof(double), ny,
MPI_DOUBLE, win);
MPI_Win_unlock( rank-1, win);
MPI_Win_lock ( MPI_LOCK_EXCLUSIVE, rank+1, 0, win);
MPI_Put ( &u[nxlocal][0], ny, MPI_DOUBLE, rank+1, 0, ny, MPI_DOUBLE, win);
MPI_Win_unlock ( rank+1, win);
One-sided vs. Two-sided communication
• One-sided communication doesn’t need
– message matching
– unexpected message queues
– Uses only one processor
potentially faster!
• One-sided communication in MPI can optimize potentially
– multiple transactions
– between multiple processes
10
Limitations of the MPI-2 model
• Synchronization costs (e.g. MPI_Win_fence) can be
significant
• Static model
– Size of memory window can not be altered after creating an MPI_Win
– Difficult to support dynamic data structures such as a
linked list
• Passive target model has limited usability
– But that is what most other RDMA libraries focus on
• In MPI-3:
– Introduction of dynamic windows
– Extending the functionality passive target operations
Use case: distributed linked list
• A linked list maintained across multiple processes
– E.g. after a global sort operation of all elements
– E.g. having fixed rules for the keys
rank 0: keys which start with ‘a’ to ‘d’
rank 1: keys which start with ‘e’ to ‘h’ …
Rank 0 Rank 1 Rank 2
11
Use case: Distributed linked list
typedef struct{
char key[MAX_KEY_SIZE];
char value[MAX_VALUE_SIZE];
MPI_Aint next_disp;
int next_rank;
void *next_local; // next local element
} ListElem;
// Create an MPI data type describing this
// structure using MPI_Type_create_struct. Not shown
// here for brevity
Equivalent to the next pointer
in a non-distributed linked list
Traversing a distributed linked list ListElem local_copy, *current;
ListElem *head; //assumed to be already set
current=head;
MPI_Win_lock_all ( win );
while (!found ) {
if ( current->next_rank != myrank ) {
MPI_Get (&local_copy, 1, ListElem_type,
current->next_rank, current->next_disp,
1, ListElem_type, win );
MPI_Win_flush ( current->next_rank, win );
current = &local_copy;
} else
current = current->next_local;
if ( strcmp(current->key, key ) == 0 )
break;
}
MPI_Win_unlock_all( win);
Get a shared (read-only) lock to
all processes that are part of win
Enforce the completion of all
pending operations to a process
without having to release the lock(s)
12
Inserting elements into a linked list
• Assuming that only a local process is allowed to insert an
element (e.g. after a global sort operation)
– Remote processes only allowed to read elements on other
processes
• Requires dynamically allocating memory and extending a
memory region
• A dynamic window defines only the participating group of process
– More than one memory region can be assigned to a single window
MPI_Win_create_dynamic( MPI_Info info, MPI_Comm comm,
MPI_Win *win);
MPI_Win_attach (MPI_Win win, void *base, MPI_Aint size);
Inserting elements into a linked list (II) // create window instance once
MPI_Win_create_dynamic (MPI_INFO_NULL, comm, &win);
// insert each element into the memory window
t = (ListElem *) malloc ( sizeof (ListElem) );
t->key = strdup (key);
t->value = strdup (value);
current = find_prev_element (head, key, value)
t2 = current->next_local;
current->next_local = t;
t->next_local = t2;
MPI_Win_attach ( win, t, sizeof(ListElem );
…
// add another element
t = (ListElem *) malloc ( sizeof (ListElem) );
…
MPI_Win_attach ( win, t, sizeof(ListElem );
MPI_Barrier (comm);
Similarly for updating next_rank and
next_disp on current and t