Parallel Computing—Higher-level concepts of MPI
description
Transcript of Parallel Computing—Higher-level concepts of MPI
1
Parallel Computing—Higher-level concepts of MPI
2
MPI—Presentation Outline
• Communicators, Groups, and Contexts• Collective Communication• Derived Datatypes• Virtual Topologies
3
Communicators, groups, and contexts
• MPI provides a higher level abstraction to create parallel libraries:• Safe communication space• Group scope for collective operations• Process Naming
• Communicators + Groups provide:• Process Naming (instead of IP address + ports)• Group scope for collective operations
• Contexts:• Safe communication
4
What are communicators?• A data-structure that contains groups (and thus processes)• Why is it useful:
• Process naming, ranks are names for application programmers • Easier than IPaddress + ports
• Group communications as well as point to point communication
• There are two types of communicators, • Intracommunicators:
• Communication within a group
• Intercommunicators:• Communication between two groups (must be disjoint)
5
What are contexts?• An unique integer:
• An additional tag on the messages
• Each communicator has a distinct context that provides a safe communication universe:• A context is agreed upon by all processes when a
communicator is built
• Intracommunicators has two contexts:• One for point-to-point communications• One for collective communications,
• Intercommunicators has two contexts:• Explained in the coming slides
6
Intracommunicators • Contains one group • Allows point-to-point and collective communications between
processes within this group• Communicators can only be built from existing communicators:
• MPI.COMM_WORLD is the first Intracommunicator to start with
• Creation of intracommunicators is a collective operation: • All processes in the existing communicator must call it in order to
execute successfully
• Intracommunicators can have process topologies: • Cartesian
• Graph
7
Creating new Intracommunicators
newComm
COMM_WORLD
0 1 2 3
Create new communicatorwith process 0 and 3 only
0 1
MPI.Init(args);int [] incl1 = { 0, 3};Group grp1 = MPI.COMM_WORLD.Group();Group grp2 = grp1.Incl(incl1);Intracomm newComm = MPI.COMM_WORLD.Create(grp2);
8
How do processes agree on context for new Intracommunicators ?
• Each process has a static context variable which is incremented whenever an Intracomm is created
• Each process increments this variable, sends it to all the other processes
• The max integer is agreed upon as the context• An existing communicators’ context is used for
sending “context agreement” messages:• What about MPI.COMM_WORLD?
• It is safe anyway, because it is the first intracommunicator and there is no chance of conflicts
9
Intercommunicators• Contains two groups:
• Local group (the local process is in this group)• Remote group • Both groups must be disjoint
• Only allows point-to-point communications• Intercommunicators cannot have process
topologies • Next slide: How to create intercommunicators
10
MPI.Init(args);int [] incl2 = {0, 2, 4, 6};int [] incl3 = {1, 3, 5, 7};Group grp1 = MPI.COMM_WORLD.Group();int rank = MPI.COMM_WORLD.Rank();Group grp2 = grp1.Incl(incl2);Group grp3 = grp1.Incl(incl3);Intercomm icomm = null;
if(rank == 0 || rank == 2 || rank == 4 || rank == 6) { icomm = MPI.COMM_WORLD.Create_intercomm(comm1,0,1,56);} else { icomm = MPI.COMM_WORLD.Create_intercomm(comm2,1,0,56);}
newComm
remote group
local group
Comm1
0(a) 1(b) 2(c) 3(d)
Comm2
0(e) 1(f) 2(g) 3(h)
0(a) 1(b) 2(c) 3(d)
0(e) 1(f) 2(g) 3(h)
Creating intercommunicators
11
Creating intercomms …• What are the arguments to Create_intercomm method:
• Local communicator (which contains current process)
• local_leader (rank)
• remote_leader (rank)
• tag for messages sent for selection of contexts
• But, the groups were disjoint, how can they communicate?• That is where a peer communicator is required
• At least local_leader and remote_leader are part of this peer communicator
• In the last figure, MPI.COMM_WORLD is the peer communicator, and process 0 and 1 (ranks relative to MPI.COMM_WORLD) are leaders of their respective groups
12
Selecting contexts for intercomms
• An intercommunicator has two contexts: • send_context (Used for sending messages)• recv_context (Used for receiving messages)
• In intercommunicators, processes in local group can only send messages to remote groups
• How is context agreed upon?• Each group decides its context, • The leaders (local and remote) exchange the
contexts agreed upon, • The one which is greater, is selected as the context
13
Process 0
Process 1
Process 3
Process 2
Process 4
Process 5
Process 7
Process 6
COMM_WORLD
Group1
Group2 0
12
0
1
2
14
MPI—Presentation Outline
• Point to Point Communication• Communicators, Groups, and Contexts• Collective Communication• Derived Datatypes• Virtual Topologies
15
Collective communications
• Provided as a convenience for application developers:• Save significant development time• Efficient algorithms may be used • Stable (tested)
• Built on top of point-to-point communications• These operations include:
• Broadcast, Barrier, Reduce, Allreduce, Alltoall, Scatter, Scan, Allscatter
• Versions that allows displacements between the data
16
Image from MPI standard doc
Broadcast, scatter, gather, allgather, alltoall
17
Reduce collective operations
1
2
3
4
5
15
1
2
3
4
5
15
15
15
15
15
reduce
allreduce
Processes Data MPI.PROD MPI.SUM MPI.MIN MPI.MAX MPI.LAND MPI.BAND MPI.LOR MPI.BOR MPI.LXOR MPI.BXOR MPI.MINLOC MPI.MAXLOC
Processes
18
Group A
0 54321
time ->
6 7
Eight processes, thus forms only one group Each process exchanges an integer 4 times Overlaps communications well
A Typical Barrier() Implementation
19
Intracomm.Bcast( … )• Sends data from a process to all the other processes• Code from adlib:
• A communication library for HPJava
• The current implementation is based on n-ary tree:• Limitation: broadcasts only from rank=0• Generated dynamically
• Cost: O( log2(N) )• MPICH1.2.5 uses linear algorithm:
• Cost O(N)
• MPICH2 has much improved algorithms• LAM/MPI uses n-ary trees:
• Limitation, broadcast from rank=0
20
0
654
4321
A Typical Broadcast Implementation
21
MPI—Presentation Outline
• Point to Point Communication• Communicators, Groups, and Contexts• Collective Communication• Derived Datatypes• Virtual Topologies
22
MPI Datatypes• What kind (type) of data can be sent using MPI
messaging? • Basically two types:
• Basic (primitive) datatypes• Derived datatypes
23
MPI Basic Datatypes
• MPI_CHAR • MPI_SHORT • MPI_INT • MPI_LONG • MPI_UNSIGNED_CHAR • MPI_UNSIGNED_SHORT • MPI_UNSIGNED_LONG • MPI_UNSIGNED • MPI_FLOAT • MPI_DOUBLE • MPI_LONG_DOUBLE • MPI_BYTE
24
• Besides basic datatypes, it is possible communicate heterogeneous and non-contiguous data: • Contiguous • Indexed• Vector• Struct
Derived Datatypes
25
MPI—Presentation Outline
• Point to Point Communication• Communicators, Groups, and Contexts• Collective Communication• Derived Datatypes• Virtual Topologies
26
Virtual topologies• Used to specify processes in a geometric shape• Virtual topologies have no connection with the
physical layout of machines:• Its possible to make use of underlying machine
architecture
• These virtual topologies can be assigned to processes in an Intracommunicator
• MPI provides:• Cartesian topology• Graph topology
27
Cartesian topology: Mapping four processes onto 2x2 topology
• Each process is assigned a coordinate:• Rank 0: (0,0)• Rank 1: (1,0)• Rank 2: (0,1)• Rank 3: (1,1)
• Uses:• Calculate rank by knowing grid position• Calculate grid positions from ranks• Easier to locate rank of neighbours• Applications may have communication
patterns:• Lots of messaging with immediate
neighbours
Comm1
0 1 2 3
y-axis
x-axis
0 (0,0) 1 (1,0)
2 (0,1) 3 (1,1)
28
Periods in cartesian topology
• Axis 1 (y-axis is periodic):• Processes in top and bottom
rows have valid neighbours towards top and bottom respectively
• Axis 0 (x-axis is non-periodic):• Processes in right and left
column have undefined neighbour towards right and left respectively
y-axis
x-axis
periodicity[0] = false;periodicity[1] = true;
29
Graph topology
nnodes=4;index=2,3,4,6edges=1,3,0,3,0,2
0 1
2 3
30
• Just to give you an idea how MPI-based applications are designed …
Doing Matrix Multiplication using MPI
31
=x
1 0 2
2 1 0
0 2 2
0 1 0
0 0 1
1 1 1
2 3 2
0 2 1
2 2 4
Basically how it works!
32
Matrix Multiplication MxN ..
int rank = MPI.COMM_WORLD.Rank() ; int size = MPI.COMM_WORLD.Size() ;
if(master_mpi_process) {
initialize matrices M and N
for(int i=1 ; i<size ; i++) { send rows of matrix M to process `i’ } broadcast matrix N to all non-zero processes
for (int i=0 ; i<size ; i++) { receive rows of resultant matrix from process `i’ }
.. print results .. } else { receive rows of Matrix M call broadcast to receive matrix N compute matrix multiplication for sub matrix (its done in parallel) send resultant row back to master process }
..