Characterization of communication.ppt
-
Upload
athira-ravindranathan -
Category
Engineering
-
view
80 -
download
0
Transcript of Characterization of communication.ppt
Characterization of Communication in Distributed Memory Multiprocessors
Harry.F.Jordan and Gita alaghaband, “Fundamentals of parallel processing”
1. Point-to-Point communication Characteristic of transmission of data: Initiation,
Synchronization, Binding, Buffering and Latency controlo Initiation: By sender, possible by receiver when single
programmer writeso Synchronization: blocking/non blocking send/receiveo Binding: Designating sender and receiver by process id; Message
channel; Receiver process and tag pair(identifies role in computation--Tag-> content, associative, addressing)
o Buffer: Finite capacity, limit messages sent but not yet received o Latency: Time from execution of send operation until message
data arrives at receiver Delivery time for a message is roughly characterized by, T = TS + LTB - - - - - - -> (1.)
TS>>TB -> Very long messages are sent in shorter parts
1. Point-to-Point communication(Cont…)
Summary of Point to Point communication characteristics
Initiation Sender; Receiver request
Synchronization Blocking/Non-blocking send/receive
Binding
Type Associated operations
(source ID, destination ID) Send(destination ID); receive(source ID)
Channel number Open(channel); close(channel); send(channel); receive(channel)
(tag, destination ID) Send( tag, destination ID); receive(tag)
1. Point-to-Point communication(Cont…) Summary of Point to Point communication characteristicsBuffer Type Location Capacity
One per source for each destination
One per channel
One per tag for each destination
Sender: User space Byte limit
System space Message limit
I/O processor
Receiver: User space
System space
I/O processor
Latency Time parameters Transfer through an additional bufferStart-up time Adds to start-up time
Time per byte Adds to time per byte unless pipelined
1. Point-to-Point communication(Cont…)
Message latency, buffering and nonblocking overlap communication and computation
Sent information maximize parallel activity Send is nonblocking then communication and computation
overlapped depends on receiving transmitted data to proceed To effectively overlap message sent as soon as possible and
computation must be independent of communication Message in user’s space copied to buffer which is in system
space or an I/O processor when sender recomputes value in message are used in send
Sender starts to construct a new message in the same area immediately
1. Point-to-Point communication(Cont…)
Each copy of message data between buffers add time to producer’s send operation, message latency and consumer’s receive operation
Choice of buffering depends on:- Software overhead in send and receive Performance characteristic of interconnection network Factor of 2 speed up is obtained by perfect overlap Initiation of new send for further computation in other processes Overlap of many simultaneous communication enabled by
communication/computation overlap (processor support multiple concurrent message)
1. Point-to-Point communication(Cont…)
(a. )Well-overlapped communication
(b. )Poorly overlapped communication
2. Variable classes in distributed memory program
Way in which variables are shared among processes Each variable resides in some processor’s local memory and
private to process running on that processor In SPMD, same variable may have representative in different
processor and representative are updated for same value in all processor
Parallelism class of variable: Private, unique, cooperative update, replicated or partitioned
2. Variable classes(Cont…)
o Private variables: Single name refer to different memory cell and value in each processor
o Unique simple variable: Variable and its values are only defined for one processor
o Structured variables: Individual component unique to single processor
o Cooperative update shared variables: Variable with single value to all processors represented by one cell in processor’s memory, updated is cooperatively performed
– supported by high level and sometimes complex, communication operations
2. Variable classes(Cont…)
o Shared variable: Value available to many processor by redundant computation (loop index)->replicated
o Replicated variable: Same sequence of values in every processor
o Partitioned variable: loop does not specify combining element from different row in arithmetic operation
Collective communication for variable of one of shared classes is updated
Example: Broadcast allow single processor to give new value to cooperative update variable
2. Variable classes(Cont…)
real myC, myA, myB, tmpA, tmpB;integer i, j, m, k, N;myC:= 0;for k:=0 step 1 until N-1 /*Loop over inner product terms*/begin
for m:=0 step 1 until N-1 /*Loop over receivers*/begin
if k!=m then{ if j=k then P(i, m) !myA; /*P(i, k) sends A[i, k]*/ if j=m then P(i, k) ?tmpA; /*to P(i, m) for all i*/ if i=k then P(m, j) !myB; /*P(k, j) sends A[k, j]*/ if i=m then P(k, j) ?tmpB; } /*to P(m, j) for all j*/
endif j=k then tmpA:=myA; /*Copy when sender and receiver would be same*/if i=k then tmpB:=myB;
myC:= myC + tmpA * tmpB;end
Distributed memory matrix multiplication using CSP blocking communication
2. Variable classes(Cont…)
if i=q and j=k thenfor m:=0 step 1 until N-1
if m!=k then send A to P(i, m);if i=q and j!=k then receive A from P(q, k);
Broadcast from P(q, k) to all P(q, m) for m!=k
3. High-level Communication Operations
o Cooperative update variable to Broadcast to assign a new value Communication operations combine values from different
processes and distribute result o Summation or reduction can combine value from each process
and either pass sum to a single root process or distribute it to all processes
o Prefix computation across value from different processes return different but related values to each process in group
o Example: Sum prefix receive one from each of p processes returns to each process an integer in range 0 to p-1
3. High-level Communication Operations(Cont…)
Results might be private or components of a partitioned vector Remapping of structure into different partition over processes
corresponds to permutation of structure components among processes(Matrix multiplication)
o Communication operations: Characterized by source of their input and destination of their output
o Combined operation: Implemented more efficiently than two communication in sequence
o Prefix operation: Related to reduction, produce different value for each destination process
3. High-level Communication Operations(Cont…)
o Scatter operation: Vector of P(number of processor) items, from one process and distribute it, one item to each process
o Reverse operation: Gather collects an item from each process and concatenate them into a vector result delivered to single destination process
o Gather/scatter operation: Remapping a partitioned structure, vector of P source items-> one for each destination process taken from each process, collection is reorganized as vector of item for each destination and are delivered to respective processes
3. High-level Communication Operations(Cont…)
Variable class Update methodsUnique Assignment by one processor
Private Parallel assignment of different values by all processes Prefix computation
Cooperative update Broadcast from a single processor Reduction
Replicated Parallel assignment of same value by all processes
Partitioned Each process assigns to its own componentsPrefix computationPermutation for remapping
Distributed variable classes and methods of updating them
3. High-level Communication Operations(Cont…)
Source DestinationOne process: Single item One process: Single item
Multiple items Multiple items
All processes: Concatenation All processes:
Single item per process
Arithmetic combining Multiple items per process
Characterizing source and destination of collective communications
Communication Source DestinationPoint-to-point One process One process
Broadcast One process: One item All processes: Item per processes
Gather All processes : item per process One process: P items
Scatter One process: P items All processes: Items per process
Reduce All processes: Arithmetic combining
One process: One item
Communication operations and their source and destination types
3. High-level Communication Operations(Cont…)
Different choices for source of item to be communicated and for destination of messages
Specific language or library of communication function exhibit number of variations on communication operation
Source of variation is in data type associated with source or destination type
Vectors of values are supported as source and results Particular arrangement of data is used repeatedly in different
communication , but latency control lead to aggregation of loosely related or unrelated data for specific communication
3. High-level Communication Operations(Cont…)
• Data items merged into output file when read decomposed into individual items by file specific input code(Communication ideas are packing and unpacking message buffer)
• Motivation of packing and unpacking long messages is message start up overhead, irreducible amount of time taken to start sending a message of any length
• Start up time or latency vary with system • Packing: Items of different size and type are concatenated into
one long message• Unpacking: messages are separated in destination
3. High-level Communication Operations(Cont…)
Behavior of some collective communication operation
4. Distributed Gauss elimination
• Solving system of linear equation• Machine has one host processor does all I/O and P worker
processor that are all identical • One process runs on each processor, and all worker processes
execute same program• Machine has communication library supporting high level,
collective operation broadcast, sum reduce and point to point communication
• Communication latency is large compared to floating point operation time (Long message)
• Program below is 2D matrix in column major order• Each worker process has unique id 0≤id<p and host process id is
outside this range
4. Distributed Gauss elimination(Cont…)• Process id communicate in point-to-point mode with process
(id+1)mod p and (id-1)mod p• Solve Ax=b by first factorizing A=LU into Lower(L) and
Upper(U) triangular matrices• Solution obtained by solving two recurrence system forward
substitution Ly=b, followed by backward substitution Ux=y• Matrix A is replaced by L and U is partitioned cyclically over
processor by column with process Pr,0≤r<p
• Mapping of an 8*8 matrix to 3 processors
4. Distributed Gauss elimination(Cont…)
Partition of Gauss elimination problem over processes
4. Distributed Gauss elimination(Cont…)
Program structure for distributed Gauss elimination
4. Distributed Gauss elimination(Cont…)
Host program:Input number p of processes and order n of linear system;Broadcast p and n to all workers;Receive time to generate test case from process zero and print it;Receive time for LU factorization from process zero and print time and rate;Receive solution time for test b vector from process zero and print time and rate;Receive residual from process zero, check and print it;End
Host programs for distributed Gauss elimination
4. Distributed Gauss elimination(Cont…)Worker program(process number id):
Receive p and n from host;Compute the number m of matrix columns for this processor;Generate m columns of the test matrix, A[i, j]=1/(i-j+0.5), for Ap of pid;Process zero sends time for generation to host;Call PGEFA procedure to factor A into matrix product LU;
Process zero sends time for factorization to host;Process zero computes test right hand side vector, b[i]=i, i=1, …n;Call PGESL procedure to solve equations, leaving solution vector in process zero;Process zero sends time for solving to host;Call PGEMUL to compute Ax and leave result in process zero;Process zero computes residual, ∑|(Ax)[i]-b[i]|and send it to host; End
Worker programs for distributed Gauss elimination
4. Distributed Gauss elimination(Cont…)
• Order n^3 work of factorization by stepping sequentially through diagonal element
• Order n^2 work of forward and backward substitution is done sequentially in absence of vector operation
• Host program is unique which does I/O and communicates with workers by broadcast and point-to-point communication
5. Process topology Vs Processor topology
• Process topology is different from processor topology imposed by interconnection network, even if one and only one process runs on each processor
• Communication software make network topology support arbitrary source/destination pair message by forwarding from point to point in network
Let =[ ]𝐴 𝑎𝑖𝑗 × 𝑛 𝑛 and =[ ]𝐵 𝑏𝑖𝑗 × 𝑛 𝑛 be n×n matrices.Compute = 𝐶 𝐴𝐵 Computational complexity of sequential algorithm: ( ^3)𝑂 𝑛
5. Process topology Vs Processor topology(Cont…)int main(int argc,char *argv[]){ int myid,numprocs, left, right;int buffer[10];MPI_Request request;MPI_Status status;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD, &numprocs);MPI_Comm_rank(MPI_COMM_WORLD, &myid);right = (myid+ 1) %numprocs;left =myid-1;if(left < 0)left =numprocs-1;MPI_Sendrecv_replace(buffer, 10, MPI_INT, left, 123, right, 123,
MPI_COMM_WORLD,&status);MPI_Finalize();return 0;}
5. Process topology Vs Processor topology(Cont…)if j≠0 thenbegin
for k:=0 step 1 until j-1begin
send myB to P((i-1)mod N, j);receive myB from P((i+1)mod N, j);
endendif i≠0 thenbegin
for k:=0 step 1 until j-1begin
send myA to P(i,(j-1)mod N); receive myA from P(i,(j+1)mod N);
endend
Initial distribution using one step left and upward transmissions