Parallel Programming on the SGI Origin2000
description
Transcript of Parallel Programming on the SGI Origin2000
Parallel Programming on theSGI Origin2000
With thanks to Igor Zacharov / Benoit Marchand, SGI
Taub Computer CenterTechnion
Moshe Goldberg, [email protected]
Mar 2004 (v1.2)
Parallel Programming on the SGI Origin2000
1) Parallelization Concepts
2) SGI Computer Design
3) Efficient Scalar Design
4) Parallel Programming -OpenMP
5) Parallel Programming- MPI
2 )Parallel Programming-MPI
Parallel classification
• Parallel architectures
Shared Memory /
Distributed Memory• Programming paradigms
Data parallel /
Message passing
Shared Memory
• Each processor can access any part of the memory
• Access times are uniform (in principle)
• Easier to program (no explicit message passing)
• Bottleneck when several tasks access same location
Distributed Memory
• Processor can only access local memory
• Access times depend on location
• Processors must communicate via explicit message passing
Distributed Memory
Interconnection network
Message Passing Programming
• Separate program on each processor
• Local Memory
• Control over distribution and transfer of data
• Additional complexity of debugging due to communications
Performance issues
• Concurrency – ability to perform actions simultaneously
• Scalability – performance is not impaired by increasing number of processors
• Locality – high ration of local memory accesses/remote memory accesses (or low communication)
SP2 Benchmark
• Goal : Checking performance of real world applications on the SP2
• Execution time (seconds):CPU time for applications
• Speedup Execution time for 1 processor = ------------------------------------ Execution time for p processors
WHAT is MPI?
• A message- passing library specification
• Extended message-passing model
• Not specific to implementation or computer
BASICS of MPI PROGRAMMING
• MPI is a message-passing library
• Assumes : a distributed memory architecture
• Includes : routines for performing communication (exchange of data and synchronization) among the processors.
Message Passing
• Data transfer + synchronization
• Synchronization : the act of bringing one or more processes to known points in their execution
• Distributed memory: memory split up into segments, each may be accessed by only one process.
MPI STANDARD
• Standard by consensus, designed in an open forum
• Introduced by the MPI FORUM in May 1994, updated in June 1995.
• MPI-2 (1998) produces extensions to the MPI standard
Why use MPI ?
• Standardization
• Portability
• Performance
• Richness
• Designed to enable libraries
Writing an MPI Program
• If there is a serial version , make sure it is debugged
• If not, try to write a serial version first
• When debugging in parallel , start with a few nodes first.
Format of MPI routines
CMPI_xxx(parameters)
include mpi.h
FOR
TRAN
call MPIxxx(parame
ters, ierror)
include mpif.h
Six useful MPI functions
MPI_INITInitialized for the MPI environment
MPI_COMM_SIZEReturns the number of processes
MPI_COMM_RANKReturns this process’s number (rank)
Communication routines
MPI_SENDSends a message
MPI_RECV Receives a message
End MPI part of program
MPI_FINALIZE Exit in an orderly way
program hello include ’mpif.h’ status(MPI_STATUS_SIZE) character*12 message call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size,ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank,ierror) tag = 100 if(rank .eq. 0) then message = 'Hello, world' do i=1, size-1 call MPI_SEND(message, 12, MPI_CHARACTER , i, & tag,MPI_COMM_WORLD,ierror)
enddo else
call MPI_RECV(message, 12, MPI_CHARACTER, 0,tag,MPI_COMM_WORLD, status, ierror)
endif print*, 'node', rank, ':', message call MPI_FINALIZE(ierror) end
int main( int argc, char *argv[]){
int tag=100;
int rank,size,i;
MPI_Status * status char message[12];
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
strcpy(message,"Hello,world");
if (rank==0)
for (i=1;i<size;i++){ MPI_Send(message,12,MPI_CHAR,i,tag,MPI_COMM_WORLD);
}
}else MPI_Recv(message,12,MPI_CHAR,0,tag,MPI_COMM_WORLD,&status);
printf("node %d : %s \n",rank,message);
MPI_Finalize;)(
return 0;
}
MPI Messages
• DATA data to be sent
• ENVELOPE – information to route the data.
Description of MPI_Send (MPI_Recv)
Startbuf The address where data start
Count Number of elements in message
DatatypeType of elements
Destination /source
Rank in communicator (0 .. Size-1)
Description of MPI_Send (MPI_Recv)
TagArbitrary number to help distinguish between messages
Communicator Communications universe
Status only for receive !!!!
Contains 3 fields : sender, tag and error code
Some useful remarks
• Source= MPI_ANY_SOURCE means that any source is acceptable
• Tags specified by sender and receiver must match, or MPI_ANY_TAG : any tag is acceptable
• Communicator must be the same for send/receive. Usually : MPI_COMM_WORLD
POINT-TO-POINT COMMUNICATION
• Transmission of a message between one pair of processes
• Programmer can choose mode of transmission
• Programmer can choose mode of transmission
MODE of TRANSMISSION
• Can be chosen by programmer
• …or let the system decide
• Synchronous mode• Ready mode• Buffered mode
• Standard mode
BLOCKING /NON-BLOCKING COMMUNICATIONS
BlockingSend or receive suspends execution till message buffer is safe to use
Non -blocking
Separates computation from communication. Send is initiated, but not completed. We can use a separate call to verify that communication has been completed.
SR
MPI_SEND
MPI_RECV
BLOCKING STANDARD SEND
Size>threshold Task waits
Date transfer fromsource complete
Task continues when data transfer to buffer is complete
waitTransfer begins
when MPI_RECV has been posted
SR
MPI_ISEND
MPI_IRECV
NON BLOCKING STANDARD SEND
Size>threshold Task waits
Date transfer fromsource complete
waitTransfer begins
when MPI_IRECV has been posted
MPI_WAIT
MPI_WAIT
No interruption if wait is late enough
SR
MPI_SEND
MPI_RECV
BLOCKING STANDARD SEND
Size<=thresholdData transfer
fromsource complete
Task continues when data transfer to user’sbuffer is complete
Transfer to buffer
on receiver
SR
MPI_ISEND
MPI_IRECV
NON BLOCKING STANDARD SEND
Size<=thresholdNo delay even though message is not yet in buffer on R
Date transfer fromsource complete
Transfer to buffer can be avoided if MPI_IRECV postedearly enough
MPI_WAIT
MPI_WAIT
No delay if wait is late enough
BLOCKING COMMUNICATION
print *, “Task “,irank, “ has sent the message”
call MPI_Send(rmessage1,MSGLEN,MPI_REAL,
& idest,isend_tag,MPI_COMM_WORLD,ierr)
call MPI_Recv(rmessage2,MSGLEN,MPI_REAL,
& isrc,irecv_tag,MPI_COMM_WORLD, &status,ierr)
NON-BLOCKINGcall MPI_ISend(rmessage1,MSGLEN,MPI_REAL,
& idest,isend_tag,MPI_COMM_WORLD,
&request_send, ierr)
call MPI_IRecv(rmessage2,MSGLEN,MPI_REAL,
& isrc,irecv_tag,MPI_COMM_WORLD,
&request_rec,ierr)
call MPI_WAIT(request_rec,istatus,ierr)
program deadlock
implicit none include 'mpif.h' integer MSGLEN, ITAG_A, ITAG_B parameter ( MSGLEN = 2048, ITAG_A = 100, ITAG_B = 200 ) real rmessage1(MSGLEN), ! message buffers . rmessage2(MSGLEN) integer irank, ! rank of task in communicator. idest, isrc, ! rank in communicator of destination ! and source tasks . isend_tag, irecv_tag, ! message tags . istatus(MPI_STATUS_SIZE), ! status of communication . ierr, ! return status . i
call MPI_Init ( ierr ) call MPI_Comm_Rank ( MPI_COMM_WORLD, irank, ierr ) print *, " Task ", irank, " initialized"C initialize message buffers do i = 1, MSGLEN rmessage1(i) = 100 rmessage2(i) = -100 end do
C
Deadlock program (cont)if ( irank.EQ.0 ) then idest = 1
isrc = 1 isend_tag = ITAG_A irecv_tag = ITAG_B else if ( irank.EQ.1 ) then idest = 0 isrc = 0 isend_tag = ITAG_B irecv_tag = ITAG_A end ifC ----------------------------------------------------------------C send and receive messagesC ------------------------------------------------------------- print *, " Task ", irank, " has sent the message" call MPI_Send ( rmessage1, MSGLEN, MPI_REAL, idest, isend_tag, . MPI_COMM_WORLD, ierr ) call MPI_Recv ( rmessage2, MSGLEN, MPI_REAL, isrc, irecv_tag, . MPI_COMM_WORLD, istatus, ierr ) print *, " Task ", irank, " has received the message"
call MPI_Finalize (ierr)end
DEADLOCK example
A
B
MPI_SEND
MPI_SEND
MPI_RECV
MPI_RECV
Deadlock example
• SP2 implementation:No Receive has been posted yet,so both processes block
• Solutions
Different ordering
Non-blocking calls
MPI_Sendrecv
Determining Information about
Messages
• Wait
• Test
• Probe
MPI_WAIT
• Useful for both sender and receiver of non-blocking communications
• Receiving process blocks until message is received, under programmer control
• Sending process blocks until send operation completes, at which time the message buffer is available for re-use
MPI_WAIT
compute
transmit
S
R
MPI_WAIT
MPI_TEST
compute
transmit
S
R
MPI_Isend
MPI_TEST
MPI_TEST
• Used for both sender and receiver of non-blocking communication
• Non-blocking call• Receiver checks to see if a specific sender has sent a message that is waiting to be delivered ... messages from all other senders are ignored
MPI_TEST (cont.)
Sender can find out if the message-buffer can be re-used ... have to wait until operation is complete before doing so
MPI_PROBE
• Receiver is notified when messages from potentially any sender arrive and are ready to be processed.
• Blocking call
Programming recommendations
• Blocking calls are needed when:
• Tasks must synchronize• MPI_Wait immediately follows communication call
Collective Communication
• Establish a communication pattern within a group of nodes.
• All processes in the group call the communication routine, with matching arguments.
• Collective routine calls can return when their participation in the collective communication is complete.
Properties of collective calls
• On completion: he caller is now free to access locations in the communication buffer.
• Does NOT indicate that other processors in the group have completed
• Only MPI_BARRIER will synchronize all processes
Properties
• MPI guarantees that a message generated by collective communication calls will not be confused with a message generated by point-to-point communication
• Communicator is the group identifier.
Barrier
• Synchronization primitive. A node calling it will block until all the nodes within the group have called it.
• Syntax
MPI_Barrier(Comm, Ierr)
Broadcast
• Send data on one node to all other nodes in communicator.
• MPI_Bcast(buffer, count, datatype,root,comm,ierr)
Broadcast DATA
A0
A0
A0
A0
A0P0
P1
P2
P3
Gather and ScatterDATA
A0
A3
A2
A1
A0P0
P1
P2
P3
A1 A2 A3 scatter
gather
Allgather effect
C0
DATA
A0
A0
A0
A0
A0P0
P1
P2
P3allgather
B0
C0
D0
D0
B0 D0
D0
D0
B0
B0
B0
C0
C0
C0
Syntax for Scatter & Gather
MPI_Gather(sendbuf,scount,,datatype,recvbuf,rcount,rdatatype,root,comm,ierr)
MPI_Scatter(sndbuf,scount,datatype,recvbuf,rcount, datatype,root,comm,ierr)
Scatter and Gather
• Gather: Collect data from every member of the group (including the root) on the root node in linear order by the rank of the node.
• Scatter: Distribute data from the root to every member of the group in linear order by node.
ALLGATHER
• All processes, not just the root, receive the result. The jth block of the receive buffer is the block of data sent from the jth process
• Syntax :
MPI_Allgather( sndbuf,scount,datatype,recvbuf,rcount,rdatatype,comm,ierr)
Gather example
DIMENSION A(25,100),b(100),cpart(25),ctotal(100) INTEGER root DATA root/0/
DO I=1,25 cpart(I) = 0. DO K=1,100 cpart(I) = cpart(I) + A(I,K)*b(K) END DO END DO call MPI_GATHER(cpart,25,MPI_REAL,ctotal,25,MPI_REAL,
root, MPI_COMM_WORLD, ierr)
AllGather example
DIMENSION A(25,100),b(100),cpart(25),ctotal(100) INTEGER root
DO I=1,25 cpart(I) = 0. DO K=1,100 cpart(I) = cpart(I) + A(I,K)*b(K) END DO END DO call
MPI_AllGATHER(cpart,25,MPI_REAL,ctotal,25,MPI_REAL, MPI_COMM_WORLD, ierr)
Parallel matrix-vector multiplication
=P125
P2P3
P4
A * b = c
25
25
25
Global Computations
• Reduction
• Scan
Reduction
• The partial result in each process in the group is combined in one specified process
Reduction
DjJth item of data at the root process
*Reduction operation (sum, max,min ….)
Dj = D(0,j)*D(1,j)* ... *
D(n-1,j)
Scan operation
•Scan or prefix-reduction operation performs partial reductions on distributed data
• Dkjkj = D0j*D1j* ... *Dkj k=0,1,n-1
Varying size gather and scatter
• Both size and memory location of the messages are varying
• More flexibility in writing code • less need to copy data into temporary buffers
• more compact final code • Vendor implementation may be optimal
Scatterv syntax
Scatterv(sbuf,scount,stype,rbuf,rcount,displs,rtype,root,comm,ierr)
SCOUNTS(I) number of items to send from process root to process I
DISPLS(I) displacement from sbuf to beginning of ith message
SCATTER
P0
P0
P1
P2
P3
SCATTERV
P0
P0
P1
P2
P3
Advanced Datatypes
• Predefined basic datatypes -- contiguous data of the same type.
• We sometimes need:
non-contiguous data of single type
contiguous data of mixed types
Solutions
• multiple MPI calls to send and receive each data element
• copy the data to a buffer before sending it (MPI_PACK)
• use MPI_BYTE to get around the datatype-matching rules
Drawback
• Slow , clumsy and wasteful of memory
• Using MPI_BYTE or MPI_PACKED can hamper portability
General Datatypes and Typemaps
• a sequence of basic datatypes
• a sequence of integer (byte) displacements
Typemaps
typemap= [(type0,disp0),(type1,disp1),….,
(typen,disp n)]
Displacement are relative to the buffer
Example :
Typemap (MPI_INT)= [(int,0)]
Extent of a Derived Datatype
Lb Min(disp0,disp1,…,dispn)
Ub Max(disp0+sizeof(type0),….
ExtentUb – Lb +pad
MPI_TYPE_EXTENT
• MPI_TYPE_EXTENT(datatype,extent,ierr)
Describes distance (in bytes) from start of datatype to start of the next datatype .
How and When Do I Use Derived Datatypes?
• MPI derived datatypes are created at run-time through calls to MPI library routines.
How and When Do I Use Derived Datatypes?}
How to use
• Construct the datatype• Allocate the datatype.• Use the datatype• Deallocate the datatype
integer oldtype,newtype,count,blocklength,stride
integer ierr,n
real buffer(n,n)
call MPI_TYPE_VECTOR(count,blocklength,stride,oldtype,newtype,ierr)
call MPI_TYPE_COMMIT(newtype,ierr)
call MPI_SEND(buffer,1,newtype,dest,tag,comm,err)
*** use it in communication operation *********
call MPI_TYPE_FREE(newtype,ierr)
**** deallocate it ************
EXAMPLE
Example on MPI_TYPE_VECTOR
oldtype
newtype
BLOCK BLOCK
COUNT 2
BLOCKLENGTH3
STRIDE5
Summary
• Derived datatypes are datatypes that are built from the basic MPI datatypes
• Derived datatypes provide a portable and elegant way of communicating non-contiguous or mixed types in a message.
• Efficiency may depend on the implementation(see how it compares to MPI_BYTE)
Several datatypes MPI_TYPE_ CONTIGUOUS
replicating the existing datatype
MPI_TYPE_ VECTOR
Same , allowing gaps in the displacement
MPI_TYPE_ HVECTOR
Same as former, but displacement in bytes
MPI_TYPE_ INDEXED
replicates the datatype into a sequence
Several datatypes
MPI_TYPE_HINDEXED
replicates the datatype into a sequence of different blocks
MPI_TYPE_STRUCT
Mix of different datatypes
GROUP
c this is a program for testing MPI_Group c program GROUP include 'mpif.h'
implicit noneINTEGER WCOMM, WGROUP, GROUP1, SUBCOMM, RANK, SIZE, IERR, IINTEGER SBUF(100), RBUF(100),count, count2, sbuf2(100),
rbuf2(100)integer ranks(100)CALL MPI_INIT(IERR)CALL MPI_COMM_RANK(MPI_COMM_WORLD, RANK, IERR) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, SIZE, IERR)
c call MPI_BARRIER(MPI_COMM_WORLD, IERR)c print*, 'rank =', rank, 'size =', size
RANKS(1) = 0WCOMM = MPI_COMM_WORLD
c WGROUP = MPI_COMM_GROUPCALL MPI_COMM_GROUP(WCOMM, WGROUP, IERR)
)
Group (cont.)
c call MPI_BARRIER(MPI_COMM_WORLD, IERR)CALL MPI_GROUP_EXCL(WGROUP, 1, RANKS, GROUP1, IERR) CALL MPI_COMM_CREATE(WCOMM, GROUP1, SUBCOMM, IERR)
c call MPI_BARRIER(MPI_COMM_WORLD, IERR)c print*, 'group1 =', rank, group1c print*, 'subcomm =', rank, subcommc print*, 'after creation of group1 & subcomm' IF(RANK .NE. 0) THEN COUNT = size do i=1, COUNT SBUF(i) = rank enddo CALL MPI_REDUCE(SBUF,RBUF,COUNT,MPI_INTEGER, * MPI_SUM,0,SUBCOMM,IERR)c print*, 'sum of group1 at rank', rank,(rbuf(i), i=1, count)
ENDIF
Group (cont.)
c if(rank .eq. 1) then print*, 'sum of group1', (rbuf(i), i=1, count)c print*, 'sum of group1', (sbuf(i), i=1, count)
endif count2 = size
do i=1, count2 sbuf2(i) = rank * rank
enddoCALL MPI_REDUCE(SBUF2,RBUF2,COUNT2,MPI_INTEGER,
* MPI_SUM,0,WCOMM,IERR)if(rank .eq. 0) then
print*, 'sum of wgroup', (rbuf2(i), i=1, count2) else CALL MPI_COMM_FREE(SUBCOMM, IERR) endif
CALL MPI_GROUP_FREE(GROUP1, IERR)CALL MPI_FINALIZE(IERR)
stopend
PERFORMANCE ISSUES
• Hidden communication takes place
• Performance depends on implementation of MPI
• Because of forced synchronization, it is not always best to use collective communication
Example : simple broadcast
1
2
3
8
B
BB
Data:B*(P-1)Steps : P-1
Example : simple scatter
1
2
3
8
B
BB
Data:B*(P-1)Steps : P-1
Example : better scatter
1
1 24*B
Data:B*p*logPSteps : log P
1 3 2 4
1 5 3 6 2 7 4 8
2*B 2*B
B BBB
Timing for sending a message
Time is composed of startup time – time to send a 0 length message and transfer time – time to transfer a byte of data.
Tcomm = Tstartup + B * Ttransfer
It may be worthwhile to group several sends together
Performance evaluation
Fortran :
Real*8 t1
T1= MPI_Wtime() ! Returns elapsed time
C:
double t1 ;
t1 =MPI_Wtime ();
MPI References
• The MPI Standard :
www-unix.mcs.anl.gov/mpi/index.html
• Parallel Programming with MPI,Peter S. Pacheco,Morgan Kaufmann,1997
• Using MPI, W. Gropp,Ewing Lusk,Anthony Skjellum, The MIT Press,1999.
Example : better broadcast
1
1 2B B
Data:B*(P-1)Steps : log P
1 3 2 7
1 5 3 6 2 7 4 8