Caplib Paper 013
-
Upload
zied-nasri -
Category
Documents
-
view
220 -
download
0
Transcript of Caplib Paper 013
-
7/28/2019 Caplib Paper 013
1/33
CAPLib A THIN LAYER MESSAGE PASSING LIBRARY TOSUPPORT COMPUTATIONAL MECHANICS CODES ON
DISTRIBUTED MEMORY PARALLEL SYSTEMS
By
P F Leggett, S P Johnson and M Cross
Parallel Processing Research Group
Centre for Numerical Modelling and Process Analysis
University of Greenwich
London SE18 6PF
UK.
ABSTRACT
The Computer Aided Parallelisation Tools (CAPTools) [1] is a set of interactive tools aimed
to provide automatic parallelisation of serial Fortran computational Mechanics (CM)
programs. CAPTools analyses the users serial code and then through stages of array
partitioning, mask and communication calculation, generates parallel SPMD (Single Program
Multiple Data) messages passing Fortran.
The parallel code generated by CAPTools contains calls to a collection of routines that form
the CAPTools communications Library (CAPLib). The library provides a portable layer and
user friendly abstraction over the underlying parallel environment. CAPLib contains
optimised message passing routines for data exchange between parallel processes and other
utility routines for parallel execution control, initialisation and debugging. By compiling andlinking with different implementations of the library the user is able to run on many different
parallel environments.
Even with todays parallel systems the concept of a single version of a parallel application
code is more of an aspiration than a reality. However for CM codes the data partitioning
SPMD paradigm requires a relatively small set of message-passing communication calls. This
set can be implemented as an intermediate thin layer library of message-passing calls that
enables the parallel code (especially that generated automatically by a parallelisation tool
such as CAPTools) to be as generic as possible.
CAPLib is just such a thin layer message passing library that supports parallel CM codes,
by mapping generic calls onto machine specific libraries (such as CRAY SHMEM) andportable general purpose libraries (such as PVM an MPI). This paper describe CAPLib
together with its three perceived advantages over other routes:
as a high level abstraction, it is both easy to understand (especially when generatedautomatically by tools) and to implement by hand, for the CM community (who are not
generally parallel computing specialists),
the one parallel version of the application code is truly generic and portable,
the parallel application can readily utilise whatever message passing libraries on a givenmachine yield optimum performance.
-
7/28/2019 Caplib Paper 013
2/33
1 Introduction
Currently the most reliable and portable way to implement parallel versions of computational
mechanics (CM) software applications is to use a domain decomposition data partitioning
strategy to ensure that data locality is preserved and inter-processor communication is
minimised. The parallel hardware model assumes a set of processors each with its ownmemory, linked in some specified connection topology. The parallelisation paradigm is
single processmultiple data (SPMD); that is each processor runs the same application except
using its own local data set. Of course, neighbouring processors (at least) will need to
exchange data during the calculation and this must usually be done in a synchronised manner,
if the parallel computation is to faithfully emulate its scalar equivalent. One of the keys to
enabling this class of parallel application is the message-passing library that enables data to
be efficiently exchanged amongst the processors comprising the system.
Up until the early 1990s, parallel vendors typically provided their own message passing
libraries, which were naturally targeted at optimising performance on their own hardware.
This made it very difficult to port a CM application from one parallel system to another. In
the early 1990s, portable message passing libraries began to emerge. The two most popular
such libraries are PVM [2] and MPI [3]. One or other, or both of these libraries is now
implemented on most commercial parallel systems. Although this certainly addresses the
issue of portability, these generic message-passing libraries may give far from optimal
performance on any specific system. On CRAY-T3D systems, for example, the PVM library
performance is somewhat inferior to the manufacturers own SHMEM library [4]. Hence, to
optimise performance on such a system the parallel application needs to utilise the in-house
library.
Although both PVM and MPI are powerful and flexible they actually provide much greater
functionality than is required by the CM community in porting their applications to
commercial parallel hardware. This issue was recognised by the authors some years agowhen they were working on the design phase of some automatic parallelisation tools for
FORTRAN computational mechanics codes CAPTools [1,5,6,7,8,9]. The challenge was to
produce generic parallel code that would run on any of the commercially available high
performance architectures. The key factor that inhibited the generation of truly generic
parallel code was the variety of the message passing libraries and the structure of the
information passed into the resulting calls as arguments. From an extensive experience base
of code parallelisation, the CAPTools team recognised that all typical inter-processor
communications required by structured mesh codes (typical of CFD applications) could be
addressed by a concise set of function calls. Furthermore it transpired that these calls could
be easily implemented as a thin software layer on top of the standard message passing
libraries PVM and MPI plus a parallel systems own optimised libraries (such as CrayT3D/T3E SHMEM). Such a thin layer software library could have three distinct advantages
over other routes:
as a high level abstraction it is both easy to understand and to implement by hand, for theCM community (who are not generally parallel computing specialists),
the one parallel version of the application code is truly generic and portable,
the parallel application can readily utilise whichever message passing libraries on a givenmachine yields optimum performance.
In this paper we describe the design, development and performance of the CAPLib message
passing software library that is specifically targeted at structured mesh CM codes. As such,
we are concerned with:- ease of use by the CM community, portability, flexibility and
-
7/28/2019 Caplib Paper 013
3/33
computational efficiency. Such a library, even if it is a very thin layer must represent some
kind of overhead on the full scale message passing libraries; part of the performance
assessment considers this issue. For such a concept to be useful to the CM community its
overhead must be minimal.
2 CAPLib Design and Fundamentals
CAPLibs primary design goal was to provide the initialisation and communication facilities
needed to execute parallel Computational Mechanics code either parallelised manually or
generated by the CAPTools semi-automatic parallelisation environment. A secondary goal is
to provide a generic set of utilities that make the compilation and execution of parallel
programs using CAPLib as straightforward as possible. The library is also supplied with a set
of scripts to enable easy and standardised compilation of parallel code with different versions
of CAPLib and for the simple execution of the compiled executable on different machines.
This section discusses the design, features and fundamentals of the library.
2.1 Design
The different layers of software of CAPTools generated code are shown in Figure 1. CAPLib
has been implemented over MPI [3] and PVM [2], the most important standard parallel
communications libraries in current use, to provide an easy method of porting CAPLib to
different machines. Where possible versions of CAPLib have been developed for proprietary
libraries in order to obtain maximum performance, for example, the Cray SHMEM library [4]
or Transtechs i860toolset library [11].
CAPTools generated parallel code
CAPLib API
MPI PVM Cray SHMEMTranstechi860toolset
Figure 1 CAPLib Software layer
The library has been designed to meet the following criteria:
Efficient. Speed of communications is perhaps the most vital characteristic of a parallelmessage-passing library. Startup latency has been found to be a very important factor
effecting the performance of parallel programs. The addition of layers of communication
software over the hardware communication mechanism increases the startup latency of allcommunications. It is important therefore to access the communication mechanism of a
machine at the lowest level possible. Each implementation of CAPLib attempts to utilise
the lowest level communications API of each parallel machine in order to achieve low
latency and therefore as fast communications possible.
Portable. Code written to use CAPLib is portable across different machines. Onlyrecompilation is necessary.
Correct. It is vitally important for parallelised computational mechanics programs to givethe same answers in parallel as in serial. The commutative (global) message passing
functions provided by CAPLib are implemented so as to guarantee that the same result is
seen on every processor. This can be of vital importance for the correct execution of
parallel code and its successful completion. For example, a globally summed value may
-
7/28/2019 Caplib Paper 013
4/33
be used to determine the exit of an iterative loop. If the summed value is not computed in a
consistent manner across all processors, then round off error may cause some processors
to continue executing the loop whilst others exit, resulting in communication deadlock.
Generic. The library is generic in the sense that decisions about which processor topologyto execute on are taken at run time. CAPTools generated code compiled with CAPLib will
run, for example, on 1 processor, a pipeline of 2 processors, a ring of 100 processors, or a
torus of 64. The scripts provided with the library are also generic. For example, capmake
and caprun are scripts that allow the user to compile and run parallel code without
knowing system specific compiler and execution procedures.
Simple. The library itself has been kept as simple as possible both in the design of theAPI and in its implementation. By keeping the library simple with the minimum number
of functions and also the minimum number of arguments to those functions, the library is
easily ported to different parallel machines. Also an uncomplicated interface is more easily
understood and assimilated by the user.
2.2 Parallel Hardware Model
CAPTools currently generates parallel code based on a Distributed Memory (DM) parallel
hardware model, which is illustrated in Figure 2. In the CAPLib parallel hardware model
processors are considered to be arranged in some form of topology, where each processor is
directly connected to several others, e.g. a pipe, ring, grid, torus or full (fully connected).
Each processor is assigned a unique number (starting from 1). In the case of grid and torus
topologies, each processor also has a dimensional processor number. Memory is considered
local to each processor and data is exchanged between processors via message passing of
some form between directly connected processors. CAPTools generated parallel code can
also be executed on Shared Memory (SM) systems providing, of course, CAPLib has been
ported to the system. On a SM system, each processor still executes the same SPMD programoperating on different sections of the problem data. The main difference between this and
operation on a DM system is that message-passing calls can be implemented inside CAPLib
as memory copies to and from hidden shared memory segments. In this respect the CAPLib
model differs from the usual parallelisation model used on SM machines that assume every
processor can directly access all memory of the problem. By restricting the memory each
processor accesses and enforcing a strict and explicit ordering to the update of halo regions
and calculation of global values, the CAPLib parallel hardware model ensures that there will
be very little memory contention on SM systems and particularly on Distributed Shared
Memory (DSM) systems. As the number of processors becomes large, for example, some of
the machines recently built for the Accelerated Strategic Computing Initiative [10] (ASCI)
have thousands of processors, the localisation of communications becomes very important.
Distributing data onto processors, taking into account the hardware processor topology, can
localise communication between processors and thus minimise contention in the
communications hardware.
-
7/28/2019 Caplib Paper 013
5/33
1 2 3 4 5 6
1,2
(4)
2,2
(3)
1,1(1)
1,2(2)
5
1
3
2
4
pipeline topology
2d grid
topology
full
topology
CPU
local
memory
processor
Figure 2 CAPLib parallel hardware model
2.3 Process Topologies
Knowledge of the processor topology of the parallel hardware a parallel code is to run on is
very important. It can be used to optimise the speed and distance travelled by messagesbetween processes. CAPTools attempts to generate code that will minimise the amount of
communication needed, however, to perform those communications that are required as
quickly as possible, the process topology must be mapped onto the processor topology.
CAPLib uses the concept of a process topology for this reason. An intelligent mapping of
process to processors will give better performance than would be possible from a random
allocation. By placing processes so that most communications are needed only between
directly connected neighbouring processors, the distance the communications have to travel is
minimised, avoiding hot spots and maximising bandwidth. An awareness of process topology
also allows for more efficient programming in global communications; for example, the use
of a hyper-cube to maximise global summations in parallel (see section 6.3).
By requiring that processes are connected in a pipe or grid type topology, it is possible for
CAPTools to generate parallel code for structured mesh parallelisations using directional
communications, i.e. where communication is specified as being up or down, left or right of a
process rather than to a particular process id. This programming style can make it easier for
the user to write and understand parallel code, especially for grids of two or more
dimensions.
Where possible, CAPLib tries to use the fastest methods of communication that are available
on a particular machine. It might be that communications to neighbouring processors could
be made directly through fast, dedicated hardware channels.
The topology required for a particular run of a parallel program, e.g. pipe, ring, and thenumber of processes can be specified to the CAPLib utilities and to the parallel program at
run time in a number of ways:- via environment variable; as a flag on the command line; a
configuration file or if none of the previous is set, by asking the user interactively. The
topologies currently available from CAPLib are pipe, ring, grid, torus and full (all to all).
2.4 Messages
Each messages sent and received using the CAPLib communication routines has a length,
type and a destination.
-
7/28/2019 Caplib Paper 013
6/33
2.4.1 Message Length
The length is defined in terms of the number of items to be communicated. Zero or a negative
number of items must result in no message being sent. All CAPLib communication routines
check for length
-
7/28/2019 Caplib Paper 013
7/33
used to hold RI(2). This method has been found to be generic and works on every
machine tested so far.
3. Heterogeneous computing. If a parallel program is sending messages within aheterogeneous environment then size and storage of data types may differ between
processors. One processor may use little endian (low bytes first) and another big
endian (high bytes first) storage, i.e. bytes in a message may have to be swapped at
destination or origin depending on the data type. Floating point representation may
also be different; e.g. default size might be 4 bytes or one machine and 8 bytes on
another. For the library to be able to convert between different storage types it must
know which type is being communicated in order to apply the correct translation.
Currently the library makes the assumption that all processors are homogenous but the
knowledge of type of messages within the library allows for adding heterogeneous
capability in the future if this is found to be desirable.
2.4.3 Message Destination
Message destination is determined by an integer argument passed in each communicationcall. A negative value indicates a direction, a positive value indicate a process number.
The code generated by CAPTools for structured mesh parallelisations currently assumes a
pipeline or grid process topology. The communication calls therefore use the negative values
to indicate direction to the left or right (or up and down) of a processes position in topology.
These are available as predefined CAPLib constants such as CAP_LEFT, CAP_RIGHT for
improved readability. A characteristic of parallel SPMD code written for an ordered
topology is a test for neighbour existence before communication. This is because the first
processor does not have a neighbour to its left and the last processor does not have a
neighbour to its right. CAPLib functions perform the necessary tests for neighbour processor
existence internally to improve the readability of CAPTools generated parallel code. Havingthe neighbour test within the library also reduces the possibility of error (and therefore
deadlock) in any manually written parallel code. The functions also test for zero-length
messages, as mentioned earlier, since this is often a possibility, so that the user avoids having
to perform this chore as well.
Typical hand written user code without these internal tests might look like as follows:-
IF (N.GT.0) THENIF (MYNUM.LT.NPROC) CALL ANY_RECEIVE(A,N*4,MYNUM+1)IF (MYNUM.GT.1) CALL ANY_SEND(A,N*4,MYNUM-1)
ENDIF
where MYNUM is the processor number and NPROC is the number of processors.
Using CAPTools communications library the code becomes
CALL CAP_RECEIVE(A,N,1,CAP_RIGHT)CALL CAP_SEND(A,N,1,CAP_LEFT)
where the receive communication will only take place if N is >=0 and a processor is present
to the right and similarly for the send communication if a processor is available to the left.
3 Requirements For Message-Passing from Structured MeshBased Computational Mechanics code
CAPLib satisfies the general requirements for message-passing from parallelisations ofstructured mesh based Computational Mechanics. The library has to provide for:
-
7/28/2019 Caplib Paper 013
8/33
Initialisation of required process topology
Data Partition calculation
Termination of parallel execution
Point to point communications
Overlap area (halo) update operations
Commutative operations, i.e. local value ->global value using some function
Broadcast operations
Algorithmic Parallel Pipelines
In the following sections, the general requirements for communication and parallel constructs
for CM codes and the CAPLib calls that address these requirements are described,
particularly emphasising their novel aspects. To illustrate this discussion a simple one-
dimensional parallel Jacobi code (Figure 3) obtained using CAPTools is used. The CAPLib
library routines are summarised in Table 1below.
CAPTool Communication Library (CAPLib) Routine Summary
FunctionName
FunctionArguments
Type
Blocking
Buffered
Cyclic
CAP_INIT () I x
CAP_FINISH () I x
CAP_SETUPPART (LOASSN,HIASSN,LOPART,HIPART) I x
CAP_SEND (A,NITEMS,TYPE,PID) P x
CAP_RECEIVE (A,NITEMS,TYPE,PID) P x
CAP_EXCHANGE (A,B,NITEMS,TYPE,PID) E x
CAP_BSEND (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID) P x x
CAP_BRECEIVE (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID) P x x
CAP_BEXCHANGE (A,B,NITEMS,STRIDE,NSTRIDE,ITYPE,PID) E x x
CAP_CSEND (A,NITEMS,TYPE,PID) P x x
CAP_CRECEIVE (A,NITEMS,TYPE,PID) P x x
CAP_CEXCHANGE (A,B,NITEMS,TYPE,PID) E x x
CAP_ASEND (A,NITEMS,TYPE,PID,ISEND) P
CAP_ARECIEVE (A,NITEMS,TYPE,PID,IRECV) P
CAP_AEXCHANGE (A,B,NITEMS,TYPE,PID,ISEND,IRECV) E
CAP_ABSEND (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID,ISYNC) P xCAP_ABRECEIVE (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID,ISYNC) P x
CAP_ABEXCHANGE (A,STRIDE,NSTRIDE,NITEMS,TYPE,PID,ISEND,IRECV) E x
CAP_CASEND (A,NITEMS,TYPE,PID,ISEND) P x
CAP_CARECIEVE (A,NITEMS,TYPE,PID,IRECV) P x
CAP_CAEXCHANGE (A,B,NITEMS,TYPE,PID,ISEND,IRECV) E x
CAP_SYNC_SEND (PID,ISYNC) S x
CAP_SYNC_RECEIVE (PID,ISYNC) S x
CAP_SYNC_EXCHANGE (PID,ISEND,IRECV) S x
CAP_COMMUTATIVE (VALUE,TYPE,FUNC) G x
CAP_COMMUPARENT (VALUE,TYPE,FIRSTFOUND,FUNC) G x
CAP_COMMUCHILD (VALUE,TYPE) G x
CAP_DCOMMUTATIVE (VALUE,TYPE,DIRECTION,FUNC) G xCAP_MCOMMUTATIVE (VALUE,NITEMS,TYPE,FUNC) G x
-
7/28/2019 Caplib Paper 013
9/33
CAP_BROADCAST (VALUE,TYPE) G x
CAP_MBROADCAST (VALUE,TYPE,OWNER) G x
CAPLib Function Type KeyI Initialisation, termination and control
P Point to point communicationE Ordered exchange communication between neighboursS Synchronisation on non-blocking communicationG Global communication or commutative operation
Table 1 Summary of CAPLib Routines
REAL TOLD(500,500), TNEW(500,500)EXTERNAL CAP_RMAXREAL CAP_RMAXINTEGER CAP_PROCNUM,CAP_NPROCCOMMON /CAP_TOOLS/CAP_PROCNUM,CAP_NPROCINTEGER CAP_HTOLD, CAP_LTOLD
C Initialise CAPLibCALL CAP_INITIF (CAP_PROCNUM.EQ.1)PRINT*,'ENTER N AND TOL'IF (CAP_PROCNUM.EQ.1)READ*,N,TOL
C Broadcast N and TOL to every processorCALL CAP_RECEIVE(TOL,1,2,CAP_LEFT)CALL CAP_SEND(TOL,1,2,CAP_RIGHT)CALL CAP_RECEIVE(N,1,1,CAP_LEFT)CALL CAP_SEND(N,1,1,CAP_RIGHT)
C Initialise data partitionCALL CAP_SETUPPART(1,N,CAP_LTOLD,CAP_HTOLD)DO I=MAX(1,CAP_LTOLD),MIN(N,CAP_HTOLD),1
TOLD(I)=0.0ENDDO
C Boundary conditions (only execute on end processors)IF (1.GE.CAP_LTOLD.AND.1.LE.CAP_HTOLD)TOLD(1)=1IF (N.GE.CAP_LTOLD.AND.N.LE.CAP_HTOLD)TOLD(N)=100
40 CONTINUEC Exchange overlap data prior to each Jacobi update
CALL CAP_EXCHANGE(TOLD(CAP_HTOLD+1),TOLD(CAP_LTOLD),1,2,CAP_RIGHT)CALL CAP_EXCHANGE(TOLD(CAP_LTOLD-1),TOLD(CAP_HTOLD),1,2,CAP_LEFT)DO I=MAX(2,CAP_LTOLD),MIN(N-1,CAP_HTOLD),1
TNEW(I)=(TOLD(I-1)+TOLD(I+1))/2.0ENDDO
C Calculate maximum difference on each processorDIFMAX=0.0DO I=MAX(1,CAP_LTOLD),MIN(N,CAP_HTOLD),1
DIFF=ABS(TNEW(I)-TOLD(I))IF (DIFF.GT.DIFMAX) DIFMAX=DIFFTOLD(I)=TNEW(I)
ENDDOC Find global maximum difference
CALL CAP_COMMUTATIVE(DIFMAX,2,CAP_RMAX)IF (DIFMAX.GT.TOL) GOTO 40
C Output results via first processor
DO I=1,N,1IF (I.GT.CAP_BHTNEW)CALL CAP_RECEIVE(TNEW(I),1,2,CAP_RIGHT)IF (I.GE.CAP_BLTNEW)CALL CAP_SEND(TNEW(I),1,2,CAP_LEFT)IF (CAP_PROCNUM.EQ.1)WRITE(UNIT=*,FMT=*)TNEW(I)
ENDDOEND
Figure 3 CAPTools generated parallel code for simple 1-D Jacobi program
3.1 Initialisation, Partition Calculation and Termination
The routine CAP_INIT is called in the example code to initialise the library. It must be called
before any other CAPLib function is used. This call sets up the internal channel arrays and
other data structures that the library needs to access. In some implementations of the library
(e.g. the PVM version) this routine is also responsible for starting all slave processes running.CAP_INIT is responsible for the allocation of processes to processors in such a manner as to
-
7/28/2019 Caplib Paper 013
10/33
minimise the number of hops between adjacent processes in the requested topology and
therefore the overall process to process communication latency, maximising communication
bandwidth. CAP_INIT is also responsible for communicating information on the runtime
environment such as hostname and X Window display name to all processes. The size of each
data type is also dynamically determined by CAP_INIT.
A general requirement for message-passing SPMD code is for each parallel process to be
assigned a unique number and also to know the total number of processors involved.
CAP_INIT sets CAP_PROCNUM (the process number) and sets the CAP_NPROC (the
number of processes). Both variables are used in internally, but can be referenced in the
application code through a common block in the generated code.
The next stage is the calculation of data assignment for each process. Adhering to the SPMD
model, the partitioning of the arrays TNEW and TOLD for this example on 4 processes
would require each process to be allocated a data range of 250 array elements in order for
each processor to obtain a balanced workload (see, for example Figure 4). The CAPLib
function CAP_SETUPPART is passed the minimum and maximum range of the accessed
data range and the number of processes. It returns to each process its own unique value forthe minimum and maximum value for the partitioned data range (variables CAP_LTOLD and
CAP_HTOLD in Figure 3). If the example was partitioned onto 4 processes then
CAP_SETUPPART would return to process 1 the partition range 1 to 250, process 2 the
partition range 251 to 500, process 3 the partition range 501 to 750 and process 4 the partition
range 751 to 1000. Each process also requires an overlap region because of data assigned on
one process but used on a neighbouring process. This will necessitate the communication of
data assigned on one process into the overlap region of their neighbouring process. Due to the
organised partition of the data the overlap areas need only be updated from their
neighbouring processes. The data partition of the partitioned array TOLD in comparison with
the original un-partitioned array is shown in Figure 4.
2511
250
501
500
751
750
1000
1
1 251 501 751 1000
PE 1 PE 2 PE 3 PE 4
PARTITIONED
ARRAY TOLD
Overlap Area
Update Lower Overlap
Update Higher Overlap
KEY:
UN-PARTITIONED
ARRAY TOLD
Figure 4 Comparison of an un-partitioned and partitioned 1-D array.
The routine CAP_FINISH must be called at the end of a program run to successfully
terminate use of the library. On some machines, this call is necessary if control is to return to
the user once the parallel run has completed.
-
7/28/2019 Caplib Paper 013
11/33
3.2 Point to Point Communication
The CAP_SEND and CAP_RECEIVE functions perform point to point communications
between two processors. Typically these functions appear in pipeline communications (see
section 3.4) but are also used to distribute data across the processor topology during
initialisation of scalars and arrays etc.
CAPLib has a selection of communication routines that allow the user to perform point to
point communications in a variety of ways. The are two main groups, those of blocking and
non-blocking and these are discussed separately in the next sections. Each communication
has the generic arguments of address (A), length (NITEMS), type (TYPE) and destination
(PID) with additional arguments depending on the routine. All the point-to-point routines are
summarised in Table 1.
3.2.1 Blocking Communication
Blocking communications do not return until the message has been successfully sent or
received. The Non-cyclic blocking communications will not communicate beyond the
boundaries of the process topology when directional message destinations are given,
Directions are indicated by a negative PID argument. For example, in a pipeline, the first
process will not send to its left, or the last process to its right. This will also be true of a ring
topology, grid and torus (multi-dimensional ring). Where communications are required to
loop around a topology like a ring or torus, as is the case for programs with cyclic partitions,
the cyclic routines can be used. These do not test for the end or beginning of a processor
topology.
Buffered routines are provided so that data that is non-contiguous can be buffered and sent as
a single communication. The extra arguments are STRIDE (stride length in terms of ITYPE
elements) and NSTRIDE (the number of strides). In other words NSTRIDE lots of NITEM
elements, STRIDE elements apart, will be communicated in each call. This approach avoidsthe multiple start up latencies incurred using a communication for each section of data. On
most platforms there is a message size dependent limit at which point the time spent
gathering and scattering data to and from buffers can be greater than the latency effect of
using multiple communications. The buffered routines switch internally to non-buffered
communications if this limit is exceeded. This limit is currently set statically but in the future
it is hoped to perform an optimal calculation for the limit during the call to CAP_INIT.
CAPTools provides a user option to generate buffered or non-buffered communications.
3.2.2 Non-Blocking Communication
It is often the speed of communication that reduces the efficiency of parallel programs morethan anything else. To improve code performance, many parallel computers allow programs
to start sending (and receiving) several messages and then to proceed with other computation
asynchronously whilst this communication takes place. CAPLib supports this approach by
providing non-blocking sends and receives. Non blocking communications are implemented
in CAPLib using the underlying host systems non-blocking routines where possible. Where
such routines are not available, non-blocking routines have been implemented using a variety
of techniques, for example, communication threads running in parallel with the main user
code. Table 1 lists the non-blocking routines currently available in the library.
Non-blocking communication routines, e.g. CAP_ASEND, begin the non-blocking operation
but return to the user program immediately the communication has been initiated. The
communication itself takes place in parallel with execution of the following user code. The
-
7/28/2019 Caplib Paper 013
12/33
arguments are the same as for the blocking communications but with the addition of a
message synchronisation id as the last argument. To make sure a message has completed its
journey the user code calls a CAP_SYNC routine to test for completion, passing the
destination and synchronisation id as arguments. The CAP_SYNC routines either return
immediately, if a communication has finished, or wait for it to complete, if it has not. A
particular communication is identified completely by the message destination and thesynchronisation id.
Depending on the hardware and underlying communications library that CAPLib is ported to,
the implementation of the non-blocking routines can be done in several different ways. For
some implementations the synchronisation call is used to actually unpack the messages
because the underlying library does not provide a non-blocking receive using the same model
as CAPLib, for example the PVM implementation.
Buffered non-blocking communications are also handled differently depending on the
underlying library and hardware. Buffered non-blocking communications consist of two
stages, for a send, first the packing of data into a buffer, and then the communication of the
buffered data. A receive communication must first receive the buffered data and then unpackit. If the parallel processor node is of a type that has a separate processor for communications,
that can be programmed to perform work asynchronously with the main processor, then the
packing and unpacking can be performed by the communications processor and overlapped
with computation done on the main processor. This relies on the communications processor
having dual memory access to the main processors memory. The benefit of this it is that both
stages of buffered communication are then performed in parallel with computation. The
Transtech Paramid [11] is a good example of such a system. However, it may be that the
communications processor is of lower speed than the main processor and the time taken to
unpack is actually longer than if the main processor had done the unpacking in serial mode
itself. CAPLib therefore makes use of this approach only where it would improve
performance. It is more often the case that parallel nodes consist of single processors and donot provide any direct hardware support for non-blocking buffered communications. On such
systems, messages can still be received asynchronously, but the processor must do data
unpacking and there is no real parallel overlapping during the packing/unpacking stage.
Libraries such as MPI implement non-blocking communications on workstations using
parallel threads. Although this provides the mechanism for non-blocking buffered
communications, because the thread will run on the same processor, the unpacking is not
actually performed in parallel, but through time slicing. Therefore no real parallel benefit on
packing/unpacking is gained.
If the underlying communications library used by CAPLib does not directly support buffered
non-blocking communication then the unpacking must be performed at the synchronisation
stage, once the buffered message has been received. CAPLib implements this by keeping a
list of asynchronous communications and whenever a CAPLib synchronisation call is made,
all outstanding messages from the list are unpacked.
Because of the extra complexity of using non-blocking communications it is a common
procedure to write or generate message-passing code that uses blocking communications as a
first parallelisation attempt. Once this version has been tested thoroughly and proved to give
the correct results, a non-blocking version can be produced to optimise the performance (In
CAPTools this merely requires clicking on one button [8]).
Before data that has been transmitted using non-blocking functions can be used in the case of
a receive communication, or re-assigned in the case of a send communication, the completion
-
7/28/2019 Caplib Paper 013
13/33
of the communication involving the data must be verified. For maximum flexibility and
efficiency in synchronisation on message completion the communication model used by
CAPLib for ordering of message arrival and departure for synchronising on message
completion is as follows:
Messages are sent in order of the calls made to send to a particular destination, D.
Messages are received in order of calls made to receive from a particular destination, D.
This implies that:-
Synchronisation on the sending of message Mi to destination D guarantees thatmessages Mi-1, Mi-2,... sent to destination D have arrived.
In the example below the synchronisation using ISENDB by statement S3 on the
message sent by S2 also guarantees that the message sent by S1 has arrived.
S1 CALL CAP_ASEND(A,1,1,CAP_LEFT,ISENDA)
S2 CALL CAP_ASEND(B,1,1,CAP_LEFT,ISENDB)
S3 CALL CAP_SYNC_SEND(CAP_LEFT,ISENDB)
Synchronisation on the receiving of message Mj from a destination D guarantees thatmessages Mj-1, Mj-2... have been received at destination D.
In the example below the synchronisation using IRECVB by statement S3 on the
message requested by S2 also guarantees that the message requested by S1 has arrived.
S1 CALL CAP_ARECEIVE(A,1,1,CAP_LEFT,IRECVA)
S2 CALL CAP_ARECEIVE(B,1,1,CAP_LEFT,IRECVB)
S3 CALL CAP_SYNC_RECEIVE(CAP_LEFT,IRECVB)
Waiting for completion of a send to a destination does not guarantee that a particular
receive has taken place from that destination and vice versa.
In the example below the synchronisation using ISENDB by statement S3 on the
message sent by S2 does not guarantee that the message requested by S1 has arrived.
S1 CALL CAP_ARECEIVE(A,1,1,CAP_LEFT,IRECVA)
S2 CALL CAP_ASEND(B,1,1,CAP_LEFT,ISENDB)
S3 CALL CAP_SYNC_SEND(CAP_LEFT,ISENDB)
Waiting for completion of a communication with a particular destination D does notguarantee that any other sends or receives to or from another destination has
completed.
In the example below the synchronisation using ISENDB by statement S4 on themessage sent by S3 does not guarantee that the messages requested by S1 or sent by
S2 has arrived.
S1 CALL CAP_ARECEIVE(A,1,1,CAP_RIGHT,IRECVA)
S2 CALL CAP_ASEND(A,1,1,CAP_RIGHT,ISENDA)
S3 CALL CAP_ASEND(B,1,1,CAP_LEFT,ISENDB)
S4 CALL CAP_SYNC_SEND(CAP_LEFT,ISENDB)
This model is flexible enough to allow for the automatic generation of non-blocking
communications within CAPTools [8]. The ability to synchronise several messages in a
particular direction with one synchronisation, i.e. waiting for the last message to be sent is
enough to guarantee that all messages previous to the last have been sent, makes codegeneration a lot easier. It also reduces the overhead of synchronisation. The model also
-
7/28/2019 Caplib Paper 013
14/33
allows for overlapping both sends and receives simultaneously to a particular destination and
for multiple tests on the same synchronisation id, which is essential for an automatic
overlapping code generation algorithm.
The flexibility of this model has allowed CAPTools to generate overlapping communications
with synchronisation that guarantees correctness in a wide range of cases. This includes loop
unrolling transformations, synchronisation and overlapping communications in pipelined
loops. Code appearance is enhanced by the merger of synchronisation points, that is only
possible with this communication model.
3.3 Exchanges (Overlap Area/Halo Updates)
For any array that is distributed across the process topology each process will have an overlap
region in the array that is assigned on another process (see Figure 4). These overlap areas are
updated when necessary. The overlap region is updated by invoking a call to
CAP_EXCHANGE, which performs a similar function to the MPI call MPI_SENDRECV.
This communication function will send data to a neighbouring process's overlap area as well
as receiving data into its own overlap region from the neighbouring processor.
CAP_EXCHANGE must ensure that no deadlock occurs and allow for non-communication
beyond the edge of the process topology for the end processes. Most important is the fact that
this type of communication is fully scalable, i.e. is not dependent on the number of processes,
taking at most 2 steps to complete (see Figure 5). If the hardware allows non-blocking
communication an exchange can be performed in 1 step by communicating in parallel.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Time
Figure 5 Communication pattern for blocking exchange operation on 16 processors.
3.4 Pipelines
A Pipeline in a parallel code involves each processor performing the same set of operations
on a successive stream of data. Pipelined loops are a common occurrence in parallel CM
codes and are often essential to implement, for example, recurrence relations, guaranteeing
correctness of the parallel code. Because a pipeline serialises a loop it must be surrounded by
outer loop(s) in order to achieve parallel speed up. The main disadvantages of pipelines are
that during the pipeline process some processors will be idle at the start up and shut downstages. Another disadvantage is the potentially significant overhead of the numerous
communication startup latencies. Figure 6 shows a simple example of a loop that has been
parallelised using a pipeline.
C Serial codeDO I=1, NI
DO J=2,NJA(I,J)=A(I,J-1)
END DOEND DO
C Parallel codeDO I=1, NI
CALL CAP_RECEIVE(A(I,CAP_LA-1),1,2,CAP_LEFT)DO J= MAX(2,CAP_LA),MIN(NJ,CAP_HA)
A(I,J)=A(I,J-1)END DOCALL CAP_SEND(A(I,CAP_HA),1,2,CAP_RIGHT)
END DO
Figure 6 Example of a Pipeline
-
7/28/2019 Caplib Paper 013
15/33
With a low communication startup latency, good parallel efficiency can be achieved (see
section 9.1)
3.5 Commutative Operations
The Jacobi example in Figure 3 uses a convergence criteria is based on DIFMAX. Since theTNEW and TOLD arrays have been partitioned across the processors, each processor will
calculate its own local value for DIFMAX, however, it is necessary to calculate the global
value. Collective parallel computation operations (minimum, maximum, sum, etc.) take a
value or values assigned on each process and combine them with all the values on all other
processes into a global result that all processes receive. This is performed by the CAPLib
function CAP_COMMUTATIVE, which is analogous to the MPI global reduction function
MPI_ALL_REDUCE. The latency for this type of communication is dependent on the
number of processors. CAPLib minimises the effect of this by internally using a hyper-cube
topology where possible to perform the commutative operation. The commutative routines in
CAPLib are summarised in Table 1.
The routine CAP_COMMUTATIVE performs the commutative operation defined by the
passed routine FUNC on the data item VALUE for all processes. On entry to the routine,
VALUE holds a local value on each processor, on exit it contains a global value computed
from all local values across processors using routine FUNC to combine contributions. For
example, the serial code to sum a vector is:-
SUM=0.0D O I = 1 , N
SUM=SUM+A(I)END DO
The parallel equivalent of this using CAP_COMMUTATIVE, given that the array A has been
partitioned across processes is:
SUM=0.0DO I = CAP_LOW, CAP_HIGH
SUM=SUM+A(I)END DOCALL CAP_COMMUTATIVE(SUM, CAP_REAL, CAP_RADD)
In this example CAP_LOW and CAP_HIGH are the local low and high limits of assignment
to A on each processor. The procedure CAP_RADD is a predefined procedure to perform a
real value addition. A list of all predefined commutative functions is given in the CAPLib
user manual [12]. Each procedure has the same arguments, which mean that
CAP_COMMUTATIVE can call it generically. For example, CAP_RADD is defined as
SUBROUTINE CAP_RADD(R1, R2, R3)
REAL R1, R2, R3R3=R1+R2END
The routine CAP_DCOMMUTATIVE is a derivation of CAP_COMMUTATIVE and
performs a commutative operation in one dimension of a grid or torus process topology, the
direction (e.g. left or up) being indicated by an additional argument. This type of
commutative operation can be necessary when a structured mesh code is partitioned in more
than one direction. The routine CAP_MCOMMUTATIVE provides for commutative
operations on an array of data rather than one item. Combining several
CAP_COMMUTATIVE calls to form one CAP_MCOMMUTATIVE call allows a
corresponding reduction in latency.
-
7/28/2019 Caplib Paper 013
16/33
CAPTools generates code with commutative operations whenever it can match the code in a
loop to the pattern of a commutative operation. Without commutative communications, the
code generated would involve complex and convoluted communication.
An interesting observation is that commutative operations performed in parallel will actually
give answers with less round off than the corresponding serial code. For example, consider
the summation of ten million small numbers in serial. As the summation continues, each
small value will be added to an increasingly larger sum. Eventually the small numbers will
cease to have an impact on the sum because of the accuracy of adding a small value to a large
one using computer arithmetic. The parallel version of the summation will first have each
processor sum their section of the sum locally, communicate the local summations, and add
them to obtain a global sum. The accuracy will be greater since each local summation will
involve less numbers, and therefore there will be smaller differences in magnitude than the
complete serial summation. In addition, the summation of the local summations to obtain the
global value will be of relatively similar sized numbers. If this were not the case, this would
not be acceptable for many users performing parallelisations from existing serial code. Part of
the parallelisation process is to validate the parallel version against the serial version.Obviously, the parallelised code must produce the same results in order to pass the validation
process. Although a parallel commutative operation may not produce exactly the same result
as the serial one, it will at least be more accurate rather than inaccurate and so most validation
tests should be passed.
As well as getting as near as possible the same results as the serial version, commutative
operations must also produce the same answer on all processes. For example, often the
calculation of the sum of the difference between two vectors, i.e. a residual value, is used to
determine whether to terminate an iterative loop. If the calculation of the residual value is not
the same on all processors then the calculated values may cause the loop to terminate on
some processes but loop again on others. Obviously, this will cause the parallel execution to
lock. To obtain the same results on all processes the commutative operation must beperformed in the same order on all processes to incur the same round off errors or broadcast a
single global value.
A common array operation is to find the maximum or minimum value and its location in the
array. The equivalent commutative operation in parallel must be performed in the same order
to return the same location as the serial code. If there are several occurrences of the local
maximum/minimum value in the array then it may be that several processes might find their
own local maxima/minima. In order to avoid this, the commutative operation must know the
direction in which the array is traversed. The routines CAP_COMMUPARENT and
CAP_COMMUCHILD provide a mechanism for this. The argument FIRSTFOUND (see
Table 1) determines how the commutative operation determines a location. If FIRSTFOUNDis set to true then for a maximum commutative calculation it is the maximum value location
found on the lowest numbered processor in the given dimension that is required on all
processes. This would be the case for a serial loop running from low to high through an array.
For example, consider the example in Figure 7 with data A=(/7, 9, 2, 2, 9, 5, 9/). Although
there are maximums at positions 2, 5 and 7 the serial code will set MAXLOC at 2 due to the
use of a strict greater than test. The parallel code will similarly produce the result
MAXLOC=2 on all processors.
C Serial codeMAXVAL=0MAXLOC=1D O I = 1 , N
IF (A(I).GT.MAXVAL) THENMAXVAL=A(I)MAXLOC=I
C ParallelMAXVAL=0MAXLOC=1DO I = LOW, HIGH
IF (A(I).GT.MAXVAL) THENMAXVAL=A(I)MAXLOC=I
-
7/28/2019 Caplib Paper 013
17/33
ENDIFENDDO
ENDIFEND DOCALL CAP_COMMUPARENT(MAXVAL,1,CAP_IMAX,.TRUE.)CALL CAP_COMMUCHILD(MAXLOC,1)
Figure 7 Example of CAP_COMMUTATIVE and CAP_COMMUCHILD
If the test for the maximum value had been .GE. rather than .GT. then MAXLOC would be
set to the location of the last maximum value rather than the first and therefore the value of
FIRSTFOUND in CAP_COMMUPARENT would be set to .FALSE..
CAP_COMMUPARENT works by sending the current maximum values processor number
with the maximum value as the commutative operation is performed among the processors.
In the commutative communication algorithms, the location for the COMMUPARENT is
also packed into the message. CAP_COMMUPARENT internally stores the processor that
owns the desired location(s). This processor is then used in any number of calls to
CAP_COMMUCHILD to broadcast the correct value to all processors.
3.6 Broadcast Operations
Broadcast operations are used to move data from one process to all other processes. The
simplest of these is a broadcast of data from the first process to all others, termed a master
broadcast. CAPLib provides the CAP_MBROADCAST routine to do this. In fact, rather than
sending data directly to all processes from the master process, the master broadcast will use
the same communication strategies as the CAP_COMMUTATIVE call. These strategies,
described in section 6, take advantage of the internal process topology to reduce the number
of communications and steps to complete the operation.
A second type of broadcast is the communication of data from a particular process to all
others. CAPLib provides the routine CAP_BROADCAST to do this. The OWNER argumentis passed in set to true for the process owning the data and false for all others.
CAP_BROADCAST is implemented currently as a COMMUTATIVE MAX style operation
on the OWNER argument to tell every process which particular process is the owner of the
data to be broadcast. The data is again transmitted from the owning process to the other
processes in an optimal fashion using the internal process topology.
4 CAPLib on the Cray T3D/T3E
4.1 Implementation
CAPLib has been ported to the Cray T3D and T3E using PVM, MPI and the SHMEM
library. The SHMEM library version is described below. Of the three, the SHMEM CAPLib
is by far the fastest, latency and bandwidth being a reflection of the performance of the
SHMEM library. Typical latency is under 7 s and bandwidth greater than 100 MB/s for
large messages on the T3D and 5 s and 300 MB/s on the T3E. The SHMEM version ofCAPLib is written in C rather than FORTRAN because of the need to do indirect accessing.
Synchronous message passing was implemented using a simple protocol build on the Cray
SHMEM_PUT library routine, which is faster than SHMEM_GET. Figure 8 shows this
protocol used to send data between two processors.
-
7/28/2019 Caplib Paper 013
18/33
Empty0x413270x41327&SENDTO
Data. (NBYTES)0x76543 Data. (NBYTES)0x41327
&ISYNC
0x0&SENDTO
&ISYNC1
0x0&SENDTO
Processor Sending Data Processor Receiving Data
SHMEM_PUT
SHMEM_PUT
SHMEM_PUT
Busy wait
Busy wait
1
0
Interconnect
Figure 8 CAPLib protocol used for communication on T3D/T3E using SHMEM library.
The receiving processor first writes the starting address it wishes to receive data into to a
known location on the sending processor and then waits for the sending processor to write the
data and send a write-data-complete acknowledgement. The sending processor waits on atight spin lock (busy wait loop) for a non-zero value in the known location. When the address
has arrived it uses SHMEM_PUT to place its data directly into the address on the receiving
processor. The sending processor then calls SHMEM_QUIET to make sure the data has
arrived and then sends a write-data-complete acknowledgement to the receiving processor.
The pseudo code for this procedure is shown in Figure 9.
send(a, n, cn)/* send data a(n) to
channel cn (processor cn2p(cn)) */{/* wait for address from receiving
processor to arrive in addr(pe) */pe=cn2p(cn)
while (!addr(pe))/* send data */shmem_put(*addr(pe),a,n,pe)/* wait */shmem_quiet()/* ack send complete*/shmem_put(ack(mype),1,1,pe)/* reset address */addr(pe)=0}
recv(b, ,n ,cn)/* recv data b(n) from
channel cn (processor cn2p(cn))*/{pe=cn2p(cn)/* place recv address in sending pe ataddress addr(pe) */
shmem_put(addr(mype),&b,1,pe)/* wait to data ack to arrive */while (!ack(pe))/* reset ack */ack(pe) = 0}
Figure 9 Pseudo code for send/recv using Cray SHMEM calls
To obtain maximum performance all internal arrays and variables involved in a
communication are cache aligned using compiler directives.
To avoid any conflicts in all-to-all communications the variables used to store addresses andact as acknowledgement flags are all declared as arrays with the T3D processor number being
used to reference the array elements. In this way each send address and data
acknowledgement can only be set by one particular processor.
Asynchronous communication has been partially implemented by removing the wait on the
write-data-complete acknowledgement in the receive and placing it in CAP_SYNC_RECV.
The send operation is not currently fully asynchronous since it can not start until it receives
an address from the receiving processor to send data to.
Commutative operations have also been implemented using these low-level functions and the
hyper-cube B method (see section 6.3) is the default commutative employed.
-
7/28/2019 Caplib Paper 013
19/33
Pahud and Cornu [13] show that communication locality can influence the communication
times in heavily loaded networks on the T3D. CAPLib uses the location of the processor
within the processor topology shape allocated to a particular run to determine
CAP_PROCNUM (the CAPTools processor number) for each processor in an optimal way so
as to minimise communication time. The numbering is chosen to provide a pipeline of
processors through the 3-D topology shape so that the number of hops from processorCAP_PROCNUM to processor CAP_PROCNUM+1 is minimised.
Another way of improving communication performance for some parallel programs
(particularly all-to-all style communication) is to order the communications so that an
optimum communication pattern is used, reducing the number of steps to perform a many-to-
many operation. Unstructured mesh code will often use this type of operation.
4.2 Performance
This section discusses the performance of the different CAPLib point-to-point and exchange
message passing functions on the Cray T3D and T3E. The speed of other message passing
libraries and CAPLib performance are compared where possible. Figure 10 and Figure 11
shows the latency and bandwidth respectively on the T3D for SHMEM versions of
CAP_SEND (synchronous), CAP_ASEND (non-blocking), CAP_EXCHANGE and
CAP_AEXCHANGE (non-blocking). As a comparison, these graphs also show timings for
MPI_SEND, MPI_SSEND and MPI_SENDRECV. Figure 12 and Figure 13 show similar
graphs for the Cray T3E.
An examination of these figures shows that CAPLib performs at least as well as the standard
MPI implementation on each machine. The CAPLib SHMEM implementation is superior to
using MPI or PVM calls both in latency and bandwidth. Generally the overhead of using the
CAPLib over MPI library instead of direct calls to MPI is negligible. CAP_SEND
implemented using SHMEM has a startup latency of around 7s. The overall bandwidthobtained on the T3E for all communication measurements is far higher than that of the T3D.
The bandwidth for CAP_SEND on the T3D for messages of 64Kb is around 116Mb/sec, on
the T3E this number is 297 Mb/sec. This is due to hardware improvements between the two
systems. CAP_EXCHANGE has been implemented on the Cray systems under SHMEM to
partially overlap the pair-wise send and receive communications it performs, and this is
reflected in the bandwidth obtained, 143Mb/sec on the T3D and 416 Mb/sec on the T3E.
Note that the bandwidth for MPI_SENDRECV (50Mb/sec on the T3D and 284Mb/sec on the
T3E) are very poor in comparison with CAP_EXCHANGE. Each performs a similar
communication, a send and receive to other processors, but CAP_EXCHANGE is able to
schedule its communications so as to overlap because it is based on directional
communication whereas MPI_SENDRECV communication is based on processor numbers
only and is unable to do this.
The graphs for the figures are obtained by performing a Ping-Pong communication many
times and taking average values. However, the non-blocking communication Ping-Pong test
has synchronisation after each communication. In this respect, the non-blocking results are
artificial in that they do not reflect the greater performance that will be obtained in real codes
where synchronisation will generally be performed after many communications. The graphs
for CAP_ASEND and CAP_AEXCHANGE therefore give a measure of the overhead of
performing synchronisation on non-blocking communication and do not reflect the latency
and bandwidth that is obtained in real use.
-
7/28/2019 Caplib Paper 013
20/33
1
10
100
1000
10000
1 10 100 1000 10000
DataTransferTime(u
s)
Data Size (REAL Items)
Latency (Cray T3D)
CAP_SEND (SHMEM)CAP_ASEND (SHMEM)
CAP_EXCHANGE (SHMEM)CAP_AEXCHANGE (SHMEM)
MPI_SENDMPI_SSEND
MPI_SENDRECV
Figure 10 CAPLib communication latency on Cray T3D
0
20
40
60
80
100
120
140
160
1 10 100 1000 10000
Bandwidth(Mbytes/Se
c)
Data Size (REAL Items)
Bandwidth (Cray T3D)
CAP_SEND (SHMEM)CAP_ASEND (SHMEM)
CAP_EXCHANGE (SHMEM)CAP_AEXCHANGE (SHMEM)
MPI_SENDMPI_SSEND
MPI_SENDRECV
Figure 11 CAPLib communication bandwidth on Cray T3D
1
10
100
1000
1 10 100 1000 10000
DataTransferTime(us)
Data Size (REAL Items)
Latency (Cray T3E)
CAP_SEND (SHMEM)CAP_ASEND (SHMEM)
CAP_EXCHANGE (SHMEM)CAP_AEXCHANGE (SHMEM)
MPI_SENDMPI_SSEND
MPI_SENDRECV
Figure 12 CAPLib communication latency on Cray T3E
0
50
100
150
200
250
300
350
400
450
1 10 100 1000 10000
Bandwidth(Mbytes/Sec)
Data Size (REAL Items)
Bandwidth (Cray T3E)
CAP_SEND (SHMEM)CAP_ASEND (SHMEM)
CAP_EXCHANGE (SHMEM)CAP_AEXCHANGE (SHMEM)
MPI_SENDMPI_SSEND
MPI_SENDRECV
Figure 13 CAPLib communication bandwidth on Cray T3E
5 CAPLib on the Paramid
5.1 Implementation
The Transtech Paramid version of the CAPTools communications library uses the low-level
Transtech/Inmos i860toolset communications library [11]. The Paramids dual processor
node architecture makes it ideal for non-blocking asynchronous communications since the
Transputer part of a node can be performing communication whilst the i860 is computing.
Non-blocking communications are implemented for the Paramid in CAPLib using an
asynchronous router program that runs on the Transputer. To minimise latency for small non-
blocking communications, the period of synchronisation between the Transputer and the i860
during initialisation of a non-blocking communication must be kept to a minimum. In
addition, the amount of effort required for the Transputer to send and receive asynchronously
must be as small as possible and as near the time for a normal direct synchronous send as
possible. Figure 14 shows a process diagram of the router process that executes on the
Transputers during runs that use the asynchronous version of CAPLib. The diagram shows
the threads in the router process for sending data asynchronously down one channel. For each
channel pair (IN and OUT channels), there will be a two sets of these threads to allowindependent communication in both directions. This arrangement is also duplicated for each
-
7/28/2019 Caplib Paper 013
21/33
channel connection to other nodes. The send and receive threads of the router process are
linked by channels over the Transputer links to the destination nodes corresponding send and
receive threads (where more links are needed than are physically available, implicit use is
made of the INMOS virtual routing mechanism, [14]).
For every channel the client send thread processes send requests and places them in send
request queue. A similar action is performed for receive requests by the client receive thread.
The send thread removes requests from the send queue and communicates the data as soon as
the corresponding receive thread on the other processor is ready to receive, i.e. when the
receive thread has itself removed a request to receive from the receive request queue. The
send and receive threads update an acknowledgement counter for each channel so that the
users program can synchronise on the completion of certain communications. It is worth
emphasising that using this model, channel communication down one channel is completely
independent of communication down another. It is up to the users program to synchronise at
the correct point to guarantee the data validity of data communicated in each direction.
send
request
queue
client (send)req thread
send
thread
..
..
CAP_ASEND(A,N,1,-2,ISEND)sends a request pkt
(address A, length N, type 1) to
router process
..CAP_SYNC_SEND (-2,ISEND)
synchronsiation call does busy
wait until ISENDACK>=ISEND
DATAA (1..N)
TTM200 TRAM (i860/transputer)
+1 for
every
send
Users program/i860
reqs in
reqs out
Router process/transputer
ISENDACK
receive
request
queue
client (recv)req thread
receive
thread
..
..
CAP_ARECEIVE(B,N,1,-1,IRECV)sends a request pkt
(address B, length N, type 1) to
router process
..CAP_SYNC_RECEIVE (-1,ISEND)
synchronisation call does busy wait
until IRECVACK>=IRECV
DATA
B (1..N)
TTM200 TRAM (i860/transputer)
+1 for
every
receive
Users program/i860
Router process/transputer
IRECVACK
reqs in
reqs out
Figure 14 Transputer router process for asynchronous communication on Transtech Paramid
5.2 Performance
Figure 15 and Figure 16 give the latency and bandwidth characteristics of CAPLib on the
Paramid. The best latency is around 33S with the bandwidth approaching peak performance
at around the 500-byte message size. Notice that the peak bandwidth of CAP_AEXCHANGE
is roughly twice that of CAP_SEND showing that it is performing its send and receive
communication asynchronously in parallel. The latency cost for small messages (~40 bytes)
is higher than the synchronous CAP_EXCHANGE because of the extra complexity of setting
up an asynchronous communication. However in real applications the increased
asynchronous latency will usually be hidden by the overall benefits of performingcomputation whilst communicating.
-
7/28/2019 Caplib Paper 013
22/33
10
100
1000
10000
100000
1 10 100 1000 10000
DataTransferTime(us)
Data Size (REAL Items)
Latency (Transtech Paramid)
CAP_SEND (I860TOOLSET)CAP_BSEND (I860TOOLSET)CAP_ASEND (I860TOOLSET)
CAP_EXCHANGE (I860TOOLSET)CAP_AEXCHANGE (I860TOOLSET)
Figure 15 CAPLib latency on Transtech Paramid
0
0.5
1
1.5
2
2.5
3
1 10 100 1000 10000
Bandwidth(
Mbytes/Sec)
Data Size (REAL Items)
Bandwidth (Transtech Paramid)
CAP_SEND (I860TOOLSET)CAP_BSEND (I860TOOLSET)CAP_ASEND (I860TOOLSET)
CAP_EXCHANGE (I860TOOLSET)CAP_AEXCHANGE (I860TOOLSET)
Figure 16 CAPLib bandwidth on Transtech Paramid
6 Optimised Global Commutative Operations
As global commutative operations usually only involve the sending and receiving of very
small messages, typically 4 bytes, it is the communication startup latency which will
dominate the time taken to perform the commutative operation. This is because the
communication startup latency is relatively expensive on most parallel machines. It is for this
reason that in many parallelisations, commutative operations can be a governing factor
affecting efficiency and speed up. It is extremely important, therefore, to implement
commutative operations as efficiently as possible. In order to do this, the commutative
routines in CAPLib take advantage of the processor topology, that is, how each processor
may communicate with other processors.
Many of the parallel machines on the market today are connected using some kind of
topology to facilitate fast communication in hardware. For example, processors in the Cray
T3D are connected to a communications network arranged as a 3-D torus. However, although
the hardware is connected as a torus, there is in fact no limitation on what processors a
particular processor may talk to at the hardware level; the communication hardware will route
messages from one processor to another around the torus as needed. From the perspective of
the methods used to perform commutative (and broadcast) operations it is this direct
processor to processor topology that is important, not the underlying hardware topology that
implements it. This means, for example, that although the Cray T3D is based on a 3D Torus,
for commutative operations internally within CAPLib it is considered fully connected. The
commutative topology used internally within CAPLib will therefore depend on the directprocessor-to-processor routing available on the machine the program is running on. The
commutative methods available are then directly related to the commutative topology.
Currently CAPTools supports a pipe, ring, grid and two different hyper-cube commutative
methods.
In order to compare the efficiency of each method we define the following:-
P = The number of processes.
C = The total number of communications for a commutative operation.
S = The total number of steps involved in the method. We define a step as a number of
communications performed in parallel such that the time/latency of allcommunications is equivalent to that of one communication. Some communication
-
7/28/2019 Caplib Paper 013
23/33
devices are serial devices only allowing one communications at a time. For example,
the Ethernet connecting workstations is a serial communications device since only
one packet may be present on the Ethernet at any one time. For these devices,
although we can consider the communications in one step taking place in parallel for
the purposes of analysis, they will in fact be serialised in practice.
The key to efficient commutative operations is to perform as much communication in parallel
as possible, i.e. by minimising the number of parallel communication steps needed to perform
the commutative operation, the effect of communication startup latency will be minimised.
The time for the commutative operation to take place is approximately proportional to the
number of communication steps, S. This is the most important term to reduce.
The communication time between processes is often affected by the number of
communications occurring simultaneously. It is therefore important that both the overall
number of communications and the number of communications per step is also minimised.
The type of communication taking place at each step also determines performance. If all the
communications in a step are between neighbour processors then there will be little
contention on the communication network as the communications take place. If the
communications are not to nearest neighbours then the number of communications will affect
the time to complete the step since the routing mechanism of the hardware will be used to
deliver messages and contention may occur.
If the process topology has not been mapped well onto the hardware topology, it will often be
the case that communication from a nearest neighbour process is not in hardware a
communication between nearest neighbour processors. For example, a ring topology
implemented onto a pipeline of processors will require the connection between the last and
first processor to be sent via a routing mechanism to the first processor. Communications
along this link will always be slower than along the other links and in a commutative
communication step the slowest communication will determine the time for the step.
6.1 Commutative Operation using a Pipeline
Figure 17 shows a diagram of a pipeline of processes and the communication pattern for a
commutative operation. The number of communications and steps is proportional to the
number of processes. This is because the value contributed from each process must be passed
down to the last process and then the result is passed back up the pipeline again.
-
7/28/2019 Caplib Paper 013
24/33
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Time
1)-2(P=S
1)-2(P=C
Figure 17 Commutative operations using a pipeline connection topology.
The number of steps for a ring commutative operation is the same as for a pipeline but the
number of communications is higher. On some hardware, this will give a pipe topology the
edge in performance over the ring. If it is possible for a commutative operation to beperformed around the ring using non-blocking communications then the number of steps can
be halved. Communication around a ring requires all the values to be accumulated in an array
in process order during communication and then the commutative computation performed
using the array once all values have been communicated. This is to avoid round off problems
and guarantees that each processor calculates the same result. Buffer space is required on
each process to perform this operation and for a very large parallel run, i.e. thousands ofprocesses, this may be disadvantageous. If it is possible for the hardware to perform
communication simultaneously in both directions then the performance can be even higher
since values can travel both ways around the ring at the same time, reducing the distance to
the furthest process to p/2.
6.2 Commutative Operation using a Grid
Figure 18 shows a diagram of a 2D grid of processes and the communication pattern for a
commutative operation. Each stage of the commutative operation is across one of the
dimensions, d, of the grid. This method would be used when a grid of processors can only
talk directly to its grid neighbours, otherwise it is advantageous to use a hyper-cube method
(see next section).
-
7/28/2019 Caplib Paper 013
25/33
Stage 1
Stage 2
1,1 1,2 1,3 1,4 1,61,5 1,7 1,8
2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8
3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8
4,1 4,2 4,3 4,4 4,5 4,6 4,7 4,8
5,1 5,2 5,3 5,4 5,5 5,6 5,7 5,8
6,1 6,2 6,3 6,4 6,5 6,6 6,7 6,8
7,1 7,2 7,3 7,4 7,5 7,6 7,7 7,8
8,1 8,2 8,3 8,4 8,5 8,6 8,7 8,8
( )
( )
=
= =
=
=
di j
d
i j
d
ijj i
PS
PPC
1
1 ,1
12
12
Figure 18 Communication pattern for commutative operation using a grid.
Where Pi is the number of processors in dimension I and d is the number of dimensions.
6.3 Commutative Operation using Hyper-cubes
In a hyper-cube topology of dimension d, each process is connected directly to d other
processes. Algorithms implemented using a hyper-cube offer the best performance generally
over other methods because the number of steps to perform a commutative operation is
related to d, i.e. 2
d
, not the number of processes, P. For non-trivial P, the hyper-cube offersfar greater performance than any other topology.
There are a number of ways to implement a commutative operation on a hyper-cube. Two
methods are currently implemented in CAPLib. Method A uses a pair-wise exchange
between processes until every process has the result. Method B uses a binary tree algorithm.
Both rely on the connectivity offered by the hyper-cube. Both methods A and B guarantee the
order of computation will be the same on every process and therefore the values obtained will
be the same on all processes. This is obviously the case with method B. In method A, this is
guaranteed by always combining combinations with that from the lower numbered processor
on the left hand side of the summation. The pair-wise exchange of data that characterises the
Method-A operation can be further improved if non-blocking communications. Overlapping
the exchange of data reduces the number of steps by a factor of two but relies on the
performance of two small non-blocking communications out-performing two small blocking
communications. CAPLib does not currently implement a non-blocking version of Method-
A.
-
7/28/2019 Caplib Paper 013
26/33
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2
3
4
5
6
7
8
Method A
)(
)(
exchangegnonblockindS
exchangeblocking2d=S
1)>(dPd=C
=
Figure 19 Communication pattern for commutative operation using a hyper-cube (method A, d=4)
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2
3
4
5
6
7
8
Method BC = 2 - 2
S = 2d
(d+1)
Figure 20 Communication pattern for commutative operation using a hyper-cube (method B, d=4).
In order to use these methods in runs where the number of processes does not exactly make
up a hyper-cube the methods must be modified to account for this. For method A if we
consider the odd number of processes to be k, then the last k processes send their values to
the first k processes before the main part of the procedure begins. This ensures the values
from these processes are used. When the main procedure is complete, the end k processes
receive the result. Method B handles odd processes by extending the binary treecommunication strategy to include the extra k processes.
-
7/28/2019 Caplib Paper 013
27/33
6.4 Comparison of Commutative Methods
Table 2 show a comparison of the number of steps and the number of communications
needed for a commutative operation using the methods implemented inside CAPTools.
Pipe Ring Hyper-cube A Hyper-cube B
P Steps Comms Steps Comms StepsSync.
StepsAsync
Comms Steps Comms
2 2 2 2 2 2 1 2 2 2
4 6 6 6 12 4 2 8 4 6
8 14 14 14 56 6 3 24 6 14
16 30 30 30 240 8 4 64 8 30
32 62 62 62 992 10 5 160 10 62
64 126 126 126 4032 12 6 384 12 126
128 254 254 254 16256 14 7 896 14 254
256 510 510 510 65280 16 8 2048 16 510
512 1022 1022 1022 261632 18 9 4608 18 1022
1024 2046 2046 2046 1047552 20 10 10240 20 2046
Table 2 Number of steps S, and communications C, for a commutative operation using different methods.
Obviously the Hyper-cube methods are the best for P>4; the pipe and ring methods would
only be used on machines where the hyper-cube is not available, for example, machines built
of hard-wired directly connected processors in a pipeline or grid. Each of the hyper-cube
methods performs the operation in d steps, but B takes fewer communications overall than A,
for P>2. For a large number of processes this factor becomes very important as time for a
large number of simultaneous communications in one step can be affected by message
contention across the hardware processor interconnect. For A, the number of messages
remains constant at each step in a commutative operation at P/2. The number of
communications in each successive step using method B reduces by a factor of 2 and
therefore any contention is minimised to the first few steps. The number of steps needed to
complete the operation using A can however be halved if non-blocking communications areused.
Figure 21 shows a graph of communication latency for CAP_COMMUTATIVE using
CAPLib over SHMEM on the Cray T3D using a pipeline and the two hyper-cube methods.
The graph clearly demonstrates the effect of using different global communication
algorithms. Global communication using a pipeline becomes rapidly more expensive as the
number of processors increase. The best performance is given by the Hyper-cube B
algorithm. Note that in this case MPI_ALLREDUCE which is the MPI equivalent to
CAP_COMMUTATIVE does not perform as well as the Hyper-cube methods employed by
CAP_COMMUTATIVE. Indeed, the CAP_COMMUTATIVE function has performed better
than the corresponding MPI_ALL_REDUCE function in all ports of CAPLib so far
undertaken.
-
7/28/2019 Caplib Paper 013
28/33
0
100
200
300
400
500
600
700
1 10 100 1000
Time
(us)
Processors
CAP_COMMUTATIVE (Cray T3D SHMEM)
PIPELINEHYPERCUBE AHYPERCUBE B
MPI_ALLREDUCE
Figure 21 CAP_COMMUTATIVE latency on Cray T3D
7 CAPLib Support Environment
One of the major reasons that parallel environments are often difficult to use is the amount of
configuration and details the user must know about the system in order to successfully
compile and run their parallel programs. As part of the CAPTools parallelisation
environment, a set of utilities is provided to aid users in compiling, running and debugging
their parallel programs. The main utilities are capf77and capmake, which allows compilation
of the users source code; caprun, which provides a mechanism for parallel execution of the
users compiled executable; and capsub which provides a simple generic method for
submitting jobs to parallel batch queues. The characteristics of the utilities are:-
Simple to use The utilities hide from the user as much as possible the details of the
compilation and execution of parallel programs. Parallel compilation usually requires extra
flags on the compile line and special libraries linked in. Many parallel environments require acomplex initialisation process to begin the execution of a parallel program. Parallel execution
often fails, not because the users program is incorrectly coded, but because they have
wrongly configured the parallel environment in some way. By hiding the messy details of
configuration from the user, execution becomes both quicker and more reliable. In many
cases, the users do not need a detailed knowledge of the parallel environment they are
utilising at all.
Generic interface Each utility uses a set of common arguments across the domains of
parallel environment (e.g. MPI) and machine type, e.g. Cray T3D. This makes it easy for the
user to migrate from one machine or parallel environment to another. The main generic
arguments are:-mach Machine type, e.g. Sun, Paramid, T3D.
-penv Parallel environment type, e.g. PVM, MPI, i860toolset, shmem.
-top Parallel topology type, e.g. pipe2, ring4, full6, grid2x2.
-debug n1 n2.. Execute in debug mode on processors n1, n2 etc..
When a utility is executed it first checks for the existence of the environment variables
CAPMACH and CAPPENV that provide default settings for the machine type and parallel
environment type. These can be set manually by the user in their login script or by the
execution of the usecaplib script, which attempts to determine these automatically from the
host system. The command line argument versions of the environment variables can be used
to over-ride any defaults.
-
7/28/2019 Caplib Paper 013
29/33
8 Parallel Debugging
The debugging of parallel message passing code often requires the user to start up multiple
debuggers and trace and debug the execution on several processes. The main disadvantages
of having several debuggers running on the workstation screen is the large amount of
resource both in computer time and physical memory that this can require. Each debugger(with graphical user interface) may require 40 Mbytes and starting up several debuggers or
attaching to several running processes can take minutes on a typical workstation. Recently
computer vendors and third party software developers have begun to address this issue by
allowing the debugger to handle more than one process and a time and allow the user to
quickly switch from one process to another. This dramatically reduces the memory cost since
only one debugger is now running and, if the same executable is running on all processors,
only a single set of debugging information need be loaded. Examples of commercial
debuggers that provide such a facility are TotalView [15] and Sun Microsystemss Workshop
development environment [16]. Cheng and Hood in [17] describe the design and
implementation of a portable debugger for parallel and distributed programs. Their client-
server design allows the same debugger to be used both on PVM and MPI programs andsuggest that the process abstractions used for debugging message-passing can be adopted to
debug HPF programs at the source level. Recently the High Performance Debugging Forum
[18] has been established to define a useful and appropriate set of standards relevant to
debugging tools for High Performance Computers.
The caprun script has a -debug argument that allows users to specify a set of
processes that they wish to debug. On systems that do not yet provide a multi-process
debugger but do provide some mechanism to debug parallel processes using this option will
result in a set of debuggers appearing on the screen attached to the chosen process set.
CAPLib also provides a library routine called CAP_DEBUG_PROC that allows a debugger
to be attached to an already running process where this is possible, perhaps following someerror condition. When a process calls CAP_INIT, one of the tasks undertaken is to check
command line arguments and environment variables. If -debug is found then a call is made to
CAP_DEBUG_PROC that calls a machine dependant system routine to run the script
capdebug. This script is passed a set of information such as the calling process-id, DISPLAY
environment variable and executable pathname that allows a debugger to be started up,
attached to the calling process and displaying on the host machines screen. The caprun script
also has a capdbgscript argument that allows the user to specify a set of debugger
commands to be executed by each debugger when starting up.
As an example
caprun -m sun -p pvm3 -top ring5 -debug 1-3 -dbxscript stopinsolve jac
This will start up 3 debuggers attaching to processes 1-3 on the users workstation, all
debuggers will then execute the script stopinsolve which might contain
print cap_procnumstop in solvecont
This would print the CAPTools processor number, set a break point in routine solve and
continue program execution.
9 Results
This section gives a series of results obtained for parallelisations using CAPTools and
CAPLib, of two of the well-known NAS Parallel Benchmarks (NPB) [19], APPLU (LU) and
-
7/28/2019 Caplib Paper 013
30/33
APPBT (BT). The LU code is a lower-upper diagonal (LU) CFD application benchmark. It
does not, however, perform a LU factorisation but instead implements a symmetric
successive over-relaxation (SSOR) scheme to solve a regular-sparse, block lower and upper
triangular system. BT is representative of computations associated with the implicit operators
of CFD codes such as ARC3D at NASA Ames. BT solves multiple independent systems of
non-diagonally dominant, block tridiagonal equations. The codes are characterised in parallelform by pipeline algorithms, making all codes sensitive to communication latency.
The results for the benchmarks refer to three different versions/revisions of the same code.
Rev 4.3 is a serial version of the benchmarks written in 1994 a starting point for optimised
implementations. Version NPB2.2 is a parallel version of the codes written by hand by
NASA and using MPI communication calls. Version NPB2.3, the successor to NPB2.2, has
both a serial and parallel version. The results presented here are for runs of CLASS A,
64x64x64 size problems. For each code, a SPMD parallelisation using a 1-D and in some
cases a 2-D partitioning strategy were produced using CAPTools. The results for runs using
these parallelisations on the Cray T3D, Transtech Paramid and the SGI Origin2000 are
presented in the following sections together with results for runs of the NPB2.2/2.3 parallelMPI versions.
9.1 LU
The results for LU runs on the Cray T3D, T3E, SGI Origin 2000 and Transtech Paramid are
shown in Figure 22 to Figure 25 respectively. The T3D and T3E results compare the
performance of 1-D and 2-D parallelisations of LU using CAPTools. The 1-D version can
only be run on a maximum of 64 processors because of the size of problem being solved
(64x64x64). The 2-D version was run up to 8x8 processors and gives very reasonable results.
Figure 23 shows graphs of execution time for 1-D and 2-D parallelisations of LU using
CAPTools on the Cray T3E with different versions of CAPLib. The best results are given asexpected by the SHMEM version of CAPLib although for the 2-D runs the differences are
quite small. These small differences are in part due to the pipelines present in LU code. The
1-D version has pipelines with a much longer startup and shutdown period than the 2-D
version and therefore performance is more dependent on the startup latency of the
communications. Another factor is the memory access patterns required for communication
in the 2nd dimension which use buffered CAPLib calls such as CAP_BSEND/BRECEIVE
that gather data before sending and scatte