Caplib Paper 013

7/28/2019 Caplib Paper 013

1/33

CAPLib A THIN LAYER MESSAGE PASSING LIBRARY TOSUPPORT COMPUTATIONAL MECHANICS CODES ON

DISTRIBUTED MEMORY PARALLEL SYSTEMS

By

P F Leggett, S P Johnson and M Cross

Parallel Processing Research Group

Centre for Numerical Modelling and Process Analysis

University of Greenwich

London SE18 6PF

UK.

ABSTRACT

The Computer Aided Parallelisation Tools (CAPTools) [1] is a set of interactive tools aimed

to provide automatic parallelisation of serial Fortran computational Mechanics (CM)

programs. CAPTools analyses the users serial code and then through stages of array

partitioning, mask and communication calculation, generates parallel SPMD (Single Program

Multiple Data) messages passing Fortran.

The parallel code generated by CAPTools contains calls to a collection of routines that form

the CAPTools communications Library (CAPLib). The library provides a portable layer and

user friendly abstraction over the underlying parallel environment. CAPLib contains

optimised message passing routines for data exchange between parallel processes and other

utility routines for parallel execution control, initialisation and debugging. By compiling andlinking with different implementations of the library the user is able to run on many different

parallel environments.

Even with todays parallel systems the concept of a single version of a parallel application

code is more of an aspiration than a reality. However for CM codes the data partitioning

SPMD paradigm requires a relatively small set of message-passing communication calls. This

set can be implemented as an intermediate thin layer library of message-passing calls that

enables the parallel code (especially that generated automatically by a parallelisation tool

such as CAPTools) to be as generic as possible.

CAPLib is just such a thin layer message passing library that supports parallel CM codes,

by mapping generic calls onto machine specific libraries (such as CRAY SHMEM) andportable general purpose libraries (such as PVM an MPI). This paper describe CAPLib

together with its three perceived advantages over other routes:

as a high level abstraction, it is both easy to understand (especially when generatedautomatically by tools) and to implement by hand, for the CM community (who are not

generally parallel computing specialists),

the one parallel version of the application code is truly generic and portable,

the parallel application can readily utilise whatever message passing libraries on a givenmachine yield optimum performance.


2/33

1 Introduction

Currently the most reliable and portable way to implement parallel versions of computational

mechanics (CM) software applications is to use a domain decomposition data partitioning

strategy to ensure that data locality is preserved and inter-processor communication is

minimised. The parallel hardware model assumes a set of processors each with its ownmemory, linked in some specified connection topology. The parallelisation paradigm is

single processmultiple data (SPMD); that is each processor runs the same application except

using its own local data set. Of course, neighbouring processors (at least) will need to

exchange data during the calculation and this must usually be done in a synchronised manner,

if the parallel computation is to faithfully emulate its scalar equivalent. One of the keys to

enabling this class of parallel application is the message-passing library that enables data to

be efficiently exchanged amongst the processors comprising the system.

Up until the early 1990s, parallel vendors typically provided their own message passing

libraries, which were naturally targeted at optimising performance on their own hardware.

This made it very difficult to port a CM application from one parallel system to another. In

the early 1990s, portable message passing libraries began to emerge. The two most popular

such libraries are PVM [2] and MPI [3]. One or other, or both of these libraries is now

implemented on most commercial parallel systems. Although this certainly addresses the

issue of portability, these generic message-passing libraries may give far from optimal

performance on any specific system. On CRAY-T3D systems, for example, the PVM library

performance is somewhat inferior to the manufacturers own SHMEM library [4]. Hence, to

optimise performance on such a system the parallel application needs to utilise the in-house

library.

Although both PVM and MPI are powerful and flexible they actually provide much greater

functionality than is required by the CM community in porting their applications to

commercial parallel hardware. This issue was recognised by the authors some years agowhen they were working on the design phase of some automatic parallelisation tools for

FORTRAN computational mechanics codes CAPTools [1,5,6,7,8,9]. The challenge was to

produce generic parallel code that would run on any of the commercially available high

performance architectures. The key factor that inhibited the generation of truly generic

parallel code was the variety of the message passing libraries and the structure of the

information passed into the resulting calls as arguments. From an extensive experience base

of code parallelisation, the CAPTools team recognised that all typical inter-processor

communications required by structured mesh codes (typical of CFD applications) could be

addressed by a concise set of function calls. Furthermore it transpired that these calls could

be easily implemented as a thin software layer on top of the standard message passing

libraries PVM and MPI plus a parallel systems own optimised libraries (such as CrayT3D/T3E SHMEM). Such a thin layer software library could have three distinct advantages

over other routes:

as a high level abstraction it is both easy to understand and to implement by hand, for theCM community (who are not generally parallel computing specialists),

the one parallel version of the application code is truly generic and portable,

the parallel application can readily utilise whichever message passing libraries on a givenmachine yields optimum performance.

In this paper we describe the design, development and performance of the CAPLib message

passing software library that is specifically targeted at structured mesh CM codes. As such,

we are concerned with:- ease of use by the CM community, portability, flexibility and


3/33

computational efficiency. Such a library, even if it is a very thin layer must represent some

kind of overhead on the full scale message passing libraries; part of the performance

assessment considers this issue. For such a concept to be useful to the CM community its

overhead must be minimal.

2 CAPLib Design and Fundamentals

CAPLibs primary design goal was to provide the initialisation and communication facilities

needed to execute parallel Computational Mechanics code either parallelised manually or

generated by the CAPTools semi-automatic parallelisation environment. A secondary goal is

to provide a generic set of utilities that make the compilation and execution of parallel

programs using CAPLib as straightforward as possible. The library is also supplied with a set

of scripts to enable easy and standardised compilation of parallel code with different versions

of CAPLib and for the simple execution of the compiled executable on different machines.

This section discusses the design, features and fundamentals of the library.

2.1 Design

The different layers of software of CAPTools generated code are shown in Figure 1. CAPLib

has been implemented over MPI [3] and PVM [2], the most important standard parallel

communications libraries in current use, to provide an easy method of porting CAPLib to

different machines. Where possible versions of CAPLib have been developed for proprietary

libraries in order to obtain maximum performance, for example, the Cray SHMEM library [4]

or Transtechs i860toolset library [11].

CAPTools generated parallel code

CAPLib API

MPI PVM Cray SHMEMTranstechi860toolset

Figure 1 CAPLib Software layer

The library has been designed to meet the following criteria:

Efficient. Speed of communications is perhaps the most vital characteristic of a parallelmessage-passing library. Startup latency has been found to be a very important factor

effecting the performance of parallel programs. The addition of layers of communication

software over the hardware communication mechanism increases the startup latency of allcommunications. It is important therefore to access the communication mechanism of a

machine at the lowest level possible. Each implementation of CAPLib attempts to utilise

the lowest level communications API of each parallel machine in order to achieve low

latency and therefore as fast communications possible.

Portable. Code written to use CAPLib is portable across different machines. Onlyrecompilation is necessary.

Correct. It is vitally important for parallelised computational mechanics programs to givethe same answers in parallel as in serial. The commutative (global) message passing

functions provided by CAPLib are implemented so as to guarantee that the same result is

seen on every processor. This can be of vital importance for the correct execution of

parallel code and its successful completion. For example, a globally summed value may


4/33

be used to determine the exit of an iterative loop. If the summed value is not computed in a

consistent manner across all processors, then round off error may cause some processors

to continue executing the loop whilst others exit, resulting in communication deadlock.

Generic. The library is generic in the sense that decisions about which processor topologyto execute on are taken at run time. CAPTools generated code compiled with CAPLib will

run, for example, on 1 processor, a pipeline of 2 processors, a ring of 100 processors, or a

torus of 64. The scripts provided with the library are also generic. For example, capmake

and caprun are scripts that allow the user to compile and run parallel code without

knowing system specific compiler and execution procedures.

Simple. The library itself has been kept as simple as possible both in the design of theAPI and in its implementation. By keeping the library simple with the minimum number

of functions and also the minimum number of arguments to those functions, the library is

easily ported to different parallel machines. Also an uncomplicated interface is more easily

understood and assimilated by the user.

2.2 Parallel Hardware Model

CAPTools currently generates parallel code based on a Distributed Memory (DM) parallel

hardware model, which is illustrated in Figure 2. In the CAPLib parallel hardware model

processors are considered to be arranged in some form of topology, where each processor is

directly connected to several others, e.g. a pipe, ring, grid, torus or full (fully connected).

Each processor is assigned a unique number (starting from 1). In the case of grid and torus

topologies, each processor also has a dimensional processor number. Memory is considered

local to each processor and data is exchanged between processors via message passing of

some form between directly connected processors. CAPTools generated parallel code can

also be executed on Shared Memory (SM) systems providing, of course, CAPLib has been

ported to the system. On a SM system, each processor still executes the same SPMD programoperating on different sections of the problem data. The main difference between this and

operation on a DM system is that message-passing calls can be implemented inside CAPLib

as memory copies to and from hidden shared memory segments. In this respect the CAPLib

model differs from the usual parallelisation model used on SM machines that assume every

processor can directly access all memory of the problem. By restricting the memory each

processor accesses and enforcing a strict and explicit ordering to the update of halo regions

and calculation of global values, the CAPLib parallel hardware model ensures that there will

be very little memory contention on SM systems and particularly on Distributed Shared

Memory (DSM) systems. As the number of processors becomes large, for example, some of

the machines recently built for the Accelerated Strategic Computing Initiative [10] (ASCI)

have thousands of processors, the localisation of communications becomes very important.

Distributing data onto processors, taking into account the hardware processor topology, can

localise communication between processors and thus minimise contention in the

communications hardware.


5/33

1 2 3 4 5 6

1,2

(4)

2,2

(3)

1,1(1)

1,2(2)

5

1

3

2

4

pipeline topology

2d grid

topology

full

topology

CPU

local

memory

processor

Figure 2 CAPLib parallel hardware model

2.3 Process Topologies

Knowledge of the processor topology of the parallel hardware a parallel code is to run on is

very important. It can be used to optimise the speed and distance travelled by messagesbetween processes. CAPTools attempts to generate code that will minimise the amount of

communication needed, however, to perform those communications that are required as

quickly as possible, the process topology must be mapped onto the processor topology.

CAPLib uses the concept of a process topology for this reason. An intelligent mapping of

process to processors will give better performance than would be possible from a random

allocation. By placing processes so that most communications are needed only between

directly connected neighbouring processors, the distance the communications have to travel is

minimised, avoiding hot spots and maximising bandwidth. An awareness of process topology

also allows for more efficient programming in global communications; for example, the use

of a hyper-cube to maximise global summations in parallel (see section 6.3).

By requiring that processes are connected in a pipe or grid type topology, it is possible for

CAPTools to generate parallel code for structured mesh parallelisations using directional

communications, i.e. where communication is specified as being up or down, left or right of a

process rather than to a particular process id. This programming style can make it easier for

the user to write and understand parallel code, especially for grids of two or more

dimensions.

Where possible, CAPLib tries to use the fastest methods of communication that are available

on a particular machine. It might be that communications to neighbouring processors could

be made directly through fast, dedicated hardware channels.

The topology required for a particular run of a parallel program, e.g. pipe, ring, and thenumber of processes can be specified to the CAPLib utilities and to the parallel program at

run time in a number of ways:- via environment variable; as a flag on the command line; a

configuration file or if none of the previous is set, by asking the user interactively. The

topologies currently available from CAPLib are pipe, ring, grid, torus and full (all to all).

2.4 Messages

Each messages sent and received using the CAPLib communication routines has a length,

type and a destination.


6/33

2.4.1 Message Length

The length is defined in terms of the number of items to be communicated. Zero or a negative

number of items must result in no message being sent. All CAPLib communication routines

check for length


7/33

used to hold RI(2). This method has been found to be generic and works on every

machine tested so far.

3. Heterogeneous computing. If a parallel program is sending messages within aheterogeneous environment then size and storage of data types may differ between

processors. One processor may use little endian (low bytes first) and another big

endian (high bytes first) storage, i.e. bytes in a message may have to be swapped at

destination or origin depending on the data type. Floating point representation may

also be different; e.g. default size might be 4 bytes or one machine and 8 bytes on

another. For the library to be able to convert between different storage types it must

know which type is being communicated in order to apply the correct translation.

Currently the library makes the assumption that all processors are homogenous but the

knowledge of type of messages within the library allows for adding heterogeneous

capability in the future if this is found to be desirable.

2.4.3 Message Destination

Message destination is determined by an integer argument passed in each communicationcall. A negative value indicates a direction, a positive value indicate a process number.

The code generated by CAPTools for structured mesh parallelisations currently assumes a

pipeline or grid process topology. The communication calls therefore use the negative values

to indicate direction to the left or right (or up and down) of a processes position in topology.

These are available as predefined CAPLib constants such as CAP_LEFT, CAP_RIGHT for

improved readability. A characteristic of parallel SPMD code written for an ordered

topology is a test for neighbour existence before communication. This is because the first

processor does not have a neighbour to its left and the last processor does not have a

neighbour to its right. CAPLib functions perform the necessary tests for neighbour processor

existence internally to improve the readability of CAPTools generated parallel code. Havingthe neighbour test within the library also reduces the possibility of error (and therefore

deadlock) in any manually written parallel code. The functions also test for zero-length

messages, as mentioned earlier, since this is often a possibility, so that the user avoids having

to perform this chore as well.

Typical hand written user code without these internal tests might look like as follows:-

IF (N.GT.0) THENIF (MYNUM.LT.NPROC) CALL ANY_RECEIVE(A,N*4,MYNUM+1)IF (MYNUM.GT.1) CALL ANY_SEND(A,N*4,MYNUM-1)

ENDIF

where MYNUM is the processor number and NPROC is the number of processors.

Using CAPTools communications library the code becomes

CALL CAP_RECEIVE(A,N,1,CAP_RIGHT)CALL CAP_SEND(A,N,1,CAP_LEFT)

where the receive communication will only take place if N is >=0 and a processor is present

to the right and similarly for the send communication if a processor is available to the left.

3 Requirements For Message-Passing from Structured MeshBased Computational Mechanics code

CAPLib satisfies the general requirements for message-passing from parallelisations ofstructured mesh based Computational Mechanics. The library has to provide for:


8/33

Initialisation of required process topology

Data Partition calculation

Termination of parallel execution

Point to point communications

Overlap area (halo) update operations

Commutative operations, i.e. local value ->global value using some function

Broadcast operations

Algorithmic Parallel Pipelines

In the following sections, the general requirements for communication and parallel constructs

for CM codes and the CAPLib calls that address these requirements are described,

particularly emphasising their novel aspects. To illustrate this discussion a simple one-

dimensional parallel Jacobi code (Figure 3) obtained using CAPTools is used. The CAPLib

library routines are summarised in Table 1below.

CAPTool Communication Library (CAPLib) Routine Summary

FunctionName

FunctionArguments

Type

Blocking

Buffered

Cyclic

CAP_INIT () I x

CAP_FINISH () I x

CAP_SETUPPART (LOASSN,HIASSN,LOPART,HIPART) I x

CAP_SEND (A,NITEMS,TYPE,PID) P x

CAP_RECEIVE (A,NITEMS,TYPE,PID) P x

CAP_EXCHANGE (A,B,NITEMS,TYPE,PID) E x

CAP_BSEND (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID) P x x

CAP_BRECEIVE (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID) P x x

CAP_BEXCHANGE (A,B,NITEMS,STRIDE,NSTRIDE,ITYPE,PID) E x x

CAP_CSEND (A,NITEMS,TYPE,PID) P x x

CAP_CRECEIVE (A,NITEMS,TYPE,PID) P x x

CAP_CEXCHANGE (A,B,NITEMS,TYPE,PID) E x x

CAP_ASEND (A,NITEMS,TYPE,PID,ISEND) P

CAP_ARECIEVE (A,NITEMS,TYPE,PID,IRECV) P

CAP_AEXCHANGE (A,B,NITEMS,TYPE,PID,ISEND,IRECV) E

CAP_ABSEND (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID,ISYNC) P xCAP_ABRECEIVE (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID,ISYNC) P x

CAP_ABEXCHANGE (A,STRIDE,NSTRIDE,NITEMS,TYPE,PID,ISEND,IRECV) E x

CAP_CASEND (A,NITEMS,TYPE,PID,ISEND) P x

CAP_CARECIEVE (A,NITEMS,TYPE,PID,IRECV) P x

CAP_CAEXCHANGE (A,B,NITEMS,TYPE,PID,ISEND,IRECV) E x

CAP_SYNC_SEND (PID,ISYNC) S x

CAP_SYNC_RECEIVE (PID,ISYNC) S x

CAP_SYNC_EXCHANGE (PID,ISEND,IRECV) S x

CAP_COMMUTATIVE (VALUE,TYPE,FUNC) G x

CAP_COMMUPARENT (VALUE,TYPE,FIRSTFOUND,FUNC) G x

CAP_COMMUCHILD (VALUE,TYPE) G x

CAP_DCOMMUTATIVE (VALUE,TYPE,DIRECTION,FUNC) G xCAP_MCOMMUTATIVE (VALUE,NITEMS,TYPE,FUNC) G x


9/33

CAP_BROADCAST (VALUE,TYPE) G x

CAP_MBROADCAST (VALUE,TYPE,OWNER) G x

CAPLib Function Type KeyI Initialisation, termination and control

P Point to point communicationE Ordered exchange communication between neighboursS Synchronisation on non-blocking communicationG Global communication or commutative operation

Table 1 Summary of CAPLib Routines

REAL TOLD(500,500), TNEW(500,500)EXTERNAL CAP_RMAXREAL CAP_RMAXINTEGER CAP_PROCNUM,CAP_NPROCCOMMON /CAP_TOOLS/CAP_PROCNUM,CAP_NPROCINTEGER CAP_HTOLD, CAP_LTOLD

C Initialise CAPLibCALL CAP_INITIF (CAP_PROCNUM.EQ.1)PRINT*,'ENTER N AND TOL'IF (CAP_PROCNUM.EQ.1)READ*,N,TOL

C Broadcast N and TOL to every processorCALL CAP_RECEIVE(TOL,1,2,CAP_LEFT)CALL CAP_SEND(TOL,1,2,CAP_RIGHT)CALL CAP_RECEIVE(N,1,1,CAP_LEFT)CALL CAP_SEND(N,1,1,CAP_RIGHT)

C Initialise data partitionCALL CAP_SETUPPART(1,N,CAP_LTOLD,CAP_HTOLD)DO I=MAX(1,CAP_LTOLD),MIN(N,CAP_HTOLD),1

TOLD(I)=0.0ENDDO

C Boundary conditions (only execute on end processors)IF (1.GE.CAP_LTOLD.AND.1.LE.CAP_HTOLD)TOLD(1)=1IF (N.GE.CAP_LTOLD.AND.N.LE.CAP_HTOLD)TOLD(N)=100

40 CONTINUEC Exchange overlap data prior to each Jacobi update

CALL CAP_EXCHANGE(TOLD(CAP_HTOLD+1),TOLD(CAP_LTOLD),1,2,CAP_RIGHT)CALL CAP_EXCHANGE(TOLD(CAP_LTOLD-1),TOLD(CAP_HTOLD),1,2,CAP_LEFT)DO I=MAX(2,CAP_LTOLD),MIN(N-1,CAP_HTOLD),1

TNEW(I)=(TOLD(I-1)+TOLD(I+1))/2.0ENDDO

C Calculate maximum difference on each processorDIFMAX=0.0DO I=MAX(1,CAP_LTOLD),MIN(N,CAP_HTOLD),1

DIFF=ABS(TNEW(I)-TOLD(I))IF (DIFF.GT.DIFMAX) DIFMAX=DIFFTOLD(I)=TNEW(I)

ENDDOC Find global maximum difference

CALL CAP_COMMUTATIVE(DIFMAX,2,CAP_RMAX)IF (DIFMAX.GT.TOL) GOTO 40

C Output results via first processor

DO I=1,N,1IF (I.GT.CAP_BHTNEW)CALL CAP_RECEIVE(TNEW(I),1,2,CAP_RIGHT)IF (I.GE.CAP_BLTNEW)CALL CAP_SEND(TNEW(I),1,2,CAP_LEFT)IF (CAP_PROCNUM.EQ.1)WRITE(UNIT=*,FMT=*)TNEW(I)

ENDDOEND

Figure 3 CAPTools generated parallel code for simple 1-D Jacobi program

3.1 Initialisation, Partition Calculation and Termination

The routine CAP_INIT is called in the example code to initialise the library. It must be called

before any other CAPLib function is used. This call sets up the internal channel arrays and

other data structures that the library needs to access. In some implementations of the library

(e.g. the PVM version) this routine is also responsible for starting all slave processes running.CAP_INIT is responsible for the allocation of processes to processors in such a manner as to


10/33

minimise the number of hops between adjacent processes in the requested topology and

therefore the overall process to process communication latency, maximising communication

bandwidth. CAP_INIT is also responsible for communicating information on the runtime

environment such as hostname and X Window display name to all processes. The size of each

data type is also dynamically determined by CAP_INIT.

A general requirement for message-passing SPMD code is for each parallel process to be

assigned a unique number and also to know the total number of processors involved.

CAP_INIT sets CAP_PROCNUM (the process number) and sets the CAP_NPROC (the

number of processes). Both variables are used in internally, but can be referenced in the

application code through a common block in the generated code.

The next stage is the calculation of data assignment for each process. Adhering to the SPMD

model, the partitioning of the arrays TNEW and TOLD for this example on 4 processes

would require each process to be allocated a data range of 250 array elements in order for

each processor to obtain a balanced workload (see, for example Figure 4). The CAPLib

function CAP_SETUPPART is passed the minimum and maximum range of the accessed

data range and the number of processes. It returns to each process its own unique value forthe minimum and maximum value for the partitioned data range (variables CAP_LTOLD and

CAP_HTOLD in Figure 3). If the example was partitioned onto 4 processes then

CAP_SETUPPART would return to process 1 the partition range 1 to 250, process 2 the

partition range 251 to 500, process 3 the partition range 501 to 750 and process 4 the partition

range 751 to 1000. Each process also requires an overlap region because of data assigned on

one process but used on a neighbouring process. This will necessitate the communication of

data assigned on one process into the overlap region of their neighbouring process. Due to the

organised partition of the data the overlap areas need only be updated from their

neighbouring processes. The data partition of the partitioned array TOLD in comparison with

the original un-partitioned array is shown in Figure 4.

2511

250

501

500

751

750

1000

1

1 251 501 751 1000

PE 1 PE 2 PE 3 PE 4

PARTITIONED

ARRAY TOLD

Overlap Area

Update Lower Overlap

Update Higher Overlap

KEY:

UN-PARTITIONED

ARRAY TOLD

Figure 4 Comparison of an un-partitioned and partitioned 1-D array.

The routine CAP_FINISH must be called at the end of a program run to successfully

terminate use of the library. On some machines, this call is necessary if control is to return to

the user once the parallel run has completed.


11/33

3.2 Point to Point Communication

The CAP_SEND and CAP_RECEIVE functions perform point to point communications

between two processors. Typically these functions appear in pipeline communications (see

section 3.4) but are also used to distribute data across the processor topology during

initialisation of scalars and arrays etc.

CAPLib has a selection of communication routines that allow the user to perform point to

point communications in a variety of ways. The are two main groups, those of blocking and

non-blocking and these are discussed separately in the next sections. Each communication

has the generic arguments of address (A), length (NITEMS), type (TYPE) and destination

(PID) with additional arguments depending on the routine. All the point-to-point routines are

summarised in Table 1.

3.2.1 Blocking Communication

Blocking communications do not return until the message has been successfully sent or

received. The Non-cyclic blocking communications will not communicate beyond the

boundaries of the process topology when directional message destinations are given,

Directions are indicated by a negative PID argument. For example, in a pipeline, the first

process will not send to its left, or the last process to its right. This will also be true of a ring

topology, grid and torus (multi-dimensional ring). Where communications are required to

loop around a topology like a ring or torus, as is the case for programs with cyclic partitions,

the cyclic routines can be used. These do not test for the end or beginning of a processor

topology.

Buffered routines are provided so that data that is non-contiguous can be buffered and sent as

a single communication. The extra arguments are STRIDE (stride length in terms of ITYPE

elements) and NSTRIDE (the number of strides). In other words NSTRIDE lots of NITEM

elements, STRIDE elements apart, will be communicated in each call. This approach avoidsthe multiple start up latencies incurred using a communication for each section of data. On

most platforms there is a message size dependent limit at which point the time spent

gathering and scattering data to and from buffers can be greater than the latency effect of

using multiple communications. The buffered routines switch internally to non-buffered

communications if this limit is exceeded. This limit is currently set statically but in the future

it is hoped to perform an optimal calculation for the limit during the call to CAP_INIT.

CAPTools provides a user option to generate buffered or non-buffered communications.

3.2.2 Non-Blocking Communication

It is often the speed of communication that reduces the efficiency of parallel programs morethan anything else. To improve code performance, many parallel computers allow programs

to start sending (and receiving) several messages and then to proceed with other computation

asynchronously whilst this communication takes place. CAPLib supports this approach by

providing non-blocking sends and receives. Non blocking communications are implemented

in CAPLib using the underlying host systems non-blocking routines where possible. Where

such routines are not available, non-blocking routines have been implemented using a variety

of techniques, for example, communication threads running in parallel with the main user

code. Table 1 lists the non-blocking routines currently available in the library.

Non-blocking communication routines, e.g. CAP_ASEND, begin the non-blocking operation

but return to the user program immediately the communication has been initiated. The

communication itself takes place in parallel with execution of the following user code. The


12/33

arguments are the same as for the blocking communications but with the addition of a

message synchronisation id as the last argument. To make sure a message has completed its

journey the user code calls a CAP_SYNC routine to test for completion, passing the

destination and synchronisation id as arguments. The CAP_SYNC routines either return

immediately, if a communication has finished, or wait for it to complete, if it has not. A

particular communication is identified completely by the message destination and thesynchronisation id.

Depending on the hardware and underlying communications library that CAPLib is ported to,

the implementation of the non-blocking routines can be done in several different ways. For

some implementations the synchronisation call is used to actually unpack the messages

because the underlying library does not provide a non-blocking receive using the same model

as CAPLib, for example the PVM implementation.

Buffered non-blocking communications are also handled differently depending on the

underlying library and hardware. Buffered non-blocking communications consist of two

stages, for a send, first the packing of data into a buffer, and then the communication of the

buffered data. A receive communication must first receive the buffered data and then unpackit. If the parallel processor node is of a type that has a separate processor for communications,

that can be programmed to perform work asynchronously with the main processor, then the

packing and unpacking can be performed by the communications processor and overlapped

with computation done on the main processor. This relies on the communications processor

having dual memory access to the main processors memory. The benefit of this it is that both

stages of buffered communication are then performed in parallel with computation. The

Transtech Paramid [11] is a good example of such a system. However, it may be that the

communications processor is of lower speed than the main processor and the time taken to

unpack is actually longer than if the main processor had done the unpacking in serial mode

itself. CAPLib therefore makes use of this approach only where it would improve

performance. It is more often the case that parallel nodes consist of single processors and donot provide any direct hardware support for non-blocking buffered communications. On such

systems, messages can still be received asynchronously, but the processor must do data

unpacking and there is no real parallel overlapping during the packing/unpacking stage.

Libraries such as MPI implement non-blocking communications on workstations using

parallel threads. Although this provides the mechanism for non-blocking buffered

communications, because the thread will run on the same processor, the unpacking is not

actually performed in parallel, but through time slicing. Therefore no real parallel benefit on

packing/unpacking is gained.

If the underlying communications library used by CAPLib does not directly support buffered

non-blocking communication then the unpacking must be performed at the synchronisation

stage, once the buffered message has been received. CAPLib implements this by keeping a

list of asynchronous communications and whenever a CAPLib synchronisation call is made,

all outstanding messages from the list are unpacked.

Because of the extra complexity of using non-blocking communications it is a common

procedure to write or generate message-passing code that uses blocking communications as a

first parallelisation attempt. Once this version has been tested thoroughly and proved to give

the correct results, a non-blocking version can be produced to optimise the performance (In

CAPTools this merely requires clicking on one button [8]).

Before data that has been transmitted using non-blocking functions can be used in the case of

a receive communication, or re-assigned in the case of a send communication, the completion


13/33

of the communication involving the data must be verified. For maximum flexibility and

efficiency in synchronisation on message completion the communication model used by

CAPLib for ordering of message arrival and departure for synchronising on message

completion is as follows:

Messages are sent in order of the calls made to send to a particular destination, D.

Messages are received in order of calls made to receive from a particular destination, D.

This implies that:-

Synchronisation on the sending of message Mi to destination D guarantees thatmessages Mi-1, Mi-2,... sent to destination D have arrived.

In the example below the synchronisation using ISENDB by statement S3 on the

message sent by S2 also guarantees that the message sent by S1 has arrived.

S1 CALL CAP_ASEND(A,1,1,CAP_LEFT,ISENDA)

S2 CALL CAP_ASEND(B,1,1,CAP_LEFT,ISENDB)

S3 CALL CAP_SYNC_SEND(CAP_LEFT,ISENDB)

Synchronisation on the receiving of message Mj from a destination D guarantees thatmessages Mj-1, Mj-2... have been received at destination D.

In the example below the synchronisation using IRECVB by statement S3 on the

message requested by S2 also guarantees that the message requested by S1 has arrived.

S1 CALL CAP_ARECEIVE(A,1,1,CAP_LEFT,IRECVA)

S2 CALL CAP_ARECEIVE(B,1,1,CAP_LEFT,IRECVB)

S3 CALL CAP_SYNC_RECEIVE(CAP_LEFT,IRECVB)

Waiting for completion of a send to a destination does not guarantee that a particular

receive has taken place from that destination and vice versa.

In the example below the synchronisation using ISENDB by statement S3 on the

message sent by S2 does not guarantee that the message requested by S1 has arrived.

S1 CALL CAP_ARECEIVE(A,1,1,CAP_LEFT,IRECVA)



Waiting for completion of a communication with a particular destination D does notguarantee that any other sends or receives to or from another destination has

completed.

In the example below the synchronisation using ISENDB by statement S4 on themessage sent by S3 does not guarantee that the messages requested by S1 or sent by

S2 has arrived.

S1 CALL CAP_ARECEIVE(A,1,1,CAP_RIGHT,IRECVA)

S2 CALL CAP_ASEND(A,1,1,CAP_RIGHT,ISENDA)



This model is flexible enough to allow for the automatic generation of non-blocking

communications within CAPTools [8]. The ability to synchronise several messages in a

particular direction with one synchronisation, i.e. waiting for the last message to be sent is

enough to guarantee that all messages previous to the last have been sent, makes codegeneration a lot easier. It also reduces the overhead of synchronisation. The model also


14/33

allows for overlapping both sends and receives simultaneously to a particular destination and

for multiple tests on the same synchronisation id, which is essential for an automatic

overlapping code generation algorithm.

The flexibility of this model has allowed CAPTools to generate overlapping communications

with synchronisation that guarantees correctness in a wide range of cases. This includes loop

unrolling transformations, synchronisation and overlapping communications in pipelined

loops. Code appearance is enhanced by the merger of synchronisation points, that is only

possible with this communication model.

3.3 Exchanges (Overlap Area/Halo Updates)

For any array that is distributed across the process topology each process will have an overlap

region in the array that is assigned on another process (see Figure 4). These overlap areas are

updated when necessary. The overlap region is updated by invoking a call to

CAP_EXCHANGE, which performs a similar function to the MPI call MPI_SENDRECV.

This communication function will send data to a neighbouring process's overlap area as well

as receiving data into its own overlap region from the neighbouring processor.

CAP_EXCHANGE must ensure that no deadlock occurs and allow for non-communication

beyond the edge of the process topology for the end processes. Most important is the fact that

this type of communication is fully scalable, i.e. is not dependent on the number of processes,

taking at most 2 steps to complete (see Figure 5). If the hardware allows non-blocking

communication an exchange can be performed in 1 step by communicating in parallel.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Time

Figure 5 Communication pattern for blocking exchange operation on 16 processors.

3.4 Pipelines

A Pipeline in a parallel code involves each processor performing the same set of operations

on a successive stream of data. Pipelined loops are a common occurrence in parallel CM

codes and are often essential to implement, for example, recurrence relations, guaranteeing

correctness of the parallel code. Because a pipeline serialises a loop it must be surrounded by

outer loop(s) in order to achieve parallel speed up. The main disadvantages of pipelines are

that during the pipeline process some processors will be idle at the start up and shut downstages. Another disadvantage is the potentially significant overhead of the numerous

communication startup latencies. Figure 6 shows a simple example of a loop that has been

parallelised using a pipeline.

C Serial codeDO I=1, NI

DO J=2,NJA(I,J)=A(I,J-1)

END DOEND DO

C Parallel codeDO I=1, NI

CALL CAP_RECEIVE(A(I,CAP_LA-1),1,2,CAP_LEFT)DO J= MAX(2,CAP_LA),MIN(NJ,CAP_HA)

A(I,J)=A(I,J-1)END DOCALL CAP_SEND(A(I,CAP_HA),1,2,CAP_RIGHT)

END DO

Figure 6 Example of a Pipeline


15/33

With a low communication startup latency, good parallel efficiency can be achieved (see

section 9.1)

3.5 Commutative Operations

The Jacobi example in Figure 3 uses a convergence criteria is based on DIFMAX. Since theTNEW and TOLD arrays have been partitioned across the processors, each processor will

calculate its own local value for DIFMAX, however, it is necessary to calculate the global

value. Collective parallel computation operations (minimum, maximum, sum, etc.) take a

value or values assigned on each process and combine them with all the values on all other

processes into a global result that all processes receive. This is performed by the CAPLib

function CAP_COMMUTATIVE, which is analogous to the MPI global reduction function

MPI_ALL_REDUCE. The latency for this type of communication is dependent on the

number of processors. CAPLib minimises the effect of this by internally using a hyper-cube

topology where possible to perform the commutative operation. The commutative routines in

CAPLib are summarised in Table 1.

The routine CAP_COMMUTATIVE performs the commutative operation defined by the

passed routine FUNC on the data item VALUE for all processes. On entry to the routine,

VALUE holds a local value on each processor, on exit it contains a global value computed

from all local values across processors using routine FUNC to combine contributions. For

example, the serial code to sum a vector is:-

SUM=0.0D O I = 1 , N

SUM=SUM+A(I)END DO

The parallel equivalent of this using CAP_COMMUTATIVE, given that the array A has been

partitioned across processes is:

SUM=0.0DO I = CAP_LOW, CAP_HIGH

SUM=SUM+A(I)END DOCALL CAP_COMMUTATIVE(SUM, CAP_REAL, CAP_RADD)

In this example CAP_LOW and CAP_HIGH are the local low and high limits of assignment

to A on each processor. The procedure CAP_RADD is a predefined procedure to perform a

real value addition. A list of all predefined commutative functions is given in the CAPLib

user manual [12]. Each procedure has the same arguments, which mean that

CAP_COMMUTATIVE can call it generically. For example, CAP_RADD is defined as

SUBROUTINE CAP_RADD(R1, R2, R3)

REAL R1, R2, R3R3=R1+R2END

The routine CAP_DCOMMUTATIVE is a derivation of CAP_COMMUTATIVE and

performs a commutative operation in one dimension of a grid or torus process topology, the

direction (e.g. left or up) being indicated by an additional argument. This type of

commutative operation can be necessary when a structured mesh code is partitioned in more

than one direction. The routine CAP_MCOMMUTATIVE provides for commutative

operations on an array of data rather than one item. Combining several

CAP_COMMUTATIVE calls to form one CAP_MCOMMUTATIVE call allows a

corresponding reduction in latency.


16/33

CAPTools generates code with commutative operations whenever it can match the code in a

loop to the pattern of a commutative operation. Without commutative communications, the

code generated would involve complex and convoluted communication.

An interesting observation is that commutative operations performed in parallel will actually

give answers with less round off than the corresponding serial code. For example, consider

the summation of ten million small numbers in serial. As the summation continues, each

small value will be added to an increasingly larger sum. Eventually the small numbers will

cease to have an impact on the sum because of the accuracy of adding a small value to a large

one using computer arithmetic. The parallel version of the summation will first have each

processor sum their section of the sum locally, communicate the local summations, and add

them to obtain a global sum. The accuracy will be greater since each local summation will

involve less numbers, and therefore there will be smaller differences in magnitude than the

complete serial summation. In addition, the summation of the local summations to obtain the

global value will be of relatively similar sized numbers. If this were not the case, this would

not be acceptable for many users performing parallelisations from existing serial code. Part of

the parallelisation process is to validate the parallel version against the serial version.Obviously, the parallelised code must produce the same results in order to pass the validation

process. Although a parallel commutative operation may not produce exactly the same result

as the serial one, it will at least be more accurate rather than inaccurate and so most validation

tests should be passed.

As well as getting as near as possible the same results as the serial version, commutative

operations must also produce the same answer on all processes. For example, often the

calculation of the sum of the difference between two vectors, i.e. a residual value, is used to

determine whether to terminate an iterative loop. If the calculation of the residual value is not

the same on all processors then the calculated values may cause the loop to terminate on

some processes but loop again on others. Obviously, this will cause the parallel execution to

lock. To obtain the same results on all processes the commutative operation must beperformed in the same order on all processes to incur the same round off errors or broadcast a

single global value.

A common array operation is to find the maximum or minimum value and its location in the

array. The equivalent commutative operation in parallel must be performed in the same order

to return the same location as the serial code. If there are several occurrences of the local

maximum/minimum value in the array then it may be that several processes might find their

own local maxima/minima. In order to avoid this, the commutative operation must know the

direction in which the array is traversed. The routines CAP_COMMUPARENT and

CAP_COMMUCHILD provide a mechanism for this. The argument FIRSTFOUND (see

Table 1) determines how the commutative operation determines a location. If FIRSTFOUNDis set to true then for a maximum commutative calculation it is the maximum value location

found on the lowest numbered processor in the given dimension that is required on all

processes. This would be the case for a serial loop running from low to high through an array.

For example, consider the example in Figure 7 with data A=(/7, 9, 2, 2, 9, 5, 9/). Although

there are maximums at positions 2, 5 and 7 the serial code will set MAXLOC at 2 due to the

use of a strict greater than test. The parallel code will similarly produce the result

MAXLOC=2 on all processors.

C Serial codeMAXVAL=0MAXLOC=1D O I = 1 , N

IF (A(I).GT.MAXVAL) THENMAXVAL=A(I)MAXLOC=I

C ParallelMAXVAL=0MAXLOC=1DO I = LOW, HIGH

IF (A(I).GT.MAXVAL) THENMAXVAL=A(I)MAXLOC=I


17/33

ENDIFENDDO

ENDIFEND DOCALL CAP_COMMUPARENT(MAXVAL,1,CAP_IMAX,.TRUE.)CALL CAP_COMMUCHILD(MAXLOC,1)

Figure 7 Example of CAP_COMMUTATIVE and CAP_COMMUCHILD

If the test for the maximum value had been .GE. rather than .GT. then MAXLOC would be

set to the location of the last maximum value rather than the first and therefore the value of

FIRSTFOUND in CAP_COMMUPARENT would be set to .FALSE..

CAP_COMMUPARENT works by sending the current maximum values processor number

with the maximum value as the commutative operation is performed among the processors.

In the commutative communication algorithms, the location for the COMMUPARENT is

also packed into the message. CAP_COMMUPARENT internally stores the processor that

owns the desired location(s). This processor is then used in any number of calls to

CAP_COMMUCHILD to broadcast the correct value to all processors.

3.6 Broadcast Operations

Broadcast operations are used to move data from one process to all other processes. The

simplest of these is a broadcast of data from the first process to all others, termed a master

broadcast. CAPLib provides the CAP_MBROADCAST routine to do this. In fact, rather than

sending data directly to all processes from the master process, the master broadcast will use

the same communication strategies as the CAP_COMMUTATIVE call. These strategies,

described in section 6, take advantage of the internal process topology to reduce the number

of communications and steps to complete the operation.

A second type of broadcast is the communication of data from a particular process to all

others. CAPLib provides the routine CAP_BROADCAST to do this. The OWNER argumentis passed in set to true for the process owning the data and false for all others.

CAP_BROADCAST is implemented currently as a COMMUTATIVE MAX style operation

on the OWNER argument to tell every process which particular process is the owner of the

data to be broadcast. The data is again transmitted from the owning process to the other

processes in an optimal fashion using the internal process topology.

4 CAPLib on the Cray T3D/T3E

4.1 Implementation

CAPLib has been ported to the Cray T3D and T3E using PVM, MPI and the SHMEM

library. The SHMEM library version is described below. Of the three, the SHMEM CAPLib

is by far the fastest, latency and bandwidth being a reflection of the performance of the

SHMEM library. Typical latency is under 7 s and bandwidth greater than 100 MB/s for

large messages on the T3D and 5 s and 300 MB/s on the T3E. The SHMEM version ofCAPLib is written in C rather than FORTRAN because of the need to do indirect accessing.

Synchronous message passing was implemented using a simple protocol build on the Cray

SHMEM_PUT library routine, which is faster than SHMEM_GET. Figure 8 shows this

protocol used to send data between two processors.


18/33

Empty0x413270x41327&SENDTO

Data. (NBYTES)0x76543 Data. (NBYTES)0x41327

&ISYNC

0x0&SENDTO

&ISYNC1

0x0&SENDTO

Processor Sending Data Processor Receiving Data

SHMEM_PUT

SHMEM_PUT

SHMEM_PUT

Busy wait

Busy wait

1

0

Interconnect

Figure 8 CAPLib protocol used for communication on T3D/T3E using SHMEM library.

The receiving processor first writes the starting address it wishes to receive data into to a

known location on the sending processor and then waits for the sending processor to write the

data and send a write-data-complete acknowledgement. The sending processor waits on atight spin lock (busy wait loop) for a non-zero value in the known location. When the address

has arrived it uses SHMEM_PUT to place its data directly into the address on the receiving

processor. The sending processor then calls SHMEM_QUIET to make sure the data has

arrived and then sends a write-data-complete acknowledgement to the receiving processor.

The pseudo code for this procedure is shown in Figure 9.

send(a, n, cn)/* send data a(n) to

channel cn (processor cn2p(cn)) */{/* wait for address from receiving

processor to arrive in addr(pe) */pe=cn2p(cn)

while (!addr(pe))/* send data */shmem_put(*addr(pe),a,n,pe)/* wait */shmem_quiet()/* ack send complete*/shmem_put(ack(mype),1,1,pe)/* reset address */addr(pe)=0}

recv(b, ,n ,cn)/* recv data b(n) from

channel cn (processor cn2p(cn))*/{pe=cn2p(cn)/* place recv address in sending pe ataddress addr(pe) */

shmem_put(addr(mype),&b,1,pe)/* wait to data ack to arrive */while (!ack(pe))/* reset ack */ack(pe) = 0}

Figure 9 Pseudo code for send/recv using Cray SHMEM calls

To obtain maximum performance all internal arrays and variables involved in a

communication are cache aligned using compiler directives.

To avoid any conflicts in all-to-all communications the variables used to store addresses andact as acknowledgement flags are all declared as arrays with the T3D processor number being

used to reference the array elements. In this way each send address and data

acknowledgement can only be set by one particular processor.

Asynchronous communication has been partially implemented by removing the wait on the

write-data-complete acknowledgement in the receive and placing it in CAP_SYNC_RECV.

The send operation is not currently fully asynchronous since it can not start until it receives

an address from the receiving processor to send data to.

Commutative operations have also been implemented using these low-level functions and the

hyper-cube B method (see section 6.3) is the default commutative employed.


19/33

Pahud and Cornu [13] show that communication locality can influence the communication

times in heavily loaded networks on the T3D. CAPLib uses the location of the processor

within the processor topology shape allocated to a particular run to determine

CAP_PROCNUM (the CAPTools processor number) for each processor in an optimal way so

as to minimise communication time. The numbering is chosen to provide a pipeline of

processors through the 3-D topology shape so that the number of hops from processorCAP_PROCNUM to processor CAP_PROCNUM+1 is minimised.

Another way of improving communication performance for some parallel programs

(particularly all-to-all style communication) is to order the communications so that an

optimum communication pattern is used, reducing the number of steps to perform a many-to-

many operation. Unstructured mesh code will often use this type of operation.

4.2 Performance

This section discusses the performance of the different CAPLib point-to-point and exchange

message passing functions on the Cray T3D and T3E. The speed of other message passing

libraries and CAPLib performance are compared where possible. Figure 10 and Figure 11

shows the latency and bandwidth respectively on the T3D for SHMEM versions of

CAP_SEND (synchronous), CAP_ASEND (non-blocking), CAP_EXCHANGE and

CAP_AEXCHANGE (non-blocking). As a comparison, these graphs also show timings for

MPI_SEND, MPI_SSEND and MPI_SENDRECV. Figure 12 and Figure 13 show similar

graphs for the Cray T3E.

An examination of these figures shows that CAPLib performs at least as well as the standard

MPI implementation on each machine. The CAPLib SHMEM implementation is superior to

using MPI or PVM calls both in latency and bandwidth. Generally the overhead of using the

CAPLib over MPI library instead of direct calls to MPI is negligible. CAP_SEND

implemented using SHMEM has a startup latency of around 7s. The overall bandwidthobtained on the T3E for all communication measurements is far higher than that of the T3D.

The bandwidth for CAP_SEND on the T3D for messages of 64Kb is around 116Mb/sec, on

the T3E this number is 297 Mb/sec. This is due to hardware improvements between the two

systems. CAP_EXCHANGE has been implemented on the Cray systems under SHMEM to

partially overlap the pair-wise send and receive communications it performs, and this is

reflected in the bandwidth obtained, 143Mb/sec on the T3D and 416 Mb/sec on the T3E.

Note that the bandwidth for MPI_SENDRECV (50Mb/sec on the T3D and 284Mb/sec on the

T3E) are very poor in comparison with CAP_EXCHANGE. Each performs a similar

communication, a send and receive to other processors, but CAP_EXCHANGE is able to

schedule its communications so as to overlap because it is based on directional

communication whereas MPI_SENDRECV communication is based on processor numbers

only and is unable to do this.

The graphs for the figures are obtained by performing a Ping-Pong communication many

times and taking average values. However, the non-blocking communication Ping-Pong test

has synchronisation after each communication. In this respect, the non-blocking results are

artificial in that they do not reflect the greater performance that will be obtained in real codes

where synchronisation will generally be performed after many communications. The graphs

for CAP_ASEND and CAP_AEXCHANGE therefore give a measure of the overhead of

performing synchronisation on non-blocking communication and do not reflect the latency

and bandwidth that is obtained in real use.


20/33

1

10

100

1000

10000

1 10 100 1000 10000

DataTransferTime(u

s)

Data Size (REAL Items)

Latency (Cray T3D)

CAP_SEND (SHMEM)CAP_ASEND (SHMEM)

CAP_EXCHANGE (SHMEM)CAP_AEXCHANGE (SHMEM)

MPI_SENDMPI_SSEND

MPI_SENDRECV

Figure 10 CAPLib communication latency on Cray T3D

0

20

40

60

80

100

120

140

160

1 10 100 1000 10000

Bandwidth(Mbytes/Se

c)


Bandwidth (Cray T3D)



MPI_SENDMPI_SSEND

MPI_SENDRECV

Figure 11 CAPLib communication bandwidth on Cray T3D

1

10

100

1000

1 10 100 1000 10000

DataTransferTime(us)


Latency (Cray T3E)



MPI_SENDMPI_SSEND

MPI_SENDRECV

Figure 12 CAPLib communication latency on Cray T3E

0

50

100

150

200

250

300

350

400

450

1 10 100 1000 10000

Bandwidth(Mbytes/Sec)


Bandwidth (Cray T3E)



MPI_SENDMPI_SSEND

MPI_SENDRECV

Figure 13 CAPLib communication bandwidth on Cray T3E

5 CAPLib on the Paramid

5.1 Implementation

The Transtech Paramid version of the CAPTools communications library uses the low-level

Transtech/Inmos i860toolset communications library [11]. The Paramids dual processor

node architecture makes it ideal for non-blocking asynchronous communications since the

Transputer part of a node can be performing communication whilst the i860 is computing.

Non-blocking communications are implemented for the Paramid in CAPLib using an

asynchronous router program that runs on the Transputer. To minimise latency for small non-

blocking communications, the period of synchronisation between the Transputer and the i860

during initialisation of a non-blocking communication must be kept to a minimum. In

addition, the amount of effort required for the Transputer to send and receive asynchronously

must be as small as possible and as near the time for a normal direct synchronous send as

possible. Figure 14 shows a process diagram of the router process that executes on the

Transputers during runs that use the asynchronous version of CAPLib. The diagram shows

the threads in the router process for sending data asynchronously down one channel. For each

channel pair (IN and OUT channels), there will be a two sets of these threads to allowindependent communication in both directions. This arrangement is also duplicated for each


21/33

channel connection to other nodes. The send and receive threads of the router process are

linked by channels over the Transputer links to the destination nodes corresponding send and

receive threads (where more links are needed than are physically available, implicit use is

made of the INMOS virtual routing mechanism, [14]).

For every channel the client send thread processes send requests and places them in send

request queue. A similar action is performed for receive requests by the client receive thread.

The send thread removes requests from the send queue and communicates the data as soon as

the corresponding receive thread on the other processor is ready to receive, i.e. when the

receive thread has itself removed a request to receive from the receive request queue. The

send and receive threads update an acknowledgement counter for each channel so that the

users program can synchronise on the completion of certain communications. It is worth

emphasising that using this model, channel communication down one channel is completely

independent of communication down another. It is up to the users program to synchronise at

the correct point to guarantee the data validity of data communicated in each direction.

send

request

queue

client (send)req thread

send

thread

..

..

CAP_ASEND(A,N,1,-2,ISEND)sends a request pkt

(address A, length N, type 1) to

router process

..CAP_SYNC_SEND (-2,ISEND)

synchronsiation call does busy

wait until ISENDACK>=ISEND

DATAA (1..N)

TTM200 TRAM (i860/transputer)

+1 for

every

send

Users program/i860

reqs in

reqs out

Router process/transputer

ISENDACK

receive

request

queue

client (recv)req thread

receive

thread

..

..

CAP_ARECEIVE(B,N,1,-1,IRECV)sends a request pkt

(address B, length N, type 1) to

router process

..CAP_SYNC_RECEIVE (-1,ISEND)

synchronisation call does busy wait

until IRECVACK>=IRECV

DATA

B (1..N)

TTM200 TRAM (i860/transputer)

+1 for

every

receive

Users program/i860

Router process/transputer

IRECVACK

reqs in

reqs out

Figure 14 Transputer router process for asynchronous communication on Transtech Paramid

5.2 Performance

Figure 15 and Figure 16 give the latency and bandwidth characteristics of CAPLib on the

Paramid. The best latency is around 33S with the bandwidth approaching peak performance

at around the 500-byte message size. Notice that the peak bandwidth of CAP_AEXCHANGE

is roughly twice that of CAP_SEND showing that it is performing its send and receive

communication asynchronously in parallel. The latency cost for small messages (~40 bytes)

is higher than the synchronous CAP_EXCHANGE because of the extra complexity of setting

up an asynchronous communication. However in real applications the increased

asynchronous latency will usually be hidden by the overall benefits of performingcomputation whilst communicating.


22/33

10

100

1000

10000

100000

1 10 100 1000 10000

DataTransferTime(us)


Latency (Transtech Paramid)

CAP_SEND (I860TOOLSET)CAP_BSEND (I860TOOLSET)CAP_ASEND (I860TOOLSET)

CAP_EXCHANGE (I860TOOLSET)CAP_AEXCHANGE (I860TOOLSET)

Figure 15 CAPLib latency on Transtech Paramid

0

0.5

1

1.5

2

2.5

3

1 10 100 1000 10000

Bandwidth(

Mbytes/Sec)


Bandwidth (Transtech Paramid)

CAP_SEND (I860TOOLSET)CAP_BSEND (I860TOOLSET)CAP_ASEND (I860TOOLSET)

CAP_EXCHANGE (I860TOOLSET)CAP_AEXCHANGE (I860TOOLSET)

Figure 16 CAPLib bandwidth on Transtech Paramid

6 Optimised Global Commutative Operations

As global commutative operations usually only involve the sending and receiving of very

small messages, typically 4 bytes, it is the communication startup latency which will

dominate the time taken to perform the commutative operation. This is because the

communication startup latency is relatively expensive on most parallel machines. It is for this

reason that in many parallelisations, commutative operations can be a governing factor

affecting efficiency and speed up. It is extremely important, therefore, to implement

commutative operations as efficiently as possible. In order to do this, the commutative

routines in CAPLib take advantage of the processor topology, that is, how each processor

may communicate with other processors.

Many of the parallel machines on the market today are connected using some kind of

topology to facilitate fast communication in hardware. For example, processors in the Cray

T3D are connected to a communications network arranged as a 3-D torus. However, although

the hardware is connected as a torus, there is in fact no limitation on what processors a

particular processor may talk to at the hardware level; the communication hardware will route

messages from one processor to another around the torus as needed. From the perspective of

the methods used to perform commutative (and broadcast) operations it is this direct

processor to processor topology that is important, not the underlying hardware topology that

implements it. This means, for example, that although the Cray T3D is based on a 3D Torus,

for commutative operations internally within CAPLib it is considered fully connected. The

commutative topology used internally within CAPLib will therefore depend on the directprocessor-to-processor routing available on the machine the program is running on. The

commutative methods available are then directly related to the commutative topology.

Currently CAPTools supports a pipe, ring, grid and two different hyper-cube commutative

methods.

In order to compare the efficiency of each method we define the following:-

P = The number of processes.

C = The total number of communications for a commutative operation.

S = The total number of steps involved in the method. We define a step as a number of

communications performed in parallel such that the time/latency of allcommunications is equivalent to that of one communication. Some communication


23/33

devices are serial devices only allowing one communications at a time. For example,

the Ethernet connecting workstations is a serial communications device since only

one packet may be present on the Ethernet at any one time. For these devices,

although we can consider the communications in one step taking place in parallel for

the purposes of analysis, they will in fact be serialised in practice.

The key to efficient commutative operations is to perform as much communication in parallel

as possible, i.e. by minimising the number of parallel communication steps needed to perform

the commutative operation, the effect of communication startup latency will be minimised.

The time for the commutative operation to take place is approximately proportional to the

number of communication steps, S. This is the most important term to reduce.

The communication time between processes is often affected by the number of

communications occurring simultaneously. It is therefore important that both the overall

number of communications and the number of communications per step is also minimised.

The type of communication taking place at each step also determines performance. If all the

communications in a step are between neighbour processors then there will be little

contention on the communication network as the communications take place. If the

communications are not to nearest neighbours then the number of communications will affect

the time to complete the step since the routing mechanism of the hardware will be used to

deliver messages and contention may occur.

If the process topology has not been mapped well onto the hardware topology, it will often be

the case that communication from a nearest neighbour process is not in hardware a

communication between nearest neighbour processors. For example, a ring topology

implemented onto a pipeline of processors will require the connection between the last and

first processor to be sent via a routing mechanism to the first processor. Communications

along this link will always be slower than along the other links and in a commutative

communication step the slowest communication will determine the time for the step.

6.1 Commutative Operation using a Pipeline

Figure 17 shows a diagram of a pipeline of processes and the communication pattern for a

commutative operation. The number of communications and steps is proportional to the

number of processes. This is because the value contributed from each process must be passed

down to the last process and then the result is passed back up the pipeline again.


24/33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Time

1)-2(P=S

1)-2(P=C

Figure 17 Commutative operations using a pipeline connection topology.

The number of steps for a ring commutative operation is the same as for a pipeline but the

number of communications is higher. On some hardware, this will give a pipe topology the

edge in performance over the ring. If it is possible for a commutative operation to beperformed around the ring using non-blocking communications then the number of steps can

be halved. Communication around a ring requires all the values to be accumulated in an array

in process order during communication and then the commutative computation performed

using the array once all values have been communicated. This is to avoid round off problems

and guarantees that each processor calculates the same result. Buffer space is required on

each process to perform this operation and for a very large parallel run, i.e. thousands ofprocesses, this may be disadvantageous. If it is possible for the hardware to perform

communication simultaneously in both directions then the performance can be even higher

since values can travel both ways around the ring at the same time, reducing the distance to

the furthest process to p/2.

6.2 Commutative Operation using a Grid

Figure 18 shows a diagram of a 2D grid of processes and the communication pattern for a

commutative operation. Each stage of the commutative operation is across one of the

dimensions, d, of the grid. This method would be used when a grid of processors can only

talk directly to its grid neighbours, otherwise it is advantageous to use a hyper-cube method

(see next section).


25/33

Stage 1

Stage 2

1,1 1,2 1,3 1,4 1,61,5 1,7 1,8

2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8

3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8

4,1 4,2 4,3 4,4 4,5 4,6 4,7 4,8

5,1 5,2 5,3 5,4 5,5 5,6 5,7 5,8

6,1 6,2 6,3 6,4 6,5 6,6 6,7 6,8

7,1 7,2 7,3 7,4 7,5 7,6 7,7 7,8

8,1 8,2 8,3 8,4 8,5 8,6 8,7 8,8

( )

( )

=

= =

=

=

di j

d

i j

d

ijj i

PS

PPC

1

1 ,1

12

12

Figure 18 Communication pattern for commutative operation using a grid.

Where Pi is the number of processors in dimension I and d is the number of dimensions.

6.3 Commutative Operation using Hyper-cubes

In a hyper-cube topology of dimension d, each process is connected directly to d other

processes. Algorithms implemented using a hyper-cube offer the best performance generally

over other methods because the number of steps to perform a commutative operation is

related to d, i.e. 2

d

, not the number of processes, P. For non-trivial P, the hyper-cube offersfar greater performance than any other topology.

There are a number of ways to implement a commutative operation on a hyper-cube. Two

methods are currently implemented in CAPLib. Method A uses a pair-wise exchange

between processes until every process has the result. Method B uses a binary tree algorithm.

Both rely on the connectivity offered by the hyper-cube. Both methods A and B guarantee the

order of computation will be the same on every process and therefore the values obtained will

be the same on all processes. This is obviously the case with method B. In method A, this is

guaranteed by always combining combinations with that from the lower numbered processor

on the left hand side of the summation. The pair-wise exchange of data that characterises the

Method-A operation can be further improved if non-blocking communications. Overlapping

the exchange of data reduces the number of steps by a factor of two but relies on the

performance of two small non-blocking communications out-performing two small blocking

communications. CAPLib does not currently implement a non-blocking version of Method-

A.


26/33

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

2

3

4

5

6

7

8

Method A

)(

)(

exchangegnonblockindS

exchangeblocking2d=S

1)>(dPd=C

=

Figure 19 Communication pattern for commutative operation using a hyper-cube (method A, d=4)

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

2

3

4

5

6

7

8

Method BC = 2 - 2

S = 2d

(d+1)

Figure 20 Communication pattern for commutative operation using a hyper-cube (method B, d=4).

In order to use these methods in runs where the number of processes does not exactly make

up a hyper-cube the methods must be modified to account for this. For method A if we

consider the odd number of processes to be k, then the last k processes send their values to

the first k processes before the main part of the procedure begins. This ensures the values

from these processes are used. When the main procedure is complete, the end k processes

receive the result. Method B handles odd processes by extending the binary treecommunication strategy to include the extra k processes.


27/33

6.4 Comparison of Commutative Methods

Table 2 show a comparison of the number of steps and the number of communications

needed for a commutative operation using the methods implemented inside CAPTools.

Pipe Ring Hyper-cube A Hyper-cube B

P Steps Comms Steps Comms StepsSync.

StepsAsync

Comms Steps Comms

2 2 2 2 2 2 1 2 2 2

4 6 6 6 12 4 2 8 4 6

8 14 14 14 56 6 3 24 6 14

16 30 30 30 240 8 4 64 8 30

32 62 62 62 992 10 5 160 10 62

64 126 126 126 4032 12 6 384 12 126

128 254 254 254 16256 14 7 896 14 254

256 510 510 510 65280 16 8 2048 16 510

512 1022 1022 1022 261632 18 9 4608 18 1022

1024 2046 2046 2046 1047552 20 10 10240 20 2046

Table 2 Number of steps S, and communications C, for a commutative operation using different methods.

Obviously the Hyper-cube methods are the best for P>4; the pipe and ring methods would

only be used on machines where the hyper-cube is not available, for example, machines built

of hard-wired directly connected processors in a pipeline or grid. Each of the hyper-cube

methods performs the operation in d steps, but B takes fewer communications overall than A,

for P>2. For a large number of processes this factor becomes very important as time for a

large number of simultaneous communications in one step can be affected by message

contention across the hardware processor interconnect. For A, the number of messages

remains constant at each step in a commutative operation at P/2. The number of

communications in each successive step using method B reduces by a factor of 2 and

therefore any contention is minimised to the first few steps. The number of steps needed to

complete the operation using A can however be halved if non-blocking communications areused.

Figure 21 shows a graph of communication latency for CAP_COMMUTATIVE using

CAPLib over SHMEM on the Cray T3D using a pipeline and the two hyper-cube methods.

The graph clearly demonstrates the effect of using different global communication

algorithms. Global communication using a pipeline becomes rapidly more expensive as the

number of processors increase. The best performance is given by the Hyper-cube B

algorithm. Note that in this case MPI_ALLREDUCE which is the MPI equivalent to

CAP_COMMUTATIVE does not perform as well as the Hyper-cube methods employed by

CAP_COMMUTATIVE. Indeed, the CAP_COMMUTATIVE function has performed better

than the corresponding MPI_ALL_REDUCE function in all ports of CAPLib so far

undertaken.


28/33

0

100

200

300

400

500

600

700

1 10 100 1000

Time

(us)

Processors

CAP_COMMUTATIVE (Cray T3D SHMEM)

PIPELINEHYPERCUBE AHYPERCUBE B

MPI_ALLREDUCE

Figure 21 CAP_COMMUTATIVE latency on Cray T3D

7 CAPLib Support Environment

One of the major reasons that parallel environments are often difficult to use is the amount of

configuration and details the user must know about the system in order to successfully

compile and run their parallel programs. As part of the CAPTools parallelisation

environment, a set of utilities is provided to aid users in compiling, running and debugging

their parallel programs. The main utilities are capf77and capmake, which allows compilation

of the users source code; caprun, which provides a mechanism for parallel execution of the

users compiled executable; and capsub which provides a simple generic method for

submitting jobs to parallel batch queues. The characteristics of the utilities are:-

Simple to use The utilities hide from the user as much as possible the details of the

compilation and execution of parallel programs. Parallel compilation usually requires extra

flags on the compile line and special libraries linked in. Many parallel environments require acomplex initialisation process to begin the execution of a parallel program. Parallel execution

often fails, not because the users program is incorrectly coded, but because they have

wrongly configured the parallel environment in some way. By hiding the messy details of

configuration from the user, execution becomes both quicker and more reliable. In many

cases, the users do not need a detailed knowledge of the parallel environment they are

utilising at all.

Generic interface Each utility uses a set of common arguments across the domains of

parallel environment (e.g. MPI) and machine type, e.g. Cray T3D. This makes it easy for the

user to migrate from one machine or parallel environment to another. The main generic

arguments are:-mach Machine type, e.g. Sun, Paramid, T3D.

-penv Parallel environment type, e.g. PVM, MPI, i860toolset, shmem.

-top Parallel topology type, e.g. pipe2, ring4, full6, grid2x2.

-debug n1 n2.. Execute in debug mode on processors n1, n2 etc..

When a utility is executed it first checks for the existence of the environment variables

CAPMACH and CAPPENV that provide default settings for the machine type and parallel

environment type. These can be set manually by the user in their login script or by the

execution of the usecaplib script, which attempts to determine these automatically from the

host system. The command line argument versions of the environment variables can be used

to over-ride any defaults.


29/33

8 Parallel Debugging

The debugging of parallel message passing code often requires the user to start up multiple

debuggers and trace and debug the execution on several processes. The main disadvantages

of having several debuggers running on the workstation screen is the large amount of

resource both in computer time and physical memory that this can require. Each debugger(with graphical user interface) may require 40 Mbytes and starting up several debuggers or

attaching to several running processes can take minutes on a typical workstation. Recently

computer vendors and third party software developers have begun to address this issue by

allowing the debugger to handle more than one process and a time and allow the user to

quickly switch from one process to another. This dramatically reduces the memory cost since

only one debugger is now running and, if the same executable is running on all processors,

only a single set of debugging information need be loaded. Examples of commercial

debuggers that provide such a facility are TotalView [15] and Sun Microsystemss Workshop

development environment [16]. Cheng and Hood in [17] describe the design and

implementation of a portable debugger for parallel and distributed programs. Their client-

server design allows the same debugger to be used both on PVM and MPI programs andsuggest that the process abstractions used for debugging message-passing can be adopted to

debug HPF programs at the source level. Recently the High Performance Debugging Forum

[18] has been established to define a useful and appropriate set of standards relevant to

debugging tools for High Performance Computers.

The caprun script has a -debug argument that allows users to specify a set of

processes that they wish to debug. On systems that do not yet provide a multi-process

debugger but do provide some mechanism to debug parallel processes using this option will

result in a set of debuggers appearing on the screen attached to the chosen process set.

CAPLib also provides a library routine called CAP_DEBUG_PROC that allows a debugger

to be attached to an already running process where this is possible, perhaps following someerror condition. When a process calls CAP_INIT, one of the tasks undertaken is to check

command line arguments and environment variables. If -debug is found then a call is made to

CAP_DEBUG_PROC that calls a machine dependant system routine to run the script

capdebug. This script is passed a set of information such as the calling process-id, DISPLAY

environment variable and executable pathname that allows a debugger to be started up,

attached to the calling process and displaying on the host machines screen. The caprun script

also has a capdbgscript argument that allows the user to specify a set of debugger

commands to be executed by each debugger when starting up.

As an example

caprun -m sun -p pvm3 -top ring5 -debug 1-3 -dbxscript stopinsolve jac

This will start up 3 debuggers attaching to processes 1-3 on the users workstation, all

debuggers will then execute the script stopinsolve which might contain

print cap_procnumstop in solvecont

This would print the CAPTools processor number, set a break point in routine solve and

continue program execution.

9 Results

This section gives a series of results obtained for parallelisations using CAPTools and

CAPLib, of two of the well-known NAS Parallel Benchmarks (NPB) [19], APPLU (LU) and


30/33

APPBT (BT). The LU code is a lower-upper diagonal (LU) CFD application benchmark. It

does not, however, perform a LU factorisation but instead implements a symmetric

successive over-relaxation (SSOR) scheme to solve a regular-sparse, block lower and upper

triangular system. BT is representative of computations associated with the implicit operators

of CFD codes such as ARC3D at NASA Ames. BT solves multiple independent systems of

non-diagonally dominant, block tridiagonal equations. The codes are characterised in parallelform by pipeline algorithms, making all codes sensitive to communication latency.

The results for the benchmarks refer to three different versions/revisions of the same code.

Rev 4.3 is a serial version of the benchmarks written in 1994 a starting point for optimised

implementations. Version NPB2.2 is a parallel version of the codes written by hand by

NASA and using MPI communication calls. Version NPB2.3, the successor to NPB2.2, has

both a serial and parallel version. The results presented here are for runs of CLASS A,

64x64x64 size problems. For each code, a SPMD parallelisation using a 1-D and in some

cases a 2-D partitioning strategy were produced using CAPTools. The results for runs using

these parallelisations on the Cray T3D, Transtech Paramid and the SGI Origin2000 are

presented in the following sections together with results for runs of the NPB2.2/2.3 parallelMPI versions.

9.1 LU

The results for LU runs on the Cray T3D, T3E, SGI Origin 2000 and Transtech Paramid are

shown in Figure 22 to Figure 25 respectively. The T3D and T3E results compare the

performance of 1-D and 2-D parallelisations of LU using CAPTools. The 1-D version can

only be run on a maximum of 64 processors because of the size of problem being solved

(64x64x64). The 2-D version was run up to 8x8 processors and gives very reasonable results.

Figure 23 shows graphs of execution time for 1-D and 2-D parallelisations of LU using

CAPTools on the Cray T3E with different versions of CAPLib. The best results are given asexpected by the SHMEM version of CAPLib although for the 2-D runs the differences are

quite small. These small differences are in part due to the pipelines present in LU code. The

1-D version has pipelines with a much longer startup and shutdown period than the 2-D

version and therefore performance is more dependent on the startup latency of the

communications. Another factor is the memory access patterns required for communication

in the 2nd dimension which use buffered CAPLib calls such as CAP_BSEND/BRECEIVE

that gather data before sending and scatte

Caplib Paper 013

Documents

Transcript of Caplib Paper 013