Caplib Paper 013

download Caplib Paper 013

of 33

Transcript of Caplib Paper 013

  • 7/28/2019 Caplib Paper 013

    1/33

    CAPLib A THIN LAYER MESSAGE PASSING LIBRARY TOSUPPORT COMPUTATIONAL MECHANICS CODES ON

    DISTRIBUTED MEMORY PARALLEL SYSTEMS

    By

    P F Leggett, S P Johnson and M Cross

    Parallel Processing Research Group

    Centre for Numerical Modelling and Process Analysis

    University of Greenwich

    London SE18 6PF

    UK.

    ABSTRACT

    The Computer Aided Parallelisation Tools (CAPTools) [1] is a set of interactive tools aimed

    to provide automatic parallelisation of serial Fortran computational Mechanics (CM)

    programs. CAPTools analyses the users serial code and then through stages of array

    partitioning, mask and communication calculation, generates parallel SPMD (Single Program

    Multiple Data) messages passing Fortran.

    The parallel code generated by CAPTools contains calls to a collection of routines that form

    the CAPTools communications Library (CAPLib). The library provides a portable layer and

    user friendly abstraction over the underlying parallel environment. CAPLib contains

    optimised message passing routines for data exchange between parallel processes and other

    utility routines for parallel execution control, initialisation and debugging. By compiling andlinking with different implementations of the library the user is able to run on many different

    parallel environments.

    Even with todays parallel systems the concept of a single version of a parallel application

    code is more of an aspiration than a reality. However for CM codes the data partitioning

    SPMD paradigm requires a relatively small set of message-passing communication calls. This

    set can be implemented as an intermediate thin layer library of message-passing calls that

    enables the parallel code (especially that generated automatically by a parallelisation tool

    such as CAPTools) to be as generic as possible.

    CAPLib is just such a thin layer message passing library that supports parallel CM codes,

    by mapping generic calls onto machine specific libraries (such as CRAY SHMEM) andportable general purpose libraries (such as PVM an MPI). This paper describe CAPLib

    together with its three perceived advantages over other routes:

    as a high level abstraction, it is both easy to understand (especially when generatedautomatically by tools) and to implement by hand, for the CM community (who are not

    generally parallel computing specialists),

    the one parallel version of the application code is truly generic and portable,

    the parallel application can readily utilise whatever message passing libraries on a givenmachine yield optimum performance.

  • 7/28/2019 Caplib Paper 013

    2/33

    1 Introduction

    Currently the most reliable and portable way to implement parallel versions of computational

    mechanics (CM) software applications is to use a domain decomposition data partitioning

    strategy to ensure that data locality is preserved and inter-processor communication is

    minimised. The parallel hardware model assumes a set of processors each with its ownmemory, linked in some specified connection topology. The parallelisation paradigm is

    single processmultiple data (SPMD); that is each processor runs the same application except

    using its own local data set. Of course, neighbouring processors (at least) will need to

    exchange data during the calculation and this must usually be done in a synchronised manner,

    if the parallel computation is to faithfully emulate its scalar equivalent. One of the keys to

    enabling this class of parallel application is the message-passing library that enables data to

    be efficiently exchanged amongst the processors comprising the system.

    Up until the early 1990s, parallel vendors typically provided their own message passing

    libraries, which were naturally targeted at optimising performance on their own hardware.

    This made it very difficult to port a CM application from one parallel system to another. In

    the early 1990s, portable message passing libraries began to emerge. The two most popular

    such libraries are PVM [2] and MPI [3]. One or other, or both of these libraries is now

    implemented on most commercial parallel systems. Although this certainly addresses the

    issue of portability, these generic message-passing libraries may give far from optimal

    performance on any specific system. On CRAY-T3D systems, for example, the PVM library

    performance is somewhat inferior to the manufacturers own SHMEM library [4]. Hence, to

    optimise performance on such a system the parallel application needs to utilise the in-house

    library.

    Although both PVM and MPI are powerful and flexible they actually provide much greater

    functionality than is required by the CM community in porting their applications to

    commercial parallel hardware. This issue was recognised by the authors some years agowhen they were working on the design phase of some automatic parallelisation tools for

    FORTRAN computational mechanics codes CAPTools [1,5,6,7,8,9]. The challenge was to

    produce generic parallel code that would run on any of the commercially available high

    performance architectures. The key factor that inhibited the generation of truly generic

    parallel code was the variety of the message passing libraries and the structure of the

    information passed into the resulting calls as arguments. From an extensive experience base

    of code parallelisation, the CAPTools team recognised that all typical inter-processor

    communications required by structured mesh codes (typical of CFD applications) could be

    addressed by a concise set of function calls. Furthermore it transpired that these calls could

    be easily implemented as a thin software layer on top of the standard message passing

    libraries PVM and MPI plus a parallel systems own optimised libraries (such as CrayT3D/T3E SHMEM). Such a thin layer software library could have three distinct advantages

    over other routes:

    as a high level abstraction it is both easy to understand and to implement by hand, for theCM community (who are not generally parallel computing specialists),

    the one parallel version of the application code is truly generic and portable,

    the parallel application can readily utilise whichever message passing libraries on a givenmachine yields optimum performance.

    In this paper we describe the design, development and performance of the CAPLib message

    passing software library that is specifically targeted at structured mesh CM codes. As such,

    we are concerned with:- ease of use by the CM community, portability, flexibility and

  • 7/28/2019 Caplib Paper 013

    3/33

    computational efficiency. Such a library, even if it is a very thin layer must represent some

    kind of overhead on the full scale message passing libraries; part of the performance

    assessment considers this issue. For such a concept to be useful to the CM community its

    overhead must be minimal.

    2 CAPLib Design and Fundamentals

    CAPLibs primary design goal was to provide the initialisation and communication facilities

    needed to execute parallel Computational Mechanics code either parallelised manually or

    generated by the CAPTools semi-automatic parallelisation environment. A secondary goal is

    to provide a generic set of utilities that make the compilation and execution of parallel

    programs using CAPLib as straightforward as possible. The library is also supplied with a set

    of scripts to enable easy and standardised compilation of parallel code with different versions

    of CAPLib and for the simple execution of the compiled executable on different machines.

    This section discusses the design, features and fundamentals of the library.

    2.1 Design

    The different layers of software of CAPTools generated code are shown in Figure 1. CAPLib

    has been implemented over MPI [3] and PVM [2], the most important standard parallel

    communications libraries in current use, to provide an easy method of porting CAPLib to

    different machines. Where possible versions of CAPLib have been developed for proprietary

    libraries in order to obtain maximum performance, for example, the Cray SHMEM library [4]

    or Transtechs i860toolset library [11].

    CAPTools generated parallel code

    CAPLib API

    MPI PVM Cray SHMEMTranstechi860toolset

    Figure 1 CAPLib Software layer

    The library has been designed to meet the following criteria:

    Efficient. Speed of communications is perhaps the most vital characteristic of a parallelmessage-passing library. Startup latency has been found to be a very important factor

    effecting the performance of parallel programs. The addition of layers of communication

    software over the hardware communication mechanism increases the startup latency of allcommunications. It is important therefore to access the communication mechanism of a

    machine at the lowest level possible. Each implementation of CAPLib attempts to utilise

    the lowest level communications API of each parallel machine in order to achieve low

    latency and therefore as fast communications possible.

    Portable. Code written to use CAPLib is portable across different machines. Onlyrecompilation is necessary.

    Correct. It is vitally important for parallelised computational mechanics programs to givethe same answers in parallel as in serial. The commutative (global) message passing

    functions provided by CAPLib are implemented so as to guarantee that the same result is

    seen on every processor. This can be of vital importance for the correct execution of

    parallel code and its successful completion. For example, a globally summed value may

  • 7/28/2019 Caplib Paper 013

    4/33

    be used to determine the exit of an iterative loop. If the summed value is not computed in a

    consistent manner across all processors, then round off error may cause some processors

    to continue executing the loop whilst others exit, resulting in communication deadlock.

    Generic. The library is generic in the sense that decisions about which processor topologyto execute on are taken at run time. CAPTools generated code compiled with CAPLib will

    run, for example, on 1 processor, a pipeline of 2 processors, a ring of 100 processors, or a

    torus of 64. The scripts provided with the library are also generic. For example, capmake

    and caprun are scripts that allow the user to compile and run parallel code without

    knowing system specific compiler and execution procedures.

    Simple. The library itself has been kept as simple as possible both in the design of theAPI and in its implementation. By keeping the library simple with the minimum number

    of functions and also the minimum number of arguments to those functions, the library is

    easily ported to different parallel machines. Also an uncomplicated interface is more easily

    understood and assimilated by the user.

    2.2 Parallel Hardware Model

    CAPTools currently generates parallel code based on a Distributed Memory (DM) parallel

    hardware model, which is illustrated in Figure 2. In the CAPLib parallel hardware model

    processors are considered to be arranged in some form of topology, where each processor is

    directly connected to several others, e.g. a pipe, ring, grid, torus or full (fully connected).

    Each processor is assigned a unique number (starting from 1). In the case of grid and torus

    topologies, each processor also has a dimensional processor number. Memory is considered

    local to each processor and data is exchanged between processors via message passing of

    some form between directly connected processors. CAPTools generated parallel code can

    also be executed on Shared Memory (SM) systems providing, of course, CAPLib has been

    ported to the system. On a SM system, each processor still executes the same SPMD programoperating on different sections of the problem data. The main difference between this and

    operation on a DM system is that message-passing calls can be implemented inside CAPLib

    as memory copies to and from hidden shared memory segments. In this respect the CAPLib

    model differs from the usual parallelisation model used on SM machines that assume every

    processor can directly access all memory of the problem. By restricting the memory each

    processor accesses and enforcing a strict and explicit ordering to the update of halo regions

    and calculation of global values, the CAPLib parallel hardware model ensures that there will

    be very little memory contention on SM systems and particularly on Distributed Shared

    Memory (DSM) systems. As the number of processors becomes large, for example, some of

    the machines recently built for the Accelerated Strategic Computing Initiative [10] (ASCI)

    have thousands of processors, the localisation of communications becomes very important.

    Distributing data onto processors, taking into account the hardware processor topology, can

    localise communication between processors and thus minimise contention in the

    communications hardware.

  • 7/28/2019 Caplib Paper 013

    5/33

    1 2 3 4 5 6

    1,2

    (4)

    2,2

    (3)

    1,1(1)

    1,2(2)

    5

    1

    3

    2

    4

    pipeline topology

    2d grid

    topology

    full

    topology

    CPU

    local

    memory

    processor

    Figure 2 CAPLib parallel hardware model

    2.3 Process Topologies

    Knowledge of the processor topology of the parallel hardware a parallel code is to run on is

    very important. It can be used to optimise the speed and distance travelled by messagesbetween processes. CAPTools attempts to generate code that will minimise the amount of

    communication needed, however, to perform those communications that are required as

    quickly as possible, the process topology must be mapped onto the processor topology.

    CAPLib uses the concept of a process topology for this reason. An intelligent mapping of

    process to processors will give better performance than would be possible from a random

    allocation. By placing processes so that most communications are needed only between

    directly connected neighbouring processors, the distance the communications have to travel is

    minimised, avoiding hot spots and maximising bandwidth. An awareness of process topology

    also allows for more efficient programming in global communications; for example, the use

    of a hyper-cube to maximise global summations in parallel (see section 6.3).

    By requiring that processes are connected in a pipe or grid type topology, it is possible for

    CAPTools to generate parallel code for structured mesh parallelisations using directional

    communications, i.e. where communication is specified as being up or down, left or right of a

    process rather than to a particular process id. This programming style can make it easier for

    the user to write and understand parallel code, especially for grids of two or more

    dimensions.

    Where possible, CAPLib tries to use the fastest methods of communication that are available

    on a particular machine. It might be that communications to neighbouring processors could

    be made directly through fast, dedicated hardware channels.

    The topology required for a particular run of a parallel program, e.g. pipe, ring, and thenumber of processes can be specified to the CAPLib utilities and to the parallel program at

    run time in a number of ways:- via environment variable; as a flag on the command line; a

    configuration file or if none of the previous is set, by asking the user interactively. The

    topologies currently available from CAPLib are pipe, ring, grid, torus and full (all to all).

    2.4 Messages

    Each messages sent and received using the CAPLib communication routines has a length,

    type and a destination.

  • 7/28/2019 Caplib Paper 013

    6/33

    2.4.1 Message Length

    The length is defined in terms of the number of items to be communicated. Zero or a negative

    number of items must result in no message being sent. All CAPLib communication routines

    check for length

  • 7/28/2019 Caplib Paper 013

    7/33

    used to hold RI(2). This method has been found to be generic and works on every

    machine tested so far.

    3. Heterogeneous computing. If a parallel program is sending messages within aheterogeneous environment then size and storage of data types may differ between

    processors. One processor may use little endian (low bytes first) and another big

    endian (high bytes first) storage, i.e. bytes in a message may have to be swapped at

    destination or origin depending on the data type. Floating point representation may

    also be different; e.g. default size might be 4 bytes or one machine and 8 bytes on

    another. For the library to be able to convert between different storage types it must

    know which type is being communicated in order to apply the correct translation.

    Currently the library makes the assumption that all processors are homogenous but the

    knowledge of type of messages within the library allows for adding heterogeneous

    capability in the future if this is found to be desirable.

    2.4.3 Message Destination

    Message destination is determined by an integer argument passed in each communicationcall. A negative value indicates a direction, a positive value indicate a process number.

    The code generated by CAPTools for structured mesh parallelisations currently assumes a

    pipeline or grid process topology. The communication calls therefore use the negative values

    to indicate direction to the left or right (or up and down) of a processes position in topology.

    These are available as predefined CAPLib constants such as CAP_LEFT, CAP_RIGHT for

    improved readability. A characteristic of parallel SPMD code written for an ordered

    topology is a test for neighbour existence before communication. This is because the first

    processor does not have a neighbour to its left and the last processor does not have a

    neighbour to its right. CAPLib functions perform the necessary tests for neighbour processor

    existence internally to improve the readability of CAPTools generated parallel code. Havingthe neighbour test within the library also reduces the possibility of error (and therefore

    deadlock) in any manually written parallel code. The functions also test for zero-length

    messages, as mentioned earlier, since this is often a possibility, so that the user avoids having

    to perform this chore as well.

    Typical hand written user code without these internal tests might look like as follows:-

    IF (N.GT.0) THENIF (MYNUM.LT.NPROC) CALL ANY_RECEIVE(A,N*4,MYNUM+1)IF (MYNUM.GT.1) CALL ANY_SEND(A,N*4,MYNUM-1)

    ENDIF

    where MYNUM is the processor number and NPROC is the number of processors.

    Using CAPTools communications library the code becomes

    CALL CAP_RECEIVE(A,N,1,CAP_RIGHT)CALL CAP_SEND(A,N,1,CAP_LEFT)

    where the receive communication will only take place if N is >=0 and a processor is present

    to the right and similarly for the send communication if a processor is available to the left.

    3 Requirements For Message-Passing from Structured MeshBased Computational Mechanics code

    CAPLib satisfies the general requirements for message-passing from parallelisations ofstructured mesh based Computational Mechanics. The library has to provide for:

  • 7/28/2019 Caplib Paper 013

    8/33

    Initialisation of required process topology

    Data Partition calculation

    Termination of parallel execution

    Point to point communications

    Overlap area (halo) update operations

    Commutative operations, i.e. local value ->global value using some function

    Broadcast operations

    Algorithmic Parallel Pipelines

    In the following sections, the general requirements for communication and parallel constructs

    for CM codes and the CAPLib calls that address these requirements are described,

    particularly emphasising their novel aspects. To illustrate this discussion a simple one-

    dimensional parallel Jacobi code (Figure 3) obtained using CAPTools is used. The CAPLib

    library routines are summarised in Table 1below.

    CAPTool Communication Library (CAPLib) Routine Summary

    FunctionName

    FunctionArguments

    Type

    Blocking

    Buffered

    Cyclic

    CAP_INIT () I x

    CAP_FINISH () I x

    CAP_SETUPPART (LOASSN,HIASSN,LOPART,HIPART) I x

    CAP_SEND (A,NITEMS,TYPE,PID) P x

    CAP_RECEIVE (A,NITEMS,TYPE,PID) P x

    CAP_EXCHANGE (A,B,NITEMS,TYPE,PID) E x

    CAP_BSEND (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID) P x x

    CAP_BRECEIVE (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID) P x x

    CAP_BEXCHANGE (A,B,NITEMS,STRIDE,NSTRIDE,ITYPE,PID) E x x

    CAP_CSEND (A,NITEMS,TYPE,PID) P x x

    CAP_CRECEIVE (A,NITEMS,TYPE,PID) P x x

    CAP_CEXCHANGE (A,B,NITEMS,TYPE,PID) E x x

    CAP_ASEND (A,NITEMS,TYPE,PID,ISEND) P

    CAP_ARECIEVE (A,NITEMS,TYPE,PID,IRECV) P

    CAP_AEXCHANGE (A,B,NITEMS,TYPE,PID,ISEND,IRECV) E

    CAP_ABSEND (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID,ISYNC) P xCAP_ABRECEIVE (A,NITEMS,STRIDE,NSTRIDE,ITYPE,PID,ISYNC) P x

    CAP_ABEXCHANGE (A,STRIDE,NSTRIDE,NITEMS,TYPE,PID,ISEND,IRECV) E x

    CAP_CASEND (A,NITEMS,TYPE,PID,ISEND) P x

    CAP_CARECIEVE (A,NITEMS,TYPE,PID,IRECV) P x

    CAP_CAEXCHANGE (A,B,NITEMS,TYPE,PID,ISEND,IRECV) E x

    CAP_SYNC_SEND (PID,ISYNC) S x

    CAP_SYNC_RECEIVE (PID,ISYNC) S x

    CAP_SYNC_EXCHANGE (PID,ISEND,IRECV) S x

    CAP_COMMUTATIVE (VALUE,TYPE,FUNC) G x

    CAP_COMMUPARENT (VALUE,TYPE,FIRSTFOUND,FUNC) G x

    CAP_COMMUCHILD (VALUE,TYPE) G x

    CAP_DCOMMUTATIVE (VALUE,TYPE,DIRECTION,FUNC) G xCAP_MCOMMUTATIVE (VALUE,NITEMS,TYPE,FUNC) G x

  • 7/28/2019 Caplib Paper 013

    9/33

    CAP_BROADCAST (VALUE,TYPE) G x

    CAP_MBROADCAST (VALUE,TYPE,OWNER) G x

    CAPLib Function Type KeyI Initialisation, termination and control

    P Point to point communicationE Ordered exchange communication between neighboursS Synchronisation on non-blocking communicationG Global communication or commutative operation

    Table 1 Summary of CAPLib Routines

    REAL TOLD(500,500), TNEW(500,500)EXTERNAL CAP_RMAXREAL CAP_RMAXINTEGER CAP_PROCNUM,CAP_NPROCCOMMON /CAP_TOOLS/CAP_PROCNUM,CAP_NPROCINTEGER CAP_HTOLD, CAP_LTOLD

    C Initialise CAPLibCALL CAP_INITIF (CAP_PROCNUM.EQ.1)PRINT*,'ENTER N AND TOL'IF (CAP_PROCNUM.EQ.1)READ*,N,TOL

    C Broadcast N and TOL to every processorCALL CAP_RECEIVE(TOL,1,2,CAP_LEFT)CALL CAP_SEND(TOL,1,2,CAP_RIGHT)CALL CAP_RECEIVE(N,1,1,CAP_LEFT)CALL CAP_SEND(N,1,1,CAP_RIGHT)

    C Initialise data partitionCALL CAP_SETUPPART(1,N,CAP_LTOLD,CAP_HTOLD)DO I=MAX(1,CAP_LTOLD),MIN(N,CAP_HTOLD),1

    TOLD(I)=0.0ENDDO

    C Boundary conditions (only execute on end processors)IF (1.GE.CAP_LTOLD.AND.1.LE.CAP_HTOLD)TOLD(1)=1IF (N.GE.CAP_LTOLD.AND.N.LE.CAP_HTOLD)TOLD(N)=100

    40 CONTINUEC Exchange overlap data prior to each Jacobi update

    CALL CAP_EXCHANGE(TOLD(CAP_HTOLD+1),TOLD(CAP_LTOLD),1,2,CAP_RIGHT)CALL CAP_EXCHANGE(TOLD(CAP_LTOLD-1),TOLD(CAP_HTOLD),1,2,CAP_LEFT)DO I=MAX(2,CAP_LTOLD),MIN(N-1,CAP_HTOLD),1

    TNEW(I)=(TOLD(I-1)+TOLD(I+1))/2.0ENDDO

    C Calculate maximum difference on each processorDIFMAX=0.0DO I=MAX(1,CAP_LTOLD),MIN(N,CAP_HTOLD),1

    DIFF=ABS(TNEW(I)-TOLD(I))IF (DIFF.GT.DIFMAX) DIFMAX=DIFFTOLD(I)=TNEW(I)

    ENDDOC Find global maximum difference

    CALL CAP_COMMUTATIVE(DIFMAX,2,CAP_RMAX)IF (DIFMAX.GT.TOL) GOTO 40

    C Output results via first processor

    DO I=1,N,1IF (I.GT.CAP_BHTNEW)CALL CAP_RECEIVE(TNEW(I),1,2,CAP_RIGHT)IF (I.GE.CAP_BLTNEW)CALL CAP_SEND(TNEW(I),1,2,CAP_LEFT)IF (CAP_PROCNUM.EQ.1)WRITE(UNIT=*,FMT=*)TNEW(I)

    ENDDOEND

    Figure 3 CAPTools generated parallel code for simple 1-D Jacobi program

    3.1 Initialisation, Partition Calculation and Termination

    The routine CAP_INIT is called in the example code to initialise the library. It must be called

    before any other CAPLib function is used. This call sets up the internal channel arrays and

    other data structures that the library needs to access. In some implementations of the library

    (e.g. the PVM version) this routine is also responsible for starting all slave processes running.CAP_INIT is responsible for the allocation of processes to processors in such a manner as to

  • 7/28/2019 Caplib Paper 013

    10/33

    minimise the number of hops between adjacent processes in the requested topology and

    therefore the overall process to process communication latency, maximising communication

    bandwidth. CAP_INIT is also responsible for communicating information on the runtime

    environment such as hostname and X Window display name to all processes. The size of each

    data type is also dynamically determined by CAP_INIT.

    A general requirement for message-passing SPMD code is for each parallel process to be

    assigned a unique number and also to know the total number of processors involved.

    CAP_INIT sets CAP_PROCNUM (the process number) and sets the CAP_NPROC (the

    number of processes). Both variables are used in internally, but can be referenced in the

    application code through a common block in the generated code.

    The next stage is the calculation of data assignment for each process. Adhering to the SPMD

    model, the partitioning of the arrays TNEW and TOLD for this example on 4 processes

    would require each process to be allocated a data range of 250 array elements in order for

    each processor to obtain a balanced workload (see, for example Figure 4). The CAPLib

    function CAP_SETUPPART is passed the minimum and maximum range of the accessed

    data range and the number of processes. It returns to each process its own unique value forthe minimum and maximum value for the partitioned data range (variables CAP_LTOLD and

    CAP_HTOLD in Figure 3). If the example was partitioned onto 4 processes then

    CAP_SETUPPART would return to process 1 the partition range 1 to 250, process 2 the

    partition range 251 to 500, process 3 the partition range 501 to 750 and process 4 the partition

    range 751 to 1000. Each process also requires an overlap region because of data assigned on

    one process but used on a neighbouring process. This will necessitate the communication of

    data assigned on one process into the overlap region of their neighbouring process. Due to the

    organised partition of the data the overlap areas need only be updated from their

    neighbouring processes. The data partition of the partitioned array TOLD in comparison with

    the original un-partitioned array is shown in Figure 4.

    2511

    250

    501

    500

    751

    750

    1000

    1

    1 251 501 751 1000

    PE 1 PE 2 PE 3 PE 4

    PARTITIONED

    ARRAY TOLD

    Overlap Area

    Update Lower Overlap

    Update Higher Overlap

    KEY:

    UN-PARTITIONED

    ARRAY TOLD

    Figure 4 Comparison of an un-partitioned and partitioned 1-D array.

    The routine CAP_FINISH must be called at the end of a program run to successfully

    terminate use of the library. On some machines, this call is necessary if control is to return to

    the user once the parallel run has completed.

  • 7/28/2019 Caplib Paper 013

    11/33

    3.2 Point to Point Communication

    The CAP_SEND and CAP_RECEIVE functions perform point to point communications

    between two processors. Typically these functions appear in pipeline communications (see

    section 3.4) but are also used to distribute data across the processor topology during

    initialisation of scalars and arrays etc.

    CAPLib has a selection of communication routines that allow the user to perform point to

    point communications in a variety of ways. The are two main groups, those of blocking and

    non-blocking and these are discussed separately in the next sections. Each communication

    has the generic arguments of address (A), length (NITEMS), type (TYPE) and destination

    (PID) with additional arguments depending on the routine. All the point-to-point routines are

    summarised in Table 1.

    3.2.1 Blocking Communication

    Blocking communications do not return until the message has been successfully sent or

    received. The Non-cyclic blocking communications will not communicate beyond the

    boundaries of the process topology when directional message destinations are given,

    Directions are indicated by a negative PID argument. For example, in a pipeline, the first

    process will not send to its left, or the last process to its right. This will also be true of a ring

    topology, grid and torus (multi-dimensional ring). Where communications are required to

    loop around a topology like a ring or torus, as is the case for programs with cyclic partitions,

    the cyclic routines can be used. These do not test for the end or beginning of a processor

    topology.

    Buffered routines are provided so that data that is non-contiguous can be buffered and sent as

    a single communication. The extra arguments are STRIDE (stride length in terms of ITYPE

    elements) and NSTRIDE (the number of strides). In other words NSTRIDE lots of NITEM

    elements, STRIDE elements apart, will be communicated in each call. This approach avoidsthe multiple start up latencies incurred using a communication for each section of data. On

    most platforms there is a message size dependent limit at which point the time spent

    gathering and scattering data to and from buffers can be greater than the latency effect of

    using multiple communications. The buffered routines switch internally to non-buffered

    communications if this limit is exceeded. This limit is currently set statically but in the future

    it is hoped to perform an optimal calculation for the limit during the call to CAP_INIT.

    CAPTools provides a user option to generate buffered or non-buffered communications.

    3.2.2 Non-Blocking Communication

    It is often the speed of communication that reduces the efficiency of parallel programs morethan anything else. To improve code performance, many parallel computers allow programs

    to start sending (and receiving) several messages and then to proceed with other computation

    asynchronously whilst this communication takes place. CAPLib supports this approach by

    providing non-blocking sends and receives. Non blocking communications are implemented

    in CAPLib using the underlying host systems non-blocking routines where possible. Where

    such routines are not available, non-blocking routines have been implemented using a variety

    of techniques, for example, communication threads running in parallel with the main user

    code. Table 1 lists the non-blocking routines currently available in the library.

    Non-blocking communication routines, e.g. CAP_ASEND, begin the non-blocking operation

    but return to the user program immediately the communication has been initiated. The

    communication itself takes place in parallel with execution of the following user code. The

  • 7/28/2019 Caplib Paper 013

    12/33

    arguments are the same as for the blocking communications but with the addition of a

    message synchronisation id as the last argument. To make sure a message has completed its

    journey the user code calls a CAP_SYNC routine to test for completion, passing the

    destination and synchronisation id as arguments. The CAP_SYNC routines either return

    immediately, if a communication has finished, or wait for it to complete, if it has not. A

    particular communication is identified completely by the message destination and thesynchronisation id.

    Depending on the hardware and underlying communications library that CAPLib is ported to,

    the implementation of the non-blocking routines can be done in several different ways. For

    some implementations the synchronisation call is used to actually unpack the messages

    because the underlying library does not provide a non-blocking receive using the same model

    as CAPLib, for example the PVM implementation.

    Buffered non-blocking communications are also handled differently depending on the

    underlying library and hardware. Buffered non-blocking communications consist of two

    stages, for a send, first the packing of data into a buffer, and then the communication of the

    buffered data. A receive communication must first receive the buffered data and then unpackit. If the parallel processor node is of a type that has a separate processor for communications,

    that can be programmed to perform work asynchronously with the main processor, then the

    packing and unpacking can be performed by the communications processor and overlapped

    with computation done on the main processor. This relies on the communications processor

    having dual memory access to the main processors memory. The benefit of this it is that both

    stages of buffered communication are then performed in parallel with computation. The

    Transtech Paramid [11] is a good example of such a system. However, it may be that the

    communications processor is of lower speed than the main processor and the time taken to

    unpack is actually longer than if the main processor had done the unpacking in serial mode

    itself. CAPLib therefore makes use of this approach only where it would improve

    performance. It is more often the case that parallel nodes consist of single processors and donot provide any direct hardware support for non-blocking buffered communications. On such

    systems, messages can still be received asynchronously, but the processor must do data

    unpacking and there is no real parallel overlapping during the packing/unpacking stage.

    Libraries such as MPI implement non-blocking communications on workstations using

    parallel threads. Although this provides the mechanism for non-blocking buffered

    communications, because the thread will run on the same processor, the unpacking is not

    actually performed in parallel, but through time slicing. Therefore no real parallel benefit on

    packing/unpacking is gained.

    If the underlying communications library used by CAPLib does not directly support buffered

    non-blocking communication then the unpacking must be performed at the synchronisation

    stage, once the buffered message has been received. CAPLib implements this by keeping a

    list of asynchronous communications and whenever a CAPLib synchronisation call is made,

    all outstanding messages from the list are unpacked.

    Because of the extra complexity of using non-blocking communications it is a common

    procedure to write or generate message-passing code that uses blocking communications as a

    first parallelisation attempt. Once this version has been tested thoroughly and proved to give

    the correct results, a non-blocking version can be produced to optimise the performance (In

    CAPTools this merely requires clicking on one button [8]).

    Before data that has been transmitted using non-blocking functions can be used in the case of

    a receive communication, or re-assigned in the case of a send communication, the completion

  • 7/28/2019 Caplib Paper 013

    13/33

    of the communication involving the data must be verified. For maximum flexibility and

    efficiency in synchronisation on message completion the communication model used by

    CAPLib for ordering of message arrival and departure for synchronising on message

    completion is as follows:

    Messages are sent in order of the calls made to send to a particular destination, D.

    Messages are received in order of calls made to receive from a particular destination, D.

    This implies that:-

    Synchronisation on the sending of message Mi to destination D guarantees thatmessages Mi-1, Mi-2,... sent to destination D have arrived.

    In the example below the synchronisation using ISENDB by statement S3 on the

    message sent by S2 also guarantees that the message sent by S1 has arrived.

    S1 CALL CAP_ASEND(A,1,1,CAP_LEFT,ISENDA)

    S2 CALL CAP_ASEND(B,1,1,CAP_LEFT,ISENDB)

    S3 CALL CAP_SYNC_SEND(CAP_LEFT,ISENDB)

    Synchronisation on the receiving of message Mj from a destination D guarantees thatmessages Mj-1, Mj-2... have been received at destination D.

    In the example below the synchronisation using IRECVB by statement S3 on the

    message requested by S2 also guarantees that the message requested by S1 has arrived.

    S1 CALL CAP_ARECEIVE(A,1,1,CAP_LEFT,IRECVA)

    S2 CALL CAP_ARECEIVE(B,1,1,CAP_LEFT,IRECVB)

    S3 CALL CAP_SYNC_RECEIVE(CAP_LEFT,IRECVB)

    Waiting for completion of a send to a destination does not guarantee that a particular

    receive has taken place from that destination and vice versa.

    In the example below the synchronisation using ISENDB by statement S3 on the

    message sent by S2 does not guarantee that the message requested by S1 has arrived.

    S1 CALL CAP_ARECEIVE(A,1,1,CAP_LEFT,IRECVA)

    S2 CALL CAP_ASEND(B,1,1,CAP_LEFT,ISENDB)

    S3 CALL CAP_SYNC_SEND(CAP_LEFT,ISENDB)

    Waiting for completion of a communication with a particular destination D does notguarantee that any other sends or receives to or from another destination has

    completed.

    In the example below the synchronisation using ISENDB by statement S4 on themessage sent by S3 does not guarantee that the messages requested by S1 or sent by

    S2 has arrived.

    S1 CALL CAP_ARECEIVE(A,1,1,CAP_RIGHT,IRECVA)

    S2 CALL CAP_ASEND(A,1,1,CAP_RIGHT,ISENDA)

    S3 CALL CAP_ASEND(B,1,1,CAP_LEFT,ISENDB)

    S4 CALL CAP_SYNC_SEND(CAP_LEFT,ISENDB)

    This model is flexible enough to allow for the automatic generation of non-blocking

    communications within CAPTools [8]. The ability to synchronise several messages in a

    particular direction with one synchronisation, i.e. waiting for the last message to be sent is

    enough to guarantee that all messages previous to the last have been sent, makes codegeneration a lot easier. It also reduces the overhead of synchronisation. The model also

  • 7/28/2019 Caplib Paper 013

    14/33

    allows for overlapping both sends and receives simultaneously to a particular destination and

    for multiple tests on the same synchronisation id, which is essential for an automatic

    overlapping code generation algorithm.

    The flexibility of this model has allowed CAPTools to generate overlapping communications

    with synchronisation that guarantees correctness in a wide range of cases. This includes loop

    unrolling transformations, synchronisation and overlapping communications in pipelined

    loops. Code appearance is enhanced by the merger of synchronisation points, that is only

    possible with this communication model.

    3.3 Exchanges (Overlap Area/Halo Updates)

    For any array that is distributed across the process topology each process will have an overlap

    region in the array that is assigned on another process (see Figure 4). These overlap areas are

    updated when necessary. The overlap region is updated by invoking a call to

    CAP_EXCHANGE, which performs a similar function to the MPI call MPI_SENDRECV.

    This communication function will send data to a neighbouring process's overlap area as well

    as receiving data into its own overlap region from the neighbouring processor.

    CAP_EXCHANGE must ensure that no deadlock occurs and allow for non-communication

    beyond the edge of the process topology for the end processes. Most important is the fact that

    this type of communication is fully scalable, i.e. is not dependent on the number of processes,

    taking at most 2 steps to complete (see Figure 5). If the hardware allows non-blocking

    communication an exchange can be performed in 1 step by communicating in parallel.

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    Time

    Figure 5 Communication pattern for blocking exchange operation on 16 processors.

    3.4 Pipelines

    A Pipeline in a parallel code involves each processor performing the same set of operations

    on a successive stream of data. Pipelined loops are a common occurrence in parallel CM

    codes and are often essential to implement, for example, recurrence relations, guaranteeing

    correctness of the parallel code. Because a pipeline serialises a loop it must be surrounded by

    outer loop(s) in order to achieve parallel speed up. The main disadvantages of pipelines are

    that during the pipeline process some processors will be idle at the start up and shut downstages. Another disadvantage is the potentially significant overhead of the numerous

    communication startup latencies. Figure 6 shows a simple example of a loop that has been

    parallelised using a pipeline.

    C Serial codeDO I=1, NI

    DO J=2,NJA(I,J)=A(I,J-1)

    END DOEND DO

    C Parallel codeDO I=1, NI

    CALL CAP_RECEIVE(A(I,CAP_LA-1),1,2,CAP_LEFT)DO J= MAX(2,CAP_LA),MIN(NJ,CAP_HA)

    A(I,J)=A(I,J-1)END DOCALL CAP_SEND(A(I,CAP_HA),1,2,CAP_RIGHT)

    END DO

    Figure 6 Example of a Pipeline

  • 7/28/2019 Caplib Paper 013

    15/33

    With a low communication startup latency, good parallel efficiency can be achieved (see

    section 9.1)

    3.5 Commutative Operations

    The Jacobi example in Figure 3 uses a convergence criteria is based on DIFMAX. Since theTNEW and TOLD arrays have been partitioned across the processors, each processor will

    calculate its own local value for DIFMAX, however, it is necessary to calculate the global

    value. Collective parallel computation operations (minimum, maximum, sum, etc.) take a

    value or values assigned on each process and combine them with all the values on all other

    processes into a global result that all processes receive. This is performed by the CAPLib

    function CAP_COMMUTATIVE, which is analogous to the MPI global reduction function

    MPI_ALL_REDUCE. The latency for this type of communication is dependent on the

    number of processors. CAPLib minimises the effect of this by internally using a hyper-cube

    topology where possible to perform the commutative operation. The commutative routines in

    CAPLib are summarised in Table 1.

    The routine CAP_COMMUTATIVE performs the commutative operation defined by the

    passed routine FUNC on the data item VALUE for all processes. On entry to the routine,

    VALUE holds a local value on each processor, on exit it contains a global value computed

    from all local values across processors using routine FUNC to combine contributions. For

    example, the serial code to sum a vector is:-

    SUM=0.0D O I = 1 , N

    SUM=SUM+A(I)END DO

    The parallel equivalent of this using CAP_COMMUTATIVE, given that the array A has been

    partitioned across processes is:

    SUM=0.0DO I = CAP_LOW, CAP_HIGH

    SUM=SUM+A(I)END DOCALL CAP_COMMUTATIVE(SUM, CAP_REAL, CAP_RADD)

    In this example CAP_LOW and CAP_HIGH are the local low and high limits of assignment

    to A on each processor. The procedure CAP_RADD is a predefined procedure to perform a

    real value addition. A list of all predefined commutative functions is given in the CAPLib

    user manual [12]. Each procedure has the same arguments, which mean that

    CAP_COMMUTATIVE can call it generically. For example, CAP_RADD is defined as

    SUBROUTINE CAP_RADD(R1, R2, R3)

    REAL R1, R2, R3R3=R1+R2END

    The routine CAP_DCOMMUTATIVE is a derivation of CAP_COMMUTATIVE and

    performs a commutative operation in one dimension of a grid or torus process topology, the

    direction (e.g. left or up) being indicated by an additional argument. This type of

    commutative operation can be necessary when a structured mesh code is partitioned in more

    than one direction. The routine CAP_MCOMMUTATIVE provides for commutative

    operations on an array of data rather than one item. Combining several

    CAP_COMMUTATIVE calls to form one CAP_MCOMMUTATIVE call allows a

    corresponding reduction in latency.

  • 7/28/2019 Caplib Paper 013

    16/33

    CAPTools generates code with commutative operations whenever it can match the code in a

    loop to the pattern of a commutative operation. Without commutative communications, the

    code generated would involve complex and convoluted communication.

    An interesting observation is that commutative operations performed in parallel will actually

    give answers with less round off than the corresponding serial code. For example, consider

    the summation of ten million small numbers in serial. As the summation continues, each

    small value will be added to an increasingly larger sum. Eventually the small numbers will

    cease to have an impact on the sum because of the accuracy of adding a small value to a large

    one using computer arithmetic. The parallel version of the summation will first have each

    processor sum their section of the sum locally, communicate the local summations, and add

    them to obtain a global sum. The accuracy will be greater since each local summation will

    involve less numbers, and therefore there will be smaller differences in magnitude than the

    complete serial summation. In addition, the summation of the local summations to obtain the

    global value will be of relatively similar sized numbers. If this were not the case, this would

    not be acceptable for many users performing parallelisations from existing serial code. Part of

    the parallelisation process is to validate the parallel version against the serial version.Obviously, the parallelised code must produce the same results in order to pass the validation

    process. Although a parallel commutative operation may not produce exactly the same result

    as the serial one, it will at least be more accurate rather than inaccurate and so most validation

    tests should be passed.

    As well as getting as near as possible the same results as the serial version, commutative

    operations must also produce the same answer on all processes. For example, often the

    calculation of the sum of the difference between two vectors, i.e. a residual value, is used to

    determine whether to terminate an iterative loop. If the calculation of the residual value is not

    the same on all processors then the calculated values may cause the loop to terminate on

    some processes but loop again on others. Obviously, this will cause the parallel execution to

    lock. To obtain the same results on all processes the commutative operation must beperformed in the same order on all processes to incur the same round off errors or broadcast a

    single global value.

    A common array operation is to find the maximum or minimum value and its location in the

    array. The equivalent commutative operation in parallel must be performed in the same order

    to return the same location as the serial code. If there are several occurrences of the local

    maximum/minimum value in the array then it may be that several processes might find their

    own local maxima/minima. In order to avoid this, the commutative operation must know the

    direction in which the array is traversed. The routines CAP_COMMUPARENT and

    CAP_COMMUCHILD provide a mechanism for this. The argument FIRSTFOUND (see

    Table 1) determines how the commutative operation determines a location. If FIRSTFOUNDis set to true then for a maximum commutative calculation it is the maximum value location

    found on the lowest numbered processor in the given dimension that is required on all

    processes. This would be the case for a serial loop running from low to high through an array.

    For example, consider the example in Figure 7 with data A=(/7, 9, 2, 2, 9, 5, 9/). Although

    there are maximums at positions 2, 5 and 7 the serial code will set MAXLOC at 2 due to the

    use of a strict greater than test. The parallel code will similarly produce the result

    MAXLOC=2 on all processors.

    C Serial codeMAXVAL=0MAXLOC=1D O I = 1 , N

    IF (A(I).GT.MAXVAL) THENMAXVAL=A(I)MAXLOC=I

    C ParallelMAXVAL=0MAXLOC=1DO I = LOW, HIGH

    IF (A(I).GT.MAXVAL) THENMAXVAL=A(I)MAXLOC=I

  • 7/28/2019 Caplib Paper 013

    17/33

    ENDIFENDDO

    ENDIFEND DOCALL CAP_COMMUPARENT(MAXVAL,1,CAP_IMAX,.TRUE.)CALL CAP_COMMUCHILD(MAXLOC,1)

    Figure 7 Example of CAP_COMMUTATIVE and CAP_COMMUCHILD

    If the test for the maximum value had been .GE. rather than .GT. then MAXLOC would be

    set to the location of the last maximum value rather than the first and therefore the value of

    FIRSTFOUND in CAP_COMMUPARENT would be set to .FALSE..

    CAP_COMMUPARENT works by sending the current maximum values processor number

    with the maximum value as the commutative operation is performed among the processors.

    In the commutative communication algorithms, the location for the COMMUPARENT is

    also packed into the message. CAP_COMMUPARENT internally stores the processor that

    owns the desired location(s). This processor is then used in any number of calls to

    CAP_COMMUCHILD to broadcast the correct value to all processors.

    3.6 Broadcast Operations

    Broadcast operations are used to move data from one process to all other processes. The

    simplest of these is a broadcast of data from the first process to all others, termed a master

    broadcast. CAPLib provides the CAP_MBROADCAST routine to do this. In fact, rather than

    sending data directly to all processes from the master process, the master broadcast will use

    the same communication strategies as the CAP_COMMUTATIVE call. These strategies,

    described in section 6, take advantage of the internal process topology to reduce the number

    of communications and steps to complete the operation.

    A second type of broadcast is the communication of data from a particular process to all

    others. CAPLib provides the routine CAP_BROADCAST to do this. The OWNER argumentis passed in set to true for the process owning the data and false for all others.

    CAP_BROADCAST is implemented currently as a COMMUTATIVE MAX style operation

    on the OWNER argument to tell every process which particular process is the owner of the

    data to be broadcast. The data is again transmitted from the owning process to the other

    processes in an optimal fashion using the internal process topology.

    4 CAPLib on the Cray T3D/T3E

    4.1 Implementation

    CAPLib has been ported to the Cray T3D and T3E using PVM, MPI and the SHMEM

    library. The SHMEM library version is described below. Of the three, the SHMEM CAPLib

    is by far the fastest, latency and bandwidth being a reflection of the performance of the

    SHMEM library. Typical latency is under 7 s and bandwidth greater than 100 MB/s for

    large messages on the T3D and 5 s and 300 MB/s on the T3E. The SHMEM version ofCAPLib is written in C rather than FORTRAN because of the need to do indirect accessing.

    Synchronous message passing was implemented using a simple protocol build on the Cray

    SHMEM_PUT library routine, which is faster than SHMEM_GET. Figure 8 shows this

    protocol used to send data between two processors.

  • 7/28/2019 Caplib Paper 013

    18/33

    Empty0x413270x41327&SENDTO

    Data. (NBYTES)0x76543 Data. (NBYTES)0x41327

    &ISYNC

    0x0&SENDTO

    &ISYNC1

    0x0&SENDTO

    Processor Sending Data Processor Receiving Data

    SHMEM_PUT

    SHMEM_PUT

    SHMEM_PUT

    Busy wait

    Busy wait

    1

    0

    Interconnect

    Figure 8 CAPLib protocol used for communication on T3D/T3E using SHMEM library.

    The receiving processor first writes the starting address it wishes to receive data into to a

    known location on the sending processor and then waits for the sending processor to write the

    data and send a write-data-complete acknowledgement. The sending processor waits on atight spin lock (busy wait loop) for a non-zero value in the known location. When the address

    has arrived it uses SHMEM_PUT to place its data directly into the address on the receiving

    processor. The sending processor then calls SHMEM_QUIET to make sure the data has

    arrived and then sends a write-data-complete acknowledgement to the receiving processor.

    The pseudo code for this procedure is shown in Figure 9.

    send(a, n, cn)/* send data a(n) to

    channel cn (processor cn2p(cn)) */{/* wait for address from receiving

    processor to arrive in addr(pe) */pe=cn2p(cn)

    while (!addr(pe))/* send data */shmem_put(*addr(pe),a,n,pe)/* wait */shmem_quiet()/* ack send complete*/shmem_put(ack(mype),1,1,pe)/* reset address */addr(pe)=0}

    recv(b, ,n ,cn)/* recv data b(n) from

    channel cn (processor cn2p(cn))*/{pe=cn2p(cn)/* place recv address in sending pe ataddress addr(pe) */

    shmem_put(addr(mype),&b,1,pe)/* wait to data ack to arrive */while (!ack(pe))/* reset ack */ack(pe) = 0}

    Figure 9 Pseudo code for send/recv using Cray SHMEM calls

    To obtain maximum performance all internal arrays and variables involved in a

    communication are cache aligned using compiler directives.

    To avoid any conflicts in all-to-all communications the variables used to store addresses andact as acknowledgement flags are all declared as arrays with the T3D processor number being

    used to reference the array elements. In this way each send address and data

    acknowledgement can only be set by one particular processor.

    Asynchronous communication has been partially implemented by removing the wait on the

    write-data-complete acknowledgement in the receive and placing it in CAP_SYNC_RECV.

    The send operation is not currently fully asynchronous since it can not start until it receives

    an address from the receiving processor to send data to.

    Commutative operations have also been implemented using these low-level functions and the

    hyper-cube B method (see section 6.3) is the default commutative employed.

  • 7/28/2019 Caplib Paper 013

    19/33

    Pahud and Cornu [13] show that communication locality can influence the communication

    times in heavily loaded networks on the T3D. CAPLib uses the location of the processor

    within the processor topology shape allocated to a particular run to determine

    CAP_PROCNUM (the CAPTools processor number) for each processor in an optimal way so

    as to minimise communication time. The numbering is chosen to provide a pipeline of

    processors through the 3-D topology shape so that the number of hops from processorCAP_PROCNUM to processor CAP_PROCNUM+1 is minimised.

    Another way of improving communication performance for some parallel programs

    (particularly all-to-all style communication) is to order the communications so that an

    optimum communication pattern is used, reducing the number of steps to perform a many-to-

    many operation. Unstructured mesh code will often use this type of operation.

    4.2 Performance

    This section discusses the performance of the different CAPLib point-to-point and exchange

    message passing functions on the Cray T3D and T3E. The speed of other message passing

    libraries and CAPLib performance are compared where possible. Figure 10 and Figure 11

    shows the latency and bandwidth respectively on the T3D for SHMEM versions of

    CAP_SEND (synchronous), CAP_ASEND (non-blocking), CAP_EXCHANGE and

    CAP_AEXCHANGE (non-blocking). As a comparison, these graphs also show timings for

    MPI_SEND, MPI_SSEND and MPI_SENDRECV. Figure 12 and Figure 13 show similar

    graphs for the Cray T3E.

    An examination of these figures shows that CAPLib performs at least as well as the standard

    MPI implementation on each machine. The CAPLib SHMEM implementation is superior to

    using MPI or PVM calls both in latency and bandwidth. Generally the overhead of using the

    CAPLib over MPI library instead of direct calls to MPI is negligible. CAP_SEND

    implemented using SHMEM has a startup latency of around 7s. The overall bandwidthobtained on the T3E for all communication measurements is far higher than that of the T3D.

    The bandwidth for CAP_SEND on the T3D for messages of 64Kb is around 116Mb/sec, on

    the T3E this number is 297 Mb/sec. This is due to hardware improvements between the two

    systems. CAP_EXCHANGE has been implemented on the Cray systems under SHMEM to

    partially overlap the pair-wise send and receive communications it performs, and this is

    reflected in the bandwidth obtained, 143Mb/sec on the T3D and 416 Mb/sec on the T3E.

    Note that the bandwidth for MPI_SENDRECV (50Mb/sec on the T3D and 284Mb/sec on the

    T3E) are very poor in comparison with CAP_EXCHANGE. Each performs a similar

    communication, a send and receive to other processors, but CAP_EXCHANGE is able to

    schedule its communications so as to overlap because it is based on directional

    communication whereas MPI_SENDRECV communication is based on processor numbers

    only and is unable to do this.

    The graphs for the figures are obtained by performing a Ping-Pong communication many

    times and taking average values. However, the non-blocking communication Ping-Pong test

    has synchronisation after each communication. In this respect, the non-blocking results are

    artificial in that they do not reflect the greater performance that will be obtained in real codes

    where synchronisation will generally be performed after many communications. The graphs

    for CAP_ASEND and CAP_AEXCHANGE therefore give a measure of the overhead of

    performing synchronisation on non-blocking communication and do not reflect the latency

    and bandwidth that is obtained in real use.

  • 7/28/2019 Caplib Paper 013

    20/33

    1

    10

    100

    1000

    10000

    1 10 100 1000 10000

    DataTransferTime(u

    s)

    Data Size (REAL Items)

    Latency (Cray T3D)

    CAP_SEND (SHMEM)CAP_ASEND (SHMEM)

    CAP_EXCHANGE (SHMEM)CAP_AEXCHANGE (SHMEM)

    MPI_SENDMPI_SSEND

    MPI_SENDRECV

    Figure 10 CAPLib communication latency on Cray T3D

    0

    20

    40

    60

    80

    100

    120

    140

    160

    1 10 100 1000 10000

    Bandwidth(Mbytes/Se

    c)

    Data Size (REAL Items)

    Bandwidth (Cray T3D)

    CAP_SEND (SHMEM)CAP_ASEND (SHMEM)

    CAP_EXCHANGE (SHMEM)CAP_AEXCHANGE (SHMEM)

    MPI_SENDMPI_SSEND

    MPI_SENDRECV

    Figure 11 CAPLib communication bandwidth on Cray T3D

    1

    10

    100

    1000

    1 10 100 1000 10000

    DataTransferTime(us)

    Data Size (REAL Items)

    Latency (Cray T3E)

    CAP_SEND (SHMEM)CAP_ASEND (SHMEM)

    CAP_EXCHANGE (SHMEM)CAP_AEXCHANGE (SHMEM)

    MPI_SENDMPI_SSEND

    MPI_SENDRECV

    Figure 12 CAPLib communication latency on Cray T3E

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    1 10 100 1000 10000

    Bandwidth(Mbytes/Sec)

    Data Size (REAL Items)

    Bandwidth (Cray T3E)

    CAP_SEND (SHMEM)CAP_ASEND (SHMEM)

    CAP_EXCHANGE (SHMEM)CAP_AEXCHANGE (SHMEM)

    MPI_SENDMPI_SSEND

    MPI_SENDRECV

    Figure 13 CAPLib communication bandwidth on Cray T3E

    5 CAPLib on the Paramid

    5.1 Implementation

    The Transtech Paramid version of the CAPTools communications library uses the low-level

    Transtech/Inmos i860toolset communications library [11]. The Paramids dual processor

    node architecture makes it ideal for non-blocking asynchronous communications since the

    Transputer part of a node can be performing communication whilst the i860 is computing.

    Non-blocking communications are implemented for the Paramid in CAPLib using an

    asynchronous router program that runs on the Transputer. To minimise latency for small non-

    blocking communications, the period of synchronisation between the Transputer and the i860

    during initialisation of a non-blocking communication must be kept to a minimum. In

    addition, the amount of effort required for the Transputer to send and receive asynchronously

    must be as small as possible and as near the time for a normal direct synchronous send as

    possible. Figure 14 shows a process diagram of the router process that executes on the

    Transputers during runs that use the asynchronous version of CAPLib. The diagram shows

    the threads in the router process for sending data asynchronously down one channel. For each

    channel pair (IN and OUT channels), there will be a two sets of these threads to allowindependent communication in both directions. This arrangement is also duplicated for each

  • 7/28/2019 Caplib Paper 013

    21/33

    channel connection to other nodes. The send and receive threads of the router process are

    linked by channels over the Transputer links to the destination nodes corresponding send and

    receive threads (where more links are needed than are physically available, implicit use is

    made of the INMOS virtual routing mechanism, [14]).

    For every channel the client send thread processes send requests and places them in send

    request queue. A similar action is performed for receive requests by the client receive thread.

    The send thread removes requests from the send queue and communicates the data as soon as

    the corresponding receive thread on the other processor is ready to receive, i.e. when the

    receive thread has itself removed a request to receive from the receive request queue. The

    send and receive threads update an acknowledgement counter for each channel so that the

    users program can synchronise on the completion of certain communications. It is worth

    emphasising that using this model, channel communication down one channel is completely

    independent of communication down another. It is up to the users program to synchronise at

    the correct point to guarantee the data validity of data communicated in each direction.

    send

    request

    queue

    client (send)req thread

    send

    thread

    ..

    ..

    CAP_ASEND(A,N,1,-2,ISEND)sends a request pkt

    (address A, length N, type 1) to

    router process

    ..CAP_SYNC_SEND (-2,ISEND)

    synchronsiation call does busy

    wait until ISENDACK>=ISEND

    DATAA (1..N)

    TTM200 TRAM (i860/transputer)

    +1 for

    every

    send

    Users program/i860

    reqs in

    reqs out

    Router process/transputer

    ISENDACK

    receive

    request

    queue

    client (recv)req thread

    receive

    thread

    ..

    ..

    CAP_ARECEIVE(B,N,1,-1,IRECV)sends a request pkt

    (address B, length N, type 1) to

    router process

    ..CAP_SYNC_RECEIVE (-1,ISEND)

    synchronisation call does busy wait

    until IRECVACK>=IRECV

    DATA

    B (1..N)

    TTM200 TRAM (i860/transputer)

    +1 for

    every

    receive

    Users program/i860

    Router process/transputer

    IRECVACK

    reqs in

    reqs out

    Figure 14 Transputer router process for asynchronous communication on Transtech Paramid

    5.2 Performance

    Figure 15 and Figure 16 give the latency and bandwidth characteristics of CAPLib on the

    Paramid. The best latency is around 33S with the bandwidth approaching peak performance

    at around the 500-byte message size. Notice that the peak bandwidth of CAP_AEXCHANGE

    is roughly twice that of CAP_SEND showing that it is performing its send and receive

    communication asynchronously in parallel. The latency cost for small messages (~40 bytes)

    is higher than the synchronous CAP_EXCHANGE because of the extra complexity of setting

    up an asynchronous communication. However in real applications the increased

    asynchronous latency will usually be hidden by the overall benefits of performingcomputation whilst communicating.

  • 7/28/2019 Caplib Paper 013

    22/33

    10

    100

    1000

    10000

    100000

    1 10 100 1000 10000

    DataTransferTime(us)

    Data Size (REAL Items)

    Latency (Transtech Paramid)

    CAP_SEND (I860TOOLSET)CAP_BSEND (I860TOOLSET)CAP_ASEND (I860TOOLSET)

    CAP_EXCHANGE (I860TOOLSET)CAP_AEXCHANGE (I860TOOLSET)

    Figure 15 CAPLib latency on Transtech Paramid

    0

    0.5

    1

    1.5

    2

    2.5

    3

    1 10 100 1000 10000

    Bandwidth(

    Mbytes/Sec)

    Data Size (REAL Items)

    Bandwidth (Transtech Paramid)

    CAP_SEND (I860TOOLSET)CAP_BSEND (I860TOOLSET)CAP_ASEND (I860TOOLSET)

    CAP_EXCHANGE (I860TOOLSET)CAP_AEXCHANGE (I860TOOLSET)

    Figure 16 CAPLib bandwidth on Transtech Paramid

    6 Optimised Global Commutative Operations

    As global commutative operations usually only involve the sending and receiving of very

    small messages, typically 4 bytes, it is the communication startup latency which will

    dominate the time taken to perform the commutative operation. This is because the

    communication startup latency is relatively expensive on most parallel machines. It is for this

    reason that in many parallelisations, commutative operations can be a governing factor

    affecting efficiency and speed up. It is extremely important, therefore, to implement

    commutative operations as efficiently as possible. In order to do this, the commutative

    routines in CAPLib take advantage of the processor topology, that is, how each processor

    may communicate with other processors.

    Many of the parallel machines on the market today are connected using some kind of

    topology to facilitate fast communication in hardware. For example, processors in the Cray

    T3D are connected to a communications network arranged as a 3-D torus. However, although

    the hardware is connected as a torus, there is in fact no limitation on what processors a

    particular processor may talk to at the hardware level; the communication hardware will route

    messages from one processor to another around the torus as needed. From the perspective of

    the methods used to perform commutative (and broadcast) operations it is this direct

    processor to processor topology that is important, not the underlying hardware topology that

    implements it. This means, for example, that although the Cray T3D is based on a 3D Torus,

    for commutative operations internally within CAPLib it is considered fully connected. The

    commutative topology used internally within CAPLib will therefore depend on the directprocessor-to-processor routing available on the machine the program is running on. The

    commutative methods available are then directly related to the commutative topology.

    Currently CAPTools supports a pipe, ring, grid and two different hyper-cube commutative

    methods.

    In order to compare the efficiency of each method we define the following:-

    P = The number of processes.

    C = The total number of communications for a commutative operation.

    S = The total number of steps involved in the method. We define a step as a number of

    communications performed in parallel such that the time/latency of allcommunications is equivalent to that of one communication. Some communication

  • 7/28/2019 Caplib Paper 013

    23/33

    devices are serial devices only allowing one communications at a time. For example,

    the Ethernet connecting workstations is a serial communications device since only

    one packet may be present on the Ethernet at any one time. For these devices,

    although we can consider the communications in one step taking place in parallel for

    the purposes of analysis, they will in fact be serialised in practice.

    The key to efficient commutative operations is to perform as much communication in parallel

    as possible, i.e. by minimising the number of parallel communication steps needed to perform

    the commutative operation, the effect of communication startup latency will be minimised.

    The time for the commutative operation to take place is approximately proportional to the

    number of communication steps, S. This is the most important term to reduce.

    The communication time between processes is often affected by the number of

    communications occurring simultaneously. It is therefore important that both the overall

    number of communications and the number of communications per step is also minimised.

    The type of communication taking place at each step also determines performance. If all the

    communications in a step are between neighbour processors then there will be little

    contention on the communication network as the communications take place. If the

    communications are not to nearest neighbours then the number of communications will affect

    the time to complete the step since the routing mechanism of the hardware will be used to

    deliver messages and contention may occur.

    If the process topology has not been mapped well onto the hardware topology, it will often be

    the case that communication from a nearest neighbour process is not in hardware a

    communication between nearest neighbour processors. For example, a ring topology

    implemented onto a pipeline of processors will require the connection between the last and

    first processor to be sent via a routing mechanism to the first processor. Communications

    along this link will always be slower than along the other links and in a commutative

    communication step the slowest communication will determine the time for the step.

    6.1 Commutative Operation using a Pipeline

    Figure 17 shows a diagram of a pipeline of processes and the communication pattern for a

    commutative operation. The number of communications and steps is proportional to the

    number of processes. This is because the value contributed from each process must be passed

    down to the last process and then the result is passed back up the pipeline again.

  • 7/28/2019 Caplib Paper 013

    24/33

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    Time

    1)-2(P=S

    1)-2(P=C

    Figure 17 Commutative operations using a pipeline connection topology.

    The number of steps for a ring commutative operation is the same as for a pipeline but the

    number of communications is higher. On some hardware, this will give a pipe topology the

    edge in performance over the ring. If it is possible for a commutative operation to beperformed around the ring using non-blocking communications then the number of steps can

    be halved. Communication around a ring requires all the values to be accumulated in an array

    in process order during communication and then the commutative computation performed

    using the array once all values have been communicated. This is to avoid round off problems

    and guarantees that each processor calculates the same result. Buffer space is required on

    each process to perform this operation and for a very large parallel run, i.e. thousands ofprocesses, this may be disadvantageous. If it is possible for the hardware to perform

    communication simultaneously in both directions then the performance can be even higher

    since values can travel both ways around the ring at the same time, reducing the distance to

    the furthest process to p/2.

    6.2 Commutative Operation using a Grid

    Figure 18 shows a diagram of a 2D grid of processes and the communication pattern for a

    commutative operation. Each stage of the commutative operation is across one of the

    dimensions, d, of the grid. This method would be used when a grid of processors can only

    talk directly to its grid neighbours, otherwise it is advantageous to use a hyper-cube method

    (see next section).

  • 7/28/2019 Caplib Paper 013

    25/33

    Stage 1

    Stage 2

    1,1 1,2 1,3 1,4 1,61,5 1,7 1,8

    2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8

    3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8

    4,1 4,2 4,3 4,4 4,5 4,6 4,7 4,8

    5,1 5,2 5,3 5,4 5,5 5,6 5,7 5,8

    6,1 6,2 6,3 6,4 6,5 6,6 6,7 6,8

    7,1 7,2 7,3 7,4 7,5 7,6 7,7 7,8

    8,1 8,2 8,3 8,4 8,5 8,6 8,7 8,8

    ( )

    ( )

    =

    = =

    =

    =

    di j

    d

    i j

    d

    ijj i

    PS

    PPC

    1

    1 ,1

    12

    12

    Figure 18 Communication pattern for commutative operation using a grid.

    Where Pi is the number of processors in dimension I and d is the number of dimensions.

    6.3 Commutative Operation using Hyper-cubes

    In a hyper-cube topology of dimension d, each process is connected directly to d other

    processes. Algorithms implemented using a hyper-cube offer the best performance generally

    over other methods because the number of steps to perform a commutative operation is

    related to d, i.e. 2

    d

    , not the number of processes, P. For non-trivial P, the hyper-cube offersfar greater performance than any other topology.

    There are a number of ways to implement a commutative operation on a hyper-cube. Two

    methods are currently implemented in CAPLib. Method A uses a pair-wise exchange

    between processes until every process has the result. Method B uses a binary tree algorithm.

    Both rely on the connectivity offered by the hyper-cube. Both methods A and B guarantee the

    order of computation will be the same on every process and therefore the values obtained will

    be the same on all processes. This is obviously the case with method B. In method A, this is

    guaranteed by always combining combinations with that from the lower numbered processor

    on the left hand side of the summation. The pair-wise exchange of data that characterises the

    Method-A operation can be further improved if non-blocking communications. Overlapping

    the exchange of data reduces the number of steps by a factor of two but relies on the

    performance of two small non-blocking communications out-performing two small blocking

    communications. CAPLib does not currently implement a non-blocking version of Method-

    A.

  • 7/28/2019 Caplib Paper 013

    26/33

    1

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    2

    3

    4

    5

    6

    7

    8

    Method A

    )(

    )(

    exchangegnonblockindS

    exchangeblocking2d=S

    1)>(dPd=C

    =

    Figure 19 Communication pattern for commutative operation using a hyper-cube (method A, d=4)

    1

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    2

    3

    4

    5

    6

    7

    8

    Method BC = 2 - 2

    S = 2d

    (d+1)

    Figure 20 Communication pattern for commutative operation using a hyper-cube (method B, d=4).

    In order to use these methods in runs where the number of processes does not exactly make

    up a hyper-cube the methods must be modified to account for this. For method A if we

    consider the odd number of processes to be k, then the last k processes send their values to

    the first k processes before the main part of the procedure begins. This ensures the values

    from these processes are used. When the main procedure is complete, the end k processes

    receive the result. Method B handles odd processes by extending the binary treecommunication strategy to include the extra k processes.

  • 7/28/2019 Caplib Paper 013

    27/33

    6.4 Comparison of Commutative Methods

    Table 2 show a comparison of the number of steps and the number of communications

    needed for a commutative operation using the methods implemented inside CAPTools.

    Pipe Ring Hyper-cube A Hyper-cube B

    P Steps Comms Steps Comms StepsSync.

    StepsAsync

    Comms Steps Comms

    2 2 2 2 2 2 1 2 2 2

    4 6 6 6 12 4 2 8 4 6

    8 14 14 14 56 6 3 24 6 14

    16 30 30 30 240 8 4 64 8 30

    32 62 62 62 992 10 5 160 10 62

    64 126 126 126 4032 12 6 384 12 126

    128 254 254 254 16256 14 7 896 14 254

    256 510 510 510 65280 16 8 2048 16 510

    512 1022 1022 1022 261632 18 9 4608 18 1022

    1024 2046 2046 2046 1047552 20 10 10240 20 2046

    Table 2 Number of steps S, and communications C, for a commutative operation using different methods.

    Obviously the Hyper-cube methods are the best for P>4; the pipe and ring methods would

    only be used on machines where the hyper-cube is not available, for example, machines built

    of hard-wired directly connected processors in a pipeline or grid. Each of the hyper-cube

    methods performs the operation in d steps, but B takes fewer communications overall than A,

    for P>2. For a large number of processes this factor becomes very important as time for a

    large number of simultaneous communications in one step can be affected by message

    contention across the hardware processor interconnect. For A, the number of messages

    remains constant at each step in a commutative operation at P/2. The number of

    communications in each successive step using method B reduces by a factor of 2 and

    therefore any contention is minimised to the first few steps. The number of steps needed to

    complete the operation using A can however be halved if non-blocking communications areused.

    Figure 21 shows a graph of communication latency for CAP_COMMUTATIVE using

    CAPLib over SHMEM on the Cray T3D using a pipeline and the two hyper-cube methods.

    The graph clearly demonstrates the effect of using different global communication

    algorithms. Global communication using a pipeline becomes rapidly more expensive as the

    number of processors increase. The best performance is given by the Hyper-cube B

    algorithm. Note that in this case MPI_ALLREDUCE which is the MPI equivalent to

    CAP_COMMUTATIVE does not perform as well as the Hyper-cube methods employed by

    CAP_COMMUTATIVE. Indeed, the CAP_COMMUTATIVE function has performed better

    than the corresponding MPI_ALL_REDUCE function in all ports of CAPLib so far

    undertaken.

  • 7/28/2019 Caplib Paper 013

    28/33

    0

    100

    200

    300

    400

    500

    600

    700

    1 10 100 1000

    Time

    (us)

    Processors

    CAP_COMMUTATIVE (Cray T3D SHMEM)

    PIPELINEHYPERCUBE AHYPERCUBE B

    MPI_ALLREDUCE

    Figure 21 CAP_COMMUTATIVE latency on Cray T3D

    7 CAPLib Support Environment

    One of the major reasons that parallel environments are often difficult to use is the amount of

    configuration and details the user must know about the system in order to successfully

    compile and run their parallel programs. As part of the CAPTools parallelisation

    environment, a set of utilities is provided to aid users in compiling, running and debugging

    their parallel programs. The main utilities are capf77and capmake, which allows compilation

    of the users source code; caprun, which provides a mechanism for parallel execution of the

    users compiled executable; and capsub which provides a simple generic method for

    submitting jobs to parallel batch queues. The characteristics of the utilities are:-

    Simple to use The utilities hide from the user as much as possible the details of the

    compilation and execution of parallel programs. Parallel compilation usually requires extra

    flags on the compile line and special libraries linked in. Many parallel environments require acomplex initialisation process to begin the execution of a parallel program. Parallel execution

    often fails, not because the users program is incorrectly coded, but because they have

    wrongly configured the parallel environment in some way. By hiding the messy details of

    configuration from the user, execution becomes both quicker and more reliable. In many

    cases, the users do not need a detailed knowledge of the parallel environment they are

    utilising at all.

    Generic interface Each utility uses a set of common arguments across the domains of

    parallel environment (e.g. MPI) and machine type, e.g. Cray T3D. This makes it easy for the

    user to migrate from one machine or parallel environment to another. The main generic

    arguments are:-mach Machine type, e.g. Sun, Paramid, T3D.

    -penv Parallel environment type, e.g. PVM, MPI, i860toolset, shmem.

    -top Parallel topology type, e.g. pipe2, ring4, full6, grid2x2.

    -debug n1 n2.. Execute in debug mode on processors n1, n2 etc..

    When a utility is executed it first checks for the existence of the environment variables

    CAPMACH and CAPPENV that provide default settings for the machine type and parallel

    environment type. These can be set manually by the user in their login script or by the

    execution of the usecaplib script, which attempts to determine these automatically from the

    host system. The command line argument versions of the environment variables can be used

    to over-ride any defaults.

  • 7/28/2019 Caplib Paper 013

    29/33

    8 Parallel Debugging

    The debugging of parallel message passing code often requires the user to start up multiple

    debuggers and trace and debug the execution on several processes. The main disadvantages

    of having several debuggers running on the workstation screen is the large amount of

    resource both in computer time and physical memory that this can require. Each debugger(with graphical user interface) may require 40 Mbytes and starting up several debuggers or

    attaching to several running processes can take minutes on a typical workstation. Recently

    computer vendors and third party software developers have begun to address this issue by

    allowing the debugger to handle more than one process and a time and allow the user to

    quickly switch from one process to another. This dramatically reduces the memory cost since

    only one debugger is now running and, if the same executable is running on all processors,

    only a single set of debugging information need be loaded. Examples of commercial

    debuggers that provide such a facility are TotalView [15] and Sun Microsystemss Workshop

    development environment [16]. Cheng and Hood in [17] describe the design and

    implementation of a portable debugger for parallel and distributed programs. Their client-

    server design allows the same debugger to be used both on PVM and MPI programs andsuggest that the process abstractions used for debugging message-passing can be adopted to

    debug HPF programs at the source level. Recently the High Performance Debugging Forum

    [18] has been established to define a useful and appropriate set of standards relevant to

    debugging tools for High Performance Computers.

    The caprun script has a -debug argument that allows users to specify a set of

    processes that they wish to debug. On systems that do not yet provide a multi-process

    debugger but do provide some mechanism to debug parallel processes using this option will

    result in a set of debuggers appearing on the screen attached to the chosen process set.

    CAPLib also provides a library routine called CAP_DEBUG_PROC that allows a debugger

    to be attached to an already running process where this is possible, perhaps following someerror condition. When a process calls CAP_INIT, one of the tasks undertaken is to check

    command line arguments and environment variables. If -debug is found then a call is made to

    CAP_DEBUG_PROC that calls a machine dependant system routine to run the script

    capdebug. This script is passed a set of information such as the calling process-id, DISPLAY

    environment variable and executable pathname that allows a debugger to be started up,

    attached to the calling process and displaying on the host machines screen. The caprun script

    also has a capdbgscript argument that allows the user to specify a set of debugger

    commands to be executed by each debugger when starting up.

    As an example

    caprun -m sun -p pvm3 -top ring5 -debug 1-3 -dbxscript stopinsolve jac

    This will start up 3 debuggers attaching to processes 1-3 on the users workstation, all

    debuggers will then execute the script stopinsolve which might contain

    print cap_procnumstop in solvecont

    This would print the CAPTools processor number, set a break point in routine solve and

    continue program execution.

    9 Results

    This section gives a series of results obtained for parallelisations using CAPTools and

    CAPLib, of two of the well-known NAS Parallel Benchmarks (NPB) [19], APPLU (LU) and

  • 7/28/2019 Caplib Paper 013

    30/33

    APPBT (BT). The LU code is a lower-upper diagonal (LU) CFD application benchmark. It

    does not, however, perform a LU factorisation but instead implements a symmetric

    successive over-relaxation (SSOR) scheme to solve a regular-sparse, block lower and upper

    triangular system. BT is representative of computations associated with the implicit operators

    of CFD codes such as ARC3D at NASA Ames. BT solves multiple independent systems of

    non-diagonally dominant, block tridiagonal equations. The codes are characterised in parallelform by pipeline algorithms, making all codes sensitive to communication latency.

    The results for the benchmarks refer to three different versions/revisions of the same code.

    Rev 4.3 is a serial version of the benchmarks written in 1994 a starting point for optimised

    implementations. Version NPB2.2 is a parallel version of the codes written by hand by

    NASA and using MPI communication calls. Version NPB2.3, the successor to NPB2.2, has

    both a serial and parallel version. The results presented here are for runs of CLASS A,

    64x64x64 size problems. For each code, a SPMD parallelisation using a 1-D and in some

    cases a 2-D partitioning strategy were produced using CAPTools. The results for runs using

    these parallelisations on the Cray T3D, Transtech Paramid and the SGI Origin2000 are

    presented in the following sections together with results for runs of the NPB2.2/2.3 parallelMPI versions.

    9.1 LU

    The results for LU runs on the Cray T3D, T3E, SGI Origin 2000 and Transtech Paramid are

    shown in Figure 22 to Figure 25 respectively. The T3D and T3E results compare the

    performance of 1-D and 2-D parallelisations of LU using CAPTools. The 1-D version can

    only be run on a maximum of 64 processors because of the size of problem being solved

    (64x64x64). The 2-D version was run up to 8x8 processors and gives very reasonable results.

    Figure 23 shows graphs of execution time for 1-D and 2-D parallelisations of LU using

    CAPTools on the Cray T3E with different versions of CAPLib. The best results are given asexpected by the SHMEM version of CAPLib although for the 2-D runs the differences are

    quite small. These small differences are in part due to the pipelines present in LU code. The

    1-D version has pipelines with a much longer startup and shutdown period than the 2-D

    version and therefore performance is more dependent on the startup latency of the

    communications. Another factor is the memory access patterns required for communication

    in the 2nd dimension which use buffered CAPLib calls such as CAP_BSEND/BRECEIVE

    that gather data before sending and scatte