Run-time scheduling and execution of loops on message passing machines

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 8,303-3 12 ( 1990)

Run-Time Scheduling and Execution of Loops on Message Passing Machines *

JOELSALTZ,KATHLEENCROWLEY, RAVIMIRCHANDANEY,ANDHARRYBERRYMAN

Institute.for Computer Applications in Science and Engineering, NASA Langley Research Center, Hampton, Virginia 23065; and Department of Computer Science, Yale University, New Haven, Connecticut

We examine the effectiveness of optimizations aimed at allowing a distributed machine to efficiently compute inner loops over globally defined data structures. Our optimizations are specifically targeted toward loops in which some array references are made through a level of indirection. Unstructured mesh codes and sparse matrix solvers are examples of programs with kernels of this sort. Experimental data that quantify the performance obtainable using the methods discussed here are included. (0 1990 Academic Press, Inc.

siderations dictate that we should partition array data and parallel loop iterations so that most array accesses do not require interprocessor communication. For this reason, it is useful to have the capability of assigning any set of array elements to a processor.

1. INTRODUCTION

On shared memory machines, an important strategy for creating a program that will execute efficiently is to make sure that inner loops of the program can be parallelized. On a distributed memory machine, it is widely recognized that parallelism alone is not enough to ensure good performance.

We present an outline of some of the infrastructure required to efficiently compute, on distributed machines, a class of parallel loops we term start-time schedulable. A nest of loops is start-time schedulable if all data dependences are resolved before the program begins execution and if these dependences do not change during the course of the computation. Our optimizations are specifically targeted toward loops in which some array references are made through a level of indirection. Unstructured mesh codes and sparse matrix solvers are examples of programs with kernels of this sort. Experimental data that quantify the performance obtainable using the methods discussed here are included.

In order to access an array element, we need to know where the element is stored in the memory of the distributed machine. The expense associated with ascertaining the physical location in distributed memory of an element in a distributed array A obviously depends greatly on how A is partitioned. If we allow an arbitrary assignment of array elements to processors, the data structure used to describe how A is distributed will have the same number of elements as A. Scalability considerations force us to use a distributed data structure to store a description of how A is partitioned. Accessing such a distributed data structure will involve sig- nificant overheads. One fruitful optimization consequently involves precomputing the appropriate location in distributed memory for each reference to any element of A made within a parallel loop.

On distributed memory machines it is typically very expensive to fetch individual data elements. Instead, before a parallel loop executes, it is desirable to prefetch all nonlocal data required in the loop. In performing the required preprocessing, we examine array references in the parallel do loop, determine what array elements are to be fetched from other processors, and then decide where to store fetched data in the processor’s memory.

The single parallel do loop originally specified must be translated into multiple do loops, each of which executes sequentially on a separate processor. Each loop iteration is assigned to a processor. In distributed memory machines, data arrays need to be partitioned between the local memories of the processors. A datum needed for carrying out work in a loop iteration is fetched from the memory of the processor in which that datum is stored. Performance con-

A command to fetch data can be implemented in a variety of ways in a distributed machine. Remote memory fetches are carried out in a rather roundabout manner. Pro- cessor A obtains the contents of a given memory location which is not on A by sending a message to processor B associated with the memory location. Processor B is pro- grammed to anticipate a request of this type, to satisfy the request, and to return a responding message containing the contents of the specified memory locations. The cost of fetching array elements required in a repetitively executed parallel do loop is reduced by precomputing what data each processor needs to send and to receive.

* This work was supported by the U.S. Office of Naval Research under The optimizations described here can be performed Grant NO00 14-86-K-03 10, and under NASA Contract NAS 1 - 18605. through procedure calls. The relevant procedures are sup-

303 0743-73 15/90$3.00 Copyright 0 1990 by Academic Press, Inc.

All rights of reproduction in any form reserved.

304 SALTZ ET AL.

plied a list of distributed array elements referenced by a processor during execution of a parallel do loop.

In this paper, we do not address the question of deciding how to partition data structures and parallel loop iterations. Descriptions of problem partitioning methods can be found in[2,6, II].

In Section 2 we outline some of the other research efforts aimed at allowing a distributed machine to efficiently compute nests of loops over globally defined data structures. In Section 3, we present a set of example programs that demonstrate why seemingly straightforward parallel loops can perform poorly on distributed memory machines. In Sec- tion 4, we outline our set of proposed optimizations, in Sec- tion 5 we provide run-time invocations, and in Section 6 we present experimental data that examine the performance that can be obtained using the proposed optimizations.

2. RELATED RESEARCH

There are several research efforts whose aim is to allow a distributed machine to efficiently compute programs consisting of sets of loops over globally defined data structures.

The Linda system [I] provides an associative addressing scheme by which a reference to variables can be resolved at execution time; this in essence provides a shared name space for distributed memory machines. Callahan and Ken- nedy [ 5 1, Rogers and Pingali [ 12 1, and Rosing and Schna- be1 [ 13 ] suggest execution time resolution of communications on distributed machines. Neither save information on repeated patterns of communications. In all these proposed systems, given special mappings of a globally defined array to distributed memory, references by a processor to the array can be efficiently transformed so that the appropriate local memory references are made.

In contrast, Mehrotra and Van Rosendale [ 9, 81 do perform execution time resolution of the communications required for carrying out parallel do loops on distributed machines in situations where compile time resolution is not possible. A major difference in our current approaches is that Mehrotra and Van Rosendale store copies of array data off processor in a form that requires checking and searching for nonlocal references during loop execution. Since we store a list of local references that must be made by each processor to a distributed array, we are able to avoid this execution time overhead. In comparing the two approaches, we have a space time trade-off.

3. PRESCHEDULING SPARSE LOOP STRUCTURES

In Section 3.1, we present a set of programs that carry out a simple regular computation. These examples demonstrate why seemingly straightforward parallel loops can cause performance problems on distributed memory machines. In

Section 3.2, we give an example of a sequential loop with interiteration data dependencies that is transformed into a sequence of parallel do loops.

3.1. Sparse Matrix Vector Multiply

To provide a context for what follows, we present two programs that carry out a sequence of sparse matrix vector multiplies (Jacobi iterations) and what might be done to execute these programs on a distributed memory machine. In Section 6 we present experimental results illustrating the differences in performance that arise from the optimizations discussed here. The first version of the program is depicted in Fig. 1. Such a problem should be partitioned in a manner that ( 1) distributes the load between the processors roughly equally, (2) limits the number of communication start-ups, and ( 3) limits the size of the messages that need to be communicated between processors. Depending on the time required for communication start-ups, the typical strategy is to partition all arrays by strips or rectangular blocks [ 111. Values of variables along each side of the pe- riphery of the strips or blocks can be exchanged.

The program in Fig. 1 performs a sequence of Jacobi sweeps or point relaxations over an n by n square. In this program, the variable values at domain point i, j are repre- sented by x ( i, j ) and xold ( i, j ). The values of a(i, j),b(i, j),c(i, j),andd(i, j) areusedeach iteration for the calculation of x ( i , j ) . Because of the regular geometry of this program, the programmer can easily identify variables whose values must be sent to other processors, and values which must be received from other processors. When this program executes, a single message can be formed from all variable values corresponding to the side of a rectangle. If we partition the program in Fig. 1 between P processors using a vertical strip decomposition, Fig. 2 gives the pseudocode for the corresponding message passing program. We use a FORTRAN 8x type notation in our pseudocode for depicting the sending or receiving of subarrays of floating point numbers.

do iter=l,num do i=l,n

do j=l,n x(i,j) = a(i,j)*xold(i+l,j) + b(i,j)*xold(i-l,j) t

c(i,j)*xold(i,j-1) t d(i,j)*xold(i,jtl) end do

end do

do i=l.n do j=l,n

xold(i,j) = x(i,j) end do

end do end do

FIG. 1. Jacobi iteration.

RUN-TIME SCHEDULING 305

do iter=l,num

do i=l,n/P do j=l,n x(i,j) = a(i,j)*xold(i+l,j) + b(i,j)*xold(i-1.j) +

c(i,j)*xold(i,j-1) + d(i,j)*xold(i,j+i) end do

end do

do i=l.n do j=l,n xold(i,j) = x(i,j) end do

end do

send xold(l;) to proc p-l send xold(n/P;) to proc p+l receive from proc p-l, put in xold(0;) receive from proc p+l. put in xold(n/P+l;)

end do

FIG. 2. Message passing Jacobi.

In a distributed memory machine, array references must refer to memory locations on a particular processor. In Fig. 2, we see that it can be straightforward to translate loops so that all references are to a processor’s local memory.

Consider Fig. 3 which depicts a program that is somewhat more general than the program in Fig. 1 but for the appropriate array initializations represents the same problem as the program in Fig. 1. In the following figures we use a mod- ification of a standard sparse matrix data structure where the nonzeros in matrix A are stored in a one-dimensional array a. Nonzero elements are taken from consecutive rows of A and are assigned to a beginning with the leftmost column of each row of A. For each row i, low ( i ) and high ( i ) represent the locations in array a of the left- and rightmost nonzero columns of the row in matrix A. The column of A corresponding to element j of a is given by column ( j ) . Rather than sweeping over a two-dimensional array, we sweep over a onedimensional array where dependences are given by the integer array column. A problem with the same pattern of dependences as that seen in

61 do iter=l.num

62 do i=l,n**2 do j=low(i) .hi.gh(i)

53 x(i) = x(i) + a(j)*xold(column(j)) end do

end do

54 do i=i,n**l xold(i) = x(i)

end do end do

FIG. 3. Sparse mesh Jacobi.

Fig. 1 could be specified by the program in Fig. 3. Sparse matrix system solvers often use these types of data structures.

In Fig. 3 we use the arrays a and column to designate which array elements will be needed to compute the right hand side in statement ~3. Unless we know how arrays low and high have been initialized, we do not know which elements of column and of a will be needed on each processor. In a naive implementation of the algorithm, we would have to partition column and a in some regular manner and would have to fetch the array values when they are needed, possibly generating high performance penalties. A naive multiprocessor implementation of the code in Fig. 3 also requires that a fetch from a remote memory be performed whenever xold ( column ( j ) ) in program statement ~3 specifies a memory location not assigned to the processor on which the code executes. The experimental results in Section 6 quantify how costly this kind of implementation can be on a message passing machine.

We can systematically partition a problem to obtain a good balance between communication costs and load balance. In Fig. 4, we show an example of a partitioned version of the program shown in Fig. 3. Each iteration of the parallel do loop or doall loop Sl is assigned to a unique processor pe . Each processor pe loops over the indices assigned to it (statement s2 ) ; the indices assigned to pe are specified in statement ~3 by the subarray schedule ( ; pe ) . In this illustration, global names are still given to all array references and loop indices.

Optimizations that can be performed on this partitioned program involve the following: ( 1) the global index numbers used to access elements of x, xold , a, and column can be translated into local index numbers that represent storage locations in each processor; (2) data to be transmit- ted between processors can be formed into longer packets to amortize start-up costs; (3) communications to be carried out by each processor can be prescheduled so that each

do iter=l,nuu

Sl doall pe=l,num-processors 52 do i=l.nlocal(pe) 33 next = schedule(i,pe)

do j=lov(next),high(next) 54 x(next) = x(next) + a(j)*xold(column(j))

end do end do

end doall

65 do i=l.nlocal(pe) next = schedule(i) xold(next) = x(next)

end do end doall

FIG. 4. Transformed sparse mesh Jacobi.

306 SALTZ ET AL.

do iter=l.n doall i=l,m

. . = x(*,iter-*)

x(i,iter) = . .

end doall end do

FIG. 5. Sequence of parallel do loops.

processor knows when to send and when to receive values of specific variables.

3.2. Prescheduling Loop Structures

We can perform optimizations of the type described in Section 3.1 in more general cases where we have a sequence of parallel do loops that might arise from a sequential loop in which there are inter-iteration dependencies. For in- stance, in Fig. 5 we have a sequential outer loop ~1 whose loop body contains a doall loop ~2 . s 2 contains an expression with variables that may have been written to during earlier iterations of Sl .

Consider the program for solving a lower triangular system in Fig. 6. In that program, we must assume that the outer loop Sl has to be executed in a sequential fashion. Sets of iterations of ~1 (Fig. 6) that can be concurrently executed can be identified by performing a topological sort [ 7, 14,3,4] on the dependency graph relating the left hand side of ~2 to the right hand side. This sort is performed by examining the integer array column. In this way the sequential construct in Fig. 6 can be transformed into a parallel construct consisting of a sequence of parallel do loops. Each parallel do loop represents a concurrently executable set of indices from ~1 of Fig. 6.

It is frequently possible to cluster work carried out in solving recursion equations so that the number of computational phases (and hence the number of communication start-ups) is reduced, but adequate parallelism is neverthe- less preserved.

For example, consider what is required for efficiently solving a set of explicitly defined recursion equations

Yi,j = ai,jY,,,-I + bi,jYi-1.j (1)

Sl do i=l,n**l do j=low(i) ,high(i)

62 x(i) = x(i) + a(j)*x(column(j)) end do

end do

FIG. 6. Sparse mesh lower triangular solve.

on an Xby Y point square (the recursion equations are sub- ject to some suitable boundary conditions). We can concurrently solve for variables along antidiagonals, i.e., variables Y;,j satisfying i + j = k for positive k. The computation will be divided into X + Y - 1 phases. Assume that we instead partition the domain into a grid with X/m point horizontal strips and X/I~ point vertical strips. The work in each of the resulting X/m by X/n point rectangles is scheduled as a single unit. This modified computation will require only X/m + X/n - 1 phases. In order to take advantage of efficiencies that can be gained by clustering blocks of variables, operations assigned to a given processor executed during a particular phase must be scheduled in a specific order.

Figure 7 demonstrates how the computations described by Eq. ( 1) are carried out. Note that a single triangular solve requires the solution of a sequence of parallel do loops (statements Sl and ~2 ); within each parallel do loop a set of row substitutions are scheduled (statement ~3 ) . As men- tioned above, carrying out the operations in the order given by the array schedule in statement ~3 may be essential. Methods for clustering work in sparse programs that solve recursion equations are discussed in detail in [ 141.

4. INDEX SET PREPROCESSING

In this section we outline how the optimizations alluded to above can be carried out. Recall that we assume that this parallel loop will be executed many times so that we can amortize the costs incurred in performing the optimizations described below.

In keeping with the terminology introduced in [ 81 we call the routines that carry out the execution time optimizations inspector routines. We provide an outline of how a distributed version of a set of inspector routines functions. Our distributed inspector routines are still under development and we will not present machine timings for these routines. The optimizations examined in the experimental data presented below used results from a set of sequential inspector routines.

61 do phase=l,num-phases

62 doall pe=l,num-processors 63 do j= 1, npoints(phase,pe) 54 next = schedulo(phase,pe,j)

do k=low(next),high(next) s5 x(next) = x(next) + a(k)*x(column(k))

end do end do

end doall

end do

FIG. 7. Transformed lower triangular solve.


Assume that we are presented with a parallel loop that references an array A on the right hand side of one or more expressions. Also assume that loop iterations have been assigned to processors. Recall that we allow an arbitrary assignment of array elements to processors. We employ arrays IAPROC and IALOC to describe how an array A is to be distributed. IAPROC( i) stores the processor to which A ( i) is assigned and IALOC( i) stores the memory location of A ( i) on that processor. We use standard methods [ 15,6,9, 5 ] for distributing arrays IAPROC and IALOC between the memories of the processors.

For each processor P, collect in array A, the global indices corresponding to each consecutive distributed array reference. IAPROC and IALOC are accessed to obtain the processor and memory location corresponding to each global index ofA,,f. The next step is to create an array A,,, that will store pointers to memory locations in P. Corresponding to each global index of A,ef, we store a pointer to the memory location in P that will, during loop execution, store the associated value of A. Some of the elements in Aplrs will point to elements of A stored in P. Since copies of elements of A stored in other processors must also be accessed, other pointers will refer to the appropriate buffer locations in P. In conjunction with the analysis required to create A,,,, we also collect a list Afilch of elements of A that must be fetched from the memories of other processors. Once each processor has a list of array elements to be fetched, we determine precisely which elements of A each processor P must send to other processors and which elements P must receive.

In the scheme described above, we must store a pointer to each element of A referenced in a processor’s portion of a parallel loop. One can instead create an array Aenumerate that consecutively enumerates values of A required in a loop. Aenumeratc will consequently have a size equal to the number of references to A in the loop by processor P. No indirection will be required when the parallel loop accesses elements of Aenumerate. On the other hand, values ofA stored locally will have to be copied when Aenumerate is initialized. For descriptive purposes, we refer to the two storage schemes described as direct, when Aenumerate is used, and indirect, when a list of pointers is used, as in A,, . We use this naming scheme in later references to the storage mecha- nisms.

5. RUN-TIME INVOCATIONS

In this section, we provide high-level descriptions of the run-time execution sequence of the preprocessor and executor for the sparse matrix-vector multiply and triangular solve. The Fortran code for these computations is depicted in Figs. 3 and 6. It is assumed that the particular storage scheme is made known to the preprocessor by the user, for each distributed array in the program. This is currently pro- vided as a parameter in the call to the preprocessor. We as-

sume that array A is stored using the direct scheme whereas X is stored using the indirect scheme. It is assumed that the mapping of each array is also known to the preprocessor.

For matrix-vector multiply, we have the following sequence:

l On each processor perform the following steps in a concurrent manner: 1. Call preprocessor with A’s list of indices used in the

loop, in order of usage. 2. After exchanging information with other processors,

preprocessor initializes Aenumerate with local and nonlocal elements of A, in order of usage.

3. Call preprocessor with X’s list of indices used in the loop, in their order of usage.

4. After exchanging information with other processors, preprocessor initializes X,,, with pointers to local and nonlocal elements of X. These pointers are stored in order of use of X’s elements. Local elements of X had been assigned storage earlier, but nonlocal elements are placed in a separate buffer.

5. Initialize send-receive lists by information gathered in step 4. All subsequent iterations use send-receive pairs for exchanging new values of X’s elements.

6. Perform the following steps for each iteration: (a) Execute the local loop using the X,,, and Amumerale

arrays. Always select the next element of the X,,, and A enumernte arrays, after the current one is used. The arrays have been initialized in order of use, in steps 2 and 4. Recall that stepping through X,,, gives the address in local memory of the element of X needed.

(b) Exchange new values of X’s elements with other processors that need them.

For the triangular solve, the execution sequence is as follows:

l On each processor perform the following steps in a concurrent manner: 1. Call preprocessor with A’s list of indices used in the

loop, in order of usage. 2. After exchanging information with other processors,

preprocessor initializes Aenumerate with local and nonlocal elements of A, in order of usage.

3. Call preprocessor with X’s list of indices used in the loop, in their order of usage.

4. After exchanging information with other processors, preprocessor initializes X,,, with pointers to local and nonlocal elements of X. These pointers are stored in order of use of X’s elements. Local elements of X had been assigned storage earlier, but nonlocal elements are placed in a separate buffer.

5. Initialize send-receive lists by information gathered in step 4. All subsequent iterations use send-receive pairs for exchanging new values of X’s elements.

308 SALTZ ET AL.

6. For phase = 1 to n do TABLE I -Perform the following steps for all iterations in phase (a) Execute the local loop using the X,,, and Aenumerale

Matrix-Vector Multiply 100 by 100 Mesh, Five-Point Template

arrays. Always select the next element of the X,,, and A entlmeralP arrays, after the current one is used. The arrays have been initialized in order of use, in steps 2 and 4. Recall that stepping through X,,, gives the address in local memory of the element of X needed.

Processors Total time

(ms)

T T 3U&Vlld

T Cmn~ - C”f?VJ P

(md b-4

(b) Exchange new values of X’s elements with other processors that need them.

From the high-level descriptions given above, we can see that the only difference between the two computations is in line 6. This difference relates to the fact that the triangular solve needs to be computed as a set of parallel do loops with synchronizations between these loops, rather than as a single parallel do loop as in the matrix-vector multiply. Thus elements of X need to be fetched incrementally at the end of each parallel do loop.

1 409 409 37 2 207 205 19 4 107 103 10 8 55 52 6

16 31 26 3 32 18 14 2

6. ANALYSIS OF EXECUTOR PERFORMANCE

6.1. Overview gf Executor Performance

To obtain an experimental estimate of the efficiency of

indices are partitioned evenly between processors in this very uniform problem, we expect the load to be almost per- fectly balanced. We form an estimate of the overhead due to extra operations performed by the executor to be Tcomp - ( Tsequenrra~ /P) and depict this in Table I. It may be seen that this overhead is roughly 10% of the parallel execution time.

the executor on the Intel iPSC/ 2, we carried out a sequence of sparse matrix-vector multiplications using matrices generated from a square mesh with a five-point template. We expect this problem to parallelize well; the experiments are carried out not to demonstrate this obvious fact but to quantify the overhead caused by the extra fixed-point operations performed by the executor. In particular, we need to distinguish between inefficiencies that can be attributed to operations performed by the executor itself and inefficiencies arising from communication delays and load imbalance.

To demonstrate that the executor is capable of achieving high efficiencies in an absolute sense, we reduce the relative contribution of communication costs by increasing the size of the problem. We compared timings from matrix-vector multiplications using matrices generated from square meshes of sizes 100, 150, and 200 using a five-point template. The parallel efficiencies for 32 processors were 0.65, 0.75, and 0.8 1 for problems arising from 100, 150, and 200 point meshes, respectively. As usual, we define parallel efficiency as the ratio between the execution time for the optimized sequential program divided by the product of the number of processors and the execution time of the multiprocessor code.

We then examine the performance of the executor in the more challenging problem of solving sparse triangular systems. The row substitutions that must be performed in carrying out the sparse triangular solve are partitioned in a manner that takes into account the underlying geometry of the discretized domain used to generate the triangular matrix.

6.2. Comparison of Executor Performance with That of a Shared Memory Simulator

Single processor timings from an optimized sequential matrix-vector multiply program were compared with the parallel code run on a single processor. The optimized sequential code required Tsequenlia, = 0.372 s while the parallel code on a single processor required 0.409 s. The overhead for using the executor in this case is approximately 9%.

The Linda system [I] provides an associative addressing scheme by which a reference to variables can be resolved at run-time. This in essence provides a shared name space for distributed memory machines. We used the Linda system (as it existed when this paper was written) to estimate what efficiencies one might expect if one were to fetch various required array values on a demand basis with no preprocessing.

We partitioned the loop indices evenly into blocks, assigning consecutive blocks of indices to physically adjacent processors of an iPSC / 2. In Table I we depict the total time required to solve the problem on varying numbers of processors. A separate estimate of computation time Tcomp was obtained by eliminating the communication calls. Because

We performed the same matrix-vector multiplication experiments described in Section 6.1 using a matrix generated from a 100 by 100 mesh.

Referring to Fig. 3, each element of array xold was fetched from the storage location assigned to it by the Linda system when a need for that element was encountered. All elements of a and column corresponding to a given row were stored contiguously and fetched as a single unit. Table


II depicts the timings obtained using the Linda code along with a repetition of the timings obtained using the executor. The striking difference in timings can be understood when one considers that using the iPSC/2 as a shared memory machine requires having to pay several milliseconds per data fetch.

It should be emphasized that this experiment is not in- tended to be an evaluation of Linda. We are using Linda to allow us to emulate shared memory on the iPSC/2; we deliberately access this apparent shared memory in a way that does not take either data locality or message latency into account. Optimizations analogous to the ones discussed here could be fruitfully employed in conjunction with the Linda system so that Linda’s apparent shared memory would be used much more sparingly.

6.3. Distributed Memory Model Problem Analysis

In Section 6.4, we experimentally examine the performance of the executor in the more challenging problem of solving sparse triangular systems. To properly interpret the results we obtain, we need to first examine load balance ver- sus communication cost trade-offs in the context of solving the recursion equation in Section 3.2. The same dependency pattern is seen in a lower triangular system generated by the zero fill factorization of the matrix arising from an X by Y point rectangular mesh with a five-point template (this system might arise in preconditioned Krylov space iterative linear system solvers). We will utilize P processors and partition the domain into n horizontal strips where each strip is divided into m blocks. We assign horizontal strip s to processor s modulo P.

We derive an expression to estimate the total time required to solve the recurrence equations in the absence of communications delays. This expression will be used to interpret experimental results we obtained using the executor on a sparse representation of this problem.

We assume for convenience that m and n are multiples of P, and let S be the time required to perform the sequential computation. We define Tc to be the time taken to perform the computation in a block for a given m, n, and S, and assume that Tc = S/ mn .

TABLE II

Comparison with Shared Memory: Matrix-Vector Multiply

Processors Linda time Executor time

(4 64

Time given 100% efficiency

(4

8 17.936 0.055 0.047 16 12.679 0.03 I 0.023 32 11.724 0.018 0.012

Estimated total time without communications can be ex- pressed as the sum of the time that would be required were the computation evenly distributed between processors in the absence of any load imbalance plus the time wasted due to load imbalances,

Tcmn + Tcmin(m, n)(P- 1) P P 3

where Tc is the calculation time per block. We now derive and discuss the second part of the above

expression representing the time wasted due to load imbalances. The number of phases is equal to m + n - 1. We assume that m and n are multiples of P. Under this assump- tion, term for idle time can be derived by noting that during any phase j G min( m, n) - 1 when j is not a multiple of P, there are P - j mod P processors idle. When j is a multiple of P, no processors are idle. Thus the sum of the processor idletimeforj<min(m,n)- lis

T=min(m, n) CL, (I- 1) = T,min(m, n)(P- 1) P*P 2P *

Through similar reasoning, the sum of the processor idle time for the last min( m, n) - 1 phases is the same. During the intermediate phases, the load is balanced with min (m, n) blocks assigned to each processor. Thus the total idle time is

T=min(m, n)(P- 1) P

If there are no communication costs, Tc = S/ mn, where S is the sequential time. Then the total estimated time equals

S+ S(P- 1) P max(m, n)P’ (2)

Thus in the absence of communication costs, all terms involve m and n in a symmetric manner.

With very general assumptions, we show below that it is optimal to make the vertical strip size Y/ IZ as large as possible and to decrease the horizontal block size X/m until the increased communication time becomes larger than the benefit of decreased idle time. First we show that in the pres- ence of communication costs, we should choose m L n . We calculate the size of the largest message that must be sent between two processors during each phase. We assume that the time required for communication is equal to the sum of the times required in each phase to send the largest mes-

310 SALTZ ET AL.

sages. This tacitly assumes that the system is essentially syn- chronous, and that computation and communication occur in alternating nonoverlapping periods of time.

The time required for communication can be safely assumed to be an increasing function of message size. For phases 1 through min( m, n) - 1, the maximum number of data values sent by any processor is [p/P1 *B,, where p is the phase number, Bs = X/m, and X is the horizontal di- mension of the matrix. For phases min( m, n) through m + y1 - min( m, n), the maximum number of data values sent by any processor is 1 min ( m , n ) /Pl * B,, and for phases m + n - min( m, n) + 1 through m + n - 1, a maximum of r( m + 12 - p)/Pl * Bs data values is sent. If Bs were held fixed, the time required for communication would by symmetric in m and IZ. Since Bs is a decreasing function of m, it is always advantageous from the standpoint of communication cost to choose m 2 n. Since Eq. (2) is also symmetric in m and n, the minimum total time always occurs when m 2 n.

To minimize all terms involving n, we should choose n to be as small as possible, i.e., P. For m 3 n, Eq. (2) has no dependence on n. For any given m, the communication cost does not increase with decreasing n. If dependency graph G, has m by nl points and dependency graph Go has m by no points, with n, < no, G, can be embedded in Go. Since the communication cost per block ( Bs = X/m) is de- pendent only on m, G, need have a communication re- quirement no greater than Go. We thus conclude that the vertical blocks chosen should be as large as possible.

6.4. Executor Performance

We show that we can account for the execution time in solving sparse triangular systems by estimating the time lost due to both load imbalance and communication delays. Re- ferring back to Fig. 7, array schedule dictates ( 1) how the row substitutions to be performed in this problem will be partitioned between processors; and (2) the number of parallel do loops that must be carried out, i.e., the computational granularity. Computational granularity is a crucial determinant of performance in message passing multipro-

cessors which possess relatively high communication laten- ties. It is important to aggregate work in a way that leads to a controlled trade-off between load imbalance and communication requirements. Optimal partitioning (i.e., mapping of the unaggregated problem) cannot be achieved without taking into account the geometrical relationship between index elements. Aggregation methods used to generate in- put schedules arising from sparse forms of a wide variety of recursion equations are described in [ 141. An aggregation strategy that can be used for the sparse versions ofthe recursion equations was discussed in Section 6.3. The granularity of parallelism is parameterized using the parameters m and IZ introduced in Section 6.3. Recall that as m and n increase, the size of the scheduled computational grains decreases.

In Table III we depict the sequential time, parallel time, time required by the parallel program run on one processor, estimated communication time, and estimated communication-free time on 32 nodes of an Intel iPSC/2. The communication time estimate is obtained by running problems in which computation is deleted but communication patterns are maintained. The estimated communication-free time indicates the computation time that would bc ex- pected in the absence of communication delays. This is given by Eq. (2) ; we set Sin this equation equal to the one- processor parallel time. The problems solved were on a 192 by 192 point domain and on a 576 by 576 point domain. It was not possible to obtain sequential times or one-processor parallel times for the larger problem; the times shown were extrapolated from the 192 by 192 point problem.

We note that for the problems on a 192 point square, the estimated communication-free time added to the estimated communication time is close to the total time measured. Because these three quantities are derived from distinct experiments, this gives us some confidence that we are able to explain the timings observed. There is more of a discrep- ancy for the problem on a 576 point square. Since the one- processor parallel time used is just an estimate, we expect that our estimate of communication-free time here will be less accurate.

We note that when we employ a very fine-grained parallelism ( m and n equal to problem size), we pay a very heavy

TABLE III

Matrix from Square Meshes, Five-Point Template

Mesh size n ??I

192 192 192 192 32 192 192 32 64 192 32 32 576 32 64

Total time Communication (ms) time (ms)

295 250 184 149 105 63 99 52

492 216

Communication free time

(ms)

32 32 41 54

370

Seq. time (ms)

591 591 591 591

5319

I Proc. parallel

time (ms)

885 885 885 885

7965


TABLE IV

Matrix from 300 by 300 Mesh, Five-Point Template Reduced System

m(=) n Efficiency Total time Phases Communication time

300 0.16 3.99 598 3.45 150 0.31 2.01 299 1.57

15 0.53 1.21 149 0.88 50 0.44 1.44 99 0.65 38 0.33 1.95 15 0.45

communication penalty relative to the computation time. The completion of this noncomputationally intensive problem requires 383 phases, each of which requires processors to both send and receive data. We can reduce the number of computational phases, and consequently the communication time, at the cost of increased load imbalances.

The overhead required for the operation of the executor appears to be captured well by the differences between the sequential time and the time required for the parallel code to execute on a single processor. Note that the overhead at- tributable to the executor, roughly 33%, is substantially larger here than it was for matrix-vector multiply. This is understandable because the executor is having to coordi- nate a long sequence of distinct, rather fine-grained computational phases.

While appropriate choice of computational granularity is essential for maximizing computational efficiency, the na- ture of the triangular solve limits the performance that can be obtained in problems that are not extremely large. The parallel efficiency obtained in the 576 by 576 point mesh was 34%. This is roughly half of the speed available once the overhead of the executor itself is accounted for.

Next, we present results from a somewhat less regular problem. The matrix for this problem is obtained in a rather involved manner discussed more fully in [lo]. In brief, we begin with a matrix obtained from a 300 by 300 point mesh using a five-point template. A reduced system is obtained from this matrix by modifying the matrix in a way that in- creases the number of nonzeros per row and halves the number of rows. We observed that given appropriate clustering techniques, an extremely high degree of regularity is not essential for achieving the efficiencies possible in these types of problems.

The results we present in Table IV were from experiments conducted on the iPSC/ 1 [lo]. On the iPSC/ 1, floating point operations are much more expensive than they are on our iPSC/2. Estimates performed on the iPSC/ 1 indicated that the overhead introduced by the executor was substantially smaller, and the parallel efficiencies were correspondingly larger. In this problem the best efficiency was 53%. The ability to aggregate work and control

the number of communication start-ups plays a central role in obtaining increased efficiency.

7. CONCLUSION

There exist many types of problems whose irregularity can cause problems for distributed memory machines. Good methods for parallelizing and partitioning these types of problems require one to assign computations and data in rather arbitrary ways. Efficient implementations tend to involve considerable programming effort to get good performance, making system development unnecessarily time- consuming.

We have described a set of index set transformations that can allow us to efficiently solve a variety of irregular problems that are start-time schedulable. These transformations are carried out by inspector and schedule executor procedures.

1.

2.

3.

4.

5.

6.

7.

8.

9

REFERENCES

Ahuja, S., Carriero, N., and Gelernter, D. Linda and friends. IEEE Comput. (Aug. 1986). Allen, T., and Cybtnko, G. Recursive binary partitions. Tech. Rep., Department of Computer Science, Tufts University, Oct. 1987.

Anderson, E. Solving sparse triangular linear systems on parallel computers. Report 794. UIUC, June 1988.

Baxter, D., Saltz, J., Schultz, M., Eisentstat, S., and Crowley, K. An experimental study of methods for parallel preconditioned krylov methods. Proc. I988 Hypercube Multiprocessor Conference, Pasadena C.4, Jan. 1988.

Callahan, D., and Kennedy, K. Compiling programs for distributed- memory multiprocessors. J. Supercomput. 2 ( 1988), 15 I-169.

Fox, G., Johnson, M., Lyzenga, G., Otto, S., Salmon, J., and Walker, D. Solving Problems on Concurrent Computers. Prentice-Hall, Engle- wood Cliffs, NJ, 1988.

Greenbaum, A. Solving sparse triangular linear systems using Fortran with parallel extensions on the nyu ultracomputer prototype. Report 99, NYU Ultracomputer Note, Apr. 1986.

Mehrotra, P., and Van Rosendale, J. Compiling high level constructs to distributed memory architectures. Proc. Fourth Conference on Hy- percube Concurrent Computers and Applications, Mar. 1989.

Koelbel, C., Mehrotra, P., and Van Rosendale, J. Supporting shared data structures on distributed memory architectures. PPoPP ‘90, Seat- tle, IFA. Mar. 1990, to appear.

10. Mirchandaney, R., Saltz, J. H., Smith, R. M., Nicol, D. M., and Crow- ley, K. Principles of runtime support for parallel processors. Proc. 1988 ACM International Coqference on Supercomputing, St. Malo France, July 1988.

11. Nicol, D. M., and Saltz, J. H. Principles for problem aggregation and assignment in medium scale multiprocessors. Tech. Rep. 87-39, ICASE, July 1987. Submitted for publication.

12. Rogers, A., and Pingali, K. Process decomposition through locality of reference. Conference on Programming Language Design and Imple- mentation. ACM SIGPLAN, June 1989, pp. l-999.

13. Rosing, M., and Schnabel, R. An overview of dino-A new language for numerical computation on distributed memory multiprocessors. Tech. Rep. CU-CS-385-88, University of Colorado, Boulder, 1988.

312 SALTZ ET AL.

14. Saltz, J. Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM J. Sci. Statist. Comput., to appear.

15. Scott, L. R., Boyle, J. M., and Bagheri, B. Distributed data structures for scientific computation. Proc. Hypercube Microprocessors ConjI, Knoxville, TN, Sept. 1986.

JOEL SALTZ graduated from Duke University with a Ph.D., M.D., in 1986. He was a staff scientist at ICASE in 1986 and an assistant professor at Yale 1986-1989 and is currently a senior staff scientist at ICASE. Dr. Saltz continues his affiliation with Yale as a research scientist. His research interests include execution time optimizations in shared and distributed memory machines and in parallel and distributed sparse matrix and unstructured mesh algorithms.

Received February 15, 1989; revised September 2 1, I989

KATHLEEN CROWLEY graduated from the University of Washing- ton with an M.S. in 1985, and has spent the last 3 years in a research position in the Computer Science Department at Yale University. Her research interests include parallel programming environments, and execution time optimizations in shared memory and distributed machines.

RAVI MIRCHANDANEY graduated from the University of Massa- chusetts, Amherst, with a Ph.D. in 1987. He has held the position of Re- search Scientist at Yale University for the last 2 years. Dr. Mirchandaney is also a consultant at ICASE. His research interests include load balancing on distributed and parallel machines.

HARRY BERRYMAN graduated from Memphis State University with a B.S. in 1988. He spent 1 year as a researcher at Yale University and is now a staff scientist at ICASE. His research interests are in systems issues arising from parallel processing.

Run-time scheduling and execution of loops on message passing machines

Documents

Transcript of Run-time scheduling and execution of loops on message passing machines