SUPPLE: An efficient run-time support for non-uniform parallel loops

21
SUPPLE: An ecient run-time support for non-uniform parallel loops Salvatore Orlando a , Raaele Perego b, * a Dip. di Matematica Applicata ed Informatica, Universit a Ca’ Foscari di Venezia, via Torino 155, 30173 Venezia Mestre, Italy b CNUCE - C.N.R., Consiglio Nazionale delle Ricerche, via Santa Maria 36, 56126 Pisa, Italy Abstract This paper presents SUPPLE (SUPort for Parallel Loop Execution), an innovative run-time support for the exe- cution of parallel loops with regular stencil data references and non-uniform iteration costs. SUPPLE relies upon a static block data distribution to exploit locality, and combines static and dynamic policies for scheduling non-uniform iterations. It adopts, as far as possible, a static scheduling policy derived from the owner computes rule, and moves data and iterations among processors only if a load imbalance actually occurs. SUPPLE always tries to overlap commu- nications with useful computations by reordering loop iterations and prefetching remote ones in the case of workload imbalance. The SUPPLE approach has been validated by many experimental results obtained by running a multi- dimensional flame simulation kernel on a 64-node Cray T3D. We have fed the benchmark code with several synthetic input data sets built on the basis of a load imbalance model. We have compared our results with those obtained with a CRAFT Fortran implementation of the benchmark. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Data parallelism; Parallel loop scheduling; Load balancing; Run-time supports; Compiler optimizations 1. Introduction Data parallelism is the most common form of parallelism exploited to speed-up scientific appli- cations. In the last few years, high level data par- allel languages such as High Performance Fortran (HPF) [4], Vienna Fortran [25] and Fortran D [6] have received considerable interest because they facilitate the expression of data parallel computa- tions by means of a simple programming ab- straction. In particular HPF, whose definition is the fruit of several research proposals, has been recommended as a standard by a large Forum of universities and industries. Using HPF, www.elsevier.com/locate/sysarc Journal of Systems Architecture 45 (1999) 1323–1343 * Corresponding author. Tel.: 39 050 593111; fax: 39 050 904052; e-mail: [email protected]. 1383-7621/99/$ – see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 1 3 8 3 - 7 6 2 1 ( 9 8 ) 0 0 0 7 1 - X

Transcript of SUPPLE: An efficient run-time support for non-uniform parallel loops

Page 1: SUPPLE: An efficient run-time support for non-uniform parallel loops

SUPPLE: An e�cient run-time support for non-uniform parallelloops

Salvatore Orlando a, Ra�aele Perego b,*

a Dip. di Matematica Applicata ed Informatica, Universit�a Ca' Foscari di Venezia, via Torino 155, 30173 Venezia Mestre, Italyb CNUCE - C.N.R., Consiglio Nazionale delle Ricerche, via Santa Maria 36, 56126 Pisa, Italy

Abstract

This paper presents SUPPLE (SUPort for Parallel Loop Execution), an innovative run-time support for the exe-

cution of parallel loops with regular stencil data references and non-uniform iteration costs. SUPPLE relies upon a

static block data distribution to exploit locality, and combines static and dynamic policies for scheduling non-uniform

iterations. It adopts, as far as possible, a static scheduling policy derived from the owner computes rule, and moves data

and iterations among processors only if a load imbalance actually occurs. SUPPLE always tries to overlap commu-

nications with useful computations by reordering loop iterations and prefetching remote ones in the case of workload

imbalance. The SUPPLE approach has been validated by many experimental results obtained by running a multi-

dimensional ¯ame simulation kernel on a 64-node Cray T3D. We have fed the benchmark code with several synthetic

input data sets built on the basis of a load imbalance model. We have compared our results with those obtained with a

CRAFT Fortran implementation of the benchmark. Ó 1999 Elsevier Science B.V. All rights reserved.

Keywords: Data parallelism; Parallel loop scheduling; Load balancing; Run-time supports; Compiler optimizations

1. Introduction

Data parallelism is the most common form ofparallelism exploited to speed-up scienti®c appli-cations. In the last few years, high level data par-

allel languages such as High Performance Fortran(HPF) [4], Vienna Fortran [25] and Fortran D [6]have received considerable interest because theyfacilitate the expression of data parallel computa-tions by means of a simple programming ab-straction. In particular HPF, whose de®nition isthe fruit of several research proposals, has beenrecommended as a standard by a large Forumof universities and industries. Using HPF,

www.elsevier.com/locate/sysarcJournal of Systems Architecture 45 (1999) 1323±1343

* Corresponding author. Tel.: 39 050 593111; fax: 39 050

904052; e-mail: [email protected].

1383-7621/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved.

PII: S 1 3 8 3 - 7 6 2 1 ( 9 8 ) 0 0 0 7 1 - X

Page 2: SUPPLE: An efficient run-time support for non-uniform parallel loops

programmers only have to provide a few directivesto specify processor and data layouts, and thecompiler translates the source program into anSPMD code with explicit interprocessor communi-cations and synchronizations for the target multi-computer. Data parallelism can be expressed aboveall by using collective operations on arrays [22], e.g.collective Fortran 90 operators, or by means ofparallel loop constructs, i.e. loops in which itera-tions are declared as independent and can be exe-cuted in parallel. In this paper we are interested inrun-time supports for parallel loops with regularstencil data references. In particular, we considernon-uniform parallel loops, i.e. parallel loops inwhich the execution time of each iteration variesconsiderable and cannot be predicted statically.

The typical HPF run-time support for parallelloops exploits a static data layout of arrays ontothe network of processing nodes, and a staticscheduling of iterations which depends on thespeci®c data layout. Arrays are distributed ac-cording to the directives supplied by programmers,and computations, i.e. the various loop iterations,are assigned to processors following a given rulewhich depends on the data layout (e.g. the ownercomputes rule). A BLOCK distribution is usuallyadopted to exploit data locality: computationsmapped on a given processor by the compilermainly use the data block allocated to the corre-sponding local memory. Conversely, CYCLICdistribution is usually adopted when load balanc-ing issues are more important than localityexploitation. A combination of both the distribu-tions, where smaller array blocks are scattered onprocessing nodes, can be adopted to ®nd a trade-o� between locality exploitation and load balanc-ing. It is worth noting, however, that the choice ofthe best distribution is still up to programmers,and that the right choice depends on the featuresof the particular application. The general problemof ®nding an optimal data layout is in factNP-complete [9].

The adoption of a static policy to map data andcomputations reduces run-time overheads, becauseall mapping and scheduling decisions are taken atcompile-time. While it produces very e�cient im-plementations for regular concurrent problems,the code produced for irregular problems (i.e.problems where some features cannot be predicteduntil run-time) may be characterized by poorperformance. Many researches have been con-ducted in the ®eld of run-time supports and com-pilation methods to e�ciently implement irregularconcurrent problems, and these researches are atthe basis of the new proposal of the HPF Forumfor HPF2 [5]. The techniques proposed are mainlybased on run-time codes which collect informationduring the ®rst phases of computation, and thenuse this information to optimize the execution ofthe following phases of the same computation. Anexample of these techniques is the one adopted bythe CHAOS support [20] to implement non-uni-form parallel loops. The idea behind this feature ofthe CHAOS library is the run-time redistributionof arrays, and the consequent re-mapping of iter-ations. Redistribution is carried out synchronouslybetween subsequent executions of parallel loops,and is decided on the basis of information (mainly,timing information) collected at run-time during aprevious execution of the loop. If the original loaddistribution is not uniform, the data layout is thusmodi®ed in order to balance the processor loads.In Ref. [18] a dialect of HPF has been extendedwith new language constructs which are interfacedwith the CHAOS library to support irregularcomputations.

In this paper we address non-uniform parallelloop implementations, introducing a truly inno-vative support called SUPPLE (SUPort for Par-allel Loop Execution). The SUPPLE is not ageneral support on which we can compile everyHPF parallel loop, but only non-uniform (as wellas uniform) parallel loops with regular stencil datareferences. Since stencil references are regular and

1324 S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343

Page 3: SUPPLE: An efficient run-time support for non-uniform parallel loops

known at compile time, optimizations such asmessage vectorization, coalescing and aggregation,as well as iteration reordering can be carried outto reduce overheads and hide communicationlatencies [7]. SUPPLE is able to initiate loopcomputation using a statically chosen BLOCKdistribution, thus starting iteration scheduling ac-cording to the owner computes rule. During theexecution, however, if a load imbalance actuallyoccurs, SUPPLE adopts a dynamic schedulingstrategy that migrates iterations and data fromoverloaded to underloaded processors in order toincrease processor utilization and improve per-formances. Decisions about migration are mainlylocal, in order not to introduce global communi-cation/synchronizations. SUPPLE overlaps mostoverheads deriving from dynamic scheduling, i.e.messages to monitor loads and move data, withuseful computations. The main features that allowus to overlap the otherwise large overheads are: (1)prefetching of remote loop iterations on under-loaded processors, and (2) a complete asynchro-nous support that does not introduce barriersbetween consecutive iterations of loops. Below wedescribe our support in detail, and we show someresults obtained by running the kernel of a ¯amesimulation code on a Cray T3D. In this applica-tion, the characteristics of the input data setstrongly in¯uence the time needed to compute loopiterations and thus may cause workload imbal-ances. For this reason, we ran our tests on syn-thetic data sets, built on the basis of a simple loadimbalance model. To compare the SUPPLE withan HPF-style static approach, we used the CrayCRAFT-Fortran, whose language directives andcompilers were partially adapted from Fortran-D[6] and Vienna Fortran [25]. Even if SUPPLE ispresented here as a run-time support to implementnon-uniform parallel loops, it can be pro®tablyused [15] for a less narrow class of applications, i.e.to compile ``uniform'' parallel loops on heteroge-neous and/or time-shared systems, where we can

have high load imbalance even if the speci®c ap-plication is regular.

The paper is organized as follows. Section 2describes the ¯ame simulation code we used as abenchmark to validate the SUPPLE approach.Section 3 presents the main features of SUPPLEand gives some details on its implementation. As-sessments of the experiment results are reportedand discussed in depth in Section 4, which alsodescribes the load imbalance model used to buildthe synthetic data sets. Finally, Section 5 providesa summary of related work, and Section 6 drawsthe conclusions.

2. The benchmark

The benchmark we used to validate the SUP-PLE approach is directly derived from a class ofactual scienti®c codes used to carry out detailed¯ame simulations [21]. Flame simulations haveattracted growing interest in the last few years:highly detailed simulations can in fact give scien-tists important information on fuel ¯ammabilitylimits and on the interactions between chemicaland physical processes. One example of code thatperforms a time-dependent multi-dimensionalsimulation of hydrocarbon ¯ames is described indetail in Ref. [18], and is included in the HPF-2draft [5] as one of its motivating applications. Thecode is split into two distinct computationalphases executed at each time step. The ®rst phasecomputes ¯uid convection over a structured mesh.The computation at each point has similar costs,and involves the point itself and the nearestneighboring elements of the simulation mesh.Stencil communications are thus required at eachtime step to carry out this phase on a distributedmemory machine. The second phase simulateschemical reaction and subsequent energy releasesrepresented by ordinary di�erential equations. Thesolution at each grid point only depends on the

S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343 1325

Page 4: SUPPLE: An efficient run-time support for non-uniform parallel loops

value of the point itself, but its computationalcosts may vary signi®cantly since the chemicalcombustion proceeds at di�erent rates across thesimulation space. The reaction phase globally re-quires more than half of the total execution time,and most of this time is spent on a small fractionof the mesh points. Furthermore, the workloaddistribution evolves as the simulation progresses.The HPF-like code of a simpli®ed two-dimen-sional ¯ame modeling code is reported in Fig. 1.The interesting characteristic of ¯ame simulationcodes is that, in order to e�ciently exploit a highlyparallel machine, two con¯icting requirementshave to be dealt with: the need to adopt a regularblock partitioning to take advantage of data lo-cality during the ®rst convection phase, and, con-versely, the need to balance the highly variableworkloads during the reaction phase. This secondpoint would entail a di�erent distribution of thesimulation grid on the processor local memories.Unfortunately, since the workload distributionevolves and depends on the speci®c features of theinput data set, we cannot devise at compile time adistribution that is suitable for both phases of thesimulation. The following are therefore possiblesolutions.1. Adopting CYCLIC distribution of the simula-

tion grid thus losing data locality in favor ofa better balance of the processor workloads.This solution must be followed to obtainacceptable performances with HPF-like data-

parallel languages, which currently do not sup-port irregular applications [5]. A combinationof BLOCK and CYCLIC distribution may alsobe used to ®nd a trade-o� between localityexploitation and load balancing. However, stat-ic optimizations like these may be unsuccessfulwhen the workload distribution varies unpre-dictably at run time.

2. Using BLOCK distribution for the ®rst loop,and CYCLIC for the second. This solutionwould entail using an executable redistributiondirective. Although several HPF implementa-tions do not support yet this directive, we cansimulate it by assigning one array distributedCYCLIC to another distributed BLOCK, andvice versa.

3. BLOCK partitioning the grid in the ®rst con-vection phase, and then suitably redistributingdata and iterations to balance the workloadsbefore starting the reaction computation. Thisapproach is di�erent from the previous one, be-cause data and computation reallocation isminimized, and is carried out at run time onthe basis of information collected during pre-vious phases of the computation. This imple-mentation scheme is followed by Ponnusamyet al. [21,18]. They exploit the CHAOS run-timelibrary [20], whose functionalities are used tobuild at run-time an irregular data redistribu-tion plan resulting in a nearly balanced reactioncomputation. Experiments were conducted by

Fig. 1. HPF-like code of a simpli®ed two-dimensional ¯ame modeling code.

1326 S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343

Page 5: SUPPLE: An efficient run-time support for non-uniform parallel loops

the authors on up to 64 Intel iPSC/860 nodes[18]. Redistribution for achieving load balancetakes about 15±20% of the total execution time,but results in execution times 4±10 times betterthan without load balance actions. The authorsdo not compare the approach with the ®rst,much simpler solutions.

4. BLOCK partitioning the grid to exploit data lo-cality, and dynamically redistributing the ex-pensive calculations of the reaction loop tobalance the load. This is the approach discussedin this paper. Non-uniform loops are executedby adopting, as far as possible, a static schedul-ing policy, and dynamically moving data anditerations among processors only if a load im-balance is actually detected. The SUPPLEbuilds the balanced schedule from scratch atevery loop execution: this is useful to e�cientlyimplement highly adaptive application codes inwhich the workload varies at each iteration.

3. Overview of SUPPLE

SUPPLE is a run-time support for the e�cientimplementation of data parallel loops with regular,stencil references [19], characterized by per-itera-tion computation times which may be non-uni-form, thus generating a workload imbalance.SUPPLE only supports BLOCK array distribu-tions to exploit locality deriving from stencilcomputations. Since stencil references are regularand known at compile time, several optimizationssuch as message vectorization, coalescing and ag-gregation [7] can be carried out to reduce com-munication overheads. These optimizationscombine element messages which are then sent tothe same processor in one single message withoutredundant data. Another optimization techniqueimplemented in SUPPLE is iteration reordering.The local loops assigned to every processor ac-cording to the BLOCK distribution and the owner

computes rule are split into multiple localizedloops: those accessing only local data, and thoseaccessing remote data as well. The loops accessingonly local data are scheduled between the asyn-chronous sends and the corresponding receiveswhich are used to transmit remote data: in this waythe execution of the iterations accessing only localdata allows communication latencies to be hidden[7]. Besides uniform loops, SUPPLE also supportsnon-uniform loops through the adoption of aninnovative dynamic load balancing technique.Dynamic scheduling is combined with the opti-mized static scheduling illustrated above. Thishybrid (static + dynamic) strategy only applies toloop iterations that access local data. Loop itera-tions that need to fetch remote data are simplyscheduled as in the static case. This limitation inthe dynamic scheduling is not very strong, anddoes not a�ect the ability of our technique tobalance the overall workload. Consider, in fact,that the number of iterations that access remotedata is usually a small fraction of the total whendata sets of interesting sizes are BLOCK distrib-uted and stencil computations apply. One of thereasons for this limitation is to avoid the intro-duction of irregularities in the stencil data refer-ences, thus preventing the possibility of optimizingcommunications. The hybrid scheduling strategyworks as follows: at the beginning iterations arescheduled statically according to the optimizedordering mentioned above, and start being dy-namically migrated only if a load imbalance isactually detected at run-time. The load balancingheuristic adopts local policies of imbalancedetection to avoid global synchronizations, andprefetching of remote iterations to hide commu-nication latencies.

This section presents SUPPLE in detail. Inparticular, we describe the optimized implemen-tation of stencil communications, and both thestatic and hybrid (static + dynamic) schedulingstrategies that can be adopted. SUPPLE supports,

S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343 1327

Page 6: SUPPLE: An efficient run-time support for non-uniform parallel loops

however, di�erent processor layouts, multi-di-mensional arrays distributed in any dimensions,and parallel loops with di�erent stencil references.To this end, SUPPLE uses several descriptors tostore information about the arrays and loops in-volved, and provides routines to ®ll the descriptor®elds used at run-time by the implementationcode.

Management of stencil references: SUPPLEprovides the support to implement uniform loopsby adopting the scheduling and the communica-tion optimizations that can be employed whendata are accessed with regular stencils [7,19]. Itallocates on each processing node enough memoryto host the array partitions assigned according tothe BLOCK distribution, and a surrounding ghost(or overlap) area. The ghost area width depends onthe shape of the stencil, and is used to bu�er theperimeter areas of the adjacent partitions ownedby neighboring processors, and accessed throughnon-local references.

In the ®nal SPMD program which implementsthe loop, an equal fraction of loop iterations isassigned to each processor according to the ownercomputes rule. Each processor executes a localizedloop, whose boundaries and array references referto the local array block. Before executing the loop,however, non-local data must be fetched from thenearest neighboring nodes. SUPPLE allows com-munications required to implement stencil com-putations to be optimized, by avoiding sendingseveral messages or replicated data to the sameprocessor [7]. However, if all communication arescheduled before the localized loop execution, wedo not hide communication latency. To overlapcommunications with useful computations, SUP-PLE allows iteration to be reordered [7]. To thisend, if one considers the iterations executed by alocalized loop, she/he can distinguish those as-signing the perimeter area and those assigning theremaining inner area of the array blocks. Loopiterations assigning the inner area only refer to

local data, while the others, which also access theghost area, need to wait for non-local data. Hence,we can reorder the loop iterations by executing theiterations that update the inner area between theasynchronous sending of messages and the corre-sponding receiving. Finally, in the SPMD codethat implements the parallel loop we can distin-guish four sequential steps:1. ápack and send perimeter areas to

neighboring processorsñ2. áexecute localized loops that update

inner areasñ3. áreceive from neighboring proces-

sors and unpack ghost areasñ4. áexecute localized loops that update

perimeter areasñSteps 2 and 4 include the execution of the new

parallel loops obtained by iteration reordering,while steps 1 and 3 correspond to communicationsthat implement stencil data references. Neverthe-less, if a parallel loop accesses several arrays, steps1 and 3 only exchange data for those arrays forwhich stencil references apply. While step 4 isstraightforward, because it simply requires theexecution of the iterations which modify data ele-ments belonging to the perimeter area, step 2 maybe implemented by exploiting either a static or ahybrid (static + dynamic) scheduling technique,where the right choice depends on whether theloop to be implemented is uniform or not. In orderto support hybrid scheduling, the iterations thatassign the elements lying on the inner area arestatically grouped into chunks by tiling the itera-tion space. Tiling is necessary because SUPPLEmigrates chunks of iterations on the basis of theworkload distribution. Tiling is not strictly neededif static scheduling is exploited, but we adopted itin this case too because it has been proved that itmay improve locality exploitation [11]. Anotherproblem considered in SUPPLE is data coherencewhen hybrid scheduling is adopted: in fact sometiles of a BLOCK partition may be updated by

1328 S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343

Page 7: SUPPLE: An efficient run-time support for non-uniform parallel loops

processors other than the owner of the partitionitself and must be returned to the owner. To thisend SUPPLE associates a full/empty-like ¯ag [1]with each data tile. If a chunk that modi®es a giventile is migrated, then the associated ¯ag is set. The¯ag will be unset when the updated data tile isreceived back from the remote processor. Flags arechecked only when the corresponding tiles need tobe read. This mechanisms does not introduce largeoverheads, because ¯ags are maintained for everytile and not for every array element. Moreover, themechanism is useful to avoid the introduction of aglobal synchronization between executions of dis-tinct loops, where, for example, the ®rst loop up-dates an array that is read by a following one.Note that if the ®rst loop adopts hybrid schedul-ing, the following loop has to check ¯ags even if itis uniform and thus chunks can be scheduledstatically. Fig. 2 shows the SPMD pseudo-codethat implements step 2 of the general techniquedescribed above. All the code is included in theroutine Static_Scheduler( ), which staticallyschedules the chunks of iterations by accessinglocal data only. The queue Q contains this chunksequence. Note the subroutine Wait_for_coheren-cy( ), which guarantees the correct program se-mantics by waiting until the data tile associatedwith T becomes coherent, and the subroutine Ex-ecute (T), which is a simple localized loop thatupdates the associated data tile. The results dis-

cussed in Section 4 show that the routineWait_for_coherency(T) does not actually blockcomputation because, when it is invoked, thecorresponding coherence message has already beenreceived. Note that, however, the while loop whichwaits for coherency can be completely removed ifthe data tiles of the arrays involved are notscheduled dynamically by other parallel loops. Asfar the benchmark illustrated in Section 2 is con-cerned, we have used such a static schedulingscheme to implement the uniform Convection loop(see Fig. 1). The width of the perimeter area is inthis case equal to one, while a ghost area of thesame width surrounds the blocks of array B. Theresulting inner area is tiled, and the chunks arescheduled statically. Each chunk updates the arrayA and reads both arrays B and C. Before the exe-cution of each chunk, according to the code inFig. 2, we wait for the coherence of the corre-sponding tile of array C. Coherence is checkedbecause the array C is written by the previous non-uniform Reaction loop, which is implemented byadopting the hybrid scheduling scheme describedbelow.

Non-uniform loop implementation: The innova-tive feature of SUPPLE is its support for non-uniform loops, which is attained by adopting ahybrid (static + dynamic) scheduling strategy toimplement step 2 of the general technique de-scribed in the previous section. At the beginning,to reduce overheads, iterations are scheduledstatically, by sequentially executing the chunksstored in queue Q, hereinafter called local queue.Once a processor understands that its local queueis becoming empty, the dynamic part of ourscheduling policy is initiated. Computations aremigrated at run-time to balance the workload.Work migration requires both chunk identi®ersand corresponding data tiles to be transmittedover the interconnection network. If loops accessan array with a given stencil data reference, tileswith a suitable surrounding area must be migrated.Fig. 2. Pseudo-code of step 2 when static scheduling is adopted.

S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343 1329

Page 8: SUPPLE: An efficient run-time support for non-uniform parallel loops

In order to overlap computation with communi-cation, an aggressive prefetching policy is adoptedto avoid processors becoming idle while waitingfor further work. Migrated chunks are stored byeach receiving processor in a queue RQ, calledremote. The size g of chunks is ®xed and is decidedstatically. This size gives only a lower bound onthe amount of work migrated at run-time, becausethe support adopts a heuristics to decide at run-time how many chunks must be moved from anoverloaded processor to a underloaded one. Onthe other hand, since SUPPLE uses a pollingtechnique to probe message arrivals, g directlyin¯uences the time elapsed between two consecu-tive polls (see Section 4). The dynamic schedulingalgorithm is fully distributed and asynchronous:chunk migration decisions are taken on the basisof local information only, and, since it is the re-ceiver of remote chunks that asks overloadedprocessors for work migration, it can be classi®edas a receiver initiated load balancing technique.According to the framework proposed by Wille-beek-LeMair and Reeves [24], the load balancingtechnique can be characterized by considering thestrategies exploited for processor load evaluation,load balancing pro®tability determination, taskmigration, and task selection.

Processor load evaluation: Each load balancerpolicy requires a reliable workload estimation todetect load imbalances and take, hopefully, correctdecisions about workload migration. Our tech-nique is based on the assumption that each pro-cessor can derive the expected, average executioncost of its own chunks stored in the local queue Q.Since the ®rst part of our technique is static, at thebeginning each processor only executes chunksbelonging to Q. During this phase, the averagechunk cost is derived through a non-intrusive codeinstrumentation (the parallel system used, a CrayT3D, allows to read the clock cycle count in asingle, inexpensive, machine instruction). More-over, when chunks are migrated toward under-

loaded processors, an estimate of their cost iscommunicated as well. Each processor is thusaware of the average costs of remote chunks storedin RQ. On the basis of the cost estimates of chunksstill stored in Q and RQ, each processor cantherefore measure its own current load.

Load balancing pro®tability determination: Thisstrategy is used to evaluate the potential speedupobtainable through load migration weightedagainst the consequent overheads. In SUPPLE,each processor detects a possible load imbalance,and starts the dynamic part of the schedulingtechnique on the basis of local information only. Itcompares the value of its current load with a ma-chine-dependent Threshold. When the estimatedlocal load becomes lower than Threshold, theprocessor begins to ask other processors for re-mote chunks. The same comparison of the currentload with Threshold is used to decide whether achunk migration request should be granted or not.A processor pj, which receives the migration re-quest from a processor pi, will grant the requestonly if its current load is higher than Threshold.Note that the Threshold parameter is chosen highenough to prefetch remote chunks, thus avoidingunderloaded processors becoming idle while wait-ing for chunk migrations.

Task migration strategy: Sources and destina-tions for task migration are determined. In ourcase, since the load balancing technique is receiverinitiated, the source (sender) of a load migration isdetermined by the destination (receiver) of thechunks. The receiver selects the processor to beasked for further chunks by using a round-robinpolicy. This criterion was chosen for its simplicityand also because tests showed its e�ectiveness. TheSUPPLE uses the same Threshold parametermentioned above to reduce overheads of our taskmigration strategy (overheads should derive fromrequests for remote chunks which cannot be servedbecause the current load of the partner that hasbeen asked for is too low). Thus each processor,

1330 S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343

Page 9: SUPPLE: An efficient run-time support for non-uniform parallel loops

when its current load becomes lower than Thresh-old, broadcasts a so-called termination message.Our round-robin strategy can thus select a pro-cessor from those that have not yet communicatedtheir termination. Termination messages are alsoneeded to end the execution of a parallel loop.When a processor has already received a termi-nation message from all the other processors, andboth its queues Q and RQ are empty, then it canlocally terminate the parallel loop since no furtherchunk can be fetched from both its local queuesand the remote processors.

Task selection strategy: Source processors selectthe most suitable tasks for e�ectively balancing theload, and send them to the destinations that askedfor load migration. In terms of our support, thismeans choosing the most appropriate number ofchunks that must be moved to grant a given mi-gration request. We use a modi®ed Factoringscheme to determine this number [8]. Factoring is aSelf-Scheduling heuristics formerly proposed toaddress the e�cient implementation of parallelloops on shared-memory multiprocessors. It pro-vides a way to determine the appropriate numberof iterations that each processor must fetch at eachaccess from a central queue storing the indexes ofunscheduled loop iterations. Clearly, the largerthis number is, the lower the contention overheadsfor accessing the shared queue, and the higher theprobability of introducing load imbalances whenthe fetched iterations are actually executed. Fac-toring requires that, if u is the number of remainingunscheduled iterations and P is the number ofprocessors involved, the next P requests for newwork are granted with a bunch of u=2 � P itera-tions. Consequently, at the beginning largebunches of iterations are scheduled, and thisnumber is reduced when the shared queue is goingto be emptied. In our case, instead of having asingle queue, we have multiple shared queues, onefor each processor. When a loop is unbalancedonly some of these queues, i.e. those owned by

overloaded processors, will be involved in grantingremote requests for further work. The other dif-ference regards our scheduling unit, which is achunk of iterations instead of a single iteration.The modi®ed heuristic we used is thus the fol-lowing: an overloaded processor replies to a re-quest for further work by sending k=2 � P chunks,where k is the number of chunks currently storedin Q, and P is the number of processors. Note that,in order to exactly apply Factoring, each processorshould have the global knowledge of all the chunksstored into the local queues of all the overloadedprocessors, and also of all the possible under-loaded processors that can ask for further work. Ifthis knowledge was available and the originalFactoring technique was thus applied, the numberof chunks returned for each request would belarger than k=2 � P . Unfortunately, this knowledgecannot be achieved without introducing globalsynchronizations due to the asynchronous behav-ior of processors involved. However, due to themultiple and concurrent requests from under-loaded processors, if we moved more than k=2 � Pchunks, we would risk prefetching too muchchunks toward underloaded processors.

3.1. SPMD code of the load balancer

Fig. 3 shows the SPMD pseudo-code to im-plement step 2 of the general technique above. Thecode exploits the hybrid (scheduling + dynamic)scheduling used by SUPPLE to implement non-uniform parallel loops. All the code is included inthe routine Hybrid_Scheduler( ), which schedulesthe iteration chunks of the local Q either locally,according to a static sequential ordering, or re-motely, according to dynamic scheduling decisionsmade only if load imbalance occurs. The core partof the code is a while construct, which ends loopingwhen the function Terminated&Empty_queues( )returns a True value. This function checks the ter-mination condition: a processor that is executing

S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343 1331

Page 10: SUPPLE: An efficient run-time support for non-uniform parallel loops

this SPMD code terminates as soon as its queuesQ and RQ are empty, and it has received a ter-mination message from all the other processors.The function My_load( ) estimates the currentworkload of a processor. It returns the product ofthe number of chunks still stored in Q times theaverage local chunk execution cost, plus the thecost estimated for any chunks stored in RQ. Qchunk average cost is determined by monitoringthe execution of the local chunks, while estimatedcosts of remote chunks are communicated bysender processors along with data tiles. At thebeginning, when no chunk executions have beenmonitored, these costs are initialized with highvalues to avoid wrong chunk requests and migra-tions.

The prefetching policy adopted to reduceoverheads of chunk migration is driven by a ma-chine-dependent Threshold parameter: a processorsends a request for remote chunks by calling sub-routine Prefetch_Chunk( ) only if its current load islower than Threshold. Likewise, the processor thathas been asked for, grants the request only if itsown load is greater than Threshold. Since latencies

of chunk migrations may be very large, we mayrisk prefetching too much because we might sendmany requests before the ®rst remote chunks ar-rive. To avoid this, we ®x a limit on the number ofrequests sent by a processor and not yet granted.Prefetch_Chunk( ) sends a request message only ifthis limit has still not been reached. Pre-fetch_Chunk( ) employs a simple local round-robinpolicy to select the next processor to be asked for aremote chunk: processor pi, whose round-robincounter is initialized to pi�1 at the beginning,chooses the next partner from those which havenot yet communicated their termination by meansof a call to subroutine Comm_termination( ). Thefunction Extract( ) returns a chunk from a queue.The subroutine Execute( ) executes a chunk T,and, if T is local, measures the time taken tocomplete it and consequently updates the estimateaverage cost of local chunks. We adopt an asyn-chronous coherence protocol to migrate updateddata from the processor that actually executed thechunk to the processor that is the owner of thecorresponding data partition. The updated datatiles are transmitted by the subroutine Send_Re-sults_Coherence( ), which also implements messagevectorization by bu�ering more data tiles to besent to the same processor. The subroutine insertsdata tiles in a bu�er, and transmits the bu�er wheneither it becomes full or termination is detected. Afull/empty-like technique [1] is used by our coher-ence protocol to avoid processing invalid data.When processor pi sends a chunk b to pj, it sets a¯ag marking the data tile to be modi®ed by pj asinvalid. The next time pi needs to access the samedata, for example during the execution of anotherloop accessing the same array, pi checks the ¯agand, if the ¯ag is still set, waits for the updateddata tile from node pj. We have already seen howthis check is performed in the static schedulingcode of Fig. 2. The same job is performed in thecode of Fig. 3 by the function Extract( ) whenapplied to the local queue Q. Such a control of

Fig. 3. Pseudo-code of step 2 when hybrid scheduling is

adopted.

1332 S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343

Page 11: SUPPLE: An efficient run-time support for non-uniform parallel loops

data consistency does not entail stopping waitingfor coherence messages at the end of parallelloop execution: a new loop can be started anduseful computation can be overlapped withthese communications. Finally, the subroutineRequest_Handler( ) performs many of the abovetasks. This subroutine whose code, for the sake ofsimplicity, is not shown in detail, polls and handlesall the messages that can arrive at a processor. Forexample:· requests for work migration, which entail decid-

ing about migration pro®tability and possiblyextracting a bunch of chunks from Q and send-ing it to the requesting processor;

· coherence messages transporting updated datatiles, which entail copying the received data inthe local array partitions and unsetting the cor-responding full/empty ¯ags;

· remote chunk bunches sent by overloaded pro-cessors, which have to be inserted into RQ;

· termination messages which modify the behav-ior of our task migration strategy and are need-ed to manage termination detection.

4. Experimental results

We have tested SUPPLE on a 64 node Cray-T3D. The message passing layer used by SUPPLEis MPI, release CRI/EPCC 1.3, developed by theEdinburgh Parallel Computing Centre in collabo-ration with Cray Research Inc. The experimentsare concerned with a SUPPLE implementation ofthe benchmark illustrated in Section 2 on a set ofsynthetic data sets presented in Section 1. We havealso compared the SUPPLE implementation withan HPF-style one. For the HPF implementationwe have used the Cray CRAFT-Fortran, whoselanguage directives and compiler have beenadapted in part from Rice University's Fortran-Dproject [6] and Vienna Fortran [25].

4.1. The synthetic input data sets

We have adopted a set of synthetic data setsbecause of the need to validate SUPPLE as ageneral support. Conversely, if we tested SUPPLEon a single problem instance, we might have cometo the wrong conclusions due to the speci®c fea-tures of the data set chosen for the test. Accordingto the code skeleton shown in Fig. 1, to build thedi�erent data sets we make the following as-sumptions:· the simulation grid is 1024 � 1024 points. Due

to the blocking data distribution, 128 � 128contiguous points are assigned to each of the64 processors of the target machine;

· convection and reaction phases require 14

and 34

ofthe total execution time, respectively;

· l is the average time needed to process a gridpoint during the reaction phase. Due to the pre-vious assumption, the average per-point convec-tion time is thus l

3;

· each input data set is built according to a simplemodel of load imbalance [14] discussed in thefollowing. Note that the imbalance features ofthe data set are important for the behavior ofthe reaction phase, where the per-point execu-tion time depends on the values of the pointitself.Model of load imbalance: Given l, the average

execution time, our model of load imbalance as-sumes that the workload is not distributed uni-formly, and that there exists a region of thesimulation grid, hereafter called loaded region, inwhich the average per-point execution time isgreater than l. The load imbalance model is thusbased on two parameters, d and t. The parameterd, 0 < d 6 1, is a fraction of the whole simulationgrid, and determines the dimension of the loadedregion. The parameter t, 0 < t6 1, is a fraction ofthe total workload T, and determines the wholeworkload concentrated on the loaded region.Therefore, it follows that t P d. Note that the

S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343 1333

Page 12: SUPPLE: An efficient run-time support for non-uniform parallel loops

imbalance is directly proportional to t for a givenvalue of d, since larger values of t correspond tolarger fractions of T concentrated on the loadedregion. Similarly, the imbalance is inversely pro-portional to d for a given value of t. From theseremarks, we can derive F , called factor of imbal-ance, de®ned as F � t=d:

Sample data sets: Fig. 4 shows a representationof some di�erent data sets we have used to force anunbalanced workload during the reaction phase ofour benchmark. We employed data sets charac-terized by values of F ranging from 1 to 9, wherethe size of the loaded region is kept ®xed (d � 0:1).The grey levels used in Fig. 4 represent the work-loads associated with grid points, where a darkergrey stands for a heavier workload. Note that thecase F � 1 corresponds to a data set that does notintroduce any workload imbalance, since thecomputational cost of each reaction point simu-lation is equal to l. The ®gure shows also theblocking distribution of the data sets on a grid of8� 8 processors. Due to the position of the greyloaded region, processors 0, 1, 8, and 9 appear tobe overloaded, while processors 2, 10, 16, 17, and18 are only partially loaded. We will refer to this®gure to explain the overheads introduced by oursupport during the dynamic load balancing phase.All the results reported in this paper were obtainedby keeping ®xed the position of the loaded regionduring all the simulation iterations. We also car-ried out many tests in which the position of theloaded region is di�erent or is moved at each it-

eration to simulate the adaptivity of the code. Thedi�erence between the results of these experimentsand those reported in the paper are not apprecia-ble, because SUPPLE does not exploit any pre-vious knowledge about which the overloaded orthe underloaded processors are.

4.2. The experimental parameters

The experiments were conducted on 64processors for di�erent values of F and l, andspeci®cally for F 2 f1; . . . ; 9g, and l 2 f0:038ms;0:075ms; 0:15ms; 0:3msg. The external sequentialloop of the ¯ame simulation code was iterated for10 time steps. The theoretical Optimal CompletionTime (OCT) is thus 8.2, 16.4, 32.8, and 65.5, for lequal to 0.038 ms, 0.075 ms, 0.15 ms, and 0.3 ms,respectively. This time, which does not take intoaccount any overheads, is computed by multiply-ing the average time required to process a singlepoint of the simulation grid (l� l=3), times thenumber of points assigned to each processor(128� 128), times the number of simulation timesteps (10). Table 1 reports some examples regard-ing the imbalance introduced by our synthetic datasets. In particular, it shows the times needed tocompute each point during the reaction phase, anddistinguishes between points belonging to the un-loaded and loaded region.

Tuning the dynamic load balancing parameters:The only parameters that can be modi®ed and

Fig. 4. Data layout of some data sets characterized by di�erent F on a 8� 8 processor grid.

1334 S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343

Page 13: SUPPLE: An efficient run-time support for non-uniform parallel loops

tuned in our support are the size g of a chunk ofiteration, i.e. our scheduling unit, and the pre-fetching Threshold. The parameter g a�ects thescheduling overhead, and the polling mechanismused to check for message arrivals. More speci®-cally, a larger g reduces the scheduling overheadsince it reduces the number of scheduling unitsstored in each local queue. On the other hand, alarger g may reduce the ability of our support toe�ectively balance the workload, mainly because,due to the polling mechanism adopted, it makesthe response time of overloaded processors slowerwhen they are asked for by underloaded ones. Atrade-o� must be found, and this depends on boththe architecture and the features of the problem.The tuning of the other parameter, Threshold,mainly depends on the speci®c architecture. Itshould be larger on architectures where commu-nications are not very e�cient. We found, how-ever, that it is also related to the granularity ofeach chunk, which in turn depends on g but alsoon the features of the speci®c problem, i.e. on theparameters l and F . From the remarks above itfollows that the two parameters g and Thresholdare highly interdependent. One strategy is to ®xone and tune the other parameter on the basis ofthe ®rst setting. Hence, all the experiments werecarried out by ®xing g� 21, so that each chunkassigns the iterations which update a tile of 7� 3points of the simulation grid. The number ofchunks scheduled by each parallel loop is thus 756.

We changed the granularity of each chunk byusing data sets characterized by di�erent l. Weobserved that for larger l, i.e. for chunks with alarger granularity, it is better to increase the valueof Threshold. This is not surprising, since for largechunk granularities, overloaded processors be-come lazier in answering incoming requests due toour polling implementation of input messagechecking. Thus, for larger l, prefetching furtherwork should be begun earlier to avoid emptyingboth the local and the remote queues. To do thiswe have to increase the value of Threshold. Hence,we used values of Threshold ranging from 2 to40 ms, where larger Threshold values are used forlarger l. We found, however, that a heuristic re-sulting in a small decrease in performance withrespect to the case in which the correct value ofThreshold is established before running the pro-gram, is the following:· ®x a chunk dimension g, which depends on the

size of the problem. The number of chunks mustbe large enough to allow the load balancing al-gorithm to work properly: a small numbershould only allow a few very large bunch of iter-ations to be moved, thus introducing probablewrong workload balances. Conversely, a verylarge number of chunks can increase the over-head introduced by our algorithm, though forsome applications, we may have cache perfor-mance bene®ts from scheduling loop iterationsusing small tiles [11];

Table 1

Reaction phase: per-point execution time for some values of l and F

l (ms) F Loaded points (ms) Unloaded points (ms)

0.038 2 0.075 0.033

0.3 2 0.6 0.267

0.038 5 0.187 0.021

0.3 5 1.498 0.167

0.038 9 0.337 0.004

0.3 9 2.697 0.033

S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343 1335

Page 14: SUPPLE: An efficient run-time support for non-uniform parallel loops

· start the load balancing algorithm with a smallvalue of Threshold, and increase its value atrun-time if necessary. When an underloadedprocessor begins to receive remote chunks alongwith their expected execution times, on the basisof this value it can increase, if necessary, the cur-rent Threshold to take into account the possiblelazy behavior of remote overloaded processors.Both the initial value of Threshold and theamount of increase depend on the speci®c archi-tecture.

4.3. Result assessment

Fig. 5(a) shows several curves. Each curve isrelated to a distinct l, and plots the di�erencebetween the completion time of the slowest pro-cessor and the theoretical OCT for various factorsof load imbalance F . The di�erence shown in thisand in the following plots can be considered as thesum of all overheads due to communications, re-sidual workload imbalance, and loop housekeep-ing. It therefore also includes the time to send/receive messages by implementing the stencilcommunications of the ®rst convection loop.There is an overhead of almost 0.5 s even in the

balanced case, and the overhead becomes largerwhen the input data set introduces a more exten-sive load imbalance. For F � 9 and l � 0:15ms,SUPPLE balances the load with a completion timewhich is very close to the OCT: the di�erence isnearly 1 s. Note that, in the absence of load bal-ancing actions, the completion time is around 229s, with a di�erence from the OCT of more than 196s. If we look at the same results by consideringtheir optimality in terms of percentage with respectto the optimum time, then for l � 0:038 this per-centage ranges from 6% to 9:4%, while for l � 0:3it ranges between 0:88% and 1:69%.

Execution pro®le: We have pro®led the execu-tion of some tests carried out with di�erent valuesof F and l. Fig. 6(a) shows, for all 64 processors,the time spent by each of them doing: (1) usefulwork on local or remote chunks, but also somesmall service computations requested by our loadbalancing algorithm; (2) sending, receiving, andchecking for arrivals of messages. Fig. 4 helps tounderstand this graph, which is relative to the caseF � 8 and l � 0:15ms, since it identi®es theoverloaded processors with respect to the inputdata set and the data layout. Recall that proces-sors identi®ed by 0, 1, 8, and 9 are overloaded,

Fig. 5. Di�erences from the OCT for various l as a function of the factor of imbalance F : (a) SUPPLE results, (b) CRAFT Fortran

results.

1336 S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343

Page 15: SUPPLE: An efficient run-time support for non-uniform parallel loops

while processors 2, 10, 18, 16 and 17 are partiallyloaded. Going back to the graph of Fig. 6(a), onecan note a larger send/receive/probe overhead forthe processors above. The reason for this overheadis the large amount of messages that they mustdispatch to give away chunks, and to receivemigration requests and coherence messages.Moreover,note that,duetotheseoverheads, thepro-cessors that execute less useful work are the sameprocessors that are overloaded at the beginning ofcomputation. The graph in Fig. 6(b) plots, foreach processor, the expression (Chunks executedÿChunks assigned). Chunks assigned is a constant: itis the number of chunks assigned to the local

queue LQ of each processor during all the itera-tions of the reaction parallel loops. This number,considering that the simulation is iterated 10 times,is equal to 7560. Chunks executed is the interestingdatum, and corresponds to the number of chunks(local + remote) that a processor actually exe-cutes. The expression is positive if a processorexecutes more chunks than those statically as-signed: this is the case, as Fig. 6(b) shows, of theunderloaded processors, which receive further re-mote chunks. The same expression is negative if aprocessor executes less chunks than the initialones: this is the case of the overloaded processors,which give away a lot of chunks to underloaded

Fig. 7. Case F � 1 and l � 0:15: (a) overheads due to send, receive, and probe MPI primitives, (b) chunks sent and received by each

processor.

Fig. 6. Case F � 8 and l � 0:15: (a) overheads due to send, receive, and probe MPI primitives, (b) chunks sent and received by each

processor.

S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343 1337

Page 16: SUPPLE: An efficient run-time support for non-uniform parallel loops

processors. Fig. 7(a) illustrates the moderate anduniform overheads introduced by SUPPLE for abalanced case characterized by F � 1 andl � 15ms. Fig. 7(b) shows, for the same case butwith a di�erent scale, the expression(Chunks executed ÿ Chunks assigned). We can notea sort of symmetry between processors that exportand import chunks. If we look at Fig. 4 (in thiscase, since F � 1, consider that the workload isuniformly distributed), we can discover that thoseprocessors that receive a few remote tasks are thesame as those that are assigned blocks on theborders of the simulation grid, and processors onthe corners are those that import the most chunks.This behavior is due to the ®rst convection loop ofour benchmark: this loop does not compute, infact, the elements on the boundaries of the simu-lation grid, and thus introduces a low load im-balance which is detected and managed bySUPPLE during the reaction phase of the com-putation.

Comparison with CRAFT: We implemented thesame benchmark with CRAFT Fortran, by feed-ing it with the same synthetic data sets. We testedseveral di�erent data layouts. The best results forunbalanced problem instances were obtained byusing a pure CYCLIC distribution. One of thereasons of these performance results is the com-putational weight of the reaction loop, which isthree times that of the ®rst convection loop. Thus,due to this feature of our benchmark, it is moreimportant to try to balance the workload througha pure CYCLIC distribution than to increase lo-cality exploitation with a BLOCK distribution.Fig. 5(b) shows the results attained by CRAFTFortran by adopting a CYCLIC distribution. Thedi�erences from the OCT are larger than thoseobtained by SUPPLE, even for the completelybalanced case (nearly a half second more thanSUPPLE). These bad results for F � 1 are alsodue to the CYCLIC distribution, which should beturned to BLOCK to take advantage of data lo-

cality in the ®rst convection loop. The results areworse for larger l and F. For example, for F � 9,the CRAFT di�erences from the optimal timesare 1.04, 1.22, 1.50, 2.21 s for l equal to 0.038,0.075, 0.15, 0.3 ms respectively, against the 0.77,0.78, 0.90, 1.11 s obtained on the same tests bySUPPLE. The CYCLIC distribution, besideslosing data locality, is not able to perfectly bal-ance the workload, so that, for larger l, theworkload di�erences between overloaded andunderloaded processors increase. We also con-ducted other tests with CRAFT Fortran to eval-uate combinations of BLOCK and CYCLICdistributions. By adopting a pure BLOCK distri-bution for the completely balanced case (F � 1)we obtained, as expected, the best CRAFT re-sults. Also in this case, however, the SUPPLEresults are better than the CRAFT ones. We be-lieve that this is due to our very e�cient imple-mentation of stencil communications, the absenceof barrier synchronizations, the small overheadsintroduced by our load balancing technique, andsome migrations of chunks toward the processorsholding the borders of the simulation grid (seeFig. 7(b)). Some of the results obtained byadopting a combination of BLOCK and CYCLICdistributions are shown in Fig. 8. All the curves,which refer to the case l � 0.038 ms, plots thedi�erences from the optimum time for di�erentvalues of F. The scale of the ordinate axis islogarithmic due to the very large di�erences fromthe optimal times when adopting a BLOCK datadistribution for unbalanced cases.

Finally, we tested the redistribution of thesimulation grid between consecutive parallel loopexecutions. CRAFT Fortran does not supportrun-time redistribution directives, so we used twodi�erent arrays, the former distributed BLOCK touse in the ®rst convection loop, and the latterdistributed CYCLIC to use in the second reactionphase. We re-assign one array to the other beforethe execution of each parallel loop. The results,

1338 S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343

Page 17: SUPPLE: An efficient run-time support for non-uniform parallel loops

which are shown in Fig. 8(b) are worse than thoseobtained by using a pure CYCLIC distribution forboth loops. The reason for this worse performancedepends on the huge volume of data movedsimultaneously between non-adjacent processorswhich may cause network congestion problems.With regard to the communication volumes, wehave measured the amounts of data transferredduring 10 time steps of the whole simulation for l� 0.15 ms. SUPPLE transfers from 2.5 to 19.4 MBfor F� 1 and F� 9, respectively. Independentlyfrom the factor of imbalance, the communicationvolumes are 2.3, 320 and 159.8 MB for theBLOCK, CYCLIC, and REDISTRIBUTIONCRAFT versions of the benchmark, respectively.Note that for CRAFT CYCLIC, most of thecommunications derive from the implementationof the ®rst convection loop, while for CRAFTREDISTRIBUTION it derives from the assign-ment statements used to redistribute the arrays.Data volumes moved in the case of a CYCLICdata layout are larger than the ones for REDIS-TRIBUTION. However, while in the CYCLICimplementation messages are exchanged onlyamong nearest neighbor processors, array redis-tribution requires that each processor communi-cates with all the others.

5. Related work

The parallel loop scheduling problem has beeninvestigated in depth by researchers working onshared-memory multiprocessors. Most proposalsaddress the e�cient implementation of loops byde®ning dynamic Self-Scheduling policies whichreduce synchronizations among processors by en-larging parallel task granularity. The main goal ofthese works is to determine the optimal size for thechunks fetched by each processor at each access toa shared queue which stores iteration indexes.Clearly, the larger the chunk size is, the lower thecontention overheads for accessing the sharedqueue, and the higher the probability of intro-ducing load imbalances. Polychronopoulos andKuck have proposed Guided Self-Scheduling, ac-cording to which u=P iterations, where u is thenumber of remaining unscheduled iterations and Pis the number of processors involved, are fetchedat each time by an idle processor [17]. TrapezoidSelf-Scheduling [23] has been proposed by Tzenand Ni to reduce the number of synchronizationsby linearly decreasing the chunk size. Hummel,Schonberg and Flynn have presented Factoring [8],the policy adopted also in SUPPLE to implementthe task selection strategy of our load balancer (see

Fig. 8. CRAFT Fortran di�erences from the OCT: (a) for di�erent data layouts with l � 0:038ms, (b) for di�erent l with array re-

distribution.

S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343 1339

Page 18: SUPPLE: An efficient run-time support for non-uniform parallel loops

Section 3). The introduction of large caches onshared-memory multiprocessors makes the schemaabove unsuitable because they do not guaranteethe exploitation of locality. Markatos and LeBlanc[13] have investigated locality and data reuse toobtain scalable and e�cient implementations ofnon-uniform parallel loops. They have explored ascheduling strategy, based on a static partitioningof iterations, which initially assigns iterations fora�nity with previously assigned ones. A�nity re-gards the presence of accessed data in the proces-sor caches. The dynamic part of the technique,which performs run-time load balancing, is post-poned until a load imbalance occurs. Liu andSaletore have worked on Self-Scheduling tech-niques for distributed-memory machines [12].They have designed hierarchical and distributedimplementations of a shared queue manager, andhave investigated partial data replication as a wayto solve the data locality problem. Plata and Ri-vera [16] presents another centralized solution ondistributed-memory systems. Di�erently fromtheir proposal, SUPPLE scheduling policy is fullydistributed and does not introduce bottleneckswhich may jeopardize the e�ciency when manyprocessors are used.

Willebeek-LeMair and Reeves [24] have pre-sented several general load balancing strategies formulticomputers. They have introduced the frame-work that is used in Section 3 to characterize ourload balancer. Another interesting work that pre-sents general load balancing techniques is the oneby Kumar et al. [10]. They introduce a techniquethat adopts a global round-robin policy to select theprocessing node to which a further task must berequested. The technique does not assume anyknowledge of the load, and thus it may not be veryaccurate in scheduling decisions, but it does notwaste any time evaluating the best load balancingchoices. Kumar et al. have shown that the tech-nique is actually very scalable, provided that ane�cient contention-free implementation of the

global round-robin is adopted. On the other hand,SUPPLE adopts a local round-robin technique toselect the source of a task migration (Task Mi-gration strategy), where the round-robin counterof each processor pi is initialized to pi�1 at thebeginning. This solution avoids contention prob-lems, but it would result in less accurateness ifunderloaded processors that select the partners tobe requested for further work did not have anyknowledge about their load. To this end, to avoidunderloaded processors to ask other underloadedones for, asynchronous broadcast terminationsignals are sent by those processors whose localchunk queues are becoming empty: these signalsstates that those processors made a transition fromthe overloaded state to the underloaded one. A lotof research has been carried out on run-time sup-ports and compilation methods for distributed-memory multiprocessors. The Fortran D project atRice and Syracuse universities [6], the ViennaFortran project at the University of Vienna [25],the Fx Fortran compiler developed at CarnegieMellon [3], and the CHAOS project at the Uni-versity of Maryland [20], are only some of the mostimportant projects addressing these topics. Anumber of compile-time optimizations have beendesigned to increase the performances obtained ondistributed-memory multiprocessors [25,6,7]. Thesolution to problems whose irregularities preventcompile-time optimizations, is generally addressedby exploiting inspector±executor codes which col-lect information at run-time, and then use thisinformation to optimize the successive computa-tions [2]. The CHAOS/PARTI library [20] devel-oped by the group headed by Joel Saltz at theUniversity of Maryland, adopted in the ViennaFortran Compilation System and in other HPF-like compilers, is the most notable example of thisapproach. It provides support for irregular arraydistribution, loop scheduling based upon thealmost owner computes rule, run-time data redis-tribution, data accessed through indirection

1340 S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343

Page 19: SUPPLE: An efficient run-time support for non-uniform parallel loops

arrays, thus o�ering general support for irregular/adaptive codes whose behavior cannot be antici-pated until run-time. Non-uniform parallel loopscan be implemented with the CHAOS library bycollecting information about iteration executiontime at run-time, and by consequently building anirregular data layout resulting in balanced pro-cessor workloads [21,18]. Data partitioning andredistribution have to be repeated several times ifthe computational costs associated with the dataitems involved vary as execution progresses.

6. Conclusions

We have described SUPPLE, an innovativerun-time support, which can be used to compileeither uniform or non-uniform parallel loops withregular stencil data references. SUPPLE exploitsgeneral BLOCK distributions of the involved ar-rays to favor data locality exploitation, and usesthe static knowledge of the data layout and of theregular stencil references to perform well-knownoptimizations such as message vectorization, co-alescing and aggregation. Iteration reordering isexploited to hide the latencies of the communica-tions needed to fetch remote data. SUPPLE sup-ports two di�erent loop scheduling strategies.Parallel loops characterized by iterations whosecosts are uniform, can be scheduled according to apure static strategy based on the owner computesrule. A hybrid scheduling strategy is used instead ifthe load assigned to the processors according tothe owner computes rule is unbalanced. In thiscase loop iterations are initially scheduled stati-cally, and, if a load imbalance is actually detected,an e�cient dynamic scheduling, which requiresdata tiles and iteration indexes to be migrated, isstarted. We have shown that SUPPLE hides mostdynamic scheduling overheads produced by theneed for monitoring the load and moving data, byoverlapping them with useful computations. To

this end it exploits an aggressive prefetching tech-nique that tries to avoid underloaded processorsfrom becoming idle while waiting for workloadmigration requests to be satis®ed. Nevertheless,since data may be migrated and updated remotelydue to dynamic scheduling decisions, and the samedata must be kept coherent on the original owner,SUPPLE adopts a fully asynchronous coherenceprotocol that allows useful computations to beexecuted while waiting for data to become coher-ent. At our best known, SUPPLE is the ®rst dis-tributed memory run-time support that exploitsthis kind of dynamic techniques to compile dataparallel languages. All the previous approaches tothe compilation of loops adopt more static andsynchronous approaches. A typical example isCHAOS, which collects performance informationduring previous execution of a loop, and preparean irregular data redistribution plan to obtain amore balanced execution. This solution requiresynchronizations between distinct loop execution,and, in addition, does not work when a non-uni-form loop must be executed only one time.Moreover, SUPPLE can be pro®tably used for aless narrow class of applications [15], i.e. to com-pile ``uniform'' parallel loop on heterogeneousand/or time-shared systems, where load imbalanceis due to the target architecture rather than to thespeci®c application. The results of our experimentsconducted on a Cray T3D machine have shownthat the performances obtained are very close tothe optimal ones. To validate SUPPLE as a gen-eral support for non-uniform loops we employed asimpli®ed ¯ame simulation application and di�er-ent synthetic data sets built on the basis of a simpleload imbalance model. These data sets, fed in in-put to the ¯ame simulation code, introduce dif-ferent workload imbalances. In most of theexperiments we obtained completion times whichdi�er by less than 1 s from the optimal completiontimes. For data sets characterized by an averageiteration execution times of 0.3 ms, the overheads

S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343 1341

Page 20: SUPPLE: An efficient run-time support for non-uniform parallel loops

due to communications and the residual load im-balance range between 0:88% and 1:69% of theoptimal times. In addition, we have compared theSUPPLE results with those obtained runningHPF-like implementations of the same bench-mark. In all the cases tested, SUPPLE obtainedbetter performances.

References

[1] R. Alverson, The tera computer system, in: Proceedings of

the ACM ICS, 1990, pp. 1±6.

[2] R. Das, M. Uysal, J. Saltz, Y.-S. Hwang, Communication

optimizations for irregular scienti®c computations on

distributed memory architectures, JPDC 22 (3) (1994)

462±479.

[3] T. Gross, D. O'Hallaron, J. Subhlok, Task parallelism in a

high performance Fortran framework, IEEE Parallel and

Distributed Technology 2 (2) (1994) 16±26.

[4] High performance Fortran forum, HPF Language Spec-

i®cation, Ver. 1.1.

[5] High performance Fortran forum, HPF-2 Scope of

Activities and Motivating Appl., Ver. 0.8.

[6] S. Hiranandani, K. Kennedy, C. Tseng, Compiling For-

tran D for MIMD distributed-memory machines, CACM

35 (8) (1992) 67±80.

[7] S. Hiranandani, K. Kennedy, C. Tseng, Evaluating

compiler optimizations for Fortran D, JPDC 21 (1)

(1994) 27±45.

[8] S.F. Hummel, E. Schonberg, L.E. Flynn, Factoring: A

Method for Scheduling Parallel Loops, CACM 35 (8)

(1992) 90±101.

[9] C.W. Kessler, Pattern-driven automatic program trans-

formation and parallelization, in: Proceedings of the 3rd

EUROMICRO Work. on Parallel and Distr. Proceedings,

1995, pp. 76±83.

[10] V. Kumar, A.Y. Grama, N. Rao Vempaty, Scalable load

balancing techniques for parallel computers, JPDC 22

(1994) 60±79.

[11] M.S. Lam, E.E. Rothberg, M.E. Wolf, The cache perfor-

mance and optimizations of block algorithms, in: Pro-

ceedings of the ASPLOS IV, Santa Clara, CA, April 1991,

pp. 63±74.

[12] J. Liu, V.A. Saletore, Self-Scheduling on distributed-

memory machines, in: Proceedings of the Supercomputing

'93, 1993, pp. 814±823.

[13] E.P. Markatos, T.J. LeBlanc, Using processor a�nity in

loop scheduling on shared-memory multiprocessors, IEEE

TPDS 5 (4) (1994) 379±400.

[14] S. Orlando, R. Perego. A template for non-uniform

parallel loops based on dynamic scheduling and prefetch-

ing techniques, in: Proceedings of the ACM ICS, 1996,

pp. 117±124.

[15] S. Orlando, R. Perego, Scheduling data-parallel computa-

tions on heterogeneous and time-shared environments, in:

Proceedings of EuroPar '98, LNCS 1470 (1998) 356±366.

[16] O. Plata, F.F. Rivera, Combining static and dynamic

scheduling on distributed-memory multiprocessors, in:

Proceedings of the ACM ICS, 1994, pp. 186±195.

[17] C. Polychronopoulos, D.J. Kuck, Guided self-scheduling:

A practical scheduling scheme for parallel supercom-

puters, IEEE TOC 36 (12) 1987.

[18] R. Ponnusamy, J. Saltz, A. Choudary, G. Y-S Hwang,

Fox. Runtime support and compilation methods for user-

speci®ed irregular data distributions, IEEE TPDS 6 (8)

(1995) 815±831.

[19] D.A. Reed, L.M. Adams, M.L. Patrick, Stencils and

problem partitionings: Their in¯uence on the performance

of multiple processor systems, IEEE TOC 36 (7) (1987)

845±858.

[20] J. Saltz, Runtime and language support for compiling

adaptive irregular programs on distributed memory ma-

chines, Software Practice and Experience 25 (6) (1995)

597±621.

[21] J. Saltz et al., Runtime support and dynamic load

balancing strategies for structured adaptive applications,

in: Proceedings of the 1995 SIAM Conf on Par. Proceed-

ings for scienti®c computing, Feb. 1995.

[22] J.M. Sipelstein, E. Blelloch, Collection-oriented languages,

Proceedings of the IEEE 79 (4) (1991) 504±523.

[23] T.H. Tzen, L.M. Ni, Dynamic loop scheduling on shared-

memory multiprocessors, in: Proceedings of the ICPP,

vol. II, 1991, pp. 247±250.

[24] M.H. Willebeek-LeMair, A.P. Reeves, Strategies for

dynamic load balancing on highly parallel computers,

IEEE TPDS 4 (9) (1993) 979±993.

[25] H.S. Zima, B.M. Chapman, Compiling for distributed-

memory systems, Proceedings of the IEEE, 1993, pp. 264±

287.

1342 S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343

Page 21: SUPPLE: An efficient run-time support for non-uniform parallel loops

Salvatore Orlando received a Laureadegree cum laude and a Ph.D. degreein Computer Science from the Uni-versity of Pisa in 1985 and 1991, re-spectively. He is currently an AssistantProfessor at the Dept. of AppliedMath. and Computer Science of theCa' Foscari University of Venice. Hisresearch interests include parallel lan-guages, parallel computational mod-els, optimizing and parallelizing tools,high-performance compiling tech-niques, parallel algorithm design andimplementation.

Ra�aele Perego received his Laureadegree degree in Computer Sciencefrom the University of Pisa in 1985. Heis currently a Researcher at CNUCE,an istitute of the italian National Re-search Council (CNR) His researchinterests include parallel languages,tools and programming models, run-time supports for data-parallelism,scheduling and load balancing, high-performance computing, parallel al-gorithm design and implementation.

S. Orlando, R. Perego / Journal of Systems Architecture 45 (1999) 1323±1343 1343