Occam implementation of process-to-processor mapping on the Hathi-2 transputer system

Microprocessing and Microprogramming 33 (1991/92) 173-189 17 3 North-Holland

Occam implementation of process-to-processor mapping on the Hathi-2 transputer system*

H o n g S h e n

Department of Computer Science, .4bo Akademi University, Lemminkiiisenkatu 14, SF-20520 Turku, Finland

Abstract

Shen, H., Occam implementation of process-to-processor mapping on the Hathi-2 transputer system, Microprocessing and Mi- croprogramming 33 (1991/92) 173-189.

This paper presents a polynomial-time Occam program for automatically mapping parallel programs onto multiprocessor systems. Based on the heuristic strategy of self-adjusting mapping, our program consists of grouping, placement, routing and self-adjusting procedures. Grouping groups the user-defined processes in a parallel program into target tasks with a possible load-balancing. Placement places the target tasks onto the processors in a transputer network. Routing produces edge-disjoint physical communication paths for the logical communication requirements among the placed tasks in the network. Self-adjusting adjusts first the placement scheme when the routing fails and then the grouping scheme when the necessary adjusting for placement is unable to make the routing succeed. These four procedures work co-operatively until a successful process-to-processor mapping has been finally achieved after a series of progressive self-adjustments. For the problem of mapping n processes in an arbitrary task graph onto m processors in a transputer network configured into a torus, the program needs time O(max{nE,mS}) in the worst case on one processor under full adjusting. The time is reduced to O(max{n2,m4}) if the adjusting heuristic is degraded into semi-adjusting, and to O(max{n2,m2}) when the adjusting heuristic is completely eliminated. The latter result holds only for the transputer networks providing message routing and multiplexing. We demonstrate the implementation result and performance evaluation of the program on the Hathi-2 transputer system. The implementation shows that for both regular and irregular task graphs the program works very well and produces satisfactory results.

Keywords. Process-to-processor mapping; Occam program; transputer network; heuristic algorithm; task; processor; graph; grouping; placement; routing; self-adjusting.

1. Introduct ion

The process - to -processor m a p p i n g p rob l em is the p rob l e m of a l locat ing processes and logical communica -

t ion channels in a para l le l p r o g r a m onto processors and physical c ommun ic a t i on l inks in a para l le l c o m p u t e r

such tha t the p r o g r a m can be mos t efficiently executed on the compute r . The m a p p i n g p rob l e m has been re-

ga rded as a fundamen ta l p rob l em o f grea t significance in paral le l processing. In the case tha t the para l le l

p rog rams are wr i t ten in Occam [8] and the paral le l compute r s are t r anspu te r -based ne tworks [7], the require-

ment o f knowledge o f the deta i led ne twork conf igura t ion for manua l m a p p i n g has become a big obstacle to

a p r o g r a m m e r . M a n u a l m a p p i n g br ings no t only grea t inconvenience to the p r o g r a m m e r bu t also much inef-

ficiency to p r o g r a m implemen ta t ion and resource waste to the system. A m a p p i n g a lgor i thm can au toma t i -

cally m a p para l le l p r o g r a m s on to a t r anspu te r ne twork and thus hide the ne twork conf igura t ion f rom the p r o g r a m m e r and solve the above problems.

Though grea t effort has been made by var ious researchers to solve the m a p p i n g p rob l e m dur ing the pas t

* This work was supported by the FINSOFT III Research Program.

174 H. Shen

decade, the problem remains yet unsolved in general today. The mapping problem has been known to be equivalent to the graph isomorphism problem to which no polynomial-time solution in the general case has been found [5]. There have been a variety of heuristic approaches to solving the mapping problem in the liter- ature [4, 5, 9, 10, 12]. Most of them are based on local search [1] or simulated annealing [6] heuristics, which usually prevent those algorithms from efficient implementation in practice because of the huge data migra- tions implied by the heuristics. Aiming at developing a heuristic mapping algorithm that can be easily and efficiently realized in practice, we have proposed a new approach to the mapping problem, self-adjusting mapping [11], based on some easy-implementable heuristic strategies obtained from extensive studies of the mapping problem.

In this paper, as building a mapping tool for the Hathi-2 transputer system and showing an application of Occam programming, we will describe how to practically implement our self-adjusting mapping algorithm and construct an Occam program for process-to-processor mapping. Moreover, we will show the implementation result and performance of our mapping program on the Hathi-2 system.

2. The self-adjusting mapping approach

2.1 Task graph andprocessor graph

A parallel program and parallel computer can be represented by a task graph Gt(T, Et) and processor graph Gp(P,Ep) respectively [5, 12], where a task is a set of user-defined processes (originally a single process) of the program. For simplicity and without loss of generality, we assume that both Gt and Gp are undirected and without self-loops. In Gt, node set T and edge set E1 respectively represent tasks and communication channels between the tasks, while node weight at node ti, denoted as wi, and edge weight between adjacent nodes t i and tj:, denoted as e0, respectively represent the known or estimated computation amount of ti and communication amount between ti and tj. We can form different tasks and change the structure of Gt by grouping processes under different strategies. In Gp, node set P and edge set Ep respectively represent processors and physical communication links between the processors (we assume that all processors have the same compu- tational power). Figure I gives an example of a Gt and Gp.

2.2 The Hathi-2 transputer system

Hathi-2 is a general-purpose MIMD multiprocessor system developed by the Department of Computer Science at Abo Akademi University and the Technical Research Center of Finland in Oulu (VTT/TKO) [2, 3]. The system consists of 25 identical boards, where 24 of them are connected in a configuration of 4 by 6 torus as depicted in Fig. 2 (left) and the remaining one is used as a separate partition. Each board contains four 32-bit Inmos T800 transputers, one 16-bit Inmos T212 control transputer and one Inmos C004 crossbar

w2 ol,/ 3 2 % & k: 23

e04 ~'~. ,LO.~ w 0

-1 F '-I

Fig. 1. Task graph and processor graph: (left) task graph Gi, (right) processor graph Gp.

Occam implementation of process-to-processor mapping 175

Fig. 2. Configuration of the Hathi-2 system: (left) intra-board connection, (right) inter-board connection.

switch [8], where the C004 switch realizes both inter-board connection by stactically connecting the board to its four neighbour boards in the torus and intra-board connection by dynamically connecting the four on- board T800 transputers as shown in Fig. 2 (right). By changing the intra-board connection, the system can be easily reconfigured. The T212 control processors are connected to each other in a ring, where each T212 is also connected to the C004 switch on the board where it resides, thus forming a separate control system which controls setting of the switches. The Hathi-2 system provides a multiuser environment by partitioning the system into several independent subsystems each of which can be used by one user via a host (host computer system). The host provides both I/O to the multiprocessor and interaction with the user. The user's program is edited, compiled, linked on the host and executed on the transputer network.

2.3 Self-adjusting mapping

For the problem of mapping parallel programs onto transputer-based multiprocessor systems like Hathi- 2, some properties of the transputer networks may help to simplify the problem. Beside regular topologies, another common property of current transputer networks is that a user is allocated usually with only a single I/O port (host). This naturally requires that the task containing the maximum I/O should be mapped onto the I/O port in the network in order to reduce the communication delay caused by transmitting I/O mes- sages.

The basic idea of self-adjusting mapping for mapping parallel programs onto transputer networks is the following:

The mapping can be realized by co-operating grouping, placement and routing under a self-adjusting strategy [11]. Grouping groups the tasks in the task graph into some target tasks under the criterion of load-balancing, which is necessary for a task-to-processor placement and a successful routing of physical communication paths. Placement places the tasks in the task graph onto the processors in the processor graph under the criterion of neighbour first so as to minimize the total length of all paths among the placed tasks w.r.t. the task graph. Routing constructs edge-disjoint physical communication paths among the placed tasks in the processor graph according to the logical communication requirements. If routing fails, self-adjusting adjusts first the grouping scheme by exchanging some pairs of tasks, and then the grouping scheme by merging some tasks when all necessary placement adjustments are unable to lead routing success, until a successful routing has been achieved [l 1]. Here we require all routed physical paths to be edge-disjoint so as to avoid the need of message routing and multiplexing. However, this will make our mapping more complex than those without the requirement, because efficient path-disjoint routing itself is an unsolved problem in general [131.

176 H. Shen

The sketch of our self-adjusting mapping is presented as follows, where Pfand Nf stand for failure path set and failure node set respectively, p~/(1 ~< i ~< z) and j~ (1 ~< j ~< ~) for failure paths and failure nodes in Pf and Nf respectively, Pi for the set of all paths passing or reaching failure nodej~, no for the number of target tasks in Gt after grouping [11]:

1. Do initialization, routed = FALSE. 2. Do load-balancedgrouping in Gt to form no ~< m target tasks, where each task has a degree not greater than

the degree of Gp and all tasks are cost-balancing. 3. Do neighbour-first placement to place the tasks in Gt onto processors in Gp, where neighbouring tasks in

G t is placed onto possibly neigbouring processors in Gp so as to keep the total length of the shortest physical communication paths among the placed tasks in Gp as small as possible.

4. Do path-disjoint routing for placed tasks in Gp. If routing is successful then routed = TRUE, otherwise output the failure path set Pfand failure node set Nf.

5. If routed = FALSE then do the following self-adjusting until routed = TRUE: (a) Do path terminal-exchanging until there is no terminal exchangeable path any more in P~

i. If no < m, for p(, pC, .... P(r in Pf, check whether p(/is terminal-exchangeable with unoccupied nodes in Gp or not, by comparing the routing results before and after terminal exchanging, and do the exchanging if exchangeable, 1 ~< i ~< z.

ii. For f l , f2, ..-, f¢ in Nf, check whether there is any pair of terminal-exchangeable paths in Pi, by comparing the routing results before and after terminal exchanging, and do the exchanging if exchangeable, 1 ~< i <~ (.

(b) For 1 ~< i<~no- 1, do path-terminal merging until the number of failure paths in Pf after re-grouping does not increase by a constant:

i. Re-group the n i t tasks in Gt into n i (ni~ n i_ t - - 1 ) tasks by merging some path terminals. ii. Place the target tasks in the re-grouped Gp onto the processors in Gp. iii. Route the paths among the placed tasks and produce the new Pfand Nf.

3. The outline of the occam program

Based on the self-adjusting mapping approach, we are now going to show how to develop an Occam program to accomplish the process-to-processor mapping and how the program works when we implement it on the Hathi-2 system. Without loss of generality, here we take a torus of size (number of nodes) m = m.height x m.width, the configuration of Hathi-2, as processor graph Gp. The number of total edges (links) of each

node in Gp is link.tal and the number of parallel edges between two adjacent nodes is link.par. The size of task graph Gt that is a direct representation of a user-defined arbitrary parallel program is n.

3.1 The data structures

The main data structure we will use in our program is an n x (3n+4) array, M, namely map matrix, that keeps the necessary data on which the program will work during execution. In array M, M[i][0]-- M[i][3n + 3] keep the data on task i, 0 ~< i< n, as described below: M[i][0] and M[i][1] are the computation weight and communication weight [11] respectively. M[i][2] is the degree. M[i][2j+ 3] and M[i][2j+4] keep the number of logical communication channels and communication cost between task i and j (if i and j are not adjacent, M[i][2j + 3] = M[i][2j+ 4] = 0), 0 ~<j < n, respectively. M[i][2n + 3]-- M[i][3n + 1] are used as the neighbour list to keep the indices of neighbour tasks (at most n - 1) of task i. M[i][3n + 2] = ' - 1' indicates that task i has not been placed onto a processor, and '0' placed. M[i][3n + 3] = ' - 1' indicates task i is active, and '0' inactive (deleted, i.e. merged into another task).

Other data structures used in our program are described below: Array C of m.width x re.height and array


T of n × 3 represent the relations between the tasks and processors after placement, where task with index C[x][y] (T[i][O]) is placed onto processor with coordinate (x,y) ((T[i][1], T[i][2])). Array P of k × 5 keeps the path information (logical connection requirement) from the task graph, where k is the number of paths, e[i][0] to e[i][3] are co-ordinates of end-nodes of pa th / (pa th i: (P[i][o],e[i][1])~ (P[i][2],P[i][3]), P[i][4] = '0' means that path i has been successfully routed (physically connected), ' - 1' not routed (failure path)). Array FR of m × 7 keeps the necessary routing information for all nodes in the processor graph, where FR[i][O] to FR[i][3] keep the indices of the latest four successfully routed paths via node i, FR[i][4] and FR[i][5] keep the indices of the latest two failure paths via node i, FR[i][6] keeps the total number of failure paths via node i.

3.2 The program outline

Our Occam program for process-to-processor mapping is organized as four subprocedures: procedure samap that includes group, p l ace , rou t e and a d j u s t . The outline of the program appears as follows:

PROC samap(...) {*Data declaration * } PROC group(.. .)

PROC p lace (...)

PROC route( . . . )

PROC ad jus t ( . . . )

SEQ ] Initialization andpre-calculation I {*routed = FALSE* } group (...) p l a c e (...) route(...) WHILE NO T routed adjust(...)

{*Output the grouping scheme, placement scheme and routing layout*} write, text.line (scn, 'The process-to-processor has been successfully completed')

The explanation for procedure samap is below: Since dynamic-bound arrays are not allowed in Occam, we have to fix the bounds for a r r a y M, C, T, FR.

Without loss of generality, we can realize it by assigning each array a large enough upper-bound. Block 'Initialization' does the following jobs: Provide an interactive environment for user to input data that include the size of the task graph (n) and

processor graph (m.width, m.height), the number of total links and the number of parallel links of each processor (link.tal, link.par), the computation weight, communication weight and degree of each task (M[i][0] to M[i][2]), and the index, communication channels and communication cost of each adjacent task (j) of each task (0 (M[i][2n + 3 +j] , M[i][2j+ 3] and M[i][2j+ 4]).

Calculate the summed cost of all tasks in the task graph. The total cost of one task is calculated by the following formula:

09to I : (-O comp -~ O. 5 *O) comm. (1 )

178 H. Shen

This is because for the communication time in a transputer network we take into account only the communication link setup time. Message transfer is performed simultaneously with computation [7].

Sort the indices of adjacent tasks of each task in cost non-decreasing order for later task grouping. The four procedures, group, place, route and adjust, of samap will be introduced separately in the

following sections.

4. Procedure group

Procedure group groups tasks in Gt such that after grouping the number and maximum degree of target tasks are not greater than the number and maximum degree of processors in Gp, respectively, and the (computation and communication) load of the target tasks is well-balanced.

This procedure contains one procedure, merge, and three blocks, 'Grouping on number of nodes', 'Grouping on node-degree' and 'Data structure updating', as follows:

PROC group (...) {*Group tasks in the task graph*} PROC merge ( .... INT tl, t2) SEQ

[ Grouping on number of nodes ]

[ Grouping on node-degree ]

[ Data structure updating ]

4.1 Procedure merge

Assume that t~ =min{tl,t2} and t~=max{h,t2}. Procedure m e r g e merges task t~ into t~. The procedures does the following jobs: 1. Add the weights of t~ into that of t~ and delete t~. 2. If t~ and t~ are adjacent, delete t~ from the neighbour list of t{ (M[tf][2n + 3] to M[t~][3n + 1]). 3. Connect all neighbours of t~ into t~. 4. Update the neighbour list of each of the new neighbours of t{ so that the list remains in a cost non-de-

creasing order. 5. Update the connection data, number of channels and communication weight of t( and its new neighbours.

4.2 Grouping on number of nodes

Block 'Grouping on number of nodes' groups the tasks in Gt, for n > m, such that the number of target tasks is not greater and as close as possible to the number of the processors in Gp. The block does the following jobs: 1. Assuming that n tasks in G t have been grouped into m tasks, calculate the average weight of the m tasks

(w.ave). Since the n tasks in Gt will be finally grouped into no (no ~< m) tasks, the average weight of the target tasks in Gt after grouping will be obviously not smaller than w.ave.

2. For the n tasks in Gt, one by one search tasks and merge those whose summed weight is not greater than w.ave until either no available task can be merged any more or n - m tasks has already been merged into other tasks.

3. If the number of target tasks in Gt after the above grouping is still greater than n, repeatedly choose two tasks with least weight in Gt and merge them until the number of target tasks is not greater than m.


4.3 Grouping on node-degree

Block 'Grouping on node-degree' groups tasks, if there is any task whose degree is greater than the degree of Gp, so as to guarantee that the degree of Gt is not greater than that of Gp after grouping.

The block checks the degrees of all tasks in Gt. For each task with a degree greater than the degree of Gp, repeatedly merge the first two neighbours with minimum weight in the neighbour list of the task until the degree of the task is not greater than the degree of Gp. The above procedure is continued until the degrees of all tasks in G t are not greater than the degree of Gp.

4.4 Data structure updating

Block 'Data structure updating' updates the data structure, array M, of Gt after grouping. It re-indexes all active tasks in M with contiguous indices and deletes all inactive tasks that have been merged into other tasks. It also updates the connection data and neighbour list of each active task according to the new indices of all active tasks after re-indexing.

5. Procedure p l a c e

Procedure plac e places tasks in Gt onto processors in Gp in a way of neighbour-first starting from placing tasks 0 (I/O task) onto processor (0,0) (host), such that the total length of the shortest physical paths in Gp w.r.t, the logical connection requirements in Gt is kept as small as possible. The following is the sketch of the procedure:

PROC P l a c e ( . . . ) {*Place tasks in Gt onto processors in Gp* } SEQ

Initialize array C, T and P. Place task 0 onto processor (0,0). For each placed task, ti, assuming t i is placed onto processor (xi,Yi), do the following until all tasks are placed."

For each of the unplacedneighbours of task ti, if there is unoccupiedneigbour of (xi,Yi) in Gp, do [ Neighbour to neighbourplacement l

[ Shared neighbour to shared neighbour placement [

I f there is any unplaced neighbour of task ti left, do

[ Neighbour to nearestprocessorplacement ]

[ Sharedneighbour to sharedneighbourplacement 1

5.1 Neighbour to neighbour placement

For task t placed onto processor (x,y), block 'Neighbour to neighbour placement' realizes placing the first unplaced neighbour in the neighbour list of t onto an unoccupied neighbour of (x,y). There are two m.width x m.height local arrays, Vand E, used in this block, where Vis used for keeping the information on whether

node (ij) in Gp has been visited (V[i][j] = TRUE) or not (V[i][j] =FALSE) during each phase of search for an unoccupied processor, E is used as an FIFO stack for keeping the traces of unvisited occupied nodes during each phase of the search. The block works in the following way:

180 H. Shen

Check neighbours of processor (x,y) in its four directions, 'down', 'left', 'up' and 'right', and place the unplaced neighbour of task t onto the unoccupied neighbour of processor (x,y) first met. For all occupied neighbours of (x,y) found during the search, keep those in array E as the traces for the further neighbour to nearest processor search if they are unvisited, according to the information provided by array V.

5.2 Shared neighbour to shared neighbour placement

We call task t~ shared neighbour of two neighbours of task t, tb and t~, if ts connects both tb and t~ in Gt, where t~ is said to be a brother of tb and vice versa. The same definition is also applied to processors in Gp.

Assume that tb and t~ are two neighbours of placed task t in Gt, and they are placed onto processors (xb,Yb) and (x[,,y~) respectively. If there are both an unplaced shared neighbour, ts, of tb and tt; in Gt, and an unoccupied shared neighbour, (xs,y~), of (xb,Yb) and (x~,y~) in Gp, as shown in Fig. 3, block 'Shared neighbour to shared neighbour placement' will place t~ onto (x~,y~).

There are eight possible schemes for shared neighbour to shared neighbour placement w.r.t, the positions of tb and t~, as depicted in Fig. 3.

The block checks the two possible brothers of (xb,yb) w.r.t, the position of (Xb,Yb) in Gp. If a brother of (xb,Yb), (x~,y~), is occupied by a brother of tb, t[,, where tb and t~ have an unplaced shared neighbour, is, and (xb,Yb) and (x~,y~) have an unoccupied shared neighbour, (xs,ys), simply place task t~ onto processor (x~,ys). The block is continued until there is either no unplaced shared neighbour of any pair of placed neighbours of t or no unoccupied shared neighbour of the relevant pair of occupied (xb,Yb) and (x~,y[,).

5.3 Neighbour to nearest processor placement

Assume that task t has been placed onto processor (x,y). If the number of unplaced neighbours of t is greater than that of the unoccupied neighbours of (x,y), there will be certainly some unplaced neighbours of t that are unable to be placed onto the unoccupied neighbours of processor (x,y) after the previous blocks have functioned. Block 'Neighbour to nearest processor placement' accomplishes the placement of these 'extra' neighbours of task t. The strategy for placing them is the nearest processor placement, i.e. they should be placed onto the processors as near to processor (x,y) as possible, so as to keep the total length of the shortest physical paths from them to (x,y) as small as possible. The block functions as follows:

For each of the remaining unplaced neighbours of t (placed onto processor (x,y)) after neighbour to neigh-

t6

t ~ t s t b

(X'~'Y'h) t x'y)

lX ,y.) I~x ttyb)

,x,,] 1 (xb, Yl~ (x=,y=

(xltYb) I x'y)

lx=,y=) I~X'b,Y'b)

IXs'Ys) IX'b'Yb )

(X b'Yb) Ix'Y)

I(Xs'Ys) IX b'Yt~

(X'b,Y'b) rx,y> .x.. 1

(x,y)! (X'b,Y'b) (x,y) (xey i

(Xb,Yb) (x,y

(X'b'Y'b) I (Xs'Ys 1

(x,y)[ (XuY b)

Fig. 3. Schemes for shared neighbour to shared neighbour placement.

Occam implementation of process- to-processor mapping 181

bour placement, repeatedly take an element from the bottom of E and check its neighbours in the four directions until an unoccupied processor has been found, and then place the unplaced task onto the unoccupied processor. Keep all unvisited occupied processors met during each phase of search in E. Since E is an FIFO stack and processors are kept in E in a neighbour-first manner, the above procedure will clearly find an unoccupied processor that is as close as possible to (x,y), as described in Fig. 4, where the search is proceeded in the ascending order of the numbers on the edges with arrowhead.

6. Procedure route

P r o c e d u r e r o u t e constructs the edge-disjoint physical communication paths [13] for the logical communication requirements among the placed tasks in Gp. The procedure is sketched as follows:

PROC route( . . . ) {*Data declaration* } P R O C r o u t . o r d e r ( . . . )

PROC r o u t . f i n d ( .... I N T x l , yl , x2, y2)

PROC r o u t . c o l l e c t ( . . . . I N T x l , yl , x2, y2)

SEQ I Initialization andpre-calculation I routed: = TRUE rout. order (...) For each pair of end-nodes to be connected, (x/,y/) , , (x~,yj), 0 <~ i< k, do

rout. find ( . . . i i " i , x l , y , , x:~, Y2) rout. c o l l e c t ( . . . . X{, Yl, x~, y~)

IF routed

[ Output the grouping andplacement schemes and routing layout ] TRUE

SKIP

In the procedure, subprocedure r o u t . o r d e r decides the routing order for all input end-node pairs to be

11

910 C ~12

13 15

(x,y)-I ~ , !_ .~ 14

Fig. 4. Search for a nearest unoccupied processor to (x,y).

182 H. Shen

connected, subprocedure rout. find finds a concrete path for each end-node pair to be connected, and subprocedure rout . c o l l e c t collects the physical paths found by rout . find.

Procedure rout e is almost same as the path-disjoint routing algorithm in a multigrid [13, 14] except the processor graph here is a torus therefore path seeking can across the left, right, up and down borders, and calculation for the shortest path union should also be modified to fit to the torus structure. For simplicity, bend weight consideration [14] has not yet been taken in the procedure here.

In procedure route, procedure rout . c o l l e c t also organizes array FR in addition to collecting the routed path, where FR[i][O] to FR[i][3] keep the latest four successful paths via processor i, FR[i][4] and FR[i][5] keep the latest two failure paths at processor i, FR[i][6] keeps the number of failure paths at processor i, O<~i<m.

7. Procedure adj us t

Procedure adjust adjusts the task placement and grouping schemes when the routing fails. The procedure works in a way of first adjusting the task placement scheme for the terminal-exchangeable paths [11] by path-terminal exchanging, and then the grouping scheme by path-terminal merging (task re-grouping). The procedure contains one procedure, exchange, and two blocks, 'Path-terminal exchanging' and 'Path-terminal merging', where procedure exchange exchanges the placement of two tasks in Gp. Procedure a d j u s t is sketched as follows:

PROC a d j u s t (...) {*Adjust the placement and grouping schemes*}

PROC exchange ( .... INT xb Yl, x2, Y2)

SEQ [ Path-terminalexchanging 1 {*Adjusting task plaeement* } [ Path-terminalmerging ] {*Adjusting task grouping*}

7.1 Path-terminal exchanging

Block 'Path-terminal exchanging' adjusts the placement scheme. Below is the sketch of the block:

IF n < m

{* There is unoccupied processor in Gp*} SEQ

swap: = TRUE WHILE (NOTrouted) AND (swap)

SEQ [ Data duplicating~

swap: = FALSE i :=0


WHILE ( i < k) AND (NO T swap) SEQ

IF e[i][4] = (- l)

[ Exchanging with unoccupiedprocessors ] {*If there has been any terminal-exchanging taking place, swap : = TRUE*}

TRUE SKIP

i :=i+ l TRUE

SKIP adjust.place: = TRUE WHILE ( N O T routed) AND (adjust.place)

SEQ I Dataduplicatingl swap: = FALSE i: =O WHILE (NOTrouted) AND ( i<m) AND (NOTswap)

SEQ IF

FR[i][6] > 0 SEQ

I Exchanging with occupiedprocessors [ {*If there has been any terminal-exchanging taking place, swap: = TRUE*}

TRUE SKIP

i: = i + 1 IF

(i = m) AND (NOTswap) adjust.place: = FALSE

TRUE SKIP

Block 'Data duplicating' duplicates array T, FR, C and P into TE, FE, CE and PE respectively so as to keep the data before the adjusting takes place for the need of later data restoration if the adjusting is not ac- cepted.

For the end-nodes of each failure path in P as well as its neighbours in Gp, block 'Exchanging with unoccupied processors' checks all unoccupied processors and selects up to Cl (constant, here 4) unoccupied processors with minimum distance to each of the above end-nodes. Fore each of these end-nodes, check whether exchanging the placement of it with one of the above selected unoccupied processors will reduce the number of failure paths or not, by comparing the routing results before and after exchanging. If reduce, accept the exchanging. Otherwise reject the exchanging and continue working on other end-nodes and unoccupied processors.

Block 'Exchanging with occupied procbssors' checks the failure path(s) stored in FR[i][r] (4 ~< r ~< 5) with paths in FR[i][r-1] ..... FR[i][O], for 0 ~< i~< m. If there are two intersecting paths, check whether exchanging a pair of their end-nodes will reduce the number of failure paths in P or not. Accept the exchanging if reduce, reject it otherwise.

184 H. Shen

1 I ' 2 1 2

/ / 3 I 4 3 4

1 p 2

3 4

Fig. 5. Path-terminal exchanging for two intersecting paths: (left) p and q, (middle) exchanging for h + h" <~/+ l ' , (right) exchanging for l + I ' < h + h'.

The path-terminal exchanging for two intersecting paths, p and q, is proceeded w.r.t, minimizing the total length of the two paths after exchanging, as described in Fig. 5.

7.2 Path-terminal merging

Block 'Path-terminal merging' adjusts the grouping scheme and reduces the number of target tasks. Let no ~< min{n,m} be the number of target tasks placed in Gp after grouping and placement. The block has the following sketch:

I Data duplicating~ n*: =no adjust.group: = FALSE WHILE (NO T routed) AND (NO T adjust.group)

SEQ n*: -- n* -- 1 group (...) {*Merge some tasks in Gt to form at most no target tasks*} p lace (...) route (...) IF the number of failure paths in P is not greater than that in PE plus a constant (c2) accept the re- grouping. Otherwise reject the re-grouping.

The block checks the result of re-grouping the tasks in Gt into n* tasks by merging some tasks (path end- nodes) for n* = n0 - l ,n0- 2 . . . . . 1. Accept the re-grouping scheme if after re-grouping the number of failure paths in P does not increase more than by c2 (constant, here fixed to 4) w.r.t, that before re-grouping, and reject it otherwise. Clearly, since the number of tasks after each re-grouping is decreasing, a successful routing will become possible eventually. In the extreme case, all tasks may be grouped into one single task and placed onto one processor.

8. Time complexity analysis

Let n and m be the number of tasks in Gt and the number of processors in Gp respectively, no (no ~<min{n,m}) be the number of the target tasks placed in Gp after grouping and placement. Assume that the degrees of Gt and Gp are both constant.

Block 'Initialization' needs time O(n2). In procedure group, block 'Grouping on number of nodes', 'Grouping on node-degree' and 'Data dupli-

cating' all need time O(n2). Thus group needs time

Tgrol l p = O(n2). (2)


In procedure place, block 'Initialization' and 'Neighbour to neighbour placement' both need time O(no+m), block 'Shared neighbour to shared neighbour placement' and 'Neighbour to nearest processor placement' both need time O(n0), so place needs time

Tplac e = O ( n 2 -q'- no m) <~ O(m2). (3)

Procedure route needs time O ( k 2 + k m 2) [14], where k is the number of paths to be routed among the placed tasks in Gp. Since k ~< O(m), the time needed for route is

Trout e = O(k 2 + km 2) <~ O(m3). (4)

In the last part of procedure samap, assume that procedure adj us t will be executed s times. Let ki and ni be the number of paths to be routed and the number of target tasks after the ith phase adjusting respectively, k{ = and k{ ° be the number of failure paths deleted after exchanging with unoccupied processors and exchanging with occupied processors of the ith phase adjusting respectively, 0 ~< i ~< s. Since each phase of path-terminal merging will reduce at least one task (one path) and increase at most c2 failure paths, we have

s ~< n0- 1 (5)

ki <~ ko - i (6)

ko >~ ~.7: ,(k{" + k{ °) - c2s. (7)

Assume that time needed for block 'Exchanging with unoccupied processor', 'Exchanging with occupied processor' and 'Path-terminal merging' are respectively T~, T ° and T,,. The ith phase adjusting needs time, T(1) adjust., at most

i T(i) w'k,= t k i)T~ + Yo o T,". = - mke Te + (8) *adjust /-...j = l k 0

By inequations (5)-(8) that no ~ m and ko = O(no), all s phases adjusting need time at most 3

radjus t = ~,s ~'~kiu t k _ z.,i = l~j= at o i - - j )T~ + m~,~= l'ibY°T°--e + sT,. ~< O((ko + c2s)ko)T~ + O(m(ko + c2s))T ° + sT,. ~< O(m2)(T~ + T °) + O(m)T m. (9)

Since during the ith phase adjusting, 1 ~< i ~ s,

T0 = O( 1 ) + O( 1 )(O(m - ni) + c l(O(ki) + O(k2i + ki m2) + O(m) + O(ki) ) <~ O(m 3 )

T ° = O( 1 )(O( 1 ) -k- O ( k i ) q- O(k2i q- kim 2) -k- O(m) -t- O ( k i ) ) <~ O(m 3)

and

Trn = O(n 2) + O(m 2 ) + O(k 2 + k i m 2 ) <~ O(m3),

inequation (9) becomes

Taajust <~ O(mS). (10)

Thus procedure samap needs time

Tsamap = Tgrou p "k- Tplac e -k" Troute Jr- Taajust <~ O(max{n2,mS}). (11)

Note that inequation (10) shows the worst-case time complexity of our program under 'full adjusting' as described in the previous section. By degrading the adjusting heuristic in procedure ad3 us t from full adjusting into 'semi-adjusting' such that within block 'Path-terminal exchanging' during each phase adjusting only c3 (constant) end-nodes of failure paths in Pf can be checked with unoccupied processors and Ca (constant) failure nodes in N I can be used to check with occupied processors for the exchangeability, the time complex-

9. Implementation result

ity of our program will be considerably reduced. Clearly, since the time needed for actjust, T~s.st, here becomes

T~j.s, = ~,'i= latC3TUe+ flc4ff'S=lT° + sTm~O(m)(TU + TU + tm)<~ O(m4) (12)

where at and fl are constant, time complexity of the program, T~,,a p, will be

TT,,.ap <~ O(max{nZ,m4}) (13)

Furthermore, if the restriction of edge-disjointness of the paths to be routed is dropped, the procedure adjust is not needed any more since path routing will certainly succeed once the tasks has been placed. A usual algorithm of shortest path routing for all pairs of processors only takes time O(rn 2) [1]. Thus the time complexity of our program that works for the transputer networks with a message routing and multiplexing mechanism, T~,,ap, will be

T~map ~ O(max {n2,m2}). (14)

10. Performance evaluation

The above Occam program for process-to-processor mapping has been implemented on one transputer in the Hathi-2 system. We have tested our mapping program with various problem instances. For any input user-defined task graph and a processor graph of torus configuration of any size, the program will find a satisfactory process-to-processor mapping. The experimental results show that our program works well for both regular and irregular task graphs. It seems that in many cases, the program can reach 'human intelli- gence', in a sense that it can present an embedded layout that is even difficult to be achieved by hand-draw- ing. Figure 6 (a) and (b) show two examples of implementation result of our program, where for task graph in (a), all tasks have the same computation weight and communication weight, for task graph in (b) the un- derlined numbers are communication weights and computation weights are implicated by node indices: node i has computation weight i+ 10, for routing layout both in (a) and (b) bold lines indicate the occupied links by physical paths and plain lines the unoccupied links.

0 1 2 3

The performance of our program has been measured on the Hathi-2 system. For mapping arbitrary task graphs onto a processor torus of arbitrary size, we measure the time elapsed during the whole mapping procedure as well as its different subprocedures of grouping, placement, routing and adjusting individually. Measuring the time elapsed for mapping a series of arbitrary task graphs of different sizes onto a processor

4 5 8

3 9 10

12

11

15 13 14

Task graph

1 13

2 14

3 15

9 ~;

10 i ;

11

186 H. Shen

Placement scheme Routing layout

Fig. 6. Two examples of an implementation result: (a) mapping 4 x 4 mesh onto 4 x 4 torus.


o 11

1 3'

5 0

Task graph

1' £

,6,8

Grouping scheme

3'

i ii II Jo' us, Ul,

Placement scheme Routing layout

Fig. 6. Two examples of an implementation result: (b) mapping an arbitrary task graph onto 3 x 3 torus.

torus of fixed size, we know how the elapsed time varies when the number of tasks of an arbitrary task graph varies, thus get a figure of the program performance w.r.t, the number of tasks. Likewise, measuring the time elapsed when the number of tasks is fixed but the size of the processor torus is varying, we obtain a figure of the program performance w.r.t, the number of processors. The combination of theses two measurements will generate an overall evaluation to the performance of our program.

The measured performances of the program under full adjusting are described as in Fig. 7 (a)-(e), where (a)-(d) respectively show the individual performances of the procedures of grouping, placement, routing and adjusting in the program, and (e) presents the overall performance of the program. In each of these fig- ures, curve t(n,100) represents the relation between time and n, the number of tasks, when the size of the processor torus is fixed to 100 (10 x 10), while curve t(100, m) shows the relation between time and m, the size of the processor torus, when the number of tasks of an arbitrary task graph is fixed to 100 (the topology of the task graph is not fixed). The column axis with a scaling unit of 10 seconds is the axis of time. The row axis with a scaling unit of 10 is the axis of task number (n) for curve t(n,100) and of processor number (m) for curve t(100, m), respectively. All task graphs are generated over a set of random data, therefore their topologies are random. From Fig. 7, it is obvious that the execution time of the program is mainly dominated by the time elapsed in the procedure of adjusting. Therefore to a given processor torus, mapping often takes more time for task graphs of a complex topology than for those of a simple topology, since the former usually requires more work of adjusting.

As samples, in Table 1 we illustrate the performances of the program for some typical categories of task graphs.

Table 1

Samples of performances for different categories of task graphs

Task graph Proc. torus Measured performance (time in seconds)

Topology n m Initial. Grouping Placement Routing Adjusting Total

Mesh 12 x 12 12 × 12 0.62 1.25 0.90 5.22 0 7.99 5 x 30 5 x 6 0.68 24.68 0.16 0.32 0 25.84

Binary tree 31 5 × 6 0.03 0.27 0.17 0.20 2.43 3.10 1 27 5 x 10 0,48 21.84 0.10 0.31 2.06 24.79

Hypercube 16 4 x 4 0.01 0.02 0,10 0.12 0 0.25 1 28 10 x 13 0.52 24.87 0.10 0.53 0 26.02

Random 1 O0 10 x 10 0.30 6.74 0.41 1.27 356.35 365.07 1 O0 5 x 10 0,31 1 9.09 0.06 0.19 0.57 20.22

188 H. Shen

t ime( lOs)

3 o .O .o , ,O

2 5 t ( lO0 m) / 2 ,-0-@-0-0-0~0-0 /

='-•'o1 015 t(n 1 0 0 ) , 0 " 0 " 0 " % / ~ . ' ' ' ' "

0o.O-O-O-O-?` ; ` o,,,, 1 3 5 7 9 11 13 15

n m(x lO)

(a) The performance of grouping

t ime(Os) 0.18,

0 1 6 o/•.o/° 0.14 0 0.12 0 / t O/t(1OO,m)

0.1. t(n l O 0 ) J \ ]_X" 0.08, o .° k,,°--" ~, 0.06 . o" "1 \,, o-o"i -o -°" / * " ~ "o-o._ 002f • • . o - O " • o

0 ~ ' ; ' ; : : : : : : : ; ; ; ; ; 1 3 5 7 9 11 13 15

n m(x lO)

(c) The performance of routing

t ime( lOs) 0,045'

0.04' 0 ~ / ' 0 - 0 . 0 . 0 , ,

0.03' - 0 m) 0.025"

0 0 2 0,015'

0.01 0,005 : ;O-O-O' l - • ° 0" "0

0 • ; ; ; ; ; ; ; ; ; ; ; ; ; ; 1 3 5 7 9 11 13 15

n m(x lO)

(b) The performance of placement

t ime lOs) 8 0 7 0 •

40 o 1@ 3o ,o, o/o~ -°-o 1 ; ' t ( n ' i O 0 ) , O / ~ ' o /d k •

im~ iF 9-o-o-o-o 0 o o o O ' ~ • • i • 1 3 5 7 9 11 13 16

n m(x lO)

(d) The performance of adjusting

t ime( lOs) 8O

7 0 •

50"

:i: o . ;:. , < ° , , o : > / \ o . Z 7 \ :

;. __o/__ '(lO, O."n>/-° \O~o o o o 0 o 8 8 ~ 9 ? = i i i = , , I

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 n m(x lO)

(e) The overall performance of the program

Fig 7 The measured performances of the program

11. Concluding remarks

As developing a mapping tool to automatically map parallel programs onto transputer networks, we have described the Occam implementation of process-to-processor mapping on the Hath•-2 transputer system.


Rather than applying classic heuristics such as local search and simulated annealing that usually require a series of batched data swapping, our program is based on the self-adjusting mapping, a special heuristic strategy for the mapping problem, that can be easily and efficiently realized in practice. For mapping n processes of an arbitrary user-defined parallel program onto m processors of a transputer network configured into a torus, the program has a worst-case time complexity O(max{n2,mS}) on a single processor under full adjusting. To the same problem, the program can also run in time O(max{n2,m4}) by degrading the adjusting heuristic into semi-adjusting, and O(max{n 2, m2}) by eliminating the adjusting heuristic, where the latter holds only for the transputer networks providing message routing and multiplexing. The implementation of our program for various problem instances on Hathi-2 shows us a good performance. The program works well for both regular and irregular task graphs and produces satisfactory mapping results. Our program can also be easily modified to realize process-to-processor mapping on multiprocessor systems of other configura- tions.

Acknowledgement

The author wishes to thank Ra1-ph-Johan Back for his guidance and Mats Aspnfis for his reading of the manuscript.

References

[1] A. Aho, J. Hopcroft and J. Utlman, Data Structure and algorithms (Addison-Wesley, Reading, MA, 1983).

1-2] M. Aspn~s and T.-E. Mal6n, Hathi-2 users guide (ver- sion 1.0), A, bo Akademi, Dept. Comput. Sci., Res. Rep. B(6) (1 989).

[3] M. Aspn~is and R.J.R. Back, A programming environment for a transputer-based multiprocessor system, Symp. on Programming Languages and Software Tools (Proc. First Finnish-Hungarian Workshop, T. Gyimbthy ed.) (1989) 94-103.

[4] F. Berman, Experience with an automatic solution to the mapping problem, The Characteristics of Parallel Algo- rithms (MIT Press, Cambridge, MA, 1987) 307-334.

1-5] S.H. Bokhari, On the mapping problem, IEEE Trans. Comput. C-30(3) (1981 ) 207-214.

[6] S. Kirkpatrick, C. Gelatt Jr. and M. Vecchi, Optimization by simulated annealing, Science (May, 1983).

[7] Inmos Limited, Transputer Reference Manual (Prentice- Hall, Englewood Cliffs, N J, 1988).

1-8] Inmos Limited, Occam 2 Reference Manual (Prentice- Hall, Englewood Cliffs, N J, 1988).

[9] O. K#imer and H. Muhlenbein, Mapping Strategies in Message Based M ultiprocessor Systems, Proc. PARLE'87, Lecture Notes in Computer Science, Vo1258 (Springer, Berlin, 1987) 213-225.

[10] P. Sadayappan and F. Ercal, Cluster-partitioning approaches to mapping parallel programs onto a hypercube, Proc. Supercomputing "87, Lecture Notes in Com- puter Science, Vol 297 (Springer, Berlin, 1987) 475- 497.

[11] H. Shen, Self-adjusting mapping: a heuristic mapping

algorithm for mapping parallel programs onto transputer networks, Developing Transputer Applications (Proc. OUG-11, J. Wexler ed.) (lOS 1989) 89-98, to appear in ComputerJ.

[12] H. Shen, Mapping parallel programs onto transputer networks, in: J. Hulskamp, ed., Proc. 1989 Australian Transputer and Occam User Group Conference (RMIT 1989) 85-94.

[13] H. Shen, Fast path-disjoint routing in transputer networks, Microprocessing Microprogramming 33 (1991 ) 21-31.

1-14] H. Shen, Occam implementation of path-disjoint routing on the Hathi-2 transputer system, Microprocessing Mi- croprogramming 30 (1 990) 93-100.

allel algorithms, parallel distributed computing.

Hong Shen is an Assistant Profes- sor in the Department of Computer Science at Abo Akademi University, Finland. He received the B.S. degree from Beijing University of Iron and Steel Technology, China, in 1982, the M.S. degree from the University of Science and Technology of China in 1987, the Ph. Lic. and Ph.D. degrees from Abo Akademi University, Finland, in 1 990 and 1991 respectively, all in Computer Science. His main research interests include par- computer architectures, parallel and

Occam implementation of process-to-processor mapping on the Hathi-2 transputer system

Documents

Transcript of Occam implementation of process-to-processor mapping on the Hathi-2 transputer system