[IEEE 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for...

OORHS-A Reliable Adaptive Environment for Distributed Computing

Daniel S. Yeung and Allan K.Y. Wong Department of Computing

Hong Kong Polytechnic University

ABSTRACT

This paper presents a reliable and adaptive distributed computing environment, known as object-oriented reciprocal hypercomputing system. In this environment a user is required to write only a high-level APPL program, which is, in effect, a specification. The system automatically parallelizes the APPL program into smaller independent program objects, instantiates the codes for these objects, distributes them, and performs load balancing and failsoft recovery during computation to produce the final result. Four sets of preliminary experimental results are shown to provide insight into the proposed system.

key words : ob jec t -o r i en ted r ec ip roca l hypercomputing, distributed computing, asynchronous parallel programming language, adaptive and reliable.

1. INTRODUCTION

The proposed object-oriented reciprocal hypercomputing system (OORHS) is a reliable and adaptive environment for distributed computing. The central idea of hypercomputing is to utilize idle CPU cycles in a network of processors that do not share memory. The OORHS research project is motivated by the wish to alleviate distributed computing problems[ 11, in the areas of adaptiveness, reliability, and transparency. A system is adaptive if it is reciprocal. It is reliable if it provides failsoft performance for partial failures. It is transparent if it requires only a high-level specification to perform all other distributed computing steps.

The OORHS requires from the user only a high-level APPL (asynchronous parallel programming language) program, which is, in effect, a specification. The system automatically performs all other distributed computing steps, which include partitioning of a program into smaller program objects (i.e. parallelization of a program), load balancing, and failsoft recovery. An APPL is formed by mixing a base language with the coordination framework developed in our work[2]. The unit of parallelism is a program object that possesses an identity, a collection of methods and communication capability. The precompiler instantiates the program code of an object from the class library that contains program templates defined by users. Every program object is executed by a dedicated process because the active object model[3] is adopted. A process terminates if the object being supported disappears.

A reciprocal system should have three features: (a) a sequential program which can be statically converted into a parallel one or vice versa, by the user, (b) partitioned program objects which can be run parallelly, pseudoparallelly (by context switching), or both, by the user, and (c) dynamic system adaptation from parallel executions to pseudoparailel ones or vice versa. The first two features help match execution strategies with physical constraints. The third feature supports failsoft operations.

The OORHS differs from other existing environments (e.g. [4]) by emphasizing three things: reciprocality, load balancing and local compilation. In local compilation program objects are distributed at the source program level, then compiled and executed at the allocated sites. Load balancing consists of static (explicit) and

0-7803-2559-1195 $4.00 0 1995 IEEE 4527

dynamic (transparent and implicit) task placement (also known as location management) strategies. Static placement is good for high-security applications because the user can choose particular sites for the required security level. Dynamic task placement may involve timely migration of program objects for the purpose of evening out system workload, possibly with security trade-off.

The present OORHS is implemented on a UNIX-based network of SUN workstations and personal computers. Preliminary results from four experiments are presented to provide insight into the proposed system.

2. THE SYSTEM ARCHITECTURE

The model in Figure 1 captures the essence of the object-oriented reciprocal hypercomputing system. A highlevel APPL program is only a specification of the program objects to be instantiated from the class library during precompilation. The final program is made up of a master program object definition (MPOD) and many service program object definitions (SPODs), which may be coded in different programming languages. The final program, therefore, can be composite. If a user executes the command of run parallelgrogram-name, full precompilation is invoked. It consists of: (1) first pass (instantiation of program objects), (2) second pass (separation of program objects into independent parallel units), and (3) transparent distribution and execution of SPODs. A distribution list is also produced from different RLLOCATZON specifications in the second pass. Check parallelgrogram-name is the other command that invokes only the first pass. SPODs are distributed by the kernel according to the SPOD-to-processor matches recorded in the distribution list. A match is created at this time for every named machine specification in the ALLOCATZON statement. This statement is basic to the CREATE-OBJECT primitive construct, and it permits four choices (i.e. any machine (AUTO), a machine specialism, same machine (PROCEDURE), or a named machine). In the first and second choices, the SPOD-to-processor matches are decided by the kernel. The decision is based

on the current workload, If PROCEDURE is chosen, then the program object is a sequential subprogram.

APPL Program Class Library (composed of (user-defined program objects) >plates)

Precompilation (a. first pass (instantiation) b. second pass

(paralleiization)]

Distribution List and F'rogram Objects - (i.e. MPOD and SPODs)

-*NCT(shared)

Communication Loyer(CL) I

*NCT=Network Configuration Table shared by kernel and CL

Figure 1. The OORHS Model

A CREATE-OBJECT primitive construct may specify a set of methods. The following example specifies only a sinlge method:

CREATE-OBJECT (object-class-name,

ALLOCATION {A UTO/mchiti e-name/

OBJECT-ID MPOD-id) /* by kernel */ SERVICE-ID SPOD-id) /* by kernel */ END

methodbarameters)),

machine capabi&y/PROCEDURE),

This primitive instructs the precompiler how to instantiate. In the first pass, the precompiler checks for syntactical errors and warns the user if object-class-name or method can not be found in the class library. In the second pass, the precompiIer generates the final program and the distribution list. The MPOD and SPODs in the final program are passed to the kernel. SPODs are distributed at the source program level. The underlying communication layer and the kernel share the information in the network configuration table (NCT). The NCT contains important parameters that reflect the current network operation. Examples of parameters include: (1) the shortest paths between the host and other nodes,

4528

(2) workload in different nodes, and (3) compilers in different machines. A method in a SPOD can be invoked by the SERVICE-REQUEST primitive construct:

SERVICE-REQUEST (method@arameters)), OBJECT-ID MPOD-id, SER VICE-ID SPOD-id, SERVICE-NUMBER m m , /*by precompiler*/ END

Architecturally, the OORHS has three distributed layers. They are: (a) the precompiler (first layer), (b) the kernel (second layer), and (c) the communication layer (CL, third layer). The last two layers must be present in all participating machines. The absence of the precompiler does not stop hypercomputing activities in a host but disables its power to initiate program execution.

The kernel assigns identifications to the MPOD and SPODs for communication amd synchronization. It keeps various time windows for distribution acknowledgement, for MPOD activation, and for MPOD deactivation. Failed SPODs will be redistributed. The kernel activates a new MPOD only after all its cooperating SPODs were successfully distributed within the activation window. Otherwise, the user will be notified of program failure. Redistribution of SPODs must be finished within the deactivation window to prevent endless redistribution. The association of MPOD and SPOD is atomic; only a successful SPOD service will advance the MPOD to the next consistent state of program execution.

The CL assumes unreliable hardware. It provides failsoft network reconfiguration; the NCT is updated to reflect addition of new nodes and deletion of failed nodes respectively. The NCT contains many entries (e.g. machine names, and their specialisms) to be exchanged periodically. All nodes have the same NCT entries when the network is at the steady state. Reconfiguration in the CL and SPOD redistribution in the kernel together make OORHS adaptive and reliable. The CL design is based on the model in [SI, which proposed a prototypical, fault-tolerant software platform for an unlimited number of small autonomous interconnected machines. This model emphasizes scalability and potentially supports

the massively parallel approach of using a large number of computers scattered across different geographical locations.

3. PRELIMINARY RESULTS

The results from four experiments, among many performed in OORHS, are presented in this paper. These four experiments give some idea of the breadth of distributed parallel programming paradigms that the OORHS can support. More work will be needed to determine the bounds and limitations of OORHS for distributed programming. The four experiments are: fractal computation, lengths of all pairs' shortest paths (LOAPSP), travelling salesman problem, and quick sort. The APPL programs for these experiments are C-based (i.e. C-APPL); all the MPODs and SPODs are program modules in C.

Rabhi[6] proposed that parallel programming algorithms can be conceptually classified into four paradigms, namely, recursively partitioned, distributed independent, iterative transformation, and process network. The proposed classification method is simple and comprehensive. Based on Rabhi's idea and practical experience, we think that the process-network algorithms can be put under the iterative-transformation paradigm. In the recursively partitioned paradigm, a problem is recursively divided into smaller problems. The partial solutions from these smaller problems are recursively combined into the final solution. In the distributed independent paradigm, a problem is divided into smaller problems, which will not be split again. In the iterative transformation paradigm, a problem is solved by independent operations on a set of objects. These operations can happen successively and parallelly.

The four experiments presented in this paper cover the three paradigms: quick sort (recursively partitioned), travelling salesman problem (distributed independent), fractal computation (distributed independent), and LOAPSP (iterative transformation). The results from the four experiments demonstrate that the OORHS framework can support a good breadth of parallel programming algorithms. These results are presented in the form of "units of

4529

computation time versus speedup". The expression of 49/4.1 in Figure 2, for example, means that the computation needs 49 time units, and it is 4.1 times faster than the sequential program.

The results (Figure 2) were obtained from a normal environment where SPODs shared CPUs with other users. Static load balancing was done by comparing CPU utilizations.

3.2 Lengths of All Pairs' Shortest Paths 3.1 Fractal Computation

The fractal equation is z=(z2+k). The algorithm that can accommodate more subtasks than processors is as follows:

(MPQD/MASTER) initialize () spawn (< worker-name > , number-of-workers) multicast(< worker-name > , subtask) /*receive-&-send "/ while (work-notfinished) receive (r) /*result */ send (< idle-worker > , next-subtask) displq-result() endwhile /*collect remaining results */ for k := 0 to number-of-workers - 1 receive (r) terminate-worker () displq-result () endfor

(SPO D/SLA VEIWORKER} while(true) receive (subtask) r = compute-image () send-master (r) endwhile

number of computation time computation time machines unitslspeedu p unitslspeedup

(static load balancing) (no load balancing)

1 SPARC 2 SPARCs 187/1.07 15W1.3 3 SPARCs 88/2-28 83/2-42 4 SPARCs 6613.05 5913.4 4 SPARCs+l PC 4914.1 4614.37 4 SPARCs+2 PCs 3915.15 3715.43 4 SPARCs+3 PCs 3216.28 3216.28

20111 (sequential execution time)

The LOAPSP algorithm can compute the shortest paths among all pairs of network nodes. A X-by-X matrix of a network of X nodes needs IT=absJ(log X)/flog 2) + I ] iterations to compute all the shortest paths. The matrix is divided into S number of blocks for S number of SPODs. Each SPOD manipulates X/S rows of the matrix and sends the partial solution to the MPOD, which based on barrier synchronization, reconstructs the matrix. The cycle of matrix division and reconstruction repeats IT times. The final matrix contains the result. The LOAPSP experiment was done on a dedicated platform of SPARC workstations. The results are shown in Figure 3.

umber of distrtlbuted subtasks I ................................................................ j l 2 3 5 7 9

sequential (< -------- distributed & par'dlel -------- >)

45x45 I 26 2111.2 19/1.4 16/1.6 12/2.2 8/3.3 90x90 I 104 60/1.7 40/2.6 31/3.4 36/3 33/3.2 135x135 I 512 290/1.8 18572.8 116/4.4 144/3.6 180U.8 180x180 I 608 520/1.2 46711.3 222/2.7 412/1.5 43411.4 225x225 I 1145 1066/1.1 95W1.2 463/2.5 950/1.2 912/1.3

mat& size _---______-_______-_-----------------------------.-------.--------

Figure 3. Parallelized LOAPSP computation results

The 90x90 and 135x135 matrices produced better speedups than the smaller 45x45 and larger 225x225 matrices. Besides, five subtasks produced more noticeable speedups for them. The speedups for the 90x90 and 135x135 matrices are 3.4 and 4.4 respectively. Obviously the matrix size and the number of distributed subtasks should be properly combined to produce the optimal computation to communication ratio needed for good speedup.

-

3.3 Travelling Salesman Problem

The travelling salesman problem is computationally intensive. At first, a C sequential program was

* PC = IBM486 compatible personal computers

Figure 2. Resultc; of fractal parallelization in OORHS

4530

run repeatedly by using a set of N (number of cities) values, and the computation time for every value of N was recorded. A parallel version of the C program (C-APPL) then was run by using (N-1) distributed SPODs. Each SPOD computed the subtree at the second level of the hierarchy of cities. For the purpose of giving some idea of the comparative performance of OORHS, the same algorithm was also parallelized in p4[7] on the same platform. The results are shown in Figure 4. p4 is one of the popular experimental distributed programming environments. It is a library of subroutines and macros developed by the Argonne National Laboratory for programming in C and Fortran.

number of sequential C-APPL p4 parallel cities0 parallel

two sublists for sorting, the computation time was longer than the sequential program. Large list size (e.g. 28000 characters) together with more processors (e.g eight processors and eight distributed SPODs) produced some speed up. The recursively partitioned parallel programming paradigm inherently requires high communication costs. It is only suitable for applications of high computation to communication ratios (i.e. large-grained applications).

list sizes in number of characters

number of I of machines \ 8000 18000 28000

1 1 211 5.511 911 I_--

2 I 510.40 1010.55 12.510.72

1 1 200 3216.3 3216.3 13 22000 2000/11 2000/11

Figure 4. Travelling salesman problem computation times and speedups by C-APPL and p4 respectively

For the travelling salesman problem, both OORHS and p4 performed equally well, but OORHS also included the local compilation overhead. Both systems began to gain speedups only after eleven cities.

3.4 Quick Sort

The essential idea of quick sort is to choose a key from the list to be sorted. All items that have values higher than the key are sorted by one process and those of lower values are sorted by another process. The sublists can be split further. The partial results are then recursively combined to form the final solution. The experimental results for the parallelization of the quick sort algorithm is shown in Figure 5. When a list was split into

Figure 5. Computation times and speedups for the parallelized quick sort

4. CONCLUSION

The OORHS research project is motivated by the wish to alleviate distributed computing problems in the areas of adaptiveness, reliability, and transparency. The OORHS differs from other existing distributed programming environments by emphatically combining three essential elements: reciprocality, load balancing and local compilation in a single framework. In OORHS a user is required to provide only a high-level APPL program, which is, in effect, a specification. The system automatically parallelizes the APPL program into smaller program objects, instantiates the codes of these objects, distributes them, and performs load balancing and failsoft recovery during computation to produce the final result. The present OORHS is implemented on a UNIX-based network of SUN workstations and personal computers. The OORHS research is on-going. Many experiments of different programming

453 1

paradigms were performed. As demonstrated by [7] R. Butler and E. Lusk, "User's Guide to the p4 the preliminary results of the four presented Programming System", Argonne National experiments, the OORHS can support a good Laboratory USA, Distribution Catagory: breadth of parallel programming paradigms, Mathematics artd Computer Science (UC-405), 1 992 namely, recursively partitioned, distributed independent, and iterative transformation. Our practical experience indicates that the OORHS is a realistic environment for reliable, adaptive and transparent distributed computing. The reconfiguration capability in the communication layer and SPOD redistribution in the kernel for failsoft and optimal operations make OORHS adaptive and reliable. We have not completely demarcated the limitations of the OORHS. The immediate future work is to improve the automatic code generation in the QORHS.

[l] C.M. Pancake, "Software Support for Parallel Computing: Where Are We Headed?", Communications of the ACM, Vol. 14, No. I f , 1992, pp. 53-56

[2] A . K . Y . Wong and D.S. Yeung, "RHS - A Framework for Exploiting Distributed Parallelism Efficiently", to appear in The International Journal of Computer Systems, Science and Engineering.

[3] R.S. Chin and S.T. Chanson, "Distributed Object-Based Programming Systems", ACM Computing Surveys, Vol. 23, No. 1, 1991, pp .91- 124

[4] C.C. Douglas, T.G. Mattson and M.H. Schultz, "Parallel Programming Systems for Workstation Clusters " , Technical Report YALE/DCS/772-975, Department of Computer Science, Yale University, 1993

I51 K.Y. Wong and K.M. Lo, "Efficient Routing in Small Machines: The V-Net Experience", Proceedings of the 3rd Pan Pacific Computer Conference, Beijing PRC, 1989, pp. 151-157

[6] Fethi A. Rabhi, "Exploiting Pardleiism in Functional Languages: A 'Paradigm-Oriented Approach"', to appear in Abstract Machine Models for Highly Parallel Computers, Oxford U. Press

4532

[IEEE 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for...

Documents

Transcript of [IEEE 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for...