An efficient multilevel master-slave model for Hsin-Chu ...and Information Sciences, Clark Atlanta...

An efficient multilevel master-slave model fordistributed parallel computation

Hsin-Chu Chen,W Alvin Lim,<*) and Nazir A. WarsiW Army Center of Research in Information Sciences, Dept. of Computerand Information Sciences, Clark Atlanta University, 223 James P. BrawleyDr. SW, Atlanta, GA 30314. Email: [email protected](2) Department of Computer Science and Engineering, Auburn University,107 Dunstan Hall, Auburn, AL 36849. Email: [email protected]

Abstract

The master-slave (MS) parallel computing model is one of the most widelyused model in a networked computing environment due to its ease of im-plementation. This model, however, suffers from the disadvantages of thesequential generation of slave processes and heavy communication overheadsimposed on the master processor. To overcome this problem, we present inthis paper an efficient multilevel master-slave (MMS) scheme which is espe-cially useful for solving decomposable large-scale problems such as structuremechanics or dynamics problems with rotational symmetry, on networkedworkstations. Our MMS model implements the MS model at multiple lev-els and generates processes using a special class of tree structures, allowingparallel creation of slave processes. It also improves performance in thedistribution of initial data and merging of computed results to and fromslave processes. We shall describe the generation of processes using differ-ent MMS structures to generate a prescribed number of processes and tobroadcast global data to all processes. We then present the implementa-tion of the optimal MMS model via PVM on a networked computer systemconsisting of workstations for a plate-bending problem that is discretized us-ing the finite strip method The performance of our numerical experimentsemploying this MMS model is reported to demonstrate its efficiency. Thisscheme can be applied equally well to other types of problems that can bedecomposed using the Fourier decomposition or circular decomposition, nomatter whether the physical problem is discretized by the boundary elementmethod or finite element method.

Transactions on Modelling and Simulation vol 22, © 1999 WIT Press, www.witpress.com, ISSN 1743-355X

440 Boundary Element Technology

1 Introduction

The master-slave (MS) parallel programming model [Geis93, GeSS87] iscurrently one of the most widely used model in networked computing envi-ronment in which a collection of heterogeneous or homogeneous computersare connected by a network to serve as a large parallel virtual machine. Inthis model the initiating process, referred to as the master process, is re-sponsible for spawning (generating) all other processes, referred to as slaveprocesses, to perform tasks assigned to them. Although the tasks assignedto the slaves processes can be different, a typical application of this modelis that all slaves perform a set of similar tasks (with different set of data).In such cases, usually only two programs are necessary no matter how manyslaves there are. One of the two programs is for the master and the otheris for all the slaves. The interprocess communications are usually handledthrough some form of message passing between the master and the slaves.

The advantage of the MS model for parallel computation is simplicityin terms of implementation. It suffers, however, from the disadvantage ofheavy communication overheads since all slaves communicate only with themaster. There is no direct communications among slaves. This disadvan-tage is especially apparent when the master needs to collect data from allslaves in order to perform an accumulated sum of them. To overcome thedrawbacks of this MS model, an approach that implements the MS model atmultiple levels as a complete tree structure is presented in [ChBy95]. In thispaper, we develop a more efficient approach for the multilevel master-slave(MMS) scheme, which is useful for solving large-scale master-slave prob-lems, e.g., plate-bending problems arising in the area of structural analysisor groundwater modeling that can be handled by the finite strip method[PuSc90, Cheu68] or heat conduction and wave problems that are solved bya fast Poisson solver [Stra86, pp.457]

2 Multilevel Master-Slave Trees

The MMS scheme has been demonstrated to be more efficient than the MSmodel for the problem presented in [ChBy95] and seems to have a greatpotential for other applications. In this paper, we further investigate thepotential of this approach and present our new results. In particular, wepresent and analyze a special class of unbalanced m-ary tree structure forthe MMS approach. It should be mentioned that the generation of ourunbalanced tree structure is not arbitrary. Instead, it is generated based oncertain rules to be described in this section.

Since there can be many master processes in an MMS model, we referthe process at the root of the tree to as the initial master process. Weassume that each process can directly generate at most ra slaves, and anysingle master process cannot simultaneously generate more than one process


Boundary Element Technology 441

Figure 1: Spatial representation of an unbalanced tree

at any given time which is the case for a machine with only one processingunit. All distinct processes are assumed to execute on different processors,one process per processor. Note also that processes are created as timeadvances. Therefore, they are dynamic although processors are static. Thetime spent in generating a new process is assumed to be fixed and taken tobe unity in this paper for convenience. We shall refer processes generatedat time t = k, k — 0, 1, 2, • • •, to as step-k processes. Now, we are in aposition to present our proposed unbalanced m-ary tree models for the MMSscheme. To begin with, we first consider the special case when m — 2 andthen generalize it to the general case of an arbitrary m, m > 1. Note thatwhen TTi—1, the resulting configuration is simply a pipeline of processes,a degenerate tree with no branches, which is of no interest to us since thegeneration of processes in this case becomes strictly sequential. Due to aclose relation between the Fibonacci sequence and the number of processesgenerated at each time step, this class of unbalanced m-ary tress will bereferred to as Fibonacci m-ary trees in this paper.

2.1 Fibonacci Binary Tree

When m = 2, each process can generate at most two processes and the re-sulting configuration is usually an unbalanced binary tree for time t = k > 1.Shown in Figure 1 is the spatial representation of such an example generatedin five time steps where the number next to each of the arrows representsthe step number at which the process pointed to by the arrow is generatedand the depth denotes the level number. A temporal representation of thistree is shown in Figure 2 where the depth denotes the step number insteadof the level. It is not necessary to display the step numbers as shown on thelinks in this representation. We show them in this figure only for the sakeof emphasis.

We now use this example to explain the generation procedure. First,the initial master process (A) must enroll itself to the networked computingsystem and is considered as having been generated at time t — 0 (step0). This process then spends one time unit (from time t = 0 to t = I)to generate process B (step 1). Recall that A cannot generate B and C



Figure 2: Temporal representation of an unbalanced tree

at the same time by our assumption. Once B have been generated, bothA and B are ready for generating their own slave, C and D, respectively.The generation of C and D can be completed in one time unit since thegeneration of C by A is entirely independent of the generation of D by B,by our assumption that different processes execute on different processors.This is done at step 2. Now, process A has already generated two slaves,B and C, and, therefore, stops generating new slaves. Accordingly, onlythree new slaves are generated at step 3. They are processes E, F, and Ggenerated by B, C, and D, respectively. By the same argument, it is notdifficult to see that the five leaves in this figure are the only processes thatcan be generated at the next step. Similar to processes B and C, these fiveprocesses can be generated in parallel since their immediate masters are alldistinct.

In general, the generation of processes continues until a prescribed num-ber of time steps (or a fixed number of processes) has been reached. Let fnbe the number of processes generated according to this procedure at timestep n (from time t = n - l t o t = n) and Fn be the total number ofprocesses in the tree at time t = n. It is easy to see that fn follows theinteresting Fibonacci sequence: 1, 1, 2, 3, 5, 8, ..., i.e.,

A = A-i+A_2, n = l, 2, •••, (1)

with the initial conditions

/o = 1 and /_i = 0. (2)

The explicit expression for /n, given (2), is well-known [Tuck84, pp.282]:

f _ 1 [,1 + An+l (1-An+ll'"-Vl[ 2 ^ 2 > [

As mentioned earlier, we shall refer this type of unbalanced binary treesto as Fibonacci binary trees due to this recurrence relation, in order todistinguish them from other unbalanced binary trees that do not satisfy thisrecurrence relation. Since Fn is simply the sum of all processes generated



Figure 3: A Fibonacci tree of depth 4 with ra = oo (temporal representa-tion).

from step 0 up to step n, we have Fn — ]Cfc=o f* which follows the followingrecurrence relation

F^i=2^-F^2, = 0, 1, ... (3)

with the initial conditions FQ = 1 and F_i = F_2 = 0. This relation en-ables us to determine iteratively the maximum number of steps required togenerate a prescribed number of processes for our MMS scheme. To closethis subsection, we stress that all the processes generated at a given timestep can be accomplished in parallel since their immediate masters are alldistinct, a clear advantage of this approach over the traditional MS model,which spawns processes sequentially. For example, to spawn 11 processes(excluding the initial master), this MMS model takes only four time stepsas clearly seen in Figure 2, instead of 11 steps as required by the MS model.

2.2 Minimum-generation-time Tree Model (ra = oo)

Obviously, as ra increases, each process is allowed to have more slaves. Inthe extreme case of m — oo, there is no constraint on the number of slavesa process can generate. Given a total number of processes to be generated,this extreme case takes the smallest number of steps (the smallest amount oftime) to generate a Fibonacci tree, possibly incomplete. This model is, thus,referred to as the minimum-generation-time (MGT) tree model. Figure 3shows an example of the Fibonacci tree with ra = oo, generated using 5time steps, A: = 0, 1, , 4. It deserves mentioning that every completeFibonacci ra-ary tree with TV < ra is a MGT tree.

From our generation rules, it is not difficult to see that 2*~* additionalslaves can be generated at time step A:, k > 1, making the total number ofprocesses in the tree equal to 2* in the MGT tree model. This appears, insome sense, to be similar to the creation of hypercube nodes [Leig92, 392].However, the configuration of this model is different from the hypercubestructure, as seen in Figure 3. The recurrence relation of /„ for MGT can



be expressed as

n-l/n = £/* = Z>' n = l, 2, .-. (4)

k<n k=0

with /o = 1 since every process generated before time t = n can generatean additional slave at t — n for all n > 0 and fk = 0 for any A: < 0. It istrivial to see that

Fn+i=2Fn, n = 0, 1, ••• (5)

with FQ = 1. The MGT model has some very desirable features. Not onlydoes it spend the least time in generating a Fibonacci tree when the totalnumber of processes is prescribed, but it requires the least time (among allpossible ra) to flood data from the root to all other processes and to accu-mulate results (the reduction operation) computed by each process back tothe root, which occurs very often in scientific computations. For each of theflood and reduction operations, it takes only log2(Fw) steps to accomplishthe task where FN is the total number of processes in the tree, assumingFN is a power of 2.

3 Applications

The MMS programming paradigm presented in this paper not only can savetime spent in spawning processes in a distributed computing environment,as compared with the standard MS model, but is suitable for distributingglobal data to all processes and for performing reduction operations, e.g., theaccumulated sum of a series of numerical values (scalars or vectors). Thisprogramming paradigm is useful for solving discretized physical problems(or partial differential equations) that use some form of Fourier decompo-sition techniques, such as those involved in the fast Poisson solver, fast bi-harmonic solver, circular decomposition method, and finite strip method.In each of these schemes, three stages are involved. First, a single prob-lem is decomposed into a number of smaller and independent subproblems.In this stage global data (data required by all processes), normally readfrom an input file or generated by the initial master processes, need to bebroadcast to all processes and process-specific data must be distributed toeach involved process, except those that can be generated by the involvedprocess itself. In the second stage, the subproblems are solved in parallel,one problem on one processor in principle, without the need of interpro-cess communications. Once the solutions to all subproblems are obtained,they are combined/accumulated to yield the final result. Parallel reductionoperations are involved in this final stage.

In the following, we present an application of our proposed MGT MMSmodel to the finite strip method for solving a special class of plate-bendingproblems on a networked system consisting of SUN4 workstations. Before



48 ft

Figure 4: A rectangular plate discretized into (n — 1) strips

presenting the physical problem for our experiments, we briefly describethis approach. The finite strip method [Cheu68] is a highly parallel ap-proach with large-grain parallelism. It is a special type of finite elementanalysis that approximates the true solution of the displacement field usinga combination of continuous harmonic functions that satisfy the boundaryconditions in one direction and piecewise interpolation polynomials in theother. The method decomposes a single problem into m subproblems if mharmonic functions are employed in the approximation. In other words,each harmonic term will result in a linear subsystem to solve and the threestages mentioned above are all involved in the analysis. With the MGTmodel, runtime reconfiguration is not necessary in this approach.

3.1 Problem Statement

The physical problem we consider for our experiments is the static analysisof a 48 ft by 32 ft rectangular Mindlin plate (0.5 ft thick), simply supportedon all four edges. The plate is subject to a 2Q-kip concentrated load actingdownward in the z-direction at (z,y) = (12ft,8ft], as shown in Figure 4.The material of the plate is assumed to be isotropic with a Young's modulusequal to 432000 kips/ft* and a Poisson ratio of 0.17. This problem is takenfrom [Chen94]. The mathematical modeling of the Mindlin plate problemsand the formulation of the algebraic equations using the finite strip methodcan be found in [Mind51, BeHi76] or in [Chen94]. We discretize the plateinto 64, 128, ..., 2048 strips. Eight harmonic (Fourier) terms are employedto approximate the true solution (displacement field) for each discretization.

3.2 Implementations and Experiments

Our experiments consist of three different implementations: one for sequen-tial execution and two for parallel execution. The two implementations forparallel execution are the standard MS model and our new MGT MMSmodel, as shown in Figures 5 and 6, respectively. We use the softwarepackage PVM3 (Parallel Virtual Machine, Version 3.3.9) [Geis93] to imple-ment interprocess communications. In the standard MS model, we only



Figure 5: MS model with eight processes (temporal representation)

Figure 6: MGT MMS model with eight processes (temporal representation)

need two computer programs: one for the master and the other for all theslaves. In the MGT model, we employ four computer programs, which differmainly in the way the interprocess communication is handled. All processesgenerated at the same time step share the same code. This, of course, is nota requirement. There are a great number of varieties. For example, one maydevelop a separate program for a process, or combine the two programs forthe intermediate processes into a single program as an alternative. Exten-sive code sharing among processes, however, usually adds a certain degree ofdifficulty to the development of the code and may not always be preferable.

Our preliminary experimental results are shown in Tables 1, where theCPU time includes both the user and system CPU time in seconds spent inthe entire analysis, including the generation of the strip stiffness matrices,assemblage of the linear subsystems, solution to each subsystem, and thecalculation of displacements. It should be noted that each timing resultrepresents the best timing observed in a series of six consecutive executionsof the same code on the same data. The speedups, defined to be the ratio ofthe time spent in the sequential code run on one workstation to that spentin the parallel code executed on the networked system configured with eightworkstations, for each individual discretization are shown in Table 2. Asseen from these two tables, both the MS and MGT (MMS) models yieldpretty good performance, as compared with the sequential execution. It isalso clear that the MGT model improves the performance obtained by thestandard MS scheme, as expected. The improvement is more than 13% inall cases. This is mainly due to the better exploitation of parallelism by the



Table 1: Performance of MS and MMS models (CPU time in seconds)

No. of strips

6412825651210242048

CPU time in secondsSequential

.350

.7001.432.855.6811.4

Traditional MS

.083

.1330.250.450.901.72

MGT (MMS)

.067

.117

.217

.383

.7671.50

Table 2: Performance of MS and MMS models (individual speedup)

No. of strips

6412825651210242048

Individual speedupSequential

1.001.001.001.001.001.00

Traditional MS

4.225.265.726.336.316.63

MGT (MMS)

5.225.986.597.447.417.60

MGT than by the MS model.

4 Conclusions

The master-slave parallel programming model has been widely used in anetworked computing environment. This model, however, suffers from thedisadvantages that the master alone is responsible for spawning all the slaveprocesses and the heavy communication overhead and delays when all slavesattempt to simultaneously communicate with the single master. In thispaper we have presented a more efficient approach for the multilevel master-slave (MMS) model that implements the MS model at multiple levels. TheMMS model is useful for large-scale master-slave problems, because it notonly allows for parallel creation of slave processes but has the ability toimprove performance in flooding global data to and merging results fromslave processes.

To demonstrate the effectiveness of our proposed approach, we havepresented an application of the MGT model to the finite strip method for



solving a particular class of plate-bending problems that arise in the areaof structural analysis. The results of our experiments show that the MGTmodel improves the performance of parallel computation.

5 Acknowledgment

This work was supported in part by the Army Research Laboratory un-der Grant No. DAAL01-98-2-D065 and in part by the National ScienceFoundation under grant CCR-9896086.

References

[BeHi76] P.R. Benson and E. Hinton, A thick finite strip solution for static, freevibration and stability problems, Int. J. for Numer. Meth. in Eng., 10(1976), pp. 665-678.

[Chen94] H.-C. Chen, Increasing parallelism in the finite strip formulation: staticanalysis, International Journal on Neural, Parallel & Scientific Com-putations, Vol. 2, No. 3 (1994), pp.273-298.

[ChBy95] H.-C. Chen and V. Byreddy, Solving plate bending problems using fi-nite strips on network workstations, submitted to Computers & Struc-tures.

[Cheu68] Y.K. Cheung, The finite strip method in the analysis of elastic plateswith two opposite simply supported ends, Proc. Inst. Civ. Eng.,40(1968), pp. 1-7.

[EGSM94] G. Eisenhauer, W. Gu, K. Schwan, and N. Mallavarupu, Falcon - To-ward Interactive Parallel Programs: The On-line Steering of a Molec-ular Dynamics Application, Proceedings of High-Performance Dis-tributed Computing, August 1994.

[Geis93] A. Geist, et al, PVM3 User's Guide and Reference Manual,ORNL/TM-12187, Oak Ridge National Laboratory.

[GeSS87] E.F. Gehringer, D.P. Siewiorek, Z. Segall, Parallel Processing: TheCM Experience, Digital Press, MA, 1987.

[Leig92] F.T. Leighton, Introduction to Parallel Algorithms and Architectures:Arrays. Trees.Hypercubes, Morgan Kaufmann, CA, 1992.

[Mind51] R.D. Mindlin, Influence of rotatory inertia and shear on flexural mo-tions of isotropic, elastic plates, J. of Applied Mechanics, 18 (1951),pp. 31-38.

[PuSc90] J. A. Puckett and R. J. Schmidt, Finite strip method for groundwatermodeling in a parallel computing environment, Eng. Comput., 7 (1990),pp. 167-172.

[Stra86] G. Strang, Introduction to Applied Mathematics, Wellesley-Cambridge,MA, 1986.

[Tuck84] A. Tucher, Applied Combinatorics (2nd ed.), John Wiley & Sons, NY,1984.


An efficient multilevel master-slave model for Hsin-Chu ...and Information Sciences, Clark Atlanta...

Documents

Transcript of An efficient multilevel master-slave model for Hsin-Chu ...and Information Sciences, Clark Atlanta...