A Distributed Object Model for Solving Irregularly ... · A Distributed Object Model for Solving...

Abstract of thesis entitled

A Distributed Object Model for Solving Irregularly Structured Problems on Distributed Systems

submitted by

Sun Yudong

for the degree of Doctor of Philosophy at The University of Hong Kong

in March 2001

This thesis presents a distributed object model, MOIDE (Multithreading Object-oriented

Infrastructure on Distributed Environment), for solving irregularly structured problems. The

primary appeal of MOIDE is its flexible, collaborative infrastructure that is adaptive to

various system architecture and application patterns. The model integrates object-oriented

and multi-threading methodologies to set up a unified computing environment on

heterogeneous system. The kernel of the MOIDE model is the hierarchical collaborative

system (HiCS) constructed by the objects, i.e., the compute coordinator and compute engines,

on the hosts to execute an application. HiCS integrates the object-oriented and multithreading

methodologies to enable its structure adaptive to hybrid hosts. Lightweight threads are

generated in the compute engines residing on SMP nodes, which is more efficient and stable

than a sole distributed object scheme. The structure and work mode of HiCS are adaptive to

the computation and communication patterns of applications as well as the architecture of

underlying hosts. The adaptability is particularly beneficial to support the high-performance

computing of irregularly structured problems.

A unified communication interface is built in MOIDE based on the two-layer

communication mechanism that integrates shared-data access and remote messaging for inter-

object communication. It is a flexible and efficient communication mechanism to resolve the

high communication cost that appears in irregularly structured problems. Autonomous load

scheduling is proposed as a new approach to dynamic load balancing in the irregular

computation based on the MOIDE model. A runtime support system is developed in Java and

RMI that implements MOIDE as a platform-independent infrastructure to support parallel and

distributed computation on varied systems.

Four irregularly structured applications are developed to manifest the advantages of

MOIDE model. The N-body method demonstrates the capability of the object-based

methodologies of the MOIDE model in the implementation of adaptive task decomposition

and complex data structure. A distributed tree structure with partial subtree scheme is devised

in the N-body method as the communication-efficient data structure to support the high data-

dependent computation. The autonomous load scheduling approach in the ray tracing can

realize the high parallelism in the MOIDE-based asynchronous computation. The MOIDE

model provides the adaptability to the CG method for solving sparse linear systems. The CG

method can be dynamically mapped onto heterogeneous hosts and utilize the unified

communication interface to enhance the communication efficiency. The radix sort verifies the

flexibility of the MOIDE-based computation in which the grouped communication can

outperform MPICH on SMP node and cluster of SMP nodes.

A Distributed Object Model for Solving Irregularly Structured Problems on

Distributed Systems

by

Sun Yudong

孫昱東

A thesis submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy at The University of Hong Kong

March 2001

i

Declaration

I declare that this thesis represents my own work, except where due acknowledgement is

made, and that it has not been previously included in a thesis, dissertation or report submitted

to this University or to any other institution for a degree, diploma or other qualification.

Signed ………………………………………….

Sun Yudong

ii

Acknowledgements

First, I would like to express best gratitude to my supervisor, Dr. Cho-Li Wang, for all his

guidance to my study in the past years. He has always given me invaluable encouragement

and advice to complete the thesis.

I am deeply thankful to Dr. Francis C. M. Lau for his constructive advice and help to my

study. I am highly grateful to Dr. P.F. Liu for his important comments and suggestions for my

thesis revision.

Special thanks to the members of the System Research Group, Anthony Tam, Benny

Cheung, Matchy Ma, David Lee, and all other colleagues, whose help and cooperation have

been the strong support to my research.

Finally I show my thanks to all people who have helped me in all aspects to finish the

thesis, especially the technical and office staff in our department.

iii

Contents

Declaration ………………………………………………………………………………….. i

Acknowledgements ..……………………………………………………………………….... ii

Table of Contents …………………………………………………………………………… iii

Lists of Figures ……………………………………………………………………………... vii

Lists of Tables ……………………………………………………………………………….. x

1 Introduction 1

1.1 Irregularly Structured Problems ……………...…………………….…...………….. 2

1.1.1 Specification …………………………………..……………………………... 2

1.1.2 Sample Applications ………………………………….……………………… 5

1.2 Distributed System and Distributed Object Computing ………..…….…………….. 8

1.2.1 Distributed System …………..………….……………………………………... 8

1.2.2 Distributed Object Computing …………….…..………………………………. 9

1.2.3 Object-Oriented Programming Language ………..……….………………….. 11

1.3 Motivation ...………………….………………………..…………………………... 11

1.4 Contributions ….……………..………………...…………………………………... 12

1.5 Thesis Organization ….……………..……………………………………….……... 14

iv

2 MOIDE: A Distributed Object Model 15

2.1 Introduction ……………...………………………………………………………… 15

2.2 Basic Collaborative System ……………..……………………….………...…….... 18

2.2.1 System Structure ………………….…………………..…………………….. 18

2.2.2 System Creation ……………….…………..………………………………... 19

2.2.3 System Work ……………….……………………………………..……….... 20

2.2.4 System Reconfiguration ………….………..………………………………... 21

2.3 Hierarchical Collaborative System …….……………………………..…………… 24

2.3.1 Heterogeneous System …………..…………………….…………………….. 25

2.3.2 Hierarchical Collaborative System ………..………………….……………... 26

2.3.3 Task Allocation …..………..………………………………………….….….. 29

2.3.4 Unified Communication Interface …...…………………………….………... 30

2.4 Implementation …………………………..……….…………...……………….…... 32

2.5 Summary ……………….……………..…………………………………….……… 33

3 Runtime Support System 34

3.1 Overview ……….……..………………………………………...…….………….... 34

3.2 Principal Objects ………………..…………………………………….…………… 36

3.3 System Creation ………………..…………………………………….……………. 37

3.3.1 class StartEngine …………..………………………….……………. 37

3.3.2 Initialization of Compute Engine ………..…………………….……………. 40

3.4 System Reconfiguration ……………….………….………………….…….……… 41

3.4.1 class ExpandEngine …………….….……..…………….…….……… 41

3.4.2 class RecfgEngine ………………..………..…………….…....……... 43

3.5 Unified Communication Interface ………………………..……………………..… 44

v

3.6 Synchronization ……………………..…………….………………………………. 46

3.6.1 barrier()for Local Synchronization …………….……………………… 47

3.6.2 remoteBarrier()for Global Synchronization …..…...………………… 48

3.7 Load Scheduling …………………………………………...……………………… 49

3.7.1 Autonomous Load Scheduling …………….………………………………… 49

3.7.2 getTask() and getSubtask()Methods …….……..….…….………… 51

3.8 System Termination ……………………...……….……………………………….. 51

3.9 Summary ……………….……………………...…………………………………... 52

4 Distributed N-body Method in MOIDE Model 53

4.1 Overview ……………………………..………………………….…………….…... 53

4.2 Distributed N-body Method ……………………..……………….…………….….. 55

4.2.1 Distributed Tree Structure …………..…………………….………….……... 56

4.2.2 Computing Procedure ………..…………………………….…………….….. 62

4.2.3 Load Balancing Strategy ……………...…………………………….………. 65

4.3 Runtime Tests and Performance Analysis …………………...…………….……… 65

4.3.1 Tests on Homogeneous Hosts …………………...……………………….…. 65

4.3.2 Tests on Heterogeneous Hosts ……………...……...…………………….…. 72

5 Ray Tracing with Autonomous Load Scheduling 75

5.1 Overview ……………………………………………..…………….…..……….… 75

5.2 Autonomous Load Scheduling ………..…………………………….……..…….... 76

5.2.1 Background …………………………………………………….………….... 76

5.2.2 Group Scheduling ………………..…………………………….………...…. 77

5.2.3 Individual Scheduling ……………………………..……...…….…………... 79

vi

5.3 Runtime Tests and Performance Analysis ……..…………..…………...…...…….. 81

6 CG and Radix Sort on Two-layer Communication 89

6.1 Conjugate Gradient …………………………………..………………….……........ 89

6.1.1 Algorithm of CG ……………………..……………..………………………. 90

6.1.2 CG Method in MOIDE Model …………..…………………..……………… 92

6.1.3 Runtime Tests and Performance Analysis ………..………………………… 94

6.2 Radix Sort ………………………………..………………………………….….... 100

6.2.1 Parallel Radix Sort …………………..………………….………….………. 101

6.2.2 Radix Sort in MOIDE Model ……………..………………….……………. 102

6.2.3 Runtime Tests and Performance Analysis …………....…..………….…….. 103

7 Related Work 112

7.1 Software Infrastructures on Distributed Systems ………..…….…………….…… 112

7.2 Programming Models on Cluster of SMPs ……..………………….………….….. 117

7.3 Methodologies for Irregularly Structured Problems ………..…….…………….… 119

7.4 Irregularly Structured Applications ………..…………….……………………….. 120

8 Conclusions 127

8.1 Summary of Research …………………………...…….…………………...…….. 127

8.2 Achievements and Remaining Issues ……..………………………….…....……... 130

8.2.1 Main Achievements ……………………………………………….………… 130

8.2.2 Remaining Issues ………………………………………………….………… 132

8.3 Future Work …………………..………………………………………..……...…. 133

References 136

vii

List of Figures

2.1 The collaborative system built on P hosts ....…………………….…...……………..... 18

2.2 Registration tables ……………………………………………………………………. 20

2.3 Horizontal system expansion …………………………………………………………. 22

2.4 Vertical system expansion ……………………………………………………………. 23

2.5 Host replacement in a collaborative system ………………………………………….. 24

2.6 Cluster of SMPs ………………………………………………………………………. 26

2.7 Hierarchical collaborative system built on heterogeneous hosts ………………….….. 27

2.8 Two-layer communication mechanism ……………………………………………….. 31

3.1 Organization of MOIDE runtime support system …………………………………….. 35

3.2 Class description and relation of compute coordinator and compute engine classes …. 36

3.3 run() method of StartEngine class ……………………………………………... 37

3.4 getHost() method ………………………………………………………………….. 38

3.5 createEngine() method ………………………………………………………….. 39

3.6 invokeEngine() method ……………………………...…………………………... 40

3.7 Generate threads in a compute engine ………………………………………………… 41

3.8 run() method in ExpandEngine class ……………………………………………. 42

3.9 addEngine() method in ExpandEngine class ………………………………..… 42

3.10 checkEngine() method in RecfgEngine class ..………………….…………… 43

3.11 replaceEngine() method in RecfgEngine class …………………………….. 44

3.12 A hierarchical collaborative system with 12 pseudo-engines ………………………… 45

3.13 The location table of the pseudo-engines ……………………………………………... 46

3.14 Execution flow on multiple threads with local and global synchronization ………….. 47

3.15 Global synchronization method remoteBarrier() ………………………………. 48

3.16 Two-level autonomous load scheduling ………………………………………………. 50

3.17 ceaseEngine() method …………………………………………………………… 51

viii

4.1 Barnes-Hut tree for 2D space decomposition ………………………………………….. 54

4.2 Space decomposition and distributed tree structure on four processors ……………….. 58

4.3 Partial subtrees built from subtree B …………………………………………………... 60

4.4 Space decomposition and the subtrees on a four-SMP cluster ………………………… 62

4.5 Execution flow of distributed N-body method on hierarchical collaborative systems …. 64

4.6 Execution time of the N-body method on four quad-processor SMP machines ….…… 66

4.7 Speedups of the N-body method ………………………………………………………. 67

4.8 The execution time on the cluster of 32 PCs …………………………………………... 67

4.9 Computation and communication time breakdowns on the cluster of SMPs …………. 68

4.10 Computation and communication time breakdowns of the full tree method …………. 69

4.11 Adaptive partial subtree vs. cut-off partial subtree on cluster of PCs ………………… 70

4.12 Execution time of the N-body method on heterogeneous hosts ………………………. 73

4.13 Speedups of the N-body method on heterogeneous hosts …………………………….. 73

4.14 Computation and communication time breakdowns on heterogeneous hosts ………… 74

5.1 View plane partitioning and allocation ………………………………………………... 78

5.2 Ray tracing with group scheduling ………………………………………………… 78-79

5.3 Ray tracing with individual scheduling ……………………………………………..… 79

5.4 Flow of ray tracing in group scheduling and individual scheduling …………………... 80

5.5 Execution times of the ray tracing in three load scheduling schemes …………………. 82

5.6 Speedups of the ray tracing in three scheduling schemes ……………………………... 82

5.7 Execution time breakdowns of group scheduling and combined scheduling …………. 83

5.8 Execution time breakdowns of combined scheduling …………………………………. 84

5.9 Execution time breakdowns of individual scheduling ………………………………… 85

5.10 The comparison of autonomous load scheduling and master/slave scheduling in ray

tracing ………………………………………………………………………………… 87

6.1 Algorithm of conjugate gradient (CG) method ………………………………………... 90

6.2 Vector/scalar reduction and transposition operations in parallel CG method ……….… 92

6.3 The reduction operation on 2 quad-processor SMP nodes ………………………….…. 93

6.4 The reduction and transposition operations on heterogeneous hosts ………………….. 94

6.5 Execution time of the CG method on homogeneous hosts ……………………………. 95

6.6 Speedup of the CG method on homogeneous hosts ………………………………….... 95

6.7 Execution time breakdowns of the CG method on homogeneous hosts ………………. 96

ix

6.8 Execution time of the CG method on heterogeneous hosts …………………………… 97

6.9 Speedup of the CG method on heterogeneous hosts …………………………………... 97

6.10 Execution time breakdowns of the CG method on heterogeneous hosts ……………... 98

6.11 Execution time of the single-threading CG method …………………………………... 99

6.12 Speedup of the single-threading CG method …………………………………………. 99

6.13 Execution time breakdowns of the single-threading CG method ………………….… 100

6.14 Parallel radix sort ………………………………………………………………….…. 101

6.15 All-to-all scattering of elements in parallel radix sort on four processors …………... 102

6.16 Execution time breakdowns of the MOIDE-based radix sort ……………………….. 104

6.17 Execution time breakdowns of the single-threading radix sort ……………………… 105

6.18 Execution time breakdowns of the C & MPI radix sort program ……………………. 107

6.19 Execution time breakdowns of three radix sort programs: Java MOIDE-based (Java-M),

Java single-threading (Java-S) and C & MPI (C-MPI) in sorting 10M elements …… 107

6.20 Execution time breakdowns of two radix sort programs on four quad-processor SMP

nodes: Java MOIDE-Based program and C& MPI (C-MPI) in sorting 10M elements 109

6.21 Communication costs of two radix sort programs on four quad-processor SMP nodes:

Java MOIDE-based program (Java-M) and C & MPI program (C-MPI) in sorting 10M

elements ……………………………………………………………………………… 110

x

List of Tables

1.1 Characteristics of four irregularly structured problems ..………….…...…………….... 7

3.1 Major classes and methods implemented in MOIDE runtime support system …… 34-35

4.1 The times of the N-body method with/without load balancing (seconds) …………… 71

6.1 Execution time of the MOIDE-based radix sort (seconds) ………………………….. 103

6.2 Execution time of the single-threading radix sort (seconds) ………………………… 104

6.3 Execution time of the C & MPI radix sort (seconds) ………………………………... 106

7.1 Comparison of the related work in supporting heterogeneous computing …………... 116

7.2 Comparison of the programming models on cluster of SMPs ……………………….. 118

1

Chapter 1

Introduction

Irregularly structured problems are the applications which have unstructured and/or

dynamically changing patterns of computation and communication. These problems widely

exist in various scientific and engineering areas from astrophysics, fluid dynamics, sparse

matrix computations, system modeling and simulations to computer graphics and image

processing. Irregularly structured problems are usually computation-intensive applications

with high potential parallelism. However, the parallelism is difficult to be fully exploited due

to the irregularity in computation and communication. The irregularities of these problems

will be aggravated when solving on distributed systems. It is hard to evenly partition the

irregular computation among the processors. Moreover the complicated and unstructured

inter-process communication will emerge in distributed computing and the communication

will restrain the parallelism in computation. Therefore the irregularly structured problems are

also communication-intensive in distributed computing.

It is a challenging work to develop flexible and efficient methodologies for solving

irregularly structured problems on distributed systems. The methodologies should cover data

structures, task decomposition and allocation schemes, load balancing strategies, and

communication mechanisms to support high-performance distributed computing of

irregularly structured applications. Meanwhile the methodologies should also take into

account the architectural features of the platforms on which an application is running in order

to create an efficient mapping of the computation onto the underlying platforms.

My research concentrates on the distributed object-oriented methods for the high-

performance solutions of irregularly structured problems on distributed systems. A distributed

object-oriented model MOIDE has been built. The model sets up a flexible and efficient

software infrastructure for developing and executing irregularly structured applications on

2

varied distributed systems. The MOIDE model supports the techniques effective for solving

irregularly structured problems with different irregular characteristics. A runtime support

system is developed to implement the computations based on the MOIDE model.

1.1 Irregularly Structured Problems

1.1.1 Specification

A lot of scientific and engineering applications in various fields including scientific

computing, computer graphics, physics and chemistry can be classified as irregularly

structured problems. For instance, sparse matrix computations in solving sparse linear

equation system; finite element methods for solving partial differential equations; ray tracing

and radiosity in computer graphics; N-body methods in particle simulations; system modeling

and simulations in many science, engineering and social disciplines. Despite the distinct

features of the problems, they have the common characteristics of irregular data distribution

in the computation that generates irregular computation patterns and thus incurs irregular

communication patterns. The irregularly structured problems can be described from different

aspects such as data structure, computation and communication patterns, unpredictable data

distribution and workload [1-6]. Generally irregularly structured problems can be described

as the following definition.

Definition

An Irregularly Structured Problem is an application whose computation and communication

patterns are input-dependent, unstructured, and evolving with the computation procedure.

The irregularity of the irregularly structured problems causes difficulties in designing

efficient parallel and distributed algorithms for them. The distribution of data and computing

workload cannot be exactly determined a priori, and they are dynamically changing during

the computation. When solving an irregularly structured problem in parallel, one should deal

with the following issues:

(1) Irregular Data Representation

The irregular data distribution requires irregular data structures, e.g., special forms of

trees and graphs, to represent data and their relations [54,55,60]. Usually it is uneasy to

exploit the parallelism in the computations on the irregular data structures.

3

(2) Non-predetermined Load Scheduling

The workload of irregularly structured problems depends on the input data and the

dynamic evolution of the data in the computation. As the irregular and dynamic data

distribution, the irregularly structured problems cannot be evenly partitioned and allocated

onto multiprocessors in advance before execution. It is impractical to accurately measure the

computation workload of an irregularly structured problem. High data-dependency may exist

in the problem. That further burdens the load scheduling operations [69,70].

(3) Complicated Communication Requirements

Due to the irregular data structures, computation patterns, and data dependencies,

irregularly structured problems also present irregular and complicated communication

requirements. The high data-dependencies in some irregularly structured problems generate

complicated inter-process communication patterns and request high communication

bandwidth [67,68]. The unstructured communication may severely restrict the performance of

irregularly structured applications.

(4) Adaptive Algorithmic Requirement

The non-predicable computation and communication patterns request adaptive algorithms

for solving irregularly structured problems. The algorithms should generate a task

decomposition in accordance with the specific patterns emerged in the problems in order to

attain high-performance in distributed computing. The algorithms should also map the

irregular computations onto the underlying platforms in an adaptive way that matches the

hardware architecture [71].

Irregularly structured problems are mostly large-scaled computation-intensive and

communication-intensive applications. The unstructured, dynamically evolving patterns

aggravate the computation and communication costs. In addition to the general strategies to

design efficient parallel or distributed algorithm as evenly scheduling computing tasks in

order to balance the workload and minimizing the communication, special techniques should

be devised for irregularly structured problems. The fundamental techniques for solving

irregularly structured problems include: (1) Flexible Data Structures

The data structures should be able to facilitate the effective representation and efficient

computation of irregular problems. The data structures should be flexible to be partitioned in

task decomposition and to be reconstructed to reflect the evolution of the computation

4

patterns. The data structures should efficiently satisfy the data sharing required in the parallel

and distributed computing. Of course, different irregular applications require special data

structures to represent the data and the related computations. For example, a distributed tree

structure is designed for the distributed N-body method in chapter 4.

(2) Dynamic Load Scheduling

The unpredictable and evolving data distribution and therefore the computation workload

of irregularly structured problems request dynamic load scheduling to allocate the workload

at run-time in distributed computing. Computing tasks should be allocated to or data should

be redistributed to multiple processes to ensure dynamic load balance. The load scheduling

approaches are relevant to the characteristics of the applications. Global load redistribution is

probably required for the applications with high data-dependency. Runtime task allocation is

proper for the applications with light data-dependency. The space re-decomposition scheme

in the distributed N-body method in chapter 4 and the autonomous load scheduling scheme in

chapter 5 are two examples of dynamic load scheduling.

(3) Efficient Communication Methodologies

The irregular communication patterns in irregular applications usually produce high

communication overhead that will deteriorate the overall performance. The communication

overhead is more critical in distributed computing environment where the communication

goes through long-latency message passing. It is essential to reduce the inter-process

communication by maintaining data locality in the computation and build efficient

communication mechanism for distributed computing. In MOIDE model described in chapter

2, a two-layer communication mechanism is created by integrating shared-data access and

message passing on heterogeneous systems.

(4) Adaptive Computing Infrastructure

The task decomposition depends on the computation and communication patterns of an

application. It should evenly decompose the computation and allocate workload fairly to each

processor meanwhile reducing the inter-processor communication. An ideal task

decomposition scheme should also take into account the architecture of the underlying system.

The algorithms of the applications should be developed on an adaptive computing

infrastructure. Therefore adaptive task decomposition and allocation strategies can be

implemented for varied hardware architecture to generate a task distribution which can make

full use of the architectural features to attain high performance on the hardware platforms.

5

1.1.2 Sample Applications

As previously indicated, a lot of applications in various fields can be viewed as

irregularly structured problems. These applications possess different irregularities that appeal

to specific strategies to cope with. Four sample irregularly structured problems are studies in

the thesis.

(1) N-body Problem

N-body problem simulates the evolution of a system containing a great number of bodies

(particles) [7, 16]. The bodies distributed in a space have force interactions between each

other. System evolution is the consequence of the cumulative force influences from all bodies.

The force influences are determined by the interactions of the bodies and impel the bodies

moving to new positions. The force influences keep on changing because of the continuous

body motion. A lot of physical systems exhibit this behavior such as in the fields of

astrophysics, plasma physics, molecular dynamics, fluid dynamics, radiosity calculations in

computer graphics, and etc. The common feature of these systems is the large range of

precision in the data requirements for bodies to compute their force influences on each other.

A body needs gradually rough data in less frequency from the bodies that are farther away.

The system evolution is a dynamic procedure. N-body problem has high irregularity in data

distribution and the computation of force influences, whereas heavy irregular data

communication occurs in distributed N-body method.

(2) Ray Tracing

Ray tracing is a graphic rendering algorithm in computer graphics to synthesize an image

from the mathematical description of the objects that constitute the image [8,20]. It generates

a 2D rendering image for a 3D scene by calculating the color contributions of the objects to

each pixel on a view plane (screen). In ray tracing, primary rays are emitted from a viewpoint,

passing through the pixels on the view plane and entering the space which encloses the

objects. When encountering an object, a ray is reflected to each light source to check if it is

shielded from that light source. If not, the light contribution from that light source onto the

view plane should be computed. The ray is also reflected from and refracted through the

object to spawn new rays. The ray tracing procedure is performed recursively on the new rays.

Thus each primary ray may generate a tree of new rays. The rays are terminated when they

6

leave the space or by some pre-defined criterion (e.g., the maximum number of levels

allowed in a ray tree). If a ray hits nothing, no further computation will be taken. The rays

hitting sophisticated objects will generate a bundle of dispersed rays that require more

rendering computations. In ray tracing, the generation of the rays is non-deterministic which

depends on the objects and light sources in the scene. The workload of the rendering

operations is totally irregular. The rendering of each pixel on the view plane has highly

diverse workload. It is difficult to evenly partition the view plane in priori for parallel ray

tracing.

(3) Sparse Matrix Computations

Sparse matrix computations appear broadly in scientific and engineering computing from

basic computations as sparse matrix-vector multiplication to complex computations as

iterative methods for solving sparse linear systems. Sparse matrix is an unstructured data

structure. Parallel sparse matrix computations are irregular operations because of the

unbalanced matrix computation as the consequence of the unstructured data density [62,63].

Unstructured communications arise in the parallel sparse matrix computations. For example,

the iterative methods for solving linear equation systems in the form bAx = generate a

sequence of approximation to the solution vector x by iterating on the matrix-vector

multiplication operations on the coefficient matrix A. The computation and communication

costs are dependent with the data density of the sparse matrix and the algorithmic operations

on it. The conjugate gradient (CG) is one of the most powerful iterative methods for large

sparse linear systems [22]. Parallel CG method is based on the mesh topology of

multiprocessors. Large vectors are exchanged among the processors in iteration. The heavy

communication may restrict the performance of parallel CG method. Efficient

communication mechanism should be built to raise the communication efficiency and

enhance the overall performance.

(4) Sorting

Sorting is a procedure to transform a random sequence of elements into an ordered one. It

is one of the common operations in computing algorithms. Sorting procedure is accomplished

by repeating comparison or non-comparison manipulations on the element sequence. Parallel

sorting algorithm involves the redistribution of the elements among multiprocessors in each

7

sorting round. The data exchange has irregular communication pattern and produces heavy

communication among all processors. Hence sorting algorithms can also be recognized as

irregular problems [48,66]. For example, radix sort is a non-comparison sorting algorithm

[24]. It reorders a sequence of elements based on the integer value of bit sets. Radix sort

examines the elements r bits at a time. It sorts the elements according to the ith least

significant block of r bits during iteration i. All elements are redistributed on the

multiprocessors based on their new positions in the global sequence in each loop. The data

redistribution is an irregular all-to-all communication. Since sorting algorithm only conducts

simple computation, the performance is mainly determined by the communication operations.

Efficient communication mechanism is also needed to speed up the sorting procedure.

The four irregularly structured applications discussed above have unique irregular

characteristics that demand specific techniques to achieve high-performance computing. The

characteristics are summarized as four attributes: computation complexity, communication

complexity, data dependency, and synchronization requirement in Table 1.1.

Applications Computation Complexity

Communication Complexity

Data Dependency

Synchronization Requirement

Key Techniques

N-body High High (all-to-all)

High Yes Distributed tree structure

Ray Tracing High Low None No Autonomous load scheduling

CG High High (point-to-point)

Medium Yes Two-layer communication

Radix Low High (all-to-all)

Low Yes Two-layer communication

Table 1.1 Characteristics of four irregularly structured problems

In Table 1.1, computation complexity is the computation workload of the application.

Communication complexity is the communication requirement of the application appeared on

distributed computing. Data dependency is the level of correlation of the data in computation.

Synchronization requirement indicates whether synchronization should be enforced on the

parallel computing procedure. No synchronization requirement implies a total asynchronous

computation procedure that can attain the highest parallelism. Otherwise, the synchronization

should be imposed to coordinate the parallel computations on multiple processes. The key

techniques in Table 1.1 are used to handle the irregularities of the individual irregularly

8

structured problems and achieve high-performance in distributed computing. These

techniques are implemented in the distributed object model MOIDE (see chapter 2). The

effects of the key techniques will be demonstrated by the sample applications in chapter 4, 5

and 6.

1.2 Distributed System and Distributed Object Computing

1.2.1 Distributed System

A distributed system is constructed with computer nodes linked across networks. The

computer nodes may be geographically distributed in wide area. Running on a distributed

system, an application should be decomposed into a group of concurrent computing tasks that

are dispatched to the distributed computer nodes meanwhile the tasks can still cooperate

during the computation. This is the basic model of distributed computing. Distributed

computing technologies are under rapidly development with the proliferation of smaller

computers, such as PCs and workstations, and the wide spread of networks. Networked

computers support large applications by distributed computing without high-end standalone

computers.

A distributed system can be a homogeneous system consisting of same type of machines.

More generally, it is a heterogeneous system composed of hybrid computers that have

different architecture and computing power. It may accommodate PCs, workstations and

multiprocessors in the same system. With fast advancement of high-speed networks and

powerful microcomputers and workstations, networked computers have been providing a

cost-effective environment to support high performance parallel and distributed computing.

The scope of the networked systems is expanding quickly. Nowadays the networked

computers are beyond the traditional LAN-linked systems. The local systems in different sites

can form a wide-area distributed system, from campus-wide to area or nation-wide system,

which provides strong computing powers. The innovative system architecture is called

Cluster of Cluster or Computational Grid [64,65].

A distributed system can be concurrently accessed by a lot of users at different sites for

different applications. Software infrastructure is required to integrate the distributed resources

and provide uniform interface for developing and running applications on the distributed

system. The infrastructure should present sufficient flexibility on heterogeneous platforms. It

9

should hide the architecture of the platforms and create a uniform environment for application

developers.

In response to the requirements for integrated computing infrastructure, object-oriented

methodology is recognized as the appropriate technique to construct distributed computing

infrastructure. The object-oriented techniques are flexible enough to create objects on the

varied platforms and organize the objects into a computing infrastructure for executing the

applications. The distributed object infrastructure is highly adaptive to heterogeneous

platforms.

1.2.2 Distributed Object Computing

Object-oriented technology is appropriate for the computation on distributed systems [9,

10]. An object is a software unit that encapsulates data and behavior. Object-oriented

computing can be considered as providing an interface that specifies the functions and

arguments of an object and encapsulates the details of the internal implementation. The

interface hides the hardware characteristics from the applications. The applications can be

developed in a uniform model regardless what platforms they will run on. The applications

based on object-oriented model can be properly mapped onto the platforms and high

performance can be achieved.

Distributed object computing is the integration of object-oriented computing and

networking. It provides high flexibility to the computations on distributed system. Objects are

created on distributed computer nodes when an application is submitted to run. The object on

one host can interact with other objects on remote hosts. An object can also be transferred to

another host for the sake of load balancing or fault tolerance. The object on one host can

create remote objects on other hosts. Distributed object computing also supports

multithreading methodology that implements light-weight computation on SMP nodes.

Remote method invocation is a more powerful communication mechanism in distributed

object system than ordinary message passing. By remote method invocation, an object can

transfer not only data but also control to remote objects. It has the feature of one-sided

communication in which the communication operation can be started on the side of sender or

receiver only. One-sided communication will contribute to the high asynchrony in distributed

object computing. The polymorphism of distributed object system guarantees the flexibility

of object-oriented computing. A distributed object system can expand by incorporating more

10

objects created on new hosts at runtime to raise its computing power. Summarily, a

distributed object system should possess the following capabilities:

(1) Runtime Host Selection

The first step to create a distributed object system is the selection of the available hosts in

a distributed system. The computers in a distributed system are simultaneously accessible by

many users and applications. A distributed system is also a loosely-coupled and non-

permanent system. Local systems or hosts can join or leave the distributed system. When

building up a distributed object system, it needs to select the appropriate hosts based on their

current states. Then objects are created on the selected hosts to perform distributed computing.

(2) Adaptive Task Mapping

In distributed computing, an application should be decomposed into a group of tasks and

allocated to the distributed objects. The tasks should be properly mapped onto the distributed

objects in accordance with the computation pattern, data locality, and the architecture of the

target hosts. The task mapping should exploit the computing power of the hosts and minimize

the data communication between the objects.

(3) Multithreading Computation

Multithreading computation is a light-weight method for parallel computing within an

object. An object residing on SMP node can spawn a group of threads inside to execute

parallel computation on the multiprocessors. The group of threads in object consume less

system resources than multiple objects. They can maintain higher data sharing and tighter

cooperation in parallel computing.

(4) Efficient Communication Mechanism

Distributed object computing is usually accompanied with high inter-object

communication. Efficient communication mechanism is demanded to support flexible and

fast communication. The communication mechanism should integrate the inter-object

communication methodologies based on the physical communication paths to achieve high

communication efficiency.

(5) Dynamic Object Creation and Load Scheduling

A distributed object system should be adaptive to the computing environment. It should

be able to dynamically create objects on new hosts to utilize the available computing

resources and improve the performance. It needs to support dynamic load scheduling to

balance the workload on the objects.

11

These capabilities are especially useful in solving irregularly structured problems. The

unstructured computation and communication patterns of these problems present strong

requests for adaptive task mapping and dynamic load scheduling to produce balanced

distribution of computation workload. Dynamic object creation is also helpful to develop

adaptive algorithms on distributed systems. The efficient communication mechanism can

mitigate the overwhelming overhead of unstructured communication. Multithreading

computation is advantageous to smooth the unstructured computation and increase the

performance on SMP nodes.

1.2.3 Object-Oriented Programming Language

A distributed system may contain different types of platforms. The object-oriented

computing on such a system should be executable on the heterogeneous platforms. Java is an

architecture-neutral object-oriented language that offers plenty of services for distributed

computing on heterogeneous platforms [11]. The services support multithreading and

distributed object computing as well as remote method invocation mechanism. It also enables

dynamic object creation and redistribution. Java provides a homogeneous, language centric

view over a heterogeneous environment. The platform-independent Java bytecodes is

executable on any hosts provided Java Virtual Machine is installed. Therefore Java classes

and objects are portable from one host to any other hosts without recompilation. RMI

(Remote Method Invocation) is a Java-based interface for distributed object-oriented

computing [12,77]. It provides the registry to register the object references and allows an

object running on one host to make remote method invocation to the object on another host.

Thus distributed objects can transfer data through both the arguments and the return value of

the method. The distributed object system implemented in Java can be flexibly built on any

platforms at runtime. In this thesis, Java and RMI are used to implement MOIDE model and

the irregularly structured applications.

1.3 Motivation

The motivation of my research comes out from the recognition of all the previously

discussed requirements. The goal of this research is to develop a generic computing model

that can support efficient and flexible computing on various distributed systems. With regard

to the main target applications, the model should provide the powerful support to solve

12

irregularly structured problems. Runtime support software is needed to integrate the

distributed and heterogeneous hosts and subsystems to provide a uniform computing

environment to the applications. The thesis presents a distributed object model MOIDE for

solving irregularly structured problems on distributed systems. MOIDE stands for

Multithreading Object-oriented Infrastructure on Distributed Environment. It integrates all

the capabilities discussed in 1.2.2 and sets up a flexible infrastructure for developing and

executing irregularly structured applications. The applications implemented in MOIDE model

can achieve high performance on heterogeneous systems.

1.4 Contributions

The research work in this thesis focuses on the development of MOIDE model and

related mechanisms to support the solutions of irregularly structured problems on distributed

systems. The model provides the foundation to develop the solutions. Proprietary techniques

are designed based on MOIDE model for specific irregularly structured applications. The

major contributions in the thesis are as follows:

A distributed object model MOIDE is developed to establish a flexible and

architecture-independent computing infrastructure on distributed systems. The model

integrates the object-oriented and multithreading methodologies to provide a unified

infrastructure for developing and executing varied applications, especially irregularly

structured problems, on heterogeneous systems. MOIDE model utilizes the

polymorphism and location-transparency properties of the object-oriented and

multithreading technologies. The properties facilitate the dynamic, adaptive creation

and reconfiguration of the distributed computing infrastructure on varied system

architecture and resources. Hierarchical collaborative system is proposed as the

fundamental software architecture of the model. The hierarchical collaborative

system has the two-level structure that is adaptive to the architecture of underlying

hosts and the pattern of irregular applications. It supports the dynamic system

creation and reconfiguration to cope with the uncertainty of the resource availability

and the irregular computation. The MOIDE-based computation can attain high

performance on the available resources.

A unified communication interface is constructed by integrating local shared-data

access and remote messaging to support architecture-transparent and efficient inter-

13

object communication. The integrated two-layer communication based on the object-

oriented technique provides a simple, flexible and extensible communication

mechanism for transmitting complex data structures and control information between

distributed objects. It can effectively improve the communication efficiency on HiCS

in solving irregularly structured problems. The MOIDE-based applications can be

developed in the architecture-independent mode by calling the unified

communication interface. The applications will be adaptively mapped onto the

underlying hosts at runtime by forming a hierarchical collaborative system and

creating the communication mechanism that match the underlying architecture.

Generic task allocation strategies are proposed to the task decomposition and

allocation in different irregularly structured problems. For example, the strategy of

initial task decomposition with runtime repartition can be applied to the applications

with high data-dependency. Dynamic task allocation can be used to the applications

with low data-dependency. The polymorphism and encapsulation of object-oriented

methodologies enable the adaptive task allocation that matches both the application

pattern and the system architecture.

A runtime support system called MOIDE-runtime is developed to implement the

distributed computing based on MOIDE model. It implements the functions and

mechanisms required in the MOIDE-based computing, including host selection,

hierarchical collaborative system creation and reconfiguration, unified

communication interface, object synchronization, and autonomous load scheduling.

Implemented in Java, MOIDE-runtime builds a platform-independent, extensible,

unified computing environment on wide-range distributed systems, e.g., cluster of

SMP nodes, cluster of single-processor PCs, and cluster of hybrid hosts.

A distributed tree structure is designed for the N-body problem. It is a distributed

variation of the Barnes-Hut tree. Partial subtree scheme is proposed as the

communication-efficient solution to the data sharing of the distributed tree structure.

It is different from the tree structures in other parallel N-body methods on shared-

memory or distributed-memory systems. The distributed tree structure is supported

by the object-oriented approach based on MOIDE model. The object-oriented

techniques facilitate the construction and transmission of the tree structures in

distributed environment.

14

Autonomous load scheduling is proposed as the dynamic task allocation approach for

high-asynchronous computation. The autonomous load scheduling method is

supported by the flexible, one-sided remote method invocation in object-based

communication. It can exploit the high parallelism in some irregular applications,

make full use of the computing power of the resources, and automatically achieve

dynamic load balancing under low overhead. The autonomous load scheduling is

used in the ray tracing method. One of the autonomous load scheduling scheme—

individual scheduling—is recognized the better scheme to the high-asynchronous ray

tracing procedure.

Grouped communication is adopted on the unified communication interface to reduce

the heavy and irregular communication cost in the irregularly structured application.

It is developed with the integration of object-oriented and multithreading

methodologies in MOIDE model. The grouped communication approach is used to

fulfill the large-volume all-to-all scattering operation in the radix sort. The grouped

scatter outperforms the similar operation in MPICH in the radix sort.

1.5 Thesis Organization

In the following text, chapter 2 addresses the distributed object model MOIDE with the

focus on the infrastructure of hierarchical collaborative system. Chapter 3 describes the

runtime support system MOIDE-runtime. Chapter 4 presents distributed N-body method

based on MOIDE model, with the emphasis on the distributed tree structure. Chapter 5

discusses the distributed ray tracing methods based on autonomous load scheduling. Chapter

6 presents the MOIDE-based CG and radix sort methods to illustrate the architecture-

independent feature of MOIDE-based computation and the efficiency of two-layer

communication mechanism. Chapter 7 covers the related work and the comparison with my

work. Chapter 8 concludes the thesis.

15

Chapter 2

MOIDE: A Distributed Object Model

2.1 Introduction

As discussed in 1.2.2, the object-oriented technology is suitable to implement the

computations on distributed systems. A distributed system can be composed of

geographically scattered hosts. The computing resources in the system are accessible to a

large number of users. Any application can be submitted to run on any of the hosts in the

system simultaneously. To attain fair utilization of the resources and effectively organize the

computations on them, software facility is demanded to support the various computing

requirements on distributed systems.

MOIDE (Multithreading Object-oriented Infrastructure on Distributed Environment) is a

distributed object model that implements high-performance computing, especially to support

irregularly structured problems, on distributed systems. It establishes a flexible infrastructure

that combines distributed object with multithreading methodologies to support parallel and

distributed computing on varied platforms in distributed system.

The basic components of MOIDE model are a group of objects dynamically created at

runtime on the hosts that are selected to run an application. The hosts may be scattered in

wide range of a distributed system. The objects on the hosts are organized into a cooperative

working system, called collaborative system, and execute the application together. The

collaborative system is a runtime infrastructure. It is built when an application is submitted to

run. The collaborative system provides the mechanisms that allow the distributed objects to

interact with one another, and it is responsible to coordinate the computing procedures on the

distributed objects.

The construction of a collaborative system is associated with the resources in the

underlying distributed system. It is always built on the most appropriate hosts, i.e., the hosts

16

with higher computing power and lower workload. For a distributed system containing SMP

nodes, multiple threads will be generated inside the object on SMP node. In this case, the

collaborative system will possess hierarchical structure. The objects, one per SMP node, form

the upper level of the collaborative system. The threads in the objects form the lower level of

the system. The communication takes place on the two levels in different mode. The

combination of distributed objects and threads gives the collaborative system high

adaptability to heterogeneous system architecture.

The distributed object-oriented and multithreading features of MOIDE model have the

following advantages.

(1) Architectural Transparency

MOIDE model erects a uniform computing infrastructure for developing and executing

applications. The architecture of the underlying distributed system is transparent to the

applications. An application does not need to learn on what hosts it runs. The application just

requests a certain number of processors. The processors may be supplied from uniprocessor

hosts, SMP nodes, or hybrid hosts. The application is developed on an identical infrastructure

no matter what hosts it will be executed on. The runtime support system will create

distributed objects and generate threads inside the objects when running the application

depending on the specific architecture of the underlying hosts.

(2) Combined Programming Methodology

MOIDE combines distributed object computing and multithreading methodologies. The

model is adaptive to heterogeneous architecture. A group of threads are generated within an

object to support parallel computing on SMP node. These threads can have tight cooperation

in the group. The threads can share computation workload and data in an object. The

combined programming methodology is more efficient and resource-saving than a complete

distributed object approach where multiple objects will be created on SMP nodes to perform

parallel computation on multiple processors. The threads can work in different modes

depending on the computation pattern of an application. An application is decomposed into

tasks. The threads in an object can work in cooperative mode to cooperatively process one

task. The threads can also work in independent mode in which each thread independently

processes its own task.

(3) Integrated Communication Mechanism

17

The MOIDE model supports two communication methods. The distributed objects

interact with each other by remote messaging–a data transmission approach between

distributed objects based on remote method invocation. Remote messaging is more powerful

than message passing. The threads inside an object can access shared data through local

memory. Shared-data access has low communication expense. MOIDE model integrates the

two communication ways and builds a two-layer communication mechanism. Meanwhile it

provides a unified communication interface on the mechanism to hide the two-layer

communication paths to the application. At the application level, the communication between

objects or the threads call the same interface regardless of the physical path. A thread can

communicate with a local thread in the same object or a remote thread in remote objects

through the same communication interface. The runtime support system will carry out the

communication through the proper communication path, either shared-data access or remote

messaging.

(4) High Asynchrony

MOIDE supports high asynchronous computation of the objects. The distributed objects

can execute the computing tasks asynchronously unless interaction is required between them.

It can also implement the asynchronous inter-object communication. Ordinary message-

passing communication is fulfilled by the cooperation between a pair of sender and receiver.

Communication operations should be explicitly performed on both sides. The sender issues a

send operation and the receiver issues a receive operation. MOIDE model allows one-sided

communication in which only the sender or receiver starts the communication. An object can

send data to another object by writing the data directly to a variable in the destination object

via remote method invocation. Similarly the receiver can fetch data from the source object by

directly reading the data value there. The communication can be conducted at any time

without the explicit participation of the other side. The one-sided communication contributes

to the high asynchrony of the computation on distributed objects.

This chapter describes the fundamental structure and functionality of MOIDE model. The

runtime support system that implements the computation in MOIDE model will be addressed

in chapter 3.

18

2.2 Basic Collaborative System

The kernel of MOIDE model is collaborative system. Collaborative system is a runtime

software infrastructure. Its fundamental components are a group of objects distributed on the

hosts. The collaborative system is formed when an application is submitted to run, and it is

terminated when the execution has finished. The collaborative system can also be

reconfigured during the computation to match the runtime requirement of the computation

and the states of the underlying hosts.

2.2.1 System Structure

A collaborative system consists of a group of distributed objects. Fig. 2.1 shows the basic

structure of a collaborative system built on P hosts. The object on host 0 is called compute

coordinator. It is the first object created in the system and it acts as the manager of the system.

Fig 2.1 The collaborative system built on P hosts

The compute coordinator is the initiator of the collaborative system. It starts to work on

the host where an application is submitted. It creates all remote objects on other hosts and

allocates the computing tasks to them. It also coordinates the computing procedures on all

objects and conducts the system-wide synchronization. Other objects accept and execute the

assigned computing tasks. Hence they are called compute engine. The collaborative system

has a registration mechanism that contains the references to the distributed objects in the

ComputeCoordinator

Host 0

ComputeCoordinator

Host 0

Host Selection Interaction Registration

ComputeEngine

ComputeEngine

ComputeEngine

Host 1 Host 2 Host 3

ComputeEngine

ComputeEngine

ComputeEngine


ComputeEngine

Host (P -1)

ComputeEngine

Host (P -1)

Reference

19

system. The distributed objects can locate each other by referring to the registration

mechanism. An object can find the reference of another object when it wants to communicate

with that object. The interaction mechanism provides the communication interface to the

distributed objects. It implements all inter-object communication. The host selection

mechanism is for the detection and selection of the hosts when a collaborative system is to be

created or reconfigured.

2.2.2 System Creation

A collaborative system is constructed under the cooperation of all objects. The compute

coordinator is the first object started on one host. Then it starts the remote objects on other

hosts. The creation of a collaborative system is accomplished in four steps.

(1) Compute coordinator start

Compute coordinator is started on the host where an application is submitted.

(2) Host selection

If the application requests more than one processors, the compute coordinator will search

for the available hosts in the underlying system to supply the required number of processors.

There may be a lot of available hosts. The compute coordinator follows the host selection

policy to choose the most appropriate hosts that can provide high performance. It examines

the computing power and current states of the hosts by referring to the information provided

from the host selection mechanism. The host selection policy can be expressed by the

following priority:

i

ii workload

eperformancpriority =

where priorityi is the priority of host i to be selected in the host selection, performancei is the

computing power of host i, and workloadi is the current workload on host i. Precedence is

given the hosts with higher prioprity.

(3) Compute engine creation

Compute coordinator starts the creation of the objects on the select hosts, one object per

host.

(4) Object registration

Each object registers itself to the registration mechanism. The registration mechanism

maintains the registration table to store the reference of every object. An object can get the

20

reference to other object by looking up the registration table as Fig 2.2 shows. Thus an object

can locate other objects and communicate with them.

As Fig 2.2 shows, the registration table is duplicated on each object. The figure only

displays the registration tables on the compute coordinator and the compute engine 1. The

table contains the items: the logical name of compute engine CE, its residing host name

HOST, and the reference to the compute engine REF. All of the registration tables constitute

the registration mechanism. When an object wants to make a communication to another

object, it looks up the table and gets the reference to that object. Then the object can perform

remote method invocation to the other object through the reference.

Fig 2.2 Registration tables

The collaborative system has been built up when all objects have registered themselves to

the registration mechanism. Then the compute coordinator and the compute engines will

execute the application together.

2.2.3 System Work

To execute the computation on collaborative system, the compute coordinator firstly

decomposes the application into computing tasks. The computing tasks will be allocated to all

compute engines. The compute coordinator itself also works as a compute engine. At the

Registration Table 1

ComputeEngine

ComputeEngine

ComputeEngine

Compute

ComputeCoordinator

host1 host2 host3

host0

ComputeEngine (ce2)

ComputeEngine (ce3)

ComputeEnginece(p-1)

ComputeCoordinator (ce0)

host (P-1)

ComputeEngine (ce1)

ce0ce1ce2ce3

ce(p-1)

host 0host 1host 2host 3

host(p-1)

CE HOST REFRegistration Table 0

null

ce0ce1ce2ce3

ce(p-1)

host 0host 1host 2host 3

host(p-1)

CE HOST REF

null

21

same time, it should perform the required coordination to the computing procedures on all

compute engines.

For a collaborative having m compute engines, let poweri be the computing power of

compute engine i. The computing power of a compute engine is determined by the computing

power of the underlying host. Assume the overall workload of an application is W, the task

allocated to compute engine i should have workload wi where

∑⋅=k

kii powerpowerWw

The compute coordinator starts the computing procedure on the compute engines by

assigning the tasks to them. The compute engines process the tasks asynchronously unless the

necessary data communication and synchronization are required. Compute coordinator is

responsible to make the global synchronization on all compute engines at the synchronization

point. When the computation has finished, the compute coordinator ceases all compute

engines. Thus the collaborative system is terminated.

The computing procedure on collaborative system is application-dependent. MOIDE is

an infrastructure to support the implementation of the applications with different computation

and communication requirements.

2.2.4 System Reconfiguration

The collaborative system is also flexible to conduct runtime system reconfiguration to

improve the system performance. The computing power of the collaborative system can be

enhanced by adding more compute engines on new hosts to it. The computing task on an

overloaded host can be moved to other available host.

MOIDE model has the flexibility to create new compute engine at any time on any host.

The distributed compute engines are linked together via the registration mechanism. The

registration table contains the references to all compute engines. The table can be updated by

inserting new references or removing old references. Therefore new compute engine can join

the collaborative system and old compute engine can be removed from the system.

2.2.4.1 System Expansion

22

Generally there are a lot of hosts available in a distributed system. A collaborative system

is established on the selected hosts. Moreover irregularly structured problems have non-

predetermined computation patterns. For instance, an application initially runs on P compute

engines. It may generate extremely high computation workload during the execution. As the

distributed system has extra hosts available, the collaborative system can select additional

hosts at runtime to share the computation workload. The collaborative system can be

expanded by incorporating new compute engines on new hosts. The new compute engines

work in the same way as the old compute engines in the collaborative system. This is

horizontal system expansion shown in Fig 2.3.

Fig 2.3 Horizontal system expansion

On the execution of an irregular application, the runtime computation workload may be

highly imbalanced among the compute engines. A heavily-loaded compute engine will

become the bottleneck. To alleviate the bottleneck, an extra compute engine can be attached

to the overloaded compute engine. The attached compute engine works under the overloaded

compute engine to share its workload. The attached compute engine is the assist-engine of the

overload one. It is only visible to its parent compute engine. This is vertical system expansion.

ComputeCoordinator

Host 0

ComputeCoordinator

Host 0


ComputeEngine

ComputeEngine

ComputeEngine


ComputeEngine

ComputeEngine

ComputeEngine


ComputeEngine

Host (P -1)

ComputeEngine

Host (P -1)

Reference

ComputeEngine

ComputeEngine

New Host

23

Fig 2.4 shows the vertical system expansion with an assist-engine attached to compute engine

2.

Fig 2.4 Vertical system expansion

2.2.4.2 Host Replacement The hosts in a distributed system are shared resources accessible by many users and

applications. Thus the states of the hosts, such as the workload, keep on changing from time

to time. Only idle or lightly-loaded hosts are suitable to accommodate a collaborative system.

At the system creation stage, the compute coordinator selects the lightly-loaded hosts.

However the workload on the hosts may uprise to a high level because of not only the

computation of this collaborative system but also other computing jobs from other users. The

overloaded hosts will slow down the collaborative computation. If there are idle or lightly-

loaded hosts available in the distributed system, they can be used to replace the overloaded

hosts. In the case of host replacement, new compute engine is created on the newly-selected

host. The new compute engine takes over the computing task from the compute engine on the

overloaded host. It replaces the role of the old compute engine in the collaborative system,

and the old compute engine will be ceased. The registration table is accordingly updated to

reflect the change of the compute engines. The reference to the new compute engine replaces

ComputeCoordinator

Host 0

ComputeCoordinator

Host 0


ComputeEngine

ComputeEngine

ComputeEngine

Host 1 Host 3

ComputeEngine

ComputeEngine

ComputeEngine


ComputeEngine

Host (P -1)

ComputeEngine

Host (P -1)

Reference

ComputeEngineAssistEngine

New Host

24

the entry of the old compute engine in the table. The logical structure of the collaborative

system remains unchanged after the replacement. The computing procedure continues on the

collaborative system as before. This is the host replacement in collaborative system. The

system size is the same after the replacement. Fig 2.5 shows an example of host replacement

in which the compute engine on Host 2 is replaced by the new compute engine on the New

Host 2.

Fig 2.5 Host replacement in a collaborative system

The system reconfiguration is managed by the compute coordinator (except the vertical

system expansion which is conducted by the parent compute engine only). The host selection

mechanism provides the real-time information of the available hosts. The compute

coordinator reads the state information and performs the host replacement operations when

necessary.

2.3 Hierarchical Collaborative System

The basic collaborative system is built with the objects of compute engine only on single-

processor hosts. All inter-object communication is accomplished via remote messaging. For

ComputeCoordinator

Host 0

ComputeCoordinator

Host 0


ComputeEngine

ComputeEngine

ComputeEngine

Host 1 Host 2 (replaced) Host 3

ComputeEngine

ComputeEngine

ComputeEngine

Host 1 Host 3

ComputeEngine

Host (P -1)

ComputeEngine

Host (P -1)

Reference

ComputeEngine

ComputeEngine

New Host 2

25

the collaborative system on heterogeneous system, multithreading methodology can be

incorporated into compute engines to suit the hierarchical architecture.

Nowadays symmetric multiprocessors (SMP) made from off-the-shelf microprocessors

are widely used as the cost-effective multiprocessor machines. Networked SMP machines

(cluster of SMP nodes) can provide high performance to large-scale computing. Cluster of

SMP nodes is considered as a low-end substitution of the high-end supercomputers [25]. A

distributed system may consist of both SMP and single-processor nodes. The SMP nodes may

contain different number of processors and varied computing power. Therefore a

heterogeneous system like this has hierarchical architecture. In such a system, there exist the

system-wide loosely-coupled nodes and the tightly-coupled processors inside SMP node. The

heterogeneous feature requires MOIDE model to be adaptive to the architecture of underlying

hosts. Therefore the basic collaborative system in 2.2 is expanded to the hierarchical

collaborative system (HiCS).

2.3.1 Heterogeneous System

Consider the situation of building a collaborative system on heterogeneous hosts. The

basic collaborative system structure is also applicable in this case. A P-processor SMP node

can be treated as P individual hosts and thus P compute engines will be created on it. All

compute engines on the same or different SMP nodes simply form a basic collaborative

system. This is certainly not a proper approach. It does not take the advantages of SMP node.

For example, the objects residing on an SMP node can communicate with each another via

shared-data access, a more efficient communication method than remote messaging. The

communication mechanism on heterogeneous system should integrate the swift shared-data

access and the widely-applicable message passing. Multithreading programming techniques

can be used on SMP node. A thread is a light-weight object that occupies less system

resource than an object. Multithreading is a cost-effective methodology to perform parallel

computation on SMP node.

Fig 2.6 shows the architecture of a heterogeneous system. It is a two-level hierarchical

structure. SMP node is composed of tightly-coupled multiprocessors connected by shared

memory modules. All nodes are linked together to be a loosely-coupled cluster across the

network. The relations between the objects residing on the nodes can also be treated at two

levels. The objects on same SMP node are tightly related to each other. These sibling objects

26

can make tight cooperation in the computation than the objects on different nodes. The

sibling objects can be created as multiple threads and take the advantages of multithreading

technique such as high data-sharing and low resource-consuming.

Fig 2.6 Cluster of SMPs

Though message passing is the ordinary communication method on distributed systems,

the threads on same SMP node can interact with each other by shared-data access because

they are instantiated from the same object and able to access the public data in the object.

Shared-data access is a quick communication way on SMP node. A two-layer communication

mechanism can be built on heterogeneous hosts, which integrates local shared-data access and

remote messaging on the two levels. With the two-layer communication mechanism, the

performance of the communication-intensive applications can be improved.

To realize the adaptability of MOIDE model, the basic collaborative system in 2.2 is

modified to be a hierarchical structure that incorporates distributed object and multithreading

methodologies. This is the hierarchical collaborative system structure.

2.3.2 Hierarchical Collaborative System

Hierarchical collaborative system (HiCS) is an infrastructure expanded from the basic

collaborative system. In HiCS, the compute engine on an SMP node spawns a group of

threads inside. The number of the threads is equal to the number of processors on the SMP

node. The threads will run on the multiprocessors in parallel. Multithreading is the

supplement to the distributed object model to allow the collaborative system adaptive to the

hierarchical architecture of hybrid hosts. The computation is more efficient when performed

P0

P1

Pi0

SMP

M

NI

M

P0

P1

Pi1

SMP

M

NI

M

Network

P0

P1

Pm

SMP

MM

NI

MM

P0

P1

Pn

M

NI

M

P

MM

NINI

PC

P

MM

NINI

PC

27

by the threads on SMP node. Two-layer communication mechanism implements the efficient

communication on HiCS.

Fig 2.7 shows the structure of hierarchical collaborative system. The compute engine on

SMP node can generate multiple threads inside. For example, assume Host 1 in Fig 2.7 is an

SMP with k processors, the compute engine on it spawns k threads as shown in the attached

box. The creation of hierarchical collaborative system includes the following operations.

Fig 2.7 Hierarchical collaborative system built on heterogeneous hosts

(1) Host selection and compute engine creation

This is similar to the creation of basic collaborative system described in 2.2. At first the

compute coordinator is started on the node where an application is initiated to run. The

compute coordinator selects the hosts to supply the processors required by the application.

The host selection policy on heterogeneous system should include the index about the size of

SMP node. It gives higher priority to the SMP node with more processors.

i

iii workload

eperformancPpriority ⋅=

where priorityi is the priority of host i to be selected, Pi is the number of processors on host i

and performancei is the computing power of each processor in host i, worloadi is the

ComputeCoordinator

Host 0

ComputeCoordinator

Host 0


ComputeEngine

ComputeEngine

ComputeEngine


ComputeEngine

ComputeEngine

ComputeEngine


ComputeEngine

Host (P -1)

ComputeEngine

Host (P -1)

Reference

LL

Thread1 Thread2 Threadk

Main thread

Shared Data

Compute

LL

Thread1 Thread2 Thread

Main thread

Shared Data

Engine

SMP

28

computation workload on host i. Here the number of processors in a host is one of the indices

in the host selection. An SMP node with more processors has higher priority to be selected.

The compute coordinator starts the compute engines on the selected hosts, one compute

engine per host, as compute engines on Host 1 to Host (P-1) shown in Fig 2.7, regardless how

many processors it has. The compute engines register themselves to the registration

mechanism. Each compute engine has a registration table that records the references to

remote compute engines.

(2) Thread generation

Each compute engine on SMP node generates a group of threads inside. The threads will

run on local multiprocessors.

The hierarchical collaborative system is a two-level infrastructure. All compute engines

form the upper level which is directly managed by the compute coordinator. The lower level

contains the threads in each compute engine. The main thread is the original thread in each

compute engine. All other threads are instantiated from it. The group of threads can jointly

process a computing task or independently process own computing task. The main thread acts

as the local coordinator. It makes the necessary synchronization on the group of the threads.

In HiCS, the threads in same compute engine can access shared data. Inter-thread

communication can be accomplished by data access via shared memory. But the distributed

compute engines are shared-nothing objects. They communicate with one another by remote

messaging.

The hierarchical collaborative system is an adaptive infrastructure on heterogeneous

system. Owing to the conditional creation of the threads in compute engine, HiCS sets up a

uniform infrastructure for developing and running applications on varied architecture.

Multithreading and single-threading compute engines can exist in a HiCS at the same time.

For instance, an application requests to run on P processors. The processors may be supplied

by different hosts each time to run the application. The application may run on a group of

SMP nodes with varied size. It is also possible to run on hybrid hosts of SMP nodes and

single-processor hosts. The underlying hosts are transparent to the application developed

upon HiCS infrastructure. A HiCS will be created at runtime with a structure that matches the

architecture of the hosts to achieve the best performance in executing the application.

The group of threads generated in the compute coordinator or compute engine can be

organized to work in two modes in the applications according to the computation patterns:

29

Cooperative mode: the main thread acts as the local coordinator inside a compute

engine, which accepts the computing task from the compute coordinator. The group of

threads shares the computing task, i.e., each thread executes a part of the computing

task. Thus a thread can be called a sub-engine in the compute engine. The main thread

coordinates the computing procedure on the group of threads. The group of threads

only conduct local communication between each other via shared data access. The

main thread is responsible for the communication to other compute engines.

Independent mode: each thread works as an independent compute engine. Each thread

processes a computing task independently. Even so, the threads in same compute

engine can access shared data as well. Moreover any thread is allowed to communicate

with the threads in other compute engine through remote messaging. There is no local

coordinator against the main thread in cooperative mode. Each thread performs

individual computing task. Thus each thread can be called a pseudo-engine.

The work mode of threads is determined by the computation pattern of an application.

The algorithm of an application can be designed based on one of the modes. The work mode

can also be switched from one to another in an application, depending on the computational

requirement in different phases.

2.3.3 Task Allocation

After a HiCS has been built, the compute coordinator should allocate the computing tasks

to all compute engines. Due to the highly diverse characteristics of irregularly structured

problems, different task allocation strategies should be used for varied applications. Good

task decomposition cannot be discovered by inspection before execution because of the

irregularity in computation and communication [73]. The following strategies can be used as

the general approaches for the task allocation in different irregularly structured problems.

(1) Initial task decomposition with runtime repartition

The workload of irregular computation is not pre-determined and it evolves in the

execution. It is not possible to evenly decompose the computation a priori. The task

allocation can adopt the strategy of initial task decomposition and runtime repartition. An

application can be initially divided into tasks based on the estimation of the workload. If the

workload is found unbalanced on the processes during the execution, runtime task repartition

30

should be made to re-decompose the tasks based on the real workload. This strategy is

suitable to the applications with high data-dependency.

(2) Dynamic task allocation

To the applications with light data-dependency, it is not required to allocate all tasks to

the processes before execution. The tasks can be progressively allocated to the processes one

at a time in accordance with the computing procedure on each process. By the dynamic task

allocation, the workload can be automatically balanced on the multiple processes without

specific load balancing operation.

(3) Balance in both computation and communication

Generally load balancing refers to the balance of computation workload. However the

load balancing in irregularly structured problems should pay special attention to the

communication workload, because the irregular communication patterns usually incur high

and diverse communication overhead that severely restrict the performance. The task

allocation strategy for the communication-intensive applications must cover the factor of

communication. The task decomposition should keep data locality in the tasks and generate a

data distribution among the tasks that can reduce the inter-process communication. The task

allocation should also alleviate the diverse communication requirements on multiple

processes so that the communication can be balanced among the processes and alleviate the

communication bottleneck.

The computation and communication patterns are extremely complicated in the

irregularly structured problems. Task allocation strategies should be closely related to the

applications. The task allocation strategies above are the generic approaches for the task

allocation. Specific task allocation approaches should be derived from these strategies for

different applications.

2.3.4 Unified Communication Interface

The communication in hierarchical collaborative system takes place on two levels. The

threads inside a compute engine can share public data and variables. The communication

between the threads can be realized through shared memory. This is the efficient way for

local data communication. The communication between compute engines is delivered by

remote messaging across the network, which requires high communication latency. Fig 2.8

shows the structure of the two-layer communication mechanism.

31

The communications between the tasks are delivered either by local shared memory

access or by message passing through the network, depending on the locations of the

communication partners. A unified communication interface is provided to all tasks at the

application level. The tasks call the same interface for communication no matter where the

destination is. The interface will implicitly decide the proper communication paths.

Fig 2.8 Two-layer communication mechanism

The two-layer communication is appropriate to the hierarchical collaborative system built

on heterogeneous hosts. It helps to reduce the heavy and unpredictable communication

overhead in irregularly structured problems. With the flexible computing mode (cooperative

or individual mode), the tasks (threads) can be flexibly organized into different mode to

exploit the efficiency of the two-layer communication. In chapter 6, the CG method will

demonstrate the transparent mapping of the communication paths. The radix sort will

illustrate the flexible use of the computing mode to implement efficient global

communication.

The pseudo-engines have two communication paths. The pseudo-engines on the same

SMP node can take the way of share-data access but the pseudo-engines residing on different

nodes should communicate via remote messaging. As the architecture-transparent feature of

MOIDE model, the locations of the pseudo-engines are transparent to the applications. At

application level, any communication between the pseudo-engines calls the unified

communication interface. The MOIDE runtime support system (see chapter 3) will decide the

exact communication path according to the locations of the pseudo-engines.

Task

Uniform Interface

Memory

Task

Task

Task

buffer

Task

Uniform Interface

Memory

Task

Task

Task

buffer

..................

Network

Host Host

32

2.4 Implementation

MOIDE model is implemented in Java and RMI (Remote Method Invocation) interface

on distributed systems. The motivation to use Java and RMI has been discussed in 1.2.3.

Java’s object-oriented, multithreading and platform-independent features are suitable to

implement the hierarchical collaborative system infrastructure on heterogeneous systems. The

remote method invocation mechanism of RMI facilitates the implementation of the flexible

interaction and communication between distributed objects. Here gives a brief introduction

on the implementation. The details will be covered in chapter 3.

(1) Compute Coordinator and Compute Engine

Computer coordinator and compute engines are the major components in hierarchical

collaborative system. Two classes of compute coordinator and compute engine are defined as

the kernel of the implementation. Other components are built around them.

(2) Object Registration and Interaction Interface

The registration mechanism of collaborative system is implemented on RMI registry—

rmiregistry. The registry runs on each selected host and provides naming service to the

distributed objects. A compute engine registers itself to the registry when created. The

compute coordinator assigns computing tasks as well as other arguments to remote compute

engines and triggers the computation on them through remote method invocation. The

compute coordinator generates a name list of the compute engines and broadcasts it to the

remote compute engines. Each compute engine can independently read the reference to other

remote compute engine by looking up the registry on remote host once. The references are

stored in the registration table on each compute engine for later use as Fig 2.2 shows.

The interaction and communication between compute coordinator and compute engines

goes through remote method invocation via RMI interface. The class of the interface is

defined. The implementation of the interface is defined in the class of compute engine.

(3) Multithreading

A compute engine will instantiate a group of threads on SMP node before it starts to run

the computing task. Recent JDK versions, e.g. IBM JDK 1.1.8, Blackdown JDK 1.2 and up,

support the kernel-based threads − native threads. JVM can schedule native threads to the

multiprocessors in SMP node.

(4) Two-layer Communication

33

The data sharing among a group of threads can be accomplished by accessing public data

objects. In cooperative mode, the threads in a compute engine may work on the same data set

and share the computing task. Compute engines transmit data to each other by remote

messaging, i.e., passing data through remote method invocation. RMI supports object

serialization that enables direct transfer of complex data objects back and forth between the

compute engines. Application calls the unified communication interface to realize the two-

layer communication. The uniform interface provides the communication methods to the

applications and seamlessly integrates the shared-data access and remote messaging

communication on the two levels. The unified communication interface is extremely useful

for the computation in independent mode, where each thread works independently as a

compute engine. The location of each thread is registered at the creation stage of HiCS. When

a thread wants to communicate to another thread, it just specifies the ID of the target when

calling a communication method. The two-layer communication mechanism will choose the

proper path and deliver the data depending on the locations of the two threads.

2.5 Summary

This chapter describes the distributed object model MOIDE that supports flexible and

efficient computing on distributed system. MOIDE model presents a computing infrastructure

that is adaptive to heterogeneous system architecture. It creates a collaborative system based

on the available hosts to execute an application. The collaborative system can be reconfigured

to adapt the change of the states in the underlying hosts. With the combination of distributed

object and multithreading methodologies, the hierarchical collaborative system can realize

high-performance computing on heterogeneous system. The compute engines in a

hierarchical collaborative system can work in two modes in accordance with the computation

patterns of the applications. The two-layer communication mechanism is built on hierarchical

collaborative system to support the efficient communication.

MOIDE model is implemented by the runtime support system MOIDE-runtime. Chapter

3 will give the implementation details of the runtime support system. MOIDE model is

suitable for various applications, especially for solving irregularly structured problems on

distributed systems. It is used to implement four irregularly structured applications: N-body

problem, ray tracing, CG, and radix sort. The utilization and advantages of MOIDE model

will be demonstrated by the four applications in chapter 4, 5 and 6.

34

Chapter 3

Runtime Support System

A runtime support system is developed to support the distributed computing on MOIDE

model. It provides the fundamental classes and APIs for implementing MOIDE-based

computation. The runtime support system is programmed in Java and RMI.

3.1 Overview

The runtime support system provides the components and functions for MOIDE model.

Table 3.1 lists the major classes and methods implemented in the runtime support system.

The runtime support system is called MOIDE-runtime in the following text for abbreviation.

Class Method Description StartEngine build a collaborative system

getHost select hosts createEngine create compute engines on the hosts invokeEngine assign tasks to remote compute engines

Codr compute coordinator main create collaborative system run run application

Engine object interface of compute engine EngineImpl implementation of compute engine

run start compute engine ceaseEngine terminate compute engine

ExpandEngine expand collaborative system addEngine add new compute engine

RecfgEngine reconfigure collaborative system checkEngine check the states of the hosts replaceEngine replace a compute engine with a new one

CommLib unified communication interface exchDouble exchange a double exchDoubleArray exchange an array of double exchIntArray exchange an integer array allReduce global reduction scan global scan

35

Util miscellaneous utilities barrier synchronize a group of threads remoteBarrier synchronize compute engines getTask get a task from global task pool getSubtask get a subtask from local subtask queue

Table 3.1 Major classes and methods in MOIDE runtime support system

Compute engine is specified as the interface Engine and the implementation of the

interface EngineImpl. The class of compute coordinator Codr is an application-dependent

class that calls the main method of the application.

Fig 3.1 Organization of MOIDE runtime support system

Fig 3.1 depicts the organization of MOIDE-runtime. The runtime supports system

implements the creation and reconfiguration of hierarchical collaborative system. It defines

class StartEngine for system creation, class ExpandEngine for system expansion, and

class RecfgEngine for host replacement. These three classes call the same methods to

search the available hosts in a distributed system and detect their states. The state detection is

implemented with the aid of ClusterProbe [13], a Java-based tool for reporting the states of

clustered computers. ClusterProbe is developed by our research group. It runs on a server and

monitors the hosts in a distributed system. It periodically reports the states and other

information of the hosts, including processor type, performance, memory size, workload, the

number of processors in each host, and etc. The host selection is based on the state

information of the hosts.

The runtime support system provides the primitives for dynamic load scheduling and

object synchronization. It also defines a unified communication interface for the two-layer

communication.

Unified Communication InterfaceCommLib

Two-layer Communication Mechanism

System CreationStartEngine

Synchronizationbarrier(), remoteBarrier()

Autonomous Load SchedulinggetTash(), getSubtask()

System ReconfigurationExpandEngine, RecfgEngine

ComputeCoordinator

Codr

Compute EngineInterfaceEngine

Compute EngineImplementationEngineImpl

36

3.2 Principal Objects

The principal objects in collaborative system are compute coordinator and compute

engines. They are defined as two classes. All other classes are built around them to complete

the system functions.

The compute coordinator is the first object started on the host where an application

begins to run. The coordinator is responsible for host selection and remote compute engine

creation. It also coordinates the whole computing procedure on collaborative system. The

compute engine is the object created on remote host. It accepts and processes computing tasks.

After establishing the collaborative system, the compute coordinator also works as a compute

engine to execute computing task.

Fig 3.2 Class description and relation of compute coordinator and compute engine

The interactions between the objects (compute coordinator and compute engines) are

implemented based on RMI interface. As the convention of RMI-based distributed object

scheme, an interface and the implementation of the interface should be specified for remote

objects. Therefore the compute engine is defined as an interface Engine and its

implementation EngineImpl. The interface contains the declaration of all methods for

remote method invocation. The implementation of the interface specifies the details of the

compute engine class. It includes the bodies of the methods declared in the interface.

class Codr

Codr codr; Appl appl;

main { StartEngine.run(); codr = new Appl(); codr.run(); ceaseEngine(); }

run { appl = new Appl(); appl.run() }

class EngineImpl

codr = new Codr(); codr.run()

Codr codr;

class EngineImpl


Codr codr;

class EngineImpl


Codr codr;

class EngineImpl


Codr codr;

37

From the description above, we can outline the structures of the compute coordinator and

compute engine classes as well as their relationship. As Fig 3.2 shows, the class of compute

coordinator Codr is a comprehensive class. It encapsulates the system creation and

coordination work. EngineImpl is the interface implementation of the compute engine

class. It includes the methods for remote object interaction such as remote messaging and

global synchronization (see 3.6.2). The construction method of compute engine is invoked by

the compute coordinator through remote method invocation. When a compute engine is

created on an SMP node, it will spawn a group of threads inside with regard to the number of

local processors. Appl is the class of the application program to be run. Both compute

coordinator and compute engine instantiate the application class and run its main method as

appl.run().

3.3 System Creation

The compute coordinator is the initiator of a collaborative system. At first it runs class

StartEngine to create the collaborative system. The creation of compute engine on other

host is trigged by the remote method invocation from the compute coordinator.

3.3.1 class StartEngine

The class StartEngine aims to build a collaborative system based on the required

number of processors and the available system resources. It includes the methods to select the

hosts, create and invoke compute engines on the hosts, and start the computation on the

system. The methods complete the three stages in system creation: host selection method

getHosts (), compute engine creation method createEngine(), and remote compute

engine invocation method invokeEngine(). These methods are called in the run()

method of StartEngine as Fig 3.3 shows which is executed by the compute coordinator.

public void run () {

String [] hostList = (String []) getHosts();

Engine [] engine = (Engine []) createEngine(hostList);

invokeEngine(engine, task);

}

Fig 3.3 run()method of StartEngine class

38

(1) getHost()

Figure 3.4 shows the getHosts() method for host selection. This is the first step in

system creation.

public String [] getHosts () {

String rn = "CLUSTERPROBE.resources.Cluster_Workstation_CPUInfo";

String server; /* the server running ClusterProbe */

int port = 7001;

Vector hostList = (new GetHostStatus()).getHostStatus(rn, server, port);

rn = "CLUSTERPROBE.resources.Cluster_Workstation_StatusTable";

Vector hostStatus = (new GetHostStatus()).getHostStatus(rn, server,

port);

hosts = selectHosts(hostList, hostStatus);

return hosts;

}

public String [] selectHosts () {

/* select hosts according to pre-defined criteria */

…………

}

Fig 3.4 getHosts()method

The method contacts ClusterProbe and retrieves the host information from it through the

GetHostStatus() interface. The host information includes all hosts available in the

distributed system, their performance and current states, their internal number of processors

and etc. By specifying the parameter rn of the method getHostStatus(), ClusterProbe

can supply various information that reflects different aspects of the distributed system. In Fig

3.4, the first string “CLUSTERPROBE.resources.Cluster_Workstation_CPUInfo”

indicates to get a full list of CPU information, including the number, type and speed of the

CPU(s) in each host. The second string “CLUSTERPROBE.resources.

Cluster_Workstation_StatusTable” means to get the state table of the hosts. The

status table contains machine states and current workload.

Method selectHosts() implements the host selection policy to select proper hosts

among all available hosts. The host on which the compute coordinator is residing is the first

host to be selected.

39

(2) createEngine()

Having chosen the hosts, the compute coordinator starts the creation of the compute

engines by Runtime.getRuntime().exec(command) (see Fig 3.5) where command

is a RPC (Remote Procedure Call) to remote hosts. The command runs a script

create_engine on the remote host to create compute engine there. The

create_engine script runs rmiregistry and starts the compute engine. The compute

engine registers itself to the registration mechanism. Then the compute coordinator needs to

get the references to remote compute engines for future interaction. This is accomplished by

running the RMI method Naming.lookup() which locates the remote objects and gets

the references to them. The returned references are stored in the registration table engine[].

Thereafter the compute coordinator can contact the compute engines through the references

in engine[].

public Engine [] createEngine(String [] hosts) {

String command = “/usr/bin/rsh hosts[i] create_engine”;

for host[i] in hosts[] {

Runtime.getRuntime().exec(command);

engine[i] = (Engine) Naming.lookup(host[i]);

}

return engine;

}

Fig 3.5 createEngine()method

(3) invokeEngine()

When the compute coordinator has got the references to remote compute engines, it

immediately assigns computing task to them by engine[i].invoke() in the method

invokeEngine() (see Fig 3.6). The operation of engine[i].invoke(task[i]) is

calling the remote method invoke() on the compute engine i where engine[i] is the

reference to it. The task allocated to the compute engine is passed through the argument

task[i]. In addition to the computing task assigned to that compute engine, task[i] also

contains the information about the construction of the collaborative system such as the name

list of all hosts and the compute engines on them. Having received the list, a compute engine

40

can get the references to all remote compute engines by calling Naming.lookup()method

and create its own registration table. The creation of the collaborative system has finished

when all compute engines have obtained the references to other engines.

As indicated above, remote method invocation is rather than a simple message passing

operation. It can also pass the control to remote object. The compute coordinator calls the

method invoke()on remote compute engine to activate its work. The invoke() method

not only transmits the compute task and related data to a compute engine but also starts the

execution of the computing task on the compute engine.

public void invokeEngine (Engine[] engine, Message[] task) {

for engine[i] in engine[]

engine[i].invoke(task[i]);

}

Fig 3.6 invokeEngine()method

After the execution of invokeEngine(), all compute engines have got their

computing tasks and begin to execute the tasks asynchronously.

3.3.2 Initialization of Compute Engine

As introduced in 3.2, the class of compute engine is defined by the interface Engine and

the implementation of the interface EngineImpl.

(1) Starting Compute Engine

The RPC Runtime.getRuntime().exec(command) in Fig 3.5 called by the

compute coordinator starts the creation of compute engine on remote host. The creation

includes two operations:

1. run rmiregistry to start RMI registry on the host;

2. start EngineImpl.

The object of EngineImpl registers itself to the registry by name (e.g., Engine0), as

below:

EngineImpl computeEngine = new EngineImpl();

Naming.rebine(“Engine0”, computeEngine);

After the registration, the creation of compute engine has been completed.

(2) Multithreading

41

On an SMP node, a group of threads will be generated in the compute engine. The code

in Fig 3.7 generates threads on host k depending on the number of its processors in use. The

number of processors on the SMP node has been obtained in StartEngine().

for ( i = 1; i < number_of_processors[k]; i ++ ) {

Appl subengine[i] = new Appl();

subengine[i].start();

}

Appl subengine[0] = new Appl();

subengine[0].run();

Fig 3.7 Generate threads in a compute engine

Appl is the main class of the application to be run on the collaborative system. The

compute engine creates multiple threads by instantiating the class Appl. The start()

method causes the associated thread subengine[i] to run the code of Appl. The main

thread subengine[0] also instantiates Appl and runs the application code by calling

run() method.

MOIDE-runtime is implemented on Blackdown JDK 1.2.2 which supports native threads.

Native threads are kernel-based threads that can be automatically scheduled by OS to run on

multiprocessors.

3.4 System Reconfiguration

MOIDE-runtime supports two types of system reconfiguration. One is system expansion

that adds extra compute engines to the collaborative system to enhance the computing power.

The other is host replacement that replaces the hosts with new hosts to improve the system

performance. The class ExpandEngine is specified for system expansion and the class

RecfgEngine for host replacement.

3.4.1 class ExpandEngine

The system expansion can be made in two directions: horizontal or vertical (see 2.2.4.1).

Fig 3.8 shows the run() method of ExpandEngine class. To make system expansion, the

compute coordinator or a compute engine calls the getHost() method in class

42

StartEngine (see 3.3.1) to search for the available hosts. Then the system expansion is

conducted by the method addEngine() in Fig 3.9.

public void run () {

String [] newHost = (String []) StartEngine.getHosts();

if (horizontal expansion)

flag = horizontal;

else /* vertical expansion */

flag = vertical;

Engine [] engine = (Engine []) addEngine(newHost, engine, flag);

}

Fig 3.8 run() method in ExpandEngine class

In the arguments to addEngine() method (Fig 3.9), newHost is the list of the

available hosts used for system expansion. oldEngine is the current registration table of

compute engines. flag indicates the horizontal or vertical expansion.

public void addEngine(String[] newHost, Engine[] oldEngine, boolean flag)

{

Engine [] newEngine = (Engine []) createEngine(newHost);

Engine [] engine = adjustEngine(engine, newEngine, flag);

invokeEngine(newEngine, task);

if (flag == horizontal)

broadcast engine[] to all compute engines;

return engine;

}

Fig 3.9 addEngine() method in ExpandEngine class

The method createEngine() (see Fig 3.5 in 3.3.1) is called in addEngine() to

create compute engine on each newly-selected host. The adjustEngine()methods adds

the references of the new compute engines to the registration table. The registration table

needs to be updated only in horizontal system expansion. In vertical system expansion, the

new engine is totally under the control of its parent engine. It works as the assist-engine of

the parent compute engine to share its workload. Other compute engines as well as the

compute coordinator do not recognize the existence of the assist-engine. In the system-wide

view, the vertical expansion resembles to add more processors to the host where the parent

43

engine is residing. There is logically not any new compute engine added into the HiCS.

Hence no change should be made to the registration table in vertical expansion.

3.4.2 class RecfgEngine

The host replacement is specified in the class RecfgEngine() which is executed by

the compute coordinator. The major methods in RecfgEngine()are checkHost() and

replaceEngine().

The method checkEngine() in Fig 3.10 calls the interface GetHostStatus()to

get the current states of the hosts from ClusterProbe. Then it decides the necessary host

replacement based on the states. If any host is overloaded, the method findHost() is

called to find a substitute host. The policy for the selection of substitute host is similar to that

for the host selection in the class StartEngine(). If no appropriate substitute host can be

found, the host replacement will not continue.

public void checkEngine(){

String rn = "CLUSTERPROBE.resources.Cluster_Workstation_CPUInfo";

String server; /* the server running ClusterProbe */

int port = 7001;

Vector hostList = (new GetHostStatus()).getHostStatus(rn, server, port);

rn="CLUSTERPROBE.resources.Cluster_Workstation_StatusTable";

Vector hostStatus = (new GetHostStatus()).getHostStatus(rn, server, port);

for engine[i] in engine[]

if ( hostStatus[engine[i]]= “Overload” )

String new_host[i] = findHost(hostList, hostStatus);

}

Fig 3.10 checkEngine() method in RecfgEngine class

After the host replacement has been decided in checkEngine(), the method

replaceEngine() in Fig 3.11 is called to create compute engine on each new host and

transfer the computing task from the overloaded host to the compute engine on new host. The

compute coordinator starts the creation of new compute engines on substitute hosts by calling

the method createEngine() in Fig 3.6. Then the compute coordinator invokes each new

compute engine by calling the remote method replace() on that engine. The function of

replace()is similar to invoke() in Fig 3.6. However, the new compute engine obtains

44

the computing task from the compute engine on the replaced host rather than assigning by the

compute coordinator. Then the compute engines on the replaced hosts cease to work. Their

references in the registration table are replaced by the references to the new compute engines.

Thus the collaborative system has been reconfigured with the new compute engines and all

remaining engines. The computation goes on the reconfigured system.

public void replaceEngine() {

Engine [] new_engine = (Engine []) createEngine(replace_host);

for new_engine[j] in new_engine[] {

new_engine[j].replace(host[j]);

engine [j] = new_engine[j];

}

}

Fig 3.11 replaceEngine() method in RecfgEngine class

3.5 Unified Communication Interface

As discussed in 2.3.3, the threads in hierarchical collaborative system can work in

independent mode. Each thread called pseudo-engine works as a compute engine. All pseudo-

engines have equal positions in the hierarchical collaborative system no mater whether they

are in same or different compute engines. A unified communication interface is needed for

the communication between the pseudo-engines. The threads can communicate with each

other through the uniform interface at application level. The two-layer communication will

transparently complete the communication via shared-data access or remote messaging.

The runtime support system provides a unified communication interface. The following

exchDoubleArray()method is one of the methods provided in the interface. It specifies a

send-receive communication of double arrays between a pair of pseudo-engines:

static void exchDoubleArray ( int src, int dest,

double[] srcData, double[] destData,

int srcStart, int destStart,

int sendLength, int recvLength,

int status )

int src ID of the pseudo-engine sending data;

int dest ID of the pseudo-engine receiving data;

double[] srcData buffer for sending data;

45

double[] destData buffer for receiving data;

int srcStart offset in the buffered sending data;

int destStart offset in the buffered receiving data;

int sendLength length of sending data;

int recvLength length of receiving data;

int status flag to identify the communication operation.

As exchDoubleArray() shows, only the IDs of the sender and receiver are required

when calling the method. The IDs are assigned to the pseudo-engines in StartEngine.

There is no indication about the locations of the sender and receiver in the method. The

runtime support system will decide the real communication path depending on their locations

found in the pseudo-engine location table (see Fig 3.13). If two pseudo-engines exist in same

compute engine, the send-receive operation will be finished by direct data exchange between

the send and receive buffers in local compute engine. Otherwise, the sending data will be

transmitted to the receive buffer on the receiving pseudo-engine through remote method

invocation.

Fig 3.12 A hierarchical collaborative system with 12 pseudo-engines

Shared Data

SMP0SMP0

Compute Coordinator Pseudo-engines(multithreads)

Shared Data

Compute Engine 1

Shared Data

Compute Engine 2

Shared Data

Compute Engine 2

SMP1 SMP2 SMP3

0 1 2 3

4 5 6 7 8 9 10 11

Host Selection Interaction RegistrationReference

46

Pseudo-engine Compute Engine Thread 0 0 0 1 0 1 2 0 2 3 0 3 4 1 0 5 1 1 6 1 2 7 1 3 8 2 0 9 2 1

10 3 0 11 3 1

Fig 3.13 The location table of pseudo-engines

The pseudo-engine location table is created in StartEngine (see 3.3.1). While

creating the compute engines, the compute coordinator records the locations of the pseudo-

engines in the table. The location includes the compute engine ID in which a pseudo-engine

exists and the thread ID of the pseudo-engine. Fig 3.12 shows an example of twelve pseudo-

engines on four SMP nodes. The pseudo-engines exist in four compute engines (including

compute coordinator). The location table is shown in Fig 3.13. For example, pseudo-engine 0

is the thread 0 in compute engine 0 (i.e., the compute coordinator) and pseudo-engine 11 is

the thread 1 in compute engine 3.

The location table will be broadcasted to all compute engines in invokeEngine()

(see 3.3.1). Examine two communication cases in Fig 3.12.

(1) Pseudo-engine 4 and 6 are both in compute engine 1 according to the location table.

The data communication between them will be finished directly by the shared-data

exchange through their own buffers.

(2) Pseudo-engine 2 and 8 respectively reside in compute engine 0 and 2. They should call

remote messaging for communication.

3.6 Synchronization

The compute engines in collaborative system asynchronously execute the computing

tasks. In hierarchical collaborative system, there are also multiple threads running in parallel

inside each compute engine. Synchronization is needed among the group of threads and

among the compute engines to coordinate their computing progress. Fig 3.14 shows the

synchronization on two levels. The local synchronization is the synchronization among a

47

group of threads in the same compute engine. The global synchronization coordinates all

compute engines in the system. MOIDE-runtime implements two synchronization methods.

Fig 3.14 Execution flow on multiple threads with local and global synchronization

3.6.1 barrier()for Local Synchronization

A group of threads in cooperative mode share a computing task. The threads need to

synchronize their computing procedures at some point. For example, a compute engine may

exchange data with other compute engines. In cooperative mode, the data is the product of the

collaborative computation on the group of threads and the main thread is responsible for the

remote communication. The communication can be performed only after the group of threads

in a compute engine have generated the result data. Here local synchronization should be

imposed on the threads. The main thread is able to start the data communication only after the

group of threads have finished the preceding computation and reached the synchronization

point. Moreover other threads probably need to wait for the finish of the data communication

before proceeding to the following computation. Hence another local synchronization is

required at the end of the data communication.

MOIDE-runtime provides the local synchronization method barrier(). The

synchronization is controlled by a public integer Barlock accessible by local threads. The

threads exclusively increments the value of Barlock when calling barrier(). The local

synchronization is accomplished when each thread has incremented Barlock once.

Local SyncLocal Sync

thread 0 thread 1 thread 0 thread 1

Local SyncLocal Sync

Global Sync

compute engine 0 compute engine 1

48

3.6.2 remoteBarrier()for Global Synchronization

Global synchronization forces all compute engines to reach the synchronization point

before proceeding to the following operations. This is the global coordination on

collaborative system. It is required in the cases as system-wide collective data communication

and the synchronization of computing iteration on all compute engines. The global

synchronization is more complex than local synchronization. As the distributed compute

engines are shared-nothing objects, the global synchronization should be implemented by the

state transition on the compute engines. At the global synchronization point, a compute

engine enters a pre-defined state and waits there. The compute coordinator examines the

states of all compute engines by remote method invocation. If all of them have entered that

state, the compute coordinator signals them to transit into a new state and go on to execute

next operations.

public static void remoteBarrier(int status) {

int my_state;

boolean sync = false;

barrier(); /* local synchronization */

my_state = status;

if (compute coordinator) {

while (!sync)

sync = (getRemoteStatus() == status);

for ( each compute engine i )

engine[i].remoteStart(new_state);

my_status = new_state;

} else { /* compute engines */

while (my_state != new_state)

Thread.currentThread().yield();

}

}

Fig 3.15 Global synchronization method remoteBarrier()

MOIDE-runtime provides the global synchronization method remoteBarrier(int

status) shown in Fig 3.15. The global synchronization includes the synchronization on

two levels. First, a local synchronization barrier() is called to synchronize the threads in

same compute engine because the state transition of the compute engine can be made only

49

after all local threads have reached the synchronization point. The compute coordinator keeps

on polling the states of remote compute engines until all compute engines have entered the

specific state status. The getRemoteStatus() method checks the states of remote

compute engines. When all compute engines have turned into the state status, the compute

coordinator informs all compute engines to continue the computation by setting new_state

on them via the remote method invocation remoteStart(). On the other side, the

compute engines wait in remoteBarrier() until the compute coordinate invokes them

to continue the computation.

3.7 Load Scheduling

Load scheduling is important to irregularly structured problems. Due to the irregular

computation patterns and dynamic features, it is difficult to measure the computing workload

and make a balanced task allocation in advance. Dynamic load scheduling is required to

balance the workload at run-time. The load scheduling schemes are tightly related to the

computation pattern of an application. MOIDE-runtime provides two dynamic load

scheduling methods for autonomous load scheduling on the two levels of hierarchical

collaborative system.

3.7.1 Autonomous Load Scheduling

An irregularly structured application is difficult to be evenly allocated to the compute

engines before execution. Runtime task allocation is one of the dynamic load balancing

techniques, which is suitable to the applications with light data-dependency. In runtime task

allocation, an application is divided into small pieces of computing tasks. At the beginning,

each compute engine will be allocated a piece of computing task. Once it has finished the

computing task, the compute engine will get another computing task. Therefore the

computation workload can be dynamically balanced on the compute engines. The task

allocation also depends on the computing power of the underlying hosts. The application is

partitioned into n task units where

n >> number of compute engines

If the computing power of host i is performancei, the compute engine on it gets a computing

task whose workload is proportional to performancei each time.

50

The computation on collaborative system has high asynchrony. The compute engines

execute the computing tasks independently except the necessary communication and

synchronization. In principle, the compute coordinator should be the global load scheduler.

With the one-sided communication feature of MOIDE model (see 2.1), however, a compute

engine is able to do the runtime load scheduling by itself. It can directly get a task from the

global task pool in the compute coordinator. Therefore the load scheduling can be performed

without global load scheduler. The self-conducted dynamic load scheduling approach can be

called autonomous load scheduling.

Generally the load scheduling in a hierarchical collaborative system happens on two

levels: the task allocation to the compute engines, and the workload sharing among the group

of threads inside compute engine. In cooperative mode, the threads in a compute engine share

the computing task. Fig 3.16 shows the autonomous load scheduling on the two levels. The

main thread in a compute engine fetches a task from the global task pool and puts it to local

subtask queue. The size of the task is proportional to the computing power of the compute

engine. The local threads get the subtasks from local subtask queue. In independent mode, a

thread works individually as a compute engine. Each thread can fetch its computing task

directly from the global task pool.

Fig 3.16 Two-level autonomous load scheduling

multithreads

multithreadsmultithreads

compute coordinator

compute engine compute engine

global task pool local subtask queue

51

3.7.2 getTask() and getSubtask() Methods

MOIDE-runtime provides two methods for autonomous load scheduling. The

getTask() method is executed by a thread to fetch a task from the global task pool. It is

usually accomplished by remote method invocation. The getSubtask() method is called

by a thread to get a subtask from local subtask queue.

The representation of workload varies in different applications. An abstract unit— task

unit is used in the load scheduling methods here. Specific meanings can be given to the task

unit in different applications.

(1) getTask()

A pointer to the global task pool is defined in the compute coordinator. The pointer

indicates the next computing task in the global task pool. A compute engine calls the

getTask() method on the compute coordinator to fetch a task. The method returns a task

to the compute engine. The task can be divided into equal-size subtasks. The subtasks are

stored in the subtask queue that is indicated by a local pointer.

(2) getSubtask() The getSubtask() method is called by a thread to get a subtask from local subtask

queue. A thread gets a subtask by referring to the pointer of the queue.

3.8 System Termination

public void ceaseEngine(String [] hosts) {

for (host[i] in hosts[]) {

String command = “/usr/bin/rsh hosts[i] create_engine”;

Runtime.getRuntime().exec(command);

}

finalize();

)

Fig 3.17 ceaseEngine()method

When the execution of an application has finished, the collaborative system should be

terminated. The system termination is controlled by the compute coordinator. Fig 3.17 shows

the system termination method ceaseEngine(). The compute coordinator ceases all

compute engines by running the cease_engine script on the remote hosts. The script

52

stops the object of the compute engine. Eventually the compute coordinator terminates itself

by Java API method finalize().

3.9 Summary

In this chapter, the runtime support system MOIDE-runtime is described in detail.

MOIDE-runtime specifies the main classes and methods to support the development and

execution of applications in MOIDE model. It defines the fundamental classes of compute

coordinator and compute engine. It provides the classes to implement the collaborative

system infrastructure including system creation, reconfiguration, termination and

synchronization. It implements the two-layer communication mechanism and dynamic load

scheduling methods.

MOIDE-runtime is used to implement the irregularly structured applications in the

following chapters. StartEngine is called to establish the hierarchical collaborative

system. The synchronization methods barrier() and remoteBarrier() are used in

the applications to enforce local and global synchronization. The ray tracing method in

chapter 5 uses the autonomous load scheduling methods getTask() and getSubtask().

The CG and radix sort methods in chapter 6 call the unified communication interface to

utilize the two-layer communication mechanism to achieve efficient collective

communications. The performance tests of those applications will demonstrate the usability

of MOIDE runtime support system.

53

Chapter 4

Distributed N-body Method in MOIDE Model

N-body problem is a typical irregularly structured problem. It simulates the evolution of

physical system containing numerous bodies. The bodies exert force influences on each other.

The impact of the force influences sustains the continuous body motion in the space. The

computation of the force influences is a computation-intensive procedure. When solved in

distributed method, the N-body problem is also communication-intensive due to the globally

tight data-dependency in force computation. This chapter presents a distributed N-body

method based on MOIDE model with the emphasis on the distributed tree structure designed

as the communication-efficient scheme for the data access in the computation.

4.1 Overview

The major operation in N-body problem is the computation of the force influences on

every body. The body motion is the consequence of the force influences. The accumulation of

the force influences determines the velocity of each body and therefore the evolution of the

whole physical system. The straight-forward method of force computation is to calculate the

force influence between each pair of the bodies. The pair-wise calculation has the time

complexity as high as N·(N-1) = O(N2). Since there are hundreds of thousands bodies

involved in the computation, the time cost is prohibitive in practice.

Time-efficient methods have been proposed [14-16] to reduce the high complexity. In

these methods, approximation is made in the force computation by grouping the bodies and

calculating the force influences based on body groups. The approximation relies on the fact of

the decreasing force influences from distant bodies. The farther bodies impose lower force

influences on a body. A body requires less information and in lower frequency from the

bodies that are farther away. If a body is far enough from a group of bodies, the force

54

influences from the group of bodies can be approximated by the force influence from a single

virtual body–the center of the mass, i.e., the representative of the group. The accumulative

force influence from the group of bodies can be computed in one calculation of the force

influence from the center of mass. Thus the time complexity can be highly reduced.

The N-body methods with body grouping are usually tree-based hierarchical algorithms

in which a tree structure is used to represent the body distribution in physical space [16]. The

tree is constructed based on space decomposition and used in the force computation. Barnes-

Hut method [14] is a well-known hierarchical N-body algorithm. In Barnes-Hut method a

physical space is recursively divided into sub-domains until each sub-domain contains at

most one body. The space decomposition is based on the spatial distribution of the bodies.

Fig. 4.1(a) gives an example of space decomposition in 2D space. At first, the space is

equally divided into four sub-domains. If there are more than one bodies in a sub-domain, the

sub-domain will be further decomposed into four smaller sub-domains. Barnes-Hut tree is

built based on the space decomposition as Fig. 4.1(b) shows. This is a quadtree for 2D space.

(a) Space decomposition (b) Barnes-Hut tree

Fig. 4.1 Barnes-Hut tree for 2D space decomposition

For 3D space, cubical space decomposition should be made. A sub-domain having more

than one bodies will be partitioned into eight sub-domains. The Barnes-Hut tree for 3D space

is an octree.

In the tree structure, the bodies reside on the leaves. Inner cell in the tree represents the

center of mass for the bodies beneath it. The force computation is performed by traversing the

tree. The simulation procedure of N-body problem proceeds in iteration. The Barnes-Hut tree

is built at the beginning of each simulation loop. Every body traverses the tree, starting from

55

the root, and computes the force influences from the bodies during the traversal. The body

compares its distance to each cell encountered on the path of traversal. If the body is far

enough from a cell, no further traversal will be continued beneath the cell. The force

influences from the bodies below the cell can be computed as the force influence from the

cell, which is the center of mass, to the body. Otherwise, the body should proceed to traverse

the children of the cell. After the force computation, each body updates its position in the

space as the effect of force influences. That ends one simulation loop. The tree should be

rebuilt at the beginning of next iteration to reflect the new body distribution in space.

Parallel N-body methods [7,17,18] can be derived from the sequential Barnes-Hut

method. For example, Singh presented a parallel N-body method based on shared-address-

space model [18,61]. In Singh’s method, multiple processes cooperatively build a global

Barnes-Hut tree in the shared-memory section. All processes can access the tree in parallel.

Each process computes the force influences on a subset of bodies by traversing the global tree.

The shared-address-space method is applicable on shared-memory system like SMP

machines. In distributed memory systems, however, message passing is the general

communication methodology [17]. The global shared tree approach is unsuitable to the

computation on distributed system. If the global tree scheme is used, each process on

different machine has to duplicate the tree for local use. To this requirement, the bodies

should be broadcast to all processes in each iteration. Inevitably the broadcast will cause

heavy communication overhead. One solution to this problem is decomposing the global tree

into subtrees that can be distributed to the processes. Each process obtains a body subset and

builds a subtree for it. It performs the force computation on the subset of the bodies. All

subtrees constitute a distributed tree structure.

4.2 Distributed N-body Method

A distributed N-body method is designed based on the MOIDE model. The basic

algorithm of the method comes from the Barnes-Hut method. However the Barnes-Hut tree is

substituted by distributed tree structure. Along with the distributed tree structure, a partial

subtree scheme is used for sharing the data of the subtrees among distributed compute

engines.

56

4.2.1 Distributed Tree Structure

The distributed tree structure is built by space decomposition. In the following text, the

distributed tree structure on the basic collaborative system is introduced first, where all

compute engines are single-threading objects. Then the basic distributed tree structure is

modified to the distributed shard tree structure for the computation on hierarchical

collaborative system.

4.2.1.1 Space Decomposition Given the N-body problem of a physical space with N bodies, it will be processed on P

processors. To run on a basic collaborative system, the physical space should be partitioned

into P sub-spaces and thus the N bodies into P subsets. Each sub-space contains N/P bodies in

average and it is allocated to one processor.

The distributed N-body method is designed based on Singh’s method. As Singh’s

partitioning method explained in [18], the Barnes-Hut algorithm has a representation of the

spatial distribution encoded in its tree data structure. The bodies are inserted into the tree

according to the following scheme [4]:

Body insertion scheme First, convert the floating-point spatial coordinate into integer. In 2D space, the

coordinator [ ]ipos of a body is converted to integer

[ ] [ ]( ) rsizeiriposIMAXix ]min[−⋅= , where 1,0=i ; IMAX is a pre-defined

integer; rmin[i] is the coordinate of the left corner of the space; rsize is the integral

side-length of the space.

Then, select the subcell from the root of the tree to insert the body, i.e., get the

path leading to the subcell in the tree where the body is inserted beneath it. The index

i of the subcell is determined by the following code:

i = 0; yes = false; if ((x[0] & level)!=0) { /* level=the level of current cell */ i += NSUB >> 1; /* NSUB=subcells per cell */ yes = true; } for (k = 1; k < NDIM; k++) { /* NDIM=number of dimensiton */ if (((x[k] & level)!=0) && !yes) { i += NSUB >> (k + 1); yes = true; }

57

else yes = false; }

The body will be inserted to the ith subcell of the current cell. The insertion is a

recursive procedure until to find an empty leaf to put the body in.

The body insertion scheme and the tree construction algorithm of the Barnes-Hut tree

guarantee the Peano-Hilbert ordering of the bodies in the in-order traversal on the tree

[18,61,73]. The space decomposition can be made by partitioning the global Barnes-Hut tree

and reserve the spatial locality of the bodies in the sub-spaces. At the beginning of the

simulation in our N-body method, the compute coordinator is responsible to generate the N

bodies and decompose them into P subsets. The space decomposition algorithm is as

following.

Space decomposition algorithm Build a global Barnes-Hut tree containing all N bodies. Starting from the root, do

in-order traversal on the tree and examine the number of leaves (i.e., the bodies)

beneath each cell encountered on the way of traversal. If the number of leaves under

a cell is less than N/P, put these leaves into the current subset of bodies. Otherwise

continue the traversal to the descendents of the cell. The leaves under the same direct

parent cell should be put into the same subset of bodies because these are

neighboring bodies in a sub-domain (a 2×2 grids in 2D space or a 2×2×2 grid in 3D

space). A subset is full when its size will exceed the upper bound if more leaves are

added to it. For 2D space, the upper bound of the subset size is 2+

PN

. For 3D

space, the upper bound is 4+

PN

. Thus one sub-space has been obtained which

encloses all bodies in the subset. The leaves encountered in the following traversal

will be put into a new subset to form another sub-space. When the traversal has

finished on the whole tree, all leaves in the tree have been partitioned into P subsets

and the bodies in one subset constitute a sub-space.

The space decomposition algorithm generates the sub-spaces that enclose the neighboring

bodies. If running the N-body problem on four processors, the leaves in that tree should be

partitioned into four subsets. With the space decomposition algorithm, the physical space in

58

Fig. 4.1(a) can be partitioned into four sub-spaces as shown in Fig. 4.2(a), by partitioning the

leaves of the Barnes-Hut tree into four subsets as Fig 4.2(b) shows. The traversal visits the

leaves on the tree from left to right that corresponds to the Peano-Hilbert traversal of the

bodies in the space, starting from the left lower corner, as Fig 4.2(a) shows.

(a) Space decomposition (b) Four body subsets generated from partitioning the tree

(c) Distributed tree structure

Fig 4.2. Space decomposition and distributed tree structure on four processors

Three-dimensional N-body problem is handled in similar way except the Barnes-Hut tree

is an octree. The code of the barnes application in SPLASH-2 [53,74] assures the Peano-

Hilbert ordering of the bodies in the Barnes-Hut tree.

4.2.1.2 Basic Distributed Tree Structure

The sub-spaces produced by the space decomposition are allocated to the compute

engines. Each compute engine will create a subtree for the sub-space it received. Thus a

distributed data structure is formed with P subtrees that are distributed on all compute

engines. For the example in Fig 4.2, the total 48 bodies in the space are partitioned into four

subsets with 12, 12, 13, or 11 bodies repectively. A subset of bodies will be assigned to one

subtree A subtree B subtree C subtree D

subset A

subset B subset C

subset D

A

BC

Dstart

end

59

compute engine where a subtree will be built. The distributed tree structure is composed of

four subtrees as Fig. 4.2(c) shows.

4.2.1.3 Partial Subtree

Each compute engine carries out the force computation for the subset of bodies. In the

computation, the data on remote subtrees are needed to compute the force influences from

other sub-spaces. In our N-body method, partial subtrees scheme is designed to provide the

data of remote subtees to each compute engine. This is a communication-efficient solution for

data sharing among all compute engines.

Considering the approximation made in the force computation of Barnes-Hut method, an

inner cell in subtree is the center of mass that represents the total force influence from the

bodies beneath it, provided the group of bodies is far enough to a remote body if the

following distance condition is satisfied:

θ<dl

(4-1)

Where l is the length of the side of space domain represented by the cell; d is the distance

from the cell to a body; θ is a user-defined accuracy parameter, usually between 0.4 and 1.2

(1.0 is used in our method).

To satisfy the remote data access, a tradeoff is made to provide remote data sharing

under lower communication cost. Each compute engine builds a partial subtree, which is a

fraction of its subtree, for each remote compute engine. Totally it will build (P-1) different

partial subtrees if there are P compute engines. The construction of partial subtree is based on

the distance between two sub-spaces. A partial subtree is built by the following procedure.

Partial subtree construction Assume two compute engines i and j. After building the subtree, compute engine

i creates the partial subtree ptreej for j, by traversing the subtree. The cell

encountered in the traversal is included into ptreej and the distance between the cell

and the root of the subtree on j (i.e., the center of mass of the sub-space on j) is

calculated. If it satisfies the distance condition (4-1), the traversal will not proceed to

the part beneath the cell. Otherwise, the traversal will continue to the children of the

cell. The partial subtree ptreej has been created when the traversal is completed.

60

A partial subtree is dedicated to the target compute engine. It is created based on the

distance from the cells in local sub-space to the other sub-space. The partial subtree will be

sent to the target compute engine to provide the data for the force computation there. Fig. 4.3

shows an example of the partial subtrees generated from subtree B in Fig 4.2. Three partial

subtrees are derived from subtree B. The outermost ring on subtree B encloses the partial

subtree for A. The partial subtree in the dash ring is built for C and the one in the inner solid

ring is partial subtree for D.

Fig 4.3 Partial subtrees built from subtree B

Then the partial subtrees will be scattered to remote compute engines. A compute engine

will carry out the force computation by traversing its own subtree and all partial subtrees it

received. In case that force computation on a body needs the data not included in a partial

subtree, the body will be sent to the correspondent compute engine and access the complete

subtree there. Partial subtree is adaptive to body distribution in the sub-spaces. It is

constructed based on the distance between two sub-spaces. The farther between the sub-

spaces, the less cells will be included in the partial subtree. When two sub-spaces are far

enough, they only need to exchange the root of the sub-tree. Moreover the cells in partial

subtree satisfy the distance condition (in respect to the center of mass) for the force

computation. The partial subtree provides most of the data required in the force computation.

Therefore partial subtree scheme is a communication-efficient method for the data sharing of

subtrees. The runtime test in 4.3.1 will verify this advantage.

partial subtree to A partial subtree to C

subtree A subtree B subtree C

partial subtree to D

subtree D

61

4.2.1.4 Distributed Shared Tree Structure

In hierarchical collaborative system, the group of threads in a compute engine can

cooperatively perform the force computation for a subset of bodies. The subtree and partial

subtrees on a multithreading compute engine are accessed by the group of threads. The subset

of bodies assigned to a compute engine should be proportional to its computing power. Hence

a space may be decomposed into the sub-spaces not in equal size. The subtrees built on HiCS

form a distributed shared tree structure, because the subtrees and partial subtrees will be

shared among a group of threads.

Assume the computing power of a compute engine is decided by the processors in the

underlying host and the N-body problem is processed on m hosts with P0, P1, …, Pm-1

processors each, where ∑−

=

=1

0

m

iiPP . The space decomposition algorithm on HiCS should be:

In the in-order traversal on the global Barnes-Hut tree, examine the number of

leaves (i.e., the bodies) beneath each cell on the path. If the number of leaves under a

cell is less than PiN/P, put these leaves into the ith subset of bodies, which is currently

under construction. The subset will be assigned to compute engine i on Pi processors.

Otherwise continue the traversal to the descendents of the cell. The leaves under

same parent cell should be put into same subset of bodies because these are

neighboring bodies (a 2×2 grids in 2D space or a 2×2×2 grid in 3D space). The ith

subset of bodies is full when its size will exceed the upper bound if any more leaves

are added to the subset. For 2D space, the upper bound of the subset size is

2+

iPPN

. For 3D space, the upper bound is 4+

iPPN

. The leaves encountered

in the following traversal will be put into a new subset. When the in-order traversal

has finished on the tree, the leaves are partitioned into P subsets and the space has

been decomposed into P sub-spaces.

Consider the same N-body problem as in Fig 4.1(a) running on four heterogeneous SMP

nodes: two quad-processor nodes and two dual-processor nodes. The space decomposition

will generate four sub-spaces. The number of bodies in each sub-space is proportional to the

computing power of the target compute engine. As there are totally 48 bodies in the whole

space, the sub-space contains 16 bodies for a quad-processor node and 8 bodies for a dual-

62

processor node as shown in Fig. 4.4(a) and (b). The distributed shared tree structure contains

four subtrees as Fig. 4.4(c) shows. The partial subtrees in distributed share tree structure are

constructed in the same way based on the distance between sub-spaces as in the basic

distributed tree structure.

(a) Space decomposition (b) Four body subsets generated from partitioning the tree

(c) Distributed shared tree structure

Fig 4.4 Space decomposition and the subtrees on a four-SMP cluster

Compared with the basic distributed tree structure, higher data locality can be included in

the distributed shared tree structure on HiCS because the size of sub-space is larger on

multithreading compute engine residing on SMP node. The higher data locality can further

reduce the communication cost in solving N-body problem.

4.2.2 Computing Procedure

The computing procedure of N-body problem proceeds by iterating the simulation loop.

The computing procedure starts on all compute engines when the compute coordinator has

made the space decomposition and allocated the subsets of bodies to them. The compute

coordinator synchronizes the simulation loop. In each loop, the force influences on every

subtree A subtree B subtree C subtree D

subset A

subset B subset C

subset D

A

BC

Dstart

end

63

body are computed and the velocity and direction of the body’s motion are updated based on

the force influences, and then the bodies move to their new positions.

4.2.2.1 Simulation Loop

A simulation loop contains three steps.

Step 1: Subtree construction and partial subtree propagation

Each compute engine builds a subtree for the subset of bodies allocated to it. The root

information of each subtree, i.e., the center of mass in the sub-space, is broadcasted to all

compute engines. Partial subtrees are constructed from the subtree based on the root

information received. The partial subtrees are sent to other compute engines.

Step 2: Force computation

Each compute engine computes the force influences on every body in the subset by

traversing the local subtree and all partial subtrees. If the force calculation on a body

requires more information than a partial subtree can provide, the body will be sent to the

remote compute engine and traverse the complete subtree there to compute the force

influence from that sub-space. Then the result will be sent back to the source compute

engine. A body may be sent to more than one compute engines. The force influences on

each body from all sub-spaces are accumulated to get the total force influence on it. Finally

each body changes its state, including the velocity and the position, as the result of force

influence.

Step 3: Body redistribution

At the end of step 2, the new position of every body has been decided. Some bodies

may cross the border of the sub-space and enter a neighbor sub-space. To keep the body

locality in sub-space, the bodies moving to other sub-spaces should be transmitted to the

compute engine where the destination sub-space locates. Since the bodies advance in small

pace, only a few bodies will move to other sub-spaces in each simulation loop.

4.2.2.2 Execution Flow

On HiCS, the main thread in each compute engine builds the subtree and partial subtrees.

The force computation on the subset of bodies is shared among the group of threads in a

compute engine. All communications between the compute engines are carried out between

the main threads. Fig 4.5 is the execution flow of the distributed N-body method on HiCS.

64

for all computer engines

do

repeat

/* by main thread */

1. build subtree;

2. broadcast the root information;

3. build partial subtrees;

4. scatter partial subtrees;

/* by all threads */

for all threads

do

5. for each body processed by the thread

do

compute force influences;

if (the body needs to access any remote subtree)

insert the body into a send buffer;

end for

end for


6. send the bodies in send buffers to remote compute engines;


7. for all bodies from remote compute engines

traverse local subtree to compute force influence;


8. send the bodies back;


9. for all local bodies

do

advance bodies;

end for


10. for all bodies moving into other sub-spaces

do

transmit the bodies to destination compute engines;

end for

end repeat

end for

Fig 4.5 Execution flow of the distributed N-body method on hierarchical collaborative system

65

4.2.3 Load Balancing Strategy

All bodies change their positions in each simulation loop. Although the motion of the

bodies is in small pace, the continuous body motion will eventually cause highly imbalanced

body distribution in the sub-spaces. That will result in the workload imbalance in force

computation among the compute engines. In this case, load balancing is required to balance

the bodies in the sub-spaces.

The load balancing strategy in the distributed N-body method is performing space re-

decomposition. The physical space is decomposed again based on the new spatial distribution

of the bodies. As described in section 4.2.1.1, the space decomposition algorithm can

generate the sub-spaces that have approximately equal number of bodies and preserve the

body locality. This is the goal of load balancing. The same space decomposition algorithm

can be used for load balancing.

The space re-decomposition is conducted by the compute coordinator. When load

balancing is required, the compute coordinator collects all bodies from the compute engines

and performs the space decomposition for the bodies. The new sub-spaces are assigned to the

compute engines again as the task allocation at the beginning of the simulation. Then the

simulation proceeds on the new sub-spaces. The space re-decomposition can balance the

workload at reasonable cost. The runtime test in 4.3.1 shows that load balancing strategy

based on space re-decomposition can improve the performance of the N-body simulation.

4.3 Runtime Tests and Performance Analysis

The distributed N-body method has been tested on homogeneous and heterogeneous

clusters. The hosts are off-the-shelf Pentium III-based SMP machines and PCs. OS is Red

Hat Linux 6.0 Kernel 2.2.12 and JDK is Blackdown JDK1.2.2. The hosts are linked over Fast

Ethernet switch. The distributed N-body method simulates a galaxy in Plummer model [19]

with a collection of 10,240 to 102,400 particles. The performance is reported in execution

time, speedup, computation to communication ratio and other metrics.

4.3.1 Tests on Homogeneous Hosts

First, the N-body method is tested on a cluster of four SMP machines. Each machine has

four processors. The cluster provides sixteen processors in total. One SMP machine is used in

the cases of one to four processors. Then two machines are used from five to eight processors,

66

and so on. The method is also executed on a cluster of PCs to test its behavior on multiple

single-processor machines.

1. Execution time and speedup on cluster of SMPs

Fig 4.6 Execution time of the N-body method on four quad-processor SMP machines

Fig 4.6 displays the execution time of the N-body method under different problem size N.

The execution time decreases in all cases when increasing the number of processors. The

speedups can be obtained in all cases as Fig 4.7 shows. The largest performance improvement

occurs on one to four processors as the first four processors are provided on the same SMP

node. In this case, only one compute engine with threads inside executes the simulation loop

without communication cost. The communication requirement occurs when more than one

SMP nodes are used. The compute engines need to scatter the partial subtrees and send the

bodies to and back from other compute engines during the force computation.

0

200

400

600

800

1000

1200

1400

1600

1 2 4 6 8 10 12 14 16

Number of processors

Exec

utio

n tim

e (s

econ

ds)

1024020480409606144081920102400

N

67

Fig 4.7 Speedups of the N-body method

2. Execution time on cluster of PCs

Fig 4.8 The execution time on the cluster of 32 PCs

0

1

2

3

4

5

6

7

8

9

1 2 4 6 8 10 12 14 16


Spee

dup 10240

20480409606144081920102400

N

N-body on cluster of PCs, p=32

0200400600800

100012001400160018002000

1 2 4 8 16 24 32

number of processors

exec

utio

n tim

e (s

econ

ds)

1024020480408606144081920102400

68

In the previous test, there are at most four SMP nodes used so that the inter-node

communication maintains at a low level. To examine the performance on a system containing

more hosts on which more communication is required, the N-body method has been tested on

a cluster of single-processor PCs as well. There are thirty-two PCs in the cluster, linked by

Fast Ethernet switch. Fig. 4.8 shows the execution time on the cluster of PCs. Although the

communication overhead should be higher than the previous test on SMP cluster, Fig 4.8

shows that the execution time still decreases when increasing the number of PCs. The

distributed N-body method can achieve the similar performance on cluster of single-

processors as on the cluster of SMP nodes.

3. Efficiency of the distributed tree structure and partial subtree The distributed tree structure and partial subtree scheme are proposed as a

communication-efficient solution for the data requirement in the force computation. The

efficiency can be demonstrated by the execution time breakdowns on the cluster of four SMP

nodes in Fig 4.9. The computation time in the figure is the cost of force computation. The

communication time includes the costs of sending partial subtrees, sending and receiving

bodies to and from remote compute engines for remote subtree access, and the body

redistribution at the end of simulation loop.

Fig 4.9 Computation and communication time breakdowns on the cluster of SMPs

0

200

400

600

800

1000

1200

1400

1600

1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16


Exec

utio

n tim

e (s

econ

ds)

Force computation Communication

N=10240N=20480

N=40960

N=61440

N=81920

N=102400

69

As Fig 4.9 shows, the most part of the execution time is spent on computation. There is

no obvious growth in the communication time under large problem size. On the contrary, the

proportion of communication in the total execution time decreases along with the increase of

problem size.

(1) Comparison with full tree scheme

A full tree scheme has been tested as the comparison to the distributed tree structure. In

the full tree scheme, each compute engine will build a global Barnes-Hut tree and all force

computation can be locally accomplished without remote tree access. The scheme requires

broadcasting all of N bodies to every compute engine in each simulation loop. Each compute

engine is still responsible for the force computation on a subset of bodies. Runtime test shows

that the broadcast of N bodies, instead of partial subtrees, produces high communication cost.

Fig 4.10 shows the execution time breakdown of the full tree scheme on the cluster of four

SMP nodes under the problem size 20,480. The broadcast of all bodies induces high

communication overhead when running on more than one SMP (above four processors). The

communication spends most of the execution time. As a result, the communication overhead

and the total execution time go up along with the number of processors. The poor

performance of the full tree scheme affirms the communication efficiency of the distributed

tree structure.

Fig 4.10 Computation and communication time breakdowns of the full tree method

0

50

100

150

200

250

1 2 4 6 8 10 12 14 16


Exec

utio

n tim

e (s

econ

ds)

Communication

Force computation

N=20480

70

(2) Comparison of partial subtree schemes

As indicated in 4.2.1.3, the partial subtree is constructed based on the distance between

the correspondent sub-spaces. It is an adaptive data structure that provides most of the data

required in the force computation. Actually the adaptive partial subtree scheme is an

improvement from the cut-off partial subtree structure we proposed in [78]. Differing from

the adaptive structure in this thesis, the cut-off partial subtree is simply duplicating the top

half depth of the subtree. Only one partial subtree will be created from a subtree and it is

broadcasted to all compute engines. It is obvious that the time cost for creating the cut-off

partial subtree is lower than the adaptive partial subtree because multiple partial subtrees will

be created from a subtree in the latter scheme. However the cut-off partial subtree may not be

able to contain sufficient information required in the force computation on other compute

engines. There should be more bodies that require remote subtree access and higher

communication will be incurred.

Fig 4.11 Adaptive partial subtree vs. cut-off partial subtree on cluster of PCs

These two partial subtree schemes are compared on the cluster of PCs. Fig 4.11 contrasts

the time costs of the adaptive partial subtree and the cut-off partial subtree schemes on 32

PCs under varied problem size N. At each problem size, the left column is the time

breakdown of the adaptive partial subtree scheme and the right column is that of the cut-off

Adaptive vs. cut-off partial subtrees, p = 32

0

50

100

150

200

250

10240 20480 40960 61440 81920 102400

problems size N

exec

utio

n tim

e (s

econ

ds)

tree buildcommunicationcomputation

adaptive cut-off

71

scheme. The execution time is broken down into three portions: computation—the time for

force computation; communication—the time for sending the partial subtrees and bodies; tree

build—the time for building partial subtrees. In the cut-off scheme on the right, only one

partial subtree is built from each subtree. Its tree build time is trivial and invisible in the

figure. For the adaptive partial subtree scheme, each compute engine should build (P-1)

different partial subtrees per simulation loop, where P is the total number of compute engines

(i.e., the number of PCs in this example). As a result, the time for tree building is apparent in

the execution time. On the other hand, the adaptive partial subtrees provide more data to the

force computation on local compute engine so that the remote subtree access can be reduced.

It spends lower communication time as Fig 4.11 shows. From all of the test results, we can

conclude that the partial subtree scheme in this thesis can satisfy most of the data requirement

in the force computation at low communication overhead on both cluster of SMPs and cluster

of PCs. The distributed tree structure is a communication-efficient scheme for the distributed

N-body method.

4. Efficiency of load balancing strategy Time

N Computation Communication

Tree

build

Load

balancing

Total

time

Improvement

ratio

Balanced 15.61 110.49 5.50 16.75 148.35 10240

Imbalanced 18.79 137.07 6.46 0 162.31 8.60%

Balanced 38.39 161.12 10.24 28.40 238.146 20480

Imbalanced 46.34 196.92 12.34 0 255.602 6.83%

Balanced 107.12 239.67 22.99 49.24 419.02 40960

Imbalanced 132.50 279.52 27.55 0 439.58 4.68%

Balanced 205.98 261.44 31.64 71.08 570.14 61440

Imbalanced 253.55 313.32 37.41 0 604.27 5.65%

Balanced 268.72 364.34 46.81 92.35 772.21 81920

Imbalanced 330.16 449.59 56.74 0 836.48 7.68%

Balanced 340.24 595.67 56.38 112.93 1105.22 102400

Imbalanced 415.73 728.34 70.20 0 1214.27 8.98%

Table 4.1 The times of the N-body method with/without load balancing (seconds)

The load balancing strategy in the distributed N-body method is described in 4.2.3. As

Singh indicated in [61,73] as well as Warren and Salmon in [75], the N-body problem

typically simulates physical systems that evolve slowly with time, and the distribution of

72

bodies changes little between two successive time-steps. Thus severe workload imbalance

will not frequently occur in the simulation procedure. Even so, load balancing is still required

when high workload imbalance occurs. Upon load imbalance, a global task repartition is

conducted by the compute coordinator. The compute coordinator collects all bodies from the

compute engines, makes space re-decomposition and allocates the new body subsets to the

compute engines. Table 4.1 compares the time costs of the N-body method with or without

adopting the load balancing strategy. The time cost in the table measure the elapse time of

fifteen simulation loops.

Table 4.1 lists times of four operations: computation, communication, tree building and

load balancing. Balanced is the time of the N-body simulation adopting load balancing

strategy and Imbalanced is the time of the N-body simulation without performing load

balancing. The improvement ratio indicates the performance improvement gained from the

load balancing strategy. As the results show, the load balancing strategy can improve the

performance of the distributed N-body method.

4.3.2 Tests on Heterogeneous Hosts

The distributed N-body method has also been tested on heterogeneous hosts to exhibit the

flexibility of MOIDE model. The test environment consists of four Pentium III-based hosts:

one tri-processor machine, two dual-processor machines, and one single-processor PC. The

system provides totally eight processors. The problem sizes are the same as in the tests on the

homogeneous SMP cluster. The N-body method starts to run on the tri-processor machine.

According to the host selection policy in 2.3.2, the hosts with more processors have higher

priority in the selection. So the two dual-processor machines will be used next. The single-

processor PC is used only in the case of eight processors. Multiple threads may be generated

in a compute engine according to the number of processors in the underlying host. The

computation workload is distributed to the compute engines based on the computing power.

Fig 4.12 shows the execution time of the distributed N-body method on the

heterogeneous hosts. The speedups are shown in Fig 4.13. The performance of the distributed

N-body method on the heterogeneous hosts is comparable to that on the homogeneous SMP

nodes. The method can reach around four-fold speedup on eight processors. Fig 4.14 is the

computation and communication time breakdowns on the heterogeneous hosts. By comparing

the time breakdowns in Fig 4.14 with Fig 4.9, it can be found that the communication cost on

73

the heterogeneous cluster is higher than that on the homogeneous SMP cluster under the same

problem size and number of processors, because more hosts are used and more remote

messaging is required.

Fig 4.12 Execution time of the N-body method on heterogeneous hosts

Fig 4.13 Speedups of the N-body method on heterogeneous hosts

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6 7 8


Exec

utio

n Ti

me

(sec

onds

)

1024020480409606144081920102400

N

00.5

11.5

22.5

33.5

44.5

5

1 2 3 4 5 6 7 8


Spee

dup 10240

20480409606144081920102400

N

74

Fig 4.14 Computation and communication time breakdowns on heterogeneous hosts

All the tests above manifest the efficiency of the MOIDE-based N-body method,

particularly the communication-efficient distributed tree structure. The tests also demonstrate

the flexibility of MOIDE model and the usability of MOIDE runtime support system. The

distributed N-body method can be mapped onto different hosts and achieve high-performance

on different systems.

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8


Exec

utio

n tim

e (s

econ

ds)

Force computation Communication

N=10240N=20480

N=40960

N=61440

N=81920

N=102400

75

Chapter 5

Ray Tracing with Autonomous Load Scheduling

This chapter presents another irregularly structured application ray tracing based on

MOIDE model. The ray tracing application uses the autonomous load scheduling to achieve

high asynchronous computation.

5.1 Overview

Ray tracing is a graphic rendering algorithm that renders an image on a view plane from

the mathematical descriptions of the objects in a scene. The view plane is divided into a grid

of pixels. The color visibility on each pixel is computed by tracing the rays that emit from a

viewpoint and pass through the pixels into the scene space. New rays are generated by

reflection and refraction when a ray hits an object on the way of tracing. The tracing

computations are performed recursively on the new rays. The rendering of the scene on a

pixel is a non-predetermined procedure. It depends on the objects a ray hits and the new rays

generated. The computation workload is different from pixel to pixel.

Differing from N-body problem, ray tracing has light data-dependency in computation.

Each ray can be independently traced. The pixels on the view plane can be rendered in

parallel and in arbitrary order. Ray tracing is a compute-intensive irregular application

[8,20,21].

The distributed ray tracing method is developed based on MOIDE model. It is a

completely asynchronous procedure due to the light data-dependency and imbalanced

computation workload on each pixel. Therefore the autonomous load scheduling discussed in

3.7 is suitable to exploit the high parallelism in the ray tracing.

76

5.2 Autonomous Load Scheduling

5.2.1 Background

In parallel ray tracing methods, the pixels on a view plane are partitioned into blocks. The

rendering of the blocks can be performed by multiple processes in parallel. The workload of

rendering each block is related to the scene appearing on that block. There should be high

diversity in the workload of each block which cannot be determined in priori. Static block

allocation is unsuitable because it cannot produce an even distribution of the workload to the

processes. The total computation time will be restrained by the processor with the highest

workload. Dynamic load scheduling should be applied to balance the rendering workload in

which the blocks are gradually allocated to the processes at runtime based on the computation

progress on them.

Generally in the parallel ray tracing based on messaging passing, the load balancing

adopts centralized load scheduling approach in which a dedicated process works as the load

scheduler [21,56]. The load scheduler keeps on observing the rendering procedure on all

other processes. Once a process has finished the rendering of a block, the load scheduler

should collect the rendered block from the process and allocate a new block to it. The load

scheduler may spend most of the time in idle, waiting for the completion of the rendering on

other processes. This is the waste of computing power. An alternate method is allowing the

load scheduler to perform the rendering operation and do the load scheduling for other

processes at specified moment. For example, the load scheduler checks the block allocation

request from other process when it has finished rendering a block. In this approach, other

processes possibly have to wait in idle for the block allocation if the load scheduler is doing

rendering work.

MOIDE supports a more efficient approach for the load scheduling in ray tracing. With

regard to the asynchronous rendering procedure, the autonomous load scheduling discussed

in 3.7 is an appropriate approach in which no dedicated load scheduler is required. All

processes can concentrate on the rendering computation. Each process performs the dynamic

block allocation by itself providing it can access the global block pool. The one-sided

communication feature of remote method invocation supports the implementation of the

autonomous load scheduling. As the description in 3.7.1, a global block pool is maintained in

the compute coordinator. Any compute engine can fetch the blocks directly from the global

77

task pool at any time via remote method invocation without the invocation from the compute

coordinator. The autonomous load scheduling can produce a balanced distribution of

rendering computation on all compute engines, including the compute coordinator, and attain

the highest parallelism.

The autonomous load scheduling strategy discussed in 3.7.1 is a two-level scheme on

hierarchical collaborative system. In the ray tracing, the autonomous load scheduling can be

implemented as the allocation of rendering task on the two levels. The main thread in

compute engine is responsible to get a block from the global task pool. The block is divided

into sub-blocks that are rendered by a group of threads inside the compute engine. The two-

level load scheduling approach in ray tracing is called group scheduling.

There is another approach for the autonomous load scheduling in the ray tracing. As the

rendering operation is independent from one another, each thread can individually get and

perform the rendering task. It can fetch a block directly from the global block pool and

independently render the block. This load scheduling approach is called individual scheduling.

5.2.2 Group Scheduling

Fig 5.1 shows the view plane of an image to be rendered. The view plane is partitioned

into row-oriented strips (blocks). A strip has several rows of the image. A pointer is used as

the index to the strips to be allocated. The pointer represents the global task pool and it is held

in the compute coordinator.

The group scheduling in the MOIDE-based ray tracing is a two-level autonomous load

scheduling approach. The threads work in cooperative mode (see 2.3.3) to share the rendering

task. The main thread in compute engine reads the pointer to get a new rendering task. A

rendering task contains certain number of successive strips. The number of strips allocated to

a compute engine depends on its computing power. For example, four SMP nodes (two quad-

processor SMP nodes and two dual-processor SMP nodes) are used to render the image. As

Fig 5.1 shows, the compute engine on quad-processor node will get four strips each time in

task allocation and the compute engine on dual-processor node will get two strips. The strips

will be rendered in parallel by the group of threads in a compute engine. The strips allocated

to a compute engine are indicated by a local pointer. A thread gets one row from the strips

each time and performs the rendering operations for the row. The getTask() and

78

getSubtask()methods provided in MOIDE-runtime can be used to implement the group

scheduling.

Fig 5.1 View plane partitioning and allocation

The intension to use group scheduling is based on the prediction that the task allocation

operation has high communication overhead. It is necessary to reduce the occurrence of task

allocation. Therefore the rendering tasks are allocated in strips and only the main thread

performs the task allocation in order to avoid frequent task allocation. With the same

prediction, the rendered strips (i.e., the output of the rendering) will not be immediately sent

back to the compute coordinator after the strip has been rendered to reduce the

communication overhead of strip send-back. The send-back of all rendered strips is

postponed to the end of the ray tracing. The compute coordinator will collect the rendered

strips from the compute engines after all rendering operations have been finished. Finally the

compute coordinator needs to organize the rendered strips into an image. This operation is

called data reordering in the following text. The ray tracing method with group scheduling is

summarized in Fig 5.2.

for all compute engines

do

while (any strips to be rendered)

do

if (main thread in compute engine)

fetch next rendering task from global task pool;

SMP 0

SMP 1

SMP 2

SMP 3pointer

79

for all threads

do

while (any rows to be rendered)

do

get a row;

render the row;

store the rendered row into local vector;

end while

end for

end while

if (main thread in compute coordinator)

do

collect the rendered strips from all compute engines;

reorder the strips to form the full image;

end if

end for

Fig 5.2 Ray tracing with group scheduling

5.2.3 Individual Scheduling

for all threads

do

fetch a strip from compute coordinator;

do

render the strip;

send the rendered strip back to and fetch next strip

from compute coordinator in one method invocation;

while (any strip to be rendered);

end for

Fig 5.3 Ray tracing with individual scheduling

In contrast to the group scheduling, individual scheduling is featured as individual task

fetching and immediate strip send-back. In the individual scheduling, all threads work in

independent mode (see 2.3.3) and each thread makes the task allocation by itself. When it has

finished rendering a strip, a thread immediately sends the rendered strip back to the compute

coordinator and fetches next strip from the global task pool in one remote method invocation.

The methods getTask() in MOIDE-runtime system can be called by every thread to

80

perform the individual scheduling. The ray tracing method with individual scheduling is

summarized in Fig 5.3.

As Fig 5.3 shows, all threads perform the rendering operations in entirely asynchronous

manner without cooperation or coordination. Each thread works as an independent compute

engine whereas the threads in a group can still share some data objects, e.g., the object

description of the scene. The multiple threads consume less system resources and present

more stability than a group of single-threading compute engines on SMP node. The size of a

strip can be defined as one row. The effect of individual scheduling is allowing the strips to

be gradually distributed to all threads depending on the rendering progress on individual

thread. Hence the individual scheduling can lead to an even workload distribution on all

threads. However the individual scheduling requires frequent remote method invocation for

strip send-back and task fetching. The frequent communication may produce high

communication overhead that will affect the overall performance.

Fig 5.4 Flow of ray tracing in group scheduling and individual scheduling

On the other hand, the immediate strip send-back has the advantage of overlapping the

system-wide computation and communication. The send-back communication of one thread

can be overlapped with the rendering computation on other threads. The overlapping can

resolve the potential communication bottleneck in the final strip send-back in the group

scheduling.

End of strip send-back

rendering computation

strip send-back

Individual Scheduling

End of rendering(b)

End of rendering

Group Scheduling

(a)

bottleneck

Threads

Threads

81

Fig 5.4 shows the execution flow of the group scheduling and individual scheduling on

four threads in a compute engine. The execution flow consists of two operations: the

rendering computation and the strip send-back communication. As the runtime tests in 5.3

will reveal that task fetching (task allocation) latency is trivial in the total execution time, the

time for task fetching is omitted in Fig 5.4. The group scheduling in Fig 5.4(a) defers the

send-back of all strips till the end of rendering. The strip send-back phase at the end may turn

into the bottleneck in the execution flow. Contrarily the individual scheduling in Fig 5.4 (b)

sends the strip back immediately once it has been rendered. Each thread performs the

computation and communication alternately so that the computation and communication

among the threads are overlapped. The individual scheduling eliminates the bottleneck of

final send-back. However, every thread should frequently perform communication operation.

That may lead to communication contest on the side of the compute coordinator, which may

become a new communication bottleneck.

The analysis above compares the advantage and disadvantage of the group scheduling

and individual scheduling in ray tracing which need to be verified by runtime tests. The

runtime tests will demonstrate that the individual scheduling is a better scheme for the ray

tracing.

5.3 Runtime Tests and Performance Analysis

The tests of the ray tracing method focus on the performance of different autonomous

load scheduling approaches. The group scheduling and individual scheduling are compared in

the test. The individual scheduling is also compared with the master/slave scheduling

approach.

1. Group scheduling and individual scheduling

To figure out the communication bottleneck, the third autonomous scheduling scheme —

combined scheduling is designed for the test. It combines the main features of the group

scheduling and individual scheduling.

Combined scheduling

(1) Task allocation: In combined scheduling, each thread independently gets a strip each time

from the global task pool in the same way as individual scheduling. However it doesn’t

immediately send the rendered strip back.

82

(2) Strip send-back: The rendered strips are sent back to the compute coordinator at the end

of the rendering and the compute coordinator reorders the strips in the same way as in

group scheduling.

Fig 5.5 Execution times of ray tracing in three load scheduling schemes

Fig 5.6 Speedups of ray tracing in three load scheduling schemes

0

50

100

150

200

250

1 2 4 6 8 10 12 14 16


Exec

utio

n tim

e (s

econ

ds)

individual scheduling

group scheduling

combined scheduling

0

1

2

3

4

5

6

7

8

1 2 4 6 8 10 12 14 16


Spee

dup individual scheduling

group scheduling

combined scheduling

83

The ray tracing method is tested on the cluster of four quad-processor SMP machines. Fig

5.1 is the sketch of the image to be rendered in the test. The execution times of the ray tracing

in the three scheduling schemes are displayed in Fig 5.5. The speedups are shown in Fig 5.6.

In the test, the task allocation is one row a time in the individual scheduling and

combined scheduling. For the group scheduling, the image is partitioned into (20 × number of

processors) strips.

Fig 5.7 Execution time breakdowns of group scheduling and combined scheduling

As the test results show that the individual scheduling has the best performance among

the three schemes. It presents smooth performance improvement when increasing the number

of processors. The performance of the group scheduling is close to the combined scheduling

but a bit better on fourteen and sixteen processors. The time breakdowns of the group

scheduling and combined scheduling in Fig 5.7 reveal that the difference in their performance

mainly comes from the communication cost. Fig 5.7 only shows the execution time

breakdown on the main thread of the compute coordinator. The computation time is spent on

rendering operations. It decreases with the number of processors. The data reordering is

performed on the compute coordinator to organize all rendered strips into the image. It is the

smallest portion in the execution time. The data reordering operation is also performed by the

main thread in each compute engine to collect local rendered strips together for send-back.

The communication time includes the time of task allocation and final strip send-back. There

0

50

100

150

200

250

1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16


Tim

e (s

econ

ds)

data reorderingcommunicationcomputation

Group scheduling Combined scheduling

84

is no communication cost on one to four processors because these four processors belong to

one SMP node. In this case, no final send-back is required and the task allocation is made

locally. The communication overhead increases with the number of processors. Eventually

the communication time exceeds the computation time and worsens the overall performance.

As the communication includes task allocation and final strip send-back, it needs to

figure out which one is the communication bottleneck. Fig 5.8 shows the detailed time

breakdowns of the combined scheduling. The communication time is divided into the times

of task allocation and strip send-back. In the combined scheduling, the strip send-back takes

place at the end, the same as in the group scheduling. The tests are made on four quad-

processor SMP nodes. Every four threads are the sibling threads in a compute engine. The

threads do the task allocation individually by fetching a row from the global task pool. Fig

5.8 shows that the cost of task allocation is low on each thread in contrast to the time cost of

final strip send-back. It can be concluded with certainty that the frequent individual task

fetching will not produce heavy communication overhead but the final strip send-back will

cause the communication bottleneck.

Fig 5.8 Execution time breakdowns of combined scheduling

Combined scheduling

0

20

40

60

80

100

120

140

0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 1011 0 1 2 3 4 5 6 7 8 9 1011 12131415

Compute engines

Tim

e (s

econ

ds)

strip send-backdata reorderingtask allocationcomputation

P=4

P=8 P=12 P=16

85

In the group scheduling, the group-based task allocation and delayed strip send-back aim

to decrease the frequency of remote method invocation. These approaches are proposed based

on the prediction that frequent remote method invocation will produce high communication

overhead so that it should be reduced during the computation. However the runtime tests give

an opposite conclusion. The row-based individual task allocation produces low

communication overhead. It has little influence on the overall performance. Now inspect the communication overhead in the individual scheduling. Fig 5.9 shows the

detailed time breakdowns of the individual scheduling. The communication time is the cost of

row fetching and send-back. In the individual scheduling, a rendered row is directly stored

into the correspondent entry of the image buffer in the computer coordinator via the remote

method invocation. There is no strip reordering operation. As the first case in Fig 5.9 (P=4,

i.e., four processors), the rendering work is evenly performed on the four threads residing on

one SMP node that has no communication cost.

Fig 5.9 Execution time breakdowns of individual scheduling

The second case is the execution time of eight threads on two SMP nodes (P=8). The

four threads on the left exist in the compute coordinator. Their execution time is mostly spent

Individual Scheduling

0

20

40

60

80

100

120

140

0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 1011 0 1 2 3 4 5 6 7 8 9 1011 12131415

Threads

Tim

e (s

econ

ds)

Computation Communication

P=4

P=8

P=12P=16

86

on computation because these threads do not invoke remote communication operations. On

the other hand, the threads on remote compute engine need to perform communication

operations for individual scheduling. The four threads on the right need communication time

in the execution. The remote method invocation operations on these threads are blocking

communication. The threads need to wait for a round trip of data communication that sends

the rendered strip back and getting the next strip to be rendered. The remote method

invocation is one-sided communication, which is performed by the remote threads only. The

difference in the communication costs between the threads on the compute coordinator and

on the compute engines also comes from the implementation of RMI. In RMI mechanism, the

distributed objects do not directly communicate to each other but via their representatives:

stub and skeleton [12,77]. The communication between the remote compute engine/thread

and the compute coordinator is accomplished by the interaction between the sub on the

remote engine and the skeleton on the compute coordinator. Hence the communication cost is

not obvious on the compute coordinator in P=8 as Fig 5.9 shows. Certainly the

communication has influence on the compute coordinator. The influence becomes apparent in

the cases of twelve and sixteen processors where the communication cost appears on the

compute coordinator. However, the communication time on the compute coordinator is still

lower than the total communication time on all compute engines.

Fig 5.9 also shows an interesting feature of individual scheduling, i.e., the computation

and communication are nearly balanced on all threads. The balance is automatically achieved

without load balancing operation. This phenomenon also illustrates the advantage of the

system-wide overlapping of computation and communication produced by the individual

scheduling. Another encouraging phenomenon is that the communication cost on each remote

thread decreases with the total number of processors because each thread takes less rendering

rows when running on more processors. This is completely different from the uprising

communication overhead of the final strip send-back in the group scheduling (see Fig 5.7).

Therefore the individual scheduling is a better scheme for the distributed ray tracing.

Of course, the group scheduling scheme can be modified to adopt the earlier strip send-

back. The rendered strips can be sent back to the compute coordinator in the task allocation

operation performed by the main thread. That can lead to the overlap of the computation and

communication in the group scheduling and eliminate the bottleneck of final strip send-back.

However the strip reordering is still required both on the compute coordinator and compute

87

engines in this scheme. Its total execution time should be longer than the individual

scheduling.

2. Autonomous scheduling and master/slave scheduling Usually the parallel ray tracing methods based on messaging passing, e.g., the ray tracing

applications in [21,56], use master/slave scheduling approach. One of the processes works as

the load scheduler that is dedicated to perform runtime block allocation. The process keeps on

watching the rendering procedures on other processes and allocates one-block-a-time to them

when required. The process often waits in idle for the next task allocation request.

Considering the highly asynchronous rendering procedure, it is proper to remove the specific

load scheduler and allow every process (thread) to perform rendering operation by itself. The

autonomous load scheduling is convinced to be more appropriate for the parallel ray tracing

than the master/slave approach. Working in independent mode, all threads can concurrently

perform the rendering operation without load scheduler. The individual scheduling can make

full use of the computing power and the high parallelism can be achieved in the ray tracing.

Fig 5.10 compares the performance of the autonomous load scheduling (individual

scheduling) and the master/slave load scheduling approaches on the same cluster of SMP

nodes as in previous tests.

Fig 5.10 The comparison of autonomous load scheduling and master/slave scheduling in ray tracing

0

50

100

150

200

250

1 2 4 6 8 10 12 14 16


Exec

utio

n tim

e (s

econ

ds)

Master/SlaveSchedulingAutonomous loadscheduling

88

The execution time shows the autonomous load scheduling superior to the master/slave

scheduling because one more thread performs the rendering. The ray tracing based on the

autonomous load scheduling is nearly P/1 faster than the master/slave scheduling, where P

is the number of processors.

We can conclude by the tests that the individual scheduling can automatically balance the

rendering workload onto the threads and exploit the highest parallelism in MOIDE-based ray

tracing. It is the appropriate load scheduling approach for the distributed ray tracing method.

89

Chapter 6

CG and Radix Sort on Two-layer

Communication

CG and radix sort are two irregularly structured applications with high all-to-all

communication requirements. The communication cost determines the performance of these

two applications in distributed computing. The two-layer communication mechanism is

useful to improve the communication efficiency in CG and radix sort. The threads work in

independent mode to execute the CG and radix sort applications based on MOIDE model.

6.1 Conjugate Gradient

Dense linear systems are usually solved by a direct method, for instance, Gaussian

elimination. A direct method leads, in the absence of rounding errors, in a finite and fixed

amount of work to the exact solution of the given linear system. However the direct method

may cause quite some zero entries of a sparse matrix to become non-zero entries, and non-

zero entries require storage as well as CPU time [81]. Hence direct method is not suitable to

solve sparse linear systems. Instead, iterative methods are used to solve sparse linear systems

that generate a sequence of approximation to the solution vector. The advantage of iterative

methods is the economy of CPU time and storage in solving sparse linear systems.

The conjugate gradient (CG) method is one of the most powerful methods for solving

large sparse linear systems Ax = b. CG is an iterative method to obtain the approximated

solution x by the iteration:

kkkk pxx α+= −1 (6-1)

where αk is a scalar step size and pk is the direction vector. The detailed description of the CG

method can be found in [22].

90

The CG method performs floating-point computation on the large symmetric sparse

matrix A and the relevant vectors. It performs frequent communication of large vectors during

the computation. An efficient communication mechanism is demanded to support the

communication in it. When implemented in MOIDE model, the CG method can acquire the

support from the two-layer communication mechanism to accelerate the vector

communication and improve the overall performance.

The CG method in MOIDE model is converted from the CG benchmark in NAS Parallel

Benchmarks (NPB) [23]. The original CG benchmark is programmed in FORTRAN and MPI

(Message Passing Interface) [72]. In the following text, the algorithm of CG method is briefly

described with the emphasis on its communication pattern, followed by the implementation

on MOIDE model.

6.1.1 Algorithm of CG

The CG method solves sparse linear systems by the iteration on the approximated

solution xk in equation (6-1) until the residue || b – Axk || converges to a pre-specified accuracy.

Fig 6.1 describes the algorithm of CG method to solve Ax = b, where A is the sparse matrix;

zk is the intermediate result of x; pk, qk, and rk are vectors; αk, βk, and ρk are scalars. || x ||

denotes the Euclidean norm of vector x, i.e., xxx T= .

1 x0 = (1,1,...,1); /* initial value */

2 for (i = 0; i < niter; i++)

3 do

4 z0 = (0,0,...,0);

5 r0 = xi;

6 p1 = r0;

7 ρ0 = rT0 r0; ③

8 for (k = 1; k < cgitmax; k++)

9 do

10 qk = Apk; ①

11 αk = ρk-1 / (pTkqk); ② ①

12 zk = zk-1 + αkpk;

13 rk = rk-1 - αkqk;

14 ρk = rTkrk; ③

15 pk+1 = rk + (ρk-1 / ρk)pk; 16 end

17 compute the residual norm ||r|| = ||xi-1 – Az||; ① ② ③

18 xi = z / ||z||; /* z is the final zk in inner loop */

19 end

Fig 6.1 Algorithm of conjugate gradient (CG) method

91

The algorithm contains two embedded loops. The predefined iteration numbers niter and

cgitmax ensure to gain the solution vector x with required precision. All computations are

vector operations. The parallel CG method is designed on the mesh topology of

multiprocessors. The sparse matrix A is decomposed into sub-matrices, and the vectors and

vector operations will be handled on the mesh topology. To perform the vector

multiplications in parallel, the processes need to exchange partial products to get the final

vector product. The data communication operations in the parallel CG algorithm include

vector reduction, vector transposition, and scalar reduction. These communication operations

are indicated in Fig 6.1 using the following signs:

sign Communication operation ① vector reduction ② vector transposition ③ scalar reduction

For example, line 7 in Fig 6.1 calculates the inner product of vector r

0. The inner

product is calculated in sections by different processes. Then the partial inner products should

be summed up by a scalar reduction operation among the processes on the same row of the

mesh to get the total inner product ρ0.

Line 10 computes the product of sparse matrix A and vector p. The processes should

make a row-oriented vector reduction (see Fig 6.2 (a)) to exchange and sum the sections of

product vector to obtain the final product vector q. The computation in line 11 multiplies two

vector p and q. However the section of vector q generated in line 10 on a process does not

match the section of p on the same process. A process needs to exchange the q section with

the correspondent process by a vector transposition operation (see Fig 6.2 (b) and (c)) and

then computes the local vector multiplication pT

kqk. Thereafter a row-oriented vector

reduction is required to compute the sum of the vector product. Hence line 11 includes one

vector transposition and one vector reduction.

The float-point computation workload of CG method on multiprocessors is determined

with the sparse matrix A. The way for performance improvement is to provide an efficient

communication mechanism for the large-scale vector communication. The two-layer

communication mechanism provided in MOIDE model can be utilized for the purpose.

92

(a) 2-step reduction on processor mesh (b) transposition on 2×4 processor mesh

(c) transposition on 4×4 processor mesh

Fig 6.2 Vector/scalar reduction and transposition operations in parallel CG method

6.1.2 CG Method in MOIDE Model

The hierarchical collaborative system in MOIDE model is an appropriate infrastructure to

implement the CG method on heterogeneous system. Parallel CG method is designed on

mesh, however, the method should be able to run on varied architecture of distributed

systems. MOIDE model supports the creation of hierarchical collaborative system that is

adaptive to the heterogeneous hosts. Two features of HiCS infrastructure are useful to the CG

method: the adaptability of the multithreaded computing and the two-layer communication

mechanism.

(1) Multithreaded computing

The threads working in independent mode can be logically organized into a mesh

structure to execute the CG method. Each thread (called pseudo-engine as in 2.3) runs

independently to simulate the work of a compute engine. Meanwhile the threads can make

0 1 2 3

4 5 6 7step 1

step 2

0 1 2 3

4 5 6 7

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

93

efficient communication over the two-layer communication mechanism. To run the CG

method on heterogeneous hosts, multithreading compute engines are created based on the

architecture of underlying hosts. MOIDE-runtime maintains a location table (see 3.5) that

maps the pseudo-engines to the physical locations in the compute engines. The

communications between the pseudo-engines are invoked through the unified communication

interface and accomplished through the two-layer communication mechanism. The MOIDE

model supports the CG method to present high flexibility and adaptability on heterogeneous

systems.

(2) Two-layer communication

The two-layer communication mechanism integrates the shared-data access and remote

messaging on HiCS. It can reduce the communication cost of the large-scale vector

communication in CG. The MOIDE-runtime delivers the data via proper communication path

between the pseudo-engines. The pseudo-engines in the same compute engine can perform

the vector reduction and transposition operations directly by shared data exchange.

Fig 6.3 The reduction operation on 2 quad-processor SMP nodes

Consider the example in Fig 6.3. The CG method runs on two quad-processor SMP nodes.

Two compute engines with four threads each are created on the SMP nodes. The eight

pseudo-engines on two SMP nodes are organized into a 2×4 mesh as Fig 6.3 shows. As Fig

6.3 shows, all vector/scalar reduction operations are conducted within the compute engines

by local shared-data exchange. No remote messaging is required. But the vector transposition

should call remote messaging across two SMP nodes (see Fig 6.2(b)).

Fig 6.4 is an example of running the CG method on heterogeneous hosts. The method

runs on two tri-processor nodes and one dual-processor node, totally eight processors. Three

0 1 2 3

4 5 6 7

94

compute engines are created on the hosts, also eight pseudo-engines in total. The 2×4 mesh of

eight pseudo-engines is mapped onto three hosts as in Fig 6.4. Fig 6.4(a) shows that half of

the reduction operations are performed inside compute engine. Fig 6.4(b) shows that half of

the vector transposition operations are performed inside a compute engine as well by shared-

data exchange. Thus the communication in both examples can get benefit from the two-layer

communication.

(a) reduction on heterogeneous hosts (b) transpose on heterogeneous hosts

Fig 6.4 The reduction and transposition operations on heterogeneous hosts

The two examples also illustrate the flexibility of MOIDE-based multithreading

computation on varied architecture, either homogeneous or heterogeneous hosts. The CG

method can uniformly run on the 2×4 mesh in the two examples regardless the different

architecture of the underlying hosts. In both cases the two-layer communication mechanism

can be used to improve the communication efficiency. The MODIE-based CG method is

adaptive to the system architecture. The real performance of these two examples will be

tested in the following runtime experiments.

6.1.3 Runtime Tests and Performance Analysis

The CG method based on MOIDE model is tested on homogeneous and heterogeneous

hosts. The goal of the tests is to verify the adaptability of MOIDE-based computation and the

efficiency of the two-layer communication mechanism.

1. Tests on Homogeneous Hosts The CG method is firstly tested on the cluster of four quad-processor SMP nodes under

different problem size n (A is n×n matrix). The CG method requires the number of processors

0 1 2 3

4 5 6 7

0 1 2 3

6 74 5

95

to be power of 2 in order to form a mesh structure. Fig 6.5 is the execution time of the CG

method on one to sixteen processors. The related speedups are drawn in Fig 6.6. Higher

speedup can be obtained on larger problem size. About eight-fold speedup can be obtained on

sixteen processors when n=90000.

Fig 6.5 Execution time of the CG method on homogeneous hosts

Fig 6.6 Speedup of the CG method on homogeneous hosts

Multithreading on Homogeneous SMP nodes

0

500

1000

1500

2000

2500

3000

1 2 4 8 16


Exec

utio

n tim

e (s

econ

ds)

n=90000n=75000n=50000n=30000


0123456789

1 2 4 8 16


Spee

dup n=90000

n=75000n=50000n=30000

96

The execution time breakdowns are depicted in Fig 6.7. The first four processors exist in

the same SMP node. The communication between them is accomplished by local shared data

access with low communication overhead. There is an obvious growth of the communication

cost from four to eight processors because two SMP nodes are used and the remote

messaging is invoked for the vector transpose operations, whereas all reduction operations

can be still accomplished by shared data exchange as Fig 6.3 shows. Owing to the two-layer

communication mechanism, the communication overhead does not increase in proportion to

the problem size. The performance improvement is more obvious on large problem size.

Fig 6.7 Execution time breakdowns of the CG method on homogeneous hosts

2. Tests on Heterogeneous Hosts

The CG method has also been tested on a cluster of three heterogeneous hosts: two tri-

processor machines and one dual-processor machine, totally eight processors. The CG

method runs on three hosts under the same problem sizes as in previous test. One tri-

processor host is used in the tests on one and two processors. Then two hosts are used in the

case of four processors and all three hosts in eight processors. The execution time and

speedup are displayed in Fig 6.8 and Fig 6.9. The results are similar to the test on

homogeneous SMP nodes. However the performance on the heterogeneous hosts is a bit


0

500

1000

1500

2000

2500

3000

1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16


Exec

utio

n tim

e (s

econ

ds)

CommunicationComputation

n=30000

n=50000

n=75000

n=90000

97

lower than the former test because more hosts are used under the same number of processors

and therefore the communication overhead is higher as the execution time breakdowns in Fig

6.10 show.

Fig 6.8 Execution time of the CG method on heterogeneous hosts

Fig 6.9 Speedup of the CG method on heterogeneous hosts

Multithreading on Heterogeneous Hosts

0

500

1000

1500

2000

2500

3000

1 2 4 8


Exec

utio

n tim

e (s

econ

ds)

n=90000n=75000n=50000n=30000

Multithreading on Heterogeneous Hosts

00.5

11.5

22.5

33.5

44.5

1 2 4 8


Spee

dup n=90000

n=75000n=50000n=30000

98

Fig 6.10 Execution time breakdowns of the CG method on heterogeneous hosts

3. Comparison with the single-threading method To further verify the efficiency of the two-layer communication, the CG method is also

executed in single-threading mode in which single-threading compute engine is created on

each processor on SMP nodes. In single-threading mode, remote messaging is the sole

communication method between compute engines either on the same or different SMP nodes.

The all remote-messaging communication incurs heavy communication overhead. The single-

threading CG method is tested on two quad-processor SMP nodes. The execution time and

speedup in Fig 6.11 and Fig 6.12 show the poor performance of the single-threading method.

Its performance is even worse on eight processors than on four processors under the problem

size 30,000 and 50,000 due to the heavy communication overhead.

Multithreading on Heterogeneous Nodes

0

500

1000

1500

2000

2500

3000

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8


Exec

utio

n tim

e (s

econ

ds)


n=30000

n=50000

n=75000

n=90000

99

Fig 6.11 Execution time of the single-threading CG method

Fig 6.12 Speedup of the single-threaded CG method

The single-threaded CG method is only able to run on at most eight processors and the

maximum problem size n=75000. The high resource consumption and heavy communication

Single-threading

0200400600800

10001200140016001800

1 2 4 8


Exec

utio

n tim

e (s

econ

ds)

n=75000n=50000n=30000

Single-threading

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8


Spee

dup n=75000

n=50000n=30000

100

overhead prohibit its execution on sixteen processors and on n=90000. The execution time

breakdowns in Fig 6.13 show the communication cost in the single-threading method. In

contrast to the time breakdowns using the two-layer communication in Fig 6.7 and Fig 6.10,

the all remote-messaging communication in single-threading method results in high

communication overhead even on four processors in one SMP node. The communication time

exceeds the computation time on eight processors.

Fig 6.13 Execution time breakdowns of the single-threaded CG method

All runtime tests of the CG method have manifested the communication efficiency of the

two-layer communication mechanism. The MOIDE model also provides the flexibility to

map of the threads into a logical mesh topology and makes the CG method adaptive to the

heterogeneous hosts. The MOIDE-based CG method can efficiently run in uniform format

and achieve high performance on heterogeneous systems.

6.2 Radix Sort

Radix sort is a counting-based sorting algorithm relying on the binary representation of

the elements. Let b be the bit length of the elements. Radix sort completes the sorting on a

sequence of elements in b / r rounds, where r < b. It sorts the elements based on r-bit block in

each round, starting from the least significant bits. It counts the number of elements having

the same value of the r bits (0 ~ 2r-1) and decides each element’s rank (new position) in the

Single-threading

0200400600800

10001200140016001800

1 2 4 8 1 2 4 8 1 2 4 8


Exec

utio

n tim

e (s

econ

ds)


n=30000

n=50000

n=75000

101

sorted sequence. Then the elements are reordered according to their ranks. Radix sort should

keep the stability, i.e., the output ordering must preserve the input order of any two elements

having equal r-bit blocks. The computation in radix sorting is simply the counting and

reordering of the elements. In parallel radix sort, every processor sorts equal number of

elements but irregular all-to-all element exchange is performed in each round. Therefore

radix sort is a communication-intensive application.

6.2.1 Parallel Radix Sort

In parallel radix sort [24], the rank of element is the element’s position in the entire

element sequence. A processor should send the elements to the destination processors

according to the ranks of the elements. Give n elements to be sorted on p processors. The

parallel algorithm is summarized in Fig 6.14.

for (i = 0; i < b/r; i++)

do

count local n/p elements on the ith least significant r-bits;

sum count values; /* All-to-all reduction */

rank the elements;

scatter elements to target processors; /* All-to-all scatter */

put the received elements into local sequence;

end

Fig 6.14 Parallel radix sort

The parallel radix sort includes two global communication operations. One is the global

reduction to compute the sum of the counts among all processors. The reduction operation

broadcasts 2rp integers. It has slight influence on the performance. The other operation is the

all-to-all scattering that exchanges the elements among all processors to reorder the elements.

A processor sends local elements to other processors based the ranks of the elements and

receives the elements from other processors. Fig 6.15 illustrates the scatter operation on four

processors. Each processor sends out and receives n/p elements respectively (including those

elements that remain on local processor) in the scatter operation. As the computation

workload is balanced on all processors, the irregularity of parallel radix sort exists in the

scatter operation in which the number of elements exchanged between each pair of processors

102

is largely diverse. Scattering is the irregular communication occurring in every sorting round.

Hence the scatter communication determines the performance of parallel radix sort.

Fig 6.15 All-to-all scattering of elements in parallel radix sort on four processors

6.2.2 Radix Sort in MOIDE Model

The parallel radix sort is suitable to be implemented on MOIDE model. To keep the

stability in the sorting and reduce the communication overhead, the MOIDE-based radix sort

method adopts different work modes of the threads in sorting and scattering phases:

1. Independent sort In the phase of sorting, all threads (pseudo-engines) work in independent mode. Each

thread sorts a subsequence of elements. Otherwise, if a group of threads work in cooperative

mode and cooperatively sort a subsequence of the elements, synchronization should be

imposed on the element counts to ensure the exclusive access on them. The synchronized

operations will produce high overhead and restrain the parallelism in the sorting.

2. Grouped scatter The all-to-all scattering operation in each sorting round includes the exchange of

elements between P·(P-1) pairs of processors. Although the two-layer communication can

reduce the communication overhead, the scatter operation still produces high communication

overhead that affects the performance of parallel sorting. The communication bottleneck will

be more serious when increasing the number of processors. The data sharing among the group

of threads in the same compute engine can be utilized to reduce the communication overhead

of all-to-all scattering. The elements sending from all threads in one compute engine to the

threads in another compute engine can be grouped together and sent to the destination in one

processor 0 processor 1


n / p n / p

n / p n / p



n / p n / p

n / p n / p

103

remote messaging operation. This is the grouped scatter operation. For example, if compute

engine i has pi threads and compute engine j has pj threads, the threads on compute engine i

will make pi·pj communication invocations to the threads in compute engine j in the scattering

when the communication is separately conducted between each pair of threads. With the

grouped scatter, the communication can be accomplished in one remote method invocation.

Therefore the multiple thread-to-thread communication invocations can be replaced by the

communication between the pair of compute engine. In the scattering phase, the threads work

in cooperative mode and the grouped scatter can get the support from the two-layer

communication. The grouped scatter can effectively reduce the communication overhead in

the radix sort.

6.2.3 Runtime Tests and Performance Analysis

MOIDE-based radix sort is tested on clusters of SMP nodes. Its performance is compared

with the radix sort program implemented in C & MPI. The MPI version is MPICH 1.2.1.

1. Tests of the MOIDE-based Radix Sort At first, the MOIDE-based radix sort method is tested on a cluster of four dual-processor

nodes, totally eight processors. Table 6.1 lists the execution time of the radix sort with 1M to

10M elements on one to eight processors. As table 6.1 shows, a two-fold performance

enhancement occurs on two processors. This is the best performance improvement in the test.

The first two processors belong to one SMP node. All data communication goes through

shared-data access.

n P 1 2 4 6 8 1M 8.547 4.668 4.604 4.494 3.49 2M 16.654 8.552 8.652 8.57 5.9594M 32.593 16.919 17.275 16.1 10.971 6M 49.333 24.637 25.133 23.728 15.892 8M 66.05 33.243 33.191 32.137 21.341

10M 85.008 40.698 42.062 39.81 25.586

Table 6.1 Execution time of the MOIDE-based radix sort (seconds)

Fig 6.16 shows the execution time breakdowns of the radix sort. As Fig 6.16 shows, there

appears a burst of communication cost when more than one SMP nodes are used. After that,

the performance is improved slowly on more processors because of the high communication

104

cost. The performance even drops down on four processors due to the emergence of the

remote communication.

Fig 6.16 Execution time breakdowns of the MOIDE-based radix sort

Though the computation time can be reduced on more processors, the communication

cost is uprising along with the number of processors. Moreover the data exchange pattern in

radix sort is irregular as Fig 6.15 shows. There is great diversity in the number of elements

sent from one processor to the other in scatter operation. The time cost of the scatter

operation is mainly affected by the largest data set being transmitted. Therefore the large-

scale irregular element scatter operation in each sorting round determines the limited

performance enhancement. However, the grouped scattering based on two-layer

communication certainly benefits the performance of the MOIDE-based radix sort.

n P 1 2 4 6 8 1M 8.547 8.354 7.071 4.738 4.34 2M 16.654 16.923 11.784 8.712 7.44 4M 32.593 33.647 23.59 16.638 13.5476M 49.333 51.315 37.072 24.214 24.1128M 66.05 63.222 48.597 32.938 31.81610M 85.008 76.567 60.412 40.897 37.373

Table 6.2 Execution time of the single-threading radix sort (seconds)

Multithreading

0102030405060708090

1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8


Exec

utio

n tim

e (s

econ

ds)


1M2M

4M

6M8M

10M

105

The advantage of the two-layer communication can also be demonstrated by the

comparison with a single-threading radix sort method. Run the MOIDE-based radix sort in

single-threading mode in which two single-thread compute engines are created on each dual-

processor node. All communication is performed via remote messaging. Table 6.2 is the

execution time of the single-threading radix sort.

Fig 6.17 Execution time breakdowns of the single-threading radix sort

It is apparent that the execution time of the single-threaded radix sort is longer than the

time of the MOIDE-based multithreading method because of the higher communication

overhead. Fig 6.17 shows the execution time breakdowns of the single-threading radix sort. It

illustrates that the all remote messaging communication produces high overhead in the

irregular scatter operation. There is no obvious performance improvement from one to two

processors although these two processors exist on the same node. After that, the method

keeps a medium performance enhancement on more processors. Nevertheless the

performance of single-threading method is obviously inferior to the multithreading method.

The comparison can confirm the communication efficiency of the two-layer

communication mechanism. In fact the two-layer communication mechanism implemented in

Java RMI can really outperform C & MPI in communication performance as the following

test shows.

2. Comparison with MPI

Single-threading

0102030405060708090

1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8


Exec

utio

n tim

e (s

econ

ds)


1M2M

4M

6M

8M

10M

106

A C & MPI radix sort program is used to compare the performance of the scatter

operations in MPI and in the MOIDE-supported two-layer communication. MPI is a widely-

used communication library in parallel and distributed computing. MPICH 1.2.1 is used in

the test. The C & MPI program is tested on the same platforms—four dual-processor

machines. The MPI package is compiled to use the shared memory for fast message passing

on one node but use TCP/IP for cluster communication. Table 6.3 lists the execution time of

the C & MPI program.

n P 1 2 4 6 8 1M 4.391 3.077 4.290 3.462 3.082 2M 8.774 6.082 8.216 6.779 5.812 4M 17.675 12.147 16.323 13.514 11.58 6M 26.635 17.975 24.646 19.901 17.138 8M 35.432 24.684 33.416 27.049 22.871

10M 43.380 30.763 41.051 33.655 27.594

Table 6.3 Execution time of the C & MPI radix sort (seconds)

By comparing with the execution time of the MOIDE-based radix sort method in Table

6.1, it can be discovered that the MOIDE-based radix sort presents the similar performance to

the C & MPI program in all cases except on one and two processors. The execution time

breakdowns of the C & MPI program in Fig 6.18 reveal the cause of the performance

equivalence in the two methods. It is a fact that Java program is usually slower than C

program. The test shows that the computation time of the MOIDE-based radix sort program

in Java (see Fig 6.16) is nearly twice to the C & MPI program (Fig 6.18). But the time

breakdowns in Fig 6.18 show that MPI has higher communication time when running on two

nodes up, so that the performance improvement in the C & MPI program is poor on multiple

nodes. On the other hand, the MOIDE-based Java program gets the benefit from the efficient

two-layer communication. It has lower communication cost than the all message-passing in

MPI for cluster communication. The MOIDE-based radix sort can achieve higher

performance improvement on more nodes. The low-overhead communication enables the

MOIDE-based program in Java to reach the performance which is comparable to the C &

MPI program. The former can even outperform the C & MPI on eight processors.

107

Fig 6.18 Execution time breakdowns of the C & MPI radix sort program

Fig 6.19 Execution time breakdowns of three radix sort programs: Java MOIDE-based (Java-M), Java

single-threading (Java-S) and C & MPI (C-MPI) in sorting 10M elements

One more radix sort program is tested as the contrast to the MOIDE-based method. It is

the single-threading method. Fig 6.19 compares the execution time of the three radix sort

MPI and C

05

101520253035404550

1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8


Exec

utio

n tim

e (s

econ

ds)


1M2M

4M6M

8M

10M

0

10

20

30

40

50

60

70

80

90

Java

-S

Java

-MC-M

PI

Java

-S

Java

-MC-M

PI

Java

-S

Java

-MC-M

PI

Java

-S

Java

-MC-M

PI

Java

-S

Java

-MC-M

PI

Exec

utio

n tim

e (s

econ

ds)


P=1P=2

P=4

P=6 P=8

108

programs: single-threading Java program (Java-S), MOIDE-based Java program (Java-M),

and C & MPI program (C-MPI) under problem size n = 10M. The C & MPI program is the

fastest one in computation. Its computation time is lower than the two Java programs. All

communications in the single-threading Java and the C & MPI programs are by message

passing. The single-threading Java program has the highest communication overhead. The

result of the single-threading Java program indicates that the communication cost of RMI-

based remote messaging is higher than MPI. However the two-layer communication on

MOIDE model combines the quick shared-data access and the slow remote messaging so that

it presents an integrated communication performance as high as or even higher than MPI in C

on cluster of SMP nodes. This is the significant achievement of the two-layer communication

mechanism and MOIDE model in supporting the irregular communication.

3. Test on Larger System and Comparison with MPI To examine the performance on larger system, the MOIDE-based and C & MPI radix sort

programs are tested on the cluster of four quad-processor SMP nodes under the problem size

n = 10M. As the execution time breakdowns in Fig 6.20 shows, the MOIDE-based Java

program presents a peak performance on the four processors in one SMP node due to the low

communication cost of all shared-data access. The peak performance reaches the performance

of the C & MPI program on fourteen processors. The growing communication overhead

determines the performance of the radix sort on multiple SMP nodes. Although the

computation time is continuously decreasing on more processors, the communication cost

raises the overall execution time of the MOIDE-based radix sort. However its performance on

sixteen processors can eventually outperform the result on four processors.

For the MPI program, MPICH can support fast message passing by using shared memory

on single SMP node. However, the portable MPICH doesn’t support multi-protocol

communication yet. It can use only one “device” at a time [76]. The communication between

the clustered SMP nodes cannot utilize both shared memory and message passing at the same

time. Thus the MPICH library should be compiled twice on cluster of SMP nodes with

different device option. It is compiled to utilize the shared memory for fast inter-process

communication on one SMP node. It is also compiled with ch_p4 device option to support

cluster communication over TCP/IP socket. Meanwhile the executables of application are not

compatible under different MPI devices. The applications need recompilation with different

MPI version (shared memory or message passing).

109

Fig 6.20 Execution time breakdowns of two radix sort programs on four quad-processor SMP nodes:

Java MOIDE-based program (Java-M) and C & MPI program (C-MPI) in sorting 10M elements

Fig 6.21 compares the communication cost of two radix sort programs. The first four

processors belong to one SMP node. In the C & MPI program, the communication on two

and four processors goes through the single-copy over the shared memory. However, its cost

is still higher than the local data exchange on the two-layer communication in MOIDE model.

The software architecture of MPICH has a multi-layer structure. To achieve the portability,

all MPI functions are implemented in terms of the macros and functions that make up the

ADI (Abstract Device Interface). ADI is implemented in terms of a lower level interface

called channel interface. Channel interface has multiple implementations for different

hardware platforms, e.g., the implementation on shared-memory system [76]. Thus the data

communication between the processes on SMP node goes across the multi-layers to reach the

buffer where data read/write will be carried out.

For the two-layer communication mechanism in MOIDE model, the data communication

in local SMP node is directly fulfilled through the data access via the runtime buffer created

in user space. When a thread calls the communication interface to communicate with a local

thread, the communication mechanism will allocate a buffer in the user space. The

communication operation can be instantly finished by the data exchange through the buffer,

which has lower overhead than the multi-layer buffering in MPI. Therefore, the

0

20

40

60

80

100

120

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Exec

utio

n tim

e (s

econ

ds)

Computation Communication

P=1

P=2

P=4

P=6P=8 P=10 P=12

P=14 P=16

110

communication cost in the MOIDE-based Java program is much lower than the MPI program

on two and four processors in one SMP node. Moreover the shared memory communication

in MPICH is unable to cross SMP nodes. The user has to indicate which MPI library should

be used, shared memory version or message passing version. Two versions cannot be used at

the same time When running on cluster of SMP nodes, the other MPICH library compiled

with ch_p4 option should be used to support the message passing. It cannot automatically

switch from one device to another. On the other hand, the two-layer communication

mechanism in MOIDE model adaptively supports the integration of the shared memory

access and the remote messaging on cluster of SMPs. It transparently decides the proper

communication path. Despite the drastic growth of the communication time on multiple SMP

nodes, the two-layer communication implemented in Java can still maintain the

communication cost comparable to C & MPI. It is even lower than C & MPI program on

eight, fourteen, and sixteen processors as Fig 6.21 shows.

Fig 6.21 Communication costs of two radix sort programs on four quad-processor SMP nodes: Java

MOIDE-based program (Java-M) and C & MPI program (C-MPI) in sorting 10M elements

It can be observed in Fig 6.21 that the communication cost of the MOIDE-based radix

sort undulates when increasing the number of processors. The undulation is caused by the

combined communications paths on the two levels. The communication cost is reduced when

using more processors on same SMP node such as increasing processors from six to eight and

from ten to twelve. But the communication cost goes up from four to six and from eight to ten

0

5

10

15

20

25

30

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Java

-MC-M

PI

Com

mun

icat

ion

time

(sec

onds

)

P=2P=4

P=6P=8 P=10

P=12P=14 P=16

111

processors because one more SMP node. Nevertheless, the grouped scattering operation on

the two-layer communication can sustain the improvement of the communication efficiency

above ten processors.

This chapter addresses two irregularly structure applications CG and radix sort. Two

applications are characterized by high communication requirements. The two-layer

communication mechanism demonstrates its benefit in the CG and radix sort developed on

MOIDE model. The two-layer communication mechanism can achieve high performance in

large-size all-to-all communication. The CG method also demonstrates the flexibility of

MOIDE model on heterogeneous system. The threads in independent mode can be flexibly

organized into a proper structure and dynamically mapped onto the heterogeneous hosts

based on the architecture. Hence the MOIDE-based applications can be developed and run in

an identical infrastructure on heterogeneous systems.

112

Chapter 7

Related Work

The widespread use of distributed systems for high-performance computing has been

attracting plenty of research forces in developing support software to integrate the system-

wide computing resources and create efficient computing infrastructures to develop and

execute applications in various fields. There are also research projects on the algorithms and

software for solving irregularly structured problems. This chapter gives an overview on the

recent development in these two areas. Meanwhile the distinction of my research is clarified

by the comparison with the related work.

7.1 Software Infrastructures on Distributed Systems

There is a trend to integrate the computers at different sites to form a large-scale powerful

computing system to satisfy the demands for high-performance computation in different

application areas. The range of cluster computing has been expanding from lab or

organization to a wide area across geographical distances, which forms a huge clustering

system called Grid [64,65]. Many projects develop the computational model, programming

environment and support tools for the implementation of distributed computing on the large-

scale distributed systems.

7.1.1 Millennium

Millennium [25,26] project aims to develop and deploy a hierarchical campus-wide

“cluster of clusters” at UC Berkeley which consists of single-processor workstations, SMP

servers, and NOWs (Network of Workstations) in different departments to support advanced

applications in scientific computing, simulation, and modeling. The individual desktop and

departmental SMP server levels are incorporated to a local high-performance cluster of SMPs

113

(called CLUMP), which utilizes the extensions of the communication, system, and

programming technologies developed in the Berkeley NOW project [27]. The CLUMPs are

further organized to a large campus CLUMP. The entire collection of clusters will be

interconnected across campus with Gigabit Ethernet links to form a large cluster of clusters of

SMPs, called intercluster.

Multiprotocal programming environment will be deployed on group CLUMPs to exploit

the SMP hardware for sharing data between processors on an SMP and utilize the high

performance interconnect between clusters. A Java-based version of Active Messages [28]

using the Java Native Interface is implemented to provide high-bandwidth I/O and

communication from a Java-based environment. The software system in Millennium is

constructed as a composition of services. A software platform for scalable, customizable

internet services is developed and deployed as the fundamental system infrastructure. Each

node in the cluster is provided with an iSpace execution environment on a JVM.

Communication is through their own Remote Method Invocation that provides secure RMI,

multicast RMI, UDP-RMI, and fast RMI. Collections of iSpaces are grouped into a

MultiSpace, such a service pushed into the MultiSpace is automatically scaled across the

cluster. A global software layer will be constructed to provide support for remote execution,

load balancing, and batch processing of parallel programs.

7.1.2 Globus

Globus [29,30] is a multi-institutional project centered at Argonne National Laboratory

and University of Southern California. It focuses on enabling the application of Grid concepts

to scientific and engineering computing. A set of services and software libraries, called

Globus Toolkit, is developed to support Grids and Grid applications. It includes software for

security, information infrastructure, resource management, data management, communication,

fault detection, and portability. The set of services can be used either independently or

together to develop useful grid applications and programming tools. The followings are some

examples of the services provided in the Globus Toolkit:

(1) The Globus Resource Allocation Manager (GRAM) provides resource allocation and

process creation, monitoring, and management services.

114

(2) The Metacomputing Directory Service (MDS) provides a uniform framework for

providing and accessing system configuration and status information such as compute

server configuration, network status, or the locations of replicated datasets.

(3) Global Access to Secondary Storage (GASS) implements a variety of automatic and

programmer-managed data movement and data access strategies, enabling programs

running at remote locations to read and write local data.

(4) Nexus and globus_io provide communication services for heterogeneous environments,

supporting multimethod communication, multithreading, and single-sided operations.

7.1.3 AppLeS

AppLeS stands for Application Level Scheduler [31,32], a project at University of

California, San Diego. The goal of AppLeS is to provide mechanisms and paradigms that

perform resource configuration and load scheduling to achieve a performance-efficient

implementation for application on a distributed heterogeneous system. AppLeS project has

two main parts of work:

(1) Application-Level Scheduling agents [33] are developed to provide a mechanism for

scheduling individual applications at machine speeds on production heterogeneous

systems. AppLeS agents utilize the Network Weather Service (NWS) [34] to monitor the

varying performance of resources potentially usable by applications. Each AppLeS uses

static and dynamic application and system information to select viable resource

configurations and evaluate their potential performance. Then AppLeS interacts with the

relevant resource management system to implement application tasks. Once it contains an

embedded AppLeS agent, the application becomes self-scheduling.

(2) AppLeS templates are stand-alone software projects that perform automatic scheduling

and deployment tasks for classes of structurally similar applications. Templates builds on

the expertise gained while developing AppLeS agents, and the main focus is re-usability

for several applications. There are recently two template projects: the Parameter Sweep

Template [35] and the Master Slave Template [21].

7.1.4 JavaPorts

JavaPorts [36,37] is an environment to facilitate distributed component computing on

clusters of workstations, developed at Northeast University, USA. It is composed of an

115

application programming interface and a set of tools for the development of modular,

reusable, parallel and distributed component-based applications for cluster computing. The

JavaPorts project aims at providing the application developer with: (1) the capability to easily

create reusable Java software components for the concurrent tasks of an application; (2)

anonymous message passing among tasks while hiding the details of the communication and

coordination; (3) tools for the definition, assembly and reconfiguration of concurrent

applications using pre-existing and/or new software components.

The JavaPorts system exploits the advantages of workstation clusters in conjunction with

recent advances in networking technologies emerged as cost-effective alternatives to

expensive supercomputers for coarse-grain parallel computation. It also underlines the

following principles: (1) cluster computing with independence from the memory model

(shared vs. distributed); (2) separation of the coordination details from the computational

aspect of an application; (3) platform-independence by using the Java technology as the

underlying implementation language; (4) object-oriented parallelism with modularity and

reusability

7.1.5 Comparison with MOIDE Model

The capabilities of the projects above in supporting heterogeneous computing can be

assessed by the following criteria. MOIDE can be compared with these projects using these

criteria. The comparison is summarized in Table 7.1.

Resource selection: manage and select the resource from the collection of available

resources in a distributed system to execute an application.

Heterogeneity support: support platform-independent computing that hides but

utilizes the heterogeneous hosts.

Architecture adaptation: map the computation onto the hosts in adaptive mode to

match the specific architecture for high-performance computing.

Multi-level communication: support multi-method communication on different levels

of heterogeneous system to raise communication efficiency.

Dynamic reconfiguration: reorganize the computation to meet the change in the state

of the available resources.

Utilization scope: the range of distributed system and applications to which the

project has been applied.

116

Millennium Globus AppLeS JavaPorts MOIDE

Resource selection Yes Yes Yes No Yes

Heterogeneous support Medium High High Medium High

Architecture adaptation No Yes No No Yes

Multi-level communication Some Yes No No Yes

Dynamic reconfiguration No Yes Yes No Yes

Utilization scope Wide Wide Wide Small Medium

Table 7.1 Comparison of the related work in supporting heterogeneous computing

By comparing with the related work, we can find that MOIDE model covers many

similar aspects in those projects. MOIDE provides the mechanism that integrates the

resources in distributed systems and creates a flexible infrastructure to implement high-

performance computing on heterogeneous hosts as Millennium, Globus and AppLeS do.

MOIDE provides the efficient two-layer communication mechanism for the interactions

between a group of threads and distributed objects as the communication services provided in

Globus. It also supports the application-level load scheduling as AppLeS do. As indicated in

section 7.4.3, the autonomous load scheduling in MOIDE is more appropriate to the

applications like ray tracing than the master-slave scheduling in AppLeS. Implemented in

Java, MOIDE model presents a flexible system infrastructure that facilitates the object-

oriented and architecture-independent computing on various platforms as JavaPorts.

MOIDE model emphasizes on the flexibility, adaptability and computational efficiency

on heterogeneous architecture. MOIDE incorporates the object-oriented and multithreading

methodologies to support efficient computing on hybrid platforms. The applications

developed in MOIDE model can be adaptively mapped onto different hosts in a mode that

can realize high-performance computing on the architecture. It also integrates shared-data

access and remote messaging to implement efficient communication on heterogeneous

system. The autonomous load scheduling is free from dedicated load scheduler and load

scheduling operations. It can automatically achieve load balancing on distributed objects and

threads. MOIDE runtime support system has been developed as a multi-functional support

system to implement the distributed computing based on MOIDE model. Although the

runtime tests are conducted on small-size clusters, MOIDE model is also applicable and the

runtime support system is also executable on wide-area distributed systems.

117

7.2 Programming Models on Cluster of SMPs

Some projects develop the programming environment and model on cluster of SMPs.

These models combine varied programming methodologies suitable to the hybrid architecture

of SMP clusters.

7.2.1 KeLP

KeLP (Kernel Lattice Parallelism) [38,39] is a programming model for implementing

portable scientific applications on distributed memory parallel computers, developed in

UCSD. It provides a set of programming abstractions to represent the data layout and data

motion patterns in block-structured scientific calculations on SMP clusters, e.g., finite

difference methods and multiblock methods for solving partial differential equations. The

KeLP run-time system [40] implemented as a C++ library supports the general blocked data

decompositions and manages low-level implementation details such as message-passing,

processes, threads, synchronization, and memory allocation.

KeLP separates the description of communication patterns from the interpretation of the

patterns using a model known as communication orchestration. KeLP provides an easily

understood model of locality that ensures that applications achieve portable performance

across a diverse range of platforms. KeLP model reflects the multi-level interconnections on

SMP clusters. KeLP abstractions help manage each level independently and manage the

interaction between the levels.

7.2.2 SIMPLE

EXPAR (Experimental Parallel Algorithmics) concentrates on high-level, architecture

independent, algorithms that execute efficiently on general-purpose parallel machines [41,42].

It develops a methodology, called SIMPLE model, for high-performance programming on

clusters of SMP nodes. The methodology is based on a small kernel of collective

communication primitives that make efficient use of the hybrid shared-memory on SMP node

and message passing paradigms. The communication primitives are grouped into three

modules:

(1) Internode Communication Library (ICL) provides an MPI-like small kernel for internode

communication

118

(2) SMP Node Library contains three primitives for SMP node (barrier, broadcast, and

reduce)

(3) SIMPLE Communication Library is built on ICL and SMP Node libraries. It included the

primitives for the SIMPLE model: barrier, reduce, broadcast, allreduce, alltoall, gather,

and scatter.

Some portable parallel programs and data sets are implemented in SIMPLE model, e.g.,

combinatorial computing and image processing.

7.2.3 Comparison with MOIDE model

With the focus on the support for the high-performance computing on cluster of SMPs,

the related work can be compared using the following criteria:

Architectural transparency: the applications can be developed in identical model that

is independent from the architecture of the hosts.

Adaptive mapping: map the computation adaptively onto the hosts at run-time based

on the specific architecture—cluster of single-processors, SMP nodes, or mixed ones.

Combined communication: integrate the communication approaches on different

levels to accelerate the communication.

Unified interface: provide unified programming and communication interface to the

applications.

Application field: support the development of what kind of applications.

KeLP SIMPLE MOIDE

Architectural transparency No No Yes

Adaptive mapping Medium Low High

Combined communication Yes Yes Yes

Unified interface No Yes Yes

Application field Block-structured applications Limited Irregularly structured

applications

Table 7.2 Comparison of the programming models on cluster of SMPs

Table 7.2 compares there programming model. MOIDE is a widely-usable distributed

computing model to support the development and execution of various applications on

heterogeneous systems, not limited to SMP cluster. Unlike the KeLP model which is

dedicated to block-structured computations, MOIDE model is suitable to develop different

119

applications, in particular, for irregular structured applications. The hierarchical collaborative

system in MOIDE is a runtime infrastructure that maps the computation and communication

patterns of an application onto the architecture of underlying hosts so as to achieve efficient

execution on the hardware platform. MOIDE model provides a unified communication

interface at application level based on the two-layer communication mechanism. Regardless

the physical communication path, the threads in distributed objects can interact with each

other through the same communication interface. MOIDE runtime support system implicitly

chooses the communication path, either shared-data access or remote messaging, and

completes the communication.

7.3 Methodologies for Irregular Structured Problems

There are also research work in developing programming languages, data structures, and

algorithms for solving irregularly structured problems. Due to the variety of irregularly

structured problems and the diversity of their characteristics, individual project can only

study certain aspects of the techniques and specific irregularly structured applications.

7.3.1 IPA

IPA is the Irregular Parallel Algorithms project at University of North Carolina. The

project proposes nested data parallelism to express the irregular computations and

investigates the incorporation of nested data parallelism in programming languages including

Fortran (Fortran 95/HPF) and Java [43,44]. Two approaches are used to cope with the load

imbalance in the nested data-parallel computations: using fine-grained threads and thread

migration to balance load, and flattening nested parallelism through compilation techniques

to create an unnested data-parallel computation that performs the correct amount of work,

optimally balanced over all processors. A runtime support library largely based on Fortran's

intrinsic functions and the routines in HPFLIB is created to support the data-parallel

computations on supercomputers. Parallel implementations of a Conjugate Gradient method

for unstructured sparse linear systems and a Barnes-Hut N-body simulation have been

constructed in Fortran 90 using the flattening transformations.

120

7.3.2 Scandal

The Scandal [45] project at CMU develops a portable, interactive environment for

programming a wide range of supercomputers. The two main goals of the project are:

(1) Developing a portable parallel language NESL and associated environment

NESL [46,47] is an applicative parallel language intended to be used as a portable

interface for programming a variety of parallel and vector supercomputers, and as a basis for

designing parallel algorithms. Parallelism is supplied through a set of data-parallel constructs

that manipulate collections of values.

(2) Developing fast implementations of parallel algorithms for irregular problems

Algorithms for various irregular problems are implemented on different parallel machines.

The algorithms are often written as prototypes in NESL and then machine-specific code is

written to study how algorithm and architecture interact. The existing theoretical algorithms

are studied to determine which ones can be mapped well onto existing parallel machines and

communication topologies, and what aspects are important in getting efficient

implementations. For example the sorting problem has been studied extensively [24, 48].

Other algorithms are studied including the algorithms for finding the convex-hull of a set of

points [49], for finding the connected-components of a graph [50], for finding the union,

intersection, and difference of ordered sets [51], and for solving irregular linear systems [52].

7.3.3 Comparison with MOIDE model

MOIDE establishes an object-oriented computing infrastructure usable on different

systems such as multiprocessors, cluster of workstations, cluster of SMP, and hybrid system

of heterogeneous hosts. The hierarchical collaborative system and other related facilities

provide support for the development of efficient algorithms for solving various irregularly

structured problems. MOIDE model has demonstrated flexibility and efficiency in the

implementations of four irregularly structured applications described in the thesis. It also

supports the implementation of dedicated techniques for specific applications such as the

distributed tree structure in the N-body method and the grouped scatter in the radix sort.

7.4 Irregularly Structured Applications

Some benchmark packages include irregularly structured applications. More applications

are studied individually in different projects.

121

7.4.1 SPLASH-2 Programs

SPLASH-2 [53] is a suit of parallel programs released to study centralized and distributed

shared-address-space multiprocessors. It contains twelve applications that represent a wider

range of computations in scientific, engineering and graphics domains. All programs are

based on shared-memory model. The Singh’s N-body method introduced in 4.1 is

implemented as the Barnes application in SPLASH-2. Other irregular applications in

SPLASH-2 include:

FMM: Fast Multipole Method for N-body problem

Radiosity: compute the equilibrium distribution of light in a scene

Radix: radix sort

Raytrace: render a three-dimensional scene using ray tracing

Volrend: render a three-dimensional volume using a ray casting technique.

The MOIDE-based N-body method is developed based on distributed object approach.

An adaptive distributed tree structure is designed as the communication-efficient data

structure for solving N-body problem on heterogeneous systems. Hence the MOIDE-based

method can be run on both shared-space and distributed systems.

7.4.2 N-body Problem

N-body problem has been broadly studied for a long time. Many N-body algorithms have

been proposed based on different architecture and data structures. Here are some examples of

distributed N-body methods.

7.4.2.1 Salmon’s method Salmon’s method is a parallel hierarchical N-body algorithm [7,75,79]. The method is

designed base on the space decomposition and hierarchical tree structure. A space is

recursively divided into domains. Local essential tree is built for each domain based on the

decomposition. As every body needs a fraction of the global tree for the force computation,

local essential tree is the union of the tree fractions required by all bodies in a domain.

Salmon’s method uses keys and hash table to describe the topology of a tree. Each cell in the

tree is assigned a key which is generated from the spatial coordinate of the cell. The

translation of keys into memory locations where the cell data is stored is achieved via hash

table lookup.

122

In principle, the build of local essential tree is a similar idea to the partial subtree in this

thesis. However local essential tree is a union subtree that contains all information required in

the force computation for the bodies in a domain. To decide the essential data from one

domain to another domain, it needs to inspect the distances from each body to other domains.

The data inspection and the construction of local essential trees are more costly than the

partial subtree scheme in this thesis. The heavy cost may not be such a serious problem as

Salmon’s method was implemented on supercomputers. Our N-body method aims at running

on distributed systems such as clusters. More attention should be paid on the communication

overhead in the algorithm. The adaptive partial subtree compromises the data requirement

and communication cost. It is an appropriate data structure for the N-body method for

distributed environment.

As stated in their papers, the motivation of using key and hash table in Salmon’s method

came from the difficulty of representing distributed adaptive tree with the pointers in

traditional languages as Fortran 90 and HPF, especially referring to the cells in separate

memory space on anther processor. However, our method is based on distributed object

model implemented in Java. The object-oriented method provides the convenience for the

representation and reference of the complicated or remote data structure such as the subtree

and partial subtree. Our method directly uses remote method invocation and object reference

to transfer data objects, including the tree structure, between the compute engines. Therefore

the MOIDE-based N-body method is more flexible and adaptive on distributed and

heterogeneous systems.

7.4.2.2 Grama’s method Grama presented a parallel implementation of the Barnes-Hut method on message

passing computer in [17]. In his method, a 2D physical domain is partitioned into subdomains.

The particles in one of the subdomains are assigned to one processor. A local tree is

constructed per processor and then all local trees are merged to form a global tree. All nodes

above a certain cut-off depth in the global tree are broadcast to all processors. Grama’s

method for 2D space ran on a 256-processor nCUBE2 parallel computer.

In MOIDE-based N-body method, the distributed tree structure requires no global tree.

Instead, partial subtrees are built for each compute engine according to the distance between

two sub-spaces. It avoids of collective communication in building the global tree. It also

raises the data availability on the partial subtrees in the force computation so as to effectively

123

reduce the remote subtree access. The distributed tree structure in MOIDE-base N-body

method is more proper on distributed memory system.

7.4.2.3 General data structures’ methods

(1) PTREE The object-oriented support for adaptive methods in [54] is based on a general-purpose

data structure layer implemented in C++. It provides a global data structure PTREE that is

implemented as a collection of local data structures on distributed-memory machine. The data

structure is distributed to multiple processors where computations are carried out and the

partial results are merged. The global data structure can support different applications. A

gravitational N-body simulation is implemented on the global data structure. The application

has been tested on a 64-node iPSC/860 machine.

(2) Liu’s method

Liu described an implementation of parallel C++ N-body framework that can support

various scientific simulations which involve tree structures [55]. The framework consists of

three layers: (1) generic tree layer supports simple tree construction and manipulation

methods, and system programmers can build special libraries using classes in the layer; (2)

Barnes-Hut tree layer supports tree operations required in most of the N-body tree algorithms.

(3) application layer implements a gravitational N-body application upon the Barnes-Hut tree

layer. The communication library is implemented in MPI. The application is executed on a

cluster of four Ultra SPARC workstations connected by a fast Ethernet network.

Differing from PTREE and Liu’s methods, the MOIDE-based N-body method is based on

a dedicated distributed tree structure. It is more efficient to the distributed N-body method.

Furthermore the distributed tree structure is based on the hierarchical collaborative system so

that it is adaptive to the heterogeneous system architecture. It is a more flexible data structure

suitable to various platforms.

7.4.3 Ray tracing

Usually the load scheduling scheme for parallel ray tracing based on message passing is

centralized approach that uses a dedicated process as the load scheduler to dynamically

allocate the rendering tasks. There are also distributed load scheduling approaches that allow

the processes to balance the tasks among neighbor processes.

124

7.4.3.1 MPIPOV MPIPOV [56] is an MPI parallel implementation of the public three-dimensional

rendering engine POV-Ray [57] on cluster of PCs. It adopts two load scheduling approaches.

One is static task partitioning for homogeneous distributed architecture and the other is

dynamic load balancing for heterogeneous architecture. The dynamic load balancing

approach is a master/slave scheme. A master process is dedicated for assigning rendering

tasks to other process in response to their request, which doesn’t perform any rendering task.

The master and slave approach is determined by the MPI programming methodology.

7.4.3.2 Ray-tracing in AppLeS The parallel ray-tracing application in AppLeS is used to study the application scheduling

policies on heterogeneous, time-varying resources and varied workload distribution [21]. The

ray-tracing application is based on PVMPOV [58], a PVM implementation of POV-Ray.

Similar to MPIPOV, it uses two master/slave scheduling strategies. One is static fixed

distribution scheduling which assigns all rendering blocks to slaves at the beginning of the

computation. The other is dynamic work queue scheduling. The master process assigns the

blocks one-at-a-time to slaves that have finished processing a block and request more work.

7.4.3.3 Diffusion Load Balancing in Ray Tracing A dynamic load balancing strategy based on diffusion model is proposed for parallel ray

tracing in [59]. In the diffusion-based strategy, the load balancing operation is initiated by the

processor that runs out of work. The load balancing procedure is as follows:

If processor pi runs out of work, it sends a “request status” message to every immediate

neighbor processor pj. Then each pj sends a “status report” to pi that reveals its current

workload and number of pending rays. After receiving all status reports, pi sends a “transfer

work” message to every neighbor whose workload is above the average workload among the

neighbor processors. Each of the neighbors pj transfers a designated amount of work from its

queue of pending work to pi in a series of messages. After receiving all of the work, pi sends

unsolicited amount of work to every neighbor that reported less than the average workload

among the set.

7.4.3.4 Comparison with MOIDE-based method

Comparing with the parallel ray tracing methods above, the MOIDE-based ray tracing

has two features.

125

1. Free from dedicated load scheduler

The centralized master/slave load scheduling is the widely-used dynamic approach for

the ray tracing based on message passing methodologies like MPI. The diffusion load

balancing strategy requires complicated message exchange among processors. A processor

may simultaneously involve in more than one load balancing operations depending on its

interconnection with neighbor processors. The diffusion strategy may produce high

intercommunication overhead and cause workload vibration among the processors. This

approach is useful for theoretical study rather than practical use. MOIDE provides a dynamic

load scheduling scheme—autonomous load scheduling. The MOIDE-based ray tracing

method adopts the autonomous load scheduling. The object-oriented and one-sided

communication features of MOIDE model support the implementation of this scheme. There

requires no master in the scheme. Each thread independently performs the on-demand task

fetching from the global task pool, without interfering other threads. The autonomous load

scheduling can automatically generate workload balance at runtime with light system

overhead. In section 5.3, a test is made to compare the performance of autonomous load

scheduling and master/slave scheduling. It verifies the advantage of autonomous scheduling.

2. Different from other load balancing strategies

The paper [80] summarized five representative dynamic load-balancing strategies on

networked systems. These strategies can be shortly described as following:

1. Gradient Model: every processor interacts only with its immediate neighbors. Lightly

loaded processors inform other processors of their state, and overloaded processors

respond by sending a portion of their load to the nearest lightly loaded processor.

2. Sender-initiated Strategy: the load distribution is initiated by the overloaded processor

(sender), which tries to send a task to an underloaded processor (receiver).

3. Receiver-initiated Strategy: The underloaded processor (receiver) initiates load

balancing by requesting certain amount of load from immediate overloaded neighbors.

4. Central Task Dispatcher: one of the network processors acts as a centralized job

dispatcher. The dispatcher keeps a table containing the number of waiting tasks in each

processor. Based on this table, the dispatcher notifies the most heavily loaded processor

to transfer tasks to a requesting processor.

5. Prediction-based strategy: use the predicted process requirements, e.g., CPU, memory

and I/O, to achieve load balancing.

126

The autonomous load scheduling is different from the strategies above. It can be viewed

as a receiver-initiated strategy but without specific sender or dispatcher. Every process

decides and performs the load fetching by itself. There is no pre-allocation of tasks before

execution, neither dedicated load balancing operation. All the load allocation happens during

the execution. The load is automatically and gradually allocated to the processors according

to the computation progress on them. The autonomous load scheduling can be implemented

based on the remote method invocation and one-sided communication features of MOIDE. It

is an efficient load scheduling scheme for data-independent computation as ray tracing. Of

course, the autonomous scheduling is not suitable to the applications with high data-

dependency. It is also different from task-stealing strategy because it conducts neither pre-

execution task allocation nor task request to neighboring processors. When a processor has

completed the task, it directly fetches a task from the global pool.

127

Chapter 8

Conclusions

The thesis has addressed a distributed object-oriented and multithreading model MOIDE

for solving irregularly structured problems. A runtime support system has been developed to

implement the computations in MOIDE model on distributed systems. MOIDE model is used

to implement the irregularly structured applications. The applications have manifested the

flexibility and efficiency of MOIDE model in supporting various computation and

communication patterns on heterogeneous systems.

8.1 Summary of Research

MOIDE is a distributed object model in response to the broadly arising requirements for

the high-performance computing methodologies on varied distributed systems. MOIDE

provides a flexible software infrastructure that is adaptive to the underlying system

architecture. It combines the object-oriented and multithreading techniques to support the

efficient computing on heterogeneous platforms such as clusters and large-scale Grids.

The kernel of MOIDE model is the collaborative system. It is a runtime infrastructure to

support object-oriented distributed computing. The basic collaborative system is constructed

with the objects created on the distributed hosts that have been selected to run an application.

The distributed objects in the collaborative system include one compute coordinator and a

group of compute engines. The compute coordinator is the initiator and manager of the

system. It starts the compute engines on remote hosts, assigning computing tasks to them and

coordinating the computing procedures on them. The communication in the collaborative

system mainly goes though remote messaging implemented by the remote method invocation

in object-oriented methodology. The capability of remote messaging is more powerful than

128

the ordinary message passing method. It can transfer not only data but also control between

the objects.

The collaborative system is created on the most available hosts in a distributed system

based on the real-time states of the hosts. The system is flexible to make runtime

reconfiguration. The collaborative system can be expanded by adding more hosts and creating

new compute engines on them to enhance the computing power. The underlying hosts can be

replaced by new hosts in response to the change of the hosts’ states.

The hierarchical collaborative system (HiCS) incorporates multithreading methodology

into the collaborative system in order to match the architecture of SMP nodes. Lightweight

threads can be generated in the compute engines residing on SMP nodes. The multithreading

technique can create an efficient computing infrastructure that spends less system resources

and reaches higher stability than a sole distributed object system. HiCS has high adaptability

on heterogeneous system. Multiple threads will be generated based on the structure of the

SMP nodes. The threads can work in cooperative mode or independent mode, depending on

the requirements of the computation.

The two-layer communication mechanism integrates the shared-data access among a

group of local threads and the remote messaging between distributed objects (compute

engines). It is a flexible and efficient communication mechanism on heterogeneous systems.

A unified communication interface is provided at application level. It provides the uniform

communication primitives to the applications and hides the underlying two-layer

communication paths. The remote messaging has the feature of one-sided communication

which contributes to the flexibility of MOIDE model. A compute engine can freely write data

object to or read from other compute engines without the explicit participation of the other

side. Relying on the one-sided communication, autonomous load scheduling is provided to

implement high asynchronous computation.

The applications implemented in MOIDE model have high adaptability on heterogeneous

systems. An application can be developed on the architecture-independent infrastructure. It

will be mapped onto the underlying hosts at runtime and form a HiCS structure that matches

the specific architecture of the hosts to reach the best performance on the hosts.

A runtime support system, MOIDE-runtime, is developed to support the computation

based on MOIDE model. MOIDE-runtime implements all features of MOIDE model. It

performs the creation and reconfiguration of collaborative system. It establishes the two-layer

129

communication mechanism and provides the unified communication interface. It supports

local and remote synchronization on the threads and compute engines. It also supports

autonomous load scheduling. MOIDE-runtime is a cross-platform system implemented in

Java and RMI.

Four irregularly structured applications are implemented to demonstrate the advantages

of MOIDE model and verify the efficiency of the computation based on the model. These

applications have distinct computation and communication features. The N-body method is

an example of the MOIDE-based computation on hierarchical collaborative system. It

demonstrates the task decomposition and allocation strategies as well as the cooperation of

the compute engines and the multiple threads in the computation. Another feature of the N-

body method is the distributed tree structure that resolves the heavy communication problem

in distributed N-body method. The construction of the subtrees and partial subtrees illustrates

the design of complicated data structure and the related computation based on the distributed

object techniques in MOIDE model.

The ray tracing application is a practice of autonomous load scheduling. Due to the high

asynchrony supported by MOIDE model, all threads can perform the rendering tasks totally

in parallel. Different load scheduling schemes are tested as the variation of the autonomous

load scheduling approach. The test results show the individual scheduling as the best scheme

that can fully overlap the system-wide computation and communication to achieve high

parallelism.

The CG and radix sort are communication-intensive applications. Both of them make use

of the two-layer communication mechanism to increase the communication efficiency on

cluster of SMPs. Both of them are developed in the architecture-independent model. All

communication operations call the unified communication interface. The MOIDE-runtime

implements the two-layer communication at runtime to deliver the large amount of data. The

CG application also confirms the adaptability of MOIDE model on heterogeneous systems.

The two-layer communication mechanism implemented in Java and RMI can reach the

performance comparable to MPI (Message Passing Interface) on heterogeneous system. It can

even outperform MPI in some tests.

130

8.2 Achievements and Remaining Issues

8.2.1 Main Achievements

The thesis has covered many research aspects including computational model, runtime

support systems, and applications. The main achievements can be summarized as follows.

1. Design MOIDE model as a distributed object computing infrastructure for solving

irregularly structured problems on heterogeneous distributed systems

The analysis of irregularly structured problems reveals the high diversity in their

computation and communication patterns. We cannot find common solutions to the

varied problems. Specific irregularly structured problem requires proprietary method to

deal with. What we need is a flexible model that facilitates the development of various

solutions for irregularly structured problems. Also with the recognition of the hybrid

hosts existing in distributed system, a flexible model is required to support the

architecture-independent development and dynamic mapping of applications on

heterogeneous systems. To meet these two requirements, MOIDE model is designed as a

distributed object computing infrastructure that is widely-usable to implement different

applications on varied system architecture. The integration of the object-oriented and

multithreading methodologies in the model provides the polymorphism, encapsulation

and location-transparency that support high flexibility and adaptability of the hierarchical

collaborative system that can fit the architectural features and states of the underlying

hosts. The model provides the means to support the high-performance computing for

solving irregularly structured problems on heterogeneous systems, such as hierarchical

collaborative system, two-layer communication, and autonomous load scheduling

2. Establish unified communication interface on heterogeneous platforms

Many irregularly structured problems are communication-intensive applications.

Usually the communication requirement and the cost are not predictable due to the

nonuniformity of these problems. The communication becomes the bottleneck to the

overall performance. As a solution to the communication bottleneck, MOIDE model

provides a two-layer communication mechanism on the hierarchical architecture of

heterogeneous systems. The mechanism integrates the quick data-sharing among a group

of threads and the flexible remote messaging between the distributed objects to

implement efficient communication. The integrated two-layer communication provides a

131

simple, flexible and extensible communication mechanism for transmitting complex data

structures and control information between distributed objects. A unified communication

interface is provided to the applications based on the two-layer communication

mechanism. The unified communication interface preserves the architecture-independent

feature of MOIDE model to the applications.

3. Support the devise of the solutions for different applications

The object-oriented features of MOIDE model enable the devise of various

approaches to solve different irregularly structured problems. The autonomous load

scheduling is a technique to achieve high asynchrony in ray tracing. It is based on the

flexible, one-sided remote method invocation in object-based communication. The

autonomous load scheduling can automatically produce an even workload distribution on

all threads without specific load scheduling operation in the execution of the applications

with light data-dependency.

4. Implement cross-platform run-time environment

MOIDE runtime support system is developed to implement the MOIDE-based

computation on heterogeneous systems. It implements the mechanisms and functions

required in MOIDE-based computation. Applications can be developed based on the

fundamental classes and the methods provided in MOIDE-runtime. It also provides the

support to run the applications, including the creation of hierarchical collaborative system

and two-layer communication mechanism. Implemented in Java, it is a cross-platform

system executable on various systems, e.g., cluster of single-processors, cluster of SMP

nodes, cluster of mixed hosts, or standalone multiprocessor, to support distributed object

computing.

5. Develop irregularly structured applications based on MOIDE model

Four irregularly structured applications are development in MOIDE model. These are

typical irregularly structured problems with different irregular characteristics that appeal

different approaches to conquer. These applications are used to demonstrate the generic

methodologies for developing application in MOIDE model as well as the approaches for

solving specific problems. The run-time tests of these applications validate the real

efficiency of the MOIDE-based computation on different systems architecture.

(1) The distributed N-body method is characterized by the distributed tree structure and

the collaborative computation of distributed objects and threads on the hierarchical

132

collaborative system infrastructure. The distributed tree structure with the partial

subtree scheme is different from the tree structures in other parallel N-body methods

on shared-memory or distributed-memory systems. The tree structure is supported by

the object-oriented method on MOIDE model that facilitates the construction and

transmission of the tree structures in distributed environment. It provides a

communication-efficient solution to the data-sharing requirement in N-body

simulation.

(2) The ray tracing method is suitable to adopt the autonomous load scheduling. With

the autonomous scheduling approach, the rendering workload can be evenly

distributed to all compute engines/threads and high parallelism can be gained.

(3) The CG includes large amount of vector communication in iteration. The two-layer

communication mechanism can accelerate the communication procedure. The

application is used to show the performance of two-layer communication and the

adaptive mapping of MOIDE-based computation onto heterogeneous architecture.

(4) Radix sort is another communication-intensive application. It is featured with the

efficient grouped communication in the scatter operation that demonstrates both the

efficiency of the two-layer communication and the flexibility of the collaborative

computation on MOIDE model. The MOIDE-based radix sort can outperform the

same application implemented in C & MPICH on cluster of SMPs.

8.2.2 Remaining issues

Despite the achievements in this thesis, there are still some aspects requiring

improvement.

1. Efficiency of system reconfiguration

As described in 2.2.4, the collaborative system is flexible enough to make dynamic

reconfiguration. The system can be easily expanded by adding new compute engines to it, or

the underlying hosts can be replaced by new hosts as the response to the change of the system

state. The purpose of reconfiguration is to improve the system performance. However, the

reconfiguration needs a series of operations, including the new host selection, new compute

engine creation, the registration update, and the migration of data /task objects. This is a time-

consuming procedure. The overhead often prohibits its usability in real computation. To solve

this problem, we should specify the conditions in detail on which system reconfiguration is

133

necessary and the reconfiguration can truly upgrade the system performance. In addition to

the reconfiguration approaches in 2.2.3, we should also design other alternate approaches to

implement efficient reconfiguration.

2. System reliability

The execution of the applications on MOIDE runtime support system may encounter

occasional crash. The causes of the problem include the shortage of memory space when

running large application and the deadlock in the execution. Space-efficient strategy should

be designed for the space-consuming applications like N-body problem. The two-layer

communication mechanism should improve its communication protocol to support reliable

data transfer and resolve communication deadlock. The run-time support system should

improve its synchronization facility that can be called in the applications to avoid deadlock.

3. Programming templates

At present, the MOIDE model lacks of standard programming template. Although the

run-time support system has provided the fundamental classes and APIs, the applications are

mostly programmed in their own manner. Standard programming templates are required to

formalize the MOIDE-based application design and ease the programming. We need to

provide the programming templates that suit both the application patterns and the MOIDE

model.

8.3 Future Work

MOIDE model has been successfully implemented on cluster of SMP nodes and small-

size heterogeneous cluster. The model has been used to design the sample irregularly

structured applications that demonstrated the satisfactory performance in the tests.

Nevertheless the model should improve its functionality and extend its use on wide range of

distributed systems and applications. The future work will concentrate on the following

aspects.

1. Solve remaining problems

Firstly the remaining issues discussed in 8.2.2 will be solved. Investigation should be

made on the requirement of system reconfiguration and its effect on the overall system

performance. Then conditions should be decided, that ensure the necessity and good effect to

perform system reconfiguration. New system reconfiguration schemes will be designed to

reduce the time cost and heighten the usability. For example, migrating portion of the threads

134

on an overloaded host to a new host, instead of a complete host replacement, to reduce the

migration overhead and avoid new overload situation. Replacing an overloaded host with

more than one host to increase the computing power of the entire collaborative system. The

stability and reliability of MOIDE runtime support system and MOIDE-based application

need to be enhanced by way of improving the synchronization, coordination, deadlock

detection mechanisms in MOIDE runtime support system and utilizing them in application

design. Programming templates should be provided for developing application based on the

computing modes. For example, the templates for hierarchical collaborative computation,

thread’s cooperative or independent mode, computation based on autonomous load

scheduling and etc.

2. Test and enhance system scalability

So far all the tests are conducted on local cluster. The future work will emphasize on the

scalability of MOIDE model and the runtime support system in order to extend the system

coverage. The powerful support to heterogeneity and autonomy are demanded on wide-area

network systems as Grid. There could be hundreds and thousands computer nodes that can

join the computation. To accommodate the computing model on large system, it needs to deal

with the network delay, the heterogeneous types of computing resources and the system

scalability. High autonomy and scalability are the key merits for a computing infrastructure to

achieve the success in network computing. MOIDE model needs the modification to enhance

the scalability. Instead of the current two-level hierarchical structure, the collaborative system

will be expanded to a multilevel hierarchy that is more appropriate to organize the

geographically distributed computer nodes. The collaborative system will present a multilevel

adaptive infrastructure that is composed of grouped compute engines and local coordinator.

The groups have higher autonomy in performing local computation. The control will be

distributed from the unique compute coordinator in current model to the distributed local

coordinators in the future system. The system coordination, task allocation, load scheduling

strategies and the communication protocols should be improved to suit the multilevel system

structure. The current runtime support system should also be tested on larger system

containing more platforms to examine its scalability and find its weakness.

3. Implement more applications

Four irregularly structure problems have been chosen as the sample applications in the

thesis. There are more irregularly structured applications widely existing in different fields.

135

We should implement other applications in scientific computing, system simulation,

combinatorial problems and etc to study the patterns of these problems and the support

required to cope with the irregularities. Therefore MOIDE model will be improved to support

the solutions for various irregularly structured problems.

136

References

[1] Jonathan C. Hardwick, “An Efficient Implementation of Nested Data Parallelism for

Irregular Divide-and-Conquer Algorithms”, Proceedings of Second Workshop on Solving

Irregular Problems on Distributed Memory Machines, April 1996, available from

http://www.cis.ufl.edu/~ranka/ipps2.html

[2] S. Orlando and R. Perego, “Support for Non-uniform Parallel Loops and Its Application

to a Flame Simulation Code”, Fourth International Symposium on Solving Irregularly

Structured Problems in Parallel (IRREGULAR’97), June 1997.

[3] Ranjan K Sen, “New Scheme for Dynamic Processor Assignment for Irregular Problems”,

Second International Workshop on Parallel Algorithms for Irregularly Structured

Problems (IRREGULAR’95), September 1995.

[4] Irregular Parallel Algorithms Project, http://www.cs.unc.edu/~sc/research/IPA.html

[5] Scandal Project, http://www.cs.cmu.edu/~scandal/alg/algs.html

[6] T. Gautier, and et al., “Regular versus Irregular Problems and Algorithms”, Second

International Workshop on Parallel Algorithms for Irregularly Structured Problems

(IRREGULAR’95), September 1995.

[7] J. Salmon and M.S. Warren, “Parallel, Out-of-core methods for N-body Simulation”,

Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing,

1997.

[8] S. Spach and R. Pulleyblank, “Parallel Raytraced Image Generation”, Hewlett-Packard

Journal, vol. 43, No. 3, June, 1992.

[9] N. Soundarajan, “On the Specification, Inheritance, and Verification of Synchronization

Constraints”, Formal methods for open object-based distributed systems: volume 2, IFIP

137

TC6 WG6.1 International Workshop on Formal Methods for Open Object-based

Distributed Systems (FMOODS '97), Canterbury, Kent, UK, 21-23 July 1997.

[10] M. J. Lewis and A. Grimshaw, “The Core Legion Object Model”, Proceedings of the

Fifth IEEE International Symposium on High Performance Distributed Computing, Los

Alamitos, California, August 1996.

[11] Geoffrey C. Fox, W. Furmanski, “HPcc as High Performance Commodity Computing”,

available from http://www.npac.syr.edu/users/gcf/HPcc/HPcc.html

[12] “Java Remote Method Invocation-Distributed Computing for Java”, White paper,

available from http://java.sun.com/marketing/collateral/javarmi.html

[13] Z. Liang, Y. Sun, and C.L. Wang, “ClusterProbe: An Open, Flexible and Scalable

Cluster Monitoring Tool”, First International Workshop on Cluster Computing,

Australia, August, 1999.

[14] J. Barnes and P. Hut, “A Hierarchical O (N log N) Force-Calculation Algorithm”, Nature

324(4) (1986) 446-449.

[15] L. Greengard and V. Rokhlin, “A Fast Algorithm for Particle Simulations”, Journal of

Computational Physics 73 (1987) 325-348.

[16] L. Hernquist, “Hierarchical N-body Methods”, Computer Physics Communications 48

(1988) 107-115.

[17] A.Y. Grama, V. Kumar and A. Sameh, “n-body Simulation Using Message Passing

Parallel Computers”, Proceedings of the 7th SIAM Conference on Parallel Processing for

Scientific Computing (1995) 355-360.

[18] J.P. Singh, J.L. Hennessy and A. Gupta, “Implications of Hierarhical N-Body Methods

for Multiprocessor Architectures”, ACM Transactions on Computer Systems 13(2)

(1995) 141-202.

[19] J. Binney and S. Tremaine, S. “Galactic Dynamics”, Princeton, NJ: Princeton University

Press, (1987) 43-44.

[20] J. Singh, A. Gupta, and M. Levoy, “Parallel Visualization Algorithms: Performance and

Architectural Implications”, IEEE Computer, vol. 27, No.7, July 1994.

138

[21] G. Shao, R. Wolski, and F. Berman, “Performance Effects of Scheduling Strategies for

Master/Slave Distributed Applications”, Technical Report #CS98-598, UCSD CSE

Dept., Sept. 1998. Available from http://apples.ucsd.edu/hetpubs.html

[22] V. Kumar and et al., “Introduction to Parallel Computing: Design and Analysis of

Algorithms 11.2.3 The Conjugate Gradient Method”, Benjamin Cummings, 1994,

pp.433-435.

[23] D. Bailey, T. Harris, W. Saphir, R. v.d. Wijngaart, A. Woo, and M. Yarrow, “The NAS

Parallel Benchmarks 2.0'', Technical report NAS-95-020, NASA Ames Research Center,

December 1995, available from http://www.nas.nasa.gov/Software/NPB/

[24] M. Zagha and G.E. Blelloch, “Radix Sort for Vector Multiprocessors”, available from

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/scandal/public/papers/cray-sort-

supercomputing91.html

[25] “The Intel Millennium Proposal”, available from

http://now.cs.berkeley.edu/Millennium/proposal.html

[26] D. E. Culler and J. Demmel, “SimMillennium: A Large-Scale Systems of Systems

Organized as a Computational Economy”, Technical report NSF RI EIA-9802069,

available from http://now.cs.berkeley.edu/Millennium/groups/GRP_SIMS/annual.html

[27] David E. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau and et al., “Parallel Computing

on the Berkeley NOW”, 9th Joint Symposium on Parallel Processing, Kobe, Japan, 1997.

[28] S. S. Lumetta, A. M. Mainwaring, D. E. Culler, “Multi-Protocol Active Messages on a

Cluster of SMP's”, Proceedings of Supercomputing’97, San Jose, California, November,

1997.

[29] I. Foster and C. Kesselman, “The Globus Project: A Status Report”, Proc. IPPS/SPDP

'98 Heterogeneous Computing Workshop, 1998, pp. 4-18.

[30] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke, “The Data Grid:

Towards an Architecture for the Distributed Management and Analysis of Large

Scientific Datasets”, to appear in the Journal of Network and Computer Applications,

available from http://www.globus.org/research/papers.html

139

[31] N. Spring and R. Wolski, “Application Level Scheduling of Gene Sequence Comparison

on Metacomputers”, Proceedings of the 12th ACM International Conference on

Supercomputing, Melbourne, Australia, July 1998. Available from

http://apples.ucsd.edu/hetpubs.html

[32] S. Smallen, W. Cirne, J. Frey, F. Berman, and et al., “Combining Workstations and

Supercomputers to Support Grid Applications: The Parallel Tomography Experience”,

Proceedings of the 9th Heterogenous Computing Workshop, May 2000. Available from


[33] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao, “Application-Level

Scheduling on Distributed Heterogeneous Networks”, Proceedings of Supercomputing

1996, available from http://apples.ucsd.edu/hetpubs.html

[34] R. Wolski, N. T. Spring, and J. Hayes, “The Network Weather Service: A Distributed

Resource Performance Forecasting Service for Metacomputing”, The Journal of Future

Generation Computing Systems, 1999, available from


[35] H. Casanova, G. Obertelli, F. Berman, and R. Wolski, “The AppLeS Parameter Sweep

Template: User-Level Middleware for the Grid”, Proceedings of the Supercomputing

Conference (SC'2000), available from http://apples.ucsd.edu/apst/html/publications.html

[36] D. Galatopoullos and E.S. Manolakos, "Developing Parallel Applications using the

JavaPorts environment. Parallel and Distributed Processing", Jose' Rolim Editor, Lecture

Notes in Computer Science, Elsevier Publ., vol. 1586, 1999, pp. 813-828.

[37] D. Galatopoullos and E.S Manolakos. "JavaPorts: An environment to facilitate Parallel

Computing on heterogeneous cluster of workstations", Informatica, vol. 23, pp. 97-105,

1999.

[38] S. B. Baden and S. J. Fink, “A Programming Methodology for Dual-tier

Multicomputers”, IEEE Transactions on Software Engineering, 26(3): 212-26, March

2000.

[39] S. J. Fink, “A Programming Model for Block-Structured Scientific Calculations on SMP

Clusters”, Ph. D. Dissertation, June 1998, available from http://www-

cse.ucsd.edu/groups/hpcl/scg/kelp/pubs.html

140

[40] S. J. Fink, S. R. Kohn, and S. B. Baden, “Efficient Run-time Support for Irregular Block-

Structured Applications”, Journal of Parallel and Distributed Computing, 50(1-2), April-

May 1998, pp. 61-82.

[41] D. A. Bader and Joseph JáJá, “SIMPLE: A Methodology for Programming High

Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs)”, Technical

report CS-TR-3798, UMIACS-TR-97-48, available from

http://www.umiacs.umd.edu/research/EXPAR/papers/3798.html

[42] D. R. Helman and Joseph JáJá, “Designing Practical Efficient Algorithms for Symmetric

Multiprocessors”, First Workshop on Algorithm Engineering and Experimentation

(ALENEX99), available from

http://www.umiacs.umd.edu/research/EXPAR/papers/3928.html

[43] B. L. Blount, S. Chatterjee, and M. Philippsen, “Irregular Parallel Algorithms in Java”,

Sixth International Workshop on Solving Irregularly Structured Problems in Parallel

(IRREGULAR'99), San Juan, Puerto Rico, April 1999.

[44] J. F. Prins, S. Chatterjee, and M. Simons, “Irregular Computations in Fortran ---

Expression and Implementation Strategies”, Scientific Programming, vol. 7, No.3-4,

1999, available from http://www.cs.cmu.edu/~scandal/papers.html

[45] Scandal Project Home Page, available at http://www.cs.cmu.edu/~scandal/

[46] G. E. Blelloch and J. Greiner, “A Provable Time and Space Efficient Implementation of

NESL”, International Conference on Functional Programming, May 1996, available

from http://www.cs.cmu.edu/~scandal/papers.html

[47] G. E. Blelloch, “NESL: A Nested Data-Parallel Language”, Technical report CMU-CS-

93-129, April 1993, available from http://www.cs.cmu.edu/~scandal/papers.html

[48] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, and et al., “A Comparison of Sorting

Algorithms for the Connection Machine CM-2”, Symposium on Parallel Algorithms and

Architectures, 1991, available from http://www.cs.cmu.edu/~scandal/papers.html

[49] J. C. Hardwick, “Implementation and Evaluation of an Efficient Parallel Delaunay

Triangulation Algorithm”, Proceedings ACM Symposium on Computational Geometry,

May 1996, available from http://www.cs.cmu.edu/~scandal/papers.html

141

[50] J. Greiner, “A Comparison of Parallel Algorithms for Connected Components”, Sixth

Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), June 27-29,

1994, available from http://www.cs.cmu.edu/~scandal/papers.html

[51] G. Blelloch and M. Reid-Miller, “Fast Set Operations Using Treaps”, Proceedings of the

10th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 16-26,

Puerto Vallarta, Mexico, June 1998, available from

http://www.cs.cmu.edu/~scandal/papers.html

[52] K. D. Gremban, G. L. Miller, and M. Zagha, “Performance Evaluation of a New Parallel

Preconditioner”, Technical report CMU-CS-94-205, available from

http://www.cs.cmu.edu/~scandal/alg/stcg.html

[53] S. C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 Programs:

Characterization and Methodological Considerations”, Proceedings of the 22nd Annual

International Symposium on Computer Architecture, pp. 24-36, June 1995.

[54] S. Bhatt, M. Chen, and et al., “Object-Oriented Support for Adaptive Methods on

Parallel Machines,” Scientific Computing 2 (1993) 179-192.

[55] P. Liu and J.J. Wu, “A Framework for Parallel Tree-Based Scientific Simulations”,

Proceedings of 26th International Conference on Parallel Processing, 1997,

[56] A. Fava, E. Fava and M. Bertozzi, “MPIPOV: a parallel implementation of POV-Ray

based on MPI”, Proceedings of Euro PVM/MPI, Barcellona, Lecture Notes in Computer

Science, 1697:426-433, September 1999.

[57] POV-Ray (Persistence Of Vision Raytracer), http://povray.org

[58] PVMPOV patch for POV-Ray,

http://www-mddsp.enel.ucalgary.ca/People/adilger/povray/pvmpov.html

[59] A. Heirich and J. Arvo, "A Competitive Analysis of Load Balancing Strategies for

Parallel Ray Tracing", Journal of Supercomputing, vol. 12, 1998.

[60] I. Alfuraih, S. Aluru, S. Goil and S. Ranka, “Parallel Construction of k-d tree and

related problems”, 2nd Workshop on Solving Irregular Problems on Distributed Memory

Machines, Hawaii, April 1996.

142

[61] J.P. Singh, C. Holt and et al., “Load Balancing and Data Locality in Adaptive

Hierarchical N-body Methods: Barnes-Hut, Fast Multipole, and Radiosity”, Journal of

Parallel and Distributed Computing 27(2) (1995) 118-141.

[62] W. Dearholt, S. Castillo, and G. Castillo “Solution of Large, Sparse, Irregular Systems

on a Massively Parallel Computer”, Third workshop on Parallel Algorithms for

Irregularly Structured Problems (IRREGULAR 96), Santa Barbara, August 19-21, 1996.

[63] U.V. Catalyurek and C. Aykanat, “Decomposing Irregularly Sparse Matrices for Parallel

Matrix-Vector Multiplication”, Third workshop on Parallel Algorithms for Irregularly

Structured Problems (IRREGULAR 96), Santa Barbara, August, 1996.

[64] M. Baker, R. Buyya and D. Laforenza, “The Grid: International Efforts in Global

Computing”, International Conference on Advances in Infrastructure for Electronic

Business, Science, and Education on the Internet SSGRR 2000, l`Aquila, Rome, Italy,

July 31 - August 6, 2000.

[65] I. Foster and C. Kesselman, ed. “The Grid: Blueprint for A New Computing

Infrastructure”, Morgan Kaufmann Publishers, San Francisco, California, 1999.

[66] P. Sanders and T. Hansch, “Efficient Massively Parallel Quicksort”, Fourth International

Symposium on Solving Irregularly Structured Problems in Parallel (IRREGULAR 97),

University of Paderborn, Germany, June 11-13, 1997.

[67] D.R. O'Hallaron, J.R. Shewchuk, and T. Gross, “Architectural Implications of a Family

of Irregular Applications”, Fourth International Symposium on High Performance

Computer Architecture, Las Vegas, Nevada, February 1998, pp 80-89.

[68] V. Ramakrishnan and I. D. Scherson, “Executing Communication-Intensive Irregular

Programs Efficiently”, Seventh International Workshop on Solving Irregularly

Structured Problems in Parallel (IRREGULAR’2000), Cancun, Mexico May 5, 2000.

[69] R. Garmann, “Locality Preserving Load Balancing with Provably Small Overhead”,

Fifth International Workshop on Solving Irregularly Structured Problems in Parallel

(IRREGULAR’98), Lawrence Berkeley National Laboratory Berkeley, California, USA,

August 9-11, 1998.

143

[70] J. Watts and S. Taylor, “Practical Dynamic Load Balancing for irregular problems”,

Third workshop on Parallel Algorithms for Irregularly Structured Problems

(IRREGULAR’96), Santa Barbara, August, 1996.

[71] L. Oliker, R. Biswas, R. G. Strawn, “Parallel Implementation of An Adaptive Scheme

for 3D Unstructured Grids on the SP2”, Third workshop on Parallel Algorithms for

Irregularly Structured Problems (IRREGULAR’96), Santa Barbara, August, 1996.

[72] The Message Passing Interface (MPI) Standard, http://www-unix.mcs.anl.gov/mpi/

[73] D.E. Culler, J.P. Singh, “Parallel Computer Architecture: a Hardware/Software

Approach -- 3.5.2 Barnes-Hut”, Morgan Kaufmann Publishers, Inc. San Francisco,

California, 1999, pp.166-174

[74] Stanford Parallel Applications for Shared Memory (SPLASH), http://www-

flash.stanford.edu/apps/SPLASH/

[75] M.S. Warren and J.K. Salmon, “A Parallel Hashed Oct-Tree N-body Algorithm”,

Supercomputing '93, Los Alamitos, 1993.

[76] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A High-Performance, Portable

Implementation of the MPI Message Passing Interface Standard”, available for

http://www-unix.mcs.anl.gov/mpi/mpich/

[77] “Java Remote Method Invocation Specification”, available from

http://java.sun.com/j2se/1.3/docs/guide/rmi/index.html

[78] Y. Sun, Z. Liang, and C.L. Wang, “Distributed Particle Simulation Method on Adaptive

Collaborative System”, Journal of Future Generation Computer Systems, vol. 18 issue 1,

September 2001, Elsevier Science, the Netherlands.

[79] M.S. Warren and J.K. Salmon, “Astrophysical N-body Simulations Using Hierarchical

Tree Data Structures”, Supercomputing ’92, 1992.

[80] P. Loh, W. Hsu, W. Cai, and N. Sriskanthan, “Now Network Topology Affects Dynamic

Load Balancing”, IEEE Parallel and Distributed Technology, vol.4, No.3, Fall 1996.

[81] H. A. van der Vorst, “Lecture Notes on Iterative Methods for Large Linear Systems”,

available from http://www.math.uu.nl/people/vorst/lecture.html

A Distributed Object Model for Solving Irregularly ... · A Distributed Object Model for Solving...

Documents

Transcript of A Distributed Object Model for Solving Irregularly ... · A Distributed Object Model for Solving...