A Distributed Object Model for Solving Irregularly ... · A Distributed Object Model for Solving...
Transcript of A Distributed Object Model for Solving Irregularly ... · A Distributed Object Model for Solving...
Abstract of thesis entitled
A Distributed Object Model for Solving Irregularly Structured Problems on Distributed Systems
submitted by
Sun Yudong
for the degree of Doctor of Philosophy at The University of Hong Kong
in March 2001
This thesis presents a distributed object model, MOIDE (Multithreading Object-oriented
Infrastructure on Distributed Environment), for solving irregularly structured problems. The
primary appeal of MOIDE is its flexible, collaborative infrastructure that is adaptive to
various system architecture and application patterns. The model integrates object-oriented
and multi-threading methodologies to set up a unified computing environment on
heterogeneous system. The kernel of the MOIDE model is the hierarchical collaborative
system (HiCS) constructed by the objects, i.e., the compute coordinator and compute engines,
on the hosts to execute an application. HiCS integrates the object-oriented and multithreading
methodologies to enable its structure adaptive to hybrid hosts. Lightweight threads are
generated in the compute engines residing on SMP nodes, which is more efficient and stable
than a sole distributed object scheme. The structure and work mode of HiCS are adaptive to
the computation and communication patterns of applications as well as the architecture of
underlying hosts. The adaptability is particularly beneficial to support the high-performance
computing of irregularly structured problems.
A unified communication interface is built in MOIDE based on the two-layer
communication mechanism that integrates shared-data access and remote messaging for inter-
object communication. It is a flexible and efficient communication mechanism to resolve the
high communication cost that appears in irregularly structured problems. Autonomous load
scheduling is proposed as a new approach to dynamic load balancing in the irregular
computation based on the MOIDE model. A runtime support system is developed in Java and
RMI that implements MOIDE as a platform-independent infrastructure to support parallel and
distributed computation on varied systems.
Four irregularly structured applications are developed to manifest the advantages of
MOIDE model. The N-body method demonstrates the capability of the object-based
methodologies of the MOIDE model in the implementation of adaptive task decomposition
and complex data structure. A distributed tree structure with partial subtree scheme is devised
in the N-body method as the communication-efficient data structure to support the high data-
dependent computation. The autonomous load scheduling approach in the ray tracing can
realize the high parallelism in the MOIDE-based asynchronous computation. The MOIDE
model provides the adaptability to the CG method for solving sparse linear systems. The CG
method can be dynamically mapped onto heterogeneous hosts and utilize the unified
communication interface to enhance the communication efficiency. The radix sort verifies the
flexibility of the MOIDE-based computation in which the grouped communication can
outperform MPICH on SMP node and cluster of SMP nodes.
A Distributed Object Model for Solving Irregularly Structured Problems on
Distributed Systems
by
Sun Yudong
孫 昱 東
A thesis submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy at The University of Hong Kong
March 2001
i
Declaration
I declare that this thesis represents my own work, except where due acknowledgement is
made, and that it has not been previously included in a thesis, dissertation or report submitted
to this University or to any other institution for a degree, diploma or other qualification.
Signed ………………………………………….
Sun Yudong
ii
Acknowledgements
First, I would like to express best gratitude to my supervisor, Dr. Cho-Li Wang, for all his
guidance to my study in the past years. He has always given me invaluable encouragement
and advice to complete the thesis.
I am deeply thankful to Dr. Francis C. M. Lau for his constructive advice and help to my
study. I am highly grateful to Dr. P.F. Liu for his important comments and suggestions for my
thesis revision.
Special thanks to the members of the System Research Group, Anthony Tam, Benny
Cheung, Matchy Ma, David Lee, and all other colleagues, whose help and cooperation have
been the strong support to my research.
Finally I show my thanks to all people who have helped me in all aspects to finish the
thesis, especially the technical and office staff in our department.
iii
Contents
Declaration ………………………………………………………………………………….. i
Acknowledgements ..……………………………………………………………………….... ii
Table of Contents …………………………………………………………………………… iii
Lists of Figures ……………………………………………………………………………... vii
Lists of Tables ……………………………………………………………………………….. x
1 Introduction 1
1.1 Irregularly Structured Problems ……………...…………………….…...………….. 2
1.1.1 Specification …………………………………..……………………………... 2
1.1.2 Sample Applications ………………………………….……………………… 5
1.2 Distributed System and Distributed Object Computing ………..…….…………….. 8
1.2.1 Distributed System …………..………….……………………………………... 8
1.2.2 Distributed Object Computing …………….…..………………………………. 9
1.2.3 Object-Oriented Programming Language ………..……….………………….. 11
1.3 Motivation ...………………….………………………..…………………………... 11
1.4 Contributions ….……………..………………...…………………………………... 12
1.5 Thesis Organization ….……………..……………………………………….……... 14
iv
2 MOIDE: A Distributed Object Model 15
2.1 Introduction ……………...………………………………………………………… 15
2.2 Basic Collaborative System ……………..……………………….………...…….... 18
2.2.1 System Structure ………………….…………………..…………………….. 18
2.2.2 System Creation ……………….…………..………………………………... 19
2.2.3 System Work ……………….……………………………………..……….... 20
2.2.4 System Reconfiguration ………….………..………………………………... 21
2.3 Hierarchical Collaborative System …….……………………………..…………… 24
2.3.1 Heterogeneous System …………..…………………….…………………….. 25
2.3.2 Hierarchical Collaborative System ………..………………….……………... 26
2.3.3 Task Allocation …..………..………………………………………….….….. 29
2.3.4 Unified Communication Interface …...…………………………….………... 30
2.4 Implementation …………………………..……….…………...……………….…... 32
2.5 Summary ……………….……………..…………………………………….……… 33
3 Runtime Support System 34
3.1 Overview ……….……..………………………………………...…….………….... 34
3.2 Principal Objects ………………..…………………………………….…………… 36
3.3 System Creation ………………..…………………………………….……………. 37
3.3.1 class StartEngine …………..………………………….……………. 37
3.3.2 Initialization of Compute Engine ………..…………………….……………. 40
3.4 System Reconfiguration ……………….………….………………….…….……… 41
3.4.1 class ExpandEngine …………….….……..…………….…….……… 41
3.4.2 class RecfgEngine ………………..………..…………….…....……... 43
3.5 Unified Communication Interface ………………………..……………………..… 44
v
3.6 Synchronization ……………………..…………….………………………………. 46
3.6.1 barrier()for Local Synchronization …………….……………………… 47
3.6.2 remoteBarrier()for Global Synchronization …..…...………………… 48
3.7 Load Scheduling …………………………………………...……………………… 49
3.7.1 Autonomous Load Scheduling …………….………………………………… 49
3.7.2 getTask() and getSubtask()Methods …….……..….…….………… 51
3.8 System Termination ……………………...……….……………………………….. 51
3.9 Summary ……………….……………………...…………………………………... 52
4 Distributed N-body Method in MOIDE Model 53
4.1 Overview ……………………………..………………………….…………….…... 53
4.2 Distributed N-body Method ……………………..……………….…………….….. 55
4.2.1 Distributed Tree Structure …………..…………………….………….……... 56
4.2.2 Computing Procedure ………..…………………………….…………….….. 62
4.2.3 Load Balancing Strategy ……………...…………………………….………. 65
4.3 Runtime Tests and Performance Analysis …………………...…………….……… 65
4.3.1 Tests on Homogeneous Hosts …………………...……………………….…. 65
4.3.2 Tests on Heterogeneous Hosts ……………...……...…………………….…. 72
5 Ray Tracing with Autonomous Load Scheduling 75
5.1 Overview ……………………………………………..…………….…..……….… 75
5.2 Autonomous Load Scheduling ………..…………………………….……..…….... 76
5.2.1 Background …………………………………………………….………….... 76
5.2.2 Group Scheduling ………………..…………………………….………...…. 77
5.2.3 Individual Scheduling ……………………………..……...…….…………... 79
vi
5.3 Runtime Tests and Performance Analysis ……..…………..…………...…...…….. 81
6 CG and Radix Sort on Two-layer Communication 89
6.1 Conjugate Gradient …………………………………..………………….……........ 89
6.1.1 Algorithm of CG ……………………..……………..………………………. 90
6.1.2 CG Method in MOIDE Model …………..…………………..……………… 92
6.1.3 Runtime Tests and Performance Analysis ………..………………………… 94
6.2 Radix Sort ………………………………..………………………………….….... 100
6.2.1 Parallel Radix Sort …………………..………………….………….………. 101
6.2.2 Radix Sort in MOIDE Model ……………..………………….……………. 102
6.2.3 Runtime Tests and Performance Analysis …………....…..………….…….. 103
7 Related Work 112
7.1 Software Infrastructures on Distributed Systems ………..…….…………….…… 112
7.2 Programming Models on Cluster of SMPs ……..………………….………….….. 117
7.3 Methodologies for Irregularly Structured Problems ………..…….…………….… 119
7.4 Irregularly Structured Applications ………..…………….……………………….. 120
8 Conclusions 127
8.1 Summary of Research …………………………...…….…………………...…….. 127
8.2 Achievements and Remaining Issues ……..………………………….…....……... 130
8.2.1 Main Achievements ……………………………………………….………… 130
8.2.2 Remaining Issues ………………………………………………….………… 132
8.3 Future Work …………………..………………………………………..……...…. 133
References 136
vii
List of Figures
2.1 The collaborative system built on P hosts ....…………………….…...……………..... 18
2.2 Registration tables ……………………………………………………………………. 20
2.3 Horizontal system expansion …………………………………………………………. 22
2.4 Vertical system expansion ……………………………………………………………. 23
2.5 Host replacement in a collaborative system ………………………………………….. 24
2.6 Cluster of SMPs ………………………………………………………………………. 26
2.7 Hierarchical collaborative system built on heterogeneous hosts ………………….….. 27
2.8 Two-layer communication mechanism ……………………………………………….. 31
3.1 Organization of MOIDE runtime support system …………………………………….. 35
3.2 Class description and relation of compute coordinator and compute engine classes …. 36
3.3 run() method of StartEngine class ……………………………………………... 37
3.4 getHost() method ………………………………………………………………….. 38
3.5 createEngine() method ………………………………………………………….. 39
3.6 invokeEngine() method ……………………………...…………………………... 40
3.7 Generate threads in a compute engine ………………………………………………… 41
3.8 run() method in ExpandEngine class ……………………………………………. 42
3.9 addEngine() method in ExpandEngine class ………………………………..… 42
3.10 checkEngine() method in RecfgEngine class ..………………….…………… 43
3.11 replaceEngine() method in RecfgEngine class …………………………….. 44
3.12 A hierarchical collaborative system with 12 pseudo-engines ………………………… 45
3.13 The location table of the pseudo-engines ……………………………………………... 46
3.14 Execution flow on multiple threads with local and global synchronization ………….. 47
3.15 Global synchronization method remoteBarrier() ………………………………. 48
3.16 Two-level autonomous load scheduling ………………………………………………. 50
3.17 ceaseEngine() method …………………………………………………………… 51
viii
4.1 Barnes-Hut tree for 2D space decomposition ………………………………………….. 54
4.2 Space decomposition and distributed tree structure on four processors ……………….. 58
4.3 Partial subtrees built from subtree B …………………………………………………... 60
4.4 Space decomposition and the subtrees on a four-SMP cluster ………………………… 62
4.5 Execution flow of distributed N-body method on hierarchical collaborative systems …. 64
4.6 Execution time of the N-body method on four quad-processor SMP machines ….…… 66
4.7 Speedups of the N-body method ………………………………………………………. 67
4.8 The execution time on the cluster of 32 PCs …………………………………………... 67
4.9 Computation and communication time breakdowns on the cluster of SMPs …………. 68
4.10 Computation and communication time breakdowns of the full tree method …………. 69
4.11 Adaptive partial subtree vs. cut-off partial subtree on cluster of PCs ………………… 70
4.12 Execution time of the N-body method on heterogeneous hosts ………………………. 73
4.13 Speedups of the N-body method on heterogeneous hosts …………………………….. 73
4.14 Computation and communication time breakdowns on heterogeneous hosts ………… 74
5.1 View plane partitioning and allocation ………………………………………………... 78
5.2 Ray tracing with group scheduling ………………………………………………… 78-79
5.3 Ray tracing with individual scheduling ……………………………………………..… 79
5.4 Flow of ray tracing in group scheduling and individual scheduling …………………... 80
5.5 Execution times of the ray tracing in three load scheduling schemes …………………. 82
5.6 Speedups of the ray tracing in three scheduling schemes ……………………………... 82
5.7 Execution time breakdowns of group scheduling and combined scheduling …………. 83
5.8 Execution time breakdowns of combined scheduling …………………………………. 84
5.9 Execution time breakdowns of individual scheduling ………………………………… 85
5.10 The comparison of autonomous load scheduling and master/slave scheduling in ray
tracing ………………………………………………………………………………… 87
6.1 Algorithm of conjugate gradient (CG) method ………………………………………... 90
6.2 Vector/scalar reduction and transposition operations in parallel CG method ……….… 92
6.3 The reduction operation on 2 quad-processor SMP nodes ………………………….…. 93
6.4 The reduction and transposition operations on heterogeneous hosts ………………….. 94
6.5 Execution time of the CG method on homogeneous hosts ……………………………. 95
6.6 Speedup of the CG method on homogeneous hosts ………………………………….... 95
6.7 Execution time breakdowns of the CG method on homogeneous hosts ………………. 96
ix
6.8 Execution time of the CG method on heterogeneous hosts …………………………… 97
6.9 Speedup of the CG method on heterogeneous hosts …………………………………... 97
6.10 Execution time breakdowns of the CG method on heterogeneous hosts ……………... 98
6.11 Execution time of the single-threading CG method …………………………………... 99
6.12 Speedup of the single-threading CG method …………………………………………. 99
6.13 Execution time breakdowns of the single-threading CG method ………………….… 100
6.14 Parallel radix sort ………………………………………………………………….…. 101
6.15 All-to-all scattering of elements in parallel radix sort on four processors …………... 102
6.16 Execution time breakdowns of the MOIDE-based radix sort ……………………….. 104
6.17 Execution time breakdowns of the single-threading radix sort ……………………… 105
6.18 Execution time breakdowns of the C & MPI radix sort program ……………………. 107
6.19 Execution time breakdowns of three radix sort programs: Java MOIDE-based (Java-M),
Java single-threading (Java-S) and C & MPI (C-MPI) in sorting 10M elements …… 107
6.20 Execution time breakdowns of two radix sort programs on four quad-processor SMP
nodes: Java MOIDE-Based program and C& MPI (C-MPI) in sorting 10M elements 109
6.21 Communication costs of two radix sort programs on four quad-processor SMP nodes:
Java MOIDE-based program (Java-M) and C & MPI program (C-MPI) in sorting 10M
elements ……………………………………………………………………………… 110
x
List of Tables
1.1 Characteristics of four irregularly structured problems ..………….…...…………….... 7
3.1 Major classes and methods implemented in MOIDE runtime support system …… 34-35
4.1 The times of the N-body method with/without load balancing (seconds) …………… 71
6.1 Execution time of the MOIDE-based radix sort (seconds) ………………………….. 103
6.2 Execution time of the single-threading radix sort (seconds) ………………………… 104
6.3 Execution time of the C & MPI radix sort (seconds) ………………………………... 106
7.1 Comparison of the related work in supporting heterogeneous computing …………... 116
7.2 Comparison of the programming models on cluster of SMPs ……………………….. 118
1
Chapter 1
Introduction
Irregularly structured problems are the applications which have unstructured and/or
dynamically changing patterns of computation and communication. These problems widely
exist in various scientific and engineering areas from astrophysics, fluid dynamics, sparse
matrix computations, system modeling and simulations to computer graphics and image
processing. Irregularly structured problems are usually computation-intensive applications
with high potential parallelism. However, the parallelism is difficult to be fully exploited due
to the irregularity in computation and communication. The irregularities of these problems
will be aggravated when solving on distributed systems. It is hard to evenly partition the
irregular computation among the processors. Moreover the complicated and unstructured
inter-process communication will emerge in distributed computing and the communication
will restrain the parallelism in computation. Therefore the irregularly structured problems are
also communication-intensive in distributed computing.
It is a challenging work to develop flexible and efficient methodologies for solving
irregularly structured problems on distributed systems. The methodologies should cover data
structures, task decomposition and allocation schemes, load balancing strategies, and
communication mechanisms to support high-performance distributed computing of
irregularly structured applications. Meanwhile the methodologies should also take into
account the architectural features of the platforms on which an application is running in order
to create an efficient mapping of the computation onto the underlying platforms.
My research concentrates on the distributed object-oriented methods for the high-
performance solutions of irregularly structured problems on distributed systems. A distributed
object-oriented model MOIDE has been built. The model sets up a flexible and efficient
software infrastructure for developing and executing irregularly structured applications on
2
varied distributed systems. The MOIDE model supports the techniques effective for solving
irregularly structured problems with different irregular characteristics. A runtime support
system is developed to implement the computations based on the MOIDE model.
1.1 Irregularly Structured Problems
1.1.1 Specification
A lot of scientific and engineering applications in various fields including scientific
computing, computer graphics, physics and chemistry can be classified as irregularly
structured problems. For instance, sparse matrix computations in solving sparse linear
equation system; finite element methods for solving partial differential equations; ray tracing
and radiosity in computer graphics; N-body methods in particle simulations; system modeling
and simulations in many science, engineering and social disciplines. Despite the distinct
features of the problems, they have the common characteristics of irregular data distribution
in the computation that generates irregular computation patterns and thus incurs irregular
communication patterns. The irregularly structured problems can be described from different
aspects such as data structure, computation and communication patterns, unpredictable data
distribution and workload [1-6]. Generally irregularly structured problems can be described
as the following definition.
Definition
An Irregularly Structured Problem is an application whose computation and communication
patterns are input-dependent, unstructured, and evolving with the computation procedure.
The irregularity of the irregularly structured problems causes difficulties in designing
efficient parallel and distributed algorithms for them. The distribution of data and computing
workload cannot be exactly determined a priori, and they are dynamically changing during
the computation. When solving an irregularly structured problem in parallel, one should deal
with the following issues:
(1) Irregular Data Representation
The irregular data distribution requires irregular data structures, e.g., special forms of
trees and graphs, to represent data and their relations [54,55,60]. Usually it is uneasy to
exploit the parallelism in the computations on the irregular data structures.
3
(2) Non-predetermined Load Scheduling
The workload of irregularly structured problems depends on the input data and the
dynamic evolution of the data in the computation. As the irregular and dynamic data
distribution, the irregularly structured problems cannot be evenly partitioned and allocated
onto multiprocessors in advance before execution. It is impractical to accurately measure the
computation workload of an irregularly structured problem. High data-dependency may exist
in the problem. That further burdens the load scheduling operations [69,70].
(3) Complicated Communication Requirements
Due to the irregular data structures, computation patterns, and data dependencies,
irregularly structured problems also present irregular and complicated communication
requirements. The high data-dependencies in some irregularly structured problems generate
complicated inter-process communication patterns and request high communication
bandwidth [67,68]. The unstructured communication may severely restrict the performance of
irregularly structured applications.
(4) Adaptive Algorithmic Requirement
The non-predicable computation and communication patterns request adaptive algorithms
for solving irregularly structured problems. The algorithms should generate a task
decomposition in accordance with the specific patterns emerged in the problems in order to
attain high-performance in distributed computing. The algorithms should also map the
irregular computations onto the underlying platforms in an adaptive way that matches the
hardware architecture [71].
Irregularly structured problems are mostly large-scaled computation-intensive and
communication-intensive applications. The unstructured, dynamically evolving patterns
aggravate the computation and communication costs. In addition to the general strategies to
design efficient parallel or distributed algorithm as evenly scheduling computing tasks in
order to balance the workload and minimizing the communication, special techniques should
be devised for irregularly structured problems. The fundamental techniques for solving
irregularly structured problems include: (1) Flexible Data Structures
The data structures should be able to facilitate the effective representation and efficient
computation of irregular problems. The data structures should be flexible to be partitioned in
task decomposition and to be reconstructed to reflect the evolution of the computation
4
patterns. The data structures should efficiently satisfy the data sharing required in the parallel
and distributed computing. Of course, different irregular applications require special data
structures to represent the data and the related computations. For example, a distributed tree
structure is designed for the distributed N-body method in chapter 4.
(2) Dynamic Load Scheduling
The unpredictable and evolving data distribution and therefore the computation workload
of irregularly structured problems request dynamic load scheduling to allocate the workload
at run-time in distributed computing. Computing tasks should be allocated to or data should
be redistributed to multiple processes to ensure dynamic load balance. The load scheduling
approaches are relevant to the characteristics of the applications. Global load redistribution is
probably required for the applications with high data-dependency. Runtime task allocation is
proper for the applications with light data-dependency. The space re-decomposition scheme
in the distributed N-body method in chapter 4 and the autonomous load scheduling scheme in
chapter 5 are two examples of dynamic load scheduling.
(3) Efficient Communication Methodologies
The irregular communication patterns in irregular applications usually produce high
communication overhead that will deteriorate the overall performance. The communication
overhead is more critical in distributed computing environment where the communication
goes through long-latency message passing. It is essential to reduce the inter-process
communication by maintaining data locality in the computation and build efficient
communication mechanism for distributed computing. In MOIDE model described in chapter
2, a two-layer communication mechanism is created by integrating shared-data access and
message passing on heterogeneous systems.
(4) Adaptive Computing Infrastructure
The task decomposition depends on the computation and communication patterns of an
application. It should evenly decompose the computation and allocate workload fairly to each
processor meanwhile reducing the inter-processor communication. An ideal task
decomposition scheme should also take into account the architecture of the underlying system.
The algorithms of the applications should be developed on an adaptive computing
infrastructure. Therefore adaptive task decomposition and allocation strategies can be
implemented for varied hardware architecture to generate a task distribution which can make
full use of the architectural features to attain high performance on the hardware platforms.
5
1.1.2 Sample Applications
As previously indicated, a lot of applications in various fields can be viewed as
irregularly structured problems. These applications possess different irregularities that appeal
to specific strategies to cope with. Four sample irregularly structured problems are studies in
the thesis.
(1) N-body Problem
N-body problem simulates the evolution of a system containing a great number of bodies
(particles) [7, 16]. The bodies distributed in a space have force interactions between each
other. System evolution is the consequence of the cumulative force influences from all bodies.
The force influences are determined by the interactions of the bodies and impel the bodies
moving to new positions. The force influences keep on changing because of the continuous
body motion. A lot of physical systems exhibit this behavior such as in the fields of
astrophysics, plasma physics, molecular dynamics, fluid dynamics, radiosity calculations in
computer graphics, and etc. The common feature of these systems is the large range of
precision in the data requirements for bodies to compute their force influences on each other.
A body needs gradually rough data in less frequency from the bodies that are farther away.
The system evolution is a dynamic procedure. N-body problem has high irregularity in data
distribution and the computation of force influences, whereas heavy irregular data
communication occurs in distributed N-body method.
(2) Ray Tracing
Ray tracing is a graphic rendering algorithm in computer graphics to synthesize an image
from the mathematical description of the objects that constitute the image [8,20]. It generates
a 2D rendering image for a 3D scene by calculating the color contributions of the objects to
each pixel on a view plane (screen). In ray tracing, primary rays are emitted from a viewpoint,
passing through the pixels on the view plane and entering the space which encloses the
objects. When encountering an object, a ray is reflected to each light source to check if it is
shielded from that light source. If not, the light contribution from that light source onto the
view plane should be computed. The ray is also reflected from and refracted through the
object to spawn new rays. The ray tracing procedure is performed recursively on the new rays.
Thus each primary ray may generate a tree of new rays. The rays are terminated when they
6
leave the space or by some pre-defined criterion (e.g., the maximum number of levels
allowed in a ray tree). If a ray hits nothing, no further computation will be taken. The rays
hitting sophisticated objects will generate a bundle of dispersed rays that require more
rendering computations. In ray tracing, the generation of the rays is non-deterministic which
depends on the objects and light sources in the scene. The workload of the rendering
operations is totally irregular. The rendering of each pixel on the view plane has highly
diverse workload. It is difficult to evenly partition the view plane in priori for parallel ray
tracing.
(3) Sparse Matrix Computations
Sparse matrix computations appear broadly in scientific and engineering computing from
basic computations as sparse matrix-vector multiplication to complex computations as
iterative methods for solving sparse linear systems. Sparse matrix is an unstructured data
structure. Parallel sparse matrix computations are irregular operations because of the
unbalanced matrix computation as the consequence of the unstructured data density [62,63].
Unstructured communications arise in the parallel sparse matrix computations. For example,
the iterative methods for solving linear equation systems in the form bAx = generate a
sequence of approximation to the solution vector x by iterating on the matrix-vector
multiplication operations on the coefficient matrix A. The computation and communication
costs are dependent with the data density of the sparse matrix and the algorithmic operations
on it. The conjugate gradient (CG) is one of the most powerful iterative methods for large
sparse linear systems [22]. Parallel CG method is based on the mesh topology of
multiprocessors. Large vectors are exchanged among the processors in iteration. The heavy
communication may restrict the performance of parallel CG method. Efficient
communication mechanism should be built to raise the communication efficiency and
enhance the overall performance.
(4) Sorting
Sorting is a procedure to transform a random sequence of elements into an ordered one. It
is one of the common operations in computing algorithms. Sorting procedure is accomplished
by repeating comparison or non-comparison manipulations on the element sequence. Parallel
sorting algorithm involves the redistribution of the elements among multiprocessors in each
7
sorting round. The data exchange has irregular communication pattern and produces heavy
communication among all processors. Hence sorting algorithms can also be recognized as
irregular problems [48,66]. For example, radix sort is a non-comparison sorting algorithm
[24]. It reorders a sequence of elements based on the integer value of bit sets. Radix sort
examines the elements r bits at a time. It sorts the elements according to the ith least
significant block of r bits during iteration i. All elements are redistributed on the
multiprocessors based on their new positions in the global sequence in each loop. The data
redistribution is an irregular all-to-all communication. Since sorting algorithm only conducts
simple computation, the performance is mainly determined by the communication operations.
Efficient communication mechanism is also needed to speed up the sorting procedure.
The four irregularly structured applications discussed above have unique irregular
characteristics that demand specific techniques to achieve high-performance computing. The
characteristics are summarized as four attributes: computation complexity, communication
complexity, data dependency, and synchronization requirement in Table 1.1.
Applications Computation Complexity
Communication Complexity
Data Dependency
Synchronization Requirement
Key Techniques
N-body High High (all-to-all)
High Yes Distributed tree structure
Ray Tracing High Low None No Autonomous load scheduling
CG High High (point-to-point)
Medium Yes Two-layer communication
Radix Low High (all-to-all)
Low Yes Two-layer communication
Table 1.1 Characteristics of four irregularly structured problems
In Table 1.1, computation complexity is the computation workload of the application.
Communication complexity is the communication requirement of the application appeared on
distributed computing. Data dependency is the level of correlation of the data in computation.
Synchronization requirement indicates whether synchronization should be enforced on the
parallel computing procedure. No synchronization requirement implies a total asynchronous
computation procedure that can attain the highest parallelism. Otherwise, the synchronization
should be imposed to coordinate the parallel computations on multiple processes. The key
techniques in Table 1.1 are used to handle the irregularities of the individual irregularly
8
structured problems and achieve high-performance in distributed computing. These
techniques are implemented in the distributed object model MOIDE (see chapter 2). The
effects of the key techniques will be demonstrated by the sample applications in chapter 4, 5
and 6.
1.2 Distributed System and Distributed Object Computing
1.2.1 Distributed System
A distributed system is constructed with computer nodes linked across networks. The
computer nodes may be geographically distributed in wide area. Running on a distributed
system, an application should be decomposed into a group of concurrent computing tasks that
are dispatched to the distributed computer nodes meanwhile the tasks can still cooperate
during the computation. This is the basic model of distributed computing. Distributed
computing technologies are under rapidly development with the proliferation of smaller
computers, such as PCs and workstations, and the wide spread of networks. Networked
computers support large applications by distributed computing without high-end standalone
computers.
A distributed system can be a homogeneous system consisting of same type of machines.
More generally, it is a heterogeneous system composed of hybrid computers that have
different architecture and computing power. It may accommodate PCs, workstations and
multiprocessors in the same system. With fast advancement of high-speed networks and
powerful microcomputers and workstations, networked computers have been providing a
cost-effective environment to support high performance parallel and distributed computing.
The scope of the networked systems is expanding quickly. Nowadays the networked
computers are beyond the traditional LAN-linked systems. The local systems in different sites
can form a wide-area distributed system, from campus-wide to area or nation-wide system,
which provides strong computing powers. The innovative system architecture is called
Cluster of Cluster or Computational Grid [64,65].
A distributed system can be concurrently accessed by a lot of users at different sites for
different applications. Software infrastructure is required to integrate the distributed resources
and provide uniform interface for developing and running applications on the distributed
system. The infrastructure should present sufficient flexibility on heterogeneous platforms. It
9
should hide the architecture of the platforms and create a uniform environment for application
developers.
In response to the requirements for integrated computing infrastructure, object-oriented
methodology is recognized as the appropriate technique to construct distributed computing
infrastructure. The object-oriented techniques are flexible enough to create objects on the
varied platforms and organize the objects into a computing infrastructure for executing the
applications. The distributed object infrastructure is highly adaptive to heterogeneous
platforms.
1.2.2 Distributed Object Computing
Object-oriented technology is appropriate for the computation on distributed systems [9,
10]. An object is a software unit that encapsulates data and behavior. Object-oriented
computing can be considered as providing an interface that specifies the functions and
arguments of an object and encapsulates the details of the internal implementation. The
interface hides the hardware characteristics from the applications. The applications can be
developed in a uniform model regardless what platforms they will run on. The applications
based on object-oriented model can be properly mapped onto the platforms and high
performance can be achieved.
Distributed object computing is the integration of object-oriented computing and
networking. It provides high flexibility to the computations on distributed system. Objects are
created on distributed computer nodes when an application is submitted to run. The object on
one host can interact with other objects on remote hosts. An object can also be transferred to
another host for the sake of load balancing or fault tolerance. The object on one host can
create remote objects on other hosts. Distributed object computing also supports
multithreading methodology that implements light-weight computation on SMP nodes.
Remote method invocation is a more powerful communication mechanism in distributed
object system than ordinary message passing. By remote method invocation, an object can
transfer not only data but also control to remote objects. It has the feature of one-sided
communication in which the communication operation can be started on the side of sender or
receiver only. One-sided communication will contribute to the high asynchrony in distributed
object computing. The polymorphism of distributed object system guarantees the flexibility
of object-oriented computing. A distributed object system can expand by incorporating more
10
objects created on new hosts at runtime to raise its computing power. Summarily, a
distributed object system should possess the following capabilities:
(1) Runtime Host Selection
The first step to create a distributed object system is the selection of the available hosts in
a distributed system. The computers in a distributed system are simultaneously accessible by
many users and applications. A distributed system is also a loosely-coupled and non-
permanent system. Local systems or hosts can join or leave the distributed system. When
building up a distributed object system, it needs to select the appropriate hosts based on their
current states. Then objects are created on the selected hosts to perform distributed computing.
(2) Adaptive Task Mapping
In distributed computing, an application should be decomposed into a group of tasks and
allocated to the distributed objects. The tasks should be properly mapped onto the distributed
objects in accordance with the computation pattern, data locality, and the architecture of the
target hosts. The task mapping should exploit the computing power of the hosts and minimize
the data communication between the objects.
(3) Multithreading Computation
Multithreading computation is a light-weight method for parallel computing within an
object. An object residing on SMP node can spawn a group of threads inside to execute
parallel computation on the multiprocessors. The group of threads in object consume less
system resources than multiple objects. They can maintain higher data sharing and tighter
cooperation in parallel computing.
(4) Efficient Communication Mechanism
Distributed object computing is usually accompanied with high inter-object
communication. Efficient communication mechanism is demanded to support flexible and
fast communication. The communication mechanism should integrate the inter-object
communication methodologies based on the physical communication paths to achieve high
communication efficiency.
(5) Dynamic Object Creation and Load Scheduling
A distributed object system should be adaptive to the computing environment. It should
be able to dynamically create objects on new hosts to utilize the available computing
resources and improve the performance. It needs to support dynamic load scheduling to
balance the workload on the objects.
11
These capabilities are especially useful in solving irregularly structured problems. The
unstructured computation and communication patterns of these problems present strong
requests for adaptive task mapping and dynamic load scheduling to produce balanced
distribution of computation workload. Dynamic object creation is also helpful to develop
adaptive algorithms on distributed systems. The efficient communication mechanism can
mitigate the overwhelming overhead of unstructured communication. Multithreading
computation is advantageous to smooth the unstructured computation and increase the
performance on SMP nodes.
1.2.3 Object-Oriented Programming Language
A distributed system may contain different types of platforms. The object-oriented
computing on such a system should be executable on the heterogeneous platforms. Java is an
architecture-neutral object-oriented language that offers plenty of services for distributed
computing on heterogeneous platforms [11]. The services support multithreading and
distributed object computing as well as remote method invocation mechanism. It also enables
dynamic object creation and redistribution. Java provides a homogeneous, language centric
view over a heterogeneous environment. The platform-independent Java bytecodes is
executable on any hosts provided Java Virtual Machine is installed. Therefore Java classes
and objects are portable from one host to any other hosts without recompilation. RMI
(Remote Method Invocation) is a Java-based interface for distributed object-oriented
computing [12,77]. It provides the registry to register the object references and allows an
object running on one host to make remote method invocation to the object on another host.
Thus distributed objects can transfer data through both the arguments and the return value of
the method. The distributed object system implemented in Java can be flexibly built on any
platforms at runtime. In this thesis, Java and RMI are used to implement MOIDE model and
the irregularly structured applications.
1.3 Motivation
The motivation of my research comes out from the recognition of all the previously
discussed requirements. The goal of this research is to develop a generic computing model
that can support efficient and flexible computing on various distributed systems. With regard
to the main target applications, the model should provide the powerful support to solve
12
irregularly structured problems. Runtime support software is needed to integrate the
distributed and heterogeneous hosts and subsystems to provide a uniform computing
environment to the applications. The thesis presents a distributed object model MOIDE for
solving irregularly structured problems on distributed systems. MOIDE stands for
Multithreading Object-oriented Infrastructure on Distributed Environment. It integrates all
the capabilities discussed in 1.2.2 and sets up a flexible infrastructure for developing and
executing irregularly structured applications. The applications implemented in MOIDE model
can achieve high performance on heterogeneous systems.
1.4 Contributions
The research work in this thesis focuses on the development of MOIDE model and
related mechanisms to support the solutions of irregularly structured problems on distributed
systems. The model provides the foundation to develop the solutions. Proprietary techniques
are designed based on MOIDE model for specific irregularly structured applications. The
major contributions in the thesis are as follows:
A distributed object model MOIDE is developed to establish a flexible and
architecture-independent computing infrastructure on distributed systems. The model
integrates the object-oriented and multithreading methodologies to provide a unified
infrastructure for developing and executing varied applications, especially irregularly
structured problems, on heterogeneous systems. MOIDE model utilizes the
polymorphism and location-transparency properties of the object-oriented and
multithreading technologies. The properties facilitate the dynamic, adaptive creation
and reconfiguration of the distributed computing infrastructure on varied system
architecture and resources. Hierarchical collaborative system is proposed as the
fundamental software architecture of the model. The hierarchical collaborative
system has the two-level structure that is adaptive to the architecture of underlying
hosts and the pattern of irregular applications. It supports the dynamic system
creation and reconfiguration to cope with the uncertainty of the resource availability
and the irregular computation. The MOIDE-based computation can attain high
performance on the available resources.
A unified communication interface is constructed by integrating local shared-data
access and remote messaging to support architecture-transparent and efficient inter-
13
object communication. The integrated two-layer communication based on the object-
oriented technique provides a simple, flexible and extensible communication
mechanism for transmitting complex data structures and control information between
distributed objects. It can effectively improve the communication efficiency on HiCS
in solving irregularly structured problems. The MOIDE-based applications can be
developed in the architecture-independent mode by calling the unified
communication interface. The applications will be adaptively mapped onto the
underlying hosts at runtime by forming a hierarchical collaborative system and
creating the communication mechanism that match the underlying architecture.
Generic task allocation strategies are proposed to the task decomposition and
allocation in different irregularly structured problems. For example, the strategy of
initial task decomposition with runtime repartition can be applied to the applications
with high data-dependency. Dynamic task allocation can be used to the applications
with low data-dependency. The polymorphism and encapsulation of object-oriented
methodologies enable the adaptive task allocation that matches both the application
pattern and the system architecture.
A runtime support system called MOIDE-runtime is developed to implement the
distributed computing based on MOIDE model. It implements the functions and
mechanisms required in the MOIDE-based computing, including host selection,
hierarchical collaborative system creation and reconfiguration, unified
communication interface, object synchronization, and autonomous load scheduling.
Implemented in Java, MOIDE-runtime builds a platform-independent, extensible,
unified computing environment on wide-range distributed systems, e.g., cluster of
SMP nodes, cluster of single-processor PCs, and cluster of hybrid hosts.
A distributed tree structure is designed for the N-body problem. It is a distributed
variation of the Barnes-Hut tree. Partial subtree scheme is proposed as the
communication-efficient solution to the data sharing of the distributed tree structure.
It is different from the tree structures in other parallel N-body methods on shared-
memory or distributed-memory systems. The distributed tree structure is supported
by the object-oriented approach based on MOIDE model. The object-oriented
techniques facilitate the construction and transmission of the tree structures in
distributed environment.
14
Autonomous load scheduling is proposed as the dynamic task allocation approach for
high-asynchronous computation. The autonomous load scheduling method is
supported by the flexible, one-sided remote method invocation in object-based
communication. It can exploit the high parallelism in some irregular applications,
make full use of the computing power of the resources, and automatically achieve
dynamic load balancing under low overhead. The autonomous load scheduling is
used in the ray tracing method. One of the autonomous load scheduling scheme—
individual scheduling—is recognized the better scheme to the high-asynchronous ray
tracing procedure.
Grouped communication is adopted on the unified communication interface to reduce
the heavy and irregular communication cost in the irregularly structured application.
It is developed with the integration of object-oriented and multithreading
methodologies in MOIDE model. The grouped communication approach is used to
fulfill the large-volume all-to-all scattering operation in the radix sort. The grouped
scatter outperforms the similar operation in MPICH in the radix sort.
1.5 Thesis Organization
In the following text, chapter 2 addresses the distributed object model MOIDE with the
focus on the infrastructure of hierarchical collaborative system. Chapter 3 describes the
runtime support system MOIDE-runtime. Chapter 4 presents distributed N-body method
based on MOIDE model, with the emphasis on the distributed tree structure. Chapter 5
discusses the distributed ray tracing methods based on autonomous load scheduling. Chapter
6 presents the MOIDE-based CG and radix sort methods to illustrate the architecture-
independent feature of MOIDE-based computation and the efficiency of two-layer
communication mechanism. Chapter 7 covers the related work and the comparison with my
work. Chapter 8 concludes the thesis.
15
Chapter 2
MOIDE: A Distributed Object Model
2.1 Introduction
As discussed in 1.2.2, the object-oriented technology is suitable to implement the
computations on distributed systems. A distributed system can be composed of
geographically scattered hosts. The computing resources in the system are accessible to a
large number of users. Any application can be submitted to run on any of the hosts in the
system simultaneously. To attain fair utilization of the resources and effectively organize the
computations on them, software facility is demanded to support the various computing
requirements on distributed systems.
MOIDE (Multithreading Object-oriented Infrastructure on Distributed Environment) is a
distributed object model that implements high-performance computing, especially to support
irregularly structured problems, on distributed systems. It establishes a flexible infrastructure
that combines distributed object with multithreading methodologies to support parallel and
distributed computing on varied platforms in distributed system.
The basic components of MOIDE model are a group of objects dynamically created at
runtime on the hosts that are selected to run an application. The hosts may be scattered in
wide range of a distributed system. The objects on the hosts are organized into a cooperative
working system, called collaborative system, and execute the application together. The
collaborative system is a runtime infrastructure. It is built when an application is submitted to
run. The collaborative system provides the mechanisms that allow the distributed objects to
interact with one another, and it is responsible to coordinate the computing procedures on the
distributed objects.
The construction of a collaborative system is associated with the resources in the
underlying distributed system. It is always built on the most appropriate hosts, i.e., the hosts
16
with higher computing power and lower workload. For a distributed system containing SMP
nodes, multiple threads will be generated inside the object on SMP node. In this case, the
collaborative system will possess hierarchical structure. The objects, one per SMP node, form
the upper level of the collaborative system. The threads in the objects form the lower level of
the system. The communication takes place on the two levels in different mode. The
combination of distributed objects and threads gives the collaborative system high
adaptability to heterogeneous system architecture.
The distributed object-oriented and multithreading features of MOIDE model have the
following advantages.
(1) Architectural Transparency
MOIDE model erects a uniform computing infrastructure for developing and executing
applications. The architecture of the underlying distributed system is transparent to the
applications. An application does not need to learn on what hosts it runs. The application just
requests a certain number of processors. The processors may be supplied from uniprocessor
hosts, SMP nodes, or hybrid hosts. The application is developed on an identical infrastructure
no matter what hosts it will be executed on. The runtime support system will create
distributed objects and generate threads inside the objects when running the application
depending on the specific architecture of the underlying hosts.
(2) Combined Programming Methodology
MOIDE combines distributed object computing and multithreading methodologies. The
model is adaptive to heterogeneous architecture. A group of threads are generated within an
object to support parallel computing on SMP node. These threads can have tight cooperation
in the group. The threads can share computation workload and data in an object. The
combined programming methodology is more efficient and resource-saving than a complete
distributed object approach where multiple objects will be created on SMP nodes to perform
parallel computation on multiple processors. The threads can work in different modes
depending on the computation pattern of an application. An application is decomposed into
tasks. The threads in an object can work in cooperative mode to cooperatively process one
task. The threads can also work in independent mode in which each thread independently
processes its own task.
(3) Integrated Communication Mechanism
17
The MOIDE model supports two communication methods. The distributed objects
interact with each other by remote messaging–a data transmission approach between
distributed objects based on remote method invocation. Remote messaging is more powerful
than message passing. The threads inside an object can access shared data through local
memory. Shared-data access has low communication expense. MOIDE model integrates the
two communication ways and builds a two-layer communication mechanism. Meanwhile it
provides a unified communication interface on the mechanism to hide the two-layer
communication paths to the application. At the application level, the communication between
objects or the threads call the same interface regardless of the physical path. A thread can
communicate with a local thread in the same object or a remote thread in remote objects
through the same communication interface. The runtime support system will carry out the
communication through the proper communication path, either shared-data access or remote
messaging.
(4) High Asynchrony
MOIDE supports high asynchronous computation of the objects. The distributed objects
can execute the computing tasks asynchronously unless interaction is required between them.
It can also implement the asynchronous inter-object communication. Ordinary message-
passing communication is fulfilled by the cooperation between a pair of sender and receiver.
Communication operations should be explicitly performed on both sides. The sender issues a
send operation and the receiver issues a receive operation. MOIDE model allows one-sided
communication in which only the sender or receiver starts the communication. An object can
send data to another object by writing the data directly to a variable in the destination object
via remote method invocation. Similarly the receiver can fetch data from the source object by
directly reading the data value there. The communication can be conducted at any time
without the explicit participation of the other side. The one-sided communication contributes
to the high asynchrony of the computation on distributed objects.
This chapter describes the fundamental structure and functionality of MOIDE model. The
runtime support system that implements the computation in MOIDE model will be addressed
in chapter 3.
18
2.2 Basic Collaborative System
The kernel of MOIDE model is collaborative system. Collaborative system is a runtime
software infrastructure. Its fundamental components are a group of objects distributed on the
hosts. The collaborative system is formed when an application is submitted to run, and it is
terminated when the execution has finished. The collaborative system can also be
reconfigured during the computation to match the runtime requirement of the computation
and the states of the underlying hosts.
2.2.1 System Structure
A collaborative system consists of a group of distributed objects. Fig. 2.1 shows the basic
structure of a collaborative system built on P hosts. The object on host 0 is called compute
coordinator. It is the first object created in the system and it acts as the manager of the system.
Fig 2.1 The collaborative system built on P hosts
The compute coordinator is the initiator of the collaborative system. It starts to work on
the host where an application is submitted. It creates all remote objects on other hosts and
allocates the computing tasks to them. It also coordinates the computing procedures on all
objects and conducts the system-wide synchronization. Other objects accept and execute the
assigned computing tasks. Hence they are called compute engine. The collaborative system
has a registration mechanism that contains the references to the distributed objects in the
ComputeCoordinator
Host 0
ComputeCoordinator
Host 0
Host Selection Interaction Registration
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
Host (P -1)
ComputeEngine
Host (P -1)
Reference
19
system. The distributed objects can locate each other by referring to the registration
mechanism. An object can find the reference of another object when it wants to communicate
with that object. The interaction mechanism provides the communication interface to the
distributed objects. It implements all inter-object communication. The host selection
mechanism is for the detection and selection of the hosts when a collaborative system is to be
created or reconfigured.
2.2.2 System Creation
A collaborative system is constructed under the cooperation of all objects. The compute
coordinator is the first object started on one host. Then it starts the remote objects on other
hosts. The creation of a collaborative system is accomplished in four steps.
(1) Compute coordinator start
Compute coordinator is started on the host where an application is submitted.
(2) Host selection
If the application requests more than one processors, the compute coordinator will search
for the available hosts in the underlying system to supply the required number of processors.
There may be a lot of available hosts. The compute coordinator follows the host selection
policy to choose the most appropriate hosts that can provide high performance. It examines
the computing power and current states of the hosts by referring to the information provided
from the host selection mechanism. The host selection policy can be expressed by the
following priority:
i
ii workload
eperformancpriority =
where priorityi is the priority of host i to be selected in the host selection, performancei is the
computing power of host i, and workloadi is the current workload on host i. Precedence is
given the hosts with higher prioprity.
(3) Compute engine creation
Compute coordinator starts the creation of the objects on the select hosts, one object per
host.
(4) Object registration
Each object registers itself to the registration mechanism. The registration mechanism
maintains the registration table to store the reference of every object. An object can get the
20
reference to other object by looking up the registration table as Fig 2.2 shows. Thus an object
can locate other objects and communicate with them.
As Fig 2.2 shows, the registration table is duplicated on each object. The figure only
displays the registration tables on the compute coordinator and the compute engine 1. The
table contains the items: the logical name of compute engine CE, its residing host name
HOST, and the reference to the compute engine REF. All of the registration tables constitute
the registration mechanism. When an object wants to make a communication to another
object, it looks up the table and gets the reference to that object. Then the object can perform
remote method invocation to the other object through the reference.
Fig 2.2 Registration tables
The collaborative system has been built up when all objects have registered themselves to
the registration mechanism. Then the compute coordinator and the compute engines will
execute the application together.
2.2.3 System Work
To execute the computation on collaborative system, the compute coordinator firstly
decomposes the application into computing tasks. The computing tasks will be allocated to all
compute engines. The compute coordinator itself also works as a compute engine. At the
Registration Table 1
ComputeEngine
ComputeEngine
ComputeEngine
Compute
ComputeCoordinator
host1 host2 host3
host0
ComputeEngine (ce2)
ComputeEngine (ce3)
ComputeEnginece(p-1)
ComputeCoordinator (ce0)
host (P-1)
ComputeEngine (ce1)
ce0ce1ce2ce3
ce(p-1)
host 0host 1host 2host 3
host(p-1)
CE HOST REFRegistration Table 0
null
ce0ce1ce2ce3
ce(p-1)
host 0host 1host 2host 3
host(p-1)
CE HOST REF
null
21
same time, it should perform the required coordination to the computing procedures on all
compute engines.
For a collaborative having m compute engines, let poweri be the computing power of
compute engine i. The computing power of a compute engine is determined by the computing
power of the underlying host. Assume the overall workload of an application is W, the task
allocated to compute engine i should have workload wi where
∑⋅=k
kii powerpowerWw
The compute coordinator starts the computing procedure on the compute engines by
assigning the tasks to them. The compute engines process the tasks asynchronously unless the
necessary data communication and synchronization are required. Compute coordinator is
responsible to make the global synchronization on all compute engines at the synchronization
point. When the computation has finished, the compute coordinator ceases all compute
engines. Thus the collaborative system is terminated.
The computing procedure on collaborative system is application-dependent. MOIDE is
an infrastructure to support the implementation of the applications with different computation
and communication requirements.
2.2.4 System Reconfiguration
The collaborative system is also flexible to conduct runtime system reconfiguration to
improve the system performance. The computing power of the collaborative system can be
enhanced by adding more compute engines on new hosts to it. The computing task on an
overloaded host can be moved to other available host.
MOIDE model has the flexibility to create new compute engine at any time on any host.
The distributed compute engines are linked together via the registration mechanism. The
registration table contains the references to all compute engines. The table can be updated by
inserting new references or removing old references. Therefore new compute engine can join
the collaborative system and old compute engine can be removed from the system.
2.2.4.1 System Expansion
22
Generally there are a lot of hosts available in a distributed system. A collaborative system
is established on the selected hosts. Moreover irregularly structured problems have non-
predetermined computation patterns. For instance, an application initially runs on P compute
engines. It may generate extremely high computation workload during the execution. As the
distributed system has extra hosts available, the collaborative system can select additional
hosts at runtime to share the computation workload. The collaborative system can be
expanded by incorporating new compute engines on new hosts. The new compute engines
work in the same way as the old compute engines in the collaborative system. This is
horizontal system expansion shown in Fig 2.3.
Fig 2.3 Horizontal system expansion
On the execution of an irregular application, the runtime computation workload may be
highly imbalanced among the compute engines. A heavily-loaded compute engine will
become the bottleneck. To alleviate the bottleneck, an extra compute engine can be attached
to the overloaded compute engine. The attached compute engine works under the overloaded
compute engine to share its workload. The attached compute engine is the assist-engine of the
overload one. It is only visible to its parent compute engine. This is vertical system expansion.
ComputeCoordinator
Host 0
ComputeCoordinator
Host 0
Host Selection Interaction Registration
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
Host (P -1)
ComputeEngine
Host (P -1)
Reference
ComputeEngine
ComputeEngine
New Host
23
Fig 2.4 shows the vertical system expansion with an assist-engine attached to compute engine
2.
Fig 2.4 Vertical system expansion
2.2.4.2 Host Replacement The hosts in a distributed system are shared resources accessible by many users and
applications. Thus the states of the hosts, such as the workload, keep on changing from time
to time. Only idle or lightly-loaded hosts are suitable to accommodate a collaborative system.
At the system creation stage, the compute coordinator selects the lightly-loaded hosts.
However the workload on the hosts may uprise to a high level because of not only the
computation of this collaborative system but also other computing jobs from other users. The
overloaded hosts will slow down the collaborative computation. If there are idle or lightly-
loaded hosts available in the distributed system, they can be used to replace the overloaded
hosts. In the case of host replacement, new compute engine is created on the newly-selected
host. The new compute engine takes over the computing task from the compute engine on the
overloaded host. It replaces the role of the old compute engine in the collaborative system,
and the old compute engine will be ceased. The registration table is accordingly updated to
reflect the change of the compute engines. The reference to the new compute engine replaces
ComputeCoordinator
Host 0
ComputeCoordinator
Host 0
Host Selection Interaction Registration
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 3
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
Host (P -1)
ComputeEngine
Host (P -1)
Reference
ComputeEngineAssistEngine
New Host
24
the entry of the old compute engine in the table. The logical structure of the collaborative
system remains unchanged after the replacement. The computing procedure continues on the
collaborative system as before. This is the host replacement in collaborative system. The
system size is the same after the replacement. Fig 2.5 shows an example of host replacement
in which the compute engine on Host 2 is replaced by the new compute engine on the New
Host 2.
Fig 2.5 Host replacement in a collaborative system
The system reconfiguration is managed by the compute coordinator (except the vertical
system expansion which is conducted by the parent compute engine only). The host selection
mechanism provides the real-time information of the available hosts. The compute
coordinator reads the state information and performs the host replacement operations when
necessary.
2.3 Hierarchical Collaborative System
The basic collaborative system is built with the objects of compute engine only on single-
processor hosts. All inter-object communication is accomplished via remote messaging. For
ComputeCoordinator
Host 0
ComputeCoordinator
Host 0
Host Selection Interaction Registration
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 (replaced) Host 3
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 3
ComputeEngine
Host (P -1)
ComputeEngine
Host (P -1)
Reference
ComputeEngine
ComputeEngine
New Host 2
25
the collaborative system on heterogeneous system, multithreading methodology can be
incorporated into compute engines to suit the hierarchical architecture.
Nowadays symmetric multiprocessors (SMP) made from off-the-shelf microprocessors
are widely used as the cost-effective multiprocessor machines. Networked SMP machines
(cluster of SMP nodes) can provide high performance to large-scale computing. Cluster of
SMP nodes is considered as a low-end substitution of the high-end supercomputers [25]. A
distributed system may consist of both SMP and single-processor nodes. The SMP nodes may
contain different number of processors and varied computing power. Therefore a
heterogeneous system like this has hierarchical architecture. In such a system, there exist the
system-wide loosely-coupled nodes and the tightly-coupled processors inside SMP node. The
heterogeneous feature requires MOIDE model to be adaptive to the architecture of underlying
hosts. Therefore the basic collaborative system in 2.2 is expanded to the hierarchical
collaborative system (HiCS).
2.3.1 Heterogeneous System
Consider the situation of building a collaborative system on heterogeneous hosts. The
basic collaborative system structure is also applicable in this case. A P-processor SMP node
can be treated as P individual hosts and thus P compute engines will be created on it. All
compute engines on the same or different SMP nodes simply form a basic collaborative
system. This is certainly not a proper approach. It does not take the advantages of SMP node.
For example, the objects residing on an SMP node can communicate with each another via
shared-data access, a more efficient communication method than remote messaging. The
communication mechanism on heterogeneous system should integrate the swift shared-data
access and the widely-applicable message passing. Multithreading programming techniques
can be used on SMP node. A thread is a light-weight object that occupies less system
resource than an object. Multithreading is a cost-effective methodology to perform parallel
computation on SMP node.
Fig 2.6 shows the architecture of a heterogeneous system. It is a two-level hierarchical
structure. SMP node is composed of tightly-coupled multiprocessors connected by shared
memory modules. All nodes are linked together to be a loosely-coupled cluster across the
network. The relations between the objects residing on the nodes can also be treated at two
levels. The objects on same SMP node are tightly related to each other. These sibling objects
26
can make tight cooperation in the computation than the objects on different nodes. The
sibling objects can be created as multiple threads and take the advantages of multithreading
technique such as high data-sharing and low resource-consuming.
Fig 2.6 Cluster of SMPs
Though message passing is the ordinary communication method on distributed systems,
the threads on same SMP node can interact with each other by shared-data access because
they are instantiated from the same object and able to access the public data in the object.
Shared-data access is a quick communication way on SMP node. A two-layer communication
mechanism can be built on heterogeneous hosts, which integrates local shared-data access and
remote messaging on the two levels. With the two-layer communication mechanism, the
performance of the communication-intensive applications can be improved.
To realize the adaptability of MOIDE model, the basic collaborative system in 2.2 is
modified to be a hierarchical structure that incorporates distributed object and multithreading
methodologies. This is the hierarchical collaborative system structure.
2.3.2 Hierarchical Collaborative System
Hierarchical collaborative system (HiCS) is an infrastructure expanded from the basic
collaborative system. In HiCS, the compute engine on an SMP node spawns a group of
threads inside. The number of the threads is equal to the number of processors on the SMP
node. The threads will run on the multiprocessors in parallel. Multithreading is the
supplement to the distributed object model to allow the collaborative system adaptive to the
hierarchical architecture of hybrid hosts. The computation is more efficient when performed
P0
P1
Pi0
SMP
M
NI
M
P0
P1
Pi1
SMP
M
NI
M
Network
P0
P1
Pm
SMP
MM
NI
MM
P0
P1
Pn
M
NI
M
P
MM
NINI
PC
P
MM
NINI
PC
27
by the threads on SMP node. Two-layer communication mechanism implements the efficient
communication on HiCS.
Fig 2.7 shows the structure of hierarchical collaborative system. The compute engine on
SMP node can generate multiple threads inside. For example, assume Host 1 in Fig 2.7 is an
SMP with k processors, the compute engine on it spawns k threads as shown in the attached
box. The creation of hierarchical collaborative system includes the following operations.
Fig 2.7 Hierarchical collaborative system built on heterogeneous hosts
(1) Host selection and compute engine creation
This is similar to the creation of basic collaborative system described in 2.2. At first the
compute coordinator is started on the node where an application is initiated to run. The
compute coordinator selects the hosts to supply the processors required by the application.
The host selection policy on heterogeneous system should include the index about the size of
SMP node. It gives higher priority to the SMP node with more processors.
i
iii workload
eperformancPpriority ⋅=
where priorityi is the priority of host i to be selected, Pi is the number of processors on host i
and performancei is the computing power of each processor in host i, worloadi is the
ComputeCoordinator
Host 0
ComputeCoordinator
Host 0
Host Selection Interaction Registration
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
ComputeEngine
ComputeEngine
Host 1 Host 2 Host 3
ComputeEngine
Host (P -1)
ComputeEngine
Host (P -1)
Reference
LL
Thread1 Thread2 Threadk
Main thread
Shared Data
Compute
LL
Thread1 Thread2 Thread
Main thread
Shared Data
Engine
SMP
28
computation workload on host i. Here the number of processors in a host is one of the indices
in the host selection. An SMP node with more processors has higher priority to be selected.
The compute coordinator starts the compute engines on the selected hosts, one compute
engine per host, as compute engines on Host 1 to Host (P-1) shown in Fig 2.7, regardless how
many processors it has. The compute engines register themselves to the registration
mechanism. Each compute engine has a registration table that records the references to
remote compute engines.
(2) Thread generation
Each compute engine on SMP node generates a group of threads inside. The threads will
run on local multiprocessors.
The hierarchical collaborative system is a two-level infrastructure. All compute engines
form the upper level which is directly managed by the compute coordinator. The lower level
contains the threads in each compute engine. The main thread is the original thread in each
compute engine. All other threads are instantiated from it. The group of threads can jointly
process a computing task or independently process own computing task. The main thread acts
as the local coordinator. It makes the necessary synchronization on the group of the threads.
In HiCS, the threads in same compute engine can access shared data. Inter-thread
communication can be accomplished by data access via shared memory. But the distributed
compute engines are shared-nothing objects. They communicate with one another by remote
messaging.
The hierarchical collaborative system is an adaptive infrastructure on heterogeneous
system. Owing to the conditional creation of the threads in compute engine, HiCS sets up a
uniform infrastructure for developing and running applications on varied architecture.
Multithreading and single-threading compute engines can exist in a HiCS at the same time.
For instance, an application requests to run on P processors. The processors may be supplied
by different hosts each time to run the application. The application may run on a group of
SMP nodes with varied size. It is also possible to run on hybrid hosts of SMP nodes and
single-processor hosts. The underlying hosts are transparent to the application developed
upon HiCS infrastructure. A HiCS will be created at runtime with a structure that matches the
architecture of the hosts to achieve the best performance in executing the application.
The group of threads generated in the compute coordinator or compute engine can be
organized to work in two modes in the applications according to the computation patterns:
29
Cooperative mode: the main thread acts as the local coordinator inside a compute
engine, which accepts the computing task from the compute coordinator. The group of
threads shares the computing task, i.e., each thread executes a part of the computing
task. Thus a thread can be called a sub-engine in the compute engine. The main thread
coordinates the computing procedure on the group of threads. The group of threads
only conduct local communication between each other via shared data access. The
main thread is responsible for the communication to other compute engines.
Independent mode: each thread works as an independent compute engine. Each thread
processes a computing task independently. Even so, the threads in same compute
engine can access shared data as well. Moreover any thread is allowed to communicate
with the threads in other compute engine through remote messaging. There is no local
coordinator against the main thread in cooperative mode. Each thread performs
individual computing task. Thus each thread can be called a pseudo-engine.
The work mode of threads is determined by the computation pattern of an application.
The algorithm of an application can be designed based on one of the modes. The work mode
can also be switched from one to another in an application, depending on the computational
requirement in different phases.
2.3.3 Task Allocation
After a HiCS has been built, the compute coordinator should allocate the computing tasks
to all compute engines. Due to the highly diverse characteristics of irregularly structured
problems, different task allocation strategies should be used for varied applications. Good
task decomposition cannot be discovered by inspection before execution because of the
irregularity in computation and communication [73]. The following strategies can be used as
the general approaches for the task allocation in different irregularly structured problems.
(1) Initial task decomposition with runtime repartition
The workload of irregular computation is not pre-determined and it evolves in the
execution. It is not possible to evenly decompose the computation a priori. The task
allocation can adopt the strategy of initial task decomposition and runtime repartition. An
application can be initially divided into tasks based on the estimation of the workload. If the
workload is found unbalanced on the processes during the execution, runtime task repartition
30
should be made to re-decompose the tasks based on the real workload. This strategy is
suitable to the applications with high data-dependency.
(2) Dynamic task allocation
To the applications with light data-dependency, it is not required to allocate all tasks to
the processes before execution. The tasks can be progressively allocated to the processes one
at a time in accordance with the computing procedure on each process. By the dynamic task
allocation, the workload can be automatically balanced on the multiple processes without
specific load balancing operation.
(3) Balance in both computation and communication
Generally load balancing refers to the balance of computation workload. However the
load balancing in irregularly structured problems should pay special attention to the
communication workload, because the irregular communication patterns usually incur high
and diverse communication overhead that severely restrict the performance. The task
allocation strategy for the communication-intensive applications must cover the factor of
communication. The task decomposition should keep data locality in the tasks and generate a
data distribution among the tasks that can reduce the inter-process communication. The task
allocation should also alleviate the diverse communication requirements on multiple
processes so that the communication can be balanced among the processes and alleviate the
communication bottleneck.
The computation and communication patterns are extremely complicated in the
irregularly structured problems. Task allocation strategies should be closely related to the
applications. The task allocation strategies above are the generic approaches for the task
allocation. Specific task allocation approaches should be derived from these strategies for
different applications.
2.3.4 Unified Communication Interface
The communication in hierarchical collaborative system takes place on two levels. The
threads inside a compute engine can share public data and variables. The communication
between the threads can be realized through shared memory. This is the efficient way for
local data communication. The communication between compute engines is delivered by
remote messaging across the network, which requires high communication latency. Fig 2.8
shows the structure of the two-layer communication mechanism.
31
The communications between the tasks are delivered either by local shared memory
access or by message passing through the network, depending on the locations of the
communication partners. A unified communication interface is provided to all tasks at the
application level. The tasks call the same interface for communication no matter where the
destination is. The interface will implicitly decide the proper communication paths.
Fig 2.8 Two-layer communication mechanism
The two-layer communication is appropriate to the hierarchical collaborative system built
on heterogeneous hosts. It helps to reduce the heavy and unpredictable communication
overhead in irregularly structured problems. With the flexible computing mode (cooperative
or individual mode), the tasks (threads) can be flexibly organized into different mode to
exploit the efficiency of the two-layer communication. In chapter 6, the CG method will
demonstrate the transparent mapping of the communication paths. The radix sort will
illustrate the flexible use of the computing mode to implement efficient global
communication.
The pseudo-engines have two communication paths. The pseudo-engines on the same
SMP node can take the way of share-data access but the pseudo-engines residing on different
nodes should communicate via remote messaging. As the architecture-transparent feature of
MOIDE model, the locations of the pseudo-engines are transparent to the applications. At
application level, any communication between the pseudo-engines calls the unified
communication interface. The MOIDE runtime support system (see chapter 3) will decide the
exact communication path according to the locations of the pseudo-engines.
Task
Uniform Interface
Memory
Task
Task
Task
buffer
Task
Uniform Interface
Memory
Task
Task
Task
buffer
..................
Network
Host Host
32
2.4 Implementation
MOIDE model is implemented in Java and RMI (Remote Method Invocation) interface
on distributed systems. The motivation to use Java and RMI has been discussed in 1.2.3.
Java’s object-oriented, multithreading and platform-independent features are suitable to
implement the hierarchical collaborative system infrastructure on heterogeneous systems. The
remote method invocation mechanism of RMI facilitates the implementation of the flexible
interaction and communication between distributed objects. Here gives a brief introduction
on the implementation. The details will be covered in chapter 3.
(1) Compute Coordinator and Compute Engine
Computer coordinator and compute engines are the major components in hierarchical
collaborative system. Two classes of compute coordinator and compute engine are defined as
the kernel of the implementation. Other components are built around them.
(2) Object Registration and Interaction Interface
The registration mechanism of collaborative system is implemented on RMI registry—
rmiregistry. The registry runs on each selected host and provides naming service to the
distributed objects. A compute engine registers itself to the registry when created. The
compute coordinator assigns computing tasks as well as other arguments to remote compute
engines and triggers the computation on them through remote method invocation. The
compute coordinator generates a name list of the compute engines and broadcasts it to the
remote compute engines. Each compute engine can independently read the reference to other
remote compute engine by looking up the registry on remote host once. The references are
stored in the registration table on each compute engine for later use as Fig 2.2 shows.
The interaction and communication between compute coordinator and compute engines
goes through remote method invocation via RMI interface. The class of the interface is
defined. The implementation of the interface is defined in the class of compute engine.
(3) Multithreading
A compute engine will instantiate a group of threads on SMP node before it starts to run
the computing task. Recent JDK versions, e.g. IBM JDK 1.1.8, Blackdown JDK 1.2 and up,
support the kernel-based threads − native threads. JVM can schedule native threads to the
multiprocessors in SMP node.
(4) Two-layer Communication
33
The data sharing among a group of threads can be accomplished by accessing public data
objects. In cooperative mode, the threads in a compute engine may work on the same data set
and share the computing task. Compute engines transmit data to each other by remote
messaging, i.e., passing data through remote method invocation. RMI supports object
serialization that enables direct transfer of complex data objects back and forth between the
compute engines. Application calls the unified communication interface to realize the two-
layer communication. The uniform interface provides the communication methods to the
applications and seamlessly integrates the shared-data access and remote messaging
communication on the two levels. The unified communication interface is extremely useful
for the computation in independent mode, where each thread works independently as a
compute engine. The location of each thread is registered at the creation stage of HiCS. When
a thread wants to communicate to another thread, it just specifies the ID of the target when
calling a communication method. The two-layer communication mechanism will choose the
proper path and deliver the data depending on the locations of the two threads.
2.5 Summary
This chapter describes the distributed object model MOIDE that supports flexible and
efficient computing on distributed system. MOIDE model presents a computing infrastructure
that is adaptive to heterogeneous system architecture. It creates a collaborative system based
on the available hosts to execute an application. The collaborative system can be reconfigured
to adapt the change of the states in the underlying hosts. With the combination of distributed
object and multithreading methodologies, the hierarchical collaborative system can realize
high-performance computing on heterogeneous system. The compute engines in a
hierarchical collaborative system can work in two modes in accordance with the computation
patterns of the applications. The two-layer communication mechanism is built on hierarchical
collaborative system to support the efficient communication.
MOIDE model is implemented by the runtime support system MOIDE-runtime. Chapter
3 will give the implementation details of the runtime support system. MOIDE model is
suitable for various applications, especially for solving irregularly structured problems on
distributed systems. It is used to implement four irregularly structured applications: N-body
problem, ray tracing, CG, and radix sort. The utilization and advantages of MOIDE model
will be demonstrated by the four applications in chapter 4, 5 and 6.
34
Chapter 3
Runtime Support System
A runtime support system is developed to support the distributed computing on MOIDE
model. It provides the fundamental classes and APIs for implementing MOIDE-based
computation. The runtime support system is programmed in Java and RMI.
3.1 Overview
The runtime support system provides the components and functions for MOIDE model.
Table 3.1 lists the major classes and methods implemented in the runtime support system.
The runtime support system is called MOIDE-runtime in the following text for abbreviation.
Class Method Description StartEngine build a collaborative system
getHost select hosts createEngine create compute engines on the hosts invokeEngine assign tasks to remote compute engines
Codr compute coordinator main create collaborative system run run application
Engine object interface of compute engine EngineImpl implementation of compute engine
run start compute engine ceaseEngine terminate compute engine
ExpandEngine expand collaborative system addEngine add new compute engine
RecfgEngine reconfigure collaborative system checkEngine check the states of the hosts replaceEngine replace a compute engine with a new one
CommLib unified communication interface exchDouble exchange a double exchDoubleArray exchange an array of double exchIntArray exchange an integer array allReduce global reduction scan global scan
35
Util miscellaneous utilities barrier synchronize a group of threads remoteBarrier synchronize compute engines getTask get a task from global task pool getSubtask get a subtask from local subtask queue
Table 3.1 Major classes and methods in MOIDE runtime support system
Compute engine is specified as the interface Engine and the implementation of the
interface EngineImpl. The class of compute coordinator Codr is an application-dependent
class that calls the main method of the application.
Fig 3.1 Organization of MOIDE runtime support system
Fig 3.1 depicts the organization of MOIDE-runtime. The runtime supports system
implements the creation and reconfiguration of hierarchical collaborative system. It defines
class StartEngine for system creation, class ExpandEngine for system expansion, and
class RecfgEngine for host replacement. These three classes call the same methods to
search the available hosts in a distributed system and detect their states. The state detection is
implemented with the aid of ClusterProbe [13], a Java-based tool for reporting the states of
clustered computers. ClusterProbe is developed by our research group. It runs on a server and
monitors the hosts in a distributed system. It periodically reports the states and other
information of the hosts, including processor type, performance, memory size, workload, the
number of processors in each host, and etc. The host selection is based on the state
information of the hosts.
The runtime support system provides the primitives for dynamic load scheduling and
object synchronization. It also defines a unified communication interface for the two-layer
communication.
Unified Communication InterfaceCommLib
Two-layer Communication Mechanism
System CreationStartEngine
Synchronizationbarrier(), remoteBarrier()
Autonomous Load SchedulinggetTash(), getSubtask()
System ReconfigurationExpandEngine, RecfgEngine
ComputeCoordinator
Codr
Compute EngineInterfaceEngine
Compute EngineImplementationEngineImpl
36
3.2 Principal Objects
The principal objects in collaborative system are compute coordinator and compute
engines. They are defined as two classes. All other classes are built around them to complete
the system functions.
The compute coordinator is the first object started on the host where an application
begins to run. The coordinator is responsible for host selection and remote compute engine
creation. It also coordinates the whole computing procedure on collaborative system. The
compute engine is the object created on remote host. It accepts and processes computing tasks.
After establishing the collaborative system, the compute coordinator also works as a compute
engine to execute computing task.
Fig 3.2 Class description and relation of compute coordinator and compute engine
The interactions between the objects (compute coordinator and compute engines) are
implemented based on RMI interface. As the convention of RMI-based distributed object
scheme, an interface and the implementation of the interface should be specified for remote
objects. Therefore the compute engine is defined as an interface Engine and its
implementation EngineImpl. The interface contains the declaration of all methods for
remote method invocation. The implementation of the interface specifies the details of the
compute engine class. It includes the bodies of the methods declared in the interface.
class Codr
Codr codr; Appl appl;
main { StartEngine.run(); codr = new Appl(); codr.run(); ceaseEngine(); }
run { appl = new Appl(); appl.run() }
class EngineImpl
codr = new Codr(); codr.run()
Codr codr;
class EngineImpl
codr = new Codr(); codr.run()
Codr codr;
class EngineImpl
codr = new Codr(); codr.run()
Codr codr;
class EngineImpl
codr = new Codr(); codr.run()
Codr codr;
37
From the description above, we can outline the structures of the compute coordinator and
compute engine classes as well as their relationship. As Fig 3.2 shows, the class of compute
coordinator Codr is a comprehensive class. It encapsulates the system creation and
coordination work. EngineImpl is the interface implementation of the compute engine
class. It includes the methods for remote object interaction such as remote messaging and
global synchronization (see 3.6.2). The construction method of compute engine is invoked by
the compute coordinator through remote method invocation. When a compute engine is
created on an SMP node, it will spawn a group of threads inside with regard to the number of
local processors. Appl is the class of the application program to be run. Both compute
coordinator and compute engine instantiate the application class and run its main method as
appl.run().
3.3 System Creation
The compute coordinator is the initiator of a collaborative system. At first it runs class
StartEngine to create the collaborative system. The creation of compute engine on other
host is trigged by the remote method invocation from the compute coordinator.
3.3.1 class StartEngine
The class StartEngine aims to build a collaborative system based on the required
number of processors and the available system resources. It includes the methods to select the
hosts, create and invoke compute engines on the hosts, and start the computation on the
system. The methods complete the three stages in system creation: host selection method
getHosts (), compute engine creation method createEngine(), and remote compute
engine invocation method invokeEngine(). These methods are called in the run()
method of StartEngine as Fig 3.3 shows which is executed by the compute coordinator.
public void run () {
String [] hostList = (String []) getHosts();
Engine [] engine = (Engine []) createEngine(hostList);
invokeEngine(engine, task);
}
Fig 3.3 run()method of StartEngine class
38
(1) getHost()
Figure 3.4 shows the getHosts() method for host selection. This is the first step in
system creation.
public String [] getHosts () {
String rn = "CLUSTERPROBE.resources.Cluster_Workstation_CPUInfo";
String server; /* the server running ClusterProbe */
int port = 7001;
Vector hostList = (new GetHostStatus()).getHostStatus(rn, server, port);
rn = "CLUSTERPROBE.resources.Cluster_Workstation_StatusTable";
Vector hostStatus = (new GetHostStatus()).getHostStatus(rn, server,
port);
hosts = selectHosts(hostList, hostStatus);
return hosts;
}
public String [] selectHosts () {
/* select hosts according to pre-defined criteria */
…………
}
Fig 3.4 getHosts()method
The method contacts ClusterProbe and retrieves the host information from it through the
GetHostStatus() interface. The host information includes all hosts available in the
distributed system, their performance and current states, their internal number of processors
and etc. By specifying the parameter rn of the method getHostStatus(), ClusterProbe
can supply various information that reflects different aspects of the distributed system. In Fig
3.4, the first string “CLUSTERPROBE.resources.Cluster_Workstation_CPUInfo”
indicates to get a full list of CPU information, including the number, type and speed of the
CPU(s) in each host. The second string “CLUSTERPROBE.resources.
Cluster_Workstation_StatusTable” means to get the state table of the hosts. The
status table contains machine states and current workload.
Method selectHosts() implements the host selection policy to select proper hosts
among all available hosts. The host on which the compute coordinator is residing is the first
host to be selected.
39
(2) createEngine()
Having chosen the hosts, the compute coordinator starts the creation of the compute
engines by Runtime.getRuntime().exec(command) (see Fig 3.5) where command
is a RPC (Remote Procedure Call) to remote hosts. The command runs a script
create_engine on the remote host to create compute engine there. The
create_engine script runs rmiregistry and starts the compute engine. The compute
engine registers itself to the registration mechanism. Then the compute coordinator needs to
get the references to remote compute engines for future interaction. This is accomplished by
running the RMI method Naming.lookup() which locates the remote objects and gets
the references to them. The returned references are stored in the registration table engine[].
Thereafter the compute coordinator can contact the compute engines through the references
in engine[].
public Engine [] createEngine(String [] hosts) {
String command = “/usr/bin/rsh hosts[i] create_engine”;
for host[i] in hosts[] {
Runtime.getRuntime().exec(command);
engine[i] = (Engine) Naming.lookup(host[i]);
}
return engine;
}
Fig 3.5 createEngine()method
(3) invokeEngine()
When the compute coordinator has got the references to remote compute engines, it
immediately assigns computing task to them by engine[i].invoke() in the method
invokeEngine() (see Fig 3.6). The operation of engine[i].invoke(task[i]) is
calling the remote method invoke() on the compute engine i where engine[i] is the
reference to it. The task allocated to the compute engine is passed through the argument
task[i]. In addition to the computing task assigned to that compute engine, task[i] also
contains the information about the construction of the collaborative system such as the name
list of all hosts and the compute engines on them. Having received the list, a compute engine
40
can get the references to all remote compute engines by calling Naming.lookup()method
and create its own registration table. The creation of the collaborative system has finished
when all compute engines have obtained the references to other engines.
As indicated above, remote method invocation is rather than a simple message passing
operation. It can also pass the control to remote object. The compute coordinator calls the
method invoke()on remote compute engine to activate its work. The invoke() method
not only transmits the compute task and related data to a compute engine but also starts the
execution of the computing task on the compute engine.
public void invokeEngine (Engine[] engine, Message[] task) {
for engine[i] in engine[]
engine[i].invoke(task[i]);
}
Fig 3.6 invokeEngine()method
After the execution of invokeEngine(), all compute engines have got their
computing tasks and begin to execute the tasks asynchronously.
3.3.2 Initialization of Compute Engine
As introduced in 3.2, the class of compute engine is defined by the interface Engine and
the implementation of the interface EngineImpl.
(1) Starting Compute Engine
The RPC Runtime.getRuntime().exec(command) in Fig 3.5 called by the
compute coordinator starts the creation of compute engine on remote host. The creation
includes two operations:
1. run rmiregistry to start RMI registry on the host;
2. start EngineImpl.
The object of EngineImpl registers itself to the registry by name (e.g., Engine0), as
below:
EngineImpl computeEngine = new EngineImpl();
Naming.rebine(“Engine0”, computeEngine);
After the registration, the creation of compute engine has been completed.
(2) Multithreading
41
On an SMP node, a group of threads will be generated in the compute engine. The code
in Fig 3.7 generates threads on host k depending on the number of its processors in use. The
number of processors on the SMP node has been obtained in StartEngine().
for ( i = 1; i < number_of_processors[k]; i ++ ) {
Appl subengine[i] = new Appl();
subengine[i].start();
}
Appl subengine[0] = new Appl();
subengine[0].run();
Fig 3.7 Generate threads in a compute engine
Appl is the main class of the application to be run on the collaborative system. The
compute engine creates multiple threads by instantiating the class Appl. The start()
method causes the associated thread subengine[i] to run the code of Appl. The main
thread subengine[0] also instantiates Appl and runs the application code by calling
run() method.
MOIDE-runtime is implemented on Blackdown JDK 1.2.2 which supports native threads.
Native threads are kernel-based threads that can be automatically scheduled by OS to run on
multiprocessors.
3.4 System Reconfiguration
MOIDE-runtime supports two types of system reconfiguration. One is system expansion
that adds extra compute engines to the collaborative system to enhance the computing power.
The other is host replacement that replaces the hosts with new hosts to improve the system
performance. The class ExpandEngine is specified for system expansion and the class
RecfgEngine for host replacement.
3.4.1 class ExpandEngine
The system expansion can be made in two directions: horizontal or vertical (see 2.2.4.1).
Fig 3.8 shows the run() method of ExpandEngine class. To make system expansion, the
compute coordinator or a compute engine calls the getHost() method in class
42
StartEngine (see 3.3.1) to search for the available hosts. Then the system expansion is
conducted by the method addEngine() in Fig 3.9.
public void run () {
String [] newHost = (String []) StartEngine.getHosts();
if (horizontal expansion)
flag = horizontal;
else /* vertical expansion */
flag = vertical;
Engine [] engine = (Engine []) addEngine(newHost, engine, flag);
}
Fig 3.8 run() method in ExpandEngine class
In the arguments to addEngine() method (Fig 3.9), newHost is the list of the
available hosts used for system expansion. oldEngine is the current registration table of
compute engines. flag indicates the horizontal or vertical expansion.
public void addEngine(String[] newHost, Engine[] oldEngine, boolean flag)
{
Engine [] newEngine = (Engine []) createEngine(newHost);
Engine [] engine = adjustEngine(engine, newEngine, flag);
invokeEngine(newEngine, task);
if (flag == horizontal)
broadcast engine[] to all compute engines;
return engine;
}
Fig 3.9 addEngine() method in ExpandEngine class
The method createEngine() (see Fig 3.5 in 3.3.1) is called in addEngine() to
create compute engine on each newly-selected host. The adjustEngine()methods adds
the references of the new compute engines to the registration table. The registration table
needs to be updated only in horizontal system expansion. In vertical system expansion, the
new engine is totally under the control of its parent engine. It works as the assist-engine of
the parent compute engine to share its workload. Other compute engines as well as the
compute coordinator do not recognize the existence of the assist-engine. In the system-wide
view, the vertical expansion resembles to add more processors to the host where the parent
43
engine is residing. There is logically not any new compute engine added into the HiCS.
Hence no change should be made to the registration table in vertical expansion.
3.4.2 class RecfgEngine
The host replacement is specified in the class RecfgEngine() which is executed by
the compute coordinator. The major methods in RecfgEngine()are checkHost() and
replaceEngine().
The method checkEngine() in Fig 3.10 calls the interface GetHostStatus()to
get the current states of the hosts from ClusterProbe. Then it decides the necessary host
replacement based on the states. If any host is overloaded, the method findHost() is
called to find a substitute host. The policy for the selection of substitute host is similar to that
for the host selection in the class StartEngine(). If no appropriate substitute host can be
found, the host replacement will not continue.
public void checkEngine(){
String rn = "CLUSTERPROBE.resources.Cluster_Workstation_CPUInfo";
String server; /* the server running ClusterProbe */
int port = 7001;
Vector hostList = (new GetHostStatus()).getHostStatus(rn, server, port);
rn="CLUSTERPROBE.resources.Cluster_Workstation_StatusTable";
Vector hostStatus = (new GetHostStatus()).getHostStatus(rn, server, port);
for engine[i] in engine[]
if ( hostStatus[engine[i]]= “Overload” )
String new_host[i] = findHost(hostList, hostStatus);
}
Fig 3.10 checkEngine() method in RecfgEngine class
After the host replacement has been decided in checkEngine(), the method
replaceEngine() in Fig 3.11 is called to create compute engine on each new host and
transfer the computing task from the overloaded host to the compute engine on new host. The
compute coordinator starts the creation of new compute engines on substitute hosts by calling
the method createEngine() in Fig 3.6. Then the compute coordinator invokes each new
compute engine by calling the remote method replace() on that engine. The function of
replace()is similar to invoke() in Fig 3.6. However, the new compute engine obtains
44
the computing task from the compute engine on the replaced host rather than assigning by the
compute coordinator. Then the compute engines on the replaced hosts cease to work. Their
references in the registration table are replaced by the references to the new compute engines.
Thus the collaborative system has been reconfigured with the new compute engines and all
remaining engines. The computation goes on the reconfigured system.
public void replaceEngine() {
Engine [] new_engine = (Engine []) createEngine(replace_host);
for new_engine[j] in new_engine[] {
new_engine[j].replace(host[j]);
engine [j] = new_engine[j];
}
}
Fig 3.11 replaceEngine() method in RecfgEngine class
3.5 Unified Communication Interface
As discussed in 2.3.3, the threads in hierarchical collaborative system can work in
independent mode. Each thread called pseudo-engine works as a compute engine. All pseudo-
engines have equal positions in the hierarchical collaborative system no mater whether they
are in same or different compute engines. A unified communication interface is needed for
the communication between the pseudo-engines. The threads can communicate with each
other through the uniform interface at application level. The two-layer communication will
transparently complete the communication via shared-data access or remote messaging.
The runtime support system provides a unified communication interface. The following
exchDoubleArray()method is one of the methods provided in the interface. It specifies a
send-receive communication of double arrays between a pair of pseudo-engines:
static void exchDoubleArray ( int src, int dest,
double[] srcData, double[] destData,
int srcStart, int destStart,
int sendLength, int recvLength,
int status )
int src ID of the pseudo-engine sending data;
int dest ID of the pseudo-engine receiving data;
double[] srcData buffer for sending data;
45
double[] destData buffer for receiving data;
int srcStart offset in the buffered sending data;
int destStart offset in the buffered receiving data;
int sendLength length of sending data;
int recvLength length of receiving data;
int status flag to identify the communication operation.
As exchDoubleArray() shows, only the IDs of the sender and receiver are required
when calling the method. The IDs are assigned to the pseudo-engines in StartEngine.
There is no indication about the locations of the sender and receiver in the method. The
runtime support system will decide the real communication path depending on their locations
found in the pseudo-engine location table (see Fig 3.13). If two pseudo-engines exist in same
compute engine, the send-receive operation will be finished by direct data exchange between
the send and receive buffers in local compute engine. Otherwise, the sending data will be
transmitted to the receive buffer on the receiving pseudo-engine through remote method
invocation.
Fig 3.12 A hierarchical collaborative system with 12 pseudo-engines
Shared Data
SMP0SMP0
Compute Coordinator Pseudo-engines(multithreads)
Shared Data
Compute Engine 1
Shared Data
Compute Engine 2
Shared Data
Compute Engine 2
SMP1 SMP2 SMP3
0 1 2 3
4 5 6 7 8 9 10 11
Host Selection Interaction RegistrationReference
46
Pseudo-engine Compute Engine Thread 0 0 0 1 0 1 2 0 2 3 0 3 4 1 0 5 1 1 6 1 2 7 1 3 8 2 0 9 2 1
10 3 0 11 3 1
Fig 3.13 The location table of pseudo-engines
The pseudo-engine location table is created in StartEngine (see 3.3.1). While
creating the compute engines, the compute coordinator records the locations of the pseudo-
engines in the table. The location includes the compute engine ID in which a pseudo-engine
exists and the thread ID of the pseudo-engine. Fig 3.12 shows an example of twelve pseudo-
engines on four SMP nodes. The pseudo-engines exist in four compute engines (including
compute coordinator). The location table is shown in Fig 3.13. For example, pseudo-engine 0
is the thread 0 in compute engine 0 (i.e., the compute coordinator) and pseudo-engine 11 is
the thread 1 in compute engine 3.
The location table will be broadcasted to all compute engines in invokeEngine()
(see 3.3.1). Examine two communication cases in Fig 3.12.
(1) Pseudo-engine 4 and 6 are both in compute engine 1 according to the location table.
The data communication between them will be finished directly by the shared-data
exchange through their own buffers.
(2) Pseudo-engine 2 and 8 respectively reside in compute engine 0 and 2. They should call
remote messaging for communication.
3.6 Synchronization
The compute engines in collaborative system asynchronously execute the computing
tasks. In hierarchical collaborative system, there are also multiple threads running in parallel
inside each compute engine. Synchronization is needed among the group of threads and
among the compute engines to coordinate their computing progress. Fig 3.14 shows the
synchronization on two levels. The local synchronization is the synchronization among a
47
group of threads in the same compute engine. The global synchronization coordinates all
compute engines in the system. MOIDE-runtime implements two synchronization methods.
Fig 3.14 Execution flow on multiple threads with local and global synchronization
3.6.1 barrier()for Local Synchronization
A group of threads in cooperative mode share a computing task. The threads need to
synchronize their computing procedures at some point. For example, a compute engine may
exchange data with other compute engines. In cooperative mode, the data is the product of the
collaborative computation on the group of threads and the main thread is responsible for the
remote communication. The communication can be performed only after the group of threads
in a compute engine have generated the result data. Here local synchronization should be
imposed on the threads. The main thread is able to start the data communication only after the
group of threads have finished the preceding computation and reached the synchronization
point. Moreover other threads probably need to wait for the finish of the data communication
before proceeding to the following computation. Hence another local synchronization is
required at the end of the data communication.
MOIDE-runtime provides the local synchronization method barrier(). The
synchronization is controlled by a public integer Barlock accessible by local threads. The
threads exclusively increments the value of Barlock when calling barrier(). The local
synchronization is accomplished when each thread has incremented Barlock once.
Local SyncLocal Sync
thread 0 thread 1 thread 0 thread 1
Local SyncLocal Sync
Global Sync
compute engine 0 compute engine 1
48
3.6.2 remoteBarrier()for Global Synchronization
Global synchronization forces all compute engines to reach the synchronization point
before proceeding to the following operations. This is the global coordination on
collaborative system. It is required in the cases as system-wide collective data communication
and the synchronization of computing iteration on all compute engines. The global
synchronization is more complex than local synchronization. As the distributed compute
engines are shared-nothing objects, the global synchronization should be implemented by the
state transition on the compute engines. At the global synchronization point, a compute
engine enters a pre-defined state and waits there. The compute coordinator examines the
states of all compute engines by remote method invocation. If all of them have entered that
state, the compute coordinator signals them to transit into a new state and go on to execute
next operations.
public static void remoteBarrier(int status) {
int my_state;
boolean sync = false;
barrier(); /* local synchronization */
my_state = status;
if (compute coordinator) {
while (!sync)
sync = (getRemoteStatus() == status);
for ( each compute engine i )
engine[i].remoteStart(new_state);
my_status = new_state;
} else { /* compute engines */
while (my_state != new_state)
Thread.currentThread().yield();
}
}
Fig 3.15 Global synchronization method remoteBarrier()
MOIDE-runtime provides the global synchronization method remoteBarrier(int
status) shown in Fig 3.15. The global synchronization includes the synchronization on
two levels. First, a local synchronization barrier() is called to synchronize the threads in
same compute engine because the state transition of the compute engine can be made only
49
after all local threads have reached the synchronization point. The compute coordinator keeps
on polling the states of remote compute engines until all compute engines have entered the
specific state status. The getRemoteStatus() method checks the states of remote
compute engines. When all compute engines have turned into the state status, the compute
coordinator informs all compute engines to continue the computation by setting new_state
on them via the remote method invocation remoteStart(). On the other side, the
compute engines wait in remoteBarrier() until the compute coordinate invokes them
to continue the computation.
3.7 Load Scheduling
Load scheduling is important to irregularly structured problems. Due to the irregular
computation patterns and dynamic features, it is difficult to measure the computing workload
and make a balanced task allocation in advance. Dynamic load scheduling is required to
balance the workload at run-time. The load scheduling schemes are tightly related to the
computation pattern of an application. MOIDE-runtime provides two dynamic load
scheduling methods for autonomous load scheduling on the two levels of hierarchical
collaborative system.
3.7.1 Autonomous Load Scheduling
An irregularly structured application is difficult to be evenly allocated to the compute
engines before execution. Runtime task allocation is one of the dynamic load balancing
techniques, which is suitable to the applications with light data-dependency. In runtime task
allocation, an application is divided into small pieces of computing tasks. At the beginning,
each compute engine will be allocated a piece of computing task. Once it has finished the
computing task, the compute engine will get another computing task. Therefore the
computation workload can be dynamically balanced on the compute engines. The task
allocation also depends on the computing power of the underlying hosts. The application is
partitioned into n task units where
n >> number of compute engines
If the computing power of host i is performancei, the compute engine on it gets a computing
task whose workload is proportional to performancei each time.
50
The computation on collaborative system has high asynchrony. The compute engines
execute the computing tasks independently except the necessary communication and
synchronization. In principle, the compute coordinator should be the global load scheduler.
With the one-sided communication feature of MOIDE model (see 2.1), however, a compute
engine is able to do the runtime load scheduling by itself. It can directly get a task from the
global task pool in the compute coordinator. Therefore the load scheduling can be performed
without global load scheduler. The self-conducted dynamic load scheduling approach can be
called autonomous load scheduling.
Generally the load scheduling in a hierarchical collaborative system happens on two
levels: the task allocation to the compute engines, and the workload sharing among the group
of threads inside compute engine. In cooperative mode, the threads in a compute engine share
the computing task. Fig 3.16 shows the autonomous load scheduling on the two levels. The
main thread in a compute engine fetches a task from the global task pool and puts it to local
subtask queue. The size of the task is proportional to the computing power of the compute
engine. The local threads get the subtasks from local subtask queue. In independent mode, a
thread works individually as a compute engine. Each thread can fetch its computing task
directly from the global task pool.
Fig 3.16 Two-level autonomous load scheduling
multithreads
multithreadsmultithreads
compute coordinator
compute engine compute engine
global task pool local subtask queue
51
3.7.2 getTask() and getSubtask() Methods
MOIDE-runtime provides two methods for autonomous load scheduling. The
getTask() method is executed by a thread to fetch a task from the global task pool. It is
usually accomplished by remote method invocation. The getSubtask() method is called
by a thread to get a subtask from local subtask queue.
The representation of workload varies in different applications. An abstract unit— task
unit is used in the load scheduling methods here. Specific meanings can be given to the task
unit in different applications.
(1) getTask()
A pointer to the global task pool is defined in the compute coordinator. The pointer
indicates the next computing task in the global task pool. A compute engine calls the
getTask() method on the compute coordinator to fetch a task. The method returns a task
to the compute engine. The task can be divided into equal-size subtasks. The subtasks are
stored in the subtask queue that is indicated by a local pointer.
(2) getSubtask() The getSubtask() method is called by a thread to get a subtask from local subtask
queue. A thread gets a subtask by referring to the pointer of the queue.
3.8 System Termination
public void ceaseEngine(String [] hosts) {
for (host[i] in hosts[]) {
String command = “/usr/bin/rsh hosts[i] create_engine”;
Runtime.getRuntime().exec(command);
}
finalize();
)
Fig 3.17 ceaseEngine()method
When the execution of an application has finished, the collaborative system should be
terminated. The system termination is controlled by the compute coordinator. Fig 3.17 shows
the system termination method ceaseEngine(). The compute coordinator ceases all
compute engines by running the cease_engine script on the remote hosts. The script
52
stops the object of the compute engine. Eventually the compute coordinator terminates itself
by Java API method finalize().
3.9 Summary
In this chapter, the runtime support system MOIDE-runtime is described in detail.
MOIDE-runtime specifies the main classes and methods to support the development and
execution of applications in MOIDE model. It defines the fundamental classes of compute
coordinator and compute engine. It provides the classes to implement the collaborative
system infrastructure including system creation, reconfiguration, termination and
synchronization. It implements the two-layer communication mechanism and dynamic load
scheduling methods.
MOIDE-runtime is used to implement the irregularly structured applications in the
following chapters. StartEngine is called to establish the hierarchical collaborative
system. The synchronization methods barrier() and remoteBarrier() are used in
the applications to enforce local and global synchronization. The ray tracing method in
chapter 5 uses the autonomous load scheduling methods getTask() and getSubtask().
The CG and radix sort methods in chapter 6 call the unified communication interface to
utilize the two-layer communication mechanism to achieve efficient collective
communications. The performance tests of those applications will demonstrate the usability
of MOIDE runtime support system.
53
Chapter 4
Distributed N-body Method in MOIDE Model
N-body problem is a typical irregularly structured problem. It simulates the evolution of
physical system containing numerous bodies. The bodies exert force influences on each other.
The impact of the force influences sustains the continuous body motion in the space. The
computation of the force influences is a computation-intensive procedure. When solved in
distributed method, the N-body problem is also communication-intensive due to the globally
tight data-dependency in force computation. This chapter presents a distributed N-body
method based on MOIDE model with the emphasis on the distributed tree structure designed
as the communication-efficient scheme for the data access in the computation.
4.1 Overview
The major operation in N-body problem is the computation of the force influences on
every body. The body motion is the consequence of the force influences. The accumulation of
the force influences determines the velocity of each body and therefore the evolution of the
whole physical system. The straight-forward method of force computation is to calculate the
force influence between each pair of the bodies. The pair-wise calculation has the time
complexity as high as N·(N-1) = O(N2). Since there are hundreds of thousands bodies
involved in the computation, the time cost is prohibitive in practice.
Time-efficient methods have been proposed [14-16] to reduce the high complexity. In
these methods, approximation is made in the force computation by grouping the bodies and
calculating the force influences based on body groups. The approximation relies on the fact of
the decreasing force influences from distant bodies. The farther bodies impose lower force
influences on a body. A body requires less information and in lower frequency from the
bodies that are farther away. If a body is far enough from a group of bodies, the force
54
influences from the group of bodies can be approximated by the force influence from a single
virtual body–the center of the mass, i.e., the representative of the group. The accumulative
force influence from the group of bodies can be computed in one calculation of the force
influence from the center of mass. Thus the time complexity can be highly reduced.
The N-body methods with body grouping are usually tree-based hierarchical algorithms
in which a tree structure is used to represent the body distribution in physical space [16]. The
tree is constructed based on space decomposition and used in the force computation. Barnes-
Hut method [14] is a well-known hierarchical N-body algorithm. In Barnes-Hut method a
physical space is recursively divided into sub-domains until each sub-domain contains at
most one body. The space decomposition is based on the spatial distribution of the bodies.
Fig. 4.1(a) gives an example of space decomposition in 2D space. At first, the space is
equally divided into four sub-domains. If there are more than one bodies in a sub-domain, the
sub-domain will be further decomposed into four smaller sub-domains. Barnes-Hut tree is
built based on the space decomposition as Fig. 4.1(b) shows. This is a quadtree for 2D space.
(a) Space decomposition (b) Barnes-Hut tree
Fig. 4.1 Barnes-Hut tree for 2D space decomposition
For 3D space, cubical space decomposition should be made. A sub-domain having more
than one bodies will be partitioned into eight sub-domains. The Barnes-Hut tree for 3D space
is an octree.
In the tree structure, the bodies reside on the leaves. Inner cell in the tree represents the
center of mass for the bodies beneath it. The force computation is performed by traversing the
tree. The simulation procedure of N-body problem proceeds in iteration. The Barnes-Hut tree
is built at the beginning of each simulation loop. Every body traverses the tree, starting from
55
the root, and computes the force influences from the bodies during the traversal. The body
compares its distance to each cell encountered on the path of traversal. If the body is far
enough from a cell, no further traversal will be continued beneath the cell. The force
influences from the bodies below the cell can be computed as the force influence from the
cell, which is the center of mass, to the body. Otherwise, the body should proceed to traverse
the children of the cell. After the force computation, each body updates its position in the
space as the effect of force influences. That ends one simulation loop. The tree should be
rebuilt at the beginning of next iteration to reflect the new body distribution in space.
Parallel N-body methods [7,17,18] can be derived from the sequential Barnes-Hut
method. For example, Singh presented a parallel N-body method based on shared-address-
space model [18,61]. In Singh’s method, multiple processes cooperatively build a global
Barnes-Hut tree in the shared-memory section. All processes can access the tree in parallel.
Each process computes the force influences on a subset of bodies by traversing the global tree.
The shared-address-space method is applicable on shared-memory system like SMP
machines. In distributed memory systems, however, message passing is the general
communication methodology [17]. The global shared tree approach is unsuitable to the
computation on distributed system. If the global tree scheme is used, each process on
different machine has to duplicate the tree for local use. To this requirement, the bodies
should be broadcast to all processes in each iteration. Inevitably the broadcast will cause
heavy communication overhead. One solution to this problem is decomposing the global tree
into subtrees that can be distributed to the processes. Each process obtains a body subset and
builds a subtree for it. It performs the force computation on the subset of the bodies. All
subtrees constitute a distributed tree structure.
4.2 Distributed N-body Method
A distributed N-body method is designed based on the MOIDE model. The basic
algorithm of the method comes from the Barnes-Hut method. However the Barnes-Hut tree is
substituted by distributed tree structure. Along with the distributed tree structure, a partial
subtree scheme is used for sharing the data of the subtrees among distributed compute
engines.
56
4.2.1 Distributed Tree Structure
The distributed tree structure is built by space decomposition. In the following text, the
distributed tree structure on the basic collaborative system is introduced first, where all
compute engines are single-threading objects. Then the basic distributed tree structure is
modified to the distributed shard tree structure for the computation on hierarchical
collaborative system.
4.2.1.1 Space Decomposition Given the N-body problem of a physical space with N bodies, it will be processed on P
processors. To run on a basic collaborative system, the physical space should be partitioned
into P sub-spaces and thus the N bodies into P subsets. Each sub-space contains N/P bodies in
average and it is allocated to one processor.
The distributed N-body method is designed based on Singh’s method. As Singh’s
partitioning method explained in [18], the Barnes-Hut algorithm has a representation of the
spatial distribution encoded in its tree data structure. The bodies are inserted into the tree
according to the following scheme [4]:
Body insertion scheme First, convert the floating-point spatial coordinate into integer. In 2D space, the
coordinator [ ]ipos of a body is converted to integer
[ ] [ ]( ) rsizeiriposIMAXix ]min[−⋅= , where 1,0=i ; IMAX is a pre-defined
integer; rmin[i] is the coordinate of the left corner of the space; rsize is the integral
side-length of the space.
Then, select the subcell from the root of the tree to insert the body, i.e., get the
path leading to the subcell in the tree where the body is inserted beneath it. The index
i of the subcell is determined by the following code:
i = 0; yes = false; if ((x[0] & level)!=0) { /* level=the level of current cell */ i += NSUB >> 1; /* NSUB=subcells per cell */ yes = true; } for (k = 1; k < NDIM; k++) { /* NDIM=number of dimensiton */ if (((x[k] & level)!=0) && !yes) { i += NSUB >> (k + 1); yes = true; }
57
else yes = false; }
The body will be inserted to the ith subcell of the current cell. The insertion is a
recursive procedure until to find an empty leaf to put the body in.
The body insertion scheme and the tree construction algorithm of the Barnes-Hut tree
guarantee the Peano-Hilbert ordering of the bodies in the in-order traversal on the tree
[18,61,73]. The space decomposition can be made by partitioning the global Barnes-Hut tree
and reserve the spatial locality of the bodies in the sub-spaces. At the beginning of the
simulation in our N-body method, the compute coordinator is responsible to generate the N
bodies and decompose them into P subsets. The space decomposition algorithm is as
following.
Space decomposition algorithm Build a global Barnes-Hut tree containing all N bodies. Starting from the root, do
in-order traversal on the tree and examine the number of leaves (i.e., the bodies)
beneath each cell encountered on the way of traversal. If the number of leaves under
a cell is less than N/P, put these leaves into the current subset of bodies. Otherwise
continue the traversal to the descendents of the cell. The leaves under the same direct
parent cell should be put into the same subset of bodies because these are
neighboring bodies in a sub-domain (a 2×2 grids in 2D space or a 2×2×2 grid in 3D
space). A subset is full when its size will exceed the upper bound if more leaves are
added to it. For 2D space, the upper bound of the subset size is 2+
PN
. For 3D
space, the upper bound is 4+
PN
. Thus one sub-space has been obtained which
encloses all bodies in the subset. The leaves encountered in the following traversal
will be put into a new subset to form another sub-space. When the traversal has
finished on the whole tree, all leaves in the tree have been partitioned into P subsets
and the bodies in one subset constitute a sub-space.
The space decomposition algorithm generates the sub-spaces that enclose the neighboring
bodies. If running the N-body problem on four processors, the leaves in that tree should be
partitioned into four subsets. With the space decomposition algorithm, the physical space in
58
Fig. 4.1(a) can be partitioned into four sub-spaces as shown in Fig. 4.2(a), by partitioning the
leaves of the Barnes-Hut tree into four subsets as Fig 4.2(b) shows. The traversal visits the
leaves on the tree from left to right that corresponds to the Peano-Hilbert traversal of the
bodies in the space, starting from the left lower corner, as Fig 4.2(a) shows.
(a) Space decomposition (b) Four body subsets generated from partitioning the tree
(c) Distributed tree structure
Fig 4.2. Space decomposition and distributed tree structure on four processors
Three-dimensional N-body problem is handled in similar way except the Barnes-Hut tree
is an octree. The code of the barnes application in SPLASH-2 [53,74] assures the Peano-
Hilbert ordering of the bodies in the Barnes-Hut tree.
4.2.1.2 Basic Distributed Tree Structure
The sub-spaces produced by the space decomposition are allocated to the compute
engines. Each compute engine will create a subtree for the sub-space it received. Thus a
distributed data structure is formed with P subtrees that are distributed on all compute
engines. For the example in Fig 4.2, the total 48 bodies in the space are partitioned into four
subsets with 12, 12, 13, or 11 bodies repectively. A subset of bodies will be assigned to one
subtree A subtree B subtree C subtree D
subset A
subset B subset C
subset D
A
BC
Dstart
end
59
compute engine where a subtree will be built. The distributed tree structure is composed of
four subtrees as Fig. 4.2(c) shows.
4.2.1.3 Partial Subtree
Each compute engine carries out the force computation for the subset of bodies. In the
computation, the data on remote subtrees are needed to compute the force influences from
other sub-spaces. In our N-body method, partial subtrees scheme is designed to provide the
data of remote subtees to each compute engine. This is a communication-efficient solution for
data sharing among all compute engines.
Considering the approximation made in the force computation of Barnes-Hut method, an
inner cell in subtree is the center of mass that represents the total force influence from the
bodies beneath it, provided the group of bodies is far enough to a remote body if the
following distance condition is satisfied:
θ<dl
(4-1)
Where l is the length of the side of space domain represented by the cell; d is the distance
from the cell to a body; θ is a user-defined accuracy parameter, usually between 0.4 and 1.2
(1.0 is used in our method).
To satisfy the remote data access, a tradeoff is made to provide remote data sharing
under lower communication cost. Each compute engine builds a partial subtree, which is a
fraction of its subtree, for each remote compute engine. Totally it will build (P-1) different
partial subtrees if there are P compute engines. The construction of partial subtree is based on
the distance between two sub-spaces. A partial subtree is built by the following procedure.
Partial subtree construction Assume two compute engines i and j. After building the subtree, compute engine
i creates the partial subtree ptreej for j, by traversing the subtree. The cell
encountered in the traversal is included into ptreej and the distance between the cell
and the root of the subtree on j (i.e., the center of mass of the sub-space on j) is
calculated. If it satisfies the distance condition (4-1), the traversal will not proceed to
the part beneath the cell. Otherwise, the traversal will continue to the children of the
cell. The partial subtree ptreej has been created when the traversal is completed.
60
A partial subtree is dedicated to the target compute engine. It is created based on the
distance from the cells in local sub-space to the other sub-space. The partial subtree will be
sent to the target compute engine to provide the data for the force computation there. Fig. 4.3
shows an example of the partial subtrees generated from subtree B in Fig 4.2. Three partial
subtrees are derived from subtree B. The outermost ring on subtree B encloses the partial
subtree for A. The partial subtree in the dash ring is built for C and the one in the inner solid
ring is partial subtree for D.
Fig 4.3 Partial subtrees built from subtree B
Then the partial subtrees will be scattered to remote compute engines. A compute engine
will carry out the force computation by traversing its own subtree and all partial subtrees it
received. In case that force computation on a body needs the data not included in a partial
subtree, the body will be sent to the correspondent compute engine and access the complete
subtree there. Partial subtree is adaptive to body distribution in the sub-spaces. It is
constructed based on the distance between two sub-spaces. The farther between the sub-
spaces, the less cells will be included in the partial subtree. When two sub-spaces are far
enough, they only need to exchange the root of the sub-tree. Moreover the cells in partial
subtree satisfy the distance condition (in respect to the center of mass) for the force
computation. The partial subtree provides most of the data required in the force computation.
Therefore partial subtree scheme is a communication-efficient method for the data sharing of
subtrees. The runtime test in 4.3.1 will verify this advantage.
partial subtree to A partial subtree to C
subtree A subtree B subtree C
partial subtree to D
subtree D
61
4.2.1.4 Distributed Shared Tree Structure
In hierarchical collaborative system, the group of threads in a compute engine can
cooperatively perform the force computation for a subset of bodies. The subtree and partial
subtrees on a multithreading compute engine are accessed by the group of threads. The subset
of bodies assigned to a compute engine should be proportional to its computing power. Hence
a space may be decomposed into the sub-spaces not in equal size. The subtrees built on HiCS
form a distributed shared tree structure, because the subtrees and partial subtrees will be
shared among a group of threads.
Assume the computing power of a compute engine is decided by the processors in the
underlying host and the N-body problem is processed on m hosts with P0, P1, …, Pm-1
processors each, where ∑−
=
=1
0
m
iiPP . The space decomposition algorithm on HiCS should be:
In the in-order traversal on the global Barnes-Hut tree, examine the number of
leaves (i.e., the bodies) beneath each cell on the path. If the number of leaves under a
cell is less than PiN/P, put these leaves into the ith subset of bodies, which is currently
under construction. The subset will be assigned to compute engine i on Pi processors.
Otherwise continue the traversal to the descendents of the cell. The leaves under
same parent cell should be put into same subset of bodies because these are
neighboring bodies (a 2×2 grids in 2D space or a 2×2×2 grid in 3D space). The ith
subset of bodies is full when its size will exceed the upper bound if any more leaves
are added to the subset. For 2D space, the upper bound of the subset size is
2+
iPPN
. For 3D space, the upper bound is 4+
iPPN
. The leaves encountered
in the following traversal will be put into a new subset. When the in-order traversal
has finished on the tree, the leaves are partitioned into P subsets and the space has
been decomposed into P sub-spaces.
Consider the same N-body problem as in Fig 4.1(a) running on four heterogeneous SMP
nodes: two quad-processor nodes and two dual-processor nodes. The space decomposition
will generate four sub-spaces. The number of bodies in each sub-space is proportional to the
computing power of the target compute engine. As there are totally 48 bodies in the whole
space, the sub-space contains 16 bodies for a quad-processor node and 8 bodies for a dual-
62
processor node as shown in Fig. 4.4(a) and (b). The distributed shared tree structure contains
four subtrees as Fig. 4.4(c) shows. The partial subtrees in distributed share tree structure are
constructed in the same way based on the distance between sub-spaces as in the basic
distributed tree structure.
(a) Space decomposition (b) Four body subsets generated from partitioning the tree
(c) Distributed shared tree structure
Fig 4.4 Space decomposition and the subtrees on a four-SMP cluster
Compared with the basic distributed tree structure, higher data locality can be included in
the distributed shared tree structure on HiCS because the size of sub-space is larger on
multithreading compute engine residing on SMP node. The higher data locality can further
reduce the communication cost in solving N-body problem.
4.2.2 Computing Procedure
The computing procedure of N-body problem proceeds by iterating the simulation loop.
The computing procedure starts on all compute engines when the compute coordinator has
made the space decomposition and allocated the subsets of bodies to them. The compute
coordinator synchronizes the simulation loop. In each loop, the force influences on every
subtree A subtree B subtree C subtree D
subset A
subset B subset C
subset D
A
BC
Dstart
end
63
body are computed and the velocity and direction of the body’s motion are updated based on
the force influences, and then the bodies move to their new positions.
4.2.2.1 Simulation Loop
A simulation loop contains three steps.
Step 1: Subtree construction and partial subtree propagation
Each compute engine builds a subtree for the subset of bodies allocated to it. The root
information of each subtree, i.e., the center of mass in the sub-space, is broadcasted to all
compute engines. Partial subtrees are constructed from the subtree based on the root
information received. The partial subtrees are sent to other compute engines.
Step 2: Force computation
Each compute engine computes the force influences on every body in the subset by
traversing the local subtree and all partial subtrees. If the force calculation on a body
requires more information than a partial subtree can provide, the body will be sent to the
remote compute engine and traverse the complete subtree there to compute the force
influence from that sub-space. Then the result will be sent back to the source compute
engine. A body may be sent to more than one compute engines. The force influences on
each body from all sub-spaces are accumulated to get the total force influence on it. Finally
each body changes its state, including the velocity and the position, as the result of force
influence.
Step 3: Body redistribution
At the end of step 2, the new position of every body has been decided. Some bodies
may cross the border of the sub-space and enter a neighbor sub-space. To keep the body
locality in sub-space, the bodies moving to other sub-spaces should be transmitted to the
compute engine where the destination sub-space locates. Since the bodies advance in small
pace, only a few bodies will move to other sub-spaces in each simulation loop.
4.2.2.2 Execution Flow
On HiCS, the main thread in each compute engine builds the subtree and partial subtrees.
The force computation on the subset of bodies is shared among the group of threads in a
compute engine. All communications between the compute engines are carried out between
the main threads. Fig 4.5 is the execution flow of the distributed N-body method on HiCS.
64
for all computer engines
do
repeat
/* by main thread */
1. build subtree;
2. broadcast the root information;
3. build partial subtrees;
4. scatter partial subtrees;
/* by all threads */
for all threads
do
5. for each body processed by the thread
do
compute force influences;
if (the body needs to access any remote subtree)
insert the body into a send buffer;
end for
end for
/* by main thread */
6. send the bodies in send buffers to remote compute engines;
/* by all threads */
7. for all bodies from remote compute engines
traverse local subtree to compute force influence;
/* by main thread */
8. send the bodies back;
/* by all threads */
9. for all local bodies
do
advance bodies;
end for
/* by main thread */
10. for all bodies moving into other sub-spaces
do
transmit the bodies to destination compute engines;
end for
end repeat
end for
Fig 4.5 Execution flow of the distributed N-body method on hierarchical collaborative system
65
4.2.3 Load Balancing Strategy
All bodies change their positions in each simulation loop. Although the motion of the
bodies is in small pace, the continuous body motion will eventually cause highly imbalanced
body distribution in the sub-spaces. That will result in the workload imbalance in force
computation among the compute engines. In this case, load balancing is required to balance
the bodies in the sub-spaces.
The load balancing strategy in the distributed N-body method is performing space re-
decomposition. The physical space is decomposed again based on the new spatial distribution
of the bodies. As described in section 4.2.1.1, the space decomposition algorithm can
generate the sub-spaces that have approximately equal number of bodies and preserve the
body locality. This is the goal of load balancing. The same space decomposition algorithm
can be used for load balancing.
The space re-decomposition is conducted by the compute coordinator. When load
balancing is required, the compute coordinator collects all bodies from the compute engines
and performs the space decomposition for the bodies. The new sub-spaces are assigned to the
compute engines again as the task allocation at the beginning of the simulation. Then the
simulation proceeds on the new sub-spaces. The space re-decomposition can balance the
workload at reasonable cost. The runtime test in 4.3.1 shows that load balancing strategy
based on space re-decomposition can improve the performance of the N-body simulation.
4.3 Runtime Tests and Performance Analysis
The distributed N-body method has been tested on homogeneous and heterogeneous
clusters. The hosts are off-the-shelf Pentium III-based SMP machines and PCs. OS is Red
Hat Linux 6.0 Kernel 2.2.12 and JDK is Blackdown JDK1.2.2. The hosts are linked over Fast
Ethernet switch. The distributed N-body method simulates a galaxy in Plummer model [19]
with a collection of 10,240 to 102,400 particles. The performance is reported in execution
time, speedup, computation to communication ratio and other metrics.
4.3.1 Tests on Homogeneous Hosts
First, the N-body method is tested on a cluster of four SMP machines. Each machine has
four processors. The cluster provides sixteen processors in total. One SMP machine is used in
the cases of one to four processors. Then two machines are used from five to eight processors,
66
and so on. The method is also executed on a cluster of PCs to test its behavior on multiple
single-processor machines.
1. Execution time and speedup on cluster of SMPs
Fig 4.6 Execution time of the N-body method on four quad-processor SMP machines
Fig 4.6 displays the execution time of the N-body method under different problem size N.
The execution time decreases in all cases when increasing the number of processors. The
speedups can be obtained in all cases as Fig 4.7 shows. The largest performance improvement
occurs on one to four processors as the first four processors are provided on the same SMP
node. In this case, only one compute engine with threads inside executes the simulation loop
without communication cost. The communication requirement occurs when more than one
SMP nodes are used. The compute engines need to scatter the partial subtrees and send the
bodies to and back from other compute engines during the force computation.
0
200
400
600
800
1000
1200
1400
1600
1 2 4 6 8 10 12 14 16
Number of processors
Exec
utio
n tim
e (s
econ
ds)
1024020480409606144081920102400
N
67
Fig 4.7 Speedups of the N-body method
2. Execution time on cluster of PCs
Fig 4.8 The execution time on the cluster of 32 PCs
0
1
2
3
4
5
6
7
8
9
1 2 4 6 8 10 12 14 16
Number of processors
Spee
dup 10240
20480409606144081920102400
N
N-body on cluster of PCs, p=32
0200400600800
100012001400160018002000
1 2 4 8 16 24 32
number of processors
exec
utio
n tim
e (s
econ
ds)
1024020480408606144081920102400
68
In the previous test, there are at most four SMP nodes used so that the inter-node
communication maintains at a low level. To examine the performance on a system containing
more hosts on which more communication is required, the N-body method has been tested on
a cluster of single-processor PCs as well. There are thirty-two PCs in the cluster, linked by
Fast Ethernet switch. Fig. 4.8 shows the execution time on the cluster of PCs. Although the
communication overhead should be higher than the previous test on SMP cluster, Fig 4.8
shows that the execution time still decreases when increasing the number of PCs. The
distributed N-body method can achieve the similar performance on cluster of single-
processors as on the cluster of SMP nodes.
3. Efficiency of the distributed tree structure and partial subtree The distributed tree structure and partial subtree scheme are proposed as a
communication-efficient solution for the data requirement in the force computation. The
efficiency can be demonstrated by the execution time breakdowns on the cluster of four SMP
nodes in Fig 4.9. The computation time in the figure is the cost of force computation. The
communication time includes the costs of sending partial subtrees, sending and receiving
bodies to and from remote compute engines for remote subtree access, and the body
redistribution at the end of simulation loop.
Fig 4.9 Computation and communication time breakdowns on the cluster of SMPs
0
200
400
600
800
1000
1200
1400
1600
1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16
Number of processors
Exec
utio
n tim
e (s
econ
ds)
Force computation Communication
N=10240N=20480
N=40960
N=61440
N=81920
N=102400
69
As Fig 4.9 shows, the most part of the execution time is spent on computation. There is
no obvious growth in the communication time under large problem size. On the contrary, the
proportion of communication in the total execution time decreases along with the increase of
problem size.
(1) Comparison with full tree scheme
A full tree scheme has been tested as the comparison to the distributed tree structure. In
the full tree scheme, each compute engine will build a global Barnes-Hut tree and all force
computation can be locally accomplished without remote tree access. The scheme requires
broadcasting all of N bodies to every compute engine in each simulation loop. Each compute
engine is still responsible for the force computation on a subset of bodies. Runtime test shows
that the broadcast of N bodies, instead of partial subtrees, produces high communication cost.
Fig 4.10 shows the execution time breakdown of the full tree scheme on the cluster of four
SMP nodes under the problem size 20,480. The broadcast of all bodies induces high
communication overhead when running on more than one SMP (above four processors). The
communication spends most of the execution time. As a result, the communication overhead
and the total execution time go up along with the number of processors. The poor
performance of the full tree scheme affirms the communication efficiency of the distributed
tree structure.
Fig 4.10 Computation and communication time breakdowns of the full tree method
0
50
100
150
200
250
1 2 4 6 8 10 12 14 16
Number of processors
Exec
utio
n tim
e (s
econ
ds)
Communication
Force computation
N=20480
70
(2) Comparison of partial subtree schemes
As indicated in 4.2.1.3, the partial subtree is constructed based on the distance between
the correspondent sub-spaces. It is an adaptive data structure that provides most of the data
required in the force computation. Actually the adaptive partial subtree scheme is an
improvement from the cut-off partial subtree structure we proposed in [78]. Differing from
the adaptive structure in this thesis, the cut-off partial subtree is simply duplicating the top
half depth of the subtree. Only one partial subtree will be created from a subtree and it is
broadcasted to all compute engines. It is obvious that the time cost for creating the cut-off
partial subtree is lower than the adaptive partial subtree because multiple partial subtrees will
be created from a subtree in the latter scheme. However the cut-off partial subtree may not be
able to contain sufficient information required in the force computation on other compute
engines. There should be more bodies that require remote subtree access and higher
communication will be incurred.
Fig 4.11 Adaptive partial subtree vs. cut-off partial subtree on cluster of PCs
These two partial subtree schemes are compared on the cluster of PCs. Fig 4.11 contrasts
the time costs of the adaptive partial subtree and the cut-off partial subtree schemes on 32
PCs under varied problem size N. At each problem size, the left column is the time
breakdown of the adaptive partial subtree scheme and the right column is that of the cut-off
Adaptive vs. cut-off partial subtrees, p = 32
0
50
100
150
200
250
10240 20480 40960 61440 81920 102400
problems size N
exec
utio
n tim
e (s
econ
ds)
tree buildcommunicationcomputation
adaptive cut-off
71
scheme. The execution time is broken down into three portions: computation—the time for
force computation; communication—the time for sending the partial subtrees and bodies; tree
build—the time for building partial subtrees. In the cut-off scheme on the right, only one
partial subtree is built from each subtree. Its tree build time is trivial and invisible in the
figure. For the adaptive partial subtree scheme, each compute engine should build (P-1)
different partial subtrees per simulation loop, where P is the total number of compute engines
(i.e., the number of PCs in this example). As a result, the time for tree building is apparent in
the execution time. On the other hand, the adaptive partial subtrees provide more data to the
force computation on local compute engine so that the remote subtree access can be reduced.
It spends lower communication time as Fig 4.11 shows. From all of the test results, we can
conclude that the partial subtree scheme in this thesis can satisfy most of the data requirement
in the force computation at low communication overhead on both cluster of SMPs and cluster
of PCs. The distributed tree structure is a communication-efficient scheme for the distributed
N-body method.
4. Efficiency of load balancing strategy Time
N Computation Communication
Tree
build
Load
balancing
Total
time
Improvement
ratio
Balanced 15.61 110.49 5.50 16.75 148.35 10240
Imbalanced 18.79 137.07 6.46 0 162.31 8.60%
Balanced 38.39 161.12 10.24 28.40 238.146 20480
Imbalanced 46.34 196.92 12.34 0 255.602 6.83%
Balanced 107.12 239.67 22.99 49.24 419.02 40960
Imbalanced 132.50 279.52 27.55 0 439.58 4.68%
Balanced 205.98 261.44 31.64 71.08 570.14 61440
Imbalanced 253.55 313.32 37.41 0 604.27 5.65%
Balanced 268.72 364.34 46.81 92.35 772.21 81920
Imbalanced 330.16 449.59 56.74 0 836.48 7.68%
Balanced 340.24 595.67 56.38 112.93 1105.22 102400
Imbalanced 415.73 728.34 70.20 0 1214.27 8.98%
Table 4.1 The times of the N-body method with/without load balancing (seconds)
The load balancing strategy in the distributed N-body method is described in 4.2.3. As
Singh indicated in [61,73] as well as Warren and Salmon in [75], the N-body problem
typically simulates physical systems that evolve slowly with time, and the distribution of
72
bodies changes little between two successive time-steps. Thus severe workload imbalance
will not frequently occur in the simulation procedure. Even so, load balancing is still required
when high workload imbalance occurs. Upon load imbalance, a global task repartition is
conducted by the compute coordinator. The compute coordinator collects all bodies from the
compute engines, makes space re-decomposition and allocates the new body subsets to the
compute engines. Table 4.1 compares the time costs of the N-body method with or without
adopting the load balancing strategy. The time cost in the table measure the elapse time of
fifteen simulation loops.
Table 4.1 lists times of four operations: computation, communication, tree building and
load balancing. Balanced is the time of the N-body simulation adopting load balancing
strategy and Imbalanced is the time of the N-body simulation without performing load
balancing. The improvement ratio indicates the performance improvement gained from the
load balancing strategy. As the results show, the load balancing strategy can improve the
performance of the distributed N-body method.
4.3.2 Tests on Heterogeneous Hosts
The distributed N-body method has also been tested on heterogeneous hosts to exhibit the
flexibility of MOIDE model. The test environment consists of four Pentium III-based hosts:
one tri-processor machine, two dual-processor machines, and one single-processor PC. The
system provides totally eight processors. The problem sizes are the same as in the tests on the
homogeneous SMP cluster. The N-body method starts to run on the tri-processor machine.
According to the host selection policy in 2.3.2, the hosts with more processors have higher
priority in the selection. So the two dual-processor machines will be used next. The single-
processor PC is used only in the case of eight processors. Multiple threads may be generated
in a compute engine according to the number of processors in the underlying host. The
computation workload is distributed to the compute engines based on the computing power.
Fig 4.12 shows the execution time of the distributed N-body method on the
heterogeneous hosts. The speedups are shown in Fig 4.13. The performance of the distributed
N-body method on the heterogeneous hosts is comparable to that on the homogeneous SMP
nodes. The method can reach around four-fold speedup on eight processors. Fig 4.14 is the
computation and communication time breakdowns on the heterogeneous hosts. By comparing
the time breakdowns in Fig 4.14 with Fig 4.9, it can be found that the communication cost on
73
the heterogeneous cluster is higher than that on the homogeneous SMP cluster under the same
problem size and number of processors, because more hosts are used and more remote
messaging is required.
Fig 4.12 Execution time of the N-body method on heterogeneous hosts
Fig 4.13 Speedups of the N-body method on heterogeneous hosts
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8
Number of processors
Exec
utio
n Ti
me
(sec
onds
)
1024020480409606144081920102400
N
00.5
11.5
22.5
33.5
44.5
5
1 2 3 4 5 6 7 8
Number of processors
Spee
dup 10240
20480409606144081920102400
N
74
Fig 4.14 Computation and communication time breakdowns on heterogeneous hosts
All the tests above manifest the efficiency of the MOIDE-based N-body method,
particularly the communication-efficient distributed tree structure. The tests also demonstrate
the flexibility of MOIDE model and the usability of MOIDE runtime support system. The
distributed N-body method can be mapped onto different hosts and achieve high-performance
on different systems.
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of processors
Exec
utio
n tim
e (s
econ
ds)
Force computation Communication
N=10240N=20480
N=40960
N=61440
N=81920
N=102400
75
Chapter 5
Ray Tracing with Autonomous Load Scheduling
This chapter presents another irregularly structured application ray tracing based on
MOIDE model. The ray tracing application uses the autonomous load scheduling to achieve
high asynchronous computation.
5.1 Overview
Ray tracing is a graphic rendering algorithm that renders an image on a view plane from
the mathematical descriptions of the objects in a scene. The view plane is divided into a grid
of pixels. The color visibility on each pixel is computed by tracing the rays that emit from a
viewpoint and pass through the pixels into the scene space. New rays are generated by
reflection and refraction when a ray hits an object on the way of tracing. The tracing
computations are performed recursively on the new rays. The rendering of the scene on a
pixel is a non-predetermined procedure. It depends on the objects a ray hits and the new rays
generated. The computation workload is different from pixel to pixel.
Differing from N-body problem, ray tracing has light data-dependency in computation.
Each ray can be independently traced. The pixels on the view plane can be rendered in
parallel and in arbitrary order. Ray tracing is a compute-intensive irregular application
[8,20,21].
The distributed ray tracing method is developed based on MOIDE model. It is a
completely asynchronous procedure due to the light data-dependency and imbalanced
computation workload on each pixel. Therefore the autonomous load scheduling discussed in
3.7 is suitable to exploit the high parallelism in the ray tracing.
76
5.2 Autonomous Load Scheduling
5.2.1 Background
In parallel ray tracing methods, the pixels on a view plane are partitioned into blocks. The
rendering of the blocks can be performed by multiple processes in parallel. The workload of
rendering each block is related to the scene appearing on that block. There should be high
diversity in the workload of each block which cannot be determined in priori. Static block
allocation is unsuitable because it cannot produce an even distribution of the workload to the
processes. The total computation time will be restrained by the processor with the highest
workload. Dynamic load scheduling should be applied to balance the rendering workload in
which the blocks are gradually allocated to the processes at runtime based on the computation
progress on them.
Generally in the parallel ray tracing based on messaging passing, the load balancing
adopts centralized load scheduling approach in which a dedicated process works as the load
scheduler [21,56]. The load scheduler keeps on observing the rendering procedure on all
other processes. Once a process has finished the rendering of a block, the load scheduler
should collect the rendered block from the process and allocate a new block to it. The load
scheduler may spend most of the time in idle, waiting for the completion of the rendering on
other processes. This is the waste of computing power. An alternate method is allowing the
load scheduler to perform the rendering operation and do the load scheduling for other
processes at specified moment. For example, the load scheduler checks the block allocation
request from other process when it has finished rendering a block. In this approach, other
processes possibly have to wait in idle for the block allocation if the load scheduler is doing
rendering work.
MOIDE supports a more efficient approach for the load scheduling in ray tracing. With
regard to the asynchronous rendering procedure, the autonomous load scheduling discussed
in 3.7 is an appropriate approach in which no dedicated load scheduler is required. All
processes can concentrate on the rendering computation. Each process performs the dynamic
block allocation by itself providing it can access the global block pool. The one-sided
communication feature of remote method invocation supports the implementation of the
autonomous load scheduling. As the description in 3.7.1, a global block pool is maintained in
the compute coordinator. Any compute engine can fetch the blocks directly from the global
77
task pool at any time via remote method invocation without the invocation from the compute
coordinator. The autonomous load scheduling can produce a balanced distribution of
rendering computation on all compute engines, including the compute coordinator, and attain
the highest parallelism.
The autonomous load scheduling strategy discussed in 3.7.1 is a two-level scheme on
hierarchical collaborative system. In the ray tracing, the autonomous load scheduling can be
implemented as the allocation of rendering task on the two levels. The main thread in
compute engine is responsible to get a block from the global task pool. The block is divided
into sub-blocks that are rendered by a group of threads inside the compute engine. The two-
level load scheduling approach in ray tracing is called group scheduling.
There is another approach for the autonomous load scheduling in the ray tracing. As the
rendering operation is independent from one another, each thread can individually get and
perform the rendering task. It can fetch a block directly from the global block pool and
independently render the block. This load scheduling approach is called individual scheduling.
5.2.2 Group Scheduling
Fig 5.1 shows the view plane of an image to be rendered. The view plane is partitioned
into row-oriented strips (blocks). A strip has several rows of the image. A pointer is used as
the index to the strips to be allocated. The pointer represents the global task pool and it is held
in the compute coordinator.
The group scheduling in the MOIDE-based ray tracing is a two-level autonomous load
scheduling approach. The threads work in cooperative mode (see 2.3.3) to share the rendering
task. The main thread in compute engine reads the pointer to get a new rendering task. A
rendering task contains certain number of successive strips. The number of strips allocated to
a compute engine depends on its computing power. For example, four SMP nodes (two quad-
processor SMP nodes and two dual-processor SMP nodes) are used to render the image. As
Fig 5.1 shows, the compute engine on quad-processor node will get four strips each time in
task allocation and the compute engine on dual-processor node will get two strips. The strips
will be rendered in parallel by the group of threads in a compute engine. The strips allocated
to a compute engine are indicated by a local pointer. A thread gets one row from the strips
each time and performs the rendering operations for the row. The getTask() and
78
getSubtask()methods provided in MOIDE-runtime can be used to implement the group
scheduling.
Fig 5.1 View plane partitioning and allocation
The intension to use group scheduling is based on the prediction that the task allocation
operation has high communication overhead. It is necessary to reduce the occurrence of task
allocation. Therefore the rendering tasks are allocated in strips and only the main thread
performs the task allocation in order to avoid frequent task allocation. With the same
prediction, the rendered strips (i.e., the output of the rendering) will not be immediately sent
back to the compute coordinator after the strip has been rendered to reduce the
communication overhead of strip send-back. The send-back of all rendered strips is
postponed to the end of the ray tracing. The compute coordinator will collect the rendered
strips from the compute engines after all rendering operations have been finished. Finally the
compute coordinator needs to organize the rendered strips into an image. This operation is
called data reordering in the following text. The ray tracing method with group scheduling is
summarized in Fig 5.2.
for all compute engines
do
while (any strips to be rendered)
do
if (main thread in compute engine)
fetch next rendering task from global task pool;
SMP 0
SMP 1
SMP 2
SMP 3pointer
79
for all threads
do
while (any rows to be rendered)
do
get a row;
render the row;
store the rendered row into local vector;
end while
end for
end while
if (main thread in compute coordinator)
do
collect the rendered strips from all compute engines;
reorder the strips to form the full image;
end if
end for
Fig 5.2 Ray tracing with group scheduling
5.2.3 Individual Scheduling
for all threads
do
fetch a strip from compute coordinator;
do
render the strip;
send the rendered strip back to and fetch next strip
from compute coordinator in one method invocation;
while (any strip to be rendered);
end for
Fig 5.3 Ray tracing with individual scheduling
In contrast to the group scheduling, individual scheduling is featured as individual task
fetching and immediate strip send-back. In the individual scheduling, all threads work in
independent mode (see 2.3.3) and each thread makes the task allocation by itself. When it has
finished rendering a strip, a thread immediately sends the rendered strip back to the compute
coordinator and fetches next strip from the global task pool in one remote method invocation.
The methods getTask() in MOIDE-runtime system can be called by every thread to
80
perform the individual scheduling. The ray tracing method with individual scheduling is
summarized in Fig 5.3.
As Fig 5.3 shows, all threads perform the rendering operations in entirely asynchronous
manner without cooperation or coordination. Each thread works as an independent compute
engine whereas the threads in a group can still share some data objects, e.g., the object
description of the scene. The multiple threads consume less system resources and present
more stability than a group of single-threading compute engines on SMP node. The size of a
strip can be defined as one row. The effect of individual scheduling is allowing the strips to
be gradually distributed to all threads depending on the rendering progress on individual
thread. Hence the individual scheduling can lead to an even workload distribution on all
threads. However the individual scheduling requires frequent remote method invocation for
strip send-back and task fetching. The frequent communication may produce high
communication overhead that will affect the overall performance.
Fig 5.4 Flow of ray tracing in group scheduling and individual scheduling
On the other hand, the immediate strip send-back has the advantage of overlapping the
system-wide computation and communication. The send-back communication of one thread
can be overlapped with the rendering computation on other threads. The overlapping can
resolve the potential communication bottleneck in the final strip send-back in the group
scheduling.
End of strip send-back
rendering computation
strip send-back
Individual Scheduling
End of rendering(b)
End of rendering
Group Scheduling
(a)
bottleneck
Threads
Threads
81
Fig 5.4 shows the execution flow of the group scheduling and individual scheduling on
four threads in a compute engine. The execution flow consists of two operations: the
rendering computation and the strip send-back communication. As the runtime tests in 5.3
will reveal that task fetching (task allocation) latency is trivial in the total execution time, the
time for task fetching is omitted in Fig 5.4. The group scheduling in Fig 5.4(a) defers the
send-back of all strips till the end of rendering. The strip send-back phase at the end may turn
into the bottleneck in the execution flow. Contrarily the individual scheduling in Fig 5.4 (b)
sends the strip back immediately once it has been rendered. Each thread performs the
computation and communication alternately so that the computation and communication
among the threads are overlapped. The individual scheduling eliminates the bottleneck of
final send-back. However, every thread should frequently perform communication operation.
That may lead to communication contest on the side of the compute coordinator, which may
become a new communication bottleneck.
The analysis above compares the advantage and disadvantage of the group scheduling
and individual scheduling in ray tracing which need to be verified by runtime tests. The
runtime tests will demonstrate that the individual scheduling is a better scheme for the ray
tracing.
5.3 Runtime Tests and Performance Analysis
The tests of the ray tracing method focus on the performance of different autonomous
load scheduling approaches. The group scheduling and individual scheduling are compared in
the test. The individual scheduling is also compared with the master/slave scheduling
approach.
1. Group scheduling and individual scheduling
To figure out the communication bottleneck, the third autonomous scheduling scheme —
combined scheduling is designed for the test. It combines the main features of the group
scheduling and individual scheduling.
Combined scheduling
(1) Task allocation: In combined scheduling, each thread independently gets a strip each time
from the global task pool in the same way as individual scheduling. However it doesn’t
immediately send the rendered strip back.
82
(2) Strip send-back: The rendered strips are sent back to the compute coordinator at the end
of the rendering and the compute coordinator reorders the strips in the same way as in
group scheduling.
Fig 5.5 Execution times of ray tracing in three load scheduling schemes
Fig 5.6 Speedups of ray tracing in three load scheduling schemes
0
50
100
150
200
250
1 2 4 6 8 10 12 14 16
Number of processors
Exec
utio
n tim
e (s
econ
ds)
individual scheduling
group scheduling
combined scheduling
0
1
2
3
4
5
6
7
8
1 2 4 6 8 10 12 14 16
Number of processors
Spee
dup individual scheduling
group scheduling
combined scheduling
83
The ray tracing method is tested on the cluster of four quad-processor SMP machines. Fig
5.1 is the sketch of the image to be rendered in the test. The execution times of the ray tracing
in the three scheduling schemes are displayed in Fig 5.5. The speedups are shown in Fig 5.6.
In the test, the task allocation is one row a time in the individual scheduling and
combined scheduling. For the group scheduling, the image is partitioned into (20 × number of
processors) strips.
Fig 5.7 Execution time breakdowns of group scheduling and combined scheduling
As the test results show that the individual scheduling has the best performance among
the three schemes. It presents smooth performance improvement when increasing the number
of processors. The performance of the group scheduling is close to the combined scheduling
but a bit better on fourteen and sixteen processors. The time breakdowns of the group
scheduling and combined scheduling in Fig 5.7 reveal that the difference in their performance
mainly comes from the communication cost. Fig 5.7 only shows the execution time
breakdown on the main thread of the compute coordinator. The computation time is spent on
rendering operations. It decreases with the number of processors. The data reordering is
performed on the compute coordinator to organize all rendered strips into the image. It is the
smallest portion in the execution time. The data reordering operation is also performed by the
main thread in each compute engine to collect local rendered strips together for send-back.
The communication time includes the time of task allocation and final strip send-back. There
0
50
100
150
200
250
1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16
Number of processors
Tim
e (s
econ
ds)
data reorderingcommunicationcomputation
Group scheduling Combined scheduling
84
is no communication cost on one to four processors because these four processors belong to
one SMP node. In this case, no final send-back is required and the task allocation is made
locally. The communication overhead increases with the number of processors. Eventually
the communication time exceeds the computation time and worsens the overall performance.
As the communication includes task allocation and final strip send-back, it needs to
figure out which one is the communication bottleneck. Fig 5.8 shows the detailed time
breakdowns of the combined scheduling. The communication time is divided into the times
of task allocation and strip send-back. In the combined scheduling, the strip send-back takes
place at the end, the same as in the group scheduling. The tests are made on four quad-
processor SMP nodes. Every four threads are the sibling threads in a compute engine. The
threads do the task allocation individually by fetching a row from the global task pool. Fig
5.8 shows that the cost of task allocation is low on each thread in contrast to the time cost of
final strip send-back. It can be concluded with certainty that the frequent individual task
fetching will not produce heavy communication overhead but the final strip send-back will
cause the communication bottleneck.
Fig 5.8 Execution time breakdowns of combined scheduling
Combined scheduling
0
20
40
60
80
100
120
140
0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 1011 0 1 2 3 4 5 6 7 8 9 1011 12131415
Compute engines
Tim
e (s
econ
ds)
strip send-backdata reorderingtask allocationcomputation
P=4
P=8 P=12 P=16
85
In the group scheduling, the group-based task allocation and delayed strip send-back aim
to decrease the frequency of remote method invocation. These approaches are proposed based
on the prediction that frequent remote method invocation will produce high communication
overhead so that it should be reduced during the computation. However the runtime tests give
an opposite conclusion. The row-based individual task allocation produces low
communication overhead. It has little influence on the overall performance. Now inspect the communication overhead in the individual scheduling. Fig 5.9 shows the
detailed time breakdowns of the individual scheduling. The communication time is the cost of
row fetching and send-back. In the individual scheduling, a rendered row is directly stored
into the correspondent entry of the image buffer in the computer coordinator via the remote
method invocation. There is no strip reordering operation. As the first case in Fig 5.9 (P=4,
i.e., four processors), the rendering work is evenly performed on the four threads residing on
one SMP node that has no communication cost.
Fig 5.9 Execution time breakdowns of individual scheduling
The second case is the execution time of eight threads on two SMP nodes (P=8). The
four threads on the left exist in the compute coordinator. Their execution time is mostly spent
Individual Scheduling
0
20
40
60
80
100
120
140
0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 1011 0 1 2 3 4 5 6 7 8 9 1011 12131415
Threads
Tim
e (s
econ
ds)
Computation Communication
P=4
P=8
P=12P=16
86
on computation because these threads do not invoke remote communication operations. On
the other hand, the threads on remote compute engine need to perform communication
operations for individual scheduling. The four threads on the right need communication time
in the execution. The remote method invocation operations on these threads are blocking
communication. The threads need to wait for a round trip of data communication that sends
the rendered strip back and getting the next strip to be rendered. The remote method
invocation is one-sided communication, which is performed by the remote threads only. The
difference in the communication costs between the threads on the compute coordinator and
on the compute engines also comes from the implementation of RMI. In RMI mechanism, the
distributed objects do not directly communicate to each other but via their representatives:
stub and skeleton [12,77]. The communication between the remote compute engine/thread
and the compute coordinator is accomplished by the interaction between the sub on the
remote engine and the skeleton on the compute coordinator. Hence the communication cost is
not obvious on the compute coordinator in P=8 as Fig 5.9 shows. Certainly the
communication has influence on the compute coordinator. The influence becomes apparent in
the cases of twelve and sixteen processors where the communication cost appears on the
compute coordinator. However, the communication time on the compute coordinator is still
lower than the total communication time on all compute engines.
Fig 5.9 also shows an interesting feature of individual scheduling, i.e., the computation
and communication are nearly balanced on all threads. The balance is automatically achieved
without load balancing operation. This phenomenon also illustrates the advantage of the
system-wide overlapping of computation and communication produced by the individual
scheduling. Another encouraging phenomenon is that the communication cost on each remote
thread decreases with the total number of processors because each thread takes less rendering
rows when running on more processors. This is completely different from the uprising
communication overhead of the final strip send-back in the group scheduling (see Fig 5.7).
Therefore the individual scheduling is a better scheme for the distributed ray tracing.
Of course, the group scheduling scheme can be modified to adopt the earlier strip send-
back. The rendered strips can be sent back to the compute coordinator in the task allocation
operation performed by the main thread. That can lead to the overlap of the computation and
communication in the group scheduling and eliminate the bottleneck of final strip send-back.
However the strip reordering is still required both on the compute coordinator and compute
87
engines in this scheme. Its total execution time should be longer than the individual
scheduling.
2. Autonomous scheduling and master/slave scheduling Usually the parallel ray tracing methods based on messaging passing, e.g., the ray tracing
applications in [21,56], use master/slave scheduling approach. One of the processes works as
the load scheduler that is dedicated to perform runtime block allocation. The process keeps on
watching the rendering procedures on other processes and allocates one-block-a-time to them
when required. The process often waits in idle for the next task allocation request.
Considering the highly asynchronous rendering procedure, it is proper to remove the specific
load scheduler and allow every process (thread) to perform rendering operation by itself. The
autonomous load scheduling is convinced to be more appropriate for the parallel ray tracing
than the master/slave approach. Working in independent mode, all threads can concurrently
perform the rendering operation without load scheduler. The individual scheduling can make
full use of the computing power and the high parallelism can be achieved in the ray tracing.
Fig 5.10 compares the performance of the autonomous load scheduling (individual
scheduling) and the master/slave load scheduling approaches on the same cluster of SMP
nodes as in previous tests.
Fig 5.10 The comparison of autonomous load scheduling and master/slave scheduling in ray tracing
0
50
100
150
200
250
1 2 4 6 8 10 12 14 16
Number of processors
Exec
utio
n tim
e (s
econ
ds)
Master/SlaveSchedulingAutonomous loadscheduling
88
The execution time shows the autonomous load scheduling superior to the master/slave
scheduling because one more thread performs the rendering. The ray tracing based on the
autonomous load scheduling is nearly P/1 faster than the master/slave scheduling, where P
is the number of processors.
We can conclude by the tests that the individual scheduling can automatically balance the
rendering workload onto the threads and exploit the highest parallelism in MOIDE-based ray
tracing. It is the appropriate load scheduling approach for the distributed ray tracing method.
89
Chapter 6
CG and Radix Sort on Two-layer
Communication
CG and radix sort are two irregularly structured applications with high all-to-all
communication requirements. The communication cost determines the performance of these
two applications in distributed computing. The two-layer communication mechanism is
useful to improve the communication efficiency in CG and radix sort. The threads work in
independent mode to execute the CG and radix sort applications based on MOIDE model.
6.1 Conjugate Gradient
Dense linear systems are usually solved by a direct method, for instance, Gaussian
elimination. A direct method leads, in the absence of rounding errors, in a finite and fixed
amount of work to the exact solution of the given linear system. However the direct method
may cause quite some zero entries of a sparse matrix to become non-zero entries, and non-
zero entries require storage as well as CPU time [81]. Hence direct method is not suitable to
solve sparse linear systems. Instead, iterative methods are used to solve sparse linear systems
that generate a sequence of approximation to the solution vector. The advantage of iterative
methods is the economy of CPU time and storage in solving sparse linear systems.
The conjugate gradient (CG) method is one of the most powerful methods for solving
large sparse linear systems Ax = b. CG is an iterative method to obtain the approximated
solution x by the iteration:
kkkk pxx α+= −1 (6-1)
where αk is a scalar step size and pk is the direction vector. The detailed description of the CG
method can be found in [22].
90
The CG method performs floating-point computation on the large symmetric sparse
matrix A and the relevant vectors. It performs frequent communication of large vectors during
the computation. An efficient communication mechanism is demanded to support the
communication in it. When implemented in MOIDE model, the CG method can acquire the
support from the two-layer communication mechanism to accelerate the vector
communication and improve the overall performance.
The CG method in MOIDE model is converted from the CG benchmark in NAS Parallel
Benchmarks (NPB) [23]. The original CG benchmark is programmed in FORTRAN and MPI
(Message Passing Interface) [72]. In the following text, the algorithm of CG method is briefly
described with the emphasis on its communication pattern, followed by the implementation
on MOIDE model.
6.1.1 Algorithm of CG
The CG method solves sparse linear systems by the iteration on the approximated
solution xk in equation (6-1) until the residue || b – Axk || converges to a pre-specified accuracy.
Fig 6.1 describes the algorithm of CG method to solve Ax = b, where A is the sparse matrix;
zk is the intermediate result of x; pk, qk, and rk are vectors; αk, βk, and ρk are scalars. || x ||
denotes the Euclidean norm of vector x, i.e., xxx T= .
1 x0 = (1,1,...,1); /* initial value */
2 for (i = 0; i < niter; i++)
3 do
4 z0 = (0,0,...,0);
5 r0 = xi;
6 p1 = r0;
7 ρ0 = rT0 r0; ③
8 for (k = 1; k < cgitmax; k++)
9 do
10 qk = Apk; ①
11 αk = ρk-1 / (pTkqk); ② ①
12 zk = zk-1 + αkpk;
13 rk = rk-1 - αkqk;
14 ρk = rTkrk; ③
15 pk+1 = rk + (ρk-1 / ρk)pk; 16 end
17 compute the residual norm ||r|| = ||xi-1 – Az||; ① ② ③
18 xi = z / ||z||; /* z is the final zk in inner loop */
19 end
Fig 6.1 Algorithm of conjugate gradient (CG) method
91
The algorithm contains two embedded loops. The predefined iteration numbers niter and
cgitmax ensure to gain the solution vector x with required precision. All computations are
vector operations. The parallel CG method is designed on the mesh topology of
multiprocessors. The sparse matrix A is decomposed into sub-matrices, and the vectors and
vector operations will be handled on the mesh topology. To perform the vector
multiplications in parallel, the processes need to exchange partial products to get the final
vector product. The data communication operations in the parallel CG algorithm include
vector reduction, vector transposition, and scalar reduction. These communication operations
are indicated in Fig 6.1 using the following signs:
sign Communication operation ① vector reduction ② vector transposition ③ scalar reduction
For example, line 7 in Fig 6.1 calculates the inner product of vector r
0. The inner
product is calculated in sections by different processes. Then the partial inner products should
be summed up by a scalar reduction operation among the processes on the same row of the
mesh to get the total inner product ρ0.
Line 10 computes the product of sparse matrix A and vector p. The processes should
make a row-oriented vector reduction (see Fig 6.2 (a)) to exchange and sum the sections of
product vector to obtain the final product vector q. The computation in line 11 multiplies two
vector p and q. However the section of vector q generated in line 10 on a process does not
match the section of p on the same process. A process needs to exchange the q section with
the correspondent process by a vector transposition operation (see Fig 6.2 (b) and (c)) and
then computes the local vector multiplication pT
kqk. Thereafter a row-oriented vector
reduction is required to compute the sum of the vector product. Hence line 11 includes one
vector transposition and one vector reduction.
The float-point computation workload of CG method on multiprocessors is determined
with the sparse matrix A. The way for performance improvement is to provide an efficient
communication mechanism for the large-scale vector communication. The two-layer
communication mechanism provided in MOIDE model can be utilized for the purpose.
92
(a) 2-step reduction on processor mesh (b) transposition on 2×4 processor mesh
(c) transposition on 4×4 processor mesh
Fig 6.2 Vector/scalar reduction and transposition operations in parallel CG method
6.1.2 CG Method in MOIDE Model
The hierarchical collaborative system in MOIDE model is an appropriate infrastructure to
implement the CG method on heterogeneous system. Parallel CG method is designed on
mesh, however, the method should be able to run on varied architecture of distributed
systems. MOIDE model supports the creation of hierarchical collaborative system that is
adaptive to the heterogeneous hosts. Two features of HiCS infrastructure are useful to the CG
method: the adaptability of the multithreaded computing and the two-layer communication
mechanism.
(1) Multithreaded computing
The threads working in independent mode can be logically organized into a mesh
structure to execute the CG method. Each thread (called pseudo-engine as in 2.3) runs
independently to simulate the work of a compute engine. Meanwhile the threads can make
0 1 2 3
4 5 6 7step 1
step 2
0 1 2 3
4 5 6 7
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
93
efficient communication over the two-layer communication mechanism. To run the CG
method on heterogeneous hosts, multithreading compute engines are created based on the
architecture of underlying hosts. MOIDE-runtime maintains a location table (see 3.5) that
maps the pseudo-engines to the physical locations in the compute engines. The
communications between the pseudo-engines are invoked through the unified communication
interface and accomplished through the two-layer communication mechanism. The MOIDE
model supports the CG method to present high flexibility and adaptability on heterogeneous
systems.
(2) Two-layer communication
The two-layer communication mechanism integrates the shared-data access and remote
messaging on HiCS. It can reduce the communication cost of the large-scale vector
communication in CG. The MOIDE-runtime delivers the data via proper communication path
between the pseudo-engines. The pseudo-engines in the same compute engine can perform
the vector reduction and transposition operations directly by shared data exchange.
Fig 6.3 The reduction operation on 2 quad-processor SMP nodes
Consider the example in Fig 6.3. The CG method runs on two quad-processor SMP nodes.
Two compute engines with four threads each are created on the SMP nodes. The eight
pseudo-engines on two SMP nodes are organized into a 2×4 mesh as Fig 6.3 shows. As Fig
6.3 shows, all vector/scalar reduction operations are conducted within the compute engines
by local shared-data exchange. No remote messaging is required. But the vector transposition
should call remote messaging across two SMP nodes (see Fig 6.2(b)).
Fig 6.4 is an example of running the CG method on heterogeneous hosts. The method
runs on two tri-processor nodes and one dual-processor node, totally eight processors. Three
0 1 2 3
4 5 6 7
94
compute engines are created on the hosts, also eight pseudo-engines in total. The 2×4 mesh of
eight pseudo-engines is mapped onto three hosts as in Fig 6.4. Fig 6.4(a) shows that half of
the reduction operations are performed inside compute engine. Fig 6.4(b) shows that half of
the vector transposition operations are performed inside a compute engine as well by shared-
data exchange. Thus the communication in both examples can get benefit from the two-layer
communication.
(a) reduction on heterogeneous hosts (b) transpose on heterogeneous hosts
Fig 6.4 The reduction and transposition operations on heterogeneous hosts
The two examples also illustrate the flexibility of MOIDE-based multithreading
computation on varied architecture, either homogeneous or heterogeneous hosts. The CG
method can uniformly run on the 2×4 mesh in the two examples regardless the different
architecture of the underlying hosts. In both cases the two-layer communication mechanism
can be used to improve the communication efficiency. The MODIE-based CG method is
adaptive to the system architecture. The real performance of these two examples will be
tested in the following runtime experiments.
6.1.3 Runtime Tests and Performance Analysis
The CG method based on MOIDE model is tested on homogeneous and heterogeneous
hosts. The goal of the tests is to verify the adaptability of MOIDE-based computation and the
efficiency of the two-layer communication mechanism.
1. Tests on Homogeneous Hosts The CG method is firstly tested on the cluster of four quad-processor SMP nodes under
different problem size n (A is n×n matrix). The CG method requires the number of processors
0 1 2 3
4 5 6 7
0 1 2 3
6 74 5
95
to be power of 2 in order to form a mesh structure. Fig 6.5 is the execution time of the CG
method on one to sixteen processors. The related speedups are drawn in Fig 6.6. Higher
speedup can be obtained on larger problem size. About eight-fold speedup can be obtained on
sixteen processors when n=90000.
Fig 6.5 Execution time of the CG method on homogeneous hosts
Fig 6.6 Speedup of the CG method on homogeneous hosts
Multithreading on Homogeneous SMP nodes
0
500
1000
1500
2000
2500
3000
1 2 4 8 16
Number of processors
Exec
utio
n tim
e (s
econ
ds)
n=90000n=75000n=50000n=30000
Multithreading on Homogeneous SMP nodes
0123456789
1 2 4 8 16
Number of processors
Spee
dup n=90000
n=75000n=50000n=30000
96
The execution time breakdowns are depicted in Fig 6.7. The first four processors exist in
the same SMP node. The communication between them is accomplished by local shared data
access with low communication overhead. There is an obvious growth of the communication
cost from four to eight processors because two SMP nodes are used and the remote
messaging is invoked for the vector transpose operations, whereas all reduction operations
can be still accomplished by shared data exchange as Fig 6.3 shows. Owing to the two-layer
communication mechanism, the communication overhead does not increase in proportion to
the problem size. The performance improvement is more obvious on large problem size.
Fig 6.7 Execution time breakdowns of the CG method on homogeneous hosts
2. Tests on Heterogeneous Hosts
The CG method has also been tested on a cluster of three heterogeneous hosts: two tri-
processor machines and one dual-processor machine, totally eight processors. The CG
method runs on three hosts under the same problem sizes as in previous test. One tri-
processor host is used in the tests on one and two processors. Then two hosts are used in the
case of four processors and all three hosts in eight processors. The execution time and
speedup are displayed in Fig 6.8 and Fig 6.9. The results are similar to the test on
homogeneous SMP nodes. However the performance on the heterogeneous hosts is a bit
Multithreading on Homogeneous SMP nodes
0
500
1000
1500
2000
2500
3000
1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16
Number of processors
Exec
utio
n tim
e (s
econ
ds)
CommunicationComputation
n=30000
n=50000
n=75000
n=90000
97
lower than the former test because more hosts are used under the same number of processors
and therefore the communication overhead is higher as the execution time breakdowns in Fig
6.10 show.
Fig 6.8 Execution time of the CG method on heterogeneous hosts
Fig 6.9 Speedup of the CG method on heterogeneous hosts
Multithreading on Heterogeneous Hosts
0
500
1000
1500
2000
2500
3000
1 2 4 8
Number of processors
Exec
utio
n tim
e (s
econ
ds)
n=90000n=75000n=50000n=30000
Multithreading on Heterogeneous Hosts
00.5
11.5
22.5
33.5
44.5
1 2 4 8
Number of processors
Spee
dup n=90000
n=75000n=50000n=30000
98
Fig 6.10 Execution time breakdowns of the CG method on heterogeneous hosts
3. Comparison with the single-threading method To further verify the efficiency of the two-layer communication, the CG method is also
executed in single-threading mode in which single-threading compute engine is created on
each processor on SMP nodes. In single-threading mode, remote messaging is the sole
communication method between compute engines either on the same or different SMP nodes.
The all remote-messaging communication incurs heavy communication overhead. The single-
threading CG method is tested on two quad-processor SMP nodes. The execution time and
speedup in Fig 6.11 and Fig 6.12 show the poor performance of the single-threading method.
Its performance is even worse on eight processors than on four processors under the problem
size 30,000 and 50,000 due to the heavy communication overhead.
Multithreading on Heterogeneous Nodes
0
500
1000
1500
2000
2500
3000
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
Number of processors
Exec
utio
n tim
e (s
econ
ds)
CommunicationComputation
n=30000
n=50000
n=75000
n=90000
99
Fig 6.11 Execution time of the single-threading CG method
Fig 6.12 Speedup of the single-threaded CG method
The single-threaded CG method is only able to run on at most eight processors and the
maximum problem size n=75000. The high resource consumption and heavy communication
Single-threading
0200400600800
10001200140016001800
1 2 4 8
Number of processors
Exec
utio
n tim
e (s
econ
ds)
n=75000n=50000n=30000
Single-threading
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8
Number of processors
Spee
dup n=75000
n=50000n=30000
100
overhead prohibit its execution on sixteen processors and on n=90000. The execution time
breakdowns in Fig 6.13 show the communication cost in the single-threading method. In
contrast to the time breakdowns using the two-layer communication in Fig 6.7 and Fig 6.10,
the all remote-messaging communication in single-threading method results in high
communication overhead even on four processors in one SMP node. The communication time
exceeds the computation time on eight processors.
Fig 6.13 Execution time breakdowns of the single-threaded CG method
All runtime tests of the CG method have manifested the communication efficiency of the
two-layer communication mechanism. The MOIDE model also provides the flexibility to
map of the threads into a logical mesh topology and makes the CG method adaptive to the
heterogeneous hosts. The MOIDE-based CG method can efficiently run in uniform format
and achieve high performance on heterogeneous systems.
6.2 Radix Sort
Radix sort is a counting-based sorting algorithm relying on the binary representation of
the elements. Let b be the bit length of the elements. Radix sort completes the sorting on a
sequence of elements in b / r rounds, where r < b. It sorts the elements based on r-bit block in
each round, starting from the least significant bits. It counts the number of elements having
the same value of the r bits (0 ~ 2r-1) and decides each element’s rank (new position) in the
Single-threading
0200400600800
10001200140016001800
1 2 4 8 1 2 4 8 1 2 4 8
Number of processors
Exec
utio
n tim
e (s
econ
ds)
CommunicationComputation
n=30000
n=50000
n=75000
101
sorted sequence. Then the elements are reordered according to their ranks. Radix sort should
keep the stability, i.e., the output ordering must preserve the input order of any two elements
having equal r-bit blocks. The computation in radix sorting is simply the counting and
reordering of the elements. In parallel radix sort, every processor sorts equal number of
elements but irregular all-to-all element exchange is performed in each round. Therefore
radix sort is a communication-intensive application.
6.2.1 Parallel Radix Sort
In parallel radix sort [24], the rank of element is the element’s position in the entire
element sequence. A processor should send the elements to the destination processors
according to the ranks of the elements. Give n elements to be sorted on p processors. The
parallel algorithm is summarized in Fig 6.14.
for (i = 0; i < b/r; i++)
do
count local n/p elements on the ith least significant r-bits;
sum count values; /* All-to-all reduction */
rank the elements;
scatter elements to target processors; /* All-to-all scatter */
put the received elements into local sequence;
end
Fig 6.14 Parallel radix sort
The parallel radix sort includes two global communication operations. One is the global
reduction to compute the sum of the counts among all processors. The reduction operation
broadcasts 2rp integers. It has slight influence on the performance. The other operation is the
all-to-all scattering that exchanges the elements among all processors to reorder the elements.
A processor sends local elements to other processors based the ranks of the elements and
receives the elements from other processors. Fig 6.15 illustrates the scatter operation on four
processors. Each processor sends out and receives n/p elements respectively (including those
elements that remain on local processor) in the scatter operation. As the computation
workload is balanced on all processors, the irregularity of parallel radix sort exists in the
scatter operation in which the number of elements exchanged between each pair of processors
102
is largely diverse. Scattering is the irregular communication occurring in every sorting round.
Hence the scatter communication determines the performance of parallel radix sort.
Fig 6.15 All-to-all scattering of elements in parallel radix sort on four processors
6.2.2 Radix Sort in MOIDE Model
The parallel radix sort is suitable to be implemented on MOIDE model. To keep the
stability in the sorting and reduce the communication overhead, the MOIDE-based radix sort
method adopts different work modes of the threads in sorting and scattering phases:
1. Independent sort In the phase of sorting, all threads (pseudo-engines) work in independent mode. Each
thread sorts a subsequence of elements. Otherwise, if a group of threads work in cooperative
mode and cooperatively sort a subsequence of the elements, synchronization should be
imposed on the element counts to ensure the exclusive access on them. The synchronized
operations will produce high overhead and restrain the parallelism in the sorting.
2. Grouped scatter The all-to-all scattering operation in each sorting round includes the exchange of
elements between P·(P-1) pairs of processors. Although the two-layer communication can
reduce the communication overhead, the scatter operation still produces high communication
overhead that affects the performance of parallel sorting. The communication bottleneck will
be more serious when increasing the number of processors. The data sharing among the group
of threads in the same compute engine can be utilized to reduce the communication overhead
of all-to-all scattering. The elements sending from all threads in one compute engine to the
threads in another compute engine can be grouped together and sent to the destination in one
processor 0 processor 1
processor 2 processor 3
n / p n / p
n / p n / p
processor 0 processor 1
processor 2 processor 3
n / p n / p
n / p n / p
103
remote messaging operation. This is the grouped scatter operation. For example, if compute
engine i has pi threads and compute engine j has pj threads, the threads on compute engine i
will make pi·pj communication invocations to the threads in compute engine j in the scattering
when the communication is separately conducted between each pair of threads. With the
grouped scatter, the communication can be accomplished in one remote method invocation.
Therefore the multiple thread-to-thread communication invocations can be replaced by the
communication between the pair of compute engine. In the scattering phase, the threads work
in cooperative mode and the grouped scatter can get the support from the two-layer
communication. The grouped scatter can effectively reduce the communication overhead in
the radix sort.
6.2.3 Runtime Tests and Performance Analysis
MOIDE-based radix sort is tested on clusters of SMP nodes. Its performance is compared
with the radix sort program implemented in C & MPI. The MPI version is MPICH 1.2.1.
1. Tests of the MOIDE-based Radix Sort At first, the MOIDE-based radix sort method is tested on a cluster of four dual-processor
nodes, totally eight processors. Table 6.1 lists the execution time of the radix sort with 1M to
10M elements on one to eight processors. As table 6.1 shows, a two-fold performance
enhancement occurs on two processors. This is the best performance improvement in the test.
The first two processors belong to one SMP node. All data communication goes through
shared-data access.
n P 1 2 4 6 8 1M 8.547 4.668 4.604 4.494 3.49 2M 16.654 8.552 8.652 8.57 5.9594M 32.593 16.919 17.275 16.1 10.971 6M 49.333 24.637 25.133 23.728 15.892 8M 66.05 33.243 33.191 32.137 21.341
10M 85.008 40.698 42.062 39.81 25.586
Table 6.1 Execution time of the MOIDE-based radix sort (seconds)
Fig 6.16 shows the execution time breakdowns of the radix sort. As Fig 6.16 shows, there
appears a burst of communication cost when more than one SMP nodes are used. After that,
the performance is improved slowly on more processors because of the high communication
104
cost. The performance even drops down on four processors due to the emergence of the
remote communication.
Fig 6.16 Execution time breakdowns of the MOIDE-based radix sort
Though the computation time can be reduced on more processors, the communication
cost is uprising along with the number of processors. Moreover the data exchange pattern in
radix sort is irregular as Fig 6.15 shows. There is great diversity in the number of elements
sent from one processor to the other in scatter operation. The time cost of the scatter
operation is mainly affected by the largest data set being transmitted. Therefore the large-
scale irregular element scatter operation in each sorting round determines the limited
performance enhancement. However, the grouped scattering based on two-layer
communication certainly benefits the performance of the MOIDE-based radix sort.
n P 1 2 4 6 8 1M 8.547 8.354 7.071 4.738 4.34 2M 16.654 16.923 11.784 8.712 7.44 4M 32.593 33.647 23.59 16.638 13.5476M 49.333 51.315 37.072 24.214 24.1128M 66.05 63.222 48.597 32.938 31.81610M 85.008 76.567 60.412 40.897 37.373
Table 6.2 Execution time of the single-threading radix sort (seconds)
Multithreading
0102030405060708090
1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8
Number of processors
Exec
utio
n tim
e (s
econ
ds)
CommunicationComputation
1M2M
4M
6M8M
10M
105
The advantage of the two-layer communication can also be demonstrated by the
comparison with a single-threading radix sort method. Run the MOIDE-based radix sort in
single-threading mode in which two single-thread compute engines are created on each dual-
processor node. All communication is performed via remote messaging. Table 6.2 is the
execution time of the single-threading radix sort.
Fig 6.17 Execution time breakdowns of the single-threading radix sort
It is apparent that the execution time of the single-threaded radix sort is longer than the
time of the MOIDE-based multithreading method because of the higher communication
overhead. Fig 6.17 shows the execution time breakdowns of the single-threading radix sort. It
illustrates that the all remote messaging communication produces high overhead in the
irregular scatter operation. There is no obvious performance improvement from one to two
processors although these two processors exist on the same node. After that, the method
keeps a medium performance enhancement on more processors. Nevertheless the
performance of single-threading method is obviously inferior to the multithreading method.
The comparison can confirm the communication efficiency of the two-layer
communication mechanism. In fact the two-layer communication mechanism implemented in
Java RMI can really outperform C & MPI in communication performance as the following
test shows.
2. Comparison with MPI
Single-threading
0102030405060708090
1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8
Number of processors
Exec
utio
n tim
e (s
econ
ds)
CommunicationComputation
1M2M
4M
6M
8M
10M
106
A C & MPI radix sort program is used to compare the performance of the scatter
operations in MPI and in the MOIDE-supported two-layer communication. MPI is a widely-
used communication library in parallel and distributed computing. MPICH 1.2.1 is used in
the test. The C & MPI program is tested on the same platforms—four dual-processor
machines. The MPI package is compiled to use the shared memory for fast message passing
on one node but use TCP/IP for cluster communication. Table 6.3 lists the execution time of
the C & MPI program.
n P 1 2 4 6 8 1M 4.391 3.077 4.290 3.462 3.082 2M 8.774 6.082 8.216 6.779 5.812 4M 17.675 12.147 16.323 13.514 11.58 6M 26.635 17.975 24.646 19.901 17.138 8M 35.432 24.684 33.416 27.049 22.871
10M 43.380 30.763 41.051 33.655 27.594
Table 6.3 Execution time of the C & MPI radix sort (seconds)
By comparing with the execution time of the MOIDE-based radix sort method in Table
6.1, it can be discovered that the MOIDE-based radix sort presents the similar performance to
the C & MPI program in all cases except on one and two processors. The execution time
breakdowns of the C & MPI program in Fig 6.18 reveal the cause of the performance
equivalence in the two methods. It is a fact that Java program is usually slower than C
program. The test shows that the computation time of the MOIDE-based radix sort program
in Java (see Fig 6.16) is nearly twice to the C & MPI program (Fig 6.18). But the time
breakdowns in Fig 6.18 show that MPI has higher communication time when running on two
nodes up, so that the performance improvement in the C & MPI program is poor on multiple
nodes. On the other hand, the MOIDE-based Java program gets the benefit from the efficient
two-layer communication. It has lower communication cost than the all message-passing in
MPI for cluster communication. The MOIDE-based radix sort can achieve higher
performance improvement on more nodes. The low-overhead communication enables the
MOIDE-based program in Java to reach the performance which is comparable to the C &
MPI program. The former can even outperform the C & MPI on eight processors.
107
Fig 6.18 Execution time breakdowns of the C & MPI radix sort program
Fig 6.19 Execution time breakdowns of three radix sort programs: Java MOIDE-based (Java-M), Java
single-threading (Java-S) and C & MPI (C-MPI) in sorting 10M elements
One more radix sort program is tested as the contrast to the MOIDE-based method. It is
the single-threading method. Fig 6.19 compares the execution time of the three radix sort
MPI and C
05
101520253035404550
1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8
Number of processors
Exec
utio
n tim
e (s
econ
ds)
CommunicationComputation
1M2M
4M6M
8M
10M
0
10
20
30
40
50
60
70
80
90
Java
-S
Java
-MC-M
PI
Java
-S
Java
-MC-M
PI
Java
-S
Java
-MC-M
PI
Java
-S
Java
-MC-M
PI
Java
-S
Java
-MC-M
PI
Exec
utio
n tim
e (s
econ
ds)
CommunicationComputation
P=1P=2
P=4
P=6 P=8
108
programs: single-threading Java program (Java-S), MOIDE-based Java program (Java-M),
and C & MPI program (C-MPI) under problem size n = 10M. The C & MPI program is the
fastest one in computation. Its computation time is lower than the two Java programs. All
communications in the single-threading Java and the C & MPI programs are by message
passing. The single-threading Java program has the highest communication overhead. The
result of the single-threading Java program indicates that the communication cost of RMI-
based remote messaging is higher than MPI. However the two-layer communication on
MOIDE model combines the quick shared-data access and the slow remote messaging so that
it presents an integrated communication performance as high as or even higher than MPI in C
on cluster of SMP nodes. This is the significant achievement of the two-layer communication
mechanism and MOIDE model in supporting the irregular communication.
3. Test on Larger System and Comparison with MPI To examine the performance on larger system, the MOIDE-based and C & MPI radix sort
programs are tested on the cluster of four quad-processor SMP nodes under the problem size
n = 10M. As the execution time breakdowns in Fig 6.20 shows, the MOIDE-based Java
program presents a peak performance on the four processors in one SMP node due to the low
communication cost of all shared-data access. The peak performance reaches the performance
of the C & MPI program on fourteen processors. The growing communication overhead
determines the performance of the radix sort on multiple SMP nodes. Although the
computation time is continuously decreasing on more processors, the communication cost
raises the overall execution time of the MOIDE-based radix sort. However its performance on
sixteen processors can eventually outperform the result on four processors.
For the MPI program, MPICH can support fast message passing by using shared memory
on single SMP node. However, the portable MPICH doesn’t support multi-protocol
communication yet. It can use only one “device” at a time [76]. The communication between
the clustered SMP nodes cannot utilize both shared memory and message passing at the same
time. Thus the MPICH library should be compiled twice on cluster of SMP nodes with
different device option. It is compiled to utilize the shared memory for fast inter-process
communication on one SMP node. It is also compiled with ch_p4 device option to support
cluster communication over TCP/IP socket. Meanwhile the executables of application are not
compatible under different MPI devices. The applications need recompilation with different
MPI version (shared memory or message passing).
109
Fig 6.20 Execution time breakdowns of two radix sort programs on four quad-processor SMP nodes:
Java MOIDE-based program (Java-M) and C & MPI program (C-MPI) in sorting 10M elements
Fig 6.21 compares the communication cost of two radix sort programs. The first four
processors belong to one SMP node. In the C & MPI program, the communication on two
and four processors goes through the single-copy over the shared memory. However, its cost
is still higher than the local data exchange on the two-layer communication in MOIDE model.
The software architecture of MPICH has a multi-layer structure. To achieve the portability,
all MPI functions are implemented in terms of the macros and functions that make up the
ADI (Abstract Device Interface). ADI is implemented in terms of a lower level interface
called channel interface. Channel interface has multiple implementations for different
hardware platforms, e.g., the implementation on shared-memory system [76]. Thus the data
communication between the processes on SMP node goes across the multi-layers to reach the
buffer where data read/write will be carried out.
For the two-layer communication mechanism in MOIDE model, the data communication
in local SMP node is directly fulfilled through the data access via the runtime buffer created
in user space. When a thread calls the communication interface to communicate with a local
thread, the communication mechanism will allocate a buffer in the user space. The
communication operation can be instantly finished by the data exchange through the buffer,
which has lower overhead than the multi-layer buffering in MPI. Therefore, the
0
20
40
60
80
100
120
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Exec
utio
n tim
e (s
econ
ds)
Computation Communication
P=1
P=2
P=4
P=6P=8 P=10 P=12
P=14 P=16
110
communication cost in the MOIDE-based Java program is much lower than the MPI program
on two and four processors in one SMP node. Moreover the shared memory communication
in MPICH is unable to cross SMP nodes. The user has to indicate which MPI library should
be used, shared memory version or message passing version. Two versions cannot be used at
the same time When running on cluster of SMP nodes, the other MPICH library compiled
with ch_p4 option should be used to support the message passing. It cannot automatically
switch from one device to another. On the other hand, the two-layer communication
mechanism in MOIDE model adaptively supports the integration of the shared memory
access and the remote messaging on cluster of SMPs. It transparently decides the proper
communication path. Despite the drastic growth of the communication time on multiple SMP
nodes, the two-layer communication implemented in Java can still maintain the
communication cost comparable to C & MPI. It is even lower than C & MPI program on
eight, fourteen, and sixteen processors as Fig 6.21 shows.
Fig 6.21 Communication costs of two radix sort programs on four quad-processor SMP nodes: Java
MOIDE-based program (Java-M) and C & MPI program (C-MPI) in sorting 10M elements
It can be observed in Fig 6.21 that the communication cost of the MOIDE-based radix
sort undulates when increasing the number of processors. The undulation is caused by the
combined communications paths on the two levels. The communication cost is reduced when
using more processors on same SMP node such as increasing processors from six to eight and
from ten to twelve. But the communication cost goes up from four to six and from eight to ten
0
5
10
15
20
25
30
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Java
-MC-M
PI
Com
mun
icat
ion
time
(sec
onds
)
P=2P=4
P=6P=8 P=10
P=12P=14 P=16
111
processors because one more SMP node. Nevertheless, the grouped scattering operation on
the two-layer communication can sustain the improvement of the communication efficiency
above ten processors.
This chapter addresses two irregularly structure applications CG and radix sort. Two
applications are characterized by high communication requirements. The two-layer
communication mechanism demonstrates its benefit in the CG and radix sort developed on
MOIDE model. The two-layer communication mechanism can achieve high performance in
large-size all-to-all communication. The CG method also demonstrates the flexibility of
MOIDE model on heterogeneous system. The threads in independent mode can be flexibly
organized into a proper structure and dynamically mapped onto the heterogeneous hosts
based on the architecture. Hence the MOIDE-based applications can be developed and run in
an identical infrastructure on heterogeneous systems.
112
Chapter 7
Related Work
The widespread use of distributed systems for high-performance computing has been
attracting plenty of research forces in developing support software to integrate the system-
wide computing resources and create efficient computing infrastructures to develop and
execute applications in various fields. There are also research projects on the algorithms and
software for solving irregularly structured problems. This chapter gives an overview on the
recent development in these two areas. Meanwhile the distinction of my research is clarified
by the comparison with the related work.
7.1 Software Infrastructures on Distributed Systems
There is a trend to integrate the computers at different sites to form a large-scale powerful
computing system to satisfy the demands for high-performance computation in different
application areas. The range of cluster computing has been expanding from lab or
organization to a wide area across geographical distances, which forms a huge clustering
system called Grid [64,65]. Many projects develop the computational model, programming
environment and support tools for the implementation of distributed computing on the large-
scale distributed systems.
7.1.1 Millennium
Millennium [25,26] project aims to develop and deploy a hierarchical campus-wide
“cluster of clusters” at UC Berkeley which consists of single-processor workstations, SMP
servers, and NOWs (Network of Workstations) in different departments to support advanced
applications in scientific computing, simulation, and modeling. The individual desktop and
departmental SMP server levels are incorporated to a local high-performance cluster of SMPs
113
(called CLUMP), which utilizes the extensions of the communication, system, and
programming technologies developed in the Berkeley NOW project [27]. The CLUMPs are
further organized to a large campus CLUMP. The entire collection of clusters will be
interconnected across campus with Gigabit Ethernet links to form a large cluster of clusters of
SMPs, called intercluster.
Multiprotocal programming environment will be deployed on group CLUMPs to exploit
the SMP hardware for sharing data between processors on an SMP and utilize the high
performance interconnect between clusters. A Java-based version of Active Messages [28]
using the Java Native Interface is implemented to provide high-bandwidth I/O and
communication from a Java-based environment. The software system in Millennium is
constructed as a composition of services. A software platform for scalable, customizable
internet services is developed and deployed as the fundamental system infrastructure. Each
node in the cluster is provided with an iSpace execution environment on a JVM.
Communication is through their own Remote Method Invocation that provides secure RMI,
multicast RMI, UDP-RMI, and fast RMI. Collections of iSpaces are grouped into a
MultiSpace, such a service pushed into the MultiSpace is automatically scaled across the
cluster. A global software layer will be constructed to provide support for remote execution,
load balancing, and batch processing of parallel programs.
7.1.2 Globus
Globus [29,30] is a multi-institutional project centered at Argonne National Laboratory
and University of Southern California. It focuses on enabling the application of Grid concepts
to scientific and engineering computing. A set of services and software libraries, called
Globus Toolkit, is developed to support Grids and Grid applications. It includes software for
security, information infrastructure, resource management, data management, communication,
fault detection, and portability. The set of services can be used either independently or
together to develop useful grid applications and programming tools. The followings are some
examples of the services provided in the Globus Toolkit:
(1) The Globus Resource Allocation Manager (GRAM) provides resource allocation and
process creation, monitoring, and management services.
114
(2) The Metacomputing Directory Service (MDS) provides a uniform framework for
providing and accessing system configuration and status information such as compute
server configuration, network status, or the locations of replicated datasets.
(3) Global Access to Secondary Storage (GASS) implements a variety of automatic and
programmer-managed data movement and data access strategies, enabling programs
running at remote locations to read and write local data.
(4) Nexus and globus_io provide communication services for heterogeneous environments,
supporting multimethod communication, multithreading, and single-sided operations.
7.1.3 AppLeS
AppLeS stands for Application Level Scheduler [31,32], a project at University of
California, San Diego. The goal of AppLeS is to provide mechanisms and paradigms that
perform resource configuration and load scheduling to achieve a performance-efficient
implementation for application on a distributed heterogeneous system. AppLeS project has
two main parts of work:
(1) Application-Level Scheduling agents [33] are developed to provide a mechanism for
scheduling individual applications at machine speeds on production heterogeneous
systems. AppLeS agents utilize the Network Weather Service (NWS) [34] to monitor the
varying performance of resources potentially usable by applications. Each AppLeS uses
static and dynamic application and system information to select viable resource
configurations and evaluate their potential performance. Then AppLeS interacts with the
relevant resource management system to implement application tasks. Once it contains an
embedded AppLeS agent, the application becomes self-scheduling.
(2) AppLeS templates are stand-alone software projects that perform automatic scheduling
and deployment tasks for classes of structurally similar applications. Templates builds on
the expertise gained while developing AppLeS agents, and the main focus is re-usability
for several applications. There are recently two template projects: the Parameter Sweep
Template [35] and the Master Slave Template [21].
7.1.4 JavaPorts
JavaPorts [36,37] is an environment to facilitate distributed component computing on
clusters of workstations, developed at Northeast University, USA. It is composed of an
115
application programming interface and a set of tools for the development of modular,
reusable, parallel and distributed component-based applications for cluster computing. The
JavaPorts project aims at providing the application developer with: (1) the capability to easily
create reusable Java software components for the concurrent tasks of an application; (2)
anonymous message passing among tasks while hiding the details of the communication and
coordination; (3) tools for the definition, assembly and reconfiguration of concurrent
applications using pre-existing and/or new software components.
The JavaPorts system exploits the advantages of workstation clusters in conjunction with
recent advances in networking technologies emerged as cost-effective alternatives to
expensive supercomputers for coarse-grain parallel computation. It also underlines the
following principles: (1) cluster computing with independence from the memory model
(shared vs. distributed); (2) separation of the coordination details from the computational
aspect of an application; (3) platform-independence by using the Java technology as the
underlying implementation language; (4) object-oriented parallelism with modularity and
reusability
7.1.5 Comparison with MOIDE Model
The capabilities of the projects above in supporting heterogeneous computing can be
assessed by the following criteria. MOIDE can be compared with these projects using these
criteria. The comparison is summarized in Table 7.1.
Resource selection: manage and select the resource from the collection of available
resources in a distributed system to execute an application.
Heterogeneity support: support platform-independent computing that hides but
utilizes the heterogeneous hosts.
Architecture adaptation: map the computation onto the hosts in adaptive mode to
match the specific architecture for high-performance computing.
Multi-level communication: support multi-method communication on different levels
of heterogeneous system to raise communication efficiency.
Dynamic reconfiguration: reorganize the computation to meet the change in the state
of the available resources.
Utilization scope: the range of distributed system and applications to which the
project has been applied.
116
Millennium Globus AppLeS JavaPorts MOIDE
Resource selection Yes Yes Yes No Yes
Heterogeneous support Medium High High Medium High
Architecture adaptation No Yes No No Yes
Multi-level communication Some Yes No No Yes
Dynamic reconfiguration No Yes Yes No Yes
Utilization scope Wide Wide Wide Small Medium
Table 7.1 Comparison of the related work in supporting heterogeneous computing
By comparing with the related work, we can find that MOIDE model covers many
similar aspects in those projects. MOIDE provides the mechanism that integrates the
resources in distributed systems and creates a flexible infrastructure to implement high-
performance computing on heterogeneous hosts as Millennium, Globus and AppLeS do.
MOIDE provides the efficient two-layer communication mechanism for the interactions
between a group of threads and distributed objects as the communication services provided in
Globus. It also supports the application-level load scheduling as AppLeS do. As indicated in
section 7.4.3, the autonomous load scheduling in MOIDE is more appropriate to the
applications like ray tracing than the master-slave scheduling in AppLeS. Implemented in
Java, MOIDE model presents a flexible system infrastructure that facilitates the object-
oriented and architecture-independent computing on various platforms as JavaPorts.
MOIDE model emphasizes on the flexibility, adaptability and computational efficiency
on heterogeneous architecture. MOIDE incorporates the object-oriented and multithreading
methodologies to support efficient computing on hybrid platforms. The applications
developed in MOIDE model can be adaptively mapped onto different hosts in a mode that
can realize high-performance computing on the architecture. It also integrates shared-data
access and remote messaging to implement efficient communication on heterogeneous
system. The autonomous load scheduling is free from dedicated load scheduler and load
scheduling operations. It can automatically achieve load balancing on distributed objects and
threads. MOIDE runtime support system has been developed as a multi-functional support
system to implement the distributed computing based on MOIDE model. Although the
runtime tests are conducted on small-size clusters, MOIDE model is also applicable and the
runtime support system is also executable on wide-area distributed systems.
117
7.2 Programming Models on Cluster of SMPs
Some projects develop the programming environment and model on cluster of SMPs.
These models combine varied programming methodologies suitable to the hybrid architecture
of SMP clusters.
7.2.1 KeLP
KeLP (Kernel Lattice Parallelism) [38,39] is a programming model for implementing
portable scientific applications on distributed memory parallel computers, developed in
UCSD. It provides a set of programming abstractions to represent the data layout and data
motion patterns in block-structured scientific calculations on SMP clusters, e.g., finite
difference methods and multiblock methods for solving partial differential equations. The
KeLP run-time system [40] implemented as a C++ library supports the general blocked data
decompositions and manages low-level implementation details such as message-passing,
processes, threads, synchronization, and memory allocation.
KeLP separates the description of communication patterns from the interpretation of the
patterns using a model known as communication orchestration. KeLP provides an easily
understood model of locality that ensures that applications achieve portable performance
across a diverse range of platforms. KeLP model reflects the multi-level interconnections on
SMP clusters. KeLP abstractions help manage each level independently and manage the
interaction between the levels.
7.2.2 SIMPLE
EXPAR (Experimental Parallel Algorithmics) concentrates on high-level, architecture
independent, algorithms that execute efficiently on general-purpose parallel machines [41,42].
It develops a methodology, called SIMPLE model, for high-performance programming on
clusters of SMP nodes. The methodology is based on a small kernel of collective
communication primitives that make efficient use of the hybrid shared-memory on SMP node
and message passing paradigms. The communication primitives are grouped into three
modules:
(1) Internode Communication Library (ICL) provides an MPI-like small kernel for internode
communication
118
(2) SMP Node Library contains three primitives for SMP node (barrier, broadcast, and
reduce)
(3) SIMPLE Communication Library is built on ICL and SMP Node libraries. It included the
primitives for the SIMPLE model: barrier, reduce, broadcast, allreduce, alltoall, gather,
and scatter.
Some portable parallel programs and data sets are implemented in SIMPLE model, e.g.,
combinatorial computing and image processing.
7.2.3 Comparison with MOIDE model
With the focus on the support for the high-performance computing on cluster of SMPs,
the related work can be compared using the following criteria:
Architectural transparency: the applications can be developed in identical model that
is independent from the architecture of the hosts.
Adaptive mapping: map the computation adaptively onto the hosts at run-time based
on the specific architecture—cluster of single-processors, SMP nodes, or mixed ones.
Combined communication: integrate the communication approaches on different
levels to accelerate the communication.
Unified interface: provide unified programming and communication interface to the
applications.
Application field: support the development of what kind of applications.
KeLP SIMPLE MOIDE
Architectural transparency No No Yes
Adaptive mapping Medium Low High
Combined communication Yes Yes Yes
Unified interface No Yes Yes
Application field Block-structured applications Limited Irregularly structured
applications
Table 7.2 Comparison of the programming models on cluster of SMPs
Table 7.2 compares there programming model. MOIDE is a widely-usable distributed
computing model to support the development and execution of various applications on
heterogeneous systems, not limited to SMP cluster. Unlike the KeLP model which is
dedicated to block-structured computations, MOIDE model is suitable to develop different
119
applications, in particular, for irregular structured applications. The hierarchical collaborative
system in MOIDE is a runtime infrastructure that maps the computation and communication
patterns of an application onto the architecture of underlying hosts so as to achieve efficient
execution on the hardware platform. MOIDE model provides a unified communication
interface at application level based on the two-layer communication mechanism. Regardless
the physical communication path, the threads in distributed objects can interact with each
other through the same communication interface. MOIDE runtime support system implicitly
chooses the communication path, either shared-data access or remote messaging, and
completes the communication.
7.3 Methodologies for Irregular Structured Problems
There are also research work in developing programming languages, data structures, and
algorithms for solving irregularly structured problems. Due to the variety of irregularly
structured problems and the diversity of their characteristics, individual project can only
study certain aspects of the techniques and specific irregularly structured applications.
7.3.1 IPA
IPA is the Irregular Parallel Algorithms project at University of North Carolina. The
project proposes nested data parallelism to express the irregular computations and
investigates the incorporation of nested data parallelism in programming languages including
Fortran (Fortran 95/HPF) and Java [43,44]. Two approaches are used to cope with the load
imbalance in the nested data-parallel computations: using fine-grained threads and thread
migration to balance load, and flattening nested parallelism through compilation techniques
to create an unnested data-parallel computation that performs the correct amount of work,
optimally balanced over all processors. A runtime support library largely based on Fortran's
intrinsic functions and the routines in HPFLIB is created to support the data-parallel
computations on supercomputers. Parallel implementations of a Conjugate Gradient method
for unstructured sparse linear systems and a Barnes-Hut N-body simulation have been
constructed in Fortran 90 using the flattening transformations.
120
7.3.2 Scandal
The Scandal [45] project at CMU develops a portable, interactive environment for
programming a wide range of supercomputers. The two main goals of the project are:
(1) Developing a portable parallel language NESL and associated environment
NESL [46,47] is an applicative parallel language intended to be used as a portable
interface for programming a variety of parallel and vector supercomputers, and as a basis for
designing parallel algorithms. Parallelism is supplied through a set of data-parallel constructs
that manipulate collections of values.
(2) Developing fast implementations of parallel algorithms for irregular problems
Algorithms for various irregular problems are implemented on different parallel machines.
The algorithms are often written as prototypes in NESL and then machine-specific code is
written to study how algorithm and architecture interact. The existing theoretical algorithms
are studied to determine which ones can be mapped well onto existing parallel machines and
communication topologies, and what aspects are important in getting efficient
implementations. For example the sorting problem has been studied extensively [24, 48].
Other algorithms are studied including the algorithms for finding the convex-hull of a set of
points [49], for finding the connected-components of a graph [50], for finding the union,
intersection, and difference of ordered sets [51], and for solving irregular linear systems [52].
7.3.3 Comparison with MOIDE model
MOIDE establishes an object-oriented computing infrastructure usable on different
systems such as multiprocessors, cluster of workstations, cluster of SMP, and hybrid system
of heterogeneous hosts. The hierarchical collaborative system and other related facilities
provide support for the development of efficient algorithms for solving various irregularly
structured problems. MOIDE model has demonstrated flexibility and efficiency in the
implementations of four irregularly structured applications described in the thesis. It also
supports the implementation of dedicated techniques for specific applications such as the
distributed tree structure in the N-body method and the grouped scatter in the radix sort.
7.4 Irregularly Structured Applications
Some benchmark packages include irregularly structured applications. More applications
are studied individually in different projects.
121
7.4.1 SPLASH-2 Programs
SPLASH-2 [53] is a suit of parallel programs released to study centralized and distributed
shared-address-space multiprocessors. It contains twelve applications that represent a wider
range of computations in scientific, engineering and graphics domains. All programs are
based on shared-memory model. The Singh’s N-body method introduced in 4.1 is
implemented as the Barnes application in SPLASH-2. Other irregular applications in
SPLASH-2 include:
FMM: Fast Multipole Method for N-body problem
Radiosity: compute the equilibrium distribution of light in a scene
Radix: radix sort
Raytrace: render a three-dimensional scene using ray tracing
Volrend: render a three-dimensional volume using a ray casting technique.
The MOIDE-based N-body method is developed based on distributed object approach.
An adaptive distributed tree structure is designed as the communication-efficient data
structure for solving N-body problem on heterogeneous systems. Hence the MOIDE-based
method can be run on both shared-space and distributed systems.
7.4.2 N-body Problem
N-body problem has been broadly studied for a long time. Many N-body algorithms have
been proposed based on different architecture and data structures. Here are some examples of
distributed N-body methods.
7.4.2.1 Salmon’s method Salmon’s method is a parallel hierarchical N-body algorithm [7,75,79]. The method is
designed base on the space decomposition and hierarchical tree structure. A space is
recursively divided into domains. Local essential tree is built for each domain based on the
decomposition. As every body needs a fraction of the global tree for the force computation,
local essential tree is the union of the tree fractions required by all bodies in a domain.
Salmon’s method uses keys and hash table to describe the topology of a tree. Each cell in the
tree is assigned a key which is generated from the spatial coordinate of the cell. The
translation of keys into memory locations where the cell data is stored is achieved via hash
table lookup.
122
In principle, the build of local essential tree is a similar idea to the partial subtree in this
thesis. However local essential tree is a union subtree that contains all information required in
the force computation for the bodies in a domain. To decide the essential data from one
domain to another domain, it needs to inspect the distances from each body to other domains.
The data inspection and the construction of local essential trees are more costly than the
partial subtree scheme in this thesis. The heavy cost may not be such a serious problem as
Salmon’s method was implemented on supercomputers. Our N-body method aims at running
on distributed systems such as clusters. More attention should be paid on the communication
overhead in the algorithm. The adaptive partial subtree compromises the data requirement
and communication cost. It is an appropriate data structure for the N-body method for
distributed environment.
As stated in their papers, the motivation of using key and hash table in Salmon’s method
came from the difficulty of representing distributed adaptive tree with the pointers in
traditional languages as Fortran 90 and HPF, especially referring to the cells in separate
memory space on anther processor. However, our method is based on distributed object
model implemented in Java. The object-oriented method provides the convenience for the
representation and reference of the complicated or remote data structure such as the subtree
and partial subtree. Our method directly uses remote method invocation and object reference
to transfer data objects, including the tree structure, between the compute engines. Therefore
the MOIDE-based N-body method is more flexible and adaptive on distributed and
heterogeneous systems.
7.4.2.2 Grama’s method Grama presented a parallel implementation of the Barnes-Hut method on message
passing computer in [17]. In his method, a 2D physical domain is partitioned into subdomains.
The particles in one of the subdomains are assigned to one processor. A local tree is
constructed per processor and then all local trees are merged to form a global tree. All nodes
above a certain cut-off depth in the global tree are broadcast to all processors. Grama’s
method for 2D space ran on a 256-processor nCUBE2 parallel computer.
In MOIDE-based N-body method, the distributed tree structure requires no global tree.
Instead, partial subtrees are built for each compute engine according to the distance between
two sub-spaces. It avoids of collective communication in building the global tree. It also
raises the data availability on the partial subtrees in the force computation so as to effectively
123
reduce the remote subtree access. The distributed tree structure in MOIDE-base N-body
method is more proper on distributed memory system.
7.4.2.3 General data structures’ methods
(1) PTREE The object-oriented support for adaptive methods in [54] is based on a general-purpose
data structure layer implemented in C++. It provides a global data structure PTREE that is
implemented as a collection of local data structures on distributed-memory machine. The data
structure is distributed to multiple processors where computations are carried out and the
partial results are merged. The global data structure can support different applications. A
gravitational N-body simulation is implemented on the global data structure. The application
has been tested on a 64-node iPSC/860 machine.
(2) Liu’s method
Liu described an implementation of parallel C++ N-body framework that can support
various scientific simulations which involve tree structures [55]. The framework consists of
three layers: (1) generic tree layer supports simple tree construction and manipulation
methods, and system programmers can build special libraries using classes in the layer; (2)
Barnes-Hut tree layer supports tree operations required in most of the N-body tree algorithms.
(3) application layer implements a gravitational N-body application upon the Barnes-Hut tree
layer. The communication library is implemented in MPI. The application is executed on a
cluster of four Ultra SPARC workstations connected by a fast Ethernet network.
Differing from PTREE and Liu’s methods, the MOIDE-based N-body method is based on
a dedicated distributed tree structure. It is more efficient to the distributed N-body method.
Furthermore the distributed tree structure is based on the hierarchical collaborative system so
that it is adaptive to the heterogeneous system architecture. It is a more flexible data structure
suitable to various platforms.
7.4.3 Ray tracing
Usually the load scheduling scheme for parallel ray tracing based on message passing is
centralized approach that uses a dedicated process as the load scheduler to dynamically
allocate the rendering tasks. There are also distributed load scheduling approaches that allow
the processes to balance the tasks among neighbor processes.
124
7.4.3.1 MPIPOV MPIPOV [56] is an MPI parallel implementation of the public three-dimensional
rendering engine POV-Ray [57] on cluster of PCs. It adopts two load scheduling approaches.
One is static task partitioning for homogeneous distributed architecture and the other is
dynamic load balancing for heterogeneous architecture. The dynamic load balancing
approach is a master/slave scheme. A master process is dedicated for assigning rendering
tasks to other process in response to their request, which doesn’t perform any rendering task.
The master and slave approach is determined by the MPI programming methodology.
7.4.3.2 Ray-tracing in AppLeS The parallel ray-tracing application in AppLeS is used to study the application scheduling
policies on heterogeneous, time-varying resources and varied workload distribution [21]. The
ray-tracing application is based on PVMPOV [58], a PVM implementation of POV-Ray.
Similar to MPIPOV, it uses two master/slave scheduling strategies. One is static fixed
distribution scheduling which assigns all rendering blocks to slaves at the beginning of the
computation. The other is dynamic work queue scheduling. The master process assigns the
blocks one-at-a-time to slaves that have finished processing a block and request more work.
7.4.3.3 Diffusion Load Balancing in Ray Tracing A dynamic load balancing strategy based on diffusion model is proposed for parallel ray
tracing in [59]. In the diffusion-based strategy, the load balancing operation is initiated by the
processor that runs out of work. The load balancing procedure is as follows:
If processor pi runs out of work, it sends a “request status” message to every immediate
neighbor processor pj. Then each pj sends a “status report” to pi that reveals its current
workload and number of pending rays. After receiving all status reports, pi sends a “transfer
work” message to every neighbor whose workload is above the average workload among the
neighbor processors. Each of the neighbors pj transfers a designated amount of work from its
queue of pending work to pi in a series of messages. After receiving all of the work, pi sends
unsolicited amount of work to every neighbor that reported less than the average workload
among the set.
7.4.3.4 Comparison with MOIDE-based method
Comparing with the parallel ray tracing methods above, the MOIDE-based ray tracing
has two features.
125
1. Free from dedicated load scheduler
The centralized master/slave load scheduling is the widely-used dynamic approach for
the ray tracing based on message passing methodologies like MPI. The diffusion load
balancing strategy requires complicated message exchange among processors. A processor
may simultaneously involve in more than one load balancing operations depending on its
interconnection with neighbor processors. The diffusion strategy may produce high
intercommunication overhead and cause workload vibration among the processors. This
approach is useful for theoretical study rather than practical use. MOIDE provides a dynamic
load scheduling scheme—autonomous load scheduling. The MOIDE-based ray tracing
method adopts the autonomous load scheduling. The object-oriented and one-sided
communication features of MOIDE model support the implementation of this scheme. There
requires no master in the scheme. Each thread independently performs the on-demand task
fetching from the global task pool, without interfering other threads. The autonomous load
scheduling can automatically generate workload balance at runtime with light system
overhead. In section 5.3, a test is made to compare the performance of autonomous load
scheduling and master/slave scheduling. It verifies the advantage of autonomous scheduling.
2. Different from other load balancing strategies
The paper [80] summarized five representative dynamic load-balancing strategies on
networked systems. These strategies can be shortly described as following:
1. Gradient Model: every processor interacts only with its immediate neighbors. Lightly
loaded processors inform other processors of their state, and overloaded processors
respond by sending a portion of their load to the nearest lightly loaded processor.
2. Sender-initiated Strategy: the load distribution is initiated by the overloaded processor
(sender), which tries to send a task to an underloaded processor (receiver).
3. Receiver-initiated Strategy: The underloaded processor (receiver) initiates load
balancing by requesting certain amount of load from immediate overloaded neighbors.
4. Central Task Dispatcher: one of the network processors acts as a centralized job
dispatcher. The dispatcher keeps a table containing the number of waiting tasks in each
processor. Based on this table, the dispatcher notifies the most heavily loaded processor
to transfer tasks to a requesting processor.
5. Prediction-based strategy: use the predicted process requirements, e.g., CPU, memory
and I/O, to achieve load balancing.
126
The autonomous load scheduling is different from the strategies above. It can be viewed
as a receiver-initiated strategy but without specific sender or dispatcher. Every process
decides and performs the load fetching by itself. There is no pre-allocation of tasks before
execution, neither dedicated load balancing operation. All the load allocation happens during
the execution. The load is automatically and gradually allocated to the processors according
to the computation progress on them. The autonomous load scheduling can be implemented
based on the remote method invocation and one-sided communication features of MOIDE. It
is an efficient load scheduling scheme for data-independent computation as ray tracing. Of
course, the autonomous scheduling is not suitable to the applications with high data-
dependency. It is also different from task-stealing strategy because it conducts neither pre-
execution task allocation nor task request to neighboring processors. When a processor has
completed the task, it directly fetches a task from the global pool.
127
Chapter 8
Conclusions
The thesis has addressed a distributed object-oriented and multithreading model MOIDE
for solving irregularly structured problems. A runtime support system has been developed to
implement the computations in MOIDE model on distributed systems. MOIDE model is used
to implement the irregularly structured applications. The applications have manifested the
flexibility and efficiency of MOIDE model in supporting various computation and
communication patterns on heterogeneous systems.
8.1 Summary of Research
MOIDE is a distributed object model in response to the broadly arising requirements for
the high-performance computing methodologies on varied distributed systems. MOIDE
provides a flexible software infrastructure that is adaptive to the underlying system
architecture. It combines the object-oriented and multithreading techniques to support the
efficient computing on heterogeneous platforms such as clusters and large-scale Grids.
The kernel of MOIDE model is the collaborative system. It is a runtime infrastructure to
support object-oriented distributed computing. The basic collaborative system is constructed
with the objects created on the distributed hosts that have been selected to run an application.
The distributed objects in the collaborative system include one compute coordinator and a
group of compute engines. The compute coordinator is the initiator and manager of the
system. It starts the compute engines on remote hosts, assigning computing tasks to them and
coordinating the computing procedures on them. The communication in the collaborative
system mainly goes though remote messaging implemented by the remote method invocation
in object-oriented methodology. The capability of remote messaging is more powerful than
128
the ordinary message passing method. It can transfer not only data but also control between
the objects.
The collaborative system is created on the most available hosts in a distributed system
based on the real-time states of the hosts. The system is flexible to make runtime
reconfiguration. The collaborative system can be expanded by adding more hosts and creating
new compute engines on them to enhance the computing power. The underlying hosts can be
replaced by new hosts in response to the change of the hosts’ states.
The hierarchical collaborative system (HiCS) incorporates multithreading methodology
into the collaborative system in order to match the architecture of SMP nodes. Lightweight
threads can be generated in the compute engines residing on SMP nodes. The multithreading
technique can create an efficient computing infrastructure that spends less system resources
and reaches higher stability than a sole distributed object system. HiCS has high adaptability
on heterogeneous system. Multiple threads will be generated based on the structure of the
SMP nodes. The threads can work in cooperative mode or independent mode, depending on
the requirements of the computation.
The two-layer communication mechanism integrates the shared-data access among a
group of local threads and the remote messaging between distributed objects (compute
engines). It is a flexible and efficient communication mechanism on heterogeneous systems.
A unified communication interface is provided at application level. It provides the uniform
communication primitives to the applications and hides the underlying two-layer
communication paths. The remote messaging has the feature of one-sided communication
which contributes to the flexibility of MOIDE model. A compute engine can freely write data
object to or read from other compute engines without the explicit participation of the other
side. Relying on the one-sided communication, autonomous load scheduling is provided to
implement high asynchronous computation.
The applications implemented in MOIDE model have high adaptability on heterogeneous
systems. An application can be developed on the architecture-independent infrastructure. It
will be mapped onto the underlying hosts at runtime and form a HiCS structure that matches
the specific architecture of the hosts to reach the best performance on the hosts.
A runtime support system, MOIDE-runtime, is developed to support the computation
based on MOIDE model. MOIDE-runtime implements all features of MOIDE model. It
performs the creation and reconfiguration of collaborative system. It establishes the two-layer
129
communication mechanism and provides the unified communication interface. It supports
local and remote synchronization on the threads and compute engines. It also supports
autonomous load scheduling. MOIDE-runtime is a cross-platform system implemented in
Java and RMI.
Four irregularly structured applications are implemented to demonstrate the advantages
of MOIDE model and verify the efficiency of the computation based on the model. These
applications have distinct computation and communication features. The N-body method is
an example of the MOIDE-based computation on hierarchical collaborative system. It
demonstrates the task decomposition and allocation strategies as well as the cooperation of
the compute engines and the multiple threads in the computation. Another feature of the N-
body method is the distributed tree structure that resolves the heavy communication problem
in distributed N-body method. The construction of the subtrees and partial subtrees illustrates
the design of complicated data structure and the related computation based on the distributed
object techniques in MOIDE model.
The ray tracing application is a practice of autonomous load scheduling. Due to the high
asynchrony supported by MOIDE model, all threads can perform the rendering tasks totally
in parallel. Different load scheduling schemes are tested as the variation of the autonomous
load scheduling approach. The test results show the individual scheduling as the best scheme
that can fully overlap the system-wide computation and communication to achieve high
parallelism.
The CG and radix sort are communication-intensive applications. Both of them make use
of the two-layer communication mechanism to increase the communication efficiency on
cluster of SMPs. Both of them are developed in the architecture-independent model. All
communication operations call the unified communication interface. The MOIDE-runtime
implements the two-layer communication at runtime to deliver the large amount of data. The
CG application also confirms the adaptability of MOIDE model on heterogeneous systems.
The two-layer communication mechanism implemented in Java and RMI can reach the
performance comparable to MPI (Message Passing Interface) on heterogeneous system. It can
even outperform MPI in some tests.
130
8.2 Achievements and Remaining Issues
8.2.1 Main Achievements
The thesis has covered many research aspects including computational model, runtime
support systems, and applications. The main achievements can be summarized as follows.
1. Design MOIDE model as a distributed object computing infrastructure for solving
irregularly structured problems on heterogeneous distributed systems
The analysis of irregularly structured problems reveals the high diversity in their
computation and communication patterns. We cannot find common solutions to the
varied problems. Specific irregularly structured problem requires proprietary method to
deal with. What we need is a flexible model that facilitates the development of various
solutions for irregularly structured problems. Also with the recognition of the hybrid
hosts existing in distributed system, a flexible model is required to support the
architecture-independent development and dynamic mapping of applications on
heterogeneous systems. To meet these two requirements, MOIDE model is designed as a
distributed object computing infrastructure that is widely-usable to implement different
applications on varied system architecture. The integration of the object-oriented and
multithreading methodologies in the model provides the polymorphism, encapsulation
and location-transparency that support high flexibility and adaptability of the hierarchical
collaborative system that can fit the architectural features and states of the underlying
hosts. The model provides the means to support the high-performance computing for
solving irregularly structured problems on heterogeneous systems, such as hierarchical
collaborative system, two-layer communication, and autonomous load scheduling
2. Establish unified communication interface on heterogeneous platforms
Many irregularly structured problems are communication-intensive applications.
Usually the communication requirement and the cost are not predictable due to the
nonuniformity of these problems. The communication becomes the bottleneck to the
overall performance. As a solution to the communication bottleneck, MOIDE model
provides a two-layer communication mechanism on the hierarchical architecture of
heterogeneous systems. The mechanism integrates the quick data-sharing among a group
of threads and the flexible remote messaging between the distributed objects to
implement efficient communication. The integrated two-layer communication provides a
131
simple, flexible and extensible communication mechanism for transmitting complex data
structures and control information between distributed objects. A unified communication
interface is provided to the applications based on the two-layer communication
mechanism. The unified communication interface preserves the architecture-independent
feature of MOIDE model to the applications.
3. Support the devise of the solutions for different applications
The object-oriented features of MOIDE model enable the devise of various
approaches to solve different irregularly structured problems. The autonomous load
scheduling is a technique to achieve high asynchrony in ray tracing. It is based on the
flexible, one-sided remote method invocation in object-based communication. The
autonomous load scheduling can automatically produce an even workload distribution on
all threads without specific load scheduling operation in the execution of the applications
with light data-dependency.
4. Implement cross-platform run-time environment
MOIDE runtime support system is developed to implement the MOIDE-based
computation on heterogeneous systems. It implements the mechanisms and functions
required in MOIDE-based computation. Applications can be developed based on the
fundamental classes and the methods provided in MOIDE-runtime. It also provides the
support to run the applications, including the creation of hierarchical collaborative system
and two-layer communication mechanism. Implemented in Java, it is a cross-platform
system executable on various systems, e.g., cluster of single-processors, cluster of SMP
nodes, cluster of mixed hosts, or standalone multiprocessor, to support distributed object
computing.
5. Develop irregularly structured applications based on MOIDE model
Four irregularly structured applications are development in MOIDE model. These are
typical irregularly structured problems with different irregular characteristics that appeal
different approaches to conquer. These applications are used to demonstrate the generic
methodologies for developing application in MOIDE model as well as the approaches for
solving specific problems. The run-time tests of these applications validate the real
efficiency of the MOIDE-based computation on different systems architecture.
(1) The distributed N-body method is characterized by the distributed tree structure and
the collaborative computation of distributed objects and threads on the hierarchical
132
collaborative system infrastructure. The distributed tree structure with the partial
subtree scheme is different from the tree structures in other parallel N-body methods
on shared-memory or distributed-memory systems. The tree structure is supported by
the object-oriented method on MOIDE model that facilitates the construction and
transmission of the tree structures in distributed environment. It provides a
communication-efficient solution to the data-sharing requirement in N-body
simulation.
(2) The ray tracing method is suitable to adopt the autonomous load scheduling. With
the autonomous scheduling approach, the rendering workload can be evenly
distributed to all compute engines/threads and high parallelism can be gained.
(3) The CG includes large amount of vector communication in iteration. The two-layer
communication mechanism can accelerate the communication procedure. The
application is used to show the performance of two-layer communication and the
adaptive mapping of MOIDE-based computation onto heterogeneous architecture.
(4) Radix sort is another communication-intensive application. It is featured with the
efficient grouped communication in the scatter operation that demonstrates both the
efficiency of the two-layer communication and the flexibility of the collaborative
computation on MOIDE model. The MOIDE-based radix sort can outperform the
same application implemented in C & MPICH on cluster of SMPs.
8.2.2 Remaining issues
Despite the achievements in this thesis, there are still some aspects requiring
improvement.
1. Efficiency of system reconfiguration
As described in 2.2.4, the collaborative system is flexible enough to make dynamic
reconfiguration. The system can be easily expanded by adding new compute engines to it, or
the underlying hosts can be replaced by new hosts as the response to the change of the system
state. The purpose of reconfiguration is to improve the system performance. However, the
reconfiguration needs a series of operations, including the new host selection, new compute
engine creation, the registration update, and the migration of data /task objects. This is a time-
consuming procedure. The overhead often prohibits its usability in real computation. To solve
this problem, we should specify the conditions in detail on which system reconfiguration is
133
necessary and the reconfiguration can truly upgrade the system performance. In addition to
the reconfiguration approaches in 2.2.3, we should also design other alternate approaches to
implement efficient reconfiguration.
2. System reliability
The execution of the applications on MOIDE runtime support system may encounter
occasional crash. The causes of the problem include the shortage of memory space when
running large application and the deadlock in the execution. Space-efficient strategy should
be designed for the space-consuming applications like N-body problem. The two-layer
communication mechanism should improve its communication protocol to support reliable
data transfer and resolve communication deadlock. The run-time support system should
improve its synchronization facility that can be called in the applications to avoid deadlock.
3. Programming templates
At present, the MOIDE model lacks of standard programming template. Although the
run-time support system has provided the fundamental classes and APIs, the applications are
mostly programmed in their own manner. Standard programming templates are required to
formalize the MOIDE-based application design and ease the programming. We need to
provide the programming templates that suit both the application patterns and the MOIDE
model.
8.3 Future Work
MOIDE model has been successfully implemented on cluster of SMP nodes and small-
size heterogeneous cluster. The model has been used to design the sample irregularly
structured applications that demonstrated the satisfactory performance in the tests.
Nevertheless the model should improve its functionality and extend its use on wide range of
distributed systems and applications. The future work will concentrate on the following
aspects.
1. Solve remaining problems
Firstly the remaining issues discussed in 8.2.2 will be solved. Investigation should be
made on the requirement of system reconfiguration and its effect on the overall system
performance. Then conditions should be decided, that ensure the necessity and good effect to
perform system reconfiguration. New system reconfiguration schemes will be designed to
reduce the time cost and heighten the usability. For example, migrating portion of the threads
134
on an overloaded host to a new host, instead of a complete host replacement, to reduce the
migration overhead and avoid new overload situation. Replacing an overloaded host with
more than one host to increase the computing power of the entire collaborative system. The
stability and reliability of MOIDE runtime support system and MOIDE-based application
need to be enhanced by way of improving the synchronization, coordination, deadlock
detection mechanisms in MOIDE runtime support system and utilizing them in application
design. Programming templates should be provided for developing application based on the
computing modes. For example, the templates for hierarchical collaborative computation,
thread’s cooperative or independent mode, computation based on autonomous load
scheduling and etc.
2. Test and enhance system scalability
So far all the tests are conducted on local cluster. The future work will emphasize on the
scalability of MOIDE model and the runtime support system in order to extend the system
coverage. The powerful support to heterogeneity and autonomy are demanded on wide-area
network systems as Grid. There could be hundreds and thousands computer nodes that can
join the computation. To accommodate the computing model on large system, it needs to deal
with the network delay, the heterogeneous types of computing resources and the system
scalability. High autonomy and scalability are the key merits for a computing infrastructure to
achieve the success in network computing. MOIDE model needs the modification to enhance
the scalability. Instead of the current two-level hierarchical structure, the collaborative system
will be expanded to a multilevel hierarchy that is more appropriate to organize the
geographically distributed computer nodes. The collaborative system will present a multilevel
adaptive infrastructure that is composed of grouped compute engines and local coordinator.
The groups have higher autonomy in performing local computation. The control will be
distributed from the unique compute coordinator in current model to the distributed local
coordinators in the future system. The system coordination, task allocation, load scheduling
strategies and the communication protocols should be improved to suit the multilevel system
structure. The current runtime support system should also be tested on larger system
containing more platforms to examine its scalability and find its weakness.
3. Implement more applications
Four irregularly structure problems have been chosen as the sample applications in the
thesis. There are more irregularly structured applications widely existing in different fields.
135
We should implement other applications in scientific computing, system simulation,
combinatorial problems and etc to study the patterns of these problems and the support
required to cope with the irregularities. Therefore MOIDE model will be improved to support
the solutions for various irregularly structured problems.
136
References
[1] Jonathan C. Hardwick, “An Efficient Implementation of Nested Data Parallelism for
Irregular Divide-and-Conquer Algorithms”, Proceedings of Second Workshop on Solving
Irregular Problems on Distributed Memory Machines, April 1996, available from
http://www.cis.ufl.edu/~ranka/ipps2.html
[2] S. Orlando and R. Perego, “Support for Non-uniform Parallel Loops and Its Application
to a Flame Simulation Code”, Fourth International Symposium on Solving Irregularly
Structured Problems in Parallel (IRREGULAR’97), June 1997.
[3] Ranjan K Sen, “New Scheme for Dynamic Processor Assignment for Irregular Problems”,
Second International Workshop on Parallel Algorithms for Irregularly Structured
Problems (IRREGULAR’95), September 1995.
[4] Irregular Parallel Algorithms Project, http://www.cs.unc.edu/~sc/research/IPA.html
[5] Scandal Project, http://www.cs.cmu.edu/~scandal/alg/algs.html
[6] T. Gautier, and et al., “Regular versus Irregular Problems and Algorithms”, Second
International Workshop on Parallel Algorithms for Irregularly Structured Problems
(IRREGULAR’95), September 1995.
[7] J. Salmon and M.S. Warren, “Parallel, Out-of-core methods for N-body Simulation”,
Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing,
1997.
[8] S. Spach and R. Pulleyblank, “Parallel Raytraced Image Generation”, Hewlett-Packard
Journal, vol. 43, No. 3, June, 1992.
[9] N. Soundarajan, “On the Specification, Inheritance, and Verification of Synchronization
Constraints”, Formal methods for open object-based distributed systems: volume 2, IFIP
137
TC6 WG6.1 International Workshop on Formal Methods for Open Object-based
Distributed Systems (FMOODS '97), Canterbury, Kent, UK, 21-23 July 1997.
[10] M. J. Lewis and A. Grimshaw, “The Core Legion Object Model”, Proceedings of the
Fifth IEEE International Symposium on High Performance Distributed Computing, Los
Alamitos, California, August 1996.
[11] Geoffrey C. Fox, W. Furmanski, “HPcc as High Performance Commodity Computing”,
available from http://www.npac.syr.edu/users/gcf/HPcc/HPcc.html
[12] “Java Remote Method Invocation-Distributed Computing for Java”, White paper,
available from http://java.sun.com/marketing/collateral/javarmi.html
[13] Z. Liang, Y. Sun, and C.L. Wang, “ClusterProbe: An Open, Flexible and Scalable
Cluster Monitoring Tool”, First International Workshop on Cluster Computing,
Australia, August, 1999.
[14] J. Barnes and P. Hut, “A Hierarchical O (N log N) Force-Calculation Algorithm”, Nature
324(4) (1986) 446-449.
[15] L. Greengard and V. Rokhlin, “A Fast Algorithm for Particle Simulations”, Journal of
Computational Physics 73 (1987) 325-348.
[16] L. Hernquist, “Hierarchical N-body Methods”, Computer Physics Communications 48
(1988) 107-115.
[17] A.Y. Grama, V. Kumar and A. Sameh, “n-body Simulation Using Message Passing
Parallel Computers”, Proceedings of the 7th SIAM Conference on Parallel Processing for
Scientific Computing (1995) 355-360.
[18] J.P. Singh, J.L. Hennessy and A. Gupta, “Implications of Hierarhical N-Body Methods
for Multiprocessor Architectures”, ACM Transactions on Computer Systems 13(2)
(1995) 141-202.
[19] J. Binney and S. Tremaine, S. “Galactic Dynamics”, Princeton, NJ: Princeton University
Press, (1987) 43-44.
[20] J. Singh, A. Gupta, and M. Levoy, “Parallel Visualization Algorithms: Performance and
Architectural Implications”, IEEE Computer, vol. 27, No.7, July 1994.
138
[21] G. Shao, R. Wolski, and F. Berman, “Performance Effects of Scheduling Strategies for
Master/Slave Distributed Applications”, Technical Report #CS98-598, UCSD CSE
Dept., Sept. 1998. Available from http://apples.ucsd.edu/hetpubs.html
[22] V. Kumar and et al., “Introduction to Parallel Computing: Design and Analysis of
Algorithms 11.2.3 The Conjugate Gradient Method”, Benjamin Cummings, 1994,
pp.433-435.
[23] D. Bailey, T. Harris, W. Saphir, R. v.d. Wijngaart, A. Woo, and M. Yarrow, “The NAS
Parallel Benchmarks 2.0'', Technical report NAS-95-020, NASA Ames Research Center,
December 1995, available from http://www.nas.nasa.gov/Software/NPB/
[24] M. Zagha and G.E. Blelloch, “Radix Sort for Vector Multiprocessors”, available from
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/scandal/public/papers/cray-sort-
supercomputing91.html
[25] “The Intel Millennium Proposal”, available from
http://now.cs.berkeley.edu/Millennium/proposal.html
[26] D. E. Culler and J. Demmel, “SimMillennium: A Large-Scale Systems of Systems
Organized as a Computational Economy”, Technical report NSF RI EIA-9802069,
available from http://now.cs.berkeley.edu/Millennium/groups/GRP_SIMS/annual.html
[27] David E. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau and et al., “Parallel Computing
on the Berkeley NOW”, 9th Joint Symposium on Parallel Processing, Kobe, Japan, 1997.
[28] S. S. Lumetta, A. M. Mainwaring, D. E. Culler, “Multi-Protocol Active Messages on a
Cluster of SMP's”, Proceedings of Supercomputing’97, San Jose, California, November,
1997.
[29] I. Foster and C. Kesselman, “The Globus Project: A Status Report”, Proc. IPPS/SPDP
'98 Heterogeneous Computing Workshop, 1998, pp. 4-18.
[30] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke, “The Data Grid:
Towards an Architecture for the Distributed Management and Analysis of Large
Scientific Datasets”, to appear in the Journal of Network and Computer Applications,
available from http://www.globus.org/research/papers.html
139
[31] N. Spring and R. Wolski, “Application Level Scheduling of Gene Sequence Comparison
on Metacomputers”, Proceedings of the 12th ACM International Conference on
Supercomputing, Melbourne, Australia, July 1998. Available from
http://apples.ucsd.edu/hetpubs.html
[32] S. Smallen, W. Cirne, J. Frey, F. Berman, and et al., “Combining Workstations and
Supercomputers to Support Grid Applications: The Parallel Tomography Experience”,
Proceedings of the 9th Heterogenous Computing Workshop, May 2000. Available from
http://apples.ucsd.edu/hetpubs.html
[33] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao, “Application-Level
Scheduling on Distributed Heterogeneous Networks”, Proceedings of Supercomputing
1996, available from http://apples.ucsd.edu/hetpubs.html
[34] R. Wolski, N. T. Spring, and J. Hayes, “The Network Weather Service: A Distributed
Resource Performance Forecasting Service for Metacomputing”, The Journal of Future
Generation Computing Systems, 1999, available from
http://apples.ucsd.edu/hetpubs.html
[35] H. Casanova, G. Obertelli, F. Berman, and R. Wolski, “The AppLeS Parameter Sweep
Template: User-Level Middleware for the Grid”, Proceedings of the Supercomputing
Conference (SC'2000), available from http://apples.ucsd.edu/apst/html/publications.html
[36] D. Galatopoullos and E.S. Manolakos, "Developing Parallel Applications using the
JavaPorts environment. Parallel and Distributed Processing", Jose' Rolim Editor, Lecture
Notes in Computer Science, Elsevier Publ., vol. 1586, 1999, pp. 813-828.
[37] D. Galatopoullos and E.S Manolakos. "JavaPorts: An environment to facilitate Parallel
Computing on heterogeneous cluster of workstations", Informatica, vol. 23, pp. 97-105,
1999.
[38] S. B. Baden and S. J. Fink, “A Programming Methodology for Dual-tier
Multicomputers”, IEEE Transactions on Software Engineering, 26(3): 212-26, March
2000.
[39] S. J. Fink, “A Programming Model for Block-Structured Scientific Calculations on SMP
Clusters”, Ph. D. Dissertation, June 1998, available from http://www-
cse.ucsd.edu/groups/hpcl/scg/kelp/pubs.html
140
[40] S. J. Fink, S. R. Kohn, and S. B. Baden, “Efficient Run-time Support for Irregular Block-
Structured Applications”, Journal of Parallel and Distributed Computing, 50(1-2), April-
May 1998, pp. 61-82.
[41] D. A. Bader and Joseph JáJá, “SIMPLE: A Methodology for Programming High
Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs)”, Technical
report CS-TR-3798, UMIACS-TR-97-48, available from
http://www.umiacs.umd.edu/research/EXPAR/papers/3798.html
[42] D. R. Helman and Joseph JáJá, “Designing Practical Efficient Algorithms for Symmetric
Multiprocessors”, First Workshop on Algorithm Engineering and Experimentation
(ALENEX99), available from
http://www.umiacs.umd.edu/research/EXPAR/papers/3928.html
[43] B. L. Blount, S. Chatterjee, and M. Philippsen, “Irregular Parallel Algorithms in Java”,
Sixth International Workshop on Solving Irregularly Structured Problems in Parallel
(IRREGULAR'99), San Juan, Puerto Rico, April 1999.
[44] J. F. Prins, S. Chatterjee, and M. Simons, “Irregular Computations in Fortran ---
Expression and Implementation Strategies”, Scientific Programming, vol. 7, No.3-4,
1999, available from http://www.cs.cmu.edu/~scandal/papers.html
[45] Scandal Project Home Page, available at http://www.cs.cmu.edu/~scandal/
[46] G. E. Blelloch and J. Greiner, “A Provable Time and Space Efficient Implementation of
NESL”, International Conference on Functional Programming, May 1996, available
from http://www.cs.cmu.edu/~scandal/papers.html
[47] G. E. Blelloch, “NESL: A Nested Data-Parallel Language”, Technical report CMU-CS-
93-129, April 1993, available from http://www.cs.cmu.edu/~scandal/papers.html
[48] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, and et al., “A Comparison of Sorting
Algorithms for the Connection Machine CM-2”, Symposium on Parallel Algorithms and
Architectures, 1991, available from http://www.cs.cmu.edu/~scandal/papers.html
[49] J. C. Hardwick, “Implementation and Evaluation of an Efficient Parallel Delaunay
Triangulation Algorithm”, Proceedings ACM Symposium on Computational Geometry,
May 1996, available from http://www.cs.cmu.edu/~scandal/papers.html
141
[50] J. Greiner, “A Comparison of Parallel Algorithms for Connected Components”, Sixth
Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), June 27-29,
1994, available from http://www.cs.cmu.edu/~scandal/papers.html
[51] G. Blelloch and M. Reid-Miller, “Fast Set Operations Using Treaps”, Proceedings of the
10th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 16-26,
Puerto Vallarta, Mexico, June 1998, available from
http://www.cs.cmu.edu/~scandal/papers.html
[52] K. D. Gremban, G. L. Miller, and M. Zagha, “Performance Evaluation of a New Parallel
Preconditioner”, Technical report CMU-CS-94-205, available from
http://www.cs.cmu.edu/~scandal/alg/stcg.html
[53] S. C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 Programs:
Characterization and Methodological Considerations”, Proceedings of the 22nd Annual
International Symposium on Computer Architecture, pp. 24-36, June 1995.
[54] S. Bhatt, M. Chen, and et al., “Object-Oriented Support for Adaptive Methods on
Parallel Machines,” Scientific Computing 2 (1993) 179-192.
[55] P. Liu and J.J. Wu, “A Framework for Parallel Tree-Based Scientific Simulations”,
Proceedings of 26th International Conference on Parallel Processing, 1997,
[56] A. Fava, E. Fava and M. Bertozzi, “MPIPOV: a parallel implementation of POV-Ray
based on MPI”, Proceedings of Euro PVM/MPI, Barcellona, Lecture Notes in Computer
Science, 1697:426-433, September 1999.
[57] POV-Ray (Persistence Of Vision Raytracer), http://povray.org
[58] PVMPOV patch for POV-Ray,
http://www-mddsp.enel.ucalgary.ca/People/adilger/povray/pvmpov.html
[59] A. Heirich and J. Arvo, "A Competitive Analysis of Load Balancing Strategies for
Parallel Ray Tracing", Journal of Supercomputing, vol. 12, 1998.
[60] I. Alfuraih, S. Aluru, S. Goil and S. Ranka, “Parallel Construction of k-d tree and
related problems”, 2nd Workshop on Solving Irregular Problems on Distributed Memory
Machines, Hawaii, April 1996.
142
[61] J.P. Singh, C. Holt and et al., “Load Balancing and Data Locality in Adaptive
Hierarchical N-body Methods: Barnes-Hut, Fast Multipole, and Radiosity”, Journal of
Parallel and Distributed Computing 27(2) (1995) 118-141.
[62] W. Dearholt, S. Castillo, and G. Castillo “Solution of Large, Sparse, Irregular Systems
on a Massively Parallel Computer”, Third workshop on Parallel Algorithms for
Irregularly Structured Problems (IRREGULAR 96), Santa Barbara, August 19-21, 1996.
[63] U.V. Catalyurek and C. Aykanat, “Decomposing Irregularly Sparse Matrices for Parallel
Matrix-Vector Multiplication”, Third workshop on Parallel Algorithms for Irregularly
Structured Problems (IRREGULAR 96), Santa Barbara, August, 1996.
[64] M. Baker, R. Buyya and D. Laforenza, “The Grid: International Efforts in Global
Computing”, International Conference on Advances in Infrastructure for Electronic
Business, Science, and Education on the Internet SSGRR 2000, l`Aquila, Rome, Italy,
July 31 - August 6, 2000.
[65] I. Foster and C. Kesselman, ed. “The Grid: Blueprint for A New Computing
Infrastructure”, Morgan Kaufmann Publishers, San Francisco, California, 1999.
[66] P. Sanders and T. Hansch, “Efficient Massively Parallel Quicksort”, Fourth International
Symposium on Solving Irregularly Structured Problems in Parallel (IRREGULAR 97),
University of Paderborn, Germany, June 11-13, 1997.
[67] D.R. O'Hallaron, J.R. Shewchuk, and T. Gross, “Architectural Implications of a Family
of Irregular Applications”, Fourth International Symposium on High Performance
Computer Architecture, Las Vegas, Nevada, February 1998, pp 80-89.
[68] V. Ramakrishnan and I. D. Scherson, “Executing Communication-Intensive Irregular
Programs Efficiently”, Seventh International Workshop on Solving Irregularly
Structured Problems in Parallel (IRREGULAR’2000), Cancun, Mexico May 5, 2000.
[69] R. Garmann, “Locality Preserving Load Balancing with Provably Small Overhead”,
Fifth International Workshop on Solving Irregularly Structured Problems in Parallel
(IRREGULAR’98), Lawrence Berkeley National Laboratory Berkeley, California, USA,
August 9-11, 1998.
143
[70] J. Watts and S. Taylor, “Practical Dynamic Load Balancing for irregular problems”,
Third workshop on Parallel Algorithms for Irregularly Structured Problems
(IRREGULAR’96), Santa Barbara, August, 1996.
[71] L. Oliker, R. Biswas, R. G. Strawn, “Parallel Implementation of An Adaptive Scheme
for 3D Unstructured Grids on the SP2”, Third workshop on Parallel Algorithms for
Irregularly Structured Problems (IRREGULAR’96), Santa Barbara, August, 1996.
[72] The Message Passing Interface (MPI) Standard, http://www-unix.mcs.anl.gov/mpi/
[73] D.E. Culler, J.P. Singh, “Parallel Computer Architecture: a Hardware/Software
Approach -- 3.5.2 Barnes-Hut”, Morgan Kaufmann Publishers, Inc. San Francisco,
California, 1999, pp.166-174
[74] Stanford Parallel Applications for Shared Memory (SPLASH), http://www-
flash.stanford.edu/apps/SPLASH/
[75] M.S. Warren and J.K. Salmon, “A Parallel Hashed Oct-Tree N-body Algorithm”,
Supercomputing '93, Los Alamitos, 1993.
[76] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A High-Performance, Portable
Implementation of the MPI Message Passing Interface Standard”, available for
http://www-unix.mcs.anl.gov/mpi/mpich/
[77] “Java Remote Method Invocation Specification”, available from
http://java.sun.com/j2se/1.3/docs/guide/rmi/index.html
[78] Y. Sun, Z. Liang, and C.L. Wang, “Distributed Particle Simulation Method on Adaptive
Collaborative System”, Journal of Future Generation Computer Systems, vol. 18 issue 1,
September 2001, Elsevier Science, the Netherlands.
[79] M.S. Warren and J.K. Salmon, “Astrophysical N-body Simulations Using Hierarchical
Tree Data Structures”, Supercomputing ’92, 1992.
[80] P. Loh, W. Hsu, W. Cai, and N. Sriskanthan, “Now Network Topology Affects Dynamic
Load Balancing”, IEEE Parallel and Distributed Technology, vol.4, No.3, Fall 1996.
[81] H. A. van der Vorst, “Lecture Notes on Iterative Methods for Large Linear Systems”,
available from http://www.math.uu.nl/people/vorst/lecture.html