STUDIES ON TASK PARTITIONING STRATEGIES IN DISTRIBUTED ... fileDate:..\.1.r.ll.:.^j3 (Signature of...

125
STUDIES ON TASK PARTITIONING STRATEGIES IN DISTRIBUTED SYSTEM THESIS SUBMITTED FOR THE AWARD OF THE DEGREE OF Boctor of $|)tlo£(0p|)? rr COMPUTER SCIENCE ' Submitted By JAVEDALI « Under the Supervision of Dr. RAFIQUL ZAMAN KHAN (Associate Professor) DEPARTMENT OF COMPUTER SCIENCE ALIGARH MUSLIM UNIVERSITY ALIGARH (INDIA) 2013

Transcript of STUDIES ON TASK PARTITIONING STRATEGIES IN DISTRIBUTED ... fileDate:..\.1.r.ll.:.^j3 (Signature of...

STUDIES ON TASK PARTITIONING STRATEGIES IN DISTRIBUTED SYSTEM

THESIS SUBMITTED FOR THE AWARD OF THE DEGREE OF

Boctor of $|)tlo£(0p|)?

rr COMPUTER SCIENCE

' Submitted By

JAVEDALI

«

Under the Supervision of

Dr. RAFIQUL ZAMAN KHAN (Associate Professor)

DEPARTMENT OF COMPUTER SCIENCE ALIGARH MUSLIM UNIVERSITY

ALIGARH (INDIA) 2013

i-Htl i^W

T8753

dedicated/

CANDIDATE'S DECLARATION

I, JAVED„ALI, Dq>artment of Computer Science, certify that the work

embodied in this Ph.D. thesis is my own bonafide work carried out by me under the

supervision of T^]Rf..VJ^QJJL_ZAM^ Aligarfi Muslim University,

Aligarh. The matter embodied in this Ph.D. thesis has not been submitted for the

award of any other degree,

I declare that I have faithfiiUy acknowledged, given credit to and referred to

the research workers wherever their works have been cited in the text and the body of

the thesis. I fiuther certify that I have not willfully lifted up some other's work, para,

text, data, result, etc. reported in the journals, books, magazines, reports, dissertations,

theses, etc., or available at web-sites and included them in this Ph.D. thesis and cited

as my own work.

Date:.. .1! .T.'.'.r.?.^.'.? ( S ^ S n T o f ifee l^naidate) Sigiialare of ine Candu

Certificate from the Supervisor

This is to certify that the above statement made by the candidate is correct to the best of my knowledge.

Signature of the Supervisor: Name & Designation: DR. RAFIQUL ZAMAN KHAN

(ASSOCIATE PROFESSOR) Department: COMPUTER SCIENCE

(Signature of the Chairn^^M^^^«rDepartment with seal) - CHAIRMAN

Department of Computer Science A.M.U.. Aligarh

COPYRIGHT TRANSFER CERTIFICATE

Title of the Thesis: STUDIES ON TASK... PAJtTmQNING

STRATEGIES IN DISTRIBUTED SYSTEM

Candidate's Name: JAVED ALI

Copyright Transfer

The undersigned hereby assigns to the Aligaih Muslim University,

Aligarh, copyright that may exist in and for the above thesis submitted for the

award of Ph.D. degree.

^).I3> (Signature of the Candidate)

COURSE/ COMPREHENSIVE EXAMINATION/ PRE-SUBMISSION SEMINAR COMPLETION CERTIFICATE

This is to certify that MR JAVED ALI, Department of COMPUTER

SCIENCE has satisfactory completed the course work/comprehensive

examination and pre-submission seminar requirement which is part of his Ph.D.

programme.

Date:..\.1.r.ll.:.^j3 (Signature of the Chairman of the Deparmi nt)

ACKNOWLEDGEMENTS

First and foremost, I must bow in reverence to Almighty 'ALLAH', the cherisher and

the Sustainer to whom we entirely owe our all capabilities, strength, courage and

kriowledge, without His blessings nothing can be done. Words and lexicons cannot do

full justice in expressing a deep sense of gratitude and indebtedness which flows from

the core of my heart towards my esteemed supervisor Dr. RafiquI Zaman Khan,

(Associate Professor), Department of Computer Science. Under his benevolence and

able guidance, I learned and ventured to write. His inspiring benefaction bestowed

upon me so generously and his unsparing help and precious advice have gone into the

preparation of this thesis. He has been a beacon all long to enable me to present this

work. I shall never be able to repay for all that I have gained from him.

I wish to thank Mr. Suhail Mustajab (Chairman), Department of Computer Science,

for providing me various departmental facilities whenever it seems feasible. I also

avail this opportunity to express my thanks to all my teachers specially, Mr. S.

Maheshwari (Associate Professor), Mrs. Priti Bala (Associate Professor), Dr. M. U.

Bokhari (Associate Professor), Dr. Jamshed Siddiqui (Associate Professor), Dr. Asim

Zafar (Associate Professor), Mr. Shahid Masood (Assistant Professor), Mrs. Sehba

Masood (Assistant Professor) and Mr. Arman Rasool Faridi (Assistant Professor) for

their encouragement and inspiration.

Thanks are also due to all my colleagues especially Firoz Ali, Noor Adnan, Haider Al

Khalaf Jabbar, Shadab Alam, Sadaf Ahmad, Faheem Masoodi, Faraz Hasan, Shahab

Saquib Sohail, Zaki Ahmad Khan, Nazir Ahmad and Suby Khan who are always

giving me their valuable suggestions.

I wish to express my gratitude to all my office workers, Mr. Sharif Ahmad Khan, Mr.

Azimur Rehman, Mohd. frfan and Mr. Mazhar Moonis and several others who have

always stood beside me in my up and down.

At last but certainly the most, I am greatly indebted to my parents, Haji Idris Ahmad

(father), Mrs. Fahmida Khatoon (mother) for their continuous and constant

encouragement and above all for their moral support. I also avail this opportunity to

express my thanks to all my well wishers.

Finally, I owe a deep sense of gratitude to the authorities of Aligarh Muslim

University, my Ahna Mater, for providing me adequate research facilities.

JAVED ALI

Department of Computer Science,

Aligarh Muslim University, Aligarh

ABSTRACT

Parallel computing allows to indicate how different portions of the computation can

be executed concurrently in a heterogeneous computing environment. The high

performance of the existed systems may be achieved by a high degree of parallelism.

It supports parallel as well as sequential execution of processes and automatic inter

process communication and synchronization. Many scientific problems (Statistical

Mechanics, Computational Fluid Dynamics, Modeling of Human Organs and Bones,

Genetic Engineering, Global Weather and Environmental Modeling etc.) are so

complex that solving them via simulation requires extraordinary powerful computers.

Grand challenging scientific problems can be solved by using high performance

parallel computing architectures. Task partitioning of parallel applications is highly

critical for the prediction of the performance of distributed computing systems.

A well known representation of parallel applications is a DAG (Directed Acyclic

Graph) in which nodes represent application tasks. Directed arcs or edges represent

inter task dependencies. Execution time of any algorithm is known as schedule length

of that algorithm. The main problem in parallel computing is to find out the minimum

schedule length of DAG computations. Normalized Schedule Length (NSL) of an>

algorithm shows that, variation of communication cost depends upon the overhead of

the existing computing system. Minimization of inter-process communication cost

and selection of network architecture are notable parameters to decide the degree of

efficiency. Various list scheduling strategies (Task Duplication Based Scheduling

(TDB), Bounded Number of Processor Scheduling (BNP), Unbounded Number of

Cluster Scheduling (UNC) and Arbitrary Network Topology Scheduling (ANP) etc)

are discussed in the literature. All scheduling strategies have their pros and cons.

HEFT (Heterogeneous Earliest Finish Time), MCP (Modified Critical Path) and FPS

(Fast Preemptive Scheduling) algorithms are recent dynamic scheduling algorithms.

These algorithms depict the best speedup and efficiency amongst existing parallel

computing scheduHng algorithms.

We proposed task partitioning algorithms in heterogeneous parallel computing

systems. Proposed scheduling algorithms are NP-complete. An iterative list

scheduling algorithm has been proposed to improve the quality of the schedule in an

iterative fashion by using results from previous iterations. The results obtained by the

designed strategy shows significant improvement in inter process communication

cost. Proposed algorithm may simulate large number of complex applications in high

performance computing systems.

Most of the scheduhng algorithms do not provide a mechanism about communication

overhead. The second Proposed Optimal Task Partitioning Strategy vv'ith Duplication

(OTPSD) minimizes scheduling length as well as communication overhead. The

communication overhead of computing system has been reduced by duplication of

tasks. It's three phase algorithm in which first phase comprises of grainpackSubDAG

formation. The second phase is a priority assignment phase. In the third phase,

processors are grouped according to their processing capabilities. The proposed

algorithm (OTPSD) minimizes makespan and shows better performance in terms of

normalized schedule length and processor utilization over MCP and HEFT

algorithms. Calculated makespan of OTPSD by using Compute Unified Device

Architecture (CUDA) environment is (184.4 Sec) which is shorter than MCP (215.6

Sec) and HEFT (196.3 Sec) algorithms.

The third proposed algorithm is Minimal Task Partitioning Strategy with Minimum

Time (TPSMT). In the designed algorithm, communication cost improvement factor

has been introduced. This factor is accomplished to reduce the inter-process

communication cost in distributed computing systems. It's compared with two well

known HEFT and Critical Path Fast Duplication (CPFD) algorithms. TPSMT shows

lower complexity than compared algorithms. Communication to computation ratio

(CCR) is also taken into consideration to calculate the performance of the algorithms.

Average efficiency of TPSMT is 53.09% and 63.20% more than HEFT and CPFD

algorithm respectively. Sunulation results show that TPSMT gained speedup 17.36%

and 56.79% than HEFT and CPFD correspondingly.

The fourth proposed algorithm is Preemptive Task Partitioning Strategy (PTPS)

preemptive. It has been compared with preemptive MCP algorithm and FPS

algorithms. It increases CPU utilization by allowing recursive preemptions of

resources. In preemptive scheduling, idle time of processors decreases by assigning

jobs in a round robin system. The proposed algorithm assigns tasks in such a manner

that every participating processor engaged most of the time. PTPS is based on the

Earliest Start Time (EST) factor which provides optimum processing capability of the

existing machines. This algorithm contributes minimum finish time (minFT)

parameter which shows minimum Normalized Schedule Length (NSL). This result

predicts that if the communication cost amongst processes is large then scheduling of

these parallel applications is complex. On the basis of simulation results it has been

observed that running time of PTPS is 11.91 % less than running time of FPS.

j(TX

TABLE OF CONTENTS

List of Tables

List of Figures

List of Publications

CHAPTER-1 : INTRODUCTION 1-16

1.1 Parallel Computing Distributed System 2

1.2 Traditional Scheduling verses Parallel Computing Scheduling 2

1.2.1 Task partitioning small constraints in parallel computing 2

1.2.2 Optimization criteria for parallel computing scheduling 2

1.2.3 Data scheduling 3

1.2.4 Methodologies 3

1.3 Types ofParallel Computing Architectures 3

1.3.1 Flynn's taxonomy 3

1.3.1.1 SISD (Single Instruction stream/Single 3

Data stream)

1.3.1.2 SIMD (Single Instruction stream/ Multiple 3

Data stream)

1.3.1.3 MISD (Multiple Instruction /Single Data) 4

1.3.1.4 MIMD (Multiple Instructions and Multiple Data) 4

1.4 Important Laws in Parallel Computing 4

1.4.1 Amdahl's law 4

1.4.1.1 Limitationsof Amdahl's Law 4

1.4.2 Gustafson's law 6

1.5 Performance Parameters in Parallel Computing 7

1.5.1 Throughput 7

1.5.2 System utilization 7

1.5.3 Turnaround time 7

1.5.4 Waiting time 7

1.5.5 Response time 7

1.5.6 Reliability 7

1.6 Issues and Challenges in Parallel Computing 7

1.6.1 Role of server system 7

1.6.2 Allocation requirements 8

1.6.3 Resource management 9

1.6.4 Security 9

1.6.5 Reliability 9

1.6.6 Fault tolerance 9

1.7 NP Hard Scheduling 9

1.7.1 Classes P and NP in parallel computing 10

1.8 Thread in Parallel Processing Algorithms 10

1.8.1 Thread benefits 11

1.8.1.1 Responsiveness 11

1.8.1.2 Economy and resource sharing 11

1.8.1.3 Utilization of multiprocessor architectures 11

1.9 DAG in Different Scheduling Environments 11

Reference 15

CHAPTER-2 : COMPARATIVE STUDY OF PARALLEL 17-34

COMPUTING TOOLS

2.1 Introduction 18

2.2 PVM (Parallel Virtual Machine) 19

2.2.1 Evolution of PVM 20

2.2.2 Portability, interoperability and evolution of PVM 20

2.2.3 Computing model of PVM 22

2.3 MPI (Message Passing Interface) 23

2.3.1 Evolution of MPI 23

2.3.2 Portability, interoperability and fault tolerance of MPI 24

2.3.3 Implementation of MPI 25

2.4 BLACS 26

2.4.1 ID-Less communication in BLACS 26

2.5 ScaLAPACK 26

2.5.1 Portability and scalability of ScaLAPACK 27

2.5.2 Software component of ScaLAPACK 27

2.6 Parallel Virtual Message Passing Interface 27

2.7 Parallel MATLAB 28

2.8 Compute Unified Device Architecture 29

2.8.1 NVIDIA tesla GPU architecture 29

2.8.2 Thread organization and scheduling of CUDA 30

2.9 Comparison of parallel computing tools 30

Reference 32

CHAPTER 3 : CLASSIFICATION OF TASK PARTITIONING 35-49

STRATEGIES IN DISTRIBUTED PARALLEL

COMPUTING SYSTEMS

3.1 Introduction 36

3.2 Deterministic Task Partitioning Strategies 37

3.2.1 Papadimitriou and Yannakakis scheduling 38

3.2.2 Linear Clustering with Task Duplication Scheduling 38

3.2.3 Edge Zeroing Scheduling 38

3.2.4 Modified Critical Path Scheduhng 39

3.2.5 Earhest Time First Scheduling 39

3.3 Dynamic Task Partitioning Strategies 39

3.3.1 Evolutionary Task Scheduling 40

3.3.2 Dynamic Priority Scheduling 40

3.4 Preemptive Task Partitioning Strategies 40

3.4.1 Rate Monotonic-Decrease Utilization-Next Fit Scheduling 41

3.4.2 Optimal Preemptive Scheduling 41

3.4.3 Preemption Threshold Scheduling 41

3.4.4 Fixed Preemption Point Scheduhng 42

3.5 Non-Preemptive Partitioning Strategies 42

3.5.1 Multiple Strict Bound Constraints scheduling 42

3.5.2 Earliest Deadline First scheduling 42

Reference 46

CHAPTER-4 : OPTIMAL TASK PARTITIONING MODEL IN 50-63

DISTBUBUTED HETEROGENEOUS PARALLEL

COMPUTING ENVIRONMENT

4.1 Introduction 51

4.2 PRAM Model 52

4.3 Proposed Model for Task Partitioning in Distributed Environment 53

Scheduling

4.3.1 Proposed algorithm for inter-process communication 56

amongst tasks

4.3.2 Pseudo code for the proposed algorithin 56

4.3.3 Low communication overhead phase 57

4.3.4 Priority assignment and start time computing phase 57

4.3.5 Procedure for computing the ALAP 58

4.4 Experimental Phase 59

Reference 62

CHAPTER -5 : OPTIMAL TASK PARTITIONING STRATEGY 64-82

WITH DUPLICATION (OTPSD) IN PARALLEL

COMPUTING ENVIRONMENTS

5.1 Introduction 65

5.2 MCP Algorithm 66

5.2.1 Design of MCP algorithm 67

5.2.2 Efficiency of the algorithm 67

5.3 HEFT(Heterogeneous Earliest Finish Time algorithm) Algorithm 67

5.4 Performance Metric for Simulation 68

5.5 Proposed Model of Task Partitioning Strategies 69

5.5.1 Task and processors assignment phase 69

5.5.2 Pseudo Code for Grain Pack SubDAG 70

5.5.3 Pseudo code for proposed algorithm 73

5.6 OTPSD hnplementation Phase 74

Reference 80

CHAPTER-6 : TASK PARTITIONING STRATEGY WITH 83-92

MINIMUM TIME IN DISTRIBUTED PARALLEL

COMPUTING ENVIRONMENT

6.1 Introduction 84

6.2 Critical Path Fast Duplication (CPFD) algorithm 85

6.2.1 Pseudo code for CPFD algorithm 86

6.3 Proposed algorithm 87

6.3.1 Task assignment phase 87

6.3.2 Pseudo code for proposed algorithm 88

6.3.3 Simulation results 88

6.3.4 Computational analysis 90

Reference 92

CHAPTER-7 : PREEMPTIVE TASK PARTITIONING STRATEGY 93-102

WITH FAST PREEMPTION IN HETEROGENEOUS

DISTRIBUTED ENVIRONMENT

7.1 Introduction 94

7.2 Resource sharing 95

7.3 Proposed Algorithm for Preemptive Scheduling 96

7.3.1 Description of the scheduling algorithm 97

7.3.2 Experimental phase 100

Reference 101

CONCLUSION 103-104

PUBLICATION

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

LIST OF FIGURE

Speedup representation by Amdhal's Law 6

Protocol hierarchy of P VM in ATM network 21

PVM Architectural View 22

PVM Computational Model 23

Two Queues used in a typical MPI 25

Hierarchal view of different task partitioning strategies in 38

different platform

PRAM model for shared memory 52

Proposed dynamic task partitioning model 54

Iteration improvement analysis of proposed model 59

Speedup analysis of the algorithm upon the number of 60

processors

Figure 11 Serializability analysis of the algorithm upon the different 60

number of processors

Figure 12 Abstract model for task partitioning strategies 69

Figure 13 Makespan analysis of different algorithms 75

Figure 14 Analysis of NSL vs CCR 76

Figure 15 Efficiency vs CCR 77

Figure 16 DAG with communication costs 77

Figure 17 Average NSL vs DAG size 78

Figure 18 DAG with Communication Cost in TPSMT 89

Figure 19 Average SLR against CCR value 89

Figure 20 Average speed up against different number of processors 90

Figure 21 Average efficiency against different number of processors 90

Figure 22 Running time vs number of processors analyasis 98

Figure 23 CCR vs average normalized schedule length analysis at 4 98

processors

Figure 24 CCR vs average normalized schedule length analysis at 8 98

processors

Figure 25 CCR vs average normalized schedule length analysis at 16 99

processors

Figure 26 CCR vs average normalized schedule length analysis at 32 99

processors

Figure 27 CCR vs average normalized schedule length analysis at 64 99

processors

Figure 28 DAG example for similution 99

LIST OF TABLE

Table 1 Comparison of parallel computing tools 31

Table 2 Comparison of different task partitioning strategies 43

under various architectures

Table 3 Nomenclature for proposed task partitioning model 55

Table 4 Computational analysis of proposed model by using different 59

parameters

Table 5 Nomenclature for OTPSD Model 71

Table 6 Allocation of tasks upon different processors in OTPSD 75

model

Table 7 Average CPU utilization and makespan length of 78

compared algorithms

Table 8 Nomenclature for TPSMT model 87

Table 9 Nomenclature in PTPS model 98

Introduction

I IP

CHAPTER 1

INTRODUCTION

1.1 Parallel computing distributed system

Parallel computing is a form of computation in which many calculations are

carried out simultaneously. It operates on the principle that large problems can often

be divided into smaller ones, which can be solved concurrently. In a distributed

system, all processing elements are connected by a network. Parallel computing

becomes the dominant paradigm in computer architecture, mainly in the form of

multi-core processors. Data parallel programming is one which each process executes

the same action concurrently, but on different parts of shared data. While in task

parallel approach, every process performs a different step of computation on the same

data.

1.2 Traditional scheduling verses parallel computing scheduling

In traditional scheduling, scheduling is defined as the allocation of operations to

processing units. One of the primary differences between traditional and parallel

computing scheduling is that the parallel computing scheduler does not have full

control over parallel computmg systems. More specifically, local resources are not

controlled by the parallel computing scheduler, but by local scheduler. Another

difference is that parallel computing scheduler cannot assume that it has a global view

of parallel computing. The difference occurs mainly due to the dynamic nature of

resources and constraints in parallel computing environment. Issues on the basis of

which differences could be considered are as follows:

1.2.1 Task partitioning constraints in parallel computing: Scheduling constraints

can be hard or soft. Hard constraints are rigidly enforced. Soft constraints are those

that are desirable but not absolutely essential. Soft constraints are usually combined

into an objective function.

1.2.2 Optimization criteria for parallel computing scheduling: A variety of

optimization criteria for parallel computing scheduling is given below:

2 | P ,

1. Minimization of inter-process communication cost, sleek time (processor

utilization time) and NSL (Normalized Schedule Length).

2. Maximization of personal or general utility, resource utilization, fairness etc.

1.2.3 Data scheduling: A variety of data is necessary for a scheduler to describe the

jobs and resources [7] e.g. Job length, resource requirement estimation, time profiles,

uncertain estimation etc.

1.2.4 Methodologies: It is obvious that there is no deterministic scheduling method

which shows ideal results for parallel computing scheduling. But in conventional

scheduling many appropriate methods are available.

1.3 Types of parallel computing architectures

1.3.1 Flynn's taxonomy; Michael J. Flynn created one of the earliest classification

systems for parallel and sequential computers and programs. It's the best known

classification scheme for parallel computers. He classified programs and computers

for different data partitioning streams. A process is a sequence of instructions (the

instruction stream) that manipulates a sequence of operands (the data stream).

Computer hardware may support a single instniction stream or multiple instruction

streams for manipulating different data streams.

1.3.1.1 SISD (Single Instruction stream/Single Data stream): It's equivalent to an

entire sequential program. In SISD, virtual network machine fetches one sequence of

instruction, data and instruction's address from memory.

1.3.1.2 SIMD (Single Instruction stream/ Multiple Data stream): This category

refers to computers with a single instruction stream but multiple data streams. These

machines typically used by process arrays. A single processor fetches instructions and

broadcast these instructions to a number of data units. These data units fetch data and

perform operations on them. An appropriate programming language for such

machines has a single flow of control by operating an entire array rather than

individual array elements. It's analogous to doing the same operation repeatedly over

a large data set. This is commonly used in signal processing applications.

There are some special features of the SIMD.

3 | P a g e

1. All processors do the same thing or are idle.

2. It consists data partitioning and parallel processing phase.

3. It produces the best results for big and regular data sets.

Systolic array is the combination of SIMD and pipeline parallelism. It achieves

very high speed by circulating data among processor before returning to memory.

1.3.1.3 MISD (Multiple Instruction /Single Data): It's not totally clear that which

type of machine fits in this category. One kind of MISD machine can be design for

failing or safe operation. Several processors perform the same operation on the same

data and check each other to be sure that any failure will be caught. The main issues

in MISD are: branch prediction, instruction installation and flushing a pipeline.

Another proposed MISD machine is a systolic array processor. In this category,

stream of data are fetched from memory and passed to the array of processors. The

individual processor performs its operation on the stream of accepted data. But they

have no control over the fetching of data from memory. This combination is not very

useftil for practical implementations.

1.3.1.4 MIMD (Multiple Instructions and Multiple Data): In this stream several

processor fetches their own instructions. Multiple instructions are executed upon a

different set of data for computations of large tasks. To achieve maximum speed up,

the processors must communicate in synchronized manner. There are two types of

streams under this category which are as follows.

1. MIMD (Distributed Memory): In this stream unique memory is associated with

each processor of the distributed computing systems. So communication overhead

is high due to the exchange of data amongst the processors.

2. MIMD (Shared Memory): This structure shares a single physical memory

amongst the processors. Programs can share blocks of memory for the execution

of parallel programs. In this stream, shared memory concept causes the problems

of exclusive access, race condition, scalability and synchronization.

1.4 Important laws in parallel computing

1.4.1 Amdahl's law: This law had been generated by Gene Amdahl in 1967. It

provides an upper limit to the speedup which may be achieved by a number of parallel

processors to execute a particular task. Asymptotic speedup is increased as the

4 I P ii e e

number of processors increases in high performance computing systems [1]. If the

numbers of parallel processors in a parallel computing system are fixed then speedup

is usually an increasing function of the problem size. This effect is known as

Amdahl's effect.

Suppose/ be the sequential fraction of computations, where 0 < / < /. The

maximum speedup achieved by the parallel processing environment by using p

processors is as follows:

1 i{j <

( f+ ( i -Op)

1.4.1.1 Limitations of Amdahl's Law:

1. It does not provide overhead computation with the association of degree of

parallelism (DOP).

2. It assumes the problem size as a constant and depicts how increasing processors

can reduce computational overhead.

A small portion of the task which cannot be parallelized will limit the overall speedup

achieved by parallelization. Any large mathematical or engineering problem will

typically consist of several parallelizable parts and several sequential parts. This

relationship is given by the equation:

1

Where S is the speedup of the task (as a factor of its original sequential runtime)

and Pf is a parallelizable fraction.

5 I P a g e

Amdhal Law '*

3K

W

Ai

i \

y y y g ^ \

y\ VJ 1 1

fKI

1 1

1 1

T\^

~

to

!

1 2 3 4 5 6 7 8 9 10 1

Speed Lp J

Figure 1: Speedup represented by Amdhai's Law

If a sequential portion of a task is 10% of the runtime, we can't get more than lOx

speedup regardless of how many processors are added. This rule puts an upper limit to

add highest number of parallel execution units. When a task can't be partitioned due

to its sequential constraints then more efforts on this application has no effect on the

schedule. Bearing of a child takes nine months regardless of the matter of assignment

of a number of women [9]. Amdahl and Gustafson laws are related to each other

because both laws give a speedup performance after partitioning given tasks into sub-

tasks.

1.4.2 Gustafson's law: It's used to estimate the degree of parallelism over a serial

execution. It allows that the problem size being an increasing function of the number

of processors. Speedup predicted by this law is termed as scaled speedup. According

to Gustafson, maximum scaled speedup is given by,

S(P) = P - aiP- 1).

Where P, S and a denotes the number of processors, speedup and non-parallelizable

part of the process respectively. Gustafson law depends upon the sequential section of

the program. On the other hand, Amdhai's law does not depend upon the sequential

section of parallel applications. Communication overhead is also ignored by this

performance metric.

6 |P a go

1.5 Performance parameters in parallel computing

There are many performance parameters for parallel computing. Some of them

are listed as follows:

1.5.1 Throughput: It is measured in units of work accomplished per unit time. There

are many possible throughput metrics which depends upon the definition of a unit of

work. For a long process throughput may be one process per hour while for short

processes it might be twenty processes per seconds. This is totally depends upon the

underlying architecture and size of executing processes upon that architecture.

1.5.2 System utilization: It keeps the system as busy as possible. It may vary from

zero to 100 percent approximately.

1.5.3 Turnaround time: It is the time taken by the job from its submission to

completion. It's the sum of the periods spent waiting to get into memory, waiting in

ready queue, executing on the processor and spending time for input/output

operations.

1.5.4 Waiting time: It is the amount of time spent to wait by a particular job in ready

queue for getting a resource. In other words, waiting time for a job is estimated at the

time taken from the job from its submission to get system for execution. Waiting time

depends upon the parameters similar at turnaround time.

1.5.5 Response time: It is the amount of time to get first response but not the time

that the process takes to output that response. This time can be limited by the output

devices of computing system.

1.5.6 Reliability: It is the ability of a system to perform failure free operation under

stated conditions for a specified period of time.

1.6 Issues and challenges in parallel computing

Parallel computing is emerging as a viable option for high performance parallel

computing. Sharing of resources in parallel computing provides improved

performance at low cost. In parallel computing there are various issues such as;

1.6.1 Role of server system: Server system plays a major part in the parallel

computing heterogeneous system as today's server systems are relatively small and

7 I P a g e

they are connected to networks by the interfaces [14]. So these systems must be

supporting for high performance computing architectures.

1.6.2 Allocation requirements: Since parallel computing may be used as distributed

computing environment. Therefore, the following aspects play an important role in

the performance of allocated resources.

Services: It's considered that parallel computing is designed to address a single and

multiple issues like minimum turnaround time as well as real time with fault

tolerance.

Topology: Whether the job services are centralized or distributed or hierarchical in

nature? Selection of appropriate topology is a challenging task for the achievement of

better results.

Nature of the job: It can be predicted on the basis of load balancing and

communication overhead of the tasks over the parallel computing architectures [8].

The effect of existing load: Existing load may cause the poor results.

Load balancing: If the tasks are spread over the parallel computing systems then load

balancing strategy depends upon the nature and size of job and characteristics of

processing elements.

Parallelism: Parallelism of the jobs may be considered on the basis of fine grain level

or coarse grain level. After jobs submission or before inserting jobs this parameter

should be kept in mind.

Redundant resource selection: What should be the degree of redundancy in the form

of task replication or resource replication? In case of failure what should be the

criterion of node selection having task replica? How the system performance is

affected by allocating the resources properly?

Efficiency: Efficiency of scheduling algorithm is computed as:

T-./-jr • Scheduling Length of Uniprocessor System ^ „ „ Etiiciency = x 100 Scheduling Length on Multiprocessor SystemxNo. of Processors

Normalized Schedule Length: If makespan of an algorithm is the completion time of

that algorithm then Normalized Schedule Length (NSL) of scheduling algorithm is

defined as follows:

8 I P a g e

^,(-,, _ Makespan of Particular Algorithm Maximum of Sum of Computational Cost Alonga Critical Path

1.6.3 Resource management: Resource management is the key factor for target of

maximum throughput. In management of resources generally includes resource

inventories, fault isolation, resource monitoring, a variety of autonomic capabilities

and service-level management activities. The most interesting aspect of the resource

management area is the selection of the correct resource from the parallel computing

resource alternatives.

1.6.4 Security: Prediction of the heterogeneous nature of resources and security

policies is complicated and complex in a parallel computing environment. These

computing resources are hosted in different security areas. So middleware solutions

must address local security integration, secure identity mapping, secure

access/authentication and trust management.

1.6.5 Reliability: Reliability is the ability of a system to perfomi its required

flinctions under proposed conditions for a certain period of time. Heterogeneous

resources, different needs of the implemented applications and distribution of users at

the different places may generate insecure and unreliable circumstances. Due to these

problems it's not possible to generate ideal parallel computing architectures for the

execution of large and real tune parallel processing problems.

1.6.6 Fault tolerance: A fault is a physical distortion, imperfection or fatal mistake

that occurs within some hardware or software packages. An error is the deviation of

the results from the ideal outputs. So failures of the system means error and an error

generate some faults. Due to these limitations parallel computing system must be fault

tolerant. Therefore, if any type of problem is happening system must be capable to

generate proper results.

1.7 NP Hard scheduling

Most of scheduling problems can be considered as optimization problems as we

look for a schedule that optimizes a certain objective function. Computational

complexity provides a mathematical framework that is able to explain why some

problems are easier than others to solve. It is accepted that more computational

9 | P a g e

complexity means the problem is hard. Computational complexity depends upon the

input size and the constraints imposed on it.

Many scheduling algorithms contain the sorting of (n) jobs, which require at most

0 (nlogri) time. These types of problem can be solved by exact methods in

polynomial time. The class of all polynomial solvable problems is called class P.

Another class of optimization problems is known as NP-hard (A/Pcomplete)

problems. In NP-hard problems, no polynomial-time algorithms are known and it is

generally assumed that these problems carmot be solved in polynomial time.

Scheduling in a parallel computing environment is a WP-hard problem due to large

amount of resources and jobs are to be scheduled. Heterogeneity of resources and jobs

causes scheduling NP-hard.

1.7.1 Classes Pand NP in parallel computing: An algorithm is a step-by-step

procedure for solving a computational problem. For a given input, it generates the

correct output after a fmite number of steps. Time complexity or running time of an

algorithm expresses the total number of elementary operations such as additions,

multiphcations and comparisons etc. An algorithm is said to be a polynomial or a

polynomial-time algorithm, if it's running time is bounded by a polynomial in the

input size. For scheduhng problems, typical values of the running time are e.g.,

0 (n^) and 0 (nm).

1. A problem is called a decision problem if its output range is {yes, no}.

2. P is the class of decision problems which are polynomials solvable.

3. NP is the class of decision problems with the property that it's not solvable in

polynomial time fashion.

1.8 Threads in parallel processing algorithm

A program in execution is known as a process. Light weight processes are called

threads. The commonly used term thread is taking from thread of control which is just

a sequence of statements in a program. In shared memory architectures, a single

process may have multiple thread of control. Dynamic threads are used in shared

memory architectures. In dynamic threads, the master thread controls the collection of

workers threads. The master thread forks worker threads, these threads complete

10 I P a g e

assigned request and after termination it again joins a master thread. This facihty

makes the efficient use of system resources, because only at the time of running state

resources are used by threads. This phenomenon reduces idle time of the participating

processors. These facihties of threads maximize throughput and try to minimize the

communication overhead. Static threads are generated by the required setup which is

known in advance.

1.8.1 Threads benefits: Multiple threads programming have the following benefits.

1.8.1.1 Responsiveness: If some threads are damaged or blocked in multithreading

apphcations then it allows a program to run continuously. It facilitates user interaction

even if a process is loaded in another thread. So it increases the responsiveness of

users.

1.8.1.2 Economy and resource sharing: During the execution of the process it's

very costly to allocate memory in traditional parallel computing environments. In

multithreading, all threads of a process can share allocated resources. So the creation

and context switching of the threads of a process is economical. Because the creation

and allocation of a process are much more time consuming than threads. So thread

generates a very low overhead in the comparison of processes. Creating a process is

thirty time slower than creating a thread in Solaris (2) systems. Context switching is

5 times slower in processes than threads. The thread code sharing facility provides an

application to have several different threads of activities within the same address

space. These threads may share resources effectively and efficiently.

1.8.1.3 Utilization of multiprocessor architectures: Each thread of all processes

may run simultaneously upon different processors. Kernel level threads parallelization

increases the usage of processors. Kernel level threads are supported directly by the

operating systems without user thread interventions. While in traditional

multiprogramming, kernel level processes can't be generated by the operating

systems.

1.9 DAG in different scheduling environments

Any parallel program may be represented by DAG. In the execution of parallel

tasks more than one computing units required concurrently [21]. DAG scheduling has

been investigated on a set of homogeneous and heterogeneous environments. In

11 [ P a g e

heterogeneous systems, it has different speed and execution capabilities. The

computing environments are connected by the networks of different topologies. These

topologies (ring, star, mesh, hypercube etc.) may be partially or fully connected.

Minimization of the execution time is the main objective of DAG. It allocates the

tasks to the participating processors by preserving precedence constraints. Scheduling

length of a schedule (program) is the overall finish time of the program. This

scheduling length is termed as the mean flow time [2, 17] or earliest finish time in

many proposed algorithms.

In DAG a parallel program is represented by G = (V, E), where V is the set of

nodes (n) and E is the set of directed edges (e). A set of instructions which can be

sequentially on the same processor without preemption is known as a node on DAG.

Every edge (n;, rij) is the corresponding computation message amongst node by

preserving the precedence constraints. The communication cost of the nodes is

denoted by {ni,nj). Sink node is called child node and source node is called the

parent node of a DAG. A node without child node is known as an exit node while a

node without parent node is called an entry node. Precedence constraints mean that

child node can't be executed without the execution of parent node. If two nodes are

assigned on same processor then communication cost between them is assumed to be

zero.

The loop structure of the program can't be explicitly modeled in the DAG. Loop

separation techniques [4, 18] divide the loop into many tasks. All iterations of loop

started concurrently with the help of DAG. This model can be used for large matrix

multiplications and data flow problems. Fast Fourier Transformation (FFT) or

Gaussian Elimination (numerical applications) loop bound must be known at the time

of compilation. So many iterations of a loop can be encapsulated in a task. Message

passing primitives and memory access operations are means of calculation of node

weight and edge weight [15]. The granularity of tasks is also divided by the

programmer [16]. The scheduling algorithms [22] are used to refme the granularity of

DAG.

Preemptive and non-preemptive scheduling approaches investigated by [3] in

homogeneous computing architectures. They used independent tasks without having

precedence constraints. In DAG, condition of precedence constraints is inserted by

12 I P a g e

Chung and Ranka [6] for preemptive and non-preemptive scheduling. Earliest time

factor inserted by them in list scheduling strategies.

In preemptive scheduling, a partial portion of the task can be reallocated to the

different processors [11, 13, 19]. While in the case of no-preemptive scheduling,

allocated processors can't be re-allocated until it finishes assigned tasks. Flexibility

and resource utilization of the preemptive scheduling is more than non-preemptive

scheduling in a theoretical manner. Practically, re-assigning partial part of task causes

extra overhead. Preemptive scheduling shows polynomial time solutions while non-

preemptive are NP-complete [5, 10]. Scheduling with duplication upon different

processors is also NP-complete. Communication delay amongst preemptive tasks is

more due to preemptions of processors.

Problems of conditional branches may be analyses by DAG. Scheduling of

probabilistic branches reported in [20]. In this scheduhng, each branch is associated

with some non-zero probability having some precedence constraints. Towsley [20]

used two step methods to generate a schedule. In first step, he tries to minimize the

amount of indeterminism. While in second step, reduced DAG may be generated by

using pre-processor methods. Problem of conditional branches is also investigated in

[9] by merging schedules to generate a unified schedule.

With the above prospective the following objectives were attempt in this thesis:

In the first chapter of the thesis, basic concept and laws related to parallel computing

are discussed. DAG is a booming field in parallel computing, so a brief discussion

about DAG is also covered.

The selection of the tools for high performance computing is challenging problem.

Therefore a systematic and comparative study about parallel computing tools is

discussed in chapter second. With the help of this comparative analysis, a researcher

can choose appropriate tool for high performance parallel computing.

In the third chapter, a classification of parallel computing scheduling is given.

These scheduling techniques are used in static, dynamic, preemptive and non-

preemptive environments. The best task partifioning strategy can be choose with the

help of comparison table. Important parameters related to each scheduling may be

analyzed by using comparison table.

13 [ P a g e

In chapter four, an iterative task partitioning model has been proposed. Tiiis

model contribute low communication factor, which reduces communication cost

amongst the tasks.

In chapter five, we proposed an Optimal Task Partitioning Strategy with

Duplication (OTPSD) in parallel computing environment. Performance of the

proposed strategy has been composed with two well-known algorithms, MCP and

HEFT. Outperformance of proposed algorithm over compared algorithm has been

discussed in terms of experimental results.

A proposed Task Partitioning Strategy with Minimum Time (TPSMT) is

discussed in the chapter six. We compared this algorithm with HEFT and CPFD

algorithms. Efficiency and normalized schedule length of the proposed algorithm

shows better performance over HEFT and CPFD.

In the chapter seven, we proposed a Preemptive Task Partitioning Strategy

(PTPS) with fast preemptions. This algorithm is a preemptive class algorithm. PTPS

has been compared with preemptive MCP and FPS. We compute average NSL value

upon different processors. Calculated NSL value shows better result than preemptive

MCP and FPS.

14 I P a e c

REFERENCES

[1] G. M. Amdahl, "Validity of the Single-Processor Approach to Achieving

Large Scale Computing Capabilities," In AFIPS Conference Proceedings,

(1967), 480-481.

[2] J. Bruno, E. G. Coffman and R. Sethi, "Scheduling Independent Tasks to

Reduce Mean Finishing Time," Commun. ACM, 17, (1974), 380-390.

[3] J. Blazewicz, J. Weglarz and M. Drabowski, "Scheduhng Independent 2-

Processor Tasks to Minimize Schedule Length," Inf. Process. Lett., 18(5),

(1984), 267-273.

[4] M. Beck, K. Pingali and A. Nicolau, "Static Scheduling for Dynamic Dataflow

Machines," y. Parallel Distrib. Comput., 10, (1990), 275-291.

[5] E. G. Coffman and R. L. Graham, "Optimal Scheduling for Two-Processor

Systems,"/icto Inf, 1, (1972), 200-213.

[6] Y. C. Chung and S. Ranka, "Applications and Performance Analysis of a

Compile-Time Optimization Approach for List Scheduling Algorithm on

Distributed Memory Multiprocessors," In Proceedings of the 1992

Conferenceon Supercomputing (Supercomputing '92, Minneapolis, MN, Nov.

16-20), R. Werner, Ed. IEEE Computer Society Press, Los Alamitos, CA,

(1992), 515-525.

[7] R. Cheng, M. Gen and Y. Tsujimura, "A Tutorial Survey of Job-Shop

Scheduling Problems using Genetic Algorithms," Comput. Ind. Eng., 30 (4),

(1996), 983-997.

[8] A. W. David and D. H. Mark, "Cost-Effective Parallel Computing," IEEE

Computer, (1995), 67-11.

[9] El-Rewini and M. A. Hesham, "Static Scheduling of Conditional Branches in

Parallel Programs," J. Parallel Distrib. Comput., 24, (1995) 41-54.

[10] T. Gonzalez and S. Sahni, "Preemptive Scheduling of Uniform Processor

Systems," y. ACM, 25 (1), (1978), 92-101.

[11] Gustafson, J. L., "Reevaluating Amdahl's Law," Ames lab web link

http://www.scl.ameslab.gov/Publications/AmdahlsLaw/Amdahls.html, (1996).

15 I P a g e

[12] E. C. Horvath, S. Lam and R, "Sethi, A Level Algorithm for Preemptive

Scheduling," J. ACM. 24{\), (1977), 36-47.

[13] T. Haluk, H. Salim and Y. W. Min, "Performance-Effective and Low

Complexity Task Scheduling for Heterogeneous Computing," IEEE

Transactionson parallel and Distributed systems, 13(3), (2002), 262-279.

[14] H. Jiang, L. N. Bhuyan, and D. Ghosal, "Approximate Analysis of

Multiprocessing Task Graphs," In Proceedings of the International

Conference on Parallel Processing, (1990), 230-238.

[15] Y. K. Kwok and I. Ahmad, "Efficient Scheduling of Arbitrary Task Graphs to

Multiprocessors using a Parallel Genetic Algorithm," J. Parallel Distrih.

Comput., 47(1), (1991), 55-78.

[16] J. Y. T. Leung and G. H. Young, "Minimizing Schedule Length Subject to

Minimum Flow Time," SIAMJ. CompuL, 18(2), (1989), 310-333.

[17] B. Lee, A. R. Hurson and T. Y. Feng, "A Vertically Layered Allocation

Scheme for Data Flow Systems," J. Parallel Distrib. Comput., 77(3), (1991),

172-190.

[18] V. J. Rayward-Smith, "The Complexity of Preemptive Scheduling given

Interprocessor Communication Delays," Inf Process. Lett., 25(2), (1987),

120-128.

[19] D. Towsley, "Allocating Programs Containing Branches and Loops within a

Multiple Processor System," IEEE Trans. Softw. Eng. SE-12, 10, (1986),

1018-1024,

[20] Q. Wang, and K. H. Cheng, "List Scheduling of Parallel Tasks," Inf Process.

Lett., 37(5), {\99\),2%9-295.

[21] T. Yang and A. Gerasoulis, "PYRROS: Static Task Scheduling and Code

Generation for Message Passing Multiprocessors," In Proceedings of the 1992

international conference on Supercomputing (ICS '92, Washington, DC , K.

Kennedy and C. D. Polychronopoulos, Eds. ACM Press, New York, NY,

(1992), 428-437.

16 I P ag i

Comparative Study of (ParaOeC Computing IjooCs

1 7 | P a g

CHAPTER 2

COMPARATIVE STUDY OF PARALLEL COMPUTING

TOOLS

2.1 Introduction

In this chapter, perfonnance of ftequently utilized parallel processing tools are

discussed and compared. Each tool has its own feature which may be completely

different or have some resemblance to some extent with others. Evaluation and

prediction of parallel programming tools are very important because it is used to

identify the factors those influence application executions. Parallel computing without

efficient, flexible, portable and interoperable tool is not effective in high performance

distributed computing systems. Portability is a measure of the ease with which

software can be moved between heterogeneous computing platforms. Software which

can run on diverse platforms increases longevity because it can be migrated from

older to newer platforms. Software tools for a high performance computing

environment require the knowledge of internal architecture and components of the

distributed systems [22]. In case of distributed systems, different products are

combined with different strengths to achieve a combination which is better than any

of the individual components. Therefore over the last few years, a number of

applications are designed to solve message passing problems in distributed

environments. Inter-process communication amongst various processes during the

computation is the major factor for high speed computing. MPI is a standardized

interface for inter-process communication in high performance massively parallel

processing (MMPs) computer systems. It provides highly optimized and efficient

implementations than PVM. Message passing interface (MPI) [3] and parallel virtual

machine (PVM) [5] are the most popular tools for message passing. PVM and MPI

have different nature in terms of architectures [12]. ScaLAPACK internally

implements parallelism with the use of the BLACS. BLACS provides a message

passing interface that may be implemented efficiently and uniformly across a large

range of parallel computing and distributed system.

I S l P a c e

The BLACS are a message-passing tool designed for linear algebra which can be

used in ScaLAPACK. Message passing through BLACS is easy and convenient for

message passing environments. Portability issue is more important than performance

due to the two reasons: communication and fault tolerance. PVMPI uses non-MPI

functions to join separately started MPI jobs. PVMPI shows transparency and

flexibility amongst different types of distributed computing systems. It combines the

best features of PVM and MPI. PVM and MPI effectively harness the capability of a

distributed network of UNIX workstations as well as heterogeneous network of UNIX

and Windows NT workstations.

Parallel Computing Toolbox and MATLAB are server oriented software, which

are used to solve data-intensive problems. How can one compare these programming

environments? How are these programs easier than others? How do the portability and

interoperability play the important role in parallel computing distributed systems?

What are the advantages and disadvantages of the software? In this chapter, we shall

discuss each of these programming environments in detail.

2.2 PVM

PVM is specifically designed for heterogeneous networks of computers. It is de

facto for the message passing environment [11]. It offers good facilities for parallel

computing but poor scheduling facilities for distribution of workload. PVM supports

certain forms of error detection and recovery. It encapsulates the functions amongst

heterogeneous distributed computing system. PVM system has gained widespread

acceptance in high performance scientific computing community.

A virtual parallel machine may be constructed by using PVM due to its simplicity.

The virtual machine concept is the one, in which dynamic collection of heterogeneous

computational resources is managed as a single parallel computer. It has been strongly

supported by PVM. Dynamic process groups layer above the core of PVM routines.

The group member is uniquely numbered from zero to the number of group members

minus one.

Functions that logically deal with group of tasks are known as broadcast and

barrier functions. A process can link to multiple groups, and groups can change

dynamically at any time during computations [18]. PVM essentially consists of a

19 | P a g e

collection of protocol to implement reliable and sequential data transfer. Mutual

exclusion enables preemption of deadlock in several situations [11]. PVM facilitates

ability to utilize shared memory and high speed networks like ATM to move data

amongst the clusters of shared memory multiprocessors.

2.2.1 Evolution of PVM: The development of PVM started in 1989.VaidySunderam,

a professor at Emory University, visited Oak Ridge National Laboratory to do

research with AI Giest on distributed computing [25]. Research associate Bob

Manchek joined the team and implemented a portable version PVM 2.0 of PVM in

1991. PVM 2.0 becomes pubUcly available by the valuable efforts of Dongarra. On

the basis of the requirements of the users, some additional interfaces were designed by

the experts that allow third party debugging and resource management [4].

A PVM software framework for heterogeneous concurrent computing is available

in, different varieties of disciplines [18]. It supports dynamic process management

while in other systems the processes are statistically defined. Performance issue was

dealt primarily with communication overhead at the time of message passing. In the

dynamic PVM, task migration is possible within the system without affecting the

operation of user or application programs.

2.2.2 Portability, interoperability and evolution of PVM: The advancement of

technology has been improving the computing speed of the workstations rapidly.

Software with a high degree of portability is highly valuable because it can be used by

the various users. Great diversity of computer hardware and software poses a

fundamental threat to software portability. Computational enhancement through the

network may be achieved by PVM due to its portable capabilities [19]. PVM

provides the facility of language interoperability also i.e. A FORTRAN program can

send a message that is received by C program and vice-versa. A user defined

collection of serial, parallel and vector computers emulates a large distributed

memory computer under PVM. PVM facilitates automatically startup of tasks on the

virtual machine with message passing and synchronization. Multiple users can

configure overlapping virtual machines and each user can execute several PVM

apphcations simultaneously. PVM is highly remarkable in ATM [21].

20 I P a g e

Application

I P\'M Message Passing Interface

P^'M Internal

Socket Interface

UDP

IP

'^ Switch J ^ Cloud. ATM

ATM

AAL5

API

ATM

API 3/4 ATM

ATM

Switch

Figure 2: Protocol hierarchy of PVM in ATM network

A set of routines is given by the different standards of PVM. Tiiese routines

enable a user to add and delete hosts from the virtual machines. Synchronization may

be achieved by sending a Unix signal to other tasks, or by using barriers. To manage

and create the multiple send and receive buffers, some special routines have been

provided by the PVM for high speed computing over network [21]. Message buffers

are allocated dynamically to save the data temporarily by native machines [14]. PVM

provides the set of functions to enable applications to communicate directly with each

other via TCP sockets. The successfiil execution of large simulations without fault

space detection and recovery is not possible. Failure of any workstation in a large

simulation will cause unexpected results. Hence there are several types of applications

that explicitly require fault tolerance execution environment due to the safety level of

requirements. When the status of a virtual machine changes or a task falls under the

control of users, PVM gives special event message which contains information about

that particular event [17]. This message gives the opportunity to respond fault without

hanging or failing. The message due to fault tolerance is used to control the

21 i P a g e

computing resources. Many systems have been designed specifically for fault

tolerance purpose, which uses Condor at the top of PVM environment. Fault tolerance

facility of PVM boosts the performance of the message passing environment.

2.2.3 Computing model of PVM: Computing model of PVM provides the facility of

asynchronous blocking (send, receive) and non-blocking fiinctions. In point-to-point

communication functions, computing model supports multicast to a set of tasks and

broadcast to a user defined group of tasks. Tasks can communicate with each other

without the limitation of size and number of messages [18]. Figure (3) justifies the

PVM computing model as well as an architectural abstraction of the system.

• Inter Component Comm. and Synchronizations

Computer 1 C? Computer 2

Input and Partitioning

Output and Display Inter Instance Comm. and Synchronizations

Figure 3: PVM Architectural View

22 I P a g e

Cluster 1

^

Cluster 2

Bridge/Router

MPP \

VECTOR SC

^T

PVM Uniform View of Multi-programmed Virtual Machine

Figure 4: PVM Computational Model

2,3 Message Passing Interface

MPI designed by the MPI Forum (a diverse collection of implements, library

writers, and end users), is quite independent of any specific implementation. MPI [24]

specifications can be used for writing portable parallel programs. It would be a library

for writing application programs, not distributed operating system. A fixed set of

processes may be created at the time of program initialization. Each process knows

the rank of others. MPI would be modular to accelerate the development of portable

libraries.

MPI has more than one freely available quality implementations. The groups of

MPI are solid, efficient and deterministic. MPI (Message Passing Interface) defines

also a 3"^ party profiling mechanism. It provides over 40 routines related to groups,

communicators, and virtual typologies which are built upon MPI communicators and

groups. Support of MPI is not mandatory for UNIX features. Tasks will be assigned to

the processors according to the capability of systems.

2.3.1 Evolution of MPI: The processes use a library to exchange messages in parallel

computing. This message passing approach allows processes to run upon multiple

processors for the solution of large problems efficiently. A ftill MPI implementation

23 I P a u e

MPICH [16] is portable to all parallel machines and workstations running Chameleon,

p4 or PVM [2]. MPI interface is meant to provide essential virtual topology,

synchronization and communication flinctionality between a set of processes. The

applications of MPI-1 were not portable across a network of workstations because

there was no standard method to start MPI tasks on separate hosts. In 1995, MPI

committee started the eiforts to design the MPI-2 specifications to correct the problem

of portability and to add additional communication functions. After the evolution of

MPI-2, one sided communication and dynamic process management came into the

existence across a number of tasks. Key aspect of dynamic process management

feature is the ability of an MPI process to participate in the creation of new MPI

processes. This facility makes communication with MPI processes which have been

started separately.

MPl-1 and MPI-2 implementation show good results in overlapping and

computation. MPI also specifies thread safe interfaces which have cohesion and

coupling strategies. Multithreaded point-to-point MPI code is easier than other codes

supported by different implementations. MPI-2 defines three one sided

communication operations: put, get and accumulate to remote memory read and a

reduction operation on the same memory across a number of tasks. MPI-2 also defines

methods for the synchronization of the communication.

2.3.2 Portability, interoperability and fault tolerance of MPI: MPI is totally

portable. So compilation and implementation with virtual typologies and efficient

buffer management are the excellent features of MPI. It provides the portability of

efficient and flexible standards for message passing. The MPI specification does not

provide interoperability facilities because a program written in C can't communicate a

program written in FORTRAN, even if executing on the same architecture. Portability

plays an important role to reuse the software tools for communications [15].

Earlier version of MPI does not support fauh tolerance mechanism. In MPI-1, model

host and task are considered as static in terms of fault tolerance. MPI-2 includes a

specification for spawning new processes to expand the capabilities of the original

static MPI-1 [10]. Message passing across the tasks is very convenient because each

task already knows about the every other task and all communications can be made

24 1 P a e e

without the expUcit need of special daemon. In the fault tolerance mechanism

management of faults is discussed below:

(1) During restart of an entire application, the periodically intermediate results should

be saved by the programmer on the reliable media.

(2) The ftinction of MPI implementation may be augmented to return information

about faults and must accept communicator reconfiguration.

(3) A fully automatic fault detection and recovery facility of the MPI implementation

hides the fault from programmers and users.

2.3.3 Implementation of MPI: MPI specifications can be used for writing portable

and parallel programs [13]. The impetus for developing MPI, according to the

computing experts is that each Massively Parallel Processors (MPP) vendor creates

their own proprietary message passing. MPI is intended to be a standard message

passing specification that each MPP vendor would implement on their systems. The

model depicts the implementation issues on a socket for the communications.

Application Receive Request

MPI Library Received Message Queue. RMQ

/ 1 Runtime

*

*

1

Received Message Qu

—•!

->

;ue

- •

.RRQ

Kernel Received Message

Socket

Figure 5: Two Queues used in a typical MPI

MPI would be modular, to accelerate the development of portable parallel libraries.

MPI has more than one freely available quality implementations and fiill

asynchronous communications.

25 1 P a u e

2.4 BLACS

BLACS are message passing routines that communicate matrices amongst the

processes arranged in two dimensional virtual process topologies. It forms basic

communication layer for ScaLAPACK. MPI provides most suitable message passing

layer for the BLACS. Its wide availability has high level functionality to support

BLACS communication semantics. It has several advantages over other available

communication libraries like PVM [10]. Computational model of BLACS consists of

one or two dimensional process grid where each process stores pieces of the matrices

and vectors. BLACS permits a process to be a member of several overlapping or

disjoint process grids. BLACS goal was to implement a core set Of linear algebra

routines provided in LAPACK [4]. The package attempts to provide the same ease of

use and portability for distributed memory linear algebra communication such as the

BLAS provides for linear algebra computation in ScaLAPACK [20]. Portability of

the BLACS has recently been further increased by the development of a version that

uses MPI. A PVM version of the BLACS is also available.

2.4.1 ID-Less communication in BLACS: ID-less communication means that during

the communication no message ID carried out across the processes. In message

passing every message has its ID (msgid) for the unique identification of messages

during communication. However, the generation of these IDs may be problematic.

Non unique IDs which are used in any two sections of code not separated by an

explicit barrier can cause the same result. These kinds of programming mistakes can

lead to non-deterministic code which will give wrong results. In the point-to-point

communication scope, messages without IDs in the BLACS work properly while

messages need the ordering for the identification in the MPI. The BLACS users do

not need to give any msgid for broadcasts/combines.

2.5 ScaLAPACK

The ScaLAPACK library includes core factorization routines which allow

factorization and solution of a dense system of linear equation via LU, QR and

Cholesky. The scalable library for concurrent computers will be fully compatible with

the LAPACK library for vector and shared memory computers. ScaLAPACK also

26 j P a g e

makes use of block partitioning algorithms. It can be used to solve the grand

challenging problems on massively parallel distributed memory concurrent

computers. Heterogeneous ScaLAPACK is a software package which provides

optimized parallel linear algebra programs for heterogeneous computational clusters

(HCCs).

2.5.1 Portability and scalability of ScaLAPACK: Main goal of designing linear

algebra algorithms is to minimize the movement of the data between different levels

of memory of distributed computing systems. Block partition algorithms passes

through blocks instead of vectors which reduce startup costs due to few message

exchange [6]. Block-partitioned algorithm with adjustable block size use poly

algorithms (where the actual algorithm is selected at run time) to achieve the

granularity of computation. Standard basic message passing as well as higher-level

communication constructs (global summation, broadcast, etc.) is necessary to achieve

a higher degree of portability.

2.5.2 Software component of ScaLAPACK: ScaLAPACK is distributed memory

version of LAPACK. These libraries are used to provide portability, flexibility and

easy use of the software. ScaLAPACK can solve systems of linear equations, linear

least squares, eigenvalues and matrix factorizations. Building block of the

ScaLAPACK library consists of distributed-memory version of the BLAS or PBLAS

[26] and a set of the BLACS. High performance of LAPACK may be achieved by

using algorithms that do most of their work in call to BLAS [13], with the emphasis

of matrix-matrix multiplication and matrix factorization. BLAS (Basic Linear Algebra

Subprograms) is a de facto standard for solving the basic numerical linear algebra

problems. It is famous software which is used to solve many scientific and

engineering problems. BLAS library has the capability to implement upon many

supercomputers by which BLACS programs can be transplanted from one computer

to another computer. The computational model of BLACS consists of one or two

dimensional grid of process where each process store matrix and vectors [6].

2.6 Parallel Virtual Message Passing Interface

Parallel Virtual Message Passing Interface (PVMPI) is a software system that

allows independent MPI applications. It provides intercommunication even if they are

27 I P a i; c

running under different MPI implementations using different language bindings. It

allows interoperability of vendor's optimized MPI versions. Due to the lack of

interoperability of the different MPI implementations, vendors bear problems in

distributed computing across different vendor's machines. The use of PVMPI is

transparent to MPI applications and allows intercommunication via all MPI point-to-

point calls [7]. It provides the facility of protection from rogue message for the MPI

applications after enrolling in PVMPI. The second implementation of the PVMPI

system uses some features of the PVM 3.4. Under MPI-1, applications caimot

communicate with other processes outside their MPI-COMM-WORLD without using

some other entities. Groups of the processes which are registered in PVMPI improve

security and performance with locking attribute.

2.7 Parallel MATLAB

It started in the 1970s as an interactive interface with EISPACK [23], a set of

eigen value and a linear system solution routines. In 1995, the use of parallel

MATLAB in high performance parallel computing was negligible. Its small

interactive environment provides high performance parallel computing routines which

initiate drastic significant interest in the field of high performance computing.

Interpreted built in language of MATLAB which is sunilar to C and flexible matrix

indexing are the innovative features of matrix programming problems [9]. Strong

graphical capabilities of MATLAB make it good data analysis tool. It provides small

hooks to the Java programming language, making integration with compiled program

easy. With the help of parallel MATLAB, we can implement task-parallel and data-

parallel algorithms at a high level without programming for specific hardware and

network architectures. A job is any large operation that we need to perform in

MATLAB session. A job is broken into segments called tasks. We can divide our job

into identical tasks but tasks may not be identical. The MATLAB distributed

computing server performs the execution of jobs by evaluating each of its tasks and

returning the results to the client [12]. In file system based communication feature,

MATLAB-MPI [3] implements basic MPI fijnctions in parallel MATLAB using

cross-mounted directories. Parallelism may be achieved through polymorphism

property of parallel MATLAB [1].

28 I P a g e

2.8 Compute Unified Device Architecture

Inherently parallel computations may be achieved by Compute Unified Device

Architecture (CUDA) programming model. In high performance computing,

acceptance of parallelism increment is more than clock rate increment. High level

parallelism provides significant computational growths.

NVIDIA [28] developed a CUDA model with a forward extension of C/C++

languages. Tliis programming model improves scalability across the broad spectrum

of real parallel computations in distributed computing systems. Programmers are

guided by the CUDA model for the utilization of multiple threads to expose fine grain

parallelism [29]. It provides optimum abstraction of parallel computation. Two key

features of the CUDA programming model are:

• It dictates to the programmer how to craft needed parallel algorithms rather than

. grappling with the mechanics of not famiUar languages.

• High scalable parallel code may re-design which can run across thousands of

parallel threads with the help of processors cores.

2.8.1 NVIDIA Tesla GPU architecture: Tesla architecture executes CUDA

programs written in CUDA programming. It consists of multithread multiprocessors.

Its memory may be accessed by different type of instructions having integer as well as

data floating point data types. Tesla GPU architecture capacity is to execute up to

30,720 threads simultaneously. While its earlier versions were capable to execute

12,288 threads concurrently. Scheduling of threads consists very low overhead. The

cost of creating and destroying threads is also negligible. Chip level parallelism may

be achieved by modem Graphics Parallelism Units (CPUs) [27]. GPU provides a

large number of simple execution units which are capable to achieve several time

speedups. Unlike CPUs, GPUs lack sophisticated pipelines and speculation

mechanisms. So, GPUs are well-suited for data-parallel compute intensive tasks.

However, GPUs can offer very high memory bandwidth due to fast enough rate of

data. The architecture of GPUs is well-suited for single-program, multiple-data

(SPMD) parallel programming models.

A CUDA program can be partitioned into two parts. The first kind should be run

on device (GPU) and second one should be executed on the host (CPU). The compiler

29 I P a g e

(like NVIDIA's) separates these two kinds. There are many APIs available in CUDA

to write programs in high level languages like C and FORTRAN.

2.8.2 Thread organization and scheduling of CUDA: To utihze all available

resources on a GPU, one must create many threads. These threads are created by

specifying the dimensions of the parallel computing and the blocks before calling the

kernels. The coordinates of each lattice point in a parallel computing and a block

represent a unique id of the corresponding block and the corresponding thread,

respectively. Thread blocks are generated by the programmers for the execution of

large applications. Threads grouped mto warps (32-thread units) and these warps are

assigned to the streaming processors (SPs). During warp instructions, executing

scheduler selects the next warp to be executed. So, there is no overhead in selecting a

warp during a warp-swapping. There is no time required for saving context while

swapping a warp as each thread has already been allocated registers. Shared memory

is also shared among all threads in a block.

The scalar sequential program may be executed on a set of parallel threads of the

kernel. Entirely different codes can be strongly executed by different thread blocks or

different threads of a kernel. Task parallel kernel exposes less fine grain parallel

computations than data parallel kernels.

2.9 Comparison of parallel computing tools

Analysis of different features of tools is as follows:

30 I P a g e

<

u

<

O H

>

1

1

> O H

c/3

s es

P M

2^ y3

> •

t/5

u > •

> •

t/3 1)

>-

C/2 (U

>-

to

>-

o

3

15 iS u O

P-

V5

>-

<U

>

C/3

o

o D .

3

on

c

O

<N

V5 D

>-

> •

W5

>-

C/3

> •

y 5

"o

"o

o ex CL 3

1/3

1 O Urn

(

o

U5

>-

o

o

WD U

o a. u. 3

0 0

lt>

c iS o

13

3

>

^

1)

o 2

o & 3

g3 "o o H

l O

t/5

•a o o O

o o

O

<L>

O CL CX 3

t /3

D . 3 O

O W5 VI O U O U i

O H CJ

'a

Q

VO

Vi

Vi

'o

O o. & 3

0 0

"o "H o U

on VD OJ

o o

O H

t ^

> •

W5

VI

W

^

4->

o 2

'o Z

o ex o. 3

CO c

<u o

3 O VI O

oo

00

o 2

o

Vi

C/5 D O

cx 0 0

Z

ON

"o

z

4—1

o 2

4—• o Z

o Z

o Z

o Z

o Z

o cx 3

on c o •s o 'c 3

E E o U

o

VI U

o Z

o Z

>

o

z

o

z

o

z

o cx cx 3

tZl u bO ca

c «

o o

2

--

IZl

Vi U

VI

o

z

"o Z

<u 6 0 <A C

a

3 ,o ' • * - * u u O

U i-i ID

t 0 0

( N

VI

V

>

>

VI

o

o

z

o

z

o

z en

E

& o

0H

bO 3 O

^

E OT

"3

CL,

m

Z

C/3

z

'o Z

Z

o Z

o Z

CD (U

> •

g

_ 3

> w > N C3

Tj-

Vi

Vi

Vi

o Z

o Z

"o Z

ir-i

>

Vi 1)

O

Z

o Z

"o Z

o Z

"o Z

W5

U

>-

c u E c o

• >

e W 0 0

,c 'o

CO

H

\o

"o Z

>•

Vi U

Vi

<U

>-

o Z

en

>-

o Z

O to C S3

c o o u c c o

u c u G

t ^

>-

en

1/)

z

o

z

o

z

r. o cx D . 3

T 3 d

CJ

CQ

0 0

> •

en

'o

z

Vi

>-

o

z

z

15 CO

cx CO

u c

."& o

C/3

OS

en

o

z

en

en

o

z

o Z

tZ)

>-

S CO

I )

o

cx,

o

§ OX)

e s a

o

I 3

H

REFERENCES

[1] C. Ron and E. Alan, "Parallel MATLAB: Doing it Right," Massachusetts

Institute of Technology, Cambridge, (2003), 2-8.

[2] Beguelin et al., "PVM 3 User's Guide's and Reference Manual," Technical

Report TM-12187.0ak Ridge National Laboratory, (1993).

[3] K. Jeremy, "Parallel Programming With MATLABMPI," High Performance

Embedded Computing HPEC, Workshop, (2001), 23-26.

[4] J. Choi et al., "CRPC Research into Linear Algebra Software for High

Performance Computers," International Journal of Super Computing

Applications, 8, (1994), 99-118.

[5] Beguelin, "PVM: Experiences Current Status and Future Directions,"

Technical Report CS/E 94-015. Oregon Graduate Institute CS, (1994).

[6] L.S. Blackford et al., "ScaLAPACK: A Portable Linear Algebra Library for

Distributed Memory Computers-Design Issues and Performance

Supercomputing," Proceedings of the ACM/IEEE Conference. 0-89791-854-1,

(1996), 5-5.

[7] E. F. Graham and D. J. Jack, "PVMPI: An Integration of the PVM and MPI

Systems," Department of Computer Science Technical Report CS 96 328,

Knoxville, (1996), 8-15.

[8] J. Choi et al., "The Design and Implementation of the ScaLAPACK LU, QR,

and Cholesky Factorization Routines," Technical Report ORNL/TM-12470,

Oak Ridge National Laboratory, (1994), 10-15.

[9] C. Stephane and B. M. Francois, "An Envu-onment for High Performance

MATLAB," In Languages, Compilers and Run-Time Systems for Scalable

Computers, (1998), 27-35.

[10] Geist et al., "PVM: A Users' Guide and Tutorial for Networked Parallel

Computing," MIT Press. (1994), 23-40.

[11] J. A. K. Geist and P. M. Papadopoulos, "PVM and MPI: A Comparison of

Features Calculate Parallel," 8(2), (1996), 20-26.

[12] C. Ron and E. Alan, "Parallel MATLAB survey,"

http://theory.lcs.mit.edu/cly/survey.html, (2001).

32 I P a t! c

[13] W. Gropp and E. Lusk, "PVM and MPI are Completely Different,"

Mathematics and Computer Science Division Argonne National Laboratory.

(1998), 1-4.

[14] C. L. Lowson et al., "Basic Linear Subprograms for FORTRAN Usage," ACM

Transactions on Mathematical Software, (1979), 308-323.

[15] J. Mooney et al., "Developing Portable Software. Information Technology:

Selected Tutorials," Springer Boston, ISBN 978-1-4020-8158-3, (2004), 55-

85.

[16] E. Nathan et al., "An Initial Implementation of MPI," Technical Report MCS-

P393-1193, Mathematics and Computer Science Division Argonne National

Laboratory, Argonne, IL 60439, (1993), 491-510.

[17] J. Pniyne and M. Liveny, "Providing Resource Management Services to

Parallel Applications," In Proceedings of the Second Workshop on

Environments and Tools for Parallel Scientific Computing, (1994).

[18] V. S. Sundarm et al., "The PVM Concunent Computing System: Evolution,

Experience and Trends," Compcon Spring., (1993), 21-29.

[19] L. C. Sheue et al., "Enhanced PVM Communications Over a High-Speed

LAN," IEEE Parallel and Distributed Technology, 3(3), (1995), 20-32.

[20] H. Salim et al., "A Problem Solving Environment for Network Computing,"

IEEE Computer Society, (1997), 22-28.

[21] Xuebin C, at. el., "Developing High performance Bias, Lapack and Scalapack

on Hitachi Sr 8000 in High Performance Computing in the Asia-Pacific

Region," The Fourth International Conference/Exhibition on, (2), (2000),

993-997.

[22] J. Ali and R. Z. Khan, "A Survey Paper on Task Partitioning Strategies in

Parallel and Distributed System," Globus-An International Journal of

Management & IT, 01, (2010), 5-9.

[23] T.S. Brian et al., "Matrix Eigen system Routines- EISPACK Guide," Springer-

Verlag, 2nd edition, 5-7, (1976), 124-135.

[24] Message Passing Interface Forum. "Document for a message passing

interface," Technical Report No: CS-93-214 1994. Available on netlib.

33 I F a e c

[25] V. S. Sundaram et al., "The PVM Concurrent Computing System: Evolution,

Experience and Trends," Parallel Computing, 20, (1994), 10-15.

[26] L. L. Charles et al, "Basic Linear Algebra Subprograms for Fortran Usage,"

ACM Trans. Math. Softw., 5(3), (1979), 308-323.

[27] P. Micikevicius, "3D Finite Difference Computation on CPUs Using CUDA,"

In: GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing

on Graphics Processing Units, New York, NY, USA, (2009), 79-84.

[28] NVIDIA: NVIDIA CUDA reference manual 3.1, (2010).

[29] S. Ryoo et al., "Optimization Principles and Application Performance

Evaluation of a Multithreaded GPU Using CUDA," In: ACM SIGPLAN

PPoPP., (2008), 2-3.

[30] K. Fatahalian et al., "Understanding the Efficiency of GPU Algorithms for

Matrix-Matrix Multiplication," In Proceedings of the ACM

SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, (2004),

133-137.

[31] The Math Works, "Matlab Software Acknowledgements," Online,

http://www.mathworks.com/access/helpdesk/help/base/relnotes/software.html,

2006.

34 I P a e e

classification qf*Tas^

(Partitioning Strategies in

<I>istri6ute(f<ParaOeC

Computing Systems

35 I P

CHAPTER 3

CLASSIFICATION OF TASK PARTITIONING

STRATEGIES IN DISTRIBUTED PARALLEL

COMPUTING SYSTEMS

3.1 Introduction

Distribution of tasks amongst various computing nodes is a challenging problem

in high performance distributed computing systems. To choose the appropriate

strategy for required system is difficult without meaningful comparison of existing

task partitioning and load balancing strategies. Effectiveness of strategies depends

upon the number of factors including efficiency, interconnection topology,

communication mode, program structure, throughput and computing capabilities of

high performance computing structures. A number of task partitioning and load

balancing strategies have been proposed, each of which produces better results under

different circumstances. The main goal of this chapter is to unravel the mystery of

strategies and to classify when and where each strategy is appropriate. This chapter

provides a common terminology and classification mechanism. Scheduling is a

fiinction which is used to assign jobs to the different processors [6]. It's a two-step

process, processor allocation and assignment. A job is an autonomous program that

executes in its domain. Resource allocation is done by the scheduler over two

dimensions, time and space and at two level jobs and threads. In running state, job

constitutes threads to reduce overhead.

If a software package is used to execute parallel jobs instead of threads, it

increased load of parallel computing systems. Threads and communication between

them may be static or dynamic [10]. For example, in MIMD architecture [1], number

of threads and the communication pattem between the threads can change

dynamically during execution in the parallel computing systems. If multiple executing

entities are part of the same application, then we treat entities as threads and

applications as jobs [30]. A parallel job is the collection of tasks having some

precedence relationship. A task can be identified as an executable fragment that must

36 I P i! u e

be sequentially executed without partial parallel execution [2]. All parallel jobs cannot

be fiilly parallelized. Effective task partitioning and load balancing of large task is

required to achieve high performance in a parallel and distributed system. Increasing

demand of high performance computing systems amongst various fields of science

shows keen interest in the parallel computing. Selection of appropriate strategies for a

particular system is a deciding factor for successfiil execution of the tasks. In a

homogeneous architecture, serial fi'action of computation executes at the speed of any

of the identical processors. Sequential bottlenecks can be greatly reduced by

executing tasks on heterogeneous parallel computers by running critical tasks on

faster processors. Efficiency analysis of heterogeneity presented by [4, 5]. Processor

allocation in efficient manner deals with the determination of the number of

processors allocated to a job [7]. Time complexity analysis tool for task partitioning

has been developed by Pugh and Nirkhe [8]. It accurately estimates the execution time

of a general real time program employing on high level structures. Towsley and

Nelson [9] proposed an analytical model for partitioning independent tasks on

different processors. There is no ideal task partitioning strategy for all diverse

computing architectures. Under the different architectures some important task

partitioning strategies are as follows:

3.2 Deterministic task partitioning strategies

In deterministic task partitioning, characteristics of an application such as

communication cost, execution time, synchronization and data dependency are

known in advance [11]. Dynamic scheduling performs scheduling at runtime and the

allocation of the processes may change during the execution of tasks. Static

scheduling cannot support load balancing and fault tolerance while dynamic

scheduling support these parameters. Deterministic scheduling on a Network of

Workstation (NOW) is NP-complete in strong sense while nondeterministic is not

strongly NP-complete. Purely static task partitioning with OpenCL, determines the

best partitioning across processors in a system. It divides tasks into many chunks

based on the availability of processors [13].

37 I P a g e

Task Partitioning Strategies

Non-Determiiiistic{EKiiainic)

X Preemptive Non-Preemptive

Homogeneous

OFT iPT XIS3C

Heterogeneous

^

Homogeneous

rru /PSTS I L R F

PIS s i LRPT-FMS ^

LRPT-m O SRT OFSDFTO

Non-Homogeneous

Figure 6: Hierarchal view of different task partitioning strategies in different

platform

3.2.1 Papadimitriou and Yannakalds sciieduling: Papadimitriou and Yannakakis

[34] scheduling approximate absolute achievable lower bound start time of a node by

using e-value attribute. From the first node up to exit node e-value have been

computed recursively to assign the priority of the nodes. After calculating e-value of

the node, each node inserted into a cluster in such a way that ancestors have data

arrival time larger than e-value of the node.

3.2.2 Linear Clustering with Task Duplication Scheduling: LCTD (Linear

Clustering with Task Duplication) [35] scheduling identifies the edges among

clusters. These edges determine completion time after linear clustering of DAG. After

that it attempts to duplicate the parent nodes corresponding to these edges in order to

reduce start time of some nodes in the cluster.

3.2.3 Edge Zeroing Scheduling: Edge Zeroing (EZ) Scheduling [33] tries to select

computing environment by merging edge weights. This scheduling strategy finds

38 IP a u e

edges with the largest weight in every step. If merging does not decrease execution

time then two clusters are merged (zeroing the largest weight) for the reduction of

completion time. Ordering of the nodes is defined in the resulting cluster based upon

the static b-level of the nodes. DAG edges have been sorted in descending order of

edges weights.

3.2.4 Modified Critical Path Scheduling: Modified Critical Path (MCP) [31]

scheduling decides priorities of nodes on the basis of ALAP (As Last as Possible)

parameter. It computes ALAP of the existing node and constructs a list of nodes in

ascending order of ALAP times. MCP finds processors idle time slot for a given node.

It selects first node from the node list and insert it into the processor of earliest

execution time [15]. It cannot guarantee an optimal schedule for fork and join

structures.

3.2.5 Earliest Time First Scheduling: ETF (Earliest Time First) scheduling [32]

technique select the node of the smallest start time from the node of earliest start time

in ready queue at each step. Earliest start time is computed by examining start time of

the node on all processors exhaustively. When two nodes have same EST, then ETF

breaks the tie by selecting the node with higher static priority. The priorities are

assigned to the nodes by computing static b-level.

3.3 Dynamic task partitioning strategies

Position Scan Task Scheduling (PSTS) is pure dynamic load balancing. It can be

used in centralized and decentralized manner [14]. It is based on divide and conquers

principle in which higher grid of dimension k is divided into grids of dimension (k-1),

until the dimension is (1). Position Scan Load Balancing (PSLB) technique is two

phase technique. In the first phase, the system modeled as a hypercube only at one

time. In the second phase, load balancing is performed by obtaining necessary

information for each system node. On the basis of the prior information, PSLB

strategy calculates its fiiture load. After that load is migrate according to the

capabilities of the nodes. An important aspect of this strategy is that, in the case of

load migration, each node knows exactly where to send its extra work load.

Communication takes place only between neighboring nodes.

39 I P a e e

3.3.1 Evolutionary task scheduling: Evolutionary Task Scheduling [26] depends

upon previous scheduling stages. Use of heuristic in initialization phase and specific

mutation operator are two parameters, which provide beneficial results for effective

scheduling in a static environment. In consistent environment, tasks moves from

higher level processors to lower level processors. This can be achieved by using

"rebalancing" operator. But in the case of inconsistent environment less greedy

operators are used to provide the best results. In this scheduling, information collected

from the previous state is used to select the environment by the scheduler.

3.3.2 Dynamic Priority Scheduling: Dynamic Priority Scheduling (DPS) [27] assign

priorities to the tasks based upon difference of bottom level (b-level) and top level (t-

level). This technique tries to schedule minimum schedule length in Directed Acyclic

Graphs (DAGs) [3]. In Dynamic process assignment, the scheduler selects more

important tasks before less important ones. Mapping and scheduling of strategies

depend upon the processor scheduler, network architecture and the DAG structures

[28]. The t-level is the length of the longest path (execution cost + communication

cost) between the target and entry tasks of DAG. While b-level is the large path

between target and exit tasks. In this way, t-level determines earliest start time. This

start time consider as a critical path of GDA [12].

Dynamic Level Scheduhng (DLS) [29] uses Generalized Dynamic Level (GDL)

to determine dynamic priorities for all tasks in the ready queue. GDL gives priorities

to the processors according to their speeds. Many other factors are taken into account

due to different execution costs on each processor. It considers factors like median of

the execution cost over all processors, average execution cost and maximum or

minimum costs for the computation of schedule levels.

3.4 Preemptive tasks partitioning strategies

Preemptive Deterministic Scheduling (PDS) ensures repUca behavior while

preserving concurrency. Replication is achieved by the threads and there is no

communication between the threads. Performance improvement is achieved due to

multithreading by exploiting concurrency in thread execution. Multithreaded replica

shows non deterministic behavior. Only one physical thread can be scheduled at given

time, so it shows poor scalability and performance. PDS schedules multiple threads at

the same time, so it shows five times throughput than Nondeterministic Preemptive

Scheduhng (NDPS) [11]. Optimal preemptive schedule subject to release date has

been used to minimize the execution time on the homogeneous platforms. This

strategy performs two steps. In the first step, minimize maximum completion time on

the desired number of machines. In second step, it determine maximum lateness with

respect to due dates for the jobs [12] on arbitrary number of machines.

Fast Preemptive Scheduling (FPS) [17] strategy simulates preemptive task

execution at a very low overhead. It requires small runtime support in heterogeneous

and homogeneous parallel computing environment [22]. Preemptive Task Scheduling

(PTS) [18] strategy can also be used for homogeneous distributed memory systems.

3.4.1 Rate Monotonic-Decrease Utilization-Next Fit Scheduling: RM (Rate

Monotonic)-DU (Decrease Utilization)-NFS (Next Fit Scheduling) [19] strategy does

not suffer from execution time anomalies. It's a scheduling for periodically arriving

tasks in multiprocessor environment. If the following assumption does not hold; (1)

assume that a task meet its deadline if it does so when all tasks are executed at their

maximum execution time, or (2) assume that a task meets its deadline, if it does so

when all tasks arrive frequently, then this situation is known as scheduling anomalies.

RM-DU-NFS is the combination of DU and NFS which is based on high system

utilization bound.

3.4.2 Optimal Preemptive Sctieduling: Optimal Preemptive Scheduling [20] is used

to schedule different jobs on multiple parallel uniform machines. By assigning

shortest remaining processing time jobs to the fastest available machine. Flow time

has been minimized by serving jobs preemptions in increasing order of their

remaining processing time. Shortest Remaining Processing Time on Fastest Machine

(SRPT-FM) rule is also optimal for the discounted flow time criteria. Flow time is

minimized by scheduling jobs according to the shortest remaining processing time on

the fastest machine. Minimization of makespan has been achieved by using (LRPT-

FM rule) Longest Remaining Processing Time on Fastest Machine [21] scheduling.

3.4.3 Preemption Threshold Scheduling: Preemption Threshold Scheduhng (PTS)

[23] disables preemption up to a specific level, called preemption threshold. Regular

priority assignment to the arriving tasks is achieved by PTS. Tasks can preempt, only

if the priority of the arriving task is higher than the threshold of running task.

41 I P a g e

3.4.4 Fixed Preemption Point Scheduling: In Fixed Preemption Point (FPP)

scheduling [24] strategy, a task is assigned in a non-preemptive mode and a

preemption can take place at a specific point of the code, which is known as

preemption point. In this way, a task is divided into subtasks. If the highest priority

task arrives then preemption is postponed till the next preemption point. So the tasks

cooperate with each other by the preemption point.

3.5 Non-Preemptive Partitioning Strategies

Loose Synchronization Algorithms (LSA) uses non-preemptive deterministic

scheduler to maintain multithreaded replica consistency. Optimal Finish Time (OFT)

non preemptive strategy is known as NP complete with many uniform processors

[14]. Largest Processing Time First (LPT) scheduling is near optimal for non-

preemptive OFT.

3.5.1 Multiple Strict Bound Constraints scheduling: Multiple Strict Bound

Constraints (MSEC) scheduling is non-preemptive static strategy for heterogeneous

computing systems. It performs alternative task priority scheduhng instead of

Heterogeneous Earliest Finish Time (HEFT) scheme. It's also used to enhance the

performance of parallel computing applications on heterogeneous platforms due to

macro data flow graph implementation [16].

3.5.2 Earliest Deadline First scheduling: EDF (Earliest Deadline First) Strategy,

schedules periodic or sporadic tasks on single processor without preemption [25].

Periodic tasks are invoked after a certain time interval while sporadic tasks are

invoked in arbitrary time but within the limited time constraints. EDF is universal for

the set of periodic or sporadic tasks.

This chapter provides a suitable firamework for comparing past work in the area of

distributed parallel computing systems. Ideal performance of the strategy depends

upon the requirement used for distributed parallel processing systems. From the brief

discussion of scheduling strategies, it's clear that there is no ideal strategy for all

parallel computing systems. In this chapter, dynamic, preemptive and non-preemptive

task partitioning and load balancing strategies had been briefly discussed.

Advancement of the technology is a key factor for mapping heuristics of task

partitioning strategies. A researcher decides his direction of the research problem

42 I P a s e

according to the existing strategies. A comparative analysis of the recent scheduhng

strategies provides the standard of existing work in the field of the parallel computing

systems. In the static scheduling, most challenging direction is to extend DAG

scheduling for the heterogeneous environment.

S.N.

1

2

3 .

4

5

6

7

8

9

Name of Strategy PY (Papadimitriou and Yannakakis) Scheduling LCTD (Linear Clustering with Task Duplication) EZ (Edge Zeroing) MCP (Modified Critical Path)

ETF (Earliest Time First)

Position Scan Task Scheduling (PSTS)

Evolutionary Task Scheduling

DPS(Dynamic Priority Scheduling)

DLS (Dynamic Level Scheduling)

Type of Strategy Static

Static

Static

Static

Static

Dynamic

Static and Dynamic

Dynamic

Dynamic

Architecture

Homogeneous

Homogeneous

Homogeneous

Homogeneous

Homogeneous

Heterogeneous

Heterogeneous

Heterogeneous

Heterogeneous

Remark

DAG based scheduling. Not Optimal.

DAG based scheduling. Not Optimal

DAG based scheduling DAG based scheduling Optimal Scheduling DAG based scheduling. Optimal Scheduling. It can be used in centralized and decentralized manner. Previous state information decides the scheduling environment Difference of top level and bottom level of nodes decide the priority of the tasks. Its DAG based scheduling. DAG based scheduling. Priorities are assigning to the tasks on the basis of GDL (Generalized Dynamic Level) in ready list at every

Ref

34

35

33

31

32

14

26

27

29

43 I P a g e

10

11

12

13

14

15

16

17

18

19

20

21

LMT( Level Min Time) Scheduling

FPS(Fast Preemptive Scheduling) PTS(Preemptive Task Scheduling) PDS(Preemptive Deterministic Scheduling)

Optimal Preemptive Scheduling subject to release date RM-DU-NFS Scheduling

Optimal Preemptive Scheduling with discount flow time objective SRPT-FM Scheduling LRPT-FM Scheduling PTS(Preemptive Threshold Scheduling) FPP(Fixed Preemption Point) Scheduling

LSA(Loose Synchronization Algorithm) Scheduling

Dynamic

Preemptive

Preemptive

Preemptive

Preemptive

Preemptive

Preemptive

Preemptive

Preemptive

Preemptive

Preemptive

Non-preemptive

Heterogeneous

Heterogeneous and Homogeneous Homogeneous

Heterogeneous

Homogeneous

Heterogeneous

Homogeneous

Homogeneous

Homogeneous

Real Time System

Real Time System

Heterogeneous

scheduling step. DAG based scheduling. During first phase level sorting has been used and in second phase greedy heuristic is applied for assigning priority. Scheduling cost is very low

Cannot be implemented in Heterogeneous Multiple threads and there is no inter replica communication Many step process

Free from scheduling anomalies SRPT-FM rule is also optimal for discounted flow time criteria

Minimize flow time

Minimize makespan

It's used to reduce unnecessary preemption Non preemptive task make preemptions with the help of preemption point in code of the job Its multithreaded replica strategy in which leader (thread) control to(follower) thread

30

17

18

11

12

19

20

21

21

23

13

44 I P a a e

•22

23

24

25

OFT(Optimal Finish Time) strategy LPTF(Largest Processing Time First) strategy MSBT(Multiple Strict Bound Constraints) strategy EDF(Earliest Deadline First) Scheduling

Non-preemptive

Non-preemptive

Non preemptive, static

Non-preemptive

Homogeneous

Homogeneous

Heterogeneous

Real Time System

by using mutex (locking/unlocking). NP-Complete

Its near optimal to non-preemptive OFT Its use macro data flow graph implementation

Universal for the set of periodic or sporadic tasks

14

15

16

25

Table 2: Comparison of different tasli partitioning strategies under various

arcliitectures

Comparative analysis of various task partitioning and scheduling strategies has been

discussed above in detail. This analysis is highly beneficial to choose the appropriate

strategy under the different requirements and architectures.

45 I P a g e

REFERENCES

[1] G. E. Christensen, "MIMD vs. SIMD Parallel Processing," A Case Study in

3D Medical Image Registration. Parallel Computing, 24 (9), (1998), 1369-

1383.

[2] D. G. Feitelson, "A Survey of Scheduling in Multi-programmed Parallel

Systems," Research Report RC 19790 (87657), IBM T.J. Watson Research

Center, (1997).

[3] X. Z. Jia and M. Z. Wei, "A DAG-Based Partitioning-Reconfiguring

Scheduling Algorithm in Network of Workstations," High Performance

Computing in the Asia-Pacific Region, 2000. Proceedings. The Fourth

International Conference/Exhibition, (1), (2000), 323-324.

[4] J. B. Andrews and C. D. Polychronopoulos, "An Analytical Approach to

Performance/ Cost Modeling of Parallel Computers," Journal of Parallel and

Distributed Computing, 12(4), (1991), 343-356.

[5] D. A. Menasce, D. Saha, Porto, S. C. D. S., V. A. F. Almeida and S. K.

Tripathi, "Static and Dynamic Processor Scheduling Disciplines in

Heterogeneous Parallel Architectures," J. Parallel Distrib. Comput., 28,

(1995), 3-6.

[6] D. Gajski and J. Peir, "Essential Issues in Multiprocessor," IEEE Computer,

18(6), (1985), 1-5.

[7] D. A. Menasce, S. C. Porto and S. K. Tripathi, "Static Heuristic Processor

Assignment in Heterogeneous Message Passing Architectures," Int. J. High

Speed Computing, (1994), 114-135.

[8] D. B. Shmoys, J. Wein and D.P. Williamson, "Scheduling Parallel Machines

On-line," SIAM Journal on Computing, (1995), 12-34.

[9] R. Nelson and D. Towsley, "Comparison of Threshold Scheduling Policies for

Multiple Server Systems," IBM, Research. Report, (1985), 45-57.

[10] D. G. Feitelson and L. Rudolph, "Parallel Job Scheduling: Issues and

Approaches," In IPPS'95 Workshop: Job Scheduling Strategies for Parallel

Processing, Springer-Verlag, Lecture Notes in Computer Science LNCS 949,

(1995), 1-3.

46 I I' a m

[11] Y. Kwok and I. Ahmad, "Static Scheduling Algorithms for Allocating

Directed Task Graphs to Multiprocessors," ACM Computing Surveys, 31(4),

(1999), 406-471.

[12] Y. Kwok and I. Ahmad, "Efficient Scheduhng of Arbitrary Task Graphs to

Multiprocessors Using a Parallel Genetic Algorithm," Journal of Parallel and

Distributed Computing, (1997), 1-3.

[13] D. Grewe and M. F. O'Boyle, "A Static Task Partitioning Approach for

Heterogeneous Systems Using OpenCL," In CC'll, (2011), 14-17.

[14] K. Savas and M. Tahar Kechadi, "Dynamic Task Scheduling in Computing

Cluster Environments," Proceedings of the ISPDC/Heterogeneous Parallel

Computing, IEEE conference, (2004), 121-154.

[15] S. Ali, H. J. Siegel, M. Maheswaran and D. Hensgen, "Task Execution Time

Modeling for Heterogeneous Computing System," Proceedings of

Heterogeneous Computing Workshop, (2000), 184-199.

[16] H. Chen, "On the Design of Task Scheduling in the Heterogeneous

Computing Envkonments," IEEE Pacific Rim Conference on

Communications, Computers and Signal Processing, (2005), 23-27.

[17] M. Ahmed, S. M. H. Chowdhury and M. Hasan, "Fast Preemptive Task

Scheduling Algorithm for Homogeneous and Heterogeneous Distributed

Memory Systems," In Ninth ACIS International Conference on Software

Engineering, Artificial Intelligence, Networking and Parallel/Distributed

Computing, (2008), 721-726.

[18] A. Radulescu and A. J. C. Gemund, "On Complexity of List Scheduling

Algorithms for Distributed Memory Systems," ACM Int'l Conf on

Supercomputing, Rhodes Greece, (1999), 45-57.

[19] B. Andersson and J. Jonsson, "Preemptive Multiprocessor Scheduling

Anomalies," Proceedmgs of IPDPS, (2002), 12-19.

[20] D. G. Pandelis, "Optimal Preemptive Scheduling on Uniform Machines with

Discounted Flow Time Objectives," European Journal of Operational

Research, 177(1), (2007), 630-637.

47 I P a g e

[21] T. Gonzalez, "Optimal Mean Finish Time Preemptive Schedules," Technical

Report 220, Computer Science Department, Pennsylvania State University,

(1977), 90-102.

[22] A. S. Renan and S. Romulo, "A Heterogeneous Preemptive and Non-

preemptive Scheduling Approach for Real-Rime Systems on

Multiprocessors," Second Brazilian Conference on Critical Embedded

Systems, (2012), 70-75.

[23] M. Desrochers, J. K. Lenstra, and M.W.P. Savelsbergh, "A Classification

Scheme for Vehicle Routing and Scheduling Problems," European Journal of

Operational Research, (1990), 320-331.

[24] A. Bums, "Preemptive Priority Based Scheduhng: An Appropriate

Engineering Approach," Technical Report YCS 214, University of York,

(1993), 12-18.

[25] K. Jeffay, D. F. Stanat and C.U. Martel, "On Non-Preemptive Scheduling of

Period and Sporadic Tasks. Real-Time Systems Symposium, Proceedings,"

Twelfth, (1991), 129-139.

[26] F. Zamfirache, D. Zaharie and C. Craciun, "Evolutionary Task Scheduling in

Static and Dynamic Environments," Computational Cybernetics and

Technical Informatics, International Joint Conference, (2010), 619-625.

[27] I. Ahmad, M. K. Dhodhi and R. Ul-Mustafa, "DPS: Dynamic Priority

Scheduling Heuristic for Heterogeneous Computing Systems," Computers

and Digital Techniques, IEEE Proceedings, 145(6), (1998), 411-418.

[28] Norman M. G. and Thanisch P, "Models of Machines and Computation for

Mapping in Multicomputers," ACM Comput. Survey, (1993), 263-302.

[29] G. C. Sih and E.A. Lee, "A Compile Time Scheduling Heuristic for

Interconnection-Constrained Heterogeneous Processor Architectures,"

Parallel and Distributed Systems, IEEE Transactions, 4(2), (1993), 175-187.

[30] M. Iverson, F. Ozgumer and G. Follen, "Parallelizing Existing Applications

in a Distributed Heterogeneous Environment," In Proceedings of the 4"

Heterogeneous Computing Workshop (HCW'95), (1995), 91-99.

48 I P a

[31] M.Y. Wu and D.D. Gajski, "Hypertool: A Programming Aid for Message-

Passing Systems," IEEE Transaction Parallel Distributed Systems, (1990),

331-344.

[32] J. J. Hwang, Y. C. Chow, F. D. Anger and C.Y. Lee, "Scheduling Precedence

Graphs in Systems with Inter-processor Communication Times," SIAM J.

Computer, (1989), 245-256.

[33] V. Sarkar, "Partitioning and Scheduling Parallel Programs for

Multiprocessors," MIT Press, Cambridge, M.A., (1989), 15-23.

[34] C. H. Papadimitriou and M. Yannakakis, "Towards an Architecture-

Independent Analysis of Parallel Algorithms," SIAM J. Computer, (1990),

324-327.

[35] H. Chen, B. Shirazi and J. Marquis, "Performance Evaluation of a Novel

Scheduling Method: Linear Clustering with Task Duplication," In

Proceedings of the 2" International Conference on Parallel and Distributed

Systems, (1993), 271-276.

49 I P a g e

Optma['Tas^<Partiticmin0 ModeCin <l>istn6ute(f<Par(UIeC

Computing Trnvironment

5 0 | ! \

CHAPTER 4

OPTIMAL TASK PARTITIONING MODEL IN

DISTRIBUTED PARALLEL COMPUTING

ENVIRONMENT

4.1 Introduction

Parallel computing systems compose task partitioning strategies in a true

multiprocessing manner. Such systems share the algorithm and processing unit as

computing resources to achieve higher inter process communications capabilities. A

large variety of experiments have been conducted on the proposed algorithm. Goal of

computational models is to provide a realistic representation of the costs of

programming.

This chapter represents optimal iterative task partitioning scheduling in distributed

heterogeneous environment. The main goal of the algorithm is to improve the

performance of the schedule in the form of iterations. This algorithm first uses fe-level

computation to calculate the initial schedule. Main characteristics of our method are

optimal scheduling and strong link between partitioning, scheduling and

communication. Some important models for task partitioning have also been

discussed in this chapter. The proposed algorithm improves inter process

co.mmunication between tasks by using recourses of the system in an efficient

manner.

Scheduling techniques might be used by an algorithm to optimize the code that

comes out from parallelizing algorithms. A researcher is always keen to construct a

parallel algorithm that runs in the shortest time. Another use of these techniques is the

designing of high-performance computing systems. Threads can be used for task

migration dynamically [1]. They are used to increase the efficiency of the algorithm.

Parallel computing systems have been implemented upon heterogeneous platforms

which comprise different kinds of units, such as CPUs, graphics co-processors, etc.

An algorithm has been constructed to solve the problem according to the processing

capabilities of the machines [10]. Communication factor is highly important feature to

51 I P a g e

solve the problem of task partitioning in the distributed systems. A conq)uter cluster is

a group of computers working together closely in such a manner that it's treated as a

single computer. The cluster is always used to improve the performance and

availability over that of a single computer. A cluster is used to improve the scientific

calculation capabilities of the distributed system [2]. Task partitioning has been

achieved by linking the computers closely to each other as a single implicit computer.

Large tasks partitioned into sub tasks by the algorithms to improve the productivity

and adaptability of the systems. Process division is a function that divides the process

into the number of processes or threads. Thread distribution distributes threads

proportionally among several machines in the cluster network [15]. A thread is a

function which executes on the different nodes independently, so the communication

cost problem is negligible [3]. Some important models [4] for task partitioning in a

parallel computing system are: PRAM, BSP etc.

4.2 PRAM model

It is a robust design paradigm provider. PRAM conqrased of P processors, each

with its own un-modifiable program. A single shared memory composed of a

sequence of words, each of which capable of containing an arbitrary integer [5].

PRAM model is an extension of the familiar RAM model and it's used in algorithm

analysis. It consists of a read-only input tape and a write-only ou^ut tape. Each

instruction in the instruction stream has been carried out by all processors

simultaneously. It requires unit time, reckless of the number of processors. Parallel

Random Access Machine (PRAM) model of computation consists of a number of

processors operatiag in lock step[13]. In this model each processor has a flag that

controls whether it is active in the execution of an instruction or not. Inactive

processors do not participate in the execution of instructions.

Shared Memory

Figure 7: PRAM model for shared memory

52 j P a g c

The processor ID can be used to distinguish processor behavior while executing

the common program. Synchronous PRAM produces results by multiple processors to

the same location in shared memory. Highest processing power of this model can be

used by using the Concurrent Read Concurrent Write (CRCW) operation. It's a

baseline model of concurrency and explicit model which specify operations at each

step [11]. It allows both concurrent read and concurrent write instructions to shared

memory locations. Many algorithms for other models (such as the network model)

can be derived directly from PRAM algorithms [12]. Classification of the PRAM

model is as follows:

1. In the Common CRCW PRAM, all the processors must write the same value.

2. In the Arbitrary CRCW PRAM, one of the processors arbitrarily succeeds in

writing.

3. In the Priority CRCW PRAM, processors have priorities associated with them

and the highest priority processor succeeds in writing.

4.3 Proposed model for task partitioning in distributed environment

scheduling

Task partitioning strategy in parallel computing system is the key factor to decide

the efficiency, speedup of the parallel computing systems. The process has been

partitioned into the subtasks where the size of the task is determined by the run time

performance of each server [9]. In this way assign a number of tasks will be

proportional to the performance of the server participated in the distributed computing

system. Inter-process communication cost amongst tasks is very important factor

which is used to improve the performance of the system [6]. Inter processes

communication cost estimation criteria is important for the enhancement of speed up

and turnaround time [8]. Call Procedure (C. P.) is used to dispatch the task according

to the capability of the machines. In this model server machine can be used for large

computations. Every processing element executes one task at a time and all tasks can

be assigned to any processing element. In the proposed model, subtasks communicate

to each other by sharing of data. So execution time is reduced due to sharing of data.

These subtasks assign to the server which dispatch the tasks to the different nodes.

53 I P a g e

•!•- t i

The proposed scheduling algorithm is used to compute the execution cost and

communication cost of the tasks. So the server is assumed by a system (P, [Pij], [Si],

[Til [Gil [Kij]) as mows:

a) P = [Pi,... P„] , vi4iere Pi denotes the processing elements on cluster

b) [Pij\ .uiiere NxN is processors topology

c) Si, 1 < i < N, specify the speed of processor Pi

d) Ti, 1 <i< N, specify the startup cost of initiating message on P,-

e) Gi ,1 < i < N, specify the startup cost of initiating process on P

J) Kij is the traosmission rate over the link connecting to adjacent processors Pj and

In the designing of the parallel algorithm, the main goal is to achieve large degree

of parallelism. For this purpose we used C.P. (Call Procedure).

Large Task

Sub Task Sub Task

Input Server Output Server

Inter Process Communication

C.P. C.P

Figure 8: Proposed dynamic task partitioning model

This model sphts both computation and data into small tasks [14]. The following

basic requirements have been fulfilled by the proposed model:

54 I P a g c

1. There is at least one order of magnitude more primitive than processors upon the

target machine to avoid later design constraints.

2. Redundant data structure storage and redundant computations are minimized

which cause to achieve large scalability for high performance computations.

3. Primitive tasks are roughly of the same size to maintain the balance work among

the processors.

4.- Number of tasks is increasing fiinction of the problem size which avoids the

constraints. It's impossible to see more processors to solve large problem

instances.

Total (t)

DAG (H)

P

MinCT

MaxCT

MinCR

MaxCR

T

fa-level

E

Total Task

DAG Height

Number of Processors

Minimum Computational Time

Maximum Computational Time

Minimum Computational Rate

Maximum Computational Rate

Speed Up

Bottom Level of DAG

Serial execution portion of algorithm

Table 3: Nomenclature for proposed task partitioning model

The model comprises the existence of an I/O (Input/Output) element associated with

each processor in the-system. Processing time can be calculated with the help of the

Gantt chart. The connectivity of the processing element can be represented by using

an undirected graph called the scheduler machine graph [7]. Program completion cost

can be computed as:

Total Cost = communication cost + execution cost

Where, Execution cost = Schedule length

Communication cost = the number of node pairs (w, ii) such that (w, n) E A and

proc (w) = proc {p.).

55 I P a ti e

4.3.1 The proposed algorithm for inter-process communication amongst tasks:

It's an optimal algorithm for scheduling interval ordered tasks on (m) processors. The

algorithm generates a schedule / that maps each task v 6 V to a processor P^ with a

starting time t^. Communication time between the processor P; and P, may be

defined as:

comm.(i,j) = [Ofori - j,otherwise 1]

task-ready (fi, i,f): the time when all the messages fi-om all task in N(v) have been

received by processor P,- in schedule / .

start time(fi, i, / ) : the earliest time at which task v can start execution on processor P;

in schedule / .

proc(^, / ) : the assigned processor to task n in schedule / .

start(//,/): the time in which task |a begins its actual execution in schedule / .

task(i,T,/): the task schedule on processor P, at time T in schedule / . If there is no

task schedule on processor P at time t in schedule / , then task(i, T,/)retums the

empty task <I). It's assumed that n2(0) < rizdi).

In this algorithm edge cut gain parameter is considered to calculate the

communication cost amongst the tasks [9].

gain(i,7) = €.gain edge cut + ( ] - € )

gain edge cut = edge cut actor / old edge cut

edge cut factor = old edge cut - new edge cut

Where € is used to set the percentage of gains from edge-cut and workload balance to

the total gain. Higher value of € contribute total gain of the communication cost.

4.3.2 Pseudo code for the proposed algorithm:

1. start 2. taskfi, T, j)*^0, for all positive integers i, where 1 < i <Pand x > 0 3. repeat 4. Let \i be the unmark task with highest priority 5. fori = 1 to P do 6. compute i-level for all tasks 7. schedule all tasks into non-increasing order of /?-level 8. compute ALAP, constructs a list of tasks in the ascending order of the ALAP

time 9. task_ready(ti, i, 0 <- max(start(n, f) -H comm(proc(^, 0, i) + 1) +

gain(i,j) for each |i 10. start_time(|i,i, f)<- minx, where task(i,x,0 *- <5and t > task_ready(n,i,0

56 I P a g e

11. endfor 12./(/i) «- (startjime(ii,i,f) if 13. start_time(n, i, 0 < (start_time(n,j,0, 1 < j < P,i = j or

start_time(pi,i,0 = (start_time(n,j,0 and n2(task(i, (start_time((i,i,0 - 1). f)^ n2(task(j,(start_time(n,j,0 - 1)./) 1 < j <P, i ^ j

14. mark task |i until all tasks marked \5.endif

4.3.3 Low communication overhead phase: Optimality of the algorithm over the

target machine can be achieved due to the following reasons:

Fact(l): comm(i, Ti,j,T2) where 1 < i,j < P

Swapping of the task by the task schedule on processor node (nj) at time (TI) with the

task schedule on node (rij) at time (T2). When the swapping of the task amongst the

different processor then

Fact (2/- total comm(i,7, T) where 1 < i,j < P

The effect of the above operation is to swap all the task schedule on node (n;) at time

Ti with the task schedule node (ny) at time T2, where T2 > T.

The following operation is equivalent to the more than one swap operations:

Fact (3): total comm(i,j,T)~ comm(i, r\,j, T2) V TI,T2 > T

4.3.4 Priority assignment and start time computing phase: Computation of the b-

level of DAG is used for the initial scheduhng. The following instructions have been

used to compute the initial scheduling cost of the task graph:

1. Construct a list of nodes in reverse order(Li) 2. for each node Oi ELI do 3. max = 0 4. for each child Oc of a, do 5. lfc{a\, Oc) + b-level(ac) >Mthen 6. M = c{a\, Oc) + b-level(ac) 7. endif 8. endfor 9. &-level(ai) = weight(ai) + M 10. endfor

In the scheduling process fo-level is usually constant until the node has been

scheduled. Procedure computes b-level and schedules a list in descending order The

quantitative behavior of the proposed strategy depends upon the topology used on the

57 I P a g e

target system. This observation might lead to the conclusion that fa-level perform best

results for all experiments. The algorithm employs the attribute ALAP (As Late as

Possible) start time which measure that how far the node's start time can be delayed

without increasing schedule length.

4.3.5 Procedure for computing the ALAP is as follows:

1. construct ready list in reverse topological order (A/;) 2. for each node Oj eMj do 3. min = k , where k is call procedure(C.P.) length 4. for each predecessor a^ of oi do 5. //•alap(ac) - c(ac, ai) < k then 6. k = alap(ac) - c(ac, a,) 7. endif 8. endfor 9. alap(ai) = k - wgt (a,) 10. endfor

According to priority of nodes, tasks allocated on the processors in distributed

computing environment. The ALAP time is computed and then constructs a list of

tasks in ascending order of ALAP time. Ties have been broken by considering ALAP

time of predecessors of tasks.

The following results from the above facts prove the optimality of the proposed

model:

1. The operation comm(i, T\,J, Xi) on the schedule f of the tasks preserves the

feasibility of the schedule of any task (w)

f(w) = (p, Ti) where pe [i,j} and T\ = T-1

2. Feasibility of the schedule f in the proposed model increased for any task schedule

f(w) = (p, Ti) where p G {i,j] and V T|

3. The operation conim(j, Ti,j, T2) and n2(total comm(J, ;, T)) > n2(task(i, TJ, / ))

shows optimality on the schedule of any task(w)

f(w) = (p, T3) where pg [i,j} and V T3

4. The operation comm(i,;, T) preserves feasibility of the schedule of any task(w)

f(w) = (p, Ti) where p e [i,j] and TI < x-l

5. The operation conim(i,;, T) also shows optimality of the schedule of any task(w)

f(w) = (p, Ti) where p^ [i,j} andVXi

58 j P a g e

4.4 Experimental phase

In experiments CCR increased as (0.1, 0.5, 1.0, 1.5, 1.75, 2.0, 5, 5.5, 7, 12) and

load factor varies from 0 to 12 with increment of 1.5. Tasks number ranging from {5,

10, 15,40, 70, 100, 130, 150}. In [16] percentage of improvement cases calculated as

70-75%. While in the proposed model percentage of unproved cases is up to 85%

with the increment of the iterations. Performance analysis of the algorithm based

upon the DAG example discussed in [16] with the assuming parameters:

No. of

Iterations

7

MaxCR

6

MinCR

1

MaxCT

150

MinCT

12

P

5

DAG(H)

13

Total(r)

150

Table 4: Computational analysis of proposed model by using different

parameters

Iteration Improvement Analysis

Average of Improvement Ratio

Figure 5: Iteration improvement analysis of proposed model

The result is shown in figure (20) in which simulation runs 100 times under each

case. The experimental computation predicted that when a smaller than the

percentage of improved cases is also small. This simulation result predicts that initial

schedule iteration contribute very low improvement in the computation of the

algorithms. When a is high then the percentage of improved cases reached at a critical

point, after that it becomes constant even if a increased.

59 I P a g e

Ratio of serial execution of program to parallel execution is known as Speedup. In

our experiment the results estimated upon the heterogeneous parallel computing

platform with 30 processors successively. When the number of the nodes is increased

then the speedup is increased up to an improved level after this level speedup factor is

not increased even if the number of processors increases.

40

30

20

10

0

Speedup Analysis

3 4 5

—^— Number of Processors

6 7

Speedup

Figure

S to

g 0.

o

2—€

10: Speedup analysis of the algorithm upon the number of processors

Serializabilty Analysis

^ -^a ^ ^ ^ . ^ ^ ^ ^ ^ ^ • F

2 3 4 5 6 7 8 9

Serial Exection Time -.^^m

Figure 11: Seriaiizability analysis of the algorithm upon the different number of

processors

Amdhal's Law and Gustafson Law ignore the communication overhead, so they can

overestimate speedup or scaled speedup. Serial fraction of the parallel computing

algorithm can be determined by the Karp-Flatt Metric which is defined as:

1 1 1

60 I P a g c

Serial fraction (e) is useful for the computation of the parallel overhead generated

by the execution of the algorithm over the distributed computing environment. Where

(y/) is the speedup factor and P are the number of processors using for the parallel

execution in distributed heterogeneous environments.

In this chapter, we proposed a new model for the estimation of communication

cost amongst various nodes at the time of the execution. The improvement ratio of

iterations has also been discussed. Our contribution gives cut edge inter-process

communication factor which is a highly important factor to assign the task to the

heterogeneous systems according to the processing capabilities of the processors on

the network. The model can also adapt the changing hardware constraints.

61 I Pa

REFERENCES

[1] N. Islam, A. Prodromidis and M. S. Squillante, "Dynamic Partitioning in

Different Distributed-Memory Environments," Proceedings of the 2nd

Workshop on Job Scheduling Strategies for Parallel Processing, (1996), 155-

170.

[2] D. J. Lilja, "Experiments with a Task Partitioning Model for Heterogeneous

Computing," University of Minnesota AHPCRC Preprint no. 92-142,

Minneapolis, MN, (1992), 15-19.

[3] L. G. Valiant, "A Bridging Model for Parallel Computation,"

Communications of the ACM, 33(8), (1990), 103-111.

[4] B. H. Juurlink and H. A. G. Wijshoff, "Communication primitives for BSP

Computers," Information Processing Letters, 58, (1996), 303-310.

[5] H. EI-Rewini and H. Ali, "The Scheduling Problem with Communication,"

Technical Report, University Of Nebraska at Omaha, (1993), 78-89.

[6] D. Menasce and V. Almeida, "Cost-Performance Analysis of Heterogeneity in

Supercomputer Architectures," Proc. Supercomputing, (1990), 169-177.

[7] T. L. Adam, K. M. Chandy and J. R. Dickson, "A Comparison of List

Schedules for Parallel Processing Systems," Comm. ACM, 17, (1974), 685-

689.

[8] L. G. Valiant, "A Bridging Model for Parallel Computation,"

Communications of the ACM, 33(8), (1990), 103-111.

[9] El-Rewini, T. G. Lewis, H. Ali, "Task Scheduling in Parallel and Distributed

Systems," Prentice Hall Series in Innovative Technology, (1994), 48-50.

[10] M. D. Ercegovac, "Heterogeneity in Supercomputer Architectures," Parallel

Computing, (7), (1987), 367-372.

[11] P. B. Gibbons, "A More Practical PRAM Model," In Proceedings of the

1989 Symposium on Parallel Algorithms and Architectures, Santa Fe,

NM,(1989), 158-168.

[12] Y. Aumann and M. O. Rabin, "Clock Construction in Fully Asynchronous

Parallel Systems and PRAM Simulation," In Proc. 33rd IEEE Symp. on

Foundations of Computer Science, (1992), 147-156.

62 j P a g e

[13] R. M. Karp and V. Ramachandran, "Parallel Algorithms for Shared-Memory

Machines," In J. van Leeuwen, editor. Handbook of Theoretical Computer

Science, (1990), 869-941.

[14] H. Topcuoglu, S. Hariri and M. Y. Wu, "Performance-Effective and Low-

Complexity Task Scheduling for Heterogeneous Computing," IEEE Trans.

Parallel and Distributed Systems, 13(3), (2002), 250-271.

[15] J. Ali and R. Z. Khan, "Dynamic Task Partitioning Model in Parallel

Computing Systems," Proceeding First International conference on Advanced

Information Technology (ICAIT), Coimbatore, Tamil Nadu, (2012), 7-10.

63 I P a ti e

Op1maC^as^(Partitdomng

Strategy witfi (DupGcation in <Para(le[ Computing

64

CHAPTER 5

OPTIMAL TASK PARTITIONING STRATEGY WITH

DUPLICATION IN PARALLEL COMPUTING

5.1 Introduction

Algorithms for scheduling tasks onto the heterogeneous processors must be

achievable of remarkable performance in terms of scheduUng length. Most of the

scheduling algorithms do not provide the mechanism about minimum communication

overhead. This chapter introduces Optimal Task Partitioning Strategy with

Duplication (OTPSD) that minimizes the scheduling length as well as communication

overhead. The proposed scheduling algorithm is NP-complete. Three phase algorithm

has been introduced in which the first phase comprises of grain_packSubDAG

formation. The second phase is a priority assignment phase. In the third phase,

processors are grouped according to their processing capabilities. The proposed

algorithm minimizes makespan and shows better performance in terms of normalized

schedule length and processor utilization over MCP and HEFT algorithms.

Mapping and scheduling of the dynamic tasks on multiprocessor architectures is a

challenging problem. Every participating processor has its own memory to execute

the segment of the executable code for the dynamic tasks. High level description of

the task partitioning applications is represented by DAG (Directed Acyclic Graph).

DAG [18] is a generic model of a parallel program consisting of a set of dependent

processes. In DAG, a set of tasks is connected by a set of direct edges. Outgoing

edges associated with computational tasks depict the control behavior of the task

graph at the associated executable task. Each process is represented by a task in the

graph. Each task may have one or more inputs/output tasks. The task is executed only

when all the inputs become available. A task without a parent is called an entry task

and a task without a child is called an exit task. The process execution time denoted

by Wi is called as the weight of task (t;). Communication between two tasks denoted

by Cij is equal to the message transfer time from task (tj) to task (tj). Obviously this

time becomes zero when two or more tasks are assigned to the same processor.

Communication and computation cost of the tasks is used to assign the priorities in

65 I P a g e

the DAG. High priority tasics are assigned prior to the low priority tasks. To estimate

the execution time of a task within a task graph during runtime is complex in parallel

computing architecture. Length of makespan is given by the exit task of DAG for a

given scheduling. A path from the entry task to the exit task for which communication

cost and computation cost is maximum is known as the critical path [10]. Goal of

scheduling algorithm is to minimize the schedule length. Task partitioning strategies

try to assign tasks on the appropriate processors. The execution order of the tasks is

the important factor for efficient task partitioning. If execution time and dependencies

are known in advance then it can be represented by static model. A number of task

partitioning strategies were proposed in the literature [5, 4, 6, 8, 10, 11, 14, 17]. These

task partitioning strategies are used for the set of homogeneous and heterogeneous

systems. Some of them show duplication and guided random behavior. In CBHD

algorithm, execution of the tasks started as soon as the dependent duplicated tasks are

finished [20]. In scheduling algorithms [10, 12], priorities are assigned to each task to

make ordering list. Scheduling algorithms [2, 6, 7, 10, 11, 16] are used in a

heterogeneous computing environment. Duplication of the tasks amongst different

processors is useful to reduce waiting time of ready tasks [1]. Duplication based

heuristics [10, 2,3] shows better efficiency for fine grain task graph. These duplication

based task graph algorithms are more effective for the network of high

communication latencies. Communication overhead and waiting time may be reduced

in duplication tasks partitioning strategies by redundant allocation of multiple

processing elements [10, 1]. Turnaround time of the tasks partially depends upon

waiting time.

A DAG may have many critical paths. We used DPS (Depth First Strategy) to find

the critical path in the different grain packs of DAG. In the proposed algorithm, ties

are broken on the basis of higher computation cost. Secondly, ties if any are broken

on the number of immediate predecessors of critical path tasks in the DAG.

5.2 MCP algorithm

MCP stands for the Modified Critical path algorithm. It is one of the six most

popular algorithms in the category of dynamic critical path algorithms e.g. MCP, ISH,

HLFET, DLS and ETF. MCP performed the best among these algorithms. MCP

66 I P a e e

algorithm is used for scheduling (DAG) Directed Acyclic Graph with communication

costs on to a bounded number of processors. The MCP algorithm possesses the

features that a DAG scheduling algorithm should have high quality and low

complexity.

5.2.1 Design of MCP algorithm: Design of MCP algorithm is based on as late as

possible (ALA?) start time to determine the task priority. This algorithin assigns

higher priorities to tasks which have smaller possible start times. ALA? start time of a

task is a measure of how far the task's start time can be delayed without increasing

scheduled length. The b-level stands for bottom level of a task. For a task say (t,), it is

the length of longest path from task (t;), to an exit task. The b-level is determined by

traversing the task graph (DAG) upward recursively from the exit task to the entry

task. The ALA? time of a task is computed by determining the length of the critical

path and then subtracting the b-level of task from it.

The MCP algorithm can be thus simply summarized in the following steps:

Step 1: Compute the ALAP start time for each task in the DAG.

Step 2: For each task in the graph, construct a Hst of ALAP start time of the task itself

and ALAP start times of all its children tasks in a descending order.

Step 3: Construct a list of tasks in an increasing lexicographical order of ALAP start

time.

Step 4: Remove first task from the list obtained in step 3 and schedule it to a

processor that allows the earliest execution by using insertion.

Step 5: Repeat step 4, until the task list obtained in step 3 becomes empty.

5.2.2 Efficiency of the algorithm

1. There is a better utilization of processors in MCP algorithm.

2. This algorithm can be implemented efficiently with a number of methods

including the extended ALAP time and ready time calculation.

3. Makespan of MCP increases with increase in the number of tasks as compared to

other algorithms.

5.3 Heterogeneous Earliest Finish Time algorithm

Heterogeneous EarHest Finish Time (HEFT) [3] is an application scheduling

algorithm for a bounded number of heterogeneous systems. It's a task selection phase

67 I P a g e

and processor selection phase based algorithm. It consists of two phases. In the first

phase; priorities are assigned to the tasks on the basis of their ranks [3]. The second

phase is a processor selection phase. This phase selects the processor which requires

minimum finish timing of tasks. Insertion base policy is used in HEFT, in which a

task inserted in earliest idle time between two scheduled tasks on a processor [19].

Communication cost is also calculated for heterogeneous computing processors. In

the first phase rank is calculated as follows:

rank^iti) = avgiwf) + maXt-esuccitoi^^aicij) + rank^{tj)) (5.1)

Where (w;) is an average computation of task (/) for all processors. Succ (t;) is the set

of child tasks of (ti). Avg (Qy), represent average computational cost between tasks

[ti) and (tj) for all pair of processors. Upward rank is the expected distance of any

task from the end of the computation. Highest priority task for which all child tasks

have finished is assigned to the processor which gives earhest finished time for that

task.

5.4 Performance metric for simulation

The performance of three algorithms is simulated on DAG. We generate ten graphs

having task size {30, 60, 90, 120, 150, 180, 210, 240, 270, 300}. Performance of

MCP, HEFT and OTPSD compared on the basis of NSL (Normalized Schedule

Length) and efficiency. Makespan of an algorithin is the completion time of that

algorithm. The average communication cost divided by the average computational

cost is ternned as CCR value [9].

68 I P a g e

r^»'.v Job Oin*M«

Figurel2: Abstract model for task partitioning strategies

5.5 Proposed model of task partitioning strategies

This model shows an abstract view of task partitioning in distributed computing.

Makespan of each algorithm is calculated by the schedule length count factor.

Minimum makespan shows optimality over other implemented algorithms. A total

conununication cost is given by the communication cost count factor. Sum of

communication cost and execution cost have been merged by aggregation phase. So

in task partitioning strategy phase, MCP algorithm assigns priorities to tasks on the

basis of b-level. In HEFT, upward rank is calculated by equation (5.1). The proposed

algorithm (OTPSD) clustered tasks into groups to make grain_packSubDAG. Rank in

each grain_packSubDAG is calculated according to the raiJi calculation in HEFT.

5.5.1 Task and processors assignment phase: Goal of this phase is to minimize total

execution time of the participating tasks running on heterogeneous environment. An

optimal partitioning of tasks is necessary to minimize the overall execution time.

Partitioning problem consists of partitioning a number of heterogeneous tasks among

data-parallel tasks in an optimal way. Mapping of the tasks onto the different

processing tasks is known to be NP-hard. Precedence constraints must be preserved

during the assignment of tasks in heterogeneous environments. Efficient resource

sharing is achieved by analyzing the mutual exclusion amongst different processors.

Selected tasks are assigned to the earliest available processors. A tie is breaking

amongst the tasks according to the processing capability of participating tasks.

Combining some task on DAG and assigning these grouped tasks onto processors

minimize communication overhead [15]. The processors of the similar capability are

grouped together. Due to this fact, heterogeneous computing environment becomes

69 [Pag c

similar to homogeneous. In the proposed algorithm, we sort the processors in

decreasing order of processing power.

5.5.2 Pseudo Code for Grain Pack SubDAG:

1. calculate comm_cost/or all tasks 2. sort tasks in decreasing order of comm_cost 3. for each task do 4. if comm._cost(ti) < comm_cost(ty) then 5. merg_tasks 6. endif 1. repeat step 3 to 5 8. endfor

According to the above steps, grain packs of tasks are fornied in the DAG. These

grouped tasks are mapped onto the grouped processors according to their ranks. If the

load of grouped tasks is more than processing capabilities of the grouped processors

then these grouped tasks are assigned to next grouped processors.

70 1 P a ti e

S.N.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

Symbol

DAG

9iPj

GpSD

Sp

ts

''US

tk9i

tr

9i

parity)

ft

ftSp

minschjengthgi

" 5

comm^cost(ti)

exec_cost{ti)

Uij

bij

Ct,

«( t i )

Kti)

Min_costGp

Description

Directed Acyclic Graph

i" grain_packSubDAG executing on processor P,

grain_packSubDAG

Selected Processor

Scheduled Task

Unscheduled Task

Key task on grain_packSubDAG

Task rank

i"' grain_packSubDAG

Parent key task

Finish time of mark task

Finish time on a selected processor

Minimum schedule length of ^j

Scheduled node

Communication cost of task(t,)

Execution cost of task(t()

start-time of a task {ti) on a processor {Pj)

fmish-time of a task (t;) on a processor (Pj)

Maximum cost successor task on critical path

Average execution of tasks on the set of processors

Successors communication cost

Group of minimum computational cost processors

Table 5: Nomenclature for OTPSD Model

71 I Page

Theorem 1://"fto, t^, t2,—,t.J are participating tasks of DAG and (CQ, C^, C 2 . . . C ^

are communication costs of edges (CQ, e^, 62,...,e^) respectively, then g^Pj, optimize

execution cost on the basis of max-min(maximum parallelism and minimum cost)

trade-off

Proof. According to Babb [13] "What qualifies as large grain processing will vary

somewhat depending upon the underplaying architecture?" In general, large grain

means that amount of processing a lowest-level program performed during an

execution is large [12], Grain packing is used to reduce communication overhead. It

reduces number of processors in comparison to list scheduling. Amount of

concurrency is reduced due to sequential execution of packed tasks. So, task size is

increased by merging many smaller tasks into one larger task. Parallelism is reduced

by increasing the size of grain in DAG. If grain size is too small then overhead is

increased due to context switching problem. The grain size must be related to max-

min (maximum parallelism and minimum cost) problem. It's a trade-off between

parallelism (fine-grain) and communication (large grain).

Definition 1: Combination of tasks and related edges which is assumed as

grain_packSubDAG Task Graph (GpSTG) is a directed acyclic graph DAG (V, E,

exec_cost, comm_cost) where:

1. E is the set of edges {(e,-, e^), C;, Cj € E] which represent the communication

from ei to ej.

2. Grain_packSubDAG (GySD) represent the set of grain packing, where (^,, gj) are

different grain_pack {gi, gj E DAG).

3. V represents the set of nodes, where every node (n,) represented by task (t;).

4. Positive level of each edge, represents communication cost (comm_cost) from

task(ti) to task (tj).

5. Associated weight of each edge depicts the execution cost of particular task (t;).

Definition 2: If there are muhiple edges amongst grain packs (gi, gj 6 DAG) where

\< i, j < n, then arithmetic mean (average value) of the communication cost of these

related edges is known as communication cost of two grain_pack of DAG.

In DAG, communication cost is associated with interrelated edges. There may be

muhiple edges between two grain packing. Because, every grain pack has many tasks

72 I P a e e

on the basis of computations and communications costs. So amongst different grain

packing communication cost is considered as a mean of communication costs

amongst the associated tasks (t;). We consider integer value of average

communication cost, partial values are neglected.

comm_cost (Si, gj) = [Q, Cj/n(Easso.)] (5.2)

where, n(Easso) ^re number of associated edges between two grain_packs.

Theorem 2: Highest priority task for all processors in subDAG, is a entry task of

critical path of Grain_packSubDAG (GpSD) and CTIJ is a maximum cost successor

task of critical path.

Proof. If task (rii) of subDAG has the highest value (priority) over all participating

processor, then its equal to highest rank value of (OTPSD) algorithm. So, task (nj)

selected as an entry task of at least one of the critical path of (OTPSD). Hence, each

critical path must have entry task (ni) which equal to highest priority task of DAG.

p

° ("i) = X *ii/P //Where p is the number of processors (5.3)

Communication cost (P) is the amount of cost which is required to transfer data to its

entire successor task in DAG. This communication cost may be computed as:

P(ni) = ZJ^i dij/AVherei < ; and for exit node p = 0 (5.4)

Where m is the number of tasks in unscheduled level of DAG. Rank of remaining

tasks can be calculated according to HEFT algorithm.

5.5 J Pseudo code for proposed algorithm:

1. construct Pj V Pi where 1 < i < n 2. while for t^j do 3. find t„gi, QiPj, tr 6 t^gi 4. if trh not unschjpar then 5. mark t € t^ 6. else mark t; e par(tr) 7. endif 8. compute /j e t^^Pi 9. find min/fSp 6 t^ 10. find minschjength gi 11. 5i -> min. costGP

73 I P a s e

12. update t € mark t^VgiPj 13. update comm_cost V giPj 14. update exe_cost ^giPj 15. update temp_zero_cost_edge(eg)giPj G min.costGP 16. update comm_costV giPj 17. endwhile

5.6 OTPSD implementation phase

The proposed algorithms are implemented in a CUDA environment with example

in figure (16). This environment allows using C language [21]. In the proposed

algorithm, DAG is derived into Sub DAG called grain_packSubDAG. After a cluster

of DAG theses grain pack sub DAG are assigned to the selected group of processors

to minimize workload [7]. All tasks inside grain_packSubDAG are also executes on a

group of selected processors. This type of allocation minimizes communication

overhead. In example (27), total tasks are grouped in four (GpSD). Tasks {t^, t2, t^,

U} are assigned to (P1P3). Grain_packSubDAG(5i2{t5, tg, tn}) and ig^it^oi 12})

are assigned to (P1P2) for minimization of sleek time. Because ('g2,g3y' have been

allocated to the group of (P1P2) so communication cost is also reduced. Task {tg} is

duplicated upon (P1P3), to execute dependent tasks{t7, tjo, tj2} as soon as possible.

Step ID

MHOTPSDOl

MHOTPSD02

MHOTPSD03

MHOTPSD04

MHOTPSD05

MHOTPSD06

MHOTPSD07

MHOTPSD08

MHOTPSD09

MHOTPSDIO

M H O T P S D l l

MCP

ts^Pi

t2 ^Pl

k^Ps

U-^PI

tj^p^

te^Pi

ts-^Ps

tio ^ PI

ts^Pi

t9 -*Pi

fn -^ P3

HEFT

h-'Pi

U^Pi

ti^Pz

ty^Pl

h^Pl

U^Pl

h-'Pj

tg -^Ps

t s ^ P i

tw ^ P2

til ^ Pi

OTPSD

h -^ P1P2

t2 ^ P1P2

h -* P1P2

ta -^ P1P2

ts ^ P2P3

tg ^ P2P3

t i l ^ P2P3

h -^ P3P1

t3 -^ P3P1

ts.ts-^PsPi

tio ^ P2P1

74 I P a g e

MH0TPSD12 t l 2 -^ P3

Execution

Time=215.6

t i2 -^ P3

Execution

Time=196.3

^12 ~* ^ 2 ^ 1

Execution

Time=184.4

Table 6: Allocation of tasks upon different processors in OTPSD model

P j

p.

Pi

t i i

ta

t i IS 33

t i •• •• « • • "

X™.

50

h 1 ts

sa

w. o™«p«"

l i s

i

u:

h t l 3

ts 15:

i

t9 US

t i l

195

tn 1

i i S i i 3 25S

MCPtExecutiortTirne

Pi

P2

p.

ta

t l

t i t;

^ tg

ts

tg

ty

hi

tiO

t, t,,

u : 155 15; 17:

HEFT ;Execu t ionTim»

US 2i-3 2J5 253

Pipj

PiPa

P2P3

t7

h

t i

ts 1

t2

ts

t j

t .

t i l

ts

ts

t jo t u

3S l i s Hi i s : i7S 13D OTPSD : Esecut icn Time

2iS 253 J53

Figure 13: Makespan analysis of different algorithms

There is a very low communication cost required amongst tasks in 5^. So, tiie total

computational cost reduces due to the reduction in communication costs amongst the

tasks of same GpSD. In simulation results, without duplication OTPS (Optimal Task

Partitioning Strategy) algorithm shows 78.6% CPU utilization. Average CPU

utilization in MCP, HEFT and OTPSD are 46.7%. 56.8% and 83% respectively. CPU

utilization increased due to allocation of grouped processors of similar capabilities.

These groups of processors minmiize heterogeneity in comparison to the random

allocation of processors. Total execution time of duplicated algorithm is (184.4)

instead of (189.1). So most of the time, all processors (Pj, P2, P3) are busy at the

time of execution of the proposed algorithm. Drawback of HEFT is that it can't

75 I P a g e

perform efficient CPU utilization amongst participating processors. Rank of each task

in g, is calculated on the basis of the equation (1). According to rank of tasks, tasks

are mapped onto processors. This approach also minimizes makespan. Calculated

makespan of OTPSD is (184.4) which is shorter than MCP (215.6) and HEFT (196.3)

algorithms. The average value of the NSL is plotted in figure (14) against the function

of the CCR. We calculate NSL value of MCP, HEFT and OTPSD at the CCR value

range (1, 1.5, 3.0, 4.5, 6.0, 7.5, 9.0). In figure (15) graph shows that, the behavior of

three algorithms is consistent in terms of values of CCR. The average NSL value

slightly increases up to the medium range of CCR. Above medium range of CCR,

OTPSD performs significant performance. At the maximum value of CCR proposed

algorithms outperform over compared algorithms. This result indicates that if

communication cost is large then task partitioning is difficult.

s

1

NSL v$ CCR

"T-t 1 J

CCR

Figure 14: Analysis of NSL vs CCR

761P a g e

|;,(^A^c,rV - i^i;

Effkieno- vs CCR

•A •A

i

0 J.1 i TJ SO

CCR

Figure 15: Analysis of Efficiency vs CCR

.JS

@

1 2 /

\ 3-J©

isX

X5) 8

(7" 4/ /

\ **^ sX

(Z

yi2 .

6 1

S>

© ^ ( &

Figure 16: DAG with communication costs

77 IP a g e

iS •

i •

15

5

NSL 'i

2 •

i J •

OJ •

c -

NSL vs DAG Size

^ >

__^a'^<^^^^

^^^^^^'^^ /

/t\ ^ U/ v—-^ /

3 : ; ; : ; 5 : : ; : 25c »-:

DAG Si2e(No. of Nodes)

-m-y^tr - * - 0 7 » »

j s :

Figure 17: Analysis of NSL vs DAG size

Algorithm

MCP

HEFT

OTPSD

Average CPU Utilization

46.7%,

56.8%

83%

Makespan Length

215.6

196.3

184.4

Table 7: Average CPU utilization and maliespan length of compared algorithms

Figure (17) shows simulation results for different size of graphs ranging from (30

to 300 nodes). Performance of MCP, HEFT and OTPSD shows that NSL value feebly

degrades if the number of nodes in the DAG is increased. In the range of (100-250)

NSL of proposed decreases in comparison of MCP and HEFT. After size 250 nodes

near to 300 nodes all three algorithm depicts average NSL (3.0-3.5). Because NSL is

normalized metric so each algorithm shows no large increment for a large number of

tasks.

In this chapter, we proposed a new task partitioning strategy (OTPSD) for

heterogeneous parallel computers. A number of experiments are conducted. The result

shows outperformance upon both MCP and HEFT in terms of efficiency and

78 IP a g c

makespan length. Proposed work demonstrates significant results in terms of better

CPU utilization. HEFT and MCP are not capable to minimize the professor's sleek

time.

79 I P a g e

REFERENCES:

[1] G, L. Park, B. Shirazi and J. Marquis, "A New Approach for Duplication

Based Scheduling for Distributed Memory Multiprocessor Systems," In: Proc.

11th Int'l Parallel Processing Symp., (1997), 153-162.

[2] R. Bajaj and D. P. Agrawal, "Improving Scheduling of Tasks in a

Heterogeneous Environment," In: IEEE Transactions on Parallel and

Distributed Systems, 15 (2), (2004), 110-111.

[3] H. Topcuoglu, S. Hariri and M.Y. Wu, "Performance-Effective and Low-

Complexity Task Scheduling for Heterogeneous Computing," In: IEEE

Transactions on Parallel and Distributed Systems, 13 (2), (2002), 25-35.

[4] O. Sinnen and L. Sousa, "List Scheduling: Extension for Contention

Awareness and Evaluation of Node Priorities for Heterogeneous Cluster

Architectures," In: Parallel Computing, (30), (2004), 75-95.

[5] G. L. Park, "Performance Evaluation of a List Scheduling Algorithm in

Distributed Memory Multiprocessor Systems," In: Future Generation

Computer Systems, (20), (2004), 230- 251.

[6] J. Barbosa, C. Morals, R. Nobrega and A. P. Monteiro, "Static Scheduling of

Dependent Parallel Tasks on Heterogeneous Clusters," In: Heteropar , IEEE

Computer Society, Los Alamitos, (2005), 1-8.

[7] Gerasoulis and T. Yang, "On the Granularity and Clustering of Directed

Acyclic Task Graphs" In: IEEE Transactions on Parallel and Distributed

Systems, (1993), 680-693.

[8] M. K. Aguilera, A. Merchant, M. Shah, A. V. Karamanolis and C. Sinfonia,

"A New Paradigm for Building Scalable Distributed Systems," In:

Proceedings of the 21st ACM symposium on operating systems principles,

New York: ACM Press, (2007), 159-174.

[9] D. Mohammad and K. Nawwaf, "A High Performance Algorithm for Static

Task Scheduling in Heterogeneous Distributed Computing Systems," In: J.

Parallel Distrib. Comput., (68), (2008), 403-405.

[10] Y. Kwok and I. Ahmad, "Dynamic Critical-Path Scheduling: An Effective

Technique for Allocating Task Graphs onto Multiprocessors," In: IEEE

Transactions on Parallel and Distributed Systems, 7 (5), (1996), 503-519.

80 j P a g e

[11] Y. Kwok and I. Ahmad, "On Multiprocessor Task Scheduling using Efficient

State Space Search Approaches," In: Journal of Parallel and Distributed

Computing, (65), (2005), 1510-1530.

[12] B. Kruatrachue and T.G. Lewis, "Grain Size Determination for Parallel

Processing," In: IEEE Software, (1988), 20-31.

[13] R. G. Babb, "Parallel Processing with Large Grain Data Flow Techniques," In:

Computer, (17), (1984), 50-59.

[14] J. K. Kim et al., "Dynamically Mapping Tasks with Priorities And Multiple

Deadlines in a Heterogeneous Environment," In: Journal of Parallel and

Distributed Computing, (67), (2007), 154-169.

[15] S. M. Alaoui, O. Frieder, T. A. EL-Ghazawi, "Parallel Genetic Algorithm for

Task Mapping on Parallel Machine," In: Proc. of the 13th International

Parallel Processing Symposium & 10th Symp. Parallel and Distributed

Processing IPPS/SPDP Workshops, (1999), 22-28.

[16] S. Wu, H. Yu, S. Jin, K. C. Lin and G. Schiavone, "An Incremental Genetic

Algorithm Approach to Multiprocessor Scheduling," In: IEEE Trans. Parallel

Distrib. Syst., (2004), 812-835.

[17] Radulescu and A. J. C. Van Gemund, "Low Cost Task Scheduling for

Distributed Memory Machines," In: IEEE Trans. Parallel Distrib. Syst., (13),

(2002), 640-642.

[18] S. Darba and D. P. Agrawal, "Optimal Scheduling Algorithm for Distributed

Memory Machines," In: IEEE Trans. Parallel and Distributed Systems, (1),

(1998), 85-93.

[19] X. Yuming, L. Kenli, T. K. Tung and Q. Meikang, "A Multiple Priority

Queuing Genetic Algorithm for Task Scheduling on Heterogeneous

Computing Systems," In: High Performance Computing and Communication

& 2012 IEEE 9th International Conference on Embedded Software and

Systems (HPCC-ICESS), 2012 IEEE Nth International Conference, (2012),

640-648.

[20] M. A. Doaa and O. Fatma., "Dynamic Task Scheduling Algorithm with Load

Balancing for Heterogeneous Computing System" In: Egyptian Informatics

Journal, Cairo University, (2012), 136-139.

81 i P a g e

[21] D. Javier, M. C. Camelia and N. Alfonso, "A Survey of Parallel Programming

Models and Tools in The Multi-Core Era," In: IEEE Transaction on Parallel

and Distributed Systems, 23 (8), (2012), 1369-1375.

82 I P a g e

^as^(PajtitUmw^ Strategy

witfi Minimum Ixme in (Distributed <Bara£[e[

Computing Environment

83 I P . ^ c

CHAPTER 6

TASK PARTITIONING STRATEGY WITH MINIMUM

TIME IN DISTRIBUTED PARALLEL COMPUTING

ENVIRONMENT

6.1 Introduction

Efficient task partitioning is necessary to achieve remarkable performance in

multiprocessing systems. When the number of processors is large then scheduling is

NP complete. Complexity of the systems depends upon the execution time [1]. The

objective flinction of any task partitioning strategy must consider computing factors.

A lot of parallel computing algorithms have been proposed which consider very small

factors concurrently. These tasks partitioning strategies are categories such as

preemptive verses non preemptive, adaptive versus non adaptive, dynamic versus

static, distributed versus centralized [2, 3]. In static task partitioning strategies runtime

overhead is negligible because the nature of every task is known prior to the

execution. If all performance factors are known in advance before the execution of the

tasks then static scheduling shows better results [4]. Practically to estimate all

communication factors prior to the execution of tasks is impossible. In round robin

systems every processor of the computing systems executes a same number of tasks.

The performance of the systems directly affected by dynamic tasks partitioning

strategies [6]. Centralized and distributed scheduling shows different behavior in

terms of overhead and complexities of scheduling algorithms. Scheduling information

is distributed amongst storage devices and related computing units in distributed

scheduling fashion. Dynamic scheduling balances the load by moving pointer from

compiler to processor. Source initiated scheduling offers the tasks from highly loaded

processors to low loaded processors [7]. While in sink initiated scheduling, tasks

transferred from low loaded CPUs to high loaded processors [5]. In symmetric

scheduling load may be transferred vice-versa. One or more processing units are used

to share the global information in centralized strategies. To access global information

contention is the major drawback for centralized scheduling algorithms. This

84 IJ' a g c

contention tends toward the minimization of existing resources. Some processors

which are used in centralized scheduling can't participate in the execution of tasks.

They are used only as storing devices. This is another serious drawback of centralized

scheduling architectures.

Dynamic scheduling plays an important role in the management of resources due

to run time interaction with under laying architectures [8]. In a large number of

application areas such as real time computing communication networks, query

processing, dynamic scheduling shows better performance than static scheduling.

A new list scheduling based strategy called Minimal Task Partitioning Strategy With

Minimum Time (TPSMT) has been proposed for heterogeneous computing

environments. This algorithm contributes average communication and computation

cost factor to assign the priorities to the critical path nodes.

6.2 Critical path fast duplication (CPFD) algorithm

Final schedule length of any algorithm is determined by the critical path of

implementing schedule. So, for the minimum schedule length critical path value of the

schedule must be optimum [2]. Due to the precedence constraint limitations, the finish

time of the (Critical Path) CP nodes is detemnined by the execution time of their

parent nodes. Therefore scheduling cost of the parent nodes of the critical path nodes

must be determined prior to the scheduling cost of CP nodes.

In CPFD, DAG is partitioned into three categories:

• Critical Path Node iCp„) of DAG

• In Branch Node (/bn) of DAG

• Out Branch Node (Ob„) of DAG

In branch nodes are those from where a path possess to the critical path nodes. Out

Branch are neither belongs to in branch or nor to critical paths. After partitioning the

problem, priorities are assigned to the critical path nodes. According to these priorities

total execution time is calculated by CPFD.

The start time of C^^ can be minimized by /ft„. So /{,„ are better than Oi,n- The

critical path of the DAG is constructed by the following procedure:

85 I P a g e

1. assignfirst node as a entry Cpjj. goto next node. Suppose Uy be the next Cp^-repeat

2. if all predecessor of Uy in queue then insert n^ in queue increment queue value. else

3. let n^ is predecessor of riy having maximum value of b-level which is not in queue.

4. tie broken by choosing predecessor with minimum t-level value. 5. If all the predecessor of n^ in queue then 6. include all parents of n^ in queue having more communication costs. 7. repeat all steps until all predecessors of Uy are in queue. 8. endif 9. make Uy to next C^^i 10. append all /bn to the queue in decreasing value of b-level.

Sequence generated by the above steps preserve constraints. C-p^ arranged in logical

maimer so that predecessor of Oj,n comes in front of successor of 0^^. All nodes of

the critical path are arranged in sequence (TXI, n^, 7x2, n^, n^, n^, n-,, HB, ng). This

ordering is on the basis of static level. CPFD generates the sequence of nodes

( « ! , n-2^. n-j, 71^,713, AZg, Wg, Uc,. n s ) .

First node of the DAG is (n^) which the first position node of critical path nodes

is. Second node (uz) has only one parent node so it's the second position node of the

sequence. After appended (712) to the critical path node, (Uy) inserted to the sequence.

After that (rig) has been considered. Both (n^) and (ng) have same b-level but (rig)

has smaller t-level. Only (n^) is the last node of the critical path.

6.2.1 Pseudo code for CPFD algorithm: Determine critical path and partitioned the

DAG into three categoriesCp^, /j,„, 0^^. Suppose key node be the entry node of Cp.^.

1. Let Pjetbe the processors set. 2. for every processor in P^et do

a. Determine key start time on P. b. Let (m) unscheduled node on Pjej. c. Assign (m) on idle processor. If assignment done than accept it as a new start

time for inserted node. d. i/"assignment can't be done then return control to examine another processors.

3. Allot processor to the ready node which gives earliest start time. 4. Perform all required duplications. 5. repeat procedure from step 2 to step 5 for each Ob„upon the set of processors

untill all Cpn nodes are assigned.

P a g e

The complexity of CPFD algorithm is 0(17"*).

S.N.

I

2

3

4

5

6

7

8

Symbol

r

hn

Obn

Wi

Cii

Pset

ti

Si

Description

Critical Path Node of DAG

In Branch Node of DAG.

Out Branch Node of DAG

average computation of task (/)

computational cost between tasks (tj) and (ty)

Group of processors

1" task

i* edge

Table 8: Nomenclature for TPSMT model

6.3 Proposed algorithm

The proposed algorithm follows two step approaches in the designed algorithms.

In the first step, priorities are assigned to the nodes of the DAG. While in second step,

processors having earliest start time are selected to assign the tasks of high priorities.

The performance of the task partitioning strategy is depends upon the selected

scheduling.

6.3.1 Task assignment phase: In heterogeneous computing environments, a task

shows different execution time upon different processors. This problem may be

resolved by inserting some parameters to the computational costs of tasks

AvgCompCpi) Sf=i(c(ni))

avgc(ti)

(6.1) No.of processors»2^{c(py))

Communication cost amongst the different processors in the heterogeneous system

computed a

AvgComm(p,) = Jlf=iaps(c(ti))

No.ofprocessors*f^{c(Pz)} (6.2)

Earliest Completion Time oft, on py can be calculated as follows:

ECT(ti, py) = min(w(ti) + Qy) (6.3)

87 I P a a e

Earliest Finish Time of t; on pj can be calculated as follows;

EFT(ti, pj) = ECT(ti, pj) + w(ti) (6.4)

A critical path is a collection of nodes for those, sum of computational cost is

optimal. This CP is used to determine the lower bound of the task partitioning

strategies. So there must be an efficient scheduling strategy to assign the priorities to

the critical path node. When two tasks are assigned to the same processor then the

communication cost between them is zero. HEFT algorithms resolve this problem by

assigning the rank to the tasks. CPFD is static task partitioning strategy so it uses the

critical path method to assign the tasks to the processors.

In the proposed algorithm, a computational cost factor is inserted to compute the

priority of the tasks. This factor considers the average execution and computational

time along with communication and computational capabilities. Average

computational capability of the system processors may be computed as follows:

6.3.2 Pseudo code for proposed algorithm:

1. assign wgt(tj) <—AvgComp(p;) 2. assign wgt(ej) <— AvgComm(p,) 3. arrange tasks in descending order of ranks 4. while some tasks are not schedule do

begin select-first(ti) from priority queue for every processor in P^g^ do begin comput-ECT(ti,py); assign task (ti) to selected processor which minimize EFT;

end; end;

5. repeat steps 1 to 4 step until ranks assigned to all tasks.

6.3.3 Simulation results: Analysis of the proposed algorithm is implemented upon

the DAG in figure (18). Tasks associated with each others have different size and

communication costs.

P a g e

4 /

CjO ^ - " ^ /

CT

t-

r r^

'1 ([[

^ 1

(^ t;^

> /e 3 /

CuZ.

5

Cliil^ /s

/d>

Figure 18: DAG with Communication Cost in TPSMT

oc

40

35

30

&

I * .

25 -

20

15

10 1

Average SLR vs C'C'R

0.5 1.0

-TPSMT

-HEFT

-CPFD

Figure 19: Average SLR against CCR values

89 I P a g c

Number of Processors vs Avg Speedup

^ '

-IPSMT

-HFFT

- C W D

20 50 80 110 140 170 200 230 250

Numbrr of Procnsrx

Figure 20: Average speed up against different number of processors

Avg Eflicienc}- vs Number of Processors

3

J.. 2.5

I 2 E W 1.5

*< 0.5

0

-TPSMT

-HEFT

-CPU)

12 24 36 48

Number of ProcMMws

60

Figure 21: Average efficiency against different number of processors

6.3.4 Computational analysis: Average efficiency produced by TPSMT, HEFT and

CPFD with number of processors {12, 24, 36, 48, 60} represented in figure (21).

Plotted graph shows that average efficiency of the TPSMT is greater than HEFT and

CPFD by 53.09% and 63.20% respectively. While average efficiency of HEFT is

61.92 % greater than average efficiency of CPFD. These results show that

performance of TPSMT is better than the both compared algorithms. If efficiency of

the algorithm is large then it shows minimal execution time.

Figure (20) depicts the average speedup obtained from TPSMT, HEFT and CPFD

algorithms for the range of number of processors (20, 50, 80, 110, 140, 170, 200, 230,

250). From the graph it's clear that average speedup of all three scheduling algorithms

90 I P a g e

increased if the number of tasks increased. The proposed algorithm gained average

speedup value than HEFT and CPFD 17.36% and 56.79% respectively. For the real

world practical problems TPSMT algorithm outperforms HEFT and CPFD algorithms

CCR value for the three algorithms has been plotted against an average SLR value in

the figure (19). Plotted graphically expose that the performance of the compared

algorithms is consistence. If CCR value increased from smaller range to medium

range than average value slightly increased. When CCR becomes larger than the

average SLR value increases significantly. The average reduction in the NSL value of

the proposed algorithm with HEFT and CPFD are 8.08% and 28.40"/o respectively.

The average NSL value of HEFT is 25.30% less than NSL value of CPFD algorithm.

At largest value of CCR=10, the average NSL value of TPSMT is 7.69% and 26.92%

less than HEFT and CPFD respectively. These results indicate that if communication

cost is large then scheduling of the parallel applications is complex.

The performance of the proposed algorithm TPSMT compared with the best

existing algorithms HEFT and CPFD. In these algorithms one is best known dynamic

algorithm and another is a static algorithm (CPFD). TPSMT significantly outperform

compared algorithms in terms of NSL and average speedup values. Average SLR

valve of the proposed algorithm reduced 10.14% and 39.20% in comparison to HEFT

and CPFD respectively in the range of (2-10) CCR value. This factor shows that

TPSMT is also efficient for the high communication cost applications.

91 I P a g e

REFERENCES

[1] A. Bampis and J. C. Konig, "Scheduling Algorithms for Parallel Gaussian

Elimination with Communication Costs," IEEE Transactions on Parallel and

Distributed Systems, 9(7), (1998), 679-686.

[2] Y. Kwok and I. Ahmad, "Static Scheduling Algorithms for Allocating

Directed Task Graphs to Multiprocessors," ACM Computing Surveys, 31 (4),

(1999), 446^71.

[3] J. K. Kim, et al., "Dynamically Mapping Tasks with Priorities and Multiple

DeadUnes in a Heterogeneous Environment," Journal of Parallel and

Distributed Computing, 67, (2007), 154-169.

[4] J. Barbosa, C. Morals, R. Nobrega, A. P. Monteiro, "Static Scheduling of

Dependent Parallel Tasks on Heterogeneous Clusters," In: Heteropar. IEEE

Computer Society, Los Alamitos, (2005), 1-8.

[5] J. Bruno, E. G. Coffman and R. Sethi, "Scheduling Independent Tasks to

Reduce Mean Fmishing Time," Commun. ACM, 17, (1974), 380-390.

[6] M. Beck, K. Pingali and A. Nicolau, "Static Scheduling for Dynamic Dataflow

Machines," J. Parallel Distrib. Comput. 10, (1990), 275-291.

[7J E. G. Coffman and R. L. Graham, "Optimal Scheduling for Two-Processor

Systems," Acta Inf 1, (1972), 200-213.

[8] A. W. David and D. H. Mark, "Cost-Effective Parallel Computing," IEEE

Computer, (1995), 67-71.

92 |Pa zc

^eemptive q:askjPartitUming

StraUgy vnth (fast (Preemption in Ceterogeneous (DistriSuUd

Systems

93

CHAPTER 7

PREEMPTIVE TASK PARTITIONING STRATEGY

WITH FAST PREEMPTION IN HETEROGENEOUS

DISTRIBUTED SYSTEMS

7.1 Introduction

Non-preemptive scheduling strategies [1, 6, 9, 8, 3] have been discussed in

literature. Modified Critical Path (MCP), Highest Level First (HEFT) with estimated

time algorithm [10], Earliest Time First (EFT) [2] and Dynamic Level Scheduling

(DLS) are some well-known non-preemptive scheduling algorithms. Preemptive Task

Scheduling (PTS) algorithm shows low scheduling cost and better load balance than

the existing list scheduling algorithms. The time complexity of PTS is better than time

complexities of MCP, HLEFT, ETF, DLS algorithms. In these scheduling algorithms,

turnaround time and CPU utilization are the metrics to evaluate the performance of

the systems. FCFS scheduhng shows worst average job slow down than SJF [18].

Starvation problem may be occurring in FCFS because some long jobs having more

priority than short jobs may take very long execution time. This problem shows long

delays and very low throughput. Job fairness parameter shows better results in non-

preemptive scheduhng. It is clear that non-FCFS scheduling are less fair than FCFS

due to the starvation problems. Fairness of job may be calculated by the delay in time

due to delayed of a later arriving job. If the actual start time of a job is greater than its

fair start time then the job is treated as unfair. FCFS scheduling algorithms do not

shows always better results in terms of fairness than another scheduling algorithm

[10].

The complexities of the scheduling algorithms play an important role in the

execution of the processes. Low time complexities show good performance in all

performance metrics [5, 4, 7]. Scheduhng cost is proportional to the number of

processors used for the scheduling of a large number of tasks. This is a major problem

for iuturistic usage due to increasing number of processors. A scheduling algorithm

with fast preemptions can speed up the compilation process. Processes must be started

94 j P a g e

as soon as possible for minimization of execution time. If the waiting time is shorter

then turnaround time then it also affects the earliest short time of processing. Hence

throughput is increased if the earliest starting time is decreased. Preemptive

scheduling may be categories in the following categories:

a) Priority based preemptive scheduling: In these scheduling, tasks are allocated to

the processors according to their priorities.

b) Resource sharing: In this approach, all tasks have been allocated to the

processors simultaneously. Every task gets equal time quantum for the execution

of threads.

c) Implementation of the preemptive approaches by thread, shares low context

switching overhead.

7.2 Resource sharing

In the preemptive scheduling utilization of the resources increased due to the

following considerations. These considerations avoid deadlock due to regular

preemptions of the resources.

1. Designed algorithm must satisfy the requirement of parallel processing.

2. The algorithm should have a low Normalized Schedule Length (NSL) and

communication overhead.

3. The algorithm must achieve high performance efficiency and low response time in

multi-core systems.

4. At every instant of time, a resource should hold at most one task.

5. In resource scheduling algorithms deadlock condition must be avoided. To ensure

the absence of deadlock condition, at least one condition from the below must be

satisfied.

a) Allocation of the resources should be in non-sharable mode. Every

participating resource in the multiprocessing environment should not be

shared by multiple tasks.

b) If any resource allocated to a particular task then another task should wait until

it's released by allocating tasks.

95 I P a g e

c) The resources should not be allocated in the circular manner (First allocated

resource requested by last resource and second allocated resource must be

requested first resource and so on.)

d) No-preemption condition must be satisfied.

6. The proposed algorithm satisfies fixed priority preemptions and it's capable to

schedule the large number of tasks simultaneously.

In DAG based preemptive scheduhng, a partial portion of the task can be reallocated

to the different processors [12, 11, 17, 15], While in the case of no-preemptive

scheduling, allocated processors can't be re-allocated until it finishes assigned tasks.

Flexibility and resource utilization of the preemptive scheduhng is more than non-

preemptive scheduling in a theoretical manner. Practically, re-assigning the partial

part of the task causes extra overhead. Preemptive scheduling shows polynomial time

solutions while non-preemptive proved NP-complete [16]. Scheduling with

duplication upon different processors is also NP-complete. Communication delay

amongst the preemptive tasks is more due to preemptions of processors. Preemptive

and non-preemptive scheduhng approaches investigated by [14] in homogeneous

computing architectures. These approaches used independent tasks without having

precedence constraints. In DAG, condition of precedence constraints has been

inserted by Wang [13] for preemptive and non-preemptive scheduling. Earliest time

factor inserted by them to construct scheduling in list scheduling strategies.

7.3 Proposed algorithm for preemptive scheduling

1. Assign each tj -^ Pf 2. for all processors

generate priority queue(tj) -> Pj sort ascending EST(ti) -* P^ endfor

3. calculate A vgFT (tj for P^i / 4. while(ready-list not empty) do begin 5. Select task tj (highest priority) from priority queue; 6. Select the processor Pyfor task tjaccording to EST; 7. Assigned t; -» P,-; 8. Delete task tjfi-om the ready-queue; 9. Update ready-queue; 10. Compute FT all unscheduled tasks; 11. Schedule smallest FT tasks;

96 I P a g e

12. Move task t/ <-» P,-; 13. end while

7.3.1 Description of the scheduling algorithm: There are two phases in the proposed

scheduUng algorithm. In a first phase priorities are assigned to the tasks according to

their communication time and execution time. While in second phase, Earliest Start

Time (EST) has been calculated and tasks are scheduled upon the processors

according to their processing capabilities. Each task of the example is assigned on the

fastest processor. Nodes are sorted according to their ascending order Earliest Start

Time. First node having highest priority is numbered as (0). Next priority nodes are

numbered in the ascending order of natural numbers. After assigning the priorities

there is no dependency exist amongst the nodes. If dependency is found the new

group of the tasks are generated to remove the dependency of the nodes. In this

manner dependency from the tasks of DAG are removed and these tasks have been

assigned independently.

After calculating earliest starting time tasks are assigned to the selected

processors. When the task has been finished then it's removed fi-om the ready queue.

Ready queue is updated after the deletion of the tasks. Finish time of the remaining

tasks is calculated as follows:

FT=te;,ectime+ min ( ESTjy+FTi^) (7.1)

Smallest finish time tasks are assigned prior to the next smallest finish time. Due to

the preemptive nature of the tasks they can move amongst computing processors. So

the idle time of the processors decreases due to optimal and fast preemptions of tasks

upon the group of heterogeneous processors.

97 I P a li e

Symbol

ti

Pf

Ps

Pall

EST,;

^i,p

Definition

r Task

Fastest Processor

Selected Processor

All Processors

Earliest Staring Time of i"" task upon j*" processor

Finish Time of i* task upon j " " processor

Table 9: Nomenclature in FTPS model

^ 100

I 80 S 60

ot 40

1 ° S 0

Running Time vs Number of Processors

^ ^

mm*

mm

• Mcr

16 . , V , ^ 256 S12 1024 Number of Processors

Figure 22: Running time vs number of processors analyasis

CCR v» Average NSL at p=4

a 1.5 i 1 a 0.5 ^ 0

-FIW

0.2 0.5 1.5 2.5

CCR 10

Figure 23: CCR vs average normalized schedule length analysis at 4 processors

3.5 n

-^ 3

f 1.5 S 1

0 0.2

CCR vs Avg NSL at p=«

0.5 1 1.5 2.5 5 10

CCR

-«—MCP

- • - w s

-^—PIPS

Figure 24: CCR vs average normalized schedule length analysis at 8 processors

98 I P a g e

Figure 25: CCR vs average normalized sciiedule length analysis at 16 processors

J 2.5

£ OS •5 OH

,

• —

* - -0.2

CCR v» Avg NSL at p=32

—± 5—.—df- * *"

°^ ' d^R '' ' 10

—•—WCP

- • - I W

—*—FTPS

Figure 26: CCR vs average normalized schedule length analysis at 32 processors

2.5

S^ 2 ^

I 03 J < 0 -

- * = A-—

0.2

CCR vs Avg NSL at p=64

j ^ " * " * —

0 3 1 1.5 2.5 5 CCR

10

—•—MCP

- • - F P S

—*—PTFS

Figure 27: CCR vs average normalized schedule length analysis at 64 processors

Figure 28: DAG Example for Simulation

99 IP a g e

7.3.2 Experimental phase: Simulation based experiments have been conducted upon

the proposed algorithm against existing well known FPS (Fast Preemptive

Scheduling) and preemptive MCP (Minimum Critical Path) algorithms. In Figure

(22), running time (Sec) variation with the different number of processors has been

demonstrated. It can be observed that MCP shows very large running time for the

range of processors {4, 16, 64, 256, 512, 1024} in the comparison of FPS and PTPS

scheduling algorithms. While running time of PTPS is 11.91 % less than running time

of FPS.

Figure (23) shows the behavior of average (NSL) against a CCR range (0,2, 0.5, 1,

1.5, 2.5, 5, 10) values. When CCR values increases then the average NSL value is

also increased. At highest value CCR = 10 for p = 4, the NSL value of PTPS

decreases by FPS and MCP 11.41 % and 33.56 % respectively. This estimate that if

communication cost increases then the overhead is also increasing. The performance

of the proposed algorithm for the range (P=4, 8, 16, 32, 64) of processors have been

calculated. Figure (23-27) shows that optimal results have been obtained by PTPS

algorithm at the different number of processors. It may be observed that if the number

of processors increases then the average NSL value is continually going to decrease.

So, PTPS shows better performance upon the compared algorithms FPS and

preemptive MCP scheduling algorithms.

100|Pagi

REFERENCES

[1] L. George, N. Rivierre and M. Spuri, " Preemptive and Non-Preemptive Real-

Time Uni-Processor Scheduling, Institut National de Recherche in

Informatique et en Automatique," Tech. Rep., (1996), 45-50.

[2] R. Dobrin and G. Fohler., "Reducing The Number of Preemptions in Fixed

Priority Scheduling," Real-Time Systems, Proceedings 16th Euromicro

Conference On Ecrts, (2004), 144-152.

[3] C. L. Liu and J. W. Layland, "Scheduling Algorithms for Multiprogramming

in a Hard-Real-Time Environment," ToMr/vfl/ Of The ACM, 20 (1), (1973), 46-

61.

[4] A. Bastoni, "Cache-Related Preemption and Migration Delays: Empirical

approximation and Impact on Scheduling," In Sixth International Workshop

on Operating Systems Platforms for Embedded Real-Time Applications,

(2010), 33-34.

[5] R.H. Moring, A.S. Schulz and M. Uetz, "Approximation in Stochastic

Scheduling: The Power of LP-Based Priority Policies," ACM, (46), (1999),

924-942.

[6] C. G. Lee, J. Hahn, Y. M. Seo, S. L. Min, R. Ha, S. Hong, C. Y. Park, M. Lee

and C. S. Kim "Analysis of Cache-Related Preemption Delay in Fixed-

Priority Preemptive Scheduhng," IEEE Trans. Comput., 47(6), (1998), 700-

713.

[7] N. Megow, M. Uetz and T. Vredeveld, "Models And Algorithms for

Stochastic Online Scheduhn," Mathematics of Operations Research, (31),

(2006), 513-525.

[8] M. Bertogna and S. Baruah, "Limited Preemption of Scheduling of Sporadic

Task Systems," IEEE Transactions on Industrial Informatics, (2010), 45-49.

[9] S. Altmeyer, R. Davis and C. Maiza, "Cache Related Pre-Emption Delay

Aware Response Time Analysis for Fixed Priority Pre-Emptive Systems," In

Proc. 32nd IEEE Real-Time Syst. Symp. (RTSS.ll), Vienna,Austria, (2011),

23-28.

[10] M. Bertogna, O. Xhani, M. Marinoni, F. Esposito and G. Buttazzo, "Optimal

Selection of Preemption Points To Minimize Preemption Overhead," In Proc.

101 j P a g e

23rd Euromicro Conf. Real-Time Syst. (ECRTS'll), Porto, Portugal, (2011),

217-227.

[11] R. Cheng, M. Gen and Y. Tsujimura, "A Tutorial Survey of Job-Shop

Scheduling Problems Using Genetic Algorithms—I: Representation,"

Comput. Ind. Eng., 50(4), (1996), 983-997.

[12] M. J. Gonzalez, Jr., "Deterministic Processor Scheduling, " ACM Comput.

5wrv.,(3), (1977), 173-204.

[13] Y. C. Chung and S. Ranka, "Applications and Performance Analysis of a

Compile-Time Optimization Approach for List Scheduling Algorithin on

Distributed Memory Multiprocessors," In Proceedings Of The 1992

Conference on Supercomputing (Supercomputing '92, Minneapolis, MN), R.

Werner, Ed. IEEE Computer Society Press, Los Alamitos, CA, (1992), 515-

525.

[14] J. Blazewicz, J. Weglarz and M. Drabowski, " Scheduling Independent 2-

Processor Tasks To Minimize Schedule Length," Inf. Process. Lett., 18,

(1984), 267-273.

[15] V. J. Rayward-Smith , "The Complexity of Preemptive Scheduling Given

Interprocessor Communication Delays," Inf. Process. Lett., 25, (1987), 120-

128.

[16] E. G. Coffman and R. L. Graham, "Optimal Scheduling for Two-Processor

Systems," Acta Inf, 1, (1972), 200-213.

[17] E. C. Horvath, S. Lam and R Sethi," A Level Algorithm for Preemptive

Scheduling," J. ACM24 (1), (1977), 36-47.

[18] N. Guan, W. Yi, Q. Deng, Z. Gu, and G. Yu, "Schedulability Analysis for

Non-Preemptive Fixed-Priority Multiprocessor Scheduling," Journal Of

Systems Architecture, 57(5), (2011), 536 - 546.

102 I Pag

Conclusion

103

CONCLUSION

This thesis represents the results of the different task partitioning strategies cairied out

on distributed parallel computing architectures. Scheduhng in parallel computing is

highly important due to the efficient use of high performance computing systems.

Important laws related to parallel computing are also summarize.

To fulfill the objective of the thesis; we completed our study in two groups. The

first group has two parts: (1) Literature review about existing parallel computing

tools, (2) Comparison of existing task partitioning strategies in parallel computing

systems.

In the first part, we analyze the tools on the basis of comparative study. In this

comparison we found that PVM and MPI are the best parallel computing tools for

non-thread based parallel computing. While in thread based parallel computing,

CUDA is the best computing tool. It provides significant speedup and efficiency due

to strong thread background than other existing parallel computing tools.

, In second part, an important classification of the existed task partitioning

strategies has been presented. This comparative analysis summarize that there is no

ideal task partition strategy for all heterogeneous distributed computing systems.

Performances of the scheduling strategies depend upon the underlying architectures.

In the second group, we conducted experimental analysis of the proposed models and

algorithms. All experiments carried on the Directed Acyclic Graph (DAG). We

proposed an iterative task partitioning model and three algorithms. Out of three

algorithms, two algorithms are of dynamic nature. One proposed algorithm (OTPSD)

compared with MCP and HEFT algorithms. It shows lower execution time and NSL

value than the compared algorithms. The second dynamic proposed algorithm is

(TPSMT). This algorithm compared with HEFT and CPFD algorithms. Experimental

results show better performance in terms of average SLR versus CCR values. Average

efficiency shown by TPSMT is also better than HEFT and CPFD respectively. The

third proposed algorithm is preemptive. The algorithm (PTPS) is compared with

preemptive MCP and EPS algorithms. We observed that, its average NSL value upon

the number of different processors is less than compared algorithms. This result

shows that NSL value decreases proportionally if the number of processors increases.

The proposed problems in the thesis may give significant contribution in the field of

parallel computing. The proposed algorithms may be extended in temis of CPU

utilization analysis in heterogeneous architectures.

104 I Pa ue

LIST OF PUBLICATIONS

International Journal/Conference Publications:

1. Javed AH and Rafiqul Zaman Khan "A Survey Paper on Task Partitioning

Strategies in Parallel and Distributed System," Globus -An International Journal

of Management & IT, ISSN: 0975-721X, Vol.2, No-1, pp 1-12, (2010).

2. Rafiqul Zaman Khan and Javed All, Firoz All, "Performance Analysis of Parallel

Computing Tools: PVM, MPI, ScaLAPACK, BLACS and PVMPI," Proceedings

of International Conference on Advances in Mathematical & Computational

Methods (AMCM-2011), ISSN: 978-81-908497-6-0), Vol. 1, pp. 215-222, (2011).

3. Javed Ali and Rafiqul Zaman Khan, "Performance Analysis of Matrix

Multiplication Algorithms using MPI," International Journal of Computer

Science and Information Technologies (IJCIT), ISSN: 0975-9646, Vol. 3, No.l,

pp. 3103-3106,(2011).

4. Rafiqul Zaman Khan and Javed Ali, "A Practical Performance Analysis of

Modem Software used in High Performance Computing," International Journal of

Advanced Research in Computer Science and Software Engineering, ISSN: 2277-

1287, Vol. 2, No. 3, pp.15-19, (2011).

5. Javed Ali and Rafiqul Zaman Khan, "Dynamic Task Partitioning Model in

Parallel Computing," First International Conference on Advanced Information

Technology (ICAIT-2012), CoNeCo, WiMo, NLP, CRYPSIS ICAIT, ICDIP.

ITCSE, CS & IT 07, pp. 279-284, (2012).

6. Rafiqul Zaman Khan and Javed Ali, "Classification of Task Partitioning and Load

Balancing Strategies in Distributed Parallel Computing Systems," International

Journal of Computer Applications (IJCA), ISSN: 0123-4560, Vol. 60, pp. 34-41.

(2012).

7. Javed Ali and Rafiqul Zaman Khan, "Optimal Task Partitioning Model in

Distributed Heterogeneous Environment," International Journal of Advanced

Information Technology (IJAIT), ISSN: 2231-1920, Vol. 19, pp. 23-40, (2012).

8. Rafiqul Zaman Khan and Javed Ali, "Preemptive Task Partitioning Strategy

(PTPS) with Fast Preemption in Heterogeneous Distributed Environment."

International Journal of Applied Information Systems (IJAIS), ISSN: 2249-0868,

Vol. 6, No. 1, pp. 25-31,(2013).

9. Javed Ali and Rafiqul Zaman Khan, "Optimal Task Partitioning Strategy with

Duplication (OTPSD) in Parallel Computing Environments," International

Journal of Computers & Distributed Systems (IJCDS), Vol. 4, No. 1, pp.7-15,

(2013).

JO. Rafiqul Zaman Khan and Javed Ali, "Task Partitioning Strategy with Minimum

Time (TPSMT) in Distributed Parallel Computing Environment," International

Journal of Application or Innovation in Engineering & Management (IJAIEM),

ISSN: 2319 - 4847, Vol. 2, No. 7, pp. 522-528, (2013).