On Demand Check Pointing for Grid Application Reliability

21
INTRODUCTION Grid and cluster architectures have gained popularity for computationally intensive parallel applications. However, the complexity of the infrastructure, consisting of computational nodes, mass storage, and interconnection networks, poses great challenges with respect to overall system reliability. Simple tools of reliability analysis show that as the complexity of the system increases, its reliability, and thus, Mean Time to Failure (MTTF), decreases. If one models the system as a series reliability block diagram, the reliability of the entire system is computed as the product of the reliabilities of all system components. For applications executing on large clusters or a Grid, e.g., Grid5000, the long execution times may exceed the MTTF of the infrastructure and, thus, render the execution infeasible. As an example, let us consider an execution lasting 10 days in a system that does not consider fault tolerance. Under the optimistic assumption that the MTTF of a single node is 2,000 days, the probability of failure of this long execution using 100, 200, or 500 nodes is 0.39, 0.63, or 0.91, respectively, approaching fast certain failure. The high failure probabilities are due to the fact that, in the absence of fault-tolerance mechanisms, the failure of a single node will cause the entire execution to fail. Note that this simple example does not even consider network failures, which are typically more likely than computer failure. Fault tolerance is, thus, a necessity to avoid failure in large applications,

description

sreeni

Transcript of On Demand Check Pointing for Grid Application Reliability

Page 1: On Demand Check Pointing for Grid Application Reliability

INTRODUCTION

Grid and cluster architectures have gained popularity for computationally intensive parallel applications. However, the complexity of the infrastructure, consisting of computational nodes, mass storage, and interconnection networks, poses great challenges with respect to overall system reliability. Simple tools of reliability analysis show that as the complexity of the system increases, its reliability, and thus, Mean Time to Failure (MTTF), decreases. If one models the system as a series reliability block diagram, the reliability of the entire system is computed as the product of the reliabilities of all system components. For applications executing on large clusters or a Grid, e.g., Grid5000, the long execution times may exceed the MTTF of the infrastructure and, thus, render the execution infeasible. As an example, let us consider an execution lasting 10 days in a system that does not consider fault tolerance. Under the optimistic assumption that the MTTF of a single node is 2,000 days, the probability of failure of this long execution using 100, 200, or 500 nodes is 0.39, 0.63, or 0.91, respectively, approaching fast certain failure. The high failure probabilities are due to the fact that, in the absence of fault-tolerance mechanisms, the failure of a single node will cause the entire execution to fail. Note that this simple example does not even consider network failures, which are typically more likely than computer failure. Fault tolerance is, thus, a necessity to avoid failure in large applications, such as found in scientific computing, executing on a Grid, or large cluster.

3. SYSTEM ANALYSIS

Existing System:

Communication Induced Check-pointing protocols usually make the assumption

that any process can be check-pointed at any time. An alternative approach which

releases the constraint of always check- point table processes, without delaying any

do not message reception nor did altering message ordering enforce by the

communication layer or by the application. This protocol has been implemented

within Pro-Active, an open source Java middleware for asynchronous and

Page 2: On Demand Check Pointing for Grid Application Reliability

distributed objects implementing the ASP (Asynchronous Sequential Processes)

model.

Proposed System:

This paper presents two fault-tolerance mechanisms called Theft-Induced Check

pointing and Systematic Event Logging. These are transparent protocols capable of

overcoming problems associated with both benign faults, i.e., crash faults, and

node or subnet volatility. Specifically, the protocols base the state of the execution

on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous

systems as well as multithreaded applications.

DATA FLOW DIAGRAMA data-flow diagram (DFD) is a graphical representation of the "flow" of

data through an information system. DFDs can also be used for the

visualization of data processing (structured design).

On a DFD, data items flow from an external data source or an internal data

store to an internal data store or an external data sink, via an internal

process.

A DFD provides no information about the timing or ordering of processes,

or about whether processes will operate in sequence or in parallel. It is

therefore quite different from a flowchart, which shows the flow of control

through an algorithm, allowing a reader to determine what operations will be

performed, in what order, and under what circumstances, but not what kinds

of data will be input to and output from the system, nor where the data will

come from and go to, nor where the data will be stored (all of which are

shown on a DFD).

Page 3: On Demand Check Pointing for Grid Application Reliability
Page 4: On Demand Check Pointing for Grid Application Reliability

Crash recoveryUse of

protocols

CRASH CLIENT PROCESS STOPPED

LOCAL/FORCED

TEC AND SEL

CENTRALSERVER

Receivedmessages

Dataflow diagram

Architecture Diagram

Page 5: On Demand Check Pointing for Grid Application Reliability

Use case diagram:

Page 6: On Demand Check Pointing for Grid Application Reliability

A use case diagram in the Unified Modeling Language (UML) is a type of

behavioral diagram defined by and created from a Use-case analysis. Its

purpose is to present a graphical overview of the functionality provided by a

system in terms of actors, their goals (represented as use cases), and any

dependencies between those use cases.

The main purpose of a use case diagram is to show what system functions

are performed for which actor. Roles of the actors in the system can be

depicted.

Local/Forced

Crash

Transmission

Server

Client

Page 7: On Demand Check Pointing for Grid Application Reliability

Class Diagram:

Class diagrams are the mainstay of object-oriented analysis and design.

Class diagrams show the classes of the system, their interrelationships (including

inheritance, aggregation, and association), and the operations and attributes of the

classes. Class diagrams are used for a wide variety of purposes, including both

conceptual/domain modeling and detailed design modeling.

Page 8: On Demand Check Pointing for Grid Application Reliability

Client1MessageCrash

Run()ActionPerformed()GetMessage()

Client2MessageCrash

Run()ActionPerformed()GetMessage()

CrashClient1Clent2

Crash()ActionPerformed()

RecoveryTICSEL

DB()ActionPerformed()

ServerClientCountCrashRecovery

UpdateStatus()Recovery()

1..n

Page 9: On Demand Check Pointing for Grid Application Reliability

Sequence Diagram:

A sequence diagram in Unified Modeling Language (UML) is a kind of

interaction diagram that shows how processes operate with one another and

in what order. It is a construct of a Message Sequence Chart.

Sequence diagrams are sometimes called Event-trace diagrams, event

scenarios, and timing diagrams.

A sequence diagram shows, as parallel vertical lines (lifelines), different

processes or objects that live simultaneously, and, as horizontal arrows, the

messages exchanged between them, in the order in which they occur. This

allows the specification of simple runtime scenarios in a graphical manner.

Page 10: On Demand Check Pointing for Grid Application Reliability

Client1 Client2 ServerCrash

SendMsg

Ack

Crash

Recovery

Page 11: On Demand Check Pointing for Grid Application Reliability

Collaboration Diagram:

Collaboration diagrams belong to a group of UML diagrams called Interaction

Diagrams. Collaboration diagrams, like Sequence Diagrams, show how objects

interact over the course of time. However, instead of showing the sequence of

events by the layout on the diagram, collaboration diagrams show the sequence by

numbering the messages on the diagram. This makes it easier to show how the

objects are linked together, but harder to see the sequence at a glance.

Server

Client1 Client2

Crash

1: SendMsg

2: Ack

3: Crash

4: Recovery

Page 12: On Demand Check Pointing for Grid Application Reliability

Activity Diagram:

Activity diagrams are typically used for business process modeling, for modeling

the logic captured by a single use case or usage scenario, or for modeling the

detailed logic of a business rule.  Although UML activity diagrams could

potentially model the internal logic of a complex operation it would be far better to

simply rewrite the operation so that it is simple enough that you don’t require an

activity diagram. In many ways UML activity diagrams are the object-oriented

equivalent of flow charts and data flow diagrams (DFDs) from structured

development.

Page 13: On Demand Check Pointing for Grid Application Reliability

Client

Sending

Crash

Local/Forced Transmission

Page 14: On Demand Check Pointing for Grid Application Reliability

sender

message

Receiver

crash

Local/Forced

State Diagram

Modules:1. Network Module

2. Logging Module

3. Check-pointing Module

Page 15: On Demand Check Pointing for Grid Application Reliability

4. Work Stealing Module

5. Fault and Fault Free Module

Module Description:

1. Network Module

Client-server computing or networking is a distributed application architecture

that partitions tasks or workloads between service providers (servers) and

service requesters, called clients. Often clients and servers operate over a

computer network on separate hardware. A server machine is a high-

performance host that is running one or more server programs which share its

resources with clients. A client also shares any of its resources; Clients

therefore initiate communication sessions with servers which await (listen to)

incoming requests.

2. Logging Module

Logging can be classified as pessimistic, optimistic, or causal. It is based on the

fact that the execution of a process can be modeled as a sequence of state intervals.

The execution during a state interval is deterministic. However, each state interval

is initiated by a nondeterministic event. Now, assume that the system can capture

and log sufficient information about the nondeterministic events that initiated the

state interval. This is called the piecewise deterministic (PWD) assumption .Then,

a crashed process can be recovered by 1) restoring it to the initial state and 2)

Page 16: On Demand Check Pointing for Grid Application Reliability

replaying the logged events to it in the same order they appeared in the execution

before the crash. To avoid a rollback to the initial state of a process and to limit the

amount of nondeterministic events that need to be replayed, each process

periodically saves its local state. Log-based mechanisms in which the only

nondeterministic events in a system are the reception of messages is usually

referred to as message logging.

3. Check-pointing Module

Rather than logging events, checkpointing relies on periodically saving the state of

the computation to stable storage. If a fault occurs, the computation is restarted

from one of the previously saved states. Since the computation is distributed, one

has to consider the tradeoff space of local and global checkpointing strategies and

their resulting recovery cost. Thus, checkpointing based methods differ in the way

processes are coordinated and in the derivation of a consistent global state. The

consistent global state can be achieved either at the time of checkpointing or at the

time of rollback recovery. The two approaches are called coordinated and

uncoordinated checkpointing, respectively

4. Work Stealing Module

The runtime environment and primary mechanism for load distribution is based on

a scheduling algorithm called work-stealing .The principle is simple: when a

process becomes idle it tries to steal work from another process called victim. The

initiating process is called thief. Work-stealing is the only mechanism for

distributing the workload constituting the application, i.e., an idle process seeks to

steal work from another process. From a practical point of view, the application

starts with the process executing main (), which creates tasks. Typically, some of

Page 17: On Demand Check Pointing for Grid Application Reliability

these tasks are then stolen by idle processes, which are either local or on other

processors. Thus, the principal mechanism for dispatching tasks in the distributed

environment is task stealing

5. Fault and Fault Free Module

We add a checkpointing mechanism; it is of special interest to analyze its overhead

associated with fault-free execution, since the occurrence of faults is considered to

be the rare exception rather than the norm.

Hardware Requirements:

• System : Pentium IV 2.4 GHz.

• Hard Disk : 40 GB.

• Floppy Drive : 1.44 Mb.

• Monitor : 15 VGA Colour.

• Mouse : Logitech.

• Ram : 256 Mb.

Software Requirements:

• Operating system : - Windows XP Professional.

• Coding Language : - Java.

• Tool Used : - Eclipse.