APPLICATION LEVEL CHECKPOINT-BASED
APPROACH FOR CRUSH FAILURE IN
DISTRIBUTED SYSTEM
Presented By
Moh Moh Khaing
OUTLINES
Abstract
Introduction
Objectives
Background Theory
Proposed System
System flow of proposed system
Two phases of proposed system
Implementation
Conclusion
2
ABSTRACT
Fault-tolerance for the computing node failure is an important and
critical issue in distributed and parallel processing system.
If the numbers of computing nodes are increased concurrently and
dynamically in network, it may occur node failure more times.
This system proposes application level checkpoint-based fault
tolerance approach for distributed computing.
The proposed system uses coordinated checkpointing techniques
and systematic process logging as global monitoring mechanism.
The proposed system implements on distributed multiple
sequences alignment (MSA) application using genetic algorithm
(GA).
3
DISTRIBUTED MULTIPLE SEQUENCE ALIGNMENT WITH
GENETIC ALGORITHM (MSAGA)
4
MSA with
GA
Division
Head Node
MSA with
GA
MSA with
GA
Aligned
Sequence ResultAligned
Sequence Result
Aligned
Sequence Result
Combine Alignment Result
Display Result
DNA Sequences (2 …..n)
SEQUENCES ALIGNMENT EXAMPLE
Input multiple DNA Sequences
>DNAseq1: AAGGAAGGAAGGAAGGAAGGAAGG
>DNAseq2: AAGGAAGGAATGGAAGGAAGGAAGG
>DNAseq3: AAGGAACGGAATGGTAGGAAGGAAGG
Output for aligned DNA Sequences
>DNAseq1: A-AGGA-AGGA-AGGAA-------GG-----AA-GGAAGG
>DNAseq2: ----------------AAGGAAGGAATGGAAGGAAGGAAGG
>DNAseq3: ----------------AAGGAACGGAATGGTAGGAAGGAAGG 5
NODE FAILURE CONDITION
Node failure condition is occurred when the worker node connectsto head node, worker node accepts the input sequence and workernode sends resulted sequence the head node. The failureconditions are
1. Worker node is denied as soon as worker node had connectedto the head node without working any job.
2. Worker node rejects the input sequence from the head nodeafter the head node and worker node had connected and headnode had prepared the input sequence for worker node.
3. Worker node sends “No Send” message to Head node afterworker node had accepted the result sequence to head node.
4. Worker node is crushed when it cannot connect to the Headnode with correct address.
5. Worker node is crushed when it disconnect to the Head node.6
COORDINATED CHECKPOINTING
Checkpointing is used as fault tolerance mechanism in distributed
system.
A checkpoint is a snapshot of the current state of a process and
assist in monitoring process.
Coordinated checkpointing takes the checkpoint periodically and
save in the log file.
This monitoring information provides at the node failure
condition.
If node failure occurs in distributed computing, another available
node can reconstruct the process state from the information saved
in the checkpoint information of failed node.7
SYSTEMATIC PROCESS LOGGING
Systematic Process Logging (SPL) which was derived from a
log-based method.
The motivation for SPL is to reduce the amount of computation
that can be lost, which is bound by the execution time of a
single failed task.
SPL saves the checkpoint information from the coordinated
checkpointing as the log file format with exactly time and their
contents.
Depending on the fault, it decides which node can be accepted
the job from failed node using storing log file.
8
PROPOSED FAULT TOLERANCE SYSTEM
The checkpoint based fault tolerance approach is implementedon the application layer without using any operating systemsupport.
In distributed multiple sequences alignment application,one headnode and one or more worker nodes are connected with localarea network.
All worker nodes implemented the MSAGA and aligned theinput sequence from head node independently.
The proposed fault tolerance system takes the local checkpoint atthe MSA process of each computing worker node themselvesand global checkpoint at events of all workers ’ condition byhead node.
9
ARCHITECTURE OF PROPOSED FAULT TOLERANCE
SYSTEM
Head Node
Local Area Network
GRM GCS
LCS LC
Worker 1
LCS LC
Worker 2
LCS LC
Worker 3
GRM – Global Resource Monitor
GCS – Global Checkpoint Storage
LCS- Local Checkpoint Storage
LC – Local Checkpoint 10
SYSTEM FLOW OF PROPOSED SYSTEM
Start
End
Load Balancing Phase
GRMHN
GCS
Checkpointing Phase
WNHN
Systematic Process Logging
GCS LCS
WNHNGRM LC
Coordinated Checkpointing
HN- Head Node
WN – Worker Node
11
IMPLEMENTATION OF HEAD NODE
Checkpointing Phase
The global resource monitor(GRM) plays the main role in
both coordinated checkpointing phase and systematic process
logging phase.
GRM takes the global checkpoint of all workers nodes’ event
at the coordinated checkpointing phase.
GCS saves the global checkpoint information as the log file
format at the Systematic process logging phase.
12
GLOBAL CHECKPOINT
13
Global Rrsource Monitor(GRM )
Begin
1. Taking global checkpoints of current condition of each WN
with WN’s IP, port, status, and time duration
2. Detecting the failure condition of WNs
3. Finding the available worker nodes and decide which node
is suitable for continuing to do failed WN’s jobs
End
TYPES OF CHECKPOINT
14
Checkpoint No Checkpoint
Name
Checkpoint Content
1 Available Worker node is connected with Head node
and waits for jobs from Head node
2 Denied Worker node is disconnected with Server
3 Busy Worker node is processing the jobs
4 Receive Worker node send the result to the Head
node and exist (or) Worker node send
Error message and Exit
5 Crush Worker node sends the crush message to
the Head node
CHECKPOINT INFORMATION
For each checkpoint, there are four conditions are
described:
Worker Typeto show worker number,
IP address to show WN,
Checkpoint Name to show worker node’s conditions,
Current Time to show process current time,
Time Duration to show time within each worker’s
running state to accept and receive state or running
state to reject state.
15
Worker
Type
IP Address Checkpoint
Name
Current
Time
Time
Duration
AVAILABLE CHECKPOINT OF ALL WORKERS
GRM take checkpoint as Available when all worker nodes are
connected to the head node
16
CHECKPOINT CHANGES FROM AVAILABLE
17
GlobalCheckpoint_Available ( )
Begin
1. IF HN and WNs are connected THEN
GRM takes checkpoint as Available
END IF
2. IF Checkpoint is Available THEN
IF WN is continuously connected to HN THEN
HN selects sequence and send to WNs
IF WN not accepted the sequence THEN
GRM takes checkpoint as Crush
The sequence is go to crush queue
ELSE
GRM takes checkpoint as Busy
WN does MSA application
END IF
ELSE
GRM takes checkpoint as Denied
END IF
End
DETECTING NODE FAILURE BY GRM
18
BUSY CHECKPOINT OF ALL WORKERS
19
CHECKPOINT CHANGES FROM BUSY
20
GlobalCheckpoint_Busy ( )
Begin
1 IF WN accepted input sequence from HN THEN
GRM takes checkpoint as Busy
END IF
2 IF the checkpoint is Busy THEN
IF WN sends error message to HN THEN
GRM takes checkpoint as Receive for error
ELSE
GRM takes checkpoint as Receive for result
END IF
END IF
End
RECEIVE CHECKPOINT WITH RESULT
21
RECEIVE CHECKPOINT WITH NO SEND MESSAGE
22
GLOBAL CHECKPOINT STORAGE(GCS)
23
Global_Checkpoint_Storage ( )
Begin
1 GCS stores the current condition of all WN in network
as checkpoint by GRM
2 GCS records the detail condition of WN
3 Create GCS log file for all checkpoint of nodes
End
GCS LOG FILE
24
LOAD BALANCING PHASE
25
GRM_LoadBalancing( )
BEGIN
IF (GRM detects Denied or Crush or Receive “No Send”) THEN
1 It is assumed that they are the failure of worker node.
2 The GRM finds the available node using GCS and decide
which node is suitable to send job.
3 If so, the HN sends jobs to such available node from failed
node.
4 Call Available and Busy Algorithm
ENDIF
END
LOAD BALANCING ACCORDING TO NODE FAILURE
AS DENIED CHECKPOINT
26
LOAD BALANCING ACCORDING TO NODE FAILURE
AS CRUSH CHECKPOINT
27
LOAD BALANCING ACCORDING TO NODE FAILURE
AS RECEIVE CHECKPOINT(NO SEND)
28
IMPLEMENTATION OF WORKER NODE
Worker node executes the DNA sequence to form alignedsequence using MSAGA application
Worker node takes the local checkpoint at the application levelof MSAGA
Worker node implements checkpointing phase in proposed faulttolerance system.
The local checkpoint (LC) and the local checkpoint storage(LCS) play the main role in that phase.
Every worker nodes make the local checkpoint and has ownlocal checkpoint storage.
Local checkpoint (LC) takes all checkpoint of each worker node.
Local checkpoint storage(LCS) stores the process of oneworker’s processing state. 29
LOCAL CHECKPOINT
local checkpoint (LC) is responsible for taking local checkpointof worker process states.
Local checkpoint (LC) starts to take the checkpoints of worker’sprocessing state when worker node (WN) connects to the headnode.
This local checkpoint’s responsibilities is done till all workers’processes are finished regularly and worker is exit from local areanetwork because of node failure.
30
LOCAL CHECKPOINT OF EACH WORKER
31
LocalCheckpoint( )
BEGIN
1 Record WN Starting time, Ending time and connection time
2 Record all process state of MSA for sequence
END
LOCAL CHECKPOINT STORAGE(LCS)
SPL produces the checkpoint log file and processing log file for
local condition of each node.
So, all local checkpoint monitoring information are stored into
local checkpoint storage (LCS).
The LCS is stored by the correspondence each WN.
32
LocalCheckpointStorage( )
BEGIN
1. Store WN Starting time, Ending time and
connection time
2. Store all process state of MSA for sequence
END
LCS LOG FILE
33
CONCLUSION
The GRM cannot make wrong checkpoint for the number of
worker node .
GRM can recognize differences between old worker node and new
worker node exactly when the worker node connect to the head
node next again.
While GRM takes the checkpoint for one worker node, the
remaining workers do not need to stop their operation. Therefore,
there is no block for worker nodes.
This approach supports that the distributed multiple sequence
alignment processing can operate continuously to get the final
result when the node failure occurred within network.
This system computes the exact time of each worker nodes and
the whole system execution time. This system can get the portable
checkpoint feature and does not need to use any operating system
supports.
34
THANK YOU!!
35
Top Related