Civilian Worms: Ensuring Reliability in an Unreliable Environment

Civilian Worms: Ensuring Reliability in an Unreliable Environment

Sanjeev R. KulkarniUniversity of Wisconsin-Madison

[email protected] Work with Sambavi Muthukrishnan

Outline Motivation and Goals Civilian Worms Master-Worker Model

Leader Election Forward Progress Correctness Parallel Applications

What’s happening today Move towards clusters Resource Managers

eg. Condor Dynamic environment

Motivation Large Parallel/Standalone Applications Non-Dedicated Resources

eg.:- Condor env. Machines can disappear at any time

Unreliable commodity clusters Hardware failures Network Failures

Security Attacks!

What’s available Parallel Platforms

MPI MPI-1 :- Machines can’t go away! MPI-2 any takers?

PVM Shoot the master!

Condor Shoot the Central Manager!

Goal Bottleneck-Free infrastructure in

an unreliable Environment Ensure “normal termination” of

applications Users submit their jobs Get e-mail upon completion!

Focus of this talk Approaches for Reliability

Standalone Applications Monitor framework ( worms! ) Replication

Parallel Applications Future work!

Worms are here again! Usual Worms

Self replicating Hard to detect and kill

Civilian Worms Controlled replication Spread legally! Monitor applications

Desired Monitoring System

C

W

W

W

W

C

W = worm

C = computation

Issues Management of worms

Distributed State detection Very hard

Forward Progress Checkpointing

Correctness

Management Models Master-Worker

Simple Effective Our Choice!

Symmetric Difficult to manage the model itself!

Our Implementation Model

C C

W

W W W

C

Master

Workers

W = worm

C = computation

Worm States Master

Maintains the state of all the worm segments Listens on a particular socket Respawns failed worm segments

Worker Periodically ping the master Starts the encapsulated process if instructed

Leader Election Invoke the LE algorithm to elect a new master

Note:- Independent of application State

Leader Election The woes begin! Master goes down

Detection Worker ping times out

Timeout value Worker gets an LE message

Action Worker goes into LE state

LE algorithm Each worm segment is given an ID

Only master gives the id Workers broadcast their ids The worker with the lowest id wins

Brief Skeleton While in LE

bcast LE message with your id Set min = your id On getting an LE message with id i

If i >= min ignore else min = i;

min is the new Master

LE in action (1)

M0

W2W1

Master goes down!

LE in action (2)

L2L1

L1 and L2 send out LE messages

LE, 1

LE, 1 LE, 2

LE, 2

LE in action (3)

L2L1

L1 gets LE, 2 and ignores itL2 gets LE, 1 and send COORD_ACK

COORD_ACK

LE in action (4)

W2M1

M1 send COORD to W2, spawns W0

W3

COORD

spawn

Implementation Problems Too many cases Many unclear cases Time to Converge

Timeout values Network Partition

What happens if? Master still up?

Incoming id < self id => goes to LE mode

Else => sends back COORD message Next master in line goes down?

Timeout on COORD message receipt Late COORD_ACK?

Sends KILL message

More Bizarre cases Multiple Masters?

Master bcasts its id periodically Conflict is resolved using lowest id

method No-master?

Workers will timeout soon!

Test-Bed 64 dual processor 550 MHz P-III

nodes Linux 2.2.12 2 GB RAM Fast interconnect. 100 Mbps Master-Worker comm. via UDP

A Stress Test for LE Test

Worker Pings every second Kill n/4 workers After 1 sec, kill the master After .5 sec kill the master in line Kill n/4 workers again

Convergence

Convergence Graph

0

5

10

15

20

25

30

35

2 4 8 16

Cluster Size

Con

verg

e tim

e in

sec

s

Forward Progress Why?

MTTF < application time Solutions

Checkpointing Application Level Process level

Start from checkpoint image!

Checkpoint Address Space

Condor Checkpoint library Rewrites Object files Writes checkpoint to a file on SIGUSR2

Files Assumption :- Common File System

Correctness File Access

Read Only, no problems Writes

Possible inconsistency if multiple processes access

Inconsistency across checkpoints? Need a new File Access Algorithm

Solution: Individual Versions File Access Algorithm

On open If first open

read: nothing write: create a local copy and set a mapping

Else If mapped access mapped file If write: create a local copy and set a mapping

Close Preserve the mapping

File Access cont.

Commit Point On completion of the computation

Checkpoint Includes mapped files

Being more Fancy Security Attacks Civilian to Military transition

Hide yourself from the ps Re-fork periodically to avoid detection

Conclusion LE is VERY HARD

Don’t take it for a course project! Does our system work?

16 nodes: YES 32 nodes: NO

Quite Reliable

Future Direction Robustness Extension to parallel programs

Re-write send/recv calls Routing issues

Scalability issues? A hierarchical design?

References Cohen, F. B., ‘A Case for Benevolent Viruses’,

http://www.all.net/books/integ/goodvcase.html M. Litzkow and M. Solomon. “Supporting Checkponting

and Process Migration outside the UNIX kernel”, Usenix Conference Proceedings, San Francisco, CA, January 1992.

Gurdip Singh, “Leader election in complete networks”, PPDC 92

Implementation Arch.Worm

Communicator

Remove

CheckpointerDispatcherDequeuer

Prepend

Append

Checkpoint

Computation

Parallel Programs Communication

Connectivity across failures Re-write send/recv socket calls

Limitations of Master-Worker Model? Not really!

Communication Checkpoint markers

Buffer all data between checkpoint markers

Help of master in rerouting

Civilian Worms: Ensuring Reliability in an Unreliable Environment

Documents

Transcript of Civilian Worms: Ensuring Reliability in an Unreliable Environment