Civilian Worms: Ensuring Reliability in an Unreliable Environment

38
Civilian Worms: Ensuring Reliability in an Unreliable Environment Sanjeev R. Kulkarni University of Wisconsin- Madison [email protected] Joint Work with Sambavi Muthukrishnan

description

Civilian Worms: Ensuring Reliability in an Unreliable Environment. Sanjeev R. Kulkarni University of Wisconsin-Madison [email protected] Joint Work with Sambavi Muthukrishnan. Outline. Motivation and Goals Civilian Worms Master-Worker Model Leader Election Forward Progress - PowerPoint PPT Presentation

Transcript of Civilian Worms: Ensuring Reliability in an Unreliable Environment

Page 1: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Civilian Worms: Ensuring Reliability in an Unreliable Environment

Sanjeev R. KulkarniUniversity of Wisconsin-Madison

[email protected] Work with Sambavi Muthukrishnan

Page 2: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Outline Motivation and Goals Civilian Worms Master-Worker Model

Leader Election Forward Progress Correctness Parallel Applications

Page 3: Civilian Worms: Ensuring Reliability in an Unreliable Environment

What’s happening today Move towards clusters Resource Managers

eg. Condor Dynamic environment

Page 4: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Motivation Large Parallel/Standalone Applications Non-Dedicated Resources

eg.:- Condor env. Machines can disappear at any time

Unreliable commodity clusters Hardware failures Network Failures

Security Attacks!

Page 5: Civilian Worms: Ensuring Reliability in an Unreliable Environment

What’s available Parallel Platforms

MPI MPI-1 :- Machines can’t go away! MPI-2 any takers?

PVM Shoot the master!

Condor Shoot the Central Manager!

Page 6: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Goal Bottleneck-Free infrastructure in

an unreliable Environment Ensure “normal termination” of

applications Users submit their jobs Get e-mail upon completion!

Page 7: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Focus of this talk Approaches for Reliability

Standalone Applications Monitor framework ( worms! ) Replication

Parallel Applications Future work!

Page 8: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Worms are here again! Usual Worms

Self replicating Hard to detect and kill

Civilian Worms Controlled replication Spread legally! Monitor applications

Page 9: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Desired Monitoring System

C

W

W

W

W

C

W = worm

C = computation

Page 10: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Issues Management of worms

Distributed State detection Very hard

Forward Progress Checkpointing

Correctness

Page 11: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Management Models Master-Worker

Simple Effective Our Choice!

Symmetric Difficult to manage the model itself!

Page 12: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Our Implementation Model

C C

W

W W W

C

Master

Workers

W = worm

C = computation

Page 13: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Worm States Master

Maintains the state of all the worm segments Listens on a particular socket Respawns failed worm segments

Worker Periodically ping the master Starts the encapsulated process if instructed

Leader Election Invoke the LE algorithm to elect a new master

Note:- Independent of application State

Page 14: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Leader Election The woes begin! Master goes down

Detection Worker ping times out

Timeout value Worker gets an LE message

Action Worker goes into LE state

Page 15: Civilian Worms: Ensuring Reliability in an Unreliable Environment

LE algorithm Each worm segment is given an ID

Only master gives the id Workers broadcast their ids The worker with the lowest id wins

Page 16: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Brief Skeleton While in LE

bcast LE message with your id Set min = your id On getting an LE message with id i

If i >= min ignore else min = i;

min is the new Master

Page 17: Civilian Worms: Ensuring Reliability in an Unreliable Environment

LE in action (1)

M0

W2W1

Master goes down!

Page 18: Civilian Worms: Ensuring Reliability in an Unreliable Environment

LE in action (2)

L2L1

L1 and L2 send out LE messages

LE, 1

LE, 1 LE, 2

LE, 2

Page 19: Civilian Worms: Ensuring Reliability in an Unreliable Environment

LE in action (3)

L2L1

L1 gets LE, 2 and ignores itL2 gets LE, 1 and send COORD_ACK

COORD_ACK

Page 20: Civilian Worms: Ensuring Reliability in an Unreliable Environment

LE in action (4)

W2M1

M1 send COORD to W2, spawns W0

W3

COORD

spawn

Page 21: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Implementation Problems Too many cases Many unclear cases Time to Converge

Timeout values Network Partition

Page 22: Civilian Worms: Ensuring Reliability in an Unreliable Environment

What happens if? Master still up?

Incoming id < self id => goes to LE mode

Else => sends back COORD message Next master in line goes down?

Timeout on COORD message receipt Late COORD_ACK?

Sends KILL message

Page 23: Civilian Worms: Ensuring Reliability in an Unreliable Environment

More Bizarre cases Multiple Masters?

Master bcasts its id periodically Conflict is resolved using lowest id

method No-master?

Workers will timeout soon!

Page 24: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Test-Bed 64 dual processor 550 MHz P-III

nodes Linux 2.2.12 2 GB RAM Fast interconnect. 100 Mbps Master-Worker comm. via UDP

Page 25: Civilian Worms: Ensuring Reliability in an Unreliable Environment

A Stress Test for LE Test

Worker Pings every second Kill n/4 workers After 1 sec, kill the master After .5 sec kill the master in line Kill n/4 workers again

Page 26: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Convergence

Convergence Graph

0

5

10

15

20

25

30

35

2 4 8 16

Cluster Size

Con

verg

e tim

e in

sec

s

Page 27: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Forward Progress Why?

MTTF < application time Solutions

Checkpointing Application Level Process level

Start from checkpoint image!

Page 28: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Checkpoint Address Space

Condor Checkpoint library Rewrites Object files Writes checkpoint to a file on SIGUSR2

Files Assumption :- Common File System

Page 29: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Correctness File Access

Read Only, no problems Writes

Possible inconsistency if multiple processes access

Inconsistency across checkpoints? Need a new File Access Algorithm

Page 30: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Solution: Individual Versions File Access Algorithm

On open If first open

read: nothing write: create a local copy and set a mapping

Else If mapped access mapped file If write: create a local copy and set a mapping

Close Preserve the mapping

Page 31: Civilian Worms: Ensuring Reliability in an Unreliable Environment

File Access cont.

Commit Point On completion of the computation

Checkpoint Includes mapped files

Page 32: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Being more Fancy Security Attacks Civilian to Military transition

Hide yourself from the ps Re-fork periodically to avoid detection

Page 33: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Conclusion LE is VERY HARD

Don’t take it for a course project! Does our system work?

16 nodes: YES 32 nodes: NO

Quite Reliable

Page 34: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Future Direction Robustness Extension to parallel programs

Re-write send/recv calls Routing issues

Scalability issues? A hierarchical design?

Page 35: Civilian Worms: Ensuring Reliability in an Unreliable Environment

References Cohen, F. B., ‘A Case for Benevolent Viruses’,

http://www.all.net/books/integ/goodvcase.html M. Litzkow and M. Solomon. “Supporting Checkponting

and Process Migration outside the UNIX kernel”, Usenix Conference Proceedings, San Francisco, CA, January 1992.

Gurdip Singh, “Leader election in complete networks”, PPDC 92

Page 36: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Implementation Arch.Worm

Communicator

Remove

CheckpointerDispatcherDequeuer

Prepend

Append

Checkpoint

Computation

Page 37: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Parallel Programs Communication

Connectivity across failures Re-write send/recv socket calls

Limitations of Master-Worker Model? Not really!

Page 38: Civilian Worms: Ensuring Reliability in an Unreliable Environment

Communication Checkpoint markers

Buffer all data between checkpoint markers

Help of master in rerouting