Tolerating Faults in Distributed Systems Vijay K. Garg Electrical and Computer Engineering The...

Tolerating Faults in Distributed Systems

Vijay K. GargElectrical and Computer Engineering

The University of Texas at AustinEmail: garg@ece.utexas.edu

(joint work with Bharath Balasubramanian and John Bridgman)

Fault Tolerance: Replication

Server 1 Server 2 Server 3

1 Fault Tolerance

2 FaultTolerance

Fault Tolerance: Fusion

1 FaultTolerance

Fault Tolerance: Fusion

2 FaultTolerance

`Fused’ Servers : Fewer Backups than Replication

Motivation

Coding Replication Fusion

Space Efficient Wasteful Efficient

Recovery Expensive Efficient Expensive

Updates Expensive Efficient Efficient

Probability of failure is low => expensive recovery is ok

OutlineCrash Faults

Space savingsMessage savings Complex Data Structures

Byzantine FaultsSingle Fault (f=1), O(1) dataSingle Fault, O(m) dataMultiple Faults (f>1), O(m) data

Conclusions & Future Work

Example 1: Event Counter

n different counters counting n different itemscounti= entry(i) – exit(i)

What if one of the processes may crash?

Event Counter: Single Fault

fCount1 keeps the sum of all countsAny crashed count can be recovered using remaining

counts

Event Counter: Multiple Faults

Event Counter: Theorem

Shared Events: Aggregation

Suppose all processes act on entry(0) and exit(0)

Aggregation of Events

Some Applications of FusionCausal Ordering of Messages for n Processes

O(n2) matrix at each processReplication to tolerate one fault: O(n3) storageFusion to tolerate one fault: O(n2) storage

Ricart and Agrawala’s AlgorithmO(n) storage per process, 2(n-1) messages/mutexReplication: n backup processes each with O(n) storage,

2(n-1) additional messagesFusion: 1 fused process with O(n) storage

Only n additional messages

OutlineCrash Faults

Example: Resource Allocation, P(i)

user: int initially 0;// resource idlewaiting: queue of int initially null;

On receiving acquire from client pid if (user == 0) { send(OK) to client pid; user = pid; } else waiting.append(pid);

On receiving release if (waiting.isEmpty()) user = 0; else { user = waiting.head(); send(OK) to user; waiting.removeHead(); }

Complex Data Structures: Fused Queue

a5a6a7

b2b3b4

tail tail

(i) Primary Queue A (i) Primary Queue B

a3 + b1

a4 + b2

a5 + b3

a6 + b4

a7 + b5a8 + b6

tailA tailB

(iii) Fused Queue F

Fused Queue that can tolerate one crash fault

Fused Queues: Circular Arrays

Resource Allocation: Fused Processes

OutlineCrash Faults

Byzantine Fault Tolerance: Replication

13 8 45

13 8 45 (2f+1)*n processes

Goals for Byzantine Fault ToleranceEfficient during error-free operationsEfficient detection of faults

No need to decode for fault detectionEfficient in space requirements

Byzantine Fault Tolerance: Fusion

13 8 45

Byzantine Faults (f=1)

Assume n primary state machine P(1)..P(n), each with an O(1) data structure.

Theorem 2: There exists an algorithm with additional n+1 backup machines withsame overhead as replication during normal operations additional O(n) overhead during recovery.

Byzantine FT: O(m) data

a5a6a7

b2b3b4

a2a3 + b1

a4 + b2

a5 + b3

a6 + b4

a7 + b5a8 + b6

tailA tailB

Crucial location

Byzantine Faults (f=1), O(m)Theorem 3: There exists an algorithm with additional

n+1 backup machines such thatnormal operations : same as replication additional O(m+n) overhead during recovery.

No need to decode F(1)

8 17 43 F(3)

1*3 + 2*1 + 3*41*3+4*1+9*4

3Single mismatched primary

1*3+1*1+1*4

8 17 43 F(3)

3Multiple mismatched primary

Byzantine Faults (f>1), O(1) data

Theorem 4: Algorithm with additional fn+f state machines for f Byzantine faults with same overhead as replication during normal operations.

Liar Detection (f > 1), O(m) data Z := set of all f+1 unfused copiesWhile (not all copies in Z identical) do

w := first location where copies differUse fused copies to find v, the correct value of state[w]Delete unfused copies with state[w] != v

Invariant: Z contains a correct machine.

No need to decode the entire fused state machine!

Fusible Structures

Fusible Data Structures[Garg and Ogale, ICDCS 2007, Balasubramanian and Garg

ICDCS 2011]Linked Lists, Stacks, Queues, Hash tablesData structure specific algorithmsPartial Replication for efficient updatesMultiple faults tolerated using Reed-Solomon Coding

Fusible Finite State Machines [Ogale, Balasubramanian, Garg IPDPS 09]Automatic Generation of minimal fused state machines

Conclusions

Coding Replication Fusion

Crash Faults n+nf n+f

Byzantine Faults n+2nf n+nf+f

Replication: recovery and updates simple, tolerates f faults for each of the primaryFusion: space efficient

Can combine them for tradeoffs

n: the number of different servers

Future Work

Optimal Algorithms for Complex Data StructuresDifferent Fusion OperatorsConcurrent Updates on Backup Structures

Thank You!

Event Counter: Proof Sketch

ModelThe servers (primary and backups) execute

independently (in parallel)Primaries and backups do not operate in lock-stepEvents/Updates are applied on all the serversAll backups act on the same sequence of events

Model contd…Faults:

Fail Stop (crash): Loss of current stateByzantine: Servers can `lie` about their current state

For crash faults, we assume the presence of a failure detector

For Byzantine faults, we provide detection algorithmsInfrequent Faults

Byzantine Faults (f=1), O(m)Theorem 3: There exists an algorithm with additional n+1 backup

machines such thatnormal operations : same as replication additional O(m+n) overhead during recovery.

Proof Sketch:Normal Operation: Responses by P(i) and Q(i), identical Detection: P(i) and Q(i) differ for any response Correction: Use liar detectionO(m) time to determine crucial locationUse F(1) to determine who is correctNo need to decode F(1)

Byzantine Faults (f>1)Proof Sketch:

f copies of each primary state machine and f overall fused machines

Normal Operation: all f+1 unfused copies result in the same output

Case 1: single mismatched primary state machine Use liar detection

Case 2: multiple mismatched primary state machinesUnfused copy with the largest tally is correct

Resource Allocation Machine

RequestQueue 1

RequestQueue 2

Lock Server 1

Lock Server 2

R1 R2 R3

RequestQueue 3

Lock Server 3

R1R2 R4

13 8 45

66 (f+1)*n + f processes

Tolerating Faults in Distributed Systems Vijay K. Garg Electrical and Computer Engineering The...

Documents

Transcript of Tolerating Faults in Distributed Systems Vijay K. Garg Electrical and Computer Engineering The...

Automatically Tolerating And Correcting Memory Errors

Tolerating Memory Leaks Michael D. Bond Kathryn S. McKinley.

Garg Dissertation

Tolerating/hiding Memory Latency

Vladimir Ninković TRANSCONFLICT. Tolerating the unexpected Dealing with uncertainties.

INTERPOLATED HALFTONING, REHALFTONING, AND HALFTONE COMPRESSION Prof. Brian L. Evans bevans@ece.utexas.edu bevans Collaboration.

Orchestrating Linux Containers while tolerating failures

Wireless garg

Tolerating and Correcting Memory Errors in C and C++

Tolerating Slowdowns in Replicated State Machines using … · 2020. 11. 4. · Tolerating Slowdowns in Replicated State Machines using Copilots Khiem Ngo?, Siddhartha Sen†, Wyatt

Praveen Garg · 2015-09-29 · Praveen Garg MD Dr. Garg is a medical oncologist and hematologist at The Cancer Center at Presbyterian. Praveen Garg, MD joined The Cancer Center at

Garg Slides

Detecting and Tolerating Faults in Distributed Systemsusers.ece.utexas.edu/~garg/dist/vinit-dissertation.pdf · Vinit Arun Ogale, Ph.D. The University of Texas at Austin, 2008 Supervisor:

Exploiting Long-Distance Interactions and Tolerating Atom ...

Tolerating Uncertainty: Experiences of Caregiving and ...eprints.staffs.ac.uk/2054/1/Pryce Laura DClinPsy thesis September... · Tolerating Uncertainty: Experiences of Caregiving

Tolerating Hardware Device Failures in Software

Tolerating the Intolerable - Josh.org – A Cru Ministry

Tolerating Memory Leaks

CHAPTER 8 Not Tolerating Intolerance · 2019-10-31 · 171 CHAPTER 8 Not Tolerating Intolerance Unpacking Critical Pedagogy in Classrooms and Conferences Spencer Brayton and Natasha

Garg Brothers