Distributed Algorithms for Failure Detection in Crash Environments

14
UPV / EHU Distributed Algorithms for Failure Detection in Crash Environments R. Cortiñas, A. Lafuente, M. Larrea Distributed Systems Group University of the Basque Country UPV/EHU

description

Distributed Algorithms for Failure Detection in Crash Environments. R. Cortiñas, A. Lafuente, M. Larrea Distributed Systems Group University of the Basque Country UPV/EHU. Guest Stars:  P ,  S and Omega.  P : s trong completeness, eventual strong accuracy - PowerPoint PPT Presentation

Transcript of Distributed Algorithms for Failure Detection in Crash Environments

Page 1: Distributed Algorithms for Failure Detection in Crash Environments

UPV / EHU

Distributed Algorithms forFailure Detection inCrash Environments

R. Cortiñas, A. Lafuente, M. Larrea

Distributed Systems GroupUniversity of the Basque Country UPV/EHU

Page 2: Distributed Algorithms for Failure Detection in Crash Environments

2

UPV / EHU

Master SIA – Sistemas Distribuidos

Guest Stars: P, S and Omega

P: strong completeness, eventual strong accuracy– Eventually every process that crashes is permanently

suspected by every correct process– There is a time after which correct processes are not

suspected by any correct process

S: strong completeness, eventual weak accuracy– There is a time after which some correct process is

never suspected by any correct process

• Omega: eventual leader election– There is a time after which all the correct processes

always trust the same correct process

Page 3: Distributed Algorithms for Failure Detection in Crash Environments

3

UPV / EHU

Master SIA – Sistemas Distribuidos

The First P Algorithm [CT96]

Page 4: Distributed Algorithms for Failure Detection in Crash Environments

4

UPV / EHU

Master SIA – Sistemas Distribuidos

p1

p3

p4

p6

p5

p2

Communication Optimality

A ring arrangement of processes

Page 5: Distributed Algorithms for Failure Detection in Crash Environments

5

UPV / EHU

Master SIA – Sistemas Distribuidos

p1

p3

p4

p6

p5

p2

Communication Optimality

Communication-efficient algorithms:

n links are used forever

Page 6: Distributed Algorithms for Failure Detection in Crash Environments

6

UPV / EHU

Master SIA – Sistemas Distribuidos

p1

p3

p4

p6

p5

p2

Communication Optimality

Communication-optimal algorithms:

C links are used forever

Page 7: Distributed Algorithms for Failure Detection in Crash Environments

7

UPV / EHU

Master SIA – Sistemas Distribuidos

Communication-optimal P

Page 8: Distributed Algorithms for Failure Detection in Crash Environments

8

UPV / EHU

Master SIA – Sistemas Distribuidos

• We also propose an optimal implementation of S, the weakest failure detector for solving Consensus:

– processes ordered: p1, ..., pn– heartbeat strategy– communication pattern: one-to-successors– based on a trusted process (instead of a list of suspected

processes)

Communication-optimal Omega

Page 9: Distributed Algorithms for Failure Detection in Crash Environments

9

UPV / EHU

Master SIA – Sistemas Distribuidos

i) Initially, p1 starts sending messages periodically to the rest of processes, and all processes trust p1

p2p1 p5p4p3

trusted1 = p1 trusted2 = p1 trusted3 = p1 trusted4 = p1 trusted5 = p1

Communication-optimal Omega

Page 10: Distributed Algorithms for Failure Detection in Crash Environments

10

UPV / EHU

Master SIA – Sistemas Distribuidos

ii) If a process does not receive a message within some timeout period from its trusted process pi, then it suspects pi and takes the next process pi+1 as its new trusted process

p2p1 p5p4

trusted1 = p1 trusted2 = p1 trusted3 = p1 timeout on p1

trusted4 = p2

trusted5 = p1

p3

Communication-optimal Omega

Page 11: Distributed Algorithms for Failure Detection in Crash Environments

11

UPV / EHU

Master SIA – Sistemas Distribuidos

iii) If a process trusts itself, then it starts sending messages periodically to its successors

p2p1 p5p4

trusted1 = p1 trusted3 = p1 trusted4 = p2 trusted5 = p1

p3

timeout on p1

trusted2 = p2

Communication-optimal Omega

Page 12: Distributed Algorithms for Failure Detection in Crash Environments

12

UPV / EHU

Master SIA – Sistemas Distribuidos

iv) If a process receives a message from a process pi preceding its trusted process, then it will trust pi again, increasing its timeout period with respect to pi

p2p1 p5

trusted1 = p1 message from p1

trusted2 = p1

timeout_period21++

trusted3 = p2 message from p1

trusted4 = p1

timeout_period41++

trusted5 = p1

p3 p4

Communication-optimal Omega

Page 13: Distributed Algorithms for Failure Detection in Crash Environments

13

UPV / EHU

Master SIA – Sistemas Distribuidos

• Lemma. With the previous algorithm, eventually all the correct processes will permanently trust the first correct process in p1, ..., pn

• This property trivially allows us to provide the properties of S:

– Eventual weak accuracy: by not suspecting the trusted process– Strong completeness: by suspecting all the processes except the

trusted process

Communication-optimal Omega

Page 14: Distributed Algorithms for Failure Detection in Crash Environments

14

UPV / EHU

Master SIA – Sistemas Distribuidos

Communication-optimal Omega