On Scalable and Efficient Distributed Failure Detectors Presented By : Sindhu Karthikeyan.

On Scalable and Efficient On Scalable and Efficient Distributed Failure DetectorsDistributed Failure Detectors

Presented By : Sindhu Karthikeyan.Presented By : Sindhu Karthikeyan.

22

INTRODUCTION INTRODUCTION ::

Failure detectors are a central component in fault-tolerant distributed systems Failure detectors are a central component in fault-tolerant distributed systems based on process groups running over unreliable, asynchronous networks based on process groups running over unreliable, asynchronous networks eg., group membership protocols, supercomputers, computer clusters, etc.eg., group membership protocols, supercomputers, computer clusters, etc.

The ability of the failure detector to detect process failures The ability of the failure detector to detect process failures completely and completely and efficientlyefficiently, , in the presence of unreliable messaging as well as arbitrary in the presence of unreliable messaging as well as arbitrary process crashes and recoveries, can have a major impact on the process crashes and recoveries, can have a major impact on the performance of these systems.performance of these systems.

"Completeness""Completeness" is the guarantee that the failure of a group member is is the guarantee that the failure of a group member is eventually detected by every non-faulty group member. eventually detected by every non-faulty group member.

"Efficiency""Efficiency" means that failures are detected means that failures are detected quickly, as quickly, as well as well as accurately accurately (i.e., without too many mistakes).(i.e., without too many mistakes).

33

The first work to address these properties of failure detectors was by The first work to address these properties of failure detectors was by Chandra and Toueg. The authors showed that it is impossible for a failure Chandra and Toueg. The authors showed that it is impossible for a failure detector algorithm to deterministically achieve both completeness and detector algorithm to deterministically achieve both completeness and accuracy over an asynchronous unreliable network. accuracy over an asynchronous unreliable network.

This result has lead to a flurry of theoretical research on other ways of This result has lead to a flurry of theoretical research on other ways of classifying failure detectors, but more importantly, has served as a guide to classifying failure detectors, but more importantly, has served as a guide to designers of failure detector algorithms for real systems.designers of failure detector algorithms for real systems.

For example, most distributed applications have opted to circumvent the For example, most distributed applications have opted to circumvent the impossibility result by relying on failure detector algorithms that guarantee impossibility result by relying on failure detector algorithms that guarantee completeness deterministically while achieving efficiency only completeness deterministically while achieving efficiency only probabilistically.probabilistically.

In this paper they have dealt with complete failure detectors that satisfy In this paper they have dealt with complete failure detectors that satisfy application-defined application-defined efficiency efficiency constraints of :constraints of :1) (quickness) detection of any group member failure by 1) (quickness) detection of any group member failure by some some non-faulty non-faulty member within a time bound, andmember within a time bound, and 2) (accuracy) probability (within this time bound) of no other non-faulty 2) (accuracy) probability (within this time bound) of no other non-faulty member detecting a given non-faulty member as having failed.member detecting a given non-faulty member as having failed.

44

For Accuracy the first requirement merit (quickness) leads to more For Accuracy the first requirement merit (quickness) leads to more discussions.discussions.

Consider a cluster, which rely on some few central computers to aggregate Consider a cluster, which rely on some few central computers to aggregate failure detection information from across the system.failure detection information from across the system.

In such systems, efficient detection of a failure depends on the In such systems, efficient detection of a failure depends on the time the failure is time the failure is first first detected by a non-faulty member. Even in the detected by a non-faulty member. Even in the absence of central server, notification of a failure is typically absence of central server, notification of a failure is typically communicated, by the first member who detected it, to the entire group via communicated, by the first member who detected it, to the entire group via a broadcast. a broadcast.

Thus, although achieving completeness is important, efficient Thus, although achieving completeness is important, efficient detection of a failure is more often related with the time to the detection of a failure is more often related with the time to the first first detection, by another non-faulty member, of the failure.detection, by another non-faulty member, of the failure.

In this paper they have discussedIn this paper they have discussed

In Section 2 why the traditional and popular heartbeating failure detecting In Section 2 why the traditional and popular heartbeating failure detecting schemes do not achieve the optimal scalability limits.schemes do not achieve the optimal scalability limits.

55

Finally they present a randomized distributed failure detector in Section 5 Finally they present a randomized distributed failure detector in Section 5 that can be configured to meet the application-defined constraints of that can be configured to meet the application-defined constraints of completeness and accuracy, and expected speed of detection. completeness and accuracy, and expected speed of detection.

With reasonable assumptions on the network unreliability (member and With reasonable assumptions on the network unreliability (member and message failure rates of up to 15%), the worst-case network load imposed message failure rates of up to 15%), the worst-case network load imposed by this protocol has a sub-optimality factor that is much lower than that of by this protocol has a sub-optimality factor that is much lower than that of traditional distributed heartbeat schemes.traditional distributed heartbeat schemes.

This sub-optimality factor does not depend on group size (in large This sub-optimality factor does not depend on group size (in large groups), but only on the application specified efficiency constraints and the groups), but only on the application specified efficiency constraints and the network unreliability probabilities.network unreliability probabilities.

Furthermore, the average load imposed per member is independent of the Furthermore, the average load imposed per member is independent of the group size.group size.

66

2. PREVIOUS WORK2. PREVIOUS WORK

In most real-life distributed systems, the failure detection service In most real-life distributed systems, the failure detection service is implemented via variants of the "Heartbeat mechanism", which have is implemented via variants of the "Heartbeat mechanism", which have been popular as they guarantee the completeness property. been popular as they guarantee the completeness property.

However, all existing heartbeat approaches have shortcomings. However, all existing heartbeat approaches have shortcomings. Centralized heartbeat schemes create hot-spots that prevent them from Centralized heartbeat schemes create hot-spots that prevent them from scaling. scaling.

Distributed heartbeat schemes offer different levels of accuracy Distributed heartbeat schemes offer different levels of accuracy and scalability depending on the exact heartbeat dissemination mechanism and scalability depending on the exact heartbeat dissemination mechanism used, but in this paper we show that they are inherently not as efficient and used, but in this paper we show that they are inherently not as efficient and scalable as claimed.scalable as claimed.

This paper work differs from all this prior work This paper work differs from all this prior work ..

Here they quantify the performance of a failure detector protocol as the Here they quantify the performance of a failure detector protocol as the network load it requires to impose on the network, in order to satisfy the network load it requires to impose on the network, in order to satisfy the application-defined constraints of completeness, and quick and accurate application-defined constraints of completeness, and quick and accurate detection. They also present an efficient and scalable distributed failure detection. They also present an efficient and scalable distributed failure detector. detector.

The new failure detector incurs a constant expected load per process, thusThe new failure detector incurs a constant expected load per process, thus

avoiding the heartbeat problem of centralized heartbeating schemesavoiding the heartbeat problem of centralized heartbeating schemes

77

3. MODEL3. MODEL

We consider a large We consider a large group group of n (>> 1) members. This set of of n (>> 1) members. This set of potential group members is fixed a priori. Group members have unique potential group members is fixed a priori. Group members have unique identifiers. Each group member maintains a list, called a identifiers. Each group member maintains a list, called a view, view, containing containing the identities of all other group members (faulty or otherwise).the identities of all other group members (faulty or otherwise).

Members may suffer crash (non-Byzantine) failures, and recover Members may suffer crash (non-Byzantine) failures, and recover subsequently. Unlike other papers on failure detectors that consider a subsequently. Unlike other papers on failure detectors that consider a member as faulty if they are perturbed and sleep for a time greater than member as faulty if they are perturbed and sleep for a time greater than some pre-specified duration, our notion of failure considers that a member some pre-specified duration, our notion of failure considers that a member is faulty if and only if it has really crashed. Perturbations at members that is faulty if and only if it has really crashed. Perturbations at members that might lead to message losses are accounted for in the message loss rate might lead to message losses are accounted for in the message loss rate pml.pml.

Whenever a member recovers from a failure, it does so into a new Whenever a member recovers from a failure, it does so into a new incarnation incarnation that is distinguishable from all its earlier incarnations. At each that is distinguishable from all its earlier incarnations. At each member, an integer in non-volatile storage, that is incremented every time member, an integer in non-volatile storage, that is incremented every time the member recovers, suffices to serve as the member's incarnation the member recovers, suffices to serve as the member's incarnation number. number. The members in our group model thus have crash-recovery The members in our group model thus have crash-recovery semantics with incarnation numbers distinguishing different failures and semantics with incarnation numbers distinguishing different failures and recoveries.recoveries.

88

We characterize the member failure probability by a parameter pf. We characterize the member failure probability by a parameter pf.

pf is the probability that a random group member is faulty at a random pf is the probability that a random group member is faulty at a random

time. Member crashes are assumed to be independent across memberstime. Member crashes are assumed to be independent across members..Some messages sent out on the network fails to be delivered at its Some messages sent out on the network fails to be delivered at its

recipient (due to network congestion, buffer overflow at the sender or recipient (due to network congestion, buffer overflow at the sender or receiver due to member perturbations, etc.) with probability pml receiver due to member perturbations, etc.) with probability pml ЄЄ (0, 1). (0, 1).

The worst-case message propagation delay (from sender to The worst-case message propagation delay (from sender to receiver through the network) for any delivered message is assumed to be receiver through the network) for any delivered message is assumed to be so small compared to the application-specified detection time (typically so small compared to the application-specified detection time (typically O( several seconds )) that henceforth, for all practical purposes, we can O( several seconds )) that henceforth, for all practical purposes, we can assume that each message is either delivered immediately at the recipient assume that each message is either delivered immediately at the recipient with probability (1 - pml ), or never reaches the recipient.with probability (1 - pml ), or never reaches the recipient.

In the rest of the paper we use shorthands for (1-pf) = qf, In the rest of the paper we use shorthands for (1-pf) = qf, and (1-pml) = qml.and (1-pml) = qml.

99

4. SCALABLE AND EFFICIENT FAILURE DETECTORS4. SCALABLE AND EFFICIENT FAILURE DETECTORS

The first formal characterization of the properties of failure The first formal characterization of the properties of failure detectors, has laid down the following properties for distributed failure detectors, has laid down the following properties for distributed failure detectors in process groups:detectors in process groups:

• • {Strong/Weak} Completeness: crash-failure of any group member is {Strong/Weak} Completeness: crash-failure of any group member is detected by {all/some} non-faulty members,detected by {all/some} non-faulty members,

• • Strong Accuracy: no non-faulty group member is declared as failed by any Strong Accuracy: no non-faulty group member is declared as failed by any other non-faulty group member.other non-faulty group member.

Subsequent work on designing efficient failure detectors has Subsequent work on designing efficient failure detectors has attempted to trade off the Completeness and Accuracy properties in several attempted to trade off the Completeness and Accuracy properties in several ways. However, the completeness properties required by most distributed ways. However, the completeness properties required by most distributed applications have lead to the popular use of failure detectors that guarantee applications have lead to the popular use of failure detectors that guarantee Strong Completeness always, even if eventually. This of course means that Strong Completeness always, even if eventually. This of course means that such failure detectors cannot guarantee Strong Accuracy always, but only such failure detectors cannot guarantee Strong Accuracy always, but only with a probability less than 1.with a probability less than 1.

1010

For example, all-to-all (distributed) heartbeating schemes have For example, all-to-all (distributed) heartbeating schemes have been popular because they guarantee Strong Completeness (since a faulty been popular because they guarantee Strong Completeness (since a faulty member will stop sending heartbeats), while providing varying degrees of member will stop sending heartbeats), while providing varying degrees of accuracy.accuracy.

The requirements imposed by an application (or its designer) on a The requirements imposed by an application (or its designer) on a failure detector protocol can be formally specified and parameterized as failure detector protocol can be formally specified and parameterized as follows:follows:

1. 1. COMPLETENESSCOMPLETENESS:: satisfy eventual Strong Completeness for member satisfy eventual Strong Completeness for member failures. failures.

2. 2. EFFICIENCY: EFFICIENCY: (a) (a) SPEED:SPEED: every member failure is detected by every member failure is detected by some nsome non-faulty group on-faulty group member within T- time units after its occurrence (T >> worst- member within T- time units after its occurrence (T >> worst- case message round trip time). case message round trip time).(b) (b) ACCURACY:ACCURACY: at any time instant, for every non faulty member Mi not yet at any time instant, for every non faulty member Mi not yet

detected as failed, the probability that no other non-faulty detected as failed, the probability that no other non-faulty group member will (mistakenly) detect Mi as faulty group member will (mistakenly) detect Mi as faulty

within within the next T time units is at least (1 - PMthe next T time units is at least (1 - PM(T)).(T)).

1111

To measure the scalability of a failure detector algorithm, we use To measure the scalability of a failure detector algorithm, we use the the worst-case worst-case network load it imposes - this is denoted as L. Since several network load it imposes - this is denoted as L. Since several messages may be transmitted simultaneously even from one group messages may be transmitted simultaneously even from one group member, we define:member, we define:

Definition 1.Definition 1. The The worst-case network load L worst-case network load L of a failure detector protocol of a failure detector protocol is the maximum number of messages transmitted by any run of the is the maximum number of messages transmitted by any run of the protocol within any time interval of length T, divided by T.protocol within any time interval of length T, divided by T.

We also require that the failure detector impose a uniform We also require that the failure detector impose a uniform expected send and receive load at each member due to this traffic.expected send and receive load at each member due to this traffic.

The goal of a near-optimal failure detector algorithm is thus to The goal of a near-optimal failure detector algorithm is thus to satisfy the above requirements (COMPLETENESS, EFFICIENCY) while satisfy the above requirements (COMPLETENESS, EFFICIENCY) while guaranteeing:guaranteeing:

• • Scale. Scale. the worst-case network load L imposed by the algorithm the worst-case network load L imposed by the algorithm is close to the optimal possible, with equal expected load per member.is close to the optimal possible, with equal expected load per member.

i.e; i.e; L L ≈ L*≈ L*where the optimal worst-case network load is L*.where the optimal worst-case network load is L*.

1212

THEOREM 1.THEOREM 1. Any distributed failure detector algorithm for a group of size n Any distributed failure detector algorithm for a group of size n (>> 1) (>> 1) that deterministically satisfies the that deterministically satisfies the COMPLETENESS, SPEED (T), ACCURACY COMPLETENESS, SPEED (T), ACCURACY (PM(T)) (PM(T)) requirements requirements (<< (<< pml), imposes a minimal worst-case network load pml), imposes a minimal worst-case network load (messages per time unit, as defined above) of:(messages per time unit, as defined above) of:

L* = n .[ log(PM(T)) / log(pml).T]L* = n .[ log(PM(T)) / log(pml).T]

L* is thus the optimal worst-case network load required to satisfy the L* is thus the optimal worst-case network load required to satisfy the COMPLETENESS, SPEED, ACCURACY COMPLETENESS, SPEED, ACCURACY requirements.requirements.

PROOF.PROOF. We prove the first part of the theorem by showing that each non-faulty group We prove the first part of the theorem by showing that each non-faulty group member could transmit up to member could transmit up to log(PM(T)) / log(pml) log(PM(T)) / log(pml) messages in a time interval of messages in a time interval of length T.length T.

Consider a group member Mi at a random point in time t. Let Mi not be Consider a group member Mi at a random point in time t. Let Mi not be detected as failed yet by any other group member, and stay non-faulty until at least detected as failed yet by any other group member, and stay non-faulty until at least time t + T. Let m be the maximum number of messages sent by Mi, in the time time t + T. Let m be the maximum number of messages sent by Mi, in the time interval [interval [t, t t, t + T], in any possible run of the failure detector protocol starting from + T], in any possible run of the failure detector protocol starting from time t.time t.

Now, at time t, the event that "all messages sent by Mi in the time Now, at time t, the event that "all messages sent by Mi in the time interval [t, t+T] are lost" happens with probability at least Pml^m.interval [t, t+T] are lost" happens with probability at least Pml^m.

Occurrence of this event entails that it is indistinguishable to the set of Occurrence of this event entails that it is indistinguishable to the set of the rest of the non-faulty group members (i.e., members other than Mi) as to the rest of the non-faulty group members (i.e., members other than Mi) as to whether Mi is faulty or not.whether Mi is faulty or not.

By the SPEED requirement, this event would then imply that Mi is By the SPEED requirement, this event would then imply that Mi is detected as failed by some non-faulty group member between t and t + T.detected as failed by some non-faulty group member between t and t + T.

1313

Thus, the probability that at time t, a given non-faulty member Mi Thus, the probability that at time t, a given non-faulty member Mi that is not yet detected as faulty, is detected as failed by some other non-that is not yet detected as faulty, is detected as failed by some other non-faulty group member within the next T time units, is at least pml^m. By the faulty group member within the next T time units, is at least pml^m. By the ACCURACY requirement, we have pml^m < ACCURACY requirement, we have pml^m < PM(T), PM(T), which implies thatwhich implies that

m m ≥≥ log(PM(T)) / log(pml) log(PM(T)) / log(pml)

A failure detector that satisfies the COMPLETENESS, SPEED,A failure detector that satisfies the COMPLETENESS, SPEED,ACCURACY requirements and meets the L* bound works asACCURACY requirements and meets the L* bound works asfollows. follows.

It uses a highly available, non-faulty server as aIt uses a highly available, non-faulty server as a group leader . group leader . Every other group member sends Every other group member sends log(PM(T)) / log(Pml) log(PM(T)) / log(Pml) "I am alive" "I am alive"

messages to this server every T time units.messages to this server every T time units. The server declares the member as failed if it doesn’t receive the “I am The server declares the member as failed if it doesn’t receive the “I am

alive” message from it for T time units.alive” message from it for T time units.

1414

Definition 2.Definition 2. The The sub-optimality factor sub-optimality factor of a failure detector algorithm that of a failure detector algorithm that imposes a worst-case network load L, while satisfying the imposes a worst-case network load L, while satisfying the COMPLETENESS and EFFICIENCY requirements, is defined as L/ L* .COMPLETENESS and EFFICIENCY requirements, is defined as L/ L* .

In the traditional Distributed Heartbeating failure algorithm, every group In the traditional Distributed Heartbeating failure algorithm, every group member periodically transmits a “heartbeat” message to all the other group member periodically transmits a “heartbeat” message to all the other group member. A member Mj is declared as failed by an another non-faulty member. A member Mj is declared as failed by an another non-faulty member Mi, when Mi doesn’t receive heartbeats from Mj for some member Mi, when Mi doesn’t receive heartbeats from Mj for some consecutive heartbeat periods.consecutive heartbeat periods.

Distributed Heartbeat Scheme guarantees “COMPLETENESS”, Distributed Heartbeat Scheme guarantees “COMPLETENESS”, however it cannot guarantee “ACCURACY” and “SCALABILITY”, however it cannot guarantee “ACCURACY” and “SCALABILITY”, because it depends totally on the mechanism used to disseminate because it depends totally on the mechanism used to disseminate Heartbeats.Heartbeats.

The worst-case number of messages transmitted by each member The worst-case number of messages transmitted by each member per unit time is 0(n), and the worst-case total network load per unit time is 0(n), and the worst-case total network load L is 0(n^2). L is 0(n^2). The The sub-optimality factor (i.e., L/ L*) varies as sub-optimality factor (i.e., L/ L*) varies as O(n), O(n), for any values of pml, pf for any values of pml, pf and PM(T).and PM(T).

1515

The distributed heartbeating schemes do not meet the optimality The distributed heartbeating schemes do not meet the optimality bound of Theorem 1 because they inherently attempt to communicate a bound of Theorem 1 because they inherently attempt to communicate a failure notification to all group members. failure notification to all group members.

Other heartbeating schemes, such as Centralized heartbeating (as Other heartbeating schemes, such as Centralized heartbeating (as discussed in the proof of Theorem 1), can be configured to meet the discussed in the proof of Theorem 1), can be configured to meet the optimal load L*, but have problems such as creating hot-spots (centralized optimal load L*, but have problems such as creating hot-spots (centralized heartbeating).heartbeating).

1616

5. A RANDOMIZED DISTRIBUTEDFAILURE DETECTOR 5. A RANDOMIZED DISTRIBUTEDFAILURE DETECTOR PROTOCOLPROTOCOL

In this section, we relax the SPEED condition to detect a failure In this section, we relax the SPEED condition to detect a failure within an within an expected expected (rather than exact, as before) time bound of T time units (rather than exact, as before) time bound of T time units after the failure. We then present a randomized distributed failure detector after the failure. We then present a randomized distributed failure detector algorithm that guarantees COMPLETENESS with probability 1, detection of algorithm that guarantees COMPLETENESS with probability 1, detection of any member failure within an expected time T from the failure and an any member failure within an expected time T from the failure and an ACCURACY probability of (1 – PM(T))ACCURACY probability of (1 – PM(T)). . The protocol imposes an equal The protocol imposes an equal expected load per group member, and a worst-case (and average case) expected load per group member, and a worst-case (and average case) network load L that differs from the optimal L* of Theorem 1 by a sub-network load L that differs from the optimal L* of Theorem 1 by a sub-optimality factor (i.e., L/L* ) that is independent of group size n (>> 1). This optimality factor (i.e., L/L* ) that is independent of group size n (>> 1). This sub-optimality factor is much lower than the sub-optimality factors of the sub-optimality factor is much lower than the sub-optimality factors of the traditional distributed heartbeating schemes discussed in the previous traditional distributed heartbeating schemes discussed in the previous section.section.

5.1 New Failure Detector Algorithm5.1 New Failure Detector AlgorithmThe failure detector algorithm uses two parameters: protocol period T’ (in The failure detector algorithm uses two parameters: protocol period T’ (in time units) and integer value k.time units) and integer value k.

The algorithm is formally described in Figure 1. The algorithm is formally described in Figure 1. At each non-faulty member Mi, steps (1-3) are executed once every At each non-faulty member Mi, steps (1-3) are executed once every

T’ time units (which we call a protocol T’ time units (which we call a protocol period), period), while steps (4,5,6) are while steps (4,5,6) are executed whenever necessary. executed whenever necessary.

The data contained in each message is shown in parentheses after The data contained in each message is shown in parentheses after the message. the message.

1717

Integer pr; /* Local period number */Integer pr; /* Local period number */Every T’ time units at Mi :Every T’ time units at Mi :O. pr O. pr := pr + 1:= pr + 11. Select random member Mj from view1. Select random member Mj from view

Send a ping(Mi, Send a ping(Mi, Mj, pr) Mj, pr) message to Mjmessage to MjWait for the worst-ease message round-trip time for an ack(Mi, Mj, pr) Wait for the worst-ease message round-trip time for an ack(Mi, Mj, pr) messagemessage

2. If have not received an ack(Mi, My, 2. If have not received an ack(Mi, My, pr) pr) message yetmessage yetSelect k members randomly from viewSelect k members randomly from viewSend each of them a ping-req(Mi, My, pr) messageSend each of them a ping-req(Mi, My, pr) messageWalt for an ack(Mi, Mj, pr) message until theWalt for an ack(Mi, Mj, pr) message until the end of period prend of period pr

3. If have not received an ack(Mi, 3. If have not received an ack(Mi, Mj, pr) Mj, pr) message yetmessage yetDeclare Mj as failedDeclare Mj as failed

Anytime at Mi :Anytime at Mi :4. On receipt of a ping-req(Mm, 4. On receipt of a ping-req(Mm, Mj, pr) (Mj # Mi)Mj, pr) (Mj # Mi)

Send a ping(Mi, Mj, Send a ping(Mi, Mj, Mm,pr) Mm,pr) message to Mjmessage to MjOn receipt of an ack(Mi, Mj, On receipt of an ack(Mi, Mj, Mm, Mm, pr) message from Mjpr) message from MjSend an ack(Mm, Mj,Send an ack(Mm, Mj, pr) pr) message to received to message to received to MmMm

Anytime at Mi :Anytime at Mi :5. On receipt of a ping(Mm, Mi, Ml, pr) message from member Mm5. On receipt of a ping(Mm, Mi, Ml, pr) message from member Mm

Reply with an ack(Mm, Mi, Ml, pr) message to MmReply with an ack(Mm, Mi, Ml, pr) message to MmAnytime at Mi :Anytime at Mi :6. On receipt of a ping(Mm, Mi, pr) message from member Mm6. On receipt of a ping(Mm, Mi, pr) message from member Mm

Reply with an ack(Mm, Mr, pr) message to MmReply with an ack(Mm, Mr, pr) message to Mm

Figure 1: Protocol steps a t a group member Mi.Data in each message is Figure 1: Protocol steps a t a group member Mi.Data in each message is shown in parentheses after the message. Each message also contains shown in parentheses after the message. Each message also contains the current incarnation number of the sender.the current incarnation number of the sender.

1818

Figure 2: Example protocol period a t Mi. This shows all the possible messages that a protocol period may initiate. Some message contents excludedfor simplicity.

1919

Figure 2 illustrates the protocol steps initiated by a member Mi, Figure 2 illustrates the protocol steps initiated by a member Mi, during one protocol period of length T' time units.during one protocol period of length T' time units.

At the start of this protocol period at Mi, a random members At the start of this protocol period at Mi, a random members selected, in this case Mj, and a ping message sent to it. If Mi does not selected, in this case Mj, and a ping message sent to it. If Mi does not receive a replying ack from Mj within sometime-out (determined by the receive a replying ack from Mj within sometime-out (determined by the message round-trip time, which is << T), it selects k members at random message round-trip time, which is << T), it selects k members at random and sends to each a ping-req message.and sends to each a ping-req message.

Each of the non-faulty members among these k which receives Each of the non-faulty members among these k which receives the ping-req message subsequently pings Mj and forwards the ack received the ping-req message subsequently pings Mj and forwards the ack received from Mj, if any, back to Mi.from Mj, if any, back to Mi.

In the example of Figure 2, one of the k members manages to In the example of Figure 2, one of the k members manages to complete this cycle of events as Mj is up, and Mi does not suspect Mj as complete this cycle of events as Mj is up, and Mi does not suspect Mj as faulty at the end of this protocol period.faulty at the end of this protocol period.

2020

The effect of using the randomly selected subgroup is to distribute the The effect of using the randomly selected subgroup is to distribute the decision on failure detection across a subgroup of (k + 1) members.decision on failure detection across a subgroup of (k + 1) members.

So it can be shown that the new protocol's properties are So it can be shown that the new protocol's properties are preserved even in the presence of some degree of variation of message preserved even in the presence of some degree of variation of message delivery loss probabilities across group members. delivery loss probabilities across group members.

Sending k repeat ping messages may not satisfy this property. Our Sending k repeat ping messages may not satisfy this property. Our analysis in Section 5.2 shows that the cost (in terms of sub-optimality analysis in Section 5.2 shows that the cost (in terms of sub-optimality factor of network load) of using a (k + 1)-sized subgroup is not too factor of network load) of using a (k + 1)-sized subgroup is not too significant.significant.

5.2 Analysis5.2 Analysis

In this section, we calculate, for the above protocol, the expected In this section, we calculate, for the above protocol, the expected detection time of a member failure, as well as the probability of an detection time of a member failure, as well as the probability of an inaccurate detection of a non-faulty member by some other (at least one) inaccurate detection of a non-faulty member by some other (at least one) non-faulty member.non-faulty member.

2121

For any group member Mj, faulty or otherwise,For any group member Mj, faulty or otherwise,Pr [at least one non-faulty member chooses to ping Mj (directly) in a time Pr [at least one non-faulty member chooses to ping Mj (directly) in a time interval interval T‘ T‘ ]]

= 1 - ( 1 – 1/n . qf= 1 - ( 1 – 1/n . qf)^n)^n≈ ≈ 1 – e^-qf (since n >> 1)1 – e^-qf (since n >> 1)

Thus, the expected time between a failure of member Mj and its Thus, the expected time between a failure of member Mj and its detection by some non-faulty member isdetection by some non-faulty member is

E[T] = T’. (1/1 – e^-qf) = T’( e^qf/ (e^qf) – 1)----------------(1).E[T] = T’. (1/1 – e^-qf) = T’( e^qf/ (e^qf) – 1)----------------(1).

Now, denoteNow, denoteC(pf)C(pf) = = e^qf/ (e^qf ) – 1.e^qf/ (e^qf ) – 1.

If PM(T) is the probability of inaccurate failure detection of a member in a If PM(T) is the probability of inaccurate failure detection of a member in a set within the time T.set within the time T.Then a random group member Ml is non-faulty with probability qf, and Then a random group member Ml is non-faulty with probability qf, and The prob. Of such a member to ping Mj within a time interval T : The prob. Of such a member to ping Mj within a time interval T :

1/n. C(pf).1/n. C(pf).

2222

Now, the prob. that Mi receives no ack’s, direct or indirect, Now, the prob. that Mi receives no ack’s, direct or indirect, according to the protocol of section 5.1: ((1-qml^2).(1-qf.qml^4)^k).according to the protocol of section 5.1: ((1-qml^2).(1-qf.qml^4)^k).

Therefore,Therefore,

PM(T) = 1- [1-qf/n.C(pf).(1-qml^2).(1-qf.qml^4)^k]^(n-1)PM(T) = 1- [1-qf/n.C(pf).(1-qml^2).(1-qf.qml^4)^k]^(n-1)

≈ ≈ 1- e^(-qf.(1-qml^2).(1-qf.qml^4)^k.C(pf)--------(since( n >> 1)1- e^(-qf.(1-qml^2).(1-qf.qml^4)^k.C(pf)--------(since( n >> 1)

≈ ≈ qf.(1 – qml^2).(1-qf.qml^4)^k.C(pf)--------------(since qf.(1 – qml^2).(1-qf.qml^4)^k.C(pf)--------------(since PM(T)<<1)PM(T)<<1)

So, So,

K = log[PM(T)/(qf.(1-qml^2).e^qf/e^ql – 1)] / log(1 – qf.qml^4)---------(2).K = log[PM(T)/(qf.(1-qml^2).e^qf/e^ql – 1)] / log(1 – qf.qml^4)---------(2).

Thus, the new randomized failure detector protocol can be Thus, the new randomized failure detector protocol can be configured using equations (1) and (2) to satisfy the SPEED and configured using equations (1) and (2) to satisfy the SPEED and ACCURACY requirements with parameters E[T], PM(T).ACCURACY requirements with parameters E[T], PM(T).

Moreover, given a member Mj that has failed (and stays failed), Moreover, given a member Mj that has failed (and stays failed), every other non-faulty member Mi will eventually choose to ping Mj in every other non-faulty member Mi will eventually choose to ping Mj in some protocol period, and discover Mj as having failed. Hence,some protocol period, and discover Mj as having failed. Hence,

2323

THEOREM 2.THEOREM 2. This randomized failure detector protocol:This randomized failure detector protocol:

(a) satisfies eventual Strong Completeness, i.e., the (a) satisfies eventual Strong Completeness, i.e., the COMPLETENESS COMPLETENESS requirement .requirement .

(b) can be configured via equations (1) and (2) to meet the (b) can be configured via equations (1) and (2) to meet the requirements requirements of of (expected) (expected) SPEED, SPEED, and and ACCURACY, ACCURACY, and and

(c) has a uniform expected send/receive load at all group members.(c) has a uniform expected send/receive load at all group members.

Proof: Proof: From the above discussions and equations (1) and (2).From the above discussions and equations (1) and (2).

2424

For calculating L/L*:For calculating L/L*:

Finally, we upper-bound the worst-case and expected network Finally, we upper-bound the worst-case and expected network load (L, E[L] respectively) imposed by this failure detector protocol.load (L, E[L] respectively) imposed by this failure detector protocol.

The worst-case network load occurs when, every T' time units, The worst-case network load occurs when, every T' time units, each member initiates steps (1-6) in the algorithm of Figure 1.each member initiates steps (1-6) in the algorithm of Figure 1.

Steps (1,6) involve at most 2 messages, whileSteps (1,6) involve at most 2 messages, while

Steps (2-5) involve at most 4 messages per ping-req target Steps (2-5) involve at most 4 messages per ping-req target member.member.

Therefore, the worst-case network load imposed by this protocol (in Therefore, the worst-case network load imposed by this protocol (in messages/time unit) is:messages/time unit) is:

L = n . [ 2 + 4 . k ] .1/T’L = n . [ 2 + 4 . k ] .1/T’

From Theorem 1 an equations (1) and (2).From Theorem 1 an equations (1) and (2).

L/L* = [2+4 .{ log[PM(T)/(qf.(1-qml^2).e^qf/e^ql – 1)] / log(1 – qf.qml^4) }] L/L* = [2+4 .{ log[PM(T)/(qf.(1-qml^2).e^qf/e^ql – 1)] / log(1 – qf.qml^4) }] / n .[ log(PM(T)) / log(pml).T]------------------------------------(3)./ n .[ log(PM(T)) / log(pml).T]------------------------------------(3).

L thus differs from the optimal L* by a factor that is independent of the group L thus differs from the optimal L* by a factor that is independent of the group size n. Equation (3) can be written as a linear function of (1/ - log(PM(T))) size n. Equation (3) can be written as a linear function of (1/ - log(PM(T))) as:as:

2626

Theorem 3: The sub-optimality factor L/L* of the protocol of Figure 1, is independent of group size n(>>1). Furthermore,

Proof: From equations (4a) through (4c).

2727

Now we calculate the average network load E[L] imposed by the new failure Now we calculate the average network load E[L] imposed by the new failure detector algorithm.detector algorithm.

At every T’ time units, each non-faulty member (n . qf) on an average At every T’ time units, each non-faulty member (n . qf) on an average executes steps 1 – 3 in the algorithm of figure 1. executes steps 1 – 3 in the algorithm of figure 1.

Steps 1 – 6 involves at most Steps 1 – 6 involves at most 2 messages2 messages – this happens only if Mj sends an – this happens only if Mj sends an ack back to Mi.ack back to Mi.

Steps 2 - 5 is executed only if Mj doesn’t send ack back to Mi and pings a Steps 2 - 5 is executed only if Mj doesn’t send ack back to Mi and pings a request to K other members in the system, with a probability of request to K other members in the system, with a probability of (1 – qf . qml^2), and involves 4 messages per non-faulty ping-req(1 – qf . qml^2), and involves 4 messages per non-faulty ping-req ..

Therefore the average network load is given as:Therefore the average network load is given as:E[L] < n . q f . [ 2 + ( 1 - q f . qml^2 ) . 4 . k ] . 1/ T’

So even E(L)/L* is independent of the group size n.So even E(L)/L* is independent of the group size n.

2828

Figure 3(a) shows the variation of L/L* as in equation (3).

This plot shows that the sub-optimality (L/L*) factor of the network load imposed by the new failure detector rises as pml and pf increase, or PM(T) decreases, ie; L/L* when pml and, L/L* when PM(T) , but is bounded by the function g(pf, pml).

(We can make out from the graph plot that L/L* < 26 for pf and pml below 15%.).

Now see figure 3(c) it can be easily seen from the graph that E[L] / L* stays very low below 8 for the values of pf and pml upto 15%.As PM(T) decreases the bound on E[L] / L* also decreases.

(This curve reveals the advantage of using randomization in failure detection, unlike Traditional distributed heartbeating algorithm E[L] < L* i.e; E[L] < 8.L ).

2929

Concluding CommentsConcluding Comments

We have thus quantified the L* required by a complete Failure detector We have thus quantified the L* required by a complete Failure detector algorithm in a process group over a simple, probabilistically lossy network algorithm in a process group over a simple, probabilistically lossy network model derived from application specification constraints of model derived from application specification constraints of

Detection time of a group member failure by some non-faulty group Detection time of a group member failure by some non-faulty group member. ( E[T] ).member. ( E[T] ).

Accuracy ( 1 – PM(T) ).Accuracy ( 1 – PM(T) ).

So the Randomized Failure detection Algorithm.So the Randomized Failure detection Algorithm. Imposes an equal load on all group members.Imposes an equal load on all group members. Is configured to satisfy the group specified requirements of completeness, Is configured to satisfy the group specified requirements of completeness,

accuracy and speed of failure detection (in average).accuracy and speed of failure detection (in average). For very stingent accuracy requirements For very stingent accuracy requirements pml pml and pf in the network (upto and pf in the network (upto

15% each), the Sub-optimality factor (L/L*) is not as large as the 15% each), the Sub-optimality factor (L/L*) is not as large as the traditional distributed heartbeating protocols.traditional distributed heartbeating protocols.

This Sub-optimality factor (L/L*) does not vary with group size, when This Sub-optimality factor (L/L*) does not vary with group size, when groups are large.groups are large.

On Scalable and Efficient Distributed Failure Detectors Presented By : Sindhu Karthikeyan.

Documents

Transcript of On Scalable and Efficient Distributed Failure Detectors Presented By : Sindhu Karthikeyan.