Practical Byzantine Group Communication Technion ...

Practical Byzantine Group Communication∗

Vadim Drabkin Roy Friedman Alon Kama

Computer Science Department

Technion - Israel Institute of Technology

Haifa, 32000

Israel

Email: {dvadim,roy,alon}@cs.technion.ac.il

Abstract

This paper presents an adaptation of a group communication system called JazzEnsembleto tolerate Byzantine failures. The work described here emphasizes scalability and good perfor-mance in the normal case, i.e., when there are no failures, while providing strong semantics tothe application. The paper presents the main concepts and protocols that enable the Byzantinetolerant version of JazzEnsemble to obtain these goals. In particular, this includes fuzzy muteand fuzzy verbose failure detectors, an efficient Byzantine vector consensus protocol, and a novelByzantine uniform broadcast protocol, as well as modifications at each layer of the system toovercome potential Byzantine attacks. Additionally, high-level protocols only rely on the oralmessages model, and thus messages need to be signed only once at a low level of the system. Fi-nally, the paper presents an extensive performance evaluation, which demonstrates the system’sscalability and efficiency. This is also used to analyze the sources of performance degradationassociated with various aspects of overcoming Byzantine failures.

Keywords: Group Communication, Byzantine Failures, Fault Tolerance, Byzantine Consensus,Byzantine Uniform Broadcast.

∗This work was funded by the Israeli Science Foundation. Most of the hardware used for the performance mea-surements was donated to our lab by IBM.

0

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

1 Introduction

1.1 Context of this Study

Group communication systems have proven themselves as powerful middleware for building reli-able networked applications in wired environments [10]. These systems relieve programmers frommany of the tedious and highly complex issues involved in designing such applications, allowingthem to focus on the essential aspects of the application being developed, resulting in faster de-velopment time and fewer bugs. During the last few years, group communication systems havebecome standard building blocks in many clustering and replication products in both industry andacademia.

Despite the large body of work on group communication, most systems assumed a relativelybenign failure model, which largely excludes Byzantine failures [39]. In particular, while somegroup communication systems support signatures and authentication of messages, the vast majorityassume that all group members can be trusted. In contrast, under the Byzantine failure model, aprocess can deviate arbitrarily from its protocol. This can be either a result of a bug or hardwaremalfunction, or due to malicious behavior.

One of the main reasons why most systems ignore Byzantine faults is that group communicationhas been largely used to coordinate clustered applications, all running within the same LAN. Itis often assumed that in such closed environments, all participants can be trusted. In particular,when combining this assumption with the performance hit and protocol complexity associated withaccommodating Byzantine failures, most projects opted not to handle such failures.

However, given the rise in security attacks against computer systems, as well as the desireto utilize group communication in new application domains, such as in ad-hoc networks [52], theneed for Byzantine tolerance re-emerges. This is because the likelihood that a node might becompromised is no longer negligible. Thus, if we want the system to remain robust in thesesituations, it must be able to tolerate Byzantine failures. On the other hand, we would like toensure reasonable performance, since otherwise the system would also be useless. In particular, weassume that the occurrence of Byzantine faults is rare, and thus we believe that it is important tofocus on the performance of the system during normal runs, when there are no Byzantine faults.Yet, of course, the system must still be able to recover from Byzantine faults and do useful workwhen they occur.

1.2 Contributions of this Work

Formal Definition: The first contribution of this work is a definition of Byzantine virtual syn-chrony. This definition is a direct extension of the strong virtual synchrony model of Horus andEnsemble [32]. A novel aspect of this definition is that it distinguishes between being disconnected,which is caused by the network, and a node being Byzantine.1

Layered Architecture: Next, we address the main barriers to Byzantine resilience, namelyprotocol complexity and performance, using the following methodologies: Our starting point is theexperimental JazzEnsemble system, a derivative of the Ensemble system [34], originally developedat Cornell University. JazzEnsemble (like Ensemble), has a layered architecture as illustrated in

1Reiter has provided a formal definition of the membership protocol and delivery guarantees implemented inRampart [47], but these are somewhat different than our definition.

1

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Figure 2. Each layer of JazzEnsemble/Ensemble implements a simple functionality, also calleda micro-protocol. Consequently, as was done in ITUA [46], we can analyze what are the possibleattacks against each micro-protocol at each layer. With this analysis, for most layers, we can modifythe corresponding micro-protocol by adding a few checks to overcome the identified attacks. In afew cases, mostly those handling membership changes, we had to re-write the corresponding layers,yet since each layer is simple, this task was also manageable.

Fuzzy Mute and Fuzzy Verbose Failure Detectors: As part of this effort, we introduce thenotion of fuzzy mute and fuzzy verbose failure detectors. That is, many of the attacks we identifiedon the various layers involve either a member consistently neglecting to send an anticipated protocolmessage, which is known as a mute failure [21], or sending too many such messages, which we referto as a verbose failure [24]. Interestingly, as elaborated below, inside a group communicationsystem, mute and verbose failures can be identified entirely based on locally observed events, whichmotivates the use of such failure detectors [21, 24].

Moreover, as has been discussed in [9, 26], one of the main scalability problems for groupcommunication systems is that a slow node can cause the entire system to pause. On the otherhand, eliminating slow nodes from a group too eagerly results in too many view changes, which alsoreduces the system’s performance and stability. The fuzzy group membership model was proposedin [26] to overcome this problem in a benign failure model. In this work, we merge the notion ofmute and verbose failure detectors with fuzzy group membership to obtain fuzzy mute and fuzzyverbose failure detectors.

Novel Protocols for View Management: As for membership, like in Ensemble, our protocolis coordinator driven. However, since the coordinator might be Byzantine, it cannot be trustedblindly. Thus, when computing a new view, a Byzantine consensus protocol is used to decide onwhich members have failed in the terminating view. Additionally, a uniform broadcast protocolis used to ensure that all members agree on the nodes that merge (or join) into the view and,in particular, for disseminating the new view. Our layered architecture allows us to use variousByzantine consensus and Byzantine uniform broadcast protocols here, in order to be flexible inaddressing tradeoff issues like performance vs. resiliency. Specifically, our measurements were donewith a vector-oriented optimized version of the Byzantine consensus protocol of [31], and our ownnovel Byzantine uniform broadcast protocol. Both these protocols utilize fuzzy mute and fuzzyverbose failure detectors. Both also enable very fast termination in good scenarios, but in exchangeassume that the number of Byzantine nodes in a view is at most f < n/6 in the case of consensusand f < n/5 for uniform broadcast, where f is the number of faulty nodes and n is the number ofnodes in the view. It is just as easy to use instead any other protocol that offers higher resiliency,yet higher latency, such as [11, 12, 13, 36, 42, 53].

Cryptography is Kept at the Lowest Level of the System: We would like to mentionthat none of our membership or ordering protocols use protocol level signatures. The only placewhere we use authentication is the reliable multicast by retransmission layer, since if a node p isasked to retransmit a message originated at q, then p must be able to prove that it is indeed q’soriginal message that it is re-sending. This layer, however, is located at the bottom of our protocolstack. Therefore, all the cryptography requirements of our solution can be handled by low-level

2

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

mechanisms, and each message needs to be signed/authenticated only once at a low level of thesystem.

Detailed Performance Evaluation: In this paper, we also present a performance evaluationof our system with various quality of service levels, ranging from benign fault-tolerance FIFOordering to full Byzantine atomic broadcast. Interestingly, we have been able to demonstrate thescalablility of our system with up to 50 nodes. In the past, the Spread system was demonstratedon 1,000 processes [5]. However, in that demonstration, Spread was run with 20 daemon processeson 20 CPUs, where on each CPU there were 50 client processes such that all client processesconnect to the daemon on their machine. In particular, only the daemons run the true membershipprotocols, failure detection heartbeats, etc. In contrast, our measurements were done with 50different daemons on 50 different CPUs. In that sense, to the best of our knowledge, this is thelargest test-bed reported so far.2

Moreover, when discounting the cost of cryptography, which can be exchanged for a hardwarelevel cryptographic accelerator card, we can obtain a throughput of between 40,000 and 50,000messages per second (this was repeated with messages of sizes 1, 10, and 100 bytes).3 These resultsare significantly higher than previously reported results for group communication systems. We wereable to obtain these results due to a combination of new hardware (an IBM BladeCenter), fixingsome legacy performance bugs of Ensemble, and the novel techniques that we used, as describedbriefly above, and elaborated on in the rest of this paper.

In measuring the impact of performing cryptography in software, we used two options: (a) apublic/private encryption and (b) symmetric keys where a shared key is generated for each pairof nodes. In particular, in the latter scheme, each broadcast message is signed with n − 1 suchsymmetric keys, one for each receiver. This trick was also used in the past in [14]. Given theoverhead of public key cryptography, for moderate sized groups of up to a few dozens of nodes, itis worth using the multiple symmetric key-pairs, as our measurements show, which also echoes theresults of [14] (yet, in our case, with much larger replication groups).

Finally, one of the most important contributions of this paper is that we were able to measurethe performance impact of various aspects of handling Byzantine failures separately. In contrast,most works tend to show the performance with full support for Byzantine failures vs. no support atall for Byzantine behavior. To the best of our knowledge, our work is the most fine grain attempt tostudy the different sources of performance degradation in handling Byzantine failures. We expectthat our results will help developers of future Byzantine tolerant systems focus on the performancebottlenecks of such systems.

Paper Road-map: Section 2 presents the model and basic assumptions we make, as well asthe formal problem statement. The solution itself is described in Section 3 and the performance

2Notice that Spread supports a hierarchical architecture for daemons. In the experiment reported in [5], thedaemons were arranged in a flat structure (or a single level), and thus potentially, Spread can be much more scalablethan that. Also, the work in [8] reports scalability of up to 50 nodes on an IBM SP2. However, in that work, datawas sent on a separate unreliable communication stack that was run on top of a user level interface to the proprietaryfast communication network of the IBM SP2, and while using a structured overlay. In the work we report here,communication was done on the main reliable stack, which was run over UDP/IP with a standard gigabit Ethernet,and with a flat communication model.

3We did not use a packing optimization [33] in this work, which can dramatically boost the performance, especiallyfor small messages, as is common in control applications, stock quotes, and cluster management.

3

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Failure Detector

NET_SEND

SEND_DELIVER

NET_RECEIVE

application

network

Group Communication

VIEW

CAST_DELIVERCASTSEND

JOINLEAVE

NET_CAST

Figure 1: A node’s architecture

measurements appear in Section 4. We compare our work with related work in Section 5, andconclude with a discussion in Section 6.

2 Model, Assumptions and Problem Statement

2.1 Basic Concepts

We assume the standard group-communication/middleware enhanced distributed computing model.That is, we assume a collection of n nodes (also called processes), each with an architecture similarto the one illustrated in Figure 1. In particular, a node includes an application module, a groupcommunication module, and a network module.

Physically, nodes can only communicate by sending and receiving messages over the network.From a theoretical standpoint, the network itself can be modeled as being driven by a schedulerthat controls the timing in which messages are received and is also allowed to drop messages.Furthermore, the scheduler may decide at each moment for any pair of nodes whether they areconnected or disconnected. When a pair of nodes p and q are connected, most messages sent betweenp and q are delivered by the scheduler within a known bounded latency and while preserving theirFIFO sending order. The few messages that are delayed, dropped, or reordered when p and qare connected are chosen randomly and in an oblivious manner to their content. Otherwise, thenetwork is disconnected. We may treat the connected and disconnected properties as relations;we assume that the scheduler maintains symmetry and transitivity for these relations at all times.Thus, if p is connected to q, then q is connected to p, and if there is a third process r that p isconnected to, then q and r are also connected.4 Finally, we assume that the scheduler makes itsdecisions about connection and disconnections, and about the timing and ordering of messages,without colluding with any process. This is also known in the literature as an oblivious scheduler.

The application module executes a program that involves communicating with the applicationmodules at other nodes by exchanging messages with them. The group communication module isresponsible for providing an abstract communication model that has much stronger semantics than

4In some networks, these relations may not be transitive, yet transitivity can be obtained by a peer-to-peer routingprotocol.

4

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

the network module. In particular, in this work we are interested in the Byzantine tolerant versionof the strong virtual synchrony model, which is defined below.

As is typically done in distributed computing, each module of a node can be modeled as anautomaton. In this work, we are mainly concerned with the group communication module. Theautomaton of this module accepts input events from the application module, the network module,or timer events. In turn, the group communication performs some computation that may changeits state, returns some output events to the application module and network module, or set futuretimer events. The input events include send – a request by the application to send a messageto a specific node, cast – a request by the application to send the same message to all nodes,net-receive – receiving a message from the network, as well as membership related events that willbe introduced below. The output events are net-send and net-cast to the network, send-deliverand cast-deliver to the application, as well as membership related events, as reported below. Theactions performed by the group communication layer in response to a given event are governed byits specification, which is also known as a transition function in automata theory.

With this model, for a process pi, we can define a process history hi to be the sequence of eventsoccurring at pi. A collection of process histories, one for each process in the system, is called anexecution. We assume in this work that all executions are well formed in the sense that in everyexecution σ, if the history of a process pi in σ includes a net-receive event with some messagem sent to by some process pj to pi (or to everyone in the case of a broadcast), then the history hj

includes a net-send or net-cast event with the message m send to pi by pj (or to everyone in thecase of a broadcast).

Each process has a local clock. The local clocks are not synchronized. However, we assume theexistence of a global clock that is known to external observers of the system. For a given historyh and two events e1 and e2, we denote by e1 →h e2 the fact that e1 is ordered before e2 in h.Similarly, for a given execution σ, we denote by e1 →σ e2 the fact that e1 occurred in the globaltime of σ before e2.

2.2 Byzantine Failures

A node is said to incur a Byzantine failure if it deviates in any arbitrary manner from its specifica-tion [39]. A node that incurs a Byzantine failure is referred to as a Byzantine node, while the restof the nodes are called correct. In particular, a Byzantine node can avoid generating messages thatit is expected to, fail to deliver messages it received from the network, send different versions of amessage that was generated using a cast event to different nodes, etc. A special type of failureis called crash, which means that the process stops incurring and generating events of any kind.Yet, we assume that nodes cannot impersonate other nodes. This can be realized using eithercryptography [50], or through true private communication lines. In the formal parts of this paperwe simply assume that the required cryptographic infrastructure exists, and do not elaborate onhow it is obtained. In particular, we rely on the key management that was developed by Rodeh forEnsemble [48] and on Xavier Leroy’s cryptokit [1].

Recall that we assume that nodes do not control the network, i.e., the network scheduler isoblivious. This also means that Byzantine nodes cannot prevent non-Byzantine nodes from ex-changing messages using the network. Also, notice that two processes p and q may not be ableto exchange messages either if they are disconnected, or if one of them is Byzantine. Finally, weassume that the number of Byzantine nodes in the system is limited by some parameter f .

5

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

2.3 Byzantine Virtual Synchrony

We assume an abstract entity called a group. The application module of a node can invoke a join

event, indicating that it wishes to join a group, or a leave event, indicating that it is no longerinterested in the group. During the time interval between a join event and a subsequent leave

event, the node is said to be a member of the group. The collection of correct nodes that aremembers of the group at a given time t is called the group membership at time t.

The virtual synchrony model presents an abstraction to the application, in which periodically theapplications receives a view event; such an event reports an estimate for the current membership.The view event includes a view ID and an ordered membership list; for a view event v, we denoteits view identifier by v.vid and the view membership by v.mbrs. The Byzantine virtual synchronymodel includes two aspects: The first relates the contents of views delivered to applications indifferent nodes to one another and to reality. The second places restrictions on message deliverywithin a given view. The formal definition appears below (the definition does not explicitly addressjoin and leave events for simplicity). It is split into Byzantine view synchrony, which only addressesviews, and Byzantine virtual synchrony, which adds certain requirements about message agreementand reliable delivery.

In the definitions below, we use the following notation for convenience: For any view eventsv1 and v2 and history hi, we denote C(hi, v1, v2) the fact that v1 →hi

v2 and there does not exista third view event v3 such that v1 →hi

v3 →hiv2 (this corresponds to saying that v1 and v2 are

consecutive view events in hi).

Definition 2.1 (Byzantine View Synchrony) An execution σ is Byzantine view synchronousif it obeys the following restrictions:

1. For every view event v that is included in a history hi of a correct process, pi ∈ v.mbrs.

2. For every history hi of a correct process and every two view events v1 →hiv2, we have

v1.vid < v2.vid.

3. For every two correct nodes pi and pj and any two view events vi ∈ hi and vj ∈ hj for whichvi.vid = vj .vid, we also have vi.mbrs = vj .mbrs.

4. For every two correct nodes pi and pj that from some point on in σ are continuously connected,there is a point in hi from which all view events v in hi are such that pj ∈ v.mbrs.

5. For every correct node pi, if from some point on in σ there is another node pj that is alwaysdisconnected from pi, or pj crashes, then from some point on in hi, for every view event v inhi we have pj 6∈ v.mbrs.

6. For any two correct nodes pi and pj and views v1 and v2, if C(hi, v1, v2) and pj ∈ v1.mbrs ∩v2.mbrs, then the history hj (of pj) also includes v1.

Intuitively, Items 1 and 2 are sanity checks that ensure that a node is included in its own viewand that view identifiers are monotonically increasing. Item 3 requires correct processes to agreeon the membership of joint views. Item 4 and 5 relate the membership lists in views to aspireto resemble the true list of connected correct processes. Finally, Item 6 requires that if a nodepj appears in two consecutive views of another node pi, then pj has at least installed the first ofthese two views. Thus, each view also serves as a confirmation and synchronization point w.r.t.

6

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

the previous view. The main difference between this definition and the benign version of ViewSynchrony that appears in [32] is that here we restrict the behavior of correct processes (and notof the alive processes) and we separate between being connected and being correct.

Notice that if a process pi is included in the membership list of some view v, i.e., pi ∈ v.mbrs,it does not automatically mean that pi has also installed this view, i.e., that v ∈ hi. Furthermore,without some strong synchronization assumptions, the 5th item in the definition of Byzantine viewsynchrony cannot be satisfied [15]. Rather than adding such explicit assumptions to our model,we assume that each node is equipped with a failure detector module, as in Figure 1. The failuredetector at process pi may occasionally report some other processes as suspected. These reports maybe erroneous, but it is assumed that the failure detector is bound in some ways about the mistakesit can make. It has been previously shown in [18] that in benign failure models, Items 4 and 5 inthe definition above are equivalent to what is known as an eventually perfect failure detector, alsodenoted 3P in the literature. At this point in the paper, we do not restrict the failure detectortype, but rather relax the 4th requirement. Specifically, we only require that for some parameterk, if there exists a subset of nodes of size at least k such that the failure detectors of these nodesnever suspect each other from some point on, then eventually these nodes continuously remain ineach other’s views. The ratio between k and f (the number of Byzantine nodes) may depend onthe exact failure detector used and possibly also on the protocols chosen. In most cases, one islikely to require that k is at least 3f + 1.

We would like to emphasize that the definition of Byzantine View Synchrony supports what isknown as partitionable membership model, in which there can be multiple concurrent views of thesame group. In particular, a process can join the group, yet still be partitioned from the rest of thegroup, or in other words, have its own view of the group, at least for a while. If the members oftwo such views become connected for sufficiently long, they should merge and create a joint view(as called for by Item 4).

For our next and final definition, we introduce the following notation: We denote I(hi, v1) theset of events ev such that ev appears in some history hi after a view v1 (v1 →hi

ev) and there is noother view v2 for which v1 →hi

v2 →hiev (this corresponds to saying that ev occurred in view v1 in

hi). Finally, for a given message m and process pi that sends m, we denote si(m) the correspondingsend event at pi; similarly, for a message m and process pi that receives m, we denote ri(m) thecorresponding receive event at pi.

Definition 2.2 (Byzantine Virtual Synchrony) An execution σ is Byzantine virtually syn-chronous if it obeys the following restrictions:

1. σ is Byzantine view synchronous.

2. Let m be a message and si(m) and rj(m) be corresponding send and receive events at correctprocesses pi and pj, respectively. If for some view v1 si(m) ∈ I(hi, v1), then rj(m) ∈ I(hj , v1).

3. Let m be a broadcast message such that si(m) ∈ I(hi, v1) for some correct process pi and viewv1, and let v2 be a view such that C(hi, v1, v2). Then for each correct process pj such thatboth v1 ∈ hj and v2 ∈ hj, we have rj(m) ∈ hj.

4. Let m be a broadcast message such that ri(m) ∈ I(hi, v1) for some correct process pi and viewv1, and let v2 be a view such that C(hi, v1, v2). Then for each correct process pj such thatv1 ∈ hj and v2 ∈ hj, we have rj(m) ∈ hj.

7

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

5. Let m1 and m2 be two messages sent by a process pi that is either correct, or crashes during σ,but does not suffer any other Byzantine failure in σ. Furthermore, assume si(m1) →hi

si(m2)and both si(m1) ∈ I(hi, v1) and si(m2) ∈ I(hi, v1). Then if for some correct process pj

rj(m2) ∈ I(hj , v1), then rj(m1) ∈ I(hj , v1) as well.

Intuitively, Item 2 implies that a message can only be received in the same view in which itwas sent; Item 3 implies reliable delivery of messages sent by correct members that remain in thesame view; Item 4 implies agreement on which messages were received in a terminating view; Item5 implies no message omissions (or no FIFO holes even from crashed processes). Here, again, thedefinition is similar to the benign case as it appears in [32], with the exception that we only restrictthe behavior of correct processes. An interesting aspect of Item 4 is that a Byzantine process cansend two distinct versions of the same message to two different correct processes. This situationcannot occur in the benign failure model. Ensuring that correct processes also agree on the contentof a message is known as uniform broadcast [41].

3 Overview of the Solution

3.1 JazzEnsemble and Fuzzy Membership

JazzEnsemble is an experimental variant of Ensemble. JazzEnsemble implements the ideas of fuzzygroup membership and also supports various optimizations and protocol layers that enable it tooperate in ad-hoc networks, including, e.g., support for routing in ad-hoc networks. Both Ensembleand JazzEnsemble have the same general architecture and the same glue mechanism, and many ofthe layers of JazzEnsemble are simply taken as is from Ensemble. The main differences are in afew layers that are related to ad-hoc networking and to fuzzy failure detection, and to benefitingfrom fuzzy membership notifications.

The main architecture of Ensemble is nicely described in [34] while its security architecture isdescribed in [48]. A detailed discussion of the adaptations done in JazzEnsemble to accommodatead-hoc networks appears in [23]. The main aspects of JazzEnsemble that are relevant as backgroundfor this work are those related to fuzzy membership. We thus briefly repeat them here.

The idea of fuzzy membership is that rather than viewing membership as a binary property,the system should maintain a fuzziness level for each view member. This indicates the degree towhich the corresponding member seems to be alive and responsive (i.e., low fuzziness level is a goodthing while high fuzziness is bad). The fuzziness level of each member is made available to all thegroup communication system’s layers, and each can utilize it in order to optimize its behavior w.r.t.nodes with high fuzziness level. With this, it is possible to have long timeouts for failure detection(and view changes) without compromising the performance of the system. At the same time, thefuzziness level is hidden from the application, which continues to enjoy the relatively simple strongvirtual synchrony model.

To better understand how fuzziness levels help, consider for example the issue of flow control [51].Flow control restricts the number of messages (or bytes) that a sender can send without hearingan acknowledgement, which is known as a sending window. This prevents overflowing the networkand the receivers’ buffers. The problem in multicast flow control is that until a sender receivesacknowledgements from all intended receivers, it should not advance its sending window. Withfuzzy membership, we modify this behavior to allow the sender to advance its sending window as

8

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

soon as all nodes with low fuzziness level acknowledge the message. This way, we avoid pausingdue to slow nodes, since the fuzziness level of slow nodes is high.

Similarly, in order to ensure reliable delivery, nodes must keep messages they receive for possibleretransmission. In order to save buffer space, we can utilize fuzziness levels by compressing messagesthat were already acknowledged by all members with low fuzziness. As reported in [28], using similarprinciples, it is also possible to expedite view changes in some cases, while offering replicated statemachine semantics. Finally, we utilize the fuzziness levels as unreliable failure detectors in ourByzantine consensus protocols, as presented later in this paper.

JazzEnsemble supports the notion of fuzzy membership by adding a special event for notifyingabout changes in fuzziness levels, by adding flags to existing events, and through modification tothe failure detection, flow control, reliable broadcast, and membership management layers. Thesechanges are discussed in more detail in [23].

3.2 Fuzzy Mute and Fuzzy Verbose Failure Detectors

Let us note that the standard heartbeat based failure detection mechanism of non-Byzantine tol-erant group communication systems is not sufficient for overcoming Byzantine failures. This isbecause a node can send heartbeats in a timely manner, yet otherwise behave in an arbitrarymanner. In fact, an inherent problem with Byzantine failures is that by definition they are tightlyrelated to the semantics of a given protocol. Thus, a pure general detector can never be used to de-tect them. On the other hand, modularity principles advocate the use of a failure detection modulerather than having an ad-hoc detection mechanism interleaved in the code of each micro-protocollayer. The solution we have chosen for this is described below.

Specifically, when considering the structure of messages sent and manipulated by layered groupcommunication systems, it is clear that at each layer it is possible to identify a header part and adata part. In particular, the header part includes the information added, manipulated and verifiedby the layer. For example, the header for a layer that implements reliable FIFO delivery oftenincludes a message type, a sequence number for the message, and possibly the sequence number ofthe last acknowledged message. Often, a given layer L is completely unaware of headers belongingto lower layers in the stack, whereas the application data (in application driven messages) as wellas the headers added by higher layers are part of the data as far as L is considered. See illustrationin Figure 2.

Moreover, often such a layer L can expect to receive messages with known headers from othernodes in the group. For example, consider a reliable FIFO delivery layer L at process p that recentlysent a message m to a node q. The layer L at p expects to see a message from q that includes anacknowledgement for m within a given timeout. A failure by layer L at p to see such a message fromq is called a mute failure of q with respect to p. Another example of a mute failure is a coordinatorof a membership maintenance layer that fails to generate a new view when expected by the othermembers. Additionally, often it is possible to assume that a correct layer should not generatemessages with certain headers too frequently. For example, if the flow control restrict the rate ofmessages, then q should not send messages faster than this limit. Similarly, there are situationsin which a layer L at p knows that a certain message header from q should not be received if q iscorrect. As an example, consider an acknowledgement for a message that was not sent in a reliableFIFO layer. We refer to such behavior as a verbose failure of q with respect to p.

Interestingly, a large percentage of Byzantine attacks against many layers are either mute failures

9

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Protocol Stack

SENDER RECEIVER

Message

Header

Event

Protocol Layer

Figure 2: Message headers and data in layers (drawing taken from Ensemble’s reference manual)

or verbose failures.5 Moreover, with the above observations, layered group communication systemsmatch perfectly the model proposed in [21] for mute failure detectors and extended in [24] to coververbose failures. This suggests replacing the standard failure detection mechanism with mute andverbose failure detectors. That is, we add a component to the system that allows each layer toregister statistics timers and counters. Whenever an inappropriate mute or verbose behavior isnoticed by some layer L, the layer can invoke the corresponding method of the mute or verbosefailure detector, instantiated with the corresponding counter or timer, to record this misbehavior.

To cope with mute processes, we consider the class of failure detectors, denoted 3Pmute, thatincludes all failure detectors that satisfy the following properties:

• Muteness Strong Completeness: Eventually, every mute or permanently disconnected processis permanently suspected by every correct process.

• Eventual Strong Accuracy: There is a time after which no correct process that is not discon-nected is suspected.

Notice that failure detectors of the class 3Pmute can be implemented in partially synchronoussystems prone to Byzantine failures [21]. Moreover, one can similarly define a 3Pverbose failuredetector for detecting verbose failures.

Yet, similarly to the detection of crash failures in the benign failure model, when running in asomewhat asynchronous system, it is hard, if not impossible, to find good timeouts for deciding thata node is truly faulty. Being too eager would result in eliminating from the view many legitimatemembers. On the other hand, being too lenient may result in serious performance degradations.Thus, the solution we adopt is in the form of fuzzy mute and fuzzy verbose failure detectors. Thatis, these failure detection modules maintain a fuzzy mute level and a fuzzy verbose level for eachgroup member. These fuzziness levels are reported to all layers of the micro-protocol stack, and eachlayer can decide how to handle members with high levels of muteness or verbosity. In particular,there is a suspicion layer that initiates removal of nodes whose fuzzy mute or fuzzy verbose levels

5Of course, a node can send a corrupt message, or try to impersonate another node. However, such a behaviorcan be trivially recognized by the cryptographic mechanism.

10

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

are above a given threshold. In order to handle false detection caused by network overloads andshort-lived disconnections, we also reduce fuzziness levels using an aging mechanism.

3.3 Intra-View Reliable Delivery

Intra-view reliable delivery involves issues like flow-control to avoid congesting the network orrunning receivers buffers, fragmenting and reassembling of messages that are larger than UDP’sMTU, and ensuring reliable FIFO delivery of both point-to-point and broadcast messages. Thereis also the issue of filtering messages sent from other views and preventing admitting corruptedmessages.6 Filtering bad messages (corrupt or from a different view) is done at the lowest partof the system. It is obtained by indicating the view id on each message and by signing it. Ifthe message is corrupt, its digest will not fit its content, and it will be dropped. Similarly, if themessage was sent in a different view by a correct process, it will be eliminated based on its viewID and not even reach any layer.

As for the layers that handle flow control and reliable delivery of messages, including recoveryof lost messages, these layers implement well known protocols. In particular, these layers arealmost the same in the Byzantine protocol stack of JazzEnsemble, the benign protocol stack ofJazzEnsemble, and Ensemble. The only differences are the ones related to fuzzy mute and fuzzyverbose failures. The differences are fairly technical, and are therefore dropped from this paper. Forthe rest of this work, we assume that the system provides reliable delivery of messages within views,i.e., it satisfies all intra-view requirements of Byzantine virtual synchrony. Below, we concentrate onthe complementing protocols that handle view changes, which together provide the overall requiredByzantine virtual synchrony semantics.

3.4 Byzantine Membership Maintenance

As in Ensemble (and in fact, this dates back to Horus [54]), a new node that tries to join thesystem first establishes a singleton view with only itself in it. From that point on, the membershipprotocols are responsible for merging concurrent views or eliminating faulty nodes by establishingnew views that exclude them. In particular, the goal of the membership maintenance is to providethe Byzantine Virtual Synchrony model. This includes the following aspects:

3.4.1 Eliminating Suspected Nodes

Each node in JazzEnsemble employs a local failure detection mechanism in order to suspect nodesthat seem to be faulty, and reports such suspicions to other nodes. When some nodes are suspected,JazzEnsemble tries to establish a new view without the suspected nodes. However, in order to pre-vent Byzantine nodes from removing correct nodes from the system, only nodes that are suspectedby enough other nodes can be removed. Moreover, the correct view members must agree on whichnodes are to be removed from the group by utilizing a Byzantine consensus protocol. This is inorder to ensure that the next view to be established will only eliminate nodes that were agreedupon by the correct members of the current view that do not suspect each other. We obtain thiswith the mechanism described below.

6We use the term broadcast to mean sending the same message to all members of the view in which the messagewas sent.

11

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Whenever a node pi locally suspects another node pj , e.g., the fuzzy muteness or fuzzy verbositylevels of pj surpass a certain threshold, or pj was caught trying to send a forged message, etc., pi

marks pj as suspected. Whenever pi has some nodes marked as suspected, it slanders about thesenodes to all other view members. In return, if a node pk receives more than f + 1 slanders about anode pj , then pk also marks pj as suspected. Notice that if f + 1 nodes slander about pj , then atleast one correct node locally suspects pj , and so it is safe to adopt this suspicion.

Additionally, the first time in a given view that a node pi marks another node pj as suspected,pi starts a timer and a counter for the number of nodes it suspects. Once the timer expires, or thenumber of nodes pj suspects goes beyond a predefined threshold, or the coordinator is suspected,pi starts a Byzantine consensus protocol in order to decide on the failed nodes. Once the Byzantineconsensus protocol terminates, the lowest ranked node pl in the membership list of the current viewthat is not suspected is supposed to generate a new view. If pl fails to do so, then this is considereda mute failure on its behalf, which would result in re-execution of the view change protocol, and inparticular of the Byzantine consensus protocol.

Note that the layered structure of JazzEnsemble allows us to utilize any known Byzantineconsensus protocol. In particular, the layer implementing consensus already enjoys intra-viewreliable delivery, and thus we can use any protocol that assumes this capability. Specifically, inthis work we use an adaptation of the mute failure detector based protocol reported in [31] sincethis protocol is very simple, and since it terminates in one communication round in favorablecircumstances, i.e., when there are no Byzantine behavior other than process crashes and networkdisconnections.

Notice, however, that we would like to decide on which nodes are faulty and which are not. Inother words, we must decide on a binary vector of suspicions. One option is to use a Byzantineconsensus protocol that works with any value domain in which the binary vector can be viewedas a binary encoding of some value. However, we claim that this is not adequate. The reason isthat if all nodes think that some node pj is suspected, yet there is a disagreement about anothernode pk, then the result would be a disagreement about the suspicion vector, which means thatany suspicion vector becomes a valid decision value for the consensus protocol (by the definition ofByzantine consensus). In particular, this could result in never eliminating pj from the view, despitethe fact that all nodes suspect it!

Thus, instead, we do the equivalent of running the mute failure detector based protocol of [31]n times in parallel, once for each view member, as listed in Algorithm 1. Yet, rather than actuallyinvoking the protocol n times, we invoke it once in a way that operates in parallel on each entry ofthe vector, providing an independent element-wise Byzantine consensus semantics for each of thevector’s bits.

Vector Byzantine Consensus Protocol: In the vector Byzantine consensus problem, we as-sume that each node starts with a vector of input bits of size n, known as the input vector. Thegoal is to have each correct process decide on an output vector of size n, also called a decisionvector. Yet, notice that as this protocol is being run within a view, some otherwise correct nodesmight become disconnected due to the network. These nodes cannot be required to terminate theircomputation. Thus, we introduce the notion of a core component.7 That is, we assume that amongthe set of n nodes participating in the computation, there is a subset of at least n−f correct nodesthat are also connected, which we call the core component. With this definition, we say that a

7A similar approach was used in [20] for handling benign failures in partitionable networks.

12

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Variables:

esti[ ] – a vector of current estimates of pi about the decision valuesdominatingi[ ] – a vector of majority estimates of pi about the decision valuesneed coord[ ] – a vector that contains the indexes of values that need to adopt the coordinator’s valueVi[ ][ ] – a matrix that contains current estimates of other processes

Figure 3: Main variables held by each process pi

protocol solves the Vector Byzantine Consensus problem if it ensures the following requirements(these are simple extensions of standard Byzantine consensus requirements; we repeat them herefor completeness):

Vector Byzantine Validity: Let Vi be the decision vector of some core process pi that decides.Then for each k, if the value of the entry Uj [k] in the input vector Uj of all core processes isv, then Vi[k] = v.

Vector Byzantine Agreement: Let Vi be the decision vector of some core process pi that decidesand Vj the decision vector of another core process pj that decides. Then for every k, Vi[k] =Vj [k].

Byzantine Termination: Eventually, every core process decides on some decision vector.

As indicated above, the Byzantine consensus protocol we employ is a simple extension of theprotocol of [31] to vectors and is based on having a ♦Pmute failure detector. The protocol ensuressafety even when the failure detector is not obeying the properties of ♦Pmute. The only require-ment that depends on the properties of ♦Pmute to hold is termination. Practically, in the actualimplementation inside JazzEnsemble we use the fuzziness levels of nodes as an approximation for♦Pmute.

The pseudo-code is listed in Figure 3 and Algorithm 1. Intuitively, the protocol includes twophases. In the first phase, each process collects the current estimates regarding the value thatshould be decided on. If some value is overwhelmingly dominating, we can safely decide on it inthe second phase. Otherwise, if there is a single value that was reported by a significant numberof the nodes, but not enough for a safe decision, we adopt this value as the estimate for the nextround, but do not decide on it yet. If even this does not happen, and we were able to obtain thevalue of the coordinator without suspecting it, then we adopt the value suggested to us by thecoordinator. The idea is that if no value gets enough support, than it means that we are not boundby validity to decide on any specific value. In this case, if we are lucky and the round is controlledby a correct coordinator, everyone will adopt the coordinator’s value and will be able to decidein the next round. The fact that we replace a coordinator on each round ensures that eventuallythere will be such a coordinator. On the other hand, if the current coordinator is mute, the failuredetector ensures that we will not wait for it forever.

Proof of the Vector Byzantine Consensus Protocol: In the following lemmas, we provethe correctness of the algorithm for an arbitrary entry k in the vector. Since the proof holds foreach entry in the vector, it also holds for the entire vector. The proofs are adaptations of thecorresponding ones given in [31] to incorporate the notions of vectors and core subsets; they aregiven here for completeness.

13

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Algorithm 1 ♦Pmute-Based Vector Byzantine Consensus Protocol Executed by pi (n > 6f)

1: procedure Byzantine consensus(dominatingi[ ])2: init : r ← 1; esti ← dominatingi; cr← hash(n,view id);3: loop

—————————————– Step 1 of round r——————————————4: Vi ← [⊥, . . . ,⊥](n ∗ n times); c ← ((c+1) mod n); r ← (r+1);5: broadcast val(r, esti);6: wait until

(

val(ri,−) or dec(−) messages have been received from all non-suspected processes and

from at least (n − f) distinct processes)

7: /* We build here the matrix of estimates */8: for all j: do

9: if(

val (ri, estj) or dec(estj) received from pj

)

then Vi[j] ← estj

10: end if

11: end for

/* We are looking for the columns that the majority value appears more than > n/2 times */12: for all k: do

13: if (∃v 6= ⊥ : #v(Vi[ ][k]) > n/2 ) then

14: dominatingi[k] ← v;15: else (dominatingi[k] ← esti[k]);16: end if

17: end for

—————————————– Step 2 of round r——————————————18: if (i = cr) then broadcast coord(r, dominatingi)19: end if

20: for all k: do

21: if(

#dominatingi[k](Vi[ ][k]) ≥ (n − 2f − #⊥(Vi[ ][k])))

then

22: esti[k] ← dominatingi[k];23: else (need coord[k] ← 1);24: end if

25: end for

26: if (∃ k s.t. need coord[k]=1) then

27: wait until(

coord(r,−) or dec(−) received from pc or pc is suspected)

28: if (coord(r, x) or dec(x) received from pc) then

29: coord vali ← x;30: else (coord vali ← dominatingi);31: end if

32: for all k: do

33: if (need coord[k]=1) then esti[k] ← coord vali[k];34: end if

35: end for

36: goto 437: else

38: for all k: do

39: if (#dominatingi[k](Vi[ ][k]) < (n − f)) then

40: goto 4;41: end if

/* if we haven’t jumped for any of the fields in the array, we can decide. */42: broadcast dec(esti) ; return (esti)43: end for

44: end if

45: end loop

46: end procedure

14

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Lemma 3.1 Let us assume n > 4f , and consider the situation where, at the beginning of a roundr, all core processes pi have the same estimate value v[k] (i.e., esti[k] = v[k]). They will neverchange their estimates thereafter.

Proof Note that in every round, each core process collects at least (n − f) estimates. Since atthe beginning of round r, all core processes have v[k] as their initial estimate and as there are atmost f Byzantine processes, then every core process pi will collect at least n − 2f estimates equalto its own estimate v[k]. As n > 4f , it follows that v[k] is a majority value in Vi[ ][k] and thereforedominatingi[k] is set to v[k] (line 14). Hence, esti[k] is set to dominatingi[k] = v[k] (line 22).

2Lemma 3.1

Lemma 3.2 [Validity] If all the core processes propose the same value v[k], then no value v′[k] 6=v[k] can be decided.

Proof This lemma is an immediate consequence of Lemma 3.1 when we consider r = 1. As allestimates of core processes remain equal to v[k], it follows from line 42 that no value v′[k] 6= v[k]can be returned by a core process. 2Lemma 3.2

Lemma 3.3 [Agreement] Let n > 6f . No two core processes decide different values.

Proof Let r be the first round during which a core process pi decides, and let v[k] be the valueof entry k that it decides. Due to the lines 14 and 42, it follows that dominatingi[k] = v[k] and#v(Vi)[ ][k]≥ n − f . Due to the fact that at most f processes are not in the core component, itfollows that, in the worst case, pj sees the same values as pi except for #⊥(Vj)[ ][k] entries thatare equal to ⊥ in Vj [ ][k] (those being equal to v in Vi[ ][k]), and at most f other entries (thosepossibly corresponding to Byzantine processes that sent v[k] to pi and v′[k] 6= v[k] to pj). It followsthat #v(Vj [ ][k]) ≥ n − f − (f + #⊥(Vj [][k])), i.e., #v(Vj [ ][k]) ≥ n− 2f −#⊥(Vj [][k]) for any coreprocess pj . As #⊥(Vj [ ][k]) ≤ f , we get #v(Vj [ ][k]) ≥ n − 3f and, as n > 6f , it follows that v[k]is a majority value in Vj [ ][k]. Hence, dominatingj [k] = v[k] (line 14).Moreover, as #v(Vj [ ][k]) ≥ n−2f −#⊥(Vj [ ][k]), the test at line 21 is satisfied for any core processpj and, accordingly, any core pj sets estj [k] to v[k] at line 22. If pj decides at line 42, it decidesv[k]. If pj proceeds to the next round, due to Lemma 3.1, no value v′[k] 6= v[k] can be decided.

2Lemma 3.3

Lemma 3.4 No core process can block forever in a round.

Proof The lemma follows immediately from the following observations. At each round r: (a) asthere are as most f non-core processes, no core process can block forever at line 6, and (b) as thefailure detector satisfies the Muteness Strong Completeness property, no core process can blockforever at line 27. 2Lemma 3.4

Lemma 3.5 [Termination] Let n > 6f . Each core process eventually decides.

15

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Proof Let t be the time after which the failure detector is accurate, i.e., no core process is suspected(due to the Eventual Strong Accuracy of the failure detector, such a time t does exist). Let r bethe first round that starts after t and is coordinated by a core process pc. Let us observe that, dueto Lemma 3.4 and the use of dec() messages (if any), any core process pi that has not yet decidedstarts round r. During r, let dominatingc[k] = v[k].Claim. At the end of r (where dominatingc[k] = v[k]), all core processes pi have esti[k] = v[k].End of the claim.

Due to the claim, it follows that all the core processes (that have not yet decided) start theround r + 1 with the same estimate value v[k]. Moreover, due to (1) the fact that there are atleast (n − f) core processes, (2) the fact that the failure detector is accurate (i.e., no core processis suspected), (3) the dec () messages sent by the processes that have already decided (if any),and (4) the waiting statement of line 6 (messages are received from all core processes), it followsthat all the core processes pi are such that #v(Vi[ ][k]) ≥ n − f , and v[k] is the only such value(because n − f > f). So, for any core pi, we have dominatingi[k] = v[k] at line 14. Consequently,the test of line 21 is satisfied (for every entry in the vector esti) and the test of line 39 is not satis-fied for any column in the matrix Vi, and each core process pi decides accordingly by the end of r+1.

Proof of the claim. Let us first observe that if each core process executes line 33 for entry k, itadopts v[k] as its new estimate and the claim trivially follows.

Let us consider the case where a process pi executes line 22, namely, esti[k] ← dominatingi[k].Let dominatingi[k] = w. We have to show that v[k] = w. As pi executes line 22, the test of line21 is satisfied and we have #w(Vi[ ][k]) ≥ n − 2f − #⊥(Vi[ ][k]). Moreover, as (1) pi is in the core,(2) there are at most f non-core processes, (3) we are after the time t (and consequently, each coreprocess receives a message from each core process), we can conclude that the entries m such thatVi[m][k] = ⊥ correspond to faulty processes. Consequently, for any core process pj , we have #w(Vj [][k]) ≥ n−2f−#⊥(Vi[][k])−(f−#⊥(Vi[ ][k])), i.e., #w(Vj [ ][k]) ≥ n−3f . So, when we consider thecoordinator pc, we get #w(Vc[ ][k]) ≥ n − 3f . As n > 6f , we have #w(Vc) ≥ n − 3f > n/2, and sow is a majority value in the vector Vc[ ][k]. It then follows from line 14, that dominatingc[k] = w.Hence w = v[k]. It follows that all core processes pi have esti[k] = v[k] at the end of r. End of

proof of the claim. 2Lemma 3.3

Theorem 3.6 Let n > 6f . The protocol described in Algorithm 1 solves the vector Byzantineconsensus problem.

Proof The proof follows from the Lemmas 3.2, 3.3 and 3.5. 2Theorem 3.6

Handling Verbose Nodes As mentioned before, a simple attack that Byzantine nodes can playat all layers of JazzEnsemble is sending spurious messages in order to slow down the entire group.In particular, in the case of membership, this means initiating too many view changes that in factdo not result in eliminating or incorporating any node. Such behavior is captured by the verbosefailure detector, which will eventually trigger a suspicion that such a node is Byzantine.

3.4.2 Merging Views

In order for concurrent views to locate each other, we employ an IP multicast based discoverymechanism. That is, the coordinator of each view is supposed to periodically multicast a message

16

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

announcing its existence and the view it represents. This message is called a gossip message. Allnodes in the system are supposed to listen for gossip messages (this is in contrast to Ensemble andthe non-Byzantine version of JazzEnsemble, in which only coordinators listen for these messages).If correct nodes of a view do not see gossip messages sent by their own coordinator, then theyconsider it a mute failure on behalf of the coordinator.

When a coordinator of a view receives such a gossip message, it checks whether it should tryto merge with the reported view. In particular, it checks if the view identifier of the gossiped viewis not older than its own view identifier, and that the membership lists of the two views do notintersect and both agree on the same protocol stack. If these conditions do not hold, then thecoordinator is supposed to try merging with the gossiped view using a merge request message,which is again sent using IP multicast.

Notice that the checks performed by the coordinator are deterministic, and can be done by anygroup member based on its local knowledge. In order to save bandwidth, only the coordinatorsends a merge request. However, in order to protect against Byzantine coordinators, all othernodes execute the same checks, and if the coordinator was supposed to send a merge request,then they notify their fuzzy mute detector to expect it. Thus, if the coordinator does not send themerge request, it will eventually be suspected as mute. Moreover, the view members verify thecontents of the merge request, and if it is bogus, they will also suspect the coordinator as beingByzantine.

Similarly, when a node pi receives a merge request message from a coordinator, then pi per-forms similar sanity checks on it. If the message is good and pi is the coordinator, it starts a mergingprocedure that eventually leads to a new view. If the checks are OK but pi is not a coordinator,then pi notifies its fuzzy mute detector to expect the coordinator to initiate a corresponding mergedview.

3.4.3 Forming a New View

In order to reduce the performance impact of a Byzantine coordinator, we replace the coordinatoron each view change. The new coordinator is chosen as the ith non-faulty node, where i is the oldview identifier modulo the number of members. Clearly, one can use other methods. However, itis preferable that the chosen method would be locally computable, so that each node can locallyverify who should act as coordinator.

When the coordinator of the new view sends a new view message, we must ensure that allcorrect view members receive the same view message. This is easily obtained by employing auniform Byzantine delivery protocol. Here again, in principle, we could use any existing protocol,such as the one by Bracha [11]. Practically, we have chosen to develop an optimized protocol thatobtains uniform broadcast with only two communication steps (instead of three in [11]), at theprice of f < n/5. The protocol is described below.

An Efficient Byzantine Uniform Broadcast Protocol: In the formal problem of uniformbroadcast, a process is trying to send a message v to all other processes such that all of them willdeliver the same message. As in the case of Byzantine consensus, in this work we assume that theview includes n processes, out of which there is a core component of at least n − f processes thatare correct and connected. A protocol implements Uniform Byzantine Broadcast if it obeys thefollowing requirements:

17

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Broadcast Uniform Delivery: If a correct process p delivers a message v, then all other coreprocesses also deliver the value v. In particular, if two core processes deliver values v and urespectively, then v = u.

Broadcast Termination: If a core process sends a message v, then every core process delivers v.

The optimized protocol for implementing uniform broadcast appears in Figure 4. Intuitively,all messages that are sent in the k’th broadcast by p are tagged with (p, k), thereby eliminatingpossible interference between broadcasts. There are two types of messages in the protocol: initialand echo. The algorithm starts when the originator of the message p sends an (initial,v, k)message, where v is the content of the actual message p wishes to disseminate. Following this,the processes report to each other the value they received via (echo,v, k) messages. If more than(n/2 + f + 1) (echo,v, k) messages (or the (initial,v, k) message) are received by a process,it sends an (echo,v, k) to other processes (if this process has not done so yet) and if the processreceives (n/2 + 2f + 1) messages, it delivers v. As is shown in the proof, this is enough to ensureuniform broadcast.

Function Uniform broadcast(vi, k)

step 0: (only by the originator): Send(initial,vi, k) to all the processes ;

step 1:

Wait until Receive one (initial,v, k) message or (n/2 + f + 1) (echo,v, k) messages for some v;Send(echo,v, k) to all the processes;

step 2:

Wait until Receive (n/2 + 2f + 1) (echo,v, k) messages for some v ;// The node accumulates echo messages it received from Step 1:// if the node gets at least (n/2 + 2f + 1) (echo,v, k) messages in both steps, it can decideDeliver(v);

Figure 4: Uniform broadcast protocol executed by pi (n > 5f)

Correctness proof: As in the case of the proof of Byzantine consensus, we assume that theterminating view includes a core component of n − f nodes, where n is the number of nodes inthe view. We show that if f < n/5, then the protocol in Figure 4 indeed implements UniformByzantine Broadcast.

Lemma 3.7 For any given k, if two core processes p and q deliver values v and u respectively,then u = v.

Proof: Assume by way of contradiction that the lemma does not hold. In order for p to deliverv it must have received (n/2 + 2f + 1) (echo,v, k) messages, and therefore at least n/2 + f + 1(echo,v, k) messages from core processes. Similarly, q must have received at least n/2 + f + 1(echo,u, k) messages from core processes. Therefore, some core process r must have sent both(echo,v, k) and (echo,u, k) messages. But core processes, which by definition are also correct,can send only one version of each message during a broadcast. A contradiction. Therefore, u = v.

18

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Lemma 3.8 For any given k, if a core process p delivers the value v, then every other core processwill eventually deliver v.

Proof: If p delivers v, then p received (n/2 + 2f + 1) (echo,v, k) messages. At least n/2 + f + 1of these messages were sent by core processes. Therefore, every other core process receives atleast n/2 + f + 1 (echo,v, k) messages and sends its own (echo,v, k) message. Thus, at least(n− f) processes will send (echo,v, k) message. Every core process will eventually receive at least(n − f) ≥ (n/2 + 2f + 1) (echo,v, k) messages and will deliver v.

Lemma 3.9 For any k, if a core process p sends v, then all the core processes will deliver v.

Proof: Suppose a core process p sends v; every other core process will receive an (initial,v, k)message and will send an (echo,v, k) message. Therefore, every core process q will receive (n−f) ≥(n/2 + 2f + 1) (echo,v, k) messages from core processes, and at most f < (n/2 + 2f + 1) differentmessages from non-core processes. Therefore, q will deliver v.

3.4.4 Message Agreement Inside a View

Recall that at this point in the paper, we already rely on the fact that we have a mechanism thatprovides detection for lost messages and for recovery of such messages (if needed) by retransmission.Thus, the only two things we still need to worry about are as following:

1. Whenever two correct nodes deliver two versions of the same message to their respectiveapplication module, then these two versions are the same.

2. If a correct node pi delivers a message m that was sent by another node that was eliminatedfrom a view V 1, then any other correct node pj that continues with pi to its consecutive viewV 2 will also deliver m during V 1.

In order to overcome the first problem, i.e., ensuring that every pair of correct nodes agree onthe content of a message they deliver to their respective application modules, we use a Byzantineuniform broadcast layer, as described above. Yet, if the message is large, it is possible to optimizeand broadcast uniformly just the digest of the message. This is because here we only need to ensurethat one version of the same message is delivered to all correct nodes. Once a correct message digestis received, the rest is taken care of in any case by the reliable retransmission mechanism.

The second problem is solved using what is known in the literature as a flush protocol. Specifi-cally, we say that a broadcast message is stable if it was acknowledged by every member that is notconsidered faulty. Thus, the coordinator does not send the new view message until all the messagesof the terminating view are stable. Moreover, as part of the uniform broadcast mechanism of newviews, a process does not echo the view message until it knows that all messages it is aware of fromthe terminating view are stable.

3.4.5 Small Views

If the membership size n of a view is small, we can use a Byzantine consensus protocol and auniform broadcast protocol that work with f < n/3. If n drops well below that, then there is notmuch that can be done due to the theoretical lower bounds. However, by distinguishing between

19

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

the number of Byzantine nodes and the number of disconnected and crashed nodes, we may beable to employ somewhat more resilient protocols that still work efficiently, along the lines of [38].

Finally, a member of a small view that is unhappy with its view members, but does not haveenough supporters to establish the view it believes in, can always establish a singleton view andtry to gradually merge with nodes it trusts. Handling this is left for future work.

3.5 Total Ordering

While total ordering is not strictly required by virtual synchrony, it is a common option in mostgroup communication systems. Adding total ordering to virtual synchrony enables obtaining atomicdelivery, which is a basic mechanism for implementing a replicated state machine semantics [49].

We have implemented total ordering as following: Nodes accumulate all the messages theyreceive. Each node picks a subset of these message, chosen by some deterministic and fair rule,and proposes them in the Consensus protocol. Once a batch of such messages is decided on, thesemessages are delivered, and the process moves on to pick the next subset to be proposed in theConsensus protocol and so forth.

As for the Byzantine consensus protocol, we have utilized the mute failure detector basedprotocol of [31], which has the nice property that it terminates in a single communication stepin good scenarios (no failures and all processes initially propose the same message). Interestingly,as we have discovered during our experiments, if the size of the subset of messages to decide onis sufficiently large, and when there is a continuous load, or bursty traffic, the amortized cost ofdeciding on each message becomes one communication step. Specifically, in the first invocationof Consensus in each burst, there might be disagreement regarding the proposals and thereforemultiple communication steps are required to decide. However, during this time, all nodes continueto accumulate messages. Thus, given that the subsets of messages to be proposed to Consensus arechosen using a deterministic rule, then the subsequent invocations of consensus terminate in onecommunication round!

Notice also that when the application messages are small, then the values proposed to theByzantine consensus protocol are the messages themselves. Thus, this implements atomic broadcastwithout needing to run a separate uniform broadcast protocol. On the other hand, when messagesare large, it makes more sense to run the Consensus protocol only on messages’ unique identifiers.However, in this case, we do need a separate uniform broadcast mechanism, similar to the one wedescribed in Section 3.4.3, to ensure that indeed all correct nodes receive the same copy (content-wise) of a given message.

4 Performance Evaluation

Our measurements were carried out on an IBM Blade Center cluster, comprising of 25 dual-processor 2.2GHz PowerPC blades (JS20), each with 4GB of RAM and interconnected via gigabitethernet switches and running SuSE Linux Enterprise Server 9. Every blade has only one NIC, andthus all applications running on the same blade share the same NIC, even if they run on a differentCPU. The blades were otherwise unloaded. We have run our tests with groups ranging from 8to 50 processes. In all tests we had only one process per CPU. Additionally, in tests of up to 24nodes, each process was run on a different blade, while with larger groups we had two processes oneach blade (so in large groups each two processes shared a NIC, but were run on different CPUs).

20

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

0 10 20 30 40 500

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x 104

16−b

yte

mes

sage

s / s

econ

d

group size

JazzEnsByzEns+NoCryptoByzEns+SymCryptoByzEns+NoCrypto+TotalByzEns+PubCrypto(512 bits)

Figure 5: Throughput measurements (the linefor public key cryptography is hardly visible,as it is so close to 0 compared with the otherlines)

0 10 20 30 40 500

1

2

3

4

5

6

7

8

9

10

aver

age

late

ncy

of 1

−byt

e m

essa

ges

in m

s

group size

JazzEnsByzEns+NoCryptoByzEns+SymCryptoByzEns+NoCrypto+Total

Figure 6: Latency measurements (the line forpublic key cryptography is dropped since it isorders of magnitude higher than the others)

Also, due to the configuration of our Blade Center, when the group size was above 12, part of thecommunication had to cross two internal switches. Last, JazzEnsemble is implemented in OCaml.Therefore, we relied on the OCaml CryptoKit for handling cryptography.

We have used the Ensemble Ring demo application to measure the performance of the system.In this demo, the application advances in rounds. In each round, a node sends a burst of k messagesand waits until it receives k messages from all other nodes, at which point it moves to the nextround. Thus, assigning k = 1 allows measuring the network latency. Throughput is measured asthe number of broadcast messages successfully delivered per second (if a message is delivered to nnodes, we count it as one message for throughput calculations).

As can be seen in Figure 5 and Figure 6, the performance is fairly scalable with up to 50members. We attribute some of the minor dip in throughput above 12 nodes to the extra switchthat some messages need to travel. Similarly, part of the minor dip above 24 nodes is due to thefact that each pair of processes shared a NIC in such large groups (yet, each process was run ona separate CPU). Moreover, the OS kernel runs only on one of the two processors; we discoveredthat any process that runs on the same processor as the kernel enjoys better performance thanprocesses that run on the second processor!

Additionally, we can see that without cryptography and uniform broadcast8, the performancein about 85-90% of the performance of the non-Byzantine version of our system. Or in other words,handling all attacks on reliable delivery, flow control, and membership maintenance reduces thethroughput by about 10%-15%.

Symmetric key cryptography (AES with a 128-bit key) reduces the performance by about half.This includes signing each message n − 1 times with a symmetric key. On the other hand, thethroughput with public key cryptography with a 512-bit key drops to a few dozen messages persecond, making it almost useless.

The line labelled “ByzEns+NoCrypto+Total” in Figure 5 illustrates the performance of atomic

8For a discussion of uniform broadcast, see Section 3.4.4 and the discussion after Definition 2.2.

21

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

0 5 10 15 20 25 30 35 40 450

2000

4000

6000

8000

10000

12000

14000

16000

16−b

yte

mes

sage

s / s

econ

d

group size

NoCrypto+TotalNoCrypto+UniformNoCrypto+Total+UniformSymCrypto+TotalSymCrypto+UniformSymCrypto+Total+Uniform

Figure 7: Additional Throughput Measure-ments (the cost of total ordering and uni-form broadcast with and without symmetric-key cryptography)

0 10 20 30 40 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

seco

nds

to s

tabi

lity

group size

ByzEns+NoCrypto merge−>initByzEns+NoCrypto leave−>init

Figure 8: Time to establish a new view

delivery, obtained by placing a Byzantine consensus layer to order messages in a total order (asdescribed in Section 3.5).9 As can be seen, the performance is lower than without total orderingwith up to 24 nodes, with a significant drop above 24 nodes. The drop in performance above 24nodes is largely attributed to the fact that when we utilize two processes on the same blade server,they both share the same NIC (but separate CPUs). This means that when running Byzantineconsensus, we are limited by the NICs capacity due to the extra messages injected by this protocol.

Figure 7 focuses on the attainable throughput of the Byzantine version of JazzEnsemble whilealso ensuring total ordering and uniform broadcast. As can be seen, symmetric key roughly halvesthe throughput for both total ordering and uniform delivery (recall that due to our use of consensusin the implementation of total ordering, total ordering already satisfies uniform broadcast). Thereason why uniform delivery is worse than total ordering is that the implementation of consensuscan decide on multiple messages in one instance. Thus, the cost of the consensus protocol isaveraged on multiple messages. Due to a bug in JazzEnsemble, we were not able to implementa similar optimization for uniform delivery. In general, both these protocols deliver reasonableperformance for small clusters. However, the performance decays as the cluster grows, due to thefact that both protocols require O(n2) messages, or to be precise, O(n) broadcasts (with consensusaveraging out this cost on multiple messages). Interestingly, the performance decay looks linearrather that polynomial. The reason is that the network is switched. Thus, the extra load imposedon each link and each group member grows only according to O(n)!

At any event, we would like to emphasize once again that this is without packing/batchingoptimizations [33]. When incorporating such optimizations, from sporadic testing, we believe thatfor small messages we can get a performance boost of at least a factor of 10, and as much as afactor of 90 for 1 byte messages.

Figure 8 shows the latency to establish a new view for both merging a new node and recovering

9Also, the graph ends at 44 nodes rather than 50 since 6 nodes were trashed due to a UPS malfunction during anelectric break.

22

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Name Meaning Recovery Time

ByzLeave A node sends a leave message and then leaves 0.013 secByzMuteNode A node is mute and not sending anything 0.015 secByzMuteCoord The coordinator is mute 0.018 secByzVerboseNode A Byzantine node is too verbose and suspects nodes all the time 0.016 secCoordBadView The coordinator sends wrong view 0.014 sec

Table 1: Recovery time from several problematic scenarios

from a failed or departed node (once the failure was detected). As can be seen, this latency growswith the view size, and is roughly the same in both cases. However, even with 50 nodes, it takesabout 0.35 seconds to establish the view. On the other hand, the exponential nature of the graphsuggests that in order to grow to much larger groups, a more scalable overlay based solution mightbe needed. However, overlays tend to be vulnerable to Byzantine failures, so finding a practicalsolution to this is not a trivial task.

Finally, Table 1 details the time to recover from several scenarios. The scenarios include amember leaving the group after sending a leave message, a node that becomes mute and does notsend anything, a coordinator that becomes mute, a node that becomes too verbose and suspectsother nodes all the time, and a coordinator that sends a view that is different from the one expected.In all cases, the time presented is from the detection of the failure until a new view is installed (butin the case of muteness and verbosity, does not include the failure detection time itself as this is atunable parameter). As can be seen, in all cases, the recovery time is similar and is always less than20 milliseconds. The main difference seem to be in whether all nodes start the consensus protocolroughly at the same time and with the same value or not. These numbers were obtained with agroup of 12 nodes. In groups of 50 nodes, the latency may grow up to 350 milliseconds; in thosecases the view latency is vastly dominated by the synchronization time of the view, as appears inFigure 8.

5 Related Work

Byzantine failures have been introduced by Lamport, Shostack, and Pease in the context of syn-chronous systems in [39]. The first randomized protocols to solve consensus in asynchronous Byzan-tine systems have been proposed by Ben-Or [7] and Rabin [42]. Both Toueg and Bracha have pre-sented randomized asynchronous consensus protocols that are optimal with respect to the numberof processes that can exhibit a Byzantine behavior in [53] and [11], respectively. Since then, manyprotocols have been published including, e.g., [12, 13, 25] (this is far from being an exhaustive list).

Group communication has a noted research history [10, 18]. Yet, most of the systems developedignore Byzantine failures. The few group communication systems that focus on security includeSecureRing [37], Ensemble [34, 48], Rampart [47], Antigone [40], ITUA [46], Cactus [35], and SecureSpread [4]. We elaborate on these systems below.

Ensemble’s architecture addresses security [48], yet it only protects the system from externalattacks, and does not handle Byzantine failures. Antigone is a framework that enables specifyingflexible application security policies [40]. The framework allows controlling various quality of serviceissues, including security, but does not handle Byzantine failures either.

Secure Spread [4] is a group communication system that was designed to provide group com-munication over WANs. Spread integrates two low-level protocols: the Ring protocol in each site

23

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

and the hop protocol connecting the sites. Secure Spread relies on strong synchronization guar-antees to assure that no member can receive and decrypt messages after it left the group and nonew member can receive and decrypt messages sent before it joined the group. Secure Spread alsoignores Byzantine failures.

Rampart is the first group communication system that handled Byzantine attacks [47]. Rampartallows dynamic group membership and it must exclude faulty replicas from the group to makeprogress (e.g., to remove a faulty primary and elect a new one). The SecureRing system consists ofa reliable delivery protocol, a group membership protocol, and a Byzantine fault detector [37]. Thesystem protects a low-level ring by authenticating each transmission of the token and data messagereceived. Both Rampart and SecureRing can guarantee safety if fewer than 1/3 of the replicas arefaulty. Additionally, Rampart and SecureRing provide group membership protocols that can beused to implement recovery, but only in the presence of benign faults.

The ITUA project has the goal of developing a middleware based intrusion tolerance solutionthat helps building survivable distributed applications [46]. They have also taken the approachof extending an existing layered group communication system, in their case, C-Ensemble, andmaking it resilient to Byzantine faults. ITUA uses an adaptive and unpredictable response as amajor technique to cope with an attacker and its architecture separates the role of detection fromreplication management. The Cactus project also enjoys a layered micro-protocol architecturethat allows adaptability and flexibility [35], and also has a Byzantine tolerant protocol stack.Interestingly, the programming model of Cactus is not virtually synchronous.

Rampart, SecureRing, Cactus, and ITUA all suffer from limited performance since they usecostly protocols and rely intensively on public key cryptography. On the other hand, the BFTsystem, by Castro and Liskov [14], provides state-machine replication for an (almost) asynchronousnetwork, where fewer than a third of the replicas may fail. BFT operates in epochs, where eachepoch is made up of two phases, an optimistic phase and a recovery phase. In the optimisticphase, one node is designated as a leader; this node decides on the ordering of messages andnotifies all other nodes about it using Bracha’s uniform broadcast protocol [11]. If the leaderbecomes suspected, then it is being replaced using an agreement protocol. BFT uses MACs toauthenticate all messages and public-key cryptography is used only to exchange the symmetric keypairs to compute the MACs. In [14], they show performance measurements with up to 7 nodes.We adopted this approach and proved its feasibility even for groups of up to 50 nodes.10

The Query/Update (a.k.a. Q/U) protocol offers an optimistic quorum approach for providingefficient Byzantine tolerant replication [2]. According to the Q/U protocol, clients access serverstrying to obtain a quorum that ensures atomic execution of queries and updates. However, unliketraditional quorum based approaches, the Q/U protocol enables a one pass execution of operationsin “good scenarios”, i.e., when there are no conflicting accesses to the same objects. On theother hand, concurrent accesses are resolved using a probabilistic back-off protocol. This sametechnique is used to eliminate the need for locks while supporting read-modify-write semantics aswell as operations accessing multiple objects atomically. The benefit of this approach, comparedto consensus based approaches like BFT, is in its improved fault-scalability, or in other words, theperformance degradation involved in tolerating an increasing number of faults.

The BAR (Byzantine, Altruist, Rational) framework was recently introduced in order to handlesystems in which some nodes are Byzantine, some are rational, and the rest are altruist [3]. That

10Let us emphasize that BFT implements a replicated state-machine solution while we implement a generic groupcommunication system with various levels of reliable and atomic broadcast.

24

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

is, Byzantine nodes can deviate arbitrarily from their protocol, rational nodes only deviate fromthe protocol is they can gain something by that, and altruists always obey the protocol. A genericset of services that accommodates this generalized failure model was also developed, as well as aspecific collaborative storage service, nicknamed BAR-B [3]. The BAR model is more suitable forcollaborative systems, in which services are hosted on the participants machines, and therefore it islikely that many of these nodes will only participate if they are given an incentive to do so. On theother hand, this comes at a substantial performance cost, and is thus not suitable for a dedicatedservers based system.

Another optimized Byzantine atomic broadcast protocol is the Parsimonious Broadcast proto-col [44]. This protocol also includes a leader based optimistic phase and a recovery phase. Yet, itutilizes a consistent broadcast service rather than uniform broadcast, in order to reduce the messagecomplexity from O(n2) to O(n). However, the protocol requires public-key cryptography. Based onour results and the results of BFT [14], this might be a limiting factor in its practical applicability(there are no performance measurements in [44]), in particular with respect to throughput [27].

When the execution of client requests is computation-intensive, it is worth splitting the decisionon the execution order from the execution itself [56]. For example, in [43], they use an agreementcluster of 3f + 1 nodes to decide only on the ordering of executions, and then pass the executionitself to a set, called primary committee, of only f + 1 nodes. The generated replies of the primarycommittee are then compared by the agreement cluster. If all replies are the same, then theyare returned to the client. Otherwise, if there is a mismatch, the request is sent to additionalf servers; a reply that repeats at least f + 1 times is declared correct and is sent to the client.In this case, a new primary committee is also elected by the agreement cluster. The savings forcompute-intensive requests comes from the fact that on average, each request is executed by onlyf +1 nodes. This is in contrast with having a request executed on all 3f +1 nodes required to decideon the ordering. However, this only makes sense when the requests are indeed compute-intensive,since the mechanism described above involves non-negligible overheads. Our results are applicableto splitting approaches since they can help optimize the performance of the agreement cluster.

The notion of a failure detector, which captures the required functional properties of failuredetection without specifying explicit timing assumptions, was initiated by Chandra and Toueg inthe context of the Consensus problem [16]. Mute failure detectors were initially proposed in [21, 22]in order to solve Byzantine Consensus in otherwise asynchronous systems. They were later usedalso in [6, 30, 36].

The MAFTIA [55] project has explored two different approaches to building intrusion-tolerantgroup communication protocols. The first approach is to use a linear secret sharing scheme basedon a generalized adversary structure that can model a more realistic set of fault assumptions. Thesecond approach is based on the use of a Trusted Timely Computing Base (TTCB). Moreover, theuse of TTCB was explored as another means of solving Byzantine Consensus efficiently in [19].

Another interesting approach to affordable fault-tolerance by adaptation has been proposedin [17]. In that work, the authors propose running a non-fault-tolerant protocol most of the time, inparallel with a failure detector. When the failure detector notices some failure, benign or Byzantine,the system can switch to a fault masking protocol. This way, the cost of masking failures is onlypaid when failures occur. Notice that the work reported in [17] only describes in detail a reliabledissemination mechanism, without any ordering guarantees or membership maintenance. Also,once the masking protocol is turned on, the work in [17] does not describe any method of switchingback to the cheaper protocol if the failures have been resolved.

25

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

In general, when it comes to benign failures, the approach of having a simple and efficientprotocol most of the time and only fixing things when needed has been practiced in the area ofgroup communication for a long time. For example, the Horus system included 4 optional totalordering protocols [33], two of which were leader based and two were token based. In all of them,during normal computation, the protocol is very simple and proceeds very efficiently, while a failureof the leader or token holder is being compensated for during the computation of the new view.

6 Discussion

In this paper we presented a scalable Byzantine group communication system. Our system enjoysseveral interesting properties: It is a generic group communication system and therefore can beused as a building block for various distributed applications. The system is designed to performwell in the normal case, i.e., when no Byzantine failures occur, yet be resilient to them if they do,as validated by our performance measurements. Also, our protocols do not rely on protocol levelsignatures, and only sign (and authenticate) each message once before sending it to (or receivingfrom) the network. The only exception to this is retransmitting by a third node, which requiressignatures at a low level of the system.

By examining our performance measurements, and in particular when focusing the sources ofoverhead in handling Byzantine failures, one can make the following observations: The cost ofhandling Byzantine failures other than cryptography, total ordering, and uniform broadcast (to beprecise, ensuring that a Byzantine node does not send different versions of its application messagesto different nodes) or total ordering, is relatively small; about 10-15% in our measurements. Also,public key cryptography is extremely expensive to perform in software. Yet, using symmetric keycryptography while signing each broadcast message n − 1 times results in acceptable performancedegradation even in groups of up to 50 nodes.11 Furthermore, as security becomes increasinglyimportant, it is conceivable that in the future most computers will include hardware acceleratorsfor it, which will reduce its cost even further.

Moreover, even with total ordering (by Byzantine consensus), the performance of the system isstill quite reasonable. Only a decade ago, a throughput of 4,000 messages per second on a clusterof 44 nodes would have been considered excellent for a non-Byzantine tolerant system. However,the performance does degrade as the cluster size increases.

Overall, our work joins the results of [2] and [14] in showing that it is possible to attain rea-sonable performance in Byzantine tolerant systems (each work uses a different approach). Inparticular, our layered architecture enables us to perform fine grain measurements regarding thecost of various performance limiting factors in fending off Byzantine faults. When considering thescalability problems of total ordering and uniform delivery, one is faced with the following tradeoffs:First, public-key cryptography is too expensive. Moreover, due to its CPU intensive behavior, itsdetrimental effect on throughput is even more considerable than its impact on latency, as high-lighted in [27]. Second, existing techniques for implementing Byzantine resilient total ordering anduniform delivery that do not rely on public-key cryptography are not very scalable. In contrast,the approach of [2] replaces the use of Byzantine consensus and uniform delivery with a quorumapproach. However, it only ensures probabilistic termination, and requires increased clients toservers communication. This makes it less attractive when the communication between clients and

11This echoes the results of [14], in which they have measured replicated state machine with groups of up to 7nodes while we tested various levels of reliable and totally ordered broadcast with up to 50 nodes.

26

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

servers is worse than the communication among the servers themselves, e.g., when all the serversare in the same farm. Thus, an interesting open problem is to develop more scalable Byzantineconsensus and uniform broadcast protocols that, unlike [44], lend themselves easily to symmetrickey authentication.

Another direction in which we are extending our work is ad-hoc networks [52]. In ad-hocnetworks, nodes cannot necessarily communicate directly with each other. Instead, some nodes actas forwarders for the entire group. The two places where this affects our work is the fact that weneed a Byzantine routing mechanism and the stability protocol must become gossip based [29]. Wehave already developed one secure routing protocol in [24], and still need to develop an efficientByzantine tolerant gossip based stability protocol.

Finally, the ITUA project included work on formal specification and verification of their groupmembership protocol [45]. It would be interesting to try to adapt their approach to formally verifyour work as well.

Acknowledgements: We would like to thank Elad Barkan and Eli Biham for their advice onthe use of cryptography, and to Ohad Rodeh on security in Ensemble. Also, we would like to thankEran Issler and Maxim Kovgan for their technical help.

References

[1] The OCaml Home Page. http://pauillac.inria.fr/ocaml.

[2] M. Abd-El-Malek, G. Ganger, G. Goodson, M. Reiter, and J. Wylie. Fault-Scalable Byzantine Fault-Tolerant Services. In Proc. 20th ACM SIGOPS Symposium on Operating Systems Principles (SOSP),pages 59–74, October 2005.

[3] A. Aiyer, L. Alvisi, A. Clement, M. Dahlin, J.-P. Martin, and C. Porth. BAR Fault-Tolerance for Co-operative Services. In Proc. 20th ACM SIGOPS Symposium on Operating Systems Principles (SOSP),pages 45–58, October 2005.

[4] Y. Amir, G. Ateniese, D. Hasse, Y. Kim, C. Nita-Rotaru, T. Schlossnagle, J. Schultz, J. Stanton, andG. Tsudik. Secure Group Communication in Asynchronous Networks with Failures: Integration andExperiments. In Proc. of the 20th International Conference on Distributed Computing Systems, pages330–343, 2000.

[5] Y. Amir and J. Stanton. The Spread Wide Area Group Communication System. Technical ReportCNDS-98-2, The Center for Networking and Distributed Systems, Computer Science Department, JohnHopkins University, 1998.

[6] R. Baldoni, J.M. Helary, and M. Raynal. From Crash-Fault Tolerance to Arbitrary-Fault Tolerance:Towards a Modular Approach. In Proc. of the IEEE International Conference on Dependable Systemsand Networks (DSN), pages 273–282, June 2000.

[7] M. Ben-Or. Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols. InProc. 2nd ACM Symposium on Principles of Distributed Computing, pages 27–30, 1983.

[8] K. Birman, R. Friedman, S. Keshav, and W. Vogels. Reliable Time-Delay Constrained Cluster Com-puting. U.S. Patent No. 6,393,581, issued May 21, 2002.

[9] K. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, , and Y. Minsky. Bimodal Multicast. ACMTransactions on Computer Systems, 17(2):41–88, May 1999.

27

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

[10] K. P. Birman. Building Secure and Reliable Network Applications. Manning Publishing Company andPrentice Hall, December 1996.

[11] G. Bracha. An Asynchronous (n − 1)/3-Resilient Consensus Protocol. In Proc. 3rd ACM Symposiumon Principles of Distributed Computing, pages 154–162, 1984.

[12] C. Cachin, K. Kursawe, and V. Shoup. Random Oracles in Constantinople: Practical AsynchronousByzantine Agreement Using Cryptography. In Proc. 19th ACM Symposium on Principles of DistributedComputing, pages 123–132, 2000.

[13] R. Canetti and T. Rabin. Fast Asynchronous Byzantine Agreement with Optimal Resilience. In Proc.25th Annual ACM Symposium on Theory of Computing, pages 42–51, 1993.

[14] M. Castro and B. Liskov. Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Transac-tions on Computer Systems, 20(4):398–461, 2002.

[15] T. Chandra, V. Hadzilacos, S. Toueg, and B. Charron-Bost. On the Impossibility of Group Membership.In Proc. of the 15th ACM Symposium of Principles of Distributed Computing, pages 322–330, May 1996.

[16] T. Chandra and S. Toueg. Unreliable Failure Detectors for Asynchronous Systems. Journal of the ACM,43(4):685–722, July 1996.

[17] I. Chang, M.A. Hiltunen, and R.D. Schlichting. Affordable Fault Tolerance Through Adaptation. InProc. of Workshop on Fault-Tolerant Parallel and Distributed Systems (LNCS 1388), pages 585–603,April 1998.

[18] G.V. Chockler, I. Keidar, and R. Vitenberg. Group Communication Specifications: A ComprehensiveStudy. ACM Computing Surveys, 33(4):427–469, 2001.

[19] M. Correia, N.F. Neves, L.C. Lung, and P. Verıssimo. Low Complexity Byzantine-Resilient Consensus.Distributed Computing, 17(3):237–249, March 2005.

[20] D. Dolev, R. Friedman, I. Keidar, and D. Malki. Failure Detectors in Omission Failure Environments.Technical Report TR96–1608, Department of Computer Science, Cornell University, 1996.

[21] A. Doudou, B. Garbinato, R. Guerraoui, and A. Schiper. Muteness Failure Detectors: Specification andImplementation. In Proc. 3rd European Dependable Computing Conference, pages 71–87, 1999.

[22] A. Doudou and A. Schiper. Muteness Detectors for Consensus with Byzantine Processes (Brief An-nouncement). In Proc. 17th ACM Symposium on Principles of Distributed Computing (PODC), page315, 1998.

[23] V. Drabkin, R. Friedman, A. Kama, and Boris Mudrik. JazzEnsemble: a Group Communication Systemfor MANET. Technical report, Computer Science, Technion, 2005.

[24] V. Drabkin, R. Friedman, and M. Segal. Efficient Byzantine Broadcast in Wireless Ad-Hoc Networks.In Proc. of the 6th IEEE Conference on Dependable Systems and Networks, pages 160–169, June 2005.

[25] P. Felman and S. Micali. Optimal Algorithms for Byzantine Agreement. In Proc. 20th Annual ACMSymposium on Theory of Computing, pages 148–161, 1988.

[26] R. Friedman. Fuzzy Group Membership. In Proc. of FuDiCo 2002: International Workshop on FutureDirections of Distributed Computing, pages 60–63, Bertinoro, Italy, June 2002.

[27] R. Friedman and E. Hadad. On the Significance of Latency vs. Throughput in Analyzing the Per-formance of Distributed Systems. IEEE Distributed Systems Online: “Distributed Wisdom” Column,January 2006.

28

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

[28] R. Friedman and A. Kama. Strong Replication Semantics in Mobile Ad-Hoc Networks. Technical report,Computer Science, Technion, 2005.

[29] R. Friedman, S. Manor, and K. Guo. Scalable Hypercube Based Stability Detection. IEEE Transactionson Parallel and Distributed Systems, 13(8), August 2002.

[30] R. Friedman, A. Mostefaoui, and M. Raynal. Simple and Efficient Oracle-Based Consensus Protocolsfor Asynchronous Byzantine Systems. In Proc. of the 23rd IEEE International Symposium on ReliableDistributed Systems (SRDS), pages 228–237, October 2004.

[31] R. Friedman, A. Mostefaoui, and M. Raynal. Simple and Efficient Oracle-Based Consensus Protocols forAsynchronous Byzantine Systems. IEEE Transactions on Dependable and Secure Computing, 2(1):46–56, March 2005.

[32] R. Friedman and R. van Renesse. Strong and Weak Virtual Synchrony in Horus. In Proc. of the 15thSymposium on Reliable Distributed Systems, pages 140–149, October 1996.

[33] R. Friedman and R. van Renesse. Packing Messages as a Tool for Boosting the Performance of To-tal Ordering Protocols. In Proc. of the Sixth IEEE International Symposium on High PerformanceDistributed Computing, pages 233–242, August 1997.

[34] M. Hayden. The Ensemble System. Technical Report TR98-1662, Department of Computer Science,Cornell University, January 1998.

[35] M.A. Hiltunen, R.D. Schlichting, and C.A. Ugarte. Survivability Issues in Cactus. In Proc. of the IEEEInformation Survivability Workshop, October 1998.

[36] K.P. Kihlstrom, L.E. Moser, and P.M. Melliar-Smith. Solving Consensus in a Byzantine EnvironmentUsing an Unreliable Fault Detector. In Proc. of the Int. Conference on Principles of Distributed Systems,pages 61–75, 1997.

[37] K.P. Kihlstrom, L.E. Moser, and P.M. Melliar-Smith. The SecureRing Group Communication System.ACM Transactions on Information and System Security, 4(4):371–406, 2001.

[38] L. Lamport. Lower Bounds for Asynchronous Consensus. In A. Schiper, A.A. Shvartsman, H. Weath-erspoon, and B.Y. Zhao, editors, Future Directions in Distributed Computing: Research and PositionPapers, number 2584 in LNCS, pages 22–23. Springer, 2003.

[39] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals Problem. ACM Transactions onProgramming Languages and Systems, 3(4):382–401, July 1982.

[40] P.D. McDaniel, A. Prakash, and P. Honeyman. Antigone: A Flexible Framework for Secure Group Com-munication. Technical Report CITI TR 99-2, University of Michigan, Ann Arbor, MI, USA, September1999.

[41] G. Neiger and S. Toueg. Automatically Increasing the Fault-Tolerance of Distributed Algorithms.Journal of Algorithms, 11(3):374–419, September 1990.

[42] M. Rabin. Randomized Byzantine Generals. In Proc. 24th IEEE Symposium on Foundations of Com-puter Science, pages 403–409, 1983.

[43] H. Ramasamy, A. Agbaria, and W.H. Sanders. Parsimony-Based Approach for Obtaining Resource-Efficient and Trustworthy Execution. In Proc. 2nd Latin-American Dependable Computing Symposium(LADC), pages 206–225, October 2005.

29

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

[44] H. Ramasamy and C. Cachin. Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast.In Proc. 9th International Conference on Principles of Distributed Systems, December 2005.

[45] H.V. Ramasamy, M. Cukier, and W.H. Sanders. Formal Specification and Verification of a GroupMembership Protocol for an Intrusion-Tolerant Group Communication System. In Proc. of the IEEEPacific Rim International Symposium on Dependable Computing, pages 9–18, 2002.

[46] H.V. Ramasamy, P. Pandey, J. Lyons, M. Cukier, and W.H. Sanders. Quantifying the Cost of ProvidingIntrusion Tolerance in Group Communication Systems. In Proc. of the IEEE Conference on DependableSystems and Networks, pages 229–238, 2002.

[47] M. Reiter. Distributed Trust with the Rampart Toolkit. Communications of the ACM, 39(4):70–74,April 1996.

[48] O. Rodeh. Secure Group Communication. PhD thesis, School of Computer Science and Engineering,The Hebrew University of Jerusalem, 2001.

[49] Fred B. Schneider. The state machine approach: a tutorial. Technical Report TR 86-800, Departmentof Computer Science, Cornell University, December 1986. Revised June 1987.

[50] B. Schneier. Applied Cryptography. Wiley, 1996.

[51] A.S. Tanenbaum. Computer Networks (4th edition). Prentice Hall PTR, 2003.

[52] C.K. Toh. Ad Hoc Mobile Wireless Networks. Prantice Hall, 2002.

[53] S. Toueg. Randomized Byzantine Agreement. In Proc. 3th ACM Symposium on Principles of DistributedComputing, pages 163–178, 1984.

[54] R. van Renesse, K. Birman, and S. Maffeis. Horus: A Flexible Group Communication System. Com-munications of the ACM, 39(4):76–83, April 1996.

[55] P. Verıssimo, N.F. Neves, and M. Correia. The Middleware Architecture of MAFTIA: A Blueprint. InProc. of the IEEE Third Information Survivability Workshop, October 2000.

[56] J. Yin, J.-P. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin. Separating Agreement from Executionfor Byzantine Fault Tolerant Services. In Proc. of the 19th ACM Symposium on Operating SystemsPrinciples, pages 253–267, 2003.

30

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - T

echn

ical

Rep

ort

CS-

2005

-17

- 20

05

Practical Byzantine Group Communication Technion ...

Documents

Transcript of Practical Byzantine Group Communication Technion ...