Ecbs2000

An Algorithm for Tolerating Crash Failures in Distributed Systems

Vincenzo De Florio and Rudy Lauwereins

Katholieke Universiteit Leuven,Electrical Engineering Department, ACCA Group,Kard. Mercierlaan 94, B-3001 Heverlee, Belgium.

E-mail:{deflorio|lauwerin}@esat.kuleuven.ac.be

Abstract

In the framework of the ESPRIT project 28620 “TIRAN”(tailorable fault tolerance frameworks for embeddedapplications), a toolset of error detection, isolation, andrecovery components is being designed to serve as a basicmeans for orchestrating application-level fault tolerance.These tools will be used either as stand-alone compo-nents or as the peripheral components of a distributedapplication, that we call “the backbone”. The backboneis to run in the background of the user application. Itsobjectives include (1) gathering and maintaining errordetection information produced by TIRAN components likewatchdog timers, trap handlers, or by external detectionservices working at kernel or driver level, and (2) usingthis information at error recovery time. In particular, thoseTIRAN tools related to error detection and fault maskingwill forward their deductions to the backbone that, inturn, will make use of this information to orchestrate errorrecovery, requesting recovery and reconfiguration actionsto those tools related to error isolation and recovery.Clearly a key point in this approach is guaranteeing thatthe backbone itself tolerates internal and external faults.In this article we describe one of the means that areused within the TIRAN backbone to fulfill this goal: adistributed algorithm for tolerating crash failures triggeredby faults affecting at most all but one of the componentsof the backbone or at most all but one of the nodes of thesystem. We call this the algorithm of mutual suspicion.

PROC. 7TH IEEE INT. CONF. AND WORKSHOP ON THE

ENGINEERING OFCOMPUTERBASED SYSTEMS (ECBS2000)EDINBURGH, UK, APRIL 3–5, 2000: 9–17.

1. Introduction

In the framework of the ESPRIT project 28620“TIRAN” [4], a toolset of error detection, isolation, andrecovery components is being developed to serve as a ba-sic means for orchestrating application-level software fault

tolerance. The basic components of this toolset can be con-sidered as ready-made software tools that the developer hasto embed into his/her application so to enhance its depend-ability. These tools include, e.g., watchdog timers, trap han-dlers, local and distributed voting tools.

The main difference between TIRAN and other librarieswith similar purposes, e.g., ISIS [3] or HATS [12], is theadoption of a special component, located between the ba-sic toolset and the user application. This entity is transpar-ently replicated on each node of the system, to keep track ofevents originating in the basic layer (e.g., a missing heart-beat from a task guarded by a watchdog) or in the user ap-plication (e.g., the spawning of a new task), and to allow theorchestration of system-wide error recovery and reconfigu-ration. We call this component “the backbone”.

In order to perform these tasks, the backbone hooks toeach active instance of the basic tools and is transparentlyinformed of any error detection or fault masking event tak-ing place in the system. Similarly, it also hooks to a li-brary of basic services. This library includes, among others,functions for remote communication, for task creation andmanagement, and to access the local hardware clock. Thesefunctions are instrumented so to transparently forward tothe backbone notifications of events like the creation or thetermination of a thread. Special low-level services at kernelor at driver-level are also hooked to the backbone—for ex-ample, on a custom board based on Analog Devices ADSP-21020 DSP and on IEEE 1355-compliant communicationchips [1], communication faults are transparently notifiedto the backbone by driver-level tools. Doing this, an infor-mation stream flows on each node from different abstractlayers to the local component of the backbone. This com-ponent maintains and updates this information in the formof a system database, also replicating it on different nodes.

Whenever an error is detected or a fault is masked, thoseTIRAN tools related to error detection and fault mask-ing forward their deductions to the backbone that, in turn,makes use of this information to manage error recovery, re-questing recovery and reconfiguration actions to those tools

Figure 1. Each processing node of the systemhosts one agent of the backbone. This com-ponent is the intermediary of the backboneon that node. In particular, it gathers informa-tion from the TIRAN tools for error detectionand fault containment (grey circles) and for-wards requests to those tools managing er-ror containment and recovery (light grey cir-cles). These latter execute recovery actionsand possibly, at test time, fault injection re-quests.

related to error isolation and recovery (see Fig. 1).

The specification of which actions to take is to be sup-plied by the user in the form of a “recovery script”, a sort ofancillary application context devoted to error recovery con-cerns, consisting of a number of guarded commands: theexecution of blocks of basic recovery actions, e.g., restart-ing a group of tasks, rebooting a node and so forth, is thensubject to the evaluation of boolean clauses based on thecurrent contents of the system database—for further infor-mation on this subject see [9, 11].

Clearly a key point in this scheme is guaranteeing thatthe backbone itself tolerates internal as well external faults.In this article we describe one of the means that have beendesigned within the TIRAN backbone to fulfill this goal: adistributed algorithm for tolerating crash failures triggeredby faults affecting at most all but one of the components ofthe backbone. We call this the algorithm of mutual suspi-cion. As in [6], we assume a timed asynchronous distributedsystem model [5]. Furthermore, we assume the availabilityof asynchronous communication means. No atomic broad-cast primitive is required.

2. Basic assumptions

The target system is a distributed system consisting ofn nodes (n ≥ 1). Nodes are assumed to be labeled with aunique number in{0, . . . , n − 1}. The backbone is a dis-tributed application consisting ofn triplets of tasks:

(taskD, taskI, taskR).

Basically taskD bears its name from the fact that it dealswith the systemdatabase of the backbone, while taskImanages “I’m alive” signals, and taskR deals with errorrecovery (see further on). We call “agent” any such triplet.At initialisation time, on each node there is exactly oneagent, identified through the label of the node where it runson. For any0 ≤ k < n, we call “t[k]” task t of agentk. Agents play two roles: coordinator (also called manager)and assistant (also called backup agent or simply “backup”).In the initial, correct state there is just one coordinator.Thechoice of which node should host the coordinator, as well assystem configuration and node labeling is done at compiletime through a configuration script.

Figure 2 displays a backbone running on a four node sys-tem (a Parsytec Xplorer MIMD engine based on PowerPCmicroprocessors). In this case, node zero hosts the coordi-nator while nodes 1–3 execute as assistants. In the portrayedsituation no errors have been detected, and this is renderedwith green circles and a status label equal to“OK”.

Each agent, be it a coordinator or an assistant, executesa number of common tasks. Among these tasks we have:

1. Interfacing the instances of the basic tools of theframework—this is to be carried out by taskD.

2. Organizing/maintaining data gathered from theinstances—also specific of taskD.

3. Error recovery and reconfiguration management (taskR).

4. Self-check. This takes place through a distributed al-gorithm that is executed by taskD and taskI in allagents in the case thatn > 1.

This article does not cover points 1–3; in particularpoints 1 and 2 are dealt with as reported in [10], while theissue of error recovery and reconfiguration management isdescribed in [9]. In the following, we describe the algorithmmentioned at point 4. As the key point of this algorithm isthe fact that the coordinator and assistants mutually ques-tion their state, we call this the algorithm of mutual suspi-cion (AMS).

Figure 2. A Netscape browser offers a globalview of the TIRAN backbone: number, role,and state of each component is displayed.“DIR net” is a nickname for the backbone.“OK” means that no faults have been de-tected. When the user selects a component,further information related to that componentis displayed. The small icon on bottom linksto a page with information related to error re-covery.

3. The algorithm of mutual suspicion

This Section describes AMS. Simulation and testingshow that AMS is capable of tolerating crash failures ofup to n − 1 agents (some or all of which may by causedby a node crash). This implies fail-stop behaviour, that canbe reached in hardware, e.g., by architectures based on du-plication and comparison [14], and coupled with other tech-niques like, e.g., control flow monitoring [15] or diversifica-tion and comparison [13]. If a coordinator or its node crash,a non-crashed assistant becomes coordinator. Whenever acrashed agent is restarted, possibly after a node reboot, itisinserted again in the backbone as an assistant. Let us firstsketch the structure of AMS. Letm be the node where thecoordinator first runs on. In short:

• The coordinator periodically broadcasts a MIA (man-ager is alive) message as part of its taskD, the assis-tants periodically send the coordinator a TAIA (this as-

Figure 3. A representation of the algorithm ofthe manager.

sistant is alive) message as part of their taskD.

• For each0 ≤ k < n, taskD[k] periodically sets a flag.This flag is periodically cleared by taskI[k].

• If, for any valid j, task I[j] finds that the flag hasnot been set during the period just elapsed, taskI[j]broadcasts a TEIF (this entity is faulty) message.

• When any agent, say the agent on nodeg, does not re-ceive any timely message from an other agent, say theagent on nodef , be that a MIA or a TAIA message,then agentg enters a so-called suspicion-period by set-ting flag sus[f ]. This state leads to three possible nextstates, corresponding to these events:

1. Agentg receives a TEIF from taskI[f ] within aspecific time period, sayt clock ticks.

2. Agentg receives a (late) TAIA from taskD[f ]within t clock ticks.

3. No message is received by agentg from neithertaskI[f ] nor taskD[f ] within t clock ticks.

Figure 4. A representation of the algorithm ofthe assistant.

State 1 corresponds to deduction “agent on nodef hascrashed, though nodef is still operational”. State 2translates into deduction “both agent on nodef andnodef are operational, though for some reason agentf

or its communication means have been slowed down”.State 3 is the detection of a crash failure for nodef .These deductions lead to actions aiming at recoveringagentf or (if possible) the whole nodef , possiblyelecting a new coordinator. In the present version ofAMS, the election algorithm is simply carried out as-suming the next coordinator to be the assistant on nodem + 1 modn.

As compliant to the timed asynchronous distributed sys-tem model, we assume the presence of an alarm managertask (taskA) on each node of the system, spawned at ini-tialization time by the agent. TaskA is used to translatetime-related clauses, e.g., “t clock ticks have elapsed”, intomessage arrivals. Let us call taskA[j] the taskA runningon nodej, 0 ≤ j < n. TaskA may be represented as afunction

a : C → M

Figure 5. A representation of the conjoint ac-tion of task I and task D on the “I’m alive”flag.

such that, for any time-related clausec ∈ C,

a(c) = message “clausec has elapsed”∈ M.

Task A[j] monitors a set of time-related clauses and,each time one of them occurs, say clausec, it sends taskD[j] messagea(c). Messages are sent via asynchronousprimitives based on mailboxes.

In particular, the coordinator, which we assume to runinitially on nodem, instructs its taskA so that the followingclauses be managed:

• (MIA SEND, j, MIA SENDTIMEOUT), j differentfrom m: every MIA SENDTIMEOUT clock ticks, amessage of type (MIASEND,j) should be sent to taskD[m], i.e., taskD on the current node. This latter willrespond to such event by sending each taskD[j] a MIA(manager is alive) message.

• (TAIA RECV, j, TAIA RECV TIMEOUT) for eachj different from m: every TAIA RECV TIMEOUTclock ticks at most, a message of type TAIA is to bereceived from taskD[j]. The arrival of such a mes-sage or of any other “sign of life” from taskD[j]translates also in renewing the corresponding alarm.On the other hand, the arrival of a message of type(TAIA RECV, k), k different from m, sent by taskA[m], warns taskD[m] that assistant on nodek sent nosign of life throughout the TAIARECV TIMEOUT-clock-tick period just elapsed. This makes taskD[m]set its flag sus[k].

Figure 6. Pseudo-code of the coordinator.

• (I’M ALIVE SET,m, I’M ALIVE SETTIMEOUT):every I’M ALIVE SETTIMEOUT clock ticks, taskA[m] sends taskD[m] an I’M ALIVE SET message.As a response to this, taskD[m] sets the “I’m alive”flag, a memory variable that is shared between tasks oftypeD andI.

Furthermore, whenever flag sus[k] is set, for anyk differ-ent fromm, the following clause is sent to taskA for beingmanaged:

• (TEIF RECV,k, TEIF RECV TIMEOUT): this clausesimply asks taskA[m] to schedule the sending of mes-sage (TEIFRECV, k) to taskD[m] after TEIFRECVTIMEOUT clock ticks. This action is canceled should

a late TAIA message arrive to taskD[m] from taskk,or should a TEIF message from taskI[k] arrive in-stead. In the first case, sus[k] is cleared and (possiblyempty) actions corresponding to a slowed down taskD[k] are taken. In the latter case, taskD[k] is as-sumed to have crashed, its clauses are removed fromthe list of those managed by taskA[m], and flag sus[k]is cleared. It is assumed that taskI[k] will take care inthis case of reviving taskD[k]. Any future sign of lifefrom taskD[k] is assumed to mean that taskD[k] isback in operation. In such a case taskD[k] would thenbe re-entered in the list of operational assistants, andtaskD[m] would then request taskA[m] to includeagain an alarm of type (MIASEND, k, MIA SENDTIMEOUT) and an alarm of type (TAIARECV, k,

TAIA RECV TIMEOUT) in its list. If a (TEIFRECV,k) message reaches taskD[m], the entire nodek is as-sumed to have crashed. Node recovery may start atthis point, if available, or a warning message shouldbe sent to an external operator so that, e.g., nodek berebooted.

Similarly any assistant, say the one on nodek, instructsits taskA so that the following clauses be managed:

• (TAIA SEND, m, TAIA SENDTIMEOUT): everyTAIA SENDTIMEOUT clock ticks, a message oftype (TAIA SEND, m) is to be sent to taskD[k],i.e., taskD on the current node. This latter willrespond to such event by sending taskD[m] (i.e.,the manager) a TAIA (this agent is alive) message.Should a data message be sent to the manager in themiddle of TAIA SENDTIMEOUT-clock-tick period,alarm (TAIA SEND, m, TAIA SENDTIMEOUT) isrenewed. This may happen for instance because oneof the basic TIRAN tools for error detection reports anevent to taskD[k]. Such event must be sent to the man-ager for it to update its database. In this case we saythat the TAIA message is sent in piggybacking with theevent notification message.

Figure 7. Pseudo-code of the assistant.

• (MIA RECV, m, MIA RECV TIMEOUT): everyMIA RECV TIMEOUT clock ticks at most, a messageof type MIA is to be received from taskD[m], i.e., themanager. The arrival of such a message or of any other“sign of life” from the manager translates also in re-newing the corresponding alarm. If a message of type(MIA RECV, m), is received from taskD[k] and sentby taskA[k], this means that no sign of life has beenreceived from the manager throughout the MIARECVTIMEOUT-clock-tick period just elapsed. This makes

taskD[k] set flag sus[m].

• (I’M ALIVE SET, k, I’M ALIVE SETTIMEOUT):every I’M ALIVE SETTIMEOUT clock ticks, taskA[k] sends taskD[k] an I’M ALIVE SET message.As a response to this, taskD[k] sets the “I’m alive”flag.

Furthermore, whenever flag sus[m] is set, the followingclause is sent to taskA[k] for being managed:

• (TEIF RECV, m, TEIF RECV TIMEOUT): thisclause simply asks taskA[k] to postpone sendingmessage (TEIFRECV,k) to taskD[k] of TEIF RECVTIMEOUT clock ticks. This action is canceled

should a late MIA message arrive to taskD[k] fromthe manager, or should a TEIF message from taskI[m] arrive instead. In the first case, sus[m] is clearedand possibly empty actions corresponding to a sloweddown manager are taken. In the latter case, taskD[m]is assumed to have crashed, its clause is removed fromthe list of those managed by taskA[k], and flag sus[m]is cleared. It is assumed that taskI[m] will take carein this case of reviving taskD[m]. Any future sign oflife from taskD[m] is assumed to mean that taskD[m]is back in operation. In such a case taskD[m] wouldbe demoted to the role of assistant and entered in thelist of operational assistants. The role of coordinatorwould then have been assigned, via an election, to anagent formerly running as assistant.

If a (TEIF RECV,m) message reaches taskD[k], the en-tire node of the manager is assumed to have crashed. Noderecovery may start at this point. An election takes place—the next assistant (modulon) is elected as new manager.

Also taskI on each node, say nodek, instructs its taskA so that the following clause be managed:

• (I’M ALIVE CLEAR, k, I’M ALIVE CLEARTIMEOUT): every I’M ALIVE CLEAR TIMEOUT

clock ticks taskA[k] sends taskI[k] an I’M ALIVECLEAR message. As a response to this, taskI[k]

clears the “I’m alive” flag.

Figures 3, 4, and 5 supply a pictorial representation ofthis algorithm. Figure 6 and Fig. 7 respectively show apseudo-code of the coordinator and of the assistant.

3.1. The alarm manager class

This section briefly describes taskA. This task makesuse of a special class to manage lists of alarms [8]. The classallows the client to “register” alarms, specifying alarm-idsand deadlines.

Once the first alarm is entered, the task managing alarmscreates a linked-list of alarms and polls the top of the list.For each new alarm to be inserted, an entry in the list isfound and the list is modified accordingly. If the top entryexpires, a user-defined alarm function is invoked. This is ageneral mechanism that allows to associate any event withthe expiring of an alarm. In the case of the backbone, taskA on nodek sends a message to taskD[k]—the same resultmay also be achieved by sending an UNIX signal to taskD[k]. Special alarms are defined as “cyclic”, i.e., they areautomatically renewed at each new expiration, after invok-ing the alarm function. A special function restarts an alarm,i.e., it deletes and re-enters an entry. It is also possible totemporarily suspend an alarm and re-enable it afterwards.

4. Current status and future directions

A prototypal implementation of the TIRAN backbone isrunning on a Parsytec Xplorer, a MIMD engine, using 4PowerPC nodes. The system has been tested and provedto be able to tolerate a number of software-injected faults,e.g., component and node crashes (see Fig. 8 and Fig. 9).Faults are scheduled as another class of alarms that, whentriggered, send a fault injection message to the local taskD or taskI. The specification of which fault to inject isread by the backbone at initialisation time from a file called“.faultrc”. The user can specify fault injections by editingthis file, e.g., as follows:

INJECT CRASH ON COMPONENT 1AFTER 5000000 TICKS

INJECT CRASH ON NODE 0AFTER 10000000 TICKS.

The first two lines inject a crash on taskD[1] after 5 secondsfrom the initialisation of the backbone, the second ones in-ject a system reboot of node 0 after 10 seconds. Via fault in-jection it is also possible to slow down artificially a compo-nent for a given period. Slowed down components are tem-porarily and automatically disconnected and then acceptedagain in the application when their performance goes backto normal values. Scenarios are represented into a Netscapewindow where a monitoring application displays the struc-ture of the user application, maps the backbone roles ontothe processing nodes of the system, and constantly reportsabout the events taking place in the system [7].

This system has been recently redeveloped so to en-hance its portability and performance and to improve its

resilience. The backbone is currently being implementedfor target platforms based on Windows CE, VxWorks, andTEX [2]. A GSPN model of the algorithm of mutual sus-picion has been developed by the University of Turin, Italy,and has been used to validate and evaluate the system. Sim-ulations of this model proved the absence of deadlocks andlivelocks. Measurements of the overheads in fault free sce-narios and when faults occur will also be collected and anal-ysed.

Acknowledgments. This project is partly supported by anFWO Krediet aan Navorsers and by the ESPRIT-IV project28620 “TIRAN”.

References

[1] Anonymous. IEEE standard for Heterogeneous Inter-Connect (HIC) (Low-cost, low-latency scalable serialinterconnect for parallel system construction). Techni-cal Report 1355-1995 (ISO/IEC 14575), IEEE, 1995.

[2] Anonymous.TEX User Manual. TXT Ingegneria In-formatica, Milano, Italy, 1997.

[3] K. P. Birman. Replication and fault tolerance inthe Isis system. ACM Operating Systems Review,19(5):79–86, 1985.

[4] Oliver Botti, Vincenzo De Florio, Geert Decon-inck, Susanna Donatelli, Andrea Bobbio, Axel Klein,H. Kufner, Rudy Lauwereins, E. Thurner, and E. Ver-hulst. TIRAN: Flexible and portable fault tolerancesolutions for cost effective dependable applications.In P. Amestoy et al., editors,Proc. of the 5th Int.Euro-Par Conference (EuroPar’99), Lecture Notes inComputer Science, volume 1685, pages 1166–1170,Toulouse, France, August/September 1999. Springer-Verlag, Berlin.

[5] Flaviu Cristian and Christof Fetzer. The timed asyn-chronous distributed system model.IEEE Trans.on Parallel and Distributed Systems, 10(6):642–657,June 1999.

[6] Flaviu Cristian and Frank Schmuck. Agreeing on pro-cessor group membership in asynchronous distributedsystems. Technical Report CSE95-428, UCSD, 1995.

[7] Vincenzo De Florio, Geert Deconinck, Mario Truyens,Wim Rosseel and Rudy Lauwereins. A hyperme-dia distributed application for monitoring and fault-injection in embedded fault-tolerant parallel pro-grams. InProc. of the 6th Euromicro Workshop onParallel and Distributed Processing (PDP’98), pages349–355, Madrid, Spain, January 1998. IEEE Comp.Soc. Press.

Figure 8. A fault is injected on node 0 of a system of four nodes . Node 0 hosts the manager of thebackbone. In the top left picture the user selects the fault t o be injected and connects to a remotelycontrollable Netscape browser. The top right picture shows this latter as it renders the shape andstate of the system. The textual window reports the current c ontents of the list of alarms used bytask A[0]. In the bottom left picture the crash of node 0 has been det ected and a new manager hasbeen elected. On election, the manager expects node 0 to be ba ck in operation after a recovery step.This recovery step is not performed in this case. As a consequ ence, node 0 is detected as inactiveand labeled as “KILLED”.

[8] Vincenzo De Florio, Geert Deconinck, and RudyLauwereins. A time-out management system for real-time distributed applications. Submitted for publica-tion in IEEE Trans. on Computers.

[9] Vincenzo De Florio, Geert Deconinck, and RudyLauwereins. The recovery language approach forsoftware-implemented fault tolerance. Submitted forpublication in ACM Transactions on Computer Sys-tems.

[10] Geert Deconinck, Vincenzo De Florio, Rudy Lauw-ereins, and Ronnie Belmans. A software library, a con-trol backbone and user-specified recovery strategiesto enhance the dependability of embedded systems.In Proc. of the 25th Euromicro Conference (Euromi-cro ’99), Workshop on Dependable Computing Sys-

tems, volume 2, pages 98–104, Milan, Italy, Septem-ber 1999. IEEE Comp. Soc. Press.

[11] Geert Deconinck, Mario Truyens, Vincenzo De Flo-rio, Wim Rosseel, Rudy Lauwereins, and Ronnie Bel-mans. A framework backbone for software fault toler-ance in embedded parallel applications. InProc. of the7th Euromicro Workshop on Parallel and DistributedProcessing (PDP’99), pages 189–195, Funchal, Por-tugal, February 1999. IEEE Comp. Soc. Press.

[12] Yennun Huang and Chandra M.R. Kintala. Softwarefault tolerance in the application layer. In MichaelLyu, editor, Software Fault Tolerance, chapter 10,pages 231–248. John Wiley & Sons, New York, 1995.

[13] D. Powell. Preliminary definition of the GUARDSarchitecture. Technical Report 96277, LAAS-CNRS,January 1997.

Figure 9. When the user selects a circular icon in the Web page of Fig. 2, the browser replies with alisting of all the events that took place on the correspondin g node. Here, a list of events occurred ontask D[1] during the experiment shown in Fig. 8 is displayed. Event s are labeled with an event-id andwith the time of occurrence (in seconds). Note in particular event 15, corresponding to deduction“a node has crashed”, and events 16–18, in which the election of the manager takes place and taskD[1] takes over the role of the former manager, task D[0]. Restarting as manager, task D[1] resetsthe local clock.

[14] D. K. Pradhan.Fault-Tolerant Computer Systems De-sign. Prentice-Hall, Upper Saddle River, NJ, 1996.

[15] M. Schuette and J. P. Shen. Processor control flowmonitoring using signatured instruction streams.IEEETrans. on Computers, 36(3):264–276, March 1987.

Ecbs2000

Technology

Transcript of Ecbs2000