Buys2011a

12
Towards Context-Aware Adaptive Fault Tolerance in SOA Applications Jonas Buys Performance Analysis of Telecommunication Systems University of Antwerp B-2020, Antwerp, Belgium [email protected] Vincenzo De Florio Performance Analysis of Telecommunication Systems University of Antwerp B-2020, Antwerp, Belgium vincenzo.defl[email protected] Chris Blondia Performance Analysis of Telecommunication Systems University of Antwerp B-2020, Antwerp, Belgium [email protected] ABSTRACT Software components are expected to exhibit highly dependable characteristics in mission-critical applications, particularly in the areas of reliability and timeliness. Redundancy-based fault-tolerant strategies have long been used as a means to avoid a disruption in the service provided by the system in spite of the occurrence of failures in the underlying components. Adopting these fault-tolerance strategies in highly dynamic distributed computing systems, in which components often suffer from long response times or temporary unavailability, does not necessarily result in the anticipated improvement in dependability. In fact, as these dependability strategies are usually statically predefined and immutable, a change in the operational status (context) of any of the components involved may very well jeopardise the schemes’ overall effectiveness. In this paper, a novel dependability strategy is introduced supporting advanced redundancy management, aiming to autonomously tune its internal configuration in view of changes in context. It is apparent from our preliminary experimentation that this strategy can effectively achieve an optimal trade-off between service reliability and performance-related factors such as timeliness and the degree of redundancy employed. A prototypical service-oriented implementation of the proposed adaptive fault tolerant strategy is presented thereafter, leveraging WS-* specifications to gather and disseminate contextual information. Categories and Subject Descriptors C.2.4 [Distributed Systems]: Distributed applications; C.4 [Performance of Systems]: Fault tolerance; Reliability, availability, and serviceability; Measurement techniques; D.2.0 [Software Engineering]: Standards; D.2.8 [Metrics]: Performance measures Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DEBS’11, July 11–15, 2011, New York, New York, USA. Copyright 2011 ACM 978-1-4503-0423-8/11/07 ...$10.00. General Terms Measurement, Reliability, Performance, Algorithms, Design Keywords dependability, service-oriented architecture (SOA), context-awareness, adaptive fault tolerance, quality of service (QoS), distance-to-failure (dtof ), WS-* specifications 1. INTRODUCTION There is a growing move to transform legacy distributed systems into service-oriented architectures (SOA), mainly driven by the prospects of interoperability, agility and legacy leverage. The widespread adherence to the service-oriented computing paradigm can be justified as it comprises the best practices in distributed computing of, roughly estimated, the past twenty years, and by the numerous standardisation initiatives backed by major industry consortia. Among the available technological solutions to SOA, XML-based web services, which have become the predominant implementation technology for encapsulating and deploying software components, are now being used in a diversity of application domains, ranging from enterprise software to embedded systems. Business- and mission-critical applications are increasingly expected to exhibit highly dependable characteristics, particularly in the areas of availability and QoS-related factors such as timeliness. For this type of applications, a complete cessation or a subnormal performance of the service they provide, as well as late or invalid results, are likely to result in significant monetary penalties, environmental disaster or human injury. However, software components deployed within distributed computing systems may inherently suffer from long response times or temporary unavailability, the latter due to failures having occurred. Considering the compositional nature of many service-oriented applications, it is easily foreseeable that failures in the constituent components not properly dealt with can propagate and may subsequently perturb the service provided by the application. Moreover, each of the web services used within a given application introduces a potential point of failure. There exist numerous web services (WS-*) specifications related to the dependability of XML-based web services, mainly in the areas of reliable messaging, transactional support and end-to-end-security [1]. Although rudimentary syntactical constructs for dealing with the previously

Transcript of Buys2011a

Page 1: Buys2011a

Towards Context-Aware AdaptiveFault Tolerance in SOA Applications

Jonas BuysPerformance Analysis of

Telecommunication SystemsUniversity of Antwerp

B-2020, Antwerp, [email protected]

Vincenzo De FlorioPerformance Analysis of

Telecommunication SystemsUniversity of Antwerp

B-2020, Antwerp, [email protected]

Chris BlondiaPerformance Analysis of

Telecommunication SystemsUniversity of Antwerp

B-2020, Antwerp, [email protected]

ABSTRACTSoftware components are expected to exhibit highlydependable characteristics in mission-critical applications,particularly in the areas of reliability and timeliness.Redundancy-based fault-tolerant strategies have long beenused as a means to avoid a disruption in the service providedby the system in spite of the occurrence of failures inthe underlying components. Adopting these fault-tolerancestrategies in highly dynamic distributed computing systems,in which components often suffer from long response timesor temporary unavailability, does not necessarily result inthe anticipated improvement in dependability.

In fact, as these dependability strategies are usuallystatically predefined and immutable, a change in theoperational status (context) of any of the componentsinvolved may very well jeopardise the schemes’ overalleffectiveness. In this paper, a novel dependability strategy isintroduced supporting advanced redundancy management,aiming to autonomously tune its internal configurationin view of changes in context. It is apparent fromour preliminary experimentation that this strategy caneffectively achieve an optimal trade-off between servicereliability and performance-related factors such as timelinessand the degree of redundancy employed.

A prototypical service-oriented implementation of theproposed adaptive fault tolerant strategy is presentedthereafter, leveraging WS-* specifications to gather anddisseminate contextual information.

Categories and Subject DescriptorsC.2.4 [Distributed Systems]: Distributed applications;C.4 [Performance of Systems]: Fault tolerance;Reliability, availability, and serviceability; Measurementtechniques; D.2.0 [Software Engineering]: Standards;D.2.8 [Metrics]: Performance measures

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DEBS’11, July 11–15, 2011, New York, New York, USA.Copyright 2011 ACM 978-1-4503-0423-8/11/07 ...$10.00.

General TermsMeasurement, Reliability, Performance, Algorithms, Design

Keywordsdependability, service-oriented architecture (SOA),context-awareness, adaptive fault tolerance, quality ofservice (QoS), distance-to-failure (dtof ), WS-* specifications

1. INTRODUCTIONThere is a growing move to transform legacy distributed

systems into service-oriented architectures (SOA), mainlydriven by the prospects of interoperability, agility andlegacy leverage. The widespread adherence to theservice-oriented computing paradigm can be justified asit comprises the best practices in distributed computingof, roughly estimated, the past twenty years, and bythe numerous standardisation initiatives backed by majorindustry consortia. Among the available technologicalsolutions to SOA, XML-based web services, which havebecome the predominant implementation technology forencapsulating and deploying software components, are nowbeing used in a diversity of application domains, rangingfrom enterprise software to embedded systems.

Business- and mission-critical applications areincreasingly expected to exhibit highly dependablecharacteristics, particularly in the areas of availabilityand QoS-related factors such as timeliness. For this typeof applications, a complete cessation or a subnormalperformance of the service they provide, as well as late orinvalid results, are likely to result in significant monetarypenalties, environmental disaster or human injury. However,software components deployed within distributed computingsystems may inherently suffer from long response times ortemporary unavailability, the latter due to failures havingoccurred. Considering the compositional nature of manyservice-oriented applications, it is easily foreseeable thatfailures in the constituent components not properly dealtwith can propagate and may subsequently perturb theservice provided by the application. Moreover, each of theweb services used within a given application introduces apotential point of failure.

There exist numerous web services (WS-*) specificationsrelated to the dependability of XML-based web services,mainly in the areas of reliable messaging, transactionalsupport and end-to-end-security [1]. Although rudimentarysyntactical constructs for dealing with the previously

Page 2: Buys2011a

described deficiencies have been provided in orchestrationtools such as WS-BPEL, XML-based SOA does not, in itself,necessarily contribute to the construction of dependable webservices.

Redundancy-based fault-tolerant strategies have longbeen used as a means to avoid disruptions in the serviceprovided by the system in spite of failures having occurredor occurring in the underlying software components orhardware. Various approaches to achieve fault tolerancehave appeared in the literature [2, 3, 4, 5]. Common to allthese approaches is a certain amount of redundancy aimingto guarantee high availability and increased reliability ofthe functional service provided by the redundant systemcomponents. Deploying multiple instances of a particularsoftware component in a distributed system has provedsuccessful in improving the scalability, and may as well lowerthe risk of a complete system failure as the result of hardwarefailures [6, pp. 345–349].

It is however estimated that the vast majority of computererrors originate from software faults, estimations rangingfrom 60 up to 90 percent [7, 8]. Within distributedSOA applications, the bulk of the complexity is situatedin the application layer, and there always remain designfaults which eluded detection despite rigorous and extensivetesting and debugging. Hence, traditional replicationschemes, which were conceived to tolerate permanenthardware faults primarily and transient faults caused byexternal disturbances secondarily, do not offer sufficientprotection for tolerating software faults (often referred toas design or specification faults) [3, 4].

Current software fault tolerant techniques attempt toleverage the experience of hardware redundancy schemes,and require diversity in the designs of redundant componentsin order to withstand design faults. The rationale isthat redundantly deploying multiple functionally-equivalentbut independently implemented software components willhopefully reduce the probability of a specific softwarefault affecting multiple implementations simultaneously,thereby keeping the system operational. This fundamentalconjecture would guarantee that correlated failures do nottranslate into the immediate exhaustion of the availableredundancy, as it would happen, e.g., by using identicalreplicas of the same software component [5, Chap. 8].Replicating software would obviously incur replicating anyresidual dormant software fault.

The n-version programming (NVP) mechanism, a wellproven design pattern for software fault tolerance, wasfirst introduced in 1985 as “the independent generationof n > 1 functionally-equivalent programs from the sameinitial specification” [9]. An n-version module constitutesa fault-tolerant software unit — a client-transparentreplication layer in which all n programs, called versions,receive a copy of the user input and are orchestrated toindependently perform their computations in parallel. Itdepends on a generic decision algorithm to determine aresult from the individual outputs of the versions employedwithin the unit. Many different types of decision algorithmshave been developed, which are usually implemented asgeneric voters. Examples include, amongst others, majority,plurality and consensus voting [10][5, Chap. 4].

Regardless of the various controversies and debate towhich design diversity has been subjected ever since itsinception, the application of redundant implementations

does have the potential to improve the reliability andscalability of software systems. It clearly brings withit some tangible impacts, the foremost of which area significantly higher development cost and associated,increased infrastructural requirements. This additionalinvestment could be justified for judiciously selected keycomponents providing critical functions or with high reusepotential. Furthermore, even though the architecturalcomplexity of the voting mechanism within an NVP moduleis of minimal magnitude compared to the complexity of theapplication logic, critics would state that it can become asingle point of failure. This issue is traditionally overcomeby conducting extensive testing to determine its reliability.

SOA systems exhibit highly dynamic characteristics, andchanges in the operational status of web services, inparticular their availability and response time, are likely tooccur frequently. Conversely, classic fault-tolerant designpatterns, including NVP, have traditionally been appliedon an immutable set of resources (i.e. replicas), and arecontext-agnostic, i.e. they do not take account of changes inthe operational status of any of the components containedwithin the redundancy scheme, which may jeopardise theeffectiveness of the overall fault-tolerant unit.

Firstly, web services may often suffer from temporaryunavailability. On the one hand, this may be the resultof a failure, e.g. originating from the manifestation ofa design fault or hardware malfunction. On the otherhand the web service may become unreachable because ofa network failure. The temporary unavailability of anyspecific web service comprised within a service-orientedapplication may cause the whole application to fail. Whereassuch a point of failure may be addressed by applying,e.g. NVP with redundant web service implementations,such redundancy-based fault tolerance schemes will notnecessarily result in an increase in availability. “Whetheror not the availability is improved depends on the amountof redundancy employed and the availability of the softwarecomponents used to construct the system”[4, 11]. From thatpoint of view, it is therefore apparent that the effectivenessof any fault-tolerant redundancy scheme depends on howfrequently its comprised resources become (temporarily)unavailable. Indeed, an NVP scheme, e.g. one based onmajority voting, would fail to guarantee the availability ofthe service it seeks to provide if a majority of the resourcesemployed have simultaneously become unavailable.

Secondly, the use of remotely deployed web servicecomponents may occasionally suffer from long responsetimes, which is mainly to be attributed to any networklatency as the result of message exchanges and, to a lesserextent, to excessive concurrency demands. For time-criticalapplications in which the timely availability of results is ofparamount importance, any additional delay in the responsetime of a web service involved in an NVP scheme may impactthe scheme’s effectiveness to deliver an outcome within theimposed time constraints [12, 13].

There is thence an urgent need for adaptive softwarefault tolerant solutions, encompassing sophisticatedcontext-aware redundancy management. The characteristicof context-awareness, referring to the fact that a redundancyscheme is aware of the environment (i.e. the context) inwhich it operates, is of considerable importance to supportcomprehensive redundancy management. Examples ofcontextual information include, but are not limited to, the

Page 3: Buys2011a

amount of redundancy currently employed, the evolution ofvoting outcomes, and the operational status of each of theavailable resources such as dependability, load, executiontime etc. Triggered by changes in the context, suchadaptive fault tolerant strategies may autonomously tunethe amount of redundancy or dynamically alter the selectionof resources currently employed in the redundancy schemeso as to maintain the effectiveness of the dependabilitystrategy, mitigating the adverse effect of employing inaptresources. In this paper, a novel dependability strategy isintroduced supporting advanced context-aware redundancymanagement, aiming to autonomously and transparentlytune its internal configuration. Designed to dynamicallyfind the optimal redundancy configuration in order topreserve the intended dependability, the objective of thisparameterised redundancy model is twofold.

Firstly, it is responsible for continuously monitoring anychanges in the operational status of the available resourcesand other contextual information. Its purpose is to makesure that resources that may threaten the effectiveness of theoverall redundancy scheme are excluded. It is noteworthy tomention that this resource selection procedure may target anoptimal trade-off between dependability attributes as well asperformance-related factors such as timeliness.

Secondly, the degree of redundancy employed is highlydependent on the current status of the context. On theone hand, in the absence of exceptional disturbances, thescheme should scale down its use of redundant resourcesso as to avoid unnecessary expenditure of resources. Onthe other hand, when the foreseen amount of redundancyis not enough to compensate for the currently experienceddisturbances, it would be beneficial to dynamically revisethat amount including additional resources — if available.

It is apparent from our preliminary experimentationthat the proposed adaptive strategy enhances the overalleffectiveness of proven fault-tolerance strategies, withlittle overhead incurred, attaining optimal performance,economical resource allocation notwithstanding.

The remainder of this paper is structured as follows: Aset of application-agnostic context properties is presentedin Sect. 2. Next, a property is introduced to capturethe suitability of a particular software component withinan NVP-based redundancy scheme with majority voting(NVP/MV). We then move on to elaborate on the internalsof the proposed adaptive fault tolerance strategy in Sect. 4.A prototypical service-oriented implementation of thisstrategy is presented thereafter, leveraging established WS-*specifications. Furthermore, an illustrative example is givenin Sect. 6 to clarify the measures and algorithm defined inthe previous sections. Finally, related work is referred to inSect. 7.

2. APPLICATION-AGNOSTICCONTEXT PROPERTIES

The effectiveness of a fault-tolerant redundancy schemesuch as NVP is largely determined by its redundancyconfiguration, i.e. the amount of redundancy used and,accordingly, a selection of functionally-equivalent softwarecomponents. On the one hand, the amount of redundancy,in conjunction with the voting algorithm, controls howmany simultaneously failing versions the NVP compositecan tolerate whilst continuing to provide the user with

the expected service. For instance, an NVP/MV schemecan mask failures affecting the availability of up to aminority of its versions — a function of the amount ofredundancy indeed. On the other hand, the dependabilityof any NVP composite is determined by the dependabilityof the versions employed in its redundancy scheme. Aselucidated in [4, Sect. 4.3.3], the use of replicas of poorreliability can result in a system tolerant of faults butwith poor reliability. Likewise, versions exhibiting lowavailability may result in a failure of the scheme when theamount of redundancy becomes insufficient to mask theensuing failures. It is therefore of paramount importanceto construct fault-tolerant systems using highly dependablesoftware components.

Redundancy configurations of NVP schemes havetraditionally been defined with a fixed amount ofredundancy and an immutable set of versions. Havingmotivated the deficiencies of this approach in Sect. 1, thispaper will introduce an adaptive NVP-based algorithm(A-NVP) in which the redundancy configuration isdynamically constructed in function of the context in whichit operates. A context property will be introduced in Sect. 3shortly, which will allow to obtain information regardingthe reliability of a single version involved in an NVP/MVscheme.

Whether or not a particular version contributes to thesuccess of a redundancy scheme may also depend on otheraspects of a version’s operational status. For instance, somemission-critical systems may require timely results. NVPvoting schemes may be designed to return a reply withina guaranteed time slot. Any version failing to produceits response within the time constraints imposed by thevoting system would translate in a performance failure and,as such, have a detrimental impact on the effectiveness ofthe redundancy configuration [2, 13]. The response timeof a version will therefore also be considered as a contextproperty in Sect. 4. Finally, another property of interest isthe ability of the A-NVP scheme to optimally balance theload between the available resources. In order to achievethis, a context property to include in our focus shall be thenumber of pending requests, i.e. the number of requestscurrently being processed by a software component.

3. APPROXIMATING VERSIONRELIABILITY IN NVP/MV

It was already pointed out that the dependability ofany NVP composite is affected by the dependability of thecomponents integrated within. Controversial opinions existon whether it is meaningful to use probabilistic measuresof dependability, most of which are based on an analogy oftraditional hardware dependability, to evaluate the qualityof software. In particular, many people have questioned theadequacy of software reliability to quantify the operationalprofile of a software system.

A first major objection that has frequently been putforth is that, in spite of the proliferation of softwarereliability models that have been developed since the early1970s, only few of these models seem to be able tocapture and quantify a satisfying amount of complexitywithout excessive limitations [15]. Failing to adequatelyquantify the reliability of a software component inhibitsthe application of commonly used analytical combinatorial

Page 4: Buys2011a

techniques for reliability analysis of hardware redundancyschemes to equivalent schemes involving diversely designedfunctionally-equivalent software components [4, Chap. 4].

Moreover, it is hard to determine a quantitativeapproximation of the overall failure rate for a given softwarecomponent. Apart from residual design faults, in SOAapplications, the failure rate of a web service may beinfluenced as a consequence of a failure in the underlyingdeployment platform or hardware, in any required externalweb service or network connectivity failures [14, 13].

As an alternative to a probabilistic measure for thereliability of a software component, we now define a genericproperty to capture the suitability of a particular softwarecomponent within an NVP/MV redundancy scheme.

3.1 Capturing the Effectiveness of theCurrent Redundancy Configuration

The distance-to-failure (dtof ) metric, first introducedin [16], was meant to provide a quantitative estimation ofhow closely the currently allocated amount and selectionof resources within an NVP/MV composite matched theobserved disturbances — by shortcoming or excess. Morespecifically, dtof can be used to deduce a measure of how wellthe currently employed redundancy configuration is capableof ensuring the availability of the composite’s service.

We define the set V containing all functionally-equivalentversions available in the system. Let L be a set ofmonotonically increasing, strictly positive integer indices,such that each single voting round is uniquely identified.For a given round l ∈ L, the amount of redundancy usedwithin the NVP/MV scheme is denoted as n(l) > 1, suchthat the versions employed for round l are contained withinV (l) ⊆ V and n(l) = |V (l)|. An indicator random variable

E(l)(v) is defined for all v ∈ V

E(l)(v) =

{0 v ∈ V \ V (l) (1a)

1 v ∈ V (l) (1b)

and can be used to discriminate between idling versions andversions that are engaged in the current voting round l.

The essential part of any voting procedure is the

construction of a partition ℘(l) = {P1, . . . , Pk(l)} ∪ P (l)F of

the set of versions V (l). This partitioning procedure isheavily influenced by the disturbances that affected any ofthe versions involved during the voting round l. Throughoutthis paper, the notion of disturbance is used to denote theevent of a single version struck by a failure perturbing theservice it is expected to provide. We will now elaborate onseveral types of disturbances relevant to NVP/MV schemesand their effect on the generated partition. More specifically,disturbances will be categorised using the comprehensivelist of failure classes for software components as presentedin [2, Sect. 1.2.1].

A first category of disturbances comprises different typesof failures resulting in the (temporary) unavailability ofreplicas. Having failed to obtain a response from thesefaulty versions, the voting algorithm will classify them in

P(l)F . Examples include performance, omission and crash

failures such as network connectivity failures, hardwaremalfunctions, design faults etc. [13].

Whereas it would be expected that functionally-equivalentversions sharing a common specification would return

the same response when provided with identical input,discrepancies between their response values may arise due toresponse value failures [13]. The partition will subsequentlyhold equivalence classes P1 . . . Pk(l) , such that each ofthese sets contains those versions which reported identicalresults1. Ideally, in a situation without disturbances of anykind, i.e. unanimous consensus, only one class P1 wouldneed to be created. Contrarily, dissenting versions requirethe creation of additional equivalence classes.

Let P (l) be the set in the generated partition ℘(l) \ P (l)F

of largest cardinality and c(l)max = |P (l)|. In other words,

c(l)max represents the largest consent found between the n(l)

replicas at the end of voting round l. Then, in order forthe majority voting procedure to be able to adjudicate theresult of the scheme, there should be a consensus amongst

an absolute majority of the n(l) versions, i.e. c(l)max ≥M (l).

Conversely, if c(l)max ≤ m(l), the voting procedure will not be

able to determine a correct result.

M (l) =

⌈n(l) + 1

2

⌉(2)

m(l) =

⌊n(l)

2

⌋(3)

It logically follows from (2) and (3) that M (l) = m(l) + 1.Given these ancillary variables, the distance-to-failure for aspecific voting round l is defined as

dtof (l) =

0 c

(l)max < M (l) (4a)

M (l) − d(l) c(l)max ≥M (l) ∧ n(l) odd (4b)

m(l) − d(l) c(l)max ≥M (l) ∧ n(l) even (4c)

where d(l) in (4b) and (4c) represents n(l) − c(l)max, i.e. thenumber of versions that are either faulty or that returneda vote that differs from the majority, if any such majorityexists2. If no majority can be found, dtof returns 0. Ascan be easily seen, dtof returns an integer in [0,M (l)] for

any odd n(l) or in [0,m(l)] for any even n(l). This integerrepresents how close we were to failure at the end of votinground l. The maximum distance is reached when there is fullconsensus among the replicas, i.e. ℘(l) \ P (l) = ∅, therefore

V (l) = P (l) and accordingly c(l)max = n(l). Conversely, the

larger the dissent, the smaller is the value returned by dtof ,and the closer we are to the failure of the voting scheme. Inother words, a large dissent (that is, small values for dtof )is interpreted as a symptom that the current redundancyconfiguration is not able to counterbalance the currentlyexperienced disturbances. Figure 1 depicts some exampleswhen the number of replicas is 7.

Intuitively, dtof (l) = 1 corresponds to the existence of aconsent between precisely M (l) versions, given that n(l)

versions were involved during the current voting round l.Accordingly, for any dtof (l) > 0, one can observe that

c(l)max = M (l) +(dtof (l) − 1

)(5)

1Note that in this paper, strict voting will be applied in viewof the exchange of XML messages in SOA applications.2For the sake of brevity, we say that the faulty versions in

P(l)F are in dissent with the responses returned by versions

in P1 . . . Pk(l) .

Page 5: Buys2011a

(a) dtof = 4 (b) dtof = 3

(c) dtof = 2 (d) dtof = 0

Figure 1: Distance-to-failure (dtof ) in an NVP/MV schemewith n = 7 replicas. In (a), unanimous consensus is reached,which corresponds to the farthest ”distance” to failure. Forscenarios (b) and (c), more and more votes dissent from themajority (red and yellow circles) and correspondingly thedistance shrinks. In (d), no majority can be found — thus,failure is reached.

In other words, dtof (l)−1 essentially quantifies how manyversions there exist in excess of the mandatory M (l) versionsthat collectively constitute the majority for round l3.

3.2 Quantifying the historical impact ofa version on an NVP/MV scheme

Whereas the dtof context property is a valuable metric forcapturing the instantaneous impact of a given redundancyconfiguration on the effectiveness of an NVP/MV scheme,it fails to assess the impact of a particular version on thescheme over time. We therefore define a measure to quantifythe historical and relative impact of any version v ∈ V onthe redundancy scheme — the normalised dissent :

D(l)

(v) =

0 #rounds(v)=0 (6a)

D(l−1)

(v)+p(l)

(v) E(l)(v)=1 ∧ dtof (l)>0 ∧ v/∈P (l) (6b)

E(l)(v)=1 ∧ dtof (l)=0 (6c)

D(l−1)

(v)×r(l)

(v) E(l)(v)=0 (6d)

E(l)(v)=1 ∧ dtof (l)>0 ∧ v∈P (l) (6e)

The number of voting rounds in which a version v ∈ Vwas actively engaged, up until and including the currentvoting round l, is denoted by #rounds(v). The value of thenormalised dissent is initialised to 0 — v. (6a). After that,it is updated at the end of each successive voting round forall versions v ∈ V .

The rationale is that a penalty p(l)(v) ∈ ]0, 1] is finedfor any engaged version in dissent with the majority thatresulted at the end of voting round l or when simply nomajority was found, which corresponds respectively to (6b)and (6c). A version v that repeatedly failed to provide a

3P (l) was previously used to denote the set of allversions in V (l) that contributed to the majority found

at the end of round l. Let P(l)M ⊆ P (l) such that

|P (l)M | = M (l). If dtof (l) = 1, P

(l)M = P (l). For dtof (l) > 1,

the majority is reconfirmed by dtof (l)−1 additional versions,

i.e. |P (l) \ P (l)M | = dtof (l) − 1.

useful contribution to the voting procedure will thereforetranslate to a higher value D(l)(v). Inversely, a reward

r(l)(v) ∈ ]0, 1[ will weigh down previously accumulatedpenalties as they get older — v. (6d) and (6e). Bothpenalisation and reward mechanisms are presented ingreater detail hereafter.

3.2.1 Acquiring Context InformationA substantial characteristic of both models is that the

penalty addends and the reward factors they generate aim tocapture the current context of the NVP/MV voting scheme.For a given voting round l during which a majority could befound, i.e. dtof (l) > 0, let

w(l)e = 1− dtof (l) − 1

n(l) −M (l)(7)

The above definition takes advantage of the dtof metric asdefined in (4) to acquire information on the effectiveness ofthe redundancy configuration employed during the votinground l. The fraction involved in (7) was designed so asto provide insight into the robustness of the redundancyconfiguration in face of the disturbances encountered.Specifically, the numerator can be regarded as the extentto which the majority is reconfirmed by dtof (l) − 1 surplusreplicas — cf. (5). This also expresses how manyadditional disturbances the redundancy configuration couldhave withstood during round l. Conversely, the denominatorrepresents the maximum number of disturbances that thescheme can withstand, given the available amount of

redundancy, n(l). As such, w(l)e provides an estimation

of how close a given redundancy configuration was toexhausting the available amount of redundancy whilst ittried to counterbalance the disturbances experienced duringround l. Considering the premise that dtof (l) > 0, it can be

seen from (7) that w(l)e is a real number contained within

the interval [0, 1]. A critically low value dtof (l) = 1, i.e.

w(l)e = 1, represents a situation for which the majority was

attained by only M (l) versions. During this voting round l,the available redundancy n(l) was completely exhausted tocounterbalance the maximal number of disturbances thescheme could tolerate, i.e. n(l) −M (l). If the scheme wouldhave been subjected to additional disturbances affecting anyof the versions v ∈ P (l), the scheme would have failed to

reach a majority. Similarly, a value w(l)e = 0 corresponds

to a voting round with full unanimity, i.e. c(l)max = n(l).

Such additional consent contributes to the robustness of thescheme and its redundancy configuration, for it is resilientto withstand up to n(l) −M (l) disturbances.

Furthermore, for v ∈ V (l), we define an ancillary functionc(l)(v) = |Pj | for Pj ∈ ℘(l) such that v ∈ Pj , which allows toobtain the amount of versions that reported the same resultas v at the end of round l. It can easily be seen that therange of this function is [1, n(l)].

3.2.2 Penalisation MechanismWe now characterise the penalisation mechanism used in

(6b) and (6c) for a subset of engaged versions V (l) ⊆ V —

that is, a set of versions v ∈ V (l) for which E(l)(v) = 1:

Page 6: Buys2011a

p(l)(v) =

s(l)(v)× w

(l)e v /∈ P

(l)F ∧ dtof (l) > 0 (8a)

m(l) −(c(l)(v)− 1

)m(l)

v /∈ P(l)F ∧ dtof (l) = 0 (8b)

1 v ∈ P(l)F (8c)

The penalty p(l)(v) inflicted by an engaged version

v ∈ V (l) in dissent with the majority found at the end ofround l is given by (8a). The idea behind the multiplier

w(l)e is that a replica disagreeing with the majority during

round l should be penalised relatively to the detrimentalimpact it may have on the robustness of the currentlyselected redundancy configuration — cf. (7). The closer

round l was to failure (that is, the closer to dtof (l) = 0),the stronger the multiplier shall penalise the dissentientreplica. The further away from failure, the less we penaliseas the excess degree of consent enhances the robustnessof the redundancy configuration such that it is capable oftolerating additional disturbances. Note how the abovemultiplier cannot evaluate to 0 for at least v is in dissentfor round l, and therefore full consensus, i.e. the maximumvalue for dtof (l) as defined in (4), can never be reached —

cf. (6b). The range of w(l)e , which was previously defined as

[0, 1] in (7), will therefore be confined to the interval ]0, 1].

The multiplicand s(l)(v) will then scale the intermediate

penalty obtained using w(l)e inversely proportional to the

amount of consent between a minority of engaged versions,including v

s(l)(v) = 1− c(l)(v)

M (l)(9)

Indeed, any version v in dissent with the majority found

is part of a minority equivalence class in ℘(l) \ {P (l), P(l)F }.

As the range of the previously defined function c(l)(v) will

consequently narrow to [1,m(l)], one can observe from (9)

that the values obtained for s(l)(v) lie in ]0, 1[.Having defined the maximum plurality that is not an

absolute majority in (3), the penalty for any of the versionsinvolved in voting round l for which no majority could bedetermined, can be found using (8b). A version v will beattributed the maximum penalty if its result is unique andin dissent with all the other versions, i.e. c(l)(v) = 1. Onthe contrary, should there exist a minority of consentientactive versions with cardinality equal to m(l), each of theversions would be penalised in the most gentle way. In otherwords, the more isolated the case, the heavier the penalty;the larger the cardinality of the minority to which a givenversion belongs, the less each of the versions that constitutethe minority will be penalised.

Finally, faulty replicas that did not return a meaningfulresponse are assigned the maximum penalty 1 – cf. (8c).

3.2.3 Reward ModelWhenever a version v ∈ V (l) produces a response that

complies with the majority determined at the end of votinground l, a reward should compensate for any penaltiesthat may have been imposed in previous voting rounds andconsequently result in the gradual decline of the normaliseddissent D(l)(v) — cf. (6e). Unlike the penalisationmechanism, which is only applicable to engaged versions,the reward model is also used for idle replicas that are not

currently involved in the redundancy configuration for agiven voting round l but that may have been used in previousvoting rounds — cf. (6d).

Let 0 < k2 < k1 < kmax < 1. We now define the rewardfactor r(l)(v) for a version v ∈ V as:

r(l)(v) =

k1 +

((kmax − k1)× w

(l)i (v)

)E(l)(v) = 0 (10a)

k2 +((k1 − k2)× w

(l)e

)E(l)(v) = 1 (10b)

For any version v, a smaller reward factor r(l)(v) will

result in a steeper decline of its normalised dissent D(l)(v),whereas a larger factor would result in a more gradualdecline. We now define #consent(v) as the number ofthose voting rounds which were accounted for in #rounds(v)for which v contributed to the majority. Consequently,#rounds(v)−#consent(v) corresponds to those votingrounds in which v has been engaged, such that either v wasin dissent with the majority, or no majority was found atall.

w(l)i (v) =

0 #rounds(v) = 0 (11a)

#rounds(v) − #consent(v)

#rounds(v)#rounds(v) > 0 (11b)

With w(l)i (v) defined as a real number in [0, 1], (10a)

shows how the reward factor is determined for an idleversion v ∈ V \ V (l) that is not involved in the current voting

round l. It follows that r(l)(v) is contained within [k1, kmax].The upper endpoint of the range, kmax, is defined to be closeto, but less than 1. This is motivated by the fact that, ifkmax were equal to 1, a value r(l)(v) = 1 would not be ableto ensure that penalties accumulated during previous votingrounds are weighed down over time — cf. (6d) and (6e). The

smallest reward r(l)(v) is equal to k1 and corresponds to thecase when v did not participate in any voting round so far,i.e. (11a), or when the replica contributed to the majority forevery voting round it was previously engaged in, i.e. (11b)when #rounds(v) = #consent(v). Larger reward valueswill be obtained for versions v, up to a maximum of kmax,proportional to the relative amount of voting rounds forwhich an engaged version v previously failed to support the

voting scheme and was subsequently penalised, i.e. w(l)i (v).

The reward procedure for engaged versions that were inconsent with the outcome of the current voting round l

is described in (10b). Having defined w(l)e as a real

number contained within the interval [0, 1] in (7), it can

be seen the range of r(l)(v) is delimited by [k2, k1] forany version v engaged during round l. As it can be

seen in (10b), larger values for w(l)e , i.e. dtof (l) − 1

approaches 0, lead to a larger reward factor r(l)(v), up to themaximum value k1. Contrariwise, more robust redundancy

configurations translate into smaller values for w(l)e and

will be allotted smaller values for r(l)(v) accordingly. Thisallows to counterbalance and rectify a situation where v wasundeservedly penalised in any preceding voting rounds itparticipated in, i.e. v did produce a correct result, but itwas penalised because of an inadequate selection V (l).

As a final remark, we would like to point out that it was adeliberate design decision to define the reward model for idleversions in a separate range [k1, kmax], resulting in rewardfactors of comparatively greater magnitude, so as to ensure

Page 7: Buys2011a

a more gradual decline in normalised dissent when comparedto engaged replicas.

4. A NOVEL ADAPTIVE FAULT-TOLERANT STRATEGY

In this section, we introduce our adaptive NVP-basedfault-tolerant strategy and elaborate on the advancedredundancy management it supports. Aiming toautonomously tune its internal configuration in view ofchanges in context, it was designed to dynamically findthe optimal redundancy configuration. Our context-awarereformulation of the classical NVP/MV system structureencompasses two complementary parameterised models thatjointly determine the redundancy configuration to be usedthroughout the next voting round l. During the first stageof this procedure, the redundancy dimensioning model,which will be explained shortly in Sect. 4.2, will select theappropriate degree of redundancy n(l) to be employed for anewly initiated voting round l in function of the disturbancesexperienced in previous voting rounds. Next, the replicaselection model will establish which replicas v ∈ V are mostappropriate to constitute V (l). This second stage, which willbe elaborated upon in Sect. 4.3, was designed to enrol thosereplicas targeting an optimal trade-off between the contextproperties introduced in Sect. 3.2 and 2, i.e. normaliseddissent, response time and pending load, respectively.

4.1 Application-Specific RequirementsThe optimal redundancy configuration is, however, not

only determined by the quantitative assessment in termsof the context properties introduced in Sect. 2 and 3,but also by the characteristics of the application itself,or the environment in which it operates. For instance,some applications may be latency-sensitive, whereas othersmay operate in a resource-constrained environment. TheA-NVP/MV algorithm was conceived to take theseapplication-specific intricacies into account, in that theredundancy dimensioning and replica selection models canbe configured by means of a set of user-defined parameters.

Our A-NVP/MV algorithm has been designed primarilyto maximise the redundancy scheme’s dependability, andsecondarily, it may be configured to target other applicationobjectives such as time constraints as well as load balancing.User-defined weights wD, wT and wL for each of the threerespective application objectives listed, can be used toconfigure the replica selection model such that it will engagethe most appropriate replicas so as to maximise the overalleffectiveness of the voting scheme. It is assumed that∑

i∈{D,T,L}

wi = 1 (12)

Furthermore, an optional user-defined parameter tmax

represents the largest response time that the applicationcan afford. A smaller value represents more stringentrequirements on the scheme’s response time, implicitlyindicating that the application is more latency-sensitive.The tmax parameter is of particular interest as it is usedto detect performance and omission failures: if a replicav ∈ V (l) failed to return its response to the NVP compositebefore the tmax time-out has lapsed since the executionof the voting procedure for round l was initiated, v will

be classified in P(l)F and penalised accordingly as described

in Sect. 3.2.2. Consequently, the response latency of theA-NVP composite is guaranteed not to exceed tmax. Morespecifically, if no absolute majority could be establishedbefore tmax, an exception will be issued to signal thatconsensus could not be found.

Finally, applications deployed in resource-constrainedenvironments may benefit from the parameter nmax to setan upper bound on the number of replicas to be usedin parallel, which may result in the utilisation of fewercomputing and networking resources. This parameter mayaffect the degree of redundancy n(l) as determined by theredundancy dimensioning model, possibly at the expense ofa significantly higher risk of failure of the voting scheme.

4.2 Redundancy Dimensioning ModelGiven the set V of available functionally-equivalent

versions in the system, our redundancy dimensioning modelis responsible for autonomously adjusting the degree ofredundancy employed such that it closely follows theevolution of the observed disturbances. In the absence ofexceptional disturbances, the scheme should scale down itsuse of redundant replicas so as to avoid the unnecessaryexpenditure of resources. Contrarily, when the foreseenamount of redundancy is not enough to compensate for thecurrently experienced disturbances, it would be beneficialto dynamically revise that amount and enrol additionalresources — if available.

0

2

4

6

8

10

12

14

50 100 150 200 250 300 350 400

failure

s

window length r

Figure 2: Number of voting scheme failures experiencedwhile injecting faults in a simulation model for dynamicallyredundant data structures encompassing 107 rounds [16].Abscissae represent how many consecutive voting rounds

must have completed with dtof(i)

i∈{l−r,...,l−1} ≥ 1 before the

employed degree of redundancy will be downshifted.

The redundancy dimensioning model is expected todetermine n(l) upon initialisation of the voting round l,abiding the premise that n(l) ≤ min (|V | , nmax). Note thatthe behaviour of the system is undefined when the optimaldegree of redundancy as inferred by the model exceeds n(l).Depending on the application domain, the A-NVP schemecould simply report failure, or it could proceed with thesuboptimal redundancy currently supported.

We will now briefly discuss a simplistic strategy that wasoriginally published in [16], as a possible implementation forthe redundancy dimensioning model. This strategy assumes

Page 8: Buys2011a

a set V such that |V | = 9, and will only report an odd

degree of redundancy, i.e. n(l) ∈ {3, 5, 7, 9}. Moreover, theredundancy scheme is initialised such that it is capable oftolerating up to one failure, hence n(0) = 3. If the votingscheme failed to find consensus amongst a majority of thereplicas involved during the round l − 1, the model willincrease the number of redundant replicas to be used inthe next voting round, to the extent that n(l) = n(l−1) + 2,provided that n(l−1) < |V |. Conversely, when the schemewas able to produce an outcome for a certain amount rof consecutive voting rounds, which can be observed by

values dtof(i)

i∈{l−r,...,l−1} ≥ 1, a lower degree of redundancy

shall be used for the next voting round l, involvingn(l) = max(3, (n(l−1) − 2)) replicas. In other words, themodel will maintain a sliding window so as to monitor thedtof value obtained for the last r completed voting rounds.

As can be seen from Fig. 2, shorter window lengthsmay result in an incautious downscaling of the redundancy,which in itself might lead to failure of the voting schemein subsequent voting rounds. The general trend showsthat the redundancy scheme is less likely to fail due tothe downscaling of the employed degree of redundancyfor larger values of r, at the expense of postponing therelinquishment of excess redundancy. Unfortunately, eventhough the strategy is capable of scaling down the utilisationof system resources, it occasionally results in redundancyundershooting, even for relatively large values of r, as onemay observe from the spikes in the graph shown above.It can be argued that the strategy, in its simplicity, doesnot take full account of the dtof formalism as it waspresented in Sect. 3.1. Had it been designed to consider,for instance, the actual amount of replicas that contributedto the majorities found during the last r voting rounds,it would have been able to determine to what extent thecurrent level of redundancy could be decreased, and assessthe risk of doing so — cf. (5) and (7).

4.3 Replica Selection ModelHaving established the degree of redundancy n(l) to be

employed throughout round l, the replica selection modelwill then determine a selection of versions v ∈ V (l) to beused by the redundancy scheme, such that |V (l)| = n(l).The proposed model has been designed so as to achievean optimal trade-off between dependability as well asperformance-related objectives such as load balancing andtimeliness, respectively represented as the wD, wL and wT

application-specific configuration parameters.The suitability of a particular version v ∈ V within

an NVP/MV scheme can now be assessed quantitatively,leveraging the context properties introduced in Sect. 2 and 3.Let us now denote the last known values4 of the normaliseddissent, the number of pending requests and the averageresponse time for a version v ∈ V by D(v), L(v) and T (v)respectively. If no such value was previously reported, allvariables will hold the value 0. The process of determining

4The motivation for this weak definition is twofold. Firstly,the estimations of the pending load may be externallyprovided to the scheme and may therefore be subjectedto delays. Secondly, even though the dependability metricD(l)(v) is harvested at the end of each voting round for

versions v ∈ V (l), one cannot reasonably expect round l − 1to have completed by the time round l is initialised.

a trade-off between the different application objectives cannow be facilitated by scaling the context properties, whichwere defined without any upper bounds, to the same range.We therefore define δD as the maximum value D(v) for allversions v ∈ V . δL and δT are defined analogously as themaximum value of L(v) and T (v), respectively. WhereasD(v), L(v) and T (v) are initialised to 0, the thresholdsδD, δL and δT will be initialised to 1. Subsequently, thevalues for these context properties can now be scaled to areal number over the interval [0, 1]:

XS(v) =δX −X(v)

δXfor v ∈ V (13)

where X ∈ {D,L, T} stands for any of the three contextproperties normalised dissent, pending load and responsetime. Practically speaking, a larger value X(v) for any ofthe three properties under consideration is representative ofa worse impact of the replica v on the redundancy scheme.Accordingly, larger values of the scaled value XS(v) signalversions more suitable to support the redundancy scheme.After the context property values were scaled onto a commonrange [0, 1], one can now determine the score s(v) for eachversion v ∈ V as follows:

s(v) = wD ×DS(v) + wL × LS(v) + wT × TS(v) (14)

The replica selection procedure is then reduced to amere sorting problem, in which the versions are ranked bydescending values of s(v). At this stage, all informationregarding the redundancy configuration is available, and theexecution of the voting round l can proceed using the firstn(l) versions.

A-NVP

eWS-SG

WSN Consumer

Generic A-NVP Composite WS-Resource

SO

AP

inte

ract

ion

wit

hse

lect

edm

emb

erse

rvic

es

messagehandlers

port type A port type B

context

Figure 3: WSDM-enabled A-NVP WS-Resource aggregatingseveral manageability capabilities. It can be seen from themessage handlers that port type A exposes 3 operationsand port type B exposes 2. All versions implementing porttype A are assumed to be unreachable. When detected, theservice group disables the corresponding message handlers.

Obviously, an important prerequisite to obtain anaccurate resource selection V (l) is to have the requiredcontextual information instantly available. As shownin Fig. 3, the A-NVP composite contains a contextmanager component that is responsible for continuouslymonitoring any changes in the operational status of theavailable resources, i.c. the context properties introduced inSect. 2 and 3 for each of the functionally-equivalent versionsv ∈ V available in the system. When new informationregarding one or more context properties is reported, thecontext manager will update its internal data structuresaccordingly, enforcing appropriate synchronisation

Page 9: Buys2011a

mechanisms so as to ensure data consistency. As such,any update of a context property X(v) for a versionv will instantaneously be reflected in the value of thecorresponding δX . Property updates may account forinternally deduced information, e.g. the dtof , normaliseddissent and response time metrics, which are harvested bythe A-NVP/MV scheme at the end of each voting round.Other metrics such as pending load may, however, beexternally provided.

It was already pointed out in Sect. 4.1 that the optionaluser-defined parameter tmax is used to enable the detectionof performance and omission failures. Whenever a replicav is detected to be affected by such a type of failurethroughout the course of a voting round l, the stalledinvocation request should promptly be abandoned, and apredefined internal failure message will be issued as theresponse message. Version v will consequently be classified

in P(l)F , and penalised as described in Sect. 3.2.2, directly

affecting the version’s normalised dissent value.The use of the tmax configuration parameter will also

have repercussions on the T (v) context property. As onecan see in (14), if some context property value X(v) for aspecific replica v was not updated after its initialisation, i.e.X(v) = 0 and therefore XS(v) = 1, the version is tacitlyassumed to contribute to the success of the scheme in termsof the application objective associated with that property.We have therefore chosen to report tmax as the responsetime of versions that fail to return their response withinthe imposed time constraint, such that the system canguarantee that T (v) ≤ tmax.

5. A-NVP WS-* SOA PROTOTYPEIn this section, we present a prototypical service-oriented

implementation of the adaptive fault tolerant strategyas proposed in Sect. 4. The framework was conceivedleveraging a set of ratified WS-* specifications, mainlycapitalising on the features offered by the Web ServicesResource Framework (WSRF), Web Services DistributedManagement (WSDM) and WS-Notification (WSN) familiesof OASIS-published standards. Figure 4 shows a layeredrepresentation of the specifications relevant to our A-NVPimplementation5. The framework was developed using thelatest version of Apache MUSE to date, supplemented byour own implementation of the Management of Web Services(MOWS) specification6.

A WSDM-enabled WS-Resource is essentially anaggregation of several manageability capabilities that arecollectively exposed through a cohesive Web ServicesDescription Language (WSDL) interface. A manageabilitycapability defines a set of resource properties, operations,

5Due to space restrictions, introductory explanations ofspecific features of the specifications referred to cannot beprovided. The reader may wish to consult www.oasis-open.org/committees/ for more information regarding theWSRF, WSDM and WSN specifications. Detailedinformation on W3C-driven specifications, particularlyXML-related standards and first-generation WS-* standardssuch as WSDL, WS-Addressing and SOAP, may be retrievedfrom http://www.w3.org/TR/.6For more information, refer to http://ws.apache.org/muse. The source code for the MOWS implementation ispublicly available via http://pats.ua.ac.be/svn/muse.

events, metadata and other semantics supporting aparticular management aspect of a WS-Resource service.Apart from a set of predefined foundational manageabilitycapabilities, WSDM was designed for extensibility, allowingthe development of domain-specific capabilities comprisingcustomised manageability logic or that extend any ofthe foundational capabilities as appropriate. Havingimplemented the A-NVP composite as a WSDM-enabledweb service, the core of its implementation consists of twocapabilities, as can be observed in Fig. 3.

UDDI MOWS

WS-Notification WS-ResourceLifetime WS-PolicyFramework

WS-Topics WS-RMD WS-PolicyAttachment

WS-Resource

WS-Eventing WS-MEX

SOAP WS-Addressing WSDL

XPath XML XML Schema

URI HTTP ... TCP/IP

Transports

XML

Messaging

Publish & Subscribe WS-Policy

Metadata

WS-ResourceProperties

WS-ServiceGroup

MUWS

WS-ResourceFramework

Discovery & Federation WSDM

Figure 4: Layered overview of WS-* specificationsillustrating WSRF and WSDM and their interdependenciesrelative to other industry standards.

5.1 Enhanced WS-ServiceGroup CapabilityThe composite A-NVP web service leverages the

WS-ServiceGroup (WSSG) specification and the notionof membership content rules defined therein to managefederations of functionally-equivalent web services. Theentries of the group represent locally or remotely hostedmember web services, and membership content rules canbe used to express constraints on the member services.Such rules can impose limitations on the WSDL port typesthat services in the service group must implement, aswell as the resource properties the member services areexpected to expose. The rationale behind the mandatoryuse of membership content rules is that web servicesimplementing a common WSDL port type and exposingthe same set of resource properties can be considered asfunctionally-equivalent.

We have crafted an enhanced WSSG capability supportingadvanced replica management, including facilities tocompensate for the occasional emerging and disappearingof web services in the system. A freshly discoveredservice may be added to the group as the result of

Page 10: Buys2011a

an incoming Management Using Web Services (MUWS)advertisement notification, provided the reported servicecomplies to the membership content rules. Upon additionof a replica member web service, its metadata will bevalidated, and the service group will automatically issue aWSN subscription request so as to be notified of changesin any additional mandatory resource properties that weredeclared in the membership content rules set on theservice group — cf. Sect. 5.3. Conversely, the receipt ofa WS-ResourceLifetime destruction event will trigger theremoval of the member from the service group.

The A-NVP composite has been explicitly designed asa generic WSDM-enabled utility WS-Resource so as tosupport a diversity of applications, without the needto generate application-specific proxy classes at designtime. When assembling the deployment artefact, theuser is expected to supply the WSDL interface definitionscontaining the port type descriptions for admissible servicegroup members. During the initialisation of the compositeWS-Resource, the provided WSDL definitions will beinspected, and for each non-standardised, request-responseoperation declared within, a new message handler will beregistered. Furthermore, the system will automaticallyinitialise the membership content rules, given the port typesthat were found whilst scanning the user-supplied interfacedefinitions. Note that the WSDL interface advertised forthe A-NVP composite itself is predefined and exposes asingle port type combining only the standardised operationsdefined for the WSSG and WSN Consumer capabilities.

Figure 3 shows how message handlers enable the A-NVPcomposite to accept application-specific SOAP requestmessages and hand these over to the A-NVP capability forexecution. Should there remain no active member servicesin the group for a particular port type, the respectivehandlers will be disabled, such that they will dismiss anyincoming SOAP request by reporting a WS-AddressingActionNotSupported fault message.

5.2 Domain-Agnostic A-NVP CapabilityContext information for any of the member web services

within the federation is managed at operation level (Fig. 3).Specifically, for each operation for which a dynamic messagehandler was registered, the context manager providesadequate data structures for storing the values D(v), L(v),T (v) and the respectively corresponding maxima δD, δLand δT as defined in Sect. 4.3, as well as the counters#rounds(v) and #consent(v) that were introduced inSect. 3.2.3. Furthermore, application-specific configurationparameters can be specified for individual operations,thereby overriding the system defaults. One may do so byediting a deployment descriptor, in which a service operationcan be uniquely identified by the service port type name andthe WS-Addressing action URI.

The capability provides a single operation to acceptNVP service requests. Upon invocation of the A-NVPcomposite, the system first determines the set of eligiblefunctionally-equivalent member services in the servicegroup, i.e. V . In order to do so, the payload of the incomingSOAP request as well as its WS-Addressing message headersare inspected so as to establish which of the registeredport types exposes the targeted service operation. Afteracquiring all registered member services that implementthe given port type, the capability proceeds by applying

the algorithm introduced in Sect. 4 so as to determine anadequate selection of versions V (l). Such selection is carriedout referring to the context information pertaining to thetargeted operation, as stored in the context manager. TheSOAP request is then simultaneously forwarded to each ofthe selected versions. As soon as an absolute majority M (l)

of the selected n(l) versions have returned their response,the voting scheme will determine and return the outcome ofthe current voting round l, without awaiting the remainingreplicas to return. At the same time, the n(l) −M (l)

pending results will be collected after the response was sentto the client such that the dtof and normalised dissentcan be computed at the end of the voting procedure andsubsequently reported to the context manager.

It is noteworthy to point out that the voting procedure willassign any two versions to the same equivalence class of the

partition ℘(l) \ P (l)F if the XML fragments enclosed within

the body of their SOAP response messages are found to besyntactically equivalent, given the XSD schema definitionsincluded in the WSDL interface. Special attention is paid toSOAP faults, however, which are typically used to conveyerror condition information when an exceptional situationoccurs. In particular, one needs to clearly distinguishbetween application-specific and application-agnostic faultmessages. Whereas the former type of fault messagesare expected to carry domain-specific fault data andare processed like ordinary SOAP response messages,application-agnostic fault messages will directly be classified

in P(l)F . Examples of this second category of messages

include, e.g., standardised fault messages from various WS-*specifications, or SOAP faults reported for versions thatwere detected to be affected by performance or omissionfailures (cf. Sect. 4.1 and 4.3).

5.3 Externally Supplied Context InformationAs pointed out in Sect. 4.3, the vast majority of the

metrics and counters stored in the context manager isupdated using information that was collected within theA-NVP composite itself, upon completion of a voting round.An exception to this approach though, is the numberof pending requests L(v), which needs to be suppliedexternally as it is conceivable that a member replica mayconcurrently be used by services other than the A-NVPcomposite. Specifically, we require any member web serviceto expose the metrics defined by the MOWS operationmetrics manageability capability. As such, the resourceproperty OperationMetrics is supposed to be included inthe membership content rules of the A-NVP composite.Upon addition of a new member service, the enhancedservice group capability will consequently issue a WSNsubscription request in order to be notified for changes inthe values of this resource property. Any valid value forthe OperationMetrics resource property is defined to holdthree direct XML child elements, i.c. NumberOfRequests,NumberOfFailedRequests and NumberOfSuccessfulRequests.Considering the non-negative integer values of thesemetrics, the context manager can easily determinethe number of pending requests as NumberOfRequests −(NumberOfFailedRequests+NumberOfSuccessfulRequests). Theestimation of the load on any of the registered memberservices is always a rough approximation, due to potentiallatency in the issuance and processing of the WSNnotification messages.

Page 11: Buys2011a

6. EXPERIMENTS AND ANALYSISTo illustrate the A-NVP strategy, we now present an

example considering a set V = {A,B,C,D,E} of 5 replicas,

and a fixed amount of versions to be used n(i)

i∈{1,...,5} = 4.

A summary of the first 5 voting rounds has been givenin Table 1 showing the disturbances the voting schemeencountered during each round.

l ℘(l) \ P (l)F P

(l)F dtof (l) w

(l)e

1 {A,E} {C} {D} 0 -2 {A,B,E} {D} 1 13 {A,B,C,E} ∅ 2 0.54 {A,C} {B,E} ∅ 0 -5 {A,B,E} {C} ∅ 1 1

Table 1: Overview of the disturbances and their impact onthe voting procedure for the first 5 rounds of an A-NVP/MVcomposite. The displayed values have been computed at theend of round l.

It is assumed that the voting rounds do not overlap,that is, round l + 1 does not commence before round lhas completed. The example displayed in Table 2 wasconstructed under the assumption that replica D wasaffected by a permanent fault, either because of a designfault, a broken network link or a malfunction of theunderlying deployment platform. The reward model asdefined in Sect. 3.2.3 was configured with parametersk1 = 0.85, k2 = 0.75 and kmax = 0.95. The example targetsa trade-off between between dependability and timeliness,and does not consider load balancing, i.e. wD = 0.8,wT = 0.2 and wL = 0. Moreover, the response times areassumed to be constant throughout the experiment suchthat T (A) = T (C) = 10, T (B) = 12 and T (D) = T (E) = 8,expressed in seconds. All values were computed usingfixed decimal numbers with four significant digits. Table 2illustrates how the normalised dissent D(l)(v) is updated atthe end of each voting round l, the last column referring tothe applicable formulae from Sect. 3.2.

We will now show how the replica selection modelpresented in Sect. 4.3 will select V (3) = {A,B,C,E}.Considering δD = 2 at the completion of round 2, the scalednormalised dissent is given by:

Ds(A) = Ds(E) =2− 0.425

2= 0.7875

Ds(B) =2− 0

2= 1

Ds(C) =2− 0.95

2= 0.525

Ds(D) =2− 2

2= 0 (15)

One can now observe from the calculations in (15) thatversion D has been found to perform the poorest in termsof reliability. Contrarily, Ds(B) holds the maximum value1, since B was not found to be previously affected bydisturbances of any kind. Given δT = 12 seconds, thenormalisation of the aforementioned response times yields:

Ts(A) = Ts(C) =12− 10

12= 0.1667

Ts(D) = Ts(E) =12− 8

12= 0.3333

Ts(B) =12− 12

12= 0 (16)

Since the configuration parameters for the replica selectionmodel did not target load balancing, i.e. wL = 0, one cannow easily compute the score value for each of the replicasin V using (14):

s(B) = 0.8× 1 + 0.2× 0 = 0.8

s(E) = 0.8× 0.7875 + 0.2× 0.3333 = 0.6967

s(A) = 0.8× 0.7875 + 0.2× 0.1667 = 0.6633

s(C) = 0.8× 0.525 + 0.2× 0.1667 = 0.4533

s(D) = 0.8× 0 + 0.2× 0.3333 = 0.0667 (17)

The above score values have already been sorted indescending order. Given the fixed redundancy degreen(3) = 4, we select the first 4 replicas from the abovelist, i.e. V (3) = {B,E,A,C}, after which the scheme willinvoke the selected versions and await their responsesin order to complete the voting procedure. The faultyversion D is excluded from V (3), due to the accumulationof the penalties that were added to its normalised dissentthroughout the initial 2 voting rounds, as it can be seenfrom Tables 1 and 2d.

l status #rounds #consent D(l) p(l) r(l)

1 active 1 0 0.5000 0.5000 - (8b)2 active 2 1 0.4250 - 0.85 (10b)3 active 3 2 0.3400 - 0.80 (10b)4 active 4 2 0.8400 0.5000 - (8b)5 active 5 3 0.7140 - 0.85 (10b)

(a) Version A

l status #rounds #consent D(l) p(l) r(l)

1 idle 0 0 0 - 0.85 (10a)2 active 1 1 0 - 0.85 (10b)3 active 2 2 0 - 0.80 (10b)4 active 3 2 0.5000 0.5000 - (8b)5 active 4 3 0.4250 - 0.85 (10b)

(b) Version B

l status #rounds #consent D(l) p(l) r(l)

1 active 1 0 1.0000 1.0000 - (8b)2 idle 1 0 0.9500 - 0.95 (10a)3 active 2 1 0.7600 - 0.80 (10b)4 active 3 1 1.2600 0.5000 - (8b)5 active 4 1 1.9267 0.6667 - (8a)

(c) Version C

l status #rounds #consent D(l) p(l) r(l)

1 active 1 0 1.0000 1.0000 - (8c)2 active 2 0 2.0000 1.0000 - (8c)3 idle 2 0 1.9000 - 0.95 (10a)4 idle 2 0 1.8050 - 0.95 (10a)5 idle 2 0 1.7148 - 0.95 (10a)

(d) Version D

l status #rounds #consent D(l) p(l) r(l)

1 active 1 0 0.5000 0.5000 - (8b)2 active 2 1 0.4250 - 0.85 (10b)3 active 3 2 0.3400 - 0.80 (10b)4 active 4 2 0.8400 0.5000 - (8b)5 active 5 3 0.7140 - 0.85 (10b)

(e) Version E

Table 2: Evolution of the normalised dissent value.

Page 12: Buys2011a

7. RELATED WORKA number of techniques for service reliability engineering

have appeared in the recent literature. The approachpresented in [14] aims to enhance the dependability ofthe system by combining multiple functionally-equivalentservices. It defines three classic decision algorithms, whichare named service operators, including majority voting.However, the described model requires manual orchestrationof the versions, while acknowledging the need for dynamicredundancy configurations. An interesting contributionof [14] is the quantitative modelling of the reliability of aservice request to evaluate the effectiveness of NVP-basedservice composites.

Applying NVP within SOA has been suggested in [17] aswell, which also ranks the versions available in the systemusing a composite metric involving e.g. reliability andresponse time. Both [14] and [17] differ from our approachin that they work with given reliability estimates, whereaswe use the dtof and normalised dissent D(l)(v) to captureall types of disturbances. Neither of the two papers referredto deal with the issue of redundancy dimensioning.

The dynamic parallel fault-tolerant selection algorithmdescribed in [18] supports the automatic selection of thevoting procedure in function of the system context. Theredundancy configuration for the NVP scheme is determinedby repeatedly predicting the response time of the versionsand computing the dependability and execution time ofthe candidate configurations, which may incur a significantoverhead as the number of replicas in the system increases.

8. CONCLUSIONIn this paper, a novel dependability strategy was

introduced supporting advanced redundancy management,aiming to autonomously tune its internal configurationin view of changes in context. Given a set offunctionally-equivalent stateless web services, ourA-NVP/MV strategy will dynamically select the mostappropriate versions depending on the contextualinformation gathered during the runtime of the system.The principal contribution of this paper is the resourceselection algorithm. We have defined two new metricsfor capturing the reliability of a software component.The primary advantage of the distance-to-failure andnormalised dissent metrics is that the dependabilitystrategy need not rely on assumptions regarding the failurerates of software components. We have implementedthe presented solution as a WSDM-enabled web serviceusing the latest Apache MUSE distribution to date. Assuch, the adequacy of established WS-* specifications forfault-tolerant manageability purposes was illustrated.

As future work we plan to enhance the proposed algorithmto autonomously tune the degree of employed redundancy infunction of the experienced disturbances. Furthermore, weare currently working on an evaluation of the performanceoverhead and efficiency of the selection algorithm.

9. REFERENCES[1] Erl, T.: Service-Oriented Architecture: Concepts,

Technology, and Design. Prentice Hall PTR, UpperSaddle River, NJ, USA (2005)

[2] De Florio, V.: Application-layer Fault-toleranceProtocols. IGI Global (2009)

[3] Dubrova, E.: Fault Tolerant Design: an Introduction(draft). Kluwer Academic Publishers (2002)

[4] Johnson, B.W.: Design and analysis of fault tolerantdigital systems. Addison-Wesley Series in Electricaland Computer Engineering, Addison-Wesley LongmanPublishing Co., Inc., Boston, MA, USA (1989)

[5] Diab, H.B., Zomaya, A.Y. (eds.): DependableComputing Systems: Paradigms, Performance Issues,and Applications. Wiley Series on Parallel andDistributed Computing, Wiley-Interscience (2005)

[6] Erl, T.: SOA Design Patterns. Prentice Hall PTR,Upper Saddle River, NJ, USA (2008)

[7] Gray, J.; Siewiorek, D.P.: High-availability computersystems. Computer 24(9), pp. 39–48. (1991)

[8] Dependable Embedded Systems: Software FaultTolerance, http://www.ece.cmu.edu/~koopman/des_s99/sw_fault_tolerance/

[9] Avizienis, A.: The n-version approach to fault-tolerantsoftware. IEEE Transactions on Software EngineeringSE-11(12), pp. 1491–1501 (1985)

[10] Lorczak, P., Caglayan, A., Eckhardt, D.: A theoreticalinvestigation of generalized voters for redundantsystems. In: IEEE Digest of Papers on the 19thInternational Symposium on Fault-TolerantComputing (FTCS-19), pp. 444-451. IEEE ComputerSociety Press, New York (1989)

[11] De Florio, V., Deconinck, G., Lauwereins, R.:Software tool combining fault masking withuser-defined recovery strategies. IEE ProceedingsSoftware 145(6), pp. 203–211. IEEE Computer SocietyPress, New York (1998)

[12] Cardoso, J., Miller, J., Sheth, A., Arnold, J.:Modeling quality of service for workflows and webservice processes. Technical report TR-02-002, LSDISLab, Computer Science Department, University ofGeorgia (2002)

[13] Cristian, F.: Understanding fault-tolerant distributedsystems. Communications of the ACM 34(2),pp. 56–78 (1991)

[14] Gotze, J., Muller, J., Muller, P.: Iterative serviceorchestration based on dependability attributes. In:Proceedings of the 34th Euromicro Conference onSoftware Engineering and Advanced Applications(SEAA2008), pp. 353–360. IEEE Computer SocietyPress, New York (2008)

[15] Lyu, M.R. (ed.): Handbook of software reliabilityengineering. McGraw-Hill, Inc., Hightstown, NJ, USA(1996)

[16] De Florio, V.: Software Assumptions FailureTolerance: Role, Strategies, and Visions. In: Casimiro,A., de Lemos, R., Gacek, C. (eds.) ArchitectingDependable Systems VII. LNCS, vol. 6420,pp. 249-272. Springer, Heidelberg (2010)

[17] Laranjeiro, N., Vieira, M.: Towards fault tolerance inweb services compositions. In: Proceedings of the 2007workshop on Engineering fault tolerant systems(EFTS ’07). Association for Computing Machinery,Inc. (ACM), New York (2007)

[18] Zheng, Z., Lyu, M. R.: An adaptive QoS-aware faulttolerance strategy for web services. In: EmpiricalSoftware Engineering 2010(15), pp. 323–345 (2010)