Making Services Fault Tolerant

1

Making Services Fault Tolerant

Pat Chan, Michael R. Lyu Department of Computer Science and EngineeringThe Chinese University of Hong Kong Miroslaw MalekDepartment of Computer Science and EngineeringHumboldt University Berlin

2

Outline Introduction Problem Statement Methodologies for Web Service

Reliability New Reliable Web Service Paradigm Road Map for Experiment Experimental Results and Discussion Conclusion

3

Introduction Service-oriented computing is becoming a

reality. Service-oriented Architectures (SOA) are

based on a simple model of roles. The problems of service dependability,

security and timeliness are becoming critical.

We propose experimental settings and offer a roadmap to dependable Web services.

4

Problem Statement Fault-tolerant techniques

Replication Diversity

Replication is one of the efficient ways for providing reliable systems by time or space redundancy.

Increasing the availability of distributed systems Key components are re-executed or replicated Protect against hardware malfunctions or transient system faults.

Another efficient technique is design diversity. By independently designing software systems or services with

different programming teams, Resort in defending against permanent software design faults.

We focus on the analysis of the replication techniques when applied to Web services.

A generic Web service system with spatial as well as temporal replication is proposed and investigated.

5

Methodologies for reliable Web services -- Redundancy Spatial redundancy

Static redundancy, all replicas are active at the same time and voting takes place to obtain a correct result.

Dynamic redundancy engages one active replica at one time while others are kept in an active or in standby state.

Temporal redundancy Redundant in time

6

Methodologies for reliable Web services -- Diversity

Protect redundant systems against common-mode failures

With different designs and implementations, common failure modes will probably cause different error effects.

N-version programming, recovery blocks…

7

Failure Response Stages of Web Services Fault confinement Fault detection Diagnosis Fail-over Reconfiguration Recovery Restart Repair Reintegration

8

Fault Confinement

Fault Detection Fault Detection

Failover Diagnosis

Online Offline

Reconfiguration

Recovery

Restart

Repair

Reintegration

9

Replication Manager

Web service selection algorithm

WatchDog

UDDI

Registry

WSDL

Web ServiceIIS

Application

Database

Web ServiceIIS

Application

Database

Web ServiceIIS

Application

Database

Client

Port

Application

Database

1. Create web services

2. Select primary web service (PWS)

3. Register

4. Look up

5. Get WSDL

6. Invoke web service

7. Keep check the availability of the PWS

8. If PWS failed, reselect the PWS.

9. Update the WSDL

Proposed Paradigm

10

RM sends message to the Web Service

Reselect a primary Web Service

Do not get reply

Map the new address to the WSDL

System Fail

Get reply

All Service failed

Work Flow of the Replication Manager

11

Road Map for Experiment Research

Redundancy in time Redundancy in space

SequentiallyParallelMajority voting using N modular

redundancyDiversified version of different

services

12

Experiments

A series of experiments are designed and performed for evaluating the reliability of the Web service, single service without replication,single service with retry or reboot and, service with spatial replication.

We will also perform retry or failover when the Web service is down.

13

Summary of the experiments

None Retry/Reboot

Failover Both (hybrid)

Single service, no retry

0 -- -- --

Single service with retry

-- 1 -- --

Single service with reboot

-- 2 -- --

Spatial replication

-- -- 3 4

14

Parameters of the Experiments

Parameters Current setting/metric

Request frequency 1 req/min

Polling frequency 5 ms

Number of replicas 5

Client timeout period for retry 10 s

Failure rate λ # failures/hour

Load (profile of the program) % or load function

Reboot time 10 min

Failover time 1 s

15

Experimental Results

Experiments over 360 hour periods (43200 reqs)

Number of failures Normal

Number of failuresServer busy

Number of failuresServer reboots periodically

Exp 0 4928 6130 6492

Exp 1 2210 2327 2658

Exp 2 2561 3160 3323

Exp 3 1324 1711 1658

Exp 4 1089 1148 1325

Retry11.97% to 4.93%

Reboot11.97% to 6.44%

Failover11.97% to 3.56%Retry and Failover11.97% to 2.59%

16

Number of failure when the server is is normal situation

17

Number of failure when the server is busy

18

Number of failure when the server reboots periodically

19

Reliability of the system over time

0

( ) ( )lim 0.025t

F t t F t

t

( )( ) t tR t e

20

Reliability Model

Reliability Model Parameters

ID Description Value

λn Network failure rate 0.02

λ* Web service failure rate 0.228

λ1 Resource problem rate 0.142

λ2 Entry point failure rate 0.150

μ* Web service repair rate 0.286

μ1 Resource problem repair rate 0.979

μ2 Entry point failure repair rate 0.979

C1 Probability that the RM responds on time 0.9

C2 Probability that the server reboots successfully 0.9

22

Outcome (SHARPE)

Failure Rate0.2280.1140.057

Reliability of the proposed system

23

Conclusion

Surveyed replication and design diversity techniques for reliable services.

Proposed a hybrid approach to improving the availability of Web services.

Carried out a series of experiments to evaluate the availability and reliability of the proposed Web service system.

N-Version Programming may finally become commercially viable in service environment.

Making Services Fault Tolerant

Documents

Transcript of Making Services Fault Tolerant