ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance...

ATIF MEHMOOD MALIKKASHIF SIDDIQUE

Improving dependability of Cloud Computing with Fault Tolerance and High Availability

Dependability

In Systems Engineering, dependability is a measure of system’s availability, reliability and maintainability

It is ability of system to deliver services that can be justifiably trusted

Often considered as third axis of system quality

Dependability ontology

Dependability challenges in cloud computing

Lack of trust in shared virtualized infrastructures

Management of cloud computing service by a single provider or vendor is in fact a single point of failure

APIs are proprietaryVirtualization increases complexityHigher resource utilization Common mode outagesMultiple administrative domainsLegal and privacy implications

Threats to dependability

Faults, Errors and FailuresA fault in a system is a deviation from its

expected behaviorFaults may arise due to hardware failure,

software bugs, user error and network problems

Fault Tolerance

Ability of a system to continue providing services to its user in case of failure of some of its components

Faults can be introduced at: Application level Virtual machine level Physical resource level

Fault Tolerance

Application Fault Tolerance: Application health is continuously monitored by

special software components called sensors Sensor may trigger specific procedures to start

repairing process of an application that is malfunctioning

Example : Vmware App HA

Fault Tolerance

Virtual Machine Fault Tolerance: Can be detected by both customer and service

provider Customers can detect virtual machine failure by

monitoring its state with the help of sensors deployed in the cloud

Cloud service provider can provide VM fault tolerance by installing a single sensor per physical server that monitors all virtual machines hosted on that server

Fault Tolerance

Physical Machine Fault Tolerance: Can be implemented by cloud service provider by

monitoring state of physical server machines and in case of hardware failure, resume all virtual machines on new server

Fault Tolerance Techniques

Reactive Fault Tolerance In case of failure, these techniques reduce the effect

of failure on application execution

Proactive Fault Tolerance These techniques work by predicting faults and

proactively replacing the suspected components with working ones

Reactive Fault Tolerance

Check pointingReplicationJob migrationSGuardRetryTask resubmissionUser defined exception handlingRescue workflow

Proactive Fault Tolerance

Software Rejuvenation Self-HealingPre-emptive migration

Tools for implementing fault tolerance

HA proxy: Open source high availability and load balancing

solution for TCP and HTTP based applications De facto standard open source load balancer

ASSUE Automatic Software Self-healing Using REscue points Uses rescue points to detect, tolerate and recover

from software faults


SHelp: Upgraded version of ASSURE Uses weighted values to rescue points and error

virtualization techniques so that applications bypass the faulty path

High Availability

Can be achieved by having redundant failover servers

Can be achieved at application level, infrastructure level, data center level

Types of Virtual Machines High Availability

Load sharing Both replicas are active Service requests are equally distributed between both

of themUpdated dedicated hot standby

Two identical virtual machines execute on two different physical servers

Both virtual machines are fully synchronized with state information

VMware Fault Tolerance is an example


Not dedicated hot standby Standby VM running in parallel with active VM Standby is not fully synchronized VMware HA and Symantec’s Veritas Cluster Server

are examples


Shared hot standby Uses check pointing mechanism to update the standby

replica Requires fewer resources for standby replica

Cold standby Standby replica is powered off and lies on storage

media Brought to service when active VM fails Useful for situations where availability requirements

are low

Conclusion

Dependability is one of the major challenges in cloud computing

Adoption of cloud computing can be increased by addressing the dependability challenges

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance...

Documents

Transcript of ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance...