Availability Analysis for Deployment of In-Cloud Applications

Post on 27-Jan-2015

112 views 0 download

Tags:

description

International Symposium on Architecting Critical Systems (ISARCS) 2013 talk slides. June 19th, 2013. Full paper at http://www.nicta.com.au/pub?doc=6431

Transcript of Availability Analysis for Deployment of In-Cloud Applications

Availability Analysis for Deployment of In-Cloud

ApplicationsXiwei Xu, Qinghua Lu, Liming Zhu, Jim (Zhanwen) Li

Sherif Sakr, Hiroshi Wada, Ingo Weber

Software Systems Research Group, NICTA

ISARCS13, Vancouver

Slides at: http://www.slideshare.net/LimingZhu/

NICTA Copyright 2010 From imagination to impact 2

Motivation

• Uncertainties in Cloud are challenging for architecting critical applications and understanding availability – Shared resources, weak SLA guarantees and limited visibility– Rare but high consequence events– Sporadic activities: upgrade, backup, recovery… – Subjective uncertainties: impact of configuration choices

• We want to explicitly model the above uncertainties in application availability analysis of cloud deployment.– from a cloud consumer perspective– focusing on mechanisms most relevant to critical

applications: auto-scaling, over-provisioning, backup, recovery and maintenance.

NICTA Copyright 2010 From imagination to impact 3

Contributions

• SRN(Stochastic Reward Net)-based availability models • which allow you to specify:

– Deployment architecture (application placements in VM)– Node/Aggregation level SLAs from infrastructure providers– Auto-scaling policies and recovery strategies – Rare events: availability zone or region down

• which give you application availability levels of different options under different scenarios

• Model evaluation by analysing existing industry best practices in cloud application deployment– Quantifying the rule-of-thumb best practices– Comparing different (best) practices

NICTA Copyright 2010 From imagination to impact 4

Deployment Architecture Assumption

– Stateless VMs: auto-scaling groups– Stateful VMs: hot standbys – Backup at separate region for recovery

NICTA Copyright 2010 From imagination to impact 5

Availability Analysis Overview

• SRN-based Models• Architecture model and recovery model in this paper• One SRN architecture model per availability zone

NICTA Copyright 2010 From imagination to impact 6

Availability Analysis Overview

• Deployment decisions and patterns – stateless/stateful application placement within VMs– auto-scaling policies– multi-zone configurations

NICTA Copyright 2010 From imagination to impact 7

Availability Analysis Overview

• SLA from the cloud providers• Node level (Rackspace) or zone level (Amazon)

NICTA Copyright 2010 From imagination to impact 8

Availability Analysis Overview

• Recovery strategy• Auto-regeneration of stateless VMs and different

recovery mechanisms for stateful VMs• Different Recovery-Time/Point-Objective (RTO/RPO)

NICTA Copyright 2010 From imagination to impact 9

Availability Analysis Overview

• Application-specific data– Stateless VM start-up time… – Stateful VM replication…

NICTA Copyright 2010 From imagination to impact 10

Stochastic Reward Net

• Stochastic Reward Net (SRN)– Stochastic Petri Net variant – Firing delays– Reward function

• Constructs• Places: VM states (Full,

Running, Stoped, Failed )• Token: VMs• Transition

• Guard function• Transition rate: 1) frequency of

events, 2) delay before the transition fires

• Reward Function: if((#Running1>0) 1 else 0

NICTA Copyright 2010 From imagination to impact 11

SRN-based Availability Models

NICTA Copyright 2010 From imagination to impact 12

Availability Models: Auto-scaling

NICTA Copyright 2010 From imagination to impact 13

Availability Models: Auto-scaling

gScaleSelf1: if(#Running1<=#Running2 && #Stopped1>0) 1 else 0

gScaleOther1: if(#Running1>#Running2 && #Stopped2>0) 1 else 0

NICTA Copyright 2010 From imagination to impact 14

Availability Models: Stateful VM

NICTA Copyright 2010 From imagination to impact 15

Availability Models—Disaster Recovery

• Availability zone life cycle– Interact with the big

architecture model

• Stateless VM recovery– Backup/AMI

• Stateful VM recovery– Backup– Replica– Hot standby

NICTA Copyright 2010 From imagination to impact 16

Case 1: Multi-zone Deployment• Parameters

– Amazon EC2 SLA of 99.95% availability – Zone fail rate: 0.00011, MTTR: 4.38 hours per year

– Application specific measurement of transitions

0.01% = 52.56 mins downtime per year

0.4% diff = 35 hours

0.76% diff = 66 hours

NICTA Copyright 2010 From imagination to impact 17

Case 2: Recovery across Availability Zone

• Industry rule of thumb: “Target auto-scale 30-60% until you have 50% headroom for load spikes. Lose an AZ leads to 90% utilisation.”• Impact on overall availability?• 30-60% vs. traditional 70-90%?• over-provisioning vs. auto-scaling?

0.29% diff = 25 hours

NICTA Copyright 2010 From imagination to impact 18

Case 3: Disaster Recovery across Regions

• Trade-off between RPO and RTO• RPO: Recovery Point Objective• RTO: Recovery Time Objective

Yuruware — http://www.yuruware.com/

0.2% diff = 17 hours

NICTA Copyright 2010 From imagination to impact

Conclusion and Future Work

• SRN-based availability models – Application-level availability – Highly configurable for different deployment architectures– Model different uncertainties and scenarios for critical systems– Quantify and compare choices and enable what-if analysis – Evaluated using industry best practices

• Future work – Better evaluation!– Integrated models on impact of upgrade, live migration, backup and

subjective uncertainties (in IEEE Cloud 13)Q. Lu, X. Xu, L. Zhu, L. Bass, et al., "Incorporating Uncertainty into in-Cloud Application Deployment Decisions for Availability," in IEEE Cloud 2013

Liming.Zhu@nicta.com.auSlides available at http://www.slideshare.net/LimingZhu/

19