Availability Analysis for Deployment of In-Cloud Applications

19
NICTA Copyright 2010 From imagination to impact Availability Analysis for Deployment of In-Cloud Applications Xiwei Xu, Qinghua Lu, Liming Zhu , Jim (Zhanwen) Li Sherif Sakr, Hiroshi Wada, Ingo Weber Software Systems Research Group, NICTA ISARCS13, Vancouver Slides at: http://www.slideshare.net/LimingZhu/

description

International Symposium on Architecting Critical Systems (ISARCS) 2013 talk slides. June 19th, 2013. Full paper at http://www.nicta.com.au/pub?doc=6431

Transcript of Availability Analysis for Deployment of In-Cloud Applications

Page 1: Availability Analysis for Deployment of In-Cloud Applications

Availability Analysis for Deployment of In-Cloud

ApplicationsXiwei Xu, Qinghua Lu, Liming Zhu, Jim (Zhanwen) Li

Sherif Sakr, Hiroshi Wada, Ingo Weber

Software Systems Research Group, NICTA

ISARCS13, Vancouver

Slides at: http://www.slideshare.net/LimingZhu/

Page 2: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 2

Motivation

• Uncertainties in Cloud are challenging for architecting critical applications and understanding availability – Shared resources, weak SLA guarantees and limited visibility– Rare but high consequence events– Sporadic activities: upgrade, backup, recovery… – Subjective uncertainties: impact of configuration choices

• We want to explicitly model the above uncertainties in application availability analysis of cloud deployment.– from a cloud consumer perspective– focusing on mechanisms most relevant to critical

applications: auto-scaling, over-provisioning, backup, recovery and maintenance.

Page 3: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 3

Contributions

• SRN(Stochastic Reward Net)-based availability models • which allow you to specify:

– Deployment architecture (application placements in VM)– Node/Aggregation level SLAs from infrastructure providers– Auto-scaling policies and recovery strategies – Rare events: availability zone or region down

• which give you application availability levels of different options under different scenarios

• Model evaluation by analysing existing industry best practices in cloud application deployment– Quantifying the rule-of-thumb best practices– Comparing different (best) practices

Page 4: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 4

Deployment Architecture Assumption

– Stateless VMs: auto-scaling groups– Stateful VMs: hot standbys – Backup at separate region for recovery

Page 5: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 5

Availability Analysis Overview

• SRN-based Models• Architecture model and recovery model in this paper• One SRN architecture model per availability zone

Page 6: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 6

Availability Analysis Overview

• Deployment decisions and patterns – stateless/stateful application placement within VMs– auto-scaling policies– multi-zone configurations

Page 7: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 7

Availability Analysis Overview

• SLA from the cloud providers• Node level (Rackspace) or zone level (Amazon)

Page 8: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 8

Availability Analysis Overview

• Recovery strategy• Auto-regeneration of stateless VMs and different

recovery mechanisms for stateful VMs• Different Recovery-Time/Point-Objective (RTO/RPO)

Page 9: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 9

Availability Analysis Overview

• Application-specific data– Stateless VM start-up time… – Stateful VM replication…

Page 10: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 10

Stochastic Reward Net

• Stochastic Reward Net (SRN)– Stochastic Petri Net variant – Firing delays– Reward function

• Constructs• Places: VM states (Full,

Running, Stoped, Failed )• Token: VMs• Transition

• Guard function• Transition rate: 1) frequency of

events, 2) delay before the transition fires

• Reward Function: if((#Running1>0) 1 else 0

Page 11: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 11

SRN-based Availability Models

Page 12: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 12

Availability Models: Auto-scaling

Page 13: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 13

Availability Models: Auto-scaling

gScaleSelf1: if(#Running1<=#Running2 && #Stopped1>0) 1 else 0

gScaleOther1: if(#Running1>#Running2 && #Stopped2>0) 1 else 0

Page 14: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 14

Availability Models: Stateful VM

Page 15: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 15

Availability Models—Disaster Recovery

• Availability zone life cycle– Interact with the big

architecture model

• Stateless VM recovery– Backup/AMI

• Stateful VM recovery– Backup– Replica– Hot standby

Page 16: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 16

Case 1: Multi-zone Deployment• Parameters

– Amazon EC2 SLA of 99.95% availability – Zone fail rate: 0.00011, MTTR: 4.38 hours per year

– Application specific measurement of transitions

0.01% = 52.56 mins downtime per year

0.4% diff = 35 hours

0.76% diff = 66 hours

Page 17: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 17

Case 2: Recovery across Availability Zone

• Industry rule of thumb: “Target auto-scale 30-60% until you have 50% headroom for load spikes. Lose an AZ leads to 90% utilisation.”• Impact on overall availability?• 30-60% vs. traditional 70-90%?• over-provisioning vs. auto-scaling?

0.29% diff = 25 hours

Page 18: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact 18

Case 3: Disaster Recovery across Regions

• Trade-off between RPO and RTO• RPO: Recovery Point Objective• RTO: Recovery Time Objective

Yuruware — http://www.yuruware.com/

0.2% diff = 17 hours

Page 19: Availability Analysis for Deployment of In-Cloud Applications

NICTA Copyright 2010 From imagination to impact

Conclusion and Future Work

• SRN-based availability models – Application-level availability – Highly configurable for different deployment architectures– Model different uncertainties and scenarios for critical systems– Quantify and compare choices and enable what-if analysis – Evaluated using industry best practices

• Future work – Better evaluation!– Integrated models on impact of upgrade, live migration, backup and

subjective uncertainties (in IEEE Cloud 13)Q. Lu, X. Xu, L. Zhu, L. Bass, et al., "Incorporating Uncertainty into in-Cloud Application Deployment Decisions for Availability," in IEEE Cloud 2013

[email protected] available at http://www.slideshare.net/LimingZhu/

19