Challenges in Practicing High Frequency Releases in Cloud Environments

12
NICTA Copyright 2012 From imagination to impact Challenges in Practicing High Frequency Releases in Cloud Environments Liming Zhu, Donna Xu, Xiwei Xu, An Binh Tran, Ingo Weber, Len Bass NICTA/UNSW http://slideshare.net/limingzhu

description

Talk at RELENG 2014 Full paper: http://www.nicta.com.au/pub?doc=7925 The continuous delivery trend is dramatically shortening release cycles from months into hours. Applications with high frequency releases often rely heavily on automated deployment tools using cloud infrastructure APIs. We report some results from experiments on reliability issues of cloud infrastructure and trade-offs between using heavily-baked and lightly-baked images. Our experiments were based on Amazon Web Service (AWS) OpsWorks APIs and configuration management tool Chef. As a result of our experiments, we then propose error handling practices that can be included in tailor-made continuous deployment facilities. More related info at our DevOps book http://www.ssrg.nicta.com.au/projects/devops_book/

Transcript of Challenges in Practicing High Frequency Releases in Cloud Environments

Page 1: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Challenges in Practicing High Frequency

Releases in Cloud Environments Liming Zhu, Donna Xu, Xiwei Xu, An Binh

Tran, Ingo Weber, Len Bass

NICTA/UNSW

http://slideshare.net/limingzhu

Page 2: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

NICTA (National ICT Australia)

• Australia’s National Centre of Excellence in Information and Communication Technology

• Five Research Labs:– ATP: Australian Technology Park, Sydney– NRL: UNSW, Sydney– CRL: ANU, Canberra– VRL: Uni. Melbourne– QRL: Uni. Queensland and QUT

• 700 staff including 270 PhD students• Research Groups

– Software Systems Research Group (SSRG)• ssrg.nicta.com.au

– Machine Learning, Optimisation, Networks, Computer Vision

Page 3: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Challenge: High Frequency Releases/Changes

• Significant shorter release cycles and DevOps– Continuous delivery/deployment

• from months at scheduled downtime to hours at all times

• Cloud uncertainty during provision/deployment– Heavy reliance on Cloud APIs; Indirect control– Other “sporadic” operations: cron jobs/backup/reconfig... – Our focus: error detection/diagnosis during

continuous “changes”• Anomaly-detection/monitoring for normal operation not working

• One solution: machine image as build artifacts?– Heavily-baked vs. lightly-baked? Immutable server?

Page 4: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Heavily-Baked vs. Lightly-Baked

• Heavily-baked approach+No server drifts, consistent, more reliable?– Image preparation time for any minor release– Image sprawl – Image consistency among teams

• coordination, golden image, image inheritance..

• Lightly-baked approach+Highly dynamic, config-as-service, less restarting…– Less reliable due to runtime dependence on external

services (etc. repo, configuration services.. )?– Drifting, outcome validation, race conditions..

Page 5: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Motivating Example: Rolling Upgrade• Used in large-scale web operations

– Have 100+ servers in cloud with version 1 software – Upgrade 10 servers at a time to version 2 software

• Potentially take a long time to complete with errors during the operation – Provisioning failure, logical failures, instance failure– Other interfering operations

• Heavily-baked vs. lightly-baked– Past experiences: Netflix Asgard with heavily-baked– AWS OpsWorks:

• DevOps automation + life cycle events + abstraction • Heavily-baked + built-in recipe vs. lightly-baked + custom recipe

Page 6: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Observations 1/3

Page 7: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Observations 2/3

Page 8: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Observations 3/3

Page 9: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Solutions for Better Reliability/Predictability

• Ad hoc tactics to reduce tails– Inspired by Jeff Dean’s “Tail at Scale” CACM article– Retry with alternative options

• stop-restart, replace, deploy without restart

– Fail fast • Tracking status time and 95 percentile to fail fast

– Asynchronous waves for upgrading granularity >1

• Validate intermediary outcomes – Inside machine:

• Chef Mini-test; test cases in production monitoring

– Outside machine: • Process-Oriented Dependability (POD) • Assertion checking and conformance checking

Page 10: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Process-Oriented Dependability (POD)• Offline: treat operations as a processes

– Process discovered automatically from logs/scripts• Log line clustering and process mining

– Expected step outcomes specified as assertions

• Online: use process context– Process context: process/instance/step ids, expected states

– Errors are detected by examining logs and monitoring data• Assertions evaluation using monitoring facilities or directly• Compliance checking against expected processes

– Detected errors are further diagnosed for (root) causes• Examining a fault tree to locate potential root causes• Performing more diagnostic tests and on-demand assertions

X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.

Page 11: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Example: Rolling Upgrade Using Asgard

Read by

Operator

Process Mining Service

Cont

rols

Outputs Create SnapshotCheck AZs

Create instance from snapshot

Create AMI from instance

Evaluate AMI

Discovered Model

Asgard Log dataLog dataGeneratesOffline

Online

Error Detection Service has two methods for detecting errors:• Assertion Checking• Conformance Checking

Page 12: Challenges in Practicing High Frequency Releases in Cloud Environments

NICTA Copyright 2012 From imagination to impact

Summary

• Lightly vs. heavily-baked for high frequency releases• Solutions for unreliable processes

– Some tactics to reduce long tails • fail fast, alternative actions, asynchronous waves…

– Validate intermediary outcomes• Inside machine: Chef Mini-test; test cases in production monitoring • Outside machine: Process-Oriented Dependability (POD)

– Assertion checking and conformance checking • Currently integrating with monitoring and alerting

• We need industry help and collaboration– Logs, trials, feedback, case study as book chapters

Book: http://www.ssrg.nicta.com.au/projects/devops_book/

Contact: [email protected]