The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a...

29
The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon Web Services 1 ©2017, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Transcript of The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a...

Page 1: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

TheChallengesofOperatingaComputingCloudandCharging

foritsUseMarvinTheimer

VP/DistinguishedEngineer

AmazonWebServices

1©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 2: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

CustomersWantItAll

• Lotsoffeaturesandallthe“ilities”• Payaslittleaspossible• Getitassoonaspossible

2©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 3: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

Trade-offsMustBeMade

• Inherenttensionbetweencustomers’desires

• MUSTworkbackwardsfromthecustomer

• It’snotalwaysobviouswhateachcustomerreallywants

3©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 4: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

ScalingChallenges

• Abigcomputecloudhasatleastamillionphysicalserversworld-wide

• AmazonS3storestrillionsofobjects,containsexabytes ofdata,andfieldsmillionsofrequests/second

• Aservice-orientedarchitecture(SOA)impliestherearemanyservices

• Amazonhastensofthousandsofservices

• AbigservicelikeAmazonS3mayrequiretensofthousandsofservers

4©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 5: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

YouNeedAvailabilityToo

• Exampledefinitionofavailability:

“Thenumberof5minuteintervalsduringwhichtheratiooferrorreturns(http500’s)tototalsystemrequestsislessthan5%overthetotalnumberof5minuteintervals.”

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.5

Page 6: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

LevelsofAvailability

• Availability Amountofdowntimeperyear

• 99.8%: 17.5hours

• 99.9%(3x9’s): 8.8hours

• 99.99%(4x9’s): 52.6minutes

• 99.999%(5x9’s): 5.26minutes

• 99.9999%(6x9’s): 31.5seconds

• 99.99999%(7x9’s): 3.15seconds

6©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 7: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

SomeImplicationsofVariousLevelsofAvailability

• 99.8%: 17.5hours

• Youmightcrippleyourbusiness(e.g.IntuitonApr15th)

• 3x9’s: 8.8hours

• Youcanaffordtodooccasionalsmallscheduleddowntimes

• 4x9’s: 52.6minutes

• Can’tdoscheduleddowntimesofanysignificance

• Pagedhumanhasabout30-40minutestocorrect/restartthings

7©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 8: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

SomeImplicationsofVariousLevelsofAvailability

• 5x9’s: 5.26minutes

• Pagedhumanwon’tbeon-linebeforeyou’veexceededyouryearlySLA

• De-factoneedfullyautomatedfailureresponsesystem

• Humanscanonlybeinvolvedwithlonger-termtrendsmanagement

• 6x9’s: 31.5seconds

• Havetoredefinewhatyoumeanbyavailability(5min.intervalstoocoarse)

• Intherangeofthrottlingdelays• 7x9’s: 3.15seconds

• Belowthepracticalthresholdfordistributedleasedlocks

8©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 9: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

RealityBites

• Developersarefallible• Cloudservicesevolvequickly• Near-perfectautomation/fault-toleranceisexpensive

• Currentstate-of-the-artrequireshumansinthelooptodealwithunforeseencircumstances

• Youcanbuild6x9’savailableservices,butitmaynotrepresenttherightcost/benefittrade-offformostcustomers

9©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 10: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

YouNeedLogging

• VolumeoflogtrafficismeasuredinTB/hour• Foralargeservicethelogvolumeisstillofthatorderofmagnitude

• Can’tjustgrepit:youneedafull-blownsearchcapability• Richerqueriesimplyevenmoretechnology

• Needfortimelyanswerspushesyoutowardsnear-real-timesupport

• It’salltechnologyfeasible• Itjustcostsalot• Howmuchcostisjustifiable?

10©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 11: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

YouNeedMetricsasWellasLogging

• Logqueriesarefordebugging• Todeterminewhetheraservice/systemisbehavingproperlyyouneedmetrics

• Ideallyyoutrack“everything”• Fartooexpensive

• Costofgathering• Attentioncost

• Youhavetofigureoutthemetrics“workingset”youneed

• Whatareyour“leading”metrics?

11©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 12: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

MetricsChallenges

• Theworkingsetmaychangeinunobviouswaysasyourservice– oritsworkloads– evolve

• Importanttohave“tripwire”metrics

• Importanttohaveautomatedalarms

• Alsoneedtohavealarmdeduplication/squelching

12©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 13: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

AvailabilityasSeenbyIndividualCustomers

• Aservicecanbe99.99%availableandanindividualcustomercanstillhaveareallybadday

• Ideally,wantnear-real-time“top-N”metrics

• Thesearenotcheap

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.13

Page 14: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

LatencyasSeenbyIndividualCustomers

• OnedefinitionofalatencySLA:

“Thenumberof5minuteintervalsduringwhichtheratioofreturnswithlatencyhigherthanthelatencySLAtototalsystemrequestsislessthan5%overthetotalnumberof5minuteintervals.”

• Same“badday”problemexistsforlatencyasforavailability

• Needtomonitorp90,p99,p99.99,andevenp100

14©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 15: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

Developing,Testing,Deploying,OperatingatScale• AmazonWebServices(AWS)launchedO(1000)featureslastyear.Customersareimpatientformore

• Thevarietyofworkloadsandexceptionscenarios(failures,distributeddenial-of-serviceattacks,customerloadspikes,etc.)ishuge

• Increasingemphasis/demandforplatform-widefeatures,suchtagging,policyenforcement,etc.

15©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 16: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

Thingscanchangeoutfromunderyouquickly

• Suddenloadquantumleaps

• Capacitychallenge• Ramp-upchallenge

• Newfeaturesthathaveunintendedscalingsideeffects• Newfeatureinoneplacemayacceleratetherateofloadgrowthinanother

• Non-lineareffects• Unintendedconsequencesduetounexpecteduses

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.16

Page 17: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

TestingisCrucial

• Avoidingthedeathspiralofmean-time-to-failure>mean-time-to-repair

• Testing:theonly“truth”youhaveiswhatyoutestregularly• Regressiontests• Scaling/performancetests

• Faulttolerancetests• Theimportanceoftestingtofailure

• Loadtestingtothebreakingpointalongallrelevantdimensions

• Chaosmonkeys

• LSEtests(chaosarmiesandgamedays)

17©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 18: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

YouCan’tAnticipateEverything

• Needrollingdeployments

• Need(automated)rollbackcapability

• Root-causeanalysisischallengingwhen“everything”isconstantlyinflux

• UseCI/CD:lotsofsmall,incrementalchangesareeasiertodealwiththanafew“bigbangs”

18©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 19: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

OperationalReadinessisCrucial

• Modelingyoursystem• Securitythreatmodel

• Failuremodel,includingLSEanalysis

• Operationalreadinessreview(ORR)checklist• On-callrotation

• Primarypersonnel

• Well-definedescalationpaths,includingtootherservices

• On-callrunbooks• Havetobeeasytounderstandanduse• Mustpracticeusingthem

19©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 20: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

HumansintheLoop

• Humansarenecessarybecausesystemsare• Extremelycomplex• Evolveataferociousrate• Exhibitdifficult-to-anticipateemergentbehaviors• Behaveinnon-linearways

• Humansareahugeproblembecausetheyareimperfect– especiallyatrepetitivetasks• Multi-stepstandardoperatingprocedures(SOPs)areagoodsourceoferrors• Dittoforcut-and-pastetasks• Dittoforcomplex,difficult-to-parse,textcommands

è Needcannedproceduresthathavesimpleinvocationsemantics

20©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 21: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

TheTensionBetweenPower/EfficiencyandSafety• ToolsandAPIsshouldbesafetouse:

• Projectedoutcomeofanactionshouldbeclearlydiscernible

• Ideallyactionscanbeundoneifnecessary

• Safetyaddsfriction• Dangerofpeopleinventingshort-cuts

• What’sthe“right”amountofsafetyfrictiontoimpose?

• Sometimesyouneedapowertoolthatwillletyoudo“heartsurgery”• Howoftendoyouuseit?

• Howoftendoyoupracticewithit?

21©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 22: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

“CorrectionofError”Reports

• Theimportanceofrecordingandpropagatingthingslearned

• Rootcauseanalysis:“The5Whys”

• Example:2011AmazonEBSoutage

• Networkmisconfiguredduringanupgrade

• Re-mirroringstorm

• What’stherootcause?

• ServicecontrolplaneproblemsàServicedataplaneproblemsà networktrafficproblemsànetworkmisconfigurationà difficult-to-usetoolsforconfiguringnetworkrouters

• Theimportanceofclosed-loopactionmechanismsvs.goodintentions

22©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 23: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

AbstractRepresentationofEBSinaRegion

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.23

Page 24: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

InitialFailureEvent

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.24

Page 25: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

Follow-onProblems

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.25

Page 26: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

ChargingforUse

• Youhavetobuildinsupportfromthebeginning(likewithsecurity)

• Havetobeabletotrackcustomers’usagealongallrelevantdimensionsandacrossallbackendsystemsandservices

• Meteringvolumes(attheedge)aremeasuredinmillionsofrecords/secandTB/hour.

• What’stherightpricingmodel?

• Fullycost-followingmodelsareverycomplicated

• Simplermodelsmayhaveunintendedconsequences

26©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 27: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

SomePricingNuances

• Freetiers• Invitationtouse• Alsoasimplerpricingmodelfor“glue”resources

• Derivativeusage• Resourceusageenabledbyotherresourceusage• Example:cheaperdataingestionleadstomorecompute

27©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 28: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

LimitingMistakesandFraud

• Highelasticityenablestheabilitytodoalotofdamagequickly

• Howdoyoudistinguishlegitimaterequestsformoreresourcesfrommistakesandfraud?

• Havetoputdynamiclimitsonwhatcanbeused

• Simplest– andleastcustomerfriendly– solutionisuniversalsoftquotas

• Canmakeaquotacustomer-specific

• Trustedcustomersgethigherdefaultlimits

• Pasthistoryusedaspredictoroffuturebehavior

• Differingpaymentstrategies

28©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Page 29: The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a Computing Cloud and Charging for its Use Marvin Theimer VP/Distinguished Engineer Amazon

SummaryandConclusions

• Afundamentaltension:customerswant• richfeaturesetandcapabilities• relentlesscostreduction• everythingassoonaspossible

• Atscaleit’sallaboutthetail

• Testingandautomationarecrucial,butthehuman(sofar)stillhastobeintheloop– forbetterandworse

• Whattochargehasmanynuancesandrequiressupportfromthebeginning

29©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.