The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a...

Post on 28-May-2020

3 views 0 download

Transcript of The Challenges of Operating a Computing Cloud and Charging ... · The Challenges of Operating a...

TheChallengesofOperatingaComputingCloudandCharging

foritsUseMarvinTheimer

VP/DistinguishedEngineer

AmazonWebServices

1©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

CustomersWantItAll

• Lotsoffeaturesandallthe“ilities”• Payaslittleaspossible• Getitassoonaspossible

2©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Trade-offsMustBeMade

• Inherenttensionbetweencustomers’desires

• MUSTworkbackwardsfromthecustomer

• It’snotalwaysobviouswhateachcustomerreallywants

3©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

ScalingChallenges

• Abigcomputecloudhasatleastamillionphysicalserversworld-wide

• AmazonS3storestrillionsofobjects,containsexabytes ofdata,andfieldsmillionsofrequests/second

• Aservice-orientedarchitecture(SOA)impliestherearemanyservices

• Amazonhastensofthousandsofservices

• AbigservicelikeAmazonS3mayrequiretensofthousandsofservers

4©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

YouNeedAvailabilityToo

• Exampledefinitionofavailability:

“Thenumberof5minuteintervalsduringwhichtheratiooferrorreturns(http500’s)tototalsystemrequestsislessthan5%overthetotalnumberof5minuteintervals.”

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.5

LevelsofAvailability

• Availability Amountofdowntimeperyear

• 99.8%: 17.5hours

• 99.9%(3x9’s): 8.8hours

• 99.99%(4x9’s): 52.6minutes

• 99.999%(5x9’s): 5.26minutes

• 99.9999%(6x9’s): 31.5seconds

• 99.99999%(7x9’s): 3.15seconds

6©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

SomeImplicationsofVariousLevelsofAvailability

• 99.8%: 17.5hours

• Youmightcrippleyourbusiness(e.g.IntuitonApr15th)

• 3x9’s: 8.8hours

• Youcanaffordtodooccasionalsmallscheduleddowntimes

• 4x9’s: 52.6minutes

• Can’tdoscheduleddowntimesofanysignificance

• Pagedhumanhasabout30-40minutestocorrect/restartthings

7©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

SomeImplicationsofVariousLevelsofAvailability

• 5x9’s: 5.26minutes

• Pagedhumanwon’tbeon-linebeforeyou’veexceededyouryearlySLA

• De-factoneedfullyautomatedfailureresponsesystem

• Humanscanonlybeinvolvedwithlonger-termtrendsmanagement

• 6x9’s: 31.5seconds

• Havetoredefinewhatyoumeanbyavailability(5min.intervalstoocoarse)

• Intherangeofthrottlingdelays• 7x9’s: 3.15seconds

• Belowthepracticalthresholdfordistributedleasedlocks

8©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

RealityBites

• Developersarefallible• Cloudservicesevolvequickly• Near-perfectautomation/fault-toleranceisexpensive

• Currentstate-of-the-artrequireshumansinthelooptodealwithunforeseencircumstances

• Youcanbuild6x9’savailableservices,butitmaynotrepresenttherightcost/benefittrade-offformostcustomers

9©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

YouNeedLogging

• VolumeoflogtrafficismeasuredinTB/hour• Foralargeservicethelogvolumeisstillofthatorderofmagnitude

• Can’tjustgrepit:youneedafull-blownsearchcapability• Richerqueriesimplyevenmoretechnology

• Needfortimelyanswerspushesyoutowardsnear-real-timesupport

• It’salltechnologyfeasible• Itjustcostsalot• Howmuchcostisjustifiable?

10©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

YouNeedMetricsasWellasLogging

• Logqueriesarefordebugging• Todeterminewhetheraservice/systemisbehavingproperlyyouneedmetrics

• Ideallyyoutrack“everything”• Fartooexpensive

• Costofgathering• Attentioncost

• Youhavetofigureoutthemetrics“workingset”youneed

• Whatareyour“leading”metrics?

11©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

MetricsChallenges

• Theworkingsetmaychangeinunobviouswaysasyourservice– oritsworkloads– evolve

• Importanttohave“tripwire”metrics

• Importanttohaveautomatedalarms

• Alsoneedtohavealarmdeduplication/squelching

12©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

AvailabilityasSeenbyIndividualCustomers

• Aservicecanbe99.99%availableandanindividualcustomercanstillhaveareallybadday

• Ideally,wantnear-real-time“top-N”metrics

• Thesearenotcheap

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.13

LatencyasSeenbyIndividualCustomers

• OnedefinitionofalatencySLA:

“Thenumberof5minuteintervalsduringwhichtheratioofreturnswithlatencyhigherthanthelatencySLAtototalsystemrequestsislessthan5%overthetotalnumberof5minuteintervals.”

• Same“badday”problemexistsforlatencyasforavailability

• Needtomonitorp90,p99,p99.99,andevenp100

14©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Developing,Testing,Deploying,OperatingatScale• AmazonWebServices(AWS)launchedO(1000)featureslastyear.Customersareimpatientformore

• Thevarietyofworkloadsandexceptionscenarios(failures,distributeddenial-of-serviceattacks,customerloadspikes,etc.)ishuge

• Increasingemphasis/demandforplatform-widefeatures,suchtagging,policyenforcement,etc.

15©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

Thingscanchangeoutfromunderyouquickly

• Suddenloadquantumleaps

• Capacitychallenge• Ramp-upchallenge

• Newfeaturesthathaveunintendedscalingsideeffects• Newfeatureinoneplacemayacceleratetherateofloadgrowthinanother

• Non-lineareffects• Unintendedconsequencesduetounexpecteduses

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.16

TestingisCrucial

• Avoidingthedeathspiralofmean-time-to-failure>mean-time-to-repair

• Testing:theonly“truth”youhaveiswhatyoutestregularly• Regressiontests• Scaling/performancetests

• Faulttolerancetests• Theimportanceoftestingtofailure

• Loadtestingtothebreakingpointalongallrelevantdimensions

• Chaosmonkeys

• LSEtests(chaosarmiesandgamedays)

17©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

YouCan’tAnticipateEverything

• Needrollingdeployments

• Need(automated)rollbackcapability

• Root-causeanalysisischallengingwhen“everything”isconstantlyinflux

• UseCI/CD:lotsofsmall,incrementalchangesareeasiertodealwiththanafew“bigbangs”

18©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

OperationalReadinessisCrucial

• Modelingyoursystem• Securitythreatmodel

• Failuremodel,includingLSEanalysis

• Operationalreadinessreview(ORR)checklist• On-callrotation

• Primarypersonnel

• Well-definedescalationpaths,includingtootherservices

• On-callrunbooks• Havetobeeasytounderstandanduse• Mustpracticeusingthem

19©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

HumansintheLoop

• Humansarenecessarybecausesystemsare• Extremelycomplex• Evolveataferociousrate• Exhibitdifficult-to-anticipateemergentbehaviors• Behaveinnon-linearways

• Humansareahugeproblembecausetheyareimperfect– especiallyatrepetitivetasks• Multi-stepstandardoperatingprocedures(SOPs)areagoodsourceoferrors• Dittoforcut-and-pastetasks• Dittoforcomplex,difficult-to-parse,textcommands

è Needcannedproceduresthathavesimpleinvocationsemantics

20©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

TheTensionBetweenPower/EfficiencyandSafety• ToolsandAPIsshouldbesafetouse:

• Projectedoutcomeofanactionshouldbeclearlydiscernible

• Ideallyactionscanbeundoneifnecessary

• Safetyaddsfriction• Dangerofpeopleinventingshort-cuts

• What’sthe“right”amountofsafetyfrictiontoimpose?

• Sometimesyouneedapowertoolthatwillletyoudo“heartsurgery”• Howoftendoyouuseit?

• Howoftendoyoupracticewithit?

21©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

“CorrectionofError”Reports

• Theimportanceofrecordingandpropagatingthingslearned

• Rootcauseanalysis:“The5Whys”

• Example:2011AmazonEBSoutage

• Networkmisconfiguredduringanupgrade

• Re-mirroringstorm

• What’stherootcause?

• ServicecontrolplaneproblemsàServicedataplaneproblemsà networktrafficproblemsànetworkmisconfigurationà difficult-to-usetoolsforconfiguringnetworkrouters

• Theimportanceofclosed-loopactionmechanismsvs.goodintentions

22©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

AbstractRepresentationofEBSinaRegion

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.23

InitialFailureEvent

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.24

Follow-onProblems

©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.25

ChargingforUse

• Youhavetobuildinsupportfromthebeginning(likewithsecurity)

• Havetobeabletotrackcustomers’usagealongallrelevantdimensionsandacrossallbackendsystemsandservices

• Meteringvolumes(attheedge)aremeasuredinmillionsofrecords/secandTB/hour.

• What’stherightpricingmodel?

• Fullycost-followingmodelsareverycomplicated

• Simplermodelsmayhaveunintendedconsequences

26©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

SomePricingNuances

• Freetiers• Invitationtouse• Alsoasimplerpricingmodelfor“glue”resources

• Derivativeusage• Resourceusageenabledbyotherresourceusage• Example:cheaperdataingestionleadstomorecompute

27©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

LimitingMistakesandFraud

• Highelasticityenablestheabilitytodoalotofdamagequickly

• Howdoyoudistinguishlegitimaterequestsformoreresourcesfrommistakesandfraud?

• Havetoputdynamiclimitsonwhatcanbeused

• Simplest– andleastcustomerfriendly– solutionisuniversalsoftquotas

• Canmakeaquotacustomer-specific

• Trustedcustomersgethigherdefaultlimits

• Pasthistoryusedaspredictoroffuturebehavior

• Differingpaymentstrategies

28©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.

SummaryandConclusions

• Afundamentaltension:customerswant• richfeaturesetandcapabilities• relentlesscostreduction• everythingassoonaspossible

• Atscaleit’sallaboutthetail

• Testingandautomationarecrucial,butthehuman(sofar)stillhastobeintheloop– forbetterandworse

• Whattochargehasmanynuancesandrequiressupportfromthebeginning

29©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights

reserved.