Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

29
Tech Talk: Give Me the Bad News Straight: Why Models are a Broken Approach to Alerting David B. Martin DevOps : Agile Ops CA Technologies APM Product Manager DO5T41T #CAWorld

Transcript of Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

Page 1: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

TechTalk:GiveMetheBadNewsStraight: WhyModelsareaBrokenApproachtoAlerting

DavidB.Martin

DevOps:AgileOps

CATechnologiesAPMProductManagerDO5T41T

#CAWorld

Page 2: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

2 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

GiveMetheBadNewsStraight:WhyModelsareaBrokenApproachtoAlerting

The industry standard approach to automatic alerts is to create modelsfrom base-lining application latencies. But when something goes wrong,is it because something is really broken or because the model wasincorrect? Training the model to avoid mistakes is complex and time-intensive. CA Application Performance Management (CA APM) 10replaces the whole approach with a brand new one: react to changes inapplication stability as they occur. Outliers are automatically ignored,while tremors in latency register progressively bigger values for theintensity of an event, a little like the richter scale for earthquakes. Jointhe discussion and learn how CA APM transforms automatic alerting.

DavidB.MartinCATechnologiesProductManager

Page 3: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

3 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

©2015CA.Allrightsreserved.Alltrademarksreferencedhereinbelongtotheirrespectivecompanies.

Thecontentprovidedinthis CAWorld2015presentationisintendedforinformationalpurposesonlyanddoesnotformanytypeofwarranty. The informationprovidedbyaCApartnerand/orCAcustomerhasnotbeenreviewedforaccuracybyCA.

ForInformationalPurposesOnlyTermsofthisPresentation

Page 4: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

4 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Agenda

WHYMODELSAREFAILING

ABRIEFHISTORYOFAPMALERTING

CATECHNOLOGIESDIFFERENTIALANALYSIS

MODELSAREMADETOBEBROKEN

DATA-DRIVENDIVEINTOAUTOMATICALERTINGMODELS

SHEWHARTSAVESTHEDAY

1

2

3

4

5

6

Page 5: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

5 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Keepingmypromise!

§ Iwillbeginthissessionbymakingadetailed,data-centriccaseforwhyCATechnologiesnewdifferentialanalysisfeatureisasuperior,market-leadingapproachtoautomaticalerting.

§ No,Iwillnotthenpullarabbitoutofahat.‘Cuz thisain’tmagicpeople…evenifitlookslikemagic.

§ “Anysufficientlyadvancedtechnologyisindistinguishablefrommagic.”—A.C.Clarke

Page 6: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

6 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

WhatwasCA’slastanswer?

§ Intheearly90s,WilyimplementedHolt’sLinearExponentialSmooth(HLES)tocalculatebaselines for metrics.

§ Baselineswerefooledbyregularproductionevents—manyweremoreaboutregularpatternsinloadthanaboutmaintenanceevents.Seasonalitydebutstoaddressit.

§ Thisleadstorules—andrulesengines—toaddressedgecasesthatseasonalitydoesnotaddress(e.g.“+3std dev frombaseline”todeadenthesensitivityoftriggers).

Andwhatareourcompetitorsdoing?

Page 7: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

7 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

What’stheproblemwiththestate-of-the-art?

§ Asthefollowingslideswillexplain,seasonalbaselinesmissproblemsthatyoudon’twanttomiss.

§ Inevitably,theyalsoreporttoooften.

§ Whentheydo,youhavetowriterulesresolvetheissuewithyourissues.

§ Nowyou’vefailedtofindtheautomaticalertinggrail.

§ Itmayactuallybemoreefficienttogobacktowritingstaticthresholdsforyourkeycomponents.

Or,agood reasonforteachingyousomeinterestingmath.

Page 8: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

8 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD440

460

480

500

520

540

560

580

600

620

AverageResponseTime

+1StdDev

+2StdDev

+3StdDev

Thisisastableapplicationresponsetime,withbandsofstandarddeviation.Mostbaselinesarefancyformsofstandarddeviationthattakeintoaccount thingslikeseasonality.

Page 9: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

9 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD0

200

400

600

800

1000

1200

1400

1600

1800

Anoutlier…Whattodo?Ifit’sinaseasonalwindow,ithastobeabiggeroutlier,buttheproblemof,“ToAlertorNottoAlert,”remains

thesame.

Youmusteithersendanalertforthissinglespikeorwritearuletosaythatthespikehastobe“sobig”beforeyoucare(whichisusuallydonewithamanuallywrittenrulelike

“>3stddev”).

“Mr.Opswon’tevenputdownhissandwichforasinglefailedtransaction.”

Page 10: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

10 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD0

500

1000

1500

2000

2500

Whatabout thesituationofasustainedspike?

Supposedly, seasonalitycancelsout thenormaloperations.Buthowmanyofyouhaveappsinwhichasingleuserlogsinandstartsrunningexpensive(e.g.reporting)transactions?

Traditionalapproachhastoagaindecide:whentoalert?Ifappusersloginatirregularintervalsandperformthistypeoftransaction,thentriggeringalertson theirnormal(non-seasonal)activity?

“catalerts/dev/null”.

Buthowlongdoyouwaitthen?Onceagain,adecisionyou havetomakeand

configureforeachofyourapps.

Page 11: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

11 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD0

500

1000

1500

2000

2500

3000

Betterhope thatsustained,normalchangesinresponsetimeareseasonalwhentheyhappen…Ifnot,youmustwriterules!

Andifyouwriterules,youmightaccidentallydeadenthethresholdtoactualproblems.Dang,gum!

Page 12: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

12 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

OurHero:WalterShewhart

§ Inthe1920s,WalterShewhart etalworkedonqualitycontrolforburiedtelephonelines.

§ Shewhart observedthatwhileeverylinedisplaysvariation,somelinesoccasionallydisplayuncontrolledvariation.Likeaseismometer,therearenormalfluctuationsandthenthereareearthquakes.

§ Shewhart inventedcontrolchartsandtheWesternElectricRulestoidentifyuncontrolledvariance,earninghimselfthetitle:“FatherofStatisticalQualityControl.”

Page 13: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

13 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Translationplease!

§ Shewhart taughtustofavorrealtimeobservationovermathematicalmodelsofasignal’sbehavior.

§ Westillbaselinethesignal,buttheWesternElectricRulesdefinethesituationsinwhichthesignalshouldbeconsideredinabadstateandnotasimpledeltafromthebaselinemodel.

§ Shewhart’smethodofcharacterizingthequalityofasignalmirrorsthebehaviorofahumanobserver.

Trustus,youwillunderstand thismath.

Page 14: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

14 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Shewhart’s WesternElectricRulesStraightoffWikipedia…

ThecanonicalWesternElectricRulesuseplain,oldstandarddeviationastheirrealtimemeasure.Eachruleidentifiesapatterninthesignal:

Rule#1– Astatisticallyinterestingoutlier

Rule#2– Twosomewhatinterestingoutliersoutofthreemeasurements.

Rule#3– Foursmalleroutliersoutoffivemeasurements.

Rule#4– Manysmalloutliersovermanymeasurements.

Thismuchweflatoutstolefrommathhistory!

SeeCommentstotheright

Page 15: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

15 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

CATechnologiesInnovation

§ WesternElectricRulesarebrilliantforbothrealtimeanalysisoftelephonesignalsandapplicationsignals.

§ Asinglerulebreach,however,istoodullabladeforslicingthroughthistoughproblem.

§ Byassigningweightstoeachrulebreach,keepingarunningsumandagingoutoldbreaches,wecanproduceasingle,normalizedvalueforvarianceintensity.

CAAPM10hasseveralpatentspending.

Page 16: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

16 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Inabusysystem,therearealwaysvaryinglevelsofstability.

Inthispicture,canyou tellwhichsignalsareleaststable?

Page 17: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

17 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Thissignalexperiencedanoutlier,butitdidn’tturnblue.

Asinglerulebreachisn’tenough for“Petetoputdownhissandwich.”

Page 18: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

18 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Inthiscase,thechangeinstabilitywassustainedoveraboutfortyminutes.

Whathappened? Click tofindout…

Page 19: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

19 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Thisapplicationexperiencedaremarkabledegradationinperformanceoveraforty-minuteperiodoftime.

Botholdandournewapproachwouldalerthere,butCA’salertwouldhappenearlyintheeventandtriggertracecollectionautomatically.

Theoldapproachmightnothaveletanoperatorknowforthirtyminutesormore,basedontherulestheyconfigured.

Page 20: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

20 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Triageisabattlefieldmedicineterm:wherearethewoundedsoldiers?

CA’sapproachmeansidentifyingchronicproblemsaswellasacuteones.Whichoftheselinesaremorestable,but stillhavingchronicstabilityeventsatregularintervals?

Page 21: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

21 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Page 22: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

22 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Page 23: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

23 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

DifferentialAnalysisDefaultConfiguration

Page 24: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

24 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

Page 25: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

25 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

CATECHNOLOGIESTEAMPEGASUSClockwisefromleft:

PrashantPathak,MarkLoSacco,WeiniYu,PrasannaRamVenkatachalam,NareshChippada,CareyFeldstein,

PaulCallahanandSai KrishnaRayanapati.[notpictured:me]

Page 26: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

26 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

RecommendedSessions

SESSION# TITLE DATE/TIME

DO5X189SHowtoAchieveaCustomer-Centric ViewinanOmni-ChannelWorld 11/18/2015 at1:00pm

DO5X194SMonitorMicroservices, Containers, Cloud Foundry andNodewithCAApplication PerformanceManagement 11/18/2015 at4:30pm

DO5X193SCustomizeCAApplicationPerformanceManagementwithTipsforUsingtheCAApplicationPerformanceManagementOpenAPIs

11/19/2015 at4:30pm

Page 27: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

27 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

MustSeeDemos

ApplicationPerformanceManagementandDevOps,featuringAPMuseinpreproduction scenarios

ApplicationPerformanceManagementTheater5

ApplicationPerformanceManagement,ModernMonitoring, featuringthenewAPMTeamCenter

ApplicationPerformanceManagementTheater5

Ensuringa“5star”mobileappexperiencewithCAMobileAppAnalytics

MobileAppAnalyticsTheater5

UnifiedMonitoring:APMIntegrationsincludingUIM

ApplicationPerformanceManagementTheater5

Page 28: Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting

28 ©2015CA.ALLRIGHTSRESERVED.@CAWORLD #CAWORLD

FollowOnConversationsAt…

SmartBarApplicationPerformanceManagementTheater5

TechTalksApplicationPerformanceManagementTheater5