Production testing through monitoring

58
@papa_fire Troubleshooting with monitoring Testing in production DevOps monitoring [something] testing [something] monitoring [something] in production Leon Fayer

Transcript of Production testing through monitoring

Page 1: Production testing through monitoring

@papa_fire

Troubleshooting with monitoringTesting in production

DevOps monitoring[something] testing [something]

monitoring [something] in production

Leon Fayer

Page 2: Production testing through monitoring

❖ @papa_fire ❖ [email protected] ❖ fayerplay.com ❖ slideshare.net/LeonFayer1

THAT’S ME

WHO AM I?๏ engineer for 20+ years

๏ professional cynic

๏ @ OmniTI

๏ build and operate big systems

๏ we are hiring! ๏ omniti.com/is/hiring

Page 3: Production testing through monitoring

@papa_fire

I HATE TESTING

Page 4: Production testing through monitoring

@papa_fire

testing is required

Page 5: Production testing through monitoring

@papa_fire

testing is not enough

Page 6: Production testing through monitoring

@papa_fire

> unit testing > functional testing > resilience testing > performance testing > …

Page 7: Production testing through monitoring

@papa_fire

testing can give a false sense of security

Page 8: Production testing through monitoring

@papa_fire

testing is deterministic

Page 9: Production testing through monitoring

@papa_fire

data problem

Page 10: Production testing through monitoring

@papa_fire

> quantity of data > frequency of data > quality of data

Page 11: Production testing through monitoring

@papa_fire

example

Wolfe+585

Page 12: Production testing through monitoring

@papa_fire

example

Hubert Blaine Wolfeschlegelsteinhausenbergerdorffwelchevoralternwaren-gewissenhaftschaferswessenschafewarenwohlgepflegeundsorgfaltigkeitbe

schutzenvorangreifendurchihrraubgierigfeindewelchevoralternzwolfhunderttausendjahresvorandieerscheinenvonderersteerdemenschderraumschiff

genachtmittungsteinundsiebeniridiumelektrischmotorsgebrauchlichtalsseinursprungvonkraftgestartseinlangefahrthinzwischensternartigraumaufdersuchennachbarschaftdersternwelchegehabtbewohnbarplanetenkreisedrehensichundwo

hinderneuerassevonverstandigmenschlichkeitkonntefortpflanzenundsicherfreuenanlebenslanglichfreudeundruhemitnichteinfurchtvorangreifenvor

andererintelligentgeschopfsvonhinzwischensternartigraum, Sr.

Page 13: Production testing through monitoring

@papa_fire

user problem

Page 14: Production testing through monitoring

@papa_fire

“Users (n) - distributed fault injection test suite for production

Page 15: Production testing through monitoring

@papa_fire

example

Corrupted Blood bug

Page 16: Production testing through monitoring

@papa_fire

example

Page 17: Production testing through monitoring

@papa_fire

other factors

Page 18: Production testing through monitoring

@papa_fire

> lack of foresight (Y2K bug) > too many use-cases (female Tauren bug) > change to assumptions

Page 19: Production testing through monitoring

@papa_fire

testing is great for “known knowns”

Page 20: Production testing through monitoring

@papa_fire

testing is ok for “known unknowns”

Page 21: Production testing through monitoring

@papa_fire

testing is bad for “unknown unknowns”

Page 22: Production testing through monitoring

@papa_fire

enter monitoring

Page 23: Production testing through monitoring

@papa_fire

why monitor?

Page 24: Production testing through monitoring

@papa_fire

because testing isn’t enough

Page 25: Production testing through monitoring

@papa_fire

> software is never perfect > systems are complex > external dependency worry > proactive is better than reactive > …

Page 26: Production testing through monitoring

@papa_fire

because things change

Page 27: Production testing through monitoring

@papa_fire

because things changein production

Page 28: Production testing through monitoring

@papa_fire

what to monitor?

Page 29: Production testing through monitoring

@papa_fire

in God we trust all others we monitor“

Page 30: Production testing through monitoring

@papa_fire

> systems > databases > applications > integration points > performance > user behavior > …

Page 31: Production testing through monitoring

@papa_fire

is it enough?

Page 32: Production testing through monitoring

@papa_fire

is it too much?

Page 33: Production testing through monitoring

@papa_fire

what is important?

Page 34: Production testing through monitoring

@papa_fire

what is important?(i.e. what to alert on)

Page 35: Production testing through monitoring

@papa_fire

example

> servers up and running > HTTP checks return 200 > tweets are lost

Page 36: Production testing through monitoring

@papa_fire

s/system checks/unit tests/

Page 37: Production testing through monitoring

@papa_fire

I don’t give a **** if the datacenter is on fire as

long as I am still making money“

— CEO

Page 38: Production testing through monitoring

@papa_fire

we monitor because things change

Page 39: Production testing through monitoring

@papa_fire

changes effect business

Page 40: Production testing through monitoring

@papa_fire

top-down approach> understand business > define baseline > correlate data

Page 41: Production testing through monitoring

@papa_fire

example๏ online marketing company ๏ major e-commerce component ๏ ~100 million users ๏ 1 billion emails/month ๏ 300,000 lines of code ๏5600+ metrics collected

Page 42: Production testing through monitoring

@papa_fire

it all starts with a call …

Page 43: Production testing through monitoring

@papa_fire

revenue

Page 44: Production testing through monitoring

@papa_fire

revenue + traffic

Page 45: Production testing through monitoring

@papa_fire

revenue + traffic + load time

Page 46: Production testing through monitoring

@papa_fire

revenue + traffic + load time + db

Page 47: Production testing through monitoring

@papa_fire

revenue + traffic + load time + db + email

Page 48: Production testing through monitoring

@papa_fire

… email wasn’t monitored?what if …

Page 49: Production testing through monitoring

@papa_fire

… email wasn’t monitored?(it would be after this)

what if …

Page 50: Production testing through monitoring

@papa_fire

instrumentation is never done

Page 51: Production testing through monitoring

@papa_fire

example

> same symptoms > higher decline rates > all metrics are within norm

Page 52: Production testing through monitoring

@papa_fire

example

> same symptoms > higher decline rates > all metrics are within norm

AmEx blocked

Page 53: Production testing through monitoring

@papa_fire

tl;dr

Page 54: Production testing through monitoring

@papa_fire

testing and monitoring not

testing or monitoring

Page 55: Production testing through monitoring

@papa_fire

understand the business

Page 56: Production testing through monitoring

@papa_fire

continuous improvement

Page 57: Production testing through monitoring

@papa_fire

{also bad at conclusions}

Page 58: Production testing through monitoring

@papa_fire

THANK YOUquestions?