Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy...
Transcript of Pitfalls in Measuring SLOs€¦ · Number of bad events allowed. @fisherdanyel @lizthegrey Deploy...
Pitfalls in Measuring SLOsDanyel Fisher@fisherdanyel
Liz Fong-Jones@lizthegrey
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
What do you do when things break?
How bad was this break?
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
Build new features! We need to improve quality!
@fisherdanyel @lizthegrey
Management
Engineering
Clients and Users
How broken is “too broken”?
What does “good enough” mean?
Combatting alert fatigue
@fisherdanyel @lizthegrey
A telemetry system produces events that correspond to real world use
We can describe some of these events as eligible
We can describe some of them as good
@fisherdanyel @lizthegrey
Given an event, is it eligible? Is it good?
Eligible: “Had an http status code”
Good: “... that was a 200, and was served under 500 ms”
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
Minimum Quality ratio over a period of time
Number of bad events allowed.
@fisherdanyel @lizthegrey
Deploy faster
Room for experimentation
Opportunity to tighten SLO
@fisherdanyel @lizthegrey
We always store incoming user data
99.99%
@fisherdanyel @lizthegrey
We always store incoming user data
99.99%
Default dashboards usually load in < 1s
99.9%
@fisherdanyel @lizthegrey
We always store incoming user data
99.99%
Queries often return in < 10 s
Default dashboards usually load in < 1s
99.9%
99%
@fisherdanyel @lizthegrey
We always store incoming user data
99.99%
Queries often return in < 10 s
Default dashboards usually load in < 1s
99.9%
99% 7.3 hours
45 minutes
~4.3 minutes
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
User Data Throughput
@fisherdanyel @lizthegrey
User Data Throughput
We blew through three months’ budget in those 12
minutes.
@fisherdanyel @lizthegrey
We dropped customer data
@fisherdanyel @lizthegrey
We dropped customer data
We rolled it back (manually)
We communicated to customers
We halted deploys
@fisherdanyel @lizthegrey
We checked in code that didn’t build.
@fisherdanyel @lizthegrey
We checked in code that didn’t build.
We had experimental CI build wiring.
@fisherdanyel @lizthegrey
We checked in code that didn’t build.
We had experimental CI build wiring.
Our scripts deployed empty binaries.
@fisherdanyel @lizthegrey
We checked in code that didn’t build.
We had experimental CI build wiring.
Our scripts deployed empty binaries.
There was no health check and rollback.
@fisherdanyel @lizthegrey
We stopped writing new features
We prioritized stability
We mitigated risks
@fisherdanyel @lizthegrey
SLOs allowed us to characterize
what went wrong, how badly it went wrong,
andhow to prioritize repair
@fisherdanyel @lizthegrey
Learning from SLOs
@fisherdanyel @lizthegrey
Final pointA one-line description of it
@fisherdanyel @lizthegrey
SLOs, as espoused at Google vs in practice
We told you "SLOs are magical".
But not everyone has Google's tooling.
And SLO practice at Google is uneven.
Debugging SLOs often is divorced from reporting SLOs.
There had to be a better way, leveraging Honeycomb...
@fisherdanyel @lizthegrey
Using a Design Thinking perspective:
● Expressing and Viewing SLOs● Burndown Alerts and Responding● Learning from our Experiences
@fisherdanyel @lizthegrey
Displays and Views
@fisherdanyel @lizthegrey
See where the burndown was happening, explain why, and remediate
@fisherdanyel @lizthegrey
Expressing SLOs
Time based: “How many 15 minute periods, had a P99(duration) < 50 ms”
Event based: “How many events had a duration < 500 ms”
@fisherdanyel @lizthegrey
Status of an SLO
@fisherdanyel @lizthegrey
How have we done?
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
Where did it go?
@fisherdanyel @lizthegrey
When did the errors happen?
@fisherdanyel @lizthegrey
When did the errors happen?
@fisherdanyel @lizthegrey
What went wrong?
High dimensional data
High cardinality data
@fisherdanyel @lizthegrey
Why did it happen?
@fisherdanyel @lizthegrey
Why did it happen?
@fisherdanyel @lizthegrey
Why did it happen?
@fisherdanyel @lizthegrey
See where the burndown was happening, explain why, and remediate
@fisherdanyel @lizthegrey
User Feedback
“I’d love to drive alerts off our SLOs. Right now we don’t have anything to draw us in and have some alerts on the average error rate but they’re a little spiky to be useful.”
@fisherdanyel @lizthegrey
Burndown Alerts
@fisherdanyel @lizthegrey
How is my system doing?Am I over budget?
When will my alarm fail?
@fisherdanyel @lizthegrey
When will I fail?
User goal: get alerts to exhaustion time
Human-digestible units
24 hours: “I’ll take a look in the morning”
4 hours: “All hands on deck!”
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
How is my system doing?Am I over budget?
When will my alarm fail?
@fisherdanyel @lizthegrey
Implementing Burn Alerts
Run a 30 day query
@fisherdanyel @lizthegrey
Implementing Burn Alerts
Run a 30 day query
at a 5 minute resolution
@fisherdanyel @lizthegrey
Implementing Burn Alerts
Run a 30 day query
at a 5 minute resolution
every minute
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
Learning from Experience
@fisherdanyel @lizthegrey
Volume is importantTolerate at least dozens of bad events per day
Faults
@fisherdanyel @lizthegrey
SLOs for Customer ServiceRemember that user having a bad day?
ADD IMAGE
@fisherdanyel @lizthegrey
Blackouts are easy… but brownouts are much more interesting
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”
1.5% brownout for 20 minutes
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”
1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”
1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”9:55 am “Why is our system uptime dropping to zero?”
It’s out of memoryWe aren’t alerting on that crash
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”
1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”9:55 am “Why is our system uptime dropping to zero?”
It’s out of memoryWe aren’t alerting on that crash
@fisherdanyel @lizthegrey
Timeline1:29 am SLO alerts. “Maybe it’s just a blip”
1.5% brownout for 20 minutes4:21 am Minor incident. “It might be an AWS problem”6:25 am SLO alerts again. “Could it be ALB compat?”9:55 am “Why is our system uptime dropping to zero?”
It’s out of memoryWe aren’t alerting on that crash
10:32 am Fixed
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
@fisherdanyel @lizthegrey
Cultural ChangeIt’s hard to replace alerts with SLOs
But a clear incident can help
@fisherdanyel @lizthegrey
Reduce Alarm FatigueFocus on user-affecting SLOs
Focus on actionable alarms
@fisherdanyel @lizthegrey
Conclusion
@fisherdanyel @lizthegrey
SLOs allowed us to characterize
what went wrong,how badly it went wrong,
andhow to prioritize repair
@fisherdanyel @lizthegrey
You can do it tooAnd maybe avoid our mistakes
Pitfalls in Measuring SLOs
Email: [email protected] / [email protected]
Twitter: @fisherdanyel & @lizthegrey
Come talk to us in Slack!
hny.co/danyel