Mean Time to Sleep: Quantifying the On-Call Experience

Post on 23-Aug-2014

3.507 views 4 download

Tags:

description

Starting an on-call rotation can be like opening a door into the unknown. You don’t know if it will be a bad week or if it will be an especially bad week. You don’t know what to expect. Thinking that historical information from past on-call rotations might yield useful insights, Etsy’s Operations team set out to quantify the on-call experience, identify what made it difficult, and use those data to reduce the incidence of pain points in an attempt to make being on call more bearable.

Transcript of Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience

Laurie Denness@lozzd

Ryan Frantz@ryan_frantz

@lozzd • @ryan_frantz

Who is in an on-call rotation?

@lozzd • @ryan_frantz

Who is on call right now?

@lozzd • @ryan_frantz

Who feels like on-call sucks?

Welcome. How is on call?

@lozzd • @ryan_frantz

Let’s help our people sleep

@lozzd • @ryan_frantz

Make on-call more bearable

@lozzd • @ryan_frantz

Incremental Changes

@lozzd • @ryan_frantz

Email toAcknowledge

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

• Do you care if one of your web/hadoop/X boxes dies in the middle of the night?

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

• Do you care if one of your web/hadoop/X boxes dies in the middle of the night?

• Can it wait until the morning?

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

@lozzd • @ryan_frantz

• Previous service state

• Duration in that state

Added Context• Previous service state

@lozzd • @ryan_frantz

• Previous service state

• Duration in that state

Added Context• Previous service state

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to Runbook

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to Runbook

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to runbook

@lozzd • @ryan_frantz

Alert Storms• Reduce noise when 200 things go wrong by aggregating

@lozzd • @ryan_frantz

Alert Storms• Reduce noise when 200 things go wrong by aggregating

• Trigger alert percentage of pool over threshold

@lozzd • @ryan_frantz

Low friction downtime• IRC commands to downtime hosts/sets of hosts

@lozzd • @ryan_frantz

Low friction downtime• IRC commands to downtime hosts/sets of hosts

@lozzd • @ryan_frantz

Downtime Reminders• Help prevent false notifications

@lozzd • @ryan_frantz

Downtime Reminders• Help prevent false notifications

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

• Re-running jobs (transient errors)

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

• Re-running jobs (transient errors)

• Duplicate crons (Chef)

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

• More ideas; hoped they’d stick

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

• More ideas; hoped they’d stick

• We didn’t know because we didn’t measure

@lozzd • @ryan_frantz

Measure Everything• “You can’t manage what you can’t measure.”

- Deming (not really)

@lozzd • @ryan_frantz

Measure Everything• “You can’t manage what you can’t measure.”

- Deming (not really)

• But, we weren’t measuring anything

@lozzd • @ryan_frantz

What should we measure?

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

• Alert times: Off-hours?

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

• Alert times: Off-hours?

• Noisy hosts/services

@lozzd • @ryan_frantz

Opsweekly

@lozzd • @ryan_frantz We have data.

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

3. Aggregate alerts

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

3. Aggregate alerts

4.Profit

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

• Standard Nagios feature

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

• Standard Nagios feature

• Computers can do this for us!

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

• Put switch info into Chef using ohai

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

• Put switch info into Chef using ohai

• Create Nagios host configs based on data

@lozzd • @ryan_frantz

Service Dependencies• Hundreds of Graphite-sourced checks

@lozzd • @ryan_frantz

Service Dependencies• Hundreds of Graphite-sourced checks

• Created new template that sets a servicegroup that depends on the Graphite service.

@lozzd • @ryan_frantz

Keep on analyzing• It’s okay to just identify and delete alerts that don’t

mean anything!

@lozzd • @ryan_frantz

Keep on analyzing• It’s okay to just identify and delete alerts that don’t

mean anything!

• Or move them to email only

@lozzd • @ryan_frantz

More Quantification!

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

• Use search

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

• Use search

• Identify noisiest alerts

@lozzd • @ryan_frantz

Reviewing the YearYEARLY REPORT SCREENSHOTS

@lozzd • @ryan_frantz

• Great time to look at this data and make improvements

Nagios Hack Day/Week

@lozzd • @ryan_frantz

• Great time to look at this data and make improvements

• If Disk Space is the worst. Can we rethink that?

Nagios Hack Day/Week

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

• A whole subset of alerts that don’t go to Ops

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

• A whole subset of alerts that don’t go to Ops

• More teams starting this but Search Team is at 100%

@lozzd • @ryan_frantz

Sleep Tracking

@lozzd • @ryan_frantz

“Track your life!” - @ph

@lozzd • @ryan_frantz

@lozzd • @ryan_frantz

@lozzd • @ryan_frantz

@lozzd • @ryan_frantz

@lozzd • @ryan_frantz

Did it work?

@lozzd • @ryan_frantz

Did it work?

@lozzd • @ryan_frantz

Did it work?• Yes.

@lozzd • @ryan_frantz

Did it work?• Yes.

@lozzd • @ryan_frantz

Did it work?• Yes.

• Signal to noise ratio is much better

@lozzd • @ryan_frantz

Did it work?• Yes.

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

• Adding alerts all the time means new “annoying” things

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

• Adding alerts all the time means new “annoying” things

• Keep monitoring

@lozzd • @ryan_frantz

What’s next?

@lozzd • @ryan_frantz

• We focus on people’s sleep

The Effect of Sleep

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

The Effect of Sleep

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

• How do we measure the impact of sleep loss/deprivation?

The Effect of Sleep

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

• How do we measure the impact of sleep loss/deprivation?

The Effect of Sleep

• Subjective: Pittsburgh Sleepiness Scale

• Objective: Psychomotor vigilance task (PVT) to measure alertness

@lozzd • @ryan_frantz

Beyond Opsweekly• Employee wellness program

@lozzd • @ryan_frantz

Beyond Opsweekly• Employee wellness program

• Security have started using past sleep data to check for weird logins to systems

@lozzd • @ryan_frantz

More context: nagios-herald

@lozzd • @ryan_frantz

More reports• We have a bunch of data, we can build better reports,

drill down to analyze alerting trends

@lozzd • @ryan_frantz

More reports• We have a bunch of data, we can build better reports,

drill down to analyze alerting trends

• Can we attribute particular actions to reduced noise volume?

• Aggregate alerts

• Non-downtimed alerts

@lozzd • @ryan_frantz

Thanks

@lozzd • @ryan_frantz

Etsy Ops Team

@lozzd • @ryan_frantz

SewMona

@lozzd • @ryan_frantz

Open Source/Links• http://ryanfrantz.com/mtts

• https://github.com/etsy/opsweekly

• https://github.com/etsy/nagios-herald

• https://github.com/jonlives/jawboneup_to_graphite

• http://codeascraft.com

@lozzd • @ryan_frantz

Questions?

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience