Life On-Call, Availa-liberty, and the Pursuit of Happiness

Post on 14-Feb-2017

20 views 0 download

Transcript of Life On-Call, Availa-liberty, and the Pursuit of Happiness

Life On-Call, Availa-

liberty, & the Pursuit

of HappinessRunbooksDave Cliffe

@CliffeHangers

Incident #1:Oct 27, 2011

Incident #2:May 1, 2013

Incident #3:Nov 2, 2015

Collaboration/Resolution

MICROSERVICES

APPS & SERVICES

CONTAINERS

CLOUD

NETWORK

DATABASE

SERVERS

Developer

NOC

Helpdesk

IT OpsSystem and User

Efficiency

ALERT 1 ALERT 2 ALERT 3

Correlate, Cluster and Manage

EVENTS

People Data Process

Deployment Tools

Monitoring Tools

Ticketing Tools

APP

SYSTEM

LOG

WEB

MOBILE APP

Automatic Escalations

On-CallScheduling

Your Fastest Path to Incident Resolution

Availability

Every software powered company experiences downtime

http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html

Cost of outages:

$7,400,000 annual cost @175 hours downtime Gartner

“The most important ability is availability.”

All CEOs everywhere

Why is Availability a terrible metric?

The Tyranny of the SLA

credit: J. Paul Reed (@jpaulreed)

“System Availability” means the percentage of total time during which the Hosted Service network is available to Client and Client is able to access the Hosted Service system interface.

______ warrants the following minimum levels of Hosted Service System Availability during each calendar month: 99.95%

The following definitions will apply to the calculation of “availability”:“Hosted Service System Availability” means the percentage of total time during each calendar month during whichthe Hosted Service is available to Client, excluding Scheduled Downtime and Emergency Maintenance

An actual SaaS SLA

Are you Available?

Happiness

Measuring (Un)Happiness

Responsiveness

Pain

Health Checks

https://labs.spotify.com/2014/09/16/squad-health-check-model/

Happiness++

http://www.activestate.com/blog/2014/01/devops-hero-culture

Beware the ‘Hero Culture’

Eliminate Single Points of

Dependence

Reduce Alert

Fatigue

https://www.pinterest.com/pin/497929302524908289

On a regular basis, For every alert, Ask …

1) Is it actionable?2) Is it urgent?3) Could we consolidate?4) Did the right person get it?

“The most important on-call responsibility is to understand customer impact.” Anonymous Customer (who I didn’t verify I could quote)

Sharing Operational

Responsibility

“Giving developers operational responsibilities has greatly enhanced the QUALITY of the services, both from a customer and

a technology point of view.

The TRADITIONAL model is that you take your software to the wall that separates development and operations, and throw it

over and then forget about it.

-Werner Vogels, CTO Amazon

SHARED OPERATIONAL RESPONSIBILITY

… You build it, you run it.”

“For developers to take responsibility for the systems they create, they need support from

operations to understand how to build ’reliable software that can be continuous deployed to an

unreliable platform that scales horizontally’.”

-Jez Humble, quoting Jesse Robbins (Chef)

SHARED OPERATIONAL RESPONSIBILITY

Thanks!

RunbooksDave Cliffedcliffe@pagerduty.co

m @CliffeHangers