War Games - Flight training for DevOps @ TechSummit Amsterdam

Post on 26-Jan-2017

235 views 0 download

Transcript of War Games - Flight training for DevOps @ TechSummit Amsterdam

Jorge Salamero Sanz <jsalamero@serverdensity.com>

TechSummit Amsterdam 2 June 2016

War Games - Flight training for DevOps

https://joind.in/talk/2e223

Jorge Salamero

@bencerillo

@serverdensity

blog.serverdensity.com

How to Monitor MySQL

● Infrastructure automation

● Configuration automation

● Continuous testing

● Continuous deployment / delivery

● Monitoring

● Logs, error handling

● Feedback

● Human Ops

DevOps lifecycle

● Humans are part of any system

● Initial design, ongoing improvements

● Maintenance

● Upgrades

● Issues, Incident response

Humans in DevOps

● System issues = error rates + SLA + ...

● Human issues = alerts out of hours + interruptions + .

● System issues = Human issues

Human issues = system issues

● Downtime = loss of users, reputation, revenue

● Downtime caused by unreliable systems

● Unhealthy teams reduce reliability

● Unhealthy teams = loss of users, reputation, revenue

Humans impact business

● Slip

● Lapse

● Mistake

● Violation

● (Always, again, again)

Human risk

What can we do?

● Prepare and practice

● Respond

● Postmortem

Expect downtime

Real example

(small war story, won’t be long)

● Power failure to half of our servers● Automated failover unavailable

(known failure condition)● Manual DNS switch required

● Expected impact: 20 min● Actual impact: 43min

Incident example

Lessons learned?

● Unfamiliarity with the process

● Pressure of time sensitive event(panic effect)

● Escalation introduces delays

The Human Factor

Handling the Human factor

● First responder, acknowledge alert

● Load incident response checklist

● Log into #ops-war-room in Slack

● Log incident into JIRA

● Begin investigation

General response process

1. Extended use of checklists

Documented procedures

● The “limits of human memory and attention”○ Complexity○ Stress and fatigue○ Ego

● Pilots, doctors, divers:Bruce Willis Ruins All Films(BCD, weights, releases, air, final)

Pre-flight checklists

1. Extended use of checklists2. Not to follow blindly, use knowledge and experience3. Independent system4. Searchable5. List of known issues and documented workarounds/fixes

Documented procedures

● Realistic replica environment

● or mock command line

● Record actions and timing

● Multiple failures

● Unexpected results

War Games

Results

● Team and individual test of response

● Run real commands

● Training the people

● Training the procedures

● Training the tools

Results

● Increase confidence

● Reduce panic

● Better coordination

● Trust relationships

● Improves time to resolution

Humans results

● Review● Suggestions for improvements● Do it again

● Scenario evolves● People forget

loop(): review and repeat

What else?

● Pressure of just waiting to be paged

● Trouble to sleep:7.8 days year productivity cost

● Prevent burnout

On call

● Half self-interruptions

● Avg 23 minutes to resume task

● Only now actionable alerts

Alerts notifications

www.humanops.com

meetup.com/humanops-london/

meetup.com/humanops-sanfrancisco/

Human Ops Meetup

serverdensity.com/conferences

TECHSUMMIT

Shh! Free monitoring!

www.CloudStatusApp.com

https://joind.in/talk/2e223 @bencerillo