War Games - Flight training for DevOps @ TechSummit Amsterdam

32
Jorge Salamero Sanz <[email protected]> TechSummit Amsterdam 2 June 2016 War Games - Flight training for DevOps https://joind.in/talk/2e223

Transcript of War Games - Flight training for DevOps @ TechSummit Amsterdam

Page 1: War Games - Flight training for DevOps @ TechSummit Amsterdam

Jorge Salamero Sanz <[email protected]>

TechSummit Amsterdam 2 June 2016

War Games - Flight training for DevOps

https://joind.in/talk/2e223

Page 2: War Games - Flight training for DevOps @ TechSummit Amsterdam

Jorge Salamero

@bencerillo

@serverdensity

blog.serverdensity.com

Page 3: War Games - Flight training for DevOps @ TechSummit Amsterdam

How to Monitor MySQL

Page 4: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Infrastructure automation

● Configuration automation

● Continuous testing

● Continuous deployment / delivery

● Monitoring

● Logs, error handling

● Feedback

● Human Ops

DevOps lifecycle

Page 5: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Humans are part of any system

● Initial design, ongoing improvements

● Maintenance

● Upgrades

● Issues, Incident response

Humans in DevOps

Page 6: War Games - Flight training for DevOps @ TechSummit Amsterdam

● System issues = error rates + SLA + ...

● Human issues = alerts out of hours + interruptions + .

● System issues = Human issues

Human issues = system issues

Page 7: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Downtime = loss of users, reputation, revenue

● Downtime caused by unreliable systems

● Unhealthy teams reduce reliability

● Unhealthy teams = loss of users, reputation, revenue

Humans impact business

Page 8: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Slip

● Lapse

● Mistake

● Violation

● (Always, again, again)

Human risk

Page 9: War Games - Flight training for DevOps @ TechSummit Amsterdam

What can we do?

Page 10: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Prepare and practice

● Respond

● Postmortem

Expect downtime

Page 11: War Games - Flight training for DevOps @ TechSummit Amsterdam

Real example

(small war story, won’t be long)

Page 12: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Power failure to half of our servers● Automated failover unavailable

(known failure condition)● Manual DNS switch required

● Expected impact: 20 min● Actual impact: 43min

Incident example

Page 13: War Games - Flight training for DevOps @ TechSummit Amsterdam
Page 14: War Games - Flight training for DevOps @ TechSummit Amsterdam

Lessons learned?

Page 15: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Unfamiliarity with the process

● Pressure of time sensitive event(panic effect)

● Escalation introduces delays

The Human Factor

Page 16: War Games - Flight training for DevOps @ TechSummit Amsterdam

Handling the Human factor

Page 17: War Games - Flight training for DevOps @ TechSummit Amsterdam

● First responder, acknowledge alert

● Load incident response checklist

● Log into #ops-war-room in Slack

● Log incident into JIRA

● Begin investigation

General response process

Page 18: War Games - Flight training for DevOps @ TechSummit Amsterdam

1. Extended use of checklists

Documented procedures

Page 19: War Games - Flight training for DevOps @ TechSummit Amsterdam

● The “limits of human memory and attention”○ Complexity○ Stress and fatigue○ Ego

● Pilots, doctors, divers:Bruce Willis Ruins All Films(BCD, weights, releases, air, final)

Pre-flight checklists

Page 20: War Games - Flight training for DevOps @ TechSummit Amsterdam

1. Extended use of checklists2. Not to follow blindly, use knowledge and experience3. Independent system4. Searchable5. List of known issues and documented workarounds/fixes

Documented procedures

Page 21: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Realistic replica environment

● or mock command line

● Record actions and timing

● Multiple failures

● Unexpected results

War Games

Page 22: War Games - Flight training for DevOps @ TechSummit Amsterdam

Results

Page 23: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Team and individual test of response

● Run real commands

● Training the people

● Training the procedures

● Training the tools

Results

Page 24: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Increase confidence

● Reduce panic

● Better coordination

● Trust relationships

● Improves time to resolution

Humans results

Page 25: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Review● Suggestions for improvements● Do it again

● Scenario evolves● People forget

loop(): review and repeat

Page 26: War Games - Flight training for DevOps @ TechSummit Amsterdam

What else?

Page 27: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Pressure of just waiting to be paged

● Trouble to sleep:7.8 days year productivity cost

● Prevent burnout

On call

Page 28: War Games - Flight training for DevOps @ TechSummit Amsterdam

● Half self-interruptions

● Avg 23 minutes to resume task

● Only now actionable alerts

Alerts notifications

Page 29: War Games - Flight training for DevOps @ TechSummit Amsterdam
Page 30: War Games - Flight training for DevOps @ TechSummit Amsterdam

www.humanops.com

meetup.com/humanops-london/

meetup.com/humanops-sanfrancisco/

Human Ops Meetup

Page 31: War Games - Flight training for DevOps @ TechSummit Amsterdam

serverdensity.com/conferences

TECHSUMMIT

Shh! Free monitoring!

Page 32: War Games - Flight training for DevOps @ TechSummit Amsterdam

www.CloudStatusApp.com

https://joind.in/talk/2e223 @bencerillo