Jorge Salamero Sanz <[email protected]>
TechSummit Amsterdam 2 June 2016
War Games - Flight training for DevOps
https://joind.in/talk/2e223
Jorge Salamero
@bencerillo
@serverdensity
blog.serverdensity.com
How to Monitor MySQL
● Infrastructure automation
● Configuration automation
● Continuous testing
● Continuous deployment / delivery
● Monitoring
● Logs, error handling
● Feedback
● Human Ops
DevOps lifecycle
● Humans are part of any system
● Initial design, ongoing improvements
● Maintenance
● Upgrades
● Issues, Incident response
Humans in DevOps
● System issues = error rates + SLA + ...
● Human issues = alerts out of hours + interruptions + .
● System issues = Human issues
Human issues = system issues
● Downtime = loss of users, reputation, revenue
● Downtime caused by unreliable systems
● Unhealthy teams reduce reliability
● Unhealthy teams = loss of users, reputation, revenue
Humans impact business
● Slip
● Lapse
● Mistake
● Violation
● (Always, again, again)
Human risk
What can we do?
● Prepare and practice
● Respond
● Postmortem
Expect downtime
Real example
(small war story, won’t be long)
● Power failure to half of our servers● Automated failover unavailable
(known failure condition)● Manual DNS switch required
● Expected impact: 20 min● Actual impact: 43min
Incident example
Lessons learned?
● Unfamiliarity with the process
● Pressure of time sensitive event(panic effect)
● Escalation introduces delays
The Human Factor
Handling the Human factor
● First responder, acknowledge alert
● Load incident response checklist
● Log into #ops-war-room in Slack
● Log incident into JIRA
● Begin investigation
General response process
1. Extended use of checklists
Documented procedures
● The “limits of human memory and attention”○ Complexity○ Stress and fatigue○ Ego
● Pilots, doctors, divers:Bruce Willis Ruins All Films(BCD, weights, releases, air, final)
Pre-flight checklists
1. Extended use of checklists2. Not to follow blindly, use knowledge and experience3. Independent system4. Searchable5. List of known issues and documented workarounds/fixes
Documented procedures
● Realistic replica environment
● or mock command line
● Record actions and timing
● Multiple failures
● Unexpected results
War Games
Results
● Team and individual test of response
● Run real commands
● Training the people
● Training the procedures
● Training the tools
Results
● Increase confidence
● Reduce panic
● Better coordination
● Trust relationships
● Improves time to resolution
Humans results
● Review● Suggestions for improvements● Do it again
● Scenario evolves● People forget
loop(): review and repeat
What else?
● Pressure of just waiting to be paged
● Trouble to sleep:7.8 days year productivity cost
● Prevent burnout
On call
● Half self-interruptions
● Avg 23 minutes to resume task
● Only now actionable alerts
Alerts notifications
www.humanops.com
meetup.com/humanops-london/
meetup.com/humanops-sanfrancisco/
Human Ops Meetup
serverdensity.com/conferences
TECHSUMMIT
Shh! Free monitoring!
www.CloudStatusApp.com
https://joind.in/talk/2e223 @bencerillo
Top Related