It's Not Your Fault - Blameless Post-mortems
-
Upload
jason-hand -
Category
Technology
-
view
519 -
download
4
description
Transcript of It's Not Your Fault - Blameless Post-mortems
(Blameless)post-mortems
@jasonhand
It’s Not Your Fault
A little about me…
Dir. of Platform Support - AppDirect
Dir. of Technical Support - Standing Cloud
Dir. of Operational Systems - American Fasteners, Inc.Hiker, climber, brewer, runner, biker, boarder, surfer, painter, singer, reader, writer, picker, coder, racer, camper, volunteer …. all the usual “Colorado 1-upper” crap.
@jasonhand
Alternative names
Also known as:(Note: Public & Internal)
Project Retrospectives
Post-mortem analysis Post-project review
Project Analysis ReviewQuality Improvement Review
Autopsy ReviewSantayana Review
After Action Review
Touchdown Meeting@jasonhand
Post-mortemDefined
A process intended to inform improvements by determining aspects that were successful or unsuccessful.
What ?
@jasonhand
Post-mortemDefined
As soon as feasible after the Incident is resolved.
When ?
@jasonhand
Post-mortemDefined
Everybody
Who ?
@jasonhand
Post-mortemDefined
To communicate with your team
Why ?
To understand what happened for learning and improving
@jasonhand
Post-mortemDefined
Talk about the incident timeline
Escalation steps
What was done to resolve the problem
Create a remediation plan
Make it available
How ?
@jasonhand
The Three R’s
Regret
Acknowledgement and apology
Reason
Initial incident detection to resolution, including the so-called “root causes.”
Remedy
Actionable remediation itemsDave Zwieback
VP Engineering - Next Big Sound
@jasonhand
( simple format )
(Remedy)
Specific
Measurable
Agreed Upon/Agreeable
Realistic
Timebound
Use SMART recommendations
Moving from Reaction to Action
@jasonhand
Blameless
image from “Across the Universe” @jasonhand
2011 - Hired to Standing Cloud
Cool story, bro
Cloud marketplace & automated deployment of apps
Build Support team
Provide Managed services
@jasonhand
Cool story, bro
@jasonhand
– Sydney Dekker
“Reprimanding bad apples may seem like a quick and rewarding
fix, but it’s like peeing in your pants.
!
You feel relieved and perhaps even nice and warm for a little while,
but then it gets cold and uncomfortable.
!
And you look like a fool”Quote first seen in J. Paul Reed’s “A Look at Looking in the Mirror"
@jasonhand
What is a blameless post-mortem?
Team members are accountable but not responsible
Complete Transparency
Deeper look at circumstances
What happened and how to improve it (specific details)
Real conditions of failure in complex systems
@jasonhand
– Dave Zwieback
“Your organization must continually affirm that
individuals are NEVER the “root cause” of outages.”
@jasonhand
Paraphrased from “Fallible Humans” by Ian Malpass - DevOpsDays - Minneapolis
source: http://www.indecorous.com/fallible_humans/@jasonhand
(Efficiency Thoroughness Trade Off)The trade off between:
!
being efficient vs
being thorough
ETTO
Efficient
Thorough
@jasonhand
- Ian Malpass
“We can be thorough and really dig into the task at hand and
understand it well but this takes time:
it is inefficient.”
@jasonhand
Cause & Effect
There are many factors that played a part in the problem
source: http://xkcd.com
“may be”
@jasonhand
Stress & Cognitive
Bias
@jasonhand
Yerkes-Dodson Model
source: The Human Side of Postmortems@jasonhand
@jasonhand
Reduce Stress?
… build muscle memory
Simulate many types of problems and outages as “practice” …
@jasonhand
Evaluative Threat
Being negatively judged plays a big role in stress
@jasonhand
What is stress surface?
Variables of a situation
Novel or unusual
Unpredictable
Controllable situation
Negative judgement
Lack of sleep
Problems at home
Health
Relationships
@jasonhand
Evaluative threatsALSO
Etc…
Capturing the Human-side
Ask questions
@jasonhand
Stress Questionnaire
The situation was novel or unusual?
The situation was unpredictable?
You were unable to control the situation?
Others could judge your actions negatively?
0 = Never 1 = Almost Never 2 = Sometimes 3 = Fairly Often 4 = Very Often
During the outage, how often have you felt or thought that:
@jasonhand
Why we don’t punish
De-incentivized to give the details
Practically guarantees a repeat of the problem
Understand why actions made sense (at the time)
Create safety AND accountability
Move away from idea of “individuals are problems”
Create new “experts”
@jasonhand
@jasonhand
Promoting from withinWhere do we start?
• Document your timeline or log data • Document conversations • Leave room for notes • Mean time to resolution / Time calculations • Level of severity • Archive it for historical retrieval • Remediation. Make it actionable
@jasonhand
The basics:
ToolsEtsy’s MorgueVictorOps
Post-mortem Report
@jasonhand
Internal Wiki
@jasonhand
Seek the truth
Don’t blame others … !
Don’t blame yourself
Thank You
Questions ?
@jasonhand
Resources
“The Human Side of Postmortems” - Dave Zwieback
“The Field Guide to Understanding Human Error” - Sydney Dekker
“A Look at Looking in the Mirror” - J. Paul Reed
“Fallible Humans” - Ian Malpass (http://www.indecorous.com/fallible_humans/)
“4 Questions to ask for an effective Technical Post Mortem” - Jeffrey O’Brien (http://www.maintenanceassistant.com/blog/4-questions-effective-technical-post-mortem/)
“Nine steps to IT post-mortem excellence” - Michael Krigsman (http://www.zdnet.com/blog/projectfailures/nine-steps-to-it-post-mortem-excellence/1069)
“Postmortem reviews: purpose and approaches in software engineering” - Torgeir Dingsøyr (http://www.uio.no/studier/emner/matnat/ifi/INF5180/v10/undervisningsmateriale/reading-materials/p08/post-mortems.pdf)
“Blameless PostMortems and a Just Culture” - John Allspaw (http://codeascraft.com/2012/05/22/blameless-postmortems/)
“What blameless really means” - Jessica Harllee (http://www.jessicaharllee.com/notes/what-blameless-really-means/)
“Each necessary, but only jointly sufficient” - John Allspaw (http://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-sufficient/)
@jasonhand