3days september
-
Upload
steve-feldman -
Category
Technology
-
view
175 -
download
3
description
Transcript of 3days september
THREE DAYS IN SEPTEMBER “Houston, We Have a Problem.”
by Steve Feldman, @PerfForensics
The Agenda I. This is a True Story...It Really Did Happen II. Houston, We Have a Problem. III. The Really Good Vendors Care IV. Getting to Zero V. The Damage Was Already Done VI. Where We Are Today
Houston, We Have a Problem...
Our Outage Affected our most Important Asset
Our Outage Was Caused By Human Error
NEVER REBOOT A UNIX MACHINE!
The Monitoring “Cameras” Should
Always Be On
24
25
26
Keep Everyone Informed
Who Wants their Users to Report the Problem first?
Not All of the Data is Believable
Crisis are the Best Time to Determine the Strength of
the Team
Keep Your Boss Informed
Keep Your Users Informed
Keep Your Users Updated
Continue to Keep Your Users Updated
Getting to Zero
Log Consolida0on
Continue to Keep Your Users Updated
It is Not Just About Restoring Service
It is OK to Admit Mistakes
Let Your Boss Take Credit
Your Boss Did Not Build a Fragile System
Do a Post-Mortem
The Problem Started Long Before
Where We are Today
Practice Really Matters
Practice Failure
Look at Your Manuals
Practice Routines and Roles
Practice Everyday
NEVER REBOOT A UNIX MACHINE!
Thanks for Listening
@PerfForensics