Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty
-
Upload
outlyer -
Category
Engineering
-
view
316 -
download
0
Transcript of Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty
![Page 1: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/1.jpg)
Anatomy of a real-life incident
Alex SolomonCTO & Co-Founder @
![Page 2: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/2.jpg)
THIS IS A TRUE STORY
The events in this presentation took place in San Francisco and Toronto on January 6, 2017
In the interest of brevity, some details have been omitted
![Page 3: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/3.jpg)
The Services
Web2Kafka Service
Incident Log Entries Service
Docker
Mesos / marathon
Linux Kernel
publishes change events from web monolith to Kafka for other services to consume
stores log entries for incidents
![Page 4: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/4.jpg)
The People
Eric Incident
Commander
Peter Scribe
Ken Deputy
Luke Communications
Liaison
Major incident response principal roles
David Core on-call
Cees Core eng
Evan SRE on-call
Renee IM People on-call
Zayna Mobile on-call
JD IM Data on-call
Priyam EM on-call
Subject Matter Experts (SMEs)
![Page 5: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/5.jpg)
The Incident
![Page 6: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/6.jpg)
[3:21 PM] David:SME
!ic page
Officer URL:Chat BOT
🚨Paging Incident Commander(s)✔ Eric has been paged.✔ Ken has been paged.✔ Peter has been paged. Incident triggered in the following service: https://pd.pagerduty.com/services/PERDDFI
David:SME
web2kafka is down, and I'm not sure what's going on
kicked off the major incident process
[3:21 PM] Eric:IC
Taking IC Eric took the IC role (he was IC primary on-call)
![Page 7: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/7.jpg)
The Incident Commander• The Wartime General: decision maker during a major incident
• GOAL: drive the incident to resolution quickly and effectively
• Gather data: ask subject matter experts to diagnose and debug various aspects of the system
• Listen: collect proposed repair actions from SMEs
• Decide: decide on a course of action
• Act (via delegation): once a decision is made, ask the team to act on it. IC should always delegate all diagnosis and repair actions to the rest of the team.
![Page 8: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/8.jpg)
Priyam:SME
I’m here from EM
Evan:SME
lmk if you need SRE sounds like IHM might be down too
Ken:DEPUTY
@renee, please join the call[3:22 PM] Ken took the deputy role
Other SMEs joined
![Page 9: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/9.jpg)
The Deputy (backup IC)
• The Sidekick: right hand person for the IC
• Monitor the status of the incident
• Be prepared to page other people
• Provide regular updates to business and/or exec stakeholders
![Page 10: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/10.jpg)
Peter:SCRIBE
I am now the scribe Eric: Looking to find Mesos experts Evan: Looking for logs & dashboards
Zayna:SME
seeing a steady rise in crashes in Android app around trigger incident log entires
[3:24 PM]
JD:SME
No ILEs will be generated due to LES not being able to query web2kafka
[3:25 PM]
Eric: David, what have you looked at? David: trolling logs, see errors David: tried restarting, doesn’t help
[3:23 PM] Ken:DEPUTY
Notifications are still going out, subject lines are filled in but not email bodies (they use ILEs)
Renee:SME
Peter becomes the scribe
Discussing customer-visible impact of the incident
Ken is both deputy and scribe
![Page 11: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/11.jpg)
The Scribe• The Record-keeper
• Add notes to the chatroom when findings are determined or significant actions are taken
• Add TODOs to the room that indicate follow-ups for later (generally after the incident)
• Monitor tasks assigned by the IC to other team members, remind the IC to follow-up
![Page 12: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/12.jpg)
Renee:SME
Can’t expand incident details
Luke:CUST LIAISON
suggested tweet: `There is currently an issue affecting the incident log entries component of our web application causing the application to display errors. We are actively investigating.`
[3:29 PM]
David: No ILEs can be created Renee: no incident details, error msg in the UI
[3:27 PM] Peter:SCRIBE
Eric: Comms rep on the phone? Luke Eric to Luke: Please compose a tweet
Peter:SCRIBE
Eric: What’s the customer impact?[3:26 PM] Peter:SCRIBE
Luke to tweetPeter:SCRIBE
IC asked the customer liaison to write a msg to customers
Msg was sent out to customers
![Page 13: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/13.jpg)
The Communications Liaison
• The link to the customer
• Monitor customer and business impact
• Provide regular updates to customers (and/or to customer-facing folks in the business)
• (Optional) Provide regular updates to stakeholders
![Page 14: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/14.jpg)
Cees:SME
I’m away from any laptops, just arrived at a pub for dinner.
[3:36 PM]
@cees Would you join us on the bridge? We have a few Mesos questions
Eric:IC
Evan: might need to kick new hardware if system is actually unreachable.Evan: slave01 is reachableDavid: slave02 is not reachable.David: slave03 is not reachable.David: only 3 slaves for mesosEric: We are down to only one hostEvan: Seeing some stuff. Memory exhaustion.
[3:37 PM] Peter:SCRIBE
TODO: Create a runbook for mesos to stop the world and start again
Peter:SCRIBE
David added Cees to the incident Eric: Is there a runbook for mesos? David: Yes, but not for this issue.
[3:34 PM] Peter:SCRIBE
Scribe captured a TODO to record & remember a follow-up that should
happen after the incident is resolved
We paged a Mesos expert who is not on-call
The Mesos expert joined the chat
![Page 15: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/15.jpg)
David: Only 3 slaves in that cluster, we have another cluster in us-west-1 Eric: Two options: kick more slaves or restart marathon
[3:38 PM] Peter:SCRIBE
Evan: OOM killer has kicked in on slave01
[3:39 PM] Peter:SCRIBE
Eric: Stop slaves in west2, startup web2kafka in west1 Evan: slave02 is alive! Eric: Waiting 2 minutes
[3:47 PM] Peter:SCRIBE
David: Consider bringing up another cluster? Cees: Should be trivial
[3:44 PM] Peter:SCRIBE
Eric to evan: please reboot slave02 and slave03
[3:41 PM] Peter:SCRIBE
Restart slaves firstCees:SME
slave01 is now down[3:42 PM] Evan:SME They are considering
bringing up another Mesos cluster in west1
slave02 is back up after reboot, so they hold off
on flipping to west1
Noticed that oom-killer killed the docker
process on slave01
![Page 16: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/16.jpg)
Evan: Slave02 is quiet. Evan: Slave02 is trying to start, exiting with code 137
[3:49 PM] Peter:SCRIBE
Evan: Slave02 is quiet. Evan: 137 means it’s being killed by OOM, OOM is killing docker containers continuously
Peter:SCRIBE
[3:53 PM] Proposed Action: David is going to configure marathon to allow more memory
Peter:SCRIBE
[3:54 PM] Proposed Action: Evan to force reboot slave01
Peter:SCRIBE
[3:56 PM] David: Web2kafka appears to be running Eric: Looks like all things are running Renee: Things are fine with notifications JD: LES is seeing progress
Peter:SCRIBE
[3:55 PM] Customer impact: there are 4 tickets so far and 2 customers chatting with us, which is another 2 tickets
Luke:CUST LIAISON
They realized the problem: oom-killer is
killing the docker containers over and over
The resolution action was to redeploy web2kafka with a higher cgroup/Docker memory limit:
2GB (vs 512 MB before)
The customer liaison provided an update on the customer impact
The system is recovering
![Page 17: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/17.jpg)
The Punchline• Root cause
• Increase in traffic caused web2kafka to increase its memory usage
• This caused the Linux oom-killer to kill the process
• Then, mesos / marathon immediately restarted it, it ramped up memory again, oom-killer killed it, and so on.
• After doing this restart-kill cycle multiple times, we hit a race-condition bug in the Linux kernel causing a kernel panic and killing the host
• Other services running on the host were impacted, notably LES
![Page 18: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/18.jpg)
Summary• Incident Command
• The most important role, crucial to fast decision making and action!
• Takes practice and experience
• Deputy
• The right-hand person for the IC, can step in and take over Incident Command for long-running incidents
• Responsible for business & exec stakeholder communications, allowing the technical team to focus on incident resolution
• Scribe
• Essential for providing context in the chatroom and tracking follow-ups & action items (for example, the IC saying “Evan, do X, report back in 5 min”)
• Produces step-by-step documentation which very helpful for constructing the timeline later (in the post-mortem)
• Communications liaison
• Essential for tracking customer impact and communicating status to customers
![Page 19: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty](https://reader031.fdocuments.us/reader031/viewer/2022030307/58ec1a391a28ab4c508b4751/html5/thumbnails/19.jpg)
The EndAlex Solomon
CTO & Co-Founder @ [email protected]
The PagerDuty Incident Response process and training is open-source: https://response.pagerduty.com