post-mortem best practices - yow! conferences€¦ · the post-mortem process 1. pick an owner for...

Post-Mortem Best Practices

Alex SolomonCTO & Co-Founder @

THIS IS A TRUE STORY

The following events took place in San Francisco and Toronto in January, 2017

In the interest of brevity, some details have been omitted

The Services

Incident Log Entries Service

stores log entries for incidents

Kafka message bus

other servicesWeb2Kafka

Service

Docker

Mesos

Linux

slave01 slave02

slave03marathon

Web (monolith)

The Incident

[2:57 PM] Infra team gets paged: web2kafka service is down

Incident Log Entries service also impacted (cannot generate log entries without web2kafka)

Notifications are still going out, not impacted

[3:20 PM] Infra on-call escalates the issue to a major incident

This pages the Incident Commander primary, Incident Commander backup, plus the on-calls for several other teams

[3:22 PM] Discussion: team discusses impact of incident and affected services

[3:36 PM] Team loops in a mesos subject-matter expert for additional help (this person is not actually on-call)

[3:37 PM] Assessment: 2 out of 3 hosts in the mesos cluster are down (slave02 and slave03 are down, slave01 is up)

[3:41 PM] Action: Reboot the 2 slaves that are down

[3:42 PM] Assessment: Now slave01 is down

[3:44 PM] Discussion of potential remediation:

Proposal is to bring up a new mesos cluster in US-west1, then, flip to this new cluster

Mesos subject-matter expert is optimistic this will work [even though we’ve never practiced it before]

[3:47 PM] Slave02 is back up

Decision: wait before attempting to flip to secondary cluster

[3:49 PM] SRE sees on slave02 that web2kafka was trying to start, but existed with error code 137 (code 137 means it’s been killed by OOM-killer, Linux memory management process)

[3:53 PM] Action: configure marathon to allow more memory for docker containers - limit is increased from 512MB to 2GB

[3:56 PM] Web2kafka is now running and processing through its backlog

[4:05 PM] Web2kafka has cleared the backlog Log Entry Service is caught up as well

[4:11 PM] The incident is resolved

The Punchline• Root cause

• Increase in traffic caused web2kafka to increase its memory usage

• This caused the Linux oom-killer to kill the process

• Then, mesos / marathon immediately restarted it, it ramped up memory again, oom-killer killed it, and so on.

• After doing this restart-kill cycle multiple times, we hit a race-condition bug in the Linux kernel causing a kernel panic and killing the host

• Other services running on the host were impacted, notably the Log Entries Service

The Post-Mortem

The Post-Mortem Process1. Pick an owner for the post-mortem

The on-call responder on the Infra team, became the owner

2. Schedule the post-mortem meetingscheduled for Jan 11 (incident happened on Jan 6)

3. Conduct a thorough analysis of the incident

4. Create a draft of the post-mortem report

5. Conduct the post-mortem meeting

6. Finalize the post-mortem report based on the discussion in the meeting

Recommend: have a fast turnaround for the post-mortem, SLA of 3-5 days

One owner only, owner can pull in additional resources

as needed

What’s Included in a Post-Mortem Report

• Description of what happened + timeline

• Customer and business impact

• Root cause / contributing factors

• How we resolved the incident

• What went well, what didn’t go so well

• Action items

} These components are written by the owner

ahead of the PM meeting

} These components are discussed and created in

the PM meeting

The TimelineTime (UTC) Event

22:50 Spike in ile_groups_requests traffic22:50:26 First instance of OOM-killer killing web2kafka

22:57 First page to Core on-call (David)22:57 David acknowledges the page23:09 slave03 freezes23:12 Incident manually escalated to Core Secondary (Joseph)23:18 Incident manually assigned to Cees23:20 SEV-2 incident manually triggered23:25 Ken starts trying to reach Cees (not on-call)23:26 Discussion about customer impact23:29 Investigating issue' tweet from support account23:29 Ken Rose starts trying to reach Div (not on-call)

23:30:39 slave02 freezes23:34 Evan says that slave03 is unreachable23:35 Cees joins call

23:37:00 Evan says slave01 is reachable, 02 and 03 are not. David confirms.23:39 Evan says OOM killer has kicked in on slave01

23:39 Rich says slave02 is failing status check in AWS

Time-To-Detect of 7min

Time-To-Acknowledge <1min

Incident escalated to SEV-2 in 23min

David tried to get help from secondary and from Cees

Time (UTC) Event23:41:00 Eric requests Evan to reboot slave02 and slave03

23:41 slave01 is frozen. All slaves now frozen.23:42 Evan says slave01 is now unreachable as well23:43 Discussion about Chef changes affecting Mesos, and

discussion about bringing web2kafka up in another Mesos 23:46:28 slave02 is back up23:46:37 slave03 is back up

23:49 Evan reports slave02 failing to start containers, they are exiting with code 13723:51 Evan reports code 137 is "killed by OOM"

23:53 start web2kafka with 2 GB allowed memory, David takes action23:53 force reboot slave01, Evan takes action

23:53:43 Last insance of OOM-killer killing web2kafka23:56 David confirms web2kafka running. Discussion about

backlog, other services catching up.23:59:08 slave01 is back up0:05 David reports web2kafka has cleared backlog0:10 Recovery tweet posted0:11 Other services have caught up / are catching up fine. Call

closed.

Issue is diagnosed

Remediation actions are taken

Issue is recovering

Issue is fully recovered

Jan 6 22:50:58 prod-mesosusw2-slave01 kernel: [5462348.745281] beam.smp invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0

What Happened, Root Cause, ResolutionRoot Cause / Contributing FactorsDetermined to be an increase in traffic, or possibly larger records, causing web2kafka to increase its memory usage. It was exacerbated by the Linux kernel bug causing Mesos slaves to freeze up.

What happened

ResolutionThe temporary resolution was to increase web2kafka's cgroup/Docker memory limit to 2 GB from 512 MB. Longer-term, we should investigate why this memory usage increased.

ImpactTime in SEV-2 1h 9m

Time in SEV-1 0

Notifications delivered out of SLA 0

Customers experiencing 500s on incident details page 1%

Customers experiencing error message on /incidents page when clicking “Show Details” link 2%

Customers experiencing errors in Android app 1%

Support requests raised 8

The Post-Mortem Meeting• Goal is shared learning

• Provides the learning feedback loop

• Can lead to improvements in a variety of areas:

• Tech improvements to production software

• Tooling improvements

• People improvements - training, knowledge sharing

• Process improvements

The Post-Mortem Meeting Agenda1. Go through the timeline

2. Ensure everyone agrees the timeline is accurate and comprehensive

3. Talk about what the group has learned from going through the timeline. This can include but is not limited to action items. brainstorming format

4. Talk about what went well, what didn’t go so well capture this in the post-mortem report

5. Create the list of action items

What Went Well, What Didn’t Go So Well• Non on-call Infra team members were

able to help resolve issue

• SRE on-calls were very helpful in identifying issues and restarting boxes

• Notifications still went out, with less detailed bodies

• Infra primary on-call was unsure about when to initiate SEV-2 incident response due to lack of web2kafka SLA

• Failing web2kafka caused cascading failures in other Mesos-hosted applications

• Various UI components and the Android app did not handle missing log entries well (lack of graceful degradation)

Action Items• Action items need to be: specific, measurable, achievable, relevant, time-bound

• DO NOT boil the ocean

• Should try to create a variety of action items, as needed

• Tech improvements to production systems

• Tooling improvements: monitoring, deploy, testing, etc.

• Process improvements (ex. make sure you always have a 2nd level on-call person, make sure each new service has a Service Operations Guide and well-defined SLAs)

• Improvements in knowledge sharing: write or update docs, create runbooks, do on-call shadowing, etc.

The Action ItemsSummary Created Status Resolution

Android app shouldn't crash when no log entries are viewable. 7-Jan-17 DONE Done

Ensure all UI/API components degrade gracefully when ILE / ALEs are missing 11-Jan-17 DONE Done

Reproduce cgroup oom-killer race condition in Mesos, verify does not exist with 3.14 11-Jan-17 DONE Done

Update web2kafka ops guide with how-to-start-without-mesos instructions 11-Jan-17 DONE Done

Exponentially backoff web2kafka restart time 11-Jan-17 DONE Done

Investigate and fix possible unbounded memory usage in web2kafka 11-Jan-17 DONE Done

Emit Mesos apps cgroup's memory.usage_in_bytes in Datadog 11-Jan-17 OPEN Unresolved

Upgrade Mesos slave boxes to kernel >= 3.14 11-Jan-17 DONE Won't Do

Update operations guide and define SLA for Web2Kafka 11-Jan-17 DONE Done

Blameless Post-Mortems

• Conservatism: reduced agility and velocity

• Lack of transparency

• More failures

• Best engineers leave

“You can’t fire your way to reliability”

Management blames eng

Reduced trust b/w eng and mgmt

Eng won’t speak up (CYA)

Eng does something that results in an outage

Errors and outages become more likely

Mgmt is less informed on how work is done. Eng

become less educated on lurking failure conditions

The Public Post-MortemPosted on our status page: https://status.pagerduty.com/incidents/510k1bnvwv6g

• Create a higher-level version of the PM for public consumption

• 3 sections

• Summary

• What happened?

• What are we doing about this?

• Apologize sincerely

https://status.pagerduty.com/incidents/510k1bnvwv6g

The PagerDuty Incident Response process and training materials are open-source

response.pagerduty.com

https://response.pagerduty.com

Thank YouAlex Solomon

CTO & Co-Founder @ [email protected]

mailto:[email protected]

post-mortem best practices - yow! conferences€¦ · the post-mortem process 1. pick an owner for...

Documents