Incident Management

A good incident management process is fast and predictable. It quickly turns detection into response, escalates to the right people, makes communications clear, and keeps customers in the loop.

Photo by Elijah O’Donnell on Unsplash

Table of contents

  1. A System crash story
  2. What is an incident
  3. Who are the stakeholders
  4. The importance of having a reliable system
  5. Detecting an incident
  6. Delivery channels
  7. Post-mortems
  8. Keeping track of the solution

1. A System crash story

2. What is an incident

3. Who are the stakeholders

Hypothetical scenario

  • High power, highly interested people — Manage Closely
  • High power, less interested people — Keep Satisfied
  • Low power, highly interested people — Keep Informed
  • Low power, less interested people — Monitor
  • Management
  • Product Team
  • Customers
  • Developers
  • Support Analysts
  • Other folks on the same company

4. The importance of having a reliable system

Monitoring and Alerting

  • Dynatrace
  • New Relic APM
  • Prometheus
  • Zabbix
  • Nagios XI
  • Grafana
  • Datadog
  • Opsgenie
  • xMatters
  • Pager Duty
  • New Relic Alerts

Good coding process

Commit messages

Simplify serialize.h's exception handlingRemove the 'state' and 'exceptmask' from serialize.h's stream implementations, as well as related methods.As exceptmask always included 'failbit', and setstate was always called with bits = failbit, all it did was immediately raise an exception. Get rid of those variables, and replace the setstate with direct exception throwing (which also removes some dead code).As a result, good() is never reached after a failure (there are
only 2 calls, one of which is in tests), and can just be replaced by !eof().
fail(), clear(n) and exceptions() are just never called. Delete them.Closes #123
See also #456

Staging environment

Infrastructure as a code (IaC)

5. Detecting an incident

6. Delivery channels

Keep customers (stakeholders) knowing what’s going on

  • A dedicated status page
  • Embedded status
  • Email
  • Workplace chat tool
  • Social media
  • SMS

Incident Manager

  1. Acknowledge the problem — Check with engineering if the problem is real and what’s the impact;
  2. Update the status page to keep customers down. Your customers should already have access to this page. If they don,t send them an email. Explain that you know that something is wrong and the team is working on the fix;
  3. If necessary, fire up some internal emails with the status page link to a manager that needs to know about the issue;
  4. Keep the page updated on every finding — Use slack channels to keep the communication flowing between teams;
  5. When the incident is resolved, start to write the Post Mortem.

7. Post-mortems

  • Acknowledge the problem, empathize with those affected and apologize
  • Explain what went wrong and why
  • Explain what was done to fix the incident and what was done to prevent repeat incidents
  • Acknowledge, empathize, and apologize once again

Postmortems should have a special place

Documenting facts

Merging to a final document

  • Summary — Overview of what happened
  • Impact — Who was impacted and how much they were impacted
  • Root Cause — Description of the root cause
  • Resolution — Description of what solved the problem. If was a temporary fix, describe the long-term solution
  • Timeline — Looks like what I wrote on the previous subsection
  • Action Items — List of what should be done to prevent it from happening again. Mostly related to the Root Cause and the Resolution.

8. Keeping track of the solution

Recap

  1. Have monitoring and alerts for your systems
  2. Have CI/CD
  3. Have staging and if possible preview environments
  4. Have a code review process
  5. Have a status page
  6. Know your customers
  7. Have an incident manager responsible for communications during incidents — Note, this is just a role, not a job title. Anyone can be an incident manager and do something else.
  8. Write post mortems
  9. Communicate more

--

--

Engineering Manager

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store