Skip to main content

We are using incident.io to manage incidents and on call. It is very simple in its operation and it is primarily driven through the /inc command in slack to trigger and manage incidents. The web UI possesses the same functionality, as well as settings as reporting.

You can log in via slack to incident.io and browse the existing incidents, or trigger a test incident to understand the workflow and tool.

What is an incident

An incident is defined as an unplanned event that reduces the quality of service or interrupts service delivery to the end users of our product. An incident can also include events that have the potential to cause these disruptions.

Incidents may be caused by outages, which are periods of service unavailability. However, incidents are not limited to outages and can also result from:

  • Software bugs or errors
  • Network issues
  • Human errors
  • Security breaches
  • External factors such as natural disasters or third-party service failures

It is important to note that not every event qualifies as an incident, and to understand when to raise an incident.

Characteristics of an incident

  • Incidents must have an Incident Lead to manage it
  • Incidents require an immediate and organized response
  • Incidents can be escalated if they are too big to handle alone
  • Incidents can have multiple root causes
  • Incidents can cause revenue loss, data damage, security breaches, and more
  • Incidents should be short lived and not last more than a working day (unless theres a disaster)

Incident Severities

We are using the standard out of the box severities in incident.io. We are categorizing them as the below:

  • Critical - This is service impairing and requires immediate attention
  • Major - Affects part of the system and only affects part of the system or a small subset of users
  • Minor - Affects a non critical part or small amount of the system

When to declare an incident

Because incidents require an immediate and organized response with a short and intense period, with peoples full attention they are costly to declare and manage. Therefore when considering declaring an incident, first consider its impact, and if you need to consult others before declaring it.

However, if you are concerned and not sure its always better to air on the side of caution and declare an incident.

Incident lifecycle

  • Incident Declared
  • Triage - Triage incidents give you a space to investigate potential issues before either accepting them as active incidents, or declining them as false positives.
  • Active - An incident is active when it has an ongoing impact and responders are working towards a resolution.
  • Post Incident - Use a post-incident flow to help responders learn from incidents, and do other clean-up tasks once the incident is resolved.
  • Closed

Post Mortems

We follow the principles of a blameless postmortem, and roughly follow the google SRE books definition of a postmortem.

Our template is defined in incident.io but is based off googles SRE template.

Followups

These are defined in incident.io and turned into jira tickets.

# On call

Currently we have on call setup to alert Will & Plato if any incident is marked as critical. There is no SLA on this, and it is a best effort service, but they will most likely respond very quickly.

Recommended reading

Orielly Atonomy of an incident

Atlassians resources

Google SRE book