You are here

You are here

Is it really an incident?

public://pictures/kurt_andersen.png
Kurt Andersen SRE Architect, Blameless
 

When you declare an event to be an incident, you’re making a judgment that has some pros and cons. Declaring an incident starts your organization’s incident response processes, which will help restore your services effectively and ensure learning and systemic change. At the same time, these response processes commit time, effort, and resources, which may or may not be necessary. To get the most out of your response processes, it’s important to understand what an incident actually is—and what it means to declare one.

Not all incidents are created equal

But how do you decide what qualifies as an incident? At first glance, people tend to think that incidents are cut-and-dried, relatively objective occurrences. But if you look closely, incidents are highly varied, often require unique handling, and often defy clear answers to something as seemingly simple as knowing when they even start.

Everybody wants to have systems that are clearly and objectively working or broken. This clear delineation makes it easy to understand and talk about systems; the system is either clearly working, or it is not. But such systems are becoming a vanishingly small fraction of the world that reliability engineers have to manage.

Not only do we have to deal with the ambiguity of this type of gray failure in distributed systems, but we also have organizationally defined constraints that bound the expected use cases for our systems. If a user is attempting to do something to (or with) your system using an unsupported browser, is that a problem? What if the user is trying to do something “too much” or “too fast”? What defines these boundaries—the supported browser versions, or the rate and performance limitations? They might be policies within your organization, agreed-upon standards in the industry, or the boundaries of what service you can realistically provide. Arriving at a singular definition of “normal usage” can be very complex.

So, what is an incident?

ITIL defines an incident as an “unplanned interruption to or quality reduction of an IT service.” However, this isn’t very helpful, because we are now in an always-on, 24/7 world of gray failure. There is always some degree of quality reduction. The key for service owners is to know when that degradation is too much. What’s more helpful is to think about incidents in the context of how incidents function in your organization.

Declaring an incident is a call for help. It’s a recognition that business-as-usual plans need to be changed because of some significant deviation from normal. This call for help might be just on an individual level, with the on-call engineer having to divert their attention from other planned work or a previously scheduled meeting. It could involve a few people for a slightly more involved incident, or it could involve multiple teams to respond to a major incident.

The other main purpose of declaring an incident is for public awareness. Sometimes the incident responder does not necessarily need additional help, but it’s important that other people avoid falling into the same trap or refrain from actions that could make a situation worse. If the deployment system has begun misbehaving, then it may be very important to avoid attempting additional deployments until the system can be corrected or caught up with a backlog of work.

There is another aspect of incident declaration related to the idea of public awareness; an incident declaration can act as a signpost for the future—so that others can learn from it later. One of the tenets of the Safety-II community is that the work to maintain normal operational performance is critical to creating a system’s resilience to disruptions. If incident retrospectives help to inform an organization about weaknesses and risks in their services, then they serve a role similar to that of “Danger” signs posted near the edge of a cliff—even if nobody (or no system) has fallen off of the cliff yet. From this point of view, it can even be beneficial to declare incidents retroactively so that the occurrences can be cataloged for future benefit.

When you are next considering declaring an incident, think about how these aspects can help your team and your customers respond most effectively. At the declaration stage, it's not worth getting stuck on precisely defining when an incident has begun. Instead, consider the purpose(s) of your system and whether the users and/or support teams will benefit from moving forward with the declaration of an incident. In any event, it is generally better to err on the side of caution and future learning value.

An example incident with Fastly

For an example, let’s look at the incident with Fastly on June 8, 2021, using its own outage report. Fastly deployed software on May 12 that contained a bug; if a customer made certain configurations under certain circumstances, it could cause their network to start throwing up errors to a large number of users. On June 8, the bug was triggered. A mere 10 minutes after the event, Fastly posted a status page declaring these errors to be an incident. Let’s look at how this declaration affected how the outage proceeded.

With the public declaration of an incident at 9:58 am, Fastly was able to let its customers (the service providers) and the customers of their customers (the service users) know that there was an ongoing problem. This notice served like a yellow “caution” flag at a race while the Fastly engineers investigated and mitigated the problem condition. This public awareness helped customers maintain reasonable expectations for when the problem would be fixed.

Internally, the declaration of an incident would have facilitated getting the attention of both engineering and management resources to speed the repair and guide public messaging. Given that Fastly posted its public statement the very same day of the incident, it worked. It recovered some of the impacted services within an hour and had fully mitigated the incident by 12:35 pm. By that evening, it had started deploying a more permanent fix. This call for help ensured that the proper people were able to resolve the issue quickly.

Ultimately, the team at Fastly needed to do more than just restore the services; they needed to make systemic changes to prevent similar bugs from occurring again. The start of this process can be seen in their incident report, in their “Where do we go from here?” section. This signpost for the future will help their system become more resilient, and publicly sharing their plans will improve customer trust.

Keep learning

Read more articles about: Enterprise ITIT Ops