I’ve seen some great discussions recently about moving away from a culture where incidents are a four-letter word. Some of the most prevalent — and best — advice on the subject encourages teams to declare more incidents and democratize who can declare incidents.
While that advice seems fairly straightforward, I’ve found that there are often cultural barriers that make it hard for teams to put it into practice.
Telling people to declare incidents doesn’t erase the fear that often comes with it. In my experience, a lot of small things, done well and consistently over time, are what ultimately amount to a positive incident management culture.
Our own internal engineering team at FireHydrant has been building steps into our incident management program intended to increase psychological safety around incident declaration and management.
One small and rather simple step we made recently is to expand the scope of what is considered an incident and provide a safe and predictable place to investigate. It’s had an outsized, positive impact on the team, so I wanted to share the details in hopes it might help others.
We tend to think of incidents as those really, really bad — and embarrassingly public — moments where customers are frustrated, the organization is losing money and everything is, well, on fire. But in reality, there’s a spectrum of activities that we can classify as incidents. And the more we normalize lower-impact incidents, the more confidence and experience we build for Sev1 situations.
Our first step toward solving some of that fear around incidents was to ask the team to start thinking differently about how they define incidents. At its core, this was a cultural shift more than a process change. And as is the case with many cultural changes, it has been a gradual shift that folks on the team move toward as they continue to see the behavior modeled.
There wasn’t a grand plan to kick this off; it just kind of happened. Someone on the team declared an incident for a sharp increase in transient failures in our test suite. This doesn’t fit the classic definition of an incident for most teams, but it was a great way for us to capture context that had spread across multiple communication channels, understand the issue, implement a fix and prioritize work in the future to improve the dependability of our test suite. Over the course of the incident, we realized how much of this work is happening already — teams are just avoiding the label.
In the end, it came down to a memory leak in a specific version of Node.js resulting in test timeouts. Five people were involved over a few days, but despite being labeled an “incident,” it didn’t derail their daily work. If anything, it provided the kind of structure and space that lower cognitive load rather than raise it. We used FireHydrant to provide structure and draw context about our system (recent changes, dashboards, etc.) into a shared space.
It felt good to give a name to the work that so often happens as a hard-to-prioritize distraction, so we started talking about it — a lot. We’re a small-enough company that other teams wanted to try out the if-it-seems-weird approach to incident declaration. When our marketing team was getting error notices while trying to deploy to the site, someone declared an incident. Ten minutes of poking around in Netlify, Gatsby, GitHub and Contentful, and we discovered a permissions issue that was easy to fix and unblocked a full day’s worth of work.
We wanted to reinforce the behavioral change with technical foundations. How could we give people a safe way to investigate whether something actually was an incident without worrying about the implications that usually come with incident declaration like alerting and distracting your coworkers?
We created a new severity type of “triage” with the simplest possible runbook condition: Create a Slack channel. This ensures that if something simply feels off, the engineer who spots the problem has a place to write down stream-of-consciousness or play-by-play notes and see what happens next. We can add charts we looked at, alerts we saw, a running history of everything we thought contributed to the problem (even red herrings), and then reference it later. If it becomes clear that there’s a larger issue at play, it’s easy to escalate and get the right people up to speed because the information is already documented in the channel.
If the severity doesn’t evolve, it still provides valuable insight into the health of our systems. At a previous company, my CTO approached every problem with a notebook. He’d write down all his notes on an incident then look back later with more context. Those notes might tell him that a major problem started six weeks back with a minor incident that then took a back seat to work that was deemed higher priority at the time.
As more “triage” severity incidents are declared and resolved, it’s becoming clear that our team’s shared definition of an incident is changing. And with that redefinition, we’re seeing evidence that incidents are becoming less scary for everyone involved.
This redefinition has me thinking a lot about where we go next. It’s become more obvious to me lately that incidents are in the eye of the beholder. An engineer’s definition of an incident might be very different from someone on the customer support, sales or marketing team who is blocked from doing an aspect of their job, or doing it without a lot of friction.
As we continue to evolve our incident definition, it makes sense for us to develop a deeper understanding of how our customers, internal and external, use our services. The truth is, every incident matters to someone. More on that to come as we continue to build and refine our internal incident management program at FireHydrant. And if you liked (or hated) what you read here, I’d love to hear from you. I’m dan@firehydrant.com.