Are You A Single Point of Failure?
I remember the first time I heard of the “hit by the bus” rule. I was talking to a colleague who mentioned a need to document something in case they were “hit by a bus.” This was all so that we — his co-workers — would know how to carry on without him. And also, presumably, so that he could have the peace of mind in knowing that we wouldn’t be bothering him in his hospital bed as he was recovering.
I was struck by a few things when I heard this. The first was that I hoped he followed this rule in his personal life and had his finances in order in case that unfortunate event were to occur. The second was that we’d probably get along just fine without him. Yes, this person did have a huge amount of knowledge about this thing. But it wasn’t so insurmountable that someone couldn’t learn how to do it on their own.
Whether you’re fond of using the “hit by the bus” language or the more optimistic “when they win the lottery” — both speak to workplaces where single points of failure exist.
Single Points of Failure
Let’s get a few definitional terms out of the way before we go further. A single point of failure is generally described as a thing in a system that, if it stops working, will cause the whole system to fail. The opposite of a single point of failure is redundancy. If the primary thing in the system goes down, the secondary thing will pick up and continue on. Redundancy is equated with resiliency. Single points of failure are equated with fragility.
Here’s a pop culture example. I was recently re-watching Rogue One: A Star Wars Story. One of the characters — Galen Urso — is hunted by the Empire because he is the only person in the galaxy who can complete the Death Star. In this example, Urso is a single point of failure to getting the Empire what they want — a weapon capable of blowing up planets. The Empire clearly learned from this mistake; although they made a few single point of failure design flaws that were exposed in future movies. Regardless, when the Death Star was recreated in films in the Star Wars timeline post-Rogue One, a single point of failure like Galen Urso did not exist.
In the context of product development, single points of failure aren’t always things — servers, manufacturing plants, economies. They’re oftentimes people. Here’s an example. John is a programmer. He designed and built a specific area of a code base. He’s been maintaining it for twenty years. Because of this, he’s viewed by his peers as the only one that can maintain it. In this same context and example, redundancy would be a team of people capable of maintaining this specific area of the code base.
How Single Points of Failure Are Created
Single points of failure aren’t often created deliberately. They just happen. Let’s take the example of John further. Let’s imagine that John was part of a team of people who designed and built a specific area of a code base. But over the years, the team’s workload shifted to include some additional areas of the code base. And some of those members who designed and built that one area with John left. As the years went by, new team members weren’t asked to focus on John’s area. John had it covered. Instead, they were asked to do work in new areas.
In this example, John became a single point of failure almost by accident. He hung around longer than his former colleagues, and the nature of the team’s work shifted. Neither of those are things that John could ultimately control.
The (Ir)Responsibility of Single Points of Failure
Single points of failure have a significant responsibility — share their knowledge with others before it’s too late. Because let’s face it — life happens. Things happen. Maybe someone actually will get hit by a bus or win the lottery. Or maybe they’ll be diagnosed with a medical challenge or determine that retirement doesn’t sound all that bad. It’s better for companies and the people they employ if single points of failure don’t exist.
Unfortunately, not everyone views being a single point of failure as something to be consciously remediated. Some view it as a status symbol; one that often warrants fawning respect from their peers. “John is the only one that can touch that area. We’d better ask him about the change we’re going to make before we do anything that might even remotely touch it.” Others view being a single point of failure as some kind of job security. “I’m the only one at the company who knows anything about this, so I can pretty much do whatever I want. They’ll never get rid of me.” This ego trip can be exacerbated when single points of failure are rewarded financially.
Identifying Single Points of Failure
Let’s look at a few ways to identify whether a single point of failure exists in your context.
This is probably the easiest way to identify single points of failure. When people talk about making changes, do they mention needing permission to make changes, or getting feedback from specific people? Do those same names keep coming up time and time again? Does work grind to a halt when those people aren’t available? Those folks might very well be single points of failure.
A more methodical way of identifying single points of failure within a team or small group is to assemble a list of skills in the group, and then documenting the different skill levels of the people in a group. I’ve heard this referred to as a “team skills heatmap.” The outcome of this activity is a nice visual — a chart that shows who has expertise in which areas; and who doesn’t. Single points of failure in a team will literally jump off the page (or screen).
Mitigating Single Points of Failure
It’s not enough to recognize the problem. Single points of failure are an impediment to delivering value to customers, as well as the overall success of a firm. Knowing where single points of failure exist allows conversations around how to mitigate them to take place.
Examples of questions to ask include:
Is our company OK with having a single point of failure in this area?
How much work does this person do in a given area?
What would it take for someone else to ramp up in this area?
How many customers are using the functionality that this person touches?
How much work comes in to this area? How urgent is it?
What would our strategy be if the single point of failure stopped coming to work?
The answers to these questions should give you enough information to get a good sense of the risk involved with the single point of failure. The next step is then to build a plan to mitigate the single point of failure.
Single Points of Failure and Subject Matter Experts
Let’s not confuse single points of failure with subject matter experts (SMEs). While single points of failure are always SMEs, SMEs aren’t always single points of failure.
The John example above is one where a single point of failure is a SME. Let’s look at an example where SMEs aren’t always a single point of failure.
An organization that develops a mobile application has ten teams. Each team is comprised of engineers that can do front end work (the application that’s surfaced on mobile devices), and back end work (the server that the mobile devices connect to). In this example, each team has one engineer that’s a server SME. From each team’s perspective, they have a single point of failure — it’s that server SME that’s on their team. If that person disappears from the team, they will not have that skill set. The reality, though, is that the server engineering skill set is sprinkled around the organization. If the server SME from one team leaves the company, the other nine server SMEs will figure out a way to continue getting work done. In this example, redundancy has been built in to the system.
Single Points of Failure Can Be OK
I’ve come across a few scenarios in my career where I think having single points of failure is ultimately OK.
End of Life Products
In this scenario, the single point of failure is someone who knows a lot about a product that’s either been deprecated, or is in the process of being deprecated. There is a major caveat to this pattern being OK, though — the product in question cannot be in use by a meaningful percentage of the company’s customer base. The more customers use a product, the less acceptable it is to have single point of failure in the development pipeline for that product. I’ve seen this “end of life” scenario coupled with development of a “next gen” version of a product. A person (or two) hang around to maintain the “end of life” version of the product, while a team of new people move forward with development of the “next gen” version of the product.
Low Learning Curve
In this scenario, the single point of failure operates in an area where the learning curve to get someone new up to speed is low. This could be because the technology is common, the scope of the single point of failure is isolated, or also because the single point of failure has done a great job documenting how the code works.
Agile Prevents Single Points of Failure
Whether companies are doing anything Agile or not, many jobs come with expectations that individuals will be part of a group of people — a team — working together towards common goals. Whether they’re project managers, sales people, or engineers, this team-centric model of working goes against single points of failure. Yes, one team member might know more than the others, but the onus is on the team to figure out how to share that knowledge across the whole group.
A development team that practices Agile stands a good chance at avoiding single points of failure. This is because many Agile methods preach things like collective code ownership, pair programming, and even mob programming. These practices promote groups of people working together towards common goals. The outcome is likely to be knowledge that’s well distributed across a group, instead of centralized in one or two brains.
Are You A Single Point of Failure?
We’ve reached the portion of this post where it’s time for you to do a little reflection on whether or not you’re a single point of failure. Whether you’re a Director of a department, or a member of a development team, here are some questions to get you started:
What happens when I’m out of the office? Are decisions made without me? Does “life” go on? Or do I return to the office with an inbox full of e-mails with decisions I’m being asked to make?
How often do people seek my opinion vs. expecting me to make decisions?
How much work do I need to do to prepare to be out of the office? Do I have to work overtime in order to be out of the office?
How often do people call or text me outside of my normal business hours?
Single points of failure can be a challenge for any product development organization. But it's a challenge that working together — whether using Agile or not — can help mitigate. Just don’t wait until someone gets hit by a bus to get started.