Calculating the Cost of Failure

Let’s consider that terrible moment when you recognize that you have a showstopper bug in production. Your customer’s business can’t run. The line is stopped.

What is the cost to your business of that outage?

You can probably come up with a cost measured in thousands of dollars per second of downtime. It might be a curve—you might be okay for the first minute, then the next few hours have a fixed cost per hour—and then at some point, you start to have permanent reputation damage.

Change in Cost of Failure Over Time

Some of you will remember when we used to ship software on disks. If you messed up a delivery, it involved a physical recall and your company on the front page of the newspaper until new disks hit the shelves. The length of the outage was huge.

The Internet changed all that. Now we can push a deployment in seconds. Of course, we still have to worry about regression, or that our “fix” broke something else, so we do some testing before a deployment—but the deploy technology itself makes it possible to roll out much more quickly. Suddenly (cost per minutes)*(number of minutes) can be a lot less, making the value of testing go down—or at least appear to go down.

“Just let the customer find the bugs,” they say. “It’s working for Facebook.”

Maybe, but it’s actually more nuanced than that.

Another Way to Look at It

Impact = how often it happens multiplied by how long it takes to recover.

The overall cost of downtime is equal to mean time between failures (MTBF) multiplied by mean time to recovery (MTTR). Without insight into the process, testing tends to try to reduce MTBF. That’s good; I’d rather have one failure per ten thousand deploys than one failure per ten.

Focusing on reducing MTBF is fine, to a point. Going from 90 percent to 99 percent uptime is valuable, and so is getting that .9. Every nine after that, though, tends to create an incredible spike in cost.

Going Beyond

The folks at Facebook, Google, and a number of other companies have added a third variable: deploying a change to a limited number of users and monitoring heavily. New Zealand, for example, is a popular deployment location for code. If MTTR is low enough, issues in New Zealand can be found and fixed while America is still sleeping. And if you don’t want to use the entire population of New Zealand, you could deploy to power users or impose an extended beta test program. That makes:

Impact = number of users affected multiplied by how often they are affected multiplied by how long it takes to recover.

When it comes to reducing risk, we used to only have one variable to adjust; now we have at least three.

Caveat

In the example above we treated all failures and all customers as equal, and I made some wide generalizations. These examples are illustrative. As soon as we go from talking about MTBF to measuring it and trying to control it, a bunch of dangerous risks appear in our path. We also talked only about showstoppers, and as my friend Tim Western has pointed out, a death by a thousand tiny usability cuts can cause the same loss of reputation in the marketplace.

For today, it’s enough. But it’s worth thinking about your strategy: What MT are you optimizing for, and what should change next?

Up Next

About the Author

TechWell Insights To Go

(* Required fields)

Get the latest stories delivered to your inbox every month.