The business case of resilient software design

Balancing costs and benefits

Uwe Friedrichsen

13 minute read

Footpath lined with trees (seen in Fulda, Germany)

The business case of resilient software design

The business case of resilience is a bit tricky. You find quite disparate forces at work: While some people tend to underrate the need and value of resilience a lot, other people find it hard to stop adding resilience measures. As so often, the sweet spot is somewhere in the middle.

Underspending

Let us start with people who have the tendency to put too little attention on resilience. Often, those people can be found at a not-so-technical decision maker level. Let us assume, you go to your manager asking for budget to make your application more resilient. A counterquestion in such a situation could be: “How much money will we earn with it?”

You find this kind of question in all companies that insist in a ROI calculation before signing off a project. The underlying assumption is that it is only worth spending money on topics that (very soon) will return more money than you spend on it.

You also find this type of question in disguise. E.g., if your Product Owner would ask you in the same situation if making the application more resilient would improve team velocity 1. And so on. So, this type of question is not limited to managers.

And, to be fair, making sure that investments economically make sense is a sensible idea because this greatly increases the probability that all employees – including us – get their salary paid at the end of the month. Neglecting the economic implications of your actions is a great way of becoming bankrupt quickly. Hence, questioning the economic feasibility of investments is a sensible idea.

Additionally, in software development departments asking this kind of question is a good litmus test. Most software engineers are attracted by the “new and shiny”: New is “cool”, everything else is not. I discussed this habit of the IT community and its implications in more detail in my “continuous amnesia” post.

A side effect of this strong attraction of software engineers towards everything new is that managers in IT departments are often flooded with requests they must introduce new technology X or shiny tool Y. In such a situation, the question for the business case can be a good litmus test. It helps the manager to differentiate rash desires from well-considered thoughts.

So, there is nothing wrong per se asking for the economic feasibility of an investment. The problem only starts if you overdo it, if you turn it into a dogma where each Euro spent must immediately create a return higher than the Euro spent:

  • Sometimes things only pay off indirectly. E.g., billions of Euros are spent for “brand awareness” every year. Yet, most if not all of the measures do not have an immediate business case. The business case of brand awareness is indirect.
  • Sometimes, probabilities are involved and a simple cause-effect relationship as we see it in most business case calculations cannot be established. E.g., in my post why we need resilience, I discussed that unexpected events become more and more likely. Still, regular business cases only plan for the expected. Anything unexpected is not covered in there, making them mostly useless in a world filled with uncertainty.
  • Sometimes, the appropriate question is not how much money you make if you do something, but how much money you will lose if you do not do it. E.g., investments in security will not make any money. They only cost money. But if you do not spend the money and get hacked, it tends to cost you a lot more money than it would have cost you to implement the security measures that would have avoided the hack. 2

Resilience is a bit like security: It helps not losing money if something does not work as expected. It also has a probabilistic component because unexpected events and failures only happen with a certain probability. And it has an indirect component, too: If your IT fails too often, it will lead to secondary effects like losing annoyed customers. This means, you do not only lose money as a direct effect of the failure, but also as a secondary effect detached from the incident itself.

The key message for the business case advocates is:

Resilience is not about making money. Resilience is about not losing money.

If we understand this, we have the required basis to evaluate the economic feasibility of resilience better – not by chasing the short-term ROI, but by using a more appropriate and sustainable business case calculation schema.

Overspending

Let us move on to the opposite problem, people who have the tendency to place too much attention on resilience. Often, those people can be found at the engineering level. Once you started realizing what can go wrong in distributed IT system landscapes, you find more and more places in your application you think you need to augment with resilience measures.

And then you figure out more and more sophisticated measures you could apply. The possibilities are endless and everything feels important.

But following this road would turn resilient software engineering into a bottomless pit, into an end in itself. You could sink the whole company money in it and would still not be done.

The key message here is:

Resilience is not an end in itself. Resilience is a means to an end.

Additionally, adding more and more resilience measures to your application makes it more and more complex. But the more complex your code and your infrastructure becomes, the harder it becomes to understand and run your application – which makes our application less resilient in the long run.

You need to balance resilience measures and solution understandability to optimize the overall resilience gain of your solution. I will discuss this topic in more detail in an upcoming post.

Finding the sweet spot

This leaves us with the question where the sweet spot of resilience engineering lies, how to avoid under- and overspending. How can we calculate a sensible business case for resilient software design, determine how much money we should spend into it and when to stop?

Probably it is possible to come up with a very complicated and sophisticated approach to answer this question. Yet, I prefer simple solutions whenever they are applicable. They might not be perfect but as long as they are sensible and “close enough”, I prefer them over any complicated solution.

Regarding resilient software design, my recommendation consists of two steps:

  • Understand how much money and risk is on stake
  • Determine in which resilience measures to invest

Defining the resilience budget

The first step is about understanding how much money is at stake if your IT systems fail. This sum typically consists of two components:

  • Direct, immediate losses
  • Indirect, delayed losses

Direct losses are the money you do not make while a failure is ongoing. E.g., your customers cannot place orders due to a failure of a system required for customers to place orders. Depending on the time to recovery you lose more or less money.

You may also lose money due to data loss because of a failure. E.g., you save the orders placed in a database. Due to a failure, the database crashes and needs to be restored from a backup. This means you lost all updates including the orders between the last database backup and the time the failure occurred, i.e., money loss due to data loss.

If your company does business continuity planning, there is a good chance that these numbers are already known. Typically, they are calculated during the business impact analysis. Search for Recovery Time Objective (RTO) and Recovery Point Objective (RPO) (see, e.g., this Wikipedia article describing disaster recovery for more information).

If your company does not do business continuity planning, you can approach your product managers or other owners of lines of business. They should be able to estimate at least how much money they would lose per unit of time if a failure would occur.

You may additionally want to augment these numbers with some risk figures. E.g., a downtime of an hour or less might be expensive. A downtime longer than an hour might be considered critical and any downtime that takes longer than four hours might put the company’s survival at stake. 3

Such risk figures provide useful additional information as they point out unbearable risks that must be mitigated by all means.

The indirect losses are a delayed, secondary effect. E.g., if your e-commerce site often suffers from availability issues, your customers get annoyed. Some of them decide to buy from your competitor. Your churn rate goes up. Fewer customers are buying from you. You lose money.

These losses are usually not covered by approaches like business continuity planning. They tend to focus on immediate risks and losses. Still, your product managers or other owners of lines of business should be able to make some sensible estimations regarding indirect losses.

Take the direct and indirect loss estimations, scale them with their probability of occurrence and you have a first approximation of a resilience budget. Additionally, the unbearable risks you have identified along the way must be mitigated by all means, no matter if they fit in the budget or not.

As all business case and budget calculations, this is not an exact science 4. A lot of psychology is involved, no matter if we like (or admit) it or not. The key point is that all parties involved agree on the numbers and consider them sensible.

Defining where to spend the budget

Now that you have a budget: How to spend it best? The budget is not limitless even if you have business owners who have understood the value and necessity of resilience. But you have endless possibilities to spend the budget.

I will keep this section short because I will come back to this topic in more detail in some later posts.

The budget calculation already consists everything needed for a simple approach to define which resilience measures to address in which order:

  1. Take the unbearable risks and address them first (no matter if they fit in the budget or not).
  2. Then order the remaining risks by weighted loss.
  3. Address them in that order until you run out of budget.

This is a very simple, but IMO useful approach to decide where and in which order to invest in resilience measures. Of course, you can tweak and fine-tune the plan if there are some additional factors involved that point towards a different, better order. E.g., you might decide to implement an infrastructure level measure to address one of the high-risk areas which you additionally can apply to many low-risk places at almost no cost and effort.

You can also use a method like Failure Mode and Effects Analysis (FMEA) to understand better which non-covered potential breaking points in your application may be good candidates to secure.

Or you can leverage empirical methods like Chaos Engineering to understand the failure modes of your application better and how they affect your business. This can also help to prioritize how to use your resilience budget.

In the end, I do not think there is a single best way to decide how to spend the budget best. I think there are several sensible ways and it is up to you to decide which way works best for you.

Still, be aware that you will need to prioritize. You will not have the budget to implement all measures that come to your mind at once – and often it would even be nonsense from an economic perspective to do so. Hence, spend your budget wisely.

Summing up

The business case for resilient software design is a bit tricky because resilience is not about making money, but about not losing money. You also do not lose the money immediately if you do not do anything, but only if something unforeseen happens which introduces a probability. This often makes it hard to negotiate a budget for resilience, especially if you are confronted with people who tend to think in direct, short-term cause-relation dependencies only.

Unfortunately, like with security, the often catastrophic effects of not caring about resilience only become obvious when it is too late. Hence, underspending can be life-threatening for companies.

Overspending on the other hand also does not make sense from an economic point of view. There are countless options to improve the resilience of your application, but implementing all of them is not useful, either.

Therefore, you should calculate a sensible resilience budget based on the immediate and secondary, delayed losses that you will suffer if failures occur. And then you need to decide where to spend the budget, using business risks and weighted, expected losses as guidance.

In a future blog post, I will discuss in more detail why too many measures are not only problematic from an economic point of view and how to find a good balance.

I hope this gave you a few ideas to ponder … and some arguments for future discussions regarding resilience budgets … ;)


  1. Team velocity is a measure that determines how much output you get based on a fixed input. Your input (or investment) are the costs of the team. For the sake of simplicity these costs can be considered constant over time (i.e., not considering that team members may change over time, that different team members might have different salaries, that some team members may get a raise or alike). Under that assumption, an increase of velocity means that you get more output, i.e., work done per Euro invested. Hence, measures that improve your velocity improve how much output you get per Euro invested. This does not take the outcome of the work done into account, i.e., how much value, i.e., actual return on investment your work actually creates (I discussed the difference between work done and value created in detail in my little blog series discussing uncertainty). It only takes the amount of work accomplished, i.e., output into account. The underlying naive assumptions is that all work done (output) creates as much value (outcome) as work was done – a simple, direct relationship between work and value. ↩︎

  2. The fact that security does not make money combined with the fact that for most people breaches felt like something “that happens to other people” led to massive security underspending in the last 30 years (i.e., since the Internet became a thing and security breaches were not limited to physical access to computers anymore). Security could be found in the “CIO top 10 topics” lists all these years but actual investments did not reflect this alleged importance at all. Only in the recent years, after cybercrime has exploded you find a – still reluctant – willingness to actually invest in security. We can observe similar behavioral patterns regarding resilience and sustainability. We will see if people will learn faster this time … ↩︎

  3. Note that such short IT downtimes as just a few hours becoming life-threatening for companies are not unusual these days. ↩︎

  4. Business case and budget calculations are never an exact science. Of course, I met several people who insisted their calculations were an exact science because they used very elaborate and complicated calculation schemes. Still, all those elaborate schemes were based on inputs that were rough estimations at best. To me, it always felt a lot like fake precision, like calculating gravity to 10 digits precision with a stone in one hand and a stop watch in the other. ↩︎