Why we need resilient software design - Part 5

The need for resilient software design

Uwe Friedrichsen

7 minute read

Ancient overgrown burial mound (seen in northern Germany)

Why we need resilient software design - Part 5

In the previous post, we discussed why the imponderabilities of distributed systems will hit us at the application level and we cannot leave their handling to the operations teams as we did in the past.

In this final post of this little blog post series, we will first recapitulate what we have learned so far and then discuss what it all means with respect to resilient software design.

Let us begin with what we have discussed so far.

The story so far

In the first post of this blog series, we discussed the stepwise journey from isolated monolithic applications to distributed system landscapes where applications continually communicate with each other. We saw that the number of peers involved continually grew (and continues to grow further), while at the same time the update propagation duration expectations became shorter and the availability expectations went to “never down”.

As a result of these three parallel evolutions, today we are faced with complex, highly distributed, continually communicating systems landscapes that have to be up and running all the time. Additionally, we are faced with many peers (and the processing running on them) that live outside the boundaries of the infrastructure we control, e.g., mobile apps, SPAs or IoT devices. They run on user-controlled (or uncontrolled) devices and communicate over (sometimes quite unreliable) public network connections.

Being confronted with such complex, highly distributed, continually communicating systems landscapes that are expected to propagate updates almost instantaneously and to be never down, we need to have a look at the failure modes of such systems and how they affect the application behavior.

This was the topic of the second post. In that post, we have seen several failure modes, namely

  • Crash failures
  • Omission failures
  • Timing failures
  • Response failures
  • Byzantine failures

that are unique to distributed systems and do not occur inside process boundaries. We have also learned that failures of these types can have very nasty effects at the application level. Especially order and consensus (and everything building on those two fundamental properties) are hard problems in distributed systems. Unfortunately, most application logic relies on always knowing the exact order of events as well as on always being sure about the current state.

Both properties are very hard to achieve in distributed system and under unfortunate circumstances they are even impossible to achieve. Additionally, the notion of a global shared state does not exist in distributed systems. Each party involved has its own, potentially conflicting partial knowledge of the global state. This creates another level of complexity that does not exist inside process boundaries.

In the third post, we then discussed what these failure modes mean in practice and how they affect us at the application level. We looked at availability in general, understanding the “nines”, and how adding nodes and the imponderabilities of network communication affect availability.

We also shortly discussed the widespread 100% availability trap, many people inside and outside IT still fall for: The implicit assumption that all remote peers, including infrastructure tools like databases, message brokers and alike, are available 100% of the time. We have seen how multiple peers and network imponderabilities reduce the overall system availability – meaning that with the 100% availability trap in your mind you will almost inevitably create brittle system, exhibiting very poor availability.

In the fourth post, we then debunked the “let ops take care of resilience” habit which is still widespread in the software development community. We have seen that while the infrastructure tools commonly used today can help us implement highly available and robust systems, they cannot solve all the problems we face.

Some failure modes are only rudimentarily supported. Many important resilience patterns are not supported at all or cannot be supported in a generic way due to the nature of the respective pattern. Not all infrastructure means are available everywhere. Some of the infrastructure means require explicit support from the application level to work, i.e., supporting code in the application code base. And usually, only relatively coarse-grained actions are possible, often not differentiated enough regarding the application’s needs.

Additionally, all infrastructure means can only support us at a technical level. They cannot relieve us from fixing any business-level issues that arose due to unexpected technical failures.

What we can learn from it

The bottom line of the previous four posts of this blog series is:

  1. Systems are highly complex and distributed today.
  2. This leads to new kinds of failure modes.
  3. We can partially delegate the handling of these failures to the infrastructure level.
  4. But they will also hit us on the application level for sure.

What can we learn from all that?

In a single sentence:

We need to take resilient software design into account when designing and implementing software-based solutions.

As software engineers, we cannot ignore resilience and fault-tolerance any longer, hoping (or expecting) that operations will take care of all availability and reliability concerns.

This means, as software engineers we need to retire our “single process, single thread” mindset we typically use while developing software and adopt a distributed systems mindset.

We need to understand that every single call leaving our process boundaries (which happens surprisingly often) is not guaranteed to complete.

Instead, we need to accept that all remote calls are predetermined breaking points of our applications and system landscapes.

And we need to accept that our applications will break at these predetermined breaking points, no matter if we neglect it or not. But if we neglect it, our applications will become brittle and exhibit a lot of downtime.

To respond to these omnipresent sources of failures, we need to understand at least the basic concepts of distributed systems, their failure modes and how to respond to them.

We need to learn how the available infrastructure means can support us and which parts we need to take care of ourselves – writing the required application code.

Only then we will be able to create highly available and reliable applications.

And if we take into account that because of the ongoing digital transformation software becomes more and more indispensable for our business and private lives every single day, I think this is the least we need to do. 1

Moving on

At this very moment, you might ask yourself what exactly you need to learn to create fault-tolerant and resilient software systems and how to apply those concepts. Maybe, you even hope that I will answer these questions in the remainder of this post.

Sorry, but it is not that easy. Resilient software design is a big topic and answering these questions – even only superficially – would require a lot more than just the remainder of a single blog post.

The goal of this series was to explain why resilient software design has become mandatory and that we cannot simply leave the issue to the operations team anymore as we did in the past.

I hope I achieved this goal.

Of course, I will come back to the question how to create resilient applications. I will definitely write quite some blog posts discussing specific concepts and applications of resilient software design. Teaser: The next few blog posts will dive deeper into the what and how of selected aspects of resilient software design.

I am also in the process of launching a new website that explains resilience patterns (and some other topics). There you will also find a lot of information. I will let you know when the web site will be live (and probably also add the link here).

Resilience is a huge, vital and exciting area to be explored and I hope I aroused your curiosity to explore this topic further.

I offer you to explore it together with me, e.g., via future posts or the new website I will launch. If you are interested, stay tuned and I will keep you informed.

And if you should decide to explore the topic on your own, I also wish you great travels in the world of resilient software design! So many exciting and fascinating things to be discovered …


  1. We should also consider extending the topic to the concept of dependability which provides a more comprehensive view on what we need to take into account to create systems we can rely on. But that would be the second step. Hence, I left it out here. For a conceptual introduction to dependability, you may want to read, e.g., “Basic Concepts and Taxonomy of Dependable and Secure Computing” by A. Avizienis et al.. Note that depending on the authors, dependability and security are treated as two distinct domains or security is treated as a subdomain of dependability. Personally, I prefer the latter definition. Still, IMO both points of view are legitimate depending on the aspects you try to emphasize. ↩︎