Resilience vs. Fault tolerance
In this post, I will discuss if there is a difference between resilience and fault tolerance when talking about IT systems.
In my previous post, I discussed why I think resilience has become the probably most important paradigm of the 21st century. Still, the examples I used were mostly about people and organizations and you might ask yourself if this “resilience thing” also applies to IT systems – or if it is just fault tolerance in disguise, using a new, fancier term.
Let us dive into that question …
Is it not just fault tolerance?
A few times I heard statements like “Resilience is nothing new. It is just fault tolerance, decorated with a shiny new term”. Admittedly, those people were not completely wrong. If we talk about resilience in the context of IT systems, most of the time we actually talk about fault tolerance.
If you, e.g., look at the definition of “Fault tolerance” you find in Wikipedia, it looks a lot like what we usually talk about if we talk about resilience in IT systems:
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure […]. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.
Continue operating in the face of partial failure. Graceful degradation of service. That sounds a lot like what we usually talk about if we talk about resilience of IT systems.
Additionally, if you look into the traditional resilience literature, you will not only read about withstanding adverse events (and maybe gracefully degrading your operations for a while) or recovering from it in a timely manner. You will also read about systems that gracefully extend the adverse event surface they can respond to, that develop emergent resilient behavior, that continually adapt to changing adverse events or that even transform themselves over time. 1
In terms of IT systems that sounds a lot like systems that incorporate some kind of advanced AI which modifies the system and its code at runtime. No matter if we personally find that idea intriguing or horrifying: The systems we usually develop today and the resilience measures we typically talk about are far from that idea.
So, is this whole resilience thing but old wine in new skins?
The overlap between resilience and fault tolerance
Personally, I would not say so. Of course, there is a significant overlap between resilience and fault tolerance with respect to IT systems. Both disciplines are about dealing with adverse events that may negatively impact the ability of a system to function correctly.
Some simpler resilience concepts can also be mapped 1:1 to known fault tolerance measures. E.g., monitoring latency or checking the correctness of request parameters from an upstream call or return values received from a downstream call are typical fault tolerance measures.
Still, they are also valid resilience measures if we consider the aspect of resilience that you want to withstand adverse external events. Detection of such event is the first step of withstanding it. Thus, those are also relevant resilience patterns.
So, fault tolerance measures are usually also valid resilience measures.
Fault tolerance but not resilience?
Before moving to the aspects of resilience that are not covered by fault tolerance, let us first ask if there are fault tolerance measures that are not resilience measures.
To be honest, I am not totally sure. When I go through fault tolerance literature, I sometimes stumble upon “exotic” fault tolerance measures, measures that you would not implement in the context of enterprise IT systems. E.g., in very safety-critical environments like space travel you may want to have heterogeneous redundancy.
This means, you implement a solution multiple times, using a different design for each of them, a different programming language, a different runtime environment and you will run each of them on a different hardware platform. While this may make the difference between life and death in an environment like manned space travel, it would be economic nonsense in most enterprise software contexts.
Does that mean that such measures do not belong to the domain of resilience while belonging to the domain of fault tolerance? I do not know. I think you might find arguments for both positions. Personally, I think they may belong to the domain of resilience, but they are not measures I would mention in most situations.
Overall, I think that boundary is a blurry one if it even exists.
Beyond fault tolerance
On the other hand, resilience contains aspects that for sure do not belong in the domain of fault tolerance. Whenever we talk about responding gracefully to not yet known failure modes, emergent resilient behavior of the system parts involved 2, adapting to changing failure surfaces or even transforming based on them, we left the domain of traditional fault tolerance.
You might argue that gracefully responding to not yet know failure modes is still part of fault tolerance – and again, there may be arguments for both positions. Yet, traditional fault tolerance usually goes like: “What are the failure modes we can think of and how shall the system respond to it?”. So, typically it is more about known failure modes (while the boundaries may be blurry again).
Yet, a resilient system should also be able to respond gracefully to not yet known and thus unanticipated failure modes. As I wrote in the beginning, we are still at the beginnings of creating such systems, systems that are smart enough to detect that something really unexpected happened and are capable of responding to it in better ways than just crying for help or shutting themselves down.
Nevertheless, that is where I think we are heading with resilient software design. Our system landscapes become more complex every day which means that more and more unexpected failure modes will emerge. At the same time, IT becomes more indispensable for our business and private lives every day which means that it is crucial that those systems continue functioning even if some unexpected adverse events hit them.
At the moment, we compensate the lack of graceful behavior and adaptability of our IT systems with involving humans. Whenever the IT systems do not know how to respond to an unexpected event they request human help: They send an alert to a system’s operator or some member of a DevOps team being on call. This means, if we talk about resilience in the context of IT systems today we actually talk about sociotechnical systems, the IT systems and the humans running and changing them.
At the same time we continually shift the boundaries between which errors IT systems can take care of on their own (“self-healing”) and which errors they cannot fix themselves, when humans need to be involved. This way, we continually extend the capabilities of our IT systems away from traditional fault-tolerance towards a more complete notion of resilience.
For me, this is an exciting journey. Of course, depending on the turns we make on our journey there is a chance not only to travel exciting scenic roads but also to run into some dreadful nightmare valleys – actually I am convinced we will run into the latter several times and hopefully learn from it. But still, I am really curious where it will lead.
While we need to be careful not to create systems that completely get out of control due to deficient self-healing or even self-adaptation capabilities, we also cannot leave the ever-growing complexity of our system landscapes to some poor human operators without improving how the IT systems themselves support them in their task.
In this post, we discussed if resilience in the context of IT systems is the same thing as fault tolerance, just using a fancier term. While there is a big overlap for sure, especially when discussing basic resilience measures (what we do most of the time), resilience goes way beyond traditional fault tolerance.
Resilience points towards systems that will respond gracefully to not yet known failure modes, develop emergent resilient behavior, adapt to changing failure surfaces or even transform based on them. At the moment, we accomplish these advanced resilience properties by involving humans, by extending the technical systems to sociotechnical systems, including the systems themselves and the humans who run and change them.
At the same time, we continually shift the boundaries of which errors the systems can take care of on their own, continually making the (technical) IT systems a bit more resilient – and not only more fault tolerant.
I hope that helped a bit to clarify the distinction between fault tolerance and resilience. Having done that, we are ready to move deeper into the domain of resilience. Stay tuned … ;)