The right dose of resilience
We have discussed the business case for resilient software design in my previous post. Let us assume, you have a budget and you know which are the most critical business processes/capabilities/interactions (whatever term suits your needs best) you need to secure, i.e., make more resilient.
Getting functional design right
Based on my experience, first you should revisit your functional design. You should assess how you sliced and decoupled the different application parts at a functional level because that determines how resilient your application can become. If your coupling at the functional level is tight, all technical measures including resilience patterns will not help.
But this is a long and challenging story about a problem, we as the software engineering community have not solved for the last 50 years. I will discuss this topic in more detail in some future posts. 1
For this post, let us assume we have got the functional design right. Just be aware that usually this is not a given.
Again, now what?
Understanding your options
The next thing is to understand your options – which resilience patterns exist and how they can help. Unfortunately, a good and comprehensive collection does not (yet) exist – at least none I would be aware of 2. Still, if you look around you will find some patterns here and some patterns there, and soon you will find yourself with a surprisingly big pattern collection.
The image above shows a resilience pattern collection I sometimes used in past resilience workshops 3. Even if that collection is still incomplete by far, the sheer number of patterns is overwhelming. So, there are many options and probably it will take you some time to collect and understand all those patterns.
While it is quite challenging to understand all those patterns and their trade-offs (or at least a reasonable subset), the question that immediately follows is: What are we going to do with all these patterns?
How many patterns should we implement? Which of them? And how should we combine them?
This is the moment, the engineer in us strikes: We learn all this new exciting stuff, all these fascinating patterns and understand how they can make our applications more resilient. We see what they can do. Now, we want to do it. We want to apply all those patterns. We want to see how they improve the application’s resilience. We want to implement them all!
I once had an impressive experience about what happens if you do this, if you let the curious and adventuresome engineer take over. It was in 1997 or 1998. I handed a copy of the back then still quite new “Gang of Four” design patterns book to one of the senior developers in my project. He was really excited and wanted to use it to improve the code base. As this was a useful idea, I did not think anything more of it.
I worked as architect and project manager in that project (hint based on personal experience: Never, ever do that combination! It was a PITA). Because managing the project and the client took quite some time, I did not have as much time for architectural work as I would have liked (and needed). I still did some conceptual architecture work and also discussed it with the lead developers, but I only worked in the code base myself once in a while.
A few weeks later, the colleague came to me and proudly told me that he had implemented all patterns from the book in our code base. This was the moment when my alarm bells went off. I mean, 3 or maybe 5 patterns certainly could have helped to improve our code base. But all 23 of them? Especially because I knew the book contained some patterns that did not make any sense in our code base. Our code base just did not provide the required context for those patterns.
So, I checked the code base … and it was a mess! It had become impossible to understand what was going on in the code. The application of way too many patterns – often in the wrong places – did not improve the code base. It resulted in the opposite effect: The code had become incomprehensible and brittle.
Not surprisingly, the team started to run into massive problems stabilizing the code at the same time. The number of bugs exploded and it felt like fixing one bug resulted in three new bugs. Nobody knew how the code would react if they applied a change. In short: The code base had become unmaintainable.
In the end, we needed to rewrite the whole code base quite from scratch as this was quicker and safer than reverting all the patterns-related changes while trying to keep the rest of the application logic we had added in the meantime.
There would be a lot more to say about the example and the project context it took place in. A lot of details are missing. But the key message here is:
Applying too many patterns is a bad idea.
This is true for all kinds of patterns, including resilience patterns.
Finding the right dose
This still leaves us with the question: How many patterns should we implement?
To be frank: I do not think, this question has a simple answer – at least as far as I understand it. Still, I would like to offer you some ideas that based on my experience lead in the right direction:
- Patterns are options, not obligations – First of all, always have in mind that you will never need all patterns you know. They are only options you can pick from to make your application more resilient and that is how they should be seen.
- Each pattern increases the application’s complexity – If you pick too many patterns, it will compromise your initial goal to make the application more reliable as each pattern increases the complexity of the solution. Added complexity makes the solution’s behavior harder to understand, leading to more unexpected problems at runtime. Added complexity also makes the code harder to understand, making it harder to maintain and evolve the code base, resulting in more bugs, also compromising reliability. Or as Sir Tony Hoare phrased it in his 1980 Turing award lecture: “The price of reliability is the pursuit of the utmost simplicity.” 4
- Each pattern costs money in development & operations – You also need to keep in mind that each pattern costs money in development as well as in production. Just, e.g., think of redundancy, i.e., running several instances of the same application part which multiplies the runtime costs of the application. These development and runtime costs need to be balanced with the resilience budget you have.
This leads to the core recommendation:
Do not use too many patterns.
You end up with two competing forces: On the one hand, you need to implement resilience patterns to improve the reliability of your application. On the other hand, if you implement too many patterns, the complexity of your application will explode, resulting in reduced reliability:
- Not implementing any resilience measures lets you lose money due to the reduced availability of your application and the resulting effects at the business level (see “The business case for resilient software design” blog post for more details).
- Implementing too many patterns also lets you lose money because then you spend more money for resilience measures than they save you – and due to the effects discussed before you will additionally lose money because of the excess complexity that will compromise your availability.
Hence, the sweet spot – as always – is somewhere between the extremes.
“Just enough patterns” is what we should strive for.
You should look for complementary patterns to maximize impact while keeping the overall number of patterns used as small as possible. Often, small groups of patterns complement each other and their effects sort of multiply up.
As far as I understand it, useful pattern combinations always depend on the given context. A set of patterns that performs very well in one context can perform poorly in a different context. Therefore, I cannot offer you some pre-packaged pattern collections that always form a great combo. Sorry about that.
But to give you a rough idea what this could mean in practice, I will briefly sketch two quite well-known and successful pattern combination examples.
Example 1: Erlang/OTP
- The basic building blocks are actors that communicate via asynchronous messages, implementing the actor model.
- Then Erlang implements the let it crash pattern, which is a special form of a worker-supervisor pattern, consisting of the basic patterns escalation, monitor and restart.
- Additionally, the Erlang Virtual Machines (VMs) use a heartbeat protocol to detect failing remote VMs and thus enable Erlang clusters to span across multiple compute nodes.
- Erlang also implements hot deployments as a standard platform feature to maximize availability. 6
These few resilience patterns enabled Ericsson to build ATM switches running Erlang that reached “9 nines” of availability which is a more than impressive number. These switches had a downtime of less than 1 second over the course of 20 years.
Admittedly, the underlying switch hardware was highly availability (redundant hardware components) and the use case implemented was not overly complex. But still the availability achieved is impressive – software-wise based on a combination of a handful of resilience patterns that support the use case nicely.
Example 2: Netflix (~2015)
The second example is Netflix, the patterns they were well-known for around 2015 when Hystrix and the Netflix OS were highly popular. I am sure, I missed some of the resilience measures Netflix implemented, but these were the patterns, they were well-known for back in 2015.
- Netflix chose (micro)services and request/response communication based on HTTP as their basic building blocks.
- Due to the synchronous nature of HTTP, they put a lot of effort in latency management, bundled in their Hystrix library (meanwhile in maintenance mode): timeout, circuit breaker, retry, bounded queue, confinement and fallback to degrade their quality of service gracefully.
- They use redundancy in many places and several variants.
- Autoscaling using the share load patterns is just one of their redundancy usages, but still one of the most popular ones as Netflix was famous for causing up to one third of the US Internet downstream traffic at peak times which only works with an excellent autoscaling implementation.
- They put a lot of effort into implementing great monitoring to always understand the status of their huge distributed system landscape.
- They implemented zero downtime deployments based on canary releases and rolling deployments.
- Finally, they were (and still are) famous for their (meanwhile retired) simian army, which did error injection of various kinds at runtime in their production environment – to make sure that their resilience measures do not only work in theory, but also in practice.
Netflix’s pattern list is a bit longer than the Erlang/OTP list. But the use cases Netflix implement are also more complex than those of the Ericsson ATM switch. Also, Netflix built on top of synchronous HTTP-based communication which requires comprehensive latency management if your aim is to achieve really high availability.
Meanwhile Netflix moved on and implemented different, more sophisticated types of resilience measures. This is not surprising as the patterns I have shown here are the ones Netflix implemented about a decade ago. They had a decade to continuously learn and improve.
Nevertheless, I think this list is still very useful because what Netflix did a decade ago is still a lot more than what most companies do today regarding application resilience. So, this list can still serve as a good starting point for many companies.
Finding the right dose of resilience for your application is not easy. We know that putting too little effort into resilient software design is a bad idea – especially in the highly distributed system landscapes of today. But adding too many patterns also is a bad idea. It does not only cost too much money, it also makes the system landscape overly complex and thereby less reliable and more fragile.
Hence, finding the right balance is key:
Implement as few patterns as possible, but not less.
The right amount of patterns depends on your context and your use case. A one-size-fits-all recommendation does not exist. But as the two examples of Erlang/OTP and Netflix have shown, you do not need to implement dozens of patterns to create highly available and reliable systems.
I hope this post gave you a few ideas to ponder, how to find the right dose of resilience patterns in your context.
And if you have any good recommendations to add: Please share them with the community. We still have a lot to learn regarding resilience in IT.
If you would like to understand the problem a bit better, you might want to have a look at my “Resilient functional service design” slide deck which discusses the problem and also includes links to other slide decks which dive a bit deeper into certain aspects. ↩︎
I currently work on a web site that (among other patterns) collects resilience patterns. But this is still work in progress. And it is a lot of work. Thus, please bear with me … ↩︎
Note that I never discussed all those patterns in my workshops as that would have resulted in massive cognitive overload for the participants. I only used it as a map to illustrate the context of the patterns I discussed. ↩︎
Sir Tony Hoare’s Turing award lecture IMO is a timeless masterpiece and has become famous as “The Emperor’s Old Clothes”. The lecture was originally published in the February 1981 edition of the Communications of the ACM (Volume 24, Number 2). Today, you can find several versions of the lecture in several places on the Internet. ↩︎
Akka implements most of the Erlang/OTP patterns for the JVM. Just note that because the JVM and the Erlang/OTP runtime have some fundamental differences, Akka does not implement all the patterns the same way Erlang does. ↩︎
For the sake of conciseness, I do not explain the patterns mentioned here in detail. The resilience (and more) patterns web site, I currently implement, will eventually explain all the mentioned patterns and many more patterns. If you do not want to wait that long, just search for those patterns on the Internet. ↩︎