The 100% availability trap
I briefly mentioned the 100% availability trap in a prior post. As this misconception is so widespread, I decided to discuss it in more detail in this post.
A typical discussion
Let me start with two familiar situations. In the first situation, a developer, aware of the fact that she might experience some kind of failure accessing a remote system, reaches out to her product manager:
Developer: “How should the application respond if it cannot reach that other system due to a technical problem?”
The product manager seems to be confused by the question at first, but then answers:
Product manager: “Such a failure must not happen! If it can happen, it is a bug and you need to fix it.”
The product manager fell into the 100% availability trap. He assumed, it would be normal that all systems are available all the time and thus it must be a software bug if they were not. This kind of perception can be observed quite often among people who are not deep into IT. For them, IT is something that always works as intended unless someone does something wrong.
But you can also observe the same fallacy among people who are deep into IT which brings me to the second situation. I experienced this situation almost exactly as I describe it here:
I reviewed a design and noticed that basically all parts of the application depended on a single, central service. So, I asked the developers:
Me: “How do you handle the situation if that one service does not respond or does not respond timely?”
The developers looked at me with a confused look. They seemed to be genuinely surprised by the question. After a moment, a developer answered:
Developer 1: “We did not implement any extra measures. This service is so important for everything and thus needs to be highly available. Hence, it is not worth the effort.”
And just to emphasize the statement of the colleague, a second developer added:
Developer 2: “Actually, if that service should be down, the other services would not be able to do anything useful anyway. Thus, it just needs to be up.”
Those developers also fell into the 100% availability trap. Their design blindly relied on the availability of that central service. To be clear: These were highly skilled developers with many years of software development experience. Still they fell for the trap. So, the 100% availability trap is not limited to people who are not familiar with IT.
Reasons to fall for the trap
This raises the question: Why do so many people, even people who know IT very well, fall into the 100% availability trap?
Why do they not realize that the other systems they use, no matter if other applications, services or infrastructure components, will not be available all the time?
I am not perfectly sure why it is this way. But based on my experience, the following drivers contribute to it:
- The IMO biggest driver is that most people, also most people in IT, do not understand the effects of distributed systems. To be fair: Distributed systems and their failure modes are really hard to understand and our regular computer science education does not teach it (or at least not in enough depth).
- An aggravating driver is the tendency of our industry to “shield” developers from the intricacies of distribution. Starting with DCE over CORBA, J(2)EE/DCOM, SOA to the containers and service meshes of today, the industry comes up with ever newer waves of standards and products that “hide” the complexity of distributed systems behind tools and interfaces and give developers the deceptive illusion of still working inside a single process. 1
- The fact that in many organizations the developers are still shielded from production does not make it any better. Even if the operations team knows exactly about the imponderabilities of distributed systems, the feedback channel to the development team does not exist making it very hard for developers to understand the consequences of their decisions in production.
- And finally, math is hard and probability theory is even harder – even the simple parts. An availability of 99,9% or better sounds like a lot and entices many developers not to care about potential downtimes. Somewhere in the back of their minds the 99,9% gets rounded to 100% – and we fell for the trap. We also tend to forget that probabilities multiply, i.e., availability goes down with every process, hardware component or network connection added. And we forget that probabilities are non-deterministic beasts: We never know when they will strike and how hard they will strike.
Probably, there are more drivers than the ones I listed here. But even those drivers are sufficient for most people to fall into the 100% availability trap. They are either not aware of or underestimate the probability that parts of the system landscape they use may fail – the more parts involved, the more likely. And more or less unconsciously they start doing their work neglecting that anything might fail at all.
Consequences of falling into the trap
The consequence of falling into the trap is that systems are designed and implemented without considering what to do if some non-local resource is not available or latent. 2
Just look at some arbitrary enterprise application code. Look at database access. How does the system respond if the request takes too long (and thus blocks the connection pool)? Usually not taken care of.
Or how does the system respond if the IAM solution does not respond or is latent (and thus blocks the connection pool)? Usually not taken care of.
And so on.
If the programming language used throws an exception if it cannot reach the remote peer (e.g., Java or C#), the exception will be caught unless it is a runtime exception that you do not need to catch. And then? Usually, the exception is re-thrown wrapped in a runtime exception to make sure that you do not need to catch it at each level of the call stack. At the top level, some catch-all construct catches all exceptions to make sure the application does not crash, logs the exception and continues because it does not have any idea how to handle the exception.
Timeouts on the other hand often are not caught at all. Instead, the remote call just blocks until either the call returns or the TCP timeout (300 seconds as default) strikes, returning an error or exception (again, depending on the programming language used).
Or some kind of reactive, “non-blocking” programming is used where the remote call is delegated to some hidden thread that calls a callback we provided or sends an event after completion we subscribed to. At first sight, this looks better as our control flow is not blocked by some remote call. But under the hood the blocking remote call still takes place, blocking one of the threads of some underlying worker thread pool. Thus: Same problem, just not immediately visible anymore.
This absence of a sensible handling of remote call failures leads to brittle systems at runtime that in the worst case exhibit undefined behavior if something “unexpectedly” goes wrong, resulting in long outages, data inconsistencies, data loss or worse.
The latent directory server
I once had a client that experienced a large system outage due to something seemingly harmless as a latent directory server. The directory server used by many applications to authenticate and authorize users and requests, went latent for some unknown reason.
Based on that added latency the thread pools of the programs using the directory server were exhausted quickly with all threads waiting for the directory server. This in turn blocked other programs that tried to access the blocked programs. And so on. Within a few minutes the whole system landscape stood still.
And if that were not bad enough, the real “fun” started when they tried to restart their system landscape. They needed to stop and restart all blocked applications, i.e., their whole system landscape.
But when trying to restart the applications, most of them immediately stopped running because they could not establish a connection to some other system they needed to access at runtime. The poor system administrators were flooded with log messages of type “Could not start system <X> because could not establish connection to system <Y>. Shutting down system <X>.”.
When their developers implemented the startup sequences of the applications, they implicitly expected all other systems to be available all the time. Otherwise, they would have implemented some kind of connection retry with exponential back-off or alike. Hello again, 100% availability trap!
As a consequence, the poor administrators had to figure out manually a working system startup order. They basically figured out the hard way and with lots of stress the order in which the systems had been developed over the years. It took them several hours to bring their system landscape back up.
Luckily, the company survived the long outage. But there are enough companies where the financial loss due to such a long system outage can be life-threatening. Nevertheless, I feel sorry for those poor administrators who probably had the stress of their lifetime – just because the developers and all other people involved in the development of their systems fell into the 100% availability trap.
Doing it better
This raises the question: How can we keep people from falling into the 100% availability trap.
Of cause, better education comes to mind. Most computer science education, no matter if at the university or later, accompanying work, mostly neglects distributed systems. Most education is still based on the “single process, single thread” context.
Just look at the average code retreat and see what is taught there. Do not get me wrong: Things taught at code retreats are usually quite valuable for becoming a better software engineer. But still, they almost never deal with distributed systems.
You can also get through a university computer science education without needing to learn anything about distributed systems at most universities.
In times of microservices and massively distributed system landscapes, this does not feel contemporary anymore. So, it would be great if distributed system education would get a higher priority.
Still, you cannot force people to learn about distributed systems. You cannot say: “Hey, you first need to do a 3 week (or better: 3 month) training learning the foundations of distributed systems before you are allowed to design your first microservice”. Even if it may feel appealing sometimes, this is not the way to go.
So, what else could help?
Here are 3 ideas that come to my mind.
The first idea is establishing what I call the “ops-dev feedback loop”. As long as developers are not aware of the consequences of their actions in production, they will easily into for the 100% availability trap – simply because they do not realize that they fell into it. I will describe the ops-dev feedback loop in more detail in the next post. Therefore, I will not detail it here.
The second idea is to introduce chaos engineering. For those who do not know what chaos engineering is about and do not want to click on the link:
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. – Taken from Principles of Chaos Engineering
Chaos Engineering is about controlled experiments – usually in a production-like environment – that help you to learn how resilient your application is and to detect not yet known failure modes.
From a developer’s perspective, chaos engineering – if done right – feels like a cooperative game where the developers try to “beat” the chaos experiments. This gamification aspect of chaos engineering can be used to raise the awareness of developers for not yet considered failure modes a lot and can help to overcome the tendency to fall into the 100% availability trap. 3
The third idea is to copy what Google did when they realized that some of their developers fell into the 100% availability trap regarding their distributed lock service Chubby. I already mentioned it in a prior post. But I think it is important to repeat the story because what Google did was very powerful while being totally simple at the same time – and the opposite of what most enterprises tend to do.
Quite some Google engineers relied on the Chubby service being always available. Chubby was highly available. Actually, the availability of Chubby was so good, that quite some engineers at Google fell into the 100% availability trap, assuming that Chubby would never be unavailable. The consequence: Whenever Chubby had one of its rare problems, a lot of other services went down with it.
The typical enterprise response in that situation would be: “We need to improve the availability of Chubby.”. In the worst (and unfortunately not unlikely) case, they would blame the Chubby team for their outages and require them to “solve the problem”.
This typical enterprise response somehow reflects the reaction of the developers in the second example at the beginning of this post: “We rely on the central service being highly available instead of adding resilience measures in the services that use the central service.” The underlying assumption is that the availability of that other service will be improved if they should encounter any problems because the failure of that other service would the “root cause” of their problems.
Google did something completely different. They started to shut down Chubby regularly for short periods of time, this way reducing its availability. This move forced all the other engineers to escape their mental 100% availability trap. They could not rely anymore on Chubby being available all the time. Actually, they knew for sure that Chubby will not be available all the time. So, they added provisions in their code to handle a potential unavailability of Chubby.
The consequence: The overall availability of the Google system landscape went up significantly. This at first sight counterintuitive measure helped to escape the 100% availability trap.
Probably there are a lot more useful measures. But those were the ones that came to my mind.
We have seen that many people including software engineers fall into the 100% availability trap, the assumption the whole IT landscape around them is always up and running without any issues. This fallacy leads to brittle system designs where small issues in one system can bring down many other systems, if not the whole system landscape.
We have looked at some drivers like people not understanding distributed systems or developers being shielded from operations that reinforce the 100% availability trap. And finally, we have looked at some ideas like establishing the ops-dev feedback loop (which I will discuss in more depth in the next post), introducing chaos engineering or repeatedly shutting systems down for short periods of time to make explicit that systems are not available all the time and thereby breaking the 100% availability trap.
I will leave it there and hope that from now on you will identify the 100% availability trap whenever you see it and have some ideas how to overcome it.
Jim Waldo, Geoff Wyant, Ann Wollrath and Sam Kendall already discussed 1994 in their famous paper “A Note on Distributed Computing” that you cannot treat distributed computing like local computing and vice versa. Now it is almost 30 years later and we continue doing the same mistake over and over again. ↩︎
I deliberately left out second order effects of systems being down or latent like lost or duplicate messages, out-of-order message arrival or out-of-sync state information in different parts of the system landscape with all their consequences. If I would add those second order effects, things would become a lot more complicated than just detecting error messages and timeouts. Just be aware that all those things are also consequences of falling for the 100% availability trap and that most of the time you do not find any provisions in the code to handle them appropriately, i.e., the system behavior usually is undefined if such an error occurs. ↩︎
There are a lot of places to learn more about chaos engineering. A personal recommendation is the Steadybit blog. Steadybit is a chaos engineering company founded by some former colleagues of mine and I trust them to do a good job. So yes, I am biased. But at least I told you so … ;) ↩︎