Why we need resilient software design - Part 2
In the previous, introductory post why we need resilient software design, we discussed the stepwise journey from isolated monolithic applications to distributed system landscapes where applications continually communicate with each other. We also discussed that the number of peers involved continually grew (and still grows), while the update propagation duration expectations became shorter and the availability expectations went to “never down” at the same time.
In this blog post, we will discuss what this kind of distribution means in term of failure modes that can occur and what their concrete consequences are regarding application behavior.
Everything fails, all the time
When discussing the failure modes of distributed systems, I quite often start with a famous quote from Werner Vogels, CTO at Amazon in which he describes distributed systems in a nutshell:
Everything fails, all the time. – Werner Vogels
But why did he say something like this?
The core problem is that distributed systems exhibit failure modes that simply do not exist inside a single process context:
- Crash failure – A node or process responds nicely all the time and then it stops working by crashing. This is by far the simplest failure mode, but it still can result in loss of data.
- Omission failure – The communication between two or more nodes becomes “brittle” or “flaky”, i.e., you sometimes get a response for a request and sometimes you do not, resulting in message loss and all its effects.
- Timing failure – Timing failures are the “arch-enemy” of robustness in distributed systems: You get a response to your request, but it takes too long with respect to your specified normal system behavior (which you typically write down in a SLA, SLO or alike) 1. Be aware that latency can kill the availability of your system landscape extremely fast by exhausting your application’s network connection thread pools (or its worker threads if you missed to create explicit network connection thread pools).
- Response failure – A response failure means that you get a response, it just is wrong. In times of data replication, massive use of caches and eventual consistency this happens way more often than you might expect.
- Byzantine failure – Byzantine failures mean that a node or process went bad, either deliberately (e.g., due to a malicious attack) or not, showing erratic behavior in unpredictable ways. There are special algorithms needed to deal with them. 2
Failure modes in more detail
While these failure modes basically cover all of the effects you can only observe in distributed systems, they tend to be too abstract for many developers to grasp their effects at the application level. Hence, let me describe the most important effects resulting form these failure modes at the application level:
- Lost messages – The message containing important information is lost. Either your peer has outdated knowledge or you do, depending on the direction of the lost message. It is also possible that the acknowledgement of some remote processing is lost and thus you think your processing order was not executed and you resend it – even if it already has been completed. Without additional measures (i.e., making your messages and their processing idempotent) this leads to duplicate processing which depending on the context can be anything between annoying and harmful.
- Incomplete messages – Only parts of a message arrive because parts of it are lost in communication. While this usually gets noticed by the underlying network protocols like, e.g. TCP, it can happen under unfortunate circumstances that this goes unnoticed and you or your peer miss relevant parts of the information exchanged – and again, nodes are out of sync.
- Duplicate messages – Always sending messages exactly once is almost impossible in distributed systems. There is always a chance of either losing messages (“at most once” message delivery) or messages sent more than once (“at least once” message delivery) if some of the aforementioned failure modes occurs. Without additional measures, message duplication leads to duplicate processing which depending on the context can be anything between annoying and harmful.
- Distorted messages – As with incomplete messages, distorted messages are usually noticed by the underlying network protocols like, e.g. TCP. Still, under unfortunate circumstances it can happen that parts of a message go distorted unnoticed and you or your peer may process wrong information. The likelihood of such failures increases if you implement your own protocols because you think that the existing standards are not sufficient for your needs.
- Delayed messages – Delayed messages easily let you miss your response time promises and lead to annoyed system users. Additionally, delayed messages can easily exhaust your thread pools waiting for responses and those of your callers, and their callers, etc., until the whole system landscape is standing still in no time. Finally, they can lead to the remaining two effects described next.
- Out-of-order message arrival – Due to latency a message A that was created before another message B reaches a given destination after message B. If those two message are causally unrelated, usually everything is fine. But if they are causally related, it can lead to problems. Just think about a withdrawal message for a given account that arrives before a deposit message for the same account. If the account does not have enough money on it for the withdrawal, you might reject the withdrawal. Now assume that in reality the deposit took place before the withdrawal which means in reality there is enough money on the account. Just because of the delayed deposit message delivery, you assumed otherwise. The question is: Is your application prepared to detect and automatically correct such an error on the business logic level due to out-of-order message arrival?
- Partial, out-of-sync local memory – More generally, it can be said that each node always has its own partial, incomplete view of the global system state. Due to lost messages, message delays, etc. there is a big likelihood that two different nodes do not share the same view on the system state. This means that whenever you reach out to a different node, you must expect that the information it sends you back may conflict with your current state. You need to detect and handle such situations in your application.
And so on. These were some of the effects you may experience on the application level inside distributed applications.
Simple problems becoming very hard ones
Overall, going from a single process to a distributed system turns simple problems into very hard ones. Sometimes, they even become impossible to solve. E.g.:
- The order of events – Inside a process, it is obvious which event happened before another event (unless you naively use threads inside a process). Across process boundaries, it is not only hard, sometimes it is not possible at all to determine the order of two concurrent events. If the two events are not causally related, usually this is not a big issue. But if the two events are causally related, you may be in big trouble. E.g., the one event tells you, you still have 10 items in stock, the other event tells you have only have 3 items in stock left. You need to reorder if your supply falls below 5 items. Which information is correct? Which one is the recent one? Or are they both wrong because the actions leading to the events were executed concurrently without being aware of each other and modified the same original stock item number? Do you need to fill up your stock items or not?
- Consensus – Inside a process, you always know the values of your data items. Across process boundaries it can be impossible to agree upon the value of an arbitrary data item, if just one process fails (see, e.g., “Impossibility of distributed consensus with one faulty process”) by Michael J. Fischer, Nancy A. Lynch and Michael S. Paterson, also know as “FLP proof”, based on the first letters of the surnames of the authors).
These are just two examples of things that are hard problems or sometimes even impossible in distributed systems. Be aware that most of our reasoning in developing software relies on knowing the order in which things happen and having a clear understanding of the current state. In distributed systems, you can rely on neither of them.
I would like to conclude this discussion of failure modes in distributed systems and their effects on the application level by referencing another famous paper by Nancy A. Lynch: ”A hundred impossibility proofs for distributed computing”. In this paper, Nancy Lynch collected 100 proofs about things that are impossible in distributed systems (and often are no brainer problems or even do not exist at all inside a process).
Distributed systems are hard.
Or as Leslie Lamport once phrased it a bit sloppily:
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. – Leslie Lamport
In this blog post, we discussed what distributed systems mean in terms of failure modes that can occur and what their concrete consequences are regarding application behavior. We have seen that failure modes that are unique to distributed system and do not occur inside process boundaries can have very nasty effects at the application level.
There would be a lot more to say about distributed systems, their special properties and the effects of their failure modes. But this post is long enough already. Maybe I will write more about it in some future post(s).
In the next post, we will discuss that these failure modes are not just things that may happen in theory, but they are very real and will make your systems fail for sure without good countermeasures (and even with good countermeasures they will get you eventually). Stay tuned … ;)
Quite often, for various types of reasons people do not specify the normal system behavior including expected response times. But this will not make timing failures go away magically. If you do not specify your expected system behavior explicitly, implicit assumptions will be made by the users of your system. Violating those implicit assumptions in the end still means a failure, no matter if you are willing to accept it or not. And usually, you are better off making the expected system behavior explicit than waiting for your users making their own assumptions. Thus, I can only recommend to specify your expected system behavior explicitly. ↩︎
We will not discuss Byzantine failures in more detail in this blog post series. Maybe I will spend a post or two on them in the future. Just be aware that this failure modes exists in practice because in most places they are ignored. Most distributed systems science paper exclude Byzantine failure in their prerequisites and product vendor whitepapers that “prove” some desirable behavior of their products basically always ignore the existence of this failure mode. Hence, be attentive when reading such papers. They may be correct in the context they assume. But reality does not care about the context assumed in a paper and Byzantine failures exist. For an introduction in Byzantine failures, best start with the original paper “The Byzantine generals problem” by Leslie Lamport, Robert Shostak and Marshall Pease. ↩︎