The non-existence of ACID consistency - Part 1
Well, of course ACID consistency exists – and it is a good thing that it exists. Thus, feel free to call the post title clickbait … ;)
My point here is that it should not exist as functional requirement. But putting all these nuances in the title would have resulted in an awfully long title. So, I decided to keep the title short and risk the accusation of clickbait in return.
But what do I mean with ACID consistency should not exist as functional requirement?
I quite often see the implicit or explicit business requirement that updates should become visible atomically in all systems affected. E.g., you update an insurance policy in your policy system and it is expected that at the very same moment the claims systems, the accounting systems, the CRM systems, etc. all see the updated policy information – no delays, no two systems temporarily showing diverging information.
This is not necessarily written down as a requirement. But if you ask the people involved if it would be okay to update any of the affected systems a bit later, you typically get a determined shake of the head and the answer that this must not happen.
So, strong consistency with ACID properties demanded across the whole IT system landscape.
If you then ask why it is not possible to relax the consistency requirements to some form of BASE consistency, you often end up in odd discussions, not only with the people from the business departments but also with your own peers in IT. Typically, these discussions either end in some variant of “because that is how we do things here” or “because it is needed for business reasons”.
Personally, I think the “because this is how we do things here” argument is a dangerous one in most situations, not only the one described here. But even if this is an argument you hear more often than one would expect (sometimes in disguise, yet being the same argument), I will not discuss it here any further.
Instead, I will focus on the second argument “because it is needed for business reasons”.
As the whole discussion is too long for a single post, I split it up in 4 posts:
- The consistency trap in distributed systems (this post)
- ACID consistency cannot be a valid business requirement
- The real value of ACID consistency
- Be aware that “ACID” usually is not what you think it is
This first post discusses the consistency trap in distributed systems.
Distributed systems everywhere
I will start here with a discussion why I think the requirement of strong consistency across multiple systems is problematic per se.
If you would only need to update a single data store, everything would be fine. Use strong consistency for updating the data store and you are done. But usually that is not the case. (Not only) Chas Emerick already stated in 2014 in his presentation “Distributed systems and the end of the API”:
(Almost) every system is a distributed system. – Chas Emerick
Even the system landscapes of medium sized companies tend to consist of more than 100 applications. Big corporates sometimes have more than 10.000 applications. Most of these applications communicate with each other – these days preferably online.
I remember doing an architectural assessment in a private bank some years ago. We identified around 40 systems relevant for our assessment with more than 600 (!) communication paths between them. Now consider the number of communication paths between 10.000 applications. Yes, there will be a lot. 1
With the rise of microservices and mobile computing the number of communication peers grew massively in the recent years and developments like edge computing give it another boost. Overall, we can say that distribution is ubiquitous in IT system landscapes and the number of communications paths between the distributed system parts is huge.
10 or 15 years ago, many of these connections were still batch-type integrations where updates were collected on one side and then periodically fetched by (or pushed to) the other side. This kind of communication is quite tolerant regarding unexpected communication failures. No data is lost and data transfer can be repeated until it eventually succeeds. As batch communication by definition excludes strong consistency across application boundaries, this whole discussion did not exist.
But today most communication paths are online, i.e., updates are typically forwarded to the receiving systems when they occur. With the rise of online communication between systems the demand for strong consistency between those systems grew, as I described it in the beginning of this post.
The consistency trap
But if you demand strong consistency between all parts of the system landscape that are affected by an update, you run into an availability problem that grows disproportionately with the number of parts involved.
Not rarely a policy system of a medium sized insurance company delivers its updates to 50 or more systems. Or a CRM system of a telecommunication provider that delivers its updates to more than 100 connected systems.
I once had a project in an insurance company where we had the requirement that updates to the policy system need to be propagated to the ~50 connected systems in an ACID fashion. I.e., the update should be written to all systems in a single distributed transaction. I asked them if they were sure that they wanted this. Their answer was: Most definitely.
Okay, let us do the math what this requirement means in practice:
- Let us assume an availability of 99,5% including planned downtime for all our applications. Be aware that this is quite a high availability for on-premises computing. You will find very few ops departments who will give you such a high availability SLA for your applications. 2
- Now assume 10 application that need to be updated in a single transaction.
- All applications need to be available, i.e., you need to multiply the individual availabilities to calculate the availability of all 10 systems, which determines the likelihood that your update will be successful.
- For 10 system involved in a transaction, you need to calculate 99,5% ^ 10 which results in ~95% success probability for your transaction. This means if you need to update 10 systems together in an ACID fashion, 1 out of 20 transactions will fail in average over time.
- For 50 systems involved, the success probability goes down to ~75%, i.e., 1 out of 4 transactions will fail in average over time. 3
This is why I think a requirement demanding strong consistency across process and data store boundaries is problematic. You will inevitably run into the availability trap sketched before. The more systems are involved, the higher the chance of a failure. And in today’s times of highly distributed and interconnected system landscapes, typically there are more systems and services involved than you can afford to keep in sync using strong consistency. 4
As you have seen, even with only 10 peers involved and a really high availability of 99,5% including planned downtime, 1 out of 20 transactions will fail in average over time. Usually the availability that your ops team will guarantee for an application via SLA is quite a bit lower – more in the range of 99% excluding planned downtime. This means you need to expect a much higher update failure rate if you insist in strong cross-application consistency.
If only 2 or 3 systems were involved in an update as it still often was the case in the 1990s, distributed transactions that spanned these application were an option. The overall availability penalty was small enough that it did not matter.
But in today’s highly distributed and interconnected IT system landscapes with many systems and services involved in updates, 24/7 uptime expectations and very high end-to-end availability demands, this is not an option anymore. This is the road to fragile and failure-prone systems and processes.
As a rule of thumb I would recommend: Do not try to keep more than 3, max. 4 systems in sync using distributed transactions (or other means that guarantee strong consistency between those systems). Everything else should be updated in an eventual consistent way.
This first post discussed why it is not a good idea to demand strong consistency with ACID properties across multiple distributed data nodes. Already with a relatively small number of nodes involved, the update success probability and thus the overall system availability would be compromised so much that it would break user expectations significantly.
In the next post, I will discuss why the “strong consistency is needed for business reasons” requirement is void per se. Stay tuned … ;)
If you think that some integration hub will solve that issue elegantly, please think again. Whole generations of CIOs, architects and developers tried to get there with their EAI, ESB, etc. solutions and it never worked the way they imagined it – which is obvious if you honestly think it through: Even if you are able to hide all your communication paths behind a nice rectangle on your architecture diagram, you still have all the logical communication paths you had before with all their intricacies. Usually, you also still have the same variety of interface technologies with all their quirks you need to support. The only difference is that you now also have that integration behemoth in the middle. In the beginning it might speed you up a bit because it offers built-in support for connecting to several kinds of interface technologies. But over time, changes typically become more and more effortful and slower and slower that eventually parties will start to work around it. Again, all the quirks and intricacies of the systems to integrate are not gone. Ideas like a “canonical data model” that were quite popular some years ago, just add another level of indirection and complication which almost always outweighs the theoretical benefits in practice. Adding business logic in the integration hub (what most companies start doing after a moment) will make everyone waiting for the integration team. And so on. Maybe I will discuss that in more detail in some future past. The point for this post is that your integration efforts will still be closer to O(n^2) than O(n), with n being the number of applications in your system landscape. ↩︎
Even for mainframe computing, you usually only get a SLA of 99,7% – but only while the transaction monitors are up! The transaction monitors typically are up around 14 hours per day. The rest of the time is reserved for batch processing and maintenance. I.e., the overall availability on a mainframe is typically around 50% because the transaction monitors are usually also down the whole weekend. ↩︎
Even if you assume all your applications run on the mainframe and you only trigger updates while the transaction monitors are up (which is not suitable for 24/7 scenarios), using the SLA described in the prior footnote you can expect a success probability of 99,7% ^ 50 which is ~86% only, i.e., 1 out of 7 transactions will fail in average over time. ↩︎
If you should ask yourself how the story with the client ended: I presented the calculation to the client. It made them rethink their requirement and we found a sensible compromise. ↩︎