The limits of the Saga pattern
The Saga pattern has become quite popular in the context of microservices. While it basically is described correctly in the referenced source, most often I see a grave misunderstanding.
The typical narrative goes like: “If you have a transaction that spans multiple services and their databases, do not use distributed transactions. Use the Saga pattern instead: Call the affected services in a row – typically by using events for activity propagation. If a services fails to update its database, roll back the transaction by triggering compensating actions in the services that already ran their updates.”
In short: Avoid the coordination overheads of distributed transactions by accepting eventual consistency and actively undoing partial updates if something goes wrong.
While this is a valid strategy in certain contexts, it misses an essential point – and from what I have seen so far most advocates and users of the Saga pattern also miss that point.
The point is:
The Saga pattern can only be used to logically roll back transactions due to business errors.
The Saga pattern cannot be used to respond to technical errors.
A business error occurs if an existing business rule is violated. E.g., if you try to pay with an expired credit card or if a pending payment would exceed your credit limit: That is a business error. From a technical point of view, everything is fine.
A technical error on the other hand is if something goes wrong on the technology, the infrastructure level. E.g., if your database does not respond or throws an unexpected technical error (the dreaded “IOException”), if a service is down or latent, if a message is lost or corrupted, if your replicated and eventually consistent database is out of sync: All that are technical errors. From a business point of view, everything is fine.
Why does it make a huge difference if you encounter a business or a technical error?
Let us say, you get an unexpected error message when trying to write to your database – a technical error. Then you decide to abort the business transaction, using the Saga pattern and trigger a compensating action. Now assume that the compensating action fails due to another technical error (keep in mind that technical errors often occur in clusters).
What now? A compensating compensating action?
And if that one fails? A compensating compensating compensating action?
And so on …
We could go on for a long time, but the problem should be clear:
Technical errors in a distributed system landscape are non-deterministic 1. You cannot predict when they will occur and how they will manifest. The only thing you know for sure is that they will happen, i.e., their probability is bigger than 0.
Thus, if you try to respond to technical errors with a one-shot, deterministic approach like the Saga pattern, chances are that your compensating action will also fall prey to another unexpected technical error.
In practice this means if you naively apply the Saga pattern also to respond to technical errors, you will eventually end up with an inconsistent database – guaranteed!
Dealing with technical errors
This leaves the question: How can we deal with technical errors?
In general, the only way is to strive for eventual completion. If you face a technical error, you need to retry in some (non-naive) way until you eventually overcome the error and your activity succeeds. This means retrying in combination with waiting, escalation strategies, etc. 2
The point is that you cannot simply roll back a partial distributed update if a technical error occurs. There is a reason for the coordination overhead of distributed transaction protocols like 2PC (2 phase commit) and the performance and availability penalty you get from it.
If you are not willing to pay the price (which often is a sensible choice), you need to strive for completion because you cannot roll back a distributed transaction due to a non-deterministic technical error without implementing your own type of 2PC protocol.
It is also not simple to implement eventual completion. Arbitrary errors can happen at any time and it can take a while until the cause for the error is fixed. This means you must make sure that you do not lose the pending changes before you can complete them. We need to store the information about the desired change in a reliable way (outside the place where we finally want to persist it) until we have completed the update. We also need to make sure that accidental duplicate updates do not lead to undesired effects, i.e., that updates are idempotent.
Again: One-shot strategies do not work and will lead to eventual data corruption.
Up to now we assumed that an error condition will not persist eternally, i.e., that the system will eventually recover from the error and the pending change can be applied. But sometimes error conditions are persistent.
E.g., assume a replicated BASE transaction type database with 3 replicas. You use a write quorum of 2 and a read quorum of 2 (prerequisite for achieving eventual consistency). Now, you have two concurrent writes, one being written to replicas 1 and 2, the other one written to replicas 2 and 3. Before the database can propagate and reconciliate the concurrent writes, the hard disk underlying replica 2 crashes for good.
You are now left with two conflicting states. You will not get a read quorum of 2, i.e., the database will return to you that it is not able to retrieve a valid state. Depending on its implementation and/or configuration, either it returns an error or both states. Additionally, it usually will not be able to resolve that situation on its own and you are left with a persistent error situation. 3
This is an example of a potential persistent error situation. There are many more such situations. Basically, whenever a stateful node crashes, there is a chance that you end up in a persistent, non-recoverable error state unless you implement the required countermeasures.
If you encounter such a persistent technical error, the strategy to eventually complete pending updates we discussed before will fail. Now what?
In general, you have to restore a valid state first and then apply the pending changes. In smaller installations (“small” applying to all contexts that do not deal with millions of updates per day), human intervention in such a situation might be a valid strategy (assuming that such situations only happen rarely).
Still, you must make sure that this procedure can be executed by a human quite easily and without a high chance of introducing new errors. You do not want to have an administrator execute queries on the database level to figure out the state, fix it manually and then apply missing changes by hand.
Such a procedure is basically a guarantee for an inconsistent database. The administrator has to work under high pressure – remember: The system is down and time is money. Thus, everything tricky and error-prone must be avoided. This means, even for manual error recovery strategies you need to implement good tooling on the technical level to recover a valid state.
Concepts like event sourcing combined with snapshots come in handy in such situations: You replace the broken parts of the system, reload the latest snapshot and then replay all events that occurred after the snapshot (of course this requires idempotent event processing to work reliably).
For bigger installations, you definitely want to automate this recovery procedure as it will occur more regularly.
Summing up, this leads to two conclusions:
- It is not trivial to implement a system that is able to deal with non-deterministic technical errors in a dependable way.
- The Saga pattern is not the way to do it.
What is left for the Saga pattern
This raises the question: Is the Saga pattern useless?
No, of course it is not. The Saga pattern is a great way to recover from business level errors. You just cannot use it to recover from technical errors.
What you need to do is to build sort of a technical layer that the saga pattern can run on. That layer needs to make sure that changes it receives are eventually completed. This way it provides a reliable basis for the business layer on top of it by hiding the imponderabilities of distribution from the business layer.
The business layer now can assume that changes it sends to the technical layer will be executed reliably. It does not need to care about potential technical errors. But it needs to take care of potential business level errors.
This is where the Saga pattern shines: Handling business level errors in a distributed context on top of a reliable technical layer.
This two-layer approach resembles the ideas that Pat Helland described 2007 in his famous paper “Life beyond Distributed Transactions: An Apostate’s Opinion”.
Pat Helland went a lot further in his paper and sketched an architecture for virtually unlimited scalable systems, but the core ideas are the same: Build a reliable lower layer that handles the imponderabilities of distribution and place a business layer on top of it that does not need to know about the intricacies of the lower layer.
A final consideration before wrapping up: If you need the Saga pattern too often, it is a hint for a design smell.
If you need to update several services in the scope of a single use case, very often it means that you organized your services around entities. This is a widespread approach inside process boundaries (e.g., most of the OO design literature is based on this approach). While this is okay-ish inside process boundaries, it is a bad idea in a distributed context like microservices from an availability point of view.
All the services handling different parts of your data, plus typically some more services like a BFF, some “process services” and more, need to be up and running without problems that an external request (the trigger for the respective use case or user interaction) can be completed. If any of the services is not available, the request cannot be completed.
You can decouple things on a technical level using a message bus or alike, but that is not always an option for the users. Quite often they do not want just the message: “We received your request and try to process it. Come back later to see if it succeeded.”. They want the immediate confirmation that their request was successfully completed.
You could still use a message bus even if the users expect an immediate completion response. But then you implement synchronous communication on a business level on top of asynchronous communication on a technical level. In other words: Nothing has been gained. You just made the implementation more complicated.
Additionally, asynchronous communication comes at a price. E.g., messages can be lost or arrive multiple times at the receiver, you can run into message ordering problems, other parts of the systems are more likely to work on stale state, even deadlocks can occur in the worst case.
Overall: While asynchronous communication can have a significant value, it comes at a price and does not magically fix the shortcomings of a suboptimal design.
To avoid such problems and maximize availability, it is a good design practice to organize services around use cases/user interactions instead of entities. I discuss this approach in more detail in the presentation “Getting (service) design right”.
Still, even then there can be situations where you need to update more than one service to complete a given use case or user interaction. Here the Saga pattern is a really useful concept.
But if you need the Saga pattern basically everywhere, if you cannot complete any non-trivial user interaction without requiring the Saga pattern, you should revisit your service design. Very likely you will find out that your service design has a lot of bad design smells.
The Saga pattern is sold quite often as the panacea to handle all types of errors in microservices landscapes – which is wrong.
The Saga pattern is useful to handle business level errors like a credit limit violation. But it cannot be used to handle technical errors based on the non-deterministic nature of distributed systems. You need different measures to deal with those types of errors.
A good way of handling them is to implement a technical layer that handles the technical errors with all their intricacies. This layer provides an interfaces that guarantees that updates sent to it will eventually complete. On top of that layer the business level logic can be implemented which takes care of business level errors.
The Saga pattern is a good way to handle business level errors if the required updates span multiple services.
Still, be aware that a rampant need of the Saga pattern usually is a design smell. This points to an entity-based organization of services which compromises the availability and reliability of a service-based landscape.
I hope this post helped to clarify some of the simplification that I often see around the Saga pattern. I am sorry that things are not that easy. But I think it is better that you groan realizing that things are harder than you thought than ending up with a corrupted database and no idea how to fix it.
PS: To avoid potential misinterpretations: Even if things are not that easy, distributed transactions are still a bad idea if you want to exploit the original microservices independence ideas and not suffer from an availability penalty. So, this post was not a silent advocacy for distributed transactions. If you really want high availability and service independence with a growing number of services, there is no alternative to the hard way I described.
Based on my observations, most people inside and outside of IT have not understood the non-deterministic nature of errors in distributed systems. While at least most of them nod if you tell them, they act differently. Usually systems are designed and implemented in a way as if everything would behave deterministically. While this way of thinking and acting is mostly okay inside a process boundary, it is not across process boundaries. The problem is that most of our computer science education, especially for software development, implicitly assumes in-process software development. It is always “If X then Y”, not “If X then maybe Y” which would be more appropriate for distributed systems. I will dive deeper into this topic in future posts. ↩︎
There is a lot of literature about how to implement such strategies. A starting point could be my presentations “Patterns of resilience” and “Resilience reloaded”. The latter presentations contains references to additional literature on slide 99. ↩︎
While most databases provide ways to resolve such situations, not all of them necessarily lead to the desired results from a functional perspective. You also often need to configure the database to make sure that it executes the desired behavior to resolve such error situations as many databases have different default settings. A few offer CRDTs, a special type of data representation that allows the database to resolve such issues. Unfortunately, CRDTs have a significant overhead and cannot represent all data structures. This is a quite tricky topic and unless you really understand what behavior your database implements and how you can influence it chances are that you will be left with a corrupted database in such a situation. ↩︎