The microservices fallacy - Part 3
This post discusses the widespread fallacy that solutions become simpler with microservices.
In the previous post we discussed the fallacy that microservices are needed for scalability purposes. In this post we take a closer look at the second fallacy – that microservices lead to simpler solutions.
Another widespread justification of microservices is that they would be simpler than monoliths. This idea is rooted in the famous quote by James Lewis that a microservice should always be of a size that you can get your head wrapped around. In other words: A single person should be able to understand a single microservice as a whole.
If you compare that naively to the multi-million line behemoths that quite some monoliths have grown into over time, this sounds great of course. A long-held dream of many developers: Working with a code base that can be understood completely. So great! Such a relief! So much simpler! Give me microservices!
Yet, this is comparing apples and oranges.
Essential complexity persists
To start with, essential complexity does not disappear just because you organize your code in a different way. Essential complexity is, as Moseley and Marks put it in their paper “Out of the Tar Pit”:
Essential Complexity is inherent in, and the essence of, the problem (as seen by the users).
To complete the picture, let us add the definition of accidental complexity from the same paper:
Accidental Complexity is all the rest —- complexity with which the development team would not have to deal in the ideal world.
Essential complexity is everything you need to do to solve the given problem. A solution can never be simpler than the problem it needs to solve. It only can be more complicated – by adding accidental complexity, complexity not necessarily needed to solve the given problem.
For enterprise applications, essential complexity is defined for the most part by the business and non-functional requirements the solution needs to implement. This complexity does not go away by choosing a different solution architecture. 1
I have seen this fallacy in a similar way before a lot in the context of “clean code” discussions. There the recommendation is to only write (very) short methods to create simpler code that is easier to understand. Admittedly, a single method then usually is easier to understand.
Some people took it to the extreme and advocate to only write methods of 2 or 3 lines as ideal. The reasoning is: The shorter the method, the better, i.e., the easier the code base can be understood.
Let us put the extreme idea to the test: If methods are just 2 or 3 lines long, most methods need to call other methods. This leaves between 1 and 2 lines per method that actually implement business logic. 2
Now take some not too complex business problem that requires 10.000 lines of code to implement it, assuming a solid implementation (not too chatty, not compressed to incomprehensibility). A single program of 10.000 lines? Not a too smart idea, obviously. But up to 10.000 methods, each one 2 or 3 lines long? That’s by far worse.
You just moved the essential complexity from the parts into the structure. That makes things even harder to understand because the whole functionality is now spread across 10.000 places and you usually need to jump back and forth along the call tree dozens of times before you figure out how even the simplest piece of business logic works.
From an understandability stance, this extremely scattered business logic is by far worse than the single huge program of 10.000 lines.
I had to work with both. I had the 10.000 lines in a single main method. I had the 2- and 3-liners. Both sucked. But in the end, the 2-liners sucked more because there you did not only have to deal with a bulk of business logic and a bit of control logic, but additionally with more than 10.000 calls to other methods, call stacks often 15 layers deep, and so on. While each method in itself was trivial, the structural complexity of the code exploded. 3
Overall, essential complexity does not go away by using smaller units. It needs to be addressed in the solution. Smaller units mean more of them, more structural complexity, more interactions and dependencies between the units:
Principle of conservation of essential complexity
The amount of essential complexity that a solution needs to implement a given problem is constant. If you simplify the building blocks used, the structure becomes more complex and vice versa.
Putting all the complexity into the building blocks or the structure impedes the understandability of the solution. The understandability sweet spot is a balance between structural and building block complexity.
This principle holds true on all design levels, also at the level for microservices. The solution does not get any simpler by using smaller and smaller building blocks. It just means you need more of them and your structural complexity grows.
The very high price of distribution
But unlike with small methods, structural complexity has a different twist in the context of microservices. Calls between services are remote, i.e., go across the network. Going microservices means that your applications become distributed systems – with all their intricacies.
We could insert hundreds of computer science papers and a huge shelf of books here that all describe the challenges and intricacies of distributed systems, why and how they are very different from non-distributed systems, what can (and will) go wrong in distributed systems, what is simply not possible with them, what you need to take care of, and a lot more challenges.
I will write more about distributed systems and their challenges in future posts. Here I would like to keep it short and just briefly sketch the problem:
In our computer science education we have learned deterministic reasoning: “If X then Y”. We need that kind of reasoning to design solutions, to derive the algorithms and code needed to solve the given problem. Inside a process boundary this kind of reasoning works nicely 4.
But as soon as you leave your process context you run into a problem. Network communication is per definition non-deterministic 5. Simply put, your reasoning would need to change to: “If X then maybe Y”, which is a completely different story.
- Crash failures (remote peers fail)
- Omission failures (remote peers sometimes reply and sometimes do not)
- Timing failures (remote peers respond too late, i.e, outside an acceptable time span)
- Response failure (remote peers give the wrong response due to consistency issues in the system)
- Byzantine failures (remote peers “go wild”, i.e., do not exhibit any predictable behavior)
All these failure modes do not exist in single process contexts, only across process boundaries. They lead to unpredictable, hard to handle effects at the application level, like:
- Lost messages (message never arrives at recipient) 6
- Incomplete messages (parts of the message are missing)
- Duplicate messages (message arrives twice or more often)
- Distorted messages (parts of the message are corrupted)
- Delayed messages (message only arrives very slow)
- Out-of-order message arrival (messages with causal order arrive in reversed order)
- Partial, out-of-sync local memory (different nodes have a different “truth”, global truth does not exist)
More concrete, this leads to effects at the application level that you need to detect and handle, e.g.:
- A lost message may go unnoticed if no party takes over the responsibility to check if all messages arrive at their destination.
- Incomplete or distorted message can be broken in such a way that they look normal and the actually wrong message gets processed.
- Duplicate messages can trigger the same action several times if not checked and detected.
- Delayed messages can lead to a complete application stall if all threads get blocked waiting for messages (which in practice can happen a lot faster than most people would ever imagine).
- Out-of-order messages can trigger wrong behavior. E.g., if you receive the withdrawal message before the corresponding deposit message even if the deposit message was sent before the withdrawal message, you might decline the withdrawal even if it was covered.
- Out-of-sync memory may also lead to wrong decisions you make locally which you will only detect later after you already sent out messages yourself that you cannot simply make undone, especially if they already left the boundaries of your control (typically company boundaries).
- Simply reaching consensus between several nodes can be impossible in certain constellations due to these effects. 7
And we have not even talked about byzantine failures where other systems behave in completely unpredictable ways, e.g., due to a malicious attack or other unexpected situations.
All this happens due to the non-deterministic behavior of remote communication. In other words: Each remote call basically is a predetermined breaking point of your application. Thus, you try to minimize the number of remote calls inside your applications.
But this conflicts with the idea of using microservices to simplify your applications by splitting them up in many smaller parts each of which is easy to understand. If you use microservices to split up your applications in smaller parts, you are confronted with the full blown complexity of distributed systems.
Alternatively, you could try design your application in a way that if you split it up in smaller parts, the parts do not need to call each other while processing an external call by a user or another system. This would allow you to have smaller application parts at runtime (i.e., services) without being confronted with all the imponderabilities of distributed systems.
But this requires a very different approach to application design than we usually use. The common divide-and-conquer decomposition techniques that most of us apply do not work as they all lead to lots of remote calls between services. The approach is also in stark contrast with the idea of reusable microservices (see the next fallacy). While this different design approach offers sort of a compromise, in practice it tends to not work due to the widespread unfamiliarity with the required design technique.
Overall, it can be said that microservices do not make anything easier. Instead everything becomes a lot more complex:
- You need to learn a different design technique to minimize the number of internal remote calls in order to reduce the likelihood of distributed failures hitting you at the application level.
- You need to reliably detect and handle failures which can be really challenging.
- You need to augment microservices with the a lot more complex monitoring and observability instrumentation at runtime to early detect errors before they turn into failures. 8
- You need to level up your infrastructure to be able to run microservices reliably in production, leading to an explosion of infrastructure complexity. 9
If you compare all the measures needed to design, implement and run microservices to the ones needed with monoliths, you end up with an overall increase of complexity by at least an order of magnitude – fueled by the idea that a single service would be easier to understand than a traditional monolith.
If you take all the consequences into account, it is a very high price to pay for a simpler development unit of work – which funnily enough you could also have inside a monolith. What you need for simpler development units of work is modularization which is available ubiquitously for more than 60 years. You do not need a specific runtime architecture style like microservices for it.
If you would just care enough about proper modularization and not break your designs because it feels simpler now and you do not care about the price you have to pay for it later, you would actually achieve what you are looking for without having to deal with the much higher complexity of distributed systems. I will come back in this when I discuss the “better design” fallacy.
Additionally, it is important to remember that the essential problem complexity does not go away because you split the solution in smaller parts. The smaller the parts become, the more complexity is moved in the structure, i.e., in the interactions and dependencies between the parts which also has its challenges – not only if the collaboration happens via remote calls.
Overall, regarding the claim that microservices lead to simpler solutions it can be said:
Microservices increase the overall solution complexity at least by an order of magnitude.
Microservices do not lead to simpler solutions. Instead, they lead to way more complex solutions.
I will leave it here regarding the discussion if microservices make solutions simpler. The post has become longer than I wanted to but this is a particularly hard to grasp topic which needed a bit more explanation. I hope I got the ideas across.
In the next post, I will discuss the next two fallacies:
- Microservices improve reusability (and thus pay for themselves after a short while).
- Microservices improve team autonomy.
Stay tuned …
As I discussed in this post, requirements can also add a lot of accidental complexity (and often they do). Nevertheless, for the sake of this post this is not relevant because here we want to understand how much complexity different solution approaches add to a given, fixed set of requirements. In this context, it is irrelevant if the requirements contain accidental complexity. We are only interested in the added complexity due to the solution architecture chosen. ↩︎
In practice, it is a bit more complicated. Your implementation resembles a big method tree. You will distribute your business logic to a lot of leaf methods. On top of the leaf methods you create a tree of branch methods that contain nothing but delegations to the next tree level. Assuming 10.000 lines of business logic perfectly split in 5.000 leaf methods of 2 lines length (neglecting potential return statements), you would additionally get another ~5.000 branch methods that implement the binary delegation tree on top of the leaf methods. While the structure is a bit different than described, the resulting numbers are the same. ↩︎
By the book, you organize all the small methods in layers of abstraction. This is a useful principle (and you should adhere to it whenever you can). Yet, based on my experience this only works well for the small examples that are shown in the books. In a big code base with 10.000 or more methods you either end up with hundreds of methods that have the same name because they basically do the same. Still, they are all a tad different which makes it impossible to unify them in a single method. Or you end up with hundreds or thousands of meaningless method names. I have seen both quite often, lots of methods with the same name and lots of methods with long unique names where you still had no idea what it does until you read the code of the method and all methods it called, etc. ↩︎
If you work with threads inside a process boundary you may also hit the limits of traditional deterministic reasoning because concurrent behavior often is very hard to grasp. But for the sake of simplicity, I ignore the issue here, especially as a lot of really good abstractions exist that basically solve the issue (e.g., CSP, actors, etc.) ↩︎
In theory, it could be completely deterministic, assuming perfect hardware and software, total protection from any kind of radiation, no spikes in power supply, etc. But as these conditions are not realistic (and economically undesirable), we can safely assume non-deterministic behavior from remote communication. ↩︎
Note that “message” here only describes something that is sent from one process to another. It does not say anything about the enclosing transport mechanism, if it is request-response, if it is events, or whatever. It can hit you with any transport mechanism. ↩︎
Theoretically, you can always guarantee consensus in distributed systems, but then you risk that the system stands still for an indefinite amount of time. In distributed systems theory the challenge always is to balance “safety” and “progress”. A bit simplified, “safety” means that the system always gives correct responses, no matter which node you access, “progress” means that it – well – makes progress, i.e., does not get stuck in a state. You cannot maximize both of them. Whenever you decide for one of them, you compromise the other one to a certain degree. As zero progress for a longer period of time is not an option for most applications, “safety”, i.e., the correctness of state across all nodes at all points in time (and thus no undesired behavior) always needs to be weakened to a certain degree in practice. ↩︎
The literature of distributed systems distinguishes 3 types of errors. Simply put: “Faults” are deficiencies in the system that could trigger an error (e.g., a hard disk sector close to EOL, a bug lurking in a rarely used branch of the software). “Errors” is incorrect behavior of the application that cannot yet be observed from the outside. “Failures” occur if errors lead to incorrect application behavior that can be observed from the outside, i.e., violates the specified behavior of the application. ↩︎
Starting with Docker and Kubernetes, the rise of microservices led to an explosion of infrastructure complexity not ending at service meshes. While these measure surely improve the robustness of the service landscape at runtime, they still cannot guarantee that nothing will fail. All this infrastructure upgrading is needed to keep the failure rate at bay, not to make it go away as quite some people naively think. You still need all the measures at the application level to detect and handle errors. ↩︎