Why we need resilient software design - Part 4

Why failures will hit you at the application level

Uwe Friedrichsen

9 minute read

Meadow (seen in northern Germany)

Why we need resilient software design - Part 4

In the previous post, we discussed availability, and how more nodes and the effects of remote communication affect it negatively. We learned that failures in today’s distributed, highly interconnected system landscapes are unavoidable and that we need to embrace them if we want to create highly available solutions.

In this post, we will discuss why these failures will hit us at the application level for sure and why we cannot leave their handling to the operations teams as we did in the past.

The “let ops take care of resilience” habit

As we already briefly touched in the first post of this blog series, the infrastructure we typically use has become better in the recent years with respect to detecting and handling failures.

As a consequence, we see more and more developers “relax”, returning to their “let ops take care of resilience” habit of the past – before the microservices movement started which caused that dreadful period when the infrastructure did not shield developers from these pesky resilience issues.

We also see them falling for the 100% availability trap, not only, but especially regarding the infrastructure components they use. Their code indicates that they expect their databases, event or message busses, service meshes, API gateways, etc. to be never, ever down.

Well, also infrastructure components are not immune to failure. If you write your code as if your infrastructure components would never, ever fail, your company is in deep trouble if any of them should fail – what they inevitably will do. It is just a matter of time. The probability that infrastructure components fail is larger than 0 which means that the only unknown factors left are when and how badly they will fail.

But even if infrastructure components would never fail (which is not true as we have just discussed), not taking care of resilient software design at the application level would still be a bad idea.

Why is it this way?

To answer this question, let us first look at how contemporary infrastructure components can support us with respect to detecting and handling failures.

What the infrastructure can do for us

Today’s infrastructure components like container schedulers, service meshes, API gateways, various cloud IaaS services and alike can support us in a variety of ways with respect to detecting and handling failures:

  • Timeout & circuit breaker – Detect if a peer does not (timely) respond via health checks and configurable request timeouts, tripping a breaker if needed
  • Retry –Retry accessing a peer with a configurable number of retries
  • Failover – Try to access a different instance from a failover group if a health check fails or a timeout expires
  • Restart & Autoscale – Automatically fire up new instances after an instance loss is detected or if the load exceeds a certain level
  • Rate limiting & quotas – Throttle incoming requests if a given request limit per unit of time is exceeded. Set usage quotas for different service users
  • Smart updates – Updates without externally perceptible downtime using rolling updates, canary releases and alike
  • Escalation – Notify administrators if some additional action is required that the infrastructure cannot handle on its own

Admittedly, that is quite a lot. When we started writing microservices roughly a decade ago, we needed to take care of all that on our own. The infrastructure has been upgraded quite a bit since then – also with respect to failure detection and handling.

And, to be clear, if we can use these features in a sensible way to improve the availability and robustness of our applications, we most definitely should do so.

The limits of infrastructure failure handling

But as nice and useful all these infrastructure-based features are, they have their limitations. E.g.:

  • Not all failure modes are supported – E.g., response failures (think, e.g., eventual consistency) are typically not detected at all by the usual infrastructure means 1. Or think of byzantine failures that are basically not detected by any infrastructure tool.
  • Not all required patterns are supported – Many crucial or at least very useful resiliency patterns are not supported by infrastructure components, like, e.g., idempotency, backup request or fallback in general. You must implement these patterns at the application level if you need them.
  • Not all means are available everywhere – Some of the means are only available in public cloud environments (like, e.g., load-based autoscaling). Some other means like, e.g., rate limiting need products that might not be available in your system landscape.
  • Often support from the application level is needed – Often supportive actions from the application are needed to make the infrastructure level means work. E.g., many problems can only be detected via good monitoring. This typically requires good metrics. While the infrastructure tools can collect infrastructure level metrics on their own, application and business level metrics usually must be provided by the applications. Or you need to implement callbacks, the infrastructure tooling can access to do its work and trigger the required actions (like, e.g., health checks or synthetic transactions).
  • Often only undifferentiated, coarse-grained actions are possible – It often feels like doing a goldsmith’s work with a sledgehammer because you can only trigger quite undifferentiated, coarse-grained actions at the infrastructure level. E.g., you often can only define a single timeout threshold for a service. But depending on the caller and the API method called, you may need different timeout thresholds and/or number of retries. You can only implement different thresholds at the application level because the support, the infrastructure tools offer is too coarse-grained. 2

Additionally, the infrastructure tools cannot implement any business-level reactions. In most situations, it is a business-level decision what exactly to do if a remote request fails or some other effect based on the non-deterministic remote communication behavior hits. This typically means implementing business logic – at the application level.

Some examples

Let me just provide three little examples. This is the first one:

  1. You access a service.
  2. The request times out.
  3. Your infrastructure retries the request.
  4. The request times out again.
  5. Your infrastructure returns to you it could not reach the other service.

It would be the same if a circuit breaker would have been tripped and the request would have returned immediately with the same error response. The infrastructure did what it was able to do. Now it is up to you to deal with this response at the application level.

Actually, you usually cannot decide on your own what to do. You need to reach out to your product manager and ask her how to respond to such a situation. Depending on the critically and business value of the request, she might decide anything from simply ignoring the error up to very elaborate recovery or mitigation measures. 3

Here is the second example:

  1. You interact with a service via several methods it exposes.
  2. One of the method calls must always return in < 100 ms (business requirement).
  3. All other method calls are not time-critical (and usually take longer than 100 ms).
  4. Your infrastructure tool allows to configure exactly one timeout threshold.

This is the problem that the infrastructure often only allows coarse-grained actions. If you would set the infrastructure timeout to 100 ms, all non-time-critical calls would fail. If you would set the timeout to a larger value, you would not detect if the time-critical call would take too long. In the end, you need to resolve the situation at the application level, configuring the required timeouts for each call.

Here is the third example:

  1. You receive a withdrawal request for an account you manage.
  2. It is not covered by the current account’s balance.
  3. You reject the withdrawal request based on your business rules.
  4. A moment later you receive a deposit that took place before the deposit. It only reaches you now due to some network lag and retries.

Some non-trivial compensation logic is needed to fix this problem. Usually, your prior response already left your area of control and thus you cannot simply fix the problem inside your application. Instead, you need to trigger some compensating actions.

Note that the infrastructure worked perfectly fine. Without interpreting the payload at a business-logic level, it cannot detect such a failure. 4

Also note that most applications do not implement the aforementioned compensation logic. Most applications are written in a way as if such problems would never occur – which is a incorrect assumption as we know from the prior posts of this blog series.

Summing up

In this post, we discussed why the imponderabilities of distributed systems will hit us at the application level and we cannot leave their handling to the operations teams as we did in the past.

We have seen that while today’s infrastructure tooling can support us nicely in quite some situations, they cannot address all problems we will face. Especially, they can only support us at a technical level. They cannot relieve us from detecting and handling any business-level issues that arose due to unexpected technical failures.

In the next and final post of this series (link will follow), we will recapitulate what we have learned so far and discuss what all this means with respect to resilient software design. Stay tuned … ;)


  1. The simple case of replicated, out-of-sync data is detected by some NoSQL databases that use data replication (at least if you configure the consistency level of that solution correctly). But, e.g., deviating transient node states are usually not detected at all without explicitly checking for it at the application level. ↩︎

  2. Some tools allow to configure timeouts for connections between two services. But depending on your situation, this can still be too coarse-grained (see the second example in this post). ↩︎

  3. I will discuss the business case of resilient software design and how to calculate a sensible budget for resilience measures in a future blog post. ↩︎

  4. To be clear: You should avoid by all means to implement business logic inside infrastructure tools! Probably, I will a write a blog post in the future to dig deeper into this issue. If it should not be obvious why putting business logic into infrastructure tools is a very bad idea, please wait for that post or just ask any experienced and trustworthy software engineer … ;) ↩︎