The importance of resilience engineering

What we can learn from the latest AWS outage

Uwe Friedrichsen

16 minute read

Antique staircase (seen in Vienna, Austria)

The importance of resilience engineering

There was a bigger outage at AWS this week, and of course media coverage was big again. E.g., “Amazon Web Services outage hobbles businesses”, titles the Washington Post, just to name one.

You can find a lot more media coverage. Yet, for me the interesting part was not the fact that AWS had one of its rare outages. It was the bottom line that most of the articles had: AWS had a partial outage and therefore companies using AWS were hobbled.

In other words: AWS is guilty. The companies are victims.

Personally, I think, it is not that simple. Actually, I think AWS delivered as promised and am sure that some companies like, e.g., Netflix were not hobbled by that outage at all. So, what did the other companies do wrong?

Not understanding distributed systems

The first problem IMO is that the hobbled companies did not understand the nature of distributed systems good enough to assess the risks correctly.

Interestingly enough, I have just described the effects of distributed systems in my previous post:

  • Things go wrong across process boundaries.
  • You cannot predict what will go wrong when.
  • It will hit you at the application level.

That is how you basically sum up distributed systems. Or as Werner Vogels, CTO of Amazon sometimes describes it:

“Everything fails, all the time.” – Werner Vogels

He does not say it because the people at AWS would do a sloppy job. Looking at the complexity and size of the system landscape they operate and the rare bigger outages they experience, they do a terrific job 1. He says it because he really understands the nature of distributed systems.

This “things will go wrong and hit you at the application level” is not only valid for the parts of your application. You also need to take the infrastructure you use into account. In any distributed system, i.e., where different processes – often running on different machines – interact, you have to look at all “moving” parts and their interactions.

The infrastructure you use is also part of your distributed application landscape. This is what the hobbled companies did not take (enough) into account.

The 100% availability trap

If software engineers design a distributed application, e.g., using microservices, at least sometimes they consider potential failures of the application parts they implement themselves.

But as soon as other parts the application interacts with are involved, being it other applications or infrastructure components, the 100% availability trap strikes. This trap is a widespread, implicit thought model, especially common in enterprise contexts. The trap goes like:

The 100% availability trap (fallacy)

Everything my application interacts with is available 100% of the time.

If you ask the people explicitly if they think that these other parts could fail, they would say yes. But their decisions and designs do not reflect it by any means. 2

Just think about application code you know, that, e.g., tries to write to a database. What does happen if the access fails? Is there an alternative action coded what to do in that situation, e.g., first retrying the write and if it still fails putting the write request in a queue and processing it later, including the logic to watch and process the queue?

I am quite sure, there is no such code. Most likely the respective exception is caught, logged and then … well, moving on as if nothing happened. In resilience engineering, this is called “fail silently”. You detect the error, decide to ignore it and move on.

This behavior can be fine in some places, but usually it is not. Most of the times, it is just a result of the 100% availability trap: The failure scenario is never discussed, the desired behavior remains undefined and thus the implementing developers do not know how to handle the situation. So, they log that something went wrong and move on. What else should they do?

Still, this is usually not what you want from a business perspective. Assume that this is about writing orders. Orders are what you make money with. Orders are what you live from. So, this is the single most important write of your whole application.

Now assume that if the write fails for whatever reason the orders are simply silently not written. You can find it in the log if you search for it, but that is it. This is not what you want.

Or customer gets a generic message like: “There was a problem processing your request. Please try later again.” – which is a bit better, but also not what you want.

What you want from a business perspective is not losing any order at all because orders are the basis of your existence.

Thus, you would expect some logic like first retrying the write. If this fails, buffering the order in a queue or some other secondary storage medium, sending the customer a message like “Thanks a lot for your order. Due to temporary technical problems we cannot immediately process your order. But we will process it as soon as the problem is resolved. Here is a URL where you can track your order processing status”, having the queue processor implemented and running as well as the order processing status page.

This is what you probably would get if you would have reasoned about the desired behavior of the order write process from a business perspective. But usually this discussion never takes place due to the 100% availability trap. 3

The companies hobbled by the AWS outage most likely never had such a discussion.

Not assessing SLAs

But even if companies do not understand distributed systems good enough and fall for the 100% availability trap, there are still the SLAs, AWS provides that should be taken as a risk management source.

Reading the SLAs that AWS provides, my first thought is that I need to take extra measures on my own if I want to minimize my risk of downtimes. My impression is that the hobbled companies did not have such thoughts.

The post-mortem analysis of the event shows that Kinesis was in the center of the outage. This also impacted some other AWS services that use Kinesis.

Usually a company like AWS tries to minimize the likelihood of cascading failures. But in a complex system landscape like AWS you tend to have subtle, unknown cross-dependencies that you only realize when a major failure happens, even if you are as experienced as AWS regarding distributed system design.

But let us focus on Kinesis alone for a moment. Kinesis offers a SLA that states:

AWS will use commercially reasonable efforts to make each Included Service 4 available with a Monthly Uptime Percentage of at least 99.9% for each AWS region during any monthly billing cycle (the “Service Commitment”).

Please note the “commercially reasonable” in that sentence. This is not a guarantee that will be kept at any cost, but only as long as it commercially makes any sense for AWS. In case they cannot keep the promise of the SLA, they offer a compensation:

In the event an Included Service does not meet the Service Commitment, you will be eligible to receive a Service Credit as described below.

A table follows that lists the amount of compensation depending on the actual availability of the Included Services during the billing cycle:

Monthly Uptime Percentage Service Credit Percentage
Less than 99.9% but equal to or greater than 99.0% 10%
Less than 99.0% but equal to or greater than 95.0% 25%
Less than 95.0% 100%

IMO this is a bold promise. I do not know any IT department that provides SLAs for their on-premises infrastructure that even come close – and that on-premises infrastructure is by orders of magnitude less complex than the infrastructure AWS provides. So, nothing to complain about regarding the SLA from my perspective.

But, and this is the point: It is not a 100% availability guarantee.

99.9% availability within a month means that you still can have more than 43 minutes of non-availability per month. This means you can easily lose 10.000s of messages per month without AWS violating its 99.9% availability promise. 5

You will pay 10% less of your Kinesis bill if Kinesis is down for up to 7.2 hours in a month’s period, and you will pay 25% less for downtimes up to 1.5 days per month. Only if the downtimes exceed 1.5 days in the related month you pay nothing. With respect to the recent AWS outage this means that the affected companies will probably get a credit of 10% or 25% on their next monthly Kinesis bill.

Be aware that this does not include compensation for lost revenue or other problems due to the outage.

Again: I think the Kinesis SLA is a really good one. But if you confuse this SLA with a 100% availability guarantee and bet on continuous uptime, you are doing it wrong.

And this is only about using Kinesis. Any non-trivial cloud-based application uses multiple services, often a dozen or more. If you browse through their SLAs, you will get similar availability promises 6.

Now let us assume, you use 10 services. For the sake of simplicity, let us additionally assume that all services used offer the same (good) SLA as Kinesis.

Here is the point: The availabilities of all parts used multiply up!

This means: If you use 10 services with a 99.9% availability promise, overall you are down to a 99% availability promise for the combined services you use – if they are completely uncorrelated 7. If they have dependencies (as we have seen throughout the outage), the expectable availability is lower. This means at least 7.2 hours expectable non-availability within a month’s period if all services keep their 99.9% availability promise.

In other words: Without additional measures, you need to be prepared for a workday (from 9 to 5) of non-available infrastructure per month for an average cloud application. If you look at it from this perspective, the outage was nothing special. It was just something you needed to expect anyway. The surprising part of the story is that such outages happen so rarely.

Maybe AWS does a too good job by usually exceeding their promises by far, that customers take the availability for granted as a consequence, becoming sloppy and complaining big time if the expectable happens. And the media joins in.

But in the end, the hobbled companies most likely did not assess the SLAs carefully enough. It is all in there. You just need to do the math.

Same pattern in on-premises contexts

Summing up, it seems as if the hobbled companies

  • did not understand the effects of distributed systems good enough
  • fell for the 100% availability trap
  • did not assess the provided SLAs carefully enough.

But to be fair: This mindset and resulting behavior is not limited to public cloud usage. We can see it everywhere. 8

We see it even more in on-premises scenarios where the databases, message queues, event buses, container schedulers, VM hypervisors, etc. are all treated as if they were available 100% – which they are not. Also their SLAs (usually being worse than the Kinesis SLA we have seen) are usually not assessed carefully enough before designing applications that run on this infrastructure.

Also remote applications are usually considered to be available 100% guaranteed in on-premises environments. The same effect can be observed in many microservices implementations where all other services are considered to be up 100%, guaranteed.

And when people learn the hard way that this assumption is wrong at the moment they call for service meshes, Apache Kafka, and alike, expecting that these additional pieces of infrastructure will solve their problems for good – falling for the 100% availability trap again.

Resilience and chaos engineering

As I wrote in the beginning of this post, no matter how you look at it:

In distributed systems, things go wrong across process boundaries.

You cannot predict what will go wrong when.

It will hit you at the application level.

Especially the last sentence is important: It will hit you at the application level. This is why you need resilience engineering. 9

Resilience engineering lets you systematically assess the criticality of your business use cases, examine potential failure modes and decide about countermeasures. Done correctly, it is a powerful risk management tool that also takes the economic consequences of your actions into account.

Regarding the AWS outage, resilience engineering would have asked the question what the consequences are if a longer cloud infrastructure outage occurs. If the impact is too high, the next activity is to identify the most critical use cases. Then the potential failure scenarios are identified and countermeasures are defined.

In the end, it is plain risk management: How much downtime can you afford and what do you want to do about it? 10

Often, resilience engineering is complemented by chaos engineering. While resilience engineering helps you to address known failure modes, chaos engineering helps you to detect unknown failure modes (and validate the effectiveness of your resilience measures). 11

Similar to exploratory testing you simulate arbitrary failure situations and observe how the systems responds to it. Note that you always carefully control the “blast radius” of your failure simulations (called “experiments”), i.e., you limit the potential impact of the experiments.

Usually, you bundle a whole set of experiments in so-called “game days”. In further advanced organizations you continuously run failure simulations and observe the results.

Even if the terminology might sound a bit playful, chaos engineering is anything but a game. It is a vital part of risk management. If you have critical use cases that must be up and running, you need to explore yet unknown failure sources. In distributed systems this is not optional, but can make the difference between success and demise.

You might argue that you do not have the time or the budget for that. Well, the hobbled companies probably acted the same way. The question is: How much money and reputation did these companies lose during the outage not having done their resilience and chaos engineering homework? Chances are that some of them will not survive it.

Thus, the question you need to ask yourself: Can you afford it?

Personally, I think most companies cannot afford to ignore resilience and chaos engineering. Quite some of them still ignore it hoping that things will go well. But hope is a bad advisor if it comes to risk management.

Summing up

This blog post has become a lot longer than I expected and admittedly I only scratched the surface. My key messages are:

  • Saying that it is AWS' sole fault that applications running on their infrastructure were down during the outage, IMO is not correct.
  • The hobbled companies did not understand the effects of distributed systems good enough and fell for the 100% availability trap.
  • The hobbled companies did not assess the SLAs good enough. From the SLAs perspective, the outage was nothing not to expect.
  • Resilience engineering helps to assess and mitigate availability and thus economic (or worse) risks. This is not an IT-only show but a joint effort of business and IT.
  • Chaos engineering augments resilience engineering by uncovering yet unknown failure modes.
  • Resilience and chaos engineering are vital, non-optional instruments of contemporary risk management. Ignoring them means hobbling yourself – or worse.

There would be so much more to say, but the post is already too long. I will pick up the discussed topics in more detail in future posts. But for now, I will leave it there and hope I was able to give you some ideas to ponder.

  1. Maybe they do a too good job as their users obviously blindly rely on the continuous availability of all their services. And to be fair: Also most of the competitors of AWS do a terrific job avoiding failures that affect their users. I only explicitly name AWS because they experienced the outage this week. ↩︎

  2. If you then ask the engineers why their designs do not include any measures to compensate potential failures of the other parts their application interacts with, you often get responses like: “That part is so important. It must not fail. If it fails, we have much bigger problems. Thus, we do not need to implement any measures.” ↩︎

  3. (Not only) non-IT people often try to suppress such discussions with statements like “Such technical errors must not happen”, “Why do you bother me with that? That is your job, not mine” or other non-helpful statements. The problem is that those statements miss the point. As soon as we deal with distributed systems (and these days every non-trivial system is distributed), the question is not if things go wrong, but when things will go wrong and how hard it will hit us at the application level. Thus, we need the discussions how to handle the technical errors from a business perspective. ↩︎

  4. Prior to the service availability commitment of 99.9%, the SLA defines the “Included Services”. As Kinesis is not a single service but consists of several services, i.e., Amazon Kinesis Data Analytics (“Amazon KDA”), Amazon Kinesis Data Firehose (“Amazon KDF”), Amazon Kinesis Data Streams (“Amazon KDS”) and Amazon Kinesis Video Streams (“Amazon KVS”), they are all listed explicitly in the SLA under “Included Services”. ↩︎

  5. Note the subtle difference in my wording. I do not use the word “guarantee” but “promise” because the SLA does not guarantee the uptime. It just states that AWS will do their best and if they should fail they reimburse parts of the costs you pay for their service. ↩︎

  6. Sometimes the promised availabilities in the SLAs are a lot lower. E.g., even if AWS promises trying to provide an overall availability of 99.99% for their compute services, the promised availability for a single EC2 instance is a lot lower: “AWS will use commercially reasonable efforts to ensure that each individual Amazon EC2 instance (“Single EC2 Instance”) has an Hourly Uptime Percentage of at least 90% of the time in which that Single EC2 Instance is deployed during each clock hour (the “Hourly Commitment”). In the event any Single EC2 Instance does not meet the Hourly Commitment, you will not be charged for that instance hour of Single EC2 Instance usage.” (copied from the Amazon Compute SLA↩︎

  7. Elias Strehle pointed me to this small, but crucial detail (thx a lot for that!): Availabilities of parts only multiply up, if they are uncorrelated, i.e., mutually independent. If they have dependencies, the overall availability goes down a lot more. You can see it at the AWS outage: Even if only Kinesis had a problem, other services were impacted, i.e., also became less available. And AWS goes great length to keep their services as independent as possible. Typical enterprise landscapes tend to be highly interdependent, i.e., not rarely a failure in a single part can take down the whole application landscape. ↩︎

  8. Probably this widespread mindset and the resulting is the reason why those companies were hobbled – not by AWS, but by their own mindset. ↩︎

  9. Resilience engineering is a broad topic with many facets. Describing it in its whole is way beyond the scope of this post (and also would not add anything to the storyline). I will pick up resilience engineering in more detail in future posts. ↩︎

  10. Note that these decisions cannot be made by software engineers alone. These are usually economic decisions which require additional parties at the table: Business experts, often operations experts, sometimes financial experts and often higher level decision makers. ↩︎

  11. As with resilience engineering, it is beyond the scope of the article to describe chaos engineering in detail. ↩︎