It is your fault if your application is down
Do not blame the infrastructure provider
 
        
    
  It is your fault if your application is down
Recently, AWS experienced one of its rare partial outages. Its DynamoDB service experienced a disruption in the US-East-1 region that could be tracked down to a latent race condition in the DynamoDB DNS management system which caused the disruption. A comprehensive post-event summary describing the outage, its cause and the resulting effects can be found here.
I do not want to add yet another opinion what AWS should have done differently. Quite the opposite. The trigger 1 was an extremely unlikely race condition that is exceptionally hard to spot when looking at the design and basically impossible to detect via testing. It was one of those surprises that nobody could anticipate and that only became clear in hindsight.
It is your fault and your fault only
I would rather like to focus on the companies that complained that their applications failed due to the partial outage. I would rather focus on the thousands of self-acclaimed “experts” who self-importantly had nothing better to do than lecturing AWS on the Internet what they did wrong and that they need to design their services more reliably. I would rather like to focus on the media that had nothing better to do than to blame AWS for the partial outage.
I would like to shout out to all those blamers, complainers and lecturers:
It is your fault if your application is down and your fault only!
I can already hear the furious retorts to my statement: But AWS! But they should have …! But they must …! But …! But …! But …!
Well, there are companies who also run their applications on AWS and their applications were also affected by the partial outage using DynamoDB or other affected services in the US-East-1 region. But their applications did not fail – not by lucky coincidence but by design. The big difference: Those companies did not complain afterwards but designed their applications upfront in a way that they are able to cope with a failure of the underlying components.
Therefore: It is your responsibility to make sure that your system is up and running even if some of the underlying components do not work. It is your responsibility and nobody else’s.
Reliable systems …
All the blaming, complaining and lecturing after the partial AWS outage is just an example of a fundamentally strange habit in IT: We tend to blame the infrastructure and middleware providers if our applications fail due to a failure of an infrastructure or middleware component.
Pat Helland and Dave Campbell started their seminal paper “Building on Quicksand” with the words:
“Reliable systems have always been built out of unreliable components.”
This is a foundational engineering principle – also in IT. E.g., John von Neumann, one of the founding figures of computing, already discussed this topic in the early 1950s in his lectures and the resulting paper “Probabilistic logics and the synthesis of reliable organisms from unreliable components”. In other words: This principle should be well-known in IT for at least 70 years.
… and how not to build them
However, we seem to have forgotten this foundational principle of reliable system design. In modern IT, we tend to do the opposite:
In modern IT, we tend to build unreliable systems and expect the underlying components to make them reliable.
This is the exact opposite of the foundational principle. And to make things worse, we blame the providers of the underlying components for the failures of our unreliable designs.
We see this perversion of the robustness principle not only if AWS has one of its rare partial outages. We see it everywhere in software development. E.g., I recently discussed a software design with a development team. As the system was expected to be highly available from a business perspective, I asked the team about the availability promises they put in their SLA. Leaving aside that (again) an official SLA did not exist (sigh!), the answer of the lead architect (who I know as a very competent person) was: “We cannot offer more than 99% availability because this is the availability promise of our compute nodes.”
Same story. Different context.
The conviction of the lead architect (and the rest of the team) was that the components they use limit their possible availability. In other words: When it comes to availability and reliability, they expected the underlying components to take care of it. Their implicit credo: “We build our application and expect the underlying components to ensure availability. Therefore, the application’s availability is limited by the least available underlying component.”
Outdated thinking patterns
But where does that still very widespread thinking pattern come from?
In the end, it is a blast from the past. As I discussed in the 3rd part of my “The long and winding road towards resilience” blog series, this misconception is a leftover from long gone times. Until the early 2000s, we primarily ran monolithic, isolated applications that ran on on-premises hardware and communicated via offline file- or database-based batch updates. In such system landscapes, the two biggest sources of failure for an application were:
- Software bugs
- Failures of the underlying hardware, infrastructure and middleware components
Taking care of software bugs was the task of the QA (quality assurance) department. Taking care of failures of the underlying hardware, infrastructure and middleware components was the task of the Ops (operations) department. The software development department (Dev) was not involved in ensuring availability. Their responsibility was solely to implement business features. Of course, Dev complained all the time about QA and Ops, making them slow with all their bug findings and runtime constraints. But basically, software development lived in their (more or less) blissful bubble where availability was a non-concern.
IT is not what it used to be
Since those days, IT changed a lot:
- Updates between systems were increasingly propagated online and with this we moved from isolated systems to distributed system landscapes introducing several new classes of failures, making the underlying fabric of our IT systems non-deterministic.
- The system landscapes became continuously more complex with the ongoing digital transformation.
- Additionally, IT became increasingly indispensable with the ongoing digital transformation. Required uptimes went to 24/7. Downtimes became increasingly intolerable.
- The move towards service-oriented system architectures, often poorly understood regarding their needs, constraints and consequences reinforced the prior developments.
- Post-industrial markets required much faster feedback cycles and different modes of developing and running IT systems. Especially, the boundaries between Dev and Ops started to dissolve.
- Agile and DevOps including CI/CD changed the collaboration modes between the business departments, software development and operations further while QA basically became part of software development.
- Public cloud infrastructure changed how we design and run IT systems.
- Etcetera …
But while IT changed so fundamentally in the last 20+ years, we missed to adapt our thinking patterns regarding availability and building reliable systems. Our minds are still stuck in the late 1990s. As a consequence, especially in Dev we are caught in the 100% availability trap, still thinking that in Dev we only need to take care about software bugs (as QA dissolved into software development) while everything else related to availability and reliability is a S.E.P., i.e., not our problem but the problem of the parties whose stuff we use at runtime.
Avoid cascading failures
But as we can see, this train of thought creates systems that are vulnerable to failures of the underlying components. In the words of resilient software design (RSD): We introduced cascading failures. A cascading failure is a failure that spreads from one component to another. The availability of one component depends on the availability of other components it interacts with.
This is exactly what we have seen with the partial AWS outage: A failure in a small part of AWS’ services in a single region brought down many applications that blindly relied on the availability of those AWS services. Such an application design is a violation of the core principle of resilient software design (based on the principles of fault-tolerant software design):
A system must not fail as a whole. Therefore, split it up in parts and isolate the parts against cascading failures.
This way we create systems which may experience partial failures without failing completely. They may be running at a reduced service level while the partial failure persists but the working parts are still able to do their job as specified (note that a potentially reduced service level must be part of the service level specification).
Resilient software design (RSD)
This raises the question: How do we get there? How do we need to design our systems differently to make them robust against partial failures?
The answer is a foundational design principle of RSD:
For every dependency, we need to ask:
“What am I going to do if this dependency fails?” 2
If the answer is “Then I am going to fail, too”, we have found a missing isolation, a latent cascading failure.
This is very different from the usual 100% availability trap approach where you expect everything you depend upon – especially all underlying infrastructure and middleware parts – works reliably all the time and thus you do not need to care about potential failures of your dependencies.
In contrast to that mindset, the RSD mindset is:
Everything eventually fails.
Therefore, let us ponder the consequences of such a failure and how to respond to it.
The probability that any piece of IT will fail is higher than zero (because availability is always lower than 100%). This means:
The question is not if something fails but only when it fails.
The business case of RSD
Let us assume, we did the RSD exercise and decided how to respond to a potential failure of a component we depend upon – our “plan B” for the failure of that component. This raises immediately the question: But what if my plan B fails? Do I need a plan C for that situation? And if my plan C fails? And so on. How can I be absolutely sure my application will not fail?
The point is that a countermeasure against a failing dependency might also fail, typically due to another failing dependency. Do we need another countermeasure for this situation? And what if that countermeasure also fails? Where should we stop because we could follow this path for a very long time?
First of all: You cannot guarantee our application will never fail. With enough effort and money, you can get very close to the 100% availability but you will never reach the 100%. There will always be a residual risk of failing.
But more importantly, this is where the business case of resilient software design comes into play. The short version (the long one can be found in the aforementioned blog post): Usually, we lose money while your system is down, directly and indirectly, immediately and delayed, depending one the failure frequency and duration 3. We can use this information (using existing past data plus assumptions where we lack dependable data) to assess the financial loss we expect from system failures. This defines our RSD budget.
If the expected impact is big, the available budget will be big (or at least should be big). If the expected impact is small, the budget will be small 4. But the RSD budget should never be bigger than the potential financial loss due to potential failures because that would not make any sense from an economical perspective.
This basic approach needs to be broken down to the different places where a dependency may fail because typically different kinds of potential failures have a different impact. E.g., if the recommendation engine should fail in an e-commerce system, it would a nuisance but the financial impact is limited. Thus, the RSD budget for dealing with a failing recommendation engine will be rather small. A typical countermeasure would be simply not showing any recommendations in such a situation.
If, however, writing the order at checkout should fail, we have a big problem. Checkout is where the money is made. The financial impact of not being able to complete checkout is big (at least if this e-commerce system is a relevant revenue source). Thus, the RSD budget for making sure that checkout works will be big. Countermeasures could, e.g., include a backup storage system for the orders and multiple, independent payment providers (assuming we use external payment providers) connected via redundant and independent network connections, etcetera.
A different kind of discussion
While the basic idea of the business case of RSD is quite straightforward, defining RSD budgets in practice can be challenging.
The key issue is that failures are technical problems. Failures are not business problems like an expired credit card, or alike. Failures are about failing processes, failing network, failing middleware, failing infrastructure, etcetera. But the decisions what to do in case of such a technical failure, how much budget and effort to invest in countermeasures is a business level decision.
This means, we need a person with the technical expertise who knows what can go wrong and what the impact would be at a technical level and we need someone with the business expertise who can assess what such a failure would mean in terms of lost money and other kinds of business risk. And in case the business expert is not the budget owner, we also need a person who can allocate the required RSD budgets. Those persons need to sit down together and discuss possible failure scenarios and their business impact to decide how much money to allocate for countermeasures, i.e., for resilient software design.
This typically poses two problems.
The first one is that the business experts (and budget owners) are usually also still stuck in the 100% available trap. Their usual first response to a question how to respond to a technical failure and how much budget to allocate for countermeasures is that such a failure must not happen and that it is our responsibility to make sure it does not happen. This means, we first need to teach them (in a friendly way) that failures are inevitable and thus their response does not make any sense.
But this is not the only problem. In practice, establishing such joint discussions is very hard for most companies because neither their organization nor their processes are set up for such discussions. Typically, the business experts are solely focused on business level requirements and the technical experts are solely focused on technical topics. They do not share any common ground. Therefore, it is usually very hard for them to find the required common ground they can base their conversation on.
Additionally, this kind of conversation would break the single direction communication from business department to IT department that is still predominant in most companies and laid down i n their processes. Feedback loops do not exist and joint conversations are not intended. Therefore, most companies still have to learn how to have such joint conversation between business and IT at eye level.
So close, and yet so far
To sum up:
- We need to adapt our way of thinking to today’s IT reality and avoid the 100% available trap.
- We need to ask ourselves what to do if a dependency of our application fails and assess how much budget to invest in countermeasures.
- We need to learn to have the conversations between business and IT to determine the right budgets.
As so few companies have yet adopted this RSD way of thinking and acting (let alone full-fledged resilience), this raises the question: Is that too much to expect?
Personally, I think this is the bare minimum of a sensible and responsible software design. However, it often seems to be too much for many companies, especially those who are efficiency-obsessed (which most companies are, especially those, who exist for a longer period of time).
As I discussed in the 6th part of my “The long and winding road towards resilience” blog series, efficiency obsession and resilience are antagonists. Solely focusing on efficiency will necessarily compromise resilience and with it also the reliability of the systems. If a company acts efficiency-obsessed, their applications fail if AWS experiences a partial failure. And then they demand that AWS needs to become more reliable because investing in the reliability of their own systems is beyond their imagination as it would “jeopardize” their efficiency.
Ensuring RSD
Google once did it the other way round. They had a distributed locking service called “Chubby”. Chubby was very reliable but as all systems, once in a while it failed. And Google noticed whenever Chubby failed, a lot of their other applications failed, too. Obviously, those applications were blindly relying on the uninterrupted availability of Chubby.
If Google would have acted like all the blamers, complainers and lecturers, I mentioned at the beginning of this post, demand, Google would have put a lot of effort in making Chubby more reliable.
Interestingly, Google did the opposite. They announced that they would shut down Chubby deliberately once in a while and that they expect the other service not to fail if Chubby is down. They knew that it is not possible to guarantee 100% uptime and thus the only option to increase the overall availability of their applications was to isolate the other systems against failures of Chubby. And the best way to achieve it is to make sure, the utilized component (here: Chubby) is not available once in a way.
This is how you build reliable systems – not by demanding 100% availability from all dependencies. Again:
“Reliable systems have always been built out of unreliable components.”
Moving on
Besides all that, everything is still valid, I wrote after the last bigger partial AWS outage which is BTW 5 years ago – an incredibly long time period for a system landscape that is as complex and fast-moving as AWS’ system landscape is.
If your are interested in actual resilience (including dependable systems), you may also be interested in my 10 part blog series “The long and winding road towards resilience”.
And if you need more (shameless plug): I am for hire … ;)
We can look at your system design together and evaluate, how robust it is. We can design the countermeasures together. We can figure out sensible RSD budgets and learn, to have the required conversations in your company. We can even look at your organization and make it more resilient. And I can train and mentor your people, from developer to C-level. (end of shameless plug)
But whatever your decision regarding RSD is, never forget:
It is your fault if your application is down and your fault only!
No one else is to blame …
- 
I prefer not to use the term “root cause” in such a context because in the end, things were not that simple. There are so many pieces working together in very intricate ways – including various well-designed fault-tolerance measures that were included to reduce the probability of a failure (which they did multiple times in the past and will continue to do in the future) – that it does not make too much sense to reduce this complex web of interactions and mutually reinforcing errors to a single root cause. There may have been an initial trigger but after it went off, it was a complex web of mutual interactions that eventually led to the failure. ↩︎ 
- 
To be fair: It is a bit more complicated than that. However, this is a very good starting point if you want to design an application that is robust with respect to failures of other system parts. Probably most of the applications that went down during the partial AWS outage would have survived the outage if their software engineers would have consistently asked this question during the design of their applications. ↩︎ 
- 
The alternative to losing money would be a safety threat caused by the failure of a system. In the end, this leads to a similar calculation schema. The only difference is that we cannot immediately calculate an expected financial loss that defines our RSD budget. Instead, we need to assess the safety threat risk in terms of likelihood of occurrence and impact of occurrence and derive the RSD budget from that. ↩︎ 
- 
If there would not be any impact of the system failing, we can safely switch off the system because then it is obviously irrelevant. ↩︎ 
 
         
         
         
        
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email