Recently, I had two experiences within a few days that made me think regarding system dependability. In both situations, the systems acted detached from their surrounding reality and thus became confusing or even annoying – even if it would have been easy for them to detect their reality detachment.
About a decade ago, Jeffrey Dean and Luiz André Barroso published their IMO great article “The tail at scale” in the Communications of the ACM 1. The article dives into the topic of latency tail-tolerance.
We have discussed the business case for resilient software design in my previous post. Let us assume, you have a budget and you know which are the most critical business processes/capabilities/interactions (whatever term suits your needs best) you need to secure, i.e., make more resilient.