The tail at scale
About a decade ago, Jeffrey Dean and Luiz André Barroso published their IMO great article “The tail at scale” in the Communications of the ACM 1. The article dives into the topic of latency tail-tolerance.
I read the article several years ago and found it very interesting. Recently, I reread the article and was surprised, how many of the ideas and concepts described in the article I have forgotten since the first reading. Hence, I decided to capture the key ideas of the article, hoping that I will not forget them again – or at least having a quick way to look reread them.
I already tried capturing the key ideas of papers and articles before in a number of ways. At least for me, few of them worked well. Most of them fell short in one way or the other.
The Morning Paper: short summaries of important, influential, topical or otherwise interesting papers in the field of computer science.
His blog is a great source of information. I found so many interesting computer science papers just by following his blog and reading his digests. Just be warned: His blog can be a real time sink. You can easily spend hours reading one digest after the other.
I do not want to step into Adrian’s shoes – they won’t fit. But I will try to adopt the core idea: Capture the ideas and concepts of a paper by writing a summary. And by sharing it via my blog, maybe it will not only be useful for me, but also for some of you. We will see …
So, let me get started with my summary of “The tail at scale”. 2
Dean and Barroso start with the observation
Systems that respond to user actions quickly (within 100ms) feel more fluid and natural to users than those that take longer.
I think that 100ms is quite a bold target as most companies would be happy to reach 500ms at the 99th percentile. But then again, the authors worked for Google at the time they wrote this article, a company quite obsessed with service response times.
But even if we can argue about the numbers: Latency matters! 3
Next, the authors observe that with growing service complexity and use, it becomes harder to satisfy such response times targets:
It is challenging for service providers to keep the tail of latency distribution short for interactive services as the size and complexity of the system scales up or as overall use increases. Temporary high-latency episodes (unimportant in moderate-size systems) may come to dominate overall service performance at large scale.
The basic observation is that with growing size and usage of a service the latency outliers start to dominate the response times. Even if the authors write that this effect would not be relevant for moderate-size system (which are the types of systems most of us work with), I think we can still learn a lot from the large-scale experiences at Google.
These days, many companies turn to microservices architectures (be it a reasonable decision or not). This typically means that lots of services need to respond timely to create a smooth user experience for the callers of such systems. Additionally, most systems and services need to access a lot of other runtime components to complete their jobs.
Overall, most contemporary enterprise system landscapes have many moving parts that need to interact as latency-free as possible to provide a smooth usage experience. And the higher the load on those systems, the higher the impact of some tail latency glitches on the usage experience.
But what triggers tail latency?
Dean and Barroso mention several typical triggers of tail latency. Most of them have to do with resource contention or the temporary unavailability of resources. Here is the list they provided:
- Shared resources (CPU, processor caches, memory or network bandwidth, etc.)
- Global resource sharing (network switches, shared file systems, etc.)
- Maintenance activities (of the systems itself, e.g., some background reconciliation or cleanup jobs)
- Power limits
- Garbage collection
- Energy management (power-saving modes in devices)
So, there are many reasons why some system part can exhibit a temporary response-time peak, often outside our immediate control.
Reducing response-time variability
The authors then briefly talk about some generic measures to reduce response-time variability, like:
- Differentiating service classes (assigning different priorities)
- Keep low-level queues (close to the OS) short to not interfere with higher-level queue management
- Reducing head-of-line blocking by breaking down long-running requests
- Manage load of background activities with respect to interactive request latency
While the authors state that careful engineering for reducing response-time variability is essential for building highly responsive systems, they also state that it will not be able to eliminate tail latency completely. Tail latency might not be as harmful as it would be without those measures, but it is still there.
And again: The more complex and the more successful your service is, the more it will be affected by the (remaining) tail latency.
Hence, more specific measures to mitigate tail latency are needed.
Dean and Barroso differentiate two types of measures:
- Short-term measures that mitigate the effects of tail latency within the scope of a single request
- Longer-term measures that take tens of seconds up to minutes to mitigate longer-term latency phenomena
The first short-term technique, the authors discuss are hedged requests:
In it simplest form, you would send the same request to several replicas, wait for the first response and try to cancel all other requests. While this technique works, it usually adds an inhibiting surplus of load – because first of all you need to provision all the replicas and then the cancellation messages often will arrive only after the processing already started, especially if the queues on all replicas are short. So, the lower the overall load, the more extra load this approach will induce.
Therefore, it makes sense to add a delay between sending the first and the second request. This still shortens the tail latency a lot while reducing the extra load significantly:
For example, in a Google benchmark that reads the values for 1,000 keys stored in a BigTable table distributed across 100 different servers, sending a hedging request after a 10ms delay reduces the 99.9th-percentile latency for retrieving all 1,000 values from 1,800ms to 74ms while sending just 2% more requests. The overhead of hedged requests can be further reduced by tagging them as lower priority than the primary requests.
Still, all variants of hedged requests have something, the authors call a “window of vulnerability”, i.e., a certain probability that requests are executed multiple times unnecessarily. This means there is a limitation how much you can improve. Either you have to wait relatively long before you send a second request (e.g., after 95th percentile response time is reached) or you accept more duplicate work.
While this extra work may be acceptable for most “normal” usage scenarios (especially if you did not reduce response-time variation as aggressively as Google did), it was not good enough for Google, considering the sheer size and load of their services.
Part of the problem is that requests cancellations are triggered by the client if using hedged requests: The client will only cancel the second request after it received a response. If you want to optimize even more aggressively, you need to find a way how to cancel extra requests earlier.
This leads to the second short-term technique, Dean and Barroso discuss, which are tied requests:
This technique is based on the observation that often the biggest source of response-time variability are delays due to the request queues in front of the actual processing. Once request processing starts, the response-time variability goes down significantly. Hence, if you design your system in a way that a pair of duplicate requests can cancel each other once they reach processing, you can optimize your tail-latency/waste-load ratio even further.
As with hedged requests, this technique breaks if you send the requests simultaneously and the queues are (mostly) empty. In that situation network latency tends to avoid timely request cancelling:
A common case where this situation can occur is if both server queues are completely empty. It is useful therefore for the client to introduce a small delay of two times the average network message delay (1ms or less in modern data-center networks) between sending the first request and sending the second request.
Interestingly, the authors do not recommend first probing queue lengths first and then sending the request to the replica with the shortest queue. While this technique usually works quite well in low- to medium-load scenarios, it seems to hit its limits at a certain load threshold.
Longer-term mitigation techniques
Besides the within-request, tail-latency mitigation techniques, Dean and Barroso also discuss several longer-term mitigation techniques that address coarser-grained response-time variations like, e.g., load imbalance. The techniques they discuss are:
- Micro-partitions: Partition your service in many more parts than servers that are needed for it. Then each server can handle multiple of those micro-partitions and if a load imbalance between servers starts to emerge, only a single partition needs to be shifted between servers.
- Selective replication: Create more replicas of partitions that are more frequently accessed than others to avoid hotspots and imbalances.
- Latency-induced probation: Temporarily remove particularly slow machines from the service until it recovers. As the authors mention, this measure first feels a bit counterintuitive – removing a machine from a service under high load making the service faster. Of course, this only works if there are replicas for the service parts of the machine removed.
Finally, the authors add a few additional thoughts and considerations:
- Good enough: If you have a service that fans out to many (micro-)partitions, it can be okay not to wait until all results from all partitions have returned but only until a given fraction returned. It can be better to return good-enough results fast (maybe missing a few hits or details) than returning a perfect result way too late. This technique can also be applied to non-essential subsystems of a service (e.g., a spell-checker).
- Canary requests: Sometimes, requests hit an untested code path that results in very long response times or – even worse – in crashes. To avoid such scenarios, requests should be sent to a few partitions first before sending it to all partitions. This is also very useful to detect and prevent malicious denial-of-service attacks. Of course, this measure is most efficient if you need to fan out to many (micro-)partitions.
Dean and Barroso also briefly discuss mutating requests because all measures presented up to this point were targeted at minimizing the tail-latency of read requests. Their basic take on this topic is that write requests are not that latency-critical as read requests:
Tolerating latency variability for operations that mutate state is somewhat easier for a number of reasons: First, the scale of latency-critical modifications in these services is generally small. Second, updates can often be performed off the critical path, after responding to the user. Third, many services can be structured to tolerate inconsistent update models for (inherently more latency-tolerant) mutations. And, finally, for those services that require consistent updates, the most commonly used techniques are quorum-based algorithms (such as Lamport’s Paxos); since these algorithms must commit to only three to five replicas, they are inherently tail-tolerant.
The authors close with the expectation that latency tail-tolerance will become a more important topic over time because the drivers for tail-latency will accumulate and are not necessarily predictable, and the size, complexity and load of services will rather grow than shrink. Therefore, they advocate for general tail-tolerance techniques that do not focus on specific sources of latency:
Fault-tolerant techniques were developed because guaranteeing fault-free operation became infeasible beyond certain levels of system complexity. Similarly, tail-tolerant techniques are being developed for large-scale services because eliminating all sources of variability is also infeasible. Although approaches that address particular sources of latency variability are useful, the most powerful tail-tolerant techniques reduce latency hiccups regardless of root cause. These tail-tolerant techniques allow designers to continue to optimize for the common case while providing resilience against uncommon cases.
As I wrote in the beginning: Even if this article focuses on the very specific problems that you will only face if you run Google-size services (which only very, very few of us do), I think it contains a lot of very useful ideas and considerations that also help us to improve our medium-sized services.
Adrian Colyer also wrote a summary of “The tail at scale in “the morning paper”. I did not (re)read it before writing this post to avoid just repeating what he wrote. If you like, you can read both summaries and pick the best parts of each … ;) ↩︎