Tail Latency Dominates User Experience

September 24, 2022

In fan-out architectures, the overall request latency is determined by the slowest component. As the number of parallel calls grows, the probability of hitting a tail latency outlier approaches certainty.

"You can horizontally scale your way toward any throughput target, but there is no easy fix for latency spikes." Yao Yue

When a single user request fans out to 100 backend services in parallel, the user waits for the slowest one. If each backend has a 1% chance of being slow, the probability that at least one is slow is 1 0.99^100 = 63%. At 1000 services, it is 99.996%. Tail latency that is invisible at the component level becomes the dominant factor in user-perceived performance at scale. This is why Twitter maintains a strict SLO of p999 under 5 milliseconds for cache operations at their scale, even 99.9th-percentile outliers are encountered constantly.

The sources of tail latency are varied and often surprising. Shared resources create contention: a background garbage collection pause, an OS scheduling decision, a burst of network interrupts, cache pollution from a co-located workload. These are not bugs in your code; they are emergent properties of running on shared infrastructure. The tail is often determined by system factors outside the application's control, including suboptimal CPU scheduling, interrupt bursts, and CPU migration between cores.

Effective strategies work at multiple levels:

Infrastructure level: separate control plane from data plane processing, use dedicated queues and CPU affinity to isolate latency-sensitive work, and batch operations to reduce context-switch overhead
Application level: use partial-order over total-order (quorum reads instead of waiting for all replicas) to get best-K-of-N performance
Hedging: issue redundant calls and take the first response

Predictability is more valuable than peak throughput: a system that reliably serves p999 under 5ms is more useful than one that averages 1ms but occasionally spikes to 500ms.

Design for the tail, not the median. At scale, your slowest component is your user's experience.

Tail Latency Dominates User Experience

Linked from