Correlated Failures Are the Real Threat

January 31, 2022

Redundancy only works if failures are independent. Correlated failures, where multiple components fail from the same underlying cause, destroy the exponential reliability gains that redundancy promises.

"Whether failures happen independently or at the same time matters a lot for availability." Marc Brooker

The math of redundancy is seductive. One server with 92% availability becomes 99.3% with two independent replicas and 99.95% with three. But if they fail together, you are back to 92% regardless of how many replicas you have. Correlated failures take that entire exponential curve and flatten it. And they are far more common than people assume:

Shared power and cooling
Common DNS dependencies
Identical software on every node
Operator actions that touch the entire fleet simultaneously
Hardware batches with the same latent manufacturing defect

Software deployments are the dominant source of correlated failure in well-run systems. Deployments cut across redundancy boundaries in ways that requests, data, and infrastructure generally cannot. Every server in a fleet runs the same code with the same limits. If a load-related bug causes one server to fail, an evenly distributed workload will cause every other server to hit the same bug simultaneously, like lemmings walking off a cliff, each following the same algorithm.

The countermeasures are varied but share a theme of introducing heterogeneity:

Availability Zones provide infrastructure isolation
Cellular architectures limit blast radius
Shuffle sharding ensures no two servers see the exact same workload
Jitter, adding small randomness to retry delays, cache TTLs, credential expiration, and housekeeping schedules, prevents synchronized behavior across fleets Even seeding jitter with a per-server value (like hostname) gives you reproducible-per-server randomness while ensuring fleet-wide diversity. The goal is always the same: make it structurally difficult for many things to break the same way at the same time.

Assume failures will be correlated and design against it. Redundancy without independence is an illusion.

Correlated Failures Are the Real Threat

Linked from