Correlated Failures Are the Real Threat
Redundancy only works if failures are independent. Correlated failures where multiple components fail from the same underlying cause destroy the exponential reliability gains that redundancy promises.
"Whether failures happen independently or at the same time matters a lot for availability." Marc Brooker
The math of redundancy is seductive. One server with 92% availability becomes 99.3% with two independent replicas and 99.95% with three. But if they fail together, you are back to 92% regardless of how many replicas you have. Correlated failures take that entire exponential curve and flatten it. And they are far more common than people assume: shared power and cooling, common DNS dependencies, identical software on every node, operator actions that touch the entire fleet simultaneously, hardware batches with the same latent manufacturing defect.
Software deployments are the dominant source of correlated failure in well-run systems. Deployments cut across redundancy boundaries in ways that requests, data, and infrastructure generally cannot. Every server in a fleet runs the same code with the same limits. If a load-related bug causes one server to fail, an evenly distributed workload will cause every other server to hit the same bug simultaneously like lemmings walking off a cliff, each following the same algorithm.
The countermeasures are varied but share a theme of introducing heterogeneity. Availability Zones provide infrastructure isolation. Cellular architectures limit blast radius. Shuffle sharding ensures no two servers see the exact same workload. Jitter adding small randomness to retry delays, cache TTLs, credential expiration, and housekeeping schedules prevents synchronized behavior across fleets. Even seeding jitter with a per-server value (like hostname) gives you reproducible-per-server randomness while ensuring fleet-wide diversity. The goal is always the same: make it structurally difficult for many things to break the same way at the same time.
Takeaway: Assume failures will be correlated and design against it redundancy without independence is an illusion.
See also: Static Stability Over Dynamic Failover | Metastable Failures Are the Hardest to Prevent | Efficiency Is The Enemy of Resilience