Circuit Breakers Are Not Enough
Circuit breakers sound like the obvious answer to cascading failures, but they are fundamentally at odds with how modern distributed systems are designed to fail partially.
"Modern distributed systems are designed to partially fail, continuing to provide service to some clients even if they can't please everybody. Circuit breakers are designed to turn partial failures into complete failures. One mechanism will likely defeat the other." Marc Brooker
Consider a sharded database where the A-H shard is overwhelmed because a conference full of Aarons is signing up. The I-R and S-Z shards are fine. Should the circuit breaker trip? If yes, Jane and Tracy lose access to a service that was working perfectly for them. If no, the breaker is not doing its job. The circuit breaker faces a binary decision is the system "down" or not? but real distributed systems fail in gradients, not absolutes.
The same problem applies to cell-based architectures, which are explicitly designed to contain blast radius. A circuit breaker that trips on one cell's failure makes the whole system look down to the client, defeating the purpose of cells entirely. For a circuit breaker to do the right thing, it would need to predict whether this specific call with these specific parameters will succeed which requires the client to understand internal sharding, partitioning, and routing details of the service. That level of coupling makes systems nearly impossible to change independently.
The proposed fixes tight coupling between client and service internals, server-side overload annotations, or ML-based inference each carry their own significant costs. The deeper lesson is not that circuit breakers are useless but that the framing of "detect failure, stop calling" oversimplifies the problem. Better approaches include per-key or per-shard health tracking, server-driven backpressure, adaptive load shedding, and retry budgets that limit work amplification without binary on/off decisions.
Takeaway: Do not rely on circuit breakers as your primary defense against cascading failure they impose binary decisions on systems designed for partial failure.
See also: Goodput Matters More Than Throughput | Metastable Failures Are the Hardest to Prevent | Correlated Failures Are the Real Threat