Distributed systems design is about making informed trade‑offs. Whether it’s balancing consistency with latency, managing tail latencies, or designing for graceful degradation, every decision impacts the whole. The key is to be explicit about assumptions, measure rigorously, and remain agile enough to adapt when those assumptions fail. This is a collection of learnings that I've gathered over last decade or so and I frequently update it to keep reminding myself of these lessons.
Embrace Trade‑Offs and Avoid Simplistic Models
Distributed systems are all about navigating fundamental trade‑offs – there’s no free lunch. Successful engineers learn to recognize and explicitly choose trade-offs rather than chasing silver bullets:
-
CAP isn’t “pick any two” – you can’t ignore partitions. In real systems, network partitions will happen. The CAP theorem’s popular form is misleading; you must design for partition tolerance, then balance consistency vs availability when a split occurs. PACELCPACELC is an extension of the CAP theorem that considers the trade-off between consistency and latency in normal operations. It's essence is that in a distributed system, if a network fails (P), you must choose between Availability (A) and Consistency (C), but even when the network is fine (Else, E), you still have to choose between low Latency (L) and strong Consistency (C). extends this: even with no partitions, there’s a trade-off between consistency and latency in normal operations. In short, always expect failures and decide what you’re willing to sacrifice (consistency, availability, latency) when they occur.
-
“Exactly once” delivery is often not worth the cost. Guaranteeing exactly-once message processing in a distributed system is hard and usually unnecessary. Embrace at-least-once delivery with idempotent operations instead. For example, give each operation a unique ID and make handling idempotent, so processing a message twice has the same effect as once. This shifts complexity out of the messaging layer and into the application logic, which often simplifies the overall system. Focus on end-to-end correctness rather than trying to make the network do magic. Simple, robust patterns (like unique IDs for de-duplication) make real-world systems more predictable and easier to build and reason about.
-
Optimize for the common case, plan for the worst case. Distributed designs often face a choice between optimistic assumptions (everything will go fine) and pessimistic ones (actively prevent issues). Optimistic techniques avoid upfront coordination and can vastly improve throughput and latency – until an assumption is wrong. Pessimistic techniques (like locks or periodic coordination) ensure safety but add overhead or reduce availability. For example, a cache with a fixed TTL is a pessimistic design – it assumes data might change and forces a refresh. This bounds staleness but if the database is unreachable when TTL expires, your cache delivers nothing (availability drops to zero). An optimistic cache might serve slightly stale data during a partition, trading off freshness for availability Caches are wonderful as they can give super-linear scaling, but think carefully about their modality. They are bimodal systems -— either hitting (fast) or missing (slow) -- which creates challenges like latency spikes, uneven load distribution, and cascading failures when cache misses overload the backend. These pose particular challenges if the control-plane recovery depends on warm caches. This topic deserve a separate post. Be explicit about your assumptions. Choose optimistic approaches when you can tolerate occasional inconsistencies and recover quickly, and pessimistic when the cost of a mistake is too high. Always ask “what happens if this assumption fails?” and decide if that failure mode is acceptable.
-
No abstraction is perfect – watch out for leaky simplifications. High-level models (CAP, PACELC, etc.) are useful to frame trade-offs, but real systems have nuances those models don’t capture. Sometimes a “mostly CP” system still provides useful guarantees under partition, or a “mostly AP” system offers more consistency than you’d think with clever client logic. Use these frameworks as guides, not gospel. Likewise, be wary of catchy one-liners – e.g. “the mean is useless” – they contain truth (averages can mislead) but taken too literally can misguide you. The reality is more nuanced -- averages do hide outliers and time variability, so don’t rely on them for latency or availability SLAs but for capacity estimations you need averages. Use the right metric for the question at hand. In general, avoid all-or-nothing thinking; understand the continuum and choose the point that best fits your needs.
Design for Failures, Tails, and Uncertainty
Distributed systems live in an unpredictable world – parts fail, messages get delayed, workloads spike. Resilient systems make failure handling a first-class design concern and pay special attention to “tail” behaviors (rare, worst-case events):
-
Tail latency matters more than averages. It’s not enough that most requests finish quickly –- a few slow ones can ruin user experience and when they do they generally do it for the largest customers. Modern applications often involve fan-out (one request triggers calls to many services). In those cases, the user’s latency is the slowest of the bunch. What was a 1-in-1000 outlier in a single service becomes commonplace when you call 100 services in parallel -- and nearly everybody will have a bad time. Engineer for the 99.9th percentile latency, not just the median. Use techniques like timeouts, hedging (retrying slow calls in parallel), or response budgets for sub-operations to curb tail latency. And always measure the distribution – don’t just trust the average or you’ll miss the real pain points.
-
Beware of positive feedback loops and metastable failures. A metastable failure is when a system gets into a bad state and stays there, even after the initial trigger is gone. Classic example: a brief outage or spike causes a backlog of requests. Latency rises, clients start timing out and retrying aggressively. Those retries amplify the load just when the system is least able to handle it. The system stays overwhelmed doing work that gets thrown away (since clients already gave up), and it never catches up -– it’s up, but effectively down. Metastable failures have caused major outages at big companies. Importantly, they often stem from features meant to improve reliability or throughput like retries, caching, or aggressive concurrency. The cure is breaking the feedback loop: implement backpressure, limit retries, shed load when overloaded, and design graceful degradation modes. For instance, use exponential backoff with jitter to slow down retry storms. Ensure that when the system is struggling, clients back off and give it air to recover, rather than hammering it harder. Monitor not just basic metrics but indicators of saturation (queue lengths, timeouts) -– you need to detect the onset of the spiral and intervene (drop lower-priority work, throttle input, etc.) before it locks in.
-
Use backoff and jitter – but know their limits. In the short term, exponential backoff (especially with random jitter added) is extremely effective at smoothing bursts. It spreads out traffic spikes (say, after a partition heals and many clients retry) and prevents synchronized retry “storms.” However, backoff doesn’t solve sustained overload. If you have many independent clients each with work to do, having them all wait and retry later might just shift the surge in time without reducing work – each user will still hit “refresh” eventually. Backoff truly helps only when clients would otherwise do redundant work (like polling loops or aggressive retries). So apply backoff for transient blips, but for prolonged heavy load you must shed load or add capacity – deferring it isn’t enough. One technique is an adaptive retry budget (e.g. a token bucket) to cap total retries and fail fast if the system is in trouble. Circuit breakersCircuit breakers prevent repeated calls to failing services by temporarily blocking requests, but they shouldn't be used for intermittent or self-healing failures, as they can make things worse. IMO, if ever used they should should only be applied at the partition level, since using them higher up can cause more harm than good due to the difficulty of predicting which partition a particular request will hit. are another: if a service is repeatedly timing out, stop hammering it for a while. In short, stabilize overload by reducing input – impose limits or random drops instead of letting queues grow without bound.
-
Design for graceful degradation. A robust distributed system should prefer a partial or degraded service over full outage. That means thinking through fallback behaviors ahead of time. If a backend is slow, maybe return cached or slightly stale data. If one datacenter is offline, shed non-critical load and serve critical requests from a backup region with higher latency. Avoid binary all-or-nothing states. For example, if your system is geo-distributed, can it continue operating in a partitioned region in “read-only” mode rather than going completely down? Users often prefer limited functionality over none. The Harvest vs Yield concept is useful here: sometimes it’s acceptable to “harvest” less data (return an incomplete response) in order to increase “yield” (fraction of requests served). A classic approach is feature flags or knobs that trade quality for availability under stress (like turning off expensive optional features when load is high). By planning these modes in advance, the system can “fail well” — giving something instead of nothing.
-
Redundancy is not a panacea (and can hurt if done wrong). It sounds obvious: add more replicas or links for fault tolerance. But naïvely piling on redundancy can reduce reliability if it adds complexity or subtle failure modes. There are four common sense rules for redundancy: (1) Don’t introduce more complexity than the availability you gain – every extra moving part is another thing to break . (2) The system must run in degraded mode – if one component fails, the others must handle the full load (e.g. a read replica must cope if it becomes primary). (3) You need reliable failure detection – no good having two servers if you can’t tell which one is bad. (4) You must be able to restore redundancy – e.g. when a node recovers or you add a new replica, it should sync up safely and return to standby. If any of these is missing, redundancy might lull you into a false sense of security or even make things “complex.” Ensure your failover processes are well-tested and that switching to backups doesn’t introduce unpredictable behavior. A simpler, well-understood system with one reliable node might beat a Rube Goldberg cluster of five. Use redundancy judiciously, and design it as a cohesive part of the system (with health checks, failover procedures, etc.), not a bolt-on.
Scale Out via Coordination Avoidance
One of the most fundamental truths in distributed systems is that coordination has a cost. If every operation requires a cluster-wide agreement, adding more machines won’t increase throughput (and often makes it worse). The path to scalable systems is to do less coordination, both in the data plane and control plane:
-
The fastest distributed operations are the ones that don’t require distributed consensus. People often assume technologies like Paxos or Raft are what make systems “web-scale.” In reality, consensus protocols give reliability, not unlimited performance. If you try to run every single user request through Paxos, your throughput will be bottlenecked by the consensus group. The fundamental mechanism of scaling is partitioning work so that tasks can happen independently on different nodes, without waiting on each other. Avoiding coordination is a fundamental tool for building scalable systems. Design your architecture such that most operations are local. Save the heavy coordination (distributed transactions, global locks, etc.) for the rare cases or background tasks. For example, a sharded database routes each query to one shard whenever possible, rather than doing multi-shard transactions each time. By avoiding cross-shard coordination, you can get near-linear scaling with the number of shards. In contrast, if every operation needs a quorum of 5 replicas, adding more machines doesn’t help single-request latency or throughput (it only helps availability). Minimize the portion of your system that must be synchronized; everything else can then grow.
-
Beware of “accidental” coordination through shared resources. Even if your high-level design is embarrassingly parallel, watch out for hidden chokepoints. Examples: a distributed workload all hitting one hot key or hot shard (suddenly that shard becomes a serial bottleneck), or a batch job that, say, grabs a global lock in a poorly designed library. Use metrics and tracing to find if one node or region is handling disproportionate load. If so, you may need to re-shard by a different key, add caching layers, or otherwise break up the hotspot (e.g. consistent hashing with “power of two choices” can avoid overloading one cache server). The Zipf distribution tells us hot keys are common in real datasets, so plan for skew – don’t assume perfect uniformity. Amdahl’s Law lurks in distributed systems too: if 10% of your operations are serialized or hitting one node, that sets a hard limit on speedup. Strive to eliminate single-threaded flows and serial dependencies in protocols.
-
Cross-node transactions are costly; try to partition the problem instead. Sometimes you really do need an atomic operation that spans servers – e.g. transferring money between accounts on different shards. Approaches like two-phase commit (2PC) exist to do this, but they have “weird scaling behavior”. In fact, a fully distributed atomic commit can make a sharded system no faster than a single node for those multi-shard operations. If every transaction touches k shards on average, your throughput is essentially divided by k. In the worst case, if each transaction hits all shards, you gain nothing by sharding – 10 machines can still only do the work of one, because each operation has to run on all 10. The lesson: limit the scope of transactions. Design your data model so that the most common operations are single-partition (or can be done with only light coordination). Use techniques like entity grouping or separation of concerns to avoid multi-shard updates. If truly needed, consider weaker consistency (eventual consistency or saga patterns) for cross-partition updates instead of strict 2PC. Often you can redesign the feature so that it doesn’t require a perfectly atomic cross-shard commit, or you can tolerate a tiny window of inconsistency. Reserve the heavy artillery (global transactions) for the rare cases and isolate them so they don’t overload the whole system.
-
Leverage optimism and asynchrony for throughput. Many scalable designs use an optimistic approach: do the work without global coordination, but be prepared to resolve conflicts or roll back if something goes wrong. For instance, optimistic concurrency control lets transactions proceed in parallel and only checks for conflicts at the end, aborting if needed – which in low-contention scenarios is far more efficient than locking upfront. Similarly, distributed systems might perform updates asynchronously (send updates to replicas and not wait for all acknowledgments immediately, trading a bit of inconsistency for lower latency). The key is to minimize time spent holding locks or waiting on others. If you can safely do something later or only if needed, don’t do it now for every operation. This can dramatically improve scalability, as long as you handle the cleanup path correctly. In practice, this means building reliable compensating actions or conflict resolution logic. For example, two users might update the same data in parallel; an optimistic system might accept both writes and later merge or last-write-wins. This avoids making them wait on each other, at the cost of possibly doing a merge later. Always consider the failure mode: what if our optimism was wrong? As long as you can detect it and recover (with a retry, a merge, or an apology to the user), optimistic concurrency can be a huge win for throughput.
-
Use back-of-the-envelope math and empirical data to guide scaling. It’s valuable to quantify how your system scales. Calculate theoretical limits (e.g. “with 3 replicas, each write costs at least 2 hops, so here’s our max throughput before network becomes bottleneck”). Keep an eye on how adding hardware changes performance: do you get 100% more, 50% more, or no improvement at all? If it’s the latter, find out what global bottleneck is capping you. Often, designing a scalable system is about removing one bottleneck after another – but you need to identify them first. Use load tests, simulations, or models to predict how the system behaves as N grows. For example, simulate how increasing shards affects throughput if each transaction touches N shards on average. This can reveal non-linear behaviors. A key insight is to treat distributed systems more like queuing networks or control systems in analysis, not just algorithms. Understand the queueing delays, the arrival rates, the service rates – basic Little’s Law and queueing theory can predict when you’ll saturate. By combining formal thinking for correctness with quantitative reasoning for performance, you cover both halves of the problem. In short: scale = design and data. Use both brainpower and metrics to reach it.
Keep Systems Understandable with Simplicity and Invariants
Distributed systems are inherently complex – multiple processes, partial failures, non-deterministic timings. To manage this complexity, great engineers enforce simplicity wherever possible and use rigorous thinking to keep systems understandable and correct:
-
Simple systems are easier to reason about and operate. “Simple” doesn’t mean “minimal features” – it means avoiding needless complexity in architecture and code. Every extra feature, layer, or special-case mode multiplies the state space your system can be in. Strive for design clarity: well-defined components, few moving parts, and straightforward failure modes. As an example, prefer a design that crash stops on errors and restarts cleanly (crash-only software) over one that tries to handle every exception and limp along. The crash-only approach simplifies thinking: each component has two states (working or crashed) and recovery is uniform (just restart). Similarly, avoid systems that behave very differently in different “modes” (normal vs failure mode vs recovery mode) – if you must have modes, try hard to automate and test transitions between them. A golden rule is “don’t be weird”, meaning don’t introduce exotic mechanisms that only kick in on rare events. Operators find simple, predictable systems far easier to manage and less prone to human error. A design with clear invariants and few surprises will beat an over-engineered “clever” system in the long run.
-
Invariants are your friends. When designing or debugging a distributed algorithm, identify the key invariants – conditions that must always hold true (or at least at clear synchronization points). Examples: “at most one leader at a time,” “each message ID is processed once,” “replicas never diverge by more than N updates.” These invariants are the safety rails that keep your system correct. Marc Write tests or use formal specs to continuously check invariants after each operation. This approach flushes out bugs much faster than ad-hoc debugging. Instead of tracing through millions of log lines, you assert the conditions that should hold, and let violations pinpoint the issue. It’s essentially putting the spec into the code. Formulating invariants will help you find tricky bugs in minutes that you’d struggled with for days using print statements. In distributed systems, good invariants might relate to consensus states (e.g. “no two different values can both be decided”) or data replication (“copies of a record differ by at most one version”). Use assertions in code for things that “should never happen.” Use tools like TLA+ to model and verify invariants at the design level. Thinking in terms of invariants also forces you to clarify the system’s behavior – if you can’t state a fundamental property (“all transactions either commit on all nodes or none”), you probably don’t fully understand the algorithm.
-
Formal methods catch the bugs your tests miss. Use formal methods not to prove everything is correct, but to hit the subtle corner cases you might overlook. At Azure, we successfully applied TLA+ to critical protocols to find bugs before writing code. The payoff is huge for tricky parts of the system (e.g. ensuring “exactly-once” message semantics or checkpointing logic). However, formal verification typically focuses on safety and liveness – ensuring nothing bad ever happens (safety) and something good eventually happens (liveness). This leaves a whole class of concerns untouched: performance, scalability, cost, etc. Don’t assume a formally verified protocol will meet your performance SLAs or stay efficient under high load – you have to model performance separately. In practice, combine formal specs for correctness-critical logic with empirical testing for performance. Use model checking to solidify the design invariants, then use load tests and simulations to see how the design behaves with 1000 nodes, 1 million requests, unreliable networks, etc. Formal tools give you high confidence in correctness, and that frees you to iterate on other dimensions. Remember, they solve “at most half” your problems -– but it’s an important half. The result is a system that both works right and works well.
-
Invest in observability and learning from incidents. Even with all the right design principles, things will still go wrong in production in ways you didn’t anticipate. Cultivate a culture of observability: detailed metrics, distributed tracing, and careful logging. When something unexpected happens (and it will), treat it as a lesson to make the system better. Was there an invariant that was violated? Strengthen it or add a check for it. Was there a scenario not covered by your failure testing? Add a test or tweak the design to handle it next time. Over years, a mature distributed system is one that has assimilated countless hard lessons from the real world. Do the same with your systems: perform blameless post-mortems, extract the core issues (be it a mis-estimated load, an unhandled partition case, or an operator error due to complexity) and address them in the design. Over time, this iterative hardening, guided by data and experience, is what makes distributed systems truly reliable.
Conclusion
Building and running distributed systems is part science, part art. The science lies in understanding fundamental principles – the math of why coordination doesn’t scale, the inevitability of failures (and the need to handle them gracefully), the importance of invariants and feedback loops. The art is in applying those principles in balance, weighing trade-offs against real-world constraints and business needs. If you choose to take a single takeaway from reading this then it should be this: respect the trade-offs and make them explicit. Know what you’re trading for what: consistency for latency, complexity for capability, optimism for risk of retry. Prioritize simplicity and clarity – a simple system with well-understood behavior will beat a complex “smart” system that no one truly groks. Measure everything that matters (and choose the right metrics), from tail latencies to failure rates, so you can see the effect of your design decisions. And finally, remain humble and curious: the field is full of “unknown unknowns,” but each incident or odd result is a chance to deepen your understanding. Distributed systems can be unforgiving, but by internalizing these core lessons on trade-offs, failures, and design fundamentals, you’ll greatly stack the odds in favor of systems that not only work, but keep working reliably at scale.