25.11.01 · computer-science / distributed-systems

Distributed systems and consensus

shipped3 tiersLean: none

Anchor (Master): Lamport, The Part-Time Parliament (1998); Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (2014); Gray and Lorie, RAID

Intuition Beginner

A distributed system is a collection of independent computers that work together to appear as a single coherent system. When you use Google, your request does not go to a single computer. It goes to thousands of servers in data centers around the world, coordinated so that the response appears to come from one source. The challenges of building such systems are fundamentally different from building software that runs on a single machine.

The defining challenge of distributed systems is partial failure. In a single machine, either the whole system works or the whole system crashes. In a distributed system, some components may fail while others continue running. A network cable gets unplugged, a server overheats, a disk corrupts data, or a software bug causes one node to crash. The system must continue providing service despite these failures, because in a system with thousands of machines, some component is always failing.

Google's infrastructure, described in several research papers, illustrates the scale of partial failure. A single Google cluster contains thousands of machines, with an expected machine failure rate of approximately 1-5% per year. At this scale, machine failures are not exceptional events but regular occurrences that the system must handle routinely.

Google's approach is to design software that treats hardware failures as normal operating conditions, using redundancy, graceful degradation, and fast recovery to maintain service. This philosophy is reflected in systems like Bigtable, Spanner, and Borg, all of which assume that individual machine failures are routine.

Another challenge is the lack of a shared clock. In a single machine, all components share the same clock, making it easy to determine which event happened first. In a distributed system, clocks on different machines drift, network delays vary unpredictably, and there is no universal notion of "now." This makes ordering events across machines a fundamental problem.

The speed of light imposes a physical lower bound on network latency. A signal traveling between New York and London takes at least 28 milliseconds (the speed of light in fiber optic cable is approximately two-thirds the speed of light in vacuum). In practice, latency is higher due to routing hops, processing delays, and queuing.

This physical constraint means that distributed systems can never match the latency of local computation, and algorithms must be designed to minimize the number of communication rounds.

The CAP theorem, discussed in the databases unit, states that a distributed system can provide at most two of three guarantees: consistency (all nodes see the same data), availability (every request receives a response), and partition tolerance (the system works despite network failures). Since network partitions are inevitable in any real distributed system, the choice is between consistency and availability during partitions.

Replication is the primary technique for achieving fault tolerance. Data is copied across multiple machines (replicas) so that if one fails, others can serve requests. But replication introduces the consistency problem: if a user writes a value to one replica and another user reads from a different replica before the write has propagated, they may see stale data.

State machine replication is the most common approach to building fault-tolerant distributed systems. Every replica runs the same deterministic state machine, and consensus ensures that all replicas process the same sequence of commands in the same order. If the state machine is deterministic and all replicas start from the same initial state, they will all reach the same final state.

This is the principle behind Raft: the replicated log provides the same sequence of commands, and each server applies those commands to its state machine in order. If any server crashes, it can recover by replaying the log from the last checkpoint, eventually catching up to the current state. The state machine approach is powerful because it can implement any deterministic service: databases, configuration stores, lock services, or message queues.

Strong consistency guarantees that all replicas agree on every operation before acknowledging it. This requires coordination: every write must be confirmed by a majority of replicas before it is considered committed. This coordination comes at the cost of latency (waiting for multiple machines to agree) and availability (if enough replicas are unreachable, writes cannot proceed).

Eventual consistency accepts that replicas may temporarily disagree, with the guarantee that they will eventually converge if no new updates are made. This approach provides lower latency and higher availability but allows users to see stale data. DNS is an eventually consistent system: when you update a DNS record, it takes hours to propagate worldwide.

The tension between consistency and availability is fundamental. During a network partition, the system must choose: refuse some requests to maintain consistency, or accept all requests and risk inconsistency. Different applications make different choices. A banking system should refuse transactions during a partition rather than risk inconsistent balances. A social media feed can accept posts during a partition and reconcile later.

The key insight is that there is no universally correct answer: the right trade-off depends on the application's requirements. This is why modern distributed databases offer configurable consistency levels, allowing developers to choose the appropriate guarantee for each operation.

Replication strategies vary in their approach to consistency. Primary-backup (leader-follower) replication routes all writes through a single leader, which propagates changes to followers. This provides strong consistency because the leader serializes all writes, but the leader is a single point of failure. Multi-leader replication allows writes to any replica, improving write availability but introducing conflict resolution complexity.

Leaderless replication (used by Cassandra and DynamoDB) allows any replica to accept writes, using quorum-based consistency levels: a read that contacts $R$ replicas and a write that contacts $W$ replicas guarantees seeing the latest write if $R + W > N$ , where $N$ is the total number of replicas. Setting $R = 2, W = 2, N = 3$ provides strong consistency with tolerance for one failure.

Consensus is the fundamental problem of getting distributed nodes to agree on a value. Despite partial failures and network delays, the nodes must all decide on the same outcome. This problem arises in many forms: distributed databases agreeing on the order of transactions, distributed file systems agreeing on which version of a file is current, and distributed coordination services agreeing on which service instances are healthy. Leslie Lamport's Paxos algorithm (1998) solved this problem, and Diego Ongaro's Raft algorithm (2014) provided a more understandable alternative that is now widely used.

Raft elects a leader who manages the replicated log. All client requests go through the leader. The leader appends each request to its log and replicates it to followers. Once a majority of followers acknowledge the entry, the leader commits it and applies it to the state machine. If the leader fails, the remaining nodes hold an election using randomized timeouts to select a new leader. The randomized timeouts prevent split votes.

The Raft consensus process operates through several subproblems. Leader election uses term numbers (logical clocks that increase monotonically) to distinguish between current and stale leaders. A candidate increments the term, votes for itself, and requests votes from other nodes. If it receives votes from a majority, it becomes leader. If it discovers a higher term, it reverts to follower.

Log replication ensures that all logs are identical across committed entries. The leader sends AppendEntries RPCs containing the log entries to replicate. If a follower's log is inconsistent, the leader decrements the index and retries until it finds agreement. Safety is guaranteed by the election restriction: a candidate must have all committed entries to receive votes, ensuring committed entries are never lost.

Paxos, Lamport's original consensus algorithm, is more general but harder to understand. Basic Paxos allows a group of nodes to agree on a single value. Multi-Paxos extends this to agree on a sequence of values (a log), which is what practical systems need. Google's Chubby lock service, Apache ZooKeeper, and etcd all implement variants of Multi-Paxos.

The key insight of Paxos is that a value is chosen when a majority of acceptors accept it. Because any two majorities overlap by at least one acceptor, a chosen value cannot be changed. Lamport originally presented Paxos as a parable about a fictional parliament on the island of Paxos, a choice that made the paper memorable but difficult to extract the algorithm from the narrative.

Visual Beginner

Challenge	Description	Example
Partial failure	Some nodes fail while others continue	A server crashes but others keep serving
Network partitions	Communication between nodes is disrupted	A network link fails, splitting the cluster
Clock synchronization	No shared global clock	Two servers disagree on event ordering
Consistency vs availability	Cannot have both during partitions	Choose between stale reads and rejected writes
Message ordering	Messages may arrive in different order than sent	Network reorders packets

Worked example Beginner

A distributed key-value store uses Raft with 5 nodes. A client sends a write request to set key "x" to value "42".

The leader receives the request and appends "set x = 42" to its log as entry 7. It sends this entry to all 4 followers. Followers 1, 2, and 3 append the entry to their logs and send acknowledgment. Follower 4 is slow and has not yet responded.

The leader has received acknowledgments from 3 out of 4 followers (plus itself, totaling 4 out of 5). This is a majority (3 out of 5 is the minimum majority). The leader commits entry 7, applies "x = 42" to its state machine, and responds to the client with success.

If the leader now crashes, any of the followers can become the new leader. The committed entry "set x = 42" is guaranteed to be present in the new leader's log because it was replicated to a majority of nodes. The system continues operating without data loss.

If only 2 out of 5 nodes acknowledge (no majority), the leader does not commit the entry. It keeps trying to replicate until a majority is reached or it steps down. This ensures that uncommitted entries are never lost in a way that would cause inconsistency.

The majority quorum requirement has a direct operational implication: in a 5-node cluster, the system can tolerate 2 failures and still make progress. If 3 or more nodes fail, the remaining nodes cannot form a majority, and the system becomes unavailable for writes. This is why production deployments of consensus-based systems like etcd and Consul typically use 5 or 7 nodes: the system can survive the loss of 2 or 3 nodes respectively while maintaining availability.

Consider a more complex scenario involving leader election. The cluster has been running with Node A as leader. Node A's network connection becomes unreliable: it can communicate with Node B but not with Nodes C, D, and E. Node A continues sending AppendEntries to Node B, but Nodes C, D, and E stop receiving heartbeats. Their election timers expire (typically after 150-300ms of silence), and they become candidates.

Node C increments its term to 8 and requests votes. Node D receives the request, sees term 8 is higher than its current term 7, grants its vote, and resets its election timer. Node E does the same. Node C now has votes from itself, D, and E: a majority of 3. Node C becomes the new leader for term 8 and begins sending heartbeats.

Node A, still believing it is leader for term 7, eventually sends an AppendEntries to Node B. Node B responds with a rejection because term 7 is less than the current term 8. Node A sees the higher term, steps down to follower, and begins following Node C. The partition containing the majority (C, D, E) elected a new leader, and the minority (A, B) stepped down when they learned of the new term.

This scenario illustrates a critical Raft property: safety is never violated, even during partitions. A committed entry from term 7 is present on at least one node in the majority partition, because any two majorities overlap by at least one node. So the new leader for term 8 will include that entry. The overlapping node property ensures that no committed data is lost during leader transitions.

Check your understanding Beginner

Formal definition Intermediate+

Distributed system. A collection of autonomous computing elements that appears to its users as a single coherent system. Each node has its own local memory and communicates with other nodes through message passing.

Consensus problem. Given $n$ processes, each proposing a value $v_{i}$ , agree on a single value $v$ such that: (1) Agreement: all non-faulty processes decide the same value. (2) Validity: the decided value must have been proposed by some process. (3) Termination: all non-faulty processes eventually decide.

These three properties capture the essential requirements of consensus. Agreement ensures consistency: no two processes disagree on the outcome. Validity ensures non-triviality: the system does not decide on a value that nobody proposed. Termination ensures liveness: the system does not deadlock or loop forever. The tension between agreement and termination in the presence of failures is what makes consensus difficult, as formalized by the FLP impossibility result.

FLP impossibility. Fischer, Lynch, and Paterson (1985) proved that deterministic consensus is impossible in an asynchronous system with even one faulty process. The proof constructs a scenario where the algorithm cannot distinguish between a slow process and a crashed one, so it cannot safely decide. Practical systems circumvent FLP by using timeouts and randomization, which provide probabilistic liveness.

Consistency models

Linearizability: every operation appears to take effect atomically at some point between its invocation and response. All operations can be placed on a single timeline consistent with real-time ordering.

Sequential consistency: all processes see the same order of operations, but this order need not correspond to real time.

Causal consistency: operations that are causally related are seen by all processes in the same order. Concurrent operations may be seen in different orders.

Eventual consistency: if no new updates are made, all replicas eventually converge to the same state. No guarantees about the order in which operations become visible.

The relationship between these models forms a strict hierarchy. Linearizability implies sequential consistency, which implies causal consistency, which implies eventual consistency. Each weakening of the consistency model allows higher availability and lower latency during network partitions, at the cost of providing weaker guarantees to applications. The choice of consistency model is a fundamental design decision for any distributed system, and it depends on the application's requirements. Financial systems require linearizability or strong consistency. Social media feeds can tolerate causal or eventual consistency. Collaborative editing requires a consistency model that handles concurrent updates gracefully, making CRDTs or operational transform appropriate.

Logical clocks and happened-before

Lamport's happened-before relation ( $\to$ ) defines a partial order on events: $a \to b$ if (1) $a$ and $b$ are in the same process and $a$ occurs before $b$ , (2) $a$ is the sending of a message and $b$ is its receipt, or (3) $a \to c$ and $c \to b$ (transitivity). Events not ordered by $\to$ are concurrent.

Lamport timestamps assign a logical time $L (e)$ to each event $e$ such that $a \to b$ implies $L (a) < L (b)$ . Each process maintains a counter, incrementing it for each local event and setting it to $max (L_{local}, L_{received}) + 1$ upon receiving a message.

Vector clocks extend Lamport timestamps by assigning each process its own counter, forming a vector. For $n$ processes, a vector clock is a vector $V = (v_{1}, v_{2}, \dots, v_{n})$ where $v_{i}$ is process $i$ 's counter. When process $i$ sends a message, it increments $v_{i}$ and includes the vector in the message. Upon receiving a message with vector $V^{'}$ , process $i$ updates its vector: $V [j] = max (V [j], V^{'} [j])$ for all $j$ , then increments $v_{i}$ . Vector clocks capture more information than Lamport timestamps: $a \to b$ if and only if $V (a) < V (b)$ (component-wise). This equivalence (if-and-only-if) is stronger than Lamport timestamps, which only provide one direction ( $a \to b$ implies $L (a) < L (b)$ , but $L (a) < L (b)$ does not imply $a \to b$ ).

Failure models

Distributed systems must be designed for specific failure models, each making different assumptions about how components can fail.

Crash failure: a process stops executing and never recovers. This is the simplest failure model, assumed by Raft and Paxos. A crashed process simply halts; it does not send incorrect messages.

Crash-recovery failure: a process stops but may restart with its stable storage intact. This is more realistic than pure crash failure because servers often reboot after hardware or software failures. Raft handles crash-recovery by persisting the log and current term to stable storage.

Omission failure: a process drops some messages (send omission or receive omission). This models network failures where messages are lost. Crash failure is a special case of omission failure where all messages are lost.

Byzantine failure: a process may behave arbitrarily, including sending conflicting messages to different processes. This is the most general failure model, covering both hardware faults and malicious attacks. Byzantine fault tolerance requires more nodes ( $3 f + 1$ instead of $2 f + 1$ ) and more message exchanges than crash fault tolerance.

The cost of fault tolerance increases with the failure model's generality. Crash fault tolerance requires $2 f + 1$ nodes and $O (n)$ messages per consensus round. Byzantine fault tolerance requires $3 f + 1$ nodes and $O (n^{2})$ messages (or $O (n)$ with threshold signatures). Most distributed databases use crash fault tolerance, which is sufficient for data center environments where nodes are physically secure. Byzantine fault tolerance is necessary for open networks like blockchain, where any participant may be adversarial.

Key result: FLP impossibility Intermediate+

Theorem (Fischer, Lynch, Paterson, 1985). There is no deterministic algorithm that solves consensus in an asynchronous distributed system with even one crash failure.

Proof sketch. Assume a deterministic consensus algorithm exists. The proof uses a bivalent state: a configuration where both decision values (0 and 1) are still possible. The algorithm must reach a univalent state (where only one decision is possible) before deciding.

Step 1: Show that there exists an initial bivalent configuration. If all initial configurations were univalent, there must be two adjacent configurations (differing in one input) with different decision values. An adversary can delay the process whose input differs, making the algorithm unable to distinguish the two configurations. Contradiction.

Step 2: Show that from any bivalent configuration, there exists an admissible execution that remains bivalent forever. Consider any event $e$ applicable to bivalent configuration $C$ . Let the adversary schedule $e$ and then delay the process executing it. The resulting configuration $C^{'}$ can be shown to remain bivalent by careful construction: if $C^{'}$ were univalent, the adversary could have achieved the same result by delaying $e$ , contradicting the determinism of the algorithm.

Since the algorithm can be kept in a bivalent state indefinitely, it cannot guarantee termination. $□$

The FLP result is one of the most influential impossibility results in computer science. It does not mean consensus is impossible in practice. It means deterministic algorithms cannot guarantee liveness in a purely asynchronous model. Real systems circumvent FLP by using partially synchronous models (Dwork, Lynch, Stockmeyer, 1988) where messages are eventually delivered within bounded but unknown time, enabling practical algorithms like Paxos and Raft. The key insight is that FLP requires an adversarial scheduler that can delay messages indefinitely. In real networks, messages eventually arrive, and timeouts provide a practical way to detect and handle failures, restoring liveness at the cost of a theoretical guarantee.

Exercises Intermediate+

Exercise 3 (hard, short answer).

Explain how vector clocks detect concurrent modifications and why Lamport timestamps alone cannot.

Hint

Consider what additional information vector clocks capture compared to scalar Lamport timestamps.

Answer

A vector clock assigns each process its own counter. Event $a$ happened-before $b$ iff every component of $a$ 's vector is $\leq$ the corresponding component of $b$ 's vector and at least one is strictly less. Events are concurrent if neither vector is $\leq$ the other. Lamport timestamps are scalar and lose this information: two events with the same timestamp could be concurrent or ordered. Vector clocks can detect concurrent writes (indicating a conflict that must be resolved), while Lamport timestamps can only establish a partial order through the happened-before relation.

Exercise 5 (easy, multiple choice).

What is the key advantage of CRDTs over traditional conflict resolution in distributed databases?

A. CRDTs provide stronger consistency guarantees than serializable isolation

B. CRDTs guarantee convergence without coordination between replicas

C. CRDTs eliminate the need for any network communication

D. CRDTs reduce storage requirements by compressing data

Hint

CRDTs are designed for eventual consistency. What property ensures all replicas eventually agree?

Answer

Option B. CRDTs mathematically guarantee that all replicas converge to the same state without requiring coordination (synchronization) between replicas. Each replica can update independently, and the merge operation is designed to be commutative, associative, and idempotent, ensuring convergence regardless of message ordering.

Exercise 6 (medium, short answer).

A distributed key-value store uses consistent hashing with 8 nodes on a hash ring. Node 5 fails. Describe which data is affected and how the system recovers.

Hint

In consistent hashing, each node owns the keys between its position and the previous node's position on the ring.

Answer

Node 5 is responsible for keys between its position and node 4's position on the ring. When node 5 fails, these keys become unreachable. The system recovers by having node 6 (the next node clockwise on the ring) take over responsibility for node 5's key range. If data is replicated (typically 3 replicas), the system can serve reads from the remaining replicas while re-replicating node 5's data to a new node or to node 6 to maintain the desired replication factor. No other key ranges are affected, which is the key advantage of consistent hashing: only the data owned by the failed node needs to be redistributed.

Advanced results Master

Byzantine fault tolerance

Byzantine faults are the most general failure model: faulty nodes may behave arbitrarily, including sending conflicting messages to different nodes. The Byzantine Generals Problem (Lamport, Shostak, Pease, 1982) showed that consensus with $f$ Byzantine faults requires at least $3 f + 1$ nodes.

PBFT (Practical Byzantine Fault Tolerance, Castro and Liskov, 1999) was the first practical BFT protocol, achieving consensus with $O (n^{2})$ message complexity. HotStuff (used by Meta's Diem/Libra) reduced this to $O (n)$ using threshold signatures and pipelined consensus.

The Byzantine Generals analogy illustrates the problem precisely. A group of generals surround a city and must agree on whether to attack or retreat. They can only communicate via messengers, and some generals may be traitors who send conflicting messages. The loyal generals must all agree on the same plan. Lamport, Shostak, and Pease proved that consensus is possible if and only if fewer than one-third of the generals are traitors: $n \geq 3 f + 1$ . The proof shows that with $3 f$ or fewer total nodes, the traitors can prevent the loyal nodes from distinguishing between two different commands, violating agreement.

PBFT operates in three phases. In the pre-prepare phase, the primary (leader) sends the proposed request to all replicas. In the prepare phase, replicas exchange prepare messages and wait for $2 f$ matching prepares from different replicas. In the commit phase, replicas exchange commit messages and execute the request after receiving $2 f + 1$ matching commits. The protocol guarantees safety (no two honest nodes commit different values) under any number of faulty nodes and liveness (eventual commitment) when fewer than $n /3$ nodes are faulty. The message complexity of $O (n^{2})$ per consensus round limits PBFT to relatively small groups of nodes (typically fewer than 100).

HotStuff, introduced in 2018, improved BFT performance by reducing communication complexity to $O (n)$ per consensus round. It achieves this through threshold signatures, which allow $n$ partial signatures to be combined into a single compact signature, and through a pipelined design where multiple consensus instances run concurrently. HotStuff's three-chain commitment rule provides safety even under leader rotation, making it suitable for blockchain applications where the leader is selected through a cryptographic lottery. The Diem (formerly Libra) project adopted HotStuff for its consensus protocol.

CRDTs (Conflict-free Replicated Data Types)

CRDTs are data structures that can be replicated across multiple nodes and updated independently, with a mathematical guarantee that all replicas will converge. There are two types: state-based (CvRDTs, where replicas exchange and merge full states) and operation-based (CmRDTs, where replicas exchange operations).

Examples include G-counters (grow-only counters), PN-counters (increment and decrement), G-sets (grow-only sets), and OR-sets (observed-remove sets that support adding and removing). CRDTs power collaborative editing (Google Docs uses a variant), distributed caching, and offline-first mobile applications.

The mathematical foundation of CRDTs rests on semilattices. A join semilattice is a partially ordered set where any two elements have a least upper bound (join). For state-based CRDTs, the merge operation must be a semilattice join: commutative ( $a \lor b = b \lor a$ ), associative ( $(a \lor b) \lor c = a \lor (b \lor c)$ ), and idempotent ( $a \lor a = a$ ). These properties guarantee that replicas converge regardless of the order in which they receive updates, because the final state depends only on the set of updates applied, not their order. The OR-set (observed-remove set) is one of the most useful CRDTs. It supports both add and remove operations by tagging each element with a unique identifier. When an element is removed, the remove operation records which adds it has observed, and only those observed adds are canceled. Subsequent adds of the same element (with new unique tags) succeed. This resolves the concurrent add-remove conflict in favor of the add, a policy that works well for collaborative applications where users may add back an item that was removed by another user.

Distributed transactions

Two-phase commit (2PC) provides atomic transactions across distributed databases. A coordinator sends a prepare message to all participants. If all vote yes, the coordinator sends commit. If any vote no, the coordinator sends abort. 2PC is blocking: if the coordinator fails after participants vote yes but before sending commit, participants are blocked indefinitely.

Three-phase commit (3PC) adds a pre-commit phase to avoid blocking, but it assumes a synchronous network and fail-stop model. In practice, most distributed databases use Raft or Paxos for replication and avoid cross-shard distributed transactions.

Google Spanner pioneered the use of TrueTime for distributed transactions. TrueTime provides a globally synchronized clock with bounded uncertainty, enabling Spanner to assign globally meaningful timestamps to transactions. External consistency, Spanner's strongest isolation level, guarantees that if transaction T1 commits before transaction T2 starts, then T1's timestamp is less than T2's. This is equivalent to linearizability across the entire database, a property that no other distributed database provides without specialized hardware. Spanner achieves this using GPS receivers and atomic clocks at each data center, maintaining clock synchronization within approximately 10 milliseconds.

Consistent hashing

Consistent hashing distributes data across nodes in a way that minimizes redistribution when nodes join or leave. Each data item and each node is mapped to a point on a hash ring. Each node is responsible for the data items between its position and the previous node's position. When a node joins, it takes responsibility for only the data in its segment, leaving all other data in place.

This technique, originally developed for Akamai's CDN by David Karger and colleagues at MIT in 1997, is used by DynamoDB, Cassandra, and many other distributed databases. The number of virtual nodes per physical node controls the balance of data distribution.

Amazon's Dynamo paper (DeCandia et al., 2007) described how consistent hashing, combined with vector clocks for conflict detection and read repair for conflict resolution, enabled a highly available key-value store serving millions of requests per second. Dynamo chose eventual consistency with always-writable semantics: writes never fail due to network partitions, but reads may return stale or conflicting data that must be reconciled by the application. This design prioritized availability over consistency, a trade-off appropriate for Amazon's shopping cart (where a temporarily stale cart is preferable to a failed checkout) but not for financial transactions.

Sharding and partitioning

Sharding distributes data across multiple database instances (shards) based on a partition key. Range-based sharding assigns key ranges to shards. Hash-based sharding uses a hash function to distribute keys uniformly. Directory-based sharding maintains a lookup table mapping keys to shards.

Each approach has trade-offs. Range-based sharding supports efficient range queries but can create hotspots if keys are not uniformly distributed. Hash-based sharding distributes load evenly but does not support range queries. Directory-based sharding is flexible but adds a lookup overhead.

MongoDB uses range-based sharding by default, with support for hash-based sharding as an option. The balancer process monitors chunk distribution and migrates chunks between shards to maintain even data distribution. CockroachDB uses range-based sharding with automatic splitting: when a range exceeds a size threshold, it is split into two ranges, and the new ranges are distributed to different nodes. This automatic range splitting and rebalancing eliminates the need for manual shard management.

The fallacies of distributed computing

Peter Deutsch and James Gosling at Sun Microsystems identified eight fallacies that programmers new to distributed systems often assume: the network is reliable, latency is zero, bandwidth is infinite, the network is secure, topology does not change, there is one administrator, transport cost is zero, and the network is homogeneous. Every one of these assumptions is false in practice, and building systems that are robust to these failures requires explicit engineering for failure modes. Netflix's Chaos Engineering practice, which deliberately injects failures into production systems to verify resilience, is a direct response to the fallacy that the network is reliable.

Clock synchronization and TrueTime

Distributed systems rely on timestamps for ordering events, but physical clocks are unreliable. Clock skew between machines can be milliseconds or even seconds. Network Time Protocol (NTP) synchronizes clocks to within a few milliseconds over the public internet, but this is insufficient for applications that require precise ordering.

Google's TrueTime service, introduced with Spanner, provides a globally synchronized clock with bounded uncertainty. TrueTime returns an interval $[e a r l i es t, l a t es t]$ representing the range within which the current time is guaranteed to fall. The uncertainty is typically less than 10 milliseconds, maintained by GPS receivers and atomic clocks at each data center. This bounded uncertainty enables Spanner to assign globally meaningful timestamps to transactions, providing external consistency without the overhead of distributed consensus on every timestamp.

For systems without TrueTime, hybrid logical clocks (HLCs) combine physical timestamps with logical counters to provide causal ordering with physical timestamp semantics. An HLC timestamp consists of a physical component (close to the actual time) and a logical component (a counter that increments when events happen at the same physical time). HLCs are used by CockroachDB and MongoDB to order events across nodes without requiring specialized clock hardware.

Distributed tracing and observability

Understanding the behavior of distributed systems requires observing requests as they traverse multiple services. Distributed tracing, pioneered by Google's Dapper system (2010), assigns each request a unique trace ID that propagates through all services involved in handling it. Each service records a span representing its contribution to the trace, including timing information and metadata. The collected spans are assembled into a trace that shows the complete lifecycle of the request.

OpenTelemetry, a CNCF project, provides a standardized API for generating and collecting traces, metrics, and logs. Jaeger and Zipkin are popular distributed tracing backends that store and visualize traces. The ability to trace a request across service boundaries is essential for debugging performance problems in microservices architectures, where a single user request may involve dozens of service calls, any of which could be the source of latency.

The Raft protocol in production

Raft's understandability advantage over Paxos has led to widespread adoption. etcd (used by Kubernetes for storing cluster state), Consul (HashiCorp's service discovery tool), TiKV (the storage engine for TiDB), and CockroachDB all use Raft for consensus. The protocol's separation into subproblems (leader election, log replication, safety) makes it easier to implement correctly, which is critical given that consensus bugs are notoriously difficult to detect through testing.

Production deployments of Raft face several practical challenges. Log compaction is necessary because the log grows indefinitely. Raft supports snapshotting, where the current state machine state is saved and all preceding log entries are discarded. Some implementations support log truncation, where only entries up to a certain index are retained. The leader transfer mechanism allows controlled leadership changes for maintenance or load balancing, avoiding the random timeouts of election-based transfers.

Read-only requests present a subtlety in Raft. Because the leader might be partitioned from the majority and not know it, a naive read from the leader could return stale data. Raft addresses this by requiring the leader to verify its leadership before serving a read, either by communicating with a majority (committing a no-op entry) or by using lease-based leadership with bounded clock drift. etcd uses a linearizable read implementation that verifies leadership through the quorum, adding one round-trip to read latency but guaranteeing freshness.

Connections Master

Connections to operating systems

Distributed systems extend many OS concepts across machine boundaries. Process scheduling becomes distributed task placement. File systems become distributed file systems (GFS, HDFS). Memory management becomes distributed shared memory. The operating system's abstractions (files, processes, sockets) are the building blocks of distributed systems.

The Google File System (GFS), described in a 2003 paper by Ghemawat, Gobioff, and Leung, illustrates how distributed systems extend OS abstractions. GFS stores files as chunks (64 MB each), replicated across three chunkservers. A single master manages metadata (file-to-chunk mappings) while clients read and write chunks directly. This separation of metadata from data allows the master to serve metadata requests from memory, handling thousands of client requests per second while keeping the data path short. The Hadoop Distributed File System (HDFS), an open-source implementation of GFS, stores files as blocks (128 MB by default) and is the storage foundation for the Hadoop ecosystem.

Connections to cryptography

Byzantine fault tolerance relies on cryptographic primitives for authentication and integrity. Digital signatures prevent impersonation. Message authentication codes detect tampering. Threshold signatures enable efficient BFT consensus. The intersection of distributed systems and cryptography is essential for building trustworthy distributed systems.

Blockchain systems combine distributed consensus with cryptographic hashing and Merkle trees to create append-only, tamper-evident ledgers. Bitcoin's Nakamoto consensus uses proof-of-work to make block creation expensive, preventing Sybil attacks where an adversary creates many virtual nodes to subvert consensus. Ethereum's transition to proof-of-stake (2022) replaced computational puzzles with economic stakes, where validators lock up cryptocurrency as collateral that is slashed (destroyed) if they behave dishonestly. The Casper FFG (Friendly Finality Gadget) provides finality guarantees: once a block is finalized, it cannot be reverted without destroying at least one-third of the total staked ether.

Connections to game theory

Distributed systems with rational (self-interested) participants require mechanism design. Blockchain consensus (proof of work, proof of stake) uses economic incentives to ensure honest behavior. Game-theoretic analysis determines whether the Nash equilibrium of a protocol corresponds to honest behavior.

TheBAR model (Byzantine, Altruistic, Rational) extends traditional Byzantine fault tolerance by modeling three types of participants: Byzantine nodes that behave arbitrarily maliciously, altruistic nodes that follow the protocol exactly, and rational nodes that deviate from the protocol only if deviation increases their utility. Designing protocols that are incentive-compatible, where rational nodes have no incentive to deviate, is crucial for open distributed systems like blockchain networks where participants are unknown and potentially adversarial.

Connections to databases

Distributed databases face all the challenges of distributed systems and must additionally provide database guarantees like ACID transactions. The CAP theorem directly affects database design: PostgreSQL provides strong consistency on a single node. Cassandra partitions data across nodes and provides high availability but only eventual consistency by default. CockroachDB and Google Spanner attempt to provide both strong consistency and partition tolerance through sophisticated consensus protocols.

The interaction between consensus and database transactions creates interesting design choices. In CockroachDB, each piece of data (range) is replicated using Raft. A transaction that spans multiple ranges requires a two-phase commit coordinated across the Raft groups, similar to how 2PC coordinates across database shards. Spanner avoids cross-shard 2PC by using TrueTime to assign timestamps that order all transactions globally, even those spanning multiple Paxos groups. This eliminates the coordination overhead of distributed transactions at the cost of requiring specialized clock hardware.

Connections to networks

The properties of the underlying network fundamentally constrain distributed system design. Network latency determines the minimum time for a single communication round, which in turn determines the latency of consensus protocols. Raft requires at least one round-trip for each write operation, so the write latency is approximately twice the network latency. In wide-area deployments spanning continents, this can add hundreds of milliseconds to every write.

Network bandwidth limits the throughput of consensus protocols. The leader must replicate log entries to all followers, consuming bandwidth proportional to the write rate multiplied by the entry size. Batching multiple entries into a single AppendEntries RPC amortizes the per-request overhead and increases throughput at the cost of increased latency for individual entries. Modern systems like etcd batch up to hundreds of entries per RPC to achieve throughputs exceeding 100,000 writes per second on a single Raft cluster.

Historical and philosophical context Master

From mainframes to microservices

The history of distributed systems follows the evolution of computing hardware. Mainframes were centralized. Minicomputers enabled departmental computing. Workstations and local area networks created the first distributed systems. The internet connected everything. Cloud computing provided elastic distributed infrastructure. Each generation made distribution more pervasive and more complex.

The transition from monolithic applications to microservices accelerated in the 2010s, driven by the need for independent deployment, technology diversity, and organizational scalability. Netflix, Amazon, and Google pioneered patterns for building reliable distributed systems at scale, including circuit breakers (preventing cascading failures by failing fast when a downstream service is unhealthy), bulkheads (isolating failures to prevent them from spreading), and service discovery (enabling services to find each other dynamically without hardcoded addresses). The service mesh pattern, implemented by Istio and Linkerd, provides these capabilities as infrastructure rather than application code, allowing developers to focus on business logic while the mesh handles reliability, observability, and security concerns.

Lamport and the formalization of distributed computing

Leslie Lamport's contributions to distributed computing are foundational. His 1978 paper on logical clocks defined the happened-before relation and Lamport timestamps. His Paxos algorithm (submitted 1990, published 1998) solved distributed consensus. His LaTeX typesetting system (1984) became the standard for scientific documents. Lamport received the Turing Award in 2013 for these contributions.

Lamport's approach emphasizes the use of mathematics to specify and reason about distributed algorithms. He advocates writing specifications before implementations, arguing that many distributed system bugs arise from incomplete or ambiguous specifications rather than implementation errors. His TLA+ specification language has been used to find bugs in distributed systems at Amazon (Zetalambda paper, 2014), Microsoft (Azure storage), and Intel (hardware protocols). The Zetalambda paper reported that TLA+ found bugs in DynamoDB that would have been extremely difficult to detect through testing, including subtle race conditions that only manifested under specific interleavings of concurrent operations.

The philosophical significance of consensus

Consensus is not merely a technical problem. It raises questions about the nature of agreement and truth in systems where no single node has complete information. In a distributed system, there is no objective "now" and no single source of truth. The system must construct a shared reality through communication, a process that has parallels in social epistemology and the philosophy of science.

The FLP result has philosophical implications: it shows that perfect consensus is impossible in an asynchronous world. This is a computational analogue of philosophical arguments about the impossibility of certain knowledge in a world with imperfect communication.

The concept of eventual consistency raises its own philosophical questions. If two replicas temporarily disagree, which one is "true"? In an eventually consistent system, truth is not absolute but provisional: the current state is the best available approximation of the true state, subject to revision as new information arrives. This pragmatist view of truth, where correctness is determined by convergence rather than by correspondence to a fixed reality, resonates with philosophical pragmatism as articulated by Charles Sanders Peirce and William James.

The CAP theorem in practice

The CAP theorem has been both influential and controversial. Martin Kleppmann argued in a 2015 blog post and subsequent paper that the CAP theorem's definition of "availability" is narrower than what practitioners mean by availability, and that the real trade-off in distributed systems is not between consistency and availability but between consistency and latency. During a network partition, a system can choose to serve stale data with low latency (favoring latency over consistency) or to wait for consistency at the cost of increased latency or unavailability (favoring consistency over latency).

Kleppmann's critique led to the PACELC theorem (Abadi, 2012), which extends CAP to consider the normal-case trade-off as well. PACELC states: if there is a Partition (P), the system trades Availability (A) against Consistency (C); else (E), when running normally, it trades Latency (L) against Consistency (C). This framework better captures the design decisions faced by distributed database engineers, who must optimize for both partition and non-partition scenarios.

Bibliography Master

Primary sources

Lamport, L. (1978). "Time, clocks, and the ordering of events in a distributed system." Communications of the ACM, 21(7), 558-565.
Lamport, L. (1998). "The part-time parliament." ACM Transactions on Computer Systems, 16(2), 133-169.
Fischer, M.J., Lynch, N.A., and Paterson, M.S. (1985). "Impossibility of distributed consensus with one faulty process." Journal of the ACM, 32(2), 374-382.
Lamport, L., Shostak, R., and Pease, M. (1982). "The Byzantine generals problem." ACM Transactions on Programming Languages and Systems, 4(3), 382-401.
Ongaro, D. and Ousterhout, J. (2014). "In search of an understandable consensus algorithm." USENIX ATC, 305-319.
Castro, M. and Liskov, B. (1999). "Practical Byzantine fault tolerance." OSDI, 173-186.
DeCandia, G. et al. (2007). "Dynamo: Amazon's highly available key-value store." SOSP, 205-220.

Secondary sources

Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly.
Tanenbaum, A.S. and Van Steen, M. (2017). Distributed Systems (3rd ed.). Pearson.
Cachin, C., Guerraoui, R., and Rodrigues, L. (2011). Reliable Distributed Systems. Springer.
Shapiro, M., Preguica, N., Baquero, C., and Zawirski, M. (2011). "Conflict-free replicated data types." SSS, 386-400.
Karger, D. et al. (1997). "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." STOC, 654-663.
Dwork, C., Lynch, N., and Stockmeyer, L. (1988). "Consensus in the presence of partial synchrony." Journal of the ACM, 35(2), 288-323.

Prerequisites

25.06.01
25.07.01
25.08.01

Tier anchors

beginner: Tanenbaum and Van Steen, Distributed Systems (3e), Ch. 1-3; Kleppmann, Designing Data-Intensive Applications, Ch. 5-7
intermediate: Kleppmann, Designing Data-Intensive Applications, Ch. 8-12; Cachin, Guerraoui, and Rodrigues, Reliable Distributed Systems, Ch. 4-6
master: Lamport, The Part-Time Parliament (1998); Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (2014); Gray and Lorie, RAID

References

computer-science · Ch. 0
Kleppmann, M., Designing Data-Intensive Applications (O'Reilly, 2017) · Ch. 5-12 · source being verified
Lamport, L., The Part-Time Parliament, ACM Transactions on Computer Systems 16(2), 1998 · pp. 133-169 · source being verified
Ongaro, D. and Ousterhout, J., In Search of an Understandable Consensus Algorithm, USENIX ATC 2014 · pp. 305-319 · source being verified

Estimated time

beginner: 30m
intermediate: 55m
master: 80m