System Design Interview — Study Notes XI — Designing Data Intensive Applications Book Notes

20 min readNov 15, 2023

Notes on Concepts & Components to be used in a System Design Interview

https://www.kobo.com/en/ebook/designing-data-intensive-applications

Chapter 9 — Consistency and Consensus

Consistency Guarantees

Inconsistencies occur no matter what replication method the database uses (single-leader, multi-leader, or leaderless replication).
Most replicated databases provide at least eventual consistency. A better name for eventual consistency may be convergence — as we expect all replicas to eventually converge to the same value.
transaction isolation is about avoiding race conditions due to concurrently executing transactions, where as distributed consistency is about coordinating the state of replicas in the face of delays and faults.
- linearizability — one of the strongest consistency models.
- the issue of ordering events in a distributed system — particularly around causality and total ordering.
- how to atomically commit a distributed transaction — the consensus problem

Linearizability

linearizability (atomic consistency, strong consistency, immediate consistency, external consistency) — the basic idea is to make a system appear as if they were only one copy of the data, and all operations on it are atomic. It is a recency guarantee.

What Makes a System Linearizable

key x — x is called a register — one key in a key-value store, one row in a relational database, or one document in a document database.
regular register — a register in which reads may return either the old or the new value if they are concurrent with a write is known.
In the absence of other requests, it would be okay for B’s read to return 2. However, client A has already read the new value 4 before B’s read started, so B is not allowed to read an older value than A. That is the intuition behind linearizability.
Test whether a system’s behaviour is linearizable by recording the timings of all requests and responses, and checking whether they can be arranged into a valid sequential order.

Linearizability Versus Serializability

linearizability — often confused with serializability.
serializability — an isolation property of transactions. It is okay for that serial order to be different from the order in which transactions were actually run.
linearizability — a recency guarantee on reads and writes of a register (an individual object). It doesn’t group operations together into transactions.
A database may provide both serializability and linearizability — strict serializability or strong one-copy serializability (strong-ISR). Implementations of serializability based on two-phase locking or actual serial execution are typically linearizable.
Serializable snapshot isolation is not linearizable — it does not include writes that are more recent than the snapshot, so reads from the snapshot are not linearizable.

Relying on Linearizability

Locking and leader election
Single leader replication — ensure there is indeed only one leader, not several (split brain).
Coordination services like Apache ZooKeeper and etcd are often used to implement distributed locks and leader election. ZooKeeper and etcd provide linearizable writes, but reads may be stale. You can optionally request a linearizable read: etcd calls this a quorum read, and in ZooKeeper you need to call sync() before the read.
Constraints and uniqueness guarantees
Uniqueness constraints are common in databases: for example, a username or email address must uniquely identify one user, no two files with the same path and filename. You need linearizability.
If a flight is overbooked — you can move customers to a different flight and offer them compensation for the inconvenience — loosely interpreted constraints.
hard uniqueness constraint — in relational databases, requires linearizability; but foreign key or attribute constraints can be implemented without requiring linearizability.

Implementing Linearizable Systems

Single-leader replication (potentially linearizable)
If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable. Not every single-leader database is actually linearizable, either by design (e.g. because it uses snapshot isolation) or due to concurrency bugs.
Consensus algorithms (linearizable)
Consensus protocols contain measures to prevent split brain and stale replicas. This is how ZooKeeper and etcd work.
Multi-leader replication (not linearizable)
Systems with multi-leader replication can produce conflicting writes that require resolution.
Leaderless replication (probably not linearizable)
For systems with leaderless replication, “strong consistency” can be obtained by requiring quorum reads and writes (w + r > n).
“Last write wins” conflict resolution methods based on time-of-day clocks (e.g. in Cassandra) are almost certainly nonlinearizable, because clock timestamps cannot be guaranteed to be consistent with actual event ordering due to clock skew.
Linearizability and quorums
When we have variable network delays, it is possible to have race conditions.
At the cost of reduced performance, a reader must perform read repair synchronously, before returning results to the application, and a writer must read the latest state of a quorum of nodes before sending its writes. Cassandra does wait for read repair to complete on quorum reads, but it loses linearizability if there are multiple concurrent writes to the same key, due to its use of last-write-wins conflict resolution.
Only linearizable read and write operations can be implemented in this way but a linearizable compare-and-set operation cannot, because it requires a consensus algorithm.

The Cost of Linearizability

Multi-leader replication is often a good choice for multi-datacenter replication.
The CAP theorem
This issue is not just a consequence of single-leader and multi-leader replication. It can occur on any unreliable network, even with one datacenter.
Two choices — CP (consistent but not available under network partitions) and AP (available but not consistent under network partitions).
Applications that don’t require linearizability can be more tolerant of network problems. CAP deserves credit for this culture swift — witness the explosion of the new database technologies since the mid-2000s (known as NoSQL).
The Unhelpful CAP Theorem
CAP is sometimes presented as Consistency, Availability, Partition Tolerance: pick 2 out of 3. This is misleading — network partitions are kind of fault, so they aren’t something about which you have a choice.
At times when the network is working correctly, a system can provide both consistency (linearizability) and total availability. When a network fault occurs, you have to choose between either Consistent or Available when Partitioned.
It doesn’t say anything about network delays, dead nodes, or other trade-offs.
Linearizability and network delays
Even RAM on a modern multi-core CPU is not linearizable — unless a memory barrier or fence is used.
Every CPU core has its own memory cache and store buffer. Memory access first goes to the cache by default, and any changes are asynchronously written out to main memory.

Ordering Guarantees

- The main purpose of the leader in single-leader replication is to determine the order of writes in the replication log.
- Serializability is about ensuring the transactions behave as if they were executed in some sequential order — achieved by literally executing transactions in that serial order, or by allowing concurrent execution while preventing serialization conflicts — by locking or aborting.
- The use of timestamps and clocks
Deep connections between ordering, linearizability, and consensus.

Ordering and Causality

- There is a causal dependency between a question and its answer.
- Some writes could “overtake” others due to network delays. Causality here means that a row must first be created before it can be applied.
- happened-before relationship — if A happened before B, that means B might have known about A, or built upon A, or depended on A. If A and B are concurrent, there is no causal link between them; in other words, we are sure that neither knew about the other
- A transaction reads from a consistent snapshot. consistent with causality — if the snapshot contains an answer, it must also contain the question being answered. The effects of all operations that happened causally before that point in time are visible, but no operations that happened causally afterward can be seen. Read skew (non-repeatable reads) means reading data in a state that violates causality.
- Write skew between transactions also demonstrates causal dependencies. Serializable snapshot isolation detects write skew by tracking the causal dependencies between transactions.
Causality imposes an ordering on events: cause comes before effect — what happened before what.
causally consistent — if a system obeys the ordering imposed by causality. Snapshot isolation provides causal consistency. Causal consistency does not have the coordination overhead of linearizability and is much less sensitive to network problems.
The causal order is not a total order
total order — allows any two elements to be compared; which one is greater and which one is smaller.
linearizability — In a linearizable system, we have a total order of operations. Linearizability implies causality.
causality — two events are incomparable if they are concurrent. some operations are ordered with respect to each other, but some are incomparable.
In Git, version histories are very much like the graph of causal dependencies.
Capturing causal dependencies
Causality in a leaderless datastore, where we need to detect concurrent writes to the same key in order to prevent lost updates. Causal consistency goes further: it needs to track causal dependencies across the entire database, not just for a single key. Version vectors can be generalized to do this.
To determine the causal ordering, the database needs to know which version of the data was read by the application. The version number from the prior operation is passed back to the database on a write. A similar idea appears in the conflict detection of SSI (Serializable Snapshot Isolation). The database keeps track of which data has been read by which transaction.

Sequence Number Ordering

Client read lots of data; it is not clear whether the write is causally dependent on all or some of those prior reads. Tracking would be a large overhead.
Sequence numbers or timestamps can be used. A timestamp doesn’t need to come from a time-of-day clock; instead can come from a logical clock, which is an algorithm to generate a sequence of numbers to identify operations, typically using counters that are incremented for every operation.
In a database with single-leader replication, the replication log defines a total order of write operations that is consistent with causality.
Noncausal sequence number generators
- Each node can generate its own independent set of sequence numbers. You could reserve some bits to contain a unique node identifier.
- You can attach a timestamp from a time-of-day clock (physical clock) — having sufficiently high resolution.
- Preallocate blocks of sequence numbers.
- The sequence numbers they generate are not consistent with causality — the causality problems occur because these sequence number generators do not correctly capture the ordering of operations across different nodes. Each node may process a different number of operations per second.
- Timestamps from physical clocks are subject to clock skew. Though, it is possible to make physical clock timestamps consistent with causality — Google’s Spanner, which estimates the expected clock skew.
Lamport timestamps
Lamport timestamp — sequence numbers that is consistent with causality. Each node has a unique identifier, and each node keeps a counter of the number of operations it has processed. The Lamport timestamp is simply a pair of (counter, node ID) — every timestamp is unique.
If the counter values are the same, the one with the greater node ID is the greater timestamp.
When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.
As long as the maximum counter value is carried along with every operation, this schema ensures that the ordering from the Lambert timestamps is consistent with causality, because every causal dependency results in an increased timestamp.
Lamport timestamps are sometimes confused with version vectors. Version vectors can distinguish whether two operations are concurrent or whether one is causally dependent on the other, whereas Lamport timestamps always enforce a total ordering. From the total ordering of Lamport timestamps, you cannot tell whether two operations are concurrent or whether they are causally dependent. Lamport timestamps are more compact.
Timestamp ordering is not sufficient
If two accounts with the same username are created, pick the one with the lower timestamp as the winner. Since timestamps are totally ordered, this comparison is always valid.
It is not sufficient when a node needs to decide right now whether the request should succeed or fail. The node does not know whether another node is concurrently in the process of creating an account with the same username, and what timestamp that other node may assign to the operation.
You would have to check with every other node to see what it is doing.
The total ordering of operations only emerges after you have collected all of the operations. If another node has generated some operations, but you don’t yet know what they are, you cannot construct the final ordering of operations.
For uniqueness constraint, it’s not sufficient to have a total ordering of operations — you also need to know whether that order is finalized.

Total Order Broadcast (Atomic Broadcast)

Knowing when your total order is finalized.
Partitioned databases with a single leader per partition often maintain ordering only per partition, which means they cannot offer consistency guarantees (e.g. consistent snapshots, foreign key references) across partitions.
Total order broadcast — a protocol for exchanging messages between nodes. Two safety properties: reliable delivery, totally ordered delivery
Using total order broadcast
Consensus services such as ZooKeeper and etcd actually implement total order broadcast. A strong connection between total order broadcast and consensus.
Total order broadcast is exactly what you need for database replication — state machine replication.
It is a way of creating a log (as in a replication log, transaction log, write-ahead log): delivering messages is like appending to the log.
It is also useful for a lock service that provides fencing tokens. The sequence number can then serve as a fencing token, because it is monotonically increasing. In ZooKeeper, this sequence number is called zxid.
Implementing linearizable storage using total order broadcast
Total order broadcast is asynchronous: messages are guaranteed to be delivered reliably in a fixed order, but there is no guarantee about when a message will be delivered (so one recipient may lag behind the others). By contrast, linearizability is a recency guarantee: a read is guaranteed to see the latest value written.
For every possible username, you can have a linearizable register with an atomic compare-and-set operation. If the first message for your desired username is your own message, you can commit the username claim (perhaps by appending another message to the log) and acknowledge it to the client.
It doesn’t guarantee linearizable reads; this procedure provides sequential consistency (timeline consistency), a slightly weaker guarantee than linearizability.
You can sequence reads through the log and performing the actual read when the message is delivered back to you.
You can wait for all entries up to that position to be delivered to you, and then perform the read (ZooKeeper’s sync() operation).
You can read from a replica that is synchronously updated on writes (used in chain replication).
Implementing total order broadcast using linearizable storage
A linearizable compare-and-set operation from total order broadcast, assuming that we have linearizable storage. A linearizable register that stores an an integer and that has an atomic increment-and-get operation or alternatively an atomic compare-and-set operation.
The algorithm: for every message you want to send through total order value broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message. You can then send the message to all nodes (resending any lost messages), and the recipients will deliver the messages consecutively by sequence number.
Unlike Lamport timestamps, the numbers you get from incrementing the linearizable register form a sequence with no gaps. If a node has delivered message 4 and receives an incoming message with a sequence number of 6, it knows that it must wait for message 5 before it can deliver message 6.
A linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus.

Distributed Transactions and Consensus

Consensus — get several nodes to agree on something.
Situations for nodes to agree:
- Leader election — in a database with single-leader replication.
- Atomic commit — we have to get all nodes to agree on the outcome of the transaction: either they all abort/roll back (if anything goes wrong) or they all commit (if nothing goes wrong).
A wide range of problems are actually reducible to consensus:
- Linearizable compare-and-set operation — the register needs to atomically decide whether to set its value.
- Atomic transaction commit — a database must decide whether to commit or abort a distributed transaction.
- Total order broadcast — the messaging system must decide on the order in which to deliver messages.
- Locks and leases — the lock decides which client successfully acquired when several client are racing to grab the lock or the lease.
- Membership/coordination services — the system must decide which nodes are alive and which should be considered dead.
- Uniqueness constraint — the constraint must decide which transaction to allow and which should fail when several concurrently try to create conflicting records with the same key.
Leaderless and multi-leader replication systems typically do not use global consensus.

The Impossibility of Consensus

FLP (Fischer, Lynch, and Paterson) result — there is no algorithm that is always able to reach consensus if there is a risk that a node may crash. If the algorithm is allowed to use timeouts, or some other way of identifying suspected crashed nodes, then consensus becomes solvable.

Atomic Commit and Two-Phase Commit (2PC)

Atomicity is especially important for multi-object transactions and databases that maintain secondary indexes. Each secondary index is a separate data structure from the primary data. Atomicity ensures that the secondary index stays consistent with the primary data.
From single-node to distributed atomic commit
On a single node; transaction commitment crucially depends on the order: first the data, then the commit record. The key is deciding moment for whether the transaction commits or aborts is the moment at which the disk finished writing the commit record. After that moment, the transaction is committed.
And once the transaction is committed on one node, it cannot be retracted again if it later turns out that it was aborted on another node. A node must only commit once it is certain that all other nodes in the transaction are also going to commit. A transaction commit must be irrevocable — the basis of read committed isolation.
The effects of a committed transaction can later be undone by another, compensating transaction — this is the application’s problem.
Introduction to two-phase commit
Two-phase commit — an algorithm for achieving atomic transactions commit across multiple nodes; to ensure that either all nodes commit or all nodes abort. 2PC is also made available to applications in the form of XA transactions (also supported by Java Transaction API).
The commit/abort process in 2PC is split into two phases.
2PC uses a coordinator (known as transaction manager), and it is often implemented as a library within the same application process that is requesting the transaction (e.g. embedded in a Java EE container).
Database nodes are called participants in the transaction.
Then the application is ready to commit, the coordinator begins phase 1: it sends a prepare request to each of the nodes, asking them whether they are able to commit. The coordinator then tracks the responses from the participants:
- If all participants reply “yes”, the coordinator sends out a commit request.
- If any of the participants replies “no”, the coordinator sends an abort request to all nodes.
A system of promises
Surely the prepare and commit requests can just as easily be lost in the two-phase case. What makes 2PC different?
- When the application wants to begin a distributed transaction, it requests a transaction ID from the coordinator. This transaction ID is globally unique.
- The globally unique transaction ID is attached to the single-node transaction.
- When the application is ready to commit, the coordinator sends a prepare request to all participants, tagged with the global transaction ID. If any of these requests fails or times out, the coordinator sends an abort request for that transaction ID to all participants.
- By replying “yes” to the coordinator, the node promises to commit the transaction without error if requested.
- The coordinator must write that decision to its transaction log on disk — commit point.
- Once the coordinator’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds. No more going back.
Two crucial “points of no return” — when a participant votes “yes”; once the coordinator decides. Those promises ensure the atomicity of 2PC. Single-node atomic commit lumps these two events into one: writing the commit record to the transaction log.
Coordinator failure
-Once the participant voted “yes”, it must wait to hear back from the coordinator whether the transaction was committed or aborted. If the coordinator crashes of the network fails, the participant can do nothing but wait — in doubt or uncertain transaction.
Database2 received the commit request but the coordinator crashed before it could send the commit request to Database1. Database1 will end up inconsistent with Database2.
The only way 2PC can complete is by waiting for the coordinator to recover. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants. Thus, the commit point of 2PC comes down to a regular single-node atomic commit on the coordinator.
Three-phase commit
Two-phase commit is called a blocking atomic commit protocol.
Three-phase commit (3PC) assumes a network with bounded delay and nodes with bounded response times.
Nonblocking atomic commit requires a perfect failure detector. In a network with unbounded delay a timeout is not a reliable failure detector. Due to a network problem even if no node has crashed. For this reason, 2PC continues to be used.

Distributed Transactions in Practice

Many cloud services choose not to implement distributed transactions due to the operational problems they engender.
Distributed transactions in MySQL are reported to be over 10 times slower than single-node transactions. Much of the performance cost inherent in two-phase commit is due to the additional disk forcing (fsync) that is required for crash recovery, and the additional network round-trips.
Database-internal distributed transactions — all the nodes participating in the transaction are running the same database software.
Heterogenous distributed transactions — the participants are two or more different technologies (two databases from different vendors, message brokers).
Exactly-once message processing
We can ensure that the message is effectively processed exactly once, even if it required a few retries before it succeeded.
XA Transactions
X/Open XA (eXtended Architecture) — a standard for implementing two-phase commit across heterogeneous technologies. XA is supported by many traditional databases (PostgreSQL, MySQL, DB2, SQL Server, Oracle) and messaging brokers (ActiveMQ).
XA is not a network protocol — it is merely a C API for interfacing with a transaction coordinator.
XA assumes that your application uses a network driver or client library to communicate with the participant databases or messaging services. If the driver supports XA, that means it calls the XA API to find out whether an operation should be part of a distributed transaction. The driver also exposes callbacks through which the coordinator can ask the participant to prepare, commit, or abort.
The transaction coordinator implements the XA API. The coordinator is often simply a library that is loaded into the same process as the application issuing the transaction (not a separate service). It keeps track of the commit/abort decision for each transaction.
The coordinator can use the database driver’s XA callbacks to ask participants to commit or abort, as appropriate.
Holding locks while in doubt
row-level exclusive lock on any rows modified, to prevent dirty writes.
serializable isolation, two-phase locking also have to take a shared lock on any rows read by the transaction.
The database cannot release the locks until the transaction commits or aborts. two-phase commit — a transaction must hold onto the locks throughout the time it is in doubt.
Recovering from coordinator failure
In practice, orphaned in-doubt transactions do occur. The coordinator cannot decide the outcome.
Many XA implementations have an emergency escape hatch called heuristic decisions, probably breaking atomicity, allowing a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive decision from the coordinator, violating the promises in two-phase commit.
Limitations of distributed transactions
When the coordinator is part of the application server, such application servers are no longer stateless.
XA cannot detect deadlocks across different systems and it does not work with SSI.

Fault-Tolerant Consensus

A (uniform) consensus algorithm must satisfy:
- Uniform agreement — no two nodes decide differently. It is a safety property.
- Integrity — no node decides twice. It is a safety property.
- Validity — if a node decides value v, then v was proposed by some node. It is a safety property.
- Termination — every node that does not crash eventually decides some value. This property formalizes the idea of fault tolerance. It is a liveliness property.
If all nodes crash, it is not possible for any algorithm to decide anything. There is a limit to the number of failures that an algorithm can tolerate. Consensus algorithms require at least a majority of nodes to be functioning correctly — the majority forms a quorum.
Most consensus algorithms assume that there are no Byzantine faults.
Consensus algorithms and total order broadcast
Best-known fault-tolerant consensus algorithms — Viewstamped Replication (VSR), Paxos, Raft (used in etcd), Zab (used in ZooKeeper).
They decide on a sequence of values, which makes them total order broadcast algorithms.
Total order broadcast requires messages to be delivered exactly once, in the same order, to all nodes. Equivalent to repeated rounds of consensus, each consensus decision corresponding to one message delivery — in each round, nodes propose the message that they want to send next, and then decide on the next message to be delivered in the total order.
Viewstamped Replication, Raft, Zab implement total order broadcast directly which is more efficient than doing repeated rounds of one-value-at-a-time consensus.
Single-leader replication and consensus
Single leader replication — the leader — total order broadcast.
Automatic leader election in a failover — consensus
Problem of split brain
Epoch numbering and quorums
epoch number (ballot number, view number, term number) — guarantee that within each epoch, the leader is unique.
If there is a conflict between two different leaders in two different epochs (perhaps because the previous leader actually wasn’t dead after all), then the leader with the higher epoch number prevails.
It must collect votes from a quorum of nodes. For every decision that a leader wants to make, it must send the proposed value to the other nodes and wait for quorum of nodes to respond in favor of the proposal.
We have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal. If the vote on a proposal does not reveal any higher-numbered epoch, the current leader can conclude that no leader election with a higher epoch number that has happened.
In 2PC, the coordinator is not elected. Fault-tolerant consensus algorithm only require votes from a majority of nodes, whereas 2PC requires a “yes” vote from every participant. Consensus algorithms also define a recovery process.
Limitations of consensus
They provide total order broadcast, and therefore they can also implement linearizable atomic operations in a fault-tolerant way.
The process by which nodes vote on proposal before they are decided is a kind of synchronous replication.
Sometimes, consensus algorithms are particularly sensitive to network problems. Raft has been shown to have unpleasant edge cases in one of which leadership continually bounces between two nodes.

Membership and Coordination Services

ZooKeeper, etcd — distributed key-value store, coordination and configuration services.
HBase, Hadoop YARN, OpenStack Nova, Kafka all rely on ZooKeeper.
ZooKeeper and etcd are designed to hold small amounts of data which is replicated across all the nodes using a fault-tolerant total order broadcast algorithm — it is just what you need for database replication: if each message represents a write to the database, applying the same writes in the same order keeps replicas consistent with each other.
ZooKeeper is modeled after Google’s Chubby lock service, having not only total order broadcast (hence consensus) but also other features that are listed below. Only the linearizable atomic operations really require consensus.
- Linearizable atomic operations
A distributed lock is usually implemented as a lease, which has an expiry time.
- Total ordering of operations
a fencing token to prevent clients from conflicting with each other, it is a number that monotonically increases every time the lock is acquired. ZooKeeper provides this by totally ordering all operations and giving each operation a monotonically increasing transaction ID (zxid) and version number (cversion).
- Failure detection
Clients maintain a long-lived session on ZooKeeper servers, and the client and server periodically exchange heartbeats to check that the other node is still alive. The session remains active even if the connection is temporarily interrupted or a ZooKeeper node fails. The session is declared dead by ZooKeeper if the heartbeat cease for a duration longer than the session timeout. Any locks held by a session can be configured to be automatically released when the session times out — ephemeral nodes.
- Change notifications
Not only can one client read locks and values that were created by another clients, but it can also watch them for changes. A client can find out when another client joins the cluster or if another client fails.
Allocating work to nodes
Apache Curator has sprung up to provide higher-level tools on top of the ZooKeeper client API.
ZooKeeper runs on a fixed number of nodes (usually 3 or 5) and performs its majority votes among those nodes while supporting a potentially large number of clients.
An example for the data managed by ZooKeeper (quite slow-changing) — “the node running on 10.1.1.23 is the leader for partition 7); it is not intended for storing the runtime state of the application.
Service discovery
ZooKeeper, etcd, and Consul are also often used for service discovery.
DNS is the traditional way of looking up the IP address for a service name, and it uses multiple layers of caching to achieve good performance and availability.
Service discovery does not require consensus, leader election does. Some consensus systems support read-only caching replicas. These replicas asynchronously receive the log of all decisions of the consensus algorithm, but do not actively participate in voting. They are therefore able to serve read requests that do not need to be linearizable.
Membership services
A membership service determines which nodes are currently active and live members of a cluster.

Happy Coding!

References:
- Designing Data-Intensive Applications
The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann