System Design Interview — Study Notes VII — Designing Data Intensive Applications Book Notes
Notes on Concepts & Components to be used in a System Design Interview
Chapter 5 — Replication
Replication: keeping a copy of the same data on multiple machines that are connected via a network.
- Data geographically close to users; to reduce latency
- In case of failure; to increase availability
- Scale out the number of machines that can serve read queries; to increase read throughput
Partitioning (Sharding) of datasets that are too big for a single machine.
Leaders and Followers
Replica: node that stores a copy of the database.
Leader-based replication (active/passive or master/slave replication):
Single-leader replication — Clients send all writes to a single node (the leader), which sends a stream of data change events to other replicas (followers).
- leader — master — primary is one of the replicas. When clients want to write to the database, they must send their requests to the leader, which first writes the new data to its local storage.
- followers — read replicas — slaves — secondaries — hot standbys are the other replicas. Whenever the leader writes new data to its local storage, it also sends the data change to all of its followers as part of replication log or change stream. Followers apply all writes in the same order as they were processed on the leader.
- When a client wants ro read from the database, it can query either the leader or any of the followers. Writes are only accepted on the leader; followers are read-only.
Hot, warm, cold standby servers. PostgreSQL — hot standby: a replica that accepts reads, warm standby: processes changes from the leader but doesn’t process any queries.
This mode of replication is a built in feature in many relational databases, such as PostgreSQL, MySQL; in some nonrelational databases such as MongoDB; distributed message brokers such as Kafka, RabbitMQ highly available queues.
Synchronous Versus Asynchronous Replication
The replication happens synchronously or asynchronously. In relational databases, this is often a configurable-option.
Synchronous replication on a database: one of the followers — synchronous, the others — asynchronous.
The advantage of synchronous replication; if the leader suddenly fails, we can be sure that the data is still available on the the follower.
The disadvantage of synchronous replication; if the synchronous follower doesn’t respond (it has crashed, a network fault, etc.), the write cannot be processed. The leader must block all writes and wait until the synchronous replica is available again.
Impractical for all followers to be synchronous.
Semi-synchronous: if the synchronous follower becomes unavailable or slow, one of the asynchronous followers are made synchronous.
Often, leader-based replication is configured to be completely asynchronous.
Chain replication — a variant of synchronous replication.
There is a strong connection between consistency of replication and consensus — getting several nodes to agree on a value.
Setting Up New Followers
Setting up a follower can usually be done without downtime.
- Take a consistent snapshot of the leader’s database at some point in time (innobackupex for MySQL).
- Copy the snapshot to the new follower node.
- The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken. Exact position in the leader’s replication log — PostgreSQL log sequence number, MySQL binlog coordinates.
- Process the backlog of changes until caught up. It can then continue to process data changes from the leader as they happen.
Handling Node Outages
Follower failure — Catch-up recovery
On its local disk, each follower keeps a log of the data changes it has received from the leader.
It knows the last transaction that was processed before the fault occurred. The follower requests all the data changes that occurred during the time when the follower was disconnected.
Leader failure — Failover
Failover — One of the followers needs to be promoted to be the new leader, clients need to be reconfigured to send their writes to the new leader, and the other followers need to start consuming data changes from the new leader.
- Determining that the leader has failed — crashes, power outages, network issues, etc. Most systems simply use a timeout — nodes frequently bounce messages back and forth between each other.
- Choosing a new leader — election process (where the leader is chosen by a majority of the remaining replicas), or a new leader could be appointed by a previously elected controller node — the replica with the most up-to-date data changes from the old leader.
- Reconfiguring the system to use the new leader — ensure that the old leader becomes a follower and recognizes the new leader.
Things can go wrong — Trade-offs around Consistency, Durability, Availability, Latency
- The new leader may not have received all the writes from the old leader. What should happen to those writes if the former leader rejoins the cluster after the new leader has been chosen? Most common solution — they are discarded.
- Discarding is dangerous if other storage systems outside of the database need to be coordinated with the database contents.
- Two nodes both may believe they are the leader — split brain. No process for resolving conflicts; data is likely to be lost or corrupted. Some systems have a a shutdown mechanism if this case is detected — fencing — Shoot The Older Node In The Head (STONITH)
- What is the right timeout to declare the leader dead?
Implementation of Replication Logs
The leader logs every write request (statement) that it executes and sends that statement log to its followers.
Each follower parses and executes that SQL statement.
- Any statement calling a nondeterministic function NOW() or RAND() — generates a different value on each replica.
- Auto-incrementing column or if statements depend on an existing data UPDATE()…WHERE — should be executed in exactly the same order on each replica. This is limiting for multiple concurrently executing transactions.
- triggers, stored procedures, user-defined functions, etc. may have nondeterministic side effects.
To overcome those, the leader can replace any nondeterministic functions with a fixed return value.
Statement-based replication was used in MySQL but now by default, it switches to row-based replication if there is any nondeterminism in a statement.
Write-ahead log (WAL) shipping
SSTables and LSM-Trees, B-trees were discussed in System Design Interview — Study Notes IV — Designing Data Intensive Applications Book Notes. We can use the exact same log to build a replica on another node. This method of replication is used in PostgreSQL and Oracle. The main disadvantage is that the log describes the data on a very low level: a WAL contains details of which bytes were changed in which disk blocks — the replication is closely coupled to the storage engine. This may be a problem if the database changes its storage format to a different version — not possible to run different versions of the database software on the nodes.
If you want to only replicate a subset of the data / replicate from one kind of the database to another / need a conflict resolution logic, you may need to move replication up to the application layer.
Oracle GoldenGate can make data changes available to an application by reading the database log. An alternative way is to use triggers and stored procedures. A trigger has the opportunity to log a change to a separate table. An external process can then apply any necessary application logic and replicate the data change to another system; Databus for Oracle, Bucardo for Postgres.
Problems with Replication Lag
Tolerate node failures, scalability, latency
Leader-based replication requires all writes to go through a single node, but read-only queries can go to any replica. The more nodes you have, the likelier it is that one will be down, so a fully synchronous configuration would be very unreliable.
In the case of eventual consistency — the battle cry of many noSQL projects and also an asynchronously replicated relational database, there is no limit to how far a replica can fall behind — the replication lag.
read-after-write consistency, read-your-writes consistency — It makes promises about other users: often other users’ updates may not be visible until some time later. It reassures the user that their own input has been saved correctly.
- When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower — always read the user’s own profile from the leader, and any other users’ profiles from a follower.
- You can track the time of the last update and, for one minute after the last update, make all reads from the leader. You can also monitor the replication lag on followers and prevent queries on any follower that it more than one minute behind the leader.
- The client can remember the timestamp of its most recent write. The replica serving any reads for that user reflects updates at least until that timestamp — a logical timestamp (something that indicates ordering of writes, such as the log sequence number) or the actual system clock (makes clock synchronization critical).
cross-device read-after-write consistency — if your approach requires reading from the leader, you may first need to route requests from all of a user’s devices to the same datacenter.
moving backward in time — if a user makes several reads from different replicas. quite likely if the user refreshes a web page.
monotonic reads — a guarantee that this kind of anomaly does not happen; a stronger guarantee than eventual consistency. If one user makes several reads in sequence, they will not see time go backward — they will not read older data after having previously read newer data.
- reads from the same replica — the replica can be chosen based on a hash of the user ID.
consistent prefix reads — if a sequence of writes happens in a certain order then anyone reading those writes will see them appear in the same order.
different partitions operate independently, so there is no global ordering of writes.
- You can make sure that any writes that are casually related to each other are written to the same partition. There are algorithms to keep track of casual dependencies — the “happens-before” relationship and concurrency.
Solutions for Replication Lag
transactions — a way for a database to provide stronger guarantees so that the application can be simpler.
distributed (replicated and partitioned) databases — abandoning transactions for performance and availability.
Leader-based replication — if the database is partitioned, each partition has one leader. Different partitions may have their leaders on different nodes, but each partition must nevertheless have one leader node.
It may happen that due to a network interruption between you and the leader, you can’t write to the database.
Allow more than one node to accept writes.
multi-leader configuration — master-master or active/active replication — each leader simultaneously acts as a follower to the other leaders. Clients send each write to one of several leader nodes. The leaders send streams of data change events to each other and to any follower nodes.
Use Cases for Multi-Leader Replication
It rarely makes sense to use a multi-leader setup within a single datacenter.
In a multi-leader configuration, you can have a leader in each database. Each datacenter’s leader replicates its changes to the leaders in other datacenters.
A multi-leader configuration with asynchronous replication
- performs better since every write can be processed in the local datacenter and replicated asynchronously to the other datacenters
- tolerates datacenter outages better since each datacenter operates independently of others.
- tolerates network problems better, since it is asynchronous.
- Tungsten Replicator for MySQL, BDR for PostgreSQL, GoldenGate for Oracle — external tools for multi-leader support.
- autoincrementing keys, triggers, and other integrity constraints can be problematic so multi-leader replication is often considered dangerous territory that should be avoided if possible.
Clients with offline operation
Such as calendar apps. Every device has a local database that acts as a leader (it accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices. The replication lag may be hours or even days, depending on when you have internet access available.
CouchDB is designed for this kind of multi-leader operation.
Real-time collaborative editing applications — Google Docs.
Handling Write Conflicts
Synchronous versus asynchronous conflict detection
In a multi-leader setup, both writes are successful, and the conflict is only detected asynchronously at some later point in time. At that time, it may be too late to ask the user to resolve the conflict.
If you want synchronous conflict detection, you might as well just use single-leader replication.
Many implementations of multi-leader replication handle conflicts quite poorly, avoiding conflicts is a frequently recommended approach.
In an application where a user can edit their own data, you can ensure that requests from a particular user are always routed to the same datacenter and use the leader in that datacenter for reading and writing. different users — different home datacenters — essentially single-leader from user’s point of view.
Converging toward a consistent state
In a multi-leader configuration, there is no defined ordering of writes, so it’s not clear what the final value should be.
- Give each write a unique ID (e.g. a timestamp, a long random number, a UUID, or a hash of the key and value), pick the write with the highest ID as the winner, and throw away the other writes. If a timestamp is used, this technique is called as last write wins (LWW).
- Give each replica a unique ID, and let writes that originated at a higher-numbered replica always take precedence over writes that originated at a lower-numbered replica.
- Merge the values together — e.g., order them alphabetically and then concatenate them; merged title such as “B/C”.
- Record the conflict in an explicit data structure that preserves all information, and write application code that resolves the conflict at some later time (perhaps by prompting the user).
Custom conflict resolution logic
Write conflict resolution logic using application code — executed on write or on read.
On write — a conflict handler (a snippet of code that executes quickly, no prompting) is called.
On read — all the conflicting writes are stored. The application may prompt the user or automatically resolve the conflict and write the result back to the database.
Conflict resolution usually applies at the level of an individual row or document, not for an entire transaction.
Automatically Conflict Resolution
Amazon — the conflict resolution logic preserves items added to the cart, but not items removed from the cart. Thus, customers would sometimes see items reappearing in their carts even though they had previously been removed.
Conflict-free replicated datatypes (CRDTs) use two-way merges.
Mergeable persistent data structures track history explicitly, similarly to the Git version control system, and use a three-way merge function.
Operational transformation algorithm behind collaborative editing applications — concurrently editing an ordered list of items, such as the list of characters.
Multi-Leader Replication Topologies
A replication topology — communication paths along which writes are propagated from one node to another.
all-to-all topology — the most general topology
circular topology — receives writes from one node and forwards those writes (plus any writes to its own) to another node. MySQL by default only supports this.
star topology — can be generalized to a tree — one designated to all of the other nodes.
- In circular and star topologies, to prevent infinite replication loops, each node is given a unique identifier, and in the replication log, each write is tagged with the identifiers of all the nodes it has passed through. With its own identifier, the data change is ignored on the node.
- A problem with circular and star topologies; if just one node fails, it can interrupt the flow of replication messages between other nodes.
- all-to-all topology is better, messages travel along different paths; avoiding a single point of failure. It has issues too; some replication messages may “overtake” others. leader2 may receive the writes in a different order than leader1; may receive the update of a non-existing row before its insert. version vectors can be used.
A leader determines the order in which writes should be processed and followers apply the leader’s writes in the same order.
Some data storage systems take a different approach, abandoning the concept of a leader and allowing any replica to directly accept writes from clients. Leaderless replication — clients send each write to several nodes, and read from several nodes in parallel (to detect and correct stale data). Cassandra is a datastore with one of the leaderless replication model inspired by Dynamo (Amazon’s in-house system); so this kind of database is also known as Dynamo-style.
In some implementations, the client directly send its writes to several replicas. In others, a coordinator node does this on behalf of the client. The coordinator does not enforce a particular ordering of writes.
Cross-datacenter replication — seen above as a use case for multi-leader replication
Leaderless replication is also suitable for multi-center operation, since it is designed to tolerate conflicting concurrent writes, network interruptions, latency spikes.
Cassandra implements multi-datacenter support within the leaderless model; number of replicas n includes nodes in all datacenters. Each write is sent to all replicas; acknowledgement from a quorum of nodes within client’s local datacenter; for other datacenters happens asynchronously.
Writing to the Database When a Node Is Down
In a leader-based configuration, you may need to perform failover.
In a leaderless configuration, failover does not exist. Stale (outdated) values — to solve that problem; when a client reads from the database, it doesn’t just send its request to one replica: read requests are also sent to several nodes in parallel. Version numbers are used to determine which value is newer.
Read repair and anti-entropy
Read repair: The client sees that replica with a lower version value has a stale value and writes the newer value back to that replica.
Anti-entropy process: Some datastores have a background process. Without this, durability is reduced.
Not all systems implement both of these.
Quorums for reading and writing
r and w values — quorum reads and writes; the minimum number of votes.
If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read.
strict quorum — w + r > n — we expect to get an up-to-date value when reading; at least one of the r replicas you read from must have seen the most recent successful write (some must overlap).
- w < n — still process writes if a node is unavailable
- r < n — still process reads if a node is unavailable
- n = 3, w = 2, r = 2 — tolerate one available node
- n = 5, w = 3, r = 3 — tolerate two available nodes
- reads and writes are always sent to all n replicas in parallel. w and r determine how many nodes we wait for.
n, r, w — configurable — common choice is to make n an odd number (3 or 5) and set w = r = (n+1) / 2 (rounded up).
A workload with few writes and many reads; w = n and r = 1 — reads are faster but one failed node causes all writes to fail.
If r and w are chosen to be a majority — more than n/2 — tolerates up to n/2 node failures.
- sloppy quorum — w writes ending up on different nodes than r reads. increases write availability. sloppy quorums are optional in all common Dynamo implementations; in Cassandra disabled by default.
hinted handoff — once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes.
Limitations of Quorum Consistency
even with w + r > n, there are likely to be edge cases where stale values are returned.
- sloppy quorum — no longer guaranteed overlap between the r nodes and the w nodes). an assurance of durability — not a traditional quorum.
- merging the concurrent writes — writes can be lost due to clock skew if based on timestamp (last write wins)
- write happening concurrently with a read.
- if succeeds on fewer than w and not rolled back on successful replicas.
- a node carrying a new value fails; restored back from a stale replica; the number of replicas storing the new value may fall below w; quorum condition is broken.
w, r — it’s wise not to take them as absolute guarantees. Stronger guarantees generally require transactions or consensus.
For leader-based replication, the database typically exposes metrics for the replication lag, which you can feed into a monitoring system. writes are applied in the same order.
With leaderless replication, there is no fixed order in which writes are applied, which makes monitoring more difficult.
If only read repair (no anti-entropy) is used, there is no limit to how old a value might be.
Detecting Concurrent Writes
Events may arrive in a different order at different nodes. Eventual consistency.
Last write wins (discarding concurrent writes) — LWW
Concurrent — their order is undefined. Writes don’t have a natural ordering but we can force an arbitrary order on them — a timestamp for last write wins (LWW). This is the only supported conflict resolution method in Cassandra. LWW achieves eventual convergance at the cost of durability. Several concurrent writes to the same key — even if reported as successful with w replicas, only one of the writes will survive. If losing data is not acceptable, LWW is a poor choice for conflict resolution. Ensure that a key is only written once and immutable. For Cassandra, using UUID as key is recommended to give each write a unique key.
The “happens-before” relationship and concurrency
How to decide whether two operations are concurrent or not?
Casually dependent operations (for example; incremention)
Two operations are concurrent if neither happens before the other (neither knows about the other). They are both unaware of each other, regardless of the physical time they occurred.
The clients are never fully up to date with the data on the server, since there is always another operation going on concurrently. But old versions of the value do get overwritten eventually, and no writes are lost.
Two operations are determined as concurrent by looking at the version numbers.
- The server maintains a version number for every key, increments the version number every time that key is written.
- When a client reads a key, the server returns all values that have not been overwritten, as well as the latest version number. A client must read a key before writing.
- When a client writes a key, it must include the version number from the prior read and must merge together all values it received in the prior read. (The response from a write request can be like a read, returning all current values, which allows us to chain several writes like in the shopping cart example.)
- When the server receives a write with a particular version number, it can overwrite all values with that version number or below (since it knows that they have been merged into the new value), but it must keep all values with a higher version number (because those values are concurrent with the incoming write).
If you make a write without including a version number, it is concurrent with all other writes, so it will not overwrite anything — it will just be returned as one of the values on subsequent reads.
Merging concurrently written values
Unfortunately requires that the clients do some extra work.
Concurrent values — siblings
A simple approach is to just pick one of the values based on a version number or timestamp (last write wins), but that implies losing data. Do something more intelligent in application code.
Merging siblings is to just take the union.
If you want to allow people to also remove things from their carts, and not just add things, then taking the union of siblings may not yield the right result.
The system must leave a marker with an appropriate version number to indicate that the item has been removed when merging siblings — deletion marker — tombstone.
Merging siblings in application code is complex and error-prone.
We need to use a version number per replica as well as per key.
The collection of version numbers from all the replicas is called a version vector. a variant — the dotted version vector.
The version vector allows the database to distinguish between overwrites and concurrent writes.