System Design Interview — Study Notes IX— Designing Data Intensive Applications Book Notes

Nil Seri
20 min readNov 5, 2023

Notes on Concepts & Components to be used in a System Design Interview

https://www.kobo.com/en/ebook/designing-data-intensive-applications

Chapter 7 — Transactions

Transaction — a way for an application to group several reads and writes together into a logical unit. Either the entire transaction succeeds (commit) or it fails (abort, rollback). If it fails, the application can safely retry — no need to worry about partial failure.
Safety guarantees — the database takes care of error scenarios and concurrency issues rather than the application; transactions are an abstraction layer.
Isolation levels — read committed, snapshot isolation, serializability.

The Slippery Concept of a Transaction

The transaction support in MySQL, PostgreSQL, Oracle, SQL Server.
Nonrelational (NoSQL) databases — replication and partitioning; abandoning transactions entirely, for higher performance or higher availability.
Transactions — the antithesis of scalability, good performance and high availability.
Transactional guarantees — for serious applications with valuable data.

The Meaning of ACID

Safety guarantees provided by transactions — ACID (Atomicity, Consistency, Isolation, Durability).
Systems that do not meet the ACID criteria — BASE (Basically Available, Soft state, Eventual consistency).
Atomicity:
cannot be broken down into smaller parts — atomic.
ACID’s atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed.
abortability — if the writes are grouped together into an atomic transaction, and the transaction cannot be completed (committed) due to a fault, then the transaction is aborted and the database must discard or undo any writes it has made so far in that transaction. With atomicity, it can be safely retried.
Consistency:
At least 4 different meanings;
- replica consistency, eventual consistency in asynchronously replicated systems.
- consistent hashing — an approach to partitioning for rebalancing.
- CAP theorem consistency — linearizability.
- ACID consistency — database being in a “good state”.
ACID consistency — certain statements about your data (invariants) in your application; in an accounting system, credits and debits across all accounts must always be balanced. It is the application’s responsibility to define its correctly to preserve consistency — not something that the database can guarantee. Using foreign key constraints or uniqueness constraints may be helpful.
Atomicity, isolation and durability are properties of the database, whereas consistency in the ACID sense is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone.
Isolation:
Most databases are accessed by several clients at the same time. Accessing the same database records — concurrency problems (race conditions).
Simultaneously incrementing a counter that is stored in a database.
Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other: they cannot step on each other’s toes.
Serializability — each transaction can pretend that it is the only transaction running on the entire database. When the transactions have committed, the result is the same as if they had run serially (one after another), even though in reality they may have run concurrently.
In practice, serializable isolation is rarely used because it carries a performance penalty. Some popular databases (Oracle 11g) don’t even implement it. Oracle’s isolation level is called “serializable” — actually snapshot isolation.
Durability:
Durability — the promise that any data that has been written will not be forgotten; provides a safer place where data can be stored without fear of losing it.
In a single-node database, it means data being written to nonvolatile storage such as a hard drive or SSD; involving a write-ahead log or similar (B-trees).
In a replicated database, it means data being successfully copied to some number of nodes. A database must wait until these writes or replications are complete before reporting a transaction as successfully committed.
Perfect durability does not exist.

Single-Object and Multi-Object Operations

atomicity and isolation — describe what the database should do if a client makes several writes within the same transaction.
Atomicity — all-or-nothing guarantee.
Isolation — concurrently running transactions shouldn’t interfere with each other. If one transaction makes several writes, then another transaction should see either all or none of those writes.
multi-object transaction are needed if several pieces of data need to be kept in sync.
Violating isolation; one transaction needs another transaction’s uncommitted writes — dirty read.
In relational databases, everything between a BEGIN TRANSACTION and a COMMIT statement is considered to be part of the same transaction, based on the client’s TCP connection to the database server. If the TCP connection is interrupted, the transaction must be aborted. A transaction manager can group operations by a unique transaction identifier that is not bound to a particular TCP connection. Many nonrelational databases don’t have such a way of grouping operations together.
Single-object writes
Atomicity and isolation also apply when a single object is being changed.
- If the network connection is interrupted after the first 10 KB have been sent, does the database store that unparseable 10 KB fragment of JSON?
- Overwriting the previous value on disk and the power fails?
- If another client reads that document while the write is in progress?
Storage engines almost universally aim to provide atomicity and isolation on the level of a single object (such as a key-value pair) on one node. Atomicity can be implemented using a log for crash recovery and isolation can be implemented using a lock on each object (allowing only one thread to access an object at one time).
Some databases also provide more complex atomic operations (in the sense of multi-threaded programming) such as an increment operation, which removes the need for a read-modify-write cycle.
Compare-and-set and other single-object operations are dubbed as “light-weight transactions” or even “ACID” but a transaction is a mechanism for grouping multiple operations on multiple objects into one unit of execution.
The need for multi-object transactions
Many distributed datastores have abandoned multi-object transactions; they get in the way where very high availability or performance is required.
- In a relational data model; when inserting several records that refer to one another, the foreign keys have to be correct and up to date.
- In a document data model; the fields that need to be updated together are often within the same document, which is treated as a single object. Document databases lack joining functionality. When denormalized information needs to be updated, you need to update several documents in one go. Denormalized data can easily go out of sync with the source data.
- In databases with secondary indexes, different database objects from a transaction point of view, it’s possible for a record to appear in one index but not another, because the update to the second index hasn’t happened yet.
Handling errors and aborts
A transaction can be aborted and safely retried. ACID databases are based on this philosophy.
Datastores with leaderless replication — “best effort” basis. It’s the application’s responsibility to recover from errors.
Popular object-relational mapping (ORM) frameworks don’t retry aborted transactions, although the whole point of aborts is to enable safe retries.
- retrying the transaction causes it to be performed twice — deduplication mechanism is required.
- due to overload, retrying makes it worse. Limit the number of retries, use exponential backoff and handle overload-related errors differently.
- if the transaction also has side effects outside of the database and you want to make sure several different systems either commit or abort together — two-phase commit.
- if the client process fails while retrying, data is lost.

Weak (Nonserializable) Isolation Levels

Concurrency issues (race conditions) only come to play when one transaction reads data that is concurrently modified by another transaction, or when two transactions try to simultaneously modify the same data.
Databases provide transaction isolation. Serializable isolation means that the database guarantees that transactions have the same effect as if they ran serially. Serializable isolation has a performance cost.
“Use an ACID database if you’re handling financial data!” — but even many popular relational database systems use weak isolation.

Read Committed

- When reading from the database, you will only see data that has been committed (no dirty reads).
- When writing to the database, you will only overwrite data that has been committed (no dirty writes).
Some databases support an even weaker isolation level called read uncommitted — it prevents dirty writes, but does not prevent dirty reads.
No dirty reads
Dirty read
— Another transaction sees that uncommitted data.
Seeing the database in a partially updated state is confusing to users and may cause other transactions to take incorrect decisions.
If a transaction aborts, any writes it has made need to be rolled back — a transaction may see data that is later rolled back.
The read committed isolation level and stronger levels prevent dirty reads.
No dirty writes
Dirty write — If the earlier write is part of a transaction that has not yet committed, so the later write overwrites an uncommitted value.
Delay the second write until the first write’s transaction has committed or aborted.
Read committed does not prevent the race condition between two counter increments — it is not a dirty write but a lost update.
Almost all transaction implementations prevent dirty writes.
Implementing read committed
The default setting in Oracle 11g, PostgreSQL, SQL Server 2012 and many databases.
To prevent dirty writes; row-level locks: (row or document). It must first acquire a lock on that object. Hold that lock until the transaction is committed or aborted. Another transaction must wait. This locking is done automatically by databases in read committed mode (or stronger isolation levels).
To prevent dirty reads; use the same lock, and to require any transaction that wants to read an object to briefly acquires the lock and then releases it again immediately after reading.
One long-running write transaction can force many read-only transactions to wait until the long-running transaction has completed — harms the response time. That’s why, for every object that is written, the database remembers both the old committed value and the new value set by the transaction that currently holds the write lock. Only when the new value is committed, transactions switch over to reading the new value.

Snapshot Isolation and Repeatable Data

A nonrepeatable read — read skew is considered acceptable under read committed isolation.
Backups — During the time that the backup process is running writes will continue to be made to the database; inconsistencies.
Snapshot isolation — each transaction reads from a consistent snapshot of the database; boon for long-running, read-only queries such as backups and analytics. Supported by PostgreSQL, MySQL, Oracle, SQL Server, etc.
Implementing snapshot isolation
Typically use write locks to prevent dirty writes. Reads no not require any locks. Snapshot isolation’s key principle — readers never block writers, and writers never block readers, allowing long-running read queries on a consistent snapshot.
The database must potentially keep several different committed versions of an object, because various in-progress transactions may need to see the state of the database at different points in time — multi-version concurrency control (MVCC).
To provide only read-committed isolation (not snapshot isolation); it is sufficient to keep two versions of an object: the committed version and the overwritten-but-not-yet-committed version. read-committed uses a separate snapshot for each query, while snapshot isolation uses the same snapshot for an entire transaction.
MVVC implemented in PostgreSQL:
- When a transaction is started, it is given a unique, always-increasing transaction ID (txid).
- Whenever a transaction writes anything to the database, the data it writes is tagged with the transaction ID of the writer. Each row has a created_by field — the ID of the transaction that caused the insertion.
- Each row also has a deleted_by field — initially empty, marked for deletion. At some later time, when it is certain that no transaction can any longer access the deleted data, a garbage collection process in the database removes any rows marked for deletion and frees their space.
- An update is internally translated into into a delete and a create.
Visibility rules for observing a consistent snapshot
- At the start of each transaction, the database makes a list of all the other transactions in progress. Any writes that those transactions have made are ignored.
- Any writes with a later transaction ID are ignored.
- All other writes are visible to the application’s queries.
An object is visible if:
- At the time when the reader’s transaction started, the transaction that created the object had already committed.
- The object is not marked for deletion, or if it is, the transaction that required deletion had not yet committed at the time when the reader’s transaction started.
By never updating values in place but instead creating a new version every time a value is changed, the database can provide a consistent snapshot.
Indexes and snapshot isolation — How do indexes work in a multi-version database?
- For inserts: Have the index simply point to all versions of an object and require an index query to filter out any object versions that are not visible to the current transaction.
- For deletes: When garbage collection removes old object versions that are no longer visible to any transaction, the corresponding index entries can also be removed.
- For updates:
PostgreSQL has optimizations for avoiding index updates if different versions of the same object can fit on the same page.
In another approach — Although B-trees are used, an append-only / copy-on-write variant is used — that does not overwrite pages of the tree when they are updated, but instead creates a new copy of each modified page. Parent pages, up to the root of the tree, are copied and updated to point to the new versions of their pages. With append-only B-trees, every write transaction (or batch of transactions) creates a new B-tree root. A particular root is a consistent snapshot of the database at the point in time when it was created — no need to filter out objects based on transaction IDs. This requires a background process for compaction and garbage collection.
Repeatable read and naming confusion
Snapshot isolation is useful for read-only transactions. In Oracle, it is called serializable; in PostgreSQL and MySQL it is called repeatable read.

Preventing Lost Updates

lost update problem: a read-modify-write cycle; the second concurrent write does not include the first modification.
- Incrementing a counter or updating an account balance.
- Making a local change to a complex value.
- Two users editing a wiki page at the same time.
Atomic write operations
Many databases provide atomic update — which remove the need to implement read-modify-write cycles in the application code.
MongoDB provide atomic operations for making local modifications to a part of a JSON document.
Redis provides atomic operations for modifying data structures such as priority queues.
Atomic operations are usually implemented by taking an exclusive lock on the object when it is read so that no other transaction can read it until the update has been applied — cursor stability. Another option is to simply force all atomic operations to be executed on a single thread.
Explicit locking
The application explicitly locks objects that are going to be updated (SELECT FOR UPDATE clause) — a manual lock.
Auto detecting lost updates
If the transaction manager detects a lost update, abort the transaction and force it to retry its read-modify-write cycle. This is less error-prone.
PostgreSQL — repeatable read, Oracle — serializable, SQL Server — snapshot isolation levels automatically detect when a lost update has occurred and abort the offending transaction. MySQL / InnoDB’s repeatable read does not detect lost updates.
Compare-and-set
In databases that don’t provide transactions, you sometimes find an atomic compare-and-set operation.
To avoid lost updates by allowing an update to happen only if the value has not changed since you last read it.
If no match, the read-modify-write cycle must be retried.
Check whether your database’s compare-and-set operation is safe before relying on it.
Conflict resolution and replication
Multi-leader or leaderless replication usually allows several writes to happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a single up-to-date copy of the data. Thus, techniques based on locks or compare-and-set do not apply in this context.
Replicated databases allow concurrent writes to create several conflicting versions of a value — siblings. To use application code or special data structures to resolve and merge these versions after the fact.
Atomic operations can work well in a replicated context, especially if they are commutative — you can apply them in a different order on different replicas, and still get the same result; incrementing a counter or adding an element to a set.
Last write wins (LWW) is prone to lost updates.

Write Skew and Phantoms

Write skew — A transaction reads something, makes a decision based on the value it saw. By the time the write is made, the premise of the decision is no longer true. It is neither a dirty write nor a lost update because the two transactions are updating two different objects — it is definitely a race condition.
It is a generalization of the lost update problem; it can occur if two transactions read the same objects, and then update some of those objects (different transactions may update different objects). In the special case where different transactions update the same object, you get a dirty write or lost update anomaly (depending on the timing).
- Snapshot isolation doesn’t help either: write skew is not automatically detected in PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL Server’s snapshot isolation level. Automatic prevention requires true serializable isolation.
- Databases’ constraints — uniqueness, foreign key constraints, restrictions on a particular value. For uniqueness that involves multiple objects — you would need a different type of constraint. Not build in support in databases but you may use triggers or materialized views.
- Explicitly lock the rows that the transaction depends on (FOR UPDATE clause).
More examples of write skew
- Meeting room booking system — you first check for any conflicting bookings (for the same room with an overlapping time range). If none are found, you create the meeting. In PostgreSQL, you can do this more elegantly using range types.
- Multiplayer game — two users trying to move the same figure at the same time; depending on the kind of rule, you may use a unique constraint.
- Claiming a username — you may use a transaction to check whether a name is taken and, if not, create an account with that name. A unique constraint is a simple solution here.
- Preventing double-spending — you may insert a tentative spending item into a user’s account, listing all the items in the account and checking the sum is positive.
Phantoms causing write skew
Checking for the absence of rows matching some search condition; and the write adds a row matching the same condition. SELECT FOR UPDATE can’t attach locks to anything.
phantom — the effect where a write in one transaction changes the result of a search query in another transaction. Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the examples we discussed, phantoms can lead to particular tricky cases of write skew.
Materializing conflicts
If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we can artificially introduce a lock object into the database? You may create rows for all possible combinations of rooms and time periods ahead of time, e.g. for the next six months. This approach is called materializing conflicts — takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database. This can be hard and error-prone. A serializable isolation level is much preferable in most cases.

Serializability

Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency.
Most databases that provide serializability today use one of three techniques:
- Literally executing transactions in a serial order.
- Two-phase locking (2PL)
- Optimistic concurrency control techniques such as serializable snapshot isolation (SSI).
Actual Serial Execution
Execute only one transaction at a time, in serial order, on a single thread. It avoids the coordination overhead of locking but the throughput is limited to that of a single CPU core. This approach is implemented in Redis.
- RAM became cheap enough to keep the entire active database in memory; transactions execute much faster.
anti-caching — If a transaction needs to access data that’s not in memory, the best solution may be to abort the transaction, asynchronously fetch the data into memory while continuing to process other transactions, and then restart the transaction when the data has been loaded.
- Database designers realized OLTP transactions are short and make a small number of reads and writes; they can be run on a consistent snapshot (using snapshot isolation) outside of the serial execution loop.
Encapsulating transactions in stored procedures
Almost all OLTP applications keep transactions short by avoiding interactively waiting for a user within a transaction. On the web, this means that a transaction is committed within the same HTTP request — a transaction does not span multiple requests.
An application makes a query, reads the result, perhaps makes another query depending on the result of the first query, and so on.
The database would spend most of its time waiting for the application to issue the next query for the current transaction. It’s necessary to process multiple transactions concurrently in order to get reasonable performance.
Systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions. Instead, the application must submit the entire transaction code to the database ahead of time, as a stored procedure. Provided that all data required by a transaction is in memory, the stored procedure can execute very fast.
Oracle has PL/SQL, PostgreSQL has PL/pgSQL. Modern implementations of stored procedures have abandoned PL/SQL and use existing general-purpose programming languages; Redis uses Lua.
With stored procedures and in-memory data, executing all transactions on a single thread becomes feasible.
Partitioning
Each partition can have its own transaction processing thread running independently from the others. You can give each CPU core its own partition.
For any transaction that needs to access multiple partitions, the database must coordinate. The stored procedure needs to be performed in lock-step across all partitions to ensure serializability across the whole system.
Cross-partition transactions have additional coordination overhead and are vastly slower than single-partition transactions.
Whether transactions can be single-partition depends very much on the structure of the data used by the application. Data with multiple secondary indexes is likely to require a lot of cross-partition coordination.

Two-Phase Locking (2PL)

Two-phase locking (2PL) and two-phase commit (2PC) are completely different things. 2PL is sometimes called strong strict two-phase locking (SS2PL).
Several transactions are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as anyone wants to write (modify or delete) an object, exclusive access is required.
In 2PL, writers don’t just block other writers; they also block readers and vice versa.
Snapshot isolation — readers never block writers, and writers never block readers. This captures the difference between snapshot isolation and two-phase locking.
2PL provides serializability — it protects against all the race conditions, including lost updates and write skew.
Implementation of two-phase locking
2PL is used by the serializable isolation level in MySQL (InnoDB)
The lock can either be in shared mode or in exclusive mode.
- To read an object, acquire the lock in shared mode. Several transactions are allowed to hold the lock in shared mode simultaneously, but if another transaction already has an exclusive lock on the object, these transactions must wait.
- To write to an object, acquire the lock in exclusive mode.
- If a transaction first reads and then writes an object, it may upgrade its shared lock to an exclusive lock.
- Continue to hold the lock until the end of the transaction (commit or abort) — two phase commit: the first phase (while the transaction is executing) is when the locks are acquired, the second phase (at the end of the transaction) is when all the locks are released.
Deadlocks may happen. The database automatically detects deadlocks between transactions and aborts one of them so that the others can make progress. The aborted transaction needs to be retried by the application.
Performance of two-phase locking
Performance is worse under two-phase locking than under weak isolation.
Overhead of acquiring and releasing those locks and mostly due to reduced concurrency.
Traditional relational databases don’t limit the duration of a transaction, because they are designed for interactive applications.
Predicate locks
Phantoms causing write skew — phantoms: one transaction changing the results of another transaction’s search query.
predicate lock: works similarly to the shared/exclusive lock but rather than belonging to a particular object (e.g. one row in a table), it belongs to all objects that match some search condition.
- TransactionA wants to read objects matching some condition, acquire a shared-mode predicate lock on the conditions of the query.
- TransactionB currently has an exlusive lock on any objects matching those conditions, TransactionA must wait until TransactionB releases its lock.
- TransactionA wants to insert, update, or delete any object, first checks whether either the old or the new value matches any existing predicate lock.
- TransactionB holds a matching predicate lock, TransactionA must wait.
A predicate lock applies even to objects that do not yet exist in the database, but which might be added in the future (phantoms). If two-phase locking includes predicate locks, the database prevents all forms of write skew and other race conditions, and so its isolation becomes serializable.
Index-range locks
Predicate locks do not perform well; checking for matching locks becomes time-consuming. Most databases with 2PL actually implement index-range locking (next-key locking).
You can approximate it by;
- locking bookings for room 123 at any time
-
locking all rooms (not just room 123) between noon and 1 p.m.
An approximation of the search condition is attached to one of the indexes. When it encounters the shared lock, it will be forced to wait.
They may lock a bigger range of objects than is strictly necessary to maintain serializability, but since they have much lower overheads, they are a good compromise.
If there is no suitable index where a range lock can be attached, the database can fall back to a shared lock on the entire table — this will not be good for performance.

Serializable Snapshot Isolation (SSI)

SSI is used both in single-node databases (the serializable isolation level in PostgreSQL) and distributed databases.
Pessimistic versus optimistic concurrency control
Two-phase locking — a pessimistic concurrency control mechanism. It is like mutual exclusion which is used to protect data structures in multi-threaded programming.
Serializable snapshot isolation — an optimistic concurrency control technique. Instead of blocking if something potentially dangerous happens, transactions continue anyway. When a transaction wants to commit, the database checks whether anything bad happened (whether isolation was violated); if so, the transaction is aborted and has to be retried.
The additional transaction load from retried transactions can make performance worse.
Commutative atomic operations — if several transactions concurrently want to increment a counter, it doesn’t matter in which order the increments are applied (as long as the counter is not read in the same direction), so the concurrent increments can all be applied without conflicting.
SSI is based on snapshot isolation — all reads within a transaction are made from a consistent snapshot of the database; this is the main difference. On top of snapshot isolation, SSI adds an algorithm for detecting serializable conflicts among writes and determining which transactions to abort.
Decisions based on an outdated premise
Under snapshot isolation, the result from the original query may no longer be up-to-date by the time the transaction commits, because the data may have been modified in the meantime.
premise — a fact that was true at the beginning of the transaction. when the transaction want to commit, the premise may no longer be true (the data may have changed).
- Detecting reads of a stale MVCC object version (uncommitted write occurred before the read)
- Detecting writes that affect prior reads (the write occurs after the read)
Detecting stale MVCC reads
Snapshot isolation is usually implemented by multi-version concurrency control (MVCC). When a transaction reads from a consistent snapshot that hadn’t yet committed at the same time when the snapshot was taken. When the transaction wants to commit, the database checks whether any of the ignored writes have now been committed. If so, the transaction must be aborted.
Why wait until committing? By avoiding unnecessary aborts, SSI preserves snapshot isolation’s support for long-running reads from a consistent snapshot.
We can use a technique similar to index-range locks here, except that SSI locks don’t block other transactions.
Performance of serializable snapshot isolation
In some cases, it’s okay for a transaction to read information that was overwritten by another transaction. PostgreSQL uses this theory to reduce the number of unnecessary aborts.
The detection of serialization conflicts can be distributed across multiple machines.
SSI requires that read-write transactions be fairly short. However, SSI is probably less sensitive to slow transactions than two-phase locking or serial execution.

Happy Coding!

References:
-
Designing Data-Intensive Applications
The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann

--

--

Nil Seri

I would love to change the world, but they won’t give me the source code | coding 👩🏻‍💻 | coffee ☕️ | jazz 🎷 | anime 🐲 | books 📚 | drawing 🎨