System Design Interview — Study Notes IV— Designing Data Intensive Applications Book Notes
Notes on Concepts & Components to be used in a System Design Interview
Chapter 3 — Storage and Retrieval
Two families of storage engines will be examined: log-structured storage engines and page-oriented storage engines such as B-trees.
Data Structures That Power Your Database
Imagine a key-value store, which is a text file where each line contains a key-value pair, separated by a comma. Every call to “set” appends to the end of the file, so if you update a key several times, the old versions of the value are not overwritten — you need to look at the last occurrence of a key in a file to find the latest value. Many databases internally use such a log, which is an append-only data file.
To efficiently find the value for a particular key in the database, we need an index. It requires to keep some additional metadata on the side, which acts as a signpost and helps you to locate the data you want.
An index is an additional structure that is derived from the primary data. It affects the performance of the queries. Maintaining additional structures incurs overhead, especially on writes because the index also needs to be updated every time data is written. This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes.
Hash Indexes
Key-value stores are quite similar to the dictionary type that you can find in most programming languages, and which is usually implemented as a hash map (hash table).
The simplest possible indexing strategy: an in-memory hash map where every key is mapped to a byte offset in the data file.
In situations where the value for each key is updated frequently; if there are not too many distinct keys — you have a large number of writes per key. To avoid running out of disk space, break the log into segments of a certain size by closing a segment file when it reaches a certain size. We can then perform compaction on these segments. Compaction: throwing away duplicate keys in the log; keeping the most recent update for each key. We can also merge several segments together. Segments are never modified after they have been written, so the merged segment is written to a new file in a background thread. We switch read requests to using the new merged segment instead of the old segments — and then the old segment files can simply be deleted.
Each segment has its own in-memory hash table, mapping keys to file offsets. Check the most recent segment’s hash map. Because of the merging process, lookups don’t need to check many hash maps.
File format: Use a binary format that first encodes the length of a string in bytes, followed by the raw string (no need for escaping).
Deleting records: Append a special deletion record to the data file (a tombstone). Through merging process, the tombstone tells it to discard any previous values for the deleted key.
Crash recovery: After a restart, the in-memory hash-maps are lost. You can restore each segment’s hashmap by reading the entire segment file from beginning to end. You can also store a snapshot of each segment’s hash map on disk — can be loaded into memory more quickly.
Partially written records: In case of a crash halfway through appending a record to the log, include checksums for detection and to ignore such parts afterwards.
Concurrency control: Writes are in a strictly sequential order — one writer thread is a common choice. Segment files are append-only and immutable — reading concurrently by multiple threads is possible.
Append-only design is good — faster than random writes (especially on magnetic spinning- disk hard drives), concurrency and crash recovery are much simpler.
Hash-table index has limitations — the hash table must fit in memory, range queries are not efficient. An indexing structure without those limitations: SSTables and LSM-Trees.
SSTables and LSM-Trees
Log-structured segments — a sequence of key-value pairs, the order of key-value pairs does not matter.
We now require the sequence of key-value pairs is sorted by key.
Sorted String Table (SSTable) — has several advantages over log segments with hash indexes:
- Merging segments is simple and efficient — mergesort algorithm. When multiple segments contain the same key, we can keep the value from the most recent segment and discard the values in older segments.
- Because of the sorting, you no longer need to keep an index of all the keys in memory. It can be sparse: one key for every few kilobytes of segment file is sufficient.
- It is possible to group those records into a block. Each entry of the sparse in-memory index then points at the start of a compressed block.
Constructing and maintaining SSTables
Maintaining a sorted structure on disk is possible (B-Trees), but maintaining it in memory is much easier (red-black trees or AVL trees — you can insert keys in any order and read them back in sorted order).
- A write comes in; add it to an in-memory balanced tree data structure (a red-black tree). This in-memory tree is called a memtable.
- When the memtable gets bigger than some threshold (a few MBs), write it out to disk as an SSTable file — it already maintains the key-value pairs sorted by key. While the SSTable is being written out to disk, writes can continue to a new memtable instance.
- To serve a read request — find the key in the memtable, then in the most recent on-disk segment, then in the next older segment, etc.
- Run a merging and compaction process in the background.
If the server crashes, the most recent writes (the ones in the memtable but not yet written out to the disk) are lost — keep a separate log on disk to which every write is immediately appended, just like we did previously. The log is not in sorted order; its only purpose is to restore the memtable after a crash. Every time the memtable is written out to an SSTable, the corresponding log can be discarded.
Making an LSM-tree out of SSTables
Similar storage engines are used in Cassandra, inspired by Google’s Bigtable paper (which introduced the terms SSTable and memtable).
Log-Structured Merge-Tree (LSM-Tree) — log structured filesystems — LSM storage engines.
Lucene, an indexing engine for full-text search used by ElasticSearch and Solr, uses a method for storing its term dictionary. For a full-text index, the key is a word (a term) and the value is the list of IDs of all the documents that contain the word (the postings list). In Lucene, this mapping is kept in SSTable-like sorted files.
Performance optimizations
The LSM-tree algorithm can be slow when looking up keys that do not exist in the database. In order to optimize this kind of access, storage engines often use additional Bloom filters. It can tell you if a key does not appear in the database.
Size-tiered and leveled compaction — Strategies to determine the order and timing of how SSTables are compacted and merged. Cassandra supports both.
In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables.
In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate “levels”, which allows the compaction to proceed more incrementally and use less disk space.
The basic idea of LSM-trees — keeping a cascade of SSTables that are merged in the background. Even when the dataset is much bigger than the available memory it continues to work well. Since data is stored in sorted order, you can efficiently perform range queries. Because the disk writes are sequential, the LSM-tree can support remarkably high write throughput.
B-Trees
The most widely used indexing structure is quite different: the B-tree.
B-trees remain the standard index implementation in almost all relational databases, and many non-relational databases use them too.
The log-structured indexes break down the database into variable-size segments and always write a segment sequentially. B-trees break the database down into fixed-size blocks or pages, traditionally 4 KB in size and read or write one page at a time. Disks are arranged in fixed-size blocks. Each page can be identified using an address or location, which allows one page to refer to another — similar to a pointer, but on disk instead of in memory. One page is designated as the root of the B-tree. We get down to a page containing individual keys (a leaf page) — either contains the value for each key inline or contains references to the pages where the values can be found.
The number of references to child pages in one page of the B-tree is called the branching factor — depends on the amount of space required to store the page references and the range boundaries, but typically it is several hundred.
To update the value for an existing key, search for the page, change the value and write the page back to disk (any references to that page remain valid).
To add a new key, find the page whose range encompasses the new key, add it to that page. If there isn’t enough space in the page, it is split into two half-full pages and the parent page is updated for the new key ranges.
The algorithm ensures that the tree remains balanced: a B-tree with n keys always has a depth of O(log n). Most databases can fit into a B-tree that is tree or four levels deep; so you don’t need to follow many page references. A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.
Making B-trees reliable
write operation — overwrite a page on disk with new data in contrast to log-structured indexes such as LSM-trees.
On magnetic hard drive — waiting for the right position on the spinning platter, overwriting the appropriate sector with new data.
On SSDs — an SSD must erase and rewrite fairly large blocks of a storage chip at a time.
A database may crash during the page split operations. To make it resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL), also known as a redo log. This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself.
Careful concurrency control is required if multiple threads are going to access the B-tree at the same time — a thread may see it in an inconsistent state. Latches (lightweight locks) are used for that.
B-tree optimizations
- Instead of overwriting pages and maintaining a WAL for crash recovery; use a copy-on-write schema. A modified page is written to a different location and a new version of the parent pages in the tree is created, pointing at the new location.
- Save space in pages by not storing the entire key, but abbreviating it — a B+ tree. This allows the tree to have a higher branching factor, and thus fewer levels.
- For range queries, many B-tree implementations try to lay out the leaf pages appearing in sequential order on disk. This is difficult. In contrast, LSM-trees rewrite large segments in one go during merging so that it is easier to keep sequential keys close to each other on disk.
- Additional pointers: each leaf page may have references to its siblings to the left and right. This allows scanning without jumping back to parents.
- B-tree variants such as fractal trees borrow some log-structured ideas to reduce disk seeks.
Comparing B-Trees and LSM-Trees
Advantages of LSM-trees:
LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads.
A B-tree index must write every write every piece of data at least twice — write-ahead log and the tree page (and again as pages are split).
Log-structured indexes also rewrite data multiple times — repeated compaction and merging of SSTables.
Write amplification: One write to the database resulting in multiple writes. This is of a particular concern on SSDs, which can only overwrite blocks a limited number of times before wearing out.
In write-heavy applications, the performance bottleneck might be the rate at which the database can write to disk.
LSM-trees are typically able to sustain higher write-throughput than B-trees.
LSM-trees can be compressed better — producing smaller files on disk than B-trees. B-tree storage engines leave some disk space unused due to fragmentation.
Lower write amplification and reduced fragmentation are still advantageous on SSDs.
Downsides of LSM-trees:
Compaction process can sometimes interfere with the performance of ongoing reads and writes.
Compaction at high write throughput: disk bandwidth shared between logging - flushing a memtable to disk - compaction threads running in the background. The bigger the database gets, the more disk bandwidth is required for compaction.
If write throughput is high, that compaction cannot keep up with the rate of incoming writes. The number of unmerged segments on disk keeps growing. Reads slow down due to checking more segment files. Since SSTable-based storage engines do not throttle the rate of incoming writes, you need monitoring.
Advantage of B-trees; each key exists in exactly one place in the index. In many relational databases, transactional isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree.
B-trees are very ingrained in the architecture of databases and provide consistently good performance for many workloads, so it’s unlikely that they will go away anytime soon. In new datastores, log-structured indexes are becoming increasingly popular.
Other Indexing Structures
It is very common to have secondary indexes.
The value in key-value pairs can be a reference to the row stored somewhere else. The place where rows are stored is known as a heap file, stores data in no particular order (it may be append-only or it may keep track of deleted rows in order to overwrite them with new data later). This avoids duplicating data when multiple secondary indexes are present.
If the new value is larger, either all indexes are updated to point at the new heap location of the record, or a forwarding pointer is left behind in the old heap location.
Clustered index: In some situations, the extra hop from the index to the heap file is too much of a performance penalty for reads — so store the indexed rows directly within an index. In MySQL’s InnoDB storage engine, the primary key of a table is always a clustered index, and secondary indexes refer to the primary key rather than a heap file location.
Clustered Index — storing all row data within the index.
Non-clustered Index — storing only references to the data within the index.
Covering Index (index with included columns) — storing some of a table’s columns within the index — a compromise between those indexes. This allows some queries to be answered by using the index alone (the index covers the query).
Clustered and covering indexes can speed up reads, but they require additional storage and can add overhead on writes.
Multi-column indexes
To query multiple columns of a table simultaneously.
A concatenated index: combines several fields into one key by appending one column to another. The index definition specifies in which order the fields are concatenated. Just like (lastname, firstname) in old fashioned phone books.
This type of index is useless if you want to find all the people with a particular first name.
Particularly fit for geospatial data — searching for restaurants within a rectangular map area. A standard B-tree or LSM-tree index is not able to answer that kind of query efficiently.
Special spatial indexes, such as R-trees: Translate a two-dimensional location into a single number using a space-filling curve, use a regular B-tree index.
Full-text search and fuzzy indexes
All the indexes so far — exact data.
To search for similar keys, such as misspelled words — fuzzy querying requires different techniques.
Full-text search engines — allow a search for one word to be expanded to include synonyms of the word, ignore grammatical variations of the words, search for occurrences of words near each other in the same document, linguistic analysis of the text.
To cope with typos, Lucene is able to search text for words within a certain edit distance (an edit distance of 1 means that one letter has been added, removed or replaced).
Lucene uses a SSTable-like structure for its term dictionary. The in-memory index is a finite state automaton (Levenshtein automaton) over the characters in the keys, similar to a trie.
Keeping everything in memory
In-memory databases: As RAM becomes cheaper and many databases are simply not that big, so it’s feasible to keep them entirely in memory, distributed across several machines.
Memcached is intended for caching use only, acceptable if data is lost if a machine is restarted.
For durability: battery-powered RAM usage, writing a log of changes to disk, writing periodic snapshots to disk, replicating the in-memory state to other machines. When an in-memory database is restarted, it needs to reload its state either from a disk or a replica over the network.
Redis and Couchbase provide weak durability by writing to disk asynchronously.
They can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk.
Redis offers a database-like interface to various data structures such as priority queues and sets.
Anti-caching — evicting the least recently used data from memory to disk.
Column-Oriented Storage
Column-Oriented Storage: stores all the value from each column together. If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query. The column-oriented storage layout relies on each column file containing the rows in the same order.
For compression of data, bitmap encoding can be used.
Cassandra has a concept of column families; though within each column family, it stores all columns from a row together, along with a row key.
Log-structured: SSTables, LSM-trees, Cassandra, Lucene.
Update-in-place: B-trees, in all major relational databases and many non-relational ones.
Happy Coding!
References:
- Designing Data-Intensive Applications
The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
- https://twitter.com/alexxubyte/status/1583119489318518786