System Design Interview — Study Notes VIII — Designing Data Intensive Applications Book Notes
Notes on Concepts & Components to be used in a System Design Interview
Chapter 6— Partitioning
Sharding — or very large datasets, or very high query throughput; need to break the data up into partitions.
Different partitions can be placed on different nodes in a shared-nothing cluster.
Partitioned databases are rediscovered by NoSQL databases and Hadoop-based data warehouses.
Partitioning and Replication
Replication of databases applies equally to replication of partitions.
Partitioning of Key-Value Data
Spread the data and the query load evenly across nodes.
Some partitions have more data or queries than others — skewed.
A partition with disproportionately high load — hot spot.
The simplest approach for avoiding hot spots would be to assign records to nodes randomly; you have to query all nodes in parallel.
Partitioning by Key Range (Key-range partitioning)
Continuous range of keys.
Make your request directly to the appropriate node.
Partitions are typically rebalanced dynamically by splitting the range into two subranges when a partition gets too big.
To distribute the data evenly, the partition boundaries need to adapt to the data.
Within each partition, we can keep keys in sorted order. You can treat the key as a concatenated index in order to fetch several related records in one query. For example, the key may be the timestamp as year-month-day-hour-minute-second format — range scans are very useful in this case.
Certain access patterns can lead to hot spots. For example, all the writes end up going to the same partition (the one for today); may be overloaded with writes while other partitions sit idle. For a sensor measurements database, you may partition first by sensor name and then by time.
Partitioning by Hash of Key (Hash partitioning)
Because of the risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key. This destroys the ordering of keys, making range queries inefficient, but may distribute load more evenly.
Cassandra and MongoDB use MD5.
You can assign each partition a range of hashes and every key whose hash falls within a partition’s range will be stored in that partition.
The partition boundaries can be evenly spaced, or they can be chosen pseudorandomly — consistent hashing.
Consistent Hashing
Randomly chosen partition boundaries. Actually doesn’t work very well for databases, so it is rarely used in practice (the documentation of some databases still refers to consistent hashing, but it is often inaccurate). It’s best to avoid the term consistent hashing and just call it hash partitioning instead.
Keys that were one adjacent are not scattered across all the partitions, so their sort order is lost. In MongoDB, if hash-based sharding is enabled, any range query has to be sent to all partitions.
Cassandra achieves a compromise between the two partitioning strategies. A table in Cassandra can be declared with a compound primary key (a hybrid approach) consisting of several columns. Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables.
Concatenated index — an elegant data model for one-to-many relationships.
(user_id, update_timestamp); within each user, the updates are stored ordered by timestamp on a single partition.
Skewed Workloads and Relieving Hot Spots
A celebrity user with millions of followers — may result in a large volume of writes to the same key.
If one key is known to be very hot, a simple technique is to add a random number to the beginning or end of the key (for example, a two-digit number splitting the key evenly across 100 different keys). This requires additional bookkeeping — any reads now have to read the data from all the keys (for two-digit, all 100 keys) and combine it.
Partitioning and Secondary Indexes
A secondary index usually doesn’t identify a record uniquely but rather is a way of searching of occurrences of a particular of a particular value.
Secondary indexes are the raison d’être of search servers such as Solr and Elasticsearch.
The problem with secondary indexes is that they don’t map neatly to partitions. Partitioning a database with secondary indexes document-based partitioning and term-based partitioning.
Partitioning Secondary Indexes by Document
A car listing website; each listing has a unique ID — the document ID; secondary index on color and make.
Document-partitioned indexes (local indexes) — each partition is completely separate; you only need to deal with the partition that contains the document ID that you are writing. For this reason, a document-partitioned index is also known as a local index.
scatter gather — for searching, you need to send the query to all partitions, and combine all the results you get back. This makes read queries on secondary indexes quite expensive. Nevertheless, it is widely used; MongoDB, Cassandra, ElasticSearch.
Partitioning Secondary Indexes by Term
Rather than each partition having its own secondary index (a local index), we can construct a global index that covers data in all partitions. A global index can be partitioned differently from the primary key index.
colors starting with a to r may appear in partition0 and colors starting with s to z may appear in partition1.
Term-partitioned indexes (global indexes) — an entry in the secondary index may include records from all partitions of the primary key.
The term we’re looking for determines the partition of the index; makes reads more efficient, writes are slower and more complicated. A write to a single document may now affect multiple partitions of the index — every term in the document might be on a different partition on a different node, requiring a distributed transaction which is not supported in all databases. Updates are often asynchronous.
We can partition the index by the term itself, or using a hash of the term. Partitioning by the term itself can be useful for range scans, whereas partitioning on a hash of the term gives a more even distribution of load.
Rebalancing Partitions
The query throughput increases, the dataset size increases, a machine fails.
rebalancing — moving load from one node in the cluster to another.
After rebalancing — the load should be shared fairly between the nodes in the cluster.
During rebalancing — the database should continue accepting reads and writes.
No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.
Strategies for Rebalancing
How not to do it: hash mod N approach
if the number of nodes N changes, most of the keys will need to be moved from one node to another.
Fixed number of partitions
Only entire partitions are moved between nodes. The only thing that changes is the assignment of partitions to nodes. This approach is used in ElasticSearch and Couchbase.
- A fixed number of partitions is operationally simpler.
- If partitions are very large, rebalancing and recovery from node failures become expensive.
- If partitions are too small, they incur too much overhead.
- Neither too big or too small is just right, which can be hard to achieve if the number of partitions is fixed but the dataset size varies.
Dynamic partitioning
Key range-partitioned datasets create partitions dynamically. When a partition grows to exceed a configured size, it is split into two partitions so that approximately half of the data ends up on each side of the split. Conversely, if lots of data is deleted and a partition shrinks below some threshold, it can be merged with an adjacent partition. This is similar to what happens at the top level of a B-tree.
After a large partition has been split, one of its two halves can be transferred to another node in order to balance the load.
An empty database starts with a single partition — all writes processed by a single node while the others sit idle.
pre-splitting — allowing an initial set of partitions to be configured on an empty database. MongoDB can do this.
Dynamic partitioning is not only suitable for key range-partitioned data but also hash-partitioned data. MongoDB supports both and splits partitions dynamically in either case.
Partitioning proportionally to nodes
In dynamic partitioning, the number of partitions is proportional to the size of the dataset.
With a fixed number of partitions, the size of each partition is proportional to the size of the dataset.
Make the number of partitions proportional to the number of nodes — a fixed number of partitions per node (such as in Cassandra). The size of each partition grows proportionally to the dataset size while the number of nodes remain unchanged, but when you increase the number of nodes, the partitions become smaller again.
When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one half of each of those split partitions while leaving the other half of each partition in place.
In Cassandra, 256 partitions per node by default.
Picking partition boundaries randomly requires that hash-based partitioning is used. This corresponds most closely to original definition of consistent hashing.
Operations: Automatic or Manual Rebalancing
Automatic — may put additional load on the already overloaded node.
Manual — a good thing to have a human in the loop for rebalancing
Request Routing
Service Discovery
- connecting randomly or via a round-robin load balancer — if that node coincidentally owns the partition to which the request applies, it can handle the request directly; otherwise, it forwards the request to the appropriate node, receives the reply, and passes the reply along the client.
- a routing tier
- clients be aware of the partitioning and assignment info, can connect directly to the appropriate node without an intermediary.
Many distributed data systems rely on a separate coordination service such as Zoo-Keeper to keep track of this cluster metadata. Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of partitions to nodes. ZooKeeper notifies the routing tier (when a partition changes ownership, a node is added or removed, etc.) so it keeps its routing information up to date.
Kafka uses ZooKeeper.
MongoDB has a similar architecture; it relies on its own config server implementation and mongos daemons as the routing tier.
Cassandra does not rebalance automatically.
Happy Coding!
References:
- Designing Data-Intensive Applications
The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann