System Design Interview — Study Notes III— Designing Data Intensive Applications Book Notes
Notes on Concepts & Components to be used in a System Design Interview
Chapter 2 — Data Models and Query Languages
Models: document, relational, graph
Historically, data started out being represented as one big tree (the hierarchical model); that wasn’t good for representing many-to-many relationships so the relational model was invented.
SQL data model — relational model (proposed by Edgar Codd in 1970): data organized into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL). Relational database management systems (RDMSes). The roots of relational databases lie in business data processing. Its use cases: transaction processing (entering sales or banking transactions, airline reservations, stock-keeping in warehouses) and batch processing (customer invoicing, payroll, reporting).
NoSQL — distributed, nonrelational databases. a need for greater scalability than relational databases can achieve, including very large datasets or very high write throughput. restrictiveness of relational schemas, more dynamic and expressive data model. NoSQL datastores have diverged in two main directions; document databases and graph databases.
Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare. A document is usually stored as a single continuous string, encoded as JSON, XML or a binary variant (BSON of MongoDB).
Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything.
In complex architectures, each layer hides the complexity of the layers below it by providing a clean data model; i.e. abstractions.
impedence mismatch: the disconnect between the models (the objects in the application code and the database model of tables, rows, columns).
Object-relational mapping (ORM) frameworks like Hibernate reduce the amount of boilerplate code required for the translation layer.
Many-to-One and Many-to-Many Relationships
Normalization in databases — removing duplication is the key idea. consistent style and spelling, ease of updating, localization support, better search.
In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak.
Maybe, you can keep them in memory for joins if not supported by the database — the work of making the join is shifted from the database to the application code.
In a relational database, the query optimizer (not the application developer) automatically decides which parts of the query to execute in which order, and which indexes to use; i.e. the access path.
Relational Versus Document Databases Today
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, data structures being closer to the ones used by the application.
The relational model counters by providing better support for joins and many-to-one and many-to-many relationships.
The document model has limitations; you cannot refer directly to a nested item within a document, etc.
Many-to-many relationships may never be needed in an analytics application that uses a document database to record which events occurred at which time. Meanwhile, for highly interconnected data, the document model is awkward, the relational model is acceptable, and graph models are the most natural.
Document databases are sometimes called schemaless; there is an implicit schema but it is not enforced by the database — schema-on-read (the structure of the data is implicit, and only interpreted when the data is read)
schema-on-write — the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it.
Schema-on-read is advantageous where there are many different types of objects and it is not practical to put each type of object in its own table and where the structure of the data is determined by external systems over which you have no control.
Data locality for queries
If data is split across multiple tables, multiple index lookups are required, requiring more disk seeks and taking more time. If not, there is a performance advantage with this storage locality. If only a small portion is required, this may be wasteful though.
On updates to a document, the entire document usually needs to be rewritten — only modifications that don’t change the encoded size of a document can easily be performed in place. For this reason, it is recommended to keep them fairly small.
For locality, Oracle has a feature called “multi-table index cluster tables”. The column-family concept used in Cassandra has a similar purpose.
Query Languages for Data
An imperative language tells the computer to perform certain operations in a certain order.
In declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want — what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated) — but not how to achieve the goal. It is up to the query optimizer. It also hides implementation details of the database engine. Declarative languages often lend themselves to parallel execution.
CSS and XSL are both declarative languages for specifying the styling of a document.
MapReduce Querying
MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google. map — collect, reduce — fold or inject. Some NoSQL datastores including MongoDB use a limited form of it as a mechanism for performing read-only queries across many documents.
Graph-Like Data Models
A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs).
Social graphs — vertices: people, edges: which people know each other.
The web graph — vertices: web pages, edges: HTML links to other pages.
Road or rail networks — vertices: junctions, edges: the roads or railway lines between them.
Well-known algorithms can operate on graphs. PageRank can be used on the web graph to determine the popularity of a web page and thus its ranking in search results.
Graphs are not limited to homogeneous data; can be used to store completely different types of objects in a single datastore.
Facebook — vertices: people, locations, events, checkins, comments made by users; edges: which people are friends with each other, which checkin happened in which location, who commented on which post, who attended which event, etc.
Property Graphs
Each vertex consists of:
- A unique identifier
- A set of outgoing edges
- A set of incoming edges
- A collection of properties (key-value pairs)
Each edge consists of:
- A unique identifier
- The vertex at which the edge starts (the tail vertex)
- The vertex at which the edge ends (the head vertex)
- A label to describe the kind of relationship between the two vertices
- A collection of properties (key-value pairs)
Graph store: consisting of two relational tables, one for vertices and one for edges.
Graphs are good for evolvability: as you add features to your applications, a graph can easily be extended to accommodate changes in your application’s data structure.
Graph Queries in SQL
In a relational database, you usually know in advance which joins you need in your query. In a graph query, you may need to traverse a variable number of edges before you find the vertex you’re looking for — the number of joins is not fixed in advance.
Triple-Stores
The triple store model is mostly equivalent to the property graph model, using different words to describe the same ideas. All information is stored in the form of three-part statements: subject, predicate, object. (Jim — subject, likes — predicate (verb), bananas — object)
Happy Coding!
References:
Designing Data-Intensive Applications
The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann