System Design Interview — Study Notes V — Designing Data Intensive Applications Book Notes
Notes on Concepts & Components to be used in a System Design Interview
Chapter 4— Encoding and Evolution
With server-side applications, rolling upgrade — staged rollout : deploying the new version to a few nodes at a time, checking whether it is running smoothly, gradually deploying to all the nodes.
With client-side applications, you are at the mercy of the user.
Backward compatibility: Newer code can read data that was written by older code.
Forward compatibility: Older code can read data that was written by newer code.
Turning data structures into bytes on the network or disk. When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes. Thus, translation between representations is required. The translation from the in-memory representation to a byte sequence is called encoding (serialization, marshalling). The reverse is called decoding (parsing, deserialization, unmarshalling).
Java => java.io.Serializable
The encoding is often tied to a particular programming language.
Security problems; if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, allowing them to execute arbitrary code remotely.
Versioning data — for forward and backward compatibility.
Efficiency, Java’s built-in serialization is notorious for its bad performance and bloated encoding.
JSON, XML, and Binary Variants
Standardized encodings, written and read by many programming languages.
Used as data interchange formats (for sending data from one organization to another).
XML — criticized for being verbose and unnecessarily complicated.
JSON — popular due to its built-in support in web browsers and simplicity.
JSON, XML, CSV — human-readible
Problems:
Encoding of numbers and binary strings: XML and CSV cannot distinguish between a number and a string. JSON distinguishes strings and numbers but not integers and floating-point numbers (and it does not specify a precision).
Twitter uses a 64-bit number to identify each tweet. Twitter’s API includes tweet IDs in JSON response twice; as a JSON number and a decimal string.
Optional schema support for both XML and JSON but CSV does not have any schema. Thus, escaping rules are unclear.
Binary Encoding
JSON is less verbose than XML, but both use a lot of space compared to binary formats => Binary encodings for JSON; MessagePack, BSON, BJSON, UBJSON, BISON, Smile.
Since they don’t prescribe a schema, they need to include all the object field names within the encoded data.
Thrift and Protocol Buffers
Binary schema-driven formats — need to be decoded before it is human-readible.
Apache Thrift — Facebook, Protocol Buffers (protobuf) — Google, Apache Avro
Both require a schema.
A code generation tool — generated code to encode decode records of the schema.
In dynamically typed programming languages such as JavaScript, Ruby, or Phyton, there is not much point in generating code since there is no compile-time type checker.
Thrift has two different binary encoding formats; BinaryProtocol, CompactProtocol.
BinaryProtocol:
Each field has a type annotation (to indicate whether it is a string, integer, list, etc.) and a length indication where required (length of a string, number of items in a list). No field names; instead field tags (which are numbers). These numbers appear in the schema definition, they are like aliases for fields.
CompactProtocol:
Packages the field type and tag number into a single byte by using variable-length integers. Packs the same information into smaller number of bytes.
For numbers, they are encoded in two bytes, with the top bit indicating whether there are still more bytes to come.
Thrift has a dedicated list datatype but you cannot change a current-valued field to multi-valued field. It supports nested lists.
Protocol Buffers has only one binary encoding format, similar to Thrift’s CompactProtocol with a slightly different bit packaging. It does not have a list or array datatype; instead repeated marker is used as a third option alongside required and optional. It is okay to change a single optional field into a repeated field; old code sees only the last element of the list.
In the schemas, each field are marked required or optional but nothing in the binary data indicates that; required enables a runtime check.
Apache Avro also uses a schema. It has two schema languages; Avro IDL for human editing, and a machine-readable one based on JSON. There are no tag numbers in the schema. There is nothing to identify fields or their datatypes; encoding simply consists of values concatenated together. An integer is encoded using a variable-length encoding (same as Thrift’s CompactProtocol). The binary data can only be decoded correctly if the code reading the data is using the exact same schema as the code that wrote the data.
writer’s schema — encoding, reader’s schema — decoding.
- The writer’s schema and the reader’s schema don’t have to be the same, only need to be compatible.
- The fields can be in a different order.
- If a field is in the writer’s schema but not in the reader’s schema, it is ignored.
- If the writer’s schema does not contain a field of a given name, it is filled in with a default value declared in the reader’s schema when being read.
- Changing a field name is backward compatible but not forward compatible.
Field tags and schema evolution
Schema evolution: schemas inevitably need to change over time.
- Adding new fields to the schema; a new tag number is given, old code can ignore that field. You cannot make this new field required, though. It would fail if new code read data written by old code. To maintain backward compatibility, every field you add after the initial deployment of the schema must be optional or have a default value.
- Removing fields; you can only remove a field that is optional. You can never use the same tag number again.
Modes of Dataflow
- Via databases
- Via service calls
- Via asynchronous message passing
Dataflow Through Databases
Forward-compatibility is required for databases. Data outlives code.
You add a field to a record schema. An older version of the code reads the record, updates and writes it back — the desirable behaviour is that the old code keeps the new field intact. Sometimes you need to take care at the application level.
Most relational databases allow simple schema changes, such as adding a new column with a null default value, without rewriting existing data.
Dataflow Through Services: REST and RPC
Two roles: clients and servers. The API exposed by the server is known as a service.
A client-side JavaScript application running inside a web browser can use XMLHttpRequest to become an HTTP client — Ajax.
HTTP — used as the transport protocol.
A server can itself be a client to another service (app server acting as a client to a database).
Service-oriented architecture (SOA), microservices architecture.
Web services: When HTTP is used as the underlying protocol for talking to the service.
Middleware: one service making requests to another service located within the same datacenter, owned by the same organization.
Service requests between different organization’s backend systems such as credit card processing systems, OAuth for shared access to user data.
REST and SOAP — two popular approaches to web services.
HATEOAS — hypermedia as the engine of application state.
REST — not a protocol but a design philosophy built upon the principles of HTTP. URLs are used for identifying resources. Using HTTP features for cache control, authentication, content type negotiation. Often associated with microservices. RESTful API — designed according to the principles of REST. A definition format such as OpenAPI, also known as Swagger can be used to describe RESTful APIs and produce documentation.
SOAP — an XML-based protocol for making network API requests. Most commonly used over HTTP but avoids using most HTTP features. Its standars — web service framework known as WS-*. The API of a SOAP web service is described using an XML-based language called the Web Services Description Language (WSDL). WSDL enables code generation.
The problems with remote procedure calls (RPCs)
Location transparency — The RPC model tries to make a request to a remote network service look the same as calling a function or method in your programming language, within the same process.
- The request or response may be lost due to a network problem; machine may be slow or unavailable; timeout -> retry a failed request.
- Only the responses are getting lost and you retry a failed network request, the action will be performed multiple times -> build a mechanism for deduplication (idempotence).
- A network request is much slower than a function call. You call a local function, you can pass pointers to objects in local memory. You make a network request, parameters need to be encoded into a sequence of bytes -> problematic with larger objects.
- The client and the service may be implemented in different programming languages.
Current directions for RPC
Thrift and Avro support it, gRPC is an RPC implementation using Protocol Buffers. Finagle and Rest.li use futures (promises). gRPC supports streams. Some of the frameworks also provide service discovery.
Custom RPC protocols with a binary encoding format can achieve better performance than JSON over REST. The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same datacenter.
A RESTful API has advantages — good for experimentation and debugging. easy to make requests using a browser or a command-line tool curl. supported by all main-stream programming languages and platforms, vast ecosystem of tools such as servers, caches, load balancers, proxies, firewalls, monitoring, debugging tools, testing tools, etc.
Data encoding and evolution for RPC
For evolvability, it is important that RPC clients and servers can be changed and deployed independently.
All the servers will be updated first, and all the clients second. You only need backward compatibility on requests, and forward compatibility on responses. If a compatibility breaking change is required, the service provider often ends up maintaining multiple versions of the service API side by side. Use a version number in the URL or in the HTTP Accept header.
Message-Passing Dataflow
Asynchronous message-passing systems: a client’s request — message is delivered to another process with low latency. Message is not sent via a direct network connection but via a message broker (also called a message queue or message-oriented middleware).
- act as a buffer — improves system reliability.
- automatically redeliver messages — prevents messages from being lost.
- avoids the sender needing to know the IP address and port number of the recipient.
- one message to be sent to several clients.
- decouples the sender from the consumer. the sender just publishes messages, doesn’t care who consumes them, doesn’t wait for the message to be delivered — asynchronous communication.
- the message-passing communication is usually one-way (different compared to RPC): a sender normally doesn’t expect to receive a reply to its messages. sending a response can be performed on a separate channel .
Message brokers
Open-source implementations such as RabbitMQ, ActiveMQ, Apache Kafka.
- One process sends a message to a named-queue or topic.
- The broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic.
- A topic provides only one-way dataflow. A consumer may itself publish messages — you can chain them together — or to a reply queue that is consumed by the sender of the original message.
- Message brokers don’t force any particular data model — you can use any encoding format.
Distributed actor frameworks
Akka (uses Java’s built-in serialization by default) — does not provide forward or backward compatibility. You can replace it with Protocol Buffers to provide this.
The actor model is a programming model for concurrency in a single process.
Rather than dealing with threads, logic is encapsulated in actors.
Each actor represents one client or entity. It may have some local state. It communicates with other actors by sending and receiving asynchronous messages. Message delivery is not guaranteed.
distributed actor frameworks— each actor can be scheduled independently by the framework, scaling an application across multiple nodes. integrates a message broker and the actor programming model into a single framework.
Happy Coding!
References:
- Designing Data-Intensive Applications
The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann