System Design Interview — Study Notes X — Designing Data Intensive Applications Book Notes
Notes on Concepts & Components to be used in a System Design Interview
Chapter 8 — The Trouble with Distributed Systems
Faults and Partial Failures
partial failover — some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. They are nondeterministic.
Cloud Computing and Supercomputing
high-performance computing (HPC) — for computationally intensive scientific computing tasks (weather forecasting, molecular dynamics)
cloud computing — multi-tenant datacenters, commodity computers connected with an IP network (often Ethernet), elastic/on-demand resource allocation, and metered billing.
traditional enterprise datacenters.
Building a Reliable System from Unreliable Components
IP (the Internet Protocol) is unreliable; it may drop, delay, duplicate, or reorder packets.
TCP (the Transmission Control Protocol) provides a more reliable transport layer on top of IP: it ensures that missing packages are retransmitted, duplicates are eliminated, and packets are reassembled into the order in which they were sent.
Unreliable Networks
shared-nothing systems — the network is the only way those machines can communicate. They are cheap and they can achieve high reliability through redundancy across multiple geographically distributed datacenters.
The internet and most internal networks in datacenters (often Ethernet) are asynchronous packet networks.
The sender can’t even tell whether the packet was delivered: the only option is for the recipient to send a response message, which may in turn be lost or delayed.
The usual way of handling this issue is a timeout. However, when a timeout occurs, you still don’t know whether the remote node got your request or not (and if the request is still queued somewhere, it may still be delivered to the recipient, even if the sender has given up on it).
Network Faults in Practice
Public cloud services such as EC2 are notorious for having frequent transient network glitches, and well-managed private datacenter networks can be stabler environments.
A network interface sometimes drops all inbound packets but send outbound packets successfully; just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction.
network partition (netsplit) — one part of the network is cut off from the rest due to a network fault.
Deliberately trigger network problems and test the system’s response (the idea behind Chaos Monkey)
Detecting Faults
- no process is is listening on the destination port — the process has crashed. the operating system will helpfully close or refuse TCP connections by sending a RST or FIN packet in reply.
- a node process crashed (or was killed by an administrator). a script can notify other nodes about the crash, another node takes over without having to wait for a timeout
- a router is sure the IP address is unreachable; it may reply to you with an ICMP Destination Unreachable packet.
You can’t count on rapid feedback. If you want to be sure that a request was successful, you need a positive response from the application itself.
You can retry a few times (TCP retries transparently but you may retry at application level), wait for a timeout to elapse, and eventually declare the node dead if you don’t hear back within the timeout.
Timeouts and Unbounded Delays
A long timeout means a long wait — users may have to wait or see error messages.
Prematurely declaring a node dead is problematic.
Cascading failure (in the extreme case, all nodes declare each other dead, and everything stops working).
- Each packet is either delivered within some time d.
- A node always handles a request within some time r.
- Every successful request receives a response within time 2d + r — a reasonable timeout.
Asynchronous networks have unbounded delays — there is no upper limit on the time it may take for a packet to arrive.
Network congestion and queueing
- The network switch must queue packets. On a busy network link, a packet may have to wait a while until it can get a slot — network congestion. The queue fills up, the packet is dropped, so it needs to be resent.
- The operating system queues requests until the application is ready to handle it.
- In virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. Incoming data is queued (buffered) by the virtual machine monitor.
- TCP performs control flow (congestion avoidance, backpressure) in which a node limits its own rate of sending in order to overloading a network link or the receiving node. This means additional queueing at the sender.
TCP considers a packet to be lost if it is not acknowledged within some timeout (which is calculated from observed round-trip sometimes), and the lost packets are automatically retransmitted. The application doesn’t see the resulting delay — waiting for the timeout to expire, and then waiting for the retransmitted packet to be acknowledged).
TCP versus UDP
Some latency-sensitive applications, such as videoconferencing and Voice over IP (VoIP), use UDP rather than TCP.
UDP does not perform flow control and does not retransmit lost packets. It is still susceptible to switch queues and scheduling delays.
UDP is a good choice in situations where delayed data is worthless. There isn’t enough time to retransmit a lost packet — the application must instead fill the missing packet’s time slot with silence and move on in the stream.
queueing delays — when a system is close to its maximum capacity. In public clouds and multi-tenant datacenters, resources are shared among many customers: the network links and switches, and even each machine’s network interface and CPUs (when running on virtual machines), are shared. a noisy neighbor — highly variable network delays if someone near you is using a lot of resources.
Synchronous Versus Asynchronous Networks
an ISDN network runs at a fixed rate of 4000 frames per second. When a call is established, it is allocated 16 bits of space within each frame (in each direction).
This kind of network is synchronous because the 16 bits of space for the call have already been reserved in the next hop of the network.
There is no queueing; the maximum end-to-end latency of the network is fixed — bounded delay.
Can we not simply make network delays predictable?
A circuit in a telephone network is very different from a TCP connection: a circuit is a fixed amount of reserved bandwidth which nobody else can use while the circuit is established, whereas the packets of a TCP connection opportunistically use whatever network bandwidth is available. While a TCP connection is idle, it doesn’t use any bandwidth (except for an occasional keepalive packet, if TCP keepalive is enabled).
If datacenter networks and the internet were circuit-switched networks — a guaranteed maximum round-trip time, but they are not: Ethernet and IP are packet-switched protocols, which suffer from queueing and thus unbounded delays in the network.
Datacenter networks and the Internet use packet switching — they are optimized for bursty traffic.
To transfer a file over a circuit. Using circuits for bursty data transfers wastes network capacity. By contrast, TCP dynamically adapts the rate of data transfer to the available network capacity.
There have been some attempts to build hybrid networks that support both circuit switching and packet switching, such as ATM — Asynchronous Transfer Mode was a competitor to Ethernet in the 1980s.
Variable delays in network are simply the result of a cost/benefit trade-off.
The network congestion, queueing, and unbounded delays will happen.
No “correct” value for timeouts — they need to be determined experimentally.
Unreliable Clocks
Clocks and time are important. Applications depend on clocks in various ways:
- the request timeout (duration)
- 99th percentile response time of a service (duration)
- queries per second the service handle on average in the last five minutes. (duration)
- how long did the user spend on out site? (duration)
- when was this article published? (points in time)
- at what date and time should the reminder email be sent? (points in time)
- when does the cache entry expire? (points in time)
- timestamp on this error message in the log file. (points in time)
Each machine has its own clock — a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own notion of time.
To synchronize clocks — the Network Time Protocol (NTP). It allows the computer clock to be adjusted according to the time reported by a group of servers. The servers in turn get their time from a more accurate time source such as a GPS receiver — the Precision Time Protocol (PTP).
Monotonic Versus Time-of-Day Clocks
Modern computers have at least two different kinds of clocks: a time-of-day clock and a monotonic clock.
Time-of-day clocks
- It returns the current date and time according to some calendar (wall-clock time).
- clock_gettime(CLOCK_REALTIME) on Linux, System.currentTimeMillis() in Java return the number of seconds (or milliseconds) since the epoch — midnight UTC on January 1, 1970, according to the Gregorian calendar, not counting leap seconds.
- Time-of-day clocks are usually synchronized with NTP. If the local clock is too far ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in time. These jumps as well as the fact that they often ignore leap seconds, make time-of-day clocks unsuitable for measuring elapsed time.
Monotonic clocks
- It is suitable for measuring a duration (time interval), such as a timeout or a service’s response time.
- clock_gettime(CLOCK_MONOTONIC) on Linux, System.nanoTime() in Java are monotonic clocks.
- They are guaranteed to always move forward (whereas a time-of-day clock may jump back in time).
- The absolute value of the clock is meaningless: it might be the number of nanoseconds since the computer was started, or something similarly arbitrary.
- It makes no sense to compare monotonic clock values from two different computers, because they don’t mean the same thing.
- NTP may adjust the frequency at which the monotonic clock moves forward (slewing the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP server. NTP allows the clock rate speeded up or slowed down by up to 0.05% but NTP cannot cause the monotonic clock to jump forward or backward. They can measure time intervals in microseconds or less.
- In a distributed system, using a monotonic clock for measuring elapsed time (timeouts) is usually fine, because it doesn’t assume any synchronization between different nodes’ clocks and is not sensitive to slight inaccuracies of measurement.
Clock Synchronization and Accuracy
Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an NTP server or other external time source in order to be useful.
- Clock drift varies depending on the temperature of the machine.
- If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the local clock will forcibly reset.
- If your NTP daemon is misconfigured or if a node is accidentally firewalled off from NTP servers, the misconfiguration may go unnoticed for some time.
- NTP synchronization can only be as good as the network delay. When you’re on a congested network with variable packet delays.
- Some NTP servers are wrong or misconfigured, reporting time that is off by hours. Worrying to bet the correctness of your systems on the time that you were told by a stranger on the internet.
- Make NTP servers “lie” by performing the leap second adjustment gradually over the course of a day — smearing.
- In virtual machines, the hardware clock is virtualized.
- On mobile or embedded devices, you probably cannot trust the device’s hardware clock at all.
Relying on Synchronized Clocks
Robust software needs to be prepared to deal with incorrect clocks.
Any node whose clock drifts too far from the others should be declared and removed from the cluster.
Timestamps for ordering events
Dangerous to rely on clocks for ordering of events across multiple nodes.
last write wins (LWW) — widely used in both multi-leader replication and leaderless databases such as Cassandra.
A node with a lagging clock is unable to overwrite values previously written by a node with a fast clock. Additional causality tracking mechanisms, such as version vectors are needed.
logical clocks — are based on incrementing counters rather than an oscillating quartz crystal, are a safer alternative for ordering events. Only the relative ordering of events (whether one event happened before or after another). In contrast, time-of-day and monotonic clocks are known as physical clocks.
Clock readings have a confidence interval
Google’s TrueTime API in Spanner — reports the confidence interval on the local clock. It returns two values [earliest, latest], which are the earliest possible and the latest possible timestamp.
Synchronized clocks for global snapshots
Those two intervals do not overlap (A-earliest < A-latest < B-earliest < B-latest), then B definitely happened after A — there can be no doubt.
Spanner waits for the confidence interval before committing a read-write transaction, it needs to keep the clock uncertainty as small as possible.
Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms.
There are distributed sequence number generators, such as Twitter’s Snowflake, that generate approximately monotonically increasing unique IDs in a scalable way (e.g. by allocating blocks of the ID space to different nodes). Typically, they cannot guarantee an ordering that is consistent with causality.
Process Pauses
lease — when a node obtains a lease, it knows that it is the leader for some amount of time, until the lease expires. In order to remain leader, the node must periodically renew the lease.
- Many programming language runtimes (such as the Java Virtual Machine) have a garbage collector (GC) that occasionally needs to stop all running threads — “stop the world” GC pauses.
- In virtualized environments, a virtual machine can be suspended and resumed.
- The operating system context-switches to another thread. The hypervisor switches to a different virtual machine. In the case of a virtual machine, the CPU time spent in other virtual machines is known as steal time. If the machine is under heavy load — there is a long queue of threads waiting to run — it may take some time before the paused thread gets to run again.
- A thread may be paused waiting for a slow disk I/O operation to complete.
- swapping to disk (paging) — The operating system may spend most of its time swapping pages in and out of memory and getting little actual work done (thrashing).
- A Unix process can be paused by sending it the SIGSTOP signal (Ctrl-Z) and can be resumed with SIGCONT.
When writing multi-threaded code on a single machine, we have fairly good tools for making it thread-safe: mutexes, semaphores, atomic counters, lock-free data structures, blocking queues, and so on. Unfortunately, these tools don’t directly translate to distributed systems, because a distributed system has no shared memory — only messages sent over an unreliable network.
Response time guarantees
hard real-time systems — If it doesn’t meet the deadline, that may cause a failure of the entire system.
A real-time operating system (RTOS) that allows processes to be scheduled with a guaranteed allocation of CPU time in specific intervals is needed.
Limiting the impact of garbage collection
Treat GC pauses like brief planned outages of a node. The application can stop sending new requests to that node. Some latency-sensitive financial trading systems use this approach.
The leader and the lock
If a node continues acting as the chosen one, even though the majority of nodes have declared it yet, it could cause problems in a system that is not carefully designed.
Fencing tokens
When using a lock or lease to protect access to some resource, such as the file storage, we need to ensure that a node that is under a false belief of being “the chosen one” cannot disrupt the rest of the system.
Everytime the lock server grants a lock or a lease, it also returns a fencing token, which is a number that increases every time a lock is granted (e.g. incremented by the lock service). For a write request to the storage service, it must include its current fencing token.
If ZooKeeper is used as lock service, the transaction ID zxid or the node version cversion can be used as fencing token. Since they are guaranteed to be monotonically increasing, they have the required properties.
This requires the resource itself to take an active role in checking tokens by rejecting any writes with an older token than the one that has already been processed — it is not sufficient to rely on clients checking their lock status themselves. A file storage service could include the fencing token in the filename.
Byzantine Faults
Byzantine fault — If a node may claim to have received a particular message when in fact it didn’t. The problem of reaching consensus in this untrusting environment is known as the Byzantine Generals Problem.
The Byzantine Generals Problem
It is a generalization of the so-called Two Generals Problem, which imagines a situation in which two army generals need to agree on a battle plan.
Most of the generals are loyal, and thus send truthful messages, but the traitors may try to deceive and confuse the others by sending fake or untrue messages (while trying to remain undiscovered).
If a system continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network — Byzantine fault-tolerant.
Flight control systems must tolerate Byzantine faults.
Peer-to-peer networks like Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties to agree whether a transaction happened or not, without relying on a central authority.
Fault-tolerant embedded systems rely on support from the hardware level. In most server-side data systems, the cost of deploying Byzantine fault-tolerant solutions makes them impractical.
Web applications — input validation, sanitization, and output escaping are so important: to prevent SQL injection and cross-site scripting. However, we typically don’t use Byzantine fault-tolerant protocols here.
Traditional mechanisms (authentication, access control, encryption, firewalls, and so on) continue to be the main protection against attackers.
Weak forms of lying
Corrupted packets are caught by the checksums built into TCP and UDP, but sometimes they evade detection. checksums — in the application-level protocol.
Checking that a value is within a reasonable range and limiting the size of strings to prevent denial of service through large memory allocations.
System Model and Reality
Formalize the kinds of faults that we expect to happen in a system by defining a system model — an abstraction that describes what things an algorithm may assume.
With regard to timing assumptions, three system models are in common use:
- Synchronous model — assumes bounded network delay, bounded process pauses, and bounded clock error — these will never exceed some fixer upper bound. Not a realistic model.
- Partially synchronous model — means a system behaves like a synchronous system most of the time, but it sometimes exceeds the bounds for network delay, process pauses and clock drift. This is a realistic model of many systems.
- Asynchronous model — it does not even have a clock (so it cannot use timeouts). It is very restrictive.
We have to consider node failures.
- Crash-stop faults — the node is gone forever, it never comes back.
- Crash-recovery faults — nodes are assumed to have stable storage (i.e., nonvolatile disk storage) that is prevented across crashes, while the in-memory state is assumed to be lost.
- Byzantine (arbitrary) faults
The partially synchronous model with crash-recovery faults is generally the most useful model.
Correctness of an algorithm
For fencing tokens for a lock, an algorithm needs to have these properties — uniqueness, monotonic sequence, availability.
Safety and liveliness
Safety — Nothing bad happens
Liveliness — Something good will eventually happens.
uniqueness and monotonic sequence — safety properties
availability, eventual consistency — liveliness properties
For distributed algorithms, it is common to require that safety properties always hold.
Safety and liveliness properties and system models are very useful for reasoning about the correctness of a distributed algorithm.
Happy Coding!
References:
- Designing Data-Intensive Applications
The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann