System Design Interview — Study Notes II — Designing Data Intensive Applications Book Notes
Notes on Concepts & Components to be used in a System Design Interview
Chapter 1 — Reliable, Scalable, and Maintainable Applications
Reliability: working correctly even in the face of adversity. fault-tolerant, resilient.
- fault is not the same as a failure; design fault-tolerance mechanisms that prevent faults from causing failures.
- check Netflix Chaos Monkey.
Scalability: a system’s ability to cope with increased load (data volume, traffic volume, complexity).
Scaling up: vertical scaling. moving to a more powerful machine.
Scaling out: horizontal scaling. distributing the load across multiple smaller machines. shared-nothing architecture.
load parameters: depends on your architecture; requests per second to a web server, the ratio of reads to writes in a database, the hit rate on a cache.
throughput: the number of records we can process per second // the total time it takes to run a job on a dataset of certain size. skew: data not being spread evenly across worker processes.
skew: data not being spread evenly across worker processes and needing to wait for the slowest task to complete.
response time: the time between a client sending a request and receiving a response.
- latency and response time are not the same;
latency: the duration a request is waiting to be handled, awaiting service.
response time: what the client sees. the service time — actual time to process the request + network delays + queueing delays (it may vary; a context switch to a background process + the loss of a network package and TCP retransmission + a garbage collection pause + a page fault forcing a read from disk + mechanical vibrations in the server rack, etc.)
- average response time is preferred in the reports.
- use percentiles. beware that averaging percentiles is mathematically meaningless; the right way of aggregating response time data is to add the histograms.
- median (50th percentile; p50) is a good metric if you want to know how long users typically have to wait.
- to figure out how bad your outliers are, use tail latencies, p95, p99 and p999 are common.
- percentiles are often used in service level objectives (SLOs) and service level agreements (SLAs); contracts defining the expected performance and availability of service.
Maintability: Operability + Simplicity + Evolvability
Operability: making it easy for operations teams. monitoring, restoring, security patches, understanding how different systems affect each other, capacity planning, tools for deployment and configuration management.
Simplicity: making it easy for new engineers. abstraction, coupling of modules, dependencies, naming and terminology, special cases that require to work around issues.
Evolvability: making it easy for engineers to make changes. extensibility, modifiability, plasticity. test-driven development, refactoring, agility.
An application has functional and nonfunctional requirements.
Functional Requirements: what it should do, such as allowing data to be stored / retrieved / searched / processed in various ways.
Nonfunctional Requirements: security, reliability, compliance, scalability, compatibility, maintainability.
Happy Coding!
References:
Designing Data-Intensive Applications
The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann