distributed systems for fun and profit
Tags: distributed systems, articles
Chapter 1
- Scalability can be divided into 3:
- Size: can you add more nodes? can you grow the dataset w/o increasing latency?
- Geographic: can you add more DC’s? What about cross dc latency?
- Administrative: as you add more nodes, is it harder to maintain the system?
- Performance & latency
- Latency is not the difference between you and getting the data, it’s between you and seeing the effects of the data
- Availability & fault tolerance
- n nines?
- Distributed systems are constrained by # of nodes and distance between nodes
- Typically means:
- Increasing the # of independent nodes increases the probability of failure
- Increasing # of independent nodes increases the need for communication
- Increasing the geographic distance increases the lower latency bound
- Often simple abstractions don’t work for distributed systems
- Performance is often gained by exposing more details about the internals of the system
- Columnar storage gives data locality information in exchange for performance
- Performance is often gained by exposing more details about the internals of the system
- Design techniques: partition and replicate
- partition divides the dataset into smaller distinct independent sets
- limits the impact of dataset growth
- increases availability by allowing partitions to fail independently
- replication:
- replication improves performance by making additional computing power available over new copy of data
- increases availability by creating additional copies of data
- sync issues
- partition divides the dataset into smaller distinct independent sets
Chapter 2: CAP and FLP
- All abstractions are ultimately fake
- abstractions make the world more managable
- basic model
- programs run:
- concurrently on independent nodes
- connected by a network that can introduce nondeterimism and message loss
- have no shared memory or shared lock
- implies:
- each node runs programs concurrently
- knowledge is local, fast access to local state, global state is slow and potentially out of date
- nodes can fail and recover independenly
- messages can be delayed or lost
- clocks are not synchronized across nodes
- nodes are:
- can execute a program
- have access to volatile memory (will be lost upon failure) and stable state (can be read after failure)
- a clock - may or may not be accurate
- communication links
- considered to be FIFO
- generally considered to be unreliable
- network partition is when the network fails but the nodes are up
- nodes may be accessed by different clients
- time ordering assumptions
- sync and async
- sync timing model imposes that each node experiences things in lockstep
- async
- timing cannot be relied on and time sensors are unreliable
- sync and async
- consensus problems
- agreement
- integrity
- termination
- validity
- programs run:
- impossbility results
- FLP impossbility result aka impossibility of consensus with one faulty process
- algorithms must give up safety or liveniess
- cap theorm
- FLP impossbility result aka impossibility of consensus with one faulty process
Chapter 3: Time and Order
- any system that can only do one thing at a time will create an order
- Partial and total order