distributed systems
Tags: computers
- http://muratbuffalo.blogspot.com/2021/02/foundational-distributed-systems-papers.html
- https://github.com/stevana/armstrong-distributed-systems/tree/main
Short Introduction
- http://book.mixu.net/distsys/single-page.html
- http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.7628
System Design
High Level Methodology
- https://static.googleusercontent.com/media/sre.google/en//static/pdf/nalsd-workbook-letter.pdf
- Do one thing and do it well
- Services should be split up into ones that do one thing, makes performance analysis easier and easier for the system to evolve over time
- Prefer simple, stateless services where possible
- State must be kept, but try to reduce the surface area of the state. Stateless components are easy to manage because they’re trivial to scale horizontally
- Assumptions, constraints, and SLO’s
- Before you start designing, make sure to look at the constraints and non-functional requirements for your system. Have they been state? Are they comprehensive?
- At minimum you should understand
- What the system is supposed to do (business logic)
- What the SLOs are around performance, error rates/availability, data loss? If no SLOs, you made need to define them with business owners
- Where should it run? What is the budget or what hardware is available?
- Expected workload (this may be difficult to gauge)
- Getting to an initial high level design (or analysing an existing one)
- API
- You need to figure out the API for the system. What operations are available externally? What data goes in, and what results comes out? Do operations change the system state, or is it read-only?
- Each set of closely related operations is likely to be a service in the system. Many systems have a stateless API layer that coordinates with other parts of the system to serve results
- Follow the data
- After the API, figure out the data flow. What state does the system need to read and/or write in order to do the work? What do we know about the state?
- API
- Data consistency - critical data
- Some applications require very consistent views of critical state
- This should be tackled with distributed consensus algorithms to manage state changes, 3/5/+ replicas in different failure domains
- Fastpaxos uses a stable leader process to improve performance. Performance for FastPaxos is one rtt between the leader and qurom of closest replicas to commit and update or a consistent read. The client must communicate with the leader, so the percieved latency for the client is the RTT between client/leader + leader/nearest quorum. Stale reads require a roundtrip to the nearest healthy replica, but the data may not be consistent. Other transactions can be batched
- Data consistency - less critical data
- Not all data needs to highly consistent, in these cases replication + backups is often enough.
- Data storage considerations
- Lots of things to think about when storing data, most fundalmental is where are you going to organize the data in RAM and disk
- Think about how data is read and written. Which is the most common use case? Usually it’s reading.
- Think about storing data to match the most common use case. If you’re reading in a chronological order by user, consider storing it like that
- Sharding data
- Data obviously needs to be sharded, but watch out for hot shards
- Easiest way to shard is m shardes on n servers, where m > 100x n
- m shards on n servers, where m > 100x n - shard = hash(key) modulo m - keep a map of shard -> server that looks it up each time (needs STRONG CONSISTENCY) - system can dynamically move shards and updates the map to rebalance based on some QPS count
- sharded and replicated datastores should have automatic mechanisms that cross-check state with other replicas and load lost state from peers, useful after downtime and when new replicas are added. Note that this should also be rate-limited somehow to avoid thundering herd problems where many replicas are out of sync
- Scaling and Performance
- Make sure to do some back of the envelope calculations to estimate the potential performance of the system, any scaling limits created by the bottlenecks in the system, and how many machines are likely to be needed to service the given workload
- Calculations to use
- Disk - sanity check the volume of data read/written, disk seeks over time
- RAM - how much can we cache? Are we storing an index of some kind?
- Bandwidth - size of requests inwards and outwards, do we have enough bandwidth?
- Bandwidth between datacenters and bandwidth between machines in the same DC
- CPU - is this a computationally intensive service? - this can be hard to gauge
- What are the estimated concurrent transactions for a service (compute based on throughput per second and latency, e.g. 500 requests per second, 200ms average latency -> 500/(1000/200) -> 100 transactions in flight) - is this a reasonable number? Each request needs some RAM to manage state and some CPU to process
- Iterating and refining
- After a complete design, time to take another look. You might’ve unearthed new constraints
- Consider
- Does it work correctly?
- Are you meeting your SLOs and constraints?
- Is it as simple as it could be?
- Is it reliable enough?
- Is it making efficient use of hardware?
- What assumptions did you make and are there risks?
- Are there places you can improve caching, indexing, lookup filters like Bloom filters? Can you degrade gracefully when overloaded by doing cheaper operations?
- Monitoring
- Montiring should be included in the design phase of any distributed system
- It needs to
- Detect outages (and alert)
- Provide information on whether you’re meeting your SLOs
- Give indications of growth and system capacity
- Assist in troubleshooting outages and less critical problems
- Design metrics for your system and indicate how you will collect data. What are your alert thresholds/conditions?
- What could go wrong?
- Answer these questions
- What is the scariest thing that can happen to your system?
- How does your system defend against it?
- Will it continue to meet its SLO? How can you recover, if not?
- Can improvements be made to the system to prevent the problem, and what would they cost?
- Consider
- Loss of a data center
- Loss of important state/data
- Bugs in critical software component
- Load spikes
- Queries/requests of death that can crash the process
- 10x growth in workload - will it scale?
- Change in data access patterns changes cache hit rate
- Answer these questions
Pub Sub
Theory
Verification
Fallacies
- https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
- https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf
Errors and Examples
- https://github.com/dranov/protocol-bugs-list
- https://covemountainsoftware.com/2021/03/01/a-survey-of-concurrency-bugs/
Lamport writings
SRE Classroom: Designing a distributed systems in 3 hours
- https://www.usenix.org/conference/srecon19americas/presentation/thomas
- 4 key areas
- Requirements and scaling
- Identify service level indicators (SLIs and SLOs)
- Data freshness
- Availability
- Latency
- Sample SLO: 99% of queries returns valid result within 100ms
- SLA is the contract containing all relevant SLOs and the punishment if its violated
- Make sure to consider future scale. What about 10x?
- Scaling
- Microservices -> decomposing binaries, but they existed long before it
- Identify service level indicators (SLIs and SLOs)
- Dealing with loss
- Possible failure domains
- single process
- machine
- rack
- datacenter
- servers with the same binary
- global config
- If you’re using cloud, you need to be able to think about how these things decompose?
- How do we guard?
- Decouple responsibilities
- Avoid global changes: use a multi-tiered canary
- Spread risk: don’t depend on one backend
- Degrade gradually: keep serving if configs are corrupt or fail to push
- Achieving relaibility
- Generally run N+2
- N = deployment large enough to deal with standard load
- If it takes a single DC, run 3
- If it takes a single server, run 3
- Possible failure domains
- Keeping state and data
- Regardless of amount of data involved, it all comes down to global consensus
- Find authoritative instances of other services
- Quick discovery of sharded data
- Various algorithmic implementations for this: paxos, raft
- CAP, but networks are unreliable, but partitions are actually pretty rare
- But you do need to converge after a partition
- Caches!
- Reduces load on backend, but capacity vs performance cache. Capacity cache can be dangerous
- A performance cache is put in place to improve performance
- A capacity cache is used to handle additional traffic, but this is dangerous
- What happens if push a new release? Cold cache? What happens if new keys?
- Generally do not include caches into SLOs
- What happens if push a new release? Cold cache? What happens if new keys?
- Reduces load on backend, but capacity vs performance cache. Capacity cache can be dangerous
- Non-abstract design
- Google likes to come out with a BOM (bill of materials)
- For this, look at what is the bottleneck?
- Disk IO/QPS/Network bandwidth
- 1 day has 25 hours
- Each day has 30 days
- Each year has 300 days or 400
- For each service you come up with, think about the following questions:
- Can it be cached?
- Can it scale horizontally?
- What’s the latency?
- Is it sharded?
- Figure out metadata size for files!
- How much NICs can handle?
- Assume maybe 1 gb/s per nic
- Usually you’re dominated by 3 things
- network
- storage
- timings (latency)
- Is there an SLO for consistency?
- Something to keep in mind: the overall thing has different bottlenecks on different things
- Requirements and scaling
Examples
storage
GFS (Google File System)
Colossus
Tectonic
Databases
BigTable
Megastore
Spanner
DynamoDB
Links to this note
- 2021 goals
- actor model
- broadcasting and multicast
- bully algorithm
- byzantine fault tolerance (BFT)
- cap theorm (brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services
- Cassandra
- clocks (distributed systems)
- crdts
- dataflow
- distributed systems for fun and profit
- distributed transactions
- Eldeeb et al: Chablis: Fast and General Transactions in Geo-distributed Systems
- epaxos
- gossip protocols
- gray failures
- Homogenous Time: Ibn Khaldun and Transaction Logs
- impossibility of consensus with one faulty process
- kubernetes (k8s)
- load balancing
- metastable failures blog post
- paxos
- physalia - amazon ebs
- raft
- windows failover clusters (hci)