distributed transactions
Tags: distributed systems
- early 2000’s
- sharding sucks
- no cross shard transactions
- nosql after 2004
- newsql after 2010
- spanner in 2012
- cockraoch in 2014
- tradeoff is increased latency
keyspace and sharding
- two main questions
- how do I locate a piece of data?
- what granularity is it distributed?
- options
- hashing
- order preserving
- need an index on a range
- for scanning
- shards
- small enough to be moved
- large enough to amortize indexing overhead
replication and fault tolerance
- primary secondary replication
- async replication
- ack before replication
- primary secondary is very tough to get correct
- almost always need a third party to arbitrate
- how do we replicate to N nodes?
- use consensus protocol
- how do we get stuff onto the followers?
- usually considered written when we have a quorum of writes
- failure
- ranges are the unit of replciation
- consensus algorithm
- multi-paxos or raft
- generally has a “replication factor” (typically an odd number to achieve qourum)
- raft
- raft provides atomic replication
- a “leaseholder” allows for nonqorum reads
- range leases
- for a timeboxed time, this can guarentee a quorum read
- a failure at leaseholder means that nothing else can become a leaseholder
replica placement
- how do we choose where replicas go?
- hundreds of nodes
- choose where they lie based on
- space, diversity, load, latency
- diversity
- want to make sure it’s diverse
- tradeoff here is always latency
- load
- always move things around based on load
- avoid hot ranges
- balance as best you can
- range splitting
- move replicas around after range splitting
- range splitting
- ranges: geo-latency
- rebalancing
- adding a new node
- any system should be able to do this
- rebalancing replicas
- automate repair
- adding a new node
cockroach API
- originally did not use SQL
- used some weird transaction API
- converting sql to key value
- mapping sql’s to key value store
transactions
- acid and isolation levels
- a - all or nothing
- c - database state is internally consistent
- i - isolation
- d - durability don’t lose committed data
- what isolation levels?
- read uncommitted (could read ongoing but uncommited writes) - no locks
- read committed (could read different values in the same txn)
- repeatable read *(could range read different values in same txn)
- snapshot isolation - could make decisions based on stale reads - no phantom read, every read is guarneteed to be consistent, but txns can read same data and independently write to different keys
- serializable - none of the above, run one by one
- single node db transations
- if a single node comes back up after failing, guarneteed it’s not losing data
- multi-node db transactions
- atomicity and durability are achieved by bootstrapping off of a lower write
- quorum of disk writes is good enough
- need different states - spans multiple machines, not a single lock
- pending
- committed
- approved
- atomicity and durability are achieved by bootstrapping off of a lower write
- 2pc
- group writing of the transaction record with the first write
- questions
- quroum while doing replication
- what happens if 2/3 agree?
- 3 replicas - tolerant to one node failure
- 5 - tolerant to 2 node failures
- how do to ranges?
- require an index, preserving
- think about the entire thing as a whole as knowing a record is committed