Download - Designing large scale distributed systems
![Page 1: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/1.jpg)
Designing LargeScale Distributed Systems
Ashwani Priyedarshi
![Page 2: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/2.jpg)
“the network is the computer.”
John Gage, Sun Microsystems
![Page 3: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/3.jpg)
“A distributed system is one in which the failure of a computer you didn’t even know existed can
render your own computer unusable.”
Leslie Lamport
![Page 4: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/4.jpg)
“Of three properties of distributed data systems consistency, availability, partitiontolerance –
choose two.”
Eric Brewer, CAP Theorem, PODC 2000
![Page 5: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/5.jpg)
Agenda
● Consistency Models● Transactions
● Why to distribute?
● Decentralized Architecture
● Design Techniques & Tradeoffs
● Few Real World Examples
● Conclusions
![Page 6: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/6.jpg)
Consistency Model
• Restricts possible values that a read operation on an item can return– Some are very restrictive, others are less
– The less restrictive ones are easier to implement
• The most natural semantic for storage system is "read should return the last written value”– In case of concurrent accesses and multiple replicas, it's
not easy to identify what "last write" means
![Page 7: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/7.jpg)
Strict Consistency
● Assumes the existence of absolute global time
● It is impossible to implement on a large distributed system
● No two operations (in different clients) allowed at the same time
● Example: Sequence (a) satisfies strict consistency, but sequence (b) does not
![Page 8: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/8.jpg)
Sequential Consistency
● The result of any execution is the same as if
● the read and write operations by all processes on the data store were executed in some sequential order
● the operations of each individual process appear in this sequence in the order specified by its program
● All processes see the same interleaving of operations
● Many interleavings are valid
● Different runs of a program might act differently
● Example: Sequence (a) satisfies sequential consistency, but sequence (b) does not
![Page 9: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/9.jpg)
Consistency vs Availability
• In large shareddata distributed systems, network partitions are a given
• Consistency or Availability
• Both options require the client developer to be aware of what the system is offering
![Page 10: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/10.jpg)
Eventual Consistency
• An eventual consistent storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value
• If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as:– load on the system– communication delays– number of replicas
• The most popular system that implements eventual consistency is DNS
![Page 11: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/11.jpg)
Quorumbased Technique
• To enforce consistent operation in a distributed system.
• Consider the following parameters:– N = Total number of replicas
– W = Replicas to wait for acknowledgement during writes
– R = Replicas to access during reads
• If W+R > N– the read set and the write set always overlap and one can
guarantee strong consistency
• If W+R <= N– the read and write set might not overlap and consistency
cannot be guaranteed
![Page 12: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/12.jpg)
Agenda
● Consistency Models
● Transactions● Why to distribute?
● Decentralized Architecture
● Design Techniques & Tradeoffs
● Few Real World Examples
● Conclusions
![Page 13: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/13.jpg)
Transactions
● Extended form of consistency across multiple operations
● Example: Transfer money from A to B
● Subtract from A
● Add to B
● What if something happens in between?
● Another transaction on A or B
● Machine Crashes
● ...
![Page 14: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/14.jpg)
Why Transactions?
● Correctness
● Consistency
● Enforce Invariants
● ACID
![Page 15: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/15.jpg)
Agenda
● Consistency Models
● Transactions
● Why to distribute?● Decentralized Architecture
● Design Techniques & Tradeoffs
● Few Real World Examples
● Conclusions
![Page 16: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/16.jpg)
Why to distribute?
● Catastrophic Failures
● Expected Failures
● Routine Maintenance
● Geolocality
● CDN, edge caching
![Page 17: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/17.jpg)
Why NOT to distribute?
● Within a Datacenter
● High bandwidth: 1100Gbps interconnects
● Low latency: < 1ms within a rack, < 5ms across
● Little to no cost
● Between Datacenters
● Low bandwidth: 10Mbps1Gbps
● High latency: expect 100s of ms
● High Cost for fiber
![Page 18: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/18.jpg)
Agenda
● Consistency Models
● Transactions
● Why to distribute?
● Decentralized Architecture● Design Techniques & Tradeoffs
● Few Real World Examples
● Conclusions
![Page 19: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/19.jpg)
Decentralized Architecture
● Operating from multiple datacenters simultaneously
● Hard problem
● Maintaining consistency? Harder
● Transactions? Hardest
![Page 20: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/20.jpg)
Option 1: Don't
● Most common
● Make sure datacenter never goes down
● Bad at catastrophic failure
● Large scale data loss
● Not great for serving
● No geolocation
![Page 21: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/21.jpg)
Option 2: Primary with hot failover(s)
● Better, but not ideal
● Mediocre at catastrophic failure
● Window of lost data
● Failover data may be inconsistent
● Geolocated for reads, not for writes
![Page 22: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/22.jpg)
Option 3: Truly Distributed
● Simultaneous writes in different DCs, maintaining consistency
● Twoway: Hard
● Nway: Harder
● Handles catastrophic failure, geolocality
● But high latency
![Page 23: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/23.jpg)
Agenda
● Consistency Models
● Transactions
● Why to distribute?
● Decentralized Architecture
● Design Techniques & Tradeoffs● Few Real World Examples
● Conclusions
![Page 24: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/24.jpg)
Tradeoffs
Backups M/S MM 2PC Paxos
Consistency
Transactions
Latency
Throughput
Data Loss
Failover
![Page 25: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/25.jpg)
Backups
● Make a copy
● Weak consistency
● Usually no transactions
![Page 26: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/26.jpg)
Tradeoffs – Backups
Backups M/S MM 2PC Paxos
Consistency Weak
Transactions No
Latency Low
Throughput High
Data Loss High
Failover Down
![Page 27: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/27.jpg)
Master/slave replication
● Usually asynchronous
● Good for throughput, latency
● Weak/eventual consistency
● Support transactions
![Page 28: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/28.jpg)
Tradeoffs – Master/Slave
Backups M/S MM 2PC Paxos
Consistency Weak Eventual
Transactions No Full
Latency Low Low
Throughput High High
Data Loss High Some
Failover Down Read Only
![Page 29: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/29.jpg)
Multimaster replication
● Asynchronous, eventual consistency
● Concurrent writes
● Need serialization protocol
● e.g. monotonically increasing timestamps
● Either with master election or distributed consensus protocol
● No strong consistency
● No global transactions
![Page 30: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/30.jpg)
Tradeoffs Multimaster
Backups M/S MM 2PC Paxos
Consistency Weak Eventual Eventual
Transactions No Full Local
Latency Low Low Low
Throughput High High High
Data Loss High Some Some
Failover Down Read Only Read/write
![Page 31: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/31.jpg)
Two Phase Commit
● Semidistributed consensus protocol
● deterministic coordinator
● 1: Request 2: Commit/Abort
● Heavyweight, synchronous, high latency
● 3PC: Asynchronous (One extra round trip)
● Poor Throughput
![Page 32: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/32.jpg)
Tradeoffs 2PC
Backups M/S MM 2PC Paxos
Consistency Weak Eventual Eventual Strong
Transactions No Full Local Full
Latency Low Low Low High
Throughput High High High Low
Data Loss High Some Some None
Failover Down Read Only Read/write Read/write
![Page 33: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/33.jpg)
Paxos
● Decentralized, distributed consensus protocol
● Protocol similar to 2PC/3PC
● Lighter, but still high latency
● Three class of agents: proposers, acceptors, learners
● 1. a) prepare b) promise 2. a) accept b) accepted
● Survives minority failure
![Page 34: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/34.jpg)
Tradeoffs
Backups M/S MM 2PC Paxos
Consistency Weak Eventual Eventual Strong Strong
Transactions No Full Local Full Full
Latency Low Low Low High High
Throughput High High High Low Medium
Data Loss High Some Some None None
Failover Down Read Only Read/write Read/write Read/write
![Page 35: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/35.jpg)
Agenda
● Consistency Models
● Transactions
● Why to distribute?
● Decentralized Architecture
● Design Techniques & Tradeoffs
● Few Real World Examples● Conclusions
![Page 36: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/36.jpg)
Examples
● Megastore
● Google's Scalable, Highly Available Datastore
● Strong Consistency, Paxos
● Optimized for reads
● Dynamo
● Amazon’s Highly Available Keyvalue Store
● Eventual Consistency, Consistent Hashing, Vector Clocks
● Optimized for writes
● PNUTS
● Yahoo's Massively Parallel & Distributed Database System
● Timeline Consistency
● Optimized for reads
![Page 37: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/37.jpg)
Conclusions
● No silver bullet
● There are no simple solutions
● Design systems based on application needs
![Page 38: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/38.jpg)
The End
![Page 39: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/39.jpg)
![Page 40: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/40.jpg)
Backup Slides
![Page 41: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/41.jpg)
Vector Clocks
• Used to capture causality between different versions of the same object.
• A vector clock is a list of (node, counter) pairs.• Every version of every object is associated with
one vector clock.• If the counters on the first object’s clock are
lessthanorequal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.
![Page 42: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/42.jpg)
Vector Clock Example
![Page 43: Designing large scale distributed systems](https://reader034.vdocuments.net/reader034/viewer/2022042813/548775aab47959f60c8b5418/html5/thumbnails/43.jpg)
Partitioning Algorithm
• Consistent hashing:– The output range of a hash
function is treated as a fixed circular space or “ring”.
• Virtual Nodes– Each node can be responsible
for more than one virtual node.
– When a new node is added, it is assigned multiple positions.
– Various Advantages