coordination and agreement, multicast, and message ordering
TRANSCRIPT
![Page 1: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/1.jpg)
![Page 2: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/2.jpg)
Coordination and Agreement, Multicast, and Message Ordering
![Page 3: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/3.jpg)
Fall 2010 5DV020 3
Outline
• Coordination– Distributed mutual exclusion– Election algorithms
• Multicast• Message ordering• Agreement
– Byzantine generals problem– Paxos algorithm
• Summary
![Page 4: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/4.jpg)
Coordination and Agreement
• Processes often need to coordinate their actions and/or reach an agreement– Which process gets to access a shared resource?
– Has the master crashed? Elect a new one!• Failure detection – how can we know that a node has failed (e.g. crashed)?
Fall 2010 5DV020 4
![Page 5: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/5.jpg)
Failure detection
• Keepalive messages every X time units– No message: the process has failed– Crashing immediately after sending?
– Message delays, re-routing, etc.
• Unreliable failure detection– Unsuspected vs. Suspected
• Reliable failure detection– Unsuspected vs. Failed– Synchronous systems
Fall 2010 5DV020 5
![Page 6: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/6.jpg)
Distributed mutual exclusion
• ME1 (safety): at most 1 process may enter the critical section at a time
• ME2 (liveness): requests to enter the critical section eventually succeed
• ME3 (→ ordering): if a request to enter the critical section happened-before another, then access is granted according to that order
Fall 2010 5DV020 6
![Page 7: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/7.jpg)
Mutex algorithms
• Central server– Simple and error-prone!
• Ring-based algorithm– Also simple, but not single point of failure
• Ricart and Agrawala– Multicast request, access when all have replied (ordered using logical clocks)
• Maekawa's voting algorithm– Ask only a subset for access: works if subsets are overlapping
Fall 2010 5DV020 7
![Page 8: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/8.jpg)
Mutex: Central server
• Send request to server, oldest process in queue gets access, return token when done
• Safety?– Yes!
• Liveness?– Yes (as long as server does not crash)
• → ordering?– No! Why not?
Fall 2010 5DV020 8
![Page 9: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/9.jpg)
Mutex: Ring-based
• Token is passed around a ring of processes– Want access? Wait until token comes, and claim it (then pass the token along)
• Safety? Liveness?– Yes (assuming no crashes)
• → ordering?– Not even close
Fall 2010 5DV020 9
![Page 10: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/10.jpg)
Mutex: Ricart and Agrawala
• Each process– Has unique process ID – Maintains a logical (Lamport) clock– Has state ∈ {wanted, held, released}
• Requests are multicasted to group– Contain process ID and clock value
• Lowest clock value gets access first– Equal values? Check process ID!
Fall 2010 5DV020 10
![Page 11: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/11.jpg)
Mutex: Ricart and Agrawala
• Have access or want it yourself, and your <id, value> is lower than incoming request? Queue the request! Else, reply to it
• Safety? Liveness? → ordering?– Yes!
• Improve performance– If process wants to re-enter critical section, and no new requests have been made, just do it!
Fall 2010 5DV020 11
![Page 12: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/12.jpg)
Mutex: Maekawa’s voting
• Optimization: only ask subset of processes for entry– Works as long as subsets overlap– Vote only in one electionat a time!
• Safety?– Yes!
• Liveness? → ordering?– No, deadlocks can happen! But, enhance using vector clocks, and we get both!
Fall 2010 5DV020 12
![Page 13: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/13.jpg)
Comparison of mutex algos
• Study the algorithms yourselves, including their performance– This is an excellent question for exams (just sayin')
• Message loss?– None of the algorithms handle this
• Crashing processes?– Ring? No! Others? Only if the process is unimportant (not server, in voting set)
Fall 2010 5DV020 13
![Page 14: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/14.jpg)
Election algorithms
• Ring-based– You may be familiar with this one!
• Bully– Very specific requirements, not generally applicable
Fall 2010 5DV020 14
![Page 15: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/15.jpg)
Election requirements
• E1 (safety): A participant has elected = False or elected = P, where P is chosen as the non-crashed process with the highest identifier
• E2 (liveness): All processes participate and eventually set elected to non-False or crash
Fall 2010 5DV020 15
![Page 16: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/16.jpg)
Election: Ring-based
• Processes have unique identifiers
• During election, pass max(own ID, incoming ID) to next process– If a process receives own ID, it must have been highest and may send that it has been elected
• Safety? Liveness?– Yes!
• Tolerates no failures (limited use)
Fall 2010 5DV020 16
![Page 17: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/17.jpg)
Election: Bully algorithm
• Requires– Synchronous system– All processes know of each other– Reliable failure detectors– Reliable message delivery
• Allows– Crashing processes
• Safety? Not if process IDs can be reused!
• Liveness? Yes (if message delivery is reliable)
Fall 2010 5DV020 17
![Page 18: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/18.jpg)
Election algorithms
• Study these algorithms on your own, and their properties– Again, excellent exam material...
• Want to read more about non-trivial election algorithms?– http://www.sics.se/~ali/teaching/dalg/l06.ppt
Fall 2010 5DV020 18
![Page 19: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/19.jpg)
Multicast
• Receive vs. Deliver– Receive: message has arrived and will be processed
– Deliver: message is allowed to reach upper layer
• Closed vs. open groups– (Dis-)allow external processes to send to group
• Unreliable (basic) multicast– Send (unicast) to each other process!• What if sender fails halfway through?
Fall 2010 5DV020 19
![Page 20: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/20.jpg)
Reliable multicast
• Integrity: messages delivered at most once
• Validity: if a correct process multicasts message m, it will eventually deliver m
• Agreement: if a correct process delivers m, then all correct processes will eventually deliver m
Fall 2010 5DV020 20
![Page 21: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/21.jpg)
Reliable multicast
• Use basic multicast to send to all (including self)– When basic multicast delivers, check if message has been received before• If it has, do nothing further• If not, and sender is not own process:– Basic multicast message to others
• Deliver message to upper layer
Fall 2010 5DV020 21
![Page 22: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/22.jpg)
Reliable multicast
• Integrity? Validity? Agreement?– Yes!
• Insane amounts of traffic?– Yes! Every message is sent sizeof(group) to each process!– A single message will be sent 100 times if we just have 10 processes
Fall 2010 5DV020 22
![Page 23: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/23.jpg)
Reliable multicast over IP multicast
• IP multicast requires hardware support– IP, however, is unreliable
• Study algorithm on your own– Perfect exam material...
Fall 2010 5DV020 23
![Page 24: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/24.jpg)
Message ordering
• FIFO• Total• Causal• Hybrids such as Causal-Total
Fall 2010 5DV020 24
![Page 25: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/25.jpg)
FIFO ordering
• Intuition: messages from a process should be delivered in the order in which they were sent
• Solution: let sending process number the messages and hold back those that have been received out of order
Fall 2010 5DV020 25
![Page 26: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/26.jpg)
Total ordering
• Intuition: messages from all processes should get a groupwide ordering number, so all processes can deliver in a single order!– Mental pitfall: the order itself does not have to make any sense, as long as all processes abide by it!
Fall 2010 5DV020 26
![Page 27: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/27.jpg)
Implementing total ordering
• Sequencer– Simple– Central server (= single point of failure)
• ISIS-algorithm– Not as simple– Distributed– Study on your own!
Fall 2010 5DV020 27
![Page 28: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/28.jpg)
Total ordering via sequencer
• Sequencer is logically external to the group
• Messages are sent to all members and the Sequencer– Initially, have no “ordering” number
• Sequencer maps message identifiers to ordering numbers– Multicasts mapping to group
• Once a message has an ordering number, it can be delivered according to that number
Fall 2010 5DV020 28
![Page 29: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/29.jpg)
Total ordering contd.
• Note, again, that the ordering is completely up to the sequencer– It could collect all messages for half an hour and then assign numbers according to how many “a”:s there are in the message• While annoying to use, this is still a total order, and all processes will have to follow it!
Fall 2010 5DV020 29
![Page 30: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/30.jpg)
Causal ordering
• Intuition: captures causal (cause and effect) relationships via happened-before ordering
• Using vector clocks, ensures that replies are delivered after the message that they are replying to
• Example: USENET replication
Fall 2010 5DV020 30
![Page 31: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/31.jpg)
Hybrid orderings
• Causal order is not unique– Concurrent messages
• ...neither is FIFO– FIFO only guarantees per process, not inter-process
• Total order only guarantees a unique order– Combine with others to get stronger delivery semantics!
Fall 2010 5DV020 31
![Page 32: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/32.jpg)
Message ordering
• You will need to get familiar with these orderings during the assignment– Read the book carefully
• Avoid mistakes• Adhere to definitions• Get a few very nice hints...
Fall 2010 5DV020 32
![Page 33: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/33.jpg)
Agreement
• Consensus– Any process can suggest a value, all have to reach consensus on which is correct
• Byzantine generals problem– One process can suggest a value, others have to agree on what that value was
Fall 2010 5DV020 33
![Page 34: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/34.jpg)
Byzantine Generals problem
• Commander– Issues order to attack or retreat
• Lieutenants– Decide what to do– Coordinate their actions by repeating the order from the commander to each other
• One or more processes (commander or lieutenants) may be “treacherous” and report wrong/inconsistent values!
Fall 2010 5DV020 34
![Page 35: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/35.jpg)
Byzantine Generals problem
• Synchronous system!• Impossible with three processes– See Figure 12.19 in the book, © Pearson Education 2005
• Possible with N ≥ 3f + 1 processes, where f is amount of treacherous ones– See Figure 12.20 in the book,© Pearson Education 2005
Fall 2010 5DV020 35
![Page 36: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/36.jpg)
Byzantine Generals problem
• Asynchronous system – bad!– No timing constraints
• Fischer's impossibility result– Even with just one crashing process, we cannot reach consensus in an asynchronous system
– This also applies to total ordering of messages and reliable multicast...
– Still, we manage to do quite well in practice – how can that be?
Fall 2010 5DV020 36
![Page 37: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/37.jpg)
Impossibility work-arounds
• Mask the faults– Use persistent storage and allow process restarts
• Use failure detectors– No reliable detectors: but good enough (agree that process is crashed if too long time has passed)
– Eventually weak failure detector
• Randomization– Great paper (which is required reading) on the literature page!
Fall 2010 5DV020 37
![Page 38: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/38.jpg)
Paxos algorithm
• Properties:– Non-triviality: only proposed values can be learned
– Consistency: at most one value can be learned (two learners cannot learn different values)
– Liveness (C;L): if a value C has been proposed, then eventually learner L will learn some value
• Asynchronous
Fall 2010 5DV020 38
![Page 39: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/39.jpg)
Paxos algorithm contd.
• Client– Initiates with a request, expects response
• Acceptor– Collected into Quorums (groups)
• Messages must be sent to a whole quorum• Messages are ignored unless they come from the whole quorum
• Proposer– Handles request on behalf of Client
• Learner– Acts if Acceptors agree on a value
• Leader– Distinguished Proposer, handles protocol
Fall 2010 5DV020 39
![Page 40: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/40.jpg)
Paxos Quorums
• A quorum contains a majority of available Acceptors
• Several subsets of Acceptors such that any pair of Quorums always have some overlap
Fall 2010 5DV020 40
![Page 41: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/41.jpg)
Paxos deployment
• Each Client reqest starts a new instance
• Each instance has two successful phases per round– If a phase is unsuccessful, start a new round
• Output from an instance is a single agreed-upon value
Fall 2010 5DV020 41
![Page 42: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/42.jpg)
Paxos phases
• Prepare (1a)– Leader proposes number N
• Promise (1b)– If N is greater than any previous proposal, Acceptors in some quorum promise to reject proposals less than N
– Acceptors respond with the last (highest) value they have accepted
• Accept (2a)– Leader chooses value from response set, sends value to some quorum of Acceptors
• Accepted (2b)– If Acceptors have not promised to do otherwise (promised to another concurrent proposal > N), they accept the value to Learners and the Leader
Fall 2010 5DV020 42
![Page 43: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/43.jpg)
Paxos further reading
• Leslie Lamport Paxos Made SimpleACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58. – PDF here– Well worth your time…
Fall 2010 5DV020 43
![Page 44: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/44.jpg)
Summary
• Failure detection– Hard to get right in async. systems
• Mutual exclusion• Election algorithms
– Seems like a simple problem, but non-trivial solutions are... non-trivial
• Multicast, reliable and unreliable
• Message ordering– Very important for the assignment
Fall 2010 5DV020 44
![Page 45: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/45.jpg)
Summary
• Byzantine generals problem• Fisher's impossibility result
– Avoiding it by various techniques
• Paxos algorithm
Fall 2010 5DV020 45
![Page 46: Coordination and Agreement, Multicast, and Message Ordering](https://reader036.vdocuments.net/reader036/viewer/2022062422/56649e695503460f94b65f49/html5/thumbnails/46.jpg)
Next lecture
• Replication– Group communication– Fault-tolerant services
• Active and Passive replication
Fall 2010 5DV020 46