dc7: more coordination chapter 11 and 14.2

44
DC7: More Coordination Chapter 11 and 14.2 Consensus Group Communication

Upload: hamish-mcdaniel

Post on 03-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

DC7: More Coordination Chapter 11 and 14.2. Consensus Group Communication. Topics. Agreement and Consensus No fault, fail-stop, byzantine Group Communication Order and delivery guarantees Virtual Synchrony. Consensus section 11.5. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DC7: More Coordination Chapter 11 and 14.2

DC7: More CoordinationChapter 11 and 14.2

•Consensus

•Group Communication

Page 2: DC7: More Coordination Chapter 11 and 14.2

Topics

• Agreement and Consensus– No fault, fail-stop, byzantine

• Group Communication– Order and delivery guarantees

• Virtual Synchrony

Page 3: DC7: More Coordination Chapter 11 and 14.2

Consensus section 11.5

• Distributed agreement or "distributed consensus" is the fundamental problem in DS. – Distributed mutual exclusion and election are basically

getting processes to agree on something.

– Agreeing on time or the update of replicated data are special cases of the distributed consensus problem.

• Agreement sometimes means one process proposes a value and the others agree on it while consensus means all processes propose values and all agree on some function of those values.

Page 4: DC7: More Coordination Chapter 11 and 14.2

Consensus (Agreement)• There are M processes, P1, P2, … Pm in a DS that are trying

to reach agreement. A subset F of the processes are faulty. Each process Pi is initially undecided and proposes a value Vi. During agreement, the processes each calculate a value Ai. At the end of the algorithm:– All non-faulty processes reach a decision.

– For every pair of non-faulty processes Pi and Pj, Ai = Aj. This is the agreement value.

– The agreement value is a function of the initial values {Vi} of the non-faulty processes.

• The function is often max (as in the case of election) or average or one of the Vi. If all non-faulty processes have the same Vi, then that must be the agreement value.

• Communications are reliable and synchronous.

Page 5: DC7: More Coordination Chapter 11 and 14.2

Consensus: Easy Case: No Failures

• No failures, synchronous, M processes• If there can be no failures, reaching consensus is easy.

Every process sends his value to every other process. All processes now have identical info.

• All processes do the same calculation and come up with the same value. Processes need to maintain an array of M values.

P1 has {1,2,3,4}

P2 has {1,2,3,4}

P3 has {1,2,3,4}

P4 has {1,2,3,4}

43

2

1

Page 6: DC7: More Coordination Chapter 11 and 14.2

Consensus: Fail-stop

• Fairly Easy case: fail-stop, synchronous

• If faulty processes are fail-stop, reaching consensus is reasonably easy, all non-faulty processes send their values to all others. However, F of them may fail at sometime during the process...

P1 has {1,2,3,4}

P2 has {1,2,3,4}

P3 has {x,2,3,4}

P4 has {x,2,3,4}4

3

2

11

Page 7: DC7: More Coordination Chapter 11 and 14.2

Consensus: Fail-stop

• Solution is after all processes send their values to all others, then all processes now broadcast all the values they received (and who from).

• This continues for f+1 rounds where f = |F|. Processes maintain a tree of values.

• P3 and P4 have

1st round: {x,2,3,4}

2nd round:

from P2 {1,2,3,4}

from P3 {x,2,3,4}

4

3

2{x,2,3,4}

{x,2,3,4}

{1,2,3,4}

Page 8: DC7: More Coordination Chapter 11 and 14.2

Consensus: Fail-stop

• If M=4 and F=1 then we need f+1=2 rounds to get consensus (previous example).

• Do we really need f+1 rounds? Consider M=4, F=2• P1 crashes during 1st round after sending to P2. P2

crashes during 2nd round after sending to P3

431 2

P3:{x,2,3,4}

P4:{x,2,3,4}P2:{1,2,3,4}

Page 9: DC7: More Coordination Chapter 11 and 14.2

Consensus: Fail stop

What do P3 and P4 see?

Round 1 {1,2,3,4} {X,2,3,4} {X,2,3,4} Round 2 send to P3 {1,2,3,4} {X,2,3,4} and die

Round 3 {1,2,3,4} {1,2,3,4}

If processes are fail-stop, we can tolerate any number

of faulty processes, however we need f+1 rounds

432

Page 10: DC7: More Coordination Chapter 11 and 14.2

Difficult Case: Agreement with Byzantine Failures

• Similar problems: agreement (single proposer) and consensus (all propose values).

• The faulty process may respond like a non-faulty process so the non-faulty processes do not know who is faulty. Faulty process can send a fake value to throw off the calculation and can send one value to some and a different value to others.

• Faulty process is an adversary and can see the global state: has more information than non-faulty nodes. But, can only affect the faulty processes.

Page 11: DC7: More Coordination Chapter 11 and 14.2

Variations on Byzantine Agreement

• Process always knows who sent the received message.• Default value - some algorithms assume a default

value (retreat) when there is no agreement.• Oral messages - message content is controlled by

latest sender (relayer) so receiver doesn’t know whether or not it was tampered with.

• Signed messages - messages can be authenticated with digital signatures. Assume faulty processes can send arbitrary messages but they cannot forge signatures.

Page 12: DC7: More Coordination Chapter 11 and 14.2

BA with Oral Messages(1)

Commanding general coordinates other generals.If all loyal generals attack victory is certain.If none attack, the Empire survives.If some attack, Empire is lost.Gong keeps time.

Attack!

Page 13: DC7: More Coordination Chapter 11 and 14.2

BA with Oral Messages(2)

How it works.• Disloyal generals have corrupt soldiers.• Orders are distributed by exchange of messages,

corrupt soldiers violate protocol at will.• But corrupt soldiers can’t intercept and modify

messages between loyal generals.• The gong sounds slowly: there is ample time for

exchange of messages.• Commanding general sends his order. Then all

other generals relay to all what they received.

Page 14: DC7: More Coordination Chapter 11 and 14.2

BA with Oral Messages(3)

• Limitations

• Let t be the maximum number of faulty processes (disloyal generals).

• Byzantine agreement is not possible with fewer than 3t+1 processes

• Same result holds for consensus in the Byzantine model

• Requires t+1 rounds of messages

Page 15: DC7: More Coordination Chapter 11 and 14.2

Byzantine Consensus Oral Messages(1)

The Byzantine generals problem for 3 loyal generals and1 traitor.a) The generals announce their troop strengths (in units of 1

kilosoldiers) to all other generals.b) The vectors that each general assembles based on (a)c) Additional vectors that each general receives in next round

(all send what they received to all). Decide other’s values by majority of the 3. If no majority, use default value.

Page 16: DC7: More Coordination Chapter 11 and 14.2

ByzantineConsensus Oral Messages(2)

The same as in previous slide, except now with 2 loyal generals and one traitor. Majority decision does not guarantee consensus.

Page 17: DC7: More Coordination Chapter 11 and 14.2

BA with Signed Messages (1)

• Faulty process can send arbitrary message, but cannot forge signatures. All messages are digitally signed for authentication.

• Assume at most f faulty nodes. At the start, each node broadcasts his value I a signed message.

• Each process at round I– endorses (authenticate) and forwards all

messages received in round I-1– signatures help locate the faulty process

Page 18: DC7: More Coordination Chapter 11 and 14.2

BA with Signed Messages (2)

• At round f+1, either:– 1 value per coordinate endorsed by at least f+1 nodes,

decide majority

– else, decide the default value

• f+1 rounds proven to be necessary and sufficient. Must have f+2 processes. (f+1)

• Ex: In round 1 node p sent me value x. In round 2 node p sent a vector with his component = y. I conclude node p is faulty.

Page 19: DC7: More Coordination Chapter 11 and 14.2

Summary

Consensus Required Required

Number Rounds

fail-stop N=f+1 R=f+1

byzantine (oral) N=3f+1 R=f+1

byzantine(signed) N=f+2 R=f+1

Page 20: DC7: More Coordination Chapter 11 and 14.2

Consensus in Asynchronous Systems

•All of the preceding agreement and consensus algorithms are for synchronous systems, that is the algorithm works by sending messages in rounds or phases.

•What about Byzantine Consensus in an asynchronous system?•Provably impossible if any node is faulty [FLP1985], but pratical algorithms do exist using failure dectors

Page 21: DC7: More Coordination Chapter 11 and 14.2

Reliable Group Communication

• We would like a message sent to a group of processes to be delivered to every member of that group.

• Problems: Processes join and leave group. Processes crash (that's a leave). Sender crashes (after sending to some or doing part of the send operation).

• What about: Efficiency? Message delivery order? Timeliness?

Page 22: DC7: More Coordination Chapter 11 and 14.2

Group Communication• Multicast communication requires coordination and

agreement. Members of a group receive copies of messages sent to the group.

• Many different delivery guarantees are possible – e.g. agree on the set of messages received or on delivery

ordering• A process can multicast by the use of a single

operation instead of a send to each member– For example in IP multicast aSocket.send(aMessage)– The single operation allows for:

• efficiency . send once on each link, using hardware multicast when available.

• delivery guarantees e.g. can’t make a guarantee if multicast is implemented as multiple sends and the sender fails. Also ordering.

Page 23: DC7: More Coordination Chapter 11 and 14.2

Reliable Group Communication

• Revised definition: A message sent to a group of processes should be delivered to all non-faulty members of that group.

• How to implement reliability: message sequencing and ACKs.

2 1 Nsender

3w x

y

54

Page 24: DC7: More Coordination Chapter 11 and 14.2

Reliable Group Communication

• For efficiency, many algorithms form a tree structure to handle message multiplication.

• Should interior nodes store the message? If not, all ack’s must be sent to originator.

2

1

3

sender

x y y yxx

Page 25: DC7: More Coordination Chapter 11 and 14.2

RGC: Handling Ack/Nacks

• Problem, ack implosion: does not scale well.

• Solution attempt: Don't ack, rather NACK missing messages. However, a receiver may not Nack because it doesn't know it missed a message because it isn't getting anything. Thus Sender has to buffer outgoing messages forever. Also, a message dropped high in the multicast tree creates a Nack implosion

2

1

3

sender

x y y yxx

Page 26: DC7: More Coordination Chapter 11 and 14.2

RGC: Handling Nacks

• If processes see all messages from others, can use Scalable Reliable Multicast (SRM) [Floyd 1997]

• No acks in SRM, only missing messages are NACKed. When a client detects a missed message, it waits for a random delay, then multicasts his NACK to everyone in the group. This feedback allows other group members who missed the same message to supress their NACK

• Assumption: the re-transmission of the NACKed message will be a multicast. This is called Feedback Suppression. Problems: still lots of NACK traffic.

Page 27: DC7: More Coordination Chapter 11 and 14.2

Nonhierarchical Feedback Control

Several receivers have scheduled a request for retransmission, but the first retransmission request

leads to the suppression of others.

Page 28: DC7: More Coordination Chapter 11 and 14.2

Hierarchical Feedback Control

• Hierarchies or trees frequently formed for multicast, why not use for feedback control? Better scalability.

• Works if there is a single sender or local group of senders and the group membership is fairly stable.

• A rooted tree is formed with the sender at the root. Each other node is a group of receivers. Each group of receivers has a coordinator who buffers the message and collects NACKs or ACKs from his group and sends one on up the tree to the sender.

• Hard to handle group membership changes.

Page 29: DC7: More Coordination Chapter 11 and 14.2

Hierarchical Feedback Control

The essence of hierarchical reliable multicasting.a) Each local coordinator forwards the message to its children.b) A local coordinator handles retransmission requests.

Page 30: DC7: More Coordination Chapter 11 and 14.2

Multicast Terminology

Message is received by the OS and comm layer but it is not delivered to the application until it has been verifiably received

by all other processes in the group.

Page 31: DC7: More Coordination Chapter 11 and 14.2

The Meaning of Delivered

Messageprocessing

Delivery queueHold-back

queue

deliver

Incomingmessages

When delivery guarantees aremet

Page 32: DC7: More Coordination Chapter 11 and 14.2

Message Ordering

• Unordered - P1 is delivered the messages in arbitrary order which might be different from the order in which P2 gets them.

• FIFO - all messages from a single source will be delivered in the order in which they were sent.

• Causally ordered - recall Lamport definition of causality. Potential causality must be preserved. Causally related messages from multiple sources are delivered in causal order.

• Total order - all processes deliver the messages in the same order. Frequently causal also. "All messages multicast to a group are delivered to all members of the group in the same order"

Page 33: DC7: More Coordination Chapter 11 and 14.2

Unordered Messages

Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis.

Process P1 Process P2 Process P3

sends m1 receives m1

sends m2 receives m2

receives m2 receives m1

Page 34: DC7: More Coordination Chapter 11 and 14.2

FIFO Ordering

Four processes in the same group with two different senders, and a possible delivery order of messages

under FIFO-ordered multicasting

Process P1 Process P2 Process P3 Process P4

sends m1 delivers m1 delivers m3 sends m3

sends m2 delivers m3 delivers m1 sends m4

delivers m2 delivers m2

delivers m4 delivers m4

Page 35: DC7: More Coordination Chapter 11 and 14.2

Atomic Multicast

• Totally ordered group communication. • Atomic = message is delivered to all or none.• View (also group view) is group membership

at any given time. That is, the set of processes belonging to the group.

• The concept of a view is needed to handle membership changes.

Page 36: DC7: More Coordination Chapter 11 and 14.2

Total Order

If P1 and P4 are in the multicast group, they also deliver the messages in this order. So, P4 may send m3 at

t=1 but not deliver it until t=2

Process P1 Process P2 Process P3 Process P4

sends m1 delivers m1 delivers m1 sends m3

sends m2 delivers m4 delivers m4 sends m4

delivers m2 delivers m2

delivers m3 delivers m3

Page 37: DC7: More Coordination Chapter 11 and 14.2

Virtual (View) Synchrony

• How to define atomic multicast in the presence of failures? How can we guarantee delivery to all group members?

• 50 members in group, I multicast a message, m1, then P10 fails before getting message, but others got the message and I assume P10 got the message.

• Control the membership changes with view change.• Virtual Synchrony - says something about the order

of the message delivery with respect to a view change message, since messages must be ordered with respect to the view change message.

Page 38: DC7: More Coordination Chapter 11 and 14.2

Properties of Virtual Synchrony

• Each process in the view has the same view. That is, they all agree on the group membership.

• When a process joins or leaves (including crash), this is announced to all (non-crashed) processes in the (old) group with a view change message VC.

• If one process, P1, in view v delivers message m, then all processes belonging to view v deliver message m in view v. (Recall difference between receive and deliver)

Page 39: DC7: More Coordination Chapter 11 and 14.2

Virtual Synchrony

The principle of virtual synchronous multicast.

Page 40: DC7: More Coordination Chapter 11 and 14.2

Figure 14.3

p

q

r

p crashes

view (q, r)view (p, q, r)

p

q

r

p crashes

view (q, r)view (p, q, r)

a (allowed). b (allowed).

p

q

r

view (p, q, r)

p

q

r

p crashes

view (q, r)view (p, q, r)

c (disallowed). d (disallowed).

p crashes

view (q, r)

Page 41: DC7: More Coordination Chapter 11 and 14.2

Multicast Summary

Six different kinds of reliable multicasting.

Multicast Basic Message Ordering Total-ordered Delivery?

Reliable multicast None No

FIFO multicast FIFO-ordered delivery No

Causal multicast Causal-ordered delivery No

Atomic multicast None Yes

FIFO atomic multicast FIFO-ordered delivery Yes

Causal atomic multicast Causal-ordered delivery Yes

Page 42: DC7: More Coordination Chapter 11 and 14.2

Atomic Multicast: Amoeba, etc

• One node is the coordinator.• Everyone sends his messages to the coordinator and

the coordinator chooses the order and sends the message to everyone, or

• Everyone sends his messages to the coordinator and all nodes and the coordinator chooses the order and sends message number to everyone – (msg 5 from p4: global order 33)

1

23

41

23

4

Page 43: DC7: More Coordination Chapter 11 and 14.2

Atomic Multicast: Totem

• Developed at UCSB• Processes are organized into a logical ring. • A token is passed around the ring. The token has

the message number of the next message to be multicast.

• Only the token holder can multicast a message. This easily establishes total order. Retransmits for missed messages are the responsibility of the original sender.

Page 44: DC7: More Coordination Chapter 11 and 14.2

End