dc7: more coordination chapter 11 and 14.2

DC7: More CoordinationChapter 11 and 14.2

•Consensus

•Group Communication

Topics

• Agreement and Consensus– No fault, fail-stop, byzantine

• Group Communication– Order and delivery guarantees

• Virtual Synchrony

Consensus section 11.5

• Distributed agreement or "distributed consensus" is the fundamental problem in DS. – Distributed mutual exclusion and election are basically

getting processes to agree on something.

– Agreeing on time or the update of replicated data are special cases of the distributed consensus problem.

• Agreement sometimes means one process proposes a value and the others agree on it while consensus means all processes propose values and all agree on some function of those values.

Consensus (Agreement)• There are M processes, P1, P2, … Pm in a DS that are trying

to reach agreement. A subset F of the processes are faulty. Each process Pi is initially undecided and proposes a value Vi. During agreement, the processes each calculate a value Ai. At the end of the algorithm:– All non-faulty processes reach a decision.

– For every pair of non-faulty processes Pi and Pj, Ai = Aj. This is the agreement value.

– The agreement value is a function of the initial values {Vi} of the non-faulty processes.

• The function is often max (as in the case of election) or average or one of the Vi. If all non-faulty processes have the same Vi, then that must be the agreement value.

• Communications are reliable and synchronous.

Consensus: Easy Case: No Failures

• No failures, synchronous, M processes• If there can be no failures, reaching consensus is easy.

Every process sends his value to every other process. All processes now have identical info.

• All processes do the same calculation and come up with the same value. Processes need to maintain an array of M values.

P1 has {1,2,3,4}

P2 has {1,2,3,4}

P3 has {1,2,3,4}

P4 has {1,2,3,4}

43

2

1

Consensus: Fail-stop

• Fairly Easy case: fail-stop, synchronous

• If faulty processes are fail-stop, reaching consensus is reasonably easy, all non-faulty processes send their values to all others. However, F of them may fail at sometime during the process...

P1 has {1,2,3,4}

P2 has {1,2,3,4}

P3 has {x,2,3,4}

P4 has {x,2,3,4}4

3

2

11


• Solution is after all processes send their values to all others, then all processes now broadcast all the values they received (and who from).

• This continues for f+1 rounds where f = |F|. Processes maintain a tree of values.

• P3 and P4 have

1st round: {x,2,3,4}

2nd round:

from P2 {1,2,3,4}

from P3 {x,2,3,4}

4

3

2{x,2,3,4}

{x,2,3,4}

{1,2,3,4}


• If M=4 and F=1 then we need f+1=2 rounds to get consensus (previous example).

• Do we really need f+1 rounds? Consider M=4, F=2• P1 crashes during 1st round after sending to P2. P2

crashes during 2nd round after sending to P3

431 2

P3:{x,2,3,4}

P4:{x,2,3,4}P2:{1,2,3,4}

Consensus: Fail stop

What do P3 and P4 see?

Round 1 {1,2,3,4} {X,2,3,4} {X,2,3,4} Round 2 send to P3 {1,2,3,4} {X,2,3,4} and die

Round 3 {1,2,3,4} {1,2,3,4}

If processes are fail-stop, we can tolerate any number

of faulty processes, however we need f+1 rounds

432

Difficult Case: Agreement with Byzantine Failures

• Similar problems: agreement (single proposer) and consensus (all propose values).

• The faulty process may respond like a non-faulty process so the non-faulty processes do not know who is faulty. Faulty process can send a fake value to throw off the calculation and can send one value to some and a different value to others.

• Faulty process is an adversary and can see the global state: has more information than non-faulty nodes. But, can only affect the faulty processes.

Variations on Byzantine Agreement

• Process always knows who sent the received message.• Default value - some algorithms assume a default

value (retreat) when there is no agreement.• Oral messages - message content is controlled by

latest sender (relayer) so receiver doesn’t know whether or not it was tampered with.

• Signed messages - messages can be authenticated with digital signatures. Assume faulty processes can send arbitrary messages but they cannot forge signatures.

BA with Oral Messages(1)

Commanding general coordinates other generals.If all loyal generals attack victory is certain.If none attack, the Empire survives.If some attack, Empire is lost.Gong keeps time.

Attack!


How it works.• Disloyal generals have corrupt soldiers.• Orders are distributed by exchange of messages,

corrupt soldiers violate protocol at will.• But corrupt soldiers can’t intercept and modify

messages between loyal generals.• The gong sounds slowly: there is ample time for

exchange of messages.• Commanding general sends his order. Then all

other generals relay to all what they received.


• Limitations

• Let t be the maximum number of faulty processes (disloyal generals).

• Byzantine agreement is not possible with fewer than 3t+1 processes

• Same result holds for consensus in the Byzantine model

• Requires t+1 rounds of messages

Byzantine Consensus Oral Messages(1)

The Byzantine generals problem for 3 loyal generals and1 traitor.a) The generals announce their troop strengths (in units of 1

kilosoldiers) to all other generals.b) The vectors that each general assembles based on (a)c) Additional vectors that each general receives in next round

(all send what they received to all). Decide other’s values by majority of the 3. If no majority, use default value.

ByzantineConsensus Oral Messages(2)

The same as in previous slide, except now with 2 loyal generals and one traitor. Majority decision does not guarantee consensus.

BA with Signed Messages (1)

• Faulty process can send arbitrary message, but cannot forge signatures. All messages are digitally signed for authentication.

• Assume at most f faulty nodes. At the start, each node broadcasts his value I a signed message.

• Each process at round I– endorses (authenticate) and forwards all

messages received in round I-1– signatures help locate the faulty process

BA with Signed Messages (2)

• At round f+1, either:– 1 value per coordinate endorsed by at least f+1 nodes,

decide majority

– else, decide the default value

• f+1 rounds proven to be necessary and sufficient. Must have f+2 processes. (f+1)

• Ex: In round 1 node p sent me value x. In round 2 node p sent a vector with his component = y. I conclude node p is faulty.

Summary

Consensus Required Required

Number Rounds

fail-stop N=f+1 R=f+1

byzantine (oral) N=3f+1 R=f+1

byzantine(signed) N=f+2 R=f+1

Consensus in Asynchronous Systems

•All of the preceding agreement and consensus algorithms are for synchronous systems, that is the algorithm works by sending messages in rounds or phases.

•What about Byzantine Consensus in an asynchronous system?•Provably impossible if any node is faulty [FLP1985], but pratical algorithms do exist using failure dectors

Reliable Group Communication

• We would like a message sent to a group of processes to be delivered to every member of that group.

• Problems: Processes join and leave group. Processes crash (that's a leave). Sender crashes (after sending to some or doing part of the send operation).

• What about: Efficiency? Message delivery order? Timeliness?

Group Communication• Multicast communication requires coordination and

agreement. Members of a group receive copies of messages sent to the group.

• Many different delivery guarantees are possible – e.g. agree on the set of messages received or on delivery

ordering• A process can multicast by the use of a single

operation instead of a send to each member– For example in IP multicast aSocket.send(aMessage)– The single operation allows for:

• efficiency . send once on each link, using hardware multicast when available.

• delivery guarantees e.g. can’t make a guarantee if multicast is implemented as multiple sends and the sender fails. Also ordering.


• Revised definition: A message sent to a group of processes should be delivered to all non-faulty members of that group.

• How to implement reliability: message sequencing and ACKs.

2 1 Nsender

3w x

y

54


• For efficiency, many algorithms form a tree structure to handle message multiplication.

• Should interior nodes store the message? If not, all ack’s must be sent to originator.

2

1

3

sender

x y y yxx

RGC: Handling Ack/Nacks

• Problem, ack implosion: does not scale well.

• Solution attempt: Don't ack, rather NACK missing messages. However, a receiver may not Nack because it doesn't know it missed a message because it isn't getting anything. Thus Sender has to buffer outgoing messages forever. Also, a message dropped high in the multicast tree creates a Nack implosion

2

1

3

sender

x y y yxx

RGC: Handling Nacks

• If processes see all messages from others, can use Scalable Reliable Multicast (SRM) [Floyd 1997]

• No acks in SRM, only missing messages are NACKed. When a client detects a missed message, it waits for a random delay, then multicasts his NACK to everyone in the group. This feedback allows other group members who missed the same message to supress their NACK

• Assumption: the re-transmission of the NACKed message will be a multicast. This is called Feedback Suppression. Problems: still lots of NACK traffic.

Nonhierarchical Feedback Control

Several receivers have scheduled a request for retransmission, but the first retransmission request

leads to the suppression of others.

Hierarchical Feedback Control

• Hierarchies or trees frequently formed for multicast, why not use for feedback control? Better scalability.

• Works if there is a single sender or local group of senders and the group membership is fairly stable.

• A rooted tree is formed with the sender at the root. Each other node is a group of receivers. Each group of receivers has a coordinator who buffers the message and collects NACKs or ACKs from his group and sends one on up the tree to the sender.

• Hard to handle group membership changes.

Hierarchical Feedback Control

The essence of hierarchical reliable multicasting.a) Each local coordinator forwards the message to its children.b) A local coordinator handles retransmission requests.

Multicast Terminology

Message is received by the OS and comm layer but it is not delivered to the application until it has been verifiably received

by all other processes in the group.

The Meaning of Delivered

Messageprocessing

Delivery queueHold-back

queue

deliver

Incomingmessages

When delivery guarantees aremet

Message Ordering

• Unordered - P1 is delivered the messages in arbitrary order which might be different from the order in which P2 gets them.

• FIFO - all messages from a single source will be delivered in the order in which they were sent.

• Causally ordered - recall Lamport definition of causality. Potential causality must be preserved. Causally related messages from multiple sources are delivered in causal order.

• Total order - all processes deliver the messages in the same order. Frequently causal also. "All messages multicast to a group are delivered to all members of the group in the same order"

Unordered Messages

Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis.

Process P1 Process P2 Process P3

sends m1 receives m1

sends m2 receives m2

receives m2 receives m1

FIFO Ordering

Four processes in the same group with two different senders, and a possible delivery order of messages

under FIFO-ordered multicasting

Process P1 Process P2 Process P3 Process P4

sends m1 delivers m1 delivers m3 sends m3


delivers m2 delivers m2


Atomic Multicast

• Totally ordered group communication. • Atomic = message is delivered to all or none.• View (also group view) is group membership

at any given time. That is, the set of processes belonging to the group.

• The concept of a view is needed to handle membership changes.

Total Order

If P1 and P4 are in the multicast group, they also deliver the messages in this order. So, P4 may send m3 at

t=1 but not deliver it until t=2

Process P1 Process P2 Process P3 Process P4





Virtual (View) Synchrony

• How to define atomic multicast in the presence of failures? How can we guarantee delivery to all group members?

• 50 members in group, I multicast a message, m1, then P10 fails before getting message, but others got the message and I assume P10 got the message.

• Control the membership changes with view change.• Virtual Synchrony - says something about the order

of the message delivery with respect to a view change message, since messages must be ordered with respect to the view change message.

Properties of Virtual Synchrony

• Each process in the view has the same view. That is, they all agree on the group membership.

• When a process joins or leaves (including crash), this is announced to all (non-crashed) processes in the (old) group with a view change message VC.

• If one process, P1, in view v delivers message m, then all processes belonging to view v deliver message m in view v. (Recall difference between receive and deliver)

Virtual Synchrony

The principle of virtual synchronous multicast.

Figure 14.3

p

q

r

p crashes

view (q, r)view (p, q, r)

p

q

r

p crashes


a (allowed). b (allowed).

p

q

r

view (p, q, r)

p

q

r

p crashes


c (disallowed). d (disallowed).

p crashes

view (q, r)

Multicast Summary

Six different kinds of reliable multicasting.

Multicast Basic Message Ordering Total-ordered Delivery?

Reliable multicast None No

FIFO multicast FIFO-ordered delivery No

Causal multicast Causal-ordered delivery No

Atomic multicast None Yes

FIFO atomic multicast FIFO-ordered delivery Yes

Causal atomic multicast Causal-ordered delivery Yes

Atomic Multicast: Amoeba, etc

• One node is the coordinator.• Everyone sends his messages to the coordinator and

the coordinator chooses the order and sends the message to everyone, or

• Everyone sends his messages to the coordinator and all nodes and the coordinator chooses the order and sends message number to everyone – (msg 5 from p4: global order 33)

1

23

41

23

4

Atomic Multicast: Totem

• Developed at UCSB• Processes are organized into a logical ring. • A token is passed around the ring. The token has

the message number of the next message to be multicast.

• Only the token holder can multicast a message. This easily establishes total order. Retransmits for missed messages are the responsibility of the original sender.

dc7: more coordination chapter 11 and 14.2

Documents