cs 194: distributed systems process resilience, reliable group communication

34
1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division artment of Electrical Engineering and Computer Scie University of California, Berkeley Berkeley, CA 94720-1776

Upload: ianthe

Post on 10-Feb-2016

56 views

Category:

Documents


4 download

DESCRIPTION

CS 194: Distributed Systems Process resilience, Reliable Group Communication. Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720-1776. Some definitions…. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 194: Distributed Systems Process resilience, Reliable Group Communication

1

CS 194: Distributed SystemsProcess resilience, Reliable Group

Communication

Scott Shenker and Ion Stoica Computer Science Division

Department of Electrical Engineering and Computer SciencesUniversity of California, Berkeley

Berkeley, CA 94720-1776

Page 2: CS 194: Distributed Systems Process resilience, Reliable Group Communication

2

Some definitions…

Availability: probability the system operates correctly at any given moment

Reliability: ability to run correctly for a long interval of time

Safety: failure to operate correctly does not lead to catastrophic failures

Maintainability: ability to “easily” repair a failed system

Page 3: CS 194: Distributed Systems Process resilience, Reliable Group Communication

3

… and Some More Definitions (Failure Models)

Crash failure: a server halts, but works correctly until it halts

Omission failure: a server fails to respond to a request

Timing failure: a server response exceeds specified time interval

Response failure: server’s response is incorrect

Arbitrary (Byzantine) failure: server produces arbitrary response at arbitrary times

Page 4: CS 194: Distributed Systems Process resilience, Reliable Group Communication

4

Masking Failures: Redundancy

How many failures can this design tolerate?

Page 5: CS 194: Distributed Systems Process resilience, Reliable Group Communication

5

Example: Open Shortest Path First (OSPF) over Broadcast Networks

1) Each node sends an route advertisements to multicast group DR-rtrs- Both designated router (DR) and backup designated router (BDR)

subscribe to this group

2) DR floods route advertisements back to all routers - Send to all-rtrs multicast group to which all nodes subscribe

DR BDR

DR BDR

Page 6: CS 194: Distributed Systems Process resilience, Reliable Group Communication

6

Agreement in Faulty Systems

Many things can go wrong…

Communication - Message transmission can be unreliable- Time taken to deliver a message is unbounded- Adversary can intercept messages

Processes- Can fail or team up to produce wrong results

Agreement very hard, sometime impossible, to achieve!

Page 7: CS 194: Distributed Systems Process resilience, Reliable Group Communication

7

Two-Army Problem

“Two blue armies need to simultaneously attack the white army to win; otherwise they will be defeated. The blue army can communicate only across the area controlled by the white army which can intercept the messengers.”

What is the solution?

Page 8: CS 194: Distributed Systems Process resilience, Reliable Group Communication

8

Byzantine Agreement [Lamport et al. (1982)]

Goal:- Each process learn the true values sent by correct processes

Assumptions:- Every message that is sent is delivered correctly- The receiver knows who sent the message- Message delivery time is bounded

Page 9: CS 194: Distributed Systems Process resilience, Reliable Group Communication

9

Byzantine Agreement Result

In a system with m faulty processes agreement can be achieved only if there are 2m+1 functioning correctly

Note: This result only guarantees that each process receives the true values sent by correct processors, but it does not identify the correct processes!

Page 10: CS 194: Distributed Systems Process resilience, Reliable Group Communication

10

Byzantine General Problem: Example

Phase 1: Generals announce their troop strengths to each other

P1 P2

P3 P4

1

11

Page 11: CS 194: Distributed Systems Process resilience, Reliable Group Communication

11

Byzantine General Problem: Example

Phase 1: Generals announce their troop strengths to each other

P1 P2

P3 P4

2

2 2

Page 12: CS 194: Distributed Systems Process resilience, Reliable Group Communication

12

Byzantine General Problem: Example

Phase 1: Generals announce their troop strengths to each other

P1 P2

P3 P4

4 4

4

Page 13: CS 194: Distributed Systems Process resilience, Reliable Group Communication

13

Byzantine General Problem: Example

Phase 2: Each general construct a vector with all troops

P1 P2 P3 P41 2 x 4

P1 P2

P3 P4

yx

z

P1 P2 P3 P41 2 y 4

P1 P2 P3 P41 2 z 4

Page 14: CS 194: Distributed Systems Process resilience, Reliable Group Communication

14

Byzantine General Problem: Example

Phase 3: Generals send their vectors to each other and compute majority voting

P1 P2 P3 P41 2 y 4a b c d1 2 z 4

P1 P2

P3 P4

(e, f, g, h)

(a, b, c, d)

(h, i, j, k)

P1 P2 P3 P41 2 x 4e f g h1 2 z 4

P1 P2 P3 P41 2 x 41 2 y 4h i j k

P2P3P4

P1P3P4

P1P2P3

(1, 2, ?, 4) (1, 2, ?, 4)

(1, 2, ?, 4)

Page 15: CS 194: Distributed Systems Process resilience, Reliable Group Communication

15

Reliable Group Communication

Reliable multicast: all nonfaulty processes which do not join/leave during communication receive the message

Atomic multicast: all messages are delivered in the same order to all processes

Page 16: CS 194: Distributed Systems Process resilience, Reliable Group Communication

16

Reliable multicast: (N)ACK Implosion

(Positive) acknowledgements- Ack every n received packets- What happens for multicast?

Negative acknowledgements- Only ack when data is lost- Assume packet 2 is lost

S

R1

R2

R3

123

Page 17: CS 194: Distributed Systems Process resilience, Reliable Group Communication

17

Reliable multicast: (N)ACK Implosion

When a packet is lost all receivers in the sub-tree originated at the link where the packet is lost send NACKs

S

R1

R2

R3

3

3

3

2?

2?

2?

Page 18: CS 194: Distributed Systems Process resilience, Reliable Group Communication

18

Scalable Reliable Multicast (SRM)[Floyd et al ’95]

Receivers use timers to send NACKS and retransmissions- Randomized: prevent implosion- Uses latency estimates

• Short timer cause duplicates when there is reordering• Long timer causes excess delay

Any node retransmits- Sender can use its bandwidth more efficiently- Overall group throughput is higher

Duplicate NACK/retransmission suppression

Page 19: CS 194: Distributed Systems Process resilience, Reliable Group Communication

19

Inter-node Latency Estimation

Every node estimates latency to every other node

Uses session reports- Assume symmetric latency- What happens when group

becomes very large?

A B

t1

t2

d d

dA,B = (t2 – t1 – d)/2

Page 20: CS 194: Distributed Systems Process resilience, Reliable Group Communication

20

Chosen from the uniform distribution on

- A – node that lost the packet- S – source- C1, C2 – constants- dS,A – latency between source (S) and A- i – iteration of repair request tries seen

Algorithm- Detect loss set timer- Receive request for same data cancel timer, set new timer- Timer expires send repair request

Repair Request Timer Randomization

])(,[2 ,21,1 ASASi dCCdC

Page 21: CS 194: Distributed Systems Process resilience, Reliable Group Communication

21

Timer Randomization

Repair timer similar- Every node that receives repair request sets repair timer- Latency estimate is between node and node requesting repair- Use following formula:

• D1, D2 – constants• dR,A – latency between node requesting repair (R) and A

Timer properties – minimize probability of duplicate packets- Reduce likelihood of implosion (duplicates still possible)- Reduce delay to repair

])(,[2 ,21,1 ARARi dDDdD

Page 22: CS 194: Distributed Systems Process resilience, Reliable Group Communication

22

Chain Topology

C1 = D1 = 1, C2 = D2 = 0 All link distances are 1

L2 L1 R1 R2 R3

source

data outof order

data/repairrequest repairrequest TOrepair TO

request

repair

Page 23: CS 194: Distributed Systems Process resilience, Reliable Group Communication

23

Star Topology

C1 = D1 = 0, Tradeoff between (1) number of requests

and (2) time to receive the repair C2 <= 1

- E(# of requests) = g –1 C2 > 1

- E(# of requests) = 1 + (g-2)/C2

- E(time until first timer expires) = 2C2/g

- E(# of requests) = - E(time until first timer expires) =

N1

N2

N3 N4

Ng

source

gC 2g

g/1

Page 24: CS 194: Distributed Systems Process resilience, Reliable Group Communication

24

Bounded Degree Tree

Use both- Deterministic suppression (chain topology)- Probabilistic suppression (star topology)

Large C2/C1 fewer duplicate requests, but larger repair time

Large C1 fewer duplicate requests Small C1 smaller repair time

Page 25: CS 194: Distributed Systems Process resilience, Reliable Group Communication

25

Adaptive Timers

C and D parameters depends on topology and congestion choose adaptively

After sending a request: - Decrease start of request timer interval

Before each new request timer is set:- If requests sent in previous rounds, and any dup requests were from

further away:• Decrease request timer interval

- Else if average dup requests high:• Increase request timer interval

- Else if average dup requests low and average request delay too high:• Decrease request timer interval

Page 26: CS 194: Distributed Systems Process resilience, Reliable Group Communication

26

Atomic Multicast

All messages are delivered in the same order to “all” processes

Group view: the set of processes known by the sender when it multicast the message

Virtual synchronous multicast: a message multicast to a group view G is delivered to all nonfaulty processes in G

- If sender fails after sending the message, the message may be delivered to no one

Page 27: CS 194: Distributed Systems Process resilience, Reliable Group Communication

27

Virtual Synchronous Multicast

Page 28: CS 194: Distributed Systems Process resilience, Reliable Group Communication

28

Virtual Synchrony Implementation [Birman et al., 1991]

The logical organization of a distributed system to distinguish between message receipt and message delivery

Page 29: CS 194: Distributed Systems Process resilience, Reliable Group Communication

29

Virtual Synchrony Implementation: [Birman et al., 1991]

Only stable messages are delivered

Stable message: a message received by all processes in the message’s group view

Assumptions (can be ensured by using TCP): - Point-to-point communication is reliable- Point-to-point communication ensures FIFO-ordering

Page 30: CS 194: Distributed Systems Process resilience, Reliable Group Communication

30

Virtual Synchrony Implementation: Example

Gi = {P1, P2, P3, P4, P5} P5 fails P1 detects that P5 has

failed P1 send a “view change”

message to every process in Gi+1 = {P1, P2, P3, P4}

P1

P2 P3

P4

P5

change view

Page 31: CS 194: Distributed Systems Process resilience, Reliable Group Communication

31

Virtual Synchrony Implementation: Example

Every process - Send each unstable message m

from Gi to members in Gi+1

- Marks m as being stable- Send a flush message to mark that

all unstable messages have been sent P1

P2 P3

P4

P5

unstable message

flush message

Page 32: CS 194: Distributed Systems Process resilience, Reliable Group Communication

32

Virtual Synchrony Implementation: Example

Every process - After receiving a flush message

from any process in Gi+1 installs Gi+1

P1

P2 P3

P4

P5

Page 33: CS 194: Distributed Systems Process resilience, Reliable Group Communication

33

Message Ordering

FIFO-order: messages from the same process are delivered in the same order they were sent

Causal-order: potential causality between different messages is preserved

Total-order: all processes receive messages in the same order

Total ordering does not imply causality or FIFO! Atomicity is orthogonal to ordering

Page 34: CS 194: Distributed Systems Process resilience, Reliable Group Communication

34

Message Ordering and Atomicity

Multicast Basic Message Ordering Total-ordered Delivery?

Reliable multicast None No

FIFO multicast FIFO-ordered delivery No

Causal multicast Causal-ordered delivery No

Atomic multicast None Yes

FIFO atomic multicast FIFO-ordered delivery Yes

Causal atomic multicast Causal-ordered delivery Yes