verification of the chord protocol, a with a · extended successor list (esl) of 29 (with l = 2):...

31
VERIFICATION OF THE CHORD PROTOCOL , WITH A SURPRISING INVARIANT Pamela Zave AT&T Laboratories—Research Bedminster, New Jersey, USA

Upload: others

Post on 26-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

AAA

A

A

A

VERIFICATION OF THE CHORD PROTOCOL,

WITH A

SURPRISING INVARIANT

Pamela Zave

AT&T Laboratories—Research

Bedminster, New Jersey, USA

1

8

14

21

32

42

51

THE PROTOCOL ISINTERESTING

no centraladministration(almost)

communicationin the networkis fast

protocoloperations aresimple and fast:

no timingconstraints(almost)

no multi-nodeatomicoperations

m = 6

THE CHORD PROTOCOL MAINTAINS A PEER-TO-PEER NETWORK identifier of a node (assumedunique) is an m-bit hashof its IP address

nodes are arranged ina ring, each nodehaving a successorpointer to the nextnode (in integer orderwith wraparound at 0)

the protocol preservesthe ring structure asnodes join, leave silently,or fail

redundant pointerssupport fault-tolerance(extra successors,predecessors)

successor

successor2

predecessor

WHY IS CHORD IMPORTANT?

the 2001 SIGCOMM paper introducing Chordis one of the most-referenced

papers in computer science, . . .

. . . and won SIGCOMM’s 2011 Test of Time Award

APPLICATIONS

“Three features that distinguish Chordfrom many other peer-to-peer lookupprotocols are . . .

. . . its simplicity,

. . . provable correctness,

. . . and provable performance.”

RESEARCH ON PROPERTIES ANDEXTENSIONS

allows millions of ad hoc peers tocooperate

often used to build distributedkey-value stores (where the keyspace is the same as the Chordidentifier space)

the best-known application isBitTorrent

protection against malicious peers

key consistency (all nodes agreeon which node owns which key),replicated data consistency

used as a building block in fault-tolerant applications

OPERATIONS OF THE PROTOCOL

710

16

10JOINS

10STABILIZES

710

16

16RECTIFIES

10RECTIFIES

7

10

16

710

16

7STABILIZES

7

10

16

an operation changesthe state of one member

Join and Stabilize are scheduled autonomously,Rectify is caused by another member’s Stabilize

in addition, a membercan Fail (or leave) silently

there is perfect failuredetection

Stabilizing memberdetects a dead successorand promotes its nextsuccessor

THE CLAIMS

THE REALITY

even with simple bugs fixed andoptimistic assumptions aboutatomicity, the original protocol isnot correct

of the seven properties claimedinvariant of the original version, notone is actually an invariant

Correctness Property:

In any execution state, IF there areno subsequent Join or Fail events, . . .

. . . THEN eventually . . .

. . . all pointers in the network will beglobally correct, and remain so.

not surprisingly, all due to sloppyinformal specification and proof

I found these problems byanalyzing a small Alloy model

Chris Newcombe and others atAWS credit this work withovercoming their bias againstformal methods, which they nowuse to find bugs.

[CACM, April 2015]

16

3 has no successor2yet (it is not requiredto have all successorsfilled in)

16

16

A TYPICAL BUG IN ORIGINAL CHORD

16FAILS

3STABILIZES

3

20 3

20

3

20

3 has replaced a pointerto a live node with apointer to a dead one

now 3 is disconnectedfrom the ring, and thering may be broken

BASIC CORRECTNESS STRATEGY 1

7

19

16

13

263

29

29

55

55

41

41

6

appendagesbest successor

(first livesuccessor)

ring

dead

29 2932

extended successor list (ESL)of 29 (with L = 2):

memberitself

Original operating assumption:

No failure leaves a member without a live successor.

But if an ESL with L = 2 is . . .

. . . then 32 cannot fail!Clearly, with L = 2, Chord isintended to tolerate one failurein a neighborhood, but not two.

BASIC CORRECTNESS STRATEGY 2

7

19

16

13

263

29

29

55

55

41

41

6

appendages

best successor(first live

successor)

ring

dead

extended successor list (ESL)of 29 (with L = 2):

memberitself

Definition of FullSuccessorLists:The extended successor list of eachmember has L+1 distinct entries.

New operating assumption:

If a Chord network has the propertyFullSuccessorLists, then no failure leavesa member without a live successor.

if not satisfied for the real failure rate,

. . . increase rate of stabilization,

. . . or increase redundancy

BASIC CORRECTNESS STRATEGY 3

7

19

16

13

263

29

55

41

6

appendages

ring

TO MAKE ORIGINAL CHORD CORRECT:

alter the initialization to satisfy FullSuccessorListswith all members live

alter the operations to populate successor lists moreeagerly, so that they always have L entries

now it isroughly correct(in hindsight)

but how do weprove it

without aninvariant?

requires L+1 members

w

x

y

u

z

ringx

fails

w

y

u

z

ring

WHY IS FINDING AN INVARIANT SO DIFFICULT?

there is a ring of best successors

there is no more than one ring

on the unique ring, the membersare in identifier order

from each appendage member, thering is reachable through bestsuccessors

THE KNOWN, NECESSARY PROPERTIES ARE STATED IN TERMS OF THE RING . . .

about the ring ofbest successors

about the appendages

. . . BUT “RING VERSUS APPENDAGE” IS CONTEXT-DEPENDENT AND FLUID:

AN INTERMEDIATE RESULT

THE INDUCTIVE INVARIANT:

AtLeastOneRing

AtMostOneRing

OrderedRing

ConnectedAppendages

BaseNotSkipped

and

and

and

and

ANOTHER OPERATING ASSUMPTION:

A chord network has a stable baseof L+1 nodes that are alwaysmembers.

no successor list skips overa member of the stable base

THE PROOF OF CORRECTNESS:

by exhaustive enumeration, in Alloy,for all model instances up to N = 9, L = 3

expensive to implement thesehigh-availability nodes!

a stable base would have 3-6members, while a Chord networkcan have millions of members—what is the base doing?

I believe it is just preventinganomalies in small networks,

but how can we know for sure?

THE FINAL RESULT

THE INDUCTIVE INVARIANT:

OneLiveSuccessor

and SufficientPrincipals

ANOTHER OPERATING ASSUMPTION:

THE PROOF OF CORRECTNESS:

None

informal and intuitive, but

. . . a real proof (no size limits)

. . . backed up by an Alloy model checked up to N = 9, L = 3 (as a protection against human error)

. . .

this is just a formalizationof the original operatingassumption

Definition of a principal member:A member that is not skipped by anymember’s successor list.

Definition of SufficientPrincipals: There are at least L+1 principal nodes.

the “stable base” has become something we can prove, rather thanan assumption!

identifierspace

hypothesize adisordered extended

successor list[. . . x, . . . y, . . . z, . . .]

the only identifier in theentire space that is notskipped by [x, . . . y, . . . z]is y

principal nodes cannotbe skipped by asuccessor list

Proof of OrderedSuccessorLists

picture

definition

SufficientPrincipals

CONTRADICTION!

there must be at leastL + 1 principal nodes,where L is greater thanor equal to 1

ORDERED (AND FULL) SUCCESSOR LISTS . . .. . . ARE IMPLIED BY THE INVARIANT

x

y z

Definition of OrderedSuccessorLists:For all sublists [x, y, z] of an ESL(whether the sublist is contiguousor not) . . . between [x, y, z].

every member has a bestsuccessor (first live successor)

there are sufficientprincipal nodes

here is a graph of bestsuccessors:

7

7

7

23

23

23

46

46

460

0

0

15

15

9

9

12

12

37

37

24

24

35

35

33

33

51

51

ssss

s

d d d d

these paths . . .

. . . do not skipprincipal nodes

. . . are acyclic

. . . are ordered by identifiers

each tree has exactly one p , whichis unique to it

so the re-arrangedgraph mustlook likethis

automaticallysatisfyingOneOrderedRingandConnected-Appendages

THE PRINCIPAL NODES MAKE THE SHAPE OF THE RING

CONCURRENCY AND COMMUNICATION

node X node Y

atomic step at X

X changes state

step can beassumed to occur

at this instant

AN OPERATION IS A SEQUENCE OFATOMIC STEPS

EACH ATOMIC STEP IS AN INTERNALSTATE CHANGE OR THIS:

querymessage

replymessage

formal model has ashared-stateabstraction

while waiting for a reply (or timeout),X cannot answer queries about its state

because of the structure of operations,queries cannot form circular waits

extended successor list ofstabilizing node, before Stabilize

extended successor listof its new successor

some Chord operations need multiple atomic steps

in the new, provably correct, specification,every intermediate operation state is alsoconstructed in this safe way

N N N NS S SS

S

0

0 0

00

01

1

12 23 3

N N N0 0 1 2

because invariant holds, nocurrent principal nodes are skippedhere or here

precondition guaranteesthat between [S , N , S ],so no current principalnodes are skipped fromS to N

therefore, no former principalnodes are skipped by this newsuccessor list, and the numberof principal nodes has notdecreased

HOW STABILIZE PRESERVES THE INVARIANT JOIN ANDRECTIFY

ARE SIMILAR

HOW FAIL PRESERVES THE INVARIANT

PRESERVATION OFOneLiveSuccessor

PRESERVATION OFSufficientPrincipals

The operatingassumption is thatno failure leaves amember with no livesuccessor, . . .

. . . so the invariantis assumed to bepreserved.

Why can’t failure of a principal node leave the networkwith fewer than L+1 principals?

Lemma: The only operation that can cause a node tochange from principal to non-principal is its own failure.

The life history of a long-lived member:

Joinbecome principal because all neighbors know youenjoy life as a principal nodeFail

1234

Therefore the number of principal nodes isproportional to the number of nodes.

Once the network has grown (especially to millions ofmembers!) it is overwhelmingly improbable that it willhave fewer than 3-6 principal nodes.

PROVING PROGRESS

IF THERE ARE NO MOREJOIN OR FAIL EVENTS . . .

dead successors areremoved, so that everymember’s firstsuccessor is live

. . . WHILE MEMBERSCONTINUE TO STABILIZE . . .

every member’s firstsuccessor andpredecessor becomeglobally correct

tails of all successorlists become correct

1

2

3

as with construction of intermediatesuccessor lists, operations must bespecified precisely to ensurecorrectness

here preconditions must ensurethat no operation reverses theprogress of a past or current phase

CONCLUSIONS

THE PRODUCT THE PROCESS

initialization is more difficult thanoriginal Chord, but a simpleprotocol will get networks off toa safe start

otherwise correct Chord is justas efficient as original Chord

these peer-to-peer protocolshave a (justified) reputation forunreliability

it is an impressivepattern for fault-tolerance

a correct specification couldpave the way for a new generation of reliable, moreuseful implementations

it also provides a firm foundation for work on better failure detection

and security

Chord is a very interesting protocol—note that the invariantlooks nothing like the propertieswe care about!

results would have been impossibleto find without model-checking toexplore bizarre cases and get ideasfrom them

the best result was impossible tofind without the insights that camefrom the proof process

that is where the idea of astable base came from

www.research.att.com/~pamela> How to Make Chord Correct

AAA

A

A

A

PATTERNS IN NETWORK ARCHITECTURE:

PRINCIPLES FOR LAYERING

PERFORMANCE SERVICES

SECURITY SERVICES

Simplifying Assumption:

There is no multihoming.

QoS

reliability

mobility

data integrity

packet filtering

AAA

A

A

A

an overlay link is implemented by an underlay session

links of a network can be implemented by sessions in oneor several networks

BASIC LAYERING

session

link

AAA

A

A

A

BEST-EFFORT POINT-TO-POINT SERVICE (QUANTIFIED)

session

link

packets are sent at the rate of B

with probability P, a packet arrives within time L

in addition to loss or delay, packets can bere-ordered or duplicated in transit

PROPERTIES OF THE LINK

THE UNDERLAY SESSION IMPLEMENTS THIS SERVICE

AAA

A

A

A

PRINCIPLES FOR LAYERING

session

link

As we have seen in this class, there areMANY services a network could implement.

How do these services composein a network architecture?

service session

Where in a network architectureshould a service be implemented?

AAA

A

A

A

for each link in thepath of the session,let Sj be the fractionof the link’s bandwidthallocated to the session

minimum bandwidth = minimum ( Sj (Bj) )j

jminimum success probability = product ( Pj )

j jmaximum latency = sum ( Lj ) + sum ( Dj )

(QUANTIFIED SERVICE WITH GUARANTEES)

D is delay here

COMPOSITION OF “QUALITY OF SERVICE” SERVICE

min. bandwidth B1min. success prob. P1maximum latency L1

min. bandwidth B2min. success prob. P2maximum latency L2

min. bandwidth B3min. success prob. P3maximum latency L3

AAA

A

A

A

COMPOSITION OF RELIABILITY SERVICE(FOR EXAMPLE, TCP)

packets are deliveredreliably and in theorder they were sent

packets are deliveredreliably and in theorder they were sent

packets are deliveredreliably and in theorder they were sent

this session is reliable if . . .

. . . all the links in its path are reliable, and

. . . all the network nodes in its path are reliable

AAA

A

A

A

COMPOSITION OF DATA-INTEGRITY SERVICE(FOR EXAMPLE, IPsec)

this session has data integrity if . . .

. . . all the links in the path have data integrity, and

. . . all the network nodes in the path can be trusted to preserve integrity

no node except thesession endpointscan read or changethe data

no node except thesession endpointscan read or changethe data

no node except thesession endpointscan read or changethe data

AAA

A

A

A

SERVICE WHERENEEDED?

WHEREIMPLEMEN-TABLE?

HOWPROPA-GATED?

RESULT?

QoS topsession

above anyuntrustedlinks ornodes

where a network can provide sessions withdedicated paths, shorterpaths, higher-qualitypaths, etc.

piecewise

piecewise

piecewise

mustpropagate upto BE2EUS

reliability top session

any session protocol

dataintegrity

any session protocol

bottom end-to-endunshared session(BE2EUS)

dynamic link

all properties of thissession propagatepiecewise to the top

there can also be abottom end-to-endsession (BE2ES)

put in BE2ESor above

put in bottomsession over

untrusted span, or above

AAA

A

A

A

network implementssession-locationmobility

session preserveddespite mobility

session preserveddespite mobility

these sessions are preserveddespite mobility if . . .

. . . the links at their endpointsare preserved despite mobility

networkimplementsdynamic-routingmobility

COMPOSITION OF MOBILITY SERVICE

AAA

A

A

A

COMPOSITION OF PACKET-FILTERING SERVICE

member receives nosession-initiation packetsthat are undesirable for X

member receives no undesirablesession-initiation packets if . . .

. . . it does not receive any undesirablesession-initiation packets on its inlinks

networkX

networkY

implemented by filteringall the inlinks of E

E

E

x

y

y

how does Y know what isundesirable for X?

it does not matterwhether inlinks arestatic or dynamic

AAA

A

A

A

packetfiltering

all networkmembersin amachine

at any level whereit is possible todiscriminate

filter high-volumeattacks at a lowlevel, low-volumeattacks wherediscrimination isbetter

SERVICE WHERENEEDED?

WHEREIMPLEMEN-TABLE?

HOWPROPA-GATED?

RESULT?

topsession

only where thechange ofattachment occurs

endpointattachment

endpointattachment

propagationupward is automatic

mobility

AAA

A

A

A

PACKETFILTERING

QoS

RELIABILITY

RELIABILITY

INTEGRITY(ENCRYPTION)

INTEGRITY

reliability convertsbandwidth,probability, latencyto goodput, whichalso propagatespiecewise

encryptiondecreasesbandwidth andincreaseslatency, both bad

filteringincreaseslatency (bad)and bandwidth(good)

reliabilityaboveencryption

reliabilityabovemobility

encryptionabovemobility

filtering andencryption areindependentonly if sessioninitiations arenot encrypted

MOBILITY

MOBILITY

SERVICEINTERACTIONS

filtering abovemobility orfar fromendpoint