distributed algorithms 2014 igor zarivach
DESCRIPTION
A Distributed Algorithm for Minimum Weight Spanning Trees By Gallager , Humblet,Spira (GHS). Distributed Algorithms 2014 Igor Zarivach. Agenda. Introduction Review of spanning trees Description of GHS algorithm Algorithm execution on ring topology Complexity analysis. - PowerPoint PPT PresentationTRANSCRIPT
Distributed Algorithms2014
Igor Zarivach
A Distributed Algorithm for Minimum Weight Spanning Trees
By Gallager, Humblet,Spira (GHS)
Agenda• Introduction• Review of spanning trees• Description of GHS algorithm• Algorithm execution on ring topology• Complexity analysis
Dijkstra prize in 2004• An elegant and efficient distributed algorithm for finding a minimum spanning tree in an
asynchronous network. • The problem is important, both theoretically and practically• Major algorithmic breakthrough on many fronts:
• It solved the fundamental problem of symmetry breaking (or leader election) in the setting of a general graph
• the algorithm has a surprisingly low message complexity for this important problem.• Techniques for multicasting, and for query and reply.
• Beauty and elegance of the algorithm and its presentation. • An exceptional degree of asynchrony among the nodes. • Its structure is very intuitive and is easy to comprehend. • The algorithm is sufficiently complicated and interesting and is a challenge problem for formal
verification methods. • Finding a proof is still very much an open problem in protocol verification and formal methods.• In summary, this paper is a genuine milestone in the area of asynchronous network algorithms; it
has changed this field completely, in terms of both algorithmics and analysis techniques.
Problem Statement• Given:
The input graph G(V,E) is a connected undirected graph with N nodes, and E edges with distinct finite weight.
• Need to find asynchronous distributed algorithm which determines the minimum spanning tree (MST) of the graph.
Minimum (Weight) Spanning Tree
4
5
12
3
2
9
1
10
4
5
11
7
1612
15
3
2
8
9
14
6
13
1
10
Applications
• Efficient broadcasting in networks• Establishing connectivity after nodes failure• Leader election
Model• Communication
• Asynchronous communication• Message passing• Messages can pass on an edge in both directions concurrently
• Computation• processors represented by nodes• Assumption: Distinct weights on edges (will see why)• A processor knows a weight of edges connected to him• A processor knows its unique ID• One or more nodes can start the algorithm
• Failures• Messages arrive in-order with no errors• No processor faults
8
Definitions• Fragment: a subtree of MST• Branch: edge in MST, edge in
fragment• Outgoing edge: edge between
different fragments• Fragment’s MWOE: Minimum
weight outgoing edge
Fragment 2
Fragment 3Fragment 1
rootbranch
outgoing edge
MWOE
Two properties of MSTs• Property 1:Given a fragment F of a MST, let e be a minimum-weight outgoing edge of F, then joining e and its adjacent non-fragment node to F yields another fragment F’ of an MST.
• Property 2:If all the edges of a connected graph have different weights, then the MST of the graph is unique.
Fragment F
MWOE e
Fragment F’
branch e
10
Algorithm GHS High Level• Each fragment finds its MWOE asynchronously
• When MWOE is found, the fragment attempts to combine with the fragment on the other end
• We will show how and when to combine the fragments so the algorithm is correct and has good message complexity
Fragment F’
Fragment F
1111
55
5
Distinct weight edgesWill the algorithm work for equal weight edges ?
• If edges are not distinct, but nodes have distinct identities , then
Let , • We get distinct weight edges by , ties broken by s
• If both edges and nodes are not distinct, there is no distributed algorithm to find MST• Any two edges are MST, but no way to break the
symmetry
Design - Fragment• Each fragment behaves asynchronously and independently• Initially, every fragment consists of a single node• Upon termination, there will be only one fragment • Each fragment will have a leader, which initiates
fragment operations• Leader starts operation by broadcast• Every node replies to the leader by convergecast• The spanning tree is used for communication• When two fragments are merged, spanning tree is updated
Fragment F’
13
Design - Node• Node has a pointer to the next node in the path
to the leader (father)• Node knows to which fragment it belongs
Fragment F’
14
Fragment F’Design - Union• Fragment finds its MWOE
, • merges into neighbor fragment • becomes a subtree of the bigger tree • becomes a new root of • Nodes of update their father accordingly• sets its father to
Fragment F
Fragment G
𝑥 𝑦
𝑥𝑦
15
Fragment F’Problem 1 - Cycle• and might merge concurrently over
common MWOE • We get a cycle of length two
Fragment F
Fragment G
𝑥 𝑦
𝑥 𝑦
Solution• Both and become leaders of G• If we need one leader, can break
symmetry by unique IDs
16
Problem 2 – Unbalanced fragments• Choosing MWOE and updating father pointers is
message complexity• Worst case:
• Size of is nodes• Mergewith other fragments of size 1
• We get message complexity, but can get if sizes are equal
Solution• Merge only smaller fragment to larger fragment• Update father pointers of smaller fragment• We need to estimate the size of the fragment!
Fragment size estimation (Level)• It is hard to estimate the size of distributed tree• Use Level as the estimation for a tree size of at least nodes• Each fragment has a Level
• Level 0 – only one node• Level k > 0 – at least nodes
• Lemma: If Fragment F Level is then F has at least nodes• We want to guarantee the Lemma for all fragments• Level doesn’t represent the size correctly,
Level L can have much more than nodes!!!
18
Design - Union• The algorithm will guarantee that every
fragment MWOE leads to such that• Level () Level ()
• Otherwise, if Level () Level ()
• will wait for to grow in Level• Waiting can lead to deadlocks!!
• Smaller fragments never wait for larger,they are immediately absorbed into the larger neighbor
FLevel 3
F’Level 2
F can’t find MWOE, and
waits
19
Fragments union operationsAbsorb () Merge ()
Level() < Level() && MWOE e leads to
Level() == Level() && common MWOE e
Condition
Level(G) = Level() + 1Result
O()) O())Message Complexity
𝐹 1 𝐹 2𝐹 1 𝐹 2
20
Design - Union• Define the “core” edge of the fragment’s tree: “The edge along which the most recent
Merge occurred.”
• Lemma: changes once per Level• Fragment is Level 1
• Node 1 and 2 merged on common MWOE• Node 3 then absorbed
• Fragment is Level 2• Fragments and merged on common MWOE• Node 4 then absorbed
21
Fragment Names and Leaders
• We need to distinguish between fragments• Levels are not unique
• Use for fragment identification
• Fragment name: ()• Leaders: two nodes adjacent to the
22
ExampleNetwork Event
All nodes wake up, choose MWOE and try to Connect to the neighbor
ID1 and ID2 Merge and become Level-1 fragment, the same for ID4 and ID5. ID3 waits…
ID3 tries to Connect again,Two fragments look for MWOE
1 2 3 4 51 3 4 2Connect Connect Connect
Connect Connect
1 2 3 4 51 3 4 2Level 1 Level 1
1 2 3 4 51 3 4 2Level 1 Level 1
Connect Test
Test
23
ExampleNetwork Event
Fragments are blocked because ID Level is smaller than 1ID3 absorbs…Only now ID3 replies for both Test messages
Two fragments merge to new Level-2 fragment
And the algorithm ends with MST
1 2 3 4 51 3 4 2Level 1 Level 1
1 2 3 4 51 3 4 2Level 1 Level 1
Connect Test
Test
Accept
Reject
Connect
Connect
1 2 3 4 51 3 4 2Level 1 Level 1
1 2 3 4 51 3 4 2Level 2
24
Fragment Lifecycle
1. New Fragment (Level L 1)• Set core• Set leaders• Set Level• Broadcast (core, Level)
2. Find MWOE• Level 0 – minimal edge,
otherwise• Test, Accept, Reject
• Can wait….• Convergecast (MWOE)
3. Union• Merge
• New Fragment is created (Level L+1)
• Or Absorb• Or Wait
Async: S
maller Frag
ments Absorbed
Async: S
maller Frag
ments Absorbed
25
Node state machine
Sleeping
FindFound
26
Specific messages:• Initiate: Broadcast from leader to find MWOE; contains fragment identity.• Report: Convergecast MWOE responses back to leader.• Test: Asks whether an edge is outgoing.• Accept/Reject: Answers to test.• Change-core: Sent from leader to endpoint of MWOE.• Connect: Sent across the MWOE, to connect fragments.
• We say merge occurs when connect message has been sent both ways on the edge (2 nodes must have same level).
• We say absorb occurs when connect message has been sent on the edge from a lower-level to a higher-level node.
27
Description: Find MWOE – Level 0• A single node fragment• The node is in state• The node awakens or receives a message• The node chooses its MWOE from all adjacent edges• Sends Connect(Level=0) over • Sets state to
28
Description: Find MWOE – Level L• Two Level (L-1) fragments merge over common MWOE • MWOE is a new • New Level L fragment has identity• Leaders broadcast Initiate() to all nodes• Initiate() contains identity, Level and state Find• Initiate() is passed to all (L-1) Level fragments waiting to connect to
nodes in G• G nodes start Test-Accept-Reject protocol to find MWOE• When a node finds MWOE, Report is convergecasted to leaders
Fragment F
MWOE e
𝑥
29
Description: Find MWOE – Level L(continued)
• Convergecast of Report(W) on fragment inbound edges• W() is defined as follows
• is leaf: W is MWOE adjacent to or infinity• is internal node: W is min(MWOE()), is a node in subtree
rooted at u)
• Every G node remembers the edge leading to the MWOE in its subtree (best edge)
• Best edges create a path from to the node • Leaders send Report messages on the core, one of them sends Change-core on • Every node on updates inbound edge to point to • sends Connect(L) over
Fragment F
MWOE e
𝑥
Test-Accept-Reject Protocol•Bookkeeping: Each node keeps a list of incident edges in order of weight,
classified as:•Branch (in the MST),•Rejected (leads to same fragment), or•Basic (not yet classified).
•Node tests only Basic edges, sequentially in order of weight:•Sends Test message, with (core, Level); recipient compares.•If same (, Level), sends Reject (same fragment), and reclassifies edge
as Rejected.•If (core, Level) pairs are unequal and Level() Level() then sends
Accept (different fragment). does not reclassify the edge.•If Level() < Level() then delays responding, until
Level() Level.)(•This is the Waiting… which can lead to Deadlocks
F’F𝑛 𝑛 ′
Merge•Suppose F and F have the same MWOE
and Level •Level() Level)(
•Both and send Connect() over one in each direction• becomes a new of Level fragment •Nodes and send Initiate),,(
F’F
𝑛 𝑛 ′𝑒
Absorb
•Suppose F absorbs into fragment F via an edge , while F is working on determining its MWOE.
•Level() Level)(•Node sends Connect)(•Node immediately sends Initiate),,(•:
•If has not yet reported its local MWOE, send Initiate(Find) •Otherwise, send Initiate(Found). We will see why new fragment’
MWOE can’t be from.
F’F𝑛 𝑛 ′
𝑒
33
CorrectnessGiven Properties 1 and 2, it is sufficient to verify:
1. MWOE is correctly chosen by every fragment
2. No deadlocks due to Waits
MWOE Correctness (Async Absorb)Case: absorbs into after reported MWOE().
We need to prove that MWOE() is valid after Absorb.Claim 1: Reported MWOE() cannot be the edge.) ,(
Proof: •Since MWOE() has already been reported, it must lead to a node with Level
Level.)(•But the level of is still < level(), when the absorb occurs.•So MWOE() is a different edge, one whose weight < weight.) ,(
Claim 2: MWOE for combined component is not outgoing from a node in. Proof: • ) ,(is the MWOE of , so there are no edges outgoing from with
weight < weight.) ,(•So no edges outgoing from F with weight < already-reported MWOE.)(•So MWOE of combined fragment isn’t outgoing from F.
F’F𝑛 𝑛 ′
35
LivenessLemma:After any finite sequence of merges and absorbs, either the forest consists of one tree (so we’re done), or some merge or absorb is enabled
Proof:• Consider the current “fragment digraph”:
• Nodes represent fragments• Directed edges represent MWOEs
• There is an edge with minimal weight not yet in a forest => Then there must be some pair , whose MWOEs point to each other.
• We can combine fragments, using either merge or absorb:• If same level, merge, else absorb.
• So, merging and absorbing are enough to proceed.• If one of , Waits, it Waits for smaller Level fragment only• But lowest Level fragment is NEVER blocked and can grow by Merge or Absorb
14
2
356
Fragment Digraph
36
The Algorithm (As Executed at Each Node)
37
The Algorithm (As Executed at Each Node)
38
The Algorithm (As Executed at Each Node)
39
The Algorithm (As Executed at Each Node)
40
Simulation – Ring• Communication
• Odd link – 1 cycle• Even link – 2 cycles 1
3
23
21
Event Time
Code
StateEdges i-b LN FN SN ID
1: Basic, 3: Basic Sleeping 12: Basic, 3: Basic Sleeping 21: Basic, 2: Basic Sleeping 3
EventsInitialization
Processor TimeN/A
Network1
3
23
21
Event TimeID1 wakes up 1ID1 send Connect(0) to ID3 1ID3 recv Connect(0) on 1 2
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 0 Found 12: Basic, 3: Basic Sleeping 21: Basic, 2: Basic Sleeping 3
EventsStep 1
Processor Time1 1
Network1
3
23
21
Event TimeID3 recv Connect(0) on 1 2ID3 wakes up 2ID3 send Connect(0) to ID1 2ID3 send Initiate(1,1,Find) to ID1 2ID1 recv Connect(0) on 1 3ID1 recv Initiate(1,1,Find) on 1 3
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 0 Found 12: Basic, 3: Basic Sleeping 2
1: Branch, 2: Basic 0 Found 3
EventsStep 2
Processor Time3 2
Network1
3
23
21
Event TimeID1 recv Connect(0) on 1 3ID1 recv Initiate(1,1,Find) on 1 3ID1 send Initiate(1,1,Find) to ID3 3ID3 recv Initiate(1,1,Find) on 1 4
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 0 Found 12: Basic, 3: Basic Sleeping 2
1: Branch, 2: Basic 0 Found 3
EventsStep 3
Processor Time1 3
Network1
3
23
21
Event TimeID1 recv Initiate(1,1,Find) on 1 3ID3 recv Initiate(1,1,Find) on 1 4Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Basic, 3: Basic Sleeping 2
1: Branch, 2: Basic 0 Found 3
EventsStep 4
Processor Time1 3
Network1
3
23
21
Event TimeID1 recv Initiate(1,1,Find) on 1 3ID3 recv Initiate(1,1,Find) on 1 4ID1 send Test(1,1) to ID2 3ID2 recv Test(1,1) on 3 4
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Basic, 3: Basic Sleeping 2
1: Branch, 2: Basic 0 Found 3
EventsStep 5
Processor Time1 3
Network1
3
23
21
Event TimeID3 recv Initiate(1,1,Find) on 1 4ID2 recv Test(1,1) on 3 4ID3 send Test(1,1) on 2 4ID2 recv Test(1,1) on 2 6
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Basic, 3: Basic Sleeping 2
1: Branch, 2: Basic 1 1 1 Find 3
EventsStep 6
Processor Time3 4
Network1
3
23
21
Event TimeID2 recv Test(1,1) on 3 4ID2 recv Test(1,1) on 2 6ID2 send Connect(0) to ID3 4ID3 recv Connect(0) on 2 6ID2 recv Test(1,1) on 3 (Wait) 5
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Branch, 3: Basic 0 Found 21: Branch, 2: Basic 1 1 1 Find 3
EventsStep 7
Processor Time2 4
Network1
3
23
21
Event TimeID2 recv Test(1,1) on 3 (Wait) 5ID2 recv Test(1,1) on 2 6ID3 recv Connect(0) on 2 6ID2 recv Test(1,1) on 3 (Wait) 6
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Branch, 3: Basic 0 Found 21: Branch, 2: Basic 1 1 1 Find 3
EventsStep 8
Processor Time2 5
Network1
3
23
21
Event TimeID2 recv Test(1,1) on 2 6ID3 recv Connect(0) on 2 6ID2 recv Test(1,1) on 3 (Wait) 6ID2 recv Test(1,1) on 2 (Wait) 7
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Branch, 3: Basic 0 Found 21: Branch, 2: Basic 1 1 1 Find 3
EventsStep 9
Processor Time2 6
Network1
3
23
21
Event TimeID3 recv Connect(0) on 2 6ID2 recv Test(1,1) on 3 (Wait) 6ID2 recv Test(1,1) on 2 (Wait) 7ID3 send Initiate(1,1,Find) to ID2 6ID2 recv Initiate(1,1,Find) on 2 8
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Branch, 3: Basic 0 Found 2
1: Branch, 2: Branch 1 1 1 Find 3
EventsStep 10
Processor Time3 6
Network1
3
23
21
Event TimeID2 recv Initiate(1,1,Find) on 2 8ID2 recv Test(1,1) on 3 (Wait) 8ID2 recv Test(1,1) on 2 (Wait) 9ID2 send Test(1,1) to ID1 8ID1 recv Test(1,1) on 3 9
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Branch, 3: Basic 2 1 1 Find 2
1: Branch, 2: Branch 1 1 1 Find 3
EventsStep 11 – We go to 8
Processor Time2 8
Network1
3
23
21
Event TimeID2 recv Test(1,1) on 3 (Wait) 8ID2 recv Test(1,1) on 2 (Wait) 9ID1 recv Test(1,1) on 3 9
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Branch, 3: Reject 2 1 1 Find 21: Branch, 2: Branch 1 1 1 Find 3
EventsStep 12
Processor Time2 8
Network1
3
23
21
Test edge is 3
Event TimeID2 recv Test(1,1) on 3 (Wait) 8ID2 recv Test(1,1) on 2 (Wait) 9ID1 recv Test(1,1) on 3 9
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Branch, 3: Reject 2 1 1 Find 21: Branch, 2: Branch 1 1 1 Find 3
EventsStep 13
Processor Time2 8
Network1
3
23
21
Event TimeID2 recv Test(1,1) on 3 (Wait) 8ID2 recv Test(1,1) on 2 (Wait) 9ID1 recv Test(1,1) on 3 9ID2 send Report( to ID3 8ID3 recv Report() on 2 10
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Branch, 3: Reject 2 1 1 Found 21: Branch, 2: Branch 1 1 1 Find 3
EventsStep 13
Processor Time2 8
Network1
3
23
21
Event TimeID2 recv Test(1,1) on 2 (Wait) 9ID1 recv Test(1,1) on 3 9ID3 recv Report() on 2 10ID2 send Reject to ID3 9ID3 recv Reject on 2 11
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Basic 1 1 1 Find 12: Branch, 3: Reject 2 1 1 Found 21: Branch, 2: Branch 1 1 1 Find 3
EventsStep 14
Processor Time2 9
Network1
3
23
21
Test edge is nil
Event TimeID1 recv Test(1,1) on 3 9ID3 recv Report() on 2 10ID3 recv Reject on 2 11ID1 send Report() to ID3 9ID3 recv Report() on 1 10
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Reject 1 1 1 Found 12: Branch, 3: Reject 2 1 1 Found 21: Branch, 2: Branch 1 1 1 Find 3
EventsStep 15
Processor Time1 9
Network1
3
23
21
Test edge is 3
Event TimeID3 recv Report() on 2 10ID3 recv Report() on 1 10ID3 recv Reject on 2 11
ID3 send Report() to ID1 10
ID1 recv Report() on 1 11
ID3 halts 10
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Reject 1 1 1 Found 12: Branch, 3: Reject 2 1 1 Found 21: Branch, 2: Branch 1 1 1 Found 3
EventsStep 16
Processor Time3 10
Network1
3
23
21
Event TimeID3 recv Reject on 2 11
ID1 recv Report() on 1 11
ID1 halts 11
Code
StateEdges i-b LN FN SN ID
1: Branch, 3: Reject 1 1 1 Found 12: Branch, 3: Reject 2 1 1 Found 21: Branch, 2: Branch 1 1 1 Found 3
EventsStep 17
Processor Time1 11
Network1
3
23
21
60
Communication Complexity Max number of Test\Reject is 2|E|, every edge is rejected once
Map messages to MWOE finding task of some fragment Max number of messages sent during the task by nodes of is
Initiate – at most Report – at most Connect – 1 Change-core – at most 1 Test\Accept – at most 2
Aggregating tasks by Levels of , we get total |N| per Level L This is because no node appears in more than one MWOE task with Level=L.
Number of Levels is bound by
So total message complexity is The bound is optimal for rings
61
Questions?
62
Thank you!