multicore programming inthe face of metamorphosis: union

27
Tel-Aviv University The Raymond and Beverly Sackler Faculty of Exact Sciences The Blavatnik School of Computer Science Multicore Programming in the Face of Metamorphosis: Union-Find as an Example This thesis is submitted in partial fulfillment of the requirements towards the M.Sc. degree Tel-Aviv University School of Computer Science by Igor Berman The research work in this thesis has been carried out under the supervision of Prof. Nir Shavit July 2010

Upload: others

Post on 05-Jun-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multicore Programming inthe Face of Metamorphosis: Union

Tel-Aviv University

The Raymond and Beverly Sackler Faculty of Exact Sciences

The Blavatnik School of Computer Science

Multicore Programming in the Face ofMetamorphosis: Union-Find as an

Example

This thesis is submitted in partial fulfillment of the requirementstowards the M.Sc. degree

Tel-Aviv UniversitySchool of Computer Science

by

Igor Berman

The research work in this thesis has been carried out under thesupervision of Prof. Nir Shavit

July 2010

Page 2: Multicore Programming inthe Face of Metamorphosis: Union

CONTENTS

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. The Union Find Delete Problem . . . . . . . . . . . . . . . . . . . . . 3

3. A Concurrent Implementation of Union Find . . . . . . . . . . . . . . 53.1 Coarse grained locking . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Software Transactional Memory . . . . . . . . . . . . . . . . . . . 53.3 Fine Grained Locking with Optimistic Traversals . . . . . . . . . 53.4 A Fully Wait-free Implementation . . . . . . . . . . . . . . . . . 73.5 A Performance and Programmability Evaluation . . . . . . . . . 8

4. Metamorphosis: Adding a Delete to Union Find . . . . . . . . . . . . 124.1 Coarse Grained Locking . . . . . . . . . . . . . . . . . . . . . . . 124.2 Long and Short Transactions . . . . . . . . . . . . . . . . . . . . 134.3 Fine Grained Locking with Optimistic Traversals . . . . . . . . . 134.4 A Fully Wait-free Implementation . . . . . . . . . . . . . . . . . 164.5 A Performance and Programmability Evaluation . . . . . . . . . 16

5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Page 3: Multicore Programming inthe Face of Metamorphosis: Union

LIST OF FIGURES

2.1 Sequential union-find-delete object. . . . . . . . . . . . . . . . . . 3

3.1 The optimistic fine-grained locking scheme for the union-find. . . 63.2 The fully Wait-free implementation of find. . . . . . . . . . . . . 73.3 The fully Wait-free implementation of union. . . . . . . . . . . . 83.4 The latency of parallel union-find implementations. . . . . . . . . 93.5 The cache miss penalty for the various union-find implementa-

tions(logarithmic scale). . . . . . . . . . . . . . . . . . . . . . . . 93.6 The CAS failures and successes for the various union-find imple-

mentations(logarithmic scale). . . . . . . . . . . . . . . . . . . . . 103.7 The performance and implementation complexity tradeoffs for

union and find. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 The optimistic fine-grained locking scheme for union-find-delete. 154.2 The latency of the parallel union-find-delete implementations. . 164.3 The cache miss penalty for the various union-find-delete imple-

mentations(logarithmic scale). . . . . . . . . . . . . . . . . . . . . 174.4 The CAS failures and successes for the various union-find-delete

implementations(logarithmic scale). . . . . . . . . . . . . . . . . . 174.5 The latency of the Optimistic fine grained lock version of the

union-find-delete object on SPARC. . . . . . . . . . . . . . . . . 184.6 The comparison of the fully Wait-free union-find object vs. Op-

timistic fine grained lock version of the union-find-delete object. . 194.7 The performance and implementation complexity tradeoffs for

implementing union, find, and delete. . . . . . . . . . . . . . . . . 19

Page 4: Multicore Programming inthe Face of Metamorphosis: Union

ACKNOWLEDGEMENTS

I would like to thank all those who supported me during my study.It was an honor for me to work with my thesis advisor Prof. Nir Shavit.

I am grateful for his guidance, expertise and support throughout my research.Prof. Nir Shavit succeded to show me the depth and the aesthetics of the fieldof multiprocessors which encouraged me to continue in this field.

Special gratitude to my parents, Mark Berman and Marina Dontsova, and toall my friends for encouraging me towards this difficult studies and supportingall along the way.

Page 5: Multicore Programming inthe Face of Metamorphosis: Union

ABSTRACT

A crucial question facing today’s multicore programmers is which programmingmethodology to use for coordination and data structure design: fine grainedlocking, lock-free or wait-free synchronization, or perhaps transactional memory.

One aspect of this question that has received little attention is the tradeoffbetween performance and flexibility. In other words, given a data structureimplemented using methodology X, delivering a given level of scalability, howcostly is it, in terms of both performance and ease of programming, to usemethodology X to add new features to this existing algorithm.

This work studies the flexibility question in the context of the union-findproblem, a problem that is a unique fit for our quest in that it has a known effi-cient wait-free concurrent solution for implementing the union and find methods,but only a complex sequential solution if one wishes to allow delete methods.Based on union find, we are able to make interesting observations about therelative benefits of using locks, non-blocking algorithms, and state-of-the-artsoftware transactional memory systems.

Moreover, based on our new understandings, we present highly efficient andflexible algorithms for the union-find and union-find-delete problems that webelieve are of independent interest.

Page 6: Multicore Programming inthe Face of Metamorphosis: Union

1. INTRODUCTION

As multicore programming moves into the mainstream, increasing numbers ofprogrammers face the dilemma of choosing a synchronization approach in im-plementing the data structures they use for inter-thread communication. Thisis not a small matter since data structure access is most often the place wherebottlenecks occur in parallel programs, and based on Amdahl’s law, we knowthat reducing these bottlenecks can provide a benefit far beyond their relativefraction of the execution time.

DISC, SPAA, and PODC, have over the years been venues for publishingnew concurrent data structure implementations, some lock-based, some non-blocking, and in recent years, transactional (see books by Attiya and Welch [4],Herlihy and Shavit [13], Lynch [18], and Taubenfeld [23] that cover much of thepast work.). Many of the presented papers discuss the performance/robustnesstradeoffs, that is, how we pay in performance for more fault-tolerance 1. How-ever, as far as we know, it is uncommon to find a comparison of in terms of thetradeoff of performance and the implementation effort, perhaps because quan-tifying the implementation effort is a tricky and somewhat inaccurate business.Moreover, no one has in the past attempted to compare the techniques in termsof their flexability: the ability to add properties and new functionality to anexisting well tuned algorithm.

We believe this test, the flexability test for design approaches under muta-tion, is a new and important dimension for evaluating concurrent data struc-tures. The reason is clear: a major direction in which both research and industryare suggesting to make multicore programming simpler is by offering librariesof popular structures that users can then tailor to their specific needs. Take forexample the Java Concurrency Library, distributed to over 10 million desktops.It contains many lock-based structures, but also non-blocking ones such as theimplementations of Lea’s lock-free skiplist [16] and Michael and Scott’s lock-free queue [20]. The reason: these algorithms are known to outperform theirlock-based counterparts.

The question we wish to raise then is if the choice of picking the best per-forming lock-based or lock-free algorithm, based on performance, for a knownfunctionality, is always justified...or can it become an Achilles Heel when wecome to change or extend it? Obviously this is a big question, which has bothapplied and theoretical aspects. This work only attempts to expose it, raisinga few interesting points for consideration.

The example problem we chose is the set union-find problem. The uniqueproperties that make this problem a great candidate for our exposition is thatit is a search structure with an elegant and highly scalable wait-free implemen-

1 Which on multicore machines translates to making more or less demands of the operating

system.

Page 7: Multicore Programming inthe Face of Metamorphosis: Union

1. Introduction 2

tation of the union and find methods, due to Anderson and Woll [3], but noknown implementations of a delete method. The only known implementationof delete is a sequential algorithm which is rather complex.

We therefore set out, in Section 2, to compare implementing the union-findalgorithm using a variety of state of the art programming approaches combininglock-based and lock-free techniques. We then, in Section 4, attempted to extendthe various implementations to include a delete method, asking the question ofhow much of a programming effort was needed and what the performance returnwas on our investment.

The results of our efforts, apart from a hopefully better understanding ofthe advantages and disadvantages of the various techniques, are new efficientand flexible fine-grained lock-based solutions to the concurrent union-find andunion-find-delete problems, ones that improve on the Anderson and Woll [3]solution used to date.

Page 8: Multicore Programming inthe Face of Metamorphosis: Union

2. THE UNION FIND DELETE PROBLEM

A sequential union-find object allows performing a sequence of union and findoperations, starting from a collection of m singleton sets {1}, {2}, . . . , {m}.The initial name of set {i} is i. Due to the definition of the union and findoperations, there are two invariants that hold at any time. First, the sets arealways disjoint and define a partition of the set {1, 2, ....m}. Second, the nameof each set corresponds to one of the items contained in the set itself (which isrelaxed in the union-find-delete object).

• union(x, y): Combine two sets A and B into one set, where x ∈ A andy ∈ B and A 6= B.

• find(x): Return the name of the unique set containing the element x.

The union-find-delete object supports a new delete operation:

• delete(x): Delete element x from the set A if x ∈ A.

The concurrent union find object is one that is linearizable [14] to the se-quentially specified union-find object.

The union-find-delete object, as depicted in Figure 2.1, is implemented as aforest of trees, each tree representing a set, where the root of a tree represents thename of the unique set. The find operation is implemented by following a pathfrom the given input element x to the root of its associated tree (traverse fromx up to r). A path halving or bypassing technique [22] can be applied to make atree shallower, thus making consequent finds faster. While following the path tothe root, each node becomes a child of it’s grandparent (x becomes a child of z,and z becomes a child of r). The union operation is implemented by finding twosets’ roots (b and c), and making a root of one tree to become a child of the other

Fig. 2.1: Sequential union-find-delete object.

Page 9: Multicore Programming inthe Face of Metamorphosis: Union

2. The Union Find Delete Problem 4

tree (b becomes a child of c). The delete operation starts by reading the elementto be deleted from the associated tree node, then marking the associated nodeas vacant, thus deleting it logically. After that, subject to a ”tidiness property”of the union-find-delete tree, the wrapping node may be deleted physically (nis marked as vacant; then it becomes a leaf, since m bypasses it, and then itis deleted physically and freed). Our concurrent implementation will have thesame basic tree structure with certain added synchronization elements, and withits traversal and modification operations performed concurrently by multiplethreads.

Page 10: Multicore Programming inthe Face of Metamorphosis: Union

3. A CONCURRENT IMPLEMENTATION OF UNION FIND

We now describe and evaluate implementations of union-find based on severalstate-of-the-art methodologies. In the next section we will describe and evaluatethe extension of the problem to union-find-delete.

3.1 Coarse grained locking

The coarse grained lock implementation wraps a sequential implementation ofthe union-find data structure with single mutual exclusion lock. This is thestandard pthreads library mutual exclusion lock [17]. The linearization pointsare the global lock acquisition events, and the progress properties are those ofthe pthreads lock implementation.

3.2 Software Transactional Memory

We used the DSTMC (Dresden STM Compiler) [9] to implement several com-piler instrumented software transactional memory versions. DSTMC offers free-dom of choice of the underlying STM [21] implementation and is fairly stable towork with. The installation process of it wasn’t simple, however can be managedby an experienced system administrator. We integrated the TL2 STM [7, 8]to work with DSTMC and also tested the built-in TinySTM++ [10] algorithm,eventually settling on TinySTM++ as being the most effective when used withDSTMC. The union-find data structure implementation code remained the samefor all used STMs, which has benefit to the programmer.

The implementation of the union-find data structure using TinySTM++/DSTMCwas easy and took almost no time. It was as easy as developing the coarsegrained lock version: we simply wrapped the code of the sequential union andfind methods with coarse grained transactions using the begin and commit id-ioms of DSTMC, e.g.:

tanger_begin();

result = disjoint_find_aux();

tanger_commit();

The correctness of these implementations follows from the atomicity andprogress properties of DSTMC and TinySTM++ and may be proved by con-tradiction.

3.3 Fine Grained Locking with Optimistic Traversals

There is no point in testing data structures implemented using hand-over-handlocking [13], or even ones in which locks are taken while traversing paths in the

Page 11: Multicore Programming inthe Face of Metamorphosis: Union

3. A Concurrent Implementation of Union Find 6

do{// l o c k i n gx = d i s j o i n t f i n d a u x ( ) ;y = d i s j o i n t f i n d a u x ( ) ;i f ( x == y){

return x ;}l o c k r o o t s ( f o r e s t , x , y ) ;// v a l i d a t i o ni f ( f o r e s t−>A[ x]−>parent == x &&

fo r e s t−>A[ y]−>parent == y){break ; //x & y are roo t s

} else {//we ’ ve l o cked inner nodesun l o ck roo t s ( f o r e s t , x , y ) ;

}}while ( 1 ) ;

Fig. 3.1: The optimistic fine-grained locking scheme for the union-find.

tree. Those are archaic techniques, used too often in the literature, perhapsbecause they make “newer” algorithms look good. Instead, we used a state-of-the-art “optimistic” locking approach introduced by Heller et. al [11]. Theirapproach, currently used in several other search structure implementations (seeexamples in [12, 16] and also a survey in [13]), is to use fine grained locks tocoordinate among mutating operations that have optimistically picked locationsto mutate, but design non-mutating traversal operations that proceed in a wait-free manner. We found that applying this approach to our union and findoperations, though not trivial, was not difficult.

In our implementation we assign a lock per tree, that is, per disjoint set.The lock is located at the tree’s root node, a node that points to itself. Whenthe tree contains one element, the node that contains it has the lock that mustbe acquired to change the tree. After merging several sets into one, the locksat the leaves and at the inner nodes become inactive, while the lock at the rootbecomes active. Since a lock can be implemented using 1 bit only, the lock ateach tree node doesn’t introduce considerable memory overhead.

To merge two trees a thread acquires the two locks of these sets. To makethe critical section as short as possible during merging, we use an “optimistic”programming technique [13]: traverse to find the two tree roots in a wait-freemanner 3.1. Then lock the potential roots using element-id order to avoiddeadlocks. After validating that the locked nodes are real roots (they point tothemselves), we merge them. It could happen that between finding the roots andlocking them, other threads can merge the set with another, turning the foundroot to be an inner node. In this case, the thread must unlock the locked nodesand retry to find the real roots (beware of optimizations that start traversingfrom the found roots, this can become a problem once deletes are allowed). Thefollowing code describes this simple locking process.

To implement a simple wait-free find operation, we relaxed it so it does notperform path compression on the tree (because changing a parent pointer mayinterfere with the merging of two sets). Instead, union operations can compress

Page 12: Multicore Programming inthe Face of Metamorphosis: Union

3. A Concurrent Implementation of Union Find 7

while ( x != f o r e s t−>A[ x]−>parent ){u i n t p t r t t = f o r e s t−>A[ x]−>parent ;CAS(

&( f o r e s t−>A[ x]−>parent ) ,t ,f o r e s t−>A[ t]−>parent ) ;

x = f o r e s t−>A[ t]−>parent ;}

Fig. 3.2: The fully Wait-free implementation of find.

paths after taking the appropriate root lock/locks, while climbing up to the treeroot. This makes the find a read-only operation. It is not hard to argue thatif a thread traverses up the tree from a given node, it will find a root that wasa valid root during its execution: if several threads execute find concurrentlywith some union, we linearize all finds that read the old root and validated itto happen before the parent pointer change of the root, and all finds that readnew root to happen after the parent pointer change.

As a point of comparison, in order to understand which part of the perfor-mance gain comes from the fine grained locking and which from the wait-freefind operation, we also implemented a version of the algorithm that uses a sin-gle course grained lock for all trees in the forest, but maintained the wait-freenon-compressing find operations.

3.4 A Fully Wait-free Implementation

The fully wait-free algorithm is due to Anderson and Woll [3]. The wait-freeimplementation maintains an array of pointers to the union-find record. Eachrecord contains parent and rank fields. The additional level of indirection allowsto perform atomic updates on both fields. The main tool used to arbitratebetween threads is the CompareAndSwap (CAS) primitive. The find operationsare implemented in a manner very similar to the sequential version, except forthe fact that the path compression operations are implemented using CAS toallow threads to agree on consistent tree states. The linearization point of findoperation occurs when the root is found, i.e. node that points to itself. Theunion operation is implemented using an optimistic approach. At the beginningit finds both x and y roots 3.3(Lines 2 and 3). Then, using union-by-rankheuristics [22] it decides that root y will become the ancestor of the root x (Line10). The wait-free implementation creates new record and fills it with newvalues of parent and rank fields (x and yr), making y an ancestor of the x (inthe update root function). Then it tries to update the array entry of the elementy by switching the record pointer using a CAS. If the CAS fails, another threadmust have changed y’s parent pointer or its rank (i.e. changed its record), sothe thread restarts (Line 12). After changing the parent pointers, the wait-free implementation has an additional step that maintains the rank orderingproperty (Lines 14 and 15). It guarantees that while traversing a path from anode to the root x, the ranks are not decreasing, but not necessarily strictlyincreasing. In the end (Line 17) the implementation prevents some pathologicalsituation by optimistically compressing the path from y to the new root x and

Page 13: Multicore Programming inthe Face of Metamorphosis: Union

3. A Concurrent Implementation of Union Find 8

1 : t ryaga in :2 : x = d i s j o i n t f i n d ( f o r e s t , x ) ;3 : y = d i s j o i n t f i n d ( f o r e s t , y ) ;4 : i f ( x == y){5 : return x ;6 : }7 : u i n t p t r t xr = f o r e s t−>A[ x]−>rank ;8 : u i n t p t r t yr = f o r e s t−>A[ y]−>rank ;9 : // l i n k y to x10 : i f ( xr > yr | | ( xr == yr && x > y ) ){11 : i f ( update root ( f o r e s t , y , yr , x , yr ) == −1){12 : goto t ryaga in ;13 : }14 : i f ( xr == yr ){15 : update root ( f o r e s t , x , xr , x , xr +1);16 : }17 : s e t r o o t ( f o r e s t , y ) ;18 : return x ;19 : }20 : else {// l i n k x to y21 : . . .22 : }

Fig. 3.3: The fully Wait-free implementation of union.

by making sure that rank of the x will be greater than the rank of y. Thelinearization point of the union operation is when the CAS that updates therecord of y succeeds, or at the point at which the union operation discovers thatthe elements are in the same set (Line 4).

The correctness of this algorithm was proved by Anderson and Woll[3].

3.5 A Performance and Programmability Evaluation

In our experiments we used two machines. The main machine we used is anIntel Nehalem processor with 4 cores that multiplex 2 hardware threads eachand share an on chip L3 cache. The second is a 64-way Niagara multicoremachine with 8 cores that multiplex 8 hardware threads each and share an onchip L2 cache. The last was used mainly to check the scalability with highernumbers of threads.

Our benchmark implements a solution to the connected components problemfor a given input graph using our parallel union-find data structure. Initially,each node is represented by the separate disjoint set. For each new edge, onemerges (a union) two sets that contain the edge’s source and target nodes. Theamount of work is measured by the number of edges to be processed, which isconstant for the given input ssca2 graph. Each thread executes union or findwith respect to the configured ratio of each operation. The working threadsprocess in parallel disjoint sets of edges to prevent a contention on a edges datastructure. The thread finishes work when processed its portion of edges. Afterall edges are processed, the resulting disjoint forest represents the connectedcomponents of the input graph. We measured the time spent to process all

Page 14: Multicore Programming inthe Face of Metamorphosis: Union

3. A Concurrent Implementation of Union Find 9

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

10000

0 2 4 6 8 10 12

dura

tion(

ms)

threads

90% unions, 10% finds

coarsewait-free

STMcoarse-wf-find

fine-wf-find

0

10000

20000

30000

40000

50000

60000

70000

0 2 4 6 8 10 12

dura

tion(

ms)

threads

20% unions, 80% finds

coarsewait-free

STMcoarse-wf-find

fine-wf-find

500

1000

1500

2000

2500

3000

3500

4000

0 2 4 6 8 10 12

dura

tion(

ms)

threads

90% unions, 10% finds (detailed)

coarsewait-free

STMcoarse-wf-find

fine-wf-find

2000

4000

6000

8000

10000

12000

0 2 4 6 8 10 12

dura

tion(

ms)

threads

20% unions, 80% finds (detailed)

coarsewait-free

STMcoarse-wf-find

fine-wf-find

Fig. 3.4: The latency of parallel union-find implementations.

1e+007

1e+008

0 2 4 6 8 10 12

#cpu

pen

alty

threads

90% unions, 10% finds

coarsewait-free

STMcoarse-wf-find

fine-wf-find

1e+007

1e+008

1e+009

0 2 4 6 8 10 12

#cpu

pen

alty

threads

20% unions, 80% finds

coarsewait-free

STMcoarse-wf-find

fine-wf-find

Fig. 3.5: The cache miss penalty for the various union-find implementa-tions(logarithmic scale).

graph’s edges in milliseconds. When the ratio of unions is higher, the process-ing time will be lower, since more of the time is spent processing edges (findoperations do not contribute towards this goal).

Input graphs were generated using GTGraph package [19]. The graphs werecreated using the ssca2 generator and are large enough so that they do not fitinto the Niagara or Nehalem processor’s cache (they have 524288 nodes).

Perhaps surprisingly, as can be seen in Figure 3.4, the fine-grained optimisticalgorithm with wait-free find (denoted as fine-wf-find) is superior to all othermethods, in particular to the fully wait-free implementation (denoted as wait-free). This is true both when there is a high and low fraction of find operations.Notice that we provide a version of the algorithm with wait-free finds and asingle coarse lock (denoted as coarse-wf-find), which does not scale well whenthere are many union operations but does great when the fraction of finds ishigh. This show the importance of parallelism obtained by using a lock per treein the union and path compression operations. Moreover, performing union andpath compression while using a lock per tree, is significantly more effective thena sequence of CAS operations generated by the wait-free algorithm.

Page 15: Multicore Programming inthe Face of Metamorphosis: Union

3. A Concurrent Implementation of Union Find 10

100

1000

10000

100000

1e+006

1e+007

0 2 4 6 8 10 12

#fai

lure

s

threads

90% unions, 10% finds

coarsewait-free

coarse-wf-findfine-wf-find

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

0 2 4 6 8 10 12

#fai

lure

s

threads

20% unions, 80% finds

coarsewait-free

coarse-wf-findfine-wf-find

1e+006

1e+007

1e+008

0 2 4 6 8 10 12

#suc

cess

es

threads

90% unions, 10% finds

coarsewait-free

coarse-wf-findfine-wf-find

1e+006

1e+007

1e+008

1e+009

0 2 4 6 8 10 12

#suc

cess

es

threads

20% unions, 80% finds

coarsewait-free

coarse-wf-findfine-wf-find

Fig. 3.6: The CAS failures and successes for the various union-find implementa-tions(logarithmic scale).

This is confirmed by the cache miss rates in Figure 3.5: the wait-free algo-rithm has a significantly higher cache miss rate than the fine-grained algorithm,which actually has the lowest rate of all methods (The estimation of the cachemiss penalties in CPU cycles was made by a formula found in the Intel’s supportforum [25]).

The number of CAS failures and successes in Figure 3.6 of the fine-grainedalgorithm is lowest too, which gives an additional benefit to it compared to thewait-free implementation, when using the fine-grained object on the platformswhere CAS latency is relatevly high. The number of CAS failures at 80% findsis close to 0. This emphesize the highly parallel nature of the union-find datastructure and supports our fine-grained lock-based design.

The STM’s performance (denoted as STM) improves on that of the coursegrained locks, but is 3 to 4 times slower than the fine grained and wait-freeversions. This has to do with the higher cache miss rates, CAS operations permemory location, and other validation overheads.

Though it is only our subjective experience in programming the various ap-proaches, we found that making the wait-free algorithm work, even if it wasalready a known algorithm, was not trivial. On the other hand, programmingwith the STM was as trivial as the coarse grained lock. The fine grained op-timistic locking scheme was easier, but only slightly easier to put together andreason about than the wait-free one. The combination of performance and im-plementation complexity is summarized in Figure 3.7 and in Table 3.1. TheSTM is the approach that offered a middle of the road solution, providing rea-sonable performance and programmability. However, if we are interested inperformance, the fine-grained locking methodology seems to be slightly supe-rior to the wait-free one.

Page 16: Multicore Programming inthe Face of Metamorphosis: Union

3. A Concurrent Implementation of Union Find 11

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

slow

fast

hard easy

coarse

STM

fine-wf-findwait-free

Fig. 3.7: The performance and implementation complexity tradeoffs for union andfind.

Version Implementation in days Additional overheads in daysCoarse-grained lock 1 0STM 1 20Fine-grained locks 10 5Wait-free 15 5

Tab. 3.1: Time spent for implementing various union-find objects.

Page 17: Multicore Programming inthe Face of Metamorphosis: Union

4. METAMORPHOSIS: ADDING A DELETE TO UNION FIND

So what happens when we add a delete operation to union and find? The basisfor our parallel union-find-delete data structure is a sequential version due toAlstrup et. al [2]. Their algorithm supports unions and deletes in constant time,and finds in O(lgm) worst case time and O(a(m)) amortized time, where m isthe number of elements in the set returned by the find operation, and a(m) isthe inverse of Ackermann’s function.

The sequential union-find-delete data structure is more complex than theunion-find data structure. It introduces inner auxiliary data structures (doublelinked lists) in each tree node for supporting delete operations in constant time.In addition, it uses a new technique, called value-regain (in addition to union-by-rank and path-compression or bypassing) to make the tree space complexityto be linear in the number of elements present in the union-find-delete forestand to improve find complexity.

The find and union operations are almost the same as in the union-finddata structure. The union operation has an additional maintenance step for theunion-find-delete tree, preserving its “tidiness” property, which allows one to freeinternal vacant tree nodes. The delete operation starts by removing the elementto be deleted from the associated tree node, then marking the associated node asvacant, thus deleting it logically. After that, subject to a ”tidiness property” ofthe union-find-delete tree and the local rebuilding technique, the associated nodemay be deleted physically. The ”tidiness property” and local rebuilding help toperform delete in constant worst case time, while preserving time complexitiesof find operation. In addition, they ensure that at most half of the nodes inthe tree are vacant. To achieve this, the local rebuilding technique defines thevalue function of the union-find-delete set. The delete operation reduces thevalue of the node by half. The bypassing increases the value of the node. Eachnode in the union-find-delete tree maintains the list of the children and the listof the children that have children, thus removing node may effect parent andgrand-parent nodes by changing the lists. The value-regain is done by inspectingthe highest node that was affected by the deletion and by performing differentkinds of bypassing operations. The bypassing operations may physically deletevacant nodes. In addition, each bypassing changes children and children-with-children lists. If the value of the set is regained, the delete finishes, otherwisethe algorithm inspects the parent node and so on until it regains a value ormeets the root of the tree.

4.1 Coarse Grained Locking

The coarse grained lock version of the union-find-delete wraps the sequentialimplementation of the union-find-delete data structure with a single lock. The

Page 18: Multicore Programming inthe Face of Metamorphosis: Union

4. Metamorphosis: Adding a Delete to Union Find 13

delete operation didn’t introduce any additional programming effort.

4.2 Long and Short Transactions

The regular implementation of the union-find-delete data structure using TinySTM++,implemented within DSTMC, was easy and took almost no time. The deleteoperation was wrapped in an effortless way. The underlying TinySTM++ imple-mentation takes care to execute memory free operations once delete transactionscommit. This is done in a way that is transparent to the user.

In order to reduce the read-set and write-set sizes of the transactions (asource of overhead in most STM designs [7]), we tried to divide each union andfind transactions into two shorter transactions. To do so we introduced newkind of “compression” transaction that takes a path from inner node to theroot in the tree, and optimistically performs path compression by changing theparent pointers of the nodes along the path to point to the grandparents. Allregular transactions became compression-less, and the algorithm itself works byinterleaving calls to regular transactions with calls to compressing transactions.For example, each union consists of two transactions:

//merge the trees

tanger_begin();

//union code for x and y omitted

tanger_commit();

//perform path compression

disjoint_compress_tx(on x_element tree);

disjoint_compress_tx(on y_element tree);

4.3 Fine Grained Locking with Optimistic Traversals

Implementing an optimistic fine-grained locking scheme with deletes was a morechallenging undertaking. Similar to the union-find version, we made the findoperation wait-free and compression-less. All union and delete operations ac-quire locks before merging trees and compress paths (after taking appropriatelock/locks), while climbing up to the trees’ roots.

The optimistic fine-grained locking scheme introduces additional problemswe have to handle. The find operations may refer tree nodes that are concur-rently taken out and deallocated by a thread that performs delete operations,or by a thread that tidies the tree after a union. This means that one must“privatize” the node, that is, make sure no reader can access the node duringwhen it is freed. To handle this problem we implemented two mechanisms:

• A quiescence mechanism similar to the explicit privatization barrier usedin the TL2P STM [6]. This barrier makes sure that no threads are perform-ing find operations that could still access the given node. The quiescencemechanism requires that all find operations inform other threads that theyare currently performing a find, by changing a find-status variable. Aftercompleting the find, threads are required to update the find-status variableand to increment a find-counter variable.

queiscence_register_thread();

Page 19: Multicore Programming inthe Face of Metamorphosis: Union

4. Metamorphosis: Adding a Delete to Union Find 14

uintptr_t result=disjoint_find_aux(.);

queiscence_unregister_thread();

Before being removed from the tree, the nodes became empty leaves, sothe problem may occur only to find operations concurrent with a thethread that performs union/delete. New find operations won’t refer thesenodes, since their elements already marked as being removed from thedata structure. To determine quiescence, a thread takes a snapshot [1]of the find-counters and the find-statuses of other threads. It then spinsuntil finders all finish. Thus it can be sure that all concurrent finders can’treference a node that was freed.

• When a thread that removes a node from the tree, it does not change theparent pointer of the node. Preserving the parent field allows the concur-rent finders to travel along the path to the root. When no concurrent findoperations can access the node the thread frees the memory and if needednullifies the parent node pointer.

In a manner similar to the union-find data structure, to utilize inherentparallelism of union and delete operations on the different trees, we introduceda lock for each disjoint set and optimistically find the roots that need to belocked (Lines 2 and 3). Compared to the union-find data structure, finding rootsoptimistically introduces additional problems because of the delete operation.The algorithm needs to handle several problematic cases, that could be combinedtogether as well. For example, in union operation the following problems couldhappen:

• One of the elements is deleted after the roots are found.

• One of the root nodes is physically deleted.

• One of the sets is merged with another set.

After finding potential roots and before locking them 4.1, other threadsmay cause all the above mentioned problems (they cannot free the associatednodes because the quiescing mechanism will detect that the node are still areaccessible). This makes the lock acquisition that the thread performed useless.After the locking roots (Line 11), the thread must therefore validate that thethread locked the current roots of the trees. The thread validates roots bytraversing the tree and finding them once again after locking (Lines 12 and 13),and if it finds the same roots (Lines 19 and 20), than it may proceed sincethey are locked. If the thread fails to validate, it unlocks the locked nodes andrestarts (Line 22).

To prevent inconsistency in the tree structure, we do not perform path-compression in the find operation (because changing the parent pointer mayinterfere with a concurrent union or delete). Find operations concurrent witha union operation are linearized as before, while find operations that run con-currently with a delete are linearized when they find a root or when they finda deleted node (by checking delete status of the id and the status of the treenode).

Page 20: Multicore Programming inthe Face of Metamorphosis: Union

4. Metamorphosis: Adding a Delete to Union Find 15

1 : do{// l o c k i n g2 : x roo t = d i s j o i n t f i n d t x ( f o r e s t , x e l ) ;3 : y roo t = d i s j o i n t f i n d t x ( f o r e s t , y e l ) ;

// x e l or y e l i s d e l e t e d4 : i f ( x roo t == ( u i n t p t r t )−1 | |5 : y roo t == ( u i n t p t r t )−1 ){6 : return −1;7 : }

// x e l and y e l are in the same s e t8 : i f ( x roo t == y roo t ){9 : return x roo t ;10 : }11 : l o c k r o o t s ( f o r e s t , x root , y roo t ) ;12 : u i n t p t r t x r o o t v a l = d i s j o i n t f i n d t x ( f o r e s t , x e l ) ;13 : u i n t p t r t y r o o t v a l = d i s j o i n t f i n d t x ( f o r e s t , y e l ) ;

// x e l or y e l i s d e l e t e d14 : i f ( x r o o t v a l == ( u i n t p t r t )−1 | |15 : y r o o t v a l == ( u i n t p t r t )−1){16 : un l o ck roo t s ( f o r e s t , x root , y roo t ) ;17 : return −1;18 : }

// roo t s d e l e t e d or changed19 : i f ( x roo t != x r o o t v a l | |20 : y roo t != y r o o t v a l ){21 : un l o ck roo t s ( f o r e s t , x root , y roo t ) ;22 : continue ;23 : }24 : break ;25 :}while ( 1 ) ;

Fig. 4.1: The optimistic fine-grained locking scheme for union-find-delete.

Page 21: Multicore Programming inthe Face of Metamorphosis: Union

4. Metamorphosis: Adding a Delete to Union Find 16

0

2000

4000

6000

8000

10000

0 2 4 6 8 10 12

dura

tion(

ms)

threads

80% unions, 10% finds, 10% deletes

coarseSTM

STM-finecoarse-wf-find

fine-wf-find

0

20000

40000

60000

80000

100000

120000

140000

0 2 4 6 8 10 12

dura

tion(

ms)

threads

10% unions, 80% finds, 10% deletes

coarseSTM

STM-finecoarse-wf-find

fine-wf-find

500

1000

1500

2000

2500

3000

3500

4000

0 2 4 6 8 10 12

dura

tion(

ms)

threads

80% unions, 10% finds, 10% deletes (detailed)

coarseSTM

STM-finecoarse-wf-find

fine-wf-find

6000

8000

10000

12000

14000

16000

18000

20000

0 2 4 6 8 10 12

dura

tion(

ms)

threads

10% unions, 80% finds, 10% deletes (detailed)

coarseSTM

STM-finecoarse-wf-find

fine-wf-find

Fig. 4.2: The latency of the parallel union-find-delete implementations.

4.4 A Fully Wait-free Implementation

The problems described in implementing the optimistic fine-grained lock-basedimplementation are exacerbated when we wish to make the union and deleteoperations wait-free. This prevented us from designing the wait-free version forthe union-find-delete data structure.

4.5 A Performance and Programmability Evaluation

The benchmark for the union-find-delete problem evolved from the union-findbenchmark by adding a delete operation on a randomly chosen graph’s node witha configured ratio. While this extension has no meaning in connected componentsproblem, it could be applicable in the designing concurrent Fibonacci heaps, thatcan use the union-find-delete data structure as auxiliary block [15].

The big news in this benchmark is that we did not know how to createa wait-free version of the delete algorithm. That is, in the time we allottedourselves, providing a provably correct wait-free version is just a too complextask. On the otehr hand, implementing the fine-grained lock based version wasnot trivial but doable. As we noted earlier, we had to add a traversal-basedvalidation to check that a root is still valid, which adds an overhead relativeto the union-find scheme. Among the other algorithms, the situation is quitesimilar to that in the union-find. The fine-grained lock-based version is againthe clear winner.

We tried to make the STM version better by splitting one coarse transactionof each kind (union/find/delete) into two smaller transactions, one of whichperforms the actual operation and the other does a path compression. Unfor-tunately, the added overhead of making two passes along each path up to theroot (in the case of union or find operation) was more of a performance killerthan we had expected. Short transactions have comparable performance to the

Page 22: Multicore Programming inthe Face of Metamorphosis: Union

4. Metamorphosis: Adding a Delete to Union Find 17

1e+007

1e+008

0 2 4 6 8 10 12

#cpu

pen

alty

threads

80% unions, 10% finds, 10% deletes

coarseSTM

STM-finecoarse-wf-find

fine-wf-find

1e+007

1e+008

1e+009

0 2 4 6 8 10 12

#cpu

pen

alty

threads

10% unions, 80% finds, 10% deletes

coarseSTM

STM-finecoarse-wf-find

fine-wf-find

Fig. 4.3: The cache miss penalty for the various union-find-delete implementa-tions(logarithmic scale).

1000

10000

100000

1e+006

1e+007

1e+008

0 2 4 6 8 10 12

#fai

lure

s

threads

80% unions, 10% finds, 10% deletes

coarsecoarse-wf-find

fine-wf-find

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

0 2 4 6 8 10 12

#fai

lure

s

threads

10% unions, 80% finds, 10% deletes

coarsecoarse-wf-find

fine-wf-find

100000

1e+006

1e+007

1e+008

0 2 4 6 8 10 12

#suc

cess

es

threads

80% unions, 10% finds, 10% deletes

coarsecoarse-wf-find

fine-wf-find

100000

1e+006

1e+007

1e+008

1e+009

0 2 4 6 8 10 12

#suc

cess

es

threads

10% unions, 80% finds, 10% deletes

coarsecoarse-wf-find

fine-wf-find

Fig. 4.4: The CAS failures and successes for the various union-find-delete implemen-tations(logarithmic scale).

simple long transactions without the path-compression even with high rate ofunions 4.2.

The Coarse-grained lock with wait-free finds performs as bad as the simplecoarse grained lock when the union rate is high and the find rate is very lowdue to global lock contention. However, when the find ratio is high, it has verygood scalability.

The cache misses penalties in Figure 4.3 and the CAS failures in Figure 4.4support the observation that the fine grained algorithm is the best in terms ofperformance, and the STM delivers the best performance versus programma-bility as long as one is willing to tolerate a 3-fold slowdown. The number ofCAS failures of the fine grained algorithm is lowest too, suggesting the optimalnumber of the lock acquisitions attempts.

The fine-grained lock-based version scales well beyond 8 threads(up to 128)on Niagara machine in Figure 4.5. Since each thread executes only disjointportion of work, it is expected that most of the unions are an empty ones, i.e.thread discovers in the wait-free manner that the edge doesn’t brings to the

Page 23: Multicore Programming inthe Face of Metamorphosis: Union

4. Metamorphosis: Adding a Delete to Union Find 18

0

5000

10000

15000

20000

25000

0 20 40 60 80 100 120 140

dura

tion(

ms)

threads

SPARC:80% unions, 10% finds, 10% deletes

fine-wf-find

0

5000

10000

15000

20000

25000

0 20 40 60 80 100 120 140

dura

tion(

ms)

threads

SPARC:10% unions, 80% finds, 10% deletes

fine-wf-find

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140

dura

tion(

ms)

threads

SPARC:80% unions, 10% finds, 10% deletes(detailed)

fine-wf-find

0

2000

4000

6000

8000

10000

0 20 40 60 80 100 120 140

dura

tion(

ms)

threads

SPARC:10% unions, 80% finds, 10% deletes(detailed)

fine-wf-find

Fig. 4.5: The latency of the Optimistic fine grained lock version of the union-find-delete object on SPARC.

Version Implementation in days Additional overheads in daysCoarse-grained lock 1 0STM 1 20Fine-grained locks 20 5Wait-free ? ?

Tab. 4.1: Time spent for implementing various union-find-delete objects.

actual merge of two disjoint sets, but instead edge’s source and target nodes arealready contained in the same disjoint set.

The complexity introduced into the union-find-delete data structure appearswhen comparing the fine-grained lock-based union-find-delete version versuswait-free union-find data structure when delete ratio is 0 in Figure 4.6. Sur-prisingly, on the SPARC architecture the the fine-grained lock-based versionperforms better than the wait-free implementation. This behavior can be ex-plained by the high latency of CAS instruction on the Niagara machines [5]which invalidates cache line each time CAS is executed.

Supporting subjective expirience of the programming union-find data struc-tures, the programming union-find-delete with the STM was as trivial as thecoarse grained lock. The fine grained optimistic locking scheme was not triv-ial at all. The combination of performance and implementation complexity issummarized in Figure 4.7 and in Table 4.1.

Page 24: Multicore Programming inthe Face of Metamorphosis: Union

4. Metamorphosis: Adding a Delete to Union Find 19

0

200

400

600

800

1000

1200

1400

0 2 4 6 8 10 12

dura

tion(

ms)

threads

Nehalem:80% unions, 20% finds, 0% deletes

fine-wf-findwait-free

0

2000

4000

6000

8000

10000

12000

0 2 4 6 8 10 12

dura

tion(

ms)

threads

Nehalem:20% unions, 80% finds, 0% deletes

fine-wf-findwait-free

0

2000

4000

6000

8000

10000

12000

0 20 40 60 80 100 120 140

dura

tion(

ms)

threads

SPARC:80% unions, 20% finds, 0% deletes

fine-wf-findwait-free

0

2000

4000

6000

8000

10000

12000

0 20 40 60 80 100 120 140

dura

tion(

ms)

threads

SPARC:20% unions, 80% finds, 0% deletes

fine-wf-findwait-free

Fig. 4.6: The comparison of the fully Wait-free union-find object vs. Optimistic finegrained lock version of the union-find-delete object.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

slow

fast

hard easy

coarse

STM

fine-wf-find

wait-free ?

Fig. 4.7: The performance and implementation complexity tradeoffs for implementingunion, find, and delete.

Page 25: Multicore Programming inthe Face of Metamorphosis: Union

5. CONCLUSIONS

The results of our efforts are new efficient and flexible fine-grained lock-basedsolutions to the concurrent union-find and union-find-delete problems, ones thatimprove on the Anderson and Woll [3] solution used to date by using locks formutations and wait-free methods for traversals. Perhaps the implication of ourwork, when considered together with past lock-based algorithms, is that thereis little benefit to using a wait-free algorithm for methods that modify the datastructure, from the point of view of both performance and flexibility.

We conclude that the STM is the best approach that offered a middle ofthe road solution, providing reasonable performance, programmability and ex-tensibility. If we are interested in performance, despite complexities and willingto allocate enough time, the fine-grained locking methodology seems to be ourchoice.

All code is available from The MultiCore Algorithmics group site[24].

Page 26: Multicore Programming inthe Face of Metamorphosis: Union

BIBLIOGRAPHY

[1] Afek, Y., Attiya, H., Dolev, D., Gafni, E., Merritt, M., and

N.Shavit. Atomic snapshots of shared memory. In Proc. of the 9th AnnualACM Symposium on Principles of Distributed Computing (PODC) (1990),pp. 1–14.

[2] Alstrup, S., Gørtz, I. L., Rauhe, T., Thorup, M., and Zwick, U.

Union-find with constant time deletions. In ICALP (2005), pp. 78–89.

[3] Anderson, R. J., and Woll, H. Wait-free parallel algorithms for theunion-find problem. In STOC (1991), pp. 370–380.

[4] Attiya, H., and Welch, J. Distributed Computing: Fundamentals,Simulations and Advanced Topics (2nd edition). John Wiley Interscience,March 2004.

[5] Dice, D. Cas and cache trivia - invalidate or update in-place.http://blogs.sun.com/dave/entry/cas and cache trivia invalidate.

[6] Dice, D., Matveev, A., and Shavit, N. Implicit privatization usingprivate transactions, April 2010.

[7] Dice, D., Shalev, O., and Shavit, N. Transactional locking ii. In DISC(2006), pp. 194–208.

[8] Dice, D., and Shavit, N. Understanding tradeoffs in software transac-tional memory. In CGO (2007), pp. 21–33.

[9] Felber, P., Fetzer, C., Muller, U., Riegel, T., Sußkraut, M.,

and Sturzrehm, H. Transactifying applications using an open compilerframework. In TRANSACT (August 2007).

[10] Felber, P., Fetzer, C., and Riegel, T. Dynamic performance tuningof word-based software transactional memory. In PPOPP (2008), pp. 237–246.

[11] Heller, S., Herlihy, M., Luchangco, V., Moir, M., N, W., III, W.

N. S., and Shavit, N. A lazy concurrent list-based set algorithm, 2005.

[12] Herlihy, M., Lev, Y., Luchangco, V., and Shavit, N. A simpleoptimistic skiplist algorithm. In SIROCCO (2007), pp. 124–138.

[13] Herlihy, M., and Shavit, N. The Art of Multiprocessor Programming.Morgan Kaufmann, NY, USA, 2008.

Page 27: Multicore Programming inthe Face of Metamorphosis: Union

Bibliography 22

[14] Herlihy, M. P., and Wing, J. M. Linearizability: a correctness condi-tion for concurrent objects. ACM Transactions on Programming Languagesand Systems (TOPLAS) 12, 3 (1990), 463–492.

[15] Kaplan, H., Shafrir, N., and Tarjan, R. E. Union-find with dele-tions, 2002.

[16] Lea, D. Concurrent lock free skiplist.http://java.sun.com/javase/6/docs/api/java/util/concurrent/ConcurrentSkipListMap.html,2007.

[17] Lewis, Bil, Berg, and J., D. Multithreaded programming with Pthreads.Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1998.

[18] Lynch, N. A. Distributed Algorithms. Morgan Kaufmann, 1996.

[19] Madduri, K., and Bader, D. A. Gtgraph: A synthetic graph generatorsuite. https://sdm.lbl.gov/∼ kamesh/software/GTgraph/.

[20] Michael, M. M., and Scott, M. L. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proc. 15th ACMSymp. on Principles of Distributed Computing (1996), pp. 267–275.

[21] Shavit, N., and Touitou, D. Software transactional memory. Dis-tributed Computing 10, 2 (February 1997), 99–116.

[22] Tarjan, R. E. Data structures and network algorithms. Society for In-dustrial and Applied Mathematics, Philadelphia, PA, USA, 1983.

[23] Taubenfeld, G. Synchronization Algorithms and Concurrent Program-ming. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2006.

[24] The multicore algorithmics group at the tau school of computer science.http://mcg.cs.tau.ac.il/projects/union-find-with-delete.

[25] Vtune reference: Cache miss ratio on nehalem.http://software.intel.com/en-us/forums/showthread.php?t=69583.