combining techniques application for tree search …the ﬂat combining applicability for binary...

RAYMOND AND BEVERLY SACKLER

FACULTY OF EXACT SCIENCES

BLAVATNIK SCHOOL OF COMPUTER SCIENCE

Combining TechniquesApplication for Tree Search

Structures

Thesis submitted in partial fulfillment of requirements

for the M. Sc. degree in the School of Computer Science,Tel-Aviv University

by

Vladimir Budovsky

The research work for this thesis has been carried out

at Tel-Aviv University under the supervision ofProf. Yehuda Afek and Prof. Nir Shavit

June 2010

CONTENTS

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Flat Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Skip Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2. The Flat Combined Skip Lists . . . . . . . . . . . . . . . . . . . . . . 52.1 Naive Flat Combined Skip List . . . . . . . . . . . . . . . . . . . 72.2 Flat Combined Skip List with Multiple Combiners . . . . . . . . 112.3 Flat Combined Skip List with ”Hints” . . . . . . . . . . . . . . . 13

3. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 Performance Comparison of Flat Combined Skip Lists vs JDK

ConcurrentSkipListSet . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Flat Combining Mechanism Experimental Verifications. . . . . . 22

4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

LIST OF FIGURES

1.1 Skip list of heights 4. May be considered either as collection of”fat” nodes or 2-d list . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Skip list traversal with key 12. Traversed predecessors are shown.start level is 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Multi-combiner skip list. Every node with height ≥ 3 is a com-biner node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Naive FC skip list implementation vs JDK lock-free Concur-rentSkipListSet, uniform keys distribution . . . . . . . . . . . . . 18

3.2 Naive FC skip list implementation vs JDK lock-free Concur-rentSkipListSet, high access locality . . . . . . . . . . . . . . . . 19

3.3 Hints FC skip list implementation vs JDK lock-free ConcurrentSkipList-Set, uniform keys distribution . . . . . . . . . . . . . . . . . . . . 20

3.4 Hints FC skip list implementation vs JDK lock-free ConcurrentSkipList-Set, high access locality . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 FC skip list implementation vs multi-lock one, naive implemen-tations, uniform keys distribution . . . . . . . . . . . . . . . . . . 24

3.6 FC skip list implementation vs multi-lock one, naive implemen-tations, high access locality . . . . . . . . . . . . . . . . . . . . . 25

3.7 FC skip list implementation vs multi-lock one, hints implemen-tations, uniform keys distribution . . . . . . . . . . . . . . . . . . 26

3.8 FC skip list implementation vs multi-lock one, hints implemen-tations, high access locality . . . . . . . . . . . . . . . . . . . . . 27

3.9 Ideal hints FC skip list implementation vs JDK lock-free Concur-rentSkipListSet, uniform keys distribution . . . . . . . . . . . . . 28

3.10 Ideal hints FC skip list implementation vs JDK lock-free Concur-rentSkipListSet, high access locality . . . . . . . . . . . . . . . . 29

3.11 Hints mechanism success rate for pure update workloads . . . . . 303.12 The connection between FC intensity to throughput per thread

for pure update workloads . . . . . . . . . . . . . . . . . . . . . . 303.13 Lock-free skip list CAS per update, CAS success rate and through-

put per thread for pure update workloads . . . . . . . . . . . . . 31

LISTINGS

2.1 Set of Integers Interface . . . . . . . . . . . . . . . . . . . . . . . 52.2 Flat combining definitions . . . . . . . . . . . . . . . . . . . . . . 62.3 Node definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Wait free contains is the same for all skip lists . . . . . . . . . . 72.5 add Naive implementation . . . . . . . . . . . . . . . . . . . . . . 82.6 scanAndCombine common implementation . . . . . . . . . . . . . 82.7 Physical add and remove Naive implementation . . . . . . . . . . 92.8 Multi-combiner remove implementation . . . . . . . . . . . . . . 122.9 Optimistic (hinted) FCrequest and add implementation . . . . . 132.10 Optimistic (hinted) doAdd and verify implementation . . . . . . 143.1 Optimistic (hinted) multi-lock add method implementation . . . 22

ACKNOWLEDGEMENTS

I would like to thank all those who made this thesis possible.I am extremely grateful to my advisors Prof. Yehuda Afek and Prof. Nir

Shavit who introduced me to the world of multiprocessors and distributed al-gorithms and whose supervision and support enabled me to advance my under-standing of the subject.

My sincere thanks to Ms. Moran Tzafrir for teaching me what everydayresearcher’s work is about and for supplying me with the arsenal of essentialtools for my work.

Finally, I am grateful to my family and especially to my sister Elena for thepatience and encouragement.

ABSTRACT

Flat combining (FC) is a new synchronization paradigm allowing to reducedramatically the synchronization costs. Use of this technique, as it was re-cently shown, brings significant performance gain for several popular paralleldata structures, such as stacks, queues, shared counters, etc. Besides, the com-bining paradigm application makes a code as simple as one synchronized viasingle global lock. However, the question about applicability for other classesof parallel data structures has not been answered yet.

This work deals with FC paradigm application to binary tree-like data struc-tures. As it is shown below, combining is hardly suitable for these cases. Thelimits for FC uses have been studied, and criterion for its applicability has beenjustified.

1. INTRODUCTION

Multi and many core computers appear more and more common these days. Wewitness recent developments of computer chips with tens of cores that consumeno more space and energy than a desktop processor. In the light of this trend,the development of scalable and correct data structures becomes extremely im-portant. The most simple and straightforward solution is to devise concurrentdata structure from sequential one using global lock as synchronization primi-tive. Unfortunately, this solution does not scale even for relatively small numberof cores. Another approach is to design fine-grained synchronization schemesusing multiple locks or non-blocking read-modify-write atomic operations. Thismethod usually requests full algorithm redesign and implementation. Additionaldrawback of fine-grained and, especially, lock-free synchronization is theirs highcomplexity. It is very difficult to formally prove the correctness of such datastructures (See, for example, [3] and [4] proofs).

1.1 Flat Combining

Flat combining [7] programming paradigm allows to achieve high level of con-currency while preserving of code simplicity. The main idea behind the flatcombining is to attach the public actions registry to existing sequential datastructure. Each thread, before accessing to shared data, publishes its actionrequest in the registry, and then tries to access the global lock. The winningthread becomes ”combiner”, scans the registry and performs all found requests.Other threads simply wait for theirs fulfilled actions results, spinning on threadlocal Done flag.

There are several benefits of this strategy:

• Low synchronization cost, comparing to global lock since there is onlyone competition round for acquiring the shared lock, and every thread -winning or missing - returns with its request performed.

• The combiner can use its knowledge about all requests and fulfill part ofthem without access to data structure. For stack, for example, the com-biner may collect push/pop pairs and to return the results to appropriatecallers. This technique is well-known and called elimination. For sharedcounter, the combiner can calculate the total counter change and updatedata structure only once. This technique, called combining, is also widelyused.

The variants of FC algorithm are described in details in Chapter 2 (The FlatCombined Skip Lists). The flat combining is proven very efficient for datastructures with ”hot spots”, such as stack head, queue ends, priority queue head,and so on. It also shows good results when synchronization costs are high. For

1. Introduction 2

Fig. 1.1: Skip list of heights 4. May be considered either as collection of ”fat” nodesor 2-d list

example, lock-free synchronous queues [16] demonstrate good throughput butmoderate scalability, which can be improved using elimination or FC techniques.However, the question of FC usefulness for data structures without emphasizedbottlenecks and high synchronization costs remains opened. This work studiesthe flat combining applicability for binary tree-like data structures, the oneswith O(log n) access time, allowing range operations.

1.2 Skip Lists

Tree search structures are, probably, the most popular and widespread ones. Itis hard to find computer science or software programming area that does not usethem. Their practical applications start with the most popular red-black tree[6], which is used nearly in any algorithms library, including C++ STL library[17] and JavaTM SDK [18], and AVL tree [1], which is very popular for searchdominated workloads, then continue with various B-trees [2], which are usefulfor block-organized memories, and finish with specialized suffix tries, splay trees,spatial search trees, persistent trees, etc. Since all of the above algorithms dealwith large amount of data, and many of them run inside various operatingsystems or used as various search indexes inside databases, the distributed andmulti-threaded decisions for search trees are in focus of many researches andcommercial projects. The comprehensive survey of concurrent binary searchtrees is given in [13]. The common problem with all search trees mentionedabove is that they either static (do not allow add/remove without full rebuild)or need re-balancing mechanism after updates in order to preserve logarithmicaccess time. In most of the cases, the re-balance scope is unknown prior toupdate action, and that makes the design of fine-grained synchronization forbinary search trees very complicated task. That is why the skip lists werechosen as the basic data structure for the research. There were several reasonsfor the decision:

• Skip list is simple and has no re-balancing overheads, which simplifiesmeasures.

1. Introduction 3

Fig. 1.2: Skip list traversal with key 12. Traversed predecessors are shown. start levelis 3.

• Skip list is the only known concurrent lock-free binary search structure.

Skip list was invented [15] in 1990 as a probabilistic alternative for binarysearch trees. Skip list is a linked list of ”fat” nodes (Figure 1.1), where eachone has randomly chosen height (number of levels). Every node has a uniquekey, and the nodes appear ordered in the list. Each node is connected at eachlevel with the successor at the same level. The random level is chosen usinggeometrical distribution: the probability that the node has layer i, i ≥ 0 is1

pi , p > 1. So, every node has layer 0, and, if node has layer i, it, with the

probability of 1

p, has also layer i+1. In practice, p is usually chosen between 2 to

4. Such distribution gives O(logN) skip list maximal node height expectation,and between every two nodes of height k, p-1 nodes of height k-1 are expectedto appear. It is useful to add two immutable nodes head and tail with highestpossible level and to manage real highest level (start level) on every add ordelete. Alternatively, the skip list may be represented as a collection of sortedlists with unique keys L1, L2, ..., Lk , such as i > j ⇒ Li ⊆ Lj and all nodes withequal keys form ”vertical” lists. The later representation is especially convenientfor lock-free implementations, where all of the updates are implemented throughatomic read-and-updates operations.

Denote the next node to node n at level l as nextl(n), and the key of n askey(N) The simple sequential list works in the following way:

• Initially, empty list contains head and tail with keys of −∞ and +∞correspondingly. The head node is connected to tail at every possiblelevel, and actual start level is 0.

• List traversal with key k starts from node n = head at level l = start level,and proceeds at this level searching the pair of nodes (pred, succ), suchthat nextl(pred) == succ and key(pred) < k ≤ key(succ). Set l = l−1and n = pred and repeat the search. The process continues until 0 levelachieved. Figure 1.2 illustrates the pred nodes observed during traversalwith key 12.

• contains(k) simply calls the traversal with key k. It is unnecessary to

1. Introduction 4

proceed to the bottom, once the desired key is found, the traversal isinterrupted and found node is returned.

• add(k) starts from generating random height h, as described above. Afterthat, the traversal algorithm is performed, collecting h bottom pred andsuss nodes. Once the node is not found (for pure set implementation), thenew node of height h with key k is linked to collected nodes.

• remove(k) starts from the traversal run. Once, the node suss withkey(suss) == k is observed on the highest level h, all traversed pred nodesare collected. After reaching the bottom level, all collected nexti(pred)references are set to nexti(suss), and suss node memory is freed.

• After every update operation, start level is verified and updated, if needed.There are two cases - when adding the node with height h > start level,start level is set to h, and, when removing the node of start level height,to find the highest level h such that nexth( head) 6= tail, and to setstart level to h.

Note, that the traversal algorithm performs O(1) expected steps at eachlevel, and that the number of levels expected to be logarithmic to nodes’ num-ber, and therefore, skip list has expected logarithmic access time. The aboveschema, short of some small variations, is used in the most of lock based con-current skip lists, and our implementations use it as well. The differences inthe implementations ([14], [8], [11]) are concerning locking schemes and stateflags devised for consistency, linearizability [10] and skip list invariants preserv-ing. Lock free skip lists, in contrast, cannot maintain skip list invariants - thisapproach needs multiple locations read-and-update atomic operations, unsup-ported on the most of existing platforms. The lock-free implementations ([5],[9]) use relaxed skip list algorithms, where the question about node existence isanswered only on the bottom list level, and the other levels are regarded as sortof index allowing to reach the bottom level in expected logarithmic time, andskip list structure can be violated at the particular execution moments.

2. THE FLAT COMBINED SKIP LISTS

All our FC skip list variants are implemented both in Java and C++ withminimal differences. C++ implementations require memory management andexplicit memory barriers, while in Java implementations the memory barriersare introduced implicitly through volatile flags’ store/load operations. We havechosen to present only Java implementations in order to avoid memory man-agement issues and to have clear and standard competitor - all performancecomparisons use Java SDK lock-free ConcurrentSkipListSet [18]. The flat com-bined skip list implements the simplest integers’ set interface:

Listing 2.1: Set of Integers Interface

1 public interface Simple IntSet {2 /∗∗3 ∗ Add item to map4 ∗ @param key − key to add ;5 ∗ @return t rue i f added ,6 ∗ f a l s e i f the key a l r eady e x i s t s on the map7 ∗/8 boolean add ( int key ) ;9 /∗∗10 ∗ Removes item from the map11 ∗ @param key − key to remove ;12 ∗ @return t rue i f removed ,13 ∗ f a l s e i f the key does not e x i s t on the map14 ∗/15 boolean remove ( int key ) ;16 /∗∗17 ∗ Ver i fy i f the item i s on the map18 ∗ @param th r e ad i d19 ∗ @param key20 ∗ @return t rue i f item e x i s t s , f a l s e o the rw i s e21 ∗/22 boolean conta in s ( int key ) ;23 }

The add and remove methods use flat combining paradigm, while containsmethod is implemented wait free. The coexistence of flat combining and waitfree methods requires special treatment for linearization points, since flat com-bining data is invisible for lock-free contains.

Define FCData and FCRequest:

2. The Flat Combined Skip Lists 6

Listing 2.2: Flat combining definitions

1 class FCRequest{2 int key ; // Key3 boolean r e spons e ; // Operation r e s u l t4 volat i le int opcode = NONE; // Action5 }67 class FCData {8 public FCRequest r e qu e s t s [ ] ; // Submitted r e qu e s t s9 public AtomicInteger l o c k ; // FC node l o c k10 }

The FCData may be attached to one or several skip list nodes The skip listnode class is:

Listing 2.3: Node definition

1 class Link{2 . . .3 public Link next ;4 public Node node ;5 public Link up ;6 public Link down ;7 }8 class Node {9 . . .10 public int numLevels ( ){ // Node h e i g h t11 return l i n k s . l ength ;12 }13 // Node i s FC when i t has FC data14 public boolean isFCNode ( ){15 return f c da t a != null ;16 }17 public Link at ( int index ){ // Get l i n k at l e v e l18 return l i n k s [ index ] ;19 }20 public Link bottom (){ // The bottom l i n k21 return l i n k s [ 0 ] ;22 }23 public Link top ( ){ // The top l i n k24 return l i n k s [ l i n k s . length −1] ;25 }26 public f ina l int key ;27 public volat i le boolean de l e t e d = fa l se ;28 public volat i le boolean f u l l y c onn e c t e d = fa l se ;29 public FCData f c d a t a ;30 // 2D l i s t o f l i n k s wi th random acces s31 // Link conta ins r e f e r ence to next , up and down l i n k s32 private Link [ ] l i n k s ;33 }


Till now, the skip list is the regular single threaded one, save for two details- deleted and fully connected flags and FCData reference (which is not nullfor flat combining nodes). The contais method is also very similar to singlethreaded implementation:

Listing 2.4: Wait free contains is the same for all skip lists

1 public boolean conta in s ( ) {2 int l e v e l = s t a r t l e v e l ; // Adoptab le s t a r t l e v e l3 Link pred = head . at ( l e v e l ) ;4 Link curr = null ;56 for ( ; l e v e l >= 0 ; −−l e v e l , pred = pred . down ) {7 curr = pred . next ;8 while ( inKey > curr . node . key ) {9 pred = curr ;10 curr = pred . next ;11 }1213 i f ( inKey == curr . node . key )14 return ( ! cur r . node . d e l e t e d &&15 curr . node . f u l l y c onn e c t e d ) ;16 }17 return fa l se ;18 }

The only distinguishing detail is the check of deleted and fully connected flags.The difference comes with add and remove implementations. We will presentimplementations for several flat combined lists variants.

2.1 Naive Flat Combined Skip List

The first simplest implementation is Naive FC list. It has exactly one combinernode (the head one). The thread performing add or remove action:

1. Puts its FCRequest into head node FCData.

2. Tries to acquire lock.

3. If succeeded, scans and fulfills the requests

4. Else, the thread spins on its own request completion flag and checks lockstate. If request fulfilled, the thread returns with desired result, otherwise,if lock is unlocked, continue from 2.

The Listing 2.5 presents add method implementation.


Listing 2.5: add Naive implementation

1 public boolean add ( int key ) {2 // Put my reque s t to node ’ s f c d a t a3 FCRequest my request =4 head . f c d a t a . r e q a ry [ ThreadId . getThreaId ( ) ] ;5 my request . key = key ;6 // Vo l a t i l e wr i te , from here combiner s ee s i t7 my request . opcode = ADD;8 AtomicInteger l ock = fc node . f c d a t a . l o c k ;9 do {10 i f (0 == lock . get ( ) && // TTAS l o c k11 lock . compareAndSet (0 , 0xFF) ) {12 // Perform a l l found r e qu e s t s13 scanAndCombine ( f c node ) ;14 lock . s e t ( 0 ) ; // Unlock15 return my request . r e spons e ;16 } else {17 do {18 Thread . y i e l d ( ) ; // Give up proces sor19 // Somebody did my work20 i f ( my request . opcode == NONE)21 return my request . r e spons e ;22 }while (0 != lock . get ( ) ) ;23 }24 } while ( true ) ;25 }

The remove method differs from the above one only by REMOVE opcode Allthe work is performed within scanAndCombine method, which is the same forall following implementations:

Listing 2.6: scanAndCombine common implementation

1 protected void scanAndCombine (Node f c node ) {2 for ( FCRequest cu r r r e q : f c node . f c d a t a . r e qu e s t s ) {3 switch ( cu r r r e q . opcode ) {4 case ADD:5 cu r r r e q . r e spons e = doAdd( fc node , cu r r r e q . key ,6 cu r r r e q . pred ary , cu r r r e q . s u c c a ry ) ;7 cu r r r e q . opcode = NONE; // Release wa i t ing thread8 break ;9 case REMOVE:10 cu r r r e q . r e spons e=doRemove ( fc node , cu r r r e q . key ,11 cu r r r e q . pred ary , cu r r r e q . s u c c a ry ) ;12 cu r r r e q . opcode=NONE; // Release wa i t ing thread13 break ;14 }15 }16 }


Here, the combiner thread scans all requests and performs modifications. BothdoAdd and doRemove methods receive the containers for predecessors and suc-cessors nodes - technical detail which allows reusing of the memory in case ofNaive list, but which is used in different way in other implementations. Besidethis, fc node parameter indicates the start node for search - it is not relevantfor single combiner list, but important to multi-combiner one, described below.The doAdd/doRemove methods act exactly as in case of single threaded skiplist:

Listing 2.7: Physical add and remove Naive implementation

1 private boolean doAdd(Node fc node , int key ,2 RandomAccessList<Link> pred ary ,3 RandomAccessList<Link> succ a ry ){4 // New node h e i g h t has to be known in advance5 // in order to r e s t r i c t nodes ’ c o l l e c t i o n .6 int t o p l e v e l = randomLevel ( ) ;7 //Find placement and nodes to connect .8 Node found node = f i nd ( fc node , key , pred ary ,9 succ ary , t op l e v e l , true ) ;10 i f ( found node == null ){ // Node not on map11 Node new node = new Node ( key ,12 t op l e v e l , fa l se ) ;13 Link new l ink = new node . bottom ( ) ;14 RandomAccessList<Link>. B iD i r I t e r a t o r p r ed I t e r =15 pred ary . begin ( ) ;16 RandomAccessList<Link>. B iD i r I t e r a t o r s u c c I t e r =17 succ ary . begin ( ) ;18 // Connect new node19 for ( int l e v e l = 0 ; l e v e l < t o p l e v e l ; ++l e v e l ,20 new l ink = new l ink . up ) {21 new l ink . next = su c c I t e r . data ;22 p r ed I t e r . data . next = new l ink ;23 p r ed I t e r = pr ed I t e r . next ( ) ;24 s u c c I t e r = su c c I t e r . next ( ) ;25 }26 // L inea r i z a t i on po in t27 new node . f u l l y c onn e c t e d = true ;28 return true ;29 }30 return fa l se ;31 }32 private boolean doRemove (Node fc node , int key ,33 RandomAccessList<Link> pred ary ,34 RandomAccessList<Link> succ a ry ){35 // Find node to d e l e t e and i t s p r edece s so r s .36 Node found node = f i nd ( fc node , key , pred ary ,37 succ ary , f c node . num leve l s ( ) , fa l se ) ;3839 i f ( found node != null ){40 int t o p l e v e l = found node . num leve l s ( ) ;


41 // Get l i n k on top l e v e l42 Link lnk = found node . top ( ) ;43 // Topmost predeces sor44 RandomAccessList<Link>. B iD i r I t e r a t o r p r ed I t e r =45 pred ary . rbeg in ( ) ;46 found node . d e l e t e d = true ; // Log i ca l d e l e t e47 for ( int l e v e l = 0 ; l e v e l < t o p l e v e l ; ++l e v e l ,48 lnk = lnk . down , p r ed I t e r = pr ed I t e r . prev ( ) ) {49 // Phys i ca l d e l e t e50 p r ed I t e r . data . next = lnk . next ;51 }52 return true ;53 }54 return fa l se ;55 }

In this implementation we use fast random number generator described in [12],the similar one is adopted in JDK’s lock-free list.

Consider the properties of the above skip list implementation.

Property 2.1.1. Naive skip list is deadlock free.

Proof. The implementation uses only one lock. Therefore, the deadlock freeimplementation of the lock implies deadlock freedom of the data structure.

Property 2.1.2. Naive skip list update operations do not overlap each otherand have strict total order.

Proof. Consider two arbitrary update operations on the list. All modificationare performed by the combiner thread during combining session (Listing 2.6).The combining sections are strictly ordered by single lock and do not overlap,and, so, if the operations belong to different sessions, the order is defined bythe lock acquiring order. Otherwise, if the updates belong to the same session,the order is defined by combine algorithm - the combiner performs updatessequentially, and any two modifications do not overlap.

Proposition 2.1.1. Naive skip list is linearizable.

Proof. Select linearization points for skip list updates:

• For add : the row 27 (Listing 2.7), where fully connected flag is set totrue.

• For remove: the row 46 (Listing 2.7), where deleted flag is set to true.

Use linearizability of OptimisticSkipList proved in [8]. Note, that by Property2.1.2, all updates performed on our skip list may be regarded as performed bysingle dedicated thread. Therefore, since initial preconditions are identical forboth OptimisticSkipList and Naive one, modifications of the next referencesand deleted and fully connected flags appear in program order exactly as inOptimisticSkipList, the Naive skip list state may be considered exactly equalto OptimisticSkipList one, where all modifications on the least are performedby single thread. Then, for each possible concurrent run on Naive skip list,


Fig. 2.1: Multi-combiner skip list. Every node with height ≥ 3 is a combiner node

there is a run on OptimisticSkipList, where both skip lists’ states defined bythe next references and flags are identical at every point of time, and so, theOptimisticSkipList linearization order is applicable to Naive skip list

As expected, the flat combining in this implementation exposes the sequen-tial bottleneck, very comparable to the global lock. In Section 3 (Performance)this estimation is verified.

2.2 Flat Combined Skip List with Multiple Combiners

The second attempt is the introduction of several combiners, that allow to makeseveral modifications simultaneously and, therefore, to improve scalability. Themulti-combiner skip list is implemented with statically distributed immutablecombiners. The idea is to divide the skip list into non-intersecting parts, suchthat every part is managed by some combiner node. The multi-combiner skiplist is shown on Figure 2.1. Suppose, that we start from initially filled skiplist of size N and have to add c < N combiners. We choose some heights hc

such that number of nodes with height h ≥ hc is at least c, and make them tobe combiner nodes by adding FCData to each one. In this work, only staticmulti-combiner skip lists are studied. The dynamic lists may be devised byalternating hc value - the process requires consecutive locking of all FC nodeslayers, converting of needed layer to combiners/non-combiners and re-schedulingof all pending combining requests. Since, by its essence, flat combining has touse a very small number of combiners (otherwise, it does not differ from sort offine-grained synchronization), the process is rare and do not expensive.

Multi-combiner skip list acts very similar to single-combiner one. As it wasmentioned early, the contains method is exactly the same, while add/removesingle difference is that the requests are placed to appropriate combiner nodesinstead of head one. The updating thread:

1. Finds combiner node fc node responsible to modification area.

2. Puts its FCRequest into fc node’s FCData.


3. Tries to acquire FCData lock.

4. If succeeded, scans and fulfills the requests

5. Else, spins on its own request completion flag and checks lock state. If re-quest is fulfilled, returns with desired result, otherwise, if lock is unlocked,continue from 3.

Listing 2.8: Multi-combiner remove implementation

1 public boolean remove ( int key ) {2 //Get r e s p on s i b l e combiner3 Node f c node = findCombiner ( key ) ;4 // Put my r e qu e s r t to node ’ s f c d a t a5 FCRequest my request =6 fc node . f c d a t a . r e q a ry [ ThreadId . getThreaId ( ) ] ;7 my request . key = key ;8 // Vo l a t i l e wr i te , from here combiner s ee s i t9 my request . opcode = REMOVE;10 AtomicInteger l ock = fc node . f c d a t a . l o c k ;11 do {12 // TTAS l o c k13 i f (0 == lock . get ( ) &&14 lock . compareAndSet (0 , 0xFF) ) {15 // Perform a l l found r e qu e s t s16 scanAndCombine ( f c node ) ;17 // Unlock18 lock . s e t ( 0 ) ;19 return my request . r e spons e ;20 } else {21 do {22 Thread . y i e l d ( ) ;23 // Somebody did my work24 i f ( my request . opcode == NONE)25 return my request . r e spons e ;26 }while (0 != lock . get ( ) ) ;27 }28 } while ( true ) ;29 }

The method findCombiner is wait-free and is implemented similar to contains.It has three differences -

1. The search goes down to the lowest combiners level and does not proceedto the bottom.

2. The search returns the lowest combiner predecessor of the key.

3. Since combiners are immutable, there is no need to check their deletedflag.

The multi-combiner skip list properties are similar to Naive list ones.


Property 2.2.1. Multi-combiner skip list is deadlock free.

Proof. As it follows from the algorithm, no thread try to hold more than onelock simultaneously. Then, the deadlock is impossible.

Practically, the multi-combiner design divides the data structure into dis-joint set of single combiner Naive lists. Call these lists combining clusters andthe combiner, responsible for the cluster cluster head. Then, the properties ofNaive FC lists are applicable for every combining cluster. Instead of strict totalorder, all updates operations of multi-combiner list form strict partial order,where operations on different clusters are commutative - the operations can bereordered without affecting the final state of data structure.

Proposition 2.2.1. Multi-combiner skip list is linearizable.

Proof. Follows from linearizability of each cluster and the fact that linearizabil-ity is compositional (Theorem 1 from [10])

The multi-combiner skip list scales much better than single-combiner one,but still perform a lot of work sequentially. The next try is to reduce this partof the execution by hints mechanism.

2.3 Flat Combined Skip List with ”Hints”

Hints mechanism is inspired by optimistic skip list [8]. The idea is to collectin wait-free ”optimistic” manner the links that have to be updated, to acquirethe lock, verify (and re-find, if needed) the links and then to perform update.The Listing 2.9 shows FCrequest structure supplemented with ”hints” and addmethod.

Listing 2.9: Optimistic (hinted) FCrequest and add implementation

1 class FCRequest{2 int key ; // Key3 boolean r e sponse ; // Operation r e s u l t4 volat i le int opcode = NONE; // Action5 int t o p l e v e l // h i n t s s i z e6 RandomAccessList<Link> pred ary ; // Co l l e c t e d h i n t s7 RandomAccessList<Link> s u c c a ry ; // Co l l e c t e d h i n t s8 }910 public boolean add ( int key ) {11 //Get r e s p on s i b l e combiner12 Node f c node = findCombiner ( key ) ;13 FCRequest my request =14 f c node . f c d a t a . r e q a ry [ ThreadId . getThreaId ( ) ] ;15 // We have to know l e v e l p r i o r to f i nd in order16 // to r e s t r i c t h i n t s s i z e17 int t o p l e v e l = randomLevel ( ) ;18 Node found node ;19 do{20 // Find placement and f i l l h i n t s data


21 found node = f i nd ( fc node , key , my request . pred ary ,22 my request . succ ary , t op l e v e l , true , true ) ;23 }while ( found node != null && found node . d e l e t e d ) ;24 // Node a l r eady e x i s t s25 i f ( found node != null )26 return fa l se ;27 // Put my reque s t to node ’ s f c d a t a28 my request . t o p l e v e l = t o p l e v e l ;29 my request . key = key ;30 // Vo l a t i l e wr i te , from here combiner s ee s i t31 my request . opcode = ADD;32 AtomicInteger l ock = fc node . f c d a t a . l o c k ;33 do {34 // TTAS l o c k35 i f (0 == lock . get ( ) &&36 lock . compareAndSet (0 , 0xFF) ) {37 // Perform a l l found r e qu e s t s38 scanAndCombine ( f c node ) ;39 // Unlock40 lock . s e t ( 0 ) ;41 return my request . r e spons e ;42 } else {43 do {44 Thread . y i e l d ( ) ;45 // Somebody did my work46 i f ( my request . opcode == NONE)47 return my request . r e spons e ;48 }while (0 != lock . get ( ) ) ;49 }50 } while ( true ) ;51 }

The internal doAdd and doDelete (Listing 2.10) methods are also slightly mod-ified, since we have to verify and re-fill, if needed, the collections of the prede-cessors and the successors. The verify method checks if all collected nodes arecorrect, i. e. they are non-deleted and connected, and each predecessor’s nextreference points to the appropriate successor, and collected nodes keys suit therequested key.

Listing 2.10: Optimistic (hinted) doAdd and verify implementation

1 private boolean doAdd(Node fc node , int key , int t o p l e v e l2 RandomAccessList<Link> pred ary ,3 RandomAccessList<Link> succ a ry ){4 Node found node = null ;5 // Ver i f y data and re− f i l l i f needed6 i f ( ! v e r i f y ( key , pred ary , succ ary , t o p l e v e l ) ){7 found node = f i nd ( fc node , key , pred ary ,8 succ ary , t op l e v e l , true , fa l se ) ;9 }10 // From here , as in \ t e x t i t {Naive} l i s t


11 . . .12 }1314 protected boolean v e r i f y ( int key ,15 RandomAccessList<Link> predAry ,16 RandomAccessList<Link> succAry ,17 int t o p l e v e l )18 {19 RandomAccessList<Link>. B iD i r I t e r a t o r p r ed I t e r20 = predAry . begin ( ) ;21 RandomAccessList<Link>. B iD i r I t e r a t o r s u c c I t e r22 = succAry . begin ( ) ;23 for ( int i L ev e l = 0 ; iL ev e l < t o p l e v e l ; ++iLeve l ,24 p r ed I t e r = pr ed I t e r . next ( ) , s u c c I t e r = su c c I t e r . next ( ) ){25 Link pred = pr ed I t e r . data ;26 Link next = su c c I t e r . data ;27 i f ( pred . node . d e l e t e d | | next . node . d e l e t e d | |28 ! pred . node . f u l l y c onn e c t e d | |29 ! next . node . f u l l y c onn e c t e d | |30 pred . next != next | |31 pred . node . key >= key | | next . node . key < key )32 return fa l se ;33 }34 return true ;35 }

As its predecessors, the hinted skip list is deadlock free and linearizable.The deadlock freedom is obvious, since this implementation uses exactly thesame locking scheme as previous ones. The linearizability may be devised fromthe fact that if verify fails, the hints skip list algorithm is identical to naiveone. Otherwise verify success guarantees that the state of all memory that hasto be updated is identical to one when data was collected, and therefore, allpreconditions, mentioned in linearizability proof for OptimisticSkipList hold,and the proof is applicable also for hints skip list.

The hints mechanism is applied to both single- and multi-combiners lists. Asit is shown in Chapter 3 (Performance), the optimistic approach is very efficient,especially when update rate is not high.

3. PERFORMANCE

For the performance verifications, we use the skip lists described above andseveral additional data structures designed to verify flat combining impact. TheJDK ConcurrentSkipListSet by Doug Lea is used as a main competitor - bynow, it is a one of the most efficient and scalable skip list implementations.Computations were performed on SunTM SPARC R© Enterprise T5140 serverpowered by two UltraSPARC T2 Plus processors. Each processor contains eightcores running eight hardware threads, which gives 128 total hardware threadsper system.

The benchmarked algorithms notation is:

FC-Naive-0 - ”Naive” FC-list with 0 non-head combiners.

FC-Hints-64 - ”Hinted” FC-list with at least 64 non-head combiners - the com-biners distribution algorithm was described in Section 2.2

JDK - JDK ConcurrentSkipListSet (based on ConcurrentSkipListMap).

ML-0, ML-64 - ”Multi-lock” skip lists with 0 and 64 non-head locks correspond-ingly - the data structure, designed to isolate combining effect from com-biners distribution one. Generally, it is multi-combiners skip list, wherethe FCData structures are substituted with simple locks. The updatingthread locks appropriate ”locking” node, makes the update and releaseslock - instead of making all the combining algorithm.

ML-hints-0, ML-hints-64 - ”Multi-lock” optimistic skip lists with 0 and 64 non-head locks correspondingly using hints mechanism exactly as flat combin-ing one does.

FC-Ideal-64 - The artificial FC-list made from FC-list with hints. Here, weassumed, that hints are always successful, and the combiner only work isto update the next references. This data structure gives an indicationabout maximal FC skip list performance, when the combiner fulfills all itsrequests sequentially.

Experiments were performed on data structures with initial size of about20000 keys. Actually, before selecting this size, the base skip list implementa-tions were roughly benchmarked for wide range of sizes - from one hundred tofew millions. The relations between run times for different skip list implemen-tations were very similar for different sizes, and therefore, every initial size wasrepresentative enough to show qualitative differences between algorithms. Theaccess locality factor was introduced to simulate different workloads. Supposethat the experiment is performed for keys space S = {1, 2, ..., N}. The accesslocality factor k, 1 ≤ k ≤ N is defined in the following way: the keys in the

3. Performance 17

benchmark are uniformly selected from the Sk = {t, t + 1, ..., t + N/k}, wheret is selected uniformly from S at the start of the run, and is changed slowlyduring the execution. The access locality factor of 1, therefore, corresponds touniformly distributed keys from S. The factor increase means that the keys areselected from the smaller interval, and so the contention increases.

3.1 Performance Comparison of Flat Combined Skip Lists vs

JDK ConcurrentSkipListSet

The first group of benchmarks compares the flat combining skip list implemen-tations throughput with JDK ConcurrentSkipListSet’s one.

Figure 3.1 presents the benchmark results for ”Naive” flat combining usinguniformly distributed values. The graphs show that single combiner imple-mentation fails to compete with SDK list even for read-dominate loads, whenimplementation with 64 combiners shows scalability even for write only loads.The picture changes dramatically when workload locality increases. Figure 3.2depicts the same data structures, where all requests are selected from 1/128 oftotal keys space. In this case, naive FC skip list lose to SDK one even for readdominated workloads - when number of running threads increases enough, andmultiple combiners do not help.

The next group of runs deals with improved optimistic skip list, using ”hints”mechanism described above in Chapter 2 (The Flat Combined Skip Lists). Fig-ure 3.3 shows the benchmark results for uniformly distributed requests, whenFigure 3.4 depicts the runs with high locality access. The presented graphs showsignificant performance gain due to optimistic approach. For read-dominatedworkloads, both single- and multi-combiner lists perform better than SDK forall workload localities. For higher update operations rate, multi-combiners listcompetes well with SDK data structure, while single combiner one shows lackof scalability, especially for high access locality.

So far, we can conclude that at least ”hinted” variant of combining skiplist is simple and effective alternative to SDK decision. It is clear enough thatfor read-dominated workloads lock-free list performs worse than ones with lockprotected updates and lock-free contains. The first reason for more effectiveread is that FC lists contains (Listing 2.4) performs only two volatile reads,while lock-free implementations require all next references to be volatile, andtherefore, need logN volatile reads. The second reason is that all known lock-free skip list implementations conclude about node presence only after reachingthe bottom skip list level, when our implementation stops if node with desiredkey is found on any level. However, it remains not clear yet what the combinermechanism impact on the presented results.

3. Performance 18

Fig. 3.1: Naive FC skip list implementation vs JDK lock-free ConcurrentSkipListSet,uniform keys distribution

3. Performance 19

Fig. 3.2: Naive FC skip list implementation vs JDK lock-free ConcurrentSkipListSet,high access locality

3. Performance 20

Fig. 3.3: Hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet,uniform keys distribution

3. Performance 21

Fig. 3.4: Hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet,high access locality

3. Performance 22

3.2 Flat Combining Mechanism Experimental Verifications.

In this section we experimentally verify in depth the FC impact on skip listbehavior.

The first experiments compare flat combining implementations with espe-cially designed multi-lock skip list. Multi-lock skip list is devised from flatcombining one by replacing FCData by simple lock. It has single- and multi-locks implementation, exactly as FC skip list has, and may be extended with”hints” mechanism as well. The multi-lock skip list with hints add method isshown at Listing 3.1. The method doAdd called at row 26 is identical to flatcombined one presented at Listing 2.9

Listing 3.1: Optimistic (hinted) multi-lock add method implementation

1 public boolean add ( int key ) {2 //Get r e s p on s i b l e combiner3 Node lock node = findLockNode ( key ) ;45 // We have to know l e v e l p r i o r to f i nd in order6 // to r e s t r i c t h i n t s s i z e7 int t o p l e v e l = randomLevel ( ) ;8 // Thread l o c a l h i n t s l i s t s9 int th r ead id = ThreadId . getThreadId ( ) ;10 RandomAccessList<f c j a v a . MultiLockSkipListFH . Link>11 succ a ry = suc c a ry [ th r ead id ] ;12 RandomAccessList<f c j a v a . MultiLockSkipListFH . Link>13 pred ary = pred ary [ th r ead id ] ;1415 Node found node ;16 do{17 found node = f i nd ( lock node , key , pred ary ,18 succ ary , t op l e v e l , true , true ) ;19 }while ( found node != null && found node . d e l e t e d ) ;20 i f ( found node != null )21 return fa l se ;22 // Acquire l o c k and perform mod i f i ca t i on23 AtomicInteger l ock = lock node . node l o ck ;24 do { // TTAS l o c k25 i f (0 == lock . get ( ) && lock . compareAndSet (0 , 0xFF) ) {26 doAdd( thread id , lock node , key , pred ary ,27 succ ary , t o p l e v e l ) ;28 // Release l o c k29 lock . s e t ( 0 ) ;3031 return true ;32 } else // Give up proces sor33 Thread . y i e l d ( ) ;34 } while ( true ) ;35 }

Instead of placing the request and running the flat combining algorithm, the

3. Performance 23

updating thread finds appropriate lock node, acquires the lock and performs thechange.

The following graphs compare between multi-lock to FC Naive skip lists.We can see that for both low (Figure 3.5) and high (Figure 3.6) localities, andfor any update rates both lists behave very similar. The multi-lock skip list eventends to perform slightly better for low access locality than its FC counterpart.It may be explained by additional overheads that flat combining introduces -the combiner thread has to read and maintain the FC registry and to writeback the operations result. All this, if not compensated by FC gains that weredescribed above, leads to performance decrease.

The benchmarks of Hints versions of multi-lock and FC skip lists are shownon Figures 3.7 and 3.8 for low and high access locality. The hints mechanismintroduction improves performance of both lists, but does not change the ratiobetween algorithms - both behave very similar with light preference to multi-lockskip list for low access locality.

As it is mentioned before, flat combining, besides opening contention bottle-neck, allows using the knowledge about all pending request for optimizing datastructure updates. For tree-like data structures, and for skip lists in particular,the elimination and combining techniques can be applied for optimizing the datastructure traversal, but it is very hard to use them for optimizing data structureupdate. For the next group experiments, we assumed that the traversal is per-fectly optimized, i.e. our hints mechanism never fails. In practice, we replacedthe verify method in Listing 2.10 with one always returning true, and suppliedevery nodes with additional dummy next references. The combiner, insteadof writing to real next references, updates the equal quantity of dummy nextones. These benchmarks are presented on Figures 3.9 and 3.9, and show thatFC skip list with ideal hints mechanism competes well with lock-free one, andfails only for high access locality and more than 50% update rate, and, so, hintsmechanism verification and improvement makes sense.

The next graph (Figure 3.11) shows our hints mechanism efficiency. Asit follows from the graph, the hints are very close to ideal for uniform accessand fall to about 50% failures, when threads number grows to 64. This resultexplains the scalability turning point between 16 to 32 threads for high accesslocality and high update rate. Note, that for ideal hints list the turning pointalso exists, but appears slightly later and is not so sharp. So, the problematicscalability of FC list caused, probably, by the flat combining itself.

3. Performance 24

Fig. 3.5: FC skip list implementation vs multi-lock one, naive implementations, uni-form keys distribution

3. Performance 25

Fig. 3.6: FC skip list implementation vs multi-lock one, naive implementations, highaccess locality

3. Performance 26

Fig. 3.7: FC skip list implementation vs multi-lock one, hints implementations, uni-form keys distribution

3. Performance 27

Fig. 3.8: FC skip list implementation vs multi-lock one, hints implementations, highaccess locality

3. Performance 28

Fig. 3.9: Ideal hints FC skip list implementation vs JDK lock-free ConcurrentSkipList-Set, uniform keys distribution

3. Performance 29

Fig. 3.10: Ideal hints FC skip list implementation vs JDK lock-free Concur-rentSkipListSet, high access locality

3. Performance 30

Fig. 3.11: Hints mechanism success rate for pure update workloads

Fig. 3.12: The connection between FC intensity to throughput per thread for pureupdate workloads

The next two benchmarks, performed for pure update workload, intend toanswer the question why lock-free list scales better than FC one. For flat com-bining loading estimation we introduce the FC intensity - the factor showing

3. Performance 31

additional combiner work. It is calculated in following way:

< FC intensity >=< Fulfilled requests per FC session > −1

< Number of threads >

This number is 0 for single threaded execution, and tend to 1 for large numberof threads, when one combiner fulfills the requests of all other threads. TheFigure 3.12 shows FC intensity together with throughput per thread for differentnumber of combiners and workload localities, and the FC intensity increase isfollowed by throughput decrease (note, that the ideal scalability is horizontalline). The jump of intensity between 16 to 32 threads corresponds well withgraphs on Figures 3.3 and 3.4 for 50% add / 50% remove workload. The jumpmay be explained in following way: starting from some number of threads, thecombiner has no time to complete all the requests during the period, when thereleased thread prepares the new request, and so, the competition for lock newerinterrupts. On the other hand, for 64 combiners and low locality, the jump hasnot happened, and algorithm is scalable. The Figure 3.13 shows lock-free liststatistics for pure update workload. As it follows from the graphs, the CASsuccess rate never drops below 75% and CAS number is as small as 1.5 - 2.5CAS per update, which explains good algorithm scalability.

Fig. 3.13: Lock-free skip list CAS per update, CAS success rate and throughput perthread for pure update workloads

4. CONCLUSIONS

We studied several approaches for flat combining technique application to skiplist based maps. As it was shown on skip list example, for the structures allowingconcurrent updates, the fine-grained and especially lock-free synchronizationsare preferable to FC. This conclusion does not completely deny usefulness of theFC application for such structures since for read dominated workloads and forseveral update request distributions flat combining behaves better than lock-freesynchronization. It is also possible that for different hardware the FC approachwill show better scalability. The breakthrough can also come from FC algorithmimprovements. It is possible, for example, to transform FC into some sort of jobdispatcher: having all the requests, it can form mutually non-conflicting groups,so the waiting threads can execute them without synchronization. Such designfaces the problems with additional FC overhead for sorting and analyzing therequests, but may be applicable for NUMA or client-server architectures.

It is interesting also to study the FC implementation for other popular datastructures - such as B-trees or Red-Black trees, where lock-free alternatives donot exist, and fine-grained locking requires complicated read-write locks. TheFC’s benefit of simplicity and proved linearizability may be valuable for thesecases.

Another, albeit auxiliary, data structure - multi-lock skip list - may be in-teresting by itself. It showed characteristics as good as FC skip list, but it issimpler, needs less memory and gives more uniform latency for update requests.The idea to build the small index, protected by locks (locked or FC layers), andentirely wait-free data structure body can replace hand-by-hand fine-grainedsynchronization schemes for tree-like structures.

BIBLIOGRAPHY

[1] Adelson-Velskii, G. M., and Landis, E. M. An algorithm for theorganization of information. Soviet Math. Doklady, 3 (1962), 1259–1263.

[2] Bayer, R., and McCreight, E. Organization and maintenance of largeordered indices. In SIGFIDET ’70: Proceedings of the 1970 ACM SIG-FIDET (now SIGMOD) Workshop on Data Description, Access and Con-trol (New York, NY, USA, 1970), ACM, pp. 107–141.

[3] Colvin, R., Groves, L., Luchangco, V., and Moir, M. Formalverification of a lazy concurrent list-based set algorithm. In CAV (2006),pp. 475–488.

[4] Doherty, S., Groves, L., Luchangco, V., and Moir, M. Formalverification of a practical lock-free queue algorithm. In In FORTE (2004),Springer, pp. 97–114.

[5] Fraser, K. Practical lock freedom. PhD thesis, Cambridge UniversityComputer Laboratory, 2003. Also available as Technical Report UCAM-CL-TR-579.

[6] Guibas, L. J., and Sedgewick, R. A dichromatic framework for bal-anced trees. In SFCS ’78: Proceedings of the 19th Annual Symposium onFoundations of Computer Science (Washington, DC, USA, 1978), IEEEComputer Society, pp. 8–21.

[7] Hendler, D., Incze, I., Shavit, N., and Tzafrir, M. Flat combiningand the synchronization-parallelism tradeoff. In SPAA (2010), pp. 355–364.

[8] Herlihy, M., Lev, Y., Luchangco, V., and Shavit, N. A simple opti-mistic skiplist algorithm. In SIROCCO’07: Proceedings of the 14th interna-tional conference on Structural information and communication complexity(Berlin, Heidelberg, 2007), Springer-Verlag, pp. 124–138.

[9] Herlihy, M., and Shavit, N. The art of multiprocessor programming.Morgan Kaufmann, 2008.

[10] Herlihy, M. P., and Wing, J. M. Linearizability: a correctness condi-tion for concurrent objects. ACM Transactions on Programming Languagesand Systems 12 (1990), 463–492.

[11] Lotan, I., and Shavit., N. Skiplist-based concurrent priority queues. InProc. of the 14th International Parallel and Distributed Processing Sympo-sium (IPDPS) (2000), pp. 263–268.

Bibliography 34

[12] Marsaglia, G. Xorshift rngs. Journal of Statistical Software 8, 14 (72003), 1–6.

[13] Moir, M., and Shavit, N. Concurrent data structures. In Handbook ofData Structures and Applications, D. Metha and S. Sahni Editors (2007),pp. 47–14 47–30. Chapman and Hall/CRC Press.

[14] Pugh, W. Concurrent maintenance of skip lists. Tech. rep., University ofMaryland at College Park, College Park, MD, USA, 1990.

[15] Pugh, W. Skip lists: a probabilistic alternative to balanced trees. Com-mun. ACM 33 (June 1990), 668–676.

[16] Scherer, III, W. N., Lea, D., and Scott, M. L. Scalable synchronousqueues. Commun. ACM 52, 5 (2009), 100–111.

[17] Stepanov, A., and Lee, M. The standard template library. Tech. rep.,WG21/N0482, ISO Programming Language C++ Project, 1995.

[18] SUN MICROSYSTEMS, INC. JAVA PLATFORM, STANDARD EDI-TION, Version 6. 4150 Network Circle, Santa Clara, CA 95054, U.S.A,2006.

combining techniques application for tree search …the ﬂat combining applicability for binary...

Documents