the gnu libstdc++ parallel mode: benefit from multi-core using …ls11- · 2008-12-08 ·...
TRANSCRIPT
Introduction Library Overview Algorithms SE Aspects Applications Conclusion1/36
The GNU libstdc++ parallel mode:Benefit from Multi-Core using the STL
Johannes Singler, Peter Sanders{singler,sanders}@ira.uka.de
Institute for Theoretical Computer ScienceUniversity of Karlsruhe
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion2/36
Talk Outline
Introduction
Library Overview
Algorithms
Software Engineering Aspects
Applications
Conclusion
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion3/36
Motivation
How to Benefit from Multi-Core Systems?I automatic parallelization not sufficientI manual/explicit parallelization needed, but
expensive, complicated, error-prone, not easyto try-out
Our ApproachI provide a data-parallelized library of basic algorithms for
shared-memory systemsI libraries are an important aspect of Algorithm Engineering
I make the usage of (data-)parallel algorithms very easyI actual parallelism not visible to the user, but encapsulated
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion4/36
Basic Approach
Starting PointI provide the functionality of the
C++ Standard Template LibraryI run the algorithms in parallelI included with GCC as of version 4.3: libstdc++ parallel mode
I formerly known as the Multi-Core Standard Template Library
Why STL?I many useful algorithms and data structures includedI simple interface, very well-known among developersI recompilation of existing programs may sufficeI C++ accepted and efficient language, standardized
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion5/36
Goals
Ease of UseI no new language, no language extensionI just few compiler options to activate
Good PerformanceI some speedup already for small inputs⇒scale downI full speedup for larger inputs⇒scale upI co-exist with other forms of parallelization
I respect machine load⇒dynamic load-balancing
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion6/36
Competitors
STAPLI must incorporate distributed-memory issuesI no code publicly availableI interface only similar to STL
Intel Threading Building BlocksI mostly on a more abstract level,
parallel programming frameworkI only combinatorial generic algorithm is parallel sorter
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion7/36
Platform SupportI based on OpenMP (fork-join parallelism)
OpenMP
STL Interface
SerialSTL
Algorithms
Application
Multi-Core Hardware
OS Thread Scheduling
Atomic Ops
Extensions
Parallel STL Algorithms
STL Interface
para
llel
mode
I low-level issues are left to the OpenMP runtime,e. g. thread pooling, environment, synchronization primitives
I GCC’s upcoming implementation improved by usI platform-independent, supported by all major C++ compilersI task construct upcomingI + atomic operations (platform-dependent)
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion8/36
Overview of Important AlgorithmsStrictly STL (mostly <algorithm>)
I for each and friends (embarrassingly parallel)I find
I partial sum (prefix sum)I partition, partial sort
I merge
I sort
I random shuffle
I bulk construction and bulk insert for set and map 1
Extension to STLI multiway merge
I multiseq partition (helper)
1not part of parallel mode yetJohannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion9/36
Parallel Mode Development StatusAlgorithm Class Function Call(s) Status w/LB w/oLBEmbarrassinglyParallel
for each, generate( n), fill( n),count( if), transform, replace( if),min element, max element,adjacent difference, unique copy
impl yes yes
Find find( if), find first of,adjacent find, mismatch, equal,lexicographical compare
impl yes notworth-while
Search search( n) impl yes not ww.NumericalAlgorithms
accumulate, partial sum,inner product
impl planned yes
Partition partition, stable partition impl yes not ww.Merge merge, multiway merge, inplace merge impl tbi yesPartial Sort nth element, partial sort impl yes plannedSort sort, stable sort impl yes yesRandomPermutation
random shuffle impl yes notworthw.
Dictionaries (multi )map/set bulk op tbi tbiComplexSet Operations
set union, set intersection,set (symmetric )difference, . . .
impl no yes
Vector Arithmetic valarray operations ongoing yes yesHeap Construction make heap, sort heap tbiPriority Queues amortized update operations ongoingFiltering remove( copy)( if) ongoing
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion10/36
for each ImplementationDefinition
I execute a certain function on a range of elementsI many similar functions like transform, generateI parallelization is easy only for
uniform execution time per element, exclusive machine
Static Load-BalancingI divide work into parts of almost equal sizeI used for accumulate,
since ends of chunks can easily be spliced (not commutative)
Dynamic Load-BalancingI initially divide work into parts of almost equal sizeI allow “unemployed” threads to take work from others⇒work-stealing
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion11/36
merge, multiway merge
Problem DefinitionsI merge: combine two sorted sequences into one sorted
sequenceI multiway merge:
combine k > 2 sorted sequences into one sorted sequenceI important for (external memory) sorting
How to divide the input?I find slabs, i. e. consistent sets of ranges from the sequencesI two possibilities:
I (randomized) splitting by samplingI exact partitioning into slabs of equal size
(using multi-sequence selection) [6]
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion12/36
merge, multiway merge: Diagram
· · · · · ·
{ k
t0
t1
t2
t3
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion13/36
Parallel Multiway Mergesort
Procedure1. divide sequence into p parts
of equal size2. in parallel, sort the parts
locally3. use parallel p-way merging
to compute the final sequence4. copy result back to original
position
t0 t1 t2 t3
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion14/36
sort, stable sort
Parallel Multiway Mergesort
+ less communication necessary+ stable variant easy to derive– needs twice the space
Parallel Load-Balanced Quicksort+ in-place± dynamic load-balancing to compensate for unequal splitting– concurrent access to memory– not stable
Both variants implemented in the parallel mode, user’s choice.
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion15/36
Parallel Partitioning[Tsigas Zhang 2003]
1. scan blocks of size B from both ends1.1 claim new blocks when running out of data
2. swap the unfinished blocks to the “middle”3. recurse on the middle
p0 p1 p2
rest recursive or sequential
swap in parallel
input
I time complexity O(n/p + B log p)
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion16/36
0
2
4
6
8
10
12
14
16
10810710610510000
Spe
edup
n
Partitioning of 32-bit integers on Sun T1
sequential1 thread
2 threads3 threads4 threads8 threads
16 threads32 threads
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion17/36
Parallel Balanced Quicksort
Procedure [5]
1. split sequence using parallelpartition, descendrecursively with theappropriate number ofthreads
2. as soon as there is only oneprocessor left per partition:start local sorting
3. each processor sorts locally,pushes parts to process laterinto lock-free dequeue
4. other processors can stealparts when out of work
p0 p1 p2partition in parallel
input
p0 p1partition in parallel
sequential sortingp2p0 p1
steal
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion18/36
Sorting Performance Results
0
5
10
15
20
25
100 1000 10000 100000 106 107
Spee
dup
Number of elements
Sorting Pairs of 64-bit Integers on the Sun T1
sequential 2 th, mwms 4 th, mwms 8 th, mwms16 th, mwms32 th, mwms
2 th, bqs 4 th, bqs 8 th, bqs16 th, bqs32 th, bqs
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion19/36
Sorting Performance Results
0
1
2
3
4
5
6
7
8
100 1000 10000 100000 106 107
Spee
dup
Input Size
Multiway Mergesort for 32-bit integers [Opteron 2.0 GHz (2p, 8c)]
8 thr7 thr6 thr5 thr4 thr3 thr2 thr1 thr
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion20/36
Sorting Performance Results
0
1
2
3
4
5
6
7
8
100 1000 10000 100000 106 107
Spee
dup
Input Size
Multiway Mergesort for pairs of 64-bit integers [Opteron 2.0 GHz (2p, 8c)]
8 thr7 thr6 thr5 thr4 thr3 thr2 thr1 thr
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion21/36
Detail Timings
0 200 400 600 800
1000 1200 1400 1600 1800
seq 1 2 3 4 5 6 7 8
Tim
e [m
s]
Number of Threads
Multiway Mergesort for 10M 32-bit integers
cleanupmultiway mergesplitlocal sortcopyenterseq
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion22/36
Dictionary Bulk Operations
Algorithmic ProblemI construct/insert into red-black treeI complicated splitting and balancing of workI bulk algorithm already brings sequential speedupI not yet in parallel mode, but only in MCSTL
Memory ManagementI memory allocation takes a considerable share of the timeI C++ does not allow asymmetric allocation/deallocation,
i. e. allocate several nodes at once, later deallocate one by one
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion23/36
Dictionary Bulk Operations PerformanceInsertion, n=0.1k, 2-way Quadcore Xeon
0
2
4
6
8
10
100 1000 10000 100000 106 107
Spee
dup
Number of inserted elements (k)
8 th7 th6 th5 th4 th3 th2 th1 thseq
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion24/36
Effect of Core Mapping (Dictionary Bulk Construction)
0
2
4
6
8
10
100 1000 10000 100000 106 107
Spee
dup
Number of inserted elements (k)
2 threads, different sockets2 threads, same socket, different dice
2 threads, same die1 thread
sequential
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion25/36
General Speedup Insight
I memory bandwidth is usually sharedI the more computation per memory accesses, the betterI highly depending on user-defined functorsI large shared cache improves situationI cannot compete with computation in L1 cache
I thread starting overhead is constant (1 microsecond) afterfirst time
I the more input per algorithm call, the betterI simple heuristic: stay sequential for too small inputs to not
worsen performance in bad cases
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion26/36
Software-Engineering-Related Goals
I transparent integration of parallel algorithmsI compile in sequential or in parallel modeI pragmatic balance between standard
adherence and benefitsI sequential semantics, exceptions, space
requirements
I selection and tuning of algorithms
I maintainability:little code duplication, much code reusebuild on existing infrastructure
I limited increase in compilation time andexecutable size for the user application.
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion27/36
Usage
Example Code
#include <algorithm>vector<double> v(1000000);std::random_shuffle(v.begin(), v.end());
g++-4.3.1 -D_GLIBCXX_PARALLEL -fopenmp random_shuffle.cpp
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion28/36
Sequence Access through Iterators
Most STL algorithms take one or more sequences as mainargument(s).
I
I iterators might have restrictions, e. g. no random accessI information gets lost,
e. g. length in linked list, inefficient to recomputeI data-parallelism efficient splitting absolutely necessary
I “best-effort solution”: single-pass splitting without copying [2]I decision on whether random access at compile-time
I iterator traits must be available, also for custom iterators
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion29/36
Binary Size and Compilation TimeProvide compile-time switches to user, in order to limitincrease in executable size and compilation time# inc lude <vector># inc lude <a lgor i thm>i n t main ( ) {
std : : vector<i n t> v i (100000000);s td : : s o r t ( v i . begin ( ) , v i . end ( ) , TAG) ;
}
Executable size and compilation time for different variantsAlgorithm Variant(s) Size (B) Time (s)Sequential 15 479 0.74Quicksort 22 387 1.49Balanced Quicksort 26 989 1.84Multiway-Mergesort Sampling 36 002 3.49Default(Multiway-Mergesort Exact)
41 229 4.68
Multiway-Mergesort Exact 41 237 4.78Multiway-Mergesort(splitting choice at run-time)
46 003 5.48
All Parallel Variants (run-time choice) 61 543 6.50
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion29/36
Binary Size and Compilation TimeProvide compile-time switches to user, in order to limitincrease in executable size and compilation time# inc lude <vector># inc lude <a lgor i thm>i n t main ( ) {
std : : vector<i n t> v i (100000000);s td : : s o r t ( v i . begin ( ) , v i . end ( ) , TAG) ;
}
Executable size and compilation time for different variantsAlgorithm Variant(s) Size (B) Time (s)Sequential 15 479 0.74Quicksort 22 387 1.49Balanced Quicksort 26 989 1.84Multiway-Mergesort Sampling 36 002 3.49Default(Multiway-Mergesort Exact)
41 229 4.68
Multiway-Mergesort Exact 41 237 4.78Multiway-Mergesort(splitting choice at run-time)
46 003 5.48
All Parallel Variants (run-time choice) 61 543 6.50
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion30/36
Combination with Other Libraries
STXXL: external memory STLI parallelization of internal computation,
e. g. sorting, multi-way mergingI + task-parallelization framework
CGAL: computational geometryI mostly preprocessing: sorting, random shufflingI + manually parallelized geometric algorithms
Upcoming: distributed external memory sortingI add shared-memory parallelism to
distributed-memory algorithm easilyI library?
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion31/36
Use Case 1: Suffix Array Construction
parallel mode + manually parallelized integer sorter
0
0.5
1
1.5
2
2.5
3
210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
Spee
dup
Input Length
sequential1 thread
2 threads3 threads4 threads
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36
Use Case 2: Minimum Spanning Tree Construction
Use STL algorithms from libstdc++ parallel mode
Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)
then kruskal(E , T , P)
parallelsort
elsepick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉
parallelpartition
filterKruskal(E≤, T , P)E>:= filter(E>, P)
parallelremove if
filterKruskal(E>, T , P)
Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36
Use Case 2: Minimum Spanning Tree Construction
Use STL algorithms from libstdc++ parallel mode
Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)
then kruskal(E , T , P)
parallelsort
elsepick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉
parallelpartition
filterKruskal(E≤, T , P)E>:= filter(E>, P)
parallelremove if
filterKruskal(E>, T , P)
Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36
Use Case 2: Minimum Spanning Tree Construction
Use STL algorithms from libstdc++ parallel mode
Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)
then kruskal(E , T , P)
parallelsort
elsepick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉
parallelpartition
filterKruskal(E≤, T , P)E>:= filter(E>, P)
parallelremove if
filterKruskal(E>, T , P)
Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36
Use Case 2: Minimum Spanning Tree Construction
Use STL algorithms from libstdc++ parallel mode
Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)
then kruskal(E , T , P)
parallelsort
elsepick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉
parallelpartition
filterKruskal(E≤, T , P)E>:= filter(E>, P)
parallelremove if
filterKruskal(E>, T , P)
Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36
Use Case 2: Minimum Spanning Tree Construction
Use STL algorithms from libstdc++ parallel mode
Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)
then kruskal(E , T , P) parallel
sort
elsepick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉 parallel
partition
filterKruskal(E≤, T , P)E>:= filter(E>, P) parallel
remove if
filterKruskal(E>, T , P)
Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36
Use Case 2: Minimum Spanning Tree Construction
Use STL algorithms from libstdc++ parallel mode
Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)
then kruskal(E , T , P)
parallel
sortelse
pick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉
parallel
partitionfilterKruskal(E≤, T , P)E>:= filter(E>, P)
parallel
remove iffilterKruskal(E>, T , P)
Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion33/36
Use Case: Minimum Spanning Tree Construction
0
1
2
3
4
5
6
100000 106 107 108
Spee
dup
Number of Edges
Kruskal 8 thrQuickMST 8 thr
QuickMST is still faster in absolute time [3] (ALENEX 2009).
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion34/36
Conclusions
BenefitsI parallel mode provides a easy way to incorporate
data parallelism into programs on an algorithmic levelI fully genericI performance is good for large inputsI speedup at hand for small inputs as well,
depending on circumstancesI could transparently support new paradigms, e. g. transactional
memoryI repository for parallel algorithm implementations
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion35/36
Future Work
I integration of missing algorithmsI performance estimation⇒automatic switching point detection
0.5
1
1.5
2
2.5
0 5000 10000 15000 20000
Spee
dup
Input Size
Switching Number of Threads for Balanced Quicksort: Preliminary Results
seq1 th2 th3 th4 th5 th6 th7 th8 th
I working affinity
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL
Introduction Library Overview Algorithms SE Aspects Applications Conclusion36/36
Thank you for your attention. Questions?References
L. Frias and J. Singler.
Parallelization of bulk operations for STL dictionaries.In Workshop on Highly Parallel Processing on a Chip (HPPC), 2007.
L. Frias, J. Singler, and P. Sanders.
Single-pass list partitioning.Scalable Computing: Practice and Experience, 9(3):179–184, 2008.
V. Osipov, P. Sanders, and J. Singler.
The filter-kruskal minimum spanning tree algorithm.In 11th Workshop on Algorithm Engineering and Experiments (ALENEX), 2009.
J. Singler, P. Sanders, and F. Putze.
The Multi-Core Standard Template Library.In Euro-Par 2007: Parallel Processing, volume 4641 of LNCS, pages 682–694. Springer-Verlag.
P. Tsigas and Y. Zhang.
A simple, fast parallel implementation of quicksort and its performance evaluation on SUN enterprise 10000.In 11th Euromicro Conference on Parallel, Distributed and Network-Based Processing, page 372, 2003.
P. J. Varman, S. D. Scheufler, B. R. Iyer, and G. R. Ricard.
Merging Multiple Lists on Hierarchical-Memory Multiprocessors.Journal of Parallel and Distributed Computing, 12(2):171–177, 1991.
Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL