the gnu libstdc++ parallel mode: benefit from multi-core using …ls11- · 2008-12-08 ·...

42
Introduction Library Overview Algorithms SE Aspects Applications 1/36 The GNU libstdc++ parallel mode: Benefit from Multi-Core using the STL Johannes Singler, Peter Sanders {singler,sanders}@ira.uka.de Institute for Theoretical Computer Science University of Karlsruhe Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Upload: others

Post on 08-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion1/36

The GNU libstdc++ parallel mode:Benefit from Multi-Core using the STL

Johannes Singler, Peter Sanders{singler,sanders}@ira.uka.de

Institute for Theoretical Computer ScienceUniversity of Karlsruhe

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 2: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion2/36

Talk Outline

Introduction

Library Overview

Algorithms

Software Engineering Aspects

Applications

Conclusion

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 3: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion3/36

Motivation

How to Benefit from Multi-Core Systems?I automatic parallelization not sufficientI manual/explicit parallelization needed, but

expensive, complicated, error-prone, not easyto try-out

Our ApproachI provide a data-parallelized library of basic algorithms for

shared-memory systemsI libraries are an important aspect of Algorithm Engineering

I make the usage of (data-)parallel algorithms very easyI actual parallelism not visible to the user, but encapsulated

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 4: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion4/36

Basic Approach

Starting PointI provide the functionality of the

C++ Standard Template LibraryI run the algorithms in parallelI included with GCC as of version 4.3: libstdc++ parallel mode

I formerly known as the Multi-Core Standard Template Library

Why STL?I many useful algorithms and data structures includedI simple interface, very well-known among developersI recompilation of existing programs may sufficeI C++ accepted and efficient language, standardized

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 5: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion5/36

Goals

Ease of UseI no new language, no language extensionI just few compiler options to activate

Good PerformanceI some speedup already for small inputs⇒scale downI full speedup for larger inputs⇒scale upI co-exist with other forms of parallelization

I respect machine load⇒dynamic load-balancing

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 6: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion6/36

Competitors

STAPLI must incorporate distributed-memory issuesI no code publicly availableI interface only similar to STL

Intel Threading Building BlocksI mostly on a more abstract level,

parallel programming frameworkI only combinatorial generic algorithm is parallel sorter

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 7: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion7/36

Platform SupportI based on OpenMP (fork-join parallelism)

OpenMP

STL Interface

SerialSTL

Algorithms

Application

Multi-Core Hardware

OS Thread Scheduling

Atomic Ops

Extensions

Parallel STL Algorithms

STL Interface

para

llel

mode

I low-level issues are left to the OpenMP runtime,e. g. thread pooling, environment, synchronization primitives

I GCC’s upcoming implementation improved by usI platform-independent, supported by all major C++ compilersI task construct upcomingI + atomic operations (platform-dependent)

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 8: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion8/36

Overview of Important AlgorithmsStrictly STL (mostly <algorithm>)

I for each and friends (embarrassingly parallel)I find

I partial sum (prefix sum)I partition, partial sort

I merge

I sort

I random shuffle

I bulk construction and bulk insert for set and map 1

Extension to STLI multiway merge

I multiseq partition (helper)

1not part of parallel mode yetJohannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 9: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion9/36

Parallel Mode Development StatusAlgorithm Class Function Call(s) Status w/LB w/oLBEmbarrassinglyParallel

for each, generate( n), fill( n),count( if), transform, replace( if),min element, max element,adjacent difference, unique copy

impl yes yes

Find find( if), find first of,adjacent find, mismatch, equal,lexicographical compare

impl yes notworth-while

Search search( n) impl yes not ww.NumericalAlgorithms

accumulate, partial sum,inner product

impl planned yes

Partition partition, stable partition impl yes not ww.Merge merge, multiway merge, inplace merge impl tbi yesPartial Sort nth element, partial sort impl yes plannedSort sort, stable sort impl yes yesRandomPermutation

random shuffle impl yes notworthw.

Dictionaries (multi )map/set bulk op tbi tbiComplexSet Operations

set union, set intersection,set (symmetric )difference, . . .

impl no yes

Vector Arithmetic valarray operations ongoing yes yesHeap Construction make heap, sort heap tbiPriority Queues amortized update operations ongoingFiltering remove( copy)( if) ongoing

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 10: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion10/36

for each ImplementationDefinition

I execute a certain function on a range of elementsI many similar functions like transform, generateI parallelization is easy only for

uniform execution time per element, exclusive machine

Static Load-BalancingI divide work into parts of almost equal sizeI used for accumulate,

since ends of chunks can easily be spliced (not commutative)

Dynamic Load-BalancingI initially divide work into parts of almost equal sizeI allow “unemployed” threads to take work from others⇒work-stealing

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 11: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion11/36

merge, multiway merge

Problem DefinitionsI merge: combine two sorted sequences into one sorted

sequenceI multiway merge:

combine k > 2 sorted sequences into one sorted sequenceI important for (external memory) sorting

How to divide the input?I find slabs, i. e. consistent sets of ranges from the sequencesI two possibilities:

I (randomized) splitting by samplingI exact partitioning into slabs of equal size

(using multi-sequence selection) [6]

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 12: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion12/36

merge, multiway merge: Diagram

· · · · · ·

{ k

t0

t1

t2

t3

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 13: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion13/36

Parallel Multiway Mergesort

Procedure1. divide sequence into p parts

of equal size2. in parallel, sort the parts

locally3. use parallel p-way merging

to compute the final sequence4. copy result back to original

position

t0 t1 t2 t3

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 14: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion14/36

sort, stable sort

Parallel Multiway Mergesort

+ less communication necessary+ stable variant easy to derive– needs twice the space

Parallel Load-Balanced Quicksort+ in-place± dynamic load-balancing to compensate for unequal splitting– concurrent access to memory– not stable

Both variants implemented in the parallel mode, user’s choice.

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 15: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion15/36

Parallel Partitioning[Tsigas Zhang 2003]

1. scan blocks of size B from both ends1.1 claim new blocks when running out of data

2. swap the unfinished blocks to the “middle”3. recurse on the middle

p0 p1 p2

rest recursive or sequential

swap in parallel

input

I time complexity O(n/p + B log p)

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 16: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion16/36

0

2

4

6

8

10

12

14

16

10810710610510000

Spe

edup

n

Partitioning of 32-bit integers on Sun T1

sequential1 thread

2 threads3 threads4 threads8 threads

16 threads32 threads

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 17: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion17/36

Parallel Balanced Quicksort

Procedure [5]

1. split sequence using parallelpartition, descendrecursively with theappropriate number ofthreads

2. as soon as there is only oneprocessor left per partition:start local sorting

3. each processor sorts locally,pushes parts to process laterinto lock-free dequeue

4. other processors can stealparts when out of work

p0 p1 p2partition in parallel

input

p0 p1partition in parallel

sequential sortingp2p0 p1

steal

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 18: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion18/36

Sorting Performance Results

0

5

10

15

20

25

100 1000 10000 100000 106 107

Spee

dup

Number of elements

Sorting Pairs of 64-bit Integers on the Sun T1

sequential 2 th, mwms 4 th, mwms 8 th, mwms16 th, mwms32 th, mwms

2 th, bqs 4 th, bqs 8 th, bqs16 th, bqs32 th, bqs

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 19: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion19/36

Sorting Performance Results

0

1

2

3

4

5

6

7

8

100 1000 10000 100000 106 107

Spee

dup

Input Size

Multiway Mergesort for 32-bit integers [Opteron 2.0 GHz (2p, 8c)]

8 thr7 thr6 thr5 thr4 thr3 thr2 thr1 thr

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 20: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion20/36

Sorting Performance Results

0

1

2

3

4

5

6

7

8

100 1000 10000 100000 106 107

Spee

dup

Input Size

Multiway Mergesort for pairs of 64-bit integers [Opteron 2.0 GHz (2p, 8c)]

8 thr7 thr6 thr5 thr4 thr3 thr2 thr1 thr

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 21: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion21/36

Detail Timings

0 200 400 600 800

1000 1200 1400 1600 1800

seq 1 2 3 4 5 6 7 8

Tim

e [m

s]

Number of Threads

Multiway Mergesort for 10M 32-bit integers

cleanupmultiway mergesplitlocal sortcopyenterseq

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 22: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion22/36

Dictionary Bulk Operations

Algorithmic ProblemI construct/insert into red-black treeI complicated splitting and balancing of workI bulk algorithm already brings sequential speedupI not yet in parallel mode, but only in MCSTL

Memory ManagementI memory allocation takes a considerable share of the timeI C++ does not allow asymmetric allocation/deallocation,

i. e. allocate several nodes at once, later deallocate one by one

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 23: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion23/36

Dictionary Bulk Operations PerformanceInsertion, n=0.1k, 2-way Quadcore Xeon

0

2

4

6

8

10

100 1000 10000 100000 106 107

Spee

dup

Number of inserted elements (k)

8 th7 th6 th5 th4 th3 th2 th1 thseq

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 24: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion24/36

Effect of Core Mapping (Dictionary Bulk Construction)

0

2

4

6

8

10

100 1000 10000 100000 106 107

Spee

dup

Number of inserted elements (k)

2 threads, different sockets2 threads, same socket, different dice

2 threads, same die1 thread

sequential

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 25: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion25/36

General Speedup Insight

I memory bandwidth is usually sharedI the more computation per memory accesses, the betterI highly depending on user-defined functorsI large shared cache improves situationI cannot compete with computation in L1 cache

I thread starting overhead is constant (1 microsecond) afterfirst time

I the more input per algorithm call, the betterI simple heuristic: stay sequential for too small inputs to not

worsen performance in bad cases

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 26: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion26/36

Software-Engineering-Related Goals

I transparent integration of parallel algorithmsI compile in sequential or in parallel modeI pragmatic balance between standard

adherence and benefitsI sequential semantics, exceptions, space

requirements

I selection and tuning of algorithms

I maintainability:little code duplication, much code reusebuild on existing infrastructure

I limited increase in compilation time andexecutable size for the user application.

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 27: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion27/36

Usage

Example Code

#include <algorithm>vector<double> v(1000000);std::random_shuffle(v.begin(), v.end());

g++-4.3.1 -D_GLIBCXX_PARALLEL -fopenmp random_shuffle.cpp

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 28: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion28/36

Sequence Access through Iterators

Most STL algorithms take one or more sequences as mainargument(s).

I

I iterators might have restrictions, e. g. no random accessI information gets lost,

e. g. length in linked list, inefficient to recomputeI data-parallelism efficient splitting absolutely necessary

I “best-effort solution”: single-pass splitting without copying [2]I decision on whether random access at compile-time

I iterator traits must be available, also for custom iterators

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 29: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion29/36

Binary Size and Compilation TimeProvide compile-time switches to user, in order to limitincrease in executable size and compilation time# inc lude <vector># inc lude <a lgor i thm>i n t main ( ) {

std : : vector<i n t> v i (100000000);s td : : s o r t ( v i . begin ( ) , v i . end ( ) , TAG) ;

}

Executable size and compilation time for different variantsAlgorithm Variant(s) Size (B) Time (s)Sequential 15 479 0.74Quicksort 22 387 1.49Balanced Quicksort 26 989 1.84Multiway-Mergesort Sampling 36 002 3.49Default(Multiway-Mergesort Exact)

41 229 4.68

Multiway-Mergesort Exact 41 237 4.78Multiway-Mergesort(splitting choice at run-time)

46 003 5.48

All Parallel Variants (run-time choice) 61 543 6.50

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 30: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion29/36

Binary Size and Compilation TimeProvide compile-time switches to user, in order to limitincrease in executable size and compilation time# inc lude <vector># inc lude <a lgor i thm>i n t main ( ) {

std : : vector<i n t> v i (100000000);s td : : s o r t ( v i . begin ( ) , v i . end ( ) , TAG) ;

}

Executable size and compilation time for different variantsAlgorithm Variant(s) Size (B) Time (s)Sequential 15 479 0.74Quicksort 22 387 1.49Balanced Quicksort 26 989 1.84Multiway-Mergesort Sampling 36 002 3.49Default(Multiway-Mergesort Exact)

41 229 4.68

Multiway-Mergesort Exact 41 237 4.78Multiway-Mergesort(splitting choice at run-time)

46 003 5.48

All Parallel Variants (run-time choice) 61 543 6.50

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 31: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion30/36

Combination with Other Libraries

STXXL: external memory STLI parallelization of internal computation,

e. g. sorting, multi-way mergingI + task-parallelization framework

CGAL: computational geometryI mostly preprocessing: sorting, random shufflingI + manually parallelized geometric algorithms

Upcoming: distributed external memory sortingI add shared-memory parallelism to

distributed-memory algorithm easilyI library?

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 32: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion31/36

Use Case 1: Suffix Array Construction

parallel mode + manually parallelized integer sorter

0

0.5

1

1.5

2

2.5

3

210 211 212 213 214 215 216 217 218 219 220 221 222 223 224

Spee

dup

Input Length

sequential1 thread

2 threads3 threads4 threads

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 33: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36

Use Case 2: Minimum Spanning Tree Construction

Use STL algorithms from libstdc++ parallel mode

Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)

then kruskal(E , T , P)

parallelsort

elsepick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉

parallelpartition

filterKruskal(E≤, T , P)E>:= filter(E>, P)

parallelremove if

filterKruskal(E>, T , P)

Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 34: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36

Use Case 2: Minimum Spanning Tree Construction

Use STL algorithms from libstdc++ parallel mode

Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)

then kruskal(E , T , P)

parallelsort

elsepick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉

parallelpartition

filterKruskal(E≤, T , P)E>:= filter(E>, P)

parallelremove if

filterKruskal(E>, T , P)

Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 35: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36

Use Case 2: Minimum Spanning Tree Construction

Use STL algorithms from libstdc++ parallel mode

Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)

then kruskal(E , T , P)

parallelsort

elsepick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉

parallelpartition

filterKruskal(E≤, T , P)E>:= filter(E>, P)

parallelremove if

filterKruskal(E>, T , P)

Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 36: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36

Use Case 2: Minimum Spanning Tree Construction

Use STL algorithms from libstdc++ parallel mode

Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)

then kruskal(E , T , P)

parallelsort

elsepick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉

parallelpartition

filterKruskal(E≤, T , P)E>:= filter(E>, P)

parallelremove if

filterKruskal(E>, T , P)

Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 37: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36

Use Case 2: Minimum Spanning Tree Construction

Use STL algorithms from libstdc++ parallel mode

Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)

then kruskal(E , T , P) parallel

sort

elsepick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉 parallel

partition

filterKruskal(E≤, T , P)E>:= filter(E>, P) parallel

remove if

filterKruskal(E>, T , P)

Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 38: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion32/36

Use Case 2: Minimum Spanning Tree Construction

Use STL algorithms from libstdc++ parallel mode

Procedure filterKruskal(E , T : Sequence of Edge, P : UnionFind)if m ≤ kruskalThreshold(n, |E |, |T |)

then kruskal(E , T , P)

parallel

sortelse

pick a pivot p ∈ EE≤:= 〈e ∈ E : e ≤ p〉; E>:= 〈e ∈ E : e > p〉

parallel

partitionfilterKruskal(E≤, T , P)E>:= filter(E>, P)

parallel

remove iffilterKruskal(E>, T , P)

Function filter(E)return 〈{u, v} ∈ E : u, v are in different components of P〉

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 39: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion33/36

Use Case: Minimum Spanning Tree Construction

0

1

2

3

4

5

6

100000 106 107 108

Spee

dup

Number of Edges

Kruskal 8 thrQuickMST 8 thr

QuickMST is still faster in absolute time [3] (ALENEX 2009).

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 40: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion34/36

Conclusions

BenefitsI parallel mode provides a easy way to incorporate

data parallelism into programs on an algorithmic levelI fully genericI performance is good for large inputsI speedup at hand for small inputs as well,

depending on circumstancesI could transparently support new paradigms, e. g. transactional

memoryI repository for parallel algorithm implementations

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 41: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion35/36

Future Work

I integration of missing algorithmsI performance estimation⇒automatic switching point detection

0.5

1

1.5

2

2.5

0 5000 10000 15000 20000

Spee

dup

Input Size

Switching Number of Threads for Balanced Quicksort: Preliminary Results

seq1 th2 th3 th4 th5 th6 th7 th8 th

I working affinity

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL

Page 42: The GNU libstdc++ parallel mode: Benefit from Multi-Core using …ls11- · 2008-12-08 · IntroductionLibrary OverviewAlgorithmsSE AspectsApplicationsConclusion1/36 The GNU libstdc++

Introduction Library Overview Algorithms SE Aspects Applications Conclusion36/36

Thank you for your attention. Questions?References

L. Frias and J. Singler.

Parallelization of bulk operations for STL dictionaries.In Workshop on Highly Parallel Processing on a Chip (HPPC), 2007.

L. Frias, J. Singler, and P. Sanders.

Single-pass list partitioning.Scalable Computing: Practice and Experience, 9(3):179–184, 2008.

V. Osipov, P. Sanders, and J. Singler.

The filter-kruskal minimum spanning tree algorithm.In 11th Workshop on Algorithm Engineering and Experiments (ALENEX), 2009.

J. Singler, P. Sanders, and F. Putze.

The Multi-Core Standard Template Library.In Euro-Par 2007: Parallel Processing, volume 4641 of LNCS, pages 682–694. Springer-Verlag.

P. Tsigas and Y. Zhang.

A simple, fast parallel implementation of quicksort and its performance evaluation on SUN enterprise 10000.In 11th Euromicro Conference on Parallel, Distributed and Network-Based Processing, page 372, 2003.

P. J. Varman, S. D. Scheufler, B. R. Iyer, and G. R. Ricard.

Merging Multiple Lists on Hierarchical-Memory Multiprocessors.Journal of Parallel and Distributed Computing, 12(2):171–177, 1991.

Johannes Singler, Peter Sanders GNU libstdc++ parallel mode: Multi-Core using the STL