insertion tree phasers: efficient and scalable barrier synchronization for fine-grained parallelism
DESCRIPTION
This paper presents an algorithm and a data structure for scalable dynamic synchronization in fine-grained parallelism. The algorithm supports the full generality of phasers with dynamic, two-phase, and point-to-point synchronization. It retains the scalability of classical tree barriers, but provides unbounded dynamicity by employing a tailor-made insertion tree data structure.It is the first completely documented implementation strategy for a scalable phaser synchronization construct. Our evaluation shows that it can be used as a drop-in replacement for classic barriers without harming performance, despite its additional complexity and potential for performance optimizations. Furthermore, our approach overcomes performance and scalability limitations which have been present in other phaser proposals.TRANSCRIPT
Insertion Tree Phasers
Efficient and Scalable Barrier Synchronization for Fine-grained
Parallelism
Stefan Marr
S. Verhaegen, B. De Fraine, T. D’Hondt, W. De Meuter
Software Languages Lab
VrijeUniversiteitBrussel
Agenda
• Introduction
– Barriers, Phasers
• Insertion Tree Phasers
– Insertion Tree
– Phaser Algorithm
• Evaluation
• Summary
9/26/2010 2
Barriers
• Synchronizing parallel activities
– High productivity: easy to get right
• Mostly for scientific computing
• Many-core evolution
– Synchronizing dynamic and irregular problems
– Requires low-overhead dynamic hierarchical barriers
9/26/2010 3Introduction
t1
p1
t2
p1
t3
p1
t1
p2
t2
p2
t3
p2
t1
p3
t2
p3
t3
p3
Phasers
9/26/2010 4Introduction
• Extension of X10 clocks
– Clocks: dynamic two-phase barrier for fork/join parallelism
• Registration modes for barrier
– Enables expression of producer/consumer relation
• Single statements
– Executed only by single thread, avoids duplicated barrier operations
t1
p1
t1
p2
t2
p2
t3
p2
t2
p2
t3
p2
t2
p3
t3
p3
Hierarchical Phasers
9/26/2010 5Introduction
• First scalable implementation strategy
• Predefined tree structure– Degree, i.e., tree arity
– Max. number of tiers, i.e., height
– Composed from phasers
• Problematic– None dynamic structure
– Two-phase support incomplete
– Leaves design decisions open
Shirako&Sarkar in Proc. of IEEE IPDPS 2010 [1]
sig sig sig sig sig sig sig sig
A1 A2 A3 A4 A5 A6 A7 A8
sub sub
Phaser
sub sub sub sub
Tier 0
Tier 1
Tier 2(leafs)
Array accessList access
INSERTION TREE PHASERS
9/26/2010 7
Design Goal
• Support for full generality of Phaser properties
– Two-phase support
– Signal-only/wait-only for producers/consumers
– Single statement
– Full dynamicity: fine-grained hierarchical fork/join
• Adaptation of existing, scalable approaches
– Dissemination barrier not adaptable
– Remaining are tree-based approaches
9/26/2010 8Insertion TreePhaserAlgorithm
Insertion Tree
• Goals
– Stable, i.e., minimized tree modifications
• Avoid inconsistent synchronization information
– Maximizing parallel operations
• Solution: Insertion Tree
– Inverted tree
– No removal
– Complete smallest subtree first
9/26/2010 9Insertion TreePhaserAlgorithm
1/2
Insertion Tree
9/26/2010 10Insertion TreePhaserAlgorithm
2/2
Insertion Tree
9/26/2010 11Insertion TreePhaserAlgorithm
2/2
1
Insertion Tree
9/26/2010 12Insertion TreePhaserAlgorithm
2/2
1 2
h1
Insertion Tree
9/26/2010 13Insertion TreePhaserAlgorithm
2/2
1 2
h1
h2
3
Insertion Tree
9/26/2010 14Insertion TreePhaserAlgorithm
2/2
1 2
h1
h2
3
h3
4
Insertion Tree
9/26/2010 15Insertion TreePhaserAlgorithm
2/2
1 2
h1
h2
3
h3
4 5
h4
Insertion Tree
9/26/2010 16Insertion TreePhaserAlgorithm
2/2
1 2
h1
h2
3
h3
4 5 6 7 8
h5 h7
h6
h4
Synchronization Tree*
9/26/2010 18Insertion TreePhaserAlgorithm
Phaserphase: 0
A1 A2 A3 A4
0 0 0 0wo
0 0
*) is simplified, leaves out registration modes
0 0 0 0rsm
d
Participant nodes
Helper nodes
Phase counter
Resume flag
Phase counter
Wait-only flag
Announcing Synchronization
9/26/2010 19Insertion Tree Phaser Algorithm
Phaserphase: 0
A1 A2 A3 A4
0 0 0 0
0 0
0 0 0 0
Announcing Synchronization
9/26/2010 20Insertion Tree Phaser Algorithm
Phaserphase: 0
A1 A2 A3 A4
0 1 1 0
0 0
0 01rsm
d
1rsm
d
Announcing Synchronization
9/26/2010 21Insertion Tree Phaser Algorithm
Phaserphase: 0
A1 A2 A3 A4
1 1 1 1
0 0
1rsm
d
1rsm
d
1rsm
d
1rsm
d
Announcing Synchronization
9/26/2010 22Insertion Tree Phaser Algorithm
Phaserphase: 0
A1 A2 A3 A4
1 1 1 1
0 1
1rsm
d
1rsm
d
1rsm
d
1rsm
d
Announcing Synchronization
9/26/2010 23Insertion Tree Phaser Algorithm
Phaserphase: 0
A1 A2 A3 A4
1 1 1 1
1 1
1rsm
d
1rsm
d
1rsm
d
1rsm
d
Announcing Synchronization
9/26/2010 24Insertion Tree Phaser Algorithm
Phaserphase: 1
A1 A2 A3 A4
1 1 1 1
1 1
1rsm
d
1rsm
d
1rsm
d
1rsm
d
Synchronization reached.Continue to next phase.
Dropping Participants
9/26/2010 25Insertion TreePhaserAlgorithm
Phaserphase: 0
A1 A2 A3 A4
0 0
0
0 0 1rsm
d
1 1
1
1rsm
d
Dropping Participants
9/26/2010 26Insertion TreePhaserAlgorithm
Phaserphase: 0
A1 A2 A3 A4
0
0
0 1rsm
d
wo1 1
1
1rsm
d
h1:R
Dropping Participants
9/26/2010 27Insertion TreePhaserAlgorithm
Phaserphase: 0
A1 A2 A3 A4
0
1rsm
d
wo1
wo1
1
1rsm
d
h1:R
Dropping Participants
9/26/2010 28Insertion TreePhaserAlgorithm
Phaserphase: 0
A1 A2 A3 A4
1rsm
d
wo1
wo1
wo1
1rsm
d
h1:R
Dropping Participants
9/26/2010 29Insertion TreePhaserAlgorithm
Phaserphase: 1
A1 A2 A3 A4
1rsm
d
wo1
wo1
wo1
1rsm
d
Synchronization reached.Continue to next phase.
h1:L
h1:R
Dropping Participants
9/26/2010 30Insertion TreePhaserAlgorithm
Phaserphase: 1
A1 A2 A3 A4
1rsm
d
wo1
wo1
wo1
1rsm
d
Adding New Participants
9/26/2010 31Insertion TreePhaserAlgorithm
Phaserphase: 8
A1 A2 A3 A4
9 8
8 9
9rsm
d
8 9rsm
d
8
Adding New Participants
9/26/2010 32Insertion TreePhaserAlgorithm
Phaserphase: 8
A1 A2 A3 A4
9 8 8 8
8 9
9rsm
d
8 9rsm
d
8
Adding New Participants
9/26/2010 33Insertion TreePhaserAlgorithm
Phaserphase: 8
A1 A2 A3 A4
9 8 9 8
8 8
9rsm
d
8 9rsm
d
8
-1
+1
Adding New Participants
9/26/2010 34Insertion TreePhaserAlgorithm
Phaserphase: 8
A1 A2 A3 A4
9 8 9 8
8 8
9rsm
d
8 9rsm
d
8
propagate phase count
EVALUATION
9/26/2010 35
Two-Phaser Barrier Operation
9/26/2010 36Evaluation
80
85
90
95
100
105
110
115
120
0 8 16 24 32 40 48 56
µs
TreeBarrier HabaneroPhaser Central
Dissemination InsertionTreePhaser TmcSpinBarrier
Overhead: Two-Phase vs. Classic
9/26/2010 37Evaluation
30%
40%
50%
60%
70%
80%
90%
100%
110%
120%
130%
0 8 16 24 32 40 48 56TreeBarrier HabaneroPhaser CentralDissemination TreePhaser TmcSpinBarrier
Use as Drop-In Replacement for SPLASH-2Speedup compared to TmcSpinBarrier
9/26/2010 38Evaluation
-20%
-15%
-10%
-5%
0%
5%
10%
15%
20%
Barnes58 cores
Cholesky58 cores
LU58 cores
Radiosity58 cores
Water-Spacial58 cores
FFT32 cores
Ocean32 cores
Radix32 cores
TreeBarrier HabaneroPhaser Central Dissemination InsertionTreePhaser
Summary
• Scalable and efficient approach to Phasers
– Documents implementation
– Based on fully dynamic insertion tree
– Overcomes limitations of existing approaches
– Usable as drop-in replacement
• Future work
– Scalability beyond 59 cores
– Optimization for other memory architectures
9/26/2010 39Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers
9/26/2010 40Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers
h1:L
h1:R
Phaserphase: 1
A1 A2 A3 A4
wo wo
wo
1rsm
d
1 1
1
1rsm
d
• Implementation
– http://barriers.googlecode.com/
– MIT license
Questions?
References
[1] Shirako, Jun &Sarkar, Vivek: Hierarchical Phasers for Scalable Synchronization and Reductions in Dynamic Parallelism In: Proc. of IEEE IPDPS (2010).
9/26/2010 41