insertion tree phasers: efficient and scalable barrier synchronization for fine-grained parallelism

39
Insertion Tree Phasers Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism Stefan Marr S. Verhaegen, B. De Fraine, T. D’Hondt, W. De Meuter Software Languages Lab VrijeUniversiteitBrussel

Upload: stefan-marr

Post on 02-Jul-2015

635 views

Category:

Documents


4 download

DESCRIPTION

This paper presents an algorithm and a data structure for scalable dynamic synchronization in fine-grained parallelism. The algorithm supports the full generality of phasers with dynamic, two-phase, and point-to-point synchronization. It retains the scalability of classical tree barriers, but provides unbounded dynamicity by employing a tailor-made insertion tree data structure.It is the first completely documented implementation strategy for a scalable phaser synchronization construct. Our evaluation shows that it can be used as a drop-in replacement for classic barriers without harming performance, despite its additional complexity and potential for performance optimizations. Furthermore, our approach overcomes performance and scalability limitations which have been present in other phaser proposals.

TRANSCRIPT

Page 1: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Insertion Tree Phasers

Efficient and Scalable Barrier Synchronization for Fine-grained

Parallelism

Stefan Marr

S. Verhaegen, B. De Fraine, T. D’Hondt, W. De Meuter

Software Languages Lab

VrijeUniversiteitBrussel

Page 2: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Agenda

• Introduction

– Barriers, Phasers

• Insertion Tree Phasers

– Insertion Tree

– Phaser Algorithm

• Evaluation

• Summary

9/26/2010 2

Page 3: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Barriers

• Synchronizing parallel activities

– High productivity: easy to get right

• Mostly for scientific computing

• Many-core evolution

– Synchronizing dynamic and irregular problems

– Requires low-overhead dynamic hierarchical barriers

9/26/2010 3Introduction

t1

p1

t2

p1

t3

p1

t1

p2

t2

p2

t3

p2

t1

p3

t2

p3

t3

p3

Page 4: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Phasers

9/26/2010 4Introduction

• Extension of X10 clocks

– Clocks: dynamic two-phase barrier for fork/join parallelism

• Registration modes for barrier

– Enables expression of producer/consumer relation

• Single statements

– Executed only by single thread, avoids duplicated barrier operations

t1

p1

t1

p2

t2

p2

t3

p2

t2

p2

t3

p2

t2

p3

t3

p3

Page 5: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Hierarchical Phasers

9/26/2010 5Introduction

• First scalable implementation strategy

• Predefined tree structure– Degree, i.e., tree arity

– Max. number of tiers, i.e., height

– Composed from phasers

• Problematic– None dynamic structure

– Two-phase support incomplete

– Leaves design decisions open

Shirako&Sarkar in Proc. of IEEE IPDPS 2010 [1]

sig sig sig sig sig sig sig sig

A1 A2 A3 A4 A5 A6 A7 A8

sub sub

Phaser

sub sub sub sub

Tier 0

Tier 1

Tier 2(leafs)

Array accessList access

Page 6: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

INSERTION TREE PHASERS

9/26/2010 7

Page 7: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Design Goal

• Support for full generality of Phaser properties

– Two-phase support

– Signal-only/wait-only for producers/consumers

– Single statement

– Full dynamicity: fine-grained hierarchical fork/join

• Adaptation of existing, scalable approaches

– Dissemination barrier not adaptable

– Remaining are tree-based approaches

9/26/2010 8Insertion TreePhaserAlgorithm

Page 8: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Insertion Tree

• Goals

– Stable, i.e., minimized tree modifications

• Avoid inconsistent synchronization information

– Maximizing parallel operations

• Solution: Insertion Tree

– Inverted tree

– No removal

– Complete smallest subtree first

9/26/2010 9Insertion TreePhaserAlgorithm

1/2

Page 9: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Insertion Tree

9/26/2010 10Insertion TreePhaserAlgorithm

2/2

Page 10: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Insertion Tree

9/26/2010 11Insertion TreePhaserAlgorithm

2/2

1

Page 11: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Insertion Tree

9/26/2010 12Insertion TreePhaserAlgorithm

2/2

1 2

h1

Page 12: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Insertion Tree

9/26/2010 13Insertion TreePhaserAlgorithm

2/2

1 2

h1

h2

3

Page 13: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Insertion Tree

9/26/2010 14Insertion TreePhaserAlgorithm

2/2

1 2

h1

h2

3

h3

4

Page 14: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Insertion Tree

9/26/2010 15Insertion TreePhaserAlgorithm

2/2

1 2

h1

h2

3

h3

4 5

h4

Page 15: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Insertion Tree

9/26/2010 16Insertion TreePhaserAlgorithm

2/2

1 2

h1

h2

3

h3

4 5 6 7 8

h5 h7

h6

h4

Page 16: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Synchronization Tree*

9/26/2010 18Insertion TreePhaserAlgorithm

Phaserphase: 0

A1 A2 A3 A4

0 0 0 0wo

0 0

*) is simplified, leaves out registration modes

0 0 0 0rsm

d

Participant nodes

Helper nodes

Phase counter

Resume flag

Phase counter

Wait-only flag

Page 17: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Announcing Synchronization

9/26/2010 19Insertion Tree Phaser Algorithm

Phaserphase: 0

A1 A2 A3 A4

0 0 0 0

0 0

0 0 0 0

Page 18: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Announcing Synchronization

9/26/2010 20Insertion Tree Phaser Algorithm

Phaserphase: 0

A1 A2 A3 A4

0 1 1 0

0 0

0 01rsm

d

1rsm

d

Page 19: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Announcing Synchronization

9/26/2010 21Insertion Tree Phaser Algorithm

Phaserphase: 0

A1 A2 A3 A4

1 1 1 1

0 0

1rsm

d

1rsm

d

1rsm

d

1rsm

d

Page 20: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Announcing Synchronization

9/26/2010 22Insertion Tree Phaser Algorithm

Phaserphase: 0

A1 A2 A3 A4

1 1 1 1

0 1

1rsm

d

1rsm

d

1rsm

d

1rsm

d

Page 21: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Announcing Synchronization

9/26/2010 23Insertion Tree Phaser Algorithm

Phaserphase: 0

A1 A2 A3 A4

1 1 1 1

1 1

1rsm

d

1rsm

d

1rsm

d

1rsm

d

Page 22: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Announcing Synchronization

9/26/2010 24Insertion Tree Phaser Algorithm

Phaserphase: 1

A1 A2 A3 A4

1 1 1 1

1 1

1rsm

d

1rsm

d

1rsm

d

1rsm

d

Synchronization reached.Continue to next phase.

Page 23: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Dropping Participants

9/26/2010 25Insertion TreePhaserAlgorithm

Phaserphase: 0

A1 A2 A3 A4

0 0

0

0 0 1rsm

d

1 1

1

1rsm

d

Page 24: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Dropping Participants

9/26/2010 26Insertion TreePhaserAlgorithm

Phaserphase: 0

A1 A2 A3 A4

0

0

0 1rsm

d

wo1 1

1

1rsm

d

Page 25: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

h1:R

Dropping Participants

9/26/2010 27Insertion TreePhaserAlgorithm

Phaserphase: 0

A1 A2 A3 A4

0

1rsm

d

wo1

wo1

1

1rsm

d

Page 26: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

h1:R

Dropping Participants

9/26/2010 28Insertion TreePhaserAlgorithm

Phaserphase: 0

A1 A2 A3 A4

1rsm

d

wo1

wo1

wo1

1rsm

d

Page 27: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

h1:R

Dropping Participants

9/26/2010 29Insertion TreePhaserAlgorithm

Phaserphase: 1

A1 A2 A3 A4

1rsm

d

wo1

wo1

wo1

1rsm

d

Synchronization reached.Continue to next phase.

Page 28: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

h1:L

h1:R

Dropping Participants

9/26/2010 30Insertion TreePhaserAlgorithm

Phaserphase: 1

A1 A2 A3 A4

1rsm

d

wo1

wo1

wo1

1rsm

d

Page 29: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Adding New Participants

9/26/2010 31Insertion TreePhaserAlgorithm

Phaserphase: 8

A1 A2 A3 A4

9 8

8 9

9rsm

d

8 9rsm

d

8

Page 30: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Adding New Participants

9/26/2010 32Insertion TreePhaserAlgorithm

Phaserphase: 8

A1 A2 A3 A4

9 8 8 8

8 9

9rsm

d

8 9rsm

d

8

Page 31: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Adding New Participants

9/26/2010 33Insertion TreePhaserAlgorithm

Phaserphase: 8

A1 A2 A3 A4

9 8 9 8

8 8

9rsm

d

8 9rsm

d

8

-1

+1

Page 32: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Adding New Participants

9/26/2010 34Insertion TreePhaserAlgorithm

Phaserphase: 8

A1 A2 A3 A4

9 8 9 8

8 8

9rsm

d

8 9rsm

d

8

propagate phase count

Page 33: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

EVALUATION

9/26/2010 35

Page 34: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Two-Phaser Barrier Operation

9/26/2010 36Evaluation

80

85

90

95

100

105

110

115

120

0 8 16 24 32 40 48 56

µs

TreeBarrier HabaneroPhaser Central

Dissemination InsertionTreePhaser TmcSpinBarrier

Page 35: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Overhead: Two-Phase vs. Classic

9/26/2010 37Evaluation

30%

40%

50%

60%

70%

80%

90%

100%

110%

120%

130%

0 8 16 24 32 40 48 56TreeBarrier HabaneroPhaser CentralDissemination TreePhaser TmcSpinBarrier

Page 36: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Use as Drop-In Replacement for SPLASH-2Speedup compared to TmcSpinBarrier

9/26/2010 38Evaluation

-20%

-15%

-10%

-5%

0%

5%

10%

15%

20%

Barnes58 cores

Cholesky58 cores

LU58 cores

Radiosity58 cores

Water-Spacial58 cores

FFT32 cores

Ocean32 cores

Radix32 cores

TreeBarrier HabaneroPhaser Central Dissemination InsertionTreePhaser

Page 37: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

Summary

• Scalable and efficient approach to Phasers

– Documents implementation

– Based on fully dynamic insertion tree

– Overcomes limitations of existing approaches

– Usable as drop-in replacement

• Future work

– Scalability beyond 59 cores

– Optimization for other memory architectures

9/26/2010 39Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers

Page 38: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

9/26/2010 40Stefan Marr, IEEE HPCC 2010, Insertion TreePhasers

h1:L

h1:R

Phaserphase: 1

A1 A2 A3 A4

wo wo

wo

1rsm

d

1 1

1

1rsm

d

• Implementation

– http://barriers.googlecode.com/

– MIT license

Questions?

Page 39: Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

References

[1] Shirako, Jun &Sarkar, Vivek: Hierarchical Phasers for Scalable Synchronization and Reductions in Dynamic Parallelism In: Proc. of IEEE IPDPS (2010).

9/26/2010 41