revisiting loop transformations with x10 clocks tomofumi yuki inria / lip / ens lyon x10 workshop...

39
Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

Upload: clemence-hunt

Post on 18-Jan-2018

223 views

Category:

Documents


0 download

DESCRIPTION

Programming with X10 Small set of parallel constructs finish/async clocks at (places), atomic, when Can be composed freely Interesting for both programmer and compilers also challenging X10 Workshop But, it seems to be under-utilized

TRANSCRIPT

Page 1: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

Revisiting Loop Transformations with X10 Clocks

Tomofumi YukiInria / LIP / ENS LyonX10 Workshop 2015

Page 2: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

2

The Problem The Parallelism Challenge

cannot escape going parallel parallel programming is hard automatic parallelization is limited

There won’t be any Silver Bullet X10 as a partial answer

high-level language with parallelism in mind

features to control parallelism/locality

Page 3: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

3

Programming with X10 Small set of parallel constructs

finish/async clocks at (places), atomic, when

Can be composed freely Interesting for both programmer and

compilers also challenging

But, it seems to be under-utilized

Page 4: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

4

This Paper Exploring how to use X10 clocks

performance

“usual” way

expressivity

alternative

performance

“usual” way

Page 5: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

5

Context: Loop Transformations Key to expose parallelism

some times it’s easy

but not always

for i for j X[i] += ...

for i forall j X[i] += ...

for i = 1 .. 2N+M forall j = /*complex bounds*/ X[j] = foo(X[2*j-i-1], X[2*j-i+1]);

for i = 0 .. N for j = 1 .. M X[j] = foo(X[j-1], X[j+1]);

Page 6: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

6

Automatic Parallelization Very sensitive to inputsfor (i=1; i<N; i++) for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1];

for (i=1; i <N-1; i++) for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1];

for (t1=2;t1<=3;t1++) { lbp=1; ubp=t1-1;#pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); } }

for (t1=4;t1<=min(M,N);t1++) { S1((t1-1),1); lbp=2; ubp=t1-2;

#pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } S1(1,(t1-1)); }

for (t1=M+1;t1<=N;t1++) { S1((t1-1),1); lbp=2; ubp=M-1;#pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } } for (t1=N+1;t1<=M;t1++) { lbp=t1-N+1; ubp=t1-2;#pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } S1(1,(t1-1)); } }

for (t1=max(M+1,N+1);t1<=N+M-2;t1++) { lbp=t1-N+1; ubp=M-1;

#pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } }

very difficult to understandtrust it or not use it

Page 7: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

7

Goal: retain the original structure

Expressing with Clocks

async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance;advance;async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance;

for (i=1; i<N; i++) for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1];

for (i=1; i <N-1; i++) for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1];

Page 8: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

8

Goal: retain the original structure

Expressing with Clocks

async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance;advance;async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance;

Page 9: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

9

Goal: retain the original structure

Expressing with Clocks

async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance;advance;async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance;

1. make many iterations parallel

Page 10: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

10

Goal: retain the original structure

Expressing with Clocks

async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance;advance;async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance;

1. make many iterations parallel

2. order them by synchronizations

Page 11: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

11

Goal: retain the original structure

Expressing with Clocks

async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance;advance;async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance;

1. make many iterations parallel

2. order them by synchronizationscompound effect: parallelism similar to those with loop trans.

Page 12: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

12

Outline Introduction X10 Clocks Examples Discussion

Page 13: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

13

clocks vs barriers Barriers can easily deadlock

Clocks are more dynamic

//P2barrier;S1;

//P1barrier;S0;barrier;

//P2advance;S1;

//P1advance;S0;advance;

Page 14: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

14

clocks vs barriers Barriers can easily deadlock

Clocks are more dynamic

//P2barrier;S1;

//P1barrier;S0;barrier;

deadlock

//P2advance;S1;

//P1advance;S0;advance;

Page 15: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

15

clocks vs barriers Barriers can easily deadlock

Clocks are more dynamic

//P2barrier;S1;

//P1barrier;S0;barrier;

deadlock

//P2advance;S1;

//P1advance;S0;advance;

OK

Page 16: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

16

Dynamicity of Clocks Implicit Syntax

The process creating a clock is also registered

clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }

Creation of a clock Each process is registered

Each process is un-registered Sync registered processes

Page 17: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

17

Dynamicity of Clocks Implicit Syntax

Each process waits until all processes starts The primary process has to terminate

first

clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }

Page 18: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

18

Dynamicity of Clocks Implicit Syntax

Each process waits until all processes starts The primary process has to terminate

first

clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }

activ

ity 1

Page 19: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

19

Dynamicity of Clocks Implicit Syntax

Each process waits until all processes starts The primary process has to terminate

first

clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }

activ

ity 1

activ

ity 2

Page 20: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

20

Dynamicity of Clocks Implicit Syntax

Each process waits until all processes starts The primary process has to terminate

first

clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }

activ

ity 1

activ

ity 2

activ

ity 3

activ

ity 4

activ

ity 5

activ

ity 6

Page 21: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

21

Dynamicity of Clocks Implicit Syntax

The primary process calls advance each time Different synchronization pattern

clocked finish for (i=1:N) { clocked async { for (j=i:N) advance; S0; }

advance; }

Page 22: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

22

Dynamicity of Clocks Implicit Syntax

The primary process calls advance each time Different synchronization pattern

clocked finish for (i=1:N) { clocked async { for (j=i:N) advance; S0; }

advance; } advance by primary activity

Page 23: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

23

Dynamicity of Clocks Implicit Syntax

The primary process calls advance each time Different synchronization pattern

clocked finish for (i=1:N) { clocked async { for (j=i:N) advance; S0; }

advance; } advance by primary activity

activ

ity 1

activ

ity 2

activ

ity 3

activ

ity 4

activ

ity 5

activ

ity 6

Page 24: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

24

Outline Introduction X10 Clocks Examples Discussion

Page 25: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

25

Example: Skewing Skewing the loops is not easy

for (i=1:N) for (j=1:N) h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1])

skewing

Page 26: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

26

Example: Skewing Skewing the loops is not easy

for (i=1:2N-1) forall (j=max(1,i-N):min(N,i-1)) h[i][j] = foo(h[(i-j)-1][j], h[(i-j)-1][j-1], h[(i-j)][j-1])

skewing

Page 27: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

27

Example: Skewing Skewing the loops is not easy

for (i=1:2N-1) forall (j=max(1,i-N):min(N,i-1)) h[i][j] = foo(h[(i-j)-1][j], h[(i-j)-1][j-1], h[(i-j)][j-1])

skewing

changes to loop bounds and

indexing

Page 28: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

28

Example: Skewing Equivalent parallelism without changing

loopsclocked finish for (i=1:N) { clocked async for (j=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }

Page 29: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

29

Example: Skewing Equivalent parallelism without changing

loopsclocked finish for (i=1:N) { clocked async for (j=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }

locally sequential

the launch of the entire block is deferred

Page 30: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

30

Example: Skewing You can have the same skewing

clocked finish for (j=1:N) { clocked async for (i=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }

skewing

Page 31: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

31

Example: Skewing You can have the same skewing

clocked finish for (j=1:N) { clocked async for (i=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }

skewing

note: interchangeouter parallel loop with

clocks

Page 32: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

32

Example: Skewing You can have the same skewing

clocked finish for (j=1:N) { clocked async for (i=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }

skewing

advance;

Page 33: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

33

Example: Loop Fission Common use of barriers

forall (i=1:N) S1; S2;

forall (i=1:N) S1;

forall (i=1:N) S2;

for (i=1:N) async { S1; S2; }

for (i=1:N) async { S1; advance; S2; }

Page 34: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

34

Example: Loop Fusion Removes all the parallelism

for (i=1:N) S1;for (i=1:N) S2;

for (i=1:N) S1; S2;

async for (i=1:N) S1; advance; advance;advance;async for (i=1:N) S2; advance; advance;

Page 35: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

35

Example: Loop Fusion Sometimes fusion is not too simple

for (i=1:N-1) S1(i);for (i=2:N) S2(i);

S1(1); for (i=2:N-1) S1(i); S2(i);S2(N);

async for (i=1:N-1) S1; advance; advance;advance;async for (i=2:N) S2; advance; advance;

code structure stays# of advance control

Page 36: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

36

What can be expressed? Limiting factor: parallelism

difficult to use for sequential loop nests works for wave-front parallelism

Intuition clocks defer execution deferring parent activity has cumulative

effect

Page 37: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

37

Discussion Learning curve

behavior of clock takes time to understand

How much can you express? 1D affine schedules for sure loop permutation is not possible what if we use multiple clocks?

Page 38: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

38

Potential Applications It might be easier for some people

have multiple ways to write code

Detect X10 fragments with such property convert to forall for performance

Page 39: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

X10 Workshop 2015

39