more definite results from the pluto scheduling...

24
More Definite Results From the Pluto Scheduling Algorithm By Athanasios Konstantinidis Supervisor Paul H. J. Kelly

Upload: others

Post on 03-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

More Definite Results From the Pluto Scheduling Algorithm

By

Athanasios Konstantinidis

Supervisor

Paul H. J. Kelly

Page 2: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

About Me

• PhD student at Imperial College London supervised by Paul H. J. Kelly. • Compiler and Language support for Heterogeneous parallel architectures (e.g.

GPGPUs, Cell BE, Multicore etc.). • Developing our own source-to-source polyhedral compiler (CUDA back-end).

• Sponsored by EPSRC and Codeplay Software Ltd.

Page 3: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Our Polyhedral Framework

Control Graph

Extraction

ROSE AST Polyhedral Model

extraction

Main graph Polyhedral framework

poly graph

PLuTo Scheduling

Dependence Analysis

Constraints Polyhedral scanning algorithm (CLooG)

Affine Transformations

CLooG graph

extraction

CLooG IR CUDA graph

extraction

Graph to AST

CLooG graph

CUDA graph

ROSE AST

Page 4: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Our Polyhedral Framework

Control Graph

Extraction

ROSE AST Polyhedral Model

extraction

Main graph Polyhedral framework

poly graph

PLuTo Scheduling

Dependence Analysis

Constraints Polyhedral scanning algorithm (CLooG)

Affine Transformations

CLooG graph

extraction

CLooG IR CUDA graph

extraction

Graph to AST

CLooG graph

CUDA graph

ROSE AST

Does not require file I/O for syntactic post-processing

Page 5: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

The layout of the constraints can affect the scheduling solutions

Our Polyhedral Framework

Control Graph

Extraction

ROSE AST Polyhedral Model

extraction

Main graph Polyhedral framework

poly graph

PLuTo Scheduling

Dependence Analysis

Constraints Polyhedral scanning algorithm (CLooG)

Affine Transformations

CLooG graph

extraction

CLooG IR CUDA graph

extraction

Graph to AST

CLooG graph

CUDA graph

ROSE AST

Page 6: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

PLuTo Scheduling Algorithm 1 • Iteratively looks for a maximal set of linearly independent affine transforms of

the original iteration space.

• An affine transform is a hyperplane representing a loop in the transformed iteration space.

• Each hyperplane needs to respect a set of constraints that guarantee legality and minimum communication between hyperplane instances (i.e. between different loop iterations).

space time

space

time time

space

Page 7: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

PLuTo Scheduling Algorithm 2

MAX + scalar dimensions

• Solve(M) Uses a Parametric Integer Programming Library (PIP) to find the lexicographic minimum solution.

Page 8: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Global Constraint Matrix M Empty M Legality M Communication Bounding M Non-Trivial solution While ( Solve(M) ) { M Linear Independence }

PLuTo Scheduling Algorithm 3

• Iteratively find as many linearly independent solution as possible

Page 9: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Global Constraint Matrix M Empty M Legality M Communication Bounding M Non-Trivial solution While ( Solve(M) ) { M Linear Independence } Cut in SCC If NO solution is found Remove Killed dependences

PLuTo Scheduling Algorithm 4

• If NO MORE solutions can be found remove any killed dependences • If NO solution was found cut the dependence graph into Strongly Connected

Components (SCC) – loop distribution – and remove the killed dependences

Page 10: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

do { Global Constraint Matrix M Empty M Legality M Communication Bounding M Non-Trivial solution While ( Solve(M) ) { M Linear Independence } Cut in SCC If NO solution is found Remove Killed dependences } While ( (total_sols < MAX) OR (deps ≠ 0) )

PLuTo Scheduling Algorithm 5

• Iteratively find bands of fully permutable loop nests

Page 11: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Communication Bounding Constraints

Affine Form on

Structure Parameters

• For every dependence edge e :

Farkas Lemma

Parameters Unknown schedule coefficients

Constant

Identification

h-transformation

Page 12: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Ordering Sensitivity 1 Cost

• For the same Cost the solution we will get from the PIP solver will eventually depend on the ordering of the transformation coefficients.

Ordering of Transformation Coefficients

Page 13: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Ordering Sensitivity (example)

Cost = 1

• Minimum Cost is 1. • No outer parallel loop.

i

j

for i = 0,N for j = 0, N A[i][j] = A[i-1][j]*A[i-1][j-1]

0 1

Page 14: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Ordering Sensitivity (example) • By changing the order of the transformation coefficients we get two different

solutions both having Cost = 1.

i

j

Order 1 :

Cost = 1

0 1 0

Order 2 :

Cost = 1

0 1 0

i

j

Page 15: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Ordering Sensitivity (example) • By adding the linear independence constraints we get a second solution. • Order 2 yields an inner loop that is fully parallel. • Which solution/order is better ?

i

j

Order 1 :

Cost = 1

1 0 0

Order 2 :

Cost = 1

1 0 0

i

j

Fully Parallel Inner Loop Pipeline/Wavefront

Page 16: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Pipeline Degrees of Parallelism

• N Non-parallel loops can be transformed into a wavefront/pipeline consisted of one sequential and N-1 parallel loops i.e. degrees of parallelism.

Wavefront/pipeline Non-parallel loops

i

j

i

j

Page 17: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Pipeline Degrees of Parallelism

i

j

Start-up Cost Drain Cost

i

j

Better spatial/temporal Locality along a wavefront

Page 18: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Pipeline Degrees of Parallelism

i

j

Start-up Cost Drain Cost

i

j

Better spatial/temporal Locality along a wavefront

Depend on structure parameters

Page 19: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Pipeline Degrees of Parallelism

i

j

Start-up Cost Drain Cost

i

j

Better spatial/temporal Locality along a wavefront

Depend on structure parameters

Number of Read-after-Read dependences that lie within the wavefront

Page 20: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Fully Parallel vs Pipeline Degrees of Parallelism 1

• We propose a way of distinguishing between fully parallel and pipeline degrees of parallelism.

• We use dependence direction vectors in order to expose inner fully parallel degrees of parallelism.

Direction Information :

bit vector If e extends along i If e does not extend along i

Boolean If e extends in only 1 dimension If e extends in more than 1 dimensions

Page 21: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Fully Parallel vs Pipeline Degrees of Parallelism 2

i

j

1e

2e

Page 22: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Fully Parallel vs Pipeline Degrees of Parallelism 3

i

j

i j

Fully parallel dimension

• By placing the coefficients of fully parallel dimensions in leading minimization positions we are effectively pushing them towards inner nest levels.

• As a result fully parallel degrees of parallelism can be recovered.

Page 23: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Conclusions

• The PLuTo scheduling algorithm iteratively finds affine transformations that minimize communication.

• For the same minimum communication the solution might be sensitive to the ordering of the affine transformation coefficients in the global constraint matrix.

• We might have to choose between fully parallel and pipeline degrees of parallelism.

• We propose a method for distinguishing between fully parallel and pipeline degrees of parallelism.

• We use dependence direction information in order to expose inner fully parallel loops.

Page 24: More Definite Results From the Pluto Scheduling Algorithmperso.ens-lyon.fr/christophe.alias/impact2011/impact-02... · 2011. 3. 30. · Ordering Sensitivity (example) Cost = 1 •

Thank You !

Any Questions ?