universitat politÈcnica de catalunya departament d’arquitectura de computadors exploiting...

35
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning Alex Aletà Josep M. Codina Jesús Sánchez Antonio González David Kaeli {aaleta, jmcodina, fran, antonio}@ac.upc.es [email protected] PACT 2002, Charlottesville, Virginia – September 2002

Upload: domenic-richard

Post on 12-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

UNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors

Exploiting Pseudo-schedules to Guide Data Dependence Graph

Partitioning

Exploiting Pseudo-schedules to Guide Data Dependence Graph

Partitioning

Alex AletàJosep M. CodinaJesús Sánchez

Antonio GonzálezDavid Kaeli

{aaleta, jmcodina, fran, antonio}@[email protected]

PACT 2002, Charlottesville, Virginia – September 2002

Page 2: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Clustered ArchitecturesClustered Architectures

Current/future challenges in processor design Delay in the transmission of signals Power consumption Architecture complexity

Clustering: divide the system in semi-independent units Each unit Cluster

Fast interconnects intra-cluster Slow interconnects inter-clusters

Common trend in commercial VLIW processors TI’s C6x Analog’s TigerSHARC HP’s LX Equator’s MAP1000

Page 3: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Architecture OverviewArchitecture Overview

L1 CACHE

LOCALREGISTER FILE

FU

FU MEM

LOCALREGISTER FILE

FU

FU

MEM

Register Buses

CLUSTER 1 CLUSTER n

Page 4: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Instruction SchedulingInstruction Scheduling

For non-clustered architectures Resources Dependences

For clustered architectures Cluster assignment Minimize inter-cluster communication delays

Exploit communication locality

This work focuses on modulo scheduling for clustered VLIW architectures Technique to schedule loops

Page 5: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Talk OutlineTalk Outline

Previous work

Proposed algorithm Overview Graph partitioning Pseudo-scheduling

Performance evaluation

Conclusions

Page 6: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

MS for Clustered ArchitecturesMS for Clustered Architectures

Two steps Data Dependence Graph partitioning: each

instruction is assigned to a cluster Scheduling: instructions are scheduled in

a suitable slot but only in the preassigned cluster

In previous work, two different approaches were proposed:

II++

ClusterAssignment + Scheduling

One stepThere is no initial cluster assignmentThe scheduler is free to choose any cluster

ClusterAssignment

ClusterAssignment SchedulingScheduling

II++

Page 7: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Goal of the WorkGoal of the Work

Both approaches have benefits Two steps

Global vision of the Data Dependence Graph Workload is better split among different clusters Number of communications is reduced

One step Local vision of partial scheduling Cluster assignment is performed with information of the partial

scheduling

Goal: obtain an algorithm taking advantage of the benefits of both approaches

Page 8: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

BaselineBaseline

Baseline scheme: GP [Aletà et al., Micro34] Cluster assignment performed with a graph partitioning

algorithm Feed-back between the partitioning and the scheduler Results outperformed previous approaches Still little information available for cluster assignment

New algorithm: better partition Pseudo-schedules are used to guide the partition

Global vision of the Data Dependence Graph More information to perform cluster assignment

Page 9: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Algorithm OverviewAlgorithm Overview

YES

II++

Refine Partition

II:= MIICompute

initial partition

Able to schedule?Select next operation

(j++)

Start scheduling

Schedule Opj based on the current partition

Move Opj toanother cluster

NO

NO

Able to schedule?

YES

Page 10: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Algorithm OverviewAlgorithm Overview

YES

II++

Refine Partition

II:= MIICompute

initial partition

Able to schedule?Select next operation

(j++)

Start scheduling

Schedule Opj based on the current partition

Move Opj toanother cluster

NO

NO

Able to schedule?

YES

Page 11: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Graph Partitioning BackgroundGraph Partitioning Background

Problem statement Split the nodes into a pre-determined number of sets and

optimizing some functions

Multilevel strategy Coarsen the graph

Iteratively, fuse pairs of nodes into new macro-nodes Enhancing heuristics

Avoid excess load in any one set Reduce execution time of the loops

Page 12: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Graph CoarseningGraph Coarsening

Previous definitions Matching

Slack

Iterate until same number of nodes than clusters: The edges are weighted according to

Impact on execution time of adding a bus delay to the edge Slack of the edge

Then, select the maximum weight matching Nodes linked by edges in the matching are fused in a single

macro-node

Page 13: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Coarsening ExampleCoarsening Example

Find matching

4

4

2

Find matching

Final graphInitial graph

4

4

4

2

1

4

Page 14: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

coarsening

Example (II)Example (II)

1st STEP: Partition induced in the original graph

Initial graph Induced Partition

Final graph

Page 15: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Estimation of execution time needed

Pseudo-schedules

Information obtained II SC Lifetimes Spills

Reducing Execution TimeReducing Execution Time

Page 16: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Dependences Respected if possible Else a penalty on register pressure and/or in execution time

is assessed

Cluster assignment Partition strictly followed

Building pseudo-schedulesBuilding pseudo-schedules

Page 17: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Pseudo-schedule: examplePseudo-schedule: example

Induced partitionA

D B

C

Cluster 1

Cluster 2

0 A

1

2

3 B

4 D

5

6 C?NO

7 C?NO

Cluster 1

Cluster 2

A D

B

2 clusters, 1 FU/cluster, 1 bus of latency 1, II= 2

Instruction latency= 3

Page 18: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Pseudo-schedule: examplePseudo-schedule: example

Induced partitionA

D B

C

Cluster 1

Cluster 2

0 A

1

2

3 B

4 D

5

6

7

8 C

Cluster 1

Cluster 2

A,C D

B

Page 19: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Heuristic descriptionHeuristic description

While improvement, iterate: Different partitions are obtained by moving nodes among

clusters Partitions that produce overload resources in any of the clusters

are discarded The partition minimizing execution time is chosen In case of tie, the one that minimizes register pressure is

selected

Page 20: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Algorithm OverviewAlgorithm Overview

YES

II++

Refine Partition

II:= MIICompute

initial partition

Able to schedule?Select next operation

(j++)

Start scheduling

Schedule Opj based on the current partition

Move Opj toanother cluster

NO

NO

Able to schedule?

YES

Page 21: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

The Scheduling StepThe Scheduling Step

To schedule the partition we use URACAM [Codina et al., PACT’01] Figure of merit Uses dynamic transformations to improve the partial

schedule Register communications

• Bus memory

Spill code on-the-fly

• Register pressure memory

If an instruction can not be scheduled in the cluster assigned by the partition Try all other clusters Select the best one according to a figure of merit

Page 22: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Algorithm OverviewAlgorithm Overview

YES

II++

Refine Partition

II:= MIICompute

initial partition

Able to schedule?Select next operation

(j++)

Start scheduling

Schedule Opj based on the current partition

Move Opj toanother cluster

NO

NO

Able to schedule?

YES

Page 23: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Partition RefinementPartition Refinement

II has increased A better partition can be found for the new II

New slots have been generated in each cluster More lifetimes are available A larger number of bus communications allowed

Coarsening process is repeated Only edges between nodes in the same set can appear in the

matching After coarsening, the induced partition will be the last partition

that could not be scheduled

The reducing execution time heuristic is reapplied

Page 24: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Benchmarks and Configurations

Benchmarks and Configurations

Benchmarks - all the SPECfp95 using the ref input set

Two schedulers evaluated: GP – (previous work) Pseudo-schedule (PSP)

Resources

INT/cluster

FP/cluster

MEM/cluster

Unified

4

4

4

2-cluster

2

2

2

4-cluster

1

1

1

Latencies INT FP MEM 2 2 ARITH 1 3 MUL/ABS 2 6

6 18DIV/SQR/TRG

Page 25: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

GP vs PSPGP vs PSP

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

le

baseline

PSP

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

le

baseline

PSP

32 registers split into 2 clusters1 bus (L=1)

32 registers split into 4 clusters1 bus (L=1)

Page 26: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

GP vs PSPGP vs PSP

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

le

baseline

PSP

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

le

baseline

PSP

64 registers split into 4 clusters1 bus (L=2)

32 registers split into 4 clusters1 bus (L=2)

Page 27: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

ConclusionsConclusions

A new algorithm to perform MS for clustered VLIW architectures Cluster assignment based on multilevel graph partitioning

The partition algorithm is improved Based on pseudo-schedules Reliable information available to guide the partition

Outperform previous work 38.5% speedup for some configurations

Page 28: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

UNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors

Any questions?Any questions?

Page 29: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

GP vs PSPGP vs PSP

64 registers split into 2 clusters1 bus (L=1)

64 registers split into 4 clusters1 bus (L=1)

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

les

baseline

PSP

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

le

baseline

PSP

Page 30: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Different AlternativesDifferent Alternatives

ClusterAssignment

ClusterAssignment SchedulingScheduling

II++

• Global vision when assigning clusters• Schedule follows exactly assignment• Re-scheduling does not take into account more resources available

• Local vision when assigning and scheduling• Assignment is based on current resource usage• No global view of the graph

II++

ClusterAssignment + Scheduling

• Global and local views of the graph• If cannot schedule, depending on the reason

• Re-schedule• Re-compute cluster assignment

ClusterAssignment

ClusterAssignment SchedulingScheduling

II++??

Page 31: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Clustered ArchitecturesClustered Architectures

Current/future challenges in processor design Delay in the transmission of signals Power consumption Architecture complexity

Solutions: VLIW architectures Clustering: divide the system in semi-independent units

Fast interconnects intra-cluster Slow interconnects inter-clusters

Common trend in commercial VLIW processors• TI’s C6x • Analog’s Tigersharc

• HP’s LX • Equator’s MAP1000

Page 32: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Example (I)Example (I)

1st STEP: Coarsening the graph

Initial graph

1 5

3

Find matching

New graph

31

Find matching

31

Final graph

1

Page 33: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

coarsening

Example (I)Example (I)

1st STEP: Partition induced in the original graph

Initial graph Induced partition

coarsened graph

1

Page 34: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Reducing Execution TimeReducing Execution Time

Heuristic description Different partitions are obtained by moving nodes among

clusters Partitions overloading resources in any of the clusters are

discarded The partition minimizing execution time is chosen In case of tie, the one that minimizes register pressure

Estimation of execution time needed

Pseudo-schedules

Page 35: UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

Building pseudo-schedules Dependences

Respected if possible Else a penalty on register pressure and/or in execution time is

assumed Cluster assignment

Partition strictly followed

Valuable information can be estimated II Length of the pseudo-schedule Register pressure

Pseudo-schedulesPseudo-schedules

Execution time