eecs 583 – class 20 research topic 2: stream compilation, stream graph modulo scheduling

30
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today: Daya Khudia

Upload: winter

Post on 31-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling. University of Michigan November 30, 2011 Guest Speaker Today: Daya Khudia. Announcements & Reading Material. This class - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

EECS 583 – Class 20Research Topic 2: Stream Compilation,Stream Graph Modulo Scheduling

University of Michigan

November 30, 2011

Guest Speaker Today: Daya Khudia

Page 2: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 2 -

Announcements & Reading Material

This class» “Orchestrating the Execution of Stream Programs on Multicore

Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008.

Next class – GPU compilation» “Program optimization space pruning for a multithreaded GPU,”

S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008.

Page 3: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 3 -

Stream Graph Modulo Scheduling (SGMS)

Coarse grain software pipelining» Equal work distribution» Communication/computation

overlap» Synchronization costs

Target : Cell processor» Cores with disjoint address spaces» Explicit copy to access remote data

DMA engine independent of PEs

Filters = operations, cores = function units

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

EIB

PPE(Power PC)

DRAM

SPE0 SPE1 SPE7

Page 4: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 4 -

Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02]

int->int filter FIR(int N, int wgts[N]) {

work pop 1 push 1 { int i, sum = 0;

for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}

Push and pop items from input/output FIFOs

Stateless

int wgts[N];

wgts = adapt(wgts);

Stateful

Page 5: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 5 -

SGMS Overview

PE0 PE1 PE2 PE3

PE0

T1

T4

T4

T1 ≈ 4

DMA

DMA

DMA

DMA

DMA

DMA

DMA

DMA

Prologue

Epilogue

Page 6: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 6 -

SGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

Page 7: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 7 -

Processor Assignment: Maximizing Throughputs

iij

jij

IIaiw

a

)(

1 for all filter i = 1, …, N

for all PE j = 1,…,P

Minimize II

B C

E

D

F

W: 20

W: 20 W: 20

W: 30

W: 50

W: 30

A

B C

E

D

F

A

A

DB

CE

F

Minimum II: 50Balanced workload!Maximum throughput

T2 = 50

A

B

C

D

E

F

PE0

T1 = 170

T1/T2 = 3. 4

PE0 PE1 PE2 PE3Four Processing Elements

• Assigns each filter to a processor

PE1PE0 PE2 PE3

W: workload

Page 8: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 8 -

Need More Than Just Processor Assignment

Assign filters to processors» Goal : Equal work distribution

Graph partitioning? Bin packing?

A

B

D

C

A

B C

D

5

5

40 10

Original stream program

PE0 PE1

B

A

C

D Speedup = 60/40 = 1.5

A

B1

D

C

B2

J

S

Modified stream program

B2

CJ

B1

AS

D Speedup = 60/32 ~ 2

Page 9: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 9 -

Filter Fission Choices

PE0 PE1 PE2 PE3

Speedup ~ 4 ?

Page 10: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 10 -

Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP)

Split/Join overheadfactored in

Objective function- Maximal load on any PE» Minimize

Result» Number of times to “split” each filter

» Filter → processor mapping

Page 11: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 11 -

Step 2: Forming the Software Pipeline To achieve speedup

» All chunks should execute concurrently» Communication should be overlapped

Processor assignment alone is insufficient information

A

B

C

A

CB

PE0 PE1

PE0 PE1

Tim

e AB

A1

B1

A2

A1

B1

A2A→B

A1

B1

A2A→B

A3A→B

Overlap Ai+2 with Bi

X

Page 12: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 12 -

Stage Assignment

i

j

PE 1

Sj ≥ Si

i

j

DMA

PE 1

PE 2

Si

SDMA > Si

Sj = SDMA+1

Preserve causality(producer-consumer dependence)

Communication-computationoverlap

Data flow traversal of the stream graph» Assign stages using above two rules

Page 13: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 13 -

Stage Assignment Example

A

B1

D

C

B2

J

S

AS

B1Stage 0

DMA DMA DMA Stage 1

CB2

J

Stage 2

D

DMA Stage 3

Stage 4

DMA

PE 0

PE 1

Page 14: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 14 -

Step 3: Code Generation for Cell

Target the Synergistic Processing Elements (SPEs)» PS3 – up to 6 SPEs

» QS20 – up to 16 SPEs

One thread / SPE Challenge

» Making a collection of independent threads implement a software pipeline

» Adapt kernel-only code schema of a modulo schedule

Page 15: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 15 -

Complete Example

void spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}

A

S

B1

DMA DMA DMA

CB2

J

D

DMA DMA

AS

B1

AS

B1

B1toJ

StoB2

AtoC

AS

B1

B2

J

C

B1toJ

StoB2

AtoC

AS

B1

JtoD

CtoD B2

J

C

B1toJ

StoB2

AtoC

AS

B1

JtoD

D

CtoD B2

J

C

B1toJ

StoB2

AtoC

SPE1 DMA1 SPE2 DMA2T

ime

Page 16: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 16 -

SGMS(ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rel

ativ

e S

peed

up

ILP Partitioning Greedy Partitioning Exposed DMA

(MIT method, ASPLOS’06)

• Solver time < 30 seconds for 16 processors

Page 17: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 17 -

SGMS Conclusions

Streamroller» Efficient mapping of stream programs to multicore

» Coarse grain software pipelining

Performance summary» 14.7x speedup on 16 cores

» Up to 35% better than greedy solution (11% on average)

Scheduling framework» Tradeoff memory space vs. load balance

Memory constrained (embedded) systems Cache based system

Page 18: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 18 -

Discussion Points

Is it possible to convert stateful filters into stateless? What if the application does not behave as you expect?

» Filters change execution time?

» Memory faster/slower than expected?

Could this be adapted for a more conventional multiprocessor with caches?

Can C code be automatically streamized? Now you have seen 3 forms of software pipelining:

» 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling

» Where else can it be used?

Page 19: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

“Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures,”

Page 20: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 20 -

Static Verses Dynamic Scheduling

CoreCore

AA

B1B1 B3B3

FF

EE

CC

SplitterSplitter

JoinerJoiner

B2B2 B4B4

D1D1 D2D2

SplitterSplitter

JoinerJoiner

MemoryMemory

CoreCore

MemoryMemory

CoreCore

MemoryMemory

CoreCore

MemoryMemory

? Performing graph modulo scheduling on a

stream graph statically.

What happens in case of dynamic resource changes?

Page 21: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 21 -

Overview of Flextream

Prepass ReplicationPrepass Replication

Work PartitioningWork Partitioning

Partition RefinementPartition Refinement

Stage AssignmentStage Assignment

Buffer AllocationBuffer Allocation

Static

Dynam

ic

Streaming Application

MSL Commands

Adjust the amount of parallelism for the target system by replicating actors.Adjust the amount of parallelism for the target system by replicating actors.

Performs light-weight adaptation of the schedule for the current configuration of the target hardware.Performs light-weight adaptation of the schedule for the current configuration of the target hardware.

Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)

Find an optimal schedule for a virtualized member of a family of processors.Find an optimal schedule for a virtualized member of a family of processors.

Specifies how actors execute in time in the new actor-processor mapping.Specifies how actors execute in time in the new actor-processor mapping.

Find optimal modulo schedule for a virtualized member of a family of processors.Find optimal modulo schedule for a virtualized member of a family of processors.

Tries to efficiently allocate the storage requirements of the new schedule into available memory units.Tries to efficiently allocate the storage requirements of the new schedule into available memory units.

Goal: To perform Adaptive Stream Graph Modulo Scheduling. Goal: To perform Adaptive Stream Graph Modulo Scheduling.

Page 22: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 22 -

Overall Execution Flow

For every application may see multiple iterations of:

Resou

rce ch

ange

Req

uest

Resou

rce ch

ange

Gran

ted

Page 23: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 23 -

Prepass Replication [static]

AA BB CC DD

EE FF E0E0E1E1

P0 : 10 P1 : 86 P2 : 246 P3 : 326

P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283

D0D0

D1D1

P6 : 283 P7 : 163

P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163

P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163

E0E0

E1E1 E2E2 E3E3

C0C0 C1C1 C2C2

C3C3

AA

FF

EE

DD

CC

1010

246246

326326

566566

1010

BB8686

C0C0 C2C2

S0S0

J0J0

61.561.5

66

66

C1C1 C3C3

D0D0

S1S1

J1J1

163163

66

66

D1D1

E0E0 E2E2

S2S2

J2J2

66

66

E1E1 E3E3 141.5141.5

21

21

22

22

22

22

Page 24: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 24 -

Partition Refinement [dynamic 1]

Available resources at runtime can be more limited than resources in static target architecture.

Partition refinement tunes actor to processor mapping for the active configuration.

A greedy iterative algorithm is used to achieve this goal.

Page 25: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 25 -

Partition Refinement Example

Pick processors with most number of actors.

Sort the actors

Find processor with max work

Assign min actors until threshold

AA BB

P0 : 184.5 P1 : 141.5 P2 : 171.5 P3 : 141.5

P4 : 151.5 P5 : 173 P7 : 159.5

D0D0D1D1

P6 : 140

E0E0

E1E1

E2E2E3E3C0C0 C1C1

C2C2

C3C3

S1S1S2S2

J0J0S0S0

J1J1 J2J2

BBE2E2 C2C2C0C0 C3C3C1C1 S1S1S2S2 J0J0S0S0J1J1 J2J2

P5 : 183

S0S0C3C3

S1S1S2S2

J1J1J2J2

C2C2

C1C1

BB

C0C0

E2E2

FFJ0J0

P5 : 193P5 : 270.5P4 : 274.5

P1 : 283 P3 : 289

Page 26: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 26 -

Stage Assignment [dynamic 2]

Processor assignment only specifies how actors are overlapped across processors.

Stage assignment finds how actors are overlapped in time.

Relative start time of the actors is based on stage numbers.

DMA operations will have a separate stage.

Page 27: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 27 -

Stage Assignment Example

AA

FF

BB

C0C0 C2C2

S0S0

J0J0

C1C1 C3C3

D0D0

S1S1

J1J1

D1D1

E0E0 E2E2

S2S2

J2J2

E1E1 E3E3

AA D0D0D1D1

E0E0

E1E1

E3E3

S0S0C3C3

S1S1S2S2

J1J1J2J2

C2C2

C1C1

BB

C0C0

E2E2

FFJ0J0

00

22

44

66

10

10

88

12

12

16

1618

18

14

14

Page 28: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 28 -

Performance Comparison

Page 29: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 29 -

Overhead Comparison

Page 30: EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling

- 30 -

Flextream Conclusions

Static scheduling approaches are promising but not enough.

Dynamic adaptation is necessary for future systems.

Flextream provides a hybrid static/dynamic approach to improve efficiency.