eecs 583 – class 20 research topic 2: stream compilation, stream graph modulo scheduling

EECS 583 – Class 20Research Topic 2: Stream Compilation,Stream Graph Modulo Scheduling

University of Michigan

November 30, 2011

Guest Speaker Today: Daya Khudia

- 2 -

Announcements & Reading Material

This class» “Orchestrating the Execution of Stream Programs on Multicore

Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008.

Next class – GPU compilation» “Program optimization space pruning for a multithreaded GPU,”

S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008.

- 3 -

Stream Graph Modulo Scheduling (SGMS)

Coarse grain software pipelining» Equal work distribution» Communication/computation

overlap» Synchronization costs

Target : Cell processor» Cores with disjoint address spaces» Explicit copy to access remote data

DMA engine independent of PEs

Filters = operations, cores = function units

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

EIB

PPE(Power PC)

DRAM

SPE0 SPE1 SPE7

- 4 -

Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02]

int->int filter FIR(int N, int wgts[N]) {

work pop 1 push 1 { int i, sum = 0;

for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}

Push and pop items from input/output FIFOs

Stateless

int wgts[N];

wgts = adapt(wgts);

Stateful

- 5 -

SGMS Overview

PE0 PE1 PE2 PE3

PE0

T1

T4

T4

T1 ≈ 4

DMA

DMA

DMA

DMA

DMA

DMA

DMA

DMA

Prologue

Epilogue

- 6 -

SGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

- 7 -

Processor Assignment: Maximizing Throughputs

iij

jij

IIaiw

a

)(

1 for all filter i = 1, …, N

for all PE j = 1,…,P

Minimize II

B C

E

D

F

W: 20

W: 20 W: 20

W: 30

W: 50

W: 30

A

B C

E

D

F

A

A

DB

CE

F

Minimum II: 50Balanced workload!Maximum throughput

T2 = 50

A

B

C

D

E

F

PE0

T1 = 170

T1/T2 = 3. 4

PE0 PE1 PE2 PE3Four Processing Elements

• Assigns each filter to a processor

PE1PE0 PE2 PE3

W: workload

- 8 -

Need More Than Just Processor Assignment

Assign filters to processors» Goal : Equal work distribution

Graph partitioning? Bin packing?

A

B

D

C

A

B C

D

5

5

40 10

Original stream program

PE0 PE1

B

A

C

D Speedup = 60/40 = 1.5

A

B1

D

C

B2

J

S

Modified stream program

B2

CJ

B1

AS

D Speedup = 60/32 ~ 2

- 9 -

Filter Fission Choices

PE0 PE1 PE2 PE3

Speedup ~ 4 ?

- 10 -

Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP)

…

Split/Join overheadfactored in

Objective function- Maximal load on any PE» Minimize

Result» Number of times to “split” each filter

» Filter → processor mapping

- 11 -

Step 2: Forming the Software Pipeline To achieve speedup

» All chunks should execute concurrently» Communication should be overlapped

Processor assignment alone is insufficient information

A

B

C

A

CB

PE0 PE1

PE0 PE1

Tim

e AB

A1

B1

A2

A1

B1

A2A→B

A1

B1

A2A→B

A3A→B

Overlap Ai+2 with Bi

X

- 12 -

Stage Assignment

i

j

PE 1

Sj ≥ Si

i

j

DMA

PE 1

PE 2

Si

SDMA > Si

Sj = SDMA+1

Preserve causality(producer-consumer dependence)

Communication-computationoverlap

Data flow traversal of the stream graph» Assign stages using above two rules

- 13 -

Stage Assignment Example

A

B1

D

C

B2

J

S

AS

B1Stage 0

DMA DMA DMA Stage 1

CB2

J

Stage 2

D

DMA Stage 3

Stage 4

DMA

PE 0

PE 1

- 14 -

Step 3: Code Generation for Cell

Target the Synergistic Processing Elements (SPEs)» PS3 – up to 6 SPEs

» QS20 – up to 16 SPEs

One thread / SPE Challenge

» Making a collection of independent threads implement a software pipeline

» Adapt kernel-only code schema of a modulo schedule

- 15 -

Complete Example

void spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}

A

S

B1

DMA DMA DMA

CB2

J

D

DMA DMA

AS

B1

AS

B1

B1toJ

StoB2

AtoC

AS

B1

B2

J

C

B1toJ

StoB2

AtoC

AS

B1

JtoD

CtoD B2

J

C

B1toJ

StoB2

AtoC

AS

B1

JtoD

D

CtoD B2

J

C

B1toJ

StoB2

AtoC

SPE1 DMA1 SPE2 DMA2T

ime

- 16 -

SGMS(ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rel

ativ

e S

peed

up

ILP Partitioning Greedy Partitioning Exposed DMA

(MIT method, ASPLOS’06)

• Solver time < 30 seconds for 16 processors

- 17 -

SGMS Conclusions

Streamroller» Efficient mapping of stream programs to multicore

» Coarse grain software pipelining

Performance summary» 14.7x speedup on 16 cores

» Up to 35% better than greedy solution (11% on average)

Scheduling framework» Tradeoff memory space vs. load balance

Memory constrained (embedded) systems Cache based system

- 18 -

Discussion Points

Is it possible to convert stateful filters into stateless? What if the application does not behave as you expect?

» Filters change execution time?

» Memory faster/slower than expected?

Could this be adapted for a more conventional multiprocessor with caches?

Can C code be automatically streamized? Now you have seen 3 forms of software pipelining:

» 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling

» Where else can it be used?

“Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures,”

- 20 -

Static Verses Dynamic Scheduling

CoreCore

AA

B1B1 B3B3

FF

EE

CC

SplitterSplitter

JoinerJoiner

B2B2 B4B4

D1D1 D2D2

SplitterSplitter

JoinerJoiner

MemoryMemory

CoreCore

MemoryMemory

CoreCore

MemoryMemory

CoreCore

MemoryMemory

? Performing graph modulo scheduling on a

stream graph statically.

What happens in case of dynamic resource changes?

- 21 -

Overview of Flextream

Prepass ReplicationPrepass Replication

Work PartitioningWork Partitioning

Partition RefinementPartition Refinement

Stage AssignmentStage Assignment

Buffer AllocationBuffer Allocation

Static

Dynam

ic

Streaming Application

MSL Commands

Adjust the amount of parallelism for the target system by replicating actors.Adjust the amount of parallelism for the target system by replicating actors.

Performs light-weight adaptation of the schedule for the current configuration of the target hardware.Performs light-weight adaptation of the schedule for the current configuration of the target hardware.

Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)

Find an optimal schedule for a virtualized member of a family of processors.Find an optimal schedule for a virtualized member of a family of processors.

Specifies how actors execute in time in the new actor-processor mapping.Specifies how actors execute in time in the new actor-processor mapping.

Find optimal modulo schedule for a virtualized member of a family of processors.Find optimal modulo schedule for a virtualized member of a family of processors.

Tries to efficiently allocate the storage requirements of the new schedule into available memory units.Tries to efficiently allocate the storage requirements of the new schedule into available memory units.

Goal: To perform Adaptive Stream Graph Modulo Scheduling. Goal: To perform Adaptive Stream Graph Modulo Scheduling.

- 22 -

Overall Execution Flow

For every application may see multiple iterations of:

Resou

rce ch

ange

Req

uest

Resou

rce ch

ange

Gran

ted

- 23 -

Prepass Replication [static]

AA BB CC DD

EE FF E0E0E1E1

P0 : 10 P1 : 86 P2 : 246 P3 : 326

P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283

D0D0

D1D1

P6 : 283 P7 : 163

P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163

P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163

E0E0

E1E1 E2E2 E3E3

C0C0 C1C1 C2C2

C3C3

AA

FF

EE

DD

CC

1010

246246

326326

566566

1010

BB8686

C0C0 C2C2

S0S0

J0J0

61.561.5

66

66

C1C1 C3C3

D0D0

S1S1

J1J1

163163

66

66

D1D1

E0E0 E2E2

S2S2

J2J2

66

66

E1E1 E3E3 141.5141.5

21

21

22

22

22

22

- 24 -

Partition Refinement [dynamic 1]

Available resources at runtime can be more limited than resources in static target architecture.

Partition refinement tunes actor to processor mapping for the active configuration.

A greedy iterative algorithm is used to achieve this goal.

- 25 -

Partition Refinement Example

Pick processors with most number of actors.

Sort the actors

Find processor with max work

Assign min actors until threshold

AA BB

P0 : 184.5 P1 : 141.5 P2 : 171.5 P3 : 141.5

P4 : 151.5 P5 : 173 P7 : 159.5

D0D0D1D1

P6 : 140

E0E0

E1E1

E2E2E3E3C0C0 C1C1

C2C2

C3C3

S1S1S2S2

J0J0S0S0

J1J1 J2J2

BBE2E2 C2C2C0C0 C3C3C1C1 S1S1S2S2 J0J0S0S0J1J1 J2J2

P5 : 183

S0S0C3C3

S1S1S2S2

J1J1J2J2

C2C2

C1C1

BB

C0C0

E2E2

FFJ0J0

P5 : 193P5 : 270.5P4 : 274.5

P1 : 283 P3 : 289

- 26 -

Stage Assignment [dynamic 2]

Processor assignment only specifies how actors are overlapped across processors.

Stage assignment finds how actors are overlapped in time.

Relative start time of the actors is based on stage numbers.

DMA operations will have a separate stage.

- 27 -

Stage Assignment Example

AA

FF

BB

C0C0 C2C2

S0S0

J0J0

C1C1 C3C3

D0D0

S1S1

J1J1

D1D1

E0E0 E2E2

S2S2

J2J2

E1E1 E3E3

AA D0D0D1D1

E0E0

E1E1

E3E3

S0S0C3C3

S1S1S2S2

J1J1J2J2

C2C2

C1C1

BB

C0C0

E2E2

FFJ0J0

00

22

44

66

10

10

88

12

12

16

1618

18

14

14

- 28 -

Performance Comparison

- 29 -

Overhead Comparison

- 30 -

Flextream Conclusions

Static scheduling approaches are promising but not enough.

Dynamic adaptation is necessary for future systems.

Flextream provides a hybrid static/dynamic approach to improve efficiency.

eecs 583 – class 20 research topic 2: stream compilation, stream graph modulo scheduling

Documents

adaptive stream graph

stream compilation

new schedule

optimal schedule

execution of stream

stream graphassign stages

int wgtsn

code generation