communication overhead estimation on multicores

Post on 01-Feb-2016

36 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Communication Overhead Estimation on Multicores. S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz. Outline. Motivation Multicore trend Stream programming Profiling communication overhead Related works. 2. 512. PicoChip. AMBRIC. 256. - PowerPoint PPT Presentation

TRANSCRIPT

Communication Overhead Estimation on Multicores

S. M. Farhad

The University of Sydney

Joint work with

Yousun Ko

Bernd Burgstaller

Bernhard Scholz

2

Outline

Motivation Multicore trend Stream programming

Profiling communication overhead Related works

2

3

Motivation

1

1975

2

4

8

16

32

64

128

256

512

1980 1985 1990 1995 2000 2005 2010

400480088080 8086 286 386 486 Pentium P2 P3 P4

Athlon Itanium Itanium2

Power4 PA8800400480088080

PA8800

Opteron CoreDuo

Power6Xbox 360

BCM 1480Opteron 4P

Xeon

Niagara Cell

RAW

RAZA XLR Cavium

Unicore

Homogeneous Multicore

Heterogeneous MulticoreCISCO CSR1

Larrabee

PicoChip AMBRIC

AMD Fusion

NVIDIA G80

Core

Core2Duo

Core2Quad

# co

res/

chip

Courtesy: Scott’08

C/C++/Java

CUDA

X10Peakstream

Fortress

Accelerator

Ct

C T M

Rstream

Rapidmind

Stream Programming

3

4

Stream Programming Paradigm Programs expressed as stream

graphs

Streams: Infinite sequence of data elements

Actors: Functions applied to streams

4

Actor

Stream

Stream

5

Properties of Stream Program Regular and repeating

computation Independent actors with explicit

communication Producer / Consumer

dependencies

5

Adder

Speaker

AtoD

FMDemod

LPF1

Splitter

Joiner

LPF2 LPF3

HPF1 HPF2 HPF3

6

StreamIt Language

An implementation of stream prog.

Hierarchical structure

Each construct has single input/output stream

parallel computation

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter

6

How to Estimate the Communication Overhead?

7

Problems to Measure Communication Overhead Reasons:

Multicores are non-communication exposed architecture

Complex cache hierarchy Cache coherence protocols

Consequence: Cannot directly measure the communication cost Estimate the communication cost by measuring

the execution time of actors

8

Measuring the Communication Overhead of an Edge

9

i k

Processor 1

No communication cost

Processor 1

With communication cost

Processor 2

ki

kkiiki ttttC ),(

it ktit kt

How to Minimize the Required Number of Experiments

10

A

B

C

1

2

Pipeline

GraphColoring

Requires2+1 Exps

A

B

C

D

Processor 1 Processor 2

1

2

3

E

F

5

4

Even edgesacross partition

Processor 1

A

D

B

C

E

Processor 2

1

3

2

4

Odd edgesacross partition

Obs. 1: There is no loop of three actors in a stream graph

11

i k

l

Processor 1 Processor 2

Obs. 2: There is no interference of adjacent nodes between edges

12

A

B

C D

E

F

For blue color edges

P-1

P-2

P-3

P-4

Remove Interference

Convert to a line graph

Add interference edges

Use vertex coloring algorithm

13

A

B

C D

E

F

AB

BC

BDCE

DE

EF

Line graphStream graph

AB

BC

BDCE

DE

EF

Processor Leveling Graph

14

A

B

C D

E

F

For blue colored edge Processor leveling graph

A

B, C, D, E

F

Coloring the Processor Labelling Graph

15

A

B, C, D, E

F

Processor 2Processor 1

A

B, C, D, E

F

A

B, C, D, E

F

Measuring the Communication Cost

16

A

B

C D

E

F

A

B, C, D, E

F

Processor 2Processor 1

)()(

)()(

),(

),(

FFEEFE

BBAABA

ttttC

ttttC

At

Bt

Et

Ft

For blue colored edge

Profiling Performance

Benchmark Total Edge Prof Steps Steps/Edge (%) Err (%)SAR 44 3 7 10MatrixMult 88 21 24 17MergeSort 37 4 11 31FMRadio 21 3 14 24DCT 28 9 32 14RadixSort 12 2 17 5FFT 26 3 12 27MPEG 56 17 30 15Channel 22 6 27 11BeamFormer 39 5 13 13

GM 17% 15%

17

18

Related Works

[1] Static Scheduling of SDF Programs for DSP [Lee ‘87]

[2] StreamIt: A language for streaming applications [Thies ‘02]

[3] Phased Scheduling of Stream Programs [Thies ’03]

[4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in

Stream Programs [Thies ‘06]

[5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08]

[6] Software Pipelined Execution of Stream Programs on GPUs

[Udupa‘09]

[7] Synergistic Execution of Stream Programs on Multicores with

Accelerators [Udupa ‘09]

[8] Orchestration by approximation [Farhad ‘11]

18

Questions?

top related