daniel orozcodaniel orozco guang gaoguang gao. mapping fdtd to many-cores ------- daniel orozco2

26
Mapping the FDTD Application to Many-Core Chip Architectures Computer Architecture and Parallel Systems Laboratory Electrical and Computer Engineering Department University of Delaware Daniel Orozco Guang Gao

Upload: sharon-brumitt

Post on 29-Mar-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping the FDTD Application to Many-Core

Chip Architectures

Computer Architecture and Parallel Systems Laboratory

Electrical and Computer Engineering DepartmentUniversity of Delaware

Daniel OrozcoGuang Gao

Page 2: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 2

Outline

Time Spent Explaining Stuff

What is the problem?What did others do?What did we do?Is it really better?So, What's Next?Questions

Page 3: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores Daniel Orozco 3

What is FDTD?FDTD = Finite Difference Time DomainFDTD simulates the propagation of electromagnetic waves through materials.

t

DJE

t

BE

B

D

f

f

0

Scientific Formulation Discretization Iteration

),,(

),,(

zyxH

zyxE

x

E

x

E

E(0,0) E(0,1) E(0,2) E(0,3)

E(1,0) E(1,1) E(1,2) E(1,3)

E(2,0) E(2,1) E(2,2) E(2,3)

E(3,0) E(3,1) E(3,2) E(3,3)

Page 4: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 4

A Simple FDTD Computation

Page 5: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 5

Memory Wall and Many Core Architectures

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P P P

FPU

MMM

On Chip Off Chip

M

Page 6: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 6

What is the point of this presentation?

What can be done about the off-chip memory bandwidth?

Use the on Chip Memory!!!

PP

FPU

PP

FPU

PP

FPU

...

...

P M

FPU

P M

Page 7: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 7

Background: What are Data Dependencies?

Data dependencies show the values needed to calculate a particular value.

This is a Data Dependency Graph or DDG

DDG are useful to know if code transformations are valid.If a particular transformation computes E(1,1) before E(0,2) we know that it is not a valid transformation.

E(0,0) E(0,1) E(0,2) E(0,3)

E(1,0) E(1,1) E(1,2) E(1,3)

Page 8: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 8

Stencil Computations

Image Processing

Solution of Partial Differential Equations

What do they have in common?

ReadCreate New

Overwrite

How are their Data Dependency Graphs?

A lot of Memory Bandwidth is required!

E(0,0) E(0,1) E(0,2) E(0,3)

E(1,0) E(1,1) E(1,2) E(1,3)

Page 9: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 9

TilingNo Tiling

TilingTiling is the process of calculating only a part of

the problem to reduce the memory limitations.

Memory Loads Per Element

Computed:9

Memory Loads Per Element

Computed:1.44

PP

FPU

P M

FPU

P M

Page 10: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 10

Tiling and Parallel Execution

Tiling in a 1 DimensionalAlgorithm

Rows represent successive loads and stores to memory

Invalid Tiles

Tiles can not be of more than one row due to mutual data dependence.

Tile Computed

T2T1

Page 11: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 11

Tiles are parallel AND bigger

Tiling after Skewing

Time Skewing

The DDG has been redrawn to show how tiles can go past several vertical directions.

This kind of parallelism is called Wavefront Parallelism and is harder to program than regular tiles.

Page 12: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 12

Logical ViewTile shape

Other Parallel Tiling Approaches:Overlapped Tiling

Only 50% of the computations are used!

Better Tiling, but There are Redundant Computations

Tiles are fully parallel.Lost computations not shown.

Lost Computations

UsefulComputations

Memory Load

Memory Store

Page 13: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 13

Logical ViewTile shape

Other Parallel Tiling Approaches:Split Tiling

No Lost Computations

Tiles are fully parallel.No lost computations.

This is the state of the art

UsefulComputations

Memory Load

Memory Store

Page 14: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 14

Logical ViewTile shape

Our Contribution: Diamond Tiling

No Lost Computations

Tiles are fully parallel.No lost computations.Maximum Reuse.

UsefulComputations

Memory Load

Memory Store

Page 15: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 15

Is there a Trick?

i

t

a)

i

t

b)

And we do have to load and store

TWO arrays to meet the

dependencies.

Well, we have tile borders across time iterations….

Start of Tile End of Tile

But it’s all for a good cause

Page 16: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 16

Logical ViewTile shape

We also tried: Triangle Tiling

No Lost Computations

Tiles are fully parallel.No lost computations.Very simple programming.

UsefulComputations

Memory Load

Memory Store

Page 17: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 17

Logical View

We also tried: Parametric Tiling

Tiles are fully parallel.No lost computations.Useful to understand the problem.

p=0.5 p=1p=0.16

UsefulComputations

Memory Load

Memory Store

Page 18: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 18

ReuseReuse is “The key concept” for on-chip memory

MReuse =

Number of elements computed

Number of memory operations

Why is reuse important?

20 Cores like this:

Need a connection like this:

Reuse = 40

Reuse = 5

P M

FPU

P M

Page 19: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 19

How good are Tiles at Reuse?

Series1

0.25 0.49

8.33

5.456.67 6.25

9.38

12.50

Reuse for a tile of size 100

No Tiling

Simple Tiling

Skewed Tiling

Overlapped Tiling

Split Tiling

Triangle Tiling

Parametric Tilingp = 0.5

DiamondTiling

Not Embarrassingly Parallel

Developed at CAPSL

The Fine Print: Values are for a tile size of 100. Reuse values change with the size of the tile.Results apply to 1 Dimensional Stencil Computation with dependencies similar to those of the examples.

Page 20: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 20

But, Does it Really Work?

Series1

1.00

3.19

7.94

6.04

13.51

Speedup

16 6416 64

No Tiling

TriangleSize =

16

TriangleSize =

64

DiamondSize =

64

DiamondSize =

16

The Fine Print: Simulated Speedup Results for FDTD 1D running on Cyclops-64 using FAST simulator. Problem size varies for each test, and was selected as big as possible. Only the computation time was measured. Problem data located in DRAM. Tiling done manually. GCC 3.4, -O3 used.

Page 21: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 21

If two tiles have the same width, the one with the MOST AREA has the

best reuse.

Other Considerations

Reuse =

Number of elements computed

Number of memory operationsReuse =

Area

Perimeter

O(N2)

O(N)

The Reuse is O(N)

The best tile is the BIGGEST tile

DiamondSize = NParametric

Size = N

LowReuse

HighReuse

LowReuse

HighReuse

Page 22: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 22

And get better performance!

So, Lead Us!

Reuse lowers the required Bandwidth.

Bandwidth is the Limiting Factor for FDTD.

Compute several TIMESTEPS at the same time.

M

M

Page 23: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 23

Future Work:Multidimensional Diamonds?

????

How are we going to partition THAT???

Page 24: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 24

Future Work: Dataflow Diamonds

Barrier

It’s bad waiting for the slow tile…And then they all compete for Bandwidth at

the same time…

Dataflow will solve that.Implementation is still a research topic.

Page 25: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 25

Multiple Diamond Hierarchies

M

MM

MM

M

M

MM

MM

M

MM

M

MM

MM

M

MM

MM

M On-Chip Bus

Diamonds work…They use little Bandwidth

We have a strong On-Chip Bus. Maybe we can work with a Super

Diamond!

M M

On-Chip Bus

M M M...

But we still send the memory back after each

Diamond…

Page 26: Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2

Mapping FDTD to Many-Cores ------- Daniel Orozco 26

Questions?

C

M

MM

MM

M

M

MM

MM

M

MM

M

MM

MM

M

MM

MM

M