daniel orozcodaniel orozco guang gaoguang gao. mapping fdtd to many-cores ------- daniel orozco2

Mapping the FDTD Application to Many-Core

Chip Architectures

Computer Architecture and Parallel Systems Laboratory

Electrical and Computer Engineering DepartmentUniversity of Delaware

Daniel OrozcoGuang Gao

Mapping FDTD to Many-Cores ------- Daniel Orozco 2

Outline

Time Spent Explaining Stuff

What is the problem?What did others do?What did we do?Is it really better?So, What's Next?Questions

Mapping FDTD to Many-Cores Daniel Orozco 3

What is FDTD?FDTD = Finite Difference Time DomainFDTD simulates the propagation of electromagnetic waves through materials.

t

DJE

t

BE

B

D

f

f

0

Scientific Formulation Discretization Iteration

),,(

),,(

zyxH

zyxE

x

E

x

E

E(0,0) E(0,1) E(0,2) E(0,3)

E(1,0) E(1,1) E(1,2) E(1,3)

E(2,0) E(2,1) E(2,2) E(2,3)

E(3,0) E(3,1) E(3,2) E(3,3)


A Simple FDTD Computation


Memory Wall and Many Core Architectures

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P

M

FPU

P

M

P P P

FPU

MMM

On Chip Off Chip

M


What is the point of this presentation?

What can be done about the off-chip memory bandwidth?

Use the on Chip Memory!!!

PP

FPU

PP

FPU

PP

FPU

...

...

P M

FPU

P M


Background: What are Data Dependencies?

Data dependencies show the values needed to calculate a particular value.

This is a Data Dependency Graph or DDG

DDG are useful to know if code transformations are valid.If a particular transformation computes E(1,1) before E(0,2) we know that it is not a valid transformation.

E(0,0) E(0,1) E(0,2) E(0,3)

E(1,0) E(1,1) E(1,2) E(1,3)


Stencil Computations

Image Processing

Solution of Partial Differential Equations

What do they have in common?

ReadCreate New

Overwrite

How are their Data Dependency Graphs?

A lot of Memory Bandwidth is required!

E(0,0) E(0,1) E(0,2) E(0,3)

E(1,0) E(1,1) E(1,2) E(1,3)


TilingNo Tiling

TilingTiling is the process of calculating only a part of

the problem to reduce the memory limitations.

Memory Loads Per Element

Computed:9

Memory Loads Per Element

Computed:1.44

PP

FPU

P M

FPU

P M


Tiling and Parallel Execution

Tiling in a 1 DimensionalAlgorithm

Rows represent successive loads and stores to memory

Invalid Tiles

Tiles can not be of more than one row due to mutual data dependence.

Tile Computed

T2T1


Tiles are parallel AND bigger

Tiling after Skewing

Time Skewing

The DDG has been redrawn to show how tiles can go past several vertical directions.

This kind of parallelism is called Wavefront Parallelism and is harder to program than regular tiles.


Logical ViewTile shape

Other Parallel Tiling Approaches:Overlapped Tiling

Only 50% of the computations are used!

Better Tiling, but There are Redundant Computations

Tiles are fully parallel.Lost computations not shown.

Lost Computations

UsefulComputations

Memory Load

Memory Store



Other Parallel Tiling Approaches:Split Tiling

No Lost Computations

Tiles are fully parallel.No lost computations.

This is the state of the art

UsefulComputations

Memory Load

Memory Store



Our Contribution: Diamond Tiling


Tiles are fully parallel.No lost computations.Maximum Reuse.

UsefulComputations

Memory Load

Memory Store


Is there a Trick?

i

t

a)

i

t

b)

And we do have to load and store

TWO arrays to meet the

dependencies.

Well, we have tile borders across time iterations….

Start of Tile End of Tile

But it’s all for a good cause



We also tried: Triangle Tiling


Tiles are fully parallel.No lost computations.Very simple programming.

UsefulComputations

Memory Load

Memory Store


Logical View

We also tried: Parametric Tiling

Tiles are fully parallel.No lost computations.Useful to understand the problem.

p=0.5 p=1p=0.16

UsefulComputations

Memory Load

Memory Store


ReuseReuse is “The key concept” for on-chip memory

MReuse =

Number of elements computed

Number of memory operations

Why is reuse important?

20 Cores like this:

Need a connection like this:

Reuse = 40

Reuse = 5

P M

FPU

P M


How good are Tiles at Reuse?

Series1

0.25 0.49

8.33

5.456.67 6.25

9.38

12.50

Reuse for a tile of size 100

No Tiling

Simple Tiling

Skewed Tiling

Overlapped Tiling

Split Tiling

Triangle Tiling

Parametric Tilingp = 0.5

DiamondTiling

Not Embarrassingly Parallel

Developed at CAPSL

The Fine Print: Values are for a tile size of 100. Reuse values change with the size of the tile.Results apply to 1 Dimensional Stencil Computation with dependencies similar to those of the examples.


But, Does it Really Work?

Series1

1.00

3.19

7.94

6.04

13.51

Speedup

16 6416 64

No Tiling

TriangleSize =

16

TriangleSize =

64

DiamondSize =

64

DiamondSize =

16

The Fine Print: Simulated Speedup Results for FDTD 1D running on Cyclops-64 using FAST simulator. Problem size varies for each test, and was selected as big as possible. Only the computation time was measured. Problem data located in DRAM. Tiling done manually. GCC 3.4, -O3 used.


If two tiles have the same width, the one with the MOST AREA has the

best reuse.

Other Considerations

Reuse =

Number of elements computed

Number of memory operationsReuse =

Area

Perimeter

O(N2)

O(N)

The Reuse is O(N)

The best tile is the BIGGEST tile

DiamondSize = NParametric

Size = N

LowReuse

HighReuse

LowReuse

HighReuse


And get better performance!

So, Lead Us!

Reuse lowers the required Bandwidth.

Bandwidth is the Limiting Factor for FDTD.

Compute several TIMESTEPS at the same time.

M

M


Future Work:Multidimensional Diamonds?

????

How are we going to partition THAT???


Future Work: Dataflow Diamonds

Barrier

It’s bad waiting for the slow tile…And then they all compete for Bandwidth at

the same time…

Dataflow will solve that.Implementation is still a research topic.


Multiple Diamond Hierarchies

M

MM

MM

M

M

MM

MM

M

MM

M

MM

MM

M

MM

MM

M On-Chip Bus

Diamonds work…They use little Bandwidth

We have a strong On-Chip Bus. Maybe we can work with a Super

Diamond!

M M

On-Chip Bus

M M M...

But we still send the memory back after each

Diamond…


Questions?

C

M

MM

MM

M

M

MM

MM

M

MM

M

MM

MM

M

MM

MM

M

daniel orozcodaniel orozco guang gaoguang gao. mapping fdtd to many-cores ------- daniel orozco2

Documents

cores daniel orozco3

daniel orozco15 slide

daniel orozco2 slide

daniel orozco10 slide

daniel orozco11 slide

daniel orozco12 slide

daniel orozco13 slide

daniel orozco14 slide