aparallel out-of-core algorithm for time-domain adaptive integral...

Post on 23-Apr-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Parallel Out-of-Core Algorithm for Time-Domain Adaptive Integral Method

Guneet KAUR and Ali E. YILMAZ

Department of Electrical & Computer EngineeringUniversity of Texas at Austin

IEEE International Conference on Computational Electromagnetics (ICCEM)Hong Kong, February 2-5, 2015

Motivation- Integral Equation Methods (Historical Perspective)- TD-AIM vs. FD-AIM- Memory Hierarchy- Out of Core Algorithms

(High-Frequency) IE Methods: Historical Progress since RWG

*

* http://web.corral.tacc.utexas.edu/BioEM-Benchmarks/historicalProgress/index.html

[1] S. M. Rao, D. R. Wilton, A. W. Glisson, IEEE TAP, May 1982.[2] A. Taflove, K. Umashankar, IEEE TEMC, Nov. 1983.[3] D. H. Schaubert, D. R. Wilton, A. W. Glisson, IEEE TAP, Jan. 1984.[4] M. F. Catedra, J. G. Cuevas, L. Nuno, IEEE TAP, Dec. 1988.[5] M. F. Catedra, E. Gago, L. Nuno, IEEE TAP, May 1989.[6] T. Cwik, J. Patterson, D. Scott, Proc. Supercomp., Nov. 1992.[7] H. Gan, W. C. Chew, J EM Waves Appl., 1995.[8] J. M. Song and W. C. Chew, MOTL, Sep. 1995.[9] E. Bleszynski et al., Radio Sci., Sep.-Oct. 1996.[10] C. F. Wang and J. M. Jin, IEEE TMTT, May 1998.[11] J. Song et al., IEEE TAP, June 1998.[12] S. Velamparambil, W. C. Chew, J. Song, IEEE AP Mag., Apr. 2003.[13] A. Rubinstein, et al., IEEE TEMC, May 2003.[14] Ö. Ergül, L. Gürel, IEEE TAP, Aug. 2008.[15] J. M. Taboada et al., IEEE AP Mag., Dec. 2009.[16] A. Heldring et al., IEEE TAP, Jan. 2013.

TDIE papers lagging:(1) Late-time instability: implicit solvers, better temporal basis functions, well-conditioned IE formulations, exact/semi-analytical integration techniques, ….(2) High computation time: fast algorithms(3) High(er) memory requirement(4) …

[17] F. Wei, A. E. Yılmaz, IEEE TAP, Feb. 2014.[18] C.L. Bennett , H. Mieras, Radio Sci., Nov.-Dec.1981.[19] S. M. Rao, D. R. Wilton, IEEE TAP, Jan. 1991.[20] D.A. Vechinski and S. M. Rao, IEEE TAP, June 1992.[21] M. J. Bluck, S. P. Walker, IEEE TAP, May 1997.[22] S. J. Dodson, S. P. Walker, M. J. Bluck, IEEE AP Mag., Aug. 1998.[23] B. Shanker, et al., IEEE TAP, Apr. 2000.[24] J. L. Hu, C. H. Chan, and Y. Xu, MOTL, May 2000.[25] A. E. Yılmaz et al., IEEE TAP, July 2002.[26] B. Shanker,et al., IEEE TAP, Mar. 2003.[27] A. E. Yılmaz, J.M. Jin, E. Michielssen, IEEE TAP, Oct. 2004.[28] H. Bağcı et al., IEEE TEMC, May 2007.[29] H. Bağcı et al., IEEE TEMC, Feb 2010.[30] G. Kaur and A.E. Yılmaz, Proc. ACES., Apr. 2012.[31] G. Kaur and A.E. Yılmaz, submitted IEEE TAP, Sep. 2014.[32] G. Kaur and A.E. Yılmaz, ICCEM, Feb. 2015.

This talk

TD-AIM vs. FD-AIM

102

103

104

105

106

10710

0

101

102

103

104

105

106

107

Ns

Mat

rix-F

ill Ti

me

(s)

O(Ns)

102

103

104

105

106

10710

-3

10-2

10-1

100

101

102

103

104

105

Ns

Mar

chin

g/So

lutio

n Ti

me

per t

ime/

freq.

ste

p(s)

O(Ns3/2logN

s)

O(Ns5/4log2N

s)

O(Ns3/2log2N

s)

O(Nslog2N

s)

102

103

104

105

106

10710

-3

10-2

10-1

100

101

102

103

Ns

Mem

ory

(GB)

O(Ns)

O(Ns2)

O(Ns3/2)

FD‐AIM Plate

TD‐AIM Plate

FD‐AIM Sphere

TD‐AIM Sphere

Number of frequency samples bandwidth of fieldsNumber of time samples maximum frequency content of the fields

Envelope-tracking methods[1-2] requires time samples bandwidth, and stores smaller time history than traditional time-domain methods

[1] A. Mohan and D. S. Weile, IEEE TAP, 2005.[2] G. Kaur and A. E. Yılmaz, IEEE TAP, to appear in 2015.

Memory Hierarchy

Registers

Processor

Control Unit

On-chip Cache

Main Memory

Local Secondary Storage

Remote Secondary StorageIncr

easi

ng c

apac

ity a

nd a

cces

s tim

esD

ecre

asin

g co

sts

and

frequ

ency

of a

cces

s Access Time

300 ps

1-10 ns

10-20 ns

50-100 ns(~20-75 ns)

5-10 ms

?

Off-Chip Cache

http://www.edn.com/design/systems-design/4397051/Memory-Hierarchy-Design-part-1John Hennessy and David Patterson, Computer Architecture, Fifth Edition: A Quantitative Approach, Morgan Kaufmann, San Francisco, CA, 1996.

TransferTime/byte

1-10 ps

25-50 ps

50-100 ps

50-200 ps(~20 ps)

0.2-2 ns

?(~2 ns)

Size

1000 bytes

64-256 kB(L1: 32 kBL2: 256 kB)

2-4 MB(L3: 20 MB)

4-16 GB(32 GB

2GB/core)

4-16 TB(80-430 GB)

O(PB)(14 PB)

* Stampede@TACC#7 supercomputer in

top500 list, Nov. 2014

Memory Hierarchy

F/A-18 HornetTop speed: 1915 km/h

Sustained speed:

~1.66 mm/s

http://www.scase.co.uk/snailracing/

Inspiration for this analogy:http://blog.scoutapp.com/articles/2011/02/10/understanding-disk-i-o-when-should-you-be-worried

Access time:50 ns vs. 5 msTransfer time:20 ps vs. 0.2-2 ns

53.2 10´

Wells (2014 world snail racing champ)

McLaren 12c (fasteststock car 0-200 kph)

http://www.dragtimes.ru/en/blogs/view/2659 29-

Highway/top speed:

65 mph/210 mph

( ~access time ratio)

(~transfer time ratio)

102 103 104 10510-5

10-4

10-3

10-2

10-1

100

101

102

103

Number of Unknowns

Tim

e pe

r MV

M (s

)

Out of CoreIn Core

O(Ns2)

• Use disk space in addition to core memory+ Lots of space: (or more)- I/O to disk much slower than memory

To read/write N bytes:Access time (latency): Transfer time (bandwidth):

+ Possible to amortize I/O costsLatency: Read/write large chunks of data (simple)Bandwidth: Many FLOPs per byte read/written (not so simple)Example: Dense-matrix-vector product (double-precision) on Stampede

• Goal: Reduce memory requirement without (significantly) increasing run time

Out of Core Algorithms

rw lat bw5

lat lat1 2

bw bw

10

10

IO IO

IO core

IO core

Nt t Nt

t t

t t-

= +

= ´

= ´

Tim

e

I/O volume (bytes)

latIOt

bwSlope: IOt

3 610 - ´

x 50- Store 100 kBin memory

- Read data in100 kB chunks & multiply

102 103 104 10510-4

10-3

10-2

10-1

100

101

102

Number of Unknowns

Mem

ory

(GB

)

Out of CoreIn Core

O(Ns2)

Background- Time Domain Integral Equations Basics- Method of Moments- Time Marching- Alternative Ways to Calculate Right-Hand Side

- Frequency interval:

- Time interval:

1

( , )( , )( , , )

( , ) ( , )t S

ttg t ds

t t

mf e-

ì üì ü ¢ï ïï ï ï ïï ï ¢ ¢= *í ý í ý¢ ¢ï ï ï ï¶ - ⋅ï ï ï ïî þ î þòò

J rA rr r

r J r

sca 2

sca 1

( , ) ( , ) ( , )

( , ) ( , )t t t

t t

t t t

t t

f

m-

ì ü ì üï ï ï ï¶ -¶ -¶ï ï ï ïï ï ï ï=í ý í ýï ï ï ï¶ ´¶ï ï ï ïï ï ï ïî þ î þ

E r A r r

H r A r

inc inc,E HS

min maxf f f

( , )tJ r

• EFIE, MFIE, CFIE2 inc

1 inc

10

ˆ ˆ ˆ ˆ( ( , ) ( , )) ( , ) (TD-EFIE)

ˆ ˆ( , ) ( , ) ( , ) (TD-MFIE)

TD-CFIE TD-EFIE (1 )TD-MFIE

t t t

t t t

n n t t n n t

t n t n t

f

m

h a a

-

-

- ´ ´ ¶ +¶ =- ´ ´¶

¶ - ´´¶ = ´¶

= + -

A r r E r

J r A r H r

TDIE Basics

,e m

PEC

( )n r

sca inc

sca inc

( , ) ( , ) ˆ ˆ ˆ ˆ

( , ) ( , ) ( , )ˆ ˆt t

t t t

n n t n n t S

t n t n t

- ´ ´¶ = ´ ´¶ " Î

¶ - ´¶ = ´¶

E r E r r

J r H r H r

S,e m ,e m

inc inc,E H

( )/( , , )

4

t cg t

d

p

¢- -¢ =

¢-

r rr r

r r

max0 t T

Numerical Solution

- Discretize geometry, expand unknowns

• Method of moments

S T

,1 1

( , ) ( ) ( )N N

k l kk l

t I T t l t¢ ¢ ¢¢ ¢= =

¢» - DååJ r S r

Rao, Wilton, Glisson, IEEE Trans. AP, May 1982.

Must resolve fastest variations:

- Testing

( ) ( ) TD-CFIEk

S

dt ds t l td¥

- D ⋅ò òò S r

Tfor 1,...,l N=

2max

min S 2

max T max max

/ 10,

1/ 10 ,

(HF)A

fs N S

c

t f N T f

lD

D

- System of equations

incT

1

for 1,...,l

l l l ll

l N¢ ¢-¢=

= =åZ I V G. Manara et al., IEEE Trans. AP, Mar. 1997.D. S. Weile et al., IEEE Trans. AP, Jan. 2004.G. Kaur and A. E. Yilmaz, MOTL, June 2011.

Marching on in Time

g g g

g g g

g g T

i10 1

1 0 2

2 1 0 3

3 2 1 0 4

1 2 1 0

1 2 1 0 1

1 2 1 0

0

0

0 0

N N N

N N N

N N N

-

- +

-

é ù é ùê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê ú

=ê ú ê úê ú ê ú⋅ ⋅ ⋅ê ú ê úê ú ê ú

⋅ ⋅ ⋅ê ú ê úê ú ê úê ú ê úê ú ê úê ú ê ú⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ê úê ú ë ûë û

VZ IZ Z IZ Z Z IZ Z Z Z I

Z Z Z Z Z I

Z Z Z Z Z I

Z Z Z Z Z I

g

g

TT S

nc

inc2inc3inc4

inc

inc1

inc

1

N

N

N N N

+

´

é ùê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úë û

V

V

V

V

V

V

- Due to causality, time invariance, uniform time-step size,(discretely) causal sub-domain space-time basis functions

- Frequency domain: 11 1 1

2 2 2 2

F F F FF S

inc,

inc,

inc,

1

0 0

0

0 N N N N

ff f f

f f f f

f f f fN N ´

é ùé ù é ùê úê ú ê úê úê ú ê úê úê ú ê ú = ê úê ú ê úê úê ú ê úê úê ú ê úê úê ú ê ú ê úë û ë û ë û

VZ I

Z I V

Z I V

Matrix properties:

- Lower triangular, Toeplitz, sparse blocks

- Non-zero entries:

- Unique entries:

- Unique blocks:

2T S

( )N NQ

2S

( )NQ

- Diagonal, unique,dense blocks

- Non-zero/uniqueentries:

2F SN N

g1N +

Marching on in Time

g gg g

g g

T T

inc10 1inc2 10 2inc

2 130 3inc

3 2 10 4 4

inc1 2 10

inc0 1 1

inc0

0

0

0 0

0

0

N NN N

N N

N N

-

+ +

é ùé ù ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê ú= -ê ú ê úê ú ⋅ ⋅ ⋅ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê úë û ë û

VZ IV ZZ I

Z ZVZ IZ Z ZZ I V

Z Z Z ZZ I V

Z I V

Z I V

g

g g g

g g T

1

2

3

4

1 2 1 1

1 2 1

0 0

0 0 0

N

N N N

N N N

- +

-

é ù é ùê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê ú

⋅ ⋅ ⋅ê ú ê úê ú ê úê ú ê úê ú ê úê ú ê ú⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ê úê ú ë ûë û

IIII

I

Z Z Z Z I

Z Z Z Z I

• Computational Costs

Storage: +0 S

( )N NQ

matrix Current/Field vectors

g S( )N NQ

0Z

g1 N-Z Z matrices

2S

( )NQ +

Iterative solution Right-hand sideFLOPs: +

T I 0 S( )O N N N N 2

T S( )O N N

Matrix fill2S

( )O N +

Marching on in Time: High-Frequency- HF: and unknowns distributed such that =>

, , c t

c t

c t

c t sD D 0 SN N

ttD 2 tDg

( 1)N t+ D

• Computational Costs

* Graph describes radiation by a point source that is a pulse of width turned on at time 0

tD

Storage:

matrix Current/Field vectors0

Zg1 N

-Z Z matrices

+

Iterative solution Right-hand sideFLOPs:

T I 0 S( )O N N N N 2

T S( )O N N

Matrix fill2S

( )O N +

1/2g S

( )N O N=

0 S( )N NQ

g S( )N NQ2

S( )NQ

Scattered-Field (Right-Hand-Side) Computation

Storage:

matrix Current/Field vectors0

Zg1 N

-Z Z matrices

+

Iterative solution Right-hand sideFLOPs:

T I 0 S( )O N N N N 2

T S( )O N N

Matrix fill2S

( )O N +

• Computational Costs

g

T

1

2

3

4

5

6

7

8

N

N

IIIIIIII

I

I

g

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

0 0

0

0

0

0 0

0

0

0

0

0

0

0

0 0

0

0

0 0

N

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

g 8 7 6 5 4 3 2 1

0NZ Z Z Z Z Z Z Z Z

g

T

sca1sca2sca3sca4sca5sca6sca7sca8

sca

sca

N

N

V

V

V

V

V

V

V

V

V

V

=

0 S( )N NQ

g S( )N NQ2

S( )NQ

g g 1 T1 2 3 4 5 6 7 8 N N N+

g

T

1

2

3

4

5

6

7

8

N

N

IIIIIIII

I

I

g

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

0 0

0

0

0

0 0

0

0

0

0

0

0

0

0 0

0

0

0 0

N

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

g 8 7 6 5 4 3 2 1

0NZ Z Z Z Z Z Z Z Z

g

T

sca1sca2sca3sca4sca5sca6sca7sca8

sca

sca

N

N

V

V

V

V

V

V

V

V

V

V

=

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

- Must access all of the stored data at every time step g

l N³

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+

l

MemoryAccessed

S. P. Walker, IEEE AP Mag., Oct. 1997. g S( )N NQ2

S( )NQ

2S g S

( ) ( )N N NQ +Q

g

T

1

2

3

4

5

6

7

8

N

N

IIIIIIII

I

I

g

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

0 0

0

0

0

0 0

0

0

0

0

0

0

0

0 0

0

0

0 0

N

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

g 8 7 6 5 4 3 2 1

0NZ Z Z Z Z Z Z Z Z

g

T

sca1sca2sca3sca4sca5sca6sca7sca8

sca

sca

N

N

V

V

V

V

V

V

V

V

V

V

=

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+

1l

MemoryAccessed

S. P. Walker, IEEE AP Mag., Oct. 1997. g S( )N NQ2

S( )NQ

2S g S

( ) ( )N N NQ +Q

g

T

1

2

3

4

5

6

7

8

N

N

IIIIIIII

I

I

g

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

7 6 5 4 3 2 1

0 0

0

0

0

0 0

0

0

0

0

0

0

0

0 0

0

0

0 0

N

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z

g 8 7 6 5 4 3 2 1

0NZ Z Z Z Z Z Z Z Z

g

T

sca1sca2sca3sca4sca5sca6sca7sca8

sca

sca

N

N

V

V

V

V

V

V

V

V

V

V

=

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

- Must access all of the stored data at every time step T g

l N N£ -

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+

l

MemoryAccessed

T g1 2 3 -N N

S. P. Walker, IEEE AP Mag., Oct. 1997. g S( )N NQ2

S( )NQ

2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1l

MemoryAccessed

eventually

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1 2l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1 2 3l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1 2 3 4l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1 2 3 4 5l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1 2 3 4 5 6l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1 2 3 4 5 6 7l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1 2 3 4 5 6 7 8l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1 2 3 4 5 6 7 8 9l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1 2 3 4 5 6 7 8 9 10l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

=

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

1 2 3 4 5 6 7 8 9 10 11l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

=1 2 3 4 5 6 7 8 9 10 11 12

l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

=1 2 3 4 5 6 7 8 9 10 11 12 13

l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

=1 2 3 4 5 6 7 8 9 10 11 12 13 14

l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+g S

( )N NQ2S

( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

• Row multiplication: Store current history

• Column multiplication: Store future fields

• Multilevel block multiplication: Store both

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

IIIIIIIIIIIIIIII

sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16

VVVVVVVVVVVVVVVV

=

Storage needed:

Current/Field vectorsg1 N

-Z Z matrices

+

l

MemoryAccessed

Scattered-Field (Right-Hand-Side) Computation

+ Only access all of the stored data once every time steps

g/ 2N

g S( )N NQ2

S( )NQ

eventually 2S g S

( ) ( )N N NQ +Q

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4

0

0

0

0

0

0

0

0

0

0

0

0

0

Z

Z Z

Z Z Z

Z Z Z Z

Z Z Z Z Z

Z Z Z Z Z Z

Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Time Domain Adaptive Integral Method

TD-AIM

Step 1: Anterpolate

L

cDEmbed in regular grid with nodesSCN

{ , , }k x y zÎ

corr † A †kl l l l

kl ll l ll l l

f¢¢ ¢ ¢ ¢ ¢- ¢-

- -¢+ L LL» + LåZ I GZ I IGI

• Computational Costs (HF, while time marching)

Storage:Inter/anter

FLOPs: T( )O N pN

( )pNQ

L

Inter/anterpolation order: 3, ( 1)m p m= +

Grid spacing: cD

TD-AIM

Step 1: AnterpolateStep 2: Propagate

L

G

Embed in regular grid with nodesSCN

c †Aorr †ll l l ll l l

kl l l

kl

f¢ ¢ ¢ ¢-

¢ ¢ ¢- -

- ¢» +L L+ L Lå G GZ I I I IZ

Storage:

current/field vectors on grid,

l l ¢-G

g C( )N NQ +

Inter/anterFLOPs: +

4-D blocked space-time FFTs2

T C C g[log log ]( )N N N NO +

T( )O N pN

( )pNQ

L

• Computational Costs (HF, while time marching)

Inter/anterpolation order: 3, ( 1)m p m= +

Grid spacing: cD

{ , , }k x y zÎ

TD-AIM

Step 1: AnterpolateStep 2: Propagate

Step 3: Interpolate L

G†L

Embed in regular grid with nodesSCN

c †A†orrll l l ll l ll l l

kl

k f¢ ¢ ¢ ¢-

¢ ¢ ¢- -

- ¢» +L L+ L Lå G GZ I I I IZ

Storage:

current/field vectors on grid,

l l ¢-G

g C( )N NQ +

Inter/anterFLOPs: +

4-D blocked space-time FFTs2

T C C g[log log ]( )N N N NO +

T( )O N pN

( )pNQ

L

• Computational Costs (HF, while time marching)

Inter/anterpolation order: 3, ( 1)m p m= +

Grid spacing: cD

{ , , }k x y zÎ

TD-AIM

Step 1: AnterpolateStep 2: Propagate

Step 3: InterpolateStep 4: Correct

L

G†L

corrZ

Embed in regular grid with nodesSCN

c †A†orrll l l ll

kl

klll l l

f¢ ¢ ¢ ¢ ¢-

¢- -¢ ¢-» + L+L L LåZ I IGI IGZ

• Computational Costs (HF, while time marching)

Storage: +near( )NQ

current/field vectors on grid

corrl l ¢-Z

,l l ¢-G

g C( )N NQ +

Inter/anterFLOPs: ++

Correct 4-D blocked space-time FFTs2

T C C g[log log ]( )N N N NO +near

T( )O N N

T( )O N pN

( )pNQ

L

Inter/anterpolation order: 3, ( 1)m p m= +

Grid spacing: cD

Correction region size: 2g =

{ , , }k x y zÎ

TD-AIM

Step 1: AnterpolateEmbed in regular grid with nodesS

CN

Inter/anterpolation order: 3, ( 1)m p m= +

Grid spacing: cD

LStep 2: Propagate

G

Step 3: Interpolate†L

Step 4: CorrectcorrZ

Correction region size: 2g =

c †A†orrll l l ll

kl

klll l l

f¢ ¢ ¢ ¢ ¢-

¢- -¢ ¢-» + L+L L LåZ I IGI IGZ

• Computational Costs (HF, while time marching)

Storage: +near( )NQ

current/field vectors on grid

corrl l ¢-Z

,l l ¢-G

g C( )N NQ +

Inter/anterFLOPs: ++

Correct 4-D blocked space-time FFTs2

T C C g[log log ]( )N N N NO +near

T( )O N N

T( )O N pN

( )pNQ

L

{ , , }k x y zÎ

4-D Blocked Space-Time FFTs

g

1inc

0max(1,

c

)

A† †orrll

l

ll l ll

k

N

kl

ll l

G GZ ZI V I

A. E. Yilmaz, J. M. Jin, and E. Michielssen, IEEE Trans. AP, Oct. 2004.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =

3i =

4i =

ìïïïïïïïïíïïïïïïïïî

ìïïïíïïïî

{{

{ , , }k x y zÎ

Unique Blocks:

Block is of size

Storage space needed to store block

to store largest block:

2 glog{1, , 1}Ni Î +ê úê úë û1 1

C C2 2i iN N- -´i

C: (2 )ii NQ

g C( )N NQ

A. E. Yilmaz, J. M. Jin, and E. Michielssen, IEEE Trans. AP, Oct. 2004.

MOTTD-AIM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15l

MemoryAccessed 2

S g S( ) ( )N N NQ +Q

g C( )N NQ

{ , , }k x y zÎ

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

4-D Blocked Space-Time FFTs

g

1inc

0max(1,

c

)

A† †orrll

l

ll l ll

k

N

kl

ll l

G GZ ZI V I

Out-of-Core Algorithm

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

- Block requires bytes of memory - Introduce an integer threshold parameter- Limit the memory use to

- Modify TD-AIM propagation stage for blocks

- Minimize effect of latency by managing data layout in disk

Out-of-Core Algorithm: Synopsisi

C(2 )iNQ

4IO =

2IO =

IO

C(2 )IONQ

i IO>

Out-of-Core Algorithm

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

2IO =

1i =1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Out-of-Core Algorithm

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

2IO =

2i =

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Out-of-Core Algorithm

1

2

3

5

6

7

8

9

10

11

12

13

14

15

16

4

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in core memory Stored in disk

2IO =

4

5

6

7

8

9

10

11

12

13

14

1

1

1

5

3

6

2

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

1i =

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

1i =

1

2

3

5

6

7

8

9

10

11

12

13

14

15

16

4

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

4

5

6

7

8

9

10

11

12

13

14

1

1

1

5

3

6

2

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

1i =5

6

7

8

9

1

10

11

12

13

14

1

6

2

3

4

5

1

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

2IO

2IO

12i-

12i-

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

3i =

For each node :

Fetch anterpolated currentsamples from the memoryand the remaining

samples from diskCompute the contribution of

these current samples to thepotentials at future timesteps

Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)

Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.

, 1 Cu u N£ £

5

6

7

8

9

1

10

11

12

13

14

1

6

2

3

4

5

1

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

2IO

12 2i IO- -

2IO

2IO

12i-

12i-

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

3i =

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

For each node :

Fetch anterpolated currentsamples from the memoryand the remaining

samples from diskCompute the contribution of

these current samples to thepotentials at future timesteps

Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)

Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.

2IO

12 2i IO- -

, 1 Cu u N£ £

5

6

7

8

9

1

10

11

12

13

14

1

6

2

3

4

5

1

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

2IO

2IO

12i-

12i-

2IO

12 2i IO- -1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

3i =

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

For each node :

Fetch anterpolated currentsamples from the memoryand the remaining

samples from diskCompute the contribution of

these current samples to thepotentials at future timesteps

Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)

Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.

, 1 Cu u N£ £

5

6

7

8

9

1

10

11

12

13

14

1

6

2

3

4

5

1

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

12i-

12i-

2IO

12 2i IO- -1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

3i =

1

2

3

4

5

9

10

11

12

13

14

15

16

6

7

8 =

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

For each node :

Fetch anterpolated currentsamples from the memoryand the remaining

samples from diskCompute the contribution of

these current samples to thepotentials at future timesteps

Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)

Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.

2IO

2IO

, 1 Cu u N£ £

6

7

8

9

1

2

3

4

10

11

12

13

14

15

16

5

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

1i =

1

2

3

4

5

9

10

11

12

13

14

15

16

6

7

8 =

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

6

7

8

9

1

2

3

4

10

11

12

13

14

15

16

5

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

1i =

1

2

3

4

5

6

9

10

11

12

13

14

15

16

7

8 =

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

7

8

9

10

11

12

1

2

13

14

15

16

5

3

4

6

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

2i =

1

2

3

4

5

6

9

10

11

12

13

14

15

16

7

8 =

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

7

8

9

10

11

12

1

2

13

14

15

16

5

3

4

6

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

2i =

1

2

3

4

5

6

7

9

10

11

12

13

14

15

16

8 = f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

8

9

10

11

1

2

3

4

12

13

14

6

5

1

5

1

7

6

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

1i =

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

9

10

11

12

13

1

5

6

7

14

15

2

3

6

8

1

4

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

4i =

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

For each node :

Fetch anterpolated currentsamples from the memoryand the remaining

samples from diskCompute the contribution of

these current samples to thepotentials at future timesteps

Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)

Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.

12i-

12i-

2IO

2IO

, 1 Cu u N£ £

9

10

11

12

13

1

5

6

7

14

15

2

3

6

8

1

4

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

2IO

12 2i IO- -

2IO

2IO

12i-

12i-

2IO

12 2i IO- -1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

4i =

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

For each node :

Fetch anterpolated currentsamples from the memoryand the remaining

samples from diskCompute the contribution of

these current samples to thepotentials at future timesteps

Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)

Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.

, 1 Cu u N£ £

9

10

11

12

13

1

5

6

7

14

15

2

3

6

8

1

4

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

2IO

2IO

12i-

12i-

2IO

12 2i IO- -1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

4i =

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

Out-of-Core AlgorithmFor each node :

Fetch anterpolated currentsamples from the memoryand the remaining

samples from diskCompute the contribution of

these current samples to thepotentials at future timesteps

Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)

Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.

, 1 Cu u N£ £

9

10

11

12

13

1

5

6

7

14

15

2

3

6

8

1

4

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

2IO

2IO

12i-

12i-

2IO

12 2i IO- -1

2 1

3 2 1

4 3 2 1

5 4 3 2 1

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

10 9 8 7 6 5 4 3 2 1

11 10 9 8 7 6 5

0

0

0

0

0

0

0

0

0

0

0

f

f f

f f f

f f f f

f f f f f

f f f f f f

f f f f f f f

f f f f f f f f

f f f f f f f f f

f f f f f f f f f f

f f f f f f f

G

G G

G G G

G G G G

G G G G G

G G G G G G

G G G G G G G

G G G G G G G G

G G G G G G G G G

G G G G G G G G G G

G G G G G G G 4 3 2 1

12 11 10 9 8 7 6 5 4 3 2 1

13 12 11 10 9 8 7 6 5 4 3 2 1

14 13 12 11 10 9 8 7 6 5 4 3 2 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0

0

0

0

0

f f f f

f f f f f f f f f f f f

f f f f f f f f f f f f f

f f f f f f f f f f f f f f

f f f f f f f f f f f f f f f

G G G G

G G G G G G G G G G G G

G G G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G

1i =2i =3i =4i =

Stored in disk

2IO =

Out-of-Core Algorithm

4i =

1

2

3

4

5

6

7

8

9

10

13

14

15

16

11

12

=

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

For each node :

Fetch anterpolated currentsamples from the memoryand the remaining

samples from diskCompute the contribution of

these current samples to thepotentials at future timesteps

Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)

Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.

, 1 Cu u N£ £

9

10

11

12

13

1

1

2

3

4

5

6

7

1

8

4

5

16

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

Stored in core memory

Data Layout: In Core vs. Out-of-Core

g1,

Nf fG G

g1,...,l N=

12

CN

g1,...,l N=

12

1,...,2 IOl=

Stored in core memory

Stored in disk

CN

Stored in core memory

Cost Analysis (No Buffering)

fl

rw

rw lat bw

: Cost of one floating point operation

: Average cost of read/write of one byte from/to disk

(must minimize effect of latency/access time)IO IO

t

t

Nt t Nt= +

Memory:

current/field vectors on grid,l l ¢-G

g C( )N NQ

In-Core TD-AIM Out-of-Core TD-AIM

C( )2IONQ

Disk space: - g C[ 2 ]( )ION NQ -

Time for FLOPs:

4-D blocked space-time FFTs2

T C C gfl[log log( ])N Nt NO N +

Time for I/O: -

2T C C gfl

[log log( ])N Nt NO N +

grw T Clog( )N N Nt O

Data Layout in Disk

g

2 1,...,IOl N+=

Stored in disk

12

CN

- Divide into blocks of 2IO

Data Layout in Disk

g2 1,...,IOl N+=

bufN

buf2N

Store in disk using fixed-length record unformatted direct access files

C buf 1N N- +

Stored in disk

12

CN

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g

2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

1=rec2=rec

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

Store in disk using fixed-length record unformatted direct access files

Step: Write the currents at the previous time steps2IO

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g

2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

1=rec2=rec

4l=

Store in disk using fixed-length record unformatted direct access files

Step: Write the currents at the previous time steps2IO

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

1=rec2=rec

g

2IO

N=rec

4l=

Store in disk using fixed-length record unformatted direct access files

Step: Write the currents at the previous time steps2IO

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

4l=

1=rec2=rec

g

2IO

N=rec

g C

buf

1 12IO

N N

N

æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec

Step: Write the currents at the previous time steps2IO

Store in disk using fixed-length record unformatted direct access files

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

1=rec2=rec

g

2IO

N=rec

g C

buf

1 12IO

N N

N

æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec

8l=

Step: Write the currents at the previous time steps2IO

Store in disk using fixed-length record unformatted direct access files

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g

2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

1=rec2=rec

g

2IO

N=rec

g C

buf

1 12IO

N N

N

æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec

8l=

Step: Write the currents at the previous time steps2IO

Store in disk using fixed-length record unformatted direct access files

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g

2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

1=rec2=rec

g

2IO

N=rec

g C

buf

1 12IO

N N

N

æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec

8l=

Step: Write the currents at the previous time steps2IO

Store in disk using fixed-length record unformatted direct access files

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g

2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

1=rec2=rec

g

2IO

N=rec

g C

buf

1 12IO

N N

N

æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec

gl N=

Step: Write the currents at the previous time steps2IO

Store in disk using fixed-length record unformatted direct access files

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g

2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

1=rec2=rec

g

2IO

N=rec

g C

buf

1 12IO

N N

N

æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec

gl N=

Step: Write the currents at the previous time steps2IO

Store in disk using fixed-length record unformatted direct access files

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g

2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

1=rec2=rec

g

2IO

N=rec

g C

buf

1 12IO

N N

N

æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec

gl N=

Step: Write the currents at the previous time steps2IO

Store in disk using fixed-length record unformatted direct access files

Data Layout in Disk

bufEach record contains bytes, formed from 2 temporal data for 2 spatial points

IO IOB N B

g

2 1,...,IOl N+=

bufN

buf2N

C buf 1N N- +

Stored in disk

12

CN

l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMemory 2iN

1=rec2=rec

g

2IO

N=rec

g C

buf

1 12IO

N N

N

æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec

gl N=

Store in disk using fixed-length record unformatted direct access files

Step: Fetch current samples from disk

12 2i IO- -

Cost Analysis (with Buffer)

fl

rw

rw lat bw

: Cost of one floating point operation

: Average cost of read/write of one byte from/to disk

(must minimize effect of latency/access time)IO IO

t

t

Nt t Nt= +

Memory:

current/field vectors on grid,l l ¢-G

g C( )N NQ

In-Core TD-AIM Out-of-Core TD-AIM

C g buf( )2 [ 2 ]IO ION N N+ -Q

Disk space: - g C[ 2 ]( )ION NQ -

Time for FLOPs:

4-D blocked space-time FFTs2

T C C gfl[log log( ])N Nt NO N +

Time for I/O:-

2T C C gfl

[log log( ])N Nt NO N +

Tlat

bw

C buf g

T C g

( / ) log

log

( )

+ ( )

IO

IO

t O

t O

N N N N

N N N

Cost Analysis (with Buffer + Parallelization)

Memory:

current/field vectors on grid,l l ¢-G

g C/( )N N PQ

In-Core TD-AIM Out-of-Core TD-AIM

C g buf( )2 / [ 2 ]IO ION P N N-Q +

Disk space: - g C]([ )2 /ION N PQ -

- Every process needs memory space for the buffer- Buffer limits parallel scalability of memory requirement: max C

bufg

2~

[ 2 ]

IO

IO

NP

NN -

Results- Sphere - Model Airplane

Stampede

Sphere

min 1 m, 3

0.24 ns

2, 100 kB

t

IO B

l g= =D =

= =

2

2

ˆ( / 8 )inc 2

c

c bw bw

ˆˆ( , ) cos(2 [ 8 ])

200 MHz, 3 2 ; 100 MHz

t z cz

t xe f tc

f f f

s

s p s

s p

+ ⋅ -- ⋅

= + -

= = =

rr

E r

s(m)L sN CN tN

0.5

1

2

4

8

16

684

3384

10 947

44 595

179 130

742 059

32 2 903 916 3512

3256

3128

380

348

327

318

gN

260

285

300

420

545

760

1375

32

44

74

132

247

471

919

Sphere

102 103 104 105 106 10710-4

10-3

10-2

10-1

100

101

102

103

104

105

Ns

Cor

e M

emor

y R

equi

rem

ent(G

B)

In-core TD-AIMOut-of-core TD-AIMBuffer MemoryFD-AIM

O(Ns3/2)

O(Ns2)

O(Ns1/2)

102 103 104 105 106 10710-2

10-1

100

101

102

103

104

105

Ns

Mar

chin

g/S

olut

ion

Tim

e pe

r tim

e/fre

quen

cy s

tep

(s)

In-core TD-AIMOut-of-core TD-AIMFD-AIM

O(Ns3/2log2Ns)

O(Ns3/2logNs)

Model Airplane

-0.1 0 0.2 0.4 0.6 0.8 1 1.20

0.005

0.01

Range-ctd/2 (m)

P (V

)

FD-AIMTD-AIM

incE

y

z x

k

max

(GHz)

fsN CN tN

2.5

5

10

20

23 217

92 868

371 472

1 485 888 2576 160´

2288 80´

2144 45´

272 27´

gN

500

1000

1600

2500

119

217

416

813

2

2

c bwc

c bw

ˆ( / 8 )inc 2

c

bw max bw

( )sca

ˆˆ( , ) cos(2 [ 8 ])

/ 2.5, 3 2

1 ˆ( , , ) lim ( , )

t y c

j t

r

yt ze f t

cf f f

P t r e d

s

s

w ww w

qw w

p s

s p

q f q w wp

+ ⋅ --

+-

¥-

⋅= + -

= =

= ⋅ò

rr

E r

E r

max

3

1 / 14

2, 100 kB

t f

IO B

g =D =

= =max 20 GHzf =

Model Airplane

104 105 106 10710-3

10-2

10-1

100

101

102

103

104

105

Ns

Cor

e M

emor

y R

equi

rem

ent(G

B)

In-core TD-AIMOut-of-core TD-AIMBuffer MemoryFD-AIM

O(Ns3/2)

O(Ns2)

O(Ns1/2)

104 105 106 107100

101

102

103

104

105

106

Ns

Mar

chin

g/S

olut

ion

Tim

e pe

r tim

e/fre

quen

cy s

tep

(s)

In-core TD-AIMOut-of-core TD-AIMFD-AIM

O(Ns3/2log2Ns)

O(Ns3/2logNs)

Model Airplane @ 10GHz

2 3 4 5 6 7 8102

103

104

IO

Cor

e M

emor

y R

equi

rem

ent (

GB

)

In Core

Out of Core

2 3 4 5 6 7 80

100

200

300

400

500

600

700

800

900

1000

IO

Mar

chin

g Ti

me

per t

ime

step

(s)

In Core

Out of Core~(logNg-IO)

Conclusions

• A parallel multilevel out-of-core algorithm for TD-AIM+ Core-memory requirement only ~ 3-4x FD-AIM when IO = 2 + Buffering => only ~8-10x slower on Stampede when IO = 2

+ Control knob to trade-off memory for speed:

- Buffer limits parallel scalability (can be ameliorated by using hybrid

parallelism on multi-core clusters)

• Acknowledgments

21 log

gIO Nê ú£ £ ê úë û

top related