aparallel out-of-core algorithm for time-domain adaptive integral...
TRANSCRIPT
A Parallel Out-of-Core Algorithm for Time-Domain Adaptive Integral Method
Guneet KAUR and Ali E. YILMAZ
Department of Electrical & Computer EngineeringUniversity of Texas at Austin
IEEE International Conference on Computational Electromagnetics (ICCEM)Hong Kong, February 2-5, 2015
Motivation- Integral Equation Methods (Historical Perspective)- TD-AIM vs. FD-AIM- Memory Hierarchy- Out of Core Algorithms
(High-Frequency) IE Methods: Historical Progress since RWG
*
* http://web.corral.tacc.utexas.edu/BioEM-Benchmarks/historicalProgress/index.html
[1] S. M. Rao, D. R. Wilton, A. W. Glisson, IEEE TAP, May 1982.[2] A. Taflove, K. Umashankar, IEEE TEMC, Nov. 1983.[3] D. H. Schaubert, D. R. Wilton, A. W. Glisson, IEEE TAP, Jan. 1984.[4] M. F. Catedra, J. G. Cuevas, L. Nuno, IEEE TAP, Dec. 1988.[5] M. F. Catedra, E. Gago, L. Nuno, IEEE TAP, May 1989.[6] T. Cwik, J. Patterson, D. Scott, Proc. Supercomp., Nov. 1992.[7] H. Gan, W. C. Chew, J EM Waves Appl., 1995.[8] J. M. Song and W. C. Chew, MOTL, Sep. 1995.[9] E. Bleszynski et al., Radio Sci., Sep.-Oct. 1996.[10] C. F. Wang and J. M. Jin, IEEE TMTT, May 1998.[11] J. Song et al., IEEE TAP, June 1998.[12] S. Velamparambil, W. C. Chew, J. Song, IEEE AP Mag., Apr. 2003.[13] A. Rubinstein, et al., IEEE TEMC, May 2003.[14] Ö. Ergül, L. Gürel, IEEE TAP, Aug. 2008.[15] J. M. Taboada et al., IEEE AP Mag., Dec. 2009.[16] A. Heldring et al., IEEE TAP, Jan. 2013.
TDIE papers lagging:(1) Late-time instability: implicit solvers, better temporal basis functions, well-conditioned IE formulations, exact/semi-analytical integration techniques, ….(2) High computation time: fast algorithms(3) High(er) memory requirement(4) …
[17] F. Wei, A. E. Yılmaz, IEEE TAP, Feb. 2014.[18] C.L. Bennett , H. Mieras, Radio Sci., Nov.-Dec.1981.[19] S. M. Rao, D. R. Wilton, IEEE TAP, Jan. 1991.[20] D.A. Vechinski and S. M. Rao, IEEE TAP, June 1992.[21] M. J. Bluck, S. P. Walker, IEEE TAP, May 1997.[22] S. J. Dodson, S. P. Walker, M. J. Bluck, IEEE AP Mag., Aug. 1998.[23] B. Shanker, et al., IEEE TAP, Apr. 2000.[24] J. L. Hu, C. H. Chan, and Y. Xu, MOTL, May 2000.[25] A. E. Yılmaz et al., IEEE TAP, July 2002.[26] B. Shanker,et al., IEEE TAP, Mar. 2003.[27] A. E. Yılmaz, J.M. Jin, E. Michielssen, IEEE TAP, Oct. 2004.[28] H. Bağcı et al., IEEE TEMC, May 2007.[29] H. Bağcı et al., IEEE TEMC, Feb 2010.[30] G. Kaur and A.E. Yılmaz, Proc. ACES., Apr. 2012.[31] G. Kaur and A.E. Yılmaz, submitted IEEE TAP, Sep. 2014.[32] G. Kaur and A.E. Yılmaz, ICCEM, Feb. 2015.
This talk
TD-AIM vs. FD-AIM
102
103
104
105
106
10710
0
101
102
103
104
105
106
107
Ns
Mat
rix-F
ill Ti
me
(s)
O(Ns)
102
103
104
105
106
10710
-3
10-2
10-1
100
101
102
103
104
105
Ns
Mar
chin
g/So
lutio
n Ti
me
per t
ime/
freq.
ste
p(s)
O(Ns3/2logN
s)
O(Ns5/4log2N
s)
O(Ns3/2log2N
s)
O(Nslog2N
s)
102
103
104
105
106
10710
-3
10-2
10-1
100
101
102
103
Ns
Mem
ory
(GB)
O(Ns)
O(Ns2)
O(Ns3/2)
FD‐AIM Plate
TD‐AIM Plate
FD‐AIM Sphere
TD‐AIM Sphere
Number of frequency samples bandwidth of fieldsNumber of time samples maximum frequency content of the fields
Envelope-tracking methods[1-2] requires time samples bandwidth, and stores smaller time history than traditional time-domain methods
[1] A. Mohan and D. S. Weile, IEEE TAP, 2005.[2] G. Kaur and A. E. Yılmaz, IEEE TAP, to appear in 2015.
Memory Hierarchy
Registers
Processor
Control Unit
On-chip Cache
Main Memory
Local Secondary Storage
Remote Secondary StorageIncr
easi
ng c
apac
ity a
nd a
cces
s tim
esD
ecre
asin
g co
sts
and
frequ
ency
of a
cces
s Access Time
300 ps
1-10 ns
10-20 ns
50-100 ns(~20-75 ns)
5-10 ms
?
Off-Chip Cache
http://www.edn.com/design/systems-design/4397051/Memory-Hierarchy-Design-part-1John Hennessy and David Patterson, Computer Architecture, Fifth Edition: A Quantitative Approach, Morgan Kaufmann, San Francisco, CA, 1996.
TransferTime/byte
1-10 ps
25-50 ps
50-100 ps
50-200 ps(~20 ps)
0.2-2 ns
?(~2 ns)
Size
1000 bytes
64-256 kB(L1: 32 kBL2: 256 kB)
2-4 MB(L3: 20 MB)
4-16 GB(32 GB
2GB/core)
4-16 TB(80-430 GB)
O(PB)(14 PB)
* Stampede@TACC#7 supercomputer in
top500 list, Nov. 2014
Memory Hierarchy
F/A-18 HornetTop speed: 1915 km/h
Sustained speed:
~1.66 mm/s
http://www.scase.co.uk/snailracing/
Inspiration for this analogy:http://blog.scoutapp.com/articles/2011/02/10/understanding-disk-i-o-when-should-you-be-worried
Access time:50 ns vs. 5 msTransfer time:20 ps vs. 0.2-2 ns
53.2 10´
Wells (2014 world snail racing champ)
McLaren 12c (fasteststock car 0-200 kph)
http://www.dragtimes.ru/en/blogs/view/2659 29-
Highway/top speed:
65 mph/210 mph
( ~access time ratio)
(~transfer time ratio)
102 103 104 10510-5
10-4
10-3
10-2
10-1
100
101
102
103
Number of Unknowns
Tim
e pe
r MV
M (s
)
Out of CoreIn Core
O(Ns2)
• Use disk space in addition to core memory+ Lots of space: (or more)- I/O to disk much slower than memory
To read/write N bytes:Access time (latency): Transfer time (bandwidth):
+ Possible to amortize I/O costsLatency: Read/write large chunks of data (simple)Bandwidth: Many FLOPs per byte read/written (not so simple)Example: Dense-matrix-vector product (double-precision) on Stampede
• Goal: Reduce memory requirement without (significantly) increasing run time
Out of Core Algorithms
rw lat bw5
lat lat1 2
bw bw
10
10
IO IO
IO core
IO core
Nt t Nt
t t
t t-
= +
= ´
= ´
Tim
e
I/O volume (bytes)
latIOt
bwSlope: IOt
3 610 - ´
x 50- Store 100 kBin memory
- Read data in100 kB chunks & multiply
102 103 104 10510-4
10-3
10-2
10-1
100
101
102
Number of Unknowns
Mem
ory
(GB
)
Out of CoreIn Core
O(Ns2)
Background- Time Domain Integral Equations Basics- Method of Moments- Time Marching- Alternative Ways to Calculate Right-Hand Side
- Frequency interval:
- Time interval:
1
( , )( , )( , , )
( , ) ( , )t S
ttg t ds
t t
mf e-
ì üì ü ¢ï ïï ï ï ïï ï ¢ ¢= *í ý í ý¢ ¢ï ï ï ï¶ - ⋅ï ï ï ïî þ î þòò
J rA rr r
r J r
sca 2
sca 1
( , ) ( , ) ( , )
( , ) ( , )t t t
t t
t t t
t t
f
m-
ì ü ì üï ï ï ï¶ -¶ -¶ï ï ï ïï ï ï ï=í ý í ýï ï ï ï¶ ´¶ï ï ï ïï ï ï ïî þ î þ
E r A r r
H r A r
inc inc,E HS
min maxf f f
( , )tJ r
• EFIE, MFIE, CFIE2 inc
1 inc
10
ˆ ˆ ˆ ˆ( ( , ) ( , )) ( , ) (TD-EFIE)
ˆ ˆ( , ) ( , ) ( , ) (TD-MFIE)
TD-CFIE TD-EFIE (1 )TD-MFIE
t t t
t t t
n n t t n n t
t n t n t
f
m
h a a
-
-
- ´ ´ ¶ +¶ =- ´ ´¶
¶ - ´´¶ = ´¶
= + -
A r r E r
J r A r H r
TDIE Basics
,e m
PEC
( )n r
sca inc
sca inc
( , ) ( , ) ˆ ˆ ˆ ˆ
( , ) ( , ) ( , )ˆ ˆt t
t t t
n n t n n t S
t n t n t
- ´ ´¶ = ´ ´¶ " Î
¶ - ´¶ = ´¶
E r E r r
J r H r H r
S,e m ,e m
inc inc,E H
( )/( , , )
4
t cg t
d
p
¢- -¢ =
¢-
r rr r
r r
max0 t T
Numerical Solution
- Discretize geometry, expand unknowns
• Method of moments
S T
,1 1
( , ) ( ) ( )N N
k l kk l
t I T t l t¢ ¢ ¢¢ ¢= =
¢» - DååJ r S r
Rao, Wilton, Glisson, IEEE Trans. AP, May 1982.
Must resolve fastest variations:
- Testing
( ) ( ) TD-CFIEk
S
dt ds t l td¥
-¥
- D ⋅ò òò S r
Tfor 1,...,l N=
2max
min S 2
max T max max
/ 10,
1/ 10 ,
(HF)A
fs N S
c
t f N T f
lD
D
- System of equations
incT
1
for 1,...,l
l l l ll
l N¢ ¢-¢=
= =åZ I V G. Manara et al., IEEE Trans. AP, Mar. 1997.D. S. Weile et al., IEEE Trans. AP, Jan. 2004.G. Kaur and A. E. Yilmaz, MOTL, June 2011.
Marching on in Time
g g g
g g g
g g T
i10 1
1 0 2
2 1 0 3
3 2 1 0 4
1 2 1 0
1 2 1 0 1
1 2 1 0
0
0
0 0
N N N
N N N
N N N
-
- +
-
é ù é ùê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê ú
=ê ú ê úê ú ê ú⋅ ⋅ ⋅ê ú ê úê ú ê ú
⋅ ⋅ ⋅ê ú ê úê ú ê úê ú ê úê ú ê úê ú ê ú⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ê úê ú ë ûë û
VZ IZ Z IZ Z Z IZ Z Z Z I
Z Z Z Z Z I
Z Z Z Z Z I
Z Z Z Z Z I
g
g
TT S
nc
inc2inc3inc4
inc
inc1
inc
1
N
N
N N N
+
´
é ùê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úë û
V
V
V
V
V
V
- Due to causality, time invariance, uniform time-step size,(discretely) causal sub-domain space-time basis functions
- Frequency domain: 11 1 1
2 2 2 2
F F F FF S
inc,
inc,
inc,
1
0 0
0
0 N N N N
ff f f
f f f f
f f f fN N ´
é ùé ù é ùê úê ú ê úê úê ú ê úê úê ú ê ú = ê úê ú ê úê úê ú ê úê úê ú ê úê úê ú ê ú ê úë û ë û ë û
VZ I
Z I V
Z I V
Matrix properties:
- Lower triangular, Toeplitz, sparse blocks
- Non-zero entries:
- Unique entries:
- Unique blocks:
2T S
( )N NQ
2S
( )NQ
- Diagonal, unique,dense blocks
- Non-zero/uniqueentries:
2F SN N
g1N +
Marching on in Time
g gg g
g g
T T
inc10 1inc2 10 2inc
2 130 3inc
3 2 10 4 4
inc1 2 10
inc0 1 1
inc0
0
0
0 0
0
0
N NN N
N N
N N
-
+ +
é ùé ù ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê ú= -ê ú ê úê ú ⋅ ⋅ ⋅ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê úë û ë û
VZ IV ZZ I
Z ZVZ IZ Z ZZ I V
Z Z Z ZZ I V
Z I V
Z I V
g
g g g
g g T
1
2
3
4
1 2 1 1
1 2 1
0 0
0 0 0
N
N N N
N N N
- +
-
é ù é ùê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê úê ú ê ú
⋅ ⋅ ⋅ê ú ê úê ú ê úê ú ê úê ú ê úê ú ê ú⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ê úê ú ë ûë û
IIII
I
Z Z Z Z I
Z Z Z Z I
• Computational Costs
Storage: +0 S
( )N NQ
matrix Current/Field vectors
g S( )N NQ
0Z
g1 N-Z Z matrices
2S
( )NQ +
Iterative solution Right-hand sideFLOPs: +
T I 0 S( )O N N N N 2
T S( )O N N
Matrix fill2S
( )O N +
Marching on in Time: High-Frequency- HF: and unknowns distributed such that =>
, , c t
c t
c t
c t sD D 0 SN N
ttD 2 tDg
( 1)N t+ D
• Computational Costs
* Graph describes radiation by a point source that is a pulse of width turned on at time 0
tD
Storage:
matrix Current/Field vectors0
Zg1 N
-Z Z matrices
+
Iterative solution Right-hand sideFLOPs:
T I 0 S( )O N N N N 2
T S( )O N N
Matrix fill2S
( )O N +
1/2g S
( )N O N=
0 S( )N NQ
g S( )N NQ2
S( )NQ
Scattered-Field (Right-Hand-Side) Computation
Storage:
matrix Current/Field vectors0
Zg1 N
-Z Z matrices
+
Iterative solution Right-hand sideFLOPs:
T I 0 S( )O N N N N 2
T S( )O N N
Matrix fill2S
( )O N +
• Computational Costs
g
T
1
2
3
4
5
6
7
8
N
N
IIIIIIII
I
I
g
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
0 0
0
0
0
0 0
0
0
0
0
0
0
0
0 0
0
0
0 0
N
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
g 8 7 6 5 4 3 2 1
0NZ Z Z Z Z Z Z Z Z
g
T
sca1sca2sca3sca4sca5sca6sca7sca8
sca
sca
N
N
V
V
V
V
V
V
V
V
V
V
=
0 S( )N NQ
g S( )N NQ2
S( )NQ
g g 1 T1 2 3 4 5 6 7 8 N N N+
g
T
1
2
3
4
5
6
7
8
N
N
IIIIIIII
I
I
g
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
0 0
0
0
0
0 0
0
0
0
0
0
0
0
0 0
0
0
0 0
N
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
g 8 7 6 5 4 3 2 1
0NZ Z Z Z Z Z Z Z Z
g
T
sca1sca2sca3sca4sca5sca6sca7sca8
sca
sca
N
N
V
V
V
V
V
V
V
V
V
V
=
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
- Must access all of the stored data at every time step g
l N³
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+
l
MemoryAccessed
S. P. Walker, IEEE AP Mag., Oct. 1997. g S( )N NQ2
S( )NQ
2S g S
( ) ( )N N NQ +Q
g
T
1
2
3
4
5
6
7
8
N
N
IIIIIIII
I
I
g
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
0 0
0
0
0
0 0
0
0
0
0
0
0
0
0 0
0
0
0 0
N
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
g 8 7 6 5 4 3 2 1
0NZ Z Z Z Z Z Z Z Z
g
T
sca1sca2sca3sca4sca5sca6sca7sca8
sca
sca
N
N
V
V
V
V
V
V
V
V
V
V
=
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+
1l
MemoryAccessed
S. P. Walker, IEEE AP Mag., Oct. 1997. g S( )N NQ2
S( )NQ
2S g S
( ) ( )N N NQ +Q
g
T
1
2
3
4
5
6
7
8
N
N
IIIIIIII
I
I
g
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
7 6 5 4 3 2 1
0 0
0
0
0
0 0
0
0
0
0
0
0
0
0 0
0
0
0 0
N
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z
g 8 7 6 5 4 3 2 1
0NZ Z Z Z Z Z Z Z Z
g
T
sca1sca2sca3sca4sca5sca6sca7sca8
sca
sca
N
N
V
V
V
V
V
V
V
V
V
V
=
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
- Must access all of the stored data at every time step T g
l N N£ -
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+
l
MemoryAccessed
T g1 2 3 -N N
S. P. Walker, IEEE AP Mag., Oct. 1997. g S( )N NQ2
S( )NQ
2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1l
MemoryAccessed
eventually
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1 2l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1 2 3l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1 2 3 4l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1 2 3 4 5l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1 2 3 4 5 6l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1 2 3 4 5 6 7l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1 2 3 4 5 6 7 8l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1 2 3 4 5 6 7 8 9l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1 2 3 4 5 6 7 8 9 10l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
=
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
1 2 3 4 5 6 7 8 9 10 11l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
=1 2 3 4 5 6 7 8 9 10 11 12
l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
=1 2 3 4 5 6 7 8 9 10 11 12 13
l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
=1 2 3 4 5 6 7 8 9 10 11 12 13 14
l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+g S
( )N NQ2S
( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
• Row multiplication: Store current history
• Column multiplication: Store future fields
• Multilevel block multiplication: Store both
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
IIIIIIIIIIIIIIII
sca1sca2sca3sca4sca5sca6sca7sca8sca9sca10sca11sca12sca13sca14sca15sca16
VVVVVVVVVVVVVVVV
=
Storage needed:
Current/Field vectorsg1 N
-Z Z matrices
+
l
MemoryAccessed
Scattered-Field (Right-Hand-Side) Computation
+ Only access all of the stored data once every time steps
g/ 2N
g S( )N NQ2
S( )NQ
eventually 2S g S
( ) ( )N N NQ +Q
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4
0
0
0
0
0
0
0
0
0
0
0
0
0
Z
Z Z
Z Z Z
Z Z Z Z
Z Z Z Z Z
Z Z Z Z Z Z
Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Time Domain Adaptive Integral Method
TD-AIM
Step 1: Anterpolate
L
cDEmbed in regular grid with nodesSCN
{ , , }k x y zÎ
corr † A †kl l l l
kl ll l ll l l
f¢¢ ¢ ¢ ¢ ¢- ¢-
- -¢+ L LL» + LåZ I GZ I IGI
• Computational Costs (HF, while time marching)
Storage:Inter/anter
FLOPs: T( )O N pN
( )pNQ
L
Inter/anterpolation order: 3, ( 1)m p m= +
Grid spacing: cD
TD-AIM
Step 1: AnterpolateStep 2: Propagate
L
G
Embed in regular grid with nodesSCN
c †Aorr †ll l l ll l l
kl l l
kl
f¢ ¢ ¢ ¢-
¢ ¢ ¢- -
- ¢» +L L+ L Lå G GZ I I I IZ
Storage:
current/field vectors on grid,
l l ¢-G
g C( )N NQ +
Inter/anterFLOPs: +
4-D blocked space-time FFTs2
T C C g[log log ]( )N N N NO +
T( )O N pN
( )pNQ
L
• Computational Costs (HF, while time marching)
Inter/anterpolation order: 3, ( 1)m p m= +
Grid spacing: cD
{ , , }k x y zÎ
TD-AIM
Step 1: AnterpolateStep 2: Propagate
Step 3: Interpolate L
G†L
Embed in regular grid with nodesSCN
c †A†orrll l l ll l ll l l
kl
k f¢ ¢ ¢ ¢-
¢ ¢ ¢- -
- ¢» +L L+ L Lå G GZ I I I IZ
Storage:
current/field vectors on grid,
l l ¢-G
g C( )N NQ +
Inter/anterFLOPs: +
4-D blocked space-time FFTs2
T C C g[log log ]( )N N N NO +
T( )O N pN
( )pNQ
L
• Computational Costs (HF, while time marching)
Inter/anterpolation order: 3, ( 1)m p m= +
Grid spacing: cD
{ , , }k x y zÎ
TD-AIM
Step 1: AnterpolateStep 2: Propagate
Step 3: InterpolateStep 4: Correct
L
G†L
corrZ
Embed in regular grid with nodesSCN
c †A†orrll l l ll
kl
klll l l
f¢ ¢ ¢ ¢ ¢-
¢- -¢ ¢-» + L+L L LåZ I IGI IGZ
• Computational Costs (HF, while time marching)
Storage: +near( )NQ
current/field vectors on grid
corrl l ¢-Z
,l l ¢-G
g C( )N NQ +
Inter/anterFLOPs: ++
Correct 4-D blocked space-time FFTs2
T C C g[log log ]( )N N N NO +near
T( )O N N
T( )O N pN
( )pNQ
L
Inter/anterpolation order: 3, ( 1)m p m= +
Grid spacing: cD
Correction region size: 2g =
{ , , }k x y zÎ
TD-AIM
Step 1: AnterpolateEmbed in regular grid with nodesS
CN
Inter/anterpolation order: 3, ( 1)m p m= +
Grid spacing: cD
LStep 2: Propagate
G
Step 3: Interpolate†L
Step 4: CorrectcorrZ
Correction region size: 2g =
c †A†orrll l l ll
kl
klll l l
f¢ ¢ ¢ ¢ ¢-
¢- -¢ ¢-» + L+L L LåZ I IGI IGZ
• Computational Costs (HF, while time marching)
Storage: +near( )NQ
current/field vectors on grid
corrl l ¢-Z
,l l ¢-G
g C( )N NQ +
Inter/anterFLOPs: ++
Correct 4-D blocked space-time FFTs2
T C C g[log log ]( )N N N NO +near
T( )O N N
T( )O N pN
( )pNQ
L
{ , , }k x y zÎ
4-D Blocked Space-Time FFTs
g
1inc
0max(1,
c
)
A† †orrll
l
ll l ll
k
N
kl
ll l
G GZ ZI V I
A. E. Yilmaz, J. M. Jin, and E. Michielssen, IEEE Trans. AP, Oct. 2004.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =
3i =
4i =
ìïïïïïïïïíïïïïïïïïî
ìïïïíïïïî
{{
{ , , }k x y zÎ
Unique Blocks:
Block is of size
Storage space needed to store block
to store largest block:
2 glog{1, , 1}Ni Î +ê úê úë û1 1
C C2 2i iN N- -´i
C: (2 )ii NQ
g C( )N NQ
A. E. Yilmaz, J. M. Jin, and E. Michielssen, IEEE Trans. AP, Oct. 2004.
MOTTD-AIM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15l
MemoryAccessed 2
S g S( ) ( )N N NQ +Q
g C( )N NQ
{ , , }k x y zÎ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
4-D Blocked Space-Time FFTs
g
1inc
0max(1,
c
)
A† †orrll
l
ll l ll
k
N
kl
ll l
G GZ ZI V I
Out-of-Core Algorithm
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
- Block requires bytes of memory - Introduce an integer threshold parameter- Limit the memory use to
- Modify TD-AIM propagation stage for blocks
- Minimize effect of latency by managing data layout in disk
Out-of-Core Algorithm: Synopsisi
C(2 )iNQ
4IO =
2IO =
IO
C(2 )IONQ
i IO>
Out-of-Core Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
2IO =
1i =1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Out-of-Core Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
2IO =
2i =
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Out-of-Core Algorithm
1
2
3
5
6
7
8
9
10
11
12
13
14
15
16
4
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in core memory Stored in disk
2IO =
4
5
6
7
8
9
10
11
12
13
14
1
1
1
5
3
6
2
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
1i =
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
1i =
1
2
3
5
6
7
8
9
10
11
12
13
14
15
16
4
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
4
5
6
7
8
9
10
11
12
13
14
1
1
1
5
3
6
2
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
1i =5
6
7
8
9
1
10
11
12
13
14
1
6
2
3
4
5
1
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
2IO
2IO
12i-
12i-
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
3i =
For each node :
Fetch anterpolated currentsamples from the memoryand the remaining
samples from diskCompute the contribution of
these current samples to thepotentials at future timesteps
Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)
Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.
, 1 Cu u N£ £
5
6
7
8
9
1
10
11
12
13
14
1
6
2
3
4
5
1
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
2IO
12 2i IO- -
2IO
2IO
12i-
12i-
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
3i =
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
For each node :
Fetch anterpolated currentsamples from the memoryand the remaining
samples from diskCompute the contribution of
these current samples to thepotentials at future timesteps
Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)
Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.
2IO
12 2i IO- -
, 1 Cu u N£ £
5
6
7
8
9
1
10
11
12
13
14
1
6
2
3
4
5
1
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
2IO
2IO
12i-
12i-
2IO
12 2i IO- -1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
3i =
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
For each node :
Fetch anterpolated currentsamples from the memoryand the remaining
samples from diskCompute the contribution of
these current samples to thepotentials at future timesteps
Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)
Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.
, 1 Cu u N£ £
5
6
7
8
9
1
10
11
12
13
14
1
6
2
3
4
5
1
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
12i-
12i-
2IO
12 2i IO- -1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
3i =
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8 =
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
For each node :
Fetch anterpolated currentsamples from the memoryand the remaining
samples from diskCompute the contribution of
these current samples to thepotentials at future timesteps
Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)
Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.
2IO
2IO
, 1 Cu u N£ £
6
7
8
9
1
2
3
4
10
11
12
13
14
15
16
5
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
1i =
1
2
3
4
5
9
10
11
12
13
14
15
16
6
7
8 =
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
6
7
8
9
1
2
3
4
10
11
12
13
14
15
16
5
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
1i =
1
2
3
4
5
6
9
10
11
12
13
14
15
16
7
8 =
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
7
8
9
10
11
12
1
2
13
14
15
16
5
3
4
6
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
2i =
1
2
3
4
5
6
9
10
11
12
13
14
15
16
7
8 =
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
7
8
9
10
11
12
1
2
13
14
15
16
5
3
4
6
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
2i =
1
2
3
4
5
6
7
9
10
11
12
13
14
15
16
8 = f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
8
9
10
11
1
2
3
4
12
13
14
6
5
1
5
1
7
6
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
1i =
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
9
10
11
12
13
1
5
6
7
14
15
2
3
6
8
1
4
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
4i =
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
For each node :
Fetch anterpolated currentsamples from the memoryand the remaining
samples from diskCompute the contribution of
these current samples to thepotentials at future timesteps
Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)
Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.
12i-
12i-
2IO
2IO
, 1 Cu u N£ £
9
10
11
12
13
1
5
6
7
14
15
2
3
6
8
1
4
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
2IO
12 2i IO- -
2IO
2IO
12i-
12i-
2IO
12 2i IO- -1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
4i =
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
For each node :
Fetch anterpolated currentsamples from the memoryand the remaining
samples from diskCompute the contribution of
these current samples to thepotentials at future timesteps
Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)
Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.
, 1 Cu u N£ £
9
10
11
12
13
1
5
6
7
14
15
2
3
6
8
1
4
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
2IO
2IO
12i-
12i-
2IO
12 2i IO- -1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
4i =
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
Out-of-Core AlgorithmFor each node :
Fetch anterpolated currentsamples from the memoryand the remaining
samples from diskCompute the contribution of
these current samples to thepotentials at future timesteps
Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)
Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.
, 1 Cu u N£ £
9
10
11
12
13
1
5
6
7
14
15
2
3
6
8
1
4
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
2IO
2IO
12i-
12i-
2IO
12 2i IO- -1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
6 5 4 3 2 1
7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
10 9 8 7 6 5 4 3 2 1
11 10 9 8 7 6 5
0
0
0
0
0
0
0
0
0
0
0
f
f f
f f f
f f f f
f f f f f
f f f f f f
f f f f f f f
f f f f f f f f
f f f f f f f f f
f f f f f f f f f f
f f f f f f f
G
G G
G G G
G G G G
G G G G G
G G G G G G
G G G G G G G
G G G G G G G G
G G G G G G G G G
G G G G G G G G G G
G G G G G G G 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
13 12 11 10 9 8 7 6 5 4 3 2 1
14 13 12 11 10 9 8 7 6 5 4 3 2 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0
0
0
0
0
f f f f
f f f f f f f f f f f f
f f f f f f f f f f f f f
f f f f f f f f f f f f f f
f f f f f f f f f f f f f f f
G G G G
G G G G G G G G G G G G
G G G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G
1i =2i =3i =4i =
Stored in disk
2IO =
Out-of-Core Algorithm
4i =
1
2
3
4
5
6
7
8
9
10
13
14
15
16
11
12
=
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
For each node :
Fetch anterpolated currentsamples from the memoryand the remaining
samples from diskCompute the contribution of
these current samples to thepotentials at future timesteps
Read (previously computed)partial potentials from thedisk and add to thecontribution computed instep (2)
Write the currents at theprevious time steps andthe updated potentialsbeyond time steps intothe future to the disk.
, 1 Cu u N£ £
9
10
11
12
13
1
1
2
3
4
5
6
7
1
8
4
5
16
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Stored in core memory
Data Layout: In Core vs. Out-of-Core
g1,
Nf fG G
g1,...,l N=
12
CN
g1,...,l N=
12
1,...,2 IOl=
Stored in core memory
Stored in disk
CN
Stored in core memory
Cost Analysis (No Buffering)
fl
rw
rw lat bw
: Cost of one floating point operation
: Average cost of read/write of one byte from/to disk
(must minimize effect of latency/access time)IO IO
t
t
Nt t Nt= +
Memory:
current/field vectors on grid,l l ¢-G
g C( )N NQ
In-Core TD-AIM Out-of-Core TD-AIM
C( )2IONQ
Disk space: - g C[ 2 ]( )ION NQ -
Time for FLOPs:
4-D blocked space-time FFTs2
T C C gfl[log log( ])N Nt NO N +
Time for I/O: -
2T C C gfl
[log log( ])N Nt NO N +
grw T Clog( )N N Nt O
Data Layout in Disk
g
2 1,...,IOl N+=
Stored in disk
12
CN
- Divide into blocks of 2IO
Data Layout in Disk
g2 1,...,IOl N+=
bufN
buf2N
Store in disk using fixed-length record unformatted direct access files
C buf 1N N- +
Stored in disk
12
CN
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g
2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
1=rec2=rec
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
Store in disk using fixed-length record unformatted direct access files
Step: Write the currents at the previous time steps2IO
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g
2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
1=rec2=rec
4l=
Store in disk using fixed-length record unformatted direct access files
Step: Write the currents at the previous time steps2IO
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
1=rec2=rec
g
2IO
N=rec
4l=
Store in disk using fixed-length record unformatted direct access files
Step: Write the currents at the previous time steps2IO
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
4l=
1=rec2=rec
g
2IO
N=rec
g C
buf
1 12IO
N N
N
æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec
Step: Write the currents at the previous time steps2IO
Store in disk using fixed-length record unformatted direct access files
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
1=rec2=rec
g
2IO
N=rec
g C
buf
1 12IO
N N
N
æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec
8l=
Step: Write the currents at the previous time steps2IO
Store in disk using fixed-length record unformatted direct access files
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g
2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
1=rec2=rec
g
2IO
N=rec
g C
buf
1 12IO
N N
N
æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec
8l=
Step: Write the currents at the previous time steps2IO
Store in disk using fixed-length record unformatted direct access files
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g
2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
1=rec2=rec
g
2IO
N=rec
g C
buf
1 12IO
N N
N
æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec
8l=
Step: Write the currents at the previous time steps2IO
Store in disk using fixed-length record unformatted direct access files
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g
2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
1=rec2=rec
g
2IO
N=rec
g C
buf
1 12IO
N N
N
æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec
gl N=
Step: Write the currents at the previous time steps2IO
Store in disk using fixed-length record unformatted direct access files
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g
2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
1=rec2=rec
g
2IO
N=rec
g C
buf
1 12IO
N N
N
æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec
gl N=
Step: Write the currents at the previous time steps2IO
Store in disk using fixed-length record unformatted direct access files
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g
2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
1=rec2=rec
g
2IO
N=rec
g C
buf
1 12IO
N N
N
æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec
gl N=
Step: Write the currents at the previous time steps2IO
Store in disk using fixed-length record unformatted direct access files
Data Layout in Disk
bufEach record contains bytes, formed from 2 temporal data for 2 spatial points
IO IOB N B
g
2 1,...,IOl N+=
bufN
buf2N
C buf 1N N- +
Stored in disk
12
CN
l1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CMemory 2iN
1=rec2=rec
g
2IO
N=rec
g C
buf
1 12IO
N N
N
æ ö÷ç ÷ç ÷= - +ç ÷ç ÷çè ørec
gl N=
Store in disk using fixed-length record unformatted direct access files
Step: Fetch current samples from disk
12 2i IO- -
Cost Analysis (with Buffer)
fl
rw
rw lat bw
: Cost of one floating point operation
: Average cost of read/write of one byte from/to disk
(must minimize effect of latency/access time)IO IO
t
t
Nt t Nt= +
Memory:
current/field vectors on grid,l l ¢-G
g C( )N NQ
In-Core TD-AIM Out-of-Core TD-AIM
C g buf( )2 [ 2 ]IO ION N N+ -Q
Disk space: - g C[ 2 ]( )ION NQ -
Time for FLOPs:
4-D blocked space-time FFTs2
T C C gfl[log log( ])N Nt NO N +
Time for I/O:-
2T C C gfl
[log log( ])N Nt NO N +
Tlat
bw
C buf g
T C g
( / ) log
log
( )
+ ( )
IO
IO
t O
t O
N N N N
N N N
Cost Analysis (with Buffer + Parallelization)
Memory:
current/field vectors on grid,l l ¢-G
g C/( )N N PQ
In-Core TD-AIM Out-of-Core TD-AIM
C g buf( )2 / [ 2 ]IO ION P N N-Q +
Disk space: - g C]([ )2 /ION N PQ -
- Every process needs memory space for the buffer- Buffer limits parallel scalability of memory requirement: max C
bufg
2~
[ 2 ]
IO
IO
NP
NN -
Results- Sphere - Model Airplane
Stampede
Sphere
min 1 m, 3
0.24 ns
2, 100 kB
t
IO B
l g= =D =
= =
2
2
ˆ( / 8 )inc 2
c
c bw bw
ˆˆ( , ) cos(2 [ 8 ])
200 MHz, 3 2 ; 100 MHz
t z cz
t xe f tc
f f f
s
s p s
s p
+ ⋅ -- ⋅
= + -
= = =
rr
E r
s(m)L sN CN tN
0.5
1
2
4
8
16
684
3384
10 947
44 595
179 130
742 059
32 2 903 916 3512
3256
3128
380
348
327
318
gN
260
285
300
420
545
760
1375
32
44
74
132
247
471
919
Sphere
102 103 104 105 106 10710-4
10-3
10-2
10-1
100
101
102
103
104
105
Ns
Cor
e M
emor
y R
equi
rem
ent(G
B)
In-core TD-AIMOut-of-core TD-AIMBuffer MemoryFD-AIM
O(Ns3/2)
O(Ns2)
O(Ns1/2)
102 103 104 105 106 10710-2
10-1
100
101
102
103
104
105
Ns
Mar
chin
g/S
olut
ion
Tim
e pe
r tim
e/fre
quen
cy s
tep
(s)
In-core TD-AIMOut-of-core TD-AIMFD-AIM
O(Ns3/2log2Ns)
O(Ns3/2logNs)
Model Airplane
-0.1 0 0.2 0.4 0.6 0.8 1 1.20
0.005
0.01
Range-ctd/2 (m)
P (V
)
FD-AIMTD-AIM
incE
y
z x
k
max
(GHz)
fsN CN tN
2.5
5
10
20
23 217
92 868
371 472
1 485 888 2576 160´
2288 80´
2144 45´
272 27´
gN
500
1000
1600
2500
119
217
416
813
2
2
c bwc
c bw
ˆ( / 8 )inc 2
c
bw max bw
( )sca
ˆˆ( , ) cos(2 [ 8 ])
/ 2.5, 3 2
1 ˆ( , , ) lim ( , )
t y c
j t
r
yt ze f t
cf f f
P t r e d
s
s
w ww w
qw w
p s
s p
q f q w wp
+ ⋅ --
+-
¥-
⋅= + -
= =
= ⋅ò
rr
E r
E r
max
3
1 / 14
2, 100 kB
t f
IO B
g =D =
= =max 20 GHzf =
Model Airplane
104 105 106 10710-3
10-2
10-1
100
101
102
103
104
105
Ns
Cor
e M
emor
y R
equi
rem
ent(G
B)
In-core TD-AIMOut-of-core TD-AIMBuffer MemoryFD-AIM
O(Ns3/2)
O(Ns2)
O(Ns1/2)
104 105 106 107100
101
102
103
104
105
106
Ns
Mar
chin
g/S
olut
ion
Tim
e pe
r tim
e/fre
quen
cy s
tep
(s)
In-core TD-AIMOut-of-core TD-AIMFD-AIM
O(Ns3/2log2Ns)
O(Ns3/2logNs)
Model Airplane @ 10GHz
2 3 4 5 6 7 8102
103
104
IO
Cor
e M
emor
y R
equi
rem
ent (
GB
)
In Core
Out of Core
2 3 4 5 6 7 80
100
200
300
400
500
600
700
800
900
1000
IO
Mar
chin
g Ti
me
per t
ime
step
(s)
In Core
Out of Core~(logNg-IO)
Conclusions
• A parallel multilevel out-of-core algorithm for TD-AIM+ Core-memory requirement only ~ 3-4x FD-AIM when IO = 2 + Buffering => only ~8-10x slower on Stampede when IO = 2
+ Control knob to trade-off memory for speed:
- Buffer limits parallel scalability (can be ameliorated by using hybrid
parallelism on multi-core clusters)
• Acknowledgments
21 log
gIO Nê ú£ £ ê úë û