![Page 1: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/1.jpg)
The basis and perspectives of an exascale algorithm: our ExaFMM project.
Lorena A Barba, Boston University
IMA Annual Program Year Workshop: High Performance Computing and Emerging ArchitecturesMinneapolis, January 10-14, 2011
![Page 2: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/2.jpg)
Acknowledgements:
work in Barba’s group done in collaboration with Jaydeep Bardhan (Rush), Mathew Knepley (UChicago), Tsuyoshi Hamada (Nagasaki Advanced Computing Center),Rio Yokota (postdoc at BU) and graduate students Felipe Cruz, Christopher Cooper, Anush Krishnan, Simon Layton
![Page 3: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/3.jpg)
Acknowledgements:
work in Barba’s group done in collaboration with Jaydeep Bardhan (Rush), Mathew Knepley (UChicago), Tsuyoshi Hamada (Nagasaki Advanced Computing Center),Rio Yokota (postdoc at BU) and graduate students Felipe Cruz, Christopher Cooper, Anush Krishnan, Simon Layton
![Page 4: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/4.jpg)
in Nagasaki Advanced Computing Center
![Page 5: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/5.jpg)
“... the fundamental law of computer science [is]: the faster the computer, the greater the importance of speed of algorithms”
Trefethen & Bau “Numerical Linear Algebra” SIAM
![Page 6: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/6.jpg)
100
101
102
103
104
105
106
102
104
106
108
1010
1012
1014
1016
1018
O(N3)
O(N2)
The curious story of conjugate gradient (CG) algorithms
‣ Iterative methods:
‣ sequence of iterates converging to the solution
‣ CG matrix iterations bring the O(N3) cost to O(N2)
‣ 1950s — N too small for CG to be competitive
‣ 1970s — renewed attention
Gauss ian e
l imin
at ion
CG i te ra t i ve methods
![Page 7: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/7.jpg)
100
101
102
103
104
105
106
102
104
106
108
1010
1012
1014
1016
1018
O(N3)
O(N2)
100
101
102
103
104
105
106
102
104
106
108
1010
1012
1014
1016
1018
O(N2)
O(N3)
The curious story of conjugate gradient (CG) algorithms
‣ Iterative methods:
‣ sequence of iterates converging to the solution
‣ CG matrix iterations bring the O(N3) cost to O(N2)
‣ 1950s — N too small for CG to be competitive
‣ 1970s — renewed attention
Gauss ian e
l imin
at ion
CG i te ra t i ve methods
![Page 8: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/8.jpg)
‣ 1946 — The Monte Carlo method.
‣ 1947 — Simplex Method for Linear Programming.
‣ 1950 — Krylov Subspace Iteration Method.
‣ 1951 — The Decompositional Approach to Matrix Computations.
‣ 1957 — The Fortran Compiler.
‣ 1959 — QR Algorithm for Computing Eigenvalues.
‣ 1962 — Quicksort Algorithms for Sorting.
‣ 1965 — Fast Fourier Transform.
‣ 1977 — Integer Relation Detection.
‣ 1987 — Fast Multipole Method Dongarra& Sullivan, IEEE Comput. Sci. Eng.,Vol. 2(1):22-- 23 ( 2000)
![Page 9: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/9.jpg)
‣ Solves N-body problems
๏ e.g. astrophysical gravity interactions
๏ reduces operation count from O(N2) to O(N)
Fast multipole method
f(y) =N∑
i=1
ciK(y − xi) y ∈ [1...N ]
![Page 10: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/10.jpg)
‣ Solves N-body problems
๏ e.g. astrophysical gravity interactions
๏ reduces operation count from O(N2) to O(N)
Fast multipole method
f(y) =N∑
i=1
ciK(y − xi) y ∈ [1...N ]
![Page 11: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/11.jpg)
‣ Solves N-body problems
๏ e.g. astrophysical gravity interactions
๏ reduces operation count from O(N2) to O(N)
Fast multipole method
f(y) =N∑
i=1
ciK(y − xi) y ∈ [1...N ]
![Page 12: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/12.jpg)
O(N) advantage
‣ Hierarchical methods:
‣ sequence of refinements converging (or contributing) to the solution
‣ FMM brings the O(N2) cost to O(N)
‣ 1990s — MD codes dropped FMM, as N too small to be competitive
‣ Now — renewed attention100
101
102
103
104
105
106
102
104
106
108
1010
1012
O(N2)
O(N)
![Page 13: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/13.jpg)
‣ space subdivision tree structure
‣ to find “near” and “far” bodies
![Page 14: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/14.jpg)
‣ space subdivision tree structure
‣ to find “near” and “far” bodies
![Page 15: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/15.jpg)
Flow of FMM calculation
M2Mmultipole to multipole
treecode & FMM
M2Lmultipole to local
FMM
L2Llocal to local
FMM
L2Plocal to particle
FMM
P2Pparticle to particle
treecode & FMM
M2Pmultipole to particle
treecode
source particlestarget particles
information moves from red to blue
P2Mparticle to multipole
treecode & FMM
![Page 16: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/16.jpg)
‣ The whole algorithm in a sketch
Downward SweepUpward Sweep
Create Multipole Expansions. Evaluate Local Expansions.
P2M M2M M2L L2L L2P
![Page 17: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/17.jpg)
‣ Contributions from Barba group:
![Page 18: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/18.jpg)
‣ Parallelization strategy:
M2M and L2L translations M2L transformation Local domain
Level k
Root tree
Sub-tree 1 Sub-tree 2 Sub-tree 3 Sub-tree 4 Sub-tree 5 Sub-tree 6 Sub-tree 7 Sub-tree 8
![Page 19: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/19.jpg)
‣ Graph representation:
cijwi
wj
Ref. — F. A Cruz, M. G. Knepley, L. A. Barba, PetFMM—A dynamically load-balancing parallel fast multipole library, Int. J. Num. Meth. Eng., Vol. 85(4): 403–428 (Jan. 2011)
![Page 20: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/20.jpg)
The algorithmic and hardware speed-ups properly multiply
GPU implementation of FMM kernels
![Page 21: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/21.jpg)
GPU Gems, Volume IV
In press, to appear February 2011 (?)
Codes in http://code.google.com/p/gemsfmm/
![Page 22: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/22.jpg)
GPU Gems, Volume III
![Page 23: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/23.jpg)
FMM on GPU
103
104
105
106
107
10!3
10!2
10!1
100
101
102
103
104
105
N
tim
e [
s]
!
Direct (CPU)
Direct (GPU)
FMM (CPU)
FMM (GPU)
“Treecode and fast multipole method for N-body simulation with CUDA”, chapter in GPU Gems IV, in press
![Page 24: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/24.jpg)
FMM on GPU
103
104
105
106
107
10!3
10!2
10!1
100
101
102
103
104
105
N
tim
e [
s]
!
Direct (CPU)
Direct (GPU)
FMM (CPU)
FMM (GPU)
200x
“Treecode and fast multipole method for N-body simulation with CUDA”, chapter in GPU Gems IV, in press
![Page 25: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/25.jpg)
FMM on GPU
103
104
105
106
107
10!3
10!2
10!1
100
101
102
103
104
105
N
tim
e [
s]
!
Direct (CPU)
Direct (GPU)
FMM (CPU)
FMM (GPU)
40x
“Treecode and fast multipole method for N-body simulation with CUDA”, chapter in GPU Gems IV, in press
![Page 26: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/26.jpg)
FMM on GPU
103
104
105
106
107
10!3
10!2
10!1
100
101
102
103
104
105
N
tim
e [
s]
!
Direct (CPU)
Direct (GPU)
FMM (CPU)
FMM (GPU)
40x
“Treecode and fast multipole method for N-body simulation with CUDA”, chapter in GPU Gems IV, in press
![Page 27: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/27.jpg)
‣ the right methods and algorithms can provide leaps in capability many times that of Moore’s law would in a given period
‣ open source & open data enables tackling large, complex computational projects
![Page 28: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/28.jpg)
‣ the right methods and algorithms can provide leaps in capability many times that of Moore’s law would in a given period
‣ new hardware for HPC adds to the mix for a new era of discovery via computation
‣ open source & open data enables tackling large, complex computational projects
![Page 29: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/29.jpg)
Parallel FMM on multi-GPUs
Strong Scaling:
parallel efficiency of 80% at 256, and 50% at 512 nodes
N=108
p=10
Degima cluster at NACC, with Infiniband comm
! " # $ !% &" %# !"$ "'% '!"(
'(
!((
!'(
"((
"'(
&((
&'(
#((
)*+,-.
/012343)*+,-.35.6
3
3
/+223-,7./+8-/0,71*0.279*"*1*0.2791":;";<2+72:;"=<2+72:="=<2+72:="><2+72:>"><2+72:>";<2+72:
![Page 30: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/30.jpg)
GPU breakdown
‣ N=108, on one node
!
"!
#!!
#"!
$!!
%&'()*+,-./
,
,
(0++,%12.(0'%()12*&).+23&$&*&).+23*$45$56+02+45$76+02+47$76+02+47$86+02+48$86+02+48$56+02+4
!
"!
#!!
#"!
$!!
,
,
(0++,%12.(0'%()12*&).+23&$&*&).+23*$4%9'26)2:,(;.6<'==+0)2:,3;(;%'3;>+(?+@)%+%'3;7;441%%'3;7+*%&A%'3;B+02+4
![Page 31: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/31.jpg)
Under revision for Comput. Phys. Comm.
See also http://barbagroup.bu.edu/
![Page 32: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/32.jpg)
Suitability of the FMM for achieving exascale
FMM is a particularly favorable algorithm for the emerging heterogeneous, many-core architectural landscape.
![Page 33: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/33.jpg)
Spatial and temporal locality
![Page 34: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/34.jpg)
Spatial and temporal locality
‣ Algorithm has intrinsic geometric locality
![Page 35: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/35.jpg)
Spatial and temporal locality
‣ Algorithm has intrinsic geometric locality
![Page 36: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/36.jpg)
Spatial and temporal locality
‣ Algorithm has intrinsic geometric locality
‣ Acces patterns could be non-local
![Page 37: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/37.jpg)
Spatial and temporal locality
‣ Algorithm has intrinsic geometric locality
‣ Acces patterns could be non-local
๏ work with sorted particle indices, access via a start-offset combination
![Page 38: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/38.jpg)
Spatial and temporal locality
‣ Algorithm has intrinsic geometric locality
‣ Acces patterns could be non-local
๏ work with sorted particle indices, access via a start-offset combination
‣ Temporal locality:
![Page 39: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/39.jpg)
Spatial and temporal locality
‣ Algorithm has intrinsic geometric locality
‣ Acces patterns could be non-local
๏ work with sorted particle indices, access via a start-offset combination
‣ Temporal locality:
๏ queue GPU tasks before execution, buffer the input and output of data making memory access contiguous
![Page 40: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/40.jpg)
Spatial and temporal locality
‣ Algorithm has intrinsic geometric locality
‣ Acces patterns could be non-local
๏ work with sorted particle indices, access via a start-offset combination
‣ Temporal locality:
๏ queue GPU tasks before execution, buffer the input and output of data making memory access contiguous
➡ The FMM is not a locallity-sensitive application
![Page 41: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/41.jpg)
Global data comunications and synchronization
‣ Two most time-consuming in the FMM:
๏ p2p — purely local
๏ m2l — exhibits “hierarchical synchronization”
P2P at the leaf level
L2P evaluation
M2M
M2L
L2L
Upward sweep Downward sweep
P2M
![Page 42: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/42.jpg)
Load balancing
‣ FMM load-balanced
๏ space-filling curves: Morton, Hilbert
๏ work-only (no comm)
‣ PetFMM:
๏ graph-partitioning
๏ will it scale?
๏ hierarchical partition?
![Page 43: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/43.jpg)
plan for an “ExaFMM”1) our present FMM technology is state-of-the-art;2) we possess the potential for a substantial performance hike
AND all our codes are always open!
![Page 44: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/44.jpg)
Present FMM state-of-the-art
103
104
105
106
10−3
10−2
10−1
100
101
102
103
O(N)
N
tim
e [s
]
published KIFMM codeSalmon&Warren treecodeour published FMM code
Single-node performance:
timings of published kifmm code (2006) , S&W treecode (2000) and our code
‣equal performance
‣same accuracy, measured L2-norm error 10-3
Single CPU core, Intel Core i7 2.67 (no SSE)
![Page 45: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/45.jpg)
New experimental FMM with higher performance
103
104
105
106
107
10−3
10−2
10−1
100
101
102
103
O(N)
N
tim
e [s
]
Our published FMMWith new optimizationsWith algorithmic improvements
Optimized code:
explicit inline assembly within the p2p kernel, implementing SIMD
‣ 5x speed-up, single precision
Algorithmic improvements:
i)hybridize FMM with treecode
ii)dynamic error-control
![Page 46: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/46.jpg)
‣ other recent work
In IEEE International Symposium on Parallel Distributed Processing (IPDPS), IEEE, pp. 1–12 (Atlanta, GA; April 2010)
![Page 47: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/47.jpg)
‣ PetFMM — open library, dynamic load balancing, comm minimizing
๏ open question: will strategy scale to 1000s procs? hierarchical partition?
‣ Performance on single node:
๏ matching other s.o.t.a. codes
‣ Algorithmic innovations:
๏ hybrid treecode/FMM
๏ variable order/variable box-opening for minimum work to achieve target accuracy
Summary so far ...
![Page 48: The basis and perspectives of an exascale algorithm: our](https://reader030.vdocuments.net/reader030/viewer/2022012915/61c642130a020c448b089961/html5/thumbnails/48.jpg)
But there is more ...
‣ Fault-tolerance:
๏ traditional checkpointing no longer adequate by itself
‣ instead: replicate threads, correctness checks on-the-fly
๏ FMM allows natural correctness checks at the time of selecting p
‣ Autotuning the FMM:
๏ natural: use tests/work estimats to select particles per box, p, and box-
opening parameters.
๏ parameter selection for load-balancing