optimization and parallelization of the boundary element...
TRANSCRIPT
Optimization and Parallelization of the BoundaryElement Method for the Wave Equation
in Time Domain
Berenger Bramas
Inria, Bordeaux - Sud-Ouest
PhD defense - Feb. 15th 2016
Advisor: Oliver Coulaud (Inria)
Industrial co-advisor: Guillaume Sylvand (Airbus)
................ ..................... ..................... ..................... ..................... ................
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(2)
Context
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(3)
Wave Equation Problems
• Study the wave propagation in acoustics or electromagnetism
• Critical in several industrial fields (design, robustness study)
In our case: antenna placement, electromagnetic compatibility,furtivity, lightning, ...
Image from Airbus Group.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(4)
Wave Equation Simulations
Boundary element method (BEM): integral equation over adiscretized mesh
Interest of BEM compared to other approaches
• Better accuracy
• Surfacic mesh (easier to produce)
Disadvantages of BEM
• Dense matrices (specific solvers)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(5)
BEM
Boundary Element Method (BEM) for the Wave Equation
In Time Domain (TD) In Frequency Domain (FD)
One solve = a range of frequencies(Fourier Transform)
One solve = one frequency
Advantages/disadvantages depend on the application/configuration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(5)
BEM
Boundary Element Method (BEM) for the Wave Equation
In Time Domain (TD) In Frequency Domain (FD)
Accelerated by FMM (Fast-BEM)
Accelerated by H-Matrix
Accelerated by FMM
Dense BEM/Matrix Approach Dense BEM
......
One solve = a range of frequencies One solve = one frequency
Advantages/disadvantages depend on the application/configuration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(5)
BEM
Boundary Element Method (BEM) for the Wave Equation
In Time Domain (TD) In Frequency Domain (FD)
Accelerated by FMM (Fast-BEM)
Accelerated by H-Matrix
Accelerated by FMM
Dense BEM/Matrix Approach Dense BEM
......
One solve = a range of frequencies One solve = one frequency
Less studied and used Widely used (academy and industry)
Advantages/disadvantages depend on the application/configuration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(5)
BEM
Boundary Element Method (BEM) for the Wave Equation
In Time Domain (TD) In Frequency Domain (FD)
Accelerated by FMM (Fast-BEM)
Accelerated by H-Matrix
Accelerated by FMM
Dense BEM/Matrix Approach Dense BEM
......
One solve = a range of frequencies One solve = one frequency
Less studied and used Widely used (academy and industry)
Advantages/disadvantages depend on the application/configuration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(6)
Industrial Context
• In partnership with Airbus Group Innovation (financed jointlywith Region Aquitaine)
• Airbus solvers:• FD-BEM
• Accelerated by FMM or H-Matrix techniques
• TD-BEM (experimental)• No stability problem (formulation based on a full Galerkin
discretization unconditionnaly stable from [Terrasse, 1993])• With FMM [Ergin et al., 2000] (trial)
Objective:
• Reduce the performance gap between FD and TD approaches
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(6)
Industrial Context
• In partnership with Airbus Group Innovation (financed jointlywith Region Aquitaine)
• Airbus solvers:• FD-BEM
• Accelerated by FMM or H-Matrix techniques
• TD-BEM (experimental)• No stability problem (formulation based on a full Galerkin
discretization unconditionnaly stable from [Terrasse, 1993])• With FMM [Ergin et al., 2000] (trial)
Objective:
• Reduce the performance gap between FD and TD approaches
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(7)
HPCSuper-computers are mandatory to solve large problems
• Shared/Distributed memory
• Heterogeneous (one or more GPU per node)
CPU CPU
GPU
Node
CPU CPU
GPU
Node
CPU CPU
GPU
Node
CPU CPU
GPU
Node
...
Cluster
Some of the challenges
• Efficient computational algorithm/kernel
• Parallelization
• Balancing
• Hardware abstraction, portable implementation, long-termdevelopment, ...
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(7)
HPCSuper-computers are mandatory to solve large problems
• Shared/Distributed memory
• Heterogeneous (one or more GPU per node)
CPU CPU
GPU
Node
CPU CPU
GPU
Node
CPU CPU
GPU
Node
CPU CPU
GPU
Node
...
Cluster
Some of the challenges
• Efficient computational algorithm/kernel
• Parallelization
• Balancing
• Hardware abstraction, portable implementation, long-termdevelopment, ...
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(8)
Outline
• Problem Formulation
• BEM Solver (Matrix Approach)
• Fast-Multipole Method Approach• FMM Algorithm & Parallelization• FMM BEM Solver (Experimental Implementation)
• Conclusion & Perspectives
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(9)
TD-BEM Application Stages
User inputs, simulation parameters↓
Mesh generator, configuration↓
Solver↓
Post-processing (TD → FD)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
TD-BEM Formulation (10)
Linear FormulationNotations:
• δΩ discretized in N unknowns/degrees of freedom
• Mk : the convolution matrices (dimension N × N) - input• ln: the incident wave emitted by a source on the unknowns attime step n - input
• an: the state of the system at time step n - to compute
Convolution system:
M0 · an +Kmax∑k≥1
Mk · an−k = ln (1)
Ω
ln
Solve at each time step:
an = (M0)−1
(ln −
Kmax∑k=1
Mk · an−k
)(2)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
TD-BEM Formulation (10)
Linear FormulationNotations:
• δΩ discretized in N unknowns/degrees of freedom• Mk : the convolution matrices (dimension N × N) - input• ln: the incident wave emitted by a source on the unknowns attime step n - input
• an: the state of the system at time step n - to compute
Convolution system:
M0 · an +Kmax∑k≥1
Mk · an−k = ln (1)
Ω
ln
Solve at each time step:
an = (M0)−1
(ln −
Kmax∑k=1
Mk · an−k
)(2)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
TD-BEM Formulation (10)
Linear FormulationNotations:
• δΩ discretized in N unknowns/degrees of freedom• Mk : the convolution matrices (dimension N × N) - input• ln: the incident wave emitted by a source on the unknowns attime step n - input
• an: the state of the system at time step n - to compute
Convolution system:
M0 · an +Kmax∑k≥1
Mk · an−k = ln (1)
Ω
ln
Solve at each time step:
an = (M0)−1
(ln −
Kmax∑k=1
Mk · an−k
)(2)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
TD-BEM Formulation (10)
Linear FormulationNotations:
• δΩ discretized in N unknowns/degrees of freedom
• Mk : the convolution matrices (dimension N × N) - input
• ln: the incident wave emitted by a source on the unknowns attime step n - input
• an: the state of the system at time step n - to compute
Convolution system:
M0 · an +Kmax∑k≥1
Mk · an−k = ln (1)
Solve at each time step:
an = (M0)−1
(ln −
Kmax∑k=1
Mk · an−k
)(2)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
TD-BEM Formulation (11)
Interaction/Convolution Matrices (Mk)
• Interactions between unknowns
• Symmetric and sparse, Mk(i , j) = 0 if distance(i , j) ≈ k .c.∆t
• Pre-computed (external tool)
M0 M1 M2 MKmax...
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
TD-BEM Formulation (12)
Solve (Schematic View)
an = (M0)−1
(ln −
Kmax∑k=1
Mk · an−k
)(3)
= =
=
-
,
Linear Solver
sn ln sn sn~
sn~an M0
M1 M2M3M4M5
an-1an-2an-3an-4an-5
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
TD-BEM Formulation (13)
SpMV (sparse matrix/vector product)Summation stage → Kmax SpMVs
• Permutation, advanced storages/kernels, blocking[White III and Sadayappan, 1997, Pinar and Heath, 1999,Pichel et al., 2005, Vuduc and Moon, 2005]
• Auto-tuning[Im and Yelick, 2001, Vuduc et al., 2005]
Low Flop-rate:• Memory bound operation
• Flop/Word hardware limit
• Irregular/not contiguous memory accesses• Instruction (pipelining, vectorization)
• Not appropriate for GPUs[Garland, 2008, Baskaran and Bordawekar, 2008,Bell and Garland, 2009]
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
TD-BEM Formulation (13)
SpMV (sparse matrix/vector product)Summation stage → Kmax SpMVs
• Permutation, advanced storages/kernels, blocking[White III and Sadayappan, 1997, Pinar and Heath, 1999,Pichel et al., 2005, Vuduc and Moon, 2005]
• Auto-tuning[Im and Yelick, 2001, Vuduc et al., 2005]
Low Flop-rate:• Memory bound operation
• Flop/Word hardware limit
• Irregular/not contiguous memory accesses• Instruction (pipelining, vectorization)
• Not appropriate for GPUs[Garland, 2008, Baskaran and Bordawekar, 2008,Bell and Garland, 2009]
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
TD-BEM Formulation (14)
SpMV (Performance)
.....
Dense
5000
.
Diagon
al 15/5
00000
.
Rando
m4/2
0000
.
Block
Rando
m(5/
80000)
.
Dense
200(x1
0000)
.0 .
10
.
20
.
30
.
GFlop/s
.
. ..C00 MKL . ..CRS MKL . ..DIA MKL . ..BCSR MKL . ..CRS cuSparse . ..BCSR cuSparse
SpMVs MKL/cuSparse (double precision)
Peak performance: CPU Haswell Intel Xeon E5-2680 2,50 GHz core
20GFlop/s, and K40-M GPU 1.43TFlop/s.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
TD-BEM Formulation (15)
TD-BEM Application Stages
User inputs, simulation parameters↓
Mesh generator, configuration,interaction matrices pre-computation
↓Solver
· Summation stage· M0 Linear Solver (external tool)
↓Post-processing (TD → FD)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(16)
Outline
• Problem Formulation
• BEM Solver (Matrix Approach)
• Fast-Multipole Method Approach• FMM Algorithm & Parallelization• FMM BEM Solver (Experimental Implementation)
• Conclusion & Perspectives
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Improving the Summation (17)
Computational Ordering
A0
M 6
A1
M 5
A2
M 4
A3
M 3
A4
M 2
A5
M 1
S6
Front (k)/ SpMV
A0
M 6
A1
M 5
A2
M 4
A3
M 3
A4
M 2
A5
M 1
S6
Top (i)
A0
M 6
A1
M 5
A2
M 4
A3
M 3
A4
M 2
A5
M 1
S6
Side (j)
sn(i) =Kmax∑k=1
N∑j=1
Mk(i , j)× an−k(j) , 1 ≤ i ≤ N . (4)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Improving the Summation (17)
Computational Ordering
A0
M 6
A1
M 5
A2
M 4
A3
M 3
A4
M 2
A5
M 1
S6
Front (k)/ SpMV
A0
M 6
A1
M 5
A2
M 4
A3
M 3
A4
M 2
A5
M 1
S6
Top (i)
A0
M 6
A1
M 5
A2
M 4
A3
M 3
A4
M 2
A5
M 1
S6
Side (j)
sn(i) =Kmax∑k=1
N∑j=1
Mk(i , j)× an−k(j) , 1 ≤ i ≤ N . (4)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Improving the Summation (18)
Structure of a Slice MatrixA Slice j :
• When outer loop index is j• The concatenation of column j of the interaction matrices Mk
(except M0)• Size (N × (Kmax − 1))• There is one dense vector per row• Slice j(i , k) = Mk(i , j) = 0with ks = d(i , j)/(c∆t) and ks ≤ k ≤ ks + p
M1 (*,j)
Slicej
M2 (*,j)
M3 (*,j)
M4 (*,j)
M5 (*,j)
M6 (*,j)
M7 (*,j)
M8 (*,j)
M9 (*,j)
M10 (*,j)
M11 (*,j)
M12 (*,j)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Improving the Summation (19)
Computing with a Slice Matrix
a*<n(j)
M1 (*,j)
Slicej
snM
2 (*,j)
M3 (*,j)
M4 (*,j)
M5 (*,j)
M6 (*,j)
M7 (*,j)
M8 (*,j)
M9 (*,j)
M10 (*,j)
M11 (*,j)
M12 (*,j)
Computation with N vector/vector products (one per line):
• Regular memory access (vectorization, pipelining)
• Low Flop/word ratio (same as SpMV)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Improving the Summation (20)
Improving the Flop/Word Ratio
an-1an-2an-3an-4an-5an-6an-7an-8an-9
M1 (*,j)
Slicej
sn
M2 (*,j)
M3 (*,j)
M4 (*,j)
M5 (*,j)
M6 (*,j)
M7 (*,j)
M8 (*,j)
M9 (*,j)
State of unknown j
anan+1an+2
an
? ? ?….
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Improving the Summation (20)
Improving the Flop/Word Ratio
an-1an-2an-3an-4an-5an-6an-7an-8an-9
M1 (*,j)
Slicej
sn+1
M2 (*,j)
M3 (*,j)
M4 (*,j)
M5 (*,j)
M6 (*,j)
M7 (*,j)
M8 (*,j)
M9 (*,j)
anan+1an+2?? ?
an-1an-2an-3an-4an-5an-6an-7an-8an-9anan+1an+2 ….?? ?
….
snn =2g
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Improving the Summation (20)
Improving the Flop/Word Ratio
an-1an-2an-3an-4an-5an-6an-7an-8an-9
M1 (*,j)
Slicej
sn+1
M2 (*,j)
M3 (*,j)
M4 (*,j)
M5 (*,j)
M6 (*,j)
M7 (*,j)
M8 (*,j)
M9 (*,j)
anan+1an+2?? ?
an-1an-2an-3an-4an-5an-6an-7an-8an-9anan+1an+2 ….?? ?
….
snsn+2
an-1an-2an-3an-4an-5an-6an-7 an-9anan+1an+2?? ? an-8 ….
n =3g
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Improving the Summation (20)
Improving the Flop/Word Ratio
M1 (*,j)
Slicej
sn+1
M2 (*,j)
M3 (*,j)
M4 (*,j)
M5 (*,j)
M6 (*,j)
M7 (*,j)
M8 (*,j)
M9 (*,j)
an-1an-2an-3an-4an-5an-6an-7an-8an-9anan+1 ….00
snsn+2 n =3g
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Improving the Summation (20)
Improving the Flop/Word Ratio
M1 (*,j)
Slicej
sn+1
M2 (*,j)
M3 (*,j)
M4 (*,j)
M5 (*,j)
M6 (*,j)
M7 (*,j)
M8 (*,j)
M9 (*,j)
an-1an-2an-3an-4an-5an-6an-7an-8an-9anan+1 ….00
snsn+2 n =3g
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Multi-Vectors/Vector Product (21)
Flop/Word RatioVector length v = 4, group size ng = 4 (v × ng × 2 Flops):
a b c d
0123r a b c d
0123r
1234
2345
3456
r r r a b c d
0123456r r r r
Vector/vector product
Vector/matrix product
Multi-vectors/vector product
v ng
• Vectors product (≈ SpMV) : ng (2v + 1)• Vector/matrix product : v + ng (v + 1)• Multi-vectors/vector product : (v + ng − 1) + (v) + (ng )
.....0
.2
.4
.6
.8
.10
.12
.14
.16
.18
.20
.0 .2
.
4
.
6
.
v
.F/W
(ng8)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Multi-Vectors/Vector Product (21)
Flop/Word RatioVector length v = 4, group size ng = 4 (v × ng × 2 Flops):
a b c d
0123r a b c d
0123r
1234
2345
3456
r r r a b c d
0123456r r r r
Vector/vector product
Vector/matrix product
Multi-vectors/vector product
v ng
• Vectors product (≈ SpMV) : ng (2v + 1)• Vector/matrix product : v + ng (v + 1)• Multi-vectors/vector product : (v + ng − 1) + (v) + (ng )
.....0
.2
.4
.6
.8
.10
.12
.14
.16
.18
.20
.0 .2
.
4
.
6
.
v
.F/W
(ng8)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Multi-Vectors/Vector Product (22)
Multi-vectors/vector Product (CPU)
.....
. ..AVX-Asm . ..AVX-Intrinsic . ..AVX-Template . ..SSE-Intrinsic . ..Compiler Version
... ..0
.20
.40
.60
.80
.0 .
5
.
10
.
15
.
20
.
Length of vectors (v)
.
Speed(G
Flop/s)
Figure : Nr = 1024
... ..0
.20
.40
.60
.80
.0 .
5
.
10
.
15
.
20
.
Length of vectors (v)
Figure : Nr = 20 480
Plots show the GFlop/s with ng = 8 for test cases of dimension Nr × v (in double precision).
Haswell Intel Xeon E5-2680 at 2, 50GHz (20GFlop/s)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Multi-Vectors/Vector Product on GPUs (23)
GPUs Slice Storages
• Blocking scheme (small conversion overhead)
• Data access appropriate for SIMT/SIMD
• Memory accesses (coalesced, low bank conflicts)
• Data re-use (shared memory)
• CPU/GPU Balancing
Slicej
a*<n(j)
Slicej 0
2
0
1
3
4
5
8
6
5
6
1
2
0
2
(a) (b) (c)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Multi-Vectors/Vector Product on GPUs (23)
GPUs Slice Storages
• Blocking scheme (small conversion overhead)
• Data access appropriate for SIMT/SIMD
• Memory accesses (coalesced, low bank conflicts)
• Data re-use (shared memory)
• CPU/GPU Balancing
Slicej
a*<n(j)
Slicej 0
2
0
1
3
4
5
8
6
5
6
1
2
0
2
(a) (b) (c)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Parallelization (24)
Parallelization
Sequential algorithm:
= =
=
-
,
Linear Solver
sn ln sn sn~
sn~an M0
M1 M2M3M4M5
an-1an-2an-3an-4an-5
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Parallelization (25)
Parallel Solver (Schematic View)
=-
ln sn sn~
=
,
Parallel Linear Solver
sn~ anM0
=
sn
M1 M2M3M4M5
an-1n-2n-3
n-4
n-5
=
sn
=
sn
sn
+P1
P2
P0
an-1n-2
n-3n-4n-5
n-5
an-1n-2n-3
n-4
M1 M2M3M4M5
M1 M2M3M4M5
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Parallelization (26)
Parallel Solver with ng > 1 (Schematic View)
=-
ln+f sn+f sn+f~
=
,
Parallel Linear Solver
sn+f~an+f
M0
=
=
=
+
sn+f
sn sn+1sn+2
M1M2
,
Radiation
sn+f+1sn+f+2
,an+f
Radiation
, ,
Radiation
, ,
+
sn+f
ng loops
T/ng loops
P1
P2
P0
P1
P2
P0
M1M2sn+f+1sn+f+2an+f
M1M2 sn+f+1sn+f+2an+f
snsn+1sn+2
snsn+1sn+2
an-1n-2n-3n-4
n-5
an-1n-2n-3n-4
n-5
an-1n-2n-3n-4
n-5
M1 M2M3M4M5
M1 M2M3M4M5
M1 M2M3M4M5
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (27)
Airplane Simulation
• Acoustics
• N = 23 962
• 10 823 time iterations
• Kmax = 341 interaction matrices Mk
• ng = 8
• 70GB of data
• double precision
• Homogeneous node: 24 Cores CPU (128GB memory)
• Heterogeneous node: 24 Cores CPU (128GB memory) and 4K40M GPUs (12GB memory)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (28)
Parallel Efficiency/Percentage (Homogeneous)
.....
. ..Full-MPI
.....
. ..Summation . ..Idle
. ..Direct solver M0
... ..1.
10.
20.0 .
0.5
.
1
.
Number of nodes
.
Efficiency
... ..1.
10.
20.0 .
20.
40
.
60
.
80
.
100
.
Number of nodes
.Percentage
(%)
Summation stage M0 Solve →
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (28)
Parallel Efficiency/Percentage (Homogeneous)
.....
. ..Full-MPI
.....
. ..Summation . ..Idle
. ..Direct solver M0
... ..1.
10.
20.0 .
0.5
.
1
.
Number of nodes
.
Efficiency
... ..1.
10.
20.0 .
20.
40
.
60
.
80
.
100
.
Number of nodes
.Percentage
(%)
Summation stage M0 Solve →
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (28)
Parallel Efficiency/Percentage (Homogeneous)
.....
. ..Full-MPI
.....
. ..Summation . ..Idle
. ..Direct solver M0
... ..1.
10.
20.0 .
0.5
.
1
.
Number of nodes
.
Efficiency
... ..1.
10.
20.0 .
20.
40
.
60
.
80
.
100
.
Number of nodes
.Percentage
(%)
Summation stage M0 Solve →
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (29)
With GPUs
...
..
1
.
2
.
3
.
4
.
5
.
300
.
1,000
.
3,000
.
20,000
.
Number of nodes
.
Tim
e(secon
ds)
.
. ..CPU-Only . ..1GPU
Figure : Execution time
.....
1
.
2
.
3
.
4
.
5
.1.0 .
11.0
.
21.0
.
Number of nodes
.
Speedup
.
. ..1GPU
Figure : Speedup against CPU-Only
Problem ≈ 70GB/GPU 12GB memory
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (29)
With GPUs
...
..
1
.
2
.
3
.
4
.
5
.
300
.
1,000
.
3,000
.
20,000
.
Number of nodes
.
Tim
e(secon
ds)
.
. ..CPU-Only . ..1GPU . ..2GPU
Figure : Execution time
.....
1
.
2
.
3
.
4
.
5
.1.0 .
11.0
.
21.0
.
Number of nodes
.
Speedup
.
. ..1GPU . ..2GPU
Figure : Speedup against CPU-Only
Problem ≈ 70GB/GPU 12GB memory
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (29)
With GPUs
...
..
1
.
2
.
3
.
4
.
5
.
300
.
1,000
.
3,000
.
20,000
.
Number of nodes
.
Tim
e(secon
ds)
.
. ..CPU-Only . ..1GPU . ..2GPU
. ..3GPU
Figure : Execution time
.....
1
.
2
.
3
.
4
.
5
.1.0 .
11.0
.
21.0
.
Number of nodes
.
Speedup
.
. ..1GPU . ..2GPU . ..3GPU
Figure : Speedup against CPU-Only
Problem ≈ 70GB/GPU 12GB memory
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (29)
With GPUs
...
..
1
.
2
.
3
.
4
.
5
.
300
.
1,000
.
3,000
.
20,000
.
Number of nodes
.
Tim
e(secon
ds)
.
. ..CPU-Only . ..1GPU . ..2GPU
. ..3GPU . ..4GPU
Figure : Execution time
.....
1
.
2
.
3
.
4
.
5
.1.0 .
11.0
.
21.0
.
Number of nodes
.
Speedup
.
. ..1GPU . ..2GPU . ..3GPU . ..4GPU
Figure : Speedup against CPU-Only
Problem ≈ 70GB/GPU 12GB memory
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Summary (30)
Summary:
• New computational ordering [Bramas et al., 2014]
• Solver with few communication points
Additional contributions:
• Permutations/SpMV
• Efficient SIMD kernel CPU
• Efficient blocking scheme/kernel forGPU [Bramas et al., 2015]
• Dynamic balancing (CPU/GPU)
Limits:
• M0 Linear solver
• GPUs’ memory
• Interaction matrices construction
• Complexity → O(N2) for each iteration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(31)
Outline
• Problem Formulation
• BEM Solver (Matrix Approach)
• Fast-Multipole Method Approach• FMM Algorithm & Parallelization• FMM BEM Solver (Experimental Implementation)
• Conclusion & Perspectives
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Algorithm (32)
FMM Operators (1D)
• Spatial decomposition → Potential decomposition
fi = f neari + f fari
• Near field by direct interactions (leaves)• Far field with FMM operators (tree)
l = 0
l = 1
l = 2
l = 3
P2P
M2M
M2L
L2LM2L M2L
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Algorithm (32)
FMM Operators (1D)
• Spatial decomposition → Potential decomposition
fi = f neari + f fari
• Near field by direct interactions (leaves)• Far field with FMM operators (tree)
l = 0
l = 1
l = 2
l = 3
P2P
M2M
M2L
L2LM2L M2L
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (33)
Related work:
• Multicore study [Chandramowlishwaran et al., 2010]
• NVidia GPU [Yokota and Barba, 2011]
• Distributed GPU [Hamada et al., 2009]
• Distributed CPU/GPU [Hu et al., 2011, Lashuk et al., 2012,Malhotra and Biros, 2015]
• Using a runtime system (multicore) [Ltaief and Yokota, 2014]
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (34)
Paradigms
• Fork-join
- Parallel-for (OpenMP)
• Task-based
- Tasks pool (OpenMP 3.1) [Agullo et al., 2014]1
- Tasks-and-dependencies (runtime systems, OpenMP 4)
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for
multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (34)
Paradigms
• Fork-join
- Parallel-for (OpenMP)
• Task-based
- Tasks pool (OpenMP 3.1) [Agullo et al., 2014]1
- Tasks-and-dependencies (runtime systems, OpenMP 4)
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for
multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (34)
Paradigms
• Fork-join
- Parallel-for (OpenMP)
• Task-based
- Tasks pool (OpenMP 3.1) [Agullo et al., 2014]1
- Tasks-and-dependencies (runtime systems, OpenMP 4)
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for
multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (35)
Tasks-and-Dependencies Model (OpenMP 4, StarPU )
Challenges
• Granularity
• Computational kernels
• Scheduling
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (35)
Tasks-and-Dependencies Model (OpenMP 4, StarPU )
Challenges
• Granularity
• Computational kernels
• Scheduling
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (35)
Tasks-and-Dependencies Model (OpenMP 4, *PU)
CPU/GPU
Challenges
• Granularity
• Computational kernels
• Scheduling
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (35)
Tasks-and-Dependencies Model (OpenMP 4, *PU)
Challenges
• Granularity
• Computational kernels
• Scheduling
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (36)
Scheduling
A
D
B
C
E
F
CPU0
CPU1
GPU0
AScheduler
A
D
B
C
E
F
CPU0
CPU1
GPU0
SchedulerB
C
D
• Priority• Work stealing [Blumofe and Leiserson, 1999]• Heterogeneous Earliest Finish Time (Heft)[Topcuouglu et al., 2002]
Drawbacks:• Calibration• Overhead• Ready-tasks view
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (36)
Scheduling
A
D
B
C
E
F
CPU0
CPU1
GPU0
SchedulerB
C
D
• Priority
• Work stealing [Blumofe and Leiserson, 1999]
• Heterogeneous Earliest Finish Time (Heft)[Topcuouglu et al., 2002]
Drawbacks:
• Calibration
• Overhead
• Ready-tasks view
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (36)
Scheduling
A
D
B
C
E
F
CPU0
CPU1
GPU0
SchedulerB
C
D
• Priority
• Work stealing [Blumofe and Leiserson, 1999]
• Heterogeneous Earliest Finish Time (Heft)[Topcuouglu et al., 2002]
Drawbacks:
• Calibration
• Overhead
• Ready-tasks view
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (36)
Scheduling
A
D
B
C
E
F
CPU0
CPU1
GPU0
SchedulerB
C
D
• Priority
• Work stealing [Blumofe and Leiserson, 1999]
• Heterogeneous Earliest Finish Time (Heft)[Topcuouglu et al., 2002]
Drawbacks:
• Calibration
• Overhead
• Ready-tasks view
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (37)
Heteroprio
• Heteroprio [Agullo et al., 2015]1
• Steady-state : execute tasks where they have the bestacceleration factor
• Critical-state : execute a task by a worker if it does not delaythe hypothetical end
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for
heterogeneous architectures. Concurrency and Computation: Practice and Experience.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (37)
Heteroprio
• Heteroprio [Agullo et al., 2015]1
• Steady-state : execute tasks where they have the bestacceleration factor
• Critical-state : execute a task by a worker if it does not delaythe hypothetical end
A
D
B
C
E
F
CPU0
CPU1
GPU0
Scheduler
B
CDCPUPrio
GPUPrio
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for
heterogeneous architectures. Concurrency and Computation: Practice and Experience.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (37)
Heteroprio
• Heteroprio [Agullo et al., 2015]1
• Steady-state : execute tasks where they have the bestacceleration factor
• Critical-state : execute a task by a worker if it does not delaythe hypothetical end
A
D
B
C
E
F
CPU0
CPU1
GPU0
SchedulerB
CPUPrio
GPUPrio
CD
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for
heterogeneous architectures. Concurrency and Computation: Practice and Experience.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (37)
Heteroprio
• Heteroprio [Agullo et al., 2015]1
• Steady-state : execute tasks where they have the bestacceleration factor
• Critical-state : execute a task by a worker if it does not delaythe hypothetical end
A
D
B
C
E
F
CPU0
CPU1
GPU0
SchedulerB
C
CPUPrio
GPUPrio
D
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for
heterogeneous architectures. Concurrency and Computation: Practice and Experience.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Parallelization (37)
Heteroprio
• Heteroprio [Agullo et al., 2015]1
• Steady-state : execute tasks where they have the bestacceleration factor
• Critical-state : execute a task by a worker if it does not delaythe hypothetical end
A
D
B
C
E
F
CPU0
CPU1
GPU0
SchedulerB
D
C
CPUPrio
GPUPrio
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for
heterogeneous architectures. Concurrency and Computation: Practice and Experience.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (38)
Test Case
CPU - 24 Cores
GPU 3 GPU 4
GPU 2GPU 1
• N = 30 millions particles
• Spherical Expansion/Rotation Kernel
• Acc = 10−3, h = 7 and Granularity = 1500
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (24CPUs)
0GPU/15.5s
Legend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle ()
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (1GPU/23CPUs)
0GPU/15.5s
1GPU/13.4sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle ()
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (1GPU/23CPUs)
0GPU/15.5s
1GPU/13.4sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle ()
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (2GPUs/22CPUs)
0GPU/15.5s
2GPU/10.9sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle ()
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (3GPUs/21CPUs)
0GPU/15.5s
3GPU/9.4sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle ()
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (4GPUs/20CPUs)
0GPU/15.5s
4GPU/8.7sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle ()
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (40)
Test Case
CPU - 24 Cores
GPU 3 GPU 4
GPU 2GPU 1
• N = 30 millions particles
• Uniform/Lagrange kernel
• Acc = 10−5, 10−7, h = 7 and Granularity = 1500
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (41)
Trace - Heterogeneous (4GPUs)
Acc = 10−5/7.9s
Acc = 10−7/17s
Legend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle ()Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (41)
Trace - Heterogeneous (4GPUs)
Acc = 10−5/7.9s
Acc = 10−7/17sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle ()Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (42)
Test Cases
Node 1 - 24 Cores
Node 2 - 24 Cores
Node 3 - 24 Cores
Node 4 - 24 Cores
Node 5 - 24 Cores
Node 6 - 24 Cores
Node 7 - 24 Cores
• N = 200 millions particles
• Spherical Expansion/Rotation Kernel
• Acc = 10−3, h = 8 and Granularity = 2000
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
FMM Particles Interaction Simulations (43)
Trace - 7 nodes × 24CPUs
Legend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle () .
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Summary (44)
Summary:
• Generic
• Kernel independent
• Architecture independent
• Performance portability
Additional contributions:
• Commutativity expression in FMM
• MPI/OpenMP implementation
All included in ScalFMM (C++/HPC library)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Summary (44)
Summary:
• Generic
• Kernel independent
• Architecture independent
• Performance portability
Additional contributions:
• Commutativity expression in FMM
• MPI/OpenMP implementation
All included in ScalFMM (C++/HPC library)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(45)
Outline
• Problem Formulation
• BEM Solver (Matrix Approach)
• Fast-Multipole Method Approach• FMM Algorithm & Parallelization• FMM BEM Solver (Experimental Implementation)
• Conclusion & Perspectives
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(46)
Propagation of the Current State to the Future
= =
=
-
,
Linear Solver
sn ln sn sn~
sn~an M0
M1 M2M3M4M5
an-1an-2an-3an-4an-5
=-
ln sn sn~
=
,
Linear Solver
sn~anM0
M5
an
M1
M2
M3
M4sn+5
sn+4
sn+3
sn+2
sn+1
++
++
+
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(46)
Propagation of the Current State to the Future
= =
=
-
,
Linear Solver
sn ln sn sn~
sn~an M0
M1 M2M3M4M5
an-1an-2an-3an-4an-5
=-
ln sn sn~
=
,
Linear Solver
sn~anM0
M5
an
M1
M2
M3
M4sn+5
sn+4
sn+3
sn+2
sn+1
++
++
+
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(47)
With FMM
=-
ln sn sn~
=
,
Linear Solver
sn~anM0
an
M1
M2
sn+5
sn+4
sn+3
sn+2
sn+1
++
++
+
FMM
• Far interactions in time (between far elements in space) arecomputed by the FMM
• The spatial decomposition is given by the octree
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(48)
Overview
• The octree is over a mesh (integration points)
• Interactions matrices between leaves• Approximation/FMM
• development in the time-domain• multipole: what a cell emits to the outside• local: what a cell receives from the outside
• operators in FD or TD• accurate up-to a chosen frequency
• the results in the TD of the matrix approach = FMM
Figure : Complete unit sphere Figure : Truncated unit sphere
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(49)
Operators (Overview)• P2M
• compute what is emitted by a leaf to the outside
• M2M/L2L• Extrapolation + time shift
• M2L• Convolution product in TD
(term-by-term multiplication in FD)• L2P
• Integration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(49)
Operators (Overview)• P2M
• compute what is emitted by a leaf to the outside• M2M/L2L
• Extrapolation + time shift
• M2L• Convolution product in TD
(term-by-term multiplication in FD)• L2P
• Integration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(49)
Operators (Overview)• P2M
• compute what is emitted by a leaf to the outside• M2M/L2L
• Extrapolation + time shift• M2L
• Convolution product in TD(term-by-term multiplication in FD)
• L2P• Integration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(49)
Operators (Overview)• P2M
• compute what is emitted by a leaf to the outside• M2M/L2L
• Extrapolation + time shift• M2L
• Convolution product in TD(term-by-term multiplication in FD)
• L2P• Integration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(49)
Operators (Overview)• P2M
• compute what is emitted by a leaf to the outside• M2M/L2L
• Extrapolation + time shift• M2L
• Convolution product in TD(term-by-term multiplication in FD)
• L2P• Integration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (50)
Cone-Sphere Test Cases
Case C-927 C-4269 C-10012
Number of unknowns 927 4269 10012FMM tree height 3 4 5Number of leaves 16 64 234
Number of Mk matrices (Kmax) 117 244 370Number of Mk matrices (leaves) 60 64 49
Number of time steps (T ) 2033 4345 6647
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (51)
Sequential Executions
TD vs. FD operators:
FMMStages TD TD + FD-M2L FD Matrix approach
Mk Construction 76 s 76 s 76 s 242 sSolve 58 122 s 53 241 s 97 861s 7.8 s(*)
Total 58 198 s 53 317 s 97 937 s 249.8 s
Execution time TD-FMM Vs. matrix approach to solve the Case C-927 indouble precision.
(*) Our optimized BEM solver.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Results (52)
Parallel Executions (FMM Vs. Matrix Approach)
.....
. ..Matrix generation . ..Solve
.....
FMM
.
Matrix
Approach
.0 .
500
.
1,000
.
883
.
237
.
Tim
e(s)
Figure : C-927 (×3.8)
.....
FMM
.
Matrix
Approach
.0 .
0.5
.
1
.
·104
.
9,426
.
10,080
Figure : C-4269 (×1)
.....
FMM
.
Matrix
Approach
.0 .
2
.
4
.
·104
.
33,408
.
25,256
Figure : C-10012 (×1.4)
The captions of the different cases show the overhead of the FMM TD-BEM against the matrix approach.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
Summary (53)
Summary:
• Preliminary results
• Best configuration: TD + FD M2L
• Not competitive against the direct approach (maybe on largertest cases)
• Any improvement of the matrix creation will make the FMMless competitive
Additional contributions:
• Incomplete/4D FMM
• Sphere discretization/length APS signal
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(54)
Conclusion & Perspectives
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(55)
Conclusion
Dense BEM/Matrix Approach
• Based on a new computational order
• Remove the bottleneck of the SpMV
• Implemented efficiently on modern architectures
• Complete BEM solver
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(56)
Conclusion
FMM
• Generic and state-of-the-art library
• Several parallelization strategies
• Robust OpenMP/MPI implementation (10 billions particles)
• Modern task-based approach
• ScalFMM
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(57)
Conclusion
FMM BEM Solver (Preliminary)
• Parallelized using ScalFMM
• Best configuration TD operators + FD M2L
• Our implementation is not faster than the direct approach
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(58)
Perspectives
TD-BEM
• Improve the construction of the interaction matrices
• M0 linear solver: small matrix, lots of nodes
• Compare existing solvers (TD vs. FD)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(59)
Perspectives
FMM parallelization
• Task-based with implicit MPI communications
• Group-Tree update
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives
(60)
Perspectives
FMM BEM
• Study the cost of the solve compare to the direct approach(complexity for some cases)
• Lots of remaining optimizations to test
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(1)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(2)
Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M.,and Takahashi, T. (2014).Task-based fmm for multicore architectures.SIAM Journal on Scientific Computing, 36(1):C66–C93.
Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M.,and Takahashi, T. (2015).Task-based fmm for heterogeneous architectures.Concurrency and Computation: Practice and Experience,pages n/a–n/a.cpe.3723.
Baskaran, M. M. and Bordawekar, R. (2008).Optimizing sparse matrix-vector multiplication on gpus usingcompile-time and run-time strategies.IBM Reserach Report, RC24704 (W0812-047).
Bell, N. and Garland, M. (2009).Implementing sparse matrix-vector multiplication onthroughput-oriented processors.In Proceedings of the Conference on High PerformanceComputing Networking, Storage and Analysis, page 18. ACM.
Blumofe, R. D. and Leiserson, C. E. (1999).Scheduling multithreaded computations by work stealing.J. ACM, 46(5):720–748.
Bramas, B., Coulaud, O., and Sylvand, G. (2014).Euro-Par 2014 Parallel Processing: 20th InternationalConference, Porto, Portugal, August 25-29, 2014. Proceedings,chapter Time-Domain BEM for the Wave Equation:Optimization and Hybrid Parallelization, pages 511–523.Springer International Publishing, Cham.
Bramas, B., Coulaud, O., and Sylvand, G. (2015).Time-domain bem for the wave equation ondistributed-heterogeneous architectures.Parallel Comput., 49(C):66–82.
Chandramowlishwaran, A., Williams, S., Oliker, L., Lashuk, I.,Biros, G., and Vuduc, R. (2010).Optimizing and tuning the fast multipole method forstate-of-the-art multicore architectures.In Parallel Distributed Processing (IPDPS), 2010 IEEEInternational Symposium on, pages 1–12.
Ergin, A. A., Shanker, B., and Michielssen, E. (2000).Fast analysis of transient acoustic wave scattering from rigidbodies using the multilevel plane wave time domain algorithm.The Journal of the Acoustical Society of America, 107(3).
Garland, M. (2008).Sparse matrix computations on manycore gpu’s.In Proceedings of the 45th annual Design AutomationConference, pages 2–6. ACM.
Hamada, T., Narumi, T., Yokota, R., Yasuoka, K., Nitadori,K., and Taiji, M. (2009).42 tflops hierarchical n-body simulations on gpus withapplications in both astrophysics and turbulence.In Proceedings of the Conference on High PerformanceComputing Networking, Storage and Analysis, SC ’09, pages62:1–62:12, New York, NY, USA. ACM.
Hu, Q., Gumerov, N. A., and Duraiswami, R. (2011).Scalable fast multipole methods on distributed heterogeneousarchitectures.In Proceedings of 2011 International Conference for HighPerformance Computing, Networking, Storage and Analysis,SC ’11, pages 36:1–36:12, New York, NY, USA. ACM.
Im, E.-J. and Yelick, K. (2001).Optimizing sparse matrix computations for register reuse insparsity.In Computational ScienceICCS 2001, pages 127–136. Springer.
Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen,T.-A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L.,Zorin, D., and Biros, G. (2012).A massively parallel adaptive fast multipole method onheterogeneous architectures.Commun. ACM, 55(5):101–109.
Ltaief, H. and Yokota, R. (2014).Data-driven execution of fast multipole methods.Concurrency and Computation: Practice and Experience,26(11):1935–1946.
Malhotra, D. and Biros, G. (2015).Pvfmm: A parallel kernel independent fmm for particle andvolume potentials.Communications in Computational Physics, 18(03):808–830.
Pichel, J. C., Heras, D. B., Cabaleiro, J. C., and Rivera, F. F.(2005).Performance optimization of irregular codes based on thecombination of reordering and blocking techniques.Parallel Computing, 31(8):858–876.
Pinar, A. and Heath, M. T. (1999).Improving performance of sparse matrix-vector multiplication.In Proceedings of the 1999 ACM/IEEE conference onSupercomputing, page 30. ACM.
Terrasse, I. (1993).Resolution mathematique et numerique des equations deMaxwell instationnaires par une methode de potentielsretardes.PhD thesis.
Topcuouglu, H., Hariri, S., and Wu, M.-y. (2002).Performance-effective and low-complexity task scheduling forheterogeneous computing.IEEE Trans. Parallel Distrib. Syst., 13(3):260–274.
Vuduc, R., Demmel, J. W., and Yelick, K. A. (2005).Oski: A library of automatically tuned sparse matrix kernels.In Journal of Physics: Conference Series, volume 16, page 521.IOP Publishing.
Vuduc, R. W. and Moon, H.-J. (2005).Fast sparse matrix-vector multiplication by exploiting variableblock structure.In High Performance Computing and Communications, pages807–816. Springer.
White III, J. B. and Sadayappan, P. (1997).On improving the performance of sparse matrix-vectormultiplication.In High-Performance Computing, 1997. Proceedings. FourthInternational Conference on, pages 66–71. IEEE.
Yokota, R. and Barba, L. A. (2011).Treecode and fast multipole method for n-body simulationwith cuda.GPU Computing Gems Emerald Edition, page 113.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(3)
Vectorized Implementation (Overview)
The algorithm is tied to the vectorization type size lSIMD
Example with ng = 4, v = 4, lSIMD = 2:
a*<n(j)
a b c d
Slicej row-vector(i)
2
3
4
5
0
1
sn+1 snsn+2
6
sn+2
sn+1
sn
sn+3
sn+3
buffer
0
1
a b
sn+3
2
1
sn+2
buffer
2
3
3
4
2
3
buffer
a b
a b
sn+1
3
4
a b
sn
buffer
2
3
buffer
c d
sn+3
3
4
c d
sn+2
buffer
4
5
c d
sn+1
6
5
sn
c dr
r
r
r
r
r
r
r
r
• Each value from the vectors is read only once (and maybecopied into the buffer)
• The values in the buffer are shifted to avoid reloading
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(3)
Vectorized Implementation (Overview)
The algorithm is tied to the vectorization type size lSIMD
Example with ng = 4, v = 4, lSIMD = 2:
a*<n(j)
a b c d
Slicej row-vector(i)
2
3
4
5
0
1
sn+1 snsn+2
6
sn+2
sn+1
sn
sn+3
sn+3
buffer
0
1
a b
sn+3
2
1
sn+2
buffer
2
3
3
4
2
3
buffer
a b
a b
sn+1
3
4
a b
sn
buffer
2
3
buffer
c d
sn+3
3
4
c d
sn+2
buffer
4
5
c d
sn+1
6
5
sn
c dr
r
r
r
r
r
r
r
r
• Each value from the vectors is read only once (and maybecopied into the buffer)
• The values in the buffer are shifted to avoid reloading
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(3)
Vectorized Implementation (Overview)
The algorithm is tied to the vectorization type size lSIMD
Example with ng = 4, v = 4, lSIMD = 2:
a*<n(j)
a b c d
Slicej row-vector(i)
2
3
4
5
0
1
sn+1 snsn+2
6
sn+2
sn+1
sn
sn+3
sn+3
buffer
0
1
a b
sn+3
2
1
sn+2
buffer
2
3
3
4
2
3
buffer
a b
a b
sn+1
3
4
a b
sn
buffer
2
3
buffer
c d
sn+3
3
4
c d
sn+2
buffer
4
5
c d
sn+1
6
5
sn
c dr
r
r
r
r
r
r
r
r
• Each value from the vectors is read only once (and maybecopied into the buffer)
• The values in the buffer are shifted to avoid reloading
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(3)
Vectorized Implementation (Overview)
The algorithm is tied to the vectorization type size lSIMD
Example with ng = 4, v = 4, lSIMD = 2:
a*<n(j)
a b c d
Slicej row-vector(i)
2
3
4
5
0
1
sn+1 snsn+2
6
sn+2
sn+1
sn
sn+3
sn+3
buffer
0
1
a b
sn+3
2
1
sn+2
buffer
2
3
3
4
2
3
buffer
a b
a b
sn+1
3
4
a b
sn
buffer
2
3
buffer
c d
sn+3
3
4
c d
sn+2
buffer
4
5
c d
sn+1
6
5
sn
c dr
r
r
r
r
r
r
r
r
• Each value from the vectors is read only once (and maybecopied into the buffer)
• The values in the buffer are shifted to avoid reloading
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(3)
Vectorized Implementation (Overview)
The algorithm is tied to the vectorization type size lSIMD
Example with ng = 4, v = 4, lSIMD = 2:
a*<n(j)
a b c d
Slicej row-vector(i)
2
3
4
5
0
1
sn+1 snsn+2
6
sn+2
sn+1
sn
sn+3
sn+3
buffer
0
1
a b
sn+3
2
1
sn+2
buffer
2
3
3
4
2
3
buffer
a b
a b
sn+1
3
4
a b
sn
buffer
2
3
buffer
c d
sn+3
3
4
c d
sn+2
buffer
4
5
c d
sn+1
6
5
sn
c dr
r
r
r
r
r
r
r
r
• Each value from the vectors is read only once (and maybecopied into the buffer)
• The values in the buffer are shifted to avoid reloading
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(4)
In Time or Frequency Domain
Propagation of the wave for several time steps on a targetdiscretized sphere. The different spheres represent the values thatwill be applied on the included mesh elements.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(5)
Contiguous-Blocking Computational Kernel
Slicej
Kmax-1
N
a*<n(j)I) Coalesced copy of the vector into shared memory
III) Coalesced add to global
memory
II) Coalesced access in global memory
(column per column)
III) Partial results in registers
snsn+1sn+2
II) Uncoalesced read from shared memory
Global Memory Shared Memory Local Memory
(a) (b) (c)
bc
Thread-1Thread-2Thread-3Thread-4Thread-5Thread-6Thread-7Thread-8Thread-9
(a) the original slice is transformed in a block during thepre-computation stage (ng = 3, bc = 11)
(b) the blocks are moved to the device memory for thesummation stage
(c) a thread-block (nb − threads = 9) is in charge of the blocksfrom a slice interval and computes several summation vectorsat the same timeOptimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(6)
Multi-vectors/vector Product
sn+2 (i)sn+1 (i) sn (i)
0
1
2
3
4
0
sn+2 (i)sn+1 (i)sn (i)
1
2
3
4
5
6
1
2
3
4
5
2
3
4
5
6
(a) (b)
Computing one slice-row with 3 vectors (ng = 3)
(a) using 3 scalar products
(b) using the multi-vectors/vector product
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(7)
In Shared Memory
• OpenMP parallelization
• Summation divided/balanced between the threads
• No communication during the summation
• Multi-threaded M0 linear solver if possible
• NUMA effects are not handled
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(8)
On Heterogeneous Nodes
• OpenMP parallelization
• One thread/core per GPU
• An interval of the slices is moved on each GPU
• Intervals are balanced between each iteration with a greedyalgorithm
• The memory limit of the GPU may reduce their performance
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(9)
Application ExampleFD-BEM + FMM (antenna at 1GHz)
Image from Airbus Group.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(10)
Cone-Sphere Test Cases
Case C-927 C-4269 C-10012 C-22468
Number of unknowns 927 4269 10012 C-22468FMM tree height 3 4 5 6
Number of leaves in the FMM tree 16 64 234 936Number of NNZ interaction matrices (Kmax ) 117 244 370 551
Number of NNZ matrices between FMM leaves 60 64 49 37Number of time steps (T ) 2033 4345 6647 9957Size of the simulation box 3.3 7.3 11 16
Fmax 348 337 335 334Incomplete FMM coefficient l = h − 1 16 18 13 10
Incomplete FMM coefficient l = 2 16 36 52 80
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(11)
Multi-vectors/vector Product (GPU)
For the Contiguous-Blocking scheme:
Width (bc)16 32 64 128 16 32 64 128
GPU CPUSingle 243 338 431 496 (11%) 4.3 5.5 7.8 6.8 (17%)Double 143 199 248 286 (20%) 3.9 5.6 4.2 4.3 (21%)
GFlop/s for 420 slices (6400 rows and bc columns)(%) percentage of the peak performance
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(12)
Fork-join (OpenMP)
• Level by level
• Critical balancing
• Possible bottleneck (top of the tree)
• Difficult to mix near/far fields
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(12)
Fork-join (OpenMP)
• Level by level
• Critical balancing
• Possible bottleneck (top of the tree)
• Difficult to mix near/far fields
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(12)
Fork-join (OpenMP)
• Level by level
• Critical balancing
• Possible bottleneck (top of the tree)
• Difficult to mix near/far fields
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(12)
Fork-join (OpenMP)
• Level by level
• Critical balancing
• Possible bottleneck (top of the tree)
• Difficult to mix near/far fields
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(12)
Fork-join (OpenMP)
• Level by level
• Critical balancing
• Possible bottleneck (top of the tree)
• Difficult to mix near/far fields
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(13)
Fork-join+Message-passing (Hybrid OpenMP/MPI)
P0 P1 P2 P3
• Distribute the tree between nodes
• Progress level by level
• Communication between all stages
Poor parallelism expression
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(13)
Fork-join+Message-passing (Hybrid OpenMP/MPI)
P0 P1 P2 P3
• Distribute the tree between nodes
• Progress level by level
• Communication between all stages
Poor parallelism expression
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(14)
Trace - Heterogeneous (4GPUs)
h = 7/ng = 1500/Acc = 10−7/17s
h = 6/ng = 300/Acc = 10−7/39s (not on the same scale)
24 threads, N = 30 millions, uniform distribution, Uniform/Lagrange
kernel. Legend: P2P () , P2M () , M2M () , M2L (), L2L (),
L2P () and Idle ()
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(14)
Trace - Heterogeneous (4GPUs)
h = 7/ng = 1500/Acc = 10−7/17s
h = 6/ng = 300/Acc = 10−7/39s (not on the same scale)
24 threads, N = 30 millions, uniform distribution, Uniform/Lagrange
kernel. Legend: P2P () , P2M () , M2M () , M2L (), L2L (),
L2P () and Idle ()
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(15)
Flop/Cost Estimation
...
..
1,000
.
10,000
.
20,000
.
106
.
107
.
108
.
109
.
1
.
6
.
23
.
92
.
Number of unknowns
.
UnitofCost
.
. ..Between FMM leaves
. ..Complete interaction matrices
Figure : Matrix generation costestimation
...
..
1,000
.
10,000
.
20,000
.
1010
.
1013
.
1016
.
1340
.
388
.
203
.
154
.
1320
.
353
.
119
.
57
.
Number of unknowns
.
Number
ofFlop
.
. ..TD-BEM FMM
. ..TD-BEM FMM (FD M2L)
. ..Matrix Approach
Figure : Summation stage Flopestimation
The numbers above the slower plot represent the slow-down factors against the faster method.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(15)
Flop/Cost Estimation
...
..
1,000
.
10,000
.
20,000
.
106
.
107
.
108
.
109
.
1
.
6
.
23
.
92
.
Number of unknowns
.
UnitofCost
.
. ..Between FMM leaves
. ..Complete interaction matrices
Figure : Matrix generation costestimation
...
..
1,000
.
10,000
.
20,000
.
1010
.
1013
.
1016
.
1340
.
388
.
203
.
154
.
1320
.
353
.
119
.
57
.
Number of unknowns
.
Number
ofFlop
.
. ..TD-BEM FMM
. ..TD-BEM FMM (FD M2L)
. ..Matrix Approach
Figure : Summation stage Flopestimation
The numbers above the slower plot represent the slow-down factors against the faster method.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(16)
Reducing the Complexity
Direct computation 0(N2)
→ FMM O(N)
..
X
.
Y
.O(N2)
..
X
.
Y
.O(N)
• Spatial decomposition → Potential decomposition
fi = f neari + f fari
• Near field is computed by direct interactions• The far field is done using different operators
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(16)
Reducing the Complexity
Direct computation 0(N2) → FMM O(N)
..
X
.
Y
.O(N2)
..
X
.
Y
.O(N)
• Spatial decomposition → Potential decomposition
fi = f neari + f fari
• Near field is computed by direct interactions• The far field is done using different operators
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
(16)
Reducing the Complexity
Direct computation 0(N2) → FMM O(N)
..
X
.
Y
.O(N2)
..
X
.
Y
.O(N)
• Spatial decomposition → Potential decomposition
fi = f neari + f fari
• Near field is computed by direct interactions• The far field is done using different operators
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Results (17)
Airplane Simulation
• Acoustics
• N = 23 962
• 10 823 time iterations
• Kmax = 341 interaction matrices Mk (≈ 5.5× 109 NNZ)
• Computing sn ≈ 11GFlop
• Total ≈ 130 651GFlop
• ng = 8
• 70GB of data
• CPU node : 2 Dodeca-core Haswell Intel Xeon E5-2680 at2, 50GHz and 128GB (DDR4) of shared memory
• GPUs per node: 4 NVIDIA Kepler K40M GPU (745MHz),2880 Cores, 12GB of dedicated memory
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Results (18)
Hybrid MPI/OpenMP
.....
. ..Hybrid MPI/OpenMP
... ..1.
10.
20.
30.
40.
50.
1
.
0.5
.0 .
Number of nodes
.
Efficienty
Efficiency: Uniform distribution, Spherical Expansion/Rotation Kernel,
N = 200 millions, h = 8, Acc = 10−3, from 1 to 50 nodes (24 threads per
node), for np = 50 the execution time is 2.24s
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Results (19)
Group-tree
• Granularity G
• A group → G cells/leaves
• Good locality
• Low iteration complexity
• Dependencies between cells = between groups
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Results (20)
Trace - Shared Memory - 24CPUs
N = 20 millions, ellipsoid distribution, Spherical Expansion/Rotation Kernel,Acc = 10−3, h = 11 and ng = 8000 in 5.2s.
Legend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle ()
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Results (21)
Trace - 24CPUs
N = 30 millions, uniform distribution, Spherical Expansion/Rotation Kernel,Acc = 10−3, h = 7 and ng = 1500 in 15.5s.
Legend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and
Idle ()
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Particles Interaction Simulations (22)
Test Cases
Interactions between N particles for two distributions:
Figure : Uniform Figure : Ellipsoid
The height of the tree (h) is chosen such that the execution time isminimal in sequential.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Particles Interaction Simulations (23)
Parallel Strategies for FMM BEMThree strategies (Fork-join OpenMP)
• Threaded FMM: divide each level between threads (ScalFMMclassic)
• Threaded kernel: divide the work inside the kernel
• Mix FMM/Kernel: two layers of parallelism, one in the FMMand a second in the kernel
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Particles Interaction Simulations (24)
Parallel Executions
FMM Vs. Matrix Approach
.....
. ..Matrix generation . ..Solve
.....
ThreadedFM
M
.
ThreadedKernel
.
MixFM
M/Kernel
.
Matrix
Approach
.0 .
2,000
.
4,000
.
3,752
.
1,897
.
883
.237
.
Tim
e(s)
Figure : C-927
.....
ThreadedFM
M
.
ThreadedKernel
.
MixFM
M/Kernel
.
Matrix
Approach
.0 .
0.5
.
1
.
1.5
.
·104
.
15,732
.
15,371
.
9,426
.
10,080
Figure : C-4269 (×1)
.....
ThreadedFM
M
.
ThreadedKernel
.
MixFM
M/Kernel
.
Matrix
Approach
.0 .
5
.
·104
.
33,408
.
76,931
.
39,227
.
25,256
Figure : C-10012
The captions of the different cases show the overhead of the FMM TD-BEM against the matrix approach.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas
..........
.....
....................................................................................
......
.....
.....
.
Particles Interaction Simulations (24)
Parallel Executions FMM Vs. Matrix Approach
.....
. ..Matrix generation . ..Solve
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas