hpc and cfd. - basque center for applied mathematics · hpc and cfd. main contributions up to now...
TRANSCRIPT
HPC and CFD.Main Contributions up to Now (An Overview) and Future Work
Pedro Valero-Lara
2 de octubre de 2014
Motivation
Accelerating CFD problems by using heterogeneous platforms(multicore-hardware accelerator)
HPCCLUSTER
MPI
Multicore CPU
ICC, OpenMP, PGI
Nvidia’s GPUs
CUDA, PGI
Intel Xeon Phi
ICC, OpenMp
CFDFluid Dynamic
Pressure Poisson Equation
Lattice-Boltzmann Method
Solid-Fluid Interaction
Immersed-Boundary Method
Grid refinement
Multi-domain Lattice-Boltzmann Method
Architectures
CLUSTER
Node 1
Node 0
Link Connectors
CPU
CPU Memory
Memory
Discs RouterSwitch
GPU Architecture
Core 1 Core 2
Shared Memory
Core N
ControlUnit
Multiprocessor
Global Memory
GPU
Heterogeneous Architectures
Mem
ory
DDR351.2GB/s
DDR351.2GB/s
Mem
ory
IOH
ICHHDD
208GB/sGDDR5
GPU Bus
320GB/sGDDR5
CPU
Intel Multicore
CPU25.6GB/s
QPI
PCIe16x8GB/s
DMIGPU/PHI Memory
GPU/PHI
PHI Bus
Intel Xeon Phi Architecture
CoherentCache
CoherentCache
CoherentCache
CoherentCache
CoherentCache
CoherentCache
CoherentCache
CoherentCache
VectorCore
VectorCore
VectorCore
VectorCore
VectorCore
VectorCore
VectorCore
VectorCore
Interprocessor Network
Interprocessor Network
Mem
ory
and
I/O In
terf
aces
Fix
ed F
unct
ion
Logi
c
Indice
Platforms
Pressure Poisson Equation
Lattice-Boltzmann Method
Solid Fluid Interaction
Multi-Domain Grid Refinement
Future Work
Platforms
Platform Intel Xeon NVIDIA GPU Intel Xeon Phi
Model E5-520/E5-2670 Kepler K20c 5510P
Cores 8/16 2496 60
On-chip L1 32KB (core) SM 16/48KB (MP) L1 32KB (core)
Mem. L2 512KB (unified) L1 48/16KB (MP) L2 256KB (core)
L3 20MB (unified) L2 768KB (unified) L2 30MB (coherent)
Memory 64/32GB DDR3 5GB GDDR5 8GB GDDR5
Bandwidth 51.2 GB/s 208 GB/s 320 GB/s
Poisson ! ! %
LBM ! ! !
Solid-Fluid ! ! !
Mesh Refinement ! ! %
Indice
Platforms
Pressure Poisson Equation
Lattice-Boltzmann Method
Solid Fluid Interaction
Multi-Domain Grid Refinement
Future Work
Pressure Poisson Equation (2D Separable Elliptic Equation)
∂∂u
(
a(u)∂x∂u
)
+ b(u)∂x∂u
+ c(u)u + ∂∂v
(
d(v)∂x∂v
)
+ e(v)∂x∂v
+ f (v)u = g(u, v)
Using Dirichlet or Neuman boundary conditions we obtain a linearsystem Ax = g:
A =
B1 C1 0A2 B2 C2
. . .
An−1 Bn−1 Cn−1An Bn
Ai = ai I ,Bi = B + bi I ,Ci = ci I
where I is a identity matrix, ai , bi , ci are scalars and B is atridiagonal matrix
Block Tridiagonal Method (BLKTRI-FISHPACK)It is divided into three stages:1. Preprocessing (This step is required to stabilize the method)
◮ A set of intermediate results (roots) are obtained◮ Rhs is divided into two terms q and p
2. Reduction◮ The next equations are solved:
q(r)i
= (Bri )
−1Br−1
i−2r−1Br−1
i+2r−1p(r)i
p(r+1)i
= αri (B
r−1
i−2r−1 )−1q
(r)i−2r
+ γri (B
r−1
i+2r−1 )−1q
(r)i+2r
− pri
3. Substitution◮ The next equations are solved:
xi = (Bri )
−1Br−1
i−2r−1Br−1
i+2r−1 [p(r)i
− αri (B
r−1
i−2r−1 )−1xi−2r − γr
i (Br−1
i+2r−1 )−1xi+2r ]
B stores the roots, and α and γ follow the next expressions:
α(r)i
=∏i
j=i−2r+1 aj and γ(r)i
=∏i+2r−1
j=1cj
The highest computational cost is found in computing:
◮ Tridiagonal systems, Scalar-Vector multiplications, Vectorsadditions
Parallel Implementation
Parallel Degree:Reduction
◮ Independent terms are divided by 2
Substitution
◮ Independent terms are multiplied by 2
Approaches:Coarse grain
◮ one thread per multiple terms (ThomasAlgorithm)
Fine grain
◮ One thread per element (CR, PCR,CR-PCR)
High
Low
High
Low
Red
uctio
nS
ubst
itutio
n
1[t] [t] [t] [t]
i j nr
Cuda Block nr
Cuda Block j
Cuda Block 1
[t]j
Cuda CudaThread j Thread nr
CudaThread 1
CudaThread i
Cuda Block 1
1[t] [t]
nr
Performance
0
2
4
6
8
10
1 2 3 4 5 6 7 8 9 10
Spe
edup
Step
1CPU-8Th2CPUs-16Th2CPUs-32Th
1GPU-TA1GPU-CR
1GPU-PCR1GPU-CRPCR
0
2
4
6
8
10
1 2 3 4 5 6 7 8 9 10
Spe
edup
Step
1CPU-8Th2CPUs-16Th2CPUs-32Th
1GPU-TA1GPU-CR
1GPU-PCR1GPU-CRPCR
Heterogeneous Approach
CPUGPU
CPU
CPU
CPU
GPU
GPU
CPU GPU
GPU
GPU CPU
0
0.5
1
1.5
2
2.5
3
3.5
128x128 256x256 512x512 1024x1024
Spp
edup
Size
1CPU-8Th1CPU(8Th)-1GPU(PCR)
2CPUs-16Th2CPUs(16Th)-1GPU(PCR)
◮ Block Tridiagonal Solvers on Heterogeneous Architectures. ISPA 2012
3D Separable Elliptic Equation
∂2u∂x2
+ ∂2u∂y2 +
∂2u∂z2
= f (x , y , z)
The periodic condition applied in one of the directions (FFT)
◮ un,j ,k = 1N
∑Nl=1 ul ,j ,ke
−iα(n−1) with α = 2π(l−1)N
ul,j+1,k+ul,j−1,k
∆y2 +ul,j,k+1+ul,j,k−1
∆z2+ βl ul ,j ,k = Fl ,j ,k , l = 1 · · ·N with
βl/2 = cos(α) − 1/∆x2 − 1/∆y2 − 1∆z2
FFT
’N’ uncoupled problems
Several 2D problems
’N’ Discretized points
3D Problem
’N’ Problems
Red
uctio
nS
ubst
itutio
n
High
High
Parallel Implementation(MPI Cluster) (OpenMP Multicore-CPUs) (CUDA GPUs)
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8 9
Spe
edup
Step
1CPU-8Th2CPUs-16Th
2CPUs-32Th1GPU
2GPUs4GPUs
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8 9
Spe
edup
Step
1CPU-8Th2CPUs-16Th
2CPUs-32Th1GPU
2GPUs4GPUs
TIME
CPUs
GPUs
CPUsCPUs
GPUs
GPUs
CPUs
GPUs
CPUs
GPUs
CPUs
GPUs CPUs GPUs
GPUs
GPUs
CPUs
GPUs
CPUs
GPUs
CPUs
GPUs 0
2
4
6
8
10
12
14
16
128x128x128
256x256x256
512x512x512
Spp
edup
Size
1CPU-8Th2CPUs-16Th
1GPU1CPU-1GPU
2GPUs2CPUs-2GPUs
4GPUs2CPUs-4GPUs
◮ Fast finite difference Poisson solvers on heterogeneous architectures.
Computer Physics Communications 2014
Indice
Platforms
Pressure Poisson Equation
Lattice-Boltzmann Method
Solid Fluid Interaction
Multi-Domain Grid Refinement
Future Work
Lattice-Boltzmann Method (LBM) FormulationBGK formulation
fi (x+ ei∆t, t +∆t)− fi (x, t) = −∆tτ
(
f (x, t)− feqi
(x, t))
Maxwell Equation
feqi
= ρωi
[
1 + ci·uc2s
+(ci·u)
2
2c4s−
u2
2c2s
]
ω0 = 4/9, ωi = 1/9, i = 1 · · · 4, ω5 = 1/36, i = 5 · · · 8c0 = (0, 0); ci = (±1, 0), c(0,±1), i = 1 · · · 4;ci = (±1,±1), c(±1,±1), i = 5 · · · 8
c2 c1
c8c7c4
c3c6 c5
ω0c0ω2
ω4 ω8ω7
ω6 ω3 ω5
ω1
Given fi (x, t) compute
◮ ρ =∑
fi (x, t)
◮ ρu =∑
ei fi (x, t)
Collision
◮ f ∗i(x, t +∆t) = fi (x, t)−
∆tτ
(
f (x, t)− feqi
(x, t))
Streaming
◮ fi (x+ ci∆t, t +∆t) = f ∗i (x, t +∆t)
LBM Implementations
Memory Management (fi)
◮ UncoalescedNx*Ny
◮ CoalescedNx*Ny Nx*Ny Nx*Ny
◮ Blended
9 Lattice Velocities
Vector Size
Nx*Ny
Approaches◮ push
macro − coll − stream
coll − stream −macro
Synchronism points
◮ pullstream −macro − coll
No synchronism points
Granularity◮ Fine
1 thread per fluid nodeGPU
◮ Coarse1 thread per multiples fluid nodesMulticore and Phi
Performance
GPUCoalescing − fine approach
0 2 4 6 8 10 12 1450
100
150
200
250
300
350
400
450
500
550
Number of nodes x106
MF
LUP
S
Pull(stream−macro−collide)Sailfish−Push(macro−collide−stream)Push(collide−stream−macro)
MulticoreCoalesced − coarse approach
2 4 6 8 10 120
20
40
60
80
100
120
140
Number of nodes (x1e6)
MF
LUP
S
LBM performance. Intel Xeon
UncoalescedCoalescedHybrid
PhiBlended − coarse approach
1 2 3 4 5 6 70
100
200
300
400
500
Number of nodes (x1e6)
MF
LUP
S
LBM performance. Intel Xeon. 240 cores
UncoalescedCoalescedHybrid
Indice
Platforms
Pressure Poisson Equation
Lattice-Boltzmann Method
Solid Fluid Interaction
Multi-Domain Grid Refinement
Future Work
Solid-Fluid Interaction based on coupling LBM-IBM1. Interpolating velocities (fluid → solid)
◮ V Lgi+ = I(U∗[C Spj ])
2. Computing the Fib on Lagrangian points
◮ F ibi = Ud
− V Lgi
3. Spreading the force (solid → fluid)
◮ f ib(C Spj )+ = S(F ibi )
4. Including the f ib to fi (IB→LBM)
◮ Fi =(
1 −12τ
)
ωi
[
ci−u
c2s+
ci·u
c4sci
]
· fib
◮ f ∗i (x, t + ∆t) = fi (x, t) −∆tτ
(
fi (x, t) − f(eq)i
(x, t))
+ Fi
◮ fi (x + ci∆t, t + ∆t) = f ∗i (x, t + ∆t)
◮ i = 0... # Lagrangian points & j = 0... # Support (Cartesian) points per Lagrangian (i) point
200 400 600 800 1000 1200 1400
100
200
300
400
500
IBM Implementation and PerformanceMulticore
◮ Coarse grain
◮ Set of Lagrangian nodes per thread-core
GPU
◮ Fine grain (2 kernels)
◮ One thread per Lagrangian node
◮ Atomic functions
Solid
Cuda Block
Lagranges Points
Cuda Block
Cuda Threads0 0.5 1 1.5 2 2.5
x 105
0
5
10
15
No. Lagrangian nodes
Spe
edup
GPU
Multicore
◮ Accelerating Solid-Fluid Interaction based on the Immersed Boundary
method on multicore and GPU architectures. The Journal of
Supercomputing 2014.
LBM-IBM (Multicore-GPU) Implementation
Transfer
GPU CPU Supports
LBM−Pred.
GPU CPU FluidTransfer
Transfer
GPU CPU Supports
GPU CPU FluidTransfer
GPU Scheluder
N
Y
Heterogeneous Scheluder
BFCIFC BFCLBM LBM IFC
? ?ResultsWrite
N
Y
ResultsWrite
N
Y
IBIB
LBM−Pred. LBM−Pred.
Transfer
CPU GPU Supports
Transfer
CPU GPU Supports
GPU CPU FluidTransfer
?ResultsWritet t t t+1 t+1 t+1
t t+1
t t+1
t+2
t+2t+1Local Local
LBMLBM
0 2 4 6 8 10 12 14 160
100
200
300
400
500
Number of nodes × 106
MF
LUP
S
LBM
LBM−IB (0.5%)
LBM−IB (1%)
0 2 4 6 8 10 12 14 16
100
200
300
400
500
Number of nodes × 106
MF
LUP
S
LBM
LBM−IB (0.5%)
LBM−IB (1%)
◮ Accelerating Solid-Fluid Interaction using Lattice-Boltzmann and
Immersed Boundary Coupled Simulations on Heterogeneous Platforms.
ICCS 2014.
LBM-IBM (Multicore-Phi) Implementation
LBM
T = 0
F1(READ)−F2(WRITE)
Fro
m X
eon
to P
HI
F2_
PH
I = F
2_X
eon
Fro
m P
HI t
o X
eon
F2_
Xeo
n =
F2_
PH
I
LBM
T = 1
F2(READ)−F1(WRITE)
LBM+IB
Fro
m X
eon
to P
HI
Fro
m P
HI t
o X
eon
F1_
PH
I = F
1_X
eon
F1_
Xeo
n =
F1_
PH
I
LBM
F1(READ)−F2(WRITE)
T = 2
LBM+IB
Fro
m X
eon
to P
HI
F2_
PH
I = F
2_X
eon
Fro
m P
HI t
o X
eon
F2_
Xeo
n =
F2_
PH
I
LBM+IB
Xeo
n−P
HI
Xeo
n
1 2 3 4 5 6 70
100
200
300
400
500
Number of nodes (x1e6)
MF
LUP
S
LBM performance. Intel Xeon. 240 cores
UncoalescedCoalescedHybrid
200 400 600 800 1000 12000
100
200
300
400
500
Lx_Xeon
MF
LUP
S
LBM-IB performance. Overlap
2560x8003200x10003840x12004480x13005120x1400
◮ Fluid-Solid (Lattice-Boltzmann & Immersed-Boundary) Simulations over
Heterogeneous Platforms (Multicore-GPU & Multicore-Phi). Journal of
Computational Science (Submitted)
Indice
Platforms
Pressure Poisson Equation
Lattice-Boltzmann Method
Solid Fluid Interaction
Multi-Domain Grid Refinement
Future Work
Multi-Domain Grid Refinement over LBMRescaling
◮ δx(t)f = δx(t)c/2◮ ωf =
2ωc
4−ωc
Coarse → Fine Interpolation◮ fi ,f (xc→f ) = f
eqi (ρ(xc→f ),u(xc→f )) +
ωc
2ωffneqi ,c (xc→f )
Fine → Coarse Interpolation◮ fi ,f (xf→c) = f
eqi (ρ(xf→c),u(xf→f )) +
2ωf
ωcfneqi ,c (xf→c)
COARSE GRID FINE GRID
Fine to Coarse
Coarse to Fine
Implementation and Performance
GPU approach
MacroscopicStreamCollide
Coarse Grid
StreamCollide Macroscopic
Fine Grid
Temporal(fine)
CommunicationCoarse to Fine
Temporal(coarse)
Spacial (fine)StreamCollide Macroscopic
Fine Grid
CommunicationFine to Coarse
Spacial (coarse)
1
3
2
4
5
Multicore-GPU approach
Fine Gridt
StreamCollide Macros.
Com
m.
Com
m.
StreamCollide Macros.
t+1Fine Grid
t+1Coarse Grid
tCommunicationFine to Coarse
(coarse)Spacial
tCommunicationCoarse to Fine
(fine)Spacial/Temp
tTemporal(coarse)
t+1
(Fine to Coarse)Coarse Grid
Performance
0
100
200
300
400
500
2 4 6 8 10
MF
LUP
S
Number of Nodes x 106
Top-Mul(0.25x)Top-GPU(0.25x)
GPU(0.25x)Top-Mul(0.5x)
Top-GPU(0.5x)GPU(0.5x)
Top-Mul(1x)Top-GPU(1x)
GPU(1x)Top-Mul(2x)
Top-GPU(2x)GPU(2x)
0
5
10
15
20
25
30
35
40
0.25 0.5 1 2%
Ratios
Top-MulTop-GPU
◮ A Fast Multi-Domain Heterogeneous Implementation for Lattice-Boltzmann
Simulations. Journal of Supercomputing (Mayor Revisions)
Indice
Platforms
Pressure Poisson Equation
Lattice-Boltzmann Method
Solid Fluid Interaction
Multi-Domain Grid Refinement
Future Work