hpc and cfd. - basque center for applied mathematics · hpc and cfd. main contributions up to now...

HPC and CFD.Main Contributions up to Now (An Overview) and Future Work

Pedro Valero-Lara

2 de octubre de 2014

Motivation

Accelerating CFD problems by using heterogeneous platforms(multicore-hardware accelerator)

HPCCLUSTER

MPI

Multicore CPU

ICC, OpenMP, PGI

Nvidia’s GPUs

CUDA, PGI

Intel Xeon Phi

ICC, OpenMp

CFDFluid Dynamic

Pressure Poisson Equation

Lattice-Boltzmann Method

Solid-Fluid Interaction

Immersed-Boundary Method

Grid refinement

Multi-domain Lattice-Boltzmann Method

Architectures

CLUSTER

Node 1

Node 0

Link Connectors

CPU

CPU Memory

Memory

Discs RouterSwitch

GPU Architecture

Core 1 Core 2

Shared Memory

Core N

ControlUnit

Multiprocessor

Global Memory

GPU

Heterogeneous Architectures

Mem

ory

DDR351.2GB/s

DDR351.2GB/s

Mem

ory

IOH

ICHHDD

208GB/sGDDR5

GPU Bus

320GB/sGDDR5

CPU

Intel Multicore

CPU25.6GB/s

QPI

PCIe16x8GB/s

DMIGPU/PHI Memory

GPU/PHI

PHI Bus

Intel Xeon Phi Architecture

CoherentCache

CoherentCache

CoherentCache

CoherentCache

CoherentCache

CoherentCache

CoherentCache

CoherentCache

VectorCore

VectorCore

VectorCore

VectorCore

VectorCore

VectorCore

VectorCore

VectorCore

Interprocessor Network

Interprocessor Network

Mem

ory

and

I/O In

terf

aces

Fix

ed F

unct

ion

Logi

c

Indice

Platforms



Solid Fluid Interaction

Multi-Domain Grid Refinement

Future Work

Platforms

Platform Intel Xeon NVIDIA GPU Intel Xeon Phi

Model E5-520/E5-2670 Kepler K20c 5510P

Cores 8/16 2496 60

On-chip L1 32KB (core) SM 16/48KB (MP) L1 32KB (core)

Mem. L2 512KB (unified) L1 48/16KB (MP) L2 256KB (core)

L3 20MB (unified) L2 768KB (unified) L2 30MB (coherent)

Memory 64/32GB DDR3 5GB GDDR5 8GB GDDR5

Bandwidth 51.2 GB/s 208 GB/s 320 GB/s

Poisson ! ! %

LBM ! ! !

Solid-Fluid ! ! !

Mesh Refinement ! ! %

Indice

Platforms





Future Work

Pressure Poisson Equation (2D Separable Elliptic Equation)

∂∂u

(

a(u)∂x∂u

)

+ b(u)∂x∂u

+ c(u)u + ∂∂v

(

d(v)∂x∂v

)

+ e(v)∂x∂v

+ f (v)u = g(u, v)

Using Dirichlet or Neuman boundary conditions we obtain a linearsystem Ax = g:

A =

B1 C1 0A2 B2 C2

. . .

An−1 Bn−1 Cn−1An Bn

Ai = ai I ,Bi = B + bi I ,Ci = ci I

where I is a identity matrix, ai , bi , ci are scalars and B is atridiagonal matrix

Block Tridiagonal Method (BLKTRI-FISHPACK)It is divided into three stages:1. Preprocessing (This step is required to stabilize the method)

◮ A set of intermediate results (roots) are obtained◮ Rhs is divided into two terms q and p

2. Reduction◮ The next equations are solved:

q(r)i

= (Bri )

−1Br−1

i−2r−1Br−1

i+2r−1p(r)i

p(r+1)i

= αri (B

r−1

i−2r−1 )−1q

(r)i−2r

+ γri (B

r−1

i+2r−1 )−1q

(r)i+2r

− pri

3. Substitution◮ The next equations are solved:

xi = (Bri )

−1Br−1

i−2r−1Br−1

i+2r−1 [p(r)i

− αri (B

r−1

i−2r−1 )−1xi−2r − γr

i (Br−1

i+2r−1 )−1xi+2r ]

B stores the roots, and α and γ follow the next expressions:

α(r)i

=∏i

j=i−2r+1 aj and γ(r)i

=∏i+2r−1

j=1cj

The highest computational cost is found in computing:

◮ Tridiagonal systems, Scalar-Vector multiplications, Vectorsadditions

Parallel Implementation

Parallel Degree:Reduction

◮ Independent terms are divided by 2

Substitution

◮ Independent terms are multiplied by 2

Approaches:Coarse grain

◮ one thread per multiple terms (ThomasAlgorithm)

Fine grain

◮ One thread per element (CR, PCR,CR-PCR)

High

Low

High

Low

Red

uctio

nS

ubst

itutio

n

1[t] [t] [t] [t]

i j nr

Cuda Block nr

Cuda Block j

Cuda Block 1

[t]j

Cuda CudaThread j Thread nr

CudaThread 1

CudaThread i

Cuda Block 1

1[t] [t]

nr

Performance

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10

Spe

edup

Step

1CPU-8Th2CPUs-16Th2CPUs-32Th

1GPU-TA1GPU-CR

1GPU-PCR1GPU-CRPCR

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10

Spe

edup

Step

1CPU-8Th2CPUs-16Th2CPUs-32Th

1GPU-TA1GPU-CR

1GPU-PCR1GPU-CRPCR

Heterogeneous Approach

CPUGPU

CPU

CPU

CPU

GPU

GPU

CPU GPU

GPU

GPU CPU

0

0.5

1

1.5

2

2.5

3

3.5

128x128 256x256 512x512 1024x1024

Spp

edup

Size

1CPU-8Th1CPU(8Th)-1GPU(PCR)

2CPUs-16Th2CPUs(16Th)-1GPU(PCR)

◮ Block Tridiagonal Solvers on Heterogeneous Architectures. ISPA 2012

3D Separable Elliptic Equation

∂2u∂x2

+ ∂2u∂y2 +

∂2u∂z2

= f (x , y , z)

The periodic condition applied in one of the directions (FFT)

◮ un,j ,k = 1N

∑Nl=1 ul ,j ,ke

−iα(n−1) with α = 2π(l−1)N

ul,j+1,k+ul,j−1,k

∆y2 +ul,j,k+1+ul,j,k−1

∆z2+ βl ul ,j ,k = Fl ,j ,k , l = 1 · · ·N with

βl/2 = cos(α) − 1/∆x2 − 1/∆y2 − 1∆z2

FFT

’N’ uncoupled problems

Several 2D problems

’N’ Discretized points

3D Problem

’N’ Problems

Red

uctio

nS

ubst

itutio

n

High

High

Parallel Implementation(MPI Cluster) (OpenMP Multicore-CPUs) (CUDA GPUs)

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9

Spe

edup

Step

1CPU-8Th2CPUs-16Th

2CPUs-32Th1GPU

2GPUs4GPUs

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9

Spe

edup

Step

1CPU-8Th2CPUs-16Th

2CPUs-32Th1GPU

2GPUs4GPUs

TIME

CPUs

GPUs

CPUsCPUs

GPUs

GPUs

CPUs

GPUs

CPUs

GPUs

CPUs

GPUs CPUs GPUs

GPUs

GPUs

CPUs

GPUs

CPUs

GPUs

CPUs

GPUs 0

2

4

6

8

10

12

14

16

128x128x128

256x256x256

512x512x512

Spp

edup

Size

1CPU-8Th2CPUs-16Th

1GPU1CPU-1GPU

2GPUs2CPUs-2GPUs

4GPUs2CPUs-4GPUs

◮ Fast finite difference Poisson solvers on heterogeneous architectures.

Computer Physics Communications 2014

Indice

Platforms





Future Work

Lattice-Boltzmann Method (LBM) FormulationBGK formulation

fi (x+ ei∆t, t +∆t)− fi (x, t) = −∆tτ

(

f (x, t)− feqi

(x, t))

Maxwell Equation

feqi

= ρωi

[

1 + ci·uc2s

+(ci·u)

2

2c4s−

u2

2c2s

]

ω0 = 4/9, ωi = 1/9, i = 1 · · · 4, ω5 = 1/36, i = 5 · · · 8c0 = (0, 0); ci = (±1, 0), c(0,±1), i = 1 · · · 4;ci = (±1,±1), c(±1,±1), i = 5 · · · 8

c2 c1

c8c7c4

c3c6 c5

ω0c0ω2

ω4 ω8ω7

ω6 ω3 ω5

ω1

Given fi (x, t) compute

◮ ρ =∑

fi (x, t)

◮ ρu =∑

ei fi (x, t)

Collision

◮ f ∗i(x, t +∆t) = fi (x, t)−

∆tτ

(

f (x, t)− feqi

(x, t))

Streaming

◮ fi (x+ ci∆t, t +∆t) = f ∗i (x, t +∆t)

LBM Implementations

Memory Management (fi)

◮ UncoalescedNx*Ny

◮ CoalescedNx*Ny Nx*Ny Nx*Ny

◮ Blended

9 Lattice Velocities

Vector Size

Nx*Ny

Approaches◮ push

macro − coll − stream

coll − stream −macro

Synchronism points

◮ pullstream −macro − coll

No synchronism points

Granularity◮ Fine

1 thread per fluid nodeGPU

◮ Coarse1 thread per multiples fluid nodesMulticore and Phi

Performance

GPUCoalescing − fine approach

0 2 4 6 8 10 12 1450

100

150

200

250

300

350

400

450

500

550

Number of nodes x106

MF

LUP

S

Pull(stream−macro−collide)Sailfish−Push(macro−collide−stream)Push(collide−stream−macro)

MulticoreCoalesced − coarse approach

2 4 6 8 10 120

20

40

60

80

100

120

140

Number of nodes (x1e6)

MF

LUP

S

LBM performance. Intel Xeon

UncoalescedCoalescedHybrid

PhiBlended − coarse approach

1 2 3 4 5 6 70

100

200

300

400

500


MF

LUP

S

LBM performance. Intel Xeon. 240 cores


Indice

Platforms





Future Work

Solid-Fluid Interaction based on coupling LBM-IBM1. Interpolating velocities (fluid → solid)

◮ V Lgi+ = I(U∗[C Spj ])

2. Computing the Fib on Lagrangian points

◮ F ibi = Ud

− V Lgi

3. Spreading the force (solid → fluid)

◮ f ib(C Spj )+ = S(F ibi )

4. Including the f ib to fi (IB→LBM)

◮ Fi =(

1 −12τ

)

ωi

[

ci−u

c2s+

ci·u

c4sci

]

· fib

◮ f ∗i (x, t + ∆t) = fi (x, t) −∆tτ

(

fi (x, t) − f(eq)i

(x, t))

+ Fi

◮ fi (x + ci∆t, t + ∆t) = f ∗i (x, t + ∆t)

◮ i = 0... # Lagrangian points & j = 0... # Support (Cartesian) points per Lagrangian (i) point

200 400 600 800 1000 1200 1400

100

200

300

400

500

IBM Implementation and PerformanceMulticore

◮ Coarse grain

◮ Set of Lagrangian nodes per thread-core

GPU

◮ Fine grain (2 kernels)

◮ One thread per Lagrangian node

◮ Atomic functions

Solid

Cuda Block

Lagranges Points

Cuda Block

Cuda Threads0 0.5 1 1.5 2 2.5

x 105

0

5

10

15

No. Lagrangian nodes

Spe

edup

GPU

Multicore

◮ Accelerating Solid-Fluid Interaction based on the Immersed Boundary

method on multicore and GPU architectures. The Journal of

Supercomputing 2014.

LBM-IBM (Multicore-GPU) Implementation

Transfer

GPU CPU Supports

LBM−Pred.

GPU CPU FluidTransfer

Transfer

GPU CPU Supports


GPU Scheluder

N

Y

Heterogeneous Scheluder

BFCIFC BFCLBM LBM IFC

? ?ResultsWrite

N

Y

ResultsWrite

N

Y

IBIB

LBM−Pred. LBM−Pred.

Transfer

CPU GPU Supports

Transfer

CPU GPU Supports


?ResultsWritet t t t+1 t+1 t+1

t t+1

t t+1

t+2

t+2t+1Local Local

LBMLBM

0 2 4 6 8 10 12 14 160

100

200

300

400

500

Number of nodes × 106

MF

LUP

S

LBM

LBM−IB (0.5%)

LBM−IB (1%)

0 2 4 6 8 10 12 14 16

100

200

300

400

500

Number of nodes × 106

MF

LUP

S

LBM

LBM−IB (0.5%)

LBM−IB (1%)

◮ Accelerating Solid-Fluid Interaction using Lattice-Boltzmann and

Immersed Boundary Coupled Simulations on Heterogeneous Platforms.

ICCS 2014.

LBM-IBM (Multicore-Phi) Implementation

LBM

T = 0

F1(READ)−F2(WRITE)

Fro

m X

eon

to P

HI

F2_

PH

I = F

2_X

eon

Fro

m P

HI t

o X

eon

F2_

Xeo

n =

F2_

PH

I

LBM

T = 1


LBM+IB

Fro

m X

eon

to P

HI

Fro

m P

HI t

o X

eon

F1_

PH

I = F

1_X

eon

F1_

Xeo

n =

F1_

PH

I

LBM


T = 2

LBM+IB

Fro

m X

eon

to P

HI

F2_

PH

I = F

2_X

eon

Fro

m P

HI t

o X

eon

F2_

Xeo

n =

F2_

PH

I

LBM+IB

Xeo

n−P

HI

Xeo

n

1 2 3 4 5 6 70

100

200

300

400

500


MF

LUP

S

LBM performance. Intel Xeon. 240 cores


200 400 600 800 1000 12000

100

200

300

400

500

Lx_Xeon

MF

LUP

S

LBM-IB performance. Overlap

2560x8003200x10003840x12004480x13005120x1400

◮ Fluid-Solid (Lattice-Boltzmann & Immersed-Boundary) Simulations over

Heterogeneous Platforms (Multicore-GPU & Multicore-Phi). Journal of

Computational Science (Submitted)

Indice

Platforms





Future Work

Multi-Domain Grid Refinement over LBMRescaling

◮ δx(t)f = δx(t)c/2◮ ωf =

2ωc

4−ωc

Coarse → Fine Interpolation◮ fi ,f (xc→f ) = f

eqi (ρ(xc→f ),u(xc→f )) +

ωc

2ωffneqi ,c (xc→f )

Fine → Coarse Interpolation◮ fi ,f (xf→c) = f

eqi (ρ(xf→c),u(xf→f )) +

2ωf

ωcfneqi ,c (xf→c)

COARSE GRID FINE GRID

Fine to Coarse

Coarse to Fine

Implementation and Performance

GPU approach

MacroscopicStreamCollide

Coarse Grid

StreamCollide Macroscopic

Fine Grid

Temporal(fine)

CommunicationCoarse to Fine

Temporal(coarse)

Spacial (fine)StreamCollide Macroscopic

Fine Grid

CommunicationFine to Coarse

Spacial (coarse)

1

3

2

4

5

Multicore-GPU approach

Fine Gridt

StreamCollide Macros.

Com

m.

Com

m.

StreamCollide Macros.

t+1Fine Grid

t+1Coarse Grid

tCommunicationFine to Coarse

(coarse)Spacial

tCommunicationCoarse to Fine

(fine)Spacial/Temp

tTemporal(coarse)

t+1

(Fine to Coarse)Coarse Grid

Performance

0

100

200

300

400

500

2 4 6 8 10

MF

LUP

S

Number of Nodes x 106

Top-Mul(0.25x)Top-GPU(0.25x)

GPU(0.25x)Top-Mul(0.5x)

Top-GPU(0.5x)GPU(0.5x)

Top-Mul(1x)Top-GPU(1x)

GPU(1x)Top-Mul(2x)

Top-GPU(2x)GPU(2x)

0

5

10

15

20

25

30

35

40

0.25 0.5 1 2%

Ratios

Top-MulTop-GPU

◮ A Fast Multi-Domain Heterogeneous Implementation for Lattice-Boltzmann

Simulations. Journal of Supercomputing (Mayor Revisions)

Indice

Platforms





Future Work

Future Work

Optimize and Accelerate FeniCS-HPC◮ PGAS

◮ Cray XC30

◮ Heterogeneos Platforms (Firedrake) ...

Analize scalability of Cray XC30◮ Dragonfly

◮ Aries

◮ LBM

◮ Use this study over FeniCS-HPC

FELBM, Combining both ideas:◮ FEM

◮ LBM

hpc and cfd. - basque center for applied mathematics · hpc and cfd. main contributions up to now...

Documents