urbulence t - on-demand.gputechconf.com

27

Upload: others

Post on 03-Jul-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: urbulence T - on-demand.gputechconf.com

High resolution GPUs odes for

dire t numeri al simulation of

turbulen e

Alberto Vela-Martin, Jose I. Cardesa, Miguel P. En inar y

Javier Jim�enez

E.T.S.I Aerona�uti a, UPM

1

Page 2: urbulence T - on-demand.gputechconf.com

Turbulen e and high performan e omputing

Turbulen e is a ommon phenomena in uid me hani s.

Pra ti al importan e: industrial pro esses, energy and

aeronauti s.

Related to energy saving and eÆ ien y in transportation.

5% of world total energy spent in turbulent fri tion.

2

Page 3: urbulence T - on-demand.gputechconf.com

Turbulen e and high performan e omputing

Highly omplex and haoti phenomena: high level of detail

required.

Dire t numeri al simulations (DNS): turbulen e simulated

with all its relevant details N

DOF

� Re

9=4

degrees of freedom.

DNS simulation with N � 10

9

degrees of freedom and 30

million CPU hours.

Large DNSs omputed in our group:

Hoyas and Jim�enez 2006 hannel at Re

= 2000 (6x10e6

CPU-hours, Marenostrum).

Sillero et al. 2013 boundary layer at Re

= 6600.

Lozano-Dur�an et al. 2014 time-resolved hannel at

Re

= 4000.

3

Page 4: urbulence T - on-demand.gputechconf.com

DNS on GPUs

Regular domains and boundary onditions.

Simple but highly eÆ ient and s alable odes.

CFD on single GPUs: simple homogeneous isotropi

turbulen e ode (3 periodi dire tions) and hannel ode (2

periodi dire tions).

Outstanding performan e, no devide-to-host ommuni ations.

In lude MPI on single GPU odes: MPI all to all

ommuni ations from host memory. Penalization aused by

D2H and H2D memory transfer.

Optimization: asyn hronous GPU-CPU exe ution.

Overlapping.

4

Page 5: urbulence T - on-demand.gputechconf.com

DNS on GPUs

Regular domains and boundary onditions.

Simple but highly eÆ ient and s alable odes.

CFD on single GPUs: simple homogeneous isotropi

turbulen e ode (3 periodi dire tions) and hannel ode (2

periodi dire tions).

Outstanding performan e, no devide-to-host ommuni ations.

In lude MPI on single GPU odes: MPI all to all

ommuni ations from host memory. Penalization aused by

D2H and H2D memory transfer.

Optimization: asyn hronous GPU-CPU exe ution.

Overlapping.

4

Page 6: urbulence T - on-demand.gputechconf.com

DNS on GPUs

Regular domains and boundary onditions.

Simple but highly eÆ ient and s alable odes.

CFD on single GPUs: simple homogeneous isotropi

turbulen e ode (3 periodi dire tions) and hannel ode (2

periodi dire tions).

Outstanding performan e, no devide-to-host ommuni ations.

In lude MPI on single GPU odes: MPI all to all

ommuni ations from host memory. Penalization aused by

D2H and H2D memory transfer.

Optimization: asyn hronous GPU-CPU exe ution.

Overlapping.

4

Page 7: urbulence T - on-demand.gputechconf.com

DNS on GPUs

Regular domains and boundary onditions.

Simple but highly eÆ ient and s alable odes.

CFD on single GPUs: simple homogeneous isotropi

turbulen e ode (3 periodi dire tions) and hannel ode (2

periodi dire tions).

Outstanding performan e, no devide-to-host ommuni ations.

In lude MPI on single GPU odes: MPI all to all

ommuni ations from host memory. Penalization aused by

D2H and H2D memory transfer.

Optimization: asyn hronous GPU-CPU exe ution.

Overlapping.

4

Page 8: urbulence T - on-demand.gputechconf.com

DNS on GPUs

Two basi on�gurations:

Isotropi turbulen e Turbulent hannel

5

Page 9: urbulence T - on-demand.gputechconf.com

Turbulent hannel ow

PSfrag repla ements

N

gpus

time (se )

Flow between two parallel wall

Periodi boundary ondition in x and z , no-slip ondition at

both walls.

Mean pressure gradient in x ! mean velo ity pro�le U

PSfrag repla ements

N

gpus

time (se )

6

Page 10: urbulence T - on-demand.gputechconf.com

Turbulent hannel ow

Formulation in !

y

and r

2

v (Kim, Moin and Moser 1987)

Fully dealiased psudospe tral method in x and z (CUFFT

library).

High resolution ompa t �nite di�eren es in y (7 point

sten il on non-uniform grid): inversion of heptadiagonal

matri es ( ustom CUDA kernels).

Temporal integration with 3th order low-storage Runge-Kutta.

Domain de ompostion: y -z planes. MPI transpose to x-z

planes.

Core of the ode single pre ission (some parts in double)

7

Page 11: urbulence T - on-demand.gputechconf.com

Turbulent hannel ow

Formulation in !

y

and r

2

v (Kim, Moin and Moser 1987)

Fully dealiased psudospe tral method in x and z (CUFFT

library).

High resolution ompa t �nite di�eren es in y (7 point

sten il on non-uniform grid): inversion of heptadiagonal

matri es ( ustom CUDA kernels).

Temporal integration with 3th order low-storage Runge-Kutta.

Domain de ompostion: y -z planes. MPI transpose to x-z

planes.

Core of the ode single pre ission (some parts in double)

7

Page 12: urbulence T - on-demand.gputechconf.com

Turbulent hannel ow

Formulation in !

y

and r

2

v (Kim, Moin and Moser 1987)

Fully dealiased psudospe tral method in x and z (CUFFT

library).

High resolution ompa t �nite di�eren es in y (7 point

sten il on non-uniform grid): inversion of heptadiagonal

matri es ( ustom CUDA kernels).

Temporal integration with 3th order low-storage Runge-Kutta.

Domain de ompostion: y -z planes. MPI transpose to x-z

planes.

Core of the ode single pre ission (some parts in double)

7

Page 13: urbulence T - on-demand.gputechconf.com

Turbulent hannel ow

Formulation in !

y

and r

2

v (Kim, Moin and Moser 1987)

Fully dealiased psudospe tral method in x and z (CUFFT

library).

High resolution ompa t �nite di�eren es in y (7 point

sten il on non-uniform grid): inversion of heptadiagonal

matri es ( ustom CUDA kernels).

Temporal integration with 3th order low-storage Runge-Kutta.

Domain de ompostion: y -z planes. MPI transpose to x-z

planes.

Core of the ode single pre ission (some parts in double)

7

Page 14: urbulence T - on-demand.gputechconf.com

Turbulent hannel ow

Formulation in !

y

and r

2

v (Kim, Moin and Moser 1987)

Fully dealiased psudospe tral method in x and z (CUFFT

library).

High resolution ompa t �nite di�eren es in y (7 point

sten il on non-uniform grid): inversion of heptadiagonal

matri es ( ustom CUDA kernels).

Temporal integration with 3th order low-storage Runge-Kutta.

Domain de ompostion: y -z planes. MPI transpose to x-z

planes.

Core of the ode single pre ission (some parts in double)

7

Page 15: urbulence T - on-demand.gputechconf.com

Turbulent hannel ow

Formulation in !

y

and r

2

v (Kim, Moin and Moser 1987)

Fully dealiased psudospe tral method in x and z (CUFFT

library).

High resolution ompa t �nite di�eren es in y (7 point

sten il on non-uniform grid): inversion of heptadiagonal

matri es ( ustom CUDA kernels).

Temporal integration with 3th order low-storage Runge-Kutta.

Domain de ompostion: y -z planes. MPI transpose to x-z

planes.

Core of the ode single pre ission (some parts in double)

7

Page 16: urbulence T - on-demand.gputechconf.com

Channel ode

Non-linear terms are the most expensive:

t

u = �uru�rp+ �r

2

u

6 Complex-Real FFT + 5 MPI transpose (global

ommuni ations)

3 Real-Complex FFT + 3 MPI transpose (global

ommuni ations)

Optimization: asyn hronous GPU-CPU exe ution.

Overlapping.

8

Page 17: urbulence T - on-demand.gputechconf.com

Non-linear onvolution: overlapping

Compute stream D2H stream H2D stream Host stream

al ulate u and w opy v to host

al ulate �

y

u opy u to host MPI transp. v

al ulate �

y

w opy w to host opy v to devi e MPI transp. u

al ulate �

yy

r

2

v opy �

y

u to host opy u to devi e MPI transp. w

al ulate �

yy

!y opy �

y

w to host opy w to devi e MPI transp. �

y

u

FFT to real v opy �

y

u to devi e MPI transp. �

y

w

FFT to real u opy �

y

w to devi e

FFT to real w

al ulate !

y

and FFT to real

al ulate !

x

and FFT to real

al ulate !

z

and FFT to real al ulate statisti s

al ulate H1 and FFT to omplex H1

al ulate H3 and FFT to omplex H3 opy H1 to host

al ulate H2 and FFT to omplex H2 opy H3 to host MPI transp. H1

1st RK step for r

2

v opy H2 to host opy H1 to devi e MPI transp. H3

1st RK step for ! opy H3 to devi e MPI transp. H2

non-linear RHS for !

y

and 2nd RK step opy H2 to devi e

impli it step for !

y

non-linear RHS for r

2

v and 2nd RK step

impli it step for r

2

v

9

Page 18: urbulence T - on-demand.gputechconf.com

Non-linear onvolution: overlapping

8 MPI transpose, 8 D2H opy and 8 H2D opy

CPU

H2DD2H

GPU

MPI transpose H2D transfer

GPU execution D2H transfer

PSfrag repla ements

N

gpus

time (se )

Figure: Exe ution pro�le on 32 M2090 in Minotauro at BSC.

10

Page 19: urbulence T - on-demand.gputechconf.com

S aling in PizDaint

N

x

�N

y

�N

z

N

min

gpus

�N

max

gpus

min

min

(ns)

? 1024 � 1024 � 256 16� 256 67% 60

+ 2048 � 2048 � 512 64� 512 82% 63

Æ 4096 � 4096 � 1024 512� 1024 100% 65

� 6144 � 4096 � 1024 512� 1024 100% 64

Time per deegre of freedom and GPU

� = time � N

gpus

=DoF

min

EÆ ien y

� = time �N

gpus

=time

0

�N

0

gpus

� 100

11

Page 20: urbulence T - on-demand.gputechconf.com

S aling in PizDaint

16 32 64 128 256 512 1024

10−1

100

100%

96%

92%

80%

67%

100%

109%

100%

82%

100%

101%

100%

105%

PSfrag repla ements

N

gpus

t

i

m

e

(

s

e

)

12

Page 21: urbulence T - on-demand.gputechconf.com

S aling in PizDaint

16 32 64 128 256 512 10243

4

5

6

7

8

9

10x 10

−8

PSfrag repla ements

N

gpus

t

i

m

e

(

s

e

)

N

g

p

u

s

/

D

o

F

13

Page 22: urbulence T - on-demand.gputechconf.com

Future proje ts: what we would like to do

Future goal for next generation GPUs (Pas al and Volta):

Re

= 10; 000 in a large box 8� � 3�.

Mesh: N

x

�N

y

�N

z

= 20; 480 � 2048 � 15; 360.

Total GPU memory: � 10� 15TB .

� 500 hours per eddy-turnover time on 2048 GPUs

(PizDaint).

� 10; 000; 000 node-hours for a 15 eddy-turnover time

simulation.

Generate on-the- y ompressed time-resolved data.

14

Page 23: urbulence T - on-demand.gputechconf.com

Present proje ts: what we an do now

Current proje t at PizDaint (Pas al):

Re

= 5; 000 in a large box 8� � 3� (low resolution).

Mesh: N

x

�N

y

�N

z

= 6140 � 1024 � 4196.

Total GPU memory: � 1� 1:5TB .

� 22 hours per eddy-turnover time on 1048 GPUs (Tesla).

� 1; 600; 000 node-hours for a 50 eddy-turnover time

simulation.

Generate on-the- y ompressed time-resolved data.

15

Page 24: urbulence T - on-demand.gputechconf.com

Homogeneous isotropi turbulen e

Flow in a triply periodi box

Optimization strategy similar to the hannel ow.

Good s alability up to 64 GPUs at Minotauro (BSC).

16

Page 25: urbulence T - on-demand.gputechconf.com

The turbulen e as ade in 5D

DECI-13 COSIT proje t in MinoTauro

�500,000 pu-hours on M2090 NVIDIA GPUs

Long run (� 60 ETT)

High temporal resolution (Kolmogorov time-s ale)

�26,000 snapshots / � 100Tb

17

Page 26: urbulence T - on-demand.gputechconf.com

The turbulen e as ade in 5D

Proje t to study the turbulen e as ade in 3 spatial

oordinates, s ale and time (5D).

Time tra king algorithms at di�erent s ales.

Results in Cardesa, Vela-Martin & Jim�enez 2017, S ien e

Database and GPU ode available at

https://torroja.dmt.upm.es

18

Page 27: urbulence T - on-demand.gputechconf.com

Questions

19