porting telemac{mascaret to openpower and experimenting ......porting telemac{mascaret to openpower...

17
Porting Telemac–Mascaret to OpenPower and experimenting GPU offloading to accelerate the Tomawac module TUC 2019 16-17th October, CERFACS, Toulouse, France Judica¨ el Grasset(1), Stephen Longshaw(1), Charles Moulinec(1), David R. Emerson(1) Yoann Audouin(2), Pablo Tassi(2) October 17, 2019 (1) STFC, Daresbury Laboratory, Warrington, United Kingdom (2) EDF R&D, Chatou, France

Upload: others

Post on 11-Sep-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

Porting Telemac–Mascaret to OpenPower and

experimenting GPU offloading to accelerate

the Tomawac module

TUC 2019 16-17th October, CERFACS, Toulouse, France

Judicael Grasset(1), Stephen Longshaw(1), Charles Moulinec(1), David R. Emerson(1)

Yoann Audouin(2), Pablo Tassi(2)

October 17, 2019

(1) STFC, Daresbury Laboratory, Warrington, United Kingdom

(2) EDF R&D, Chatou, France

Page 2: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

Computing used

OpenPower architecture in a

nutshell:

• IBM POWER processors

• NVIDIA GPUs

• NVIDIA NVLink The machine used for this work, Paragon

In our case, each node of the machine used consists of:

• 2 IBM POWER8 processors, with 8 cores each

• Each core has simultaneous multithreading (SMT) capability

• In this case the cores are able to run either 1 thread (SMT1), 2

threads (SMT2), 4 threads (SMT4) or 8 threads (SMT8) at the

same time

• 4 NVIDIA P100 GPUs

• NVIDIA NVLink for GPU–GPU and GPU–CPU interconnections1

Page 3: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

Porting to OpenPower

• Why? Summit and Sierra, the 2 most powerful cluster in the world

are based on an OpenPower architecture (Top500, June 2019)

• Porting to different architecure might reveal some bugs in the code

(increased robustness)

2

Page 4: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

Porting to OpenPower

Status of the port:

Version > PGI 18.10 > GCC 9.1 > XL 16.1.1.1

v8p0r2 compile compile does not compile*

trunk (Oct. 2019) does not compile* compile does not compile*

*problem known and solved, it compile when applying a small patch

All tests done with the Spectrum MPI library

3

Page 5: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

.

Experimenting with GPUs

Or trying to port Telemac to the architecture of the ���future present

4

Page 6: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

The test case

Test case used: tomawac/fetch limited/tom test6.cas

• This is a limited test with a small mesh: 75k elements, 32k points.

• It spends all of its time in a single fortran subroutine: qnlin3.f

• This function was reported to be a bottleneck by some users during

the annual TELEMAC User Conference (2018).

5

Page 7: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

qnlin3.f

In a nutshell:

• do loop

• init some variables

• do loop

• init some variables

• do loop

• init some variables

• do loop

• tmp array(x,y,z) = tmp array(x,y,z) + k

6

Page 8: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

Porting to GPUs, methods

Different solutions exist:

• Pragma based: OpenMP, OpenACC

• Library based: Magma, cuBLAS...

• Language extension: CUDA, OpenCL

7

Page 9: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

MPI+OpenACC (PGI compiler) on GPU

Move data to GPU and execute the loop on it.

• !$acc data copy(array)

• !$acc parallel loop collapse(4)

• do loop

• do loop

• do loop

• do loop

• !$acc atomic

• array(x,y,z) = array(x,y,z) + k

• ...

• !$acc end data

Elsewhere during the initialisation of the code, we have linked each MPI

task to a specific GPU.

8

Page 10: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

MPI+OpenACC (PGI compiler) on GPU

9

Page 11: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

MPI+OpenMP (IBM compiler) on GPU

Move data to GPU and execute the loop on it.

• !$omp target data map(array)

• !$omp target teams distribute parallel do collapse(4)

• do loop

• do loop

• do loop

• do loop

• !$omp atomic

• array(x,y,z) = array(x,y,z) + k

• ...

• !$omp end target data

Elsewhere during the initialisation of the code, we have linked each MPI

task to a specific GPU.

10

Page 12: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

MPI+OpenMP (IBM compiler) on GPU

11

Page 13: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

Somme test-case

• Somme 7 days

• Telemac2d-Tomawac-Sisyphe

20.8%

6%6%6.6%

6.9%

9.6%

11.4%

11.6%

21.1%

other subroutinessemimpqwind1propa

fremoyschar41 per 4dlogqnlin1bief interp

12

Page 14: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

Inclusion in the codebase

• OpenACC and OpenMP redundancy

• Could be solved with pragma in this case

• But might not always be possible

• Usage of the optional directory

13

Page 15: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

Conclusion

Results achieved:

• Telemac-Mascaret ported to OpenPower

• The port revelead bugs in Telemac-Mascaret and some compilers

• Good improvement when using GPU for the qnlin3 subroutine

• Work still going on, but will be more difficult for real world test-case

14

Page 16: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

Acknowledgements

• This work is supported by the Hartree Centre through the Innovation

Return on Research (IROR) programme.

15

Page 17: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th

Thank you for your attention

If you think the code is too slow, or uses to much memory for you

(partel, Telemac, Tomawac...)

Please contact us.

Contact:

[email protected] [email protected]

16