introduction to openacc - accre vanderbiltopenmp, mpi and cuda. ver 1.0 nov 2011 ver 2.0 jun 2013...

14
INTRODUCTION TO OPENACC Davide Vanzo Mar 30 - Apr 01 2016

Upload: others

Post on 18-Aug-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

INTRODUCTION TO

OPENACC

Davide VanzoMar 30 - Apr 01 2016

Page 2: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

Why another programming model?

Page 3: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

Why another programming model?

Page 4: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

What is OpenACC?

Libraries

Compiler directives

Programming languages

Use portable libraries with accelerators support (cuBLAS, cuFFT, Thrust, etc.)

Use compiler directives to optimize the code for accelerators

Use lower level languages to fine-tune single kernels performance (CUDA, OpenCL)

Portability

Performance

Page 5: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

What is OpenACC?

• OpenACC is a directives-based programming model for expressing parallelism in heterogeneous systems.

• Aims to be performance portable to a wide range of accelerators.

• Unique programming specification for a wide variety of platforms, including host-GPU, multi-core and many-core processors.

• OpenACC directives are complementary and interoperate with existing HPC programming models like OpenMP, MPI and CUDA.

Ver 1.0

Nov 2011

Ver 2.0

Jun 2013

Ver 2.5

Oct 2015

IT IS NOT GPU PROGRAMMING!ONLY EXPRESS PARALLELISM

Page 6: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

OpenACC directives

...serial code...

#pragma acc parallel forfor (int i=0; i<n; i++) {

...parallel code...}

...serial code...

Compiler directive...serial code...

!$acc parallel dodo i=1, n

...parallel code...end do!$acc end parallel do

...serial code...

C/C++ Fortran

Express parallelism

Express data locality

OptimizeIdentify

parallelism

Page 7: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

Kernels directive

It defines a code region that may contain loops that can be parallelized by generating one kernel for each loop.

#pragma acc kernels{

for (int i=0; i<n; i++) {A[i] = i;B[i] = 2 * i;

}

for (int i=0; i<n; i++) {C[i] = A[i] + B[i]

}}

kernel 1

kernel 2

When operating with pointers, the OpenACC compiler cannot disambiguate the pointer targets and it will not generate parallel code to avoid the risk of writing in the same memory allocation.

Solution: #pragma acc kernels loop independent or float *restrict A

#pragma acc kernels loopfor (int i=0; i<n; i++) {

A[i] = i;B[i] = 2 * i;

}

#pragma acc kernels loopfor (int i=0; i<n; i++) {

C[i] = A[i] + B[i]}

Page 8: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

Compiler flags

$ pgcc -acc -ta=nvidia:maxwell -Minfo mysource.c

Compilers supporting OpenACC:

• PGI Accelerator http://www.pgroup.com/resources/accel.htm• OpenUH http://web.cs.uh.edu/~openuh/• OpenARC http://ft.ornl.gov/research/openarc

nvidiaradeonmulticorehost

Nvidia GPUsAMD GPUsParallel execution across host cores (OpenMP-like)Serial execution on host

Page 9: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

Data regions

They allow the programmer to define specific data locality by identifying regions of code where arrays will remain on the GPU until the end of the region.

#pragma acc data clauses(array){

#pragma acc kernels...

#pragma acc kernels...

}

Clauses: copyin copyout copy create present

Allocates memory on GPU

Copy host to GPU entry

Copy GPU to host exit

Page 10: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

Parallel loop directive

Explicitly identifies a specific loop as safe to be parallelized (i.e. no pointer aliasing) by the compiler.

The programmer is responsible for identifying parallelism, while the compiler will take care of mapping parallelism to the specific accelerator.

#pragma acc parallel{

#pragma acc loopfor (int i=0; i<n; i++) {

A[i] = i;B[i] = 2 * i;

}

#pragma acc loopfor (int i=0; i<n; i++) {

C[i] = A[i] + B[i]}

}

#pragma acc parallel loopfor (int i=0; i<n; i++) {

A[i] = i;B[i] = 2 * i;

}

#pragma acc parallel loopfor (int i=0; i<n; i++) {

C[i] = A[i] + B[i]}

Page 11: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

Case study: 2D-Laplace with Jacobi solver

Solve 2D-Laplace partial differential equation by using the iterative Jacobi solver.

A(i,j)

A(i,j+1)

A(i,j-1)

A(i-1,j) A(i+1,j)

Page 12: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

Case study: 2D-Laplace with Jacobi solver

while ( error > tol && iter < iter_max ) {error = 0.f;

for( int j = 1; j < n-1; j++) {for( int i = 1; i < m-1; i++ ) {

Anew[j*m+i] = 0.25f * ( A[j*m+i+1] + A[j*m+i-1] + A[(j-1)*m+i] + A[(j+1)*m+i]);

error = fmaxf( error, fabsf(Anew[j*m+i]-A[j*m+i]));}

}

for( int j = 1; j < n-1; j++) {for( int i = 1; i < m-1; i++ ) {A[j*m+i] = Anew[j*m+i];

}}

if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);

iter++;}

Loop until convergence

Iterate across array elements

Iterate across matrix elements

Compute maximum error

Swap new/old arrays

Page 13: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

Case study: 2D-Laplace with Jacobi solver

while ( error > tol && iter < iter_max ) {error = 0.f;

for( int j = 1; j < n-1; j++) {for( int i = 1; i < m-1; i++ ) {

Anew[j*m+i] = 0.25f * ( A[j*m+i+1] + A[j*m+i-1] + A[(j-1)*m+i] + A[(j+1)*m+i]);

error = fmaxf( error, fabsf(Anew[j*m+i]-A[j*m+i]));}

}

for( int j = 1; j < n-1; j++) {for( int i = 1; i < m-1; i++ ) {A[j*m+i] = Anew[j*m+i];

}}

if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);

iter++;}

Host GPU

A, AnewA, Anew

A, Anew

A, Anew

Nex

t cy

cle

GP

U p

rocessin

g

Page 14: INTRODUCTION TO OPENACC - ACCRE VanderbiltOpenMP, MPI and CUDA. Ver 1.0 Nov 2011 Ver 2.0 Jun 2013 Ver 2.5 Oct 2015 IT IS NOT GPU PROGRAMMING! ONLY EXPRESS PARALLELISM OpenACC directives...serial

Additional resources

Official OpenACC programming and best practices guide:

http://www.openacc.org/sites/default/files/OpenACC_Programming_Guide_0.pdf

Nvidia OpenACC course:

https://developer.nvidia.com/openacc-overview-course

PGI Accelerator Compilers – OpenACC getting started guide:

http://www.pgroup.com/doc/openacc_gs.pdf