introduction to openacc - accre vanderbiltopenmp, mpi and cuda. ver 1.0 nov 2011 ver 2.0 jun 2013...
TRANSCRIPT
INTRODUCTION TO
OPENACC
Davide VanzoMar 30 - Apr 01 2016
Why another programming model?
≠
≠
Why another programming model?
What is OpenACC?
Libraries
Compiler directives
Programming languages
Use portable libraries with accelerators support (cuBLAS, cuFFT, Thrust, etc.)
Use compiler directives to optimize the code for accelerators
Use lower level languages to fine-tune single kernels performance (CUDA, OpenCL)
Portability
Performance
What is OpenACC?
• OpenACC is a directives-based programming model for expressing parallelism in heterogeneous systems.
• Aims to be performance portable to a wide range of accelerators.
• Unique programming specification for a wide variety of platforms, including host-GPU, multi-core and many-core processors.
• OpenACC directives are complementary and interoperate with existing HPC programming models like OpenMP, MPI and CUDA.
Ver 1.0
Nov 2011
Ver 2.0
Jun 2013
Ver 2.5
Oct 2015
IT IS NOT GPU PROGRAMMING!ONLY EXPRESS PARALLELISM
OpenACC directives
...serial code...
#pragma acc parallel forfor (int i=0; i<n; i++) {
...parallel code...}
...serial code...
Compiler directive...serial code...
!$acc parallel dodo i=1, n
...parallel code...end do!$acc end parallel do
...serial code...
C/C++ Fortran
Express parallelism
Express data locality
OptimizeIdentify
parallelism
Kernels directive
It defines a code region that may contain loops that can be parallelized by generating one kernel for each loop.
#pragma acc kernels{
for (int i=0; i<n; i++) {A[i] = i;B[i] = 2 * i;
}
for (int i=0; i<n; i++) {C[i] = A[i] + B[i]
}}
kernel 1
kernel 2
When operating with pointers, the OpenACC compiler cannot disambiguate the pointer targets and it will not generate parallel code to avoid the risk of writing in the same memory allocation.
Solution: #pragma acc kernels loop independent or float *restrict A
#pragma acc kernels loopfor (int i=0; i<n; i++) {
A[i] = i;B[i] = 2 * i;
}
#pragma acc kernels loopfor (int i=0; i<n; i++) {
C[i] = A[i] + B[i]}
Compiler flags
$ pgcc -acc -ta=nvidia:maxwell -Minfo mysource.c
Compilers supporting OpenACC:
• PGI Accelerator http://www.pgroup.com/resources/accel.htm• OpenUH http://web.cs.uh.edu/~openuh/• OpenARC http://ft.ornl.gov/research/openarc
nvidiaradeonmulticorehost
Nvidia GPUsAMD GPUsParallel execution across host cores (OpenMP-like)Serial execution on host
Data regions
They allow the programmer to define specific data locality by identifying regions of code where arrays will remain on the GPU until the end of the region.
#pragma acc data clauses(array){
#pragma acc kernels...
#pragma acc kernels...
}
Clauses: copyin copyout copy create present
Allocates memory on GPU
Copy host to GPU entry
Copy GPU to host exit
Parallel loop directive
Explicitly identifies a specific loop as safe to be parallelized (i.e. no pointer aliasing) by the compiler.
The programmer is responsible for identifying parallelism, while the compiler will take care of mapping parallelism to the specific accelerator.
#pragma acc parallel{
#pragma acc loopfor (int i=0; i<n; i++) {
A[i] = i;B[i] = 2 * i;
}
#pragma acc loopfor (int i=0; i<n; i++) {
C[i] = A[i] + B[i]}
}
#pragma acc parallel loopfor (int i=0; i<n; i++) {
A[i] = i;B[i] = 2 * i;
}
#pragma acc parallel loopfor (int i=0; i<n; i++) {
C[i] = A[i] + B[i]}
Case study: 2D-Laplace with Jacobi solver
Solve 2D-Laplace partial differential equation by using the iterative Jacobi solver.
A(i,j)
A(i,j+1)
A(i,j-1)
A(i-1,j) A(i+1,j)
Case study: 2D-Laplace with Jacobi solver
while ( error > tol && iter < iter_max ) {error = 0.f;
for( int j = 1; j < n-1; j++) {for( int i = 1; i < m-1; i++ ) {
Anew[j*m+i] = 0.25f * ( A[j*m+i+1] + A[j*m+i-1] + A[(j-1)*m+i] + A[(j+1)*m+i]);
error = fmaxf( error, fabsf(Anew[j*m+i]-A[j*m+i]));}
}
for( int j = 1; j < n-1; j++) {for( int i = 1; i < m-1; i++ ) {A[j*m+i] = Anew[j*m+i];
}}
if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
iter++;}
Loop until convergence
Iterate across array elements
Iterate across matrix elements
Compute maximum error
Swap new/old arrays
Case study: 2D-Laplace with Jacobi solver
while ( error > tol && iter < iter_max ) {error = 0.f;
for( int j = 1; j < n-1; j++) {for( int i = 1; i < m-1; i++ ) {
Anew[j*m+i] = 0.25f * ( A[j*m+i+1] + A[j*m+i-1] + A[(j-1)*m+i] + A[(j+1)*m+i]);
error = fmaxf( error, fabsf(Anew[j*m+i]-A[j*m+i]));}
}
for( int j = 1; j < n-1; j++) {for( int i = 1; i < m-1; i++ ) {A[j*m+i] = Anew[j*m+i];
}}
if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
iter++;}
Host GPU
A, AnewA, Anew
A, Anew
A, Anew
Nex
t cy
cle
GP
U p
rocessin
g
Additional resources
Official OpenACC programming and best practices guide:
http://www.openacc.org/sites/default/files/OpenACC_Programming_Guide_0.pdf
Nvidia OpenACC course:
https://developer.nvidia.com/openacc-overview-course
PGI Accelerator Compilers – OpenACC getting started guide:
http://www.pgroup.com/doc/openacc_gs.pdf