introduction to openacc - oak ridge leadership computing ... · openacc portability goals compiler...

Introduction to

OpenACC Jeff Larkin

What is OpenACC? A common directive programming model for today’s GPUs

Announced at SC11 conference

Offers portability between compilers Drawn up by: NVIDIA, Cray, PGI, CAPS

Multiple compilers offer portability, debugging, permanence

Works for Fortran, C, (and maybe) C++ Standard available at www.OpenACC-standard.org

Initially implementations targeted at NVIDIA GPUs

Current version: 1.0 (November 2011) Version 2.0 RFC released at SC12 (expected 1Q13)

Compiler support:

PGI Accelerator: released product in 2012

Cray CCE: released product in 2012

CAPS: released product in Q1 2012

2

http://www.openacc-standard.org/



OpenACC Portability Goals

Compiler Portability Different compilers should support the same directives/pragmas and runtime library

Work is currently underway to standardize a compliance test suite.

Device Portability Designed to be high level enough to support any of today’s or tomorrow’s accelerators.

Eliminate the need for separate code branches for CPU and GPUs.

Performance Portability Since OpenACC only annotated the code, well-written code should perform well on either the CPU or GPU

3

OPENACC BASICS

4

Directive Syntax

Fortran !$acc directive [clause [,] clause] …]

Often paired with a matching end directive

surrounding a structured code block !$acc end directive

C #pragma acc directive [clause [,] clause]

…]

Often followed by a structured code block

Important Directives to Know

!$acc parallel

Much like !$omp parallel, defines a region where loop iterations may be run in parallel

Compiler has the freedom to decompose this however it believes is profitable

!$acc kernels

Similar to parallel, but loops within the kernels region will be independent kernels, rather than one large kernel.

Independent kernels and associated data transfers may be overlapped with other kernels

6

kernels: Your first OpenACC

Directive Each loop executed as a separate kernel on the GPU.

!$acc kernels

do i=1,n

a(i) = 0.0

b(i) = 1.0

c(i) = 2.0

end do

do i=1,n

a(i) = b(i) + c(i)

end do

!$acc end kernels

kernel 1

kernel 2

Kernel: A parallel function

that runs on the GPU


!$acc data

Defines regions where data may be left on the device

Useful for reducing PCIe transfers by creating temporary arrays

or leaving data on device until needed

!$acc host_data

Define a region in which host (CPU) arrays will be used, unless specified with use_device()

The use_device() clause exposes device pointer to the CPU

Useful for overlapping with CPU computation or calling library

routines that expect device memory

8


!$acc wait

Synchronize with asynchronous activities.

May declare specific conditions or wait on all outstanding

requests

!$acc update

Update a host or device array within a data region

Allows updating parts of arrays

Frequently used around MPI

9


!$acc loop

Useful for optimizing how the compiler treats specific loops.

May be used to specify the decomposition of the work

May be used to collapse loop nests for additional parallelism

May be used to declare kernels as independent of each

other

10

Important Terminology

Gang The highest level of parallelism, equivalent to CUDA Threadblock. (num_gangs => number of threadblocks in the grid)

A “gang” loop affects the “CUDA Grid”

Worker A member of the gang, equivalent to CUDA thread within a threadblock (num_workers => threadblock size)

A “worker” loop affects the “CUDA Threadblock”

Vector Tightest level of SIMT/SIMD/Vector parallelism, reoughly equivalent to CUDA warp or SIMD vector length (vector_length should be a multiple of warp size)

A ‘vector” loop affects the SIMT parallelism

Declaring these on particular loops in your loop nest will affect the decomposition of the problem to the hardware

11

Other Directives

async clause Declares that control should return to the CPU

immediately.

If an integer is passed to async, that integer can

be passed as a handle to wait

cache construct Cache data in software managed data cache

(CUDA shared memory).

declare directive Specify that data is to allocated in device

memory for the duration of an implicit data

region created during the execution of a

subprogram.

Runtime Library Routines

Fortran use openacc or

#include "openacc_lib.h"

acc_get_num_devices

acc_set_device_type

acc_get_device_type

acc_set_device_num

acc_get_device_num

acc_async_test

acc_async_test_all

C #include "openacc.h"

acc_async_wait

acc_async_wait_all

acc_shutdown

acc_on_device

acc_malloc

acc_free

acc_init

Environment and Conditional

Compilation ACC_DEVICE device Specifies which device type to

connect to.

ACC_DEVICE_NUM num Specifies which device number to

connect to.

_OPENACC Preprocessor directive for

conditional compilation. Set to

OpenACC version

USING OPENACC

15

Identify High-level, Rich Loop Nests

Use your favorite profiling tool to identify hotspots at

the highest level possible.

If there’s not enough concurrency to warrant CUDA, there’s

not enough to warrant OpenACC either.

16

CrayPAT Loop-level profile

17

100.0% | 117.646170 | 13549032.0 |Total

|-------------------------------------------------------

| 75.4% | 88.723495 | 13542013.0 |USER

||------------------------------------------------------

|| 10.7% | 12.589734 | 2592000.0 |parabola_

|||-----------------------------------------------------

3|| 7.1% | 8.360290 | 1728000.0 |remap_.LOOPS

4|| | | | remap_

5|| | | | ppmlr_

||||||--------------------------------------------------

6||||| 3.2% | 3.708452 | 768000.0 |sweepx2_.LOOP.2.li.35

7||||| | | | sweepx2_.LOOP.1.li.34

8||||| | | | sweepx2_.LOOPS

9||||| | | | sweepx2_

10|||| | | | vhone_

6||||| 3.1% | 3.663423 | 768000.0 |sweepx1_.LOOP.2.li.35

7||||| | | | sweepx1_.LOOP.1.li.34

8||||| | | | sweepx1_.LOOPS

9||||| | | | sweepx1_

10|||| | | | vhone_

||||||==================================================

3|| 3.6% | 4.229443 | 864000.0 |ppmlr_

||||----------------------------------------------------

4||| 1.6% | 1.880874 | 384000.0 |sweepx2_.LOOP.2.li.35

5||| | | | sweepx2_.LOOP.1.li.34

6||| | | | sweepx2_.LOOPS

7||| | | | sweepx2_

8||| | | | vhone_

4||| 1.6% | 1.852820 | 384000.0 |sweepx1_.LOOP.2.li.35

5||| | | | sweepx1_.LOOP.1.li.34

6||| | | | sweepx1_.LOOPS

7||| | | | sweepx1_

8||| | | | vhone_

|||=====================================================

Place OpenMP On High-level Loops

Using OpenMP allows debugging issues of variable

scoping, reductions, dependencies, etc. easily on the

CPU

CPU toolset more mature

Can test anywhere

Cray will soon be releasing Reveal, a product for

scoping high-level loop structures.

Who knows, this may actually speed-up your CPU

code!

18

Focus on Vectorizing Low-Level Loops

Although GPUs are not strictly vector processors,

vector inner loops will benefit both CPUs and GPUs

Eliminate dependencies

Reduce striding

Remove invariant logic

…

Compiler feedback is critical in this process

19

Finally, Add OpenACC

Once High-Level parallelism with OpenMP and Low-Level vector

parallelism is exposed and debugged, OpenACC is easy.

#ifdef _OPENACC

!$acc parallel loop private( k,j,i,n,r, p, e, q, u, v, w,&

!$acc& svel0,xa, xa0, dx, dx0, dvol, f, flat,&

!$acc& para,radius, theta, stheta) reduction(max:svel)

#else

!$omp parallel do private( k,j,i,n,r, p, e, q, u, v, w,&

!$omp& svel0,xa, xa0, dx, dx0, dvol, f, flat,&

!$omp& para,radius, theta, stheta) reduction(max:svel)

#endif

20

Differences between OpenMP and

OpenACC Things that are different between OpenMP and OpenACC

Cannot have CRITICAL REGION down callchain

Cannot have THREADPRIVATE

Vectorization is much more important

Cache/Memory Optimization much more important

No EQUIVALENCE

Private variables not necessarily initialized to zero.

21

#ifdef _OPENACC

!$acc parallel loop private( k,j,i,n,r, p, e, q, u, v, w,&

!$acc& svel0,xa, xa0, dx, dx0, dvol, f, flat, para,radius,&

!$acc& theta, stheta) reduction(max:svel)

#else

!$omp parallel do private( k,j,i,n,r, p, e, q, u, v, w, svel0,&

!$omp& xa, xa0, dx, dx0, dvol, f, flat, para,radius, &

!$omp& theta, stheta) reduction(max:svel)

#endif

But Now It Runs Slower!

Every time I’ve gone through this process, the code is

slower at this step than when I started.

OpenACC is not automatic, you’ve still got work to

do…

Improve data movement

Adjust loop decomposition

File bugs?

22

Optimizing Data Movement

Compilers will be cautious with data movement a likely

move more data that necessary.

If it’s left of ‘=‘, it will probably be copied from the device.

If it’s right of ‘=‘, it will probably be copied to the device.

The CUDA Profiler can be used to measure data movement.

The Cray Compiler also has the CRAY_ACC_DEBUG runtime

environment variable, which will print useful information.

See man intro_openacc for details.

23

Optimizing Data Movement

Step 1, place a data region around the simulation loop

Use this directive to declare data that needs to be copied in,

copied out, or created resident on the device.

Use the present clause to declare places where the compiler

may not realize the data is already on the device (within

function calls, for example)

Step 2, use an update directive to copy data between

GPU and CPU inside the data region as necessary

24

Keep data on the accelerator with acc_data region

25

!$acc data copyin(cix,ci1,ci2,ci3,ci4,ci5,ci6,ci7,ci8,ci9,ci10,ci11,&

!$acc& ci12,ci13,ci14,r,b,uxyz,cell,rho,grad,index_max,index,&

!$acc& ciy,ciz,wet,np,streaming_sbuf1, &

!$acc& streaming_sbuf1,streaming_sbuf2,streaming_sbuf4,streaming_sbuf5,&

!$acc& streaming_sbuf7s,streaming_sbuf8s,streaming_sbuf9n,streaming_sbuf10s,&

!$acc& streaming_sbuf11n,streaming_sbuf12n,streaming_sbuf13s,streaming_sbuf14n,&

!$acc& streaming_sbuf7e,streaming_sbuf8w,streaming_sbuf9e,streaming_sbuf10e,&

!$acc& streaming_sbuf11w,streaming_sbuf12e,streaming_sbuf13w,streaming_sbuf14w, &

!$acc& streaming_rbuf1,streaming_rbuf2,streaming_rbuf4,streaming_rbuf5,&

!$acc& streaming_rbuf7n,streaming_rbuf8n,streaming_rbuf9s,streaming_rbuf10n,&

!$acc& streaming_rbuf11s,streaming_rbuf12s,streaming_rbuf13n,streaming_rbuf14s,&

!$acc& streaming_rbuf7w,streaming_rbuf8e,streaming_rbuf9w,streaming_rbuf10w,&

!$acc& streaming_rbuf11e,streaming_rbuf12w,streaming_rbuf13e,streaming_rbuf14e, &

!$acc& send_e,send_w,send_n,send_s,recv_e,recv_w,recv_n,recv_s)

do ii=1,ntimes

o o o

call set_boundary_macro_press2

call set_boundary_micro_press

call collisiona

call collisionb

call recolor

Update the Host for Communication

26

!$acc parallel_loop private(k,j,i)

do j=0,local_ly-1

do i=0,local_lx-1

if (cell(i,j,0)==1) then

grad (i,j,-1) = (1.0d0-wet)*db*press

else

grad (i,j,-1) = db*press

end if

grad (i,j,lz) = grad(i,j,lz-1)

end do

end do

!$acc end parallel_loop

!$acc update host(grad)

call mpi_barrier(mpi_comm_world,ierr)

call grad_exchange

!$acc update device(grad)

But we would rather not send the entire grad array back – how about…

Packing the buffers on the accelerator

27

!$acc data present(grad,recv_w,recv_e,send_e,send_w,recv_n,&

!$acc& recv_s,send_n,send_s)

!$acc parallel_loop

do k=-1,lz

do j=-1,local_ly

send_e(j,k) = grad(local_lx-1,j ,k)

send_w(j,k) = grad(0 ,j ,k)

end do

end do

!$acc end parallel_loop

!$acc update host(send_e,send_w)

call mpi_irecv(recv_w, bufsize(2),mpi_double_precision,w_id, &

tag(25),mpi_comm_world,irequest_in(25),ierr)

o o o

call mpi_isend(send_w, bufsize(2),mpi_double_precision,w_id, &

tag(26),& mpi_comm_world,irequest_out(26),ierr)

call mpi_waitall(2,irequest_in(25),istatus_req,ierr)

call mpi_waitall(2,irequest_out(25),istatus_req,ierr)

!$acc update device(recv_e,recv_w)

!$acc parallel

!$acc loop

do k=-1,lz

do j=-1,local_ly

grad(local_lx ,j ,k) = recv_e(j,k)

grad(-1 ,j ,k) = recv_w(j,k)

Optimizing Kernels

The compiler has freedom to schedule loops and

kernels as it thinks is best, but the programmer can

override this.

First you must know how the work was decomposed.

Feedback from compiler at build time

Feedback from executable at runtime

CUDA Profiler

28

Adjusting Decomposition

Adjust the number of gangs, workers, and or vector length on your parallel or kernels region

num_gangs, num_workers, vector_length

Add loop directives to individual loop declaring them

as gang, worker, or vector parallelism

29

Further Optimizing Kernels

Use loop collapse() to merge loops and increase parallelism at particular levels

Use compiler’s existing directives regarding loop optimizations

Loop unrolling

Loop fusion/fission

Loop blocking

Ensure appropriate data access patterns Memory coalescing, bank conflicts, and striding are just as important with OpenACC as CUDA/OpenCL

This will likely help when using the CPU as well.

30

Interoperability

OpenACC plays well with others; CUDA C, CUDA

Fortran, Libraries

If adding OpenACC to an existing CUDA code, the deviceptr data clause allows using existing data

structures.

If adding CUDA or a library call to an OpenACC code, use host_data and use_device to declare CPU or

GPU memory use.

31

Interoperability Advice

OpenACC provides a very straightforward way to

manage data structures without needing 2 pointers

(host & device), so use it at the top level.

CUDA provides very close-to-the-metal control, so it

can be used for very highly tuned kernels that may be

called from OpenACC

Compilers do complex tasks such as reductions very

well, so let them.

32

OpenACC

Homework

OpenACC Homework

I have provided a starting point at

http://users.nccs.gov/~larkin/OpenACC_Exercises.zip

A simple Fortran matrix multiplication kernel has been provided,

your assignment is to parallelize the kernel via OpenACC.

Instructions have been provided for the Fortran kernel to add the

directives in stages.

Although the instructions use PGI, please feel free to try CCE as

well. If you do, consider using the parallel directive in place of

kernels.

Please consider doing the option step 0 first.



introduction to openacc - oak ridge leadership computing ... · openacc portability goals compiler...

Documents