pgi accelerator - pgroup.com · pgi installation has two additional directory levels. the first...

PGI Accelerator ™

OpenACC

This article examines architectural features common to the three most popular types of compute Accelera-tors being used or evaluated today. It describes how these features can be distilled into a canonical CPU+Accelerator abstract ma-chine architecture which can be effectively targeted to create programs that are portable and performance portable across the different types of Accelerators. The key to this is selecting a program-ming model that unleashes the core strengths of all three types of accelerators, without compromis-ing the unique features offered by each. We hope you’ll find this article illuminating as you explore how Accelerators can enable you to reach new heights in performance and power efficiency.

Tesla vs Xeon Phi vs RadeonA Compiler Writer’s Perspective

Today’s high performance systems are trending towards using highly parallel accelerators to meet performance goals and power and price limits. The most popular compute accelerators today are NVIDIA GPUs. Intel Xeon Phi co-processors and AMD Radeon GPUs are competing for that same market, meaning we will soon be programming and tuning for a wider variety of host + accelerator systems.

We want to avoid writing a different program for each type of accelerator. There are at least three current options for writing a single program that targets multiple accelera-tor types. One is to use a library, which works really well if the library contains all the primitives your application needs. Solutions built on class libraries with managed data structures are really another method to implement libraries, and again work well if the primitives suit your application. The potential downside is that you depend on the library being implemented and optimized on each of your targets now and in the future.

A second option is to dive into a low-level, target inde-pendent solution such as OpenCL. There are significant up-front costs to refactor your application, as well as additional and continuing costs to tune for each accelerator. OpenCL, in particular, is quite low-level, and requires several levels of tuning to achieve high performance for each device.

A third option is to use a high-level target independent programming model, such as OpenMP or OpenACC directives. OpenMP has had great success for its target domain, shared-memory multiprocessor and multicore systems. OpenACC is just now being targeted for multiple accelerator types, and initial evidence is that it can be used to achieve to achieve high performance for a single program across a range of accelerator systems.

Using directives for parallel and accelerator program-ming divides the programming challenges between the application writer and the compiler. Defining an abstract

by Brent Leback, Douglas Miles, and Michael Wolfe

Tesla vs Xeon Phi vs RadeonA Compiler Writer’s Perspective

view of the target architecture used by the compiler allows a program to be optimized for very different targets, and al-lows programmers to understand how to express a program that can be optimized across accelerators.

PGI Fortran, C and C++ compilers support OpenACC directives on Linux, OS/X and Windows, targeting NVID-IA and AMD GPUs. Initial functionality has been demon-strated on Intel Xeon Phi co-processors. The PGI compilers are designed to allow a program to be compiled for a single accelerator target, or to be compiled into a single binary that will run on any of multiple accelerator targets depend-ing on which is available at runtime. They are also designed to generate binaries that utilize multiple accelerators at runtime, or even multiple accelerators from different ven-dors. To make all these cases truly useful, the program must be written in a way that allows the compiler to optimize for different architectures that have common themes but fundamentally different parallelism structures.

Accelerator Architectures TodayHere we present important architectural aspects of to-

day’s multi-core CPUs and the 3 most common accelerators coupled to those CPUs for technical computing: NVIDIA Tesla, Intel Xeon Phi and AMD Radeon. There are other accelerators in use, such as FPGAs and DSPs, but these have not yet achieved a significant presence in the technical computing market and we do not cover them here.

Common Multi-core CPUFor comparison, we present an architectural summary

of a common multi-core CPU. An Intel Sandy Bridge processor has up to eight cores. Each core stores the state of up to two threads. Each core can issue up to five instruc-tions in a single clock cycle; the five instructions issued in a single cycle can be a mix from the two different threads. This is Intel’s Hyperthreading, which is an implementa-tion of Simultaneous Multithreading. In addition to scalar

instructions, each core has SSE and AVX SIMD registers and instructions. An AVX instruction has 256-bit oper-ands, meaning it can process eight single-precision floating point operations or four double-precision floating point operations with a single instruction. Each core has a 32KB level-1 data cache and 32KB level-1 instruction cache, and a 256KB L2 cache. There is an L3 cache shared across all cores which ranges up to 20MB. The CPU clock rate ranges between 2.3GHz to 4.0GHz (in Turbo mode). The core is deeply pipelined, about 14 stages.

The control unit for a modern CPU can issue instruc-tions even beyond the point where an instruction stalls. For instance, if a memory load instruction misses in the L1 and L2 cache, its result won’t be ready for tens or hundreds of cycles until the operand is fetched from main memory. If a subsequent add instruction, say, needs the result of that load instruction, the control unit can still issue the add instruc-tion, which will wait at a reservation station until the result from the load becomes available. Meanwhile, instructions beyond that add can be issued and can even execute while the memory load is being processed. Thus, instructions are executed out-of-order relative to their appearance in the program.

The deep, two- or three-level cache hierarchy reduces the average latency for memory operations. If a memory data load instruction hits in the L1 cache, the result will be ready in 4 to 5 clock cycles. If it misses in the L1 cache but hits in the L2 cache, the result will take about 12 cycles. An L3 cache hit will take 35 to 45 clock cycles, while missing completely in the cache may take 250 or more cycles. There is additional overhead to manage multi-core cache coher-ence for the L1 and L2 caches.

GPU marketing literature often uses the term “GPU core”. This is essentially a single-precision (32-bit) ALU or equivalent. By this definition, an 8 core Sandy Bridge with AVX instructions has 64 GPU core-equivalents.

NVIDIA Kepler GPUAn NVIDIA Kepler GPU has up to 15 SMX units,

where each SMX unit corresponds roughly to a highly par-allel CPU core. The SMX units execute warp-instructions, where each warp-instruction is roughly a 32-wide SIMD instruction, with predication. The warp is a fundamen-tal unit for NVIDIA GPUs; all operations are processed as warps. The CUDA programming model uses CUDA threads, which are implemented as predicated, smart lanes in a SIMD instruction stream. In this section, we use the term “thread” to mean a sequence of warp-instructions, not a CUDA thread. The SMX units can store the state of up to 64 threads. Each SMX unit can issue instructions from up to four threads in a single cycle, and can issue one or two instructions from each thread. There are obvious structural and data hazards that can limit the actual number of instructions issued in a single cycle. Instructions from a single thread are issued in-order. When a thread stalls while waiting for a result from memory, the SMX control unit will select instructions from another active thread. The SMX clock is about 750 MHz. The SMX is also pipelined, with 8 stages.

Each execution engine of the Kepler SMX has an array of execution units, including 192 single-precision GPU cores (for a max of 2880 GPU cores on chip), plus addi-tional units for double-precision, memory load/store and transcendental operations. The execution units are orga-nized into several SIMD functional units, with a hardware SIMD width of 16. This means it takes two cycles to initi-ate a 32-wide warp instruction on the 16-wide hardware SIMD function unit.

There is a two-level data cache hierarchy. The L1 data cache is 64KB per SMX unit, and is used only for thread-private data. The L1 cache is actually split between a normal associative data cache and a scratchpad memory, with 25%, 50% or 75% of the 64KB selectively used as scratchpad memory. There is an additional 48KB read-only L1 cache, which is accessed through special instructions. The 1.5MB L2 data cache is shared across all SMX units. The caches are small, relative to the large number of SIMD operations

typically running on all 15 SMX units. The device main memory ranges up to 6GB in the Kepler K20x, and larger memories are expected in the near future.

AMD Radeon GPU

An AMD Radeon GPU has up to 32 compute units, where each compute unit corresponds roughly to a simple CPU core with attached vector units. Each compute unit has four SIMD units and a scalar unit, where the scalar unit is used mainly for control flow. The SIMD units oper-ate on wavefronts which roughly correspond to NVIDIA warps, except wavefronts are 64-wide. The SIMD units are physically 16-wide, so it takes four clock cycles to initiate a 64-wide wavefront instruction. With 32 compute units, each with four 16-wide SIMD units, the AMD Radeon has up to 2048 single-precision GPU cores. In this section, the term “thread” is used to mean a sequence of scalar and wavefront instructions.

A thread is statically assigned to one of the four SIMD units. As with NVIDIA Kepler, the AMD Radeon issues instructions in-order per thread. Each SIMD unit stores the state for up to 10 threads, and uses multithreading to tolerate long latency operations. The compute units operate at about 1GHz.

There is a 64KB scratchpad memory for each compute unit, shared across all SIMD units. There is an additional 16KB L1 data cache per compute unit, and a 16KB L1 read-only scalar data cache shared across the four SIMD units. A 768KB L2 data cache is shared across all compute units. The device main memory is up to 3GB in current devices, and larger memories are expected in the near future.

Intel Xeon Phi Coprocessor

The Intel Xeon Phi Coprocessor (IXPC) is designed to appear and operate more like a common multi-core proces-sor, though with many more cores. An IXPC has up to 61 cores, where each core implements most of the 64-bit x86 scalar instruction set. Instead of SSE and AVX instruc-tions, an IXPC core has 512-bit SIMD instructions, which perform 16 single precision operations or eight double

precision operations in a single instruction. Each core has a 32KB L1 instruction cache and 32KB L1 data cache, and a 512KB L2 cache. There is no cache shared across cores. The CPU clock is about 1.1GHz. With 61 cores and 16-wide SIMD units, the IXPC has the equivalent of 976 GPU cores.

The control unit can issue two instructions per clock, though instructions are issued in-order and those two instructions must be from the same thread (unlike Intel Hyperthreading). Each core stores the state of up to four threads, using multithreading to tolerate cache misses.

The IXPC is packaged as an IO device, similar to a GPU. It has 8GB memory in current devices. The biggest potential advantage to the IXPC is its programmability using existing multi-core programming models. In many if not most cases, you really can just recompile your pro-gram and run it natively on an IXPC using MPI and/or OpenMP. However, to get good performance, you will likely have to tune or refactor your application.

Common Accelerator Architecture Themes

There are several common themes across the Tesla, Rad-eon and Xeon Phi accelerators.

Separate device memory: Current accelerators are all packaged as IO devices, with separate memory from the host. This makes memory management key to achieving high performance. Partly this is because the transfer rates across the IO bus are so slow relative to memory speeds, but partly this is because the device memory is not paged and is small relative to host memory. When a workstation or single cluster node typically has 64GB or more of physical memory, having less memory on the high throughput ac-celerator presents a challenge. The limited physical memory sizes and relatively slow individual core speeds on today’s accelerators limit their ability to be viewed or used as stand-alone compute nodes. They are only viable when coupled to and used in concert with very fast CPUs.

Many-core: Our definition of core here is a proces-sor that issues instructions across one or more functional

units, and shares resources closely amongst those units. An NVIDIA SMX unit or Radeon Compute unit is a core, by this definition. With 15, 32 or 61 cores, and a reliance on SIMD processing, your program needs enough parallelism to keep these units busy.

Somewhat unique to each device is how it organizes the cores and functional units within each core. The Radeon GPU has up to 32 compute units where each core has four SIMD units. One could argue that this should be treated as 128 cores. We have chosen to treat this like 32 cores, where each core can issue four instructions per clock, one to each SIMD unit. Similarly, we treat the Kepler SMX as a single core which can issue many instructions in a single cycle to many functional units.

Multithreading: All three devices use multithreading to tolerate long latency operations, typically memory opera-tions. This is a trade-off between storing more thread state on board instead of implementing a more complex control unit. This means your application will have to expose an ad-ditional degree of parallelism to fill the multithreading slots, in addition to filling the cores.

The degree of multithreading differs across the devices. The IXPC has a lower degree (4) than Kepler (64) or Rad-eon (40), largely because the IXPC depends more on caches to reduce average memory latency.

Vectors: All three devices have wide vector operations. The physical SIMD parallelism in all three designs happens to be the same (16), though the instruction set for Kepler and Radeon are designed with 32- and 64-wide vectors. Your application must be amenable to both SIMD and MIMD parallelism to achieve peak performance.

In-order instruction issue: This is really one of several common design themes that highlight the simpler control units on these devices relative to commodity CPU cores. If a goal is higher performance per transistor or per square nanometer, then making the control unit smaller allows the designer to use more transistors for the functional units, registers or caches. However, commodity cores benefit from more complex control units, so clearly there is value to them. As a result, the cores used in accelerators may prove

less than adequate on workloads where a commodity CPU thrives, or on serial portions of workloads that overall are well-suited to an accelerator. This is the primary challenge facing so-called self-hosted accelerators, and why systems that successfully couple latency-optimized CPUs with stream-optimized accelerators from the hardware through the programming model are likely to be most effective on a broad range of applications.

Memory strides matter: Accelerator device memories are optimized for accesses to superwords, a long sequence of adjacent words; a superword essentially corresponds to a cache line. If the data accesses are all stride-1, the applica-tion can take advantage of very high hardware memory bandwidth. With indexed accesses or high strides, the memory still responds with 256- or 384- or 512-bit super-words, but only a fraction of that data bandwidth is used.

Smaller caches: The cache memories are smaller per core than today’s high performance CPUs. The applications where accelerators are used effectively generally operate on very large matrices or data structures, where cache memo-ries are less effective. The trade-off between smaller cache and higher memory bandwidth targets this application space directly.

At PGI, for purposes of programming model and compiler design, we represent these common elements in an abstracted CPU+Accelerator architecture that looks as follows:

Figure 1: CPU+Accelerator Abstract Machine

Maximizing Accelerator Performance

Here we review and discuss the various means to pro-gram for today’s accelerators, focusing on how the vari-ous important architectural aspects are managed by the programming model. A programming model has several options for exploiting each architectural feature: hide it, virtualize it, or expose it.

The model can hide a feature by automatically manag-ing it. This is the route taken for register management for classical CPUs, for instance. While the C language has a register keyword, modern compilers ignore that when performing register allocation. Hiding a feature relieves the programmer from worrying about it, but also prevents the programmer from optimizing or tuning for it.

The model can virtualize a feature by exposing its pres-ence but virtualizing the details. This is how vectorization is classically handled. The presence of vector or SIMD instructions is explicit, and programmers are expected to write vectorizable loops. However, the native vector length, the number of vector registers and the vector instructions themselves are all virtualized by the compiler. Compare this to using SSE intrinsics and how that affects portability to a future architecture (with, say, AVX instructions). Virtual-izing a feature often incurs some performance penalty, but also improves productivity and portability.

Finally, a model can expose a feature fully. The classical example is MPI, which exposes the separate compute nodes, distributed memory and network data transfers within a program. The C language exposes linear data layout, allow-ing pointer arithmetic to navigate a data structure. Expos-ing a feature allows the programmer to tune and optimize for it, but also requires a programmer to do the tuning and optimization with little or no help from the compiler.

The programming models we will discuss are OpenCL, CUDA C, CUDA Fortran, Microsoft’s C++AMP, Intel’s offload directives (which are turning into the OpenMP 4.0 target directives), and OpenACC.

©2012 The Portland Group, Inc.

Data

SIM

D/S

IMT

Hardware/SoftwareCache

PE 0


PE 1


PE n-1

Stream Optimized Device Memor y

Execut ion QueuesControl

Latenc yOptimized

Host Memor y

…

Multicore CPU

Memory Management

Memory management is the highest hurdle and big-gest obstacle to high performance on today’s accelerators. Memory has to be allocated on the device, and data has to be moved between the host and the device. This is es-sentially identical across all current devices. Attempts to automatically manage data movement between devices, or nodes of a cluster, have so far resulted in impressive perfor-mance for some simple stunt cases but disappointing results in general. The costs of a bad decision are very high, which makes a fully automatic approach to memory management extremely challenging to implement. For that reason, all of these programming models assume at least some aspects of memory management must be delegated to the program-mer.

CUDA C and CUDA Fortran typically require the programmer to allocate device memory and explicitly copy data between host and device. CUDA Fortran allows the use of standard allocate and array assignment statements to manage the device data. While CUDA will soon support a logically unified view of host and device memory, it is likely that explicit data management will still be required to extract maximum performance from NVIDIA GPUs.

Similarly, OpenCL requires the programmer to al-locate buffers and copy data to the buffers. It hides some important details, however, in that it doesn’t expose exactly where the buffer lives at various points during the program execution. A compliant implementation may (and may be required to) move a buffer to host memory, then back to device memory, with no notification to the programmer.

C++AMP uses views and arrays of data. The program-mer can create a view of the data that will be processed on the accelerator. As with OpenCL, the actual point at which the memory is allocated on the device and data transferred to or from the device is hidden. Alternatively, the program-mer may take full control by creating an explicit new array, or create an array with data copied from the host.

The Intel offload directives and OpenACC allow the programmer to rely entirely on the compiler for memory

management to get started, but offer optional data con-structs and clauses to control and optimize when data is allocated on the device and moved between host and device. The language, compiler and runtime cooperate to virtualize the separate memory spaces by using the same names for host and device copies of data. This is important to allow ef-ficient implementations on current or future systems where data movement is not necessary, and to allow the same pro-gram to be used on systems with or without an accelerator.

Parallelism Scheduling

Accelerator devices have three levels of compute paral-lelism: many cores, multithreading on a core, and SIMD execution.

CUDA and OpenCL expose many-core parallelism with a three-level hierarchy. A kernel is written as a scalar thread; many threads are executed in parallel in a rectangular block or workgroup on a single core; many blocks or workgroups are executed in parallel in a rectangular grid across cores. The number of scalar threads in a block or workgroup, and the number of blocks or workgroups, are chosen explic-itly by the programmer. The fact that consecutive scalar threads are actually executed in SIMD fashion in warps or wavefronts on a GPU is hidden, though tuning guidelines give advice on how to avoid performance cliffs. The fact that some threads in a block or workgroup may execute in parallel using multithreading on the same hardware, or that multiple blocks or workgroups may execute in parallel using multithreading on the same hardware is also hidden. The programming model virtualizes the multithreading aspect, using multithreading for large blocks or workgroups, or many blocks or workgroups, until the resources are ex-hausted.

C++AMP does not expose the multiple levels of par-allelism, instead presenting a flat parallelism model. The programmer can impose some hierarchy by explicitly using tiles in the code, but the mapping to the eventual execution mode is still implicit.

Intel’s offload directives use OpenMP parallelism across

the cores and threads on a core. The number of threads to launch can be set by the programmer using OpenMP clauses or environment variables, or set by default by the implementation. SIMD execution on a core is implemented with loop vectorization.

OpenACC exposes the three levels of parallelism as gang, worker and vector parallelism. A programmer can use OpenACC loop directives to explicitly control which loop indices are mapped to each level. However, a compiler has some freedom to automatically map or remap parallelism for a target device. For instance, the compiler may choose to automatically vectorize an inner loop. Or, if there is only a single loop, the compiler may remap the programmer-specified gang parallelism across gangs, workers and vector lanes, to exploit all the available parallelism. This allows the programmer to concentrate on abstract parallelism, leaving most device-specific scheduling details to the compiler.

OpenACC also allows the programmer to specify how much parallelism to instantiate, similar to the num_threads clause in OpenMP, with num_gangs, num_workers and vector_length clauses on the parallel directive, or with explicit sizes in the loop directives in a kernels region. The optimal values to use are likely to be dependent on the de-vice, so it’s best for cross-platform and forward performance portability if these are left to the OpenACC compiler or runtime system.

Multithreading

All three types of Accelerators we consider here require oversubscription to keep the compute units busy; that is, the program must expose extra (slack) parallelism so a compute unit can swap in another active thread when a thread stalls on memory or other long latency operations. This slack parallelism can come from several sources. Here we explore how to create enough parallelism to keep the devices busy.

In CUDA and OpenCL, slack parallelism comes from creating blocks or workgroups larger than one warp or wavefront, or from creating more blocks or workgroups than the number of cores. Whether the larger blocks or

workgroups are executed in parallel across SIMD units or time-sliced using multithreading is hidden by the hardware.

C++AMP completely hides how parallelism is sched-uled. The programmer simply trusts that if there is enough parallelism, the implementation will manage it well on the target device.

Intel’s offload model doesn’t expose multithreading explicitly. Multithreading is used as another mechanism to implement OpenMP threads on the device. The program-mer must create enough OpenMP threads to populate the cores, then create another factor to take advantage of multithreading for latency tolerance.

OpenACC worker-level parallelism is intended to address this issue directly. On the GPUs, iterations of a worker-parallel loop will run on the same core, either simultaneously using multiple SIMD units in the same core, or sharing the same resources using multithreading. As mentioned earlier, an OpenACC implementation can also use longer vectors or more gangs and implement these using multithreading, similar to OpenCL or CUDA.

SIMD operations

All three current architectures have vector or SIMD operations, but they are exposed differently. The IXPC has a classical core, with scalar and vector instructions. The Kepler is programmed in what NVIDIA calls SIMT fashion, that is, as a number of independent threads that happen to ex-ecute in lock-step as a warp; the SIMD implementation is virtualized at the hardware level. The Radeon also has scalar and vector units, but the scalar unit is limited; the Radeon is actually programmed much like the Kepler, with thread-independent branching relegated to the scalar unit.

As mentioned above, CUDA and OpenCL hide the SIMD aspect of the target architecture, except for tuning guidelines. C++AMP also completely hides how the paral-lelism of the algorithm is mapped onto the parallelism of the hardware.

Intel’s offload model uses classical SIMD vectorization within a thread. New directives also allow more control over

which loops are vectorized. The SIMD length is (obviously) limited to the 512-bit instruction set, for the current IXPC, but any specification of vector length is not part of the model.

An OpenACC compiler can treat all three machines as cores with scalar and vector units. For Kepler, vector opera-tions are spread over the CUDA threads of a warp or thread block; scalar operations are either executed redundantly by each CUDA thread, or, where redundant execution is invalid or inefficient, executed in a protected region only by CUDA thread zero. Code generation for Radeon works much the same way. The native SIMD length for Kepler is 32 and for Radeon is 64, in both single and double preci-sion. The compiler has the freedom to choose a longer vector length, multiples of 32 or 64, by synchronizing the warps or wavefronts as necessary to preserve vector depen-dences. Thus the OpenACC programming model for all three devices is really multi-core plus vector parallelism. The compiler and runtime manage the implementation differ-ences between the device types.

Memory Strides

All three devices are quite sensitive to memory strides. The memories are designed to deliver very high bandwidth when a program accesses blocks of contiguous memory.

In OpenCL and CUDA, the programmer must write code so that consecutive OpenCL or CUDA threads access consecutive or contiguous memory locations. Because these threads execute in SIMD fashion, they will issue memory operations for consecutive locations in the same cycle, to maximize memory performance.

In C++AMP, the mapping of parallel loop iterations to hardware parallelism is virtualized. If the index space is one-dimensional, the programmer can guess that adjacent iterations will be executed together and should then access consecutive memory locations.

The Intel offload model has no need to deal with strides. The programmer should write the vector or SIMD loops to access memory in stride-1 order.

Similarly, using OpenACC, the programmer should write vector loops with stride-1 accesses. When mapping a vector loop, the natural mapping will assign consecutive vector iterations to adjacent SIMD lanes, warp threads or wavefront threads. This will result in stride-1 memory accesses, or so-called coalesced memory accesses where adjacent GPU cores access adjacent memory locations.

Caching and Scratchpad MemoriesAll three devices have classical cache memories. For the

IXPC, classical cache optimizations should be as effective as for commodity multi-core systems, except there is no large shared L3 cache. For the GPUs, the small hardware cache coupled with the amount of parallelism in each core makes it much less likely that the caches will be very effective for temporal locality. These devices have small scratchpad memories, which operate with the same latency as the L1 cache.

In CUDA and OpenCL, the programmer can and must manage the scratchpad memory explicitly, using CUDA __shared__ or OpenCL __local memory. The size of this memory is quite small, and it quickly becomes a critical resource that can limit the amount of parallelism that can be exploited on the core.

C++AMP has no explicit scratchpad memory manage-ment, but it does expose a tile optimization to take advan-tage of memory locality. The programmer depends on the implementation or the hardware to actually exploit the locality.

Intel’s offload model has no scratchpad memory man-agement, since the target (the IXPC) has no such feature.

OpenACC has a cache directive to allow the program-mer to give hints to the compiler on which data has enough reuse to cache locally. An OpenACC compiler can use this directive to store the specified data in the scratchpad memory. The PGI OpenACC compilers can also manage the scratchpad memory automatically, analyzing a program to find temporal or spatial data reuse within and across workers and vector lanes in the same gang to choose which store data in the scratchpad memory.

PortabilityThere are three levels of portability. First is language

portability, meaning a programmer can use the same lan-guage to write a program for different targets, even if the programs must be different. Second is functional portability, meaning a programmer can write one program that will run on different targets, though not all targets will get the best performance. Third is performance portability, meaning a programmer can write one program that gives good perfor-mance across many targets. For standard multi-core proces-sors, C, C++ and Fortran do a pretty good job of delivering performance portability, as long as the program is written to allow vectorization and with appropriate memory access patterns. For portability on accelerators, the problems are significantly more difficult.

CUDA C and Fortran provide reasonable portability across NVIDIA devices. Each new architecture generation adds new features and extends the programming model, so while old programs will likely run well, retuning or rewrit-ing a program will often give even better performance. PGI has implementations of CUDA C and Fortran that target multicore x86 as an accelerator device, but there is no pretense that these provide cross-vendor portability or even necessarily performance portability of CUDA source code.

OpenCL is designed to provide language and function-ality portability. Research has demonstrated that even across similar devices, like NVIDIA and AMD GPUs, retun-ing or rewriting a program can have a significant impact on performance. In both CUDA and OpenCL, since the memory management is explicit in the program, data will be allocated and copied even when targeting a device like a multi-core CPU which has no separate device memory.

C++AMP is intended to provide performance portabil-ity across devices. It is designed to do this by hiding or vir-tualizing aspects of the program that a programmer would tune for a particular target.

Intel’s offload model makes no attempt to provide por-tability to other accelerator targets. Even when targeting a multi-core, the mechanism is to ignore the offload direc-

tives entirely, using the embedded OpenMP for parallelism on the device. The successor to Intel’s offload model, the OpenMP 4.0 target directives, are promoted to be portable across accelerator devices. However, these are unlikely to be efficient except on targets similar to those for which OpenMP was originally intended, and by definition they are designed around the OpenMP concept of long-lived relatively heavyweight threads used on classical CPU cores.

OpenACC is also intended to provide performance portability across devices, and there is some initial evidence to support this claim. The OpenACC parallelism model is more abstract than any of the others discussed here, except C++AMP, yet allows the programmer to expose different aspects of parallel execution. An implementation can use this to exploit the different parallel features of a wide vari-ety of target architectures.

Accelerator Architectures TomorrowToday’s devices will be replaced in the next 12-24

months by a new generation, and we can speculate or pre-dict what some of these future devices will look like.

AMD already has APUs (Accelerated Processing Units) with a CPU core and GPU on the same die, sharing the same memory interface. Future APUs will allow the GPU to use the same virtual address space as the CPU core, al-lowing true shared memory. Current CPU memory inter-faces do not provide the memory bandwidth demanded for high performance GPUs. Future APUs will have to im-prove this memory interface, or the on-chip GPU perfor-mance will be limited by the available memory bandwidth. This improvement could use new memory technology, such as stacked memory, to give both CPU and GPU very high memory bandwidth, or by using separate but coherent memory controllers for CPU-side and GPU-side memory.

NVIDIA has already announced plans to use stacked memory on future GPU dies. This should improve the memory bandwidth significantly. NVIDIA has also an-nounced plans to allow the GPU to access system memory directly. If the GPU sits on an IO bus, such memory access will be quite slow relative to on-board memory, so this

will not be a very broad solution. However, NVIDIA also has plans to package an NVIDIA GPU with ARM cores, which may give it many of the same capabilities as the AMD APU.

While Intel has not published a roadmap, one can imag-ine a convergence of the Intel Xeon Phi Coprocessor with the Intel Xeon processor line. Current IXPC products sit on an IO bus, but Intel presentations have shown diagrams with future IXPC and Xeon chips as peer processors on a single node.

In all three cases, it is possible or likely that the need to manage memory allocation and movement will diminish somewhat. However, we don’t know how the accelerator architecture itself will evolve. The core count may increase, vectors may get longer (or not), other capabilities may be added. The clock speed may vary within a defined envelope as we prove out the most successful accelerator architec-tures. Other vendors may become viable candidates, such as the Texas Instruments or Tilera products. A successful programming model and implementation must effectively support the important accelerators available today, and be ready to support the accelerators we are likely to see tomor-row.

Our ConclusionNVIDIA Tesla, Intel Xeon Phi and AMD Radeon have

many common features that can be leveraged to create an Accelerator programing model that delivers language, func-tional and performance portability across devices — sepa-rated and relatively small device memories, stream-oriented memory systems, many cores, multi-threading, and SIMD/SIMT processing. Features we would all like to see in such a model are many:

• Addresses a suitable range of applications

• Allows programming of CPU+Accelerator systems as an integrated platform

• Is built around standard C/C++/Fortran without sacri- ficing too much performance

• Integrates with the existing, proven, widely-used parallel programming models - MPI, OpenMP

• Allows incremental porting and optimization of exist ing applications

• Imposes little or no disruption on the standard code/ compile/run development cycle

• Is interoperable with low-level accelerator program ming models

• Allows for efficient implementations on each accelerator

Admittedly, any programming model that meets most or all of these criteria is likely to short-change certain features of each of the important targets. However, in deciding which models to implement, compiler writers like those at PGI have to maximize those goals across all targets rather than maximizing them for a particular target. Likewise, when deciding which models to support and use, HPC users need to consider the costs associated with re-writing applications for successive generations of hardware and of potentially limiting the diversity of platforms from which they can choose.

To be successful, an Accelerator programming model must be low-level enough to express the important aspects of parallelism outlined above, but high-level and portable enough to allow an implementation to create efficient map-pings of these aspects to a variety of Accelerator hardware targets.

Libraries are one solution, and can be a very good solu-tion for applications dominated by widely-used algorithms. CUDA and OpenCL give control over all aspects of accel-erator programming, but are relatively low level, not very in-cremental, and come with portability issues and a relatively steep learning curve. C++ AMP may be a good solution for pure Windows programmers, but relies heavily on compiler technology to overcome abstraction, is not standardized or portable, and for HPC programmers offers no Fortran solution. The OpenMP 4.0 target and SIMD extensions are attractive on the surface, but come with the Achilles heel of

requiring support for all of OpenMP 3.1 on devices which are ill-suited to support it.

In our view, OpenACC meets the criteria for long-term success. It is orthogonal to and interoperable with MPI and OpenMP. It virtualizes into existing languages the concept of an accelerator for a general-purpose CPU where these have potentially separated memories. It exposes and gives the programmer control over the biggest current perfor-mance bottleneck — data movement over a PCI bus. It virtualizes the mapping of parallelism from the program onto the hardware, giving both the compiler and the user the freedom to create optimal mappings for specific targets. And finally, it was designed from the outset to be portable and performance portable across CPU and Accelerator tar-gets of a canonical form — a MIMD parallel dimension, a SIMD/SIMT parallel dimension, and a memory hierarchy that is exposed to a greater or lesser degree and which must be accommodated by the programmer or compiler.

by Michael Wolfe, PGI Compiler Engineer

GPUs have a very high compute capacity, and the latest designs have made them even more programmable and useful for tasks other than just graphics. Research on us-ing GPUs for general purpose computing has gone on for several years, but it was only when NVIDIA introduced the CUDA Toolkit in 2007, including a compiler with exten-sions to C, that GPU computing became useful without heroic effort. Yet, while CUDA is a big step forward in GPU programming, it requires significant rewriting and restructuring of your program, and if you want to retain the option of running on an X64 host, you must maintain both the GPU and CPU program versions separately.

The OpenACC API is a higher-level directive-based method for CPU+Accelerator programming, allowing you to treat a GPU or coprocessor as an accelerator. This article describes OpenACC features for those of you who have little or no experience with GPU programming, and no experience with the OpenACC programming model. The OpenACC API uses directives and compiler analysis to compile natural Fortran, C and C++ for an accelerator; this often allows you to maintain a single source version, be-cause by ignoring the directives a compiler can compile the same program for execution on a multi-core CPU. Future compiler releases will even use the OpenACC directives to generate parallel code on a host multi-core CPU.

OpenACC Featuresin PGI Accelerator Compilers

This article uses example programs to introduce the basic concepts of compiling, executing and tuning standard C, C++ and Fortran programs for CPU+Accelerator platforms using standard OpenACC compiler directives.

It will deepen your insight and un-derstanding of how an OpenACC works, how OpenACC can be used to effectively target NVIDIA Tesla and AMD Radeon GPUs, and how to make effective use of PGI compiler feedback to minimize data movement and optimize loops and kernels for maximum performance.

GPU ArchitectureLet’s start by looking at accelerators in general, and

GPUs in particular. An accelerator is typically implemented as a coprocessor to the host; it has its own instruction set and usually (but not always) its own memory. To the hard-ware, the accelerator looks like another IO unit; it com-municates with the CPU using IO commands and DMA memory transfers. To the software, the accelerator is anoth-er computer to which your program sends data and routines to be executed. Many accelerators have been produced over the years; with current technology, an accelerator fits on a single chip, like a CPU. Besides GPUs, other accelerators being considered today are the Intel Xeon Phi Coprocessor, the AMD APU, DSPs, and FPGAs.

Here we focus on GPUs from NVIDIA and AMD; a picture of the relevant architectural features is shown below.

OpenACC Abstract Machine Architecture

The key features are the processors, the memory, and the interconnect. The NVIDIA Kepler has up to 15 multipro-cessors (PEs); each multiprocessor has 12 single-precision SIMD units and 4 double-precision SIMD units, each unit with 16 parallel GPU cores. An AMD Radeon 7970 has up to 32 compute units (PEs); each compute unit has 4 SIMD units, each with 16 parallel GPU cores. The GPU cores in a SIMD unit run synchronously, meaning all GPU cores in a SIMD unit execute the same instruction at the same time. Different SIMD units and different PEs run asynchronous-ly, much like commodity multi-core processors.

These GPU have their own memory, usually called device memory; this can range up to 6GB today on some devices, and will likely be larger in the near future. As with CPUs, access time to the memory is quite slow. CPUs use caches to try to reduce the effect of the long memory latency, by caching recently accessed data in the hopes that the future accesses will hit in the cache. Caches can be quite effective, but they don’t solve the basic memory bandwidth problem. Graphics programs typically require streaming access to large data sets that would overflow the size of a reasonable cache. To solve this problem, GPUs use multi-threading. When the GPU processor issues a fetch from the device memory, that GPU thread goes to sleep until the memory returns the value. In the meantime, the GPU cores switch over very quickly, in hardware, to another GPU thread, and continue executing that thread. In this way, the GPU exploits program parallelism to keep busy while the slow device memory is responding.

While the device memory has long latency, the inter-connect between the memory and the GPU processors supports very high bandwidth. In contrast to a CPU, the memory can keep up with the demands of data-intensive programs; instead of suffering from cache stalls, the GPU can keep busy, as long as there is enough parallelism to keep the processors busy.

Programming

Current approaches to programming GPUs include NVIDIA’s CUDA and the open standard language Open-CL. Over the past several years, there have been many success stories using CUDA to port programs to NVIDIA GPUs. The goal of OpenCL is to provide a portable mechanism to program different GPUs and other parallel systems. Is there need or room for another programming strategy?

The initial cost of programming using CUDA or Open-CL is the programming effort to convert your program into the host part and the accelerator part. Each routine to run on the accelerator must be extracted to a separate kernel

Part 1

©2012 The Portland Group, Inc.

Data

Thre

ad P

roce

ssor

s


PE 0


PE 1


PE n-1

Stream Optimized Device Memor y

Execut ion QueuesControl

Latenc yOptimized

Host Memor y

…

Multicore CPU

Part 1a recent CUDA driver, available from the Download CUDA page on the NVIDIA site. For AMD GPUs, find the AMD APP SDK download page for operating system support. Your system will need a supported AMD Radeon GPU; you’ll want to install a recent AMD Catalyst driver as well.

For this discussion, we’ll assume you’ve installed the PGI Linux compilers in the default location /opt/pgi. The PGI installation has two additional directory levels. The first corresponds to the target (linux86 for 32-bit linux86, linux86-64 for 64-bit). The second level has a directory for the PGI version (14.1 for example) and another PGI release directory containing common components (2014). If the compilers are installed, you’re ready to test your accelerator connection. Try running the PGI-supplied tool pgaccelin-fo. If you have everything set up properly, you should see output similar to this for an NVIDIA GPU:CUDA Driver Version: 5000NVRM version: NVIDIA UNIX x86_64 Kernel Module 310.19 Thu Nov 8 00:52:03 PST 2012

CUDA Device Number: 0Device Name: GeForce GTX 690Device Revision Number: 3.0Global Memory Size: 2147287040Number of Multiprocessors: 8Number of SP Cores: 1536Number of DP Cores: 512Concurrent Copy and Execution: YesTotal Constant Memory: 65536Total Shared Memory per Block: 49152Registers per Block: 65536Warp Size: 32Maximum Threads per Block: 1024Maximum Block Dimensions: 1024, 1024, 64Maximum Grid Dimensions: 2147483647 x 65535 x 65535Maximum Memory Pitch: 2147483647BTexture Alignment: 512BClock Rate: 1019 MHzExecution Timeout: NoIntegrated Device: NoCan Map Host Memory: YesCompute Mode: defaultConcurrent Kernels: YesECC Enabled: NoMemory Clock Rate: 3004 MHzMemory Bus Width: 256 bits

function, and the host code must manage device memory allocation, data movement, and kernel invocation. The ker-nel itself may have to be carefully optimized for the GPU or accelerator, including unrolling loops and orchestrating device memory fetches and stores. There are additional con-tinuing costs of maintaining host-only and host+accelerator versions of the program and retuning when new hardware becomes available. While CUDA and OpenCL are much, much easier programming environments than what was available before 2007, and both allow very detailed low-lev-el optimization for GPUs, they are a long way from making it easy to program and experiment.

To address this, the OpenACC consortium has defined the OpenACC API, implemented as a set of directives and runtime routines. Using OpenACC, you can more easily start and experiment with porting programs to GPUs, let-ting the compiler do much of the bookkeeping. The result-ing program is more portable, and in fact can run unmodi-fied on the CPU itself. In this article, we will show some initial programs to get you started, and explore some of the features of the model. As we will see, this model doesn’t make parallel programming easy, but it does reduce the cost of entry; we will also explore using OpenACC directives to tune performance.

Setting UpYou need the right hardware and software to start using

the PGI Accelerator OpenACC compilers. First, you need a 64-bit x86 system with an operating system supported both by PGI and the target GPU toolkit; these include recent versions of Linux, Apple OS X and Microsoft Windows. See the PGI release support page for currently supported versions and distributions. You need to install the PGI compilers, which include bundled versions of the necessary NVIDIA CUDA and AMD OpenCL toolkit components. For NVIDIA GPUs, check the Download CUDA page on the NVIDIA website for operating system support. Your system will need a CUDA-enabled NVIDIA graphics or Tesla card; see the CUDA-Enabled Products page on the NVIDIA website for a list of appropriate cards. You’ll want

!$acc

L2 Cache Size: 524288 bytesMax Threads Per SMP: 2048Async Engines: 1Unified Addressing: YesInitialization time: 531645 microsecondsCurrent free memory: 2096226304Upload time (4MB): 911 microseconds (671 ms pinned)Download time: 963 microseconds (656 ms pinned)Upload bandwidth: 4604 MB/sec (6250

MB/sec pinned)Download bandwidth: 4355 MB/sec (6393

MB/sec pinned) Clock Rate: 1147 MHzPGI Compiler Option: -ta=nvidia,cc30

The output first tells you the CUDA driver version information. It tells you that there is a single device, number zero; it’s an NVIDIA GeForce GTX 690, it has compute capability 3.0, 2GB memory and eight multiprocessors. You might have more than one GPU installed; perhaps you have a small GPU on the motherboard and a larger GPU or Tesla card in a PCI slot. The pgaccelinfo tool will give you information about each one it can find.

The output for an AMD Radeon GPU will look like this:

CAL version: 1.4-1848

Device Number: 0Target ISA: TahitiGlobal Memory Size: 3072MBCached Remote Memory Size: 508MBUncached Remote Memory Size: 2044MBWavefront Size: 64Number of SIMD Units: 28Number of GPU Cores: 448Pitch Alignment: 256BSurface Alignment: 256BDouble Precision Support: YesLocal Data Share Support: YesGlobal Data Share Support: YesGlobal Register Support: YesCompute Shader Model Support: YesMemory Export Support: YesRevision: 5Max 1D image size: 16384

Max 2D image size: 16384 x 16384

This gives the CAL version information, and tells you that there is a single device, number zero; it’s an AMD Radeon Tahiti GPU, with 3GB memory and 28 SIMD units.

If you see the message:No accelerators found.Try pgaccelinfo -v for more information

then you probably don’t have the right hardware or drivers installed. Presuming everything is working, you’re ready to start your first program.

First Program

We’re going to show several simple example programs; we encourage you to try each one yourself. These example programs are all available for download from the PGI web-site at www.pgroup.com/lit/samples/pgi_accelerator_ex-amples.tar.

We’ll start with a very simple program; it will send a vector of floats to the GPU, double it, and bring the results back. In C, the whole program is:#include <stdio.h>#include <stdlib.h>#include <assert.h>

int main( int argc, char* argv[] ){ int n; /* size of the vector */ float *a; /* the vector */ float *restrict r; /* the results */ float *e; /* expected results */ int i; if( argc > 1 ) n = atoi( argv[1] ); else n = 100000; if( n <= 0 ) n = 100000;

a = (float*)malloc(n*sizeof(float)); r = (float*)malloc(n*sizeof(float)); e = (float*)malloc(n*sizeof(float));

!$accmain:

21, Generating present or copyout(r[0:n]) Generating present or copyin(a[0:n]) Generating NVIDIA code Generating compute capability 1.0 binary

Generating compute capability 2.0 binary Generating compute capability 3.0 binary 22, Loop is parallelizable Accelerator kernel generated 22, #pragma acc loop gang, vector(128)

/* blockIdx.x threadIdx.x */

For an AMD GPU, build with the command: % pgcc -o acc_c1.exe acc_c1.c -acc -Minfo -ta=radeon

The -ta option selects an AMD Radeon target. The infor-mational messages for this build are similar:

main:21, Generating present or copyout(r[0:n]) Generating present or copyin(a[0:n]) Generating Radeon code22, Loop is parallelizable Accelerator kernel generated 22, #pragma acc loop gang, vector(128) /* global dim(0) local dim(0) */

Let’s explain a few of these messages. The first:

Generating present or copyout(r[0:n])

tells you that the array r is assigned, but never read inside the loop; the values from the CPU memory don’t need to be sent to the GPU before kernel execution, but the modified values need to be copied back. This is a copy-out from the device memory. It starts at element r[0] and continues for n elements. It’s called “present or copy” because if the array has already been moved to GPU memory using a previous directive, that copy will be used instead of copying the data over again. The second message

Generating present or copyin(a[0:n])

tells you that the compiler determined that the array a is used only as input to the loop, so those n elements of

/* initialize */ for( i = 0; i < n; ++i ) a[i] = (float)(i+1);#pragma acc kernels loop for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f; /* compute on the host to compare */ for( i = 0; i < n; ++i ) e[i] = a[i]*2.0f; /* check the results */ for( i = 0; i < n; ++i ) assert( r[i] == e[i] ); printf( “%d iterations completed\n”, n ); return 0;}

Note the restrict keyword in the declarations of the pointer r assigned in the loop; we’ll see why shortly. Note also the explicit float constant 2.0f instead of 2.0. By default, C floating point constants are double precision. The expression a[i]*2.0 is computed in double precision, as (float)((double)a[i] * 2.0). To avoid this, use explicit float constants, or use the PGI command line flag -Mfcon, which treats floating point constants as type float by default.

We prefixed the loop we want offloaded to the GPU by a kernels loop directive. This tells the compiler to find the parallelism in the loop, move the data over to the GPU, launch the operation on the GPU, then bring the results back. For this program, it’s as simple as that.

For an NVIDIA GPU, build this with the command:

% pgcc -o acc_c1.exe acc_c1.c -acc -Minfo -ta=nvidia

Note the -acc and -Minfo and -ta flags. The -acc flag enables OpenACC directives in the compiler; by default, the PGI compilers will target the accelerator regions for an NVIDIA GPU. We’ll show other options below. The -Minfo flag enables informational messages from the com-piler; we’ll enable this on all our builds, and explain what the messages mean. You’re going to want to understand these messages when you start to tune for performance. The -ta flag chooses the target accelerator, NVIDIA in this case.

If everything is installed and licensed correctly you should see the following informational messages from pgcc:

array a need to be copied over from the CPU memory to the GPU device memory; this is a copy-in to the device memory. Because these elements aren’t modified, they don’t need to be copied back. Farther below that is the message:

Loop is parallelizable

This tells you that the compiler analyzed the references in the loop and determined that all iterations could be ex-ecuted in parallel. We added the restrict keyword to the declarations of the pointer r to allow this; otherwise, the compiler couldn’t safely determine that a and r pointed to different memory. The next message is the most key:

Accelerator kernel generated

This tells you that the compiler successfully converted the body of that loop to a kernel for the GPU. The kernel is the GPU function itself, created by the compiler, that will be called by the program and executed in parallel on the GPU.

So now you’re ready to run the program. Assuming you’re on the machine with the GPU, just type the name of the executable, acc_c1.exe. If you get a message

libcuda.so not found, exiting

or similarly

No accelerator matches the profile for which this program was compiled

then you may be running this program on a system without a GPU corresponding to the -ta target you specified on the build line, or you may not have installed the device driver in its default location. In the latter case, you may have to set the environment variable LD_LIBRARY_PATH. What you should see is just the final output:

100000 iterations completed

How do you know that anything executed on the GPU? You can set the environment variable ACC_NOTIFY to 1.

csh: setenv ACC_NOTIFY 1bash: export ACC_NOTIFY=1

then re-run the program; this environment variable tells the OpenACC runtime to load a small dynamic library that prints out a line each time a GPU kernel is launched. In this case, you’ll see something like:

launch CUDA kernel file=acc_c1.c function=mainline=22 device=0 grid=782 block=128

which tells you the file, function, and line number of the kernel, and the CUDA grid and thread block dimensions. You probably don’t want to leave this environment variable set for all your programs, but it’s instructive and useful dur-ing program development and testing.

Second ProgramOur first program was pretty trivial, just enough to get a

test run. Let’s take on a slightly more interesting program, one that has more computational intensity on each itera-tion. In C, the program is:#include <stdio.h>#include <stdlib.h>#include <assert.h>#include <sys/time.h>#include <math.h>#include <accel.h>#include <accelmath.h>int main( int argc, char* argv[] ){ int n; /* size of the vector */ float *a; /* the vector */ float *restrict r; /* the results */ float *e; /* expected results */ float s, c; struct timeval t1, t2, t3; long cgpu, chost; int i; if( argc > 1 ) n = atoi( argv[1] ); else n = 100000; if( n <= 0 ) n = 100000;

a = (float*)malloc(n*sizeof(float)); r = (float*)malloc(n*sizeof(float)); e = (float*)malloc(n*sizeof(float));

!$acc for( i = 0; i < n; ++i )

a[i] = (float)(i+1) * 2.0f;

/*acc_init( acc_device_nvidia );*/

gettimeofday( &t1, NULL );

#pragma acc kernels loop for( i = 0; i < n; ++i ){ s = sinf(a[i]); c = cosf(a[i]); r[i] = s*s + c*c; }

gettimeofday( &t2, NULL ); cgpu = (t2.tv_sec - t1.tv_sec)*1000000 + (t2.tv_usec - t1.tv_usec);

for( i = 0; i < n; ++i ){ s = sinf(a[i]); c = cosf(a[i]); e[i] = s*s + c*c;}

gettimeofday( &t3, NULL );chost = (t3.tv_sec - t2.tv_sec)*1000000 +

(t3.tv_usec - t2.tv_usec);

/* check the results */for( i = 0; i < n; ++i )

assert( fabsf(r[i] - e[i]) < 0.000001f ); printf(“%13d iterations completed\n”, n);printf(“%13ld microseconds on GPU\n”, cgpu );printf(“%13ld microseconds on host\n”,

chost );return 0;

}

The program reads the first command line argument as the number of elements to compute, allocates arrays, runs a kernel to compute floating point sine and cosine (note sinf and cosf function names here), and compares to the host. Some details to note:

• There is a call to acc_init() that is commented out; you’ll uncomment that shortly • There are calls to gettimeofday() to measure wall

clock time on the GPU and host loops • The program doesn’t compare for equality, it

compares against a tolerance. We’ll discuss that as well.

Now let’s build and run the program; you’ll compare the speed of your GPU to the speed of the host. Build as you did before, and you should see messages much like you did before. You can view just the accelerator messages by replacing -Minfo by -Minfo=accel on the compile line.

The first time you run this program, you’ll see output something like:

1000000 iterations completed 2034132 microseconds on GPU 39954 microseconds on host

So what’s this? Over two seconds on the GPU? Only 39 milliseconds on the host? What’s the deal?

Let’s explore this a little. If I enclose the program from the first call to the timer to the last print statement in an-other loop that iterates three times, I’ll see something more like the following:

1000000 iterations completed 2548137 microseconds on GPU 35802 microseconds on host 1000000 iterations completed 3131 microseconds on GPU 35475 microseconds on host 1000000 iterations completed 2881 microseconds on GPU 34735 microseconds on host

The time on the GPU is very long for the first iteration, then is much faster after that. The reason is the overhead of connecting to the GPU on Linux. On Linux, if you run on a GPU without a display connected, it can take 0.5 to 1.5 seconds to power up the GPU from its power-saving state the first time the first kernel executes. You can set the environment variable PGI_ACC_TIME to 1 before running your program. This directs the runtime to collect the time spent in GPU initialization, data movement, and kernel execution.

If we set this environment variable, we’ll get additional profile information:


Accelerator Kernel Timing data acc c2.cmain NVIDIA devicenum=0 time(us): 1,685 43: compute region reached 1 time 43: data copyin reached 1 times device time(us): total=668 max=668 min=668 avg=668 44: kernel launched 1 time grid: [7813] block: [128] device time(us): total=379 max=379

min=379 avg=379 elapsed time(us): total=391 max=391

min=391 avg=291 49: data copyout reached 1 time device time(us): total=638 max=638 min=638 avg=638

The timing data tells us that 1.7 milliseconds were spent on the GPU, with 668 microseconds spent copying data to the device and 638 copying data back to the host. Only 390 microseconds were spent executing the kernel. The rest of the time was spent initializing the device.

Let’s take the initialization out of the timing code altogether. Uncomment the call to acc_init() in your program, rebuild and then run the program. You should see output more like:


The GPU time still includes the overhead of moving data between the GPU and host. Your times may differ, even substantially, depending on the GPU you have in-stalled (particularly the Number of Multiprocessors for an NVIDIA GPU reported by pgaccelinfo) and the host

processor. These runs were made on a 3.1GHz Intel Sandy Bridge CPU. You will see similar behavior building for and running on an AMD Radeon 7970 GPU.

To see some more interesting performance numbers, try increasing the number of loop iterations. The program defaults to 1,000,000; increase this to 5,000,000. On my machine, I see: 5000000 iterations completed 42934 microseconds on GPU

270039 microseconds on host

Note the GPU time increases hardly at all, whereas the host time increases by quite a bit more. That’s because the host is sensitive to cache locality. The GPU has a very high bandwidth device memory; it uses the extra parallelism that comes from the 5,000,000 parallel iterations to tolerate the long memory latency, so there’s no performance cliff.

So, let’s go back and look at the reason for the tolerance test, instead of an equality test, for correctness. The fact is that the GPU doesn’t always compute to exactly the same precision as the host. In particular, some transcendentals and trigonometric functions may be different in the low-order bit. You, the programmer, have to be aware of the potential for these differences, and if they are not accept-able, you may need to wait until GPUs implement full host bit-exactness. Before the adoption of the IEEE floating point arithmetic standard, every computer used a different floating-point format and delivered different precision, so this is not a new problem, just a new manifestation.

Your next assignment is to convert this second program to double precision; remember to change the sinf and cosf calls. You might compare the results to find the maxi-mum difference. We’re computing sin^2 + cos^2, which, if I remember my high school geometry, should equal 1.0 for all angles. So, you can compare the GPU and host com-puted values against the actual correct answer as well.

!$accThird Program

Here, we’ll explore writing a slightly more complex program, and try some other options, such as building it to run on either the GPU, or on the host if you don’t have a GPU installed. We’ll look at a simple Jacobi relaxation on a two-dimensional rectangular mesh. In C, the relaxation routine we’ll use is:

voidsmooth( float* restrict a, float* restrict b, float w0, float w1, float w2, int n,

int m, int niters ){ int i, j, iter; float* tmp; for( iter = 1; iter < niters; ++iter ){ #pragma acc kernels loop copyin(b[0:n*m])

copy(a[0:n*m]) independent for( i = 1; i < n-1; ++i ) for( j = 1; j < m-1; ++j ) a[i*m+j] = w0 * b[i*m+j] + w1 *( b[(i-1)*m+j] + b[(i+1)*m+j] +

b[i*m+j-1] + b[i*m+j+1] ) + w2 *( b[(i-1)*m+j-1] +

b[(i-1)*m+j+1] + b[(i+1)*m+j-1] +

b[(i+1)*m+j+1] ); tmp = a; a = b; b = tmp; }}

Again, note the use of the restrict keyword on the point-ers. We can build these routines as before, and we’ll get a set of messages as before. This particular implementation ex-ecutes a fixed number of iterations. We added some clauses to the kernels directive. We added two data clauses; the first is a copyin clause to tell the compiler to copy the array b starting at b[0] and continuing for n*m elements in to the GPU. The second is a copy clause to tell the compiler to copy the array a starting at a[0] and continuing for n*m elements in to the GPU, and back out from the GPU at the end of the loop. We also added the independent clause to

tell the compiler that the iterations of the i loop are data-independent as well, and can be run in parallel.

We can build this program as before for an NVIDIA GPU, and we’ll see the informational messages:

smooth: 42, Generating present or copyin(b[0:m*n]) Generating present or copy(a[0:m*n]) Generating NVIDIA code Generating compute capability 1.0 binary Generating compute capability 2.0 binary Generating compute capability 3.0 binary 43, Loop is parallelizable 44, Loop is parallelizable

Accelerator kernel generated 43, #pragma acc loop gang /* blockIdx.y */ 44, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

The compiler produces messages about the data copied into and out of the GPU. The next two lines tell us that the compiler generated three GPU binaries, one suitable for Tesla devices (NVIDIA compute capability 1.0-1.3), a sec-ond for Fermi devices (compute capability 2.0) and a third for Kepler devices (compute capability 3.0 and 3.5). After the line specifying that an accelerator kernel was generated, the compiler gives us the execution schedule, the mapping of the loop iterations to the device parallelism. We’ll see similar message if we compile with the -ta=radeon com-mand line option.

The compiler prints the schedule using the directives that you might use if you were going to specify this schedule manually. In this case, the compiler chooses to execute the inner loop (with the stride-1 array references) in vector mode (vector(128)) and executes the vectors in parallel across the NVIDIA multiprocessors or Radeon vector units (gang for both loops). A useful exercise you might try is to remove the independent clause, recompile the program, and look at the schedule generated by the compiler.

We’ve provided the source for this routine along with a driver routine to call it. Try running the program with a 1000 x 1000 matrix and 50 iterations, using a call to acc_init to isolate any device initialization. On my NVIDIA machine, setting the environment variable PGI_ACC_TIME, I see the following output:

Accelerator Kernel Timing dataacc_c3.c smooth NVIDIA devicenum=1 time(us): 140,987 smooth NVIDIA devicenum=0 time(us): 210,808 41: compute region reached 100 times 41: data copyin reached 200 times device time(us): total=130,493

max=664 min=649 avg=652 44: kernel launched 100 times grid: [8x998] block: [128] device time(us): total=16,824

max=403 min=164 avg=168 elapsed time(us): total=17,889

max=415 min=174 avg=178 48: data copyout reached 100 times device time(us): total=63,491

max=638 min=633 avg=634

On my Radeon machine, the timings are similar:

Accelerator Kernel Timing dataacc_c3.c smooth AMD RADEON devicenum=1 time(us): 313,522 41: compute region reached 100 times 41: data copyin reached 200 times device time(us): total=211,745

max=1,188 min=866 avg=1,058 44: kernel launched 100 times grid: [16x499] block: [64x2] device time(us): total=6,314

max=183 min=57 avg=63 elapsed time(us): total=17,368

max=388 min=130 avg=173 48: data copyout reached 100 times device time(us): total=95,463

max=1,050 min=745 avg=954

As before, we see that over 90% of the time for the GPU is spent moving data to and from the GPU memory, especially for the Radeon. What if, instead of moving the data back and forth for each iteration, we could move the data once and leave it there? OpenACC gives us a way to do just that. Let’s modify the program so in the routine that calls smooth, we surround that call with a data con-struct:

#pragma acc data copy(b[0:n*m],a[0:n*m]) { smooth( a, b, w0, w1, w2, n, m, iters );

}

This tells the compiler to copy both arrays to the GPU before the call, and bring the results back to the host mem-ory after the call. Inside the function, replace the copyin and copy data clauses by a present clause:

#pragma acc kernels loop present (b[0:n*m], a[0:n*m]) independent

This tells the compiler that the data is already present on the GPU, so rather than copying the data it should just use the copy that is already present. In the output from the modified program below, we see no data movement in the smooth function, just the 15 milliseconds for kernel execu-tion. The data movement only happens in the main routine, and only once, taking roughly 2.5 milliseconds, instead of 300 as before, for the NVIDIA GPU:

Accelerator Kernel Timing dataacc_c3a.c smooth NVIDIA devicenum=0 time(us): 15,823 44: kernel launched 100 times grid: [8x998] block: [128] device time(us): total=15,823 max=251

min=156 avg=158 elapsed time(us): total=16,885 max=263 total=16,885 max=263

By default, the runtime system will use the GPU version if the GPU is available, and will use the host version if it is not. You can manually select the host version two ways. One way is to set the environment variable ACC_DEVICE_TYPE before running the program: csh: setenv ACC_DEVICE_TYPE host

bash: export ACC_DEVICE_TYPE=host

You can go back to the default by unsetting this variable. Alternatively, the program can select the device by calling an API routine:

#include “openacc.h” ...

acc_set_device_type( acc_device_host );

SummaryThis article introduced OpenACC features in the PGI

Accelerator compilers for NVIDIA Tesla and AMD Radeon GPUs, and presented three simple programs in C. We looked at issues you may encounter with float versus double and unrestricted pointers. We enabled OpenACC directives with the -acc flag, used the simple accelerator profiling utility with the PGI_ACC_TIME environment vari-able, and used -ta=nvidia,host and -ta=radeon,host to generate a unified CPU+Accelerator binary. We also briefly introduced the data construct and the present clause. We encourage you to try these examples to get started program-ming your GPU accelerators with OpenACC.

acc_c3a.c main NVIDIA devicenum=0 time(us): 2,589 137: data copyin reached 2 times device time(us): total=1,315 max=662 min=653 avg=657 141: data copyout reached 2 times device time(us): total=1,274 max=639 min=635 avg=637

For the AMD Radeon, the difference is even more dramatic, reflecting the more expensive relative cost of data movement:

Accelerator Kernel Timing dataacc_c3a.c smooth AMD RADEON devicenum=1 time(us): 9,017 44: kernel launched 100 times grid: [16x499] block: [64x2] device time(us): total=9,017 max=182 min=51 avg=90 elapsed time(us): total=23,267 max=3,362 min=118 avg=232acc_c3a.c main AMD RADEON devicenum=1 time(us): 4,360 137: data copyin reached 2 times device time(us): total=2,316 max=1,324 min=992 avg=1,158 141: data copyout reached 2 times device time(us): total=2,044 max=1,060 min=984 avg=984

But suppose we want a program that will run on the GPU when it’s available, or on the host when it’s not. The PGI compilers provide that functionality as well. To enable it, build with the target accelerator option: -ta=nvidia,host or -ta=radeon,host. This generates two versions of the smooth function, one that can run on the host and one that can run on the GPU. At run time, the program will deter-mine whether there is a GPU attached and run that version if there is, or run the host version if there is not.

PGI Accelerator

The enclosed articles were originally published in the PGInsider newsletter. These and many more like them are designed to help you effectively use the PGI compilers and tools. The PGInsider is free to subscribers. Register on the PGI website at http://www.pgroup.com/register

PGFORTRAN, PGI Unified Binary and PGI Accelerator are trademarks and PGI, PGI CDK, PGCC, PGC++, PGI Visual Fortran, PVF, Cluster Development Kit, PGPROF, PGDBG and The Portland Group are registered trademarks of NVIDIA Corporation

© 2009-2013 The Portland Group, Inc All rights reserved.

pgi accelerator - pgroup.com · pgi installation has two additional directory levels. the first...

Documents