signal processing and general purpose computing …carg.site.uottawa.ca/doc/muge_spgpcg.pdf ·...

13
Signal Processing and General Purpose Computing on GPU Muge Guher 5996283 University Of Ottawa Abstract Graphics processing units (GPUs) have been growing in popularity due to their impressive processing capabilities, and with general purpose programming languages such as NVIDIA's CUDA interface, are becoming the platform of choice in the scientific computing community. Today the research community successfully uses GPU to solve a broad range of computationally demanding, complex problems. This effort in general purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems and DSP applications. GPUs have evolved a general- purpose programmable architecture and supporting ecosystem that make it possible for their use in a wide range of non-graphics tasks, including many applications in signal processing. Commercial, high-performance signal processing applications that use GPUs as accelerators for general purpose tasks are still emerging, but many aspects of the architecture of GPUs and their wide availability make them interesting options for implementing and deploying such applications even though there are memory bottleneck challenges that have to be overcome in real time processing. This report describes the brief history, hardware architecture, and programming model for GPU computing as well as the modern tools and methods used recently. The application space of the GPGPU is explored with examples of signal processing applications and with results of recently conducted evaluations and benchmarks. 1. Introduction High performance embedded computing applications pose significant challenges for processing throughout. Traditionally, such applications have been implemented on application specific integrated circuits (ASICs) and/or digital signal processors (DSP). However, ASICs’ advantage in performance and power often could not justify the fast increasing fabrication cost, while DSP offers a limited processing throughput that is usually lower than 100GFLOPS. On the other hand, current multi-core processors, especially graphics processing units (GPUs), deliver very high computing throughput, and at the same time maintain high flexibility and programmability. The potential of GPUs for high performance embedded computing has been explored and there are many examples of comprehensive performance evaluations on GPUs with the high performance embedded computing (HPEC) benchmark suites, signal processing benchmarks such as radar processing applications. Some have compared the performance and power efficiency between GPU and DSP with the HPEC benchmarks. The comparison reveals that the major hurdle for GPU’s applications in embedded computing is its relatively low power efficiency. Traditionally, the abovementioned high performance embedded computing applications depend on application specific integrated circuits (ASICs) to realize the functionality. ASICs offer the highest performance and best power-efficiency, but they lack flexibility in areas of customization and future upgrades. With the fast increasing fabrication cost as well as the ever-growing requirements for service differentiation, DSPs have gradually become the workhorse for performance- demanding embedded systems. However, DSPs are initially designed for low-power and low-cost applications. Thus, current single-chip DSPs usually could only deliver relatively limited computing power. For example, TI’s high performance embedded processor; TMS320C6472 has 6 cores and achieves a peak performance of 67.2GFLOPS. Often times, to provide a sufficient level of performance, a high performance embedded system usually has to be assembled with multiple boards with each installing one or more DSPs. Such multi-board solutions tend to be bulky and costly due to the extra levels of packaging. Moreover, the distribution of program data over multiple DSPs would also incur performance overhead. On the other hand, modern multi- core general- general-purpose processors have become very powerful computing machines. Especially, current GPUs could attain a peak float-point throughput of up to 1536 giga-floating- point operations per second (Gflops), which is higher than that of the fastest DSP by a factor of over 20. In addition, GPU follows a single program multiple data (SPMD) programming model, which well matches the underlying computing patterns of most high performance embedded applications. GPUs have evolved a general-purpose programmable architecture and supporting ecosystem that make it possible for their use in a wide range of non-graphics tasks, including many applications in signal processing.

Upload: vukhanh

Post on 30-Jul-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

Signal Processing and General Purpose Computing on GPU

Muge Guher

5996283 University Of Ottawa

Abstract Graphics processing units (GPUs) have been growing in popularity due to their impressive processing capabilities, and with general purpose programming languages such as NVIDIA's CUDA interface, are becoming the platform of choice in the scientific computing community. Today the research community successfully uses GPU to solve a broad range of computationally demanding, complex problems. This effort in general purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems and DSP applications. GPUs have evolved a general-purpose programmable architecture and supporting ecosystem that make it possible for their use in a wide range of non-graphics tasks, including many applications in signal processing. Commercial, high-performance signal processing applications that use GPUs as accelerators for general purpose tasks are still emerging, but many aspects of the architecture of GPUs and their wide availability make them interesting options for implementing and deploying such applications even though there are memory bottleneck challenges that have to be overcome in real time processing. This report describes the brief history, hardware architecture, and programming model for GPU computing as well as the modern tools and methods used recently. The application space of the GPGPU is explored with examples of signal processing applications and with results of recently conducted evaluations and benchmarks. 1. Introduction High performance embedded computing applications pose significant challenges for processing throughout. Traditionally, such applications have been implemented on application specific integrated circuits (ASICs) and/or digital signal processors (DSP). However, ASICs’ advantage in performance and power often could not justify the fast increasing fabrication cost, while DSP offers a limited processing throughput that is usually lower than 100GFLOPS. On the other hand, current multi-core processors, especially graphics processing units (GPUs), deliver very high computing throughput, and at the same time maintain high flexibility and programmability. The potential of GPUs for high

performance embedded computing has been explored and there are many examples of comprehensive performance evaluations on GPUs with the high performance embedded computing (HPEC) benchmark suites, signal processing benchmarks such as radar processing applications. Some have compared the performance and power efficiency between GPU and DSP with the HPEC benchmarks. The comparison reveals that the major hurdle for GPU’s applications in embedded computing is its relatively low power efficiency. Traditionally, the abovementioned high performance embedded computing applications depend on application specific integrated circuits (ASICs) to realize the functionality. ASICs offer the highest performance and best power-efficiency, but they lack flexibility in areas of customization and future upgrades. With the fast increasing fabrication cost as well as the ever-growing requirements for service differentiation, DSPs have gradually become the workhorse for performance-demanding embedded systems. However, DSPs are initially designed for low-power and low-cost applications. Thus, current single-chip DSPs usually could only deliver relatively limited computing power. For example, TI’s high performance embedded processor; TMS320C6472 has 6 cores and achieves a peak performance of 67.2GFLOPS. Often times, to provide a sufficient level of performance, a high performance embedded system usually has to be assembled with multiple boards with each installing one or more DSPs. Such multi-board solutions tend to be bulky and costly due to the extra levels of packaging. Moreover, the distribution of program data over multiple DSPs would also incur performance overhead. On the other hand, modern multi-core general- general-purpose processors have become very powerful computing machines. Especially, current GPUs could attain a peak float-point throughput of up to 1536 giga-floating- point operations per second (Gflops), which is higher than that of the fastest DSP by a factor of over 20. In addition, GPU follows a single program multiple data (SPMD) programming model, which well matches the underlying computing patterns of most high performance embedded applications. GPUs have evolved a general-purpose programmable architecture and supporting ecosystem that make it possible for their use in a wide range of non-graphics tasks, including many applications in signal processing.

Commercial, high-performance signal processing applications that use GPUs as accelerators for general purpose (GP) tasks (referred to as GPGPU processing) are still emerging, but many aspects of the architecture of GPUs and their wide availability make them interesting options for implementing and deploying such applications even though there are challenges, such as the memory bottleneck, that have to be overcome in real time processing. Data stream processing applications such as stock exchange data analysis, VoIP streaming, and sensor data processing pose two conflicting challenges: short per stream latency to satisfy the milliseconds long, hard real-time constraints of each stream, and high throughput to enable efficient processing of as many streams as possible. High-throughput programmable accelerators such as modern GPUs hold high potential to speed up the computations. However, their use for hard real-time stream processing is complicated by slow communications with CPUs, variable throughput changing non-linearly with the input size, and weak consistency of their local memory with respect to CPU accesses. Furthermore, their coarse grain hardware scheduler renders them unsuitable for unbalanced multi-stream workloads. 2. Overview of GPU Computing The graphics processing unit (GPU) has become an integral part of today’s mainstream computing systems. Over the recent years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart. The GPU’s rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. The GPU is designed for a particular class of applications with the following characteristics. Over the past few years, a growing community has identified other applications with similar characteristics and successfully mapped these applications onto the GPU. Computational requirements are large. Real-time rendering requires billions of pixels per second, and each pixel requires hundreds or more operations. GPUs must deliver an enormous amount of compute performance to satisfy the demand of complex real-time applications. Parallelism is substantial. Fortunately, the graphics pipeline is well suited for parallelism. Operations on vertices and fragments are well matched to fine-grained

closely coupled programmable parallel compute units, which in turn are applicable to many other computational domains. Throughput is more important than latency. GPU implementations of the graphics pipeline prioritize throughput over latency. The human visual system operates on millisecond time scales, while operations within a modern processor take nanoseconds. This six-order-of-magnitude gap means that the latency of any individual operation is unimportant. As a consequence, the graphics pipeline is quite deep, perhaps hundreds to thousands of cycles, with thousands of primitives in flight at any given time. The pipeline is also feed-forward, removing the penalty of control hazards, further allowing optimal throughput of primitives through the pipeline. This emphasis on throughput is characteristic of applications in other areas as well. In early 2000, the GPU was a fixed-function processor, built around the graphics pipeline, that excelled at three dimensional (3-D) graphics but not much more than that. Since that time, the GPU has evolved into a powerful programmable processor, with both application-programming interface (APIs) and hardware increasingly focusing on the programmable aspects of the GPU. The result is a processor with enormous arithmetic capability [a single NVIDIA GeForce 8800 GTX can sustain over 330 Gflops] and streaming memory bandwidth (80+ GB/s), both substantially greater than a high-end CPU. The GPU has a distinctive architecture centered around a large number of fine-grained parallel processors. Just as important in the development of the GPU as a general-purpose computing engine has been the advancement of the programming model and programming tools. The challenge to GPU vendors and researchers has been to strike the right balance between low-level access to the hardware to enable performance and high-level programming languages and tools that allow programmer flexibility and productivity, all in the face of rapidly advancing hardware. Until recently, GPU computing could best be described as an academic exercise. Because of the primitive nature of the tools and techniques, the first generations of applications were notable for simply working at all. As the field matured, the techniques became more sophisticated and the comparisons with non-GPU work more rigorous. As games have become increasingly limited by CPU performance, offloading complex CPU tasks to the GPU yields better overall performance. GPUs are also playing an increasing role in scientific computing applications. Three such applications are in computational biophysics: protein folding simulation, scalable molecular dynamics, and calculating electrostatic potential maps. These applications demonstrate the potential of the GPU for delivering real performance gains on computationally complex, large problems.

Looking to the future, one of the most important challenges for GPU computing is to connect with the mainstream fields of processor architecture and programming systems, as well as learn from the parallel computing experts of the past. 3. GPU Architecture The GPU has always been a processor with ample computational resources. The most important recent trend, however, has been exposing that computation to the programmer. Over the past few years, the GPU has evolved from a fixed-function special-purpose processor into a full-fledged parallel programmable processor with additional fixed-function special-purpose functionality. More than ever, the programmable aspects of the processor have taken center stage. 3.1 The Hardware Graphics Pipeline The hardware architecture of a graphics processing unit differs from that of a normal CPU in several key aspects. These differences originate in the special conditions in the field of realtime computer graphics:

• Many objects like pixels and vertices can be handled in isolation and are not interdependent.

• There are many independent objects (millions of pixels, thousands of vertices, …).

• Many objects require expensive computations. The GPU architectures evolved to meet these requirements. The input to the GPU is a list of geometric primitives, typically triangles, in a 3-D world coordinate system. Through many steps, those primitives are shaded and mapped onto the screen, where they are assembled to create a final picture. It is instructive to first explain the specific steps in the canonical pipeline before showing how the pipeline has become programmable. Vertex Operations: The input primitives are formed from individual vertices. Each vertex must be transformed into screen space and shaded, typically through computing their interaction with the lights in the scene. Because typical scenes have tens to hundreds of thousands of vertices, and each vertex can be computed independently, this stage is well suited for parallel hardware. Primitive Assembly: The vertices are assembled into triangles, the fundamental hardware-supported primitive in today’s GPUs. Rasterization: Rasterization is the process of determining which screen-space pixel locations are covered by each triangle. Each triangle generates a primitive called a fragment at each screen-space pixel location that it covers. Because many triangles may overlap at any pixel location, each pixel’s color value may be computed from several fragments.

Fragment Operations: Using color information from the vertices and possibly fetching additional data from global memory in the form of textures (images that are mapped onto surfaces), each fragment is shaded to determine its final color. Just as in the vertex stage, each fragment can be computed in parallel. This stage is typically the most computationally demanding stage in the graphics pipeline. Composition: Fragments are assembled into a final image with one color per pixel, usually by keeping the closest fragment to the camera for each pixel location. Historically, the operations available at the vertex and fragment stages were configurable but not programmable. 3.2 GPU Architecture Evolution The fixed-function pipeline lacked the generality to efficiently express more complicated but essential shading and lighting operations. The key step was replacing the fixed-function per-vertex and per-fragment operations with user-specified programs run on each vertex and fragment. Over the recent, these vertex programs and fragment programs have become increasingly more capable, with larger limits on their size and resource consumption, with more fully featured instruction sets, and with more flexible control-flow operations. Indeed, while previous generations of GPUs could best be described as additions of programmability to a fixed-function pipeline, today’s GPUs are better characterized as a programmable engine surrounded by supporting fixed-function units. 3.3 Modern GPU Architecture GPU is built for different application demands than the CPU; large, parallel computation requirements with an emphasis on throughput rather than latency. Therefore, the architecture of the GPU has progressed in a different direction than that of the CPU. Consider a pipeline of tasks, such as seen in most graphics APIs (and many other applications) that must process a large number of input elements. In such a pipeline, the output of each successive task is fed into the input of the next task. The pipeline exposes the task parallelism of the application, as data in multiple pipeline stages can be computed at the same time; within each stage, computing more than one element at the same time is data parallelism. To execute such a pipeline, a CPU would take a single element (or group of elements) and process the first stage in the pipeline, then the next stage, and so on. The CPU divides the pipeline in time, applying all resources in the processor to each stage in turn. GPUs have historically taken a different approach. The GPU divides the resources of the processor among the different stages, such that the pipeline is divided in space, not time. The part of the processor working on one stage feeds its output directly into a different part that works on the next stage.

This machine organization was highly successful in fixed-function GPUs for two reasons. First, the hardware in any given stage could exploit data parallelism within that stage, processing multiple elements at the same time. Because many task-parallel stages were running at any time, the GPU could meet the large compute needs of the graphics pipeline. Secondly, each stage’s hardware could be customized with special-purpose hardware for its given task, allowing substantially greater compute and area efficiency over a general-purpose solution. As programmable stages replaced fixed-function stages, the special-purpose fixed function components were simply replaced by programmable components, but the task-parallel organization did not change.

Figure 3-1. Both AMD and NVIDIA build architectures with unified, massively parallel programmable units at their cores. (a) The NVIDIA GeForce 8800 GTX (top) features 16 streaming multiprocessors of 8 thread (stream) processors each. One pair of streaming multi-processors is shown below; each contains shared instruction and data caches, control logic, a 16 kB shared memory, eight stream processors, and two special function units. (Courtesy of NVIDIA.) The result was a lengthy, feed-forward GPU pipeline with many stages, each typically accelerated by special purpose parallel hardware. In a CPU, any given operation may take on the order of 20 cycles between entering and leaving the CPU pipeline. On a GPU, a graphics operation may take thousands of cycles from start to finish. The latency of any given operation is long. However, the task and data parallelism across and between stages delivers high throughput. The major disadvantage of the GPU task-parallel pipeline is load balancing. Like any pipeline, the performance of the GPU pipeline is dependent on its slowest stage. In the early days of programmable stages, the instruction set of the vertex and fragment programs

were quite different, so these stages were separate. However, as both the vertex and fragment programs became more fully featured, and as the instruction sets converged, GPU architects reconsidered a strict task-parallel pipeline in favor of a unified shader architecture, in which all programmable units in the pipeline share a single programmable hardware unit. While much of the pipeline is still task-parallel, the programmable units now divide their time among vertex work, fragment work, and geometry work. These units can exploit both task and data parallelism. As the programmable parts of the pipeline are responsible for more and more computation within the graphics pipeline, the architecture of the GPU is migrating from a strict pipelined task-parallel architecture to one that is increasingly built around a single unified data-parallel programmable unit. AMD introduced the first unified shader architecture for modern GPUs in its Xenos GPU in the XBox 360 in 2005. Today, both AMD’s and NVIDIA’s flagship GPUs feature unified shaders. The benefit for GPU users is better load-balancing at the cost of more complex hardware. The benefit for GPGPU users is clear: with all the programmable power in a single hardware unit, GPGPU programmers can now target that programmable unit directly, rather than the previous approach of dividing work across multiple hardware units. 3.4 Overview of NVIDIA Tesla and Fermi GPU General purpose parallel programming on GPUs is a relatively recent phenomenon. GPUs were originally hardware blocks optimized for a small set of graphics operations. As demand arose for more flexibility, GPUs became ever more programmable. Early approaches to computing on GPUs cast computations into a graphics framework, allocating buffers (arrays) and writing shaders (kernel functions). Several research projects looked at designing languages to simplify this task; in late 2006, NVIDIA introduced its CUDA architecture and tools to make data parallel computing on a GPU more straightforward. The data parallel features of CUDA map pretty well to the data parallelism available on the NVIDIA GPUs.

Figure 3-2 NVIDIA Tesla Block Diagram [2]

Figure 3-3 NVIDIA Fermi Block Diagram [2]

3.4.1 GPU Hardware A GPU is connected to a host through a high speed IO bus slot, typically a PCI-Express in current high performance systems. The GPU has its own device memory, up to several gigabytes in current configurations. Data is usually transferred between the GPU and host memories using programmed DMA, which can operate concurrently with both the host and GPU compute units, though there is some support for direct access to host memory from the GPU under certain restrictions. As a GPU is designed for stream or throughput computing, it does not depend on a deep cache memory hierarchy for memory performance. The device memory supports very high data bandwidth using a wide data path. NVIDIA GPUs have a number of multiprocessors, each of which executes in parallel with the others. On Tesla, each multiprocessor has a group of 8 stream processors; a Fermi multiprocessor has two groups of 16 stream processors. The high end Tesla accelerators have 30 multiprocessors, for a total of 240 cores; a high end Fermi has 16 multiprocessors, for 512 cores. Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors. SIMT handles conditionals somewhat differently than SIMD, though the effect is much the same, where some cores are disabled for conditional operations. The code is executed in groups of 32 threads, what NVIDIA calls a warp. When the threads in a warp issue a device memory operation, that instruction will take a very long time, perhaps hundreds of clock cycles, due to the long memory latency. Mainstream architectures would add a cache memory hierarchy to reduce the latency, and Fermi does include some hardware caches, but mostly GPUs are designed for stream or throughput computing, where cache memories are ineffective. Instead, these GPUs tolerate memory latency by using a high degree of multithreading. When one warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that one. In this way, the cores can be

productive as long as there is enough parallelism to keep them busy. 4. GPU Programming Model The programmable units of the GPU follow a single program multiple-data (SPMD) programming model. For efficiency, the GPU processes many elements (vertices or fragments) in parallel using the same program. Each element is independent from the other elements, and in the base programming model, elements cannot communicate with each other. All GPU programs must be structured in this way: many parallel elements, each processed in parallel by a single program. Each element can operate on 32-bit integer or floating point data with a reasonably complete general-purpose instruction set. Elements can read data from a shared global memory and, with the newest GPUs, also write back to arbitrary locations in shared global memory. This programming model is well suited to straight-line programs, as many elements can be processed in lockstep running the exact same code. Code written in this manner is single instruction, multiple data (SIMD). As shader programs have become more complex, programmers prefer to allow different elements to take different paths through the same program, leading to the more general SPMD model. Allowing a different execution path for each element requires a substantial amount of control hardware. Elements are grouped together into blocks, and blocks are processed in parallel. If elements branch in different directions within a block, the hardware computes both sides of the branch for all elements in the block. The size of the block is known as the branch granularity and has been decreasing with recent GPU generations. It is on the order of 16 elements. In writing GPU programs, then, branches are permitted but not free. Programmers who structure their code such that blocks have coherent branches will make the best use of the hardware. 4.1 General-Purpose Computing on the GPU Mapping general-purpose computation onto the GPU uses the graphics hardware in much the same way as any standard graphics application. Because of this similarity, it is both easier and more difficult to explain the process. On one hand, the actual operations are the same and are easy to follow; on the other hand, the terminology is different between graphics and general-purpose use. One of the historical difficulties in programming GPGPU applications has been that despite their general-purpose tasks’ having nothing to do with graphics, the applications still had to be programmed using graphics APIs. In addition, the program had to be structured in terms of the graphics pipeline, with the programmable units only accessible as an intermediate step in that pipeline, when the programmer would almost certainly prefer to access the programmable units directly Today,

GPU computing applications are structured in the following way. 1. The programmer directly defines the computation

domain of interest as a structured grid of threads. 2. An SPMD general-purpose program computes the

value of each thread. 3. The value for each thread is computed by a

combination of math operations and both gather (read) accesses from and scatter (write) accesses to global memory.

4. The resulting buffer in global memory can then be used as an input in future computation.

This programming model is a powerful one for several reasons. First, it allows the hardware to fully exploit the application’s data parallelism by explicitly specifying that parallelism in the program. Next, it strikes a careful balance between generality and restrictions to ensure good performance. Finally, its direct access to the programmable units eliminates much of the complexity faced by previous GPGPU programmers in coopting the graphics interface for general-purpose programming. As a result, programs are more often expressed in a familiar programming language and are simpler and easier to build and debug. The result is a programming model that allows its users to take full advantage of the GPU’s powerful hardware but also permits an increasingly high-level programming model that enables productive authoring of complex applications. 4.1.2 NVIDIA CUDA GPUs are programmed as a sequence of kernels; typically, each kernel completes execution before the next kernel begins, with implicit barrier synchronization between kernels. Fermi has some support for multiple, independent kernels to execute simultaneously, but most kernels are large enough to fill the entire machine. As mentioned, the multiprocessors execute in parallel, asynchronously. However, GPUs do not support a fully coherent memory model that allows the multiprocessors to synchronize with each other. Classical parallel programming techniques can't be used here. Threads can't spawn more threads; threads on one multiprocessor can't send results to threads on another multiprocessor; there's no facility for a critical section among all the threads across the whole system. CUDA offers a data parallel programming model that is supported on NVIDIA GPUs. In this model, the host program launches a sequence of kernels. A kernel is organized as a hierarchy of threads. Threads are grouped into blocks, and blocks are grouped into a grid. Each thread has a unique local index in its block, and each block has a unique index in the grid. Kernels can use these indices to compute array subscripts, for instance.

Threads in a single block will be executed on a single multiprocessor, sharing the software data cache, and can synchronize and share data with threads in the same block; a warp will always be a subset of threads from a single block. Threads in different blocks may be assigned to different multiprocessors concurrently, to the same multiprocessor concurrently (using multithreading), or may be assigned to the same or different multiprocessors at different times, depending on how the blocks are scheduled dynamically. Performance tuning on the GPU requires optimizing all these architectural features:

• Finding and exposing enough parallelism to populate all the multiprocessors.

• Finding and exposing enough additional parallelism to allow multithreading to keep the cores busy.

• Optimizing device memory accesses for contiguous data, essentially optimizing for stride-1 memory accesses.

• Utilizing the software data cache to store intermediate results or to reorganize data that would otherwise require non-stride-1 device memory accesses.

4.2 Software Environments In the past, the majority of GPGPU programming was done directly through graphics APIs. Although many researchers were successful in getting applications to work through these graphics APIs, there is a fundamental mismatch between the traditional programming models people were using and the goals of the graphics APIs. Originally, people used fixed function, graphics-specific units (e.g. texture filters, blending, and stencil buffer operations) to perform GPGPU operations. This quickly got better with fully programmable fragment processors, which provided pseudo assembly languages, but this was still unapproachable by all but the most ardent researchers. With DirectX 9, higher-level shader programming was made possible through the high-level shading language (HLSL), presenting a C-like interface for programming shaders. NVIDIA’s C provided similar capabilities as HLSL but was able to compile to multiple targets and provided the first high-level language for OpenGL. The OpenGL Shading Language (GLSL) is now the standard shading language for OpenGL. However, the main issue with Cg/HLSL/GLSL for GPGPU is that they are inherently shading languages. Computation must still be expressed in graphics terms like vertices, textures, fragments, and blending. So, although you could do more general computation with graphics APIs and shading languages, they were still largely unapproachable by the common programmer. The stream programming model structures programs to express parallelism and allows for efficient

communication and data transfer, matching the parallel processing resources and memory system available on GPUs. A stream program comprises a set of streams, ordered sets of data, and kernels, the functions applied to each element in a set of streams producing one or more streams as output. RapidMind commercialized Sh and now targets multiple platforms including GPUs, the STI Cell Broadband Engine, and multicore CPUs, and the new system is much more focused on computation as compared to Sh, which included many graphics-centric operations. Similar to Accelerator, RapidMind uses delayed evaluation and online compilation to capture and optimize the user’s application code along with operator and type extensions to C++ to provide direct support for arrays. PeakStream is a new system, inspired by Brook, designed around operations on arrays. Similar to RapidMind and Accelerator, PeakStream uses just-in-time compilation but is much more aggressive about vectorizing the user’s code to maximize performance on SIMD architectures. PeakStream is also the first platform to provide profiling and debugging support, the latter continuing to be a serious problem in GPGPU development. Both AMD and NVIDIA now also have their own GPGPU programming systems. AMD announced and released their system to researchers in late 2006. CTM, or close to the metal, provides a low-level hardware abstraction layer (HAL) for the R5XX and R6XX series of ATI GPUs. CTMHAL provides raw assembly-level access to the fragment engines (stream processors) along with an assembler and command buffers to control execution on the hardware. No graphics-specific features are exported through this interface. Computation is performed by binding memory as inputs and outputs to the stream processors, loading an ELF binary, and defining a domain over the outputs on which to execute the binary. AMD also offers the compute abstraction layer (CAL), which adds higher level constructs. NVIDIA’s CUDA is a higher level interface than AMD’s HAL and CAL. Similar to Brook, CUDA provides a C-like syntax for executing on the GPU and compiles offline. However, unlike Brook, which only exposed one dimension of parallelism, data parallelism via streaming, CUDA exposes two levels of parallelism, data parallel and multithreading. CUDA also exposes much more of the hardware resources than Brook, exposing multiple levels of memory hierarchy: per-thread registers, fast shared memory between threads in a block, board memory, and host memory. Kernels in CUDA are also more flexible that those in Brook by allowing the use of pointers (although data must be on board), general load/store to memory allowing the user to scatter data from within a kernel, and synchronization between threads in a thread block. However, all of this flexibility

and potential performance gain comes with the cost of requiring the user to understand more of the low-level details of the hardware, notably register usage, thread and thread block scheduling, and behavior of access patterns through memory. All of these systems allow developers to more easily build large applications. CUDA provides tuned and optimized basic linear algebra subprograms (BLAS) and fast Fourier transform (FFT) libraries to use as building blocks for large applications. Low-level access to hardware, such as that provided by CTM, or GPGPU specific systems like CUDA, allow developers to effectively bypass the graphics drivers and maintain stable performance and correctness. 5. GPU Applications There are four data-parallel operations central to GPU computing: performing scatter/gather memory operations, mapping a function onto many elements in parallel, reducing a collection of elements to a single element or value, and computing prefix reductions of an array in parallel. Researchers have studied on GPUs: scan, sort, search, data queries, differential equations, and linear algebra. These algorithms enable a wide range of applications ranging from databases and data mining to scientific simulations such as fluid dynamics and heat transfer 5.1 Algorithms Researchers have demonstrated many higher level algorithms and applications that exploit the computational strengths of the GPU. Sort: GPUs have come to excel at sorting as the GPU computing community has rediscovered, adapted, and improved seminal sorting algorithms, notably bitonic merge sort. This sorting algorithm is intrinsically parallel and oblivious, meaning the same steps are regardless of input. Search and database queries: Researchers have also implemented several forms of search on the GPU, such as binary search and nearest neighbor search, as well as high-performance database operations that build on special-purpose graphics hardware (called the depth and stencil buffers) and the fast sorting algorithms. Differential equations: The earliest attempts to use GPUs for non graphics computation focused on solving large sets of differential equations. Particle tracing is a common GPU application for ordinary differential equations, used heavily in scientific visualization (e.g., the scientific flow exploration system by and in visual effects for computer games. GPUs have been heavily used to solve problems in partial differential equations (PDEs) such as the Navier–Stokes equations for incompressible fluid flow. Particularly successful applications of GPU PDE solvers include fluid dynamics and level set equations for volume segmentation.

Linear algebra: Sparse and dense linear algebra routines are the core building blocks for a huge class of numeric algorithms, including many PDE solvers mentioned above. Applications include simulation of physical effects such as fluids, heat, and radiation, optical effects such as depth of field, and so on. Accordingly, the topic of linear algebra on GPUs has received a great deal of attention. The use of direct-compute layers such as CUDA and CTM both simplifies and improves the performance of linear algebra on the GPU. For example, NVIDIA provides CuBLAS, a dense linear algebra package implemented in CUDA and following the popular BLAS conventions. Sparse linear algebraic algorithms, which are more varied and complicated than dense codes, are an open and active area of research; researchers expect sparse codes to realize benefits similar to or greater than those of the new GPU computing layers. Several recurring themes emerge throughout the algorithms and applications explored in GPU computing to date. Successful GPU computing applications do the following. Emphasize parallelism: GPUs are fundamentally parallel machines, and their efficient utilization depends on a high degree of parallelism in the workload. For example, NVIDIA’s CUDA prefers to run thousands of threads at one time to maximize opportunities to mask memory latency using multithreading. Emphasizing parallelism requires choosing algorithms that divide the computational domain into as many independent pieces as possible. To maximize the number of simultaneous running threads, GPU programmers should also seek to minimize thread usage of shared resources and should use synchronization between threads sparingly. Minimize SIMD divergence: GPUs provide an SPMD programming model: multiple threads run the same program but access different data and thus may diverge in their execution. At some granularity, however, GPUs perform SIMD execution on batches of threads (such as CUDA warps). If threads within a batch diverge, the entire batch will execute both code paths until the threads reconverge. High-performance GPU computing thus requires structuring code to minimize divergence within batches. Maximize arithmetic intensity: In today’s computing landscape, actual computation is relatively cheap but bandwidth is precious. This is dramatically true for GPUs with their abundant floating-point horsepower. To obtain maximum utilization of that power requires structuring the algorithm to maximize the arithmetic intensity or number of numeric computations performed per memory transaction. Coherent data accesses by individual threads help, since these can be coalesced into fewer total memory transactions. Use of CUDA shared memory on NVIDIA GPUs also helps, reducing overfetch. Exploit streaming bandwidth: Despite the importance of arithmetic intensity, it is worth noting that GPUs do

have very high peak bandwidth to their onboard memory, on the order of 10_ the CPU-memory bandwidths on typical PC platforms. This is why GPUs can outperform CPUs at tasks such as sort, which have a low computation/bandwidth ratio. To achieve high performance on such applications requires streaming memory access patterns in which threads read from and write to large coherent blocks (maximizing bandwidth per transaction) located in separate regions of memory (avoiding data hazards). It has been shown that, when algorithms and applications can follow these design principles for GPU computing such as the PDE solvers, linear algebra packages, database systems, game physics and molecular dynamics applications they can achieve 10–100 times speedups over even mature, optimized CPU codes. 5.4 The future of GPU With the rising importance of GPU computing, GPU hardware and software are changing at a remarkable pace. In the upcoming years, it is expected that there will be several changes to allow more flexibility and performance from future GPU computing systems:

• Both AMD and NVIDIA announced future support for double-precision floating-point hardware. The addition of double-precision support removes one of the major obstacles for the adoption of the GPU in many scientific computing applications.

• Another upcoming trend is a higher bandwidth path between CPU and GPU. The PCI Express bus between CPU and GPU is a bottleneck in many applications, so future support for PCI Express 2, HyperTransport, or other high-bandwidth connections is a welcome trend. One open question is the fate of the GPU’s dedicated high-bandwidth memory system in a computer with a more tightly coupled CPU and GPU. •

6. GPUs in Real Time Signal Processing Applications Several signal and image processing algorithms have been implemented on the GPU, including basic operations (such as the fast Fourier transform and convolution); differential equation-based algorithms; and pattern recognition and vision algorithms (such as those for sequence alignment, optical flow estimation, tracking, and stereo reconstruction). The GPU is well-suited for convolution and warping algorithms in image processing due to its hardware support for image interpolation and filtering. The GPU is also well-suited for robust edge detection in the presence of noise, image segmentation, and other applications that may require differential-equation-based approaches. In such cases, implementing explicit differential equation solvers is straightforward.

Implementing implicit solvers requires the solution of large sparse systems of linear equations. In this case, it has been demonstrated that the GPU can be used to efficiently implement iterative methods such as the conjugate gradient algorithm. The GPU is useful in implementations of both dynamic programming and hidden Markov model approaches to sequence alignment and pattern recognition. It should also be possible, especially on newer GPUs, to build implementations of computationally intensive applications such as video coding that could significantly outperform implementations on traditional central processor units (CPUs). Key components of such codecs, such as motion compensation and wavelet transformation, have already been implemented on GPUs. What makes GPUs interesting for signal processing applications is that their peak computational performance can be more than an order of magnitude higher than that of traditional CPUs. They also support very high bandwidth to a relatively large amount of dedicated memory. GPUs use massively parallel architectures with a very large number of floating-point functional units and extensive support for efficient data parallelism. Their architecture is built around the need to access data efficiently and schedule large amounts of parallel computation. Extensive use is made of single-instruction multiple-data (SIMD) parallelism, in which one instruction causes a single operation to take place on more than one value at the same time. SIMD parallelism in GPUs can be explicit (using vector registers and vector operations) or implicit (performing computations over blocks of pixels.) However, GPUs are not pure SIMD machines and do support a coarse-grained, structured version of multithreading; they are actually arrays of data-parallel processors. GPUs are integrated into a memory hierarchy consisting of registers, cache, video memory, and host memory, with various capacities and latencies. Registers provide a small amount of high-bandwidth, low-latency memory while the DRAM in video and host memory provides higher capacities at higher latencies and lower bandwidth. The amount of high-bandwidth local memory available, including registers and cache, is especially important for algorithms that require a large “working set” (simultaneously accessed state). Until recently, registers were the only form of on-chip memory on GPUs supporting read-write operations; caches were read only and relatively small. However, the latest generation of GPUs now includes support for a small amount of randomly accessible read/write high-speed on-chip local memory. This will greatly expand the class of algorithms that can be mapped efficiently onto GPUs. However there are still many problems associated with efficient implementation of scientific algorithms on GPUs that researcher are working towards solving. The GPUs

have been used for solving sparse and dense linear systems, eigen-decomposition, matrix multiplication, fluid flow simulation, FFT, sorting and finite-element simulations. The GPU-based algorithms exploit the capabilities of multiple vertex and fragment processors along with high memory bandwidth to achieve high performance. Current GPUs support 32-bit floating-point arithmetic and GPU based implementations of some of the scientific algorithms have outperformed optimized CPU-based implementations available as part of ATLAS or the Intel Math Kernel Library (MKL). There is research conducted on developing appropriate memory models to analyze the performance or designing cache-efficient GPU-based algorithms. Current GPU architectures are designed to perform vector computations on input data that is represented as 2D arrays or textures. They achieve high memory bandwidth using a 256-bit memory interface to the video memory. Moreover, the GPUs consist of multiple fragment processors and each fragment processor has small L1 and L2 SRAM caches. As compared to conventional CPUs, the GPUs have fewer pipeline stages and smaller cache sizes. The GPUs perform block transfers between the caches and DRAM-based video memory. As a result, the performance of GPU-based scientific algorithms depends on their cache efficiency. Researchers at [10] have developed a model to analyze the performance of GPU-based scientific algorithms and use this model to improve their cache efficiency. Based on this model, they present efficient tiling algorithms to improve the performance of three applications: sorting, fast Fourier transform and dense matrix multiplication. Some examples of algorithms and research conducted to increase the performance of real time signal processing applications on GPUs will be discussed in sections 6.2, 6.3 and 6.4. 6.1 GPU Memory Access Memory access operations on GPUs take four forms: random access reads (gathers); writes to predetermined array locations (structured writes); writes to random memory locations (scatters); and finally streaming reads (and writes, on he latest GPUs). Random memory read operations can be implemented on GPUs with texture fetches, which involve much more than a simple “load” operation. During the use of a GPU for image synthesis (rendering), a texturing operation is available to paste images onto surfaces. Since pixels in a frame buffer (the target of a rendering operation) can be drawn in an order different than the storage order of image samples in textures, supporting texturing requires both random access into textures and hardware support for interpolation and antialiasing. This functionality can be used directly to implement image processing operations such as warping and

perspective correction. Structured write operations to predetermined locations in an output image are supported on GPUs. Structured writes support a number of read-modify write operations on their destination frame buffers, including predicated writes and summation. The structured write operation can be very fast, as GPUs have numerous hardware optimizations for this mode of memory access. Write operations to random locations in an output array and streaming reads and writes can also be implemented. Scatters, however, are relatively slow, and may have to be implemented using multiple passes through the rasterizer or other GPU-specific mechanisms. Streaming reads and writes access arrays in a predictable, coherent order that allows several optimizations such as prefetching and block transfer. Implementation of streaming access maximizes bandwidth. 6.2 Processing data streams with hard real-time constraints on heterogeneous systems [11] There has been research conducted by [11], to develop efficient and practical algorithm for hard real-time stream scheduling in heterogeneous systems. The algorithm assigns incoming streams of different rates and deadlines to CPUs and accelerators. By employing novel stream schedulability criteria for accelerators, the algorithm finds the assignment, which simultaneously satisfies the aggregate throughput requirements of all the streams and the deadline constraint of each stream alone. Using the AES-CBC encryption kernel, they experimented extensively on thousands of streams with realistic rate and deadline distributions. They have shown that, their framework outperformed the alternative methods by allowing 50% more streams to be processed with provably deadline-compliant execution even for deadlines as short as tens milliseconds. Overall, the combined GPU-CPU execution allows for up to 4-fold throughput increase over highly optimized multi-threaded CPU-only implementations. 6.3 GPU-CPU Multi-Core For Real-Time Signal Processing [12] Since GPUs process elements independently there is no way to have shared or static data. For each element one can only read from the input, perform operations on it, and write to the output. However, recent improvements in graphics cards have permitted to have multiple inputs, multiple outputs, but never a piece of memory that is both readable and writable. It is important for general-purpose applications to have high arithmetic intensity, otherwise the memory access latency will limit computation speed. Recent developments in data transfer rate through PCI Express interface to an extent addressed this Issue. Researchers at [12] have demonstrated the GPU-CPU multi core for real time signal processing with an Internet Protocol Television (IP-TV) application scenario. A

direct application of this work is to perform DCT and IDCT on GPU in real-time signal processing to be used for broadcasting live events such as sports. In this application setup, a server containing GPU and CPU performs tasks, such as MPEG compression and digital right management (DRM). The video data, which arrives at shared memory, is directed by CPU and sent to GPU, and then data is mapped into GPUs memory. The GPU processes the video data. After GPU finishes, the control from CPU initiates to copy the data back to CPU and the video is stored in database. This approach ensures faster video processing for vast amount of data and will not add extra hardware cost to either video providers or receivers. Data is sent to GPU as textures, which means that data is treated as group of pictures. [12] In applications, such as JPEG and MPEG, Discreet Cosine Transformation is a resource intensive task. They [12] performed computation of DCT on GPU and performed the experiments for the following cases: (1) using CPU only, (2) using GPU only, and (3) using both GPU and CPU. Although the computation power of GPU is much higher than CPU in terms of throughput, the fastest DCT technique on GPU is still slower than the implementation on CPU. The reason is that the speeds of GPU is limited by memory access. However, they predict that this scenario will change with the use of more advance GPU and interface. They have concluded that GPU is still an excellent candidate for performing data intensive signal processing algorithms. The disadvantage in GPU is in the fact that the data transfer rate from the graphics hardware to the main memory is very slow. This bottleneck degrades the performance, as it is needed to bring the results to the main memory in the context of general purpose computation. [12] 6.4 Cache-Aware GPU Memory Scheduling Scheme for CT Back-Projection [13] As previous research shows, the bottleneck of GPU-implementation is not the computational power, but the memory bandwidth. Researchers at [13] propose a cache-aware memory-scheduling scheme for the back projection, which can ensure a better load-balancing between GPU processors and the GPU memory. The proposed reshuffling method can be directly applied on existing GPU-accelerated CT reconstruction pipelines. The experimental results show that their optimization can achieve speedup ranging from 1.18-1.48. [13] This cache-optimization method is particular effective for low-resolution volumes with high resolution projections. Many previous GPU-acceleration works identified that the memory bandwidth was the bottleneck of the GPU-based back-projection problem. This problem roots in the limited amount of cache and insufficient bandwidth of device memory in the GPU architectures. Previous works increased the memory bandwidth by reordering-loops, unrolling loops and multithreading. In [13] they focus on

how to improve the cache hit rate specifically. They propose a new volume shuffling method ensuring better cache behavior depending on different projection angles. They present a volume reshuffling scheme optimizing the locality of the memory reading pattern in GPU-based back projection. They claim their scheme can yield better cache hit rate and can be added to variety of existing methods with different back projection ordering. Their results show this method is particularly useful for 4 reconstructions with low-resolution volume with high resolution projections. 7. Case Studies 7.1 3-D Audio Processing with the GPU Audio processing applications are among the most compute intensive and often rely on additional DSP resources for real time performance. However, programmable audio DSPs are in general only available to product developers. Professional audio boards with multiple DSPs usually support specific effects and products while consumer “game-audio” hardware still only implements fixed-function pipelines, which evolve at a rather slow pace. The widespread availability and increasing processing power of GPUs could offer an alternative solution. GPU features, like multiply-accumulate instructions or multiple execution units, are similar to those of most DSPs. Besides, 3D audio rendering applications require a significant number of geometric calculations, which are a perfect fit for the GPU. A feasibility study conducted by [5] investigates the use of GPUs for efficient audio processing. It considers a combination of two simple operations commonly used for 3D audio rendering: variable delay-line and filtering. They compared an optimized SSE (Intel’s Streaming SIMD Extensions) assembly code running on a Pentium 4 3GHz processor and an equivalent Cg/OpenGL implementation running on a nVidia GeForce FX 5950 Ultra graphics board on AGP 8x. Audio was processed at 44.1 KHz using 1024-sample long frames. All processing was 32-bit floating point. The SSE implementation achieves real-time binaural rendering of 700 sound sources, while the GPU renders up to 580 in one time-frame. They have shown that, although on average the GPU implementation was about 20% slower than the SSE implementation, it would become 50% faster if floating-point texture resampling was supported in hardware. The latest graphics architectures are likely to significantly improve GPU performance due to their increased number of pipelines and better floating-point texture support. The question is can the GPU be a good audio DSP ? There are experiments that suggest that GPUs can be used for 3D audio processing with similar or increased performance compared to optimized software implementations running on top-of-the-line CPUs. The latest GPUs have been shown to outperform CPUs for a

number of other tasks, including Fast Fourier Transform, a tool widely used for audio processing [5]. However, several shortcomings still prevent an efficient use of GPUs for mainstream audio processing applications. Due to limitations in texture-access modes and texture-size, long 1D textures cannot be indexed easily. Infinite impulse response (recursive) filtering cannot be implemented efficiently since past values are usually unavailable when rendering a given pixel in fragment programs. Including persistent registers to accumulate results across fragments might solve this problem. Slow AGP read backs might also become an issue when large amounts of audio data must be transferred from graphics memory to the audio hardware for playback. However, upcoming PCI Express support should solve this problem for most applications. Finally, their results demonstrate that game-audio hardware, borrowing from graphics architectures and shading languages, may benefit from including programmable “voice shaders”, enabling per-sample processing, prior to their main effects processor. 7.2 High Performance Embedded Computing with GPU Many signal and image processing algorithms in embedded applications have been implemented on GPUs such as fast Fourier transform, convolution, pattern recognition and vision algorithm (e.g., tracking and stereo reconstruction. These works focus on a specific problem. However, high performance embedded systems generally involve a set of application, and the acceleration of one application does not necessarily suggest the performance advantage of the whole system. The work of [4] provides a systematic investigation by implementing the whole HPEC benchmark suite on the latest NVidia GPU. In addition, they conducted a comprehensive characterization for GPU execution of the HPEC benchmarks. A performance and power comparison with DSP is also performed and the comparison reveals that the major hurdle for GPU’s applications in embedded computing is its relatively low power efficiency. HPEC Benchmarks The HPEC benchmark suite defines nine kernel programs identified through a survey of a broad range of signal processing application areas such as radar and sonar processing, infrared sensing, signal intelligence, communication, and data fusion. Benchmark Description

TDFIR Time-domain finite impulse response

filtering FDFIR Frequency-domain finite impulse response

filtering

CT Corner turn via matrix operations PM Pattern matching CFAR Constant false-alarm rate detection GA Genetic algorithm QR QR factorization SVD Singular value decomposition DB Database operations

Table 7-1 Benchmark Descriptions In this work, the GPU programs were implemented on a Fermi GPU, which is NVidia’s latest generation graphics processor. Delivering a 1536GFLOPS single-precision float point performance, a Fermi GPU installs 14 stream multiprocessors (SM), while a SM has 32 streaming processors (SP). The GPU program follows a SPMD model with a thread as the basic unit for parallel computations. Tens of thousands of threads can be launched. All these threads execute the same program, but on different data. Overall, GPU implementations are superior for almost all the benchmarks. In these experiment, they used ADSP-TS101S TigerSHARC 6U cPCI board that contains eight DSPs. Each DSP has a peak floating point performance of 1.5GFLOP. This platform could deal with multi-channel input data, but only one channel is used. Due to the limitations of DSP architecture, only 5 benchmarks are implemented, TDFIR, FDFIR, CFAR, QR, and SVD on DSP. For all benchmarks, GPU could outperform DSP by two orders of magnitude.

Figure 7-1 Power Efficiency [4]

On the other hand, it is interesting to compare the power efficiency between GPU and DSP because power is becoming an increasingly important issue for embedded applications. The measured power consumptions of ADSP-TS101 and Tesla C20250 are 10w and 238w, respectively. The results reveal that DSPs do offer a considerably better power efficiency. The encouraging results justify the potential of GPUs for high performance embedded computing applications. The analysis provides key insight for optimizing data

parallel algorithms on GPU architecture. The comparison of power efficiency between GPU and DSP results suggest that the big obstacle to apply GPU to embedded applications is its low power efficiency. There is ongoing research on developing micro-architectures that could combine the advantages of GPU and DSPs. Another important research direction is to reduce the GPU power consumption through optimizing the memory hardware and using adaptive techniques to match the required computing intensity automatically. 8. Challenges faced by GPGPU Programming models and tools: With the new programming systems, the state of the art over the past year has substantially improved. Much of the difficulty of early GPGPU programming has dissipated with the new capabilities of these programming systems, though support for debugging and profiling on the hardware is still primitive. However one concern is the proprietary nature of the tools. Standard languages, tools, and APIs that work across GPUs from multiple vendors would advance the field. GPU functionality (onto the chipset or onto the processor die itself) : While it is fairly clear that graphics capability is a vital part of future computing systems, it is wholly unclear which part of a future computer will provide that capability, or even if an increasingly important GPU with parallel computing capabilities could absorb a CPU. Design tradeoffs and their impact on the programming model: GPU vendors are constantly weighing decisions regarding flexibility and features for their programmers: how do those decisions impact the programming model and their ability to build hardware capable of top performance? An illustrative example is data conditionals. Today, GPUs support conditionals on a thread granularity, but conditionals are not free; any GPU today pays a performance penalty for incoherent branches. Programmers want a small branch granularity so each thread can be independent; architects want a large branch granularity to build more efficient hardware. Another important design decision is thread granularity: less powerful but many lightweight threads versus fewer, more powerful heavyweight threads. As the programmable aspects of the hardware increasingly take center stage, and the GPU continues to mature as a general-purpose platform, these tradeoffs are only increasing in importance. Performance evaluation: The science of program optimization for CPUs is reasonably well understood profilers and optimizing compilers are effective in allowing programmers to make the most of their hardware. Tools on GPUs are much more primitive making code run fast on the GPU remains something of a black art. One of the most difficult ordeals for the GPU programmer is the performance cliff, where small changes

to the code, or the use of one feature rather than another, make large and surprising differences in performance. The challenge going forward is for vendors and users to build tools that provide better visibility into the hardware and better feedback to the programmer about performance characteristics. Lack of precision: The hardware graphics pipeline features many architectural decisions that favored performance over correctness. For output to a display, these tradeoffs were quite sensible; the difference between perfectly correct output and the actual output is likely indistinguishable. The most notable tradeoff is the precision of 32-bit floating-point values in the graphics pipeline. Though the precision has improved, it is still not IEEE compliant, and features such as denorms are not supported. As this hardware is used for general-purpose computation, noncompliance with standards becomes much more important, and dealing with faults such as exceptions from division by zero, which are not currently supported in GPUs also becomes an issue. Broader toolbox for computation and data structures: On CPUs, any given application is likely to have only a small fraction of its code written by its author. Most of the code comes from libraries, and the application developer concentrates on high-level coding, relying on established APIs such as STL or Boost or BLAS to provide lower level functionality. In contrast, program development for general-purpose computing the GPU programmer writes nearly all the code that goes into his program, from the lowest level to the highest. Libraries of fundamental data structures and algorithms that would be applicable to a wide range of GPU computing applications (such as NVIDIA’s FFT and dense matrix algebra libraries) are only just today being developed but are vital for the growth of GPU computing in the future. 9. References [1] Owens, J.D.; Houston, M.; Luebke, D.; Green, S.; Stone, J.E.; Phillips, J.C.; , "GPU Computing," Proceedings of the IEEE , May 2008, vol.96, no.5, pp.879-899 [2] Wolfe, Michael, PGI Compiler Engineer , “Understanding the CUDA Data Parallel Threading Model A Primer”, PGI Insider, February 2010 [3] Keckler, S.W.; Dally, W.J.; Khailany, B.; Garland, M.; Glasco, D.; , "GPUs and the Future of Parallel Computing," Micro, IEEE , Sept.-Oct. 2011, vol.31, no.5, pp.7-17 [4] Shuai Mu; Chenxi Wang; Ming Liu; Dongdong Li; Maohua Zhu; Xiaoliang Chen; Xiang Xie; Yangdong Deng; , "Evaluating the potential of graphics processors for high performance embedded computing," Design,

Automation & Test in Europe Conference & Exhibition (DATE), 2011 , 14-18 March 2011, pp.1-6, [5] Gallo, E.; Tsingos, N.; “Efficient 3D Audio Processing with the GPU”, 2004, .In GP2, ACM Workshop on General Purpose Computing on Graphics Processors. [6] Schaa, D.; Kaeli, D.; , "Exploring the multiple-GPU design space," Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, 23-29 May 2009, pp.1-12 , [7] Luebke, D.; , "CUDA: Scalable parallel programming for high-performance scientific computing," Biomedical Imaging: From Nano to Macro, 2008. ISBI 2008. 5th IEEE International Symposium on, 14-17 May 2008, vol., no., pp.836-838 [8] Russo, L.M.; Pedrino, E.C.; Kato, E.; Roda, V.O.; , "Image convolution processing: A GPU versus FPGA comparison," Programmable Logic (SPL), 2012 VIII Southern Conference on , 20-23 March 2012, pp.1-6 [9] Ujaldon, Manuel; , "High performance computing and simulations on the GPU using CUDA," High Performance Computing and Simulation (HPCS), 2012 International Conference on , 2-6 July 2012, pp.1-7 [10] Naga K. Govindaraju; Scott Larsen; Jim Gray; Dinesh Manocha; , "A Memory Model for Scientific Algorithms on Graphics Processors," SC 2006 Conference, Proceedings of the ACM/IEEE , Nov. 2006, pp.6 [11] Verner U.; Schuster A.; Silberstein M.; “Processing data streams with hard real-time constraints on heterogeneous systems, ” Supercomputing, ICS 2011, pp.120-129 . [12] Mohanty, S.P.; , "GPU-CPU multi-core for real-time signal processing," Consumer Electronics, 2009. ICCE '09. Digest of Technical Papers International Conference on , 10-14 Jan. 2009, pp.1-2 [13] Ziyi Zheng; Mueller, K.; , "Cache-aware GPU memory scheduling scheme for CT back-projection," Nuclear Science Symposium Conference Record (NSS/MIC), 2010 IEEE , Oct. 30 2010-Nov. 6 2010, pp.2248-2251 [14] Naga K. Govindaraju; Scott Larsen; Jim Gray; Dinesh Manocha; , "A Memory Model for Scientific Algorithms on Graphics Processors," SC 2006 Conference, Proceedings of the ACM/IEEE , Nov. 2006, pp.6