parallel computing experiences with cudaftp.cecm.sfu.ca/cag/papers/cuda.pdf · 2010-05-14 · cuda...

........................................................................................................................................................................................................................................................

PARALLEL COMPUTING EXPERIENCESWITH CUDA

........................................................................................................................................................................................................................................................

THE CUDA PROGRAMMING MODEL PROVIDES A STRAIGHTFORWARD MEANS OF

DESCRIBING INHERENTLY PARALLEL COMPUTATIONS, AND NVIDIA’S TESLA GPU

ARCHITECTURE DELIVERS HIGH COMPUTATIONAL THROUGHPUT ON MASSIVELY PARALLEL

PROBLEMS. THIS ARTICLE SURVEYS EXPERIENCES GAINED IN APPLYING CUDA TO A

DIVERSE SET OF PROBLEMS AND THE PARALLEL SPEEDUPS OVER SEQUENTIAL CODES

RUNNING ON TRADITIONAL CPU ARCHITECTURES ATTAINED BY EXECUTING KEY

COMPUTATIONS ON THE GPU.

......With the transition from single-core to multicore processors essentiallycomplete, virtually all commodity CPUsare now parallel processors. Increasingparallelism, rather than increasing clockrate, has become the primary engine ofprocessor performance growth, and thistrend is likely to continue. This raises manyimportant questions about how to produc-tively develop efficient parallel programsthat will scale well across increasinglyparallel processors.

Modern graphics processing units(GPUs) have been at the leading edge ofincreasing chip-level parallelism for sometime. Current NVIDIA GPUs are many-core processor chips, scaling from 8 to 240cores. This degree of hardware parallelismreflects the fact that GPU architecturesevolved to fit the needs of real-timecomputer graphics, a problem domain withtremendous inherent parallelism. With theadvent of the GeForce 8800—the first GPUbased on NVIDIA’s Tesla unified architec-ture—it has become possible to programGPU processors directly, as massively

parallel processors rather than simply asgraphics API accelerators.

NVIDIA developed the CUDA program-ming model and software environment tolet programmers write scalable parallelprograms using a straightforward extensionof the C language. The CUDA program-ming model guides the programmer toexpose substantial fine-grained parallelismsufficient for utilizing massively multi-threaded GPUs, while at the same timeproviding scalability across the broad spec-trum of physical parallelism available in therange of GPU devices. Because it provides afairly simple, minimalist abstraction ofparallelism and inherits all the well-knownsemantics of C, it lets programmers developmassively parallel programs with relative ease.

In the year since its release, manydevelopers have used CUDA to parallelizeand accelerate computations across variousproblem domains. In this article, we surveysome experiences gained in applying CUDAto a diverse set of problems and the parallelspeedups attained by executing key compu-tations on the GPU.

Michael Garland

Scott Le Grand

John Nickolls

NVIDIA

Joshua Anderson

Iowa State University

and Ames Laboratory

Jim Hardwick

TechniScan Medical Systems

Scott Morton

Hess

Everett Phillips

Yao Zhang

University of California,

Davis

Vasily Volkov

University of California,

Berkeley

0272-1732/08/$20.00 G 2008 IEEE Published by the IEEE Computer Society

........................................................................

13Authorized licensed use limited to: SIMON FRASER UNIVERSITY. Downloaded on June 10, 2009 at 20:01 from IEEE Xplore. Restrictions apply.

CUDA parallel programming modelThe CUDA parallel programming model

emphasizes two key design goals. First, itaims to extend a standard sequentialprogramming language, specifically C/C++,with a minimalist set of abstractions forexpressing parallelism. Broadly speaking, thislets the programmer focus on the importantissues of parallelism—how to craft efficientparallel algorithms—rather than grapplingwith the mechanics of an unfamiliar andcomplicated language. Second, it is designedfor writing highly scalable parallel code thatcan run across tens of thousands of concur-rent threads and hundreds of processor cores.This is essential because the physical paral-lelism of current NVIDIA GPUs rangesfrom eight processor cores and 768 threadcontexts up to 240 processor cores and30,720 thread contexts. The CUDA modelnaturally guides the programmer to writeparallel programs that transparently andefficiently scale across these different levelsof parallelism.

A CUDA program is organized into ahost program, consisting of one or moresequential threads running on the hostCPU, and one or more parallel kernels thatare suitable for execution on a parallelprocessing device like the GPU. A kernelexecutes a scalar sequential program on a setof parallel threads. The programmer orga-nizes these threads into a grid of thread

blocks. The threads of a single thread blockare allowed to synchronize with each othervia barriers and have access to a high-speed,per-block shared on-chip memory for inter-thread communication. Threads from dif-ferent blocks in the same grid can coordi-nate only via operations in a shared globalmemory space visible to all threads. CUDArequires that thread blocks be independent,meaning that a kernel must execute correct-ly no matter the order in which blocks arerun, even if all blocks are executed sequen-tially in arbitrary order without preemption.This restriction on the dependencies be-tween blocks of a kernel provides scalability.It also implies that the need for globalcommunication or synchronization amongstthreads is the main consideration in decom-posing parallel work into separate kernels.

The details of the CUDA programmingmodel are available in NVIDIA’s CUDAProgramming Guide (www.nvidia.com/CUDA) and related literature.1 Figure 1shows some basic features of parallelprogramming with CUDA. It containsstraightforward implementations, both se-quential and parallel, of the SAXPY routinedefined by the BLAS linear algebra library.Given vectors x and y containing n floating-point numbers, it performs the update y ra x + y. The serial implementation is asimple loop that computes one element of y

in each iteration. The parallel kernel

Figure 1. Parallel programming with CUDA: serial (a) and parallel (b) kernels for computing y

r a x + y.

.........................................................................................................................................................................................................................

ACCELERATOR ARCHITECTURES

.......................................................................

14 IEEE MICRO

Authorized licensed use limited to: SIMON FRASER UNIVERSITY. Downloaded on June 10, 2009 at 20:01 from IEEE Xplore. Restrictions apply.

effectively executes each of these independentiterations in parallel, assigning a separatethread to compute each element of y. The__global__ modifier indicates that theprocedure is a kernel entry point, and theextended function call syntax saxpy

,,,B, T...(…) is used to launchthe kernel saxpy() in parallel across Bblocks of T threads each. Each thread of thekernel determines which element it shouldprocess from its integer thread block index(blockIdx.x), its index within its block(threadIdx.x), and the total number ofthreads per block (blockDim.x).

This example demonstrates a commonparallelization pattern, where a serial loopwith independent iterations can be executedin parallel across many threads. In theCUDA paradigm, the programmer writes ascalar program—the parallel saxpy()

kernel—that specifies the behavior of asingle thread of the kernel. This lets CUDAleverage the underlying C language, withonly a few small additions, such as the built-in thread and block index variables.

The SAXPY kernel is also a simpleexample of data parallelism, where parallelwork is decomposed to match result dataelements. Although different thread blocksor different threads of a kernel can poten-tially execute entirely different code, suchtask parallelism does not generally scale aswell as data parallelism. Moreover, data-parallel kernels typically expose substantiallymore fine-grained parallelism than task-parallel kernels and, therefore, generally cantake best advantage of the GPU architecture.

NVIDIA Tesla GPU architectureNVIDIA designed its Tesla unified

graphics and computing architecture2 toaccelerate parallel programs written in theCUDA programming model. It is builtaround a fully programmable processorarray, organized into a number of SMmultithreaded multiprocessors, each ofwhich contains eight scalar SP processorcores. In contrast to earlier generations ofGPUs, threads executing on the SP coreshave access to the full range of instructionsprogrammers expect in any general-purposeprocessor core. In particular, memory isaccessed through load/store instructions

supporting arbitrary address arithmetic,and it fully supports both floating-pointand integer data types.

Figure 2 shows the Tesla architecture of aGeForce GTX 280 or Tesla T10 GPU with240 SP streaming processor cores, organizedin 30 SM streaming multiprocessors. Eachmultithreaded SP core executes up to 128concurrent coresident threads sharing aregister file of 2,048 entries; in total, theGPU executes up to 30,720 concurrentthreads. The earlier GeForce 8800 GTXGPU provides 16 SMs with 1,024 registersper SP core and supports a maximum of12,288 concurrent threads. Thread crea-tion, scheduling, and resource managementare performed entirely in hardware. Thecost of creating and destroying threads isnegligible, and there is effectively nooverhead involved in thread scheduling.Each thread of a CUDA program is mappedto a physical thread resident in the GPU,and each running thread block is physicallyresident on a single SM. The SM multi-processor supports efficient communicationamong threads in a block by providing anextremely lightweight barrier synchroniza-tion facility—CUDA’s barrier intrinsictranslates into a single instruction—and 16Kbytes per SM of on-chip shared memorythat has low access latency and highbandwidth, similar to an L1 cache.

The Tesla architecture is designed tooperate on massively parallel problems. Thedeep multithreading provides substantiallatency tolerance, allowing resources to bededicated toward throughput rather thanlarge caches. As we discussed earlier,experience shows that data-parallel softwaredesign is frequently the best approach formanaging this level of fine-grained parallel-ism. The threads of a data-parallel kernelwill often be following substantially similarpaths of execution. Consequently, the Teslaarchitecture is optimized for this case. TheTesla SM employs a single instruction,multiple thread (SIMT) architecture,1,2

where a block’s threads are grouped intowarps containing 32 threads each. A warp’sthreads are free to follow arbitrary andindependent execution paths, involving anynumber of branches, but can collectivelyexecute only a single instruction at any

........................................................................

JULY–AUGUST 2008 15Authorized licensed use limited to: SIMON FRASER UNIVERSITY. Downloaded on June 10, 2009 at 20:01 from IEEE Xplore. Restrictions apply.

particular instant. Thus, threads in a warpthat execute different code paths (that is,that diverge) must wait in turn while otherthreads of the warp execute the instructionsthey wish. Divergence and reconvergenceare managed in hardware. On the otherhand, in the common case where all threadsof a warp are executing the same instruc-tion, all the warp’s threads can execute thatinstruction at the same time. This lets thehardware achieve substantial efficiencieswhen executing typical data-parallel pro-grams, while at the same time making theSIMT execution of threads transparent tothe programmer and providing the flexibil-ity of a scalar computational model.

Application experience with CUDAMany applications consist of a mixture of

fundamentally serial control logic andinherently parallel computation. Further-more, these parallel computations arefrequently data-parallel in nature. Thisdirectly matches the program model thatCUDA adopts, namely a sequential control

thread capable of launching a series ofparallel kernels. The use of parallel kernelslaunched from a sequential program alsomakes it relatively easy to parallelize anapplication’s individual components, ratherthan requiring a wholesale rewriting of theentire application.

Many applications—both academic re-search and industrial products—have beenaccelerated using CUDA to achieve signif-icant parallel speedups. Such applicationsfall into a variety of problem domains,including machine learning,3 database pro-cessing,4 bioinformatics,5,6 financial model-ing, numerical linear algebra, medicalimaging,7 and physical simulation, amongothers. Of the many available examples, wesurvey a few representative cases.

Molecular dynamicsMolecular dynamics is a simulation

technique widely used in physics, chemistry,biology, and related fields. Its goal is tocompute the movement of a number ofatoms, beginning in some initial configura-

Figure 2. Tesla unified graphics and computing architecture of a GeForce GTX 280 or Tesla T10 GPU with 240 SP streaming

processor cores, organized in 30 SM multithreaded multiprocessors. Each multithreaded SP core executes up to 128

concurrent threads; the GPU executes up to 30,720 concurrent threads.

.........................................................................................................................................................................................................................


.......................................................................

16 IEEE MICRO


tion, and track their trajectory over specifiedtime intervals. Molecular dynamics is aversatile tool that scientists can use tocalculate materials’ static and dynamicproperties, study methods of crystal growth,or examine mechanisms by which proteinsperform their functions, to name only a fewof its uses.8 Figure 3 shows an examplesnapshot from a bead-spring polymersimulation with each of 64,017 particlesdisplayed as a sphere.9

Molecular dynamics simulations are in-herently parallel computations and areideally suited to the CUDA programmingmodel. During each time step of thesimulation, the program must calculate theforces acting on all atoms and use theresulting forces to integrate atom positionsand velocities forward to the next step. Thedifferent tasks required in a time step caneach be implemented in a separate kernel.Because each atom can be processedindependently of the others during a singletime step, it is natural to map each atom toa single thread. Thus, the applicationnaturally provides the large amount offine-grained parallelism for which theGPU is designed, and several moleculardynamics10 and molecular modeling11,12

codes have been successfully acceleratedwith CUDA.

A typical molecular dynamics simulationof the polymer model in Figure 3 would berun for 18 million time steps, and duringeach time step, every atom can be processedindependently of the others. Performing allthis work sequentially can be expensive; aserial calculation using a single core of a 2.4-GHz Opteron 280 CPU can calculate about6.46 time steps per second (tps) and wouldtake about 32 days to complete the entirerun. Usually, scientists perform jobs of thissize using parallel software such asLAMMPS,13 possibly completing it in asingle day if executed on 32 processor coresin a cluster of Opteron 280 nodes connect-ed by InfiniBand. Implemented in CUDAand executing on a single GeForce 8800GTX, Highly Optimized Object-OrientedMolecular Dynamics (HOOMD; www.ameslab.gov/hoomd) can execute the samesimulation with a performance of 203 tps,equivalent to that of LAMMPS using 32

CPU cores.10 A multi-GPU implementationis currently under development, and theperformance is expected to scale well withthe number of GPUs in the workstation.

Due to the excellent scalability of Tesla-architecture GPUs, HOOMD’s perfor-mance scales linearly with particle count,from systems as small as 5,000 particles tothose consuming all available memory.Figure 4 shows the time taken to executejust the pair force summation kernel for thepolymer system with varying numbers ofparticles (N ) and the speedup relative to anoptimized serial CPU implementation run-ning on a single-core 3.0-GHz Xeon80546K processor.10 The GPU is veryefficient at the pair force calculation kernel,offering speedup factors of approximately60 times for large systems. Furthermore, itis able to accelerate the entire application-level simulation for the problem shown inFigure 3 by a total of 32 times.

In HOOMD, the different tasks requiredin a time step are each implemented in aseparate kernel. The host program performs

Figure 3. Snapshot of a bead-spring polymer simulation with

64,017 particles.

........................................................................


all memory management and other setuptasks and then calls the GPU kernels toperform the simulation. Each kernel breaksits task down further in a data-parallelfashion, with each thread mapped to asingle particle. This is a natural choice formolecular dynamics, where all particles areupdated independently from each otherduring a single time step, and is a naturalfit for CUDA. When executing the polymermodel simulations, completing each timestep in HOOMD requires calling a numberof kernels in sequence. First, the neighborlist data structure must be updated, al-though this doesn’t strictly need to be doneat every step. Then it can be used tocompute pair forces. After that, bond forcesare computed, and the system is integratedforward to the next time step. Each of thesetasks requires calling one or more kernelstotaling five to eight for each time step,depending on whether the neighbor listshave been updated.

The GeForce 8800 GTX offers a peakmultiply-add rate of 345 Gflops and amemory throughput of 86 Gbytes/s. Mak-ing the most of these resources is key increating high-performance GPU kernels.For most calculations in molecular dynam-ics, this means optimizing the memoryaccesses first because there are relatively

few floating-point operations for eachmemory read. With CUDA, it is straight-forward to write a kernel with a simplememory access pattern that achieves near-peak performance. Such kernels for per-forming the integration step are nearly assimple as the previous SAXPY example andachieve near 70 Gbytes/s. The pair forcecomputation involves more complicatedand random memory access patterns. Uti-lizing the 1D texture cache together with aspace-filling curve-based data-reorderingtechnique, the pair force computation canexecute at close to peak bandwidth, utilizing76 Gbytes/s and 130 Gflops.10 Achievingthese performance levels requires no assem-bly language programming or complicatedcode transformations, just writing simpleand readable data-parallel C code.

Folding@Home is a popular applicationfor molecular dynamics that runs on manyhardware platforms.14 The performancebottleneck for this application involves threeN-body calculations over all O(N2) uniquepairs of atoms. The first two of thesecalculations are independent and can bemerged into one calculation, but the thirdcalculation depends on results generated bythe second. This leaves us with two suchloops that consume approximately 75percent of this application’s computation.Previous GPU-based implementations havecalculated all pair-wise interactions due tothe complexity of tracking only unique pairsin a streaming parallel implementation. Incontrast, the CUDA implementation oper-ates only on unique pairs. It blocks thecalculation on N bodies into a series of p 3

p swaths, where p is the warp width. Thislets the p threads within each warp each readone atom’s worth of data into their registerspace and then operate on p atoms that wereread into on-chip shared memory. Further-more, because all the threads in a warp areguaranteed to execute synchronously to-gether, each thread can interact with oneatom’s data in shared memory at a time overp iterations without any need for additionalsynchronization.

Folding@Home presents an excellentopportunity to compare the performanceof several different architectures on the sameapplication. Figure 5 plots the performance

Figure 4. Scaling and parallel speedup of GPU implementation over single-

CPU core for the pair force calculation kernel, across many problem sizes.

.........................................................................................................................................................................................................................


.......................................................................

18 IEEE MICRO


of the CUDA implementation of theFolding@Home energy kernel on a GeForce8800M GTS (a laptop GPU with 64 SPs),a GeForce 8800 GTS (a desktop GPU with128 SPs), and the latest GeForce GTX 280(a high-end GPU with 240 SPs). TheCUDA programming model was specifical-ly designed to enable the design of scalableparallel computations, and here we see clearevidence of strong scaling across a range ofprocessors with substantially different levelsof physical parallelism. To place theirperformance in context, the graph alsoshows the performance of this simulationon a Core2 Duo CPU, the Cell processor ofa Sony PlayStation 3, and an ATI Radon3870 GPU. The algorithmic flexibility thatCUDA provides, and the architecturalfeatures that it exploits—specifically sharedmemory—allow the CUDA implementa-tion on the GTX 280 to run 3.6 times fasterthan the Brook-based ATI implementationand 6.7 times faster than the Cell imple-mentation.

Numerical linear algebraDense matrix-matrix multiplication, par-

ticularly as provided by the BLAS libraryGEMM routines, is one of the fundamentalbuilding blocks of numerical linear algebraalgorithms. It is also a natural fit for CUDAand the GPU because it is inherentlyparallel and can naturally be expressed as ablocked computation.

Volkov and Demmel15 describe thedesign of a custom SGEMM matrix-matrixmultiplication kernel. Figure 6 summarizesthe performance of their algorithm runningon a GeForce 8800 GTX and Intel’s MathKernel Library (MKL) 10.0, running on a2.4-GHz Core2 Quad Q6600 (‘‘Kents-field’’) processor. This algorithm, whichoperates on single-precision floating-pointnumbers, achieves up to 206 Gflops—a 2.9times improvement over the highest rateachieved by the Core2 Quad—and roughly60 percent of the GeForce 8800 GTX peakmultiply-add rate.

Writing a basic dense matrix-matrixmultiplication kernel is a fairly simpleexercise (see the CUDA ProgrammingGuide for detail). Achieving this high levelof performance, on the other hand, requires

more careful optimization. Volkov andDemmel use a block algorithm similar tothose used for vector computers, using GPUregisters and per-block shared memory tostore the data blocks. As the GPU has anunusually large register file, registers can beused as the primary scratch space for thecomputation. Furthermore, assigning smallblocks of elements to each thread, ratherthan a single element to each thread, boostsefficiency much as strip-mining boostsefficiency on vector machines. Finally, thenonblocking nature of loads on the GPU

Figure 5. Performance of Folding@Home energy kernel on various

platforms.

Figure 6. Performance on dense matrix-matrix multiplication (SGEMM).

........................................................................


makes it possible to do software prefetching,which is useful for hiding memory latency.Ryoo and colleagues explored the benefits ofblocking and loop unrolling, producing amatrix multiplication kernel executing at 91Gflops.16 The gap with the performanceachieved by Volkov and Demmel clearlydemonstrates the utility of the additionaloptimizations they employed, namely stor-ing as much data as possible in registers,using larger data blocks, and strip-miningmultiple elements onto each thread.

Matrix factorizations are widely used tosolve systems of linear equations and linearleast-square problems, and among the manyavailable factorization schemes the LU,Cholesky, and QR factorizations are mostcommon. Like matrix-matrix multiplication,these factorizations can be implementedusing blocked algorithms that do most oftheir work with bulk matrix-matrix multipliesthat have high arithmetic intensity and exposesubstantial amounts of data and thread-levelparallelism. The remaining work lies infactorizing tall and skinny matrices (calledpanels) and pivoting. Panel factorizationsinvolve many small fine-grained operationsthat might not offer sufficient parallelism torun efficiently on the GPU. Therefore, it isoften more efficient to perform panelfactorization on the CPU while using theGPU to perform the inherently parallel partsof the computation, particularly since these

computations can then be overlapped. Forefficient partial pivoting in LU factorization,it is necessary to arrange accesses to memoryto have a minimal stride, which can be doneby keeping matrices transposed in the GPUmemory.

Figure 7 compares the performance ofthese factorization routines running on aGeForce 8800 GTX in a 2.67-GHz Core2Duo system, versus performing the factor-ization entirely on a 2.4-GHz Core2 Quadwith MKL. The CUDA factorization codeachieves rates of up to 190 Gflops, whichapproaches the peak rate sustained bymatrix-matrix multiplication itself. Formatrix dimensions of around 300, theproblem becomes large enough to leveragethe GPU parallelism and overcome theoverhead of CPU-GPU coordination anddata movement. For larger matrices, theCUDA factorization running on the GPUand Core2 Duo is up to 5.5 times fasterthan the MKL factorization code runningon the Core2 Quad.

Medical imagingTechniScan Medical Systems has been

developing advanced inverse-scattering al-gorithms to generate 3D volumetric imagesof the breast with ultrasound for severalyears. Unlike conventional ultrasound im-aging, in which reflected ultrasound is usedto form images, inverse-scattering usesultrasound transmitted through, refractedby, and scattered by breast tissue to generatehigh-resolution speed and attenuation ofsound images.17 Pending FDA clearance,these images are intended to provideadditional information to improve careand outcome in breast disease treatment(see Figure 8).

In the last five years, CPUs have becomefast enough to run these algorithms inseveral hours on a small high-performancecomputing cluster inside the TechniScanWhole Breast Ultrasound (WBU) scanner.Radiologists and women’s health providers,however, would like images in minutes sothat they can conduct the exam and discussresults with their patients in the same visit.TechniScan has investigated porting theinverse-scattering algorithm to FPGAs,digital signal processors (DSPs), and IBM’s

Figure 7. Performance on dense LU, Cholesky, and QR factorization for N

3 N matrices.

.........................................................................................................................................................................................................................


.......................................................................

20 IEEE MICRO


Cell processor to meet this goal, but eachhas disadvantages. FPGAs can be a hugeengineering undertaking, and DSPs don’tprovide the floating-point performancenecessary to compute images in minutes.The Cell processor showed potential, butthe development workstation was expensivefor a small startup.

For approximately $600—the cost of oneGeForce 8800 GTX and an ATX powersupply—TechniScan was able to start aproof-of-concept project to port the inverse-scattering algorithm to the GPU. Thescanner collects ultrasound signals at regularrotational positions and at regular coronalslice positions. The inverse-scattering algo-rithm is a modified nonlinear conjugategradient method that tries to match simu-lated scattered ultrasound to the collectedsignals. It uses 2D convolution by fastFourier transform (FFT) to simulate ultra-sound propagation.17 This simulation is usedto compute the residual, gradient, and steplength for each iteration. Approximately63 million 2D FFTs are performed duringthe algorithm’s run, which accounts for over75 percent of the computation time.

A 2D GPU convolution routine can beimplemented easily using CUFFT—theFast Fourier Transform library supplied

with CUDA—and a simple kernel toperform point-wise multiplication. Thisapproach is approximately eight times fasterthan a CPU version using an optimizedFFT and running on one core of a 2.4-GHzCore2 Quad Q6600 processor. However,because the FFT size is fairly small (256 3

64), using the entire GPU to perform asingle FFT does not produce enough workto efficiently utilize the GPU. Instead,creating a batched FFT that assigns multipleFFTs to different thread blocks is a muchmore effective way of utilizing the hardware.Implementing a batched 2D FFT kernelnearly doubled convolution performanceand made the GPU convolution almost 16times faster than the CPU implementation.

Other parts of the algorithm showedmore dramatic speed improvements run-ning on the GPU. The most impressiveimprovements were seen in the CAXPBYroutine, which multiplies two complex-valued vectors by two different scalars andsums the resulting vectors; computation ofthe L2 norm of a complex vector; andcomputation of the complex exponent ofeach value of a large, complex vector.Figure 9 shows the runtime of each routinerunning on a GeForce 8800 GTX and onecore of a 2.4-GHz Q6600 GPU.

Figure 8. Example scans produced from ultrasound by TechniScan Medical Systems.

........................................................................


The algorithm was also adapted to run onmultiple GPUs. Running the algorithm ontwo Tesla D870s provided the performancenecessary to generate images in approxi-mately 16 minutes—fast enough to meetthe same-visit requirement of TechniScan’scustomers. This level of performance is overtwice as fast as a 16-core Intel Core2 CPUcluster.

Fluid dynamicsPhysical simulations based on finite-

element, finite-difference, finite-volume,and similar methods are not as triviallyparallelized as molecular dynamics. Howev-er, by adopting a blocking strategy similarto those used in matrix multiplication andimage processing, algorithms of this sort canalso be transformed into highly parallelcomputations.

As an example, we consider the 2Dcompressible Euler equations, which areoften used in the design of aerospace vehiclecomponents such as rocket nozzles andsupersonic airfoils. The equations are solvedon an irregular structured grid using thefinite-volume method and an integrationscheme developed by Ni.18 The solver useslocal time stepping and multigrid tech-niques to accelerate convergence. Thisprocedure can serve as the pseudo-time

iteration in a more complex solver for theunsteady compressible Navier-Stokes equa-tions with turbulence modeling.

Phillips and colleagues developed thisCUDA-based solver,19 which makes eachthread block responsible for updating a 163 5 tile of the domain (see Figure 10).Each block requires access to a 20 3 9 areabecause the computational stencil extendsby two nodes past the tile boundary in eachdirection. These specific dimensions arechosen to best fit Tesla-architecture GPUs:the memory subsystem can deliver muchhigher bandwidth by coalescing accesses bythreads of a warp to 16 contiguous values,and a height of five is the largest that fitswithin available register and per-blockshared memory limits. Memory bandwidthcan become the bottleneck when solving theEuler equations on graphics hardware.20

Consequently, reducing memory band-width by calculating intermediate variableson the fly, rather than storing them, canimprove efficiency.

Figure 11 shows example simulations ofa rocket nozzle and supersonic airfoilperformed on a QuadroFX 5600. Figure 12shows the performance of the CUDA-basedsolver running on a GPU cluster incomparison to a serial reference solverrunning on a 2.4-GHz Core2 Duo. Thecluster consists of four nodes, each with twoQuadroFX 5600 GPUs and dual Opteron2216 CPUs connected by gigabit Ethernet.For the solution process, the domain isdecomposed across GPUs, and each GPUperforms one iteration of the solver, afterwhich the boundary elements are commu-nicated with its neighbors. On the coarsestgrid of 1,600 nodes, the amount of parallelwork is small enough that the overhead ofmoving work onto the GPU is a compar-atively high cost, and a single GPU is onlyable to deliver about four times theperformance of the serial CPU solver. With25,000 nodes, the subdomains remain smalland communication time dominates; asingle GPU, which solves the problem atroughly 18 times the speed of the CPU, isfaster than the entire eight GPU cluster. Asthe grids become denser, communicationcost is an ever-decreasing component of thesolution time. At the densest grid resolu-

Figure 9. GPU speedup of individual computational routines.

.........................................................................................................................................................................................................................


.......................................................................

22 IEEE MICRO


tion, the solver scales nearly linearly as moreGPUs are used, delivering 22 times theserial performance on one GPU and 160times the serial performance on eight GPUs.All GPU simulations were performed insingle precision, and the results wereverified to agree with a reference CPUimplementation. The use of single precisionrequired a priori calculation of the inlet andexit properties to compensate for numericaldrift due to the exponentiation and square-root functions’ limited accuracy.

Seismic imagingThe petroleum industry makes heavy use

of seismic data to construct images of theEarth’s subsurface structure in its search foroil and gas. A seismic survey of a prospectiveregion will typically consist of hundreds ofthousands of seismic experiments. Eachexperiment involves an impulsive acousticsource that generates a signal that propa-gates up to tens of kilometers through thesubsurface, reflects off the interfaces be-tween rock layers, and is recorded by a fewthousand pressure-sensitive receivers at thesurface. This acquisition process producesterabytes of seismic data.

Reconstructing a subsurface image fromthis data typically requires weeks of com-putations on thousands of CPUs, with the

exact effort depending on the fidelity of theapproximations used to simulate the wavepropagation.

Given the large amount of parallelcomputation involved in the seismic imag-ing process, it is natural to considermapping the process onto the GPU withCUDA. We consider a specific imagingmethod, where the wave equation is solvedin the frequency domain using a paraxialapproximation implemented with an ADI(alternating-direction implicit) finite-differ-ence method. This method’s computationalcost is dominated by the evaluation and

Figure 10. Decomposition of a 33 3 17 subdomain. A single-thread block computes a result

for the light gray tile in the center and also accesses neighboring information for integration

(white boxes surrounding the center tile) and smoothing (dark gray boxes surrounding the

center tile) overlap.

Figure 11. Simulation results showing pressure distribution in a rocket

nozzle (a) and over a supersonic airfoil (b).

........................................................................


solution of complex-valued tridiagonal lin-ear systems, where there are on the order ofa billion systems for each seismic experi-ment and the dimension of each system is inthe hundreds.

Although the solution of tridiagonalsystems can be parallelized directly,21,22 thisis not an effective way to solve such a largenumber of relatively small systems. A muchmore efficient approach is to assign eachlinear system to a single thread. In thisparticular application, it also happens thatmany systems to be solved involve the samecoefficient matrix. This leads to a naturaldesign in CUDA where each thread block isgiven one tridiagonal matrix and eachthread of that block solves a linear systeminvolving that matrix and a specific right-hand side. Furthermore, keeping theseshared coefficients in shared memory avoidsredundant accesses to memory. A threadblock’s threads collectively read their coef-ficients into shared memory and synchro-nize, and then each thread computes itssystem’s solution.

Figure 13 examines the speedup of theprototype CUDA implementation overCPU production code with varying num-

bers of processors. The CUDA code wasexecuted on a Tesla C870 GPU. CPUperformance was measured on two systems,a dual 3.6-GHz Intel Xeon and a dual 3.0-GHz quad-core Xeon (Harpertown). TheCPU code was optimized to use SSE(streaming SIMD extension) vector instruc-tions and compiled with the Intel Fortrancompiler. Even though eight CPU coreswere available on the Harpertown-basedsystem, performance did not scale beyondfour processes because of memory band-width limitations. This resulted in a singleGPU out-performing an entire Harpertownsystem by a factor of six on this seismicimaging code.

In considering the experiences of anumber of applications parallelized with

CUDA, we believe that a few basic trendsare apparent. CUDA provides a straightfor-ward means of describing inherently parallelcomputations and is particularly well suitedfor data-parallel algorithms. The NVIDIATesla GPU architecture delivers high com-putational throughput on such kernels.Furthermore, unlike the massively parallelmachines of the past, which were large,

Figure 12. Performance speedup for GPU clusters of varying size over a single CPU for

solving 2D Euler equations.

.........................................................................................................................................................................................................................


.......................................................................

24 IEEE MICRO


expensive, and available to a select few, theseGPUs capable of running CUDA areubiquitous. By the end of summer 2008,NVIDIA will have shipped roughly 80 mil-lion CUDA-capable GPUs, transformingacceleration with massively parallel hardwarefrom a rarity into an everyday commodity.

Reviewing the many CUDA-enabledapplications now available, we encounter afew important design techniques. First, andforemost, is the fundamental importance ofexposing sufficient amounts of fine-grainedparallelism to exploit hardware like theTesla-architecture GPU. Second is theimportance of blocking computations, aprocess that naturally fits the CUDA threadblock abstraction and encourages datalayout and access patterns with high locality.Third is the efficiency of data-parallelprograms where threads of a warp followthe same execution path, thus fully utilizingthe GPU’s processor cores. Finally is thebenefit of the on-chip, per-block sharedmemory provided by the Tesla architecture,which provides high-speed, low-latencyscratchpad space that is critical to theperformance of many efficient algorithms.

The latest CUDA toolkit, documenta-tion, and code examples, as well as adirectory of some of the many availableCUDA-based applications and researchprojects, are available at www.nvidia.com/CUDA/. A course on parallel programmingusing CUDA is also available online (http://courses.ece.uiuc.edu/ece498/al). MICRO

................................................................................................

References1. J. Nickolls et al., ‘‘Scalable Parallel Pro-

gramming with CUDA,’’ ACM Queue,

vol. 6, no. 2, Mar./Apr. 2008, pp. 40-53.

2. E. Lindholm et al., ‘‘NVIDIA Tesla: A Unified

Graphics and Computing Architecture,’’

IEEE Micro, vol. 28, no. 2, Mar./Apr.

2008, pp. 39-55.

3. B. Catanzaro, N. Sundaram, and K. Keutzer,

‘‘Fast Support Vector Machine Training and

Classification on Graphics Processors,’’

Proc. 25th Ann. Int’l Conf. Machine Learn-

ing, Omnipress, 2008, pp. 104-111.

4. B. He et al., ‘‘Relational Joins on Graphics

Processors,’’ Proc. ACM SIGMOD 2008,

ACM Press, 2008; www.cse.ust.hk/

catalac/papers/gpujoin_sigmod08.pdf.

Figure 13. Speedup of a CUDA prototype wave-equation solver compared with various

CPU configurations.

........................................................................


5. M. Schatz et al., ‘‘High-Throughput Se-

quence Alignment Using Graphics Process-

ing Units,’’ BMC Bioinformatics, vol. 8,

no. 1, 2007, p. 474; http://dx.doi.org/10.

1186/1471-2105-8-474.

6. S. Manavski and G. Valle, ‘‘CUDA Compat-

ible GPU Cards as Efficient Hardware

Accelerators for Smith-Waterman Se-

quence Alignment,’’ BMC Bioinformatics,

vol. 9, suppl. 2, 2008, p. S10; http://dx.doi.

org/10.1186/1471-2105-9-S2-S10.

7. S.S. Stone et al., ‘‘How GPUs Can Improve

the Quality of Magnetic Resonance Imag-

ing,’’ Proc. 1st Workshop General Purpose

Processing on Graphics Processing Units,

2007.

8. D. Frenkel and B. Smit, Understanding

Molecular Simulations, Academic Press,

2002.

9. J.A. Anderson, C.D. Lorenz, and A. Traves-

set, ‘‘Micellar Crystals in Solution from

Molecular Dynamics Simulations,’’ J. Chem-

ical Physics, vol. 128, 2008, pp. 184906-

184916.

10. J.A. Anderson, C.D. Lorenz, and A. Traves-

set, ‘‘General Purpose Molecular Dynamics

Simulations Fully Implemented on Graphics

Processing Units,’’ J. Computational Phys-

ics, vol. 227, no. 10, May 2008, pp. 5342-

5359.

11. J.E. Stone et al., ‘‘Accelerating Molecular

Modeling Applications with Graphics Pro-

cessors,’’ J. Computational Chemistry,

vol. 28, no. 16, 2007, pp. 2618-2640.

12. C.I. Rodrigues et al., ‘‘GPU Acceleration of

Cutoff Pair Potentials for Molecular Model-

ing Applications,’’ Proc. 2008 Conf. Com-

puting Frontiers (CF 08), ACM Press, 2008,

pp. 273-282.

13. S. Plimpton, ‘‘Fast Parallel Algorithms for

Short-Range Molecular Dynamics,’’ J.

Computational Physics, vol. 117, no. 1,

1995, pp. 1-19.

14. M. Shirts and V.S. Pande, ‘‘Screen Savers

of The World Unite,’’ Science, vol. 290,

no. 5498, 2000, pp. 1903-1904.

15. V. Volkov and J.W. Demmel, ‘‘LU, QR and

Cholesky Factorizations Using Vector Ca-

pabilities of GPUs,’’ tech. report UCB/

EECS-2008-49, EECS Dept., Univ. of Calif.,

Berkeley, 2008.

16. S. Ryoo et al., ‘‘Optimization Principles and

Application Performance Evaluation of a

Multithreaded GPU using CUDA,’’ Proc.

13th ACM SIGPLAN Symp. Principles and

Practice of Parallel Programming, ACM

Press, 2008, pp. 73-82.

17. S.A. Johnson et al., Apparatus and Method

for Imaging Objects with Wavefields, US

patent 6,636,584, Patent and Trademark

Office, 2003.

18. R.H. Ni, ‘‘A Multiple Grid Scheme for

Solving the Euler Equations,’’ Proc. AIAA

5th Computational Fluid Dynamics Conf.,

AIAA Press, 1981, pp. 257-264.

19. E.H. Phillips et al., ‘‘A Multi-Grid Solver for

the 2D Compressible Euler Equations on a

GPU Cluster,’’ tech. report ECE-CE-2008-2,

Computer Eng. Research Lab., Univ. of

California, Davis, 2008; www.ece.ucdavis.

edu/cerl/techreports/2008-2.

20. T. Brandvik and G. Pullan, ‘‘Acceleration of

a 3D Euler Solver Using Commodity Graph-

ics Hardware,’’ Proc. 48th AIAA Aerospace

Sciences Meeting and Exhibit, AIAA Press,

2008, p. 607.

21. H.S. Stone, ‘‘Parallel Tridiagonal Equation

Solvers,’’ ACM Trans. Mathematical Soft-

ware, vol. 1, no. 4, Dec. 1975, pp. 289-

307, http://doi.acm.org/10.1145/355656.

355657.

22. M. Kass, A. Lefohn, and J. Owens,

‘‘Interactive Depth of Field Using Simulated

Diffusion on a GPU,’’ tech. report 06-01,

Pixar Animation Studios, 2006; http://

graphics.pixar.com/DepthOfField/.

Michael Garland is a research scientist atNVIDIA. His research interests includecomputer graphics and visualization, geo-metric algorithms, and parallel algorithmsand programming models. Garland receivedhis PhD in computer science from CarnegieMellon University.

Scott Le Grand is a senior engineer on theCUDA software team at NVIDIA. LeGrand received a BS in biology from SienaCollege and a PhD in biochemistry fromPennsylvania State University.

John Nickolls is director of architecture atNVIDIA for GPU computing. His interestsinclude parallel processing systems, languag-es, and architectures. Nickolls received his

.........................................................................................................................................................................................................................


.......................................................................

26 IEEE MICRO


PhD in electrical engineering from StanfordUniversity.

Joshua Anderson is a graduate student inthe Department of Physics and Astronomyat Iowa State University and the AmesLaboratory. His PhD work focuses on thedesign and implementation of a general-purpose molecular dynamics simulationcode on the GPU and the theoretical designof new polymeric materials by computa-tional methods. Anderson received his BS inphysics and computer science from Michi-gan Tech.

Jim Hardwick is the senior softwareengineer at TechniScan Medical Systemsin Salt Lake City, Utah. His expertise is inmedical device development, includingsoftware and hardware development andhuman-computer interaction. Hardwickreceived his BS in computer science fromWestminster College.

Scott Morton leads a geophysical technol-ogy group at Hess Corporation. Hisresearch interests include geophysical tech-nologies with a focus on seismic imaging.Morton received his PhD in astrophysicsfrom the University of Illinois at Urbana-Champaign.

Everett Phillips is a graduate student at theUniversity of California, Davis, working inmechanical and aeronautical engineering

and electrical and computer engineering.His research interests include computationalfluid dynamics, parallel computing, andgraphics hardware. Phillips received his BSin mechanical engineering from the Uni-versity of California, Davis.

Yao Zhang is a PhD student in theDepartment of Electrical and ComputerEngineering at University of California,Davis. Zhang received his BS in electricalengineering from the Beijing Institute ofTechnology. His research interests are in thearea of GPU computing, especially in solvinglinear algebra and fluid problems on GPUs.

Vasily Volkov is a PhD student in theDepartment of Electrical Engineering andComputer Science at the University ofCalifornia, Berkeley. His reserach interestsinclude high-performance computing andnumerical linear algebra. Volkov receivedhis MS from the Moscow Institute ofPhysics and Technology and MEng fromthe Nanyang Technological University.

Direct questions and comments aboutthis article to Michael Garland, NVIDIA,2701 San Tomas Expressway, Santa Clara,CA 95050; [email protected].

For more information on this or any

other computing topic, please visit our

Digital Library at http://computer.org/

csdl.

........................................................................


parallel computing experiences with cudaftp.cecm.sfu.ca/cag/papers/cuda.pdf · 2010-05-14 · cuda...

Documents