gpu outpacing cpu in raw processing - speedup · fortran java and python ... code on the gpu and...

1

CUDA: Parallel C t A hit tCompute Architecture

Christian Sigg

GPU Outpacing CPU in Raw Processing

GPU

GPUNVIDIA GTX 295480 cores1,788 GFLOPS

CPUIntel Core i7 965

4 cores102 GFLOPS

CPU

NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing

CPU

The Brick Wall

P W ll i d d f l t i lPower Wall: growing demand for electrical powerMemory Wall: bandwidth improves sub-linear with compute flopsILP Wall: diminishing return on more ILP area

Power Wall + Memory Wall + ILP Wall = Brick Wall


David Patterson, UC Berkeley

GPU initially a 3D Accelerator

NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing© Id Software

2

CUDA: Parallel Compute Architecture

Massive parallelism100s of processors

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

100s of processors

Memory latency tolerant1000s of parallel threadsIf one thread stalls, switch to another


Teraflops of compute powerVery energy efficient

GPU C ti A li tiGPU C ti A li ti

GPU Computing Today

Over 140M installed CUDA- GPU Computing ApplicationsGPU Computing ApplicationsOver 140M installed CUDA-Architecture GPU’s

Windows, Linux and MacOS Platforms supported

GPU Computing spans Consumer applications to HPC

Over 60,000 GPU C ti D l

CC OpenCLOpenCLtm Direct Direct ComputeCompute FORTRANFORTRAN Java and Java and

PythonPythonWith CUDA ExtensionsWith CUDA ExtensionsOver 60,000 developersOver 60,000 developersRunning in Production Running in Production i J l 2007i J l 2007

11stst GPU demoGPU demoShipped 1Shipped 1stst OpenCL OpenCL Beta DriverBeta Driver

11stst GPU demoGPU demoShipped 1Shipped 1stst driver to driver to Microsoft’s Win7Microsoft’s Win7

SW supplied by:SW supplied by:•• The Portland GroupThe Portland Group

NCSA releaseNCSA release

Compute KernelsCompute KernelsDriver API BindingsDriver API Bindings


Computing Developers

200+ Universities teaching the CUDA Architecture and GPU Computing

NVIDIA GPUNVIDIA GPU

with the CUDA Parallel Computing Architecturewith the CUDA Parallel Computing Architecture

since July 2007 since July 2007 SDK + Lib’s + Visual SDK + Lib’s + Visual Profiler and debuggerProfiler and debugger

Beta DriverBeta DriverStrategic developers Strategic developers using NV SW todayusing NV SW today

Microsoft s Win7 Microsoft s Win7 developersdevelopersSupports all CUDASupports all CUDA--Architecture GPU’sArchitecture GPU’s

•• NCSA releaseNCSA release

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

Tesla®

High-Performance ComputingQuadro®

Design & CreationGeForce®

Entertainment

CUDA Available on All Modern NVIDIA GPUs

High-Performance Computing Design & CreationEntertainment


Tesla GPU Computing Products

Tesla S1070 1U SystemTesla C1060

Computing Board

Tesla Personal Supercomputer

(4 Tesla C1060s)


GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUsSingle Precision Perf 4.14 Teraflops 933 Gigaflops 3.7 Teraflops

Double Precision Perf 346 Gigaflops 78 Gigaflops 312 Gigaflops

Memory 4 GB / GPU 4 GB 4 GB / GPU

3

CUDA Software

C RuntimeThe C Runtime for CUDA provides support for executing

code on the GPU and allows native bindings for languages such as Fortran, Java, and Python

Libraries Advanced libraries that include BLAS, FFT, and other functions optimized for the CUDA Architecture

Tools NVIDIA C Compiler (nvcc), CUDA Debugger (cudagdb), CUDA Visual Profiler (cudaprof), and other helpful tools


Documentation Includes the CUDA Programming Guide, API specifications, and other helpful documentation

Samples Code samples and documentation that demonstrate best practices for a wide variety GPU Computing algorithms and applications

CUDA SDK Roadmap

CUDACUDA3.03.0

CUDACUDA2.32.3

CUDACUDA2.22.2

CUDACUDA2.12.1

CUDACUDA2.02.0

2008 2009Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4


3.03.02.32.32.22.22.12.12.02.0

Example: LIBOR Monte CarloOriginal LIBOR “C” Code CUDA LIBOR “C” Code

134 xFaster


CPU execution time = 26.9 sec GPU execution time = 0.2 sec

Source: Prof. Mike Giles, Oxford Univ

Intel Xeon Quad; double precision Tesla C870; single precision

CUDA Parallel Computing Hardware

Independent multiprocessors8 ALUs (32-wide SIMD)

CC

Special Func Unit

CC

Special Func Unit

CC

Special Func Unit

Thread Scheduler

Special function unit (plus dp)Shared memory (16kB)

Hardware thread management1000s of concurrent threadsNo switching overhead

Core

Core

Core

Core

Shared M

emory

Core

Core

Core

Core

Core

Core

Core

Core

Shared M

emory

Core

Core

Core

Core

Core

Core

Core

Core

Shared M

emory

Core

Core

Core

Core

Crossbar


No switching overhead

Global device memoryRandom accessAtomics

Memory

Atomic

Memory

Atomic

Memory

Atomic

4

CUDA Programming ModelParallel code (kernel) is written for a thread

Each thread instance is free to execute a different code path

Core

Threads are grouped into thread blocksThreads of a block synchronize their execution and communicate via shared memorySeveral concurrent thread blocks execute on one multiprocessor, but don’t migrate

Core

Core

Core

Core

Shared M

emory

Core

Core

Core

Core

Special Func Unit


one multiprocessor, but don t migrate

Kernel is run on a grid of thread blocksBuilt-in thread and block ID variables

Core

Core

Core

Core

Shared M

emory

Core

Core

Core

Core

Special Func Unit

Core

Core

Core

Core

Shared M

emory

Core

Core

Core

Core

Special Func Unit

Core

Core

Core

Core

Shared M

emory

Core

Core

Core

Core

Special Func Unit

Memory

Atomic

Memory

Atomic

Memory

Atomic

Crossbar

Thread Scheduler

float c = 0, *bIt = gB + wB * threadIdx.y + j;for(float *aIt = gA + wA * i + threadIdx.x, *aEnd = aIt + wA;

aEnd > aIt; aIt += BLOCK_SIZE, bIt += BLOCK_SIZE * wB ){

// fetch block into shared memory, 1 element per threadrA[threadIdx.x] = *aIt;cB[threadIdx.y * BLOCK_SIZE] = *bIt;

Example: Dense Matrix Multiplication//! Matrix multiplication on the device: C = A * B//! gA, gB, gC are pointers to device memory//! wA, wB are matrix widths__global__ void matrixMul(float* gC, float* gA, float* gB, int wA, int wB){

int i = blockIdx.y * BLOCK_SIZE + threadIdx.y;

__syncthreads(); // synchronize threads

// compute block multiply, 1 element per threadfor (int k = 0; k < BLOCK_SIZE; ++k)

c += rA[k] * cB[BLOCK_SIZE * k];

__syncthreads(); // synchronize threads}gC[wB * i + j] = c; // write back to device memory

}

int j = blockIdx.x * BLOCK_SIZE + threadIdx.x;

__shared__ float sA[BLOCK_SIZE*BLOCK_SIZE];__shared__ float sB[BLOCK_SIZE*BLOCK_SIZE];

// row and column start pointersfloat* rA = sA + threadIdx.y * BLOCK_SIZE;float* cB = sB + threadIdx.x;

...


CUDA Developer Resources

SDKCode SamplesDocumentationDocumentationEmulator, HW DebuggerVisual ProfilerVisual Studio integration

Libraries

Read throughput = 43.42 GB/s, Write throughput = 0.68 GB/sKernel details: Grid size: 32 x 32, Block size: 16 x 16 x 1Register Ratio = 0.75 ( 12288 / 16384 ) [15 registers per thread] Shared Mem Ratio = 0.84375 ( 13824 / 16384 ) [4132 bytes per Block] Active Blocks per SM = 3 : 8Active threads per SM = 768 : 1024Occupancy = 0.75 ( 24 / 32 )Occupancy limiting factor = Shared-memory Warning: Grid Size (1024) is not a multiple of available SMs (27).


BLAS, FFTCUDPP, THRUST

Online Forums

CUDA Applications

5

CUDA Momentum in Every HPC Field

Oil & Gas Finance MedicalLife sciencesManufacturing


Oil & Gas Finance MedicalLife sciencesManufacturing

CUDA is Accelerating Time to Discovery

4.6 Days2.7 Days

8 Hours

3 Hours

CPU Only With GPU

27 Minutes30 Minutes

16 Minutes


27 Minutes13 Minutes

16 Minutes

NVIDIA ‐ CONFIDENTIAL

(UIUC) (Evolved Machines) (Nokia, Motorola) (Techniscan )

NVIDIA in Oil & Gas Workflow

Seismic ReservoirSi l ti

Seismic Well Planning

Quadro Value•Large Scale Visualization•Transparent Scalability•Virtualization with Full Acceleration•Secure Collaboration

Quadro Additives•SLI Mosaic ModeSLI M ltiOS

Interpretation

Tesla/Quadro Value•Reduce Cycle Time•More Iterations•Improved Scalability•Streamlined Operations•Better Oil/Gas Recovery

Tesla/CUDA Additives•Scalable Iterative SolversE h d P diti i

SimulationProcessing

Tesla/CUDA Value•Improve Throughput•Increase Revenue•Reduce Operating Costs•Enhance Subsurface Image•Optimize Acquisition Parms

Tesla/CUDA Additives•Kirchhoff MigrationW E ti Mi ti

Tesla/Quadro Value•Improve Simulation•Add Gravity Calculations•Reduce Non-Productive Time, Increase Revenue•Reduce Operating costs

Tesla/CUDA Additives•Enhanced Computation Fl id D i Si l ti

gDrilling


Public ReferencesSchlumberger, Halliburton, Paradigm, Seismic MicroTechnologies, Global Exploration Companies

•SLI MultiOS•NVScale Multi-GPU•3D Immersive support•QuadroPlex

Public ReferencesConocoPhillips, Polyhedron, French Institute for Petroleum, Elegant Mathematics

•Enhanced Pre-conditioning•Double Precision Support•Sparse Matrix Vector Multiply Support

Public ReferencesHess, Chevron, TOTAL,CGGVeritas, Petrobras,Seismic City, Acceleware,OpenGeoSolutions

•Wave Equation Migration•Reverse Time Migration•Spectral Decomposition

Public ReferencesAnsys, Acceleware, Accelereyes (MATLAB), UCLA Institute of Geophysics, Rice/Brown Collaboration

Fluid Dynamic Simulation•Dense Matrix Acceleration•Multi-GPU Scalability

Computing Attributes on Seismic Volume

Toolkit used by Oil/Gas ISVsCUDA enhancements available

Mercury Computer Systems using CUDA and Quadro

transparently with library upgradeCompare single CPU to multiple GPUsShowing attribute computation on an area of interest

1 CPU: 50 MB/s1 GPU: 480 MB/s


2 GPU: 650 MB/s3 GPU: 750 MB/s

GPU computation can be performed on any other trace-based attribute as well (such as phase computation)

Click here for full example

6

Medical Equipment

GE Healthcare : CT40% increase in CT resolution2x increase in frame rate

Techniscan: Ultra-soundHigh resolution ultra-sound2x increase in acquisition

Digisens : Tomography

Source: Stone et al, UIUC

Digisens : TomographyTomography reconstruction

Several others on X-Ray, Flow Cytometry, MRI, etc

Source: Batenburg, Sijbers, et al

Computational Finance31.1 secs

1520253035

Time(secs)

Derivative Pricing using SciFinanceFinancial Computing Software vendors

SciComp : Derivatives pricing modelingHanweck: Options pricing & risk analysis

Source: SciComp

0.4 secs 0.25 secs05

1015

Intel Xeon (2.6 GHz)

1 Tesla C1060 2 Tesla C1060s

51326000

MSamplesper sec

Random Number Generators for Monte Carlo Simulations

Intel Xeon Quad-Core

Source: CUDA SDK

Hanweck: Options pricing & risk analysisAqumin: 3D visualization of market dataExegy: High-volume Tickers & Risk AnalysisQuantCatalyst: Pricing & Hedging EngineOneye: Algorithmic TradingArbitragis Trading: Trinomial Options Pricing

O i k


164 491

2116

010002000300040005000

Mersenne Twister DR + Box-Mueller (MKL)

LRAND48

Intel Xeon Quad-Core (3.0 GHz)Tesla C1060

Ongoing workLIBOR Monte Carlo market modelCallable Swaps and Continuous Time Finance

Derivative Pricing using SciFinance®

Typical complex derivative: Basket Equity-Linked Structured Note

(*Basket Equity-Linked Structured Note - Heston SV model*)

MonteCarlo; Sobol; CUDA;

SDE[delta[S] = (r-q) S delta[t] + Sqrt[v] S dW1; S [de ta[S] ( q) S de ta[t] Sq t[ ] S d ;delta[v] = kappa(theta-v) delta[t] + sigma Sqrt[v] dW2];

StdWiener[{dW1,dW2}]; Discretize[QuadraticExponential]; Initial[S=S0; v=v0; HasRedeemed=0; AccruedBonus=0]; Payoff[if[HasRedeemed=True, EarlyPayout,

if[Sum[KI]>=Trigger, Min[S/SRef], 1 + MaturityCoupon]]];

76X


All timings on Intel Quad-Core 2.6GHz + NVIDIA Tesla C1060

Std. Dev. of PV

Serial(sec)

1 GPU(sec)

2 GPU(sec)

0.02% 31.05 0.41

X 760.24

X 125

BNP Paribas Equity Pricing


16x Less Space2 Tesla S1070s 500 CPU Cores

13x Lower Power2.8 kWatts 37.5 kWatts

10x Lower Cost$24K $250K

7

Trend Towards GPU Computing in CAE

“CUDA allows us to leapfrog in a very compute intensivein a very compute intensive part of ANSYS.”

Gene PoolePrincipal SW Developer, ANSYS

Source: Super Computing 2008, Ansys GPU acceleration of LDLT factorization

BCSLIB-GPU

BCSLIB-EXT is computational workhorse in many FEA productsDisplacements given loads: Ax=bModes and mode shapes: Kx=λMx

CUDA accelerated solvers (4x speedup)Direct complete matrix factorization (sparse symmetric indefinite multifrontal factorization)Lanczos


Lanczos

Seamless integrationSame numerical stability, accuracy

Molecular Dynamics Programs using CUDA

AMBEROpenMM GROMACS NAMDHOOMD VMD GAMESS

CODE CUDA version

Release Date Overview

NAMD v2.7 Beta 5/09 Download source now from CVS repo, beta binary builds by mid Jan

VMD v1.8.7 Beta 4/09 Available for download

CHARMM v c36a2 2/09 CUDA version using MD-GRAPE -III

HOOMD v0.7.1 9/08 v0.8.0 (12/08) adds multiple GPU support


GROMACS tbd 3/09 CUDA client based on OpenMM

AMBER v10.x 6/09 OpenMM CUDA implementation from NVIDIA

Autodock v0.9 6/08 DockStar beta from Silicon Informatics in use at NCI

HMMER v0.9 12/08 GPU-HMMER available now

GAMESS tbd 7/09 Funded project at ISU for CUDA port

NAMD: Nanoscale Molecular Dynamics

Parallel Molecular Dynamics code

Runs on laptops and big clusters

NAMD v2.7 (beta) with CUDADirect non-bonded interactionsMulti-GPU support


~10x speedup

Source: “Adapting a message-driven parallel application to GPU accelerated clusters” – James Philips, John Stone, Klaus Schulten

8

VMD: Visual Molecular Dynamics

VMD 1.8.7 with CUDANew binaries includeNew binaries include

New CUDA kernelsMassive speedup for displaying electron orbitalsFast algorithm for computing electrostatic fields


Electrostatics, ion placement:120x faster Direct Coulomb summation7x faster Multilevel Coulomb summation

CUDA Libraries

BLAS: CPU vs GPU (10-series)

300

350

Single Precision BLAS: SGEMM

60

70Double Precision BLAS: DGEMM

50

100

150

200

250

300

GFL

OPS

CUDA

ATLAS 1 Thread

ATLAS 4 Threads10

20

30

40

50

60

GFL

OPS

CUBLASATLAS ParallelATLAS Single


0

Matrix Size

0

Matrix SizeCUBLAS: CUDA 2.0, Tesla C1060 (10-series GPU)ATLAS 3.81 on Dual 2.8GHz Opteron Dual-Core

CULA Tools

GPU accelerated LAPACKLU QR SVDLU, QR, SVDSingle and double precisionReal and complex

40x to 200x speedup over Netlib LAPACK3x to 10x speedup over Intel Math Kernel Library


3x to 10x speedup over Intel Math Kernel Library

Free single precision version, beta available nowCommercial release Sept. 30

9

FFT Performance: CPU vs GPU (8-Series)

80

90

1D Fast Fourier TransformOn CUDA

NVIDIA Tesla C870 GPU (8‐series GPU)Quad‐Core Intel Xeon CPU 5400 Series 3.0GHz, In‐place, complex, single precision

20

30

40

50

60

70

80

GFL

OPS

CUFFT 2.x

CUFFT 1.1

INTEL MKL 10.0

• Intel FFT numbers calculated by repeating same FFT plan


0

10

Transform Size (Power of 2)

same FFT plan• Real FFT performance is

~10 GFlops

Source for Intel data : http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm

CUDPP and Thrust

Open Source available on Google Code

CUDPP: data-parallel algorithm primitivesPrefix-sum (scan), reductionSort (radix and merge)Pseudorandom number generatorSparse matrix vector multiply


Sparse matrix vector multiply

Thrust: parallel algorithms with C++ template interfaceSimilar to STL: containers, iterators, algorithms

Sparse Matrix-Vector Multiplication (SpMV) on CUDA

CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Jacket MATLAB plugin

GPU accelerated dense matrix operations15 d t il i i bl f htt // l15-day trail version avaiable from http://www.accelereyes.com

143x Speedup

86x Speedup

44x Speedup

10

200+ Apps on CUDA Zone

DukeErlangen

NortheasternOregon State

115+ Universities Teaching CUDA900+ research papers

Momentum of CUDA GPU computing30+ CUDA GPU clusters

CUDA 2.0

ETH ZurichGeorgia TechGrove City CollegeHarvardIISc BangaloreIIIT HyderabadIIT IllinoisINRIAIowaITESM

PennsylvaniaPolimiPurdueSanta ClaraStanford StuttgartSunyTokyo TU-ViennaUSCUtah

140 M CUDA enabled GPUs60,000+ active developers

150K CUDA compiler downloads

CUDA 1.0

CUDA 1.1

Johns HopkinsKent StateKyotoLundMarylandMcGillMITNorth Carolina

VirginiaWashingtonWaterlooWestern AustraliaWilliams CollegeWisconsinYonsei

Keynotes:

Jen-Hsun HuangCEO NVIDIA

Emerging Companies Summit

- For investors, venture capitalists and entrepreneurs- Recognized as a premier private company showcase

GPU Technology ConferenceSept 30 – Oct 2, 2009 – The Fairmont San Jose, California

CEO, NVIDIA

Hanspeter PfisterDirector Visual Comp, Harvard University

Richard KerrisCTO, Lucasfilm Ltd

Summit Session Topics:

Hot Trends in Visual Computing (Augmented R lit Vi l A l ti I t ti R

GPU Developers Summit

- For developers, programmers and engineers

- In-depth look at tools and techniques to impact mission-critical work NOW

Recognized as a premier private company showcase

Reality, Visual Analytics, Interactive Ray Tracing)

The GPU Computing Revolution Breakthroughs in Energy, Medical Science,

Supercomputing and ResearchCUDA, OpenCL, Direct Compute

NVIDIA Research Summit

- For researchers and academics- Showcase findings and learn about

ways to reduce time-to-discovery

Additional ResourcesExtensive self help material

http://www.nvidia.com/object/cuda_education.htmlPod Casts of lectures and tutorials and course materialsCUDA D t ti d R h PCUDA Documentation and Research Papers

Regular Webinars given by NVIDIA Engineeringhttp://www.nvidia.com/object/cuda_education.html

University Courseshttp://www.nvidia.com/object/cuda_university_courses.html

Third Party Consultants / Partners


Active Technical Support Community:http://forums.nvidia.com/index.php?showforum=62

Registered CUDA Developer Program:http://nvdeveloper.nvidia.com/content/CDUDeveloperApplication/frmDeveloperRegistration.asp

gpu outpacing cpu in raw processing - speedup · fortran java and python ... code on the gpu and...

Documents