gpu outpacing cpu in raw processing - speedup · fortran java and python ... code on the gpu and...
TRANSCRIPT
1
CUDA: Parallel C t A hit tCompute Architecture
Christian Sigg
GPU Outpacing CPU in Raw Processing
GPU
GPUNVIDIA GTX 295480 cores1,788 GFLOPS
CPUIntel Core i7 965
4 cores102 GFLOPS
CPU
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
CPU
The Brick Wall
P W ll i d d f l t i lPower Wall: growing demand for electrical powerMemory Wall: bandwidth improves sub-linear with compute flopsILP Wall: diminishing return on more ILP area
Power Wall + Memory Wall + ILP Wall = Brick Wall
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
David Patterson, UC Berkeley
GPU initially a 3D Accelerator
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing© Id Software
2
CUDA: Parallel Compute Architecture
Massive parallelism100s of processors
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
100s of processors
Memory latency tolerant1000s of parallel threadsIf one thread stalls, switch to another
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
Teraflops of compute powerVery energy efficient
GPU C ti A li tiGPU C ti A li ti
GPU Computing Today
Over 140M installed CUDA- GPU Computing ApplicationsGPU Computing ApplicationsOver 140M installed CUDA-Architecture GPU’s
Windows, Linux and MacOS Platforms supported
GPU Computing spans Consumer applications to HPC
Over 60,000 GPU C ti D l
CC OpenCLOpenCLtm Direct Direct ComputeCompute FORTRANFORTRAN Java and Java and
PythonPythonWith CUDA ExtensionsWith CUDA ExtensionsOver 60,000 developersOver 60,000 developersRunning in Production Running in Production i J l 2007i J l 2007
11stst GPU demoGPU demoShipped 1Shipped 1stst OpenCL OpenCL Beta DriverBeta Driver
11stst GPU demoGPU demoShipped 1Shipped 1stst driver to driver to Microsoft’s Win7Microsoft’s Win7
SW supplied by:SW supplied by:•• The Portland GroupThe Portland Group
NCSA releaseNCSA release
Compute KernelsCompute KernelsDriver API BindingsDriver API Bindings
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
Computing Developers
200+ Universities teaching the CUDA Architecture and GPU Computing
NVIDIA GPUNVIDIA GPU
with the CUDA Parallel Computing Architecturewith the CUDA Parallel Computing Architecture
since July 2007 since July 2007 SDK + Lib’s + Visual SDK + Lib’s + Visual Profiler and debuggerProfiler and debugger
Beta DriverBeta DriverStrategic developers Strategic developers using NV SW todayusing NV SW today
Microsoft s Win7 Microsoft s Win7 developersdevelopersSupports all CUDASupports all CUDA--Architecture GPU’sArchitecture GPU’s
•• NCSA releaseNCSA release
OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
Tesla®
High-Performance ComputingQuadro®
Design & CreationGeForce®
Entertainment
CUDA Available on All Modern NVIDIA GPUs
High-Performance Computing Design & CreationEntertainment
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
Tesla GPU Computing Products
Tesla S1070 1U SystemTesla C1060
Computing Board
Tesla Personal Supercomputer
(4 Tesla C1060s)
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUsSingle Precision Perf 4.14 Teraflops 933 Gigaflops 3.7 Teraflops
Double Precision Perf 346 Gigaflops 78 Gigaflops 312 Gigaflops
Memory 4 GB / GPU 4 GB 4 GB / GPU
3
CUDA Software
C RuntimeThe C Runtime for CUDA provides support for executing
code on the GPU and allows native bindings for languages such as Fortran, Java, and Python
Libraries Advanced libraries that include BLAS, FFT, and other functions optimized for the CUDA Architecture
Tools NVIDIA C Compiler (nvcc), CUDA Debugger (cudagdb), CUDA Visual Profiler (cudaprof), and other helpful tools
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
Documentation Includes the CUDA Programming Guide, API specifications, and other helpful documentation
Samples Code samples and documentation that demonstrate best practices for a wide variety GPU Computing algorithms and applications
CUDA SDK Roadmap
CUDACUDA3.03.0
CUDACUDA2.32.3
CUDACUDA2.22.2
CUDACUDA2.12.1
CUDACUDA2.02.0
2008 2009Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
3.03.02.32.32.22.22.12.12.02.0
Example: LIBOR Monte CarloOriginal LIBOR “C” Code CUDA LIBOR “C” Code
134 xFaster
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
CPU execution time = 26.9 sec GPU execution time = 0.2 sec
Source: Prof. Mike Giles, Oxford Univ
Intel Xeon Quad; double precision Tesla C870; single precision
CUDA Parallel Computing Hardware
Independent multiprocessors8 ALUs (32-wide SIMD)
CC
Special Func Unit
CC
Special Func Unit
CC
Special Func Unit
Thread Scheduler
Special function unit (plus dp)Shared memory (16kB)
Hardware thread management1000s of concurrent threadsNo switching overhead
Core
Core
Core
Core
Shared M
emory
Core
Core
Core
Core
Core
Core
Core
Core
Shared M
emory
Core
Core
Core
Core
Core
Core
Core
Core
Shared M
emory
Core
Core
Core
Core
Crossbar
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
No switching overhead
Global device memoryRandom accessAtomics
Memory
Atomic
Memory
Atomic
Memory
Atomic
4
CUDA Programming ModelParallel code (kernel) is written for a thread
Each thread instance is free to execute a different code path
Core
Threads are grouped into thread blocksThreads of a block synchronize their execution and communicate via shared memorySeveral concurrent thread blocks execute on one multiprocessor, but don’t migrate
Core
Core
Core
Core
Shared M
emory
Core
Core
Core
Core
Special Func Unit
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
one multiprocessor, but don t migrate
Kernel is run on a grid of thread blocksBuilt-in thread and block ID variables
Core
Core
Core
Core
Shared M
emory
Core
Core
Core
Core
Special Func Unit
Core
Core
Core
Core
Shared M
emory
Core
Core
Core
Core
Special Func Unit
Core
Core
Core
Core
Shared M
emory
Core
Core
Core
Core
Special Func Unit
Memory
Atomic
Memory
Atomic
Memory
Atomic
Crossbar
Thread Scheduler
float c = 0, *bIt = gB + wB * threadIdx.y + j;for(float *aIt = gA + wA * i + threadIdx.x, *aEnd = aIt + wA;
aEnd > aIt; aIt += BLOCK_SIZE, bIt += BLOCK_SIZE * wB ){
// fetch block into shared memory, 1 element per threadrA[threadIdx.x] = *aIt;cB[threadIdx.y * BLOCK_SIZE] = *bIt;
Example: Dense Matrix Multiplication//! Matrix multiplication on the device: C = A * B//! gA, gB, gC are pointers to device memory//! wA, wB are matrix widths__global__ void matrixMul(float* gC, float* gA, float* gB, int wA, int wB){
int i = blockIdx.y * BLOCK_SIZE + threadIdx.y;
__syncthreads(); // synchronize threads
// compute block multiply, 1 element per threadfor (int k = 0; k < BLOCK_SIZE; ++k)
c += rA[k] * cB[BLOCK_SIZE * k];
__syncthreads(); // synchronize threads}gC[wB * i + j] = c; // write back to device memory
}
int j = blockIdx.x * BLOCK_SIZE + threadIdx.x;
__shared__ float sA[BLOCK_SIZE*BLOCK_SIZE];__shared__ float sB[BLOCK_SIZE*BLOCK_SIZE];
// row and column start pointersfloat* rA = sA + threadIdx.y * BLOCK_SIZE;float* cB = sB + threadIdx.x;
...
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
CUDA Developer Resources
SDKCode SamplesDocumentationDocumentationEmulator, HW DebuggerVisual ProfilerVisual Studio integration
Libraries
Read throughput = 43.42 GB/s, Write throughput = 0.68 GB/sKernel details: Grid size: 32 x 32, Block size: 16 x 16 x 1Register Ratio = 0.75 ( 12288 / 16384 ) [15 registers per thread] Shared Mem Ratio = 0.84375 ( 13824 / 16384 ) [4132 bytes per Block] Active Blocks per SM = 3 : 8Active threads per SM = 768 : 1024Occupancy = 0.75 ( 24 / 32 )Occupancy limiting factor = Shared-memory Warning: Grid Size (1024) is not a multiple of available SMs (27).
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
BLAS, FFTCUDPP, THRUST
Online Forums
CUDA Applications
5
CUDA Momentum in Every HPC Field
Oil & Gas Finance MedicalLife sciencesManufacturing
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
Oil & Gas Finance MedicalLife sciencesManufacturing
CUDA is Accelerating Time to Discovery
4.6 Days2.7 Days
8 Hours
3 Hours
CPU Only With GPU
27 Minutes30 Minutes
16 Minutes
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
27 Minutes13 Minutes
16 Minutes
NVIDIA ‐ CONFIDENTIAL
(UIUC) (Evolved Machines) (Nokia, Motorola) (Techniscan )
NVIDIA in Oil & Gas Workflow
Seismic ReservoirSi l ti
Seismic Well Planning
Quadro Value•Large Scale Visualization•Transparent Scalability•Virtualization with Full Acceleration•Secure Collaboration
Quadro Additives•SLI Mosaic ModeSLI M ltiOS
Interpretation
Tesla/Quadro Value•Reduce Cycle Time•More Iterations•Improved Scalability•Streamlined Operations•Better Oil/Gas Recovery
Tesla/CUDA Additives•Scalable Iterative SolversE h d P diti i
SimulationProcessing
Tesla/CUDA Value•Improve Throughput•Increase Revenue•Reduce Operating Costs•Enhance Subsurface Image•Optimize Acquisition Parms
Tesla/CUDA Additives•Kirchhoff MigrationW E ti Mi ti
Tesla/Quadro Value•Improve Simulation•Add Gravity Calculations•Reduce Non-Productive Time, Increase Revenue•Reduce Operating costs
Tesla/CUDA Additives•Enhanced Computation Fl id D i Si l ti
gDrilling
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
Public ReferencesSchlumberger, Halliburton, Paradigm, Seismic MicroTechnologies, Global Exploration Companies
•SLI MultiOS•NVScale Multi-GPU•3D Immersive support•QuadroPlex
Public ReferencesConocoPhillips, Polyhedron, French Institute for Petroleum, Elegant Mathematics
•Enhanced Pre-conditioning•Double Precision Support•Sparse Matrix Vector Multiply Support
Public ReferencesHess, Chevron, TOTAL,CGGVeritas, Petrobras,Seismic City, Acceleware,OpenGeoSolutions
•Wave Equation Migration•Reverse Time Migration•Spectral Decomposition
Public ReferencesAnsys, Acceleware, Accelereyes (MATLAB), UCLA Institute of Geophysics, Rice/Brown Collaboration
Fluid Dynamic Simulation•Dense Matrix Acceleration•Multi-GPU Scalability
Computing Attributes on Seismic Volume
Toolkit used by Oil/Gas ISVsCUDA enhancements available
Mercury Computer Systems using CUDA and Quadro
transparently with library upgradeCompare single CPU to multiple GPUsShowing attribute computation on an area of interest
1 CPU: 50 MB/s1 GPU: 480 MB/s
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
2 GPU: 650 MB/s3 GPU: 750 MB/s
GPU computation can be performed on any other trace-based attribute as well (such as phase computation)
Click here for full example
6
Medical Equipment
GE Healthcare : CT40% increase in CT resolution2x increase in frame rate
Techniscan: Ultra-soundHigh resolution ultra-sound2x increase in acquisition
Digisens : Tomography
Source: Stone et al, UIUC
Digisens : TomographyTomography reconstruction
Several others on X-Ray, Flow Cytometry, MRI, etc
Source: Batenburg, Sijbers, et al
Computational Finance31.1 secs
1520253035
Time(secs)
Derivative Pricing using SciFinanceFinancial Computing Software vendors
SciComp : Derivatives pricing modelingHanweck: Options pricing & risk analysis
Source: SciComp
0.4 secs 0.25 secs05
1015
Intel Xeon (2.6 GHz)
1 Tesla C1060 2 Tesla C1060s
51326000
MSamplesper sec
Random Number Generators for Monte Carlo Simulations
Intel Xeon Quad-Core
Source: CUDA SDK
Hanweck: Options pricing & risk analysisAqumin: 3D visualization of market dataExegy: High-volume Tickers & Risk AnalysisQuantCatalyst: Pricing & Hedging EngineOneye: Algorithmic TradingArbitragis Trading: Trinomial Options Pricing
O i k
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
164 491
2116
010002000300040005000
Mersenne Twister DR + Box-Mueller (MKL)
LRAND48
Intel Xeon Quad-Core (3.0 GHz)Tesla C1060
Ongoing workLIBOR Monte Carlo market modelCallable Swaps and Continuous Time Finance
Derivative Pricing using SciFinance®
Typical complex derivative: Basket Equity-Linked Structured Note
(*Basket Equity-Linked Structured Note - Heston SV model*)
MonteCarlo; Sobol; CUDA;
SDE[delta[S] = (r-q) S delta[t] + Sqrt[v] S dW1; S [de ta[S] ( q) S de ta[t] Sq t[ ] S d ;delta[v] = kappa(theta-v) delta[t] + sigma Sqrt[v] dW2];
StdWiener[{dW1,dW2}]; Discretize[QuadraticExponential]; Initial[S=S0; v=v0; HasRedeemed=0; AccruedBonus=0]; Payoff[if[HasRedeemed=True, EarlyPayout,
if[Sum[KI]>=Trigger, Min[S/SRef], 1 + MaturityCoupon]]];
76X
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
All timings on Intel Quad-Core 2.6GHz + NVIDIA Tesla C1060
Std. Dev. of PV
Serial(sec)
1 GPU(sec)
2 GPU(sec)
0.02% 31.05 0.41
X 760.24
X 125
BNP Paribas Equity Pricing
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
16x Less Space2 Tesla S1070s 500 CPU Cores
13x Lower Power2.8 kWatts 37.5 kWatts
10x Lower Cost$24K $250K
7
Trend Towards GPU Computing in CAE
“CUDA allows us to leapfrog in a very compute intensivein a very compute intensive part of ANSYS.”
Gene PoolePrincipal SW Developer, ANSYS
Source: Super Computing 2008, Ansys GPU acceleration of LDLT factorization
BCSLIB-GPU
BCSLIB-EXT is computational workhorse in many FEA productsDisplacements given loads: Ax=bModes and mode shapes: Kx=λMx
CUDA accelerated solvers (4x speedup)Direct complete matrix factorization (sparse symmetric indefinite multifrontal factorization)Lanczos
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
Lanczos
Seamless integrationSame numerical stability, accuracy
Molecular Dynamics Programs using CUDA
AMBEROpenMM GROMACS NAMDHOOMD VMD GAMESS
CODE CUDA version
Release Date Overview
NAMD v2.7 Beta 5/09 Download source now from CVS repo, beta binary builds by mid Jan
VMD v1.8.7 Beta 4/09 Available for download
CHARMM v c36a2 2/09 CUDA version using MD-GRAPE -III
HOOMD v0.7.1 9/08 v0.8.0 (12/08) adds multiple GPU support
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
GROMACS tbd 3/09 CUDA client based on OpenMM
AMBER v10.x 6/09 OpenMM CUDA implementation from NVIDIA
Autodock v0.9 6/08 DockStar beta from Silicon Informatics in use at NCI
HMMER v0.9 12/08 GPU-HMMER available now
GAMESS tbd 7/09 Funded project at ISU for CUDA port
NAMD: Nanoscale Molecular Dynamics
Parallel Molecular Dynamics code
Runs on laptops and big clusters
NAMD v2.7 (beta) with CUDADirect non-bonded interactionsMulti-GPU support
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
~10x speedup
Source: “Adapting a message-driven parallel application to GPU accelerated clusters” – James Philips, John Stone, Klaus Schulten
8
VMD: Visual Molecular Dynamics
VMD 1.8.7 with CUDANew binaries includeNew binaries include
New CUDA kernelsMassive speedup for displaying electron orbitalsFast algorithm for computing electrostatic fields
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
Electrostatics, ion placement:120x faster Direct Coulomb summation7x faster Multilevel Coulomb summation
CUDA Libraries
BLAS: CPU vs GPU (10-series)
300
350
Single Precision BLAS: SGEMM
60
70Double Precision BLAS: DGEMM
50
100
150
200
250
300
GFL
OPS
CUDA
ATLAS 1 Thread
ATLAS 4 Threads10
20
30
40
50
60
GFL
OPS
CUBLASATLAS ParallelATLAS Single
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
0
Matrix Size
0
Matrix SizeCUBLAS: CUDA 2.0, Tesla C1060 (10-series GPU)ATLAS 3.81 on Dual 2.8GHz Opteron Dual-Core
CULA Tools
GPU accelerated LAPACKLU QR SVDLU, QR, SVDSingle and double precisionReal and complex
40x to 200x speedup over Netlib LAPACK3x to 10x speedup over Intel Math Kernel Library
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
3x to 10x speedup over Intel Math Kernel Library
Free single precision version, beta available nowCommercial release Sept. 30
9
FFT Performance: CPU vs GPU (8-Series)
80
90
1D Fast Fourier TransformOn CUDA
NVIDIA Tesla C870 GPU (8‐series GPU)Quad‐Core Intel Xeon CPU 5400 Series 3.0GHz, In‐place, complex, single precision
20
30
40
50
60
70
80
GFL
OPS
CUFFT 2.x
CUFFT 1.1
INTEL MKL 10.0
• Intel FFT numbers calculated by repeating same FFT plan
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
0
10
Transform Size (Power of 2)
same FFT plan• Real FFT performance is
~10 GFlops
Source for Intel data : http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm
CUDPP and Thrust
Open Source available on Google Code
CUDPP: data-parallel algorithm primitivesPrefix-sum (scan), reductionSort (radix and merge)Pseudorandom number generatorSparse matrix vector multiply
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
Sparse matrix vector multiply
Thrust: parallel algorithms with C++ template interfaceSimilar to STL: containers, iterators, algorithms
Sparse Matrix-Vector Multiplication (SpMV) on CUDA
CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007
Jacket MATLAB plugin
GPU accelerated dense matrix operations15 d t il i i bl f htt // l15-day trail version avaiable from http://www.accelereyes.com
143x Speedup
86x Speedup
44x Speedup
10
200+ Apps on CUDA Zone
DukeErlangen
NortheasternOregon State
115+ Universities Teaching CUDA900+ research papers
Momentum of CUDA GPU computing30+ CUDA GPU clusters
CUDA 2.0
ETH ZurichGeorgia TechGrove City CollegeHarvardIISc BangaloreIIIT HyderabadIIT IllinoisINRIAIowaITESM
PennsylvaniaPolimiPurdueSanta ClaraStanford StuttgartSunyTokyo TU-ViennaUSCUtah
140 M CUDA enabled GPUs60,000+ active developers
150K CUDA compiler downloads
CUDA 1.0
CUDA 1.1
Johns HopkinsKent StateKyotoLundMarylandMcGillMITNorth Carolina
VirginiaWashingtonWaterlooWestern AustraliaWilliams CollegeWisconsinYonsei
Keynotes:
Jen-Hsun HuangCEO NVIDIA
Emerging Companies Summit
- For investors, venture capitalists and entrepreneurs- Recognized as a premier private company showcase
GPU Technology ConferenceSept 30 – Oct 2, 2009 – The Fairmont San Jose, California
CEO, NVIDIA
Hanspeter PfisterDirector Visual Comp, Harvard University
Richard KerrisCTO, Lucasfilm Ltd
Summit Session Topics:
Hot Trends in Visual Computing (Augmented R lit Vi l A l ti I t ti R
GPU Developers Summit
- For developers, programmers and engineers
- In-depth look at tools and techniques to impact mission-critical work NOW
Recognized as a premier private company showcase
Reality, Visual Analytics, Interactive Ray Tracing)
The GPU Computing Revolution Breakthroughs in Energy, Medical Science,
Supercomputing and ResearchCUDA, OpenCL, Direct Compute
NVIDIA Research Summit
- For researchers and academics- Showcase findings and learn about
ways to reduce time-to-discovery
Additional ResourcesExtensive self help material
http://www.nvidia.com/object/cuda_education.htmlPod Casts of lectures and tutorials and course materialsCUDA D t ti d R h PCUDA Documentation and Research Papers
Regular Webinars given by NVIDIA Engineeringhttp://www.nvidia.com/object/cuda_education.html
University Courseshttp://www.nvidia.com/object/cuda_university_courses.html
Third Party Consultants / Partners
NVIDIA CUDA 38th SPEEDUP Workshop on High-Performance Computing
Active Technical Support Community:http://forums.nvidia.com/index.php?showforum=62
Registered CUDA Developer Program:http://nvdeveloper.nvidia.com/content/CDUDeveloperApplication/frmDeveloperRegistration.asp