© NVIDIA Corporation 2008
HPC Computing with
CUDA and Tesla HardwareTim Lanfear, NVIDIA
Parallel Computing’s Golden Age
• 1980s, early 1990s• Particularly data-parallel computing
• Architectures• Connection Machine, MasPar, Cray
Thinking Machines
CM-1 (1984)
Thinking Machines
CM-1 (1984)
Cray X-MP
(1982)
Cray X-MP
(1982)
© NVIDIA Corporation 2008
• Connection Machine, MasPar, Cray
• True Supercomputers: incredibly exotic, powerful, expensive
• Algorithms, languages, & programming models• Solved a wide variety of problems
• Various parallel algorithmic models developed
• P-RAM, V-RAM, circuit, hypercube, etc.MasPar MP-1
(1990)
MasPar MP-1
(1990)
Parallel Computing’s Dark Age
• But … impact of data-parallel computing limited• Thinking Machines sold 7 CM-1s (100s of systems total)
• MasPar sold ~200 systems
• Commercial and research activity subsided
© NVIDIA Corporation 2008
• Commercial and research activity subsided• Massively-parallel machines replaced by clusters of ever more powerful
commodity microprocessors
• Beowulf, Legion, grid computing, …
Massively parallel computing
lost momentum to the
advance of commodity technology
Enter the GPU
• GPU = Graphics Processing Unit
• Processor in computer video cards, PlayStation 3, etc.
• Computer games caused “evolution pressure”
• GPUs are massively multithreaded many-core chips
© NVIDIA Corporation 2008
• GPUs are massively multithreaded many-core chips
• NVIDIA Tesla products have 240 scalar processors
• Over 1 TERAFLOPS sustained performance
• Over 30,000 concurrent threads
• Multi-GPU scales beyond
Parallelism is Scaling Rapidly
• CPUs and GPUs are parallel processors
• CPUs now have 2, 4, 8, … processors
• GPUs now have 32, 64, 128, 240, … processors
• Parallelism is increasing rapidly with Moore’s Law
© NVIDIA Corporation 2008
• Parallelism is increasing rapidly with Moore’s Law
• Processor count is doubling every 18 – 24 months
• Individual processor cores no longer getting faster
• Challenge: Develop parallel application software
• Scale software parallelism to use more and more processors
• Same source for parallel GPUs and CPUs
GPUs: Turning Point in Supercomputing
59.9 secs4 Tesla
C1060 GPUs
Desktop beats Cluster
© NVIDIA Corporation 2008
Tesla Personal Supercomputer
$10,000
CalcUA$5 Million
Source: University of Antwerp, Belgium
67.4 secs
55 60 65 70
256 AMD dual-core Opterons
Digital Tomography Reconstruction Time
TeslaTM
High-Performance Computing
Quadro®
Design & Creation
GeForce®
Entertainment
NVIDIA GPU Product Families
© NVIDIA Corporation 2008
What is GPU Computing?
© NVIDIA Corporation 2008
Computing with CPU + GPUHeterogeneous Computing
GPUs: Many-Core High Performance Computing
• NVIDIA’s 10-series GPU has 240 cores
• Each core has a• Floating point / integer unit
• Logic unit 1.4 billion transistors
NVIDIA 10NVIDIA 10--Series GPUSeries GPUNVIDIA 10NVIDIA 10--Series GPUSeries GPU
© NVIDIA Corporation 2008
• Logic unit
• Move, compare unit
• Branch unit
• Cores managed by thread manager• Thread manager can spawn
and manage 30,000+ threads
• Zero overhead thread switching
1.4 billion transistors
1 Teraflop of processing power
240 processing cores
NVIDIA’s 2nd Generation CUDA Processor
Tesla 8-Series vs 10-Series
Double the Performance > Double the Memory
1.5 Gigabytes4 Gigabytes
500 Gigaflops
1 Teraflop
© NVIDIA Corporation 2008
Tesla 8 Tesla 10
Double the Precision
Finance Science Design
Tesla 8 Tesla 10
Tesla GPU Computing Products
© NVIDIA Corporation 2008
Tesla S1070 1U SystemTesla C1060
Computing Board
Tesla Personal Supercomputer (4 Tesla C1060s)
GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs
Single Precision Perf 4.14 Teraflops 933 Gigaflops 3.7 Teraflops
Double Precision Perf 346 Gigaflops 78 Gigaflops 312 Gigaflops
Memory 4 GB / GPU 4 GB 4 GB / GPU
Processor 1 x Tesla T101 x Tesla T10
Number of cores 240240
Core Clock 1.296 GHz1.296 GHz
On-board memory 4.0 GB 4.0 GB
Memory bandwidth 102 GB/sec peak102 GB/sec peak
Tesla C1060 Computing Processor
© NVIDIA Corporation 2008
Memory bandwidth 102 GB/sec peak102 GB/sec peak
Memory I/O 512512--bit, 800MHz GDDR3bit, 800MHz GDDR3
Form factorFull ATX: 4.736” x 10.5”Full ATX: 4.736” x 10.5”
Dual slot wideDual slot wide
System I/O PCIePCIe x16 Gen2x16 Gen2
Typical power 160 W160 W
Tesla Personal Supercomputer
Supercomputing Performance
• Massively parallel CUDA Architecture
• 960 cores. 4 TeraFlops
• 250x the performance of a desktop
Personal
© NVIDIA Corporation 2008
Personal
• One researcher, one supercomputer
• Plugs into standard power strip
Accessible
• Program in C for Windows, Linux
• Available now worldwide under $10,000
Tesla S1070 1U System
Processors 4 x Tesla T10
Number of cores 960
Core Clock 1.44 GHz
Performance 4 Teraflops
Total system memory 16.0 GB (4.0 GB per T10)
© NVIDIA Corporation 2008
Memory bandwidth408 GB/sec peak
(102 GB/sec per T10)
Memory I/O2048-bit, 800MHz GDDR3
(512-bit per T10)
Form factor 1U (EIA 19” rack)
System I/O 2 PCIe x16 Gen2
Typical power 700 W
HostServer
PCIe Host Interface Cards
PCIe Gen2 Cables
PCIe Gen2 Cable
Connecting Tesla S1070 to Host Servers
© NVIDIA Corporation 2008
TeslaS1070
PCIe Gen2 Host Interface Card
PCIe Gen2 Cable(0.5m length)
PCI-E Gen2 x16
Adapter card
CPU ServerCPU Server
CPUCPUFSBFSB PCI-E x16 Gen 2PCI-E x16 Gen 2
Core
Logic
Core
Logic
CPU ServerCPU Server
Tesla GPU Computing System
© NVIDIA Corporation 2008
CPUCPUCore
Logic
Core
Logic
NVIDIA SwitchNVIDIA SwitchPower
Supply
PCI-
Express
Cables
PCI-
Express
Cables
Tesla GPU SystemTesla GPU System
FSBFSB
PCI-E x16 Gen 2PCI-E x16 Gen 2NVIDIA SwitchNVIDIA Switch
NVIDIA Tesla GPU
NVIDIA Tesla GPU
NVIDIA Tesla GPU
NVIDIA Tesla GPU
NVIDIA Tesla GPU
NVIDIA Tesla GPU
NVIDIA Tesla GPU
NVIDIA Tesla GPU
PCI-E x16 Gen 2PCI-E x16 Gen 2
PCI-E Gen2 x16
Adapter card
PCI-E x16 Gen 2PCI-E x16 Gen 2
© NVIDIA Corporation 2008
A scalable parallel programming model and software environment for parallel computing
Enter CUDA
• CUDA is a scalable parallel programming model and software environment for parallel computing
• NVIDIA GPU architecture accelerates CUDA
© NVIDIA Corporation 2008
• Hardware and software designed together for computing
• Expose the computational horsepower of NVIDIA GPUs
• Enable general-purpose GPU computing
TeslaTM
High-Performance Computing
Quadro®
Design & Creation
GeForce®
Entertainment
Parallel Computing on All GPUsOver 100 million CUDA GPUs Deployed
© NVIDIA Corporation 2008
Pervasive CUDA Parallel Computing
• CUDA brings data-parallel computing to the masses• Over 100 M CUDA-capable GPUs deployed since Nov 2006
• Wide developer acceptanceww.nvidia.com/CUDA
• Over 100K CUDA developer downloads (CUDA is free!)
© NVIDIA Corporation 2008
• Over 100K CUDA developer downloads (CUDA is free!)
• A GPU “developer kit” costs ~$200 (GeForce board price) for 500 GFLOPS
• Data-parallel supercomputers are everywhere!• CUDA makes this power readily accessible
• Enables rapid innovations in data-parallel computing
Massively parallel computing has become a commodity technology!
CUDA is C for Parallel Processors
• CUDA is industry-standard C with minimal extensions• Write a program for one thread
• Instantiate it on many parallel threads
• Familiar programming model and language
© NVIDIA Corporation 2008
• CUDA is a scalable parallel programming model• Program runs on any number of processors without recompiling
CUDA Uses Extensive Multithreading
• CUDA threads express fine-grained data parallelism• Map threads to GPU threads• Virtualize the processors• You must rethink your algorithms to be aggressively parallel
• CUDA thread blocks express coarse-grained parallelism
© NVIDIA Corporation 2008
• CUDA thread blocks express coarse-grained parallelism• Blocks hold arrays of GPU threads, define shared memory boundaries• Allow scaling between smaller and larger GPUs
• GPUs execute thousands of lightweight threads• (In graphics, each thread computes one pixel)• One CUDA thread computes one result (or several results)• Hardware multithreading & zero-overhead scheduling
Data Parallel Levels• Thread
• Computes result elements
• Thread id number
• Thread Block• Runs on one SM, shared mem
• 1 to 512 threads per block
• Block id number
• Grid of Blocks
Thread
t0 t1 t2 … tm
Block
© NVIDIA Corporation 2008
• Grid of Blocks• Holds complete computation task
• One to many blocks per Grid
• Sequential Grids• Compute sequential problem steps
Grid
Bl. 0 Bl. 1 Bl. 2 Bl. n
. . .
Parallel Memory Sharing
Thread
Local Memory
Block
SharedMemory
Local barrier
© NVIDIA Corporation 2008
Grid 0
. . .
GlobalMemory
. . .
Grid 1
SequentialGridsin Time
Global barrier
Simple “C” Description For Parallelism
void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)
{{{{
forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)
y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];
}}}}
// Invoke serialserialserialserial SAXPY kernel
saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);
Standard C Code
© NVIDIA Corporation 2008
__global__ __global__ __global__ __global__ void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)
{{{{
int i = int i = int i = int i = blockIdxblockIdxblockIdxblockIdx.x*.x*.x*.x*blockDimblockDimblockDimblockDim.x + .x + .x + .x + threadIdxthreadIdxthreadIdxthreadIdx.x;.x;.x;.x;
ifififif (i(i(i(i <<<< n) n) n) n) y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];
}}}}
// Invoke parallelparallelparallelparallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;
saxpy_parallelsaxpy_parallelsaxpy_parallelsaxpy_parallel<<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>>(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);
Parallel C Code
CUDA Libraries
cuFFT cuBLAS cuDPP
CUDA Compiler
C Fortran
CUDA Tools
Debugger Profiler
CPU Hardware
PCI-E Switch1U
Application Software(written in C)
© NVIDIA Corporation 2008
C Fortran Debugger ProfilerPCI-E Switch1U
4 cores 240 cores
CUDA Zone: www.nvidia.com/CUDA
• CUDA Toolkit• Compiler
• Libraries
• CUDA SDK• Code samples
© NVIDIA Corporation 2008
• CUDA Profiler
• Forums
• Resources for CUDA developers
CUDA Computing Sweet Spots
• Parallel Applications
• High bandwidth:Sequencing (virus scanning, genomics), sorting, database, …
• Visual computing:Graphics, image processing, tomography, machine vision, …
© NVIDIA Corporation 2008
Graphics, image processing, tomography, machine vision, …
• High arithmetic intensity:Dense linear algebra, PDEs, n-body, finite difference, …
• Applications in finance
Wide Developer Acceptance and Success
146X 36X 19X 17X 100X
Interactive Interactive visualization of visualization of
volumetric white volumetric white matter connectivitymatter connectivity
Ion placement for Ion placement for molecular molecular dynamics dynamics simulationsimulation
TranscodingTranscoding HD HD video stream to video stream to
H.264H.264
Simulation in Simulation in MatlabMatlab using .using .mexmexfile CUDA functionfile CUDA function
Astrophysics NAstrophysics N--body simulationbody simulation
© NVIDIA Corporation 2008
matter connectivitymatter connectivity simulationsimulation
149X 47X 20X 24X 30X
Financial Financial simulation of simulation of
LIBOR model with LIBOR model with swaptionsswaptions
GLAME@labGLAME@lab: An : An MM--script API for script API for linear Algebra linear Algebra
operations on GPUoperations on GPU
Ultrasound Ultrasound medical imaging medical imaging
for cancer for cancer diagnosticsdiagnostics
Highly optimized Highly optimized object oriented object oriented
molecular molecular dynamicsdynamics
CmatchCmatch exact exact string matching to string matching to
find similar find similar proteins and gene proteins and gene
sequencessequences
Tesla SM Multiprocessor
• SM has 8 SP Thread Processors• IEEE 754 32-bit floating point
• 32-bit and 64-bit integer
• 16K 32-bit registers
• SM has 2 SFU Special Function Units
• SM has DP Double Precision Unit• IEEE 754 64-bit floating point
© NVIDIA Corporation 2008
• IEEE 754 64-bit floating point
• Fused multiply-add
• Multithreaded Instruction Unit• 1024 threads, hardware multithreaded
• 32 SIMT warps of 32 threads
• Independent thread execution
• Hardware thread scheduling
• 16KB Shared Memory• Concurrent threads share data
• Low latency load/store
Thread Processor Datapath
• Executes 32-bit IEEE floating point instructions:
• FADD, FMUL, FMAD, FMIN, FMAX, FSET, F2I, I2F
• Performs 32-bit integer instructions:
• IADD, IMUL24, IMAD24, IMIN, IMAX, ISET, I2I
• SHR, SHL, AND, OR, XOR
• Fully pipelined
© NVIDIA Corporation 2008
• Fully pipelined
• Latency and area optimized
• IEEE 754 compliant FADD, FMUL
• Round to nearest even, round toward zero
• Handles special numbers, NaNs, infinities properly
• Flushes denormal operands and results to zero
Special Function Unit (SFU)
• Executes transcendental function instructions
• RCP, RSQRT, EXP2, LOG2, SIN, COS
• 2 SFUs per SM yields ¼ instruction throughput
• Evaluates function approximations
• Quadratic interpolation with Enhanced Minimax Approximation
© NVIDIA Corporation 2008
• Interpolates pixel attributes
• Accuracy ranges from 22.5 to 24.0 bits
• 1/x in the interval [1,2) is 24 bits, 1 ulp
• CUDA uses SFUs for initial estimates, and refines upon it
• Final accuracy: see CUDA programming manual, appendix B
GPU Architecture
Mem
o
Ac
ce
lera
tio
n
GeForce GTX 280 / Tesla T10
© NVIDIA Corporation 2008
Communication Fabric
ory
& I/O
Fix
ed
Fu
nc
tio
n A
240 scalar cores
On-chip memory
Texture units
CUDA Computing with Tesla T10
• 240 SP processors at 1.5 GHz: 1 TFLOPS peak
• 128 threads per processor: 30,720 threads total
• Tesla PCI-e board: C1060 (1 GPU)
• 1U Server: S1070 (4 GPUs)
SMI-Cache
MT Issue
C-Cache
© NVIDIA Corporation 2008
Tesla T10
Bridge System Memory
Work Distribution
DRAM
ROP L2
DRAM
ROP L2
DRAM
ROP L2
DRAM
ROP L2
DRAM
ROP L2
DRAM
ROP L2
DRAM
ROP L2
DRAM
ROP L2
Host CPU
Interconnection Network
SP
DP
SP
SP SP
SP SP
SP SP
SFU SFU
SharedMemory
© NVIDIA Corporation 2008
CUDA Case Studies
Single Precision BLAS: CPU vs GPU (10-series)
200
250
300
350
PS
BLAS (SGEMM) on CUDA
CUDA
ATLAS 1 Thread
ATLAS 4 Threads
CUBLAS: CUDA 2.0b2, Tesla C1060 (10-series GPU)
ATLAS 3.81 on Dual 2.8GHz Opteron Dual-Core
© NVIDIA Corporation 2008
0
50
100
150
200
256x256 512x256 512x512 1024x512 1024x1024 2048x1024 2048x2048 4096x2048 4096x4096 8192x4096 8192x8192
GF
LO
P
Matrix Size
Double Precision BLAS: CPU vs GPU (10-series)
40
50
60
70
PS
BLAS (DGEMM) on CUDA
CUBLAS
ATLAS Parallel
ATLAS Single
CUBLAS CUDA 2.0b2 on Tesla C1060 (10-series)
ATLAS 3.81 on Intel Xeon E5440 Quad-core, 2.83 GHz
© NVIDIA Corporation 2008
0
10
20
30
40
256x256 256x512 512x512 1024x512 1024x1024 2048x1024 2048x2048 4096x2048 4096x4096 8192x4096 8192x8192
GF
LO
P
Matrix Size
GPU + CPU DGEMM Performance
60
80
100
120
GFLOPs
GPU + CPU
GPU only
© NVIDIA Corporation 2008
0
20
40
60
12
8
32
0
51
2
70
4
89
6
10
88
12
80
14
72
16
64
18
56
20
48
22
40
24
32
26
24
28
16
30
08
32
00
33
92
35
84
37
76
39
68
41
60
43
52
45
44
47
36
49
28
51
20
53
12
55
04
56
96
58
88
60
80
GFLOPs
Size
Xeon Quad-core 2.8 GHz, MKL 10.3
Tesla C1060 GPU (1.296 GHz)
GPU + CPU
CPU only
AccelerEyes Jacket
• Who is AccelerEyes?
• AccelerEyes is a MathWorks partner
• Simple software for visual computing
• What is Jacket?
• GPU engine for MATLAB
© NVIDIA Corporation 2008
• GPU engine for MATLAB
• CUDA-powered language extension
• Why Jacket?
• Challenges in technical computing
• Low-cost speed, high-value graphics
• Increased productivity
ERROR: stackunderflow
OFFENDING COMMAND: ~
STACK: