hpc computing with cuda and tesla hardwarejbaker/pdc-sp12/slides/cuda... · hpc computing with cuda...

40
© NVIDIA Corporation 2008 HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA

Upload: lydang

Post on 25-Apr-2018

271 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

© NVIDIA Corporation 2008

HPC Computing with

CUDA and Tesla HardwareTim Lanfear, NVIDIA

Page 2: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Parallel Computing’s Golden Age

• 1980s, early 1990s• Particularly data-parallel computing

• Architectures• Connection Machine, MasPar, Cray

Thinking Machines

CM-1 (1984)

Thinking Machines

CM-1 (1984)

Cray X-MP

(1982)

Cray X-MP

(1982)

© NVIDIA Corporation 2008

• Connection Machine, MasPar, Cray

• True Supercomputers: incredibly exotic, powerful, expensive

• Algorithms, languages, & programming models• Solved a wide variety of problems

• Various parallel algorithmic models developed

• P-RAM, V-RAM, circuit, hypercube, etc.MasPar MP-1

(1990)

MasPar MP-1

(1990)

Page 3: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Parallel Computing’s Dark Age

• But … impact of data-parallel computing limited• Thinking Machines sold 7 CM-1s (100s of systems total)

• MasPar sold ~200 systems

• Commercial and research activity subsided

© NVIDIA Corporation 2008

• Commercial and research activity subsided• Massively-parallel machines replaced by clusters of ever more powerful

commodity microprocessors

• Beowulf, Legion, grid computing, …

Massively parallel computing

lost momentum to the

advance of commodity technology

Page 4: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Enter the GPU

• GPU = Graphics Processing Unit

• Processor in computer video cards, PlayStation 3, etc.

• Computer games caused “evolution pressure”

• GPUs are massively multithreaded many-core chips

© NVIDIA Corporation 2008

• GPUs are massively multithreaded many-core chips

• NVIDIA Tesla products have 240 scalar processors

• Over 1 TERAFLOPS sustained performance

• Over 30,000 concurrent threads

• Multi-GPU scales beyond

Page 5: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Parallelism is Scaling Rapidly

• CPUs and GPUs are parallel processors

• CPUs now have 2, 4, 8, … processors

• GPUs now have 32, 64, 128, 240, … processors

• Parallelism is increasing rapidly with Moore’s Law

© NVIDIA Corporation 2008

• Parallelism is increasing rapidly with Moore’s Law

• Processor count is doubling every 18 – 24 months

• Individual processor cores no longer getting faster

• Challenge: Develop parallel application software

• Scale software parallelism to use more and more processors

• Same source for parallel GPUs and CPUs

Page 6: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

GPUs: Turning Point in Supercomputing

59.9 secs4 Tesla

C1060 GPUs

Desktop beats Cluster

© NVIDIA Corporation 2008

Tesla Personal Supercomputer

$10,000

CalcUA$5 Million

Source: University of Antwerp, Belgium

67.4 secs

55 60 65 70

256 AMD dual-core Opterons

Digital Tomography Reconstruction Time

Page 7: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

TeslaTM

High-Performance Computing

Quadro®

Design & Creation

GeForce®

Entertainment

NVIDIA GPU Product Families

© NVIDIA Corporation 2008

Page 8: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

What is GPU Computing?

© NVIDIA Corporation 2008

Computing with CPU + GPUHeterogeneous Computing

Page 9: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

GPUs: Many-Core High Performance Computing

• NVIDIA’s 10-series GPU has 240 cores

• Each core has a• Floating point / integer unit

• Logic unit 1.4 billion transistors

NVIDIA 10NVIDIA 10--Series GPUSeries GPUNVIDIA 10NVIDIA 10--Series GPUSeries GPU

© NVIDIA Corporation 2008

• Logic unit

• Move, compare unit

• Branch unit

• Cores managed by thread manager• Thread manager can spawn

and manage 30,000+ threads

• Zero overhead thread switching

1.4 billion transistors

1 Teraflop of processing power

240 processing cores

NVIDIA’s 2nd Generation CUDA Processor

Page 10: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Tesla 8-Series vs 10-Series

Double the Performance > Double the Memory

1.5 Gigabytes4 Gigabytes

500 Gigaflops

1 Teraflop

© NVIDIA Corporation 2008

Tesla 8 Tesla 10

Double the Precision

Finance Science Design

Tesla 8 Tesla 10

Page 11: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Tesla GPU Computing Products

© NVIDIA Corporation 2008

Tesla S1070 1U SystemTesla C1060

Computing Board

Tesla Personal Supercomputer (4 Tesla C1060s)

GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs

Single Precision Perf 4.14 Teraflops 933 Gigaflops 3.7 Teraflops

Double Precision Perf 346 Gigaflops 78 Gigaflops 312 Gigaflops

Memory 4 GB / GPU 4 GB 4 GB / GPU

Page 12: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Processor 1 x Tesla T101 x Tesla T10

Number of cores 240240

Core Clock 1.296 GHz1.296 GHz

On-board memory 4.0 GB 4.0 GB

Memory bandwidth 102 GB/sec peak102 GB/sec peak

Tesla C1060 Computing Processor

© NVIDIA Corporation 2008

Memory bandwidth 102 GB/sec peak102 GB/sec peak

Memory I/O 512512--bit, 800MHz GDDR3bit, 800MHz GDDR3

Form factorFull ATX: 4.736” x 10.5”Full ATX: 4.736” x 10.5”

Dual slot wideDual slot wide

System I/O PCIePCIe x16 Gen2x16 Gen2

Typical power 160 W160 W

Page 13: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Tesla Personal Supercomputer

Supercomputing Performance

• Massively parallel CUDA Architecture

• 960 cores. 4 TeraFlops

• 250x the performance of a desktop

Personal

© NVIDIA Corporation 2008

Personal

• One researcher, one supercomputer

• Plugs into standard power strip

Accessible

• Program in C for Windows, Linux

• Available now worldwide under $10,000

Page 14: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Tesla S1070 1U System

Processors 4 x Tesla T10

Number of cores 960

Core Clock 1.44 GHz

Performance 4 Teraflops

Total system memory 16.0 GB (4.0 GB per T10)

© NVIDIA Corporation 2008

Memory bandwidth408 GB/sec peak

(102 GB/sec per T10)

Memory I/O2048-bit, 800MHz GDDR3

(512-bit per T10)

Form factor 1U (EIA 19” rack)

System I/O 2 PCIe x16 Gen2

Typical power 700 W

Page 15: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

HostServer

PCIe Host Interface Cards

PCIe Gen2 Cables

PCIe Gen2 Cable

Connecting Tesla S1070 to Host Servers

© NVIDIA Corporation 2008

TeslaS1070

PCIe Gen2 Host Interface Card

PCIe Gen2 Cable(0.5m length)

Page 16: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

PCI-E Gen2 x16

Adapter card

CPU ServerCPU Server

CPUCPUFSBFSB PCI-E x16 Gen 2PCI-E x16 Gen 2

Core

Logic

Core

Logic

CPU ServerCPU Server

Tesla GPU Computing System

© NVIDIA Corporation 2008

CPUCPUCore

Logic

Core

Logic

NVIDIA SwitchNVIDIA SwitchPower

Supply

PCI-

Express

Cables

PCI-

Express

Cables

Tesla GPU SystemTesla GPU System

FSBFSB

PCI-E x16 Gen 2PCI-E x16 Gen 2NVIDIA SwitchNVIDIA Switch

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

PCI-E x16 Gen 2PCI-E x16 Gen 2

PCI-E Gen2 x16

Adapter card

PCI-E x16 Gen 2PCI-E x16 Gen 2

Page 17: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

© NVIDIA Corporation 2008

A scalable parallel programming model and software environment for parallel computing

Page 18: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Enter CUDA

• CUDA is a scalable parallel programming model and software environment for parallel computing

• NVIDIA GPU architecture accelerates CUDA

© NVIDIA Corporation 2008

• Hardware and software designed together for computing

• Expose the computational horsepower of NVIDIA GPUs

• Enable general-purpose GPU computing

Page 19: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

TeslaTM

High-Performance Computing

Quadro®

Design & Creation

GeForce®

Entertainment

Parallel Computing on All GPUsOver 100 million CUDA GPUs Deployed

© NVIDIA Corporation 2008

Page 20: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Pervasive CUDA Parallel Computing

• CUDA brings data-parallel computing to the masses• Over 100 M CUDA-capable GPUs deployed since Nov 2006

• Wide developer acceptanceww.nvidia.com/CUDA

• Over 100K CUDA developer downloads (CUDA is free!)

© NVIDIA Corporation 2008

• Over 100K CUDA developer downloads (CUDA is free!)

• A GPU “developer kit” costs ~$200 (GeForce board price) for 500 GFLOPS

• Data-parallel supercomputers are everywhere!• CUDA makes this power readily accessible

• Enables rapid innovations in data-parallel computing

Massively parallel computing has become a commodity technology!

Page 21: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

CUDA is C for Parallel Processors

• CUDA is industry-standard C with minimal extensions• Write a program for one thread

• Instantiate it on many parallel threads

• Familiar programming model and language

© NVIDIA Corporation 2008

• CUDA is a scalable parallel programming model• Program runs on any number of processors without recompiling

Page 22: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

CUDA Uses Extensive Multithreading

• CUDA threads express fine-grained data parallelism• Map threads to GPU threads• Virtualize the processors• You must rethink your algorithms to be aggressively parallel

• CUDA thread blocks express coarse-grained parallelism

© NVIDIA Corporation 2008

• CUDA thread blocks express coarse-grained parallelism• Blocks hold arrays of GPU threads, define shared memory boundaries• Allow scaling between smaller and larger GPUs

• GPUs execute thousands of lightweight threads• (In graphics, each thread computes one pixel)• One CUDA thread computes one result (or several results)• Hardware multithreading & zero-overhead scheduling

Page 23: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Data Parallel Levels• Thread

• Computes result elements

• Thread id number

• Thread Block• Runs on one SM, shared mem

• 1 to 512 threads per block

• Block id number

• Grid of Blocks

Thread

t0 t1 t2 … tm

Block

© NVIDIA Corporation 2008

• Grid of Blocks• Holds complete computation task

• One to many blocks per Grid

• Sequential Grids• Compute sequential problem steps

Grid

Bl. 0 Bl. 1 Bl. 2 Bl. n

. . .

Page 24: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Parallel Memory Sharing

Thread

Local Memory

Block

SharedMemory

Local barrier

© NVIDIA Corporation 2008

Grid 0

. . .

GlobalMemory

. . .

Grid 1

SequentialGridsin Time

Global barrier

Page 25: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Simple “C” Description For Parallelism

void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)

{{{{

forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)

y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

// Invoke serialserialserialserial SAXPY kernel

saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);

Standard C Code

© NVIDIA Corporation 2008

__global__ __global__ __global__ __global__ void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)

{{{{

int i = int i = int i = int i = blockIdxblockIdxblockIdxblockIdx.x*.x*.x*.x*blockDimblockDimblockDimblockDim.x + .x + .x + .x + threadIdxthreadIdxthreadIdxthreadIdx.x;.x;.x;.x;

ifififif (i(i(i(i <<<< n) n) n) n) y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

// Invoke parallelparallelparallelparallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;

saxpy_parallelsaxpy_parallelsaxpy_parallelsaxpy_parallel<<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>>(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);

Parallel C Code

Page 26: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

CUDA Libraries

cuFFT cuBLAS cuDPP

CUDA Compiler

C Fortran

CUDA Tools

Debugger Profiler

CPU Hardware

PCI-E Switch1U

Application Software(written in C)

© NVIDIA Corporation 2008

C Fortran Debugger ProfilerPCI-E Switch1U

4 cores 240 cores

Page 27: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

CUDA Zone: www.nvidia.com/CUDA

• CUDA Toolkit• Compiler

• Libraries

• CUDA SDK• Code samples

© NVIDIA Corporation 2008

• CUDA Profiler

• Forums

• Resources for CUDA developers

Page 28: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

CUDA Computing Sweet Spots

• Parallel Applications

• High bandwidth:Sequencing (virus scanning, genomics), sorting, database, …

• Visual computing:Graphics, image processing, tomography, machine vision, …

© NVIDIA Corporation 2008

Graphics, image processing, tomography, machine vision, …

• High arithmetic intensity:Dense linear algebra, PDEs, n-body, finite difference, …

• Applications in finance

Page 29: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Wide Developer Acceptance and Success

146X 36X 19X 17X 100X

Interactive Interactive visualization of visualization of

volumetric white volumetric white matter connectivitymatter connectivity

Ion placement for Ion placement for molecular molecular dynamics dynamics simulationsimulation

TranscodingTranscoding HD HD video stream to video stream to

H.264H.264

Simulation in Simulation in MatlabMatlab using .using .mexmexfile CUDA functionfile CUDA function

Astrophysics NAstrophysics N--body simulationbody simulation

© NVIDIA Corporation 2008

matter connectivitymatter connectivity simulationsimulation

149X 47X 20X 24X 30X

Financial Financial simulation of simulation of

LIBOR model with LIBOR model with swaptionsswaptions

GLAME@labGLAME@lab: An : An MM--script API for script API for linear Algebra linear Algebra

operations on GPUoperations on GPU

Ultrasound Ultrasound medical imaging medical imaging

for cancer for cancer diagnosticsdiagnostics

Highly optimized Highly optimized object oriented object oriented

molecular molecular dynamicsdynamics

CmatchCmatch exact exact string matching to string matching to

find similar find similar proteins and gene proteins and gene

sequencessequences

Page 30: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Tesla SM Multiprocessor

• SM has 8 SP Thread Processors• IEEE 754 32-bit floating point

• 32-bit and 64-bit integer

• 16K 32-bit registers

• SM has 2 SFU Special Function Units

• SM has DP Double Precision Unit• IEEE 754 64-bit floating point

© NVIDIA Corporation 2008

• IEEE 754 64-bit floating point

• Fused multiply-add

• Multithreaded Instruction Unit• 1024 threads, hardware multithreaded

• 32 SIMT warps of 32 threads

• Independent thread execution

• Hardware thread scheduling

• 16KB Shared Memory• Concurrent threads share data

• Low latency load/store

Page 31: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Thread Processor Datapath

• Executes 32-bit IEEE floating point instructions:

• FADD, FMUL, FMAD, FMIN, FMAX, FSET, F2I, I2F

• Performs 32-bit integer instructions:

• IADD, IMUL24, IMAD24, IMIN, IMAX, ISET, I2I

• SHR, SHL, AND, OR, XOR

• Fully pipelined

© NVIDIA Corporation 2008

• Fully pipelined

• Latency and area optimized

• IEEE 754 compliant FADD, FMUL

• Round to nearest even, round toward zero

• Handles special numbers, NaNs, infinities properly

• Flushes denormal operands and results to zero

Page 32: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Special Function Unit (SFU)

• Executes transcendental function instructions

• RCP, RSQRT, EXP2, LOG2, SIN, COS

• 2 SFUs per SM yields ¼ instruction throughput

• Evaluates function approximations

• Quadratic interpolation with Enhanced Minimax Approximation

© NVIDIA Corporation 2008

• Interpolates pixel attributes

• Accuracy ranges from 22.5 to 24.0 bits

• 1/x in the interval [1,2) is 24 bits, 1 ulp

• CUDA uses SFUs for initial estimates, and refines upon it

• Final accuracy: see CUDA programming manual, appendix B

Page 33: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

GPU Architecture

Mem

o

Ac

ce

lera

tio

n

GeForce GTX 280 / Tesla T10

© NVIDIA Corporation 2008

Communication Fabric

ory

& I/O

Fix

ed

Fu

nc

tio

n A

240 scalar cores

On-chip memory

Texture units

Page 34: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

CUDA Computing with Tesla T10

• 240 SP processors at 1.5 GHz: 1 TFLOPS peak

• 128 threads per processor: 30,720 threads total

• Tesla PCI-e board: C1060 (1 GPU)

• 1U Server: S1070 (4 GPUs)

SMI-Cache

MT Issue

C-Cache

© NVIDIA Corporation 2008

Tesla T10

Bridge System Memory

Work Distribution

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

Host CPU

Interconnection Network

SP

DP

SP

SP SP

SP SP

SP SP

SFU SFU

SharedMemory

Page 35: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

© NVIDIA Corporation 2008

CUDA Case Studies

Page 36: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Single Precision BLAS: CPU vs GPU (10-series)

200

250

300

350

PS

BLAS (SGEMM) on CUDA

CUDA

ATLAS 1 Thread

ATLAS 4 Threads

CUBLAS: CUDA 2.0b2, Tesla C1060 (10-series GPU)

ATLAS 3.81 on Dual 2.8GHz Opteron Dual-Core

© NVIDIA Corporation 2008

0

50

100

150

200

256x256 512x256 512x512 1024x512 1024x1024 2048x1024 2048x2048 4096x2048 4096x4096 8192x4096 8192x8192

GF

LO

P

Matrix Size

Page 37: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

Double Precision BLAS: CPU vs GPU (10-series)

40

50

60

70

PS

BLAS (DGEMM) on CUDA

CUBLAS

ATLAS Parallel

ATLAS Single

CUBLAS CUDA 2.0b2 on Tesla C1060 (10-series)

ATLAS 3.81 on Intel Xeon E5440 Quad-core, 2.83 GHz

© NVIDIA Corporation 2008

0

10

20

30

40

256x256 256x512 512x512 1024x512 1024x1024 2048x1024 2048x2048 4096x2048 4096x4096 8192x4096 8192x8192

GF

LO

P

Matrix Size

Page 38: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

GPU + CPU DGEMM Performance

60

80

100

120

GFLOPs

GPU + CPU

GPU only

© NVIDIA Corporation 2008

0

20

40

60

12

8

32

0

51

2

70

4

89

6

10

88

12

80

14

72

16

64

18

56

20

48

22

40

24

32

26

24

28

16

30

08

32

00

33

92

35

84

37

76

39

68

41

60

43

52

45

44

47

36

49

28

51

20

53

12

55

04

56

96

58

88

60

80

GFLOPs

Size

Xeon Quad-core 2.8 GHz, MKL 10.3

Tesla C1060 GPU (1.296 GHz)

GPU + CPU

CPU only

Page 39: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

AccelerEyes Jacket

• Who is AccelerEyes?

• AccelerEyes is a MathWorks partner

• Simple software for visual computing

• What is Jacket?

• GPU engine for MATLAB

© NVIDIA Corporation 2008

• GPU engine for MATLAB

• CUDA-powered language extension

• Why Jacket?

• Challenges in technical computing

• Low-cost speed, high-value graphics

• Increased productivity

Page 40: HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

ERROR: stackunderflow

OFFENDING COMMAND: ~

STACK: