hpc computing with cuda and tesla hardwarejbaker/pdc-sp12/slides/cuda... · hpc computing with cuda...

© NVIDIA Corporation 2008

HPC Computing with

CUDA and Tesla HardwareTim Lanfear, NVIDIA

Parallel Computing’s Golden Age

• 1980s, early 1990s• Particularly data-parallel computing

• Architectures• Connection Machine, MasPar, Cray

Thinking Machines

CM-1 (1984)

Thinking Machines

CM-1 (1984)

Cray X-MP

(1982)

Cray X-MP

(1982)


• Connection Machine, MasPar, Cray

• True Supercomputers: incredibly exotic, powerful, expensive

• Algorithms, languages, & programming models• Solved a wide variety of problems

• Various parallel algorithmic models developed

• P-RAM, V-RAM, circuit, hypercube, etc.MasPar MP-1

(1990)

MasPar MP-1

(1990)

Parallel Computing’s Dark Age

• But … impact of data-parallel computing limited• Thinking Machines sold 7 CM-1s (100s of systems total)

• MasPar sold ~200 systems

• Commercial and research activity subsided


• Commercial and research activity subsided• Massively-parallel machines replaced by clusters of ever more powerful

commodity microprocessors

• Beowulf, Legion, grid computing, …

Massively parallel computing

lost momentum to the

advance of commodity technology

Enter the GPU

• GPU = Graphics Processing Unit

• Processor in computer video cards, PlayStation 3, etc.

• Computer games caused “evolution pressure”

• GPUs are massively multithreaded many-core chips


• GPUs are massively multithreaded many-core chips

• NVIDIA Tesla products have 240 scalar processors

• Over 1 TERAFLOPS sustained performance

• Over 30,000 concurrent threads

• Multi-GPU scales beyond

Parallelism is Scaling Rapidly

• CPUs and GPUs are parallel processors

• CPUs now have 2, 4, 8, … processors

• GPUs now have 32, 64, 128, 240, … processors

• Parallelism is increasing rapidly with Moore’s Law


• Parallelism is increasing rapidly with Moore’s Law

• Processor count is doubling every 18 – 24 months

• Individual processor cores no longer getting faster

• Challenge: Develop parallel application software

• Scale software parallelism to use more and more processors

• Same source for parallel GPUs and CPUs

GPUs: Turning Point in Supercomputing

59.9 secs4 Tesla

C1060 GPUs

Desktop beats Cluster


Tesla Personal Supercomputer

$10,000

CalcUA$5 Million

Source: University of Antwerp, Belgium

67.4 secs

55 60 65 70

256 AMD dual-core Opterons

Digital Tomography Reconstruction Time

TeslaTM

High-Performance Computing

Quadro®

Design & Creation

GeForce®

Entertainment

NVIDIA GPU Product Families


What is GPU Computing?


Computing with CPU + GPUHeterogeneous Computing

GPUs: Many-Core High Performance Computing

• NVIDIA’s 10-series GPU has 240 cores

• Each core has a• Floating point / integer unit

• Logic unit 1.4 billion transistors

NVIDIA 10NVIDIA 10--Series GPUSeries GPUNVIDIA 10NVIDIA 10--Series GPUSeries GPU


• Logic unit

• Move, compare unit

• Branch unit

• Cores managed by thread manager• Thread manager can spawn

and manage 30,000+ threads

• Zero overhead thread switching

1.4 billion transistors

1 Teraflop of processing power

240 processing cores

NVIDIA’s 2nd Generation CUDA Processor

Tesla 8-Series vs 10-Series

Double the Performance > Double the Memory

1.5 Gigabytes4 Gigabytes

500 Gigaflops

1 Teraflop


Tesla 8 Tesla 10

Double the Precision

Finance Science Design

Tesla 8 Tesla 10

Tesla GPU Computing Products


Tesla S1070 1U SystemTesla C1060

Computing Board

Tesla Personal Supercomputer (4 Tesla C1060s)

GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs

Single Precision Perf 4.14 Teraflops 933 Gigaflops 3.7 Teraflops

Double Precision Perf 346 Gigaflops 78 Gigaflops 312 Gigaflops

Memory 4 GB / GPU 4 GB 4 GB / GPU

Processor 1 x Tesla T101 x Tesla T10

Number of cores 240240

Core Clock 1.296 GHz1.296 GHz

On-board memory 4.0 GB 4.0 GB

Memory bandwidth 102 GB/sec peak102 GB/sec peak

Tesla C1060 Computing Processor


Memory bandwidth 102 GB/sec peak102 GB/sec peak

Memory I/O 512512--bit, 800MHz GDDR3bit, 800MHz GDDR3

Form factorFull ATX: 4.736” x 10.5”Full ATX: 4.736” x 10.5”

Dual slot wideDual slot wide

System I/O PCIePCIe x16 Gen2x16 Gen2

Typical power 160 W160 W

Tesla Personal Supercomputer

Supercomputing Performance

• Massively parallel CUDA Architecture

• 960 cores. 4 TeraFlops

• 250x the performance of a desktop

Personal


Personal

• One researcher, one supercomputer

• Plugs into standard power strip

Accessible

• Program in C for Windows, Linux

• Available now worldwide under $10,000

Tesla S1070 1U System

Processors 4 x Tesla T10

Number of cores 960

Core Clock 1.44 GHz

Performance 4 Teraflops

Total system memory 16.0 GB (4.0 GB per T10)


Memory bandwidth408 GB/sec peak

(102 GB/sec per T10)

Memory I/O2048-bit, 800MHz GDDR3

(512-bit per T10)

Form factor 1U (EIA 19” rack)

System I/O 2 PCIe x16 Gen2

Typical power 700 W

HostServer

PCIe Host Interface Cards

PCIe Gen2 Cables

PCIe Gen2 Cable

Connecting Tesla S1070 to Host Servers


TeslaS1070

PCIe Gen2 Host Interface Card

PCIe Gen2 Cable(0.5m length)

PCI-E Gen2 x16

Adapter card

CPU ServerCPU Server

CPUCPUFSBFSB PCI-E x16 Gen 2PCI-E x16 Gen 2

Core

Logic

Core

Logic

CPU ServerCPU Server

Tesla GPU Computing System


CPUCPUCore

Logic

Core

Logic

NVIDIA SwitchNVIDIA SwitchPower

Supply

PCI-

Express

Cables

PCI-

Express

Cables

Tesla GPU SystemTesla GPU System

FSBFSB

PCI-E x16 Gen 2PCI-E x16 Gen 2NVIDIA SwitchNVIDIA Switch

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

NVIDIA Tesla GPU

PCI-E x16 Gen 2PCI-E x16 Gen 2

PCI-E Gen2 x16

Adapter card

PCI-E x16 Gen 2PCI-E x16 Gen 2


A scalable parallel programming model and software environment for parallel computing

Enter CUDA

• CUDA is a scalable parallel programming model and software environment for parallel computing

• NVIDIA GPU architecture accelerates CUDA


• Hardware and software designed together for computing

• Expose the computational horsepower of NVIDIA GPUs

• Enable general-purpose GPU computing

TeslaTM

High-Performance Computing

Quadro®

Design & Creation

GeForce®

Entertainment

Parallel Computing on All GPUsOver 100 million CUDA GPUs Deployed


Pervasive CUDA Parallel Computing

• CUDA brings data-parallel computing to the masses• Over 100 M CUDA-capable GPUs deployed since Nov 2006

• Wide developer acceptanceww.nvidia.com/CUDA

• Over 100K CUDA developer downloads (CUDA is free!)


• Over 100K CUDA developer downloads (CUDA is free!)

• A GPU “developer kit” costs ~$200 (GeForce board price) for 500 GFLOPS

• Data-parallel supercomputers are everywhere!• CUDA makes this power readily accessible

• Enables rapid innovations in data-parallel computing

Massively parallel computing has become a commodity technology!

CUDA is C for Parallel Processors

• CUDA is industry-standard C with minimal extensions• Write a program for one thread

• Instantiate it on many parallel threads

• Familiar programming model and language


• CUDA is a scalable parallel programming model• Program runs on any number of processors without recompiling

CUDA Uses Extensive Multithreading

• CUDA threads express fine-grained data parallelism• Map threads to GPU threads• Virtualize the processors• You must rethink your algorithms to be aggressively parallel

• CUDA thread blocks express coarse-grained parallelism


• CUDA thread blocks express coarse-grained parallelism• Blocks hold arrays of GPU threads, define shared memory boundaries• Allow scaling between smaller and larger GPUs

• GPUs execute thousands of lightweight threads• (In graphics, each thread computes one pixel)• One CUDA thread computes one result (or several results)• Hardware multithreading & zero-overhead scheduling

Data Parallel Levels• Thread

• Computes result elements

• Thread id number

• Thread Block• Runs on one SM, shared mem

• 1 to 512 threads per block

• Block id number

• Grid of Blocks

Thread

t0 t1 t2 … tm

Block


• Grid of Blocks• Holds complete computation task

• One to many blocks per Grid

• Sequential Grids• Compute sequential problem steps

Grid

Bl. 0 Bl. 1 Bl. 2 Bl. n

. . .

Parallel Memory Sharing

Thread

Local Memory

Block

SharedMemory

Local barrier


Grid 0

. . .

GlobalMemory

. . .

Grid 1

SequentialGridsin Time

Global barrier

Simple “C” Description For Parallelism

void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)

{{{{

forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)

y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

// Invoke serialserialserialserial SAXPY kernel

saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);

Standard C Code


__global__ __global__ __global__ __global__ void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)

{{{{

int i = int i = int i = int i = blockIdxblockIdxblockIdxblockIdx.x*.x*.x*.x*blockDimblockDimblockDimblockDim.x + .x + .x + .x + threadIdxthreadIdxthreadIdxthreadIdx.x;.x;.x;.x;

ifififif (i(i(i(i <<<< n) n) n) n) y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];

}}}}

// Invoke parallelparallelparallelparallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;

saxpy_parallelsaxpy_parallelsaxpy_parallelsaxpy_parallel<<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>>(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);

Parallel C Code

CUDA Libraries

cuFFT cuBLAS cuDPP

CUDA Compiler

C Fortran

CUDA Tools

Debugger Profiler

CPU Hardware

PCI-E Switch1U

Application Software(written in C)


C Fortran Debugger ProfilerPCI-E Switch1U

4 cores 240 cores

CUDA Zone: www.nvidia.com/CUDA

• CUDA Toolkit• Compiler

• Libraries

• CUDA SDK• Code samples


• CUDA Profiler

• Forums

• Resources for CUDA developers

CUDA Computing Sweet Spots

• Parallel Applications

• High bandwidth:Sequencing (virus scanning, genomics), sorting, database, …

• Visual computing:Graphics, image processing, tomography, machine vision, …


Graphics, image processing, tomography, machine vision, …

• High arithmetic intensity:Dense linear algebra, PDEs, n-body, finite difference, …

• Applications in finance

Wide Developer Acceptance and Success

146X 36X 19X 17X 100X

Interactive Interactive visualization of visualization of

volumetric white volumetric white matter connectivitymatter connectivity

Ion placement for Ion placement for molecular molecular dynamics dynamics simulationsimulation

TranscodingTranscoding HD HD video stream to video stream to

H.264H.264

Simulation in Simulation in MatlabMatlab using .using .mexmexfile CUDA functionfile CUDA function

Astrophysics NAstrophysics N--body simulationbody simulation


matter connectivitymatter connectivity simulationsimulation

149X 47X 20X 24X 30X

Financial Financial simulation of simulation of

LIBOR model with LIBOR model with swaptionsswaptions

GLAME@labGLAME@lab: An : An MM--script API for script API for linear Algebra linear Algebra

operations on GPUoperations on GPU

Ultrasound Ultrasound medical imaging medical imaging

for cancer for cancer diagnosticsdiagnostics

Highly optimized Highly optimized object oriented object oriented

molecular molecular dynamicsdynamics

CmatchCmatch exact exact string matching to string matching to

find similar find similar proteins and gene proteins and gene

sequencessequences

Tesla SM Multiprocessor

• SM has 8 SP Thread Processors• IEEE 754 32-bit floating point

• 32-bit and 64-bit integer

• 16K 32-bit registers

• SM has 2 SFU Special Function Units

• SM has DP Double Precision Unit• IEEE 754 64-bit floating point


• IEEE 754 64-bit floating point

• Fused multiply-add

• Multithreaded Instruction Unit• 1024 threads, hardware multithreaded

• 32 SIMT warps of 32 threads

• Independent thread execution

• Hardware thread scheduling

• 16KB Shared Memory• Concurrent threads share data

• Low latency load/store

Thread Processor Datapath

• Executes 32-bit IEEE floating point instructions:

• FADD, FMUL, FMAD, FMIN, FMAX, FSET, F2I, I2F

• Performs 32-bit integer instructions:

• IADD, IMUL24, IMAD24, IMIN, IMAX, ISET, I2I

• SHR, SHL, AND, OR, XOR

• Fully pipelined


• Fully pipelined

• Latency and area optimized

• IEEE 754 compliant FADD, FMUL

• Round to nearest even, round toward zero

• Handles special numbers, NaNs, infinities properly

• Flushes denormal operands and results to zero

Special Function Unit (SFU)

• Executes transcendental function instructions

• RCP, RSQRT, EXP2, LOG2, SIN, COS

• 2 SFUs per SM yields ¼ instruction throughput

• Evaluates function approximations

• Quadratic interpolation with Enhanced Minimax Approximation


• Interpolates pixel attributes

• Accuracy ranges from 22.5 to 24.0 bits

• 1/x in the interval [1,2) is 24 bits, 1 ulp

• CUDA uses SFUs for initial estimates, and refines upon it

• Final accuracy: see CUDA programming manual, appendix B

GPU Architecture

Mem

o

Ac

ce

lera

tio

n

GeForce GTX 280 / Tesla T10


Communication Fabric

ory

& I/O

Fix

ed

Fu

nc

tio

n A

240 scalar cores

On-chip memory

Texture units

CUDA Computing with Tesla T10

• 240 SP processors at 1.5 GHz: 1 TFLOPS peak

• 128 threads per processor: 30,720 threads total

• Tesla PCI-e board: C1060 (1 GPU)

• 1U Server: S1070 (4 GPUs)

SMI-Cache

MT Issue

C-Cache


Tesla T10

Bridge System Memory

Work Distribution

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

Host CPU

Interconnection Network

SP

DP

SP

SP SP

SP SP

SP SP

SFU SFU

SharedMemory


CUDA Case Studies

Single Precision BLAS: CPU vs GPU (10-series)

200

250

300

350

PS

BLAS (SGEMM) on CUDA

CUDA

ATLAS 1 Thread

ATLAS 4 Threads

CUBLAS: CUDA 2.0b2, Tesla C1060 (10-series GPU)

ATLAS 3.81 on Dual 2.8GHz Opteron Dual-Core


0

50

100

150

200

256x256 512x256 512x512 1024x512 1024x1024 2048x1024 2048x2048 4096x2048 4096x4096 8192x4096 8192x8192

GF

LO

P

Matrix Size

Double Precision BLAS: CPU vs GPU (10-series)

40

50

60

70

PS

BLAS (DGEMM) on CUDA

CUBLAS

ATLAS Parallel

ATLAS Single

CUBLAS CUDA 2.0b2 on Tesla C1060 (10-series)

ATLAS 3.81 on Intel Xeon E5440 Quad-core, 2.83 GHz


0

10

20

30

40

256x256 256x512 512x512 1024x512 1024x1024 2048x1024 2048x2048 4096x2048 4096x4096 8192x4096 8192x8192

GF

LO

P

Matrix Size

GPU + CPU DGEMM Performance

60

80

100

120

GFLOPs

GPU + CPU

GPU only


0

20

40

60

12

8

32

0

51

2

70

4

89

6

10

88

12

80

14

72

16

64

18

56

20

48

22

40

24

32

26

24

28

16

30

08

32

00

33

92

35

84

37

76

39

68

41

60

43

52

45

44

47

36

49

28

51

20

53

12

55

04

56

96

58

88

60

80

GFLOPs

Size

Xeon Quad-core 2.8 GHz, MKL 10.3

Tesla C1060 GPU (1.296 GHz)

GPU + CPU

CPU only

AccelerEyes Jacket

• Who is AccelerEyes?

• AccelerEyes is a MathWorks partner

• Simple software for visual computing

• What is Jacket?

• GPU engine for MATLAB


• GPU engine for MATLAB

• CUDA-powered language extension

• Why Jacket?

• Challenges in technical computing

• Low-cost speed, high-value graphics

• Increased productivity

ERROR: stackunderflow

OFFENDING COMMAND: ~

STACK:

hpc computing with cuda and tesla hardwarejbaker/pdc-sp12/slides/cuda... · hpc computing with cuda...

Documents