gpu computing: pervasive massively multithreaded processors

GPU Computing: Pervasive Massively

Multithreaded Processors

Michael C ShebanowSr. Architecture Manager,

GPUs

© NVIDIA Corporation 2008

Agenda

● GPU Computing● Tesla Products● CUDA● SM: Thread

Multiprocessor● Application

Performance Model


GPU Computing

● GPU = graphics processing unit● NVIDIA’s latest

products = 88xx/98xx series

● Accelerates Graphics

● GPU Computing● Using GPUs for

general purpose computing other than graphics


GPU Computing is Pervasive

● Huge #s of deployed parallel computing engines● NVIDIA has shipped >70M

CUDA-capable GPUs to date

● NVIDIA currently shipping more than1M CUDA-capable GPUs per week

● Wide range of products ranging from laptops to the servers● Coming soon: cell

phones, PDAs, …


GPU Computing Key Concepts

● Hardware (HW) thread management● HW thread launch and monitoring● HW thread switching● Tens of thousands of lightweight, concurrent threads● Real threads: PC, private registers, …

● SIMT execution model● Multiple memory scopes

● Per-thread private memory● Per-thread-block shared memory● Global memory

● Using threads to hide memory latency● Coarse grain thread synchronization


SIMT Multithreaded Execution

● SIMT: Single-Instruction Multi-Threadexecutes one instruction across many independent threads● Warp: a set of 32 parallel threads

that execute a SIMT instruction● SIMT provides easy single-thread scalar

programming with SIMD efficiency

● Hardware implements zero-overhead warp and thread scheduling

● SIMT threads can execute independently● SIMT warp diverges and converges when

threads branch independently● Best efficiency and performance when threads

of a warp execute together

warp 8 instruction 11

Single-Instruction Multi-Threadinstruction scheduler




...

time



Multiple Memory Scopes● Per-thread private memory

● Each thread has its own local memory

● Stacks, other private data

● Per-thread-block shared memory● Small memory close to

the processor, low latency● Allocated per thread block

● Main memory● GPU frame buffer● Can be accessed by any

thread in any thread block

Thread

Per-threadLocal Memory

Block

Per-blockSharedMemory

Kernel 0

.

.

.Per-device

GlobalMemory

. . .

Kernel 1

SequentialBlocks


Hiding LOAD Latency

● Principle: Little’s Law:

● N = “number in flight”● l = arrival rate● L = memory latency

● Arrival Rate product of:● Desired execution rate

(IPC)● Density of LOAD

instructions (%)● N = # of threads needed to

cover latency L

W0T0

Load Req

W0T1

Load Req

W0T2

Load Req

LN l

Wk

T2

9Load Req

Wk

T3

0

Load Req

WkT3

1Load Req

W0

T0

Load Resp

W0

T1

Load Resp

W0T2

Load Resp

Time


Hiding LOAD Latency w/ Fewer Threads

● Use batching● Group independent

LOADs together● Modified law:

● B = batch size

// batch size 3 examplefloat *d_A, *d_B, *d_C;float a, b, c, result;

a = *d_A; b = *d_B; c = *d_C;result = a * b + c;

● The values ‘a’, ‘b’, and ‘c’ are loaded independently before being used

● Implication is that we can execute 3 loads from one thread before the first use (‘a’ in this case) causes a stall

BLN l

© NVIDIA Corporation 2008 10

Thread Synchronization

● Barrier synchronization among threads of block● Fast single-instruction barrier in Tesla GPUs● void __syncthreads();● Synchronizes all threads in a thread block● Once all threads have reached this point,

kernel execution resumes normally● Use before reading shared memory written by another

thread in the same block

● Global synchronization between dependent kernels● Waits for all thread blocks of kernel grid to complete● Fast synchronization and kernel launch in Tesla GPUs

Tesla Series Products


The Tesla 8-Series ProcessorNVIDIA’s 1st Generation CUDA Processor

● 681 million transistors

● 518 Gigaflops● 128 cores (SPs),

12288 threads max● 384-bit 800 MHz

GDDR3● 76 GB/sec peak


The Tesla 10-Series ProcessorNVIDIA’s 2nd Generation CUDA Processor

● 1.4 billion transistors

● 1 Teraflop● 240 cores (SPs),

30720 threads max● 512-bit, 800MHz

GDDR3● 102 GB/sec peak


Processor 1 x Tesla T10

Number of cores 240

Core Clock 1.33 GHz

On-board memory 4.0 GB

Memory bandwidth 102 GB/sec peak

Memory I/O 512-bit, 800MHz GDDR3

Form factor Full ATX: 4.736” x 10.5”Dual slot wide

System I/O PCIe x16 Gen2

Typical power 160 W

Tesla C1060 Computing Processor


Processors 4 x Tesla T10

Number of cores 960

Core Clock 1.5 GHz

Performance 4 Teraflops

Total system memory16.0 GB

(4.0 GB per T10)

Memory bandwidth408 GB/sec peak

(102 GB/sec per T10)

Memory I/O2048-bit, 800MHz

GDDR3(512-bit per T10)

Form factor 1U (EIA 19” rack)

System I/O 2 PCIe x16 Gen2

Typical power 700 W

Tesla S1070 1U System


146X 36X 18X 17X 100X

Interactive visualization of

volumetric white matter

connectivity

Ionic placement for molecular

dynamics simulation on

GPU

Transcoding HD video stream to

H.264

Simulation in Matlab

using .mex file CUDA function

Astrophysics N-body simulation

149X 47X 20X 24X 30X

Financial simulation of LIBOR model

with swaptions

GLAME@lab: An M-script API for linear Algebra operations on

GPU

Ultrasound medical imaging

for cancer diagnostics

Highly optimized object oriented

molecular dynamics

Cmatch exact string matching to find similar proteins and

gene sequences

Example Speedups


Example: Fluid Simulation

CUDA port of:Jos Stam, "Stable Fluids", In SIGGRAPH 99 Conference Proceedings, Annual Conference Series, August 1999, 121-128.


CUDA N-Body Simulation

10B interactions / s

16K bodies 44 FPSx 20 FLOPS / interactionx 16K2 interactions / frame= 240 GFLOP/s

= 50x tuned CPU implementation on Intel Core 2 Duo

GeForce 8800 GTX GPU

Highly Parallel High Arithmetic Intensity


N-Body Physics on CUDA

● All-pairs gravitational N-body physics of 16,384 stars● 240 GFLOPS on NVIDIA GeForce 8800 – see GPU Gems 3


CUDA Programming Model

● Minimal extension of C and C++ languages

● Write a serial program that calls parallel kernels

● Serial portions execute on the host CPU

● A kernel executes as parallel threads on the GPU device● Kernels may be simple

functions or full programs● Many threads execute

each kernel Target code

Virtual

Physical

NVCC CPU Code

PTX Code

PTX to TargetCompiler

G80 … GTX

C CUDAApplication


CUDA Thread Model

● Thread● Computes result elements● threadIdx is thread id number

● Thread Block● Computes result data Block● 1 to 512 threads per Thread Block● blockIdx is block id number

● Grid of Blocks● Computes many result blocks● 1 to many blocks per grid

● Sequential Grids● Compute sequential problem steps

Thread

t0 t1 t2 … tm

Thread Block

Grid

Block 0

Block 1

Block 2

Block n. . .


CUDA: Basics

● Declaration specifiers to indicate where things live__global__ void KernelFunc(...); // kernel callable from host__device__ void DeviceFunc(...); // function callable on device__device__ int GlobalVar; // variable in device memory__shared__ int SharedVar; // in per-block shared memory

● Extend function invocation syntax for parallel kernel launchKernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each

● Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim;

● Intrinsics that expose specific operations in kernel code__syncthreads(); // barrier synchronization


CUDA: Extended Features

● Standard mathematical functionssinf, powf, atanf, ceil, etc.

● Built-in vector typesfloat4, int4, uint4, etc. for dimensions 1..4

● AtomicsatomicAdd(int *pmem; int value), etc. add, sub, min, max, and, or, xor ...

● Texture accesses in kernelstexture<float,2> my_texture; // declare texture referencefloat4 texel = texfetch(my_texture, u, v);

© NVIDIA Corporation 2008 25

CUDA Memory Management

// allocate host memoryunsigned int numBytes = N * sizeof(float)float* h_A = (float*) malloc(numBytes);

// allocate device memoryfloat* d_A = 0;cudaMalloc((void**)&d_A, numbytes);

// copy data from host to devicecudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);

// copy data from device back to hostcudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);

// free device memorycudaFree(d_A);


Example: Vector Addition Kernel

// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__ void vecAdd(float* A, float* B, float* C){ int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i];}

int main(){ // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);}

Device Code


Example: Vector Addition Kernel


int main(){ // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);}

Host Code


Example: Host code for vecAdd// allocate and initialize host (CPU) memoryfloat *h_A = …, *h_B = …;

// allocate device (GPU) memoryfloat *d_A, *d_B, *d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),

cudaMemcpyHostToDevice) );cudaMemcpy( d_B, h_B, N * sizeof(float),

cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);


Example #2: Adding matrices with 2D grids

CPU C program CUDA C programvoid addMatrix(float *a, float *b, float *c, int N){ int i, j, index; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { index = i + j * N; c[index]=a[index] + b[index]; } }}

void main(){ ..... addMatrix(a, b, c, N);}

__global__ void addMatrix(float *a, float *b, float *c, int N){ int i=blockIdx.x*blockDim.x+threadIdx.x; int j=blockIdx.y*blockDim.y+threadIdx.y; int index = i + j * N; if ( i < N && j < N) c[index]= a[index] + b[index];}

void main(){ ..... dim3 dimBlk (blocksize, blocksize); dim3 dimGrd (N/dimBlk.x, N/dimBlk.y); addMatrix<<<dimGrd,dimBlk>>>(a, b, c, N);}

The SM:Thread

Multiprocesor


Tesla 10-Series Architecture

● Scales parallel performance 2X beyond Tesla 8-series● 240 multithreaded thread processors at 1.5 GHz, 1 TFLOPS peak

Tesla T10

Interconnection Network

Bridge System Memory

Work Distribution

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

DRAM

ROP L2

Host CPU


T10 Multithreaded Multiprocessor● Scalar register-based ISA● Multithreaded Instruction Unit

● 1024 threads, hardware multithreaded● 32 SIMT warps of 32 threads● Independent thread execution● Hardware thread scheduling

● 8 SP Thread Processors● IEEE 754 32-bit floating point● 32-bit and 64-bit integer● 16K 32-bit registers

● 2 SFU Special Function Units● RCP, RSQRT, EXP2, LOG2, SIN, COS● 2 SFUs per SM yields ¼ instruction throughput● Accuracy ranges from 22.5 to 24.0 bits

● 1 DP Double Precision Unit● IEEE 754 64-bit floating point● Fused multiply-add● Full-speed denormalized operands and results

● 16KB Shared Memory● Concurrent threads share data● Low latency load/store

SM

TPC

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory


SM Conceptual Block Diagram

Warp 0

Warp 1

Warp K

FetchUnit

RegisterFiles

Warp 0

Warp 1

Warp K

Instruction Buffers

PC

PC

PC

InstructionCache

SchedUnit

ALUs

LSU

SingleInstruction

(SI)

Multi- Threaded

(MT)

Application Performance


Limiter Theory

● SM a form of queuing system● Use “limiter theory” to predict

SM performance● There are three types of limits

on the performance of the SM:● Bandwidth resource limiters● Per-thread-block space

limiters● Per-thread space limiters

● The most constraining limiter is called the critical limiter● = min(all limiters)

Warp 0

Warp 1

Warp K

FetchUnit

RegisterFiles

Warp 0

Warp 1

Warp K

Instruction Buffers

PC

PC

PC

InstructionCache

SchedUnit

ALUs

LSU

SingleInstruction

(SI)

Multi- Threaded

(MT)


Bandwidth Limiters

● Thread blocks arrive at some rate λTB ● Threads composed of some distribution of operations● Each arriving thread block of S threads contributes a

distribution of operations to be performed● Per operation type, the offered load, or BW DEMAND, is the

product of:● Thread block arrival rate λTB ● # of threads S in a block● Operation density δ in each thread

● BW CAPACITY is an upper bound on BW DEMAND


FADD IADD IMUL LOAD STORE0

1

2


Space Limiters

● SM also has space resources. Examples:● Finite limit on warp count● Finite limit on register file

space● Finite limit on shared memory

size● Space resources:

● Allocated on thread block launch

● Deallocated on thread block completion

● Consumption computed using Little’s Law (N = λL)

● Thread Latency (L)● Complex computation● Varies with memory behavior

Warp 0

Warp 1

Warp 2

Warp 3

Warp 4

Warp 5

Warp 30

Warp 31

Thread Block 0

Thread Block 1

Registers,Shared Memory

Registers,Shared Memory


Limitations of Limiter Theory

● Limiter theory assumes uniform workloads

● Breaks down if “traffic jam” behavior

● Limiter theory is an ok 1st order approximation


FADD IADD IMUL LOAD STORE0

1

2

Warp 0

Warp 1

Warp K

FetchUnit

RegisterFiles

Warp 0

Warp 1

Warp K

Instruction Buffers

PC

PC

PC

InstructionCache

SchedUnit

ALUs

LSU

SingleInstruction

(SI)

Multi- Threaded

(MT)


Implications of Limiter Theory

● Kernel code has to pay careful attention to Operation “mix”● Math-to-memory operation ratios for example● Do not want to bottleneck on one function unit leaving other

units idling● Ideal: all units equally critical

● Don’t “freeway jam” kernel code● Making thread blocks too large so that only a few execute on the

SM at a time a bad idea● “Bunching” operations of a similar type in one section of a

kernel will aggravate the problem● Ideal: lots of small thread blocks with uniform distribution of

operation densities● Focus on space resource consumption

● Ideal: use as few resources necessary to “load the SM”


Final Thoughts

● Threads are free● A common mistake in GPU Computing kernels is to make

threads do too much● Keep them short and sweet● Example: one thread per vector element● HW provides LOTs of them (10s of thousands)● HW launch => near zero overhead to create them● HW context switching => near zero overhead scheduling

● Barriers are cheap● Single instruction● HW synchronization of thread blocks● Partition kernel code into producer-consumer● DON’T use spin locks!

● Partition on results, not sources

Questions?

[email protected]://www.nvidia.com/CUDAhttp://www.nvidia.com/TESLA

gpu computing: pervasive massively multithreaded processors

Documents

thread blockonce

simt threads

thread blocks of kernel

time nvidia corporation

stall nvidia corporation

tesla gpus nvidia corporation

concurrent threadsreal

overhead warp