intro to cuda

57
GPU Algorithms David Hauck github.com/ davidhauck @david_hauck_mke davidhauck40.blogsp ot.com dhauck@skylinetechnolog ies.com

Upload: david-hauck

Post on 22-Jan-2017

31 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Intro to Cuda

GPU AlgorithmsDavid Hauck

github.com/davidhauck

@david_hauck_mke

davidhauck40.blogspot.com

[email protected]

Page 2: Intro to Cuda

Graphics Processing Unit

Page 3: Intro to Cuda
Page 4: Intro to Cuda
Page 5: Intro to Cuda

Why?

Page 6: Intro to Cuda
Page 7: Intro to Cuda

Graphics Processing Unit

Page 8: Intro to Cuda

Graphics Processing Unit

General Purpose

Page 9: Intro to Cuda

T EM S

R

Page 10: Intro to Cuda

HOST

Page 11: Intro to Cuda

DEVICE

Page 12: Intro to Cuda
Page 13: Intro to Cuda

PCI Bus

Copy initial data to DEVICE

Page 14: Intro to Cuda

PCI Bus

Run DEVICE Executable

Page 15: Intro to Cuda

PCI Bus

Copy Results Back To HOST

Page 16: Intro to Cuda

Still Running on CPU

Page 17: Intro to Cuda

Still Running on CPUGPU is a Resource

Page 18: Intro to Cuda
Page 19: Intro to Cuda

MEMORYCONSCIOUSNESS

Page 20: Intro to Cuda

HOST DEVICEPOINTERSPOINTERS

Page 21: Intro to Cuda

int *a;

Page 22: Intro to Cuda

int *a;int *d_a;

Page 23: Intro to Cuda

arr = malloc(size);

Page 24: Intro to Cuda

arr = malloc(size);

cudaMalloc(&d_arr, size);

Page 25: Intro to Cuda

free(arr);

Page 26: Intro to Cuda

free(arr);

cudaFree(d_arr);

Page 27: Intro to Cuda

memcpy(dest, source, size);

Page 28: Intro to Cuda

memcpy(dest, source, size);

cudaMemcpy(&dest, src, size, …);

Page 29: Intro to Cuda

1: HOST DEVICE

2: EXECUTE

3: DEVICE HOST

Page 30: Intro to Cuda

1: HOST DEVICE

3: DEVICE HOST

cudaMemcpy();

Page 31: Intro to Cuda

1: HOST DEVICEcudaMemcpy(

&dest,source,size, ..hostToDevice);

Page 32: Intro to Cuda

EXECUTION

Page 33: Intro to Cuda

__global__ void myKernel(int *a){}

Page 34: Intro to Cuda

myKernel<<<1,1>>>(d_arr);

Page 35: Intro to Cuda

Let’s do an example

Page 36: Intro to Cuda

abcd

+

efgh

=

ijkl

Page 37: Intro to Cuda

abcd

+

efgh

=

ijkl

Page 38: Intro to Cuda

abcd

+

efgh

=

ijkl

threadIdx.x

0

1

2

3

Page 39: Intro to Cuda

int index = threadIdx.x;c[index] =

a[index] + b[index];

Page 40: Intro to Cuda

Let’s invent an ALGORITHM

Page 41: Intro to Cuda

K-Means Clustering

Page 42: Intro to Cuda
Page 43: Intro to Cuda
Page 44: Intro to Cuda
Page 45: Intro to Cuda
Page 46: Intro to Cuda
Page 47: Intro to Cuda
Page 48: Intro to Cuda
Page 49: Intro to Cuda
Page 50: Intro to Cuda
Page 51: Intro to Cuda
Page 52: Intro to Cuda
Page 53: Intro to Cuda
Page 54: Intro to Cuda
Page 55: Intro to Cuda

CODE

Page 56: Intro to Cuda

Shared Memory• ~48k• Multiple GB device memory (100x higher latency)• Access memory in order• 1 2 3• 4 5 6• 7 8 9

Page 57: Intro to Cuda

Considerations• Transistors are allocated to arithmetic, not memory. Sometimes

better to recompute rather than cache• Copying to/from host takes a while. Sometimes sequential operations

can stay on gpu• Avoid serialization (shared memory bank conflicts)• Asynchronous memory operations