intro to cuda
TRANSCRIPT
GPU AlgorithmsDavid Hauck
github.com/davidhauck
@david_hauck_mke
davidhauck40.blogspot.com
Graphics Processing Unit
Why?
Graphics Processing Unit
Graphics Processing Unit
General Purpose
T EM S
R
HOST
DEVICE
PCI Bus
Copy initial data to DEVICE
PCI Bus
Run DEVICE Executable
PCI Bus
Copy Results Back To HOST
Still Running on CPU
Still Running on CPUGPU is a Resource
MEMORYCONSCIOUSNESS
HOST DEVICEPOINTERSPOINTERS
int *a;
int *a;int *d_a;
arr = malloc(size);
arr = malloc(size);
cudaMalloc(&d_arr, size);
free(arr);
free(arr);
cudaFree(d_arr);
memcpy(dest, source, size);
memcpy(dest, source, size);
cudaMemcpy(&dest, src, size, …);
1: HOST DEVICE
2: EXECUTE
3: DEVICE HOST
1: HOST DEVICE
3: DEVICE HOST
cudaMemcpy();
1: HOST DEVICEcudaMemcpy(
&dest,source,size, ..hostToDevice);
EXECUTION
__global__ void myKernel(int *a){}
myKernel<<<1,1>>>(d_arr);
Let’s do an example
abcd
+
efgh
=
ijkl
abcd
+
efgh
=
ijkl
abcd
+
efgh
=
ijkl
threadIdx.x
0
1
2
3
int index = threadIdx.x;c[index] =
a[index] + b[index];
Let’s invent an ALGORITHM
K-Means Clustering
CODE
Shared Memory• ~48k• Multiple GB device memory (100x higher latency)• Access memory in order• 1 2 3• 4 5 6• 7 8 9
Considerations• Transistors are allocated to arithmetic, not memory. Sometimes
better to recompute rather than cache• Copying to/from host takes a while. Sometimes sequential operations
can stay on gpu• Avoid serialization (shared memory bank conflicts)• Asynchronous memory operations