cuda programming - basavaraj talawar · hello, world! the standard c program runs on the host...

60
CUDA Programming

Upload: others

Post on 12-Oct-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA Programming

Page 2: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Outline

● Basic CUDA program

Page 3: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA C - Terminology

● Host – The CPU and its memory (host memory)● Device – The GPU and its memory (device

memory)

HOST

DEVICE

Page 4: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Compute Unified Device Architecture

● Hybrid CPU/GPU Code● Low latency code is run on

CPU– Result immediately

available

● High latency, high throughput code is run on GPU– Result on bus– GPU has many more cores

than CPU

Page 5: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Hello, World!

● The standard C program runs on the host● NVIDIA’s compiler (nvcc) will not complain

about CUDA programs with no device code● At its simplest, CUDA C is just C!

int main( void ) {printf( "Hello, World!\n" );return 0;

}

int main( void ) {printf( "Hello, World!\n" );return 0;

}

Page 6: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Hello, World! with Device Code

__global__ void kernel( void ) {}

int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );

return 0;}

__global__ void kernel( void ) {}

int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );

return 0;}

Function runs on the device. Called from the host.

Function runs on the device. Called from the host.

Page 7: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Hello, World! with Device Code

__global__ void kernel( void ) {}

int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );

return 0;}

__global__ void kernel( void ) {}

int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );

return 0;}

Function runs on the device. Called from the host.

Function runs on the device. Called from the host.

● nvcc splits source file into host and device components– NVIDIA’s compiler handles device functions. eg. kernel()

– Standard host compiler handles host functions like main()

Page 8: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Hello, World! with Device Code

__global__ void kernel( void ) {}

int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );

return 0;}

__global__ void kernel( void ) {}

int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );

return 0;}

Function runs on the device. Called from the host.

Function runs on the device. Called from the host.

● Triple angle brackets mark a call from host code to device code - a kernel launch

● The kernel executes on the GPU

Page 9: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA /OpenCL – Execution Model

...

...

CPU Serial Code (on Host)

Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);

CUDA programExecution

CPU Serial Code (on Host)

Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);

Page 10: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA /OpenCL – Execution Model

● Heterogeneous host+device application C program– Serial parts in host C code

– Parallel parts in device SPMD kernel C code

...

...

CPU Serial Code (on Host)

Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);

CUDA programExecution

CPU Serial Code (on Host)

Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);

Page 11: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

A von-Neumann Processor

● A thread is a “virtualized” or “abstracted” von-Neumann Processor

Page 12: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA Program Example

Page 13: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Vector Addition Example

for (i=0; i<N; i++) {  C[i] = A[i] + B[i]}

for (i=0; i<N; i++) {  C[i] = A[i] + B[i]}

Page 14: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Vector Addition Example

for (i=0; i<N; i++) {  C[i] = A[i] + B[i]}

for (i=0; i<N; i++) {  C[i] = A[i] + B[i]}

A[0] A[1] A[2] A[N-1]

B[0] B[1] B[2] B[N-1]

C[0] C[1] C[2] C[N-1]

+ + + +

...

...

...

Page 15: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Vector Addition Example

for (i=0; i<N; i++) {  C[i] = A[i] + B[i]}

for (i=0; i<N; i++) {  C[i] = A[i] + B[i]}

A[0] A[1] A[2] A[N-1]

B[0] B[1] B[2] B[N-1]

C[0] C[1] C[2] C[N-1]

+ + + +

...

...

...

● Each thread adds one element from A[] and B[] and updates one element in C[]– Data level parallelism!

Page 16: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Parallel Threads

0 1 2 254 255... 1022 1023... ... ...A

0 1 2 254 255... 1022 1023... ... ...B

... ... ......

0 1 2 254 255... 1022 1023... ... ...C

Page 17: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Parallel Threads

0 1 2 254 255... 1022 1023... ... ...A

0 1 2 254 255... 1022 1023... ... ...B

... ... ...... i = threadIdx.x;C[i]=A[i]+B[i];i = threadIdx.x;C[i]=A[i]+B[i];

0 1 2 254 255... 1022 1023... ... ...C

Page 18: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...

0 255 256 511 767 1023... ...B 512... ... 768 ...

0 255 256 511 767 1023... ...C 512... ... 768 ...

Page 19: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...

0 255 256 511 767 1023... ...B 512... ... 768 ...

Block ID = 0Thread Ids = 0, 1, ... ,255

Block ID = 0Thread Ids = 0, 1, ... ,255

Block IDs = 2Thread Ids = 0, 1, ... ,255

Block IDs = 2Thread Ids = 0, 1, ... ,255

Block ID = 1Thread Ids = 0, 1, ... ,255

Block ID = 1Thread Ids = 0, 1, ... ,255

Block ID = 3Thread Ids = 0, 1, ... ,255

Block ID = 3Thread Ids = 0, 1, ... ,255

Block Dimension = 256(in the X direction)Block Dimension = 256(in the X direction)

Page 20: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...

0 255 256 511 767 1023... ...B 512... ... 768 ...

Block ID = 0Thread Ids = 0, 1, ... ,255

Block ID = 0Thread Ids = 0, 1, ... ,255

Block IDs = 2Thread Ids = 0, 1, ... ,255

Block IDs = 2Thread Ids = 0, 1, ... ,255

Block ID = 1Thread Ids = 0, 1, ... ,255

Block ID = 1Thread Ids = 0, 1, ... ,255

Block ID = 3Thread Ids = 0, 1, ... ,255

Block ID = 3Thread Ids = 0, 1, ... ,255

Kernel_Call<<<No_of_Blocks,ThreadsPerBlock>>>(params)Kernel_Call<<<No_of_Blocks,ThreadsPerBlock>>>(params)

Page 21: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...

0 255 256 511 767 1023... ...B 512... ... 768 ...

Block ID = 0Thread Ids = 0, 1, ... ,255

Block ID = 0Thread Ids = 0, 1, ... ,255

Block IDs = 2Thread Ids = 0, 1, ... ,255

Block IDs = 2Thread Ids = 0, 1, ... ,255

Block ID = 1Thread Ids = 0, 1, ... ,255

Block ID = 1Thread Ids = 0, 1, ... ,255

Block ID = 3Thread Ids = 0, 1, ... ,255

Block ID = 3Thread Ids = 0, 1, ... ,255

Kernel_Call<<<4,256>>>(params)Kernel_Call<<<4,256>>>(params)

Page 22: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...

0 255 256 511 767 1023... ...B 512... ... 768 ...

0 255 256 511 767 1023... ...C 512... ... 768 ...

i = blockIDx.x * blockDim.x + threadIdx.x;C[i]=A[i]+B[i];

i = blockIDx.x * blockDim.x + threadIdx.x;C[i]=A[i]+B[i];

Page 23: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...

0 255 256 511 767 1023... ...B 512... ... 768 ...

0 255 256 511 767 1023... ...C 512... ... 768 ...

i = blockIDx.x * blockDim.x + threadIdx.x;C[i]=A[i]+B[i];

i = blockIDx.x * blockDim.x + threadIdx.x;C[i]=A[i]+B[i];

● A CUDA kernel is executed by a grid (array) of threads– All threads in a grid run the same kernel code (SPMD)

– Each thread has indexes that it uses to compute memory addresses and make control decisions

Page 24: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Thread Blocks: Scalable Cooperation0 255 0 255 255... ... 0 ......

Thread Block 0Thread Block 0 Thread Block 1Thread Block 1 Thread Block N­1Thread Block N­1

...

...

Page 25: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Thread Blocks: Scalable Cooperation

● Thread array is divided into multiple blocks● Threads within a block cooperate via shared memory,

atomic operations and barrier synchronization● Threads in different blocks do not interact

0 255 0 255 255... ... 0 ......

Thread Block 0Thread Block 0 Thread Block 1Thread Block 1 Thread Block N­1Thread Block N­1

...

...

Page 26: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

00 11 22 33

threadIdx.xthreadIdx.x

Page 27: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

00 11 22 33 (0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)

(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1)

threadIdx.xthreadIdx.x (threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)

Page 28: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

00 11 22 33 (0,0,1)(0,0,1) (1,0,1)(1,0,1) (2,0,1)(2,0,1) (3,0,1)(3,0,1)

(0,1,0)(0,1,0)

threadIdx.xthreadIdx.x

(threadIdx.x,threadIdx.y,threadIdx.z)

(threadIdx.x,threadIdx.y,threadIdx.z)(threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)

(0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)

(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1) (0,0,0)(0,0,0) (1,0,0)(1,0,0) (2,0,0)(2,0,0) (3,0,0)(3,0,0)

(0,1,0)(0,1,0) (1,1,0)(1,1,0) (2,1,0)(2,1,0) (3,1,0)(3,1,0)

Page 29: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

00 11 22 33

Block (1,0)Block (1,0)

threadIdx.xthreadIdx.x

(threadIdx.x,threadIdx.y,threadIdx.z)

(threadIdx.x,threadIdx.y,threadIdx.z)(threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)

(0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)

(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1)(0,0,1)(0,0,1) (1,0,1)(1,0,1) (2,0,1)(2,0,1) (3,0,1)(3,0,1)

(0,1,0)(0,1,0)(0,0,0)(0,0,0) (1,0,0)(1,0,0) (2,0,0)(2,0,0) (3,0,0)(3,0,0)

(0,1,0)(0,1,0) (1,1,0)(1,1,0) (2,1,0)(2,1,0) (3,1,0)(3,1,0)

Page 30: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

00 11 22 33

Block (0,0)Block (0,0) Block (1,0)Block (1,0)

Block (0,1)Block (0,1) Block (1,1)Block (1,1)

threadIdx.xthreadIdx.x

(threadIdx.x,threadIdx.y,threadIdx.z)

(threadIdx.x,threadIdx.y,threadIdx.z)(threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)

(blockIdx.x,blockIdx.y)

(blockIdx.x,blockIdx.y)

(0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)

(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1)(0,0,1)(0,0,1) (1,0,1)(1,0,1) (2,0,1)(2,0,1) (3,0,1)(3,0,1)

(0,1,0)(0,1,0)(0,0,0)(0,0,0) (1,0,0)(1,0,0) (2,0,0)(2,0,0) (3,0,0)(3,0,0)

(0,1,0)(0,1,0) (1,1,0)(1,1,0) (2,1,0)(2,1,0) (3,1,0)(3,1,0)

Page 31: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

00 11 22 33

Grid

threadIdx.xthreadIdx.x

(threadIdx.x,threadIdx.y,threadIdx.z)

(threadIdx.x,threadIdx.y,threadIdx.z)(threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)

(blockIdx.x,blockIdx.y)

(blockIdx.x,blockIdx.y)

(0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)

(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1)(0,0,1)(0,0,1) (1,0,1)(1,0,1) (2,0,1)(2,0,1) (3,0,1)(3,0,1)

(0,1,0)(0,1,0)(0,0,0)(0,0,0) (1,0,0)(1,0,0) (2,0,0)(2,0,0) (3,0,0)(3,0,0)

(0,1,0)(0,1,0) (1,1,0)(1,1,0) (2,1,0)(2,1,0) (3,1,0)(3,1,0)

Block (0,0)Block (0,0) Block (1,0)Block (1,0)

Block (0,1)Block (0,1) Block (1,1)Block (1,1)

Page 32: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

00 11 22 33

Grid

threadIdx.xthreadIdx.x

(threadIdx.x,threadIdx.y,threadIdx.z)

(threadIdx.x,threadIdx.y,threadIdx.z)(threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)

(blockIdx.x,blockIdx.y)

(blockIdx.x,blockIdx.y)

Device

(0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)

(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1)(0,0,1)(0,0,1) (1,0,1)(1,0,1) (2,0,1)(2,0,1) (3,0,1)(3,0,1)

(0,1,0)(0,1,0)(0,0,0)(0,0,0) (1,0,0)(1,0,0) (2,0,0)(2,0,0) (3,0,0)(3,0,0)

(0,1,0)(0,1,0) (1,1,0)(1,1,0) (2,1,0)(2,1,0) (3,1,0)(3,1,0)

Block (0,0)Block (0,0) Block (1,0)Block (1,0)

Block (0,1)Block (0,1) Block (1,1)Block (1,1)

Page 33: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

● Each thread uses indices to decide what data to work on

Page 34: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

● Each thread uses indices to decide what data to work on

● blockIdx: 1D, 2D, or 3D (CUDA 4.0)

Page 35: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

● Each thread uses indices to decide what data to work on

● blockIdx: 1D, 2D, or 3D (CUDA 4.0)● threadIdx: 1D, 2D, or 3D

Page 36: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

blockIdx and threadIdx

● Each thread uses indices to decide what data to work on

● blockIdx: 1D, 2D, or 3D (CUDA 4.0)● threadIdx: 1D, 2D, or 3D● Simplifies memory addressing when processing

multidimensional data– Image processing

– Solving PDEs on volumes

Page 37: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA C – Vector Addition Kernel

Page 38: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Vector Addition – C Code

int main(){

// Memory allocation for h_A, h_B, and h_C// I/O to read h_A and h_B, N elements...vecAdd(h_A, h_B, h_C, N);

}

int main(){

// Memory allocation for h_A, h_B, and h_C// I/O to read h_A and h_B, N elements...vecAdd(h_A, h_B, h_C, N);

}

Page 39: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Vector Addition – C Code

// Compute vector sum C = A+Bvoid vecAdd(float *h_A, float *h_B, float *h_C, int n){

int i;for (i = 0; i<n; i++) h_C[i] = h_A[i]+h_B[i];

}

int main(){

// Memory allocation for h_A, h_B, and h_C// I/O to read h_A and h_B, N elements...vecAdd(h_A, h_B, h_C, N);

}

// Compute vector sum C = A+Bvoid vecAdd(float *h_A, float *h_B, float *h_C, int n){

int i;for (i = 0; i<n; i++) h_C[i] = h_A[i]+h_B[i];

}

int main(){

// Memory allocation for h_A, h_B, and h_C// I/O to read h_A and h_B, N elements...vecAdd(h_A, h_B, h_C, N);

}

Page 40: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

vecAdd - CUDA Version

Host MemoryHost Memory

CPUCPU

Device MemoryDevice Memory

GPUGPU

Step 1. Allocate space for data on the GPU Memory

Step 1. Allocate space for data on the GPU Memory

A[]B[]C[]

Page 41: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

vecAdd - CUDA Version

Host MemoryHost Memory

CPUCPU

Device MemoryDevice Memory

GPUGPUmalloc() for

A,B,C

Step 1. Allocate space for data on the GPU Memory

Step 1. Allocate space for data on the GPU Memory

A[]B[]C[]

Page 42: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

vecAdd - CUDA Version

Host MemoryHost Memory

CPUCPU

Device MemoryDevice Memory

GPUGPUA[]B[]C[]

Step 1. Allocate space for data on the GPU Memory

Step 1. Allocate space for data on the GPU Memory

Step 2. Copy data on to GPU Memory

Step 2. Copy data on to GPU MemoryA[]

B[]C[]

Page 43: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

vecAdd - CUDA Version

Host MemoryHost Memory

CPUCPU

Device MemoryDevice Memory

GPUGPUA[]B[]C[]

Step 1. Allocate space for data on the GPU Memory

Step 1. Allocate space for data on the GPU Memory

Step 2. Copy data on to GPU Memory

Step 2. Copy data on to GPU Memory

Step 3. Kernel Launch.

Step 3. Kernel Launch.

A[]B[]C[]

Page 44: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

vecAdd - CUDA Version

Host MemoryHost Memory

CPUCPU

Device MemoryDevice Memory

GPUGPUA[]B[]C[]

Step 1. Allocate space for data on the GPU Memory

Step 1. Allocate space for data on the GPU Memory

Step 2. Copy data on to GPU Memory

Step 2. Copy data on to GPU Memory

Step 3. Kernel Launch.

Step 3. Kernel Launch.

Step 4. Copy result data back to Host

Main Memory

Step 4. Copy result data back to Host

Main Memory

A[]B[]C[]

Page 45: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

vecAdd - CUDA Version

Host MemoryHost Memory

CPUCPU

Device MemoryDevice Memory

GPUGPU

Step 1. Allocate space for data on the GPU Memory

Step 1. Allocate space for data on the GPU Memory

Step 2. Copy data on to GPU Memory

Step 2. Copy data on to GPU Memory

Step 3. Kernel Launch.

Step 3. Kernel Launch.

Step 4. Copy result data back to Host

Main Memory

Step 4. Copy result data back to Host

Main Memory

Step 5. Free device memory

Step 5. Free device memory

A[]B[]C[]

Page 46: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

vecAdd - CUDA Host Code

#include <cuda.h>void vecAdd(float *h_A, float *h_B, float *h_C, int n){

int size = n* sizeof(float);float *d_A, *d_B, *d_C;1. // Allocate device memory for A, B, and C

// copy A and B to device memory2. // Kernel launch code – the device performs the

// actual vector addition3. // copy C from the device memory // Free device

// vectors}

#include <cuda.h>void vecAdd(float *h_A, float *h_B, float *h_C, int n){

int size = n* sizeof(float);float *d_A, *d_B, *d_C;1. // Allocate device memory for A, B, and C

// copy A and B to device memory2. // Kernel launch code – the device performs the

// actual vector addition3. // copy C from the device memory // Free device

// vectors}

Page 47: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA Memories – Quick Overview

Page 48: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA Memories – Quick Overview

● Device code can:– R/W per-thread

registers

– R/W all-shared global memory

Page 49: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA Memories – Quick Overview

● Device code can:– R/W per-thread

registers

– R/W all-shared global memory

● Host code can:– Transfer data to/from

per grid global memory

Page 50: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA Device Memory Management API

cudaMalloc()

Allocates object in the device global memory

Two Parameters:

• Address of a pointer to the allocated object

• Size of allocated object in terms of bytes

cudaMalloc()

Allocates object in the device global memory

Two Parameters:

• Address of a pointer to the allocated object

• Size of allocated object in terms of bytes

cudaMalloc((void **) &d_A, size);

Page 51: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA Device Memory Management API

cudaMalloc()

Allocates object in the device global memory

Two Parameters:

• Address of a pointer to the allocated object

• Size of allocated object in terms of bytes

cudaMalloc()

Allocates object in the device global memory

Two Parameters:

• Address of a pointer to the allocated object

• Size of allocated object in terms of bytes

cudaFree()

Frees object from device global memory

Pointer to freed object

cudaFree()

Frees object from device global memory

Pointer to freed object

cudaMalloc((void **) &d_A, size);

cudaFree(d_A);

Page 52: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Host-Device Data Transfer API

cudaMemcpy()

Memory data transfer

Four parameters

• Pointer to destination, Pointer to source, bytes copied, Type/Direction of transfer

Transfer to device is asynchronous

cudaMemcpy()

Memory data transfer

Four parameters

• Pointer to destination, Pointer to source, bytes copied, Type/Direction of transfer

Transfer to device is asynchronous

cudaMemcpy(d_A, h_A, size,cudaMemcpyHostToDevice);

Page 53: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Vector Addition Host Codevoid vecAdd(float *h_A, float *h_B, float *h_C, int n)

{

int size = n * sizeof(float);float *d_A, *d_B, *d_C;

cudaMalloc((void **) &d_A, size);cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);cudaMalloc((void **) &d_B, size);cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);cudaMalloc((void **) &d_C, size);

// Kernel invocation code – not shown here

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

}

void vecAdd(float *h_A, float *h_B, float *h_C, int n){

int size = n * sizeof(float);

float *d_A, *d_B, *d_C;

cudaMalloc((void **) &d_A, size);cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);cudaMalloc((void **) &d_B, size);cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);cudaMalloc((void **) &d_C, size);

// Kernel invocation code – not shown here

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

}

Page 54: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Error Checking for API

cudaError_t err = cudaMalloc((void **) &d_A, size);

if (err != cudaSuccess) {

printf(“%s in %s at line %d\n”,cudaGetErrorString(err), __FILE__, __LINE__);

exit(EXIT_FAILURE);

}

cudaError_t err = cudaMalloc((void **) &d_A, size);if (err != cudaSuccess) {

printf(“%s in %s at line %d\n”,

cudaGetErrorString(err), __FILE__, __LINE__);exit(EXIT_FAILURE);

}

Page 55: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Vector Addition Kernel

// Compute vector sum C = A+B

// Each thread performs one pair­wise addition

__global__void vecAddKernel(float* A, float* B, float* C, int n){

int i = threadIdx.x+blockDim.x*blockIdx.x;if(i<n)

C[i] = A[i] + B[i];

}

// Compute vector sum C = A+B

// Each thread performs one pair­wise addition

__global__void vecAddKernel(float* A, float* B, float* C, int n)

{

int i = threadIdx.x+blockDim.x*blockIdx.x;if(i<n)

C[i] = A[i] + B[i];}

Page 56: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Host Code – Kernel Launch

int vecAdd(float* h_A, float* h_B, float* h_C, int n){

// ..., cudaMalloc(), cudaMemcpy(), ...

// ...

// Run ceil(n/256.0) blocks of 256 threads each

vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);// ...

}

int vecAdd(float* h_A, float* h_B, float* h_C, int n)

{// ..., cudaMalloc(), cudaMemcpy(), ...

// ...

// Run ceil(n/256.0) blocks of 256 threads each

vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);// ...

}

Page 57: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Better Kernel Launch Code

int vecAdd(float* h_A, float* h_B, float* h_C, int n){

// ..., cudaMalloc(), cudaMemcpy(), ...

// ...

// Run ceil(n/256.0) blocks of 256 threads each

dim3 DimGrid((n­1)/256+1, 1, 1);dim3 DimBlock(256, 1, 1);vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);// ...

}

int vecAdd(float* h_A, float* h_B, float* h_C, int n)

{// ..., cudaMalloc(), cudaMemcpy(), ...

// ...

// Run ceil(n/256.0) blocks of 256 threads each

dim3 DimGrid((n­1)/256+1, 1, 1);dim3 DimBlock(256, 1, 1);vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);// ...

}

Page 58: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Kernel Execution

__host__

int vecAdd(float* h_A, float* h_B, float* h_C, int n)

{

...

dim3 DimGrid((n­1)/256+1, 1, 1);

dim3 DimBlock(256, 1, 1);

vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);

...

}

__global__

void vecAddKernel(float *A, float *B, float *C, int n)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if(i<n)  C[i] = A[i]+B[i];

}

__host__

int vecAdd(float* h_A, float* h_B, float* h_C, int n)

{

...

dim3 DimGrid((n­1)/256+1, 1, 1);

dim3 DimBlock(256, 1, 1);

vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);

...

}

__global__

void vecAddKernel(float *A, float *B, float *C, int n)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if(i<n)  C[i] = A[i]+B[i];

}

Page 59: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

CUDA Function Declarations

● __global__ defines a kernel function– kernel function must return void

● __device__ and __host__ can be used together● __host__ is optional if used alone

Page 60: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

Compiling a CUDA Program

Integrated C programs with CUDA extensions

NVCC Compiler

Host C Compiler/Linker Device Just-in-Time Compiler

Heterogeneous Computing Platform withCPUs, GPUs, etc.