basic cuda programming

Basic CUDA Programming

Computer Architecture 2014 (Prof. Chih-Wei Liu)

Final Project – CUDA Tutorial

TA Cheng-Yen Yang ([email protected])

From Graphics to General Purpose Processing – CPU vs GPU

CPU： general purpose computation (SISD)GPU： data-parallel computation (SIMD)

2

3

What is CUDA?

Compute Unified Device Architecture

Hardware or software? A programming modelA parallel computing platform

. . .. . .

. . .. . .

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Threa d(0,1,0 )

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Threa d(0,0,0 )

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/storeLoad/store

Global Memory


Input Assembler

Host

TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTextureTextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Thread Id #:0 1 2 3 … m

Thread program

Application

CUDA

Platform

Heterogeneous computing: CPU+GPU Cooperation

4

CPU(host)

GPU w/ local DRAM

(device)

Host-Device Architecture：

5

Heterogeneous computing:NVIDIA CUDA Compiler (NVCC)

NVCC separates CPU and GPU source code into two parts. For host codes, NVCC invokes typical C compiler like GCC, Intel

C compiler, or MS C compiler.

All the device codes are compiled by NVCC. The extension of device source files should be “.cu”.

All executable with CUDA code requires： CUDA core library (cuda)

CUDA runtime library (cudart)

Heterogeneous computing:CUDA Code Execution (1/2)

Serial Code

Parallel Code

Serial Code

Parallel Code

Host

Serial Code

Parallel Code

Serial Code

Parallel Code

Host

Memory Transfer

Memory Transfer

Memory Transfer

Memory Transfer

Device

Lunch Kernel

Lunch Kernel

6

Heterogeneous computing:CUDA Code Execution (2/2)

7

Heterogeneous computing:NIVIDA G80 seriesTexture Processor Cluster (TPC)

Streaming Multiprocessor (SM)

Streaming Processor (SP) / CUDA core Special Function Unit (SFU)

For NV G8800(G80), the number of SPs is 128.

8

Heterogeneous computing:NIVIDA G80 series – CUDA mode

9

Load/store

Global Memory


Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

10

CUDA Programming Model (1/7) Define

Programming model Memory model

Help developers map the current applications or algorithms onto CUDA devices more easily and clearly.

NVIDIA GPUs have different architecture compared with common CPUs. It is important to follow CUDA’s programming model to obtain higher performance of program execution.

11

CUDA Programming Model (2/7)

C/C++ for CUDASubset of C with extensionsC++ templates for GPU codeCUDA goals:

Scale code to 100s of cores and 1000s of parallel threads.

Facilitate heterogeneous computing: CPU + GPUCUDA defines:

Programming modelMemory model

12

CUDA Programming Model (3/7) CUDA Kernels and Threads：

Parallel portions of an application are executed on the device as kernels. And only one kernel is executed at a time.

All the threads execute the same kernel at a time.

Differences between CUDA and GPU threads CUDA threads are extremely lightweight

Very little creation overhead

Fast switching

CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can only use a few


Arrays of Parallel Threads：A CUDA kernel is executed by an array of threads

All threads run the same code Each thread has an ID that it uses to compute memory addresses

and make control decisions

13


Thread Batching：Kernel launches a grid of thread blocks

Threads within a block cooperate via shared memory

Threads in different blocks cannot cooperate

Allows programs to transparently scale to different GPUs14


CUDA Programming Model：A kernel is executed by a grid of

thread blocks Block can be 1D or 2D.

A thread block is a batch of threads

Thread can be 1D or 2D or 3D.

Data can be shared through shared memory

Execution synchronization

But threads from different blocks can’t cooperate.

15

CUDA Programming Model (7/7)Memory Model：Registers

Per threadData lifetime = thread lifetime

Local memoryPer thread off-chip memory (physically in device DRAM)Data lifetime = thread lifetime

Shared memoryPer thread block on-chip memoryData lifetime = block lifetime

Global (device) memoryAccessible by all threads as well as host (CPU)Data lifetime = from allocation to deallocation

Host (CPU) memoryNot directly accessible by CUDA threads

16

CUDA C Basic

17

GPU Memory Allocation/ReleaseMemory allocation on GPU

cudaMalloc(void **pointer, size_t nbytes)Preset value for specific memory area

cudaMemset(void *pointer, int value, size_t count)Release memory allocation

cudaFree(void *pointer)

18

int n = 1024;int nbytes = 1024*sizeof(int);int *d_a = 0;cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);cudaFree(d_a);

Data Copies

cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

direction specifies locations (host or device) of src and dstBlocks CPU thread: returns after the copy is completeDoesn’t start copying until previous CUDA calls complete

enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice

19

Function Qualifiers

__global__ : invoked from within host (CPU) code, cannot be called from device (GPU) code must return void

__device__ : called from other GPU functions, cannot be called from host (CPU)

code

__host__ : can only be executed by CPU, called from host

20

Variable Qualifiers (GPU code)

__device__ Stored in device memory (large capacity, high latency, uncached)

Allocated with cudaMalloc (__device__ qualifier implied)

Accessible by all threads

Lifetime: application

__shared__ Stored in on-chip shared memory (SRAM, low latency)

Allocated by execution configuration or at compile time

Accessible by all threads in the same thread block

Lifetime: duration of thread block

Unqualified variables: Scalars and built-in vector types are stored in registers

Arrays may be in registers or local memory (registers are not addressable)21

CUDA Built-in Device Variables

All __global__ and __device__ functions have access to these automatically defined variables

dim3 gridDim; Dimensions of the grid in blocks (at most 2D)

dim3 blockDim; Dimensions of the block in threads

dim3 blockIdx; Block index within the grid

dim3 threadIdx; Thread index within the block

22

Executing Code on the GPU

Kernels are C functions with some restrictionsCan only access GPU memoryMust have void return typeNo variable number of arguments (“varargs”)Not recursiveNo static variables

Function arguments automatically copied from CPU to GPU memory

23

Launching Kernels

Modified C function call syntax：kernel<<<dim3 grid, dim3 block>>>(…);

Execution configuration (“<<< >>>”)：Grid dimensions: x and yThread-block dimensions: x, y, and z

24

dim3 grid(16,16);dim3 block(16,16);kernel1<<<grid, block>>>(…);

kernel2<<<32, 512>>>(…);

Data Decomposition

Often want each thread in kernel to access a different element of an array.

25

blockIdx.x

blockDim.x = 5

threadIdx.x

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14idx =

blockIdx.x*blockDim.x + threadIdx.x

Data Decomposition Example:Increment Array Elements

Increment N-element vector a by scalar b

26

CPU program

void increment_cpu(float *a, float *b, int N){ for(int idx=0;idx<N;idx++) a[idx]=a[idx]+b;}

void main(){ … increment_cpu(a,b,N);}

CUDA program

__global__ void increment_gpu(float *a, float *b, int N){ int idx = blockIdx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=a[idx]+b;}

void main(){ … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N/(float)blocksize)); increment_gpu<<<dimGrid,dimBlock>>>(a,b,N);}

LAB0: Setup CUDA Environment & Device Query

27

CUDA Environment Setup

Check your NVIDIA GPU Compute capability

GPU’s generation

https://developer.nvidia.com/cuda-gpus

Download CUDA Development fileshttps://developer.nvidia.com/cuda-toolkit-archiveCUDA driverCUDA toolkit

PC Room ED417B used version 2.3.

CUDA SDK (optional)

Install CUDATest CUDA

Device Query (check in the sample codes)28




https://developer.nvidia.com/cuda-toolkit-archive

https://developer.nvidia.com/cuda-toolkit-archive

Setup CUDA for MS Visual Studio(WindowsXP/VS2005~2010)

In PC Room ED417B : CUDA device: NV 8400GS

CUDA toolkit version: 2.0 or 2.3

VS20005 or VS2008

From scratch http://forums.nvidia.com/index.php?showtopic=30273

CUDA VS Wizard http://sourceforge.net/projects/cudavswizard/

Modified from existing project

29

http://forums.nvidia.com/index.php?showtopic=30273

http://sourceforge.net/projects/cudavswizard/

30

Setup CUDA for MS Visual Studio 2012 (Win7/VS2012) MS Visual Studio 2012 installed Download and install CUDA toolkit 5.0

https://developer.nvidia.com/cuda-toolkit-50-archive

Follow the instructions on this blog http://blog.norture.com/2012/10/gpu-parallel-programming-in-vs2

012-with-nvidia-cuda/

You can start to write a CUDA program by creating new project from VS2012’s template.

If you have previous version of VS projects, it is recommended to re-create the project by using VS2012’s new template.



http://blog.norture.com/2012/10/gpu-parallel-programming-in-vs2012-with-nvidia-cuda/

http://blog.norture.com/2012/10/gpu-parallel-programming-in-vs2012-with-nvidia-cuda/

CUDA Device Query Example:(CUDA toolkit 6.0)

31

Lab1: First CUDA ProgramYet Another Random Number Sequence Generator

32

Yet Another Random Number Sequence Generator

Implemented by CPU and GPUFunctionality:

Given an random integer array A holding 8192 elements

Generated by rand()Re-generate random number by multiplying itself

256 times without regard to overflow occurred.B[i]new = B[i]old*A[i] (CPU)C[i]new = C[i]old*A[i] (GPU)

Check the consistency between the results of CPU and GPU. 33

Data Manipulation between Host and Device

cudaError_t cudaMalloc( void** devPtr, size_t count )

Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory

cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)

Copies count bytes from memory area pointed to by src to the memory area pointed to by dst

kind indicates the type of memory transfer cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

cudaError_t cudaFree( void* devPtr ) Frees the memory space pointed to by devPtr

34

Float GPU_kernel(int *B, int *A) {

// Create Pointers for Memory Space on Device

int *dA, *dB;

// Allocate Memory Space on Device

cudaMalloc( (void**) &dA, sizeof(int)*SIZE );

cudaMalloc( (void**) &dB, sizeof(int)*SIZE );

// Copy Data to be Calculated

cudaMemcpy( dA, A, sizeof(int)*SIZE, cudaMemcpyHostToDevice );

// Lunch Kernel

cuda_kernel<<<1,1>>>(dB, dA);

// Copy Output Back

cudaMemcpy( B, dB, sizeof(int)*SIZE, cudaMemcpyDeviceToHost );

// Free Memory Spaces on Device

cudaFree( dA );

cudaFree( dB );

}

cudaMemcpy( dB, B, sizeof(int)*SIZE, cudaMemcpyHostToDevice );

Now, go and finish your first CUDA program !!!

35

Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zipOpen project with Visual C++ 2008 ( lab1/cuda_lab/cuda_lab.vcproj )

main.cu Random input generation, output validation, result reporting

device.cu Lunch GPU kernel, GPU kernel code

parameter.hFill in appropriate APIs

GPU_kernel() in device.cu

36

http://twins.ee.nctu.edu.tw/~skchen/lab1.zip

Lab2: Make the Parallel Code Faster Yet Another Random Number Sequence Generator

37

Parallel Processing in CUDA

Parallel code can be partitioned into blocks and threads

cuda_kernel<<<nBlk, nTid>>>(…)Multiple tasks will be initialized, each with

different block id and thread idThe tasks are dynamically scheduled

Tasks within the same block will be scheduled on the same stream multiprocessor

Each task take care of single data partition according to its block id and thread id

38

Locate Data Partition by Built-in Variables

Built-in VariablesgridDim

x, y

blockIdx x, y

blockDim x, y, z

threadIdx x, y, z

39

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Data Partition for Previous Example

TASK 0blockIdx.x = 0threadIdx.x = 0




… ……

head

length

40

When processing 64 integer data:

cuda_kernel<<<2, 2>>>(…)

int total_task = gridDim.x * blockDim.x ;int task_sn = blockIdx.x * blockDim.x + threadIdx.x ;

int length = SIZE / total_task ;int head = task_sn * length ;

Processing Single Data Partition

__global__ void cuda_kernel ( int *B, int *A ) {

int length = SIZE / total_task;

for ( int i = head ; i < head + length ; i++ ) {

B[i] = A[i]256;

}

}

return;

int total_task = gridDim.x * blockDim.x;

int task_sn = blockDim.x * blockIdx.x + threadIdx.x;

int head = task_sn * length;

41

Parallelize Your Program !!!

42

Partition kernel into threadsIncrease nTid from 1 to 512Keep nBlk = 1

Group threads into blocksAdjust nBlk and see if it helpsMaintain total number of threads below 512,

and make sure that 8192 can be divisible by that number.

e.g. nBlk * nTid < 51243

Lab3: Resolve Memory Contention Yet Another Random Number Sequence Generator

44

Parallel Memory Architecture

Memory is divided into banks to achieve high bandwidth

Each bank can service one address per cycle

Successive 32-bit words are assigned to successive banks

BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

45

Lab2 Review


THREAD3

THREAD2

THREAD1

THREAD0

Iteration 1

CONFILICT!!!!

A[48]

A[32]

A[16]

A[ 0]


THREAD3

THREAD2

THREAD1

THREAD0

Iteration 2

CONFILICT!!!!

A[49]

A[33]

A[17]

A[ 1]

46



How about Interleave Accessing?


THREAD3

THREAD2

THREAD1

THREAD0

Iteration 1

NO CONFLICT

A[ 3]

A[ 2]

A[ 1]

A[ 0]


THREAD3

THREAD2

THREAD1

THREAD0

Iteration 2

NO CONFLICT

A[ 7]

A[ 6]

A[ 5]

A[ 4]

47



Implementation of Interleave Accessing

…

head

stripe

head = task_snstripe = total_task

48


Improve Your Program !!!

49

Modify original kernel code in interleaving manner

cuda_kernel() in device.cu

Adjusting nBlk and nTid as in Lab2 and examine the effect

Maintain total number of threads below 512, and make sure that 8192 can be divisible by that number.

e.g. nBlk * nTid < 512

50

Thank You

http://twins.ee.nctu.edu.tw/~skchen/lab3.zipFinal project issue

Subject: Porting & optimizing any algorithm on any multi-core

Demo: 1 week after final exam @ ED412

Group:1 ~ 2 person per group

* Group member & demo time should be registered after final exam @ ED412

51

http://twins.ee.nctu.edu.tw/~skchen/lab3.zip

basic cuda programming

Documents