basic cuda programming
DESCRIPTION
Basic CUDA Programming. Computer Architecture 2014 (Prof. Chih -Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang ([email protected]). From Graphics to General Purpose Processing – CPU vs GPU. CPU : general purpose computation (SISD) GPU : data-parallel computation (SIMD). - PowerPoint PPT PresentationTRANSCRIPT
Basic CUDA Programming
Computer Architecture 2014 (Prof. Chih-Wei Liu)
Final Project – CUDA Tutorial
TA Cheng-Yen Yang ([email protected])
From Graphics to General Purpose Processing – CPU vs GPU
CPU: general purpose computation (SISD)GPU: data-parallel computation (SIMD)
2
3
What is CUDA?
Compute Unified Device Architecture
Hardware or software? A programming modelA parallel computing platform
. . .. . .
. . .. . .
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Threa d(0,1,0 )
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Threa d(0,0,0 )
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/storeLoad/store
Global Memory
Thread Execution Manager
Input Assembler
Host
TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTextureTextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
Thread Id #:0 1 2 3 … m
Thread program
Application
CUDA
Platform
Heterogeneous computing: CPU+GPU Cooperation
4
CPU(host)
GPU w/ local DRAM
(device)
Host-Device Architecture:
5
Heterogeneous computing:NVIDIA CUDA Compiler (NVCC)
NVCC separates CPU and GPU source code into two parts. For host codes, NVCC invokes typical C compiler like GCC, Intel
C compiler, or MS C compiler.
All the device codes are compiled by NVCC. The extension of device source files should be “.cu”.
All executable with CUDA code requires: CUDA core library (cuda)
CUDA runtime library (cudart)
Heterogeneous computing:CUDA Code Execution (1/2)
Serial Code
Parallel Code
Serial Code
Parallel Code
Host
Serial Code
Parallel Code
Serial Code
Parallel Code
Host
Memory Transfer
Memory Transfer
Memory Transfer
Memory Transfer
Device
Lunch Kernel
Lunch Kernel
6
Heterogeneous computing:CUDA Code Execution (2/2)
7
Heterogeneous computing:NIVIDA G80 seriesTexture Processor Cluster (TPC)
Streaming Multiprocessor (SM)
Streaming Processor (SP) / CUDA core Special Function Unit (SFU)
For NV G8800(G80), the number of SPs is 128.
8
Heterogeneous computing:NIVIDA G80 series – CUDA mode
9
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
10
CUDA Programming Model (1/7) Define
Programming model Memory model
Help developers map the current applications or algorithms onto CUDA devices more easily and clearly.
NVIDIA GPUs have different architecture compared with common CPUs. It is important to follow CUDA’s programming model to obtain higher performance of program execution.
11
CUDA Programming Model (2/7)
C/C++ for CUDASubset of C with extensionsC++ templates for GPU codeCUDA goals:
Scale code to 100s of cores and 1000s of parallel threads.
Facilitate heterogeneous computing: CPU + GPUCUDA defines:
Programming modelMemory model
12
CUDA Programming Model (3/7) CUDA Kernels and Threads:
Parallel portions of an application are executed on the device as kernels. And only one kernel is executed at a time.
All the threads execute the same kernel at a time.
Differences between CUDA and GPU threads CUDA threads are extremely lightweight
Very little creation overhead
Fast switching
CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can only use a few
CUDA Programming Model (4/7)
Arrays of Parallel Threads:A CUDA kernel is executed by an array of threads
All threads run the same code Each thread has an ID that it uses to compute memory addresses
and make control decisions
13
CUDA Programming Model (5/7)
Thread Batching:Kernel launches a grid of thread blocks
Threads within a block cooperate via shared memory
Threads in different blocks cannot cooperate
Allows programs to transparently scale to different GPUs14
CUDA Programming Model (6/7)
CUDA Programming Model:A kernel is executed by a grid of
thread blocks Block can be 1D or 2D.
A thread block is a batch of threads
Thread can be 1D or 2D or 3D.
Data can be shared through shared memory
Execution synchronization
But threads from different blocks can’t cooperate.
15
CUDA Programming Model (7/7)Memory Model:Registers
Per threadData lifetime = thread lifetime
Local memoryPer thread off-chip memory (physically in device DRAM)Data lifetime = thread lifetime
Shared memoryPer thread block on-chip memoryData lifetime = block lifetime
Global (device) memoryAccessible by all threads as well as host (CPU)Data lifetime = from allocation to deallocation
Host (CPU) memoryNot directly accessible by CUDA threads
16
CUDA C Basic
17
GPU Memory Allocation/ReleaseMemory allocation on GPU
cudaMalloc(void **pointer, size_t nbytes)Preset value for specific memory area
cudaMemset(void *pointer, int value, size_t count)Release memory allocation
cudaFree(void *pointer)
18
int n = 1024;int nbytes = 1024*sizeof(int);int *d_a = 0;cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);cudaFree(d_a);
Data Copies
cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);
direction specifies locations (host or device) of src and dstBlocks CPU thread: returns after the copy is completeDoesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice
19
Function Qualifiers
__global__ : invoked from within host (CPU) code, cannot be called from device (GPU) code must return void
__device__ : called from other GPU functions, cannot be called from host (CPU)
code
__host__ : can only be executed by CPU, called from host
20
Variable Qualifiers (GPU code)
__device__ Stored in device memory (large capacity, high latency, uncached)
Allocated with cudaMalloc (__device__ qualifier implied)
Accessible by all threads
Lifetime: application
__shared__ Stored in on-chip shared memory (SRAM, low latency)
Allocated by execution configuration or at compile time
Accessible by all threads in the same thread block
Lifetime: duration of thread block
Unqualified variables: Scalars and built-in vector types are stored in registers
Arrays may be in registers or local memory (registers are not addressable)21
CUDA Built-in Device Variables
All __global__ and __device__ functions have access to these automatically defined variables
dim3 gridDim; Dimensions of the grid in blocks (at most 2D)
dim3 blockDim; Dimensions of the block in threads
dim3 blockIdx; Block index within the grid
dim3 threadIdx; Thread index within the block
22
Executing Code on the GPU
Kernels are C functions with some restrictionsCan only access GPU memoryMust have void return typeNo variable number of arguments (“varargs”)Not recursiveNo static variables
Function arguments automatically copied from CPU to GPU memory
23
Launching Kernels
Modified C function call syntax:kernel<<<dim3 grid, dim3 block>>>(…);
Execution configuration (“<<< >>>”):Grid dimensions: x and yThread-block dimensions: x, y, and z
24
dim3 grid(16,16);dim3 block(16,16);kernel1<<<grid, block>>>(…);
kernel2<<<32, 512>>>(…);
Data Decomposition
Often want each thread in kernel to access a different element of an array.
25
blockIdx.x
blockDim.x = 5
threadIdx.x
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14idx =
blockIdx.x*blockDim.x + threadIdx.x
Data Decomposition Example:Increment Array Elements
Increment N-element vector a by scalar b
26
CPU program
void increment_cpu(float *a, float *b, int N){ for(int idx=0;idx<N;idx++) a[idx]=a[idx]+b;}
void main(){ … increment_cpu(a,b,N);}
CUDA program
__global__ void increment_gpu(float *a, float *b, int N){ int idx = blockIdx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=a[idx]+b;}
void main(){ … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N/(float)blocksize)); increment_gpu<<<dimGrid,dimBlock>>>(a,b,N);}
LAB0: Setup CUDA Environment & Device Query
27
CUDA Environment Setup
Check your NVIDIA GPU Compute capability
GPU’s generation
https://developer.nvidia.com/cuda-gpus
Download CUDA Development fileshttps://developer.nvidia.com/cuda-toolkit-archiveCUDA driverCUDA toolkit
PC Room ED417B used version 2.3.
CUDA SDK (optional)
Install CUDATest CUDA
Device Query (check in the sample codes)28
Setup CUDA for MS Visual Studio(WindowsXP/VS2005~2010)
In PC Room ED417B : CUDA device: NV 8400GS
CUDA toolkit version: 2.0 or 2.3
VS20005 or VS2008
From scratch http://forums.nvidia.com/index.php?showtopic=30273
CUDA VS Wizard http://sourceforge.net/projects/cudavswizard/
Modified from existing project
29
30
Setup CUDA for MS Visual Studio 2012 (Win7/VS2012) MS Visual Studio 2012 installed Download and install CUDA toolkit 5.0
https://developer.nvidia.com/cuda-toolkit-50-archive
Follow the instructions on this blog http://blog.norture.com/2012/10/gpu-parallel-programming-in-vs2
012-with-nvidia-cuda/
You can start to write a CUDA program by creating new project from VS2012’s template.
If you have previous version of VS projects, it is recommended to re-create the project by using VS2012’s new template.
CUDA Device Query Example:(CUDA toolkit 6.0)
31
Lab1: First CUDA ProgramYet Another Random Number Sequence Generator
32
Yet Another Random Number Sequence Generator
Implemented by CPU and GPUFunctionality:
Given an random integer array A holding 8192 elements
Generated by rand()Re-generate random number by multiplying itself
256 times without regard to overflow occurred.B[i]new = B[i]old*A[i] (CPU)C[i]new = C[i]old*A[i] (GPU)
Check the consistency between the results of CPU and GPU. 33
Data Manipulation between Host and Device
cudaError_t cudaMalloc( void** devPtr, size_t count )
Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory
cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)
Copies count bytes from memory area pointed to by src to the memory area pointed to by dst
kind indicates the type of memory transfer cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice
cudaError_t cudaFree( void* devPtr ) Frees the memory space pointed to by devPtr
34
Float GPU_kernel(int *B, int *A) {
// Create Pointers for Memory Space on Device
int *dA, *dB;
// Allocate Memory Space on Device
cudaMalloc( (void**) &dA, sizeof(int)*SIZE );
cudaMalloc( (void**) &dB, sizeof(int)*SIZE );
// Copy Data to be Calculated
cudaMemcpy( dA, A, sizeof(int)*SIZE, cudaMemcpyHostToDevice );
// Lunch Kernel
cuda_kernel<<<1,1>>>(dB, dA);
// Copy Output Back
cudaMemcpy( B, dB, sizeof(int)*SIZE, cudaMemcpyDeviceToHost );
// Free Memory Spaces on Device
cudaFree( dA );
cudaFree( dB );
}
cudaMemcpy( dB, B, sizeof(int)*SIZE, cudaMemcpyHostToDevice );
Now, go and finish your first CUDA program !!!
35
Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zipOpen project with Visual C++ 2008 ( lab1/cuda_lab/cuda_lab.vcproj )
main.cu Random input generation, output validation, result reporting
device.cu Lunch GPU kernel, GPU kernel code
parameter.hFill in appropriate APIs
GPU_kernel() in device.cu
36
Lab2: Make the Parallel Code Faster Yet Another Random Number Sequence Generator
37
Parallel Processing in CUDA
Parallel code can be partitioned into blocks and threads
cuda_kernel<<<nBlk, nTid>>>(…)Multiple tasks will be initialized, each with
different block id and thread idThe tasks are dynamically scheduled
Tasks within the same block will be scheduled on the same stream multiprocessor
Each task take care of single data partition according to its block id and thread id
38
Locate Data Partition by Built-in Variables
Built-in VariablesgridDim
x, y
blockIdx x, y
blockDim x, y, z
threadIdx x, y, z
39
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Data Partition for Previous Example
TASK 0blockIdx.x = 0threadIdx.x = 0
TASK 1blockIdx.x = 0threadIdx.x = 1
TASK 2blockIdx.x = 1threadIdx.x = 0
TASK 3blockIdx.x = 1threadIdx.x = 1
… ……
head
length
40
When processing 64 integer data:
cuda_kernel<<<2, 2>>>(…)
int total_task = gridDim.x * blockDim.x ;int task_sn = blockIdx.x * blockDim.x + threadIdx.x ;
int length = SIZE / total_task ;int head = task_sn * length ;
Processing Single Data Partition
__global__ void cuda_kernel ( int *B, int *A ) {
int length = SIZE / total_task;
for ( int i = head ; i < head + length ; i++ ) {
B[i] = A[i]256;
}
}
return;
int total_task = gridDim.x * blockDim.x;
int task_sn = blockDim.x * blockIdx.x + threadIdx.x;
int head = task_sn * length;
41
Parallelize Your Program !!!
42
Partition kernel into threadsIncrease nTid from 1 to 512Keep nBlk = 1
Group threads into blocksAdjust nBlk and see if it helpsMaintain total number of threads below 512,
and make sure that 8192 can be divisible by that number.
e.g. nBlk * nTid < 51243
Lab3: Resolve Memory Contention Yet Another Random Number Sequence Generator
44
Parallel Memory Architecture
Memory is divided into banks to achieve high bandwidth
Each bank can service one address per cycle
Successive 32-bit words are assigned to successive banks
BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0
45
Lab2 Review
BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0
THREAD3
THREAD2
THREAD1
THREAD0
Iteration 1
CONFILICT!!!!
A[48]
A[32]
A[16]
A[ 0]
BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0
THREAD3
THREAD2
THREAD1
THREAD0
Iteration 2
CONFILICT!!!!
A[49]
A[33]
A[17]
A[ 1]
46
When processing 64 integer data:
cuda_kernel<<<1, 4>>>(…)
How about Interleave Accessing?
BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0
THREAD3
THREAD2
THREAD1
THREAD0
Iteration 1
NO CONFLICT
A[ 3]
A[ 2]
A[ 1]
A[ 0]
BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0
THREAD3
THREAD2
THREAD1
THREAD0
Iteration 2
NO CONFLICT
A[ 7]
A[ 6]
A[ 5]
A[ 4]
47
When processing 64 integer data:
cuda_kernel<<<1, 4>>>(…)
Implementation of Interleave Accessing
…
head
stripe
head = task_snstripe = total_task
48
cuda_kernel<<<1, 4>>>(…)
Improve Your Program !!!
49
Modify original kernel code in interleaving manner
cuda_kernel() in device.cu
Adjusting nBlk and nTid as in Lab2 and examine the effect
Maintain total number of threads below 512, and make sure that 8192 can be divisible by that number.
e.g. nBlk * nTid < 512
50
Thank You
http://twins.ee.nctu.edu.tw/~skchen/lab3.zipFinal project issue
Subject: Porting & optimizing any algorithm on any multi-core
Demo: 1 week after final exam @ ED412
Group:1 ~ 2 person per group
* Group member & demo time should be registered after final exam @ ED412
51