basics of cuda programming - msi.umn.edugrid 2 courtesy: ndvia figure 3.2. an example of cuda thread...

1

Basics of CUDA Programming

Weijun Xiao

Department of Electrical and Computer Engineering

University of Minnesota

2

Outline

• What’s GPU computing?• CUDA programming model• Basic Memory Management• Basic Kernels and Execution• CPU and GPU Coordination• CUDA debugging and profiling• Conclusions

3

What is GPU?

• Graphic Processing Unit

Logical Representationof Visual Information Output Signal

4

Performance Gap between GPUs and CPUs

5

GPU = Fast Parallel Machine

• GPU speed increasing at faster pace than Moore’s Law.

• This is a consequence of the data-parallel streaming aspects of the GPU.

• Gaming market simulates the development of GPU

• GPUs are cheap ! Put enough together, and you can get a super-computer.

So can we use the GPU for general-purpose computing ?

6

Sure, thousands of Applications

• Large matrix/vector operations (BLAS)

• Protein Folding (Molecular Dynamics)

• FFT (signal processing)

• VMD(Visual Molecular Dynamics)

• Speech Recognition (Hidden Markov Models, Neural nets)

• Databases

• Sort/Search

• Storages

• MRI

• …

7

Why are We Interested in GPU?

• High-performance Computing• High Parallelism• Low Cost• GPU can be Programmable• GPGPU

8

Growth and Development of GPU

• A quiet revolution and potential build-up– Calculation: 367 GFLOPS vs. 32 GFLOPS– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s– Before CUDA , programmed through graphics API

– GPU in every PC and workstation – massive volume and potential impact

GF

LOP

S

G80 = GeForce 8800 GTX



NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

9

GeForce 8800

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

16 highly threaded MP, 128 Cores, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

10

Telsa 205014 MP, 448 Cores, 1.03 TFLOPS/515 GFLOPS, 3GB GDDR5 DRAM with ECC, 144GB/S Mem BW, PCIe 2 x16 (8GB/S BW to CPU)

11

GPU Languages

• Assembly

• Cg (NVIDIA)

- C for Graphics

• GLSL (OpenGL)

- OpenGL Shading Language

• HLSL (Microsoft)

- High-level Shading language

• Brook C/C++ (AMD)

• CUDA (NVIDIA)

• Open CL

12

How GPGPU Works before CUDA?

• Follow graphics pipeline• Pretend to be graphics• Take an advantage of massive parallelism of

GPU• Disguise data as textures or geometry• Disguise algorithm as render passes• Fool graphics pipeline to do computation

13

CUDA Programming Model

• Compute Unified Device Architecture• Simple and General-Purpose Programming

Model• Standalone driver to load computation

programs into GPU• Graphics-free API• Data sharing with OpenGL buffer objects• Easy to use and low-learning curve

14

CUDA – C with no shader limitations!• Integrated host+device app C program

– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C code

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

15© David Kirk/NVIDIA and Wenmei W. Hwu, 20072009ECE 498AL, University of Illinois, UrbanaChampaign

CUDA Devices and Threads

• A compute device– Is a coprocessor to the CPU or host– Has its own DRAM (device memory)– Runs many threads in parallel– Is typically a GPU but can also be another type of parallel

processing device

• Data-parallel portions of an application are expressed as device kernels which run on many threads

• Differences between GPU and CPU threads – GPU threads are extremely lightweight

• Very little creation overhead

– GPU needs 1000s of threads for full efficiency• Multi-core CPU needs only a few

16

Extended C• Declspecs

– global, device, shared, local, constant

• Keywords– threadIdx, blockIdx

• Intrinsics– __syncthreads

• Runtime API– Memory, symbol,

execution management

• Function launch

__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M]; ...

region[threadIdx] = image[i];

__syncthreads() ...

image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);

17

Compiling a CUDA Program

NVCC

C/C++ CUDAApplication

PTX to TargetCompiler

G80 … GPU

Target code

PTX CodeVirtual

Physical

CPU Code

• Parallel Thread eXecution (PTX)– Virtual Machine

and ISA– Programming

model– Execution

resources and state

float4 me = gx[gtid];me.x += me.y * me.z;

ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;


Arrays of Parallel Threads

• A CUDA kernel is executed by an array of threads– All threads run the same code (SPMD)– Each thread has an ID that it uses to compute memory addresses and

make control decisions

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID



threadID

Thread Block 0

……float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block 1


Thread Block N - 1

Thread Blocks: Scalable Cooperation• Divide monolithic thread array into multiple blocks

– Threads within a block cooperate via shared memory, atomic operations and barrier synchronization

– Threads in different blocks cannot cooperate

76543210 76543210 76543210

20

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Click to edit Master text stylesSecond levelThird levelFourth levelFifth level

Block IDs and Thread IDs

• Each thread uses IDs to decide what data to work on– Block ID: 1D or 2D– Thread ID: 1D, 2D, or 3D

• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on volumes– …


CUDA Memory Model

• Global memory– Main means of

communicating R/W Data between host and device

– Contents visible to all threads

– Long latency access

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

22

Basic Memory Management

23

Memory Spaces

• CPU and GPU have separate memory spaces– Data is moved across PCIe bus

– Use functions to allocate/set/copy memory on GPU• Very similar to corresponding C functions

• Pointers are just addresses– Can’t tell from the pointer value whether the address is on

CPU or GPU

– Must exercise care when dereferencing:• Dereferencing CPU pointer on GPU will likely crash

• Same for vice versa

24

GPU Memory Allocation / Release

• Host (CPU) manages device (GPU) memory:– cudaMalloc (void ** pointer, size_t nbytes)– cudaMemset (void * pointer, int value, size_t count)– cudaFree (void* pointer)

int n = 1024;

int nbytes = 1024*sizeof(int);

int * d_a = 0;

cudaMalloc( (void**)&d_a, nbytes );

cudaMemset( d_a, 0, nbytes);

cudaFree(d_a);

25

Data Copies• cudaMemcpy( void *dst, void *src, size_t nbytes,

enum cudaMemcpyKind direction);– returns after the copy is complete– blocks CPU thread until all bytes have been copied– doesn’t start copying until previous CUDA calls complete

• enum cudaMemcpyKind– cudaMemcpyHostToDevice– cudaMemcpyDeviceToHost– cudaMemcpyDeviceToDevice

• Non-blocking memcopies are provided

26

Code Walkthrough 1

• Allocate CPU memory for n integers• Allocate GPU memory for n integers• Initialize GPU memory to 0s• Copy from GPU to CPU• Print the values

27

Code Walkthrough 1#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

28




h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

29






cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

30






cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n");

free( h_a ); cudaFree( d_a );

return 0;}

31

Basic Kernels and Execution on GPU


CUDA Function Declarations

hosthost__host__ float HostFunc()

hostdevice__global__ void KernelFunc()

devicedevice__device__ float DeviceFunc()

Only callable from the:

Executed on the:

• __global__ defines a kernel function– Must return void

• __device__ and __host__ can be used together


CUDA Function Declarations (cont.)

• __device__ functions cannot have their address taken

• For functions executed on the device:– Can only access GPU memory– No recursion– No static variable declarations inside the

function– No variable number of arguments

34

Code Walkthrough 2

• Build on Walkthrough 1• Write a kernel to initialize integers• Copy the result back to CPU• Print the values

35

__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7;}

Kernel Code (executed on GPU)

36

Launching kernels on GPU

• Launch parameters:– grid dimensions (up to 2D), dim3 type– thread-block dimensions (up to 3D), dim3 type– shared memory: number of bytes per block

• for extern smem variables declared without size• Optional, 0 by default

– stream ID• Optional, 0 by default

dim3 grid(16, 16);dim3 block(16,16);kernel<<<grid, block, 0, 0>>>(...);

kernel<<<32, 512>>>(...);

37

#include <stdio.h>






cudaMemset( d_a, 0, num_bytes );

dim3 grid, block; block.x = 4; grid.x = dimx / block.x;

kernel<<<grid, block>>>( d_a );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n");


return 0;}

38


__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x;}

__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x;}

Kernel Variations and Output

Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

39

Code Walkthrough 3

• Build on Walkthruogh 2• Write a kernel to increment n×m integers• Copy the result back to CPU• Print the values

40

__global__ void kernel( int *a, int dimx, int dimy ){ int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix;

a[idx] = a[idx]+1;}

Kernel with 2D Indexing

41

int main(){ int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int);




cudaMemset( d_a, 0, num_bytes );

dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y;

kernel<<<grid, block>>>( d_a, dimx, dimy );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); }


return 0;}

__global__ void kernel( int *a, int dimx, int dimy ){ int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix;

a[idx] = a[idx]+1;}

42

Blocks must be independent

• Any possible interleaving of blocks should be valid– presumed to run to completion without pre-emption– can run in any order– can run concurrently OR sequentially

• Blocks may coordinate but not synchronize– shared queue pointer: OK– shared lock: BAD … can easily deadlock

• Independence requirement gives scalability

43

Blocks must be independent

• Thread blocks can run in any order– Concurrently or sequentially– Facilitates scaling of the same code across

many devices

Scalability

44

Coordinating CPU and GPU Execution

45

Synchronizing GPU and CPU• All kernel launches are asynchronous

– control returns to CPU immediately– kernel starts executing once all previous CUDA calls

have completed• Memcopies are synchronous

– control returns to CPU once the copy is complete– copy starts once all previous CUDA calls have

completed• cudaThreadSynchronize()

– blocks until all previous CUDA calls complete• Asynchronous CUDA calls provide:

– non-blocking memcopies– ability to overlap memcopies and kernel execution

46

CUDA Error Reporting to CPU• All CUDA calls return error code:

– except kernel launches– cudaError_t type

• cudaError_t cudaGetLastError(void)– returns the code for the last error (“no error” has a code)

• char* cudaGetErrorString(cudaError_t code)– returns a null-terminated character string describing the

errorprintf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

47

Device Management• CPU can query and select GPU devices

– cudaGetDeviceCount( int* count )– cudaSetDevice( int device )– cudaGetDevice( int *current_device )– cudaGetDeviceProperties( cudaDeviceProp* prop, int device )– cudaChooseDevice( int *device, cudaDeviceProp* prop )

• Multi-GPU setup:– device 0 is used by default– one CPU thread can control one GPU

• multiple CPU threads can control the same GPU – calls are serialized by the driver

48

CUDA Debugging and Profiling

49

What’s cuda-gdb?

• All-in-one debugging tool• Host and CUDA codes• Extension to linux gdb• 32/64-bit Linux• 4.0 release

50

Debug Compilation

• -g –G

• nvcc –g –G foo.cu –o foo

• Fermi

-gencode arch=compute_20,code=sm_20

• Makefile

• CUDA-GDB error: undefined reference to '$gpu_registers‘ (2.2 beta or previous)

• Ptxvars.cunvcc "/usr/local/cuda/bin/ptxvars.cu" -g -G --host-compilation=c -c -define-always-macro _DEVICE_LAUNCH_PARAMETERS_H__ -Xptxas -fext

51

Extension to GDB

• Debug both host and GPU code seamlessly• GPU memory is as an extension to host

memory• GPU thread/blocks are as extensions to host

threads• Breakpoints at any host and/or device

function symbol or source file line number• Single-step individual warps

52

Debug commands

• thread <<<(BX,BY),(TX,TY,TX)>>>

thread <<<170>>>

thread <<<2,(10,10)>>> • cuda block (n,m) thread (x,y,z)• info cuda state (replacing with devices,

kernels, system, warp, sm…)

53

Debugging Commands(cont.)

• break

• print

• continue

• next

• step

• quit

• set args

…

• GDB quick reference

http://users.ece.utexas.edu/~adnan/gdb-refcard.pdf

54

Example code

• 8-bit bit reverse• 00011101 -> 10111000• 10010111 -> 11101001

55

Algorithms

r= 0;for (int i=0;i<8;i++) {

r = r <<1; if (x mod 2) r += 1; x=x>>1;

}

x = (((0xf0 &x )>>4) | ((0x0f &x) << 4) );x = (((0xcc &x )>>2) | ((0x33 &x) << 2 ));x = (((0xaa &x ) >>1) | ((0x55 &x> <<1));

56

Code1 #include <stdio.h>

2 #include <stdlib.h>3

4 // Simple 8-bit bit reversal Compute test

56 #define N 256

7

8 __global__ void bitreverse(unsigned int *data)9 {

10 unsigned int *idata = data;

11

12 unsigned int x = idata[threadIdx.x];

13

14 x = ((0xf0f0f0f0 & x) >> 4 | ((0x0f0f0f0f & x) << 4);

15 x = ((0xcccccccc & x) >> 2 | ((0x33333333 & x) << 2);16 x = ((0xaaaaaaaa & x) >> 1 | ((0x55555555 & x) << 1);

17

18 idata[threadIdx.x] = x;

19 }

2021 int main(void)

22 {

23 unsigned int *d = NULL; int i;

24 unsigned int idata[N], odata[N];

25

26 for (i = 0; i < N; i++)

27 idata[i] = (unsigned int)i;

28

29 cudaMalloc((void**)&d, sizeof(int)*N);

30 cudaMemcpy(d, idata, sizeof(int)*N,

31 cudaMemcpyHostToDevice);

32

33 bitreverse<<<1, N>>>(d);

34

35 cudaMemcpy(odata, d, sizeof(int)*N,

36 cudaMemcpyHostToDevice);

37

38 for (i = 0; i < N; i++)

39 printf(“%u -> %u\n”, idata[i], odata[i]);

40

41 cudaFree((void*)d);

42 return 0;

43 }

57

Cuda-gdb supported platform

• Host platform

X11 cannot be running on the GPU used for debugging

One GPU: Disable X11

Two more GPUs• GPU requirements

All CUDA-enable GPUs except 8800GTS, 8800GTX, 8800 Ultra, FX4600, and FX5600

58

Debugging example code

• Step 1

nvcc –g –G bitreverse.cu –o bitreverse

• Step 2

Cuda-gdb ./bitreverse

• Step 3

Set breakpoints(break main, break bitreverse, break 18)

• Step 4

Run CUDA application

(cuda-gdb) run

• Step 5

Continue and watch variables

(cuda-gdb) continue

(cuda-gdb) thread

(cuda-gdb) print x

59

Profiling Tools

• CUDA memcheck• Occupancy Calculator• Visual Profiler

60

CUDA Visual Profiler

61

CUDA Counters

62

Profiler Counters for Fermi

• branch, divergent branch

• instruction issued, instruction executed

• sm cta launched

• gld request, gst request

• local load, local store

• share load, share store

• warps launched, threads launched

• l1 global load hit, l1 global load miss

• l1 local load hit, l1 local load miss

• l1 local store hit, l1 local store miss

• l1 share bank conflicts

• uncached global load transaction

• global store transaction

• l2 read requests, l2 write requests

• l2 read misses, l2 write misses

• dram reads, dram writes

• tex cache requests, tex cache misses

63

Memory throughput• Compute capability<2.0Global read throughput= (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime Global write throughput = (((gst_32*32) + (gst_64*64) + (gst_128*128)) * TPC) / gputime

• Compute capability>=2.0

Global read throughput = (dram reads * 32 )/gputime

Global write throughput = (dram writes * 32 )/gputime

• Gmem overall throughput = read throughput + write throughput• Tesla C2050 , theoretical bandwidth 144GB/s

64

Conclusions

• GPU as an accelerator for HPC• CUDA programming model• CUDA thread and kernel• CUDA example codes• CUDA debugging and profiling

basics of cuda programming - msi.umn.edugrid 2 courtesy: ndvia figure 3.2. an example of cuda thread...

Documents