cuda/openacc course at dkrz...cuda in flynn's classification computer architecture simd – all...

© APC

CUDA/OpenACC course at DKRZ Day 1

Alex Shevchenko, APC

http://www.parallel-computing.pro/




APC | 2

Introduction to CUDA





APC | 3

GPGPU & CUDA

GPU - Graphics Processing Unit

GPGPU - General-Purpose computing on GPU • First GPGPU-enabled GPU by Nvidia was GeForce G80 (2006)

CUDA - Compute Unified Device Architecture is a parallel computing platform and programming model implemented by the graphics processing units created by Nvidia





APC | 4

Entertainment

Professional graphics

HPC

Nvidia GPUs





APC | 5

GPGPU Revolution in HPC In regard of • Price / Performance • Performance / Energy consumption





APC | 6

GPGPU Revolution in HPC





APC | 7

Acceleration via GPU





APC | 8

Hardware Architecture of CUDA-Enabled GPU





APC | 9

CPU Intel Core I-7 Features

Several high performance independent cores • 2,4,6,8 cores, 2,66—3,6ГГц each

• Each physical core is defined by system as 2 logical and can execute two threads concurrently (Hyper-Threading)

3 cache levels, big cache L3 • Per core: L1=32KB (data)

+ 32KB ( Instructions), L2=256KB

• Shared L3, up to 20 MB

Memory requests are managed separately for each thread/process

Core I7-3960x, 6 cores, 15MB L3





APC | 10

GPU Streaming Multiprocessor (SMX)

Device ‘solid’ unit (similar to core in CPU)

Consists of

• 192 scalar cores - CUDA Core, ~1 GHz each

• 4 Warp Schedulers

• Register file, 256KB

• 3 caches – texture, global (L1), constant(uniform)

• 32 Special Function Unit (SFU) – interpolation and transcendent single-precision math





APC | 11

Chip in Maximum Configuration (K20X)

• 14(15) SMX

• 2688 CUDA Cores

• Cache L2 1.5 MB

• 384-bit GDDR5

• PCI-E 3.0





APC | 12

Chip in Maximum Configuration (K20X)

• 14(15) SMX

• 2688 CUDA Cores

• Cache L2 1.5 MB

• 384-bit GDDR5

• PCI-E 3.0





APC | 13

GPU vs. CPU

Hundreds of simplified computational cores working at low clock frequencies ~1 GHz (instead of 2-8 in CPU) Small caches • 192 cores share L1 (16 - 48 KB) • L2 shared between all cores, 1.5 MB, no L3

GDDR 5 with high bandwidth and high latency • Optimized for public access

Support for millions of virtual threads, fast (hardware) context switching for groups of threads





APC | 14

Purpose: load all cores

Problem: memory latency

Solution:

• CPU: complex caches hierarchy

• GPU: thousands of thread working during memory transactions

By the presence of hundreds of cores and support for millions of threads it is better to utilize all the bandwidth on GPU

Memory Latency Utilization





APC | 15

Theoretical Bandwidth and Performance GPU vs. СPU





APC | 16

Development systems

Deeper knowledge

Possibly, better performance





APC | 17

SIMT Model How to execute millions of threads on GPU?





APC | 18

CUDA in Flynn's Classification

Computer Architecture

SIMD – all processes execute one instruction on

multiple data

MIMD – each process is executed

independently

SMP – all processes have equal rights to access the memory

MPP NUMA cc-NUMA

MISD SISD





APC | 19

Nvidia SIMT

SIMD MIMD(SMP)

CUDA in Flynn's Classification

Nvidia has its own computational model, it has features both from SIMD and MIMD:

Nvidia SIMT: Single Instruction – Multiple Thread





APC | 20

SIMT: Virtual Threads, Blocks

All threads virtually :

• operate in parallel (MIMD)

• have equal privileges to access the memory (MIMD :SMP)

Threads are divided into groups of equal size (blocks):

• In general, the global synchronization of all threads is not possible

• There is a local synchronization within a block, the threads of a single block can communicate through a special memory

Threads do not migrate between blocks. Each thread is in its block since the beginning and to the end





APC | 21

All threads of a single block are executed on a single multiprocessor (SMX)

Maximum number of threads in a block – 1024

Blocks can’t switch SMX

Allocation of blocks between multiprocessors is unpredictable

Each SMX operates independently

Virtual blocks of threads

SIMT: Hardware Implementation

Program blocks





APC | 22

Thread blocks are divided into groups of 32 threads called warps

All threads of a warp simultaneously execute single common instruction (exactly SIMD-execution)

Warp Scheduler on each cycle of execution selects the warp, the threads of which are ready execution, and launches all the warp

SIMT: Hardware Implementation

warp

Warp Scheduler

Virtual block of threads





APC | 23

Branching

All threads of a warp simultaneously perform the same instructions

What if part of the threads should not execute this instruction?

• if (<condition>), where the conditions are different for the threads in a warp

They will be idle





APC | 24

Branching

I

n

s

t

r

u

c

t

i

o

n





APC | 25

SM can concurrently execute several blocks

• Maximum number of blocks per SM – 16

• Maximum number of resident threads per multiprocessor 2048 threads

Several Blocks on a Single SM

Virtual

Virtual

Virtual block of threads





APC | 26

The more threads are active on a multiprocessor, the more efficiency can be reached Blocks of 1024 threads – 2 blocks per SM, 2048 threads, 100% of

maximum

Blocks of 50 threads – 16 blocks per SM, 800 threads, 39%

Blocks of 768 threads – 2 blocks per SM, 1536 threads, 75%

Occupancy





APC | 27

SIMT and Scaling

Virtual

• GPU supports for millions of threads

• Independent virtual blocks Code can be executed on any

number of SMs

Hardware

• SMs are independent Different GPUs contain different

number of SMs





APC | 28

Nvidia SIMT-all the threads of a warp

simultaneously execute one

instruction, warps run independently

SIMD – all the threads simultaneously perform one

instruction

MIMD – each thread runs independently ,

SMP - all the threads have an equal opportunity to

access the memory

Summing Up

thread

warp

block

program

MIMD

SIMD





APC | 29

CUDA: Heterogeneous Parallel Programming





APC | 30

Calculations using GPU

A program that uses GPU consists of:

• Code for GPU (device code), containing the computational instructions and memory accesses handling

• Code for CPU (host code), which executes GPU memory handling – allocation / release

Data exchange between GPU and CPU

GPU code launch

Processing of the results and other serial code





APC | 31

GPU is regarded as a peripheral device controlled by the CPU

• GPU is «passive», i.e. it can’t load itself

Device code can be launched from anywhere in the program like a normal function

• ‘Incremental’ program optimization

Calculations using GPU





APC | 32

Device Code

CUDA code uses C++ with some add-ons: • Function attributes, variables and structures

• Built-in functions Mathematics implemented on GPU

Synchronization, collective operations

• Vector data types

• Built-in variables threadIdx, blockIdx, gridDim, blockDim

• Templates for working with textures

• …

Compiled by a special compiler cicc





APC | 33

Host Code

There is a special syntax for launching multiple instances of the kernel process on the GPU

• In the simplest form it looks like:

kernel_routine<<<gridDim, blockDim>>>(args);

Code for the CPU is compiled with a typical compiler

• Exception the construction of the kernel <<< ... >>>

Functions are linked from dynamic libraries





APC | 34

CUDA Kernel

Special function, an entry point for code executed on GPU Doesn’t return anything (void)

Declared with a qualifier __global__

Can only access GPU memory

No static variables

Parameter declarations and their use is the same as for normal functions

Host launches «kernels», device executes them

__global__ void kernel (int * ptr) {

ptr = ptr + 1;

ptr[0] = 100;

….; //other code for GPU

}





APC | 35

CUDA Grid

Multiple instances of a kernel are executed by a set of virtual threads

Kernel launch create hierarchical groups of threads

• Threads are grouped into blocks, and blocks into grids

• Threads and blocks represent different levels of parallelism

Grid – multiple blocks of the same size

Threads within block and blocks in grid are indexed in a special way





APC | 36

CUDA Grid

Thread position in a block and block position in a grid are indexed in 3 dimensions (x,y,z)

Grid is specified by the number of blocks in x,y,z (grid size in blocks) and the size of each block in x,y,z

If grid and block sizes in z are equal to 1, we get a flat rectangular grid of threads





APC | 37

CUDA Grid Example

2D Grid of 3D blocks • Logical index z of any block is equal to zero

• Each block consists of N 2D ‘slices’ of threads, corresponding z=0,N-1





APC | 38

Orientation in Grid

Performed with the help of built-in variables: • threadIx.x threadIx.y threadIx.z – thread indexes in block • blockIdx.x blockIdx.y blockIdx.z – block indexes in grid • blockDim.x blockDim.y blockDim.z – block sizes in threads • gridDim.x gridDim.y gridDim.z – grid sizes in blocks

• Linear index of a thread in grid:

int gridSizeX = blockDim.x*gridDim.x;

int gridSizeAll = gridSizeX * gridSizeY * gridSizeZ

int threadLinearIdx =

(threaIdx.z * gridSizeY + threadIx.y)*gridSizeX + threadIdx.x





APC | 39

Warps and Blocks

Blocks are divided into warps • Linear index of a thread in block:

threadIndex =

(threaIdx.z * blockDim.y +

threadIx.y)*blockDim.x + threadIdx.x

• Then the index of warp containing thread – threadIndex / 32

• Thread index in warp – threadIndex % 32

As if the block is row-by-row pulled into the line and cut into segments of 32 threads





APC | 40

One-Dimensional Vectors Addition

threads …

Vector A

Vector B

Result

ld ld ld ld ld ld ld ld ld ld

ld ld ld ld ld ld ld ld ld ld

st st st st st st st st st st





APC | 41

One-Dimensional Vectors Addition

Each thread • Receives a copy of parameters

In this example it receives pointers to vectors on GPU

• Determines its position in grid threadLinearIdx

• Reads elements of input vectors with index threadLinearIdx and writes their sum to an output vector element with index threadLinearIdx Calculates one element of an output vector

__global__ void sum_kernel( int *A, int *B, int *C )

{

int threadLinearIdx =

blockIdx.x * blockDim.x + threadIdx.x; //calculate its index

int elemA = A[threadLinearIdx ]; //read the required element of A

int elemB = B[threadLinearIdx ]; // read the required element of B

C[threadLinearIdx ] = elemA + elemB; //write the summation result

}





APC | 42

Host Code

Select a device

By default – device with index 0

Allocate memory on GPU

Copy input data to GPU

Specify grid and block sizes

Depend on the problem size

Launch a kernel

Copy output data to host





APC | 43

Device Memory Allocation

cudaError_t cudaMalloc ( void** devPtr,

size_t size )

• Allocates size bytes of linear memory on GPU and returns a pointer to allocated memory to *devPtr. Memory elements are not set to zeros. The memory address is aligned to 512 bytes

cudaError_t cudaFree ( void* devPtr )

• Frees device memory pointed by devPtr.





APC | 44

Memory Transfer

cudaError_t cudaMemcpy ( void* dst, const

void* src, size_t count, cudaMemcpyKind kind )

• Copies count bytes from memory pointed by src to memory pointed by dst; kind specifies the transfer direction

cudaMemcpyHostToHost– transferring data from host to host

cudaMemcpyHostToDevice – transferring data from host to device

cudaMemcpyDeviceToHost – transferring data from device to host

cudaMemcpyDeviceToDevice – transferring data within device

• Calling cudaMemcpy() with kind, inconsistent with dst and src results in unpredictable behaviour





http://docs.nvidia.com/cuda/cuda-runtime-api/index.html

http://docs.nvidia.com/cuda/cuda-runtime-api/index.html

APC | 45

Kernel Launch

kernel<<< execution configuration >>>(params); • “kernel” – kernel function name, • “params” – kernel parameters, each thread gets a copy of them

execution configuration (basic) - Dg, Db • dim3 Dg - grid size in blocks, Dg.x * Dg.y * Dg.z - number of

blocks • dim3 Db - each block size, Db.x * Db.y * Db.z - number of

threads in a block

struct dim3 – structure defined in CUDA Toolkit, • Three fields: unsigned x,y,z • Constructor dim3(unsigned x=1, unsigned y=1, unsigned z=1)





APC | 46

All runtime functions return an error code

Codes of all occurring errors are automatically stored in the unique special host variable of type enum cudaError_t • Thus, in each moment, this variable stores code of the last occurred

error

• cudaError_t cudaPeekAtLastError() – returns this variable • cudaError_t cudaGetLastError() - returns this variable

and resets it to cudaSuccess

• const char* cudaGetErrorString (cudaError_t error ) – returns the message string from an error code

The list of possible errors you can find in CUDA_Toolkit_reference_manual.pdf

Error Checking





APC | 47

Compiling and Running





APC | 48

Special File Extension *.cu

CUDA extends C++ in several ways: • Kernel call construction <<< …. >>>

• Built-in variables threadIdx, blockIdx

• Qualifiers __global__ __device__ etc.

• ….

These extensions can only be processed in *.cu files! • cudafe doesn’t run with files of different extensions

• This file doesn’t need #include <cuda_runtime.h>

Library functions calls beginning with ‘cuda*’ can be placed to *.cpp files • They will be linked by a typical linker if the library libcudart.so





APC | 49

Host Code Compiling

test.cpp contains:

The main host code. The kernel launch construction cannot be placed to *.cpp, so we place it to a separate function, defined in *.cu

#include <cuda_runtime.h> // Toolkit functions declarations

void launchKernel(params); // define this function in *.cu

int main() {

… // typical host code

cudaSetDevice(0); // Allowed usage of cudart library functions


launchKernel(params); // This function contains kernel invocation

// Defined in *.cu


}

Compilation:

g++ -I /toolkit_install_dir/include test.cpp –c –o test.o /toolkit_install_path/include - is CUDA toolkit path with required includes

-c -o test.o is to make an object file

If using nvcc, then the include path may be omitted: nvcc test.cpp –c –o test.o





APC | 50

Device Code Compilation kernel.cu contains:

kernel definition and a function to launch it. The launch function configures its launch parameters and outputs its execution time

__global __ void kernel(params) {

...; kernel code

}

void launchKernel(params) {

...; // Launch parameters configuring

...; // creating events

kernel<<< configuration >>> (params); // kernel launch

}

Compilation: nvcc –arch=sm_35 –Xptxas –v kernel.cu -c –o kernel.o





APC | 51

Project Linking

g++ -L/toolkit_install_dir/lib64 –lcudart test.o kernel.o –o test

• Link with libcudart.so, pointing to its possible location

nvcc test.o kernel.o –o test

• nvcc –v test.o kernel.o –o test shows what command was specifically invoked

Also it is possible to place all the code to *.cu file and avoid using *.cpp at all

For details, see CUDA_Compiler_Driver_NVCC.pdf





APC | 52

Running CUDA Program

As a result after compilation and linking we get an ordinary execution file

Run from the command line:

• $./test 1024





APC | 53

Conclusion

Well parallelized on GPU tasks:

With data parallelism

Can be divided into sub-problems of similar difficulty

Each sub-task can be performed independently

Number of arithmetic operations is large compared to the memory access operations • to cover the memory latency by computations

If an algorithm is iterative, its implementation can be organized without memory transfers between host and GPU after each iteration • Data transfers between the host and GPU are expensive





APC | 54

Beyond the scope

Memory hierarchy

Asynchronous operations, CUDA streams

Time measurement

Dealing with multi-gpu





APC | 55

Questions?

Alex Shevchenko

[email protected]





mailto:[email protected]



cuda/openacc course at dkrz...cuda in flynn's classification computer architecture simd – all...

Documents