cuda/openacc course at dkrz...cuda in flynn's classification computer architecture simd – all...
TRANSCRIPT
© APC
CUDA/OpenACC course at DKRZ Day 1
Alex Shevchenko, APC
APC | 2
Introduction to CUDA
APC | 3
GPGPU & CUDA
GPU - Graphics Processing Unit
GPGPU - General-Purpose computing on GPU • First GPGPU-enabled GPU by Nvidia was GeForce G80 (2006)
CUDA - Compute Unified Device Architecture is a parallel computing platform and programming model implemented by the graphics processing units created by Nvidia
APC | 4
Entertainment
Professional graphics
HPC
Nvidia GPUs
APC | 5
GPGPU Revolution in HPC In regard of • Price / Performance • Performance / Energy consumption
APC | 6
GPGPU Revolution in HPC
APC | 7
Acceleration via GPU
APC | 8
Hardware Architecture of CUDA-Enabled GPU
APC | 9
CPU Intel Core I-7 Features
Several high performance independent cores • 2,4,6,8 cores, 2,66—3,6ГГц each
• Each physical core is defined by system as 2 logical and can execute two threads concurrently (Hyper-Threading)
3 cache levels, big cache L3 • Per core: L1=32KB (data)
+ 32KB ( Instructions), L2=256KB
• Shared L3, up to 20 MB
Memory requests are managed separately for each thread/process
Core I7-3960x, 6 cores, 15MB L3
APC | 10
GPU Streaming Multiprocessor (SMX)
Device ‘solid’ unit (similar to core in CPU)
Consists of
• 192 scalar cores - CUDA Core, ~1 GHz each
• 4 Warp Schedulers
• Register file, 256KB
• 3 caches – texture, global (L1), constant(uniform)
• 32 Special Function Unit (SFU) – interpolation and transcendent single-precision math
APC | 11
Chip in Maximum Configuration (K20X)
• 14(15) SMX
• 2688 CUDA Cores
• Cache L2 1.5 MB
• 384-bit GDDR5
• PCI-E 3.0
APC | 12
Chip in Maximum Configuration (K20X)
• 14(15) SMX
• 2688 CUDA Cores
• Cache L2 1.5 MB
• 384-bit GDDR5
• PCI-E 3.0
APC | 13
GPU vs. CPU
Hundreds of simplified computational cores working at low clock frequencies ~1 GHz (instead of 2-8 in CPU) Small caches • 192 cores share L1 (16 - 48 KB) • L2 shared between all cores, 1.5 MB, no L3
GDDR 5 with high bandwidth and high latency • Optimized for public access
Support for millions of virtual threads, fast (hardware) context switching for groups of threads
APC | 14
Purpose: load all cores
Problem: memory latency
Solution:
• CPU: complex caches hierarchy
• GPU: thousands of thread working during memory transactions
By the presence of hundreds of cores and support for millions of threads it is better to utilize all the bandwidth on GPU
Memory Latency Utilization
APC | 15
Theoretical Bandwidth and Performance GPU vs. СPU
APC | 16
Development systems
Deeper knowledge
Possibly, better performance
APC | 17
SIMT Model How to execute millions of threads on GPU?
APC | 18
CUDA in Flynn's Classification
Computer Architecture
SIMD – all processes execute one instruction on
multiple data
MIMD – each process is executed
independently
SMP – all processes have equal rights to access the memory
MPP NUMA cc-NUMA
MISD SISD
APC | 19
Nvidia SIMT
SIMD MIMD(SMP)
CUDA in Flynn's Classification
Nvidia has its own computational model, it has features both from SIMD and MIMD:
Nvidia SIMT: Single Instruction – Multiple Thread
APC | 20
SIMT: Virtual Threads, Blocks
All threads virtually :
• operate in parallel (MIMD)
• have equal privileges to access the memory (MIMD :SMP)
Threads are divided into groups of equal size (blocks):
• In general, the global synchronization of all threads is not possible
• There is a local synchronization within a block, the threads of a single block can communicate through a special memory
Threads do not migrate between blocks. Each thread is in its block since the beginning and to the end
APC | 21
All threads of a single block are executed on a single multiprocessor (SMX)
Maximum number of threads in a block – 1024
Blocks can’t switch SMX
Allocation of blocks between multiprocessors is unpredictable
Each SMX operates independently
Virtual blocks of threads
SIMT: Hardware Implementation
Program blocks
APC | 22
Thread blocks are divided into groups of 32 threads called warps
All threads of a warp simultaneously execute single common instruction (exactly SIMD-execution)
Warp Scheduler on each cycle of execution selects the warp, the threads of which are ready execution, and launches all the warp
SIMT: Hardware Implementation
warp
Warp Scheduler
Virtual block of threads
APC | 23
Branching
All threads of a warp simultaneously perform the same instructions
What if part of the threads should not execute this instruction?
• if (<condition>), where the conditions are different for the threads in a warp
They will be idle
APC | 24
Branching
I
n
s
t
r
u
c
t
i
o
n
APC | 25
SM can concurrently execute several blocks
• Maximum number of blocks per SM – 16
• Maximum number of resident threads per multiprocessor 2048 threads
Several Blocks on a Single SM
Virtual
Virtual
Virtual block of threads
APC | 26
The more threads are active on a multiprocessor, the more efficiency can be reached Blocks of 1024 threads – 2 blocks per SM, 2048 threads, 100% of
maximum
Blocks of 50 threads – 16 blocks per SM, 800 threads, 39%
Blocks of 768 threads – 2 blocks per SM, 1536 threads, 75%
Occupancy
APC | 27
SIMT and Scaling
Virtual
• GPU supports for millions of threads
• Independent virtual blocks Code can be executed on any
number of SMs
Hardware
• SMs are independent Different GPUs contain different
number of SMs
APC | 28
Nvidia SIMT-all the threads of a warp
simultaneously execute one
instruction, warps run independently
SIMD – all the threads simultaneously perform one
instruction
MIMD – each thread runs independently ,
SMP - all the threads have an equal opportunity to
access the memory
Summing Up
thread
warp
block
program
MIMD
SIMD
APC | 29
CUDA: Heterogeneous Parallel Programming
APC | 30
Calculations using GPU
A program that uses GPU consists of:
• Code for GPU (device code), containing the computational instructions and memory accesses handling
• Code for CPU (host code), which executes GPU memory handling – allocation / release
Data exchange between GPU and CPU
GPU code launch
Processing of the results and other serial code
APC | 31
GPU is regarded as a peripheral device controlled by the CPU
• GPU is «passive», i.e. it can’t load itself
Device code can be launched from anywhere in the program like a normal function
• ‘Incremental’ program optimization
Calculations using GPU
APC | 32
Device Code
CUDA code uses C++ with some add-ons: • Function attributes, variables and structures
• Built-in functions Mathematics implemented on GPU
Synchronization, collective operations
• Vector data types
• Built-in variables threadIdx, blockIdx, gridDim, blockDim
• Templates for working with textures
• …
Compiled by a special compiler cicc
APC | 33
Host Code
There is a special syntax for launching multiple instances of the kernel process on the GPU
• In the simplest form it looks like:
kernel_routine<<<gridDim, blockDim>>>(args);
Code for the CPU is compiled with a typical compiler
• Exception the construction of the kernel <<< ... >>>
Functions are linked from dynamic libraries
APC | 34
CUDA Kernel
Special function, an entry point for code executed on GPU Doesn’t return anything (void)
Declared with a qualifier __global__
Can only access GPU memory
No static variables
Parameter declarations and their use is the same as for normal functions
Host launches «kernels», device executes them
__global__ void kernel (int * ptr) {
ptr = ptr + 1;
ptr[0] = 100;
….; //other code for GPU
}
APC | 35
CUDA Grid
Multiple instances of a kernel are executed by a set of virtual threads
Kernel launch create hierarchical groups of threads
• Threads are grouped into blocks, and blocks into grids
• Threads and blocks represent different levels of parallelism
Grid – multiple blocks of the same size
Threads within block and blocks in grid are indexed in a special way
APC | 36
CUDA Grid
Thread position in a block and block position in a grid are indexed in 3 dimensions (x,y,z)
Grid is specified by the number of blocks in x,y,z (grid size in blocks) and the size of each block in x,y,z
If grid and block sizes in z are equal to 1, we get a flat rectangular grid of threads
APC | 37
CUDA Grid Example
2D Grid of 3D blocks • Logical index z of any block is equal to zero
• Each block consists of N 2D ‘slices’ of threads, corresponding z=0,N-1
APC | 38
Orientation in Grid
Performed with the help of built-in variables: • threadIx.x threadIx.y threadIx.z – thread indexes in block • blockIdx.x blockIdx.y blockIdx.z – block indexes in grid • blockDim.x blockDim.y blockDim.z – block sizes in threads • gridDim.x gridDim.y gridDim.z – grid sizes in blocks
• Linear index of a thread in grid:
int gridSizeX = blockDim.x*gridDim.x;
int gridSizeAll = gridSizeX * gridSizeY * gridSizeZ
int threadLinearIdx =
(threaIdx.z * gridSizeY + threadIx.y)*gridSizeX + threadIdx.x
APC | 39
Warps and Blocks
Blocks are divided into warps • Linear index of a thread in block:
threadIndex =
(threaIdx.z * blockDim.y +
threadIx.y)*blockDim.x + threadIdx.x
• Then the index of warp containing thread – threadIndex / 32
• Thread index in warp – threadIndex % 32
As if the block is row-by-row pulled into the line and cut into segments of 32 threads
APC | 40
One-Dimensional Vectors Addition
threads …
Vector A
Vector B
Result
ld ld ld ld ld ld ld ld ld ld
ld ld ld ld ld ld ld ld ld ld
st st st st st st st st st st
APC | 41
One-Dimensional Vectors Addition
Each thread • Receives a copy of parameters
In this example it receives pointers to vectors on GPU
• Determines its position in grid threadLinearIdx
• Reads elements of input vectors with index threadLinearIdx and writes their sum to an output vector element with index threadLinearIdx Calculates one element of an output vector
__global__ void sum_kernel( int *A, int *B, int *C )
{
int threadLinearIdx =
blockIdx.x * blockDim.x + threadIdx.x; //calculate its index
int elemA = A[threadLinearIdx ]; //read the required element of A
int elemB = B[threadLinearIdx ]; // read the required element of B
C[threadLinearIdx ] = elemA + elemB; //write the summation result
}
APC | 42
Host Code
Select a device
By default – device with index 0
Allocate memory on GPU
Copy input data to GPU
Specify grid and block sizes
Depend on the problem size
Launch a kernel
Copy output data to host
APC | 43
Device Memory Allocation
cudaError_t cudaMalloc ( void** devPtr,
size_t size )
• Allocates size bytes of linear memory on GPU and returns a pointer to allocated memory to *devPtr. Memory elements are not set to zeros. The memory address is aligned to 512 bytes
cudaError_t cudaFree ( void* devPtr )
• Frees device memory pointed by devPtr.
APC | 44
Memory Transfer
cudaError_t cudaMemcpy ( void* dst, const
void* src, size_t count, cudaMemcpyKind kind )
• Copies count bytes from memory pointed by src to memory pointed by dst; kind specifies the transfer direction
cudaMemcpyHostToHost– transferring data from host to host
cudaMemcpyHostToDevice – transferring data from host to device
cudaMemcpyDeviceToHost – transferring data from device to host
cudaMemcpyDeviceToDevice – transferring data within device
• Calling cudaMemcpy() with kind, inconsistent with dst and src results in unpredictable behaviour
APC | 45
Kernel Launch
kernel<<< execution configuration >>>(params); • “kernel” – kernel function name, • “params” – kernel parameters, each thread gets a copy of them
execution configuration (basic) - Dg, Db • dim3 Dg - grid size in blocks, Dg.x * Dg.y * Dg.z - number of
blocks • dim3 Db - each block size, Db.x * Db.y * Db.z - number of
threads in a block
struct dim3 – structure defined in CUDA Toolkit, • Three fields: unsigned x,y,z • Constructor dim3(unsigned x=1, unsigned y=1, unsigned z=1)
APC | 46
All runtime functions return an error code
Codes of all occurring errors are automatically stored in the unique special host variable of type enum cudaError_t • Thus, in each moment, this variable stores code of the last occurred
error
• cudaError_t cudaPeekAtLastError() – returns this variable • cudaError_t cudaGetLastError() - returns this variable
and resets it to cudaSuccess
• const char* cudaGetErrorString (cudaError_t error ) – returns the message string from an error code
The list of possible errors you can find in CUDA_Toolkit_reference_manual.pdf
Error Checking
APC | 47
Compiling and Running
APC | 48
Special File Extension *.cu
CUDA extends C++ in several ways: • Kernel call construction <<< …. >>>
• Built-in variables threadIdx, blockIdx
• Qualifiers __global__ __device__ etc.
• ….
These extensions can only be processed in *.cu files! • cudafe doesn’t run with files of different extensions
• This file doesn’t need #include <cuda_runtime.h>
Library functions calls beginning with ‘cuda*’ can be placed to *.cpp files • They will be linked by a typical linker if the library libcudart.so
APC | 49
Host Code Compiling
test.cpp contains:
The main host code. The kernel launch construction cannot be placed to *.cpp, so we place it to a separate function, defined in *.cu
#include <cuda_runtime.h> // Toolkit functions declarations
void launchKernel(params); // define this function in *.cu
int main() {
… // typical host code
cudaSetDevice(0); // Allowed usage of cudart library functions
… // typical host code
launchKernel(params); // This function contains kernel invocation
// Defined in *.cu
… // typical host code
}
Compilation:
g++ -I /toolkit_install_dir/include test.cpp –c –o test.o /toolkit_install_path/include - is CUDA toolkit path with required includes
-c -o test.o is to make an object file
If using nvcc, then the include path may be omitted: nvcc test.cpp –c –o test.o
APC | 50
Device Code Compilation kernel.cu contains:
kernel definition and a function to launch it. The launch function configures its launch parameters and outputs its execution time
__global __ void kernel(params) {
...; kernel code
}
void launchKernel(params) {
...; // Launch parameters configuring
...; // creating events
kernel<<< configuration >>> (params); // kernel launch
}
Compilation: nvcc –arch=sm_35 –Xptxas –v kernel.cu -c –o kernel.o
APC | 51
Project Linking
g++ -L/toolkit_install_dir/lib64 –lcudart test.o kernel.o –o test
• Link with libcudart.so, pointing to its possible location
nvcc test.o kernel.o –o test
• nvcc –v test.o kernel.o –o test shows what command was specifically invoked
Also it is possible to place all the code to *.cu file and avoid using *.cpp at all
For details, see CUDA_Compiler_Driver_NVCC.pdf
APC | 52
Running CUDA Program
As a result after compilation and linking we get an ordinary execution file
Run from the command line:
• $./test 1024
APC | 53
Conclusion
Well parallelized on GPU tasks:
With data parallelism
Can be divided into sub-problems of similar difficulty
Each sub-task can be performed independently
Number of arithmetic operations is large compared to the memory access operations • to cover the memory latency by computations
If an algorithm is iterative, its implementation can be organized without memory transfers between host and GPU after each iteration • Data transfers between the host and GPU are expensive
APC | 54
Beyond the scope
Memory hierarchy
Asynchronous operations, CUDA streams
Time measurement
Dealing with multi-gpu