![Page 1: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/1.jpg)
© APC
CUDA/OpenACC course at DKRZ Day 1
Alex Shevchenko, APC
![Page 2: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/2.jpg)
APC | 2
Introduction to CUDA
![Page 3: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/3.jpg)
APC | 3
GPGPU & CUDA
GPU - Graphics Processing Unit
GPGPU - General-Purpose computing on GPU • First GPGPU-enabled GPU by Nvidia was GeForce G80 (2006)
CUDA - Compute Unified Device Architecture is a parallel computing platform and programming model implemented by the graphics processing units created by Nvidia
![Page 4: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/4.jpg)
APC | 4
Entertainment
Professional graphics
HPC
Nvidia GPUs
![Page 5: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/5.jpg)
APC | 5
GPGPU Revolution in HPC In regard of • Price / Performance • Performance / Energy consumption
![Page 6: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/6.jpg)
APC | 6
GPGPU Revolution in HPC
![Page 7: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/7.jpg)
APC | 7
Acceleration via GPU
![Page 8: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/8.jpg)
APC | 8
Hardware Architecture of CUDA-Enabled GPU
![Page 9: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/9.jpg)
APC | 9
CPU Intel Core I-7 Features
Several high performance independent cores • 2,4,6,8 cores, 2,66—3,6ГГц each
• Each physical core is defined by system as 2 logical and can execute two threads concurrently (Hyper-Threading)
3 cache levels, big cache L3 • Per core: L1=32KB (data)
+ 32KB ( Instructions), L2=256KB
• Shared L3, up to 20 MB
Memory requests are managed separately for each thread/process
Core I7-3960x, 6 cores, 15MB L3
![Page 10: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/10.jpg)
APC | 10
GPU Streaming Multiprocessor (SMX)
Device ‘solid’ unit (similar to core in CPU)
Consists of
• 192 scalar cores - CUDA Core, ~1 GHz each
• 4 Warp Schedulers
• Register file, 256KB
• 3 caches – texture, global (L1), constant(uniform)
• 32 Special Function Unit (SFU) – interpolation and transcendent single-precision math
![Page 11: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/11.jpg)
APC | 11
Chip in Maximum Configuration (K20X)
• 14(15) SMX
• 2688 CUDA Cores
• Cache L2 1.5 MB
• 384-bit GDDR5
• PCI-E 3.0
![Page 12: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/12.jpg)
APC | 12
Chip in Maximum Configuration (K20X)
• 14(15) SMX
• 2688 CUDA Cores
• Cache L2 1.5 MB
• 384-bit GDDR5
• PCI-E 3.0
![Page 13: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/13.jpg)
APC | 13
GPU vs. CPU
Hundreds of simplified computational cores working at low clock frequencies ~1 GHz (instead of 2-8 in CPU) Small caches • 192 cores share L1 (16 - 48 KB) • L2 shared between all cores, 1.5 MB, no L3
GDDR 5 with high bandwidth and high latency • Optimized for public access
Support for millions of virtual threads, fast (hardware) context switching for groups of threads
![Page 14: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/14.jpg)
APC | 14
Purpose: load all cores
Problem: memory latency
Solution:
• CPU: complex caches hierarchy
• GPU: thousands of thread working during memory transactions
By the presence of hundreds of cores and support for millions of threads it is better to utilize all the bandwidth on GPU
Memory Latency Utilization
![Page 15: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/15.jpg)
APC | 15
Theoretical Bandwidth and Performance GPU vs. СPU
![Page 16: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/16.jpg)
APC | 16
Development systems
Deeper knowledge
Possibly, better performance
![Page 17: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/17.jpg)
APC | 17
SIMT Model How to execute millions of threads on GPU?
![Page 18: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/18.jpg)
APC | 18
CUDA in Flynn's Classification
Computer Architecture
SIMD – all processes execute one instruction on
multiple data
MIMD – each process is executed
independently
SMP – all processes have equal rights to access the memory
MPP NUMA cc-NUMA
MISD SISD
![Page 19: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/19.jpg)
APC | 19
Nvidia SIMT
SIMD MIMD(SMP)
CUDA in Flynn's Classification
Nvidia has its own computational model, it has features both from SIMD and MIMD:
Nvidia SIMT: Single Instruction – Multiple Thread
![Page 20: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/20.jpg)
APC | 20
SIMT: Virtual Threads, Blocks
All threads virtually :
• operate in parallel (MIMD)
• have equal privileges to access the memory (MIMD :SMP)
Threads are divided into groups of equal size (blocks):
• In general, the global synchronization of all threads is not possible
• There is a local synchronization within a block, the threads of a single block can communicate through a special memory
Threads do not migrate between blocks. Each thread is in its block since the beginning and to the end
![Page 21: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/21.jpg)
APC | 21
All threads of a single block are executed on a single multiprocessor (SMX)
Maximum number of threads in a block – 1024
Blocks can’t switch SMX
Allocation of blocks between multiprocessors is unpredictable
Each SMX operates independently
Virtual blocks of threads
SIMT: Hardware Implementation
Program blocks
![Page 22: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/22.jpg)
APC | 22
Thread blocks are divided into groups of 32 threads called warps
All threads of a warp simultaneously execute single common instruction (exactly SIMD-execution)
Warp Scheduler on each cycle of execution selects the warp, the threads of which are ready execution, and launches all the warp
SIMT: Hardware Implementation
warp
Warp Scheduler
Virtual block of threads
![Page 23: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/23.jpg)
APC | 23
Branching
All threads of a warp simultaneously perform the same instructions
What if part of the threads should not execute this instruction?
• if (<condition>), where the conditions are different for the threads in a warp
They will be idle
![Page 24: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/24.jpg)
APC | 24
Branching
I
n
s
t
r
u
c
t
i
o
n
![Page 25: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/25.jpg)
APC | 25
SM can concurrently execute several blocks
• Maximum number of blocks per SM – 16
• Maximum number of resident threads per multiprocessor 2048 threads
Several Blocks on a Single SM
Virtual
Virtual
Virtual block of threads
![Page 26: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/26.jpg)
APC | 26
The more threads are active on a multiprocessor, the more efficiency can be reached Blocks of 1024 threads – 2 blocks per SM, 2048 threads, 100% of
maximum
Blocks of 50 threads – 16 blocks per SM, 800 threads, 39%
Blocks of 768 threads – 2 blocks per SM, 1536 threads, 75%
Occupancy
![Page 27: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/27.jpg)
APC | 27
SIMT and Scaling
Virtual
• GPU supports for millions of threads
• Independent virtual blocks Code can be executed on any
number of SMs
Hardware
• SMs are independent Different GPUs contain different
number of SMs
![Page 28: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/28.jpg)
APC | 28
Nvidia SIMT-all the threads of a warp
simultaneously execute one
instruction, warps run independently
SIMD – all the threads simultaneously perform one
instruction
MIMD – each thread runs independently ,
SMP - all the threads have an equal opportunity to
access the memory
Summing Up
thread
warp
block
program
MIMD
SIMD
![Page 29: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/29.jpg)
APC | 29
CUDA: Heterogeneous Parallel Programming
![Page 30: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/30.jpg)
APC | 30
Calculations using GPU
A program that uses GPU consists of:
• Code for GPU (device code), containing the computational instructions and memory accesses handling
• Code for CPU (host code), which executes GPU memory handling – allocation / release
Data exchange between GPU and CPU
GPU code launch
Processing of the results and other serial code
![Page 31: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/31.jpg)
APC | 31
GPU is regarded as a peripheral device controlled by the CPU
• GPU is «passive», i.e. it can’t load itself
Device code can be launched from anywhere in the program like a normal function
• ‘Incremental’ program optimization
Calculations using GPU
![Page 32: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/32.jpg)
APC | 32
Device Code
CUDA code uses C++ with some add-ons: • Function attributes, variables and structures
• Built-in functions Mathematics implemented on GPU
Synchronization, collective operations
• Vector data types
• Built-in variables threadIdx, blockIdx, gridDim, blockDim
• Templates for working with textures
• …
Compiled by a special compiler cicc
![Page 33: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/33.jpg)
APC | 33
Host Code
There is a special syntax for launching multiple instances of the kernel process on the GPU
• In the simplest form it looks like:
kernel_routine<<<gridDim, blockDim>>>(args);
Code for the CPU is compiled with a typical compiler
• Exception the construction of the kernel <<< ... >>>
Functions are linked from dynamic libraries
![Page 34: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/34.jpg)
APC | 34
CUDA Kernel
Special function, an entry point for code executed on GPU Doesn’t return anything (void)
Declared with a qualifier __global__
Can only access GPU memory
No static variables
Parameter declarations and their use is the same as for normal functions
Host launches «kernels», device executes them
__global__ void kernel (int * ptr) {
ptr = ptr + 1;
ptr[0] = 100;
….; //other code for GPU
}
![Page 35: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/35.jpg)
APC | 35
CUDA Grid
Multiple instances of a kernel are executed by a set of virtual threads
Kernel launch create hierarchical groups of threads
• Threads are grouped into blocks, and blocks into grids
• Threads and blocks represent different levels of parallelism
Grid – multiple blocks of the same size
Threads within block and blocks in grid are indexed in a special way
![Page 36: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/36.jpg)
APC | 36
CUDA Grid
Thread position in a block and block position in a grid are indexed in 3 dimensions (x,y,z)
Grid is specified by the number of blocks in x,y,z (grid size in blocks) and the size of each block in x,y,z
If grid and block sizes in z are equal to 1, we get a flat rectangular grid of threads
![Page 37: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/37.jpg)
APC | 37
CUDA Grid Example
2D Grid of 3D blocks • Logical index z of any block is equal to zero
• Each block consists of N 2D ‘slices’ of threads, corresponding z=0,N-1
![Page 38: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/38.jpg)
APC | 38
Orientation in Grid
Performed with the help of built-in variables: • threadIx.x threadIx.y threadIx.z – thread indexes in block • blockIdx.x blockIdx.y blockIdx.z – block indexes in grid • blockDim.x blockDim.y blockDim.z – block sizes in threads • gridDim.x gridDim.y gridDim.z – grid sizes in blocks
• Linear index of a thread in grid:
int gridSizeX = blockDim.x*gridDim.x;
int gridSizeAll = gridSizeX * gridSizeY * gridSizeZ
int threadLinearIdx =
(threaIdx.z * gridSizeY + threadIx.y)*gridSizeX + threadIdx.x
![Page 39: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/39.jpg)
APC | 39
Warps and Blocks
Blocks are divided into warps • Linear index of a thread in block:
threadIndex =
(threaIdx.z * blockDim.y +
threadIx.y)*blockDim.x + threadIdx.x
• Then the index of warp containing thread – threadIndex / 32
• Thread index in warp – threadIndex % 32
As if the block is row-by-row pulled into the line and cut into segments of 32 threads
![Page 40: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/40.jpg)
APC | 40
One-Dimensional Vectors Addition
threads …
Vector A
Vector B
Result
ld ld ld ld ld ld ld ld ld ld
ld ld ld ld ld ld ld ld ld ld
st st st st st st st st st st
![Page 41: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/41.jpg)
APC | 41
One-Dimensional Vectors Addition
Each thread • Receives a copy of parameters
In this example it receives pointers to vectors on GPU
• Determines its position in grid threadLinearIdx
• Reads elements of input vectors with index threadLinearIdx and writes their sum to an output vector element with index threadLinearIdx Calculates one element of an output vector
__global__ void sum_kernel( int *A, int *B, int *C )
{
int threadLinearIdx =
blockIdx.x * blockDim.x + threadIdx.x; //calculate its index
int elemA = A[threadLinearIdx ]; //read the required element of A
int elemB = B[threadLinearIdx ]; // read the required element of B
C[threadLinearIdx ] = elemA + elemB; //write the summation result
}
![Page 42: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/42.jpg)
APC | 42
Host Code
Select a device
By default – device with index 0
Allocate memory on GPU
Copy input data to GPU
Specify grid and block sizes
Depend on the problem size
Launch a kernel
Copy output data to host
![Page 43: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/43.jpg)
APC | 43
Device Memory Allocation
cudaError_t cudaMalloc ( void** devPtr,
size_t size )
• Allocates size bytes of linear memory on GPU and returns a pointer to allocated memory to *devPtr. Memory elements are not set to zeros. The memory address is aligned to 512 bytes
cudaError_t cudaFree ( void* devPtr )
• Frees device memory pointed by devPtr.
![Page 44: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/44.jpg)
APC | 44
Memory Transfer
cudaError_t cudaMemcpy ( void* dst, const
void* src, size_t count, cudaMemcpyKind kind )
• Copies count bytes from memory pointed by src to memory pointed by dst; kind specifies the transfer direction
cudaMemcpyHostToHost– transferring data from host to host
cudaMemcpyHostToDevice – transferring data from host to device
cudaMemcpyDeviceToHost – transferring data from device to host
cudaMemcpyDeviceToDevice – transferring data within device
• Calling cudaMemcpy() with kind, inconsistent with dst and src results in unpredictable behaviour
![Page 45: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/45.jpg)
APC | 45
Kernel Launch
kernel<<< execution configuration >>>(params); • “kernel” – kernel function name, • “params” – kernel parameters, each thread gets a copy of them
execution configuration (basic) - Dg, Db • dim3 Dg - grid size in blocks, Dg.x * Dg.y * Dg.z - number of
blocks • dim3 Db - each block size, Db.x * Db.y * Db.z - number of
threads in a block
struct dim3 – structure defined in CUDA Toolkit, • Three fields: unsigned x,y,z • Constructor dim3(unsigned x=1, unsigned y=1, unsigned z=1)
![Page 46: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/46.jpg)
APC | 46
All runtime functions return an error code
Codes of all occurring errors are automatically stored in the unique special host variable of type enum cudaError_t • Thus, in each moment, this variable stores code of the last occurred
error
• cudaError_t cudaPeekAtLastError() – returns this variable • cudaError_t cudaGetLastError() - returns this variable
and resets it to cudaSuccess
• const char* cudaGetErrorString (cudaError_t error ) – returns the message string from an error code
The list of possible errors you can find in CUDA_Toolkit_reference_manual.pdf
Error Checking
![Page 47: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/47.jpg)
APC | 47
Compiling and Running
![Page 48: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/48.jpg)
APC | 48
Special File Extension *.cu
CUDA extends C++ in several ways: • Kernel call construction <<< …. >>>
• Built-in variables threadIdx, blockIdx
• Qualifiers __global__ __device__ etc.
• ….
These extensions can only be processed in *.cu files! • cudafe doesn’t run with files of different extensions
• This file doesn’t need #include <cuda_runtime.h>
Library functions calls beginning with ‘cuda*’ can be placed to *.cpp files • They will be linked by a typical linker if the library libcudart.so
![Page 49: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/49.jpg)
APC | 49
Host Code Compiling
test.cpp contains:
The main host code. The kernel launch construction cannot be placed to *.cpp, so we place it to a separate function, defined in *.cu
#include <cuda_runtime.h> // Toolkit functions declarations
void launchKernel(params); // define this function in *.cu
int main() {
… // typical host code
cudaSetDevice(0); // Allowed usage of cudart library functions
… // typical host code
launchKernel(params); // This function contains kernel invocation
// Defined in *.cu
… // typical host code
}
Compilation:
g++ -I /toolkit_install_dir/include test.cpp –c –o test.o /toolkit_install_path/include - is CUDA toolkit path with required includes
-c -o test.o is to make an object file
If using nvcc, then the include path may be omitted: nvcc test.cpp –c –o test.o
![Page 50: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/50.jpg)
APC | 50
Device Code Compilation kernel.cu contains:
kernel definition and a function to launch it. The launch function configures its launch parameters and outputs its execution time
__global __ void kernel(params) {
...; kernel code
}
void launchKernel(params) {
...; // Launch parameters configuring
...; // creating events
kernel<<< configuration >>> (params); // kernel launch
}
Compilation: nvcc –arch=sm_35 –Xptxas –v kernel.cu -c –o kernel.o
![Page 51: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/51.jpg)
APC | 51
Project Linking
g++ -L/toolkit_install_dir/lib64 –lcudart test.o kernel.o –o test
• Link with libcudart.so, pointing to its possible location
nvcc test.o kernel.o –o test
• nvcc –v test.o kernel.o –o test shows what command was specifically invoked
Also it is possible to place all the code to *.cu file and avoid using *.cpp at all
For details, see CUDA_Compiler_Driver_NVCC.pdf
![Page 52: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/52.jpg)
APC | 52
Running CUDA Program
As a result after compilation and linking we get an ordinary execution file
Run from the command line:
• $./test 1024
![Page 53: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/53.jpg)
APC | 53
Conclusion
Well parallelized on GPU tasks:
With data parallelism
Can be divided into sub-problems of similar difficulty
Each sub-task can be performed independently
Number of arithmetic operations is large compared to the memory access operations • to cover the memory latency by computations
If an algorithm is iterative, its implementation can be organized without memory transfers between host and GPU after each iteration • Data transfers between the host and GPU are expensive
![Page 54: CUDA/OpenACC course at DKRZ...CUDA in Flynn's Classification Computer Architecture SIMD – all processes execute one instruction on multiple data MIMD – each process is executed](https://reader033.vdocuments.net/reader033/viewer/2022041912/5e681f16feae0635b22470ae/html5/thumbnails/54.jpg)
APC | 54
Beyond the scope
Memory hierarchy
Asynchronous operations, CUDA streams
Time measurement
Dealing with multi-gpu