gpu-p - portal.tpu.ru

GPU-PROGRAMMING WITH CUDA

J. Keller March 12, 2018

2

GPU Hardware

CUDA Programming Fundamentals

CUDA Programming Examples

Summary

Overview

3

Graphics Processing Units are Complex

Abbildung: DirectX10 Pipeline

• Vertex-Shader: Transformation 3D-2D Coordinates, Point Position+Color

• Geometry-Shader (DirectX 10): triangulation, add e.g. line segments to improve to improve curve representation

• Pixel-Shader: modifies color and shade of pixels

4

First Languages for Shading • RenderMan Shading Language (Pixar 1988) • Stanford Real-Time Shading Language (2001)

Standardized Shading Languages - SL • GLSL (OpenGL Shading Language) • HLSL (High Level Shading Language, Microsoft) • Cg (NVIDIA, GL und D3D)

Start of GPU-Computing

General Purpose Graphics Processing Units

• GPU Computing or GPGPU means: usage of GPU for normal

computation (not for computation of images for display)

• GPU Computing: CPU and GPU both participate in computation

• CPU: control-flow intensive part GPU: data-intensive part

5

6

GPU vs. CPU

• CPU – Small to medium number of strong general purpose cores – High performance for single to medium number of threads

• GPU (Graphics Processing Unit) – Large number of small, specialized cores – High performance for very large number of threads

• Now: Integration of GPU and CPU on same die

• Examples:

AMD Trinity, Intel HD2000, HD4000 for i{3, 5, 7}-3K Nvidia Tegra

7

SIMD Processing

CPU instruction set extensions • x86 SSE or AVX, 3DNow!, PowerPC AltiVec • Knights Ferry/Corner/Landing (MIC, Larrabee) • Typical SIMD width: 2–8

GPU • Implicit vectorization through hardware • NVIDIA GPUs: SIMD warps • AMD GPUs: VLIW wavefronts • Typical SIMD width: 16–64 (warp=32)

8

Programming APIs

• CUDA (NVIDIA, leading but proprietary)

• OpenCL (Open Compute Language, open standard)

• DirectCompute (Microsoft)

• OpenACC, pragma-based standard like OpenMP

• PGI (The Portland Group, Inc.) Accelerator Compiler, implicitly parallel programming language

• Shader in OpenGL/DirectX (GLSL, HLSL, open standards)

• Brook ⇒ Brook+, RapidMind ⇒ CAL (Compute Abstraction Layer, AMD)

• FireStream (AMD), HMPP, GPUSs, StarPU, QUARK, OpenMPC

General Organization and Connection to Host(CPU)

9

General Organization of GPU

10

• Many simple cores, streaming processors (SP) • Grouped (32) into multi processors (SM) • Many registers • Small, fast local shared memories • Large, slow, global memory

(Access in 400 to 800 cycles) • Optimized for data-parallel processing

– Serialization upon control flow divergence – No branch prediction

NVidia GeForce-8 Architecture

11

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

TPTP

TPTP

TPTP

TPTP

L1 D-Cache

FB

L2

Streaming Processors

Texture Processors

SLI

Streaming Multiprocessor #4

From Instruction Unit

• Floats and doubles available

• Doubles slow

• Missing exceptions (e.g. divide by 0)

SLI: Scalable Link Interface

Example: NVidia GeForce 8800 GT

12

Fermi Streaming Multiprocessor (SM) - 2 Warp schedulers - Warp: 32 parallel threads - 2 dispatch units - 4*8=32 cores - 32K 32 bit registers - 4 Special Function Units SIN, COS, EXP, RCP, etc.

13

Nvidia Tesla Series

14

• Based on Fermi Architecture • T10P:

– 240 Stream Prozessors (SP) @ 1,33 GHz, – 4GB @ 800 MHz GDDR3 Memory – 512 Bit Memory bus – 1,4*109 Transistors – ~ 1*1012 FLOPS (1 TFLOP)

15

Tesla 10 Series

• Tesla C1060 Computing Processor – PCIe 2.0 Card (x16) – 1 T10P Processor with 240 SPs @ 1,33GHz – 4 GB @ 800 MHz GDDR3 Memory – 512 Bit Memory Bus – 102 GB/s Memory Bandwidth (theoretical) – ~160 W Power Consumption

• Tesla S1070 1U System – 1U -Form-Factor – 4 T10P Processors @ 1,5GHz – I.e. 960 SPs, 16GB RAM – 2048 Bit Memory bus – Host connection through 2 PCIe 2.0 Cables – ~700 W Power Consumption

16

Compute Capabilities

Compute Capability 1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.5 Threads/ Warp 32 32 32 32 32 32 32 32 Warps/ Multiprocessor 24 24 32 32 48 48 64 64 Threads/ Multiprocessor 768 768 1024 1024 1536 1536 2048 2048 Thread Blocks/ Multiprocessor 8 8 8 8 8 8 16 16 Max Shared Memory/ Multiprocessor (Bytes) 16384 16384 16384 16384 49152 49152 49152 49152 Register File Size 8192 8192 16384 16384 32768 32768 65536 65536 Register Allocation Unit Size 256 256 512 512 64 64 256 256 Allocation Granularity block block block block warp warp warp warp Max. Registers / Thread 124 124 124 124 63 63 63 255 Shared Memory Allocation Unit Size 512 512 512 512 128 128 256 256 Warp allocation granularity 2 2 2 2 2 2 4 4 Max. Thread Block Size 512 512 512 512 1024 1024 1024 1024

Shared Memory Size Configurations (Bytes) 16384 16384 16384 16384 49152 49152 49152 49152 Warp register allocation granularities 64 64 256 256

CUDA = Compute Unified Device Architecture Programming API for GPUs by Nvidia Latest Version: http://developer.nvidia.com/cuda/cuda-downloads Currently: CUDA 9.1 (March 2018) Toolkit contains: NVIDIA Performance Primitives (NPP) library Support for: Eclipse, Visual Studio LLVM-based Compiler (nvcc) (LLVM = Low Level Virtual Maschine) Visual Profiler cuda-gdb Debugger (Linux & MacOS) GPU Disassembler (cuobjdump) Examples with source text

17

http://developer.nvidia.com/cuda/cuda-downloads

18

Libraries (from CUDA Toolkit) • cuFFT Fast Fourier Transformation • cuBLAS Complete BLAS

(Basic Linear Algebra Subprograms) • cuSPARSE Sparse Matrix Computations • cuRAND Random Number Generation (RNG) • NPP Performance Primitives for Image and

Video Processing • nvcuvid Video Decoding • nvcuvenc Video Encoding • Thrust Templated Parallel Algorithms & Data Structures

e.g. Parallel sorting, parallel summation, Data structures for vectors

19

CUDA • CUDA extends C language

• Compiled through nvcc compiler

• High portability between different CUDA architectures

20

Compilation

CUDA C Functions

Compiler nvcc PTX Code

PTX for target

architecture (Object code)

C Program (without CUDA)

Compiler, e.g. gcc

CPU object files

CPU/GPU executable

PTX: Parallel Thread eXecution architecture, virtual instruction set architecture

21

CUDA Notation

• Device – Graphics card with GPU and graphics memory

• Kernel – Program that runs on device – Kernel can only access GPU-memory – New CUDA versions can run several kernels simultaneously

• Host – CPU, which starts kernels on device

22

CUDA Programming Model (1)

• Programming in C • Some functions executed on GPU in kernels • Program partitioned in host code (Standard C) and device

code (CUDA) • Distinguish functions by qualifiers

__host__ Functions on CPU (default) __device__ GPU functions __global__ Entry points into CUDA code, define kernel

23

CUDA Further Notation

• Kernel: mapped onto Grid

• Grid: 2D-mesh of Thread blocks

Kernel start (CUDA): kernel<<<dim3 gridS, dim blockS, size_t smem, cudaStream_t str>>>(…)

dim3 gridS (max. 65536 × 65536) Dimension of grid (2D)

dim blockS (max. 512 × 512 × 64) Dimension of block (3D)

size_t smem (optional) Shared Memory per block

cudaStream_t str (optional)

Block: groups threads as 3D-mesh

Blocks must be independent

Threads in a thread block can be synchronized

Shared memory can be used to exchange data

Threads: have unique ID, 1-3D

24

Scheduling of Kernel onto GPU

25

• Kernel is Grid, which contains blocks

• Each block assigned to Streaming Multiprocessor (SM)

• SM partitions block into warps (warp: 32 threads)

• All threads of one warp executed simultaneously on the Streaming Processors

(SPs) of SM

• Assignment Threads → SPs dynamical

• Warp executed in SIMD style

26

Memory Hierarchy Register: per thread, small capacity (KB), small latency Shared Memory: per block, medium capacity (KB), medium latency, can coordinate threads Global Memory: per grid, large capacity (GB), large latency, necessary for I/O

27

CUDA Memory Model

Texture Memory

Constant Memory

Global Memory

Grid

Block (0,0)

Shared Memory

Block (0,1)

Shared Memory

Block (0,n-1)

Shared Memory …

28

Memory Model • Memory specified through qualifiers

__device__ in global memory on GPU __shared__ in shared memory on SMs

• Operations on global memory: • Allocation: cudaMalloc(void **ptr, size_t bytes)

• Deallocation: cudaFree(void *ptr) • Set Memory: cudaMemSet(void *ptr, int value, size_t bytes)

• Transfer System-RAM ↔ GPU-RAM cudaMemCopy(*dst, *src, size_t bytes, …)

• In Kernels: Transfer global memory ↔ shared memory

Caching - Default: enabled - Access first L1, then L2, then global mem - Granularity: 128-byte cache line Non-caching - activate with option –Xptxas –dlcm=cg in Nvidia-compiler - Access first L2, then global mem - No access to L1; if present in L1: invalidate cache line - Granularity: 32-byte

29

30

Accessing Array Elements

Size of 2-dim block: blockDim.x // size of 2-dim block (X coord) blockDim.y // size of 2-dim block (Y coord)

Identifying thread within 2-dim block threadIdx.x // Thread ID in Block (0 to blockDim.x-1) threadIdx.y // Thread ID in Block (0 to blockDim.y-1)

For block dimension (3,4): threadIdx.x = 0,1,2 threadIdx.y = 0,1,2,3

31

Accessing Array Elements Identifying block within 1-dim grid blockIdx.x // block ID in grid Access to array element (1-dim grid, 2-dim blocks): blocksize = blockDim.x * blockDim.y; // no. threads in block tid = threadIdx.y * blockDim.x + threadIdx.x; // linear ID index = blockIdx.x * blocksize + tid; // array element index Access to array element (1-dim grid, 1-dim blocks): blocksize = blockDim.x; // no. threads in block tid = threadIdx.x; // linear ID index = blockIdx.x * blocksize + tid; // array element index or index = blockIdx.x * blockDim.x + threadIdx.x;

32

First parallel Code – GPU part

• Add two vectors of length N 1-dim grid of N/B blocks, each 1-dim block consisting of B threads

• blockDim.x : first dimension of block (i.e. B) __global__ void vecAdd(float *a,float *b,float *c, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i<N){ c[i] = a[i] + b[i]; } }

33

First Parallel Code – CPU Part #define N 65536 // N is Dimension of vectors #include <stdio.h> #include <cuda.h> int main(void){ size_t size = N*sizeof(float); float * dA, *dB, *dC; float hA[N], hB[N], hC[N]; for(int i=0;i<N;i++){ hA[i]=(float)i; hB[i]=(float)(N-i); } cudaMalloc((void**)&dA, size); // alloc vectors on cuda mem cudaMalloc((void**)&dB, size); cudaMalloc((void**)&dC, size); cudaMemcpy(dA, hA, size, cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, size, cudaMemcpyHostToDevice); int threadS = 256; // Threads per block int blockS =(N + threadS - 1) / threadS; // Blocks per grid vecAdd<<<blockS, threadS>>>(dA, dB, dC, N); cudaMemcpy(hC, dC, size, cudaMemcpyDeviceToHost); cudaFree(dA); cudaFree(dB); cudaFree(dC); for(int i=0;i<N;i++) if(hC[i]!=N) printf(„Wrong result at %d\n“,i); return 0; }

// Matrix Multiplication int main() { dim3 threads (16,16,1); // blocksize=256, really 2-dim dim3 grid(DIM/ threads.x ,DIM/ threads.y, 1); // 2-dim float a[DIM*DIM], b[DIM*DIM], res[DIM*DIM]; float *devA , *devB, *devRes; int matSize = DIM*DIM*sizeof (float); cudaMalloc (( void)&devA , matSize); cudaMalloc (( void)&devB , matSize); cudaMalloc (( void)&devRes , matSize); cudaMemcpy (devA , a , matSize , cudaMemcpyHostToDevice); cudaMemcpy (devB , b , matSize , cudaMemcpyHostToDevice); matMul<<<grid, threads >>>(devA, devB , devRes); cudaThreadSynchronize(); cudaMemcpy (res, devRes, matSize, cudaMemcpyDeviceToHost); for (int i =0; i <DIM; i ++) { for (int j =0; j <DIM; j ++) printf("%d" , res [ i + j*DIM ]); printf("\n"); } return 0; }

34

Matrix Multiplication Dimension of matrix: DIM (global variable) In each of 2 dimensions: DIM threads __global__ void matMul (float *a, float *b, float *c) { int2 id; id.x = blockDim.x*blockIdx.x+threadIdx.x; id.y = blockDim.y*blockIdx.y+threadIdx.y; float sum=0; for (int z=0; z<DIM; z++) { sum +=a[id.y*DIM+z]*b[z*DIM+id.x] ; } c[id.y*DIM+id.x]=sum; }

35

36

Choice of Block size

• Number of threads per block: multiple of warp size (32)

• SMs can execute up to 8 thread blocks in parallel

• Small block size: prevents high utilization

• Large block size: less flexibility

• Typical: ~128 to 256 threads per block

• Depends on application experiments necessary

37

Avoid Control Flow Divergence

• Typical code for CPU: if(idx&1) a[idx] = b[idx]; else a[idx] = b[idx] + 1;

• On GPU: performance loss by factor 2! First: all odd threads in warp execute a[idx] = b[idx]; Then: all even threads in warp execute a[idx] = b[idx] + 1;

• Better: a[idx] = b[idx] + (idx&1);

• Not always so simple to detect and to cure!

Comparison OpenCL and CUDA

• OpenCL – Compiles kernel at runtime for actual platform (~) – Supports cards of several manufacturers (++) – Supports also non-GPU devices (++) – Open standard (++) – Available for: Windows, Mac, Linux – Larger setup code (-) – Less programming comfort (-) – Command-Queues resemble CUDA streams (~) – API commands similar, partly different parameters

38

Comparison OpenCL and CUDA

• CUDA – Abstraction for general purpose computing on GPU

(GPGPU) – Good high-level API (+) – Uniform programming model (+) – Low level and high level thread synchronization (+) – Stream synchronization (+) – Support for atomic operations (+) – Very good documentation (+) – Works only with NVIDIA hardware (-) – Manual memory management (-) – Available for: Windows, Mac, Linux

39

Summary

• use GPUs as co-processor for massively parallel problems with regular structure (few control flow statements)

• CUDA: most popular programming environment

• Many more issues not mentioned in introduction see CUDA manual and textbooks

40

gpu-p - portal.tpu.ru

Documents