gpu-p - portal.tpu.ru
TRANSCRIPT
GPU-PROGRAMMING WITH CUDA
J. Keller March 12, 2018
2
GPU Hardware
CUDA Programming Fundamentals
CUDA Programming Examples
Summary
Overview
3
Graphics Processing Units are Complex
Abbildung: DirectX10 Pipeline
• Vertex-Shader: Transformation 3D-2D Coordinates, Point Position+Color
• Geometry-Shader (DirectX 10): triangulation, add e.g. line segments to improve to improve curve representation
• Pixel-Shader: modifies color and shade of pixels
4
First Languages for Shading • RenderMan Shading Language (Pixar 1988) • Stanford Real-Time Shading Language (2001)
Standardized Shading Languages - SL • GLSL (OpenGL Shading Language) • HLSL (High Level Shading Language, Microsoft) • Cg (NVIDIA, GL und D3D)
Start of GPU-Computing
General Purpose Graphics Processing Units
• GPU Computing or GPGPU means: usage of GPU for normal
computation (not for computation of images for display)
• GPU Computing: CPU and GPU both participate in computation
• CPU: control-flow intensive part GPU: data-intensive part
5
6
GPU vs. CPU
• CPU – Small to medium number of strong general purpose cores – High performance for single to medium number of threads
• GPU (Graphics Processing Unit) – Large number of small, specialized cores – High performance for very large number of threads
• Now: Integration of GPU and CPU on same die
• Examples:
AMD Trinity, Intel HD2000, HD4000 for i{3, 5, 7}-3K Nvidia Tegra
7
SIMD Processing
CPU instruction set extensions • x86 SSE or AVX, 3DNow!, PowerPC AltiVec • Knights Ferry/Corner/Landing (MIC, Larrabee) • Typical SIMD width: 2–8
GPU • Implicit vectorization through hardware • NVIDIA GPUs: SIMD warps • AMD GPUs: VLIW wavefronts • Typical SIMD width: 16–64 (warp=32)
8
Programming APIs
• CUDA (NVIDIA, leading but proprietary)
• OpenCL (Open Compute Language, open standard)
• DirectCompute (Microsoft)
• OpenACC, pragma-based standard like OpenMP
• PGI (The Portland Group, Inc.) Accelerator Compiler, implicitly parallel programming language
• Shader in OpenGL/DirectX (GLSL, HLSL, open standards)
• Brook ⇒ Brook+, RapidMind ⇒ CAL (Compute Abstraction Layer, AMD)
• FireStream (AMD), HMPP, GPUSs, StarPU, QUARK, OpenMPC
General Organization and Connection to Host(CPU)
9
General Organization of GPU
10
• Many simple cores, streaming processors (SP) • Grouped (32) into multi processors (SM) • Many registers • Small, fast local shared memories • Large, slow, global memory
(Access in 400 to 800 cycles) • Optimized for data-parallel processing
– Serialization upon control flow divergence – No branch prediction
NVidia GeForce-8 Architecture
11
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
TPTP
TPTP
TPTP
TPTP
L1 D-Cache
FB
L2
Streaming Processors
Texture Processors
SLI
Streaming Multiprocessor #4
From Instruction Unit
• Floats and doubles available
• Doubles slow
• Missing exceptions (e.g. divide by 0)
SLI: Scalable Link Interface
Example: NVidia GeForce 8800 GT
12
Fermi Streaming Multiprocessor (SM) - 2 Warp schedulers - Warp: 32 parallel threads - 2 dispatch units - 4*8=32 cores - 32K 32 bit registers - 4 Special Function Units SIN, COS, EXP, RCP, etc.
13
Nvidia Tesla Series
14
• Based on Fermi Architecture • T10P:
– 240 Stream Prozessors (SP) @ 1,33 GHz, – 4GB @ 800 MHz GDDR3 Memory – 512 Bit Memory bus – 1,4*109 Transistors – ~ 1*1012 FLOPS (1 TFLOP)
15
Tesla 10 Series
• Tesla C1060 Computing Processor – PCIe 2.0 Card (x16) – 1 T10P Processor with 240 SPs @ 1,33GHz – 4 GB @ 800 MHz GDDR3 Memory – 512 Bit Memory Bus – 102 GB/s Memory Bandwidth (theoretical) – ~160 W Power Consumption
• Tesla S1070 1U System – 1U -Form-Factor – 4 T10P Processors @ 1,5GHz – I.e. 960 SPs, 16GB RAM – 2048 Bit Memory bus – Host connection through 2 PCIe 2.0 Cables – ~700 W Power Consumption
16
Compute Capabilities
Compute Capability 1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.5 Threads/ Warp 32 32 32 32 32 32 32 32 Warps/ Multiprocessor 24 24 32 32 48 48 64 64 Threads/ Multiprocessor 768 768 1024 1024 1536 1536 2048 2048 Thread Blocks/ Multiprocessor 8 8 8 8 8 8 16 16 Max Shared Memory/ Multiprocessor (Bytes) 16384 16384 16384 16384 49152 49152 49152 49152 Register File Size 8192 8192 16384 16384 32768 32768 65536 65536 Register Allocation Unit Size 256 256 512 512 64 64 256 256 Allocation Granularity block block block block warp warp warp warp Max. Registers / Thread 124 124 124 124 63 63 63 255 Shared Memory Allocation Unit Size 512 512 512 512 128 128 256 256 Warp allocation granularity 2 2 2 2 2 2 4 4 Max. Thread Block Size 512 512 512 512 1024 1024 1024 1024
Shared Memory Size Configurations (Bytes) 16384 16384 16384 16384 49152 49152 49152 49152 Warp register allocation granularities 64 64 256 256
CUDA = Compute Unified Device Architecture Programming API for GPUs by Nvidia Latest Version: http://developer.nvidia.com/cuda/cuda-downloads Currently: CUDA 9.1 (March 2018) Toolkit contains: NVIDIA Performance Primitives (NPP) library Support for: Eclipse, Visual Studio LLVM-based Compiler (nvcc) (LLVM = Low Level Virtual Maschine) Visual Profiler cuda-gdb Debugger (Linux & MacOS) GPU Disassembler (cuobjdump) Examples with source text
17
18
Libraries (from CUDA Toolkit) • cuFFT Fast Fourier Transformation • cuBLAS Complete BLAS
(Basic Linear Algebra Subprograms) • cuSPARSE Sparse Matrix Computations • cuRAND Random Number Generation (RNG) • NPP Performance Primitives for Image and
Video Processing • nvcuvid Video Decoding • nvcuvenc Video Encoding • Thrust Templated Parallel Algorithms & Data Structures
e.g. Parallel sorting, parallel summation, Data structures for vectors
19
CUDA • CUDA extends C language
• Compiled through nvcc compiler
• High portability between different CUDA architectures
20
Compilation
CUDA C Functions
Compiler nvcc PTX Code
PTX for target
architecture (Object code)
C Program (without CUDA)
Compiler, e.g. gcc
CPU object files
CPU/GPU executable
PTX: Parallel Thread eXecution architecture, virtual instruction set architecture
21
CUDA Notation
• Device – Graphics card with GPU and graphics memory
• Kernel – Program that runs on device – Kernel can only access GPU-memory – New CUDA versions can run several kernels simultaneously
• Host – CPU, which starts kernels on device
22
CUDA Programming Model (1)
• Programming in C • Some functions executed on GPU in kernels • Program partitioned in host code (Standard C) and device
code (CUDA) • Distinguish functions by qualifiers
__host__ Functions on CPU (default) __device__ GPU functions __global__ Entry points into CUDA code, define kernel
23
CUDA Further Notation
• Kernel: mapped onto Grid
• Grid: 2D-mesh of Thread blocks
Kernel start (CUDA): kernel<<<dim3 gridS, dim blockS, size_t smem, cudaStream_t str>>>(…)
dim3 gridS (max. 65536 × 65536) Dimension of grid (2D)
dim blockS (max. 512 × 512 × 64) Dimension of block (3D)
size_t smem (optional) Shared Memory per block
cudaStream_t str (optional)
Block: groups threads as 3D-mesh
Blocks must be independent
Threads in a thread block can be synchronized
Shared memory can be used to exchange data
Threads: have unique ID, 1-3D
24
Scheduling of Kernel onto GPU
25
• Kernel is Grid, which contains blocks
• Each block assigned to Streaming Multiprocessor (SM)
• SM partitions block into warps (warp: 32 threads)
• All threads of one warp executed simultaneously on the Streaming Processors
(SPs) of SM
• Assignment Threads → SPs dynamical
• Warp executed in SIMD style
26
Memory Hierarchy Register: per thread, small capacity (KB), small latency Shared Memory: per block, medium capacity (KB), medium latency, can coordinate threads Global Memory: per grid, large capacity (GB), large latency, necessary for I/O
27
CUDA Memory Model
Texture Memory
Constant Memory
Global Memory
Grid
Block (0,0)
Shared Memory
Block (0,1)
Shared Memory
Block (0,n-1)
Shared Memory …
28
Memory Model • Memory specified through qualifiers
__device__ in global memory on GPU __shared__ in shared memory on SMs
• Operations on global memory: • Allocation: cudaMalloc(void **ptr, size_t bytes)
• Deallocation: cudaFree(void *ptr) • Set Memory: cudaMemSet(void *ptr, int value, size_t bytes)
• Transfer System-RAM ↔ GPU-RAM cudaMemCopy(*dst, *src, size_t bytes, …)
• In Kernels: Transfer global memory ↔ shared memory
Caching - Default: enabled - Access first L1, then L2, then global mem - Granularity: 128-byte cache line Non-caching - activate with option –Xptxas –dlcm=cg in Nvidia-compiler - Access first L2, then global mem - No access to L1; if present in L1: invalidate cache line - Granularity: 32-byte
29
30
Accessing Array Elements
Size of 2-dim block: blockDim.x // size of 2-dim block (X coord) blockDim.y // size of 2-dim block (Y coord)
Identifying thread within 2-dim block threadIdx.x // Thread ID in Block (0 to blockDim.x-1) threadIdx.y // Thread ID in Block (0 to blockDim.y-1)
For block dimension (3,4): threadIdx.x = 0,1,2 threadIdx.y = 0,1,2,3
31
Accessing Array Elements Identifying block within 1-dim grid blockIdx.x // block ID in grid Access to array element (1-dim grid, 2-dim blocks): blocksize = blockDim.x * blockDim.y; // no. threads in block tid = threadIdx.y * blockDim.x + threadIdx.x; // linear ID index = blockIdx.x * blocksize + tid; // array element index Access to array element (1-dim grid, 1-dim blocks): blocksize = blockDim.x; // no. threads in block tid = threadIdx.x; // linear ID index = blockIdx.x * blocksize + tid; // array element index or index = blockIdx.x * blockDim.x + threadIdx.x;
32
First parallel Code – GPU part
• Add two vectors of length N 1-dim grid of N/B blocks, each 1-dim block consisting of B threads
• blockDim.x : first dimension of block (i.e. B) __global__ void vecAdd(float *a,float *b,float *c, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i<N){ c[i] = a[i] + b[i]; } }
33
First Parallel Code – CPU Part #define N 65536 // N is Dimension of vectors #include <stdio.h> #include <cuda.h> int main(void){ size_t size = N*sizeof(float); float * dA, *dB, *dC; float hA[N], hB[N], hC[N]; for(int i=0;i<N;i++){ hA[i]=(float)i; hB[i]=(float)(N-i); } cudaMalloc((void**)&dA, size); // alloc vectors on cuda mem cudaMalloc((void**)&dB, size); cudaMalloc((void**)&dC, size); cudaMemcpy(dA, hA, size, cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, size, cudaMemcpyHostToDevice); int threadS = 256; // Threads per block int blockS =(N + threadS - 1) / threadS; // Blocks per grid vecAdd<<<blockS, threadS>>>(dA, dB, dC, N); cudaMemcpy(hC, dC, size, cudaMemcpyDeviceToHost); cudaFree(dA); cudaFree(dB); cudaFree(dC); for(int i=0;i<N;i++) if(hC[i]!=N) printf(„Wrong result at %d\n“,i); return 0; }
// Matrix Multiplication int main() { dim3 threads (16,16,1); // blocksize=256, really 2-dim dim3 grid(DIM/ threads.x ,DIM/ threads.y, 1); // 2-dim float a[DIM*DIM], b[DIM*DIM], res[DIM*DIM]; float *devA , *devB, *devRes; int matSize = DIM*DIM*sizeof (float); cudaMalloc (( void)&devA , matSize); cudaMalloc (( void)&devB , matSize); cudaMalloc (( void)&devRes , matSize); cudaMemcpy (devA , a , matSize , cudaMemcpyHostToDevice); cudaMemcpy (devB , b , matSize , cudaMemcpyHostToDevice); matMul<<<grid, threads >>>(devA, devB , devRes); cudaThreadSynchronize(); cudaMemcpy (res, devRes, matSize, cudaMemcpyDeviceToHost); for (int i =0; i <DIM; i ++) { for (int j =0; j <DIM; j ++) printf("%d" , res [ i + j*DIM ]); printf("\n"); } return 0; }
34
Matrix Multiplication Dimension of matrix: DIM (global variable) In each of 2 dimensions: DIM threads __global__ void matMul (float *a, float *b, float *c) { int2 id; id.x = blockDim.x*blockIdx.x+threadIdx.x; id.y = blockDim.y*blockIdx.y+threadIdx.y; float sum=0; for (int z=0; z<DIM; z++) { sum +=a[id.y*DIM+z]*b[z*DIM+id.x] ; } c[id.y*DIM+id.x]=sum; }
35
36
Choice of Block size
• Number of threads per block: multiple of warp size (32)
• SMs can execute up to 8 thread blocks in parallel
• Small block size: prevents high utilization
• Large block size: less flexibility
• Typical: ~128 to 256 threads per block
• Depends on application experiments necessary
37
Avoid Control Flow Divergence
• Typical code for CPU: if(idx&1) a[idx] = b[idx]; else a[idx] = b[idx] + 1;
• On GPU: performance loss by factor 2! First: all odd threads in warp execute a[idx] = b[idx]; Then: all even threads in warp execute a[idx] = b[idx] + 1;
• Better: a[idx] = b[idx] + (idx&1);
• Not always so simple to detect and to cure!
Comparison OpenCL and CUDA
• OpenCL – Compiles kernel at runtime for actual platform (~) – Supports cards of several manufacturers (++) – Supports also non-GPU devices (++) – Open standard (++) – Available for: Windows, Mac, Linux – Larger setup code (-) – Less programming comfort (-) – Command-Queues resemble CUDA streams (~) – API commands similar, partly different parameters
38
Comparison OpenCL and CUDA
• CUDA – Abstraction for general purpose computing on GPU
(GPGPU) – Good high-level API (+) – Uniform programming model (+) – Low level and high level thread synchronization (+) – Stream synchronization (+) – Support for atomic operations (+) – Very good documentation (+) – Works only with NVIDIA hardware (-) – Manual memory management (-) – Available for: Windows, Mac, Linux
39
Summary
• use GPUs as co-processor for massively parallel problems with regular structure (few control flow statements)
• CUDA: most popular programming environment
• Many more issues not mentioned in introduction see CUDA manual and textbooks
40