1 100m cuda gpus oil & gasfinancemedicalbiophysicsnumericsaudiovideoimaging heterogeneous...
TRANSCRIPT
1
100M CUDA GPUs
CUDA
Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging
Heterogeneous Computing
CPUCPUGPUGPU
Joy Lee Senior SW Engineer, Development & Technology
Optimization
3
3 Steps to Port your C/C++ code to CUDA
Step 1: Single Threadport your C/C++ code to single thread CUDA kernel, and make sure output result correct.
Focus on data movement between device & host memory
Step 2: Single BlockPort single thread kernel into single block kernel, and make sure output result correct.
Focus on parallelizing with thread index
Step 3: Multi Blocks & ThreadsPort single block kernel into multi blocks kernel, and make sure output result correct.
Focus on fixing 2 layers index system, determine the best index utilization
4
3 Steps to optimize your CUDA kernels
Step 1: setup timers to measure the kernel timeuse CUDA Event to measure kernel executing time
use clock() in kernel to measure executing time weight per part in detail
Step 2: kernel part & bottleneck divisionanalyze your kernel, divide it into multi parts
determine the bottleneck in each part
use profiler to help determine bottlenecks
Step 3: parts optimizationoptimize each part one by one, from the most time consuming part
make sure the output correct after optimizing
make sure the kernel executing time become shorter after optimizing
5
Bottlenecks Division I
PCIE boundSuffer too much cudaMemcpy between host and device memory
Memory boundSuffer global memory (device memory) bandwidth, or non-coalesced memory access pattern
Computing boundSuffer computing power limit (Flops)
Branch boundSuffer too many branch conditions
Thread boundSuffer too few threads
Register boundSuffer too few available registers (conjugated with thread bound)
6
Bottlenecks Division II
PCIE boundTry keeping all data in device memory as long as possible
Try using CUDA Stream to asynchronous data movement
Memory boundTry using Texture, shared memory, constant memory, or cache (after Fermi) to reduce directly global memory I/O
Computing boundTry reduce the operations in your algorithm
Try use intrinsic functions
Try trigger on –fast_math compiler options
After trying all possible ways, you face to the hardware limit, which means this part is almost optimized already, please change to faster card
7
Bottlenecks Division III
Branch boundReduce the number of branches, especially the diverged branches.
Thread boundUse compiler option –Xptxas –v to watch the used register amount per thread, and the used smem (shared memory) amount per block
If the total register amount per block is over the spec, please try --maxrregcount <N> to set the maximum register usage per thread, but this will generate local memory (DRAM) usage, which will be performance drawback.
Register boundNote: the number of variables declared in kernel is not equal to the register using amount, the compiler will optimize it to smaller register amount, and drop some not used variables
Try reduce the variables in your algorithm
Try change the computing order, this will make the lifetime of some variables shorter
8
Warp & Branches I Warp = 32 threads SP : MUL & MAD SFU : sin(x), cos(x)…
1 block divides into multi-warps
SM could execute 1 warp in 4 clocks
9
Warp & Branches II
Branch will make warp
diverged, each part will be
executed in time order
More diverged branches
will be slower
Non-diverged
2-fold diverged
3-fold diverged
10
Warp & Branches III
Fat & slim diverged warpIf there are some common instructions in diverged warp, move it out of branch will save some executing time
Generally speaking, make the branch as slim as possible will save time
Such as data load/save, common instructions,…etc will make the diverged warp fatter
1: common instruction
2
3
1: com
2
1: com
3
Slim diverged warp
Fat diverged warp
11
Estimate the computing throughput
Computing weightIsolate the computing parts & measure their percentage in kernel through GPU clock()
Kernel executing timeMeasure the kernel executing time, and calculate the computing time in these kernel parts
Used computing throughput in FlopsCount total arithmetic operations, and divide by this executing time
12
Example of intrinsic functions
__mul24(x, y) faster than 32 bits product
computes the product of the 24 least significant bits of the integer parameters
__sinf(x) , __cosf(x), __expf(x) very fast, single precision
less precision, but still ok
__sincosf(x,sptr,cptr)
13
Memory system
Thread ScopeRegister: on die, fastest, default
Local memory: DRAM, non cached, slow (400~600 clocks)
Block Scope Shared memory: on die, fast (4~6 clocks), Qualifier __shared__
Grid ScopeGlobal memory: DRAM, non cached, slow (400~600 clocks)
Constant memory: on die, small (total 64KB), Qualifier __constant__
Texture memory: read only, DRAM+cache, fast if cache hit
Cache (Fermi only): R/W cache
14
Count memory bandwidth (exercise)
Memory access weightIsolate the memory access parts & measure their percentage in kernel thru GPU clock()
Kernel executing timeMeasure the kernel executing time, and calculate the memory access time in kernel
Used memory bandwidthCount total memory access bytes, and divide by the access time
15
Coalesced global memory I/O
Threads in ½ warp shares the same memory controllerIf the memory access pattern in ½ warp is dense localized in memory, this will lead to good performance, cause it will form a single transaction. We call this coalesced I/O
if they diverge to different memory segments, the performance will drop due to multi transactions. We call this non-coalesced I/O
How many threads in warp shared the same memory controller may differ from hardware spec.
16
Wrap kernel as standard C/C++ functions
This can compile to kernels into standard object files or Library
Link to other languages: Java, Fortran, MATLAB, …
Not necessary to rewrite all non-C code into CUDA, we can call kernels from any other languages
17
Example: Wrap Kernel to C
__global__ void ker_xxx(int* a, int* b){ //some CUDA kernel …
}
extern “C”{ //export as standard C format
void xxx(int* a, int* b);
};
void xxx(int* a, int* b){ //wrap the kernel into C function
…
ker_xxx<<<grid,block>>>(a,b);
…
}
18
Multi-GPU operations
Before CUDA 4.0 or Non-tesla CardsOne CUDA context can control only one GPU hardware, to send/receive data, and launch kernels (which means one CPU thread can control one GPU, since one CPU thread own one CUDA context)
We can use MPI, openMP, pthreads to create multi CPU threads, then use cudaSetDevice() to assign each CPU thread to each GPU
Data communications: copy data in global memory back to system memory, and transfer data through MPI, openMP, pthreads protocols.
CUDA 4.0UVA: universal virtual addressing (all GPU & CPU can see data from each other)
19
Hardware (SPA, Streaming Processors Array)
TPC
20
Hardware (TPC, Texture/Processors Cluster)
21
Hardware (SM, Streaming Multiprocessor) Warp = 32 threads SP : MUL & MAD SFU : sin(x), cos(x)…
1 block divides into multi-warps
SM could execute 1 warp in 4 clocks
22
SPA, Streaming Processors Array
Double Precision
Special Function Unit (SFU)
TP Array Shared Memory
23
How to use so many cores?
240 SP thread processors
30 DP thread processors
Full scalar processor
IEEE 754 double precision floating point
Double Precision
Special Function Unit (SFU)
TP Array Shared Memory
Thread Processor (TP)
FP/Int
Multi-banked Register File
SpcOpsALUs
Thread Processor Array (TPA)