1 100m cuda gpus oil & gasfinancemedicalbiophysicsnumericsaudiovideoimaging heterogeneous...

1

100M CUDA GPUs

CUDA

Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging

Heterogeneous Computing

CPUCPUGPUGPU

Joy Lee Senior SW Engineer, Development & Technology

Optimization

3

3 Steps to Port your C/C++ code to CUDA

Step 1: Single Threadport your C/C++ code to single thread CUDA kernel, and make sure output result correct.

Focus on data movement between device & host memory

Step 2: Single BlockPort single thread kernel into single block kernel, and make sure output result correct.

Focus on parallelizing with thread index

Step 3: Multi Blocks & ThreadsPort single block kernel into multi blocks kernel, and make sure output result correct.

Focus on fixing 2 layers index system, determine the best index utilization

4

3 Steps to optimize your CUDA kernels

Step 1: setup timers to measure the kernel timeuse CUDA Event to measure kernel executing time

use clock() in kernel to measure executing time weight per part in detail

Step 2: kernel part & bottleneck divisionanalyze your kernel, divide it into multi parts

determine the bottleneck in each part

use profiler to help determine bottlenecks

Step 3: parts optimizationoptimize each part one by one, from the most time consuming part

make sure the output correct after optimizing

make sure the kernel executing time become shorter after optimizing

5

Bottlenecks Division I

PCIE boundSuffer too much cudaMemcpy between host and device memory

Memory boundSuffer global memory (device memory) bandwidth, or non-coalesced memory access pattern

Computing boundSuffer computing power limit (Flops)

Branch boundSuffer too many branch conditions

Thread boundSuffer too few threads

Register boundSuffer too few available registers (conjugated with thread bound)

6

Bottlenecks Division II

PCIE boundTry keeping all data in device memory as long as possible

Try using CUDA Stream to asynchronous data movement

Memory boundTry using Texture, shared memory, constant memory, or cache (after Fermi) to reduce directly global memory I/O

Computing boundTry reduce the operations in your algorithm

Try use intrinsic functions

Try trigger on –fast_math compiler options

After trying all possible ways, you face to the hardware limit, which means this part is almost optimized already, please change to faster card

7

Bottlenecks Division III

Branch boundReduce the number of branches, especially the diverged branches.

Thread boundUse compiler option –Xptxas –v to watch the used register amount per thread, and the used smem (shared memory) amount per block

If the total register amount per block is over the spec, please try --maxrregcount <N> to set the maximum register usage per thread, but this will generate local memory (DRAM) usage, which will be performance drawback.

Register boundNote: the number of variables declared in kernel is not equal to the register using amount, the compiler will optimize it to smaller register amount, and drop some not used variables

Try reduce the variables in your algorithm

Try change the computing order, this will make the lifetime of some variables shorter

8

Warp & Branches I Warp = 32 threads SP : MUL & MAD SFU : sin(x), cos(x)…

1 block divides into multi-warps

SM could execute 1 warp in 4 clocks

9

Warp & Branches II

Branch will make warp

diverged, each part will be

executed in time order

More diverged branches

will be slower

Non-diverged

2-fold diverged

3-fold diverged

10

Warp & Branches III

Fat & slim diverged warpIf there are some common instructions in diverged warp, move it out of branch will save some executing time

Generally speaking, make the branch as slim as possible will save time

Such as data load/save, common instructions,…etc will make the diverged warp fatter

1: common instruction

2

3

1: com

2

1: com

3

Slim diverged warp

Fat diverged warp

11

Estimate the computing throughput

Computing weightIsolate the computing parts & measure their percentage in kernel through GPU clock()

Kernel executing timeMeasure the kernel executing time, and calculate the computing time in these kernel parts

Used computing throughput in FlopsCount total arithmetic operations, and divide by this executing time

12

Example of intrinsic functions

__mul24(x, y) faster than 32 bits product

computes the product of the 24 least significant bits of the integer parameters

__sinf(x) , __cosf(x), __expf(x) very fast, single precision

less precision, but still ok

__sincosf(x,sptr,cptr)

13

Memory system

Thread ScopeRegister: on die, fastest, default

Local memory: DRAM, non cached, slow (400~600 clocks)

Block Scope Shared memory: on die, fast (4~6 clocks), Qualifier __shared__

Grid ScopeGlobal memory: DRAM, non cached, slow (400~600 clocks)

Constant memory: on die, small (total 64KB), Qualifier __constant__

Texture memory: read only, DRAM+cache, fast if cache hit

Cache (Fermi only): R/W cache

14

Count memory bandwidth (exercise)

Memory access weightIsolate the memory access parts & measure their percentage in kernel thru GPU clock()

Kernel executing timeMeasure the kernel executing time, and calculate the memory access time in kernel

Used memory bandwidthCount total memory access bytes, and divide by the access time

15

Coalesced global memory I/O

Threads in ½ warp shares the same memory controllerIf the memory access pattern in ½ warp is dense localized in memory, this will lead to good performance, cause it will form a single transaction. We call this coalesced I/O

if they diverge to different memory segments, the performance will drop due to multi transactions. We call this non-coalesced I/O

How many threads in warp shared the same memory controller may differ from hardware spec.

16

Wrap kernel as standard C/C++ functions

This can compile to kernels into standard object files or Library

Link to other languages: Java, Fortran, MATLAB, …

Not necessary to rewrite all non-C code into CUDA, we can call kernels from any other languages

17

Example: Wrap Kernel to C

__global__ void ker_xxx(int* a, int* b){ //some CUDA kernel …

}

extern “C”{ //export as standard C format

void xxx(int* a, int* b);

};

void xxx(int* a, int* b){ //wrap the kernel into C function

…

ker_xxx<<<grid,block>>>(a,b);

…

}

18

Multi-GPU operations

Before CUDA 4.0 or Non-tesla CardsOne CUDA context can control only one GPU hardware, to send/receive data, and launch kernels (which means one CPU thread can control one GPU, since one CPU thread own one CUDA context)

We can use MPI, openMP, pthreads to create multi CPU threads, then use cudaSetDevice() to assign each CPU thread to each GPU

Data communications: copy data in global memory back to system memory, and transfer data through MPI, openMP, pthreads protocols.

CUDA 4.0UVA: universal virtual addressing (all GPU & CPU can see data from each other)

19

Hardware (SPA, Streaming Processors Array)

TPC

20

Hardware (TPC, Texture/Processors Cluster)

21

Hardware (SM, Streaming Multiprocessor) Warp = 32 threads SP : MUL & MAD SFU : sin(x), cos(x)…

1 block divides into multi-warps

SM could execute 1 warp in 4 clocks

22

SPA, Streaming Processors Array

Double Precision

Special Function Unit (SFU)

TP Array Shared Memory

23

How to use so many cores?

240 SP thread processors

30 DP thread processors

Full scalar processor

IEEE 754 double precision floating point

Double Precision

Special Function Unit (SFU)

TP Array Shared Memory

Thread Processor (TP)

FP/Int

Multi-banked Register File

SpcOpsALUs

Thread Processor Array (TPA)

1 100m cuda gpus oil & gasfinancemedicalbiophysicsnumericsaudiovideoimaging heterogeneous...

Documents