optimization on kepler zehuan wang [email protected]

Optimization on Kepler

Zehuan [email protected]

Fundamental Optimization

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

GPU High Level View

SM

EM

SM

EM

SM

EM

SM

EM

Global MemoryCPU

ChipsetPCIe

Streaming Multiprocessor

Global memory

GK110 SMControl unit

4 Warp Scheduler8 instruction dispatcher

Execution unit192 single-precision CUDA Cores64 double-precision CUDA Cores32 SFU, 32 LD/ST

MemoryRegisters: 64K 32-bitCache

L1+shared memory (64 KB)TextureConstant

GPU and Programming ModelSoftware GPU

Threads are executed by CUDA cores

Thread

CUDA Core

Thread Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources

...

Grid Device

A kernel is launched as a grid of thread blocks

Up to 16 kernels can execute on a device at one time

Warp

Warp is successive 32 threads in a blockE.g. blockDim = 160

Automatically divided to 5 warps by GPU

E.g. blockDim = 161If the blockDim is not the Multiple of 32The rest of thread will occupy one more warp

Block 0

Warp 4 (128~159)Warp 3 (96~127)

Warp2 (64~95)Warp1 (32~63)Warp 0 (0~31)

Block 0

Warp 5 (160)Warp 4 (128~159)Warp 3 (96~127)

Warp2 (64~95)Warp1 (32~63)Warp 0 (0~31)

Block

32 Threads

32 Threads

32 Threads

...

Warps

=

Warp

SIMD: Same Instruction Multi DataThe threads in the same warp always executing the same instructionInstructions will be issued tooperation units by warp

warp 8 instruction 11

Warp Scheduler 0




...




Warp Scheduler 1




...



Warp

Latency is caused by the dependencybetween the neighbor instructions in the same warpIn the waiting time, other instructionsfrom other warps can be executedContext switching is freeA lot of warps can hide memory latency


Warp Scheduler 0




...




Warp Scheduler 1




...



Kepler Memory Hierarchy

RegisterSpills to local memory

CachesShared memoryL1 cacheL2 cacheConstant cacheTexture cache

Global memory

Kepler/Fermi Memory Hierarchy

L2

Global Memory

Registers

C

SM-0

L1&SMEM TEX

Registers

C

SM-1

L1&SMEM TEX

Registers

C

SM-N

L1&SMEM TEX

low

fastkepler

Wrong View To Optimization

Try all the optimization methods in bookOptimization is endless

Low Efficiency

General Optimization Strategies: Measurement

Find out the limiting factor in kernel performanceMemory bandwidth bound (memory optimization)Instruction throughput bound (instruction optimization)Latency bound (configuration optimization)

Measure effective memory/instruction throughput: NVIDIA Visual Profiler

Resolved

Find Limiter

Compare to Effective Value GB/s

Memory optimization

Compare to Effective Value inst/s

Instruction optimization

Configuration optimization

Memory bound

Instruction bound

Latency bound

Done!

<< <<~ ~

Memory Optimization

If the code is memory-bound and effective memory throughput is much lower than the peak

Purpose: access only data that are absolutely necessary

Major techniquesImprove access pattern to reduce wasted transactionsReduce redundant access: read-only cache, shared memory

Reduce Wasted Transactions

Memory accesses are per warpMemory is accessed in discrete chunksL1 is reserved only for register spills and stack data in Kepler

Go directly to L2 (invalidate line in L1), on L2 miss go to DRAMMemory is transport in segments = 32 B (same as for writes)If a warp can’t take use all of the data in the segments, the rest memory transaction is wasted.

Kepler/Fermi Memory Hierarchy

L2

Global Memory

Registers

C

SM-0

L1&SMEM TEX

Registers

C

SM-1

L1&SMEM TEX

Registers

C

SM-N

L1&SMEM TEX

low

fastkepler


Scenario:Warp requests 32 aligned, consecutive 4-byte words

Addresses fall within 4 segmentsNo replaysBus utilization: 100%

Warp needs 128 bytes128 bytes move across the bus on a miss

...

addresses from a warp

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0


...


Scenario:Warp requests 32 aligned, permuted 4-byte words

Addresses fall within 4 segmentsNo replaysBus utilization: 100%



0


Scenario:Warp requests 32 consecutive 4-byte words, offset from perfect alignment

Addresses fall within at most 5 segments1 replay (2 transactions)Bus utilization: at least 80%

Warp needs 128 bytesAt most 160 bytes move across the busSome misaligned patterns will fall within 4 segments, so 100% utilization


0

...




Scenario:All threads in a warp request the same 4-byte word

Addresses fall within a single segmentNo replaysBus utilization: 12.5%


...


0




0

Scenario:Warp requests 32 scattered 4-byte words

Addresses fall within N segments(N-1) replays (N transactions)

Could be lower some segments can be arranged into a single transaction

Bus utilization: 128 / (N*32) (4x higher than caching loads)Warp needs 128 bytesN*32 bytes move across the bus on a miss

...

Read-Only Cache

An alternative to L1 when accessing DRAMAlso known as texture cache: all texture accesses use this cacheCC 3.5 and higher also enable global memory accesses

Should not be used if a kernel reads and writes to the same addresses

Comparing to L1:Generally better for scattered reads than L1

Caching is at 32 B granularity (L1, when caching operates at 128 B granularity)Does not require replay for multiple transactions (L1 does)

Higher latency than L1 reads, also tends to increase register use

Read-Only Cache

Annotate eligible kernel parameters withconst __restrict

Compiler will automatically map loads to use read-only data cache path

__global__ void saxpy(float x, float y, const float * __restrict input, float * output){ size_t offset = threadIdx.x + (blockIdx.x * blockDim.x);

// Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y;}

Shared Memory

Low latency: a few cyclesHigh throughputMain use

Inter-block communicationUser-managed cache to reduce redundant global memory accessesAvoid non-coalesced access

Shared Memory Example: Matrix Multiplication

A

B

C

C=AxB

Every thread corresponds to one entry in C.

Naive Kernel

__global__ void simpleMultiply(float* a, float* b, float* c,

int N){ int row = threadIdx.x + blockIdx.x*blockDim.x; int col = threadIdx.y + blockIdx.y*blockDim.y; float sum = 0.0f; for (int i = 0; i < N; i++) { sum += a[row*N+i] * b[i*N+col]; } c[row*N+col] = sum;}

Every thread corresponds to one entry in C.

Blocked Matrix Multiplication

A

B

C

C=AxB

Data reuse in the blocked version

Blocked and cached kernel__global__ void coalescedMultiply(double*a,

double* b, double*c,

int N){ __shared__ double aTile[TILE_DIM][TILE_DIM]; __shared__ double bTile[TILE_DIM][TILE_DIM]; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0.0f; for (int k = 0; k < N; k += TILE_DIM) { aTile[threadIdx.y][threadIdx.x] = a[row*N+threadIdx.x+k]; bTile[threadIdx.y][threadIdx.x] = b[(threadIdx.y+k)*N+col]; __syncthreads(); for (int i = 0; i < TILE_DIM; i++) { sum += aTile[threadIdx.y][i]* bTile[i][threadIdx.x]; } c[row*N+col] = sum;}

Latency Optimization

When the code is latency boundBoth the memory and instruction throughputs are far from the peak

Latency hiding: switching threadsA thread blocks when one of the operands isn’t ready

Purpose: have enough warps to hide latency

Major techniques: increase active warps

Enough Block and Block Size

# of blocks >> # of SM > 100 to scale well to future deviceMinimum: 64. I generally use 128 or 256. But use whatever is best for your app.Depends on the problem, do experiments!

Occupancy & Active Warps

Occupancy: ratio of active warps per SM to the maximum number of allowed warps

Maximum number: 48 in Fermi

We need the occupancy to be high enough to hide latency

Occupancy is limited by resource usage

Dynamical Partitioning of SM Resources

Shared memory is partitioned among blocksRegisters are partitioned among threads: <= 255Thread block slots: <= 16Thread slots: <= 2048Any of those can be the limiting factor on how many threads can be launched at the same time on a SM

Occupancy Calculator

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

Occupancy Optimizations

Know the current occupancyVisual profiler--ptxas-options=-v: output resource usage info; input to Occupancy Calculator

Adjust resource usage to increase occupancyChange block sizeLimit register usage

Compiler option –maxrregcount=n: per file__launch_bounds__: per kernel

Dynamical allocating shared memory

Instruction Optimization

If you find out the code is instruction boundCompute-intensive algorithm can easily become memory-bound if not careful enoughTypically, worry about instruction optimization after memory and execution configuration optimizations

Purpose: reduce instruction countUse less instructions to get the same job done

Major techniquesUse high throughput instructionsReduce wasted instructions: branch divergence, etc.

Reduce Instruction Count

Use float if precision allowAdding “f” to floating literals (e.g. 1.0f) because the default is double

Fast math functionsTwo types of runtime math library functions

func(): slower but higher accuracy (5 ulp or less)

__func(): fast but lower accuracy (see prog. guide for full details)

-use_fast_math: forces every func() to __func ()

Control Flow

Divergent branches:Threads within a single warp take different pathsExample with divergence:

if (threadIdx.x > 2) {...} else {...}Branch granularity < warp size

Different execution paths within a warp are serialized

Different warps can execute different code with no impact on performance

Avoid diverging within a warpExample without divergence:

if (threadIdx.x / WARP_SIZE > 2) {...} else {...}Branch granularity is a whole multiple of warp size

Kernel Optimization Workflow

Find Limiter

Compare to peak GB/s

Memory optimization

Compare to peak inst/s

Instruction optimization

Configuration optimization

Memory bound

Instruction bound

Latency bound

Done!

<< <<~ ~

Minimizing CPU-GPU data transferHost<->device data transfer has much lower bandwidth than global memory access.

16 GB/s (PCIe x16 Gen3) vs 250 GB/s & 3.95 Tinst/s (GK110)

Minimize transferIntermediate data can be allocated, operated, de-allocated directly on GPUSometimes it’s even better to recompute on GPU

Group transferOne large transfer much better than many small onesOverlap memory transfer with computation

Streams and Async API

Default API:Kernel launches are asynchronous with CPUMemcopies (D2H, H2D) block CPU threadCUDA calls are serialized by the driver

Streams and async functions provide:Memcopies (D2H, H2D) asynchronous with CPUAbility to concurrently execute a kernel and a memcopyConcurrent kernel in Fermi

Stream = sequence of operations that execute in issue-order on GPUOperations from different streams can be interleavedA kernel and memcopy from different streams can be overlapped

Pinned (non-pageable) memory

Pinned memory enables:memcopies asynchronous with CPU & GPU

UsagecudaHostAlloc / cudaFreeHost instead of malloc / free

Note:pinned memory is essentially removed from virtual memorycudaHostAlloc is typically very expensive

Overlap kernel and memory copy

Requirements:D2H or H2D memcopy from pinned memoryDevice with compute capability ≥ 1.1 (G84 and later)Kernel and memcopy in different, non-0 streams

Code:cudaStream_t stream1, stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);

cudaMemcpyAsync( dst, src, size, dir, stream1 );kernel<<<grid, block, 0, stream2>>>(…);

potentiallyoverlapped

optimization on kepler zehuan wang [email protected]

Documents

warp warp

warp scheduler

warp block

warp latency

warp simd

memory latency warp

instruction instructions

fundamental optimization