optimization on kepler zehuan wang [email protected]

48
Optimization on Kepler Zehuan Wang [email protected]

Upload: daquan-herringshaw

Post on 14-Dec-2015

230 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Optimization on Kepler

Zehuan [email protected]

Page 2: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Fundamental Optimization

Page 3: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 4: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 5: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

GPU High Level View

SM

EM

SM

EM

SM

EM

SM

EM

Global MemoryCPU

ChipsetPCIe

Streaming Multiprocessor

Global memory

Page 6: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

GK110 SMControl unit

4 Warp Scheduler8 instruction dispatcher

Execution unit192 single-precision CUDA Cores64 double-precision CUDA Cores32 SFU, 32 LD/ST

MemoryRegisters: 64K 32-bitCache

L1+shared memory (64 KB)TextureConstant

Page 7: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

GPU and Programming ModelSoftware GPU

Threads are executed by CUDA cores

Thread

CUDA Core

Thread Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources

...

Grid Device

A kernel is launched as a grid of thread blocks

Up to 16 kernels can execute on a device at one time

Page 8: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Warp

Warp is successive 32 threads in a blockE.g. blockDim = 160

Automatically divided to 5 warps by GPU

E.g. blockDim = 161If the blockDim is not the Multiple of 32The rest of thread will occupy one more warp

Block 0

Warp 4 (128~159)Warp 3 (96~127)

Warp2 (64~95)Warp1 (32~63)Warp 0 (0~31)

Block 0

Warp 5 (160)Warp 4 (128~159)Warp 3 (96~127)

Warp2 (64~95)Warp1 (32~63)Warp 0 (0~31)

Block

32 Threads

32 Threads

32 Threads

...

Warps

=

Page 9: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Warp

SIMD: Same Instruction Multi DataThe threads in the same warp always executing the same instructionInstructions will be issued tooperation units by warp

warp 8 instruction 11

Warp Scheduler 0

warp 2 instruction 42

warp 14 instruction 95

warp 8 instruction 12

...

warp 14 instruction 96

warp 2 instruction 43

warp 9 instruction 11

Warp Scheduler 1

warp 3 instruction 33

warp 15 instruction 95

warp 9 instruction 12

...

warp 3 instruction 34

warp 15 instruction 96

Page 10: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Warp

Latency is caused by the dependencybetween the neighbor instructions in the same warpIn the waiting time, other instructionsfrom other warps can be executedContext switching is freeA lot of warps can hide memory latency

warp 8 instruction 11

Warp Scheduler 0

warp 2 instruction 42

warp 14 instruction 95

warp 8 instruction 12

...

warp 14 instruction 96

warp 2 instruction 43

warp 9 instruction 11

Warp Scheduler 1

warp 3 instruction 33

warp 15 instruction 95

warp 9 instruction 12

...

warp 3 instruction 34

warp 15 instruction 96

Page 11: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Kepler Memory Hierarchy

RegisterSpills to local memory

CachesShared memoryL1 cacheL2 cacheConstant cacheTexture cache

Global memory

Page 12: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Kepler/Fermi Memory Hierarchy

L2

Global Memory

Registers

C

SM-0

L1&SMEM TEX

Registers

C

SM-1

L1&SMEM TEX

Registers

C

SM-N

L1&SMEM TEX

low

fastkepler

Page 13: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Wrong View To Optimization

Try all the optimization methods in bookOptimization is endless

Low Efficiency

Page 14: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

General Optimization Strategies: Measurement

Find out the limiting factor in kernel performanceMemory bandwidth bound (memory optimization)Instruction throughput bound (instruction optimization)Latency bound (configuration optimization)

Measure effective memory/instruction throughput: NVIDIA Visual Profiler

Page 15: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Resolved

Find Limiter

Compare to Effective Value GB/s

Memory optimization

Compare to Effective Value inst/s

Instruction optimization

Configuration optimization

Memory bound

Instruction bound

Latency bound

Done!

<< <<~ ~

Page 16: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 17: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Memory Optimization

If the code is memory-bound and effective memory throughput is much lower than the peak

Purpose: access only data that are absolutely necessary

Major techniquesImprove access pattern to reduce wasted transactionsReduce redundant access: read-only cache, shared memory

Page 18: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Reduce Wasted Transactions

Memory accesses are per warpMemory is accessed in discrete chunksL1 is reserved only for register spills and stack data in Kepler

Go directly to L2 (invalidate line in L1), on L2 miss go to DRAMMemory is transport in segments = 32 B (same as for writes)If a warp can’t take use all of the data in the segments, the rest memory transaction is wasted.

Page 19: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Kepler/Fermi Memory Hierarchy

L2

Global Memory

Registers

C

SM-0

L1&SMEM TEX

Registers

C

SM-1

L1&SMEM TEX

Registers

C

SM-N

L1&SMEM TEX

low

fastkepler

Page 20: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Reduce Wasted Transactions

Scenario:Warp requests 32 aligned, consecutive 4-byte words

Addresses fall within 4 segmentsNo replaysBus utilization: 100%

Warp needs 128 bytes128 bytes move across the bus on a miss

...

addresses from a warp

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0

Page 21: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Reduce Wasted Transactions

...

addresses from a warp

Scenario:Warp requests 32 aligned, permuted 4-byte words

Addresses fall within 4 segmentsNo replaysBus utilization: 100%

Warp needs 128 bytes128 bytes move across the bus on a miss

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0

Page 22: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Reduce Wasted Transactions

Scenario:Warp requests 32 consecutive 4-byte words, offset from perfect alignment

Addresses fall within at most 5 segments1 replay (2 transactions)Bus utilization: at least 80%

Warp needs 128 bytesAt most 160 bytes move across the busSome misaligned patterns will fall within 4 segments, so 100% utilization

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0

...

addresses from a warp

Page 23: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Reduce Wasted Transactions

addresses from a warp

Scenario:All threads in a warp request the same 4-byte word

Addresses fall within a single segmentNo replaysBus utilization: 12.5%

Warp needs 4 bytes32 bytes move across the bus on a miss

...

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0

Page 24: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Reduce Wasted Transactions

addresses from a warp

96 192128 160 224 28825632 64 352320 384 448416Memory addresses

0

Scenario:Warp requests 32 scattered 4-byte words

Addresses fall within N segments(N-1) replays (N transactions)

Could be lower some segments can be arranged into a single transaction

Bus utilization: 128 / (N*32) (4x higher than caching loads)Warp needs 128 bytesN*32 bytes move across the bus on a miss

...

Page 25: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Read-Only Cache

An alternative to L1 when accessing DRAMAlso known as texture cache: all texture accesses use this cacheCC 3.5 and higher also enable global memory accesses

Should not be used if a kernel reads and writes to the same addresses

Comparing to L1:Generally better for scattered reads than L1

Caching is at 32 B granularity (L1, when caching operates at 128 B granularity)Does not require replay for multiple transactions (L1 does)

Higher latency than L1 reads, also tends to increase register use

Page 26: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Read-Only Cache

Annotate eligible kernel parameters withconst __restrict

Compiler will automatically map loads to use read-only data cache path

__global__ void saxpy(float x, float y, const float * __restrict input, float * output){ size_t offset = threadIdx.x + (blockIdx.x * blockDim.x);

// Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y;}

Page 27: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Shared Memory

Low latency: a few cyclesHigh throughputMain use

Inter-block communicationUser-managed cache to reduce redundant global memory accessesAvoid non-coalesced access

Page 28: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Shared Memory Example: Matrix Multiplication

A

B

C

C=AxB

Every thread corresponds to one entry in C.

Page 29: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Naive Kernel

__global__ void simpleMultiply(float* a, float* b, float* c,

int N){ int row = threadIdx.x + blockIdx.x*blockDim.x; int col = threadIdx.y + blockIdx.y*blockDim.y; float sum = 0.0f; for (int i = 0; i < N; i++) { sum += a[row*N+i] * b[i*N+col]; } c[row*N+col] = sum;}

Every thread corresponds to one entry in C.

Page 30: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Blocked Matrix Multiplication

A

B

C

C=AxB

Data reuse in the blocked version

Page 31: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Blocked and cached kernel__global__ void coalescedMultiply(double*a,

double* b, double*c,

int N){ __shared__ double aTile[TILE_DIM][TILE_DIM]; __shared__ double bTile[TILE_DIM][TILE_DIM];  int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0.0f; for (int k = 0; k < N; k += TILE_DIM) { aTile[threadIdx.y][threadIdx.x] = a[row*N+threadIdx.x+k]; bTile[threadIdx.y][threadIdx.x] = b[(threadIdx.y+k)*N+col]; __syncthreads(); for (int i = 0; i < TILE_DIM; i++) { sum += aTile[threadIdx.y][i]* bTile[i][threadIdx.x]; } c[row*N+col] = sum;}

Page 32: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 33: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Latency Optimization

When the code is latency boundBoth the memory and instruction throughputs are far from the peak

Latency hiding: switching threadsA thread blocks when one of the operands isn’t ready

Purpose: have enough warps to hide latency

Major techniques: increase active warps

Page 34: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Enough Block and Block Size

# of blocks >> # of SM > 100 to scale well to future deviceMinimum: 64. I generally use 128 or 256. But use whatever is best for your app.Depends on the problem, do experiments!

Page 35: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Occupancy & Active Warps

Occupancy: ratio of active warps per SM to the maximum number of allowed warps

Maximum number: 48 in Fermi

We need the occupancy to be high enough to hide latency

Occupancy is limited by resource usage

Page 36: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Dynamical Partitioning of SM Resources

Shared memory is partitioned among blocksRegisters are partitioned among threads: <= 255Thread block slots: <= 16Thread slots: <= 2048Any of those can be the limiting factor on how many threads can be launched at the same time on a SM

Page 37: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Occupancy Calculator

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

Page 38: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Occupancy Optimizations

Know the current occupancyVisual profiler--ptxas-options=-v: output resource usage info; input to Occupancy Calculator

Adjust resource usage to increase occupancyChange block sizeLimit register usage

Compiler option –maxrregcount=n: per file__launch_bounds__: per kernel

Dynamical allocating shared memory

Page 39: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 40: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Instruction Optimization

If you find out the code is instruction boundCompute-intensive algorithm can easily become memory-bound if not careful enoughTypically, worry about instruction optimization after memory and execution configuration optimizations

Purpose: reduce instruction countUse less instructions to get the same job done

Major techniquesUse high throughput instructionsReduce wasted instructions: branch divergence, etc.

Page 41: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Reduce Instruction Count

Use float if precision allowAdding “f” to floating literals (e.g. 1.0f) because the default is double

Fast math functionsTwo types of runtime math library functions

func(): slower but higher accuracy (5 ulp or less)

__func(): fast but lower accuracy (see prog. guide for full details)

-use_fast_math: forces every func() to __func ()

Page 42: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Control Flow

Divergent branches:Threads within a single warp take different pathsExample with divergence:

if (threadIdx.x > 2) {...} else {...}Branch granularity < warp size

Different execution paths within a warp are serialized

Different warps can execute different code with no impact on performance

Avoid diverging within a warpExample without divergence:

if (threadIdx.x / WARP_SIZE > 2) {...} else {...}Branch granularity is a whole multiple of warp size

Page 43: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Kernel Optimization Workflow

Find Limiter

Compare to peak GB/s

Memory optimization

Compare to peak inst/s

Instruction optimization

Configuration optimization

Memory bound

Instruction bound

Latency bound

Done!

<< <<~ ~

Page 44: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Optimization Overview

GPU architectureKernel optimization

Memory optimizationLatency optimizationInstruction optimization

CPU-GPU interaction optimizationOverlapped execution using streams

Page 45: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Minimizing CPU-GPU data transferHost<->device data transfer has much lower bandwidth than global memory access.

16 GB/s (PCIe x16 Gen3) vs 250 GB/s & 3.95 Tinst/s (GK110)

Minimize transferIntermediate data can be allocated, operated, de-allocated directly on GPUSometimes it’s even better to recompute on GPU

Group transferOne large transfer much better than many small onesOverlap memory transfer with computation

Page 46: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Streams and Async API

Default API:Kernel launches are asynchronous with CPUMemcopies (D2H, H2D) block CPU threadCUDA calls are serialized by the driver

Streams and async functions provide:Memcopies (D2H, H2D) asynchronous with CPUAbility to concurrently execute a kernel and a memcopyConcurrent kernel in Fermi

Stream = sequence of operations that execute in issue-order on GPUOperations from different streams can be interleavedA kernel and memcopy from different streams can be overlapped

Page 47: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Pinned (non-pageable) memory

Pinned memory enables:memcopies asynchronous with CPU & GPU

UsagecudaHostAlloc / cudaFreeHost instead of malloc / free

Note:pinned memory is essentially removed from virtual memorycudaHostAlloc is typically very expensive

Page 48: Optimization on Kepler Zehuan Wang zehuan@nvidia.com

Overlap kernel and memory copy

Requirements:D2H or H2D memcopy from pinned memoryDevice with compute capability ≥ 1.1 (G84 and later)Kernel and memcopy in different, non-0 streams

Code:cudaStream_t stream1, stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);

cudaMemcpyAsync( dst, src, size, dir, stream1 );kernel<<<grid, block, 0, stream2>>>(…);

potentiallyoverlapped