optimization on kepler zehuan wang [email protected]
TRANSCRIPT
Optimization on Kepler
Zehuan [email protected]
Fundamental Optimization
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
GPU High Level View
SM
EM
SM
EM
SM
EM
SM
EM
Global MemoryCPU
ChipsetPCIe
Streaming Multiprocessor
Global memory
GK110 SMControl unit
4 Warp Scheduler8 instruction dispatcher
Execution unit192 single-precision CUDA Cores64 double-precision CUDA Cores32 SFU, 32 LD/ST
MemoryRegisters: 64K 32-bitCache
L1+shared memory (64 KB)TextureConstant
GPU and Programming ModelSoftware GPU
Threads are executed by CUDA cores
Thread
CUDA Core
Thread Block Multiprocessor
Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources
...
Grid Device
A kernel is launched as a grid of thread blocks
Up to 16 kernels can execute on a device at one time
Warp
Warp is successive 32 threads in a blockE.g. blockDim = 160
Automatically divided to 5 warps by GPU
E.g. blockDim = 161If the blockDim is not the Multiple of 32The rest of thread will occupy one more warp
Block 0
Warp 4 (128~159)Warp 3 (96~127)
Warp2 (64~95)Warp1 (32~63)Warp 0 (0~31)
Block 0
Warp 5 (160)Warp 4 (128~159)Warp 3 (96~127)
Warp2 (64~95)Warp1 (32~63)Warp 0 (0~31)
Block
32 Threads
32 Threads
32 Threads
...
Warps
=
Warp
SIMD: Same Instruction Multi DataThe threads in the same warp always executing the same instructionInstructions will be issued tooperation units by warp
warp 8 instruction 11
Warp Scheduler 0
warp 2 instruction 42
warp 14 instruction 95
warp 8 instruction 12
...
warp 14 instruction 96
warp 2 instruction 43
warp 9 instruction 11
Warp Scheduler 1
warp 3 instruction 33
warp 15 instruction 95
warp 9 instruction 12
...
warp 3 instruction 34
warp 15 instruction 96
Warp
Latency is caused by the dependencybetween the neighbor instructions in the same warpIn the waiting time, other instructionsfrom other warps can be executedContext switching is freeA lot of warps can hide memory latency
warp 8 instruction 11
Warp Scheduler 0
warp 2 instruction 42
warp 14 instruction 95
warp 8 instruction 12
...
warp 14 instruction 96
warp 2 instruction 43
warp 9 instruction 11
Warp Scheduler 1
warp 3 instruction 33
warp 15 instruction 95
warp 9 instruction 12
...
warp 3 instruction 34
warp 15 instruction 96
Kepler Memory Hierarchy
RegisterSpills to local memory
CachesShared memoryL1 cacheL2 cacheConstant cacheTexture cache
Global memory
Kepler/Fermi Memory Hierarchy
L2
Global Memory
Registers
C
SM-0
L1&SMEM TEX
Registers
C
SM-1
L1&SMEM TEX
Registers
C
SM-N
L1&SMEM TEX
low
fastkepler
Wrong View To Optimization
Try all the optimization methods in bookOptimization is endless
Low Efficiency
General Optimization Strategies: Measurement
Find out the limiting factor in kernel performanceMemory bandwidth bound (memory optimization)Instruction throughput bound (instruction optimization)Latency bound (configuration optimization)
Measure effective memory/instruction throughput: NVIDIA Visual Profiler
Resolved
Find Limiter
Compare to Effective Value GB/s
Memory optimization
Compare to Effective Value inst/s
Instruction optimization
Configuration optimization
Memory bound
Instruction bound
Latency bound
Done!
<< <<~ ~
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
Memory Optimization
If the code is memory-bound and effective memory throughput is much lower than the peak
Purpose: access only data that are absolutely necessary
Major techniquesImprove access pattern to reduce wasted transactionsReduce redundant access: read-only cache, shared memory
Reduce Wasted Transactions
Memory accesses are per warpMemory is accessed in discrete chunksL1 is reserved only for register spills and stack data in Kepler
Go directly to L2 (invalidate line in L1), on L2 miss go to DRAMMemory is transport in segments = 32 B (same as for writes)If a warp can’t take use all of the data in the segments, the rest memory transaction is wasted.
Kepler/Fermi Memory Hierarchy
L2
Global Memory
Registers
C
SM-0
L1&SMEM TEX
Registers
C
SM-1
L1&SMEM TEX
Registers
C
SM-N
L1&SMEM TEX
low
fastkepler
Reduce Wasted Transactions
Scenario:Warp requests 32 aligned, consecutive 4-byte words
Addresses fall within 4 segmentsNo replaysBus utilization: 100%
Warp needs 128 bytes128 bytes move across the bus on a miss
...
addresses from a warp
96 192128 160 224 28825632 64 352320 384 448416Memory addresses
0
Reduce Wasted Transactions
...
addresses from a warp
Scenario:Warp requests 32 aligned, permuted 4-byte words
Addresses fall within 4 segmentsNo replaysBus utilization: 100%
Warp needs 128 bytes128 bytes move across the bus on a miss
96 192128 160 224 28825632 64 352320 384 448416Memory addresses
0
Reduce Wasted Transactions
Scenario:Warp requests 32 consecutive 4-byte words, offset from perfect alignment
Addresses fall within at most 5 segments1 replay (2 transactions)Bus utilization: at least 80%
Warp needs 128 bytesAt most 160 bytes move across the busSome misaligned patterns will fall within 4 segments, so 100% utilization
96 192128 160 224 28825632 64 352320 384 448416Memory addresses
0
...
addresses from a warp
Reduce Wasted Transactions
addresses from a warp
Scenario:All threads in a warp request the same 4-byte word
Addresses fall within a single segmentNo replaysBus utilization: 12.5%
Warp needs 4 bytes32 bytes move across the bus on a miss
...
96 192128 160 224 28825632 64 352320 384 448416Memory addresses
0
Reduce Wasted Transactions
addresses from a warp
96 192128 160 224 28825632 64 352320 384 448416Memory addresses
0
Scenario:Warp requests 32 scattered 4-byte words
Addresses fall within N segments(N-1) replays (N transactions)
Could be lower some segments can be arranged into a single transaction
Bus utilization: 128 / (N*32) (4x higher than caching loads)Warp needs 128 bytesN*32 bytes move across the bus on a miss
...
Read-Only Cache
An alternative to L1 when accessing DRAMAlso known as texture cache: all texture accesses use this cacheCC 3.5 and higher also enable global memory accesses
Should not be used if a kernel reads and writes to the same addresses
Comparing to L1:Generally better for scattered reads than L1
Caching is at 32 B granularity (L1, when caching operates at 128 B granularity)Does not require replay for multiple transactions (L1 does)
Higher latency than L1 reads, also tends to increase register use
Read-Only Cache
Annotate eligible kernel parameters withconst __restrict
Compiler will automatically map loads to use read-only data cache path
__global__ void saxpy(float x, float y, const float * __restrict input, float * output){ size_t offset = threadIdx.x + (blockIdx.x * blockDim.x);
// Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y;}
Shared Memory
Low latency: a few cyclesHigh throughputMain use
Inter-block communicationUser-managed cache to reduce redundant global memory accessesAvoid non-coalesced access
Shared Memory Example: Matrix Multiplication
A
B
C
C=AxB
Every thread corresponds to one entry in C.
Naive Kernel
__global__ void simpleMultiply(float* a, float* b, float* c,
int N){ int row = threadIdx.x + blockIdx.x*blockDim.x; int col = threadIdx.y + blockIdx.y*blockDim.y; float sum = 0.0f; for (int i = 0; i < N; i++) { sum += a[row*N+i] * b[i*N+col]; } c[row*N+col] = sum;}
Every thread corresponds to one entry in C.
Blocked Matrix Multiplication
A
B
C
C=AxB
Data reuse in the blocked version
Blocked and cached kernel__global__ void coalescedMultiply(double*a,
double* b, double*c,
int N){ __shared__ double aTile[TILE_DIM][TILE_DIM]; __shared__ double bTile[TILE_DIM][TILE_DIM]; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0.0f; for (int k = 0; k < N; k += TILE_DIM) { aTile[threadIdx.y][threadIdx.x] = a[row*N+threadIdx.x+k]; bTile[threadIdx.y][threadIdx.x] = b[(threadIdx.y+k)*N+col]; __syncthreads(); for (int i = 0; i < TILE_DIM; i++) { sum += aTile[threadIdx.y][i]* bTile[i][threadIdx.x]; } c[row*N+col] = sum;}
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
Latency Optimization
When the code is latency boundBoth the memory and instruction throughputs are far from the peak
Latency hiding: switching threadsA thread blocks when one of the operands isn’t ready
Purpose: have enough warps to hide latency
Major techniques: increase active warps
Enough Block and Block Size
# of blocks >> # of SM > 100 to scale well to future deviceMinimum: 64. I generally use 128 or 256. But use whatever is best for your app.Depends on the problem, do experiments!
Occupancy & Active Warps
Occupancy: ratio of active warps per SM to the maximum number of allowed warps
Maximum number: 48 in Fermi
We need the occupancy to be high enough to hide latency
Occupancy is limited by resource usage
Dynamical Partitioning of SM Resources
Shared memory is partitioned among blocksRegisters are partitioned among threads: <= 255Thread block slots: <= 16Thread slots: <= 2048Any of those can be the limiting factor on how many threads can be launched at the same time on a SM
Occupancy Calculator
http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
Occupancy Optimizations
Know the current occupancyVisual profiler--ptxas-options=-v: output resource usage info; input to Occupancy Calculator
Adjust resource usage to increase occupancyChange block sizeLimit register usage
Compiler option –maxrregcount=n: per file__launch_bounds__: per kernel
Dynamical allocating shared memory
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
Instruction Optimization
If you find out the code is instruction boundCompute-intensive algorithm can easily become memory-bound if not careful enoughTypically, worry about instruction optimization after memory and execution configuration optimizations
Purpose: reduce instruction countUse less instructions to get the same job done
Major techniquesUse high throughput instructionsReduce wasted instructions: branch divergence, etc.
Reduce Instruction Count
Use float if precision allowAdding “f” to floating literals (e.g. 1.0f) because the default is double
Fast math functionsTwo types of runtime math library functions
func(): slower but higher accuracy (5 ulp or less)
__func(): fast but lower accuracy (see prog. guide for full details)
-use_fast_math: forces every func() to __func ()
Control Flow
Divergent branches:Threads within a single warp take different pathsExample with divergence:
if (threadIdx.x > 2) {...} else {...}Branch granularity < warp size
Different execution paths within a warp are serialized
Different warps can execute different code with no impact on performance
Avoid diverging within a warpExample without divergence:
if (threadIdx.x / WARP_SIZE > 2) {...} else {...}Branch granularity is a whole multiple of warp size
Kernel Optimization Workflow
Find Limiter
Compare to peak GB/s
Memory optimization
Compare to peak inst/s
Instruction optimization
Configuration optimization
Memory bound
Instruction bound
Latency bound
Done!
<< <<~ ~
Optimization Overview
GPU architectureKernel optimization
Memory optimizationLatency optimizationInstruction optimization
CPU-GPU interaction optimizationOverlapped execution using streams
Minimizing CPU-GPU data transferHost<->device data transfer has much lower bandwidth than global memory access.
16 GB/s (PCIe x16 Gen3) vs 250 GB/s & 3.95 Tinst/s (GK110)
Minimize transferIntermediate data can be allocated, operated, de-allocated directly on GPUSometimes it’s even better to recompute on GPU
Group transferOne large transfer much better than many small onesOverlap memory transfer with computation
Streams and Async API
Default API:Kernel launches are asynchronous with CPUMemcopies (D2H, H2D) block CPU threadCUDA calls are serialized by the driver
Streams and async functions provide:Memcopies (D2H, H2D) asynchronous with CPUAbility to concurrently execute a kernel and a memcopyConcurrent kernel in Fermi
Stream = sequence of operations that execute in issue-order on GPUOperations from different streams can be interleavedA kernel and memcopy from different streams can be overlapped
Pinned (non-pageable) memory
Pinned memory enables:memcopies asynchronous with CPU & GPU
UsagecudaHostAlloc / cudaFreeHost instead of malloc / free
Note:pinned memory is essentially removed from virtual memorycudaHostAlloc is typically very expensive
Overlap kernel and memory copy
Requirements:D2H or H2D memcopy from pinned memoryDevice with compute capability ≥ 1.1 (G84 and later)Kernel and memcopy in different, non-0 streams
Code:cudaStream_t stream1, stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);
cudaMemcpyAsync( dst, src, size, dir, stream1 );kernel<<<grid, block, 0, stream2>>>(…);
potentiallyoverlapped