programming with cuda ws 08/09 lecture 9 thu, 20 nov, 2008

Programming with Programming with CUDACUDAWS 08/09WS 08/09

Lecture 9Lecture 9Thu, 20 Nov, 2008Thu, 20 Nov, 2008

PreviouslyPreviously

CUDA Runtime ComponentCUDA Runtime Component– Common ComponentCommon Component– Device ComponentDevice Component– Host Component: runtime & driver APIsHost Component: runtime & driver APIs

TodayToday

Memory & instruction optimizationsMemory & instruction optimizations Final projects - reminderFinal projects - reminder

Instruction PerformanceInstruction Performance

Instruction ProcessingInstruction Processing

To execute an instruction on a To execute an instruction on a warp of threads, the SMwarp of threads, the SM– Reads in instruction operands Reads in instruction operands for for

each threadeach thread– Executes the instuction Executes the instuction on all on all

threadsthreads– Writes the result Writes the result of each threadof each thread

Instruction ThroughputInstruction Throughput

Maximized whenMaximized when– Use of low throughput instructions is Use of low throughput instructions is

minimizedminimized– Available memory bandwidth is used Available memory bandwidth is used

maximallymaximally– The thread scheduler overlaps The thread scheduler overlaps

compute & memory operationscompute & memory operations Programs have a high arithmetic intensity Programs have a high arithmetic intensity

per memory operationper memory operation Each SM has many active threadsEach SM has many active threads


Maximized whenMaximized when– Use of low throughput Use of low throughput

instructions is minimizedinstructions is minimized– Available memory bandwidth is used Available memory bandwidth is used

maximallymaximally– The thread scheduler overlaps The thread scheduler overlaps




Avoid low throughput instructionsAvoid low throughput instructions– Be aware of clock cycles used per Be aware of clock cycles used per

instructioninstruction– There are often faster alternatives for There are often faster alternatives for

math functions, e.g. math functions, e.g. sinfsinf and and __sinf__sinf– Size of operands (24 bit, 32 bit) also Size of operands (24 bit, 32 bit) also

makes a differencemakes a difference


Avoid low throughput instructionsAvoid low throughput instructions– Integer division and modulo are Integer division and modulo are

expensiveexpensive Use bitwise operations (>>, &) insteadUse bitwise operations (>>, &) instead

– Type conversion costs cyclesType conversion costs cycles charchar//shortshort => => intint doubledouble <=> <=> floatfloat

– Define Define floatfloat quantities with quantities with ff, e.g. , e.g. 1.0f1.0f– Use float functions, e.g. Use float functions, e.g. expfexpf– Some devices (<= 1.2) demote Some devices (<= 1.2) demote doubledouble to to floatfloat


Avoid branchingAvoid branching– Diverging threads in a warp are Diverging threads in a warp are

serializedserialized– Try to minimize the number of Try to minimize the number of

divergent warpsdivergent warps– Loop unrolling by the compiler can be Loop unrolling by the compiler can be

controlled using controlled using #pragma unroll #pragma unroll


Avoid high latency memory Avoid high latency memory instructionsinstructions– A SM takes 4 clock cycles to issue a A SM takes 4 clock cycles to issue a

memory instruction to a warpmemory instruction to a warp– In case of local/global memory, there In case of local/global memory, there

is an overhead of 400 to 600 cyclesis an overhead of 400 to 600 cycles__shared__ float shared;__shared__ float shared;__device__ float device;__device__ float device;shared = device;shared = device;4 + 4 + [400,600] cycles4 + 4 + [400,600] cycles


Avoid high latency memory Avoid high latency memory instructionsinstructions– If local/global memory has to be If local/global memory has to be

accessed, surround it with accessed, surround it with independent arithmetic instructionsindependent arithmetic instructions

SM can do math while accessing memorySM can do math while accessing memory


Cost of Cost of __syncThreads()__syncThreads()– Instruction itself takes 4 clock cycles Instruction itself takes 4 clock cycles

for a warpfor a warp– Additional cycles spent waiting for Additional cycles spent waiting for

threads to catch upthreads to catch up


Maximized whenMaximized when– Use of low throughput instructions is Use of low throughput instructions is

minimizedminimized– Available memory bandwidth is Available memory bandwidth is

used maximallyused maximally– The thread scheduler overlaps The thread scheduler overlaps




Effective memory bandwidth of Effective memory bandwidth of each memory space (global, local, each memory space (global, local, shared) depends on the shared) depends on the memory memory access patternaccess pattern

Device memory has higher latency Device memory has higher latency and lower bandwidth than on-chip and lower bandwidth than on-chip memorymemory– Minimize use of device memoryMinimize use of device memory


Typical executionTypical execution– Each thread loads data from device Each thread loads data from device

to shared memoryto shared memory– Synch threads, if necessarySynch threads, if necessary– Each thread processes data in shared Each thread processes data in shared

memorymemory– Synch threads, if necessarySynch threads, if necessary– Write data from shared to device Write data from shared to device

memorymemory


Global memoryGlobal memory– High latency, low bandwidthHigh latency, low bandwidth– Not cachedNot cached– Right access patterns are crucialRight access patterns are crucial


Global memory: alignmentGlobal memory: alignment– Supported word sizes: 4, 8, 16 bytesSupported word sizes: 4, 8, 16 bytes– __device__ type device[32];__device__ type device[32];type data = device[tid];type data = device[tid];

compiles to a single load instruction ifcompiles to a single load instruction if typetype has a supported size has a supported size typetype variables are aligned to variables are aligned to sizeof(type)sizeof(type): the address of the variable : the address of the variable should be a multiple of should be a multiple of sizeof(type)sizeof(type)


Global memory: alignmentGlobal memory: alignment– Alignment requirement is Alignment requirement is

automatically fulfilled for built-in automatically fulfilled for built-in typestypes

– For self-defined structures, alignment For self-defined structures, alignment can be forced can be forced struct __align__(8) {struct __align__(8) { float a,b; } myStruct8; float a,b; } myStruct8;struct __align__(16) {struct __align__(16) { float a,b,c; } myStruct12; float a,b,c; } myStruct12;


Global memory: alignmentGlobal memory: alignment– Addresses of global variables are Addresses of global variables are

aligned to 256 bytesaligned to 256 bytes– Align structures cleverlyAlign structures cleverlystruct {struct { float a,b,c,d,e; } myStruct20; float a,b,c,d,e; } myStruct20;Five 32-bit load instructionsFive 32-bit load instructions



aligned to 256 bytesaligned to 256 bytes– Align structures cleverlyAlign structures cleverlystruct __align__(8) {struct __align__(8) { float a,b,c,d,e; } myStruct20; float a,b,c,d,e; } myStruct20;Three 64-bit load instructionsThree 64-bit load instructions



aligned to 256 bytesaligned to 256 bytes– Align structures cleverlyAlign structures cleverlystruct __align__(16) {struct __align__(16) { float a,b,c,d,e; } myStruct20; float a,b,c,d,e; } myStruct20;Two 128-bit load instructionsTwo 128-bit load instructions


Global memory: coalescingGlobal memory: coalescing– Size of a memory transaction on Size of a memory transaction on

global memory can be 32 (>= 1.2), global memory can be 32 (>= 1.2), 64 or 128 bytes64 or 128 bytes

– Used most efficiently when Used most efficiently when simultaneous memory accesses by simultaneous memory accesses by threads in a half-warp can be threads in a half-warp can be coalesced into a single memory coalesced into a single memory transactiontransaction

– Coalescing varies w/ comp capabilityCoalescing varies w/ comp capability


Global memory: coalescing, <= 1.1Global memory: coalescing, <= 1.1– Global memory access by threads in Global memory access by threads in

a half-warp are coalesced ifa half-warp are coalesced if Each thread accesses words of sizeEach thread accesses words of size

– 4 bytes: one 64-byte memory operation4 bytes: one 64-byte memory operation– 8 bytes: one 128-byte memory operation8 bytes: one 128-byte memory operation– 16 bytes: two 128-byte memory operations16 bytes: two 128-byte memory operations

All 16 words lie in the same (aligned) All 16 words lie in the same (aligned) segment in global memorysegment in global memory

Threads access words in sequenceThreads access words in sequence


Global memory: coalescing, <= 1.1Global memory: coalescing, <= 1.1– If any of the conditions is violated by If any of the conditions is violated by

a half-warp, thread memory accesses a half-warp, thread memory accesses are serializedare serialized

– Coalesced access of larger sizes is Coalesced access of larger sizes is slower than coalesced access of lower slower than coalesced access of lower sizessizes

Still a lot more efficient than non-Still a lot more efficient than non-coalesced accesscoalesced access


Global memory: coalescing, >= 1.2Global memory: coalescing, >= 1.2– Global memory access by threads in Global memory access by threads in

a half-warp are coalesced if accessed a half-warp are coalesced if accessed words lie in the same aligned words lie in the same aligned segment of required sizesegment of required size

32 bytes for 2 byte words32 bytes for 2 byte words 64 bytes for 4 byte words64 bytes for 4 byte words 128 bytes for 8/16 byte words128 bytes for 8/16 byte words

– Any access pattern is allowedAny access pattern is allowed Lower CC cards restrict access patternsLower CC cards restrict access patterns


Global memory: coalescing, >= 1.2Global memory: coalescing, >= 1.2– If a half-warp addresses words in N If a half-warp addresses words in N

different segments, N memory different segments, N memory transactions are issuedtransactions are issued

Lower CC cards issue 16Lower CC cards issue 16

– Hardware automatically detects and Hardware automatically detects and optimizes for unused words, e.g. if optimizes for unused words, e.g. if request words lie in the lower of request words lie in the lower of upper half of a 128 byte segment, a upper half of a 128 byte segment, a 64 byte operation is issued.64 byte operation is issued.


Global memory: coalescing, >= 1.2Global memory: coalescing, >= 1.2– Summary for memory transactions by Summary for memory transactions by

threads in a half-warpthreads in a half-warp Find the memory segment containing the Find the memory segment containing the

address requested by the lowest address requested by the lowest numbered active threadnumbered active thread

Find all other active threads requesting Find all other active threads requesting addresses in the same segmentaddresses in the same segment

Reduce transaction size, if possibleReduce transaction size, if possible Do the transaction, mark threads inactiveDo the transaction, mark threads inactive Repeat until all threads are servicedRepeat until all threads are serviced


Global memory: coalescingGlobal memory: coalescing– General patternsGeneral patterns

TYPE* BaseAddress; // 1D arrayTYPE* BaseAddress; // 1D array// thread reads BaseAddress + tid// thread reads BaseAddress + tid

TYPE TYPE must meet size and alignment req.smust meet size and alignment req.s If If TYPE TYPE is larger than 16 bytes, split it into is larger than 16 bytes, split it into

smaller objects that meet the smaller objects that meet the requirementsrequirements



TYPE* BaseAddress;TYPE* BaseAddress;// 2D array of size: width x height// 2D array of size: width x height// read BaseAddress + tx*width + ty// read BaseAddress + tx*width + ty

Size and alignment requirements holdSize and alignment requirements hold



Memory coalescing achieved for all half-Memory coalescing achieved for all half-warps in a block ifwarps in a block if

– Width of the block is a multiple of 16Width of the block is a multiple of 16– widthwidth is a multiple of 16 is a multiple of 16

Arrays whose width is a multiple of 16 are Arrays whose width is a multiple of 16 are accessed more efficientlyaccessed more efficiently

– Useful to pad arrays up to multiples of 16Useful to pad arrays up to multiples of 16– Done automatically by the Done automatically by the cuMemAllocPitch cuMemAllocPitch cudaMallocPitch cudaMallocPitch functionsfunctions


Local memoryLocal memory– Used for some internal variablesUsed for some internal variables– Not cachedNot cached– As expensive as global memoryAs expensive as global memory– As accesses are, by definition, per-As accesses are, by definition, per-

thread, they are automatically thread, they are automatically coalescedcoalesced


Constant memoryConstant memory– CachedCached

Costs one memory read from device Costs one memory read from device memory on cache missmemory on cache miss

Otherwise, one cache readOtherwise, one cache read

– For threads in a half-warp, cost of For threads in a half-warp, cost of reading cache is proportional to reading cache is proportional to number of different addresses readnumber of different addresses read

Recommended to have all threads in a Recommended to have all threads in a half-warp read the same addresshalf-warp read the same address


Texture memoryTexture memory– CachedCached

Costs one memory read from device Costs one memory read from device memory on cache missmemory on cache miss

Otherwise, one cache readOtherwise, one cache read

– Texture cache is optimized for 2D Texture cache is optimized for 2D spatial localityspatial locality

Recommended for threads in a warp to Recommended for threads in a warp to read neighboring texture addressesread neighboring texture addresses


Shared memoryShared memory– On-chipOn-chip

As fast as registers, provided there are no As fast as registers, provided there are no bank conflictsbank conflicts between threads between threads

– Divided into equally-sized modules, Divided into equally-sized modules, called called banksbanks

If N requests fall in N separate banks, they If N requests fall in N separate banks, they are processed concurrentlyare processed concurrently

If N requests fall in the same bank, there If N requests fall in the same bank, there is an N-way bank conflictis an N-way bank conflict

– The N requests are serializedThe N requests are serialized


Shared memory: banksShared memory: banks– Successive 32-bit words are assigned Successive 32-bit words are assigned

to successive banksto successive banks– Bandwidth: 32 bits per 2 clock cyclesBandwidth: 32 bits per 2 clock cycles– Requests from a warp are split Requests from a warp are split

according to half-warpsaccording to half-warps Threads in different half-warps cannot Threads in different half-warps cannot

conflictconflict


Shared memory: bank conflictsShared memory: bank conflicts– __shared__ char shared[32];__shared__ char shared[32];char data = shared[BaseIndex+tId];char data = shared[BaseIndex+tId];

– Why?Why?


Shared memory: bank conflictsShared memory: bank conflicts– __shared__ char shared[32];__shared__ char shared[32];char data = shared[BaseIndex+tId];char data = shared[BaseIndex+tId];

– Multiple array members, e.g. Multiple array members, e.g. char[0]char[0], , char[1]char[1], , char[2] char[2] and and char[3]char[3], lie in the , lie in the same banksame bank

– Can be resolved asCan be resolved aschar data = shared[BaseIndex+4*tId];char data = shared[BaseIndex+4*tId];


Shared memory: bank conflictsShared memory: bank conflicts– __shared__ double shared[32];__shared__ double shared[32];double data = shared[BaseIndex+tId];double data = shared[BaseIndex+tId];

– Why?Why?


Shared memory: bank conflictsShared memory: bank conflicts– __shared__ double shared[32];__shared__ double shared[32];double data = shared[BaseIndex+tId];double data = shared[BaseIndex+tId];

– 2-way bank conflict because of a stride of 2-way bank conflict because of a stride of two 32-bit wordstwo 32-bit words


Shared memory: bank conflictsShared memory: bank conflicts– __shared__ TYPE shared[32];__shared__ TYPE shared[32];TYPE data = shared[BaseIndex+tId];TYPE data = shared[BaseIndex+tId];



– Three separate memory reads with no Three separate memory reads with no bank conflictsbank conflictsstruct TYPE {struct TYPE { float x,y,z; }; float x,y,z; };

– Stride of three 32-bit wordsStride of three 32-bit words



– Two separate memory reads with no bank Two separate memory reads with no bank conflictsconflictsstruct TYPE {struct TYPE { float x,y; }; float x,y; };

– Stride of two 32-bit words, similar to Stride of two 32-bit words, similar to doubledouble

Final ProjectsFinal Projects

ReminderReminder– Form groups by next lectureForm groups by next lecture– Think of project ideas for your groupThink of project ideas for your group

Encouraged to submit several ideasEncouraged to submit several ideas

– For each idea, submit a short textFor each idea, submit a short text describing the problem you want to solvedescribing the problem you want to solve why you think it is suited for parallel why you think it is suited for parallel

computationcomputation

– Jens and I will assign you one of your Jens and I will assign you one of your suggested topicssuggested topics


ReminderReminder– If some people have not formed If some people have not formed

groups, Jens and I will assign you groups, Jens and I will assign you randomly to groups.randomly to groups.

– If you cannot think of any ideas, Jens If you cannot think of any ideas, Jens and I will assign you some.and I will assign you some.

– We will float around some write-ups We will float around some write-ups of our own ideas. You may choose of our own ideas. You may choose one of those.one of those.


Time-lineTime-line– Thu, 20 Nov (today):Thu, 20 Nov (today):

Float write-ups on ideas of Jens & WaqarFloat write-ups on ideas of Jens & Waqar

– Tue, 25 Nov:Tue, 25 Nov: Suggest groups and topicsSuggest groups and topics

– Thu, 27 Nov:Thu, 27 Nov: Groups and topics assignedGroups and topics assigned

– Tue, 2 Dec:Tue, 2 Dec: Last chance to change groups/topicsLast chance to change groups/topics Groups and topics finalizedGroups and topics finalized

All for todayAll for today

Next timeNext time– More on bank conflictsMore on bank conflicts– Other optimizationsOther optimizations

See you next week!See you next week!

programming with cuda ws 08/09 lecture 9 thu, 20 nov, 2008

Documents