solving common problems with gpu: a case study with sparse matrices 17 mar 2011 arts talk
DESCRIPTION
Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk. Sean Baxter (324). GPU Architecture. Why GPU? Scan Idiomatic GPU Programming Sparse Matrix Optimization. Why GPU?. Why GPU?. Consider advance of CMOS process technology. CMOS Process Tech. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/1.jpg)
Solving Common Problems with GPU:A Case Study with Sparse Matrices
17 Mar 2011ARTS Talk
Sean Baxter (324)
![Page 2: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/2.jpg)
GPU Architecture
• Why GPU?• Scan• Idiomatic GPU Programming• Sparse Matrix Optimization
![Page 3: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/3.jpg)
Why GPU?
![Page 4: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/4.jpg)
Why GPU?
Consider advance of CMOS process technology
![Page 5: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/5.jpg)
CMOS Process Tech
• 1972 – Intel 8008 – 10000nm - 3500 trans• 1985 – Intel 386 – 1000nm - 275k trans• 1989 – Intel 486 – 800nm - 1.18m trans• 1997 – Pentium II – 350nm - 7.5m trans• 2000 – Pentium 4 – 180nm - 42m trans• 2006 – Core 2 Duo – 65nm - 291m trans• 2011 – Sandy Bridge – 32nm - 995m trans
![Page 6: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/6.jpg)
CMOS Process Tech
Moore’s Law is Ending because…
![Page 7: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/7.jpg)
CMOS Process Tech
…we’re running out of atoms
![Page 8: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/8.jpg)
CMOS Process Tech
• Current node is 28nm• 28nm is half distance between start of
repeating features• Radius of Si is .11nm• Only 250 Si atoms between start of repeating
features• Not many more nodes before we run out of
atoms
![Page 9: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/9.jpg)
Why GPU?
• Feature density will plateau• To get continuously improving performance
we need to improve efficiency• Focus on features that actually do work• CPU pipeline length enables high clockspeed• CPU pipeline length makes big, inefficient chip• GPU uses latency hiding to manage pipeline
![Page 10: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/10.jpg)
Instruction Pipeline
• Old school RISC pipeline:1 Instruction fetch2 Instruction decode3 Execute4 Memory access5 Write back to register
![Page 11: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/11.jpg)
Instruction Pipeline
• Cycle 0: Fet In0• Cycle 1: Dec In0, Fet In0• Cycle 2: Ex In0, Dec In1, Fet In2• Cycle 3: Mem In0, Ex In1, Dec In2, Fet In3• Cycle 4: WB In0, Mem In1, Ex In2, Dec In2, Fet In3• Cycle 5: WB In1, Mem In2, Ex In3, Dec In4, Fet In5
![Page 12: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/12.jpg)
Instruction Pipeline
• Pipelining allows concurrent execution of instructions
• Latency increases but so does throughput• Longer pipelines allow higher clockspeeds for
increased throughput• Longer pipelines require more parallel
circuitry to implement all stages
![Page 13: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/13.jpg)
Pipeline Hazards
0: x3 <- x2 op x11: x5 <- x4 op x3
Result of In0 is source of In1: Pipeline stallNOP are silently inserted into pipeline
Also stall on memory contention or branch
![Page 14: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/14.jpg)
Pipeline Performance
• Mitigate hazards with longer pipelines• Out-of-order execution• Branch prediction• Speculative execution• Register renaming• Microfusion• Micro-op cache (1500 uOp on Sandy Bridge)
![Page 15: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/15.jpg)
Superscalar
• Instruction Level Parallelism (ILP) on CPU• Retire multiple instructions per cycle (IPC)• Intel Core architecture retires 4 IPC• Build multiple pipelines and have each fetch,
decode, execute multiple instructions
![Page 16: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/16.jpg)
Instruction Pipeline
• Pipeline length drove MHz in 90s• CMOS Process improvements enabled longer
pipelines with increasing feature density• Circuitry to support long pipelines grew faster
than circuitry to execute • Superscalar not actually all that effective:
– Measured performance more like 1-1.5 IPC
![Page 17: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/17.jpg)
End of Long Pipelines
• Pentium 4 Prescott (90nm Netburst)• Longest pipeline of any consumer chip:
– Pentium 4 Northwood 20 pipeline stages– Pentium 4 Prescott 31 pipeline stages
• Pipeline irony– Longer pipelines increase latency, make hazards
harder to resolve• After Prescott, all vendors retreated from long
pipelines
![Page 18: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/18.jpg)
GPU Revisionist History
• CPU pipeline is very complex to ensure single-threaded program correctness
• GPU drops single threaded support and implements latency hiding
• Scalar, in-order execution• Long pipelines for high MHz• Almost all circuitry is in execution blocks for
doing work, not decoding, re-ordering, or analyzing instructions
![Page 19: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/19.jpg)
GPU Pipeline
• NVIDIA DX11 part “Fermi” (GF100-GF118)• GTX 580 (GF110) 16 Streaming Multiprocessors
(SM)• Each SM operates independently, akin to cores on
CPU• Within each SM, 32 active threads• Within each SM, 1536 threads in flight• GTX 580 is like 16-core exotic sequential processor
![Page 20: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/20.jpg)
GPU SM Pipeline
• Instruction dispatch grouped into warps• On NV warp = 32 threads, on ATI warp = 64
threads• Threads within warp execute in parallel• In0 completes on all threads of warp 0 prior to
In1 executing on any thread of warp 0• ATI and G200 execute quarter warp per cycle• Fermi executes full warp per cycle
![Page 21: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/21.jpg)
GPU SM Pipeline
• Fake RISC-like pipeline on GPU• Cycle 0:
– Tid 0-31 fetch In0• Cycle 1:
– Tid 0-31 decode In0– Tid 32-63 fetch In0
• Cycle 2:– Tid 0-31 execute In0– Tid 32-63 decode In0– Tid 64-95 fetch In0
![Page 22: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/22.jpg)
GPU SM Pipeline
• On Fermi, up to 48 warps in flight• For 100% SM occupancy, 48 cycle instruction
latency can be hidden with zero pipeline stalls• Actual instruction latency suspected around
22 cycles – launch at least 704/1536 threads
![Page 23: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/23.jpg)
GPU SM Pipeline Performance
• 32 instructions executed per cycle• Core speed 1544MHz• 16 MPs on device• FMA x 1544MHz x 32IPC x 16MP = 1.58TFlop• ATI is VLIW4 for almost 3TFlop
![Page 24: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/24.jpg)
GPU SM Pipeline
• The programmer has little control over the CPU pipeline
• The programmer has great control over the GPU pipeline, as expressed by the shape of thread execution
• Branching over a small number of warps stalls the pipeline
![Page 25: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/25.jpg)
Memory Bandwidth
• AMD Phenom II: 14.6 GB/s• Intel Sandy Bridge Quad: 21.3 GB/s• Itanium 2 (NASA Columbia): 6.4 GB/s• NVidia Geforce GTX 580: 192 GB/s
Rule of thumb: GPU has 10x memory bandwidth and 50x arithmetic throughput compared to CPU
![Page 26: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/26.jpg)
GPU Memory Architecture
• Memory segmented into aligned 128 byte intervals
• On global memory I/O, number of segments addressed by threads in a warp is computed
• Each segment is a memory transaction• On coalesced r/w, each thread in warp addresses
different 4byte address in segment• Coalesced r/w mean only 1 transaction per 128
bytes
![Page 27: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/27.jpg)
Coalescing
• For bandwidth-bound kernels, memory coalescing is #1 performance priority
• For 4 byte types, transactions are serviced for full warps• For 8 byte types, transactions are serviced for half warps• For 16 bytes types (like float4 vecs in D3D), transactions
are serviced for quarter warps• For larger types, penalties apply when loading from
typed pointers• To r/w large structures (>16bytes), spread transactions
over multiple threads
![Page 28: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/28.jpg)
GPU Memory Architecture
• GTX 580 memory clock 1002 MHz • GDDR5 controller is quad-pumped• Memory controller is 384 bits (6x64) (48 byte)• 1002MHz x 48 byte x 4 = 192.3e9 byte/s
Thread switching enables rest of warps on SM to execute while some are waiting on memory
![Page 29: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/29.jpg)
GPU Core Speed
• 16 MPs• 1544 MHz shader speed• 32 IPC• 16 MPs x 1544 MHz x 32 IPC = 790e9 IPS
• If each thread reads 4 bytes, 3162e9 byte/s• 3162e9 / 192.3e9 = 16.5 cycles/read
![Page 30: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/30.jpg)
GPU Memory Bandwidth
• On GTX 580, you need to do a fully coalesced memory op every 16 cycles to saturate the controller
• The high arithmetic throughput is there to enable this!
• Latency hiding averages this out – may either stream data in or load in all data, do all ops, write all data – still get max throughput if ratio is met
![Page 31: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/31.jpg)
GPU Memory Bandwidth
• GPU has expansive fields of ALUs to allow fast read-execute-write turnaround and see max performance in real problems
• CPU has 1/10th memory bandwidth because it doesn’t have the arithmetic performance to do work on data even if it had more bandwidth
![Page 32: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/32.jpg)
Worst Idea for Sparse Matrix
• Consider a CSR encoded MxN sparse matrix• Launch M threads (one per row)• Each thread reads an interval from the Row
array• Each thread dynamically loops over all non-zero
values in the row, reads from the Col array, and uses that to index into the dense vector texture
• Accumulate products and write
![Page 33: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/33.jpg)
Worst Idea for Sparse Matrix
• Consider memory transactions on GPU hardware
• Threads 0-31 (computing rows 0-31) run concurrently
• If average row density is 9 values, the warp reads over an interval of 32*9 = 288 values
• 288 values span 10 segments• Memory throughput is 1/10th of peak
![Page 34: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/34.jpg)
Worst Idea for Sparse Matrix
• Threads 0-31 (computing rows 0-31) run concurrently
• All threads take as long as the thread with the most work
• If all threads in warp have rows with 5 values, except one thread with 40 values, all threads wait through 40 values
• Runs 1/8th efficiency!
![Page 35: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/35.jpg)
Understand This
MPs are not coarse-grained parallel processors!
(They are exotic sequential processors)
![Page 36: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/36.jpg)
Sean’s Sparse Matrix Performance
• Utilize special encoding of sparse matrices to fully saturate the memory controller
• For double precision, each matrix element is 12 bytes, each vector element is 8 bytes
• How many dot-product components computed at 20bytes/component?
• On GTX 570 (peak bandwidth 141.5 GB/s) I see up to 197 GB/s throughput. Thanks texture cache! Incredible CG method speed
![Page 37: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/37.jpg)
Scan
The GPGPU algorithm for everything
![Page 38: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/38.jpg)
Scan
• GPU is fast YAY• Your old code won’t work BOO• GPU is really hard to program• GPU is fairly easy to optimize
• Throw away algorithms book• One algorithm to rule them all: scan
![Page 39: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/39.jpg)
Scan
• Not a callable routine – more like Batman’s utility belt
• At its simplest, adds up sequence of numbers:– Inclusive scan transforms sequence:
(3, 2, 1, 4, 5) -> (3, 5, 6, 10, 15)– Exclusive scan transforms sequence:
(3, 2, 1, 4, 5) -> (0, 3, 5, 6, 10)
![Page 40: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/40.jpg)
// Basic 32-element scan in cs_5_0 HLSL
#define NUM_THREADS 32#define sync GroupMemoryBarrierWithGroupSync
Buffer<uint> readbuf : register(b0);RWBuffer<uint> writebuf : register(u0);
groupshared uint sharedArray[NUM_THREADS];
[numthreads(NUM_THREADS, 1, 1)]void WarpScan(uint tid : SV_GroupIndex) {
uint x = readbuf[tid];sharedArray[tid] = x;sync();
[unroll]for(uint offset = 1; offset < NUM_THREADS; offset<<= 1) {
uint left = (NUM_THREADS - 1) & (tid - offset);uint y = sharedArray[left];sync();if(offset <= tid) x += y;sharedArray[tid] = x;sync();
}writebuf[tid] = x;
}
![Page 41: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/41.jpg)
cs_5_0dcl_globalFlags refactoringAlloweddcl_resource_buffer (uint,uint,uint,uint) t0dcl_uav_typed_buffer (uint,uint,uint,uint) u0dcl_input vThreadIDInGroupFlatteneddcl_temps 3dcl_tgsm_structured g0, 4, 32dcl_thread_group 32, 1, 1ld_indexable(buffer)(uint,uint,uint,uint) r0.x, vThreadIDInGroupFlattened.xxxx, t0.xyzwstore_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.xsync_g_tiadd r1.xyzw, vThreadIDInGroupFlattened.xxxx, l(-1, -2, -4, -8)and r1.xyzw, r1.xyzw, l(31, 31, 31, 31)ld_structured r0.y, r1.x, l(0), g0.xxxxiadd r0.y, r0.x, r0.ysync_g_tuge r2.xyzw, vThreadIDInGroupFlattened.xxxx, l(1, 2, 4, 8)movc r0.x, r2.x, r0.y, r0.xstore_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.xsync_g_tld_structured r0.y, r1.y, l(0), g0.xxxxiadd r0.y, r0.x, r0.ymovc r0.x, r2.y, r0.y, r0.x
(cont)
![Page 42: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/42.jpg)
sync_g_tstore_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.xsync_g_tld_structured r0.y, r1.z, l(0), g0.xxxxiadd r0.y, r0.x, r0.ymovc r0.x, r2.z, r0.y, r0.xsync_g_tstore_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.xsync_g_tld_structured r0.y, r1.w, l(0), g0.xxxxiadd r0.y, r0.x, r0.ymovc r0.x, r2.w, r0.y, r0.xsync_g_tstore_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.xsync_g_tiadd r0.y, vThreadIDInGroupFlattened.x, l(-16)and r0.y, r0.y, l(31)ld_structured r0.y, r0.y, l(0), g0.xxxxiadd r0.y, r0.x, r0.yuge r0.z, vThreadIDInGroupFlattened.x, l(16)movc r0.x, r0.z, r0.y, r0.xstore_uav_typed u0.xyzw, vThreadIDInGroupFlattened.xxxx, r0.xxxxret // Approximately 38 instruction slots used
![Page 43: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/43.jpg)
Scan
• Parallel scan is inefficient for adding numbers, yet critical for idiomatic GPU programming
• Complex predicates allow many problems to be solved
• Essential for load balancing jagged problems across simple threads
• A well-used scan broadcasts information-dense values to all threads in the warp or block
![Page 44: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/44.jpg)
Segmented Scan
• Sum from left-to-right within segments• Same as above code but with a modified
predicate:[unroll]for(uint offset = 1; offset < NUM_THREADS; offset<<= 1) {
uint left = (NUM_THREADS - 1) & (tid - offset);uint y = sharedArray[left];sync();if(offset <= delta) x += y;sharedArray[tid] = x;sync();
}
• delta is distance from tid to start of segment
![Page 45: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/45.jpg)
Segmented Scan
• Values of the same color are in the same segment:(2 1 2 0 3 4 5 1 2 1 0 4 5 2 1)
• Segmented scan performs a complete scan within each segment(2 3 5 0 3 7 12 1 2 3 0 4 9 11 12)
![Page 46: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/46.jpg)
Sparse Matrix
• Sparse matrix * vector is a very obvious segmented scan problem
• Each segment is the size of the number of non-zero rows in the matrix
• Each element is the product of a non-zero element and its corresponding component from the dense vector
![Page 47: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/47.jpg)
Sparse Matrix
• Consider data in CSR (Compressed Sparse Row) format:– Row: (0 3 5 6 9 14)– Col: (5 4 3 1 2 3 4 6 3 1 2 1 5 4)
• Index vec from col to get vector values for the matrix values, and multiply into the matrix values:– Mat*vec: (x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14)
![Page 48: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/48.jpg)
• Run a segmented scan – the last value in each segment is the dot product of a sparse matrix row and the dense vector (x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14)
scans tox1 x1+x2 x1+x2+x3x4 x4+x5x6 x7 x7+x8 x7+x8+x9x10 x10+x11 x10+x11+x12
x10+x11+x12+x13 x10+x11+x12+x13+x14
![Page 49: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/49.jpg)
Sparse Matrix Challenges
• Sparse matrix * vector should be bandwidth limited – each component requires only a multiply-and-add.
• To achieve peak bandwidth, we need to issue a coalesced read every 16 cycles
• DP is nerfed on Geforce series – only runs at 1/4th speed as same die on Quadro/Tesla part, so extra efficiency during reduction is essential
• Parallel scan has poor work efficiency• Matrix data not in a convenient order for processing
multiple values per thread
![Page 50: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/50.jpg)
Idiomatic GPU Programming
GPU architecture guides software design
![Page 51: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/51.jpg)
Fermi SM Programming Environment
• 16 MPs per device• Up to 1536 threads in flight• 32768 32bit registers
– For 100% SM occupancy, only 20 regs per thread– More regs means lower occupancy
• Up to 8 blocks per SM• Max block size 1024 (DX11 requirement)
– 256 threads/block may give 100% occupancy on all architectures
![Page 52: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/52.jpg)
Fermi SM Programming Environment
• 48KB shared memory • Shared memory supported by 32 32-bit banks
– For 100% SM occupancy, 32bytes shared memory per thread
• Each thread in warp must access different bank of shared memory to avoid bank conflicts
• N-way conflict takes N cycles to resolve
![Page 53: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/53.jpg)
Fermi SM Programming Environment
• Shared memory is primary mechanism for inter-thread communication
• Intra-warp communication requires only volatile shared mem pointer
• Inter-warp communication requires __syncthreads() call– __syncthreads() flushes pipeline
• Inter-block communication requires global memory access and __threadfence or new kernel launch
![Page 54: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/54.jpg)
Parallel hierarchy
• Prioritize computation:– Thread sequential (90%)
• Fast sequential algorithms• Runs at high occupancy• Compute information-dense values and store in shared
memory– Intra-warp communication (8%)
• Slower parallel algorithms• Fast shared mem communication with volatile pointer• Runs at high occupancy – no pipeline issues
![Page 55: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/55.jpg)
Parallel hierarchy (2)
• Prioritize computation (2):– Inter-warp communication (1.9%)
• Slower parallel algorithms• Sync between shared mem access requires pipeline
flush• May include divergent branches (such as in multiscan)
– Inter-block communication (0.1%)• If all blocks are running concurrently, __threadfence
can sync, otherwise new kernel launch is required• Must share data through global memory
![Page 56: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/56.jpg)
A simple model
• Thread sequential work is ‘vertical’• Load multiple values per thread and process
from top to bottom
• Thread parallel work is ‘horizontal’• Combine information-dense values from
vertical stage from left to right with scan
![Page 57: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/57.jpg)
Sparse Matrix Optimization
Think like the machine
![Page 58: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/58.jpg)
Sparse Matrix on CUDA
• www.github.com/seanbaxter/sparse/• My open source SpMxV library• Performance 1.6-4.3x faster than CUSPARSE• Full pre-conditioned CG solver in a few weeks• Super fast radix sort included in next update• Hits 197 GB/s throughput, usually over
190GB/s for DP
![Page 59: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/59.jpg)
Sparse Matrix on CUDA
• Dense vector stored in 1D texture– GPU’s 768KB texture cache pushes sparse
throughput over theoretical peak• Texture cache critical in graphics and is also
available in GPGPU• Only bilinear filters in CUDA• All sampler states available in D3D11 (it’s
better in many ways)
![Page 60: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/60.jpg)
Sparse Matrix on CUDA
Texture cache misses cause of poor performance
pwtk.mtx (wind tunnel)Height = 217,918nz = 11,634,424nz / h = 54.389
Max throughput = 196 GB/s
Matrix is dense and well banded
![Page 61: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/61.jpg)
pdb1HYS.mtx (Protein)height = 36,417nz = 4,344,765nz / h = 119.306
Max throughput = 197 GB/s
Row density brings high throughput
scircuit.mtx (Circuit)height = 170,998nz = 958,936nz / h = 5.608
Max throughput = 105 GB/s
High matrix bandwidth and low density cause texture cache misses, impairing performance
![Page 62: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/62.jpg)
Reformat the Matrix
• Uses special matrix encoding to accelerate scans by baking offsets and flags into unused top bits of col indices
• SpMxV is performed thousands of times to solve CG problem – reformatting is slow, but is only done once, and doubles multiplication throughput
![Page 63: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/63.jpg)
Strided Order vs Thread Order
Warps making multiple coalesced reads receive data in strided order
WARP_SIZE = 8, VALUES_PER_THREAD = 4Threads don’t hold sequential values!
a0 a1 a2 a3 a4 a5 a6 b0b1 b2 c0 c1 d0 d1 d2 d3d4 d5 e0 e1 e2 e3 e4 f0f1 f2 g0 g1 g2 g3 g4 g5
![Page 64: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/64.jpg)
Transposed Format
Transpose each group’s data so that coalesced reads put values in thread order:a0 a4 b1 d0 d4 e2 f1 g2a1 a5 b2 d1 d5 e3 f2 g3a2 a6 c0 d2 e0 e4 g0 g4a3 b0 c1 d3 e1 f0 g1 g5
With data in thread order we can perform sequential scan within threads, then parallel scan between them
![Page 65: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/65.jpg)
Locating scan buckets
tid 0: a0 a1 a2 a3tid 1: a4 a5 a6 b0tid 2: b1 b2 c0 c1tid 3: d0 d1 d2 d3tid 4: d4 d5 e0 e1tid 5: e2 e3 e4 f0tid 6: f1 f2 g0 g1tid 7: g2 g3 g4 g5
Locate the start and end of each bucket (matrix row) within each thread
Left underline indicates first value in a bucket in the thread
Right underline indicates last value in a bucket in the thread
Summing within threads with sequential scan is fast, provided each thread doesn’t need to calculate matrix geometry
![Page 66: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/66.jpg)
Encoding scan buckets
tid 0: TF FF FF FT scanOffset = 0tid 1: TF FF FT TT scanOffset = 1tid 2: TF FT TF TT scanOffset = 3tid 3: TF FF FF FT scanOffset = 5tid 4: TF FT TF FT scanOffset = 6tid 5: TF FF FT TT scanOffset = 8tid 6: TF FT TF FT scanOffset = 10tid 7: TF FF FF FT scanOffset = 12
Encode first and last value bits into column indices.scanOffset is the position in shared memory in which to store the first “last” value for each thread. These are dot product fragments.
![Page 67: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/67.jpg)
Sequential Scan Resultstid 0: a0 a0+a1 a0+a1+a2 a0+a1+a2+a3 s[0] = a0+a1+a2+a3tid 1: a4 a4+a5 a4+a5+a6 b0 s[1] = a4+a5+a6 s[2] = b0tid 2: b1 b1+b2 c0 c0+c1 s[3] = b1+b2 s[4] = c0+c1tid 3: d0 d0+d1 d0+d1+d2 d0+d1+d2+d3 s[5] = d0+d1+d2+d3tid 4: d4 d4+d5 e0 e0+e1 s[6] = d4+d5 s[7] = e0+e1tid 5: e2 e2+e3 e2+e3+e4 f0 s[8] = e2+e3+e4 s[9] = f0tid 6: f1 f1+f2 g0 g0+g1 s[10] = f1+f2 s[11] = g0+g1tid 7: g2 g2+g3 g2+g3+g4 g2+g3+g4+g5 s[12] = g2+g3+g4+g5
• Stream in data and compute products• Use sequential segmented scan (i.e. just add the current value to
the previous total)• Write to shared memory at sharedOffset and inc sharedOffset is
LAST flag is set• Clear preceding total before adding if FIRST flag is set
![Page 68: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/68.jpg)
Parallel Scan
• A parallel segmented scan merges partial dot products
• There are at most 2 * WARP_SIZE partial dot products, so each thread handles 2 elements in the parallel scan: tid and WARP_SIZE + tid
• Bake delta offsets into unused bits of two of the column indices (recall slide 44)
![Page 69: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/69.jpg)
Parallel ScanFor convenience, let:A0 = a0+a1+a2+a3 A1 = a4+a5+a6 B0 = b0B1 = b1+b2 C0 = c0+c1 D0 = d0+d1+d2+d3D1 = d4+d5 E0 = e0+e1 E1 = e2+e3+e4F0 = f0 F1 = f1+f2 G0 = g0+g1G1 = g2+g3+g4+g5 2 * WARP_SIZE sharedArrays = [ A0 A1 B0 B1 C0 D0 D1 E0 E1 F0 F1 G0 G1 XX XX XX ] tid 0: A0 E1 deltaX = 0 deltaY = 1tid 1: A1 F0 deltaX = 1 deltaY = 0tid 2: B0 F1 deltaX = 0 deltaY = 1tid 3: B1 G0 deltaX = 1 deltaY = 0tid 4: C0 G1 deltaX = 0 deltaY = 1tid 5: D0 XX deltaX = 0 deltaY = 0tid 6: D1 XX deltaX = 1 deltaY = 0tid 7: E0 XX deltaX = 0 deltaY = 0
![Page 70: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/70.jpg)
Parallel Scan• After parallel scan, the shared array holds:
s[0] = A0s[1] = A0+A1s[2] = B0 s[3] = B0+B1 s[4] = C0 s[5] = D0 s[6] = D0+D1 s[7] = E0
s[8] = E0+E1
s[9] = F0 s[10] = F0+F1 s[11] = G0 s[12] = G0+G1s[13] = XX s[14] = XX s[15] = XX
• The completed dot products are at indices 1, 3, 4, 6, 8, 10, and 12
• Bake these indices into the unused high bits of column indices
• Each thread writes up to 1 value to global memory
![Page 71: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/71.jpg)
The final pass
• Blocks are constant size• The last row in each block may spill over into
the next block, causing two or more partial dot products to be written to global memory
• These are summed by a simple final pass
![Page 72: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/72.jpg)
Performance Analysis
• Because each thread writes no more than 1 partial dot products, each group cannot process more than WARP_SIZE (32) unique rows
• VALUES_PER_THREAD should be increased as high as the mean number of values per row to maximize fast sequential processing
![Page 73: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/73.jpg)
Performance Analysis• All global memory loads are coalesced• Most of the sum is computed with efficient sequential
segmented scan as opposed to inefficient parallel scan• There is no branching in the kernel when computing
the sum• Segmented scan flags and offsets are baked into col
indices to accelerate inner loops• Low register and shared mem usage delivers high SM
occupancy
![Page 74: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/74.jpg)
Performance Analysis
• Exceptionally high memory bandwidth of GPUs make them the obvious choice for iterative algorithms like CG
• Switched fabric in clusters results in high latency, slowing process and reducing effectiveness of parallelism
• With GTX 580, expect 12.5 billion double-precision dot product components
• A 50 million element DP matrix can be multiplied 250 times per second with a single card
![Page 75: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/75.jpg)
Optimization Wrap-up
• GPUs are not coarse-grained parallel systems• Prioritize for coalesced r/w• Favor sequential operations• Use intra-warp reduction• Maintain high SM occupancy to avoid pipeline
stalls by reducing register usage and warp-divergent branches
• Keep optimizing until 1/16th of instructions are global memory ops
![Page 76: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/76.jpg)
GPU is Enabling Tech
• Low cost• Fast• Simple deployment• Energy efficient• Makes rendering simple• Low latency encourages interactivity, bringing
researches closer to their data and models
![Page 77: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/77.jpg)
GPU is Enabling Tech
• No way for clusters with conventional nodes to compete
• Clusters with obsolete hardware (like Columbia) will get crushed by a single GPU in iterative processes
• On-die GPU integration (Fusion) will soon support GPGPU computation in system memory for low-cost mobile systems
![Page 78: Solving Common Problems with GPU: A Case Study with Sparse Matrices 17 Mar 2011 ARTS Talk](https://reader031.vdocuments.net/reader031/viewer/2022012917/568161eb550346895dd21cf7/html5/thumbnails/78.jpg)
Thank you
github.com/seanbaxter/sparse/