scan primitives for gpu computing
DESCRIPTION
Scan Primitives for GPU Computing. Shubho Sengupta, Mark Harris * , Yao Zhang, John Owens. Motivation. Raw compute power and bandwidth of GPUs increasing rapidly Programmable unified shader cores Ability to program outside the graphics framework - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/1.jpg)
University of California Davis, *NVIDIA Corporation
Scan Primitives for GPU Computing
Scan Primitives for GPU Computing
Shubho Sengupta, Mark Harris*, Yao Zhang, John Owens
![Page 2: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/2.jpg)
MotivationMotivation
• Raw compute power and bandwidth of GPUs increasing rapidly
• Programmable unified shader cores
• Ability to program outside the graphics framework
• However lack of efficient data-parallel primitives and algorithms
![Page 3: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/3.jpg)
MotivationMotivation
• Current efficient algorithms either have streaming
access
• 1:1 relationship between input and output element
36140713
47251824
![Page 4: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/4.jpg)
MotivationMotivation
• Or have small “neighborhood” access
• N:1 relationship between input and output element where N is a small constant
36140713
39754784
![Page 5: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/5.jpg)
• However interesting problems require more
general access patterns
• Changing one element affects everybody
• Stream Compaction
MotivationMotivation
30140703
31473
0 Null
![Page 6: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/6.jpg)
• Split
• Needed for Sort
MotivationMotivation
TTTFFFFF
FTFFTFFT
![Page 7: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/7.jpg)
• Common scenarios in parallel computing
• Variable output per thread
• Threads want to perform a split – radix sort, building trees
• “What came before/after me?”
• “Where do I start writing my data?”
• Scan answers this question
MotivationMotivation
![Page 8: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/8.jpg)
System OverviewSystem Overview
Low Level PrimitivesScan and variants
Higher Level PrimitivesEnumerate, Distribute,…
AlgorithmsSort, Sparse matrix operations,…
Libraries and Abstractions
for data parallel
programming
![Page 9: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/9.jpg)
ScanScan
• Each element is a sum of all the elements to the left of it (Exclusive)
• Each element is a sum of all the elements to the left of it and itself (Inclusive)
2216151111430 Exclusive
25221615111143 Inclusive
36140713 Input
![Page 10: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/10.jpg)
Scan – the pastScan – the past
• First proposed in APL (1962)
• Used as a data parallel primitive in the Connection Machine (1990)
• Guy Blelloch used scan as a primitive for various parallel algorithms (1990)
![Page 11: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/11.jpg)
Scan – the presentScan – the present
• First GPU implementation by Daniel Horn (2004),
O(n logn)
• Subsequent GPU implementations by
• Hensley (2005) O(n logn), Sengupta (2006) O(n), Greß (2006) O(n) 2D
• NVIDIA CUDA implementation by Mark Harris (2007), O(n), space efficient
![Page 12: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/12.jpg)
Scan – the implementationScan – the implementation
• O(n) algorithm – same work complexity as the
serial version
• Space efficient – needs O(n) storage
• Has two stages – reduce and down-sweep
![Page 13: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/13.jpg)
Scan - ReduceScan - Reduce
36140713
96547743
1465411743
2565411743
• log n steps
• Work halves each step
• O(n) work
• In place, space efficient
![Page 14: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/14.jpg)
Scan - Down SweepScan - Down Sweep
065411743
116540743
1661144703
2216151111430
2565411743 • log n steps
• Work doubles each step
• O(n) work
• In place, space efficient
![Page 15: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/15.jpg)
• Input
• Scan within each segment in parallel
• Output
30 770 710
Segmented ScanSegmented Scan
13 407 361
![Page 16: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/16.jpg)
Segmented ScanSegmented Scan
• Introduced by Schwartz (1980)
• Forms the basis for a wide variety of algorithms
• Quicksort, Sparse Matrix-Vector Multiply, Convex Hull
![Page 17: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/17.jpg)
Segmented Scan - ChallengesSegmented Scan - Challenges
• Representing segments
• Efficiently storing and propagating information about segments
• Scans over all segments should happen in parallel
• Overall work and space complexity should be O(n)regardless of the number of segments
![Page 18: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/18.jpg)
Representing SegmentsRepresenting Segments
• Possible Representations are
• Vector of segment lengths
• Vector of indices which are segment heads
• Vector of flags: 1 for segment head, 0 if not
• First two approaches hard to parallelize as different size as input
• We use the third as it is the same size as input
![Page 19: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/19.jpg)
• Space-Inefficient to store one flag in an integer
• Store one flag in a byte striped across 32 words
• Reduces bank conflicts
Segmented Scan – Flag StorageSegmented Scan – Flag Storage
![Page 20: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/20.jpg)
Segmented Scan – implementationSegmented Scan – implementation
• Similar to Scan
• O(n) space and work complexity
• Has two stages – reduce and down-sweep
![Page 21: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/21.jpg)
Segmented Scan – implementationSegmented Scan – implementation
• Unique to segmented scan
• Requires an additional flag perelement for intermediate computation
• Additional flags get set in reduce step
• Additional book-keeping with flags in down-sweep
• These flags prevent data movement between segments
![Page 22: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/22.jpg)
Platform – NVIDIA CUDA and G80Platform – NVIDIA CUDA and G80
• Threads grouped into blocks
• Threads in a block can cooperate through fast on-chip memory
• Hence programmer must partition problem into multiple blocks to use fast memory
• Adds complexity but usually much faster code
![Page 23: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/23.jpg)
Segmented Scan – Large InputSegmented Scan – Large Input
![Page 24: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/24.jpg)
• Operations in parallel over all the segments
• Irregular workload since segments can be of any length
• Can simulate divide-and-conquer recursion sinceadditional segments can be generated
Segmented Scan – AdvantagesSegmented Scan – Advantages
![Page 25: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/25.jpg)
Primitives - EnumeratePrimitives - Enumerate
• Input: a true/false vector
• Output: count of true values to the left of each element
• Useful in stream compact
• Output for each true element is the address forthat element in the compacted array
TFTFF T T F
11000 2 3 3
![Page 26: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/26.jpg)
Primitives - DistributePrimitives - Distribute
• Input: a vector with segments
• Output: the first element of a segment copied over all other elements
713 104 36
333 444 66
![Page 27: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/27.jpg)
Primitives – DistributePrimitives – Distribute
• Set all elements except the head elements to zero
• Do inclusive segmented scan
• Used in quicksort to distribute pivot
333 444 66
003 004 06
![Page 28: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/28.jpg)
• Input: a vector with true/false elements. Possibly segmented
• Output: Stable split within each segment – falses on the left, trues on the right
Primitives – Split and SegmentPrimitives – Split and Segment
0713 3614
03 71 61 34
False
![Page 29: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/29.jpg)
Primitives – Split and SegmentPrimitives – Split and Segment
• Can be implemented with Enumerate
• One enumerate for the falses going left to right
• One enumerate for the trues going right to left
• Used in quicksort
![Page 30: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/30.jpg)
Applications – QuicksortApplications – Quicksort
• Traditional algorithm GPU unfriendly
• Recursive
• Segments vary in length, unequal workload
• Primitives built on segmented scan solves both problems
• Allows operations on all segments in parallel
• Simulates recursion by generating new segments in each iteration
![Page 31: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/31.jpg)
Applications – QuicksortApplications – Quicksort
Input64735 8 9 3
Distribute pivot55555 5 5 5
Input > pivotTFTFF T T F
Split and Segment7 6 8 93435
![Page 32: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/32.jpg)
Applications – QuicksortApplications – Quicksort
7 6 8 93435
Distribute pivot7 7 7 75555
Input ≥ pivotT F T TFFFT
Split and segment5343 6 7 8 9
![Page 33: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/33.jpg)
Applications – QuicksortApplications – Quicksort
5343 6 7 8 9
Distribute pivot5333 6 7 7 7
Input > pivotFFTF F F T T
Split and segment5 63 43 7 8 9
![Page 34: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/34.jpg)
Applications – Sparse M-V multiplyApplications – Sparse M-V multiply
• Dense matrix operations are much faster on GPU than CPU
• However Sparse matrix operations on GPU much slower
• Hard to implement on GPU
• Non-zero entries in row vary in number
![Page 35: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/35.jpg)
Applications – Sparse M-V multiplyApplications – Sparse M-V multiply
• Three different approaches
• Rows sorted by number of non-zero entries [Bolz]
• Stored as diagonals and processed them in sequence [Krüger]
• Rows computed in parallel but runtime determined by longest row [Brook]
![Page 36: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/36.jpg)
Applications – Sparse M-V multiplyApplications – Sparse M-V multiply
y0
y1
y2
a 0 b
c d e0 0 f
x0
x1
x2
=
221020 Column Index
520 Row begin Index
Non-zero elementsba edc f
![Page 37: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/37.jpg)
Applications – Sparse M-V multiplyApplications – Sparse M-V multiply
x2x2x1x0x2x0x =
fx2cx0+ dx1 + ex2ax0+ bx2
Backward inclusive segmented scanPick first element in segment
ba edc f
bx2ax0 cx0 dx1 ex2 fx2
221020Column Index
![Page 38: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/38.jpg)
Applications – Tridiagonal SolverApplications – Tridiagonal Solver
• Implemented Kass and Miller’s shallow water solver
• Water surface described as a 2D array of heights
• Global movement of data
• From one end to the other and back
• Suits the Reduce/Down-sweep structure of scan
![Page 39: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/39.jpg)
Applications – Tridiagonal SolverApplications – Tridiagonal Solver
• Tridiagonal system of n rows solved in parallel
• Then for each of the m columns in parallel
• Read pattern is similar to but more complex than scan
![Page 40: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/40.jpg)
Results - ScanResults - Scan
1.1x slower
3x slower
4.8 x slower
Forward Backward Forward Backward
Scan Segmented Scan
Tim
e (
Norm
aliz
ed)
Extra computation for
sequential memory access
Packing and Unpacking FlagsNon sequential I/O
Saving State
Packing and Unpacking FlagsNon sequential I/O
Saving State
![Page 41: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/41.jpg)
Results – Sparse M-V MultiplyResults – Sparse M-V Multiply
• Input: “raefsky” matrix, 3242 x 3242, 294276 elements
• GPU (215 MFLOPS) half as fast as CPU “oski” (522 MFLOPS)
• Hard to do irregular computation
• Most time spent in backward segmented scan
![Page 42: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/42.jpg)
Results - SortResults - Sort
2x slower
13x slower
4x slower
Global Block GPU CPU
Radix Sort Quick Sort
Tim
e (
Norm
aliz
ed)
Slow Merge
Slow MergePacking/Unpacking Flags
Complex Kernel
![Page 43: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/43.jpg)
Results – Tridiagonal solverResults – Tridiagonal solver
• 256 x 256 grid: 367 simulation steps per second
• Dominated by the overhead of mapping and unmapping vertex buffers
• 3x faster than a CPU cyclic reduction solver
• 12x faster when using shared memory
![Page 44: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/44.jpg)
Improved Results Since PublicationImproved Results Since Publication
• Twice as fast for all variants of scan and sparse matrix-vector multiply
• Scan
• More work per thread – 8 elements vs 2 before
• Segmented Scan
• No packing of flags
• Sequential memory access
• More optimizations possible
![Page 45: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/45.jpg)
Contribution and Future WorkContribution and Future Work
• Algorithm and implementation of segmented scan on GPU
• First implementation of quicksort on GPU
• Primitives appropriate for complex algorithms
• Global data movement, unbalanced workload, recursive
• Scan never occurs in serial computation
• Tiered approach, standard library and interfaces
![Page 46: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/46.jpg)
AcknowledgmentsAcknowledgments
• Jim Ahrens, Guy Blelloch, Jeff Inman, Pat McCormick
• David Luebke, Ian Buck
• Jeff Bolz
• Eric Lengyel
• Department of Energy
• National Science Foundation
![Page 47: Scan Primitives for GPU Computing](https://reader036.vdocuments.net/reader036/viewer/2022081504/56815164550346895dbf8e82/html5/thumbnails/47.jpg)
Shallow Water SimulationShallow Water Simulation