“collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. a design model for...
Post on 29-Jul-2020
2 Views
Preview:
TRANSCRIPT
CUB “collective” software primitives
Duane Merrill NVIDIA Research
2
What is CUB?
1. A design model for “collective” primitives
How to make reusable SIMT software constructs
2. A library of collective primitives
Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc.
3. A library of global primitives
Device-reduce, device-sort, device-scan, etc.
Constructed from collective primitives
Demonstrate performance, performance-portability
3
Software reuse
4
Software reuse Abstraction & composability are fundamental
Reducing redundant programmer effort…
Saves time, energy, money
Reduces buggy software
Encapsulating complexity…
Empowers ordinary programmers
Insulates applications from underlying hardware
Software reuse empowers a durable programming model
5
Software reuse Abstraction & composability are fundamental
Reducing redundant programmer effort…
Saves time, energy, money
Reduces buggy software
Encapsulating complexity…
Empowers ordinary programmers
Insulates applications from underlying hardware
Software reuse empowers a durable programming model
6
“Collective” primitives
7
Parallel programming is hard…
8
Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.)
Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc.
Cooperative parallel programming is hard…
9
…
Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.)
Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc.
Parallel programming is hard…
10
CUDA today
threadblock threadblock threadblock
CUDA stub
application thread
…
11
CUDA today “Collective primitives” are the missing layer in today’s CUDA software stack
threadblock threadblock threadblock
BlockSort BlockSort BlockSort …
CUDA stub
application thread
12
What do these have in common?
0
1 1
2
1
1
2
2
2
2
1
3
2
3
2
1 2
2 ∞
∞
∞
Parallel sparse graph traversal Parallel radix sort
Parallel BWT compression Parallel SpMV
13
What do these have in common? Block-wide prefix-scan
Queue management
Segmented reduction
Recurrence solver
Partitioning 0
1 1
2
1
1
2
2
2
2
1
3
2
3
2
1 2
2 ∞
∞
∞
Parallel sparse graph traversal Parallel radix sort
Parallel BWT compression Parallel SpMV
14
Examples of parallel scan data flow 16 threads contributing 4 items each
t3
t3
t0 t3 t2 t1
t2
t2
t3
t3
t3
t3
id
id id
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t7
t7
t4 t7 t6 t5
t6
t6
t5
t5
t4
t4
id
id id t11 t11
t8 t11
t10
t9 t10 t10
t9
t9
t8
t8
id
id id t15 t15
t12
t15
t14
t13 t1
4 t14
t13 t13
t12 t12
id
id id
t4 t7 t6 t5 t11
t8 t11
t10
t9 t10
t9 t8 t15 t15
t12
t15
t14
t13 t1
4 t14
t13 t13
t12 t12
t1 t0 t2 t3 t1 t0 t2 t3
t1 t0 t2 t3
t5 t4 t6 t7 t5 t4 t6 t7
t5 t4 t6 t7
t9 t8 t10 t11 t9 t8 t10 t11
t9 t8 t10 t11
t13 t12 t14 t15 t13 t12 t14 t15
t13 t12 t14 t15
t3
t3
t3
t2
t2
t2
t1
t1
t1
t0
t0
t0
t3
t3
t0 t3 t2 t1
t2
t2
t1
t1
t0
t0
id
id id
t3
t3
t3
t2
t2
t2
t1
t1
t1
t0
t0
t0
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
Work-efficient Brent-Kung hybrid (~130 binary ops)
Depth-efficient Kogge-Stone hybrid (~170 binary ops)
15
CUDA today Kernel programming is complicating
threadblock threadblock threadblock
CUDA stub
application thread
…
16
CUDA today “Collective primitives” are the missing layer in today’s CUDA software stack
threadblock threadblock threadblock
BlockSort BlockSort BlockSort …
CUDA stub
application thread
17
Collective design & usage
18
Collective design criteria Components are easily nested & sequenced
threadblock
BlockSort
BlockRadixRank
BlockScan
WarpScan
BlockExchange
threadblock threadblock threadblock
BlockSort BlockSort BlockSort …
CUDA stub
application thread
19
Collective design criteria Flexible interfaces that scale (& tune) to different block sizes, thread-granularities, etc.
threadblock
BlockSort
BlockRadixRank
BlockScan
WarpScan
BlockExchange
thread block
thread block
thread block
BlockSort
BlockSort
BlockSort
…
CUDA stub
application thread
20
Collective interface design - 3 parameter fields separated by concerns - Reflected shared resource types
1. Static specialization interface Params dictate storage layout and
unrolling of algorithmic steps
Allows data placement in fast
registers
2. Reflected shared resource types Reflection enables compile-time
allocation and tuning
3. Collective construction interface Optional params concerning inter-
thread communication
Orthogonal to function behavior
4. Operational function interface Method-specific inputs/outputs
__global__ void ExampleKernel() { // Specialize cub::BlockScan for 128 threads typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ typename BlockScanT::TempStorage scan_storage; // Obtain a 512 items blocked across 128 threads int items[4]; ... // Compute block-wide prefix sum BlockScanT(scan_storage).ExclusiveSum(items, items);
1
3 4
2
21
template <typename T, int BLOCK_THREADS> class BlockScan { // Type of shared memory needed by BlockScan typedef T TempStorage[BLOCK_THREADS]; // Per-thread data (shared storage reference) TempStorage &temp_storage; // Constructor BlockScan (TempStorage &storage) : temp_storage(storage) {} // Prefix sum operation (each thread contributes its own data item) T Sum (T thread_data) { for (int i = 1; i < BLOCK_THREADS; i *= 2) { temp_storage[tid] = thread_data; __syncthreads(); if (tid – i >= 0) thread_data += temp_storage[tid]; __syncthreads(); } return thread_data; } };
Collective primitive design Simplified block-wide prefix sum
22
Sequencing CUB primitives Using cub::BlockLoad and cub::BlockScan
__global__ void ExampleKernel(int *d_in) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; } temp_storage; // Use coalesced (thread-striped) loads and a subsequent local exchange to // block a global segment of 512 items across 128 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items) __syncthreads() // Compute block-wide prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); ...
Load, Scan
Specialize, Allocate
23
Nested composition of CUB primitives cub::BlockScan
cub::BlockScan
cub::WarpScan
24
Nested composition of CUB primitives cub::BlockRadixSort
cub::BlockRadixSort
cub::BlockRadixRank
cub::BlockScan
cub::WarpScan
cub::BlockExchange
25
cub::BlockHistogram
Nested composition of CUB primitives cub::BlockHistogram (specialized for BLOCK_HISTO_SORT algorithm)
cub::BlockRadixSort
cub::BlockRadixRank
cub::BlockScan
cub::WarpScan
cub::BlockExchange
cub::BlockDiscontinuity
26
Block-wide and warp-wide CUB primitives cub::BlockDiscontinuity
cub::BlockExchange
cub::BlockLoad & cub::BlockStore
cub::BlockRadixSort
cub::WarpReduce & cub::BlockReduce
cub::WarpScan & cub::BlockScan
cub::BlockHistogram
t0 t1 t2 t3
t4 t5 t6 t7 t0 t1 t3 t4 t5 t6 t7 t0 t1 t2 t3 t2
L2 / Tex
… and more at the CUB project on GitHub
http://nvlabs.github.com/cub
27
Tuning with flexible collectives
28
Example: radix sorting throughput (initial GT200 effort ~2011)
28
0
200
400
600
800
1000
1200
1400
1600
1800
2000
NVIDIAGTX580 [1]
NVIDIAGTX480 [1]
NVIDIA TeslaC2050 [1]
NVIDIAGTX280 [1]
NVIDIA 9800GTX+ [1]
Intel MICKnight'sFerry [4]
Intel Core i7Nehalem
3.2GHz [2]
AMD RadeonHD 6970 [3]
Mill
ions
of 3
2-bi
t key
s /s
[1] Merrill. Back40 GPU Primitives (2012) [2] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (2011) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.
29
Radix sorting throughput (current)
29
0
200
400
600
800
1000
1200
1400
1600
1800
2000
NVIDIAGTX580 [1]
NVIDIAGTX480 [1]
NVIDIA TeslaC2050 [1]
NVIDIAGTX280 [1]
NVIDIA 9800GTX+ [1]
Intel MICKnight'sFerry [4]
Intel Core i7Nehalem
3.2GHz [2]
AMD RadeonHD 6970 [3]
Mill
ions
of 3
2-bi
t key
s /s
[1] Merrill. Back40, CUB GPU Primitives (2013) [2] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (2011) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.
30
Fine-tuning primitives Tiled prefix sum
/**
* Simple CUDA kernel for computing tiled partial sums
*/
template <int BLOCK_THREADS, int ITEMS_PER_THREAD, LoadAlgorithm LOAD_ALGO, ScanAlgorithm SCAN_ALGO>
__global__ void ScanTilesKernel(int *d_in, int *d_out)
{
// Specialize collective types for problem context
typedef cub::BlockLoad<int*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT;
typedef cub::BlockScan<int, BLOCK_THREADS, SCAN_ALGO> BlockScanT;
// Allocate on-chip temporary storage
__shared__ union {
typename BlockLoadT::TempStorage load;
typename BlockScanT::TempStorage reduce;
} temp_storage;
// Load data per thread
int thread_data[ITEMS_PER_THREAD];
int offset = blockIdx.x * (BLOCK_THREADS * ITEMS_PER_THREAD);
BlockLoadT(temp_storage.load).Load(d_in + offset, offset);
__syncthreads();
// Compute the block-wide prefix sum
BlockScanT(temp_storage).Sum(thread_data);
…
}
t4 t5 t6 t7 t0 t1 t2 t3
Data is striped across threads for memory accesses
t4 t5 t6 t7 t0 t1 t3 t2
Data is blocked across threads for scanning
t3 t3
t0
t3
t2
t1 t
2 t2
t3 t3
t3 t3
id id
id
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t7 t7
t4
t7
t6
t5 t
6 t6
t5 t5
t4 t4
id id
id
t1
1 t1
1
t8
t1
1
t1
0 t9
t1
0 t1
0
t9 t9
t8 t8
id id
id
t1
5 t1
5
t1
2
t1
5
t1
4
t1
3 t1
4 t1
4
t1
3 t1
3
t1
2 t1
2
id id
id
t4
t7
t6
t5
t1
1
t8
t1
1
t1
0 t9
t1
0 t9
t8
t1
5 t1
5
t1
2
t1
5
t1
4
t1
3 t1
4 t1
4
t1
3 t1
3
t1
2 t1
2 t1 t0 t2 t3 t1 t0 t2 t3 t1 t0 t2 t3 t5 t4 t6 t7 t5 t4 t6 t7 t5 t4 t6 t7
t9 t8 t10 t11 t9 t8 t10 t11 t9 t8 t10 t11 t13 t12 t14 t15 t13 t12 t14 t15 t13 t12 t14 t15
Scan data flow tiled from warpscans
31
CUB: device-wide performance-portability vs. Thrust and NPP across the last three major NVIDIA arch families (Telsa, Fermi, Kepler)
0.50
1.05
1.40
0.51
0.71 0.66
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
TeslaC1060
TeslaC2050
TeslaK20C
billi
ons
of 3
2b k
eys /
sec
Global radix sort
CUB Thrust v1.7.1
22
33
45
19
30
37
05
101520253035404550
TeslaC1060
TeslaC2050
TeslaK20C
billi
ons
of 3
2b it
ems /
sec
Global reduction
CUB Thrust v1.7.1
8
14
21
4 6 6
0
5
10
15
20
25
TeslaC1060
TeslaC2050
TeslaK20C
billi
ons
of 3
2b it
ems /
sec
Global prefix scan
CUB Thrust v1.7.1
2.7
16.2
19.3
0 2 2
0
5
10
15
20
25
TeslaC1060
TeslaC2050
TeslaK20C
billi
ons
of 8
b ite
ms /
sec
Global Histogram
CUB NPP
4.2
8.6
16.4
1.7 2.2 2.4
02468
1012141618
TeslaC1060
TeslaC2050
TeslaK20C
billi
ons
of 3
2b in
puts
/ se
c
Global partition-if
CUB Thrust v1.7.1
32
Summary
33
Summary: benefits of using CUB primitives
Simplicity of composition
Kernels are simply sequences of primitives (e.g., BlockLoad -> BlockSort -> BlockReduceByKey)
High performance
CUB uses the best known algorithms, abstractions, and strategies, and techniques
Performance portability
CUB is specialized for the target hardware (e.g., memory conflict rules, special instructions, etc.)
Simplicity of tuning
CUB adapts to various grain sizes (threads per block, items per thread, etc.)
CUB provides alterative algorithms
Robustness and durability
CUB supports arbitrary data types and block sizes
Questions? Please visit the CUB project on GitHub http://nvlabs.github.com/cub Duane Merrill (dumerrill@nvidia.com)
p0 p1 p2 pP-1
x0 x1 x2
y0 y1 y2
…
…
…
prefix0:0 prefix0:1
prefix0:2 prefix0:P-2
reduce reduce
reduce
reduce scan
scan scan
scan
pP-1
adaptive look-back
p2
adaptive look-back
p1
adaptive look-back
p0
x0 x1 x2
y0 y1 y2
…
…
…
aggregate0
incl-prefix0
adaptive look-back
A P P … X Status flag
256 256 256 … - Aggregate
- 768 256 … - Inclusive prefix
1 2 0 P-1
aggregate0
incl-prefix0
aggregate0
incl-prefix0
aggregate0
incl-prefix0
reduce reduce reduce reduce
scan scan scan scan
top related