![Page 1: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/1.jpg)
CUB “collective” software primitives
Duane Merrill NVIDIA Research
![Page 2: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/2.jpg)
2
What is CUB?
1. A design model for “collective” primitives
How to make reusable SIMT software constructs
2. A library of collective primitives
Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc.
3. A library of global primitives
Device-reduce, device-sort, device-scan, etc.
Constructed from collective primitives
Demonstrate performance, performance-portability
![Page 3: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/3.jpg)
3
Software reuse
![Page 4: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/4.jpg)
4
Software reuse Abstraction & composability are fundamental
Reducing redundant programmer effort…
Saves time, energy, money
Reduces buggy software
Encapsulating complexity…
Empowers ordinary programmers
Insulates applications from underlying hardware
Software reuse empowers a durable programming model
![Page 5: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/5.jpg)
5
Software reuse Abstraction & composability are fundamental
Reducing redundant programmer effort…
Saves time, energy, money
Reduces buggy software
Encapsulating complexity…
Empowers ordinary programmers
Insulates applications from underlying hardware
Software reuse empowers a durable programming model
![Page 6: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/6.jpg)
6
“Collective” primitives
![Page 7: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/7.jpg)
7
Parallel programming is hard…
![Page 8: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/8.jpg)
8
Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.)
Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc.
Cooperative parallel programming is hard…
![Page 9: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/9.jpg)
9
…
Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.)
Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc.
Parallel programming is hard…
![Page 10: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/10.jpg)
10
CUDA today
threadblock threadblock threadblock
CUDA stub
application thread
…
![Page 11: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/11.jpg)
11
CUDA today “Collective primitives” are the missing layer in today’s CUDA software stack
threadblock threadblock threadblock
BlockSort BlockSort BlockSort …
CUDA stub
application thread
![Page 12: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/12.jpg)
12
What do these have in common?
0
1 1
2
1
1
2
2
2
2
1
3
2
3
2
1 2
2 ∞
∞
∞
Parallel sparse graph traversal Parallel radix sort
Parallel BWT compression Parallel SpMV
![Page 13: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/13.jpg)
13
What do these have in common? Block-wide prefix-scan
Queue management
Segmented reduction
Recurrence solver
Partitioning 0
1 1
2
1
1
2
2
2
2
1
3
2
3
2
1 2
2 ∞
∞
∞
Parallel sparse graph traversal Parallel radix sort
Parallel BWT compression Parallel SpMV
![Page 14: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/14.jpg)
14
Examples of parallel scan data flow 16 threads contributing 4 items each
t3
t3
t0 t3 t2 t1
t2
t2
t3
t3
t3
t3
id
id id
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t7
t7
t4 t7 t6 t5
t6
t6
t5
t5
t4
t4
id
id id t11 t11
t8 t11
t10
t9 t10 t10
t9
t9
t8
t8
id
id id t15 t15
t12
t15
t14
t13 t1
4 t14
t13 t13
t12 t12
id
id id
t4 t7 t6 t5 t11
t8 t11
t10
t9 t10
t9 t8 t15 t15
t12
t15
t14
t13 t1
4 t14
t13 t13
t12 t12
t1 t0 t2 t3 t1 t0 t2 t3
t1 t0 t2 t3
t5 t4 t6 t7 t5 t4 t6 t7
t5 t4 t6 t7
t9 t8 t10 t11 t9 t8 t10 t11
t9 t8 t10 t11
t13 t12 t14 t15 t13 t12 t14 t15
t13 t12 t14 t15
t3
t3
t3
t2
t2
t2
t1
t1
t1
t0
t0
t0
t3
t3
t0 t3 t2 t1
t2
t2
t1
t1
t0
t0
id
id id
t3
t3
t3
t2
t2
t2
t1
t1
t1
t0
t0
t0
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
Work-efficient Brent-Kung hybrid (~130 binary ops)
Depth-efficient Kogge-Stone hybrid (~170 binary ops)
![Page 15: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/15.jpg)
15
CUDA today Kernel programming is complicating
threadblock threadblock threadblock
CUDA stub
application thread
…
![Page 16: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/16.jpg)
16
CUDA today “Collective primitives” are the missing layer in today’s CUDA software stack
threadblock threadblock threadblock
BlockSort BlockSort BlockSort …
CUDA stub
application thread
![Page 17: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/17.jpg)
17
Collective design & usage
![Page 18: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/18.jpg)
18
Collective design criteria Components are easily nested & sequenced
threadblock
BlockSort
BlockRadixRank
BlockScan
WarpScan
BlockExchange
threadblock threadblock threadblock
BlockSort BlockSort BlockSort …
CUDA stub
application thread
![Page 19: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/19.jpg)
19
Collective design criteria Flexible interfaces that scale (& tune) to different block sizes, thread-granularities, etc.
threadblock
BlockSort
BlockRadixRank
BlockScan
WarpScan
BlockExchange
thread block
thread block
thread block
BlockSort
BlockSort
BlockSort
…
CUDA stub
application thread
![Page 20: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/20.jpg)
20
Collective interface design - 3 parameter fields separated by concerns - Reflected shared resource types
1. Static specialization interface Params dictate storage layout and
unrolling of algorithmic steps
Allows data placement in fast
registers
2. Reflected shared resource types Reflection enables compile-time
allocation and tuning
3. Collective construction interface Optional params concerning inter-
thread communication
Orthogonal to function behavior
4. Operational function interface Method-specific inputs/outputs
__global__ void ExampleKernel() { // Specialize cub::BlockScan for 128 threads typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ typename BlockScanT::TempStorage scan_storage; // Obtain a 512 items blocked across 128 threads int items[4]; ... // Compute block-wide prefix sum BlockScanT(scan_storage).ExclusiveSum(items, items);
1
3 4
2
![Page 21: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/21.jpg)
21
template <typename T, int BLOCK_THREADS> class BlockScan { // Type of shared memory needed by BlockScan typedef T TempStorage[BLOCK_THREADS]; // Per-thread data (shared storage reference) TempStorage &temp_storage; // Constructor BlockScan (TempStorage &storage) : temp_storage(storage) {} // Prefix sum operation (each thread contributes its own data item) T Sum (T thread_data) { for (int i = 1; i < BLOCK_THREADS; i *= 2) { temp_storage[tid] = thread_data; __syncthreads(); if (tid – i >= 0) thread_data += temp_storage[tid]; __syncthreads(); } return thread_data; } };
Collective primitive design Simplified block-wide prefix sum
![Page 22: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/22.jpg)
22
Sequencing CUB primitives Using cub::BlockLoad and cub::BlockScan
__global__ void ExampleKernel(int *d_in) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; } temp_storage; // Use coalesced (thread-striped) loads and a subsequent local exchange to // block a global segment of 512 items across 128 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items) __syncthreads() // Compute block-wide prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); ...
Load, Scan
Specialize, Allocate
![Page 23: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/23.jpg)
23
Nested composition of CUB primitives cub::BlockScan
cub::BlockScan
cub::WarpScan
![Page 24: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/24.jpg)
24
Nested composition of CUB primitives cub::BlockRadixSort
cub::BlockRadixSort
cub::BlockRadixRank
cub::BlockScan
cub::WarpScan
cub::BlockExchange
![Page 25: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/25.jpg)
25
cub::BlockHistogram
Nested composition of CUB primitives cub::BlockHistogram (specialized for BLOCK_HISTO_SORT algorithm)
cub::BlockRadixSort
cub::BlockRadixRank
cub::BlockScan
cub::WarpScan
cub::BlockExchange
cub::BlockDiscontinuity
![Page 26: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/26.jpg)
26
Block-wide and warp-wide CUB primitives cub::BlockDiscontinuity
cub::BlockExchange
cub::BlockLoad & cub::BlockStore
cub::BlockRadixSort
cub::WarpReduce & cub::BlockReduce
cub::WarpScan & cub::BlockScan
cub::BlockHistogram
t0 t1 t2 t3
t4 t5 t6 t7 t0 t1 t3 t4 t5 t6 t7 t0 t1 t2 t3 t2
L2 / Tex
… and more at the CUB project on GitHub
http://nvlabs.github.com/cub
![Page 27: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/27.jpg)
27
Tuning with flexible collectives
![Page 28: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/28.jpg)
28
Example: radix sorting throughput (initial GT200 effort ~2011)
28
0
200
400
600
800
1000
1200
1400
1600
1800
2000
NVIDIAGTX580 [1]
NVIDIAGTX480 [1]
NVIDIA TeslaC2050 [1]
NVIDIAGTX280 [1]
NVIDIA 9800GTX+ [1]
Intel MICKnight'sFerry [4]
Intel Core i7Nehalem
3.2GHz [2]
AMD RadeonHD 6970 [3]
Mill
ions
of 3
2-bi
t key
s /s
[1] Merrill. Back40 GPU Primitives (2012) [2] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (2011) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.
![Page 29: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/29.jpg)
29
Radix sorting throughput (current)
29
0
200
400
600
800
1000
1200
1400
1600
1800
2000
NVIDIAGTX580 [1]
NVIDIAGTX480 [1]
NVIDIA TeslaC2050 [1]
NVIDIAGTX280 [1]
NVIDIA 9800GTX+ [1]
Intel MICKnight'sFerry [4]
Intel Core i7Nehalem
3.2GHz [2]
AMD RadeonHD 6970 [3]
Mill
ions
of 3
2-bi
t key
s /s
[1] Merrill. Back40, CUB GPU Primitives (2013) [2] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (2011) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.
![Page 30: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/30.jpg)
30
Fine-tuning primitives Tiled prefix sum
/**
* Simple CUDA kernel for computing tiled partial sums
*/
template <int BLOCK_THREADS, int ITEMS_PER_THREAD, LoadAlgorithm LOAD_ALGO, ScanAlgorithm SCAN_ALGO>
__global__ void ScanTilesKernel(int *d_in, int *d_out)
{
// Specialize collective types for problem context
typedef cub::BlockLoad<int*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT;
typedef cub::BlockScan<int, BLOCK_THREADS, SCAN_ALGO> BlockScanT;
// Allocate on-chip temporary storage
__shared__ union {
typename BlockLoadT::TempStorage load;
typename BlockScanT::TempStorage reduce;
} temp_storage;
// Load data per thread
int thread_data[ITEMS_PER_THREAD];
int offset = blockIdx.x * (BLOCK_THREADS * ITEMS_PER_THREAD);
BlockLoadT(temp_storage.load).Load(d_in + offset, offset);
__syncthreads();
// Compute the block-wide prefix sum
BlockScanT(temp_storage).Sum(thread_data);
…
}
t4 t5 t6 t7 t0 t1 t2 t3
Data is striped across threads for memory accesses
t4 t5 t6 t7 t0 t1 t3 t2
Data is blocked across threads for scanning
t3 t3
t0
t3
t2
t1 t
2 t2
t3 t3
t3 t3
id id
id
t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11
t7 t7
t4
t7
t6
t5 t
6 t6
t5 t5
t4 t4
id id
id
t1
1 t1
1
t8
t1
1
t1
0 t9
t1
0 t1
0
t9 t9
t8 t8
id id
id
t1
5 t1
5
t1
2
t1
5
t1
4
t1
3 t1
4 t1
4
t1
3 t1
3
t1
2 t1
2
id id
id
t4
t7
t6
t5
t1
1
t8
t1
1
t1
0 t9
t1
0 t9
t8
t1
5 t1
5
t1
2
t1
5
t1
4
t1
3 t1
4 t1
4
t1
3 t1
3
t1
2 t1
2 t1 t0 t2 t3 t1 t0 t2 t3 t1 t0 t2 t3 t5 t4 t6 t7 t5 t4 t6 t7 t5 t4 t6 t7
t9 t8 t10 t11 t9 t8 t10 t11 t9 t8 t10 t11 t13 t12 t14 t15 t13 t12 t14 t15 t13 t12 t14 t15
Scan data flow tiled from warpscans
![Page 31: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/31.jpg)
31
CUB: device-wide performance-portability vs. Thrust and NPP across the last three major NVIDIA arch families (Telsa, Fermi, Kepler)
0.50
1.05
1.40
0.51
0.71 0.66
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
TeslaC1060
TeslaC2050
TeslaK20C
billi
ons
of 3
2b k
eys /
sec
Global radix sort
CUB Thrust v1.7.1
22
33
45
19
30
37
05
101520253035404550
TeslaC1060
TeslaC2050
TeslaK20C
billi
ons
of 3
2b it
ems /
sec
Global reduction
CUB Thrust v1.7.1
8
14
21
4 6 6
0
5
10
15
20
25
TeslaC1060
TeslaC2050
TeslaK20C
billi
ons
of 3
2b it
ems /
sec
Global prefix scan
CUB Thrust v1.7.1
2.7
16.2
19.3
0 2 2
0
5
10
15
20
25
TeslaC1060
TeslaC2050
TeslaK20C
billi
ons
of 8
b ite
ms /
sec
Global Histogram
CUB NPP
4.2
8.6
16.4
1.7 2.2 2.4
02468
1012141618
TeslaC1060
TeslaC2050
TeslaK20C
billi
ons
of 3
2b in
puts
/ se
c
Global partition-if
CUB Thrust v1.7.1
![Page 32: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/32.jpg)
32
Summary
![Page 33: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/33.jpg)
33
Summary: benefits of using CUB primitives
Simplicity of composition
Kernels are simply sequences of primitives (e.g., BlockLoad -> BlockSort -> BlockReduceByKey)
High performance
CUB uses the best known algorithms, abstractions, and strategies, and techniques
Performance portability
CUB is specialized for the target hardware (e.g., memory conflict rules, special instructions, etc.)
Simplicity of tuning
CUB adapts to various grain sizes (threads per block, items per thread, etc.)
CUB provides alterative algorithms
Robustness and durability
CUB supports arbitrary data types and block sizes
![Page 34: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/34.jpg)
Questions? Please visit the CUB project on GitHub http://nvlabs.github.com/cub Duane Merrill ([email protected])
![Page 35: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/35.jpg)
p0 p1 p2 pP-1
x0 x1 x2
y0 y1 y2
…
…
…
prefix0:0 prefix0:1
prefix0:2 prefix0:P-2
reduce reduce
reduce
reduce scan
scan scan
scan
![Page 36: “collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. A design model for “collective” primitives How to make reusable SIMT software constructs 2. A library](https://reader033.vdocuments.net/reader033/viewer/2022050303/5f7e3bbeac598c36b0792893/html5/thumbnails/36.jpg)
pP-1
adaptive look-back
p2
adaptive look-back
p1
adaptive look-back
p0
x0 x1 x2
y0 y1 y2
…
…
…
aggregate0
incl-prefix0
adaptive look-back
A P P … X Status flag
256 256 256 … - Aggregate
- 768 256 … - Inclusive prefix
1 2 0 P-1
aggregate0
incl-prefix0
aggregate0
incl-prefix0
aggregate0
incl-prefix0
reduce reduce reduce reduce
scan scan scan scan