“collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. a design model for...

CUB “collective” software primitives

Duane Merrill NVIDIA Research

What is CUB?

1. A design model for “collective” primitives

How to make reusable SIMT software constructs

2. A library of collective primitives

Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc.

3. A library of global primitives

Device-reduce, device-sort, device-scan, etc.

Constructed from collective primitives

Demonstrate performance, performance-portability

Software reuse

Software reuse Abstraction & composability are fundamental

Reducing redundant programmer effort…

Saves time, energy, money

Reduces buggy software

Encapsulating complexity…

Empowers ordinary programmers

Insulates applications from underlying hardware

Software reuse empowers a durable programming model

Software reuse Abstraction & composability are fundamental

Reducing redundant programmer effort…

Saves time, energy, money

Reduces buggy software

Encapsulating complexity…

Empowers ordinary programmers

Insulates applications from underlying hardware

Software reuse empowers a durable programming model

“Collective” primitives

Parallel programming is hard…

Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.)

Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc.

Cooperative parallel programming is hard…

Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.)

Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc.

Parallel programming is hard…

CUDA today

threadblock threadblock threadblock

CUDA stub

application thread

CUDA today “Collective primitives” are the missing layer in today’s CUDA software stack

BlockSort BlockSort BlockSort …

CUDA stub

application thread

What do these have in common?

Parallel sparse graph traversal Parallel radix sort

Parallel BWT compression Parallel SpMV

What do these have in common? Block-wide prefix-scan

Queue management

Segmented reduction

Recurrence solver

Partitioning 0

Parallel sparse graph traversal Parallel radix sort

Parallel BWT compression Parallel SpMV

Examples of parallel scan data flow 16 threads contributing 4 items each

t0 t3 t2 t1

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t4 t7 t6 t5

id id t11 t11

t8 t11

t9 t10 t10

id id t15 t15

t13 t1

t13 t13

t12 t12

t4 t7 t6 t5 t11

t8 t11

t9 t10

t9 t8 t15 t15

t13 t1

t13 t13

t12 t12

t1 t0 t2 t3 t1 t0 t2 t3

t1 t0 t2 t3

t5 t4 t6 t7 t5 t4 t6 t7

t5 t4 t6 t7

t9 t8 t10 t11 t9 t8 t10 t11

t9 t8 t10 t11

t13 t12 t14 t15 t13 t12 t14 t15

t13 t12 t14 t15

t0 t3 t2 t1

Work-efficient Brent-Kung hybrid (~130 binary ops)

Depth-efficient Kogge-Stone hybrid (~170 binary ops)

CUDA today Kernel programming is complicating

CUDA stub

application thread

CUDA today “Collective primitives” are the missing layer in today’s CUDA software stack

CUDA stub

application thread

Collective design & usage

Collective design criteria Components are easily nested & sequenced

threadblock

BlockSort

BlockRadixRank

BlockScan

WarpScan

BlockExchange

CUDA stub

application thread

Collective design criteria Flexible interfaces that scale (& tune) to different block sizes, thread-granularities, etc.

threadblock

BlockSort

BlockRadixRank

BlockScan

WarpScan

BlockExchange

thread block

BlockSort

CUDA stub

application thread

Collective interface design - 3 parameter fields separated by concerns - Reflected shared resource types

1. Static specialization interface Params dictate storage layout and

unrolling of algorithmic steps

Allows data placement in fast

registers

2. Reflected shared resource types Reflection enables compile-time

allocation and tuning

3. Collective construction interface Optional params concerning inter-

thread communication

Orthogonal to function behavior

4. Operational function interface Method-specific inputs/outputs

__global__ void ExampleKernel() { // Specialize cub::BlockScan for 128 threads typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ typename BlockScanT::TempStorage scan_storage; // Obtain a 512 items blocked across 128 threads int items[4]; ... // Compute block-wide prefix sum BlockScanT(scan_storage).ExclusiveSum(items, items);

template <typename T, int BLOCK_THREADS> class BlockScan { // Type of shared memory needed by BlockScan typedef T TempStorage[BLOCK_THREADS]; // Per-thread data (shared storage reference) TempStorage &temp_storage; // Constructor BlockScan (TempStorage &storage) : temp_storage(storage) {} // Prefix sum operation (each thread contributes its own data item) T Sum (T thread_data) { for (int i = 1; i < BLOCK_THREADS; i *= 2) { temp_storage[tid] = thread_data; __syncthreads(); if (tid – i >= 0) thread_data += temp_storage[tid]; __syncthreads(); } return thread_data; } };

Collective primitive design Simplified block-wide prefix sum

Sequencing CUB primitives Using cub::BlockLoad and cub::BlockScan

__global__ void ExampleKernel(int *d_in) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; } temp_storage; // Use coalesced (thread-striped) loads and a subsequent local exchange to // block a global segment of 512 items across 128 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items) __syncthreads() // Compute block-wide prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); ...

Load, Scan

Specialize, Allocate

Nested composition of CUB primitives cub::BlockScan

cub::BlockScan

cub::WarpScan

Nested composition of CUB primitives cub::BlockRadixSort

cub::BlockRadixSort

cub::BlockRadixRank

cub::BlockScan

cub::WarpScan

cub::BlockExchange

cub::BlockHistogram

Nested composition of CUB primitives cub::BlockHistogram (specialized for BLOCK_HISTO_SORT algorithm)

cub::BlockRadixSort

cub::BlockRadixRank

cub::BlockScan

cub::WarpScan

cub::BlockExchange

cub::BlockDiscontinuity

Block-wide and warp-wide CUB primitives cub::BlockDiscontinuity

cub::BlockExchange

cub::BlockLoad & cub::BlockStore

cub::BlockRadixSort

cub::WarpReduce & cub::BlockReduce

cub::WarpScan & cub::BlockScan

cub::BlockHistogram

t0 t1 t2 t3

L2 / Tex

… and more at the CUB project on GitHub

http://nvlabs.github.com/cub

Tuning with flexible collectives

Example: radix sorting throughput (initial GT200 effort ~2011)

NVIDIAGTX580 [1]

NVIDIAGTX480 [1]

NVIDIA TeslaC2050 [1]

NVIDIAGTX280 [1]

NVIDIA 9800GTX+ [1]

Intel MICKnight'sFerry [4]

Intel Core i7Nehalem

3.2GHz [2]

AMD RadeonHD 6970 [3]

[1] Merrill. Back40 GPU Primitives (2012) [2] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (2011) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.

Radix sorting throughput (current)

NVIDIAGTX580 [1]

NVIDIAGTX480 [1]

NVIDIA TeslaC2050 [1]

NVIDIAGTX280 [1]

NVIDIA 9800GTX+ [1]

Intel MICKnight'sFerry [4]

Intel Core i7Nehalem

3.2GHz [2]

AMD RadeonHD 6970 [3]

[1] Merrill. Back40, CUB GPU Primitives (2013) [2] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (2011) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.

Fine-tuning primitives Tiled prefix sum

* Simple CUDA kernel for computing tiled partial sums

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, LoadAlgorithm LOAD_ALGO, ScanAlgorithm SCAN_ALGO>

__global__ void ScanTilesKernel(int *d_in, int *d_out)

// Specialize collective types for problem context

typedef cub::BlockLoad<int*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT;

typedef cub::BlockScan<int, BLOCK_THREADS, SCAN_ALGO> BlockScanT;

// Allocate on-chip temporary storage

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage reduce;

} temp_storage;

// Load data per thread

int thread_data[ITEMS_PER_THREAD];

int offset = blockIdx.x * (BLOCK_THREADS * ITEMS_PER_THREAD);

BlockLoadT(temp_storage.load).Load(d_in + offset, offset);

__syncthreads();

// Compute the block-wide prefix sum

BlockScanT(temp_storage).Sum(thread_data);

t4 t5 t6 t7 t0 t1 t2 t3

Data is striped across threads for memory accesses

t4 t5 t6 t7 t0 t1 t3 t2

Data is blocked across threads for scanning

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

2 t1 t0 t2 t3 t1 t0 t2 t3 t1 t0 t2 t3 t5 t4 t6 t7 t5 t4 t6 t7 t5 t4 t6 t7

t9 t8 t10 t11 t9 t8 t10 t11 t9 t8 t10 t11 t13 t12 t14 t15 t13 t12 t14 t15 t13 t12 t14 t15

Scan data flow tiled from warpscans

CUB: device-wide performance-portability vs. Thrust and NPP across the last three major NVIDIA arch families (Telsa, Fermi, Kepler)

0.71 0.66

TeslaC1060

TeslaC2050

TeslaK20C

Global radix sort

CUB Thrust v1.7.1

101520253035404550

TeslaC1060

TeslaC2050

TeslaK20C

Global reduction

CUB Thrust v1.7.1

TeslaC1060

TeslaC2050

TeslaK20C

Global prefix scan

CUB Thrust v1.7.1

TeslaC1060

TeslaC2050

TeslaK20C

Global Histogram

CUB NPP

1.7 2.2 2.4

1012141618

TeslaC1060

TeslaC2050

TeslaK20C

Global partition-if

CUB Thrust v1.7.1

Summary

Summary: benefits of using CUB primitives

Simplicity of composition

Kernels are simply sequences of primitives (e.g., BlockLoad -> BlockSort -> BlockReduceByKey)

High performance

CUB uses the best known algorithms, abstractions, and strategies, and techniques

Performance portability

CUB is specialized for the target hardware (e.g., memory conflict rules, special instructions, etc.)

Simplicity of tuning

CUB adapts to various grain sizes (threads per block, items per thread, etc.)

CUB provides alterative algorithms

Robustness and durability

CUB supports arbitrary data types and block sizes

Questions? Please visit the CUB project on GitHub http://nvlabs.github.com/cub Duane Merrill (dumerrill@nvidia.com)

p0 p1 p2 pP-1

x0 x1 x2

y0 y1 y2

prefix0:0 prefix0:1

prefix0:2 prefix0:P-2

reduce reduce

reduce

reduce scan

scan scan

adaptive look-back

x0 x1 x2

y0 y1 y2

aggregate0

incl-prefix0

adaptive look-back

A P P … X Status flag

256 256 256 … - Aggregate

- 768 256 … - Inclusive prefix

1 2 0 P-1

aggregate0

incl-prefix0

aggregate0

incl-prefix0

aggregate0

incl-prefix0

reduce reduce reduce reduce

scan scan scan scan

“collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. a design model for...

Documents

women's collective action in the free software world

organe simt om

=organele de simt i. albu

bunul simt

lp ii-12-organe de simt

simt am conference keynote

anatomia descriptiva - organe de simt

private-collective software business models:

compaction for efficient simt control flow

be simt justuju by nighat abdullah-zemtime.com

wipocos - software for collective management organizations

collective access - open source collections management...

limba-organ de simt al gustului

organele de simt ale mamiferelor.doc

fgpu: an simt-architecture for fpgas

mimd synchronization on simt architectures

simt & markem-imaje

asta simt eu pt tine

code gpu with cuda - simt

ma simt sanatos - emmanuel damanakis