“collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. a design model for...

CUB “collective” software primitives

Duane Merrill NVIDIA Research

2

What is CUB?

1. A design model for “collective” primitives

How to make reusable SIMT software constructs

2. A library of collective primitives

Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc.

3. A library of global primitives

Device-reduce, device-sort, device-scan, etc.

Constructed from collective primitives

Demonstrate performance, performance-portability

3

Software reuse

4

Software reuse Abstraction & composability are fundamental

Reducing redundant programmer effort…

Saves time, energy, money

Reduces buggy software

Encapsulating complexity…

Empowers ordinary programmers

Insulates applications from underlying hardware

Software reuse empowers a durable programming model

5

Software reuse Abstraction & composability are fundamental

Reducing redundant programmer effort…

Saves time, energy, money

Reduces buggy software

Encapsulating complexity…

Empowers ordinary programmers

Insulates applications from underlying hardware

Software reuse empowers a durable programming model

6

“Collective” primitives

7

Parallel programming is hard…

8

Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.)

Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc.

Cooperative parallel programming is hard…

9

…

Parallel decomposition and grain sizing Synchronization Deadlock, livelock, and data races Plurality of state Plurality of flow control (divergence, etc.)

Bookkeeping control structures Memory access conflicts, coalescing, etc. Occupancy constraints from SMEM, RF, etc Algorithm selection and instruction scheduling Special hardware functionality, instructions, etc.

Parallel programming is hard…

10

CUDA today

threadblock threadblock threadblock

CUDA stub

application thread

…

11

CUDA today “Collective primitives” are the missing layer in today’s CUDA software stack


BlockSort BlockSort BlockSort …

CUDA stub

application thread

12

What do these have in common?

0

1 1

2

1

1

2

2

2

2

1

3

2

3

2

1 2

2 ∞

∞

∞

Parallel sparse graph traversal Parallel radix sort

Parallel BWT compression Parallel SpMV

13

What do these have in common? Block-wide prefix-scan

Queue management

Segmented reduction

Recurrence solver

Partitioning 0

1 1

2

1

1

2

2

2

2

1

3

2

3

2

1 2

2 ∞

∞

∞

Parallel sparse graph traversal Parallel radix sort

Parallel BWT compression Parallel SpMV

14

Examples of parallel scan data flow 16 threads contributing 4 items each

t3

t3

t0 t3 t2 t1

t2

t2

t3

t3

t3

t3

id

id id

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t7

t7

t4 t7 t6 t5

t6

t6

t5

t5

t4

t4

id

id id t11 t11

t8 t11

t10

t9 t10 t10

t9

t9

t8

t8

id

id id t15 t15

t12

t15

t14

t13 t1

4 t14

t13 t13

t12 t12

id

id id

t4 t7 t6 t5 t11

t8 t11

t10

t9 t10

t9 t8 t15 t15

t12

t15

t14

t13 t1

4 t14

t13 t13

t12 t12

t1 t0 t2 t3 t1 t0 t2 t3

t1 t0 t2 t3

t5 t4 t6 t7 t5 t4 t6 t7

t5 t4 t6 t7

t9 t8 t10 t11 t9 t8 t10 t11

t9 t8 t10 t11

t13 t12 t14 t15 t13 t12 t14 t15

t13 t12 t14 t15

t3

t3

t3

t2

t2

t2

t1

t1

t1

t0

t0

t0

t3

t3

t0 t3 t2 t1

t2

t2

t1

t1

t0

t0

id

id id

t3

t3

t3

t2

t2

t2

t1

t1

t1

t0

t0

t0







Work-efficient Brent-Kung hybrid (~130 binary ops)

Depth-efficient Kogge-Stone hybrid (~170 binary ops)

15

CUDA today Kernel programming is complicating


CUDA stub

application thread

…

16

CUDA today “Collective primitives” are the missing layer in today’s CUDA software stack



CUDA stub

application thread

17

Collective design & usage

18

Collective design criteria Components are easily nested & sequenced

threadblock

BlockSort

BlockRadixRank

BlockScan

WarpScan

BlockExchange



CUDA stub

application thread

19

Collective design criteria Flexible interfaces that scale (& tune) to different block sizes, thread-granularities, etc.

threadblock

BlockSort

BlockRadixRank

BlockScan

WarpScan

BlockExchange

thread block

thread block

thread block

BlockSort

BlockSort

BlockSort

…

CUDA stub

application thread

20

Collective interface design - 3 parameter fields separated by concerns - Reflected shared resource types

1. Static specialization interface Params dictate storage layout and

unrolling of algorithmic steps

Allows data placement in fast

registers

2. Reflected shared resource types Reflection enables compile-time

allocation and tuning

3. Collective construction interface Optional params concerning inter-

thread communication

Orthogonal to function behavior

4. Operational function interface Method-specific inputs/outputs

__global__ void ExampleKernel() { // Specialize cub::BlockScan for 128 threads typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ typename BlockScanT::TempStorage scan_storage; // Obtain a 512 items blocked across 128 threads int items[4]; ... // Compute block-wide prefix sum BlockScanT(scan_storage).ExclusiveSum(items, items);

1

3 4

2

21

template <typename T, int BLOCK_THREADS> class BlockScan { // Type of shared memory needed by BlockScan typedef T TempStorage[BLOCK_THREADS]; // Per-thread data (shared storage reference) TempStorage &temp_storage; // Constructor BlockScan (TempStorage &storage) : temp_storage(storage) {} // Prefix sum operation (each thread contributes its own data item) T Sum (T thread_data) { for (int i = 1; i < BLOCK_THREADS; i *= 2) { temp_storage[tid] = thread_data; __syncthreads(); if (tid – i >= 0) thread_data += temp_storage[tid]; __syncthreads(); } return thread_data; } };

Collective primitive design Simplified block-wide prefix sum

22

Sequencing CUB primitives Using cub::BlockLoad and cub::BlockScan

__global__ void ExampleKernel(int *d_in) { // Specialize for 128 threads owning 4 integers each typedef cub::BlockLoad<int*, 128, 4> BlockLoadT; typedef cub::BlockScan<int, 128> BlockScanT; // Allocate temporary storage in shared memory __shared__ union { typename BlockLoadT::TempStorage load; typename BlockScanT::TempStorage scan; } temp_storage; // Use coalesced (thread-striped) loads and a subsequent local exchange to // block a global segment of 512 items across 128 threads int items[4]; BlockLoadT(temp_storage.load).Load(d_in, items) __syncthreads() // Compute block-wide prefix sum BlockScanT(temp_storage.scan).ExclusiveSum(items, items); ...

Load, Scan

Specialize, Allocate

23

Nested composition of CUB primitives cub::BlockScan

cub::BlockScan

cub::WarpScan

24

Nested composition of CUB primitives cub::BlockRadixSort

cub::BlockRadixSort

cub::BlockRadixRank

cub::BlockScan

cub::WarpScan

cub::BlockExchange

25

cub::BlockHistogram

Nested composition of CUB primitives cub::BlockHistogram (specialized for BLOCK_HISTO_SORT algorithm)

cub::BlockRadixSort

cub::BlockRadixRank

cub::BlockScan

cub::WarpScan

cub::BlockExchange

cub::BlockDiscontinuity

26

Block-wide and warp-wide CUB primitives cub::BlockDiscontinuity

cub::BlockExchange

cub::BlockLoad & cub::BlockStore

cub::BlockRadixSort

cub::WarpReduce & cub::BlockReduce

cub::WarpScan & cub::BlockScan

cub::BlockHistogram

t0 t1 t2 t3


L2 / Tex

… and more at the CUB project on GitHub

http://nvlabs.github.com/cub



27

Tuning with flexible collectives

28

Example: radix sorting throughput (initial GT200 effort ~2011)

28

0

200

400

600

800

1000

1200

1400

1600

1800

2000

NVIDIAGTX580 [1]

NVIDIAGTX480 [1]

NVIDIA TeslaC2050 [1]

NVIDIAGTX280 [1]

NVIDIA 9800GTX+ [1]

Intel MICKnight'sFerry [4]

Intel Core i7Nehalem

3.2GHz [2]

AMD RadeonHD 6970 [3]

Mill

ions

of 3

2-bi

t key

s /s

[1] Merrill. Back40 GPU Primitives (2012) [2] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (2011) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.

29

Radix sorting throughput (current)

29

0

200

400

600

800

1000

1200

1400

1600

1800

2000

NVIDIAGTX580 [1]

NVIDIAGTX480 [1]

NVIDIA TeslaC2050 [1]

NVIDIAGTX280 [1]

NVIDIA 9800GTX+ [1]

Intel MICKnight'sFerry [4]

Intel Core i7Nehalem

3.2GHz [2]

AMD RadeonHD 6970 [3]

Mill

ions

of 3

2-bi

t key

s /s

[1] Merrill. Back40, CUB GPU Primitives (2013) [2] Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort (2010) [3] T. Harada and L. Howes. Introduction to GPU Radix Sort (2011) [4] Satish et al. Fast Sort on CPUs, GPUs, and Intel MIC Architectures. Intel Labs, 2010.

30

Fine-tuning primitives Tiled prefix sum

/**

* Simple CUDA kernel for computing tiled partial sums

*/

template <int BLOCK_THREADS, int ITEMS_PER_THREAD, LoadAlgorithm LOAD_ALGO, ScanAlgorithm SCAN_ALGO>

__global__ void ScanTilesKernel(int *d_in, int *d_out)

{

// Specialize collective types for problem context

typedef cub::BlockLoad<int*, BLOCK_THREADS, ITEMS_PER_THREAD, LOAD_ALGO> BlockLoadT;

typedef cub::BlockScan<int, BLOCK_THREADS, SCAN_ALGO> BlockScanT;

// Allocate on-chip temporary storage

__shared__ union {

typename BlockLoadT::TempStorage load;

typename BlockScanT::TempStorage reduce;

} temp_storage;

// Load data per thread

int thread_data[ITEMS_PER_THREAD];

int offset = blockIdx.x * (BLOCK_THREADS * ITEMS_PER_THREAD);

BlockLoadT(temp_storage.load).Load(d_in + offset, offset);

__syncthreads();

// Compute the block-wide prefix sum

BlockScanT(temp_storage).Sum(thread_data);

…

}

t4 t5 t6 t7 t0 t1 t2 t3

Data is striped across threads for memory accesses

t4 t5 t6 t7 t0 t1 t3 t2

Data is blocked across threads for scanning

t3 t3

t0

t3

t2

t1 t

2 t2

t3 t3

t3 t3

id id

id

t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11 t15 t9 t8 t10 t5 t4 t6 t7 t1 t0 t2 t3 t13 t12 t14 t11

t7 t7

t4

t7

t6

t5 t

6 t6

t5 t5

t4 t4

id id

id

t1

1 t1

1

t8

t1

1

t1

0 t9

t1

0 t1

0

t9 t9

t8 t8

id id

id

t1

5 t1

5

t1

2

t1

5

t1

4

t1

3 t1

4 t1

4

t1

3 t1

3

t1

2 t1

2

id id

id

t4

t7

t6

t5

t1

1

t8

t1

1

t1

0 t9

t1

0 t9

t8

t1

5 t1

5

t1

2

t1

5

t1

4

t1

3 t1

4 t1

4

t1

3 t1

3

t1

2 t1

2 t1 t0 t2 t3 t1 t0 t2 t3 t1 t0 t2 t3 t5 t4 t6 t7 t5 t4 t6 t7 t5 t4 t6 t7

t9 t8 t10 t11 t9 t8 t10 t11 t9 t8 t10 t11 t13 t12 t14 t15 t13 t12 t14 t15 t13 t12 t14 t15

Scan data flow tiled from warpscans

31

CUB: device-wide performance-portability vs. Thrust and NPP across the last three major NVIDIA arch families (Telsa, Fermi, Kepler)

0.50

1.05

1.40

0.51

0.71 0.66

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

TeslaC1060

TeslaC2050

TeslaK20C

billi

ons

of 3

2b k

eys /

sec

Global radix sort

CUB Thrust v1.7.1

22

33

45

19

30

37

05

101520253035404550

TeslaC1060

TeslaC2050

TeslaK20C

billi

ons

of 3

2b it

ems /

sec

Global reduction

CUB Thrust v1.7.1

8

14

21

4 6 6

0

5

10

15

20

25

TeslaC1060

TeslaC2050

TeslaK20C

billi

ons

of 3

2b it

ems /

sec

Global prefix scan

CUB Thrust v1.7.1

2.7

16.2

19.3

0 2 2

0

5

10

15

20

25

TeslaC1060

TeslaC2050

TeslaK20C

billi

ons

of 8

b ite

ms /

sec

Global Histogram

CUB NPP

4.2

8.6

16.4

1.7 2.2 2.4

02468

1012141618

TeslaC1060

TeslaC2050

TeslaK20C

billi

ons

of 3

2b in

puts

/ se

c

Global partition-if

CUB Thrust v1.7.1

32

Summary

33

Summary: benefits of using CUB primitives

Simplicity of composition

Kernels are simply sequences of primitives (e.g., BlockLoad -> BlockSort -> BlockReduceByKey)

High performance

CUB uses the best known algorithms, abstractions, and strategies, and techniques

Performance portability

CUB is specialized for the target hardware (e.g., memory conflict rules, special instructions, etc.)

Simplicity of tuning

CUB adapts to various grain sizes (threads per block, items per thread, etc.)

CUB provides alterative algorithms

Robustness and durability

CUB supports arbitrary data types and block sizes

Questions? Please visit the CUB project on GitHub http://nvlabs.github.com/cub Duane Merrill ([email protected])


p0 p1 p2 pP-1

x0 x1 x2

y0 y1 y2

…

…

…

prefix0:0 prefix0:1

prefix0:2 prefix0:P-2

reduce reduce

reduce

reduce scan

scan scan

scan

pP-1

adaptive look-back

p2

adaptive look-back

p1

adaptive look-back

p0

x0 x1 x2

y0 y1 y2

…

…

…

aggregate0

incl-prefix0

adaptive look-back

A P P … X Status flag

256 256 256 … - Aggregate

- 768 256 … - Inclusive prefix

1 2 0 P-1

aggregate0

incl-prefix0

aggregate0

incl-prefix0

aggregate0

incl-prefix0

reduce reduce reduce reduce

scan scan scan scan

“collective” software primitiveson-demand.gputechconf.com/gtc/2014/...1. a design model for...

Documents