gpu computing with c++ amp

GPU computing with C++ AMP

Kenneth DominoDomem Technologies, Inc.http://domemtech.com

NB: This presentation is based on Daniel Moth’s presentation at http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T

October 24, 2011

http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T

http://domemtech.com/?p=1025

CPUs vs GPUs today

CPU

• Low memory bandwidth• Higher power consumption• Medium level of parallelism• Deep execution pipelines• Random accesses• Supports general code• Mainstream programming



CPUs vs GPUs today

GPU

• High memory bandwidth• Lower power consumption• High level of parallelism• Shallow execution pipelines• Sequential accesses• Supports data-parallel code• Niche programming



• CPUs and GPUs coming closer together…• …nothing settled in this space, things still in motion…

• C++ AMP is designed as a mainstream solution not only for today, but also for tomorrow

Tomorrow…

image source: AMD



C++ AMP• Part of Visual C++ • Visual Studio integration• STL-like library for multidimensional data • Builds on Direct3D

performance

portabilityproductivity



C++ AMP vs. CUDA vs. OpenCL

OpenCL

(Stone knives and bearskins – C99!! YUCK!)

CUDA – Wow, classes!

C++ AMP

Lambda functions (1936 Church—back to the future)



Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pC){ for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; }}

How do we take the serial code on the left that runs on the CPU and convert it to run on an accelerator like the GPU?



void AddArrays(int n, int * pA, int * pB, int * pC){

for (int i=0; i<n; i++)

{ pC[i] = pA[i] + pB[i]; }

}

#include <amp.h>using namespace concurrency;

void AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> i) restrict(direct3d) { sum[i] = a[i] + b[i]; } );}

Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pC){

for (int i=0; i<n; i++)

{ pC[i] = pA[i] + pB[i]; }

}



Basic Elements of C++ AMP coding

void AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each(

sum.grid, [=](index<1> idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx];

} );}

array_view variables captured and associated data copied to accelerator (on demand)

restrict(direct3d): tells the compiler to check that this code can execute on Direct3D hardware (aka accelerator)

parallel_for_each: execute the lambda on the accelerator once per thread

grid: the number and shape of threads to execute the lambda

index: the thread ID that is running the lambda, used to index into data

array_view: wraps the data to operate on the accelerator



grid<N>, extent<N>, and index<N>• index<N>

• represents an N-dimensional point• extent<N>

• number of units in each dimension of an N-dimensional space

• grid<N>• origin (index<N>) plus extent<N>

• N can be any number



Examples: grid, extent, and index

index<1> i(2); index<2> i(0,2); index<3> i(2,0,1);

extent<3> e(3,2,2);extent<2> e(3,4);extent<1> e(6);grid<3> g(e);grid<2> g(e);grid<1> g(e);



vector<int> v(96);extent<2> e(8,12); // e[0] == 8; e[1] == 12;array<int,2> a(e, v.begin(), v.end());

// in the body of my lambdaindex<2> i(3,9); // i[0] == 3; i[1] == 9;int o = a[i]; //or a[i] = 16;//int o = a(i[0], i[1]);

array<T,N>• Multi-dimensional array of rank N

with element T• Storage lives on accelerator

0 1 2 3 4 5 6 7 8 9 10 11

0

1

2

3

4

5

6

7



array_view<T,N>

• View on existing data on the CPU or GPU• array_view<T,N> • array_view<const T,N>

vector<int> v(10);

extent<2> e(2,5); array_view<int,2> a(e, v);

//above two lines can also be written//array_view<int,2> a(2,5,v);



Data Classes Comparison

array<T,N>• Rank at compile time • Extent at runtime• Rectangular

• Dense• Container for data• Explicit copy• Capture by reference [&]

array_view<T,N>• Rank at compile time• Extent at runtime• Rectangular

• Dense in one dimension• Wrapper for data• Implicit copy• Capture by value [=]

KED Note: array_view’s seems faster than array<>. Could it be because of the on-demand feature of array_view?



1. parallel_for_each( 2. g, //g is of type grid<N>3. [ ](index<N> idx)

restrict(direct3d) { // kernel code}

4. );

parallel_for_each• Executes the lambda for each point in the extent• As-if synchronous in terms of visible side-effects



restrict(…)• Applies to functions (including lambdas)• Why restrict

• Target-specific language restrictions• Optimizations or special code-gen behavior• Future-proofing

• Functions can have multiple restrictions• In 1st release we are implementing direct3d and cpu• cpu – the implicit default



restrict(direct3d) restrictions• Can only call other restrict(direct3d) functions• All functions must be inlinable• Only direct3d-supported types

• int, unsigned int, float, double • structs & arrays of these types

• Pointers and References• Lambdas cannot capture by reference¹, nor capture pointers• References and single-indirection pointers supported only as local variables

and function arguments



restrict(direct3d) restrictions• No

• recursion• 'volatile'• virtual functions• pointers to functions• pointers to member functions• pointers in structs• pointers to pointers

• No • goto or labeled statements• throw, try, catch• globals or statics• dynamic_cast or typeid• asm declarations• varargs• unsupported types

• e.g. char, short, long double



Example: restrict overloadingdouble bar( double ) restrict(cpu,direc3d); // 1: same code for bothdouble cos( double ); // 2a: general codedouble cos( double ) restrict(direct3d); // 2b: specific code

void SomeMethod(array<double> c) { parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { //… double d1 = bar(c[idx]); // ok double d2 = cos(c[idx]); // ok, chooses direct3d overload //… });}



accelerator, accelerator_view

• accelerator • e.g. DX11 GPU, REF• e.g. CPU

• accelerator_view• a context for scheduling and

memory management

CPUs

System memory

GPUPCIeGPU

GPU

GPU

Host Accelerator (GPU example)

• Data transfers • between accelerator and host

• could be optimized away for integrated memory architecture



Example: accelerator

// Identify an accelerator based on Windows device IDaccelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”);

// …or enumerate all accelerators (not shown)

// Allocate an array on my acceleratorarray<int> myArray(10, myAcc.default_view);

// …or launch a kernel on my acceleratorparallel_for_each(myAcc.default_view, myArrayView.grid, ...);



C++ AMP at a Glance (so far)• restrict(direct3d, cpu)• parallel_for_each• class array<T,N>• class array_view<T,N>• class index<N>• class extent<N>, grid<N>• class accelerator• class accelerator_view



Achieving maximum performance gains

• Schedule threads in tiles• Avoid thread index remapping• Gain ability to use tile static memory

• parallel_for_each overload for tiles accepts• tiled_grid<D0> or tiled_grid<D0, D1> or tiled_grid<D0, D1, D2>• a lambda which accepts

• tiled_index<D0> or tiled_index<D0, D1> or tiled_index<D0, D1, D2>

0 1 2 3 4 5

0

1

2

3

4

5

6

7

0 1 2 3 4 5

0

1

2

3

4

5

6

7

0 1 2 3 4 5

0

1

2

3

4

5

6

7

g.tile<2,2>()g.tile<4,3>()extent<2> e(8,6);grid<2> g(e);



tiled_grid, tiled_index• Given

• When the lambda is executed by• t_idx.global = index<2> (6,3)• t_idx.local = index<2> (0,1)• t_idx.tile = index<2> (3,1)• t_idx.tile_origin = index<2> (6,2)

array_view<int,2> data(8, 6, p_my_data);parallel_for_each( data.grid.tile<2,2>(), [=] (tiled_index<2,2> t_idx)… { … });

0 1 2 3 4 50123456 T7



tile_static, tile_barrier• Within the tiled parallel_for_each lambda we can use

• tile_static storage class for local variables• indicates that the variable is allocated in fast cache memory

• i.e. shared by each thread in a tile of threads• only applicable in restrict(direct3d) functions

• class tile_barrier• synchronize all threads within a tile• e.g. t_idx.barrier.wait();



C++ AMP at a Glance• restrict(direct3d, cpu)• parallel_for_each• class array<T,N>• class array_view<T,N>• class index<N>• class extent<N>, grid<N>• class accelerator• class accelerator_view

• tile_static storage class• class tiled_grid< , , >• class tiled_index< , , >• class tile_barrier



EXAMPLES!!!

• The beloved, ubiquitous matrix multiplication

• Bitonic sort



Example: Matrix Multiplication



Example: Matrix Multiplicationvoid Multiply_Serial(Matrix * C, Matrix * A, Matrix * B){ int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows;

for (int gr = 0; i < hA; ++gr) // row for (int gc = 0; j < wB; ++gc) { // col

float sum = 0; for (int k = 0; k < hB; ++k)

sum += Data(A)[gr * wA +k] * Data(B)[k*wB +gc]; Data(C)[gr * wC + gc] = sum; }}

void MultiplySimple(Matrix * C, Matrix * A, Matrix * B){ int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows;

array_view<const float,1> a(hA * wA, Data(A)); array_view<const float,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC * wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e); parallel_for_each(g, [=](index<2> idx) restrict(direct3d) { int gr = idx[0]; int gc = idx[1]; float sum = 0.0f; for(int k = 0; k < hB; k++) sum += a[gr * wA + k]; * b[k * wB + gc]; c[gr * wC + gc] = sum; });}



Example: Matrix Multiplication



Example: Matrix Multiplication (tiled, shared memory)void MultiplySimple(Matrix * C, Matrix * A, Matrix * B){ int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows;

array_view<const float,1> a(hA * wA, Data(A)); array_view<const float,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC * wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e);

const int TS = 16; parallel_for_each(g.tile<TS,TS>(), [=](tiled_index<TS,TS> idx) restrict(direct3d) { int lr = idx.local[0]; int lc = idx.local[1]; int gr = idx.global[0]; int gc = idx.global[1];

float sum = 0.0f; for (int i = 0; i < hB; i += TS) { tile_static float locA[TS][TS], locB[TS][TS]; locA[lr][lc] = a[gr * wA + lc + i]; locB[lr][lc] = b[(lr + i) * wB + gc]; idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[lr][k] * locB[k][lc]; idx.barrier.wait(); } c[gr * wC + gc] = sum; });}

void MultiplySimple(Matrix * C, Matrix * A, Matrix * B){ int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows;

array_view<const float,1> a(hA * wA, Data(A)); array_view<const float,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC * wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e); parallel_for_each(g, [=](index<2> idx) restrict(direct3d) { int gr = idx[0]; int gc = idx[1]; float sum = 0.0f; for(int k = 0; k < hB; k++) sum += a[gr * wA + k]; * b[k * wB + gc]; c[gr * wC + gc] = sum; });}



Performance

C++ AMP performs as good as CUDA or OpenCL for matrix multiplication.

Implementation Sequential Implicit

(unspecified tile)Explicit(16 x 16 tile)

Shared mem(16 x 16 tile)

C++ AMP 1.317 ± 0.006 s 0.035 ± 0.008 s 0.030 ± 0.001 s 0.015 ± 0.002 s

CUDA 1.454 ± 0.008 s n.a. 0.046 ± 0.001 s 0.0150 ± 0.0003 s

OpenCL 1.448 ± 0.003 s n.a. 0.061 ± 0.002 s 0.033 ± 0.002 s

Random matrices A (480 x 640) x B (640 x 960) = C (480 x 960) single prec. floats.Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used),and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.10 runs, first run thrown out (for total of 9). Average ± S.E. of sample. NB: this comparison is between “Developer Preview vs RTM



Bitonic sort



Bitonic sortvoid bitonicSortSequential(int * data, int length){ unsigned int log2length = log2(length); unsigned int checklength = pow2(log2length); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; unsigned int phase2 = pow2((unsigned int)phase); for (int ig = 0; ig < compares; ++ig) { int cross, paired; orange_box(ig, phase2, cross, paired); swap(data[cross], data[paired]); } for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); for (int ig = 0; ig < compares; ++ig) { int cross, paired; red_box(ig, level2, cross, paired); swap(data[cross], data[paired]); } } }}

void bitonicSortSimple(int * data, int length){ unsigned int log2length = log2(length); unsigned int checklength = pow2(log2length); static const int TS = 1024; array_view<int,1> a(length, data); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; extent<1> e(compares); grid<1> g(e); unsigned int phase2 = pow2((unsigned int)phase); parallel_for_each(g.tile<TS>(), [phase2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; orange_box(ig, phase2, cross, paired); swap(a[cross], a[paired]); }); for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); parallel_for_each(*it, g.tile<TS>(), [level2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; red_box(ig, level2, cross, paired); swap(a[cross], a[paired]); });} } }



Performance

C++ AMP performs almost as good as CUDA for (naïve) bitonic sort.

Implementation Sequential Explicit

(1024 tile)C++ AMP 12.5 ± 0.4 s 0.42 ± 0.04 sCUDA 12.6 ± 0.01 s 0.372 ± 0.002 s

Array of length 1024 * 32 * 16 * 16 = 8388608, of integers.Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used),and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.10 runs, first run thrown out (for total of 9). Average ± S.E. of sample.



Bitonic sort with subsets of bitonic merge

Peters, H., O. Schulz-Hildebrandt, et al. (2011). "Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort." Concurrency and Computation: Practice and Experience 23(7): 681-693.



Optimized bitonic sort

for (int i = 0; i < msize; ++i) mm[i] = a[base + memory[i]]; for (int i = 0; i < csize; i += 2){ int cross = normalized_compares[i]; int paired = normalized_compares[i+1]; swap(mm[cross], mm[paired]);}

for (int i = 0; i < msize; ++i) a[base + memory[i]] = mm[i];

void bitonicSortSimple(int * data, int length){ unsigned int log2length = log2(length); unsigned int checklength = pow2(log2length); static const int TS = 1024; array_view<int,1> a(length, data); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; extent<1> e(compares); grid<1> g(e); unsigned int phase2 = pow2((unsigned int)phase); parallel_for_each(g.tile<TS>(), [phase2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; orange_box(ig, phase2, cross, paired); swap(a[cross], a[paired]); }); for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); parallel_for_each(*it, g.tile<TS>(), [level2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; red_box(ig, level2, cross, paired); swap(a[cross], a[paired]); });} } }

• Optimization: compute multiple compares per thread using subsets of the sorted merge of a given degree.

• Load the portion of the subset bitonic merge into registers.

• Perform swap in registers.• Store registers back to global memory.

• NB: Must manually unroll loop – bug in compiler with #pragma unroll!



Performance

C++ AMP performs almost as good as CUDA for (naïve) bitonic sort.

Implementation

Explicit (512 tile)

Degree opt., explicit(512 tile)

C++ AMP 0.285 ± 0.4 s 0.239 ± 0.001 s

CUDA 0.280 ± 0.001 s 0.187 ± 0.001 s

Array of length 1024 * 32 * 16 * 16 = 8388608, of integers.Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used),and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.10 runs, first run thrown out (for total of 9). Average ± S.E. of sample.



Issues

All GPU memory must be an array_view<> (or array<>), so you must use an array index!!!

foo_bar[…] = …;

Pointer to struct/class unavailable, but struct/class is OK:

class foo {}foo f;…Parallel_for_each… f.bar(); // cannot convert 'this' pointer from 'const foo' to 'foo &‘

foo f;array_view<foo,1> xxx(1, &f);parallel_for_each… xxx[0].bar(); // OK, and must use []’s—YUCK!



Issues

Easy to forget that after parallel_for_each, results must be fetched via1) call arrray_view synchronize() for the variable;2) destruction of array_view variable;3) explicit copy for array<>.

Take care in comparing CUDA/OpenCL with C++ AMP:1) C++ AMP may not pick right optimal accelerator if you have two GPU’s;2) C++ AMP picks tile size if you don’t specify it.

If you do pick a tile size, it must fit with an exact multiple into the grid!



Not Covered• Math library

• e.g. acosf• Atomic operation library

• e.g. atomic_fetch_add• Direct3D intrinsics

• debugging (e.g. direct3d_printf), fences (e.g. __dp_d3d_all_memory_fence), float math (e.g. __dp_d3d_absf)

• Direct3D Interop• *get_device, create_accelerator_view, make_array, *get_buffer



Visual Studio 11

• Organize• Edit• Design• Build• Browse• Debug• Profile

NB: VS Express does not contain C++ AMP! If you install Windows 8, make sure to install VS Ultimate 11 Developer preview.



C++ AMP Parallel Debugger• Well known Visual Studio debugging features

• Launch, Attach, Break, Stepping, Breakpoints, DataTips • Toolwindows

• Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch

• New features (for both CPU and GPU)• Parallel Stacks window, Parallel Watch window, Barrier

• New GPU-specific• Emulator, GPU Threads window, race detection

I could not break on a statement in the kernel



Visual Studio 11 - Profiler

• Labels statements that are most expensive in a routine..

NB: VS Express does not contain C++ AMP! If you install Windows 8, make sure to install VS Ultimate 11 Developer preview.



Visual Studio 11 - Profiler

• Labels statements that are most expensive in a routine.

• Does not seem to label kernel code.



Concurrency Visualizer for GPU • Direct3D-centric

• Supports any library/programming model built on it• Integrated GPU and CPU view• Goal is to analyze high-level performance metrics

• Memory copy overheads• Synchronization overheads across CPU/GPU• GPU activity and contention with other processes

Where is my GPU???



Summary• Democratization of parallel hardware programmability

• Performance for the mainstream• High-level abstractions in C++ (not C)• State-of-the-art Visual Studio IDE• Hardware abstraction platform

• Intent is to make C++ AMP an open specification


Ken’s blog comparing C++ AMP, CUDA, OpenCL• http://domemtech.com/?p=1025Daniel Moth's blog (PM of C++ AMP)• http://www.danielmoth.com/Blog/ MSDN Native parallelism blog (team blog)• http://blogs.msdn.com/b/nativeconcurrency/ MSDN Dev Center for Parallel Computing• http://msdn.com/concurrency MSND Forums to ask questions• http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads



http://www.danielmoth.com/Blog/

http://blogs.msdn.com/b/nativeconcurrency/

http://msdn.com/concurrency

http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads

gpu computing with c++ amp

Documents

int wa

int ts

rows int wb

int lr

brows int wc

ccols int hc

idx0 int gc

global0 int gc