gpu computing with c++ amp
DESCRIPTION
GPU computing with C++ AMP. Kenneth Domino Domem Technologies, Inc . http://domemtech.com. October 24, 2011. NB: This presentation is based on Daniel Moth’s presentation at http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T. CPUs vs GPUs today. CPU. Low memory bandwidth - PowerPoint PPT PresentationTRANSCRIPT
GPU computing with C++ AMP
Kenneth DominoDomem Technologies, Inc.http://domemtech.com
NB: This presentation is based on Daniel Moth’s presentation at http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T
October 24, 2011
http://domemtech.com/?p=1025
CPUs vs GPUs today
CPU
• Low memory bandwidth• Higher power consumption• Medium level of parallelism• Deep execution pipelines• Random accesses• Supports general code• Mainstream programming
http://domemtech.com/?p=1025
CPUs vs GPUs today
GPU
• High memory bandwidth• Lower power consumption• High level of parallelism• Shallow execution pipelines• Sequential accesses• Supports data-parallel code• Niche programming
http://domemtech.com/?p=1025
• CPUs and GPUs coming closer together…• …nothing settled in this space, things still in motion…
• C++ AMP is designed as a mainstream solution not only for today, but also for tomorrow
Tomorrow…
image source: AMD
http://domemtech.com/?p=1025
C++ AMP• Part of Visual C++ • Visual Studio integration• STL-like library for multidimensional data • Builds on Direct3D
performance
portabilityproductivity
http://domemtech.com/?p=1025
C++ AMP vs. CUDA vs. OpenCL
OpenCL
(Stone knives and bearskins – C99!! YUCK!)
CUDA – Wow, classes!
C++ AMP
Lambda functions (1936 Church—back to the future)
http://domemtech.com/?p=1025
Hello World: Array Addition
void AddArrays(int n, int * pA, int * pB, int * pC){ for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; }}
How do we take the serial code on the left that runs on the CPU and convert it to run on an accelerator like the GPU?
http://domemtech.com/?p=1025
void AddArrays(int n, int * pA, int * pB, int * pC){
for (int i=0; i<n; i++)
{ pC[i] = pA[i] + pB[i]; }
}
#include <amp.h>using namespace concurrency;
void AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> i) restrict(direct3d) { sum[i] = a[i] + b[i]; } );}
Hello World: Array Addition
void AddArrays(int n, int * pA, int * pB, int * pC){
for (int i=0; i<n; i++)
{ pC[i] = pA[i] + pB[i]; }
}
http://domemtech.com/?p=1025
Basic Elements of C++ AMP coding
void AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each(
sum.grid, [=](index<1> idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx];
} );}
array_view variables captured and associated data copied to accelerator (on demand)
restrict(direct3d): tells the compiler to check that this code can execute on Direct3D hardware (aka accelerator)
parallel_for_each: execute the lambda on the accelerator once per thread
grid: the number and shape of threads to execute the lambda
index: the thread ID that is running the lambda, used to index into data
array_view: wraps the data to operate on the accelerator
http://domemtech.com/?p=1025
grid<N>, extent<N>, and index<N>• index<N>
• represents an N-dimensional point• extent<N>
• number of units in each dimension of an N-dimensional space
• grid<N>• origin (index<N>) plus extent<N>
• N can be any number
http://domemtech.com/?p=1025
Examples: grid, extent, and index
index<1> i(2); index<2> i(0,2); index<3> i(2,0,1);
extent<3> e(3,2,2);extent<2> e(3,4);extent<1> e(6);grid<3> g(e);grid<2> g(e);grid<1> g(e);
http://domemtech.com/?p=1025
vector<int> v(96);extent<2> e(8,12); // e[0] == 8; e[1] == 12;array<int,2> a(e, v.begin(), v.end());
// in the body of my lambdaindex<2> i(3,9); // i[0] == 3; i[1] == 9;int o = a[i]; //or a[i] = 16;//int o = a(i[0], i[1]);
array<T,N>• Multi-dimensional array of rank N
with element T• Storage lives on accelerator
0 1 2 3 4 5 6 7 8 9 10 11
0
1
2
3
4
5
6
7
http://domemtech.com/?p=1025
array_view<T,N>
• View on existing data on the CPU or GPU• array_view<T,N> • array_view<const T,N>
vector<int> v(10);
extent<2> e(2,5); array_view<int,2> a(e, v);
//above two lines can also be written//array_view<int,2> a(2,5,v);
http://domemtech.com/?p=1025
Data Classes Comparison
array<T,N>• Rank at compile time • Extent at runtime• Rectangular
• Dense• Container for data• Explicit copy• Capture by reference [&]
array_view<T,N>• Rank at compile time• Extent at runtime• Rectangular
• Dense in one dimension• Wrapper for data• Implicit copy• Capture by value [=]
KED Note: array_view’s seems faster than array<>. Could it be because of the on-demand feature of array_view?
http://domemtech.com/?p=1025
1. parallel_for_each( 2. g, //g is of type grid<N>3. [ ](index<N> idx)
restrict(direct3d) { // kernel code}
4. );
parallel_for_each• Executes the lambda for each point in the extent• As-if synchronous in terms of visible side-effects
http://domemtech.com/?p=1025
restrict(…)• Applies to functions (including lambdas)• Why restrict
• Target-specific language restrictions• Optimizations or special code-gen behavior• Future-proofing
• Functions can have multiple restrictions• In 1st release we are implementing direct3d and cpu• cpu – the implicit default
http://domemtech.com/?p=1025
restrict(direct3d) restrictions• Can only call other restrict(direct3d) functions• All functions must be inlinable• Only direct3d-supported types
• int, unsigned int, float, double • structs & arrays of these types
• Pointers and References• Lambdas cannot capture by reference¹, nor capture pointers• References and single-indirection pointers supported only as local variables
and function arguments
http://domemtech.com/?p=1025
restrict(direct3d) restrictions• No
• recursion• 'volatile'• virtual functions• pointers to functions• pointers to member functions• pointers in structs• pointers to pointers
• No • goto or labeled statements• throw, try, catch• globals or statics• dynamic_cast or typeid• asm declarations• varargs• unsupported types
• e.g. char, short, long double
http://domemtech.com/?p=1025
Example: restrict overloadingdouble bar( double ) restrict(cpu,direc3d); // 1: same code for bothdouble cos( double ); // 2a: general codedouble cos( double ) restrict(direct3d); // 2b: specific code
void SomeMethod(array<double> c) { parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { //… double d1 = bar(c[idx]); // ok double d2 = cos(c[idx]); // ok, chooses direct3d overload //… });}
http://domemtech.com/?p=1025
accelerator, accelerator_view
• accelerator • e.g. DX11 GPU, REF• e.g. CPU
• accelerator_view• a context for scheduling and
memory management
CPUs
System memory
GPUPCIeGPU
GPU
GPU
Host Accelerator (GPU example)
• Data transfers • between accelerator and host
• could be optimized away for integrated memory architecture
http://domemtech.com/?p=1025
Example: accelerator
// Identify an accelerator based on Windows device IDaccelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”);
// …or enumerate all accelerators (not shown)
// Allocate an array on my acceleratorarray<int> myArray(10, myAcc.default_view);
// …or launch a kernel on my acceleratorparallel_for_each(myAcc.default_view, myArrayView.grid, ...);
http://domemtech.com/?p=1025
C++ AMP at a Glance (so far)• restrict(direct3d, cpu)• parallel_for_each• class array<T,N>• class array_view<T,N>• class index<N>• class extent<N>, grid<N>• class accelerator• class accelerator_view
http://domemtech.com/?p=1025
Achieving maximum performance gains
• Schedule threads in tiles• Avoid thread index remapping• Gain ability to use tile static memory
• parallel_for_each overload for tiles accepts• tiled_grid<D0> or tiled_grid<D0, D1> or tiled_grid<D0, D1, D2>• a lambda which accepts
• tiled_index<D0> or tiled_index<D0, D1> or tiled_index<D0, D1, D2>
0 1 2 3 4 5
0
1
2
3
4
5
6
7
0 1 2 3 4 5
0
1
2
3
4
5
6
7
0 1 2 3 4 5
0
1
2
3
4
5
6
7
g.tile<2,2>()g.tile<4,3>()extent<2> e(8,6);grid<2> g(e);
http://domemtech.com/?p=1025
tiled_grid, tiled_index• Given
• When the lambda is executed by• t_idx.global = index<2> (6,3)• t_idx.local = index<2> (0,1)• t_idx.tile = index<2> (3,1)• t_idx.tile_origin = index<2> (6,2)
array_view<int,2> data(8, 6, p_my_data);parallel_for_each( data.grid.tile<2,2>(), [=] (tiled_index<2,2> t_idx)… { … });
0 1 2 3 4 50123456 T7
http://domemtech.com/?p=1025
tile_static, tile_barrier• Within the tiled parallel_for_each lambda we can use
• tile_static storage class for local variables• indicates that the variable is allocated in fast cache memory
• i.e. shared by each thread in a tile of threads• only applicable in restrict(direct3d) functions
• class tile_barrier• synchronize all threads within a tile• e.g. t_idx.barrier.wait();
http://domemtech.com/?p=1025
C++ AMP at a Glance• restrict(direct3d, cpu)• parallel_for_each• class array<T,N>• class array_view<T,N>• class index<N>• class extent<N>, grid<N>• class accelerator• class accelerator_view
• tile_static storage class• class tiled_grid< , , >• class tiled_index< , , >• class tile_barrier
http://domemtech.com/?p=1025
EXAMPLES!!!
• The beloved, ubiquitous matrix multiplication
• Bitonic sort
http://domemtech.com/?p=1025
Example: Matrix Multiplicationvoid Multiply_Serial(Matrix * C, Matrix * A, Matrix * B){ int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows;
for (int gr = 0; i < hA; ++gr) // row for (int gc = 0; j < wB; ++gc) { // col
float sum = 0; for (int k = 0; k < hB; ++k)
sum += Data(A)[gr * wA +k] * Data(B)[k*wB +gc]; Data(C)[gr * wC + gc] = sum; }}
void MultiplySimple(Matrix * C, Matrix * A, Matrix * B){ int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows;
array_view<const float,1> a(hA * wA, Data(A)); array_view<const float,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC * wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e); parallel_for_each(g, [=](index<2> idx) restrict(direct3d) { int gr = idx[0]; int gc = idx[1]; float sum = 0.0f; for(int k = 0; k < hB; k++) sum += a[gr * wA + k]; * b[k * wB + gc]; c[gr * wC + gc] = sum; });}
http://domemtech.com/?p=1025
Example: Matrix Multiplication (tiled, shared memory)void MultiplySimple(Matrix * C, Matrix * A, Matrix * B){ int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows;
array_view<const float,1> a(hA * wA, Data(A)); array_view<const float,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC * wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e);
const int TS = 16; parallel_for_each(g.tile<TS,TS>(), [=](tiled_index<TS,TS> idx) restrict(direct3d) { int lr = idx.local[0]; int lc = idx.local[1]; int gr = idx.global[0]; int gc = idx.global[1];
float sum = 0.0f; for (int i = 0; i < hB; i += TS) { tile_static float locA[TS][TS], locB[TS][TS]; locA[lr][lc] = a[gr * wA + lc + i]; locB[lr][lc] = b[(lr + i) * wB + gc]; idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[lr][k] * locB[k][lc]; idx.barrier.wait(); } c[gr * wC + gc] = sum; });}
void MultiplySimple(Matrix * C, Matrix * A, Matrix * B){ int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows;
array_view<const float,1> a(hA * wA, Data(A)); array_view<const float,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC * wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e); parallel_for_each(g, [=](index<2> idx) restrict(direct3d) { int gr = idx[0]; int gc = idx[1]; float sum = 0.0f; for(int k = 0; k < hB; k++) sum += a[gr * wA + k]; * b[k * wB + gc]; c[gr * wC + gc] = sum; });}
http://domemtech.com/?p=1025
Performance
C++ AMP performs as good as CUDA or OpenCL for matrix multiplication.
Implementation Sequential Implicit
(unspecified tile)Explicit(16 x 16 tile)
Shared mem(16 x 16 tile)
C++ AMP 1.317 ± 0.006 s 0.035 ± 0.008 s 0.030 ± 0.001 s 0.015 ± 0.002 s
CUDA 1.454 ± 0.008 s n.a. 0.046 ± 0.001 s 0.0150 ± 0.0003 s
OpenCL 1.448 ± 0.003 s n.a. 0.061 ± 0.002 s 0.033 ± 0.002 s
Random matrices A (480 x 640) x B (640 x 960) = C (480 x 960) single prec. floats.Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used),and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.10 runs, first run thrown out (for total of 9). Average ± S.E. of sample. NB: this comparison is between “Developer Preview vs RTM
http://domemtech.com/?p=1025
Bitonic sortvoid bitonicSortSequential(int * data, int length){ unsigned int log2length = log2(length); unsigned int checklength = pow2(log2length); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; unsigned int phase2 = pow2((unsigned int)phase); for (int ig = 0; ig < compares; ++ig) { int cross, paired; orange_box(ig, phase2, cross, paired); swap(data[cross], data[paired]); } for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); for (int ig = 0; ig < compares; ++ig) { int cross, paired; red_box(ig, level2, cross, paired); swap(data[cross], data[paired]); } } }}
void bitonicSortSimple(int * data, int length){ unsigned int log2length = log2(length); unsigned int checklength = pow2(log2length); static const int TS = 1024; array_view<int,1> a(length, data); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; extent<1> e(compares); grid<1> g(e); unsigned int phase2 = pow2((unsigned int)phase); parallel_for_each(g.tile<TS>(), [phase2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; orange_box(ig, phase2, cross, paired); swap(a[cross], a[paired]); }); for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); parallel_for_each(*it, g.tile<TS>(), [level2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; red_box(ig, level2, cross, paired); swap(a[cross], a[paired]); });} } }
http://domemtech.com/?p=1025
Performance
C++ AMP performs almost as good as CUDA for (naïve) bitonic sort.
Implementation Sequential Explicit
(1024 tile)C++ AMP 12.5 ± 0.4 s 0.42 ± 0.04 sCUDA 12.6 ± 0.01 s 0.372 ± 0.002 s
Array of length 1024 * 32 * 16 * 16 = 8388608, of integers.Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used),and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.10 runs, first run thrown out (for total of 9). Average ± S.E. of sample.
http://domemtech.com/?p=1025
Bitonic sort with subsets of bitonic merge
Peters, H., O. Schulz-Hildebrandt, et al. (2011). "Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort." Concurrency and Computation: Practice and Experience 23(7): 681-693.
http://domemtech.com/?p=1025
Optimized bitonic sort
for (int i = 0; i < msize; ++i) mm[i] = a[base + memory[i]]; for (int i = 0; i < csize; i += 2){ int cross = normalized_compares[i]; int paired = normalized_compares[i+1]; swap(mm[cross], mm[paired]);}
for (int i = 0; i < msize; ++i) a[base + memory[i]] = mm[i];
void bitonicSortSimple(int * data, int length){ unsigned int log2length = log2(length); unsigned int checklength = pow2(log2length); static const int TS = 1024; array_view<int,1> a(length, data); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; extent<1> e(compares); grid<1> g(e); unsigned int phase2 = pow2((unsigned int)phase); parallel_for_each(g.tile<TS>(), [phase2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; orange_box(ig, phase2, cross, paired); swap(a[cross], a[paired]); }); for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); parallel_for_each(*it, g.tile<TS>(), [level2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; red_box(ig, level2, cross, paired); swap(a[cross], a[paired]); });} } }
• Optimization: compute multiple compares per thread using subsets of the sorted merge of a given degree.
• Load the portion of the subset bitonic merge into registers.
• Perform swap in registers.• Store registers back to global memory.
• NB: Must manually unroll loop – bug in compiler with #pragma unroll!
http://domemtech.com/?p=1025
Performance
C++ AMP performs almost as good as CUDA for (naïve) bitonic sort.
Implementation
Explicit (512 tile)
Degree opt., explicit(512 tile)
C++ AMP 0.285 ± 0.4 s 0.239 ± 0.001 s
CUDA 0.280 ± 0.001 s 0.187 ± 0.001 s
Array of length 1024 * 32 * 16 * 16 = 8388608, of integers.Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used),and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.10 runs, first run thrown out (for total of 9). Average ± S.E. of sample.
http://domemtech.com/?p=1025
Issues
All GPU memory must be an array_view<> (or array<>), so you must use an array index!!!
foo_bar[…] = …;
Pointer to struct/class unavailable, but struct/class is OK:
class foo {}foo f;…Parallel_for_each… f.bar(); // cannot convert 'this' pointer from 'const foo' to 'foo &‘
foo f;array_view<foo,1> xxx(1, &f);parallel_for_each… xxx[0].bar(); // OK, and must use []’s—YUCK!
http://domemtech.com/?p=1025
Issues
Easy to forget that after parallel_for_each, results must be fetched via1) call arrray_view synchronize() for the variable;2) destruction of array_view variable;3) explicit copy for array<>.
Take care in comparing CUDA/OpenCL with C++ AMP:1) C++ AMP may not pick right optimal accelerator if you have two GPU’s;2) C++ AMP picks tile size if you don’t specify it.
If you do pick a tile size, it must fit with an exact multiple into the grid!
http://domemtech.com/?p=1025
Not Covered• Math library
• e.g. acosf• Atomic operation library
• e.g. atomic_fetch_add• Direct3D intrinsics
• debugging (e.g. direct3d_printf), fences (e.g. __dp_d3d_all_memory_fence), float math (e.g. __dp_d3d_absf)
• Direct3D Interop• *get_device, create_accelerator_view, make_array, *get_buffer
http://domemtech.com/?p=1025
Visual Studio 11
• Organize• Edit• Design• Build• Browse• Debug• Profile
NB: VS Express does not contain C++ AMP! If you install Windows 8, make sure to install VS Ultimate 11 Developer preview.
http://domemtech.com/?p=1025
C++ AMP Parallel Debugger• Well known Visual Studio debugging features
• Launch, Attach, Break, Stepping, Breakpoints, DataTips • Toolwindows
• Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch
• New features (for both CPU and GPU)• Parallel Stacks window, Parallel Watch window, Barrier
• New GPU-specific• Emulator, GPU Threads window, race detection
I could not break on a statement in the kernel
http://domemtech.com/?p=1025
Visual Studio 11 - Profiler
• Labels statements that are most expensive in a routine..
NB: VS Express does not contain C++ AMP! If you install Windows 8, make sure to install VS Ultimate 11 Developer preview.
http://domemtech.com/?p=1025
Visual Studio 11 - Profiler
• Labels statements that are most expensive in a routine.
• Does not seem to label kernel code.
http://domemtech.com/?p=1025
Concurrency Visualizer for GPU • Direct3D-centric
• Supports any library/programming model built on it• Integrated GPU and CPU view• Goal is to analyze high-level performance metrics
• Memory copy overheads• Synchronization overheads across CPU/GPU• GPU activity and contention with other processes
Where is my GPU???
http://domemtech.com/?p=1025
Summary• Democratization of parallel hardware programmability
• Performance for the mainstream• High-level abstractions in C++ (not C)• State-of-the-art Visual Studio IDE• Hardware abstraction platform
• Intent is to make C++ AMP an open specification
Ken’s blog comparing C++ AMP, CUDA, OpenCL• http://domemtech.com/?p=1025Daniel Moth's blog (PM of C++ AMP)• http://www.danielmoth.com/Blog/ MSDN Native parallelism blog (team blog)• http://blogs.msdn.com/b/nativeconcurrency/ MSDN Dev Center for Parallel Computing• http://msdn.com/concurrency MSND Forums to ask questions• http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads