heterogeneous platform support in visual studio -...
TRANSCRIPT
![Page 1: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/1.jpg)
Daniel Moth, Parallel Computing Platform, Microsoft
![Page 2: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/2.jpg)
Heterogeneous platform support in Visual Studio
![Page 3: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/3.jpg)
![Page 4: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/4.jpg)
Context
Code
Closing thoughts
![Page 5: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/5.jpg)
146X
Interactive
visualization of
volumetric white
matter connectivity
36X
Ionic placement for
molecular
dynamics
simulation on GPU
19X
Transcoding HD
video stream to
H.264
17X
Simulation in
Matlab using .mex
file CUDA function
100X
Astrophysics N-
body simulation
149X
Financial
simulation of
LIBOR model with
swaptions
47X
GLAME@lab: An
M-script API for
linear Algebra
operations on GPU
20X
Ultrasound
medical imaging
for cancer
diagnostics
24X
Highly optimized
object oriented
molecular
dynamics
30X
Cmatch exact string
matching to find
similar proteins and
gene sequences
![Page 6: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/6.jpg)
CPU GPU
Low memory bandwidth
Higher power consumption
Medium level of parallelism
Deep execution pipelines
Random accesses
Supports general code
Mainstream programming
High memory bandwidth
Lower power consumption
High level of parallelism
Shallow execution pipelines
Sequential accesses
Supports data-parallel code
Niche/exotic programming
![Page 7: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/7.jpg)
CPUs and GPUs coming closer together…
…nothing settled in this space, things still in motion…
We have designed a mainstream solution not only for today, but also for tomorrow
![Page 8: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/8.jpg)
Part of Visual C++
Visual Studio integration
STL-like library for multidimensional data
Builds on DirectX
performance
portability
productivity
![Page 9: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/9.jpg)
Context
Code
Closing thoughts
![Page 10: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/10.jpg)
void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } }
How do we take the serial code on the left that runs on the CPU and convert it to run on the GPU?
![Page 11: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/11.jpg)
How do we take the serial code on the left that runs on the CPU and convert it to run on the GPU?
void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } }
#include <amp.h> using namespace concurrency; void AddArrays(int n, int * pA, int * pB, int * pC) { array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx]; } ); }
void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } }
![Page 12: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/12.jpg)
void AddArrays(int n, int * pA, int * pB, int * pC) { array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx]; } ); }
array_view variables captured and copied to device (on demand)
restrict(direct3d): tells the compiler to check that this code can execute on DirectX hardware
parallel_for_each: execute the lambda on the accelerator once per thread
grid: the number and shape of threads to execute the lambda
index: the thread ID that is running the lambda, used to index into captured arrays
array_view: Wraps the data to operate on the accelerator
![Page 13: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/13.jpg)
index<N>
represents an N-dimensional point
extent<N>
number of elements in each dimension of an N-dimensional array
grid<N>
origin (index<N>) plus extent<N>
N can be any number
conveniences for up to 3 dimensions (z,y,x)
![Page 14: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/14.jpg)
index<1> i1(2); index<2> i2(0,2); index<3> i3(2,0,1);
extent<3> e3(3,2,2); extent<2> e2(3,4); extent<1> e1(6);
grid<3> g(index<3>(47,58,12), extent<3>(3,2,2));
grid<3> g3(e3); grid<2> g2(e2); grid<1> g1(e1);
// cubic indices from (-1,-1,-1) through (98,98,98) grid<3> g(index<3>(-1,-1,-1), extent<3>(100,100,100));
![Page 15: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/15.jpg)
Multi-dimensional array of rank N with element T
Storage lives on accelerator
vector<int> v(96); extent<2> e(8,12); // e.y == 8; e.x == 12; array<int,2> a(e, v.begin(), v.end()); // in my lambda index<2> i(3,9); // i.y == 3; i.x == 9; int o = a[i]; // = a(i[0], i[1]); // = a(i.y, i.x)
![Page 16: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/16.jpg)
View on existing data on the CPU or GPU
Usage considerations
array_view<T,N>
array_view<const T,N>
array_view<writeonly<T>,N>
vector<int> v(10);
extent<2> e(2,5); array_view<int,2> a(e, v);
//above two lines can be written //array_view<int,2> a(2,5,v);
![Page 17: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/17.jpg)
array<T,N> array_view<T,N>
Rank at compile time
Extent at runtime
Rectangular
Dense
Origin always at zero
Container for data
Explicit copy
Capture by reference [&]
Rank at compile time
Extent at runtime
Rectangular
Dense in one dimension
Origin can be non-zero
Wrapper for data
Future proof design
Capture by value [=]
![Page 18: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/18.jpg)
1. parallel_for_each( 2. grid<N>, 3. [ ](index<N>) restrict(direct3d) { // kernel code } 1. );
Executes the lambda for each point in the grid
As-if synchronous in terms of visible side-effects
![Page 19: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/19.jpg)
Applies to functions (including lambdas)
Why restrict
Target-specific language restrictions
Optimizations or special code-gen behavior
Functions can have multiple restrictions
In 1st release we are implementing “direct3d” and “cpu”
“cpu” – the implicit default
![Page 20: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/20.jpg)
Can only call other restrict(direct3d) functions
All functions must be inlinable
Only direct3d-supported types
int, unsigned int, float, double
structs & arrays of these types
Pointers and References
Lambdas cannot capture by reference, nor capture pointers
References and single-indirection pointers supported only as local variables and function arguments
![Page 21: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/21.jpg)
No
recursion
'volatile'
virtual functions
pointers to functions
pointers to member functions
pointers in structs
pointers to pointers
No
goto or labeled statements
throw, try, catch
globals or statics
dynamic_cast or typeid
asm declarations
varargs
unsupported types
e.g. bool, char, short, long double
![Page 22: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/22.jpg)
void MatrixMultiply( vector<float>& C, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ) { for (int y = 0; y < M; y++) { for (int x = 0; x < N; x++) { float sum = 0; for(int i = 0; i < W; i++) sum += vA[y * W + i] * vB[i * N + x]; vC[y * N + x] = sum; } } }
void MatrixMultiply( vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ) { array_view<const float,2> a(M,W,vA),b(W,N,vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) {
float sum = 0; for(int i = 0; i < a.extent.x; i++) sum += a(idx.y, i) * b(i, idx.x); c[idx] = sum;
} ); }
![Page 23: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/23.jpg)
restrict(direct3d, cpu)
parallel_for_each
class array<T,N>
class array_view<T,N>
class index<N>
class extent<N>
class grid<N>
class accelerator
class accelerator_view
![Page 24: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/24.jpg)
Schedule threads in a tiled manner
Avoid thread index remapping
Gain ability to use tile static memory
parallel_for_each overload for tiles accepts
tiled_grid<X> or tiled_grid<Y,X> or tiled_grid<Z,Y,X>
a lambda which accepts
tiled_index<X> or tiled_index<Y,X> or tiled_index<Z,Y,X>
0 1 2 3 4 5
0
1
2
3
4
5
6
7
0 1 2 3 4 5
0
1
2
3
4
5
6
7
![Page 25: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/25.jpg)
Given
When the lambda is executed by
t_idx.global = index<2> (6,3)
t_idx.local = index<2> (0,1)
t_idx.tile = index<2> (3,1)
t_idx.tile_origin = index<2> (6,2)
T
array_view<int,2> data(8, 6, pMyData); parallel_for_each( data.grid.tile<2,2>(), [=] (tiled_index<2,2> t_idx)… { … });
0 1 2 3 4 5
0
1
2
3
4
5
6 T
7
![Page 26: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/26.jpg)
Within the kernel we can use
tile_static storage class
only applicable in restrict(direct3d)
indicates that the local variable is allocated in shared memory, i.e. shared by each thread in a tile of threads
class tile_barrier
synchronize all threads within a tile
e.g. myTiledIndex.barrier.wait();
![Page 27: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/27.jpg)
void MatrixMultiplySimple(float* A, float* B, float* C, int M, int N, int W) { extent<2> eA(M, N), eB(N, W), eC(M, W); grid<2> g(eC); array<float,2> mA(eA, A), mB (eB, B), mC (eC); parallel_for_each(g, [=, &mA, &mB, &mC] (index<2> idx) restrict(direct3d) { float temp = 0; for(int k = 0; k < N; k++) temp += mA(idx.y, k) * mB(k, idx.x); mC(idx) = temp; } ); copy(mC, C); }
void MatrixMultiplyTiled(float* A, float* B, float* C, int M, int N, int W) { static const int TS = 16; extent<2> eA(M, N), eB(N, W), eC(M, W); grid<2> g(eC); array<float,2> mA (eA, A), mB (eB, B), mC (eC); parallel_for_each(g.tile< TS, TS >(), [=, &mA, &mB, &mC] (tiled_index< TS, TS> t_idx) restrict(direct3d) { float temp = 0; index<2> locIdx = t_idx.local; index<2> globIdx = t_idx.global; for (int i = 0; i < N; i += TS) { tile_static float locB[TS][TS], locA[TS][TS]; locA[locIdx.y][locIdx.x] = mA(globIdx.y, i + locIdx.x); locB[locIdx.y][locIdx.x] = mB(i + locIdx.y, globIdx.x); t_idx.barrier.wait(); for (int k = 0; k < TS; k++) temp += locA[locIdx.y][k] * locB[k][locIdx.x]; t_idx.barrier.wait(); } mC[t_idx] = temp; } ); copy(mC, C); }
![Page 28: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/28.jpg)
restrict(direct3d, cpu)
parallel_for_each
class array<T,N>
class array_view<T,N>
class index<N>
class extent<N>
class grid<N>
class accelerator
class accelerator_view
class tiled_grid<Z,Y,X>
class tiled_index<Z,Y,X>
class tile_barrier
tile_static storage class
![Page 29: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/29.jpg)
Context
Code
Closing thoughts
![Page 30: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/30.jpg)
Organize
Edit
Design
Build
Browse
Debug
Profile
![Page 31: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/31.jpg)
Organize
Edit
Design
Build
Browse
Debug
Profile
![Page 32: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/32.jpg)
We are looking for developers wanting to use C++ AMP to participate in a study on the API and tools
For 45 minutes of your time you get
To peek at what we are thinking
To influence our product direction and our development team
A Microsoft product as a “thank you”
Sign up at the Microsoft Lounge Information Desk
![Page 33: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/33.jpg)
Democratization of parallel hardware programmability
Performance for the mainstream
High-level abstractions in C++ (not C)
State-of-the-art Visual Studio IDE
Hardware abstraction platform
Intent is to make C++ AMP an open specification
![Page 35: Heterogeneous platform support in Visual Studio - AMDdeveloper.amd.com/wordpress/media/2013/06/2616_final.pdf · Part of Visual C++ Visual Studio integration ... using namespace concurrency;](https://reader034.vdocuments.net/reader034/viewer/2022042800/5a7606bf7f8b9aa3688cece0/html5/thumbnails/35.jpg)
| AMD FUSION DEVELOPER SUMMIT | June 2011
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.