automatic task-based code generation for high performance dsel
TRANSCRIPT
Automatic Task-based Code Generation for High Performance DSEL
Antoine Tran Tan Joel Falcou Daniel Etiemble Hartmut Kaiser
Université de Paris Sud - LRI - ParSys - Inria PostaleSte||ar Group - LSU
NumScale SAS
Meeting C++ 2014
1 of 28
The Expressiveness/Efficiency War
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Expressiveness
Sequential
Threads
SIMD
Heterogenous Era
Performance
Expressiveness
Sequential
SIMD
Threads
GPUPhi
Distributed
As parallel systems complexity grows, the expressiveness gap turns into an ocean
2 of 28
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
3 of 28
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
Our Approach
� Design tools as C++ libraries (1)� Design them as Domain Specic Embedded Language (DSEL) (2+3)� Use Generative Programming to deliver performance (4)
3 of 28
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
Objectives of the Talk
� Presents one of our such desgined tools� Expose some limitations of the DSEL medium� Propose a different take on such tools
3 of 28
Outline
Introduction
The NT2Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
4 of 28
Outline
Introduction
The NT2Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
4 of 28
NT2 : The Numerical Template Toolbox
NT2 as a Scientic Computing Library
� Provides a simple, M DSEL� Provides high-performance computing entities and primitives� Is easily extendable
Components
� Uses Boost.SIMD for in-core optimizations� Uses recursive parallel skeletons� Supports task parallelism through Futures
5 of 28
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar valuesas in M
How does it works
� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a
6 of 28
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar valuesas in M
How does it works
� Take a .m le, copy to a .cpp le
� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a
6 of 28
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar valuesas in M
How does it works
� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the le and link with libnt2.a
6 of 28
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar valuesas in M
How does it works
� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a
6 of 28
NT2 - From M to C++
M codeA1 = 1:1000;A2 = A1 + randn(size(A1));
X = lu(A1*A1 ’);
rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );
NT2codetable <double > A1 = _(1. ,1000.);table <double > A2 = A1 + randn(size(A1));
table <double > X = lu( mtimes(A1 , trans(A1) );
double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );
7 of 28
Domain Specic Embedded Languages
Domain Specic Languages
� Non-Turing complete declarative languages� Solve a single type of problems� Express what to do instead of how to do it� E.g : SQL, M, M, …
From DSL to DSEL [Abrahams 2004]
� A DSL incorporates domain-specic notation, constructs, and abstractions asfundamental design considerations.
� A Domain Specic Embedded Languages (DSEL) is simply a library that meets thesame criteria
� Generative Programming is one way to design such libraries
8 of 28
Meta-programming as a Tool
DenitionMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.
Meta-programmable Languages
� metaOCAML : runtime code generation via code quoting� Template Haskell : compile-time code generation via templates� C++ : compile-time code generation via templates
C++ meta-programming
� Relies on the Turing-complete C++ Template sub-language� Handles types and integral constants at compile-time� Template classes and functions act as code quoting
9 of 28
The Expression Templates Idiom
Principles
� Relies on extensive operatoroverloading
� Carries semantic informationaround code fragment
� Introduces DSLs withoutdisrupting dev. chain
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}
Arbitrary Transforms appliedon the meta-AST
General Principles of Expression Templates
10 of 28
The Expression Templates Idiom
Advantages
� Generic implementation becomesself-aware of optimizations
� API abstraction level is arbitraryhigh
� Accessible through high-leveltools like B.P
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}
Arbitrary Transforms appliedon the meta-AST
General Principles of Expression Templates
10 of 28
Why Parallel Programming Models ?
Limits of regular tools
� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap
Available Models
� Performance centric : P-RAM, LOG-P, BSP� Pattern centric : Futures, Skeletons� Data centric : HTA, PGAS
11 of 28
Why Parallel Programming Models ?
Limits of regular tools
� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap
Available Models
� Performance centric : P-RAM, LOG-P, BSP� Pattern centric : Futures, Skeletons� Data centric : HTA, PGAS
11 of 28
Why Parallel Programming Models ?
Limits of regular tools
� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap
Available Models
� Performance centric : P-RAM, LOG-P, BSP� Pattern centric : Futures, Skeletons� Data centric : HTA, PGAS
11 of 28
Parallel Skeletons [Cole 89]
Principles
� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns
Functional point of view
� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions
12 of 28
Parallel Skeletons [Cole 89]
Principles
� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns
Classical Skeletons
� Data parallel : map, fold, scan� Task parallel : par, pipe, farm� More complex : Distribuable Homomorphism, Divide & Conquer, …
12 of 28
Parallel Skeletons : The heart of NT2
Parallel Skeletons in NT2
� transform : Apply a n-ary function over a subset of data� fold : Perform n-ary reduction over a subset of data� scan : Perform n-ary prex reduction over a subset of data
Tied to families of loop nests
� Elementwise : call to transform� Can be nested with elementwise operations
� Reduction & scan : respectively call to fold or scan� Successive reductions or prex scans are not nestable� Can contain a nested elementwise loop nest
13 of 28
Parallel Skeletons : The heart of NT2
Parallel Skeletons in NT2
� transform : Apply a n-ary function over a subset of data� fold : Perform n-ary reduction over a subset of data� scan : Perform n-ary prex reduction over a subset of data
Tied to families of loop nests
� Elementwise : call to transform� Can be nested with elementwise operations
� Reduction & scan : respectively call to fold or scan� Successive reductions or prex scans are not nestable� Can contain a nested elementwise loop nest
13 of 28
Parallel Skeletons extraction process
A = B / sum(C+D) ;
=
A /
B sum
+
C D
fold
transform
14 of 28
Parallel Skeletons extraction process
A = B / sum(C+D) ;
; ;
=
A /
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
⇒=
A /
B tmp
transform
15 of 28
The limits
Limits of Expression Templates
� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism (ex : concurrency, pipeline)� Poor data locality and no inter-statement optimization
Actions !
� Upgrade NT2 to enable task parallelism� Adapt current skeletons for taskication� Derive a dependency graph between statements
16 of 28
The limits
Limits of Expression Templates
� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism (ex : concurrency, pipeline)� Poor data locality and no inter-statement optimization
Actions !
� Upgrade NT2 to enable task parallelism� Adapt current skeletons for taskication� Derive a dependency graph between statements
16 of 28
Outline
Introduction
The NT2Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
16 of 28
Futures programming model
future <int > f1 = async(foo1);
FutureClient Promise
async( )
new( )
returns future
set( )
17 of 28
Futures programming model
future <int > f1 = async(foo1);...int result = f1.get();
FutureClient Promise
async( )
new( )
returns future
set( )
17 of 28
Futures programming model
future <int > f1 = async(foo1);...int result = f1.get();
FutureClient Promise
async( )
new( )
returns future
set( )
get( )
returns value
17 of 28
Futures programming model
future <int > f1 = async(foo1);...int result = f1.get();
FutureClient Promise
async( )
new( )
returns future
set( )
get( )
not ready
returns value
17 of 28
Sequential composition
..
Future1
.
Client
.
Promise
.
async( )
.
new( )
.
returns future1
.
set( )
future <int > f1 = async(foo1);
18 of 28
Sequential composition
..
Future1
.
Client
.
Promise
.
async( )
.
new( )
.
returns future1
.
set( )
future <int > f1 = async(foo1);future <int > f2 = f1.then(foo2);
18 of 28
Sequential composition
Future1 Future2Client Promise
async( )
new( )
returns future1
set( )set( )
set( )
then( )
returns future2
new( ) future <int > f1 = async(foo1);future <int > f2 = f1.then(foo2);
18 of 28
Sequential composition
Future1 Future2Client Promise
async( )
new( )
returns future1
set( )
new( )
set( )
returns future2
then( )
future <int > f1 = async(foo1);future <int > f2 = f1.then(foo2);
18 of 28
Parallel composition
typedef typename future <int > F;
F f1 = async(foo1);F f2 = async(foo2);
..foo1. foo2
19 of 28
Parallel composition
typedef typename future <int > F;
F f1 = async(foo1);F f2 = async(foo2);
future < tuple <F,F> > f3 = when_all(f1,f2);
..foo1. foo2.
19 of 28
Parallel composition
typedef typename future <int > F;
vector <F> join;join.push_back(async(foo1));join.push_back(async(foo2));
..foo1. foo2
19 of 28
Parallel composition
typedef typename future <int > F;
vector <F> join;join.push_back(async(foo1));join.push_back(async(foo2));
future <vector <F>> f3 = when_all(join);
..foo1. foo2.
19 of 28
Parallel composition
typedef typename future <int > F;
vector <F> join;join.push_back(async(foo1));join.push_back(async(foo2));
F f3 = when_all(join).then(foo3);
foo1 foo2
foo3
19 of 28
Parallel Skeletons extraction process
A = B / sum(C+D) ;
; ;
=
tmp sum
+
C D
fold
=
A /
B tmp
transform
20 of 28
Parallel Skeletons extraction process
A = B / sum(C+D) ;
fold
=
tmp(3) sum(3)
+
C(:, 3) D(:, 3)transform
=
A(:, 3) /
B(:, 3) tmp(3)
fold
=
tmp(2) sum(2)
+
C(:, 2) D(:, 2)transform
=
A(:, 2) /
B(:, 2) tmp(2)
workerfold,simd
=
tmp(1) sum
+
C(:, 1) D(:, 1)workertransform,simd
=
A(:, 1) /
B(:, 1) tmp(1)
spawnertransform,Futures
spawnertransform,Futures
;
21 of 28
Outline
Introduction
The NT2Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
21 of 28
Black and Scholes Option Pricing
NT2Codetable <float > blackscholes( table <float > const& Sa, table <float > const& Xa
, table <float > const& Ta, table <float > const& ra, table <float > const& va)
{table <float > da = sqrt(Ta);table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);table <float > d2 = d1-va*da;
return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);}
22 of 28
Black and Scholes Option Pricing
NT2Code - Manual Loop Fusiontable <float > blackscholes( table <float > const& Sa, table <float > const& Xa
, table <float > const& Ta, table <float > const& ra, table <float > const& va)
{// Preallocate temporary tablestable <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));
// tie merge loop nest and increase cache localitytie(da,d1,d2 ,R) = tie( sqrt(Ta)
, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da), d1-va*da, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2));
return R;}
23 of 28
Black and Scholes performance for size 256M
scalar SIMD NT2
0
200
400
600 581
140
23
Implementation
ExecutionTim
ein
cycles/elem
ent
24 of 28
TiledLU algorithm
A00
A01 A02A10
A20 A11
A21
A12
A22 A11
A12A21
A22
A22
step 1
step 2
step 3
step 4
step 5
step 6
step 7
DGETRF
DGESSM
DTSTRF
DSSSSM
25 of 28
LU performance for Problem Size 8000
0 10 20 30 40 50
0
50
100
150
Number of cores
Perform
ance
inGFlop/s
NT2
Plasma
Intel MKL
26 of 28
Outline
Introduction
The NT2Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
26 of 28
Conclusion and Perspectives
Conclusion
� Extraction of asynchrony from DSEL statements� Combinaison of Future-based task management with parallel skeletons� Efficiency for coarse grain tasks
Perspectives
� Integration of application/memory aware cost models� Generalization of Futures for distributed systems & GPGPUs� Follow us on http://www.github.com/MetaScale/nt2
27 of 28
Conclusion and Perspectives
Conclusion
� Extraction of asynchrony from DSEL statements� Combinaison of Future-based task management with parallel skeletons� Efficiency for coarse grain tasks
Perspectives
� Integration of application/memory aware cost models� Generalization of Futures for distributed systems & GPGPUs� Follow us on http://www.github.com/MetaScale/nt2
27 of 28
Thanks for your attention
28 of 28