software abstractions for parallel hardware
TRANSCRIPT
Software Abstractions for Parallel Architectures
Joel Falcou
LRI - CNRS - INRIA
HDR Thesis Defense12/01/2014
The Paradigm Change in Science
From Experiments to Simulations
� Simulations is now an integralpart of the Scientic Method
� Scientic Computing enableslarger, faster, more accurateResearch
� Fast Simulation is Time Travel asscientic results are now morereadily available
Local Galaxy Cluster Simulation - Illustris project
Computing is rst and foremost a mainstream science tool
2 of 41
The Paradigm Change in Science
The Parallel Hell
� Heat Wall: Growing coresinstead of GHz
� Hierarchical and heterogeneousparallel systems are the norm
� The Free Lunch is over ashardware complexity rises fasterthan the average developer skills
Local Galaxy Cluster Simulation - Illustris project
The real challenge in HPC is the Expressiveness/Efficiency War
2 of 41
The Expressiveness/Efficiency War
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Expressiveness
Sequential
Threads
SIMD
Heterogenous Era
Performance
Expressiveness
Sequential
SIMD
Threads
GPUPhi
Distributed
As parallel systems complexity grows, the expressiveness gap turns into an ocean
3 of 41
Designing tools for Scientic Computing
Objectives
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3)� Use Parallel Programming Abstractions as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 41
Designing tools for Scientic Computing
Objectives
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3)� Use Parallel Programming Abstractions as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 41
Talk Layout
Introduction
Abstractions & Efficiency
Experimental Results
Conclusion
5 of 41
Why Parallel Programming Models ?
Limits of regular tools
� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap
Available Models
� Performance centric: P-RAM, LOG-P, BSP� Pattern centric: Futures, Skeletons� Data centric: HTA, PGAS
6 of 41
Why Parallel Programming Models ?
Limits of regular tools
� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap
Available Models
� Performance centric: P-RAM, LOG-P, BSP� Pattern centric: Futures, Skeletons� Data centric: HTA, PGAS
6 of 41
Why Parallel Programming Models ?
Limits of regular tools
� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap
Available Models
� Performance centric: P-RAM, LOG-P, BSP� Pattern centric: Futures, Skeletons� Data centric: HTA, PGAS
6 of 41
Bulk Synchronous Parallelism [Valiant, McColl 90]
Principles
� Machine Model� Execution Model� Analytic Cost Model
Compute
Barrier
Comm
Wmax h.g
P0
P1
P2
P3
Superstep T Superstep T+1
LWmax h.g
BSP Execution Model
7 of 41
Bulk Synchronous Parallelism [Valiant, McColl 90]
Advantages
� Simple set of primitives� Implementable on any
kind of hardware� Possibility to reason about
BSP programs
Compute
Barrier
Comm
Wmax h.g
P0
P1
P2
P3
Superstep T Superstep T+1
LWmax h.g
BSP Execution Model
7 of 41
Parallel Skeletons [Cole 89]
Principles
� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns
Functional point of view
� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions
8 of 41
Parallel Skeletons [Cole 89]
Principles
� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns
Classical Skeletons
� Data parallel: map, fold, scan� Task parallel: par, pipe, farm� More complex: Distribuable Homomorphism, Divide & Conquer, …
8 of 41
Relevance to our Objectives
Why using Parallel Skeletons ?
� Write code independant of parallel programming minutiae� Composability supports hierarchical architectures� Code is scalable and easy to maintain
Why using BSP ?
� Cost model guide development� Few primitives mean that intellectual burden is low� Good medium for developping skeletons
How to ensure performance of those models’ implementations ?
9 of 41
Domain Specic Embedded Languages
Domain Specic Languages
� Non-Turing complete declarative languages� Solve a single type of problems� Express what to do instead of how to do it� E.g: SQL, M, M, …
From DSL to DSEL [Abrahams 2004]
� A DSL incorporates domain-specic notation, constructs, and abstractions asfundamental design considerations.
� A Domain Specic Embedded Languages (DSEL) is simply a library that meets thesame criteria
� Generative Programming is one way to design such libraries
10 of 41
Generative Programming [Eisenecker 97]
Domain SpecificApplication Description
Generative Component Concrete Application
Translator
Parametric Sub-components
11 of 41
Meta-programming as a Tool
DenitionMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.
Meta-programmable Languages
� metaOCAML : runtime code generation via code quoting� Template Haskell : compile-time code generation via templates� C++ : compile-time code generation via templates
C++ meta-programming
� Relies on the Turing-complete C++ sub-language� Handles types and integral constants at compile-time� classes and functions act as code quoting
12 of 41
The Expression Templates Idiom
Principles
� Relies on extensive operatoroverloading
� Carries semantic informationaround code fragment
� Introduces DSLs withoutdisrupting dev. chain
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}
Arbitrary Transforms appliedon the meta-AST
General Principles of Expression Templates
13 of 41
The Expression Templates Idiom
Advantages
� Generic implementation becomesself-aware of optimizations
� API abstraction level is arbitraryhigh
� Accessible through high-leveltools like B.P
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}
Arbitrary Transforms appliedon the meta-AST
General Principles of Expression Templates
13 of 41
Our Contributions
Our Strategy
� Applies DSEL generation techniques to parallel programming� Maintains low cost of abstractions through meta-programming� Maintains abstraction level via modern library design
Our contributions
Tools Pub. Scope ApplicationsQuaff ParCo’06 MPI Skeletons Real-time 3D reconstructionSkellPU PACT’08 Skeleton on Cell BE Real-time Image processingBSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model CheckingNT2 JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision
14 of 41
Example of BSP++ ApplicationKhaled Hamidouche PHD 2008-2011 in collab. with Univ. Brasilia
BSP Smith & Waterman� S&W computes DNA sequences alignment� BSP++ implementation was written once and run on 7 different hardwares� Efficiency of 95+% even on 6000 cores super-computer
Platform MaxSize # Elements Speedup GCUPscluster (MPI) 1,072,950 128 cores 73x 6.53
cluster (MPI/OpenMP) 1,072,950 128 cores 116x 10.41OpenMP 1,072,950 16 cores 16x 0.40
CellBE 85,603 8 SPEs — 0.14cluster of CellBEs 85,603 24 SPEs (8:24) 2.8x 0.37
Hopper(MPI) 5,303,436 3072 cores 260x 3.09Hopper(MPI+OpenMP) 24,894,269 6144 cores 5664x 15,5
15 of 41
Second Look at our Contributions
Development Limitations
� DSELs are mostly tied to the domain model� Architecture support is often an afterthought� Extensibility is difficult as many refactoring are required per architecture� Example : No proper way to support GPUs with those implementation techniques
Proposed Method
� Extends Generative Programming to take this architecture into account� Provides an architecture description DSEL� Integrates this description in the code generation process
16 of 41
Second Look at our Contributions
Development Limitations
� DSELs are mostly tied to the domain model� Architecture support is often an afterthought� Extensibility is difficult as many refactoring are required per architecture� Example : No proper way to support GPUs with those implementation techniques
Proposed Method
� Extends Generative Programming to take this architecture into account� Provides an architecture description DSEL� Integrates this description in the code generation process
16 of 41
Architecture Aware Generative Programming
17 of 41
Software refactoring
Tools Issues ChangesQuaff Raw skeletons API Re-engineered as part of NT2
SkellPU Too architecture specic Re-engineered as part of NT2
BSP++ Integration issues Integrate hybrid code generationNT2 Not easily extendable Integrate Quaff Skeleton modelsBoost.SIMD - Side product of NT2 restructuration
Conclusion� Skeletons are ne as parallel middleware� Model based abstractions are not high level enough� For low level architectures, the simplest model is often the best
18 of 41
Boost.SIMDPierre Estérie PHD 2010-2014
Principles
� Provides simple C++ API over SIMDextensions
� Supports every Intel, PPC and ARMinstructions sets
� Fully integrates with modern C++idioms
Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang
19 of 41
Talk Layout
Introduction
Abstractions & Efficiency
Experimental Results
Conclusion
20 of 41
The Numerical Template ToolboxPierre Estérie PHD 2010-2014
NT2 as a Scientic Computing Library
� Provides a simple, M-like interface for users� Provides high-performance computing entities and primitives� Is easily extendable
Components
� Uses Boost.SIMD for in-core optimizations� Uses recursive parallel skeletons� Supports task parallelism through Futures
21 of 41
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar valuesas in M
How does it works
� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a
22 of 41
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar valuesas in M
How does it works
� Take a .m le, copy to a .cpp le
� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a
22 of 41
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar valuesas in M
How does it works
� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the le and link with libnt2.a
22 of 41
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar valuesas in M
How does it works
� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a
22 of 41
NT2 - From M to C++
M codeA1 = 1:1000;A2 = A1 + randn(size(A1));X = lu(A1*A1’);
rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );
NT2 codetable <double > A1 = _(1. ,1000.);table <double > A2 = A1 + randn(size(A1));table <double > X = lu( mtimes(A1 , trans(A1) );
double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );
23 of 41
Parallel Skeletons extraction process
A = B / sum(C+D);
=
A /
B sum
+
C D
fold
transform
24 of 41
Parallel Skeletons extraction process
A = B / sum(C+D);
; ;
=
A /
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
⇒=
A /
B tmp
transform
25 of 41
From data to task parallelismAntoine Tran Tan PHD, 2012-2015
Limits of the fork-join model
� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization
Skeletons from the Future
� Adapt current skeletons for taskication� Use Futures ( or HPX) to automatically pipeline� Derive a dependency graph between statements
26 of 41
From data to task parallelismAntoine Tran Tan PHD, 2012-2015
Limits of the fork-join model
� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization
Skeletons from the Future
� Adapt current skeletons for taskication� Use Futures ( or HPX) to automatically pipeline� Derive a dependency graph between statements
26 of 41
Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
; ;
=
tmp sum
+
C D
fold
=
A /
B tmp
transform
27 of 41
Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
fold
=
tmp(3) sum(3)
+
C(:, 3) D(:, 3)transform
=
A(:, 3) /
B(:, 3) tmp(3)
fold
=
tmp(2) sum(2)
+
C(:, 2) D(:, 2)transform
=
A(:, 2) /
B(:, 2) tmp(2)
workerfold,simd
=
tmp(1) sum
+
C(:, 1) D(:, 1)workertransform,simd
=
A(:, 1) /
B(:, 1) tmp(1)
spawnertransform,OpenMP
spawnertransform,OpenMP
;
28 of 41
Motion DetectionLacassagne et al., ICIP 2009
� Sigma-Delta algorithm based on background substraction� Use local gaussian model of lightness variation to detect motion� Challenge: Very low arithmetic density� Challenge: Integer-based implementation with small range
29 of 41
Motion Detection
table <char > sigma_delta( table <char >& background, table <char > const& frame, table <char >& variance)
{// Estimate Raw Movementbackground = selinc( background < frame
, seldec(background > frame , background));
table <char > diff = dist(background , frame);
// Compute Local Variancetable <char > sig3 = muls(diff ,3);
var = if_else( diff != 0, selinc( variance < sig3
, seldec( var > sig3 , variance))
, variance);
// Generate Movement Labelreturn if_zero_else_one( diff < variance );
}
30 of 41
Motion Detection
0
2
4
6
8
10
12
14
16
18
512x512 1024x1024
cycle
s/e
lem
ent
Image Size (N x N)
x6.8
x14.8
x16.5
x2.1
x3.6
x6.7
x15.3
x18
x2.3
x3.9
9
x10.8
x10.8
SCALARHALF COREFULL CORE
SIMDJRTIP2008
SIMD + HALF CORESIMD + FULL CORE
31 of 41
Black and Scholes Option Pricing
NT2 Codetable <float > blackscholes( table <float > const& Sa, table <float > const& Xa
, table <float > const& Ta, table <float > const& ra, table <float > const& va)
{table <float > da = sqrt(Ta);table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);table <float > d2 = d1-va*da;
return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);}
32 of 41
Black and Scholes Option Pricing
NT2 Code with loop fusiontable <float > blackscholes( table <float > const& Sa, table <float > const& Xa
, table <float > const& Ta, table <float > const& ra, table <float > const& va)
{// Preallocate temporary tablestable <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));
// tie merge loop nest and increase cache localitytie(da,d1 ,d2,R) = tie( sqrt(Ta)
, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da), d1-va*da, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2));
return R;}
32 of 41
Black and Scholes Option PricingPerformance
1000000
0
50
100
150
x1.8
9
x2.9
1
x5.5
8
x6.3
0
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
33 of 41
Black and Scholes Option PricingPerformance with loop fusion/futurisation
1000000
0
50
100
150
x2.2
7
x4.1
3
x8.0
5
x11.
12
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
34 of 41
LU DecompositionAlgorithm
A00
A01 A02A10
A20 A11
A21
A12
A22 A11
A12A21
A22
A22
step 1
step 2
step 3
step 4
step 5
step 6
step 7
DGETRF
DGESSM
DTSTRF
DSSSSM
35 of 41
LU DecompositionPerformance
0 10 20 30 40 50
0
50
100
Number of cores
Med
ian
GFL
OPS
8000× 8000 LU decomposition
NT2Intel MKL
36 of 41
Talk Layout
Introduction
Abstractions & Efficiency
Experimental Results
Conclusion
37 of 41
Conclusion
Parallel Computing for Scientist
� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.
� Like regular language, DSEL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
Our Achievements� A new method for parallel software development� Efficient libraries working on large subset of hardware� High level of performances across a wide application spectrum
38 of 41
Conclusion
Parallel Computing for Scientist
� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.
� Like regular language, DSEL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
Our Achievements� A new method for parallel software development� Efficient libraries working on large subset of hardware� High level of performances across a wide application spectrum
38 of 41
Works in Progress
Application to Accelerators
� Exploration of proper skeleton implementation on GPUs� Adaptation of Future based code generator� In progress with Ian Masliah’ PHD thesis
Parallelism within C++
� SIMD as part of the standard library� Proposal N3571 for standard SIMD computation� Interoperability with current parallel model of C++
39 of 41
Perspectives
DSEL as C++ rst class idiom
� Build partial evaluation into the language� Ease transition between regular and meta C++� Mid-term Prospect: metaOCAML like quoting for C++
DSEL and compilers relationship
� C++ DSEL hits a limit on their applicability� Compilers often lack high level informations for proper optimization� Mid-term Prospect: Hybrid library/compiler approaches for DSEL
40 of 41
Thanks for your attention