domain speci c embedded languages in c++ · joel falcou numscale may 27, 2016. unlocked software...

50
Domain Specic Embedded Languages in C++ Contributions to HPC Unlocked software performance Joel Falcou NumScale May 27, 2016

Upload: others

Post on 21-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Domain Specic Embedded Languages in C++Contributions to HPC

Unlocked software performance

Joel Falcou

NumScale

May 27, 2016

Page 2: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

The Paradigm Change in Science

From Experiments to Simulations

� Simulations is now an integralpart of the Scientic Method

� Scientic Computing enableslarger, faster, more accurateResearch

� Fast Simulation is Time Travel asscientic results are now morereadily available

Local Galaxy Cluster Simulation - Illustris project

Computing is rst and foremost a mainstream science tool

2 of 37

Page 3: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

The Paradigm Change in Science

The Parallel Hell

� Heat Wall: Growing coresinstead of GHz

� Hierarchical and heterogeneousparallel systems are the norm

� The Free Lunch is over ashardware complexity rises fasterthan the average developer skills

Local Galaxy Cluster Simulation - Illustris project

The real challenge in HPC is the Expressiveness/Efficiency War

2 of 37

Page 4: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

The Expressiveness/Efficiency War

Single Core Era

Performance

Expressiveness

C/Fort.

C++

Java

Multi-Core/SIMD Era

Performance

Expressiveness

Sequential

Threads

SIMD

Heterogenous Era

Performance

Expressiveness

Sequential

SIMD

Threads

GPUPhi

Distributed

As parallel systems complexity grows, the expressiveness gap turns into an ocean

3 of 37

Page 5: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Designing tools for Scientic Computing

Objectives

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3)� Use Parallel Programming Abstractions as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 37

Page 6: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Designing tools for Scientic Computing

Objectives

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3)� Use Parallel Programming Abstractions as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 37

Page 7: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Talk Layout

Introduction

Abstractions & Efficiency

Experimental Results

Conclusion

5 of 37

Page 8: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Why Parallel Programming Models ?

Limits of regular tools

� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap

Available Models

� Performance centric: P-RAM, LOG-P, BSP� Pattern centric: Futures, Skeletons� Data centric: HTA, PGAS

6 of 37

Page 9: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Why Parallel Programming Models ?

Limits of regular tools

� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap

Available Models

� Performance centric: P-RAM, LOG-P, BSP� Pattern centric: Futures, Skeletons� Data centric: HTA, PGAS

6 of 37

Page 10: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Why Parallel Programming Models ?

Limits of regular tools

� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap

Available Models

� Performance centric: P-RAM, LOG-P, BSP� Pattern centric: Futures, Skeletons� Data centric: HTA, PGAS

6 of 37

Page 11: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Parallel Skeletons [Cole 89]

Principles

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns

Functional point of view

� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions

7 of 37

Page 12: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Parallel Skeletons [Cole 89]

Principles

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns

Classical Skeletons

� Data parallel: map, fold, scan� Task parallel: par, pipe, farm� More complex: Distribuable Homomorphism, Divide & Conquer, …

7 of 37

Page 13: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Domain Specic Embedded Languages

Domain Specic Languages

� Non-Turing complete declarative languages� Solve a single type of problems� Express what to do instead of how to do it� E.g: SQL, M, M, …

From DSL to DSEL [Abrahams 2004]

� A DSL incorporates domain-specic notation, constructs, and abstractions asfundamental design considerations.

� A Domain Specic Embedded Languages (DSEL) is simply a library that meets thesame criteria

� Generative Programming is one way to design such libraries

8 of 37

Page 14: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Generative Programming [Eisenecker 97]

Domain SpecificApplication Description

Generative Component Concrete Application

Translator

Parametric Sub-components

9 of 37

Page 15: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Meta-programming as a Tool

DenitionMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.

Meta-programmable Languages

� metaOCAML : runtime code generation via code quoting� Template Haskell : compile-time code generation via templates� C++ : compile-time code generation via templates

C++ meta-programming

� Relies on the Turing-complete C++ sub-language� Handles types and integral constants at compile-time� classes and functions act as code quoting

10 of 37

Page 16: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

The Expression Templates Idiom

Principles

� Relies on extensive operatoroverloading

� Carries semantic informationaround code fragment

� Introduces DSLs withoutdisrupting dev. chain

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

General Principles of Expression Templates

11 of 37

Page 17: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

The Expression Templates Idiom

Advantages

� Generic implementation becomesself-aware of optimizations

� API abstraction level is arbitraryhigh

� Accessible through high-leveltools like B.P

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

General Principles of Expression Templates

11 of 37

Page 18: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Our Contributions

Our Strategy

� Applies DSEL generation techniques to parallel programming� Maintains low cost of abstractions through meta-programming� Maintains abstraction level via modern library design

Our contributions

Tools Pub. Scope ApplicationsQuaff ParCo’06 MPI Skeletons Real-time 3D reconstructionSkellBE PACT’08 Skeleton on Cell BE Real-time Image processingBSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model CheckingNT2 JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision

12 of 37

Page 19: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Second Look at our Contributions

Development Limitations

� DSELs are mostly tied to the domain model� Architecture support is often an afterthought� Extensibility is difficult as many refactoring are required per architecture� Example : No proper way to support GPUs with those implementation techniques

Proposed Method

� Extends Generative Programming to take this architecture into account� Provides an architecture description DSEL� Integrates this description in the code generation process

13 of 37

Page 20: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Second Look at our Contributions

Development Limitations

� DSELs are mostly tied to the domain model� Architecture support is often an afterthought� Extensibility is difficult as many refactoring are required per architecture� Example : No proper way to support GPUs with those implementation techniques

Proposed Method

� Extends Generative Programming to take this architecture into account� Provides an architecture description DSEL� Integrates this description in the code generation process

13 of 37

Page 21: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Architecture Aware Generative Programming

14 of 37

Page 22: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Software refactoring

Tools Issues ChangesQuaff Raw skeletons API Re-engineered as part of NT2

SkellBE Too architecture specic Re-engineered as part of NT2

BSP++ Integration issues Integrate hybrid code generationNT2 Not easily extendable Integrate Quaff Skeleton modelsBoost.SIMD - Side product of NT2 restructuration

Conclusion� Skeletons are ne as parallel middleware� Model based abstractions are not high level enough� For low level architectures, the simplest model is often the best

15 of 37

Page 23: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Boost.SIMDPierre Estérie PHD 2010-2014

Principles

� Provides simple C++ API over SIMDextensions

� Supports every Intel and PPCinstructions sets

� Fully integrates with modern C++idioms

Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang

16 of 37

Page 24: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Talk Layout

Introduction

Abstractions & Efficiency

Experimental Results

Conclusion

17 of 37

Page 25: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

The Numerical Template ToolboxPierre Estérie PHD 2010-2014

NT2 as a Scientic Computing Library

� Provides a simple, M-like interface for users� Provides high-performance computing entities and primitives� Is easily extendable

Components

� Uses Boost.SIMD for in-core optimizations� Uses recursive parallel skeletons� Supports task parallelism through Futures

18 of 37

Page 26: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar valuesas in M

How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

19 of 37

Page 27: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar valuesas in M

How does it works

� Take a .m le, copy to a .cpp le

� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

19 of 37

Page 28: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar valuesas in M

How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the le and link with libnt2.a

19 of 37

Page 29: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar valuesas in M

How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

19 of 37

Page 30: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

NT2 - From M to C++

M codeA1 = 1:1000;A2 = A1 + randn(size(A1));X = lu(A1*A1’);

rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );

NT2 codetable <double > A1 = _(1. ,1000.);table <double > A2 = A1 + randn(size(A1));table <double > X = lu( mtimes(A1 , trans(A1) );

double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );

20 of 37

Page 31: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Parallel Skeletons extraction process

A = B / sum(C+D);

=

A /

B sum

+

C D

fold

transform

21 of 37

Page 32: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Parallel Skeletons extraction process

A = B / sum(C+D);

; ;

=

A /

B sum

+

C D

fold

transform

=

tmp sum

+

C D

fold

⇒=

A /

B tmp

transform

22 of 37

Page 33: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

From data to task parallelismAntoine Tran Tan PHD, 2012-2015

Limits of the fork-join model

� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization

Skeletons from the Future

� Adapt current skeletons for taskication� Use Futures ( or HPX) to automatically pipeline� Derive a dependency graph between statements

23 of 37

Page 34: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

From data to task parallelismAntoine Tran Tan PHD, 2012-2015

Limits of the fork-join model

� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization

Skeletons from the Future

� Adapt current skeletons for taskication� Use Futures ( or HPX) to automatically pipeline� Derive a dependency graph between statements

23 of 37

Page 35: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Parallel Skeletons extraction process - Take 2

A = B / sum(C+D);

; ;

=

tmp sum

+

C D

fold

=

A /

B tmp

transform

24 of 37

Page 36: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Parallel Skeletons extraction process - Take 2

A = B / sum(C+D);

fold

=

tmp(3) sum(3)

+

C(:, 3) D(:, 3)transform

=

A(:, 3) /

B(:, 3) tmp(3)

fold

=

tmp(2) sum(2)

+

C(:, 2) D(:, 2)transform

=

A(:, 2) /

B(:, 2) tmp(2)

workerfold,simd

=

tmp(1) sum

+

C(:, 1) D(:, 1)workertransform,simd

=

A(:, 1) /

B(:, 1) tmp(1)

spawnertransform,OpenMP

spawnertransform,OpenMP

;

25 of 37

Page 37: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Motion DetectionLacassagne et al., ICIP 2009

� Sigma-Delta algorithm based on background substraction� Use local gaussian model of lightness variation to detect motion� Challenge: Very low arithmetic density� Challenge: Integer-based implementation with small range

26 of 37

Page 38: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Motion Detection

table <char > sigma_delta( table <char >& background, table <char > const& frame, table <char >& variance)

{// Estimate Raw Movementbackground = selinc( background < frame

, seldec(background > frame , background));

table <char > diff = dist(background , frame);

// Compute Local Variancetable <char > sig3 = muls(diff ,3);

var = if_else( diff != 0, selinc( variance < sig3

, seldec( var > sig3 , variance))

, variance);

// Generate Movement Labelreturn if_zero_else_one( diff < variance );

}

27 of 37

Page 39: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Motion Detection

0

2

4

6

8

10

12

14

16

18

512x512 1024x1024

cycle

s/e

lem

ent

Image Size (N x N)

x6.8

x14.8

x16.5

x2.1

x3.6

x6.7

x15.3

x18

x2.3

x3.9

9

x10.8

x10.8

SCALARHALF COREFULL CORE

SIMDJRTIP2008

SIMD + HALF CORESIMD + FULL CORE

28 of 37

Page 40: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Black and Scholes Option Pricing

NT2 Codetable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{table <float > da = sqrt(Ta);table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);table <float > d2 = d1-va*da;

return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);}

29 of 37

Page 41: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Black and Scholes Option Pricing

NT2 Code with loop fusiontable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{// Preallocate temporary tablestable <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));

// tie merge loop nest and increase cache localitytie(da,d1 ,d2,R) = tie( sqrt(Ta)

, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da), d1-va*da, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2));

return R;}

29 of 37

Page 42: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Black and Scholes Option PricingPerformance

1000000

0

50

100

150

x1.8

9

x2.9

1

x5.5

8

x6.3

0

Size

cycle/value

scalar

SSE2

AVX2

SSE2, 4 cores

AVX2, 4 cores

30 of 37

Page 43: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Black and Scholes Option PricingPerformance with loop fusion/futurisation

1000000

0

50

100

150

x2.2

7

x4.1

3

x8.0

5

x11.

12

Size

cycle/value

scalar

SSE2

AVX2

SSE2, 4 cores

AVX2, 4 cores

31 of 37

Page 44: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

LU DecompositionAlgorithm

A00

A01 A02A10

A20 A11

A21

A12

A22 A11

A12A21

A22

A22

step 1

step 2

step 3

step 4

step 5

step 6

step 7

DGETRF

DGESSM

DTSTRF

DSSSSM

32 of 37

Page 45: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

LU DecompositionPerformance

0 10 20 30 40 50

0

50

100

Number of cores

Med

ian

GFL

OPS

8000× 8000 LU decomposition

NT2Intel MKL

33 of 37

Page 46: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Talk Layout

Introduction

Abstractions & Efficiency

Experimental Results

Conclusion

34 of 37

Page 47: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Conclusion

Parallel Computing for Scientist

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, DSEL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability

and re-targetability

Our Achievements� A new method for parallel software development� Efficient libraries working on large subset of hardware� High level of performances across a wide application spectrum

35 of 37

Page 48: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Conclusion

Parallel Computing for Scientist

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, DSEL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability

and re-targetability

Our Achievements� A new method for parallel software development� Efficient libraries working on large subset of hardware� High level of performances across a wide application spectrum

35 of 37

Page 49: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Unlocked software performance

Perspectives

DSEL as C++ rst class idiom

� Build partial evaluation into the language� Ease transition between regular and meta C++� Mid-term Prospect: metaOCAML like quoting for C++

DSEL and compilers relationship

� C++ DSEL hits a limit on their applicability� Compilers often lack high level informations for proper optimization� Mid-term Prospect: Hybrid library/compiler approaches for DSEL

36 of 37

Page 50: Domain Speci c Embedded Languages in C++ · Joel Falcou NumScale May 27, 2016. Unlocked software performance The Paradigm Change in Science From Experiments to Simulations Simulations

Thanks for your attention