automatic task-based code generation for high performance dsel

Automatic Task-based Code Generation for High Performance DSEL

Antoine Tran Tan Joel Falcou Daniel Etiemble Hartmut Kaiser

Université de Paris Sud - LRI - ParSys - Inria PostaleSte||ar Group - LSU

NumScale SAS

Meeting C++ 2014

1 of 28

The Expressiveness/Efficiency War

Single Core Era

Performance

Expressiveness

C/Fort.

C++

Java

Multi-Core/SIMD Era

Performance

Expressiveness

Sequential

Threads

SIMD

Heterogenous Era

Performance

Expressiveness

Sequential

SIMD

Threads

GPUPhi

Distributed

As parallel systems complexity grows, the expressiveness gap turns into an ocean

2 of 28

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

3 of 28


Challenges





Our Approach

� Design tools as C++ libraries (1)� Design them as Domain Specic Embedded Language (DSEL) (2+3)� Use Generative Programming to deliver performance (4)

3 of 28


Challenges





Objectives of the Talk

� Presents one of our such desgined tools� Expose some limitations of the DSEL medium� Propose a different take on such tools

3 of 28

Outline

Introduction

The NT2Library

Redesign for task parallelism

Benchmarks

Conclusion and Perspectives

4 of 28

NT2 : The Numerical Template Toolbox

NT2 as a Scientic Computing Library

� Provides a simple, M DSEL� Provides high-performance computing entities and primitives� Is easily extendable

Components

� Uses Boost.SIMD for in-core optimizations� Uses recursive parallel skeletons� Supports task parallelism through Futures

5 of 28

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar valuesas in M

How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

6 of 28


Principles



How does it works

� Take a .m le, copy to a .cpp le

� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

6 of 28


Principles



How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the le and link with libnt2.a

6 of 28


Principles



How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

6 of 28

NT2 - From M to C++

M codeA1 = 1:1000;A2 = A1 + randn(size(A1));

X = lu(A1*A1 ’);

rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );

NT2codetable <double > A1 = _(1. ,1000.);table <double > A2 = A1 + randn(size(A1));

table <double > X = lu( mtimes(A1 , trans(A1) );

double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );

7 of 28

Domain Specic Embedded Languages

Domain Specic Languages

� Non-Turing complete declarative languages� Solve a single type of problems� Express what to do instead of how to do it� E.g : SQL, M, M, …

From DSL to DSEL [Abrahams 2004]

� A DSL incorporates domain-specic notation, constructs, and abstractions asfundamental design considerations.

� A Domain Specic Embedded Languages (DSEL) is simply a library that meets thesame criteria

� Generative Programming is one way to design such libraries

8 of 28

Meta-programming as a Tool

DenitionMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.

Meta-programmable Languages

� metaOCAML : runtime code generation via code quoting� Template Haskell : compile-time code generation via templates� C++ : compile-time code generation via templates

C++ meta-programming

� Relies on the Turing-complete C++ Template sub-language� Handles types and integral constants at compile-time� Template classes and functions act as code quoting

9 of 28

The Expression Templates Idiom

Principles

� Relies on extensive operatoroverloading

� Carries semantic informationaround code fragment

� Introduces DSLs withoutdisrupting dev. chain

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

General Principles of Expression Templates

10 of 28

The Expression Templates Idiom

Advantages

� Generic implementation becomesself-aware of optimizations

� API abstraction level is arbitraryhigh

� Accessible through high-leveltools like B.P

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

General Principles of Expression Templates

10 of 28

Why Parallel Programming Models ?

Limits of regular tools

� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap

Available Models

� Performance centric : P-RAM, LOG-P, BSP� Pattern centric : Futures, Skeletons� Data centric : HTA, PGAS

11 of 28

Parallel Skeletons [Cole 89]

Principles

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns

Functional point of view

� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions

12 of 28

Parallel Skeletons [Cole 89]

Principles

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns

Classical Skeletons

� Data parallel : map, fold, scan� Task parallel : par, pipe, farm� More complex : Distribuable Homomorphism, Divide & Conquer, …

12 of 28

Parallel Skeletons : The heart of NT2

Parallel Skeletons in NT2

� transform : Apply a n-ary function over a subset of data� fold : Perform n-ary reduction over a subset of data� scan : Perform n-ary prex reduction over a subset of data

Tied to families of loop nests

� Elementwise : call to transform� Can be nested with elementwise operations

� Reduction & scan : respectively call to fold or scan� Successive reductions or prex scans are not nestable� Can contain a nested elementwise loop nest

13 of 28

Parallel Skeletons extraction process

A = B / sum(C+D) ;

=

A /

B sum

+

C D

fold

transform

14 of 28


A = B / sum(C+D) ;

; ;

=

A /

B sum

+

C D

fold

transform

=

tmp sum

+

C D

fold

⇒=

A /

B tmp

transform

15 of 28

The limits

Limits of Expression Templates

� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism (ex : concurrency, pipeline)� Poor data locality and no inter-statement optimization

Actions !

� Upgrade NT2 to enable task parallelism� Adapt current skeletons for taskication� Derive a dependency graph between statements

16 of 28

Outline

Introduction

The NT2Library


Benchmarks


16 of 28

Futures programming model

future <int > f1 = async(foo1);

FutureClient Promise

async( )

new( )

returns future

set( )

17 of 28


future <int > f1 = async(foo1);...int result = f1.get();


async( )

new( )

returns future

set( )

17 of 28




async( )

new( )

returns future

set( )

get( )

returns value

17 of 28




async( )

new( )

returns future

set( )

get( )

not ready

returns value

17 of 28

Sequential composition

..

Future1

.

Client

.

Promise

.

async( )

.

new( )

.

returns future1

.

set( )

future <int > f1 = async(foo1);

18 of 28


..

Future1

.

Client

.

Promise

.

async( )

.

new( )

.

returns future1

.

set( )

future <int > f1 = async(foo1);future <int > f2 = f1.then(foo2);

18 of 28


Future1 Future2Client Promise

async( )

new( )

returns future1

set( )set( )

set( )

then( )

returns future2

new( ) future <int > f1 = async(foo1);future <int > f2 = f1.then(foo2);

18 of 28


Future1 Future2Client Promise

async( )

new( )

returns future1

set( )

new( )

set( )

returns future2

then( )

future <int > f1 = async(foo1);future <int > f2 = f1.then(foo2);

18 of 28

Parallel composition

typedef typename future <int > F;

F f1 = async(foo1);F f2 = async(foo2);

..foo1. foo2

19 of 28



F f1 = async(foo1);F f2 = async(foo2);

future < tuple <F,F> > f3 = when_all(f1,f2);

..foo1. foo2.

19 of 28



vector <F> join;join.push_back(async(foo1));join.push_back(async(foo2));

..foo1. foo2

19 of 28




future <vector <F>> f3 = when_all(join);

..foo1. foo2.

19 of 28




F f3 = when_all(join).then(foo3);

foo1 foo2

foo3

19 of 28


A = B / sum(C+D) ;

; ;

=

tmp sum

+

C D

fold

=

A /

B tmp

transform

20 of 28


A = B / sum(C+D) ;

fold

=

tmp(3) sum(3)

+

C(:, 3) D(:, 3)transform

=

A(:, 3) /

B(:, 3) tmp(3)

fold

=

tmp(2) sum(2)

+

C(:, 2) D(:, 2)transform

=

A(:, 2) /

B(:, 2) tmp(2)

workerfold,simd

=

tmp(1) sum

+

C(:, 1) D(:, 1)workertransform,simd

=

A(:, 1) /

B(:, 1) tmp(1)

spawnertransform,Futures

spawnertransform,Futures

;

21 of 28

Outline

Introduction

The NT2Library


Benchmarks


21 of 28

Black and Scholes Option Pricing

NT2Codetable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{table <float > da = sqrt(Ta);table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);table <float > d2 = d1-va*da;

return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);}

22 of 28

Black and Scholes Option Pricing

NT2Code - Manual Loop Fusiontable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{// Preallocate temporary tablestable <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));

// tie merge loop nest and increase cache localitytie(da,d1,d2 ,R) = tie( sqrt(Ta)

, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da), d1-va*da, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2));

return R;}

23 of 28

Black and Scholes performance for size 256M

scalar SIMD NT2

0

200

400

600 581

140

23

Implementation

ExecutionTim

ein

cycles/elem

ent

24 of 28

TiledLU algorithm

A00

A01 A02A10

A20 A11

A21

A12

A22 A11

A12A21

A22

A22

step 1

step 2

step 3

step 4

step 5

step 6

step 7

DGETRF

DGESSM

DTSTRF

DSSSSM

25 of 28

LU performance for Problem Size 8000

0 10 20 30 40 50

0

50

100

150

Number of cores

Perform

ance

inGFlop/s

NT2

Plasma

Intel MKL

26 of 28

Outline

Introduction

The NT2Library


Benchmarks


26 of 28


Conclusion

� Extraction of asynchrony from DSEL statements� Combinaison of Future-based task management with parallel skeletons� Efficiency for coarse grain tasks

Perspectives

� Integration of application/memory aware cost models� Generalization of Futures for distributed systems & GPGPUs� Follow us on http://www.github.com/MetaScale/nt2

27 of 28

http://www.github.com/MetaScale/nt2

Thanks for your attention

28 of 28

automatic task-based code generation for high performance dsel

Software

array behavior

multidimensional array

functions usable

scalar valuesas

numerical template toolboxnt2

scientic computingchallenges1

domain driven optimizations3

desgined tools