software abstractions for parallel hardware

Software Abstractions for Parallel Architectures

Joel Falcou

LRI - CNRS - INRIA

HDR Thesis Defense12/01/2014

The Paradigm Change in Science

From Experiments to Simulations

� Simulations is now an integralpart of the Scientic Method

� Scientic Computing enableslarger, faster, more accurateResearch

� Fast Simulation is Time Travel asscientic results are now morereadily available

Local Galaxy Cluster Simulation - Illustris project

Computing is rst and foremost a mainstream science tool

2 of 41

The Paradigm Change in Science

The Parallel Hell

� Heat Wall: Growing coresinstead of GHz

� Hierarchical and heterogeneousparallel systems are the norm

� The Free Lunch is over ashardware complexity rises fasterthan the average developer skills

Local Galaxy Cluster Simulation - Illustris project

The real challenge in HPC is the Expressiveness/Efficiency War

2 of 41

The Expressiveness/Efficiency War

Single Core Era

Performance

Expressiveness

C/Fort.

C++

Java

Multi-Core/SIMD Era

Performance

Expressiveness

Sequential

Threads

SIMD

Heterogenous Era

Performance

Expressiveness

Sequential

SIMD

Threads

GPUPhi

Distributed

As parallel systems complexity grows, the expressiveness gap turns into an ocean

3 of 41

Designing tools for Scientic Computing

Objectives

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3)� Use Parallel Programming Abstractions as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 41

Talk Layout

Introduction

Abstractions & Efficiency

Experimental Results

Conclusion

5 of 41

Why Parallel Programming Models ?

Limits of regular tools

� Unstructured parallelism is error-prone� Low level parallel tools are non-composable� Contribute to the Expressiveness Gap

Available Models

� Performance centric: P-RAM, LOG-P, BSP� Pattern centric: Futures, Skeletons� Data centric: HTA, PGAS

6 of 41

Bulk Synchronous Parallelism [Valiant, McColl 90]

Principles

� Machine Model� Execution Model� Analytic Cost Model

Compute

Barrier

Comm

Wmax h.g

P0

P1

P2

P3

Superstep T Superstep T+1

LWmax h.g

BSP Execution Model

7 of 41

Bulk Synchronous Parallelism [Valiant, McColl 90]

Advantages

� Simple set of primitives� Implementable on any

kind of hardware� Possibility to reason about

BSP programs

Compute

Barrier

Comm

Wmax h.g

P0

P1

P2

P3

Superstep T Superstep T+1

LWmax h.g

BSP Execution Model

7 of 41

Parallel Skeletons [Cole 89]

Principles

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns

Functional point of view

� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions

8 of 41

Parallel Skeletons [Cole 89]

Principles

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as a combination of such patterns

Classical Skeletons

� Data parallel: map, fold, scan� Task parallel: par, pipe, farm� More complex: Distribuable Homomorphism, Divide & Conquer, …

8 of 41

Relevance to our Objectives

Why using Parallel Skeletons ?

� Write code independant of parallel programming minutiae� Composability supports hierarchical architectures� Code is scalable and easy to maintain

Why using BSP ?

� Cost model guide development� Few primitives mean that intellectual burden is low� Good medium for developping skeletons

How to ensure performance of those models’ implementations ?

9 of 41

Domain Specic Embedded Languages

Domain Specic Languages

� Non-Turing complete declarative languages� Solve a single type of problems� Express what to do instead of how to do it� E.g: SQL, M, M, …

From DSL to DSEL [Abrahams 2004]

� A DSL incorporates domain-specic notation, constructs, and abstractions asfundamental design considerations.

� A Domain Specic Embedded Languages (DSEL) is simply a library that meets thesame criteria

� Generative Programming is one way to design such libraries

10 of 41

Generative Programming [Eisenecker 97]

Domain SpecificApplication Description

Generative Component Concrete Application

Translator

Parametric Sub-components

11 of 41

Meta-programming as a Tool

DenitionMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.

Meta-programmable Languages

� metaOCAML : runtime code generation via code quoting� Template Haskell : compile-time code generation via templates� C++ : compile-time code generation via templates

C++ meta-programming

� Relies on the Turing-complete C++ sub-language� Handles types and integral constants at compile-time� classes and functions act as code quoting

12 of 41

The Expression Templates Idiom

Principles

� Relies on extensive operatoroverloading

� Carries semantic informationaround code fragment

� Introduces DSLs withoutdisrupting dev. chain

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

General Principles of Expression Templates

13 of 41

The Expression Templates Idiom

Advantages

� Generic implementation becomesself-aware of optimizations

� API abstraction level is arbitraryhigh

� Accessible through high-leveltools like B.P

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

General Principles of Expression Templates

13 of 41

Our Contributions

Our Strategy

� Applies DSEL generation techniques to parallel programming� Maintains low cost of abstractions through meta-programming� Maintains abstraction level via modern library design

Our contributions

Tools Pub. Scope ApplicationsQuaff ParCo’06 MPI Skeletons Real-time 3D reconstructionSkellPU PACT’08 Skeleton on Cell BE Real-time Image processingBSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model CheckingNT2 JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision

14 of 41

Example of BSP++ ApplicationKhaled Hamidouche PHD 2008-2011 in collab. with Univ. Brasilia

BSP Smith & Waterman� S&W computes DNA sequences alignment� BSP++ implementation was written once and run on 7 different hardwares� Efficiency of 95+% even on 6000 cores super-computer

Platform MaxSize # Elements Speedup GCUPscluster (MPI) 1,072,950 128 cores 73x 6.53

cluster (MPI/OpenMP) 1,072,950 128 cores 116x 10.41OpenMP 1,072,950 16 cores 16x 0.40

CellBE 85,603 8 SPEs — 0.14cluster of CellBEs 85,603 24 SPEs (8:24) 2.8x 0.37

Hopper(MPI) 5,303,436 3072 cores 260x 3.09Hopper(MPI+OpenMP) 24,894,269 6144 cores 5664x 15,5

15 of 41

Second Look at our Contributions

Development Limitations

� DSELs are mostly tied to the domain model� Architecture support is often an afterthought� Extensibility is difficult as many refactoring are required per architecture� Example : No proper way to support GPUs with those implementation techniques

Proposed Method

� Extends Generative Programming to take this architecture into account� Provides an architecture description DSEL� Integrates this description in the code generation process

16 of 41

Architecture Aware Generative Programming

17 of 41

Software refactoring

Tools Issues ChangesQuaff Raw skeletons API Re-engineered as part of NT2

SkellPU Too architecture specic Re-engineered as part of NT2

BSP++ Integration issues Integrate hybrid code generationNT2 Not easily extendable Integrate Quaff Skeleton modelsBoost.SIMD - Side product of NT2 restructuration

Conclusion� Skeletons are ne as parallel middleware� Model based abstractions are not high level enough� For low level architectures, the simplest model is often the best

18 of 41

Boost.SIMDPierre Estérie PHD 2010-2014

Principles

� Provides simple C++ API over SIMDextensions

� Supports every Intel, PPC and ARMinstructions sets

� Fully integrates with modern C++idioms

Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang

19 of 41

Talk Layout

Introduction



Conclusion

20 of 41

The Numerical Template ToolboxPierre Estérie PHD 2010-2014

NT2 as a Scientic Computing Library

� Provides a simple, M-like interface for users� Provides high-performance computing entities and primitives� Is easily extendable

Components

� Uses Boost.SIMD for in-core optimizations� Uses recursive parallel skeletons� Supports task parallelism through Futures

21 of 41

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactlymimics M array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar valuesas in M

How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

22 of 41


Principles



How does it works

� Take a .m le, copy to a .cpp le

� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

22 of 41


Principles



How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the le and link with libnt2.a

22 of 41


Principles



How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

22 of 41

NT2 - From M to C++

M codeA1 = 1:1000;A2 = A1 + randn(size(A1));X = lu(A1*A1’);

rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );

NT2 codetable <double > A1 = _(1. ,1000.);table <double > A2 = A1 + randn(size(A1));table <double > X = lu( mtimes(A1 , trans(A1) );

double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );

23 of 41

Parallel Skeletons extraction process

A = B / sum(C+D);

=

A /

B sum

+

C D

fold

transform

24 of 41

Parallel Skeletons extraction process

A = B / sum(C+D);

; ;

=

A /

B sum

+

C D

fold

transform

=

tmp sum

+

C D

fold

⇒=

A /

B tmp

transform

25 of 41

From data to task parallelismAntoine Tran Tan PHD, 2012-2015

Limits of the fork-join model

� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization

Skeletons from the Future

� Adapt current skeletons for taskication� Use Futures ( or HPX) to automatically pipeline� Derive a dependency graph between statements

26 of 41

Parallel Skeletons extraction process - Take 2

A = B / sum(C+D);

; ;

=

tmp sum

+

C D

fold

=

A /

B tmp

transform

27 of 41

Parallel Skeletons extraction process - Take 2

A = B / sum(C+D);

fold

=

tmp(3) sum(3)

+

C(:, 3) D(:, 3)transform

=

A(:, 3) /

B(:, 3) tmp(3)

fold

=

tmp(2) sum(2)

+

C(:, 2) D(:, 2)transform

=

A(:, 2) /

B(:, 2) tmp(2)

workerfold,simd

=

tmp(1) sum

+

C(:, 1) D(:, 1)workertransform,simd

=

A(:, 1) /

B(:, 1) tmp(1)

spawnertransform,OpenMP

spawnertransform,OpenMP

;

28 of 41

Motion DetectionLacassagne et al., ICIP 2009

� Sigma-Delta algorithm based on background substraction� Use local gaussian model of lightness variation to detect motion� Challenge: Very low arithmetic density� Challenge: Integer-based implementation with small range

29 of 41

Motion Detection

table <char > sigma_delta( table <char >& background, table <char > const& frame, table <char >& variance)

{// Estimate Raw Movementbackground = selinc( background < frame

, seldec(background > frame , background));

table <char > diff = dist(background , frame);

// Compute Local Variancetable <char > sig3 = muls(diff ,3);

var = if_else( diff != 0, selinc( variance < sig3

, seldec( var > sig3 , variance))

, variance);

// Generate Movement Labelreturn if_zero_else_one( diff < variance );

}

30 of 41

Motion Detection

0

2

4

6

8

10

12

14

16

18

512x512 1024x1024

cycle

s/e

lem

ent

Image Size (N x N)

x6.8

x14.8

x16.5

x2.1

x3.6

x6.7

x15.3

x18

x2.3

x3.9

9

x10.8

x10.8

SCALARHALF COREFULL CORE

SIMDJRTIP2008

SIMD + HALF CORESIMD + FULL CORE

31 of 41

Black and Scholes Option Pricing

NT2 Codetable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{table <float > da = sqrt(Ta);table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);table <float > d2 = d1-va*da;

return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);}

32 of 41

Black and Scholes Option Pricing

NT2 Code with loop fusiontable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{// Preallocate temporary tablestable <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));

// tie merge loop nest and increase cache localitytie(da,d1 ,d2,R) = tie( sqrt(Ta)

, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da), d1-va*da, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2));

return R;}

32 of 41

Black and Scholes Option PricingPerformance

1000000

0

50

100

150

x1.8

9

x2.9

1

x5.5

8

x6.3

0

Size

cycle/value

scalar

SSE2

AVX2

SSE2, 4 cores

AVX2, 4 cores

33 of 41

Black and Scholes Option PricingPerformance with loop fusion/futurisation

1000000

0

50

100

150

x2.2

7

x4.1

3

x8.0

5

x11.

12

Size

cycle/value

scalar

SSE2

AVX2

SSE2, 4 cores

AVX2, 4 cores

34 of 41

LU DecompositionAlgorithm

A00

A01 A02A10

A20 A11

A21

A12

A22 A11

A12A21

A22

A22

step 1

step 2

step 3

step 4

step 5

step 6

step 7

DGETRF

DGESSM

DTSTRF

DSSSSM

35 of 41

LU DecompositionPerformance

0 10 20 30 40 50

0

50

100

Number of cores

Med

ian

GFL

OPS

8000× 8000 LU decomposition

NT2Intel MKL

36 of 41

Talk Layout

Introduction



Conclusion

37 of 41

Conclusion

Parallel Computing for Scientist

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, DSEL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability

and re-targetability

Our Achievements� A new method for parallel software development� Efficient libraries working on large subset of hardware� High level of performances across a wide application spectrum

38 of 41

Works in Progress

Application to Accelerators

� Exploration of proper skeleton implementation on GPUs� Adaptation of Future based code generator� In progress with Ian Masliah’ PHD thesis

Parallelism within C++

� SIMD as part of the standard library� Proposal N3571 for standard SIMD computation� Interoperability with current parallel model of C++

39 of 41

Perspectives

DSEL as C++ rst class idiom

� Build partial evaluation into the language� Ease transition between regular and meta C++� Mid-term Prospect: metaOCAML like quoting for C++

DSEL and compilers relationship

� C++ DSEL hits a limit on their applicability� Compilers often lack high level informations for proper optimization� Mid-term Prospect: Hybrid library/compiler approaches for DSEL

40 of 41

Thanks for your attention

software abstractions for parallel hardware

Software

parallel programming

parallel components

designing tools

skeletons data centric

bsp pattern centric

use generative programming

c libraries

scientic computingobjectives1