the numerical algorithms group

Over 30 years of mathematical excellence

The Numerical Algorithms GroupCombining mathematics and technology for enhanced performance

Unlocking the Power of OpenMPDr Stef SalviniNAG Ltdstef.salvini@nag.co.uk

Stef Salvini 2

Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003

Acknowledgements

Lawrence MulhollandEdward SmythThemos TsikasAnne E. TrefethenJeremy Du CrozRobert TongAnd too many others to mention here!

Stef Salvini 3

Contents

Opening RemarksSMP Systems, Parallelism and OpenMPTuning OpenMP:

General PointsNag Specific Case: Needs, Solutions and Case Studies

OpenMP and Hybrid ParallelismSome Other Considerations and Conclusions

Stef Salvini 4

Contents

OpenMP in Hybrid ParallelismSome Other Considerations and Conclusions

Stef Salvini 5

NAG and HPC

NAG ProductsParallel Library Release 3 (MPI)NAG SMP Library Release 2 (Open MP)

Collaborations with external agenciesVendors (e.g. ACML Library for AMD Opteron)Research and academic institutionsIndustrial, commercial and financial concerns

Consultancy activities

Stef Salvini 6

Who Can Benefit from Parallelism?

Anybody with large computationally intensive problems

Academic institutionsIndustry (Aerospace, car, etc.)Financial institutions (Forecasts, etc.)

Increasingly commercial systemsDatabasesOn-Line TransactionsData miningWeb Servers, etc

They want solutions!

Stef Salvini 7

Some thoughts…

Short Life CycleHardware

Long Life CycleScientific Software

Is Software the Real Capital Investment ?

Stef Salvini 8

The Changing World of HPC

The “(ir)resistible” rise of PC ClustersPrice/performance (claims or substantial?)Originally in-house built, now part of mainstreamIncreasing penetration of the server market

MPI as New LegacyIs Parallelism crystallised into MPI?Is anybody interested in other types of paralellism?

Hybrid systemsThe de facto standard for high-end systemsDo we need multi-level parallelism?

Stef Salvini 9

Contents

Stef Salvini 10

Why SMPs?

HardwareRe-usable technologyModular technologyIncreasingly partitionable hardwareReliable Technology (minimum downtime)

Commercial applicationsDatabasesOLTPWeb Servers

Numerical and Scientific ApplicationsTremendous potential

Stef Salvini 11

SMP Model

Memory

Interconnect Subsystem

Single memory visible to all processors

Memory can be physically partitioned

(NUMA systems)

Stef Salvini 12

SMP Parallelism in a Nutshell

Multi-threaded Parallelism (Parallelism-on-demand)

Parallel Region

Spawn threads Destroy threads

Multi-threading•Parallel execution•Serial Interfaces•Details of parallelismare hidden outside theparallel region

Serial executionSerial execution

Parallelism carried out in distinct parallel regions

Stef Salvini 13

The Data View

Local Memory

Primary Cache

Registers

Secondary Cache

Global Memory

Data must migrate through the different levels of memory in a

very coordinated fashion

Size of available data space

Stef Salvini 14

Memory and Data Transfer

Memory StructureMultiple-levels (increasing)Caches invaluable, essential and difficult

Some difficulties with data accessSingle processor effects

Cache misses and thrashingTLB misses

Multi-processor effectsFalse SharingRequired synchronisations

NUMA SystemsData allocation and distributionPage misses and migration

Stef Salvini 15

SMP Parallelism: Dynamic View of Data

Computation Stage 1 Computation Stage 2Computation Stage 3

Processor 1

Processor 2

Processor 3

Processor 4

Stef Salvini 16

From SMP to OpenMP

OpenMP embodies SMP MechanismsConcise notation

Not all parallel structures representedSimple

To implement: hence wide acceptance by vendorsto understand (?)

Compiler DirectivesCompile cleanly on serial systems

However:“Local” references onlySome system calls

Stef Salvini 17

Contents

Stef Salvini 18

Ensuring Efficiency on Modern Systems

Algorithm Must take into accountParallelismData access (multi-level memory layout)

Algorithms must haveDynamic load balancingSome strategy for serial or quasi-serial bottlenecks (beating Amdahl’s law?)Parametrised for easy configuration & porting

Should also algorithms take into account ofMulti-level memory (e.g. NUMA, clusters, etc)?Contingent (history of the computation) data layout?

Stef Salvini 19

Levels of Parallelism

Coarse GrainApplication drivenMore potential for parallelismCloser to optimal performanceLess overheadsDesign complexityImplementation costsMaintenance costsRemoved from “serial” world“Non-local” data access problemNon modular designExpandable to non-SMP systems

Coarser

r Fine GrainClose to “elementary” algorithmsModular designDirect path from “serial” worldTop-down refinementHigher overheadsLess potential for parallelismSerial or quasi-serial bottlenecks

Stef Salvini 20

Level of Parallelism

Coarse/Fine trade-off dictated byNature of applicationTechnological feasibilityAvailability of ComponentsExpertise and experienceTime scale and resources for developmentDeadlines for results

Stef Salvini 21

The “reductionist approach”

Postulate: Parallelise the basic computational kernels

BLAS (Basic Linear Algebra Subroutines)Level-3 (Matrix-matrix product)Level-2 (Matrix-vector product)Level-1 (vector operations)

Basic FFTsTheorem: Anything built on them will have adequate parallelism

Proof by oral tradition

Stef Salvini 22

Level-3 BLAS

N3 OperationsN2 Data ReferencesGood data re-use (cache friendly)Good parallelisability

Stef Salvini 23

Level-2 and Level-1 BLAS

N2 and N OperationsN2 and N data ReferencesPoor data re-use Dubious parallelisabilityProblems if sequences of BLAS are applied to the same data space

Stef Salvini 24

DSYMV (Symm. Matrix Vector Product)

Requires Synchronisations, etcMany vendors do not parallelise it!

Stef Salvini 25

Contents

Stef Salvini 26

NAG SMP Library

Our needsFill the gap between Serial and Parallel codesAllow maximum reuse of our products (NAG Library)Best performance achievable (parallelism must make a difference!)

Our ApproachBuild the Library based on our NAG Library (serial)Keep identical serial interfacesKeep identical functionality and numericsHide all details of parallelismUse parametrisable algorithms for easy portingThat dictates our level of granularity!

Stef Salvini 27

NAG SMP Library Release 2

All the NAG Library Mark 19Over 1200 user-callable components

Parallelised numerical Routines inDense linear algebraSparse Linear Algebra FFTsRandom-number generationAll other routines dependent on thye above

Stef Salvini 28

New at NAG SMP Library Release 3 (Soon)

Sparse TechnologyDirect MethodsPreconditioners for Iterative SolversEnhanced Iterative SolversBand SolversEigensolution of Large Sparse Matrices

Extended Linear Algebra CoverageMulti-Dimensional Quadrature

Stef Salvini 29

LU Factorisation: A “Serial” Algorithm

Stef Salvini 30

LU Factorisation: LAPACK Style

Active submatrix

Already Factorised

L Factor

U Factor

Permute the Rows (Level-1

Update the Trailing

Submatrix(Level-3 BLAS)

Factorise the pivot block

(Level-2 BLAS)

Solve the triangular

system (Level-3 BLAS)

U Factor

Stef Salvini 31

About serial or quasi-serial bottlenecks

That is what we do

Identify serial bottlenecksIdentify memory access bottlenecksRemove the above or …“Hide” them using a “look ahead” strategy“Locally asynchronous” algorithms?

Stef Salvini 32

About serial or quasi-serial bottlenecksLive with them …The reductionist approach

Further parallelisation of the bottlenecksPerhaps nested parallelism?

Use knowledge about the algorithmsPredecessor/successorTask queues

Use toolsInstrument the code to generate a stack of tasksProfile and analyse previous runs

“Hide” them using a “look ahead” strategy“Locally asynchronous” algorithms

Stef Salvini 33

Look-ahead (Beating Amdahl’s Law?)

SerialParallelisable Parallelisable

Serial Execution

Parallel Execution

Processor 1

Processor 2

Processor 3

Processor 1

Processor 2

Processor 3

Parallel Execution with Look-Ahead

Stef Salvini 34

LU Factorisation: LAPACK Style

Active submatrix

Already Factorised

L Factor

U Factor

Permute the Rows (Level-1

Update the Trailing

Submatrix(Level-3 BLAS)

Factorise the pivot block

(Level-2 BLAS)

Solve the triangular

system (Level-3 BLAS)

U Factor

Stef Salvini 35

LU Factorisation: SMP Style (1)

Active submatrix

Already FactorisedU Factor

321 On ALL processors:•Permute the rows•Solve the Triangular system• Update

Pivot Block,Factorise on 1

After the update on 1 is done

Stef Salvini 36

LU Factorisation: SMP Style (2)

Apply all the permutations from the

right in parallel

Stef Salvini 37

LU Factorisation

6000010

8000 10

124816243248

Problem size (N)

ps)Sun PerflibSun F15K – 1050 MHz

N. Procs

NAG SMP Release 2

Stef Salvini 38

QR Factorisation

8000 100

1248121620

Problem size (N)

ps) N. Procs

Sun Perflib

Sun E6800 – 900 MHz

NAG SMP Release 2LAPACK

Stef Salvini 39

NEC SX4, LU Factorization

1000 2000 4000 1000 2000 4000

12481214

Problem size (n)

LAPACK

NAG SMP Library

N. Procs

Stef Salvini 40

1 2 4 8 16 24 32 48

100200

5001000

20003000

40006000

LU Factorisation

Number of Processors

ize (N

Sun PerflibSun F15K – 1050 MHz

Stef Salvini 41

Other cases

TridiagonalisationHalf operations through DSYMVHalf operations through rank-k symmetric update (Level-3 BLAS)The gateway to symmetric eigenproblem

Full SVD (QR Algorithm) of a bidiagonalmatrix

Computational kernel: line rotations (Level-1 BLAS DROT)Statistics, LLS problems, rank-deficiency, etc

Stef Salvini 42

3000010

4000 100

124816243248

Tridiagonalisation (Upper variant)

Problem size (N)

N. Procs

Sun PerflibSun F15K – 1050 MHz

NAG SMP Release 2

Stef Salvini 43

1400010

4000 100

124816243248

Full SVD of Bidiagonal Matrix (QR Algor.)

Problem size (N)

N. Procs

Sun PerflibSun F15K – 1050 MHz NAG SMP Release 2

Stef Salvini 44

Full SVD of Bidiagonal Matrix (QR Algor.)

1 2 4 8 16 24 32 48 1 2 4 8 16 24 32 48

100200

5001000

20004000

N. Procs

Sun Perflib

Sun F15K – 1050 MHz

NAG SMP Release 2

Proble

m Size

Stef Salvini 45

Contents

OpenMP and Hybrid ParallelismSome Other Considerations and Conclusions

Stef Salvini 46

Clusters of SMPs

Interconnection Sub-System

FutureHigh-endMedium-endLow-end

Technologyre-usableupgradeableLinux boxes?

Hybrid (Mixed) Model?NAG currently actively involved

Stef Salvini 47

Some Considerations

Enormous increase of CPU performanceLess marked improvements to memory subsystemsUsing modular components

Relatively higher latency than currently

Stef Salvini 48

Hybrid Model Paradigm

Currently:All processors the

same (MPI,etc)

“flattening mountains” …

Stef Salvini 49

Hybrid Parallelism: Why?

High-Latency systemsIncreased levels of memoryPart-Serialisation of Message-PassingIncreased Number of Processors Competing for Communication

Interconnection Sub-System

Stef Salvini 50

Mixed Mode Parallelism: A Model’s Goals

Maximize code re-useE.g., retain message-passing main code architectureUse existing SMP techniques and technology

Allow some form of top-down refinementIdentify bottlenecks in isolation from rest of code and improve their efficiency

Exploit a problem different levels of granularityCoarse granularity: mapped onto message-passingFine granularity: mapped onto SMP

‘Hide’ communication costs (look ahead again)Reduce load unbalancePerhaps best with problems consisting of loosely coupled components

Stef Salvini 51

SMP on Clusters

OS-Level SSDMSCompiler-level“Architecture-aware” OPenMP

Explicit page allocation, etcRetracing HPF?Also, very much the topic of tomorrow’s panel

Stef Salvini 52

Contents

Stef Salvini 53

Performance of Real Applications

Memory boundMPI “faster” than OpenMP

Data “segregation”Better access to memoryLimited cross-processor memory effects

We need

Block algorithmsBetter data locality

User-specified prefetching

Stef Salvini 54

Example: Very Sparse Large Problems

Random matrix entries, almost diagonally dominant

Diagonal preconditioningVirtually removes the effects of preconditioning from the performance analysis

4 Case studiesNeither matrix nor vectors fit in secondary cache

N = 1000000, NNZ = 10000000, random patternN = 1000000, NNZ = 10000000, random pattern within narrow band bandwidth = 200)

Matrix does not fit in secondary cacheN = 153600, NNZ = 15000000, random patternN = 153600, NNZ = 15000000, random pattern within narrow band bandwidth = 2000)

Stef Salvini 55

Some Scalability Results

1 2 3 4

Number of Processors

TFQMR Random, n=1000000

TFQMR Narrow band, n = 1000000

Bi-CGSTAB (4) Random, n =153600

Bi-CGSTAB (4) Narrow band, n =153600

Stef Salvini 56

Performance of Matrix-Vector Product

Case 1 Case 2 Case 3 Case 4

Relative CPU times

Stef Salvini 57

Some Food for Thought

Algorithms producersNew algorithms required

Block algorithmslatency-tolerantparallel-adaptivedynamically load-balanceable

Other aspectsExpertise

Dearth of SMP and mixed-mode parallel expertiseIncreasing need

Stef Salvini 58

What do we need in OpenMP?More flexible synchronisations

“Level crossings” (some wait, one releases)?Partial barriers, relationships of precedence?

More flexible work sharing mechanismsData allocation/distribution

Avoid HPF constructsIs good page migration sufficient on NUMA?

User-specified prefetchingEssential for performancePortable API (or part of OpenMP)Compiler writers rather against it

Some message passing mechanism?

Stef Salvini 59

Multi-Level Parallelism: Do We Need it?

Multi-level applicationsLoosely coupled componentsSMP Nested parallelism?

Heterogeneous applicationsVery difficult to map on OpenMP, currentlySMP nested parallelism?

Hardware requirements (Hybrid systems)Clusters of SMPs

Stef Salvini 60

Summary

Tuning of Open MPDifficult but feasible

Considerable gainsGateway between serial and parallel worlds

Current algorithms may need revisionGood prefetching essential in the future

Challenges aheadDeveloping “look-ahead” strategies for numericalalgorithms Mapping existing numerical algorithm to future architectures (clusters of SMPs)Developing new “multi-level” algorithms

Stef Salvini 61

Thank you

Thank youfor

your attention

Stef Salvinistef.salvini@nag.co.uk

the numerical algorithms group

Documents

two numerical graph algorithms

numerical alg. & cryptography1 numerical algorithms...

can you count on your correlation matrix? - numerical...

time-domain reconstruction algorithms and numerical...

ms4: minimizing communication in numerical algorithms part

practical portfolio optimization - numerical algorithms...

high-performance numerical algorithms for large-scale

chapter 1 numerical algorithms and roundoff errors

parallel numerical algorithms - edgar solomonik · 2019. 9....

hybrid algorithms for numerical solution of

results matter. trust nag. numerical algorithms group...

numerical algorithms for stock option valuation

kh basic numerical algorithms integration

numerical optimization: basic concepts and algorithms ·...

numerical algorithms -...

numerical algorithms matrix multiplication numerical...

sessions on algorithms & numerical techniques (subject to

parallel numerical algorithms: iterative linear systems...

development of numerical algorithms for ferroresonance …

numerical methods and algorithms